diff mbox series

[v8,5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)

Message ID 20220201062901.428838-6-leobras@redhat.com (mailing list archive)
State New, archived
Headers show
Series MSG_ZEROCOPY + multifd | expand

Commit Message

Leonardo Bras Feb. 1, 2022, 6:29 a.m. UTC
Implement zero copy send on nocomp_send_write(), by making use of QIOChannel
writev + flags & flush interface.

Change multifd_send_sync_main() so flush_zero_copy() can be called
after each iteration in order to make sure all dirty pages are sent before
a new iteration is started. It will also flush at the beginning and at the
end of migration.

Also make it return -1 if flush_zero_copy() fails, in order to cancel
the migration process, and avoid resuming the guest in the target host
without receiving all current RAM.

This will work fine on RAM migration because the RAM pages are not usually freed,
and there is no problem on changing the pages content between writev_zero_copy() and
the actual sending of the buffer, because this change will dirty the page and
cause it to be re-sent on a next iteration anyway.

A lot of locked memory may be needed in order to use multid migration
with zero-copy enabled, so disabling the feature should be necessary for
low-privileged users trying to perform multifd migrations.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 migration/multifd.h   |  4 +++-
 migration/migration.c | 11 ++++++++++-
 migration/multifd.c   | 41 +++++++++++++++++++++++++++++++++++------
 migration/ram.c       | 29 ++++++++++++++++++++++-------
 migration/socket.c    |  5 +++--
 5 files changed, 73 insertions(+), 17 deletions(-)

Comments

Peter Xu Feb. 8, 2022, 2:22 a.m. UTC | #1
On Tue, Feb 01, 2022 at 03:29:03AM -0300, Leonardo Bras wrote:
> -void multifd_send_sync_main(QEMUFile *f)
> +int multifd_send_sync_main(QEMUFile *f)
>  {
>      int i;
> +    bool flush_zero_copy;
>  
>      if (!migrate_use_multifd()) {
> -        return;
> +        return 0;
>      }
>      if (multifd_send_state->pages->num) {
>          if (multifd_send_pages(f) < 0) {
>              error_report("%s: multifd_send_pages fail", __func__);
> -            return;
> +            return 0;

I've not checked how it used to do if multifd_send_pages() failed, but.. should
it returns -1 rather than 0 when there will be a return code?

>          }
>      }
> +
> +    /*
> +     * When using zero-copy, it's necessary to flush after each iteration to
> +     * make sure pages from earlier iterations don't end up replacing newer
> +     * pages.
> +     */
> +    flush_zero_copy = migrate_use_zero_copy_send();
> +
>      for (i = 0; i < migrate_multifd_channels(); i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
>  
> @@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f)
>          if (p->quit) {
>              error_report("%s: channel %d has already quit", __func__, i);
>              qemu_mutex_unlock(&p->mutex);
> -            return;
> +            return 0;

Same question here.

>          }

The rest looks good.  Thanks,
Leonardo Bras Feb. 8, 2022, 2:49 a.m. UTC | #2
Hello Peter, thanks for reviewing!

On Mon, Feb 7, 2022 at 11:22 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Feb 01, 2022 at 03:29:03AM -0300, Leonardo Bras wrote:
> > -void multifd_send_sync_main(QEMUFile *f)
> > +int multifd_send_sync_main(QEMUFile *f)
> >  {
> >      int i;
> > +    bool flush_zero_copy;
> >
> >      if (!migrate_use_multifd()) {
> > -        return;
> > +        return 0;
> >      }
> >      if (multifd_send_state->pages->num) {
> >          if (multifd_send_pages(f) < 0) {
> >              error_report("%s: multifd_send_pages fail", __func__);
> > -            return;
> > +            return 0;
>
> I've not checked how it used to do if multifd_send_pages() failed, but.. should
> it returns -1 rather than 0 when there will be a return code?

Yeah, that makes sense.
The point here is that I was trying not to modify much of the current behavior.

I mean, multifd_send_sync_main() would previously return void, so any
other errors would not matter to the caller of this function, which
will continue to run as if nothing happened.

Now, if it fails with flush_zero_copy, the operation needs to be aborted.

Maybe, I should make it different:
- In any error, return -1.
- Create/use a specific error code in the case of a failing
flush_zero_copy, so I can test the return value for it on the caller
function and return early.

Or alternatively, the other errors could also return early, but since
this will change how the code currently works, I would probably need
another patch for that change. (so it can be easily reverted if
needed)

What do you think is better?


> >          }
> >      }
> > +
> > +    /*
> > +     * When using zero-copy, it's necessary to flush after each iteration to
> > +     * make sure pages from earlier iterations don't end up replacing newer
> > +     * pages.
> > +     */
> > +    flush_zero_copy = migrate_use_zero_copy_send();
> > +
> >      for (i = 0; i < migrate_multifd_channels(); i++) {
> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> >
> > @@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f)
> >          if (p->quit) {
> >              error_report("%s: channel %d has already quit", __func__, i);
> >              qemu_mutex_unlock(&p->mutex);
> > -            return;
> > +            return 0;
>
> Same question here.

Please see above,

>
> >          }
>
> The rest looks good.  Thanks,

Thank you!

Best regards,
Leo
Peter Xu Feb. 8, 2022, 3:05 a.m. UTC | #3
On Mon, Feb 07, 2022 at 11:49:38PM -0300, Leonardo Bras Soares Passos wrote:
> Hello Peter, thanks for reviewing!
> 
> On Mon, Feb 7, 2022 at 11:22 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Feb 01, 2022 at 03:29:03AM -0300, Leonardo Bras wrote:
> > > -void multifd_send_sync_main(QEMUFile *f)
> > > +int multifd_send_sync_main(QEMUFile *f)
> > >  {
> > >      int i;
> > > +    bool flush_zero_copy;
> > >
> > >      if (!migrate_use_multifd()) {
> > > -        return;
> > > +        return 0;
> > >      }
> > >      if (multifd_send_state->pages->num) {
> > >          if (multifd_send_pages(f) < 0) {
> > >              error_report("%s: multifd_send_pages fail", __func__);
> > > -            return;
> > > +            return 0;
> >
> > I've not checked how it used to do if multifd_send_pages() failed, but.. should
> > it returns -1 rather than 0 when there will be a return code?
> 
> Yeah, that makes sense.
> The point here is that I was trying not to modify much of the current behavior.
> 
> I mean, multifd_send_sync_main() would previously return void, so any
> other errors would not matter to the caller of this function, which
> will continue to run as if nothing happened.
> 
> Now, if it fails with flush_zero_copy, the operation needs to be aborted.

Right, so how I understand is we'll fail anyway, and this allows us to fail
(probably) sooner.

> 
> Maybe, I should make it different:
> - In any error, return -1.
> - Create/use a specific error code in the case of a failing
> flush_zero_copy, so I can test the return value for it on the caller
> function and return early.
> 
> Or alternatively, the other errors could also return early, but since
> this will change how the code currently works, I would probably need
> another patch for that change. (so it can be easily reverted if
> needed)

Yeah, should work too to add a patch before this one.

> 
> What do you think is better?

I just don't see how it could continue if e.g. multifd_send_pages() failed.

The other thing is returning zero looks weird itself when there's obviously an
error.  Normally we could allow that but better with a comment showing why.
For this case it's more natural to me if we could just fail early.

Juan?
Juan Quintela Feb. 18, 2022, 4:57 p.m. UTC | #4
Leonardo Bras <leobras@redhat.com> wrote:
> Implement zero copy send on nocomp_send_write(), by making use of QIOChannel
> writev + flags & flush interface.
>
> Change multifd_send_sync_main() so flush_zero_copy() can be called
> after each iteration in order to make sure all dirty pages are sent before
> a new iteration is started. It will also flush at the beginning and at the
> end of migration.
>
> Also make it return -1 if flush_zero_copy() fails, in order to cancel
> the migration process, and avoid resuming the guest in the target host
> without receiving all current RAM.
>
> This will work fine on RAM migration because the RAM pages are not usually freed,
> and there is no problem on changing the pages content between writev_zero_copy() and
> the actual sending of the buffer, because this change will dirty the page and
> cause it to be re-sent on a next iteration anyway.
>
> A lot of locked memory may be needed in order to use multid migration
                                                       ^^^^^^
multifd.

I can fix it on the commit.


> @@ -1479,7 +1479,16 @@ static bool migrate_params_check(MigrationParameters *params, Error **errp)
>          error_prepend(errp, "Invalid mapping given for block-bitmap-mapping: ");
>          return false;
>      }
> -
> +#ifdef CONFIG_LINUX
> +    if (params->zero_copy_send &&
> +        (!migrate_use_multifd() ||
> +         params->multifd_compression != MULTIFD_COMPRESSION_NONE ||
> +         (params->tls_creds && *params->tls_creds))) {
> +        error_setg(errp,
> +                   "Zero copy only available for non-compressed non-TLS multifd migration");
> +        return false;
> +    }
> +#endif
>      return true;
>  }

Test is long, but it is exactly what we need.  Good.


>  
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 43998ad117..2d68b9cf4f 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -568,19 +568,28 @@ void multifd_save_cleanup(void)
>      multifd_send_state = NULL;
>  }
>  
> -void multifd_send_sync_main(QEMUFile *f)
> +int multifd_send_sync_main(QEMUFile *f)
>  {
>      int i;
> +    bool flush_zero_copy;
>  
>      if (!migrate_use_multifd()) {
> -        return;
> +        return 0;
>      }
>      if (multifd_send_state->pages->num) {
>          if (multifd_send_pages(f) < 0) {
>              error_report("%s: multifd_send_pages fail", __func__);
> -            return;
> +            return 0;
>          }
>      }
> +
> +    /*
> +     * When using zero-copy, it's necessary to flush after each iteration to
> +     * make sure pages from earlier iterations don't end up replacing newer
> +     * pages.
> +     */
> +    flush_zero_copy = migrate_use_zero_copy_send();
> +
>      for (i = 0; i < migrate_multifd_channels(); i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
>  
> @@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f)
>          if (p->quit) {
>              error_report("%s: channel %d has already quit", __func__, i);
>              qemu_mutex_unlock(&p->mutex);
> -            return;
> +            return 0;
>          }
>  
>          p->packet_num = multifd_send_state->packet_num++;
> @@ -602,6 +611,17 @@ void multifd_send_sync_main(QEMUFile *f)
>          ram_counters.transferred += p->packet_len;
>          qemu_mutex_unlock(&p->mutex);
>          qemu_sem_post(&p->sem);
> +
> +        if (flush_zero_copy) {
> +            int ret;
> +            Error *err = NULL;
> +
> +            ret = qio_channel_flush(p->c, &err);
> +            if (ret < 0) {
> +                error_report_err(err);
> +                return -1;
> +            }
> +        }
>      }
>      for (i = 0; i < migrate_multifd_channels(); i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
> @@ -610,6 +630,8 @@ void multifd_send_sync_main(QEMUFile *f)
>          qemu_sem_wait(&p->sem_sync);
>      }
>      trace_multifd_send_sync_main(multifd_send_state->packet_num);
> +
> +    return 0;
>  }

We are leaving pages is flight for potentially a lot of time. I *think*
that we can sync shorter than that.

>  static void *multifd_send_thread(void *opaque)
> @@ -668,8 +690,8 @@ static void *multifd_send_thread(void *opaque)
>              p->iov[0].iov_len = p->packet_len;
>              p->iov[0].iov_base = p->packet;
>  
> -            ret = qio_channel_writev_all(p->c, p->iov, p->iovs_num,
> -                                         &local_err);
> +            ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
> +                                              0, p->write_flags, &local_err);
>              if (ret != 0) {
>                  break;
>              }

I still think that it would be better to have a sync here in each
thread.  I don't know the size, but once every couple megabytes of RAM
or so.

I did a change on:

commit d48c3a044537689866fe44e65d24c7d39a68868a
Author: Juan Quintela <quintela@redhat.com>
Date:   Fri Nov 19 15:35:58 2021 +0100

    multifd: Use a single writev on the send side
    
    Until now, we wrote the packet header with write(), and the rest of the
    pages with writev().  Just increase the size of the iovec and do a
    single writev().
    
    Signed-off-by: Juan Quintela <quintela@redhat.com>
    Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

And now we need to "perserve" this header until we do the sync,
otherwise we are overwritting it with other things.

What testing have you done after this commit?

Notice that it is not _complicated_ to fix it, I will try to come with
some idea on monday, but basically is having an array of buffers for
each thread, and store them until we call a sync().

Later, Juan.
Juan Quintela Feb. 18, 2022, 5:36 p.m. UTC | #5
Leonardo Bras Soares Passos <leobras@redhat.com> wrote:
> Hello Peter, thanks for reviewing!
>
> On Mon, Feb 7, 2022 at 11:22 PM Peter Xu <peterx@redhat.com> wrote:
>>
>> On Tue, Feb 01, 2022 at 03:29:03AM -0300, Leonardo Bras wrote:
>> > -void multifd_send_sync_main(QEMUFile *f)
>> > +int multifd_send_sync_main(QEMUFile *f)
>> >  {
>> >      int i;
>> > +    bool flush_zero_copy;
>> >
>> >      if (!migrate_use_multifd()) {
>> > -        return;
>> > +        return 0;
>> >      }
>> >      if (multifd_send_state->pages->num) {
>> >          if (multifd_send_pages(f) < 0) {
>> >              error_report("%s: multifd_send_pages fail", __func__);
>> > -            return;
>> > +            return 0;
>>
>> I've not checked how it used to do if multifd_send_pages() failed, but.. should
>> it returns -1 rather than 0 when there will be a return code?
>
> Yeah, that makes sense.
> The point here is that I was trying not to modify much of the current behavior.

    if (qatomic_read(&multifd_send_state->exiting)) {
        return -1;
    }

        if (p->quit) {
            error_report("%s: channel %d has already quit!", __func__, i);
            qemu_mutex_unlock(&p->mutex);
            return -1;
        }

This are the only two cases where the current code can return one
error.  In the 1st case we are exiting, we are already in the middle of
finishing, so we don't really care.
In the second one, we have already quit, and error as already quite big.

But I agree with both comments:
- we need to improve the error paths
- leonardo changes don't affect what is already there.


> I mean, multifd_send_sync_main() would previously return void, so any
> other errors would not matter to the caller of this function, which
> will continue to run as if nothing happened.
>
> Now, if it fails with flush_zero_copy, the operation needs to be aborted.
>
> Maybe, I should make it different:
> - In any error, return -1.
> - Create/use a specific error code in the case of a failing
> flush_zero_copy, so I can test the return value for it on the caller
> function and return early.

We need to add the check.  It don't matter if the problem is zero_copy
or the existing one, we are under a minor catastrophe and migration has
to be aborted.

> Or alternatively, the other errors could also return early, but since
> this will change how the code currently works, I would probably need
> another patch for that change. (so it can be easily reverted if
> needed)
>
> What do you think is better?
>
>
>> >          }
>> >      }
>> > +
>> > +    /*
>> > +     * When using zero-copy, it's necessary to flush after each iteration to
>> > +     * make sure pages from earlier iterations don't end up replacing newer
>> > +     * pages.
>> > +     */
>> > +    flush_zero_copy = migrate_use_zero_copy_send();
>> > +
>> >      for (i = 0; i < migrate_multifd_channels(); i++) {
>> >          MultiFDSendParams *p = &multifd_send_state->params[i];
>> >
>> > @@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f)
>> >          if (p->quit) {
>> >              error_report("%s: channel %d has already quit", __func__, i);
>> >              qemu_mutex_unlock(&p->mutex);
>> > -            return;
>> > +            return 0;
>>
>> Same question here.
>
> Please see above,
>
>>
>> >          }
>>
>> The rest looks good.  Thanks,

Later, Juan.
Leonardo Bras Feb. 21, 2022, 7:41 p.m. UTC | #6
Hello Juan, thanks for the feedback!

On Fri, Feb 18, 2022 at 1:57 PM Juan Quintela <quintela@redhat.com> wrote:
>
> Leonardo Bras <leobras@redhat.com> wrote:
> > Implement zero copy send on nocomp_send_write(), by making use of QIOChannel
> > writev + flags & flush interface.
> >
> > Change multifd_send_sync_main() so flush_zero_copy() can be called
> > after each iteration in order to make sure all dirty pages are sent before
> > a new iteration is started. It will also flush at the beginning and at the
> > end of migration.
> >
> > Also make it return -1 if flush_zero_copy() fails, in order to cancel
> > the migration process, and avoid resuming the guest in the target host
> > without receiving all current RAM.
> >
> > This will work fine on RAM migration because the RAM pages are not usually freed,
> > and there is no problem on changing the pages content between writev_zero_copy() and
> > the actual sending of the buffer, because this change will dirty the page and
> > cause it to be re-sent on a next iteration anyway.
> >
> > A lot of locked memory may be needed in order to use multid migration
>                                                        ^^^^^^
> multifd.
>
> I can fix it on the commit.

No worries, fixed for v9.

>
>
> > @@ -1479,7 +1479,16 @@ static bool migrate_params_check(MigrationParameters *params, Error **errp)
> >          error_prepend(errp, "Invalid mapping given for block-bitmap-mapping: ");
> >          return false;
> >      }
> > -
> > +#ifdef CONFIG_LINUX
> > +    if (params->zero_copy_send &&
> > +        (!migrate_use_multifd() ||
> > +         params->multifd_compression != MULTIFD_COMPRESSION_NONE ||
> > +         (params->tls_creds && *params->tls_creds))) {
> > +        error_setg(errp,
> > +                   "Zero copy only available for non-compressed non-TLS multifd migration");
> > +        return false;
> > +    }
> > +#endif
> >      return true;
> >  }
>
> Test is long, but it is exactly what we need.  Good.

Thanks!


>
>
> >
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 43998ad117..2d68b9cf4f 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -568,19 +568,28 @@ void multifd_save_cleanup(void)
> >      multifd_send_state = NULL;
> >  }
> >
> > -void multifd_send_sync_main(QEMUFile *f)
> > +int multifd_send_sync_main(QEMUFile *f)
> >  {
> >      int i;
> > +    bool flush_zero_copy;
> >
> >      if (!migrate_use_multifd()) {
> > -        return;
> > +        return 0;
> >      }
> >      if (multifd_send_state->pages->num) {
> >          if (multifd_send_pages(f) < 0) {
> >              error_report("%s: multifd_send_pages fail", __func__);
> > -            return;
> > +            return 0;
> >          }
> >      }
> > +
> > +    /*
> > +     * When using zero-copy, it's necessary to flush after each iteration to
> > +     * make sure pages from earlier iterations don't end up replacing newer
> > +     * pages.
> > +     */
> > +    flush_zero_copy = migrate_use_zero_copy_send();
> > +
> >      for (i = 0; i < migrate_multifd_channels(); i++) {
> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> >
> > @@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f)
> >          if (p->quit) {
> >              error_report("%s: channel %d has already quit", __func__, i);
> >              qemu_mutex_unlock(&p->mutex);
> > -            return;
> > +            return 0;
> >          }
> >
> >          p->packet_num = multifd_send_state->packet_num++;
> > @@ -602,6 +611,17 @@ void multifd_send_sync_main(QEMUFile *f)
> >          ram_counters.transferred += p->packet_len;
> >          qemu_mutex_unlock(&p->mutex);
> >          qemu_sem_post(&p->sem);
> > +
> > +        if (flush_zero_copy) {
> > +            int ret;
> > +            Error *err = NULL;
> > +
> > +            ret = qio_channel_flush(p->c, &err);
> > +            if (ret < 0) {
> > +                error_report_err(err);
> > +                return -1;
> > +            }
> > +        }
> >      }
> >      for (i = 0; i < migrate_multifd_channels(); i++) {
> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> > @@ -610,6 +630,8 @@ void multifd_send_sync_main(QEMUFile *f)
> >          qemu_sem_wait(&p->sem_sync);
> >      }
> >      trace_multifd_send_sync_main(multifd_send_state->packet_num);
> > +
> > +    return 0;
> >  }
>
> We are leaving pages is flight for potentially a lot of time. I *think*
> that we can sync shorter than that.
>
> >  static void *multifd_send_thread(void *opaque)
> > @@ -668,8 +690,8 @@ static void *multifd_send_thread(void *opaque)
> >              p->iov[0].iov_len = p->packet_len;
> >              p->iov[0].iov_base = p->packet;
> >
> > -            ret = qio_channel_writev_all(p->c, p->iov, p->iovs_num,
> > -                                         &local_err);
> > +            ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
> > +                                              0, p->write_flags, &local_err);
> >              if (ret != 0) {
> >                  break;
> >              }
>
> I still think that it would be better to have a sync here in each
> thread.  I don't know the size, but once every couple megabytes of RAM
> or so.

This seems a good idea, since the first iteration may take a while,
and we may take a lot of time to fail if something goes wrong with
zerocopy at the start of iteration 1.

On the other hand, flushing takes some time, and doing it a lot may
take away some of the performance improvements.

Maybe we could move the flushing to a thread of it's own, if it
becomes a problem.


>
> I did a change on:
>
> commit d48c3a044537689866fe44e65d24c7d39a68868a
> Author: Juan Quintela <quintela@redhat.com>
> Date:   Fri Nov 19 15:35:58 2021 +0100
>
>     multifd: Use a single writev on the send side
>
>     Until now, we wrote the packet header with write(), and the rest of the
>     pages with writev().  Just increase the size of the iovec and do a
>     single writev().
>
>     Signed-off-by: Juan Quintela <quintela@redhat.com>
>     Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>
> And now we need to "perserve" this header until we do the sync,
> otherwise we are overwritting it with other things.

Yeah, that is a problem I faced on non-multifd migration, and a reason
why I choose to implement directly in multifd.

>
> What testing have you done after this commit?

Not much, but this will probably become an issue with bigger guests.

>
> Notice that it is not _complicated_ to fix it, I will try to come with
> some idea on monday, but basically is having an array of buffers for
> each thread, and store them until we call a sync().

That will probably work, we can use MultiFDRecvParams->packet as this
array, right?
We can start with a somehow small buffer size and grow if it ever gets full.
(Full means a sync + g_try_malloc0 + g_free)

By my calculations, it should take 1,8kB each element on a 4k PAGESIZE.

What do you think?

Best regards,
Leo
Leonardo Bras Feb. 21, 2022, 7:47 p.m. UTC | #7
On Fri, Feb 18, 2022 at 2:36 PM Juan Quintela <quintela@redhat.com> wrote:
>
> Leonardo Bras Soares Passos <leobras@redhat.com> wrote:
> > Hello Peter, thanks for reviewing!
> >
> > On Mon, Feb 7, 2022 at 11:22 PM Peter Xu <peterx@redhat.com> wrote:
> >>
> >> On Tue, Feb 01, 2022 at 03:29:03AM -0300, Leonardo Bras wrote:
> >> > -void multifd_send_sync_main(QEMUFile *f)
> >> > +int multifd_send_sync_main(QEMUFile *f)
> >> >  {
> >> >      int i;
> >> > +    bool flush_zero_copy;
> >> >
> >> >      if (!migrate_use_multifd()) {
> >> > -        return;
> >> > +        return 0;
> >> >      }
> >> >      if (multifd_send_state->pages->num) {
> >> >          if (multifd_send_pages(f) < 0) {
> >> >              error_report("%s: multifd_send_pages fail", __func__);
> >> > -            return;
> >> > +            return 0;
> >>
> >> I've not checked how it used to do if multifd_send_pages() failed, but.. should
> >> it returns -1 rather than 0 when there will be a return code?
> >
> > Yeah, that makes sense.
> > The point here is that I was trying not to modify much of the current behavior.
>
>     if (qatomic_read(&multifd_send_state->exiting)) {
>         return -1;
>     }
>
>         if (p->quit) {
>             error_report("%s: channel %d has already quit!", __func__, i);
>             qemu_mutex_unlock(&p->mutex);
>             return -1;
>         }
>
> This are the only two cases where the current code can return one
> error.  In the 1st case we are exiting, we are already in the middle of
> finishing, so we don't really care.
> In the second one, we have already quit, and error as already quite big.
>
> But I agree with both comments:
> - we need to improve the error paths
> - leonardo changes don't affect what is already there.
>



>
> > I mean, multifd_send_sync_main() would previously return void, so any
> > other errors would not matter to the caller of this function, which
> > will continue to run as if nothing happened.
> >
> > Now, if it fails with flush_zero_copy, the operation needs to be aborted.
> >
> > Maybe, I should make it different:
> > - In any error, return -1.
> > - Create/use a specific error code in the case of a failing
> > flush_zero_copy, so I can test the return value for it on the caller
> > function and return early.
>
> We need to add the check.  It don't matter if the problem is zero_copy
> or the existing one, we are under a minor catastrophe and migration has
> to be aborted.

Ok, I will fix that so we can abort in case of any error.
Maybe it's better to do that on a separated patch, before 5/5, right?

>
> > Or alternatively, the other errors could also return early, but since
> > this will change how the code currently works, I would probably need
> > another patch for that change. (so it can be easily reverted if
> > needed)
> >
> > What do you think is better?
> >
> >
> >> >          }
> >> >      }
> >> > +
> >> > +    /*
> >> > +     * When using zero-copy, it's necessary to flush after each iteration to
> >> > +     * make sure pages from earlier iterations don't end up replacing newer
> >> > +     * pages.
> >> > +     */
> >> > +    flush_zero_copy = migrate_use_zero_copy_send();
> >> > +
> >> >      for (i = 0; i < migrate_multifd_channels(); i++) {
> >> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> >> >
> >> > @@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f)
> >> >          if (p->quit) {
> >> >              error_report("%s: channel %d has already quit", __func__, i);
> >> >              qemu_mutex_unlock(&p->mutex);
> >> > -            return;
> >> > +            return 0;
> >>
> >> Same question here.
> >
> > Please see above,
> >
> >>
> >> >          }
> >>
> >> The rest looks good.  Thanks,
>
> Later, Juan.
>

Thanks for the feedback!

Best regards,
Leo
Leonardo Bras Feb. 22, 2022, 4:09 a.m. UTC | #8
On Mon, Feb 21, 2022 at 4:41 PM Leonardo Bras Soares Passos
<leobras@redhat.com> wrote:
>
> Hello Juan, thanks for the feedback!
>
> On Fri, Feb 18, 2022 at 1:57 PM Juan Quintela <quintela@redhat.com> wrote:
> >
> > Leonardo Bras <leobras@redhat.com> wrote:
> > > Implement zero copy send on nocomp_send_write(), by making use of QIOChannel
> > > writev + flags & flush interface.
> > >
> > > Change multifd_send_sync_main() so flush_zero_copy() can be called
> > > after each iteration in order to make sure all dirty pages are sent before
> > > a new iteration is started. It will also flush at the beginning and at the
> > > end of migration.
> > >
> > > Also make it return -1 if flush_zero_copy() fails, in order to cancel
> > > the migration process, and avoid resuming the guest in the target host
> > > without receiving all current RAM.
> > >
> > > This will work fine on RAM migration because the RAM pages are not usually freed,
> > > and there is no problem on changing the pages content between writev_zero_copy() and
> > > the actual sending of the buffer, because this change will dirty the page and
> > > cause it to be re-sent on a next iteration anyway.
> > >
> > > A lot of locked memory may be needed in order to use multid migration
> >                                                        ^^^^^^
> > multifd.
> >
> > I can fix it on the commit.
>
> No worries, fixed for v9.
>
> >
> >
> > > @@ -1479,7 +1479,16 @@ static bool migrate_params_check(MigrationParameters *params, Error **errp)
> > >          error_prepend(errp, "Invalid mapping given for block-bitmap-mapping: ");
> > >          return false;
> > >      }
> > > -
> > > +#ifdef CONFIG_LINUX
> > > +    if (params->zero_copy_send &&
> > > +        (!migrate_use_multifd() ||
> > > +         params->multifd_compression != MULTIFD_COMPRESSION_NONE ||
> > > +         (params->tls_creds && *params->tls_creds))) {
> > > +        error_setg(errp,
> > > +                   "Zero copy only available for non-compressed non-TLS multifd migration");
> > > +        return false;
> > > +    }
> > > +#endif
> > >      return true;
> > >  }
> >
> > Test is long, but it is exactly what we need.  Good.
>
> Thanks!
>
>
> >
> >
> > >
> > > diff --git a/migration/multifd.c b/migration/multifd.c
> > > index 43998ad117..2d68b9cf4f 100644
> > > --- a/migration/multifd.c
> > > +++ b/migration/multifd.c
> > > @@ -568,19 +568,28 @@ void multifd_save_cleanup(void)
> > >      multifd_send_state = NULL;
> > >  }
> > >
> > > -void multifd_send_sync_main(QEMUFile *f)
> > > +int multifd_send_sync_main(QEMUFile *f)
> > >  {
> > >      int i;
> > > +    bool flush_zero_copy;
> > >
> > >      if (!migrate_use_multifd()) {
> > > -        return;
> > > +        return 0;
> > >      }
> > >      if (multifd_send_state->pages->num) {
> > >          if (multifd_send_pages(f) < 0) {
> > >              error_report("%s: multifd_send_pages fail", __func__);
> > > -            return;
> > > +            return 0;
> > >          }
> > >      }
> > > +
> > > +    /*
> > > +     * When using zero-copy, it's necessary to flush after each iteration to
> > > +     * make sure pages from earlier iterations don't end up replacing newer
> > > +     * pages.
> > > +     */
> > > +    flush_zero_copy = migrate_use_zero_copy_send();
> > > +
> > >      for (i = 0; i < migrate_multifd_channels(); i++) {
> > >          MultiFDSendParams *p = &multifd_send_state->params[i];
> > >
> > > @@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f)
> > >          if (p->quit) {
> > >              error_report("%s: channel %d has already quit", __func__, i);
> > >              qemu_mutex_unlock(&p->mutex);
> > > -            return;
> > > +            return 0;
> > >          }
> > >
> > >          p->packet_num = multifd_send_state->packet_num++;
> > > @@ -602,6 +611,17 @@ void multifd_send_sync_main(QEMUFile *f)
> > >          ram_counters.transferred += p->packet_len;
> > >          qemu_mutex_unlock(&p->mutex);
> > >          qemu_sem_post(&p->sem);
> > > +
> > > +        if (flush_zero_copy) {
> > > +            int ret;
> > > +            Error *err = NULL;
> > > +
> > > +            ret = qio_channel_flush(p->c, &err);
> > > +            if (ret < 0) {
> > > +                error_report_err(err);
> > > +                return -1;
> > > +            }
> > > +        }
> > >      }
> > >      for (i = 0; i < migrate_multifd_channels(); i++) {
> > >          MultiFDSendParams *p = &multifd_send_state->params[i];
> > > @@ -610,6 +630,8 @@ void multifd_send_sync_main(QEMUFile *f)
> > >          qemu_sem_wait(&p->sem_sync);
> > >      }
> > >      trace_multifd_send_sync_main(multifd_send_state->packet_num);
> > > +
> > > +    return 0;
> > >  }
> >
> > We are leaving pages is flight for potentially a lot of time. I *think*
> > that we can sync shorter than that.
> >
> > >  static void *multifd_send_thread(void *opaque)
> > > @@ -668,8 +690,8 @@ static void *multifd_send_thread(void *opaque)
> > >              p->iov[0].iov_len = p->packet_len;
> > >              p->iov[0].iov_base = p->packet;
> > >
> > > -            ret = qio_channel_writev_all(p->c, p->iov, p->iovs_num,
> > > -                                         &local_err);
> > > +            ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
> > > +                                              0, p->write_flags, &local_err);
> > >              if (ret != 0) {
> > >                  break;
> > >              }
> >
> > I still think that it would be better to have a sync here in each
> > thread.  I don't know the size, but once every couple megabytes of RAM
> > or so.
>
> This seems a good idea, since the first iteration may take a while,
> and we may take a lot of time to fail if something goes wrong with
> zerocopy at the start of iteration 1.
>
> On the other hand, flushing takes some time, and doing it a lot may
> take away some of the performance improvements.
>
> Maybe we could move the flushing to a thread of it's own, if it
> becomes a problem.
>
>
> >
> > I did a change on:
> >
> > commit d48c3a044537689866fe44e65d24c7d39a68868a
> > Author: Juan Quintela <quintela@redhat.com>
> > Date:   Fri Nov 19 15:35:58 2021 +0100
> >
> >     multifd: Use a single writev on the send side
> >
> >     Until now, we wrote the packet header with write(), and the rest of the
> >     pages with writev().  Just increase the size of the iovec and do a
> >     single writev().
> >
> >     Signed-off-by: Juan Quintela <quintela@redhat.com>
> >     Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> >
> > And now we need to "perserve" this header until we do the sync,
> > otherwise we are overwritting it with other things.
>
> Yeah, that is a problem I faced on non-multifd migration, and a reason
> why I choose to implement directly in multifd.
>
> >
> > What testing have you done after this commit?
>
> Not much, but this will probably become an issue with bigger guests.
>
> >
> > Notice that it is not _complicated_ to fix it, I will try to come with
> > some idea on monday, but basically is having an array of buffers for
> > each thread, and store them until we call a sync().
>
> That will probably work, we can use MultiFDRecvParams->packet as this
> array, right?
> We can start with a somehow small buffer size and grow if it ever gets full.
> (Full means a sync + g_try_malloc0 + g_free)

Or maybe use the array size as a size limiter before flushing, so it
will always flush after BUFFSIZE headers are sent.

>
> By my calculations, it should take 1,8kB each element on a 4k PAGESIZE.
>
> What do you think?
>
> Best regards,
> Leo
Peter Xu March 1, 2022, 3:57 a.m. UTC | #9
On Fri, Feb 18, 2022 at 05:57:13PM +0100, Juan Quintela wrote:
> I did a change on:
> 
> commit d48c3a044537689866fe44e65d24c7d39a68868a
> Author: Juan Quintela <quintela@redhat.com>
> Date:   Fri Nov 19 15:35:58 2021 +0100
> 
>     multifd: Use a single writev on the send side
>     
>     Until now, we wrote the packet header with write(), and the rest of the
>     pages with writev().  Just increase the size of the iovec and do a
>     single writev().
>     
>     Signed-off-by: Juan Quintela <quintela@redhat.com>
>     Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> And now we need to "perserve" this header until we do the sync,
> otherwise we are overwritting it with other things.
> 
> What testing have you done after this commit?
> 
> Notice that it is not _complicated_ to fix it, I will try to come with
> some idea on monday, but basically is having an array of buffers for
> each thread, and store them until we call a sync().

Or can we conditionally merge the two write()s?  IMHO the array of buffers
idea sounds too complicated, and I'm not extremely sure whether it'll pay
off at last.  We could keep the two write()s with ZEROCOPY enabled, and use
the merged version otherwise.

Btw, is there any performance measurements for above commit d48c3a044537?
I had a feeling that the single write() may not help that much, because for
multifd the bottleneck should be on the nic not on the processor.

IOW, we could find that the major time used does not fall into the
user<->kernel switches (which is where the extra overhead of write()
syscall, iiuc), but we simply blocked on any of the write()s because the
socket write buffer is full...  So we could have saved some cpu cycles by
merging the calls, but performance-wise we may not get much.

Thanks,
Leonardo Bras March 7, 2022, 2:20 p.m. UTC | #10
On Tue, Mar 1, 2022 at 12:57 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Feb 18, 2022 at 05:57:13PM +0100, Juan Quintela wrote:
> > I did a change on:
> >
> > commit d48c3a044537689866fe44e65d24c7d39a68868a
> > Author: Juan Quintela <quintela@redhat.com>
> > Date:   Fri Nov 19 15:35:58 2021 +0100
> >
> >     multifd: Use a single writev on the send side
> >
> >     Until now, we wrote the packet header with write(), and the rest of the
> >     pages with writev().  Just increase the size of the iovec and do a
> >     single writev().
> >
> >     Signed-off-by: Juan Quintela <quintela@redhat.com>
> >     Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> >
> > And now we need to "perserve" this header until we do the sync,
> > otherwise we are overwritting it with other things.
> >
> > What testing have you done after this commit?
> >
> > Notice that it is not _complicated_ to fix it, I will try to come with
> > some idea on monday, but basically is having an array of buffers for
> > each thread, and store them until we call a sync().
>
> Or can we conditionally merge the two write()s?  IMHO the array of buffers
> idea sounds too complicated, and I'm not extremely sure whether it'll pay
> off at last.  We could keep the two write()s with ZEROCOPY enabled, and use
> the merged version otherwise.

I think that's a great idea!
It would optimize the non-zerocopy version while letting us have a
simpler zerocopy implementation.
The array of buffers implementation would either require us to have a
'large' amount of memory for keeping the headers, or having flush
happening too often.

>
> Btw, is there any performance measurements for above commit d48c3a044537?
> I had a feeling that the single write() may not help that much, because for
> multifd the bottleneck should be on the nic not on the processor.

I am quite curious about those numbers too.

>
> IOW, we could find that the major time used does not fall into the
> user<->kernel switches (which is where the extra overhead of write()
> syscall, iiuc), but we simply blocked on any of the write()s because the
> socket write buffer is full...  So we could have saved some cpu cycles by
> merging the calls, but performance-wise we may not get much.
>
> Thanks,
>
> --
> Peter Xu
>

Thanks Peter!
diff mbox series

Patch

diff --git a/migration/multifd.h b/migration/multifd.h
index 4dda900a0b..7ec688fb4f 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -22,7 +22,7 @@  int multifd_load_cleanup(Error **errp);
 bool multifd_recv_all_channels_created(void);
 bool multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
 void multifd_recv_sync_main(void);
-void multifd_send_sync_main(QEMUFile *f);
+int multifd_send_sync_main(QEMUFile *f);
 int multifd_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
 
 /* Multifd Compression flags */
@@ -96,6 +96,8 @@  typedef struct {
     uint32_t packet_len;
     /* pointer to the packet */
     MultiFDPacket_t *packet;
+    /* multifd flags for sending ram */
+    int write_flags;
     /* multifd flags for each packet */
     uint32_t flags;
     /* size of the next packet that contains pages */
diff --git a/migration/migration.c b/migration/migration.c
index 3e0a25bb5b..1450fd0370 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1479,7 +1479,16 @@  static bool migrate_params_check(MigrationParameters *params, Error **errp)
         error_prepend(errp, "Invalid mapping given for block-bitmap-mapping: ");
         return false;
     }
-
+#ifdef CONFIG_LINUX
+    if (params->zero_copy_send &&
+        (!migrate_use_multifd() ||
+         params->multifd_compression != MULTIFD_COMPRESSION_NONE ||
+         (params->tls_creds && *params->tls_creds))) {
+        error_setg(errp,
+                   "Zero copy only available for non-compressed non-TLS multifd migration");
+        return false;
+    }
+#endif
     return true;
 }
 
diff --git a/migration/multifd.c b/migration/multifd.c
index 43998ad117..2d68b9cf4f 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -568,19 +568,28 @@  void multifd_save_cleanup(void)
     multifd_send_state = NULL;
 }
 
-void multifd_send_sync_main(QEMUFile *f)
+int multifd_send_sync_main(QEMUFile *f)
 {
     int i;
+    bool flush_zero_copy;
 
     if (!migrate_use_multifd()) {
-        return;
+        return 0;
     }
     if (multifd_send_state->pages->num) {
         if (multifd_send_pages(f) < 0) {
             error_report("%s: multifd_send_pages fail", __func__);
-            return;
+            return 0;
         }
     }
+
+    /*
+     * When using zero-copy, it's necessary to flush after each iteration to
+     * make sure pages from earlier iterations don't end up replacing newer
+     * pages.
+     */
+    flush_zero_copy = migrate_use_zero_copy_send();
+
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
 
@@ -591,7 +600,7 @@  void multifd_send_sync_main(QEMUFile *f)
         if (p->quit) {
             error_report("%s: channel %d has already quit", __func__, i);
             qemu_mutex_unlock(&p->mutex);
-            return;
+            return 0;
         }
 
         p->packet_num = multifd_send_state->packet_num++;
@@ -602,6 +611,17 @@  void multifd_send_sync_main(QEMUFile *f)
         ram_counters.transferred += p->packet_len;
         qemu_mutex_unlock(&p->mutex);
         qemu_sem_post(&p->sem);
+
+        if (flush_zero_copy) {
+            int ret;
+            Error *err = NULL;
+
+            ret = qio_channel_flush(p->c, &err);
+            if (ret < 0) {
+                error_report_err(err);
+                return -1;
+            }
+        }
     }
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
@@ -610,6 +630,8 @@  void multifd_send_sync_main(QEMUFile *f)
         qemu_sem_wait(&p->sem_sync);
     }
     trace_multifd_send_sync_main(multifd_send_state->packet_num);
+
+    return 0;
 }
 
 static void *multifd_send_thread(void *opaque)
@@ -668,8 +690,8 @@  static void *multifd_send_thread(void *opaque)
             p->iov[0].iov_len = p->packet_len;
             p->iov[0].iov_base = p->packet;
 
-            ret = qio_channel_writev_all(p->c, p->iov, p->iovs_num,
-                                         &local_err);
+            ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
+                                              0, p->write_flags, &local_err);
             if (ret != 0) {
                 break;
             }
@@ -910,6 +932,13 @@  int multifd_save_setup(Error **errp)
         /* We need one extra place for the packet header */
         p->iov = g_new0(struct iovec, page_count + 1);
         p->normal = g_new0(ram_addr_t, page_count);
+
+        if (migrate_use_zero_copy_send()) {
+            p->write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY;
+        } else {
+            p->write_flags = 0;
+        }
+
         socket_send_channel_create(multifd_new_send_channel_async, p);
     }
 
diff --git a/migration/ram.c b/migration/ram.c
index 91ca743ac8..d8c215e43a 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2902,6 +2902,7 @@  static int ram_save_setup(QEMUFile *f, void *opaque)
 {
     RAMState **rsp = opaque;
     RAMBlock *block;
+    int ret;
 
     if (compress_threads_save_setup()) {
         return -1;
@@ -2936,7 +2937,11 @@  static int ram_save_setup(QEMUFile *f, void *opaque)
     ram_control_before_iterate(f, RAM_CONTROL_SETUP);
     ram_control_after_iterate(f, RAM_CONTROL_SETUP);
 
-    multifd_send_sync_main(f);
+    ret =  multifd_send_sync_main(f);
+    if (ret < 0) {
+        return ret;
+    }
+
     qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
     qemu_fflush(f);
 
@@ -3045,7 +3050,11 @@  static int ram_save_iterate(QEMUFile *f, void *opaque)
 out:
     if (ret >= 0
         && migration_is_setup_or_active(migrate_get_current()->state)) {
-        multifd_send_sync_main(rs->f);
+        ret = multifd_send_sync_main(rs->f);
+        if (ret < 0) {
+            return ret;
+        }
+
         qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
         qemu_fflush(f);
         ram_transferred_add(8);
@@ -3105,13 +3114,19 @@  static int ram_save_complete(QEMUFile *f, void *opaque)
         ram_control_after_iterate(f, RAM_CONTROL_FINISH);
     }
 
-    if (ret >= 0) {
-        multifd_send_sync_main(rs->f);
-        qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
-        qemu_fflush(f);
+    if (ret < 0) {
+        return ret;
     }
 
-    return ret;
+    ret = multifd_send_sync_main(rs->f);
+    if (ret < 0) {
+        return ret;
+    }
+
+    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
+    qemu_fflush(f);
+
+    return 0;
 }
 
 static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
diff --git a/migration/socket.c b/migration/socket.c
index f586f983b7..c49431425e 100644
--- a/migration/socket.c
+++ b/migration/socket.c
@@ -78,8 +78,9 @@  static void socket_outgoing_migration(QIOTask *task,
         trace_migration_socket_outgoing_connected(data->hostname);
     }
 
-    if (migrate_use_zero_copy_send()) {
-        error_setg(&err, "Zero copy send not available in migration");
+    if (migrate_use_zero_copy_send() &&
+        !qio_channel_has_feature(sioc, QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY)) {
+        error_setg(&err, "Zero copy send feature not detected in host kernel");
     }
 
     migration_channel_connect(data->s, sioc, data->hostname, err);