diff mbox series

[1/2] migration/multifd: Further remove the SYNC on complete

Message ID 20241205185303.897010-2-peterx@redhat.com (mailing list archive)
State New
Headers show
Series migration/multifd: Some VFIO preparations | expand

Commit Message

Peter Xu Dec. 5, 2024, 6:53 p.m. UTC
Commit 637280aeb2 ("migration/multifd: Avoid the final FLUSH in
complete()") removed the FLUSH operation on complete() which should avoid
one global sync on destination side, because it's not needed.

However that commit overlooked multifd_ram_flush_and_sync() part of things,
as that's always used together with the FLUSH message on the main channel.

For very old binaries (multifd_flush_after_each_section==true), that's
still needed because each EOS received on destination will enforce
all-channel sync once.

For new binaries (multifd_flush_after_each_section==false), the flush and
sync shouldn't be needed just like the FLUSH, with the same reasoning.

Currently, on new binaries we'll still send SYNC messages on multifd
channels, even if FLUSH is omitted at the end.  It'll make all recv threads
hang at SYNC message.

Multifd is still all working fine because luckily recv side cleanup
code (mostly multifd_recv_sync_main()) is smart enough to make sure even if
recv threads are stuck at SYNC it'll get kicked out.  And since this is the
completion phase of migration, nothing else will be sent after the SYNCs.

This may be needed for VFIO in the future because VFIO can have data to
push after ram_save_complete(), hence we don't want the recv thread to be
stuck in SYNC message. Remove this limitation will make src even slightly
faster too to avoid some more code.

Stable branches do not need this patch, as no real bug I can think of that
will go wrong there.. so not attaching Fixes to be clear on the backport
not needed.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/ram.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

Comments

Fabiano Rosas Dec. 5, 2024, 7:55 p.m. UTC | #1
Peter Xu <peterx@redhat.com> writes:

> Commit 637280aeb2 ("migration/multifd: Avoid the final FLUSH in
> complete()") removed the FLUSH operation on complete() which should avoid
> one global sync on destination side, because it's not needed.
>
> However that commit overlooked multifd_ram_flush_and_sync() part of things,
> as that's always used together with the FLUSH message on the main channel.

Let's please write the full name of the flags, functions, etc. As we've
seen in the discussion with Prasad, we're stumbling over ambiguous
terminology. This is RAM_SAVE_FLAG_MULTIFD_FLUSH.

>
> For very old binaries (multifd_flush_after_each_section==true), that's
> still needed because each EOS received on destination will enforce
> all-channel sync once.

Ok, but why does this patch not reinstate the flag?

>
> For new binaries (multifd_flush_after_each_section==false), the flush and
> sync shouldn't be needed just like the FLUSH, with the same reasoning.
>
> Currently, on new binaries we'll still send SYNC messages on multifd
> channels, even if FLUSH is omitted at the end.  It'll make all recv threads
> hang at SYNC message.

I don't get this part, is this a bug you're describing? There's not SYNC
message on the recv side, I think you mean the MULTIFD_FLAG_SYNC and
this code?

    if (flags & MULTIFD_FLAG_SYNC) {
        qemu_sem_post(&multifd_recv_state->sem_sync);
        qemu_sem_wait(&p->sem_sync);
    }

That's not a hang, that's the sync.

>
> Multifd is still all working fine because luckily recv side cleanup
> code (mostly multifd_recv_sync_main()) is smart enough to make sure even if
> recv threads are stuck at SYNC it'll get kicked out.  And since this is the
> completion phase of migration, nothing else will be sent after the SYNCs.

Hmm, a last sync on the recv side might indeed not be needed.

>
> This may be needed for VFIO in the future because VFIO can have data to
> push after ram_save_complete(), hence we don't want the recv thread to be
> stuck in SYNC message. Remove this limitation will make src even slightly
> faster too to avoid some more code.
>
> Stable branches do not need this patch, as no real bug I can think of that
> will go wrong there.. so not attaching Fixes to be clear on the backport
> not needed.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/ram.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 05ff9eb328..7284c34bd8 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3283,9 +3283,16 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>          }
>      }
>  
> -    ret = multifd_ram_flush_and_sync();
> -    if (ret < 0) {
> -        return ret;
> +    if (migrate_multifd() &&
> +        migrate_multifd_flush_after_each_section()) {
> +        /*
> +         * Only the old dest QEMU will need this sync, because each EOS
> +         * will require one SYNC message on each channel.
> +         */
> +        ret = multifd_ram_flush_and_sync();
> +        if (ret < 0) {
> +            return ret;
> +        }

I don't think this works. We need one last flush to not lose the last
few pages of ram. And we need the src side sync of multifd threads to
make sure this function doesn't return before all IO has been put on the
wire.

This also doesn't address what the commit message says about the recv
side never getting the RAM_SAVE_FLAG_MULTIFD_FLUSH message.

>      }
>  
>      if (migrate_mapped_ram()) {
Peter Xu Dec. 5, 2024, 8:23 p.m. UTC | #2
On Thu, Dec 05, 2024 at 04:55:08PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > Commit 637280aeb2 ("migration/multifd: Avoid the final FLUSH in
> > complete()") removed the FLUSH operation on complete() which should avoid
> > one global sync on destination side, because it's not needed.
> >
> > However that commit overlooked multifd_ram_flush_and_sync() part of things,
> > as that's always used together with the FLUSH message on the main channel.
> 
> Let's please write the full name of the flags, functions, etc. As we've
> seen in the discussion with Prasad, we're stumbling over ambiguous
> terminology. This is RAM_SAVE_FLAG_MULTIFD_FLUSH.

Sure.

> 
> >
> > For very old binaries (multifd_flush_after_each_section==true), that's
> > still needed because each EOS received on destination will enforce
> > all-channel sync once.
> 
> Ok, but why does this patch not reinstate the flag?

RAM_SAVE_FLAG_MULTIFD_FLUSH?  Because it's not needed?

> 
> >
> > For new binaries (multifd_flush_after_each_section==false), the flush and
> > sync shouldn't be needed just like the FLUSH, with the same reasoning.
> >
> > Currently, on new binaries we'll still send SYNC messages on multifd
> > channels, even if FLUSH is omitted at the end.  It'll make all recv threads
> > hang at SYNC message.
> 
> I don't get this part, is this a bug you're describing? There's not SYNC
> message on the recv side, I think you mean the MULTIFD_FLAG_SYNC and
> this code?
> 
>     if (flags & MULTIFD_FLAG_SYNC) {
>         qemu_sem_post(&multifd_recv_state->sem_sync);
>         qemu_sem_wait(&p->sem_sync);
>     }

Yes.

> 
> That's not a hang, that's the sync.

When recv side never collect that SYNC (by invoke multifd_recv_sync_main),
then it is a hang.

> 
> >
> > Multifd is still all working fine because luckily recv side cleanup
> > code (mostly multifd_recv_sync_main()) is smart enough to make sure even if
> > recv threads are stuck at SYNC it'll get kicked out.  And since this is the
> > completion phase of migration, nothing else will be sent after the SYNCs.
> 
> Hmm, a last sync on the recv side might indeed not be needed.
> 
> >
> > This may be needed for VFIO in the future because VFIO can have data to
> > push after ram_save_complete(), hence we don't want the recv thread to be
> > stuck in SYNC message. Remove this limitation will make src even slightly
> > faster too to avoid some more code.
> >
> > Stable branches do not need this patch, as no real bug I can think of that
> > will go wrong there.. so not attaching Fixes to be clear on the backport
> > not needed.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  migration/ram.c | 13 ++++++++++---
> >  1 file changed, 10 insertions(+), 3 deletions(-)
> >
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 05ff9eb328..7284c34bd8 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -3283,9 +3283,16 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
> >          }
> >      }
> >  
> > -    ret = multifd_ram_flush_and_sync();
> > -    if (ret < 0) {
> > -        return ret;
> > +    if (migrate_multifd() &&
> > +        migrate_multifd_flush_after_each_section()) {
> > +        /*
> > +         * Only the old dest QEMU will need this sync, because each EOS
> > +         * will require one SYNC message on each channel.
> > +         */
> > +        ret = multifd_ram_flush_and_sync();
> > +        if (ret < 0) {
> > +            return ret;
> > +        }
> 
> I don't think this works. We need one last flush to not lose the last
> few pages of ram. And we need the src side sync of multifd threads to
> make sure this function doesn't return before all IO has been put on the
> wire.

This should be the question for commit 637280aeb2, I thought it got
answered there.. It's almost what the commit message there in 637280aeb2
wanted to justify too.

We don't need to flush the last pages here, because we flushed it already,
in the last find_dirty_block() call where src QEMU finished scanning the
last round.  Then we set complete_round=true, scan one more round, and the
iteration won't complete until the new round sees zero dirty page.

So when reaching this line, multifd send buffer must be empty.  We need to
flush for each round of RAM scan, we can't avoid the flush there, but we
can save this one in complete().

To explain that with code, I hacked a QEMU and assert that.  It ran all
fine here (this is definitely not for merge.. but to show what I meant):

===8<===
diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
index f64c4c9abd..0bd42c7627 100644
--- a/migration/multifd-nocomp.c
+++ b/migration/multifd-nocomp.c
@@ -21,7 +21,7 @@
 #include "qemu/error-report.h"
 #include "trace.h"

-static MultiFDSendData *multifd_ram_send;
+MultiFDSendData *multifd_ram_send;

 size_t multifd_ram_payload_size(void)
 {
diff --git a/migration/ram.c b/migration/ram.c
index 7284c34bd8..edeb9e28ff 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3228,6 +3228,8 @@ out:
     return done;
 }

+extern MultiFDSendData *multifd_ram_send;
+
 /**
  * ram_save_complete: function called to send the remaining amount of ram
  *
@@ -3283,6 +3285,10 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
         }
     }

+    if (migrate_multifd()) {
+        assert(multifd_payload_empty(multifd_ram_send));
+    }
+
     if (migrate_multifd() &&
         migrate_multifd_flush_after_each_section()) {
         /*
===8<===

> 
> This also doesn't address what the commit message says about the recv
> side never getting the RAM_SAVE_FLAG_MULTIFD_FLUSH message.

The new binaries now always not send RAM_SAVE_FLAG_MULTIFD_FLUSH when
complete(), however it always sends the multifd SYNC messages on the wire.
It means those SYNC messages will always arrive on dest multifd channels,
and then all these channels will wait for the main thread to collect that.
However since RAM_SAVE_FLAG_MULTIFD_FLUSH is not there, they'll hang until
multifd recv cleanups.
Fabiano Rosas Dec. 5, 2024, 9:16 p.m. UTC | #3
Peter Xu <peterx@redhat.com> writes:

> On Thu, Dec 05, 2024 at 04:55:08PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > Commit 637280aeb2 ("migration/multifd: Avoid the final FLUSH in
>> > complete()") removed the FLUSH operation on complete() which should avoid
>> > one global sync on destination side, because it's not needed.
>> >
>> > However that commit overlooked multifd_ram_flush_and_sync() part of things,
>> > as that's always used together with the FLUSH message on the main channel.
>> 
>> Let's please write the full name of the flags, functions, etc. As we've
>> seen in the discussion with Prasad, we're stumbling over ambiguous
>> terminology. This is RAM_SAVE_FLAG_MULTIFD_FLUSH.
>
> Sure.
>
>> 
>> >
>> > For very old binaries (multifd_flush_after_each_section==true), that's
>> > still needed because each EOS received on destination will enforce
>> > all-channel sync once.
>> 
>> Ok, but why does this patch not reinstate the flag?
>
> RAM_SAVE_FLAG_MULTIFD_FLUSH?  Because it's not needed?
>

Ah, you're saying we need the source side to send the MULTIFD_FLAG_SYNC
packet so that the old QEMU on the recv side gets out of waiting. I
see.

>> 
>> >
>> > For new binaries (multifd_flush_after_each_section==false), the flush and
>> > sync shouldn't be needed just like the FLUSH, with the same reasoning.
>> >
>> > Currently, on new binaries we'll still send SYNC messages on multifd
>> > channels, even if FLUSH is omitted at the end.  It'll make all recv threads
>> > hang at SYNC message.
>> 
>> I don't get this part, is this a bug you're describing? There's not SYNC
>> message on the recv side, I think you mean the MULTIFD_FLAG_SYNC and
>> this code?
>> 
>>     if (flags & MULTIFD_FLAG_SYNC) {
>>         qemu_sem_post(&multifd_recv_state->sem_sync);
>>         qemu_sem_wait(&p->sem_sync);
>>     }
>
> Yes.
>
>> 
>> That's not a hang, that's the sync.
>
> When recv side never collect that SYNC (by invoke multifd_recv_sync_main),
> then it is a hang.
>

Right, I forget the sync on the recv is the other way around. It's the
multifd_recv_state that does the sync between the threads. The sem_sync
is there so that the channels don't exit.

>> 
>> >
>> > Multifd is still all working fine because luckily recv side cleanup
>> > code (mostly multifd_recv_sync_main()) is smart enough to make sure even if
>> > recv threads are stuck at SYNC it'll get kicked out.  And since this is the
>> > completion phase of migration, nothing else will be sent after the SYNCs.
>> 
>> Hmm, a last sync on the recv side might indeed not be needed.
>> 
>> >
>> > This may be needed for VFIO in the future because VFIO can have data to
>> > push after ram_save_complete(), hence we don't want the recv thread to be
>> > stuck in SYNC message. Remove this limitation will make src even slightly
>> > faster too to avoid some more code.
>> >
>> > Stable branches do not need this patch, as no real bug I can think of that
>> > will go wrong there.. so not attaching Fixes to be clear on the backport
>> > not needed.
>> >
>> > Signed-off-by: Peter Xu <peterx@redhat.com>
>> > ---
>> >  migration/ram.c | 13 ++++++++++---
>> >  1 file changed, 10 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/migration/ram.c b/migration/ram.c
>> > index 05ff9eb328..7284c34bd8 100644
>> > --- a/migration/ram.c
>> > +++ b/migration/ram.c
>> > @@ -3283,9 +3283,16 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>> >          }
>> >      }
>> >  
>> > -    ret = multifd_ram_flush_and_sync();
>> > -    if (ret < 0) {
>> > -        return ret;
>> > +    if (migrate_multifd() &&
>> > +        migrate_multifd_flush_after_each_section()) {
>> > +        /*
>> > +         * Only the old dest QEMU will need this sync, because each EOS
>> > +         * will require one SYNC message on each channel.
>> > +         */
>> > +        ret = multifd_ram_flush_and_sync();
>> > +        if (ret < 0) {
>> > +            return ret;
>> > +        }
>> 
>> I don't think this works. We need one last flush to not lose the last
>> few pages of ram. And we need the src side sync of multifd threads to
>> make sure this function doesn't return before all IO has been put on the
>> wire.
>
> This should be the question for commit 637280aeb2, I thought it got
> answered there.. It's almost what the commit message there in 637280aeb2
> wanted to justify too.

Yeah, it missed the mark.

>
> We don't need to flush the last pages here, because we flushed it already,
> in the last find_dirty_block() call where src QEMU finished scanning the
> last round.  Then we set complete_round=true, scan one more round, and the
> iteration won't complete until the new round sees zero dirty page.

Ok, let's put this in the commit message. As it stands it looks like
this patch is fixing a bug with 637280aeb2 when instead it's doing
further optimization, but so it happens that we still require the
backward compatibility part.

>
> So when reaching this line, multifd send buffer must be empty.  We need to
> flush for each round of RAM scan, we can't avoid the flush there, but we
> can save this one in complete().
>
> To explain that with code, I hacked a QEMU and assert that.  It ran all
> fine here (this is definitely not for merge.. but to show what I meant):
>
> ===8<===
> diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
> index f64c4c9abd..0bd42c7627 100644
> --- a/migration/multifd-nocomp.c
> +++ b/migration/multifd-nocomp.c
> @@ -21,7 +21,7 @@
>  #include "qemu/error-report.h"
>  #include "trace.h"
>
> -static MultiFDSendData *multifd_ram_send;
> +MultiFDSendData *multifd_ram_send;
>
>  size_t multifd_ram_payload_size(void)
>  {
> diff --git a/migration/ram.c b/migration/ram.c
> index 7284c34bd8..edeb9e28ff 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3228,6 +3228,8 @@ out:
>      return done;
>  }
>
> +extern MultiFDSendData *multifd_ram_send;
> +
>  /**
>   * ram_save_complete: function called to send the remaining amount of ram
>   *
> @@ -3283,6 +3285,10 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>          }
>      }
>
> +    if (migrate_multifd()) {
> +        assert(multifd_payload_empty(multifd_ram_send));
> +    }
> +
>      if (migrate_multifd() &&
>          migrate_multifd_flush_after_each_section()) {
>          /*
> ===8<===
>
>> 
>> This also doesn't address what the commit message says about the recv
>> side never getting the RAM_SAVE_FLAG_MULTIFD_FLUSH message.
>
> The new binaries now always not send RAM_SAVE_FLAG_MULTIFD_FLUSH when
> complete(), however it always sends the multifd SYNC messages on the wire.
> It means those SYNC messages will always arrive on dest multifd channels,
> and then all these channels will wait for the main thread to collect that.
> However since RAM_SAVE_FLAG_MULTIFD_FLUSH is not there, they'll hang until
> multifd recv cleanups.
Peter Xu Dec. 5, 2024, 10 p.m. UTC | #4
On Thu, Dec 05, 2024 at 06:16:05PM -0300, Fabiano Rosas wrote:
> > We don't need to flush the last pages here, because we flushed it already,
> > in the last find_dirty_block() call where src QEMU finished scanning the
> > last round.  Then we set complete_round=true, scan one more round, and the
> > iteration won't complete until the new round sees zero dirty page.
> 
> Ok, let's put this in the commit message. As it stands it looks like
> this patch is fixing a bug with 637280aeb2 when instead it's doing
> further optimization, but so it happens that we still require the
> backward compatibility part.

Yes, and as commit message said I didn't attach Fixes and plan not to
backport to stable because there's no real bug that we can hit, as those
SYNCs will always only present at the end of migration, so they cannot harm
anything yet if RAM is the only multifd user.

I can add some of above into commit message.  Since I already started
looking at the patch you posted on putting all sync conditions together, I
decided to repost this series with that, and with the rename you suggested
on local/remote.  Though I can't name them LOCAL and REMOTE because the
REMOTE will contain LOCAL too.  So in reality I renamed them to LOCAL and
ALL, comment will explain that ALL contains LOCAL and REMOTE flushes.
diff mbox series

Patch

diff --git a/migration/ram.c b/migration/ram.c
index 05ff9eb328..7284c34bd8 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3283,9 +3283,16 @@  static int ram_save_complete(QEMUFile *f, void *opaque)
         }
     }
 
-    ret = multifd_ram_flush_and_sync();
-    if (ret < 0) {
-        return ret;
+    if (migrate_multifd() &&
+        migrate_multifd_flush_after_each_section()) {
+        /*
+         * Only the old dest QEMU will need this sync, because each EOS
+         * will require one SYNC message on each channel.
+         */
+        ret = multifd_ram_flush_and_sync();
+        if (ret < 0) {
+            return ret;
+        }
     }
 
     if (migrate_mapped_ram()) {