diff mbox series

[v4,22/34] migration/multifd: Prepare multifd sync for fixed-ram migration

Message ID 20240220224138.24759-23-farosas@suse.de (mailing list archive)
State New, archived
Headers show
Series migration: File based migration with multifd and fixed-ram | expand

Commit Message

Fabiano Rosas Feb. 20, 2024, 10:41 p.m. UTC
The fixed-ram migration can be performed live or non-live, but it is
always asynchronous, i.e. the source machine and the destination
machine are not migrating at the same time. We only need some pieces
of the multifd sync operations.

multifd_send_sync_main()
------------------------
  Issued by the ram migration code on the migration thread, causes the
  multifd send channels to synchronize with the migration thread and
  makes the sending side emit a packet with the MULTIFD_FLUSH flag.

  With fixed-ram we want to maintain the sync on the sending side
  because that provides ordering between the rounds of dirty pages when
  migrating live.

MULTIFD_FLUSH
-------------
  On the receiving side, the presence of the MULTIFD_FLUSH flag on a
  packet causes the receiving channels to start synchronizing with the
  main thread.

  We're not using packets with fixed-ram, so there's no MULTIFD_FLUSH
  flag and therefore no channel sync on the receiving side.

multifd_recv_sync_main()
------------------------
  Issued by the migration thread when the ram migration flag
  RAM_SAVE_FLAG_MULTIFD_FLUSH is received, causes the migration thread
  on the receiving side to start synchronizing with the recv
  channels. Due to compatibility, this is also issued when
  RAM_SAVE_FLAG_EOS is received.

  For fixed-ram we only need to synchronize the channels at the end of
  migration to avoid doing cleanup before the channels have finished
  their IO.

Make sure the multifd syncs are only issued at the appropriate
times. Note that due to pre-existing backward compatibility issues, we
have the multifd_flush_after_each_section property that enables an
older behavior of synchronizing channels more frequently (and
inefficiently). Fixed-ram should always run with that property
disabled (default).

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/ram.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

Comments

Peter Xu Feb. 26, 2024, 7:47 a.m. UTC | #1
On Tue, Feb 20, 2024 at 07:41:26PM -0300, Fabiano Rosas wrote:
> The fixed-ram migration can be performed live or non-live, but it is
> always asynchronous, i.e. the source machine and the destination
> machine are not migrating at the same time. We only need some pieces
> of the multifd sync operations.
> 
> multifd_send_sync_main()
> ------------------------
>   Issued by the ram migration code on the migration thread, causes the
>   multifd send channels to synchronize with the migration thread and
>   makes the sending side emit a packet with the MULTIFD_FLUSH flag.
> 
>   With fixed-ram we want to maintain the sync on the sending side
>   because that provides ordering between the rounds of dirty pages when
>   migrating live.
> 
> MULTIFD_FLUSH
> -------------
>   On the receiving side, the presence of the MULTIFD_FLUSH flag on a
>   packet causes the receiving channels to start synchronizing with the
>   main thread.
> 
>   We're not using packets with fixed-ram, so there's no MULTIFD_FLUSH
>   flag and therefore no channel sync on the receiving side.
> 
> multifd_recv_sync_main()
> ------------------------
>   Issued by the migration thread when the ram migration flag
>   RAM_SAVE_FLAG_MULTIFD_FLUSH is received, causes the migration thread
>   on the receiving side to start synchronizing with the recv
>   channels. Due to compatibility, this is also issued when
>   RAM_SAVE_FLAG_EOS is received.
> 
>   For fixed-ram we only need to synchronize the channels at the end of
>   migration to avoid doing cleanup before the channels have finished
>   their IO.
> 
> Make sure the multifd syncs are only issued at the appropriate
> times. Note that due to pre-existing backward compatibility issues, we
> have the multifd_flush_after_each_section property that enables an
> older behavior of synchronizing channels more frequently (and
> inefficiently). Fixed-ram should always run with that property
> disabled (default).

What if the user enables multifd_flush_after_each_section=true?

IMHO we don't necessarily need to attach the fixed-ram loading flush to any
flag in the stream.  For fixed-ram IIUC all the loads will happen in one
shot of ram_load() anyway when parsing the ramblock list, so.. how about we
decouple the fixed-ram load flush from the stream by always do a sync in
ram_load() unconditionally?

@@ -4368,6 +4367,15 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
             ret = ram_load_precopy(f);
         }
     }
+
+    /*
+     * Fixed-ram migration may queue load tasks to multifd threads; make
+     * sure they're all done.
+     */
+    if (migrate_fixed_ram() && migrate_multifd()) {
+        multifd_recv_sync_main();
+    }
+
     trace_ram_load_complete(ret, seq_iter);
 
     return ret;

Then ram_load() always guarantees synchronous loading of pages, and
fixed-ram will completely ignore multifd flushes (then we also skip it for
the ram_save_complete() like what this patch does for the rest).

> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  migration/ram.c | 19 ++++++++++++++++---
>  1 file changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 5932e1b8e1..c7050f6f68 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1369,8 +1369,11 @@ static int find_dirty_block(RAMState *rs, PageSearchStatus *pss)
>                  if (ret < 0) {
>                      return ret;
>                  }
> -                qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
> -                qemu_fflush(f);
> +
> +                if (!migrate_fixed_ram()) {
> +                    qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
> +                    qemu_fflush(f);
> +                }
>              }
>              /*
>               * If memory migration starts over, we will meet a dirtied page
> @@ -3112,7 +3115,8 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>          return ret;
>      }
>  
> -    if (migrate_multifd() && !migrate_multifd_flush_after_each_section()) {
> +    if (migrate_multifd() && !migrate_multifd_flush_after_each_section()
> +        && !migrate_fixed_ram()) {
>          qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
>      }
>  
> @@ -4253,6 +4257,15 @@ static int ram_load_precopy(QEMUFile *f)
>              break;
>          case RAM_SAVE_FLAG_EOS:
>              /* normal exit */
> +            if (migrate_fixed_ram()) {
> +                /*
> +                 * The EOS flag appears multiple times on the
> +                 * stream. Fixed-ram needs only one sync at the
> +                 * end. It will be done on the flush flag above.
> +                 */
> +                break;
> +            }
> +
>              if (migrate_multifd() &&
>                  migrate_multifd_flush_after_each_section()) {
>                  multifd_recv_sync_main();
> -- 
> 2.35.3
>
Fabiano Rosas Feb. 26, 2024, 10:52 p.m. UTC | #2
Peter Xu <peterx@redhat.com> writes:

> On Tue, Feb 20, 2024 at 07:41:26PM -0300, Fabiano Rosas wrote:
>> The fixed-ram migration can be performed live or non-live, but it is
>> always asynchronous, i.e. the source machine and the destination
>> machine are not migrating at the same time. We only need some pieces
>> of the multifd sync operations.
>> 
>> multifd_send_sync_main()
>> ------------------------
>>   Issued by the ram migration code on the migration thread, causes the
>>   multifd send channels to synchronize with the migration thread and
>>   makes the sending side emit a packet with the MULTIFD_FLUSH flag.
>> 
>>   With fixed-ram we want to maintain the sync on the sending side
>>   because that provides ordering between the rounds of dirty pages when
>>   migrating live.
>> 
>> MULTIFD_FLUSH
>> -------------
>>   On the receiving side, the presence of the MULTIFD_FLUSH flag on a
>>   packet causes the receiving channels to start synchronizing with the
>>   main thread.
>> 
>>   We're not using packets with fixed-ram, so there's no MULTIFD_FLUSH
>>   flag and therefore no channel sync on the receiving side.
>> 
>> multifd_recv_sync_main()
>> ------------------------
>>   Issued by the migration thread when the ram migration flag
>>   RAM_SAVE_FLAG_MULTIFD_FLUSH is received, causes the migration thread
>>   on the receiving side to start synchronizing with the recv
>>   channels. Due to compatibility, this is also issued when
>>   RAM_SAVE_FLAG_EOS is received.
>> 
>>   For fixed-ram we only need to synchronize the channels at the end of
>>   migration to avoid doing cleanup before the channels have finished
>>   their IO.
>> 
>> Make sure the multifd syncs are only issued at the appropriate
>> times. Note that due to pre-existing backward compatibility issues, we
>> have the multifd_flush_after_each_section property that enables an
>> older behavior of synchronizing channels more frequently (and
>> inefficiently). Fixed-ram should always run with that property
>> disabled (default).
>
> What if the user enables multifd_flush_after_each_section=true?
>
> IMHO we don't necessarily need to attach the fixed-ram loading flush to any
> flag in the stream.  For fixed-ram IIUC all the loads will happen in one
> shot of ram_load() anyway when parsing the ramblock list, so.. how about we
> decouple the fixed-ram load flush from the stream by always do a sync in
> ram_load() unconditionally?

I would like to. But it's not possible because ram_load() is called once
per section. So once for each EOS flag on the stream. We'll have at
least two calls to ram_load(), once due to qemu_savevm_state_iterate()
and another due to qemu_savevm_state_complete_precopy().

The fact that fixed-ram can use just one load doesn't change the fact
that we perform more than one "save". So we'll need to use the FLUSH
flag in this case unfortunately.

>
> @@ -4368,6 +4367,15 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>              ret = ram_load_precopy(f);
>          }
>      }
> +
> +    /*
> +     * Fixed-ram migration may queue load tasks to multifd threads; make
> +     * sure they're all done.
> +     */
> +    if (migrate_fixed_ram() && migrate_multifd()) {
> +        multifd_recv_sync_main();
> +    }
> +
>      trace_ram_load_complete(ret, seq_iter);
>  
>      return ret;
>
> Then ram_load() always guarantees synchronous loading of pages, and
> fixed-ram will completely ignore multifd flushes (then we also skip it for
> the ram_save_complete() like what this patch does for the rest).
>
>> 
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>>  migration/ram.c | 19 ++++++++++++++++---
>>  1 file changed, 16 insertions(+), 3 deletions(-)
>> 
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 5932e1b8e1..c7050f6f68 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1369,8 +1369,11 @@ static int find_dirty_block(RAMState *rs, PageSearchStatus *pss)
>>                  if (ret < 0) {
>>                      return ret;
>>                  }
>> -                qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
>> -                qemu_fflush(f);
>> +
>> +                if (!migrate_fixed_ram()) {
>> +                    qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
>> +                    qemu_fflush(f);
>> +                }
>>              }
>>              /*
>>               * If memory migration starts over, we will meet a dirtied page
>> @@ -3112,7 +3115,8 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>>          return ret;
>>      }
>>  
>> -    if (migrate_multifd() && !migrate_multifd_flush_after_each_section()) {
>> +    if (migrate_multifd() && !migrate_multifd_flush_after_each_section()
>> +        && !migrate_fixed_ram()) {
>>          qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
>>      }
>>  
>> @@ -4253,6 +4257,15 @@ static int ram_load_precopy(QEMUFile *f)
>>              break;
>>          case RAM_SAVE_FLAG_EOS:
>>              /* normal exit */
>> +            if (migrate_fixed_ram()) {
>> +                /*
>> +                 * The EOS flag appears multiple times on the
>> +                 * stream. Fixed-ram needs only one sync at the
>> +                 * end. It will be done on the flush flag above.
>> +                 */
>> +                break;
>> +            }
>> +
>>              if (migrate_multifd() &&
>>                  migrate_multifd_flush_after_each_section()) {
>>                  multifd_recv_sync_main();
>> -- 
>> 2.35.3
>>
Peter Xu Feb. 27, 2024, 3:52 a.m. UTC | #3
On Mon, Feb 26, 2024 at 07:52:20PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Tue, Feb 20, 2024 at 07:41:26PM -0300, Fabiano Rosas wrote:
> >> The fixed-ram migration can be performed live or non-live, but it is
> >> always asynchronous, i.e. the source machine and the destination
> >> machine are not migrating at the same time. We only need some pieces
> >> of the multifd sync operations.
> >> 
> >> multifd_send_sync_main()
> >> ------------------------
> >>   Issued by the ram migration code on the migration thread, causes the
> >>   multifd send channels to synchronize with the migration thread and
> >>   makes the sending side emit a packet with the MULTIFD_FLUSH flag.
> >> 
> >>   With fixed-ram we want to maintain the sync on the sending side
> >>   because that provides ordering between the rounds of dirty pages when
> >>   migrating live.
> >> 
> >> MULTIFD_FLUSH
> >> -------------
> >>   On the receiving side, the presence of the MULTIFD_FLUSH flag on a
> >>   packet causes the receiving channels to start synchronizing with the
> >>   main thread.
> >> 
> >>   We're not using packets with fixed-ram, so there's no MULTIFD_FLUSH
> >>   flag and therefore no channel sync on the receiving side.
> >> 
> >> multifd_recv_sync_main()
> >> ------------------------
> >>   Issued by the migration thread when the ram migration flag
> >>   RAM_SAVE_FLAG_MULTIFD_FLUSH is received, causes the migration thread
> >>   on the receiving side to start synchronizing with the recv
> >>   channels. Due to compatibility, this is also issued when
> >>   RAM_SAVE_FLAG_EOS is received.
> >> 
> >>   For fixed-ram we only need to synchronize the channels at the end of
> >>   migration to avoid doing cleanup before the channels have finished
> >>   their IO.
> >> 
> >> Make sure the multifd syncs are only issued at the appropriate
> >> times. Note that due to pre-existing backward compatibility issues, we
> >> have the multifd_flush_after_each_section property that enables an
> >> older behavior of synchronizing channels more frequently (and
> >> inefficiently). Fixed-ram should always run with that property
> >> disabled (default).
> >
> > What if the user enables multifd_flush_after_each_section=true?
> >
> > IMHO we don't necessarily need to attach the fixed-ram loading flush to any
> > flag in the stream.  For fixed-ram IIUC all the loads will happen in one
> > shot of ram_load() anyway when parsing the ramblock list, so.. how about we
> > decouple the fixed-ram load flush from the stream by always do a sync in
> > ram_load() unconditionally?
> 
> I would like to. But it's not possible because ram_load() is called once
> per section. So once for each EOS flag on the stream. We'll have at
> least two calls to ram_load(), once due to qemu_savevm_state_iterate()
> and another due to qemu_savevm_state_complete_precopy().
> 
> The fact that fixed-ram can use just one load doesn't change the fact
> that we perform more than one "save". So we'll need to use the FLUSH
> flag in this case unfortunately.

After I re-read it, I found one more issue.

Now recv side sync is "once and for all" - it doesn't allow a second time
to sync_main because it syncs only until quits.  That is IMHO making the
code much harder to maintain, and we'll need rich comment to explain why is
that happening.

Ideally any "sync main" for recv threads can be called multiple times.  And
IMHO it's not really hard.  Actually it can make the code much cleaner by
merging some logic between socket-based and file-based from that regard.

I tried to play with your branch and propose something like this, just to
show what I meant. This should allow all new fixed-ram test to pass here,
meanwhile it should allow sync main on recv side to be re-entrant, sharing
the logic with socket-based as much as possible:

=====
diff --git a/migration/multifd.c b/migration/multifd.c
index a0202b5661..28480f6cfe 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -86,10 +86,8 @@ struct {
     /* number of created threads */
     int count;
     /*
-     * For sockets: this is posted once for each MULTIFD_FLAG_SYNC flag.
-     *
-     * For files: this is only posted at the end of the file load to mark
-     *            completion of the load process.
+     * This is always posted by the recv threads, the main thread uses it
+     * to wait for recv threads to finish assigned tasks.
      */
     QemuSemaphore sem_sync;
     /* global number of generated multifd packets */
@@ -1316,38 +1314,55 @@ void multifd_recv_cleanup(void)
     multifd_recv_cleanup_state();
 }
 
-
-/*
- * Wait until all channels have finished receiving data. Once this
- * function returns, cleanup routines are safe to run.
- */
-static void multifd_file_recv_sync(void)
+static void multifd_recv_file_sync_request(void)
 {
     int i;
 
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDRecvParams *p = &multifd_recv_state->params[i];
 
-        trace_multifd_recv_sync_main_wait(p->id);
-
+        /*
+         * We play a trick here: instead of using a separate pending_sync
+         * to send a sync request (like what we do on senders), we simply
+         * kick the recv thread once without setting pending_job.
+         *
+         * If there's already a pending_job, the thread will only see it
+         * after it processed the current.  If there's no pending_job,
+         * it'll see this immediately.
+         */
         qemu_sem_post(&p->sem);
-
         trace_multifd_recv_sync_main_signal(p->id);
-        qemu_sem_wait(&p->sem_sync);
     }
-    return;
 }
 
+/*
+ * Request a sync for all the multifd recv threads.
+ *
+ * For socket-based, sync request is much more complicated, which relies on
+ * collaborations between both explicit RAM_SAVE_FLAG_MULTIFD_FLUSH in the
+ * main stream, and MULTIFD_FLAG_SYNC flag in per-channel protocol.  Here
+ * it should be invoked by the main stream request.
+ *
+ * For file-based, it is much simpler, because there's no need for a strong
+ * sync semantics between the main thread and the recv threads.  What we
+ * need is only to make sure all recv threads finished their tasks.
+ */
 void multifd_recv_sync_main(void)
 {
+    bool file_based = !multifd_use_packets();
     int i;
 
     if (!migrate_multifd()) {
         return;
     }
 
-    if (!multifd_use_packets()) {
-        return multifd_file_recv_sync();
+    if (file_based) {
+        /*
+         * File-based multifd requires an explicit sync request because
+         * tasks are assigned by the main recv thread, rather than parsed
+         * through the multifd channels.
+         */
+        multifd_recv_file_sync_request();
     }
 
     for (i = 0; i < migrate_multifd_channels(); i++) {
@@ -1356,6 +1371,11 @@ void multifd_recv_sync_main(void)
         trace_multifd_recv_sync_main_wait(p->id);
         qemu_sem_wait(&multifd_recv_state->sem_sync);
     }
+
+    if (file_based) {
+        return;
+    }
+
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDRecvParams *p = &multifd_recv_state->params[i];
 
@@ -1420,11 +1440,12 @@ static void *multifd_recv_thread(void *opaque)
             }
 
             /*
-             * Migration thread did not send work, break and signal
-             * sem_sync so it knows we're not lagging behind.
+             * Migration thread did not send work, this emulates
+             * pending_sync, post sem_sync to notify the main thread.
              */
             if (!qatomic_read(&p->pending_job)) {
-                break;
+                qemu_sem_post(&multifd_recv_state->sem_sync);
+                continue;
             }
 
             has_data = !!p->data->size;
@@ -1449,10 +1470,6 @@ static void *multifd_recv_thread(void *opaque)
         }
     }
 
-    if (!use_packets) {
-        qemu_sem_post(&p->sem_sync);
-    }
-
     if (local_err) {
         multifd_recv_terminate_threads(local_err);
         error_free(local_err);

==========

Note that I used multifd_recv_state->sem_sync to send the message rather
than p->sem, not only because socket-based has similar logic on using that
sem, but also because main thread shouldn't care about "which" recv thread
has finished, but "all recv threads are idle".

Do you think this should work out for us in a nicer way?

Then we talk about the other issue, on whether we should rely on migration
stream to flush recv threads.  My answer is still hopefully a no.

In the ideal case, fixed-ram image format should even be tailed to not use
a live stream protocol.  For example, currently during ram iterations we
should flush quite a lot of ram QEMU_VM_SECTION_PART sections contains
mostly rubbish but then ending that with RAM_SAVE_FLAG_EOS. Then we keep
doing this in the iteration loop.  Here the real meat is during processing
of QEMU_VM_SECTION_PART, the src QEMU will update the guest pages with
fixed offsets in the file.  That however doesn't really contribute to
anything valuable in the migration stream itself (things sent over
to_dst_file).

AFAIU we chose to still use that logic only for simplicity, even if we know
those EOSs and all RAM streams are garbage.  Now we tend to add one
dependency on part of the garbage, which is RAM_SAVE_FLAG_MULTIFD_FLUSH in
this case; which is useful in socket-based but shouldn't be necessary for
file.

I think I have a solution besides ram_load(): ultimately fixed-ram stores
all guest mem in the QEMU_VM_SECTION_START section of the ram, through all
of the RAM_SAVE_FLAG_MEM_SIZE (which leads to parse_ramblocks()).  If so,
perhaps we can do one shot sync for file at the end of parse_ramblocks()?
Then we decouple sync_main on recv for file-based completely against all
stream flags.
Fabiano Rosas Feb. 27, 2024, 2 p.m. UTC | #4
Peter Xu <peterx@redhat.com> writes:

> On Mon, Feb 26, 2024 at 07:52:20PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Tue, Feb 20, 2024 at 07:41:26PM -0300, Fabiano Rosas wrote:
>> >> The fixed-ram migration can be performed live or non-live, but it is
>> >> always asynchronous, i.e. the source machine and the destination
>> >> machine are not migrating at the same time. We only need some pieces
>> >> of the multifd sync operations.
>> >> 
>> >> multifd_send_sync_main()
>> >> ------------------------
>> >>   Issued by the ram migration code on the migration thread, causes the
>> >>   multifd send channels to synchronize with the migration thread and
>> >>   makes the sending side emit a packet with the MULTIFD_FLUSH flag.
>> >> 
>> >>   With fixed-ram we want to maintain the sync on the sending side
>> >>   because that provides ordering between the rounds of dirty pages when
>> >>   migrating live.
>> >> 
>> >> MULTIFD_FLUSH
>> >> -------------
>> >>   On the receiving side, the presence of the MULTIFD_FLUSH flag on a
>> >>   packet causes the receiving channels to start synchronizing with the
>> >>   main thread.
>> >> 
>> >>   We're not using packets with fixed-ram, so there's no MULTIFD_FLUSH
>> >>   flag and therefore no channel sync on the receiving side.
>> >> 
>> >> multifd_recv_sync_main()
>> >> ------------------------
>> >>   Issued by the migration thread when the ram migration flag
>> >>   RAM_SAVE_FLAG_MULTIFD_FLUSH is received, causes the migration thread
>> >>   on the receiving side to start synchronizing with the recv
>> >>   channels. Due to compatibility, this is also issued when
>> >>   RAM_SAVE_FLAG_EOS is received.
>> >> 
>> >>   For fixed-ram we only need to synchronize the channels at the end of
>> >>   migration to avoid doing cleanup before the channels have finished
>> >>   their IO.
>> >> 
>> >> Make sure the multifd syncs are only issued at the appropriate
>> >> times. Note that due to pre-existing backward compatibility issues, we
>> >> have the multifd_flush_after_each_section property that enables an
>> >> older behavior of synchronizing channels more frequently (and
>> >> inefficiently). Fixed-ram should always run with that property
>> >> disabled (default).
>> >
>> > What if the user enables multifd_flush_after_each_section=true?
>> >
>> > IMHO we don't necessarily need to attach the fixed-ram loading flush to any
>> > flag in the stream.  For fixed-ram IIUC all the loads will happen in one
>> > shot of ram_load() anyway when parsing the ramblock list, so.. how about we
>> > decouple the fixed-ram load flush from the stream by always do a sync in
>> > ram_load() unconditionally?
>> 
>> I would like to. But it's not possible because ram_load() is called once
>> per section. So once for each EOS flag on the stream. We'll have at
>> least two calls to ram_load(), once due to qemu_savevm_state_iterate()
>> and another due to qemu_savevm_state_complete_precopy().
>> 
>> The fact that fixed-ram can use just one load doesn't change the fact
>> that we perform more than one "save". So we'll need to use the FLUSH
>> flag in this case unfortunately.
>
> After I re-read it, I found one more issue.
>
> Now recv side sync is "once and for all" - it doesn't allow a second time
> to sync_main because it syncs only until quits.  That is IMHO making the
> code much harder to maintain, and we'll need rich comment to explain why is
> that happening.
>
> Ideally any "sync main" for recv threads can be called multiple times.  And
> IMHO it's not really hard.  Actually it can make the code much cleaner by
> merging some logic between socket-based and file-based from that regard.
>
> I tried to play with your branch and propose something like this, just to
> show what I meant. This should allow all new fixed-ram test to pass here,
> meanwhile it should allow sync main on recv side to be re-entrant, sharing
> the logic with socket-based as much as possible:
>
> =====
> diff --git a/migration/multifd.c b/migration/multifd.c
> index a0202b5661..28480f6cfe 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -86,10 +86,8 @@ struct {
>      /* number of created threads */
>      int count;
>      /*
> -     * For sockets: this is posted once for each MULTIFD_FLAG_SYNC flag.
> -     *
> -     * For files: this is only posted at the end of the file load to mark
> -     *            completion of the load process.
> +     * This is always posted by the recv threads, the main thread uses it
> +     * to wait for recv threads to finish assigned tasks.
>       */
>      QemuSemaphore sem_sync;
>      /* global number of generated multifd packets */
> @@ -1316,38 +1314,55 @@ void multifd_recv_cleanup(void)
>      multifd_recv_cleanup_state();
>  }
>  
> -
> -/*
> - * Wait until all channels have finished receiving data. Once this
> - * function returns, cleanup routines are safe to run.
> - */
> -static void multifd_file_recv_sync(void)
> +static void multifd_recv_file_sync_request(void)
>  {
>      int i;
>  
>      for (i = 0; i < migrate_multifd_channels(); i++) {
>          MultiFDRecvParams *p = &multifd_recv_state->params[i];
>  
> -        trace_multifd_recv_sync_main_wait(p->id);
> -
> +        /*
> +         * We play a trick here: instead of using a separate pending_sync
> +         * to send a sync request (like what we do on senders), we simply
> +         * kick the recv thread once without setting pending_job.
> +         *
> +         * If there's already a pending_job, the thread will only see it
> +         * after it processed the current.  If there's no pending_job,
> +         * it'll see this immediately.
> +         */
>          qemu_sem_post(&p->sem);
> -
>          trace_multifd_recv_sync_main_signal(p->id);
> -        qemu_sem_wait(&p->sem_sync);
>      }
> -    return;
>  }
>  
> +/*
> + * Request a sync for all the multifd recv threads.
> + *
> + * For socket-based, sync request is much more complicated, which relies on
> + * collaborations between both explicit RAM_SAVE_FLAG_MULTIFD_FLUSH in the
> + * main stream, and MULTIFD_FLAG_SYNC flag in per-channel protocol.  Here
> + * it should be invoked by the main stream request.
> + *
> + * For file-based, it is much simpler, because there's no need for a strong
> + * sync semantics between the main thread and the recv threads.  What we
> + * need is only to make sure all recv threads finished their tasks.
> + */
>  void multifd_recv_sync_main(void)
>  {
> +    bool file_based = !multifd_use_packets();
>      int i;
>  
>      if (!migrate_multifd()) {
>          return;
>      }
>  
> -    if (!multifd_use_packets()) {
> -        return multifd_file_recv_sync();
> +    if (file_based) {
> +        /*
> +         * File-based multifd requires an explicit sync request because
> +         * tasks are assigned by the main recv thread, rather than parsed
> +         * through the multifd channels.
> +         */
> +        multifd_recv_file_sync_request();
>      }
>  
>      for (i = 0; i < migrate_multifd_channels(); i++) {
> @@ -1356,6 +1371,11 @@ void multifd_recv_sync_main(void)
>          trace_multifd_recv_sync_main_wait(p->id);
>          qemu_sem_wait(&multifd_recv_state->sem_sync);
>      }
> +
> +    if (file_based) {
> +        return;
> +    }
> +
>      for (i = 0; i < migrate_multifd_channels(); i++) {
>          MultiFDRecvParams *p = &multifd_recv_state->params[i];
>  
> @@ -1420,11 +1440,12 @@ static void *multifd_recv_thread(void *opaque)
>              }
>  
>              /*
> -             * Migration thread did not send work, break and signal
> -             * sem_sync so it knows we're not lagging behind.
> +             * Migration thread did not send work, this emulates
> +             * pending_sync, post sem_sync to notify the main thread.
>               */
>              if (!qatomic_read(&p->pending_job)) {
> -                break;
> +                qemu_sem_post(&multifd_recv_state->sem_sync);
> +                continue;
>              }
>  
>              has_data = !!p->data->size;
> @@ -1449,10 +1470,6 @@ static void *multifd_recv_thread(void *opaque)
>          }
>      }
>  
> -    if (!use_packets) {
> -        qemu_sem_post(&p->sem_sync);
> -    }
> -
>      if (local_err) {
>          multifd_recv_terminate_threads(local_err);
>          error_free(local_err);
>
> ==========
>
> Note that I used multifd_recv_state->sem_sync to send the message rather
> than p->sem, not only because socket-based has similar logic on using that
> sem, but also because main thread shouldn't care about "which" recv thread
> has finished, but "all recv threads are idle".
>
> Do you think this should work out for us in a nicer way?
>

I don't really like the interleaving of file and socket logic at
multifd_recv_sync_main(), but I can live with it.

Waiting on multifd_recv_state->sem_sync is problematic because if the
thread has an error, that will hang forever.

Actually, I don't even see this being handled in _current_ code
anywhere, we probably have a bug there. I guess we need to add one more
"post this sem just because" somewhere. multifd_recv_kick_main probably.

> Then we talk about the other issue, on whether we should rely on migration
> stream to flush recv threads.  My answer is still hopefully a no.
>
> In the ideal case, fixed-ram image format should even be tailed to not use
> a live stream protocol.  For example, currently during ram iterations we
> should flush quite a lot of ram QEMU_VM_SECTION_PART sections contains
> mostly rubbish but then ending that with RAM_SAVE_FLAG_EOS. Then we keep
> doing this in the iteration loop.  Here the real meat is during processing
> of QEMU_VM_SECTION_PART, the src QEMU will update the guest pages with
> fixed offsets in the file.  That however doesn't really contribute to
> anything valuable in the migration stream itself (things sent over
> to_dst_file).
>
> AFAIU we chose to still use that logic only for simplicity, even if we know
> those EOSs and all RAM streams are garbage.  Now we tend to add one
> dependency on part of the garbage, which is RAM_SAVE_FLAG_MULTIFD_FLUSH in
> this case; which is useful in socket-based but shouldn't be necessary for
> file.
>
> I think I have a solution besides ram_load(): ultimately fixed-ram stores
> all guest mem in the QEMU_VM_SECTION_START section of the ram, through all
> of the RAM_SAVE_FLAG_MEM_SIZE (which leads to parse_ramblocks()).  If so,
> perhaps we can do one shot sync for file at the end of parse_ramblocks()?
> Then we decouple sync_main on recv for file-based completely against all
> stream flags.

Yeah, that could work. I think I'll blacklist all unused flags using the
invalid_flags logic.

Thanks
Peter Xu Feb. 27, 2024, 11:46 p.m. UTC | #5
On Tue, Feb 27, 2024 at 11:00:44AM -0300, Fabiano Rosas wrote:
> I don't really like the interleaving of file and socket logic at
> multifd_recv_sync_main(), but I can live with it.

The idea was to share the "wait" part and the semaphore.  If you don't like
the form of it, an alternative is we can provide three helpers (file_kick,
wait, socket_kick), then:

  if (file) {
    file_kick();
    wait();
  } else {
    wait();
    socket_kick();
  }

> 
> Waiting on multifd_recv_state->sem_sync is problematic because if the
> thread has an error, that will hang forever.
> 
> Actually, I don't even see this being handled in _current_ code
> anywhere, we probably have a bug there. I guess we need to add one more
> "post this sem just because" somewhere. multifd_recv_kick_main probably.

Might because dest qemu is even less of a concern? As if something wrong on
dest, then src is probably already failing the migration, then libvirt or
upper layer can directly kill dest qemu (while we can't do that to src).
But yeah we should still fix it at some point.. to make dest qemu quit
gracefully in error cases, and it'll also help more in the future if
multifd will support postcopy, then both src/dst can't be killed.
diff mbox series

Patch

diff --git a/migration/ram.c b/migration/ram.c
index 5932e1b8e1..c7050f6f68 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1369,8 +1369,11 @@  static int find_dirty_block(RAMState *rs, PageSearchStatus *pss)
                 if (ret < 0) {
                     return ret;
                 }
-                qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
-                qemu_fflush(f);
+
+                if (!migrate_fixed_ram()) {
+                    qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
+                    qemu_fflush(f);
+                }
             }
             /*
              * If memory migration starts over, we will meet a dirtied page
@@ -3112,7 +3115,8 @@  static int ram_save_setup(QEMUFile *f, void *opaque)
         return ret;
     }
 
-    if (migrate_multifd() && !migrate_multifd_flush_after_each_section()) {
+    if (migrate_multifd() && !migrate_multifd_flush_after_each_section()
+        && !migrate_fixed_ram()) {
         qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
     }
 
@@ -4253,6 +4257,15 @@  static int ram_load_precopy(QEMUFile *f)
             break;
         case RAM_SAVE_FLAG_EOS:
             /* normal exit */
+            if (migrate_fixed_ram()) {
+                /*
+                 * The EOS flag appears multiple times on the
+                 * stream. Fixed-ram needs only one sync at the
+                 * end. It will be done on the flush flag above.
+                 */
+                break;
+            }
+
             if (migrate_multifd() &&
                 migrate_multifd_flush_after_each_section()) {
                 multifd_recv_sync_main();