diff mbox series

[1/2] migration: switchover-hold parameter

Message ID 20230602144715.249002-2-peterx@redhat.com (mailing list archive)
State New, archived
Headers show
Series migration: switchover-hold flag | expand

Commit Message

Peter Xu June 2, 2023, 2:47 p.m. UTC
Add a new migration parameter switchover-hold which can block src qemu
migration from switching over to dest from running.

One can set this flag to true so src qemu will keep iterating the VM data,
not switching over to dest even if it can.

It means now live migration works somehow like COLO; we keep syncing data
from src to dst without stopping.

When the user is ready for the switchover, one can set the parameter from
true->false.  That'll contain a implicit kick to migration thread to be
alive and re-evaluate the switchover decision.

This can be used in two cases so far in my mind:

  (1) One can use this parameter to start pre-heating migration (but not
      really migrating, so a migrate-cancel will cancel the preheat).  When
      the user wants to really migrate, just clear the flag.  It'll in most
      cases migrate immediately because most pages are already synced.

  (2) Can also be used as a clean way to do qtest, in many of the precopy
      tests we have requirement to run after 1 iteration without completing
      the precopy migration.  Before that we have either set bandwidth to
      ridiculous low value, or tricks on detecting guest memory change over
      some adhoc guest memory position.  Now we can simply set this flag
      then we know precopy won't complete and will just keep going.

Here we leveraged a sem to make sure migration thread won't busy spin on a
physical cpu, meanwhile provide a timedwait() of 10ms so it can still try
its best to sync with dest QEMU from time to time.  Note that the sem is
prone to outdated counts but it's benign, please refer to the comment above
the semaphore definition for more information.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 qapi/migration.json            | 25 +++++++++++++--
 migration/migration.h          | 16 ++++++++++
 migration/migration-hmp-cmds.c |  3 ++
 migration/migration.c          | 56 ++++++++++++++++++++++++++++++++--
 migration/options.c            | 17 +++++++++++
 5 files changed, 111 insertions(+), 6 deletions(-)

Comments

Avihai Horon June 5, 2023, 12:27 p.m. UTC | #1
Hi Peter,

On 02/06/2023 17:47, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> Add a new migration parameter switchover-hold which can block src qemu
> migration from switching over to dest from running.
>
> One can set this flag to true so src qemu will keep iterating the VM data,
> not switching over to dest even if it can.
>
> It means now live migration works somehow like COLO; we keep syncing data
> from src to dst without stopping.
>
> When the user is ready for the switchover, one can set the parameter from
> true->false.  That'll contain a implicit kick to migration thread to be
> alive and re-evaluate the switchover decision.
>
> This can be used in two cases so far in my mind:
>
>    (1) One can use this parameter to start pre-heating migration (but not
>        really migrating, so a migrate-cancel will cancel the preheat).  When
>        the user wants to really migrate, just clear the flag.  It'll in most
>        cases migrate immediately because most pages are already synced.
>
>    (2) Can also be used as a clean way to do qtest, in many of the precopy
>        tests we have requirement to run after 1 iteration without completing
>        the precopy migration.  Before that we have either set bandwidth to
>        ridiculous low value, or tricks on detecting guest memory change over
>        some adhoc guest memory position.  Now we can simply set this flag
>        then we know precopy won't complete and will just keep going.
>
> Here we leveraged a sem to make sure migration thread won't busy spin on a
> physical cpu, meanwhile provide a timedwait() of 10ms so it can still try
> its best to sync with dest QEMU from time to time.  Note that the sem is
> prone to outdated counts but it's benign, please refer to the comment above
> the semaphore definition for more information.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   qapi/migration.json            | 25 +++++++++++++--
>   migration/migration.h          | 16 ++++++++++
>   migration/migration-hmp-cmds.c |  3 ++
>   migration/migration.c          | 56 ++++++++++++++++++++++++++++++++--
>   migration/options.c            | 17 +++++++++++
>   5 files changed, 111 insertions(+), 6 deletions(-)
>
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 179af0c4d8..1d0059d125 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -779,6 +779,15 @@
>   #     Nodes are mapped to their block device name if there is one, and
>   #     to their node name otherwise.  (Since 5.2)
>   #
> +# @switchover-hold: Whether we should hold-off precopy switchover from src
> +#     to dest QEMU, even if we can finish migration in the downtime
> +#     specified.  By default off, so precopy migration will complete as
> +#     soon as possible.  One can set it to explicitly keep iterating during
> +#     precopy migration until set the flag to false again to kick off the
> +#     final switchover.  Note, this does not affect postcopy switchover,
> +#     because the user can control that using "migrate-start-postcopy"
> +#     command explicitly. (Since 8.1)
> +#

I think this should wrap at col 70, and need double space before (Since 
8.1).
Also in other places below.

>   # Features:
>   #
>   # @unstable: Member @x-checkpoint-delay is experimental.
> @@ -800,7 +809,7 @@
>              'xbzrle-cache-size', 'max-postcopy-bandwidth',
>              'max-cpu-throttle', 'multifd-compression',
>              'multifd-zlib-level' ,'multifd-zstd-level',
> -           'block-bitmap-mapping' ] }
> +           'block-bitmap-mapping', 'switchover-hold' ] }
>
>   ##
>   # @MigrateSetParameters:
> @@ -935,6 +944,10 @@
>   #     Nodes are mapped to their block device name if there is one, and
>   #     to their node name otherwise.  (Since 5.2)
>   #
> +# @switchover-hold: Whether we should hold-off precopy switchover from src
> +#     to dest QEMU.  For more details, please refer to MigrationParameter
> +#     entry of the same field. (Since 8.1)
> +#
>   # Features:
>   #
>   # @unstable: Member @x-checkpoint-delay is experimental.
> @@ -972,7 +985,8 @@
>               '*multifd-compression': 'MultiFDCompression',
>               '*multifd-zlib-level': 'uint8',
>               '*multifd-zstd-level': 'uint8',
> -            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
> +            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
> +            '*switchover-hold': 'bool' } }
>
>   ##
>   # @migrate-set-parameters:
> @@ -1127,6 +1141,10 @@
>   #     Nodes are mapped to their block device name if there is one, and
>   #     to their node name otherwise.  (Since 5.2)
>   #
> +# @switchover-hold: Whether we should hold-off precopy switchover from src
> +#     to dest QEMU.  For more details, please refer to MigrationParameter
> +#     entry of the same field. (Since 8.1)
> +#
>   # Features:
>   #
>   # @unstable: Member @x-checkpoint-delay is experimental.
> @@ -1161,7 +1179,8 @@
>               '*multifd-compression': 'MultiFDCompression',
>               '*multifd-zlib-level': 'uint8',
>               '*multifd-zstd-level': 'uint8',
> -            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
> +            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
> +            '*switchover-hold': 'bool' } }
>
>   ##
>   # @query-migrate-parameters:
> diff --git a/migration/migration.h b/migration/migration.h
> index 30c3e97635..721b1c9473 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -440,6 +440,22 @@ struct MigrationState {
>
>       /* QEMU_VM_VMDESCRIPTION content filled for all non-iterable devices. */
>       JSONWriter *vmdesc;
> +    /*
> +     * Only migration thread will wait on it when switchover_hold==true.
> +     *
> +     * Only qmp set param will kick it when switching switchover_hold from
> +     * true->false.
> +     *
> +     * NOTE: outdated sem count here is benign.  E.g., when this is posted,
> +     * the 1st migration got cancelled, then start the 2nd migration, or
> +     * when someone sets the flag from true->false->true->false.. because
> +     * any outdated sem count will only let the migration thread to run one
> +     * more loop (timedwait() will eat the outdated count) when reaching
> +     * the completion phase, then in the next loop it'll sleep again.  The
> +     * important thing here OTOH is when the migration thread is sleeping
> +     * we can always kick it out of the sleep, which we will always do.
> +     */
> +    QemuSemaphore switchover_hold_sem;
>   };
>
>   void migrate_set_state(int *state, int old_state, int new_state);
> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
> index 9885d7c9f7..63a2c8a4a3 100644
> --- a/migration/migration-hmp-cmds.c
> +++ b/migration/migration-hmp-cmds.c
> @@ -338,6 +338,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
>           monitor_printf(mon, "%s: '%s'\n",
>               MigrationParameter_str(MIGRATION_PARAMETER_TLS_AUTHZ),
>               params->tls_authz);
> +        monitor_printf(mon, "%s: %s\n",
> +            MigrationParameter_str(MIGRATION_PARAMETER_SWITCHOVER_HOLD),
> +            params->switchover_hold ? "on" : "off");
>
>           if (params->has_block_bitmap_mapping) {
>               const BitmapMigrationNodeAliasList *bmnal;
> diff --git a/migration/migration.c b/migration/migration.c
> index dc05c6f6ea..076c9f1372 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2693,6 +2693,55 @@ static void migration_update_counters(MigrationState *s,
>                                 bandwidth, s->threshold_size);
>   }
>
> +static bool
> +migration_should_complete(MigrationState *s, uint64_t pending_size)
> +{
> +    /* We still have large pending data to send? */
> +    if (pending_size && (pending_size >= s->threshold_size)) {

The parenthesis here are redundant.

> +        return false;
> +    }
> +
> +    /* The user doesn't want us to switchover yet */
> +    if (s->parameters.switchover_hold) {

Should we check this only if we are in precopy? I.e., skip this check if 
postcopy is active?

I thought that this parameter is all about precopy switchover, and the 
fact that it prevents stopcopy from reaching migration_completion() is 
just an undesired side effect.
As I see it, if this parameter is set then precopy switchover is hold 
off, and if on top of that a user starts postcopy then this parameter is 
not relevant anymore - we should start postcopy and ignore it for the 
rest of migration.

Does it make sense?

Thanks.

> +        /*
> +         * Note: when reaching here it probably means we've migrated almost
> +         * everything and ready to switchover.  If user asked not to switch
> +         * wait for a short period and respond to kicks immediately.
> +         *
> +         * If we wait too long, there can be a lot of dirty data generated,
> +         * while we could have done something to sync data between src/dst.
> +         *
> +         * If we wait too short, migration thread can eat most/all cpu
> +         * resource looping over switchover_hold.
> +         *
> +         * Make it 10ms which seems to be a good intermediate value.
> +         */
> +        qemu_sem_timedwait(&s->switchover_hold_sem, 10);
> +
> +        /*
> +         * Return false here always even if user changed it, because we'd
> +         * like to re-evaluate everything (e.g. pending_size).
> +         */
> +        return false;
> +    }
> +
> +    return true;
> +}
> +
> +static bool
> +migration_should_start_postcopy(MigrationState *s, uint64_t must_precopy)
> +{
> +    /* If we're already in postcopy phase, don't bother */
> +    if (migration_in_postcopy()) {
> +        return false;
> +    }
> +    /* We still have lots of thing that must be migrated in precopy */
> +    if (must_precopy > s->threshold_size) {
> +        return false;
> +    }
> +    return qatomic_read(&s->start_postcopy);
> +}
> +
>   /* Migration thread iteration status */
>   typedef enum {
>       MIG_ITERATE_RESUME,         /* Resume current iteration */
> @@ -2720,15 +2769,14 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>           trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
>       }
>
> -    if (!pending_size || pending_size < s->threshold_size) {
> +    if (migration_should_complete(s, pending_size)) {
>           trace_migration_thread_low_pending(pending_size);
>           migration_completion(s);
>           return MIG_ITERATE_BREAK;
>       }
>
>       /* Still a significant amount to transfer */
> -    if (!in_postcopy && must_precopy <= s->threshold_size &&
> -        qatomic_read(&s->start_postcopy)) {
> +    if (migration_should_start_postcopy(s, must_precopy)) {
>           if (postcopy_start(s)) {
>               error_report("%s: postcopy failed to start", __func__);
>           }
> @@ -3285,6 +3333,7 @@ static void migration_instance_finalize(Object *obj)
>       qemu_sem_destroy(&ms->rp_state.rp_sem);
>       qemu_sem_destroy(&ms->rp_state.rp_pong_acks);
>       qemu_sem_destroy(&ms->postcopy_qemufile_src_sem);
> +    qemu_sem_destroy(&ms->switchover_hold_sem);
>       error_free(ms->error);
>   }
>
> @@ -3307,6 +3356,7 @@ static void migration_instance_init(Object *obj)
>       qemu_sem_init(&ms->rate_limit_sem, 0);
>       qemu_sem_init(&ms->wait_unplug_sem, 0);
>       qemu_sem_init(&ms->postcopy_qemufile_src_sem, 0);
> +    qemu_sem_init(&ms->switchover_hold_sem, 0);
>       qemu_mutex_init(&ms->qemu_file_lock);
>   }
>
> diff --git a/migration/options.c b/migration/options.c
> index b62ab30cd5..2d6b138352 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -163,6 +163,8 @@ Property migration_properties[] = {
>       DEFINE_PROP_STRING("tls-creds", MigrationState, parameters.tls_creds),
>       DEFINE_PROP_STRING("tls-hostname", MigrationState, parameters.tls_hostname),
>       DEFINE_PROP_STRING("tls-authz", MigrationState, parameters.tls_authz),
> +    DEFINE_PROP_BOOL("switchover-hold", MigrationState,
> +                     parameters.switchover_hold, false),
>
>       /* Migration capabilities */
>       DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
> @@ -883,6 +885,8 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
>       params->announce_rounds = s->parameters.announce_rounds;
>       params->has_announce_step = true;
>       params->announce_step = s->parameters.announce_step;
> +    params->has_switchover_hold = true;
> +    params->switchover_hold = s->parameters.switchover_hold;
>
>       if (s->parameters.has_block_bitmap_mapping) {
>           params->has_block_bitmap_mapping = true;
> @@ -923,6 +927,7 @@ void migrate_params_init(MigrationParameters *params)
>       params->has_announce_max = true;
>       params->has_announce_rounds = true;
>       params->has_announce_step = true;
> +    params->has_switchover_hold = true;
>   }
>
>   /*
> @@ -1177,6 +1182,9 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
>       if (params->has_announce_step) {
>           dest->announce_step = params->announce_step;
>       }
> +    if (params->has_switchover_hold) {
> +        dest->switchover_hold = params->switchover_hold;
> +    }
>
>       if (params->has_block_bitmap_mapping) {
>           dest->has_block_bitmap_mapping = true;
> @@ -1290,6 +1298,15 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
>       if (params->has_announce_step) {
>           s->parameters.announce_step = params->announce_step;
>       }
> +    if (params->has_switchover_hold) {
> +        bool old = s->parameters.switchover_hold;
> +        bool new = params->switchover_hold;
> +
> +        s->parameters.switchover_hold = params->switchover_hold;
> +        if (old && !new) {
> +            qemu_sem_post(&s->switchover_hold_sem);
> +        }
> +    }
>
>       if (params->has_block_bitmap_mapping) {
>           qapi_free_BitmapMigrationNodeAliasList(
> --
> 2.40.1
>
Peter Xu June 5, 2023, 4:06 p.m. UTC | #2
On Mon, Jun 05, 2023 at 03:27:53PM +0300, Avihai Horon wrote:
> Hi Peter,
> 
> On 02/06/2023 17:47, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > Add a new migration parameter switchover-hold which can block src qemu
> > migration from switching over to dest from running.
> > 
> > One can set this flag to true so src qemu will keep iterating the VM data,
> > not switching over to dest even if it can.
> > 
> > It means now live migration works somehow like COLO; we keep syncing data
> > from src to dst without stopping.
> > 
> > When the user is ready for the switchover, one can set the parameter from
> > true->false.  That'll contain a implicit kick to migration thread to be
> > alive and re-evaluate the switchover decision.
> > 
> > This can be used in two cases so far in my mind:
> > 
> >    (1) One can use this parameter to start pre-heating migration (but not
> >        really migrating, so a migrate-cancel will cancel the preheat).  When
> >        the user wants to really migrate, just clear the flag.  It'll in most
> >        cases migrate immediately because most pages are already synced.
> > 
> >    (2) Can also be used as a clean way to do qtest, in many of the precopy
> >        tests we have requirement to run after 1 iteration without completing
> >        the precopy migration.  Before that we have either set bandwidth to
> >        ridiculous low value, or tricks on detecting guest memory change over
> >        some adhoc guest memory position.  Now we can simply set this flag
> >        then we know precopy won't complete and will just keep going.
> > 
> > Here we leveraged a sem to make sure migration thread won't busy spin on a
> > physical cpu, meanwhile provide a timedwait() of 10ms so it can still try
> > its best to sync with dest QEMU from time to time.  Note that the sem is
> > prone to outdated counts but it's benign, please refer to the comment above
> > the semaphore definition for more information.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   qapi/migration.json            | 25 +++++++++++++--
> >   migration/migration.h          | 16 ++++++++++
> >   migration/migration-hmp-cmds.c |  3 ++
> >   migration/migration.c          | 56 ++++++++++++++++++++++++++++++++--
> >   migration/options.c            | 17 +++++++++++
> >   5 files changed, 111 insertions(+), 6 deletions(-)
> > 
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index 179af0c4d8..1d0059d125 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -779,6 +779,15 @@
> >   #     Nodes are mapped to their block device name if there is one, and
> >   #     to their node name otherwise.  (Since 5.2)
> >   #
> > +# @switchover-hold: Whether we should hold-off precopy switchover from src
> > +#     to dest QEMU, even if we can finish migration in the downtime
> > +#     specified.  By default off, so precopy migration will complete as
> > +#     soon as possible.  One can set it to explicitly keep iterating during
> > +#     precopy migration until set the flag to false again to kick off the
> > +#     final switchover.  Note, this does not affect postcopy switchover,
> > +#     because the user can control that using "migrate-start-postcopy"
> > +#     command explicitly. (Since 8.1)
> > +#
> 
> I think this should wrap at col 70, and need double space before (Since
> 8.1).
> Also in other places below.

I don't know that rule; anywhere documented?  It seems true for qapi/.

Obviously I forgot to copy Markus and Eric at least, since this is qmp stuff.

> 
> >   # Features:
> >   #
> >   # @unstable: Member @x-checkpoint-delay is experimental.
> > @@ -800,7 +809,7 @@
> >              'xbzrle-cache-size', 'max-postcopy-bandwidth',
> >              'max-cpu-throttle', 'multifd-compression',
> >              'multifd-zlib-level' ,'multifd-zstd-level',
> > -           'block-bitmap-mapping' ] }
> > +           'block-bitmap-mapping', 'switchover-hold' ] }
> > 
> >   ##
> >   # @MigrateSetParameters:
> > @@ -935,6 +944,10 @@
> >   #     Nodes are mapped to their block device name if there is one, and
> >   #     to their node name otherwise.  (Since 5.2)
> >   #
> > +# @switchover-hold: Whether we should hold-off precopy switchover from src
> > +#     to dest QEMU.  For more details, please refer to MigrationParameter
> > +#     entry of the same field. (Since 8.1)
> > +#
> >   # Features:
> >   #
> >   # @unstable: Member @x-checkpoint-delay is experimental.
> > @@ -972,7 +985,8 @@
> >               '*multifd-compression': 'MultiFDCompression',
> >               '*multifd-zlib-level': 'uint8',
> >               '*multifd-zstd-level': 'uint8',
> > -            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
> > +            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
> > +            '*switchover-hold': 'bool' } }
> > 
> >   ##
> >   # @migrate-set-parameters:
> > @@ -1127,6 +1141,10 @@
> >   #     Nodes are mapped to their block device name if there is one, and
> >   #     to their node name otherwise.  (Since 5.2)
> >   #
> > +# @switchover-hold: Whether we should hold-off precopy switchover from src
> > +#     to dest QEMU.  For more details, please refer to MigrationParameter
> > +#     entry of the same field. (Since 8.1)
> > +#
> >   # Features:
> >   #
> >   # @unstable: Member @x-checkpoint-delay is experimental.
> > @@ -1161,7 +1179,8 @@
> >               '*multifd-compression': 'MultiFDCompression',
> >               '*multifd-zlib-level': 'uint8',
> >               '*multifd-zstd-level': 'uint8',
> > -            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
> > +            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
> > +            '*switchover-hold': 'bool' } }
> > 
> >   ##
> >   # @query-migrate-parameters:
> > diff --git a/migration/migration.h b/migration/migration.h
> > index 30c3e97635..721b1c9473 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -440,6 +440,22 @@ struct MigrationState {
> > 
> >       /* QEMU_VM_VMDESCRIPTION content filled for all non-iterable devices. */
> >       JSONWriter *vmdesc;
> > +    /*
> > +     * Only migration thread will wait on it when switchover_hold==true.
> > +     *
> > +     * Only qmp set param will kick it when switching switchover_hold from
> > +     * true->false.
> > +     *
> > +     * NOTE: outdated sem count here is benign.  E.g., when this is posted,
> > +     * the 1st migration got cancelled, then start the 2nd migration, or
> > +     * when someone sets the flag from true->false->true->false.. because
> > +     * any outdated sem count will only let the migration thread to run one
> > +     * more loop (timedwait() will eat the outdated count) when reaching
> > +     * the completion phase, then in the next loop it'll sleep again.  The
> > +     * important thing here OTOH is when the migration thread is sleeping
> > +     * we can always kick it out of the sleep, which we will always do.
> > +     */
> > +    QemuSemaphore switchover_hold_sem;
> >   };
> > 
> >   void migrate_set_state(int *state, int old_state, int new_state);
> > diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
> > index 9885d7c9f7..63a2c8a4a3 100644
> > --- a/migration/migration-hmp-cmds.c
> > +++ b/migration/migration-hmp-cmds.c
> > @@ -338,6 +338,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
> >           monitor_printf(mon, "%s: '%s'\n",
> >               MigrationParameter_str(MIGRATION_PARAMETER_TLS_AUTHZ),
> >               params->tls_authz);
> > +        monitor_printf(mon, "%s: %s\n",
> > +            MigrationParameter_str(MIGRATION_PARAMETER_SWITCHOVER_HOLD),
> > +            params->switchover_hold ? "on" : "off");
> > 
> >           if (params->has_block_bitmap_mapping) {
> >               const BitmapMigrationNodeAliasList *bmnal;
> > diff --git a/migration/migration.c b/migration/migration.c
> > index dc05c6f6ea..076c9f1372 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -2693,6 +2693,55 @@ static void migration_update_counters(MigrationState *s,
> >                                 bandwidth, s->threshold_size);
> >   }
> > 
> > +static bool
> > +migration_should_complete(MigrationState *s, uint64_t pending_size)
> > +{
> > +    /* We still have large pending data to send? */
> > +    if (pending_size && (pending_size >= s->threshold_size)) {
> 
> The parenthesis here are redundant.
> 
> > +        return false;
> > +    }
> > +
> > +    /* The user doesn't want us to switchover yet */
> > +    if (s->parameters.switchover_hold) {
> 
> Should we check this only if we are in precopy? I.e., skip this check if
> postcopy is active?
> 
> I thought that this parameter is all about precopy switchover, and the fact
> that it prevents stopcopy from reaching migration_completion() is just an
> undesired side effect.
> As I see it, if this parameter is set then precopy switchover is hold off,
> and if on top of that a user starts postcopy then this parameter is not
> relevant anymore - we should start postcopy and ignore it for the rest of
> migration.
> 
> Does it make sense?

Yes, it matches more with the doc update indeed.

I didn't bother with that and added flag=false in the postcopy tests in the
follow up patch because logically a postcopy user shouldn't even use it
(also per the doc), but I can also make it strict here as you said.

Thanks for taking a look!
Avihai Horon June 6, 2023, 12:34 p.m. UTC | #3
On 05/06/2023 19:06, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Mon, Jun 05, 2023 at 03:27:53PM +0300, Avihai Horon wrote:
>> Hi Peter,
>>
>> On 02/06/2023 17:47, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> Add a new migration parameter switchover-hold which can block src qemu
>>> migration from switching over to dest from running.
>>>
>>> One can set this flag to true so src qemu will keep iterating the VM data,
>>> not switching over to dest even if it can.
>>>
>>> It means now live migration works somehow like COLO; we keep syncing data
>>> from src to dst without stopping.
>>>
>>> When the user is ready for the switchover, one can set the parameter from
>>> true->false.  That'll contain a implicit kick to migration thread to be
>>> alive and re-evaluate the switchover decision.
>>>
>>> This can be used in two cases so far in my mind:
>>>
>>>     (1) One can use this parameter to start pre-heating migration (but not
>>>         really migrating, so a migrate-cancel will cancel the preheat).  When
>>>         the user wants to really migrate, just clear the flag.  It'll in most
>>>         cases migrate immediately because most pages are already synced.
>>>
>>>     (2) Can also be used as a clean way to do qtest, in many of the precopy
>>>         tests we have requirement to run after 1 iteration without completing
>>>         the precopy migration.  Before that we have either set bandwidth to
>>>         ridiculous low value, or tricks on detecting guest memory change over
>>>         some adhoc guest memory position.  Now we can simply set this flag
>>>         then we know precopy won't complete and will just keep going.
>>>
>>> Here we leveraged a sem to make sure migration thread won't busy spin on a
>>> physical cpu, meanwhile provide a timedwait() of 10ms so it can still try
>>> its best to sync with dest QEMU from time to time.  Note that the sem is
>>> prone to outdated counts but it's benign, please refer to the comment above
>>> the semaphore definition for more information.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>    qapi/migration.json            | 25 +++++++++++++--
>>>    migration/migration.h          | 16 ++++++++++
>>>    migration/migration-hmp-cmds.c |  3 ++
>>>    migration/migration.c          | 56 ++++++++++++++++++++++++++++++++--
>>>    migration/options.c            | 17 +++++++++++
>>>    5 files changed, 111 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>> index 179af0c4d8..1d0059d125 100644
>>> --- a/qapi/migration.json
>>> +++ b/qapi/migration.json
>>> @@ -779,6 +779,15 @@
>>>    #     Nodes are mapped to their block device name if there is one, and
>>>    #     to their node name otherwise.  (Since 5.2)
>>>    #
>>> +# @switchover-hold: Whether we should hold-off precopy switchover from src
>>> +#     to dest QEMU, even if we can finish migration in the downtime
>>> +#     specified.  By default off, so precopy migration will complete as
>>> +#     soon as possible.  One can set it to explicitly keep iterating during
>>> +#     precopy migration until set the flag to false again to kick off the
>>> +#     final switchover.  Note, this does not affect postcopy switchover,
>>> +#     because the user can control that using "migrate-start-postcopy"
>>> +#     command explicitly. (Since 8.1)
>>> +#
>> I think this should wrap at col 70, and need double space before (Since
>> 8.1).
>> Also in other places below.
> I don't know that rule; anywhere documented?  It seems true for qapi/.

Yes, it's specifically for qapi. It's documented here [1].

Thanks.

[1] 
https://lore.kernel.org/qemu-devel/20230428105429.1687850-16-armbru@redhat.com/

>
> Obviously I forgot to copy Markus and Eric at least, since this is qmp stuff.
>
>>>    # Features:
>>>    #
>>>    # @unstable: Member @x-checkpoint-delay is experimental.
>>> @@ -800,7 +809,7 @@
>>>               'xbzrle-cache-size', 'max-postcopy-bandwidth',
>>>               'max-cpu-throttle', 'multifd-compression',
>>>               'multifd-zlib-level' ,'multifd-zstd-level',
>>> -           'block-bitmap-mapping' ] }
>>> +           'block-bitmap-mapping', 'switchover-hold' ] }
>>>
>>>    ##
>>>    # @MigrateSetParameters:
>>> @@ -935,6 +944,10 @@
>>>    #     Nodes are mapped to their block device name if there is one, and
>>>    #     to their node name otherwise.  (Since 5.2)
>>>    #
>>> +# @switchover-hold: Whether we should hold-off precopy switchover from src
>>> +#     to dest QEMU.  For more details, please refer to MigrationParameter
>>> +#     entry of the same field. (Since 8.1)
>>> +#
>>>    # Features:
>>>    #
>>>    # @unstable: Member @x-checkpoint-delay is experimental.
>>> @@ -972,7 +985,8 @@
>>>                '*multifd-compression': 'MultiFDCompression',
>>>                '*multifd-zlib-level': 'uint8',
>>>                '*multifd-zstd-level': 'uint8',
>>> -            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
>>> +            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
>>> +            '*switchover-hold': 'bool' } }
>>>
>>>    ##
>>>    # @migrate-set-parameters:
>>> @@ -1127,6 +1141,10 @@
>>>    #     Nodes are mapped to their block device name if there is one, and
>>>    #     to their node name otherwise.  (Since 5.2)
>>>    #
>>> +# @switchover-hold: Whether we should hold-off precopy switchover from src
>>> +#     to dest QEMU.  For more details, please refer to MigrationParameter
>>> +#     entry of the same field. (Since 8.1)
>>> +#
>>>    # Features:
>>>    #
>>>    # @unstable: Member @x-checkpoint-delay is experimental.
>>> @@ -1161,7 +1179,8 @@
>>>                '*multifd-compression': 'MultiFDCompression',
>>>                '*multifd-zlib-level': 'uint8',
>>>                '*multifd-zstd-level': 'uint8',
>>> -            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
>>> +            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
>>> +            '*switchover-hold': 'bool' } }
>>>
>>>    ##
>>>    # @query-migrate-parameters:
>>> diff --git a/migration/migration.h b/migration/migration.h
>>> index 30c3e97635..721b1c9473 100644
>>> --- a/migration/migration.h
>>> +++ b/migration/migration.h
>>> @@ -440,6 +440,22 @@ struct MigrationState {
>>>
>>>        /* QEMU_VM_VMDESCRIPTION content filled for all non-iterable devices. */
>>>        JSONWriter *vmdesc;
>>> +    /*
>>> +     * Only migration thread will wait on it when switchover_hold==true.
>>> +     *
>>> +     * Only qmp set param will kick it when switching switchover_hold from
>>> +     * true->false.
>>> +     *
>>> +     * NOTE: outdated sem count here is benign.  E.g., when this is posted,
>>> +     * the 1st migration got cancelled, then start the 2nd migration, or
>>> +     * when someone sets the flag from true->false->true->false.. because
>>> +     * any outdated sem count will only let the migration thread to run one
>>> +     * more loop (timedwait() will eat the outdated count) when reaching
>>> +     * the completion phase, then in the next loop it'll sleep again.  The
>>> +     * important thing here OTOH is when the migration thread is sleeping
>>> +     * we can always kick it out of the sleep, which we will always do.
>>> +     */
>>> +    QemuSemaphore switchover_hold_sem;
>>>    };
>>>
>>>    void migrate_set_state(int *state, int old_state, int new_state);
>>> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
>>> index 9885d7c9f7..63a2c8a4a3 100644
>>> --- a/migration/migration-hmp-cmds.c
>>> +++ b/migration/migration-hmp-cmds.c
>>> @@ -338,6 +338,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
>>>            monitor_printf(mon, "%s: '%s'\n",
>>>                MigrationParameter_str(MIGRATION_PARAMETER_TLS_AUTHZ),
>>>                params->tls_authz);
>>> +        monitor_printf(mon, "%s: %s\n",
>>> +            MigrationParameter_str(MIGRATION_PARAMETER_SWITCHOVER_HOLD),
>>> +            params->switchover_hold ? "on" : "off");
>>>
>>>            if (params->has_block_bitmap_mapping) {
>>>                const BitmapMigrationNodeAliasList *bmnal;
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index dc05c6f6ea..076c9f1372 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -2693,6 +2693,55 @@ static void migration_update_counters(MigrationState *s,
>>>                                  bandwidth, s->threshold_size);
>>>    }
>>>
>>> +static bool
>>> +migration_should_complete(MigrationState *s, uint64_t pending_size)
>>> +{
>>> +    /* We still have large pending data to send? */
>>> +    if (pending_size && (pending_size >= s->threshold_size)) {
>> The parenthesis here are redundant.
>>
>>> +        return false;
>>> +    }
>>> +
>>> +    /* The user doesn't want us to switchover yet */
>>> +    if (s->parameters.switchover_hold) {
>> Should we check this only if we are in precopy? I.e., skip this check if
>> postcopy is active?
>>
>> I thought that this parameter is all about precopy switchover, and the fact
>> that it prevents stopcopy from reaching migration_completion() is just an
>> undesired side effect.
>> As I see it, if this parameter is set then precopy switchover is hold off,
>> and if on top of that a user starts postcopy then this parameter is not
>> relevant anymore - we should start postcopy and ignore it for the rest of
>> migration.
>>
>> Does it make sense?
> Yes, it matches more with the doc update indeed.
>
> I didn't bother with that and added flag=false in the postcopy tests in the
> follow up patch because logically a postcopy user shouldn't even use it
> (also per the doc), but I can also make it strict here as you said.
>
> Thanks for taking a look!
>
> --
> Peter Xu
>
Peter Xu June 6, 2023, 3:58 p.m. UTC | #4
On Tue, Jun 06, 2023 at 03:34:26PM +0300, Avihai Horon wrote:
> Yes, it's specifically for qapi. It's documented here [1].
> 
> Thanks.
> 
> [1] https://lore.kernel.org/qemu-devel/20230428105429.1687850-16-armbru@redhat.com/

Thanks, I'll repost soon.
diff mbox series

Patch

diff --git a/qapi/migration.json b/qapi/migration.json
index 179af0c4d8..1d0059d125 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -779,6 +779,15 @@ 
 #     Nodes are mapped to their block device name if there is one, and
 #     to their node name otherwise.  (Since 5.2)
 #
+# @switchover-hold: Whether we should hold-off precopy switchover from src
+#     to dest QEMU, even if we can finish migration in the downtime
+#     specified.  By default off, so precopy migration will complete as
+#     soon as possible.  One can set it to explicitly keep iterating during
+#     precopy migration until set the flag to false again to kick off the
+#     final switchover.  Note, this does not affect postcopy switchover,
+#     because the user can control that using "migrate-start-postcopy"
+#     command explicitly. (Since 8.1)
+#
 # Features:
 #
 # @unstable: Member @x-checkpoint-delay is experimental.
@@ -800,7 +809,7 @@ 
            'xbzrle-cache-size', 'max-postcopy-bandwidth',
            'max-cpu-throttle', 'multifd-compression',
            'multifd-zlib-level' ,'multifd-zstd-level',
-           'block-bitmap-mapping' ] }
+           'block-bitmap-mapping', 'switchover-hold' ] }
 
 ##
 # @MigrateSetParameters:
@@ -935,6 +944,10 @@ 
 #     Nodes are mapped to their block device name if there is one, and
 #     to their node name otherwise.  (Since 5.2)
 #
+# @switchover-hold: Whether we should hold-off precopy switchover from src
+#     to dest QEMU.  For more details, please refer to MigrationParameter
+#     entry of the same field. (Since 8.1)
+#
 # Features:
 #
 # @unstable: Member @x-checkpoint-delay is experimental.
@@ -972,7 +985,8 @@ 
             '*multifd-compression': 'MultiFDCompression',
             '*multifd-zlib-level': 'uint8',
             '*multifd-zstd-level': 'uint8',
-            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
+            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
+            '*switchover-hold': 'bool' } }
 
 ##
 # @migrate-set-parameters:
@@ -1127,6 +1141,10 @@ 
 #     Nodes are mapped to their block device name if there is one, and
 #     to their node name otherwise.  (Since 5.2)
 #
+# @switchover-hold: Whether we should hold-off precopy switchover from src
+#     to dest QEMU.  For more details, please refer to MigrationParameter
+#     entry of the same field. (Since 8.1)
+#
 # Features:
 #
 # @unstable: Member @x-checkpoint-delay is experimental.
@@ -1161,7 +1179,8 @@ 
             '*multifd-compression': 'MultiFDCompression',
             '*multifd-zlib-level': 'uint8',
             '*multifd-zstd-level': 'uint8',
-            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
+            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
+            '*switchover-hold': 'bool' } }
 
 ##
 # @query-migrate-parameters:
diff --git a/migration/migration.h b/migration/migration.h
index 30c3e97635..721b1c9473 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -440,6 +440,22 @@  struct MigrationState {
 
     /* QEMU_VM_VMDESCRIPTION content filled for all non-iterable devices. */
     JSONWriter *vmdesc;
+    /*
+     * Only migration thread will wait on it when switchover_hold==true.
+     *
+     * Only qmp set param will kick it when switching switchover_hold from
+     * true->false.
+     *
+     * NOTE: outdated sem count here is benign.  E.g., when this is posted,
+     * the 1st migration got cancelled, then start the 2nd migration, or
+     * when someone sets the flag from true->false->true->false.. because
+     * any outdated sem count will only let the migration thread to run one
+     * more loop (timedwait() will eat the outdated count) when reaching
+     * the completion phase, then in the next loop it'll sleep again.  The
+     * important thing here OTOH is when the migration thread is sleeping
+     * we can always kick it out of the sleep, which we will always do.
+     */
+    QemuSemaphore switchover_hold_sem;
 };
 
 void migrate_set_state(int *state, int old_state, int new_state);
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 9885d7c9f7..63a2c8a4a3 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -338,6 +338,9 @@  void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
         monitor_printf(mon, "%s: '%s'\n",
             MigrationParameter_str(MIGRATION_PARAMETER_TLS_AUTHZ),
             params->tls_authz);
+        monitor_printf(mon, "%s: %s\n",
+            MigrationParameter_str(MIGRATION_PARAMETER_SWITCHOVER_HOLD),
+            params->switchover_hold ? "on" : "off");
 
         if (params->has_block_bitmap_mapping) {
             const BitmapMigrationNodeAliasList *bmnal;
diff --git a/migration/migration.c b/migration/migration.c
index dc05c6f6ea..076c9f1372 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2693,6 +2693,55 @@  static void migration_update_counters(MigrationState *s,
                               bandwidth, s->threshold_size);
 }
 
+static bool
+migration_should_complete(MigrationState *s, uint64_t pending_size)
+{
+    /* We still have large pending data to send? */
+    if (pending_size && (pending_size >= s->threshold_size)) {
+        return false;
+    }
+
+    /* The user doesn't want us to switchover yet */
+    if (s->parameters.switchover_hold) {
+        /*
+         * Note: when reaching here it probably means we've migrated almost
+         * everything and ready to switchover.  If user asked not to switch
+         * wait for a short period and respond to kicks immediately.
+         *
+         * If we wait too long, there can be a lot of dirty data generated,
+         * while we could have done something to sync data between src/dst.
+         *
+         * If we wait too short, migration thread can eat most/all cpu
+         * resource looping over switchover_hold.
+         *
+         * Make it 10ms which seems to be a good intermediate value.
+         */
+        qemu_sem_timedwait(&s->switchover_hold_sem, 10);
+
+        /*
+         * Return false here always even if user changed it, because we'd
+         * like to re-evaluate everything (e.g. pending_size).
+         */
+        return false;
+    }
+
+    return true;
+}
+
+static bool
+migration_should_start_postcopy(MigrationState *s, uint64_t must_precopy)
+{
+    /* If we're already in postcopy phase, don't bother */
+    if (migration_in_postcopy()) {
+        return false;
+    }
+    /* We still have lots of thing that must be migrated in precopy */
+    if (must_precopy > s->threshold_size) {
+        return false;
+    }
+    return qatomic_read(&s->start_postcopy);
+}
+
 /* Migration thread iteration status */
 typedef enum {
     MIG_ITERATE_RESUME,         /* Resume current iteration */
@@ -2720,15 +2769,14 @@  static MigIterateState migration_iteration_run(MigrationState *s)
         trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
     }
 
-    if (!pending_size || pending_size < s->threshold_size) {
+    if (migration_should_complete(s, pending_size)) {
         trace_migration_thread_low_pending(pending_size);
         migration_completion(s);
         return MIG_ITERATE_BREAK;
     }
 
     /* Still a significant amount to transfer */
-    if (!in_postcopy && must_precopy <= s->threshold_size &&
-        qatomic_read(&s->start_postcopy)) {
+    if (migration_should_start_postcopy(s, must_precopy)) {
         if (postcopy_start(s)) {
             error_report("%s: postcopy failed to start", __func__);
         }
@@ -3285,6 +3333,7 @@  static void migration_instance_finalize(Object *obj)
     qemu_sem_destroy(&ms->rp_state.rp_sem);
     qemu_sem_destroy(&ms->rp_state.rp_pong_acks);
     qemu_sem_destroy(&ms->postcopy_qemufile_src_sem);
+    qemu_sem_destroy(&ms->switchover_hold_sem);
     error_free(ms->error);
 }
 
@@ -3307,6 +3356,7 @@  static void migration_instance_init(Object *obj)
     qemu_sem_init(&ms->rate_limit_sem, 0);
     qemu_sem_init(&ms->wait_unplug_sem, 0);
     qemu_sem_init(&ms->postcopy_qemufile_src_sem, 0);
+    qemu_sem_init(&ms->switchover_hold_sem, 0);
     qemu_mutex_init(&ms->qemu_file_lock);
 }
 
diff --git a/migration/options.c b/migration/options.c
index b62ab30cd5..2d6b138352 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -163,6 +163,8 @@  Property migration_properties[] = {
     DEFINE_PROP_STRING("tls-creds", MigrationState, parameters.tls_creds),
     DEFINE_PROP_STRING("tls-hostname", MigrationState, parameters.tls_hostname),
     DEFINE_PROP_STRING("tls-authz", MigrationState, parameters.tls_authz),
+    DEFINE_PROP_BOOL("switchover-hold", MigrationState,
+                     parameters.switchover_hold, false),
 
     /* Migration capabilities */
     DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
@@ -883,6 +885,8 @@  MigrationParameters *qmp_query_migrate_parameters(Error **errp)
     params->announce_rounds = s->parameters.announce_rounds;
     params->has_announce_step = true;
     params->announce_step = s->parameters.announce_step;
+    params->has_switchover_hold = true;
+    params->switchover_hold = s->parameters.switchover_hold;
 
     if (s->parameters.has_block_bitmap_mapping) {
         params->has_block_bitmap_mapping = true;
@@ -923,6 +927,7 @@  void migrate_params_init(MigrationParameters *params)
     params->has_announce_max = true;
     params->has_announce_rounds = true;
     params->has_announce_step = true;
+    params->has_switchover_hold = true;
 }
 
 /*
@@ -1177,6 +1182,9 @@  static void migrate_params_test_apply(MigrateSetParameters *params,
     if (params->has_announce_step) {
         dest->announce_step = params->announce_step;
     }
+    if (params->has_switchover_hold) {
+        dest->switchover_hold = params->switchover_hold;
+    }
 
     if (params->has_block_bitmap_mapping) {
         dest->has_block_bitmap_mapping = true;
@@ -1290,6 +1298,15 @@  static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
     if (params->has_announce_step) {
         s->parameters.announce_step = params->announce_step;
     }
+    if (params->has_switchover_hold) {
+        bool old = s->parameters.switchover_hold;
+        bool new = params->switchover_hold;
+
+        s->parameters.switchover_hold = params->switchover_hold;
+        if (old && !new) {
+            qemu_sem_post(&s->switchover_hold_sem);
+        }
+    }
 
     if (params->has_block_bitmap_mapping) {
         qapi_free_BitmapMigrationNodeAliasList(