diff mbox series

failover: allow to pause the VM during the migration

Message ID 20210930170926.1298118-1-lvivier@redhat.com (mailing list archive)
State New, archived
Headers show
Series failover: allow to pause the VM during the migration | expand

Commit Message

Laurent Vivier Sept. 30, 2021, 5:09 p.m. UTC
If we want to save a snapshot of a VM to a file, we used to follow the
following steps:

1- stop the VM:
   (qemu) stop

2- migrate the VM to a file:
   (qemu) migrate "exec:cat > snapshot"

3- resume the VM:
   (qemu) cont

After that we can restore the snapshot with:
  qemu-system-x86_64 ... -incoming "exec:cat snapshot"
  (qemu) cont

But when failover is configured, it doesn't work anymore.

As the failover needs to ask the guest OS to unplug the card
the machine cannot be paused.

This patch introduces a new migration parameter, "pause-vm", that
asks the migration to pause the VM during the migration startup
phase after the the card is unplugged.

Once the migration is done, we only need to resume the VM with
"cont" and the card is plugged back:

1- set the parameter:
   (qemu) migrate_set_parameter pause-vm on

2- migrate the VM to a file:
   (qemu) migrate "exec:cat > snapshot"

   The primary failover card (VFIO) is unplugged and the VM is paused.

3- resume the VM:
   (qemu) cont

   The VM restarts and the primary failover card is plugged back

The VM state sent in the migration stream is "paused", it means
when the snapshot is loaded or if the stream is sent to a destination
QEMU, the VM needs to be resumed manually.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 qapi/migration.json            | 20 +++++++++++++++---
 include/hw/virtio/virtio-net.h |  1 +
 hw/net/virtio-net.c            | 33 ++++++++++++++++++++++++++++++
 migration/migration.c          | 37 +++++++++++++++++++++++++++++++++-
 monitor/hmp-cmds.c             |  8 ++++++++
 5 files changed, 95 insertions(+), 4 deletions(-)

Comments

Laine Stump Sept. 30, 2021, 8:17 p.m. UTC | #1
On 9/30/21 1:09 PM, Laurent Vivier wrote:
> If we want to save a snapshot of a VM to a file, we used to follow the
> following steps:
> 
> 1- stop the VM:
>     (qemu) stop
> 
> 2- migrate the VM to a file:
>     (qemu) migrate "exec:cat > snapshot"
> 
> 3- resume the VM:
>     (qemu) cont
> 
> After that we can restore the snapshot with:
>    qemu-system-x86_64 ... -incoming "exec:cat snapshot"
>    (qemu) cont

This is the basics of what libvirt does for a snapshot, and steps 1+2 
are what it does for a "managedsave" (where it saves the snapshot to 
disk and then terminates the qemu process, for later re-animation).

In those cases, it seems like this new parameter could work for us - 
instead of explicitly pausing the guest prior to migrating it to disk, 
we would set this new parameter to on, then directly migrate-to-disk 
(relying on qemu to do the pause). Care will need to be taken to assure 
that error recovery behaves the same though.

There are a couple of cases when libvirt apparently *doesn't* pause the 
guest during the migrate-to-disk, both having to do with saving a 
coredump of the guest. Since I really have no idea of how 
common/important that is (or even if my assessment of the code is 
correct), I'm Cc'ing this patch to libvir-list to make sure it catches 
the attention of someone who knows the answers and implications.


> But when failover is configured, it doesn't work anymore.
> 
> As the failover needs to ask the guest OS to unplug the card
> the machine cannot be paused.
> 
> This patch introduces a new migration parameter, "pause-vm", that
> asks the migration to pause the VM during the migration startup
> phase after the the card is unplugged.
> 
> Once the migration is done, we only need to resume the VM with
> "cont" and the card is plugged back:
> 
> 1- set the parameter:
>     (qemu) migrate_set_parameter pause-vm on
> 
> 2- migrate the VM to a file:
>     (qemu) migrate "exec:cat > snapshot"
> 
>     The primary failover card (VFIO) is unplugged and the VM is paused.
> 
> 3- resume the VM:
>     (qemu) cont
> 
>     The VM restarts and the primary failover card is plugged back
> 
> The VM state sent in the migration stream is "paused", it means
> when the snapshot is loaded or if the stream is sent to a destination
> QEMU, the VM needs to be resumed manually.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>   qapi/migration.json            | 20 +++++++++++++++---
>   include/hw/virtio/virtio-net.h |  1 +
>   hw/net/virtio-net.c            | 33 ++++++++++++++++++++++++++++++
>   migration/migration.c          | 37 +++++++++++++++++++++++++++++++++-
>   monitor/hmp-cmds.c             |  8 ++++++++
>   5 files changed, 95 insertions(+), 4 deletions(-)
> 
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 88f07baedd06..86284d96ad2a 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -743,6 +743,10 @@
>   #                        block device name if there is one, and to their node name
>   #                        otherwise. (Since 5.2)
>   #
> +# @pause-vm:           Pause the virtual machine before doing the migration.
> +#                      This allows QEMU to unplug a card before doing the
> +#                      migration as it depends on the guest kernel. (since 6.2)
> +#
>   # Since: 2.4
>   ##
>   { 'enum': 'MigrationParameter',
> @@ -758,7 +762,7 @@
>              'xbzrle-cache-size', 'max-postcopy-bandwidth',
>              'max-cpu-throttle', 'multifd-compression',
>              'multifd-zlib-level' ,'multifd-zstd-level',
> -           'block-bitmap-mapping' ] }
> +           'block-bitmap-mapping', 'pause-vm' ] }
>   
>   ##
>   # @MigrateSetParameters:
> @@ -903,6 +907,10 @@
>   #                        block device name if there is one, and to their node name
>   #                        otherwise. (Since 5.2)
>   #
> +# @pause-vm:           Pause the virtual machine before doing the migration.
> +#                      This allows QEMU to unplug a card before doing the
> +#                      migration as it depends on the guest kernel. (since 6.2)
> +#
>   # Since: 2.4
>   ##
>   # TODO either fuse back into MigrationParameters, or make
> @@ -934,7 +942,8 @@
>               '*multifd-compression': 'MultiFDCompression',
>               '*multifd-zlib-level': 'uint8',
>               '*multifd-zstd-level': 'uint8',
> -            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
> +            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
> +            '*pause-vm': 'bool' } }
>   
>   ##
>   # @migrate-set-parameters:
> @@ -1099,6 +1108,10 @@
>   #                        block device name if there is one, and to their node name
>   #                        otherwise. (Since 5.2)
>   #
> +# @pause-vm:           Pause the virtual machine before doing the migration.
> +#                      This allows QEMU to unplug a card before doing the
> +#                      migration as it depends on the guest kernel. (since 6.2)
> +#
>   # Since: 2.4
>   ##
>   { 'struct': 'MigrationParameters',
> @@ -1128,7 +1141,8 @@
>               '*multifd-compression': 'MultiFDCompression',
>               '*multifd-zlib-level': 'uint8',
>               '*multifd-zstd-level': 'uint8',
> -            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
> +            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
> +            '*pause-vm': 'bool' } }
>   
>   ##
>   # @query-migrate-parameters:
> diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
> index 824a69c23f06..a6c186e28b45 100644
> --- a/include/hw/virtio/virtio-net.h
> +++ b/include/hw/virtio/virtio-net.h
> @@ -210,6 +210,7 @@ struct VirtIONet {
>       bool failover;
>       DeviceListener primary_listener;
>       Notifier migration_state;
> +    VMChangeStateEntry *vm_state;
>       VirtioNetRssData rss_data;
>       struct NetRxPkt *rx_pkt;
>       struct EBPFRSSContext ebpf_rss;
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index f205331dcf8c..c83364eac47b 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -39,6 +39,7 @@
>   #include "migration/misc.h"
>   #include "standard-headers/linux/ethtool.h"
>   #include "sysemu/sysemu.h"
> +#include "sysemu/runstate.h"
>   #include "trace.h"
>   #include "monitor/qdev.h"
>   #include "hw/pci/pci.h"
> @@ -3303,6 +3304,35 @@ static void virtio_net_migration_state_notifier(Notifier *notifier, void *data)
>       virtio_net_handle_migration_primary(n, s);
>   }
>   
> +static void virtio_net_failover_restart_cb(void *opaque, bool running,
> +                                           RunState state)
> +{
> +    DeviceState *dev;
> +    VirtIONet *n = opaque;
> +    Error *err = NULL;
> +    PCIDevice *pdev;
> +
> +    if (!running) {
> +        return;
> +    }
> +
> +    dev = failover_find_primary_device(n);
> +    if (!dev) {
> +        return;
> +    }
> +
> +    pdev = PCI_DEVICE(dev);
> +    if (!pdev->partially_hotplugged) {
> +        return;
> +    }
> +
> +    if (!failover_replug_primary(n, dev, &err)) {
> +        if (err) {
> +            error_report_err(err);
> +        }
> +    }
> +}
> +
>   static bool failover_hide_primary_device(DeviceListener *listener,
>                                            QemuOpts *device_opts)
>   {
> @@ -3360,6 +3390,8 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
>           device_listener_register(&n->primary_listener);
>           n->migration_state.notify = virtio_net_migration_state_notifier;
>           add_migration_state_change_notifier(&n->migration_state);
> +        n->vm_state = qemu_add_vm_change_state_handler(
> +                                             virtio_net_failover_restart_cb, n);
>           n->host_features |= (1ULL << VIRTIO_NET_F_STANDBY);
>       }
>   
> @@ -3508,6 +3540,7 @@ static void virtio_net_device_unrealize(DeviceState *dev)
>       if (n->failover) {
>           device_listener_unregister(&n->primary_listener);
>           remove_migration_state_change_notifier(&n->migration_state);
> +        qemu_del_vm_change_state_handler(n->vm_state);
>       }
>   
>       max_queues = n->multiqueue ? n->max_queues : 1;
> diff --git a/migration/migration.c b/migration/migration.c
> index bb909781b7f5..9c96986d4abf 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -901,6 +901,9 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
>                          s->parameters.block_bitmap_mapping);
>       }
>   
> +    params->has_pause_vm = true;
> +    params->pause_vm = s->parameters.pause_vm;
> +
>       return params;
>   }
>   
> @@ -1549,6 +1552,11 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
>           dest->has_block_bitmap_mapping = true;
>           dest->block_bitmap_mapping = params->block_bitmap_mapping;
>       }
> +
> +    if (params->has_pause_vm) {
> +        dest->has_pause_vm = true;
> +        dest->pause_vm = params->pause_vm;
> +    }
>   }
>   
>   static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
> @@ -1671,6 +1679,10 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
>               QAPI_CLONE(BitmapMigrationNodeAliasList,
>                          params->block_bitmap_mapping);
>       }
> +
> +    if (params->has_pause_vm) {
> +        s->parameters.pause_vm = params->pause_vm;
> +    }
>   }
>   
>   void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
> @@ -1718,6 +1730,12 @@ void qmp_migrate_start_postcopy(Error **errp)
>                            " started");
>           return;
>       }
> +
> +    if (s->parameters.pause_vm) {
> +        error_setg(errp, "Postcopy cannot be started if pause-vm is on");
> +        return;
> +    }
> +
>       /*
>        * we don't error if migration has finished since that would be racy
>        * with issuing this command.
> @@ -3734,13 +3752,27 @@ static void qemu_savevm_wait_unplug(MigrationState *s, int old_state,
>                               "failure");
>               }
>           }
> -
>           migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG, new_state);
>       } else {
>           migrate_set_state(&s->state, old_state, new_state);
>       }
>   }
>   
> +/* stop the VM before starting the migration but after device unplug */
> +static void pause_vm_after_unplug(MigrationState *s)
> +{
> +    if (s->parameters.pause_vm) {
> +        qemu_mutex_lock_iothread();
> +        qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
> +        s->vm_was_running = runstate_is_running();
> +        if (vm_stop_force_state(RUN_STATE_PAUSED)) {
> +            migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
> +                                         MIGRATION_STATUS_FAILED);
> +        }
> +        qemu_mutex_unlock_iothread();
> +    }
> +}
> +
>   /*
>    * Master migration thread on the source VM.
>    * It drives the migration and pumps the data down the outgoing channel.
> @@ -3790,6 +3822,8 @@ static void *migration_thread(void *opaque)
>       qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
>                                  MIGRATION_STATUS_ACTIVE);
>   
> +    pause_vm_after_unplug(s);
> +
>       s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
>   
>       trace_migration_thread_setup_complete();
> @@ -4265,6 +4299,7 @@ static void migration_instance_init(Object *obj)
>       params->has_announce_max = true;
>       params->has_announce_rounds = true;
>       params->has_announce_step = true;
> +    params->has_pause_vm = true;
>   
>       qemu_sem_init(&ms->postcopy_pause_sem, 0);
>       qemu_sem_init(&ms->postcopy_pause_rp_sem, 0);
> diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
> index b5e71d9e6f52..71bc8c15335b 100644
> --- a/monitor/hmp-cmds.c
> +++ b/monitor/hmp-cmds.c
> @@ -513,6 +513,10 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
>                   }
>               }
>           }
> +
> +        monitor_printf(mon, "%s: %s\n",
> +            MigrationParameter_str(MIGRATION_PARAMETER_PAUSE_VM),
> +            params->pause_vm ? "on" : "off");
>       }
>   
>       qapi_free_MigrationParameters(params);
> @@ -1399,6 +1403,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
>           error_setg(&err, "The block-bitmap-mapping parameter can only be set "
>                      "through QMP");
>           break;
> +    case MIGRATION_PARAMETER_PAUSE_VM:
> +        p->has_pause_vm = true;
> +        visit_type_bool(v, param, &p->pause_vm, &err);
> +        break;
>       default:
>           assert(0);
>       }
>
Laurent Vivier Oct. 1, 2021, 6:48 a.m. UTC | #2
On 30/09/2021 22:17, Laine Stump wrote:
> On 9/30/21 1:09 PM, Laurent Vivier wrote:
>> If we want to save a snapshot of a VM to a file, we used to follow the
>> following steps:
>>
>> 1- stop the VM:
>>     (qemu) stop
>>
>> 2- migrate the VM to a file:
>>     (qemu) migrate "exec:cat > snapshot"
>>
>> 3- resume the VM:
>>     (qemu) cont
>>
>> After that we can restore the snapshot with:
>>    qemu-system-x86_64 ... -incoming "exec:cat snapshot"
>>    (qemu) cont
> 
> This is the basics of what libvirt does for a snapshot, and steps 1+2 are what it does for 
> a "managedsave" (where it saves the snapshot to disk and then terminates the qemu process, 
> for later re-animation).
> 
> In those cases, it seems like this new parameter could work for us - instead of explicitly 
> pausing the guest prior to migrating it to disk, we would set this new parameter to on, 
> then directly migrate-to-disk (relying on qemu to do the pause). Care will need to be 
> taken to assure that error recovery behaves the same though.

In case of error, the VM is restarted like it's done for a standard migration. I can 
change that if you need.

An other point is the VM state sent to the migration stream is "paused", it means that 
machine needs to be resumed after the stream is loaded (from the file or on destination in 
the case of a real migration), but it can be also changed to be "running" so the machine 
will be resumed automatically at the end of the file loading (or real migration)

> There are a couple of cases when libvirt apparently *doesn't* pause the guest during the 
> migrate-to-disk, both having to do with saving a coredump of the guest. Since I really 
> have no idea of how common/important that is (or even if my assessment of the code is 
> correct), I'm Cc'ing this patch to libvir-list to make sure it catches the attention of 
> someone who knows the answers and implications.

It's an interesting point I need to test and think about: in case of a coredump I guess 
the machine is crashed and doesn't answer to the unplug request and so the failover unplug 
cannot be done. For the moment the migration will hang until it is canceled. IT can be 
annoying if we want to debug the cause of the crash...

> 
>> But when failover is configured, it doesn't work anymore.
>>
>> As the failover needs to ask the guest OS to unplug the card
>> the machine cannot be paused.
>>
>> This patch introduces a new migration parameter, "pause-vm", that
>> asks the migration to pause the VM during the migration startup
>> phase after the the card is unplugged.
>>
>> Once the migration is done, we only need to resume the VM with
>> "cont" and the card is plugged back:
>>
>> 1- set the parameter:
>>     (qemu) migrate_set_parameter pause-vm on
>>
>> 2- migrate the VM to a file:
>>     (qemu) migrate "exec:cat > snapshot"
>>
>>     The primary failover card (VFIO) is unplugged and the VM is paused.
>>
>> 3- resume the VM:
>>     (qemu) cont
>>
>>     The VM restarts and the primary failover card is plugged back
>>
>> The VM state sent in the migration stream is "paused", it means
>> when the snapshot is loaded or if the stream is sent to a destination
>> QEMU, the VM needs to be resumed manually.
>>
>> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
>> ---
>>   qapi/migration.json            | 20 +++++++++++++++---
>>   include/hw/virtio/virtio-net.h |  1 +
>>   hw/net/virtio-net.c            | 33 ++++++++++++++++++++++++++++++
>>   migration/migration.c          | 37 +++++++++++++++++++++++++++++++++-
>>   monitor/hmp-cmds.c             |  8 ++++++++
>>   5 files changed, 95 insertions(+), 4 deletions(-)
>>
...

Thanks,
Laurent
Peter Krempa Oct. 1, 2021, 7:37 a.m. UTC | #3
On Thu, Sep 30, 2021 at 16:17:44 -0400, Laine Stump wrote:
> On 9/30/21 1:09 PM, Laurent Vivier wrote:
> > If we want to save a snapshot of a VM to a file, we used to follow the
> > following steps:
> > 
> > 1- stop the VM:
> >     (qemu) stop
> > 
> > 2- migrate the VM to a file:
> >     (qemu) migrate "exec:cat > snapshot"
> > 
> > 3- resume the VM:
> >     (qemu) cont
> > 
> > After that we can restore the snapshot with:
> >    qemu-system-x86_64 ... -incoming "exec:cat snapshot"
> >    (qemu) cont
> 
> This is the basics of what libvirt does for a snapshot, and steps 1+2 are
> what it does for a "managedsave" (where it saves the snapshot to disk and
> then terminates the qemu process, for later re-animation).
> 
> In those cases, it seems like this new parameter could work for us - instead
> of explicitly pausing the guest prior to migrating it to disk, we would set
> this new parameter to on, then directly migrate-to-disk (relying on qemu to
> do the pause). Care will need to be taken to assure that error recovery
> behaves the same though.

Yup, see below ...

> There are a couple of cases when libvirt apparently *doesn't* pause the
> guest during the migrate-to-disk, both having to do with saving a coredump
> of the guest. Since I really have no idea of how common/important that is

In most cases when doing a coredump the guest is paused because of an
emulation/guest error.

One example where the VM is not paused is a 'live' snapshot. It wastes
disk space and is not commonly used thoug.

Where it might become interesting is with the 'background-snapshot'
migration flag. Ideally failover will be fixed to properly work with
that one too. In such case we don't want to pause the VM (but we have to
AFAIK, the backround-snapshot migration can't be done as part of
'transacetion' yet, so we need to pause the VM to kick off the migration
(memory-snapshot) and then snapshot the disks).

> (or even if my assessment of the code is correct), I'm Cc'ing this patch to
> libvir-list to make sure it catches the attention of someone who knows the
> answers and implications.

Well cc-ing relevant patches to libvirt is always good. Especially if
we'll need to adapt the code to support the new feature.

> > But when failover is configured, it doesn't work anymore.
> > 
> > As the failover needs to ask the guest OS to unplug the card
> > the machine cannot be paused.
> > 
> > This patch introduces a new migration parameter, "pause-vm", that
> > asks the migration to pause the VM during the migration startup
> > phase after the the card is unplugged.

Is there a time limit to this? If guest interaction is required it might
take unbounded time.

In case of snapshots the expectation from the user is that the state
capture happens "reasonably" immediately after issuing the command. If
we introduce an possibly unbounded wait time, it will need an
re-imagining of the snapshot workflow and the feature will need to be an
opt-in.

> > 
> > Once the migration is done, we only need to resume the VM with
> > "cont" and the card is plugged back:
> > 
> > 1- set the parameter:
> >     (qemu) migrate_set_parameter pause-vm on
> > 
> > 2- migrate the VM to a file:
> >     (qemu) migrate "exec:cat > snapshot"
> > 
> >     The primary failover card (VFIO) is unplugged and the VM is paused.
> > 
> > 3- resume the VM:
> >     (qemu) cont
> > 
> >     The VM restarts and the primary failover card is plugged back
> > 
> > The VM state sent in the migration stream is "paused", it means
> > when the snapshot is loaded or if the stream is sent to a destination
> > QEMU, the VM needs to be resumed manually.

This is not a problem, libvirt is already dealing with this internally
anyways.
Daniel P. Berrangé Oct. 1, 2021, 9:01 a.m. UTC | #4
On Thu, Sep 30, 2021 at 04:17:44PM -0400, Laine Stump wrote:
> On 9/30/21 1:09 PM, Laurent Vivier wrote:
> > If we want to save a snapshot of a VM to a file, we used to follow the
> > following steps:
> > 
> > 1- stop the VM:
> >     (qemu) stop
> > 
> > 2- migrate the VM to a file:
> >     (qemu) migrate "exec:cat > snapshot"
> > 
> > 3- resume the VM:
> >     (qemu) cont
> > 
> > After that we can restore the snapshot with:
> >    qemu-system-x86_64 ... -incoming "exec:cat snapshot"
> >    (qemu) cont
> 
> This is the basics of what libvirt does for a snapshot, and steps 1+2 are
> what it does for a "managedsave" (where it saves the snapshot to disk and
> then terminates the qemu process, for later re-animation).
> 
> In those cases, it seems like this new parameter could work for us - instead
> of explicitly pausing the guest prior to migrating it to disk, we would set
> this new parameter to on, then directly migrate-to-disk (relying on qemu to
> do the pause). Care will need to be taken to assure that error recovery
> behaves the same though.

What libvirt does is actually quite different from this in a signficant
way.  In the HMP example here 'migrate' is a blocking command that does
not return until migration is finished.

Libvirt uses QMP and 'migrate' there is a asynchronous command that merely
launches the migration and returns control to the client.

IOW, what libvirt does is

    stop
    migrate
    while status != failed || completed
       query-migrate
       
       ...also receive any QMP migration events...

       ...possibly modify migration parameters...

    cont

With this pattern I'm not seeing any need for a new migration parameter
for libvirt. The migration status lets us distinguish when QEMU is in
the "waiting for unplug" phase vs the "active" phase. So AFAICT, libvirt
can do:

    migrate
    while status != failed || completed
       query-migrate
       
       ...also receive any QMP migration events..

       if status changed wait-for-unplug to active
         stop

       ...possibly modify migration parameters...

    cont


There is a small window here when the guest CPUs are running
but migration is active.  In most cases for libvirt that is
harmless.  If there are cases where libvirt needs a strong
guarantee to synchonize the 'stop' with some other option,
then the new proposed "pause-vm" parameter as the same problem
as libvirt can't sychronize against that either.


> There are a couple of cases when libvirt apparently *doesn't* pause the
> guest during the migrate-to-disk, both having to do with saving a coredump
> of the guest. Since I really have no idea of how common/important that is
> (or even if my assessment of the code is correct), I'm Cc'ing this patch to
> libvir-list to make sure it catches the attention of someone who knows the
> answers and implications.

IIUC, the problem with unplug only happens when libvirt pauses
the guest. So surely if there are some scenarios where we're not
pausing the guest, there's no problem to solve for those.


Regards,
Daniel
Dr. David Alan Gilbert Oct. 14, 2021, 1:20 p.m. UTC | #5
* Laurent Vivier (lvivier@redhat.com) wrote:
> If we want to save a snapshot of a VM to a file, we used to follow the
> following steps:
> 
> 1- stop the VM:
>    (qemu) stop
> 
> 2- migrate the VM to a file:
>    (qemu) migrate "exec:cat > snapshot"
> 
> 3- resume the VM:
>    (qemu) cont
> 
> After that we can restore the snapshot with:
>   qemu-system-x86_64 ... -incoming "exec:cat snapshot"
>   (qemu) cont
> 
> But when failover is configured, it doesn't work anymore.
> 
> As the failover needs to ask the guest OS to unplug the card
> the machine cannot be paused.
> 
> This patch introduces a new migration parameter, "pause-vm", that
> asks the migration to pause the VM during the migration startup
> phase after the the card is unplugged.
> 
> Once the migration is done, we only need to resume the VM with
> "cont" and the card is plugged back:
> 
> 1- set the parameter:
>    (qemu) migrate_set_parameter pause-vm on
> 
> 2- migrate the VM to a file:
>    (qemu) migrate "exec:cat > snapshot"
> 
>    The primary failover card (VFIO) is unplugged and the VM is paused.
> 
> 3- resume the VM:
>    (qemu) cont
> 
>    The VM restarts and the primary failover card is plugged back
> 
> The VM state sent in the migration stream is "paused", it means
> when the snapshot is loaded or if the stream is sent to a destination
> QEMU, the VM needs to be resumed manually.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

A mix of comments:
  a) As a boolean, this should be a MigrationCapability rather than a
parameter
  b) We already have a pause-before-switchover capability for a pause
that happens later in the flow; so this would be something like
pause-after-unplug
  c) Is this really the right answer?  Could this be done a different
way by doing the unplugs using (a possibly new) qmp command - so
that you can explicitly trigger the unplug prior to the migration?

Dave

> ---
>  qapi/migration.json            | 20 +++++++++++++++---
>  include/hw/virtio/virtio-net.h |  1 +
>  hw/net/virtio-net.c            | 33 ++++++++++++++++++++++++++++++
>  migration/migration.c          | 37 +++++++++++++++++++++++++++++++++-
>  monitor/hmp-cmds.c             |  8 ++++++++
>  5 files changed, 95 insertions(+), 4 deletions(-)
> 
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 88f07baedd06..86284d96ad2a 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -743,6 +743,10 @@
>  #                        block device name if there is one, and to their node name
>  #                        otherwise. (Since 5.2)
>  #
> +# @pause-vm:           Pause the virtual machine before doing the migration.
> +#                      This allows QEMU to unplug a card before doing the
> +#                      migration as it depends on the guest kernel. (since 6.2)
> +#
>  # Since: 2.4
>  ##
>  { 'enum': 'MigrationParameter',
> @@ -758,7 +762,7 @@
>             'xbzrle-cache-size', 'max-postcopy-bandwidth',
>             'max-cpu-throttle', 'multifd-compression',
>             'multifd-zlib-level' ,'multifd-zstd-level',
> -           'block-bitmap-mapping' ] }
> +           'block-bitmap-mapping', 'pause-vm' ] }
>  
>  ##
>  # @MigrateSetParameters:
> @@ -903,6 +907,10 @@
>  #                        block device name if there is one, and to their node name
>  #                        otherwise. (Since 5.2)
>  #
> +# @pause-vm:           Pause the virtual machine before doing the migration.
> +#                      This allows QEMU to unplug a card before doing the
> +#                      migration as it depends on the guest kernel. (since 6.2)
> +#
>  # Since: 2.4
>  ##
>  # TODO either fuse back into MigrationParameters, or make
> @@ -934,7 +942,8 @@
>              '*multifd-compression': 'MultiFDCompression',
>              '*multifd-zlib-level': 'uint8',
>              '*multifd-zstd-level': 'uint8',
> -            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
> +            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
> +            '*pause-vm': 'bool' } }
>  
>  ##
>  # @migrate-set-parameters:
> @@ -1099,6 +1108,10 @@
>  #                        block device name if there is one, and to their node name
>  #                        otherwise. (Since 5.2)
>  #
> +# @pause-vm:           Pause the virtual machine before doing the migration.
> +#                      This allows QEMU to unplug a card before doing the
> +#                      migration as it depends on the guest kernel. (since 6.2)
> +#
>  # Since: 2.4
>  ##
>  { 'struct': 'MigrationParameters',
> @@ -1128,7 +1141,8 @@
>              '*multifd-compression': 'MultiFDCompression',
>              '*multifd-zlib-level': 'uint8',
>              '*multifd-zstd-level': 'uint8',
> -            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
> +            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
> +            '*pause-vm': 'bool' } }
>  
>  ##
>  # @query-migrate-parameters:
> diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
> index 824a69c23f06..a6c186e28b45 100644
> --- a/include/hw/virtio/virtio-net.h
> +++ b/include/hw/virtio/virtio-net.h
> @@ -210,6 +210,7 @@ struct VirtIONet {
>      bool failover;
>      DeviceListener primary_listener;
>      Notifier migration_state;
> +    VMChangeStateEntry *vm_state;
>      VirtioNetRssData rss_data;
>      struct NetRxPkt *rx_pkt;
>      struct EBPFRSSContext ebpf_rss;
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index f205331dcf8c..c83364eac47b 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -39,6 +39,7 @@
>  #include "migration/misc.h"
>  #include "standard-headers/linux/ethtool.h"
>  #include "sysemu/sysemu.h"
> +#include "sysemu/runstate.h"
>  #include "trace.h"
>  #include "monitor/qdev.h"
>  #include "hw/pci/pci.h"
> @@ -3303,6 +3304,35 @@ static void virtio_net_migration_state_notifier(Notifier *notifier, void *data)
>      virtio_net_handle_migration_primary(n, s);
>  }
>  
> +static void virtio_net_failover_restart_cb(void *opaque, bool running,
> +                                           RunState state)
> +{
> +    DeviceState *dev;
> +    VirtIONet *n = opaque;
> +    Error *err = NULL;
> +    PCIDevice *pdev;
> +
> +    if (!running) {
> +        return;
> +    }
> +
> +    dev = failover_find_primary_device(n);
> +    if (!dev) {
> +        return;
> +    }
> +
> +    pdev = PCI_DEVICE(dev);
> +    if (!pdev->partially_hotplugged) {
> +        return;
> +    }
> +
> +    if (!failover_replug_primary(n, dev, &err)) {
> +        if (err) {
> +            error_report_err(err);
> +        }
> +    }
> +}
> +
>  static bool failover_hide_primary_device(DeviceListener *listener,
>                                           QemuOpts *device_opts)
>  {
> @@ -3360,6 +3390,8 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
>          device_listener_register(&n->primary_listener);
>          n->migration_state.notify = virtio_net_migration_state_notifier;
>          add_migration_state_change_notifier(&n->migration_state);
> +        n->vm_state = qemu_add_vm_change_state_handler(
> +                                             virtio_net_failover_restart_cb, n);
>          n->host_features |= (1ULL << VIRTIO_NET_F_STANDBY);
>      }
>  
> @@ -3508,6 +3540,7 @@ static void virtio_net_device_unrealize(DeviceState *dev)
>      if (n->failover) {
>          device_listener_unregister(&n->primary_listener);
>          remove_migration_state_change_notifier(&n->migration_state);
> +        qemu_del_vm_change_state_handler(n->vm_state);
>      }
>  
>      max_queues = n->multiqueue ? n->max_queues : 1;
> diff --git a/migration/migration.c b/migration/migration.c
> index bb909781b7f5..9c96986d4abf 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -901,6 +901,9 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
>                         s->parameters.block_bitmap_mapping);
>      }
>  
> +    params->has_pause_vm = true;
> +    params->pause_vm = s->parameters.pause_vm;
> +
>      return params;
>  }
>  
> @@ -1549,6 +1552,11 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
>          dest->has_block_bitmap_mapping = true;
>          dest->block_bitmap_mapping = params->block_bitmap_mapping;
>      }
> +
> +    if (params->has_pause_vm) {
> +        dest->has_pause_vm = true;
> +        dest->pause_vm = params->pause_vm;
> +    }
>  }
>  
>  static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
> @@ -1671,6 +1679,10 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
>              QAPI_CLONE(BitmapMigrationNodeAliasList,
>                         params->block_bitmap_mapping);
>      }
> +
> +    if (params->has_pause_vm) {
> +        s->parameters.pause_vm = params->pause_vm;
> +    }
>  }
>  
>  void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
> @@ -1718,6 +1730,12 @@ void qmp_migrate_start_postcopy(Error **errp)
>                           " started");
>          return;
>      }
> +
> +    if (s->parameters.pause_vm) {
> +        error_setg(errp, "Postcopy cannot be started if pause-vm is on");
> +        return;
> +    }
> +
>      /*
>       * we don't error if migration has finished since that would be racy
>       * with issuing this command.
> @@ -3734,13 +3752,27 @@ static void qemu_savevm_wait_unplug(MigrationState *s, int old_state,
>                              "failure");
>              }
>          }
> -
>          migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG, new_state);
>      } else {
>          migrate_set_state(&s->state, old_state, new_state);
>      }
>  }
>  
> +/* stop the VM before starting the migration but after device unplug */
> +static void pause_vm_after_unplug(MigrationState *s)
> +{
> +    if (s->parameters.pause_vm) {
> +        qemu_mutex_lock_iothread();
> +        qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
> +        s->vm_was_running = runstate_is_running();
> +        if (vm_stop_force_state(RUN_STATE_PAUSED)) {
> +            migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
> +                                         MIGRATION_STATUS_FAILED);
> +        }
> +        qemu_mutex_unlock_iothread();
> +    }
> +}
> +
>  /*
>   * Master migration thread on the source VM.
>   * It drives the migration and pumps the data down the outgoing channel.
> @@ -3790,6 +3822,8 @@ static void *migration_thread(void *opaque)
>      qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
>                                 MIGRATION_STATUS_ACTIVE);
>  
> +    pause_vm_after_unplug(s);
> +
>      s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
>  
>      trace_migration_thread_setup_complete();
> @@ -4265,6 +4299,7 @@ static void migration_instance_init(Object *obj)
>      params->has_announce_max = true;
>      params->has_announce_rounds = true;
>      params->has_announce_step = true;
> +    params->has_pause_vm = true;
>  
>      qemu_sem_init(&ms->postcopy_pause_sem, 0);
>      qemu_sem_init(&ms->postcopy_pause_rp_sem, 0);
> diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
> index b5e71d9e6f52..71bc8c15335b 100644
> --- a/monitor/hmp-cmds.c
> +++ b/monitor/hmp-cmds.c
> @@ -513,6 +513,10 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
>                  }
>              }
>          }
> +
> +        monitor_printf(mon, "%s: %s\n",
> +            MigrationParameter_str(MIGRATION_PARAMETER_PAUSE_VM),
> +            params->pause_vm ? "on" : "off");
>      }
>  
>      qapi_free_MigrationParameters(params);
> @@ -1399,6 +1403,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
>          error_setg(&err, "The block-bitmap-mapping parameter can only be set "
>                     "through QMP");
>          break;
> +    case MIGRATION_PARAMETER_PAUSE_VM:
> +        p->has_pause_vm = true;
> +        visit_type_bool(v, param, &p->pause_vm, &err);
> +        break;
>      default:
>          assert(0);
>      }
> -- 
> 2.31.1
>
Juan Quintela Oct. 29, 2021, 1:49 p.m. UTC | #6
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> * Laurent Vivier (lvivier@redhat.com) wrote:
>> If we want to save a snapshot of a VM to a file, we used to follow the
>> following steps:
>> 
>> 1- stop the VM:
>>    (qemu) stop
>> 
>> 2- migrate the VM to a file:
>>    (qemu) migrate "exec:cat > snapshot"
>> 
>> 3- resume the VM:
>>    (qemu) cont
>> 
>> After that we can restore the snapshot with:
>>   qemu-system-x86_64 ... -incoming "exec:cat snapshot"
>>   (qemu) cont
>> 
>> But when failover is configured, it doesn't work anymore.
>> 
>> As the failover needs to ask the guest OS to unplug the card
>> the machine cannot be paused.
>> 
>> This patch introduces a new migration parameter, "pause-vm", that
>> asks the migration to pause the VM during the migration startup
>> phase after the the card is unplugged.
>> 
>> Once the migration is done, we only need to resume the VM with
>> "cont" and the card is plugged back:
>> 
>> 1- set the parameter:
>>    (qemu) migrate_set_parameter pause-vm on
>> 
>> 2- migrate the VM to a file:
>>    (qemu) migrate "exec:cat > snapshot"
>> 
>>    The primary failover card (VFIO) is unplugged and the VM is paused.
>> 
>> 3- resume the VM:
>>    (qemu) cont
>> 
>>    The VM restarts and the primary failover card is plugged back
>> 
>> The VM state sent in the migration stream is "paused", it means
>> when the snapshot is loaded or if the stream is sent to a destination
>> QEMU, the VM needs to be resumed manually.
>> 
>> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
>
> A mix of comments:
>   a) As a boolean, this should be a MigrationCapability rather than a
> parameter
>   b) We already have a pause-before-switchover capability for a pause
> that happens later in the flow; so this would be something like
> pause-after-unplug
>   c) Is this really the right answer?  Could this be done a different
> way by doing the unplugs using (a possibly new) qmp command - so
> that you can explicitly trigger the unplug prior to the migration?

Not if you want the wait to be minimal.
What managedsave wants to do is doing the migration with the guest
stopped.  And wait for it until the last moment.

Doing this is qemu is "relatively" simple.  Doing that on libvirt is
extremely complex, because you basically have to :
- unplug the device
- wait for unplug to finish
- stop the guest
- migrate paused
- (restart the guest)

If you do it in libvirt, you are increasing the time betwee wait for
unplug to finish and stop the guest.  But the biggest problem is what
happens if the migration (or anything else fails).
qemu failover code already knows how to handle the stop/continuation of
the vfio device.  It is what happens on a normal run.  If you do this on
libvirt, it needs to be able to recover for all scenarios, what is much
more complex in my hunble opinion.

Later, Juan.
Juan Quintela Oct. 29, 2021, 1:56 p.m. UTC | #7
Laurent Vivier <lvivier@redhat.com> wrote:
> If we want to save a snapshot of a VM to a file, we used to follow the
> following steps:
>
> 1- stop the VM:
>    (qemu) stop
>
> 2- migrate the VM to a file:
>    (qemu) migrate "exec:cat > snapshot"
>
> 3- resume the VM:
>    (qemu) cont
>
> After that we can restore the snapshot with:
>   qemu-system-x86_64 ... -incoming "exec:cat snapshot"
>   (qemu) cont
>
> But when failover is configured, it doesn't work anymore.
>
> As the failover needs to ask the guest OS to unplug the card
> the machine cannot be paused.
>
> This patch introduces a new migration parameter, "pause-vm", that
> asks the migration to pause the VM during the migration startup
> phase after the the card is unplugged.
>
> Once the migration is done, we only need to resume the VM with
> "cont" and the card is plugged back:
>
> 1- set the parameter:
>    (qemu) migrate_set_parameter pause-vm on
>
> 2- migrate the VM to a file:
>    (qemu) migrate "exec:cat > snapshot"
>
>    The primary failover card (VFIO) is unplugged and the VM is paused.
>
> 3- resume the VM:
>    (qemu) cont
>
>    The VM restarts and the primary failover card is plugged back
>
> The VM state sent in the migration stream is "paused", it means
> when the snapshot is loaded or if the stream is sent to a destination
> QEMU, the VM needs to be resumed manually.
>
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

I agree with Dave that you should use a capability instead of a
parameter.

Other than that, the code for the new parameter looks ok.
> @@ -3734,13 +3752,27 @@ static void qemu_savevm_wait_unplug(MigrationState *s, int old_state,
>                              "failure");
>              }
>          }
> -
>          migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG, new_state);
>      } else {
>          migrate_set_state(&s->state, old_state, new_state);
>      }
>  }


This change is spurious.

And to make this more generic, I think you can consider changing the
name to pause_during_migration.

Because that is basically what managedsave needs (If I understood Laine
correctly).

Later, Juan.
diff mbox series

Patch

diff --git a/qapi/migration.json b/qapi/migration.json
index 88f07baedd06..86284d96ad2a 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -743,6 +743,10 @@ 
 #                        block device name if there is one, and to their node name
 #                        otherwise. (Since 5.2)
 #
+# @pause-vm:           Pause the virtual machine before doing the migration.
+#                      This allows QEMU to unplug a card before doing the
+#                      migration as it depends on the guest kernel. (since 6.2)
+#
 # Since: 2.4
 ##
 { 'enum': 'MigrationParameter',
@@ -758,7 +762,7 @@ 
            'xbzrle-cache-size', 'max-postcopy-bandwidth',
            'max-cpu-throttle', 'multifd-compression',
            'multifd-zlib-level' ,'multifd-zstd-level',
-           'block-bitmap-mapping' ] }
+           'block-bitmap-mapping', 'pause-vm' ] }
 
 ##
 # @MigrateSetParameters:
@@ -903,6 +907,10 @@ 
 #                        block device name if there is one, and to their node name
 #                        otherwise. (Since 5.2)
 #
+# @pause-vm:           Pause the virtual machine before doing the migration.
+#                      This allows QEMU to unplug a card before doing the
+#                      migration as it depends on the guest kernel. (since 6.2)
+#
 # Since: 2.4
 ##
 # TODO either fuse back into MigrationParameters, or make
@@ -934,7 +942,8 @@ 
             '*multifd-compression': 'MultiFDCompression',
             '*multifd-zlib-level': 'uint8',
             '*multifd-zstd-level': 'uint8',
-            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
+            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
+            '*pause-vm': 'bool' } }
 
 ##
 # @migrate-set-parameters:
@@ -1099,6 +1108,10 @@ 
 #                        block device name if there is one, and to their node name
 #                        otherwise. (Since 5.2)
 #
+# @pause-vm:           Pause the virtual machine before doing the migration.
+#                      This allows QEMU to unplug a card before doing the
+#                      migration as it depends on the guest kernel. (since 6.2)
+#
 # Since: 2.4
 ##
 { 'struct': 'MigrationParameters',
@@ -1128,7 +1141,8 @@ 
             '*multifd-compression': 'MultiFDCompression',
             '*multifd-zlib-level': 'uint8',
             '*multifd-zstd-level': 'uint8',
-            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
+            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
+            '*pause-vm': 'bool' } }
 
 ##
 # @query-migrate-parameters:
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index 824a69c23f06..a6c186e28b45 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -210,6 +210,7 @@  struct VirtIONet {
     bool failover;
     DeviceListener primary_listener;
     Notifier migration_state;
+    VMChangeStateEntry *vm_state;
     VirtioNetRssData rss_data;
     struct NetRxPkt *rx_pkt;
     struct EBPFRSSContext ebpf_rss;
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index f205331dcf8c..c83364eac47b 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -39,6 +39,7 @@ 
 #include "migration/misc.h"
 #include "standard-headers/linux/ethtool.h"
 #include "sysemu/sysemu.h"
+#include "sysemu/runstate.h"
 #include "trace.h"
 #include "monitor/qdev.h"
 #include "hw/pci/pci.h"
@@ -3303,6 +3304,35 @@  static void virtio_net_migration_state_notifier(Notifier *notifier, void *data)
     virtio_net_handle_migration_primary(n, s);
 }
 
+static void virtio_net_failover_restart_cb(void *opaque, bool running,
+                                           RunState state)
+{
+    DeviceState *dev;
+    VirtIONet *n = opaque;
+    Error *err = NULL;
+    PCIDevice *pdev;
+
+    if (!running) {
+        return;
+    }
+
+    dev = failover_find_primary_device(n);
+    if (!dev) {
+        return;
+    }
+
+    pdev = PCI_DEVICE(dev);
+    if (!pdev->partially_hotplugged) {
+        return;
+    }
+
+    if (!failover_replug_primary(n, dev, &err)) {
+        if (err) {
+            error_report_err(err);
+        }
+    }
+}
+
 static bool failover_hide_primary_device(DeviceListener *listener,
                                          QemuOpts *device_opts)
 {
@@ -3360,6 +3390,8 @@  static void virtio_net_device_realize(DeviceState *dev, Error **errp)
         device_listener_register(&n->primary_listener);
         n->migration_state.notify = virtio_net_migration_state_notifier;
         add_migration_state_change_notifier(&n->migration_state);
+        n->vm_state = qemu_add_vm_change_state_handler(
+                                             virtio_net_failover_restart_cb, n);
         n->host_features |= (1ULL << VIRTIO_NET_F_STANDBY);
     }
 
@@ -3508,6 +3540,7 @@  static void virtio_net_device_unrealize(DeviceState *dev)
     if (n->failover) {
         device_listener_unregister(&n->primary_listener);
         remove_migration_state_change_notifier(&n->migration_state);
+        qemu_del_vm_change_state_handler(n->vm_state);
     }
 
     max_queues = n->multiqueue ? n->max_queues : 1;
diff --git a/migration/migration.c b/migration/migration.c
index bb909781b7f5..9c96986d4abf 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -901,6 +901,9 @@  MigrationParameters *qmp_query_migrate_parameters(Error **errp)
                        s->parameters.block_bitmap_mapping);
     }
 
+    params->has_pause_vm = true;
+    params->pause_vm = s->parameters.pause_vm;
+
     return params;
 }
 
@@ -1549,6 +1552,11 @@  static void migrate_params_test_apply(MigrateSetParameters *params,
         dest->has_block_bitmap_mapping = true;
         dest->block_bitmap_mapping = params->block_bitmap_mapping;
     }
+
+    if (params->has_pause_vm) {
+        dest->has_pause_vm = true;
+        dest->pause_vm = params->pause_vm;
+    }
 }
 
 static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
@@ -1671,6 +1679,10 @@  static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
             QAPI_CLONE(BitmapMigrationNodeAliasList,
                        params->block_bitmap_mapping);
     }
+
+    if (params->has_pause_vm) {
+        s->parameters.pause_vm = params->pause_vm;
+    }
 }
 
 void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
@@ -1718,6 +1730,12 @@  void qmp_migrate_start_postcopy(Error **errp)
                          " started");
         return;
     }
+
+    if (s->parameters.pause_vm) {
+        error_setg(errp, "Postcopy cannot be started if pause-vm is on");
+        return;
+    }
+
     /*
      * we don't error if migration has finished since that would be racy
      * with issuing this command.
@@ -3734,13 +3752,27 @@  static void qemu_savevm_wait_unplug(MigrationState *s, int old_state,
                             "failure");
             }
         }
-
         migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG, new_state);
     } else {
         migrate_set_state(&s->state, old_state, new_state);
     }
 }
 
+/* stop the VM before starting the migration but after device unplug */
+static void pause_vm_after_unplug(MigrationState *s)
+{
+    if (s->parameters.pause_vm) {
+        qemu_mutex_lock_iothread();
+        qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
+        s->vm_was_running = runstate_is_running();
+        if (vm_stop_force_state(RUN_STATE_PAUSED)) {
+            migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
+                                         MIGRATION_STATUS_FAILED);
+        }
+        qemu_mutex_unlock_iothread();
+    }
+}
+
 /*
  * Master migration thread on the source VM.
  * It drives the migration and pumps the data down the outgoing channel.
@@ -3790,6 +3822,8 @@  static void *migration_thread(void *opaque)
     qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
                                MIGRATION_STATUS_ACTIVE);
 
+    pause_vm_after_unplug(s);
+
     s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
 
     trace_migration_thread_setup_complete();
@@ -4265,6 +4299,7 @@  static void migration_instance_init(Object *obj)
     params->has_announce_max = true;
     params->has_announce_rounds = true;
     params->has_announce_step = true;
+    params->has_pause_vm = true;
 
     qemu_sem_init(&ms->postcopy_pause_sem, 0);
     qemu_sem_init(&ms->postcopy_pause_rp_sem, 0);
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index b5e71d9e6f52..71bc8c15335b 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -513,6 +513,10 @@  void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
                 }
             }
         }
+
+        monitor_printf(mon, "%s: %s\n",
+            MigrationParameter_str(MIGRATION_PARAMETER_PAUSE_VM),
+            params->pause_vm ? "on" : "off");
     }
 
     qapi_free_MigrationParameters(params);
@@ -1399,6 +1403,10 @@  void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         error_setg(&err, "The block-bitmap-mapping parameter can only be set "
                    "through QMP");
         break;
+    case MIGRATION_PARAMETER_PAUSE_VM:
+        p->has_pause_vm = true;
+        visit_type_bool(v, param, &p->pause_vm, &err);
+        break;
     default:
         assert(0);
     }