Message ID | 1708622920-68779-11-git-send-email-steven.sistare@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | allow cpr-reboot for vfio | expand |
On Thu, Feb 22, 2024 at 09:28:36AM -0800, Steve Sistare wrote: > When migration for cpr is initiated, stop the vm and set state > RUN_STATE_FINISH_MIGRATE before ram is saved. This eliminates the > possibility of ram and device state being out of sync, and guarantees > that a guest in the suspended state remains suspended, because qmp_cont > rejects a cont command in the RUN_STATE_FINISH_MIGRATE state. > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Peter Xu <peterx@redhat.com> cpr-reboot mode keeps changing behavior. Could we declare it "experimental" until it's solid? Maybe a patch to document this? Normally IMHO we shouldn't merge a feature if it's not complete, however cpr-reboot is so special that the mode itself is already merged in 8.2 before I started to merge patches, and it keeps changing things. I don't know what else we can do here besides declaring it experimental and not declare it a stable feature.
On 2/25/2024 9:08 PM, Peter Xu wrote: > On Thu, Feb 22, 2024 at 09:28:36AM -0800, Steve Sistare wrote: >> When migration for cpr is initiated, stop the vm and set state >> RUN_STATE_FINISH_MIGRATE before ram is saved. This eliminates the >> possibility of ram and device state being out of sync, and guarantees >> that a guest in the suspended state remains suspended, because qmp_cont >> rejects a cont command in the RUN_STATE_FINISH_MIGRATE state. >> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com> > > Reviewed-by: Peter Xu <peterx@redhat.com> > > cpr-reboot mode keeps changing behavior. > > Could we declare it "experimental" until it's solid? Maybe a patch to > document this? > > Normally IMHO we shouldn't merge a feature if it's not complete, however > cpr-reboot is so special that the mode itself is already merged in 8.2 > before I started to merge patches, and it keeps changing things. I don't > know what else we can do here besides declaring it experimental and not > declare it a stable feature. Hi Peter, the planned/committed functionality for cpr-reboot changed only once, in: migration: stop vm for cpr Suspension to support vfio is an enhancement which adds to the basic functionality, it does not change it. This was planned all along, but submitted as a separate series to manage complexity, as I outlined in my qemu community presentation, which I emailed you at the time. Other "changes" that arose during review were just clarifications and explanations. So, I don't think cpr-reboot deserves to be condemned to experimental limbo. - Steve
On Thu, Feb 29, 2024 at 10:21:14AM -0500, Steven Sistare wrote: > On 2/25/2024 9:08 PM, Peter Xu wrote: > > On Thu, Feb 22, 2024 at 09:28:36AM -0800, Steve Sistare wrote: > >> When migration for cpr is initiated, stop the vm and set state > >> RUN_STATE_FINISH_MIGRATE before ram is saved. This eliminates the > >> possibility of ram and device state being out of sync, and guarantees > >> that a guest in the suspended state remains suspended, because qmp_cont > >> rejects a cont command in the RUN_STATE_FINISH_MIGRATE state. > >> > >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com> > > > > Reviewed-by: Peter Xu <peterx@redhat.com> > > > > cpr-reboot mode keeps changing behavior. > > > > Could we declare it "experimental" until it's solid? Maybe a patch to > > document this? > > > > Normally IMHO we shouldn't merge a feature if it's not complete, however > > cpr-reboot is so special that the mode itself is already merged in 8.2 > > before I started to merge patches, and it keeps changing things. I don't > > know what else we can do here besides declaring it experimental and not > > declare it a stable feature. > > Hi Peter, the planned/committed functionality for cpr-reboot changed only once, in: > migration: stop vm for cpr > > Suspension to support vfio is an enhancement which adds to the basic functionality, > it does not change it. This was planned all along, but submitted as a separate If VFIO used to migrate without suspension and now it won't, it's a behavior change? > series to manage complexity, as I outlined in my qemu community presentation, > which I emailed you at the time. > > Other "changes" that arose during review were just clarifications and explanations. > > So, I don't think cpr-reboot deserves to be condemned to experimental limbo. IMHO it's not about a feature being condemned, it's about a kindful heads-up to the users that one needs to take risk on using it until it becomes stable, it also makes developers easier because of no limitation on behavior change. If all the changes are landing, then there's no need for such a patch. If so, please propose the planned complete document patch. I had a feeling that cpr is still not fully understood by even many developers on the list. It'll be great that such document will contain all the details one needs to know on the feature, etc. meaning of the name cpr-reboot (what is "cpr"), when to use it, how to use it, how it differs from "normal" mode (etc. lifted limitations on some devices that used to block migration?), what is enforced (vm stop, suspension, etc.) and what is optionally offered (VFIO, shared-mem, etc.), the expected behaviors, etc. When send it, please copy relevant people (Alex & Cedric for VFIO, while Markus could also be a good candidate considering the QMP involvement). Thanks a lot,
On 3/1/24 02:28, Peter Xu wrote: > On Thu, Feb 29, 2024 at 10:21:14AM -0500, Steven Sistare wrote: >> On 2/25/2024 9:08 PM, Peter Xu wrote: >>> On Thu, Feb 22, 2024 at 09:28:36AM -0800, Steve Sistare wrote: >>>> When migration for cpr is initiated, stop the vm and set state >>>> RUN_STATE_FINISH_MIGRATE before ram is saved. This eliminates the >>>> possibility of ram and device state being out of sync, and guarantees >>>> that a guest in the suspended state remains suspended, because qmp_cont >>>> rejects a cont command in the RUN_STATE_FINISH_MIGRATE state. >>>> >>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com> >>> >>> Reviewed-by: Peter Xu <peterx@redhat.com> >>> >>> cpr-reboot mode keeps changing behavior. >>> >>> Could we declare it "experimental" until it's solid? Maybe a patch to >>> document this? >>> >>> Normally IMHO we shouldn't merge a feature if it's not complete, however >>> cpr-reboot is so special that the mode itself is already merged in 8.2 >>> before I started to merge patches, and it keeps changing things. I don't >>> know what else we can do here besides declaring it experimental and not >>> declare it a stable feature. >> >> Hi Peter, the planned/committed functionality for cpr-reboot changed only once, in: >> migration: stop vm for cpr >> >> Suspension to support vfio is an enhancement which adds to the basic functionality, >> it does not change it. This was planned all along, but submitted as a separate > > If VFIO used to migrate without suspension and now it won't, it's a > behavior change? > >> series to manage complexity, as I outlined in my qemu community presentation, >> which I emailed you at the time. >> >> Other "changes" that arose during review were just clarifications and explanations. >> >> So, I don't think cpr-reboot deserves to be condemned to experimental limbo. > > IMHO it's not about a feature being condemned, it's about a kindful > heads-up to the users that one needs to take risk on using it until it > becomes stable, it also makes developers easier because of no limitation on > behavior change. If all the changes are landing, then there's no need for > such a patch. > > If so, please propose the planned complete document patch. I had a feeling > that cpr is still not fully understood by even many developers on the list. > It'll be great that such document will contain all the details one needs to > know on the feature, etc. meaning of the name cpr-reboot (what is "cpr"), > when to use it, how to use it, how it differs from "normal" mode > (etc. lifted limitations on some devices that used to block migration?), > what is enforced (vm stop, suspension, etc.) and what is optionally offered > (VFIO, shared-mem, etc.), the expected behaviors, etc. > > When send it, please copy relevant people (Alex & Cedric for VFIO, while > Markus could also be a good candidate considering the QMP involvement). I second that. If we could have a file under docs/, it would be great. Thanks, C.
On 2/29/2024 8:28 PM, Peter Xu wrote: > On Thu, Feb 29, 2024 at 10:21:14AM -0500, Steven Sistare wrote: >> On 2/25/2024 9:08 PM, Peter Xu wrote: >>> On Thu, Feb 22, 2024 at 09:28:36AM -0800, Steve Sistare wrote: >>>> When migration for cpr is initiated, stop the vm and set state >>>> RUN_STATE_FINISH_MIGRATE before ram is saved. This eliminates the >>>> possibility of ram and device state being out of sync, and guarantees >>>> that a guest in the suspended state remains suspended, because qmp_cont >>>> rejects a cont command in the RUN_STATE_FINISH_MIGRATE state. >>>> >>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com> >>> >>> Reviewed-by: Peter Xu <peterx@redhat.com> >>> >>> cpr-reboot mode keeps changing behavior. >>> >>> Could we declare it "experimental" until it's solid? Maybe a patch to >>> document this? >>> >>> Normally IMHO we shouldn't merge a feature if it's not complete, however >>> cpr-reboot is so special that the mode itself is already merged in 8.2 >>> before I started to merge patches, and it keeps changing things. I don't >>> know what else we can do here besides declaring it experimental and not >>> declare it a stable feature. >> >> Hi Peter, the planned/committed functionality for cpr-reboot changed only once, in: >> migration: stop vm for cpr >> >> Suspension to support vfio is an enhancement which adds to the basic functionality, >> it does not change it. This was planned all along, but submitted as a separate > > If VFIO used to migrate without suspension and now it won't, it's a > behavior change? VFIO could not cpr-reboot migrate without suspension. The existing vfio migration blockers applied to all modes: Error: 0000:3a:10.0: VFIO migration is not supported in kernel Now, with suspension, it will. An addition, not a change. >> series to manage complexity, as I outlined in my qemu community presentation, >> which I emailed you at the time. >> >> Other "changes" that arose during review were just clarifications and explanations. >> >> So, I don't think cpr-reboot deserves to be condemned to experimental limbo. > > IMHO it's not about a feature being condemned, it's about a kindful > heads-up to the users that one needs to take risk on using it until it > becomes stable, it also makes developers easier because of no limitation on > behavior change. If all the changes are landing, then there's no need for > such a patch. > > If so, please propose the planned complete document patch. I had a feeling > that cpr is still not fully understood by even many developers on the list. > It'll be great that such document will contain all the details one needs to > know on the feature, etc. meaning of the name cpr-reboot (what is "cpr"), > when to use it, how to use it, how it differs from "normal" mode > (etc. lifted limitations on some devices that used to block migration?), > what is enforced (vm stop, suspension, etc.) and what is optionally offered > (VFIO, shared-mem, etc.), the expected behaviors, etc. > > When send it, please copy relevant people (Alex & Cedric for VFIO, while > Markus could also be a good candidate considering the QMP involvement). Submitted. - Steve
On 3/13/24 15:18, Steven Sistare wrote: > On 2/29/2024 8:28 PM, Peter Xu wrote: >> On Thu, Feb 29, 2024 at 10:21:14AM -0500, Steven Sistare wrote: >>> On 2/25/2024 9:08 PM, Peter Xu wrote: >>>> On Thu, Feb 22, 2024 at 09:28:36AM -0800, Steve Sistare wrote: >>>>> When migration for cpr is initiated, stop the vm and set state >>>>> RUN_STATE_FINISH_MIGRATE before ram is saved. This eliminates the >>>>> possibility of ram and device state being out of sync, and guarantees >>>>> that a guest in the suspended state remains suspended, because qmp_cont >>>>> rejects a cont command in the RUN_STATE_FINISH_MIGRATE state. >>>>> >>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com> >>>> >>>> Reviewed-by: Peter Xu <peterx@redhat.com> >>>> >>>> cpr-reboot mode keeps changing behavior. >>>> >>>> Could we declare it "experimental" until it's solid? Maybe a patch to >>>> document this? >>>> >>>> Normally IMHO we shouldn't merge a feature if it's not complete, however >>>> cpr-reboot is so special that the mode itself is already merged in 8.2 >>>> before I started to merge patches, and it keeps changing things. I don't >>>> know what else we can do here besides declaring it experimental and not >>>> declare it a stable feature. >>> >>> Hi Peter, the planned/committed functionality for cpr-reboot changed only once, in: >>> migration: stop vm for cpr >>> >>> Suspension to support vfio is an enhancement which adds to the basic functionality, >>> it does not change it. This was planned all along, but submitted as a separate >> >> If VFIO used to migrate without suspension and now it won't, it's a >> behavior change? > > VFIO could not cpr-reboot migrate without suspension. The existing vfio migration blockers applied to all modes: > Error: 0000:3a:10.0: VFIO migration is not supported in kernel > > Now, with suspension, it will. An addition, not a change. Still, I wonder if we should not have a per-device toggle to block migration for CPR_REBOOT mode. This to maintain the pre-9.0 behavior and to manage possible incompatibilities we haven't thought of yet. A config option to deactivate CPR_REBOOT mode in downstream could be useful too. Thanks, C. > >>> series to manage complexity, as I outlined in my qemu community presentation, >>> which I emailed you at the time. >>> >>> Other "changes" that arose during review were just clarifications and explanations. >>> >>> So, I don't think cpr-reboot deserves to be condemned to experimental limbo. >> >> IMHO it's not about a feature being condemned, it's about a kindful >> heads-up to the users that one needs to take risk on using it until it >> becomes stable, it also makes developers easier because of no limitation on >> behavior change. If all the changes are landing, then there's no need for >> such a patch. >> >> If so, please propose the planned complete document patch. I had a feeling >> that cpr is still not fully understood by even many developers on the list. >> It'll be great that such document will contain all the details one needs to >> know on the feature, etc. meaning of the name cpr-reboot (what is "cpr"), >> when to use it, how to use it, how it differs from "normal" mode >> (etc. lifted limitations on some devices that used to block migration?), >> what is enforced (vm stop, suspension, etc.) and what is optionally offered >> (VFIO, shared-mem, etc.), the expected behaviors, etc. >> >> When send it, please copy relevant people (Alex & Cedric for VFIO, while >> Markus could also be a good candidate considering the QMP involvement). > > Submitted. > > - Steve >
diff --git a/include/migration/misc.h b/include/migration/misc.h index e4933b8..5d1aa59 100644 --- a/include/migration/misc.h +++ b/include/migration/misc.h @@ -60,6 +60,7 @@ void migration_object_init(void); void migration_shutdown(void); bool migration_is_idle(void); bool migration_is_active(MigrationState *); +bool migrate_mode_is_cpr(MigrationState *); typedef enum MigrationEventType { MIG_EVENT_PRECOPY_SETUP, diff --git a/migration/migration.c b/migration/migration.c index 37c836b..90a9094 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -167,11 +167,19 @@ static gint page_request_addr_cmp(gconstpointer ap, gconstpointer bp) return (a > b) - (a < b); } -int migration_stop_vm(RunState state) +static int migration_stop_vm(MigrationState *s, RunState state) { - int ret = vm_stop_force_state(state); + int ret; + + migration_downtime_start(s); + + s->vm_old_state = runstate_get(); + global_state_store(); + + ret = vm_stop_force_state(state); trace_vmstate_downtime_checkpoint("src-vm-stopped"); + trace_migration_completion_vm_stop(ret); return ret; } @@ -1602,6 +1610,11 @@ bool migration_is_active(MigrationState *s) s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE); } +bool migrate_mode_is_cpr(MigrationState *s) +{ + return s->parameters.mode == MIG_MODE_CPR_REBOOT; +} + int migrate_init(MigrationState *s, Error **errp) { int ret; @@ -2454,10 +2467,7 @@ static int postcopy_start(MigrationState *ms, Error **errp) bql_lock(); trace_postcopy_start_set_run(); - migration_downtime_start(ms); - - global_state_store(); - ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE); + ret = migration_stop_vm(ms, RUN_STATE_FINISH_MIGRATE); if (ret < 0) { goto fail; } @@ -2652,15 +2662,12 @@ static int migration_completion_precopy(MigrationState *s, int ret; bql_lock(); - migration_downtime_start(s); - s->vm_old_state = runstate_get(); - global_state_store(); - - ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE); - trace_migration_completion_vm_stop(ret); - if (ret < 0) { - goto out_unlock; + if (!migrate_mode_is_cpr(s)) { + ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE); + if (ret < 0) { + goto out_unlock; + } } ret = migration_maybe_pause(s, current_active_state, @@ -3500,15 +3507,10 @@ static void *bg_migration_thread(void *opaque) s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start; trace_migration_thread_setup_complete(); - migration_downtime_start(s); bql_lock(); - s->vm_old_state = runstate_get(); - - global_state_store(); - /* Forcibly stop VM before saving state of vCPUs and devices */ - if (migration_stop_vm(RUN_STATE_PAUSED)) { + if (migration_stop_vm(s, RUN_STATE_PAUSED)) { goto fail; } /* @@ -3584,6 +3586,7 @@ void migrate_fd_connect(MigrationState *s, Error *error_in) Error *local_err = NULL; uint64_t rate_limit; bool resume = s->state == MIGRATION_STATUS_POSTCOPY_PAUSED; + int ret; /* * If there's a previous error, free it and prepare for another one. @@ -3655,6 +3658,14 @@ void migrate_fd_connect(MigrationState *s, Error *error_in) return; } + if (migrate_mode_is_cpr(s)) { + ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE); + if (ret < 0) { + error_setg(&local_err, "migration_stop_vm failed, error %d", -ret); + goto fail; + } + } + if (migrate_background_snapshot()) { qemu_thread_create(&s->thread, "bg_snapshot", bg_migration_thread, s, QEMU_THREAD_JOINABLE); diff --git a/migration/migration.h b/migration/migration.h index aef8afb..65c0b61 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -541,6 +541,4 @@ int migration_rp_wait(MigrationState *s); */ void migration_rp_kick(MigrationState *s); -int migration_stop_vm(RunState state); - #endif
When migration for cpr is initiated, stop the vm and set state RUN_STATE_FINISH_MIGRATE before ram is saved. This eliminates the possibility of ram and device state being out of sync, and guarantees that a guest in the suspended state remains suspended, because qmp_cont rejects a cont command in the RUN_STATE_FINISH_MIGRATE state. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> --- include/migration/misc.h | 1 + migration/migration.c | 51 +++++++++++++++++++++++++++++------------------- migration/migration.h | 2 -- 3 files changed, 32 insertions(+), 22 deletions(-)