Message ID | 20220512154320.19697-5-avihaih@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfio/migration: Implement VFIO migration protocol v2 | expand |
Avihai Horon <avihaih@nvidia.com> wrote: > Currently, if IOMMU of a VFIO container doesn't support dirty page > tracking, migration is blocked completely. This is because a DMA-able > VFIO device can dirty RAM pages without updating QEMU about it, thus > breaking the migration. > > However, this doesn't mean that migration can't be done at all. If > migration pre-copy phase is skipped, the VFIO device doesn't have a > chance to dirty RAM pages that have been migrated already, thus > eliminating the problem previously mentioned. > > Hence, in such case allow migration but skip pre-copy phase. > > Signed-off-by: Avihai Horon <avihaih@nvidia.com> I don't know (TM). Several issues: - Patch is ugly as hell (ok, that depends on taste) - It changes migration_iteration_run() instead of directly migration_thread. - There is already another case where we skip the sending of RAM (localhost migration with shared memory) In migration/ram.c: static int ram_find_and_save_block(RAMState *rs, bool last_stage) { PageSearchStatus pss; int pages = 0; bool again, found; /* No dirty page as there is zero RAM */ if (!ram_bytes_total()) { return pages; } This is the other place where we _don't_ send any RAM at all. I don't have a great idea about how to make things clear at a higher level, I have to think about this. Later, Juan. > --- > hw/vfio/migration.c | 9 ++++++++- > migration/migration.c | 5 +++++ > migration/migration.h | 3 +++ > 3 files changed, 16 insertions(+), 1 deletion(-) > > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c > index 21e8f9d4d4..d4b6653026 100644 > --- a/hw/vfio/migration.c > +++ b/hw/vfio/migration.c > @@ -863,10 +863,17 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) > struct vfio_region_info *info = NULL; > int ret = -ENOTSUP; > > - if (!vbasedev->enable_migration || !container->dirty_pages_supported) { > + if (!vbasedev->enable_migration) { > goto add_blocker; > } > > + if (!container->dirty_pages_supported) { > + warn_report( > + "%s: IOMMU of the device's VFIO container doesn't support dirty page tracking, migration pre-copy phase will be skipped", > + vbasedev->name); > + migrate_get_current()->skip_precopy = true; > + } > + > ret = vfio_get_dev_region_info(vbasedev, > VFIO_REGION_TYPE_MIGRATION_DEPRECATED, > VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED, > diff --git a/migration/migration.c b/migration/migration.c > index 5a31b23bd6..668343508d 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -3593,6 +3593,11 @@ static MigIterateState migration_iteration_run(MigrationState *s) > uint64_t pending_size, pend_pre, pend_compat, pend_post; > bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE; > > + if (s->skip_precopy) { > + migration_completion(s); > + return MIG_ITERATE_BREAK; > + } > + > qemu_savevm_state_pending(s->to_dst_file, s->threshold_size, &pend_pre, > &pend_compat, &pend_post); > pending_size = pend_pre + pend_compat + pend_post; > diff --git a/migration/migration.h b/migration/migration.h > index a863032b71..876713e7e1 100644 > --- a/migration/migration.h > +++ b/migration/migration.h > @@ -332,6 +332,9 @@ struct MigrationState { > * This save hostname when out-going migration starts > */ > char *hostname; > + > + /* Whether to skip pre-copy phase of migration or not */ > + bool skip_precopy; > }; > > void migrate_set_state(int *state, int old_state, int new_state);
On Mon, 16 May 2022 13:22:14 +0200 Juan Quintela <quintela@redhat.com> wrote: > Avihai Horon <avihaih@nvidia.com> wrote: > > Currently, if IOMMU of a VFIO container doesn't support dirty page > > tracking, migration is blocked completely. This is because a DMA-able > > VFIO device can dirty RAM pages without updating QEMU about it, thus > > breaking the migration. > > > > However, this doesn't mean that migration can't be done at all. If > > migration pre-copy phase is skipped, the VFIO device doesn't have a > > chance to dirty RAM pages that have been migrated already, thus > > eliminating the problem previously mentioned. > > > > Hence, in such case allow migration but skip pre-copy phase. > > > > Signed-off-by: Avihai Horon <avihaih@nvidia.com> > > I don't know (TM). > Several issues: > - Patch is ugly as hell (ok, that depends on taste) > - It changes migration_iteration_run() instead of directly > migration_thread. > - There is already another case where we skip the sending of RAM > (localhost migration with shared memory) > > In migration/ram.c: > > static int ram_find_and_save_block(RAMState *rs, bool last_stage) > { > PageSearchStatus pss; > int pages = 0; > bool again, found; > > /* No dirty page as there is zero RAM */ > if (!ram_bytes_total()) { > return pages; > } > > This is the other place where we _don't_ send any RAM at all. > > I don't have a great idea about how to make things clear at a higher > level, I have to think about this. It seems like if we have devices dictating what type of migrations can be performed then there probably needs to be a switch to restrict use of such devices just as we have the -only-migratable switch now to prevent attaching devices that don't support migration. I'd guess that we need the switch to opt-in to allowing such devices to maintain compatibility. There's probably a whole pile of qapi things missing to expose this to management tools as well. Thanks, Alex > > --- > > hw/vfio/migration.c | 9 ++++++++- > > migration/migration.c | 5 +++++ > > migration/migration.h | 3 +++ > > 3 files changed, 16 insertions(+), 1 deletion(-) > > > > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c > > index 21e8f9d4d4..d4b6653026 100644 > > --- a/hw/vfio/migration.c > > +++ b/hw/vfio/migration.c > > @@ -863,10 +863,17 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) > > struct vfio_region_info *info = NULL; > > int ret = -ENOTSUP; > > > > - if (!vbasedev->enable_migration || !container->dirty_pages_supported) { > > + if (!vbasedev->enable_migration) { > > goto add_blocker; > > } > > > > + if (!container->dirty_pages_supported) { > > + warn_report( > > + "%s: IOMMU of the device's VFIO container doesn't support dirty page tracking, migration pre-copy phase will be skipped", > > + vbasedev->name); > > + migrate_get_current()->skip_precopy = true; > > + } > > + > > ret = vfio_get_dev_region_info(vbasedev, > > VFIO_REGION_TYPE_MIGRATION_DEPRECATED, > > VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED, > > diff --git a/migration/migration.c b/migration/migration.c > > index 5a31b23bd6..668343508d 100644 > > --- a/migration/migration.c > > +++ b/migration/migration.c > > @@ -3593,6 +3593,11 @@ static MigIterateState migration_iteration_run(MigrationState *s) > > uint64_t pending_size, pend_pre, pend_compat, pend_post; > > bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE; > > > > + if (s->skip_precopy) { > > + migration_completion(s); > > + return MIG_ITERATE_BREAK; > > + } > > + > > qemu_savevm_state_pending(s->to_dst_file, s->threshold_size, &pend_pre, > > &pend_compat, &pend_post); > > pending_size = pend_pre + pend_compat + pend_post; > > diff --git a/migration/migration.h b/migration/migration.h > > index a863032b71..876713e7e1 100644 > > --- a/migration/migration.h > > +++ b/migration/migration.h > > @@ -332,6 +332,9 @@ struct MigrationState { > > * This save hostname when out-going migration starts > > */ > > char *hostname; > > + > > + /* Whether to skip pre-copy phase of migration or not */ > > + bool skip_precopy; > > }; > > > > void migrate_set_state(int *state, int old_state, int new_state); >
On Mon, May 16, 2022 at 02:22:00PM -0600, Alex Williamson wrote: > On Mon, 16 May 2022 13:22:14 +0200 > Juan Quintela <quintela@redhat.com> wrote: > > > Avihai Horon <avihaih@nvidia.com> wrote: > > > Currently, if IOMMU of a VFIO container doesn't support dirty page > > > tracking, migration is blocked completely. This is because a DMA-able > > > VFIO device can dirty RAM pages without updating QEMU about it, thus > > > breaking the migration. > > > > > > However, this doesn't mean that migration can't be done at all. If > > > migration pre-copy phase is skipped, the VFIO device doesn't have a > > > chance to dirty RAM pages that have been migrated already, thus > > > eliminating the problem previously mentioned. > > > > > > Hence, in such case allow migration but skip pre-copy phase. > > > > > > Signed-off-by: Avihai Horon <avihaih@nvidia.com> > > > > I don't know (TM). > > Several issues: > > - Patch is ugly as hell (ok, that depends on taste) > > - It changes migration_iteration_run() instead of directly > > migration_thread. > > - There is already another case where we skip the sending of RAM > > (localhost migration with shared memory) > > > > In migration/ram.c: > > > > static int ram_find_and_save_block(RAMState *rs, bool last_stage) > > { > > PageSearchStatus pss; > > int pages = 0; > > bool again, found; > > > > /* No dirty page as there is zero RAM */ > > if (!ram_bytes_total()) { > > return pages; > > } > > > > This is the other place where we _don't_ send any RAM at all. > > > > I don't have a great idea about how to make things clear at a higher > > level, I have to think about this. > > It seems like if we have devices dictating what type of migrations can > be performed then there probably needs to be a switch to restrict use of > such devices just as we have the -only-migratable switch now to prevent > attaching devices that don't support migration. I'd guess that we need > the switch to opt-in to allowing such devices to maintain > compatibility. There's probably a whole pile of qapi things missing to > expose this to management tools as well. Thanks, This is really intended to be a NOP from where things are now, as if you use mlx5 live migration without a patch like this then it causes a botched pre-copy since everything just ends up permanently dirty. If it makes more sense we could abort the pre-copy too - in the end there will be dirty tracking so I don't know if I'd invest in a big adventure to fully define non-dirty tracking migration. Jason
On Mon, 16 May 2022 20:08:32 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, May 16, 2022 at 02:22:00PM -0600, Alex Williamson wrote: > > On Mon, 16 May 2022 13:22:14 +0200 > > Juan Quintela <quintela@redhat.com> wrote: > > > > > Avihai Horon <avihaih@nvidia.com> wrote: > > > > Currently, if IOMMU of a VFIO container doesn't support dirty page > > > > tracking, migration is blocked completely. This is because a DMA-able > > > > VFIO device can dirty RAM pages without updating QEMU about it, thus > > > > breaking the migration. > > > > > > > > However, this doesn't mean that migration can't be done at all. If > > > > migration pre-copy phase is skipped, the VFIO device doesn't have a > > > > chance to dirty RAM pages that have been migrated already, thus > > > > eliminating the problem previously mentioned. > > > > > > > > Hence, in such case allow migration but skip pre-copy phase. > > > > > > > > Signed-off-by: Avihai Horon <avihaih@nvidia.com> > > > > > > I don't know (TM). > > > Several issues: > > > - Patch is ugly as hell (ok, that depends on taste) > > > - It changes migration_iteration_run() instead of directly > > > migration_thread. > > > - There is already another case where we skip the sending of RAM > > > (localhost migration with shared memory) > > > > > > In migration/ram.c: > > > > > > static int ram_find_and_save_block(RAMState *rs, bool last_stage) > > > { > > > PageSearchStatus pss; > > > int pages = 0; > > > bool again, found; > > > > > > /* No dirty page as there is zero RAM */ > > > if (!ram_bytes_total()) { > > > return pages; > > > } > > > > > > This is the other place where we _don't_ send any RAM at all. > > > > > > I don't have a great idea about how to make things clear at a higher > > > level, I have to think about this. > > > > It seems like if we have devices dictating what type of migrations can > > be performed then there probably needs to be a switch to restrict use of > > such devices just as we have the -only-migratable switch now to prevent > > attaching devices that don't support migration. I'd guess that we need > > the switch to opt-in to allowing such devices to maintain > > compatibility. There's probably a whole pile of qapi things missing to > > expose this to management tools as well. Thanks, > > This is really intended to be a NOP from where things are now, as if > you use mlx5 live migration without a patch like this then it causes a > botched pre-copy since everything just ends up permanently dirty. > > If it makes more sense we could abort the pre-copy too - in the end > there will be dirty tracking so I don't know if I'd invest in a big > adventure to fully define non-dirty tracking migration. How is pre-copy currently "botched" without a patch like this? If it's simply that the pre-copy doesn't converge and the downtime constraints don't allow the VM to enter stop-and-copy, that's the expected behavior AIUI, and supports backwards compatibility with existing SLAs. I'm assuming that by setting this new skip_precopy flag that we're forcing the VM to move to stop-and-copy, regardless of any other SLA constraints placed on the migration. It seems like a better solution would be to expose to management tools that the VM contains a device that does not support the pre-copy phase so that downtime expectations can be adjusted. Thanks, Alex
On Tue, May 17, 2022 at 10:00:45AM -0600, Alex Williamson wrote: > > This is really intended to be a NOP from where things are now, as if > > you use mlx5 live migration without a patch like this then it causes a > > botched pre-copy since everything just ends up permanently dirty. > > > > If it makes more sense we could abort the pre-copy too - in the end > > there will be dirty tracking so I don't know if I'd invest in a big > > adventure to fully define non-dirty tracking migration. > > How is pre-copy currently "botched" without a patch like this? If it's > simply that the pre-copy doesn't converge and the downtime constraints > don't allow the VM to enter stop-and-copy, that's the expected behavior > AIUI, and supports backwards compatibility with existing SLAs. It means it always fails - that certainly isn't working live migration. There is no point in trying to converge something that we already know will never converge. > I'm assuming that by setting this new skip_precopy flag that we're > forcing the VM to move to stop-and-copy, regardless of any other SLA > constraints placed on the migration. That does seem like a defect in this patch, any SLA constraints should still all be checked under the assumption all ram is dirty. > It seems like a better solution would be to expose to management > tools that the VM contains a device that does not support the > pre-copy phase so that downtime expectations can be adjusted. I don't expect this to be a real use case though.. Remember, you asked for this patch when you wanted qemu to have good behavior when kernel support for legacy dirty tracking is removed before we merge v2 support. Thanks, Jason
On Tue, 17 May 2022 13:08:44 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, May 17, 2022 at 10:00:45AM -0600, Alex Williamson wrote: > > > > This is really intended to be a NOP from where things are now, as if > > > you use mlx5 live migration without a patch like this then it causes a > > > botched pre-copy since everything just ends up permanently dirty. > > > > > > If it makes more sense we could abort the pre-copy too - in the end > > > there will be dirty tracking so I don't know if I'd invest in a big > > > adventure to fully define non-dirty tracking migration. > > > > How is pre-copy currently "botched" without a patch like this? If it's > > simply that the pre-copy doesn't converge and the downtime constraints > > don't allow the VM to enter stop-and-copy, that's the expected behavior > > AIUI, and supports backwards compatibility with existing SLAs. > > It means it always fails - that certainly isn't working live > migration. There is no point in trying to converge something that we > already know will never converge. If we eliminate the pre-copy phase then it's not so much live migration anyway. Trying to converge is indeed useless work, but afaik it's that useless work that generates the data that management tools can use to determine that SLAs cannot be achieved in a compatible way. > > I'm assuming that by setting this new skip_precopy flag that we're > > forcing the VM to move to stop-and-copy, regardless of any other SLA > > constraints placed on the migration. > > That does seem like a defect in this patch, any SLA constraints should > still all be checked under the assumption all ram is dirty. The migration iteration function certainly tries to compare remaining bytes to a threshold based on bandwidth and downtime. The exit path added here is the same as it would take if we had achieved our threshold limit. It's not clear to me that we're checking the downtime limit elsewhere or have the data to do it if we don't transfer anything estimate bandwidth. > > It seems like a better solution would be to expose to management > > tools that the VM contains a device that does not support the > > pre-copy phase so that downtime expectations can be adjusted. > > I don't expect this to be a real use case though.. > > Remember, you asked for this patch when you wanted qemu to have good > behavior when kernel support for legacy dirty tracking is removed > before we merge v2 support. Is wanting good behavior a controversial point? Did we define this as the desired good behavior? Ref? Thanks, Alex
On Tue, May 17, 2022 at 11:22:32AM -0600, Alex Williamson wrote: > > > It seems like a better solution would be to expose to management > > > tools that the VM contains a device that does not support the > > > pre-copy phase so that downtime expectations can be adjusted. > > > > I don't expect this to be a real use case though.. > > > > Remember, you asked for this patch when you wanted qemu to have good > > behavior when kernel support for legacy dirty tracking is removed > > before we merge v2 support. > > Is wanting good behavior a controversial point? Did we define this as > the desired good behavior? Ref? As I said, this was intended as a NOP, which is what I thought we agreed to. Missing the SLA checking that existed before seems like something to fix in this patch. This is the discussion thread: https://lore.kernel.org/kvm/20220324231159.GA11336@nvidia.com/ "I guess I was assuming that enabling v2 migration in QEMU was dependent on the existing type1 dirty tracking because it's the only means we have to tell QEMU that all memory is perpetually dirty when we have a DMA device. Is that not correct?" The only point was to prepare qemu for kernel's that don't support the legacy dirty tracking API but do have a v2 migration interface. IIRC something undesired happens if you do that right now. We could also just dirty all memory in qemu and keep it exactly the same so every SLA detail works. Or completely block pre-copy based flows. It it not intended to be a useful configuration, this is just covering off backwards compat issues - so I'm not keen to do a bunch of management work to support it. Thanks, Jason
On Tue, 17 May 2022 14:39:37 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, May 17, 2022 at 11:22:32AM -0600, Alex Williamson wrote: > > > > > It seems like a better solution would be to expose to management > > > > tools that the VM contains a device that does not support the > > > > pre-copy phase so that downtime expectations can be adjusted. > > > > > > I don't expect this to be a real use case though.. > > > > > > Remember, you asked for this patch when you wanted qemu to have good > > > behavior when kernel support for legacy dirty tracking is removed > > > before we merge v2 support. > > > > Is wanting good behavior a controversial point? Did we define this as > > the desired good behavior? Ref? > > As I said, this was intended as a NOP, which is what I thought we > agreed to. Missing the SLA checking that existed before seems like > something to fix in this patch. But even if we have the SLA check, how does a management tool know that pre-copy will be skipped and that the downtime needs to account for that? The current solution is obviously non-optimal, it was mainly meant for backwards compatibility, but this seems like a fail faster solution, with less useless work, but also providing less indication how to configure the migration to succeed. > This is the discussion thread: > > https://lore.kernel.org/kvm/20220324231159.GA11336@nvidia.com/ > > "I guess I was assuming that enabling v2 migration in QEMU was dependent > on the existing type1 dirty tracking because it's the only means we > have to tell QEMU that all memory is perpetually dirty when we have a > DMA device. Is that not correct?" Doesn't sound like me asking for this patch so much as trying to figure out how to support migration without DMA dirty tracking, and after the above comment was a concern whether we lock ourselves into a dirty tracking requirement in the iommufd type1 compatibility interface if we rely on this type1 feature. You had spit-balled that QEMU could skip pre-copy, but this all needs to be vetted with migration folks and management tools. > The only point was to prepare qemu for kernel's that don't support the > legacy dirty tracking API but do have a v2 migration interface. IIRC > something undesired happens if you do that right now. Per this patch, the container requires dirty tracking or we add a blocker and can't migrate the device. > We could also just dirty all memory in qemu and keep it exactly the > same so every SLA detail works. Or completely block pre-copy based > flows. > > It it not intended to be a useful configuration, this is just covering > off backwards compat issues - so I'm not keen to do a bunch of > management work to support it. Are we then deemphasizing the vfio compatibility support in iommufd? If we really don't want to put effort towards backwards compatibility, should migration support only be enabled with native iommufd support? Thanks, Alex
Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, May 17, 2022 at 10:00:45AM -0600, Alex Williamson wrote: > >> > This is really intended to be a NOP from where things are now, as if >> > you use mlx5 live migration without a patch like this then it causes a >> > botched pre-copy since everything just ends up permanently dirty. >> > >> > If it makes more sense we could abort the pre-copy too - in the end >> > there will be dirty tracking so I don't know if I'd invest in a big >> > adventure to fully define non-dirty tracking migration. >> >> How is pre-copy currently "botched" without a patch like this? If it's >> simply that the pre-copy doesn't converge and the downtime constraints >> don't allow the VM to enter stop-and-copy, that's the expected behavior >> AIUI, and supports backwards compatibility with existing SLAs. > > It means it always fails - that certainly isn't working live > migration. There is no point in trying to converge something that we > already know will never converge. Fully agree with you here. But not how this is being done. I think we need a way to convince the migration code that it shouldn't even try to migrate RAM. That would fix the current use case, and your use case. >> I'm assuming that by setting this new skip_precopy flag that we're >> forcing the VM to move to stop-and-copy, regardless of any other SLA >> constraints placed on the migration. > > That does seem like a defect in this patch, any SLA constraints should > still all be checked under the assumption all ram is dirty. And how are we going to: - detect the network link speed - to be sure that we are inside downtime limit I think that it is not possible, so basically we are skiping the precopy stage and praying that the other bits are going to be ok. >> It seems like a better solution would be to expose to management >> tools that the VM contains a device that does not support the >> pre-copy phase so that downtime expectations can be adjusted. > > I don't expect this to be a real use case though.. > > Remember, you asked for this patch when you wanted qemu to have good > behavior when kernel support for legacy dirty tracking is removed > before we merge v2 support. I am an ignorant on the subject. Can I ask how the dirty memory is tracked on this v2? Thanks, Juan.
On Wed, May 18, 2022 at 01:39:31PM +0200, Juan Quintela wrote: > > That does seem like a defect in this patch, any SLA constraints should > > still all be checked under the assumption all ram is dirty. > > And how are we going to: > - detect the network link speed > - to be sure that we are inside downtime limit > > I think that it is not possible, so basically we are skiping the precopy > stage and praying that the other bits are going to be ok. Like I keep saying, this is not a real use case, we expect dirty tracking to be available in any real configuration. This is just trying to make qemu work in some reasonable way if dirty tracking is not available but a VFIO migration device is plugged in. Just pick something simple that makes sense. Like if any SLA is set then just refuse to even start. If no SLA then go directly to STOP_COPY. > >> It seems like a better solution would be to expose to management > >> tools that the VM contains a device that does not support the > >> pre-copy phase so that downtime expectations can be adjusted. > > > > I don't expect this to be a real use case though.. > > > > Remember, you asked for this patch when you wanted qemu to have good > > behavior when kernel support for legacy dirty tracking is removed > > before we merge v2 support. > > I am an ignorant on the subject. Can I ask how the dirty memory is > tracked on this v2? These two RFCs are the current proposal to enable it for the system iommu: https://lore.kernel.org/kvm/20220428210933.3583-1-joao.m.martins@oracle.com And for device internal trackers: https://lore.kernel.org/kvm/20220501123301.127279-1-yishaih@nvidia.com/ Regards, Jason
On Tue, May 17, 2022 at 09:46:56PM -0600, Alex Williamson wrote: > The current solution is obviously non-optimal, it was mainly > meant for backwards compatibility, but this seems like a fail faster > solution, with less useless work, but also providing less indication > how to configure the migration to succeed. I don't think this is a production configuration. We should not be expecting that the user is going to carefully fine tune some SLA parameter. It may be the "SLA check" we are missing is simply if a SLA exists or not. > > It it not intended to be a useful configuration, this is just covering > > off backwards compat issues - so I'm not keen to do a bunch of > > management work to support it. > > Are we then deemphasizing the vfio compatibility support in iommufd? I'm viewing iommufd compatability with VFIO type 1 as only extending to implemented features.. We are deleting NESTED for instance. As there is no in-kernel implementation of the type 1 tracker I would expect to eventually delete it as well once we have consensus this is what we plan to do. We discussed this in Joao's series and that was the consensus. > If we really don't want to put effort towards backwards compatibility, > should migration support only be enabled with native iommufd > support? I'm expecting that the dirty tracking will require native iommufd only for the system iommu tracker, but the mlx5 on-device tracker will be fine with either iommu back end. It is still useful currently for testing the VFIO device part as it will be some time until the other parts are ready, so I'd rather not block it completely in qemu. The goal is for qemu to do something sensible so when we delete kernel support for type 1 dirty tracking so there is no back compat concern for existing non-experimental qemu. Jason
On 5/12/22 16:43, Avihai Horon wrote: > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c > index 21e8f9d4d4..d4b6653026 100644 > --- a/hw/vfio/migration.c > +++ b/hw/vfio/migration.c > @@ -863,10 +863,17 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) > struct vfio_region_info *info = NULL; > int ret = -ENOTSUP; > > - if (!vbasedev->enable_migration || !container->dirty_pages_supported) { > + if (!vbasedev->enable_migration) { > goto add_blocker; > } > > + if (!container->dirty_pages_supported) { > + warn_report( > + "%s: IOMMU of the device's VFIO container doesn't support dirty page tracking, migration pre-copy phase will be skipped", > + vbasedev->name); Maybe warn_report_once(...) given that the following N devices would observe the same thing in theory. > + migrate_get_current()->skip_precopy = true; > + } > + Perhaps it might be easier to reuse the existing knob to disable pre-copy per device that Kirti added some time ago, rather than changing migration core just yet (e.g. you might wanna bail of the migration preemptively because the CPU is dirtying too many pages for example?): if (!container->dirty_pages_supported) { warn_report_once(...) pre_copy_dirty_page_tracking = ON_OFF_AUTO_OFF; } You might need to tackle the fact you will get when dirty_pages start/stop ioctls returns you error messages. The errors is just because log_start() and log_stop() simply fail because the ioctl doesn't exist, but everything else is fine -- at least that's what I observed at least. Should be noted that it's a problem with the existing `-device vfio-pci host=XX:YY.ZZ,x-pre-copy-dirty-page-tracking=true` regardless of this patch: void vfio_listener_log_global_start() { if (vfio_devices_all_dirty_tracking(container)) { vfio_set_dirty_page_tracking(container, true); } } ... And same for vfio_listener_log_global_stop() -- maybe a worthwhile predecessor patch.
On 5/20/2022 1:58 PM, Joao Martins wrote: > External email: Use caution opening links or attachments > > > On 5/12/22 16:43, Avihai Horon wrote: >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c >> index 21e8f9d4d4..d4b6653026 100644 >> --- a/hw/vfio/migration.c >> +++ b/hw/vfio/migration.c >> @@ -863,10 +863,17 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) >> struct vfio_region_info *info = NULL; >> int ret = -ENOTSUP; >> >> - if (!vbasedev->enable_migration || !container->dirty_pages_supported) { >> + if (!vbasedev->enable_migration) { >> goto add_blocker; >> } >> >> + if (!container->dirty_pages_supported) { >> + warn_report( >> + "%s: IOMMU of the device's VFIO container doesn't support dirty page tracking, migration pre-copy phase will be skipped", >> + vbasedev->name); > Maybe warn_report_once(...) given that the following N devices would observe the > same thing in theory. Yes, you are right. Will change. >> + migrate_get_current()->skip_precopy = true; >> + } >> + > Perhaps it might be easier to reuse the existing knob to disable pre-copy > per device that Kirti added some time ago, rather than changing migration core just > yet (e.g. you might wanna bail of the migration preemptively because the CPU is dirtying > too many pages for example?): > > if (!container->dirty_pages_supported) { > warn_report_once(...) > pre_copy_dirty_page_tracking = ON_OFF_AUTO_OFF; > } But this doesn't prevent the VFIO device from dirtying RAM pages during pre-copy phase. The VFIO device can dirty RAM pages during pre-copy and it won't have a way to report the dirtied pages to QEMU and migration will be faulty. Thanks. > > You might need to tackle the fact you will get when dirty_pages start/stop ioctls > returns you error messages. The errors is just because log_start() and log_stop() simply > fail because the ioctl doesn't exist, but everything else is fine -- at least that's what > I observed at least. Should be noted that it's a problem with the existing > `-device vfio-pci host=XX:YY.ZZ,x-pre-copy-dirty-page-tracking=true` regardless of this patch: > > void vfio_listener_log_global_start() > { > if (vfio_devices_all_dirty_tracking(container)) { > vfio_set_dirty_page_tracking(container, true); > } > } > > ... And same for vfio_listener_log_global_stop() -- maybe a worthwhile predecessor patch.
On 5/23/22 07:11, Avihai Horon wrote: > On 5/20/2022 1:58 PM, Joao Martins wrote: >>> + migrate_get_current()->skip_precopy = true; >>> + } >>> + >> Perhaps it might be easier to reuse the existing knob to disable pre-copy >> per device that Kirti added some time ago, rather than changing migration core just >> yet (e.g. you might wanna bail of the migration preemptively because the CPU is dirtying >> too many pages for example?): >> >> if (!container->dirty_pages_supported) { >> warn_report_once(...) >> pre_copy_dirty_page_tracking = ON_OFF_AUTO_OFF; >> } > > But this doesn't prevent the VFIO device from dirtying RAM pages during > pre-copy phase. > The VFIO device can dirty RAM pages during pre-copy and it won't have a > way to report the dirtied pages to QEMU and migration will be faulty. > You're quite right, sorry for the noise. I was a little too obsessed in reusing the existing field that forgot that letting the iterate stage go the PCI device could also be driven into dirtying RAM too.
On 5/18/2022 6:50 PM, Jason Gunthorpe wrote: > On Wed, May 18, 2022 at 01:39:31PM +0200, Juan Quintela wrote: > >>> That does seem like a defect in this patch, any SLA constraints should >>> still all be checked under the assumption all ram is dirty. >> And how are we going to: >> - detect the network link speed >> - to be sure that we are inside downtime limit >> >> I think that it is not possible, so basically we are skiping the precopy >> stage and praying that the other bits are going to be ok. > Like I keep saying, this is not a real use case, we expect dirty > tracking to be available in any real configuration. This is just > trying to make qemu work in some reasonable way if dirty tracking is > not available but a VFIO migration device is plugged in. > > Just pick something simple that makes sense. Like if any SLA is set > then just refuse to even start. If no SLA then go directly to > STOP_COPY. I tried to follow Jason's suggestion to check if there is any SLA and block migration, or no SLA at all and allow migration. It doesn't seem like there is a way to say "no SLA at all". Migration param downtime_limit takes values between 0 and 2000 seconds. What does downtime_limit=0 mean? If it's 0 then MigrationState->threshold_size = 0, so ram_save_pending() will never call migration_bitmap_sync_precopy(). Only upon entering stop-copy phase will migration_bitmap_sync_precopy() be called, which will collect all pages that were dirtied during pre-copy, which could be a lot and impact downtime. In libvirt it seems that downtime_limit = 0 is not valid and can't be set. Am I missing something here? Is there a way to allow unlimited downtime? Thanks.
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 21e8f9d4d4..d4b6653026 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -863,10 +863,17 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) struct vfio_region_info *info = NULL; int ret = -ENOTSUP; - if (!vbasedev->enable_migration || !container->dirty_pages_supported) { + if (!vbasedev->enable_migration) { goto add_blocker; } + if (!container->dirty_pages_supported) { + warn_report( + "%s: IOMMU of the device's VFIO container doesn't support dirty page tracking, migration pre-copy phase will be skipped", + vbasedev->name); + migrate_get_current()->skip_precopy = true; + } + ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION_DEPRECATED, VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED, diff --git a/migration/migration.c b/migration/migration.c index 5a31b23bd6..668343508d 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -3593,6 +3593,11 @@ static MigIterateState migration_iteration_run(MigrationState *s) uint64_t pending_size, pend_pre, pend_compat, pend_post; bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE; + if (s->skip_precopy) { + migration_completion(s); + return MIG_ITERATE_BREAK; + } + qemu_savevm_state_pending(s->to_dst_file, s->threshold_size, &pend_pre, &pend_compat, &pend_post); pending_size = pend_pre + pend_compat + pend_post; diff --git a/migration/migration.h b/migration/migration.h index a863032b71..876713e7e1 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -332,6 +332,9 @@ struct MigrationState { * This save hostname when out-going migration starts */ char *hostname; + + /* Whether to skip pre-copy phase of migration or not */ + bool skip_precopy; }; void migrate_set_state(int *state, int old_state, int new_state);
Currently, if IOMMU of a VFIO container doesn't support dirty page tracking, migration is blocked completely. This is because a DMA-able VFIO device can dirty RAM pages without updating QEMU about it, thus breaking the migration. However, this doesn't mean that migration can't be done at all. If migration pre-copy phase is skipped, the VFIO device doesn't have a chance to dirty RAM pages that have been migrated already, thus eliminating the problem previously mentioned. Hence, in such case allow migration but skip pre-copy phase. Signed-off-by: Avihai Horon <avihaih@nvidia.com> --- hw/vfio/migration.c | 9 ++++++++- migration/migration.c | 5 +++++ migration/migration.h | 3 +++ 3 files changed, 16 insertions(+), 1 deletion(-)