diff mbox series

[4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported

Message ID 20220512154320.19697-5-avihaih@nvidia.com (mailing list archive)
State New, archived
Headers show
Series vfio/migration: Implement VFIO migration protocol v2 | expand

Commit Message

Avihai Horon May 12, 2022, 3:43 p.m. UTC
Currently, if IOMMU of a VFIO container doesn't support dirty page
tracking, migration is blocked completely. This is because a DMA-able
VFIO device can dirty RAM pages without updating QEMU about it, thus
breaking the migration.

However, this doesn't mean that migration can't be done at all. If
migration pre-copy phase is skipped, the VFIO device doesn't have a
chance to dirty RAM pages that have been migrated already, thus
eliminating the problem previously mentioned.

Hence, in such case allow migration but skip pre-copy phase.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/migration.c   | 9 ++++++++-
 migration/migration.c | 5 +++++
 migration/migration.h | 3 +++
 3 files changed, 16 insertions(+), 1 deletion(-)

Comments

Juan Quintela May 16, 2022, 11:22 a.m. UTC | #1
Avihai Horon <avihaih@nvidia.com> wrote:
> Currently, if IOMMU of a VFIO container doesn't support dirty page
> tracking, migration is blocked completely. This is because a DMA-able
> VFIO device can dirty RAM pages without updating QEMU about it, thus
> breaking the migration.
>
> However, this doesn't mean that migration can't be done at all. If
> migration pre-copy phase is skipped, the VFIO device doesn't have a
> chance to dirty RAM pages that have been migrated already, thus
> eliminating the problem previously mentioned.
>
> Hence, in such case allow migration but skip pre-copy phase.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

I don't know (TM).
Several issues:
- Patch is ugly as hell (ok, that depends on taste)
- It changes migration_iteration_run() instead of directly
  migration_thread.
- There is already another case where we skip the sending of RAM
  (localhost migration with shared memory)

In migration/ram.c:

static int ram_find_and_save_block(RAMState *rs, bool last_stage)
{
    PageSearchStatus pss;
    int pages = 0;
    bool again, found;

    /* No dirty page as there is zero RAM */
    if (!ram_bytes_total()) {
        return pages;
    }

This is the other place where we _don't_ send any RAM at all.

I don't have a great idea about how to make things clear at a higher
level, I have to think about this.

Later, Juan.

> ---
>  hw/vfio/migration.c   | 9 ++++++++-
>  migration/migration.c | 5 +++++
>  migration/migration.h | 3 +++
>  3 files changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 21e8f9d4d4..d4b6653026 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -863,10 +863,17 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>      struct vfio_region_info *info = NULL;
>      int ret = -ENOTSUP;
>  
> -    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
> +    if (!vbasedev->enable_migration) {
>          goto add_blocker;
>      }
>  
> +    if (!container->dirty_pages_supported) {
> +        warn_report(
> +            "%s: IOMMU of the device's VFIO container doesn't support dirty page tracking, migration pre-copy phase will be skipped",
> +            vbasedev->name);
> +        migrate_get_current()->skip_precopy = true;
> +    }
> +
>      ret = vfio_get_dev_region_info(vbasedev,
>                                     VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
>                                     VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> diff --git a/migration/migration.c b/migration/migration.c
> index 5a31b23bd6..668343508d 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3593,6 +3593,11 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>      uint64_t pending_size, pend_pre, pend_compat, pend_post;
>      bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
>  
> +    if (s->skip_precopy) {
> +        migration_completion(s);
> +        return MIG_ITERATE_BREAK;
> +    }
> +
>      qemu_savevm_state_pending(s->to_dst_file, s->threshold_size, &pend_pre,
>                                &pend_compat, &pend_post);
>      pending_size = pend_pre + pend_compat + pend_post;
> diff --git a/migration/migration.h b/migration/migration.h
> index a863032b71..876713e7e1 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -332,6 +332,9 @@ struct MigrationState {
>       * This save hostname when out-going migration starts
>       */
>      char *hostname;
> +
> +    /* Whether to skip pre-copy phase of migration or not */
> +    bool skip_precopy;
>  };
>  
>  void migrate_set_state(int *state, int old_state, int new_state);
Alex Williamson May 16, 2022, 8:22 p.m. UTC | #2
On Mon, 16 May 2022 13:22:14 +0200
Juan Quintela <quintela@redhat.com> wrote:

> Avihai Horon <avihaih@nvidia.com> wrote:
> > Currently, if IOMMU of a VFIO container doesn't support dirty page
> > tracking, migration is blocked completely. This is because a DMA-able
> > VFIO device can dirty RAM pages without updating QEMU about it, thus
> > breaking the migration.
> >
> > However, this doesn't mean that migration can't be done at all. If
> > migration pre-copy phase is skipped, the VFIO device doesn't have a
> > chance to dirty RAM pages that have been migrated already, thus
> > eliminating the problem previously mentioned.
> >
> > Hence, in such case allow migration but skip pre-copy phase.
> >
> > Signed-off-by: Avihai Horon <avihaih@nvidia.com>  
> 
> I don't know (TM).
> Several issues:
> - Patch is ugly as hell (ok, that depends on taste)
> - It changes migration_iteration_run() instead of directly
>   migration_thread.
> - There is already another case where we skip the sending of RAM
>   (localhost migration with shared memory)
> 
> In migration/ram.c:
> 
> static int ram_find_and_save_block(RAMState *rs, bool last_stage)
> {
>     PageSearchStatus pss;
>     int pages = 0;
>     bool again, found;
> 
>     /* No dirty page as there is zero RAM */
>     if (!ram_bytes_total()) {
>         return pages;
>     }
> 
> This is the other place where we _don't_ send any RAM at all.
> 
> I don't have a great idea about how to make things clear at a higher
> level, I have to think about this.

It seems like if we have devices dictating what type of migrations can
be performed then there probably needs to be a switch to restrict use of
such devices just as we have the -only-migratable switch now to prevent
attaching devices that don't support migration.  I'd guess that we need
the switch to opt-in to allowing such devices to maintain
compatibility.  There's probably a whole pile of qapi things missing to
expose this to management tools as well.  Thanks,

Alex

> > ---
> >  hw/vfio/migration.c   | 9 ++++++++-
> >  migration/migration.c | 5 +++++
> >  migration/migration.h | 3 +++
> >  3 files changed, 16 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 21e8f9d4d4..d4b6653026 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -863,10 +863,17 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
> >      struct vfio_region_info *info = NULL;
> >      int ret = -ENOTSUP;
> >  
> > -    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
> > +    if (!vbasedev->enable_migration) {
> >          goto add_blocker;
> >      }
> >  
> > +    if (!container->dirty_pages_supported) {
> > +        warn_report(
> > +            "%s: IOMMU of the device's VFIO container doesn't support dirty page tracking, migration pre-copy phase will be skipped",
> > +            vbasedev->name);
> > +        migrate_get_current()->skip_precopy = true;
> > +    }
> > +
> >      ret = vfio_get_dev_region_info(vbasedev,
> >                                     VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
> >                                     VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 5a31b23bd6..668343508d 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -3593,6 +3593,11 @@ static MigIterateState migration_iteration_run(MigrationState *s)
> >      uint64_t pending_size, pend_pre, pend_compat, pend_post;
> >      bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
> >  
> > +    if (s->skip_precopy) {
> > +        migration_completion(s);
> > +        return MIG_ITERATE_BREAK;
> > +    }
> > +
> >      qemu_savevm_state_pending(s->to_dst_file, s->threshold_size, &pend_pre,
> >                                &pend_compat, &pend_post);
> >      pending_size = pend_pre + pend_compat + pend_post;
> > diff --git a/migration/migration.h b/migration/migration.h
> > index a863032b71..876713e7e1 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -332,6 +332,9 @@ struct MigrationState {
> >       * This save hostname when out-going migration starts
> >       */
> >      char *hostname;
> > +
> > +    /* Whether to skip pre-copy phase of migration or not */
> > +    bool skip_precopy;
> >  };
> >  
> >  void migrate_set_state(int *state, int old_state, int new_state);  
>
Jason Gunthorpe May 16, 2022, 11:08 p.m. UTC | #3
On Mon, May 16, 2022 at 02:22:00PM -0600, Alex Williamson wrote:
> On Mon, 16 May 2022 13:22:14 +0200
> Juan Quintela <quintela@redhat.com> wrote:
> 
> > Avihai Horon <avihaih@nvidia.com> wrote:
> > > Currently, if IOMMU of a VFIO container doesn't support dirty page
> > > tracking, migration is blocked completely. This is because a DMA-able
> > > VFIO device can dirty RAM pages without updating QEMU about it, thus
> > > breaking the migration.
> > >
> > > However, this doesn't mean that migration can't be done at all. If
> > > migration pre-copy phase is skipped, the VFIO device doesn't have a
> > > chance to dirty RAM pages that have been migrated already, thus
> > > eliminating the problem previously mentioned.
> > >
> > > Hence, in such case allow migration but skip pre-copy phase.
> > >
> > > Signed-off-by: Avihai Horon <avihaih@nvidia.com>  
> > 
> > I don't know (TM).
> > Several issues:
> > - Patch is ugly as hell (ok, that depends on taste)
> > - It changes migration_iteration_run() instead of directly
> >   migration_thread.
> > - There is already another case where we skip the sending of RAM
> >   (localhost migration with shared memory)
> > 
> > In migration/ram.c:
> > 
> > static int ram_find_and_save_block(RAMState *rs, bool last_stage)
> > {
> >     PageSearchStatus pss;
> >     int pages = 0;
> >     bool again, found;
> > 
> >     /* No dirty page as there is zero RAM */
> >     if (!ram_bytes_total()) {
> >         return pages;
> >     }
> > 
> > This is the other place where we _don't_ send any RAM at all.
> > 
> > I don't have a great idea about how to make things clear at a higher
> > level, I have to think about this.
> 
> It seems like if we have devices dictating what type of migrations can
> be performed then there probably needs to be a switch to restrict use of
> such devices just as we have the -only-migratable switch now to prevent
> attaching devices that don't support migration.  I'd guess that we need
> the switch to opt-in to allowing such devices to maintain
> compatibility.  There's probably a whole pile of qapi things missing to
> expose this to management tools as well.  Thanks,

This is really intended to be a NOP from where things are now, as if
you use mlx5 live migration without a patch like this then it causes a
botched pre-copy since everything just ends up permanently dirty.

If it makes more sense we could abort the pre-copy too - in the end
there will be dirty tracking so I don't know if I'd invest in a big
adventure to fully define non-dirty tracking migration.

Jason
Alex Williamson May 17, 2022, 4 p.m. UTC | #4
On Mon, 16 May 2022 20:08:32 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, May 16, 2022 at 02:22:00PM -0600, Alex Williamson wrote:
> > On Mon, 16 May 2022 13:22:14 +0200
> > Juan Quintela <quintela@redhat.com> wrote:
> >   
> > > Avihai Horon <avihaih@nvidia.com> wrote:  
> > > > Currently, if IOMMU of a VFIO container doesn't support dirty page
> > > > tracking, migration is blocked completely. This is because a DMA-able
> > > > VFIO device can dirty RAM pages without updating QEMU about it, thus
> > > > breaking the migration.
> > > >
> > > > However, this doesn't mean that migration can't be done at all. If
> > > > migration pre-copy phase is skipped, the VFIO device doesn't have a
> > > > chance to dirty RAM pages that have been migrated already, thus
> > > > eliminating the problem previously mentioned.
> > > >
> > > > Hence, in such case allow migration but skip pre-copy phase.
> > > >
> > > > Signed-off-by: Avihai Horon <avihaih@nvidia.com>    
> > > 
> > > I don't know (TM).
> > > Several issues:
> > > - Patch is ugly as hell (ok, that depends on taste)
> > > - It changes migration_iteration_run() instead of directly
> > >   migration_thread.
> > > - There is already another case where we skip the sending of RAM
> > >   (localhost migration with shared memory)
> > > 
> > > In migration/ram.c:
> > > 
> > > static int ram_find_and_save_block(RAMState *rs, bool last_stage)
> > > {
> > >     PageSearchStatus pss;
> > >     int pages = 0;
> > >     bool again, found;
> > > 
> > >     /* No dirty page as there is zero RAM */
> > >     if (!ram_bytes_total()) {
> > >         return pages;
> > >     }
> > > 
> > > This is the other place where we _don't_ send any RAM at all.
> > > 
> > > I don't have a great idea about how to make things clear at a higher
> > > level, I have to think about this.  
> > 
> > It seems like if we have devices dictating what type of migrations can
> > be performed then there probably needs to be a switch to restrict use of
> > such devices just as we have the -only-migratable switch now to prevent
> > attaching devices that don't support migration.  I'd guess that we need
> > the switch to opt-in to allowing such devices to maintain
> > compatibility.  There's probably a whole pile of qapi things missing to
> > expose this to management tools as well.  Thanks,  
> 
> This is really intended to be a NOP from where things are now, as if
> you use mlx5 live migration without a patch like this then it causes a
> botched pre-copy since everything just ends up permanently dirty.
> 
> If it makes more sense we could abort the pre-copy too - in the end
> there will be dirty tracking so I don't know if I'd invest in a big
> adventure to fully define non-dirty tracking migration.

How is pre-copy currently "botched" without a patch like this?  If it's
simply that the pre-copy doesn't converge and the downtime constraints
don't allow the VM to enter stop-and-copy, that's the expected behavior
AIUI, and supports backwards compatibility with existing SLAs.

I'm assuming that by setting this new skip_precopy flag that we're
forcing the VM to move to stop-and-copy, regardless of any other SLA
constraints placed on the migration.  It seems like a better solution
would be to expose to management tools that the VM contains a device
that does not support the pre-copy phase so that downtime expectations
can be adjusted.  Thanks,

Alex
Jason Gunthorpe May 17, 2022, 4:08 p.m. UTC | #5
On Tue, May 17, 2022 at 10:00:45AM -0600, Alex Williamson wrote:

> > This is really intended to be a NOP from where things are now, as if
> > you use mlx5 live migration without a patch like this then it causes a
> > botched pre-copy since everything just ends up permanently dirty.
> > 
> > If it makes more sense we could abort the pre-copy too - in the end
> > there will be dirty tracking so I don't know if I'd invest in a big
> > adventure to fully define non-dirty tracking migration.
> 
> How is pre-copy currently "botched" without a patch like this?  If it's
> simply that the pre-copy doesn't converge and the downtime constraints
> don't allow the VM to enter stop-and-copy, that's the expected behavior
> AIUI, and supports backwards compatibility with existing SLAs.

It means it always fails - that certainly isn't working live
migration. There is no point in trying to converge something that we
already know will never converge.

> I'm assuming that by setting this new skip_precopy flag that we're
> forcing the VM to move to stop-and-copy, regardless of any other SLA
> constraints placed on the migration.  

That does seem like a defect in this patch, any SLA constraints should
still all be checked under the assumption all ram is dirty.

> It seems like a better solution would be to expose to management
> tools that the VM contains a device that does not support the
> pre-copy phase so that downtime expectations can be adjusted.

I don't expect this to be a real use case though..

Remember, you asked for this patch when you wanted qemu to have good
behavior when kernel support for legacy dirty tracking is removed
before we merge v2 support.

Thanks,
Jason
Alex Williamson May 17, 2022, 5:22 p.m. UTC | #6
On Tue, 17 May 2022 13:08:44 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, May 17, 2022 at 10:00:45AM -0600, Alex Williamson wrote:
> 
> > > This is really intended to be a NOP from where things are now, as if
> > > you use mlx5 live migration without a patch like this then it causes a
> > > botched pre-copy since everything just ends up permanently dirty.
> > > 
> > > If it makes more sense we could abort the pre-copy too - in the end
> > > there will be dirty tracking so I don't know if I'd invest in a big
> > > adventure to fully define non-dirty tracking migration.  
> > 
> > How is pre-copy currently "botched" without a patch like this?  If it's
> > simply that the pre-copy doesn't converge and the downtime constraints
> > don't allow the VM to enter stop-and-copy, that's the expected behavior
> > AIUI, and supports backwards compatibility with existing SLAs.  
> 
> It means it always fails - that certainly isn't working live
> migration. There is no point in trying to converge something that we
> already know will never converge.

If we eliminate the pre-copy phase then it's not so much live migration
anyway.  Trying to converge is indeed useless work, but afaik it's that
useless work that generates the data that management tools can use to
determine that SLAs cannot be achieved in a compatible way.
 
> > I'm assuming that by setting this new skip_precopy flag that we're
> > forcing the VM to move to stop-and-copy, regardless of any other SLA
> > constraints placed on the migration.    
> 
> That does seem like a defect in this patch, any SLA constraints should
> still all be checked under the assumption all ram is dirty.

The migration iteration function certainly tries to compare remaining
bytes to a threshold based on bandwidth and downtime.  The exit path
added here is the same as it would take if we had achieved our
threshold limit.  It's not clear to me that we're checking the downtime
limit elsewhere or have the data to do it if we don't transfer anything
estimate bandwidth.

> > It seems like a better solution would be to expose to management
> > tools that the VM contains a device that does not support the
> > pre-copy phase so that downtime expectations can be adjusted.  
> 
> I don't expect this to be a real use case though..
> 
> Remember, you asked for this patch when you wanted qemu to have good
> behavior when kernel support for legacy dirty tracking is removed
> before we merge v2 support.

Is wanting good behavior a controversial point?  Did we define this as
the desired good behavior?  Ref?  Thanks,

Alex
Jason Gunthorpe May 17, 2022, 5:39 p.m. UTC | #7
On Tue, May 17, 2022 at 11:22:32AM -0600, Alex Williamson wrote:

> > > It seems like a better solution would be to expose to management
> > > tools that the VM contains a device that does not support the
> > > pre-copy phase so that downtime expectations can be adjusted.  
> > 
> > I don't expect this to be a real use case though..
> > 
> > Remember, you asked for this patch when you wanted qemu to have good
> > behavior when kernel support for legacy dirty tracking is removed
> > before we merge v2 support.
> 
> Is wanting good behavior a controversial point?  Did we define this as
> the desired good behavior?  Ref?  

As I said, this was intended as a NOP, which is what I thought we
agreed to. Missing the SLA checking that existed before seems like
something to fix in this patch. This is the discussion thread:

https://lore.kernel.org/kvm/20220324231159.GA11336@nvidia.com/

 "I guess I was assuming that enabling v2 migration in QEMU was dependent
  on the existing type1 dirty tracking because it's the only means we
  have to tell QEMU that all memory is perpetually dirty when we have a
  DMA device.  Is that not correct?"

The only point was to prepare qemu for kernel's that don't support the
legacy dirty tracking API but do have a v2 migration interface. IIRC
something undesired happens if you do that right now.

We could also just dirty all memory in qemu and keep it exactly the
same so every SLA detail works. Or completely block pre-copy based
flows.

It it not intended to be a useful configuration, this is just covering
off backwards compat issues - so I'm not keen to do a bunch of
management work to support it.

Thanks,
Jason
Alex Williamson May 18, 2022, 3:46 a.m. UTC | #8
On Tue, 17 May 2022 14:39:37 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, May 17, 2022 at 11:22:32AM -0600, Alex Williamson wrote:
> 
> > > > It seems like a better solution would be to expose to management
> > > > tools that the VM contains a device that does not support the
> > > > pre-copy phase so that downtime expectations can be adjusted.    
> > > 
> > > I don't expect this to be a real use case though..
> > > 
> > > Remember, you asked for this patch when you wanted qemu to have good
> > > behavior when kernel support for legacy dirty tracking is removed
> > > before we merge v2 support.  
> > 
> > Is wanting good behavior a controversial point?  Did we define this as
> > the desired good behavior?  Ref?    
> 
> As I said, this was intended as a NOP, which is what I thought we
> agreed to. Missing the SLA checking that existed before seems like
> something to fix in this patch.

But even if we have the SLA check, how does a management tool know that
pre-copy will be skipped and that the downtime needs to account for
that?  The current solution is obviously non-optimal, it was mainly
meant for backwards compatibility, but this seems like a fail faster
solution, with less useless work, but also providing less indication
how to configure the migration to succeed.

> This is the discussion thread:
> 
> https://lore.kernel.org/kvm/20220324231159.GA11336@nvidia.com/
> 
>  "I guess I was assuming that enabling v2 migration in QEMU was dependent
>   on the existing type1 dirty tracking because it's the only means we
>   have to tell QEMU that all memory is perpetually dirty when we have a
>   DMA device.  Is that not correct?"

Doesn't sound like me asking for this patch so much as trying to figure
out how to support migration without DMA dirty tracking, and after the
above comment was a concern whether we lock ourselves into a dirty
tracking requirement in the iommufd type1 compatibility interface if we
rely on this type1 feature.  You had spit-balled that QEMU could skip
pre-copy, but this all needs to be vetted with migration folks and
management tools.

> The only point was to prepare qemu for kernel's that don't support the
> legacy dirty tracking API but do have a v2 migration interface. IIRC
> something undesired happens if you do that right now.

Per this patch, the container requires dirty tracking or we add a
blocker and can't migrate the device.
 
> We could also just dirty all memory in qemu and keep it exactly the
> same so every SLA detail works. Or completely block pre-copy based
> flows.
> 
> It it not intended to be a useful configuration, this is just covering
> off backwards compat issues - so I'm not keen to do a bunch of
> management work to support it.

Are we then deemphasizing the vfio compatibility support in iommufd?
If we really don't want to put effort towards backwards compatibility,
should migration support only be enabled with native iommufd support?
Thanks,

Alex
Juan Quintela May 18, 2022, 11:39 a.m. UTC | #9
Jason Gunthorpe <jgg@nvidia.com> wrote:
> On Tue, May 17, 2022 at 10:00:45AM -0600, Alex Williamson wrote:
>
>> > This is really intended to be a NOP from where things are now, as if
>> > you use mlx5 live migration without a patch like this then it causes a
>> > botched pre-copy since everything just ends up permanently dirty.
>> > 
>> > If it makes more sense we could abort the pre-copy too - in the end
>> > there will be dirty tracking so I don't know if I'd invest in a big
>> > adventure to fully define non-dirty tracking migration.
>> 
>> How is pre-copy currently "botched" without a patch like this?  If it's
>> simply that the pre-copy doesn't converge and the downtime constraints
>> don't allow the VM to enter stop-and-copy, that's the expected behavior
>> AIUI, and supports backwards compatibility with existing SLAs.
>
> It means it always fails - that certainly isn't working live
> migration. There is no point in trying to converge something that we
> already know will never converge.

Fully agree with you here.

But not how this is being done.  I think we need a way to convince the
migration code that it shouldn't even try to migrate RAM.  That would
fix the current use case, and your use case.

>> I'm assuming that by setting this new skip_precopy flag that we're
>> forcing the VM to move to stop-and-copy, regardless of any other SLA
>> constraints placed on the migration.  
>
> That does seem like a defect in this patch, any SLA constraints should
> still all be checked under the assumption all ram is dirty.

And how are we going to:
- detect the network link speed
- to be sure that we are inside downtime limit

I think that it is not possible, so basically we are skiping the precopy
stage and praying that the other bits are going to be ok.

>> It seems like a better solution would be to expose to management
>> tools that the VM contains a device that does not support the
>> pre-copy phase so that downtime expectations can be adjusted.
>
> I don't expect this to be a real use case though..
>
> Remember, you asked for this patch when you wanted qemu to have good
> behavior when kernel support for legacy dirty tracking is removed
> before we merge v2 support.

I am an ignorant on the subject.  Can I ask how the dirty memory is
tracked on this v2?

Thanks, Juan.
Jason Gunthorpe May 18, 2022, 3:50 p.m. UTC | #10
On Wed, May 18, 2022 at 01:39:31PM +0200, Juan Quintela wrote:

> > That does seem like a defect in this patch, any SLA constraints should
> > still all be checked under the assumption all ram is dirty.
> 
> And how are we going to:
> - detect the network link speed
> - to be sure that we are inside downtime limit
> 
> I think that it is not possible, so basically we are skiping the precopy
> stage and praying that the other bits are going to be ok.

Like I keep saying, this is not a real use case, we expect dirty
tracking to be available in any real configuration. This is just
trying to make qemu work in some reasonable way if dirty tracking is
not available but a VFIO migration device is plugged in.

Just pick something simple that makes sense. Like if any SLA is set
then just refuse to even start. If no SLA then go directly to
STOP_COPY.

> >> It seems like a better solution would be to expose to management
> >> tools that the VM contains a device that does not support the
> >> pre-copy phase so that downtime expectations can be adjusted.
> >
> > I don't expect this to be a real use case though..
> >
> > Remember, you asked for this patch when you wanted qemu to have good
> > behavior when kernel support for legacy dirty tracking is removed
> > before we merge v2 support.
> 
> I am an ignorant on the subject.  Can I ask how the dirty memory is
> tracked on this v2?

These two RFCs are the current proposal to enable it for the system
iommu:

https://lore.kernel.org/kvm/20220428210933.3583-1-joao.m.martins@oracle.com

And for device internal trackers:

https://lore.kernel.org/kvm/20220501123301.127279-1-yishaih@nvidia.com/ 

Regards,
Jason
Jason Gunthorpe May 18, 2022, 5:01 p.m. UTC | #11
On Tue, May 17, 2022 at 09:46:56PM -0600, Alex Williamson wrote:

> The current solution is obviously non-optimal, it was mainly
> meant for backwards compatibility, but this seems like a fail faster
> solution, with less useless work, but also providing less indication
> how to configure the migration to succeed.

I don't think this is a production configuration. We should not be
expecting that the user is going to carefully fine tune some SLA
parameter. It may be the "SLA check" we are missing is simply if a SLA
exists or not.

> > It it not intended to be a useful configuration, this is just covering
> > off backwards compat issues - so I'm not keen to do a bunch of
> > management work to support it.
> 
> Are we then deemphasizing the vfio compatibility support in iommufd?

I'm viewing iommufd compatability with VFIO type 1 as only extending
to implemented features.. We are deleting NESTED for instance. As
there is no in-kernel implementation of the type 1 tracker I would
expect to eventually delete it as well once we have consensus this is
what we plan to do.

We discussed this in Joao's series and that was the consensus.

> If we really don't want to put effort towards backwards compatibility,
> should migration support only be enabled with native iommufd
> support?

I'm expecting that the dirty tracking will require native iommufd only
for the system iommu tracker, but the mlx5 on-device tracker will be
fine with either iommu back end.

It is still useful currently for testing the VFIO device part as it
will be some time until the other parts are ready, so I'd rather not
block it completely in qemu.

The goal is for qemu to do something sensible so when we delete kernel
support for type 1 dirty tracking so there is no back compat concern
for existing non-experimental qemu.

Jason
Joao Martins May 20, 2022, 10:58 a.m. UTC | #12
On 5/12/22 16:43, Avihai Horon wrote:
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 21e8f9d4d4..d4b6653026 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -863,10 +863,17 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>      struct vfio_region_info *info = NULL;
>      int ret = -ENOTSUP;
>  
> -    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
> +    if (!vbasedev->enable_migration) {
>          goto add_blocker;
>      }
>  
> +    if (!container->dirty_pages_supported) {
> +        warn_report(
> +            "%s: IOMMU of the device's VFIO container doesn't support dirty page tracking, migration pre-copy phase will be skipped",
> +            vbasedev->name);

Maybe warn_report_once(...) given that the following N devices would observe the
same thing in theory.

> +        migrate_get_current()->skip_precopy = true;
> +    }
> +

Perhaps it might be easier to reuse the existing knob to disable pre-copy
per device that Kirti added some time ago, rather than changing migration core just
yet (e.g. you might wanna bail of the migration preemptively because the CPU is dirtying
too many pages for example?):

if (!container->dirty_pages_supported) {
    warn_report_once(...)
    pre_copy_dirty_page_tracking = ON_OFF_AUTO_OFF;
}

You might need to tackle the fact you will get when dirty_pages start/stop ioctls
returns you error messages. The errors is just because log_start() and log_stop() simply
fail because the ioctl doesn't exist, but everything else is fine -- at least that's what
I observed at least. Should be noted that it's a problem with the existing
`-device vfio-pci host=XX:YY.ZZ,x-pre-copy-dirty-page-tracking=true` regardless of this patch:

void vfio_listener_log_global_start()
{
	if (vfio_devices_all_dirty_tracking(container)) {
		vfio_set_dirty_page_tracking(container, true);
        }
}

... And same for vfio_listener_log_global_stop() -- maybe a worthwhile predecessor patch.
Avihai Horon May 23, 2022, 6:11 a.m. UTC | #13
On 5/20/2022 1:58 PM, Joao Martins wrote:
> External email: Use caution opening links or attachments
>
>
> On 5/12/22 16:43, Avihai Horon wrote:
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 21e8f9d4d4..d4b6653026 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -863,10 +863,17 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>>       struct vfio_region_info *info = NULL;
>>       int ret = -ENOTSUP;
>>
>> -    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
>> +    if (!vbasedev->enable_migration) {
>>           goto add_blocker;
>>       }
>>
>> +    if (!container->dirty_pages_supported) {
>> +        warn_report(
>> +            "%s: IOMMU of the device's VFIO container doesn't support dirty page tracking, migration pre-copy phase will be skipped",
>> +            vbasedev->name);
> Maybe warn_report_once(...) given that the following N devices would observe the
> same thing in theory.

Yes, you are right. Will change.

>> +        migrate_get_current()->skip_precopy = true;
>> +    }
>> +
> Perhaps it might be easier to reuse the existing knob to disable pre-copy
> per device that Kirti added some time ago, rather than changing migration core just
> yet (e.g. you might wanna bail of the migration preemptively because the CPU is dirtying
> too many pages for example?):
>
> if (!container->dirty_pages_supported) {
>      warn_report_once(...)
>      pre_copy_dirty_page_tracking = ON_OFF_AUTO_OFF;
> }

But this doesn't prevent the VFIO device from dirtying RAM pages during 
pre-copy phase.
The VFIO device can dirty RAM pages during pre-copy and it won't have a 
way to report the dirtied pages to QEMU and migration will be faulty.

Thanks.

>
> You might need to tackle the fact you will get when dirty_pages start/stop ioctls
> returns you error messages. The errors is just because log_start() and log_stop() simply
> fail because the ioctl doesn't exist, but everything else is fine -- at least that's what
> I observed at least. Should be noted that it's a problem with the existing
> `-device vfio-pci host=XX:YY.ZZ,x-pre-copy-dirty-page-tracking=true` regardless of this patch:
>
> void vfio_listener_log_global_start()
> {
>          if (vfio_devices_all_dirty_tracking(container)) {
>                  vfio_set_dirty_page_tracking(container, true);
>          }
> }
>
> ... And same for vfio_listener_log_global_stop() -- maybe a worthwhile predecessor patch.
Joao Martins May 23, 2022, 9:45 a.m. UTC | #14
On 5/23/22 07:11, Avihai Horon wrote:
> On 5/20/2022 1:58 PM, Joao Martins wrote:
>>> +        migrate_get_current()->skip_precopy = true;
>>> +    }
>>> +
>> Perhaps it might be easier to reuse the existing knob to disable pre-copy
>> per device that Kirti added some time ago, rather than changing migration core just
>> yet (e.g. you might wanna bail of the migration preemptively because the CPU is dirtying
>> too many pages for example?):
>>
>> if (!container->dirty_pages_supported) {
>>      warn_report_once(...)
>>      pre_copy_dirty_page_tracking = ON_OFF_AUTO_OFF;
>> }
> 
> But this doesn't prevent the VFIO device from dirtying RAM pages during 
> pre-copy phase.
> The VFIO device can dirty RAM pages during pre-copy and it won't have a 
> way to report the dirtied pages to QEMU and migration will be faulty.
> 

You're quite right, sorry for the noise. I was a little too obsessed in reusing the
existing field that forgot that letting the iterate stage go the PCI device could
also be driven into dirtying RAM too.
Avihai Horon May 24, 2022, 3:11 p.m. UTC | #15
On 5/18/2022 6:50 PM, Jason Gunthorpe wrote:
> On Wed, May 18, 2022 at 01:39:31PM +0200, Juan Quintela wrote:
>
>>> That does seem like a defect in this patch, any SLA constraints should
>>> still all be checked under the assumption all ram is dirty.
>> And how are we going to:
>> - detect the network link speed
>> - to be sure that we are inside downtime limit
>>
>> I think that it is not possible, so basically we are skiping the precopy
>> stage and praying that the other bits are going to be ok.
> Like I keep saying, this is not a real use case, we expect dirty
> tracking to be available in any real configuration. This is just
> trying to make qemu work in some reasonable way if dirty tracking is
> not available but a VFIO migration device is plugged in.
>
> Just pick something simple that makes sense. Like if any SLA is set
> then just refuse to even start. If no SLA then go directly to
> STOP_COPY.

I tried to follow Jason's suggestion to check if there is any SLA and 
block migration, or no SLA at all and allow migration.

It doesn't seem like there is a way to say "no SLA at all".

Migration param downtime_limit takes values between 0 and 2000 seconds.

What does downtime_limit=0 mean?

If it's 0 then MigrationState->threshold_size = 0, so ram_save_pending() 
will never call migration_bitmap_sync_precopy(). Only upon entering 
stop-copy phase will migration_bitmap_sync_precopy() be called, which 
will collect all pages that were dirtied during pre-copy, which could be 
a lot and impact downtime.

In libvirt it seems that downtime_limit = 0 is not valid and can't be set.

Am I missing something here? Is there a way to allow unlimited downtime?


Thanks.
diff mbox series

Patch

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 21e8f9d4d4..d4b6653026 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -863,10 +863,17 @@  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
     struct vfio_region_info *info = NULL;
     int ret = -ENOTSUP;
 
-    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
+    if (!vbasedev->enable_migration) {
         goto add_blocker;
     }
 
+    if (!container->dirty_pages_supported) {
+        warn_report(
+            "%s: IOMMU of the device's VFIO container doesn't support dirty page tracking, migration pre-copy phase will be skipped",
+            vbasedev->name);
+        migrate_get_current()->skip_precopy = true;
+    }
+
     ret = vfio_get_dev_region_info(vbasedev,
                                    VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
                                    VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
diff --git a/migration/migration.c b/migration/migration.c
index 5a31b23bd6..668343508d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3593,6 +3593,11 @@  static MigIterateState migration_iteration_run(MigrationState *s)
     uint64_t pending_size, pend_pre, pend_compat, pend_post;
     bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
 
+    if (s->skip_precopy) {
+        migration_completion(s);
+        return MIG_ITERATE_BREAK;
+    }
+
     qemu_savevm_state_pending(s->to_dst_file, s->threshold_size, &pend_pre,
                               &pend_compat, &pend_post);
     pending_size = pend_pre + pend_compat + pend_post;
diff --git a/migration/migration.h b/migration/migration.h
index a863032b71..876713e7e1 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -332,6 +332,9 @@  struct MigrationState {
      * This save hostname when out-going migration starts
      */
     char *hostname;
+
+    /* Whether to skip pre-copy phase of migration or not */
+    bool skip_precopy;
 };
 
 void migrate_set_state(int *state, int old_state, int new_state);