diff mbox series

[v11,08/11] vfio/migration: Implement VFIO migration protocol v2

Message ID 20230216143630.25610-9-avihaih@nvidia.com (mailing list archive)
State New, archived
Headers show
Series vfio/migration: Implement VFIO migration protocol v2 | expand

Commit Message

Avihai Horon Feb. 16, 2023, 2:36 p.m. UTC
Implement the basic mandatory part of VFIO migration protocol v2.
This includes all functionality that is necessary to support
VFIO_MIGRATION_STOP_COPY part of the v2 protocol.

The two protocols, v1 and v2, will co-exist and in the following patches
v1 protocol code will be removed.

There are several main differences between v1 and v2 protocols:
- VFIO device state is now represented as a finite state machine instead
  of a bitmap.

- Migration interface with kernel is now done using VFIO_DEVICE_FEATURE
  ioctl and normal read() and write() instead of the migration region.

- Pre-copy is made optional in v2 protocol. Support for pre-copy will be
  added later on.

Detailed information about VFIO migration protocol v2 and its difference
compared to v1 protocol can be found here [1].

[1]
https://lore.kernel.org/all/20220224142024.147653-10-yishaih@nvidia.com/

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/hw/vfio/vfio-common.h |   5 +
 hw/vfio/common.c              |  17 +-
 hw/vfio/migration.c           | 479 +++++++++++++++++++++++++++++++---
 hw/vfio/trace-events          |   7 +
 4 files changed, 469 insertions(+), 39 deletions(-)

Comments

Juan Quintela Feb. 16, 2023, 3:43 p.m. UTC | #1
Avihai Horon <avihaih@nvidia.com> wrote:

Reviewed-by: Juan Quintela <quintela@redhat.com>.


1st comment is a bug, but as you just remove it.


Not that it matters a lot in the end (you are removing v1 on the
following patch).

> @@ -438,7 +445,13 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>                  return false;
>              }
>  
> -            if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
> +            if (!migration->v2 &&
> +                migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
> +                continue;
> +            }

Are you missing here:


               } else {
                   return false;
               }

Or I am missunderstanding the code.


> +            if (migration->v2 &&
> +                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
>                  continue;
>              } else {
>                  return false;


[...]



> +/*
> + * This is an arbitrary size based on migration of mlx5 devices, where typically
> + * total device migration size is on the order of 100s of MB. Testing with
> + * larger values, e.g. 128MB and 1GB, did not show a performance improvement.
> + */
> +#define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)

Just in case that it makes a difference.  You are using here a 1MB
buffer.
But qemu_file_get_to_fd() uses a 32KB buffer.



>      }
>  }
>  
> +static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
> +                                     uint64_t *stop_copy_size)
> +{
> +    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
> +                              sizeof(struct vfio_device_feature_mig_data_size),
> +                              sizeof(uint64_t))] = {};
> +    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;

suggestion:

       size_t size = DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
                                 sizeof(struct vfio_device_feature_mig_data_size),
                                 sizeof(uint64_t));
       g_autofree struct vfio_device_feature *feature = g_malloc0(size * sizeof(uint64_t));

I have to reread several times to see what was going on there.


And call it a day?



Later, Juan.
Avihai Horon Feb. 16, 2023, 4:40 p.m. UTC | #2
On 16/02/2023 17:43, Juan Quintela wrote:
> External email: Use caution opening links or attachments
>
>
> Avihai Horon <avihaih@nvidia.com> wrote:
>
> Reviewed-by: Juan Quintela <quintela@redhat.com>.
>
>
> 1st comment is a bug, but as you just remove it.
>
>
> Not that it matters a lot in the end (you are removing v1 on the
> following patch).
>
>> @@ -438,7 +445,13 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>>                   return false;
>>               }
>>
>> -            if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
>> +            if (!migration->v2 &&
>> +                migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
>> +                continue;
>> +            }
> Are you missing here:
>
>
>                 } else {
>                     return false;
>                 }
>
> Or I am missunderstanding the code.

I think the code is OK:

If the device uses v1 migration and is running, the first if is true and 
we continue.
If the device uses v1 migration and is not running, the first if is 
false and the second if is false (since the device doesn't use v2 
migration) and we return false.

If the device uses v2 migration and is running, the first if is false 
and the second if is true, so we continue.
If the device uses v2 migration and is not running, first if is false 
and the second if is false, so we return false.

>
>> +            if (migration->v2 &&
>> +                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
>>                   continue;
>>               } else {
>>                   return false;
>
> [...]
>
>
>
>> +/*
>> + * This is an arbitrary size based on migration of mlx5 devices, where typically
>> + * total device migration size is on the order of 100s of MB. Testing with
>> + * larger values, e.g. 128MB and 1GB, did not show a performance improvement.
>> + */
>> +#define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
> Just in case that it makes a difference.  You are using here a 1MB
> buffer.
> But qemu_file_get_to_fd() uses a 32KB buffer.

Yes, I'm aware of that.
Down the road we will address the performance subject more thoroughly.
In the meantime, it seems like a reasonable size according to the tests 
we did.

>
>
>>       }
>>   }
>>
>> +static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
>> +                                     uint64_t *stop_copy_size)
>> +{
>> +    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
>> +                              sizeof(struct vfio_device_feature_mig_data_size),
>> +                              sizeof(uint64_t))] = {};
>> +    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
> suggestion:
>
>         size_t size = DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
>                                   sizeof(struct vfio_device_feature_mig_data_size),
>                                   sizeof(uint64_t));
>         g_autofree struct vfio_device_feature *feature = g_malloc0(size * sizeof(uint64_t));
>
> I have to reread several times to see what was going on there.
>
>
> And call it a day?

I don't have a strong opinion here.
We can do whatever you/Alex like more.

Thanks for reviewing!
Juan Quintela Feb. 16, 2023, 4:52 p.m. UTC | #3
Avihai Horon <avihaih@nvidia.com> wrote:
> On 16/02/2023 17:43, Juan Quintela wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> Avihai Horon <avihaih@nvidia.com> wrote:
>>
>> Reviewed-by: Juan Quintela <quintela@redhat.com>.
>>
>>
>> 1st comment is a bug, but as you just remove it.
>>
>>
>> Not that it matters a lot in the end (you are removing v1 on the
>> following patch).
>>
>>> @@ -438,7 +445,13 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>>>                   return false;
>>>               }
>>>
>>> -            if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
>>> +            if (!migration->v2 &&
>>> +                migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
>>> +                continue;
>>> +            }
>> Are you missing here:
>>
>>
>>                 } else {
>>                     return false;
>>                 }
>>
>> Or I am missunderstanding the code.
>
> I think the code is OK:
>
> If the device uses v1 migration and is running, the first if is true
> and we continue.
> If the device uses v1 migration and is not running, the first if is
> false and the second if is false (since the device doesn't use v2
> migration) and we return false.
>
> If the device uses v2 migration and is running, the first if is false
> and the second if is true, so we continue.
> If the device uses v2 migration and is not running, first if is false
> and the second if is false, so we return false.

You win O:-)

I was looking at C level, not at the "semantic" level.

>>
>>         size_t size = DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
>>                                   sizeof(struct vfio_device_feature_mig_data_size),
>>                                   sizeof(uint64_t));
>>         g_autofree struct vfio_device_feature *feature = g_malloc0(size * sizeof(uint64_t));
>>
>> I have to reread several times to see what was going on there.
>>
>>
>> And call it a day?
>
> I don't have a strong opinion here.
> We can do whatever you/Alex like more.

I think it is more readable, and we don't put (normally) big chunks in
the stack, but once told that, who wrotes the code has more to say O:-)

Later, Juan.
Alex Williamson Feb. 16, 2023, 7:53 p.m. UTC | #4
On Thu, 16 Feb 2023 17:52:33 +0100
Juan Quintela <quintela@redhat.com> wrote:

> Avihai Horon <avihaih@nvidia.com> wrote:
> > On 16/02/2023 17:43, Juan Quintela wrote:  
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> Avihai Horon <avihaih@nvidia.com> wrote:
> >>
> >> Reviewed-by: Juan Quintela <quintela@redhat.com>.
> >>
> >>
> >> 1st comment is a bug, but as you just remove it.
> >>
> >>
> >> Not that it matters a lot in the end (you are removing v1 on the
> >> following patch).
> >>  
> >>> @@ -438,7 +445,13 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
> >>>                   return false;
> >>>               }
> >>>
> >>> -            if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
> >>> +            if (!migration->v2 &&
> >>> +                migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
> >>> +                continue;
> >>> +            }  
> >> Are you missing here:
> >>
> >>
> >>                 } else {
> >>                     return false;
> >>                 }
> >>
> >> Or I am missunderstanding the code.  
> >
> > I think the code is OK:
> >
> > If the device uses v1 migration and is running, the first if is true
> > and we continue.
> > If the device uses v1 migration and is not running, the first if is
> > false and the second if is false (since the device doesn't use v2
> > migration) and we return false.
> >
> > If the device uses v2 migration and is running, the first if is false
> > and the second if is true, so we continue.
> > If the device uses v2 migration and is not running, first if is false
> > and the second if is false, so we return false.  
> 
> You win O:-)
> 
> I was looking at C level, not at the "semantic" level.
> 
> >>
> >>         size_t size = DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
> >>                                   sizeof(struct vfio_device_feature_mig_data_size),
> >>                                   sizeof(uint64_t));
> >>         g_autofree struct vfio_device_feature *feature = g_malloc0(size * sizeof(uint64_t));
> >>
> >> I have to reread several times to see what was going on there.
> >>
> >>
> >> And call it a day?  
> >
> > I don't have a strong opinion here.
> > We can do whatever you/Alex like more.  
> 
> I think it is more readable, and we don't put (normally) big chunks in
> the stack, but once told that, who wrotes the code has more to say O:-)

This is not a huge amount of data on the stack, so I'm fine with it as
is.  Thanks,

Alex
Peter Xu Sept. 4, 2024, 1 p.m. UTC | #5
Hello, Avihai,

Reviving this thread just to discuss one issue below..

On Thu, Feb 16, 2023 at 04:36:27PM +0200, Avihai Horon wrote:
> +/*
> + * Migration size of VFIO devices can be as little as a few KBs or as big as
> + * many GBs. This value should be big enough to cover the worst case.
> + */
> +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
> +
> +/*
> + * Only exact function is implemented and not estimate function. The reason is
> + * that during pre-copy phase of migration the estimate function is called
> + * repeatedly while pending RAM size is over the threshold, thus migration
> + * can't converge and querying the VFIO device pending data size is useless.
> + */
> +static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
> +                                     uint64_t *can_postcopy)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
> +
> +    /*
> +     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
> +     * reported so downtime limit won't be violated.
> +     */
> +    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
> +    *must_precopy += stop_copy_size;

Is this the chunk of data only can be copied during VM stopped?  If so, I
wonder why it's reported as "must precopy" if we know precopy won't ever
move them..

The issue is if with such reporting (and now in latest master branch we do
have the precopy size too, which was reported both in exact() and
estimate()), we can observe weird reports like this:

23411@1725380798968696657 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
23411@1725380799050766000 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
23411@1725380799050896975 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
23411@1725380799138657103 migrate_pending_exact exact pending size 21040144384 (pre = 21040144384 post=0)
23411@1725380799140166709 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
23411@1725380799217246861 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
23411@1725380799217384969 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
23411@1725380799305147722 migrate_pending_exact exact pending size 21039976448 (pre = 21039976448 post=0)
23411@1725380799306639956 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
23411@1725380799385118245 migrate_pending_exact exact pending size 21038796800 (pre = 21038796800 post=0)
23411@1725380799385709382 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)

So estimate() keeps reporting zero but the exact() reports much larger, and
it keeps spinning like this.  I think that's not how it was designed to be
used..

Does this stop copy size change for a VFIO device or not?

IIUC, we may want some other mechanism to report stop copy size for a
device, rather than reporting it with the current exact()/estimate() api.
That's, per my undertanding, only used for iterable data, while
stop-copy-size may not fall into that category if so.

> +
> +    trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
> +                                   stop_copy_size);
> +}
Avihai Horon Sept. 4, 2024, 3:41 p.m. UTC | #6
On 04/09/2024 16:00, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> Hello, Avihai,
>
> Reviving this thread just to discuss one issue below..
>
> On Thu, Feb 16, 2023 at 04:36:27PM +0200, Avihai Horon wrote:
>> +/*
>> + * Migration size of VFIO devices can be as little as a few KBs or as big as
>> + * many GBs. This value should be big enough to cover the worst case.
>> + */
>> +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
>> +
>> +/*
>> + * Only exact function is implemented and not estimate function. The reason is
>> + * that during pre-copy phase of migration the estimate function is called
>> + * repeatedly while pending RAM size is over the threshold, thus migration
>> + * can't converge and querying the VFIO device pending data size is useless.
>> + */
>> +static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
>> +                                     uint64_t *can_postcopy)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>> +
>> +    /*
>> +     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
>> +     * reported so downtime limit won't be violated.
>> +     */
>> +    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>> +    *must_precopy += stop_copy_size;
> Is this the chunk of data only can be copied during VM stopped?  If so, I
> wonder why it's reported as "must precopy" if we know precopy won't ever
> move them..

A VFIO device that doesn't support precopy will send this data only when 
VM is stopped.
A VFIO device that supports precopy may or may not send this data (or 
part of it) during precopy, and it depends on the specific VFIO device.

According to state_pending_{estimate,exact} documentation, must_precopy 
is the amount of data that must be migrated before target starts, and 
indeed this VFIO data must be migrated before target starts.

>
> The issue is if with such reporting (and now in latest master branch we do
> have the precopy size too, which was reported both in exact() and
> estimate()), we can observe weird reports like this:
>
> 23411@1725380798968696657 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> 23411@1725380799050766000 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
> 23411@1725380799050896975 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> 23411@1725380799138657103 migrate_pending_exact exact pending size 21040144384 (pre = 21040144384 post=0)
> 23411@1725380799140166709 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> 23411@1725380799217246861 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
> 23411@1725380799217384969 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> 23411@1725380799305147722 migrate_pending_exact exact pending size 21039976448 (pre = 21039976448 post=0)
> 23411@1725380799306639956 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> 23411@1725380799385118245 migrate_pending_exact exact pending size 21038796800 (pre = 21038796800 post=0)
> 23411@1725380799385709382 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>
> So estimate() keeps reporting zero but the exact() reports much larger, and
> it keeps spinning like this.  I think that's not how it was designed to be
> used..

It keeps spinning and migration doesn't converge?
If so, configuring a higher downtime limit or the 
avail-switchover-bandwidth parameter may solve it.

>
> Does this stop copy size change for a VFIO device or not?

It depends on the specific VFIO device.
If the device supports precopy and all (or part) of its data is 
precopy-able, then stopcopy size will change.
Besides that, the amount of resources currently used by the VFIO device 
can also affect the stopcopy size, and it may increase or decrease as 
resources are created or destroyed.

> IIUC, we may want some other mechanism to report stop copy size for a
> device, rather than reporting it with the current exact()/estimate() api.
> That's, per my undertanding, only used for iterable data, while
> stop-copy-size may not fall into that category if so.

The above situation is caused by the fact that VFIO data may not be 
fully precopy-able (as opposed to RAM data).
I don't think reporting the stop-copy-size in a different API will help 
the above situation -- we would still have to take stop-copy-size into 
account before converging, to not violate downtime.

Thanks.

>
>> +
>> +    trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
>> +                                   stop_copy_size);
>> +}
> --
> Peter Xu
>
Peter Xu Sept. 4, 2024, 4:16 p.m. UTC | #7
On Wed, Sep 04, 2024 at 06:41:03PM +0300, Avihai Horon wrote:
> 
> On 04/09/2024 16:00, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > Hello, Avihai,
> > 
> > Reviving this thread just to discuss one issue below..
> > 
> > On Thu, Feb 16, 2023 at 04:36:27PM +0200, Avihai Horon wrote:
> > > +/*
> > > + * Migration size of VFIO devices can be as little as a few KBs or as big as
> > > + * many GBs. This value should be big enough to cover the worst case.
> > > + */
> > > +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
> > > +
> > > +/*
> > > + * Only exact function is implemented and not estimate function. The reason is
> > > + * that during pre-copy phase of migration the estimate function is called
> > > + * repeatedly while pending RAM size is over the threshold, thus migration
> > > + * can't converge and querying the VFIO device pending data size is useless.
> > > + */
> > > +static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
> > > +                                     uint64_t *can_postcopy)
> > > +{
> > > +    VFIODevice *vbasedev = opaque;
> > > +    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
> > > +
> > > +    /*
> > > +     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
> > > +     * reported so downtime limit won't be violated.
> > > +     */
> > > +    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
> > > +    *must_precopy += stop_copy_size;
> > Is this the chunk of data only can be copied during VM stopped?  If so, I
> > wonder why it's reported as "must precopy" if we know precopy won't ever
> > move them..
> 
> A VFIO device that doesn't support precopy will send this data only when VM
> is stopped.
> A VFIO device that supports precopy may or may not send this data (or part
> of it) during precopy, and it depends on the specific VFIO device.
> 
> According to state_pending_{estimate,exact} documentation, must_precopy is
> the amount of data that must be migrated before target starts, and indeed
> this VFIO data must be migrated before target starts.

Correct.  It's just that estimate/exact API will be more suitable when the
exact() gets the report closer to estimate(), while the estimate() is only
a fast-path of exact().  It'll cause some chaos like this if it doesn't do
as so.

Since we have discussion elsewhere on the fact that we currently ignore
non-iterative device states during migration decides to switchover, I was
wondering when reply on whether this stop-size could be the first thing
that will start to provide such non-iterable pending data.  But that might
indeed need more thoughts, at least we may want to collect more outliers of
non-iterative device states outside VFIO that can cause downtime to be too
large.

> 
> > 
> > The issue is if with such reporting (and now in latest master branch we do
> > have the precopy size too, which was reported both in exact() and
> > estimate()), we can observe weird reports like this:
> > 
> > 23411@1725380798968696657 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > 23411@1725380799050766000 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
> > 23411@1725380799050896975 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > 23411@1725380799138657103 migrate_pending_exact exact pending size 21040144384 (pre = 21040144384 post=0)
> > 23411@1725380799140166709 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > 23411@1725380799217246861 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
> > 23411@1725380799217384969 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > 23411@1725380799305147722 migrate_pending_exact exact pending size 21039976448 (pre = 21039976448 post=0)
> > 23411@1725380799306639956 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > 23411@1725380799385118245 migrate_pending_exact exact pending size 21038796800 (pre = 21038796800 post=0)
> > 23411@1725380799385709382 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > 
> > So estimate() keeps reporting zero but the exact() reports much larger, and
> > it keeps spinning like this.  I think that's not how it was designed to be
> > used..
> 
> It keeps spinning and migration doesn't converge?
> If so, configuring a higher downtime limit or the avail-switchover-bandwidth
> parameter may solve it.

Yes, this is the only way to go, but it's a separate issue on reportings of
estimate()/exact().  More below.

> 
> > 
> > Does this stop copy size change for a VFIO device or not?
> 
> It depends on the specific VFIO device.
> If the device supports precopy and all (or part) of its data is
> precopy-able, then stopcopy size will change.
> Besides that, the amount of resources currently used by the VFIO device can
> also affect the stopcopy size, and it may increase or decrease as resources
> are created or destroyed.

I see, thanks.

> 
> > IIUC, we may want some other mechanism to report stop copy size for a
> > device, rather than reporting it with the current exact()/estimate() api.
> > That's, per my undertanding, only used for iterable data, while
> > stop-copy-size may not fall into that category if so.
> 
> The above situation is caused by the fact that VFIO data may not be fully
> precopy-able (as opposed to RAM data).
> I don't think reporting the stop-copy-size in a different API will help the
> above situation -- we would still have to take stop-copy-size into account
> before converging, to not violate downtime.

It will help some other situation, though.

One issue with above freqeunt estimate()/exact() call is that QEMU will go
into madness loop thinking "we're close" and "we're far away from converge"
even if the reality is "we're far away". The bad side effect is when this
loop happens it'll not only affect VFIO but also other devices (e.g. KVM,
vhost, etc.) so we'll do high overhead sync() in an extremely frequent
manner.  IMHO they're totally a waste of resource, because all the rest of
the modules are following the default rules of estimate()/exact().

One simple but efficient fix for VFIO, IMHO, is at least VFIO should also
cache the stop-size internally and report in estimate(), e.g.:

/* Give an estimate of the amount left to be transferred,
 * the result is split into the amount for units that can and
 * for units that can't do postcopy.
 */
void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
                                        uint64_t *can_postcopy)
{
}

If it's justified that the stop-size to be reported in exact(), it should
also be reported in estimate() per the comment above.  It should also fall
into precopy category in this case.

Then with that we should avoid calling exact() frequently for not only VFIO
but also others (especially, KVM GET_DIRTY_LOG / CLEAR_DIRTY_LOG ioctls),
then we know it won't converge anyway without the help of tuning downtime
upper, or adjust avail-switchover-bandwidth.

This may improve situation but still leave one other issue, that IIUC even
with above change and even if we can avoid sync dirty bitmap frequently,
the migration thread can still spinning 100% calling estimate() and keep
seeing data (which is not iterable...).  For the longer term we may still
need to report non-iterable stop-size in another way so QEMU knows that
iterate() over all the VMState registers won't help in this case, so it
should go into sleep without eating the cores.  I hope that explains why I
think a new API should be still needed for the long run.

Thanks,
Avihai Horon Sept. 5, 2024, 11:41 a.m. UTC | #8
On 04/09/2024 19:16, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, Sep 04, 2024 at 06:41:03PM +0300, Avihai Horon wrote:
>> On 04/09/2024 16:00, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> Hello, Avihai,
>>>
>>> Reviving this thread just to discuss one issue below..
>>>
>>> On Thu, Feb 16, 2023 at 04:36:27PM +0200, Avihai Horon wrote:
>>>> +/*
>>>> + * Migration size of VFIO devices can be as little as a few KBs or as big as
>>>> + * many GBs. This value should be big enough to cover the worst case.
>>>> + */
>>>> +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
>>>> +
>>>> +/*
>>>> + * Only exact function is implemented and not estimate function. The reason is
>>>> + * that during pre-copy phase of migration the estimate function is called
>>>> + * repeatedly while pending RAM size is over the threshold, thus migration
>>>> + * can't converge and querying the VFIO device pending data size is useless.
>>>> + */
>>>> +static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
>>>> +                                     uint64_t *can_postcopy)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>>>> +
>>>> +    /*
>>>> +     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
>>>> +     * reported so downtime limit won't be violated.
>>>> +     */
>>>> +    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>>>> +    *must_precopy += stop_copy_size;
>>> Is this the chunk of data only can be copied during VM stopped?  If so, I
>>> wonder why it's reported as "must precopy" if we know precopy won't ever
>>> move them..
>> A VFIO device that doesn't support precopy will send this data only when VM
>> is stopped.
>> A VFIO device that supports precopy may or may not send this data (or part
>> of it) during precopy, and it depends on the specific VFIO device.
>>
>> According to state_pending_{estimate,exact} documentation, must_precopy is
>> the amount of data that must be migrated before target starts, and indeed
>> this VFIO data must be migrated before target starts.
> Correct.  It's just that estimate/exact API will be more suitable when the
> exact() gets the report closer to estimate(), while the estimate() is only
> a fast-path of exact().  It'll cause some chaos like this if it doesn't do
> as so.
>
> Since we have discussion elsewhere on the fact that we currently ignore
> non-iterative device states during migration decides to switchover, I was
> wondering when reply on whether this stop-size could be the first thing
> that will start to provide such non-iterable pending data.  But that might
> indeed need more thoughts, at least we may want to collect more outliers of
> non-iterative device states outside VFIO that can cause downtime to be too
> large.

Ah, I see.

>
>>> The issue is if with such reporting (and now in latest master branch we do
>>> have the precopy size too, which was reported both in exact() and
>>> estimate()), we can observe weird reports like this:
>>>
>>> 23411@1725380798968696657 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>> 23411@1725380799050766000 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
>>> 23411@1725380799050896975 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>> 23411@1725380799138657103 migrate_pending_exact exact pending size 21040144384 (pre = 21040144384 post=0)
>>> 23411@1725380799140166709 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>> 23411@1725380799217246861 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
>>> 23411@1725380799217384969 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>> 23411@1725380799305147722 migrate_pending_exact exact pending size 21039976448 (pre = 21039976448 post=0)
>>> 23411@1725380799306639956 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>> 23411@1725380799385118245 migrate_pending_exact exact pending size 21038796800 (pre = 21038796800 post=0)
>>> 23411@1725380799385709382 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>>
>>> So estimate() keeps reporting zero but the exact() reports much larger, and
>>> it keeps spinning like this.  I think that's not how it was designed to be
>>> used..
>> It keeps spinning and migration doesn't converge?
>> If so, configuring a higher downtime limit or the avail-switchover-bandwidth
>> parameter may solve it.
> Yes, this is the only way to go, but it's a separate issue on reportings of
> estimate()/exact().  More below.
>
>>> Does this stop copy size change for a VFIO device or not?
>> It depends on the specific VFIO device.
>> If the device supports precopy and all (or part) of its data is
>> precopy-able, then stopcopy size will change.
>> Besides that, the amount of resources currently used by the VFIO device can
>> also affect the stopcopy size, and it may increase or decrease as resources
>> are created or destroyed.
> I see, thanks.
>
>>> IIUC, we may want some other mechanism to report stop copy size for a
>>> device, rather than reporting it with the current exact()/estimate() api.
>>> That's, per my undertanding, only used for iterable data, while
>>> stop-copy-size may not fall into that category if so.
>> The above situation is caused by the fact that VFIO data may not be fully
>> precopy-able (as opposed to RAM data).
>> I don't think reporting the stop-copy-size in a different API will help the
>> above situation -- we would still have to take stop-copy-size into account
>> before converging, to not violate downtime.
> It will help some other situation, though.
>
> One issue with above freqeunt estimate()/exact() call is that QEMU will go
> into madness loop thinking "we're close" and "we're far away from converge"
> even if the reality is "we're far away". The bad side effect is when this
> loop happens it'll not only affect VFIO but also other devices (e.g. KVM,
> vhost, etc.) so we'll do high overhead sync() in an extremely frequent
> manner.  IMHO they're totally a waste of resource, because all the rest of
> the modules are following the default rules of estimate()/exact().
>
> One simple but efficient fix for VFIO, IMHO, is at least VFIO should also
> cache the stop-size internally and report in estimate(), e.g.:
>
> /* Give an estimate of the amount left to be transferred,
>   * the result is split into the amount for units that can and
>   * for units that can't do postcopy.
>   */
> void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
>                                          uint64_t *can_postcopy)
> {
> }
>
> If it's justified that the stop-size to be reported in exact(), it should
> also be reported in estimate() per the comment above.  It should also fall
> into precopy category in this case.
>
> Then with that we should avoid calling exact() frequently for not only VFIO
> but also others (especially, KVM GET_DIRTY_LOG / CLEAR_DIRTY_LOG ioctls),
> then we know it won't converge anyway without the help of tuning downtime
> upper, or adjust avail-switchover-bandwidth.

Yes, it will solve this problem, but IIRC, the reason I didn't cache and 
return stop-copy-size in estimate from the first place was that it 
wasn't fully precopy-able, and that could cause some problems:
Once we cache and report this size in estimate, it may not decrease 
anymore -- we can't send it during precopy and we might not be able to 
call get stop-copy-size again from the exact path (because estimate 
might now reach below the threshold).

For example, assume the threshold for convergence is 1GB.
A VFIO device may report 2GB stop-copy-size (not precopy-able) in the 
beginning of the migration.
If the VFIO device stop-copy-size changes in the middle of migration 
(e.g., some of its resources are destroyed) and drops to 900MB, we will 
still see 2GB report in estimate.
Only when calling the exact handler again we will get the updated 900MB 
value and be able to converge. But that won't happen, because estimate 
size will always be above the 1GB threshold.

We could prevent these situations and call the get stop-copy-size once 
every X seconds, but it feels a bit awkward.

>
> This may improve situation but still leave one other issue, that IIUC even
> with above change and even if we can avoid sync dirty bitmap frequently,
> the migration thread can still spinning 100% calling estimate() and keep
> seeing data (which is not iterable...).  For the longer term we may still
> need to report non-iterable stop-size in another way so QEMU knows that
> iterate() over all the VMState registers won't help in this case, so it
> should go into sleep without eating the cores.  I hope that explains why I
> think a new API should be still needed for the long run.

Yes, I get your point.
But is the spinning case very common? AFAIU, if it happens, it may 
indicate some misconfiguration of the migration parameters.

Anyway, I think your idea may fit VFIO, but still need to think about 
all the small details.

Thanks.
Peter Xu Sept. 5, 2024, 3:17 p.m. UTC | #9
On Thu, Sep 05, 2024 at 02:41:09PM +0300, Avihai Horon wrote:
> 
> On 04/09/2024 19:16, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Wed, Sep 04, 2024 at 06:41:03PM +0300, Avihai Horon wrote:
> > > On 04/09/2024 16:00, Peter Xu wrote:
> > > > External email: Use caution opening links or attachments
> > > > 
> > > > 
> > > > Hello, Avihai,
> > > > 
> > > > Reviving this thread just to discuss one issue below..
> > > > 
> > > > On Thu, Feb 16, 2023 at 04:36:27PM +0200, Avihai Horon wrote:
> > > > > +/*
> > > > > + * Migration size of VFIO devices can be as little as a few KBs or as big as
> > > > > + * many GBs. This value should be big enough to cover the worst case.
> > > > > + */
> > > > > +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
> > > > > +
> > > > > +/*
> > > > > + * Only exact function is implemented and not estimate function. The reason is
> > > > > + * that during pre-copy phase of migration the estimate function is called
> > > > > + * repeatedly while pending RAM size is over the threshold, thus migration
> > > > > + * can't converge and querying the VFIO device pending data size is useless.
> > > > > + */
> > > > > +static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
> > > > > +                                     uint64_t *can_postcopy)
> > > > > +{
> > > > > +    VFIODevice *vbasedev = opaque;
> > > > > +    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
> > > > > +
> > > > > +    /*
> > > > > +     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
> > > > > +     * reported so downtime limit won't be violated.
> > > > > +     */
> > > > > +    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
> > > > > +    *must_precopy += stop_copy_size;
> > > > Is this the chunk of data only can be copied during VM stopped?  If so, I
> > > > wonder why it's reported as "must precopy" if we know precopy won't ever
> > > > move them..
> > > A VFIO device that doesn't support precopy will send this data only when VM
> > > is stopped.
> > > A VFIO device that supports precopy may or may not send this data (or part
> > > of it) during precopy, and it depends on the specific VFIO device.
> > > 
> > > According to state_pending_{estimate,exact} documentation, must_precopy is
> > > the amount of data that must be migrated before target starts, and indeed
> > > this VFIO data must be migrated before target starts.
> > Correct.  It's just that estimate/exact API will be more suitable when the
> > exact() gets the report closer to estimate(), while the estimate() is only
> > a fast-path of exact().  It'll cause some chaos like this if it doesn't do
> > as so.
> > 
> > Since we have discussion elsewhere on the fact that we currently ignore
> > non-iterative device states during migration decides to switchover, I was
> > wondering when reply on whether this stop-size could be the first thing
> > that will start to provide such non-iterable pending data.  But that might
> > indeed need more thoughts, at least we may want to collect more outliers of
> > non-iterative device states outside VFIO that can cause downtime to be too
> > large.
> 
> Ah, I see.
> 
> > 
> > > > The issue is if with such reporting (and now in latest master branch we do
> > > > have the precopy size too, which was reported both in exact() and
> > > > estimate()), we can observe weird reports like this:
> > > > 
> > > > 23411@1725380798968696657 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > > > 23411@1725380799050766000 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
> > > > 23411@1725380799050896975 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > > > 23411@1725380799138657103 migrate_pending_exact exact pending size 21040144384 (pre = 21040144384 post=0)
> > > > 23411@1725380799140166709 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > > > 23411@1725380799217246861 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
> > > > 23411@1725380799217384969 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > > > 23411@1725380799305147722 migrate_pending_exact exact pending size 21039976448 (pre = 21039976448 post=0)
> > > > 23411@1725380799306639956 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > > > 23411@1725380799385118245 migrate_pending_exact exact pending size 21038796800 (pre = 21038796800 post=0)
> > > > 23411@1725380799385709382 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > > > 
> > > > So estimate() keeps reporting zero but the exact() reports much larger, and
> > > > it keeps spinning like this.  I think that's not how it was designed to be
> > > > used..
> > > It keeps spinning and migration doesn't converge?
> > > If so, configuring a higher downtime limit or the avail-switchover-bandwidth
> > > parameter may solve it.
> > Yes, this is the only way to go, but it's a separate issue on reportings of
> > estimate()/exact().  More below.
> > 
> > > > Does this stop copy size change for a VFIO device or not?
> > > It depends on the specific VFIO device.
> > > If the device supports precopy and all (or part) of its data is
> > > precopy-able, then stopcopy size will change.
> > > Besides that, the amount of resources currently used by the VFIO device can
> > > also affect the stopcopy size, and it may increase or decrease as resources
> > > are created or destroyed.
> > I see, thanks.
> > 
> > > > IIUC, we may want some other mechanism to report stop copy size for a
> > > > device, rather than reporting it with the current exact()/estimate() api.
> > > > That's, per my undertanding, only used for iterable data, while
> > > > stop-copy-size may not fall into that category if so.
> > > The above situation is caused by the fact that VFIO data may not be fully
> > > precopy-able (as opposed to RAM data).
> > > I don't think reporting the stop-copy-size in a different API will help the
> > > above situation -- we would still have to take stop-copy-size into account
> > > before converging, to not violate downtime.
> > It will help some other situation, though.
> > 
> > One issue with above freqeunt estimate()/exact() call is that QEMU will go
> > into madness loop thinking "we're close" and "we're far away from converge"
> > even if the reality is "we're far away". The bad side effect is when this
> > loop happens it'll not only affect VFIO but also other devices (e.g. KVM,
> > vhost, etc.) so we'll do high overhead sync() in an extremely frequent
> > manner.  IMHO they're totally a waste of resource, because all the rest of
> > the modules are following the default rules of estimate()/exact().
> > 
> > One simple but efficient fix for VFIO, IMHO, is at least VFIO should also
> > cache the stop-size internally and report in estimate(), e.g.:
> > 
> > /* Give an estimate of the amount left to be transferred,
> >   * the result is split into the amount for units that can and
> >   * for units that can't do postcopy.
> >   */
> > void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> >                                          uint64_t *can_postcopy)
> > {
> > }
> > 
> > If it's justified that the stop-size to be reported in exact(), it should
> > also be reported in estimate() per the comment above.  It should also fall
> > into precopy category in this case.
> > 
> > Then with that we should avoid calling exact() frequently for not only VFIO
> > but also others (especially, KVM GET_DIRTY_LOG / CLEAR_DIRTY_LOG ioctls),
> > then we know it won't converge anyway without the help of tuning downtime
> > upper, or adjust avail-switchover-bandwidth.
> 
> Yes, it will solve this problem, but IIRC, the reason I didn't cache and
> return stop-copy-size in estimate from the first place was that it wasn't
> fully precopy-able, and that could cause some problems:
> Once we cache and report this size in estimate, it may not decrease anymore
> -- we can't send it during precopy and we might not be able to call get
> stop-copy-size again from the exact path (because estimate might now reach
> below the threshold).
> 
> For example, assume the threshold for convergence is 1GB.
> A VFIO device may report 2GB stop-copy-size (not precopy-able) in the
> beginning of the migration.
> If the VFIO device stop-copy-size changes in the middle of migration (e.g.,
> some of its resources are destroyed) and drops to 900MB, we will still see
> 2GB report in estimate.
> Only when calling the exact handler again we will get the updated 900MB
> value and be able to converge. But that won't happen, because estimate size
> will always be above the 1GB threshold.
> 
> We could prevent these situations and call the get stop-copy-size once every
> X seconds, but it feels a bit awkward.

IMHO the confusion is caused by an unclear definition of stop-size and
precopy-size in VFIO terms.

What I would hope is the stop-size reported should be constant and not be
affected by precopy process happening.  Whatever _can_ change at all should
be reported as precopy-size.

So I wonder why stop-size can change from a driver, and whether that can be
reported in a more predictable fashion.  Otherwise I see little point in
providing both stop-size and precopy-size, otherwise we'll always add them
up into VFIO's total pending-size.  The line on provision which data falls
into which bucket doesn't seem to be clear to me.

> 
> > 
> > This may improve situation but still leave one other issue, that IIUC even
> > with above change and even if we can avoid sync dirty bitmap frequently,
> > the migration thread can still spinning 100% calling estimate() and keep
> > seeing data (which is not iterable...).  For the longer term we may still
> > need to report non-iterable stop-size in another way so QEMU knows that
> > iterate() over all the VMState registers won't help in this case, so it
> > should go into sleep without eating the cores.  I hope that explains why I
> > think a new API should be still needed for the long run.
> 
> Yes, I get your point.
> But is the spinning case very common? AFAIU, if it happens, it may indicate
> some misconfiguration of the migration parameters.

Agreed.

It's just that our QE actively tests such migrations and there're always
weird and similar reports on cpu spinning, so it might be nice to still fix
it at some point.

This shouldn't be super urgent, indeed.  It's just that it can start to
make sense when we have other discussions where reporting non-iteratble
pending data might be pretty useful on its own as an idea.  It's just that
we need to figure out above on how VFIO can report predicatble stop-size in
the drivers, and I wonder whether the drivers can be fixed (without
breaking existing qemu; after all they currently always add both
counters..).

> 
> Anyway, I think your idea may fit VFIO, but still need to think about all
> the small details.

Thanks,
Avihai Horon Sept. 5, 2024, 4:07 p.m. UTC | #10
On 05/09/2024 18:17, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, Sep 05, 2024 at 02:41:09PM +0300, Avihai Horon wrote:
>> On 04/09/2024 19:16, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Wed, Sep 04, 2024 at 06:41:03PM +0300, Avihai Horon wrote:
>>>> On 04/09/2024 16:00, Peter Xu wrote:
>>>>> External email: Use caution opening links or attachments
>>>>>
>>>>>
>>>>> Hello, Avihai,
>>>>>
>>>>> Reviving this thread just to discuss one issue below..
>>>>>
>>>>> On Thu, Feb 16, 2023 at 04:36:27PM +0200, Avihai Horon wrote:
>>>>>> +/*
>>>>>> + * Migration size of VFIO devices can be as little as a few KBs or as big as
>>>>>> + * many GBs. This value should be big enough to cover the worst case.
>>>>>> + */
>>>>>> +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
>>>>>> +
>>>>>> +/*
>>>>>> + * Only exact function is implemented and not estimate function. The reason is
>>>>>> + * that during pre-copy phase of migration the estimate function is called
>>>>>> + * repeatedly while pending RAM size is over the threshold, thus migration
>>>>>> + * can't converge and querying the VFIO device pending data size is useless.
>>>>>> + */
>>>>>> +static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
>>>>>> +                                     uint64_t *can_postcopy)
>>>>>> +{
>>>>>> +    VFIODevice *vbasedev = opaque;
>>>>>> +    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>>>>>> +
>>>>>> +    /*
>>>>>> +     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
>>>>>> +     * reported so downtime limit won't be violated.
>>>>>> +     */
>>>>>> +    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>>>>>> +    *must_precopy += stop_copy_size;
>>>>> Is this the chunk of data only can be copied during VM stopped?  If so, I
>>>>> wonder why it's reported as "must precopy" if we know precopy won't ever
>>>>> move them..
>>>> A VFIO device that doesn't support precopy will send this data only when VM
>>>> is stopped.
>>>> A VFIO device that supports precopy may or may not send this data (or part
>>>> of it) during precopy, and it depends on the specific VFIO device.
>>>>
>>>> According to state_pending_{estimate,exact} documentation, must_precopy is
>>>> the amount of data that must be migrated before target starts, and indeed
>>>> this VFIO data must be migrated before target starts.
>>> Correct.  It's just that estimate/exact API will be more suitable when the
>>> exact() gets the report closer to estimate(), while the estimate() is only
>>> a fast-path of exact().  It'll cause some chaos like this if it doesn't do
>>> as so.
>>>
>>> Since we have discussion elsewhere on the fact that we currently ignore
>>> non-iterative device states during migration decides to switchover, I was
>>> wondering when reply on whether this stop-size could be the first thing
>>> that will start to provide such non-iterable pending data.  But that might
>>> indeed need more thoughts, at least we may want to collect more outliers of
>>> non-iterative device states outside VFIO that can cause downtime to be too
>>> large.
>> Ah, I see.
>>
>>>>> The issue is if with such reporting (and now in latest master branch we do
>>>>> have the precopy size too, which was reported both in exact() and
>>>>> estimate()), we can observe weird reports like this:
>>>>>
>>>>> 23411@1725380798968696657 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>>>> 23411@1725380799050766000 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
>>>>> 23411@1725380799050896975 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>>>> 23411@1725380799138657103 migrate_pending_exact exact pending size 21040144384 (pre = 21040144384 post=0)
>>>>> 23411@1725380799140166709 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>>>> 23411@1725380799217246861 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
>>>>> 23411@1725380799217384969 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>>>> 23411@1725380799305147722 migrate_pending_exact exact pending size 21039976448 (pre = 21039976448 post=0)
>>>>> 23411@1725380799306639956 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>>>> 23411@1725380799385118245 migrate_pending_exact exact pending size 21038796800 (pre = 21038796800 post=0)
>>>>> 23411@1725380799385709382 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>>>>
>>>>> So estimate() keeps reporting zero but the exact() reports much larger, and
>>>>> it keeps spinning like this.  I think that's not how it was designed to be
>>>>> used..
>>>> It keeps spinning and migration doesn't converge?
>>>> If so, configuring a higher downtime limit or the avail-switchover-bandwidth
>>>> parameter may solve it.
>>> Yes, this is the only way to go, but it's a separate issue on reportings of
>>> estimate()/exact().  More below.
>>>
>>>>> Does this stop copy size change for a VFIO device or not?
>>>> It depends on the specific VFIO device.
>>>> If the device supports precopy and all (or part) of its data is
>>>> precopy-able, then stopcopy size will change.
>>>> Besides that, the amount of resources currently used by the VFIO device can
>>>> also affect the stopcopy size, and it may increase or decrease as resources
>>>> are created or destroyed.
>>> I see, thanks.
>>>
>>>>> IIUC, we may want some other mechanism to report stop copy size for a
>>>>> device, rather than reporting it with the current exact()/estimate() api.
>>>>> That's, per my undertanding, only used for iterable data, while
>>>>> stop-copy-size may not fall into that category if so.
>>>> The above situation is caused by the fact that VFIO data may not be fully
>>>> precopy-able (as opposed to RAM data).
>>>> I don't think reporting the stop-copy-size in a different API will help the
>>>> above situation -- we would still have to take stop-copy-size into account
>>>> before converging, to not violate downtime.
>>> It will help some other situation, though.
>>>
>>> One issue with above freqeunt estimate()/exact() call is that QEMU will go
>>> into madness loop thinking "we're close" and "we're far away from converge"
>>> even if the reality is "we're far away". The bad side effect is when this
>>> loop happens it'll not only affect VFIO but also other devices (e.g. KVM,
>>> vhost, etc.) so we'll do high overhead sync() in an extremely frequent
>>> manner.  IMHO they're totally a waste of resource, because all the rest of
>>> the modules are following the default rules of estimate()/exact().
>>>
>>> One simple but efficient fix for VFIO, IMHO, is at least VFIO should also
>>> cache the stop-size internally and report in estimate(), e.g.:
>>>
>>> /* Give an estimate of the amount left to be transferred,
>>>    * the result is split into the amount for units that can and
>>>    * for units that can't do postcopy.
>>>    */
>>> void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
>>>                                           uint64_t *can_postcopy)
>>> {
>>> }
>>>
>>> If it's justified that the stop-size to be reported in exact(), it should
>>> also be reported in estimate() per the comment above.  It should also fall
>>> into precopy category in this case.
>>>
>>> Then with that we should avoid calling exact() frequently for not only VFIO
>>> but also others (especially, KVM GET_DIRTY_LOG / CLEAR_DIRTY_LOG ioctls),
>>> then we know it won't converge anyway without the help of tuning downtime
>>> upper, or adjust avail-switchover-bandwidth.
>> Yes, it will solve this problem, but IIRC, the reason I didn't cache and
>> return stop-copy-size in estimate from the first place was that it wasn't
>> fully precopy-able, and that could cause some problems:
>> Once we cache and report this size in estimate, it may not decrease anymore
>> -- we can't send it during precopy and we might not be able to call get
>> stop-copy-size again from the exact path (because estimate might now reach
>> below the threshold).
>>
>> For example, assume the threshold for convergence is 1GB.
>> A VFIO device may report 2GB stop-copy-size (not precopy-able) in the
>> beginning of the migration.
>> If the VFIO device stop-copy-size changes in the middle of migration (e.g.,
>> some of its resources are destroyed) and drops to 900MB, we will still see
>> 2GB report in estimate.
>> Only when calling the exact handler again we will get the updated 900MB
>> value and be able to converge. But that won't happen, because estimate size
>> will always be above the 1GB threshold.
>>
>> We could prevent these situations and call the get stop-copy-size once every
>> X seconds, but it feels a bit awkward.
> IMHO the confusion is caused by an unclear definition of stop-size and
> precopy-size in VFIO terms.
>
> What I would hope is the stop-size reported should be constant and not be
> affected by precopy process happening.  Whatever _can_ change at all should
> be reported as precopy-size.
>
> So I wonder why stop-size can change from a driver, and whether that can be
> reported in a more predictable fashion.  Otherwise I see little point in
> providing both stop-size and precopy-size, otherwise we'll always add them
> up into VFIO's total pending-size.  The line on provision which data falls
> into which bucket doesn't seem to be clear to me.

Stopcopy-size is reported by VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl, 
which states that this is "[...] the estimated data length that will be 
required to complete stop copy".
So by this definition, stopcopy-size can change during precopy (it can 
also change if device resources are created or destroyed).

Precopy-size is reported by VFIO_MIG_GET_PRECOPY_INFO ioctl, which 
states that this is "[...] estimates of data available from the device 
during the PRE_COPY states".

Maybe the confusion was caused by this line in vfio_state_pending_exact():
     *must_precopy += migration->precopy_init_size + 
migration->precopy_dirty_size;

Which I think should be removed, as stopcopy-size should already cover 
precopy data.

Thanks.

>
>>> This may improve situation but still leave one other issue, that IIUC even
>>> with above change and even if we can avoid sync dirty bitmap frequently,
>>> the migration thread can still spinning 100% calling estimate() and keep
>>> seeing data (which is not iterable...).  For the longer term we may still
>>> need to report non-iterable stop-size in another way so QEMU knows that
>>> iterate() over all the VMState registers won't help in this case, so it
>>> should go into sleep without eating the cores.  I hope that explains why I
>>> think a new API should be still needed for the long run.
>> Yes, I get your point.
>> But is the spinning case very common? AFAIU, if it happens, it may indicate
>> some misconfiguration of the migration parameters.
> Agreed.
>
> It's just that our QE actively tests such migrations and there're always
> weird and similar reports on cpu spinning, so it might be nice to still fix
> it at some point.
>
> This shouldn't be super urgent, indeed.  It's just that it can start to
> make sense when we have other discussions where reporting non-iteratble
> pending data might be pretty useful on its own as an idea.  It's just that
> we need to figure out above on how VFIO can report predicatble stop-size in
> the drivers, and I wonder whether the drivers can be fixed (without
> breaking existing qemu; after all they currently always add both
> counters..).
>
>> Anyway, I think your idea may fit VFIO, but still need to think about all
>> the small details.
> Thanks,
>
> --
> Peter Xu
>
Peter Xu Sept. 5, 2024, 4:23 p.m. UTC | #11
On Thu, Sep 05, 2024 at 07:07:28PM +0300, Avihai Horon wrote:
> > So I wonder why stop-size can change from a driver, and whether that can be
> > reported in a more predictable fashion.  Otherwise I see little point in
> > providing both stop-size and precopy-size, otherwise we'll always add them
> > up into VFIO's total pending-size.  The line on provision which data falls
> > into which bucket doesn't seem to be clear to me.
> 
> Stopcopy-size is reported by VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl, which
> states that this is "[...] the estimated data length that will be required
> to complete stop copy".
> So by this definition, stopcopy-size can change during precopy (it can also
> change if device resources are created or destroyed).
> 
> Precopy-size is reported by VFIO_MIG_GET_PRECOPY_INFO ioctl, which states
> that this is "[...] estimates of data available from the device during the
> PRE_COPY states".
> 
> Maybe the confusion was caused by this line in vfio_state_pending_exact():
>     *must_precopy += migration->precopy_init_size +
> migration->precopy_dirty_size;
> 
> Which I think should be removed, as stopcopy-size should already cover
> precopy data.

Would you please help double check on the kernel drivers' side?  If that's
true I agree that should be dropped.

Does it also mean then that the currently reported stop-size - precopy-size
will be very close to the constant non-iterable data size?  If so, is it
cache-able?  It doesn't need to be 100% accurate as the ioctls are not
atomic, but still if it's mostly usable (again, will need you all to help
justify that..) as that then it looks like we have some way to report that
without the concern you raised before.

Thanks,
Avihai Horon Sept. 5, 2024, 4:45 p.m. UTC | #12
On 05/09/2024 19:23, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, Sep 05, 2024 at 07:07:28PM +0300, Avihai Horon wrote:
>>> So I wonder why stop-size can change from a driver, and whether that can be
>>> reported in a more predictable fashion.  Otherwise I see little point in
>>> providing both stop-size and precopy-size, otherwise we'll always add them
>>> up into VFIO's total pending-size.  The line on provision which data falls
>>> into which bucket doesn't seem to be clear to me.
>> Stopcopy-size is reported by VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl, which
>> states that this is "[...] the estimated data length that will be required
>> to complete stop copy".
>> So by this definition, stopcopy-size can change during precopy (it can also
>> change if device resources are created or destroyed).
>>
>> Precopy-size is reported by VFIO_MIG_GET_PRECOPY_INFO ioctl, which states
>> that this is "[...] estimates of data available from the device during the
>> PRE_COPY states".
>>
>> Maybe the confusion was caused by this line in vfio_state_pending_exact():
>>      *must_precopy += migration->precopy_init_size +
>> migration->precopy_dirty_size;
>>
>> Which I think should be removed, as stopcopy-size should already cover
>> precopy data.
> Would you please help double check on the kernel drivers' side?  If that's
> true I agree that should be dropped.

Sure.

>
> Does it also mean then that the currently reported stop-size - precopy-size
> will be very close to the constant non-iterable data size?

It's not constant, while the VM is running it can change.

Thanks.

>    If so, is it
> cache-able?  It doesn't need to be 100% accurate as the ioctls are not
> atomic, but still if it's mostly usable (again, will need you all to help
> justify that..) as that then it looks like we have some way to report that
> without the concern you raised before.
>
> Thanks,
>
> --
> Peter Xu
>
Peter Xu Sept. 5, 2024, 6:31 p.m. UTC | #13
On Thu, Sep 05, 2024 at 07:45:43PM +0300, Avihai Horon wrote:
> > Does it also mean then that the currently reported stop-size - precopy-size
> > will be very close to the constant non-iterable data size?
> 
> It's not constant, while the VM is running it can change.

I wonder how heavy is VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl.

I just gave it a quick shot with a busy VM migrating and estimate() is
invoked only every ~100ms.

VFIO might be different, but I wonder whether we can fetch stop-size in
estimate() somehow, so it's still a pretty fast estimate() meanwhile we
avoid the rest of exact() calls (which are destined to be useless without
VFIO).

IIUC so far the estimate()/exact() was because ram sync is heavy when
exact().  When idle it's 80+ms now for 32G VM with current master (which
has a bug and I'm fixing it up [1]..), even if after the fix it's 3ms (I
think both numbers contain dirty bitmap sync for both vfio and kvm).  So in
that case maybe we can still try fetching stop-size only for both
estimate() and exact(), but only sync bitmap in exact().

[1] https://lore.kernel.org/r/20240904223510.3519358-1-peterx@redhat.com

Thanks,
Avihai Horon Sept. 9, 2024, 12:52 p.m. UTC | #14
On 05/09/2024 21:31, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, Sep 05, 2024 at 07:45:43PM +0300, Avihai Horon wrote:
>>> Does it also mean then that the currently reported stop-size - precopy-size
>>> will be very close to the constant non-iterable data size?
>> It's not constant, while the VM is running it can change.
> I wonder how heavy is VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl.
>
> I just gave it a quick shot with a busy VM migrating and estimate() is
> invoked only every ~100ms.
>
> VFIO might be different, but I wonder whether we can fetch stop-size in
> estimate() somehow, so it's still a pretty fast estimate() meanwhile we
> avoid the rest of exact() calls (which are destined to be useless without
> VFIO).
>
> IIUC so far the estimate()/exact() was because ram sync is heavy when
> exact().  When idle it's 80+ms now for 32G VM with current master (which
> has a bug and I'm fixing it up [1]..), even if after the fix it's 3ms (I
> think both numbers contain dirty bitmap sync for both vfio and kvm).  So in
> that case maybe we can still try fetching stop-size only for both
> estimate() and exact(), but only sync bitmap in exact().

IIUC, the end goal is to prevent migration thread spinning uselessly in 
pre-copy in such scenarios, right?
If eventually we do call get stop-copy-size in estimate(), we will move 
the spinning from "exact() -> estimate() -> exact() -> estimate() ..." 
to "estimate() -> estimate() -> ...".
If so, what benefit would we get from this? We only move the useless 
work to other place.
Shouldn't we directly go for the non precopy-able vs precopy-able report 
that you suggested?

Thanks.
Peter Xu Sept. 9, 2024, 3:11 p.m. UTC | #15
On Mon, Sep 09, 2024 at 03:52:39PM +0300, Avihai Horon wrote:
> 
> On 05/09/2024 21:31, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Thu, Sep 05, 2024 at 07:45:43PM +0300, Avihai Horon wrote:
> > > > Does it also mean then that the currently reported stop-size - precopy-size
> > > > will be very close to the constant non-iterable data size?
> > > It's not constant, while the VM is running it can change.
> > I wonder how heavy is VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl.
> > 
> > I just gave it a quick shot with a busy VM migrating and estimate() is
> > invoked only every ~100ms.
> > 
> > VFIO might be different, but I wonder whether we can fetch stop-size in
> > estimate() somehow, so it's still a pretty fast estimate() meanwhile we
> > avoid the rest of exact() calls (which are destined to be useless without
> > VFIO).
> > 
> > IIUC so far the estimate()/exact() was because ram sync is heavy when
> > exact().  When idle it's 80+ms now for 32G VM with current master (which
> > has a bug and I'm fixing it up [1]..), even if after the fix it's 3ms (I
> > think both numbers contain dirty bitmap sync for both vfio and kvm).  So in
> > that case maybe we can still try fetching stop-size only for both
> > estimate() and exact(), but only sync bitmap in exact().
> 
> IIUC, the end goal is to prevent migration thread spinning uselessly in
> pre-copy in such scenarios, right?
> If eventually we do call get stop-copy-size in estimate(), we will move the
> spinning from "exact() -> estimate() -> exact() -> estimate() ..." to
> "estimate() -> estimate() -> ...".
> If so, what benefit would we get from this? We only move the useless work to
> other place.

We can avoid exact() calls invoked for other vmstate handlers, e.g. RAM,
which can be much heavier and can require BQL during the slow process,
which can further block more vcpu operations during migration.

And as mentioned previously, VFIO is, AFAIK, the only handler that provide
different definitions of estimate() and exact(), which can be confusing,
and it's against the "estimate() is the fast-path" logic.

But I agree it's not fundamentally changing much..

> Shouldn't we directly go for the non precopy-able vs precopy-able report
> that you suggested?

Yep, I just thought the previous one would be much easier to achieve.  And
as you said, VFIO is still pretty special that the user will need manual
involvements anyway to specify e.g. very large downtimes, so this condition
shouldn't be a major case to happen.  Said that, if you have a solid idea
on this please feel free to go ahead directly with a complete solution.

Thanks,
Avihai Horon Sept. 12, 2024, 8:09 a.m. UTC | #16
On 09/09/2024 18:11, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Mon, Sep 09, 2024 at 03:52:39PM +0300, Avihai Horon wrote:
>> On 05/09/2024 21:31, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Thu, Sep 05, 2024 at 07:45:43PM +0300, Avihai Horon wrote:
>>>>> Does it also mean then that the currently reported stop-size - precopy-size
>>>>> will be very close to the constant non-iterable data size?
>>>> It's not constant, while the VM is running it can change.
>>> I wonder how heavy is VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl.
>>>
>>> I just gave it a quick shot with a busy VM migrating and estimate() is
>>> invoked only every ~100ms.
>>>
>>> VFIO might be different, but I wonder whether we can fetch stop-size in
>>> estimate() somehow, so it's still a pretty fast estimate() meanwhile we
>>> avoid the rest of exact() calls (which are destined to be useless without
>>> VFIO).
>>>
>>> IIUC so far the estimate()/exact() was because ram sync is heavy when
>>> exact().  When idle it's 80+ms now for 32G VM with current master (which
>>> has a bug and I'm fixing it up [1]..), even if after the fix it's 3ms (I
>>> think both numbers contain dirty bitmap sync for both vfio and kvm).  So in
>>> that case maybe we can still try fetching stop-size only for both
>>> estimate() and exact(), but only sync bitmap in exact().
>> IIUC, the end goal is to prevent migration thread spinning uselessly in
>> pre-copy in such scenarios, right?
>> If eventually we do call get stop-copy-size in estimate(), we will move the
>> spinning from "exact() -> estimate() -> exact() -> estimate() ..." to
>> "estimate() -> estimate() -> ...".
>> If so, what benefit would we get from this? We only move the useless work to
>> other place.
> We can avoid exact() calls invoked for other vmstate handlers, e.g. RAM,
> which can be much heavier and can require BQL during the slow process,
> which can further block more vcpu operations during migration.
>
> And as mentioned previously, VFIO is, AFAIK, the only handler that provide
> different definitions of estimate() and exact(), which can be confusing,
> and it's against the "estimate() is the fast-path" logic.
>
> But I agree it's not fundamentally changing much..
>
>> Shouldn't we directly go for the non precopy-able vs precopy-able report
>> that you suggested?
> Yep, I just thought the previous one would be much easier to achieve.

Yes, though I prefer not to add the get stop-copy-size ioctl in the 
estimate() flow because: a) it's guaranteed to be called (possibly many 
times) in every migration (either well configured which is the probable 
case or misconfigured which spins), and b) because how "heavy" get 
stop-copy-size is may differ from VFIO device to the other.

Maybe I am being a bit overcautious here, but let's explore other 
options first :)

> And
> as you said, VFIO is still pretty special that the user will need manual
> involvements anyway to specify e.g. very large downtimes, so this condition
> shouldn't be a major case to happen.  Said that, if you have a solid idea
> on this please feel free to go ahead directly with a complete solution.

I think it's possible to do it with what we currently have (VFIO 
uAPI-wise), I will try to think of one.

BTW, I checked again and I think we should drop this line from 
vfio_state_pending_exact():
     *must_precopy += migration->precopy_init_size + 
migration->precopy_dirty_size;

I can send a patch for that.

Thanks.
Cédric Le Goater Sept. 12, 2024, 9:41 a.m. UTC | #17
On 9/12/24 10:09, Avihai Horon wrote:
> 
> On 09/09/2024 18:11, Peter Xu wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On Mon, Sep 09, 2024 at 03:52:39PM +0300, Avihai Horon wrote:
>>> On 05/09/2024 21:31, Peter Xu wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> On Thu, Sep 05, 2024 at 07:45:43PM +0300, Avihai Horon wrote:
>>>>>> Does it also mean then that the currently reported stop-size - precopy-size
>>>>>> will be very close to the constant non-iterable data size?
>>>>> It's not constant, while the VM is running it can change.
>>>> I wonder how heavy is VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl.
>>>>
>>>> I just gave it a quick shot with a busy VM migrating and estimate() is
>>>> invoked only every ~100ms.
>>>>
>>>> VFIO might be different, but I wonder whether we can fetch stop-size in
>>>> estimate() somehow, so it's still a pretty fast estimate() meanwhile we
>>>> avoid the rest of exact() calls (which are destined to be useless without
>>>> VFIO).
>>>>
>>>> IIUC so far the estimate()/exact() was because ram sync is heavy when
>>>> exact().  When idle it's 80+ms now for 32G VM with current master (which
>>>> has a bug and I'm fixing it up [1]..), even if after the fix it's 3ms (I
>>>> think both numbers contain dirty bitmap sync for both vfio and kvm).  So in
>>>> that case maybe we can still try fetching stop-size only for both
>>>> estimate() and exact(), but only sync bitmap in exact().
>>> IIUC, the end goal is to prevent migration thread spinning uselessly in
>>> pre-copy in such scenarios, right?
>>> If eventually we do call get stop-copy-size in estimate(), we will move the
>>> spinning from "exact() -> estimate() -> exact() -> estimate() ..." to
>>> "estimate() -> estimate() -> ...".
>>> If so, what benefit would we get from this? We only move the useless work to
>>> other place.
>> We can avoid exact() calls invoked for other vmstate handlers, e.g. RAM,
>> which can be much heavier and can require BQL during the slow process,
>> which can further block more vcpu operations during migration.
>>
>> And as mentioned previously, VFIO is, AFAIK, the only handler that provide
>> different definitions of estimate() and exact(), which can be confusing,
>> and it's against the "estimate() is the fast-path" logic.
>>
>> But I agree it's not fundamentally changing much..
>>
>>> Shouldn't we directly go for the non precopy-able vs precopy-able report
>>> that you suggested?
>> Yep, I just thought the previous one would be much easier to achieve.
> 
> Yes, though I prefer not to add the get stop-copy-size ioctl in the estimate() flow because: a) it's guaranteed to be called (possibly many times) in every migration (either well configured which is the probable case or misconfigured which spins), and b) because how "heavy" get stop-copy-size is may differ from VFIO device to the other.
> 
> Maybe I am being a bit overcautious here, but let's explore other options first :)
> 
>> And
>> as you said, VFIO is still pretty special that the user will need manual
>> involvements anyway to specify e.g. very large downtimes, so this condition
>> shouldn't be a major case to happen.  Said that, if you have a solid idea
>> on this please feel free to go ahead directly with a complete solution.
> 
> I think it's possible to do it with what we currently have (VFIO uAPI-wise), I will try to think of one.
> 
> BTW, I checked again and I think we should drop this line from vfio_state_pending_exact():
>      *must_precopy += migration->precopy_init_size + migration->precopy_dirty_size;
> 
> I can send a patch for that.

Please do. We can then provide a scratch build for further testing
and experiments with vGPUs.

Thanks,

C.
Peter Xu Sept. 12, 2024, 1:45 p.m. UTC | #18
On Thu, Sep 12, 2024 at 11:09:12AM +0300, Avihai Horon wrote:
> 
> On 09/09/2024 18:11, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Mon, Sep 09, 2024 at 03:52:39PM +0300, Avihai Horon wrote:
> > > On 05/09/2024 21:31, Peter Xu wrote:
> > > > External email: Use caution opening links or attachments
> > > > 
> > > > 
> > > > On Thu, Sep 05, 2024 at 07:45:43PM +0300, Avihai Horon wrote:
> > > > > > Does it also mean then that the currently reported stop-size - precopy-size
> > > > > > will be very close to the constant non-iterable data size?
> > > > > It's not constant, while the VM is running it can change.
> > > > I wonder how heavy is VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl.
> > > > 
> > > > I just gave it a quick shot with a busy VM migrating and estimate() is
> > > > invoked only every ~100ms.
> > > > 
> > > > VFIO might be different, but I wonder whether we can fetch stop-size in
> > > > estimate() somehow, so it's still a pretty fast estimate() meanwhile we
> > > > avoid the rest of exact() calls (which are destined to be useless without
> > > > VFIO).
> > > > 
> > > > IIUC so far the estimate()/exact() was because ram sync is heavy when
> > > > exact().  When idle it's 80+ms now for 32G VM with current master (which
> > > > has a bug and I'm fixing it up [1]..), even if after the fix it's 3ms (I
> > > > think both numbers contain dirty bitmap sync for both vfio and kvm).  So in
> > > > that case maybe we can still try fetching stop-size only for both
> > > > estimate() and exact(), but only sync bitmap in exact().
> > > IIUC, the end goal is to prevent migration thread spinning uselessly in
> > > pre-copy in such scenarios, right?
> > > If eventually we do call get stop-copy-size in estimate(), we will move the
> > > spinning from "exact() -> estimate() -> exact() -> estimate() ..." to
> > > "estimate() -> estimate() -> ...".
> > > If so, what benefit would we get from this? We only move the useless work to
> > > other place.
> > We can avoid exact() calls invoked for other vmstate handlers, e.g. RAM,
> > which can be much heavier and can require BQL during the slow process,
> > which can further block more vcpu operations during migration.
> > 
> > And as mentioned previously, VFIO is, AFAIK, the only handler that provide
> > different definitions of estimate() and exact(), which can be confusing,
> > and it's against the "estimate() is the fast-path" logic.
> > 
> > But I agree it's not fundamentally changing much..
> > 
> > > Shouldn't we directly go for the non precopy-able vs precopy-able report
> > > that you suggested?
> > Yep, I just thought the previous one would be much easier to achieve.
> 
> Yes, though I prefer not to add the get stop-copy-size ioctl in the
> estimate() flow because: a) it's guaranteed to be called (possibly many
> times) in every migration (either well configured which is the probable case
> or misconfigured which spins), and b) because how "heavy" get stop-copy-size
> is may differ from VFIO device to the other.
> 
> Maybe I am being a bit overcautious here, but let's explore other options
> first :)

Nop; that's totally ok.

> 
> > And
> > as you said, VFIO is still pretty special that the user will need manual
> > involvements anyway to specify e.g. very large downtimes, so this condition
> > shouldn't be a major case to happen.  Said that, if you have a solid idea
> > on this please feel free to go ahead directly with a complete solution.
> 
> I think it's possible to do it with what we currently have (VFIO uAPI-wise),
> I will try to think of one.
> 
> BTW, I checked again and I think we should drop this line from
> vfio_state_pending_exact():
>     *must_precopy += migration->precopy_init_size +
> migration->precopy_dirty_size;
> 
> I can send a patch for that.

Yes please.

I also wonder whether it'll be good to update the uAPI header too in Linux,
for vfio_device_feature_mig_data_size or VFIO_DEVICE_FEATURE_MIG_DATA_SIZE.
Currently it reads a bit ambiguous to me, while it could make a huge
difference iiuc when precopy size is non-trivial.

/*
 * Upon VFIO_DEVICE_FEATURE_GET read back the estimated data length that will
 * be required to complete stop copy.
 *
 * Note: Can be called on each device state.
 */

Maybe that'll also be helpful when other drivers will implement this, then
just to make sure both side (user / kernel) are crystal clear on the
definition.
diff mbox series

Patch

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 2c0fb1d622..c4eab55af9 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -66,6 +66,11 @@  typedef struct VFIOMigration {
     int vm_running;
     Notifier migration_state;
     uint64_t pending_bytes;
+    uint32_t device_state;
+    int data_fd;
+    void *data_buffer;
+    size_t data_buffer_size;
+    bool v2;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 1c974e9c5a..54fee2d5de 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -408,10 +408,17 @@  static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
                 return false;
             }
 
-            if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
+            if (!migration->v2 &&
+                (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
                 (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING)) {
                 return false;
             }
+
+            if (migration->v2 &&
+                vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
+                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
+                return false;
+            }
         }
     }
     return true;
@@ -438,7 +445,13 @@  static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
                 return false;
             }
 
-            if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
+            if (!migration->v2 &&
+                migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
+                continue;
+            }
+
+            if (migration->v2 &&
+                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
                 continue;
             } else {
                 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index a5fe285721..452d96a520 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -10,6 +10,7 @@ 
 #include "qemu/osdep.h"
 #include "qemu/main-loop.h"
 #include "qemu/cutils.h"
+#include "qemu/units.h"
 #include <linux/vfio.h>
 #include <sys/ioctl.h>
 
@@ -44,8 +45,114 @@ 
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
 
+/*
+ * This is an arbitrary size based on migration of mlx5 devices, where typically
+ * total device migration size is on the order of 100s of MB. Testing with
+ * larger values, e.g. 128MB and 1GB, did not show a performance improvement.
+ */
+#define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
+
 static int64_t bytes_transferred;
 
+static const char *mig_state_to_str(enum vfio_device_mig_state state)
+{
+    switch (state) {
+    case VFIO_DEVICE_STATE_ERROR:
+        return "ERROR";
+    case VFIO_DEVICE_STATE_STOP:
+        return "STOP";
+    case VFIO_DEVICE_STATE_RUNNING:
+        return "RUNNING";
+    case VFIO_DEVICE_STATE_STOP_COPY:
+        return "STOP_COPY";
+    case VFIO_DEVICE_STATE_RESUMING:
+        return "RESUMING";
+    default:
+        return "UNKNOWN STATE";
+    }
+}
+
+static int vfio_migration_set_state(VFIODevice *vbasedev,
+                                    enum vfio_device_mig_state new_state,
+                                    enum vfio_device_mig_state recover_state)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
+                              sizeof(struct vfio_device_feature_mig_state),
+                              sizeof(uint64_t))] = {};
+    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
+    struct vfio_device_feature_mig_state *mig_state =
+        (struct vfio_device_feature_mig_state *)feature->data;
+    int ret;
+
+    feature->argsz = sizeof(buf);
+    feature->flags =
+        VFIO_DEVICE_FEATURE_SET | VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE;
+    mig_state->device_state = new_state;
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+        /* Try to set the device in some good state */
+        ret = -errno;
+
+        if (recover_state == VFIO_DEVICE_STATE_ERROR) {
+            error_report("%s: Failed setting device state to %s, err: %s. "
+                         "Recover state is ERROR. Resetting device",
+                         vbasedev->name, mig_state_to_str(new_state),
+                         strerror(errno));
+
+            goto reset_device;
+        }
+
+        error_report(
+            "%s: Failed setting device state to %s, err: %s. Setting device in recover state %s",
+                     vbasedev->name, mig_state_to_str(new_state),
+                     strerror(errno), mig_state_to_str(recover_state));
+
+        mig_state->device_state = recover_state;
+        if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+            ret = -errno;
+            error_report(
+                "%s: Failed setting device in recover state, err: %s. Resetting device",
+                         vbasedev->name, strerror(errno));
+
+            goto reset_device;
+        }
+
+        migration->device_state = recover_state;
+
+        return ret;
+    }
+
+    migration->device_state = new_state;
+    if (mig_state->data_fd != -1) {
+        if (migration->data_fd != -1) {
+            /*
+             * This can happen if the device is asynchronously reset and
+             * terminates a data transfer.
+             */
+            error_report("%s: data_fd out of sync", vbasedev->name);
+            close(mig_state->data_fd);
+
+            return -EBADF;
+        }
+
+        migration->data_fd = mig_state->data_fd;
+    }
+
+    trace_vfio_migration_set_state(vbasedev->name, mig_state_to_str(new_state));
+
+    return 0;
+
+reset_device:
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_RESET)) {
+        hw_error("%s: Failed resetting device, err: %s", vbasedev->name,
+                 strerror(errno));
+    }
+
+    migration->device_state = VFIO_DEVICE_STATE_RUNNING;
+
+    return ret;
+}
+
 static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
                                   off_t off, bool iswrite)
 {
@@ -260,6 +367,18 @@  static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
     return ret;
 }
 
+static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
+                            uint64_t data_size)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = qemu_file_get_to_fd(f, migration->data_fd, data_size);
+    trace_vfio_load_state_device_data(vbasedev->name, data_size, ret);
+
+    return ret;
+}
+
 static int vfio_v1_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
                                uint64_t data_size)
 {
@@ -394,6 +513,14 @@  static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
+static void vfio_migration_cleanup(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    close(migration->data_fd);
+    migration->data_fd = -1;
+}
+
 static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
@@ -403,8 +530,80 @@  static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
     }
 }
 
+static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
+                                     uint64_t *stop_copy_size)
+{
+    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
+                              sizeof(struct vfio_device_feature_mig_data_size),
+                              sizeof(uint64_t))] = {};
+    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
+    struct vfio_device_feature_mig_data_size *mig_data_size =
+        (struct vfio_device_feature_mig_data_size *)feature->data;
+
+    feature->argsz = sizeof(buf);
+    feature->flags =
+        VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_MIG_DATA_SIZE;
+
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+        return -errno;
+    }
+
+    *stop_copy_size = mig_data_size->stop_copy_length;
+
+    return 0;
+}
+
+/* Returns 1 if end-of-stream is reached, 0 if more data and -errno if error */
+static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
+{
+    ssize_t data_size;
+
+    data_size = read(migration->data_fd, migration->data_buffer,
+                     migration->data_buffer_size);
+    if (data_size < 0) {
+        return -errno;
+    }
+    if (data_size == 0) {
+        return 1;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+    qemu_put_be64(f, data_size);
+    qemu_put_buffer(f, migration->data_buffer, data_size);
+    bytes_transferred += data_size;
+
+    trace_vfio_save_block(migration->vbasedev->name, data_size);
+
+    return qemu_file_get_error(f);
+}
+
 /* ---------------------------------------------------------------------- */
 
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
+
+    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
+    migration->data_buffer_size = MIN(VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE,
+                                      stop_copy_size);
+    migration->data_buffer = g_try_malloc0(migration->data_buffer_size);
+    if (!migration->data_buffer) {
+        error_report("%s: Failed to allocate migration data buffer",
+                     vbasedev->name);
+        return -ENOMEM;
+    }
+
+    trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    return qemu_file_get_error(f);
+}
+
 static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -448,6 +647,17 @@  static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
     return 0;
 }
 
+static void vfio_save_cleanup(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    g_free(migration->data_buffer);
+    migration->data_buffer = NULL;
+    vfio_migration_cleanup(vbasedev);
+    trace_vfio_save_cleanup(vbasedev->name);
+}
+
 static void vfio_v1_save_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -456,6 +666,35 @@  static void vfio_v1_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
+/*
+ * Migration size of VFIO devices can be as little as a few KBs or as big as
+ * many GBs. This value should be big enough to cover the worst case.
+ */
+#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
+
+/*
+ * Only exact function is implemented and not estimate function. The reason is
+ * that during pre-copy phase of migration the estimate function is called
+ * repeatedly while pending RAM size is over the threshold, thus migration
+ * can't converge and querying the VFIO device pending data size is useless.
+ */
+static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
+                                     uint64_t *can_postcopy)
+{
+    VFIODevice *vbasedev = opaque;
+    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
+
+    /*
+     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
+     * reported so downtime limit won't be violated.
+     */
+    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
+    *must_precopy += stop_copy_size;
+
+    trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
+                                   stop_copy_size);
+}
+
 static void vfio_v1_state_pending(void *opaque, uint64_t *must_precopy,
                                   uint64_t *can_postcopy)
 {
@@ -520,6 +759,42 @@  static int vfio_save_iterate(QEMUFile *f, void *opaque)
     return 0;
 }
 
+static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    int ret;
+
+    /* We reach here with device state STOP only */
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
+                                   VFIO_DEVICE_STATE_STOP);
+    if (ret) {
+        return ret;
+    }
+
+    do {
+        ret = vfio_save_block(f, vbasedev->migration);
+        if (ret < 0) {
+            return ret;
+        }
+    } while (!ret);
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    /*
+     * If setting the device in STOP state fails, the device should be reset.
+     * To do so, use ERROR state as a recover state.
+     */
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
+                                   VFIO_DEVICE_STATE_ERROR);
+    trace_vfio_save_complete_precopy(vbasedev->name, ret);
+
+    return ret;
+}
+
 static int vfio_v1_save_complete_precopy(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -589,6 +864,14 @@  static void vfio_save_state(QEMUFile *f, void *opaque)
     }
 }
 
+static int vfio_load_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
+                                   vbasedev->migration->device_state);
+}
+
 static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -616,6 +899,16 @@  static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
     return ret;
 }
 
+static int vfio_load_cleanup(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    vfio_migration_cleanup(vbasedev);
+    trace_vfio_load_cleanup(vbasedev->name);
+
+    return 0;
+}
+
 static int vfio_v1_load_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -658,7 +951,11 @@  static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
             uint64_t data_size = qemu_get_be64(f);
 
             if (data_size) {
-                ret = vfio_v1_load_buffer(f, vbasedev, data_size);
+                if (vbasedev->migration->v2) {
+                    ret = vfio_load_buffer(f, vbasedev, data_size);
+                } else {
+                    ret = vfio_v1_load_buffer(f, vbasedev, data_size);
+                }
                 if (ret < 0) {
                     return ret;
                 }
@@ -679,6 +976,17 @@  static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
     return ret;
 }
 
+static const SaveVMHandlers savevm_vfio_handlers = {
+    .save_setup = vfio_save_setup,
+    .save_cleanup = vfio_save_cleanup,
+    .state_pending_exact = vfio_state_pending_exact,
+    .save_live_complete_precopy = vfio_save_complete_precopy,
+    .save_state = vfio_save_state,
+    .load_setup = vfio_load_setup,
+    .load_cleanup = vfio_load_cleanup,
+    .load_state = vfio_load_state,
+};
+
 static SaveVMHandlers savevm_vfio_v1_handlers = {
     .save_setup = vfio_v1_save_setup,
     .save_cleanup = vfio_v1_save_cleanup,
@@ -694,6 +1002,38 @@  static SaveVMHandlers savevm_vfio_v1_handlers = {
 
 /* ---------------------------------------------------------------------- */
 
+static void vfio_vmstate_change(void *opaque, bool running, RunState state)
+{
+    VFIODevice *vbasedev = opaque;
+    enum vfio_device_mig_state new_state;
+    int ret;
+
+    if (running) {
+        new_state = VFIO_DEVICE_STATE_RUNNING;
+    } else {
+        new_state = VFIO_DEVICE_STATE_STOP;
+    }
+
+    /*
+     * If setting the device in new_state fails, the device should be reset.
+     * To do so, use ERROR state as a recover state.
+     */
+    ret = vfio_migration_set_state(vbasedev, new_state,
+                                   VFIO_DEVICE_STATE_ERROR);
+    if (ret) {
+        /*
+         * Migration should be aborted in this case, but vm_state_notify()
+         * currently does not support reporting failures.
+         */
+        if (migrate_get_current()->to_dst_file) {
+            qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
+        }
+    }
+
+    trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
+                              mig_state_to_str(new_state));
+}
+
 static void vfio_v1_vmstate_change(void *opaque, bool running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
@@ -767,12 +1107,21 @@  static void vfio_migration_state_notifier(Notifier *notifier, void *data)
     case MIGRATION_STATUS_CANCELLED:
     case MIGRATION_STATUS_FAILED:
         bytes_transferred = 0;
-        ret = vfio_migration_v1_set_state(vbasedev,
-                                          ~(VFIO_DEVICE_STATE_V1_SAVING |
-                                            VFIO_DEVICE_STATE_V1_RESUMING),
-                                          VFIO_DEVICE_STATE_V1_RUNNING);
-        if (ret) {
-            error_report("%s: Failed to set state RUNNING", vbasedev->name);
+        if (migration->v2) {
+            /*
+             * If setting the device in RUNNING state fails, the device should
+             * be reset. To do so, use ERROR state as a recover state.
+             */
+            vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING,
+                                     VFIO_DEVICE_STATE_ERROR);
+        } else {
+            ret = vfio_migration_v1_set_state(vbasedev,
+                                              ~(VFIO_DEVICE_STATE_V1_SAVING |
+                                                VFIO_DEVICE_STATE_V1_RESUMING),
+                                              VFIO_DEVICE_STATE_V1_RUNNING);
+            if (ret) {
+                error_report("%s: Failed to set state RUNNING", vbasedev->name);
+            }
         }
     }
 }
@@ -781,12 +1130,42 @@  static void vfio_migration_exit(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
 
-    vfio_region_exit(&migration->region);
-    vfio_region_finalize(&migration->region);
+    if (!migration->v2) {
+        vfio_region_exit(&migration->region);
+        vfio_region_finalize(&migration->region);
+    }
     g_free(vbasedev->migration);
     vbasedev->migration = NULL;
 }
 
+static int vfio_migration_query_flags(VFIODevice *vbasedev, uint64_t *mig_flags)
+{
+    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
+                                  sizeof(struct vfio_device_feature_migration),
+                              sizeof(uint64_t))] = {};
+    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
+    struct vfio_device_feature_migration *mig =
+        (struct vfio_device_feature_migration *)feature->data;
+
+    feature->argsz = sizeof(buf);
+    feature->flags = VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_MIGRATION;
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+        if (errno == ENOTTY) {
+            error_report("%s: VFIO migration is not supported in kernel",
+                         vbasedev->name);
+        } else {
+            error_report("%s: Failed to query VFIO migration support, err: %s",
+                         vbasedev->name, strerror(errno));
+        }
+
+        return -errno;
+    }
+
+    *mig_flags = mig->flags;
+
+    return 0;
+}
+
 static int vfio_migration_init(VFIODevice *vbasedev)
 {
     int ret;
@@ -795,6 +1174,7 @@  static int vfio_migration_init(VFIODevice *vbasedev)
     char id[256] = "";
     g_autofree char *path = NULL, *oid = NULL;
     struct vfio_region_info *info;
+    uint64_t mig_flags = 0;
 
     if (!vbasedev->ops->vfio_get_object) {
         return -EINVAL;
@@ -805,34 +1185,50 @@  static int vfio_migration_init(VFIODevice *vbasedev)
         return -EINVAL;
     }
 
-    ret = vfio_get_dev_region_info(vbasedev,
-                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
-                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
-                                   &info);
-    if (ret) {
-        return ret;
-    }
+    ret = vfio_migration_query_flags(vbasedev, &mig_flags);
+    if (!ret) {
+        /* Migration v2 */
+        /* Basic migration functionality must be supported */
+        if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
+            return -EOPNOTSUPP;
+        }
+        vbasedev->migration = g_new0(VFIOMigration, 1);
+        vbasedev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
+        vbasedev->migration->data_fd = -1;
+        vbasedev->migration->v2 = true;
+    } else if (ret == -ENOTTY) {
+        /* Migration v1 */
+        ret = vfio_get_dev_region_info(vbasedev,
+                                       VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
+                                       VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
+                                       &info);
+        if (ret) {
+            return ret;
+        }
 
-    vbasedev->migration = g_new0(VFIOMigration, 1);
-    vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
-    vbasedev->migration->vm_running = runstate_is_running();
+        vbasedev->migration = g_new0(VFIOMigration, 1);
+        vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
+        vbasedev->migration->vm_running = runstate_is_running();
 
-    ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
-                            info->index, "migration");
-    if (ret) {
-        error_report("%s: Failed to setup VFIO migration region %d: %s",
-                     vbasedev->name, info->index, strerror(-ret));
-        goto err;
-    }
+        ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
+                                info->index, "migration");
+        if (ret) {
+            error_report("%s: Failed to setup VFIO migration region %d: %s",
+                         vbasedev->name, info->index, strerror(-ret));
+            goto err;
+        }
 
-    if (!vbasedev->migration->region.size) {
-        error_report("%s: Invalid zero-sized VFIO migration region %d",
-                     vbasedev->name, info->index);
-        ret = -EINVAL;
-        goto err;
-    }
+        if (!vbasedev->migration->region.size) {
+            error_report("%s: Invalid zero-sized VFIO migration region %d",
+                         vbasedev->name, info->index);
+            ret = -EINVAL;
+            goto err;
+        }
 
-    g_free(info);
+        g_free(info);
+    } else {
+        return ret;
+    }
 
     migration = vbasedev->migration;
     migration->vbasedev = vbasedev;
@@ -845,11 +1241,20 @@  static int vfio_migration_init(VFIODevice *vbasedev)
     }
     strpadcpy(id, sizeof(id), path, '\0');
 
-    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
-                         &savevm_vfio_v1_handlers, vbasedev);
+    if (migration->v2) {
+        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
+                             &savevm_vfio_handlers, vbasedev);
+
+        migration->vm_state = qdev_add_vm_change_state_handler(
+            vbasedev->dev, vfio_vmstate_change, vbasedev);
+    } else {
+        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
+                             &savevm_vfio_v1_handlers, vbasedev);
+
+        migration->vm_state = qdev_add_vm_change_state_handler(
+            vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
+    }
 
-    migration->vm_state = qdev_add_vm_change_state_handler(
-        vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
     migration->migration_state.notify = vfio_migration_state_notifier;
     add_migration_state_change_notifier(&migration->migration_state);
     return 0;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index c8739449df..b24e448534 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -149,20 +149,27 @@  vfio_display_edid_write_error(void) ""
 
 # migration.c
 vfio_migration_probe(const char *name) " (%s)"
+vfio_migration_set_state(const char *name, const char *state) " (%s) state %s"
 vfio_migration_v1_set_state(const char *name, uint32_t state) " (%s) state %d"
+vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
 vfio_v1_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
+vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
 vfio_v1_save_setup(const char *name) " (%s)"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_buffer(const char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
 vfio_update_pending(const char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
 vfio_save_device_config_state(const char *name) " (%s)"
+vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
 vfio_v1_state_pending(const char *name, uint64_t precopy, uint64_t postcopy) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64
 vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
+vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
 vfio_v1_save_complete_precopy(const char *name) " (%s)"
 vfio_load_device_config_state(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_v1_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
+vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size 0x%"PRIx64" ret %d"
 vfio_load_cleanup(const char *name) " (%s)"
 vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
 vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
+vfio_save_block(const char *name, int data_size) " (%s) data_size %d"