Message ID | 20230216143630.25610-9-avihaih@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfio/migration: Implement VFIO migration protocol v2 | expand |
Avihai Horon <avihaih@nvidia.com> wrote: Reviewed-by: Juan Quintela <quintela@redhat.com>. 1st comment is a bug, but as you just remove it. Not that it matters a lot in the end (you are removing v1 on the following patch). > @@ -438,7 +445,13 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container) > return false; > } > > - if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) { > + if (!migration->v2 && > + migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) { > + continue; > + } Are you missing here: } else { return false; } Or I am missunderstanding the code. > + if (migration->v2 && > + migration->device_state == VFIO_DEVICE_STATE_RUNNING) { > continue; > } else { > return false; [...] > +/* > + * This is an arbitrary size based on migration of mlx5 devices, where typically > + * total device migration size is on the order of 100s of MB. Testing with > + * larger values, e.g. 128MB and 1GB, did not show a performance improvement. > + */ > +#define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB) Just in case that it makes a difference. You are using here a 1MB buffer. But qemu_file_get_to_fd() uses a 32KB buffer. > } > } > > +static int vfio_query_stop_copy_size(VFIODevice *vbasedev, > + uint64_t *stop_copy_size) > +{ > + uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) + > + sizeof(struct vfio_device_feature_mig_data_size), > + sizeof(uint64_t))] = {}; > + struct vfio_device_feature *feature = (struct vfio_device_feature *)buf; suggestion: size_t size = DIV_ROUND_UP(sizeof(struct vfio_device_feature) + sizeof(struct vfio_device_feature_mig_data_size), sizeof(uint64_t)); g_autofree struct vfio_device_feature *feature = g_malloc0(size * sizeof(uint64_t)); I have to reread several times to see what was going on there. And call it a day? Later, Juan.
On 16/02/2023 17:43, Juan Quintela wrote: > External email: Use caution opening links or attachments > > > Avihai Horon <avihaih@nvidia.com> wrote: > > Reviewed-by: Juan Quintela <quintela@redhat.com>. > > > 1st comment is a bug, but as you just remove it. > > > Not that it matters a lot in the end (you are removing v1 on the > following patch). > >> @@ -438,7 +445,13 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container) >> return false; >> } >> >> - if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) { >> + if (!migration->v2 && >> + migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) { >> + continue; >> + } > Are you missing here: > > > } else { > return false; > } > > Or I am missunderstanding the code. I think the code is OK: If the device uses v1 migration and is running, the first if is true and we continue. If the device uses v1 migration and is not running, the first if is false and the second if is false (since the device doesn't use v2 migration) and we return false. If the device uses v2 migration and is running, the first if is false and the second if is true, so we continue. If the device uses v2 migration and is not running, first if is false and the second if is false, so we return false. > >> + if (migration->v2 && >> + migration->device_state == VFIO_DEVICE_STATE_RUNNING) { >> continue; >> } else { >> return false; > > [...] > > > >> +/* >> + * This is an arbitrary size based on migration of mlx5 devices, where typically >> + * total device migration size is on the order of 100s of MB. Testing with >> + * larger values, e.g. 128MB and 1GB, did not show a performance improvement. >> + */ >> +#define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB) > Just in case that it makes a difference. You are using here a 1MB > buffer. > But qemu_file_get_to_fd() uses a 32KB buffer. Yes, I'm aware of that. Down the road we will address the performance subject more thoroughly. In the meantime, it seems like a reasonable size according to the tests we did. > > >> } >> } >> >> +static int vfio_query_stop_copy_size(VFIODevice *vbasedev, >> + uint64_t *stop_copy_size) >> +{ >> + uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) + >> + sizeof(struct vfio_device_feature_mig_data_size), >> + sizeof(uint64_t))] = {}; >> + struct vfio_device_feature *feature = (struct vfio_device_feature *)buf; > suggestion: > > size_t size = DIV_ROUND_UP(sizeof(struct vfio_device_feature) + > sizeof(struct vfio_device_feature_mig_data_size), > sizeof(uint64_t)); > g_autofree struct vfio_device_feature *feature = g_malloc0(size * sizeof(uint64_t)); > > I have to reread several times to see what was going on there. > > > And call it a day? I don't have a strong opinion here. We can do whatever you/Alex like more. Thanks for reviewing!
Avihai Horon <avihaih@nvidia.com> wrote: > On 16/02/2023 17:43, Juan Quintela wrote: >> External email: Use caution opening links or attachments >> >> >> Avihai Horon <avihaih@nvidia.com> wrote: >> >> Reviewed-by: Juan Quintela <quintela@redhat.com>. >> >> >> 1st comment is a bug, but as you just remove it. >> >> >> Not that it matters a lot in the end (you are removing v1 on the >> following patch). >> >>> @@ -438,7 +445,13 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container) >>> return false; >>> } >>> >>> - if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) { >>> + if (!migration->v2 && >>> + migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) { >>> + continue; >>> + } >> Are you missing here: >> >> >> } else { >> return false; >> } >> >> Or I am missunderstanding the code. > > I think the code is OK: > > If the device uses v1 migration and is running, the first if is true > and we continue. > If the device uses v1 migration and is not running, the first if is > false and the second if is false (since the device doesn't use v2 > migration) and we return false. > > If the device uses v2 migration and is running, the first if is false > and the second if is true, so we continue. > If the device uses v2 migration and is not running, first if is false > and the second if is false, so we return false. You win O:-) I was looking at C level, not at the "semantic" level. >> >> size_t size = DIV_ROUND_UP(sizeof(struct vfio_device_feature) + >> sizeof(struct vfio_device_feature_mig_data_size), >> sizeof(uint64_t)); >> g_autofree struct vfio_device_feature *feature = g_malloc0(size * sizeof(uint64_t)); >> >> I have to reread several times to see what was going on there. >> >> >> And call it a day? > > I don't have a strong opinion here. > We can do whatever you/Alex like more. I think it is more readable, and we don't put (normally) big chunks in the stack, but once told that, who wrotes the code has more to say O:-) Later, Juan.
On Thu, 16 Feb 2023 17:52:33 +0100 Juan Quintela <quintela@redhat.com> wrote: > Avihai Horon <avihaih@nvidia.com> wrote: > > On 16/02/2023 17:43, Juan Quintela wrote: > >> External email: Use caution opening links or attachments > >> > >> > >> Avihai Horon <avihaih@nvidia.com> wrote: > >> > >> Reviewed-by: Juan Quintela <quintela@redhat.com>. > >> > >> > >> 1st comment is a bug, but as you just remove it. > >> > >> > >> Not that it matters a lot in the end (you are removing v1 on the > >> following patch). > >> > >>> @@ -438,7 +445,13 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container) > >>> return false; > >>> } > >>> > >>> - if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) { > >>> + if (!migration->v2 && > >>> + migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) { > >>> + continue; > >>> + } > >> Are you missing here: > >> > >> > >> } else { > >> return false; > >> } > >> > >> Or I am missunderstanding the code. > > > > I think the code is OK: > > > > If the device uses v1 migration and is running, the first if is true > > and we continue. > > If the device uses v1 migration and is not running, the first if is > > false and the second if is false (since the device doesn't use v2 > > migration) and we return false. > > > > If the device uses v2 migration and is running, the first if is false > > and the second if is true, so we continue. > > If the device uses v2 migration and is not running, first if is false > > and the second if is false, so we return false. > > You win O:-) > > I was looking at C level, not at the "semantic" level. > > >> > >> size_t size = DIV_ROUND_UP(sizeof(struct vfio_device_feature) + > >> sizeof(struct vfio_device_feature_mig_data_size), > >> sizeof(uint64_t)); > >> g_autofree struct vfio_device_feature *feature = g_malloc0(size * sizeof(uint64_t)); > >> > >> I have to reread several times to see what was going on there. > >> > >> > >> And call it a day? > > > > I don't have a strong opinion here. > > We can do whatever you/Alex like more. > > I think it is more readable, and we don't put (normally) big chunks in > the stack, but once told that, who wrotes the code has more to say O:-) This is not a huge amount of data on the stack, so I'm fine with it as is. Thanks, Alex
Hello, Avihai, Reviving this thread just to discuss one issue below.. On Thu, Feb 16, 2023 at 04:36:27PM +0200, Avihai Horon wrote: > +/* > + * Migration size of VFIO devices can be as little as a few KBs or as big as > + * many GBs. This value should be big enough to cover the worst case. > + */ > +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB) > + > +/* > + * Only exact function is implemented and not estimate function. The reason is > + * that during pre-copy phase of migration the estimate function is called > + * repeatedly while pending RAM size is over the threshold, thus migration > + * can't converge and querying the VFIO device pending data size is useless. > + */ > +static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy, > + uint64_t *can_postcopy) > +{ > + VFIODevice *vbasedev = opaque; > + uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE; > + > + /* > + * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is > + * reported so downtime limit won't be violated. > + */ > + vfio_query_stop_copy_size(vbasedev, &stop_copy_size); > + *must_precopy += stop_copy_size; Is this the chunk of data only can be copied during VM stopped? If so, I wonder why it's reported as "must precopy" if we know precopy won't ever move them.. The issue is if with such reporting (and now in latest master branch we do have the precopy size too, which was reported both in exact() and estimate()), we can observe weird reports like this: 23411@1725380798968696657 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) 23411@1725380799050766000 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0) 23411@1725380799050896975 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) 23411@1725380799138657103 migrate_pending_exact exact pending size 21040144384 (pre = 21040144384 post=0) 23411@1725380799140166709 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) 23411@1725380799217246861 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0) 23411@1725380799217384969 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) 23411@1725380799305147722 migrate_pending_exact exact pending size 21039976448 (pre = 21039976448 post=0) 23411@1725380799306639956 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) 23411@1725380799385118245 migrate_pending_exact exact pending size 21038796800 (pre = 21038796800 post=0) 23411@1725380799385709382 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) So estimate() keeps reporting zero but the exact() reports much larger, and it keeps spinning like this. I think that's not how it was designed to be used.. Does this stop copy size change for a VFIO device or not? IIUC, we may want some other mechanism to report stop copy size for a device, rather than reporting it with the current exact()/estimate() api. That's, per my undertanding, only used for iterable data, while stop-copy-size may not fall into that category if so. > + > + trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy, > + stop_copy_size); > +}
On 04/09/2024 16:00, Peter Xu wrote: > External email: Use caution opening links or attachments > > > Hello, Avihai, > > Reviving this thread just to discuss one issue below.. > > On Thu, Feb 16, 2023 at 04:36:27PM +0200, Avihai Horon wrote: >> +/* >> + * Migration size of VFIO devices can be as little as a few KBs or as big as >> + * many GBs. This value should be big enough to cover the worst case. >> + */ >> +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB) >> + >> +/* >> + * Only exact function is implemented and not estimate function. The reason is >> + * that during pre-copy phase of migration the estimate function is called >> + * repeatedly while pending RAM size is over the threshold, thus migration >> + * can't converge and querying the VFIO device pending data size is useless. >> + */ >> +static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy, >> + uint64_t *can_postcopy) >> +{ >> + VFIODevice *vbasedev = opaque; >> + uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE; >> + >> + /* >> + * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is >> + * reported so downtime limit won't be violated. >> + */ >> + vfio_query_stop_copy_size(vbasedev, &stop_copy_size); >> + *must_precopy += stop_copy_size; > Is this the chunk of data only can be copied during VM stopped? If so, I > wonder why it's reported as "must precopy" if we know precopy won't ever > move them.. A VFIO device that doesn't support precopy will send this data only when VM is stopped. A VFIO device that supports precopy may or may not send this data (or part of it) during precopy, and it depends on the specific VFIO device. According to state_pending_{estimate,exact} documentation, must_precopy is the amount of data that must be migrated before target starts, and indeed this VFIO data must be migrated before target starts. > > The issue is if with such reporting (and now in latest master branch we do > have the precopy size too, which was reported both in exact() and > estimate()), we can observe weird reports like this: > > 23411@1725380798968696657 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) > 23411@1725380799050766000 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0) > 23411@1725380799050896975 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) > 23411@1725380799138657103 migrate_pending_exact exact pending size 21040144384 (pre = 21040144384 post=0) > 23411@1725380799140166709 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) > 23411@1725380799217246861 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0) > 23411@1725380799217384969 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) > 23411@1725380799305147722 migrate_pending_exact exact pending size 21039976448 (pre = 21039976448 post=0) > 23411@1725380799306639956 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) > 23411@1725380799385118245 migrate_pending_exact exact pending size 21038796800 (pre = 21038796800 post=0) > 23411@1725380799385709382 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) > > So estimate() keeps reporting zero but the exact() reports much larger, and > it keeps spinning like this. I think that's not how it was designed to be > used.. It keeps spinning and migration doesn't converge? If so, configuring a higher downtime limit or the avail-switchover-bandwidth parameter may solve it. > > Does this stop copy size change for a VFIO device or not? It depends on the specific VFIO device. If the device supports precopy and all (or part) of its data is precopy-able, then stopcopy size will change. Besides that, the amount of resources currently used by the VFIO device can also affect the stopcopy size, and it may increase or decrease as resources are created or destroyed. > IIUC, we may want some other mechanism to report stop copy size for a > device, rather than reporting it with the current exact()/estimate() api. > That's, per my undertanding, only used for iterable data, while > stop-copy-size may not fall into that category if so. The above situation is caused by the fact that VFIO data may not be fully precopy-able (as opposed to RAM data). I don't think reporting the stop-copy-size in a different API will help the above situation -- we would still have to take stop-copy-size into account before converging, to not violate downtime. Thanks. > >> + >> + trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy, >> + stop_copy_size); >> +} > -- > Peter Xu >
On Wed, Sep 04, 2024 at 06:41:03PM +0300, Avihai Horon wrote: > > On 04/09/2024 16:00, Peter Xu wrote: > > External email: Use caution opening links or attachments > > > > > > Hello, Avihai, > > > > Reviving this thread just to discuss one issue below.. > > > > On Thu, Feb 16, 2023 at 04:36:27PM +0200, Avihai Horon wrote: > > > +/* > > > + * Migration size of VFIO devices can be as little as a few KBs or as big as > > > + * many GBs. This value should be big enough to cover the worst case. > > > + */ > > > +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB) > > > + > > > +/* > > > + * Only exact function is implemented and not estimate function. The reason is > > > + * that during pre-copy phase of migration the estimate function is called > > > + * repeatedly while pending RAM size is over the threshold, thus migration > > > + * can't converge and querying the VFIO device pending data size is useless. > > > + */ > > > +static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy, > > > + uint64_t *can_postcopy) > > > +{ > > > + VFIODevice *vbasedev = opaque; > > > + uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE; > > > + > > > + /* > > > + * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is > > > + * reported so downtime limit won't be violated. > > > + */ > > > + vfio_query_stop_copy_size(vbasedev, &stop_copy_size); > > > + *must_precopy += stop_copy_size; > > Is this the chunk of data only can be copied during VM stopped? If so, I > > wonder why it's reported as "must precopy" if we know precopy won't ever > > move them.. > > A VFIO device that doesn't support precopy will send this data only when VM > is stopped. > A VFIO device that supports precopy may or may not send this data (or part > of it) during precopy, and it depends on the specific VFIO device. > > According to state_pending_{estimate,exact} documentation, must_precopy is > the amount of data that must be migrated before target starts, and indeed > this VFIO data must be migrated before target starts. Correct. It's just that estimate/exact API will be more suitable when the exact() gets the report closer to estimate(), while the estimate() is only a fast-path of exact(). It'll cause some chaos like this if it doesn't do as so. Since we have discussion elsewhere on the fact that we currently ignore non-iterative device states during migration decides to switchover, I was wondering when reply on whether this stop-size could be the first thing that will start to provide such non-iterable pending data. But that might indeed need more thoughts, at least we may want to collect more outliers of non-iterative device states outside VFIO that can cause downtime to be too large. > > > > > The issue is if with such reporting (and now in latest master branch we do > > have the precopy size too, which was reported both in exact() and > > estimate()), we can observe weird reports like this: > > > > 23411@1725380798968696657 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) > > 23411@1725380799050766000 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0) > > 23411@1725380799050896975 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) > > 23411@1725380799138657103 migrate_pending_exact exact pending size 21040144384 (pre = 21040144384 post=0) > > 23411@1725380799140166709 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) > > 23411@1725380799217246861 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0) > > 23411@1725380799217384969 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) > > 23411@1725380799305147722 migrate_pending_exact exact pending size 21039976448 (pre = 21039976448 post=0) > > 23411@1725380799306639956 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) > > 23411@1725380799385118245 migrate_pending_exact exact pending size 21038796800 (pre = 21038796800 post=0) > > 23411@1725380799385709382 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) > > > > So estimate() keeps reporting zero but the exact() reports much larger, and > > it keeps spinning like this. I think that's not how it was designed to be > > used.. > > It keeps spinning and migration doesn't converge? > If so, configuring a higher downtime limit or the avail-switchover-bandwidth > parameter may solve it. Yes, this is the only way to go, but it's a separate issue on reportings of estimate()/exact(). More below. > > > > > Does this stop copy size change for a VFIO device or not? > > It depends on the specific VFIO device. > If the device supports precopy and all (or part) of its data is > precopy-able, then stopcopy size will change. > Besides that, the amount of resources currently used by the VFIO device can > also affect the stopcopy size, and it may increase or decrease as resources > are created or destroyed. I see, thanks. > > > IIUC, we may want some other mechanism to report stop copy size for a > > device, rather than reporting it with the current exact()/estimate() api. > > That's, per my undertanding, only used for iterable data, while > > stop-copy-size may not fall into that category if so. > > The above situation is caused by the fact that VFIO data may not be fully > precopy-able (as opposed to RAM data). > I don't think reporting the stop-copy-size in a different API will help the > above situation -- we would still have to take stop-copy-size into account > before converging, to not violate downtime. It will help some other situation, though. One issue with above freqeunt estimate()/exact() call is that QEMU will go into madness loop thinking "we're close" and "we're far away from converge" even if the reality is "we're far away". The bad side effect is when this loop happens it'll not only affect VFIO but also other devices (e.g. KVM, vhost, etc.) so we'll do high overhead sync() in an extremely frequent manner. IMHO they're totally a waste of resource, because all the rest of the modules are following the default rules of estimate()/exact(). One simple but efficient fix for VFIO, IMHO, is at least VFIO should also cache the stop-size internally and report in estimate(), e.g.: /* Give an estimate of the amount left to be transferred, * the result is split into the amount for units that can and * for units that can't do postcopy. */ void qemu_savevm_state_pending_estimate(uint64_t *must_precopy, uint64_t *can_postcopy) { } If it's justified that the stop-size to be reported in exact(), it should also be reported in estimate() per the comment above. It should also fall into precopy category in this case. Then with that we should avoid calling exact() frequently for not only VFIO but also others (especially, KVM GET_DIRTY_LOG / CLEAR_DIRTY_LOG ioctls), then we know it won't converge anyway without the help of tuning downtime upper, or adjust avail-switchover-bandwidth. This may improve situation but still leave one other issue, that IIUC even with above change and even if we can avoid sync dirty bitmap frequently, the migration thread can still spinning 100% calling estimate() and keep seeing data (which is not iterable...). For the longer term we may still need to report non-iterable stop-size in another way so QEMU knows that iterate() over all the VMState registers won't help in this case, so it should go into sleep without eating the cores. I hope that explains why I think a new API should be still needed for the long run. Thanks,
On 04/09/2024 19:16, Peter Xu wrote: > External email: Use caution opening links or attachments > > > On Wed, Sep 04, 2024 at 06:41:03PM +0300, Avihai Horon wrote: >> On 04/09/2024 16:00, Peter Xu wrote: >>> External email: Use caution opening links or attachments >>> >>> >>> Hello, Avihai, >>> >>> Reviving this thread just to discuss one issue below.. >>> >>> On Thu, Feb 16, 2023 at 04:36:27PM +0200, Avihai Horon wrote: >>>> +/* >>>> + * Migration size of VFIO devices can be as little as a few KBs or as big as >>>> + * many GBs. This value should be big enough to cover the worst case. >>>> + */ >>>> +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB) >>>> + >>>> +/* >>>> + * Only exact function is implemented and not estimate function. The reason is >>>> + * that during pre-copy phase of migration the estimate function is called >>>> + * repeatedly while pending RAM size is over the threshold, thus migration >>>> + * can't converge and querying the VFIO device pending data size is useless. >>>> + */ >>>> +static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy, >>>> + uint64_t *can_postcopy) >>>> +{ >>>> + VFIODevice *vbasedev = opaque; >>>> + uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE; >>>> + >>>> + /* >>>> + * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is >>>> + * reported so downtime limit won't be violated. >>>> + */ >>>> + vfio_query_stop_copy_size(vbasedev, &stop_copy_size); >>>> + *must_precopy += stop_copy_size; >>> Is this the chunk of data only can be copied during VM stopped? If so, I >>> wonder why it's reported as "must precopy" if we know precopy won't ever >>> move them.. >> A VFIO device that doesn't support precopy will send this data only when VM >> is stopped. >> A VFIO device that supports precopy may or may not send this data (or part >> of it) during precopy, and it depends on the specific VFIO device. >> >> According to state_pending_{estimate,exact} documentation, must_precopy is >> the amount of data that must be migrated before target starts, and indeed >> this VFIO data must be migrated before target starts. > Correct. It's just that estimate/exact API will be more suitable when the > exact() gets the report closer to estimate(), while the estimate() is only > a fast-path of exact(). It'll cause some chaos like this if it doesn't do > as so. > > Since we have discussion elsewhere on the fact that we currently ignore > non-iterative device states during migration decides to switchover, I was > wondering when reply on whether this stop-size could be the first thing > that will start to provide such non-iterable pending data. But that might > indeed need more thoughts, at least we may want to collect more outliers of > non-iterative device states outside VFIO that can cause downtime to be too > large. Ah, I see. > >>> The issue is if with such reporting (and now in latest master branch we do >>> have the precopy size too, which was reported both in exact() and >>> estimate()), we can observe weird reports like this: >>> >>> 23411@1725380798968696657 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) >>> 23411@1725380799050766000 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0) >>> 23411@1725380799050896975 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) >>> 23411@1725380799138657103 migrate_pending_exact exact pending size 21040144384 (pre = 21040144384 post=0) >>> 23411@1725380799140166709 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) >>> 23411@1725380799217246861 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0) >>> 23411@1725380799217384969 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) >>> 23411@1725380799305147722 migrate_pending_exact exact pending size 21039976448 (pre = 21039976448 post=0) >>> 23411@1725380799306639956 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) >>> 23411@1725380799385118245 migrate_pending_exact exact pending size 21038796800 (pre = 21038796800 post=0) >>> 23411@1725380799385709382 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) >>> >>> So estimate() keeps reporting zero but the exact() reports much larger, and >>> it keeps spinning like this. I think that's not how it was designed to be >>> used.. >> It keeps spinning and migration doesn't converge? >> If so, configuring a higher downtime limit or the avail-switchover-bandwidth >> parameter may solve it. > Yes, this is the only way to go, but it's a separate issue on reportings of > estimate()/exact(). More below. > >>> Does this stop copy size change for a VFIO device or not? >> It depends on the specific VFIO device. >> If the device supports precopy and all (or part) of its data is >> precopy-able, then stopcopy size will change. >> Besides that, the amount of resources currently used by the VFIO device can >> also affect the stopcopy size, and it may increase or decrease as resources >> are created or destroyed. > I see, thanks. > >>> IIUC, we may want some other mechanism to report stop copy size for a >>> device, rather than reporting it with the current exact()/estimate() api. >>> That's, per my undertanding, only used for iterable data, while >>> stop-copy-size may not fall into that category if so. >> The above situation is caused by the fact that VFIO data may not be fully >> precopy-able (as opposed to RAM data). >> I don't think reporting the stop-copy-size in a different API will help the >> above situation -- we would still have to take stop-copy-size into account >> before converging, to not violate downtime. > It will help some other situation, though. > > One issue with above freqeunt estimate()/exact() call is that QEMU will go > into madness loop thinking "we're close" and "we're far away from converge" > even if the reality is "we're far away". The bad side effect is when this > loop happens it'll not only affect VFIO but also other devices (e.g. KVM, > vhost, etc.) so we'll do high overhead sync() in an extremely frequent > manner. IMHO they're totally a waste of resource, because all the rest of > the modules are following the default rules of estimate()/exact(). > > One simple but efficient fix for VFIO, IMHO, is at least VFIO should also > cache the stop-size internally and report in estimate(), e.g.: > > /* Give an estimate of the amount left to be transferred, > * the result is split into the amount for units that can and > * for units that can't do postcopy. > */ > void qemu_savevm_state_pending_estimate(uint64_t *must_precopy, > uint64_t *can_postcopy) > { > } > > If it's justified that the stop-size to be reported in exact(), it should > also be reported in estimate() per the comment above. It should also fall > into precopy category in this case. > > Then with that we should avoid calling exact() frequently for not only VFIO > but also others (especially, KVM GET_DIRTY_LOG / CLEAR_DIRTY_LOG ioctls), > then we know it won't converge anyway without the help of tuning downtime > upper, or adjust avail-switchover-bandwidth. Yes, it will solve this problem, but IIRC, the reason I didn't cache and return stop-copy-size in estimate from the first place was that it wasn't fully precopy-able, and that could cause some problems: Once we cache and report this size in estimate, it may not decrease anymore -- we can't send it during precopy and we might not be able to call get stop-copy-size again from the exact path (because estimate might now reach below the threshold). For example, assume the threshold for convergence is 1GB. A VFIO device may report 2GB stop-copy-size (not precopy-able) in the beginning of the migration. If the VFIO device stop-copy-size changes in the middle of migration (e.g., some of its resources are destroyed) and drops to 900MB, we will still see 2GB report in estimate. Only when calling the exact handler again we will get the updated 900MB value and be able to converge. But that won't happen, because estimate size will always be above the 1GB threshold. We could prevent these situations and call the get stop-copy-size once every X seconds, but it feels a bit awkward. > > This may improve situation but still leave one other issue, that IIUC even > with above change and even if we can avoid sync dirty bitmap frequently, > the migration thread can still spinning 100% calling estimate() and keep > seeing data (which is not iterable...). For the longer term we may still > need to report non-iterable stop-size in another way so QEMU knows that > iterate() over all the VMState registers won't help in this case, so it > should go into sleep without eating the cores. I hope that explains why I > think a new API should be still needed for the long run. Yes, I get your point. But is the spinning case very common? AFAIU, if it happens, it may indicate some misconfiguration of the migration parameters. Anyway, I think your idea may fit VFIO, but still need to think about all the small details. Thanks.
On Thu, Sep 05, 2024 at 02:41:09PM +0300, Avihai Horon wrote: > > On 04/09/2024 19:16, Peter Xu wrote: > > External email: Use caution opening links or attachments > > > > > > On Wed, Sep 04, 2024 at 06:41:03PM +0300, Avihai Horon wrote: > > > On 04/09/2024 16:00, Peter Xu wrote: > > > > External email: Use caution opening links or attachments > > > > > > > > > > > > Hello, Avihai, > > > > > > > > Reviving this thread just to discuss one issue below.. > > > > > > > > On Thu, Feb 16, 2023 at 04:36:27PM +0200, Avihai Horon wrote: > > > > > +/* > > > > > + * Migration size of VFIO devices can be as little as a few KBs or as big as > > > > > + * many GBs. This value should be big enough to cover the worst case. > > > > > + */ > > > > > +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB) > > > > > + > > > > > +/* > > > > > + * Only exact function is implemented and not estimate function. The reason is > > > > > + * that during pre-copy phase of migration the estimate function is called > > > > > + * repeatedly while pending RAM size is over the threshold, thus migration > > > > > + * can't converge and querying the VFIO device pending data size is useless. > > > > > + */ > > > > > +static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy, > > > > > + uint64_t *can_postcopy) > > > > > +{ > > > > > + VFIODevice *vbasedev = opaque; > > > > > + uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE; > > > > > + > > > > > + /* > > > > > + * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is > > > > > + * reported so downtime limit won't be violated. > > > > > + */ > > > > > + vfio_query_stop_copy_size(vbasedev, &stop_copy_size); > > > > > + *must_precopy += stop_copy_size; > > > > Is this the chunk of data only can be copied during VM stopped? If so, I > > > > wonder why it's reported as "must precopy" if we know precopy won't ever > > > > move them.. > > > A VFIO device that doesn't support precopy will send this data only when VM > > > is stopped. > > > A VFIO device that supports precopy may or may not send this data (or part > > > of it) during precopy, and it depends on the specific VFIO device. > > > > > > According to state_pending_{estimate,exact} documentation, must_precopy is > > > the amount of data that must be migrated before target starts, and indeed > > > this VFIO data must be migrated before target starts. > > Correct. It's just that estimate/exact API will be more suitable when the > > exact() gets the report closer to estimate(), while the estimate() is only > > a fast-path of exact(). It'll cause some chaos like this if it doesn't do > > as so. > > > > Since we have discussion elsewhere on the fact that we currently ignore > > non-iterative device states during migration decides to switchover, I was > > wondering when reply on whether this stop-size could be the first thing > > that will start to provide such non-iterable pending data. But that might > > indeed need more thoughts, at least we may want to collect more outliers of > > non-iterative device states outside VFIO that can cause downtime to be too > > large. > > Ah, I see. > > > > > > > The issue is if with such reporting (and now in latest master branch we do > > > > have the precopy size too, which was reported both in exact() and > > > > estimate()), we can observe weird reports like this: > > > > > > > > 23411@1725380798968696657 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) > > > > 23411@1725380799050766000 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0) > > > > 23411@1725380799050896975 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) > > > > 23411@1725380799138657103 migrate_pending_exact exact pending size 21040144384 (pre = 21040144384 post=0) > > > > 23411@1725380799140166709 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) > > > > 23411@1725380799217246861 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0) > > > > 23411@1725380799217384969 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) > > > > 23411@1725380799305147722 migrate_pending_exact exact pending size 21039976448 (pre = 21039976448 post=0) > > > > 23411@1725380799306639956 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) > > > > 23411@1725380799385118245 migrate_pending_exact exact pending size 21038796800 (pre = 21038796800 post=0) > > > > 23411@1725380799385709382 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) > > > > > > > > So estimate() keeps reporting zero but the exact() reports much larger, and > > > > it keeps spinning like this. I think that's not how it was designed to be > > > > used.. > > > It keeps spinning and migration doesn't converge? > > > If so, configuring a higher downtime limit or the avail-switchover-bandwidth > > > parameter may solve it. > > Yes, this is the only way to go, but it's a separate issue on reportings of > > estimate()/exact(). More below. > > > > > > Does this stop copy size change for a VFIO device or not? > > > It depends on the specific VFIO device. > > > If the device supports precopy and all (or part) of its data is > > > precopy-able, then stopcopy size will change. > > > Besides that, the amount of resources currently used by the VFIO device can > > > also affect the stopcopy size, and it may increase or decrease as resources > > > are created or destroyed. > > I see, thanks. > > > > > > IIUC, we may want some other mechanism to report stop copy size for a > > > > device, rather than reporting it with the current exact()/estimate() api. > > > > That's, per my undertanding, only used for iterable data, while > > > > stop-copy-size may not fall into that category if so. > > > The above situation is caused by the fact that VFIO data may not be fully > > > precopy-able (as opposed to RAM data). > > > I don't think reporting the stop-copy-size in a different API will help the > > > above situation -- we would still have to take stop-copy-size into account > > > before converging, to not violate downtime. > > It will help some other situation, though. > > > > One issue with above freqeunt estimate()/exact() call is that QEMU will go > > into madness loop thinking "we're close" and "we're far away from converge" > > even if the reality is "we're far away". The bad side effect is when this > > loop happens it'll not only affect VFIO but also other devices (e.g. KVM, > > vhost, etc.) so we'll do high overhead sync() in an extremely frequent > > manner. IMHO they're totally a waste of resource, because all the rest of > > the modules are following the default rules of estimate()/exact(). > > > > One simple but efficient fix for VFIO, IMHO, is at least VFIO should also > > cache the stop-size internally and report in estimate(), e.g.: > > > > /* Give an estimate of the amount left to be transferred, > > * the result is split into the amount for units that can and > > * for units that can't do postcopy. > > */ > > void qemu_savevm_state_pending_estimate(uint64_t *must_precopy, > > uint64_t *can_postcopy) > > { > > } > > > > If it's justified that the stop-size to be reported in exact(), it should > > also be reported in estimate() per the comment above. It should also fall > > into precopy category in this case. > > > > Then with that we should avoid calling exact() frequently for not only VFIO > > but also others (especially, KVM GET_DIRTY_LOG / CLEAR_DIRTY_LOG ioctls), > > then we know it won't converge anyway without the help of tuning downtime > > upper, or adjust avail-switchover-bandwidth. > > Yes, it will solve this problem, but IIRC, the reason I didn't cache and > return stop-copy-size in estimate from the first place was that it wasn't > fully precopy-able, and that could cause some problems: > Once we cache and report this size in estimate, it may not decrease anymore > -- we can't send it during precopy and we might not be able to call get > stop-copy-size again from the exact path (because estimate might now reach > below the threshold). > > For example, assume the threshold for convergence is 1GB. > A VFIO device may report 2GB stop-copy-size (not precopy-able) in the > beginning of the migration. > If the VFIO device stop-copy-size changes in the middle of migration (e.g., > some of its resources are destroyed) and drops to 900MB, we will still see > 2GB report in estimate. > Only when calling the exact handler again we will get the updated 900MB > value and be able to converge. But that won't happen, because estimate size > will always be above the 1GB threshold. > > We could prevent these situations and call the get stop-copy-size once every > X seconds, but it feels a bit awkward. IMHO the confusion is caused by an unclear definition of stop-size and precopy-size in VFIO terms. What I would hope is the stop-size reported should be constant and not be affected by precopy process happening. Whatever _can_ change at all should be reported as precopy-size. So I wonder why stop-size can change from a driver, and whether that can be reported in a more predictable fashion. Otherwise I see little point in providing both stop-size and precopy-size, otherwise we'll always add them up into VFIO's total pending-size. The line on provision which data falls into which bucket doesn't seem to be clear to me. > > > > > This may improve situation but still leave one other issue, that IIUC even > > with above change and even if we can avoid sync dirty bitmap frequently, > > the migration thread can still spinning 100% calling estimate() and keep > > seeing data (which is not iterable...). For the longer term we may still > > need to report non-iterable stop-size in another way so QEMU knows that > > iterate() over all the VMState registers won't help in this case, so it > > should go into sleep without eating the cores. I hope that explains why I > > think a new API should be still needed for the long run. > > Yes, I get your point. > But is the spinning case very common? AFAIU, if it happens, it may indicate > some misconfiguration of the migration parameters. Agreed. It's just that our QE actively tests such migrations and there're always weird and similar reports on cpu spinning, so it might be nice to still fix it at some point. This shouldn't be super urgent, indeed. It's just that it can start to make sense when we have other discussions where reporting non-iteratble pending data might be pretty useful on its own as an idea. It's just that we need to figure out above on how VFIO can report predicatble stop-size in the drivers, and I wonder whether the drivers can be fixed (without breaking existing qemu; after all they currently always add both counters..). > > Anyway, I think your idea may fit VFIO, but still need to think about all > the small details. Thanks,
On 05/09/2024 18:17, Peter Xu wrote: > External email: Use caution opening links or attachments > > > On Thu, Sep 05, 2024 at 02:41:09PM +0300, Avihai Horon wrote: >> On 04/09/2024 19:16, Peter Xu wrote: >>> External email: Use caution opening links or attachments >>> >>> >>> On Wed, Sep 04, 2024 at 06:41:03PM +0300, Avihai Horon wrote: >>>> On 04/09/2024 16:00, Peter Xu wrote: >>>>> External email: Use caution opening links or attachments >>>>> >>>>> >>>>> Hello, Avihai, >>>>> >>>>> Reviving this thread just to discuss one issue below.. >>>>> >>>>> On Thu, Feb 16, 2023 at 04:36:27PM +0200, Avihai Horon wrote: >>>>>> +/* >>>>>> + * Migration size of VFIO devices can be as little as a few KBs or as big as >>>>>> + * many GBs. This value should be big enough to cover the worst case. >>>>>> + */ >>>>>> +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB) >>>>>> + >>>>>> +/* >>>>>> + * Only exact function is implemented and not estimate function. The reason is >>>>>> + * that during pre-copy phase of migration the estimate function is called >>>>>> + * repeatedly while pending RAM size is over the threshold, thus migration >>>>>> + * can't converge and querying the VFIO device pending data size is useless. >>>>>> + */ >>>>>> +static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy, >>>>>> + uint64_t *can_postcopy) >>>>>> +{ >>>>>> + VFIODevice *vbasedev = opaque; >>>>>> + uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE; >>>>>> + >>>>>> + /* >>>>>> + * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is >>>>>> + * reported so downtime limit won't be violated. >>>>>> + */ >>>>>> + vfio_query_stop_copy_size(vbasedev, &stop_copy_size); >>>>>> + *must_precopy += stop_copy_size; >>>>> Is this the chunk of data only can be copied during VM stopped? If so, I >>>>> wonder why it's reported as "must precopy" if we know precopy won't ever >>>>> move them.. >>>> A VFIO device that doesn't support precopy will send this data only when VM >>>> is stopped. >>>> A VFIO device that supports precopy may or may not send this data (or part >>>> of it) during precopy, and it depends on the specific VFIO device. >>>> >>>> According to state_pending_{estimate,exact} documentation, must_precopy is >>>> the amount of data that must be migrated before target starts, and indeed >>>> this VFIO data must be migrated before target starts. >>> Correct. It's just that estimate/exact API will be more suitable when the >>> exact() gets the report closer to estimate(), while the estimate() is only >>> a fast-path of exact(). It'll cause some chaos like this if it doesn't do >>> as so. >>> >>> Since we have discussion elsewhere on the fact that we currently ignore >>> non-iterative device states during migration decides to switchover, I was >>> wondering when reply on whether this stop-size could be the first thing >>> that will start to provide such non-iterable pending data. But that might >>> indeed need more thoughts, at least we may want to collect more outliers of >>> non-iterative device states outside VFIO that can cause downtime to be too >>> large. >> Ah, I see. >> >>>>> The issue is if with such reporting (and now in latest master branch we do >>>>> have the precopy size too, which was reported both in exact() and >>>>> estimate()), we can observe weird reports like this: >>>>> >>>>> 23411@1725380798968696657 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) >>>>> 23411@1725380799050766000 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0) >>>>> 23411@1725380799050896975 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) >>>>> 23411@1725380799138657103 migrate_pending_exact exact pending size 21040144384 (pre = 21040144384 post=0) >>>>> 23411@1725380799140166709 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) >>>>> 23411@1725380799217246861 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0) >>>>> 23411@1725380799217384969 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) >>>>> 23411@1725380799305147722 migrate_pending_exact exact pending size 21039976448 (pre = 21039976448 post=0) >>>>> 23411@1725380799306639956 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) >>>>> 23411@1725380799385118245 migrate_pending_exact exact pending size 21038796800 (pre = 21038796800 post=0) >>>>> 23411@1725380799385709382 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0) >>>>> >>>>> So estimate() keeps reporting zero but the exact() reports much larger, and >>>>> it keeps spinning like this. I think that's not how it was designed to be >>>>> used.. >>>> It keeps spinning and migration doesn't converge? >>>> If so, configuring a higher downtime limit or the avail-switchover-bandwidth >>>> parameter may solve it. >>> Yes, this is the only way to go, but it's a separate issue on reportings of >>> estimate()/exact(). More below. >>> >>>>> Does this stop copy size change for a VFIO device or not? >>>> It depends on the specific VFIO device. >>>> If the device supports precopy and all (or part) of its data is >>>> precopy-able, then stopcopy size will change. >>>> Besides that, the amount of resources currently used by the VFIO device can >>>> also affect the stopcopy size, and it may increase or decrease as resources >>>> are created or destroyed. >>> I see, thanks. >>> >>>>> IIUC, we may want some other mechanism to report stop copy size for a >>>>> device, rather than reporting it with the current exact()/estimate() api. >>>>> That's, per my undertanding, only used for iterable data, while >>>>> stop-copy-size may not fall into that category if so. >>>> The above situation is caused by the fact that VFIO data may not be fully >>>> precopy-able (as opposed to RAM data). >>>> I don't think reporting the stop-copy-size in a different API will help the >>>> above situation -- we would still have to take stop-copy-size into account >>>> before converging, to not violate downtime. >>> It will help some other situation, though. >>> >>> One issue with above freqeunt estimate()/exact() call is that QEMU will go >>> into madness loop thinking "we're close" and "we're far away from converge" >>> even if the reality is "we're far away". The bad side effect is when this >>> loop happens it'll not only affect VFIO but also other devices (e.g. KVM, >>> vhost, etc.) so we'll do high overhead sync() in an extremely frequent >>> manner. IMHO they're totally a waste of resource, because all the rest of >>> the modules are following the default rules of estimate()/exact(). >>> >>> One simple but efficient fix for VFIO, IMHO, is at least VFIO should also >>> cache the stop-size internally and report in estimate(), e.g.: >>> >>> /* Give an estimate of the amount left to be transferred, >>> * the result is split into the amount for units that can and >>> * for units that can't do postcopy. >>> */ >>> void qemu_savevm_state_pending_estimate(uint64_t *must_precopy, >>> uint64_t *can_postcopy) >>> { >>> } >>> >>> If it's justified that the stop-size to be reported in exact(), it should >>> also be reported in estimate() per the comment above. It should also fall >>> into precopy category in this case. >>> >>> Then with that we should avoid calling exact() frequently for not only VFIO >>> but also others (especially, KVM GET_DIRTY_LOG / CLEAR_DIRTY_LOG ioctls), >>> then we know it won't converge anyway without the help of tuning downtime >>> upper, or adjust avail-switchover-bandwidth. >> Yes, it will solve this problem, but IIRC, the reason I didn't cache and >> return stop-copy-size in estimate from the first place was that it wasn't >> fully precopy-able, and that could cause some problems: >> Once we cache and report this size in estimate, it may not decrease anymore >> -- we can't send it during precopy and we might not be able to call get >> stop-copy-size again from the exact path (because estimate might now reach >> below the threshold). >> >> For example, assume the threshold for convergence is 1GB. >> A VFIO device may report 2GB stop-copy-size (not precopy-able) in the >> beginning of the migration. >> If the VFIO device stop-copy-size changes in the middle of migration (e.g., >> some of its resources are destroyed) and drops to 900MB, we will still see >> 2GB report in estimate. >> Only when calling the exact handler again we will get the updated 900MB >> value and be able to converge. But that won't happen, because estimate size >> will always be above the 1GB threshold. >> >> We could prevent these situations and call the get stop-copy-size once every >> X seconds, but it feels a bit awkward. > IMHO the confusion is caused by an unclear definition of stop-size and > precopy-size in VFIO terms. > > What I would hope is the stop-size reported should be constant and not be > affected by precopy process happening. Whatever _can_ change at all should > be reported as precopy-size. > > So I wonder why stop-size can change from a driver, and whether that can be > reported in a more predictable fashion. Otherwise I see little point in > providing both stop-size and precopy-size, otherwise we'll always add them > up into VFIO's total pending-size. The line on provision which data falls > into which bucket doesn't seem to be clear to me. Stopcopy-size is reported by VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl, which states that this is "[...] the estimated data length that will be required to complete stop copy". So by this definition, stopcopy-size can change during precopy (it can also change if device resources are created or destroyed). Precopy-size is reported by VFIO_MIG_GET_PRECOPY_INFO ioctl, which states that this is "[...] estimates of data available from the device during the PRE_COPY states". Maybe the confusion was caused by this line in vfio_state_pending_exact(): *must_precopy += migration->precopy_init_size + migration->precopy_dirty_size; Which I think should be removed, as stopcopy-size should already cover precopy data. Thanks. > >>> This may improve situation but still leave one other issue, that IIUC even >>> with above change and even if we can avoid sync dirty bitmap frequently, >>> the migration thread can still spinning 100% calling estimate() and keep >>> seeing data (which is not iterable...). For the longer term we may still >>> need to report non-iterable stop-size in another way so QEMU knows that >>> iterate() over all the VMState registers won't help in this case, so it >>> should go into sleep without eating the cores. I hope that explains why I >>> think a new API should be still needed for the long run. >> Yes, I get your point. >> But is the spinning case very common? AFAIU, if it happens, it may indicate >> some misconfiguration of the migration parameters. > Agreed. > > It's just that our QE actively tests such migrations and there're always > weird and similar reports on cpu spinning, so it might be nice to still fix > it at some point. > > This shouldn't be super urgent, indeed. It's just that it can start to > make sense when we have other discussions where reporting non-iteratble > pending data might be pretty useful on its own as an idea. It's just that > we need to figure out above on how VFIO can report predicatble stop-size in > the drivers, and I wonder whether the drivers can be fixed (without > breaking existing qemu; after all they currently always add both > counters..). > >> Anyway, I think your idea may fit VFIO, but still need to think about all >> the small details. > Thanks, > > -- > Peter Xu >
On Thu, Sep 05, 2024 at 07:07:28PM +0300, Avihai Horon wrote: > > So I wonder why stop-size can change from a driver, and whether that can be > > reported in a more predictable fashion. Otherwise I see little point in > > providing both stop-size and precopy-size, otherwise we'll always add them > > up into VFIO's total pending-size. The line on provision which data falls > > into which bucket doesn't seem to be clear to me. > > Stopcopy-size is reported by VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl, which > states that this is "[...] the estimated data length that will be required > to complete stop copy". > So by this definition, stopcopy-size can change during precopy (it can also > change if device resources are created or destroyed). > > Precopy-size is reported by VFIO_MIG_GET_PRECOPY_INFO ioctl, which states > that this is "[...] estimates of data available from the device during the > PRE_COPY states". > > Maybe the confusion was caused by this line in vfio_state_pending_exact(): > *must_precopy += migration->precopy_init_size + > migration->precopy_dirty_size; > > Which I think should be removed, as stopcopy-size should already cover > precopy data. Would you please help double check on the kernel drivers' side? If that's true I agree that should be dropped. Does it also mean then that the currently reported stop-size - precopy-size will be very close to the constant non-iterable data size? If so, is it cache-able? It doesn't need to be 100% accurate as the ioctls are not atomic, but still if it's mostly usable (again, will need you all to help justify that..) as that then it looks like we have some way to report that without the concern you raised before. Thanks,
On 05/09/2024 19:23, Peter Xu wrote: > External email: Use caution opening links or attachments > > > On Thu, Sep 05, 2024 at 07:07:28PM +0300, Avihai Horon wrote: >>> So I wonder why stop-size can change from a driver, and whether that can be >>> reported in a more predictable fashion. Otherwise I see little point in >>> providing both stop-size and precopy-size, otherwise we'll always add them >>> up into VFIO's total pending-size. The line on provision which data falls >>> into which bucket doesn't seem to be clear to me. >> Stopcopy-size is reported by VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl, which >> states that this is "[...] the estimated data length that will be required >> to complete stop copy". >> So by this definition, stopcopy-size can change during precopy (it can also >> change if device resources are created or destroyed). >> >> Precopy-size is reported by VFIO_MIG_GET_PRECOPY_INFO ioctl, which states >> that this is "[...] estimates of data available from the device during the >> PRE_COPY states". >> >> Maybe the confusion was caused by this line in vfio_state_pending_exact(): >> *must_precopy += migration->precopy_init_size + >> migration->precopy_dirty_size; >> >> Which I think should be removed, as stopcopy-size should already cover >> precopy data. > Would you please help double check on the kernel drivers' side? If that's > true I agree that should be dropped. Sure. > > Does it also mean then that the currently reported stop-size - precopy-size > will be very close to the constant non-iterable data size? It's not constant, while the VM is running it can change. Thanks. > If so, is it > cache-able? It doesn't need to be 100% accurate as the ioctls are not > atomic, but still if it's mostly usable (again, will need you all to help > justify that..) as that then it looks like we have some way to report that > without the concern you raised before. > > Thanks, > > -- > Peter Xu >
On Thu, Sep 05, 2024 at 07:45:43PM +0300, Avihai Horon wrote: > > Does it also mean then that the currently reported stop-size - precopy-size > > will be very close to the constant non-iterable data size? > > It's not constant, while the VM is running it can change. I wonder how heavy is VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl. I just gave it a quick shot with a busy VM migrating and estimate() is invoked only every ~100ms. VFIO might be different, but I wonder whether we can fetch stop-size in estimate() somehow, so it's still a pretty fast estimate() meanwhile we avoid the rest of exact() calls (which are destined to be useless without VFIO). IIUC so far the estimate()/exact() was because ram sync is heavy when exact(). When idle it's 80+ms now for 32G VM with current master (which has a bug and I'm fixing it up [1]..), even if after the fix it's 3ms (I think both numbers contain dirty bitmap sync for both vfio and kvm). So in that case maybe we can still try fetching stop-size only for both estimate() and exact(), but only sync bitmap in exact(). [1] https://lore.kernel.org/r/20240904223510.3519358-1-peterx@redhat.com Thanks,
On 05/09/2024 21:31, Peter Xu wrote: > External email: Use caution opening links or attachments > > > On Thu, Sep 05, 2024 at 07:45:43PM +0300, Avihai Horon wrote: >>> Does it also mean then that the currently reported stop-size - precopy-size >>> will be very close to the constant non-iterable data size? >> It's not constant, while the VM is running it can change. > I wonder how heavy is VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl. > > I just gave it a quick shot with a busy VM migrating and estimate() is > invoked only every ~100ms. > > VFIO might be different, but I wonder whether we can fetch stop-size in > estimate() somehow, so it's still a pretty fast estimate() meanwhile we > avoid the rest of exact() calls (which are destined to be useless without > VFIO). > > IIUC so far the estimate()/exact() was because ram sync is heavy when > exact(). When idle it's 80+ms now for 32G VM with current master (which > has a bug and I'm fixing it up [1]..), even if after the fix it's 3ms (I > think both numbers contain dirty bitmap sync for both vfio and kvm). So in > that case maybe we can still try fetching stop-size only for both > estimate() and exact(), but only sync bitmap in exact(). IIUC, the end goal is to prevent migration thread spinning uselessly in pre-copy in such scenarios, right? If eventually we do call get stop-copy-size in estimate(), we will move the spinning from "exact() -> estimate() -> exact() -> estimate() ..." to "estimate() -> estimate() -> ...". If so, what benefit would we get from this? We only move the useless work to other place. Shouldn't we directly go for the non precopy-able vs precopy-able report that you suggested? Thanks.
On Mon, Sep 09, 2024 at 03:52:39PM +0300, Avihai Horon wrote: > > On 05/09/2024 21:31, Peter Xu wrote: > > External email: Use caution opening links or attachments > > > > > > On Thu, Sep 05, 2024 at 07:45:43PM +0300, Avihai Horon wrote: > > > > Does it also mean then that the currently reported stop-size - precopy-size > > > > will be very close to the constant non-iterable data size? > > > It's not constant, while the VM is running it can change. > > I wonder how heavy is VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl. > > > > I just gave it a quick shot with a busy VM migrating and estimate() is > > invoked only every ~100ms. > > > > VFIO might be different, but I wonder whether we can fetch stop-size in > > estimate() somehow, so it's still a pretty fast estimate() meanwhile we > > avoid the rest of exact() calls (which are destined to be useless without > > VFIO). > > > > IIUC so far the estimate()/exact() was because ram sync is heavy when > > exact(). When idle it's 80+ms now for 32G VM with current master (which > > has a bug and I'm fixing it up [1]..), even if after the fix it's 3ms (I > > think both numbers contain dirty bitmap sync for both vfio and kvm). So in > > that case maybe we can still try fetching stop-size only for both > > estimate() and exact(), but only sync bitmap in exact(). > > IIUC, the end goal is to prevent migration thread spinning uselessly in > pre-copy in such scenarios, right? > If eventually we do call get stop-copy-size in estimate(), we will move the > spinning from "exact() -> estimate() -> exact() -> estimate() ..." to > "estimate() -> estimate() -> ...". > If so, what benefit would we get from this? We only move the useless work to > other place. We can avoid exact() calls invoked for other vmstate handlers, e.g. RAM, which can be much heavier and can require BQL during the slow process, which can further block more vcpu operations during migration. And as mentioned previously, VFIO is, AFAIK, the only handler that provide different definitions of estimate() and exact(), which can be confusing, and it's against the "estimate() is the fast-path" logic. But I agree it's not fundamentally changing much.. > Shouldn't we directly go for the non precopy-able vs precopy-able report > that you suggested? Yep, I just thought the previous one would be much easier to achieve. And as you said, VFIO is still pretty special that the user will need manual involvements anyway to specify e.g. very large downtimes, so this condition shouldn't be a major case to happen. Said that, if you have a solid idea on this please feel free to go ahead directly with a complete solution. Thanks,
On 09/09/2024 18:11, Peter Xu wrote: > External email: Use caution opening links or attachments > > > On Mon, Sep 09, 2024 at 03:52:39PM +0300, Avihai Horon wrote: >> On 05/09/2024 21:31, Peter Xu wrote: >>> External email: Use caution opening links or attachments >>> >>> >>> On Thu, Sep 05, 2024 at 07:45:43PM +0300, Avihai Horon wrote: >>>>> Does it also mean then that the currently reported stop-size - precopy-size >>>>> will be very close to the constant non-iterable data size? >>>> It's not constant, while the VM is running it can change. >>> I wonder how heavy is VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl. >>> >>> I just gave it a quick shot with a busy VM migrating and estimate() is >>> invoked only every ~100ms. >>> >>> VFIO might be different, but I wonder whether we can fetch stop-size in >>> estimate() somehow, so it's still a pretty fast estimate() meanwhile we >>> avoid the rest of exact() calls (which are destined to be useless without >>> VFIO). >>> >>> IIUC so far the estimate()/exact() was because ram sync is heavy when >>> exact(). When idle it's 80+ms now for 32G VM with current master (which >>> has a bug and I'm fixing it up [1]..), even if after the fix it's 3ms (I >>> think both numbers contain dirty bitmap sync for both vfio and kvm). So in >>> that case maybe we can still try fetching stop-size only for both >>> estimate() and exact(), but only sync bitmap in exact(). >> IIUC, the end goal is to prevent migration thread spinning uselessly in >> pre-copy in such scenarios, right? >> If eventually we do call get stop-copy-size in estimate(), we will move the >> spinning from "exact() -> estimate() -> exact() -> estimate() ..." to >> "estimate() -> estimate() -> ...". >> If so, what benefit would we get from this? We only move the useless work to >> other place. > We can avoid exact() calls invoked for other vmstate handlers, e.g. RAM, > which can be much heavier and can require BQL during the slow process, > which can further block more vcpu operations during migration. > > And as mentioned previously, VFIO is, AFAIK, the only handler that provide > different definitions of estimate() and exact(), which can be confusing, > and it's against the "estimate() is the fast-path" logic. > > But I agree it's not fundamentally changing much.. > >> Shouldn't we directly go for the non precopy-able vs precopy-able report >> that you suggested? > Yep, I just thought the previous one would be much easier to achieve. Yes, though I prefer not to add the get stop-copy-size ioctl in the estimate() flow because: a) it's guaranteed to be called (possibly many times) in every migration (either well configured which is the probable case or misconfigured which spins), and b) because how "heavy" get stop-copy-size is may differ from VFIO device to the other. Maybe I am being a bit overcautious here, but let's explore other options first :) > And > as you said, VFIO is still pretty special that the user will need manual > involvements anyway to specify e.g. very large downtimes, so this condition > shouldn't be a major case to happen. Said that, if you have a solid idea > on this please feel free to go ahead directly with a complete solution. I think it's possible to do it with what we currently have (VFIO uAPI-wise), I will try to think of one. BTW, I checked again and I think we should drop this line from vfio_state_pending_exact(): *must_precopy += migration->precopy_init_size + migration->precopy_dirty_size; I can send a patch for that. Thanks.
On 9/12/24 10:09, Avihai Horon wrote: > > On 09/09/2024 18:11, Peter Xu wrote: >> External email: Use caution opening links or attachments >> >> >> On Mon, Sep 09, 2024 at 03:52:39PM +0300, Avihai Horon wrote: >>> On 05/09/2024 21:31, Peter Xu wrote: >>>> External email: Use caution opening links or attachments >>>> >>>> >>>> On Thu, Sep 05, 2024 at 07:45:43PM +0300, Avihai Horon wrote: >>>>>> Does it also mean then that the currently reported stop-size - precopy-size >>>>>> will be very close to the constant non-iterable data size? >>>>> It's not constant, while the VM is running it can change. >>>> I wonder how heavy is VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl. >>>> >>>> I just gave it a quick shot with a busy VM migrating and estimate() is >>>> invoked only every ~100ms. >>>> >>>> VFIO might be different, but I wonder whether we can fetch stop-size in >>>> estimate() somehow, so it's still a pretty fast estimate() meanwhile we >>>> avoid the rest of exact() calls (which are destined to be useless without >>>> VFIO). >>>> >>>> IIUC so far the estimate()/exact() was because ram sync is heavy when >>>> exact(). When idle it's 80+ms now for 32G VM with current master (which >>>> has a bug and I'm fixing it up [1]..), even if after the fix it's 3ms (I >>>> think both numbers contain dirty bitmap sync for both vfio and kvm). So in >>>> that case maybe we can still try fetching stop-size only for both >>>> estimate() and exact(), but only sync bitmap in exact(). >>> IIUC, the end goal is to prevent migration thread spinning uselessly in >>> pre-copy in such scenarios, right? >>> If eventually we do call get stop-copy-size in estimate(), we will move the >>> spinning from "exact() -> estimate() -> exact() -> estimate() ..." to >>> "estimate() -> estimate() -> ...". >>> If so, what benefit would we get from this? We only move the useless work to >>> other place. >> We can avoid exact() calls invoked for other vmstate handlers, e.g. RAM, >> which can be much heavier and can require BQL during the slow process, >> which can further block more vcpu operations during migration. >> >> And as mentioned previously, VFIO is, AFAIK, the only handler that provide >> different definitions of estimate() and exact(), which can be confusing, >> and it's against the "estimate() is the fast-path" logic. >> >> But I agree it's not fundamentally changing much.. >> >>> Shouldn't we directly go for the non precopy-able vs precopy-able report >>> that you suggested? >> Yep, I just thought the previous one would be much easier to achieve. > > Yes, though I prefer not to add the get stop-copy-size ioctl in the estimate() flow because: a) it's guaranteed to be called (possibly many times) in every migration (either well configured which is the probable case or misconfigured which spins), and b) because how "heavy" get stop-copy-size is may differ from VFIO device to the other. > > Maybe I am being a bit overcautious here, but let's explore other options first :) > >> And >> as you said, VFIO is still pretty special that the user will need manual >> involvements anyway to specify e.g. very large downtimes, so this condition >> shouldn't be a major case to happen. Said that, if you have a solid idea >> on this please feel free to go ahead directly with a complete solution. > > I think it's possible to do it with what we currently have (VFIO uAPI-wise), I will try to think of one. > > BTW, I checked again and I think we should drop this line from vfio_state_pending_exact(): > *must_precopy += migration->precopy_init_size + migration->precopy_dirty_size; > > I can send a patch for that. Please do. We can then provide a scratch build for further testing and experiments with vGPUs. Thanks, C.
On Thu, Sep 12, 2024 at 11:09:12AM +0300, Avihai Horon wrote: > > On 09/09/2024 18:11, Peter Xu wrote: > > External email: Use caution opening links or attachments > > > > > > On Mon, Sep 09, 2024 at 03:52:39PM +0300, Avihai Horon wrote: > > > On 05/09/2024 21:31, Peter Xu wrote: > > > > External email: Use caution opening links or attachments > > > > > > > > > > > > On Thu, Sep 05, 2024 at 07:45:43PM +0300, Avihai Horon wrote: > > > > > > Does it also mean then that the currently reported stop-size - precopy-size > > > > > > will be very close to the constant non-iterable data size? > > > > > It's not constant, while the VM is running it can change. > > > > I wonder how heavy is VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl. > > > > > > > > I just gave it a quick shot with a busy VM migrating and estimate() is > > > > invoked only every ~100ms. > > > > > > > > VFIO might be different, but I wonder whether we can fetch stop-size in > > > > estimate() somehow, so it's still a pretty fast estimate() meanwhile we > > > > avoid the rest of exact() calls (which are destined to be useless without > > > > VFIO). > > > > > > > > IIUC so far the estimate()/exact() was because ram sync is heavy when > > > > exact(). When idle it's 80+ms now for 32G VM with current master (which > > > > has a bug and I'm fixing it up [1]..), even if after the fix it's 3ms (I > > > > think both numbers contain dirty bitmap sync for both vfio and kvm). So in > > > > that case maybe we can still try fetching stop-size only for both > > > > estimate() and exact(), but only sync bitmap in exact(). > > > IIUC, the end goal is to prevent migration thread spinning uselessly in > > > pre-copy in such scenarios, right? > > > If eventually we do call get stop-copy-size in estimate(), we will move the > > > spinning from "exact() -> estimate() -> exact() -> estimate() ..." to > > > "estimate() -> estimate() -> ...". > > > If so, what benefit would we get from this? We only move the useless work to > > > other place. > > We can avoid exact() calls invoked for other vmstate handlers, e.g. RAM, > > which can be much heavier and can require BQL during the slow process, > > which can further block more vcpu operations during migration. > > > > And as mentioned previously, VFIO is, AFAIK, the only handler that provide > > different definitions of estimate() and exact(), which can be confusing, > > and it's against the "estimate() is the fast-path" logic. > > > > But I agree it's not fundamentally changing much.. > > > > > Shouldn't we directly go for the non precopy-able vs precopy-able report > > > that you suggested? > > Yep, I just thought the previous one would be much easier to achieve. > > Yes, though I prefer not to add the get stop-copy-size ioctl in the > estimate() flow because: a) it's guaranteed to be called (possibly many > times) in every migration (either well configured which is the probable case > or misconfigured which spins), and b) because how "heavy" get stop-copy-size > is may differ from VFIO device to the other. > > Maybe I am being a bit overcautious here, but let's explore other options > first :) Nop; that's totally ok. > > > And > > as you said, VFIO is still pretty special that the user will need manual > > involvements anyway to specify e.g. very large downtimes, so this condition > > shouldn't be a major case to happen. Said that, if you have a solid idea > > on this please feel free to go ahead directly with a complete solution. > > I think it's possible to do it with what we currently have (VFIO uAPI-wise), > I will try to think of one. > > BTW, I checked again and I think we should drop this line from > vfio_state_pending_exact(): > *must_precopy += migration->precopy_init_size + > migration->precopy_dirty_size; > > I can send a patch for that. Yes please. I also wonder whether it'll be good to update the uAPI header too in Linux, for vfio_device_feature_mig_data_size or VFIO_DEVICE_FEATURE_MIG_DATA_SIZE. Currently it reads a bit ambiguous to me, while it could make a huge difference iiuc when precopy size is non-trivial. /* * Upon VFIO_DEVICE_FEATURE_GET read back the estimated data length that will * be required to complete stop copy. * * Note: Can be called on each device state. */ Maybe that'll also be helpful when other drivers will implement this, then just to make sure both side (user / kernel) are crystal clear on the definition.
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 2c0fb1d622..c4eab55af9 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -66,6 +66,11 @@ typedef struct VFIOMigration { int vm_running; Notifier migration_state; uint64_t pending_bytes; + uint32_t device_state; + int data_fd; + void *data_buffer; + size_t data_buffer_size; + bool v2; } VFIOMigration; typedef struct VFIOAddressSpace { diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 1c974e9c5a..54fee2d5de 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -408,10 +408,17 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container) return false; } - if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) && + if (!migration->v2 && + (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) && (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING)) { return false; } + + if (migration->v2 && + vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF && + migration->device_state == VFIO_DEVICE_STATE_RUNNING) { + return false; + } } } return true; @@ -438,7 +445,13 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container) return false; } - if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) { + if (!migration->v2 && + migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) { + continue; + } + + if (migration->v2 && + migration->device_state == VFIO_DEVICE_STATE_RUNNING) { continue; } else { return false; diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index a5fe285721..452d96a520 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -10,6 +10,7 @@ #include "qemu/osdep.h" #include "qemu/main-loop.h" #include "qemu/cutils.h" +#include "qemu/units.h" #include <linux/vfio.h> #include <sys/ioctl.h> @@ -44,8 +45,114 @@ #define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xffffffffef100003ULL) #define VFIO_MIG_FLAG_DEV_DATA_STATE (0xffffffffef100004ULL) +/* + * This is an arbitrary size based on migration of mlx5 devices, where typically + * total device migration size is on the order of 100s of MB. Testing with + * larger values, e.g. 128MB and 1GB, did not show a performance improvement. + */ +#define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB) + static int64_t bytes_transferred; +static const char *mig_state_to_str(enum vfio_device_mig_state state) +{ + switch (state) { + case VFIO_DEVICE_STATE_ERROR: + return "ERROR"; + case VFIO_DEVICE_STATE_STOP: + return "STOP"; + case VFIO_DEVICE_STATE_RUNNING: + return "RUNNING"; + case VFIO_DEVICE_STATE_STOP_COPY: + return "STOP_COPY"; + case VFIO_DEVICE_STATE_RESUMING: + return "RESUMING"; + default: + return "UNKNOWN STATE"; + } +} + +static int vfio_migration_set_state(VFIODevice *vbasedev, + enum vfio_device_mig_state new_state, + enum vfio_device_mig_state recover_state) +{ + VFIOMigration *migration = vbasedev->migration; + uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) + + sizeof(struct vfio_device_feature_mig_state), + sizeof(uint64_t))] = {}; + struct vfio_device_feature *feature = (struct vfio_device_feature *)buf; + struct vfio_device_feature_mig_state *mig_state = + (struct vfio_device_feature_mig_state *)feature->data; + int ret; + + feature->argsz = sizeof(buf); + feature->flags = + VFIO_DEVICE_FEATURE_SET | VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE; + mig_state->device_state = new_state; + if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) { + /* Try to set the device in some good state */ + ret = -errno; + + if (recover_state == VFIO_DEVICE_STATE_ERROR) { + error_report("%s: Failed setting device state to %s, err: %s. " + "Recover state is ERROR. Resetting device", + vbasedev->name, mig_state_to_str(new_state), + strerror(errno)); + + goto reset_device; + } + + error_report( + "%s: Failed setting device state to %s, err: %s. Setting device in recover state %s", + vbasedev->name, mig_state_to_str(new_state), + strerror(errno), mig_state_to_str(recover_state)); + + mig_state->device_state = recover_state; + if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) { + ret = -errno; + error_report( + "%s: Failed setting device in recover state, err: %s. Resetting device", + vbasedev->name, strerror(errno)); + + goto reset_device; + } + + migration->device_state = recover_state; + + return ret; + } + + migration->device_state = new_state; + if (mig_state->data_fd != -1) { + if (migration->data_fd != -1) { + /* + * This can happen if the device is asynchronously reset and + * terminates a data transfer. + */ + error_report("%s: data_fd out of sync", vbasedev->name); + close(mig_state->data_fd); + + return -EBADF; + } + + migration->data_fd = mig_state->data_fd; + } + + trace_vfio_migration_set_state(vbasedev->name, mig_state_to_str(new_state)); + + return 0; + +reset_device: + if (ioctl(vbasedev->fd, VFIO_DEVICE_RESET)) { + hw_error("%s: Failed resetting device, err: %s", vbasedev->name, + strerror(errno)); + } + + migration->device_state = VFIO_DEVICE_STATE_RUNNING; + + return ret; +} + static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count, off_t off, bool iswrite) { @@ -260,6 +367,18 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size) return ret; } +static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev, + uint64_t data_size) +{ + VFIOMigration *migration = vbasedev->migration; + int ret; + + ret = qemu_file_get_to_fd(f, migration->data_fd, data_size); + trace_vfio_load_state_device_data(vbasedev->name, data_size, ret); + + return ret; +} + static int vfio_v1_load_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t data_size) { @@ -394,6 +513,14 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque) return qemu_file_get_error(f); } +static void vfio_migration_cleanup(VFIODevice *vbasedev) +{ + VFIOMigration *migration = vbasedev->migration; + + close(migration->data_fd); + migration->data_fd = -1; +} + static void vfio_migration_v1_cleanup(VFIODevice *vbasedev) { VFIOMigration *migration = vbasedev->migration; @@ -403,8 +530,80 @@ static void vfio_migration_v1_cleanup(VFIODevice *vbasedev) } } +static int vfio_query_stop_copy_size(VFIODevice *vbasedev, + uint64_t *stop_copy_size) +{ + uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) + + sizeof(struct vfio_device_feature_mig_data_size), + sizeof(uint64_t))] = {}; + struct vfio_device_feature *feature = (struct vfio_device_feature *)buf; + struct vfio_device_feature_mig_data_size *mig_data_size = + (struct vfio_device_feature_mig_data_size *)feature->data; + + feature->argsz = sizeof(buf); + feature->flags = + VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_MIG_DATA_SIZE; + + if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) { + return -errno; + } + + *stop_copy_size = mig_data_size->stop_copy_length; + + return 0; +} + +/* Returns 1 if end-of-stream is reached, 0 if more data and -errno if error */ +static int vfio_save_block(QEMUFile *f, VFIOMigration *migration) +{ + ssize_t data_size; + + data_size = read(migration->data_fd, migration->data_buffer, + migration->data_buffer_size); + if (data_size < 0) { + return -errno; + } + if (data_size == 0) { + return 1; + } + + qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE); + qemu_put_be64(f, data_size); + qemu_put_buffer(f, migration->data_buffer, data_size); + bytes_transferred += data_size; + + trace_vfio_save_block(migration->vbasedev->name, data_size); + + return qemu_file_get_error(f); +} + /* ---------------------------------------------------------------------- */ +static int vfio_save_setup(QEMUFile *f, void *opaque) +{ + VFIODevice *vbasedev = opaque; + VFIOMigration *migration = vbasedev->migration; + uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE; + + qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE); + + vfio_query_stop_copy_size(vbasedev, &stop_copy_size); + migration->data_buffer_size = MIN(VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE, + stop_copy_size); + migration->data_buffer = g_try_malloc0(migration->data_buffer_size); + if (!migration->data_buffer) { + error_report("%s: Failed to allocate migration data buffer", + vbasedev->name); + return -ENOMEM; + } + + trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size); + + qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); + + return qemu_file_get_error(f); +} + static int vfio_v1_save_setup(QEMUFile *f, void *opaque) { VFIODevice *vbasedev = opaque; @@ -448,6 +647,17 @@ static int vfio_v1_save_setup(QEMUFile *f, void *opaque) return 0; } +static void vfio_save_cleanup(void *opaque) +{ + VFIODevice *vbasedev = opaque; + VFIOMigration *migration = vbasedev->migration; + + g_free(migration->data_buffer); + migration->data_buffer = NULL; + vfio_migration_cleanup(vbasedev); + trace_vfio_save_cleanup(vbasedev->name); +} + static void vfio_v1_save_cleanup(void *opaque) { VFIODevice *vbasedev = opaque; @@ -456,6 +666,35 @@ static void vfio_v1_save_cleanup(void *opaque) trace_vfio_save_cleanup(vbasedev->name); } +/* + * Migration size of VFIO devices can be as little as a few KBs or as big as + * many GBs. This value should be big enough to cover the worst case. + */ +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB) + +/* + * Only exact function is implemented and not estimate function. The reason is + * that during pre-copy phase of migration the estimate function is called + * repeatedly while pending RAM size is over the threshold, thus migration + * can't converge and querying the VFIO device pending data size is useless. + */ +static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy, + uint64_t *can_postcopy) +{ + VFIODevice *vbasedev = opaque; + uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE; + + /* + * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is + * reported so downtime limit won't be violated. + */ + vfio_query_stop_copy_size(vbasedev, &stop_copy_size); + *must_precopy += stop_copy_size; + + trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy, + stop_copy_size); +} + static void vfio_v1_state_pending(void *opaque, uint64_t *must_precopy, uint64_t *can_postcopy) { @@ -520,6 +759,42 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque) return 0; } +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) +{ + VFIODevice *vbasedev = opaque; + int ret; + + /* We reach here with device state STOP only */ + ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY, + VFIO_DEVICE_STATE_STOP); + if (ret) { + return ret; + } + + do { + ret = vfio_save_block(f, vbasedev->migration); + if (ret < 0) { + return ret; + } + } while (!ret); + + qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); + ret = qemu_file_get_error(f); + if (ret) { + return ret; + } + + /* + * If setting the device in STOP state fails, the device should be reset. + * To do so, use ERROR state as a recover state. + */ + ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP, + VFIO_DEVICE_STATE_ERROR); + trace_vfio_save_complete_precopy(vbasedev->name, ret); + + return ret; +} + static int vfio_v1_save_complete_precopy(QEMUFile *f, void *opaque) { VFIODevice *vbasedev = opaque; @@ -589,6 +864,14 @@ static void vfio_save_state(QEMUFile *f, void *opaque) } } +static int vfio_load_setup(QEMUFile *f, void *opaque) +{ + VFIODevice *vbasedev = opaque; + + return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING, + vbasedev->migration->device_state); +} + static int vfio_v1_load_setup(QEMUFile *f, void *opaque) { VFIODevice *vbasedev = opaque; @@ -616,6 +899,16 @@ static int vfio_v1_load_setup(QEMUFile *f, void *opaque) return ret; } +static int vfio_load_cleanup(void *opaque) +{ + VFIODevice *vbasedev = opaque; + + vfio_migration_cleanup(vbasedev); + trace_vfio_load_cleanup(vbasedev->name); + + return 0; +} + static int vfio_v1_load_cleanup(void *opaque) { VFIODevice *vbasedev = opaque; @@ -658,7 +951,11 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id) uint64_t data_size = qemu_get_be64(f); if (data_size) { - ret = vfio_v1_load_buffer(f, vbasedev, data_size); + if (vbasedev->migration->v2) { + ret = vfio_load_buffer(f, vbasedev, data_size); + } else { + ret = vfio_v1_load_buffer(f, vbasedev, data_size); + } if (ret < 0) { return ret; } @@ -679,6 +976,17 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id) return ret; } +static const SaveVMHandlers savevm_vfio_handlers = { + .save_setup = vfio_save_setup, + .save_cleanup = vfio_save_cleanup, + .state_pending_exact = vfio_state_pending_exact, + .save_live_complete_precopy = vfio_save_complete_precopy, + .save_state = vfio_save_state, + .load_setup = vfio_load_setup, + .load_cleanup = vfio_load_cleanup, + .load_state = vfio_load_state, +}; + static SaveVMHandlers savevm_vfio_v1_handlers = { .save_setup = vfio_v1_save_setup, .save_cleanup = vfio_v1_save_cleanup, @@ -694,6 +1002,38 @@ static SaveVMHandlers savevm_vfio_v1_handlers = { /* ---------------------------------------------------------------------- */ +static void vfio_vmstate_change(void *opaque, bool running, RunState state) +{ + VFIODevice *vbasedev = opaque; + enum vfio_device_mig_state new_state; + int ret; + + if (running) { + new_state = VFIO_DEVICE_STATE_RUNNING; + } else { + new_state = VFIO_DEVICE_STATE_STOP; + } + + /* + * If setting the device in new_state fails, the device should be reset. + * To do so, use ERROR state as a recover state. + */ + ret = vfio_migration_set_state(vbasedev, new_state, + VFIO_DEVICE_STATE_ERROR); + if (ret) { + /* + * Migration should be aborted in this case, but vm_state_notify() + * currently does not support reporting failures. + */ + if (migrate_get_current()->to_dst_file) { + qemu_file_set_error(migrate_get_current()->to_dst_file, ret); + } + } + + trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state), + mig_state_to_str(new_state)); +} + static void vfio_v1_vmstate_change(void *opaque, bool running, RunState state) { VFIODevice *vbasedev = opaque; @@ -767,12 +1107,21 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data) case MIGRATION_STATUS_CANCELLED: case MIGRATION_STATUS_FAILED: bytes_transferred = 0; - ret = vfio_migration_v1_set_state(vbasedev, - ~(VFIO_DEVICE_STATE_V1_SAVING | - VFIO_DEVICE_STATE_V1_RESUMING), - VFIO_DEVICE_STATE_V1_RUNNING); - if (ret) { - error_report("%s: Failed to set state RUNNING", vbasedev->name); + if (migration->v2) { + /* + * If setting the device in RUNNING state fails, the device should + * be reset. To do so, use ERROR state as a recover state. + */ + vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING, + VFIO_DEVICE_STATE_ERROR); + } else { + ret = vfio_migration_v1_set_state(vbasedev, + ~(VFIO_DEVICE_STATE_V1_SAVING | + VFIO_DEVICE_STATE_V1_RESUMING), + VFIO_DEVICE_STATE_V1_RUNNING); + if (ret) { + error_report("%s: Failed to set state RUNNING", vbasedev->name); + } } } } @@ -781,12 +1130,42 @@ static void vfio_migration_exit(VFIODevice *vbasedev) { VFIOMigration *migration = vbasedev->migration; - vfio_region_exit(&migration->region); - vfio_region_finalize(&migration->region); + if (!migration->v2) { + vfio_region_exit(&migration->region); + vfio_region_finalize(&migration->region); + } g_free(vbasedev->migration); vbasedev->migration = NULL; } +static int vfio_migration_query_flags(VFIODevice *vbasedev, uint64_t *mig_flags) +{ + uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) + + sizeof(struct vfio_device_feature_migration), + sizeof(uint64_t))] = {}; + struct vfio_device_feature *feature = (struct vfio_device_feature *)buf; + struct vfio_device_feature_migration *mig = + (struct vfio_device_feature_migration *)feature->data; + + feature->argsz = sizeof(buf); + feature->flags = VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_MIGRATION; + if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) { + if (errno == ENOTTY) { + error_report("%s: VFIO migration is not supported in kernel", + vbasedev->name); + } else { + error_report("%s: Failed to query VFIO migration support, err: %s", + vbasedev->name, strerror(errno)); + } + + return -errno; + } + + *mig_flags = mig->flags; + + return 0; +} + static int vfio_migration_init(VFIODevice *vbasedev) { int ret; @@ -795,6 +1174,7 @@ static int vfio_migration_init(VFIODevice *vbasedev) char id[256] = ""; g_autofree char *path = NULL, *oid = NULL; struct vfio_region_info *info; + uint64_t mig_flags = 0; if (!vbasedev->ops->vfio_get_object) { return -EINVAL; @@ -805,34 +1185,50 @@ static int vfio_migration_init(VFIODevice *vbasedev) return -EINVAL; } - ret = vfio_get_dev_region_info(vbasedev, - VFIO_REGION_TYPE_MIGRATION_DEPRECATED, - VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED, - &info); - if (ret) { - return ret; - } + ret = vfio_migration_query_flags(vbasedev, &mig_flags); + if (!ret) { + /* Migration v2 */ + /* Basic migration functionality must be supported */ + if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) { + return -EOPNOTSUPP; + } + vbasedev->migration = g_new0(VFIOMigration, 1); + vbasedev->migration->device_state = VFIO_DEVICE_STATE_RUNNING; + vbasedev->migration->data_fd = -1; + vbasedev->migration->v2 = true; + } else if (ret == -ENOTTY) { + /* Migration v1 */ + ret = vfio_get_dev_region_info(vbasedev, + VFIO_REGION_TYPE_MIGRATION_DEPRECATED, + VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED, + &info); + if (ret) { + return ret; + } - vbasedev->migration = g_new0(VFIOMigration, 1); - vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING; - vbasedev->migration->vm_running = runstate_is_running(); + vbasedev->migration = g_new0(VFIOMigration, 1); + vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING; + vbasedev->migration->vm_running = runstate_is_running(); - ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region, - info->index, "migration"); - if (ret) { - error_report("%s: Failed to setup VFIO migration region %d: %s", - vbasedev->name, info->index, strerror(-ret)); - goto err; - } + ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region, + info->index, "migration"); + if (ret) { + error_report("%s: Failed to setup VFIO migration region %d: %s", + vbasedev->name, info->index, strerror(-ret)); + goto err; + } - if (!vbasedev->migration->region.size) { - error_report("%s: Invalid zero-sized VFIO migration region %d", - vbasedev->name, info->index); - ret = -EINVAL; - goto err; - } + if (!vbasedev->migration->region.size) { + error_report("%s: Invalid zero-sized VFIO migration region %d", + vbasedev->name, info->index); + ret = -EINVAL; + goto err; + } - g_free(info); + g_free(info); + } else { + return ret; + } migration = vbasedev->migration; migration->vbasedev = vbasedev; @@ -845,11 +1241,20 @@ static int vfio_migration_init(VFIODevice *vbasedev) } strpadcpy(id, sizeof(id), path, '\0'); - register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, - &savevm_vfio_v1_handlers, vbasedev); + if (migration->v2) { + register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, + &savevm_vfio_handlers, vbasedev); + + migration->vm_state = qdev_add_vm_change_state_handler( + vbasedev->dev, vfio_vmstate_change, vbasedev); + } else { + register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, + &savevm_vfio_v1_handlers, vbasedev); + + migration->vm_state = qdev_add_vm_change_state_handler( + vbasedev->dev, vfio_v1_vmstate_change, vbasedev); + } - migration->vm_state = qdev_add_vm_change_state_handler( - vbasedev->dev, vfio_v1_vmstate_change, vbasedev); migration->migration_state.notify = vfio_migration_state_notifier; add_migration_state_change_notifier(&migration->migration_state); return 0; diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index c8739449df..b24e448534 100644 --- a/hw/vfio/trace-events +++ b/hw/vfio/trace-events @@ -149,20 +149,27 @@ vfio_display_edid_write_error(void) "" # migration.c vfio_migration_probe(const char *name) " (%s)" +vfio_migration_set_state(const char *name, const char *state) " (%s) state %s" vfio_migration_v1_set_state(const char *name, uint32_t state) " (%s) state %d" +vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s" vfio_v1_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d" vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s" +vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64 vfio_v1_save_setup(const char *name) " (%s)" vfio_save_cleanup(const char *name) " (%s)" vfio_save_buffer(const char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64 vfio_update_pending(const char *name, uint64_t pending) " (%s) pending 0x%"PRIx64 vfio_save_device_config_state(const char *name) " (%s)" +vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64 vfio_v1_state_pending(const char *name, uint64_t precopy, uint64_t postcopy) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64 vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d" +vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d" vfio_v1_save_complete_precopy(const char *name) " (%s)" vfio_load_device_config_state(const char *name) " (%s)" vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64 vfio_v1_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64 +vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size 0x%"PRIx64" ret %d" vfio_load_cleanup(const char *name) " (%s)" vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64 vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64 +vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
Implement the basic mandatory part of VFIO migration protocol v2. This includes all functionality that is necessary to support VFIO_MIGRATION_STOP_COPY part of the v2 protocol. The two protocols, v1 and v2, will co-exist and in the following patches v1 protocol code will be removed. There are several main differences between v1 and v2 protocols: - VFIO device state is now represented as a finite state machine instead of a bitmap. - Migration interface with kernel is now done using VFIO_DEVICE_FEATURE ioctl and normal read() and write() instead of the migration region. - Pre-copy is made optional in v2 protocol. Support for pre-copy will be added later on. Detailed information about VFIO migration protocol v2 and its difference compared to v1 protocol can be found here [1]. [1] https://lore.kernel.org/all/20220224142024.147653-10-yishaih@nvidia.com/ Signed-off-by: Avihai Horon <avihaih@nvidia.com> --- include/hw/vfio/vfio-common.h | 5 + hw/vfio/common.c | 17 +- hw/vfio/migration.c | 479 +++++++++++++++++++++++++++++++--- hw/vfio/trace-events | 7 + 4 files changed, 469 insertions(+), 39 deletions(-)