diff mbox series

[v1] docs/devel: Add VFIO device migration documentation

Message ID 1603950791-27236-1-git-send-email-kwankhede@nvidia.com (mailing list archive)
State New, archived
Headers show
Series [v1] docs/devel: Add VFIO device migration documentation | expand

Commit Message

Kirti Wankhede Oct. 29, 2020, 5:53 a.m. UTC
Document interfaces used for VFIO device migration. Added flow of state
changes during live migration with VFIO device.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 MAINTAINERS                   |   1 +
 docs/devel/vfio-migration.rst | 119 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 120 insertions(+)
 create mode 100644 docs/devel/vfio-migration.rst

Comments

Cornelia Huck Oct. 29, 2020, 11:52 a.m. UTC | #1
On Thu, 29 Oct 2020 11:23:11 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Document interfaces used for VFIO device migration. Added flow of state
> changes during live migration with VFIO device.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
>  MAINTAINERS                   |   1 +
>  docs/devel/vfio-migration.rst | 119 ++++++++++++++++++++++++++++++++++++++++++

You probably want to include this into the Developer's Guide via
index.rst.

>  2 files changed, 120 insertions(+)
>  create mode 100644 docs/devel/vfio-migration.rst
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 6a197bd358d6..6f3fcffc6b3d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1728,6 +1728,7 @@ M: Alex Williamson <alex.williamson@redhat.com>
>  S: Supported
>  F: hw/vfio/*
>  F: include/hw/vfio/
> +F: docs/devel/vfio-migration.rst
>  
>  vfio-ccw
>  M: Cornelia Huck <cohuck@redhat.com>
> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> new file mode 100644
> index 000000000000..dab9127825e4
> --- /dev/null
> +++ b/docs/devel/vfio-migration.rst
> @@ -0,0 +1,119 @@
> +=====================
> +VFIO device Migration
> +=====================
> +
> +VFIO devices use iterative approach for migration because certain VFIO devices

s/use/use an/ ?

> +(e.g. GPU) have large amount of data to be transfered. The iterative pre-copy
> +phase of migration allows for the guest to continue whilst the VFIO device state
> +is transferred to destination, this helps to reduce the total downtime of the

s/to destination,/to the destination;/

> +VM. VFIO devices can choose to skip the pre-copy phase of migration by returning
> +pending_bytes as zero during pre-copy phase.

s/during/during the/

> +
> +Detailed description of UAPI for VFIO device for migration is in the comment
> +above ``vfio_device_migration_info`` structure definition in header file
> +linux-headers/linux/vfio.h.

I think I'd copy that to this file. If I'm looking at the
documentation, I'd rather not go hunting for source code to find out
what structure you are talking about. Plus, as it's UAPI, I don't
expect it to change much, so it should be easy to keep the definitions
in sync (famous last words).

> +
> +VFIO device hooks for iterative approach:
> +-  A ``save_setup`` function that setup migration region, sets _SAVING flag in

s/setup/sets up the/
s/in/in the/

> +VFIO device state and inform VFIO IOMMU module to start dirty page tracking.

s/inform/informs the/

> +
> +- A ``load_setup`` function that setup migration region on the destination and

s/setup/sets up the/

> +sets _RESUMING flag in VFIO device state.

s/in/in the/

> +
> +- A ``save_live_pending`` function that reads pending_bytes from vendor driver
> +that indicate how much more data the vendor driver yet to save for the VFIO
> +device.

"A ``save_live_pending`` function that reads pending_bytes from the
vendor driver, which indicates the amount of data that the vendor
driver has yet to save for the VFIO device." ?

> +
> +- A ``save_live_iterate`` function that reads VFIO device's data from vendor

s/reads/reads the/
s/from/from the/

> +driver through migration region during iterative phase.

s/through/through the/

> +
> +- A ``save_live_complete_precopy`` function that resets _RUNNING flag from VFIO

s/from/from the/

> +device state, saves device config space, if any, and iteratively copies

s/saves/saves the/

> +remaining data for VFIO device till pending_bytes returned by vendor driver
> +is zero.

"...and interactively copies the remaining data for the VFIO device
until the vendor driver indicates that no data remains (pending_bytes
is zero)." ?

> +
> +- A ``load_state`` function loads config section and data sections generated by
> +above save functions.

"A ``load_state`` function that loads the config section and the data
sections that are generated by the save functions above." ?

> +
> +- ``cleanup`` functions for both save and load that unmap migration region.

..."that perform any migration-related cleanup, including unmapping the
migration region." ?

> +
> +VM state change handler is registered to change VFIO device state based on VM
> +state change.

"A VM state change handler is registered to change the VFIO device
state when the VM state changes." ?

> +
> +Similarly, a migration state change notifier is added to get a notification on

s/added/registered/ ?

> +migration state change. These states are translated to VFIO device state and
> +conveyed to vendor driver.
> +
> +System memory dirty pages tracking
> +----------------------------------
> +
> +A ``log_sync`` memory listener callback is added to mark system memory pages

s/is added to mark/marks those/ 

> +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried

s/by/by the/
s/Dirty/The dirty/

> +per container. All pages pinned by vendor driver through vfio_pin_pages()

s/by/by the/

> +external API have to be marked as dirty during migration. When there are CPU
> +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
> +by vendor driver can also be written by device. There is currently no device

s/by/by the/ (x2)

> +which has hardware support for dirty page tracking. So all pages which are
> +pinned by vendor driver are considered as dirty.
> +Dirty pages are tracked when device is in stop-and-copy phase because if pages
> +are marked dirty during pre-copy phase and content is transfered from source to
> +destination, there is no way to know newly dirtied pages from the point they
> +were copied earlier until device stops. To avoid repeated copy of same content,
> +pinned pages are marked dirty only during stop-and-copy phase.

Let me take a quick stab at rewriting this paragraph (not sure if I
understood it correctly):

"Dirty pages are tracked when the device is in the stop-and-copy phase.
During the pre-copy phase, it is not possible to distinguish a dirty
page that has been transferred from the source to the destination from
newly dirtied pages, which would lead to repeated copying of the same
content. Therefore, pinned pages are only marked dirty during the
stop-and-copy phase." ?

> +
> +System memory dirty pages tracking when vIOMMU is enabled
> +---------------------------------------------------------
> +With vIOMMU, IO virtual address range can get unmapped while in pre-copy phase

s/IO/an I/O/

> +of migration. In that case, unmap ioctl returns pages pinned in that range and

s/unmap ioctl returns pages pinned/the unmap ioctl returns any pinned pages/ ?

> +QEMU reports corresponding guest physical pages dirty.
> +During stop-and-copy phase, an IOMMU notifier is used to get a callback for
> +mapped pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for
> +those mapped ranges.
> +
> +Flow of state changes during Live migration
> +===========================================
> +Below is the flow of state change during live migration where states in brackets
> +represent VM state, migration state and VFIO device state as:
> +                (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)

"The values in the brackets represent the VM state, the migration
state, and the VFIO device state, respectively." ?

> +
> +Live migration save path
> +------------------------
> +                        QEMU normal running state
> +                        (RUNNING, _NONE, _RUNNING)
> +                                    |
> +                       migrate_init spawns migration_thread
> +                Migration thread then calls each device's .save_setup()
> +                        (RUNNING, _SETUP, _RUNNING|_SAVING)
> +                                    |
> +                        (RUNNING, _ACTIVE, _RUNNING|_SAVING)
> +            If device is active, get pending_bytes by .save_live_pending()
> +         if total pending_bytes >= threshold_size, call .save_live_iterate()
> +                  Data of VFIO device for pre-copy phase is copied
> +     Iterate till total pending bytes converge and are less than threshold
> +                                    |
> +   On migration completion, vCPUs stops and calls .save_live_complete_precopy

s/vCPUs/vCPU/ ?

> +   for each active device. VFIO device is then transitioned in _SAVING state

s/VFIO device/The VFIO device/

s/in/into/

> +                    (FINISH_MIGRATE, _DEVICE, _SAVING)
> +                                    |
> +For VFIO device, iterate in .save_live_complete_precopy until pending data is 0

s/For/For the/

> +                    (FINISH_MIGRATE, _DEVICE, _STOPPED)
> +                                    |
> +                    (FINISH_MIGRATE, _COMPLETED, _STOPPED)
> +                Migraton thread schedule cleanup bottom half and exit

s/schedule/schedules/
s/exit/exits/

> +
> +Live migration resume path
> +--------------------------
> +
> +             Incoming migration calls .load_setup for each device
> +                        (RESTORE_VM, _ACTIVE, _STOPPED)
> +                                    |
> +    For each device, .load_state is called for that device section data
> +                        (RESTORE_VM, _ACTIVE, _RESUMING)
> +                                    |
> +    At the end, called .load_cleanup for each device and vCPUs are started                        |

Stray '|', probably should go on the next line?

s/called .load_cleanup/.load_cleanup is called/


> +                        (RUNNING, _NONE, _RUNNING)
> +
> +
> +Postcopy
> +========
> +Postcopy migration is not supported for VFIO devices.

s/not/not yet/, I think? (Disregarding any time frame here.)

Generally, I think this is useful information to have in the docs.
Kirti Wankhede Oct. 29, 2020, 5:41 p.m. UTC | #2
Thanks for corrections Cornelia. I had done the corrections you 
suggested I had not replied, see my comments on couple of places where I 
disagree.


On 10/29/2020 5:22 PM, Cornelia Huck wrote:
> On Thu, 29 Oct 2020 11:23:11 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Document interfaces used for VFIO device migration. Added flow of state
>> changes during live migration with VFIO device.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> ---
>>   MAINTAINERS                   |   1 +
>>   docs/devel/vfio-migration.rst | 119 ++++++++++++++++++++++++++++++++++++++++++
> 
> You probably want to include this into the Developer's Guide via
> index.rst.
> 

Ok.

>>   2 files changed, 120 insertions(+)
>>   create mode 100644 docs/devel/vfio-migration.rst
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 6a197bd358d6..6f3fcffc6b3d 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -1728,6 +1728,7 @@ M: Alex Williamson <alex.williamson@redhat.com>
>>   S: Supported
>>   F: hw/vfio/*
>>   F: include/hw/vfio/
>> +F: docs/devel/vfio-migration.rst
>>   
>>   vfio-ccw
>>   M: Cornelia Huck <cohuck@redhat.com>
>> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
>> new file mode 100644
>> index 000000000000..dab9127825e4
>> --- /dev/null
>> +++ b/docs/devel/vfio-migration.rst
>> @@ -0,0 +1,119 @@
>> +=====================
>> +VFIO device Migration
>> +=====================
>> +
>> +VFIO devices use iterative approach for migration because certain VFIO devices
> 
> s/use/use an/ ?
> 
>> +(e.g. GPU) have large amount of data to be transfered. The iterative pre-copy
>> +phase of migration allows for the guest to continue whilst the VFIO device state
>> +is transferred to destination, this helps to reduce the total downtime of the
> 
> s/to destination,/to the destination;/
> 
>> +VM. VFIO devices can choose to skip the pre-copy phase of migration by returning
>> +pending_bytes as zero during pre-copy phase.
> 
> s/during/during the/
> 
>> +
>> +Detailed description of UAPI for VFIO device for migration is in the comment
>> +above ``vfio_device_migration_info`` structure definition in header file
>> +linux-headers/linux/vfio.h.
> 
> I think I'd copy that to this file. If I'm looking at the
> documentation, I'd rather not go hunting for source code to find out
> what structure you are talking about. Plus, as it's UAPI, I don't
> expect it to change much, so it should be easy to keep the definitions
> in sync (famous last words).
> 

I feel its duplication of documentation. I would like to know others 
views as well.

>> +
>> +VFIO device hooks for iterative approach:
>> +-  A ``save_setup`` function that setup migration region, sets _SAVING flag in
> 
> s/setup/sets up the/
> s/in/in the/
> 
>> +VFIO device state and inform VFIO IOMMU module to start dirty page tracking.
> 
> s/inform/informs the/
> 
>> +
>> +- A ``load_setup`` function that setup migration region on the destination and
> 
> s/setup/sets up the/
> 
>> +sets _RESUMING flag in VFIO device state.
> 
> s/in/in the/
> 
>> +
>> +- A ``save_live_pending`` function that reads pending_bytes from vendor driver
>> +that indicate how much more data the vendor driver yet to save for the VFIO
>> +device.
> 
> "A ``save_live_pending`` function that reads pending_bytes from the
> vendor driver, which indicates the amount of data that the vendor
> driver has yet to save for the VFIO device." ?
> 
>> +
>> +- A ``save_live_iterate`` function that reads VFIO device's data from vendor
> 
> s/reads/reads the/
> s/from/from the/
> 
>> +driver through migration region during iterative phase.
> 
> s/through/through the/
> 
>> +
>> +- A ``save_live_complete_precopy`` function that resets _RUNNING flag from VFIO
> 
> s/from/from the/
> 
>> +device state, saves device config space, if any, and iteratively copies
> 
> s/saves/saves the/
> 
>> +remaining data for VFIO device till pending_bytes returned by vendor driver
>> +is zero.
> 
> "...and interactively copies the remaining data for the VFIO device
> until the vendor driver indicates that no data remains (pending_bytes
> is zero)." ?
> 
>> +
>> +- A ``load_state`` function loads config section and data sections generated by
>> +above save functions.
> 
> "A ``load_state`` function that loads the config section and the data
> sections that are generated by the save functions above." ?
> 
>> +
>> +- ``cleanup`` functions for both save and load that unmap migration region.
> 
> ..."that perform any migration-related cleanup, including unmapping the
> migration region." ?
> 
>> +
>> +VM state change handler is registered to change VFIO device state based on VM
>> +state change.
> 
> "A VM state change handler is registered to change the VFIO device
> state when the VM state changes." ?
> 
>> +
>> +Similarly, a migration state change notifier is added to get a notification on
> 
> s/added/registered/ ?
> 
>> +migration state change. These states are translated to VFIO device state and
>> +conveyed to vendor driver.
>> +
>> +System memory dirty pages tracking
>> +----------------------------------
>> +
>> +A ``log_sync`` memory listener callback is added to mark system memory pages
> 
> s/is added to mark/marks those/
> 
>> +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried
> 
> s/by/by the/
> s/Dirty/The dirty/
> 
>> +per container. All pages pinned by vendor driver through vfio_pin_pages()
> 
> s/by/by the/
> 
>> +external API have to be marked as dirty during migration. When there are CPU
>> +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
>> +by vendor driver can also be written by device. There is currently no device
> 
> s/by/by the/ (x2)
> 
>> +which has hardware support for dirty page tracking. So all pages which are
>> +pinned by vendor driver are considered as dirty.
>> +Dirty pages are tracked when device is in stop-and-copy phase because if pages
>> +are marked dirty during pre-copy phase and content is transfered from source to
>> +destination, there is no way to know newly dirtied pages from the point they
>> +were copied earlier until device stops. To avoid repeated copy of same content,
>> +pinned pages are marked dirty only during stop-and-copy phase.
> 
> Let me take a quick stab at rewriting this paragraph (not sure if I
> understood it correctly):
> 
> "Dirty pages are tracked when the device is in the stop-and-copy phase.
> During the pre-copy phase, it is not possible to distinguish a dirty
> page that has been transferred from the source to the destination from
> newly dirtied pages, which would lead to repeated copying of the same
> content. Therefore, pinned pages are only marked dirty during the
> stop-and-copy phase." ?
> 

I think above rephrase only talks about repeated copying in pre-copy 
phase. Used "copied earlier until device stops" to indicate both 
pre-copy and stop-and-copy till device stops.

Thanks,
Kirti

>> +
>> +System memory dirty pages tracking when vIOMMU is enabled
>> +---------------------------------------------------------
>> +With vIOMMU, IO virtual address range can get unmapped while in pre-copy phase
> 
> s/IO/an I/O/
> 
>> +of migration. In that case, unmap ioctl returns pages pinned in that range and
> 
> s/unmap ioctl returns pages pinned/the unmap ioctl returns any pinned pages/ ?
> 
>> +QEMU reports corresponding guest physical pages dirty.
>> +During stop-and-copy phase, an IOMMU notifier is used to get a callback for
>> +mapped pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for
>> +those mapped ranges.
>> +
>> +Flow of state changes during Live migration
>> +===========================================
>> +Below is the flow of state change during live migration where states in brackets
>> +represent VM state, migration state and VFIO device state as:
>> +                (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
> 
> "The values in the brackets represent the VM state, the migration
> state, and the VFIO device state, respectively." ?
> 
>> +
>> +Live migration save path
>> +------------------------
>> +                        QEMU normal running state
>> +                        (RUNNING, _NONE, _RUNNING)
>> +                                    |
>> +                       migrate_init spawns migration_thread
>> +                Migration thread then calls each device's .save_setup()
>> +                        (RUNNING, _SETUP, _RUNNING|_SAVING)
>> +                                    |
>> +                        (RUNNING, _ACTIVE, _RUNNING|_SAVING)
>> +            If device is active, get pending_bytes by .save_live_pending()
>> +         if total pending_bytes >= threshold_size, call .save_live_iterate()
>> +                  Data of VFIO device for pre-copy phase is copied
>> +     Iterate till total pending bytes converge and are less than threshold
>> +                                    |
>> +   On migration completion, vCPUs stops and calls .save_live_complete_precopy
> 
> s/vCPUs/vCPU/ ?
> 
>> +   for each active device. VFIO device is then transitioned in _SAVING state
> 
> s/VFIO device/The VFIO device/
> 
> s/in/into/
> 
>> +                    (FINISH_MIGRATE, _DEVICE, _SAVING)
>> +                                    |
>> +For VFIO device, iterate in .save_live_complete_precopy until pending data is 0
> 
> s/For/For the/
> 
>> +                    (FINISH_MIGRATE, _DEVICE, _STOPPED)
>> +                                    |
>> +                    (FINISH_MIGRATE, _COMPLETED, _STOPPED)
>> +                Migraton thread schedule cleanup bottom half and exit
> 
> s/schedule/schedules/
> s/exit/exits/
> 
>> +
>> +Live migration resume path
>> +--------------------------
>> +
>> +             Incoming migration calls .load_setup for each device
>> +                        (RESTORE_VM, _ACTIVE, _STOPPED)
>> +                                    |
>> +    For each device, .load_state is called for that device section data
>> +                        (RESTORE_VM, _ACTIVE, _RESUMING)
>> +                                    |
>> +    At the end, called .load_cleanup for each device and vCPUs are started                        |
> 
> Stray '|', probably should go on the next line?
> 
> s/called .load_cleanup/.load_cleanup is called/
> 
> 
>> +                        (RUNNING, _NONE, _RUNNING)
>> +
>> +
>> +Postcopy
>> +========
>> +Postcopy migration is not supported for VFIO devices.
> 
> s/not/not yet/, I think? (Disregarding any time frame here.)
> 
> Generally, I think this is useful information to have in the docs.
>
Alex Williamson Oct. 29, 2020, 7:05 p.m. UTC | #3
On Thu, 29 Oct 2020 23:11:16 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Thanks for corrections Cornelia. I had done the corrections you 
> suggested I had not replied, see my comments on couple of places where I 
> disagree.
> 
> 
> On 10/29/2020 5:22 PM, Cornelia Huck wrote:
> > On Thu, 29 Oct 2020 11:23:11 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> Document interfaces used for VFIO device migration. Added flow of state
> >> changes during live migration with VFIO device.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> ---
> >>   MAINTAINERS                   |   1 +
> >>   docs/devel/vfio-migration.rst | 119 ++++++++++++++++++++++++++++++++++++++++++  
> > 
> > You probably want to include this into the Developer's Guide via
> > index.rst.
> >   
> 
> Ok.
> 
> >>   2 files changed, 120 insertions(+)
> >>   create mode 100644 docs/devel/vfio-migration.rst
> >>
> >> diff --git a/MAINTAINERS b/MAINTAINERS
> >> index 6a197bd358d6..6f3fcffc6b3d 100644
> >> --- a/MAINTAINERS
> >> +++ b/MAINTAINERS
> >> @@ -1728,6 +1728,7 @@ M: Alex Williamson <alex.williamson@redhat.com>
> >>   S: Supported
> >>   F: hw/vfio/*
> >>   F: include/hw/vfio/
> >> +F: docs/devel/vfio-migration.rst
> >>   
> >>   vfio-ccw
> >>   M: Cornelia Huck <cohuck@redhat.com>
> >> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> >> new file mode 100644
> >> index 000000000000..dab9127825e4
> >> --- /dev/null
> >> +++ b/docs/devel/vfio-migration.rst
> >> @@ -0,0 +1,119 @@
> >> +=====================
> >> +VFIO device Migration
> >> +=====================
> >> +
> >> +VFIO devices use iterative approach for migration because certain VFIO devices  
> > 
> > s/use/use an/ ?
> >   
> >> +(e.g. GPU) have large amount of data to be transfered. The iterative pre-copy
> >> +phase of migration allows for the guest to continue whilst the VFIO device state
> >> +is transferred to destination, this helps to reduce the total downtime of the  
> > 
> > s/to destination,/to the destination;/
> >   
> >> +VM. VFIO devices can choose to skip the pre-copy phase of migration by returning
> >> +pending_bytes as zero during pre-copy phase.  
> > 
> > s/during/during the/
> >   
> >> +
> >> +Detailed description of UAPI for VFIO device for migration is in the comment
> >> +above ``vfio_device_migration_info`` structure definition in header file
> >> +linux-headers/linux/vfio.h.  
> > 
> > I think I'd copy that to this file. If I'm looking at the
> > documentation, I'd rather not go hunting for source code to find out
> > what structure you are talking about. Plus, as it's UAPI, I don't
> > expect it to change much, so it should be easy to keep the definitions
> > in sync (famous last words).
> >   
> 
> I feel its duplication of documentation. I would like to know others 
> views as well.


TBH I don't think it's necessary here either, we're documenting the
QEMU interaction with the uAPI, the uAPI itself is documented in the
kernel header.  I don't think it would be unreasonable to ask someone
trying to understand this to look at both sources together.


> >> +
> >> +VFIO device hooks for iterative approach:
> >> +-  A ``save_setup`` function that setup migration region, sets _SAVING flag in  
> > 
> > s/setup/sets up the/
> > s/in/in the/
> >   
> >> +VFIO device state and inform VFIO IOMMU module to start dirty page tracking.  
> > 
> > s/inform/informs the/
> >   
> >> +
> >> +- A ``load_setup`` function that setup migration region on the destination and  
> > 
> > s/setup/sets up the/
> >   
> >> +sets _RESUMING flag in VFIO device state.  
> > 
> > s/in/in the/
> >   
> >> +
> >> +- A ``save_live_pending`` function that reads pending_bytes from vendor driver
> >> +that indicate how much more data the vendor driver yet to save for the VFIO
> >> +device.  
> > 
> > "A ``save_live_pending`` function that reads pending_bytes from the
> > vendor driver, which indicates the amount of data that the vendor
> > driver has yet to save for the VFIO device." ?
> >   
> >> +
> >> +- A ``save_live_iterate`` function that reads VFIO device's data from vendor  
> > 
> > s/reads/reads the/
> > s/from/from the/
> >   
> >> +driver through migration region during iterative phase.  
> > 
> > s/through/through the/
> >   
> >> +
> >> +- A ``save_live_complete_precopy`` function that resets _RUNNING flag from VFIO  
> > 
> > s/from/from the/
> >   
> >> +device state, saves device config space, if any, and iteratively copies  
> > 
> > s/saves/saves the/
> >   
> >> +remaining data for VFIO device till pending_bytes returned by vendor driver
> >> +is zero.  
> > 
> > "...and interactively copies the remaining data for the VFIO device
> > until the vendor driver indicates that no data remains (pending_bytes
> > is zero)." ?


Connie, was that intentional to replace "iteratively" with
"interactively"?  Iteratively seems correct to me.


> >   
> >> +
> >> +- A ``load_state`` function loads config section and data sections generated by
> >> +above save functions.  
> > 
> > "A ``load_state`` function that loads the config section and the data
> > sections that are generated by the save functions above." ?
> >   
> >> +
> >> +- ``cleanup`` functions for both save and load that unmap migration region.  
> > 
> > ..."that perform any migration-related cleanup, including unmapping the
> > migration region." ?
> >   
> >> +
> >> +VM state change handler is registered to change VFIO device state based on VM
> >> +state change.  
> > 
> > "A VM state change handler is registered to change the VFIO device
> > state when the VM state changes." ?
> >   
> >> +
> >> +Similarly, a migration state change notifier is added to get a notification on  
> > 
> > s/added/registered/ ?
> >   
> >> +migration state change. These states are translated to VFIO device state and
> >> +conveyed to vendor driver.
> >> +
> >> +System memory dirty pages tracking
> >> +----------------------------------
> >> +
> >> +A ``log_sync`` memory listener callback is added to mark system memory pages  
> > 
> > s/is added to mark/marks those/
> >   
> >> +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried  
> > 
> > s/by/by the/
> > s/Dirty/The dirty/
> >   
> >> +per container. All pages pinned by vendor driver through vfio_pin_pages()  
> > 
> > s/by/by the/
> >   
> >> +external API have to be marked as dirty during migration. When there are CPU
> >> +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
> >> +by vendor driver can also be written by device. There is currently no device  
> > 
> > s/by/by the/ (x2)
> >   
> >> +which has hardware support for dirty page tracking. So all pages which are
> >> +pinned by vendor driver are considered as dirty.
> >> +Dirty pages are tracked when device is in stop-and-copy phase because if pages
> >> +are marked dirty during pre-copy phase and content is transfered from source to
> >> +destination, there is no way to know newly dirtied pages from the point they
> >> +were copied earlier until device stops. To avoid repeated copy of same content,
> >> +pinned pages are marked dirty only during stop-and-copy phase.  
>
>
> > Let me take a quick stab at rewriting this paragraph (not sure if I
> > understood it correctly):
> > 
> > "Dirty pages are tracked when the device is in the stop-and-copy phase.
> > During the pre-copy phase, it is not possible to distinguish a dirty
> > page that has been transferred from the source to the destination from
> > newly dirtied pages, which would lead to repeated copying of the same
> > content. Therefore, pinned pages are only marked dirty during the
> > stop-and-copy phase." ?
> >   
> 
> I think above rephrase only talks about repeated copying in pre-copy 
> phase. Used "copied earlier until device stops" to indicate both 
> pre-copy and stop-and-copy till device stops.


Now I'm confused, I thought we had abandoned the idea that we can only
report pinned pages during stop-and-copy.  Doesn't the device needs to
expose its dirty memory footprint during the iterative phase regardless
of whether that causes repeat copies?  If QEMU iterates and sees that
all memory is still dirty, it may have transferred more data, but it
can actually predict if it can achieve its downtime tolerances.  Which
is more important, less data transfer or predictability?  Thanks,

Alex


> >> +
> >> +System memory dirty pages tracking when vIOMMU is enabled
> >> +---------------------------------------------------------
> >> +With vIOMMU, IO virtual address range can get unmapped while in pre-copy phase  
> > 
> > s/IO/an I/O/
> >   
> >> +of migration. In that case, unmap ioctl returns pages pinned in that range and  
> > 
> > s/unmap ioctl returns pages pinned/the unmap ioctl returns any pinned pages/ ?
> >   
> >> +QEMU reports corresponding guest physical pages dirty.
> >> +During stop-and-copy phase, an IOMMU notifier is used to get a callback for
> >> +mapped pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for
> >> +those mapped ranges.
> >> +
> >> +Flow of state changes during Live migration
> >> +===========================================
> >> +Below is the flow of state change during live migration where states in brackets
> >> +represent VM state, migration state and VFIO device state as:
> >> +                (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)  
> > 
> > "The values in the brackets represent the VM state, the migration
> > state, and the VFIO device state, respectively." ?
> >   
> >> +
> >> +Live migration save path
> >> +------------------------
> >> +                        QEMU normal running state
> >> +                        (RUNNING, _NONE, _RUNNING)
> >> +                                    |
> >> +                       migrate_init spawns migration_thread
> >> +                Migration thread then calls each device's .save_setup()
> >> +                        (RUNNING, _SETUP, _RUNNING|_SAVING)
> >> +                                    |
> >> +                        (RUNNING, _ACTIVE, _RUNNING|_SAVING)
> >> +            If device is active, get pending_bytes by .save_live_pending()
> >> +         if total pending_bytes >= threshold_size, call .save_live_iterate()
> >> +                  Data of VFIO device for pre-copy phase is copied
> >> +     Iterate till total pending bytes converge and are less than threshold
> >> +                                    |
> >> +   On migration completion, vCPUs stops and calls .save_live_complete_precopy  
> > 
> > s/vCPUs/vCPU/ ?
> >   
> >> +   for each active device. VFIO device is then transitioned in _SAVING state  
> > 
> > s/VFIO device/The VFIO device/
> > 
> > s/in/into/
> >   
> >> +                    (FINISH_MIGRATE, _DEVICE, _SAVING)
> >> +                                    |
> >> +For VFIO device, iterate in .save_live_complete_precopy until pending data is 0  
> > 
> > s/For/For the/
> >   
> >> +                    (FINISH_MIGRATE, _DEVICE, _STOPPED)
> >> +                                    |
> >> +                    (FINISH_MIGRATE, _COMPLETED, _STOPPED)
> >> +                Migraton thread schedule cleanup bottom half and exit  
> > 
> > s/schedule/schedules/
> > s/exit/exits/
> >   
> >> +
> >> +Live migration resume path
> >> +--------------------------
> >> +
> >> +             Incoming migration calls .load_setup for each device
> >> +                        (RESTORE_VM, _ACTIVE, _STOPPED)
> >> +                                    |
> >> +    For each device, .load_state is called for that device section data
> >> +                        (RESTORE_VM, _ACTIVE, _RESUMING)
> >> +                                    |
> >> +    At the end, called .load_cleanup for each device and vCPUs are started                        |  
> > 
> > Stray '|', probably should go on the next line?
> > 
> > s/called .load_cleanup/.load_cleanup is called/
> > 
> >   
> >> +                        (RUNNING, _NONE, _RUNNING)
> >> +
> >> +
> >> +Postcopy
> >> +========
> >> +Postcopy migration is not supported for VFIO devices.  
> > 
> > s/not/not yet/, I think? (Disregarding any time frame here.)
> > 
> > Generally, I think this is useful information to have in the docs.
> >   
>
Cornelia Huck Oct. 29, 2020, 8:25 p.m. UTC | #4
On Thu, 29 Oct 2020 13:05:19 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Thu, 29 Oct 2020 23:11:16 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > Thanks for corrections Cornelia. I had done the corrections you 
> > suggested I had not replied, see my comments on couple of places where I 
> > disagree.
> > 
> > 
> > On 10/29/2020 5:22 PM, Cornelia Huck wrote:  
> > > On Thu, 29 Oct 2020 11:23:11 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:

> > >> +Detailed description of UAPI for VFIO device for migration is in the comment
> > >> +above ``vfio_device_migration_info`` structure definition in header file
> > >> +linux-headers/linux/vfio.h.    
> > > 
> > > I think I'd copy that to this file. If I'm looking at the
> > > documentation, I'd rather not go hunting for source code to find out
> > > what structure you are talking about. Plus, as it's UAPI, I don't
> > > expect it to change much, so it should be easy to keep the definitions
> > > in sync (famous last words).
> > >     
> > 
> > I feel its duplication of documentation. I would like to know others 
> > views as well.  
> 
> 
> TBH I don't think it's necessary here either, we're documenting the
> QEMU interaction with the uAPI, the uAPI itself is documented in the
> kernel header.  I don't think it would be unreasonable to ask someone
> trying to understand this to look at both sources together.

Ok, I can live with that. But let me correct some nits :)

"A detailed description of the UAPI for VFIO device migration can be
found in the comment for the ``vfio_device_migration_info`` structure
in the header file linux-headers/linux/vfio.h."

(...)

> > >> +remaining data for VFIO device till pending_bytes returned by vendor driver
> > >> +is zero.    
> > > 
> > > "...and interactively copies the remaining data for the VFIO device
> > > until the vendor driver indicates that no data remains (pending_bytes
> > > is zero)." ?  
> 
> 
> Connie, was that intentional to replace "iteratively" with
> "interactively"?  Iteratively seems correct to me.

Eh, should be "iteratively", of course. Too many meetings :/
Kirti Wankhede Nov. 3, 2020, 7:48 p.m. UTC | #5
On 10/30/2020 12:35 AM, Alex Williamson wrote:
> On Thu, 29 Oct 2020 23:11:16 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 

<snip>

>>>> +System memory dirty pages tracking
>>>> +----------------------------------
>>>> +
>>>> +A ``log_sync`` memory listener callback is added to mark system memory pages
>>>
>>> s/is added to mark/marks those/
>>>    
>>>> +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried
>>>
>>> s/by/by the/
>>> s/Dirty/The dirty/
>>>    
>>>> +per container. All pages pinned by vendor driver through vfio_pin_pages()
>>>
>>> s/by/by the/
>>>    
>>>> +external API have to be marked as dirty during migration. When there are CPU
>>>> +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
>>>> +by vendor driver can also be written by device. There is currently no device
>>>
>>> s/by/by the/ (x2)
>>>    
>>>> +which has hardware support for dirty page tracking. So all pages which are
>>>> +pinned by vendor driver are considered as dirty.
>>>> +Dirty pages are tracked when device is in stop-and-copy phase because if pages
>>>> +are marked dirty during pre-copy phase and content is transfered from source to
>>>> +destination, there is no way to know newly dirtied pages from the point they
>>>> +were copied earlier until device stops. To avoid repeated copy of same content,
>>>> +pinned pages are marked dirty only during stop-and-copy phase.
>>
>>
>>> Let me take a quick stab at rewriting this paragraph (not sure if I
>>> understood it correctly):
>>>
>>> "Dirty pages are tracked when the device is in the stop-and-copy phase.
>>> During the pre-copy phase, it is not possible to distinguish a dirty
>>> page that has been transferred from the source to the destination from
>>> newly dirtied pages, which would lead to repeated copying of the same
>>> content. Therefore, pinned pages are only marked dirty during the
>>> stop-and-copy phase." ?
>>>    
>>
>> I think above rephrase only talks about repeated copying in pre-copy
>> phase. Used "copied earlier until device stops" to indicate both
>> pre-copy and stop-and-copy till device stops.
> 
> 
> Now I'm confused, I thought we had abandoned the idea that we can only
> report pinned pages during stop-and-copy.  Doesn't the device needs to
> expose its dirty memory footprint during the iterative phase regardless
> of whether that causes repeat copies?  If QEMU iterates and sees that
> all memory is still dirty, it may have transferred more data, but it
> can actually predict if it can achieve its downtime tolerances.  Which
> is more important, less data transfer or predictability?  Thanks,
> 

Even if QEMU copies and transfers content of all sys mem pages during 
pre-copy (worst case with IOMMU backed mdev device when its vendor 
driver is not smart to pin pages explicitly and all sys mem pages are 
marked dirty), then also its prediction about downtime tolerance will 
not be correct, because during stop-and-copy again all pages need to be 
copied as device can write to any of those pinned pages.

Thanks,
Kirti
Alex Williamson Nov. 3, 2020, 8:27 p.m. UTC | #6
On Wed, 4 Nov 2020 01:18:12 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/30/2020 12:35 AM, Alex Williamson wrote:
> > On Thu, 29 Oct 2020 23:11:16 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> 
> <snip>
> 
> >>>> +System memory dirty pages tracking
> >>>> +----------------------------------
> >>>> +
> >>>> +A ``log_sync`` memory listener callback is added to mark system memory pages  
> >>>
> >>> s/is added to mark/marks those/
> >>>      
> >>>> +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried  
> >>>
> >>> s/by/by the/
> >>> s/Dirty/The dirty/
> >>>      
> >>>> +per container. All pages pinned by vendor driver through vfio_pin_pages()  
> >>>
> >>> s/by/by the/
> >>>      
> >>>> +external API have to be marked as dirty during migration. When there are CPU
> >>>> +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
> >>>> +by vendor driver can also be written by device. There is currently no device  
> >>>
> >>> s/by/by the/ (x2)
> >>>      
> >>>> +which has hardware support for dirty page tracking. So all pages which are
> >>>> +pinned by vendor driver are considered as dirty.
> >>>> +Dirty pages are tracked when device is in stop-and-copy phase because if pages
> >>>> +are marked dirty during pre-copy phase and content is transfered from source to
> >>>> +destination, there is no way to know newly dirtied pages from the point they
> >>>> +were copied earlier until device stops. To avoid repeated copy of same content,
> >>>> +pinned pages are marked dirty only during stop-and-copy phase.  
> >>
> >>  
> >>> Let me take a quick stab at rewriting this paragraph (not sure if I
> >>> understood it correctly):
> >>>
> >>> "Dirty pages are tracked when the device is in the stop-and-copy phase.
> >>> During the pre-copy phase, it is not possible to distinguish a dirty
> >>> page that has been transferred from the source to the destination from
> >>> newly dirtied pages, which would lead to repeated copying of the same
> >>> content. Therefore, pinned pages are only marked dirty during the
> >>> stop-and-copy phase." ?
> >>>      
> >>
> >> I think above rephrase only talks about repeated copying in pre-copy
> >> phase. Used "copied earlier until device stops" to indicate both
> >> pre-copy and stop-and-copy till device stops.  
> > 
> > 
> > Now I'm confused, I thought we had abandoned the idea that we can only
> > report pinned pages during stop-and-copy.  Doesn't the device needs to
> > expose its dirty memory footprint during the iterative phase regardless
> > of whether that causes repeat copies?  If QEMU iterates and sees that
> > all memory is still dirty, it may have transferred more data, but it
> > can actually predict if it can achieve its downtime tolerances.  Which
> > is more important, less data transfer or predictability?  Thanks,
> >   
> 
> Even if QEMU copies and transfers content of all sys mem pages during 
> pre-copy (worst case with IOMMU backed mdev device when its vendor 
> driver is not smart to pin pages explicitly and all sys mem pages are 
> marked dirty), then also its prediction about downtime tolerance will 
> not be correct, because during stop-and-copy again all pages need to be 
> copied as device can write to any of those pinned pages.

I think you're only reiterating my point.  If QEMU copies all of guest
memory during the iterative phase and each time it sees that all memory
is dirty, such as if CPUs or devices (including assigned devices) are
dirtying pages as fast as it copies them (or continuously marks them
dirty), then QEMU can predict that downtime will require copying all
pages.  If instead devices don't mark dirty pages until the VM is
stopped, then QEMU might iterate through memory copy and predict a short
downtime because not much memory is dirty, only to be surprised that
all of memory is suddenly dirty.  At that point it's too late, the VM
is already stopped, the predicted short downtime takes far longer than
expected.  This is exactly why we made the kernel interface mark pinned
pages persistently dirty when it was proposed that we only report
pinned pages once.  Thanks,

Alex
Kirti Wankhede Nov. 4, 2020, 7:55 a.m. UTC | #7
On 11/4/2020 1:57 AM, Alex Williamson wrote:
> On Wed, 4 Nov 2020 01:18:12 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 10/30/2020 12:35 AM, Alex Williamson wrote:
>>> On Thu, 29 Oct 2020 23:11:16 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>    
>>
>> <snip>
>>
>>>>>> +System memory dirty pages tracking
>>>>>> +----------------------------------
>>>>>> +
>>>>>> +A ``log_sync`` memory listener callback is added to mark system memory pages
>>>>>
>>>>> s/is added to mark/marks those/
>>>>>       
>>>>>> +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried
>>>>>
>>>>> s/by/by the/
>>>>> s/Dirty/The dirty/
>>>>>       
>>>>>> +per container. All pages pinned by vendor driver through vfio_pin_pages()
>>>>>
>>>>> s/by/by the/
>>>>>       
>>>>>> +external API have to be marked as dirty during migration. When there are CPU
>>>>>> +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
>>>>>> +by vendor driver can also be written by device. There is currently no device
>>>>>
>>>>> s/by/by the/ (x2)
>>>>>       
>>>>>> +which has hardware support for dirty page tracking. So all pages which are
>>>>>> +pinned by vendor driver are considered as dirty.
>>>>>> +Dirty pages are tracked when device is in stop-and-copy phase because if pages
>>>>>> +are marked dirty during pre-copy phase and content is transfered from source to
>>>>>> +destination, there is no way to know newly dirtied pages from the point they
>>>>>> +were copied earlier until device stops. To avoid repeated copy of same content,
>>>>>> +pinned pages are marked dirty only during stop-and-copy phase.
>>>>
>>>>   
>>>>> Let me take a quick stab at rewriting this paragraph (not sure if I
>>>>> understood it correctly):
>>>>>
>>>>> "Dirty pages are tracked when the device is in the stop-and-copy phase.
>>>>> During the pre-copy phase, it is not possible to distinguish a dirty
>>>>> page that has been transferred from the source to the destination from
>>>>> newly dirtied pages, which would lead to repeated copying of the same
>>>>> content. Therefore, pinned pages are only marked dirty during the
>>>>> stop-and-copy phase." ?
>>>>>       
>>>>
>>>> I think above rephrase only talks about repeated copying in pre-copy
>>>> phase. Used "copied earlier until device stops" to indicate both
>>>> pre-copy and stop-and-copy till device stops.
>>>
>>>
>>> Now I'm confused, I thought we had abandoned the idea that we can only
>>> report pinned pages during stop-and-copy.  Doesn't the device needs to
>>> expose its dirty memory footprint during the iterative phase regardless
>>> of whether that causes repeat copies?  If QEMU iterates and sees that
>>> all memory is still dirty, it may have transferred more data, but it
>>> can actually predict if it can achieve its downtime tolerances.  Which
>>> is more important, less data transfer or predictability?  Thanks,
>>>    
>>
>> Even if QEMU copies and transfers content of all sys mem pages during
>> pre-copy (worst case with IOMMU backed mdev device when its vendor
>> driver is not smart to pin pages explicitly and all sys mem pages are
>> marked dirty), then also its prediction about downtime tolerance will
>> not be correct, because during stop-and-copy again all pages need to be
>> copied as device can write to any of those pinned pages.
> 
> I think you're only reiterating my point.  If QEMU copies all of guest
> memory during the iterative phase and each time it sees that all memory
> is dirty, such as if CPUs or devices (including assigned devices) are
> dirtying pages as fast as it copies them (or continuously marks them
> dirty), then QEMU can predict that downtime will require copying all
> pages. 

But as of now there is no way to know if device has dirtied pages during 
iterative phase.

> If instead devices don't mark dirty pages until the VM is
> stopped, then QEMU might iterate through memory copy and predict a short
> downtime because not much memory is dirty, only to be surprised that
> all of memory is suddenly dirty.  At that point it's too late, the VM
> is already stopped, the predicted short downtime takes far longer than
> expected.  This is exactly why we made the kernel interface mark pinned
> pages persistently dirty when it was proposed that we only report
> pinned pages once.  Thanks,
> 

Since there is no way to know if device dirtied pages during iterative 
phase, QEMU should query pinned pages in stop-and-copy phase.

Whenever there will be hardware support or some software mechanism to 
report pages dirtied by device then we will add a capability bit in 
migration capability and based on that capability bit qemu/user space 
app should decide to query dirty pages in iterative phase.

Thanks,
Kirti
Alex Williamson Nov. 4, 2020, 12:45 p.m. UTC | #8
On Wed, 4 Nov 2020 13:25:40 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 11/4/2020 1:57 AM, Alex Williamson wrote:
> > On Wed, 4 Nov 2020 01:18:12 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 10/30/2020 12:35 AM, Alex Williamson wrote:  
> >>> On Thu, 29 Oct 2020 23:11:16 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>      
> >>
> >> <snip>
> >>  
> >>>>>> +System memory dirty pages tracking
> >>>>>> +----------------------------------
> >>>>>> +
> >>>>>> +A ``log_sync`` memory listener callback is added to mark system memory pages  
> >>>>>
> >>>>> s/is added to mark/marks those/
> >>>>>         
> >>>>>> +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried  
> >>>>>
> >>>>> s/by/by the/
> >>>>> s/Dirty/The dirty/
> >>>>>         
> >>>>>> +per container. All pages pinned by vendor driver through vfio_pin_pages()  
> >>>>>
> >>>>> s/by/by the/
> >>>>>         
> >>>>>> +external API have to be marked as dirty during migration. When there are CPU
> >>>>>> +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
> >>>>>> +by vendor driver can also be written by device. There is currently no device  
> >>>>>
> >>>>> s/by/by the/ (x2)
> >>>>>         
> >>>>>> +which has hardware support for dirty page tracking. So all pages which are
> >>>>>> +pinned by vendor driver are considered as dirty.
> >>>>>> +Dirty pages are tracked when device is in stop-and-copy phase because if pages
> >>>>>> +are marked dirty during pre-copy phase and content is transfered from source to
> >>>>>> +destination, there is no way to know newly dirtied pages from the point they
> >>>>>> +were copied earlier until device stops. To avoid repeated copy of same content,
> >>>>>> +pinned pages are marked dirty only during stop-and-copy phase.  
> >>>>
> >>>>     
> >>>>> Let me take a quick stab at rewriting this paragraph (not sure if I
> >>>>> understood it correctly):
> >>>>>
> >>>>> "Dirty pages are tracked when the device is in the stop-and-copy phase.
> >>>>> During the pre-copy phase, it is not possible to distinguish a dirty
> >>>>> page that has been transferred from the source to the destination from
> >>>>> newly dirtied pages, which would lead to repeated copying of the same
> >>>>> content. Therefore, pinned pages are only marked dirty during the
> >>>>> stop-and-copy phase." ?
> >>>>>         
> >>>>
> >>>> I think above rephrase only talks about repeated copying in pre-copy
> >>>> phase. Used "copied earlier until device stops" to indicate both
> >>>> pre-copy and stop-and-copy till device stops.  
> >>>
> >>>
> >>> Now I'm confused, I thought we had abandoned the idea that we can only
> >>> report pinned pages during stop-and-copy.  Doesn't the device needs to
> >>> expose its dirty memory footprint during the iterative phase regardless
> >>> of whether that causes repeat copies?  If QEMU iterates and sees that
> >>> all memory is still dirty, it may have transferred more data, but it
> >>> can actually predict if it can achieve its downtime tolerances.  Which
> >>> is more important, less data transfer or predictability?  Thanks,
> >>>      
> >>
> >> Even if QEMU copies and transfers content of all sys mem pages during
> >> pre-copy (worst case with IOMMU backed mdev device when its vendor
> >> driver is not smart to pin pages explicitly and all sys mem pages are
> >> marked dirty), then also its prediction about downtime tolerance will
> >> not be correct, because during stop-and-copy again all pages need to be
> >> copied as device can write to any of those pinned pages.  
> > 
> > I think you're only reiterating my point.  If QEMU copies all of guest
> > memory during the iterative phase and each time it sees that all memory
> > is dirty, such as if CPUs or devices (including assigned devices) are
> > dirtying pages as fast as it copies them (or continuously marks them
> > dirty), then QEMU can predict that downtime will require copying all
> > pages.   
> 
> But as of now there is no way to know if device has dirtied pages during 
> iterative phase.


This claim doesn't make any sense, pinned pages are considered
persistently dirtied, during the iterative phase and while stopped.

 
> > If instead devices don't mark dirty pages until the VM is
> > stopped, then QEMU might iterate through memory copy and predict a short
> > downtime because not much memory is dirty, only to be surprised that
> > all of memory is suddenly dirty.  At that point it's too late, the VM
> > is already stopped, the predicted short downtime takes far longer than
> > expected.  This is exactly why we made the kernel interface mark pinned
> > pages persistently dirty when it was proposed that we only report
> > pinned pages once.  Thanks,
> >   
> 
> Since there is no way to know if device dirtied pages during iterative 
> phase, QEMU should query pinned pages in stop-and-copy phase.


As above, I don't believe this is true.


> Whenever there will be hardware support or some software mechanism to 
> report pages dirtied by device then we will add a capability bit in 
> migration capability and based on that capability bit qemu/user space 
> app should decide to query dirty pages in iterative phase.


Yes, we could advertise support for fine granularity dirty page
tracking, but I completely disagree that we should consider pinned
pages clean until suddenly exposing them as dirty once the VM is
stopped.  Thanks,

Alex
Kirti Wankhede Nov. 5, 2020, 6:59 p.m. UTC | #9
On 11/4/2020 6:15 PM, Alex Williamson wrote:
> On Wed, 4 Nov 2020 13:25:40 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 11/4/2020 1:57 AM, Alex Williamson wrote:
>>> On Wed, 4 Nov 2020 01:18:12 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>    
>>>> On 10/30/2020 12:35 AM, Alex Williamson wrote:
>>>>> On Thu, 29 Oct 2020 23:11:16 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>       
>>>>
>>>> <snip>
>>>>   
>>>>>>>> +System memory dirty pages tracking
>>>>>>>> +----------------------------------
>>>>>>>> +
>>>>>>>> +A ``log_sync`` memory listener callback is added to mark system memory pages
>>>>>>>
>>>>>>> s/is added to mark/marks those/
>>>>>>>          
>>>>>>>> +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried
>>>>>>>
>>>>>>> s/by/by the/
>>>>>>> s/Dirty/The dirty/
>>>>>>>          
>>>>>>>> +per container. All pages pinned by vendor driver through vfio_pin_pages()
>>>>>>>
>>>>>>> s/by/by the/
>>>>>>>          
>>>>>>>> +external API have to be marked as dirty during migration. When there are CPU
>>>>>>>> +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
>>>>>>>> +by vendor driver can also be written by device. There is currently no device
>>>>>>>
>>>>>>> s/by/by the/ (x2)
>>>>>>>          
>>>>>>>> +which has hardware support for dirty page tracking. So all pages which are
>>>>>>>> +pinned by vendor driver are considered as dirty.
>>>>>>>> +Dirty pages are tracked when device is in stop-and-copy phase because if pages
>>>>>>>> +are marked dirty during pre-copy phase and content is transfered from source to
>>>>>>>> +destination, there is no way to know newly dirtied pages from the point they
>>>>>>>> +were copied earlier until device stops. To avoid repeated copy of same content,
>>>>>>>> +pinned pages are marked dirty only during stop-and-copy phase.
>>>>>>
>>>>>>      
>>>>>>> Let me take a quick stab at rewriting this paragraph (not sure if I
>>>>>>> understood it correctly):
>>>>>>>
>>>>>>> "Dirty pages are tracked when the device is in the stop-and-copy phase.
>>>>>>> During the pre-copy phase, it is not possible to distinguish a dirty
>>>>>>> page that has been transferred from the source to the destination from
>>>>>>> newly dirtied pages, which would lead to repeated copying of the same
>>>>>>> content. Therefore, pinned pages are only marked dirty during the
>>>>>>> stop-and-copy phase." ?
>>>>>>>          
>>>>>>
>>>>>> I think above rephrase only talks about repeated copying in pre-copy
>>>>>> phase. Used "copied earlier until device stops" to indicate both
>>>>>> pre-copy and stop-and-copy till device stops.
>>>>>
>>>>>
>>>>> Now I'm confused, I thought we had abandoned the idea that we can only
>>>>> report pinned pages during stop-and-copy.  Doesn't the device needs to
>>>>> expose its dirty memory footprint during the iterative phase regardless
>>>>> of whether that causes repeat copies?  If QEMU iterates and sees that
>>>>> all memory is still dirty, it may have transferred more data, but it
>>>>> can actually predict if it can achieve its downtime tolerances.  Which
>>>>> is more important, less data transfer or predictability?  Thanks,
>>>>>       
>>>>
>>>> Even if QEMU copies and transfers content of all sys mem pages during
>>>> pre-copy (worst case with IOMMU backed mdev device when its vendor
>>>> driver is not smart to pin pages explicitly and all sys mem pages are
>>>> marked dirty), then also its prediction about downtime tolerance will
>>>> not be correct, because during stop-and-copy again all pages need to be
>>>> copied as device can write to any of those pinned pages.
>>>
>>> I think you're only reiterating my point.  If QEMU copies all of guest
>>> memory during the iterative phase and each time it sees that all memory
>>> is dirty, such as if CPUs or devices (including assigned devices) are
>>> dirtying pages as fast as it copies them (or continuously marks them
>>> dirty), then QEMU can predict that downtime will require copying all
>>> pages.
>>
>> But as of now there is no way to know if device has dirtied pages during
>> iterative phase.
> 
> 
> This claim doesn't make any sense, pinned pages are considered
> persistently dirtied, during the iterative phase and while stopped.
> 
>   
>>> If instead devices don't mark dirty pages until the VM is
>>> stopped, then QEMU might iterate through memory copy and predict a short
>>> downtime because not much memory is dirty, only to be surprised that
>>> all of memory is suddenly dirty.  At that point it's too late, the VM
>>> is already stopped, the predicted short downtime takes far longer than
>>> expected.  This is exactly why we made the kernel interface mark pinned
>>> pages persistently dirty when it was proposed that we only report
>>> pinned pages once.  Thanks,
>>>    
>>
>> Since there is no way to know if device dirtied pages during iterative
>> phase, QEMU should query pinned pages in stop-and-copy phase.
> 
> 
> As above, I don't believe this is true.
> 
> 
>> Whenever there will be hardware support or some software mechanism to
>> report pages dirtied by device then we will add a capability bit in
>> migration capability and based on that capability bit qemu/user space
>> app should decide to query dirty pages in iterative phase.
> 
> 
> Yes, we could advertise support for fine granularity dirty page
> tracking, but I completely disagree that we should consider pinned
> pages clean until suddenly exposing them as dirty once the VM is
> stopped.  Thanks,
> 

Should QEMU copy dirtied pages twice, during iterative phase and then 
when VM is stopped?

Thanks,
Kirti
Alex Williamson Nov. 5, 2020, 7:11 p.m. UTC | #10
On Fri, 6 Nov 2020 00:29:36 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 11/4/2020 6:15 PM, Alex Williamson wrote:
> > On Wed, 4 Nov 2020 13:25:40 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 11/4/2020 1:57 AM, Alex Williamson wrote:  
> >>> On Wed, 4 Nov 2020 01:18:12 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>      
> >>>> On 10/30/2020 12:35 AM, Alex Williamson wrote:  
> >>>>> On Thu, 29 Oct 2020 23:11:16 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>         
> >>>>
> >>>> <snip>
> >>>>     
> >>>>>>>> +System memory dirty pages tracking
> >>>>>>>> +----------------------------------
> >>>>>>>> +
> >>>>>>>> +A ``log_sync`` memory listener callback is added to mark system memory pages  
> >>>>>>>
> >>>>>>> s/is added to mark/marks those/
> >>>>>>>            
> >>>>>>>> +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried  
> >>>>>>>
> >>>>>>> s/by/by the/
> >>>>>>> s/Dirty/The dirty/
> >>>>>>>            
> >>>>>>>> +per container. All pages pinned by vendor driver through vfio_pin_pages()  
> >>>>>>>
> >>>>>>> s/by/by the/
> >>>>>>>            
> >>>>>>>> +external API have to be marked as dirty during migration. When there are CPU
> >>>>>>>> +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
> >>>>>>>> +by vendor driver can also be written by device. There is currently no device  
> >>>>>>>
> >>>>>>> s/by/by the/ (x2)
> >>>>>>>            
> >>>>>>>> +which has hardware support for dirty page tracking. So all pages which are
> >>>>>>>> +pinned by vendor driver are considered as dirty.
> >>>>>>>> +Dirty pages are tracked when device is in stop-and-copy phase because if pages
> >>>>>>>> +are marked dirty during pre-copy phase and content is transfered from source to
> >>>>>>>> +destination, there is no way to know newly dirtied pages from the point they
> >>>>>>>> +were copied earlier until device stops. To avoid repeated copy of same content,
> >>>>>>>> +pinned pages are marked dirty only during stop-and-copy phase.  
> >>>>>>
> >>>>>>        
> >>>>>>> Let me take a quick stab at rewriting this paragraph (not sure if I
> >>>>>>> understood it correctly):
> >>>>>>>
> >>>>>>> "Dirty pages are tracked when the device is in the stop-and-copy phase.
> >>>>>>> During the pre-copy phase, it is not possible to distinguish a dirty
> >>>>>>> page that has been transferred from the source to the destination from
> >>>>>>> newly dirtied pages, which would lead to repeated copying of the same
> >>>>>>> content. Therefore, pinned pages are only marked dirty during the
> >>>>>>> stop-and-copy phase." ?
> >>>>>>>            
> >>>>>>
> >>>>>> I think above rephrase only talks about repeated copying in pre-copy
> >>>>>> phase. Used "copied earlier until device stops" to indicate both
> >>>>>> pre-copy and stop-and-copy till device stops.  
> >>>>>
> >>>>>
> >>>>> Now I'm confused, I thought we had abandoned the idea that we can only
> >>>>> report pinned pages during stop-and-copy.  Doesn't the device needs to
> >>>>> expose its dirty memory footprint during the iterative phase regardless
> >>>>> of whether that causes repeat copies?  If QEMU iterates and sees that
> >>>>> all memory is still dirty, it may have transferred more data, but it
> >>>>> can actually predict if it can achieve its downtime tolerances.  Which
> >>>>> is more important, less data transfer or predictability?  Thanks,
> >>>>>         
> >>>>
> >>>> Even if QEMU copies and transfers content of all sys mem pages during
> >>>> pre-copy (worst case with IOMMU backed mdev device when its vendor
> >>>> driver is not smart to pin pages explicitly and all sys mem pages are
> >>>> marked dirty), then also its prediction about downtime tolerance will
> >>>> not be correct, because during stop-and-copy again all pages need to be
> >>>> copied as device can write to any of those pinned pages.  
> >>>
> >>> I think you're only reiterating my point.  If QEMU copies all of guest
> >>> memory during the iterative phase and each time it sees that all memory
> >>> is dirty, such as if CPUs or devices (including assigned devices) are
> >>> dirtying pages as fast as it copies them (or continuously marks them
> >>> dirty), then QEMU can predict that downtime will require copying all
> >>> pages.  
> >>
> >> But as of now there is no way to know if device has dirtied pages during
> >> iterative phase.  
> > 
> > 
> > This claim doesn't make any sense, pinned pages are considered
> > persistently dirtied, during the iterative phase and while stopped.
> > 
> >     
> >>> If instead devices don't mark dirty pages until the VM is
> >>> stopped, then QEMU might iterate through memory copy and predict a short
> >>> downtime because not much memory is dirty, only to be surprised that
> >>> all of memory is suddenly dirty.  At that point it's too late, the VM
> >>> is already stopped, the predicted short downtime takes far longer than
> >>> expected.  This is exactly why we made the kernel interface mark pinned
> >>> pages persistently dirty when it was proposed that we only report
> >>> pinned pages once.  Thanks,
> >>>      
> >>
> >> Since there is no way to know if device dirtied pages during iterative
> >> phase, QEMU should query pinned pages in stop-and-copy phase.  
> > 
> > 
> > As above, I don't believe this is true.
> > 
> >   
> >> Whenever there will be hardware support or some software mechanism to
> >> report pages dirtied by device then we will add a capability bit in
> >> migration capability and based on that capability bit qemu/user space
> >> app should decide to query dirty pages in iterative phase.  
> > 
> > 
> > Yes, we could advertise support for fine granularity dirty page
> > tracking, but I completely disagree that we should consider pinned
> > pages clean until suddenly exposing them as dirty once the VM is
> > stopped.  Thanks,
> >   
> 
> Should QEMU copy dirtied pages twice, during iterative phase and then 
> when VM is stopped?

I don't understand why this is controversial.  We cannot decide within
the vfio device to only expose device dirtied pages in the final stage
of migration.  It's not our job to minimize the number of pages copied
beyond the hardware granularity.  If core QEMU migration code asks for
dirty pages, we provide them, regardless of how many times we report a
page as dirty.  So yes, if that migration code asks for dirty pages in
the iterative stage and the stopped stage, we provide them both times.
If someone wants to skip the iterative phase altogether, I imagine
there are migration parameters that allow it, but we should not be
determining that policy at the device level.  Thanks,

Alex
Kirti Wankhede Nov. 5, 2020, 8:52 p.m. UTC | #11
On 11/6/2020 12:41 AM, Alex Williamson wrote:
> On Fri, 6 Nov 2020 00:29:36 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 11/4/2020 6:15 PM, Alex Williamson wrote:
>>> On Wed, 4 Nov 2020 13:25:40 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>    
>>>> On 11/4/2020 1:57 AM, Alex Williamson wrote:
>>>>> On Wed, 4 Nov 2020 01:18:12 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>       
>>>>>> On 10/30/2020 12:35 AM, Alex Williamson wrote:
>>>>>>> On Thu, 29 Oct 2020 23:11:16 +0530
>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>          
>>>>>>
>>>>>> <snip>
>>>>>>      
>>>>>>>>>> +System memory dirty pages tracking
>>>>>>>>>> +----------------------------------
>>>>>>>>>> +
>>>>>>>>>> +A ``log_sync`` memory listener callback is added to mark system memory pages
>>>>>>>>>
>>>>>>>>> s/is added to mark/marks those/
>>>>>>>>>             
>>>>>>>>>> +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried
>>>>>>>>>
>>>>>>>>> s/by/by the/
>>>>>>>>> s/Dirty/The dirty/
>>>>>>>>>             
>>>>>>>>>> +per container. All pages pinned by vendor driver through vfio_pin_pages()
>>>>>>>>>
>>>>>>>>> s/by/by the/
>>>>>>>>>             
>>>>>>>>>> +external API have to be marked as dirty during migration. When there are CPU
>>>>>>>>>> +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
>>>>>>>>>> +by vendor driver can also be written by device. There is currently no device
>>>>>>>>>
>>>>>>>>> s/by/by the/ (x2)
>>>>>>>>>             
>>>>>>>>>> +which has hardware support for dirty page tracking. So all pages which are
>>>>>>>>>> +pinned by vendor driver are considered as dirty.
>>>>>>>>>> +Dirty pages are tracked when device is in stop-and-copy phase because if pages
>>>>>>>>>> +are marked dirty during pre-copy phase and content is transfered from source to
>>>>>>>>>> +destination, there is no way to know newly dirtied pages from the point they
>>>>>>>>>> +were copied earlier until device stops. To avoid repeated copy of same content,
>>>>>>>>>> +pinned pages are marked dirty only during stop-and-copy phase.
>>>>>>>>
>>>>>>>>         
>>>>>>>>> Let me take a quick stab at rewriting this paragraph (not sure if I
>>>>>>>>> understood it correctly):
>>>>>>>>>
>>>>>>>>> "Dirty pages are tracked when the device is in the stop-and-copy phase.
>>>>>>>>> During the pre-copy phase, it is not possible to distinguish a dirty
>>>>>>>>> page that has been transferred from the source to the destination from
>>>>>>>>> newly dirtied pages, which would lead to repeated copying of the same
>>>>>>>>> content. Therefore, pinned pages are only marked dirty during the
>>>>>>>>> stop-and-copy phase." ?
>>>>>>>>>             
>>>>>>>>
>>>>>>>> I think above rephrase only talks about repeated copying in pre-copy
>>>>>>>> phase. Used "copied earlier until device stops" to indicate both
>>>>>>>> pre-copy and stop-and-copy till device stops.
>>>>>>>
>>>>>>>
>>>>>>> Now I'm confused, I thought we had abandoned the idea that we can only
>>>>>>> report pinned pages during stop-and-copy.  Doesn't the device needs to
>>>>>>> expose its dirty memory footprint during the iterative phase regardless
>>>>>>> of whether that causes repeat copies?  If QEMU iterates and sees that
>>>>>>> all memory is still dirty, it may have transferred more data, but it
>>>>>>> can actually predict if it can achieve its downtime tolerances.  Which
>>>>>>> is more important, less data transfer or predictability?  Thanks,
>>>>>>>          
>>>>>>
>>>>>> Even if QEMU copies and transfers content of all sys mem pages during
>>>>>> pre-copy (worst case with IOMMU backed mdev device when its vendor
>>>>>> driver is not smart to pin pages explicitly and all sys mem pages are
>>>>>> marked dirty), then also its prediction about downtime tolerance will
>>>>>> not be correct, because during stop-and-copy again all pages need to be
>>>>>> copied as device can write to any of those pinned pages.
>>>>>
>>>>> I think you're only reiterating my point.  If QEMU copies all of guest
>>>>> memory during the iterative phase and each time it sees that all memory
>>>>> is dirty, such as if CPUs or devices (including assigned devices) are
>>>>> dirtying pages as fast as it copies them (or continuously marks them
>>>>> dirty), then QEMU can predict that downtime will require copying all
>>>>> pages.
>>>>
>>>> But as of now there is no way to know if device has dirtied pages during
>>>> iterative phase.
>>>
>>>
>>> This claim doesn't make any sense, pinned pages are considered
>>> persistently dirtied, during the iterative phase and while stopped.
>>>
>>>      
>>>>> If instead devices don't mark dirty pages until the VM is
>>>>> stopped, then QEMU might iterate through memory copy and predict a short
>>>>> downtime because not much memory is dirty, only to be surprised that
>>>>> all of memory is suddenly dirty.  At that point it's too late, the VM
>>>>> is already stopped, the predicted short downtime takes far longer than
>>>>> expected.  This is exactly why we made the kernel interface mark pinned
>>>>> pages persistently dirty when it was proposed that we only report
>>>>> pinned pages once.  Thanks,
>>>>>       
>>>>
>>>> Since there is no way to know if device dirtied pages during iterative
>>>> phase, QEMU should query pinned pages in stop-and-copy phase.
>>>
>>>
>>> As above, I don't believe this is true.
>>>
>>>    
>>>> Whenever there will be hardware support or some software mechanism to
>>>> report pages dirtied by device then we will add a capability bit in
>>>> migration capability and based on that capability bit qemu/user space
>>>> app should decide to query dirty pages in iterative phase.
>>>
>>>
>>> Yes, we could advertise support for fine granularity dirty page
>>> tracking, but I completely disagree that we should consider pinned
>>> pages clean until suddenly exposing them as dirty once the VM is
>>> stopped.  Thanks,
>>>    
>>
>> Should QEMU copy dirtied pages twice, during iterative phase and then
>> when VM is stopped?
> 
> I don't understand why this is controversial.  We cannot decide within
> the vfio device to only expose device dirtied pages in the final stage
> of migration.  It's not our job to minimize the number of pages copied
> beyond the hardware granularity.  If core QEMU migration code asks for
> dirty pages, we provide them, regardless of how many times we report a
> page as dirty.  So yes, if that migration code asks for dirty pages in
> the iterative stage and the stopped stage, we provide them both times.

Isn't that would increase total migration time?

> If someone wants to skip the iterative phase altogether, I imagine
> there are migration parameters that allow it, but we should not be
> determining that policy at the device level.  Thanks,
> 

What is that parameter? should that be documented here?

Thanks,
Kirti
Alex Williamson Nov. 5, 2020, 9:26 p.m. UTC | #12
On Fri, 6 Nov 2020 02:22:11 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 11/6/2020 12:41 AM, Alex Williamson wrote:
> > On Fri, 6 Nov 2020 00:29:36 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 11/4/2020 6:15 PM, Alex Williamson wrote:  
> >>> On Wed, 4 Nov 2020 13:25:40 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>      
> >>>> On 11/4/2020 1:57 AM, Alex Williamson wrote:  
> >>>>> On Wed, 4 Nov 2020 01:18:12 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>         
> >>>>>> On 10/30/2020 12:35 AM, Alex Williamson wrote:  
> >>>>>>> On Thu, 29 Oct 2020 23:11:16 +0530
> >>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>>            
> >>>>>>
> >>>>>> <snip>
> >>>>>>        
> >>>>>>>>>> +System memory dirty pages tracking
> >>>>>>>>>> +----------------------------------
> >>>>>>>>>> +
> >>>>>>>>>> +A ``log_sync`` memory listener callback is added to mark system memory pages  
> >>>>>>>>>
> >>>>>>>>> s/is added to mark/marks those/
> >>>>>>>>>               
> >>>>>>>>>> +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried  
> >>>>>>>>>
> >>>>>>>>> s/by/by the/
> >>>>>>>>> s/Dirty/The dirty/
> >>>>>>>>>               
> >>>>>>>>>> +per container. All pages pinned by vendor driver through vfio_pin_pages()  
> >>>>>>>>>
> >>>>>>>>> s/by/by the/
> >>>>>>>>>               
> >>>>>>>>>> +external API have to be marked as dirty during migration. When there are CPU
> >>>>>>>>>> +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
> >>>>>>>>>> +by vendor driver can also be written by device. There is currently no device  
> >>>>>>>>>
> >>>>>>>>> s/by/by the/ (x2)
> >>>>>>>>>               
> >>>>>>>>>> +which has hardware support for dirty page tracking. So all pages which are
> >>>>>>>>>> +pinned by vendor driver are considered as dirty.
> >>>>>>>>>> +Dirty pages are tracked when device is in stop-and-copy phase because if pages
> >>>>>>>>>> +are marked dirty during pre-copy phase and content is transfered from source to
> >>>>>>>>>> +destination, there is no way to know newly dirtied pages from the point they
> >>>>>>>>>> +were copied earlier until device stops. To avoid repeated copy of same content,
> >>>>>>>>>> +pinned pages are marked dirty only during stop-and-copy phase.  
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>> Let me take a quick stab at rewriting this paragraph (not sure if I
> >>>>>>>>> understood it correctly):
> >>>>>>>>>
> >>>>>>>>> "Dirty pages are tracked when the device is in the stop-and-copy phase.
> >>>>>>>>> During the pre-copy phase, it is not possible to distinguish a dirty
> >>>>>>>>> page that has been transferred from the source to the destination from
> >>>>>>>>> newly dirtied pages, which would lead to repeated copying of the same
> >>>>>>>>> content. Therefore, pinned pages are only marked dirty during the
> >>>>>>>>> stop-and-copy phase." ?
> >>>>>>>>>               
> >>>>>>>>
> >>>>>>>> I think above rephrase only talks about repeated copying in pre-copy
> >>>>>>>> phase. Used "copied earlier until device stops" to indicate both
> >>>>>>>> pre-copy and stop-and-copy till device stops.  
> >>>>>>>
> >>>>>>>
> >>>>>>> Now I'm confused, I thought we had abandoned the idea that we can only
> >>>>>>> report pinned pages during stop-and-copy.  Doesn't the device needs to
> >>>>>>> expose its dirty memory footprint during the iterative phase regardless
> >>>>>>> of whether that causes repeat copies?  If QEMU iterates and sees that
> >>>>>>> all memory is still dirty, it may have transferred more data, but it
> >>>>>>> can actually predict if it can achieve its downtime tolerances.  Which
> >>>>>>> is more important, less data transfer or predictability?  Thanks,
> >>>>>>>            
> >>>>>>
> >>>>>> Even if QEMU copies and transfers content of all sys mem pages during
> >>>>>> pre-copy (worst case with IOMMU backed mdev device when its vendor
> >>>>>> driver is not smart to pin pages explicitly and all sys mem pages are
> >>>>>> marked dirty), then also its prediction about downtime tolerance will
> >>>>>> not be correct, because during stop-and-copy again all pages need to be
> >>>>>> copied as device can write to any of those pinned pages.  
> >>>>>
> >>>>> I think you're only reiterating my point.  If QEMU copies all of guest
> >>>>> memory during the iterative phase and each time it sees that all memory
> >>>>> is dirty, such as if CPUs or devices (including assigned devices) are
> >>>>> dirtying pages as fast as it copies them (or continuously marks them
> >>>>> dirty), then QEMU can predict that downtime will require copying all
> >>>>> pages.  
> >>>>
> >>>> But as of now there is no way to know if device has dirtied pages during
> >>>> iterative phase.  
> >>>
> >>>
> >>> This claim doesn't make any sense, pinned pages are considered
> >>> persistently dirtied, during the iterative phase and while stopped.
> >>>
> >>>        
> >>>>> If instead devices don't mark dirty pages until the VM is
> >>>>> stopped, then QEMU might iterate through memory copy and predict a short
> >>>>> downtime because not much memory is dirty, only to be surprised that
> >>>>> all of memory is suddenly dirty.  At that point it's too late, the VM
> >>>>> is already stopped, the predicted short downtime takes far longer than
> >>>>> expected.  This is exactly why we made the kernel interface mark pinned
> >>>>> pages persistently dirty when it was proposed that we only report
> >>>>> pinned pages once.  Thanks,
> >>>>>         
> >>>>
> >>>> Since there is no way to know if device dirtied pages during iterative
> >>>> phase, QEMU should query pinned pages in stop-and-copy phase.  
> >>>
> >>>
> >>> As above, I don't believe this is true.
> >>>
> >>>      
> >>>> Whenever there will be hardware support or some software mechanism to
> >>>> report pages dirtied by device then we will add a capability bit in
> >>>> migration capability and based on that capability bit qemu/user space
> >>>> app should decide to query dirty pages in iterative phase.  
> >>>
> >>>
> >>> Yes, we could advertise support for fine granularity dirty page
> >>> tracking, but I completely disagree that we should consider pinned
> >>> pages clean until suddenly exposing them as dirty once the VM is
> >>> stopped.  Thanks,
> >>>      
> >>
> >> Should QEMU copy dirtied pages twice, during iterative phase and then
> >> when VM is stopped?  
> > 
> > I don't understand why this is controversial.  We cannot decide within
> > the vfio device to only expose device dirtied pages in the final stage
> > of migration.  It's not our job to minimize the number of pages copied
> > beyond the hardware granularity.  If core QEMU migration code asks for
> > dirty pages, we provide them, regardless of how many times we report a
> > page as dirty.  So yes, if that migration code asks for dirty pages in
> > the iterative stage and the stopped stage, we provide them both times.  
> 
> Isn't that would increase total migration time?

As I explained, that's not a policy decision that we as a device within
the VM should be making.  We do not have the visibility to determine
how the footprint of our device will affect the migration and by
preventing QEMU migration code from understanding the device footprint,
we're creating a scenario where QEMU absolutely cannot predict the
downtime.
 
> > If someone wants to skip the iterative phase altogether, I imagine
> > there are migration parameters that allow it, but we should not be
> > determining that policy at the device level.  Thanks,
> >   
> 
> What is that parameter? should that be documented here?

Dunno, but clearly we could pause the VM, migration, and resume on the
target.  I imagine there are migration tuning parameters that might do
essentially that automatically.  Thanks,

Alex
Kirti Wankhede Nov. 6, 2020, 6:57 p.m. UTC | #13
On 11/6/2020 2:56 AM, Alex Williamson wrote:
> On Fri, 6 Nov 2020 02:22:11 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 11/6/2020 12:41 AM, Alex Williamson wrote:
>>> On Fri, 6 Nov 2020 00:29:36 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>    
>>>> On 11/4/2020 6:15 PM, Alex Williamson wrote:
>>>>> On Wed, 4 Nov 2020 13:25:40 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>       
>>>>>> On 11/4/2020 1:57 AM, Alex Williamson wrote:
>>>>>>> On Wed, 4 Nov 2020 01:18:12 +0530
>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>          
>>>>>>>> On 10/30/2020 12:35 AM, Alex Williamson wrote:
>>>>>>>>> On Thu, 29 Oct 2020 23:11:16 +0530
>>>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>>>             
>>>>>>>>
>>>>>>>> <snip>
>>>>>>>>         
>>>>>>>>>>>> +System memory dirty pages tracking
>>>>>>>>>>>> +----------------------------------
>>>>>>>>>>>> +
>>>>>>>>>>>> +A ``log_sync`` memory listener callback is added to mark system memory pages
>>>>>>>>>>>
>>>>>>>>>>> s/is added to mark/marks those/
>>>>>>>>>>>                
>>>>>>>>>>>> +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried
>>>>>>>>>>>
>>>>>>>>>>> s/by/by the/
>>>>>>>>>>> s/Dirty/The dirty/
>>>>>>>>>>>                
>>>>>>>>>>>> +per container. All pages pinned by vendor driver through vfio_pin_pages()
>>>>>>>>>>>
>>>>>>>>>>> s/by/by the/
>>>>>>>>>>>                
>>>>>>>>>>>> +external API have to be marked as dirty during migration. When there are CPU
>>>>>>>>>>>> +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
>>>>>>>>>>>> +by vendor driver can also be written by device. There is currently no device
>>>>>>>>>>>
>>>>>>>>>>> s/by/by the/ (x2)
>>>>>>>>>>>                
>>>>>>>>>>>> +which has hardware support for dirty page tracking. So all pages which are
>>>>>>>>>>>> +pinned by vendor driver are considered as dirty.
>>>>>>>>>>>> +Dirty pages are tracked when device is in stop-and-copy phase because if pages
>>>>>>>>>>>> +are marked dirty during pre-copy phase and content is transfered from source to
>>>>>>>>>>>> +destination, there is no way to know newly dirtied pages from the point they
>>>>>>>>>>>> +were copied earlier until device stops. To avoid repeated copy of same content,
>>>>>>>>>>>> +pinned pages are marked dirty only during stop-and-copy phase.
>>>>>>>>>>
>>>>>>>>>>            
>>>>>>>>>>> Let me take a quick stab at rewriting this paragraph (not sure if I
>>>>>>>>>>> understood it correctly):
>>>>>>>>>>>
>>>>>>>>>>> "Dirty pages are tracked when the device is in the stop-and-copy phase.
>>>>>>>>>>> During the pre-copy phase, it is not possible to distinguish a dirty
>>>>>>>>>>> page that has been transferred from the source to the destination from
>>>>>>>>>>> newly dirtied pages, which would lead to repeated copying of the same
>>>>>>>>>>> content. Therefore, pinned pages are only marked dirty during the
>>>>>>>>>>> stop-and-copy phase." ?
>>>>>>>>>>>                
>>>>>>>>>>
>>>>>>>>>> I think above rephrase only talks about repeated copying in pre-copy
>>>>>>>>>> phase. Used "copied earlier until device stops" to indicate both
>>>>>>>>>> pre-copy and stop-and-copy till device stops.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Now I'm confused, I thought we had abandoned the idea that we can only
>>>>>>>>> report pinned pages during stop-and-copy.  Doesn't the device needs to
>>>>>>>>> expose its dirty memory footprint during the iterative phase regardless
>>>>>>>>> of whether that causes repeat copies?  If QEMU iterates and sees that
>>>>>>>>> all memory is still dirty, it may have transferred more data, but it
>>>>>>>>> can actually predict if it can achieve its downtime tolerances.  Which
>>>>>>>>> is more important, less data transfer or predictability?  Thanks,
>>>>>>>>>             
>>>>>>>>
>>>>>>>> Even if QEMU copies and transfers content of all sys mem pages during
>>>>>>>> pre-copy (worst case with IOMMU backed mdev device when its vendor
>>>>>>>> driver is not smart to pin pages explicitly and all sys mem pages are
>>>>>>>> marked dirty), then also its prediction about downtime tolerance will
>>>>>>>> not be correct, because during stop-and-copy again all pages need to be
>>>>>>>> copied as device can write to any of those pinned pages.
>>>>>>>
>>>>>>> I think you're only reiterating my point.  If QEMU copies all of guest
>>>>>>> memory during the iterative phase and each time it sees that all memory
>>>>>>> is dirty, such as if CPUs or devices (including assigned devices) are
>>>>>>> dirtying pages as fast as it copies them (or continuously marks them
>>>>>>> dirty), then QEMU can predict that downtime will require copying all
>>>>>>> pages.
>>>>>>
>>>>>> But as of now there is no way to know if device has dirtied pages during
>>>>>> iterative phase.
>>>>>
>>>>>
>>>>> This claim doesn't make any sense, pinned pages are considered
>>>>> persistently dirtied, during the iterative phase and while stopped.
>>>>>
>>>>>         
>>>>>>> If instead devices don't mark dirty pages until the VM is
>>>>>>> stopped, then QEMU might iterate through memory copy and predict a short
>>>>>>> downtime because not much memory is dirty, only to be surprised that
>>>>>>> all of memory is suddenly dirty.  At that point it's too late, the VM
>>>>>>> is already stopped, the predicted short downtime takes far longer than
>>>>>>> expected.  This is exactly why we made the kernel interface mark pinned
>>>>>>> pages persistently dirty when it was proposed that we only report
>>>>>>> pinned pages once.  Thanks,
>>>>>>>          
>>>>>>
>>>>>> Since there is no way to know if device dirtied pages during iterative
>>>>>> phase, QEMU should query pinned pages in stop-and-copy phase.
>>>>>
>>>>>
>>>>> As above, I don't believe this is true.
>>>>>
>>>>>       
>>>>>> Whenever there will be hardware support or some software mechanism to
>>>>>> report pages dirtied by device then we will add a capability bit in
>>>>>> migration capability and based on that capability bit qemu/user space
>>>>>> app should decide to query dirty pages in iterative phase.
>>>>>
>>>>>
>>>>> Yes, we could advertise support for fine granularity dirty page
>>>>> tracking, but I completely disagree that we should consider pinned
>>>>> pages clean until suddenly exposing them as dirty once the VM is
>>>>> stopped.  Thanks,
>>>>>       
>>>>
>>>> Should QEMU copy dirtied pages twice, during iterative phase and then
>>>> when VM is stopped?
>>>
>>> I don't understand why this is controversial.  We cannot decide within
>>> the vfio device to only expose device dirtied pages in the final stage
>>> of migration.  It's not our job to minimize the number of pages copied
>>> beyond the hardware granularity.  If core QEMU migration code asks for
>>> dirty pages, we provide them, regardless of how many times we report a
>>> page as dirty.  So yes, if that migration code asks for dirty pages in
>>> the iterative stage and the stopped stage, we provide them both times.
>>
>> Isn't that would increase total migration time?
> 
> As I explained, that's not a policy decision that we as a device within
> the VM should be making.  We do not have the visibility to determine
> how the footprint of our device will affect the migration and by
> preventing QEMU migration code from understanding the device footprint,
> we're creating a scenario where QEMU absolutely cannot predict the
> downtime.
>

If VFIO device dirtied lot of memory and each time VFIO device reports 
all those as dirty during iterative phase, then migration would never 
converge.

If there is a way to track dirty pages during iterative phase and there 
is a way to know that pages weren't dirtied again, then it makes sense 
for them to reported during the iterative phase.

Since there is no hardware support to track pages dirtied by device, for 
now best way would be to report those during stop-and-copy.

Thanks,
Kirti

>>> If someone wants to skip the iterative phase altogether, I imagine
>>> there are migration parameters that allow it, but we should not be
>>> determining that policy at the device level.  Thanks,
>>>    
>>
>> What is that parameter? should that be documented here?
> 
> Dunno, but clearly we could pause the VM, migration, and resume on the
> target.  I imagine there are migration tuning parameters that might do
> essentially that automatically.  Thanks,
> 
> Alex
>
Alex Williamson Nov. 6, 2020, 7:17 p.m. UTC | #14
On Sat, 7 Nov 2020 00:27:46 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 11/6/2020 2:56 AM, Alex Williamson wrote:
> > On Fri, 6 Nov 2020 02:22:11 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 11/6/2020 12:41 AM, Alex Williamson wrote:  
> >>> On Fri, 6 Nov 2020 00:29:36 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>      
> >>>> On 11/4/2020 6:15 PM, Alex Williamson wrote:  
> >>>>> On Wed, 4 Nov 2020 13:25:40 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>         
> >>>>>> On 11/4/2020 1:57 AM, Alex Williamson wrote:  
> >>>>>>> On Wed, 4 Nov 2020 01:18:12 +0530
> >>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>>            
> >>>>>>>> On 10/30/2020 12:35 AM, Alex Williamson wrote:  
> >>>>>>>>> On Thu, 29 Oct 2020 23:11:16 +0530
> >>>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>>>>               
> >>>>>>>>
> >>>>>>>> <snip>
> >>>>>>>>           
> >>>>>>>>>>>> +System memory dirty pages tracking
> >>>>>>>>>>>> +----------------------------------
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +A ``log_sync`` memory listener callback is added to mark system memory pages  
> >>>>>>>>>>>
> >>>>>>>>>>> s/is added to mark/marks those/
> >>>>>>>>>>>                  
> >>>>>>>>>>>> +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried  
> >>>>>>>>>>>
> >>>>>>>>>>> s/by/by the/
> >>>>>>>>>>> s/Dirty/The dirty/
> >>>>>>>>>>>                  
> >>>>>>>>>>>> +per container. All pages pinned by vendor driver through vfio_pin_pages()  
> >>>>>>>>>>>
> >>>>>>>>>>> s/by/by the/
> >>>>>>>>>>>                  
> >>>>>>>>>>>> +external API have to be marked as dirty during migration. When there are CPU
> >>>>>>>>>>>> +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
> >>>>>>>>>>>> +by vendor driver can also be written by device. There is currently no device  
> >>>>>>>>>>>
> >>>>>>>>>>> s/by/by the/ (x2)
> >>>>>>>>>>>                  
> >>>>>>>>>>>> +which has hardware support for dirty page tracking. So all pages which are
> >>>>>>>>>>>> +pinned by vendor driver are considered as dirty.
> >>>>>>>>>>>> +Dirty pages are tracked when device is in stop-and-copy phase because if pages
> >>>>>>>>>>>> +are marked dirty during pre-copy phase and content is transfered from source to
> >>>>>>>>>>>> +destination, there is no way to know newly dirtied pages from the point they
> >>>>>>>>>>>> +were copied earlier until device stops. To avoid repeated copy of same content,
> >>>>>>>>>>>> +pinned pages are marked dirty only during stop-and-copy phase.  
> >>>>>>>>>>
> >>>>>>>>>>              
> >>>>>>>>>>> Let me take a quick stab at rewriting this paragraph (not sure if I
> >>>>>>>>>>> understood it correctly):
> >>>>>>>>>>>
> >>>>>>>>>>> "Dirty pages are tracked when the device is in the stop-and-copy phase.
> >>>>>>>>>>> During the pre-copy phase, it is not possible to distinguish a dirty
> >>>>>>>>>>> page that has been transferred from the source to the destination from
> >>>>>>>>>>> newly dirtied pages, which would lead to repeated copying of the same
> >>>>>>>>>>> content. Therefore, pinned pages are only marked dirty during the
> >>>>>>>>>>> stop-and-copy phase." ?
> >>>>>>>>>>>                  
> >>>>>>>>>>
> >>>>>>>>>> I think above rephrase only talks about repeated copying in pre-copy
> >>>>>>>>>> phase. Used "copied earlier until device stops" to indicate both
> >>>>>>>>>> pre-copy and stop-and-copy till device stops.  
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Now I'm confused, I thought we had abandoned the idea that we can only
> >>>>>>>>> report pinned pages during stop-and-copy.  Doesn't the device needs to
> >>>>>>>>> expose its dirty memory footprint during the iterative phase regardless
> >>>>>>>>> of whether that causes repeat copies?  If QEMU iterates and sees that
> >>>>>>>>> all memory is still dirty, it may have transferred more data, but it
> >>>>>>>>> can actually predict if it can achieve its downtime tolerances.  Which
> >>>>>>>>> is more important, less data transfer or predictability?  Thanks,
> >>>>>>>>>               
> >>>>>>>>
> >>>>>>>> Even if QEMU copies and transfers content of all sys mem pages during
> >>>>>>>> pre-copy (worst case with IOMMU backed mdev device when its vendor
> >>>>>>>> driver is not smart to pin pages explicitly and all sys mem pages are
> >>>>>>>> marked dirty), then also its prediction about downtime tolerance will
> >>>>>>>> not be correct, because during stop-and-copy again all pages need to be
> >>>>>>>> copied as device can write to any of those pinned pages.  
> >>>>>>>
> >>>>>>> I think you're only reiterating my point.  If QEMU copies all of guest
> >>>>>>> memory during the iterative phase and each time it sees that all memory
> >>>>>>> is dirty, such as if CPUs or devices (including assigned devices) are
> >>>>>>> dirtying pages as fast as it copies them (or continuously marks them
> >>>>>>> dirty), then QEMU can predict that downtime will require copying all
> >>>>>>> pages.  
> >>>>>>
> >>>>>> But as of now there is no way to know if device has dirtied pages during
> >>>>>> iterative phase.  
> >>>>>
> >>>>>
> >>>>> This claim doesn't make any sense, pinned pages are considered
> >>>>> persistently dirtied, during the iterative phase and while stopped.
> >>>>>
> >>>>>           
> >>>>>>> If instead devices don't mark dirty pages until the VM is
> >>>>>>> stopped, then QEMU might iterate through memory copy and predict a short
> >>>>>>> downtime because not much memory is dirty, only to be surprised that
> >>>>>>> all of memory is suddenly dirty.  At that point it's too late, the VM
> >>>>>>> is already stopped, the predicted short downtime takes far longer than
> >>>>>>> expected.  This is exactly why we made the kernel interface mark pinned
> >>>>>>> pages persistently dirty when it was proposed that we only report
> >>>>>>> pinned pages once.  Thanks,
> >>>>>>>            
> >>>>>>
> >>>>>> Since there is no way to know if device dirtied pages during iterative
> >>>>>> phase, QEMU should query pinned pages in stop-and-copy phase.  
> >>>>>
> >>>>>
> >>>>> As above, I don't believe this is true.
> >>>>>
> >>>>>         
> >>>>>> Whenever there will be hardware support or some software mechanism to
> >>>>>> report pages dirtied by device then we will add a capability bit in
> >>>>>> migration capability and based on that capability bit qemu/user space
> >>>>>> app should decide to query dirty pages in iterative phase.  
> >>>>>
> >>>>>
> >>>>> Yes, we could advertise support for fine granularity dirty page
> >>>>> tracking, but I completely disagree that we should consider pinned
> >>>>> pages clean until suddenly exposing them as dirty once the VM is
> >>>>> stopped.  Thanks,
> >>>>>         
> >>>>
> >>>> Should QEMU copy dirtied pages twice, during iterative phase and then
> >>>> when VM is stopped?  
> >>>
> >>> I don't understand why this is controversial.  We cannot decide within
> >>> the vfio device to only expose device dirtied pages in the final stage
> >>> of migration.  It's not our job to minimize the number of pages copied
> >>> beyond the hardware granularity.  If core QEMU migration code asks for
> >>> dirty pages, we provide them, regardless of how many times we report a
> >>> page as dirty.  So yes, if that migration code asks for dirty pages in
> >>> the iterative stage and the stopped stage, we provide them both times.  
> >>
> >> Isn't that would increase total migration time?  
> > 
> > As I explained, that's not a policy decision that we as a device within
> > the VM should be making.  We do not have the visibility to determine
> > how the footprint of our device will affect the migration and by
> > preventing QEMU migration code from understanding the device footprint,
> > we're creating a scenario where QEMU absolutely cannot predict the
> > downtime.
> >  
> 
> If VFIO device dirtied lot of memory and each time VFIO device reports 
> all those as dirty during iterative phase, then migration would never 
> converge.
> 
> If there is a way to track dirty pages during iterative phase and there 
> is a way to know that pages weren't dirtied again, then it makes sense 
> for them to reported during the iterative phase.
> 
> Since there is no hardware support to track pages dirtied by device, for 
> now best way would be to report those during stop-and-copy.

You're only re-iterating what we already know and what I've already
disagreed with you about.  If QEMU assumes that migration is converging
because we're not reporting dirty pages during the iterative phase,
that's a FALSE conclusion and exactly the crux of my argument that we
must provide dirty page data as accurately as possible to QEMU.  QEMU
needs to know if the migration is not converging and the acceptable
downtime needs to be adjusted at the management layers to account for
that.  Thanks,

Alex
diff mbox series

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index 6a197bd358d6..6f3fcffc6b3d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1728,6 +1728,7 @@  M: Alex Williamson <alex.williamson@redhat.com>
 S: Supported
 F: hw/vfio/*
 F: include/hw/vfio/
+F: docs/devel/vfio-migration.rst
 
 vfio-ccw
 M: Cornelia Huck <cohuck@redhat.com>
diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
new file mode 100644
index 000000000000..dab9127825e4
--- /dev/null
+++ b/docs/devel/vfio-migration.rst
@@ -0,0 +1,119 @@ 
+=====================
+VFIO device Migration
+=====================
+
+VFIO devices use iterative approach for migration because certain VFIO devices
+(e.g. GPU) have large amount of data to be transfered. The iterative pre-copy
+phase of migration allows for the guest to continue whilst the VFIO device state
+is transferred to destination, this helps to reduce the total downtime of the
+VM. VFIO devices can choose to skip the pre-copy phase of migration by returning
+pending_bytes as zero during pre-copy phase.
+
+Detailed description of UAPI for VFIO device for migration is in the comment
+above ``vfio_device_migration_info`` structure definition in header file
+linux-headers/linux/vfio.h.
+
+VFIO device hooks for iterative approach:
+-  A ``save_setup`` function that setup migration region, sets _SAVING flag in
+VFIO device state and inform VFIO IOMMU module to start dirty page tracking.
+
+- A ``load_setup`` function that setup migration region on the destination and
+sets _RESUMING flag in VFIO device state.
+
+- A ``save_live_pending`` function that reads pending_bytes from vendor driver
+that indicate how much more data the vendor driver yet to save for the VFIO
+device.
+
+- A ``save_live_iterate`` function that reads VFIO device's data from vendor
+driver through migration region during iterative phase.
+
+- A ``save_live_complete_precopy`` function that resets _RUNNING flag from VFIO
+device state, saves device config space, if any, and iteratively copies
+remaining data for VFIO device till pending_bytes returned by vendor driver
+is zero.
+
+- A ``load_state`` function loads config section and data sections generated by
+above save functions.
+
+- ``cleanup`` functions for both save and load that unmap migration region.
+
+VM state change handler is registered to change VFIO device state based on VM
+state change.
+
+Similarly, a migration state change notifier is added to get a notification on
+migration state change. These states are translated to VFIO device state and
+conveyed to vendor driver.
+
+System memory dirty pages tracking
+----------------------------------
+
+A ``log_sync`` memory listener callback is added to mark system memory pages
+as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried
+per container. All pages pinned by vendor driver through vfio_pin_pages()
+external API have to be marked as dirty during migration. When there are CPU
+writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
+by vendor driver can also be written by device. There is currently no device
+which has hardware support for dirty page tracking. So all pages which are
+pinned by vendor driver are considered as dirty.
+Dirty pages are tracked when device is in stop-and-copy phase because if pages
+are marked dirty during pre-copy phase and content is transfered from source to
+destination, there is no way to know newly dirtied pages from the point they
+were copied earlier until device stops. To avoid repeated copy of same content,
+pinned pages are marked dirty only during stop-and-copy phase.
+
+System memory dirty pages tracking when vIOMMU is enabled
+---------------------------------------------------------
+With vIOMMU, IO virtual address range can get unmapped while in pre-copy phase
+of migration. In that case, unmap ioctl returns pages pinned in that range and
+QEMU reports corresponding guest physical pages dirty.
+During stop-and-copy phase, an IOMMU notifier is used to get a callback for
+mapped pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for
+those mapped ranges.
+
+Flow of state changes during Live migration
+===========================================
+Below is the flow of state change during live migration where states in brackets
+represent VM state, migration state and VFIO device state as:
+                (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
+
+Live migration save path
+------------------------
+                        QEMU normal running state
+                        (RUNNING, _NONE, _RUNNING)
+                                    |
+                       migrate_init spawns migration_thread
+                Migration thread then calls each device's .save_setup()
+                        (RUNNING, _SETUP, _RUNNING|_SAVING)
+                                    |
+                        (RUNNING, _ACTIVE, _RUNNING|_SAVING)
+            If device is active, get pending_bytes by .save_live_pending()
+         if total pending_bytes >= threshold_size, call .save_live_iterate()
+                  Data of VFIO device for pre-copy phase is copied
+     Iterate till total pending bytes converge and are less than threshold
+                                    |
+   On migration completion, vCPUs stops and calls .save_live_complete_precopy
+   for each active device. VFIO device is then transitioned in _SAVING state
+                    (FINISH_MIGRATE, _DEVICE, _SAVING)
+                                    |
+For VFIO device, iterate in .save_live_complete_precopy until pending data is 0
+                    (FINISH_MIGRATE, _DEVICE, _STOPPED)
+                                    |
+                    (FINISH_MIGRATE, _COMPLETED, _STOPPED)
+                Migraton thread schedule cleanup bottom half and exit
+
+Live migration resume path
+--------------------------
+
+             Incoming migration calls .load_setup for each device
+                        (RESTORE_VM, _ACTIVE, _STOPPED)
+                                    |
+    For each device, .load_state is called for that device section data
+                        (RESTORE_VM, _ACTIVE, _RESUMING)
+                                    |
+    At the end, called .load_cleanup for each device and vCPUs are started                        |
+                        (RUNNING, _NONE, _RUNNING)
+
+
+Postcopy
+========
+Postcopy migration is not supported for VFIO devices.