[1/5] VFIO KABI for migration interface

Message ID	1542746383-18288-2-git-send-email-kwankhede@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> TLS: TLSv1.2, DES-CBC3-SHA) id <B5bf4718c0000>; Tue, 20 Nov 2018 12:41:49 -0800 From: Kirti Wankhede <kwankhede@nvidia.com> To: <alex.williamson@redhat.com>, <cjia@nvidia.com> Date: Wed, 21 Nov 2018 02:09:39 +0530 Message-ID: <1542746383-18288-2-git-send-email-kwankhede@nvidia.com> In-Reply-To: <1542746383-18288-1-git-send-email-kwankhede@nvidia.com> References: <1542746383-18288-1-git-send-email-kwankhede@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain Subject: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface Precedence: list Cc: Zhengxiao.zx@Alibaba-inc.com, kevin.tian@intel.com, yi.l.liu@intel.com, eskultet@redhat.com, ziye.yang@intel.com, qemu-devel@nongnu.org, cohuck@redhat.com, shuangtai.tst@alibaba-inc.com, dgilbert@redhat.com, zhi.a.wang@intel.com, mlevitsk@redhat.com, pasic@linux.ibm.com, aik@ozlabs.ru, Kirti Wankhede <kwankhede@nvidia.com>, eauger@redhat.com, felipe@nutanix.com, jonathan.davies@nutanix.com, changpeng.liu@intel.com, Ken.Xue@amd.com Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
Series	Add migration support for VFIO device \| expand [0/5] Add migration support for VFIO device [1/5] VFIO KABI for migration interface [2/5] Add save and load functions for VFIO PCI devices [3/5] Add migration functions for VFIO devices [4/5] Add vfio_listerner_log_sync to mark dirty pages [5/5] Make vfio-pci device migration capable.

Kirti Wankhede Nov. 20, 2018, 8:39 p.m. UTC

- Defined MIGRATION region type and sub-type.
- Defined VFIO device states during migration process.
- Defined vfio_device_migration_info structure which will be placed at 0th
  offset of migration region to get/set VFIO device related information.
  Defined actions and members of structure usage for each action:
    * To convey VFIO device state to be transitioned to.
    * To get pending bytes yet to be migrated for VFIO device
    * To ask driver to write data to migration region and return number of bytes
      written in the region
    * In migration resume path, user space app writes to migration region and
      communicates it to vendor driver.
    * Get bitmap of dirty pages from vendor driver from given start address

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 linux-headers/linux/vfio.h | 130 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 130 insertions(+)

Tian, Kevin Nov. 21, 2018, 12:26 a.m. UTC | #1

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Wednesday, November 21, 2018 4:40 AM
> 
> - Defined MIGRATION region type and sub-type.
> - Defined VFIO device states during migration process.
> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined actions and members of structure usage for each action:
>     * To convey VFIO device state to be transitioned to.
>     * To get pending bytes yet to be migrated for VFIO device
>     * To ask driver to write data to migration region and return number of
> bytes
>       written in the region
>     * In migration resume path, user space app writes to migration region
> and
>       communicates it to vendor driver.
>     * Get bitmap of dirty pages from vendor driver from given start address
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  linux-headers/linux/vfio.h | 130
> +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 130 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 3615a269d378..a6e45cb2cae2 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -301,6 +301,10 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
> 
> +/* Migration region type and sub-type */
> +#define VFIO_REGION_TYPE_MIGRATION	        (1 << 30)
> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be
> mmapped
>   * which allows direct access to non-MSIX registers which happened to be
> within
> @@ -602,6 +606,132 @@ struct vfio_device_ioeventfd {
> 
>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE
> + 16)
> 
> +/**
> + * VFIO device states :
> + * VFIO User space application should set the device state to indicate
> vendor
> + * driver in which state the VFIO device should transitioned.
> + * - VFIO_DEVICE_STATE_NONE:
> + *   State when VFIO device is initialized but not yet running.
> + * - VFIO_DEVICE_STATE_RUNNING:
> + *   Transition VFIO device in running state, that is, user space application
> or
> + *   VM is active.
> + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
> + *   Transition VFIO device in migration setup state. This is used to prepare
> + *   VFIO device for migration while application or VM and vCPUs are still in
> + *   running state.
> + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
> + *   When VFIO user space application or VM is active and vCPUs are
> running,
> + *   transition VFIO device in pre-copy state.
> + * - VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY:
> + *   When VFIO user space application or VM is stopped and vCPUs are
> halted,
> + *   transition VFIO device in stop-and-copy state.
> + * - VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED:
> + *   When VFIO user space application has copied data provided by vendor
> driver.
> + *   This state is used by vendor driver to clean up all software state that
> was
> + *   setup during MIGRATION_SETUP state.
> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME:
> + *   Transition VFIO device to resume state, that is, start resuming VFIO
> device
> + *   when user space application or VM is not running and vCPUs are
> halted.
> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED:
> + *   When user space application completes iterations of providing device
> state
> + *   data, transition device in resume completed state.
> + * - VFIO_DEVICE_STATE_MIGRATION_FAILED:
> + *   Migration process failed due to some reason, transition device to
> failed
> + *   state. If migration process fails while saving at source, resume device
> at
> + *   source. If migration process fails while resuming application or VM at
> + *   destination, stop restoration at destination and resume at source.
> + * - VFIO_DEVICE_STATE_MIGRATION_CANCELLED:
> + *   User space application has cancelled migration process either for some
> + *   known reason or due to user's intervention. Transition device to
> Cancelled
> + *   state, that is, resume device state as it was during running state at
> + *   source.
> + */
> +
> +enum {
> +    VFIO_DEVICE_STATE_NONE,
> +    VFIO_DEVICE_STATE_RUNNING,
> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY,
> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY,
> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
> +};

We discussed in KVM forum to define the interfaces around the state
itself, instead of around live migration flow. Looks this version doesn't 
move that way?

quote the summary from Alex, which though high level but simple
enough to demonstrate the idea:

--
Here we would define "registers" for putting the device in various 
states through the migration process, for example enabling dirty logging, 
suspending the device, resuming the device, direction of data flow 
through the device state area, etc.
--

based on that we just need much fewer states, e.g. {RUNNING, 
RUNNING_DIRTYLOG, STOPPED}. data flow direction doesn't need
to be a state. could just a flag in the region. Those are sufficient to 
enable vGPU live migration on Intel platform. nvidia or other vendors
may have more requirements, which could lead to addition of new
states - but again, they should be defined in a way not tied to migration
flow.

Thanks
Kevin

> +
> +/**
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device
> related migration
> + * information.
> + *
> + * Action Set state:
> + *      To tell vendor driver the state VFIO device should be transitioned to.
> + *      device_state [input] : User space app sends device state to vendor
> + *           driver on state change, the state to which VFIO device should be
> + *           transitioned to.
> + *
> + * Action Get pending bytes:
> + *      To get pending bytes yet to be migrated from vendor driver
> + *      pending.threshold_size [Input] : threshold of buffer in User space
> app.
> + *      pending.precopy_only [output] : pending data which must be
> migrated in
> + *          precopy phase or in stopped state, in other words - before target
> + *          user space application or VM start. In case of migration, this
> + *          indicates pending bytes to be transfered while application or VM
> or
> + *          vCPUs are active and running.
> + *      pending.compatible [output] : pending data which may be migrated
> any
> + *          time , either when application or VM is active and vCPUs are active
> + *          or when application or VM is halted and vCPUs are halted.
> + *      pending.postcopy_only [output] : pending data which must be
> migrated in
> + *           postcopy phase or in stopped state, in other words - after source
> + *           application or VM stopped and vCPUs are halted.
> + *      Sum of pending.precopy_only, pending.compatible and
> + *      pending.postcopy_only is the whole amount of pending data.
> + *
> + * Action Get buffer:
> + *      On this action, vendor driver should write data to migration region
> and
> + *      return number of bytes written in the region.
> + *      data.offset [output] : offset in the region from where data is written.
> + *      data.size [output] : number of bytes written in migration buffer by
> + *          vendor driver.
> + *
> + * Action Set buffer:
> + *      In migration resume path, user space app writes to migration region
> and
> + *      communicates it to vendor driver with this action.
> + *      data.offset [Input] : offset in the region from where data is written.
> + *      data.size [Input] : number of bytes written in migration buffer by
> + *          user space app.
> + *
> + * Action Get dirty pages bitmap:
> + *      Get bitmap of dirty pages from vendor driver from given start
> address.
> + *      dirty_pfns.start_addr [Input] : start address
> + *      dirty_pfns.total [Input] : Total pfn count from start_addr for which
> + *          dirty bitmap is requested
> + *      dirty_pfns.copied [Output] : pfn count for which dirty bitmap is
> copied
> + *          to migration region.
> + *      Vendor driver should copy the bitmap with bits set only for pages to
> be
> + *      marked dirty in migration region.
> + */
> +
> +struct vfio_device_migration_info {
> +        __u32 device_state;         /* VFIO device state */
> +        struct {
> +            __u64 precopy_only;
> +            __u64 compatible;
> +            __u64 postcopy_only;
> +            __u64 threshold_size;
> +        } pending;
> +        struct {
> +            __u64 offset;           /* offset */
> +            __u64 size;             /* size */
> +        } data;
> +        struct {
> +            __u64 start_addr;
> +            __u64 total;
> +            __u64 copied;
> +        } dirty_pfns;
> +} __attribute__((packed));
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
> 
>  /**
> --
> 2.7.0

Kirti Wankhede Nov. 21, 2018, 4:24 a.m. UTC | #2

On 11/21/2018 5:56 AM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>> Sent: Wednesday, November 21, 2018 4:40 AM
>>
>> - Defined MIGRATION region type and sub-type.
>> - Defined VFIO device states during migration process.
>> - Defined vfio_device_migration_info structure which will be placed at 0th
>>   offset of migration region to get/set VFIO device related information.
>>   Defined actions and members of structure usage for each action:
>>     * To convey VFIO device state to be transitioned to.
>>     * To get pending bytes yet to be migrated for VFIO device
>>     * To ask driver to write data to migration region and return number of
>> bytes
>>       written in the region
>>     * In migration resume path, user space app writes to migration region
>> and
>>       communicates it to vendor driver.
>>     * Get bitmap of dirty pages from vendor driver from given start address
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>  linux-headers/linux/vfio.h | 130
>> +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 130 insertions(+)
>>
>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>> index 3615a269d378..a6e45cb2cae2 100644
>> --- a/linux-headers/linux/vfio.h
>> +++ b/linux-headers/linux/vfio.h
>> @@ -301,6 +301,10 @@ struct vfio_region_info_cap_type {
>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>>
>> +/* Migration region type and sub-type */
>> +#define VFIO_REGION_TYPE_MIGRATION	        (1 << 30)
>> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
>> +
>>  /*
>>   * The MSIX mappable capability informs that MSIX data of a BAR can be
>> mmapped
>>   * which allows direct access to non-MSIX registers which happened to be
>> within
>> @@ -602,6 +606,132 @@ struct vfio_device_ioeventfd {
>>
>>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE
>> + 16)
>>
>> +/**
>> + * VFIO device states :
>> + * VFIO User space application should set the device state to indicate
>> vendor
>> + * driver in which state the VFIO device should transitioned.
>> + * - VFIO_DEVICE_STATE_NONE:
>> + *   State when VFIO device is initialized but not yet running.
>> + * - VFIO_DEVICE_STATE_RUNNING:
>> + *   Transition VFIO device in running state, that is, user space application
>> or
>> + *   VM is active.
>> + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
>> + *   Transition VFIO device in migration setup state. This is used to prepare
>> + *   VFIO device for migration while application or VM and vCPUs are still in
>> + *   running state.
>> + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
>> + *   When VFIO user space application or VM is active and vCPUs are
>> running,
>> + *   transition VFIO device in pre-copy state.
>> + * - VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY:
>> + *   When VFIO user space application or VM is stopped and vCPUs are
>> halted,
>> + *   transition VFIO device in stop-and-copy state.
>> + * - VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED:
>> + *   When VFIO user space application has copied data provided by vendor
>> driver.
>> + *   This state is used by vendor driver to clean up all software state that
>> was
>> + *   setup during MIGRATION_SETUP state.
>> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME:
>> + *   Transition VFIO device to resume state, that is, start resuming VFIO
>> device
>> + *   when user space application or VM is not running and vCPUs are
>> halted.
>> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED:
>> + *   When user space application completes iterations of providing device
>> state
>> + *   data, transition device in resume completed state.
>> + * - VFIO_DEVICE_STATE_MIGRATION_FAILED:
>> + *   Migration process failed due to some reason, transition device to
>> failed
>> + *   state. If migration process fails while saving at source, resume device
>> at
>> + *   source. If migration process fails while resuming application or VM at
>> + *   destination, stop restoration at destination and resume at source.
>> + * - VFIO_DEVICE_STATE_MIGRATION_CANCELLED:
>> + *   User space application has cancelled migration process either for some
>> + *   known reason or due to user's intervention. Transition device to
>> Cancelled
>> + *   state, that is, resume device state as it was during running state at
>> + *   source.
>> + */
>> +
>> +enum {
>> +    VFIO_DEVICE_STATE_NONE,
>> +    VFIO_DEVICE_STATE_RUNNING,
>> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
>> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY,
>> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY,
>> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
>> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
>> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
>> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
>> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
>> +};
> 
> We discussed in KVM forum to define the interfaces around the state
> itself, instead of around live migration flow. Looks this version doesn't 
> move that way?
> 

This is patch series is along the discussion we had at KVM forum.

> quote the summary from Alex, which though high level but simple
> enough to demonstrate the idea:
> 
> --
> Here we would define "registers" for putting the device in various 
> states through the migration process, for example enabling dirty logging, 
> suspending the device, resuming the device, direction of data flow 
> through the device state area, etc.
> --
> 

Defined a packed structure here to map it at 0th offset of migration
region so that offset can be calculated by offset_of(), you may call
same as register definitions.


> based on that we just need much fewer states, e.g. {RUNNING, 
> RUNNING_DIRTYLOG, STOPPED}. data flow direction doesn't need
> to be a state. could just a flag in the region.

Flag is not preferred here, multiple flags can be set at a time.
Here need finite states with its proper definition what that device
state means to driver and user space application.
For Intel or others who don't need other states can ignore the state in
driver by taking no action on pwrite on .device_state offset. For
example for Intel driver could only take action on state change to
VFIO_DEVICE_STATE_RUNNING and VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY

I think dirty page logging is not a VFIO device's state.
.log_sync of MemoryListener is called during both :
- PRECOPY phase i.e. while vCPUs are still running and
- during STOPNCOPY phase i.e. when vCPUs are stopped.


> Those are sufficient to 
> enable vGPU live migration on Intel platform. nvidia or other vendors
> may have more requirements, which could lead to addition of new
> states - but again, they should be defined in a way not tied to migration
> flow.
> 

I had tried to explain the intend of each state. Please go through the
comments above.
Also please take a look at other patches, mainly
0003-Add-migration-functions-for-VFIO-devices.patch to understand why
these states are required.

Thanks,
Kirti

> Thanks
> Kevin
> 
>> +
>> +/**
>> + * Structure vfio_device_migration_info is placed at 0th offset of
>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device
>> related migration
>> + * information.
>> + *
>> + * Action Set state:
>> + *      To tell vendor driver the state VFIO device should be transitioned to.
>> + *      device_state [input] : User space app sends device state to vendor
>> + *           driver on state change, the state to which VFIO device should be
>> + *           transitioned to.
>> + *
>> + * Action Get pending bytes:
>> + *      To get pending bytes yet to be migrated from vendor driver
>> + *      pending.threshold_size [Input] : threshold of buffer in User space
>> app.
>> + *      pending.precopy_only [output] : pending data which must be
>> migrated in
>> + *          precopy phase or in stopped state, in other words - before target
>> + *          user space application or VM start. In case of migration, this
>> + *          indicates pending bytes to be transfered while application or VM
>> or
>> + *          vCPUs are active and running.
>> + *      pending.compatible [output] : pending data which may be migrated
>> any
>> + *          time , either when application or VM is active and vCPUs are active
>> + *          or when application or VM is halted and vCPUs are halted.
>> + *      pending.postcopy_only [output] : pending data which must be
>> migrated in
>> + *           postcopy phase or in stopped state, in other words - after source
>> + *           application or VM stopped and vCPUs are halted.
>> + *      Sum of pending.precopy_only, pending.compatible and
>> + *      pending.postcopy_only is the whole amount of pending data.
>> + *
>> + * Action Get buffer:
>> + *      On this action, vendor driver should write data to migration region
>> and
>> + *      return number of bytes written in the region.
>> + *      data.offset [output] : offset in the region from where data is written.
>> + *      data.size [output] : number of bytes written in migration buffer by
>> + *          vendor driver.
>> + *
>> + * Action Set buffer:
>> + *      In migration resume path, user space app writes to migration region
>> and
>> + *      communicates it to vendor driver with this action.
>> + *      data.offset [Input] : offset in the region from where data is written.
>> + *      data.size [Input] : number of bytes written in migration buffer by
>> + *          user space app.
>> + *
>> + * Action Get dirty pages bitmap:
>> + *      Get bitmap of dirty pages from vendor driver from given start
>> address.
>> + *      dirty_pfns.start_addr [Input] : start address
>> + *      dirty_pfns.total [Input] : Total pfn count from start_addr for which
>> + *          dirty bitmap is requested
>> + *      dirty_pfns.copied [Output] : pfn count for which dirty bitmap is
>> copied
>> + *          to migration region.
>> + *      Vendor driver should copy the bitmap with bits set only for pages to
>> be
>> + *      marked dirty in migration region.
>> + */
>> +
>> +struct vfio_device_migration_info {
>> +        __u32 device_state;         /* VFIO device state */
>> +        struct {
>> +            __u64 precopy_only;
>> +            __u64 compatible;
>> +            __u64 postcopy_only;
>> +            __u64 threshold_size;
>> +        } pending;
>> +        struct {
>> +            __u64 offset;           /* offset */
>> +            __u64 size;             /* size */
>> +        } data;
>> +        struct {
>> +            __u64 start_addr;
>> +            __u64 total;
>> +            __u64 copied;
>> +        } dirty_pfns;
>> +} __attribute__((packed));
>> +
>>  /* -------- API for Type1 VFIO IOMMU -------- */
>>
>>  /**
>> --
>> 2.7.0
>

Tian, Kevin Nov. 21, 2018, 6:13 a.m. UTC | #3

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Wednesday, November 21, 2018 12:24 PM
> 
> 
> On 11/21/2018 5:56 AM, Tian, Kevin wrote:
> >> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> >> Sent: Wednesday, November 21, 2018 4:40 AM
> >>
> >> - Defined MIGRATION region type and sub-type.
> >> - Defined VFIO device states during migration process.
> >> - Defined vfio_device_migration_info structure which will be placed at
> 0th
> >>   offset of migration region to get/set VFIO device related information.
> >>   Defined actions and members of structure usage for each action:
> >>     * To convey VFIO device state to be transitioned to.
> >>     * To get pending bytes yet to be migrated for VFIO device
> >>     * To ask driver to write data to migration region and return number of
> >> bytes
> >>       written in the region
> >>     * In migration resume path, user space app writes to migration region
> >> and
> >>       communicates it to vendor driver.
> >>     * Get bitmap of dirty pages from vendor driver from given start
> address
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>  linux-headers/linux/vfio.h | 130
> >> +++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 130 insertions(+)
> >>
> >> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >> index 3615a269d378..a6e45cb2cae2 100644
> >> --- a/linux-headers/linux/vfio.h
> >> +++ b/linux-headers/linux/vfio.h
> >> @@ -301,6 +301,10 @@ struct vfio_region_info_cap_type {
> >>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
> >>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
> >>
> >> +/* Migration region type and sub-type */
> >> +#define VFIO_REGION_TYPE_MIGRATION	        (1 << 30)
> >> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> >> +
> >>  /*
> >>   * The MSIX mappable capability informs that MSIX data of a BAR can be
> >> mmapped
> >>   * which allows direct access to non-MSIX registers which happened to
> be
> >> within
> >> @@ -602,6 +606,132 @@ struct vfio_device_ioeventfd {
> >>
> >>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE
> >> + 16)
> >>
> >> +/**
> >> + * VFIO device states :
> >> + * VFIO User space application should set the device state to indicate
> >> vendor
> >> + * driver in which state the VFIO device should transitioned.
> >> + * - VFIO_DEVICE_STATE_NONE:
> >> + *   State when VFIO device is initialized but not yet running.
> >> + * - VFIO_DEVICE_STATE_RUNNING:
> >> + *   Transition VFIO device in running state, that is, user space
> application
> >> or
> >> + *   VM is active.
> >> + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
> >> + *   Transition VFIO device in migration setup state. This is used to
> prepare
> >> + *   VFIO device for migration while application or VM and vCPUs are
> still in
> >> + *   running state.
> >> + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
> >> + *   When VFIO user space application or VM is active and vCPUs are
> >> running,
> >> + *   transition VFIO device in pre-copy state.
> >> + * - VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY:
> >> + *   When VFIO user space application or VM is stopped and vCPUs are
> >> halted,
> >> + *   transition VFIO device in stop-and-copy state.
> >> + * - VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED:
> >> + *   When VFIO user space application has copied data provided by
> vendor
> >> driver.
> >> + *   This state is used by vendor driver to clean up all software state that
> >> was
> >> + *   setup during MIGRATION_SETUP state.
> >> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME:
> >> + *   Transition VFIO device to resume state, that is, start resuming VFIO
> >> device
> >> + *   when user space application or VM is not running and vCPUs are
> >> halted.
> >> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED:
> >> + *   When user space application completes iterations of providing
> device
> >> state
> >> + *   data, transition device in resume completed state.
> >> + * - VFIO_DEVICE_STATE_MIGRATION_FAILED:
> >> + *   Migration process failed due to some reason, transition device to
> >> failed
> >> + *   state. If migration process fails while saving at source, resume
> device
> >> at
> >> + *   source. If migration process fails while resuming application or VM
> at
> >> + *   destination, stop restoration at destination and resume at source.
> >> + * - VFIO_DEVICE_STATE_MIGRATION_CANCELLED:
> >> + *   User space application has cancelled migration process either for
> some
> >> + *   known reason or due to user's intervention. Transition device to
> >> Cancelled
> >> + *   state, that is, resume device state as it was during running state at
> >> + *   source.
> >> + */
> >> +
> >> +enum {
> >> +    VFIO_DEVICE_STATE_NONE,
> >> +    VFIO_DEVICE_STATE_RUNNING,
> >> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
> >> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY,
> >> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY,
> >> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
> >> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
> >> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
> >> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
> >> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
> >> +};
> >
> > We discussed in KVM forum to define the interfaces around the state
> > itself, instead of around live migration flow. Looks this version doesn't
> > move that way?
> >
> 
> This is patch series is along the discussion we had at KVM forum.
> 
> > quote the summary from Alex, which though high level but simple
> > enough to demonstrate the idea:
> >
> > --
> > Here we would define "registers" for putting the device in various
> > states through the migration process, for example enabling dirty logging,
> > suspending the device, resuming the device, direction of data flow
> > through the device state area, etc.
> > --
> >
> 
> Defined a packed structure here to map it at 0th offset of migration
> region so that offset can be calculated by offset_of(), you may call
> same as register definitions.

yes, this part is a good change. My comment was around state definition
itself.

> 
> 
> > based on that we just need much fewer states, e.g. {RUNNING,
> > RUNNING_DIRTYLOG, STOPPED}. data flow direction doesn't need
> > to be a state. could just a flag in the region.
> 
> Flag is not preferred here, multiple flags can be set at a time.
> Here need finite states with its proper definition what that device
> state means to driver and user space application.
> For Intel or others who don't need other states can ignore the state in
> driver by taking no action on pwrite on .device_state offset. For
> example for Intel driver could only take action on state change to
> VFIO_DEVICE_STATE_RUNNING and
> VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY
> 
> I think dirty page logging is not a VFIO device's state.
> .log_sync of MemoryListener is called during both :
> - PRECOPY phase i.e. while vCPUs are still running and
> - during STOPNCOPY phase i.e. when vCPUs are stopped.
> 
> 
> > Those are sufficient to
> > enable vGPU live migration on Intel platform. nvidia or other vendors
> > may have more requirements, which could lead to addition of new
> > states - but again, they should be defined in a way not tied to migration
> > flow.
> >
> 
> I had tried to explain the intend of each state. Please go through the
> comments above.
> Also please take a look at other patches, mainly
> 0003-Add-migration-functions-for-VFIO-devices.patch to understand why
> these states are required.
> 

I looked at the explanations in this patch, but still didn't get the intention, e.g.:

+ * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
+ *   Transition VFIO device in migration setup state. This is used to prepare
+ *   VFIO device for migration while application or VM and vCPUs are still in
+ *   running state.

what preparation is actually required? any example?

+ * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
+ *   When VFIO user space application or VM is active and vCPUs are running,
+ *   transition VFIO device in pre-copy state.

why does device driver need know this stage? in precopy phase, the VM
is still running. Just dirty page tracking is in progress. the dirty bitmap could
be retrieved through its own action interface.

you have code to demonstrate how those states are transitioned in Qemu,
but you didn't show evidence why those states are necessary in device side,
which leads to the puzzle whether the definition is over-killed and limiting.

the flow in my mind is like below:

1. an interface to turn on/off dirty page tracking on VFIO device:
	* vendor driver can do whatever required to enable device specific
dirty page tracking mechanism here
	* device state is not changed here. still in running state

2. an interface to get dirty page bitmap

3. an interface to start/stop device activity
	* the effect of stop is to stop and drain in-the-fly device activities and
make device state ready for dump-out. vendor driver can do specific preparation 
here
	* the effect of start is to check validity of device state and then resume
device activities. again, vendor driver can do specific cleanup/preparation here

4. an interface to save/restore device state
	* should happen when device is stopped
	* of course there is still an open how to check state compatibility as
Alex pointed earlier

this way above interfaces are not tied to migration. other usages which are
interested in device state could also work (e.g. snapshot). If it doesn't work
with your device, it's better that you can elaborate your requirement with more
concrete examples. Then people will judge the necessity of a more complex
interface as proposed in this series...

Thanks
Kevin

Pierre Morel Nov. 21, 2018, 5:26 p.m. UTC | #4

On 20/11/2018 21:39, Kirti Wankhede wrote:
> - Defined MIGRATION region type and sub-type.
> - Defined VFIO device states during migration process.
> - Defined vfio_device_migration_info structure which will be placed at 0th
>    offset of migration region to get/set VFIO device related information.
>    Defined actions and members of structure usage for each action:
>      * To convey VFIO device state to be transitioned to.
>      * To get pending bytes yet to be migrated for VFIO device
>      * To ask driver to write data to migration region and return number of bytes
>        written in the region
>      * In migration resume path, user space app writes to migration region and
>        communicates it to vendor driver.
>      * Get bitmap of dirty pages from vendor driver from given start address
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>   linux-headers/linux/vfio.h | 130 +++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 130 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 3615a269d378..a6e45cb2cae2 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -301,6 +301,10 @@ struct vfio_region_info_cap_type {
>   #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>   #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
> 
> +/* Migration region type and sub-type */
> +#define VFIO_REGION_TYPE_MIGRATION	        (1 << 30)
> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> +
>   /*
>    * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>    * which allows direct access to non-MSIX registers which happened to be within
> @@ -602,6 +606,132 @@ struct vfio_device_ioeventfd {
> 
>   #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
> 
> +/**
> + * VFIO device states :
> + * VFIO User space application should set the device state to indicate vendor
> + * driver in which state the VFIO device should transitioned.
> + * - VFIO_DEVICE_STATE_NONE:
> + *   State when VFIO device is initialized but not yet running.
> + * - VFIO_DEVICE_STATE_RUNNING:
> + *   Transition VFIO device in running state, that is, user space application or
> + *   VM is active.
> + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
> + *   Transition VFIO device in migration setup state. This is used to prepare
> + *   VFIO device for migration while application or VM and vCPUs are still in
> + *   running state.
> + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
> + *   When VFIO user space application or VM is active and vCPUs are running,
> + *   transition VFIO device in pre-copy state.
> + * - VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY:
> + *   When VFIO user space application or VM is stopped and vCPUs are halted,
> + *   transition VFIO device in stop-and-copy state.
> + * - VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED:
> + *   When VFIO user space application has copied data provided by vendor driver.
> + *   This state is used by vendor driver to clean up all software state that was
> + *   setup during MIGRATION_SETUP state.
> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME:
> + *   Transition VFIO device to resume state, that is, start resuming VFIO device
> + *   when user space application or VM is not running and vCPUs are halted.
> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED:
> + *   When user space application completes iterations of providing device state
> + *   data, transition device in resume completed state.
> + * - VFIO_DEVICE_STATE_MIGRATION_FAILED:
> + *   Migration process failed due to some reason, transition device to failed
> + *   state. If migration process fails while saving at source, resume device at
> + *   source. If migration process fails while resuming application or VM at
> + *   destination, stop restoration at destination and resume at source.
> + * - VFIO_DEVICE_STATE_MIGRATION_CANCELLED:
> + *   User space application has cancelled migration process either for some
> + *   known reason or due to user's intervention. Transition device to Cancelled
> + *   state, that is, resume device state as it was during running state at
> + *   source.
> + */
> +
> +enum {
> +    VFIO_DEVICE_STATE_NONE,
> +    VFIO_DEVICE_STATE_RUNNING,
> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY,
> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY,
> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
> +};
> +
> +/**
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> + * information.
> + *
> + * Action Set state:
> + *      To tell vendor driver the state VFIO device should be transitioned to.
> + *      device_state [input] : User space app sends device state to vendor
> + *           driver on state change, the state to which VFIO device should be
> + *           transitioned to.
> + *
> + * Action Get pending bytes:
> + *      To get pending bytes yet to be migrated from vendor driver
> + *      pending.threshold_size [Input] : threshold of buffer in User space app.
> + *      pending.precopy_only [output] : pending data which must be migrated in
> + *          precopy phase or in stopped state, in other words - before target
> + *          user space application or VM start. In case of migration, this
> + *          indicates pending bytes to be transfered while application or VM or
> + *          vCPUs are active and running.
> + *      pending.compatible [output] : pending data which may be migrated any
> + *          time , either when application or VM is active and vCPUs are active
> + *          or when application or VM is halted and vCPUs are halted.
> + *      pending.postcopy_only [output] : pending data which must be migrated in
> + *           postcopy phase or in stopped state, in other words - after source
> + *           application or VM stopped and vCPUs are halted.
> + *      Sum of pending.precopy_only, pending.compatible and
> + *      pending.postcopy_only is the whole amount of pending data.
> + *
> + * Action Get buffer:
> + *      On this action, vendor driver should write data to migration region and
> + *      return number of bytes written in the region.
> + *      data.offset [output] : offset in the region from where data is written.
> + *      data.size [output] : number of bytes written in migration buffer by
> + *          vendor driver.
> + *
> + * Action Set buffer:
> + *      In migration resume path, user space app writes to migration region and
> + *      communicates it to vendor driver with this action.
> + *      data.offset [Input] : offset in the region from where data is written.
> + *      data.size [Input] : number of bytes written in migration buffer by
> + *          user space app.
> + *
> + * Action Get dirty pages bitmap:
> + *      Get bitmap of dirty pages from vendor driver from given start address.
> + *      dirty_pfns.start_addr [Input] : start address
> + *      dirty_pfns.total [Input] : Total pfn count from start_addr for which
> + *          dirty bitmap is requested
> + *      dirty_pfns.copied [Output] : pfn count for which dirty bitmap is copied
> + *          to migration region.
> + *      Vendor driver should copy the bitmap with bits set only for pages to be
> + *      marked dirty in migration region.
> + */
> +

Hi Kirti,

I am very interested in your work, thanks for it.
I just begin to look at it.

> +struct vfio_device_migration_info {
> +        __u32 device_state;         /* VFIO device state */

May be it is a little soon to care about this but wouldn't the __u32 
here cause a problem, even with packed (or due to packed), for different 
architectures?
Wouldn't it be better to use a __u64 for the state and keep all 
naturally aligned?

Regards,
Pierre


> +        struct {
> +            __u64 precopy_only;
> +            __u64 compatible;
> +            __u64 postcopy_only;
> +            __u64 threshold_size;
> +        } pending;
> +        struct {
> +            __u64 offset;           /* offset */
> +            __u64 size;             /* size */
> +        } data;
> +        struct {
> +            __u64 start_addr;
> +            __u64 total;
> +            __u64 copied;
> +        } dirty_pfns;
> +} __attribute__((packed));



> +
>   /* -------- API for Type1 VFIO IOMMU -------- */
> 
>   /**
>

Dr. David Alan Gilbert Nov. 22, 2018, 6:54 p.m. UTC | #5

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> - Defined MIGRATION region type and sub-type.
> - Defined VFIO device states during migration process.
> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined actions and members of structure usage for each action:
>     * To convey VFIO device state to be transitioned to.
>     * To get pending bytes yet to be migrated for VFIO device
>     * To ask driver to write data to migration region and return number of bytes
>       written in the region
>     * In migration resume path, user space app writes to migration region and
>       communicates it to vendor driver.
>     * Get bitmap of dirty pages from vendor driver from given start address
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>

<snip>

> + * Action Get buffer:
> + *      On this action, vendor driver should write data to migration region and
> + *      return number of bytes written in the region.
> + *      data.offset [output] : offset in the region from where data is written.
> + *      data.size [output] : number of bytes written in migration buffer by
> + *          vendor driver.

<snip>

> + */
> +
> +struct vfio_device_migration_info {
> +        __u32 device_state;         /* VFIO device state */
> +        struct {
> +            __u64 precopy_only;
> +            __u64 compatible;
> +            __u64 postcopy_only;
> +            __u64 threshold_size;
> +        } pending;
> +        struct {
> +            __u64 offset;           /* offset */
> +            __u64 size;             /* size */
> +        } data;

I'm curious how the offsets/size work; how does the 
kernel driver know the maximum size of state it's allowed to write?
Why would it pick a none-0 offset into the output region?

Without having dug further these feel like i/o rather than just output;
i.e. the calling process says 'put it at that offset and you've got size
bytes' and the kernel replies with 'I did put it at offset and I wrote
only this size bytes'

Dave

> +        struct {
> +            __u64 start_addr;
> +            __u64 total;
> +            __u64 copied;
> +        } dirty_pfns;
> +} __attribute__((packed));
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Kirti Wankhede Nov. 22, 2018, 8:01 p.m. UTC | #6

On 11/21/2018 11:43 AM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>> Sent: Wednesday, November 21, 2018 12:24 PM
>>
>>
>> On 11/21/2018 5:56 AM, Tian, Kevin wrote:
>>>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>>>> Sent: Wednesday, November 21, 2018 4:40 AM
>>>>
>>>> - Defined MIGRATION region type and sub-type.
>>>> - Defined VFIO device states during migration process.
>>>> - Defined vfio_device_migration_info structure which will be placed at
>> 0th
>>>>   offset of migration region to get/set VFIO device related information.
>>>>   Defined actions and members of structure usage for each action:
>>>>     * To convey VFIO device state to be transitioned to.
>>>>     * To get pending bytes yet to be migrated for VFIO device
>>>>     * To ask driver to write data to migration region and return number of
>>>> bytes
>>>>       written in the region
>>>>     * In migration resume path, user space app writes to migration region
>>>> and
>>>>       communicates it to vendor driver.
>>>>     * Get bitmap of dirty pages from vendor driver from given start
>> address
>>>>
>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>> ---
>>>>  linux-headers/linux/vfio.h | 130
>>>> +++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 130 insertions(+)
>>>>
>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>>>> index 3615a269d378..a6e45cb2cae2 100644
>>>> --- a/linux-headers/linux/vfio.h
>>>> +++ b/linux-headers/linux/vfio.h
>>>> @@ -301,6 +301,10 @@ struct vfio_region_info_cap_type {
>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>>>>
>>>> +/* Migration region type and sub-type */
>>>> +#define VFIO_REGION_TYPE_MIGRATION	        (1 << 30)
>>>> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
>>>> +
>>>>  /*
>>>>   * The MSIX mappable capability informs that MSIX data of a BAR can be
>>>> mmapped
>>>>   * which allows direct access to non-MSIX registers which happened to
>> be
>>>> within
>>>> @@ -602,6 +606,132 @@ struct vfio_device_ioeventfd {
>>>>
>>>>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE
>>>> + 16)
>>>>
>>>> +/**
>>>> + * VFIO device states :
>>>> + * VFIO User space application should set the device state to indicate
>>>> vendor
>>>> + * driver in which state the VFIO device should transitioned.
>>>> + * - VFIO_DEVICE_STATE_NONE:
>>>> + *   State when VFIO device is initialized but not yet running.
>>>> + * - VFIO_DEVICE_STATE_RUNNING:
>>>> + *   Transition VFIO device in running state, that is, user space
>> application
>>>> or
>>>> + *   VM is active.
>>>> + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
>>>> + *   Transition VFIO device in migration setup state. This is used to
>> prepare
>>>> + *   VFIO device for migration while application or VM and vCPUs are
>> still in
>>>> + *   running state.
>>>> + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
>>>> + *   When VFIO user space application or VM is active and vCPUs are
>>>> running,
>>>> + *   transition VFIO device in pre-copy state.
>>>> + * - VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY:
>>>> + *   When VFIO user space application or VM is stopped and vCPUs are
>>>> halted,
>>>> + *   transition VFIO device in stop-and-copy state.
>>>> + * - VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED:
>>>> + *   When VFIO user space application has copied data provided by
>> vendor
>>>> driver.
>>>> + *   This state is used by vendor driver to clean up all software state that
>>>> was
>>>> + *   setup during MIGRATION_SETUP state.
>>>> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME:
>>>> + *   Transition VFIO device to resume state, that is, start resuming VFIO
>>>> device
>>>> + *   when user space application or VM is not running and vCPUs are
>>>> halted.
>>>> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED:
>>>> + *   When user space application completes iterations of providing
>> device
>>>> state
>>>> + *   data, transition device in resume completed state.
>>>> + * - VFIO_DEVICE_STATE_MIGRATION_FAILED:
>>>> + *   Migration process failed due to some reason, transition device to
>>>> failed
>>>> + *   state. If migration process fails while saving at source, resume
>> device
>>>> at
>>>> + *   source. If migration process fails while resuming application or VM
>> at
>>>> + *   destination, stop restoration at destination and resume at source.
>>>> + * - VFIO_DEVICE_STATE_MIGRATION_CANCELLED:
>>>> + *   User space application has cancelled migration process either for
>> some
>>>> + *   known reason or due to user's intervention. Transition device to
>>>> Cancelled
>>>> + *   state, that is, resume device state as it was during running state at
>>>> + *   source.
>>>> + */
>>>> +
>>>> +enum {
>>>> +    VFIO_DEVICE_STATE_NONE,
>>>> +    VFIO_DEVICE_STATE_RUNNING,
>>>> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
>>>> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY,
>>>> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY,
>>>> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
>>>> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
>>>> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
>>>> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
>>>> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
>>>> +};
>>>
>>> We discussed in KVM forum to define the interfaces around the state
>>> itself, instead of around live migration flow. Looks this version doesn't
>>> move that way?
>>>
>>
>> This is patch series is along the discussion we had at KVM forum.
>>
>>> quote the summary from Alex, which though high level but simple
>>> enough to demonstrate the idea:
>>>
>>> --
>>> Here we would define "registers" for putting the device in various
>>> states through the migration process, for example enabling dirty logging,
>>> suspending the device, resuming the device, direction of data flow
>>> through the device state area, etc.
>>> --
>>>
>>
>> Defined a packed structure here to map it at 0th offset of migration
>> region so that offset can be calculated by offset_of(), you may call
>> same as register definitions.
> 
> yes, this part is a good change. My comment was around state definition
> itself.
> 
>>
>>
>>> based on that we just need much fewer states, e.g. {RUNNING,
>>> RUNNING_DIRTYLOG, STOPPED}. data flow direction doesn't need
>>> to be a state. could just a flag in the region.
>>
>> Flag is not preferred here, multiple flags can be set at a time.
>> Here need finite states with its proper definition what that device
>> state means to driver and user space application.
>> For Intel or others who don't need other states can ignore the state in
>> driver by taking no action on pwrite on .device_state offset. For
>> example for Intel driver could only take action on state change to
>> VFIO_DEVICE_STATE_RUNNING and
>> VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY
>>
>> I think dirty page logging is not a VFIO device's state.
>> .log_sync of MemoryListener is called during both :
>> - PRECOPY phase i.e. while vCPUs are still running and
>> - during STOPNCOPY phase i.e. when vCPUs are stopped.
>>
>>
>>> Those are sufficient to
>>> enable vGPU live migration on Intel platform. nvidia or other vendors
>>> may have more requirements, which could lead to addition of new
>>> states - but again, they should be defined in a way not tied to migration
>>> flow.
>>>
>>
>> I had tried to explain the intend of each state. Please go through the
>> comments above.
>> Also please take a look at other patches, mainly
>> 0003-Add-migration-functions-for-VFIO-devices.patch to understand why
>> these states are required.
>>
> 
> I looked at the explanations in this patch, but still didn't get the intention, e.g.:
> 
> + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
> + *   Transition VFIO device in migration setup state. This is used to prepare
> + *   VFIO device for migration while application or VM and vCPUs are still in
> + *   running state.
> 
> what preparation is actually required? any example?

Each vendor driver can have different requirements as to how to prepare
for migration. For example, this phase can be used to allocate buffer
which can be mapped to MIGRATION region's data part, and allocating
staging buffer. Driver might need to spawn thread which would start
collecting data that need to be send during pre-copy phase.

> 
> + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
> + *   When VFIO user space application or VM is active and vCPUs are running,
> + *   transition VFIO device in pre-copy state.
> 
> why does device driver need know this stage? in precopy phase, the VM
> is still running. Just dirty page tracking is in progress. the dirty bitmap could
> be retrieved through its own action interface.
> 

All mdev devices are not similar. Pre-copy phase is not just about dirty
page tracking. For devices which have memory on device could transfer
data from that memory during pre-copy phase. For example, NVIDIA GPU has
its own FB, so need to start sending FB data during pre-copy phase and
then during stop and copy phase send data from FB which is marked dirty
after that was copied in pre-copy phase. That helps to reduce total down
time.

> you have code to demonstrate how those states are transitioned in Qemu,
> but you didn't show evidence why those states are necessary in device side,
> which leads to the puzzle whether the definition is over-killed and limiting.
> 

I'm trying to keep these interfaces generic for VFIO and mdev devices.
Its difficult to define what vendor driver should do for each state,
each vendor driver have their own requirements. Vendor drivers should
decide whether to take any action on state transition or not.

> the flow in my mind is like below:
> 
> 1. an interface to turn on/off dirty page tracking on VFIO device:
> 	* vendor driver can do whatever required to enable device specific
> dirty page tracking mechanism here
> 	* device state is not changed here. still in running state
> 
> 2. an interface to get dirty page bitmap
>

I don't think there should be on/off interface for dirty page tracking.
If there is a write access on dirty_pfns.start_addr and dirty_pfns.total
and device_state >=VFIO_DEVICE_STATE_MIGRATION_SETUP && device_state <=
VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY then dirty page tracking has
started, so return dirty page bitmap in data part of migration region.


> 3. an interface to start/stop device activity
> 	* the effect of stop is to stop and drain in-the-fly device activities and
> make device state ready for dump-out. vendor driver can do specific preparation 
> here

VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY is to stop the device, but as I
mentioned above some vendor driver might have to do preparation before
pre-copy phase starts.

> 	* the effect of start is to check validity of device state and then resume
> device activities. again, vendor driver can do specific cleanup/preparation here
>

That is VFIO_DEVICE_STATE_MIGRATION_RESUME.

Defined VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED and
VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED states to cleanup all that
which was allocated/mmapped/started thread during setup phase. This can
be moved to transition to _RUNNING state. So if all agrees these states
can be removed.


> 4. an interface to save/restore device state
> 	* should happen when device is stopped
> 	* of course there is still an open how to check state compatibility as
> Alex pointed earlier
>

I hope above explains why other states are required.

Thanks,
Kirti


> this way above interfaces are not tied to migration. other usages which are
> interested in device state could also work (e.g. snapshot). If it doesn't work
> with your device, it's better that you can elaborate your requirement with more
> concrete examples. Then people will judge the necessity of a more complex
> interface as proposed in this series...
> 
> Thanks
> Kevin
> 
>

Kirti Wankhede Nov. 22, 2018, 8:43 p.m. UTC | #7

On 11/23/2018 12:24 AM, Dr. David Alan Gilbert wrote:
> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
>> - Defined MIGRATION region type and sub-type.
>> - Defined VFIO device states during migration process.
>> - Defined vfio_device_migration_info structure which will be placed at 0th
>>   offset of migration region to get/set VFIO device related information.
>>   Defined actions and members of structure usage for each action:
>>     * To convey VFIO device state to be transitioned to.
>>     * To get pending bytes yet to be migrated for VFIO device
>>     * To ask driver to write data to migration region and return number of bytes
>>       written in the region
>>     * In migration resume path, user space app writes to migration region and
>>       communicates it to vendor driver.
>>     * Get bitmap of dirty pages from vendor driver from given start address
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> 
> <snip>
> 
>> + * Action Get buffer:
>> + *      On this action, vendor driver should write data to migration region and
>> + *      return number of bytes written in the region.
>> + *      data.offset [output] : offset in the region from where data is written.
>> + *      data.size [output] : number of bytes written in migration buffer by
>> + *          vendor driver.
> 
> <snip>
> 
>> + */
>> +
>> +struct vfio_device_migration_info {
>> +        __u32 device_state;         /* VFIO device state */
>> +        struct {
>> +            __u64 precopy_only;
>> +            __u64 compatible;
>> +            __u64 postcopy_only;
>> +            __u64 threshold_size;
>> +        } pending;
>> +        struct {
>> +            __u64 offset;           /* offset */
>> +            __u64 size;             /* size */
>> +        } data;
> 
> I'm curious how the offsets/size work; how does the 
> kernel driver know the maximum size of state it's allowed to write?


Migration region looks like:
 ----------------------------------------------------------------------
|vfio_device_migration_info|    data section			      |	
|                          |     ///////////////////////////////////  |
 ----------------------------------------------------------------------
 ^			         ^                                 ^
 offset 0-trapped part         data.offset                     data.size


Kernel driver defines the size of migration region and tells VFIO user
space application (QEMU here) through VFIO_DEVICE_GET_REGION_INFO ioctl.
So kernel driver can calculate the size of data section. Then kernel
driver can have (data.size >= data section size) or (data.size < data
section size), hence VFIO user space application need to know data.size
to copy only relevant data.

> Why would it pick a none-0 offset into the output region?

Data section is always followed by vfio_device_migration_info structure
in the region, so data.offset will always be none-0.
Offset from where data is copied is decided by kernel driver, data
section can be trapped or mapped depending on how kernel driver defines
data section. If mmapped, then data.offset should be page aligned, where
as initial section which contain vfio_device_migration_info structure
might not end at offset which is page aligned.

Thanks,
Kirti

> Without having dug further these feel like i/o rather than just output;
> i.e. the calling process says 'put it at that offset and you've got size
> bytes' and the kernel replies with 'I did put it at offset and I wrote
> only this size bytes'
> 
> Dave
> 
>> +        struct {
>> +            __u64 start_addr;
>> +            __u64 total;
>> +            __u64 copied;
>> +        } dirty_pfns;
>> +} __attribute__((packed));
>> +
>>  /* -------- API for Type1 VFIO IOMMU -------- */
>>  
>>  /**
>> -- 
>> 2.7.0
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>

Yan Zhao Nov. 23, 2018, 5:47 a.m. UTC | #8

On Wed, Nov 21, 2018 at 04:39:39AM +0800, Kirti Wankhede wrote:
> - Defined MIGRATION region type and sub-type.
> - Defined VFIO device states during migration process.
> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined actions and members of structure usage for each action:
>     * To convey VFIO device state to be transitioned to.
>     * To get pending bytes yet to be migrated for VFIO device
>     * To ask driver to write data to migration region and return number of bytes
>       written in the region
>     * In migration resume path, user space app writes to migration region and
>       communicates it to vendor driver.
>     * Get bitmap of dirty pages from vendor driver from given start address
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  linux-headers/linux/vfio.h | 130 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 130 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 3615a269d378..a6e45cb2cae2 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -301,6 +301,10 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG  (3)
> 
> +/* Migration region type and sub-type */
> +#define VFIO_REGION_TYPE_MIGRATION             (1 << 30)
> +#define VFIO_REGION_SUBTYPE_MIGRATION          (1)
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within
> @@ -602,6 +606,132 @@ struct vfio_device_ioeventfd {
> 
>  #define VFIO_DEVICE_IOEVENTFD          _IO(VFIO_TYPE, VFIO_BASE + 16)
> 
> +/**
> + * VFIO device states :
> + * VFIO User space application should set the device state to indicate vendor
> + * driver in which state the VFIO device should transitioned.
> + * - VFIO_DEVICE_STATE_NONE:
> + *   State when VFIO device is initialized but not yet running.
> + * - VFIO_DEVICE_STATE_RUNNING:
> + *   Transition VFIO device in running state, that is, user space application or
> + *   VM is active.
> + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
> + *   Transition VFIO device in migration setup state. This is used to prepare
> + *   VFIO device for migration while application or VM and vCPUs are still in
> + *   running state.
> + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
> + *   When VFIO user space application or VM is active and vCPUs are running,
> + *   transition VFIO device in pre-copy state.
> + * - VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY:
> + *   When VFIO user space application or VM is stopped and vCPUs are halted,
> + *   transition VFIO device in stop-and-copy state.
> + * - VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED:
> + *   When VFIO user space application has copied data provided by vendor driver.
> + *   This state is used by vendor driver to clean up all software state that was
> + *   setup during MIGRATION_SETUP state.
> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME:
> + *   Transition VFIO device to resume state, that is, start resuming VFIO device
> + *   when user space application or VM is not running and vCPUs are halted.
> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED:
> + *   When user space application completes iterations of providing device state
> + *   data, transition device in resume completed state.
> + * - VFIO_DEVICE_STATE_MIGRATION_FAILED:
> + *   Migration process failed due to some reason, transition device to failed
> + *   state. If migration process fails while saving at source, resume device at
> + *   source. If migration process fails while resuming application or VM at
> + *   destination, stop restoration at destination and resume at source.
> + * - VFIO_DEVICE_STATE_MIGRATION_CANCELLED:
> + *   User space application has cancelled migration process either for some
> + *   known reason or due to user's intervention. Transition device to Cancelled
> + *   state, that is, resume device state as it was during running state at
> + *   source.
> + */
> +
> +enum {
> +    VFIO_DEVICE_STATE_NONE,
> +    VFIO_DEVICE_STATE_RUNNING,
> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY,
> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY,
> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
> +};
> +
> +/**
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> + * information.
> + *
> + * Action Set state:
> + *      To tell vendor driver the state VFIO device should be transitioned to.
> + *      device_state [input] : User space app sends device state to vendor
> + *           driver on state change, the state to which VFIO device should be
> + *           transitioned to.
> + *
> + * Action Get pending bytes:
> + *      To get pending bytes yet to be migrated from vendor driver
> + *      pending.threshold_size [Input] : threshold of buffer in User space app.
> + *      pending.precopy_only [output] : pending data which must be migrated in
> + *          precopy phase or in stopped state, in other words - before target
> + *          user space application or VM start. In case of migration, this
> + *          indicates pending bytes to be transfered while application or VM or
> + *          vCPUs are active and running.
> + *      pending.compatible [output] : pending data which may be migrated any
> + *          time , either when application or VM is active and vCPUs are active
> + *          or when application or VM is halted and vCPUs are halted.
> + *      pending.postcopy_only [output] : pending data which must be migrated in
> + *           postcopy phase or in stopped state, in other words - after source
> + *           application or VM stopped and vCPUs are halted.
> + *      Sum of pending.precopy_only, pending.compatible and
> + *      pending.postcopy_only is the whole amount of pending data.
> + *
> + * Action Get buffer:
> + *      On this action, vendor driver should write data to migration region and
> + *      return number of bytes written in the region.
> + *      data.offset [output] : offset in the region from where data is written.
> + *      data.size [output] : number of bytes written in migration buffer by
> + *          vendor driver.
suggest to add flag like restore-iteration/restore-complete to GET_BUFFER action.
Avoid to let vendor driver keep various qemu migration states


> + * Action Set buffer:
> + *      In migration resume path, user space app writes to migration region and
> + *      communicates it to vendor driver with this action.
> + *      data.offset [Input] : offset in the region from where data is written.
> + *      data.size [Input] : number of bytes written in migration buffer by
> + *          user space app.
suggest to add flag like precopy/stop-and-copy to SET_BUFFER action.
Avoid to let vendor driver keep various qemu migration states

> + *
> + * Action Get dirty pages bitmap:
> + *      Get bitmap of dirty pages from vendor driver from given start address.
> + *      dirty_pfns.start_addr [Input] : start address
> + *      dirty_pfns.total [Input] : Total pfn count from start_addr for which
> + *          dirty bitmap is requested
> + *      dirty_pfns.copied [Output] : pfn count for which dirty bitmap is copied
> + *          to migration region.
> + *      Vendor driver should copy the bitmap with bits set only for pages to be
> + *      marked dirty in migration region.
> + */
> +
> +struct vfio_device_migration_info {
> +        __u32 device_state;         /* VFIO device state */
> +        struct {
> +            __u64 precopy_only;
> +            __u64 compatible;
> +            __u64 postcopy_only;
> +            __u64 threshold_size;
> +        } pending;
> +        struct {
> +            __u64 offset;           /* offset */
> +            __u64 size;             /* size */
> +        } data;
> +        struct {
> +            __u64 start_addr;
> +            __u64 total;
> +            __u64 copied;
> +        } dirty_pfns;
> +} __attribute__((packed));
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
> 
>  /**
> --
> 2.7.0
> 
>

Dr. David Alan Gilbert Nov. 23, 2018, 11:44 a.m. UTC | #9

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> 
> 
> On 11/23/2018 12:24 AM, Dr. David Alan Gilbert wrote:
> > * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> >> - Defined MIGRATION region type and sub-type.
> >> - Defined VFIO device states during migration process.
> >> - Defined vfio_device_migration_info structure which will be placed at 0th
> >>   offset of migration region to get/set VFIO device related information.
> >>   Defined actions and members of structure usage for each action:
> >>     * To convey VFIO device state to be transitioned to.
> >>     * To get pending bytes yet to be migrated for VFIO device
> >>     * To ask driver to write data to migration region and return number of bytes
> >>       written in the region
> >>     * In migration resume path, user space app writes to migration region and
> >>       communicates it to vendor driver.
> >>     * Get bitmap of dirty pages from vendor driver from given start address
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> > 
> > <snip>
> > 
> >> + * Action Get buffer:
> >> + *      On this action, vendor driver should write data to migration region and
> >> + *      return number of bytes written in the region.
> >> + *      data.offset [output] : offset in the region from where data is written.
> >> + *      data.size [output] : number of bytes written in migration buffer by
> >> + *          vendor driver.
> > 
> > <snip>
> > 
> >> + */
> >> +
> >> +struct vfio_device_migration_info {
> >> +        __u32 device_state;         /* VFIO device state */
> >> +        struct {
> >> +            __u64 precopy_only;
> >> +            __u64 compatible;
> >> +            __u64 postcopy_only;
> >> +            __u64 threshold_size;
> >> +        } pending;
> >> +        struct {
> >> +            __u64 offset;           /* offset */
> >> +            __u64 size;             /* size */
> >> +        } data;
> > 
> > I'm curious how the offsets/size work; how does the 
> > kernel driver know the maximum size of state it's allowed to write?
> 
> 
> Migration region looks like:
>  ----------------------------------------------------------------------
> |vfio_device_migration_info|    data section			      |	
> |                          |     ///////////////////////////////////  |
>  ----------------------------------------------------------------------
>  ^			         ^                                 ^
>  offset 0-trapped part         data.offset                     data.size
> 
> 
> Kernel driver defines the size of migration region and tells VFIO user
> space application (QEMU here) through VFIO_DEVICE_GET_REGION_INFO ioctl.
> So kernel driver can calculate the size of data section. Then kernel
> driver can have (data.size >= data section size) or (data.size < data
> section size), hence VFIO user space application need to know data.size
> to copy only relevant data.
> 
> > Why would it pick a none-0 offset into the output region?
> 
> Data section is always followed by vfio_device_migration_info structure
> in the region, so data.offset will always be none-0.
> Offset from where data is copied is decided by kernel driver, data
> section can be trapped or mapped depending on how kernel driver defines
> data section. If mmapped, then data.offset should be page aligned, where
> as initial section which contain vfio_device_migration_info structure
> might not end at offset which is page aligned.

Ah OK; I see - it wasn't clear to me which buffer we were talking about
here; so yes it makes sense if it's one the kernel had the control of.

Dave

> Thanks,
> Kirti
> 
> > Without having dug further these feel like i/o rather than just output;
> > i.e. the calling process says 'put it at that offset and you've got size
> > bytes' and the kernel replies with 'I did put it at offset and I wrote
> > only this size bytes'
> > 
> > Dave
> > 
> >> +        struct {
> >> +            __u64 start_addr;
> >> +            __u64 total;
> >> +            __u64 copied;
> >> +        } dirty_pfns;
> >> +} __attribute__((packed));
> >> +
> >>  /* -------- API for Type1 VFIO IOMMU -------- */
> >>  
> >>  /**
> >> -- 
> >> 2.7.0
> >>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Tian, Kevin Nov. 26, 2018, 7:14 a.m. UTC | #10

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Friday, November 23, 2018 4:02 AM
> 
[...]
> >
> > I looked at the explanations in this patch, but still didn't get the intention,
> e.g.:
> >
> > + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
> > + *   Transition VFIO device in migration setup state. This is used to
> prepare
> > + *   VFIO device for migration while application or VM and vCPUs are still
> in
> > + *   running state.
> >
> > what preparation is actually required? any example?
> 
> Each vendor driver can have different requirements as to how to prepare
> for migration. For example, this phase can be used to allocate buffer
> which can be mapped to MIGRATION region's data part, and allocating
> staging buffer. Driver might need to spawn thread which would start
> collecting data that need to be send during pre-copy phase.
> 
> >
> > + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
> > + *   When VFIO user space application or VM is active and vCPUs are
> running,
> > + *   transition VFIO device in pre-copy state.
> >
> > why does device driver need know this stage? in precopy phase, the VM
> > is still running. Just dirty page tracking is in progress. the dirty bitmap
> could
> > be retrieved through its own action interface.
> >
> 
> All mdev devices are not similar. Pre-copy phase is not just about dirty
> page tracking. For devices which have memory on device could transfer
> data from that memory during pre-copy phase. For example, NVIDIA GPU
> has
> its own FB, so need to start sending FB data during pre-copy phase and
> then during stop and copy phase send data from FB which is marked dirty
> after that was copied in pre-copy phase. That helps to reduce total down
> time.

yes it makes sense, otherwise copying whole big FB at stop time is time
consuming. Curious, does Qemu already support pre-copy of device state
today, or is this series the 1st example to do that?

> 
> > you have code to demonstrate how those states are transitioned in Qemu,
> > but you didn't show evidence why those states are necessary in device
> side,
> > which leads to the puzzle whether the definition is over-killed and limiting.
> >
> 
> I'm trying to keep these interfaces generic for VFIO and mdev devices.
> Its difficult to define what vendor driver should do for each state,
> each vendor driver have their own requirements. Vendor drivers should
> decide whether to take any action on state transition or not.
> 
> > the flow in my mind is like below:
> >
> > 1. an interface to turn on/off dirty page tracking on VFIO device:
> > 	* vendor driver can do whatever required to enable device specific
> > dirty page tracking mechanism here
> > 	* device state is not changed here. still in running state
> >
> > 2. an interface to get dirty page bitmap
> >
> 
> I don't think there should be on/off interface for dirty page tracking.
> If there is a write access on dirty_pfns.start_addr and dirty_pfns.total
> and device_state >=VFIO_DEVICE_STATE_MIGRATION_SETUP &&
> device_state <=
> VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY then dirty page tracking has
> started, so return dirty page bitmap in data part of migration region.

dirty page tracking might be useful for other purposes, e.g. if people want
to just draw memory access pattern of a given VM. binding dirty tracking
to migration flow is limiting...

> 
> 
> > 3. an interface to start/stop device activity
> > 	* the effect of stop is to stop and drain in-the-fly device activities
> and
> > make device state ready for dump-out. vendor driver can do specific
> preparation
> > here
> 
> VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY is to stop the device, but as
> I
> mentioned above some vendor driver might have to do preparation before
> pre-copy phase starts.
> 
> > 	* the effect of start is to check validity of device state and then
> resume
> > device activities. again, vendor driver can do specific cleanup/preparation
> here
> >
> 
> That is VFIO_DEVICE_STATE_MIGRATION_RESUME.
> 
> Defined VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED and
> VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED states to cleanup
> all that
> which was allocated/mmapped/started thread during setup phase. This
> can
> be moved to transition to _RUNNING state. So if all agrees these states
> can be removed.
> 
> 
> > 4. an interface to save/restore device state
> > 	* should happen when device is stopped
> > 	* of course there is still an open how to check state compatibility as
> > Alex pointed earlier
> >
> 
> I hope above explains why other states are required.
> 

yes, above makes the whole picture much clearer. Thanks a lot!

Accordingly I'm thinking about whether below state definition could be
more general and extensible:

_STATE_NONE, indicates initial state
_STATE_RUNNING, indicates normal state
_STATE_STOPPED, indicates that device activities are fully stopped
_STATE_IN_TRACKING, indicates that device state can be r/w by user space.
this state can be ORed to RUNNING or STOPPED.

live migration could be implemented in below flow:

(at src side)
1. RUNNING -> {RUNNING | IN_TRACKING}
	* this switch does vendor specific preparation to make device
state accessible to user space (as covered by MIGRATION_SETUP)
	* vendor driver may let iterative read get incremental changes 
since last read (as covered by MIGRATION_PRECOPY). 	*open*, do we 
need an explicit flag to indicate such capability?
	* dirty page bitmap is also made available upon this change

2. (RUNNING | IN_TRACKING) -> (STOPPED | IN_TRACKING)
	* device is stopped thus device state is finalized
	* user space can read full device state, as defined for
MIGRATION_STOPNCOPY

3. (STOPPED | IN_TRACKING) -> (STOPPED)
	* device state tracking and dirty page tracking are cancelled. 
cleanup is done for resources setup in step 1. similar to MIGRATION_
SAVE_COMPLETED

4. STOPPED -> NONE, when device is reset later

(at dest side)

1. NONE -> (STOPPED | IN_TRACKING)
	* prepare device state region so user space can write
	* map to MIGRATION_RESUME
	* open: do we need both NONE and STOPPED, or just STOPPED?
2. (STOPPED | IN_TRACKING) -> STOPPED
	* clean up resources allocated in step 1
	* map to MIGRATION_RESUME_COMPLETED
3. STOPPED -> RUNNING
	* resume the device activities

compare to original definition, I think all important steps are covered:
+enum {
+    VFIO_DEVICE_STATE_NONE,
+    VFIO_DEVICE_STATE_RUNNING,
+    VFIO_DEVICE_STATE_MIGRATION_SETUP,
+    VFIO_DEVICE_STATE_MIGRATION_PRECOPY,
+    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY,
+    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
+    VFIO_DEVICE_STATE_MIGRATION_RESUME,
+    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
+    VFIO_DEVICE_STATE_MIGRATION_FAILED,
+    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
+};

FAILED is not a device state. It should be indicated in return value of set
state action.

CANCELLED can be achieved any time by clearing IN_TRACKING state.

with this new definition, above states can be also selectively used for
other purposes, e.g.:
 
1. user space can do RUNNING->STOPPED->RUNNING for any control reason,
w/o touching device state at all.

2. if someone wants to draw memory access pattern of a VM, it could
be done by RUNNING->(RUNNING | IN_TRACKING)->RUNNING, by reading
dirty bitmap when IN_TRACKING is active. Device state is ready but not 
accessed here, hope it is not a big burden.

Thoughts?

Thanks
Kevin

Alex Williamson Nov. 27, 2018, 7:52 p.m. UTC | #11

On Wed, 21 Nov 2018 02:09:39 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> - Defined MIGRATION region type and sub-type.
> - Defined VFIO device states during migration process.
> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined actions and members of structure usage for each action:
>     * To convey VFIO device state to be transitioned to.
>     * To get pending bytes yet to be migrated for VFIO device
>     * To ask driver to write data to migration region and return number of bytes
>       written in the region
>     * In migration resume path, user space app writes to migration region and
>       communicates it to vendor driver.
>     * Get bitmap of dirty pages from vendor driver from given start address
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  linux-headers/linux/vfio.h | 130 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 130 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 3615a269d378..a6e45cb2cae2 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -301,6 +301,10 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>  
> +/* Migration region type and sub-type */
> +#define VFIO_REGION_TYPE_MIGRATION	        (1 << 30)
> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> +

I think this is copied from the vendor type, but I don't think it makes
sense here.  We reserve the top bit of the type to indicate a PCI
vendor type where the lower 16 bits are then the PCI vendor ID.  This
gives each vendor their own sub-type address space.  With the graphics
type we began our first type with (1).  I would expect migration to
then be type (2), not (1 << 30).

>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within
> @@ -602,6 +606,132 @@ struct vfio_device_ioeventfd {
>  
>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)

Reading ahead in the thread, I'm going to mention a lot of what Kevin
has already said here...

> +/**
> + * VFIO device states :
> + * VFIO User space application should set the device state to indicate vendor
> + * driver in which state the VFIO device should transitioned.
> + * - VFIO_DEVICE_STATE_NONE:
> + *   State when VFIO device is initialized but not yet running.
> + * - VFIO_DEVICE_STATE_RUNNING:
> + *   Transition VFIO device in running state, that is, user space application or
> + *   VM is active.

Is this backwards compatible?  A new device that supports migration
must be backwards compatible with old userspace that doesn't interact
with the migration region.  What happens if userspace never moves the
state to running?

> + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
> + *   Transition VFIO device in migration setup state. This is used to prepare
> + *   VFIO device for migration while application or VM and vCPUs are still in
> + *   running state.

What does this imply to the device?  I thought we were going to
redefine these states in terms of what we expect the device to do.
These still seem like just a copy of the QEMU states which we discussed
are an internal reference that can change at any time.

> + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
> + *   When VFIO user space application or VM is active and vCPUs are running,
> + *   transition VFIO device in pre-copy state.

Which means what?

> + * - VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY:
> + *   When VFIO user space application or VM is stopped and vCPUs are halted,
> + *   transition VFIO device in stop-and-copy state.

Is the device still running?  What happens to in-flight DMA?  Does it
wait?  We need a clear definition in terms of the device, not the VM.

> + * - VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED:
> + *   When VFIO user space application has copied data provided by vendor driver.
> + *   This state is used by vendor driver to clean up all software state that was
> + *   setup during MIGRATION_SETUP state.

When was the MIGRATION_SAVE_STARTED?

> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME:
> + *   Transition VFIO device to resume state, that is, start resuming VFIO device
> + *   when user space application or VM is not running and vCPUs are halted.

Are we simply restoring the we copied from the device after it was
stopped?

> + * - VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED:
> + *   When user space application completes iterations of providing device state
> + *   data, transition device in resume completed state.

So the device should start running here?

> + * - VFIO_DEVICE_STATE_MIGRATION_FAILED:
> + *   Migration process failed due to some reason, transition device to failed
> + *   state. If migration process fails while saving at source, resume device at
> + *   source. If migration process fails while resuming application or VM at
> + *   destination, stop restoration at destination and resume at source.

What does a failed device do?  Can't we simply not move to the
completed state?

> + * - VFIO_DEVICE_STATE_MIGRATION_CANCELLED:
> + *   User space application has cancelled migration process either for some
> + *   known reason or due to user's intervention. Transition device to Cancelled
> + *   state, that is, resume device state as it was during running state at
> + *   source.

Why do we need to tell the device this, can't we simply go back to
normal running state?

It seems to me that the default state needs to be RUNNING in order to
be backwards compatible.

I also agree with Kevin that it seems that dirty tracking is just an
augmented running state that we can turn on and off.  Shouldn't we also
define that dirty tracking can be optional?  For instance if a device
doesn't support dirty tracking, couldn't we stop the device first, then
save all memory, then retrieve the device state?  This is the other
problem with mirroring QEMU migration states, we don't account for
things that QEMU currently doesn't do.

Actually I'm wondering if we can distill everything down to two bits,
STOPPED and LOGGING.

We start at RUNNING, the user can optionally enable LOGGING when
supported by the device to cover the SETUP and PRECOPY states
proposed.  The device stays running, but activates any sort of
dirtly page tracking that's necessary to activate those interfaces.
LOGGING can also be cleared to handle the CANCELLED state.  The user
would set STOPPED which should quiesce the device and make the full
device state available through the device data section.  Clearing
STOPPED and LOGGING would handle the FAILED state below.  Likewise on
the migration target, QEMU would set the device top STOPPED in order to
write the incoming data via the data section and clear STOPPED to
indicate the device returns to RUNNING (aka RESUME/RESUME_COMPLETED).

> + */
> +
> +enum {
> +    VFIO_DEVICE_STATE_NONE,
> +    VFIO_DEVICE_STATE_RUNNING,
> +    VFIO_DEVICE_STATE_MIGRATION_SETUP,
> +    VFIO_DEVICE_STATE_MIGRATION_PRECOPY,
> +    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY,
> +    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME,
> +    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
> +    VFIO_DEVICE_STATE_MIGRATION_FAILED,
> +    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
> +};
> +
> +/**
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> + * information.

Should include a note that field accesses are only supported at their
native width and alignment, anything else is unsupported and should
generate an error.  I don't see a good reason to bloat the supporting
code to handle anything else.

> + *
> + * Action Set state:
> + *      To tell vendor driver the state VFIO device should be transitioned to.
> + *      device_state [input] : User space app sends device state to vendor
> + *           driver on state change, the state to which VFIO device should be
> + *           transitioned to.

It should be noted that the write will return error on a transition
failure.  The write is also synchronous, ie. when the device is asked
to stop, any in-flight DMA will be completed before the write returns.

> + *
> + * Action Get pending bytes:
> + *      To get pending bytes yet to be migrated from vendor driver
> + *      pending.threshold_size [Input] : threshold of buffer in User space app.

I still don't see why the kernel needs to be concerned with the size of
the user buffer.  The vendor driver will already know the size of the
user read, won't the user try to fill their buffer in a single read?
Infer the size.  Maybe you really want a minimum read size if you want
to package your vendor data in some way?  ie. what minimum size must the
user use to avoid getting -ENOSPC return.

> + *      pending.precopy_only [output] : pending data which must be migrated in
> + *          precopy phase or in stopped state, in other words - before target
> + *          user space application or VM start. In case of migration, this
> + *          indicates pending bytes to be transfered while application or VM or
> + *          vCPUs are active and running.

What sort of data is included here?  Is this mostly a compatibility
check?  I think that needs to be an explicit part of the interface, not
something we simply assume the vendor driver handles (though it's
welcome to do additional checking).  What other device state is valid
to be saved prior to stopping the device?

> + *      pending.compatible [output] : pending data which may be migrated any
> + *          time , either when application or VM is active and vCPUs are active
> + *          or when application or VM is halted and vCPUs are halted.

Again, what sort of data is included here?  If it's live device data,
like a framebuffer, shouldn't it be handled by the dirty page tracking
interface?  Is this meant to do dirty tracking within device memory?
Should we formalize that?

> + *      pending.postcopy_only [output] : pending data which must be migrated in
> + *           postcopy phase or in stopped state, in other words - after source
> + *           application or VM stopped and vCPUs are halted.
> + *      Sum of pending.precopy_only, pending.compatible and
> + *      pending.postcopy_only is the whole amount of pending data.

It seems the user is able to stop the device at any point in time, what
do these values indicate then?  Shouldn't there be just one value then?

Can't we do all of this with just a save_bytes_available value?  When
the device is RUNNING this value could be dynamic (if the vendor driver
supports data in that phase... and we understand how to consume it),
when the device is stopped, it could update with each read.

> + *
> + * Action Get buffer:
> + *      On this action, vendor driver should write data to migration region and
> + *      return number of bytes written in the region.
> + *      data.offset [output] : offset in the region from where data is written.
> + *      data.size [output] : number of bytes written in migration buffer by
> + *          vendor driver.

If we know the pending bytes, why do we need this?  Isn't the read
itself the indication to prepare the data to be read?  Does the user
really ever need to start a read from anywhere other than the starting
offset of the data section?

> + *
> + * Action Set buffer:
> + *      In migration resume path, user space app writes to migration region and
> + *      communicates it to vendor driver with this action.
> + *      data.offset [Input] : offset in the region from where data is written.
> + *      data.size [Input] : number of bytes written in migration buffer by
> + *          user space app.

Again, isn't all the information contained in the write itself?
Shouldn't the packet of data the user writes include information that
makes the offset unnecessary?  Are all of these trying to support an
mmap capable data area and do we really need that?

> + *
> + * Action Get dirty pages bitmap:
> + *      Get bitmap of dirty pages from vendor driver from given start address.
> + *      dirty_pfns.start_addr [Input] : start address
> + *      dirty_pfns.total [Input] : Total pfn count from start_addr for which
> + *          dirty bitmap is requested
> + *      dirty_pfns.copied [Output] : pfn count for which dirty bitmap is copied
> + *          to migration region.
> + *      Vendor driver should copy the bitmap with bits set only for pages to be
> + *      marked dirty in migration region.
> + */

The protocol is not very clear here, the vendor driver never copies the
bitmap, from the user perspective the vendor driver handles the read(2)
from the region.  But is the data data.offset being used for this?
It's not clear where the user reads the bitmap.  Is the start_addr here
meant to address the segmentation that we discussed previously?  As
above, I don't see why the user needs all these input fields, they can
almost all be determined by the read itself.

The format I proposed was that much like the data section, the device
could expose a dirty pfn section with offset and size.  For example:

struct {
	__u64 offset;		// read-only (in bytes)
	__u64 size;		// read-only (in bytes)
	__u64 page_size;	// read-only (ex. 4k)
	__u64 segment;		// read-write
} dirty_pages;

So for example, the vendor driver would expose an offset within the
region much like for the data area.  The size might be something like
32MB and the page_size could be 4096.  The user can calculate from this
that the area exposes 1TB worth of pfns.  When segment is 0x0 the user
can directly read pfns for address 0x0 to 1TB - 1.  Setting segment to
0x1 allows access to 1TB to 2TB - 1, etc.  Therefore the user sets the
segment register and simply performs a read.

The thing we discussed that this interface lacks is an efficient
interface for handling reading from a range where no dirty bits are
set, which I think you're trying to handle with the additional 'copied'
field, but couldn't read(2) simply return zero to optimize that case?
We only need to ensure that the user won't continue to retry for that
return value.

> +
> +struct vfio_device_migration_info {
> +        __u32 device_state;         /* VFIO device state */
> +        struct {
> +            __u64 precopy_only;
> +            __u64 compatible;
> +            __u64 postcopy_only;
> +            __u64 threshold_size;
> +        } pending;
> +        struct {
> +            __u64 offset;           /* offset */
> +            __u64 size;             /* size */
> +        } data;
> +        struct {
> +            __u64 start_addr;
> +            __u64 total;
> +            __u64 copied;
> +        } dirty_pfns;
> +} __attribute__((packed));
> +

We're still missing explicit versioning and compatibility information.
Thanks,

Alex

Cornelia Huck Dec. 4, 2018, 10:53 a.m. UTC | #12

On Tue, 27 Nov 2018 12:52:48 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> Actually I'm wondering if we can distill everything down to two bits,
> STOPPED and LOGGING.
> 
> We start at RUNNING, the user can optionally enable LOGGING when
> supported by the device to cover the SETUP and PRECOPY states
> proposed.  The device stays running, but activates any sort of
> dirtly page tracking that's necessary to activate those interfaces.
> LOGGING can also be cleared to handle the CANCELLED state.  The user
> would set STOPPED which should quiesce the device and make the full
> device state available through the device data section.  Clearing
> STOPPED and LOGGING would handle the FAILED state below.  Likewise on
> the migration target, QEMU would set the device top STOPPED in order to
> write the incoming data via the data section and clear STOPPED to
> indicate the device returns to RUNNING (aka RESUME/RESUME_COMPLETED).

This idea sounds like something that can be more easily adapted to
other device types as well. The LOGGING bit is probably more flexible
if you reframe it as a PREPARATION bit: That would also cover devices
or device types that don't do dirty logging, but need some other kind
of preparation prior to moving over.

This model would also be simple enough to allow e.g. a vendor driver
for mdev to implement its own, specialized backend while still fitting
into the general framework. Non-pci mdevs are probably different enough
from pci devices so that many of the states proposed don't really match
their needs.

Alex Williamson Dec. 4, 2018, 5:14 p.m. UTC | #13

On Tue, 4 Dec 2018 11:53:33 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> On Tue, 27 Nov 2018 12:52:48 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > Actually I'm wondering if we can distill everything down to two bits,
> > STOPPED and LOGGING.
> > 
> > We start at RUNNING, the user can optionally enable LOGGING when
> > supported by the device to cover the SETUP and PRECOPY states
> > proposed.  The device stays running, but activates any sort of
> > dirtly page tracking that's necessary to activate those interfaces.
> > LOGGING can also be cleared to handle the CANCELLED state.  The user
> > would set STOPPED which should quiesce the device and make the full
> > device state available through the device data section.  Clearing
> > STOPPED and LOGGING would handle the FAILED state below.  Likewise on
> > the migration target, QEMU would set the device top STOPPED in order to
> > write the incoming data via the data section and clear STOPPED to
> > indicate the device returns to RUNNING (aka RESUME/RESUME_COMPLETED).  
> 
> This idea sounds like something that can be more easily adapted to
> other device types as well. The LOGGING bit is probably more flexible
> if you reframe it as a PREPARATION bit: That would also cover devices
> or device types that don't do dirty logging, but need some other kind
> of preparation prior to moving over.

Can you elaborate on what PREPARATION might do w/o dirty logging?
LOGGING is just a state, it's on or off, whereas PREPARATION implies
some sequential step in a process and then I'm afraid we slide back into
states that a really steps in a QEMU specific migration process.  So
I'm curious why PREPARATION wouldn't just be a vendor implementation
specific first step when RUNNING is turned off.  A device that doesn't
implement dirty logging could always just claim everything is dirty if
it wants that advanced warning that RUNNING might be turned off soon,
but there are probably more efficient ways to support that, ex. a flag
indicating the dirty logging granularity.  Thanks,

Alex

[1/5] VFIO KABI for migration interface

Commit Message

Comments

Patch