mbox series

[0/5] QEMU VFIO live migration

Message ID 1550566254-3545-1-git-send-email-yan.y.zhao@intel.com (mailing list archive)
Headers show
Series QEMU VFIO live migration | expand

Message

Yan Zhao Feb. 19, 2019, 8:50 a.m. UTC
This patchset enables VFIO devices to have live migration capability.
Currently it does not support post-copy phase.

It follows Alex's comments on last version of VFIO live migration patches,
including device states, VFIO device state region layout, dirty bitmap's
query.

Device Data
-----------
Device data is divided into three types: device memory, device config,
and system memory dirty pages produced by device.

Device config: data like MMIOs, page tables...
        Every device is supposed to possess device config data.
    	Usually device config's size is small (no big than 10M), and it
        needs to be loaded in certain strict order.
        Therefore, device config only needs to be saved/loaded in
        stop-and-copy phase.
        The data of device config is held in device config region.
        Size of device config data is smaller than or equal to that of
        device config region.

Device Memory: device's internal memory, standalone and outside system
        memory. It is usually very big.
        This kind of data needs to be saved / loaded in pre-copy and
        stop-and-copy phase.
        The data of device memory is held in device memory region.
        Size of devie memory is usually larger than that of device
        memory region. qemu needs to save/load it in chunks of size of
        device memory region.
        Not all device has device memory. Like IGD only uses system memory.

System memory dirty pages: If a device produces dirty pages in system
        memory, it is able to get dirty bitmap for certain range of system
        memory. This dirty bitmap is queried in pre-copy and stop-and-copy
        phase in .log_sync callback. By setting dirty bitmap in .log_sync
        callback, dirty pages in system memory will be save/loaded by ram's
        live migration code.
        The dirty bitmap of system memory is held in dirty bitmap region.
        If system memory range is larger than that dirty bitmap region can
        hold, qemu will cut it into several chunks and get dirty bitmap in
        succession.


Device State Regions
--------------------
Vendor driver is required to expose two mandatory regions and another two
optional regions if it plans to support device state management.

So, there are up to four regions in total.
One control region: mandatory.
        Get access via read/write system call.
        Its layout is defined in struct vfio_device_state_ctl
Three data regions: mmaped into qemu.
        device config region: mandatory, holding data of device config
        device memory region: optional, holding data of device memory
        dirty bitmap region: optional, holding bitmap of system memory
                            dirty pages

(The reason why four seperate regions are defined is that the unit of mmap
system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
control and three mmaped regions for data seems better than one big region
padded and sparse mmaped).


kernel device state interface [1]
--------------------------------------
#define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
#define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
#define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2

#define VFIO_DEVICE_STATE_RUNNING 0 
#define VFIO_DEVICE_STATE_STOP 1
#define VFIO_DEVICE_STATE_LOGGING 2

#define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
#define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
#define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3

struct vfio_device_state_ctl {
	__u32 version;		  /* ro */
	__u32 device_state;       /* VFIO device state, wo */
	__u32 caps;		 /* ro */
        struct {
		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
		__u64 size;    /*rw*/
	} device_config;
	struct {
		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
		__u64 size;     /* rw */  
                __u64 pos; /*the offset in total buffer of device memory*/
	} device_memory;
	struct {
		__u64 start_addr; /* wo */
		__u64 page_nr;   /* wo */
	} system_memory;
};

Devcie States
------------- 
After migration is initialzed, it will set device state via writing to
device_state field of control region.

Four states are defined for a VFIO device:
        RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 

RUNNING: In this state, a VFIO device is in active state ready to receive
        commands from device driver.
        It is the default state that a VFIO device enters initially.

STOP:  In this state, a VFIO device is deactivated to interact with
       device driver.

LOGGING: a special state that it CANNOT exist independently. It must be
       set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
       STOP & LOGGING).
       Qemu will set LOGGING state on in .save_setup callbacks, then vendor
       driver can start dirty data logging for device memory and system
       memory.
       LOGGING only impacts device/system memory. They return whole
       snapshot outside LOGGING and dirty data since last get operation
       inside LOGGING.
       Device config should be always accessible and return whole config
       snapshot regardless of LOGGING state.
       
Note:
The reason why RUNNING is the default state is that device's active state
must not depend on device state interface.
It is possible that region vfio_device_state_ctl fails to get registered.
In that condition, a device needs be in active state by default. 

Get Version & Get Caps
----------------------
On migration init phase, qemu will probe the existence of device state
regions of vendor driver, then get version of the device state interface
from the r/w control region.

Then it will probe VFIO device's data capability by reading caps field of
control region.
        #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
        #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
        device memory in pre-copy and stop-and-copy phase. The data of
        device memory is held in device memory region.
If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
        produced by VFIO device during pre-copy and stop-and-copy phase.
        The dirty bitmap of system memory is held in dirty bitmap region.

If failing to find two mandatory regions and optional data regions
corresponding to data caps or version mismatching, it will setup a
migration blocker and disable live migration for VFIO device.


Flows to call device state interface for VFIO live migration
------------------------------------------------------------

Live migration save path:

(QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)

MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
 |
MIGRATION_STATUS_SAVE_SETUP
 |
 .save_setup callback -->
 get device memory size (whole snapshot size)
 get device memory buffer (whole snapshot data)
 set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
 |
MIGRATION_STATUS_ACTIVE
 |
 .save_live_pending callback --> get device memory size (dirty data)
 .save_live_iteration callback --> get device memory buffer (dirty data)
 .log_sync callback --> get system memory dirty bitmap
 |
(vcpu stops) --> set device state -->
 VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
 |
.save_live_complete_precopy callback -->
 get device memory size (dirty data)
 get device memory buffer (dirty data)
 get device config size (whole snapshot size)
 get device config buffer (whole snapshot data)
 |
.save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
MIGRATION_STATUS_COMPLETED

MIGRATION_STATUS_CANCELLED or
MIGRATION_STATUS_FAILED
 |
(vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING


Live migration load path:

(QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)

MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
 |
(vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
 |
MIGRATION_STATUS_ACTIVE
 |
.load state callback -->
 set device memory size, set device memory buffer, set device config size,
 set device config buffer
 |
(vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
 |
MIGRATION_STATUS_COMPLETED



In source VM side,
In precopy phase,
if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
qemu will first get whole snapshot of device memory in .save_setup
callback, and then it will get total size of dirty data in device memory in
.save_live_pending callback by reading device_memory.size field of control
region.
Then in .save_live_iteration callback, it will get buffer of device memory's
dirty data chunk by chunk from device memory region by writing pos &
action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
control region. (size of each chunk is the size of device memory data
region).
.save_live_pending and .save_live_iteration may be called several times in
precopy phase to get dirty data in device memory.

If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
like .save_setup, .save_live_pending, .save_live_iteration will not call
vendor driver's device state interface to get data from devcie memory.

In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
.log_sync callback will get system memory dirty bitmap from dirty bitmap
region by writing system memory's start address, page count and action 
(GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
"system_memory.action" fields of control region.
If page count passed in .log_sync callback is larger than the bitmap size
the dirty bitmap region supports, Qemu will cut it into chunks and call
vendor driver's get system memory dirty bitmap interface.
If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
returns without call to vendor driver.

In stop-and-copy phase, device state will be set to STOP & LOGGING first.
in save_live_complete_precopy callback,
If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
get device memory size and get device memory buffer will be called again.
After that,
device config data is get from device config region by reading
devcie_config.size of control region and writing action (GET_BITMAP) to
device_config.action of control region.
Then after migration completes, in cleanup handler, LOGGING state will be
cleared (i.e. deivce state is set to STOP).
Clearing LOGGING state in cleanup handler is in consideration of the case
of "migration failed" and "migration cancelled". They can also leverage
the cleanup handler to unset LOGGING state.


References
----------
1. kernel side implementation of Device state interfaces:
https://patchwork.freedesktop.org/series/56876/


Yan Zhao (5):
  vfio/migration: define kernel interfaces
  vfio/migration: support device of device config capability
  vfio/migration: tracking of dirty page in system memory
  vfio/migration: turn on migration
  vfio/migration: support device memory capability

 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/common.c              |  26 ++
 hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 |  10 +-
 hw/vfio/pci.h                 |  26 +-
 include/hw/vfio/vfio-common.h |   1 +
 linux-headers/linux/vfio.h    | 260 +++++++++++++
 7 files changed, 1174 insertions(+), 9 deletions(-)
 create mode 100644 hw/vfio/migration.c

Comments

Dr. David Alan Gilbert Feb. 19, 2019, 11:25 a.m. UTC | #1
* Yan Zhao (yan.y.zhao@intel.com) wrote:
> If a device has device memory capability, save/load data from device memory
> in pre-copy and stop-and-copy phases.
> 
> LOGGING state is set for device memory for dirty page logging:
> in LOGGING state, get device memory returns whole device memory snapshot;
> outside LOGGING state, get device memory returns dirty data since last get
> operation.
> 
> Usually, device memory is very big, qemu needs to chunk it into several
> pieces each with size of device memory region.
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
>  hw/vfio/migration.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  hw/vfio/pci.h       |   1 +
>  2 files changed, 231 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 16d6395..f1e9309 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
>      return 0;
>  }
>  
> +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    uint64_t len;
> +    int sz;
> +
> +    sz = sizeof(len);
> +    if (pread(vbasedev->fd, &len, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> +            != sz) {
> +        error_report("vfio: Failed to get length of device memory");
> +        return -1;
> +    }
> +    vdev->migration->devmem_size = len;
> +    return 0;
> +}
> +
> +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    int sz;
> +
> +    sz = sizeof(size);
> +    if (pwrite(vbasedev->fd, &size, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> +            != sz) {
> +        error_report("vfio: Failed to set length of device comemory");
> +        return -1;
> +    }
> +    vdev->migration->devmem_size = size;
> +    return 0;
> +}
> +
> +static
> +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> +                                    uint64_t pos, uint64_t len)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_devmem =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> +    void *dest;
> +    uint32_t sz;
> +    uint8_t *buf = NULL;
> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> +
> +    if (len > region_devmem->size) {
> +        return -1;
> +    }
> +
> +    sz = sizeof(pos);
> +    if (pwrite(vbasedev->fd, &pos, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> +            != sz) {
> +        error_report("vfio: Failed to set save buffer pos");
> +        return -1;
> +    }
> +    sz = sizeof(action);
> +    if (pwrite(vbasedev->fd, &action, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> +            != sz) {
> +        error_report("vfio: Failed to set save buffer action");
> +        return -1;
> +    }
> +
> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> +        buf = g_malloc(len);
> +        if (buf == NULL) {
> +            error_report("vfio: Failed to allocate memory for migrate");
> +            return -1;
> +        }
> +        if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != len) {
> +            error_report("vfio: error load device memory buffer");

That's forgotten to g_free(buf)

> +            return -1;
> +        }
> +        qemu_put_be64(f, len);
> +        qemu_put_be64(f, pos);
> +        qemu_put_buffer(f, buf, len);
> +        g_free(buf);
> +    } else {
> +        dest = region_devmem->mmaps[0].mmap;
> +        qemu_put_be64(f, len);
> +        qemu_put_be64(f, pos);
> +        qemu_put_buffer(f, dest, len);
> +    }
> +    return 0;
> +}
> +
> +static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
> +{
> +    VFIORegion *region_devmem =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> +    uint64_t total_len = vdev->migration->devmem_size;
> +    uint64_t pos = 0;
> +
> +    qemu_put_be64(f, total_len);
> +    while (pos < total_len) {
> +        uint64_t len = region_devmem->size;
> +
> +        if (pos + len >= total_len) {
> +            len = total_len - pos;
> +        }
> +        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
> +            return -1;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static
> +int vfio_load_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> +                                uint64_t pos, uint64_t len)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_devmem =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> +
> +    void *dest;
> +    uint32_t sz;
> +    uint8_t *buf = NULL;
> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> +
> +    if (len > region_devmem->size) {
> +        return -1;
> +    }
> +
> +    sz = sizeof(pos);
> +    if (pwrite(vbasedev->fd, &pos, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> +            != sz) {
> +        error_report("vfio: Failed to set device memory buffer pos");
> +        return -1;
> +    }
> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> +        buf = g_malloc(len);
> +        if (buf == NULL) {
> +            error_report("vfio: Failed to allocate memory for migrate");
> +            return -1;
> +        }
> +        qemu_get_buffer(f, buf, len);
> +        if (pwrite(vbasedev->fd, buf, len,
> +                    region_devmem->fd_offset) != len) {
> +            error_report("vfio: Failed to load devie memory buffer");

Again, failed to free buf

> +            return -1;
> +        }
> +        g_free(buf);
> +    } else {
> +        dest = region_devmem->mmaps[0].mmap;
> +        qemu_get_buffer(f, dest, len);
> +    }

You might want to use qemu_file_get_error(f)  before writing the data
to the device, to check for the case of a read error on the migration
stream that happened somewhere in the pevious qemu_get's

> +    sz = sizeof(action);
> +    if (pwrite(vbasedev->fd, &action, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> +            != sz) {
> +        error_report("vfio: Failed to set load device memory buffer action");
> +        return -1;
> +    }
> +
> +    return 0;
> +
> +}
> +
> +static int vfio_load_data_device_memory(VFIOPCIDevice *vdev,
> +                        QEMUFile *f, uint64_t total_len)
> +{
> +    uint64_t pos = 0, len = 0;
> +
> +    vfio_set_device_memory_size(vdev, total_len);
> +
> +    while (pos + len < total_len) {
> +        len = qemu_get_be64(f);
> +        pos = qemu_get_be64(f);

Please check len/pos - always assume that the migration stream could
be (maliciously or accidentally) corrupt.

> +        vfio_load_data_device_memory_chunk(vdev, f, pos, len);
> +    }
> +
> +    return 0;
> +}
> +
> +
>  static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
>          uint64_t start_addr, uint64_t page_nr)
>  {
> @@ -377,6 +572,10 @@ static void vfio_save_live_pending(QEMUFile *f, void *opaque,
>          return;
>      }
>  
> +    /* get dirty data size of device memory */
> +    vfio_get_device_memory_size(vdev);
> +
> +    *res_precopy_only += vdev->migration->devmem_size;
>      return;
>  }
>  
> @@ -388,7 +587,9 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>          return 0;
>      }
>  
> -    return 0;
> +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> +    /* get dirty data of device memory */
> +    return vfio_save_data_device_memory(vdev, f);
>  }
>  
>  static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> @@ -458,6 +659,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>              len = qemu_get_be64(f);
>              vfio_load_data_device_config(vdev, f, len);
>              break;
> +        case VFIO_SAVE_FLAG_DEVMEMORY:
> +            len = qemu_get_be64(f);
> +            vfio_load_data_device_memory(vdev, f, len);
> +            break;
>          default:
>              ret = -EINVAL;
>          }
> @@ -503,6 +708,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>      VFIOPCIDevice *vdev = opaque;
>      int rc = 0;
>  
> +    if (vfio_device_data_cap_device_memory(vdev)) {
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY | VFIO_SAVE_FLAG_CONTINUE);
> +        /* get dirty data of device memory */
> +        vfio_get_device_memory_size(vdev);
> +        rc = vfio_save_data_device_memory(vdev, f);
> +    }
> +
>      qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
>      vfio_pci_save_config(vdev, f);
>  
> @@ -515,12 +727,22 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
>  {
> +    int rc = 0;
>      VFIOPCIDevice *vdev = opaque;
> -    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> +
> +    if (vfio_device_data_cap_device_memory(vdev)) {
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP | VFIO_SAVE_FLAG_CONTINUE);
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> +        /* get whole snapshot of device memory */
> +        vfio_get_device_memory_size(vdev);
> +        rc = vfio_save_data_device_memory(vdev, f);
> +    } else {
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> +    }
>  
>      vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
>              VFIO_DEVICE_STATE_LOGGING);
> -    return 0;
> +    return rc;
>  }
>  
>  static int vfio_load_setup(QEMUFile *f, void *opaque)
> @@ -576,8 +798,11 @@ int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
>          goto error;
>      }
>  
> -    if (vfio_device_data_cap_device_memory(vdev)) {
> -        error_report("No suppport of data cap device memory Yet");
> +    if (vfio_device_data_cap_device_memory(vdev) &&
> +            vfio_device_state_region_setup(vdev,
> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY],
> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY,
> +              "device-state-data-device-memory")) {
>          goto error;
>      }
>  
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 4b7b1bb..a2cc64b 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -69,6 +69,7 @@ typedef struct VFIOMigration {
>      uint32_t data_caps;
>      uint32_t device_state;
>      uint64_t devconfig_size;
> +    uint64_t devmem_size;
>      VMChangeStateEntry *vm_state;
>  } VFIOMigration;
>  
> -- 
> 2.7.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Dr. David Alan Gilbert Feb. 19, 2019, 11:32 a.m. UTC | #2
* Yan Zhao (yan.y.zhao@intel.com) wrote:
> This patchset enables VFIO devices to have live migration capability.
> Currently it does not support post-copy phase.
> 
> It follows Alex's comments on last version of VFIO live migration patches,
> including device states, VFIO device state region layout, dirty bitmap's
> query.

Hi,
  I've sent minor comments to later patches; but some minor general
comments:

  a) Never trust the incoming migrations stream - it might be corrupt,
    so check when you can.
  b) How do we detect if we're migrating from/to the wrong device or
version of device?  Or say to a device with older firmware or perhaps
a device that has less device memory ?
  c) Consider using the trace_ mechanism - it's really useful to
add to loops writing/reading data so that you can see when it fails.

Dave

(P.S. You have a few typo's grep your code for 'devcie', 'devie' and
'migrtion'

> Device Data
> -----------
> Device data is divided into three types: device memory, device config,
> and system memory dirty pages produced by device.
> 
> Device config: data like MMIOs, page tables...
>         Every device is supposed to possess device config data.
>     	Usually device config's size is small (no big than 10M), and it
>         needs to be loaded in certain strict order.
>         Therefore, device config only needs to be saved/loaded in
>         stop-and-copy phase.
>         The data of device config is held in device config region.
>         Size of device config data is smaller than or equal to that of
>         device config region.
> 
> Device Memory: device's internal memory, standalone and outside system
>         memory. It is usually very big.
>         This kind of data needs to be saved / loaded in pre-copy and
>         stop-and-copy phase.
>         The data of device memory is held in device memory region.
>         Size of devie memory is usually larger than that of device
>         memory region. qemu needs to save/load it in chunks of size of
>         device memory region.
>         Not all device has device memory. Like IGD only uses system memory.
> 
> System memory dirty pages: If a device produces dirty pages in system
>         memory, it is able to get dirty bitmap for certain range of system
>         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
>         phase in .log_sync callback. By setting dirty bitmap in .log_sync
>         callback, dirty pages in system memory will be save/loaded by ram's
>         live migration code.
>         The dirty bitmap of system memory is held in dirty bitmap region.
>         If system memory range is larger than that dirty bitmap region can
>         hold, qemu will cut it into several chunks and get dirty bitmap in
>         succession.
> 
> 
> Device State Regions
> --------------------
> Vendor driver is required to expose two mandatory regions and another two
> optional regions if it plans to support device state management.
> 
> So, there are up to four regions in total.
> One control region: mandatory.
>         Get access via read/write system call.
>         Its layout is defined in struct vfio_device_state_ctl
> Three data regions: mmaped into qemu.
>         device config region: mandatory, holding data of device config
>         device memory region: optional, holding data of device memory
>         dirty bitmap region: optional, holding bitmap of system memory
>                             dirty pages
> 
> (The reason why four seperate regions are defined is that the unit of mmap
> system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> control and three mmaped regions for data seems better than one big region
> padded and sparse mmaped).
> 
> 
> kernel device state interface [1]
> --------------------------------------
> #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> 
> #define VFIO_DEVICE_STATE_RUNNING 0 
> #define VFIO_DEVICE_STATE_STOP 1
> #define VFIO_DEVICE_STATE_LOGGING 2
> 
> #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> 
> struct vfio_device_state_ctl {
> 	__u32 version;		  /* ro */
> 	__u32 device_state;       /* VFIO device state, wo */
> 	__u32 caps;		 /* ro */
>         struct {
> 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> 		__u64 size;    /*rw*/
> 	} device_config;
> 	struct {
> 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> 		__u64 size;     /* rw */  
>                 __u64 pos; /*the offset in total buffer of device memory*/
> 	} device_memory;
> 	struct {
> 		__u64 start_addr; /* wo */
> 		__u64 page_nr;   /* wo */
> 	} system_memory;
> };
> 
> Devcie States
> ------------- 
> After migration is initialzed, it will set device state via writing to
> device_state field of control region.
> 
> Four states are defined for a VFIO device:
>         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> 
> RUNNING: In this state, a VFIO device is in active state ready to receive
>         commands from device driver.
>         It is the default state that a VFIO device enters initially.
> 
> STOP:  In this state, a VFIO device is deactivated to interact with
>        device driver.
> 
> LOGGING: a special state that it CANNOT exist independently. It must be
>        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
>        STOP & LOGGING).
>        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
>        driver can start dirty data logging for device memory and system
>        memory.
>        LOGGING only impacts device/system memory. They return whole
>        snapshot outside LOGGING and dirty data since last get operation
>        inside LOGGING.
>        Device config should be always accessible and return whole config
>        snapshot regardless of LOGGING state.
>        
> Note:
> The reason why RUNNING is the default state is that device's active state
> must not depend on device state interface.
> It is possible that region vfio_device_state_ctl fails to get registered.
> In that condition, a device needs be in active state by default. 
> 
> Get Version & Get Caps
> ----------------------
> On migration init phase, qemu will probe the existence of device state
> regions of vendor driver, then get version of the device state interface
> from the r/w control region.
> 
> Then it will probe VFIO device's data capability by reading caps field of
> control region.
>         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
>         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
>         device memory in pre-copy and stop-and-copy phase. The data of
>         device memory is held in device memory region.
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
>         produced by VFIO device during pre-copy and stop-and-copy phase.
>         The dirty bitmap of system memory is held in dirty bitmap region.
> 
> If failing to find two mandatory regions and optional data regions
> corresponding to data caps or version mismatching, it will setup a
> migration blocker and disable live migration for VFIO device.
> 
> 
> Flows to call device state interface for VFIO live migration
> ------------------------------------------------------------
> 
> Live migration save path:
> 
> (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> 
> MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
>  |
> MIGRATION_STATUS_SAVE_SETUP
>  |
>  .save_setup callback -->
>  get device memory size (whole snapshot size)
>  get device memory buffer (whole snapshot data)
>  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
>  |
> MIGRATION_STATUS_ACTIVE
>  |
>  .save_live_pending callback --> get device memory size (dirty data)
>  .save_live_iteration callback --> get device memory buffer (dirty data)
>  .log_sync callback --> get system memory dirty bitmap
>  |
> (vcpu stops) --> set device state -->
>  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
>  |
> .save_live_complete_precopy callback -->
>  get device memory size (dirty data)
>  get device memory buffer (dirty data)
>  get device config size (whole snapshot size)
>  get device config buffer (whole snapshot data)
>  |
> .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> MIGRATION_STATUS_COMPLETED
> 
> MIGRATION_STATUS_CANCELLED or
> MIGRATION_STATUS_FAILED
>  |
> (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> 
> 
> Live migration load path:
> 
> (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> 
> MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
>  |
> (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
>  |
> MIGRATION_STATUS_ACTIVE
>  |
> .load state callback -->
>  set device memory size, set device memory buffer, set device config size,
>  set device config buffer
>  |
> (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
>  |
> MIGRATION_STATUS_COMPLETED
> 
> 
> 
> In source VM side,
> In precopy phase,
> if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> qemu will first get whole snapshot of device memory in .save_setup
> callback, and then it will get total size of dirty data in device memory in
> .save_live_pending callback by reading device_memory.size field of control
> region.
> Then in .save_live_iteration callback, it will get buffer of device memory's
> dirty data chunk by chunk from device memory region by writing pos &
> action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> control region. (size of each chunk is the size of device memory data
> region).
> .save_live_pending and .save_live_iteration may be called several times in
> precopy phase to get dirty data in device memory.
> 
> If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> like .save_setup, .save_live_pending, .save_live_iteration will not call
> vendor driver's device state interface to get data from devcie memory.
> 
> In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> .log_sync callback will get system memory dirty bitmap from dirty bitmap
> region by writing system memory's start address, page count and action 
> (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> "system_memory.action" fields of control region.
> If page count passed in .log_sync callback is larger than the bitmap size
> the dirty bitmap region supports, Qemu will cut it into chunks and call
> vendor driver's get system memory dirty bitmap interface.
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> returns without call to vendor driver.
> 
> In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> in save_live_complete_precopy callback,
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> get device memory size and get device memory buffer will be called again.
> After that,
> device config data is get from device config region by reading
> devcie_config.size of control region and writing action (GET_BITMAP) to
> device_config.action of control region.
> Then after migration completes, in cleanup handler, LOGGING state will be
> cleared (i.e. deivce state is set to STOP).
> Clearing LOGGING state in cleanup handler is in consideration of the case
> of "migration failed" and "migration cancelled". They can also leverage
> the cleanup handler to unset LOGGING state.
> 
> 
> References
> ----------
> 1. kernel side implementation of Device state interfaces:
> https://patchwork.freedesktop.org/series/56876/
> 
> 
> Yan Zhao (5):
>   vfio/migration: define kernel interfaces
>   vfio/migration: support device of device config capability
>   vfio/migration: tracking of dirty page in system memory
>   vfio/migration: turn on migration
>   vfio/migration: support device memory capability
> 
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/common.c              |  26 ++
>  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 |  10 +-
>  hw/vfio/pci.h                 |  26 +-
>  include/hw/vfio/vfio-common.h |   1 +
>  linux-headers/linux/vfio.h    | 260 +++++++++++++
>  7 files changed, 1174 insertions(+), 9 deletions(-)
>  create mode 100644 hw/vfio/migration.c
> 
> -- 
> 2.7.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Cornelia Huck Feb. 19, 2019, 1:09 p.m. UTC | #3
On Tue, 19 Feb 2019 16:52:14 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> - defined 4 device states regions: one control region and 3 data regions
> - defined layout of control region in struct vfio_device_state_ctl
> - defined 4 device states: running, stop, running&logging, stop&logging
> - define 3 device data categories: device config, device memory, system
>   memory
> - defined 2 device data capabilities: device memory and system memory
> - defined device state interfaces' version and 12 device state interfaces
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
> ---
>  linux-headers/linux/vfio.h | 260 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 260 insertions(+)

[commenting here for convenience; changes obviously need to be done in
the Linux patch]

> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index ceb6453..a124fc1 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -303,6 +303,56 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>  
> +/* Device State region type and sub-type
> + *
> + * A VFIO device driver needs to register up to four device state regions in
> + * total: two mandatory and another two optional, if it plans to support device
> + * state management.

Suggest to rephrase:

"A VFIO device driver that plans to support device state management
needs to register..."

> + *
> + * 1. region CTL :
> + *          Mandatory.
> + *          This is a control region.
> + *          Its layout is defined in struct vfio_device_state_ctl.
> + *          Reading from this region can get version, capabilities and data
> + *          size of device state interfaces.
> + *          Writing to this region can set device state, data size and
> + *          choose which interface to use.
> + * 2. region DEVICE_CONFIG
> + *          Mandatory.
> + *          This is a data region that holds device config data.
> + *          Device config is such kind of data like MMIOs, page tables...

"Device config is data such as..."

> + *          Every device is supposed to possess device config data.
> + *          Usually the size of device config data is small (no big

s/no big/no bigger/

> + *          than 10M), and it needs to be loaded in certain strict
> + *          order.
> + *          Therefore no dirty data logging is enabled for device
> + *          config and it must be got/set as a whole.
> + *          Size of device config data is smaller than or equal to that of
> + *          device config region.

Not sure if I understand that sentence correctly... but what if a
device has more config state than fits into this region? Is that
supposed to be covered by the device memory region? Or is this assumed
to be something so exotic that we don't need to plan for it?

> + *          It is able to be mmaped into user space.
> + * 3. region DEVICE_MEMORY
> + *          Optional.
> + *          This is a data region that holds device memory data.
> + *          Device memory is device's internal memory, standalone and outside

s/outside/distinct from/ ?

> + *          system memory.  It is usually very big.
> + *          Not all device has device memory. Like IGD only uses system

s/all devices has/all devices have/

s/Like/E.g./

> + *          memory and has no device memory.
> + *          Size of devie memory is usually larger than that of device

s/devie/device/

> + *          memory region. qemu needs to save/load it in chunks of size of
> + *          device memory region.

I'd rather not explicitly mention QEMU in this header. Maybe
"Userspace"?

> + *          It is able to be mmaped into user space.
> + * 4. region DIRTY_BITMAP
> + *          Optional.
> + *          This is a data region that holds bitmap of dirty pages in system
> + *          memory that a VFIO devices produces.
> + *          It is able to be mmaped into user space.
> + */
> +#define VFIO_REGION_TYPE_DEVICE_STATE           (1 << 1)

Can you make this an explicit number instead?

(FWIW, I plan to add a CCW region as type 2, whatever comes first.)

> +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL       (1)
> +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG      (2)
> +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY      (3)
> +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP (4)
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within
> @@ -816,6 +866,216 @@ struct vfio_iommu_spapr_tce_remove {
>  };
>  #define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
>  
> +/* version number of the device state interface */
> +#define VFIO_DEVICE_STATE_INTERFACE_VERSION 1

Hm. Is this supposed to be backwards-compatible, should we need to bump
this?

> +
> +/*
> + * For devices that have devcie memory, it is required to expose

s/devcie/device/

> + * DEVICE_MEMORY capability.
> + *
> + * For devices producing dirty pages in system memory, it is required to
> + * expose cap SYSTEM_MEMORY in order to get dirty bitmap in certain range
> + * of system memory.
> + */
> +#define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> +#define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> +
> +/*
> + * DEVICE STATES
> + *
> + * Four states are defined for a VFIO device:
> + * RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP.
> + * They can be set by writing to device_state field of
> + * vfio_device_state_ctl region.

Who controls this? Userspace?

> + *
> + * RUNNING: In this state, a VFIO device is in active state ready to
> + * receive commands from device driver.
> + * It is the default state that a VFIO device enters initially.
> + *
> + * STOP: In this state, a VFIO device is deactivated to interact with
> + * device driver.

I think 'STOPPED' would read nicer.

> + *
> + * LOGGING state is a special state that it CANNOT exist
> + * independently.

So it's not a state, but rather a modifier?

> + * It must be set alongside with state RUNNING or STOP, i.e,
> + * RUNNING & LOGGING, STOP & LOGGING.
> + * It is used for dirty data logging both for device memory
> + * and system memory.
> + *
> + * LOGGING only impacts device/system memory. In LOGGING state, get buffer
> + * of device memory returns dirty pages since last call; outside LOGGING
> + * state, get buffer of device memory returns whole snapshot of device
> + * memory. system memory's dirty page is only available in LOGGING state.
> + *
> + * Device config should be always accessible and return whole config snapshot
> + * regardless of LOGGING state.
> + * */
> +#define VFIO_DEVICE_STATE_RUNNING 0
> +#define VFIO_DEVICE_STATE_STOP 1
> +#define VFIO_DEVICE_STATE_LOGGING 2
> +
> +/* action to get data from device memory or device config
> + * the action is write to device state's control region, and data is read
> + * from device memory region or device config region.
> + * Each time before read device memory region or device config region,
> + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> + * field in control region. That is because device memory and devie config
> + * region is mmaped into user space. vendor driver has to be notified of
> + * the the GET_BUFFER action in advance.
> + */
> +#define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> +
> +/* action to set data to device memory or device config
> + * the action is write to device state's control region, and data is
> + * written to device memory region or device config region.
> + * Each time after write to device memory region or device config region,
> + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> + * field in control region. That is because device memory and devie config
> + * region is mmaped into user space. vendor driver has to be notified of
> + * the the SET_BUFFER action after data written.
> + */
> +#define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2

Let me describe this in my own words to make sure that I understand
this correctly.

- The actions are set by userspace to notify the kernel that it is
  going to get data or that it has just written data.
- This is needed as a notification that the mmapped data should not be
  changed resp. just has changed.

So, how does the kernel know whether the read action has finished resp.
whether the write action has started? Even if userspace reads/writes it
as a whole.

> +
> +/* layout of device state interfaces' control region
> + * By reading to control region and reading/writing data from device config
> + * region, device memory region, system memory regions, below interface can
> + * be implemented:
> + *
> + * 1. get version
> + *   (1) user space calls read system call on "version" field of control
> + *   region.
> + *   (2) vendor driver writes version number of device state interfaces
> + *   to the "version" field of control region.
> + *
> + * 2. get caps
> + *   (1) user space calls read system call on "caps" field of control region.
> + *   (2) if a VFIO device has huge device memory, vendor driver reports
> + *      VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY in "caps" field of control region.
> + *      if a VFIO device produces dirty pages in system memory, vendor driver
> + *      reports VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY in "caps" field of
> + *      control region.
> + *
> + * 3. set device state
> + *    (1) user space calls write system call on "device_state" field of
> + *    control region.
> + *    (2) device state transitions as:
> + *
> + *    RUNNING -- start dirty data logging --> RUNNING & LOGGING
> + *    RUNNING -- deactivate --> STOP
> + *    RUNNING -- deactivate & start dirty data longging --> STOP & LOGGING
> + *    RUNNING & LOGGING -- stop dirty data logging --> RUNNING
> + *    RUNNING & LOGGING -- deactivate --> STOP & LOGGING
> + *    RUNNING & LOGGING -- deactivate & stop dirty data logging --> STOP
> + *    STOP -- activate --> RUNNING
> + *    STOP -- start dirty data logging --> STOP & LOGGING
> + *    STOP -- activate & start dirty data logging --> RUNNING & LOGGING
> + *    STOP & LOGGING -- stop dirty data logging --> STOP
> + *    STOP & LOGGING -- activate --> RUNNING & LOGGING
> + *    STOP & LOGGING -- activate & stop dirty data logging --> RUNNING
> + *
> + * 4. get device config size
> + *   (1) user space calls read system call on "device_config.size" field of
> + *       control region for the total size of device config snapshot.
> + *   (2) vendor driver writes device config data's total size in
> + *       "device_config.size" field of control region.
> + *
> + * 5. set device config size
> + *   (1) user space calls write system call.
> + *       total size of device config snapshot --> "device_config.size" field
> + *       of control region.
> + *   (2) vendor driver reads device config data's total size from
> + *       "device_config.size" field of control region.
> + *
> + * 6 get device config buffer
> + *   (1) user space calls write system call.
> + *       "GET_BUFFER" --> "device_config.action" field of control region.
> + *   (2) vendor driver
> + *       a. gets whole snapshot for device config
> + *       b. writes whole device config snapshot to region
> + *       DEVICE_CONFIG.
> + *   (3) user space reads the whole of device config snapshot from region
> + *       DEVICE_CONFIG.
> + *
> + * 7. set device config buffer
> + *   (1) user space writes whole of device config data to region
> + *       DEVICE_CONFIG.
> + *   (2) user space calls write system call.
> + *       "SET_BUFFER" --> "device_config.action" field of control region.
> + *   (3) vendor driver loads whole of device config from region DEVICE_CONFIG.
> + *
> + * 8. get device memory size
> + *   (1) user space calls read system call on "device_memory.size" field of
> + *       control region for device memory size.
> + *   (2) vendor driver
> + *       a. gets device memory snapshot (in state RUNNING or STOP), or
> + *          gets device memory dirty data (in state RUNNING & LOGGING or
> + *          state STOP & LOGGING)
> + *       b. writes size in "device_memory.size" field of control region
> + *
> + * 9. set device memory size
> + *   (1) user space calls write system call on "device_memory.size" field of
> + *       control region to set total size of device memory snapshot.
> + *   (2) vendor driver reads device memory's size from "device_memory.size"
> + *       field of control region.
> + *
> + *
> + * 10. get device memory buffer
> + *   (1) user space calls write system.
> + *       pos --> "device_memory.pos" field of control region,
> + *       "GET_BUFFER" --> "device_memory.action" field of control region.
> + *       (pos must be 0 or multiples of length of region DEVICE_MEMORY).
> + *   (2) vendor driver writes N'th chunk of device memory snapshot/dirty data
> + *       to region DEVICE_MEMORY.
> + *       (N equals to pos/(region length of DEVICE_MEMORY))
> + *   (3) user space reads the N'th chunk of device memory snapshot/dirty data
> + *       from region DEVICE_MEMORY.
> + *
> + * 11. set device memory buffer
> + *   (1) user space writes N'th chunk of device memory snapshot/dirty data to
> + *       region DEVICE_MEMORY.
> + *       (N equals to pos/(region length of DEVICE_MEMORY))
> + *   (2) user space writes pos to "device_memory.pos" field and writes
> + *       "SET_BUFFER" to "device_memory.action" field of control region.
> + *   (3) vendor driver loads N'th chunk of device memory snapshot/dirty data
> + *       from region DEVICE_MEMORY.
> + *
> + * 12. get system memory dirty bitmap
> + *   (1) user space calls write system call to specify a range of system
> + *       memory that querying dirty pages.
> + *       system memory's start address --> "system_memory.start_addr" field
> + *       of control region,
> + *       system memory's page count --> "system_memory.page_nr" field of
> + *       control region.
> + *   (2) if device state is not in RUNNING or STOP & LOGGING,
> + *       vendor driver returns empty bitmap; otherwise,
> + *       vendor driver checks the page_nr,
> + *       if it's larger than the size that region DIRTY_BITMAP can support,
> + *       error returns; if not,
> + *       vendor driver returns as bitmap to specify dirty pages that
> + *       device produces since last query in this range of system memory .
> + *   (3) usespace reads back the dirty bitmap from region DIRTY_BITMAP.
> + *
> + */

It might make sense to extract the explanations above into a separate
design document in the kernel Documentation/ directory. You could also
add ASCII art there :)

> +
> +struct vfio_device_state_ctl {
> +	__u32 version;		  /* ro versio of devcie state interfaces*/

s/versio/version/
s/devcie/device/

> +	__u32 device_state;       /* VFIO device state, wo */
> +	__u32 caps;		 /* ro */
> +        struct {

Indentation looks a bit off.

> +		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */
> +		__u64 size;    /*rw, total size of device config*/
> +	} device_config;
> +	struct {
> +		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */
> +		__u64 size;     /* rw, total size of device memory*/
> +        __u64 pos;/*chunk offset in total buffer of device memory*/

Here as well.

> +	} device_memory;
> +	struct {
> +		__u64 start_addr; /* wo */
> +		__u64 page_nr;   /* wo */
> +	} system_memory;
> +}__attribute__((packed));

For an interface definition, it's probably better to avoid packed and
instead add padding if needed.

> +
>  /* ***************************************************************** */
>  
>  #endif /* VFIO_H */

On the whole, I think this is moving into the right direction.
Christophe de Dinechin Feb. 19, 2019, 2:42 p.m. UTC | #4
> On 19 Feb 2019, at 09:53, Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> If a device has device memory capability, save/load data from device memory
> in pre-copy and stop-and-copy phases.
> 
> LOGGING state is set for device memory for dirty page logging:
> in LOGGING state, get device memory returns whole device memory snapshot;
> outside LOGGING state, get device memory returns dirty data since last get
> operation.
> 
> Usually, device memory is very big, qemu needs to chunk it into several
> pieces each with size of device memory region.
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
> hw/vfio/migration.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> hw/vfio/pci.h       |   1 +
> 2 files changed, 231 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 16d6395..f1e9309 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
>     return 0;
> }
> 
> +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    uint64_t len;
> +    int sz;
> +
> +    sz = sizeof(len);
> +    if (pread(vbasedev->fd, &len, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> +            != sz) {
> +        error_report("vfio: Failed to get length of device memory”);

s/length/size/ ? (to be consistent with function name)
> +        return -1;
> +    }
> +    vdev->migration->devmem_size = len;
> +    return 0;
> +}
> +
> +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    int sz;
> +
> +    sz = sizeof(size);
> +    if (pwrite(vbasedev->fd, &size, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> +            != sz) {
> +        error_report("vfio: Failed to set length of device comemory”);

What is comemory? Typo?

Same comment about length vs size

> +        return -1;
> +    }
> +    vdev->migration->devmem_size = size;
> +    return 0;
> +}
> +
> +static
> +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> +                                    uint64_t pos, uint64_t len)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_devmem =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> +    void *dest;
> +    uint32_t sz;
> +    uint8_t *buf = NULL;
> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> +
> +    if (len > region_devmem->size) {

Is it intentional that there is no error_report here?

> +        return -1;
> +    }
> +
> +    sz = sizeof(pos);
> +    if (pwrite(vbasedev->fd, &pos, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> +            != sz) {
> +        error_report("vfio: Failed to set save buffer pos");
> +        return -1;
> +    }
> +    sz = sizeof(action);
> +    if (pwrite(vbasedev->fd, &action, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> +            != sz) {
> +        error_report("vfio: Failed to set save buffer action");
> +        return -1;
> +    }
> +
> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> +        buf = g_malloc(len);
> +        if (buf == NULL) {
> +            error_report("vfio: Failed to allocate memory for migrate”);
s/migrate/migration/ ?
> +            return -1;
> +        }
> +        if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != len) {
> +            error_report("vfio: error load device memory buffer”);
s/load/loading/ ?
> +            return -1;
> +        }
> +        qemu_put_be64(f, len);
> +        qemu_put_be64(f, pos);
> +        qemu_put_buffer(f, buf, len);
> +        g_free(buf);
> +    } else {
> +        dest = region_devmem->mmaps[0].mmap;
> +        qemu_put_be64(f, len);
> +        qemu_put_be64(f, pos);
> +        qemu_put_buffer(f, dest, len);
> +    }
> +    return 0;
> +}
> +
> +static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
> +{
> +    VFIORegion *region_devmem =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> +    uint64_t total_len = vdev->migration->devmem_size;
> +    uint64_t pos = 0;
> +
> +    qemu_put_be64(f, total_len);
> +    while (pos < total_len) {
> +        uint64_t len = region_devmem->size;
> +
> +        if (pos + len >= total_len) {
> +            len = total_len - pos;
> +        }
> +        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
> +            return -1;
> +        }

I don’t see where pos is incremented in this loop

> +    }
> +
> +    return 0;
> +}
> +
> +static
> +int vfio_load_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> +                                uint64_t pos, uint64_t len)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_devmem =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> +
> +    void *dest;
> +    uint32_t sz;
> +    uint8_t *buf = NULL;
> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> +
> +    if (len > region_devmem->size) {

error_report?
> +        return -1;
> +    }
> +
> +    sz = sizeof(pos);
> +    if (pwrite(vbasedev->fd, &pos, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> +            != sz) {
> +        error_report("vfio: Failed to set device memory buffer pos");
> +        return -1;
> +    }
> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> +        buf = g_malloc(len);
> +        if (buf == NULL) {
> +            error_report("vfio: Failed to allocate memory for migrate");
> +            return -1;
> +        }
> +        qemu_get_buffer(f, buf, len);
> +        if (pwrite(vbasedev->fd, buf, len,
> +                    region_devmem->fd_offset) != len) {
> +            error_report("vfio: Failed to load devie memory buffer");
> +            return -1;
> +        }
> +        g_free(buf);
> +    } else {
> +        dest = region_devmem->mmaps[0].mmap;
> +        qemu_get_buffer(f, dest, len);
> +    }
> +
> +    sz = sizeof(action);
> +    if (pwrite(vbasedev->fd, &action, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> +            != sz) {
> +        error_report("vfio: Failed to set load device memory buffer action");
> +        return -1;
> +    }
> +
> +    return 0;
> +
> +}
> +
> +static int vfio_load_data_device_memory(VFIOPCIDevice *vdev,
> +                        QEMUFile *f, uint64_t total_len)
> +{
> +    uint64_t pos = 0, len = 0;
> +
> +    vfio_set_device_memory_size(vdev, total_len);
> +
> +    while (pos + len < total_len) {
> +        len = qemu_get_be64(f);
> +        pos = qemu_get_be64(f);

Nit: load reads len/pos in the loop, whereas save does it in the
inner function (vfio_save_data_device_memory_chunk)

> +
> +        vfio_load_data_device_memory_chunk(vdev, f, pos, len);
> +    }
> +
> +    return 0;
> +}
> +
> +
> static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
>         uint64_t start_addr, uint64_t page_nr)
> {
> @@ -377,6 +572,10 @@ static void vfio_save_live_pending(QEMUFile *f, void *opaque,
>         return;
>     }
> 
> +    /* get dirty data size of device memory */
> +    vfio_get_device_memory_size(vdev);
> +
> +    *res_precopy_only += vdev->migration->devmem_size;
>     return;
> }
> 
> @@ -388,7 +587,9 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>         return 0;
>     }
> 
> -    return 0;
> +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> +    /* get dirty data of device memory */
> +    return vfio_save_data_device_memory(vdev, f);
> }
> 
> static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> @@ -458,6 +659,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>             len = qemu_get_be64(f);
>             vfio_load_data_device_config(vdev, f, len);
>             break;
> +        case VFIO_SAVE_FLAG_DEVMEMORY:
> +            len = qemu_get_be64(f);
> +            vfio_load_data_device_memory(vdev, f, len);
> +            break;
>         default:
>             ret = -EINVAL;
>         }
> @@ -503,6 +708,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>     VFIOPCIDevice *vdev = opaque;
>     int rc = 0;
> 
> +    if (vfio_device_data_cap_device_memory(vdev)) {
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY | VFIO_SAVE_FLAG_CONTINUE);
> +        /* get dirty data of device memory */
> +        vfio_get_device_memory_size(vdev);
> +        rc = vfio_save_data_device_memory(vdev, f);
> +    }
> +
>     qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
>     vfio_pci_save_config(vdev, f);
> 
> @@ -515,12 +727,22 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> 
> static int vfio_save_setup(QEMUFile *f, void *opaque)
> {
> +    int rc = 0;
>     VFIOPCIDevice *vdev = opaque;
> -    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> +
> +    if (vfio_device_data_cap_device_memory(vdev)) {
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP | VFIO_SAVE_FLAG_CONTINUE);
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> +        /* get whole snapshot of device memory */
> +        vfio_get_device_memory_size(vdev);
> +        rc = vfio_save_data_device_memory(vdev, f);
> +    } else {
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> +    }
> 
>     vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
>             VFIO_DEVICE_STATE_LOGGING);
> -    return 0;
> +    return rc;
> }
> 
> static int vfio_load_setup(QEMUFile *f, void *opaque)
> @@ -576,8 +798,11 @@ int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
>         goto error;
>     }
> 
> -    if (vfio_device_data_cap_device_memory(vdev)) {
> -        error_report("No suppport of data cap device memory Yet");
> +    if (vfio_device_data_cap_device_memory(vdev) &&
> +            vfio_device_state_region_setup(vdev,
> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY],
> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY,
> +              "device-state-data-device-memory")) {
>         goto error;
>     }
> 
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 4b7b1bb..a2cc64b 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -69,6 +69,7 @@ typedef struct VFIOMigration {
>     uint32_t data_caps;
>     uint32_t device_state;
>     uint64_t devconfig_size;
> +    uint64_t devmem_size;
>     VMChangeStateEntry *vm_state;
> } VFIOMigration;
> 
> -- 
> 2.7.4
> 
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
Yan Zhao Feb. 20, 2019, 5:17 a.m. UTC | #5
On Tue, Feb 19, 2019 at 11:25:43AM +0000, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > If a device has device memory capability, save/load data from device memory
> > in pre-copy and stop-and-copy phases.
> > 
> > LOGGING state is set for device memory for dirty page logging:
> > in LOGGING state, get device memory returns whole device memory snapshot;
> > outside LOGGING state, get device memory returns dirty data since last get
> > operation.
> > 
> > Usually, device memory is very big, qemu needs to chunk it into several
> > pieces each with size of device memory region.
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > ---
> >  hw/vfio/migration.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >  hw/vfio/pci.h       |   1 +
> >  2 files changed, 231 insertions(+), 5 deletions(-)
> > 
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 16d6395..f1e9309 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
> >      return 0;
> >  }
> >  
> > +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    uint64_t len;
> > +    int sz;
> > +
> > +    sz = sizeof(len);
> > +    if (pread(vbasedev->fd, &len, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> > +            != sz) {
> > +        error_report("vfio: Failed to get length of device memory");
> > +        return -1;
> > +    }
> > +    vdev->migration->devmem_size = len;
> > +    return 0;
> > +}
> > +
> > +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    int sz;
> > +
> > +    sz = sizeof(size);
> > +    if (pwrite(vbasedev->fd, &size, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> > +            != sz) {
> > +        error_report("vfio: Failed to set length of device comemory");
> > +        return -1;
> > +    }
> > +    vdev->migration->devmem_size = size;
> > +    return 0;
> > +}
> > +
> > +static
> > +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> > +                                    uint64_t pos, uint64_t len)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_devmem =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +    void *dest;
> > +    uint32_t sz;
> > +    uint8_t *buf = NULL;
> > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> > +
> > +    if (len > region_devmem->size) {
> > +        return -1;
> > +    }
> > +
> > +    sz = sizeof(pos);
> > +    if (pwrite(vbasedev->fd, &pos, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> > +            != sz) {
> > +        error_report("vfio: Failed to set save buffer pos");
> > +        return -1;
> > +    }
> > +    sz = sizeof(action);
> > +    if (pwrite(vbasedev->fd, &action, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> > +            != sz) {
> > +        error_report("vfio: Failed to set save buffer action");
> > +        return -1;
> > +    }
> > +
> > +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> > +        buf = g_malloc(len);
> > +        if (buf == NULL) {
> > +            error_report("vfio: Failed to allocate memory for migrate");
> > +            return -1;
> > +        }
> > +        if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != len) {
> > +            error_report("vfio: error load device memory buffer");
> 
> That's forgotten to g_free(buf)
>
Right, I'll correct that.

> > +            return -1;
> > +        }
> > +        qemu_put_be64(f, len);
> > +        qemu_put_be64(f, pos);
> > +        qemu_put_buffer(f, buf, len);
> > +        g_free(buf);
> > +    } else {
> > +        dest = region_devmem->mmaps[0].mmap;
> > +        qemu_put_be64(f, len);
> > +        qemu_put_be64(f, pos);
> > +        qemu_put_buffer(f, dest, len);
> > +    }
> > +    return 0;
> > +}
> > +
> > +static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
> > +{
> > +    VFIORegion *region_devmem =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +    uint64_t total_len = vdev->migration->devmem_size;
> > +    uint64_t pos = 0;
> > +
> > +    qemu_put_be64(f, total_len);
> > +    while (pos < total_len) {
> > +        uint64_t len = region_devmem->size;
> > +
> > +        if (pos + len >= total_len) {
> > +            len = total_len - pos;
> > +        }
> > +        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
> > +            return -1;
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static
> > +int vfio_load_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> > +                                uint64_t pos, uint64_t len)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_devmem =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +
> > +    void *dest;
> > +    uint32_t sz;
> > +    uint8_t *buf = NULL;
> > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> > +
> > +    if (len > region_devmem->size) {
> > +        return -1;
> > +    }
> > +
> > +    sz = sizeof(pos);
> > +    if (pwrite(vbasedev->fd, &pos, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> > +            != sz) {
> > +        error_report("vfio: Failed to set device memory buffer pos");
> > +        return -1;
> > +    }
> > +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> > +        buf = g_malloc(len);
> > +        if (buf == NULL) {
> > +            error_report("vfio: Failed to allocate memory for migrate");
> > +            return -1;
> > +        }
> > +        qemu_get_buffer(f, buf, len);
> > +        if (pwrite(vbasedev->fd, buf, len,
> > +                    region_devmem->fd_offset) != len) {
> > +            error_report("vfio: Failed to load devie memory buffer");
> 
> Again, failed to free buf
> 
Right, I'll correct that.
> > +            return -1;
> > +        }
> > +        g_free(buf);
> > +    } else {
> > +        dest = region_devmem->mmaps[0].mmap;
> > +        qemu_get_buffer(f, dest, len);
> > +    }
> 
> You might want to use qemu_file_get_error(f)  before writing the data
> to the device, to check for the case of a read error on the migration
> stream that happened somewhere in the pevious qemu_get's
>

ok.

> > +    sz = sizeof(action);
> > +    if (pwrite(vbasedev->fd, &action, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> > +            != sz) {
> > +        error_report("vfio: Failed to set load device memory buffer action");
> > +        return -1;
> > +    }
> > +
> > +    return 0;
> > +
> > +}
> > +
> > +static int vfio_load_data_device_memory(VFIOPCIDevice *vdev,
> > +                        QEMUFile *f, uint64_t total_len)
> > +{
> > +    uint64_t pos = 0, len = 0;
> > +
> > +    vfio_set_device_memory_size(vdev, total_len);
> > +
> > +    while (pos + len < total_len) {
> > +        len = qemu_get_be64(f);
> > +        pos = qemu_get_be64(f);
> 
> Please check len/pos - always assume that the migration stream could
> be (maliciously or accidentally) corrupt.
>
ok.

> > +        vfio_load_data_device_memory_chunk(vdev, f, pos, len);
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +
> >  static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
> >          uint64_t start_addr, uint64_t page_nr)
> >  {
> > @@ -377,6 +572,10 @@ static void vfio_save_live_pending(QEMUFile *f, void *opaque,
> >          return;
> >      }
> >  
> > +    /* get dirty data size of device memory */
> > +    vfio_get_device_memory_size(vdev);
> > +
> > +    *res_precopy_only += vdev->migration->devmem_size;
> >      return;
> >  }
> >  
> > @@ -388,7 +587,9 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
> >          return 0;
> >      }
> >  
> > -    return 0;
> > +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> > +    /* get dirty data of device memory */
> > +    return vfio_save_data_device_memory(vdev, f);
> >  }
> >  
> >  static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > @@ -458,6 +659,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> >              len = qemu_get_be64(f);
> >              vfio_load_data_device_config(vdev, f, len);
> >              break;
> > +        case VFIO_SAVE_FLAG_DEVMEMORY:
> > +            len = qemu_get_be64(f);
> > +            vfio_load_data_device_memory(vdev, f, len);
> > +            break;
> >          default:
> >              ret = -EINVAL;
> >          }
> > @@ -503,6 +708,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> >      VFIOPCIDevice *vdev = opaque;
> >      int rc = 0;
> >  
> > +    if (vfio_device_data_cap_device_memory(vdev)) {
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY | VFIO_SAVE_FLAG_CONTINUE);
> > +        /* get dirty data of device memory */
> > +        vfio_get_device_memory_size(vdev);
> > +        rc = vfio_save_data_device_memory(vdev, f);
> > +    }
> > +
> >      qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
> >      vfio_pci_save_config(vdev, f);
> >  
> > @@ -515,12 +727,22 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> >  
> >  static int vfio_save_setup(QEMUFile *f, void *opaque)
> >  {
> > +    int rc = 0;
> >      VFIOPCIDevice *vdev = opaque;
> > -    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> > +
> > +    if (vfio_device_data_cap_device_memory(vdev)) {
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP | VFIO_SAVE_FLAG_CONTINUE);
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> > +        /* get whole snapshot of device memory */
> > +        vfio_get_device_memory_size(vdev);
> > +        rc = vfio_save_data_device_memory(vdev, f);
> > +    } else {
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> > +    }
> >  
> >      vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
> >              VFIO_DEVICE_STATE_LOGGING);
> > -    return 0;
> > +    return rc;
> >  }
> >  
> >  static int vfio_load_setup(QEMUFile *f, void *opaque)
> > @@ -576,8 +798,11 @@ int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
> >          goto error;
> >      }
> >  
> > -    if (vfio_device_data_cap_device_memory(vdev)) {
> > -        error_report("No suppport of data cap device memory Yet");
> > +    if (vfio_device_data_cap_device_memory(vdev) &&
> > +            vfio_device_state_region_setup(vdev,
> > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY],
> > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY,
> > +              "device-state-data-device-memory")) {
> >          goto error;
> >      }
> >  
> > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > index 4b7b1bb..a2cc64b 100644
> > --- a/hw/vfio/pci.h
> > +++ b/hw/vfio/pci.h
> > @@ -69,6 +69,7 @@ typedef struct VFIOMigration {
> >      uint32_t data_caps;
> >      uint32_t device_state;
> >      uint64_t devconfig_size;
> > +    uint64_t devmem_size;
> >      VMChangeStateEntry *vm_state;
> >  } VFIOMigration;
> >  
> > -- 
> > 2.7.4
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
Yan Zhao Feb. 20, 2019, 7:36 a.m. UTC | #6
On Tue, Feb 19, 2019 at 02:09:18PM +0100, Cornelia Huck wrote:
> On Tue, 19 Feb 2019 16:52:14 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > - defined 4 device states regions: one control region and 3 data regions
> > - defined layout of control region in struct vfio_device_state_ctl
> > - defined 4 device states: running, stop, running&logging, stop&logging
> > - define 3 device data categories: device config, device memory, system
> >   memory
> > - defined 2 device data capabilities: device memory and system memory
> > - defined device state interfaces' version and 12 device state interfaces
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> > Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
> > ---
> >  linux-headers/linux/vfio.h | 260 +++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 260 insertions(+)
> 
> [commenting here for convenience; changes obviously need to be done in
> the Linux patch]
> 
yes, you can find the corresponding kernel part code at
https://patchwork.freedesktop.org/series/56876/


> > 
> > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> > index ceb6453..a124fc1 100644
> > --- a/linux-headers/linux/vfio.h
> > +++ b/linux-headers/linux/vfio.h
> > @@ -303,6 +303,56 @@ struct vfio_region_info_cap_type {
> >  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
> >  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
> >  
> > +/* Device State region type and sub-type
> > + *
> > + * A VFIO device driver needs to register up to four device state regions in
> > + * total: two mandatory and another two optional, if it plans to support device
> > + * state management.
> 
> Suggest to rephrase:
> 
> "A VFIO device driver that plans to support device state management
> needs to register..."
>
ok :)

> > + *
> > + * 1. region CTL :
> > + *          Mandatory.
> > + *          This is a control region.
> > + *          Its layout is defined in struct vfio_device_state_ctl.
> > + *          Reading from this region can get version, capabilities and data
> > + *          size of device state interfaces.
> > + *          Writing to this region can set device state, data size and
> > + *          choose which interface to use.
> > + * 2. region DEVICE_CONFIG
> > + *          Mandatory.
> > + *          This is a data region that holds device config data.
> > + *          Device config is such kind of data like MMIOs, page tables...
> 
> "Device config is data such as..."

ok :)
> 
> > + *          Every device is supposed to possess device config data.
> > + *          Usually the size of device config data is small (no big
> 
> s/no big/no bigger/

right :)
> 
> > + *          than 10M), and it needs to be loaded in certain strict
> > + *          order.
> > + *          Therefore no dirty data logging is enabled for device
> > + *          config and it must be got/set as a whole.
> > + *          Size of device config data is smaller than or equal to that of
> > + *          device config region.
> 
> Not sure if I understand that sentence correctly... but what if a
> device has more config state than fits into this region? Is that
> supposed to be covered by the device memory region? Or is this assumed
> to be something so exotic that we don't need to plan for it?
> 
Device config data and device config region are all provided by vendor
driver, so vendor driver is always able to create a large enough device
config region to hold device config data.
So, if a device has data that are better to be saved after device stop and
saved/loaded in strict order, the data needs to be in device config region.
This kind of data is supposed to be small.
If the device data can be saved/loaded several times, it can also be put
into device memory region.


> > + *          It is able to be mmaped into user space.
> > + * 3. region DEVICE_MEMORY
> > + *          Optional.
> > + *          This is a data region that holds device memory data.
> > + *          Device memory is device's internal memory, standalone and outside
> 
> s/outside/distinct from/ ?
ok.
> 
> > + *          system memory.  It is usually very big.
> > + *          Not all device has device memory. Like IGD only uses system
> 
> s/all devices has/all devices have/
> 
> s/Like/E.g./
>
ok :)

> > + *          memory and has no device memory.
> > + *          Size of devie memory is usually larger than that of device
> 
> s/devie/device/
> 
thanks:)

> > + *          memory region. qemu needs to save/load it in chunks of size of
> > + *          device memory region.
> 
> I'd rather not explicitly mention QEMU in this header. Maybe
> "Userspace"?
>
ok.

> > + *          It is able to be mmaped into user space.
> > + * 4. region DIRTY_BITMAP
> > + *          Optional.
> > + *          This is a data region that holds bitmap of dirty pages in system
> > + *          memory that a VFIO devices produces.
> > + *          It is able to be mmaped into user space.
> > + */
> > +#define VFIO_REGION_TYPE_DEVICE_STATE           (1 << 1)
> 
> Can you make this an explicit number instead?
> 
> (FWIW, I plan to add a CCW region as type 2, whatever comes first.)
ok :)

> 
> > +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL       (1)
> > +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG      (2)
> > +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY      (3)
> > +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP (4)
> > +
> >  /*
> >   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
> >   * which allows direct access to non-MSIX registers which happened to be within
> > @@ -816,6 +866,216 @@ struct vfio_iommu_spapr_tce_remove {
> >  };
> >  #define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
> >  
> > +/* version number of the device state interface */
> > +#define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> 
> Hm. Is this supposed to be backwards-compatible, should we need to bump
> this?
>
currently no backwords-compatible. we can discuss on that.

> > +
> > +/*
> > + * For devices that have devcie memory, it is required to expose
> 
> s/devcie/device/
> 
> > + * DEVICE_MEMORY capability.
> > + *
> > + * For devices producing dirty pages in system memory, it is required to
> > + * expose cap SYSTEM_MEMORY in order to get dirty bitmap in certain range
> > + * of system memory.
> > + */
> > +#define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > +#define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > +
> > +/*
> > + * DEVICE STATES
> > + *
> > + * Four states are defined for a VFIO device:
> > + * RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP.
> > + * They can be set by writing to device_state field of
> > + * vfio_device_state_ctl region.
> 
> Who controls this? Userspace?

Yes. Userspace notifies vendor driver to do the state switching.

> > + *
> > + * RUNNING: In this state, a VFIO device is in active state ready to
> > + * receive commands from device driver.
> > + * It is the default state that a VFIO device enters initially.
> > + *
> > + * STOP: In this state, a VFIO device is deactivated to interact with
> > + * device driver.
> 
> I think 'STOPPED' would read nicer.
> 
sounds better :)

> > + *
> > + * LOGGING state is a special state that it CANNOT exist
> > + * independently.
> 
> So it's not a state, but rather a modifier?
> 
yes. or thinking LOGGING/not LOGGING as bit 1 of a device state,
whereas RUNNING/STOPPED is bit 0 of a device state.
They have to be got as a whole.


> > + * It must be set alongside with state RUNNING or STOP, i.e,
> > + * RUNNING & LOGGING, STOP & LOGGING.
> > + * It is used for dirty data logging both for device memory
> > + * and system memory.
> > + *
> > + * LOGGING only impacts device/system memory. In LOGGING state, get buffer
> > + * of device memory returns dirty pages since last call; outside LOGGING
> > + * state, get buffer of device memory returns whole snapshot of device
> > + * memory. system memory's dirty page is only available in LOGGING state.
> > + *
> > + * Device config should be always accessible and return whole config snapshot
> > + * regardless of LOGGING state.
> > + * */
> > +#define VFIO_DEVICE_STATE_RUNNING 0
> > +#define VFIO_DEVICE_STATE_STOP 1
> > +#define VFIO_DEVICE_STATE_LOGGING 2
> > +
> > +/* action to get data from device memory or device config
> > + * the action is write to device state's control region, and data is read
> > + * from device memory region or device config region.
> > + * Each time before read device memory region or device config region,
> > + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> > + * field in control region. That is because device memory and devie config
> > + * region is mmaped into user space. vendor driver has to be notified of
> > + * the the GET_BUFFER action in advance.
> > + */
> > +#define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > +
> > +/* action to set data to device memory or device config
> > + * the action is write to device state's control region, and data is
> > + * written to device memory region or device config region.
> > + * Each time after write to device memory region or device config region,
> > + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> > + * field in control region. That is because device memory and devie config
> > + * region is mmaped into user space. vendor driver has to be notified of
> > + * the the SET_BUFFER action after data written.
> > + */
> > +#define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> 
> Let me describe this in my own words to make sure that I understand
> this correctly.
> 
> - The actions are set by userspace to notify the kernel that it is
>   going to get data or that it has just written data.
> - This is needed as a notification that the mmapped data should not be
>   changed resp. just has changed.
we need this notification is because when userspace read the mmapped data,
it's from the ptr returned from mmap(). So, when userspace reads that ptr,
there will be no page fault or read/write system calls, so vendor driver
does not know whether read/write opertation happens or not. 
Therefore, before userspace reads the ptr from mmap, it first writes action
field in control region (through write system call), and vendor driver
will not return the write system call until data prepared.

When userspace writes to that ptr from mmap, it writes data to the data
region first, then writes the action field in control region (through write
system call) to notify vendor driver. vendor driver will return the system
call after it copies the buffer completely.
> 
> So, how does the kernel know whether the read action has finished resp.
> whether the write action has started? Even if userspace reads/writes it
> as a whole.
> 
kernel does not touch the data region except when in response to the
"action" write system call.
> > +
> > +/* layout of device state interfaces' control region
> > + * By reading to control region and reading/writing data from device config
> > + * region, device memory region, system memory regions, below interface can
> > + * be implemented:
> > + *
> > + * 1. get version
> > + *   (1) user space calls read system call on "version" field of control
> > + *   region.
> > + *   (2) vendor driver writes version number of device state interfaces
> > + *   to the "version" field of control region.
> > + *
> > + * 2. get caps
> > + *   (1) user space calls read system call on "caps" field of control region.
> > + *   (2) if a VFIO device has huge device memory, vendor driver reports
> > + *      VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY in "caps" field of control region.
> > + *      if a VFIO device produces dirty pages in system memory, vendor driver
> > + *      reports VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY in "caps" field of
> > + *      control region.
> > + *
> > + * 3. set device state
> > + *    (1) user space calls write system call on "device_state" field of
> > + *    control region.
> > + *    (2) device state transitions as:
> > + *
> > + *    RUNNING -- start dirty data logging --> RUNNING & LOGGING
> > + *    RUNNING -- deactivate --> STOP
> > + *    RUNNING -- deactivate & start dirty data longging --> STOP & LOGGING
> > + *    RUNNING & LOGGING -- stop dirty data logging --> RUNNING
> > + *    RUNNING & LOGGING -- deactivate --> STOP & LOGGING
> > + *    RUNNING & LOGGING -- deactivate & stop dirty data logging --> STOP
> > + *    STOP -- activate --> RUNNING
> > + *    STOP -- start dirty data logging --> STOP & LOGGING
> > + *    STOP -- activate & start dirty data logging --> RUNNING & LOGGING
> > + *    STOP & LOGGING -- stop dirty data logging --> STOP
> > + *    STOP & LOGGING -- activate --> RUNNING & LOGGING
> > + *    STOP & LOGGING -- activate & stop dirty data logging --> RUNNING
> > + *
> > + * 4. get device config size
> > + *   (1) user space calls read system call on "device_config.size" field of
> > + *       control region for the total size of device config snapshot.
> > + *   (2) vendor driver writes device config data's total size in
> > + *       "device_config.size" field of control region.
> > + *
> > + * 5. set device config size
> > + *   (1) user space calls write system call.
> > + *       total size of device config snapshot --> "device_config.size" field
> > + *       of control region.
> > + *   (2) vendor driver reads device config data's total size from
> > + *       "device_config.size" field of control region.
> > + *
> > + * 6 get device config buffer
> > + *   (1) user space calls write system call.
> > + *       "GET_BUFFER" --> "device_config.action" field of control region.
> > + *   (2) vendor driver
> > + *       a. gets whole snapshot for device config
> > + *       b. writes whole device config snapshot to region
> > + *       DEVICE_CONFIG.
> > + *   (3) user space reads the whole of device config snapshot from region
> > + *       DEVICE_CONFIG.
> > + *
> > + * 7. set device config buffer
> > + *   (1) user space writes whole of device config data to region
> > + *       DEVICE_CONFIG.
> > + *   (2) user space calls write system call.
> > + *       "SET_BUFFER" --> "device_config.action" field of control region.
> > + *   (3) vendor driver loads whole of device config from region DEVICE_CONFIG.
> > + *
> > + * 8. get device memory size
> > + *   (1) user space calls read system call on "device_memory.size" field of
> > + *       control region for device memory size.
> > + *   (2) vendor driver
> > + *       a. gets device memory snapshot (in state RUNNING or STOP), or
> > + *          gets device memory dirty data (in state RUNNING & LOGGING or
> > + *          state STOP & LOGGING)
> > + *       b. writes size in "device_memory.size" field of control region
> > + *
> > + * 9. set device memory size
> > + *   (1) user space calls write system call on "device_memory.size" field of
> > + *       control region to set total size of device memory snapshot.
> > + *   (2) vendor driver reads device memory's size from "device_memory.size"
> > + *       field of control region.
> > + *
> > + *
> > + * 10. get device memory buffer
> > + *   (1) user space calls write system.
> > + *       pos --> "device_memory.pos" field of control region,
> > + *       "GET_BUFFER" --> "device_memory.action" field of control region.
> > + *       (pos must be 0 or multiples of length of region DEVICE_MEMORY).
> > + *   (2) vendor driver writes N'th chunk of device memory snapshot/dirty data
> > + *       to region DEVICE_MEMORY.
> > + *       (N equals to pos/(region length of DEVICE_MEMORY))
> > + *   (3) user space reads the N'th chunk of device memory snapshot/dirty data
> > + *       from region DEVICE_MEMORY.
> > + *
> > + * 11. set device memory buffer
> > + *   (1) user space writes N'th chunk of device memory snapshot/dirty data to
> > + *       region DEVICE_MEMORY.
> > + *       (N equals to pos/(region length of DEVICE_MEMORY))
> > + *   (2) user space writes pos to "device_memory.pos" field and writes
> > + *       "SET_BUFFER" to "device_memory.action" field of control region.
> > + *   (3) vendor driver loads N'th chunk of device memory snapshot/dirty data
> > + *       from region DEVICE_MEMORY.
> > + *
> > + * 12. get system memory dirty bitmap
> > + *   (1) user space calls write system call to specify a range of system
> > + *       memory that querying dirty pages.
> > + *       system memory's start address --> "system_memory.start_addr" field
> > + *       of control region,
> > + *       system memory's page count --> "system_memory.page_nr" field of
> > + *       control region.
> > + *   (2) if device state is not in RUNNING or STOP & LOGGING,
> > + *       vendor driver returns empty bitmap; otherwise,
> > + *       vendor driver checks the page_nr,
> > + *       if it's larger than the size that region DIRTY_BITMAP can support,
> > + *       error returns; if not,
> > + *       vendor driver returns as bitmap to specify dirty pages that
> > + *       device produces since last query in this range of system memory .
> > + *   (3) usespace reads back the dirty bitmap from region DIRTY_BITMAP.
> > + *
> > + */
> 
> It might make sense to extract the explanations above into a separate
> design document in the kernel Documentation/ directory. You could also
> add ASCII art there :)
>
yes, a diagram is better:)

> > +
> > +struct vfio_device_state_ctl {
> > +	__u32 version;		  /* ro versio of devcie state interfaces*/
> 
> s/versio/version/
> s/devcie/device/
> 
thanks~
> > +	__u32 device_state;       /* VFIO device state, wo */
> > +	__u32 caps;		 /* ro */
> > +        struct {
> 
> Indentation looks a bit off.
> 
> > +		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */
> > +		__u64 size;    /*rw, total size of device config*/
> > +	} device_config;
> > +	struct {
> > +		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */
> > +		__u64 size;     /* rw, total size of device memory*/
> > +        __u64 pos;/*chunk offset in total buffer of device memory*/
> 
> Here as well.
> 
thanks~
> > +	} device_memory;
> > +	struct {
> > +		__u64 start_addr; /* wo */
> > +		__u64 page_nr;   /* wo */
> > +	} system_memory;
> > +}__attribute__((packed));
> 
> For an interface definition, it's probably better to avoid packed and
> instead add padding if needed.
> 
ok. so just remove the __attribute__((packed)); is enough for this
interface. 

> > +
> >  /* ***************************************************************** */
> >  
> >  #endif /* VFIO_H */
> 
> On the whole, I think this is moving into the right direction.
Yan Zhao Feb. 20, 2019, 7:58 a.m. UTC | #7
On Tue, Feb 19, 2019 at 03:42:36PM +0100, Christophe de Dinechin wrote:
> 
> 
> > On 19 Feb 2019, at 09:53, Yan Zhao <yan.y.zhao@intel.com> wrote:
> > 
> > If a device has device memory capability, save/load data from device memory
> > in pre-copy and stop-and-copy phases.
> > 
> > LOGGING state is set for device memory for dirty page logging:
> > in LOGGING state, get device memory returns whole device memory snapshot;
> > outside LOGGING state, get device memory returns dirty data since last get
> > operation.
> > 
> > Usually, device memory is very big, qemu needs to chunk it into several
> > pieces each with size of device memory region.
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > ---
> > hw/vfio/migration.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> > hw/vfio/pci.h       |   1 +
> > 2 files changed, 231 insertions(+), 5 deletions(-)
> > 
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 16d6395..f1e9309 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
> >     return 0;
> > }
> > 
> > +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    uint64_t len;
> > +    int sz;
> > +
> > +    sz = sizeof(len);
> > +    if (pread(vbasedev->fd, &len, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> > +            != sz) {
> > +        error_report("vfio: Failed to get length of device memory”);
> 
> s/length/size/ ? (to be consistent with function name)

ok. thanks
> > +        return -1;
> > +    }
> > +    vdev->migration->devmem_size = len;
> > +    return 0;
> > +}
> > +
> > +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    int sz;
> > +
> > +    sz = sizeof(size);
> > +    if (pwrite(vbasedev->fd, &size, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> > +            != sz) {
> > +        error_report("vfio: Failed to set length of device comemory”);
> 
> What is comemory? Typo?

Right, typo. should be "memory" :)
> 
> Same comment about length vs size
>
got it. thanks

> > +        return -1;
> > +    }
> > +    vdev->migration->devmem_size = size;
> > +    return 0;
> > +}
> > +
> > +static
> > +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> > +                                    uint64_t pos, uint64_t len)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_devmem =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +    void *dest;
> > +    uint32_t sz;
> > +    uint8_t *buf = NULL;
> > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> > +
> > +    if (len > region_devmem->size) {
> 
> Is it intentional that there is no error_report here?
>
an error_report here may be better.
> > +        return -1;
> > +    }
> > +
> > +    sz = sizeof(pos);
> > +    if (pwrite(vbasedev->fd, &pos, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> > +            != sz) {
> > +        error_report("vfio: Failed to set save buffer pos");
> > +        return -1;
> > +    }
> > +    sz = sizeof(action);
> > +    if (pwrite(vbasedev->fd, &action, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> > +            != sz) {
> > +        error_report("vfio: Failed to set save buffer action");
> > +        return -1;
> > +    }
> > +
> > +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> > +        buf = g_malloc(len);
> > +        if (buf == NULL) {
> > +            error_report("vfio: Failed to allocate memory for migrate”);
> s/migrate/migration/ ?

yes, thanks
> > +            return -1;
> > +        }
> > +        if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != len) {
> > +            error_report("vfio: error load device memory buffer”);
> s/load/loading/ ?
error to load? :)

> > +            return -1;
> > +        }
> > +        qemu_put_be64(f, len);
> > +        qemu_put_be64(f, pos);
> > +        qemu_put_buffer(f, buf, len);
> > +        g_free(buf);
> > +    } else {
> > +        dest = region_devmem->mmaps[0].mmap;
> > +        qemu_put_be64(f, len);
> > +        qemu_put_be64(f, pos);
> > +        qemu_put_buffer(f, dest, len);
> > +    }
> > +    return 0;
> > +}
> > +
> > +static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
> > +{
> > +    VFIORegion *region_devmem =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +    uint64_t total_len = vdev->migration->devmem_size;
> > +    uint64_t pos = 0;
> > +
> > +    qemu_put_be64(f, total_len);
> > +    while (pos < total_len) {
> > +        uint64_t len = region_devmem->size;
> > +
> > +        if (pos + len >= total_len) {
> > +            len = total_len - pos;
> > +        }
> > +        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
> > +            return -1;
> > +        }
> 
> I don’t see where pos is incremented in this loop
> 
yes, missing one line "pos += len;"
Currently, code is not verified in hardware with device memory cap on.
Thanks:)
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static
> > +int vfio_load_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> > +                                uint64_t pos, uint64_t len)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_devmem =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +
> > +    void *dest;
> > +    uint32_t sz;
> > +    uint8_t *buf = NULL;
> > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> > +
> > +    if (len > region_devmem->size) {
> 
> error_report?

seems better to add error_report.
> > +        return -1;
> > +    }
> > +
> > +    sz = sizeof(pos);
> > +    if (pwrite(vbasedev->fd, &pos, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> > +            != sz) {
> > +        error_report("vfio: Failed to set device memory buffer pos");
> > +        return -1;
> > +    }
> > +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> > +        buf = g_malloc(len);
> > +        if (buf == NULL) {
> > +            error_report("vfio: Failed to allocate memory for migrate");
> > +            return -1;
> > +        }
> > +        qemu_get_buffer(f, buf, len);
> > +        if (pwrite(vbasedev->fd, buf, len,
> > +                    region_devmem->fd_offset) != len) {
> > +            error_report("vfio: Failed to load devie memory buffer");
> > +            return -1;
> > +        }
> > +        g_free(buf);
> > +    } else {
> > +        dest = region_devmem->mmaps[0].mmap;
> > +        qemu_get_buffer(f, dest, len);
> > +    }
> > +
> > +    sz = sizeof(action);
> > +    if (pwrite(vbasedev->fd, &action, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> > +            != sz) {
> > +        error_report("vfio: Failed to set load device memory buffer action");
> > +        return -1;
> > +    }
> > +
> > +    return 0;
> > +
> > +}
> > +
> > +static int vfio_load_data_device_memory(VFIOPCIDevice *vdev,
> > +                        QEMUFile *f, uint64_t total_len)
> > +{
> > +    uint64_t pos = 0, len = 0;
> > +
> > +    vfio_set_device_memory_size(vdev, total_len);
> > +
> > +    while (pos + len < total_len) {
> > +        len = qemu_get_be64(f);
> > +        pos = qemu_get_be64(f);
> 
> Nit: load reads len/pos in the loop, whereas save does it in the
> inner function (vfio_save_data_device_memory_chunk)
right, load has to read len/pos in the loop.
> 
> > +
> > +        vfio_load_data_device_memory_chunk(vdev, f, pos, len);
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +
> > static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
> >         uint64_t start_addr, uint64_t page_nr)
> > {
> > @@ -377,6 +572,10 @@ static void vfio_save_live_pending(QEMUFile *f, void *opaque,
> >         return;
> >     }
> > 
> > +    /* get dirty data size of device memory */
> > +    vfio_get_device_memory_size(vdev);
> > +
> > +    *res_precopy_only += vdev->migration->devmem_size;
> >     return;
> > }
> > 
> > @@ -388,7 +587,9 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
> >         return 0;
> >     }
> > 
> > -    return 0;
> > +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> > +    /* get dirty data of device memory */
> > +    return vfio_save_data_device_memory(vdev, f);
> > }
> > 
> > static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > @@ -458,6 +659,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> >             len = qemu_get_be64(f);
> >             vfio_load_data_device_config(vdev, f, len);
> >             break;
> > +        case VFIO_SAVE_FLAG_DEVMEMORY:
> > +            len = qemu_get_be64(f);
> > +            vfio_load_data_device_memory(vdev, f, len);
> > +            break;
> >         default:
> >             ret = -EINVAL;
> >         }
> > @@ -503,6 +708,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> >     VFIOPCIDevice *vdev = opaque;
> >     int rc = 0;
> > 
> > +    if (vfio_device_data_cap_device_memory(vdev)) {
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY | VFIO_SAVE_FLAG_CONTINUE);
> > +        /* get dirty data of device memory */
> > +        vfio_get_device_memory_size(vdev);
> > +        rc = vfio_save_data_device_memory(vdev, f);
> > +    }
> > +
> >     qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
> >     vfio_pci_save_config(vdev, f);
> > 
> > @@ -515,12 +727,22 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> > 
> > static int vfio_save_setup(QEMUFile *f, void *opaque)
> > {
> > +    int rc = 0;
> >     VFIOPCIDevice *vdev = opaque;
> > -    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> > +
> > +    if (vfio_device_data_cap_device_memory(vdev)) {
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP | VFIO_SAVE_FLAG_CONTINUE);
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> > +        /* get whole snapshot of device memory */
> > +        vfio_get_device_memory_size(vdev);
> > +        rc = vfio_save_data_device_memory(vdev, f);
> > +    } else {
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> > +    }
> > 
> >     vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
> >             VFIO_DEVICE_STATE_LOGGING);
> > -    return 0;
> > +    return rc;
> > }
> > 
> > static int vfio_load_setup(QEMUFile *f, void *opaque)
> > @@ -576,8 +798,11 @@ int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
> >         goto error;
> >     }
> > 
> > -    if (vfio_device_data_cap_device_memory(vdev)) {
> > -        error_report("No suppport of data cap device memory Yet");
> > +    if (vfio_device_data_cap_device_memory(vdev) &&
> > +            vfio_device_state_region_setup(vdev,
> > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY],
> > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY,
> > +              "device-state-data-device-memory")) {
> >         goto error;
> >     }
> > 
> > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > index 4b7b1bb..a2cc64b 100644
> > --- a/hw/vfio/pci.h
> > +++ b/hw/vfio/pci.h
> > @@ -69,6 +69,7 @@ typedef struct VFIOMigration {
> >     uint32_t data_caps;
> >     uint32_t device_state;
> >     uint64_t devconfig_size;
> > +    uint64_t devmem_size;
> >     VMChangeStateEntry *vm_state;
> > } VFIOMigration;
> > 
> > -- 
> > 2.7.4
> > 
> > _______________________________________________
> > intel-gvt-dev mailing list
> > intel-gvt-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
> 
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
Christophe de Dinechin Feb. 20, 2019, 10:14 a.m. UTC | #8
> On 20 Feb 2019, at 08:58, Zhao Yan <yan.y.zhao@intel.com> wrote:
> 
> On Tue, Feb 19, 2019 at 03:42:36PM +0100, Christophe de Dinechin wrote:
>> 
>> 
>>> On 19 Feb 2019, at 09:53, Yan Zhao <yan.y.zhao@intel.com> wrote:
>>> 
>>> If a device has device memory capability, save/load data from device memory
>>> in pre-copy and stop-and-copy phases.
>>> 
>>> LOGGING state is set for device memory for dirty page logging:
>>> in LOGGING state, get device memory returns whole device memory snapshot;
>>> outside LOGGING state, get device memory returns dirty data since last get
>>> operation.
>>> 
>>> Usually, device memory is very big, qemu needs to chunk it into several
>>> pieces each with size of device memory region.
>>> 
>>> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>> ---
>>> hw/vfio/migration.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>>> hw/vfio/pci.h       |   1 +
>>> 2 files changed, 231 insertions(+), 5 deletions(-)
>>> 
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index 16d6395..f1e9309 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
>>>    return 0;
>>> }
>>> 
>>> +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
>>> +{
>>> +    VFIODevice *vbasedev = &vdev->vbasedev;
>>> +    VFIORegion *region_ctl =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
>>> +    uint64_t len;
>>> +    int sz;
>>> +
>>> +    sz = sizeof(len);
>>> +    if (pread(vbasedev->fd, &len, sz,
>>> +                region_ctl->fd_offset +
>>> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
>>> +            != sz) {
>>> +        error_report("vfio: Failed to get length of device memory”);
>> 
>> s/length/size/ ? (to be consistent with function name)
> 
> ok. thanks
>>> +        return -1;
>>> +    }
>>> +    vdev->migration->devmem_size = len;
>>> +    return 0;
>>> +}
>>> +
>>> +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
>>> +{
>>> +    VFIODevice *vbasedev = &vdev->vbasedev;
>>> +    VFIORegion *region_ctl =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
>>> +    int sz;
>>> +
>>> +    sz = sizeof(size);
>>> +    if (pwrite(vbasedev->fd, &size, sz,
>>> +                region_ctl->fd_offset +
>>> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
>>> +            != sz) {
>>> +        error_report("vfio: Failed to set length of device comemory”);
>> 
>> What is comemory? Typo?
> 
> Right, typo. should be "memory" :)
>> 
>> Same comment about length vs size
>> 
> got it. thanks
> 
>>> +        return -1;
>>> +    }
>>> +    vdev->migration->devmem_size = size;
>>> +    return 0;
>>> +}
>>> +
>>> +static
>>> +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
>>> +                                    uint64_t pos, uint64_t len)
>>> +{
>>> +    VFIODevice *vbasedev = &vdev->vbasedev;
>>> +    VFIORegion *region_ctl =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
>>> +    VFIORegion *region_devmem =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
>>> +    void *dest;
>>> +    uint32_t sz;
>>> +    uint8_t *buf = NULL;
>>> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
>>> +
>>> +    if (len > region_devmem->size) {
>> 
>> Is it intentional that there is no error_report here?
>> 
> an error_report here may be better.
>>> +        return -1;
>>> +    }
>>> +
>>> +    sz = sizeof(pos);
>>> +    if (pwrite(vbasedev->fd, &pos, sz,
>>> +                region_ctl->fd_offset +
>>> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
>>> +            != sz) {
>>> +        error_report("vfio: Failed to set save buffer pos");
>>> +        return -1;
>>> +    }
>>> +    sz = sizeof(action);
>>> +    if (pwrite(vbasedev->fd, &action, sz,
>>> +                region_ctl->fd_offset +
>>> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
>>> +            != sz) {
>>> +        error_report("vfio: Failed to set save buffer action");
>>> +        return -1;
>>> +    }
>>> +
>>> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
>>> +        buf = g_malloc(len);
>>> +        if (buf == NULL) {
>>> +            error_report("vfio: Failed to allocate memory for migrate”);
>> s/migrate/migration/ ?
> 
> yes, thanks
>>> +            return -1;
>>> +        }
>>> +        if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != len) {
>>> +            error_report("vfio: error load device memory buffer”);
>> s/load/loading/ ?
> error to load? :)

I’d check with a native speaker, but I believe it’s “error loading”.

To me (to be checked), the two sentences don’t have the same meaning:

“It is an error to load device memory buffer” -> “You are not allowed to do that”
“I had an error loading device memory buffer” -> “I tried, but it failed"

> 
>>> +            return -1;
>>> +        }
>>> +        qemu_put_be64(f, len);
>>> +        qemu_put_be64(f, pos);
>>> +        qemu_put_buffer(f, buf, len);
>>> +        g_free(buf);
>>> +    } else {
>>> +        dest = region_devmem->mmaps[0].mmap;
>>> +        qemu_put_be64(f, len);
>>> +        qemu_put_be64(f, pos);
>>> +        qemu_put_buffer(f, dest, len);
>>> +    }
>>> +    return 0;
>>> +}
>>> +
>>> +static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
>>> +{
>>> +    VFIORegion *region_devmem =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
>>> +    uint64_t total_len = vdev->migration->devmem_size;
>>> +    uint64_t pos = 0;
>>> +
>>> +    qemu_put_be64(f, total_len);
>>> +    while (pos < total_len) {
>>> +        uint64_t len = region_devmem->size;
>>> +
>>> +        if (pos + len >= total_len) {
>>> +            len = total_len - pos;
>>> +        }
>>> +        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
>>> +            return -1;
>>> +        }
>> 
>> I don’t see where pos is incremented in this loop
>> 
> yes, missing one line "pos += len;"
> Currently, code is not verified in hardware with device memory cap on.
> Thanks:)
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static
>>> +int vfio_load_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
>>> +                                uint64_t pos, uint64_t len)
>>> +{
>>> +    VFIODevice *vbasedev = &vdev->vbasedev;
>>> +    VFIORegion *region_ctl =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
>>> +    VFIORegion *region_devmem =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
>>> +
>>> +    void *dest;
>>> +    uint32_t sz;
>>> +    uint8_t *buf = NULL;
>>> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
>>> +
>>> +    if (len > region_devmem->size) {
>> 
>> error_report?
> 
> seems better to add error_report.
>>> +        return -1;
>>> +    }
>>> +
>>> +    sz = sizeof(pos);
>>> +    if (pwrite(vbasedev->fd, &pos, sz,
>>> +                region_ctl->fd_offset +
>>> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
>>> +            != sz) {
>>> +        error_report("vfio: Failed to set device memory buffer pos");
>>> +        return -1;
>>> +    }
>>> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
>>> +        buf = g_malloc(len);
>>> +        if (buf == NULL) {
>>> +            error_report("vfio: Failed to allocate memory for migrate");
>>> +            return -1;
>>> +        }
>>> +        qemu_get_buffer(f, buf, len);
>>> +        if (pwrite(vbasedev->fd, buf, len,
>>> +                    region_devmem->fd_offset) != len) {
>>> +            error_report("vfio: Failed to load devie memory buffer");
>>> +            return -1;
>>> +        }
>>> +        g_free(buf);
>>> +    } else {
>>> +        dest = region_devmem->mmaps[0].mmap;
>>> +        qemu_get_buffer(f, dest, len);
>>> +    }
>>> +
>>> +    sz = sizeof(action);
>>> +    if (pwrite(vbasedev->fd, &action, sz,
>>> +                region_ctl->fd_offset +
>>> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
>>> +            != sz) {
>>> +        error_report("vfio: Failed to set load device memory buffer action");
>>> +        return -1;
>>> +    }
>>> +
>>> +    return 0;
>>> +
>>> +}
>>> +
>>> +static int vfio_load_data_device_memory(VFIOPCIDevice *vdev,
>>> +                        QEMUFile *f, uint64_t total_len)
>>> +{
>>> +    uint64_t pos = 0, len = 0;
>>> +
>>> +    vfio_set_device_memory_size(vdev, total_len);
>>> +
>>> +    while (pos + len < total_len) {
>>> +        len = qemu_get_be64(f);
>>> +        pos = qemu_get_be64(f);
>> 
>> Nit: load reads len/pos in the loop, whereas save does it in the
>> inner function (vfio_save_data_device_memory_chunk)
> right, load has to read len/pos in the loop.
>> 
>>> +
>>> +        vfio_load_data_device_memory_chunk(vdev, f, pos, len);
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +
>>> static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
>>>        uint64_t start_addr, uint64_t page_nr)
>>> {
>>> @@ -377,6 +572,10 @@ static void vfio_save_live_pending(QEMUFile *f, void *opaque,
>>>        return;
>>>    }
>>> 
>>> +    /* get dirty data size of device memory */
>>> +    vfio_get_device_memory_size(vdev);
>>> +
>>> +    *res_precopy_only += vdev->migration->devmem_size;
>>>    return;
>>> }
>>> 
>>> @@ -388,7 +587,9 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>>        return 0;
>>>    }
>>> 
>>> -    return 0;
>>> +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
>>> +    /* get dirty data of device memory */
>>> +    return vfio_save_data_device_memory(vdev, f);
>>> }
>>> 
>>> static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
>>> @@ -458,6 +659,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>>            len = qemu_get_be64(f);
>>>            vfio_load_data_device_config(vdev, f, len);
>>>            break;
>>> +        case VFIO_SAVE_FLAG_DEVMEMORY:
>>> +            len = qemu_get_be64(f);
>>> +            vfio_load_data_device_memory(vdev, f, len);
>>> +            break;
>>>        default:
>>>            ret = -EINVAL;
>>>        }
>>> @@ -503,6 +708,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>    VFIOPCIDevice *vdev = opaque;
>>>    int rc = 0;
>>> 
>>> +    if (vfio_device_data_cap_device_memory(vdev)) {
>>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY | VFIO_SAVE_FLAG_CONTINUE);
>>> +        /* get dirty data of device memory */
>>> +        vfio_get_device_memory_size(vdev);
>>> +        rc = vfio_save_data_device_memory(vdev, f);
>>> +    }
>>> +
>>>    qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
>>>    vfio_pci_save_config(vdev, f);
>>> 
>>> @@ -515,12 +727,22 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>> 
>>> static int vfio_save_setup(QEMUFile *f, void *opaque)
>>> {
>>> +    int rc = 0;
>>>    VFIOPCIDevice *vdev = opaque;
>>> -    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
>>> +
>>> +    if (vfio_device_data_cap_device_memory(vdev)) {
>>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP | VFIO_SAVE_FLAG_CONTINUE);
>>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
>>> +        /* get whole snapshot of device memory */
>>> +        vfio_get_device_memory_size(vdev);
>>> +        rc = vfio_save_data_device_memory(vdev, f);
>>> +    } else {
>>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
>>> +    }
>>> 
>>>    vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
>>>            VFIO_DEVICE_STATE_LOGGING);
>>> -    return 0;
>>> +    return rc;
>>> }
>>> 
>>> static int vfio_load_setup(QEMUFile *f, void *opaque)
>>> @@ -576,8 +798,11 @@ int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
>>>        goto error;
>>>    }
>>> 
>>> -    if (vfio_device_data_cap_device_memory(vdev)) {
>>> -        error_report("No suppport of data cap device memory Yet");
>>> +    if (vfio_device_data_cap_device_memory(vdev) &&
>>> +            vfio_device_state_region_setup(vdev,
>>> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY],
>>> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY,
>>> +              "device-state-data-device-memory")) {
>>>        goto error;
>>>    }
>>> 
>>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>>> index 4b7b1bb..a2cc64b 100644
>>> --- a/hw/vfio/pci.h
>>> +++ b/hw/vfio/pci.h
>>> @@ -69,6 +69,7 @@ typedef struct VFIOMigration {
>>>    uint32_t data_caps;
>>>    uint32_t device_state;
>>>    uint64_t devconfig_size;
>>> +    uint64_t devmem_size;
>>>    VMChangeStateEntry *vm_state;
>>> } VFIOMigration;
>>> 
>>> -- 
>>> 2.7.4
>>> 
>>> _______________________________________________
>>> intel-gvt-dev mailing list
>>> intel-gvt-dev@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
>> 
>> _______________________________________________
>> intel-gvt-dev mailing list
>> intel-gvt-dev@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
Dr. David Alan Gilbert Feb. 20, 2019, 11:01 a.m. UTC | #9
* Zhao Yan (yan.y.zhao@intel.com) wrote:
> On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:
> > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > > This patchset enables VFIO devices to have live migration capability.
> > > Currently it does not support post-copy phase.
> > > 
> > > It follows Alex's comments on last version of VFIO live migration patches,
> > > including device states, VFIO device state region layout, dirty bitmap's
> > > query.
> > 
> > Hi,
> >   I've sent minor comments to later patches; but some minor general
> > comments:
> > 
> >   a) Never trust the incoming migrations stream - it might be corrupt,
> >     so check when you can.
> hi Dave
> Thanks for this suggestion. I'll add more checks for migration streams.
> 
> 
> >   b) How do we detect if we're migrating from/to the wrong device or
> > version of device?  Or say to a device with older firmware or perhaps
> > a device that has less device memory ?
> Actually it's still an open for VFIO migration. Need to think about
> whether it's better to check that in libvirt or qemu (like a device magic
> along with verion ?).
> This patchset is intended to settle down the main device state interfaces
> for VFIO migration. So that we can work on that and improve it.
> 
> 
> >   c) Consider using the trace_ mechanism - it's really useful to
> > add to loops writing/reading data so that you can see when it fails.
> > 
> > Dave
> >
> Got it. many thanks~~
> 
> 
> > (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> > 'migrtion'
> 
> sorry :)

No problem.

Given the mails, I'm guessing you've mostly tested this on graphics
devices?  Have you also checked with VFIO network cards?

Also see the mail I sent in reply to Kirti's series; we need to boil
these down to one solution.

Dave

> > 
> > > Device Data
> > > -----------
> > > Device data is divided into three types: device memory, device config,
> > > and system memory dirty pages produced by device.
> > > 
> > > Device config: data like MMIOs, page tables...
> > >         Every device is supposed to possess device config data.
> > >     	Usually device config's size is small (no big than 10M), and it
> > >         needs to be loaded in certain strict order.
> > >         Therefore, device config only needs to be saved/loaded in
> > >         stop-and-copy phase.
> > >         The data of device config is held in device config region.
> > >         Size of device config data is smaller than or equal to that of
> > >         device config region.
> > > 
> > > Device Memory: device's internal memory, standalone and outside system
> > >         memory. It is usually very big.
> > >         This kind of data needs to be saved / loaded in pre-copy and
> > >         stop-and-copy phase.
> > >         The data of device memory is held in device memory region.
> > >         Size of devie memory is usually larger than that of device
> > >         memory region. qemu needs to save/load it in chunks of size of
> > >         device memory region.
> > >         Not all device has device memory. Like IGD only uses system memory.
> > > 
> > > System memory dirty pages: If a device produces dirty pages in system
> > >         memory, it is able to get dirty bitmap for certain range of system
> > >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > >         callback, dirty pages in system memory will be save/loaded by ram's
> > >         live migration code.
> > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > >         If system memory range is larger than that dirty bitmap region can
> > >         hold, qemu will cut it into several chunks and get dirty bitmap in
> > >         succession.
> > > 
> > > 
> > > Device State Regions
> > > --------------------
> > > Vendor driver is required to expose two mandatory regions and another two
> > > optional regions if it plans to support device state management.
> > > 
> > > So, there are up to four regions in total.
> > > One control region: mandatory.
> > >         Get access via read/write system call.
> > >         Its layout is defined in struct vfio_device_state_ctl
> > > Three data regions: mmaped into qemu.
> > >         device config region: mandatory, holding data of device config
> > >         device memory region: optional, holding data of device memory
> > >         dirty bitmap region: optional, holding bitmap of system memory
> > >                             dirty pages
> > > 
> > > (The reason why four seperate regions are defined is that the unit of mmap
> > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > control and three mmaped regions for data seems better than one big region
> > > padded and sparse mmaped).
> > > 
> > > 
> > > kernel device state interface [1]
> > > --------------------------------------
> > > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > 
> > > #define VFIO_DEVICE_STATE_RUNNING 0 
> > > #define VFIO_DEVICE_STATE_STOP 1
> > > #define VFIO_DEVICE_STATE_LOGGING 2
> > > 
> > > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > > 
> > > struct vfio_device_state_ctl {
> > > 	__u32 version;		  /* ro */
> > > 	__u32 device_state;       /* VFIO device state, wo */
> > > 	__u32 caps;		 /* ro */
> > >         struct {
> > > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> > > 		__u64 size;    /*rw*/
> > > 	} device_config;
> > > 	struct {
> > > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> > > 		__u64 size;     /* rw */  
> > >                 __u64 pos; /*the offset in total buffer of device memory*/
> > > 	} device_memory;
> > > 	struct {
> > > 		__u64 start_addr; /* wo */
> > > 		__u64 page_nr;   /* wo */
> > > 	} system_memory;
> > > };
> > > 
> > > Devcie States
> > > ------------- 
> > > After migration is initialzed, it will set device state via writing to
> > > device_state field of control region.
> > > 
> > > Four states are defined for a VFIO device:
> > >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> > > 
> > > RUNNING: In this state, a VFIO device is in active state ready to receive
> > >         commands from device driver.
> > >         It is the default state that a VFIO device enters initially.
> > > 
> > > STOP:  In this state, a VFIO device is deactivated to interact with
> > >        device driver.
> > > 
> > > LOGGING: a special state that it CANNOT exist independently. It must be
> > >        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> > >        STOP & LOGGING).
> > >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> > >        driver can start dirty data logging for device memory and system
> > >        memory.
> > >        LOGGING only impacts device/system memory. They return whole
> > >        snapshot outside LOGGING and dirty data since last get operation
> > >        inside LOGGING.
> > >        Device config should be always accessible and return whole config
> > >        snapshot regardless of LOGGING state.
> > >        
> > > Note:
> > > The reason why RUNNING is the default state is that device's active state
> > > must not depend on device state interface.
> > > It is possible that region vfio_device_state_ctl fails to get registered.
> > > In that condition, a device needs be in active state by default. 
> > > 
> > > Get Version & Get Caps
> > > ----------------------
> > > On migration init phase, qemu will probe the existence of device state
> > > regions of vendor driver, then get version of the device state interface
> > > from the r/w control region.
> > > 
> > > Then it will probe VFIO device's data capability by reading caps field of
> > > control region.
> > >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> > >         device memory in pre-copy and stop-and-copy phase. The data of
> > >         device memory is held in device memory region.
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> > >         produced by VFIO device during pre-copy and stop-and-copy phase.
> > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > > 
> > > If failing to find two mandatory regions and optional data regions
> > > corresponding to data caps or version mismatching, it will setup a
> > > migration blocker and disable live migration for VFIO device.
> > > 
> > > 
> > > Flows to call device state interface for VFIO live migration
> > > ------------------------------------------------------------
> > > 
> > > Live migration save path:
> > > 
> > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > 
> > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > MIGRATION_STATUS_SAVE_SETUP
> > >  |
> > >  .save_setup callback -->
> > >  get device memory size (whole snapshot size)
> > >  get device memory buffer (whole snapshot data)
> > >  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> > >  |
> > > MIGRATION_STATUS_ACTIVE
> > >  |
> > >  .save_live_pending callback --> get device memory size (dirty data)
> > >  .save_live_iteration callback --> get device memory buffer (dirty data)
> > >  .log_sync callback --> get system memory dirty bitmap
> > >  |
> > > (vcpu stops) --> set device state -->
> > >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> > >  |
> > > .save_live_complete_precopy callback -->
> > >  get device memory size (dirty data)
> > >  get device memory buffer (dirty data)
> > >  get device config size (whole snapshot size)
> > >  get device config buffer (whole snapshot data)
> > >  |
> > > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > > MIGRATION_STATUS_COMPLETED
> > > 
> > > MIGRATION_STATUS_CANCELLED or
> > > MIGRATION_STATUS_FAILED
> > >  |
> > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > 
> > > 
> > > Live migration load path:
> > > 
> > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > 
> > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> > >  |
> > > MIGRATION_STATUS_ACTIVE
> > >  |
> > > .load state callback -->
> > >  set device memory size, set device memory buffer, set device config size,
> > >  set device config buffer
> > >  |
> > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > MIGRATION_STATUS_COMPLETED
> > > 
> > > 
> > > 
> > > In source VM side,
> > > In precopy phase,
> > > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > > qemu will first get whole snapshot of device memory in .save_setup
> > > callback, and then it will get total size of dirty data in device memory in
> > > .save_live_pending callback by reading device_memory.size field of control
> > > region.
> > > Then in .save_live_iteration callback, it will get buffer of device memory's
> > > dirty data chunk by chunk from device memory region by writing pos &
> > > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > > control region. (size of each chunk is the size of device memory data
> > > region).
> > > .save_live_pending and .save_live_iteration may be called several times in
> > > precopy phase to get dirty data in device memory.
> > > 
> > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > > vendor driver's device state interface to get data from devcie memory.
> > > 
> > > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > > region by writing system memory's start address, page count and action 
> > > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> > > "system_memory.action" fields of control region.
> > > If page count passed in .log_sync callback is larger than the bitmap size
> > > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > > vendor driver's get system memory dirty bitmap interface.
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > > returns without call to vendor driver.
> > > 
> > > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > > in save_live_complete_precopy callback,
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > > get device memory size and get device memory buffer will be called again.
> > > After that,
> > > device config data is get from device config region by reading
> > > devcie_config.size of control region and writing action (GET_BITMAP) to
> > > device_config.action of control region.
> > > Then after migration completes, in cleanup handler, LOGGING state will be
> > > cleared (i.e. deivce state is set to STOP).
> > > Clearing LOGGING state in cleanup handler is in consideration of the case
> > > of "migration failed" and "migration cancelled". They can also leverage
> > > the cleanup handler to unset LOGGING state.
> > > 
> > > 
> > > References
> > > ----------
> > > 1. kernel side implementation of Device state interfaces:
> > > https://patchwork.freedesktop.org/series/56876/
> > > 
> > > 
> > > Yan Zhao (5):
> > >   vfio/migration: define kernel interfaces
> > >   vfio/migration: support device of device config capability
> > >   vfio/migration: tracking of dirty page in system memory
> > >   vfio/migration: turn on migration
> > >   vfio/migration: support device memory capability
> > > 
> > >  hw/vfio/Makefile.objs         |   2 +-
> > >  hw/vfio/common.c              |  26 ++
> > >  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
> > >  hw/vfio/pci.c                 |  10 +-
> > >  hw/vfio/pci.h                 |  26 +-
> > >  include/hw/vfio/vfio-common.h |   1 +
> > >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> > >  7 files changed, 1174 insertions(+), 9 deletions(-)
> > >  create mode 100644 hw/vfio/migration.c
> > > 
> > > -- 
> > > 2.7.4
> > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > _______________________________________________
> > intel-gvt-dev mailing list
> > intel-gvt-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Gonglei (Arei) Feb. 20, 2019, 11:28 a.m. UTC | #10
> -----Original Message-----
> From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
> Sent: Wednesday, February 20, 2019 7:02 PM
> To: Zhao Yan <yan.y.zhao@intel.com>
> Cc: cjia@nvidia.com; kvm@vger.kernel.org; aik@ozlabs.ru;
> Zhengxiao.zx@alibaba-inc.com; shuangtai.tst@alibaba-inc.com;
> qemu-devel@nongnu.org; kwankhede@nvidia.com; eauger@redhat.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> mlevitsk@redhat.com; pasic@linux.ibm.com; Gonglei (Arei)
> <arei.gonglei@huawei.com>; felipe@nutanix.com; Ken.Xue@amd.com;
> kevin.tian@intel.com; alex.williamson@redhat.com;
> intel-gvt-dev@lists.freedesktop.org; changpeng.liu@intel.com;
> cohuck@redhat.com; zhi.a.wang@intel.com; jonathan.davies@nutanix.com
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> * Zhao Yan (yan.y.zhao@intel.com) wrote:
> > On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:
> > > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > > > This patchset enables VFIO devices to have live migration capability.
> > > > Currently it does not support post-copy phase.
> > > >
> > > > It follows Alex's comments on last version of VFIO live migration patches,
> > > > including device states, VFIO device state region layout, dirty bitmap's
> > > > query.
> > >
> > > Hi,
> > >   I've sent minor comments to later patches; but some minor general
> > > comments:
> > >
> > >   a) Never trust the incoming migrations stream - it might be corrupt,
> > >     so check when you can.
> > hi Dave
> > Thanks for this suggestion. I'll add more checks for migration streams.
> >
> >
> > >   b) How do we detect if we're migrating from/to the wrong device or
> > > version of device?  Or say to a device with older firmware or perhaps
> > > a device that has less device memory ?
> > Actually it's still an open for VFIO migration. Need to think about
> > whether it's better to check that in libvirt or qemu (like a device magic
> > along with verion ?).

We must keep the hardware generation is the same with one POD of public cloud
providers. But we still think about the live migration between from the the lower
generation of hardware migrated to the higher generation.

> > This patchset is intended to settle down the main device state interfaces
> > for VFIO migration. So that we can work on that and improve it.
> >
> >
> > >   c) Consider using the trace_ mechanism - it's really useful to
> > > add to loops writing/reading data so that you can see when it fails.
> > >
> > > Dave
> > >
> > Got it. many thanks~~
> >
> >
> > > (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> > > 'migrtion'
> >
> > sorry :)
> 
> No problem.
> 
> Given the mails, I'm guessing you've mostly tested this on graphics
> devices?  Have you also checked with VFIO network cards?
> 
> Also see the mail I sent in reply to Kirti's series; we need to boil
> these down to one solution.
> 
> Dave
> 
> > >
> > > > Device Data
> > > > -----------
> > > > Device data is divided into three types: device memory, device config,
> > > > and system memory dirty pages produced by device.
> > > >
> > > > Device config: data like MMIOs, page tables...
> > > >         Every device is supposed to possess device config data.
> > > >     	Usually device config's size is small (no big than 10M), and it
> > > >         needs to be loaded in certain strict order.
> > > >         Therefore, device config only needs to be saved/loaded in
> > > >         stop-and-copy phase.
> > > >         The data of device config is held in device config region.
> > > >         Size of device config data is smaller than or equal to that of
> > > >         device config region.
> > > >
> > > > Device Memory: device's internal memory, standalone and outside
> system
> > > >         memory. It is usually very big.
> > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > >         stop-and-copy phase.
> > > >         The data of device memory is held in device memory region.
> > > >         Size of devie memory is usually larger than that of device
> > > >         memory region. qemu needs to save/load it in chunks of size of
> > > >         device memory region.
> > > >         Not all device has device memory. Like IGD only uses system
> memory.
> > > >
> > > > System memory dirty pages: If a device produces dirty pages in system
> > > >         memory, it is able to get dirty bitmap for certain range of
> system
> > > >         memory. This dirty bitmap is queried in pre-copy and
> stop-and-copy
> > > >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > > >         callback, dirty pages in system memory will be save/loaded by
> ram's
> > > >         live migration code.
> > > >         The dirty bitmap of system memory is held in dirty bitmap
> region.
> > > >         If system memory range is larger than that dirty bitmap region
> can
> > > >         hold, qemu will cut it into several chunks and get dirty bitmap in
> > > >         succession.
> > > >
> > > >
> > > > Device State Regions
> > > > --------------------
> > > > Vendor driver is required to expose two mandatory regions and another
> two
> > > > optional regions if it plans to support device state management.
> > > >
> > > > So, there are up to four regions in total.
> > > > One control region: mandatory.
> > > >         Get access via read/write system call.
> > > >         Its layout is defined in struct vfio_device_state_ctl
> > > > Three data regions: mmaped into qemu.
> > > >         device config region: mandatory, holding data of device config
> > > >         device memory region: optional, holding data of device memory
> > > >         dirty bitmap region: optional, holding bitmap of system
> memory
> > > >                             dirty pages
> > > >
> > > > (The reason why four seperate regions are defined is that the unit of
> mmap
> > > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > > control and three mmaped regions for data seems better than one big
> region
> > > > padded and sparse mmaped).
> > > >
> > > >
> > > > kernel device state interface [1]
> > > > --------------------------------------
> > > > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > >
> > > > #define VFIO_DEVICE_STATE_RUNNING 0
> > > > #define VFIO_DEVICE_STATE_STOP 1
> > > > #define VFIO_DEVICE_STATE_LOGGING 2
> > > >
> > > > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > > > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > > >
> > > > struct vfio_device_state_ctl {
> > > > 	__u32 version;		  /* ro */
> > > > 	__u32 device_state;       /* VFIO device state, wo */
> > > > 	__u32 caps;		 /* ro */
> > > >         struct {
> > > > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */
> > > > 		__u64 size;    /*rw*/
> > > > 	} device_config;
> > > > 	struct {
> > > > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */
> > > > 		__u64 size;     /* rw */
> > > >                 __u64 pos; /*the offset in total buffer of device
> memory*/
> > > > 	} device_memory;
> > > > 	struct {
> > > > 		__u64 start_addr; /* wo */
> > > > 		__u64 page_nr;   /* wo */
> > > > 	} system_memory;
> > > > };
> > > >
> > > > Devcie States
> > > > -------------
> > > > After migration is initialzed, it will set device state via writing to
> > > > device_state field of control region.
> > > >
> > > > Four states are defined for a VFIO device:
> > > >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP
> > > >
> > > > RUNNING: In this state, a VFIO device is in active state ready to receive
> > > >         commands from device driver.
> > > >         It is the default state that a VFIO device enters initially.
> > > >
> > > > STOP:  In this state, a VFIO device is deactivated to interact with
> > > >        device driver.
> > > >
> > > > LOGGING: a special state that it CANNOT exist independently. It must be
> > > >        set alongside with state RUNNING or STOP (i.e. RUNNING &
> LOGGING,
> > > >        STOP & LOGGING).
> > > >        Qemu will set LOGGING state on in .save_setup callbacks, then
> vendor
> > > >        driver can start dirty data logging for device memory and system
> > > >        memory.
> > > >        LOGGING only impacts device/system memory. They return
> whole
> > > >        snapshot outside LOGGING and dirty data since last get
> operation
> > > >        inside LOGGING.
> > > >        Device config should be always accessible and return whole
> config
> > > >        snapshot regardless of LOGGING state.
> > > >
> > > > Note:
> > > > The reason why RUNNING is the default state is that device's active state
> > > > must not depend on device state interface.
> > > > It is possible that region vfio_device_state_ctl fails to get registered.
> > > > In that condition, a device needs be in active state by default.
> > > >
> > > > Get Version & Get Caps
> > > > ----------------------
> > > > On migration init phase, qemu will probe the existence of device state
> > > > regions of vendor driver, then get version of the device state interface
> > > > from the r/w control region.
> > > >
> > > > Then it will probe VFIO device's data capability by reading caps field of
> > > > control region.
> > > >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load
> data of
> > > >         device memory in pre-copy and stop-and-copy phase. The data
> of
> > > >         device memory is held in device memory region.
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty
> pages
> > > >         produced by VFIO device during pre-copy and stop-and-copy
> phase.
> > > >         The dirty bitmap of system memory is held in dirty bitmap
> region.
> > > >
> > > > If failing to find two mandatory regions and optional data regions
> > > > corresponding to data caps or version mismatching, it will setup a
> > > > migration blocker and disable live migration for VFIO device.
> > > >
> > > >
> > > > Flows to call device state interface for VFIO live migration
> > > > ------------------------------------------------------------
> > > >
> > > > Live migration save path:
> > > >
> > > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE
> STATE)
> > > >
> > > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > MIGRATION_STATUS_SAVE_SETUP
> > > >  |
> > > >  .save_setup callback -->
> > > >  get device memory size (whole snapshot size)
> > > >  get device memory buffer (whole snapshot data)
> > > >  set device state --> VFIO_DEVICE_STATE_RUNNING &
> VFIO_DEVICE_STATE_LOGGING
> > > >  |
> > > > MIGRATION_STATUS_ACTIVE
> > > >  |
> > > >  .save_live_pending callback --> get device memory size (dirty data)
> > > >  .save_live_iteration callback --> get device memory buffer (dirty data)
> > > >  .log_sync callback --> get system memory dirty bitmap
> > > >  |
> > > > (vcpu stops) --> set device state -->
> > > >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> > > >  |
> > > > .save_live_complete_precopy callback -->
> > > >  get device memory size (dirty data)
> > > >  get device memory buffer (dirty data)
> > > >  get device config size (whole snapshot size)
> > > >  get device config buffer (whole snapshot data)
> > > >  |
> > > > .save_cleanup callback -->  set device state -->
> VFIO_DEVICE_STATE_STOP
> > > > MIGRATION_STATUS_COMPLETED
> > > >
> > > > MIGRATION_STATUS_CANCELLED or
> > > > MIGRATION_STATUS_FAILED
> > > >  |
> > > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > >
> > > >
> > > > Live migration load path:
> > > >
> > > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE
> STATE)
> > > >
> > > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> > > >  |
> > > > MIGRATION_STATUS_ACTIVE
> > > >  |
> > > > .load state callback -->
> > > >  set device memory size, set device memory buffer, set device config
> size,
> > > >  set device config buffer
> > > >  |
> > > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > MIGRATION_STATUS_COMPLETED
> > > >
> > > >
> > > >
> > > > In source VM side,
> > > > In precopy phase,
> > > > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > > > qemu will first get whole snapshot of device memory in .save_setup
> > > > callback, and then it will get total size of dirty data in device memory in
> > > > .save_live_pending callback by reading device_memory.size field of
> control
> > > > region.
> > > > Then in .save_live_iteration callback, it will get buffer of device memory's
> > > > dirty data chunk by chunk from device memory region by writing pos &
> > > > action (GET_BUFFER) to device_memory.pos & device_memory.action
> fields of
> > > > control region. (size of each chunk is the size of device memory data
> > > > region).
> > > > .save_live_pending and .save_live_iteration may be called several times in
> > > > precopy phase to get dirty data in device memory.
> > > >
> > > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy
> phase
> > > > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > > > vendor driver's device state interface to get data from devcie memory.
> > > >
> > > > In precopy phase, if a device has
> VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > > > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > > > region by writing system memory's start address, page count and action
> > > > (GET_BITMAP) to "system_memory.start_addr",
> "system_memory.page_nr", and
> > > > "system_memory.action" fields of control region.
> > > > If page count passed in .log_sync callback is larger than the bitmap size
> > > > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > > > vendor driver's get system memory dirty bitmap interface.
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback
> just
> > > > returns without call to vendor driver.
> > > >
> > > > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > > > in save_live_complete_precopy callback,
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > > > get device memory size and get device memory buffer will be called again.
> > > > After that,
> > > > device config data is get from device config region by reading
> > > > devcie_config.size of control region and writing action (GET_BITMAP) to
> > > > device_config.action of control region.
> > > > Then after migration completes, in cleanup handler, LOGGING state will
> be
> > > > cleared (i.e. deivce state is set to STOP).
> > > > Clearing LOGGING state in cleanup handler is in consideration of the case
> > > > of "migration failed" and "migration cancelled". They can also leverage
> > > > the cleanup handler to unset LOGGING state.
> > > >
> > > >
> > > > References
> > > > ----------
> > > > 1. kernel side implementation of Device state interfaces:
> > > > https://patchwork.freedesktop.org/series/56876/
> > > >
> > > >
> > > > Yan Zhao (5):
> > > >   vfio/migration: define kernel interfaces
> > > >   vfio/migration: support device of device config capability
> > > >   vfio/migration: tracking of dirty page in system memory
> > > >   vfio/migration: turn on migration
> > > >   vfio/migration: support device memory capability
> > > >
> > > >  hw/vfio/Makefile.objs         |   2 +-
> > > >  hw/vfio/common.c              |  26 ++
> > > >  hw/vfio/migration.c           | 858
> ++++++++++++++++++++++++++++++++++++++++++
> > > >  hw/vfio/pci.c                 |  10 +-
> > > >  hw/vfio/pci.h                 |  26 +-
> > > >  include/hw/vfio/vfio-common.h |   1 +
> > > >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> > > >  7 files changed, 1174 insertions(+), 9 deletions(-)
> > > >  create mode 100644 hw/vfio/migration.c
> > > >
> > > > --
> > > > 2.7.4
> > > >
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > _______________________________________________
> > > intel-gvt-dev mailing list
> > > intel-gvt-dev@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Cornelia Huck Feb. 20, 2019, 11:42 a.m. UTC | #11
On Wed, 20 Feb 2019 11:28:46 +0000
"Gonglei (Arei)" <arei.gonglei@huawei.com> wrote:

> > -----Original Message-----
> > From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
> > Sent: Wednesday, February 20, 2019 7:02 PM
> > To: Zhao Yan <yan.y.zhao@intel.com>
> > Cc: cjia@nvidia.com; kvm@vger.kernel.org; aik@ozlabs.ru;
> > Zhengxiao.zx@alibaba-inc.com; shuangtai.tst@alibaba-inc.com;
> > qemu-devel@nongnu.org; kwankhede@nvidia.com; eauger@redhat.com;
> > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > mlevitsk@redhat.com; pasic@linux.ibm.com; Gonglei (Arei)
> > <arei.gonglei@huawei.com>; felipe@nutanix.com; Ken.Xue@amd.com;
> > kevin.tian@intel.com; alex.williamson@redhat.com;
> > intel-gvt-dev@lists.freedesktop.org; changpeng.liu@intel.com;
> > cohuck@redhat.com; zhi.a.wang@intel.com; jonathan.davies@nutanix.com
> > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > 
> > * Zhao Yan (yan.y.zhao@intel.com) wrote:  
> > > On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:  
> > > > * Yan Zhao (yan.y.zhao@intel.com) wrote:  
> > > > > This patchset enables VFIO devices to have live migration capability.
> > > > > Currently it does not support post-copy phase.
> > > > >
> > > > > It follows Alex's comments on last version of VFIO live migration patches,
> > > > > including device states, VFIO device state region layout, dirty bitmap's
> > > > > query.  

> > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > version of device?  Or say to a device with older firmware or perhaps
> > > > a device that has less device memory ?  
> > > Actually it's still an open for VFIO migration. Need to think about
> > > whether it's better to check that in libvirt or qemu (like a device magic
> > > along with verion ?).  
> 
> We must keep the hardware generation is the same with one POD of public cloud
> providers. But we still think about the live migration between from the the lower
> generation of hardware migrated to the higher generation.

Agreed, lower->higher is the one direction that might make sense to
support.

But regardless of that, I think we need to make sure that incompatible
devices/versions fail directly instead of failing in a subtle, hard to
debug way. Might be useful to do some initial sanity checks in libvirt
as well.

How easy is it to obtain that information in a form that can be
consumed by higher layers? Can we find out the device type at least?
What about some kind of revision?
Gonglei (Arei) Feb. 20, 2019, 11:56 a.m. UTC | #12
Hi yan,

Thanks for your work.

I have some suggestions or questions:

1) Would you add msix mode support,? if not, pls add a check in vfio_pci_save_config(), likes Nvidia's solution.
2) We should start vfio devices before vcpu resumes, so we can't rely on vm start change handler completely.
3) We'd better support live migration rollback since have many failure scenarios,
 register a migration notifier is a good choice.
4) Four memory region for live migration is too complicated IMHO. 
5) About log sync, why not register log_global_start/stop in vfio_memory_listener?


Regards,
-Gonglei


> -----Original Message-----
> From: Yan Zhao [mailto:yan.y.zhao@intel.com]
> Sent: Tuesday, February 19, 2019 4:51 PM
> To: alex.williamson@redhat.com; qemu-devel@nongnu.org
> Cc: intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com; Gonglei (Arei)
> <arei.gonglei@huawei.com>; kvm@vger.kernel.org; Yan Zhao
> <yan.y.zhao@intel.com>
> Subject: [PATCH 0/5] QEMU VFIO live migration
> 
> This patchset enables VFIO devices to have live migration capability.
> Currently it does not support post-copy phase.
> 
> It follows Alex's comments on last version of VFIO live migration patches,
> including device states, VFIO device state region layout, dirty bitmap's
> query.
> 
> Device Data
> -----------
> Device data is divided into three types: device memory, device config,
> and system memory dirty pages produced by device.
> 
> Device config: data like MMIOs, page tables...
>         Every device is supposed to possess device config data.
>     	Usually device config's size is small (no big than 10M), and it
>         needs to be loaded in certain strict order.
>         Therefore, device config only needs to be saved/loaded in
>         stop-and-copy phase.
>         The data of device config is held in device config region.
>         Size of device config data is smaller than or equal to that of
>         device config region.
> 
> Device Memory: device's internal memory, standalone and outside system
>         memory. It is usually very big.
>         This kind of data needs to be saved / loaded in pre-copy and
>         stop-and-copy phase.
>         The data of device memory is held in device memory region.
>         Size of devie memory is usually larger than that of device
>         memory region. qemu needs to save/load it in chunks of size of
>         device memory region.
>         Not all device has device memory. Like IGD only uses system memory.
> 
> System memory dirty pages: If a device produces dirty pages in system
>         memory, it is able to get dirty bitmap for certain range of system
>         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
>         phase in .log_sync callback. By setting dirty bitmap in .log_sync
>         callback, dirty pages in system memory will be save/loaded by ram's
>         live migration code.
>         The dirty bitmap of system memory is held in dirty bitmap region.
>         If system memory range is larger than that dirty bitmap region can
>         hold, qemu will cut it into several chunks and get dirty bitmap in
>         succession.
> 
> 
> Device State Regions
> --------------------
> Vendor driver is required to expose two mandatory regions and another two
> optional regions if it plans to support device state management.
> 
> So, there are up to four regions in total.
> One control region: mandatory.
>         Get access via read/write system call.
>         Its layout is defined in struct vfio_device_state_ctl
> Three data regions: mmaped into qemu.
>         device config region: mandatory, holding data of device config
>         device memory region: optional, holding data of device memory
>         dirty bitmap region: optional, holding bitmap of system memory
>                             dirty pages
> 
> (The reason why four seperate regions are defined is that the unit of mmap
> system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> control and three mmaped regions for data seems better than one big region
> padded and sparse mmaped).
> 
> 
> kernel device state interface [1]
> --------------------------------------
> #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> 
> #define VFIO_DEVICE_STATE_RUNNING 0
> #define VFIO_DEVICE_STATE_STOP 1
> #define VFIO_DEVICE_STATE_LOGGING 2
> 
> #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> 
> struct vfio_device_state_ctl {
> 	__u32 version;		  /* ro */
> 	__u32 device_state;       /* VFIO device state, wo */
> 	__u32 caps;		 /* ro */
>         struct {
> 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */
> 		__u64 size;    /*rw*/
> 	} device_config;
> 	struct {
> 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */
> 		__u64 size;     /* rw */
>                 __u64 pos; /*the offset in total buffer of device memory*/
> 	} device_memory;
> 	struct {
> 		__u64 start_addr; /* wo */
> 		__u64 page_nr;   /* wo */
> 	} system_memory;
> };
> 
> Devcie States
> -------------
> After migration is initialzed, it will set device state via writing to
> device_state field of control region.
> 
> Four states are defined for a VFIO device:
>         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP
> 
> RUNNING: In this state, a VFIO device is in active state ready to receive
>         commands from device driver.
>         It is the default state that a VFIO device enters initially.
> 
> STOP:  In this state, a VFIO device is deactivated to interact with
>        device driver.
> 
> LOGGING: a special state that it CANNOT exist independently. It must be
>        set alongside with state RUNNING or STOP (i.e. RUNNING &
> LOGGING,
>        STOP & LOGGING).
>        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
>        driver can start dirty data logging for device memory and system
>        memory.
>        LOGGING only impacts device/system memory. They return whole
>        snapshot outside LOGGING and dirty data since last get operation
>        inside LOGGING.
>        Device config should be always accessible and return whole config
>        snapshot regardless of LOGGING state.
> 
> Note:
> The reason why RUNNING is the default state is that device's active state
> must not depend on device state interface.
> It is possible that region vfio_device_state_ctl fails to get registered.
> In that condition, a device needs be in active state by default.
> 
> Get Version & Get Caps
> ----------------------
> On migration init phase, qemu will probe the existence of device state
> regions of vendor driver, then get version of the device state interface
> from the r/w control region.
> 
> Then it will probe VFIO device's data capability by reading caps field of
> control region.
>         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
>         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
>         device memory in pre-copy and stop-and-copy phase. The data of
>         device memory is held in device memory region.
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty
> pages
>         produced by VFIO device during pre-copy and stop-and-copy phase.
>         The dirty bitmap of system memory is held in dirty bitmap region.
> 
> If failing to find two mandatory regions and optional data regions
> corresponding to data caps or version mismatching, it will setup a
> migration blocker and disable live migration for VFIO device.
> 
> 
> Flows to call device state interface for VFIO live migration
> ------------------------------------------------------------
> 
> Live migration save path:
> 
> (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> 
> MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
>  |
> MIGRATION_STATUS_SAVE_SETUP
>  |
>  .save_setup callback -->
>  get device memory size (whole snapshot size)
>  get device memory buffer (whole snapshot data)
>  set device state --> VFIO_DEVICE_STATE_RUNNING &
> VFIO_DEVICE_STATE_LOGGING
>  |
> MIGRATION_STATUS_ACTIVE
>  |
>  .save_live_pending callback --> get device memory size (dirty data)
>  .save_live_iteration callback --> get device memory buffer (dirty data)
>  .log_sync callback --> get system memory dirty bitmap
>  |
> (vcpu stops) --> set device state -->
>  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
>  |
> .save_live_complete_precopy callback -->
>  get device memory size (dirty data)
>  get device memory buffer (dirty data)
>  get device config size (whole snapshot size)
>  get device config buffer (whole snapshot data)
>  |
> .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> MIGRATION_STATUS_COMPLETED
> 
> MIGRATION_STATUS_CANCELLED or
> MIGRATION_STATUS_FAILED
>  |
> (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> 
> 
> Live migration load path:
> 
> (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> 
> MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
>  |
> (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
>  |
> MIGRATION_STATUS_ACTIVE
>  |
> .load state callback -->
>  set device memory size, set device memory buffer, set device config size,
>  set device config buffer
>  |
> (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
>  |
> MIGRATION_STATUS_COMPLETED
> 
> 
> 
> In source VM side,
> In precopy phase,
> if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> qemu will first get whole snapshot of device memory in .save_setup
> callback, and then it will get total size of dirty data in device memory in
> .save_live_pending callback by reading device_memory.size field of control
> region.
> Then in .save_live_iteration callback, it will get buffer of device memory's
> dirty data chunk by chunk from device memory region by writing pos &
> action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> control region. (size of each chunk is the size of device memory data
> region).
> .save_live_pending and .save_live_iteration may be called several times in
> precopy phase to get dirty data in device memory.
> 
> If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy
> phase
> like .save_setup, .save_live_pending, .save_live_iteration will not call
> vendor driver's device state interface to get data from devcie memory.
> 
> In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY
> on,
> .log_sync callback will get system memory dirty bitmap from dirty bitmap
> region by writing system memory's start address, page count and action
> (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr",
> and
> "system_memory.action" fields of control region.
> If page count passed in .log_sync callback is larger than the bitmap size
> the dirty bitmap region supports, Qemu will cut it into chunks and call
> vendor driver's get system memory dirty bitmap interface.
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> returns without call to vendor driver.
> 
> In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> in save_live_complete_precopy callback,
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> get device memory size and get device memory buffer will be called again.
> After that,
> device config data is get from device config region by reading
> devcie_config.size of control region and writing action (GET_BITMAP) to
> device_config.action of control region.
> Then after migration completes, in cleanup handler, LOGGING state will be
> cleared (i.e. deivce state is set to STOP).
> Clearing LOGGING state in cleanup handler is in consideration of the case
> of "migration failed" and "migration cancelled". They can also leverage
> the cleanup handler to unset LOGGING state.
> 
> 
> References
> ----------
> 1. kernel side implementation of Device state interfaces:
> https://patchwork.freedesktop.org/series/56876/
> 
> 
> Yan Zhao (5):
>   vfio/migration: define kernel interfaces
>   vfio/migration: support device of device config capability
>   vfio/migration: tracking of dirty page in system memory
>   vfio/migration: turn on migration
>   vfio/migration: support device memory capability
> 
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/common.c              |  26 ++
>  hw/vfio/migration.c           | 858
> ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 |  10 +-
>  hw/vfio/pci.h                 |  26 +-
>  include/hw/vfio/vfio-common.h |   1 +
>  linux-headers/linux/vfio.h    | 260 +++++++++++++
>  7 files changed, 1174 insertions(+), 9 deletions(-)
>  create mode 100644 hw/vfio/migration.c
> 
> --
> 2.7.4
Gonglei (Arei) Feb. 20, 2019, 12:07 p.m. UTC | #13
> -----Original Message-----
> From: Cornelia Huck [mailto:cohuck@redhat.com]
> Sent: Wednesday, February 20, 2019 7:43 PM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>; Zhao Yan
> <yan.y.zhao@intel.com>; cjia@nvidia.com; kvm@vger.kernel.org;
> aik@ozlabs.ru; Zhengxiao.zx@alibaba-inc.com; shuangtai.tst@alibaba-inc.com;
> qemu-devel@nongnu.org; kwankhede@nvidia.com; eauger@redhat.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> mlevitsk@redhat.com; pasic@linux.ibm.com; felipe@nutanix.com;
> Ken.Xue@amd.com; kevin.tian@intel.com; alex.williamson@redhat.com;
> intel-gvt-dev@lists.freedesktop.org; changpeng.liu@intel.com;
> zhi.a.wang@intel.com; jonathan.davies@nutanix.com
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> On Wed, 20 Feb 2019 11:28:46 +0000
> "Gonglei (Arei)" <arei.gonglei@huawei.com> wrote:
> 
> > > -----Original Message-----
> > > From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
> > > Sent: Wednesday, February 20, 2019 7:02 PM
> > > To: Zhao Yan <yan.y.zhao@intel.com>
> > > Cc: cjia@nvidia.com; kvm@vger.kernel.org; aik@ozlabs.ru;
> > > Zhengxiao.zx@alibaba-inc.com; shuangtai.tst@alibaba-inc.com;
> > > qemu-devel@nongnu.org; kwankhede@nvidia.com; eauger@redhat.com;
> > > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > > mlevitsk@redhat.com; pasic@linux.ibm.com; Gonglei (Arei)
> > > <arei.gonglei@huawei.com>; felipe@nutanix.com; Ken.Xue@amd.com;
> > > kevin.tian@intel.com; alex.williamson@redhat.com;
> > > intel-gvt-dev@lists.freedesktop.org; changpeng.liu@intel.com;
> > > cohuck@redhat.com; zhi.a.wang@intel.com;
> jonathan.davies@nutanix.com
> > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > >
> > > * Zhao Yan (yan.y.zhao@intel.com) wrote:
> > > > On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:
> > > > > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > > > > > This patchset enables VFIO devices to have live migration capability.
> > > > > > Currently it does not support post-copy phase.
> > > > > >
> > > > > > It follows Alex's comments on last version of VFIO live migration
> patches,
> > > > > > including device states, VFIO device state region layout, dirty bitmap's
> > > > > > query.
> 
> > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > a device that has less device memory ?
> > > > Actually it's still an open for VFIO migration. Need to think about
> > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > along with verion ?).
> >
> > We must keep the hardware generation is the same with one POD of public
> cloud
> > providers. But we still think about the live migration between from the the
> lower
> > generation of hardware migrated to the higher generation.
> 
> Agreed, lower->higher is the one direction that might make sense to
> support.
> 
> But regardless of that, I think we need to make sure that incompatible
> devices/versions fail directly instead of failing in a subtle, hard to
> debug way. Might be useful to do some initial sanity checks in libvirt
> as well.
> 
> How easy is it to obtain that information in a form that can be
> consumed by higher layers? Can we find out the device type at least?
> What about some kind of revision?

We can provide an interface to query if the VM support live migration or not
in prepare phase of Libvirt.

Can we get the revision_id from the vendor driver ? before invoking

register_savevm_live(NULL, TYPE_VFIO_PCI, -1,
            revision_id,
            &savevm_vfio_handlers,
            vdev);

then limit the live migration form higher gens to lower gens?

Regards,
-Gonglei
Cornelia Huck Feb. 20, 2019, 5:08 p.m. UTC | #14
On Wed, 20 Feb 2019 02:36:36 -0500
Zhao Yan <yan.y.zhao@intel.com> wrote:

> On Tue, Feb 19, 2019 at 02:09:18PM +0100, Cornelia Huck wrote:
> > On Tue, 19 Feb 2019 16:52:14 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
(...)
> > > + *          Size of device config data is smaller than or equal to that of
> > > + *          device config region.  
> > 
> > Not sure if I understand that sentence correctly... but what if a
> > device has more config state than fits into this region? Is that
> > supposed to be covered by the device memory region? Or is this assumed
> > to be something so exotic that we don't need to plan for it?
> >   
> Device config data and device config region are all provided by vendor
> driver, so vendor driver is always able to create a large enough device
> config region to hold device config data.
> So, if a device has data that are better to be saved after device stop and
> saved/loaded in strict order, the data needs to be in device config region.
> This kind of data is supposed to be small.
> If the device data can be saved/loaded several times, it can also be put
> into device memory region.

So, it is the vendor driver's decision which device information should
go via which region? With the device config data supposed to be
saved/loaded in one go?

(...)
> > > +/* version number of the device state interface */
> > > +#define VFIO_DEVICE_STATE_INTERFACE_VERSION 1  
> > 
> > Hm. Is this supposed to be backwards-compatible, should we need to bump
> > this?
> >  
> currently no backwords-compatible. we can discuss on that.

It might be useful if we discover that we need some extensions. But I'm
not sure how much work it would be.

(...)
> > > +/*
> > > + * DEVICE STATES
> > > + *
> > > + * Four states are defined for a VFIO device:
> > > + * RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP.
> > > + * They can be set by writing to device_state field of
> > > + * vfio_device_state_ctl region.  
> > 
> > Who controls this? Userspace?  
> 
> Yes. Userspace notifies vendor driver to do the state switching.

Might be good to mention this (just to make it obvious).

> > > + * LOGGING state is a special state that it CANNOT exist
> > > + * independently.  
> > 
> > So it's not a state, but rather a modifier?
> >   
> yes. or thinking LOGGING/not LOGGING as bit 1 of a device state,
> whereas RUNNING/STOPPED is bit 0 of a device state.
> They have to be got as a whole.

So it is (on a bit level):
RUNNING -> 00
STOPPED -> 01
LOGGING/RUNNING -> 10
LOGGING/STOPPED -> 11
 
> > > + * It must be set alongside with state RUNNING or STOP, i.e,
> > > + * RUNNING & LOGGING, STOP & LOGGING.
> > > + * It is used for dirty data logging both for device memory
> > > + * and system memory.
> > > + *
> > > + * LOGGING only impacts device/system memory. In LOGGING state, get buffer
> > > + * of device memory returns dirty pages since last call; outside LOGGING
> > > + * state, get buffer of device memory returns whole snapshot of device
> > > + * memory. system memory's dirty page is only available in LOGGING state.
> > > + *
> > > + * Device config should be always accessible and return whole config snapshot
> > > + * regardless of LOGGING state.
> > > + * */
> > > +#define VFIO_DEVICE_STATE_RUNNING 0
> > > +#define VFIO_DEVICE_STATE_STOP 1
> > > +#define VFIO_DEVICE_STATE_LOGGING 2

This makes it look a bit like LOGGING were an individual state, while 2
is in reality LOGGING/RUNNING... not sure how to make that more
obvious. Maybe (as we are dealing with a u32):

#define VFIO_DEVICE_STATE_RUNNING 0x00000000
#define VFIO_DEVICE_STATE_STOPPED 0x00000001
#define VFIO_DEVICE_STATE_LOGGING_RUNNING 0x00000002
#define VFIO_DEVICE_STATE_LOGGING_STOPPED 0x00000003
#define VFIO_DEVICE_STATE_LOGGING_MASK 0x00000002

> > > +
> > > +/* action to get data from device memory or device config
> > > + * the action is write to device state's control region, and data is read
> > > + * from device memory region or device config region.
> > > + * Each time before read device memory region or device config region,
> > > + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> > > + * field in control region. That is because device memory and devie config
> > > + * region is mmaped into user space. vendor driver has to be notified of
> > > + * the the GET_BUFFER action in advance.
> > > + */
> > > +#define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > +
> > > +/* action to set data to device memory or device config
> > > + * the action is write to device state's control region, and data is
> > > + * written to device memory region or device config region.
> > > + * Each time after write to device memory region or device config region,
> > > + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> > > + * field in control region. That is because device memory and devie config
> > > + * region is mmaped into user space. vendor driver has to be notified of
> > > + * the the SET_BUFFER action after data written.
> > > + */
> > > +#define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2  
> > 
> > Let me describe this in my own words to make sure that I understand
> > this correctly.
> > 
> > - The actions are set by userspace to notify the kernel that it is
> >   going to get data or that it has just written data.
> > - This is needed as a notification that the mmapped data should not be
> >   changed resp. just has changed.  
> we need this notification is because when userspace read the mmapped data,
> it's from the ptr returned from mmap(). So, when userspace reads that ptr,
> there will be no page fault or read/write system calls, so vendor driver
> does not know whether read/write opertation happens or not. 
> Therefore, before userspace reads the ptr from mmap, it first writes action
> field in control region (through write system call), and vendor driver
> will not return the write system call until data prepared.
> 
> When userspace writes to that ptr from mmap, it writes data to the data
> region first, then writes the action field in control region (through write
> system call) to notify vendor driver. vendor driver will return the system
> call after it copies the buffer completely.
> > 
> > So, how does the kernel know whether the read action has finished resp.
> > whether the write action has started? Even if userspace reads/writes it
> > as a whole.
> >   
> kernel does not touch the data region except when in response to the
> "action" write system call.

Thanks for the explanation, that makes sense.
(...)
Yan Zhao Feb. 21, 2019, 12:07 a.m. UTC | #15
On Wed, Feb 20, 2019 at 11:14:24AM +0100, Christophe de Dinechin wrote:
> 
> 
> > On 20 Feb 2019, at 08:58, Zhao Yan <yan.y.zhao@intel.com> wrote:
> > 
> > On Tue, Feb 19, 2019 at 03:42:36PM +0100, Christophe de Dinechin wrote:
> >> 
> >> 
> >>> On 19 Feb 2019, at 09:53, Yan Zhao <yan.y.zhao@intel.com> wrote:
> >>> 
> >>> If a device has device memory capability, save/load data from device memory
> >>> in pre-copy and stop-and-copy phases.
> >>> 
> >>> LOGGING state is set for device memory for dirty page logging:
> >>> in LOGGING state, get device memory returns whole device memory snapshot;
> >>> outside LOGGING state, get device memory returns dirty data since last get
> >>> operation.
> >>> 
> >>> Usually, device memory is very big, qemu needs to chunk it into several
> >>> pieces each with size of device memory region.
> >>> 
> >>> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> >>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>> ---
> >>> hw/vfio/migration.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >>> hw/vfio/pci.h       |   1 +
> >>> 2 files changed, 231 insertions(+), 5 deletions(-)
> >>> 
> >>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >>> index 16d6395..f1e9309 100644
> >>> --- a/hw/vfio/migration.c
> >>> +++ b/hw/vfio/migration.c
> >>> @@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
> >>>    return 0;
> >>> }
> >>> 
> >>> +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
> >>> +{
> >>> +    VFIODevice *vbasedev = &vdev->vbasedev;
> >>> +    VFIORegion *region_ctl =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> >>> +    uint64_t len;
> >>> +    int sz;
> >>> +
> >>> +    sz = sizeof(len);
> >>> +    if (pread(vbasedev->fd, &len, sz,
> >>> +                region_ctl->fd_offset +
> >>> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> >>> +            != sz) {
> >>> +        error_report("vfio: Failed to get length of device memory”);
> >> 
> >> s/length/size/ ? (to be consistent with function name)
> > 
> > ok. thanks
> >>> +        return -1;
> >>> +    }
> >>> +    vdev->migration->devmem_size = len;
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
> >>> +{
> >>> +    VFIODevice *vbasedev = &vdev->vbasedev;
> >>> +    VFIORegion *region_ctl =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> >>> +    int sz;
> >>> +
> >>> +    sz = sizeof(size);
> >>> +    if (pwrite(vbasedev->fd, &size, sz,
> >>> +                region_ctl->fd_offset +
> >>> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> >>> +            != sz) {
> >>> +        error_report("vfio: Failed to set length of device comemory”);
> >> 
> >> What is comemory? Typo?
> > 
> > Right, typo. should be "memory" :)
> >> 
> >> Same comment about length vs size
> >> 
> > got it. thanks
> > 
> >>> +        return -1;
> >>> +    }
> >>> +    vdev->migration->devmem_size = size;
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static
> >>> +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> >>> +                                    uint64_t pos, uint64_t len)
> >>> +{
> >>> +    VFIODevice *vbasedev = &vdev->vbasedev;
> >>> +    VFIORegion *region_ctl =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> >>> +    VFIORegion *region_devmem =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> >>> +    void *dest;
> >>> +    uint32_t sz;
> >>> +    uint8_t *buf = NULL;
> >>> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> >>> +
> >>> +    if (len > region_devmem->size) {
> >> 
> >> Is it intentional that there is no error_report here?
> >> 
> > an error_report here may be better.
> >>> +        return -1;
> >>> +    }
> >>> +
> >>> +    sz = sizeof(pos);
> >>> +    if (pwrite(vbasedev->fd, &pos, sz,
> >>> +                region_ctl->fd_offset +
> >>> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> >>> +            != sz) {
> >>> +        error_report("vfio: Failed to set save buffer pos");
> >>> +        return -1;
> >>> +    }
> >>> +    sz = sizeof(action);
> >>> +    if (pwrite(vbasedev->fd, &action, sz,
> >>> +                region_ctl->fd_offset +
> >>> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> >>> +            != sz) {
> >>> +        error_report("vfio: Failed to set save buffer action");
> >>> +        return -1;
> >>> +    }
> >>> +
> >>> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> >>> +        buf = g_malloc(len);
> >>> +        if (buf == NULL) {
> >>> +            error_report("vfio: Failed to allocate memory for migrate”);
> >> s/migrate/migration/ ?
> > 
> > yes, thanks
> >>> +            return -1;
> >>> +        }
> >>> +        if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != len) {
> >>> +            error_report("vfio: error load device memory buffer”);
> >> s/load/loading/ ?
> > error to load? :)
> 
> I’d check with a native speaker, but I believe it’s “error loading”.
> 
> To me (to be checked), the two sentences don’t have the same meaning:
> 
> “It is an error to load device memory buffer” -> “You are not allowed to do that”
> “I had an error loading device memory buffer” -> “I tried, but it failed"
>
haha, ok, I'll change it to loading, thanks :)
> > 
> >>> +            return -1;
> >>> +        }
> >>> +        qemu_put_be64(f, len);
> >>> +        qemu_put_be64(f, pos);
> >>> +        qemu_put_buffer(f, buf, len);
> >>> +        g_free(buf);
> >>> +    } else {
> >>> +        dest = region_devmem->mmaps[0].mmap;
> >>> +        qemu_put_be64(f, len);
> >>> +        qemu_put_be64(f, pos);
> >>> +        qemu_put_buffer(f, dest, len);
> >>> +    }
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
> >>> +{
> >>> +    VFIORegion *region_devmem =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> >>> +    uint64_t total_len = vdev->migration->devmem_size;
> >>> +    uint64_t pos = 0;
> >>> +
> >>> +    qemu_put_be64(f, total_len);
> >>> +    while (pos < total_len) {
> >>> +        uint64_t len = region_devmem->size;
> >>> +
> >>> +        if (pos + len >= total_len) {
> >>> +            len = total_len - pos;
> >>> +        }
> >>> +        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
> >>> +            return -1;
> >>> +        }
> >> 
> >> I don’t see where pos is incremented in this loop
> >> 
> > yes, missing one line "pos += len;"
> > Currently, code is not verified in hardware with device memory cap on.
> > Thanks:)
> >>> +    }
> >>> +
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static
> >>> +int vfio_load_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> >>> +                                uint64_t pos, uint64_t len)
> >>> +{
> >>> +    VFIODevice *vbasedev = &vdev->vbasedev;
> >>> +    VFIORegion *region_ctl =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> >>> +    VFIORegion *region_devmem =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> >>> +
> >>> +    void *dest;
> >>> +    uint32_t sz;
> >>> +    uint8_t *buf = NULL;
> >>> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> >>> +
> >>> +    if (len > region_devmem->size) {
> >> 
> >> error_report?
> > 
> > seems better to add error_report.
> >>> +        return -1;
> >>> +    }
> >>> +
> >>> +    sz = sizeof(pos);
> >>> +    if (pwrite(vbasedev->fd, &pos, sz,
> >>> +                region_ctl->fd_offset +
> >>> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> >>> +            != sz) {
> >>> +        error_report("vfio: Failed to set device memory buffer pos");
> >>> +        return -1;
> >>> +    }
> >>> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> >>> +        buf = g_malloc(len);
> >>> +        if (buf == NULL) {
> >>> +            error_report("vfio: Failed to allocate memory for migrate");
> >>> +            return -1;
> >>> +        }
> >>> +        qemu_get_buffer(f, buf, len);
> >>> +        if (pwrite(vbasedev->fd, buf, len,
> >>> +                    region_devmem->fd_offset) != len) {
> >>> +            error_report("vfio: Failed to load devie memory buffer");
> >>> +            return -1;
> >>> +        }
> >>> +        g_free(buf);
> >>> +    } else {
> >>> +        dest = region_devmem->mmaps[0].mmap;
> >>> +        qemu_get_buffer(f, dest, len);
> >>> +    }
> >>> +
> >>> +    sz = sizeof(action);
> >>> +    if (pwrite(vbasedev->fd, &action, sz,
> >>> +                region_ctl->fd_offset +
> >>> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> >>> +            != sz) {
> >>> +        error_report("vfio: Failed to set load device memory buffer action");
> >>> +        return -1;
> >>> +    }
> >>> +
> >>> +    return 0;
> >>> +
> >>> +}
> >>> +
> >>> +static int vfio_load_data_device_memory(VFIOPCIDevice *vdev,
> >>> +                        QEMUFile *f, uint64_t total_len)
> >>> +{
> >>> +    uint64_t pos = 0, len = 0;
> >>> +
> >>> +    vfio_set_device_memory_size(vdev, total_len);
> >>> +
> >>> +    while (pos + len < total_len) {
> >>> +        len = qemu_get_be64(f);
> >>> +        pos = qemu_get_be64(f);
> >> 
> >> Nit: load reads len/pos in the loop, whereas save does it in the
> >> inner function (vfio_save_data_device_memory_chunk)
> > right, load has to read len/pos in the loop.
> >> 
> >>> +
> >>> +        vfio_load_data_device_memory_chunk(vdev, f, pos, len);
> >>> +    }
> >>> +
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +
> >>> static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
> >>>        uint64_t start_addr, uint64_t page_nr)
> >>> {
> >>> @@ -377,6 +572,10 @@ static void vfio_save_live_pending(QEMUFile *f, void *opaque,
> >>>        return;
> >>>    }
> >>> 
> >>> +    /* get dirty data size of device memory */
> >>> +    vfio_get_device_memory_size(vdev);
> >>> +
> >>> +    *res_precopy_only += vdev->migration->devmem_size;
> >>>    return;
> >>> }
> >>> 
> >>> @@ -388,7 +587,9 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
> >>>        return 0;
> >>>    }
> >>> 
> >>> -    return 0;
> >>> +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> >>> +    /* get dirty data of device memory */
> >>> +    return vfio_save_data_device_memory(vdev, f);
> >>> }
> >>> 
> >>> static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> >>> @@ -458,6 +659,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> >>>            len = qemu_get_be64(f);
> >>>            vfio_load_data_device_config(vdev, f, len);
> >>>            break;
> >>> +        case VFIO_SAVE_FLAG_DEVMEMORY:
> >>> +            len = qemu_get_be64(f);
> >>> +            vfio_load_data_device_memory(vdev, f, len);
> >>> +            break;
> >>>        default:
> >>>            ret = -EINVAL;
> >>>        }
> >>> @@ -503,6 +708,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> >>>    VFIOPCIDevice *vdev = opaque;
> >>>    int rc = 0;
> >>> 
> >>> +    if (vfio_device_data_cap_device_memory(vdev)) {
> >>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY | VFIO_SAVE_FLAG_CONTINUE);
> >>> +        /* get dirty data of device memory */
> >>> +        vfio_get_device_memory_size(vdev);
> >>> +        rc = vfio_save_data_device_memory(vdev, f);
> >>> +    }
> >>> +
> >>>    qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
> >>>    vfio_pci_save_config(vdev, f);
> >>> 
> >>> @@ -515,12 +727,22 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> >>> 
> >>> static int vfio_save_setup(QEMUFile *f, void *opaque)
> >>> {
> >>> +    int rc = 0;
> >>>    VFIOPCIDevice *vdev = opaque;
> >>> -    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> >>> +
> >>> +    if (vfio_device_data_cap_device_memory(vdev)) {
> >>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP | VFIO_SAVE_FLAG_CONTINUE);
> >>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> >>> +        /* get whole snapshot of device memory */
> >>> +        vfio_get_device_memory_size(vdev);
> >>> +        rc = vfio_save_data_device_memory(vdev, f);
> >>> +    } else {
> >>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> >>> +    }
> >>> 
> >>>    vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
> >>>            VFIO_DEVICE_STATE_LOGGING);
> >>> -    return 0;
> >>> +    return rc;
> >>> }
> >>> 
> >>> static int vfio_load_setup(QEMUFile *f, void *opaque)
> >>> @@ -576,8 +798,11 @@ int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
> >>>        goto error;
> >>>    }
> >>> 
> >>> -    if (vfio_device_data_cap_device_memory(vdev)) {
> >>> -        error_report("No suppport of data cap device memory Yet");
> >>> +    if (vfio_device_data_cap_device_memory(vdev) &&
> >>> +            vfio_device_state_region_setup(vdev,
> >>> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY],
> >>> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY,
> >>> +              "device-state-data-device-memory")) {
> >>>        goto error;
> >>>    }
> >>> 
> >>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> >>> index 4b7b1bb..a2cc64b 100644
> >>> --- a/hw/vfio/pci.h
> >>> +++ b/hw/vfio/pci.h
> >>> @@ -69,6 +69,7 @@ typedef struct VFIOMigration {
> >>>    uint32_t data_caps;
> >>>    uint32_t device_state;
> >>>    uint64_t devconfig_size;
> >>> +    uint64_t devmem_size;
> >>>    VMChangeStateEntry *vm_state;
> >>> } VFIOMigration;
> >>> 
> >>> -- 
> >>> 2.7.4
> >>> 
> >>> _______________________________________________
> >>> intel-gvt-dev mailing list
> >>> intel-gvt-dev@lists.freedesktop.org
> >>> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
> >> 
> >> _______________________________________________
> >> intel-gvt-dev mailing list
> >> intel-gvt-dev@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
>
Yan Zhao Feb. 21, 2019, 12:24 a.m. UTC | #16
On Wed, Feb 20, 2019 at 11:56:01AM +0000, Gonglei (Arei) wrote:
> Hi yan,
> 
> Thanks for your work.
> 
> I have some suggestions or questions:
> 
> 1) Would you add msix mode support,? if not, pls add a check in vfio_pci_save_config(), likes Nvidia's solution.
ok.

> 2) We should start vfio devices before vcpu resumes, so we can't rely on vm start change handler completely.
vfio devices is by default set to running state.
In the target machine, its state transition flow is running->stop->running.
so, maybe you can ignore the stop notification in kernel?
> 3) We'd better support live migration rollback since have many failure scenarios,
>  register a migration notifier is a good choice.
I think this patchset can also handle the failure case well.
if migration failure or cancelling happens, 
in cleanup handler, LOGGING state is cleared. device state(running or
stopped) keeps as it is).
then,
if vm switches back to running, device state will be set to running;
if vm stayes at stopped state, device state is also stopped (it has no
meaning to let it in running state).
Do you think so ?

> 4) Four memory region for live migration is too complicated IMHO. 
one big region requires the sub-regions well padded.
like for the first control fields, they have to be padded to 4K.
the same for other data fields.
Otherwise, mmap simply fails, because the start-offset and size for mmap
both need to be PAGE aligned.

Also, 4 regions is clearer in my view :)

> 5) About log sync, why not register log_global_start/stop in vfio_memory_listener?
> 
> 
seems log_global_start/stop cannot be iterately called in pre-copy phase?
for dirty pages in system memory, it's better to transfer dirty data
iteratively to reduce down time, right?


> Regards,
> -Gonglei
> 
> 
> > -----Original Message-----
> > From: Yan Zhao [mailto:yan.y.zhao@intel.com]
> > Sent: Tuesday, February 19, 2019 4:51 PM
> > To: alex.williamson@redhat.com; qemu-devel@nongnu.org
> > Cc: intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com; Gonglei (Arei)
> > <arei.gonglei@huawei.com>; kvm@vger.kernel.org; Yan Zhao
> > <yan.y.zhao@intel.com>
> > Subject: [PATCH 0/5] QEMU VFIO live migration
> > 
> > This patchset enables VFIO devices to have live migration capability.
> > Currently it does not support post-copy phase.
> > 
> > It follows Alex's comments on last version of VFIO live migration patches,
> > including device states, VFIO device state region layout, dirty bitmap's
> > query.
> > 
> > Device Data
> > -----------
> > Device data is divided into three types: device memory, device config,
> > and system memory dirty pages produced by device.
> > 
> > Device config: data like MMIOs, page tables...
> >         Every device is supposed to possess device config data.
> >     	Usually device config's size is small (no big than 10M), and it
> >         needs to be loaded in certain strict order.
> >         Therefore, device config only needs to be saved/loaded in
> >         stop-and-copy phase.
> >         The data of device config is held in device config region.
> >         Size of device config data is smaller than or equal to that of
> >         device config region.
> > 
> > Device Memory: device's internal memory, standalone and outside system
> >         memory. It is usually very big.
> >         This kind of data needs to be saved / loaded in pre-copy and
> >         stop-and-copy phase.
> >         The data of device memory is held in device memory region.
> >         Size of devie memory is usually larger than that of device
> >         memory region. qemu needs to save/load it in chunks of size of
> >         device memory region.
> >         Not all device has device memory. Like IGD only uses system memory.
> > 
> > System memory dirty pages: If a device produces dirty pages in system
> >         memory, it is able to get dirty bitmap for certain range of system
> >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> >         callback, dirty pages in system memory will be save/loaded by ram's
> >         live migration code.
> >         The dirty bitmap of system memory is held in dirty bitmap region.
> >         If system memory range is larger than that dirty bitmap region can
> >         hold, qemu will cut it into several chunks and get dirty bitmap in
> >         succession.
> > 
> > 
> > Device State Regions
> > --------------------
> > Vendor driver is required to expose two mandatory regions and another two
> > optional regions if it plans to support device state management.
> > 
> > So, there are up to four regions in total.
> > One control region: mandatory.
> >         Get access via read/write system call.
> >         Its layout is defined in struct vfio_device_state_ctl
> > Three data regions: mmaped into qemu.
> >         device config region: mandatory, holding data of device config
> >         device memory region: optional, holding data of device memory
> >         dirty bitmap region: optional, holding bitmap of system memory
> >                             dirty pages
> > 
> > (The reason why four seperate regions are defined is that the unit of mmap
> > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > control and three mmaped regions for data seems better than one big region
> > padded and sparse mmaped).
> > 
> > 
> > kernel device state interface [1]
> > --------------------------------------
> > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > 
> > #define VFIO_DEVICE_STATE_RUNNING 0
> > #define VFIO_DEVICE_STATE_STOP 1
> > #define VFIO_DEVICE_STATE_LOGGING 2
> > 
> > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > 
> > struct vfio_device_state_ctl {
> > 	__u32 version;		  /* ro */
> > 	__u32 device_state;       /* VFIO device state, wo */
> > 	__u32 caps;		 /* ro */
> >         struct {
> > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */
> > 		__u64 size;    /*rw*/
> > 	} device_config;
> > 	struct {
> > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */
> > 		__u64 size;     /* rw */
> >                 __u64 pos; /*the offset in total buffer of device memory*/
> > 	} device_memory;
> > 	struct {
> > 		__u64 start_addr; /* wo */
> > 		__u64 page_nr;   /* wo */
> > 	} system_memory;
> > };
> > 
> > Devcie States
> > -------------
> > After migration is initialzed, it will set device state via writing to
> > device_state field of control region.
> > 
> > Four states are defined for a VFIO device:
> >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP
> > 
> > RUNNING: In this state, a VFIO device is in active state ready to receive
> >         commands from device driver.
> >         It is the default state that a VFIO device enters initially.
> > 
> > STOP:  In this state, a VFIO device is deactivated to interact with
> >        device driver.
> > 
> > LOGGING: a special state that it CANNOT exist independently. It must be
> >        set alongside with state RUNNING or STOP (i.e. RUNNING &
> > LOGGING,
> >        STOP & LOGGING).
> >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> >        driver can start dirty data logging for device memory and system
> >        memory.
> >        LOGGING only impacts device/system memory. They return whole
> >        snapshot outside LOGGING and dirty data since last get operation
> >        inside LOGGING.
> >        Device config should be always accessible and return whole config
> >        snapshot regardless of LOGGING state.
> > 
> > Note:
> > The reason why RUNNING is the default state is that device's active state
> > must not depend on device state interface.
> > It is possible that region vfio_device_state_ctl fails to get registered.
> > In that condition, a device needs be in active state by default.
> > 
> > Get Version & Get Caps
> > ----------------------
> > On migration init phase, qemu will probe the existence of device state
> > regions of vendor driver, then get version of the device state interface
> > from the r/w control region.
> > 
> > Then it will probe VFIO device's data capability by reading caps field of
> > control region.
> >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> >         device memory in pre-copy and stop-and-copy phase. The data of
> >         device memory is held in device memory region.
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty
> > pages
> >         produced by VFIO device during pre-copy and stop-and-copy phase.
> >         The dirty bitmap of system memory is held in dirty bitmap region.
> > 
> > If failing to find two mandatory regions and optional data regions
> > corresponding to data caps or version mismatching, it will setup a
> > migration blocker and disable live migration for VFIO device.
> > 
> > 
> > Flows to call device state interface for VFIO live migration
> > ------------------------------------------------------------
> > 
> > Live migration save path:
> > 
> > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > 
> > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > MIGRATION_STATUS_SAVE_SETUP
> >  |
> >  .save_setup callback -->
> >  get device memory size (whole snapshot size)
> >  get device memory buffer (whole snapshot data)
> >  set device state --> VFIO_DEVICE_STATE_RUNNING &
> > VFIO_DEVICE_STATE_LOGGING
> >  |
> > MIGRATION_STATUS_ACTIVE
> >  |
> >  .save_live_pending callback --> get device memory size (dirty data)
> >  .save_live_iteration callback --> get device memory buffer (dirty data)
> >  .log_sync callback --> get system memory dirty bitmap
> >  |
> > (vcpu stops) --> set device state -->
> >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> >  |
> > .save_live_complete_precopy callback -->
> >  get device memory size (dirty data)
> >  get device memory buffer (dirty data)
> >  get device config size (whole snapshot size)
> >  get device config buffer (whole snapshot data)
> >  |
> > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > MIGRATION_STATUS_COMPLETED
> > 
> > MIGRATION_STATUS_CANCELLED or
> > MIGRATION_STATUS_FAILED
> >  |
> > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > 
> > 
> > Live migration load path:
> > 
> > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > 
> > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> >  |
> > MIGRATION_STATUS_ACTIVE
> >  |
> > .load state callback -->
> >  set device memory size, set device memory buffer, set device config size,
> >  set device config buffer
> >  |
> > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > MIGRATION_STATUS_COMPLETED
> > 
> > 
> > 
> > In source VM side,
> > In precopy phase,
> > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > qemu will first get whole snapshot of device memory in .save_setup
> > callback, and then it will get total size of dirty data in device memory in
> > .save_live_pending callback by reading device_memory.size field of control
> > region.
> > Then in .save_live_iteration callback, it will get buffer of device memory's
> > dirty data chunk by chunk from device memory region by writing pos &
> > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > control region. (size of each chunk is the size of device memory data
> > region).
> > .save_live_pending and .save_live_iteration may be called several times in
> > precopy phase to get dirty data in device memory.
> > 
> > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy
> > phase
> > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > vendor driver's device state interface to get data from devcie memory.
> > 
> > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY
> > on,
> > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > region by writing system memory's start address, page count and action
> > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr",
> > and
> > "system_memory.action" fields of control region.
> > If page count passed in .log_sync callback is larger than the bitmap size
> > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > vendor driver's get system memory dirty bitmap interface.
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > returns without call to vendor driver.
> > 
> > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > in save_live_complete_precopy callback,
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > get device memory size and get device memory buffer will be called again.
> > After that,
> > device config data is get from device config region by reading
> > devcie_config.size of control region and writing action (GET_BITMAP) to
> > device_config.action of control region.
> > Then after migration completes, in cleanup handler, LOGGING state will be
> > cleared (i.e. deivce state is set to STOP).
> > Clearing LOGGING state in cleanup handler is in consideration of the case
> > of "migration failed" and "migration cancelled". They can also leverage
> > the cleanup handler to unset LOGGING state.
> > 
> > 
> > References
> > ----------
> > 1. kernel side implementation of Device state interfaces:
> > https://patchwork.freedesktop.org/series/56876/
> > 
> > 
> > Yan Zhao (5):
> >   vfio/migration: define kernel interfaces
> >   vfio/migration: support device of device config capability
> >   vfio/migration: tracking of dirty page in system memory
> >   vfio/migration: turn on migration
> >   vfio/migration: support device memory capability
> > 
> >  hw/vfio/Makefile.objs         |   2 +-
> >  hw/vfio/common.c              |  26 ++
> >  hw/vfio/migration.c           | 858
> > ++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/pci.c                 |  10 +-
> >  hw/vfio/pci.h                 |  26 +-
> >  include/hw/vfio/vfio-common.h |   1 +
> >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> >  7 files changed, 1174 insertions(+), 9 deletions(-)
> >  create mode 100644 hw/vfio/migration.c
> > 
> > --
> > 2.7.4
>
Yan Zhao Feb. 21, 2019, 12:31 a.m. UTC | #17
On Wed, Feb 20, 2019 at 11:01:43AM +0000, Dr. David Alan Gilbert wrote:
> * Zhao Yan (yan.y.zhao@intel.com) wrote:
> > On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:
> > > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > > > This patchset enables VFIO devices to have live migration capability.
> > > > Currently it does not support post-copy phase.
> > > > 
> > > > It follows Alex's comments on last version of VFIO live migration patches,
> > > > including device states, VFIO device state region layout, dirty bitmap's
> > > > query.
> > > 
> > > Hi,
> > >   I've sent minor comments to later patches; but some minor general
> > > comments:
> > > 
> > >   a) Never trust the incoming migrations stream - it might be corrupt,
> > >     so check when you can.
> > hi Dave
> > Thanks for this suggestion. I'll add more checks for migration streams.
> > 
> > 
> > >   b) How do we detect if we're migrating from/to the wrong device or
> > > version of device?  Or say to a device with older firmware or perhaps
> > > a device that has less device memory ?
> > Actually it's still an open for VFIO migration. Need to think about
> > whether it's better to check that in libvirt or qemu (like a device magic
> > along with verion ?).
> > This patchset is intended to settle down the main device state interfaces
> > for VFIO migration. So that we can work on that and improve it.
> > 
> > 
> > >   c) Consider using the trace_ mechanism - it's really useful to
> > > add to loops writing/reading data so that you can see when it fails.
> > > 
> > > Dave
> > >
> > Got it. many thanks~~
> > 
> > 
> > > (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> > > 'migrtion'
> > 
> > sorry :)
> 
> No problem.
> 
> Given the mails, I'm guessing you've mostly tested this on graphics
> devices?  Have you also checked with VFIO network cards?
> 
yes, I tested it on Intel's graphics devices which do not have device
memory. so the cap of device-memory is off.
I believe this patchset can work well on VFIO network cards as well,
because Gonglei once said their NIC can work well on our previous code
(i.e. device-memory cap off).


> Also see the mail I sent in reply to Kirti's series; we need to boil
> these down to one solution.
>
Maybe Kirti can merge their implementaion into the code for device-memory
cap (like in my patch 5 for device-memory).

> Dave
> 
> > > 
> > > > Device Data
> > > > -----------
> > > > Device data is divided into three types: device memory, device config,
> > > > and system memory dirty pages produced by device.
> > > > 
> > > > Device config: data like MMIOs, page tables...
> > > >         Every device is supposed to possess device config data.
> > > >     	Usually device config's size is small (no big than 10M), and it
> > > >         needs to be loaded in certain strict order.
> > > >         Therefore, device config only needs to be saved/loaded in
> > > >         stop-and-copy phase.
> > > >         The data of device config is held in device config region.
> > > >         Size of device config data is smaller than or equal to that of
> > > >         device config region.
> > > > 
> > > > Device Memory: device's internal memory, standalone and outside system
> > > >         memory. It is usually very big.
> > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > >         stop-and-copy phase.
> > > >         The data of device memory is held in device memory region.
> > > >         Size of devie memory is usually larger than that of device
> > > >         memory region. qemu needs to save/load it in chunks of size of
> > > >         device memory region.
> > > >         Not all device has device memory. Like IGD only uses system memory.
> > > > 
> > > > System memory dirty pages: If a device produces dirty pages in system
> > > >         memory, it is able to get dirty bitmap for certain range of system
> > > >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > > >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > > >         callback, dirty pages in system memory will be save/loaded by ram's
> > > >         live migration code.
> > > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > > >         If system memory range is larger than that dirty bitmap region can
> > > >         hold, qemu will cut it into several chunks and get dirty bitmap in
> > > >         succession.
> > > > 
> > > > 
> > > > Device State Regions
> > > > --------------------
> > > > Vendor driver is required to expose two mandatory regions and another two
> > > > optional regions if it plans to support device state management.
> > > > 
> > > > So, there are up to four regions in total.
> > > > One control region: mandatory.
> > > >         Get access via read/write system call.
> > > >         Its layout is defined in struct vfio_device_state_ctl
> > > > Three data regions: mmaped into qemu.
> > > >         device config region: mandatory, holding data of device config
> > > >         device memory region: optional, holding data of device memory
> > > >         dirty bitmap region: optional, holding bitmap of system memory
> > > >                             dirty pages
> > > > 
> > > > (The reason why four seperate regions are defined is that the unit of mmap
> > > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > > control and three mmaped regions for data seems better than one big region
> > > > padded and sparse mmaped).
> > > > 
> > > > 
> > > > kernel device state interface [1]
> > > > --------------------------------------
> > > > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > > 
> > > > #define VFIO_DEVICE_STATE_RUNNING 0 
> > > > #define VFIO_DEVICE_STATE_STOP 1
> > > > #define VFIO_DEVICE_STATE_LOGGING 2
> > > > 
> > > > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > > > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > > > 
> > > > struct vfio_device_state_ctl {
> > > > 	__u32 version;		  /* ro */
> > > > 	__u32 device_state;       /* VFIO device state, wo */
> > > > 	__u32 caps;		 /* ro */
> > > >         struct {
> > > > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> > > > 		__u64 size;    /*rw*/
> > > > 	} device_config;
> > > > 	struct {
> > > > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> > > > 		__u64 size;     /* rw */  
> > > >                 __u64 pos; /*the offset in total buffer of device memory*/
> > > > 	} device_memory;
> > > > 	struct {
> > > > 		__u64 start_addr; /* wo */
> > > > 		__u64 page_nr;   /* wo */
> > > > 	} system_memory;
> > > > };
> > > > 
> > > > Devcie States
> > > > ------------- 
> > > > After migration is initialzed, it will set device state via writing to
> > > > device_state field of control region.
> > > > 
> > > > Four states are defined for a VFIO device:
> > > >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> > > > 
> > > > RUNNING: In this state, a VFIO device is in active state ready to receive
> > > >         commands from device driver.
> > > >         It is the default state that a VFIO device enters initially.
> > > > 
> > > > STOP:  In this state, a VFIO device is deactivated to interact with
> > > >        device driver.
> > > > 
> > > > LOGGING: a special state that it CANNOT exist independently. It must be
> > > >        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> > > >        STOP & LOGGING).
> > > >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> > > >        driver can start dirty data logging for device memory and system
> > > >        memory.
> > > >        LOGGING only impacts device/system memory. They return whole
> > > >        snapshot outside LOGGING and dirty data since last get operation
> > > >        inside LOGGING.
> > > >        Device config should be always accessible and return whole config
> > > >        snapshot regardless of LOGGING state.
> > > >        
> > > > Note:
> > > > The reason why RUNNING is the default state is that device's active state
> > > > must not depend on device state interface.
> > > > It is possible that region vfio_device_state_ctl fails to get registered.
> > > > In that condition, a device needs be in active state by default. 
> > > > 
> > > > Get Version & Get Caps
> > > > ----------------------
> > > > On migration init phase, qemu will probe the existence of device state
> > > > regions of vendor driver, then get version of the device state interface
> > > > from the r/w control region.
> > > > 
> > > > Then it will probe VFIO device's data capability by reading caps field of
> > > > control region.
> > > >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> > > >         device memory in pre-copy and stop-and-copy phase. The data of
> > > >         device memory is held in device memory region.
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> > > >         produced by VFIO device during pre-copy and stop-and-copy phase.
> > > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > > > 
> > > > If failing to find two mandatory regions and optional data regions
> > > > corresponding to data caps or version mismatching, it will setup a
> > > > migration blocker and disable live migration for VFIO device.
> > > > 
> > > > 
> > > > Flows to call device state interface for VFIO live migration
> > > > ------------------------------------------------------------
> > > > 
> > > > Live migration save path:
> > > > 
> > > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > > 
> > > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > MIGRATION_STATUS_SAVE_SETUP
> > > >  |
> > > >  .save_setup callback -->
> > > >  get device memory size (whole snapshot size)
> > > >  get device memory buffer (whole snapshot data)
> > > >  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> > > >  |
> > > > MIGRATION_STATUS_ACTIVE
> > > >  |
> > > >  .save_live_pending callback --> get device memory size (dirty data)
> > > >  .save_live_iteration callback --> get device memory buffer (dirty data)
> > > >  .log_sync callback --> get system memory dirty bitmap
> > > >  |
> > > > (vcpu stops) --> set device state -->
> > > >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> > > >  |
> > > > .save_live_complete_precopy callback -->
> > > >  get device memory size (dirty data)
> > > >  get device memory buffer (dirty data)
> > > >  get device config size (whole snapshot size)
> > > >  get device config buffer (whole snapshot data)
> > > >  |
> > > > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > > > MIGRATION_STATUS_COMPLETED
> > > > 
> > > > MIGRATION_STATUS_CANCELLED or
> > > > MIGRATION_STATUS_FAILED
> > > >  |
> > > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > > 
> > > > 
> > > > Live migration load path:
> > > > 
> > > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > > 
> > > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> > > >  |
> > > > MIGRATION_STATUS_ACTIVE
> > > >  |
> > > > .load state callback -->
> > > >  set device memory size, set device memory buffer, set device config size,
> > > >  set device config buffer
> > > >  |
> > > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > MIGRATION_STATUS_COMPLETED
> > > > 
> > > > 
> > > > 
> > > > In source VM side,
> > > > In precopy phase,
> > > > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > > > qemu will first get whole snapshot of device memory in .save_setup
> > > > callback, and then it will get total size of dirty data in device memory in
> > > > .save_live_pending callback by reading device_memory.size field of control
> > > > region.
> > > > Then in .save_live_iteration callback, it will get buffer of device memory's
> > > > dirty data chunk by chunk from device memory region by writing pos &
> > > > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > > > control region. (size of each chunk is the size of device memory data
> > > > region).
> > > > .save_live_pending and .save_live_iteration may be called several times in
> > > > precopy phase to get dirty data in device memory.
> > > > 
> > > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > > > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > > > vendor driver's device state interface to get data from devcie memory.
> > > > 
> > > > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > > > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > > > region by writing system memory's start address, page count and action 
> > > > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> > > > "system_memory.action" fields of control region.
> > > > If page count passed in .log_sync callback is larger than the bitmap size
> > > > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > > > vendor driver's get system memory dirty bitmap interface.
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > > > returns without call to vendor driver.
> > > > 
> > > > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > > > in save_live_complete_precopy callback,
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > > > get device memory size and get device memory buffer will be called again.
> > > > After that,
> > > > device config data is get from device config region by reading
> > > > devcie_config.size of control region and writing action (GET_BITMAP) to
> > > > device_config.action of control region.
> > > > Then after migration completes, in cleanup handler, LOGGING state will be
> > > > cleared (i.e. deivce state is set to STOP).
> > > > Clearing LOGGING state in cleanup handler is in consideration of the case
> > > > of "migration failed" and "migration cancelled". They can also leverage
> > > > the cleanup handler to unset LOGGING state.
> > > > 
> > > > 
> > > > References
> > > > ----------
> > > > 1. kernel side implementation of Device state interfaces:
> > > > https://patchwork.freedesktop.org/series/56876/
> > > > 
> > > > 
> > > > Yan Zhao (5):
> > > >   vfio/migration: define kernel interfaces
> > > >   vfio/migration: support device of device config capability
> > > >   vfio/migration: tracking of dirty page in system memory
> > > >   vfio/migration: turn on migration
> > > >   vfio/migration: support device memory capability
> > > > 
> > > >  hw/vfio/Makefile.objs         |   2 +-
> > > >  hw/vfio/common.c              |  26 ++
> > > >  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
> > > >  hw/vfio/pci.c                 |  10 +-
> > > >  hw/vfio/pci.h                 |  26 +-
> > > >  include/hw/vfio/vfio-common.h |   1 +
> > > >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> > > >  7 files changed, 1174 insertions(+), 9 deletions(-)
> > > >  create mode 100644 hw/vfio/migration.c
> > > > 
> > > > -- 
> > > > 2.7.4
> > > > 
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > _______________________________________________
> > > intel-gvt-dev mailing list
> > > intel-gvt-dev@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
Gonglei (Arei) Feb. 21, 2019, 1:35 a.m. UTC | #18
> -----Original Message-----
> From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> Sent: Thursday, February 21, 2019 8:25 AM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> kvm@vger.kernel.org
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> On Wed, Feb 20, 2019 at 11:56:01AM +0000, Gonglei (Arei) wrote:
> > Hi yan,
> >
> > Thanks for your work.
> >
> > I have some suggestions or questions:
> >
> > 1) Would you add msix mode support,? if not, pls add a check in
> vfio_pci_save_config(), likes Nvidia's solution.
> ok.
> 
> > 2) We should start vfio devices before vcpu resumes, so we can't rely on vm
> start change handler completely.
> vfio devices is by default set to running state.
> In the target machine, its state transition flow is running->stop->running.

That's confusing. We should start vfio devices after vfio_load_state, otherwise
how can you keep the devices' information are the same between source side
and destination side?

> so, maybe you can ignore the stop notification in kernel?
> > 3) We'd better support live migration rollback since have many failure
> scenarios,
> >  register a migration notifier is a good choice.
> I think this patchset can also handle the failure case well.
> if migration failure or cancelling happens,
> in cleanup handler, LOGGING state is cleared. device state(running or
> stopped) keeps as it is).

IIRC there're many failure paths don't calling cleanup handler.

> then,
> if vm switches back to running, device state will be set to running;
> if vm stayes at stopped state, device state is also stopped (it has no
> meaning to let it in running state).
> Do you think so ?
> 
IF the underlying state machine is complicated,
We should tell the canceling state to vendor driver proactively.

> > 4) Four memory region for live migration is too complicated IMHO.
> one big region requires the sub-regions well padded.
> like for the first control fields, they have to be padded to 4K.
> the same for other data fields.
> Otherwise, mmap simply fails, because the start-offset and size for mmap
> both need to be PAGE aligned.
> 
But if we don't need use mmap for control filed and device state, they are small basically.
The performance is enough using pread/pwrite. 

> Also, 4 regions is clearer in my view :)
> 
> > 5) About log sync, why not register log_global_start/stop in
> vfio_memory_listener?
> >
> >
> seems log_global_start/stop cannot be iterately called in pre-copy phase?
> for dirty pages in system memory, it's better to transfer dirty data
> iteratively to reduce down time, right?
> 

We just need invoking only once for start and stop logging. Why we need to call
them literately? See memory_listener of vhost.

Regards,
-Gonglei
Yan Zhao Feb. 21, 2019, 1:47 a.m. UTC | #19
On Wed, Feb 20, 2019 at 06:08:13PM +0100, Cornelia Huck wrote:
> On Wed, 20 Feb 2019 02:36:36 -0500
> Zhao Yan <yan.y.zhao@intel.com> wrote:
> 
> > On Tue, Feb 19, 2019 at 02:09:18PM +0100, Cornelia Huck wrote:
> > > On Tue, 19 Feb 2019 16:52:14 +0800
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> (...)
> > > > + *          Size of device config data is smaller than or equal to that of
> > > > + *          device config region.  
> > > 
> > > Not sure if I understand that sentence correctly... but what if a
> > > device has more config state than fits into this region? Is that
> > > supposed to be covered by the device memory region? Or is this assumed
> > > to be something so exotic that we don't need to plan for it?
> > >   
> > Device config data and device config region are all provided by vendor
> > driver, so vendor driver is always able to create a large enough device
> > config region to hold device config data.
> > So, if a device has data that are better to be saved after device stop and
> > saved/loaded in strict order, the data needs to be in device config region.
> > This kind of data is supposed to be small.
> > If the device data can be saved/loaded several times, it can also be put
> > into device memory region.
> 
> So, it is the vendor driver's decision which device information should
> go via which region? With the device config data supposed to be
> saved/loaded in one go?
Right, exactly.


> (...)
> > > > +/* version number of the device state interface */
> > > > +#define VFIO_DEVICE_STATE_INTERFACE_VERSION 1  
> > > 
> > > Hm. Is this supposed to be backwards-compatible, should we need to bump
> > > this?
> > >  
> > currently no backwords-compatible. we can discuss on that.
> 
> It might be useful if we discover that we need some extensions. But I'm
> not sure how much work it would be.
> 
> (...)
> > > > +/*
> > > > + * DEVICE STATES
> > > > + *
> > > > + * Four states are defined for a VFIO device:
> > > > + * RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP.
> > > > + * They can be set by writing to device_state field of
> > > > + * vfio_device_state_ctl region.  
> > > 
> > > Who controls this? Userspace?  
> > 
> > Yes. Userspace notifies vendor driver to do the state switching.
> 
> Might be good to mention this (just to make it obvious).
>
Got it. thanks

> > > > + * LOGGING state is a special state that it CANNOT exist
> > > > + * independently.  
> > > 
> > > So it's not a state, but rather a modifier?
> > >   
> > yes. or thinking LOGGING/not LOGGING as bit 1 of a device state,
> > whereas RUNNING/STOPPED is bit 0 of a device state.
> > They have to be got as a whole.
> 
> So it is (on a bit level):
> RUNNING -> 00
> STOPPED -> 01
> LOGGING/RUNNING -> 10
> LOGGING/STOPPED -> 11
> 

Yes.

> > > > + * It must be set alongside with state RUNNING or STOP, i.e,
> > > > + * RUNNING & LOGGING, STOP & LOGGING.
> > > > + * It is used for dirty data logging both for device memory
> > > > + * and system memory.
> > > > + *
> > > > + * LOGGING only impacts device/system memory. In LOGGING state, get buffer
> > > > + * of device memory returns dirty pages since last call; outside LOGGING
> > > > + * state, get buffer of device memory returns whole snapshot of device
> > > > + * memory. system memory's dirty page is only available in LOGGING state.
> > > > + *
> > > > + * Device config should be always accessible and return whole config snapshot
> > > > + * regardless of LOGGING state.
> > > > + * */
> > > > +#define VFIO_DEVICE_STATE_RUNNING 0
> > > > +#define VFIO_DEVICE_STATE_STOP 1
> > > > +#define VFIO_DEVICE_STATE_LOGGING 2
> 
> This makes it look a bit like LOGGING were an individual state, while 2
> is in reality LOGGING/RUNNING... not sure how to make that more
> obvious. Maybe (as we are dealing with a u32):
> 
> #define VFIO_DEVICE_STATE_RUNNING 0x00000000
> #define VFIO_DEVICE_STATE_STOPPED 0x00000001
> #define VFIO_DEVICE_STATE_LOGGING_RUNNING 0x00000002
> #define VFIO_DEVICE_STATE_LOGGING_STOPPED 0x00000003
> #define VFIO_DEVICE_STATE_LOGGING_MASK 0x00000002
>
Yes, yours are better, thanks:)

> > > > +
> > > > +/* action to get data from device memory or device config
> > > > + * the action is write to device state's control region, and data is read
> > > > + * from device memory region or device config region.
> > > > + * Each time before read device memory region or device config region,
> > > > + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> > > > + * field in control region. That is because device memory and devie config
> > > > + * region is mmaped into user space. vendor driver has to be notified of
> > > > + * the the GET_BUFFER action in advance.
> > > > + */
> > > > +#define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > > +
> > > > +/* action to set data to device memory or device config
> > > > + * the action is write to device state's control region, and data is
> > > > + * written to device memory region or device config region.
> > > > + * Each time after write to device memory region or device config region,
> > > > + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> > > > + * field in control region. That is because device memory and devie config
> > > > + * region is mmaped into user space. vendor driver has to be notified of
> > > > + * the the SET_BUFFER action after data written.
> > > > + */
> > > > +#define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2  
> > > 
> > > Let me describe this in my own words to make sure that I understand
> > > this correctly.
> > > 
> > > - The actions are set by userspace to notify the kernel that it is
> > >   going to get data or that it has just written data.
> > > - This is needed as a notification that the mmapped data should not be
> > >   changed resp. just has changed.  
> > we need this notification is because when userspace read the mmapped data,
> > it's from the ptr returned from mmap(). So, when userspace reads that ptr,
> > there will be no page fault or read/write system calls, so vendor driver
> > does not know whether read/write opertation happens or not. 
> > Therefore, before userspace reads the ptr from mmap, it first writes action
> > field in control region (through write system call), and vendor driver
> > will not return the write system call until data prepared.
> > 
> > When userspace writes to that ptr from mmap, it writes data to the data
> > region first, then writes the action field in control region (through write
> > system call) to notify vendor driver. vendor driver will return the system
> > call after it copies the buffer completely.
> > > 
> > > So, how does the kernel know whether the read action has finished resp.
> > > whether the write action has started? Even if userspace reads/writes it
> > > as a whole.
> > >   
> > kernel does not touch the data region except when in response to the
> > "action" write system call.
> 
> Thanks for the explanation, that makes sense.
> (...)
welcome:)
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
Yan Zhao Feb. 21, 2019, 1:58 a.m. UTC | #20
On Thu, Feb 21, 2019 at 01:35:43AM +0000, Gonglei (Arei) wrote:
> 
> 
> > -----Original Message-----
> > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > Sent: Thursday, February 21, 2019 8:25 AM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > kvm@vger.kernel.org
> > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > 
> > On Wed, Feb 20, 2019 at 11:56:01AM +0000, Gonglei (Arei) wrote:
> > > Hi yan,
> > >
> > > Thanks for your work.
> > >
> > > I have some suggestions or questions:
> > >
> > > 1) Would you add msix mode support,? if not, pls add a check in
> > vfio_pci_save_config(), likes Nvidia's solution.
> > ok.
> > 
> > > 2) We should start vfio devices before vcpu resumes, so we can't rely on vm
> > start change handler completely.
> > vfio devices is by default set to running state.
> > In the target machine, its state transition flow is running->stop->running.
> 
> That's confusing. We should start vfio devices after vfio_load_state, otherwise
> how can you keep the devices' information are the same between source side
> and destination side?
>
so, your meaning is to set device state to running in the first call to
vfio_load_state?

> > so, maybe you can ignore the stop notification in kernel?
> > > 3) We'd better support live migration rollback since have many failure
> > scenarios,
> > >  register a migration notifier is a good choice.
> > I think this patchset can also handle the failure case well.
> > if migration failure or cancelling happens,
> > in cleanup handler, LOGGING state is cleared. device state(running or
> > stopped) keeps as it is).
> 
> IIRC there're many failure paths don't calling cleanup handler.
>
could you take an example?
> > then,
> > if vm switches back to running, device state will be set to running;
> > if vm stayes at stopped state, device state is also stopped (it has no
> > meaning to let it in running state).
> > Do you think so ?
> > 
> IF the underlying state machine is complicated,
> We should tell the canceling state to vendor driver proactively.
> 
That makes sense.

> > > 4) Four memory region for live migration is too complicated IMHO.
> > one big region requires the sub-regions well padded.
> > like for the first control fields, they have to be padded to 4K.
> > the same for other data fields.
> > Otherwise, mmap simply fails, because the start-offset and size for mmap
> > both need to be PAGE aligned.
> > 
> But if we don't need use mmap for control filed and device state, they are small basically.
> The performance is enough using pread/pwrite. 
> 
we don't mmap control fields. but if data fields going immedately after
control fields (e.g. just 64 bytes), we can't mmap data fields
successfully because its start offset is 64. Therefore control fields have
to be padded to 4k to let data fields start from 4k.
That's the drawback of one big region holding both control and data fields.

> > Also, 4 regions is clearer in my view :)
> > 
> > > 5) About log sync, why not register log_global_start/stop in
> > vfio_memory_listener?
> > >
> > >
> > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > for dirty pages in system memory, it's better to transfer dirty data
> > iteratively to reduce down time, right?
> > 
> 
> We just need invoking only once for start and stop logging. Why we need to call
> them literately? See memory_listener of vhost.
> 



> Regards,
> -Gonglei
Yan Zhao Feb. 21, 2019, 2:04 a.m. UTC | #21
> > 
> > > 5) About log sync, why not register log_global_start/stop in
> > vfio_memory_listener?
> > >
> > >
> > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > for dirty pages in system memory, it's better to transfer dirty data
> > iteratively to reduce down time, right?
> > 
> 
> We just need invoking only once for start and stop logging. Why we need to call
> them literately? See memory_listener of vhost.
>
the dirty pages in system memory produces by device is incremental.
if it can be got iteratively, the dirty pages in stop-and-copy phase can be
minimal. 
:)

> Regards,
> -Gonglei
Gonglei (Arei) Feb. 21, 2019, 3:16 a.m. UTC | #22
> -----Original Message-----
> From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> Sent: Thursday, February 21, 2019 10:05 AM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> kvm@vger.kernel.org
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> > >
> > > > 5) About log sync, why not register log_global_start/stop in
> > > vfio_memory_listener?
> > > >
> > > >
> > > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > > for dirty pages in system memory, it's better to transfer dirty data
> > > iteratively to reduce down time, right?
> > >
> >
> > We just need invoking only once for start and stop logging. Why we need to
> call
> > them literately? See memory_listener of vhost.
> >
> the dirty pages in system memory produces by device is incremental.
> if it can be got iteratively, the dirty pages in stop-and-copy phase can be
> minimal.
> :)
> 
I mean starting or stopping the capability of logging, not log sync. 

We register the below callbacks:

.log_sync = vfio_log_sync,
.log_global_start = vfio_log_global_start,
.log_global_stop = vfio_log_global_stop,

Regards,
-Gonglei
Gonglei (Arei) Feb. 21, 2019, 3:33 a.m. UTC | #23
> -----Original Message-----
> From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> Sent: Thursday, February 21, 2019 9:59 AM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> kvm@vger.kernel.org
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> On Thu, Feb 21, 2019 at 01:35:43AM +0000, Gonglei (Arei) wrote:
> >
> >
> > > -----Original Message-----
> > > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > > Sent: Thursday, February 21, 2019 8:25 AM
> > > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com;
> dgilbert@redhat.com;
> > > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > > jonathan.davies@nutanix.com; changpeng.liu@intel.com;
> Ken.Xue@amd.com;
> > > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > > kvm@vger.kernel.org
> > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > >
> > > On Wed, Feb 20, 2019 at 11:56:01AM +0000, Gonglei (Arei) wrote:
> > > > Hi yan,
> > > >
> > > > Thanks for your work.
> > > >
> > > > I have some suggestions or questions:
> > > >
> > > > 1) Would you add msix mode support,? if not, pls add a check in
> > > vfio_pci_save_config(), likes Nvidia's solution.
> > > ok.
> > >
> > > > 2) We should start vfio devices before vcpu resumes, so we can't rely on
> vm
> > > start change handler completely.
> > > vfio devices is by default set to running state.
> > > In the target machine, its state transition flow is running->stop->running.
> >
> > That's confusing. We should start vfio devices after vfio_load_state,
> otherwise
> > how can you keep the devices' information are the same between source side
> > and destination side?
> >
> so, your meaning is to set device state to running in the first call to
> vfio_load_state?
> 
No, it should start devices after vfio_load_state and before vcpu resuming.

> > > so, maybe you can ignore the stop notification in kernel?
> > > > 3) We'd better support live migration rollback since have many failure
> > > scenarios,
> > > >  register a migration notifier is a good choice.
> > > I think this patchset can also handle the failure case well.
> > > if migration failure or cancelling happens,
> > > in cleanup handler, LOGGING state is cleared. device state(running or
> > > stopped) keeps as it is).
> >
> > IIRC there're many failure paths don't calling cleanup handler.
> >
> could you take an example?

Never mind, that's another bug I think. 

> > > then,
> > > if vm switches back to running, device state will be set to running;
> > > if vm stayes at stopped state, device state is also stopped (it has no
> > > meaning to let it in running state).
> > > Do you think so ?
> > >
> > IF the underlying state machine is complicated,
> > We should tell the canceling state to vendor driver proactively.
> >
> That makes sense.
> 
> > > > 4) Four memory region for live migration is too complicated IMHO.
> > > one big region requires the sub-regions well padded.
> > > like for the first control fields, they have to be padded to 4K.
> > > the same for other data fields.
> > > Otherwise, mmap simply fails, because the start-offset and size for mmap
> > > both need to be PAGE aligned.
> > >
> > But if we don't need use mmap for control filed and device state, they are
> small basically.
> > The performance is enough using pread/pwrite.
> >
> we don't mmap control fields. but if data fields going immedately after
> control fields (e.g. just 64 bytes), we can't mmap data fields
> successfully because its start offset is 64. Therefore control fields have
> to be padded to 4k to let data fields start from 4k.
> That's the drawback of one big region holding both control and data fields.
> 
> > > Also, 4 regions is clearer in my view :)
> > >
> > > > 5) About log sync, why not register log_global_start/stop in
> > > vfio_memory_listener?
> > > >
> > > >
> > > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > > for dirty pages in system memory, it's better to transfer dirty data
> > > iteratively to reduce down time, right?
> > >
> >
> > We just need invoking only once for start and stop logging. Why we need to
> call
> > them literately? See memory_listener of vhost.
> >
> 
> 
> 
> > Regards,
> > -Gonglei
Yan Zhao Feb. 21, 2019, 4:08 a.m. UTC | #24
On Thu, Feb 21, 2019 at 03:33:24AM +0000, Gonglei (Arei) wrote:
> 
> > -----Original Message-----
> > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > Sent: Thursday, February 21, 2019 9:59 AM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > kvm@vger.kernel.org
> > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > 
> > On Thu, Feb 21, 2019 at 01:35:43AM +0000, Gonglei (Arei) wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > > > Sent: Thursday, February 21, 2019 8:25 AM
> > > > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > > > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > > > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > > > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > > > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com;
> > dgilbert@redhat.com;
> > > > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > > > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > > > jonathan.davies@nutanix.com; changpeng.liu@intel.com;
> > Ken.Xue@amd.com;
> > > > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > > > kvm@vger.kernel.org
> > > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > > >
> > > > On Wed, Feb 20, 2019 at 11:56:01AM +0000, Gonglei (Arei) wrote:
> > > > > Hi yan,
> > > > >
> > > > > Thanks for your work.
> > > > >
> > > > > I have some suggestions or questions:
> > > > >
> > > > > 1) Would you add msix mode support,? if not, pls add a check in
> > > > vfio_pci_save_config(), likes Nvidia's solution.
> > > > ok.
> > > >
> > > > > 2) We should start vfio devices before vcpu resumes, so we can't rely on
> > vm
> > > > start change handler completely.
> > > > vfio devices is by default set to running state.
> > > > In the target machine, its state transition flow is running->stop->running.
> > >
> > > That's confusing. We should start vfio devices after vfio_load_state,
> > otherwise
> > > how can you keep the devices' information are the same between source side
> > > and destination side?
> > >
> > so, your meaning is to set device state to running in the first call to
> > vfio_load_state?
> > 
> No, it should start devices after vfio_load_state and before vcpu resuming.
>

What about set device state to running in load_cleanup handler ?

> > > > so, maybe you can ignore the stop notification in kernel?
> > > > > 3) We'd better support live migration rollback since have many failure
> > > > scenarios,
> > > > >  register a migration notifier is a good choice.
> > > > I think this patchset can also handle the failure case well.
> > > > if migration failure or cancelling happens,
> > > > in cleanup handler, LOGGING state is cleared. device state(running or
> > > > stopped) keeps as it is).
> > >
> > > IIRC there're many failure paths don't calling cleanup handler.
> > >
> > could you take an example?
> 
> Never mind, that's another bug I think. 
> 
> > > > then,
> > > > if vm switches back to running, device state will be set to running;
> > > > if vm stayes at stopped state, device state is also stopped (it has no
> > > > meaning to let it in running state).
> > > > Do you think so ?
> > > >
> > > IF the underlying state machine is complicated,
> > > We should tell the canceling state to vendor driver proactively.
> > >
> > That makes sense.
> > 
> > > > > 4) Four memory region for live migration is too complicated IMHO.
> > > > one big region requires the sub-regions well padded.
> > > > like for the first control fields, they have to be padded to 4K.
> > > > the same for other data fields.
> > > > Otherwise, mmap simply fails, because the start-offset and size for mmap
> > > > both need to be PAGE aligned.
> > > >
> > > But if we don't need use mmap for control filed and device state, they are
> > small basically.
> > > The performance is enough using pread/pwrite.
> > >
> > we don't mmap control fields. but if data fields going immedately after
> > control fields (e.g. just 64 bytes), we can't mmap data fields
> > successfully because its start offset is 64. Therefore control fields have
> > to be padded to 4k to let data fields start from 4k.
> > That's the drawback of one big region holding both control and data fields.
> > 
> > > > Also, 4 regions is clearer in my view :)
> > > >
> > > > > 5) About log sync, why not register log_global_start/stop in
> > > > vfio_memory_listener?
> > > > >
> > > > >
> > > > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > > > for dirty pages in system memory, it's better to transfer dirty data
> > > > iteratively to reduce down time, right?
> > > >
> > >
> > > We just need invoking only once for start and stop logging. Why we need to
> > call
> > > them literately? See memory_listener of vhost.
> > >
> > 
> > 
> > 
> > > Regards,
> > > -Gonglei
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
Yan Zhao Feb. 21, 2019, 4:21 a.m. UTC | #25
On Thu, Feb 21, 2019 at 03:16:45AM +0000, Gonglei (Arei) wrote:
> 
> 
> 
> > -----Original Message-----
> > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > Sent: Thursday, February 21, 2019 10:05 AM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > kvm@vger.kernel.org
> > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > 
> > > >
> > > > > 5) About log sync, why not register log_global_start/stop in
> > > > vfio_memory_listener?
> > > > >
> > > > >
> > > > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > > > for dirty pages in system memory, it's better to transfer dirty data
> > > > iteratively to reduce down time, right?
> > > >
> > >
> > > We just need invoking only once for start and stop logging. Why we need to
> > call
> > > them literately? See memory_listener of vhost.
> > >
> > the dirty pages in system memory produces by device is incremental.
> > if it can be got iteratively, the dirty pages in stop-and-copy phase can be
> > minimal.
> > :)
> > 
> I mean starting or stopping the capability of logging, not log sync. 
> 
> We register the below callbacks:
> 
> .log_sync = vfio_log_sync,
> .log_global_start = vfio_log_global_start,
> .log_global_stop = vfio_log_global_stop,
>
.log_global_start is also a good point to notify logging state.
But if notifying in .save_setup handler, we can do fine-grained
control of when to notify of logging starting together with get_buffer
operation.
Is there any special benifit by registering to .log_global_start/stop?


> Regards,
> -Gonglei
Gonglei (Arei) Feb. 21, 2019, 5:46 a.m. UTC | #26
> -----Original Message-----
> From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> Sent: Thursday, February 21, 2019 12:08 PM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: cjia@nvidia.com; kvm@vger.kernel.org; aik@ozlabs.ru;
> Zhengxiao.zx@Alibaba-inc.com; shuangtai.tst@alibaba-inc.com;
> qemu-devel@nongnu.org; kwankhede@nvidia.com; eauger@redhat.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> mlevitsk@redhat.com; pasic@linux.ibm.com; felipe@nutanix.com;
> Ken.Xue@amd.com; kevin.tian@intel.com; dgilbert@redhat.com;
> alex.williamson@redhat.com; intel-gvt-dev@lists.freedesktop.org;
> changpeng.liu@intel.com; cohuck@redhat.com; zhi.a.wang@intel.com;
> jonathan.davies@nutanix.com
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> On Thu, Feb 21, 2019 at 03:33:24AM +0000, Gonglei (Arei) wrote:
> >
> > > -----Original Message-----
> > > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > > Sent: Thursday, February 21, 2019 9:59 AM
> > > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com;
> dgilbert@redhat.com;
> > > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > > jonathan.davies@nutanix.com; changpeng.liu@intel.com;
> Ken.Xue@amd.com;
> > > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > > kvm@vger.kernel.org
> > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > >
> > > On Thu, Feb 21, 2019 at 01:35:43AM +0000, Gonglei (Arei) wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > > > > Sent: Thursday, February 21, 2019 8:25 AM
> > > > > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > > > > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > > > > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > > > > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > > > > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com;
> > > dgilbert@redhat.com;
> > > > > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > > > > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > > > > jonathan.davies@nutanix.com; changpeng.liu@intel.com;
> > > Ken.Xue@amd.com;
> > > > > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > > > > kvm@vger.kernel.org
> > > > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > > > >
> > > > > On Wed, Feb 20, 2019 at 11:56:01AM +0000, Gonglei (Arei) wrote:
> > > > > > Hi yan,
> > > > > >
> > > > > > Thanks for your work.
> > > > > >
> > > > > > I have some suggestions or questions:
> > > > > >
> > > > > > 1) Would you add msix mode support,? if not, pls add a check in
> > > > > vfio_pci_save_config(), likes Nvidia's solution.
> > > > > ok.
> > > > >
> > > > > > 2) We should start vfio devices before vcpu resumes, so we can't rely
> on
> > > vm
> > > > > start change handler completely.
> > > > > vfio devices is by default set to running state.
> > > > > In the target machine, its state transition flow is
> running->stop->running.
> > > >
> > > > That's confusing. We should start vfio devices after vfio_load_state,
> > > otherwise
> > > > how can you keep the devices' information are the same between source
> side
> > > > and destination side?
> > > >
> > > so, your meaning is to set device state to running in the first call to
> > > vfio_load_state?
> > >
> > No, it should start devices after vfio_load_state and before vcpu resuming.
> >
> 
> What about set device state to running in load_cleanup handler ?
> 

The timing is fine, but you should also think about if should set device state 
to running in failure branches when calling load_cleanup handler.

Regards,
-Gonglei
Gonglei (Arei) Feb. 21, 2019, 5:56 a.m. UTC | #27
> >
> > > -----Original Message-----
> > > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > > Sent: Thursday, February 21, 2019 10:05 AM
> > > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com;
> dgilbert@redhat.com;
> > > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > > jonathan.davies@nutanix.com; changpeng.liu@intel.com;
> Ken.Xue@amd.com;
> > > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > > kvm@vger.kernel.org
> > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > >
> > > > >
> > > > > > 5) About log sync, why not register log_global_start/stop in
> > > > > vfio_memory_listener?
> > > > > >
> > > > > >
> > > > > seems log_global_start/stop cannot be iterately called in pre-copy
> phase?
> > > > > for dirty pages in system memory, it's better to transfer dirty data
> > > > > iteratively to reduce down time, right?
> > > > >
> > > >
> > > > We just need invoking only once for start and stop logging. Why we need
> to
> > > call
> > > > them literately? See memory_listener of vhost.
> > > >
> > > the dirty pages in system memory produces by device is incremental.
> > > if it can be got iteratively, the dirty pages in stop-and-copy phase can be
> > > minimal.
> > > :)
> > >
> > I mean starting or stopping the capability of logging, not log sync.
> >
> > We register the below callbacks:
> >
> > .log_sync = vfio_log_sync,
> > .log_global_start = vfio_log_global_start,
> > .log_global_stop = vfio_log_global_stop,
> >
> .log_global_start is also a good point to notify logging state.
> But if notifying in .save_setup handler, we can do fine-grained
> control of when to notify of logging starting together with get_buffer
> operation.
> Is there any special benifit by registering to .log_global_start/stop?
> 

Performance benefit when one VM has multiple same vfio devices.


Regards,
-Gonglei
Dr. David Alan Gilbert Feb. 21, 2019, 9:15 a.m. UTC | #28
* Zhao Yan (yan.y.zhao@intel.com) wrote:
> On Wed, Feb 20, 2019 at 11:01:43AM +0000, Dr. David Alan Gilbert wrote:
> > * Zhao Yan (yan.y.zhao@intel.com) wrote:
> > > On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:
> > > > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > > > > This patchset enables VFIO devices to have live migration capability.
> > > > > Currently it does not support post-copy phase.
> > > > > 
> > > > > It follows Alex's comments on last version of VFIO live migration patches,
> > > > > including device states, VFIO device state region layout, dirty bitmap's
> > > > > query.
> > > > 
> > > > Hi,
> > > >   I've sent minor comments to later patches; but some minor general
> > > > comments:
> > > > 
> > > >   a) Never trust the incoming migrations stream - it might be corrupt,
> > > >     so check when you can.
> > > hi Dave
> > > Thanks for this suggestion. I'll add more checks for migration streams.
> > > 
> > > 
> > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > version of device?  Or say to a device with older firmware or perhaps
> > > > a device that has less device memory ?
> > > Actually it's still an open for VFIO migration. Need to think about
> > > whether it's better to check that in libvirt or qemu (like a device magic
> > > along with verion ?).
> > > This patchset is intended to settle down the main device state interfaces
> > > for VFIO migration. So that we can work on that and improve it.
> > > 
> > > 
> > > >   c) Consider using the trace_ mechanism - it's really useful to
> > > > add to loops writing/reading data so that you can see when it fails.
> > > > 
> > > > Dave
> > > >
> > > Got it. many thanks~~
> > > 
> > > 
> > > > (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> > > > 'migrtion'
> > > 
> > > sorry :)
> > 
> > No problem.
> > 
> > Given the mails, I'm guessing you've mostly tested this on graphics
> > devices?  Have you also checked with VFIO network cards?
> > 
> yes, I tested it on Intel's graphics devices which do not have device
> memory. so the cap of device-memory is off.
> I believe this patchset can work well on VFIO network cards as well,
> because Gonglei once said their NIC can work well on our previous code
> (i.e. device-memory cap off).

It would be great if you could find some Intel NIC people to help test
it out.

> 
> > Also see the mail I sent in reply to Kirti's series; we need to boil
> > these down to one solution.
> >
> Maybe Kirti can merge their implementaion into the code for device-memory
> cap (like in my patch 5 for device-memory).

It would be great to come up with one patchset between yourself and
Kirti that was tested for Intel and Nvidia GPUs and Intel NICs
(and anyone else who wants to jump on board!).

Dave

> > Dave
> > 
> > > > 
> > > > > Device Data
> > > > > -----------
> > > > > Device data is divided into three types: device memory, device config,
> > > > > and system memory dirty pages produced by device.
> > > > > 
> > > > > Device config: data like MMIOs, page tables...
> > > > >         Every device is supposed to possess device config data.
> > > > >     	Usually device config's size is small (no big than 10M), and it
> > > > >         needs to be loaded in certain strict order.
> > > > >         Therefore, device config only needs to be saved/loaded in
> > > > >         stop-and-copy phase.
> > > > >         The data of device config is held in device config region.
> > > > >         Size of device config data is smaller than or equal to that of
> > > > >         device config region.
> > > > > 
> > > > > Device Memory: device's internal memory, standalone and outside system
> > > > >         memory. It is usually very big.
> > > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > > >         stop-and-copy phase.
> > > > >         The data of device memory is held in device memory region.
> > > > >         Size of devie memory is usually larger than that of device
> > > > >         memory region. qemu needs to save/load it in chunks of size of
> > > > >         device memory region.
> > > > >         Not all device has device memory. Like IGD only uses system memory.
> > > > > 
> > > > > System memory dirty pages: If a device produces dirty pages in system
> > > > >         memory, it is able to get dirty bitmap for certain range of system
> > > > >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > > > >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > > > >         callback, dirty pages in system memory will be save/loaded by ram's
> > > > >         live migration code.
> > > > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > > > >         If system memory range is larger than that dirty bitmap region can
> > > > >         hold, qemu will cut it into several chunks and get dirty bitmap in
> > > > >         succession.
> > > > > 
> > > > > 
> > > > > Device State Regions
> > > > > --------------------
> > > > > Vendor driver is required to expose two mandatory regions and another two
> > > > > optional regions if it plans to support device state management.
> > > > > 
> > > > > So, there are up to four regions in total.
> > > > > One control region: mandatory.
> > > > >         Get access via read/write system call.
> > > > >         Its layout is defined in struct vfio_device_state_ctl
> > > > > Three data regions: mmaped into qemu.
> > > > >         device config region: mandatory, holding data of device config
> > > > >         device memory region: optional, holding data of device memory
> > > > >         dirty bitmap region: optional, holding bitmap of system memory
> > > > >                             dirty pages
> > > > > 
> > > > > (The reason why four seperate regions are defined is that the unit of mmap
> > > > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > > > control and three mmaped regions for data seems better than one big region
> > > > > padded and sparse mmaped).
> > > > > 
> > > > > 
> > > > > kernel device state interface [1]
> > > > > --------------------------------------
> > > > > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > > > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > > > 
> > > > > #define VFIO_DEVICE_STATE_RUNNING 0 
> > > > > #define VFIO_DEVICE_STATE_STOP 1
> > > > > #define VFIO_DEVICE_STATE_LOGGING 2
> > > > > 
> > > > > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > > > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > > > > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > > > > 
> > > > > struct vfio_device_state_ctl {
> > > > > 	__u32 version;		  /* ro */
> > > > > 	__u32 device_state;       /* VFIO device state, wo */
> > > > > 	__u32 caps;		 /* ro */
> > > > >         struct {
> > > > > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> > > > > 		__u64 size;    /*rw*/
> > > > > 	} device_config;
> > > > > 	struct {
> > > > > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> > > > > 		__u64 size;     /* rw */  
> > > > >                 __u64 pos; /*the offset in total buffer of device memory*/
> > > > > 	} device_memory;
> > > > > 	struct {
> > > > > 		__u64 start_addr; /* wo */
> > > > > 		__u64 page_nr;   /* wo */
> > > > > 	} system_memory;
> > > > > };
> > > > > 
> > > > > Devcie States
> > > > > ------------- 
> > > > > After migration is initialzed, it will set device state via writing to
> > > > > device_state field of control region.
> > > > > 
> > > > > Four states are defined for a VFIO device:
> > > > >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> > > > > 
> > > > > RUNNING: In this state, a VFIO device is in active state ready to receive
> > > > >         commands from device driver.
> > > > >         It is the default state that a VFIO device enters initially.
> > > > > 
> > > > > STOP:  In this state, a VFIO device is deactivated to interact with
> > > > >        device driver.
> > > > > 
> > > > > LOGGING: a special state that it CANNOT exist independently. It must be
> > > > >        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> > > > >        STOP & LOGGING).
> > > > >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> > > > >        driver can start dirty data logging for device memory and system
> > > > >        memory.
> > > > >        LOGGING only impacts device/system memory. They return whole
> > > > >        snapshot outside LOGGING and dirty data since last get operation
> > > > >        inside LOGGING.
> > > > >        Device config should be always accessible and return whole config
> > > > >        snapshot regardless of LOGGING state.
> > > > >        
> > > > > Note:
> > > > > The reason why RUNNING is the default state is that device's active state
> > > > > must not depend on device state interface.
> > > > > It is possible that region vfio_device_state_ctl fails to get registered.
> > > > > In that condition, a device needs be in active state by default. 
> > > > > 
> > > > > Get Version & Get Caps
> > > > > ----------------------
> > > > > On migration init phase, qemu will probe the existence of device state
> > > > > regions of vendor driver, then get version of the device state interface
> > > > > from the r/w control region.
> > > > > 
> > > > > Then it will probe VFIO device's data capability by reading caps field of
> > > > > control region.
> > > > >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > > >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> > > > >         device memory in pre-copy and stop-and-copy phase. The data of
> > > > >         device memory is held in device memory region.
> > > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> > > > >         produced by VFIO device during pre-copy and stop-and-copy phase.
> > > > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > > > > 
> > > > > If failing to find two mandatory regions and optional data regions
> > > > > corresponding to data caps or version mismatching, it will setup a
> > > > > migration blocker and disable live migration for VFIO device.
> > > > > 
> > > > > 
> > > > > Flows to call device state interface for VFIO live migration
> > > > > ------------------------------------------------------------
> > > > > 
> > > > > Live migration save path:
> > > > > 
> > > > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > > > 
> > > > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > > >  |
> > > > > MIGRATION_STATUS_SAVE_SETUP
> > > > >  |
> > > > >  .save_setup callback -->
> > > > >  get device memory size (whole snapshot size)
> > > > >  get device memory buffer (whole snapshot data)
> > > > >  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> > > > >  |
> > > > > MIGRATION_STATUS_ACTIVE
> > > > >  |
> > > > >  .save_live_pending callback --> get device memory size (dirty data)
> > > > >  .save_live_iteration callback --> get device memory buffer (dirty data)
> > > > >  .log_sync callback --> get system memory dirty bitmap
> > > > >  |
> > > > > (vcpu stops) --> set device state -->
> > > > >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> > > > >  |
> > > > > .save_live_complete_precopy callback -->
> > > > >  get device memory size (dirty data)
> > > > >  get device memory buffer (dirty data)
> > > > >  get device config size (whole snapshot size)
> > > > >  get device config buffer (whole snapshot data)
> > > > >  |
> > > > > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > > > > MIGRATION_STATUS_COMPLETED
> > > > > 
> > > > > MIGRATION_STATUS_CANCELLED or
> > > > > MIGRATION_STATUS_FAILED
> > > > >  |
> > > > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > > > 
> > > > > 
> > > > > Live migration load path:
> > > > > 
> > > > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > > > 
> > > > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > > >  |
> > > > > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> > > > >  |
> > > > > MIGRATION_STATUS_ACTIVE
> > > > >  |
> > > > > .load state callback -->
> > > > >  set device memory size, set device memory buffer, set device config size,
> > > > >  set device config buffer
> > > > >  |
> > > > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > > >  |
> > > > > MIGRATION_STATUS_COMPLETED
> > > > > 
> > > > > 
> > > > > 
> > > > > In source VM side,
> > > > > In precopy phase,
> > > > > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > > > > qemu will first get whole snapshot of device memory in .save_setup
> > > > > callback, and then it will get total size of dirty data in device memory in
> > > > > .save_live_pending callback by reading device_memory.size field of control
> > > > > region.
> > > > > Then in .save_live_iteration callback, it will get buffer of device memory's
> > > > > dirty data chunk by chunk from device memory region by writing pos &
> > > > > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > > > > control region. (size of each chunk is the size of device memory data
> > > > > region).
> > > > > .save_live_pending and .save_live_iteration may be called several times in
> > > > > precopy phase to get dirty data in device memory.
> > > > > 
> > > > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > > > > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > > > > vendor driver's device state interface to get data from devcie memory.
> > > > > 
> > > > > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > > > > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > > > > region by writing system memory's start address, page count and action 
> > > > > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> > > > > "system_memory.action" fields of control region.
> > > > > If page count passed in .log_sync callback is larger than the bitmap size
> > > > > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > > > > vendor driver's get system memory dirty bitmap interface.
> > > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > > > > returns without call to vendor driver.
> > > > > 
> > > > > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > > > > in save_live_complete_precopy callback,
> > > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > > > > get device memory size and get device memory buffer will be called again.
> > > > > After that,
> > > > > device config data is get from device config region by reading
> > > > > devcie_config.size of control region and writing action (GET_BITMAP) to
> > > > > device_config.action of control region.
> > > > > Then after migration completes, in cleanup handler, LOGGING state will be
> > > > > cleared (i.e. deivce state is set to STOP).
> > > > > Clearing LOGGING state in cleanup handler is in consideration of the case
> > > > > of "migration failed" and "migration cancelled". They can also leverage
> > > > > the cleanup handler to unset LOGGING state.
> > > > > 
> > > > > 
> > > > > References
> > > > > ----------
> > > > > 1. kernel side implementation of Device state interfaces:
> > > > > https://patchwork.freedesktop.org/series/56876/
> > > > > 
> > > > > 
> > > > > Yan Zhao (5):
> > > > >   vfio/migration: define kernel interfaces
> > > > >   vfio/migration: support device of device config capability
> > > > >   vfio/migration: tracking of dirty page in system memory
> > > > >   vfio/migration: turn on migration
> > > > >   vfio/migration: support device memory capability
> > > > > 
> > > > >  hw/vfio/Makefile.objs         |   2 +-
> > > > >  hw/vfio/common.c              |  26 ++
> > > > >  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
> > > > >  hw/vfio/pci.c                 |  10 +-
> > > > >  hw/vfio/pci.h                 |  26 +-
> > > > >  include/hw/vfio/vfio-common.h |   1 +
> > > > >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> > > > >  7 files changed, 1174 insertions(+), 9 deletions(-)
> > > > >  create mode 100644 hw/vfio/migration.c
> > > > > 
> > > > > -- 
> > > > > 2.7.4
> > > > > 
> > > > --
> > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > > _______________________________________________
> > > > intel-gvt-dev mailing list
> > > > intel-gvt-dev@lists.freedesktop.org
> > > > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > _______________________________________________
> > intel-gvt-dev mailing list
> > intel-gvt-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Alex Williamson Feb. 21, 2019, 8:40 p.m. UTC | #29
Hi Yan,

Thanks for working on this!

On Tue, 19 Feb 2019 16:50:54 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> This patchset enables VFIO devices to have live migration capability.
> Currently it does not support post-copy phase.
> 
> It follows Alex's comments on last version of VFIO live migration patches,
> including device states, VFIO device state region layout, dirty bitmap's
> query.
> 
> Device Data
> -----------
> Device data is divided into three types: device memory, device config,
> and system memory dirty pages produced by device.
> 
> Device config: data like MMIOs, page tables...
>         Every device is supposed to possess device config data.
>     	Usually device config's size is small (no big than 10M), and it

I'm not sure how we can really impose a limit here, it is what it is
for a device.  A smaller state is obviously desirable to reduce
downtime, but some devices could have very large states.

>         needs to be loaded in certain strict order.
>         Therefore, device config only needs to be saved/loaded in
>         stop-and-copy phase.
>         The data of device config is held in device config region.
>         Size of device config data is smaller than or equal to that of
>         device config region.

So the intention here is that this is the last data read from the
device and it's done in one pass, so the region needs to be large
enough to expose all config data at once.  On restore it's the last
data written before switching the device to the run state.

> 
> Device Memory: device's internal memory, standalone and outside system

s/system/VM/

>         memory. It is usually very big.

Or it doesn't exist.  Not sure we should be setting expectations since
it will vary per device.

>         This kind of data needs to be saved / loaded in pre-copy and
>         stop-and-copy phase.
>         The data of device memory is held in device memory region.
>         Size of devie memory is usually larger than that of device
>         memory region. qemu needs to save/load it in chunks of size of
>         device memory region.
>         Not all device has device memory. Like IGD only uses system memory.

It seems a little gratuitous to me that this is a separate region or
that this data is handled separately.  All of this data is opaque to
QEMU, so why do we need to separate it?

> System memory dirty pages: If a device produces dirty pages in system
>         memory, it is able to get dirty bitmap for certain range of system
>         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
>         phase in .log_sync callback. By setting dirty bitmap in .log_sync
>         callback, dirty pages in system memory will be save/loaded by ram's
>         live migration code.
>         The dirty bitmap of system memory is held in dirty bitmap region.
>         If system memory range is larger than that dirty bitmap region can
>         hold, qemu will cut it into several chunks and get dirty bitmap in
>         succession.
> 
> 
> Device State Regions
> --------------------
> Vendor driver is required to expose two mandatory regions and another two
> optional regions if it plans to support device state management.
> 
> So, there are up to four regions in total.
> One control region: mandatory.
>         Get access via read/write system call.
>         Its layout is defined in struct vfio_device_state_ctl
> Three data regions: mmaped into qemu.

Is mmap mandatory?  I would think this would be defined by the mdev
device what access they want to support per region.  We don't want to
impose a more complicated interface if the device doesn't require it.

>         device config region: mandatory, holding data of device config
>         device memory region: optional, holding data of device memory
>         dirty bitmap region: optional, holding bitmap of system memory
>                             dirty pages
> 
> (The reason why four seperate regions are defined is that the unit of mmap
> system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> control and three mmaped regions for data seems better than one big region
> padded and sparse mmaped).

It's not obvious to me how this is better, a big region isn't padded,
there's simply a gap in the file descriptor.  Is having a sub-PAGE_SIZE
gap in a file really of any consequence?  Each region beyond the header
is more than likely larger than PAGE_SIZE, therefore they can be nicely
aligned together.  We still need fields to tell us how much data is
available in each area, so another to tell us the start of each area is
a minor detail.  And I think we still want to allow drivers to specify
which parts of which areas support mmap, so I don't think we're getting
away from sparse mmap support.

> kernel device state interface [1]
> --------------------------------------
> #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2

If we were to go with this multi-region solution, isn't it evident from
the regions exposed that device memory and a dirty bitmap are
provided?  Alternatively, I believe Kirti's proposal doesn't require
this distinction between device memory and device config, a device not
requiring runtime migration data would simply report no data until the
device moved to the stopped state, making it consistent path for
userspace.  Likewise, the dirty bitmap could report a zero page count
in the bitmap rather than branching based on device support.

Note that it seems that cap VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY implies
VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP, but there's no
consistency in the naming.

> #define VFIO_DEVICE_STATE_RUNNING 0 
> #define VFIO_DEVICE_STATE_STOP 1
> #define VFIO_DEVICE_STATE_LOGGING 2

It looks like these are being defined as bits, since patch 1 talks
about RUNNING & LOGGING and STOP & LOGGING.  I think Connie already
posted some comments about this.  I'm not sure anything prevents us
from defining RUNNING a 1 and STOPPED as 0 so we don't have the
polarity flip vs LOGGING though.

The state "STOP & LOGGING" also feels like a strange "device state", if
the device is stopped, it's not logging any new state, so I think this
is more that the device state is STOP, but the LOGGING feature is
active.  Maybe we should consider these independent bits.  LOGGING is
active as we stop a device so that we can fetch the last dirtied pages,
but disabled as we load the state of the device into the target.

> #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> 
> struct vfio_device_state_ctl {
> 	__u32 version;		  /* ro */
> 	__u32 device_state;       /* VFIO device state, wo */
> 	__u32 caps;		 /* ro */
>         struct {
> 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> 		__u64 size;    /*rw*/
> 	} device_config;

Patch 1 indicates that to get the config buffer we write GET_BUFFER to
action and read from the config region.  The size is previously read
and apparently constant.  To set the config buffer, the config region
is written followed by writing SET_BUFFER to action.  Why is size
listed as read-write?

Doesn't this protocol also require that the mdev driver consume each
full region's worth of host kernel memory for backing pages in
anticipation of a rare event like migration?  This might be a strike
against separate regions if the driver needs to provide backing pages
for 3 separate regions vs 1.  To avoid this runtime overhead, would it
be expected that the user only mmap the regions during migration and
the mdev driver allocate backing pages on mmap?  Should the mmap be
restricted to the LOGGING feature being enabled?  Maybe I'm mistaken in
how the mdev driver would back these mmap'd pages.

> 	struct {
> 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> 		__u64 size;     /* rw */  
>                 __u64 pos; /*the offset in total buffer of device memory*/

Patch 1 outlines the protocol here that getting device memory begins
with writing the position field, followed by reading from the device
memory region.  Setting device memory begins with writing the data to
the device memory region, followed by writing the position field.  Why
does the user need to have visibility of data position?  This is opaque
data to the user, the device should manage how the chunks fit together.

How does the user know when they reach the end?

Bullets 8 and 9 in patch 1 also discuss setting and getting the device
memory size, but these aren't well integrated into the protocol for
getting and setting the memory buffer.  Is getting the device memory
really started by reading the size, which triggers the vendor driver to
snapshot the state in an internal buffer which the user then iterates
through using GET_BUFFER?  Therefore re-reading the size field could
corrupt the data stream?  Wouldn't it be easier to report bytes
available and countdown as the user marks them read?  What does
position mean when we switch from snapshot to dirty data?

> 	} device_memory;
> 	struct {
> 		__u64 start_addr; /* wo */
> 		__u64 page_nr;   /* wo */
> 	} system_memory;
> };

Why is one specified as an address and the other as pages?  Note that
Kirti's implementation has an optimization to know how many pages are
set within a range to avoid unnecessary buffer reads.

> 
> Devcie States
> ------------- 
> After migration is initialzed, it will set device state via writing to
> device_state field of control region.
> 
> Four states are defined for a VFIO device:
>         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> 
> RUNNING: In this state, a VFIO device is in active state ready to receive
>         commands from device driver.
>         It is the default state that a VFIO device enters initially.
> 
> STOP:  In this state, a VFIO device is deactivated to interact with
>        device driver.
> 
> LOGGING: a special state that it CANNOT exist independently. It must be
>        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
>        STOP & LOGGING).
>        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
>        driver can start dirty data logging for device memory and system
>        memory.
>        LOGGING only impacts device/system memory. They return whole
>        snapshot outside LOGGING and dirty data since last get operation
>        inside LOGGING.
>        Device config should be always accessible and return whole config
>        snapshot regardless of LOGGING state.
>        
> Note:
> The reason why RUNNING is the default state is that device's active state
> must not depend on device state interface.
> It is possible that region vfio_device_state_ctl fails to get registered.
> In that condition, a device needs be in active state by default. 
> 
> Get Version & Get Caps
> ----------------------
> On migration init phase, qemu will probe the existence of device state
> regions of vendor driver, then get version of the device state interface
> from the r/w control region.
> 
> Then it will probe VFIO device's data capability by reading caps field of
> control region.
>         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
>         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
>         device memory in pre-copy and stop-and-copy phase. The data of
>         device memory is held in device memory region.
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
>         produced by VFIO device during pre-copy and stop-and-copy phase.
>         The dirty bitmap of system memory is held in dirty bitmap region.

As above, these capabilities seem redundant to the existence of the
device specific regions in this implementation.

> If failing to find two mandatory regions and optional data regions
> corresponding to data caps or version mismatching, it will setup a
> migration blocker and disable live migration for VFIO device.
> 
> 
> Flows to call device state interface for VFIO live migration
> ------------------------------------------------------------
> 
> Live migration save path:
> 
> (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> 
> MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
>  |
> MIGRATION_STATUS_SAVE_SETUP
>  |
>  .save_setup callback -->
>  get device memory size (whole snapshot size)
>  get device memory buffer (whole snapshot data)
>  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
>  |
> MIGRATION_STATUS_ACTIVE
>  |
>  .save_live_pending callback --> get device memory size (dirty data)
>  .save_live_iteration callback --> get device memory buffer (dirty data)
>  .log_sync callback --> get system memory dirty bitmap
>  |
> (vcpu stops) --> set device state -->
>  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
>  |
> .save_live_complete_precopy callback -->
>  get device memory size (dirty data)
>  get device memory buffer (dirty data)
>  get device config size (whole snapshot size)
>  get device config buffer (whole snapshot data)
>  |
> .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> MIGRATION_STATUS_COMPLETED
> 
> MIGRATION_STATUS_CANCELLED or
> MIGRATION_STATUS_FAILED
>  |
> (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> 
> 
> Live migration load path:
> 
> (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> 
> MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
>  |
> (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
>  |
> MIGRATION_STATUS_ACTIVE
>  |
> .load state callback -->
>  set device memory size, set device memory buffer, set device config size,
>  set device config buffer
>  |
> (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
>  |
> MIGRATION_STATUS_COMPLETED
> 
> 
> 
> In source VM side,
> In precopy phase,
> if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> qemu will first get whole snapshot of device memory in .save_setup
> callback, and then it will get total size of dirty data in device memory in
> .save_live_pending callback by reading device_memory.size field of control
> region.

This requires iterative reads of device memory buffer but the protocol
is unclear (to me) how the user knows how to do this or interact with
the position field. 

> Then in .save_live_iteration callback, it will get buffer of device memory's
> dirty data chunk by chunk from device memory region by writing pos &
> action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> control region. (size of each chunk is the size of device memory data
> region).

What if there's not enough dirty data to fill the region?  The data is
always padded to fill the region?

> .save_live_pending and .save_live_iteration may be called several times in
> precopy phase to get dirty data in device memory.
> 
> If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> like .save_setup, .save_live_pending, .save_live_iteration will not call
> vendor driver's device state interface to get data from devcie memory.

Therefore through the entire precopy phase we have no data from source
to target to begin a compatibility check :-\  I think both proposals
currently still lack any sort of device compatibility or data
versioning check between source and target.  Thanks,

Alex

> In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> .log_sync callback will get system memory dirty bitmap from dirty bitmap
> region by writing system memory's start address, page count and action 
> (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> "system_memory.action" fields of control region.
> If page count passed in .log_sync callback is larger than the bitmap size
> the dirty bitmap region supports, Qemu will cut it into chunks and call
> vendor driver's get system memory dirty bitmap interface.
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> returns without call to vendor driver.
> 
> In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> in save_live_complete_precopy callback,
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> get device memory size and get device memory buffer will be called again.
> After that,
> device config data is get from device config region by reading
> devcie_config.size of control region and writing action (GET_BITMAP) to
> device_config.action of control region.
> Then after migration completes, in cleanup handler, LOGGING state will be
> cleared (i.e. deivce state is set to STOP).
> Clearing LOGGING state in cleanup handler is in consideration of the case
> of "migration failed" and "migration cancelled". They can also leverage
> the cleanup handler to unset LOGGING state.
> 
> 
> References
> ----------
> 1. kernel side implementation of Device state interfaces:
> https://patchwork.freedesktop.org/series/56876/
> 
> 
> Yan Zhao (5):
>   vfio/migration: define kernel interfaces
>   vfio/migration: support device of device config capability
>   vfio/migration: tracking of dirty page in system memory
>   vfio/migration: turn on migration
>   vfio/migration: support device memory capability
> 
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/common.c              |  26 ++
>  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 |  10 +-
>  hw/vfio/pci.h                 |  26 +-
>  include/hw/vfio/vfio-common.h |   1 +
>  linux-headers/linux/vfio.h    | 260 +++++++++++++
>  7 files changed, 1174 insertions(+), 9 deletions(-)
>  create mode 100644 hw/vfio/migration.c
>
Yan Zhao Feb. 25, 2019, 2:22 a.m. UTC | #30
On Thu, Feb 21, 2019 at 01:40:51PM -0700, Alex Williamson wrote:
> Hi Yan,
> 
> Thanks for working on this!
> 
> On Tue, 19 Feb 2019 16:50:54 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > This patchset enables VFIO devices to have live migration capability.
> > Currently it does not support post-copy phase.
> > 
> > It follows Alex's comments on last version of VFIO live migration patches,
> > including device states, VFIO device state region layout, dirty bitmap's
> > query.
> > 
> > Device Data
> > -----------
> > Device data is divided into three types: device memory, device config,
> > and system memory dirty pages produced by device.
> > 
> > Device config: data like MMIOs, page tables...
> >         Every device is supposed to possess device config data.
> >     	Usually device config's size is small (no big than 10M), and it
> 
> I'm not sure how we can really impose a limit here, it is what it is
> for a device.  A smaller state is obviously desirable to reduce
> downtime, but some devices could have very large states.
> 
> >         needs to be loaded in certain strict order.
> >         Therefore, device config only needs to be saved/loaded in
> >         stop-and-copy phase.
> >         The data of device config is held in device config region.
> >         Size of device config data is smaller than or equal to that of
> >         device config region.
> 
> So the intention here is that this is the last data read from the
> device and it's done in one pass, so the region needs to be large
> enough to expose all config data at once.  On restore it's the last
> data written before switching the device to the run state.
> 
> > 
> > Device Memory: device's internal memory, standalone and outside system
> 
> s/system/VM/
> 
> >         memory. It is usually very big.
> 
> Or it doesn't exist.  Not sure we should be setting expectations since
> it will vary per device.
> 
> >         This kind of data needs to be saved / loaded in pre-copy and
> >         stop-and-copy phase.
> >         The data of device memory is held in device memory region.
> >         Size of devie memory is usually larger than that of device
> >         memory region. qemu needs to save/load it in chunks of size of
> >         device memory region.
> >         Not all device has device memory. Like IGD only uses system memory.
> 
> It seems a little gratuitous to me that this is a separate region or
> that this data is handled separately.  All of this data is opaque to
> QEMU, so why do we need to separate it?
hi Alex,
as the device state interfaces are provided by kernel, it is expected to
meet as general needs as possible. So, do you think there are such use
cases from user space that user space knows well of the device, and
it wants kernel to return desired data back to it.
E.g. It just wants to get whole device config data including all mmios,
page tables, pci config data...
or, It just wants to get current device memory snapshot, not including any
dirty data.
Or, It just needs the dirty pages in device memory or system memory.
With all this accurate query, quite a lot of useful features can be
developped in user space.

If all of this data is opaque to user app, seems the only use case is
for live migration.

From another aspect, if the final solution is to let the data opaque to
user space, like what NV did, kernel side's implementation will be more
complicated, and actually a little challenge to vendor driver.
in that case, in pre-copy phase,
1. in not LOGGING state, vendor driver first returns full data including
full device memory snapshot
2. user space reads some data (you can't expect it to finish reading all
data)
3. then userspace set the device state to LOGGING to start dirty data
logging
4. vendor driver starts dirty data logging, and appends the dirty data to
the tail of remaining unread full data and increase the pending data size?
5. user space keeps reading data.
6. vendor driver keeps appending new dirty data to the tail of remaining
unread full data/dirty data and increase the pending data size?

in stop-and-copy phase
1. user space sets device state to exit LOGGING state,
2. vendor driver stops data logging. it has to append device config
   data at the tail of remaining dirty data unread by userspace.

during this flow, when vendor driver should get dirty data? just keeps
logging and appends to tail? how to ensure dirty data is refresh new before
LOGGING state exit? how does vendor driver know whether certain dirty data
is copied or not?

I've no idea how NVidia handle this problem, and they don't open their
kernel side code. 
just feel it's a bit hard for other vendor drivers to follow:)

> > System memory dirty pages: If a device produces dirty pages in system
> >         memory, it is able to get dirty bitmap for certain range of system
> >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> >         callback, dirty pages in system memory will be save/loaded by ram's
> >         live migration code.
> >         The dirty bitmap of system memory is held in dirty bitmap region.
> >         If system memory range is larger than that dirty bitmap region can
> >         hold, qemu will cut it into several chunks and get dirty bitmap in
> >         succession.
> > 
> > 
> > Device State Regions
> > --------------------
> > Vendor driver is required to expose two mandatory regions and another two
> > optional regions if it plans to support device state management.
> > 
> > So, there are up to four regions in total.
> > One control region: mandatory.
> >         Get access via read/write system call.
> >         Its layout is defined in struct vfio_device_state_ctl
> > Three data regions: mmaped into qemu.
> 
> Is mmap mandatory?  I would think this would be defined by the mdev
> device what access they want to support per region.  We don't want to
> impose a more complicated interface if the device doesn't require it.
I think it's "mmap is preferred, but allowed to fail".
just like a normal region with MMAP flag on (like bar regions), we also
allow its mmap to fail, right?

> >         device config region: mandatory, holding data of device config
> >         device memory region: optional, holding data of device memory
> >         dirty bitmap region: optional, holding bitmap of system memory
> >                             dirty pages
> > 
> > (The reason why four seperate regions are defined is that the unit of mmap
> > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > control and three mmaped regions for data seems better than one big region
> > padded and sparse mmaped).
> 
> It's not obvious to me how this is better, a big region isn't padded,
> there's simply a gap in the file descriptor.  Is having a sub-PAGE_SIZE
> gap in a file really of any consequence?  Each region beyond the header
> is more than likely larger than PAGE_SIZE, therefore they can be nicely
> aligned together.  We still need fields to tell us how much data is
> available in each area, so another to tell us the start of each area is
> a minor detail.  And I think we still want to allow drivers to specify
> which parts of which areas support mmap, so I don't think we're getting
> away from sparse mmap support.

with seperate regions and sub-region type defined, user space can explictly
know which region is which region after vfio_get_dev_region_info(). along
with it, user space knows region offset and size. mmap is allowed to fail
and falls back to normal read/write to the region.

But with one big region and sparse mmapped subregions (1 data subregion or
3 data subregions, whatever), userspace can't tell which subregion is which
one.
So, if using one big region, I think we need to explictly define
subregions' sequence (like index 0 is dedicated to control subregion,
index 1 is for device config data subregion ...). Vendor driver cannot
freely change the sequence.
Then keep data offset the same as region->mmaps[i].offset, and data size
the same as region->mmaps[i].size. (i.e. let actual data starts immediatly
from first byte of its data subregion)
Also, mmaps for sparse mmaped subregions are not allowed to fail.


With one big region, we also need to consider the case when vendor driver
does not want the data subregion to be mmaped.
so, what is the data layout for that case?
data subregion immedately follows control subregion, or not?
Of couse, for this condition, we can specify the data filed's start offset
and size through control region. And we must not expect the data start
offset in source and target are equal.
(because the big region's fd_offset
may vary in source and target. consider the case when both source and
target have one opregion and one device state region, but source has
opregion in the first and target has device state region in the first.
If we think this case is illegal, we must be able to detect it in the first
place).
Also, we must keep the start offset and size consistent with the above mmap
case.


> > kernel device state interface [1]
> > --------------------------------------
> > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> 
> If we were to go with this multi-region solution, isn't it evident from
> the regions exposed that device memory and a dirty bitmap are
> provided?  Alternatively, I believe Kirti's proposal doesn't require

> this distinction between device memory and device config, a device not
> requiring runtime migration data would simply report no data until the
> device moved to the stopped state, making it consistent path for
> userspace.  Likewise, the dirty bitmap could report a zero page count
> in the bitmap rather than branching based on device support.
If the path in userspace is consistent for device config and device
memory, there will be many unnecessary call of getting data size into
vendor driver.
 

> Note that it seems that cap VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY implies
> VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP, but there's no
> consistency in the naming.
> 
> > #define VFIO_DEVICE_STATE_RUNNING 0 
> > #define VFIO_DEVICE_STATE_STOP 1
> > #define VFIO_DEVICE_STATE_LOGGING 2
> 
> It looks like these are being defined as bits, since patch 1 talks
> about RUNNING & LOGGING and STOP & LOGGING.  I think Connie already
> posted some comments about this.  I'm not sure anything prevents us
> from defining RUNNING a 1 and STOPPED as 0 so we don't have the
> polarity flip vs LOGGING though.
> 
> The state "STOP & LOGGING" also feels like a strange "device state", if
> the device is stopped, it's not logging any new state, so I think this
> is more that the device state is STOP, but the LOGGING feature is
> active.  Maybe we should consider these independent bits.  LOGGING is
> active as we stop a device so that we can fetch the last dirtied pages,
> but disabled as we load the state of the device into the target.
> 
> > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > 
> > struct vfio_device_state_ctl {
> > 	__u32 version;		  /* ro */
> > 	__u32 device_state;       /* VFIO device state, wo */
> > 	__u32 caps;		 /* ro */
> >         struct {
> > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> > 		__u64 size;    /*rw*/
> > 	} device_config;
> 
> Patch 1 indicates that to get the config buffer we write GET_BUFFER to
> action and read from the config region.  The size is previously read
> and apparently constant.  To set the config buffer, the config region
> is written followed by writing SET_BUFFER to action.  Why is size
> listed as read-write?
this is the size of config data.
size of config data <= size of config data region.


> Doesn't this protocol also require that the mdev driver consume each
> full region's worth of host kernel memory for backing pages in
> anticipation of a rare event like migration?  This might be a strike
> against separate regions if the driver needs to provide backing pages
> for 3 separate regions vs 1.  To avoid this runtime overhead, would it
> be expected that the user only mmap the regions during migration and
> the mdev driver allocate backing pages on mmap?  Should the mmap be
> restricted to the LOGGING feature being enabled?  Maybe I'm mistaken in
> how the mdev driver would back these mmap'd pages.
>
yes, 3 seperate regions consume a little more memory than 1 region.
but it's just a little overhead.
As in intel's kernel implementation,
device config region's size is 9M, dirty bitmap region's size is 16k.
if there is device memory region, its size can be defined as 100M?
so it's 109M vs 100M ?

> > 	struct {
> > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> > 		__u64 size;     /* rw */  
> >                 __u64 pos; /*the offset in total buffer of device memory*/
> 
> Patch 1 outlines the protocol here that getting device memory begins
> with writing the position field, followed by reading from the device
> memory region.  Setting device memory begins with writing the data to
> the device memory region, followed by writing the position field.  Why
> does the user need to have visibility of data position?  This is opaque
> data to the user, the device should manage how the chunks fit together.
> 
> How does the user know when they reach the end?
sorry, maybe I didn't explain clearly here.

device  ________________________________________
memory  |    |    |////|    |    |    |    |    |
data:   |____|____|////|____|____|____|____|____|
                  :pos :
                  :    :
device            :____:
memory            |    |
region:           |____|

the whole sequence is like this:

1. user space reads device_memory.size
2. driver gets device memory's data(full snapshot or dirty data, depending
on it's in LOGGING state or not), and return the total size of
this data. 
3. user space finishes reading device_memory.size (>= device memory
region's size)

4. user space starts a loop like
  
   while (pos < total_len) {
        uint64_t len = region_devmem->size;

        if (pos + len >= total_len) {
            len = total_len - pos;
        }
        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
            return -1;
        }
	pos += len;
    }

 vfio_save_data_device_memory_chunk reads each chunk from device memory
 region by writing GET_BUFFER  to device_memory.action, and pos to
 device_memory.pos.


so. each time, userspace will finish getting device memory data in one
time.

specifying "pos" is just like the "lseek" before "write".

> Bullets 8 and 9 in patch 1 also discuss setting and getting the device
> memory size, but these aren't well integrated into the protocol for
> getting and setting the memory buffer.  Is getting the device memory
> really started by reading the size, which triggers the vendor driver to
> snapshot the state in an internal buffer which the user then iterates
> through using GET_BUFFER?  Therefore re-reading the size field could
> corrupt the data stream?  Wouldn't it be easier to report bytes
> available and countdown as the user marks them read?  What does
> position mean when we switch from snapshot to dirty data?
when switching to device memory's dirty data, pos means the pos in whole
dirty data.

.save_live_pending ==> driver gets dirty data in device memory and returns
total size.

.save_live_iterate ==> userspace reads all dirty data from device memory
region chunk by chunk

So, in an iteration, all dirty data are saved.
then in next iteration, dirty data is recalculated.


> 
> > 	} device_memory;
> > 	struct {
> > 		__u64 start_addr; /* wo */
> > 		__u64 page_nr;   /* wo */
> > 	} system_memory;
> > };
> 
> Why is one specified as an address and the other as pages?  Note that
Yes, start_addr ==> start pfn is better

> Kirti's implementation has an optimization to know how many pages are
> set within a range to avoid unnecessary buffer reads.
> 

Let's use start_pfn_all, page_nr_all to represent the start pfn and
page_nr passed in from qemu .log_sync interface.

and use start_pfn_i, page_nr_i to the value passed to driver.


start_pfn_all
  |         start_pfn_i
  |         |
 \ /_______\_/_____________________________
  |    |    |////|    |    |    |    |    |
  |____|____|////|____|____|____|____|____|
            :    :
            :    :
            :____:
bitmap      |    |
region:     |____|
           

1. Each time QEMU queries dirty bitmap from driver, it passes in
start_pfn_i, and page_nr_i. (page_nr_i is the biggest page number the
bitmap region can hold).
2. driver queries memory range from start_pfn_i with size page_nr_i.
3. driver return a bitmap (if no dirty data, the bitmap is all 0).
4. QEMU saves the pages according to the bitmap

If there's no dirty data found in step 2, step 4 can be skipped.
(I'll add this check before step 4 in future, thanks)
but if there's even 1 bit set in the bitmap, no step from 1-4 can be
skipped.

Honestly, after reviewing Kirti's implementation, I don't think it's an
optimization. As in below pseudo code for Kirti's code, I would think the
copied_pfns corresponds to the page_nr_i in my case. so, the case of
copied_pfns equaling to 0 is for the tail chunk? don't think it's working..

write start_pfn to driver
write page_size  to driver
write pfn_count to driver

do {
    read copied_pfns from driver.
    if (copied_pfns == 0) {
       break;
    }
   bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
   buf = get bitmap from driver
   cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
                                           (start_pfn + count) * page_size,
                                                copied_pfns);

     count +=  copied_pfns;

} while (count < pfn_count);



> > 
> > Devcie States
> > ------------- 
> > After migration is initialzed, it will set device state via writing to
> > device_state field of control region.
> > 
> > Four states are defined for a VFIO device:
> >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> > 
> > RUNNING: In this state, a VFIO device is in active state ready to receive
> >         commands from device driver.
> >         It is the default state that a VFIO device enters initially.
> > 
> > STOP:  In this state, a VFIO device is deactivated to interact with
> >        device driver.
> > 
> > LOGGING: a special state that it CANNOT exist independently. It must be
> >        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> >        STOP & LOGGING).
> >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> >        driver can start dirty data logging for device memory and system
> >        memory.
> >        LOGGING only impacts device/system memory. They return whole
> >        snapshot outside LOGGING and dirty data since last get operation
> >        inside LOGGING.
> >        Device config should be always accessible and return whole config
> >        snapshot regardless of LOGGING state.
> >        
> > Note:
> > The reason why RUNNING is the default state is that device's active state
> > must not depend on device state interface.
> > It is possible that region vfio_device_state_ctl fails to get registered.
> > In that condition, a device needs be in active state by default. 
> > 
> > Get Version & Get Caps
> > ----------------------
> > On migration init phase, qemu will probe the existence of device state
> > regions of vendor driver, then get version of the device state interface
> > from the r/w control region.
> > 
> > Then it will probe VFIO device's data capability by reading caps field of
> > control region.
> >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> >         device memory in pre-copy and stop-and-copy phase. The data of
> >         device memory is held in device memory region.
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> >         produced by VFIO device during pre-copy and stop-and-copy phase.
> >         The dirty bitmap of system memory is held in dirty bitmap region.
> 
> As above, these capabilities seem redundant to the existence of the
> device specific regions in this implementation.
>
seems so :)

> > If failing to find two mandatory regions and optional data regions
> > corresponding to data caps or version mismatching, it will setup a
> > migration blocker and disable live migration for VFIO device.
> > 
> > 
> > Flows to call device state interface for VFIO live migration
> > ------------------------------------------------------------
> > 
> > Live migration save path:
> > 
> > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > 
> > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > MIGRATION_STATUS_SAVE_SETUP
> >  |
> >  .save_setup callback -->
> >  get device memory size (whole snapshot size)
> >  get device memory buffer (whole snapshot data)
> >  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> >  |
> > MIGRATION_STATUS_ACTIVE
> >  |
> >  .save_live_pending callback --> get device memory size (dirty data)
> >  .save_live_iteration callback --> get device memory buffer (dirty data)
> >  .log_sync callback --> get system memory dirty bitmap
> >  |
> > (vcpu stops) --> set device state -->
> >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> >  |
> > .save_live_complete_precopy callback -->
> >  get device memory size (dirty data)
> >  get device memory buffer (dirty data)
> >  get device config size (whole snapshot size)
> >  get device config buffer (whole snapshot data)
> >  |
> > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > MIGRATION_STATUS_COMPLETED
> > 
> > MIGRATION_STATUS_CANCELLED or
> > MIGRATION_STATUS_FAILED
> >  |
> > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > 
> > 
> > Live migration load path:
> > 
> > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > 
> > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> >  |
> > MIGRATION_STATUS_ACTIVE
> >  |
> > .load state callback -->
> >  set device memory size, set device memory buffer, set device config size,
> >  set device config buffer
> >  |
> > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > MIGRATION_STATUS_COMPLETED
> > 
> > 
> > 
> > In source VM side,
> > In precopy phase,
> > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > qemu will first get whole snapshot of device memory in .save_setup
> > callback, and then it will get total size of dirty data in device memory in
> > .save_live_pending callback by reading device_memory.size field of control
> > region.
> 
> This requires iterative reads of device memory buffer but the protocol
> is unclear (to me) how the user knows how to do this or interact with
> the position field. 
> 
> > Then in .save_live_iteration callback, it will get buffer of device memory's
> > dirty data chunk by chunk from device memory region by writing pos &
> > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > control region. (size of each chunk is the size of device memory data
> > region).
> 
> What if there's not enough dirty data to fill the region?  The data is
> always padded to fill the region?
>
I think dirty data in vendor driver is orgnaized in a format like:
(addr0, len0, data0, addr1, len1, data1, addr2, len2, data2,....addrN,
lenN, dataN).
for full snapshot, it's like (addr0, len0, data0).
so, to userspace and data region, it doesn't matter whether it's full
snapshot or dirty data.


> > .save_live_pending and .save_live_iteration may be called several times in
> > precopy phase to get dirty data in device memory.
> > 
> > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > vendor driver's device state interface to get data from devcie memory.
> 
> Therefore through the entire precopy phase we have no data from source
> to target to begin a compatibility check :-\  I think both proposals
> currently still lack any sort of device compatibility or data
> versioning check between source and target.  Thanks,
I checked the compatibility, though not good enough:)

in migration_init, vfio_check_devstate_version() checked version from
kernel with VFIO_DEVICE_STATE_INTERFACE_VERSION in both source and target,
and in target side, vfio_load_state() checked source side version.


int vfio_load_state(QEMUFile *f, void *opaque, int version_id)                          
{       
    ...
    if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {                                   
        return -EINVAL;
    } 
    ...
}

Thanks
Yan

> Alex
> 
> > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > region by writing system memory's start address, page count and action 
> > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> > "system_memory.action" fields of control region.
> > If page count passed in .log_sync callback is larger than the bitmap size
> > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > vendor driver's get system memory dirty bitmap interface.
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > returns without call to vendor driver.
> > 
> > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > in save_live_complete_precopy callback,
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > get device memory size and get device memory buffer will be called again.
> > After that,
> > device config data is get from device config region by reading
> > devcie_config.size of control region and writing action (GET_BITMAP) to
> > device_config.action of control region.
> > Then after migration completes, in cleanup handler, LOGGING state will be
> > cleared (i.e. deivce state is set to STOP).
> > Clearing LOGGING state in cleanup handler is in consideration of the case
> > of "migration failed" and "migration cancelled". They can also leverage
> > the cleanup handler to unset LOGGING state.
> > 
> > 
> > References
> > ----------
> > 1. kernel side implementation of Device state interfaces:
> > https://patchwork.freedesktop.org/series/56876/
> > 
> > 
> > Yan Zhao (5):
> >   vfio/migration: define kernel interfaces
> >   vfio/migration: support device of device config capability
> >   vfio/migration: tracking of dirty page in system memory
> >   vfio/migration: turn on migration
> >   vfio/migration: support device memory capability
> > 
> >  hw/vfio/Makefile.objs         |   2 +-
> >  hw/vfio/common.c              |  26 ++
> >  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/pci.c                 |  10 +-
> >  hw/vfio/pci.h                 |  26 +-
> >  include/hw/vfio/vfio-common.h |   1 +
> >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> >  7 files changed, 1174 insertions(+), 9 deletions(-)
> >  create mode 100644 hw/vfio/migration.c
> > 
>
Yan Zhao March 6, 2019, 12:22 a.m. UTC | #31
hi Alex
we still have some opens as below. could you kindly help review on that? :)

Thanks
Yan

On Mon, Feb 25, 2019 at 10:22:56AM +0800, Zhao Yan wrote:
> On Thu, Feb 21, 2019 at 01:40:51PM -0700, Alex Williamson wrote:
> > Hi Yan,
> > 
> > Thanks for working on this!
> > 
> > On Tue, 19 Feb 2019 16:50:54 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > 
> > > This patchset enables VFIO devices to have live migration capability.
> > > Currently it does not support post-copy phase.
> > > 
> > > It follows Alex's comments on last version of VFIO live migration patches,
> > > including device states, VFIO device state region layout, dirty bitmap's
> > > query.
> > > 
> > > Device Data
> > > -----------
> > > Device data is divided into three types: device memory, device config,
> > > and system memory dirty pages produced by device.
> > > 
> > > Device config: data like MMIOs, page tables...
> > >         Every device is supposed to possess device config data.
> > >     	Usually device config's size is small (no big than 10M), and it
> > 
> > I'm not sure how we can really impose a limit here, it is what it is
> > for a device.  A smaller state is obviously desirable to reduce
> > downtime, but some devices could have very large states.
> > 
> > >         needs to be loaded in certain strict order.
> > >         Therefore, device config only needs to be saved/loaded in
> > >         stop-and-copy phase.
> > >         The data of device config is held in device config region.
> > >         Size of device config data is smaller than or equal to that of
> > >         device config region.
> > 
> > So the intention here is that this is the last data read from the
> > device and it's done in one pass, so the region needs to be large
> > enough to expose all config data at once.  On restore it's the last
> > data written before switching the device to the run state.
> > 
> > > 
> > > Device Memory: device's internal memory, standalone and outside system
> > 
> > s/system/VM/
> > 
> > >         memory. It is usually very big.
> > 
> > Or it doesn't exist.  Not sure we should be setting expectations since
> > it will vary per device.
> > 
> > >         This kind of data needs to be saved / loaded in pre-copy and
> > >         stop-and-copy phase.
> > >         The data of device memory is held in device memory region.
> > >         Size of devie memory is usually larger than that of device
> > >         memory region. qemu needs to save/load it in chunks of size of
> > >         device memory region.
> > >         Not all device has device memory. Like IGD only uses system memory.
> > 
> > It seems a little gratuitous to me that this is a separate region or
> > that this data is handled separately.  All of this data is opaque to
> > QEMU, so why do we need to separate it?
> hi Alex,
> as the device state interfaces are provided by kernel, it is expected to
> meet as general needs as possible. So, do you think there are such use
> cases from user space that user space knows well of the device, and
> it wants kernel to return desired data back to it.
> E.g. It just wants to get whole device config data including all mmios,
> page tables, pci config data...
> or, It just wants to get current device memory snapshot, not including any
> dirty data.
> Or, It just needs the dirty pages in device memory or system memory.
> With all this accurate query, quite a lot of useful features can be
> developped in user space.
> 
> If all of this data is opaque to user app, seems the only use case is
> for live migration.
> 
> From another aspect, if the final solution is to let the data opaque to
> user space, like what NV did, kernel side's implementation will be more
> complicated, and actually a little challenge to vendor driver.
> in that case, in pre-copy phase,
> 1. in not LOGGING state, vendor driver first returns full data including
> full device memory snapshot
> 2. user space reads some data (you can't expect it to finish reading all
> data)
> 3. then userspace set the device state to LOGGING to start dirty data
> logging
> 4. vendor driver starts dirty data logging, and appends the dirty data to
> the tail of remaining unread full data and increase the pending data size?
> 5. user space keeps reading data.
> 6. vendor driver keeps appending new dirty data to the tail of remaining
> unread full data/dirty data and increase the pending data size?
> 
> in stop-and-copy phase
> 1. user space sets device state to exit LOGGING state,
> 2. vendor driver stops data logging. it has to append device config
>    data at the tail of remaining dirty data unread by userspace.
> 
> during this flow, when vendor driver should get dirty data? just keeps
> logging and appends to tail? how to ensure dirty data is refresh new before
> LOGGING state exit? how does vendor driver know whether certain dirty data
> is copied or not?
> 
> I've no idea how NVidia handle this problem, and they don't open their
> kernel side code. 
> just feel it's a bit hard for other vendor drivers to follow:)
> 
> > > System memory dirty pages: If a device produces dirty pages in system
> > >         memory, it is able to get dirty bitmap for certain range of system
> > >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > >         callback, dirty pages in system memory will be save/loaded by ram's
> > >         live migration code.
> > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > >         If system memory range is larger than that dirty bitmap region can
> > >         hold, qemu will cut it into several chunks and get dirty bitmap in
> > >         succession.
> > > 
> > > 
> > > Device State Regions
> > > --------------------
> > > Vendor driver is required to expose two mandatory regions and another two
> > > optional regions if it plans to support device state management.
> > > 
> > > So, there are up to four regions in total.
> > > One control region: mandatory.
> > >         Get access via read/write system call.
> > >         Its layout is defined in struct vfio_device_state_ctl
> > > Three data regions: mmaped into qemu.
> > 
> > Is mmap mandatory?  I would think this would be defined by the mdev
> > device what access they want to support per region.  We don't want to
> > impose a more complicated interface if the device doesn't require it.
> I think it's "mmap is preferred, but allowed to fail".
> just like a normal region with MMAP flag on (like bar regions), we also
> allow its mmap to fail, right?
> 
> > >         device config region: mandatory, holding data of device config
> > >         device memory region: optional, holding data of device memory
> > >         dirty bitmap region: optional, holding bitmap of system memory
> > >                             dirty pages
> > > 
> > > (The reason why four seperate regions are defined is that the unit of mmap
> > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > control and three mmaped regions for data seems better than one big region
> > > padded and sparse mmaped).
> > 
> > It's not obvious to me how this is better, a big region isn't padded,
> > there's simply a gap in the file descriptor.  Is having a sub-PAGE_SIZE
> > gap in a file really of any consequence?  Each region beyond the header
> > is more than likely larger than PAGE_SIZE, therefore they can be nicely
> > aligned together.  We still need fields to tell us how much data is
> > available in each area, so another to tell us the start of each area is
> > a minor detail.  And I think we still want to allow drivers to specify
> > which parts of which areas support mmap, so I don't think we're getting
> > away from sparse mmap support.
> 
> with seperate regions and sub-region type defined, user space can explictly
> know which region is which region after vfio_get_dev_region_info(). along
> with it, user space knows region offset and size. mmap is allowed to fail
> and falls back to normal read/write to the region.
> 
> But with one big region and sparse mmapped subregions (1 data subregion or
> 3 data subregions, whatever), userspace can't tell which subregion is which
> one.
> So, if using one big region, I think we need to explictly define
> subregions' sequence (like index 0 is dedicated to control subregion,
> index 1 is for device config data subregion ...). Vendor driver cannot
> freely change the sequence.
> Then keep data offset the same as region->mmaps[i].offset, and data size
> the same as region->mmaps[i].size. (i.e. let actual data starts immediatly
> from first byte of its data subregion)
> Also, mmaps for sparse mmaped subregions are not allowed to fail.
> 
> 
> With one big region, we also need to consider the case when vendor driver
> does not want the data subregion to be mmaped.
> so, what is the data layout for that case?
> data subregion immedately follows control subregion, or not?
> Of couse, for this condition, we can specify the data filed's start offset
> and size through control region. And we must not expect the data start
> offset in source and target are equal.
> (because the big region's fd_offset
> may vary in source and target. consider the case when both source and
> target have one opregion and one device state region, but source has
> opregion in the first and target has device state region in the first.
> If we think this case is illegal, we must be able to detect it in the first
> place).
> Also, we must keep the start offset and size consistent with the above mmap
> case.
> 
> 
> > > kernel device state interface [1]
> > > --------------------------------------
> > > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > 
> > If we were to go with this multi-region solution, isn't it evident from
> > the regions exposed that device memory and a dirty bitmap are
> > provided?  Alternatively, I believe Kirti's proposal doesn't require
> 
> > this distinction between device memory and device config, a device not
> > requiring runtime migration data would simply report no data until the
> > device moved to the stopped state, making it consistent path for
> > userspace.  Likewise, the dirty bitmap could report a zero page count
> > in the bitmap rather than branching based on device support.
> If the path in userspace is consistent for device config and device
> memory, there will be many unnecessary call of getting data size into
> vendor driver.
>  
> 
> > Note that it seems that cap VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY implies
> > VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP, but there's no
> > consistency in the naming.
> > 
> > > #define VFIO_DEVICE_STATE_RUNNING 0 
> > > #define VFIO_DEVICE_STATE_STOP 1
> > > #define VFIO_DEVICE_STATE_LOGGING 2
> > 
> > It looks like these are being defined as bits, since patch 1 talks
> > about RUNNING & LOGGING and STOP & LOGGING.  I think Connie already
> > posted some comments about this.  I'm not sure anything prevents us
> > from defining RUNNING a 1 and STOPPED as 0 so we don't have the
> > polarity flip vs LOGGING though.
> > 
> > The state "STOP & LOGGING" also feels like a strange "device state", if
> > the device is stopped, it's not logging any new state, so I think this
> > is more that the device state is STOP, but the LOGGING feature is
> > active.  Maybe we should consider these independent bits.  LOGGING is
> > active as we stop a device so that we can fetch the last dirtied pages,
> > but disabled as we load the state of the device into the target.
> > 
> > > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > > 
> > > struct vfio_device_state_ctl {
> > > 	__u32 version;		  /* ro */
> > > 	__u32 device_state;       /* VFIO device state, wo */
> > > 	__u32 caps;		 /* ro */
> > >         struct {
> > > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> > > 		__u64 size;    /*rw*/
> > > 	} device_config;
> > 
> > Patch 1 indicates that to get the config buffer we write GET_BUFFER to
> > action and read from the config region.  The size is previously read
> > and apparently constant.  To set the config buffer, the config region
> > is written followed by writing SET_BUFFER to action.  Why is size
> > listed as read-write?
> this is the size of config data.
> size of config data <= size of config data region.
> 
> 
> > Doesn't this protocol also require that the mdev driver consume each
> > full region's worth of host kernel memory for backing pages in
> > anticipation of a rare event like migration?  This might be a strike
> > against separate regions if the driver needs to provide backing pages
> > for 3 separate regions vs 1.  To avoid this runtime overhead, would it
> > be expected that the user only mmap the regions during migration and
> > the mdev driver allocate backing pages on mmap?  Should the mmap be
> > restricted to the LOGGING feature being enabled?  Maybe I'm mistaken in
> > how the mdev driver would back these mmap'd pages.
> >
> yes, 3 seperate regions consume a little more memory than 1 region.
> but it's just a little overhead.
> As in intel's kernel implementation,
> device config region's size is 9M, dirty bitmap region's size is 16k.
> if there is device memory region, its size can be defined as 100M?
> so it's 109M vs 100M ?
> 
> > > 	struct {
> > > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> > > 		__u64 size;     /* rw */  
> > >                 __u64 pos; /*the offset in total buffer of device memory*/
> > 
> > Patch 1 outlines the protocol here that getting device memory begins
> > with writing the position field, followed by reading from the device
> > memory region.  Setting device memory begins with writing the data to
> > the device memory region, followed by writing the position field.  Why
> > does the user need to have visibility of data position?  This is opaque
> > data to the user, the device should manage how the chunks fit together.
> > 
> > How does the user know when they reach the end?
> sorry, maybe I didn't explain clearly here.
> 
> device  ________________________________________
> memory  |    |    |////|    |    |    |    |    |
> data:   |____|____|////|____|____|____|____|____|
>                   :pos :
>                   :    :
> device            :____:
> memory            |    |
> region:           |____|
> 
> the whole sequence is like this:
> 
> 1. user space reads device_memory.size
> 2. driver gets device memory's data(full snapshot or dirty data, depending
> on it's in LOGGING state or not), and return the total size of
> this data. 
> 3. user space finishes reading device_memory.size (>= device memory
> region's size)
> 
> 4. user space starts a loop like
>   
>    while (pos < total_len) {
>         uint64_t len = region_devmem->size;
> 
>         if (pos + len >= total_len) {
>             len = total_len - pos;
>         }
>         if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
>             return -1;
>         }
> 	pos += len;
>     }
> 
>  vfio_save_data_device_memory_chunk reads each chunk from device memory
>  region by writing GET_BUFFER  to device_memory.action, and pos to
>  device_memory.pos.
> 
> 
> so. each time, userspace will finish getting device memory data in one
> time.
> 
> specifying "pos" is just like the "lseek" before "write".
> 
> > Bullets 8 and 9 in patch 1 also discuss setting and getting the device
> > memory size, but these aren't well integrated into the protocol for
> > getting and setting the memory buffer.  Is getting the device memory
> > really started by reading the size, which triggers the vendor driver to
> > snapshot the state in an internal buffer which the user then iterates
> > through using GET_BUFFER?  Therefore re-reading the size field could
> > corrupt the data stream?  Wouldn't it be easier to report bytes
> > available and countdown as the user marks them read?  What does
> > position mean when we switch from snapshot to dirty data?
> when switching to device memory's dirty data, pos means the pos in whole
> dirty data.
> 
> .save_live_pending ==> driver gets dirty data in device memory and returns
> total size.
> 
> .save_live_iterate ==> userspace reads all dirty data from device memory
> region chunk by chunk
> 
> So, in an iteration, all dirty data are saved.
> then in next iteration, dirty data is recalculated.
> 
> 
> > 
> > > 	} device_memory;
> > > 	struct {
> > > 		__u64 start_addr; /* wo */
> > > 		__u64 page_nr;   /* wo */
> > > 	} system_memory;
> > > };
> > 
> > Why is one specified as an address and the other as pages?  Note that
> Yes, start_addr ==> start pfn is better
> 
> > Kirti's implementation has an optimization to know how many pages are
> > set within a range to avoid unnecessary buffer reads.
> > 
> 
> Let's use start_pfn_all, page_nr_all to represent the start pfn and
> page_nr passed in from qemu .log_sync interface.
> 
> and use start_pfn_i, page_nr_i to the value passed to driver.
> 
> 
> start_pfn_all
>   |         start_pfn_i
>   |         |
>  \ /_______\_/_____________________________
>   |    |    |////|    |    |    |    |    |
>   |____|____|////|____|____|____|____|____|
>             :    :
>             :    :
>             :____:
> bitmap      |    |
> region:     |____|
>            
> 
> 1. Each time QEMU queries dirty bitmap from driver, it passes in
> start_pfn_i, and page_nr_i. (page_nr_i is the biggest page number the
> bitmap region can hold).
> 2. driver queries memory range from start_pfn_i with size page_nr_i.
> 3. driver return a bitmap (if no dirty data, the bitmap is all 0).
> 4. QEMU saves the pages according to the bitmap
> 
> If there's no dirty data found in step 2, step 4 can be skipped.
> (I'll add this check before step 4 in future, thanks)
> but if there's even 1 bit set in the bitmap, no step from 1-4 can be
> skipped.
> 
> Honestly, after reviewing Kirti's implementation, I don't think it's an
> optimization. As in below pseudo code for Kirti's code, I would think the
> copied_pfns corresponds to the page_nr_i in my case. so, the case of
> copied_pfns equaling to 0 is for the tail chunk? don't think it's working..
> 
> write start_pfn to driver
> write page_size  to driver
> write pfn_count to driver
> 
> do {
>     read copied_pfns from driver.
>     if (copied_pfns == 0) {
>        break;
>     }
>    bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
>    buf = get bitmap from driver
>    cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
>                                            (start_pfn + count) * page_size,
>                                                 copied_pfns);
> 
>      count +=  copied_pfns;
> 
> } while (count < pfn_count);
> 
> 
> 
> > > 
> > > Devcie States
> > > ------------- 
> > > After migration is initialzed, it will set device state via writing to
> > > device_state field of control region.
> > > 
> > > Four states are defined for a VFIO device:
> > >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> > > 
> > > RUNNING: In this state, a VFIO device is in active state ready to receive
> > >         commands from device driver.
> > >         It is the default state that a VFIO device enters initially.
> > > 
> > > STOP:  In this state, a VFIO device is deactivated to interact with
> > >        device driver.
> > > 
> > > LOGGING: a special state that it CANNOT exist independently. It must be
> > >        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> > >        STOP & LOGGING).
> > >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> > >        driver can start dirty data logging for device memory and system
> > >        memory.
> > >        LOGGING only impacts device/system memory. They return whole
> > >        snapshot outside LOGGING and dirty data since last get operation
> > >        inside LOGGING.
> > >        Device config should be always accessible and return whole config
> > >        snapshot regardless of LOGGING state.
> > >        
> > > Note:
> > > The reason why RUNNING is the default state is that device's active state
> > > must not depend on device state interface.
> > > It is possible that region vfio_device_state_ctl fails to get registered.
> > > In that condition, a device needs be in active state by default. 
> > > 
> > > Get Version & Get Caps
> > > ----------------------
> > > On migration init phase, qemu will probe the existence of device state
> > > regions of vendor driver, then get version of the device state interface
> > > from the r/w control region.
> > > 
> > > Then it will probe VFIO device's data capability by reading caps field of
> > > control region.
> > >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> > >         device memory in pre-copy and stop-and-copy phase. The data of
> > >         device memory is held in device memory region.
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> > >         produced by VFIO device during pre-copy and stop-and-copy phase.
> > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > 
> > As above, these capabilities seem redundant to the existence of the
> > device specific regions in this implementation.
> >
> seems so :)
> 
> > > If failing to find two mandatory regions and optional data regions
> > > corresponding to data caps or version mismatching, it will setup a
> > > migration blocker and disable live migration for VFIO device.
> > > 
> > > 
> > > Flows to call device state interface for VFIO live migration
> > > ------------------------------------------------------------
> > > 
> > > Live migration save path:
> > > 
> > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > 
> > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > MIGRATION_STATUS_SAVE_SETUP
> > >  |
> > >  .save_setup callback -->
> > >  get device memory size (whole snapshot size)
> > >  get device memory buffer (whole snapshot data)
> > >  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> > >  |
> > > MIGRATION_STATUS_ACTIVE
> > >  |
> > >  .save_live_pending callback --> get device memory size (dirty data)
> > >  .save_live_iteration callback --> get device memory buffer (dirty data)
> > >  .log_sync callback --> get system memory dirty bitmap
> > >  |
> > > (vcpu stops) --> set device state -->
> > >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> > >  |
> > > .save_live_complete_precopy callback -->
> > >  get device memory size (dirty data)
> > >  get device memory buffer (dirty data)
> > >  get device config size (whole snapshot size)
> > >  get device config buffer (whole snapshot data)
> > >  |
> > > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > > MIGRATION_STATUS_COMPLETED
> > > 
> > > MIGRATION_STATUS_CANCELLED or
> > > MIGRATION_STATUS_FAILED
> > >  |
> > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > 
> > > 
> > > Live migration load path:
> > > 
> > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > 
> > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> > >  |
> > > MIGRATION_STATUS_ACTIVE
> > >  |
> > > .load state callback -->
> > >  set device memory size, set device memory buffer, set device config size,
> > >  set device config buffer
> > >  |
> > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > MIGRATION_STATUS_COMPLETED
> > > 
> > > 
> > > 
> > > In source VM side,
> > > In precopy phase,
> > > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > > qemu will first get whole snapshot of device memory in .save_setup
> > > callback, and then it will get total size of dirty data in device memory in
> > > .save_live_pending callback by reading device_memory.size field of control
> > > region.
> > 
> > This requires iterative reads of device memory buffer but the protocol
> > is unclear (to me) how the user knows how to do this or interact with
> > the position field. 
> > 
> > > Then in .save_live_iteration callback, it will get buffer of device memory's
> > > dirty data chunk by chunk from device memory region by writing pos &
> > > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > > control region. (size of each chunk is the size of device memory data
> > > region).
> > 
> > What if there's not enough dirty data to fill the region?  The data is
> > always padded to fill the region?
> >
> I think dirty data in vendor driver is orgnaized in a format like:
> (addr0, len0, data0, addr1, len1, data1, addr2, len2, data2,....addrN,
> lenN, dataN).
> for full snapshot, it's like (addr0, len0, data0).
> so, to userspace and data region, it doesn't matter whether it's full
> snapshot or dirty data.
> 
> 
> > > .save_live_pending and .save_live_iteration may be called several times in
> > > precopy phase to get dirty data in device memory.
> > > 
> > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > > vendor driver's device state interface to get data from devcie memory.
> > 
> > Therefore through the entire precopy phase we have no data from source
> > to target to begin a compatibility check :-\  I think both proposals
> > currently still lack any sort of device compatibility or data
> > versioning check between source and target.  Thanks,
> I checked the compatibility, though not good enough:)
> 
> in migration_init, vfio_check_devstate_version() checked version from
> kernel with VFIO_DEVICE_STATE_INTERFACE_VERSION in both source and target,
> and in target side, vfio_load_state() checked source side version.
> 
> 
> int vfio_load_state(QEMUFile *f, void *opaque, int version_id)                          
> {       
>     ...
>     if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {                                   
>         return -EINVAL;
>     } 
>     ...
> }
> 
> Thanks
> Yan
> 
> > Alex
> > 
> > > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > > region by writing system memory's start address, page count and action 
> > > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> > > "system_memory.action" fields of control region.
> > > If page count passed in .log_sync callback is larger than the bitmap size
> > > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > > vendor driver's get system memory dirty bitmap interface.
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > > returns without call to vendor driver.
> > > 
> > > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > > in save_live_complete_precopy callback,
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > > get device memory size and get device memory buffer will be called again.
> > > After that,
> > > device config data is get from device config region by reading
> > > devcie_config.size of control region and writing action (GET_BITMAP) to
> > > device_config.action of control region.
> > > Then after migration completes, in cleanup handler, LOGGING state will be
> > > cleared (i.e. deivce state is set to STOP).
> > > Clearing LOGGING state in cleanup handler is in consideration of the case
> > > of "migration failed" and "migration cancelled". They can also leverage
> > > the cleanup handler to unset LOGGING state.
> > > 
> > > 
> > > References
> > > ----------
> > > 1. kernel side implementation of Device state interfaces:
> > > https://patchwork.freedesktop.org/series/56876/
> > > 
> > > 
> > > Yan Zhao (5):
> > >   vfio/migration: define kernel interfaces
> > >   vfio/migration: support device of device config capability
> > >   vfio/migration: tracking of dirty page in system memory
> > >   vfio/migration: turn on migration
> > >   vfio/migration: support device memory capability
> > > 
> > >  hw/vfio/Makefile.objs         |   2 +-
> > >  hw/vfio/common.c              |  26 ++
> > >  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
> > >  hw/vfio/pci.c                 |  10 +-
> > >  hw/vfio/pci.h                 |  26 +-
> > >  include/hw/vfio/vfio-common.h |   1 +
> > >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> > >  7 files changed, 1174 insertions(+), 9 deletions(-)
> > >  create mode 100644 hw/vfio/migration.c
> > > 
> > 
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
Alex Williamson March 7, 2019, 5:44 p.m. UTC | #32
Hi Yan,

Sorry for the delay, I've been on PTO...

On Sun, 24 Feb 2019 21:22:56 -0500
Zhao Yan <yan.y.zhao@intel.com> wrote:

> On Thu, Feb 21, 2019 at 01:40:51PM -0700, Alex Williamson wrote:
> > Hi Yan,
> > 
> > Thanks for working on this!
> > 
> > On Tue, 19 Feb 2019 16:50:54 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > This patchset enables VFIO devices to have live migration capability.
> > > Currently it does not support post-copy phase.
> > > 
> > > It follows Alex's comments on last version of VFIO live migration patches,
> > > including device states, VFIO device state region layout, dirty bitmap's
> > > query.
> > > 
> > > Device Data
> > > -----------
> > > Device data is divided into three types: device memory, device config,
> > > and system memory dirty pages produced by device.
> > > 
> > > Device config: data like MMIOs, page tables...
> > >         Every device is supposed to possess device config data.
> > >     	Usually device config's size is small (no big than 10M), and it  
> > 
> > I'm not sure how we can really impose a limit here, it is what it is
> > for a device.  A smaller state is obviously desirable to reduce
> > downtime, but some devices could have very large states.
> >   
> > >         needs to be loaded in certain strict order.
> > >         Therefore, device config only needs to be saved/loaded in
> > >         stop-and-copy phase.
> > >         The data of device config is held in device config region.
> > >         Size of device config data is smaller than or equal to that of
> > >         device config region.  
> > 
> > So the intention here is that this is the last data read from the
> > device and it's done in one pass, so the region needs to be large
> > enough to expose all config data at once.  On restore it's the last
> > data written before switching the device to the run state.
> >   
> > > 
> > > Device Memory: device's internal memory, standalone and outside system  
> > 
> > s/system/VM/
> >   
> > >         memory. It is usually very big.  
> > 
> > Or it doesn't exist.  Not sure we should be setting expectations since
> > it will vary per device.
> >   
> > >         This kind of data needs to be saved / loaded in pre-copy and
> > >         stop-and-copy phase.
> > >         The data of device memory is held in device memory region.
> > >         Size of devie memory is usually larger than that of device
> > >         memory region. qemu needs to save/load it in chunks of size of
> > >         device memory region.
> > >         Not all device has device memory. Like IGD only uses system memory.  
> > 
> > It seems a little gratuitous to me that this is a separate region or
> > that this data is handled separately.  All of this data is opaque to
> > QEMU, so why do we need to separate it?  
> hi Alex,
> as the device state interfaces are provided by kernel, it is expected to
> meet as general needs as possible. So, do you think there are such use
> cases from user space that user space knows well of the device, and
> it wants kernel to return desired data back to it.
> E.g. It just wants to get whole device config data including all mmios,
> page tables, pci config data...
> or, It just wants to get current device memory snapshot, not including any
> dirty data.
> Or, It just needs the dirty pages in device memory or system memory.
> With all this accurate query, quite a lot of useful features can be
> developped in user space.
> 
> If all of this data is opaque to user app, seems the only use case is
> for live migration.

I can certainly appreciate a more versatile interface, but I think
we're also trying to create the most simple interface we can, with the
primary target being live migration.  As soon as we start defining this
type of device memory and that type of device memory, we're going to
have another device come along that needs yet another because they have
a slightly different requirement.  Even without that, we're going to
have vendor drivers implement it differently, so what works for one
device for a more targeted approach may not work for all devices.  Can
you enumerate some specific examples of the use cases you imagine your
design to enable?

> From another aspect, if the final solution is to let the data opaque to
> user space, like what NV did, kernel side's implementation will be more
> complicated, and actually a little challenge to vendor driver.
> in that case, in pre-copy phase,
> 1. in not LOGGING state, vendor driver first returns full data including
> full device memory snapshot

When we're not LOGGING, does the vendor driver need to return
anything?  It seems that LOGGING could be considered an enable switch
for the interface.

> 2. user space reads some data (you can't expect it to finish reading all
> data)
> 3. then userspace set the device state to LOGGING to start dirty data
> logging
> 4. vendor driver starts dirty data logging, and appends the dirty data to
> the tail of remaining unread full data and increase the pending data size?

It seems a lot of overhead to expect the vendor driver to consider
state read by the user prior to LOGGING being enabled.  Does it log
those changes forever?  It seems like we should consider LOGGING
enabled to be a "session".

> 5. user space keeps reading data.
> 6. vendor driver keeps appending new dirty data to the tail of remaining
> unread full data/dirty data and increase the pending data size?

Until the device is stopped it can always generate new pending date and
the size of that pending data needs to be considered volatile by the
user, right?  What's different here?  This all seems to factor into
when the user decides whether the migration is converting and whether
to transition to stopped phase to force that convergence.

> in stop-and-copy phase
> 1. user space sets device state to exit LOGGING state,
> 2. vendor driver stops data logging. it has to append device config
>    data at the tail of remaining dirty data unread by userspace.
> 
> during this flow, when vendor driver should get dirty data? just keeps
> logging and appends to tail? how to ensure dirty data is refresh new before
> LOGGING state exit? how does vendor driver know whether certain dirty data
> is copied or not?

At stop-and-copy, I'd assume LOGGING remains enabled, only adding STOP,
such that the device does not generate new data, but perhaps I've
forgotten the details on vacation.  As above, I'd think we'd want to
bound any sort of dirty state tracking to a session bounded by the
LOGGING state.  The protocol defined with userspace needs to account
for determining what the user has and has not read, for instance to
support mmap'd data, a trapped interface needs to be used to setup the
data and acknowledge a read of that data.
 
> I've no idea how NVidia handle this problem, and they don't open their
> kernel side code. 
> just feel it's a bit hard for other vendor drivers to follow:)

Their interface proposal is available on the list, I don't have access
to their proprietary driver either, but I expect the best ideas from
each proposal to be combined into a unified solution.

> > > System memory dirty pages: If a device produces dirty pages in system
> > >         memory, it is able to get dirty bitmap for certain range of system
> > >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > >         callback, dirty pages in system memory will be save/loaded by ram's
> > >         live migration code.
> > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > >         If system memory range is larger than that dirty bitmap region can
> > >         hold, qemu will cut it into several chunks and get dirty bitmap in
> > >         succession.
> > > 
> > > 
> > > Device State Regions
> > > --------------------
> > > Vendor driver is required to expose two mandatory regions and another two
> > > optional regions if it plans to support device state management.
> > > 
> > > So, there are up to four regions in total.
> > > One control region: mandatory.
> > >         Get access via read/write system call.
> > >         Its layout is defined in struct vfio_device_state_ctl
> > > Three data regions: mmaped into qemu.  
> > 
> > Is mmap mandatory?  I would think this would be defined by the mdev
> > device what access they want to support per region.  We don't want to
> > impose a more complicated interface if the device doesn't require it.  
> I think it's "mmap is preferred, but allowed to fail".
> just like a normal region with MMAP flag on (like bar regions), we also
> allow its mmap to fail, right?

Currently mmap support for any region is optional both from the vendor
driver and the user.  The vendor driver may or may not support mmap of
a region (or subset of region with sparse mmap) and the user may or may
not make use of mmap if it is available.  The question here was whether
this interface requires the vendor driver to support mmap of these
device specific regions.

> > >         device config region: mandatory, holding data of device config
> > >         device memory region: optional, holding data of device memory
> > >         dirty bitmap region: optional, holding bitmap of system memory
> > >                             dirty pages
> > > 
> > > (The reason why four seperate regions are defined is that the unit of mmap
> > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > control and three mmaped regions for data seems better than one big region
> > > padded and sparse mmaped).  
> > 
> > It's not obvious to me how this is better, a big region isn't padded,
> > there's simply a gap in the file descriptor.  Is having a sub-PAGE_SIZE
> > gap in a file really of any consequence?  Each region beyond the header
> > is more than likely larger than PAGE_SIZE, therefore they can be nicely
> > aligned together.  We still need fields to tell us how much data is
> > available in each area, so another to tell us the start of each area is
> > a minor detail.  And I think we still want to allow drivers to specify
> > which parts of which areas support mmap, so I don't think we're getting
> > away from sparse mmap support.  
> 
> with seperate regions and sub-region type defined, user space can explictly
> know which region is which region after vfio_get_dev_region_info(). along
> with it, user space knows region offset and size. mmap is allowed to fail
> and falls back to normal read/write to the region.
> 
> But with one big region and sparse mmapped subregions (1 data subregion or
> 3 data subregions, whatever), userspace can't tell which subregion is which
> one.

Of course they can, this is part of defining the header structure.  One
region could define a header including config_offset, config_size,
memory_offset, memory_size, dirty_offset, dirty_size.  Notice how Kirti
even uses the same area to support all of these (which leaves some
issues with vendor driver flexibility, but at least shows this is
possible).

> So, if using one big region, I think we need to explictly define
> subregions' sequence (like index 0 is dedicated to control subregion,
> index 1 is for device config data subregion ...). Vendor driver cannot
> freely change the sequence.
> Then keep data offset the same as region->mmaps[i].offset, and data size
> the same as region->mmaps[i].size. (i.e. let actual data starts immediatly
> from first byte of its data subregion)
> Also, mmaps for sparse mmaped subregions are not allowed to fail.

This doesn't make any sense to me, the vendor driver can define the
start and size of each area within the region with simple header
fields.  We don't need fixed sequence fields.  Likewise the sparse mmap
capability for the region can define which of those areas within the
region support mmap.  The mmap can be optional for both vendor driver
and user, just as it is elsewhere.  The header fields defining the
sub-areas can be read-only to the user, the sparse mmap only needs to
match what the vendor driver defines and supports.
 
> With one big region, we also need to consider the case when vendor driver
> does not want the data subregion to be mmaped.
> so, what is the data layout for that case?

Vendor driver defines data_offset and data_size, sparse mmap capability
does not list that area as mmap capable.

> data subregion immedately follows control subregion, or not?

The header needs to begin at offset zero, the layout of the rest is
defined by the vendor driver within this header.  I believe this is
(mostly) implemented in Kirti's version.

> Of couse, for this condition, we can specify the data filed's start offset
> and size through control region. And we must not expect the data start
> offset in source and target are equal.
> (because the big region's fd_offset
> may vary in source and target. consider the case when both source and
> target have one opregion and one device state region, but source has
> opregion in the first and target has device state region in the first.
> If we think this case is illegal, we must be able to detect it in the first
> place).
> Also, we must keep the start offset and size consistent with the above mmap
> case.

AFAICT, these are all non-issues.  Please look at Kirti's proposal.
The (one) migration region can define a header at offset zero that
allows the vendor driver to define where within that region the data,
config, and dirty bitmap areas are and the sparse mmap capability
defines which of those are mmap capable.  Clearly this migration region
offset within the vfio device file descriptor is independent between
source and target, as are the offsets of the sub-areas within the
migration region.  These all need to be defined as part of the
migration protocol where the source and target implement the same
protocol but have no requirement to be absolutely identical.

> > > kernel device state interface [1]
> > > --------------------------------------
> > > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2  
> > 
> > If we were to go with this multi-region solution, isn't it evident from
> > the regions exposed that device memory and a dirty bitmap are
> > provided?  Alternatively, I believe Kirti's proposal doesn't require  
> 
> > this distinction between device memory and device config, a device not
> > requiring runtime migration data would simply report no data until the
> > device moved to the stopped state, making it consistent path for
> > userspace.  Likewise, the dirty bitmap could report a zero page count
> > in the bitmap rather than branching based on device support.  
> If the path in userspace is consistent for device config and device
> memory, there will be many unnecessary call of getting data size into
> vendor driver.

Consistency seems like a good thing, it makes code more simple, we
don't behave differently in one case versus another.  If the vendor
reports no data, skip.  It also provides versatility.  Are the "many
unnecessary call[s]" quantifiable?

> > Note that it seems that cap VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY implies
> > VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP, but there's no
> > consistency in the naming.
> >   
> > > #define VFIO_DEVICE_STATE_RUNNING 0 
> > > #define VFIO_DEVICE_STATE_STOP 1
> > > #define VFIO_DEVICE_STATE_LOGGING 2  
> > 
> > It looks like these are being defined as bits, since patch 1 talks
> > about RUNNING & LOGGING and STOP & LOGGING.  I think Connie already
> > posted some comments about this.  I'm not sure anything prevents us
> > from defining RUNNING a 1 and STOPPED as 0 so we don't have the
> > polarity flip vs LOGGING though.
> > 
> > The state "STOP & LOGGING" also feels like a strange "device state", if
> > the device is stopped, it's not logging any new state, so I think this
> > is more that the device state is STOP, but the LOGGING feature is
> > active.  Maybe we should consider these independent bits.  LOGGING is
> > active as we stop a device so that we can fetch the last dirtied pages,
> > but disabled as we load the state of the device into the target.
> >   
> > > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > > 
> > > struct vfio_device_state_ctl {
> > > 	__u32 version;		  /* ro */
> > > 	__u32 device_state;       /* VFIO device state, wo */
> > > 	__u32 caps;		 /* ro */
> > >         struct {
> > > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> > > 		__u64 size;    /*rw*/
> > > 	} device_config;  
> > 
> > Patch 1 indicates that to get the config buffer we write GET_BUFFER to
> > action and read from the config region.  The size is previously read
> > and apparently constant.  To set the config buffer, the config region
> > is written followed by writing SET_BUFFER to action.  Why is size
> > listed as read-write?  
> this is the size of config data.
> size of config data <= size of config data region.

Where in the usage protocol does the user WRITE the config data size?

> > Doesn't this protocol also require that the mdev driver consume each
> > full region's worth of host kernel memory for backing pages in
> > anticipation of a rare event like migration?  This might be a strike
> > against separate regions if the driver needs to provide backing pages
> > for 3 separate regions vs 1.  To avoid this runtime overhead, would it
> > be expected that the user only mmap the regions during migration and
> > the mdev driver allocate backing pages on mmap?  Should the mmap be
> > restricted to the LOGGING feature being enabled?  Maybe I'm mistaken in
> > how the mdev driver would back these mmap'd pages.
> >  
> yes, 3 seperate regions consume a little more memory than 1 region.
> but it's just a little overhead.
> As in intel's kernel implementation,
> device config region's size is 9M, dirty bitmap region's size is 16k.
> if there is device memory region, its size can be defined as 100M?
> so it's 109M vs 100M ?

But what if it's 100M config with no device memory?  This proposal
requires 100M in-kernel backing due to the definition of the config
region when it could be implemented with significantly less by allowing
a small data area to be read multiple times until a bytes remaining
counter becomes zero.

> > > 	struct {
> > > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> > > 		__u64 size;     /* rw */  
> > >                 __u64 pos; /*the offset in total buffer of device memory*/  
> > 
> > Patch 1 outlines the protocol here that getting device memory begins
> > with writing the position field, followed by reading from the device
> > memory region.  Setting device memory begins with writing the data to
> > the device memory region, followed by writing the position field.  Why
> > does the user need to have visibility of data position?  This is opaque
> > data to the user, the device should manage how the chunks fit together.
> > 
> > How does the user know when they reach the end?  
> sorry, maybe I didn't explain clearly here.
> 
> device  ________________________________________
> memory  |    |    |////|    |    |    |    |    |
> data:   |____|____|////|____|____|____|____|____|
>                   :pos :
>                   :    :
> device            :____:
> memory            |    |
> region:           |____|
> 
> the whole sequence is like this:
> 
> 1. user space reads device_memory.size
> 2. driver gets device memory's data(full snapshot or dirty data, depending
> on it's in LOGGING state or not), and return the total size of
> this data. 
> 3. user space finishes reading device_memory.size (>= device memory
> region's size)
> 
> 4. user space starts a loop like
>   
>    while (pos < total_len) {
>         uint64_t len = region_devmem->size;
> 
>         if (pos + len >= total_len) {
>             len = total_len - pos;
>         }
>         if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
>             return -1;
>         }
> 	pos += len;
>     }
> 
>  vfio_save_data_device_memory_chunk reads each chunk from device memory
>  region by writing GET_BUFFER  to device_memory.action, and pos to
>  device_memory.pos.
> 
> 
> so. each time, userspace will finish getting device memory data in one
> time.
> 
> specifying "pos" is just like the "lseek" before "write".

This could also be implemented as a remaining bytes counter in the
interface where the vendor driver wouldn't rely on the user to manage
the position.  What internal consistency checking is going to protect
the host kernel when the user writes data to the wrong position?  If we
consider the data to be opaque, the vendor driver can embed that sort
of meta data into the data blob the user reads and reassemble it
correctly or generate a consistency failure itself.

> > Bullets 8 and 9 in patch 1 also discuss setting and getting the device
> > memory size, but these aren't well integrated into the protocol for
> > getting and setting the memory buffer.  Is getting the device memory
> > really started by reading the size, which triggers the vendor driver to
> > snapshot the state in an internal buffer which the user then iterates
> > through using GET_BUFFER?  Therefore re-reading the size field could
> > corrupt the data stream?  Wouldn't it be easier to report bytes
> > available and countdown as the user marks them read?  What does
> > position mean when we switch from snapshot to dirty data?  
> when switching to device memory's dirty data, pos means the pos in whole
> dirty data.
> 
> .save_live_pending ==> driver gets dirty data in device memory and returns
> total size.
> 
> .save_live_iterate ==> userspace reads all dirty data from device memory
> region chunk by chunk
> 
> So, in an iteration, all dirty data are saved.
> then in next iteration, dirty data is recalculated.
> 
> > > 	} device_memory;
> > > 	struct {
> > > 		__u64 start_addr; /* wo */
> > > 		__u64 page_nr;   /* wo */
> > > 	} system_memory;
> > > };  
> > 
> > Why is one specified as an address and the other as pages?  Note that  
> Yes, start_addr ==> start pfn is better
> 
> > Kirti's implementation has an optimization to know how many pages are
> > set within a range to avoid unnecessary buffer reads.
> >   
> 
> Let's use start_pfn_all, page_nr_all to represent the start pfn and
> page_nr passed in from qemu .log_sync interface.
> 
> and use start_pfn_i, page_nr_i to the value passed to driver.
> 
> 
> start_pfn_all
>   |         start_pfn_i
>   |         |
>  \ /_______\_/_____________________________
>   |    |    |////|    |    |    |    |    |
>   |____|____|////|____|____|____|____|____|
>             :    :
>             :    :
>             :____:
> bitmap      |    |
> region:     |____|
>            
> 
> 1. Each time QEMU queries dirty bitmap from driver, it passes in
> start_pfn_i, and page_nr_i. (page_nr_i is the biggest page number the
> bitmap region can hold).
> 2. driver queries memory range from start_pfn_i with size page_nr_i.
> 3. driver return a bitmap (if no dirty data, the bitmap is all 0).
> 4. QEMU saves the pages according to the bitmap
> 
> If there's no dirty data found in step 2, step 4 can be skipped.
> (I'll add this check before step 4 in future, thanks)
> but if there's even 1 bit set in the bitmap, no step from 1-4 can be
> skipped.
> 
> Honestly, after reviewing Kirti's implementation, I don't think it's an
> optimization. As in below pseudo code for Kirti's code, I would think the
> copied_pfns corresponds to the page_nr_i in my case. so, the case of
> copied_pfns equaling to 0 is for the tail chunk? don't think it's working..
> 
> write start_pfn to driver
> write page_size  to driver
> write pfn_count to driver
> 
> do {
>     read copied_pfns from driver.
>     if (copied_pfns == 0) {
>        break;
>     }
>    bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
>    buf = get bitmap from driver
>    cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
>                                            (start_pfn + count) * page_size,
>                                                 copied_pfns);
> 
>      count +=  copied_pfns;
> 
> } while (count < pfn_count);

The intent of Kirti's copied_pfns is clearly to avoid unnecessarily
reading pages from the kernel when nothing has changed.  Perhaps the
implementation still requires work, but I don't see from above how
that's not considered an optimization.

> > > 
> > > Devcie States
> > > ------------- 
> > > After migration is initialzed, it will set device state via writing to
> > > device_state field of control region.
> > > 
> > > Four states are defined for a VFIO device:
> > >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> > > 
> > > RUNNING: In this state, a VFIO device is in active state ready to receive
> > >         commands from device driver.
> > >         It is the default state that a VFIO device enters initially.
> > > 
> > > STOP:  In this state, a VFIO device is deactivated to interact with
> > >        device driver.
> > > 
> > > LOGGING: a special state that it CANNOT exist independently. It must be
> > >        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> > >        STOP & LOGGING).
> > >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> > >        driver can start dirty data logging for device memory and system
> > >        memory.
> > >        LOGGING only impacts device/system memory. They return whole
> > >        snapshot outside LOGGING and dirty data since last get operation
> > >        inside LOGGING.
> > >        Device config should be always accessible and return whole config
> > >        snapshot regardless of LOGGING state.
> > >        
> > > Note:
> > > The reason why RUNNING is the default state is that device's active state
> > > must not depend on device state interface.
> > > It is possible that region vfio_device_state_ctl fails to get registered.
> > > In that condition, a device needs be in active state by default. 
> > > 
> > > Get Version & Get Caps
> > > ----------------------
> > > On migration init phase, qemu will probe the existence of device state
> > > regions of vendor driver, then get version of the device state interface
> > > from the r/w control region.
> > > 
> > > Then it will probe VFIO device's data capability by reading caps field of
> > > control region.
> > >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> > >         device memory in pre-copy and stop-and-copy phase. The data of
> > >         device memory is held in device memory region.
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> > >         produced by VFIO device during pre-copy and stop-and-copy phase.
> > >         The dirty bitmap of system memory is held in dirty bitmap region.  
> > 
> > As above, these capabilities seem redundant to the existence of the
> > device specific regions in this implementation.
> >  
> seems so :)
> 
> > > If failing to find two mandatory regions and optional data regions
> > > corresponding to data caps or version mismatching, it will setup a
> > > migration blocker and disable live migration for VFIO device.
> > > 
> > > 
> > > Flows to call device state interface for VFIO live migration
> > > ------------------------------------------------------------
> > > 
> > > Live migration save path:
> > > 
> > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > 
> > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > MIGRATION_STATUS_SAVE_SETUP
> > >  |
> > >  .save_setup callback -->
> > >  get device memory size (whole snapshot size)
> > >  get device memory buffer (whole snapshot data)
> > >  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> > >  |
> > > MIGRATION_STATUS_ACTIVE
> > >  |
> > >  .save_live_pending callback --> get device memory size (dirty data)
> > >  .save_live_iteration callback --> get device memory buffer (dirty data)
> > >  .log_sync callback --> get system memory dirty bitmap
> > >  |
> > > (vcpu stops) --> set device state -->
> > >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> > >  |
> > > .save_live_complete_precopy callback -->
> > >  get device memory size (dirty data)
> > >  get device memory buffer (dirty data)
> > >  get device config size (whole snapshot size)
> > >  get device config buffer (whole snapshot data)
> > >  |
> > > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > > MIGRATION_STATUS_COMPLETED
> > > 
> > > MIGRATION_STATUS_CANCELLED or
> > > MIGRATION_STATUS_FAILED
> > >  |
> > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > 
> > > 
> > > Live migration load path:
> > > 
> > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > 
> > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> > >  |
> > > MIGRATION_STATUS_ACTIVE
> > >  |
> > > .load state callback -->
> > >  set device memory size, set device memory buffer, set device config size,
> > >  set device config buffer
> > >  |
> > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > MIGRATION_STATUS_COMPLETED
> > > 
> > > 
> > > 
> > > In source VM side,
> > > In precopy phase,
> > > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > > qemu will first get whole snapshot of device memory in .save_setup
> > > callback, and then it will get total size of dirty data in device memory in
> > > .save_live_pending callback by reading device_memory.size field of control
> > > region.  
> > 
> > This requires iterative reads of device memory buffer but the protocol
> > is unclear (to me) how the user knows how to do this or interact with
> > the position field. 
> >   
> > > Then in .save_live_iteration callback, it will get buffer of device memory's
> > > dirty data chunk by chunk from device memory region by writing pos &
> > > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > > control region. (size of each chunk is the size of device memory data
> > > region).  
> > 
> > What if there's not enough dirty data to fill the region?  The data is
> > always padded to fill the region?
> >  
> I think dirty data in vendor driver is orgnaized in a format like:
> (addr0, len0, data0, addr1, len1, data1, addr2, len2, data2,....addrN,
> lenN, dataN).
> for full snapshot, it's like (addr0, len0, data0).
> so, to userspace and data region, it doesn't matter whether it's full
> snapshot or dirty data.
> 
> 
> > > .save_live_pending and .save_live_iteration may be called several times in
> > > precopy phase to get dirty data in device memory.
> > > 
> > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > > vendor driver's device state interface to get data from devcie memory.  
> > 
> > Therefore through the entire precopy phase we have no data from source
> > to target to begin a compatibility check :-\  I think both proposals
> > currently still lack any sort of device compatibility or data
> > versioning check between source and target.  Thanks,  
> I checked the compatibility, though not good enough:)
> 
> in migration_init, vfio_check_devstate_version() checked version from
> kernel with VFIO_DEVICE_STATE_INTERFACE_VERSION in both source and target,
> and in target side, vfio_load_state() checked source side version.
> 
> 
> int vfio_load_state(QEMUFile *f, void *opaque, int version_id)                          
> {       
>     ...
>     if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {                                   
>         return -EINVAL;
>     } 
>     ...
> }

But this only checks that both source and target are using the same
migration interface, how do we know that they're compatible devices and
that the vendor data stream is compatible between source and target?
Whether both ends use the same migration interface is potentially not
relevant if the data stream is compatible.  Thanks,

Alex
Tian, Kevin March 7, 2019, 11:20 p.m. UTC | #33
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, March 8, 2019 1:44 AM
> > >
> > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > >         stop-and-copy phase.
> > > >         The data of device memory is held in device memory region.
> > > >         Size of devie memory is usually larger than that of device
> > > >         memory region. qemu needs to save/load it in chunks of size of
> > > >         device memory region.
> > > >         Not all device has device memory. Like IGD only uses system
> memory.
> > >
> > > It seems a little gratuitous to me that this is a separate region or
> > > that this data is handled separately.  All of this data is opaque to
> > > QEMU, so why do we need to separate it?
> > hi Alex,
> > as the device state interfaces are provided by kernel, it is expected to
> > meet as general needs as possible. So, do you think there are such use
> > cases from user space that user space knows well of the device, and
> > it wants kernel to return desired data back to it.
> > E.g. It just wants to get whole device config data including all mmios,
> > page tables, pci config data...
> > or, It just wants to get current device memory snapshot, not including any
> > dirty data.
> > Or, It just needs the dirty pages in device memory or system memory.
> > With all this accurate query, quite a lot of useful features can be
> > developped in user space.
> >
> > If all of this data is opaque to user app, seems the only use case is
> > for live migration.
> 
> I can certainly appreciate a more versatile interface, but I think
> we're also trying to create the most simple interface we can, with the
> primary target being live migration.  As soon as we start defining this
> type of device memory and that type of device memory, we're going to
> have another device come along that needs yet another because they have
> a slightly different requirement.  Even without that, we're going to
> have vendor drivers implement it differently, so what works for one
> device for a more targeted approach may not work for all devices.  Can
> you enumerate some specific examples of the use cases you imagine your
> design to enable?
> 

Do we want to consider an use case where user space would like to
selectively introspect a portion of the device state (including implicit 
state which are not available through PCI regions), and may ask for
capability of direct mapping of selected portion for scanning (e.g.
device memory) instead of always turning on dirty logging on all
device state?

Thanks
Kevin
Alex Williamson March 8, 2019, 4:11 p.m. UTC | #34
On Thu, 7 Mar 2019 23:20:36 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, March 8, 2019 1:44 AM  
> > > >  
> > > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > > >         stop-and-copy phase.
> > > > >         The data of device memory is held in device memory region.
> > > > >         Size of devie memory is usually larger than that of device
> > > > >         memory region. qemu needs to save/load it in chunks of size of
> > > > >         device memory region.
> > > > >         Not all device has device memory. Like IGD only uses system  
> > memory.  
> > > >
> > > > It seems a little gratuitous to me that this is a separate region or
> > > > that this data is handled separately.  All of this data is opaque to
> > > > QEMU, so why do we need to separate it?  
> > > hi Alex,
> > > as the device state interfaces are provided by kernel, it is expected to
> > > meet as general needs as possible. So, do you think there are such use
> > > cases from user space that user space knows well of the device, and
> > > it wants kernel to return desired data back to it.
> > > E.g. It just wants to get whole device config data including all mmios,
> > > page tables, pci config data...
> > > or, It just wants to get current device memory snapshot, not including any
> > > dirty data.
> > > Or, It just needs the dirty pages in device memory or system memory.
> > > With all this accurate query, quite a lot of useful features can be
> > > developped in user space.
> > >
> > > If all of this data is opaque to user app, seems the only use case is
> > > for live migration.  
> > 
> > I can certainly appreciate a more versatile interface, but I think
> > we're also trying to create the most simple interface we can, with the
> > primary target being live migration.  As soon as we start defining this
> > type of device memory and that type of device memory, we're going to
> > have another device come along that needs yet another because they have
> > a slightly different requirement.  Even without that, we're going to
> > have vendor drivers implement it differently, so what works for one
> > device for a more targeted approach may not work for all devices.  Can
> > you enumerate some specific examples of the use cases you imagine your
> > design to enable?
> >   
> 
> Do we want to consider an use case where user space would like to
> selectively introspect a portion of the device state (including implicit 
> state which are not available through PCI regions), and may ask for
> capability of direct mapping of selected portion for scanning (e.g.
> device memory) instead of always turning on dirty logging on all
> device state?

I don't see that a migration interface necessarily lends itself to this
use case.  A migration data stream has no requirement to be user
consumable as anything other than opaque data, there's also no
requirement that it expose state in a form that directly represents the
internal state of the device.  In fact I'm not sure we want to encourage
introspection via this data stream.  If a user knows how to interpret
the data, what prevents them from modifying the data in-flight?  I've
raised the question previously regarding how the vendor driver can
validate the integrity of the migration data stream.  Using the
migration interface to introspect the device certainly suggests an
interface ripe for exploiting any potential weakness in the vendor
driver reassembling that migration stream.  If the user has an mmap to
the actual live working state of the vendor driver, protection in the
hardware seems like the only way you could protect against a malicious
user.  Please be defensive in what is directly exposed to the user and
what safeguards are in place within the vendor driver for validating
incoming data.  Thanks,

Alex
Dr. David Alan Gilbert March 8, 2019, 4:21 p.m. UTC | #35
* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Thu, 7 Mar 2019 23:20:36 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Friday, March 8, 2019 1:44 AM  
> > > > >  
> > > > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > > > >         stop-and-copy phase.
> > > > > >         The data of device memory is held in device memory region.
> > > > > >         Size of devie memory is usually larger than that of device
> > > > > >         memory region. qemu needs to save/load it in chunks of size of
> > > > > >         device memory region.
> > > > > >         Not all device has device memory. Like IGD only uses system  
> > > memory.  
> > > > >
> > > > > It seems a little gratuitous to me that this is a separate region or
> > > > > that this data is handled separately.  All of this data is opaque to
> > > > > QEMU, so why do we need to separate it?  
> > > > hi Alex,
> > > > as the device state interfaces are provided by kernel, it is expected to
> > > > meet as general needs as possible. So, do you think there are such use
> > > > cases from user space that user space knows well of the device, and
> > > > it wants kernel to return desired data back to it.
> > > > E.g. It just wants to get whole device config data including all mmios,
> > > > page tables, pci config data...
> > > > or, It just wants to get current device memory snapshot, not including any
> > > > dirty data.
> > > > Or, It just needs the dirty pages in device memory or system memory.
> > > > With all this accurate query, quite a lot of useful features can be
> > > > developped in user space.
> > > >
> > > > If all of this data is opaque to user app, seems the only use case is
> > > > for live migration.  
> > > 
> > > I can certainly appreciate a more versatile interface, but I think
> > > we're also trying to create the most simple interface we can, with the
> > > primary target being live migration.  As soon as we start defining this
> > > type of device memory and that type of device memory, we're going to
> > > have another device come along that needs yet another because they have
> > > a slightly different requirement.  Even without that, we're going to
> > > have vendor drivers implement it differently, so what works for one
> > > device for a more targeted approach may not work for all devices.  Can
> > > you enumerate some specific examples of the use cases you imagine your
> > > design to enable?
> > >   
> > 
> > Do we want to consider an use case where user space would like to
> > selectively introspect a portion of the device state (including implicit 
> > state which are not available through PCI regions), and may ask for
> > capability of direct mapping of selected portion for scanning (e.g.
> > device memory) instead of always turning on dirty logging on all
> > device state?
> 
> I don't see that a migration interface necessarily lends itself to this
> use case.  A migration data stream has no requirement to be user
> consumable as anything other than opaque data, there's also no
> requirement that it expose state in a form that directly represents the
> internal state of the device.  In fact I'm not sure we want to encourage
> introspection via this data stream.  If a user knows how to interpret
> the data, what prevents them from modifying the data in-flight?  I've
> raised the question previously regarding how the vendor driver can
> validate the integrity of the migration data stream.  Using the
> migration interface to introspect the device certainly suggests an
> interface ripe for exploiting any potential weakness in the vendor
> driver reassembling that migration stream.  If the user has an mmap to
> the actual live working state of the vendor driver, protection in the
> hardware seems like the only way you could protect against a malicious
> user.  Please be defensive in what is directly exposed to the user and
> what safeguards are in place within the vendor driver for validating
> incoming data.  Thanks,

Hmm; that sounds like a security-by-obscurity answer!

The scripts/analyze-migration.py scripts will actually dump the
migration stream data in an almost readable format.
So if you properly define the VMState definitions it should be almost
readable; it's occasionally been useful.

I agree that you should be very very careful to validate the incoming
migration stream against:
  a) Corruption
  b) Wrong driver versions
  c) Malicious intent
    c.1) Especially by the guest
    c.2) Or by someone trying to feed you a duff stream
  d) Someone trying load the VFIO stream into completely the wrong
device.

Whether the migration interface is the right thing to use for that
inspection hmm; well it might be - if you're trying to debug
your device and need a dump of it's state, then why not?
(I guess you end up with something not dissimilar to what things
like intek_reg_snapshot in intel-gpu-tools does).

Dave

> Alex
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Alex Williamson March 8, 2019, 10:02 p.m. UTC | #36
On Fri, 8 Mar 2019 16:21:46 +0000
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Alex Williamson (alex.williamson@redhat.com) wrote:
> > On Thu, 7 Mar 2019 23:20:36 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Friday, March 8, 2019 1:44 AM    
> > > > > >    
> > > > > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > > > > >         stop-and-copy phase.
> > > > > > >         The data of device memory is held in device memory region.
> > > > > > >         Size of devie memory is usually larger than that of device
> > > > > > >         memory region. qemu needs to save/load it in chunks of size of
> > > > > > >         device memory region.
> > > > > > >         Not all device has device memory. Like IGD only uses system    
> > > > memory.    
> > > > > >
> > > > > > It seems a little gratuitous to me that this is a separate region or
> > > > > > that this data is handled separately.  All of this data is opaque to
> > > > > > QEMU, so why do we need to separate it?    
> > > > > hi Alex,
> > > > > as the device state interfaces are provided by kernel, it is expected to
> > > > > meet as general needs as possible. So, do you think there are such use
> > > > > cases from user space that user space knows well of the device, and
> > > > > it wants kernel to return desired data back to it.
> > > > > E.g. It just wants to get whole device config data including all mmios,
> > > > > page tables, pci config data...
> > > > > or, It just wants to get current device memory snapshot, not including any
> > > > > dirty data.
> > > > > Or, It just needs the dirty pages in device memory or system memory.
> > > > > With all this accurate query, quite a lot of useful features can be
> > > > > developped in user space.
> > > > >
> > > > > If all of this data is opaque to user app, seems the only use case is
> > > > > for live migration.    
> > > > 
> > > > I can certainly appreciate a more versatile interface, but I think
> > > > we're also trying to create the most simple interface we can, with the
> > > > primary target being live migration.  As soon as we start defining this
> > > > type of device memory and that type of device memory, we're going to
> > > > have another device come along that needs yet another because they have
> > > > a slightly different requirement.  Even without that, we're going to
> > > > have vendor drivers implement it differently, so what works for one
> > > > device for a more targeted approach may not work for all devices.  Can
> > > > you enumerate some specific examples of the use cases you imagine your
> > > > design to enable?
> > > >     
> > > 
> > > Do we want to consider an use case where user space would like to
> > > selectively introspect a portion of the device state (including implicit 
> > > state which are not available through PCI regions), and may ask for
> > > capability of direct mapping of selected portion for scanning (e.g.
> > > device memory) instead of always turning on dirty logging on all
> > > device state?  
> > 
> > I don't see that a migration interface necessarily lends itself to this
> > use case.  A migration data stream has no requirement to be user
> > consumable as anything other than opaque data, there's also no
> > requirement that it expose state in a form that directly represents the
> > internal state of the device.  In fact I'm not sure we want to encourage
> > introspection via this data stream.  If a user knows how to interpret
> > the data, what prevents them from modifying the data in-flight?  I've
> > raised the question previously regarding how the vendor driver can
> > validate the integrity of the migration data stream.  Using the
> > migration interface to introspect the device certainly suggests an
> > interface ripe for exploiting any potential weakness in the vendor
> > driver reassembling that migration stream.  If the user has an mmap to
> > the actual live working state of the vendor driver, protection in the
> > hardware seems like the only way you could protect against a malicious
> > user.  Please be defensive in what is directly exposed to the user and
> > what safeguards are in place within the vendor driver for validating
> > incoming data.  Thanks,  
> 
> Hmm; that sounds like a security-by-obscurity answer!

Yup, that's fair.  I won't deny that in-kernel vendor driver state
passing through userspace from source to target systems scares me quite
a bit, but defining device introspection as a use case for the
migration interface imposes requirements on the vendor drivers that
don't otherwise exist.  Mdev vendor specific utilities could always be
written to interpret the migration stream to deduce the internal state,
but I think that imposing segregated device memory vs device config
regions with the expectation that internal state can be directly
tracked is beyond the scope of a migration interface.
 
> The scripts/analyze-migration.py scripts will actually dump the
> migration stream data in an almost readable format.
> So if you properly define the VMState definitions it should be almost
> readable; it's occasionally been useful.

That's true for emulated devices, but I expect an mdev device migration
stream is simply one blob of opaque data followed by another.  We can
impose the protocol that userspace uses to read and write this data
stream from the device, but not the data it contains.
 
> I agree that you should be very very careful to validate the incoming
> migration stream against:
>   a) Corruption
>   b) Wrong driver versions
>   c) Malicious intent
>     c.1) Especially by the guest
>     c.2) Or by someone trying to feed you a duff stream
>   d) Someone trying load the VFIO stream into completely the wrong
> device.

Yes, and with open source mdev vendor drivers we can at least
theoretically audit the reload, but of course we also have proprietary
drivers.  I wonder if we should install the kill switch in advance to
allow users to opt-out of enabling migration at the mdev layer.

> Whether the migration interface is the right thing to use for that
> inspection hmm; well it might be - if you're trying to debug
> your device and need a dump of it's state, then why not?
> (I guess you end up with something not dissimilar to what things
> like intek_reg_snapshot in intel-gpu-tools does).

Sure, as above there's nothing preventing mdev specific utilities from
decoding the migration stream, but I begin to have an issue if this
introspective use case imposes requirements on how device state is
represented through the migration interface that don't otherwise
exist.  If we want to define a standard for the actual data from the
device, we'll be at this for years :-\  Thanks,

Alex
Tian, Kevin March 11, 2019, 2:33 a.m. UTC | #37
> From: Alex Williamson
> Sent: Saturday, March 9, 2019 6:03 AM
> 
> On Fri, 8 Mar 2019 16:21:46 +0000
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Alex Williamson (alex.williamson@redhat.com) wrote:
> > > On Thu, 7 Mar 2019 23:20:36 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Friday, March 8, 2019 1:44 AM
> > > > > > >
> > > > > > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > > > > > >         stop-and-copy phase.
> > > > > > > >         The data of device memory is held in device memory region.
> > > > > > > >         Size of devie memory is usually larger than that of device
> > > > > > > >         memory region. qemu needs to save/load it in chunks of size
> of
> > > > > > > >         device memory region.
> > > > > > > >         Not all device has device memory. Like IGD only uses system
> > > > > memory.
> > > > > > >
> > > > > > > It seems a little gratuitous to me that this is a separate region or
> > > > > > > that this data is handled separately.  All of this data is opaque to
> > > > > > > QEMU, so why do we need to separate it?
> > > > > > hi Alex,
> > > > > > as the device state interfaces are provided by kernel, it is expected to
> > > > > > meet as general needs as possible. So, do you think there are such
> use
> > > > > > cases from user space that user space knows well of the device, and
> > > > > > it wants kernel to return desired data back to it.
> > > > > > E.g. It just wants to get whole device config data including all mmios,
> > > > > > page tables, pci config data...
> > > > > > or, It just wants to get current device memory snapshot, not
> including any
> > > > > > dirty data.
> > > > > > Or, It just needs the dirty pages in device memory or system memory.
> > > > > > With all this accurate query, quite a lot of useful features can be
> > > > > > developped in user space.
> > > > > >
> > > > > > If all of this data is opaque to user app, seems the only use case is
> > > > > > for live migration.
> > > > >
> > > > > I can certainly appreciate a more versatile interface, but I think
> > > > > we're also trying to create the most simple interface we can, with the
> > > > > primary target being live migration.  As soon as we start defining this
> > > > > type of device memory and that type of device memory, we're going to
> > > > > have another device come along that needs yet another because they
> have
> > > > > a slightly different requirement.  Even without that, we're going to
> > > > > have vendor drivers implement it differently, so what works for one
> > > > > device for a more targeted approach may not work for all devices.  Can
> > > > > you enumerate some specific examples of the use cases you imagine
> your
> > > > > design to enable?
> > > > >
> > > >
> > > > Do we want to consider an use case where user space would like to
> > > > selectively introspect a portion of the device state (including implicit
> > > > state which are not available through PCI regions), and may ask for
> > > > capability of direct mapping of selected portion for scanning (e.g.
> > > > device memory) instead of always turning on dirty logging on all
> > > > device state?
> > >
> > > I don't see that a migration interface necessarily lends itself to this
> > > use case.  A migration data stream has no requirement to be user
> > > consumable as anything other than opaque data, there's also no
> > > requirement that it expose state in a form that directly represents the
> > > internal state of the device.  In fact I'm not sure we want to encourage
> > > introspection via this data stream.  If a user knows how to interpret
> > > the data, what prevents them from modifying the data in-flight?  I've
> > > raised the question previously regarding how the vendor driver can
> > > validate the integrity of the migration data stream.  Using the
> > > migration interface to introspect the device certainly suggests an
> > > interface ripe for exploiting any potential weakness in the vendor
> > > driver reassembling that migration stream.  If the user has an mmap to
> > > the actual live working state of the vendor driver, protection in the
> > > hardware seems like the only way you could protect against a malicious
> > > user.  Please be defensive in what is directly exposed to the user and
> > > what safeguards are in place within the vendor driver for validating
> > > incoming data.  Thanks,
> >
> > Hmm; that sounds like a security-by-obscurity answer!
> 
> Yup, that's fair.  I won't deny that in-kernel vendor driver state
> passing through userspace from source to target systems scares me quite
> a bit, but defining device introspection as a use case for the
> migration interface imposes requirements on the vendor drivers that
> don't otherwise exist.  Mdev vendor specific utilities could always be
> written to interpret the migration stream to deduce the internal state,
> but I think that imposing segregated device memory vs device config
> regions with the expectation that internal state can be directly
> tracked is beyond the scope of a migration interface.

I'm fine with defining such interface aimed only for migration-like
usages (e.g. also including fast check-pointing), but I didn't buy-in
the point that such opaque way is more secure than segregated
style since the layout can be anyway dumped out by looking at 
source code of mdev driver.

Also better we don't include any 'migration' word in related interface
structure definition. It's just an opaque/dirty-logged way of get/set
device state, e.g. instead of calling it "migration interface" can we
call it "dirty-logged state interface"?

> 
> > The scripts/analyze-migration.py scripts will actually dump the
> > migration stream data in an almost readable format.
> > So if you properly define the VMState definitions it should be almost
> > readable; it's occasionally been useful.
> 
> That's true for emulated devices, but I expect an mdev device migration
> stream is simply one blob of opaque data followed by another.  We can
> impose the protocol that userspace uses to read and write this data
> stream from the device, but not the data it contains.
> 
> > I agree that you should be very very careful to validate the incoming
> > migration stream against:
> >   a) Corruption
> >   b) Wrong driver versions
> >   c) Malicious intent
> >     c.1) Especially by the guest
> >     c.2) Or by someone trying to feed you a duff stream
> >   d) Someone trying load the VFIO stream into completely the wrong
> > device.
> 
> Yes, and with open source mdev vendor drivers we can at least
> theoretically audit the reload, but of course we also have proprietary
> drivers.  I wonder if we should install the kill switch in advance to
> allow users to opt-out of enabling migration at the mdev layer.
> 
> > Whether the migration interface is the right thing to use for that
> > inspection hmm; well it might be - if you're trying to debug
> > your device and need a dump of it's state, then why not?
> > (I guess you end up with something not dissimilar to what things
> > like intek_reg_snapshot in intel-gpu-tools does).
> 
> Sure, as above there's nothing preventing mdev specific utilities from
> decoding the migration stream, but I begin to have an issue if this
> introspective use case imposes requirements on how device state is
> represented through the migration interface that don't otherwise
> exist.  If we want to define a standard for the actual data from the
> device, we'll be at this for years :-\  Thanks,
> 

Introspection is one potential usage when thinking about mmapped
style in Yan's proposal, but it's not strong enough since introspection
can be also done with opaque way (just not optimal meaning always
need to track all the states). We may introduce new interface in the
future when it becomes a real problem.

But I still didn't get your exact concern about security part. For
version yes we still haven't worked out a sane way to represent
vendor-specific compatibility requirement. But allowing user
space to modify data through this interface has really no difference
from allowing guest to modify data through trapped MMIO interface.
mdev driver should guarantee that operations through both interfaces
can modify only the state associated with the said mdev instance,
w/o breaking the isolation boundary. Then the former becomes just
a batch of operations to be verified in the same way as if they are
done individually through the latter interface. 

Thanks
Kevin
Alex Williamson March 11, 2019, 8:19 p.m. UTC | #38
On Mon, 11 Mar 2019 02:33:11 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson
> > Sent: Saturday, March 9, 2019 6:03 AM
> > 
> > On Fri, 8 Mar 2019 16:21:46 +0000
> > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> >   
> > > * Alex Williamson (alex.williamson@redhat.com) wrote:  
> > > > On Thu, 7 Mar 2019 23:20:36 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >  
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Friday, March 8, 2019 1:44 AM  
> > > > > > > >  
> > > > > > > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > > > > > > >         stop-and-copy phase.
> > > > > > > > >         The data of device memory is held in device memory region.
> > > > > > > > >         Size of devie memory is usually larger than that of device
> > > > > > > > >         memory region. qemu needs to save/load it in chunks of size  
> > of  
> > > > > > > > >         device memory region.
> > > > > > > > >         Not all device has device memory. Like IGD only uses system  
> > > > > > memory.  
> > > > > > > >
> > > > > > > > It seems a little gratuitous to me that this is a separate region or
> > > > > > > > that this data is handled separately.  All of this data is opaque to
> > > > > > > > QEMU, so why do we need to separate it?  
> > > > > > > hi Alex,
> > > > > > > as the device state interfaces are provided by kernel, it is expected to
> > > > > > > meet as general needs as possible. So, do you think there are such  
> > use  
> > > > > > > cases from user space that user space knows well of the device, and
> > > > > > > it wants kernel to return desired data back to it.
> > > > > > > E.g. It just wants to get whole device config data including all mmios,
> > > > > > > page tables, pci config data...
> > > > > > > or, It just wants to get current device memory snapshot, not  
> > including any  
> > > > > > > dirty data.
> > > > > > > Or, It just needs the dirty pages in device memory or system memory.
> > > > > > > With all this accurate query, quite a lot of useful features can be
> > > > > > > developped in user space.
> > > > > > >
> > > > > > > If all of this data is opaque to user app, seems the only use case is
> > > > > > > for live migration.  
> > > > > >
> > > > > > I can certainly appreciate a more versatile interface, but I think
> > > > > > we're also trying to create the most simple interface we can, with the
> > > > > > primary target being live migration.  As soon as we start defining this
> > > > > > type of device memory and that type of device memory, we're going to
> > > > > > have another device come along that needs yet another because they  
> > have  
> > > > > > a slightly different requirement.  Even without that, we're going to
> > > > > > have vendor drivers implement it differently, so what works for one
> > > > > > device for a more targeted approach may not work for all devices.  Can
> > > > > > you enumerate some specific examples of the use cases you imagine  
> > your  
> > > > > > design to enable?
> > > > > >  
> > > > >
> > > > > Do we want to consider an use case where user space would like to
> > > > > selectively introspect a portion of the device state (including implicit
> > > > > state which are not available through PCI regions), and may ask for
> > > > > capability of direct mapping of selected portion for scanning (e.g.
> > > > > device memory) instead of always turning on dirty logging on all
> > > > > device state?  
> > > >
> > > > I don't see that a migration interface necessarily lends itself to this
> > > > use case.  A migration data stream has no requirement to be user
> > > > consumable as anything other than opaque data, there's also no
> > > > requirement that it expose state in a form that directly represents the
> > > > internal state of the device.  In fact I'm not sure we want to encourage
> > > > introspection via this data stream.  If a user knows how to interpret
> > > > the data, what prevents them from modifying the data in-flight?  I've
> > > > raised the question previously regarding how the vendor driver can
> > > > validate the integrity of the migration data stream.  Using the
> > > > migration interface to introspect the device certainly suggests an
> > > > interface ripe for exploiting any potential weakness in the vendor
> > > > driver reassembling that migration stream.  If the user has an mmap to
> > > > the actual live working state of the vendor driver, protection in the
> > > > hardware seems like the only way you could protect against a malicious
> > > > user.  Please be defensive in what is directly exposed to the user and
> > > > what safeguards are in place within the vendor driver for validating
> > > > incoming data.  Thanks,  
> > >
> > > Hmm; that sounds like a security-by-obscurity answer!  
> > 
> > Yup, that's fair.  I won't deny that in-kernel vendor driver state
> > passing through userspace from source to target systems scares me quite
> > a bit, but defining device introspection as a use case for the
> > migration interface imposes requirements on the vendor drivers that
> > don't otherwise exist.  Mdev vendor specific utilities could always be
> > written to interpret the migration stream to deduce the internal state,
> > but I think that imposing segregated device memory vs device config
> > regions with the expectation that internal state can be directly
> > tracked is beyond the scope of a migration interface.  
> 
> I'm fine with defining such interface aimed only for migration-like
> usages (e.g. also including fast check-pointing), but I didn't buy-in
> the point that such opaque way is more secure than segregated
> style since the layout can be anyway dumped out by looking at 
> source code of mdev driver.

I think I've fully conceded any notion of security by obscurity towards
opaque data already, but segregating types of device data still seems
to unnecessarily impose a usage model on the vendor driver that I think
we should try to avoid.  The migration interface should define the
protocol through which userspace can save and restore the device state,
not impose how the vendor driver exposes or manages that state.  Also, I
got the impression (perhaps incorrectly) that you were trying to mmap
live data to userspace, which would allow not only saving the state,
but also unchecked state modification by userspace. I think we want
more of a producer/consumer model of the state where consuming state
also involves at least some degree of sanity or consistency checking.
Let's not forget too that we're obviously dealing with non-open source
driver in the mdev universe as well.
 
> Also better we don't include any 'migration' word in related interface
> structure definition. It's just an opaque/dirty-logged way of get/set
> device state, e.g. instead of calling it "migration interface" can we
> call it "dirty-logged state interface"?

I think we're talking about the color of the interface now ;)

> > > The scripts/analyze-migration.py scripts will actually dump the
> > > migration stream data in an almost readable format.
> > > So if you properly define the VMState definitions it should be almost
> > > readable; it's occasionally been useful.  
> > 
> > That's true for emulated devices, but I expect an mdev device migration
> > stream is simply one blob of opaque data followed by another.  We can
> > impose the protocol that userspace uses to read and write this data
> > stream from the device, but not the data it contains.
> >   
> > > I agree that you should be very very careful to validate the incoming
> > > migration stream against:
> > >   a) Corruption
> > >   b) Wrong driver versions
> > >   c) Malicious intent
> > >     c.1) Especially by the guest
> > >     c.2) Or by someone trying to feed you a duff stream
> > >   d) Someone trying load the VFIO stream into completely the wrong
> > > device.  
> > 
> > Yes, and with open source mdev vendor drivers we can at least
> > theoretically audit the reload, but of course we also have proprietary
> > drivers.  I wonder if we should install the kill switch in advance to
> > allow users to opt-out of enabling migration at the mdev layer.
> >   
> > > Whether the migration interface is the right thing to use for that
> > > inspection hmm; well it might be - if you're trying to debug
> > > your device and need a dump of it's state, then why not?
> > > (I guess you end up with something not dissimilar to what things
> > > like intek_reg_snapshot in intel-gpu-tools does).  
> > 
> > Sure, as above there's nothing preventing mdev specific utilities from
> > decoding the migration stream, but I begin to have an issue if this
> > introspective use case imposes requirements on how device state is
> > represented through the migration interface that don't otherwise
> > exist.  If we want to define a standard for the actual data from the
> > device, we'll be at this for years :-\  Thanks,
> >   
> 
> Introspection is one potential usage when thinking about mmapped
> style in Yan's proposal, but it's not strong enough since introspection
> can be also done with opaque way (just not optimal meaning always
> need to track all the states). We may introduce new interface in the
> future when it becomes a real problem.
> 
> But I still didn't get your exact concern about security part. For
> version yes we still haven't worked out a sane way to represent
> vendor-specific compatibility requirement. But allowing user
> space to modify data through this interface has really no difference
> from allowing guest to modify data through trapped MMIO interface.
> mdev driver should guarantee that operations through both interfaces
> can modify only the state associated with the said mdev instance,
> w/o breaking the isolation boundary. Then the former becomes just
> a batch of operations to be verified in the same way as if they are
> done individually through the latter interface. 

It seems like you're assuming a working model for the vendor driver and
the data entering and exiting through this interface.  The vendor
drivers can expose data any way that they want.  All we need to do is
imagine that the migration data stream includes an array index count
somewhere which the user could modify to trigger the in-kernel vendor
driver to allocate an absurd array size and DoS the target.  This is
probably the most simplistic attack, possibly knowing the state machine
of the vendor driver a malicious user could trick it into providing
host kernel data.  We're not necessarily only conveying state that the
user already has access to via this interface, the vendor driver may
include non-visible internal state as well.  Even the state that is
user accessible is being pushed into the vendor driver via an alternate
path from the user mediation we have on the existing paths.

On the other hand, if your assertion that an incoming migration is
nothing more than a batch of operations through existing interfaces to
the device, then maybe this migration interface should be read-only to
generate an interpreted series of operations to the device.  I expect
we wouldn't get terribly far with such an approach though.  Thanks,

Alex
Tian, Kevin March 12, 2019, 2:48 a.m. UTC | #39
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Tuesday, March 12, 2019 4:19 AM
> On Mon, 11 Mar 2019 02:33:11 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
[...]
> 
> I think I've fully conceded any notion of security by obscurity towards
> opaque data already, but segregating types of device data still seems
> to unnecessarily impose a usage model on the vendor driver that I think
> we should try to avoid.  The migration interface should define the
> protocol through which userspace can save and restore the device state,
> not impose how the vendor driver exposes or manages that state.  Also, I
> got the impression (perhaps incorrectly) that you were trying to mmap
> live data to userspace, which would allow not only saving the state,
> but also unchecked state modification by userspace. I think we want
> more of a producer/consumer model of the state where consuming state
> also involves at least some degree of sanity or consistency checking.
> Let's not forget too that we're obviously dealing with non-open source
> driver in the mdev universe as well.

OK. I think for this part we are in agreement - as long as the goal of
this interface is clearly defined as such way. :-)

[...]
> > But I still didn't get your exact concern about security part. For
> > version yes we still haven't worked out a sane way to represent
> > vendor-specific compatibility requirement. But allowing user
> > space to modify data through this interface has really no difference
> > from allowing guest to modify data through trapped MMIO interface.
> > mdev driver should guarantee that operations through both interfaces
> > can modify only the state associated with the said mdev instance,
> > w/o breaking the isolation boundary. Then the former becomes just
> > a batch of operations to be verified in the same way as if they are
> > done individually through the latter interface.
> 
> It seems like you're assuming a working model for the vendor driver and
> the data entering and exiting through this interface.  The vendor
> drivers can expose data any way that they want.  All we need to do is
> imagine that the migration data stream includes an array index count
> somewhere which the user could modify to trigger the in-kernel vendor
> driver to allocate an absurd array size and DoS the target.  This is
> probably the most simplistic attack, possibly knowing the state machine
> of the vendor driver a malicious user could trick it into providing
> host kernel data.  We're not necessarily only conveying state that the
> user already has access to via this interface, the vendor driver may
> include non-visible internal state as well.  Even the state that is
> user accessible is being pushed into the vendor driver via an alternate
> path from the user mediation we have on the existing paths.

Then I don't know how this concern can be effectively addressed 
since you assume vendor drivers are not trusted here. and why do
you trust vendor drivers on mediating existing path but not this
alternative one? non-visible internal states just mean more stuff
to be carefully scrutinized, which is not essentially causing a 
conceptual difference of trust level.

Or can this concern be partially mitigated if we create some 
test cases which poke random data through the new interface,
and mark vendor drivers which pass such tests as trusted? Then
there is also an open who should be in charge of such certification 
process...

Thanks
Kevin
Yan Zhao March 12, 2019, 2:57 a.m. UTC | #40
hi Alex
thanks for your reply.

So, if we choose migration data to be userspace opaque, do you think below
sequence is the right behavior for vendor driver to follow:

1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
vendor driver,  vendor driver should reject and return 0.

2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
driver,
   a. vendor driver shoud first query a whole snapshot of device memory
   (let's use this term to represent device's standalone memory for now),
   b. vendor driver returns a chunk of data just queried to userspace,
   while recording current pos in data.
   c. vendor driver finds all data just queried is finished transmitting to
   userspace, and queries only dirty data in device memory now.
   d. vendor driver returns a chunk of data just quered (this time is dirty
   data )to userspace while recording current pos in data
   e. if all data is transmited to usespace and still GET_BUFFERs come from
   userspace, vendor driver starts another round of dirty data query.

3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
driver,
   a. if vendor driver finds there's previously untransmitted data, returns
   them until all transmitted.
   b. vendor driver then queries dirty data again and transmits them.
   c. at last, vendor driver queris device config data (which has to be
   queried at last and sent once) and transmits them.


for the 1 bullet, if LOGGING state is firstly set and migration aborts
then,  vendor driver has to be able to detect that condition. so seemingly,
vendor driver has to know more qemu's migration state, like migration
called and failed. Do you think that's acceptable?


Thanks
Yan
Yan Zhao March 13, 2019, 1:13 a.m. UTC | #41
hi Alex
Any comments to the sequence below?

Actaully we have some concerns and suggestions to userspace-opaque migration
data.

1. if data is opaque to userspace, kernel interface must be tightly bound to
migration. 
   e.g. vendor driver has to know state (running + not logging) should not
   return any data, and state (running + logging) should return whole
   snapshot first and dirty later. it also has to know qemu migration will
   not call GET_BUFFER in state (running + not logging), otherwise, it has
   to adjust its behavior.

2. vendor driver cannot ensure userspace get all the data it intends to
save in pre-copy phase.
  e.g. in stop-and-copy phase, vendor driver has to first check and send
  data in previous phase.
 

3. if all the sequence is tightly bound to live migration, can we remove the
logging state? what about adding two states migrate-in and migrate-out?
so there are four states: running, stopped, migrate-in, migrate-out.
   migrate-out is for source side when migration starts. together with
   state running and stopped, it can substitute state logging.
   migrate-in is for target side.


Thanks
Yan

On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:
> hi Alex
> thanks for your reply.
> 
> So, if we choose migration data to be userspace opaque, do you think below
> sequence is the right behavior for vendor driver to follow:
> 
> 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> vendor driver,  vendor driver should reject and return 0.
> 
> 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> driver,
>    a. vendor driver shoud first query a whole snapshot of device memory
>    (let's use this term to represent device's standalone memory for now),
>    b. vendor driver returns a chunk of data just queried to userspace,
>    while recording current pos in data.
>    c. vendor driver finds all data just queried is finished transmitting to
>    userspace, and queries only dirty data in device memory now.
>    d. vendor driver returns a chunk of data just quered (this time is dirty
>    data )to userspace while recording current pos in data
>    e. if all data is transmited to usespace and still GET_BUFFERs come from
>    userspace, vendor driver starts another round of dirty data query.
> 
> 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> driver,
>    a. if vendor driver finds there's previously untransmitted data, returns
>    them until all transmitted.
>    b. vendor driver then queries dirty data again and transmits them.
>    c. at last, vendor driver queris device config data (which has to be
>    queried at last and sent once) and transmits them.
> 
> 
> for the 1 bullet, if LOGGING state is firstly set and migration aborts
> then,  vendor driver has to be able to detect that condition. so seemingly,
> vendor driver has to know more qemu's migration state, like migration
> called and failed. Do you think that's acceptable?
> 
> 
> Thanks
> Yan
> 
>
Alex Williamson March 13, 2019, 7:14 p.m. UTC | #42
On Tue, 12 Mar 2019 21:13:01 -0400
Zhao Yan <yan.y.zhao@intel.com> wrote:

> hi Alex
> Any comments to the sequence below?
> 
> Actaully we have some concerns and suggestions to userspace-opaque migration
> data.
> 
> 1. if data is opaque to userspace, kernel interface must be tightly bound to
> migration. 
>    e.g. vendor driver has to know state (running + not logging) should not
>    return any data, and state (running + logging) should return whole
>    snapshot first and dirty later. it also has to know qemu migration will
>    not call GET_BUFFER in state (running + not logging), otherwise, it has
>    to adjust its behavior.

This all just sounds like defining the protocol we expect with the
interface.  For instance if we define a session as beginning when
logging is enabled and ending when the device is stopped and the
interface reports no more data is available, then we can state that any
partial accumulation of data is incomplete relative to migration.  If
userspace wants to initiate a new migration stream, they can simply
toggle logging.  How the vendor driver provides the data during the
session is not defined, but beginning the session with a snapshot
followed by repeated iterations of dirtied data is certainly a valid
approach.

> 2. vendor driver cannot ensure userspace get all the data it intends to
> save in pre-copy phase.
>   e.g. in stop-and-copy phase, vendor driver has to first check and send
>   data in previous phase.

First, I don't think the device has control of when QEMU switches from
pre-copy to stop-and-copy, the protocol needs to support that
transition at any point.  However, it seems a simply data available
counter provides an indication of when it might be optimal to make such
a transition.  If a vendor driver follows a scheme as above, the
available data counter would indicate a large value, the entire initial
snapshot of the device.  As the migration continues and pages are
dirtied, the device would reach a steady state amount of data
available, depending on the guest activity.  This could indicate to the
user to stop the device.  The migration stream would not be considered
completed until the available data counter reaches zero while the
device is in the stopped|logging state.

> 3. if all the sequence is tightly bound to live migration, can we remove the
> logging state? what about adding two states migrate-in and migrate-out?
> so there are four states: running, stopped, migrate-in, migrate-out.
>    migrate-out is for source side when migration starts. together with
>    state running and stopped, it can substitute state logging.
>    migrate-in is for target side.

In fact, Kirti's implementation specifies a data direction, but I think
we still need logging to indicate sessions.  I'd also assume that
logging implies some overhead for the vendor driver.

> On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:
> > hi Alex
> > thanks for your reply.
> > 
> > So, if we choose migration data to be userspace opaque, do you think below
> > sequence is the right behavior for vendor driver to follow:
> > 
> > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > vendor driver,  vendor driver should reject and return 0.

What would this state mean otherwise?  If we're not logging then it
should not be expected that we can construct dirtied data from a
previous read of the state before logging was enabled (it would be
outside of the "session").  So at best this is an incomplete segment of
the initial snapshot of the device, but that presumes how the vendor
driver constructs the data.  I wouldn't necessarily mandate the vendor
driver reject it, but I think we should consider it undefined and
vendor specific relative to the migration interface.

> > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > driver,
> >    a. vendor driver shoud first query a whole snapshot of device memory
> >    (let's use this term to represent device's standalone memory for now),
> >    b. vendor driver returns a chunk of data just queried to userspace,
> >    while recording current pos in data.
> >    c. vendor driver finds all data just queried is finished transmitting to
> >    userspace, and queries only dirty data in device memory now.
> >    d. vendor driver returns a chunk of data just quered (this time is dirty
> >    data )to userspace while recording current pos in data
> >    e. if all data is transmited to usespace and still GET_BUFFERs come from
> >    userspace, vendor driver starts another round of dirty data query.

This is a valid vendor driver approach, but it's outside the scope of
the interface definition.  A vendor driver could also decide to not
provide any data until both stopped and logging are set and then
provide a fixed, final snapshot.  The interface supports either
approach by defining the protocol to interact with it.

> > 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> > driver,
> >    a. if vendor driver finds there's previously untransmitted data, returns
> >    them until all transmitted.
> >    b. vendor driver then queries dirty data again and transmits them.
> >    c. at last, vendor driver queris device config data (which has to be
> >    queried at last and sent once) and transmits them.

This seems broken, the vendor driver is presuming the user intentions.
If logging is unset, we return to bullet 1, reading data is undefined
and vendor specific.  It's outside of the session.

> > for the 1 bullet, if LOGGING state is firstly set and migration aborts
> > then,  vendor driver has to be able to detect that condition. so seemingly,
> > vendor driver has to know more qemu's migration state, like migration
> > called and failed. Do you think that's acceptable?

If migration aborts, logging is cleared and the device continues
operation.  If a new migration is started, the session is initiated by
enabling logging.  Sound reasonable?  Thanks,

Alex
Alex Williamson March 13, 2019, 7:57 p.m. UTC | #43
On Tue, 12 Mar 2019 02:48:39 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Tuesday, March 12, 2019 4:19 AM
> > On Mon, 11 Mar 2019 02:33:11 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> [...]
> > 
> > I think I've fully conceded any notion of security by obscurity towards
> > opaque data already, but segregating types of device data still seems
> > to unnecessarily impose a usage model on the vendor driver that I think
> > we should try to avoid.  The migration interface should define the
> > protocol through which userspace can save and restore the device state,
> > not impose how the vendor driver exposes or manages that state.  Also, I
> > got the impression (perhaps incorrectly) that you were trying to mmap
> > live data to userspace, which would allow not only saving the state,
> > but also unchecked state modification by userspace. I think we want
> > more of a producer/consumer model of the state where consuming state
> > also involves at least some degree of sanity or consistency checking.
> > Let's not forget too that we're obviously dealing with non-open source
> > driver in the mdev universe as well.  
> 
> OK. I think for this part we are in agreement - as long as the goal of
> this interface is clearly defined as such way. :-)
> 
> [...]
> > > But I still didn't get your exact concern about security part. For
> > > version yes we still haven't worked out a sane way to represent
> > > vendor-specific compatibility requirement. But allowing user
> > > space to modify data through this interface has really no difference
> > > from allowing guest to modify data through trapped MMIO interface.
> > > mdev driver should guarantee that operations through both interfaces
> > > can modify only the state associated with the said mdev instance,
> > > w/o breaking the isolation boundary. Then the former becomes just
> > > a batch of operations to be verified in the same way as if they are
> > > done individually through the latter interface.  
> > 
> > It seems like you're assuming a working model for the vendor driver and
> > the data entering and exiting through this interface.  The vendor
> > drivers can expose data any way that they want.  All we need to do is
> > imagine that the migration data stream includes an array index count
> > somewhere which the user could modify to trigger the in-kernel vendor
> > driver to allocate an absurd array size and DoS the target.  This is
> > probably the most simplistic attack, possibly knowing the state machine
> > of the vendor driver a malicious user could trick it into providing
> > host kernel data.  We're not necessarily only conveying state that the
> > user already has access to via this interface, the vendor driver may
> > include non-visible internal state as well.  Even the state that is
> > user accessible is being pushed into the vendor driver via an alternate
> > path from the user mediation we have on the existing paths.  
> 
> Then I don't know how this concern can be effectively addressed 
> since you assume vendor drivers are not trusted here. and why do
> you trust vendor drivers on mediating existing path but not this
> alternative one? non-visible internal states just mean more stuff
> to be carefully scrutinized, which is not essentially causing a 
> conceptual difference of trust level.
> 
> Or can this concern be partially mitigated if we create some 
> test cases which poke random data through the new interface,
> and mark vendor drivers which pass such tests as trusted? Then
> there is also an open who should be in charge of such certification 
> process...

The vendor driver is necessarily trusted, it lives in the kernel, it
works in the kernel address space.  Unfortunately that's also the risk
with passing data from userspace into the vendor driver, the vendor
driver needs to take every precaution in sanitizing and validating that
data.  I wish we had a common way to perform that checking, but it
seems that each vendor driver is going to need to define their own
protocol and battle their own bugs and exploits in the code
implementing that protocol.  For open source drivers we can continue to
rely on review and openness, for closed drivers... the user has already
accepted the risk to trust the driver themselves.  Perhaps all I can do
is raise the visibility that there are potential security issues here
and vendor drivers need to own that risk.

A fuzzing test would be great, we could at least validate whether a
vendor driver implements some sort of CRC test, but I don't think we
can create a certification process around that.  Thanks,

Alex
Yan Zhao March 14, 2019, 1:12 a.m. UTC | #44
On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:
> On Tue, 12 Mar 2019 21:13:01 -0400
> Zhao Yan <yan.y.zhao@intel.com> wrote:
> 
> > hi Alex
> > Any comments to the sequence below?
> > 
> > Actaully we have some concerns and suggestions to userspace-opaque migration
> > data.
> > 
> > 1. if data is opaque to userspace, kernel interface must be tightly bound to
> > migration. 
> >    e.g. vendor driver has to know state (running + not logging) should not
> >    return any data, and state (running + logging) should return whole
> >    snapshot first and dirty later. it also has to know qemu migration will
> >    not call GET_BUFFER in state (running + not logging), otherwise, it has
> >    to adjust its behavior.
> 
> This all just sounds like defining the protocol we expect with the
> interface.  For instance if we define a session as beginning when
> logging is enabled and ending when the device is stopped and the
> interface reports no more data is available, then we can state that any
> partial accumulation of data is incomplete relative to migration.  If
> userspace wants to initiate a new migration stream, they can simply
> toggle logging.  How the vendor driver provides the data during the
> session is not defined, but beginning the session with a snapshot
> followed by repeated iterations of dirtied data is certainly a valid
> approach.
> 
> > 2. vendor driver cannot ensure userspace get all the data it intends to
> > save in pre-copy phase.
> >   e.g. in stop-and-copy phase, vendor driver has to first check and send
> >   data in previous phase.
> 
> First, I don't think the device has control of when QEMU switches from
> pre-copy to stop-and-copy, the protocol needs to support that
> transition at any point.  However, it seems a simply data available
> counter provides an indication of when it might be optimal to make such
> a transition.  If a vendor driver follows a scheme as above, the
> available data counter would indicate a large value, the entire initial
> snapshot of the device.  As the migration continues and pages are
> dirtied, the device would reach a steady state amount of data
> available, depending on the guest activity.  This could indicate to the
> user to stop the device.  The migration stream would not be considered
> completed until the available data counter reaches zero while the
> device is in the stopped|logging state.
> 
> > 3. if all the sequence is tightly bound to live migration, can we remove the
> > logging state? what about adding two states migrate-in and migrate-out?
> > so there are four states: running, stopped, migrate-in, migrate-out.
> >    migrate-out is for source side when migration starts. together with
> >    state running and stopped, it can substitute state logging.
> >    migrate-in is for target side.
> 
> In fact, Kirti's implementation specifies a data direction, but I think
> we still need logging to indicate sessions.  I'd also assume that
> logging implies some overhead for the vendor driver.
>
ok. If you prefer logging, I'm ok with it. just found migrate-in and
migrate-out are more universal againt hardware requirement changes.

> > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:
> > > hi Alex
> > > thanks for your reply.
> > > 
> > > So, if we choose migration data to be userspace opaque, do you think below
> > > sequence is the right behavior for vendor driver to follow:
> > > 
> > > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > > vendor driver,  vendor driver should reject and return 0.
> 
> What would this state mean otherwise?  If we're not logging then it
> should not be expected that we can construct dirtied data from a
> previous read of the state before logging was enabled (it would be
> outside of the "session").  So at best this is an incomplete segment of
> the initial snapshot of the device, but that presumes how the vendor
> driver constructs the data.  I wouldn't necessarily mandate the vendor
> driver reject it, but I think we should consider it undefined and
> vendor specific relative to the migration interface.
> 
> > > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > > driver,
> > >    a. vendor driver shoud first query a whole snapshot of device memory
> > >    (let's use this term to represent device's standalone memory for now),
> > >    b. vendor driver returns a chunk of data just queried to userspace,
> > >    while recording current pos in data.
> > >    c. vendor driver finds all data just queried is finished transmitting to
> > >    userspace, and queries only dirty data in device memory now.
> > >    d. vendor driver returns a chunk of data just quered (this time is dirty
> > >    data )to userspace while recording current pos in data
> > >    e. if all data is transmited to usespace and still GET_BUFFERs come from
> > >    userspace, vendor driver starts another round of dirty data query.
> 
> This is a valid vendor driver approach, but it's outside the scope of
> the interface definition.  A vendor driver could also decide to not
> provide any data until both stopped and logging are set and then
> provide a fixed, final snapshot.  The interface supports either
> approach by defining the protocol to interact with it.
> 
> > > 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> > > driver,
> > >    a. if vendor driver finds there's previously untransmitted data, returns
> > >    them until all transmitted.
> > >    b. vendor driver then queries dirty data again and transmits them.
> > >    c. at last, vendor driver queris device config data (which has to be
> > >    queried at last and sent once) and transmits them.
> 
> This seems broken, the vendor driver is presuming the user intentions.
> If logging is unset, we return to bullet 1, reading data is undefined
> and vendor specific.  It's outside of the session.
> 
> > > for the 1 bullet, if LOGGING state is firstly set and migration aborts
> > > then,  vendor driver has to be able to detect that condition. so seemingly,
> > > vendor driver has to know more qemu's migration state, like migration
> > > called and failed. Do you think that's acceptable?
> 
> If migration aborts, logging is cleared and the device continues
> operation.  If a new migration is started, the session is initiated by
> enabling logging.  Sound reasonable?  Thanks,
>

For the flow, I still have a question.
There are 2 approaches below, which one do you prefer?

Approach A, in precopy stage, the sequence is

(1)
.save_live_pending --> return whole snapshot size
.save_live_iterate --> save whole snapshot

(2)
.save_live_pending --> get dirty data, return dirty data size
.save_live_iterate --> save all dirty data

(3)
.save_live_pending --> get dirty data again, return dirty data size
.save_live_iterate --> save all dirty data


Approach B, in precopy stage, the sequence is
(1)
.save_live_pending --> return whole snapshot size
.save_live_iterate --> save part of snapshot

(2)
.save_live_pending --> return rest part of whole snapshot size +
                              current dirty data size
.save_live_iterate --> save part of snapshot 

(3) repeat (2) until whole snapshot saved.

(4) 
.save_live_pending --> get diryt data and return current dirty data size
.save_live_iterate --> save part of dirty data

(5)
.save_live_pending --> return reset part of dirty data size +
			     delta size of dirty data
.save_live_iterate --> save part of dirty data

(6)
repeat (5) until precopy stops


> Alex
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
Alex Williamson March 14, 2019, 10:44 p.m. UTC | #45
On Wed, 13 Mar 2019 21:12:22 -0400
Zhao Yan <yan.y.zhao@intel.com> wrote:

> On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:
> > On Tue, 12 Mar 2019 21:13:01 -0400
> > Zhao Yan <yan.y.zhao@intel.com> wrote:
> >   
> > > hi Alex
> > > Any comments to the sequence below?
> > > 
> > > Actaully we have some concerns and suggestions to userspace-opaque migration
> > > data.
> > > 
> > > 1. if data is opaque to userspace, kernel interface must be tightly bound to
> > > migration. 
> > >    e.g. vendor driver has to know state (running + not logging) should not
> > >    return any data, and state (running + logging) should return whole
> > >    snapshot first and dirty later. it also has to know qemu migration will
> > >    not call GET_BUFFER in state (running + not logging), otherwise, it has
> > >    to adjust its behavior.  
> > 
> > This all just sounds like defining the protocol we expect with the
> > interface.  For instance if we define a session as beginning when
> > logging is enabled and ending when the device is stopped and the
> > interface reports no more data is available, then we can state that any
> > partial accumulation of data is incomplete relative to migration.  If
> > userspace wants to initiate a new migration stream, they can simply
> > toggle logging.  How the vendor driver provides the data during the
> > session is not defined, but beginning the session with a snapshot
> > followed by repeated iterations of dirtied data is certainly a valid
> > approach.
> >   
> > > 2. vendor driver cannot ensure userspace get all the data it intends to
> > > save in pre-copy phase.
> > >   e.g. in stop-and-copy phase, vendor driver has to first check and send
> > >   data in previous phase.  
> > 
> > First, I don't think the device has control of when QEMU switches from
> > pre-copy to stop-and-copy, the protocol needs to support that
> > transition at any point.  However, it seems a simply data available
> > counter provides an indication of when it might be optimal to make such
> > a transition.  If a vendor driver follows a scheme as above, the
> > available data counter would indicate a large value, the entire initial
> > snapshot of the device.  As the migration continues and pages are
> > dirtied, the device would reach a steady state amount of data
> > available, depending on the guest activity.  This could indicate to the
> > user to stop the device.  The migration stream would not be considered
> > completed until the available data counter reaches zero while the
> > device is in the stopped|logging state.
> >   
> > > 3. if all the sequence is tightly bound to live migration, can we remove the
> > > logging state? what about adding two states migrate-in and migrate-out?
> > > so there are four states: running, stopped, migrate-in, migrate-out.
> > >    migrate-out is for source side when migration starts. together with
> > >    state running and stopped, it can substitute state logging.
> > >    migrate-in is for target side.  
> > 
> > In fact, Kirti's implementation specifies a data direction, but I think
> > we still need logging to indicate sessions.  I'd also assume that
> > logging implies some overhead for the vendor driver.
> >  
> ok. If you prefer logging, I'm ok with it. just found migrate-in and
> migrate-out are more universal againt hardware requirement changes.
> 
> > > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:  
> > > > hi Alex
> > > > thanks for your reply.
> > > > 
> > > > So, if we choose migration data to be userspace opaque, do you think below
> > > > sequence is the right behavior for vendor driver to follow:
> > > > 
> > > > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > > > vendor driver,  vendor driver should reject and return 0.  
> > 
> > What would this state mean otherwise?  If we're not logging then it
> > should not be expected that we can construct dirtied data from a
> > previous read of the state before logging was enabled (it would be
> > outside of the "session").  So at best this is an incomplete segment of
> > the initial snapshot of the device, but that presumes how the vendor
> > driver constructs the data.  I wouldn't necessarily mandate the vendor
> > driver reject it, but I think we should consider it undefined and
> > vendor specific relative to the migration interface.
> >   
> > > > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > > > driver,
> > > >    a. vendor driver shoud first query a whole snapshot of device memory
> > > >    (let's use this term to represent device's standalone memory for now),
> > > >    b. vendor driver returns a chunk of data just queried to userspace,
> > > >    while recording current pos in data.
> > > >    c. vendor driver finds all data just queried is finished transmitting to
> > > >    userspace, and queries only dirty data in device memory now.
> > > >    d. vendor driver returns a chunk of data just quered (this time is dirty
> > > >    data )to userspace while recording current pos in data
> > > >    e. if all data is transmited to usespace and still GET_BUFFERs come from
> > > >    userspace, vendor driver starts another round of dirty data query.  
> > 
> > This is a valid vendor driver approach, but it's outside the scope of
> > the interface definition.  A vendor driver could also decide to not
> > provide any data until both stopped and logging are set and then
> > provide a fixed, final snapshot.  The interface supports either
> > approach by defining the protocol to interact with it.
> >   
> > > > 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> > > > driver,
> > > >    a. if vendor driver finds there's previously untransmitted data, returns
> > > >    them until all transmitted.
> > > >    b. vendor driver then queries dirty data again and transmits them.
> > > >    c. at last, vendor driver queris device config data (which has to be
> > > >    queried at last and sent once) and transmits them.  
> > 
> > This seems broken, the vendor driver is presuming the user intentions.
> > If logging is unset, we return to bullet 1, reading data is undefined
> > and vendor specific.  It's outside of the session.
> >   
> > > > for the 1 bullet, if LOGGING state is firstly set and migration aborts
> > > > then,  vendor driver has to be able to detect that condition. so seemingly,
> > > > vendor driver has to know more qemu's migration state, like migration
> > > > called and failed. Do you think that's acceptable?  
> > 
> > If migration aborts, logging is cleared and the device continues
> > operation.  If a new migration is started, the session is initiated by
> > enabling logging.  Sound reasonable?  Thanks,
> >  
> 
> For the flow, I still have a question.
> There are 2 approaches below, which one do you prefer?
> 
> Approach A, in precopy stage, the sequence is
> 
> (1)
> .save_live_pending --> return whole snapshot size
> .save_live_iterate --> save whole snapshot
> 
> (2)
> .save_live_pending --> get dirty data, return dirty data size
> .save_live_iterate --> save all dirty data
> 
> (3)
> .save_live_pending --> get dirty data again, return dirty data size
> .save_live_iterate --> save all dirty data
> 
> 
> Approach B, in precopy stage, the sequence is
> (1)
> .save_live_pending --> return whole snapshot size
> .save_live_iterate --> save part of snapshot
> 
> (2)
> .save_live_pending --> return rest part of whole snapshot size +
>                               current dirty data size
> .save_live_iterate --> save part of snapshot 
> 
> (3) repeat (2) until whole snapshot saved.
> 
> (4) 
> .save_live_pending --> get diryt data and return current dirty data size
> .save_live_iterate --> save part of dirty data
> 
> (5)
> .save_live_pending --> return reset part of dirty data size +
> 			     delta size of dirty data
> .save_live_iterate --> save part of dirty data
> 
> (6)
> repeat (5) until precopy stops

I don't really understand the question here.  If the vendor driver's
approach is to send a full snapshot followed by iterations of dirty
data, then when the user enables logging and reads the counter for
available data it should report the (size of the snapshot).  The next
time the user reads the counter, it should report the size of the
(size of the snapshot) - (what the user has already read) + (size of
the dirty data since the snapshot).  As the user continues to read past
the snapshot data, the available data counter transitions to reporting
only the size of the remaining dirty data, which is monotonically
increasing.  I guess this would be more similar to your approach B,
which seems to suggest that the interface needs to continue providing
data regardless of whether the user fully exhausted the available data
from the previous cycle.  Thanks,

Alex
Yan Zhao March 14, 2019, 11:05 p.m. UTC | #46
On Fri, Mar 15, 2019 at 06:44:58AM +0800, Alex Williamson wrote:
> On Wed, 13 Mar 2019 21:12:22 -0400
> Zhao Yan <yan.y.zhao@intel.com> wrote:
> 
> > On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:
> > > On Tue, 12 Mar 2019 21:13:01 -0400
> > > Zhao Yan <yan.y.zhao@intel.com> wrote:
> > >   
> > > > hi Alex
> > > > Any comments to the sequence below?
> > > > 
> > > > Actaully we have some concerns and suggestions to userspace-opaque migration
> > > > data.
> > > > 
> > > > 1. if data is opaque to userspace, kernel interface must be tightly bound to
> > > > migration. 
> > > >    e.g. vendor driver has to know state (running + not logging) should not
> > > >    return any data, and state (running + logging) should return whole
> > > >    snapshot first and dirty later. it also has to know qemu migration will
> > > >    not call GET_BUFFER in state (running + not logging), otherwise, it has
> > > >    to adjust its behavior.  
> > > 
> > > This all just sounds like defining the protocol we expect with the
> > > interface.  For instance if we define a session as beginning when
> > > logging is enabled and ending when the device is stopped and the
> > > interface reports no more data is available, then we can state that any
> > > partial accumulation of data is incomplete relative to migration.  If
> > > userspace wants to initiate a new migration stream, they can simply
> > > toggle logging.  How the vendor driver provides the data during the
> > > session is not defined, but beginning the session with a snapshot
> > > followed by repeated iterations of dirtied data is certainly a valid
> > > approach.
> > >   
> > > > 2. vendor driver cannot ensure userspace get all the data it intends to
> > > > save in pre-copy phase.
> > > >   e.g. in stop-and-copy phase, vendor driver has to first check and send
> > > >   data in previous phase.  
> > > 
> > > First, I don't think the device has control of when QEMU switches from
> > > pre-copy to stop-and-copy, the protocol needs to support that
> > > transition at any point.  However, it seems a simply data available
> > > counter provides an indication of when it might be optimal to make such
> > > a transition.  If a vendor driver follows a scheme as above, the
> > > available data counter would indicate a large value, the entire initial
> > > snapshot of the device.  As the migration continues and pages are
> > > dirtied, the device would reach a steady state amount of data
> > > available, depending on the guest activity.  This could indicate to the
> > > user to stop the device.  The migration stream would not be considered
> > > completed until the available data counter reaches zero while the
> > > device is in the stopped|logging state.
> > >   
> > > > 3. if all the sequence is tightly bound to live migration, can we remove the
> > > > logging state? what about adding two states migrate-in and migrate-out?
> > > > so there are four states: running, stopped, migrate-in, migrate-out.
> > > >    migrate-out is for source side when migration starts. together with
> > > >    state running and stopped, it can substitute state logging.
> > > >    migrate-in is for target side.  
> > > 
> > > In fact, Kirti's implementation specifies a data direction, but I think
> > > we still need logging to indicate sessions.  I'd also assume that
> > > logging implies some overhead for the vendor driver.
> > >  
> > ok. If you prefer logging, I'm ok with it. just found migrate-in and
> > migrate-out are more universal againt hardware requirement changes.
> > 
> > > > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:  
> > > > > hi Alex
> > > > > thanks for your reply.
> > > > > 
> > > > > So, if we choose migration data to be userspace opaque, do you think below
> > > > > sequence is the right behavior for vendor driver to follow:
> > > > > 
> > > > > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > > > > vendor driver,  vendor driver should reject and return 0.  
> > > 
> > > What would this state mean otherwise?  If we're not logging then it
> > > should not be expected that we can construct dirtied data from a
> > > previous read of the state before logging was enabled (it would be
> > > outside of the "session").  So at best this is an incomplete segment of
> > > the initial snapshot of the device, but that presumes how the vendor
> > > driver constructs the data.  I wouldn't necessarily mandate the vendor
> > > driver reject it, but I think we should consider it undefined and
> > > vendor specific relative to the migration interface.
> > >   
> > > > > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > > > > driver,
> > > > >    a. vendor driver shoud first query a whole snapshot of device memory
> > > > >    (let's use this term to represent device's standalone memory for now),
> > > > >    b. vendor driver returns a chunk of data just queried to userspace,
> > > > >    while recording current pos in data.
> > > > >    c. vendor driver finds all data just queried is finished transmitting to
> > > > >    userspace, and queries only dirty data in device memory now.
> > > > >    d. vendor driver returns a chunk of data just quered (this time is dirty
> > > > >    data )to userspace while recording current pos in data
> > > > >    e. if all data is transmited to usespace and still GET_BUFFERs come from
> > > > >    userspace, vendor driver starts another round of dirty data query.  
> > > 
> > > This is a valid vendor driver approach, but it's outside the scope of
> > > the interface definition.  A vendor driver could also decide to not
> > > provide any data until both stopped and logging are set and then
> > > provide a fixed, final snapshot.  The interface supports either
> > > approach by defining the protocol to interact with it.
> > >   
> > > > > 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> > > > > driver,
> > > > >    a. if vendor driver finds there's previously untransmitted data, returns
> > > > >    them until all transmitted.
> > > > >    b. vendor driver then queries dirty data again and transmits them.
> > > > >    c. at last, vendor driver queris device config data (which has to be
> > > > >    queried at last and sent once) and transmits them.  
> > > 
> > > This seems broken, the vendor driver is presuming the user intentions.
> > > If logging is unset, we return to bullet 1, reading data is undefined
> > > and vendor specific.  It's outside of the session.
> > >   
> > > > > for the 1 bullet, if LOGGING state is firstly set and migration aborts
> > > > > then,  vendor driver has to be able to detect that condition. so seemingly,
> > > > > vendor driver has to know more qemu's migration state, like migration
> > > > > called and failed. Do you think that's acceptable?  
> > > 
> > > If migration aborts, logging is cleared and the device continues
> > > operation.  If a new migration is started, the session is initiated by
> > > enabling logging.  Sound reasonable?  Thanks,
> > >  
> > 
> > For the flow, I still have a question.
> > There are 2 approaches below, which one do you prefer?
> > 
> > Approach A, in precopy stage, the sequence is
> > 
> > (1)
> > .save_live_pending --> return whole snapshot size
> > .save_live_iterate --> save whole snapshot
> > 
> > (2)
> > .save_live_pending --> get dirty data, return dirty data size
> > .save_live_iterate --> save all dirty data
> > 
> > (3)
> > .save_live_pending --> get dirty data again, return dirty data size
> > .save_live_iterate --> save all dirty data
> > 
> > 
> > Approach B, in precopy stage, the sequence is
> > (1)
> > .save_live_pending --> return whole snapshot size
> > .save_live_iterate --> save part of snapshot
> > 
> > (2)
> > .save_live_pending --> return rest part of whole snapshot size +
> >                               current dirty data size
> > .save_live_iterate --> save part of snapshot 
> > 
> > (3) repeat (2) until whole snapshot saved.
> > 
> > (4) 
> > .save_live_pending --> get diryt data and return current dirty data size
> > .save_live_iterate --> save part of dirty data
> > 
> > (5)
> > .save_live_pending --> return reset part of dirty data size +
> > 			     delta size of dirty data
> > .save_live_iterate --> save part of dirty data
> > 
> > (6)
> > repeat (5) until precopy stops
> 
> I don't really understand the question here.  If the vendor driver's
> approach is to send a full snapshot followed by iterations of dirty
> data, then when the user enables logging and reads the counter for
> available data it should report the (size of the snapshot).  The next
> time the user reads the counter, it should report the size of the
> (size of the snapshot) - (what the user has already read) + (size of
> the dirty data since the snapshot).  As the user continues to read past
> the snapshot data, the available data counter transitions to reporting
> only the size of the remaining dirty data, which is monotonically
> increasing.  I guess this would be more similar to your approach B,
> which seems to suggest that the interface needs to continue providing
> data regardless of whether the user fully exhausted the available data
> from the previous cycle.  Thanks,
>

Right. But regarding to the VFIO migration code in QEMU, rather than save
one chunk each time, do you think it is better to exhaust all reported data
from .save_live_pending in each .save_live_iterate callback? (eventhough 
vendor driver will handle the case that if userspace cannot exhaust
all data, VFIO QEMU can still try to save as many available data as it can
each time).

> Alex
Alex Williamson March 15, 2019, 2:24 a.m. UTC | #47
On Thu, 14 Mar 2019 19:05:06 -0400
Zhao Yan <yan.y.zhao@intel.com> wrote:

> On Fri, Mar 15, 2019 at 06:44:58AM +0800, Alex Williamson wrote:
> > On Wed, 13 Mar 2019 21:12:22 -0400
> > Zhao Yan <yan.y.zhao@intel.com> wrote:
> >   
> > > On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:  
> > > > On Tue, 12 Mar 2019 21:13:01 -0400
> > > > Zhao Yan <yan.y.zhao@intel.com> wrote:
> > > >     
> > > > > hi Alex
> > > > > Any comments to the sequence below?
> > > > > 
> > > > > Actaully we have some concerns and suggestions to userspace-opaque migration
> > > > > data.
> > > > > 
> > > > > 1. if data is opaque to userspace, kernel interface must be tightly bound to
> > > > > migration. 
> > > > >    e.g. vendor driver has to know state (running + not logging) should not
> > > > >    return any data, and state (running + logging) should return whole
> > > > >    snapshot first and dirty later. it also has to know qemu migration will
> > > > >    not call GET_BUFFER in state (running + not logging), otherwise, it has
> > > > >    to adjust its behavior.    
> > > > 
> > > > This all just sounds like defining the protocol we expect with the
> > > > interface.  For instance if we define a session as beginning when
> > > > logging is enabled and ending when the device is stopped and the
> > > > interface reports no more data is available, then we can state that any
> > > > partial accumulation of data is incomplete relative to migration.  If
> > > > userspace wants to initiate a new migration stream, they can simply
> > > > toggle logging.  How the vendor driver provides the data during the
> > > > session is not defined, but beginning the session with a snapshot
> > > > followed by repeated iterations of dirtied data is certainly a valid
> > > > approach.
> > > >     
> > > > > 2. vendor driver cannot ensure userspace get all the data it intends to
> > > > > save in pre-copy phase.
> > > > >   e.g. in stop-and-copy phase, vendor driver has to first check and send
> > > > >   data in previous phase.    
> > > > 
> > > > First, I don't think the device has control of when QEMU switches from
> > > > pre-copy to stop-and-copy, the protocol needs to support that
> > > > transition at any point.  However, it seems a simply data available
> > > > counter provides an indication of when it might be optimal to make such
> > > > a transition.  If a vendor driver follows a scheme as above, the
> > > > available data counter would indicate a large value, the entire initial
> > > > snapshot of the device.  As the migration continues and pages are
> > > > dirtied, the device would reach a steady state amount of data
> > > > available, depending on the guest activity.  This could indicate to the
> > > > user to stop the device.  The migration stream would not be considered
> > > > completed until the available data counter reaches zero while the
> > > > device is in the stopped|logging state.
> > > >     
> > > > > 3. if all the sequence is tightly bound to live migration, can we remove the
> > > > > logging state? what about adding two states migrate-in and migrate-out?
> > > > > so there are four states: running, stopped, migrate-in, migrate-out.
> > > > >    migrate-out is for source side when migration starts. together with
> > > > >    state running and stopped, it can substitute state logging.
> > > > >    migrate-in is for target side.    
> > > > 
> > > > In fact, Kirti's implementation specifies a data direction, but I think
> > > > we still need logging to indicate sessions.  I'd also assume that
> > > > logging implies some overhead for the vendor driver.
> > > >    
> > > ok. If you prefer logging, I'm ok with it. just found migrate-in and
> > > migrate-out are more universal againt hardware requirement changes.
> > >   
> > > > > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:    
> > > > > > hi Alex
> > > > > > thanks for your reply.
> > > > > > 
> > > > > > So, if we choose migration data to be userspace opaque, do you think below
> > > > > > sequence is the right behavior for vendor driver to follow:
> > > > > > 
> > > > > > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > > > > > vendor driver,  vendor driver should reject and return 0.    
> > > > 
> > > > What would this state mean otherwise?  If we're not logging then it
> > > > should not be expected that we can construct dirtied data from a
> > > > previous read of the state before logging was enabled (it would be
> > > > outside of the "session").  So at best this is an incomplete segment of
> > > > the initial snapshot of the device, but that presumes how the vendor
> > > > driver constructs the data.  I wouldn't necessarily mandate the vendor
> > > > driver reject it, but I think we should consider it undefined and
> > > > vendor specific relative to the migration interface.
> > > >     
> > > > > > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > > > > > driver,
> > > > > >    a. vendor driver shoud first query a whole snapshot of device memory
> > > > > >    (let's use this term to represent device's standalone memory for now),
> > > > > >    b. vendor driver returns a chunk of data just queried to userspace,
> > > > > >    while recording current pos in data.
> > > > > >    c. vendor driver finds all data just queried is finished transmitting to
> > > > > >    userspace, and queries only dirty data in device memory now.
> > > > > >    d. vendor driver returns a chunk of data just quered (this time is dirty
> > > > > >    data )to userspace while recording current pos in data
> > > > > >    e. if all data is transmited to usespace and still GET_BUFFERs come from
> > > > > >    userspace, vendor driver starts another round of dirty data query.    
> > > > 
> > > > This is a valid vendor driver approach, but it's outside the scope of
> > > > the interface definition.  A vendor driver could also decide to not
> > > > provide any data until both stopped and logging are set and then
> > > > provide a fixed, final snapshot.  The interface supports either
> > > > approach by defining the protocol to interact with it.
> > > >     
> > > > > > 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> > > > > > driver,
> > > > > >    a. if vendor driver finds there's previously untransmitted data, returns
> > > > > >    them until all transmitted.
> > > > > >    b. vendor driver then queries dirty data again and transmits them.
> > > > > >    c. at last, vendor driver queris device config data (which has to be
> > > > > >    queried at last and sent once) and transmits them.    
> > > > 
> > > > This seems broken, the vendor driver is presuming the user intentions.
> > > > If logging is unset, we return to bullet 1, reading data is undefined
> > > > and vendor specific.  It's outside of the session.
> > > >     
> > > > > > for the 1 bullet, if LOGGING state is firstly set and migration aborts
> > > > > > then,  vendor driver has to be able to detect that condition. so seemingly,
> > > > > > vendor driver has to know more qemu's migration state, like migration
> > > > > > called and failed. Do you think that's acceptable?    
> > > > 
> > > > If migration aborts, logging is cleared and the device continues
> > > > operation.  If a new migration is started, the session is initiated by
> > > > enabling logging.  Sound reasonable?  Thanks,
> > > >    
> > > 
> > > For the flow, I still have a question.
> > > There are 2 approaches below, which one do you prefer?
> > > 
> > > Approach A, in precopy stage, the sequence is
> > > 
> > > (1)
> > > .save_live_pending --> return whole snapshot size
> > > .save_live_iterate --> save whole snapshot
> > > 
> > > (2)
> > > .save_live_pending --> get dirty data, return dirty data size
> > > .save_live_iterate --> save all dirty data
> > > 
> > > (3)
> > > .save_live_pending --> get dirty data again, return dirty data size
> > > .save_live_iterate --> save all dirty data
> > > 
> > > 
> > > Approach B, in precopy stage, the sequence is
> > > (1)
> > > .save_live_pending --> return whole snapshot size
> > > .save_live_iterate --> save part of snapshot
> > > 
> > > (2)
> > > .save_live_pending --> return rest part of whole snapshot size +
> > >                               current dirty data size
> > > .save_live_iterate --> save part of snapshot 
> > > 
> > > (3) repeat (2) until whole snapshot saved.
> > > 
> > > (4) 
> > > .save_live_pending --> get diryt data and return current dirty data size
> > > .save_live_iterate --> save part of dirty data
> > > 
> > > (5)
> > > .save_live_pending --> return reset part of dirty data size +
> > > 			     delta size of dirty data
> > > .save_live_iterate --> save part of dirty data
> > > 
> > > (6)
> > > repeat (5) until precopy stops  
> > 
> > I don't really understand the question here.  If the vendor driver's
> > approach is to send a full snapshot followed by iterations of dirty
> > data, then when the user enables logging and reads the counter for
> > available data it should report the (size of the snapshot).  The next
> > time the user reads the counter, it should report the size of the
> > (size of the snapshot) - (what the user has already read) + (size of
> > the dirty data since the snapshot).  As the user continues to read past
> > the snapshot data, the available data counter transitions to reporting
> > only the size of the remaining dirty data, which is monotonically
> > increasing.  I guess this would be more similar to your approach B,
> > which seems to suggest that the interface needs to continue providing
> > data regardless of whether the user fully exhausted the available data
> > from the previous cycle.  Thanks,
> >  
> 
> Right. But regarding to the VFIO migration code in QEMU, rather than save
> one chunk each time, do you think it is better to exhaust all reported data
> from .save_live_pending in each .save_live_iterate callback? (eventhough 
> vendor driver will handle the case that if userspace cannot exhaust
> all data, VFIO QEMU can still try to save as many available data as it can
> each time).

Don't you suspect that some devices might have state that's too large
to process in each iteration?  I expect we'll need to use heuristics on
data size or time spent on each iteration round such that some devices
might be able to fully process their pending data while others will
require multiple passes or make up the balance once we've entered stop
and copy.  Thanks,

Alex
Alex Williamson March 18, 2019, 3:09 a.m. UTC | #48
On Sun, 17 Mar 2019 22:51:27 -0400
Zhao Yan <yan.y.zhao@intel.com> wrote:

> On Fri, Mar 15, 2019 at 10:24:02AM +0800, Alex Williamson wrote:
> > On Thu, 14 Mar 2019 19:05:06 -0400
> > Zhao Yan <yan.y.zhao@intel.com> wrote:
> >   
> > > On Fri, Mar 15, 2019 at 06:44:58AM +0800, Alex Williamson wrote:  
> > > > On Wed, 13 Mar 2019 21:12:22 -0400
> > > > Zhao Yan <yan.y.zhao@intel.com> wrote:
> > > >     
> > > > > On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:    
> > > > > > On Tue, 12 Mar 2019 21:13:01 -0400
> > > > > > Zhao Yan <yan.y.zhao@intel.com> wrote:
> > > > > >       
> > > > > > > hi Alex
> > > > > > > Any comments to the sequence below?
> > > > > > > 
> > > > > > > Actaully we have some concerns and suggestions to userspace-opaque migration
> > > > > > > data.
> > > > > > > 
> > > > > > > 1. if data is opaque to userspace, kernel interface must be tightly bound to
> > > > > > > migration. 
> > > > > > >    e.g. vendor driver has to know state (running + not logging) should not
> > > > > > >    return any data, and state (running + logging) should return whole
> > > > > > >    snapshot first and dirty later. it also has to know qemu migration will
> > > > > > >    not call GET_BUFFER in state (running + not logging), otherwise, it has
> > > > > > >    to adjust its behavior.      
> > > > > > 
> > > > > > This all just sounds like defining the protocol we expect with the
> > > > > > interface.  For instance if we define a session as beginning when
> > > > > > logging is enabled and ending when the device is stopped and the
> > > > > > interface reports no more data is available, then we can state that any
> > > > > > partial accumulation of data is incomplete relative to migration.  If
> > > > > > userspace wants to initiate a new migration stream, they can simply
> > > > > > toggle logging.  How the vendor driver provides the data during the
> > > > > > session is not defined, but beginning the session with a snapshot
> > > > > > followed by repeated iterations of dirtied data is certainly a valid
> > > > > > approach.
> > > > > >       
> > > > > > > 2. vendor driver cannot ensure userspace get all the data it intends to
> > > > > > > save in pre-copy phase.
> > > > > > >   e.g. in stop-and-copy phase, vendor driver has to first check and send
> > > > > > >   data in previous phase.      
> > > > > > 
> > > > > > First, I don't think the device has control of when QEMU switches from
> > > > > > pre-copy to stop-and-copy, the protocol needs to support that
> > > > > > transition at any point.  However, it seems a simply data available
> > > > > > counter provides an indication of when it might be optimal to make such
> > > > > > a transition.  If a vendor driver follows a scheme as above, the
> > > > > > available data counter would indicate a large value, the entire initial
> > > > > > snapshot of the device.  As the migration continues and pages are
> > > > > > dirtied, the device would reach a steady state amount of data
> > > > > > available, depending on the guest activity.  This could indicate to the
> > > > > > user to stop the device.  The migration stream would not be considered
> > > > > > completed until the available data counter reaches zero while the
> > > > > > device is in the stopped|logging state.
> > > > > >       
> > > > > > > 3. if all the sequence is tightly bound to live migration, can we remove the
> > > > > > > logging state? what about adding two states migrate-in and migrate-out?
> > > > > > > so there are four states: running, stopped, migrate-in, migrate-out.
> > > > > > >    migrate-out is for source side when migration starts. together with
> > > > > > >    state running and stopped, it can substitute state logging.
> > > > > > >    migrate-in is for target side.      
> > > > > > 
> > > > > > In fact, Kirti's implementation specifies a data direction, but I think
> > > > > > we still need logging to indicate sessions.  I'd also assume that
> > > > > > logging implies some overhead for the vendor driver.
> > > > > >      
> > > > > ok. If you prefer logging, I'm ok with it. just found migrate-in and
> > > > > migrate-out are more universal againt hardware requirement changes.
> > > > >     
> > > > > > > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:      
> > > > > > > > hi Alex
> > > > > > > > thanks for your reply.
> > > > > > > > 
> > > > > > > > So, if we choose migration data to be userspace opaque, do you think below
> > > > > > > > sequence is the right behavior for vendor driver to follow:
> > > > > > > > 
> > > > > > > > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > > > > > > > vendor driver,  vendor driver should reject and return 0.      
> > > > > > 
> > > > > > What would this state mean otherwise?  If we're not logging then it
> > > > > > should not be expected that we can construct dirtied data from a
> > > > > > previous read of the state before logging was enabled (it would be
> > > > > > outside of the "session").  So at best this is an incomplete segment of
> > > > > > the initial snapshot of the device, but that presumes how the vendor
> > > > > > driver constructs the data.  I wouldn't necessarily mandate the vendor
> > > > > > driver reject it, but I think we should consider it undefined and
> > > > > > vendor specific relative to the migration interface.
> > > > > >       
> > > > > > > > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > > > > > > > driver,
> > > > > > > >    a. vendor driver shoud first query a whole snapshot of device memory
> > > > > > > >    (let's use this term to represent device's standalone memory for now),
> > > > > > > >    b. vendor driver returns a chunk of data just queried to userspace,
> > > > > > > >    while recording current pos in data.
> > > > > > > >    c. vendor driver finds all data just queried is finished transmitting to
> > > > > > > >    userspace, and queries only dirty data in device memory now.
> > > > > > > >    d. vendor driver returns a chunk of data just quered (this time is dirty
> > > > > > > >    data )to userspace while recording current pos in data
> > > > > > > >    e. if all data is transmited to usespace and still GET_BUFFERs come from
> > > > > > > >    userspace, vendor driver starts another round of dirty data query.      
> > > > > > 
> > > > > > This is a valid vendor driver approach, but it's outside the scope of
> > > > > > the interface definition.  A vendor driver could also decide to not
> > > > > > provide any data until both stopped and logging are set and then
> > > > > > provide a fixed, final snapshot.  The interface supports either
> > > > > > approach by defining the protocol to interact with it.
> > > > > >       
> > > > > > > > 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> > > > > > > > driver,
> > > > > > > >    a. if vendor driver finds there's previously untransmitted data, returns
> > > > > > > >    them until all transmitted.
> > > > > > > >    b. vendor driver then queries dirty data again and transmits them.
> > > > > > > >    c. at last, vendor driver queris device config data (which has to be
> > > > > > > >    queried at last and sent once) and transmits them.      
> > > > > > 
> > > > > > This seems broken, the vendor driver is presuming the user intentions.
> > > > > > If logging is unset, we return to bullet 1, reading data is undefined
> > > > > > and vendor specific.  It's outside of the session.
> > > > > >       
> > > > > > > > for the 1 bullet, if LOGGING state is firstly set and migration aborts
> > > > > > > > then,  vendor driver has to be able to detect that condition. so seemingly,
> > > > > > > > vendor driver has to know more qemu's migration state, like migration
> > > > > > > > called and failed. Do you think that's acceptable?      
> > > > > > 
> > > > > > If migration aborts, logging is cleared and the device continues
> > > > > > operation.  If a new migration is started, the session is initiated by
> > > > > > enabling logging.  Sound reasonable?  Thanks,
> > > > > >      
> > > > > 
> > > > > For the flow, I still have a question.
> > > > > There are 2 approaches below, which one do you prefer?
> > > > > 
> > > > > Approach A, in precopy stage, the sequence is
> > > > > 
> > > > > (1)
> > > > > .save_live_pending --> return whole snapshot size
> > > > > .save_live_iterate --> save whole snapshot
> > > > > 
> > > > > (2)
> > > > > .save_live_pending --> get dirty data, return dirty data size
> > > > > .save_live_iterate --> save all dirty data
> > > > > 
> > > > > (3)
> > > > > .save_live_pending --> get dirty data again, return dirty data size
> > > > > .save_live_iterate --> save all dirty data
> > > > > 
> > > > > 
> > > > > Approach B, in precopy stage, the sequence is
> > > > > (1)
> > > > > .save_live_pending --> return whole snapshot size
> > > > > .save_live_iterate --> save part of snapshot
> > > > > 
> > > > > (2)
> > > > > .save_live_pending --> return rest part of whole snapshot size +
> > > > >                               current dirty data size
> > > > > .save_live_iterate --> save part of snapshot 
> > > > > 
> > > > > (3) repeat (2) until whole snapshot saved.
> > > > > 
> > > > > (4) 
> > > > > .save_live_pending --> get diryt data and return current dirty data size
> > > > > .save_live_iterate --> save part of dirty data
> > > > > 
> > > > > (5)
> > > > > .save_live_pending --> return reset part of dirty data size +
> > > > > 			     delta size of dirty data
> > > > > .save_live_iterate --> save part of dirty data
> > > > > 
> > > > > (6)
> > > > > repeat (5) until precopy stops    
> > > > 
> > > > I don't really understand the question here.  If the vendor driver's
> > > > approach is to send a full snapshot followed by iterations of dirty
> > > > data, then when the user enables logging and reads the counter for
> > > > available data it should report the (size of the snapshot).  The next
> > > > time the user reads the counter, it should report the size of the
> > > > (size of the snapshot) - (what the user has already read) + (size of
> > > > the dirty data since the snapshot).  As the user continues to read past
> > > > the snapshot data, the available data counter transitions to reporting
> > > > only the size of the remaining dirty data, which is monotonically
> > > > increasing.  I guess this would be more similar to your approach B,
> > > > which seems to suggest that the interface needs to continue providing
> > > > data regardless of whether the user fully exhausted the available data
> > > > from the previous cycle.  Thanks,
> > > >    
> > > 
> > > Right. But regarding to the VFIO migration code in QEMU, rather than save
> > > one chunk each time, do you think it is better to exhaust all reported data
> > > from .save_live_pending in each .save_live_iterate callback? (eventhough 
> > > vendor driver will handle the case that if userspace cannot exhaust
> > > all data, VFIO QEMU can still try to save as many available data as it can
> > > each time).  
> > 
> > Don't you suspect that some devices might have state that's too large
> > to process in each iteration?  I expect we'll need to use heuristics on
> > data size or time spent on each iteration round such that some devices
> > might be able to fully process their pending data while others will
> > require multiple passes or make up the balance once we've entered stop
> > and copy.  Thanks,
> >  
> hi Alex
> What about looping and draining the pending data in each iteration? :)

How is this question different than your previous question?  Thanks,

Alex
Yan Zhao March 18, 2019, 3:27 a.m. UTC | #49
On Mon, Mar 18, 2019 at 11:09:04AM +0800, Alex Williamson wrote:
> On Sun, 17 Mar 2019 22:51:27 -0400
> Zhao Yan <yan.y.zhao@intel.com> wrote:
> 
> > On Fri, Mar 15, 2019 at 10:24:02AM +0800, Alex Williamson wrote:
> > > On Thu, 14 Mar 2019 19:05:06 -0400
> > > Zhao Yan <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Fri, Mar 15, 2019 at 06:44:58AM +0800, Alex Williamson wrote:  
> > > > > On Wed, 13 Mar 2019 21:12:22 -0400
> > > > > Zhao Yan <yan.y.zhao@intel.com> wrote:
> > > > >     
> > > > > > On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:    
> > > > > > > On Tue, 12 Mar 2019 21:13:01 -0400
> > > > > > > Zhao Yan <yan.y.zhao@intel.com> wrote:
> > > > > > >       
> > > > > > > > hi Alex
> > > > > > > > Any comments to the sequence below?
> > > > > > > > 
> > > > > > > > Actaully we have some concerns and suggestions to userspace-opaque migration
> > > > > > > > data.
> > > > > > > > 
> > > > > > > > 1. if data is opaque to userspace, kernel interface must be tightly bound to
> > > > > > > > migration. 
> > > > > > > >    e.g. vendor driver has to know state (running + not logging) should not
> > > > > > > >    return any data, and state (running + logging) should return whole
> > > > > > > >    snapshot first and dirty later. it also has to know qemu migration will
> > > > > > > >    not call GET_BUFFER in state (running + not logging), otherwise, it has
> > > > > > > >    to adjust its behavior.      
> > > > > > > 
> > > > > > > This all just sounds like defining the protocol we expect with the
> > > > > > > interface.  For instance if we define a session as beginning when
> > > > > > > logging is enabled and ending when the device is stopped and the
> > > > > > > interface reports no more data is available, then we can state that any
> > > > > > > partial accumulation of data is incomplete relative to migration.  If
> > > > > > > userspace wants to initiate a new migration stream, they can simply
> > > > > > > toggle logging.  How the vendor driver provides the data during the
> > > > > > > session is not defined, but beginning the session with a snapshot
> > > > > > > followed by repeated iterations of dirtied data is certainly a valid
> > > > > > > approach.
> > > > > > >       
> > > > > > > > 2. vendor driver cannot ensure userspace get all the data it intends to
> > > > > > > > save in pre-copy phase.
> > > > > > > >   e.g. in stop-and-copy phase, vendor driver has to first check and send
> > > > > > > >   data in previous phase.      
> > > > > > > 
> > > > > > > First, I don't think the device has control of when QEMU switches from
> > > > > > > pre-copy to stop-and-copy, the protocol needs to support that
> > > > > > > transition at any point.  However, it seems a simply data available
> > > > > > > counter provides an indication of when it might be optimal to make such
> > > > > > > a transition.  If a vendor driver follows a scheme as above, the
> > > > > > > available data counter would indicate a large value, the entire initial
> > > > > > > snapshot of the device.  As the migration continues and pages are
> > > > > > > dirtied, the device would reach a steady state amount of data
> > > > > > > available, depending on the guest activity.  This could indicate to the
> > > > > > > user to stop the device.  The migration stream would not be considered
> > > > > > > completed until the available data counter reaches zero while the
> > > > > > > device is in the stopped|logging state.
> > > > > > >       
> > > > > > > > 3. if all the sequence is tightly bound to live migration, can we remove the
> > > > > > > > logging state? what about adding two states migrate-in and migrate-out?
> > > > > > > > so there are four states: running, stopped, migrate-in, migrate-out.
> > > > > > > >    migrate-out is for source side when migration starts. together with
> > > > > > > >    state running and stopped, it can substitute state logging.
> > > > > > > >    migrate-in is for target side.      
> > > > > > > 
> > > > > > > In fact, Kirti's implementation specifies a data direction, but I think
> > > > > > > we still need logging to indicate sessions.  I'd also assume that
> > > > > > > logging implies some overhead for the vendor driver.
> > > > > > >      
> > > > > > ok. If you prefer logging, I'm ok with it. just found migrate-in and
> > > > > > migrate-out are more universal againt hardware requirement changes.
> > > > > >     
> > > > > > > > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:      
> > > > > > > > > hi Alex
> > > > > > > > > thanks for your reply.
> > > > > > > > > 
> > > > > > > > > So, if we choose migration data to be userspace opaque, do you think below
> > > > > > > > > sequence is the right behavior for vendor driver to follow:
> > > > > > > > > 
> > > > > > > > > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > > > > > > > > vendor driver,  vendor driver should reject and return 0.      
> > > > > > > 
> > > > > > > What would this state mean otherwise?  If we're not logging then it
> > > > > > > should not be expected that we can construct dirtied data from a
> > > > > > > previous read of the state before logging was enabled (it would be
> > > > > > > outside of the "session").  So at best this is an incomplete segment of
> > > > > > > the initial snapshot of the device, but that presumes how the vendor
> > > > > > > driver constructs the data.  I wouldn't necessarily mandate the vendor
> > > > > > > driver reject it, but I think we should consider it undefined and
> > > > > > > vendor specific relative to the migration interface.
> > > > > > >       
> > > > > > > > > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > > > > > > > > driver,
> > > > > > > > >    a. vendor driver shoud first query a whole snapshot of device memory
> > > > > > > > >    (let's use this term to represent device's standalone memory for now),
> > > > > > > > >    b. vendor driver returns a chunk of data just queried to userspace,
> > > > > > > > >    while recording current pos in data.
> > > > > > > > >    c. vendor driver finds all data just queried is finished transmitting to
> > > > > > > > >    userspace, and queries only dirty data in device memory now.
> > > > > > > > >    d. vendor driver returns a chunk of data just quered (this time is dirty
> > > > > > > > >    data )to userspace while recording current pos in data
> > > > > > > > >    e. if all data is transmited to usespace and still GET_BUFFERs come from
> > > > > > > > >    userspace, vendor driver starts another round of dirty data query.      
> > > > > > > 
> > > > > > > This is a valid vendor driver approach, but it's outside the scope of
> > > > > > > the interface definition.  A vendor driver could also decide to not
> > > > > > > provide any data until both stopped and logging are set and then
> > > > > > > provide a fixed, final snapshot.  The interface supports either
> > > > > > > approach by defining the protocol to interact with it.
> > > > > > >       
> > > > > > > > > 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> > > > > > > > > driver,
> > > > > > > > >    a. if vendor driver finds there's previously untransmitted data, returns
> > > > > > > > >    them until all transmitted.
> > > > > > > > >    b. vendor driver then queries dirty data again and transmits them.
> > > > > > > > >    c. at last, vendor driver queris device config data (which has to be
> > > > > > > > >    queried at last and sent once) and transmits them.      
> > > > > > > 
> > > > > > > This seems broken, the vendor driver is presuming the user intentions.
> > > > > > > If logging is unset, we return to bullet 1, reading data is undefined
> > > > > > > and vendor specific.  It's outside of the session.
> > > > > > >       
> > > > > > > > > for the 1 bullet, if LOGGING state is firstly set and migration aborts
> > > > > > > > > then,  vendor driver has to be able to detect that condition. so seemingly,
> > > > > > > > > vendor driver has to know more qemu's migration state, like migration
> > > > > > > > > called and failed. Do you think that's acceptable?      
> > > > > > > 
> > > > > > > If migration aborts, logging is cleared and the device continues
> > > > > > > operation.  If a new migration is started, the session is initiated by
> > > > > > > enabling logging.  Sound reasonable?  Thanks,
> > > > > > >      
> > > > > > 
> > > > > > For the flow, I still have a question.
> > > > > > There are 2 approaches below, which one do you prefer?
> > > > > > 
> > > > > > Approach A, in precopy stage, the sequence is
> > > > > > 
> > > > > > (1)
> > > > > > .save_live_pending --> return whole snapshot size
> > > > > > .save_live_iterate --> save whole snapshot
> > > > > > 
> > > > > > (2)
> > > > > > .save_live_pending --> get dirty data, return dirty data size
> > > > > > .save_live_iterate --> save all dirty data
> > > > > > 
> > > > > > (3)
> > > > > > .save_live_pending --> get dirty data again, return dirty data size
> > > > > > .save_live_iterate --> save all dirty data
> > > > > > 
> > > > > > 
> > > > > > Approach B, in precopy stage, the sequence is
> > > > > > (1)
> > > > > > .save_live_pending --> return whole snapshot size
> > > > > > .save_live_iterate --> save part of snapshot
> > > > > > 
> > > > > > (2)
> > > > > > .save_live_pending --> return rest part of whole snapshot size +
> > > > > >                               current dirty data size
> > > > > > .save_live_iterate --> save part of snapshot 
> > > > > > 
> > > > > > (3) repeat (2) until whole snapshot saved.
> > > > > > 
> > > > > > (4) 
> > > > > > .save_live_pending --> get diryt data and return current dirty data size
> > > > > > .save_live_iterate --> save part of dirty data
> > > > > > 
> > > > > > (5)
> > > > > > .save_live_pending --> return reset part of dirty data size +
> > > > > > 			     delta size of dirty data
> > > > > > .save_live_iterate --> save part of dirty data
> > > > > > 
> > > > > > (6)
> > > > > > repeat (5) until precopy stops    
> > > > > 
> > > > > I don't really understand the question here.  If the vendor driver's
> > > > > approach is to send a full snapshot followed by iterations of dirty
> > > > > data, then when the user enables logging and reads the counter for
> > > > > available data it should report the (size of the snapshot).  The next
> > > > > time the user reads the counter, it should report the size of the
> > > > > (size of the snapshot) - (what the user has already read) + (size of
> > > > > the dirty data since the snapshot).  As the user continues to read past
> > > > > the snapshot data, the available data counter transitions to reporting
> > > > > only the size of the remaining dirty data, which is monotonically
> > > > > increasing.  I guess this would be more similar to your approach B,
> > > > > which seems to suggest that the interface needs to continue providing
> > > > > data regardless of whether the user fully exhausted the available data
> > > > > from the previous cycle.  Thanks,
> > > > >    
> > > > 
> > > > Right. But regarding to the VFIO migration code in QEMU, rather than save
> > > > one chunk each time, do you think it is better to exhaust all reported data
> > > > from .save_live_pending in each .save_live_iterate callback? (eventhough 
> > > > vendor driver will handle the case that if userspace cannot exhaust
> > > > all data, VFIO QEMU can still try to save as many available data as it can
> > > > each time).  
> > > 
> > > Don't you suspect that some devices might have state that's too large
> > > to process in each iteration?  I expect we'll need to use heuristics on
> > > data size or time spent on each iteration round such that some devices
> > > might be able to fully process their pending data while others will
> > > require multiple passes or make up the balance once we've entered stop
> > > and copy.  Thanks,
> > >  
> > hi Alex
> > What about looping and draining the pending data in each iteration? :)
> 
> How is this question different than your previous question?  Thanks,
> 
sorry, I misunderstood your meaning in last mail.
you are right, sometimes, one device may have too large of pending data to
save in each iteration. 
Although draining the pending data in each iteration is feasible because
pre-copy phase is allowed to be slow, but use a heuristic max size in each
iteration is also reasonable.
Yan Zhao March 27, 2019, 6:35 a.m. UTC | #50
On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:
> > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > a device that has less device memory ?  
> > > > Actually it's still an open for VFIO migration. Need to think about
> > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > along with verion ?).  
> > 
> > We must keep the hardware generation is the same with one POD of public cloud
> > providers. But we still think about the live migration between from the the lower
> > generation of hardware migrated to the higher generation.
> 
> Agreed, lower->higher is the one direction that might make sense to
> support.
> 
> But regardless of that, I think we need to make sure that incompatible
> devices/versions fail directly instead of failing in a subtle, hard to
> debug way. Might be useful to do some initial sanity checks in libvirt
> as well.
> 
> How easy is it to obtain that information in a form that can be
> consumed by higher layers? Can we find out the device type at least?
> What about some kind of revision?
hi Alex and Cornelia
for device compatibility, do you think it's a good idea to use "version"
and "device version" fields?

version field: identify live migration interface's version. it can have a
sort of backward compatibility, like target machine's version >= source
machine's version. something like that.

device_version field consists two parts:
1. vendor id : it takes 32 bits. e.g. 0x8086.
2. vendor proprietary string: it can be any string that a vendor driver
thinks can identify a source device. e.g. pciid + mdev type.
"vendor id" is to avoid overlap of "vendor proprietary string".


struct vfio_device_state_ctl {
     __u32 version;            /* ro */
     __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
     struct {
     	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
	...
     }data;
     ...
 };

Then, an action IS_COMPATIBLE is added to check device compatibility.

The flow to figure out whether a source device is migratable to target device
is like that:
1. in source side's .save_setup, save source device's device_version string
2. in target side's .load_state, load source device's device version string
and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
to check whether the source device is compatible to it.

The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
maintain a compatibility table and decide whether source device is compatible
to target device according to its proprietary table.
In device_version string, vendor driver only has to describe the source
device as elaborately as possible and resorts to vendor driver in target side
to figure out whether they are compatible.

Thanks
Yan
Dr. David Alan Gilbert March 27, 2019, 8:18 p.m. UTC | #51
* Zhao Yan (yan.y.zhao@intel.com) wrote:
> On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:
> > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > a device that has less device memory ?  
> > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > along with verion ?).  
> > > 
> > > We must keep the hardware generation is the same with one POD of public cloud
> > > providers. But we still think about the live migration between from the the lower
> > > generation of hardware migrated to the higher generation.
> > 
> > Agreed, lower->higher is the one direction that might make sense to
> > support.
> > 
> > But regardless of that, I think we need to make sure that incompatible
> > devices/versions fail directly instead of failing in a subtle, hard to
> > debug way. Might be useful to do some initial sanity checks in libvirt
> > as well.
> > 
> > How easy is it to obtain that information in a form that can be
> > consumed by higher layers? Can we find out the device type at least?
> > What about some kind of revision?
> hi Alex and Cornelia
> for device compatibility, do you think it's a good idea to use "version"
> and "device version" fields?
> 
> version field: identify live migration interface's version. it can have a
> sort of backward compatibility, like target machine's version >= source
> machine's version. something like that.
> 
> device_version field consists two parts:
> 1. vendor id : it takes 32 bits. e.g. 0x8086.
> 2. vendor proprietary string: it can be any string that a vendor driver
> thinks can identify a source device. e.g. pciid + mdev type.
> "vendor id" is to avoid overlap of "vendor proprietary string".
> 
> 
> struct vfio_device_state_ctl {
>      __u32 version;            /* ro */
>      __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
>      struct {
>      	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> 	...
>      }data;
>      ...
>  };
> 
> Then, an action IS_COMPATIBLE is added to check device compatibility.
> 
> The flow to figure out whether a source device is migratable to target device
> is like that:
> 1. in source side's .save_setup, save source device's device_version string
> 2. in target side's .load_state, load source device's device version string
> and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
> to check whether the source device is compatible to it.
> 
> The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> maintain a compatibility table and decide whether source device is compatible
> to target device according to its proprietary table.
> In device_version string, vendor driver only has to describe the source
> device as elaborately as possible and resorts to vendor driver in target side
> to figure out whether they are compatible.

It would also be good if the 'IS_COMPATIBLE' was somehow callable
externally - so we could be able to answer a question like 'can we
migrate this VM to this host' - from the management layer before it
actually starts the migration.

Dave

> Thanks
> Yan
> 
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Alex Williamson March 27, 2019, 10:10 p.m. UTC | #52
On Wed, 27 Mar 2019 20:18:54 +0000
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Zhao Yan (yan.y.zhao@intel.com) wrote:
> > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:  
> > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > a device that has less device memory ?    
> > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > along with verion ?).    
> > > > 
> > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > providers. But we still think about the live migration between from the the lower
> > > > generation of hardware migrated to the higher generation.  
> > > 
> > > Agreed, lower->higher is the one direction that might make sense to
> > > support.
> > > 
> > > But regardless of that, I think we need to make sure that incompatible
> > > devices/versions fail directly instead of failing in a subtle, hard to
> > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > as well.
> > > 
> > > How easy is it to obtain that information in a form that can be
> > > consumed by higher layers? Can we find out the device type at least?
> > > What about some kind of revision?  
> > hi Alex and Cornelia
> > for device compatibility, do you think it's a good idea to use "version"
> > and "device version" fields?
> > 
> > version field: identify live migration interface's version. it can have a
> > sort of backward compatibility, like target machine's version >= source
> > machine's version. something like that.

Don't we essentially already have this via the device specific region?
The struct vfio_info_cap_header includes id and version fields, so we
can declare a migration id and increment the version for any
incompatible changes to the protocol.

> > 
> > device_version field consists two parts:
> > 1. vendor id : it takes 32 bits. e.g. 0x8086.

Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
suggest we use a bit to flag it as such so we can reserve that portion
of the 32bit address space.  See for example:

#define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
#define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)

For vendor specific regions.

> > 2. vendor proprietary string: it can be any string that a vendor driver
> > thinks can identify a source device. e.g. pciid + mdev type.
> > "vendor id" is to avoid overlap of "vendor proprietary string".
> > 
> > 
> > struct vfio_device_state_ctl {
> >      __u32 version;            /* ro */
> >      __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
> >      struct {
> >      	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > 	...
> >      }data;
> >      ...
> >  };

We have a buffer area where we can read and write data from the vendor
driver, why would we have this IS_COMPATIBLE protocol to write the
device version string but use a static fixed length version string in
the control header to read it?  IOW, let's use GET_VERSION,
CHECK_VERSION actions that make use of the buffer area and allow vendor
specific version information length.

> > 
> > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > 
> > The flow to figure out whether a source device is migratable to target device
> > is like that:
> > 1. in source side's .save_setup, save source device's device_version string
> > 2. in target side's .load_state, load source device's device version string
> > and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
> > to check whether the source device is compatible to it.
> > 
> > The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> > maintain a compatibility table and decide whether source device is compatible
> > to target device according to its proprietary table.
> > In device_version string, vendor driver only has to describe the source
> > device as elaborately as possible and resorts to vendor driver in target side
> > to figure out whether they are compatible.  

I agree, it's too complicated and restrictive to try to create an
interface for the user to determine compatibility, let the driver
declare it compatible or not.

> It would also be good if the 'IS_COMPATIBLE' was somehow callable
> externally - so we could be able to answer a question like 'can we
> migrate this VM to this host' - from the management layer before it
> actually starts the migration.

I think we'd need to mirror this capability in sysfs to support that,
or create a qmp interface through QEMU that the device owner could make
the request on behalf of the management layer.  Getting access to the
vfio device requires an iommu context that's already in use by the
device owner, we have no intention of supporting a model that allows
independent tasks access to a device.  Thanks,

Alex
Yan Zhao March 28, 2019, 8:36 a.m. UTC | #53
hi Alex and Dave,
Thanks for your replies.
Please see my comments inline.

On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:
> On Wed, 27 Mar 2019 20:18:54 +0000
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Zhao Yan (yan.y.zhao@intel.com) wrote:
> > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:  
> > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > a device that has less device memory ?    
> > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > along with verion ?).    
> > > > > 
> > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > providers. But we still think about the live migration between from the the lower
> > > > > generation of hardware migrated to the higher generation.  
> > > > 
> > > > Agreed, lower->higher is the one direction that might make sense to
> > > > support.
> > > > 
> > > > But regardless of that, I think we need to make sure that incompatible
> > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > as well.
> > > > 
> > > > How easy is it to obtain that information in a form that can be
> > > > consumed by higher layers? Can we find out the device type at least?
> > > > What about some kind of revision?  
> > > hi Alex and Cornelia
> > > for device compatibility, do you think it's a good idea to use "version"
> > > and "device version" fields?
> > > 
> > > version field: identify live migration interface's version. it can have a
> > > sort of backward compatibility, like target machine's version >= source
> > > machine's version. something like that.
> 
> Don't we essentially already have this via the device specific region?
> The struct vfio_info_cap_header includes id and version fields, so we
> can declare a migration id and increment the version for any
> incompatible changes to the protocol.
yes, good idea!
so, what about declaring below new cap? 
    #define VFIO_REGION_INFO_CAP_MIGRATION 4
    struct vfio_region_info_cap_migration {
        struct vfio_info_cap_header header;
        __u32 device_version_len;
        __u8  device_version[];
    };


> > > 
> > > device_version field consists two parts:
> > > 1. vendor id : it takes 32 bits. e.g. 0x8086.
> 
> Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> suggest we use a bit to flag it as such so we can reserve that portion
> of the 32bit address space.  See for example:
> 
> #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> 
> For vendor specific regions.
Yes, use PCI vendor ID.
you are right, we need to use highest bit (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
to identify it's a PCI ID.
Thanks for pointing it out. 
But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used for?
why it's 0xffff? I searched QEMU and kernel code and did not find anywhere
uses it.


> > > 2. vendor proprietary string: it can be any string that a vendor driver
> > > thinks can identify a source device. e.g. pciid + mdev type.
> > > "vendor id" is to avoid overlap of "vendor proprietary string".
> > > 
> > > 
> > > struct vfio_device_state_ctl {
> > >      __u32 version;            /* ro */
> > >      __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
> > >      struct {
> > >      	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > > 	...
> > >      }data;
> > >      ...
> > >  };
> 
> We have a buffer area where we can read and write data from the vendor
> driver, why would we have this IS_COMPATIBLE protocol to write the
> device version string but use a static fixed length version string in
> the control header to read it?  IOW, let's use GET_VERSION,
> CHECK_VERSION actions that make use of the buffer area and allow vendor
> specific version information length.
you are right, such static fixed length version string is bad :)
To get device version, do you think which approach below is better?
1. use GET_VERSION action, and read from region buffer
2. get it when querying cap VFIO_REGION_INFO_CAP_MIGRATION

seems approach 1 is better, and cap VFIO_REGION_INFO_CAP_MIGRATION is only
for checking migration interface's version?

> > > 
> > > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > > 
> > > The flow to figure out whether a source device is migratable to target device
> > > is like that:
> > > 1. in source side's .save_setup, save source device's device_version string
> > > 2. in target side's .load_state, load source device's device version string
> > > and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
> > > to check whether the source device is compatible to it.
> > > 
> > > The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> > > maintain a compatibility table and decide whether source device is compatible
> > > to target device according to its proprietary table.
> > > In device_version string, vendor driver only has to describe the source
> > > device as elaborately as possible and resorts to vendor driver in target side
> > > to figure out whether they are compatible.  
> 
> I agree, it's too complicated and restrictive to try to create an
> interface for the user to determine compatibility, let the driver
> declare it compatible or not.
:)

> > It would also be good if the 'IS_COMPATIBLE' was somehow callable
> > externally - so we could be able to answer a question like 'can we
> > migrate this VM to this host' - from the management layer before it
> > actually starts the migration.

so qemu needs to expose two qmp/sysfs interfaces: GET_VERSION and CHECK_VERSION.
GET_VERSION returns a vm's device's version string.
CHECK_VERSION's input is device version string and return
compatible/non-compatible.
Do you think it's good?

> I think we'd need to mirror this capability in sysfs to support that,
> or create a qmp interface through QEMU that the device owner could make
> the request on behalf of the management layer.  Getting access to the
> vfio device requires an iommu context that's already in use by the
> device owner, we have no intention of supporting a model that allows
> independent tasks access to a device.  Thanks,
> Alex
>
do you think two sysfs nodes under a device node is ok?
e.g.
/sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/get_version
/sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/check_version

Thanks
Yan
Erik Skultety March 28, 2019, 9:21 a.m. UTC | #54
On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:
> hi Alex and Dave,
> Thanks for your replies.
> Please see my comments inline.
>
> On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:
> > On Wed, 27 Mar 2019 20:18:54 +0000
> > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> >
> > > * Zhao Yan (yan.y.zhao@intel.com) wrote:
> > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:
> > > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > > a device that has less device memory ?
> > > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > > along with verion ?).
> > > > > >
> > > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > > providers. But we still think about the live migration between from the the lower
> > > > > > generation of hardware migrated to the higher generation.
> > > > >
> > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > support.
> > > > >
> > > > > But regardless of that, I think we need to make sure that incompatible
> > > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > > as well.
> > > > >
> > > > > How easy is it to obtain that information in a form that can be
> > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > What about some kind of revision?
> > > > hi Alex and Cornelia
> > > > for device compatibility, do you think it's a good idea to use "version"
> > > > and "device version" fields?
> > > >
> > > > version field: identify live migration interface's version. it can have a
> > > > sort of backward compatibility, like target machine's version >= source
> > > > machine's version. something like that.
> >
> > Don't we essentially already have this via the device specific region?
> > The struct vfio_info_cap_header includes id and version fields, so we
> > can declare a migration id and increment the version for any
> > incompatible changes to the protocol.
> yes, good idea!
> so, what about declaring below new cap?
>     #define VFIO_REGION_INFO_CAP_MIGRATION 4
>     struct vfio_region_info_cap_migration {
>         struct vfio_info_cap_header header;
>         __u32 device_version_len;
>         __u8  device_version[];
>     };
>
>
> > > >
> > > > device_version field consists two parts:
> > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.
> >
> > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > suggest we use a bit to flag it as such so we can reserve that portion
> > of the 32bit address space.  See for example:
> >
> > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> >
> > For vendor specific regions.
> Yes, use PCI vendor ID.
> you are right, we need to use highest bit (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
> to identify it's a PCI ID.
> Thanks for pointing it out.
> But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used for?
> why it's 0xffff? I searched QEMU and kernel code and did not find anywhere
> uses it.
>
>
> > > > 2. vendor proprietary string: it can be any string that a vendor driver
> > > > thinks can identify a source device. e.g. pciid + mdev type.
> > > > "vendor id" is to avoid overlap of "vendor proprietary string".
> > > >
> > > >
> > > > struct vfio_device_state_ctl {
> > > >      __u32 version;            /* ro */
> > > >      __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
> > > >      struct {
> > > >      	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > > > 	...
> > > >      }data;
> > > >      ...
> > > >  };
> >
> > We have a buffer area where we can read and write data from the vendor
> > driver, why would we have this IS_COMPATIBLE protocol to write the
> > device version string but use a static fixed length version string in
> > the control header to read it?  IOW, let's use GET_VERSION,
> > CHECK_VERSION actions that make use of the buffer area and allow vendor
> > specific version information length.
> you are right, such static fixed length version string is bad :)
> To get device version, do you think which approach below is better?
> 1. use GET_VERSION action, and read from region buffer
> 2. get it when querying cap VFIO_REGION_INFO_CAP_MIGRATION
>
> seems approach 1 is better, and cap VFIO_REGION_INFO_CAP_MIGRATION is only
> for checking migration interface's version?
>
> > > >
> > > > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > > >
> > > > The flow to figure out whether a source device is migratable to target device
> > > > is like that:
> > > > 1. in source side's .save_setup, save source device's device_version string
> > > > 2. in target side's .load_state, load source device's device version string
> > > > and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
> > > > to check whether the source device is compatible to it.
> > > >
> > > > The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> > > > maintain a compatibility table and decide whether source device is compatible
> > > > to target device according to its proprietary table.
> > > > In device_version string, vendor driver only has to describe the source
> > > > device as elaborately as possible and resorts to vendor driver in target side
> > > > to figure out whether they are compatible.
> >
> > I agree, it's too complicated and restrictive to try to create an
> > interface for the user to determine compatibility, let the driver
> > declare it compatible or not.
> :)
>
> > > It would also be good if the 'IS_COMPATIBLE' was somehow callable
> > > externally - so we could be able to answer a question like 'can we
> > > migrate this VM to this host' - from the management layer before it
> > > actually starts the migration.
>
> so qemu needs to expose two qmp/sysfs interfaces: GET_VERSION and CHECK_VERSION.
> GET_VERSION returns a vm's device's version string.
> CHECK_VERSION's input is device version string and return
> compatible/non-compatible.
> Do you think it's good?
>
> > I think we'd need to mirror this capability in sysfs to support that,
> > or create a qmp interface through QEMU that the device owner could make
> > the request on behalf of the management layer.  Getting access to the
> > vfio device requires an iommu context that's already in use by the
> > device owner, we have no intention of supporting a model that allows
> > independent tasks access to a device.  Thanks,
> > Alex
> >
> do you think two sysfs nodes under a device node is ok?
> e.g.
> /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/get_version
> /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/check_version

Why do you need both sysfs and QMP at the same time? I can see it gives us some
flexibility, but is there something more to that?

Normally, I'd prefer a QMP interface from libvirt's perspective (with an
appropriate capability that libvirt can check for QEMU support) because I imagine large nodes having a
bunch of GPUs with different revisions which might not be backwards compatible.
Libvirt might query the version string on source and check it on dest via the
QMP in a way that QEMU, talking to the driver, would return either a list or a
single physical device to which we can migrate, because neither QEMU nor
libvirt know that, only the driver does, so that's an important information
rather than looping through all the devices and trying to find one that is
compatible. However, you might have a hard time making all the necessary
changes in QMP introspectable, a new command would be fine, but if you also
wanted to extend say vfio-pci options, IIRC that would not appear in the QAPI
schema and libvirt would not be able to detect support for it.

On the other hand, the presence of a QMP interface IMO doesn't help mgmt apps
much, as it still carries the burden of being able to check this only at the
time of migration, which e.g. OpenStack would like to know long before that.

So, having sysfs attributes would work for both libvirt (even though libvirt
would benefit from a QMP much more) and OpenStack. OpenStack would IMO then
have to figure out how to create the mappings between compatible devices across
several nodes which are non-uniform.

Regards,
Erik
Alex Williamson March 28, 2019, 4:04 p.m. UTC | #55
On Thu, 28 Mar 2019 10:21:38 +0100
Erik Skultety <eskultet@redhat.com> wrote:

> On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:
> > hi Alex and Dave,
> > Thanks for your replies.
> > Please see my comments inline.
> >
> > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:  
> > > On Wed, 27 Mar 2019 20:18:54 +0000
> > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > >  
> > > > * Zhao Yan (yan.y.zhao@intel.com) wrote:  
> > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:  
> > > > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > > > a device that has less device memory ?  
> > > > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > > > along with verion ?).  
> > > > > > >
> > > > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > > > providers. But we still think about the live migration between from the the lower
> > > > > > > generation of hardware migrated to the higher generation.  
> > > > > >
> > > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > > support.
> > > > > >
> > > > > > But regardless of that, I think we need to make sure that incompatible
> > > > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > > > as well.
> > > > > >
> > > > > > How easy is it to obtain that information in a form that can be
> > > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > > What about some kind of revision?  
> > > > > hi Alex and Cornelia
> > > > > for device compatibility, do you think it's a good idea to use "version"
> > > > > and "device version" fields?
> > > > >
> > > > > version field: identify live migration interface's version. it can have a
> > > > > sort of backward compatibility, like target machine's version >= source
> > > > > machine's version. something like that.  
> > >
> > > Don't we essentially already have this via the device specific region?
> > > The struct vfio_info_cap_header includes id and version fields, so we
> > > can declare a migration id and increment the version for any
> > > incompatible changes to the protocol.  
> > yes, good idea!
> > so, what about declaring below new cap?
> >     #define VFIO_REGION_INFO_CAP_MIGRATION 4
> >     struct vfio_region_info_cap_migration {
> >         struct vfio_info_cap_header header;
> >         __u32 device_version_len;
> >         __u8  device_version[];
> >     };

I'm not sure why we need a new region for everything, it seems this
could fit within the protocol of a single region.  This could simply be
a new action to retrieve the version where the protocol would report
the number of bytes available, just like the migration stream itself.

> > > > > device_version field consists two parts:
> > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.  
> > >
> > > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > > suggest we use a bit to flag it as such so we can reserve that portion
> > > of the 32bit address space.  See for example:
> > >
> > > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> > > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> > >
> > > For vendor specific regions.  
> > Yes, use PCI vendor ID.
> > you are right, we need to use highest bit (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
> > to identify it's a PCI ID.
> > Thanks for pointing it out.
> > But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used for?
> > why it's 0xffff? I searched QEMU and kernel code and did not find anywhere
> > uses it.

PCI vendor IDs are 16bits, it's just indicating that when the
PCI_VENDOR_TYPE bit is set the valid data is the lower 16bits.

> > > > > 2. vendor proprietary string: it can be any string that a vendor driver
> > > > > thinks can identify a source device. e.g. pciid + mdev type.
> > > > > "vendor id" is to avoid overlap of "vendor proprietary string".
> > > > >
> > > > >
> > > > > struct vfio_device_state_ctl {
> > > > >      __u32 version;            /* ro */
> > > > >      __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
> > > > >      struct {
> > > > >      	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > > > > 	...
> > > > >      }data;
> > > > >      ...
> > > > >  };  
> > >
> > > We have a buffer area where we can read and write data from the vendor
> > > driver, why would we have this IS_COMPATIBLE protocol to write the
> > > device version string but use a static fixed length version string in
> > > the control header to read it?  IOW, let's use GET_VERSION,
> > > CHECK_VERSION actions that make use of the buffer area and allow vendor
> > > specific version information length.  
> > you are right, such static fixed length version string is bad :)
> > To get device version, do you think which approach below is better?
> > 1. use GET_VERSION action, and read from region buffer
> > 2. get it when querying cap VFIO_REGION_INFO_CAP_MIGRATION
> >
> > seems approach 1 is better, and cap VFIO_REGION_INFO_CAP_MIGRATION is only
> > for checking migration interface's version?

I think 1 provides the most flexibility to the vendor driver.

> > > > > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > > > >
> > > > > The flow to figure out whether a source device is migratable to target device
> > > > > is like that:
> > > > > 1. in source side's .save_setup, save source device's device_version string
> > > > > 2. in target side's .load_state, load source device's device version string
> > > > > and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
> > > > > to check whether the source device is compatible to it.
> > > > >
> > > > > The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> > > > > maintain a compatibility table and decide whether source device is compatible
> > > > > to target device according to its proprietary table.
> > > > > In device_version string, vendor driver only has to describe the source
> > > > > device as elaborately as possible and resorts to vendor driver in target side
> > > > > to figure out whether they are compatible.  
> > >
> > > I agree, it's too complicated and restrictive to try to create an
> > > interface for the user to determine compatibility, let the driver
> > > declare it compatible or not.  
> > :)
> >  
> > > > It would also be good if the 'IS_COMPATIBLE' was somehow callable
> > > > externally - so we could be able to answer a question like 'can we
> > > > migrate this VM to this host' - from the management layer before it
> > > > actually starts the migration.  
> >
> > so qemu needs to expose two qmp/sysfs interfaces: GET_VERSION and CHECK_VERSION.
> > GET_VERSION returns a vm's device's version string.
> > CHECK_VERSION's input is device version string and return
> > compatible/non-compatible.
> > Do you think it's good?

That's the idea, but note that QEMU can only provide the QMP interface,
the sysfs interface would of course be provided as more of a direct
path from the vendor driver or mdev kernel layer.

> > > I think we'd need to mirror this capability in sysfs to support that,
> > > or create a qmp interface through QEMU that the device owner could make
> > > the request on behalf of the management layer.  Getting access to the
> > > vfio device requires an iommu context that's already in use by the
> > > device owner, we have no intention of supporting a model that allows
> > > independent tasks access to a device.  Thanks,
> > > Alex
> > >  
> > do you think two sysfs nodes under a device node is ok?
> > e.g.
> > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/get_version
> > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/check_version  

I'd think it might live more in the mdev_support_types area, wouldn't
we ideally like to know if a device is compatible even before it's
created?  For example maybe:

/sys/class/mdev_bus/0000:00:02.0/mdev_supported_types/i915-GVTg_V5_4/version

Where reading the sysfs attribute returns the version string and
writing a string into the attribute return an errno for incompatibility.

> Why do you need both sysfs and QMP at the same time? I can see it gives us some
> flexibility, but is there something more to that?
>
> Normally, I'd prefer a QMP interface from libvirt's perspective (with an
> appropriate capability that libvirt can check for QEMU support) because I imagine large nodes having a
> bunch of GPUs with different revisions which might not be backwards compatible.
> Libvirt might query the version string on source and check it on dest via the
> QMP in a way that QEMU, talking to the driver, would return either a list or a
> single physical device to which we can migrate, because neither QEMU nor
> libvirt know that, only the driver does, so that's an important information
> rather than looping through all the devices and trying to find one that is
> compatible. However, you might have a hard time making all the necessary
> changes in QMP introspectable, a new command would be fine, but if you also
> wanted to extend say vfio-pci options, IIRC that would not appear in the QAPI
> schema and libvirt would not be able to detect support for it.
> 
> On the other hand, the presence of a QMP interface IMO doesn't help mgmt apps
> much, as it still carries the burden of being able to check this only at the
> time of migration, which e.g. OpenStack would like to know long before that.
> 
> So, having sysfs attributes would work for both libvirt (even though libvirt
> would benefit from a QMP much more) and OpenStack. OpenStack would IMO then
> have to figure out how to create the mappings between compatible devices across
> several nodes which are non-uniform.

Yep, vfio encompasses more than just QEMU, so a sysfs interface has more
utility than a QMP interface.  For instance we couldn't predetermine if
an mdev type on a host is compatible if we need to first create the
device and launch a QEMU instance on it to get access to QMP.  So maybe
the question is whether we should bother with any sort of VFIO API to
do this comparison, perhaps only a sysfs interface is sufficient for a
complete solution.  The downside of not having a version API in the
user interface might be that QEMU on its own can only try a migration
and see if it fails, it wouldn't have the ability to test expected
compatibility without access to sysfs.  And maybe that's fine.  Thanks,

Alex
Alex Williamson March 29, 2019, 2:26 p.m. UTC | #56
On Thu, 28 Mar 2019 22:47:04 -0400
Zhao Yan <yan.y.zhao@intel.com> wrote:

> On Fri, Mar 29, 2019 at 12:04:31AM +0800, Alex Williamson wrote:
> > On Thu, 28 Mar 2019 10:21:38 +0100
> > Erik Skultety <eskultet@redhat.com> wrote:
> >   
> > > On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:  
> > > > hi Alex and Dave,
> > > > Thanks for your replies.
> > > > Please see my comments inline.
> > > >
> > > > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:    
> > > > > On Wed, 27 Mar 2019 20:18:54 +0000
> > > > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > > > >    
> > > > > > * Zhao Yan (yan.y.zhao@intel.com) wrote:    
> > > > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:    
> > > > > > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > > > > > a device that has less device memory ?    
> > > > > > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > > > > > along with verion ?).    
> > > > > > > > >
> > > > > > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > > > > > providers. But we still think about the live migration between from the the lower
> > > > > > > > > generation of hardware migrated to the higher generation.    
> > > > > > > >
> > > > > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > > > > support.
> > > > > > > >
> > > > > > > > But regardless of that, I think we need to make sure that incompatible
> > > > > > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > > > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > > > > > as well.
> > > > > > > >
> > > > > > > > How easy is it to obtain that information in a form that can be
> > > > > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > > > > What about some kind of revision?    
> > > > > > > hi Alex and Cornelia
> > > > > > > for device compatibility, do you think it's a good idea to use "version"
> > > > > > > and "device version" fields?
> > > > > > >
> > > > > > > version field: identify live migration interface's version. it can have a
> > > > > > > sort of backward compatibility, like target machine's version >= source
> > > > > > > machine's version. something like that.    
> > > > >
> > > > > Don't we essentially already have this via the device specific region?
> > > > > The struct vfio_info_cap_header includes id and version fields, so we
> > > > > can declare a migration id and increment the version for any
> > > > > incompatible changes to the protocol.    
> > > > yes, good idea!
> > > > so, what about declaring below new cap?
> > > >     #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > > >     struct vfio_region_info_cap_migration {
> > > >         struct vfio_info_cap_header header;
> > > >         __u32 device_version_len;
> > > >         __u8  device_version[];
> > > >     };  
> > 
> > I'm not sure why we need a new region for everything, it seems this
> > could fit within the protocol of a single region.  This could simply be
> > a new action to retrieve the version where the protocol would report
> > the number of bytes available, just like the migration stream itself.  
> so, to get version of VFIO live migration device state interface (simply
> call it migration interface?),
> a new cap looks like this:
> #define VFIO_REGION_INFO_CAP_MIGRATION 4
> it contains struct vfio_info_cap_header only.
> when get region info of the migration region, we query this cap and get
> migration interface's version. right?
> 
> or just directly use VFIO_REGION_INFO_CAP_TYPE is fine?

Again, why a new region.  I'm imagining we have one region and this is
just asking for a slightly different thing from it.  But TBH, I'm not
sure we need it at all vs the sysfs interface.

> > > > > > > device_version field consists two parts:
> > > > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.    
> > > > >
> > > > > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > > > > suggest we use a bit to flag it as such so we can reserve that portion
> > > > > of the 32bit address space.  See for example:
> > > > >
> > > > > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> > > > > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> > > > >
> > > > > For vendor specific regions.    
> > > > Yes, use PCI vendor ID.
> > > > you are right, we need to use highest bit (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
> > > > to identify it's a PCI ID.
> > > > Thanks for pointing it out.
> > > > But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used for?
> > > > why it's 0xffff? I searched QEMU and kernel code and did not find anywhere
> > > > uses it.  
> > 
> > PCI vendor IDs are 16bits, it's just indicating that when the
> > PCI_VENDOR_TYPE bit is set the valid data is the lower 16bits.  
> 
> thanks:)
> 
> > > > > > > 2. vendor proprietary string: it can be any string that a vendor driver
> > > > > > > thinks can identify a source device. e.g. pciid + mdev type.
> > > > > > > "vendor id" is to avoid overlap of "vendor proprietary string".
> > > > > > >
> > > > > > >
> > > > > > > struct vfio_device_state_ctl {
> > > > > > >      __u32 version;            /* ro */
> > > > > > >      __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
> > > > > > >      struct {
> > > > > > >      	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > > > > > > 	...
> > > > > > >      }data;
> > > > > > >      ...
> > > > > > >  };    
> > > > >
> > > > > We have a buffer area where we can read and write data from the vendor
> > > > > driver, why would we have this IS_COMPATIBLE protocol to write the
> > > > > device version string but use a static fixed length version string in
> > > > > the control header to read it?  IOW, let's use GET_VERSION,
> > > > > CHECK_VERSION actions that make use of the buffer area and allow vendor
> > > > > specific version information length.    
> > > > you are right, such static fixed length version string is bad :)
> > > > To get device version, do you think which approach below is better?
> > > > 1. use GET_VERSION action, and read from region buffer
> > > > 2. get it when querying cap VFIO_REGION_INFO_CAP_MIGRATION
> > > >
> > > > seems approach 1 is better, and cap VFIO_REGION_INFO_CAP_MIGRATION is only
> > > > for checking migration interface's version?  
> > 
> > I think 1 provides the most flexibility to the vendor driver.  
> 
> Got it.
> For VFIO live migration, compared to reuse device state region (which takes
> GET_BUFFER/SET_BUFFER actions),
> could we create a new region for GET_VERSION & CHECK_VERSION ?

Why?

> > > > > > > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > > > > > >
> > > > > > > The flow to figure out whether a source device is migratable to target device
> > > > > > > is like that:
> > > > > > > 1. in source side's .save_setup, save source device's device_version string
> > > > > > > 2. in target side's .load_state, load source device's device version string
> > > > > > > and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
> > > > > > > to check whether the source device is compatible to it.
> > > > > > >
> > > > > > > The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> > > > > > > maintain a compatibility table and decide whether source device is compatible
> > > > > > > to target device according to its proprietary table.
> > > > > > > In device_version string, vendor driver only has to describe the source
> > > > > > > device as elaborately as possible and resorts to vendor driver in target side
> > > > > > > to figure out whether they are compatible.    
> > > > >
> > > > > I agree, it's too complicated and restrictive to try to create an
> > > > > interface for the user to determine compatibility, let the driver
> > > > > declare it compatible or not.    
> > > > :)
> > > >    
> > > > > > It would also be good if the 'IS_COMPATIBLE' was somehow callable
> > > > > > externally - so we could be able to answer a question like 'can we
> > > > > > migrate this VM to this host' - from the management layer before it
> > > > > > actually starts the migration.    
> > > >
> > > > so qemu needs to expose two qmp/sysfs interfaces: GET_VERSION and CHECK_VERSION.
> > > > GET_VERSION returns a vm's device's version string.
> > > > CHECK_VERSION's input is device version string and return
> > > > compatible/non-compatible.
> > > > Do you think it's good?  
> > 
> > That's the idea, but note that QEMU can only provide the QMP interface,
> > the sysfs interface would of course be provided as more of a direct
> > path from the vendor driver or mdev kernel layer.
> >   
> > > > > I think we'd need to mirror this capability in sysfs to support that,
> > > > > or create a qmp interface through QEMU that the device owner could make
> > > > > the request on behalf of the management layer.  Getting access to the
> > > > > vfio device requires an iommu context that's already in use by the
> > > > > device owner, we have no intention of supporting a model that allows
> > > > > independent tasks access to a device.  Thanks,
> > > > > Alex
> > > > >    
> > > > do you think two sysfs nodes under a device node is ok?
> > > > e.g.
> > > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/get_version
> > > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/check_version    
> > 
> > I'd think it might live more in the mdev_support_types area, wouldn't
> > we ideally like to know if a device is compatible even before it's
> > created?  For example maybe:
> > 
> > /sys/class/mdev_bus/0000:00:02.0/mdev_supported_types/i915-GVTg_V5_4/version
> > 
> > Where reading the sysfs attribute returns the version string and
> > writing a string into the attribute return an errno for incompatibility.  
> yes, knowing if a device is compatible before it's created is good.
> but do you think check whether a device is compatible after it's created is
> also required? For live migration, user usually only queries this information
> when it's really required, i.e. when a device has been created.
> maybe we can add this version get/check at both places?

Why does an instantiated device suddenly not follow the version and
compatibility rules of an uninstantiated device?  IOW, if the version
and compatibility check are on the mdev type, why can't we trace back
from the device to the mdev type and make use of that same interface?
Seems the only question is whether we require an interface through the
vfio API directly or if sysfs is sufficient.

> > > Why do you need both sysfs and QMP at the same time? I can see it gives us some
> > > flexibility, but is there something more to that?
> > >
> > > Normally, I'd prefer a QMP interface from libvirt's perspective (with an
> > > appropriate capability that libvirt can check for QEMU support) because I imagine large nodes having a
> > > bunch of GPUs with different revisions which might not be backwards compatible.
> > > Libvirt might query the version string on source and check it on dest via the
> > > QMP in a way that QEMU, talking to the driver, would return either a list or a
> > > single physical device to which we can migrate, because neither QEMU nor
> > > libvirt know that, only the driver does, so that's an important information
> > > rather than looping through all the devices and trying to find one that is
> > > compatible. However, you might have a hard time making all the necessary
> > > changes in QMP introspectable, a new command would be fine, but if you also
> > > wanted to extend say vfio-pci options, IIRC that would not appear in the QAPI
> > > schema and libvirt would not be able to detect support for it.
> > > 
> > > On the other hand, the presence of a QMP interface IMO doesn't help mgmt apps
> > > much, as it still carries the burden of being able to check this only at the
> > > time of migration, which e.g. OpenStack would like to know long before that.
> > > 
> > > So, having sysfs attributes would work for both libvirt (even though libvirt
> > > would benefit from a QMP much more) and OpenStack. OpenStack would IMO then
> > > have to figure out how to create the mappings between compatible devices across
> > > several nodes which are non-uniform.  
> > 
> > Yep, vfio encompasses more than just QEMU, so a sysfs interface has more
> > utility than a QMP interface.  For instance we couldn't predetermine if
> > an mdev type on a host is compatible if we need to first create the
> > device and launch a QEMU instance on it to get access to QMP.  So maybe
> > the question is whether we should bother with any sort of VFIO API to
> > do this comparison, perhaps only a sysfs interface is sufficient for a
> > complete solution.  The downside of not having a version API in the
> > user interface might be that QEMU on its own can only try a migration
> > and see if it fails, it wouldn't have the ability to test expected
> > compatibility without access to sysfs.  And maybe that's fine.  Thanks,
> >   
> So QEMU vfio uses sysfs to check device compatiblity in migration's save_setup
> phase?

The migration stream between source and target device are the ultimate
test of compatibility, the vendor driver should never rely on userspace
validating compatibility of the migration.  At the point it could do so, the
migration has already begun, so we're only testing how quickly we can
fail the migration.  The management layer setting up the migration can
test via sysfs for compatibility and the migration stream itself needs
to be self validating, so what value is added for QEMU to perform a
version compatibility test?  Thanks,

Alex
Alex Williamson March 30, 2019, 2:14 p.m. UTC | #57
On Fri, 29 Mar 2019 19:10:50 -0400
Zhao Yan <yan.y.zhao@intel.com> wrote:

> On Fri, Mar 29, 2019 at 10:26:39PM +0800, Alex Williamson wrote:
> > On Thu, 28 Mar 2019 22:47:04 -0400
> > Zhao Yan <yan.y.zhao@intel.com> wrote:
> >   
> > > On Fri, Mar 29, 2019 at 12:04:31AM +0800, Alex Williamson wrote:  
> > > > On Thu, 28 Mar 2019 10:21:38 +0100
> > > > Erik Skultety <eskultet@redhat.com> wrote:
> > > >     
> > > > > On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:    
> > > > > > hi Alex and Dave,
> > > > > > Thanks for your replies.
> > > > > > Please see my comments inline.
> > > > > >
> > > > > > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:      
> > > > > > > On Wed, 27 Mar 2019 20:18:54 +0000
> > > > > > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > > > > > >      
> > > > > > > > * Zhao Yan (yan.y.zhao@intel.com) wrote:      
> > > > > > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:      
> > > > > > > > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > > > > > > > a device that has less device memory ?      
> > > > > > > > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > > > > > > > along with verion ?).      
> > > > > > > > > > >
> > > > > > > > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > > > > > > > providers. But we still think about the live migration between from the the lower
> > > > > > > > > > > generation of hardware migrated to the higher generation.      
> > > > > > > > > >
> > > > > > > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > > > > > > support.
> > > > > > > > > >
> > > > > > > > > > But regardless of that, I think we need to make sure that incompatible
> > > > > > > > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > > > > > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > > > > > > > as well.
> > > > > > > > > >
> > > > > > > > > > How easy is it to obtain that information in a form that can be
> > > > > > > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > > > > > > What about some kind of revision?      
> > > > > > > > > hi Alex and Cornelia
> > > > > > > > > for device compatibility, do you think it's a good idea to use "version"
> > > > > > > > > and "device version" fields?
> > > > > > > > >
> > > > > > > > > version field: identify live migration interface's version. it can have a
> > > > > > > > > sort of backward compatibility, like target machine's version >= source
> > > > > > > > > machine's version. something like that.      
> > > > > > >
> > > > > > > Don't we essentially already have this via the device specific region?
> > > > > > > The struct vfio_info_cap_header includes id and version fields, so we
> > > > > > > can declare a migration id and increment the version for any
> > > > > > > incompatible changes to the protocol.      
> > > > > > yes, good idea!
> > > > > > so, what about declaring below new cap?
> > > > > >     #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > > > > >     struct vfio_region_info_cap_migration {
> > > > > >         struct vfio_info_cap_header header;
> > > > > >         __u32 device_version_len;
> > > > > >         __u8  device_version[];
> > > > > >     };    
> > > > 
> > > > I'm not sure why we need a new region for everything, it seems this
> > > > could fit within the protocol of a single region.  This could simply be
> > > > a new action to retrieve the version where the protocol would report
> > > > the number of bytes available, just like the migration stream itself.    
> > > so, to get version of VFIO live migration device state interface (simply
> > > call it migration interface?),
> > > a new cap looks like this:
> > > #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > > it contains struct vfio_info_cap_header only.
> > > when get region info of the migration region, we query this cap and get
> > > migration interface's version. right?
> > > 
> > > or just directly use VFIO_REGION_INFO_CAP_TYPE is fine?  
> > 
> > Again, why a new region.  I'm imagining we have one region and this is
> > just asking for a slightly different thing from it.  But TBH, I'm not
> > sure we need it at all vs the sysfs interface.
> >   
> > > > > > > > > device_version field consists two parts:
> > > > > > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.      
> > > > > > >
> > > > > > > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > > > > > > suggest we use a bit to flag it as such so we can reserve that portion
> > > > > > > of the 32bit address space.  See for example:
> > > > > > >
> > > > > > > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> > > > > > > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> > > > > > >
> > > > > > > For vendor specific regions.      
> > > > > > Yes, use PCI vendor ID.
> > > > > > you are right, we need to use highest bit (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
> > > > > > to identify it's a PCI ID.
> > > > > > Thanks for pointing it out.
> > > > > > But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used for?
> > > > > > why it's 0xffff? I searched QEMU and kernel code and did not find anywhere
> > > > > > uses it.    
> > > > 
> > > > PCI vendor IDs are 16bits, it's just indicating that when the
> > > > PCI_VENDOR_TYPE bit is set the valid data is the lower 16bits.    
> > > 
> > > thanks:)
> > >   
> > > > > > > > > 2. vendor proprietary string: it can be any string that a vendor driver
> > > > > > > > > thinks can identify a source device. e.g. pciid + mdev type.
> > > > > > > > > "vendor id" is to avoid overlap of "vendor proprietary string".
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > struct vfio_device_state_ctl {
> > > > > > > > >      __u32 version;            /* ro */
> > > > > > > > >      __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
> > > > > > > > >      struct {
> > > > > > > > >      	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > > > > > > > > 	...
> > > > > > > > >      }data;
> > > > > > > > >      ...
> > > > > > > > >  };      
> > > > > > >
> > > > > > > We have a buffer area where we can read and write data from the vendor
> > > > > > > driver, why would we have this IS_COMPATIBLE protocol to write the
> > > > > > > device version string but use a static fixed length version string in
> > > > > > > the control header to read it?  IOW, let's use GET_VERSION,
> > > > > > > CHECK_VERSION actions that make use of the buffer area and allow vendor
> > > > > > > specific version information length.      
> > > > > > you are right, such static fixed length version string is bad :)
> > > > > > To get device version, do you think which approach below is better?
> > > > > > 1. use GET_VERSION action, and read from region buffer
> > > > > > 2. get it when querying cap VFIO_REGION_INFO_CAP_MIGRATION
> > > > > >
> > > > > > seems approach 1 is better, and cap VFIO_REGION_INFO_CAP_MIGRATION is only
> > > > > > for checking migration interface's version?    
> > > > 
> > > > I think 1 provides the most flexibility to the vendor driver.    
> > > 
> > > Got it.
> > > For VFIO live migration, compared to reuse device state region (which takes
> > > GET_BUFFER/SET_BUFFER actions),
> > > could we create a new region for GET_VERSION & CHECK_VERSION ?  
> > 
> > Why?
> >   
> > > > > > > > > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > > > > > > > >
> > > > > > > > > The flow to figure out whether a source device is migratable to target device
> > > > > > > > > is like that:
> > > > > > > > > 1. in source side's .save_setup, save source device's device_version string
> > > > > > > > > 2. in target side's .load_state, load source device's device version string
> > > > > > > > > and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
> > > > > > > > > to check whether the source device is compatible to it.
> > > > > > > > >
> > > > > > > > > The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> > > > > > > > > maintain a compatibility table and decide whether source device is compatible
> > > > > > > > > to target device according to its proprietary table.
> > > > > > > > > In device_version string, vendor driver only has to describe the source
> > > > > > > > > device as elaborately as possible and resorts to vendor driver in target side
> > > > > > > > > to figure out whether they are compatible.      
> > > > > > >
> > > > > > > I agree, it's too complicated and restrictive to try to create an
> > > > > > > interface for the user to determine compatibility, let the driver
> > > > > > > declare it compatible or not.      
> > > > > > :)
> > > > > >      
> > > > > > > > It would also be good if the 'IS_COMPATIBLE' was somehow callable
> > > > > > > > externally - so we could be able to answer a question like 'can we
> > > > > > > > migrate this VM to this host' - from the management layer before it
> > > > > > > > actually starts the migration.      
> > > > > >
> > > > > > so qemu needs to expose two qmp/sysfs interfaces: GET_VERSION and CHECK_VERSION.
> > > > > > GET_VERSION returns a vm's device's version string.
> > > > > > CHECK_VERSION's input is device version string and return
> > > > > > compatible/non-compatible.
> > > > > > Do you think it's good?    
> > > > 
> > > > That's the idea, but note that QEMU can only provide the QMP interface,
> > > > the sysfs interface would of course be provided as more of a direct
> > > > path from the vendor driver or mdev kernel layer.
> > > >     
> > > > > > > I think we'd need to mirror this capability in sysfs to support that,
> > > > > > > or create a qmp interface through QEMU that the device owner could make
> > > > > > > the request on behalf of the management layer.  Getting access to the
> > > > > > > vfio device requires an iommu context that's already in use by the
> > > > > > > device owner, we have no intention of supporting a model that allows
> > > > > > > independent tasks access to a device.  Thanks,
> > > > > > > Alex
> > > > > > >      
> > > > > > do you think two sysfs nodes under a device node is ok?
> > > > > > e.g.
> > > > > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/get_version
> > > > > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/check_version      
> > > > 
> > > > I'd think it might live more in the mdev_support_types area, wouldn't
> > > > we ideally like to know if a device is compatible even before it's
> > > > created?  For example maybe:
> > > > 
> > > > /sys/class/mdev_bus/0000:00:02.0/mdev_supported_types/i915-GVTg_V5_4/version
> > > > 
> > > > Where reading the sysfs attribute returns the version string and
> > > > writing a string into the attribute return an errno for incompatibility.    
> > > yes, knowing if a device is compatible before it's created is good.
> > > but do you think check whether a device is compatible after it's created is
> > > also required? For live migration, user usually only queries this information
> > > when it's really required, i.e. when a device has been created.
> > > maybe we can add this version get/check at both places?  
> > 
> > Why does an instantiated device suddenly not follow the version and
> > compatibility rules of an uninstantiated device?  IOW, if the version
> > and compatibility check are on the mdev type, why can't we trace back
> > from the device to the mdev type and make use of that same interface?
> > Seems the only question is whether we require an interface through the
> > vfio API directly or if sysfs is sufficient.  
> ok. got it.
> 
> > > > > Why do you need both sysfs and QMP at the same time? I can see it gives us some
> > > > > flexibility, but is there something more to that?
> > > > >
> > > > > Normally, I'd prefer a QMP interface from libvirt's perspective (with an
> > > > > appropriate capability that libvirt can check for QEMU support) because I imagine large nodes having a
> > > > > bunch of GPUs with different revisions which might not be backwards compatible.
> > > > > Libvirt might query the version string on source and check it on dest via the
> > > > > QMP in a way that QEMU, talking to the driver, would return either a list or a
> > > > > single physical device to which we can migrate, because neither QEMU nor
> > > > > libvirt know that, only the driver does, so that's an important information
> > > > > rather than looping through all the devices and trying to find one that is
> > > > > compatible. However, you might have a hard time making all the necessary
> > > > > changes in QMP introspectable, a new command would be fine, but if you also
> > > > > wanted to extend say vfio-pci options, IIRC that would not appear in the QAPI
> > > > > schema and libvirt would not be able to detect support for it.
> > > > > 
> > > > > On the other hand, the presence of a QMP interface IMO doesn't help mgmt apps
> > > > > much, as it still carries the burden of being able to check this only at the
> > > > > time of migration, which e.g. OpenStack would like to know long before that.
> > > > > 
> > > > > So, having sysfs attributes would work for both libvirt (even though libvirt
> > > > > would benefit from a QMP much more) and OpenStack. OpenStack would IMO then
> > > > > have to figure out how to create the mappings between compatible devices across
> > > > > several nodes which are non-uniform.    
> > > > 
> > > > Yep, vfio encompasses more than just QEMU, so a sysfs interface has more
> > > > utility than a QMP interface.  For instance we couldn't predetermine if
> > > > an mdev type on a host is compatible if we need to first create the
> > > > device and launch a QEMU instance on it to get access to QMP.  So maybe
> > > > the question is whether we should bother with any sort of VFIO API to
> > > > do this comparison, perhaps only a sysfs interface is sufficient for a
> > > > complete solution.  The downside of not having a version API in the
> > > > user interface might be that QEMU on its own can only try a migration
> > > > and see if it fails, it wouldn't have the ability to test expected
> > > > compatibility without access to sysfs.  And maybe that's fine.  Thanks,
> > > >     
> > > So QEMU vfio uses sysfs to check device compatiblity in migration's save_setup
> > > phase?  
> > 
> > The migration stream between source and target device are the ultimate
> > test of compatibility, the vendor driver should never rely on userspace
> > validating compatibility of the migration.  At the point it could do so, the
> > migration has already begun, so we're only testing how quickly we can
> > fail the migration.  The management layer setting up the migration can
> > test via sysfs for compatibility and the migration stream itself needs
> > to be self validating, so what value is added for QEMU to perform a
> > version compatibility test?  Thanks,  
> oh, do you mean vendor driver should embed source device's version in migration
> stream, which is opaque to qemu?
> otherwise, I can't think of a quick way for vendor driver to determine whether
> source device is an incompatible device.  

Yes, the vendor driver cannot rely on the user to make sure the
incoming migration stream is compatible, the vendor driver must take
responsibility for this.  Therefore, regardless of what other
interfaces we have for the user to test the compatibility between
devices, the vendor driver must make no assumptions about the validity
or integrity of the data stream.  Plan for and protect against a
malicious or incompetent user.  Thanks,

Alex
Cornelia Huck April 1, 2019, 8:14 a.m. UTC | #58
On Wed, 27 Mar 2019 16:10:20 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 27 Mar 2019 20:18:54 +0000
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Zhao Yan (yan.y.zhao@intel.com) wrote:  
> > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:    
> > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > a device that has less device memory ?      
> > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > along with verion ?).      
> > > > > 
> > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > providers. But we still think about the live migration between from the the lower
> > > > > generation of hardware migrated to the higher generation.    
> > > > 
> > > > Agreed, lower->higher is the one direction that might make sense to
> > > > support.
> > > > 
> > > > But regardless of that, I think we need to make sure that incompatible
> > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > as well.
> > > > 
> > > > How easy is it to obtain that information in a form that can be
> > > > consumed by higher layers? Can we find out the device type at least?
> > > > What about some kind of revision?    
> > > hi Alex and Cornelia
> > > for device compatibility, do you think it's a good idea to use "version"
> > > and "device version" fields?
> > > 
> > > version field: identify live migration interface's version. it can have a
> > > sort of backward compatibility, like target machine's version >= source
> > > machine's version. something like that.  
> 
> Don't we essentially already have this via the device specific region?
> The struct vfio_info_cap_header includes id and version fields, so we
> can declare a migration id and increment the version for any
> incompatible changes to the protocol.
> 
> > > 
> > > device_version field consists two parts:
> > > 1. vendor id : it takes 32 bits. e.g. 0x8086.  
> 
> Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> suggest we use a bit to flag it as such so we can reserve that portion
> of the 32bit address space.  See for example:
> 
> #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> 
> For vendor specific regions.

Just browsing through the thread... if I don't misunderstand, we could
use a vfio-ccw region type id here for ccw, couldn't we? Just to make
sure that this is not pci-specific.
Yan Zhao April 1, 2019, 8:40 a.m. UTC | #59
On Mon, Apr 01, 2019 at 04:14:30PM +0800, Cornelia Huck wrote:
> On Wed, 27 Mar 2019 16:10:20 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Wed, 27 Mar 2019 20:18:54 +0000
> > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > 
> > > * Zhao Yan (yan.y.zhao@intel.com) wrote:  
> > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:    
> > > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > > a device that has less device memory ?      
> > > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > > along with verion ?).      
> > > > > > 
> > > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > > providers. But we still think about the live migration between from the the lower
> > > > > > generation of hardware migrated to the higher generation.    
> > > > > 
> > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > support.
> > > > > 
> > > > > But regardless of that, I think we need to make sure that incompatible
> > > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > > as well.
> > > > > 
> > > > > How easy is it to obtain that information in a form that can be
> > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > What about some kind of revision?    
> > > > hi Alex and Cornelia
> > > > for device compatibility, do you think it's a good idea to use "version"
> > > > and "device version" fields?
> > > > 
> > > > version field: identify live migration interface's version. it can have a
> > > > sort of backward compatibility, like target machine's version >= source
> > > > machine's version. something like that.  
> > 
> > Don't we essentially already have this via the device specific region?
> > The struct vfio_info_cap_header includes id and version fields, so we
> > can declare a migration id and increment the version for any
> > incompatible changes to the protocol.
> > 
> > > > 
> > > > device_version field consists two parts:
> > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.  
> > 
> > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > suggest we use a bit to flag it as such so we can reserve that portion
> > of the 32bit address space.  See for example:
> > 
> > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> > 
> > For vendor specific regions.
> 
> Just browsing through the thread... if I don't misunderstand, we could
> use a vfio-ccw region type id here for ccw, couldn't we? Just to make
> sure that this is not pci-specific.
CCW could use another bit other than bit 31?
e.g.
#define VFIO_REGION_TYPE_CCW_VENDOR_TYPE        (1 << 30)
then ccw device use (VFIO_REGION_TYPE_CCW_VENDOR_TYPE | vendor id) as its
first 32 bit for device version string.

But as Alex said we'll not provide an extra region to get device version,
and device version is only exported in sysfs, probably we should define them as
below:
#define VFIO_DEVICE_VERSION_TYPE_PCI (1<<31)
#define VFIO_DEVICE_VERSION_TYPE_CCW (1<<30)

Do you think it's ok?

Thanks
Yan
Alex Williamson April 1, 2019, 2:15 p.m. UTC | #60
On Mon, 1 Apr 2019 04:40:03 -0400
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Mon, Apr 01, 2019 at 04:14:30PM +0800, Cornelia Huck wrote:
> > On Wed, 27 Mar 2019 16:10:20 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> > > On Wed, 27 Mar 2019 20:18:54 +0000
> > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > >   
> > > > * Zhao Yan (yan.y.zhao@intel.com) wrote:    
> > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:      
> > > > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > > > a device that has less device memory ?        
> > > > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > > > along with verion ?).        
> > > > > > > 
> > > > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > > > providers. But we still think about the live migration between from the the lower
> > > > > > > generation of hardware migrated to the higher generation.      
> > > > > > 
> > > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > > support.
> > > > > > 
> > > > > > But regardless of that, I think we need to make sure that incompatible
> > > > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > > > as well.
> > > > > > 
> > > > > > How easy is it to obtain that information in a form that can be
> > > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > > What about some kind of revision?      
> > > > > hi Alex and Cornelia
> > > > > for device compatibility, do you think it's a good idea to use "version"
> > > > > and "device version" fields?
> > > > > 
> > > > > version field: identify live migration interface's version. it can have a
> > > > > sort of backward compatibility, like target machine's version >= source
> > > > > machine's version. something like that.    
> > > 
> > > Don't we essentially already have this via the device specific region?
> > > The struct vfio_info_cap_header includes id and version fields, so we
> > > can declare a migration id and increment the version for any
> > > incompatible changes to the protocol.
> > >   
> > > > > 
> > > > > device_version field consists two parts:
> > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.    
> > > 
> > > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > > suggest we use a bit to flag it as such so we can reserve that portion
> > > of the 32bit address space.  See for example:
> > > 
> > > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> > > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> > > 
> > > For vendor specific regions.  
> > 
> > Just browsing through the thread... if I don't misunderstand, we could
> > use a vfio-ccw region type id here for ccw, couldn't we? Just to make
> > sure that this is not pci-specific.  
> CCW could use another bit other than bit 31?
> e.g.
> #define VFIO_REGION_TYPE_CCW_VENDOR_TYPE        (1 << 30)
> then ccw device use (VFIO_REGION_TYPE_CCW_VENDOR_TYPE | vendor id) as its
> first 32 bit for device version string.
> 
> But as Alex said we'll not provide an extra region to get device version,
> and device version is only exported in sysfs, probably we should define them as
> below:
> #define VFIO_DEVICE_VERSION_TYPE_PCI (1<<31)
> #define VFIO_DEVICE_VERSION_TYPE_CCW (1<<30)
> 
> Do you think it's ok?

We already had this discussion for device specific regions and decided
that CCW doesn't have enough vendors to justify a full subset of the
available address space.  Also, this doesn't need to imply the device
interface, we're simply specifying a vendor registrar such that we can
give each vendor their own namespace, so I don't think it would be a
problem for a CCW to specify a namespace using a PCI vendor ID.
Finally, if we have such need for this in the future, because I'm not
sure where we stand with this in the current proposals, maybe we should
make use of an IEEE OUI rather than a PCI database to avoid this sort
of confusion and mis-association if we have further need.  Thanks,

Alex