mbox series

[RFC,0/9] Introduce mediate ops in vfio-pci

Message ID 20191205032419.29606-1-yan.y.zhao@intel.com (mailing list archive)
Headers show
Series Introduce mediate ops in vfio-pci | expand

Message

Yan Zhao Dec. 5, 2019, 3:24 a.m. UTC
For SRIOV devices, VFs are passthroughed into guest directly without host
driver mediation. However, when VMs migrating with passthroughed VFs,
dynamic host mediation is required to  (1) get device states, (2) get
dirty pages. Since device states as well as other critical information
required for dirty page tracking for VFs are usually retrieved from PFs,
it is handy to provide an extension in PF driver to centralizingly control
VFs' migration.

Therefore, in order to realize (1) passthrough VFs at normal time, (2)
dynamically trap VFs' bars for dirty page tracking and (3) centralizing
VF critical states retrieving and VF controls into one driver, we propose
to introduce mediate ops on top of current vfio-pci device driver.


                                   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
                                  
 __________   register mediate ops|  ___________     ___________    |
|          |<-----------------------|     VF    |   |           |   
| vfio-pci |                      | |  mediate  |   | PF driver |   |
|__________|----------------------->|   driver  |   |___________|   
     |            open(pdev)      |  -----------          |         |
     |                                                    |         
     |                            |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
    \|/                                                  \|/
-----------                                         ------------
|    VF   |                                         |    PF    |
-----------                                         ------------


VF mediate driver could be a standalone driver that does not bind to
any devices (as in demo code in patches 5-6) or it could be a built-in
extension of PF driver (as in patches 7-9) .

Rather than directly bind to VF, VF mediate driver register a mediate
ops into vfio-pci in driver init. vfio-pci maintains a list of such
mediate ops.
(Note that: VF mediate driver can register mediate ops into vfio-pci
before vfio-pci binding to any devices. And VF mediate driver can
support mediating multiple devices.)

When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
device as a parameter.
VF mediate driver should return success or failure depending on it
supports the pdev or not.
E.g. VF mediate driver would compare its supported VF devfn with the
devfn of the passed-in pdev.
Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
stop querying other mediate ops and bind the opening device with this
mediate ops using the returned mediate handle.

Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
VF will be intercepted into VF mediate driver as
vfio_pci_mediate_ops->get_region_info(),
vfio_pci_mediate_ops->rw,
vfio_pci_mediate_ops->mmap, and get customized.
For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
further return 'pt' to indicate whether vfio-pci should further
passthrough data to hw.

when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
with a mediate handle as parameter.

The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
mediate driver be able to differentiate two opening VFs of the same device
id and vendor id.

When VF mediate driver exits, it unregisters its mediate ops from
vfio-pci.


In this patchset, we enable vfio-pci to provide 3 things:
(1) calling mediate ops to allow vendor driver customizing default
region info/rw/mmap of a region.
(2) provide a migration region to support migration
(3) provide a dynamic trap bar info region to allow vendor driver
control trap/untrap of device pci bars

This vfio-pci + mediate ops way differs from mdev way in that
(1) medv way needs to create a 1:1 mdev device on top of one VF, device
specific mdev parent driver is bound to VF directly.
(2) vfio-pci + mediate ops way does not create mdev devices and VF
mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.

The reason why we don't choose the way of writing mdev parent driver is
that
(1) VFs are almost all the time directly passthroughed. Directly binding
to vfio-pci can make most of the code shared/reused. If we write a
vendor specific mdev parent driver, most of the code (like passthrough
style of rw/mmap) still needs to be copied from vfio-pci driver, which is
actually a duplicated and tedious work.
(2) For features like dynamically trap/untrap pci bars, if they are in
vfio-pci, they can be available to most people without repeated code
copying and re-testing.
(3) with a 1:1 mdev driver which passthrough VFs most of the time, people
have to decide whether to bind VFs to vfio-pci or mdev parent driver before
it runs into a real migration need. However, if vfio-pci is bound
initially, they have no chance to do live migration when there's a need
later. 

In this patchset,
- patches 1-4 enable vfio-pci to call mediate ops registered by vendor
  driver to mediate/customize region info/rw/mmap.

- patches 5-6 provide a standalone sample driver to register a mediate ops
  for Intel Graphics Devices. It does not bind to IGDs directly but decides
  what devices it supports via its pciidlist. It also demonstrates how to
  dynamic trap a device's PCI bars. (by adding more pciids in its
  pciidlist, this sample driver actually is not necessarily limited to
  support IGDs)

- patch 7-9 provide a sample on i40e driver that supports Intel(R)
  Ethernet Controller XL710 Family of devices. It supports VF precopy live
  migration on Intel's 710 SRIOV. (but we commented out the real
  implementation of dirty page tracking and device state retrieving part
  to focus on demonstrating framework part. Will send out them in future
  versions)
 
  patch 7 registers/unregisters VF mediate ops when PF driver
  probes/removes. It specifies its supporting VFs via
  vfio_pci_mediate_ops->open(pdev)

  patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
  provides a sample implementation of migration region.
  The QEMU part of vfio migration is based on v8
  https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
  We do not based on recent v9 because we think there are still opens in
  dirty page track part in that series.

  patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
  provides an example on how to trap part of bar0 when migration starts
  and passthrough this part of bar0 again when migration fails.

Yan Zhao (9):
  vfio/pci: introduce mediate ops to intercept vfio-pci ops
  vfio/pci: test existence before calling region->ops
  vfio/pci: register a default migration region
  vfio-pci: register default dynamic-trap-bar-info region
  samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
  sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
  i40e/vf_migration: register mediate_ops to vfio-pci
  i40e/vf_migration: mediate migration region
  i40e/vf_migration: support dynamic trap of bar0

 drivers/net/ethernet/intel/Kconfig            |   2 +-
 drivers/net/ethernet/intel/i40e/Makefile      |   3 +-
 drivers/net/ethernet/intel/i40e/i40e.h        |   2 +
 drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
 .../ethernet/intel/i40e/i40e_vf_migration.c   | 626 ++++++++++++++++++
 .../ethernet/intel/i40e/i40e_vf_migration.h   |  78 +++
 drivers/vfio/pci/vfio_pci.c                   | 189 +++++-
 drivers/vfio/pci/vfio_pci_private.h           |   2 +
 include/linux/vfio.h                          |  18 +
 include/uapi/linux/vfio.h                     | 160 +++++
 samples/Kconfig                               |   6 +
 samples/Makefile                              |   1 +
 samples/vfio-pci/Makefile                     |   2 +
 samples/vfio-pci/igd_dt.c                     | 367 ++++++++++
 14 files changed, 1455 insertions(+), 4 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
 create mode 100644 samples/vfio-pci/Makefile
 create mode 100644 samples/vfio-pci/igd_dt.c

Comments

Jason Wang Dec. 5, 2019, 6:33 a.m. UTC | #1
Hi:

On 2019/12/5 上午11:24, Yan Zhao wrote:
> For SRIOV devices, VFs are passthroughed into guest directly without host
> driver mediation. However, when VMs migrating with passthroughed VFs,
> dynamic host mediation is required to  (1) get device states, (2) get
> dirty pages. Since device states as well as other critical information
> required for dirty page tracking for VFs are usually retrieved from PFs,
> it is handy to provide an extension in PF driver to centralizingly control
> VFs' migration.
>
> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
> dynamically trap VFs' bars for dirty page tracking and


A silly question, what's the reason for doing this, is this a must for 
dirty page tracking?


>   (3) centralizing
> VF critical states retrieving and VF controls into one driver, we propose
> to introduce mediate ops on top of current vfio-pci device driver.
>
>
>                                     _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>                                    
>   __________   register mediate ops|  ___________     ___________    |
> |          |<-----------------------|     VF    |   |           |
> | vfio-pci |                      | |  mediate  |   | PF driver |   |
> |__________|----------------------->|   driver  |   |___________|
>       |            open(pdev)      |  -----------          |         |
>       |                                                    |
>       |                            |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
>      \|/                                                  \|/
> -----------                                         ------------
> |    VF   |                                         |    PF    |
> -----------                                         ------------
>
>
> VF mediate driver could be a standalone driver that does not bind to
> any devices (as in demo code in patches 5-6) or it could be a built-in
> extension of PF driver (as in patches 7-9) .
>
> Rather than directly bind to VF, VF mediate driver register a mediate
> ops into vfio-pci in driver init. vfio-pci maintains a list of such
> mediate ops.
> (Note that: VF mediate driver can register mediate ops into vfio-pci
> before vfio-pci binding to any devices. And VF mediate driver can
> support mediating multiple devices.)
>
> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
> device as a parameter.
> VF mediate driver should return success or failure depending on it
> supports the pdev or not.
> E.g. VF mediate driver would compare its supported VF devfn with the
> devfn of the passed-in pdev.
> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
> stop querying other mediate ops and bind the opening device with this
> mediate ops using the returned mediate handle.
>
> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
> VF will be intercepted into VF mediate driver as
> vfio_pci_mediate_ops->get_region_info(),
> vfio_pci_mediate_ops->rw,
> vfio_pci_mediate_ops->mmap, and get customized.
> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
> further return 'pt' to indicate whether vfio-pci should further
> passthrough data to hw.
>
> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
> with a mediate handle as parameter.
>
> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
> mediate driver be able to differentiate two opening VFs of the same device
> id and vendor id.
>
> When VF mediate driver exits, it unregisters its mediate ops from
> vfio-pci.
>
>
> In this patchset, we enable vfio-pci to provide 3 things:
> (1) calling mediate ops to allow vendor driver customizing default
> region info/rw/mmap of a region.
> (2) provide a migration region to support migration


What's the benefit of introducing a region? It looks to me we don't 
expect the region to be accessed directly from guest. Could we simply 
extend device fd ioctl for doing such things?


> (3) provide a dynamic trap bar info region to allow vendor driver
> control trap/untrap of device pci bars
>
> This vfio-pci + mediate ops way differs from mdev way in that
> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
> specific mdev parent driver is bound to VF directly.
> (2) vfio-pci + mediate ops way does not create mdev devices and VF
> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
>
> The reason why we don't choose the way of writing mdev parent driver is
> that
> (1) VFs are almost all the time directly passthroughed. Directly binding
> to vfio-pci can make most of the code shared/reused.


Can we split out the common parts from vfio-pci?


>   If we write a
> vendor specific mdev parent driver, most of the code (like passthrough
> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
> actually a duplicated and tedious work.


The mediate ops looks quite similar to what vfio-mdev did. And it looks 
to me we need to consider live migration for mdev as well. In that case, 
do we still expect mediate ops through VFIO directly?


> (2) For features like dynamically trap/untrap pci bars, if they are in
> vfio-pci, they can be available to most people without repeated code
> copying and re-testing.
> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
> it runs into a real migration need. However, if vfio-pci is bound
> initially, they have no chance to do live migration when there's a need
> later.


We can teach management layer to do this.

Thanks


>
> In this patchset,
> - patches 1-4 enable vfio-pci to call mediate ops registered by vendor
>    driver to mediate/customize region info/rw/mmap.
>
> - patches 5-6 provide a standalone sample driver to register a mediate ops
>    for Intel Graphics Devices. It does not bind to IGDs directly but decides
>    what devices it supports via its pciidlist. It also demonstrates how to
>    dynamic trap a device's PCI bars. (by adding more pciids in its
>    pciidlist, this sample driver actually is not necessarily limited to
>    support IGDs)
>
> - patch 7-9 provide a sample on i40e driver that supports Intel(R)
>    Ethernet Controller XL710 Family of devices. It supports VF precopy live
>    migration on Intel's 710 SRIOV. (but we commented out the real
>    implementation of dirty page tracking and device state retrieving part
>    to focus on demonstrating framework part. Will send out them in future
>    versions)
>   
>    patch 7 registers/unregisters VF mediate ops when PF driver
>    probes/removes. It specifies its supporting VFs via
>    vfio_pci_mediate_ops->open(pdev)
>
>    patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
>    provides a sample implementation of migration region.
>    The QEMU part of vfio migration is based on v8
>    https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
>    We do not based on recent v9 because we think there are still opens in
>    dirty page track part in that series.
>
>    patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
>    provides an example on how to trap part of bar0 when migration starts
>    and passthrough this part of bar0 again when migration fails.
>
> Yan Zhao (9):
>    vfio/pci: introduce mediate ops to intercept vfio-pci ops
>    vfio/pci: test existence before calling region->ops
>    vfio/pci: register a default migration region
>    vfio-pci: register default dynamic-trap-bar-info region
>    samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
>    sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
>    i40e/vf_migration: register mediate_ops to vfio-pci
>    i40e/vf_migration: mediate migration region
>    i40e/vf_migration: support dynamic trap of bar0
>
>   drivers/net/ethernet/intel/Kconfig            |   2 +-
>   drivers/net/ethernet/intel/i40e/Makefile      |   3 +-
>   drivers/net/ethernet/intel/i40e/i40e.h        |   2 +
>   drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
>   .../ethernet/intel/i40e/i40e_vf_migration.c   | 626 ++++++++++++++++++
>   .../ethernet/intel/i40e/i40e_vf_migration.h   |  78 +++
>   drivers/vfio/pci/vfio_pci.c                   | 189 +++++-
>   drivers/vfio/pci/vfio_pci_private.h           |   2 +
>   include/linux/vfio.h                          |  18 +
>   include/uapi/linux/vfio.h                     | 160 +++++
>   samples/Kconfig                               |   6 +
>   samples/Makefile                              |   1 +
>   samples/vfio-pci/Makefile                     |   2 +
>   samples/vfio-pci/igd_dt.c                     | 367 ++++++++++
>   14 files changed, 1455 insertions(+), 4 deletions(-)
>   create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
>   create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
>   create mode 100644 samples/vfio-pci/Makefile
>   create mode 100644 samples/vfio-pci/igd_dt.c
>
Yan Zhao Dec. 5, 2019, 8:51 a.m. UTC | #2
On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
> Hi:
> 
> On 2019/12/5 上午11:24, Yan Zhao wrote:
> > For SRIOV devices, VFs are passthroughed into guest directly without host
> > driver mediation. However, when VMs migrating with passthroughed VFs,
> > dynamic host mediation is required to  (1) get device states, (2) get
> > dirty pages. Since device states as well as other critical information
> > required for dirty page tracking for VFs are usually retrieved from PFs,
> > it is handy to provide an extension in PF driver to centralizingly control
> > VFs' migration.
> > 
> > Therefore, in order to realize (1) passthrough VFs at normal time, (2)
> > dynamically trap VFs' bars for dirty page tracking and
> 
> 
> A silly question, what's the reason for doing this, is this a must for dirty
> page tracking?
>
For performance consideration. VFs' bars should be passthoughed at
normal time and only enter into trap state on need.

> 
> >   (3) centralizing
> > VF critical states retrieving and VF controls into one driver, we propose
> > to introduce mediate ops on top of current vfio-pci device driver.
> > 
> > 
> >                                     _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >   __________   register mediate ops|  ___________     ___________    |
> > |          |<-----------------------|     VF    |   |           |
> > | vfio-pci |                      | |  mediate  |   | PF driver |   |
> > |__________|----------------------->|   driver  |   |___________|
> >       |            open(pdev)      |  -----------          |         |
> >       |                                                    |
> >       |                            |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
> >      \|/                                                  \|/
> > -----------                                         ------------
> > |    VF   |                                         |    PF    |
> > -----------                                         ------------
> > 
> > 
> > VF mediate driver could be a standalone driver that does not bind to
> > any devices (as in demo code in patches 5-6) or it could be a built-in
> > extension of PF driver (as in patches 7-9) .
> > 
> > Rather than directly bind to VF, VF mediate driver register a mediate
> > ops into vfio-pci in driver init. vfio-pci maintains a list of such
> > mediate ops.
> > (Note that: VF mediate driver can register mediate ops into vfio-pci
> > before vfio-pci binding to any devices. And VF mediate driver can
> > support mediating multiple devices.)
> > 
> > When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
> > list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
> > device as a parameter.
> > VF mediate driver should return success or failure depending on it
> > supports the pdev or not.
> > E.g. VF mediate driver would compare its supported VF devfn with the
> > devfn of the passed-in pdev.
> > Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
> > stop querying other mediate ops and bind the opening device with this
> > mediate ops using the returned mediate handle.
> > 
> > Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
> > VF will be intercepted into VF mediate driver as
> > vfio_pci_mediate_ops->get_region_info(),
> > vfio_pci_mediate_ops->rw,
> > vfio_pci_mediate_ops->mmap, and get customized.
> > For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
> > further return 'pt' to indicate whether vfio-pci should further
> > passthrough data to hw.
> > 
> > when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
> > with a mediate handle as parameter.
> > 
> > The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
> > mediate driver be able to differentiate two opening VFs of the same device
> > id and vendor id.
> > 
> > When VF mediate driver exits, it unregisters its mediate ops from
> > vfio-pci.
> > 
> > 
> > In this patchset, we enable vfio-pci to provide 3 things:
> > (1) calling mediate ops to allow vendor driver customizing default
> > region info/rw/mmap of a region.
> > (2) provide a migration region to support migration
> 
> 
> What's the benefit of introducing a region? It looks to me we don't expect
> the region to be accessed directly from guest. Could we simply extend device
> fd ioctl for doing such things?
>
You may take a look on mdev live migration discussions in
https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html

or previous discussion at
https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/

generaly speaking, qemu part of live migration is consistent for
vfio-pci + mediate ops way or mdev way. The region is only a channel for
QEMU and kernel to communicate information without introducing IOCTLs.


> 
> > (3) provide a dynamic trap bar info region to allow vendor driver
> > control trap/untrap of device pci bars
> > 
> > This vfio-pci + mediate ops way differs from mdev way in that
> > (1) medv way needs to create a 1:1 mdev device on top of one VF, device
> > specific mdev parent driver is bound to VF directly.
> > (2) vfio-pci + mediate ops way does not create mdev devices and VF
> > mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
> > 
> > The reason why we don't choose the way of writing mdev parent driver is
> > that
> > (1) VFs are almost all the time directly passthroughed. Directly binding
> > to vfio-pci can make most of the code shared/reused.
> 
> 
> Can we split out the common parts from vfio-pci?
> 
That's very attractive. but one cannot implement a vfio-pci except
export everything in it as common part :)
> 
> >   If we write a
> > vendor specific mdev parent driver, most of the code (like passthrough
> > style of rw/mmap) still needs to be copied from vfio-pci driver, which is
> > actually a duplicated and tedious work.
> 
> 
> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
> me we need to consider live migration for mdev as well. In that case, do we
> still expect mediate ops through VFIO directly?
> 
> 
> > (2) For features like dynamically trap/untrap pci bars, if they are in
> > vfio-pci, they can be available to most people without repeated code
> > copying and re-testing.
> > (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
> > have to decide whether to bind VFs to vfio-pci or mdev parent driver before
> > it runs into a real migration need. However, if vfio-pci is bound
> > initially, they have no chance to do live migration when there's a need
> > later.
> 
> 
> We can teach management layer to do this.
> 
No. not possible as vfio-pci by default has no migration region and
dirty page tracking needs vendor's mediation at least for most
passthrough devices now.

Thanks
Yn

> Thanks
> 
> 
> > 
> > In this patchset,
> > - patches 1-4 enable vfio-pci to call mediate ops registered by vendor
> >    driver to mediate/customize region info/rw/mmap.
> > 
> > - patches 5-6 provide a standalone sample driver to register a mediate ops
> >    for Intel Graphics Devices. It does not bind to IGDs directly but decides
> >    what devices it supports via its pciidlist. It also demonstrates how to
> >    dynamic trap a device's PCI bars. (by adding more pciids in its
> >    pciidlist, this sample driver actually is not necessarily limited to
> >    support IGDs)
> > 
> > - patch 7-9 provide a sample on i40e driver that supports Intel(R)
> >    Ethernet Controller XL710 Family of devices. It supports VF precopy live
> >    migration on Intel's 710 SRIOV. (but we commented out the real
> >    implementation of dirty page tracking and device state retrieving part
> >    to focus on demonstrating framework part. Will send out them in future
> >    versions)
> >    patch 7 registers/unregisters VF mediate ops when PF driver
> >    probes/removes. It specifies its supporting VFs via
> >    vfio_pci_mediate_ops->open(pdev)
> > 
> >    patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
> >    provides a sample implementation of migration region.
> >    The QEMU part of vfio migration is based on v8
> >    https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
> >    We do not based on recent v9 because we think there are still opens in
> >    dirty page track part in that series.
> > 
> >    patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
> >    provides an example on how to trap part of bar0 when migration starts
> >    and passthrough this part of bar0 again when migration fails.
> > 
> > Yan Zhao (9):
> >    vfio/pci: introduce mediate ops to intercept vfio-pci ops
> >    vfio/pci: test existence before calling region->ops
> >    vfio/pci: register a default migration region
> >    vfio-pci: register default dynamic-trap-bar-info region
> >    samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
> >    sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
> >    i40e/vf_migration: register mediate_ops to vfio-pci
> >    i40e/vf_migration: mediate migration region
> >    i40e/vf_migration: support dynamic trap of bar0
> > 
> >   drivers/net/ethernet/intel/Kconfig            |   2 +-
> >   drivers/net/ethernet/intel/i40e/Makefile      |   3 +-
> >   drivers/net/ethernet/intel/i40e/i40e.h        |   2 +
> >   drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
> >   .../ethernet/intel/i40e/i40e_vf_migration.c   | 626 ++++++++++++++++++
> >   .../ethernet/intel/i40e/i40e_vf_migration.h   |  78 +++
> >   drivers/vfio/pci/vfio_pci.c                   | 189 +++++-
> >   drivers/vfio/pci/vfio_pci_private.h           |   2 +
> >   include/linux/vfio.h                          |  18 +
> >   include/uapi/linux/vfio.h                     | 160 +++++
> >   samples/Kconfig                               |   6 +
> >   samples/Makefile                              |   1 +
> >   samples/vfio-pci/Makefile                     |   2 +
> >   samples/vfio-pci/igd_dt.c                     | 367 ++++++++++
> >   14 files changed, 1455 insertions(+), 4 deletions(-)
> >   create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
> >   create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
> >   create mode 100644 samples/vfio-pci/Makefile
> >   create mode 100644 samples/vfio-pci/igd_dt.c
> > 
>
Jason Wang Dec. 5, 2019, 1:05 p.m. UTC | #3
On 2019/12/5 下午4:51, Yan Zhao wrote:
> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
>> Hi:
>>
>> On 2019/12/5 上午11:24, Yan Zhao wrote:
>>> For SRIOV devices, VFs are passthroughed into guest directly without host
>>> driver mediation. However, when VMs migrating with passthroughed VFs,
>>> dynamic host mediation is required to  (1) get device states, (2) get
>>> dirty pages. Since device states as well as other critical information
>>> required for dirty page tracking for VFs are usually retrieved from PFs,
>>> it is handy to provide an extension in PF driver to centralizingly control
>>> VFs' migration.
>>>
>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
>>> dynamically trap VFs' bars for dirty page tracking and
>>
>> A silly question, what's the reason for doing this, is this a must for dirty
>> page tracking?
>>
> For performance consideration. VFs' bars should be passthoughed at
> normal time and only enter into trap state on need.


Right, but how does this matter for the case of dirty page tracking?


>
>>>    (3) centralizing
>>> VF critical states retrieving and VF controls into one driver, we propose
>>> to introduce mediate ops on top of current vfio-pci device driver.
>>>
>>>
>>>                                      _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>>    __________   register mediate ops|  ___________     ___________    |
>>> |          |<-----------------------|     VF    |   |           |
>>> | vfio-pci |                      | |  mediate  |   | PF driver |   |
>>> |__________|----------------------->|   driver  |   |___________|
>>>        |            open(pdev)      |  -----------          |         |
>>>        |                                                    |
>>>        |                            |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
>>>       \|/                                                  \|/
>>> -----------                                         ------------
>>> |    VF   |                                         |    PF    |
>>> -----------                                         ------------
>>>
>>>
>>> VF mediate driver could be a standalone driver that does not bind to
>>> any devices (as in demo code in patches 5-6) or it could be a built-in
>>> extension of PF driver (as in patches 7-9) .
>>>
>>> Rather than directly bind to VF, VF mediate driver register a mediate
>>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
>>> mediate ops.
>>> (Note that: VF mediate driver can register mediate ops into vfio-pci
>>> before vfio-pci binding to any devices. And VF mediate driver can
>>> support mediating multiple devices.)
>>>
>>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
>>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
>>> device as a parameter.
>>> VF mediate driver should return success or failure depending on it
>>> supports the pdev or not.
>>> E.g. VF mediate driver would compare its supported VF devfn with the
>>> devfn of the passed-in pdev.
>>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
>>> stop querying other mediate ops and bind the opening device with this
>>> mediate ops using the returned mediate handle.
>>>
>>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
>>> VF will be intercepted into VF mediate driver as
>>> vfio_pci_mediate_ops->get_region_info(),
>>> vfio_pci_mediate_ops->rw,
>>> vfio_pci_mediate_ops->mmap, and get customized.
>>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
>>> further return 'pt' to indicate whether vfio-pci should further
>>> passthrough data to hw.
>>>
>>> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
>>> with a mediate handle as parameter.
>>>
>>> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
>>> mediate driver be able to differentiate two opening VFs of the same device
>>> id and vendor id.
>>>
>>> When VF mediate driver exits, it unregisters its mediate ops from
>>> vfio-pci.
>>>
>>>
>>> In this patchset, we enable vfio-pci to provide 3 things:
>>> (1) calling mediate ops to allow vendor driver customizing default
>>> region info/rw/mmap of a region.
>>> (2) provide a migration region to support migration
>>
>> What's the benefit of introducing a region? It looks to me we don't expect
>> the region to be accessed directly from guest. Could we simply extend device
>> fd ioctl for doing such things?
>>
> You may take a look on mdev live migration discussions in
> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
>
> or previous discussion at
> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
> which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/
>
> generaly speaking, qemu part of live migration is consistent for
> vfio-pci + mediate ops way or mdev way.


So in mdev, do you still have a mediate driver? Or you expect the parent 
to implement the region?


> The region is only a channel for
> QEMU and kernel to communicate information without introducing IOCTLs.


Well, at least you introduce new type of region in uapi. So this does 
not answer why region is better than ioctl. If the region will only be 
used by qemu, using ioctl is much more easier and straightforward.


>
>
>>> (3) provide a dynamic trap bar info region to allow vendor driver
>>> control trap/untrap of device pci bars
>>>
>>> This vfio-pci + mediate ops way differs from mdev way in that
>>> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
>>> specific mdev parent driver is bound to VF directly.
>>> (2) vfio-pci + mediate ops way does not create mdev devices and VF
>>> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
>>>
>>> The reason why we don't choose the way of writing mdev parent driver is
>>> that
>>> (1) VFs are almost all the time directly passthroughed. Directly binding
>>> to vfio-pci can make most of the code shared/reused.
>>
>> Can we split out the common parts from vfio-pci?
>>
> That's very attractive. but one cannot implement a vfio-pci except
> export everything in it as common part :)


Well, I think there should be not hard to do that. E..g you can route it 
back to like:

vfio -> vfio_mdev -> parent -> vfio_pci


>>>    If we write a
>>> vendor specific mdev parent driver, most of the code (like passthrough
>>> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
>>> actually a duplicated and tedious work.
>>
>> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
>> me we need to consider live migration for mdev as well. In that case, do we
>> still expect mediate ops through VFIO directly?
>>
>>
>>> (2) For features like dynamically trap/untrap pci bars, if they are in
>>> vfio-pci, they can be available to most people without repeated code
>>> copying and re-testing.
>>> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
>>> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
>>> it runs into a real migration need. However, if vfio-pci is bound
>>> initially, they have no chance to do live migration when there's a need
>>> later.
>>
>> We can teach management layer to do this.
>>
> No. not possible as vfio-pci by default has no migration region and
> dirty page tracking needs vendor's mediation at least for most
> passthrough devices now.


I'm not quite sure I get here but in this case, just tech them to use 
the driver that has migration support?

Thanks


>
> Thanks
> Yn
>
>> Thanks
>>
>>
>>> In this patchset,
>>> - patches 1-4 enable vfio-pci to call mediate ops registered by vendor
>>>     driver to mediate/customize region info/rw/mmap.
>>>
>>> - patches 5-6 provide a standalone sample driver to register a mediate ops
>>>     for Intel Graphics Devices. It does not bind to IGDs directly but decides
>>>     what devices it supports via its pciidlist. It also demonstrates how to
>>>     dynamic trap a device's PCI bars. (by adding more pciids in its
>>>     pciidlist, this sample driver actually is not necessarily limited to
>>>     support IGDs)
>>>
>>> - patch 7-9 provide a sample on i40e driver that supports Intel(R)
>>>     Ethernet Controller XL710 Family of devices. It supports VF precopy live
>>>     migration on Intel's 710 SRIOV. (but we commented out the real
>>>     implementation of dirty page tracking and device state retrieving part
>>>     to focus on demonstrating framework part. Will send out them in future
>>>     versions)
>>>     patch 7 registers/unregisters VF mediate ops when PF driver
>>>     probes/removes. It specifies its supporting VFs via
>>>     vfio_pci_mediate_ops->open(pdev)
>>>
>>>     patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
>>>     provides a sample implementation of migration region.
>>>     The QEMU part of vfio migration is based on v8
>>>     https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
>>>     We do not based on recent v9 because we think there are still opens in
>>>     dirty page track part in that series.
>>>
>>>     patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
>>>     provides an example on how to trap part of bar0 when migration starts
>>>     and passthrough this part of bar0 again when migration fails.
>>>
>>> Yan Zhao (9):
>>>     vfio/pci: introduce mediate ops to intercept vfio-pci ops
>>>     vfio/pci: test existence before calling region->ops
>>>     vfio/pci: register a default migration region
>>>     vfio-pci: register default dynamic-trap-bar-info region
>>>     samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
>>>     sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
>>>     i40e/vf_migration: register mediate_ops to vfio-pci
>>>     i40e/vf_migration: mediate migration region
>>>     i40e/vf_migration: support dynamic trap of bar0
>>>
>>>    drivers/net/ethernet/intel/Kconfig            |   2 +-
>>>    drivers/net/ethernet/intel/i40e/Makefile      |   3 +-
>>>    drivers/net/ethernet/intel/i40e/i40e.h        |   2 +
>>>    drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
>>>    .../ethernet/intel/i40e/i40e_vf_migration.c   | 626 ++++++++++++++++++
>>>    .../ethernet/intel/i40e/i40e_vf_migration.h   |  78 +++
>>>    drivers/vfio/pci/vfio_pci.c                   | 189 +++++-
>>>    drivers/vfio/pci/vfio_pci_private.h           |   2 +
>>>    include/linux/vfio.h                          |  18 +
>>>    include/uapi/linux/vfio.h                     | 160 +++++
>>>    samples/Kconfig                               |   6 +
>>>    samples/Makefile                              |   1 +
>>>    samples/vfio-pci/Makefile                     |   2 +
>>>    samples/vfio-pci/igd_dt.c                     | 367 ++++++++++
>>>    14 files changed, 1455 insertions(+), 4 deletions(-)
>>>    create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
>>>    create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
>>>    create mode 100644 samples/vfio-pci/Makefile
>>>    create mode 100644 samples/vfio-pci/igd_dt.c
>>>
Yan Zhao Dec. 6, 2019, 8:22 a.m. UTC | #4
On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:
> 
> On 2019/12/5 下午4:51, Yan Zhao wrote:
> > On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
> >> Hi:
> >>
> >> On 2019/12/5 上午11:24, Yan Zhao wrote:
> >>> For SRIOV devices, VFs are passthroughed into guest directly without host
> >>> driver mediation. However, when VMs migrating with passthroughed VFs,
> >>> dynamic host mediation is required to  (1) get device states, (2) get
> >>> dirty pages. Since device states as well as other critical information
> >>> required for dirty page tracking for VFs are usually retrieved from PFs,
> >>> it is handy to provide an extension in PF driver to centralizingly control
> >>> VFs' migration.
> >>>
> >>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
> >>> dynamically trap VFs' bars for dirty page tracking and
> >>
> >> A silly question, what's the reason for doing this, is this a must for dirty
> >> page tracking?
> >>
> > For performance consideration. VFs' bars should be passthoughed at
> > normal time and only enter into trap state on need.
> 
> 
> Right, but how does this matter for the case of dirty page tracking?
>
Take NIC as an example, to trap its VF dirty pages, software way is
required to trap every write of ring tail that resides in BAR0. There's
still no IOMMU Dirty bit available.
> 
> >
> >>>    (3) centralizing
> >>> VF critical states retrieving and VF controls into one driver, we propose
> >>> to introduce mediate ops on top of current vfio-pci device driver.
> >>>
> >>>
> >>>                                      _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >>>    __________   register mediate ops|  ___________     ___________    |
> >>> |          |<-----------------------|     VF    |   |           |
> >>> | vfio-pci |                      | |  mediate  |   | PF driver |   |
> >>> |__________|----------------------->|   driver  |   |___________|
> >>>        |            open(pdev)      |  -----------          |         |
> >>>        |                                                    |
> >>>        |                            |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
> >>>       \|/                                                  \|/
> >>> -----------                                         ------------
> >>> |    VF   |                                         |    PF    |
> >>> -----------                                         ------------
> >>>
> >>>
> >>> VF mediate driver could be a standalone driver that does not bind to
> >>> any devices (as in demo code in patches 5-6) or it could be a built-in
> >>> extension of PF driver (as in patches 7-9) .
> >>>
> >>> Rather than directly bind to VF, VF mediate driver register a mediate
> >>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
> >>> mediate ops.
> >>> (Note that: VF mediate driver can register mediate ops into vfio-pci
> >>> before vfio-pci binding to any devices. And VF mediate driver can
> >>> support mediating multiple devices.)
> >>>
> >>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
> >>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
> >>> device as a parameter.
> >>> VF mediate driver should return success or failure depending on it
> >>> supports the pdev or not.
> >>> E.g. VF mediate driver would compare its supported VF devfn with the
> >>> devfn of the passed-in pdev.
> >>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
> >>> stop querying other mediate ops and bind the opening device with this
> >>> mediate ops using the returned mediate handle.
> >>>
> >>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
> >>> VF will be intercepted into VF mediate driver as
> >>> vfio_pci_mediate_ops->get_region_info(),
> >>> vfio_pci_mediate_ops->rw,
> >>> vfio_pci_mediate_ops->mmap, and get customized.
> >>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
> >>> further return 'pt' to indicate whether vfio-pci should further
> >>> passthrough data to hw.
> >>>
> >>> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
> >>> with a mediate handle as parameter.
> >>>
> >>> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
> >>> mediate driver be able to differentiate two opening VFs of the same device
> >>> id and vendor id.
> >>>
> >>> When VF mediate driver exits, it unregisters its mediate ops from
> >>> vfio-pci.
> >>>
> >>>
> >>> In this patchset, we enable vfio-pci to provide 3 things:
> >>> (1) calling mediate ops to allow vendor driver customizing default
> >>> region info/rw/mmap of a region.
> >>> (2) provide a migration region to support migration
> >>
> >> What's the benefit of introducing a region? It looks to me we don't expect
> >> the region to be accessed directly from guest. Could we simply extend device
> >> fd ioctl for doing such things?
> >>
> > You may take a look on mdev live migration discussions in
> > https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
> >
> > or previous discussion at
> > https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
> > which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/
> >
> > generaly speaking, qemu part of live migration is consistent for
> > vfio-pci + mediate ops way or mdev way.
> 
> 
> So in mdev, do you still have a mediate driver? Or you expect the parent 
> to implement the region?
> 
No, currently it's only for vfio-pci.
mdev parent driver is free to customize its regions and hence does not
requires this mediate ops hooks.

> 
> > The region is only a channel for
> > QEMU and kernel to communicate information without introducing IOCTLs.
> 
> 
> Well, at least you introduce new type of region in uapi. So this does 
> not answer why region is better than ioctl. If the region will only be 
> used by qemu, using ioctl is much more easier and straightforward.
>
It's not introduced by me :)
mdev live migration is actually using this way, I'm just keeping
compatible to the uapi.

From my own perspective, my answer is that a region is more flexible
compared to ioctl. vendor driver can freely define the size, mmap cap of
its data subregion. Also, there're already too many ioctls in vfio.
> 
> >
> >
> >>> (3) provide a dynamic trap bar info region to allow vendor driver
> >>> control trap/untrap of device pci bars
> >>>
> >>> This vfio-pci + mediate ops way differs from mdev way in that
> >>> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
> >>> specific mdev parent driver is bound to VF directly.
> >>> (2) vfio-pci + mediate ops way does not create mdev devices and VF
> >>> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
> >>>
> >>> The reason why we don't choose the way of writing mdev parent driver is
> >>> that
> >>> (1) VFs are almost all the time directly passthroughed. Directly binding
> >>> to vfio-pci can make most of the code shared/reused.
> >>
> >> Can we split out the common parts from vfio-pci?
> >>
> > That's very attractive. but one cannot implement a vfio-pci except
> > export everything in it as common part :)
> 
> 
> Well, I think there should be not hard to do that. E..g you can route it 
> back to like:
> 
> vfio -> vfio_mdev -> parent -> vfio_pci
>
it's desired for us to have mediate driver binding to PF device.
so once a VF device is created, only PF driver and vfio-pci are
required. Just the same as what needs to be done for a normal VF passthrough.
otherwise, a separate parent driver binding to VF is required.
Also, this parent driver has many drawbacks as I mentions in this
cover-letter.
> 
> >>>    If we write a
> >>> vendor specific mdev parent driver, most of the code (like passthrough
> >>> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
> >>> actually a duplicated and tedious work.
> >>
> >> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
> >> me we need to consider live migration for mdev as well. In that case, do we
> >> still expect mediate ops through VFIO directly?
> >>
> >>
> >>> (2) For features like dynamically trap/untrap pci bars, if they are in
> >>> vfio-pci, they can be available to most people without repeated code
> >>> copying and re-testing.
> >>> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
> >>> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
> >>> it runs into a real migration need. However, if vfio-pci is bound
> >>> initially, they have no chance to do live migration when there's a need
> >>> later.
> >>
> >> We can teach management layer to do this.
> >>
> > No. not possible as vfio-pci by default has no migration region and
> > dirty page tracking needs vendor's mediation at least for most
> > passthrough devices now.
> 
> 
> I'm not quite sure I get here but in this case, just tech them to use 
> the driver that has migration support?
> 
That's a way, but as more and more passthrough devices have demands and
caps to do migration, will vfio-pci be used in future any more ?

Thanks
Yan

> Thanks
> 
> 
> >
> > Thanks
> > Yn
> >
> >> Thanks
> >>
> >>
> >>> In this patchset,
> >>> - patches 1-4 enable vfio-pci to call mediate ops registered by vendor
> >>>     driver to mediate/customize region info/rw/mmap.
> >>>
> >>> - patches 5-6 provide a standalone sample driver to register a mediate ops
> >>>     for Intel Graphics Devices. It does not bind to IGDs directly but decides
> >>>     what devices it supports via its pciidlist. It also demonstrates how to
> >>>     dynamic trap a device's PCI bars. (by adding more pciids in its
> >>>     pciidlist, this sample driver actually is not necessarily limited to
> >>>     support IGDs)
> >>>
> >>> - patch 7-9 provide a sample on i40e driver that supports Intel(R)
> >>>     Ethernet Controller XL710 Family of devices. It supports VF precopy live
> >>>     migration on Intel's 710 SRIOV. (but we commented out the real
> >>>     implementation of dirty page tracking and device state retrieving part
> >>>     to focus on demonstrating framework part. Will send out them in future
> >>>     versions)
> >>>     patch 7 registers/unregisters VF mediate ops when PF driver
> >>>     probes/removes. It specifies its supporting VFs via
> >>>     vfio_pci_mediate_ops->open(pdev)
> >>>
> >>>     patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
> >>>     provides a sample implementation of migration region.
> >>>     The QEMU part of vfio migration is based on v8
> >>>     https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
> >>>     We do not based on recent v9 because we think there are still opens in
> >>>     dirty page track part in that series.
> >>>
> >>>     patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
> >>>     provides an example on how to trap part of bar0 when migration starts
> >>>     and passthrough this part of bar0 again when migration fails.
> >>>
> >>> Yan Zhao (9):
> >>>     vfio/pci: introduce mediate ops to intercept vfio-pci ops
> >>>     vfio/pci: test existence before calling region->ops
> >>>     vfio/pci: register a default migration region
> >>>     vfio-pci: register default dynamic-trap-bar-info region
> >>>     samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
> >>>     sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
> >>>     i40e/vf_migration: register mediate_ops to vfio-pci
> >>>     i40e/vf_migration: mediate migration region
> >>>     i40e/vf_migration: support dynamic trap of bar0
> >>>
> >>>    drivers/net/ethernet/intel/Kconfig            |   2 +-
> >>>    drivers/net/ethernet/intel/i40e/Makefile      |   3 +-
> >>>    drivers/net/ethernet/intel/i40e/i40e.h        |   2 +
> >>>    drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
> >>>    .../ethernet/intel/i40e/i40e_vf_migration.c   | 626 ++++++++++++++++++
> >>>    .../ethernet/intel/i40e/i40e_vf_migration.h   |  78 +++
> >>>    drivers/vfio/pci/vfio_pci.c                   | 189 +++++-
> >>>    drivers/vfio/pci/vfio_pci_private.h           |   2 +
> >>>    include/linux/vfio.h                          |  18 +
> >>>    include/uapi/linux/vfio.h                     | 160 +++++
> >>>    samples/Kconfig                               |   6 +
> >>>    samples/Makefile                              |   1 +
> >>>    samples/vfio-pci/Makefile                     |   2 +
> >>>    samples/vfio-pci/igd_dt.c                     | 367 ++++++++++
> >>>    14 files changed, 1455 insertions(+), 4 deletions(-)
> >>>    create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
> >>>    create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
> >>>    create mode 100644 samples/vfio-pci/Makefile
> >>>    create mode 100644 samples/vfio-pci/igd_dt.c
> >>>
>
Jason Wang Dec. 6, 2019, 9:40 a.m. UTC | #5
On 2019/12/6 下午4:22, Yan Zhao wrote:
> On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:
>> On 2019/12/5 下午4:51, Yan Zhao wrote:
>>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
>>>> Hi:
>>>>
>>>> On 2019/12/5 上午11:24, Yan Zhao wrote:
>>>>> For SRIOV devices, VFs are passthroughed into guest directly without host
>>>>> driver mediation. However, when VMs migrating with passthroughed VFs,
>>>>> dynamic host mediation is required to  (1) get device states, (2) get
>>>>> dirty pages. Since device states as well as other critical information
>>>>> required for dirty page tracking for VFs are usually retrieved from PFs,
>>>>> it is handy to provide an extension in PF driver to centralizingly control
>>>>> VFs' migration.
>>>>>
>>>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
>>>>> dynamically trap VFs' bars for dirty page tracking and
>>>> A silly question, what's the reason for doing this, is this a must for dirty
>>>> page tracking?
>>>>
>>> For performance consideration. VFs' bars should be passthoughed at
>>> normal time and only enter into trap state on need.
>>
>> Right, but how does this matter for the case of dirty page tracking?
>>
> Take NIC as an example, to trap its VF dirty pages, software way is
> required to trap every write of ring tail that resides in BAR0.


Interesting, but it looks like we need:
- decode the instruction
- mediate all access to BAR0
All of which seems a great burden for the VF driver. I wonder whether or 
not doing interrupt relay and tracking head is better in this case.


>   There's
> still no IOMMU Dirty bit available.
>>>>>     (3) centralizing
>>>>> VF critical states retrieving and VF controls into one driver, we propose
>>>>> to introduce mediate ops on top of current vfio-pci device driver.
>>>>>
>>>>>
>>>>>                                       _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>>>>     __________   register mediate ops|  ___________     ___________    |
>>>>> |          |<-----------------------|     VF    |   |           |
>>>>> | vfio-pci |                      | |  mediate  |   | PF driver |   |
>>>>> |__________|----------------------->|   driver  |   |___________|
>>>>>         |            open(pdev)      |  -----------          |         |
>>>>>         |                                                    |
>>>>>         |                            |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
>>>>>        \|/                                                  \|/
>>>>> -----------                                         ------------
>>>>> |    VF   |                                         |    PF    |
>>>>> -----------                                         ------------
>>>>>
>>>>>
>>>>> VF mediate driver could be a standalone driver that does not bind to
>>>>> any devices (as in demo code in patches 5-6) or it could be a built-in
>>>>> extension of PF driver (as in patches 7-9) .
>>>>>
>>>>> Rather than directly bind to VF, VF mediate driver register a mediate
>>>>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
>>>>> mediate ops.
>>>>> (Note that: VF mediate driver can register mediate ops into vfio-pci
>>>>> before vfio-pci binding to any devices. And VF mediate driver can
>>>>> support mediating multiple devices.)
>>>>>
>>>>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
>>>>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
>>>>> device as a parameter.
>>>>> VF mediate driver should return success or failure depending on it
>>>>> supports the pdev or not.
>>>>> E.g. VF mediate driver would compare its supported VF devfn with the
>>>>> devfn of the passed-in pdev.
>>>>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
>>>>> stop querying other mediate ops and bind the opening device with this
>>>>> mediate ops using the returned mediate handle.
>>>>>
>>>>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
>>>>> VF will be intercepted into VF mediate driver as
>>>>> vfio_pci_mediate_ops->get_region_info(),
>>>>> vfio_pci_mediate_ops->rw,
>>>>> vfio_pci_mediate_ops->mmap, and get customized.
>>>>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
>>>>> further return 'pt' to indicate whether vfio-pci should further
>>>>> passthrough data to hw.
>>>>>
>>>>> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
>>>>> with a mediate handle as parameter.
>>>>>
>>>>> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
>>>>> mediate driver be able to differentiate two opening VFs of the same device
>>>>> id and vendor id.
>>>>>
>>>>> When VF mediate driver exits, it unregisters its mediate ops from
>>>>> vfio-pci.
>>>>>
>>>>>
>>>>> In this patchset, we enable vfio-pci to provide 3 things:
>>>>> (1) calling mediate ops to allow vendor driver customizing default
>>>>> region info/rw/mmap of a region.
>>>>> (2) provide a migration region to support migration
>>>> What's the benefit of introducing a region? It looks to me we don't expect
>>>> the region to be accessed directly from guest. Could we simply extend device
>>>> fd ioctl for doing such things?
>>>>
>>> You may take a look on mdev live migration discussions in
>>> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
>>>
>>> or previous discussion at
>>> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
>>> which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/
>>>
>>> generaly speaking, qemu part of live migration is consistent for
>>> vfio-pci + mediate ops way or mdev way.
>>
>> So in mdev, do you still have a mediate driver? Or you expect the parent
>> to implement the region?
>>
> No, currently it's only for vfio-pci.

And specific to PCI.

> mdev parent driver is free to customize its regions and hence does not
> requires this mediate ops hooks.
>
>>> The region is only a channel for
>>> QEMU and kernel to communicate information without introducing IOCTLs.
>>
>> Well, at least you introduce new type of region in uapi. So this does
>> not answer why region is better than ioctl. If the region will only be
>> used by qemu, using ioctl is much more easier and straightforward.
>>
> It's not introduced by me :)
> mdev live migration is actually using this way, I'm just keeping
> compatible to the uapi.


I meant e.g VFIO_REGION_TYPE_MIGRATION.


>
>  From my own perspective, my answer is that a region is more flexible
> compared to ioctl. vendor driver can freely define the size,
>

Probably not since it's an ABI I think.

>   mmap cap of
> its data subregion.
>

It doesn't help much unless it can be mapped into guest (which I don't 
think it was the case here).

>   Also, there're already too many ioctls in vfio.

Probably not :) We had a brunch of  subsystems that have much more 
ioctls than VFIO. (e.g DRM)

>>>
>>>>> (3) provide a dynamic trap bar info region to allow vendor driver
>>>>> control trap/untrap of device pci bars
>>>>>
>>>>> This vfio-pci + mediate ops way differs from mdev way in that
>>>>> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
>>>>> specific mdev parent driver is bound to VF directly.
>>>>> (2) vfio-pci + mediate ops way does not create mdev devices and VF
>>>>> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
>>>>>
>>>>> The reason why we don't choose the way of writing mdev parent driver is
>>>>> that
>>>>> (1) VFs are almost all the time directly passthroughed. Directly binding
>>>>> to vfio-pci can make most of the code shared/reused.
>>>> Can we split out the common parts from vfio-pci?
>>>>
>>> That's very attractive. but one cannot implement a vfio-pci except
>>> export everything in it as common part :)
>>
>> Well, I think there should be not hard to do that. E..g you can route it
>> back to like:
>>
>> vfio -> vfio_mdev -> parent -> vfio_pci
>>
> it's desired for us to have mediate driver binding to PF device.
> so once a VF device is created, only PF driver and vfio-pci are
> required. Just the same as what needs to be done for a normal VF passthrough.
> otherwise, a separate parent driver binding to VF is required.
> Also, this parent driver has many drawbacks as I mentions in this
> cover-letter.

Well, as discussed, no need to duplicate the code, bar trick should 
still work. The main issues I saw with this proposal is:

1) PCI specific, other bus may need something similar
2) Function duplicated with mdev and mdev can do even more


>>>>>     If we write a
>>>>> vendor specific mdev parent driver, most of the code (like passthrough
>>>>> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
>>>>> actually a duplicated and tedious work.
>>>> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
>>>> me we need to consider live migration for mdev as well. In that case, do we
>>>> still expect mediate ops through VFIO directly?
>>>>
>>>>
>>>>> (2) For features like dynamically trap/untrap pci bars, if they are in
>>>>> vfio-pci, they can be available to most people without repeated code
>>>>> copying and re-testing.
>>>>> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
>>>>> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
>>>>> it runs into a real migration need. However, if vfio-pci is bound
>>>>> initially, they have no chance to do live migration when there's a need
>>>>> later.
>>>> We can teach management layer to do this.
>>>>
>>> No. not possible as vfio-pci by default has no migration region and
>>> dirty page tracking needs vendor's mediation at least for most
>>> passthrough devices now.
>>
>> I'm not quite sure I get here but in this case, just tech them to use
>> the driver that has migration support?
>>
> That's a way, but as more and more passthrough devices have demands and
> caps to do migration, will vfio-pci be used in future any more ?


This should not be a problem:
- If we introduce a common mdev for vfio-pci, we can just bind that 
driver always
- The most straightforward way to support dirty page tracking is done by 
IOMMU instead of device specific operations.

Thanks

>
> Thanks
> Yan
>
>> Thanks
>>
>>
>>> Thanks
>>> Yn
>>>
>>>> Thanks
>>>>
>>>>
>>>>> In this patchset,
>>>>> - patches 1-4 enable vfio-pci to call mediate ops registered by vendor
>>>>>      driver to mediate/customize region info/rw/mmap.
>>>>>
>>>>> - patches 5-6 provide a standalone sample driver to register a mediate ops
>>>>>      for Intel Graphics Devices. It does not bind to IGDs directly but decides
>>>>>      what devices it supports via its pciidlist. It also demonstrates how to
>>>>>      dynamic trap a device's PCI bars. (by adding more pciids in its
>>>>>      pciidlist, this sample driver actually is not necessarily limited to
>>>>>      support IGDs)
>>>>>
>>>>> - patch 7-9 provide a sample on i40e driver that supports Intel(R)
>>>>>      Ethernet Controller XL710 Family of devices. It supports VF precopy live
>>>>>      migration on Intel's 710 SRIOV. (but we commented out the real
>>>>>      implementation of dirty page tracking and device state retrieving part
>>>>>      to focus on demonstrating framework part. Will send out them in future
>>>>>      versions)
>>>>>      patch 7 registers/unregisters VF mediate ops when PF driver
>>>>>      probes/removes. It specifies its supporting VFs via
>>>>>      vfio_pci_mediate_ops->open(pdev)
>>>>>
>>>>>      patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
>>>>>      provides a sample implementation of migration region.
>>>>>      The QEMU part of vfio migration is based on v8
>>>>>      https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
>>>>>      We do not based on recent v9 because we think there are still opens in
>>>>>      dirty page track part in that series.
>>>>>
>>>>>      patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
>>>>>      provides an example on how to trap part of bar0 when migration starts
>>>>>      and passthrough this part of bar0 again when migration fails.
>>>>>
>>>>> Yan Zhao (9):
>>>>>      vfio/pci: introduce mediate ops to intercept vfio-pci ops
>>>>>      vfio/pci: test existence before calling region->ops
>>>>>      vfio/pci: register a default migration region
>>>>>      vfio-pci: register default dynamic-trap-bar-info region
>>>>>      samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
>>>>>      sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
>>>>>      i40e/vf_migration: register mediate_ops to vfio-pci
>>>>>      i40e/vf_migration: mediate migration region
>>>>>      i40e/vf_migration: support dynamic trap of bar0
>>>>>
>>>>>     drivers/net/ethernet/intel/Kconfig            |   2 +-
>>>>>     drivers/net/ethernet/intel/i40e/Makefile      |   3 +-
>>>>>     drivers/net/ethernet/intel/i40e/i40e.h        |   2 +
>>>>>     drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
>>>>>     .../ethernet/intel/i40e/i40e_vf_migration.c   | 626 ++++++++++++++++++
>>>>>     .../ethernet/intel/i40e/i40e_vf_migration.h   |  78 +++
>>>>>     drivers/vfio/pci/vfio_pci.c                   | 189 +++++-
>>>>>     drivers/vfio/pci/vfio_pci_private.h           |   2 +
>>>>>     include/linux/vfio.h                          |  18 +
>>>>>     include/uapi/linux/vfio.h                     | 160 +++++
>>>>>     samples/Kconfig                               |   6 +
>>>>>     samples/Makefile                              |   1 +
>>>>>     samples/vfio-pci/Makefile                     |   2 +
>>>>>     samples/vfio-pci/igd_dt.c                     | 367 ++++++++++
>>>>>     14 files changed, 1455 insertions(+), 4 deletions(-)
>>>>>     create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
>>>>>     create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
>>>>>     create mode 100644 samples/vfio-pci/Makefile
>>>>>     create mode 100644 samples/vfio-pci/igd_dt.c
>>>>>
Yan Zhao Dec. 6, 2019, 12:49 p.m. UTC | #6
On Fri, Dec 06, 2019 at 05:40:02PM +0800, Jason Wang wrote:
> 
> On 2019/12/6 下午4:22, Yan Zhao wrote:
> > On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:
> >> On 2019/12/5 下午4:51, Yan Zhao wrote:
> >>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
> >>>> Hi:
> >>>>
> >>>> On 2019/12/5 上午11:24, Yan Zhao wrote:
> >>>>> For SRIOV devices, VFs are passthroughed into guest directly without host
> >>>>> driver mediation. However, when VMs migrating with passthroughed VFs,
> >>>>> dynamic host mediation is required to  (1) get device states, (2) get
> >>>>> dirty pages. Since device states as well as other critical information
> >>>>> required for dirty page tracking for VFs are usually retrieved from PFs,
> >>>>> it is handy to provide an extension in PF driver to centralizingly control
> >>>>> VFs' migration.
> >>>>>
> >>>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
> >>>>> dynamically trap VFs' bars for dirty page tracking and
> >>>> A silly question, what's the reason for doing this, is this a must for dirty
> >>>> page tracking?
> >>>>
> >>> For performance consideration. VFs' bars should be passthoughed at
> >>> normal time and only enter into trap state on need.
> >>
> >> Right, but how does this matter for the case of dirty page tracking?
> >>
> > Take NIC as an example, to trap its VF dirty pages, software way is
> > required to trap every write of ring tail that resides in BAR0.
> 
> 
> Interesting, but it looks like we need:
> - decode the instruction
> - mediate all access to BAR0
> All of which seems a great burden for the VF driver. I wonder whether or 
> not doing interrupt relay and tracking head is better in this case.
>
hi Jason

not familiar with the way you mentioned. could you elaborate more?
> 
> >   There's
> > still no IOMMU Dirty bit available.
> >>>>>     (3) centralizing
> >>>>> VF critical states retrieving and VF controls into one driver, we propose
> >>>>> to introduce mediate ops on top of current vfio-pci device driver.
> >>>>>
> >>>>>
> >>>>>                                       _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >>>>>     __________   register mediate ops|  ___________     ___________    |
> >>>>> |          |<-----------------------|     VF    |   |           |
> >>>>> | vfio-pci |                      | |  mediate  |   | PF driver |   |
> >>>>> |__________|----------------------->|   driver  |   |___________|
> >>>>>         |            open(pdev)      |  -----------          |         |
> >>>>>         |                                                    |
> >>>>>         |                            |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
> >>>>>        \|/                                                  \|/
> >>>>> -----------                                         ------------
> >>>>> |    VF   |                                         |    PF    |
> >>>>> -----------                                         ------------
> >>>>>
> >>>>>
> >>>>> VF mediate driver could be a standalone driver that does not bind to
> >>>>> any devices (as in demo code in patches 5-6) or it could be a built-in
> >>>>> extension of PF driver (as in patches 7-9) .
> >>>>>
> >>>>> Rather than directly bind to VF, VF mediate driver register a mediate
> >>>>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
> >>>>> mediate ops.
> >>>>> (Note that: VF mediate driver can register mediate ops into vfio-pci
> >>>>> before vfio-pci binding to any devices. And VF mediate driver can
> >>>>> support mediating multiple devices.)
> >>>>>
> >>>>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
> >>>>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
> >>>>> device as a parameter.
> >>>>> VF mediate driver should return success or failure depending on it
> >>>>> supports the pdev or not.
> >>>>> E.g. VF mediate driver would compare its supported VF devfn with the
> >>>>> devfn of the passed-in pdev.
> >>>>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
> >>>>> stop querying other mediate ops and bind the opening device with this
> >>>>> mediate ops using the returned mediate handle.
> >>>>>
> >>>>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
> >>>>> VF will be intercepted into VF mediate driver as
> >>>>> vfio_pci_mediate_ops->get_region_info(),
> >>>>> vfio_pci_mediate_ops->rw,
> >>>>> vfio_pci_mediate_ops->mmap, and get customized.
> >>>>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
> >>>>> further return 'pt' to indicate whether vfio-pci should further
> >>>>> passthrough data to hw.
> >>>>>
> >>>>> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
> >>>>> with a mediate handle as parameter.
> >>>>>
> >>>>> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
> >>>>> mediate driver be able to differentiate two opening VFs of the same device
> >>>>> id and vendor id.
> >>>>>
> >>>>> When VF mediate driver exits, it unregisters its mediate ops from
> >>>>> vfio-pci.
> >>>>>
> >>>>>
> >>>>> In this patchset, we enable vfio-pci to provide 3 things:
> >>>>> (1) calling mediate ops to allow vendor driver customizing default
> >>>>> region info/rw/mmap of a region.
> >>>>> (2) provide a migration region to support migration
> >>>> What's the benefit of introducing a region? It looks to me we don't expect
> >>>> the region to be accessed directly from guest. Could we simply extend device
> >>>> fd ioctl for doing such things?
> >>>>
> >>> You may take a look on mdev live migration discussions in
> >>> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
> >>>
> >>> or previous discussion at
> >>> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
> >>> which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/
> >>>
> >>> generaly speaking, qemu part of live migration is consistent for
> >>> vfio-pci + mediate ops way or mdev way.
> >>
> >> So in mdev, do you still have a mediate driver? Or you expect the parent
> >> to implement the region?
> >>
> > No, currently it's only for vfio-pci.
> 
> And specific to PCI.
> 
> > mdev parent driver is free to customize its regions and hence does not
> > requires this mediate ops hooks.
> >
> >>> The region is only a channel for
> >>> QEMU and kernel to communicate information without introducing IOCTLs.
> >>
> >> Well, at least you introduce new type of region in uapi. So this does
> >> not answer why region is better than ioctl. If the region will only be
> >> used by qemu, using ioctl is much more easier and straightforward.
> >>
> > It's not introduced by me :)
> > mdev live migration is actually using this way, I'm just keeping
> > compatible to the uapi.
> 
> 
> I meant e.g VFIO_REGION_TYPE_MIGRATION.
>
here's the history of vfio live migration:
https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg05564.html
https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html
https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html

If you have any concern of this region way, feel free to comment to the
latest v9 patchset: 
https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html

The patchset here will always keep compatible to there.
> 
> >
> >  From my own perspective, my answer is that a region is more flexible
> > compared to ioctl. vendor driver can freely define the size,
> >
> 
> Probably not since it's an ABI I think.
> 
that's why I need to define VFIO_REGION_TYPE_MIGRATION here in this
patchset, as it's not upstreamed yet.
maybe I should make it into a prerequisite patch, indicating it is not
introduced by this patchset

> >   mmap cap of
> > its data subregion.
> >
> 
> It doesn't help much unless it can be mapped into guest (which I don't 
> think it was the case here).
> 
it's access by host qemu, the same as how linux app access an mmaped
memory. the mmap here is to reduce memory copy from kernel to user.
No need to get mapped into guest.

> >   Also, there're already too many ioctls in vfio.
> 
> Probably not :) We had a brunch of  subsystems that have much more 
> ioctls than VFIO. (e.g DRM)
>

> >>>
> >>>>> (3) provide a dynamic trap bar info region to allow vendor driver
> >>>>> control trap/untrap of device pci bars
> >>>>>
> >>>>> This vfio-pci + mediate ops way differs from mdev way in that
> >>>>> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
> >>>>> specific mdev parent driver is bound to VF directly.
> >>>>> (2) vfio-pci + mediate ops way does not create mdev devices and VF
> >>>>> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
> >>>>>
> >>>>> The reason why we don't choose the way of writing mdev parent driver is
> >>>>> that
> >>>>> (1) VFs are almost all the time directly passthroughed. Directly binding
> >>>>> to vfio-pci can make most of the code shared/reused.
> >>>> Can we split out the common parts from vfio-pci?
> >>>>
> >>> That's very attractive. but one cannot implement a vfio-pci except
> >>> export everything in it as common part :)
> >>
> >> Well, I think there should be not hard to do that. E..g you can route it
> >> back to like:
> >>
> >> vfio -> vfio_mdev -> parent -> vfio_pci
> >>
> > it's desired for us to have mediate driver binding to PF device.
> > so once a VF device is created, only PF driver and vfio-pci are
> > required. Just the same as what needs to be done for a normal VF passthrough.
> > otherwise, a separate parent driver binding to VF is required.
> > Also, this parent driver has many drawbacks as I mentions in this
> > cover-letter.
> 
> Well, as discussed, no need to duplicate the code, bar trick should 
> still work. The main issues I saw with this proposal is:
> 
> 1) PCI specific, other bus may need something similar
vfio-pci is only for PCI of course.

> 2) Function duplicated with mdev and mdev can do even more
> 
could you elaborate how mdev can do solve the above saying problem ?
> 
> >>>>>     If we write a
> >>>>> vendor specific mdev parent driver, most of the code (like passthrough
> >>>>> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
> >>>>> actually a duplicated and tedious work.
> >>>> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
> >>>> me we need to consider live migration for mdev as well. In that case, do we
> >>>> still expect mediate ops through VFIO directly?
> >>>>
> >>>>
> >>>>> (2) For features like dynamically trap/untrap pci bars, if they are in
> >>>>> vfio-pci, they can be available to most people without repeated code
> >>>>> copying and re-testing.
> >>>>> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
> >>>>> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
> >>>>> it runs into a real migration need. However, if vfio-pci is bound
> >>>>> initially, they have no chance to do live migration when there's a need
> >>>>> later.
> >>>> We can teach management layer to do this.
> >>>>
> >>> No. not possible as vfio-pci by default has no migration region and
> >>> dirty page tracking needs vendor's mediation at least for most
> >>> passthrough devices now.
> >>
> >> I'm not quite sure I get here but in this case, just tech them to use
> >> the driver that has migration support?
> >>
> > That's a way, but as more and more passthrough devices have demands and
> > caps to do migration, will vfio-pci be used in future any more ?
> 
> 
> This should not be a problem:
> - If we introduce a common mdev for vfio-pci, we can just bind that 
> driver always
what is common mdev for vfio-pci? a common mdev parent driver that have
the same implementation as vfio-pci?

There's actually already a solution of creating only one mdev on top
of each passthrough device, and make mdev share the same iommu group
with it. We've also made an implementation on it already. here's a
sample one made by Yi at https://patchwork.kernel.org/cover/11134695/.

But, as I said, it's desired to re-use vfio-pci directly for SRIOV,
which is straghtforward :)

> - The most straightforward way to support dirty page tracking is done by 
> IOMMU instead of device specific operations.
>
No such IOMMU yet. And all kinds of platforms should be cared, right?

Thanks
Yan

> Thanks
> 
> >
> > Thanks
> > Yan
> >
> >> Thanks
> >>
> >>
> >>> Thanks
> >>> Yn
> >>>
> >>>> Thanks
> >>>>
> >>>>
> >>>>> In this patchset,
> >>>>> - patches 1-4 enable vfio-pci to call mediate ops registered by vendor
> >>>>>      driver to mediate/customize region info/rw/mmap.
> >>>>>
> >>>>> - patches 5-6 provide a standalone sample driver to register a mediate ops
> >>>>>      for Intel Graphics Devices. It does not bind to IGDs directly but decides
> >>>>>      what devices it supports via its pciidlist. It also demonstrates how to
> >>>>>      dynamic trap a device's PCI bars. (by adding more pciids in its
> >>>>>      pciidlist, this sample driver actually is not necessarily limited to
> >>>>>      support IGDs)
> >>>>>
> >>>>> - patch 7-9 provide a sample on i40e driver that supports Intel(R)
> >>>>>      Ethernet Controller XL710 Family of devices. It supports VF precopy live
> >>>>>      migration on Intel's 710 SRIOV. (but we commented out the real
> >>>>>      implementation of dirty page tracking and device state retrieving part
> >>>>>      to focus on demonstrating framework part. Will send out them in future
> >>>>>      versions)
> >>>>>      patch 7 registers/unregisters VF mediate ops when PF driver
> >>>>>      probes/removes. It specifies its supporting VFs via
> >>>>>      vfio_pci_mediate_ops->open(pdev)
> >>>>>
> >>>>>      patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
> >>>>>      provides a sample implementation of migration region.
> >>>>>      The QEMU part of vfio migration is based on v8
> >>>>>      https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
> >>>>>      We do not based on recent v9 because we think there are still opens in
> >>>>>      dirty page track part in that series.
> >>>>>
> >>>>>      patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
> >>>>>      provides an example on how to trap part of bar0 when migration starts
> >>>>>      and passthrough this part of bar0 again when migration fails.
> >>>>>
> >>>>> Yan Zhao (9):
> >>>>>      vfio/pci: introduce mediate ops to intercept vfio-pci ops
> >>>>>      vfio/pci: test existence before calling region->ops
> >>>>>      vfio/pci: register a default migration region
> >>>>>      vfio-pci: register default dynamic-trap-bar-info region
> >>>>>      samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
> >>>>>      sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
> >>>>>      i40e/vf_migration: register mediate_ops to vfio-pci
> >>>>>      i40e/vf_migration: mediate migration region
> >>>>>      i40e/vf_migration: support dynamic trap of bar0
> >>>>>
> >>>>>     drivers/net/ethernet/intel/Kconfig            |   2 +-
> >>>>>     drivers/net/ethernet/intel/i40e/Makefile      |   3 +-
> >>>>>     drivers/net/ethernet/intel/i40e/i40e.h        |   2 +
> >>>>>     drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
> >>>>>     .../ethernet/intel/i40e/i40e_vf_migration.c   | 626 ++++++++++++++++++
> >>>>>     .../ethernet/intel/i40e/i40e_vf_migration.h   |  78 +++
> >>>>>     drivers/vfio/pci/vfio_pci.c                   | 189 +++++-
> >>>>>     drivers/vfio/pci/vfio_pci_private.h           |   2 +
> >>>>>     include/linux/vfio.h                          |  18 +
> >>>>>     include/uapi/linux/vfio.h                     | 160 +++++
> >>>>>     samples/Kconfig                               |   6 +
> >>>>>     samples/Makefile                              |   1 +
> >>>>>     samples/vfio-pci/Makefile                     |   2 +
> >>>>>     samples/vfio-pci/igd_dt.c                     | 367 ++++++++++
> >>>>>     14 files changed, 1455 insertions(+), 4 deletions(-)
> >>>>>     create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
> >>>>>     create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
> >>>>>     create mode 100644 samples/vfio-pci/Makefile
> >>>>>     create mode 100644 samples/vfio-pci/igd_dt.c
> >>>>>
>
Alex Williamson Dec. 6, 2019, 5:42 p.m. UTC | #7
On Fri, 6 Dec 2019 17:40:02 +0800
Jason Wang <jasowang@redhat.com> wrote:

> On 2019/12/6 下午4:22, Yan Zhao wrote:
> > On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:  
> >> On 2019/12/5 下午4:51, Yan Zhao wrote:  
> >>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:  
> >>>> Hi:
> >>>>
> >>>> On 2019/12/5 上午11:24, Yan Zhao wrote:  
> >>>>> For SRIOV devices, VFs are passthroughed into guest directly without host
> >>>>> driver mediation. However, when VMs migrating with passthroughed VFs,
> >>>>> dynamic host mediation is required to  (1) get device states, (2) get
> >>>>> dirty pages. Since device states as well as other critical information
> >>>>> required for dirty page tracking for VFs are usually retrieved from PFs,
> >>>>> it is handy to provide an extension in PF driver to centralizingly control
> >>>>> VFs' migration.
> >>>>>
> >>>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
> >>>>> dynamically trap VFs' bars for dirty page tracking and  
> >>>> A silly question, what's the reason for doing this, is this a must for dirty
> >>>> page tracking?
> >>>>  
> >>> For performance consideration. VFs' bars should be passthoughed at
> >>> normal time and only enter into trap state on need.  
> >>
> >> Right, but how does this matter for the case of dirty page tracking?
> >>  
> > Take NIC as an example, to trap its VF dirty pages, software way is
> > required to trap every write of ring tail that resides in BAR0.  
> 
> 
> Interesting, but it looks like we need:
> - decode the instruction
> - mediate all access to BAR0
> All of which seems a great burden for the VF driver. I wonder whether or 
> not doing interrupt relay and tracking head is better in this case.

This sounds like a NIC specific solution, I believe the goal here is to
allow any device type to implement a partial mediation solution, in
this case to sufficiently track the device while in the migration
saving state.

> >   There's
> > still no IOMMU Dirty bit available.  
> >>>>>     (3) centralizing
> >>>>> VF critical states retrieving and VF controls into one driver, we propose
> >>>>> to introduce mediate ops on top of current vfio-pci device driver.
> >>>>>
> >>>>>
> >>>>>                                       _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >>>>>     __________   register mediate ops|  ___________     ___________    |
> >>>>> |          |<-----------------------|     VF    |   |           |
> >>>>> | vfio-pci |                      | |  mediate  |   | PF driver |   |
> >>>>> |__________|----------------------->|   driver  |   |___________|
> >>>>>         |            open(pdev)      |  -----------          |         |
> >>>>>         |                                                    |
> >>>>>         |                            |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
> >>>>>        \|/                                                  \|/
> >>>>> -----------                                         ------------
> >>>>> |    VF   |                                         |    PF    |
> >>>>> -----------                                         ------------
> >>>>>
> >>>>>
> >>>>> VF mediate driver could be a standalone driver that does not bind to
> >>>>> any devices (as in demo code in patches 5-6) or it could be a built-in
> >>>>> extension of PF driver (as in patches 7-9) .
> >>>>>
> >>>>> Rather than directly bind to VF, VF mediate driver register a mediate
> >>>>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
> >>>>> mediate ops.
> >>>>> (Note that: VF mediate driver can register mediate ops into vfio-pci
> >>>>> before vfio-pci binding to any devices. And VF mediate driver can
> >>>>> support mediating multiple devices.)
> >>>>>
> >>>>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
> >>>>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
> >>>>> device as a parameter.
> >>>>> VF mediate driver should return success or failure depending on it
> >>>>> supports the pdev or not.
> >>>>> E.g. VF mediate driver would compare its supported VF devfn with the
> >>>>> devfn of the passed-in pdev.
> >>>>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
> >>>>> stop querying other mediate ops and bind the opening device with this
> >>>>> mediate ops using the returned mediate handle.
> >>>>>
> >>>>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
> >>>>> VF will be intercepted into VF mediate driver as
> >>>>> vfio_pci_mediate_ops->get_region_info(),
> >>>>> vfio_pci_mediate_ops->rw,
> >>>>> vfio_pci_mediate_ops->mmap, and get customized.
> >>>>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
> >>>>> further return 'pt' to indicate whether vfio-pci should further
> >>>>> passthrough data to hw.
> >>>>>
> >>>>> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
> >>>>> with a mediate handle as parameter.
> >>>>>
> >>>>> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
> >>>>> mediate driver be able to differentiate two opening VFs of the same device
> >>>>> id and vendor id.
> >>>>>
> >>>>> When VF mediate driver exits, it unregisters its mediate ops from
> >>>>> vfio-pci.
> >>>>>
> >>>>>
> >>>>> In this patchset, we enable vfio-pci to provide 3 things:
> >>>>> (1) calling mediate ops to allow vendor driver customizing default
> >>>>> region info/rw/mmap of a region.
> >>>>> (2) provide a migration region to support migration  
> >>>> What's the benefit of introducing a region? It looks to me we don't expect
> >>>> the region to be accessed directly from guest. Could we simply extend device
> >>>> fd ioctl for doing such things?
> >>>>  
> >>> You may take a look on mdev live migration discussions in
> >>> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
> >>>
> >>> or previous discussion at
> >>> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
> >>> which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/
> >>>
> >>> generaly speaking, qemu part of live migration is consistent for
> >>> vfio-pci + mediate ops way or mdev way.  
> >>
> >> So in mdev, do you still have a mediate driver? Or you expect the parent
> >> to implement the region?
> >>  
> > No, currently it's only for vfio-pci.  
> 
> And specific to PCI.

What's PCI specific?  The implementation, yes, it's done in the bus
vfio bus driver here but all device access is performed by the bus
driver.  I'm not sure how we could introduce the intercept at the
vfio-core level, but I'm open to suggestions.

> > mdev parent driver is free to customize its regions and hence does not
> > requires this mediate ops hooks.
> >  
> >>> The region is only a channel for
> >>> QEMU and kernel to communicate information without introducing IOCTLs.  
> >>
> >> Well, at least you introduce new type of region in uapi. So this does
> >> not answer why region is better than ioctl. If the region will only be
> >> used by qemu, using ioctl is much more easier and straightforward.
> >>  
> > It's not introduced by me :)
> > mdev live migration is actually using this way, I'm just keeping
> > compatible to the uapi.  
> 
> 
> I meant e.g VFIO_REGION_TYPE_MIGRATION.
> 
> 
> >
> >  From my own perspective, my answer is that a region is more flexible
> > compared to ioctl. vendor driver can freely define the size,
> >  
> 
> Probably not since it's an ABI I think.

I think Kirti's thread proposing the migration interface is a better
place for this discussion, I believe Yan has already linked to it.  In
general we prefer to be frugal in our introduction of new ioctls,
especially when we have existing mechanisms via regions to support the
interactions.  The interface is designed to be flexible to the vendor
driver needs, partially thanks to it being a region.

> >   mmap cap of
> > its data subregion.
> >  
> 
> It doesn't help much unless it can be mapped into guest (which I don't 
> think it was the case here).
> 
> >   Also, there're already too many ioctls in vfio.  
> 
> Probably not :) We had a brunch of  subsystems that have much more 
> ioctls than VFIO. (e.g DRM)

And this is a good thing?  We can more easily deprecate and revise
region support than we can take back ioctls that have been previously
used.  I generally don't like the "let's create a new ioctl for that"
approach versus trying to fit something within the existing
architecture and convention.

> >>>>> (3) provide a dynamic trap bar info region to allow vendor driver
> >>>>> control trap/untrap of device pci bars
> >>>>>
> >>>>> This vfio-pci + mediate ops way differs from mdev way in that
> >>>>> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
> >>>>> specific mdev parent driver is bound to VF directly.
> >>>>> (2) vfio-pci + mediate ops way does not create mdev devices and VF
> >>>>> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
> >>>>>
> >>>>> The reason why we don't choose the way of writing mdev parent driver is
> >>>>> that
> >>>>> (1) VFs are almost all the time directly passthroughed. Directly binding
> >>>>> to vfio-pci can make most of the code shared/reused.  
> >>>> Can we split out the common parts from vfio-pci?
> >>>>  
> >>> That's very attractive. but one cannot implement a vfio-pci except
> >>> export everything in it as common part :)  
> >>
> >> Well, I think there should be not hard to do that. E..g you can route it
> >> back to like:
> >>
> >> vfio -> vfio_mdev -> parent -> vfio_pci
> >>  
> > it's desired for us to have mediate driver binding to PF device.
> > so once a VF device is created, only PF driver and vfio-pci are
> > required. Just the same as what needs to be done for a normal VF passthrough.
> > otherwise, a separate parent driver binding to VF is required.
> > Also, this parent driver has many drawbacks as I mentions in this
> > cover-letter.  
> 
> Well, as discussed, no need to duplicate the code, bar trick should 
> still work. The main issues I saw with this proposal is:
> 
> 1) PCI specific, other bus may need something similar

Propose how it could be implemented higher in the vfio stack to make it
device agnostic.

> 2) Function duplicated with mdev and mdev can do even more

mdev also comes with a device lifecycle interface that doesn't really
make sense when a driver is only trying to partially mediate a single
physical device rather than multiplex a physical device into virtual
devices.  mdev would also require vendor drivers to re-implement
much of vfio-pci for the direct access mechanisms.  Also, do we really
want users or management tools to decide between binding a device to
vfio-pci or a separate mdev driver to get this functionality.  We've
already been burnt trying to use mdev beyond its scope.

> >>>>>     If we write a
> >>>>> vendor specific mdev parent driver, most of the code (like passthrough
> >>>>> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
> >>>>> actually a duplicated and tedious work.  
> >>>> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
> >>>> me we need to consider live migration for mdev as well. In that case, do we
> >>>> still expect mediate ops through VFIO directly?
> >>>>
> >>>>  
> >>>>> (2) For features like dynamically trap/untrap pci bars, if they are in
> >>>>> vfio-pci, they can be available to most people without repeated code
> >>>>> copying and re-testing.
> >>>>> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
> >>>>> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
> >>>>> it runs into a real migration need. However, if vfio-pci is bound
> >>>>> initially, they have no chance to do live migration when there's a need
> >>>>> later.  
> >>>> We can teach management layer to do this.
> >>>>  
> >>> No. not possible as vfio-pci by default has no migration region and
> >>> dirty page tracking needs vendor's mediation at least for most
> >>> passthrough devices now.  
> >>
> >> I'm not quite sure I get here but in this case, just tech them to use
> >> the driver that has migration support?
> >>  
> > That's a way, but as more and more passthrough devices have demands and
> > caps to do migration, will vfio-pci be used in future any more ?  
> 
> 
> This should not be a problem:
> - If we introduce a common mdev for vfio-pci, we can just bind that 
> driver always

There's too much of mdev that doesn't make sense for this usage model,
this is why Yi's proposed generic mdev PCI wrapper is only a sample
driver.  I think we do not want to introduce user confusion regarding
which driver to use and there are outstanding non-singleton group
issues with mdev that don't seem worthwhile to resolve.

> - The most straightforward way to support dirty page tracking is done by 
> IOMMU instead of device specific operations.

Of course, but it doesn't exist yet.  We're attempting to design the
dirty page tracking in a way that's mostly transparent for current mdev
drivers, would provide generic support for IOMMU-based dirty tracking,
and extensible to the inevitability of vendor driver tracking.  Thanks,

Alex
Jason Wang Dec. 12, 2019, 3:48 a.m. UTC | #8
On 2019/12/6 下午8:49, Yan Zhao wrote:
> On Fri, Dec 06, 2019 at 05:40:02PM +0800, Jason Wang wrote:
>> On 2019/12/6 下午4:22, Yan Zhao wrote:
>>> On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:
>>>> On 2019/12/5 下午4:51, Yan Zhao wrote:
>>>>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
>>>>>> Hi:
>>>>>>
>>>>>> On 2019/12/5 上午11:24, Yan Zhao wrote:
>>>>>>> For SRIOV devices, VFs are passthroughed into guest directly without host
>>>>>>> driver mediation. However, when VMs migrating with passthroughed VFs,
>>>>>>> dynamic host mediation is required to  (1) get device states, (2) get
>>>>>>> dirty pages. Since device states as well as other critical information
>>>>>>> required for dirty page tracking for VFs are usually retrieved from PFs,
>>>>>>> it is handy to provide an extension in PF driver to centralizingly control
>>>>>>> VFs' migration.
>>>>>>>
>>>>>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
>>>>>>> dynamically trap VFs' bars for dirty page tracking and
>>>>>> A silly question, what's the reason for doing this, is this a must for dirty
>>>>>> page tracking?
>>>>>>
>>>>> For performance consideration. VFs' bars should be passthoughed at
>>>>> normal time and only enter into trap state on need.
>>>> Right, but how does this matter for the case of dirty page tracking?
>>>>
>>> Take NIC as an example, to trap its VF dirty pages, software way is
>>> required to trap every write of ring tail that resides in BAR0.
>>
>> Interesting, but it looks like we need:
>> - decode the instruction
>> - mediate all access to BAR0
>> All of which seems a great burden for the VF driver. I wonder whether or
>> not doing interrupt relay and tracking head is better in this case.
>>
> hi Jason
>
> not familiar with the way you mentioned. could you elaborate more?


It looks to me that you want to intercept the bar that contains the 
head. Then you can figure out the buffers submitted from driver and you 
still need to decide a proper time to mark them as dirty.

What I meant is, intercept the interrupt, then you can figure still 
figure out the buffers which has been modified by the device and make 
them as dirty.

Then there's no need to trap BAR and do decoding/emulation etc.

But it will still be tricky to be correct...


>>>    There's
>>> still no IOMMU Dirty bit available.
>>>>>>>      (3) centralizing
>>>>>>> VF critical states retrieving and VF controls into one driver, we propose
>>>>>>> to introduce mediate ops on top of current vfio-pci device driver.
>>>>>>>
>>>>>>>
>>>>>>>                                        _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>>>>>>      __________   register mediate ops|  ___________     ___________    |
>>>>>>> |          |<-----------------------|     VF    |   |           |
>>>>>>> | vfio-pci |                      | |  mediate  |   | PF driver |   |
>>>>>>> |__________|----------------------->|   driver  |   |___________|
>>>>>>>          |            open(pdev)      |  -----------          |         |
>>>>>>>          |                                                    |
>>>>>>>          |                            |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
>>>>>>>         \|/                                                  \|/
>>>>>>> -----------                                         ------------
>>>>>>> |    VF   |                                         |    PF    |
>>>>>>> -----------                                         ------------
>>>>>>>
>>>>>>>
>>>>>>> VF mediate driver could be a standalone driver that does not bind to
>>>>>>> any devices (as in demo code in patches 5-6) or it could be a built-in
>>>>>>> extension of PF driver (as in patches 7-9) .
>>>>>>>
>>>>>>> Rather than directly bind to VF, VF mediate driver register a mediate
>>>>>>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
>>>>>>> mediate ops.
>>>>>>> (Note that: VF mediate driver can register mediate ops into vfio-pci
>>>>>>> before vfio-pci binding to any devices. And VF mediate driver can
>>>>>>> support mediating multiple devices.)
>>>>>>>
>>>>>>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
>>>>>>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
>>>>>>> device as a parameter.
>>>>>>> VF mediate driver should return success or failure depending on it
>>>>>>> supports the pdev or not.
>>>>>>> E.g. VF mediate driver would compare its supported VF devfn with the
>>>>>>> devfn of the passed-in pdev.
>>>>>>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
>>>>>>> stop querying other mediate ops and bind the opening device with this
>>>>>>> mediate ops using the returned mediate handle.
>>>>>>>
>>>>>>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
>>>>>>> VF will be intercepted into VF mediate driver as
>>>>>>> vfio_pci_mediate_ops->get_region_info(),
>>>>>>> vfio_pci_mediate_ops->rw,
>>>>>>> vfio_pci_mediate_ops->mmap, and get customized.
>>>>>>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
>>>>>>> further return 'pt' to indicate whether vfio-pci should further
>>>>>>> passthrough data to hw.
>>>>>>>
>>>>>>> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
>>>>>>> with a mediate handle as parameter.
>>>>>>>
>>>>>>> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
>>>>>>> mediate driver be able to differentiate two opening VFs of the same device
>>>>>>> id and vendor id.
>>>>>>>
>>>>>>> When VF mediate driver exits, it unregisters its mediate ops from
>>>>>>> vfio-pci.
>>>>>>>
>>>>>>>
>>>>>>> In this patchset, we enable vfio-pci to provide 3 things:
>>>>>>> (1) calling mediate ops to allow vendor driver customizing default
>>>>>>> region info/rw/mmap of a region.
>>>>>>> (2) provide a migration region to support migration
>>>>>> What's the benefit of introducing a region? It looks to me we don't expect
>>>>>> the region to be accessed directly from guest. Could we simply extend device
>>>>>> fd ioctl for doing such things?
>>>>>>
>>>>> You may take a look on mdev live migration discussions in
>>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
>>>>>
>>>>> or previous discussion at
>>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
>>>>> which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/
>>>>>
>>>>> generaly speaking, qemu part of live migration is consistent for
>>>>> vfio-pci + mediate ops way or mdev way.
>>>> So in mdev, do you still have a mediate driver? Or you expect the parent
>>>> to implement the region?
>>>>
>>> No, currently it's only for vfio-pci.
>> And specific to PCI.
>>
>>> mdev parent driver is free to customize its regions and hence does not
>>> requires this mediate ops hooks.
>>>
>>>>> The region is only a channel for
>>>>> QEMU and kernel to communicate information without introducing IOCTLs.
>>>> Well, at least you introduce new type of region in uapi. So this does
>>>> not answer why region is better than ioctl. If the region will only be
>>>> used by qemu, using ioctl is much more easier and straightforward.
>>>>
>>> It's not introduced by me :)
>>> mdev live migration is actually using this way, I'm just keeping
>>> compatible to the uapi.
>>
>> I meant e.g VFIO_REGION_TYPE_MIGRATION.
>>
> here's the history of vfio live migration:
> https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg05564.html
> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html
> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
>
> If you have any concern of this region way, feel free to comment to the
> latest v9 patchset:
> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
>
> The patchset here will always keep compatible to there.


Sure.


>>>   From my own perspective, my answer is that a region is more flexible
>>> compared to ioctl. vendor driver can freely define the size,
>>>
>> Probably not since it's an ABI I think.
>>
> that's why I need to define VFIO_REGION_TYPE_MIGRATION here in this
> patchset, as it's not upstreamed yet.
> maybe I should make it into a prerequisite patch, indicating it is not
> introduced by this patchset


Yes.


>
>>>    mmap cap of
>>> its data subregion.
>>>
>> It doesn't help much unless it can be mapped into guest (which I don't
>> think it was the case here).
>>
> it's access by host qemu, the same as how linux app access an mmaped
> memory. the mmap here is to reduce memory copy from kernel to user.
> No need to get mapped into guest.


But copy_to_user() is not a bad choice. If I read the code correctly 
only the dirty bitmap was mmaped. This means you probably need to deal 
with dcache carefully on some archs. [1]

Note KVM doesn't use shared dirty bitmap, it uses copy_to_user().

[1] https://lkml.org/lkml/2019/4/9/5


>
>>>    Also, there're already too many ioctls in vfio.
>> Probably not :) We had a brunch of  subsystems that have much more
>> ioctls than VFIO. (e.g DRM)
>>
>>>>>>> (3) provide a dynamic trap bar info region to allow vendor driver
>>>>>>> control trap/untrap of device pci bars
>>>>>>>
>>>>>>> This vfio-pci + mediate ops way differs from mdev way in that
>>>>>>> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
>>>>>>> specific mdev parent driver is bound to VF directly.
>>>>>>> (2) vfio-pci + mediate ops way does not create mdev devices and VF
>>>>>>> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
>>>>>>>
>>>>>>> The reason why we don't choose the way of writing mdev parent driver is
>>>>>>> that
>>>>>>> (1) VFs are almost all the time directly passthroughed. Directly binding
>>>>>>> to vfio-pci can make most of the code shared/reused.
>>>>>> Can we split out the common parts from vfio-pci?
>>>>>>
>>>>> That's very attractive. but one cannot implement a vfio-pci except
>>>>> export everything in it as common part :)
>>>> Well, I think there should be not hard to do that. E..g you can route it
>>>> back to like:
>>>>
>>>> vfio -> vfio_mdev -> parent -> vfio_pci
>>>>
>>> it's desired for us to have mediate driver binding to PF device.
>>> so once a VF device is created, only PF driver and vfio-pci are
>>> required. Just the same as what needs to be done for a normal VF passthrough.
>>> otherwise, a separate parent driver binding to VF is required.
>>> Also, this parent driver has many drawbacks as I mentions in this
>>> cover-letter.
>> Well, as discussed, no need to duplicate the code, bar trick should
>> still work. The main issues I saw with this proposal is:
>>
>> 1) PCI specific, other bus may need something similar
> vfio-pci is only for PCI of course.


I meant if what propose here makes sense, other bus driver like 
vfio-platform may want something similar.


>
>> 2) Function duplicated with mdev and mdev can do even more
>>
> could you elaborate how mdev can do solve the above saying problem ?


Well, I think both of us agree the mdev can do what mediate ops did, 
mdev device implementation just need to add the direct PCI access part.


>>>>>>>      If we write a
>>>>>>> vendor specific mdev parent driver, most of the code (like passthrough
>>>>>>> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
>>>>>>> actually a duplicated and tedious work.
>>>>>> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
>>>>>> me we need to consider live migration for mdev as well. In that case, do we
>>>>>> still expect mediate ops through VFIO directly?
>>>>>>
>>>>>>
>>>>>>> (2) For features like dynamically trap/untrap pci bars, if they are in
>>>>>>> vfio-pci, they can be available to most people without repeated code
>>>>>>> copying and re-testing.
>>>>>>> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
>>>>>>> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
>>>>>>> it runs into a real migration need. However, if vfio-pci is bound
>>>>>>> initially, they have no chance to do live migration when there's a need
>>>>>>> later.
>>>>>> We can teach management layer to do this.
>>>>>>
>>>>> No. not possible as vfio-pci by default has no migration region and
>>>>> dirty page tracking needs vendor's mediation at least for most
>>>>> passthrough devices now.
>>>> I'm not quite sure I get here but in this case, just tech them to use
>>>> the driver that has migration support?
>>>>
>>> That's a way, but as more and more passthrough devices have demands and
>>> caps to do migration, will vfio-pci be used in future any more ?
>>
>> This should not be a problem:
>> - If we introduce a common mdev for vfio-pci, we can just bind that
>> driver always
> what is common mdev for vfio-pci? a common mdev parent driver that have
> the same implementation as vfio-pci?


The common part is not PCI of course. The common part is the both mdev 
and mediate ops want to do some kind of mediation. Mdev is bus agnostic, 
but what you propose here is PCI specific but should be bus agnostic as 
well. Assume we implement a bug agnostic mediate ops, mdev could be even 
built on top.


>
> There's actually already a solution of creating only one mdev on top
> of each passthrough device, and make mdev share the same iommu group
> with it. We've also made an implementation on it already. here's a
> sample one made by Yi at https://patchwork.kernel.org/cover/11134695/.
>
> But, as I said, it's desired to re-use vfio-pci directly for SRIOV,
> which is straghtforward :)


Can we have a device that is capable of both SRIOV and function slicing? 
If yes, does it mean you need to provides two drivers? One for mdev, 
another for mediate ops?


>
>> - The most straightforward way to support dirty page tracking is done by
>> IOMMU instead of device specific operations.
>>
> No such IOMMU yet. And all kinds of platforms should be cared, right?


Or the device can track dirty pages by itself, otherwise it would be 
very hard to implement dirty page tracking correctly without the help of 
switching to software datapath (or maybe you can post the part of BAR0 
mediation and dirty page tracking which is missed in this series?)

Thanks


>
> Thanks
> Yan
>
>> Thanks
>>
>>> Thanks
>>> Yan
>>>
>>>> Thanks
>>>>
>>>>
>>>>> Thanks
>>>>> Yn
>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>> In this patchset,
>>>>>>> - patches 1-4 enable vfio-pci to call mediate ops registered by vendor
>>>>>>>       driver to mediate/customize region info/rw/mmap.
>>>>>>>
>>>>>>> - patches 5-6 provide a standalone sample driver to register a mediate ops
>>>>>>>       for Intel Graphics Devices. It does not bind to IGDs directly but decides
>>>>>>>       what devices it supports via its pciidlist. It also demonstrates how to
>>>>>>>       dynamic trap a device's PCI bars. (by adding more pciids in its
>>>>>>>       pciidlist, this sample driver actually is not necessarily limited to
>>>>>>>       support IGDs)
>>>>>>>
>>>>>>> - patch 7-9 provide a sample on i40e driver that supports Intel(R)
>>>>>>>       Ethernet Controller XL710 Family of devices. It supports VF precopy live
>>>>>>>       migration on Intel's 710 SRIOV. (but we commented out the real
>>>>>>>       implementation of dirty page tracking and device state retrieving part
>>>>>>>       to focus on demonstrating framework part. Will send out them in future
>>>>>>>       versions)
>>>>>>>       patch 7 registers/unregisters VF mediate ops when PF driver
>>>>>>>       probes/removes. It specifies its supporting VFs via
>>>>>>>       vfio_pci_mediate_ops->open(pdev)
>>>>>>>
>>>>>>>       patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
>>>>>>>       provides a sample implementation of migration region.
>>>>>>>       The QEMU part of vfio migration is based on v8
>>>>>>>       https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
>>>>>>>       We do not based on recent v9 because we think there are still opens in
>>>>>>>       dirty page track part in that series.
>>>>>>>
>>>>>>>       patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
>>>>>>>       provides an example on how to trap part of bar0 when migration starts
>>>>>>>       and passthrough this part of bar0 again when migration fails.
>>>>>>>
>>>>>>> Yan Zhao (9):
>>>>>>>       vfio/pci: introduce mediate ops to intercept vfio-pci ops
>>>>>>>       vfio/pci: test existence before calling region->ops
>>>>>>>       vfio/pci: register a default migration region
>>>>>>>       vfio-pci: register default dynamic-trap-bar-info region
>>>>>>>       samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
>>>>>>>       sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
>>>>>>>       i40e/vf_migration: register mediate_ops to vfio-pci
>>>>>>>       i40e/vf_migration: mediate migration region
>>>>>>>       i40e/vf_migration: support dynamic trap of bar0
>>>>>>>
>>>>>>>      drivers/net/ethernet/intel/Kconfig            |   2 +-
>>>>>>>      drivers/net/ethernet/intel/i40e/Makefile      |   3 +-
>>>>>>>      drivers/net/ethernet/intel/i40e/i40e.h        |   2 +
>>>>>>>      drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
>>>>>>>      .../ethernet/intel/i40e/i40e_vf_migration.c   | 626 ++++++++++++++++++
>>>>>>>      .../ethernet/intel/i40e/i40e_vf_migration.h   |  78 +++
>>>>>>>      drivers/vfio/pci/vfio_pci.c                   | 189 +++++-
>>>>>>>      drivers/vfio/pci/vfio_pci_private.h           |   2 +
>>>>>>>      include/linux/vfio.h                          |  18 +
>>>>>>>      include/uapi/linux/vfio.h                     | 160 +++++
>>>>>>>      samples/Kconfig                               |   6 +
>>>>>>>      samples/Makefile                              |   1 +
>>>>>>>      samples/vfio-pci/Makefile                     |   2 +
>>>>>>>      samples/vfio-pci/igd_dt.c                     | 367 ++++++++++
>>>>>>>      14 files changed, 1455 insertions(+), 4 deletions(-)
>>>>>>>      create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
>>>>>>>      create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
>>>>>>>      create mode 100644 samples/vfio-pci/Makefile
>>>>>>>      create mode 100644 samples/vfio-pci/igd_dt.c
>>>>>>>
Jason Wang Dec. 12, 2019, 4:09 a.m. UTC | #9
On 2019/12/7 上午1:42, Alex Williamson wrote:
> On Fri, 6 Dec 2019 17:40:02 +0800
> Jason Wang <jasowang@redhat.com> wrote:
>
>> On 2019/12/6 下午4:22, Yan Zhao wrote:
>>> On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:
>>>> On 2019/12/5 下午4:51, Yan Zhao wrote:
>>>>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
>>>>>> Hi:
>>>>>>
>>>>>> On 2019/12/5 上午11:24, Yan Zhao wrote:
>>>>>>> For SRIOV devices, VFs are passthroughed into guest directly without host
>>>>>>> driver mediation. However, when VMs migrating with passthroughed VFs,
>>>>>>> dynamic host mediation is required to  (1) get device states, (2) get
>>>>>>> dirty pages. Since device states as well as other critical information
>>>>>>> required for dirty page tracking for VFs are usually retrieved from PFs,
>>>>>>> it is handy to provide an extension in PF driver to centralizingly control
>>>>>>> VFs' migration.
>>>>>>>
>>>>>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
>>>>>>> dynamically trap VFs' bars for dirty page tracking and
>>>>>> A silly question, what's the reason for doing this, is this a must for dirty
>>>>>> page tracking?
>>>>>>   
>>>>> For performance consideration. VFs' bars should be passthoughed at
>>>>> normal time and only enter into trap state on need.
>>>> Right, but how does this matter for the case of dirty page tracking?
>>>>   
>>> Take NIC as an example, to trap its VF dirty pages, software way is
>>> required to trap every write of ring tail that resides in BAR0.
>>
>> Interesting, but it looks like we need:
>> - decode the instruction
>> - mediate all access to BAR0
>> All of which seems a great burden for the VF driver. I wonder whether or
>> not doing interrupt relay and tracking head is better in this case.
> This sounds like a NIC specific solution, I believe the goal here is to
> allow any device type to implement a partial mediation solution, in
> this case to sufficiently track the device while in the migration
> saving state.


I suspect there's a solution that can work for any device type. E.g for 
virtio, avail index (head) doesn't belongs to any BAR and device may 
decide to disable doorbell from guest. So did interrupt relay since 
driver may choose to disable interrupt from device. In this case, the 
only way to track dirty pages correctly is to switch to software datapath.


>
>>>    There's
>>> still no IOMMU Dirty bit available.
>>>>>>>      (3) centralizing
>>>>>>> VF critical states retrieving and VF controls into one driver, we propose
>>>>>>> to introduce mediate ops on top of current vfio-pci device driver.
>>>>>>>
>>>>>>>
>>>>>>>                                        _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>>>>>>      __________   register mediate ops|  ___________     ___________    |
>>>>>>> |          |<-----------------------|     VF    |   |           |
>>>>>>> | vfio-pci |                      | |  mediate  |   | PF driver |   |
>>>>>>> |__________|----------------------->|   driver  |   |___________|
>>>>>>>          |            open(pdev)      |  -----------          |         |
>>>>>>>          |                                                    |
>>>>>>>          |                            |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
>>>>>>>         \|/                                                  \|/
>>>>>>> -----------                                         ------------
>>>>>>> |    VF   |                                         |    PF    |
>>>>>>> -----------                                         ------------
>>>>>>>
>>>>>>>
>>>>>>> VF mediate driver could be a standalone driver that does not bind to
>>>>>>> any devices (as in demo code in patches 5-6) or it could be a built-in
>>>>>>> extension of PF driver (as in patches 7-9) .
>>>>>>>
>>>>>>> Rather than directly bind to VF, VF mediate driver register a mediate
>>>>>>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
>>>>>>> mediate ops.
>>>>>>> (Note that: VF mediate driver can register mediate ops into vfio-pci
>>>>>>> before vfio-pci binding to any devices. And VF mediate driver can
>>>>>>> support mediating multiple devices.)
>>>>>>>
>>>>>>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
>>>>>>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
>>>>>>> device as a parameter.
>>>>>>> VF mediate driver should return success or failure depending on it
>>>>>>> supports the pdev or not.
>>>>>>> E.g. VF mediate driver would compare its supported VF devfn with the
>>>>>>> devfn of the passed-in pdev.
>>>>>>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
>>>>>>> stop querying other mediate ops and bind the opening device with this
>>>>>>> mediate ops using the returned mediate handle.
>>>>>>>
>>>>>>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
>>>>>>> VF will be intercepted into VF mediate driver as
>>>>>>> vfio_pci_mediate_ops->get_region_info(),
>>>>>>> vfio_pci_mediate_ops->rw,
>>>>>>> vfio_pci_mediate_ops->mmap, and get customized.
>>>>>>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
>>>>>>> further return 'pt' to indicate whether vfio-pci should further
>>>>>>> passthrough data to hw.
>>>>>>>
>>>>>>> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
>>>>>>> with a mediate handle as parameter.
>>>>>>>
>>>>>>> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
>>>>>>> mediate driver be able to differentiate two opening VFs of the same device
>>>>>>> id and vendor id.
>>>>>>>
>>>>>>> When VF mediate driver exits, it unregisters its mediate ops from
>>>>>>> vfio-pci.
>>>>>>>
>>>>>>>
>>>>>>> In this patchset, we enable vfio-pci to provide 3 things:
>>>>>>> (1) calling mediate ops to allow vendor driver customizing default
>>>>>>> region info/rw/mmap of a region.
>>>>>>> (2) provide a migration region to support migration
>>>>>> What's the benefit of introducing a region? It looks to me we don't expect
>>>>>> the region to be accessed directly from guest. Could we simply extend device
>>>>>> fd ioctl for doing such things?
>>>>>>   
>>>>> You may take a look on mdev live migration discussions in
>>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
>>>>>
>>>>> or previous discussion at
>>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
>>>>> which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/
>>>>>
>>>>> generaly speaking, qemu part of live migration is consistent for
>>>>> vfio-pci + mediate ops way or mdev way.
>>>> So in mdev, do you still have a mediate driver? Or you expect the parent
>>>> to implement the region?
>>>>   
>>> No, currently it's only for vfio-pci.
>> And specific to PCI.
> What's PCI specific?  The implementation, yes, it's done in the bus
> vfio bus driver here but all device access is performed by the bus
> driver.  I'm not sure how we could introduce the intercept at the
> vfio-core level, but I'm open to suggestions.


I haven't thought this too much, but if we can intercept at core level, 
it basically can do what mdev can do right now.


>
>>> mdev parent driver is free to customize its regions and hence does not
>>> requires this mediate ops hooks.
>>>   
>>>>> The region is only a channel for
>>>>> QEMU and kernel to communicate information without introducing IOCTLs.
>>>> Well, at least you introduce new type of region in uapi. So this does
>>>> not answer why region is better than ioctl. If the region will only be
>>>> used by qemu, using ioctl is much more easier and straightforward.
>>>>   
>>> It's not introduced by me :)
>>> mdev live migration is actually using this way, I'm just keeping
>>> compatible to the uapi.
>>
>> I meant e.g VFIO_REGION_TYPE_MIGRATION.
>>
>>
>>>   From my own perspective, my answer is that a region is more flexible
>>> compared to ioctl. vendor driver can freely define the size,
>>>   
>> Probably not since it's an ABI I think.
> I think Kirti's thread proposing the migration interface is a better
> place for this discussion, I believe Yan has already linked to it.  In
> general we prefer to be frugal in our introduction of new ioctls,
> especially when we have existing mechanisms via regions to support the
> interactions.  The interface is designed to be flexible to the vendor
> driver needs, partially thanks to it being a region.
>
>>>    mmap cap of
>>> its data subregion.
>>>   
>> It doesn't help much unless it can be mapped into guest (which I don't
>> think it was the case here).
>> /
>>>    Also, there're already too many ioctls in vfio.
>> Probably not :) We had a brunch of  subsystems that have much more
>> ioctls than VFIO. (e.g DRM)
> And this is a good thing?


Well, I just meant that "having too much ioctls already" is not a good 
reason for not introducing new ones.


> We can more easily deprecate and revise
> region support than we can take back ioctls that have been previously
> used.


It belongs to uapi, how easily can we deprecate that?


> I generally don't like the "let's create a new ioctl for that"
> approach versus trying to fit something within the existing
> architecture and convention.
>
>>>>>>> (3) provide a dynamic trap bar info region to allow vendor driver
>>>>>>> control trap/untrap of device pci bars
>>>>>>>
>>>>>>> This vfio-pci + mediate ops way differs from mdev way in that
>>>>>>> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
>>>>>>> specific mdev parent driver is bound to VF directly.
>>>>>>> (2) vfio-pci + mediate ops way does not create mdev devices and VF
>>>>>>> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
>>>>>>>
>>>>>>> The reason why we don't choose the way of writing mdev parent driver is
>>>>>>> that
>>>>>>> (1) VFs are almost all the time directly passthroughed. Directly binding
>>>>>>> to vfio-pci can make most of the code shared/reused.
>>>>>> Can we split out the common parts from vfio-pci?
>>>>>>   
>>>>> That's very attractive. but one cannot implement a vfio-pci except
>>>>> export everything in it as common part :)
>>>> Well, I think there should be not hard to do that. E..g you can route it
>>>> back to like:
>>>>
>>>> vfio -> vfio_mdev -> parent -> vfio_pci
>>>>   
>>> it's desired for us to have mediate driver binding to PF device.
>>> so once a VF device is created, only PF driver and vfio-pci are
>>> required. Just the same as what needs to be done for a normal VF passthrough.
>>> otherwise, a separate parent driver binding to VF is required.
>>> Also, this parent driver has many drawbacks as I mentions in this
>>> cover-letter.
>> Well, as discussed, no need to duplicate the code, bar trick should
>> still work. The main issues I saw with this proposal is:
>>
>> 1) PCI specific, other bus may need something similar
> Propose how it could be implemented higher in the vfio stack to make it
> device agnostic.


E.g doing it in vfio_device_fops instead of vfio_pci_ops?


>
>> 2) Function duplicated with mdev and mdev can do even more
> mdev also comes with a device lifecycle interface that doesn't really
> make sense when a driver is only trying to partially mediate a single
> physical device rather than multiplex a physical device into virtual
> devices.


Yes, but that part could be decoupled out of mdev.


>   mdev would also require vendor drivers to re-implement
> much of vfio-pci for the direct access mechanisms.  Also, do we really
> want users or management tools to decide between binding a device to
> vfio-pci or a separate mdev driver to get this functionality.  We've
> already been burnt trying to use mdev beyond its scope.


The problem is, if we had a device that support both SRIOV and mdev. 
Does this mean we need prepare two set of drivers?


>
>>>>>>>      If we write a
>>>>>>> vendor specific mdev parent driver, most of the code (like passthrough
>>>>>>> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
>>>>>>> actually a duplicated and tedious work.
>>>>>> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
>>>>>> me we need to consider live migration for mdev as well. In that case, do we
>>>>>> still expect mediate ops through VFIO directly?
>>>>>>
>>>>>>   
>>>>>>> (2) For features like dynamically trap/untrap pci bars, if they are in
>>>>>>> vfio-pci, they can be available to most people without repeated code
>>>>>>> copying and re-testing.
>>>>>>> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
>>>>>>> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
>>>>>>> it runs into a real migration need. However, if vfio-pci is bound
>>>>>>> initially, they have no chance to do live migration when there's a need
>>>>>>> later.
>>>>>> We can teach management layer to do this.
>>>>>>   
>>>>> No. not possible as vfio-pci by default has no migration region and
>>>>> dirty page tracking needs vendor's mediation at least for most
>>>>> passthrough devices now.
>>>> I'm not quite sure I get here but in this case, just tech them to use
>>>> the driver that has migration support?
>>>>   
>>> That's a way, but as more and more passthrough devices have demands and
>>> caps to do migration, will vfio-pci be used in future any more ?
>>
>> This should not be a problem:
>> - If we introduce a common mdev for vfio-pci, we can just bind that
>> driver always
> There's too much of mdev that doesn't make sense for this usage model,
> this is why Yi's proposed generic mdev PCI wrapper is only a sample
> driver.  I think we do not want to introduce user confusion regarding
> which driver to use and there are outstanding non-singleton group
> issues with mdev that don't seem worthwhile to resolve.


I agree, but I think what user want is a unified driver that works for 
both SRIOV and mdev. That's why trying to have a common way for doing 
mediation may make sense.

Thanks


>
>> - The most straightforward way to support dirty page tracking is done by
>> IOMMU instead of device specific operations.
> Of course, but it doesn't exist yet.  We're attempting to design the
> dirty page tracking in a way that's mostly transparent for current mdev
> drivers, would provide generic support for IOMMU-based dirty tracking,
> and extensible to the inevitability of vendor driver tracking.  Thanks,
>
> Alex
Yan Zhao Dec. 12, 2019, 5:47 a.m. UTC | #10
On Thu, Dec 12, 2019 at 11:48:25AM +0800, Jason Wang wrote:
> 
> On 2019/12/6 下午8:49, Yan Zhao wrote:
> > On Fri, Dec 06, 2019 at 05:40:02PM +0800, Jason Wang wrote:
> >> On 2019/12/6 下午4:22, Yan Zhao wrote:
> >>> On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:
> >>>> On 2019/12/5 下午4:51, Yan Zhao wrote:
> >>>>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
> >>>>>> Hi:
> >>>>>>
> >>>>>> On 2019/12/5 上午11:24, Yan Zhao wrote:
> >>>>>>> For SRIOV devices, VFs are passthroughed into guest directly without host
> >>>>>>> driver mediation. However, when VMs migrating with passthroughed VFs,
> >>>>>>> dynamic host mediation is required to  (1) get device states, (2) get
> >>>>>>> dirty pages. Since device states as well as other critical information
> >>>>>>> required for dirty page tracking for VFs are usually retrieved from PFs,
> >>>>>>> it is handy to provide an extension in PF driver to centralizingly control
> >>>>>>> VFs' migration.
> >>>>>>>
> >>>>>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
> >>>>>>> dynamically trap VFs' bars for dirty page tracking and
> >>>>>> A silly question, what's the reason for doing this, is this a must for dirty
> >>>>>> page tracking?
> >>>>>>
> >>>>> For performance consideration. VFs' bars should be passthoughed at
> >>>>> normal time and only enter into trap state on need.
> >>>> Right, but how does this matter for the case of dirty page tracking?
> >>>>
> >>> Take NIC as an example, to trap its VF dirty pages, software way is
> >>> required to trap every write of ring tail that resides in BAR0.
> >>
> >> Interesting, but it looks like we need:
> >> - decode the instruction
> >> - mediate all access to BAR0
> >> All of which seems a great burden for the VF driver. I wonder whether or
> >> not doing interrupt relay and tracking head is better in this case.
> >>
> > hi Jason
> >
> > not familiar with the way you mentioned. could you elaborate more?
> 
> 
> It looks to me that you want to intercept the bar that contains the 
> head. Then you can figure out the buffers submitted from driver and you 
> still need to decide a proper time to mark them as dirty.
> 
Not need to be accurate, right? just a superset of real dirty bitmap is
enough.

> What I meant is, intercept the interrupt, then you can figure still 
> figure out the buffers which has been modified by the device and make 
> them as dirty.
> 
> Then there's no need to trap BAR and do decoding/emulation etc.
> 
> But it will still be tricky to be correct...
>
intercept the interrupt is a little hard if post interrupt is enabled..
I think what you worried about here is the timing to mark dirty pages,
right? upon interrupt receiving, you regard DMAs are finished and safe
to make them dirty.
But with BAR trap way, we at least can keep those dirtied pages as dirty
until device stop. Of course we have other methods to optimize it.

> 
> >>>    There's
> >>> still no IOMMU Dirty bit available.
> >>>>>>>      (3) centralizing
> >>>>>>> VF critical states retrieving and VF controls into one driver, we propose
> >>>>>>> to introduce mediate ops on top of current vfio-pci device driver.
> >>>>>>>
> >>>>>>>
> >>>>>>>                                        _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >>>>>>>      __________   register mediate ops|  ___________     ___________    |
> >>>>>>> |          |<-----------------------|     VF    |   |           |
> >>>>>>> | vfio-pci |                      | |  mediate  |   | PF driver |   |
> >>>>>>> |__________|----------------------->|   driver  |   |___________|
> >>>>>>>          |            open(pdev)      |  -----------          |         |
> >>>>>>>          |                                                    |
> >>>>>>>          |                            |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
> >>>>>>>         \|/                                                  \|/
> >>>>>>> -----------                                         ------------
> >>>>>>> |    VF   |                                         |    PF    |
> >>>>>>> -----------                                         ------------
> >>>>>>>
> >>>>>>>
> >>>>>>> VF mediate driver could be a standalone driver that does not bind to
> >>>>>>> any devices (as in demo code in patches 5-6) or it could be a built-in
> >>>>>>> extension of PF driver (as in patches 7-9) .
> >>>>>>>
> >>>>>>> Rather than directly bind to VF, VF mediate driver register a mediate
> >>>>>>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
> >>>>>>> mediate ops.
> >>>>>>> (Note that: VF mediate driver can register mediate ops into vfio-pci
> >>>>>>> before vfio-pci binding to any devices. And VF mediate driver can
> >>>>>>> support mediating multiple devices.)
> >>>>>>>
> >>>>>>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
> >>>>>>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
> >>>>>>> device as a parameter.
> >>>>>>> VF mediate driver should return success or failure depending on it
> >>>>>>> supports the pdev or not.
> >>>>>>> E.g. VF mediate driver would compare its supported VF devfn with the
> >>>>>>> devfn of the passed-in pdev.
> >>>>>>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
> >>>>>>> stop querying other mediate ops and bind the opening device with this
> >>>>>>> mediate ops using the returned mediate handle.
> >>>>>>>
> >>>>>>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
> >>>>>>> VF will be intercepted into VF mediate driver as
> >>>>>>> vfio_pci_mediate_ops->get_region_info(),
> >>>>>>> vfio_pci_mediate_ops->rw,
> >>>>>>> vfio_pci_mediate_ops->mmap, and get customized.
> >>>>>>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
> >>>>>>> further return 'pt' to indicate whether vfio-pci should further
> >>>>>>> passthrough data to hw.
> >>>>>>>
> >>>>>>> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
> >>>>>>> with a mediate handle as parameter.
> >>>>>>>
> >>>>>>> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
> >>>>>>> mediate driver be able to differentiate two opening VFs of the same device
> >>>>>>> id and vendor id.
> >>>>>>>
> >>>>>>> When VF mediate driver exits, it unregisters its mediate ops from
> >>>>>>> vfio-pci.
> >>>>>>>
> >>>>>>>
> >>>>>>> In this patchset, we enable vfio-pci to provide 3 things:
> >>>>>>> (1) calling mediate ops to allow vendor driver customizing default
> >>>>>>> region info/rw/mmap of a region.
> >>>>>>> (2) provide a migration region to support migration
> >>>>>> What's the benefit of introducing a region? It looks to me we don't expect
> >>>>>> the region to be accessed directly from guest. Could we simply extend device
> >>>>>> fd ioctl for doing such things?
> >>>>>>
> >>>>> You may take a look on mdev live migration discussions in
> >>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
> >>>>>
> >>>>> or previous discussion at
> >>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
> >>>>> which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/
> >>>>>
> >>>>> generaly speaking, qemu part of live migration is consistent for
> >>>>> vfio-pci + mediate ops way or mdev way.
> >>>> So in mdev, do you still have a mediate driver? Or you expect the parent
> >>>> to implement the region?
> >>>>
> >>> No, currently it's only for vfio-pci.
> >> And specific to PCI.
> >>
> >>> mdev parent driver is free to customize its regions and hence does not
> >>> requires this mediate ops hooks.
> >>>
> >>>>> The region is only a channel for
> >>>>> QEMU and kernel to communicate information without introducing IOCTLs.
> >>>> Well, at least you introduce new type of region in uapi. So this does
> >>>> not answer why region is better than ioctl. If the region will only be
> >>>> used by qemu, using ioctl is much more easier and straightforward.
> >>>>
> >>> It's not introduced by me :)
> >>> mdev live migration is actually using this way, I'm just keeping
> >>> compatible to the uapi.
> >>
> >> I meant e.g VFIO_REGION_TYPE_MIGRATION.
> >>
> > here's the history of vfio live migration:
> > https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg05564.html
> > https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html
> > https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
> >
> > If you have any concern of this region way, feel free to comment to the
> > latest v9 patchset:
> > https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
> >
> > The patchset here will always keep compatible to there.
> 
> 
> Sure.
> 
> 
> >>>   From my own perspective, my answer is that a region is more flexible
> >>> compared to ioctl. vendor driver can freely define the size,
> >>>
> >> Probably not since it's an ABI I think.
> >>
> > that's why I need to define VFIO_REGION_TYPE_MIGRATION here in this
> > patchset, as it's not upstreamed yet.
> > maybe I should make it into a prerequisite patch, indicating it is not
> > introduced by this patchset
> 
> 
> Yes.
> 
> 
> >
> >>>    mmap cap of
> >>> its data subregion.
> >>>
> >> It doesn't help much unless it can be mapped into guest (which I don't
> >> think it was the case here).
> >>
> > it's access by host qemu, the same as how linux app access an mmaped
> > memory. the mmap here is to reduce memory copy from kernel to user.
> > No need to get mapped into guest.
> 
> 
> But copy_to_user() is not a bad choice. If I read the code correctly 
> only the dirty bitmap was mmaped. This means you probably need to deal 
> with dcache carefully on some archs. [1]
> 
> Note KVM doesn't use shared dirty bitmap, it uses copy_to_user().
> 
> [1] https://lkml.org/lkml/2019/4/9/5
>
on those platforms, mmap can be safely disabled by vendor driver at will.
Also, when mmap is disabled, copy_to_user() is also used in region way.
Any way, please raise you concern in kirti's thread for this common part.

> 
> >
> >>>    Also, there're already too many ioctls in vfio.
> >> Probably not :) We had a brunch of  subsystems that have much more
> >> ioctls than VFIO. (e.g DRM)
> >>
> >>>>>>> (3) provide a dynamic trap bar info region to allow vendor driver
> >>>>>>> control trap/untrap of device pci bars
> >>>>>>>
> >>>>>>> This vfio-pci + mediate ops way differs from mdev way in that
> >>>>>>> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
> >>>>>>> specific mdev parent driver is bound to VF directly.
> >>>>>>> (2) vfio-pci + mediate ops way does not create mdev devices and VF
> >>>>>>> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
> >>>>>>>
> >>>>>>> The reason why we don't choose the way of writing mdev parent driver is
> >>>>>>> that
> >>>>>>> (1) VFs are almost all the time directly passthroughed. Directly binding
> >>>>>>> to vfio-pci can make most of the code shared/reused.
> >>>>>> Can we split out the common parts from vfio-pci?
> >>>>>>
> >>>>> That's very attractive. but one cannot implement a vfio-pci except
> >>>>> export everything in it as common part :)
> >>>> Well, I think there should be not hard to do that. E..g you can route it
> >>>> back to like:
> >>>>
> >>>> vfio -> vfio_mdev -> parent -> vfio_pci
> >>>>
> >>> it's desired for us to have mediate driver binding to PF device.
> >>> so once a VF device is created, only PF driver and vfio-pci are
> >>> required. Just the same as what needs to be done for a normal VF passthrough.
> >>> otherwise, a separate parent driver binding to VF is required.
> >>> Also, this parent driver has many drawbacks as I mentions in this
> >>> cover-letter.
> >> Well, as discussed, no need to duplicate the code, bar trick should
> >> still work. The main issues I saw with this proposal is:
> >>
> >> 1) PCI specific, other bus may need something similar
> > vfio-pci is only for PCI of course.
> 
> 
> I meant if what propose here makes sense, other bus driver like 
> vfio-platform may want something similar.
>
sure they can follow.
> 
> >
> >> 2) Function duplicated with mdev and mdev can do even more
> >>
> > could you elaborate how mdev can do solve the above saying problem ?
> 
> 
> Well, I think both of us agree the mdev can do what mediate ops did, 
> mdev device implementation just need to add the direct PCI access part.
>
> 
> >>>>>>>      If we write a
> >>>>>>> vendor specific mdev parent driver, most of the code (like passthrough
> >>>>>>> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
> >>>>>>> actually a duplicated and tedious work.
> >>>>>> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
> >>>>>> me we need to consider live migration for mdev as well. In that case, do we
> >>>>>> still expect mediate ops through VFIO directly?
> >>>>>>
> >>>>>>
> >>>>>>> (2) For features like dynamically trap/untrap pci bars, if they are in
> >>>>>>> vfio-pci, they can be available to most people without repeated code
> >>>>>>> copying and re-testing.
> >>>>>>> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
> >>>>>>> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
> >>>>>>> it runs into a real migration need. However, if vfio-pci is bound
> >>>>>>> initially, they have no chance to do live migration when there's a need
> >>>>>>> later.
> >>>>>> We can teach management layer to do this.
> >>>>>>
> >>>>> No. not possible as vfio-pci by default has no migration region and
> >>>>> dirty page tracking needs vendor's mediation at least for most
> >>>>> passthrough devices now.
> >>>> I'm not quite sure I get here but in this case, just tech them to use
> >>>> the driver that has migration support?
> >>>>
> >>> That's a way, but as more and more passthrough devices have demands and
> >>> caps to do migration, will vfio-pci be used in future any more ?
> >>
> >> This should not be a problem:
> >> - If we introduce a common mdev for vfio-pci, we can just bind that
> >> driver always
> > what is common mdev for vfio-pci? a common mdev parent driver that have
> > the same implementation as vfio-pci?
> 
> 
> The common part is not PCI of course. The common part is the both mdev 
> and mediate ops want to do some kind of mediation. Mdev is bus agnostic, 
> but what you propose here is PCI specific but should be bus agnostic as 
> well. Assume we implement a bug agnostic mediate ops, mdev could be even 
> built on top.
>
I believe Alex has already replied the above better than me.
> 
> >
> > There's actually already a solution of creating only one mdev on top
> > of each passthrough device, and make mdev share the same iommu group
> > with it. We've also made an implementation on it already. here's a
> > sample one made by Yi at https://patchwork.kernel.org/cover/11134695/.
> >
> > But, as I said, it's desired to re-use vfio-pci directly for SRIOV,
> > which is straghtforward :)
> 
> 
> Can we have a device that is capable of both SRIOV and function slicing? 
> If yes, does it mean you need to provides two drivers? One for mdev, 
> another for mediate ops?
> 
what do you mean by "function slicing"? SIOV?
For vendor driver, in SRIOV
- with mdev approach, two drivers required: one for mdev parent driver on
VF, one for PF driver.
- with mediate ops + vfio-pci: one driver on PF.

in SIOV, only one driver on PF in both case.


> >
> >> - The most straightforward way to support dirty page tracking is done by
> >> IOMMU instead of device specific operations.
> >>
> > No such IOMMU yet. And all kinds of platforms should be cared, right?
> 
> 
> Or the device can track dirty pages by itself, otherwise it would be 
> very hard to implement dirty page tracking correctly without the help of 
> switching to software datapath (or maybe you can post the part of BAR0 

I think you mixed "correct" and "accurate".
DMA pre-inspection is a long existing term and we have implemented and
verified it in NIC for both precopy and postcopy case. Though I can't promise
there's 100% no bug, the method is right.

Also, whether to trap BARs for dirty page is vendor specific and is not
what should be cared about from this interface part.

> mediation and dirty page tracking which is missed in this series?)
> 

Currently, that part of code is owned by shaopeng's team. The code I
posted is only for demonstrating how to use the interface. Shaopeng's
team is responsible for upsteam of their part at their timing.

Thanks
Yan

> >>>>>>
> >>>>>>> In this patchset,
> >>>>>>> - patches 1-4 enable vfio-pci to call mediate ops registered by vendor
> >>>>>>>       driver to mediate/customize region info/rw/mmap.
> >>>>>>>
> >>>>>>> - patches 5-6 provide a standalone sample driver to register a mediate ops
> >>>>>>>       for Intel Graphics Devices. It does not bind to IGDs directly but decides
> >>>>>>>       what devices it supports via its pciidlist. It also demonstrates how to
> >>>>>>>       dynamic trap a device's PCI bars. (by adding more pciids in its
> >>>>>>>       pciidlist, this sample driver actually is not necessarily limited to
> >>>>>>>       support IGDs)
> >>>>>>>
> >>>>>>> - patch 7-9 provide a sample on i40e driver that supports Intel(R)
> >>>>>>>       Ethernet Controller XL710 Family of devices. It supports VF precopy live
> >>>>>>>       migration on Intel's 710 SRIOV. (but we commented out the real
> >>>>>>>       implementation of dirty page tracking and device state retrieving part
> >>>>>>>       to focus on demonstrating framework part. Will send out them in future
> >>>>>>>       versions)
> >>>>>>>       patch 7 registers/unregisters VF mediate ops when PF driver
> >>>>>>>       probes/removes. It specifies its supporting VFs via
> >>>>>>>       vfio_pci_mediate_ops->open(pdev)
> >>>>>>>
> >>>>>>>       patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
> >>>>>>>       provides a sample implementation of migration region.
> >>>>>>>       The QEMU part of vfio migration is based on v8
> >>>>>>>       https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
> >>>>>>>       We do not based on recent v9 because we think there are still opens in
> >>>>>>>       dirty page track part in that series.
> >>>>>>>
> >>>>>>>       patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
> >>>>>>>       provides an example on how to trap part of bar0 when migration starts
> >>>>>>>       and passthrough this part of bar0 again when migration fails.
> >>>>>>>
> >>>>>>> Yan Zhao (9):
> >>>>>>>       vfio/pci: introduce mediate ops to intercept vfio-pci ops
> >>>>>>>       vfio/pci: test existence before calling region->ops
> >>>>>>>       vfio/pci: register a default migration region
> >>>>>>>       vfio-pci: register default dynamic-trap-bar-info region
> >>>>>>>       samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
> >>>>>>>       sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
> >>>>>>>       i40e/vf_migration: register mediate_ops to vfio-pci
> >>>>>>>       i40e/vf_migration: mediate migration region
> >>>>>>>       i40e/vf_migration: support dynamic trap of bar0
> >>>>>>>
> >>>>>>>      drivers/net/ethernet/intel/Kconfig            |   2 +-
> >>>>>>>      drivers/net/ethernet/intel/i40e/Makefile      |   3 +-
> >>>>>>>      drivers/net/ethernet/intel/i40e/i40e.h        |   2 +
> >>>>>>>      drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
> >>>>>>>      .../ethernet/intel/i40e/i40e_vf_migration.c   | 626 ++++++++++++++++++
> >>>>>>>      .../ethernet/intel/i40e/i40e_vf_migration.h   |  78 +++
> >>>>>>>      drivers/vfio/pci/vfio_pci.c                   | 189 +++++-
> >>>>>>>      drivers/vfio/pci/vfio_pci_private.h           |   2 +
> >>>>>>>      include/linux/vfio.h                          |  18 +
> >>>>>>>      include/uapi/linux/vfio.h                     | 160 +++++
> >>>>>>>      samples/Kconfig                               |   6 +
> >>>>>>>      samples/Makefile                              |   1 +
> >>>>>>>      samples/vfio-pci/Makefile                     |   2 +
> >>>>>>>      samples/vfio-pci/igd_dt.c                     | 367 ++++++++++
> >>>>>>>      14 files changed, 1455 insertions(+), 4 deletions(-)
> >>>>>>>      create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
> >>>>>>>      create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
> >>>>>>>      create mode 100644 samples/vfio-pci/Makefile
> >>>>>>>      create mode 100644 samples/vfio-pci/igd_dt.c
> >>>>>>>
>
Alex Williamson Dec. 12, 2019, 6:39 p.m. UTC | #11
On Thu, 12 Dec 2019 12:09:48 +0800
Jason Wang <jasowang@redhat.com> wrote:

> On 2019/12/7 上午1:42, Alex Williamson wrote:
> > On Fri, 6 Dec 2019 17:40:02 +0800
> > Jason Wang <jasowang@redhat.com> wrote:
> >  
> >> On 2019/12/6 下午4:22, Yan Zhao wrote:  
> >>> On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:  
> >>>> On 2019/12/5 下午4:51, Yan Zhao wrote:  
> >>>>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:  
> >>>>>> Hi:
> >>>>>>
> >>>>>> On 2019/12/5 上午11:24, Yan Zhao wrote:  
> >>>>>>> For SRIOV devices, VFs are passthroughed into guest directly without host
> >>>>>>> driver mediation. However, when VMs migrating with passthroughed VFs,
> >>>>>>> dynamic host mediation is required to  (1) get device states, (2) get
> >>>>>>> dirty pages. Since device states as well as other critical information
> >>>>>>> required for dirty page tracking for VFs are usually retrieved from PFs,
> >>>>>>> it is handy to provide an extension in PF driver to centralizingly control
> >>>>>>> VFs' migration.
> >>>>>>>
> >>>>>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
> >>>>>>> dynamically trap VFs' bars for dirty page tracking and  
> >>>>>> A silly question, what's the reason for doing this, is this a must for dirty
> >>>>>> page tracking?
> >>>>>>     
> >>>>> For performance consideration. VFs' bars should be passthoughed at
> >>>>> normal time and only enter into trap state on need.  
> >>>> Right, but how does this matter for the case of dirty page tracking?
> >>>>     
> >>> Take NIC as an example, to trap its VF dirty pages, software way is
> >>> required to trap every write of ring tail that resides in BAR0.  
> >>
> >> Interesting, but it looks like we need:
> >> - decode the instruction
> >> - mediate all access to BAR0
> >> All of which seems a great burden for the VF driver. I wonder whether or
> >> not doing interrupt relay and tracking head is better in this case.  
> > This sounds like a NIC specific solution, I believe the goal here is to
> > allow any device type to implement a partial mediation solution, in
> > this case to sufficiently track the device while in the migration
> > saving state.  
> 
> 
> I suspect there's a solution that can work for any device type. E.g for 
> virtio, avail index (head) doesn't belongs to any BAR and device may 
> decide to disable doorbell from guest. So did interrupt relay since 
> driver may choose to disable interrupt from device. In this case, the 
> only way to track dirty pages correctly is to switch to software datapath.
> 
> 
> >  
> >>>    There's
> >>> still no IOMMU Dirty bit available.  
> >>>>>>>      (3) centralizing
> >>>>>>> VF critical states retrieving and VF controls into one driver, we propose
> >>>>>>> to introduce mediate ops on top of current vfio-pci device driver.
> >>>>>>>
> >>>>>>>
> >>>>>>>                                        _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >>>>>>>      __________   register mediate ops|  ___________     ___________    |
> >>>>>>> |          |<-----------------------|     VF    |   |           |
> >>>>>>> | vfio-pci |                      | |  mediate  |   | PF driver |   |
> >>>>>>> |__________|----------------------->|   driver  |   |___________|
> >>>>>>>          |            open(pdev)      |  -----------          |         |
> >>>>>>>          |                                                    |
> >>>>>>>          |                            |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
> >>>>>>>         \|/                                                  \|/
> >>>>>>> -----------                                         ------------
> >>>>>>> |    VF   |                                         |    PF    |
> >>>>>>> -----------                                         ------------
> >>>>>>>
> >>>>>>>
> >>>>>>> VF mediate driver could be a standalone driver that does not bind to
> >>>>>>> any devices (as in demo code in patches 5-6) or it could be a built-in
> >>>>>>> extension of PF driver (as in patches 7-9) .
> >>>>>>>
> >>>>>>> Rather than directly bind to VF, VF mediate driver register a mediate
> >>>>>>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
> >>>>>>> mediate ops.
> >>>>>>> (Note that: VF mediate driver can register mediate ops into vfio-pci
> >>>>>>> before vfio-pci binding to any devices. And VF mediate driver can
> >>>>>>> support mediating multiple devices.)
> >>>>>>>
> >>>>>>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
> >>>>>>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
> >>>>>>> device as a parameter.
> >>>>>>> VF mediate driver should return success or failure depending on it
> >>>>>>> supports the pdev or not.
> >>>>>>> E.g. VF mediate driver would compare its supported VF devfn with the
> >>>>>>> devfn of the passed-in pdev.
> >>>>>>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
> >>>>>>> stop querying other mediate ops and bind the opening device with this
> >>>>>>> mediate ops using the returned mediate handle.
> >>>>>>>
> >>>>>>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
> >>>>>>> VF will be intercepted into VF mediate driver as
> >>>>>>> vfio_pci_mediate_ops->get_region_info(),
> >>>>>>> vfio_pci_mediate_ops->rw,
> >>>>>>> vfio_pci_mediate_ops->mmap, and get customized.
> >>>>>>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
> >>>>>>> further return 'pt' to indicate whether vfio-pci should further
> >>>>>>> passthrough data to hw.
> >>>>>>>
> >>>>>>> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
> >>>>>>> with a mediate handle as parameter.
> >>>>>>>
> >>>>>>> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
> >>>>>>> mediate driver be able to differentiate two opening VFs of the same device
> >>>>>>> id and vendor id.
> >>>>>>>
> >>>>>>> When VF mediate driver exits, it unregisters its mediate ops from
> >>>>>>> vfio-pci.
> >>>>>>>
> >>>>>>>
> >>>>>>> In this patchset, we enable vfio-pci to provide 3 things:
> >>>>>>> (1) calling mediate ops to allow vendor driver customizing default
> >>>>>>> region info/rw/mmap of a region.
> >>>>>>> (2) provide a migration region to support migration  
> >>>>>> What's the benefit of introducing a region? It looks to me we don't expect
> >>>>>> the region to be accessed directly from guest. Could we simply extend device
> >>>>>> fd ioctl for doing such things?
> >>>>>>     
> >>>>> You may take a look on mdev live migration discussions in
> >>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
> >>>>>
> >>>>> or previous discussion at
> >>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
> >>>>> which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/
> >>>>>
> >>>>> generaly speaking, qemu part of live migration is consistent for
> >>>>> vfio-pci + mediate ops way or mdev way.  
> >>>> So in mdev, do you still have a mediate driver? Or you expect the parent
> >>>> to implement the region?
> >>>>     
> >>> No, currently it's only for vfio-pci.  
> >> And specific to PCI.  
> > What's PCI specific?  The implementation, yes, it's done in the bus
> > vfio bus driver here but all device access is performed by the bus
> > driver.  I'm not sure how we could introduce the intercept at the
> > vfio-core level, but I'm open to suggestions.  
> 
> 
> I haven't thought this too much, but if we can intercept at core level, 
> it basically can do what mdev can do right now.

An intercept at the core level is essentially a new vfio bus driver.

> >>> mdev parent driver is free to customize its regions and hence does not
> >>> requires this mediate ops hooks.
> >>>     
> >>>>> The region is only a channel for
> >>>>> QEMU and kernel to communicate information without introducing IOCTLs.  
> >>>> Well, at least you introduce new type of region in uapi. So this does
> >>>> not answer why region is better than ioctl. If the region will only be
> >>>> used by qemu, using ioctl is much more easier and straightforward.
> >>>>     
> >>> It's not introduced by me :)
> >>> mdev live migration is actually using this way, I'm just keeping
> >>> compatible to the uapi.  
> >>
> >> I meant e.g VFIO_REGION_TYPE_MIGRATION.
> >>
> >>  
> >>>   From my own perspective, my answer is that a region is more flexible
> >>> compared to ioctl. vendor driver can freely define the size,
> >>>     
> >> Probably not since it's an ABI I think.  
> > I think Kirti's thread proposing the migration interface is a better
> > place for this discussion, I believe Yan has already linked to it.  In
> > general we prefer to be frugal in our introduction of new ioctls,
> > especially when we have existing mechanisms via regions to support the
> > interactions.  The interface is designed to be flexible to the vendor
> > driver needs, partially thanks to it being a region.
> >  
> >>>    mmap cap of
> >>> its data subregion.
> >>>     
> >> It doesn't help much unless it can be mapped into guest (which I don't
> >> think it was the case here).
> >> /  
> >>>    Also, there're already too many ioctls in vfio.  
> >> Probably not :) We had a brunch of  subsystems that have much more
> >> ioctls than VFIO. (e.g DRM)  
> > And this is a good thing?  
> 
> 
> Well, I just meant that "having too much ioctls already" is not a good 
> reason for not introducing new ones.

Avoiding ioctl proliferation is a reason to require a high bar for any
new ioctl though.  Push back on every ioctl and maybe we won't get to
that point.

> > We can more easily deprecate and revise
> > region support than we can take back ioctls that have been previously
> > used.  
> 
> 
> It belongs to uapi, how easily can we deprecate that?

I'm not saying there shouldn't be a deprecation process, but the core
uapi for vfio remains (relatively) unchanged.  The user has a protocol
for discovering the features of a device and if we decide we've screwed
up the implementation of the migration_region-v1 we can simply add a
migration_region-v2 and both can be exposed via the same core ioctls
until we decide to no longer expose v1.  Our address space of region
types is defined within vfio, not shared with every driver in the
kernel.  The "add an ioctl for that" approach is not the model I
advocate.

> > I generally don't like the "let's create a new ioctl for that"
> > approach versus trying to fit something within the existing
> > architecture and convention.
> >  
> >>>>>>> (3) provide a dynamic trap bar info region to allow vendor driver
> >>>>>>> control trap/untrap of device pci bars
> >>>>>>>
> >>>>>>> This vfio-pci + mediate ops way differs from mdev way in that
> >>>>>>> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
> >>>>>>> specific mdev parent driver is bound to VF directly.
> >>>>>>> (2) vfio-pci + mediate ops way does not create mdev devices and VF
> >>>>>>> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
> >>>>>>>
> >>>>>>> The reason why we don't choose the way of writing mdev parent driver is
> >>>>>>> that
> >>>>>>> (1) VFs are almost all the time directly passthroughed. Directly binding
> >>>>>>> to vfio-pci can make most of the code shared/reused.  
> >>>>>> Can we split out the common parts from vfio-pci?
> >>>>>>     
> >>>>> That's very attractive. but one cannot implement a vfio-pci except
> >>>>> export everything in it as common part :)  
> >>>> Well, I think there should be not hard to do that. E..g you can route it
> >>>> back to like:
> >>>>
> >>>> vfio -> vfio_mdev -> parent -> vfio_pci
> >>>>     
> >>> it's desired for us to have mediate driver binding to PF device.
> >>> so once a VF device is created, only PF driver and vfio-pci are
> >>> required. Just the same as what needs to be done for a normal VF passthrough.
> >>> otherwise, a separate parent driver binding to VF is required.
> >>> Also, this parent driver has many drawbacks as I mentions in this
> >>> cover-letter.  
> >> Well, as discussed, no need to duplicate the code, bar trick should
> >> still work. The main issues I saw with this proposal is:
> >>
> >> 1) PCI specific, other bus may need something similar  
> > Propose how it could be implemented higher in the vfio stack to make it
> > device agnostic.  
> 
> 
> E.g doing it in vfio_device_fops instead of vfio_pci_ops?

Which is essentially a new vfio bus driver.  This is something vfio has
supported since day one.  Issues with doing that here are that it puts
the burden on the mediation/vendor driver to re-implement or re-use a
lot of existing code in vfio-pci, and I think it creates user confusion
around which driver to use for what feature set when using a device
through vfio.  You're complaining this series is PCI specific, when
re-using the vfio-pci code is exactly what we're trying to achieve.
Other bus types can do something similar and injecting vendor
specific drivers a layer above the bus driver is already a fundamental
part of the infrastructure.

> >> 2) Function duplicated with mdev and mdev can do even more  
> > mdev also comes with a device lifecycle interface that doesn't really
> > make sense when a driver is only trying to partially mediate a single
> > physical device rather than multiplex a physical device into virtual
> > devices.  
> 
> 
> Yes, but that part could be decoupled out of mdev.

There would be nothing left.  vfio-mdev is essentially nothing more than
a vfio bus driver that forwards through to mdev to provide that
lifecycle interface to the vendor driver.  Without that, it's just
another vfio bus driver.

> >   mdev would also require vendor drivers to re-implement
> > much of vfio-pci for the direct access mechanisms.  Also, do we really
> > want users or management tools to decide between binding a device to
> > vfio-pci or a separate mdev driver to get this functionality.  We've
> > already been burnt trying to use mdev beyond its scope.  
> 
> 
> The problem is, if we had a device that support both SRIOV and mdev. 
> Does this mean we need prepare two set of drivers?

We have this situation today, modulo SR-IOV, but that's a red herring
anyway, VF vs PF is irrelevant.  For example we can either directly
assign IGD graphics to a VM with vfio-pci or we can enable GVT-g
support in the i915 driver, which registers vGPU support via mdev.
These are different use cases, expose different features, and have
different support models.  NVIDIA is the same way, assigning a GPU via
vfio-pci or a vGPU via vfio-mdev are entirely separate usage models.
Once we use mdev, it's at the vendor driver's discretion how the device
resources are backed, they might make use of the resource isolation of
SR-IOV or they might divide a single function.

If your question is whether there's a concern around proliferation of
vfio bus drivers and user confusion over which to use for what
features, yes, absolutely.  I think this is why we're starting with
seeing what it looks like to add mediation to vfio-pci rather than
modularize vfio-pci and ask Intel to develop a new vfio-pci-intel-dsa
driver.  I'm not yet convinced we won't eventually come back to that
latter approach though if this initial draft is what we can expect of a
mediated vfio-pci.

> >>>>>>>      If we write a
> >>>>>>> vendor specific mdev parent driver, most of the code (like passthrough
> >>>>>>> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
> >>>>>>> actually a duplicated and tedious work.  
> >>>>>> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
> >>>>>> me we need to consider live migration for mdev as well. In that case, do we
> >>>>>> still expect mediate ops through VFIO directly?
> >>>>>>
> >>>>>>     
> >>>>>>> (2) For features like dynamically trap/untrap pci bars, if they are in
> >>>>>>> vfio-pci, they can be available to most people without repeated code
> >>>>>>> copying and re-testing.
> >>>>>>> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
> >>>>>>> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
> >>>>>>> it runs into a real migration need. However, if vfio-pci is bound
> >>>>>>> initially, they have no chance to do live migration when there's a need
> >>>>>>> later.  
> >>>>>> We can teach management layer to do this.
> >>>>>>     
> >>>>> No. not possible as vfio-pci by default has no migration region and
> >>>>> dirty page tracking needs vendor's mediation at least for most
> >>>>> passthrough devices now.  
> >>>> I'm not quite sure I get here but in this case, just tech them to use
> >>>> the driver that has migration support?
> >>>>     
> >>> That's a way, but as more and more passthrough devices have demands and
> >>> caps to do migration, will vfio-pci be used in future any more ?  
> >>
> >> This should not be a problem:
> >> - If we introduce a common mdev for vfio-pci, we can just bind that
> >> driver always  
> > There's too much of mdev that doesn't make sense for this usage model,
> > this is why Yi's proposed generic mdev PCI wrapper is only a sample
> > driver.  I think we do not want to introduce user confusion regarding
> > which driver to use and there are outstanding non-singleton group
> > issues with mdev that don't seem worthwhile to resolve.  
> 
> 
> I agree, but I think what user want is a unified driver that works for 
> both SRIOV and mdev. That's why trying to have a common way for doing 
> mediation may make sense.

I don't think we can get to one driver, nor is it clear to me that we
should.  Direct assignment and mdev currently provide different usage
models.  Both are valid, both are useful.  That said, I don't
necessarily want a user to need to choose whether to bind a device to
vfio-pci for base functionality or vfio-pci-vendor-foo for extended
functionality either.  I think that's why we're exploring this
mediation approach and there's already precedent in vfio-pci for some
extent of device specific support.  It's still a question though
whether it can be done clean enough to make it worthwhile.  Thanks,

Alex
Jason Wang Dec. 18, 2019, 2:36 a.m. UTC | #12
On 2019/12/12 下午1:47, Yan Zhao wrote:
> On Thu, Dec 12, 2019 at 11:48:25AM +0800, Jason Wang wrote:
>> On 2019/12/6 下午8:49, Yan Zhao wrote:
>>> On Fri, Dec 06, 2019 at 05:40:02PM +0800, Jason Wang wrote:
>>>> On 2019/12/6 下午4:22, Yan Zhao wrote:
>>>>> On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:
>>>>>> On 2019/12/5 下午4:51, Yan Zhao wrote:
>>>>>>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
>>>>>>>> Hi:
>>>>>>>>
>>>>>>>> On 2019/12/5 上午11:24, Yan Zhao wrote:
>>>>>>>>> For SRIOV devices, VFs are passthroughed into guest directly without host
>>>>>>>>> driver mediation. However, when VMs migrating with passthroughed VFs,
>>>>>>>>> dynamic host mediation is required to  (1) get device states, (2) get
>>>>>>>>> dirty pages. Since device states as well as other critical information
>>>>>>>>> required for dirty page tracking for VFs are usually retrieved from PFs,
>>>>>>>>> it is handy to provide an extension in PF driver to centralizingly control
>>>>>>>>> VFs' migration.
>>>>>>>>>
>>>>>>>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
>>>>>>>>> dynamically trap VFs' bars for dirty page tracking and
>>>>>>>> A silly question, what's the reason for doing this, is this a must for dirty
>>>>>>>> page tracking?
>>>>>>>>
>>>>>>> For performance consideration. VFs' bars should be passthoughed at
>>>>>>> normal time and only enter into trap state on need.
>>>>>> Right, but how does this matter for the case of dirty page tracking?
>>>>>>
>>>>> Take NIC as an example, to trap its VF dirty pages, software way is
>>>>> required to trap every write of ring tail that resides in BAR0.
>>>> Interesting, but it looks like we need:
>>>> - decode the instruction
>>>> - mediate all access to BAR0
>>>> All of which seems a great burden for the VF driver. I wonder whether or
>>>> not doing interrupt relay and tracking head is better in this case.
>>>>
>>> hi Jason
>>>
>>> not familiar with the way you mentioned. could you elaborate more?
>>
>> It looks to me that you want to intercept the bar that contains the
>> head. Then you can figure out the buffers submitted from driver and you
>> still need to decide a proper time to mark them as dirty.
>>
> Not need to be accurate, right? just a superset of real dirty bitmap is
> enough.


If the superset is too large compared with the dirty pages, it will lead 
a lot of side effects.


>
>> What I meant is, intercept the interrupt, then you can figure still
>> figure out the buffers which has been modified by the device and make
>> them as dirty.
>>
>> Then there's no need to trap BAR and do decoding/emulation etc.
>>
>> But it will still be tricky to be correct...
>>
> intercept the interrupt is a little hard if post interrupt is enabled..


We don't care about the performance too much in this case. Can we simply 
disable it?


> I think what you worried about here is the timing to mark dirty pages,
> right? upon interrupt receiving, you regard DMAs are finished and safe
> to make them dirty.
> But with BAR trap way, we at least can keep those dirtied pages as dirty
> until device stop. Of course we have other methods to optimize it.


I'm not sure this will not confuse the migration converge time estimation.


>
>>>>>     There's
>>>>> still no IOMMU Dirty bit available.
>>>>>>>>>       (3) centralizing
>>>>>>>>> VF critical states retrieving and VF controls into one driver, we propose
>>>>>>>>> to introduce mediate ops on top of current vfio-pci device driver.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                                         _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>>>>>>>>       __________   register mediate ops|  ___________     ___________    |
>>>>>>>>> |          |<-----------------------|     VF    |   |           |
>>>>>>>>> | vfio-pci |                      | |  mediate  |   | PF driver |   |
>>>>>>>>> |__________|----------------------->|   driver  |   |___________|
>>>>>>>>>           |            open(pdev)      |  -----------          |         |
>>>>>>>>>           |                                                    |
>>>>>>>>>           |                            |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
>>>>>>>>>          \|/                                                  \|/
>>>>>>>>> -----------                                         ------------
>>>>>>>>> |    VF   |                                         |    PF    |
>>>>>>>>> -----------                                         ------------
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> VF mediate driver could be a standalone driver that does not bind to
>>>>>>>>> any devices (as in demo code in patches 5-6) or it could be a built-in
>>>>>>>>> extension of PF driver (as in patches 7-9) .
>>>>>>>>>
>>>>>>>>> Rather than directly bind to VF, VF mediate driver register a mediate
>>>>>>>>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
>>>>>>>>> mediate ops.
>>>>>>>>> (Note that: VF mediate driver can register mediate ops into vfio-pci
>>>>>>>>> before vfio-pci binding to any devices. And VF mediate driver can
>>>>>>>>> support mediating multiple devices.)
>>>>>>>>>
>>>>>>>>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
>>>>>>>>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
>>>>>>>>> device as a parameter.
>>>>>>>>> VF mediate driver should return success or failure depending on it
>>>>>>>>> supports the pdev or not.
>>>>>>>>> E.g. VF mediate driver would compare its supported VF devfn with the
>>>>>>>>> devfn of the passed-in pdev.
>>>>>>>>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
>>>>>>>>> stop querying other mediate ops and bind the opening device with this
>>>>>>>>> mediate ops using the returned mediate handle.
>>>>>>>>>
>>>>>>>>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
>>>>>>>>> VF will be intercepted into VF mediate driver as
>>>>>>>>> vfio_pci_mediate_ops->get_region_info(),
>>>>>>>>> vfio_pci_mediate_ops->rw,
>>>>>>>>> vfio_pci_mediate_ops->mmap, and get customized.
>>>>>>>>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
>>>>>>>>> further return 'pt' to indicate whether vfio-pci should further
>>>>>>>>> passthrough data to hw.
>>>>>>>>>
>>>>>>>>> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
>>>>>>>>> with a mediate handle as parameter.
>>>>>>>>>
>>>>>>>>> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
>>>>>>>>> mediate driver be able to differentiate two opening VFs of the same device
>>>>>>>>> id and vendor id.
>>>>>>>>>
>>>>>>>>> When VF mediate driver exits, it unregisters its mediate ops from
>>>>>>>>> vfio-pci.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In this patchset, we enable vfio-pci to provide 3 things:
>>>>>>>>> (1) calling mediate ops to allow vendor driver customizing default
>>>>>>>>> region info/rw/mmap of a region.
>>>>>>>>> (2) provide a migration region to support migration
>>>>>>>> What's the benefit of introducing a region? It looks to me we don't expect
>>>>>>>> the region to be accessed directly from guest. Could we simply extend device
>>>>>>>> fd ioctl for doing such things?
>>>>>>>>
>>>>>>> You may take a look on mdev live migration discussions in
>>>>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
>>>>>>>
>>>>>>> or previous discussion at
>>>>>>> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
>>>>>>> which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/
>>>>>>>
>>>>>>> generaly speaking, qemu part of live migration is consistent for
>>>>>>> vfio-pci + mediate ops way or mdev way.
>>>>>> So in mdev, do you still have a mediate driver? Or you expect the parent
>>>>>> to implement the region?
>>>>>>
>>>>> No, currently it's only for vfio-pci.
>>>> And specific to PCI.
>>>>
>>>>> mdev parent driver is free to customize its regions and hence does not
>>>>> requires this mediate ops hooks.
>>>>>
>>>>>>> The region is only a channel for
>>>>>>> QEMU and kernel to communicate information without introducing IOCTLs.
>>>>>> Well, at least you introduce new type of region in uapi. So this does
>>>>>> not answer why region is better than ioctl. If the region will only be
>>>>>> used by qemu, using ioctl is much more easier and straightforward.
>>>>>>
>>>>> It's not introduced by me :)
>>>>> mdev live migration is actually using this way, I'm just keeping
>>>>> compatible to the uapi.
>>>> I meant e.g VFIO_REGION_TYPE_MIGRATION.
>>>>
>>> here's the history of vfio live migration:
>>> https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg05564.html
>>> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html
>>> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
>>>
>>> If you have any concern of this region way, feel free to comment to the
>>> latest v9 patchset:
>>> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
>>>
>>> The patchset here will always keep compatible to there.
>>
>> Sure.
>>
>>
>>>>>    From my own perspective, my answer is that a region is more flexible
>>>>> compared to ioctl. vendor driver can freely define the size,
>>>>>
>>>> Probably not since it's an ABI I think.
>>>>
>>> that's why I need to define VFIO_REGION_TYPE_MIGRATION here in this
>>> patchset, as it's not upstreamed yet.
>>> maybe I should make it into a prerequisite patch, indicating it is not
>>> introduced by this patchset
>>
>> Yes.
>>
>>
>>>>>     mmap cap of
>>>>> its data subregion.
>>>>>
>>>> It doesn't help much unless it can be mapped into guest (which I don't
>>>> think it was the case here).
>>>>
>>> it's access by host qemu, the same as how linux app access an mmaped
>>> memory. the mmap here is to reduce memory copy from kernel to user.
>>> No need to get mapped into guest.
>>
>> But copy_to_user() is not a bad choice. If I read the code correctly
>> only the dirty bitmap was mmaped. This means you probably need to deal
>> with dcache carefully on some archs. [1]
>>
>> Note KVM doesn't use shared dirty bitmap, it uses copy_to_user().
>>
>> [1] https://lkml.org/lkml/2019/4/9/5
>>
> on those platforms, mmap can be safely disabled by vendor driver at will.


Then you driver need to detect and behave differently according to the 
arch.


> Also, when mmap is disabled, copy_to_user() is also used in region way.
> Any way, please raise you concern in kirti's thread for this common part.


Well, if I read the code correctly, they are all invented in this 
series. Kirti's thread just a user.


>
>>>>>     Also, there're already too many ioctls in vfio.
>>>> Probably not :) We had a brunch of  subsystems that have much more
>>>> ioctls than VFIO. (e.g DRM)
>>>>
>>>>>>>>> (3) provide a dynamic trap bar info region to allow vendor driver
>>>>>>>>> control trap/untrap of device pci bars
>>>>>>>>>
>>>>>>>>> This vfio-pci + mediate ops way differs from mdev way in that
>>>>>>>>> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
>>>>>>>>> specific mdev parent driver is bound to VF directly.
>>>>>>>>> (2) vfio-pci + mediate ops way does not create mdev devices and VF
>>>>>>>>> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
>>>>>>>>>
>>>>>>>>> The reason why we don't choose the way of writing mdev parent driver is
>>>>>>>>> that
>>>>>>>>> (1) VFs are almost all the time directly passthroughed. Directly binding
>>>>>>>>> to vfio-pci can make most of the code shared/reused.
>>>>>>>> Can we split out the common parts from vfio-pci?
>>>>>>>>
>>>>>>> That's very attractive. but one cannot implement a vfio-pci except
>>>>>>> export everything in it as common part :)
>>>>>> Well, I think there should be not hard to do that. E..g you can route it
>>>>>> back to like:
>>>>>>
>>>>>> vfio -> vfio_mdev -> parent -> vfio_pci
>>>>>>
>>>>> it's desired for us to have mediate driver binding to PF device.
>>>>> so once a VF device is created, only PF driver and vfio-pci are
>>>>> required. Just the same as what needs to be done for a normal VF passthrough.
>>>>> otherwise, a separate parent driver binding to VF is required.
>>>>> Also, this parent driver has many drawbacks as I mentions in this
>>>>> cover-letter.
>>>> Well, as discussed, no need to duplicate the code, bar trick should
>>>> still work. The main issues I saw with this proposal is:
>>>>
>>>> 1) PCI specific, other bus may need something similar
>>> vfio-pci is only for PCI of course.
>>
>> I meant if what propose here makes sense, other bus driver like
>> vfio-platform may want something similar.
>>
> sure they can follow.
>>>> 2) Function duplicated with mdev and mdev can do even more
>>>>
>>> could you elaborate how mdev can do solve the above saying problem ?
>>
>> Well, I think both of us agree the mdev can do what mediate ops did,
>> mdev device implementation just need to add the direct PCI access part.
>>
>>
>>>>>>>>>       If we write a
>>>>>>>>> vendor specific mdev parent driver, most of the code (like passthrough
>>>>>>>>> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
>>>>>>>>> actually a duplicated and tedious work.
>>>>>>>> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
>>>>>>>> me we need to consider live migration for mdev as well. In that case, do we
>>>>>>>> still expect mediate ops through VFIO directly?
>>>>>>>>
>>>>>>>>
>>>>>>>>> (2) For features like dynamically trap/untrap pci bars, if they are in
>>>>>>>>> vfio-pci, they can be available to most people without repeated code
>>>>>>>>> copying and re-testing.
>>>>>>>>> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
>>>>>>>>> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
>>>>>>>>> it runs into a real migration need. However, if vfio-pci is bound
>>>>>>>>> initially, they have no chance to do live migration when there's a need
>>>>>>>>> later.
>>>>>>>> We can teach management layer to do this.
>>>>>>>>
>>>>>>> No. not possible as vfio-pci by default has no migration region and
>>>>>>> dirty page tracking needs vendor's mediation at least for most
>>>>>>> passthrough devices now.
>>>>>> I'm not quite sure I get here but in this case, just tech them to use
>>>>>> the driver that has migration support?
>>>>>>
>>>>> That's a way, but as more and more passthrough devices have demands and
>>>>> caps to do migration, will vfio-pci be used in future any more ?
>>>> This should not be a problem:
>>>> - If we introduce a common mdev for vfio-pci, we can just bind that
>>>> driver always
>>> what is common mdev for vfio-pci? a common mdev parent driver that have
>>> the same implementation as vfio-pci?
>>
>> The common part is not PCI of course. The common part is the both mdev
>> and mediate ops want to do some kind of mediation. Mdev is bus agnostic,
>> but what you propose here is PCI specific but should be bus agnostic as
>> well. Assume we implement a bug agnostic mediate ops, mdev could be even
>> built on top.
>>
> I believe Alex has already replied the above better than me.
>>> There's actually already a solution of creating only one mdev on top
>>> of each passthrough device, and make mdev share the same iommu group
>>> with it. We've also made an implementation on it already. here's a
>>> sample one made by Yi at https://patchwork.kernel.org/cover/11134695/.
>>>
>>> But, as I said, it's desired to re-use vfio-pci directly for SRIOV,
>>> which is straghtforward :)
>>
>> Can we have a device that is capable of both SRIOV and function slicing?
>> If yes, does it mean you need to provides two drivers? One for mdev,
>> another for mediate ops?
>>
> what do you mean by "function slicing"? SIOV?


SIOV could be one of the solution.


> For vendor driver, in SRIOV
> - with mdev approach, two drivers required: one for mdev parent driver on
> VF, one for PF driver.
> - with mediate ops + vfio-pci: one driver on PF.
>
> in SIOV, only one driver on PF in both case.


The point is, e.g if you have a card that support both SRIOV and SIOV. 
(I don't think you will ship a card with SIOV only). Then, when SRIOV is 
used, you need go for mediate ops, when SIOV is used, you need go for 
e.g mdev. It means you need prepare two set of drivers for migration.


>
>
>>>> - The most straightforward way to support dirty page tracking is done by
>>>> IOMMU instead of device specific operations.
>>>>
>>> No such IOMMU yet. And all kinds of platforms should be cared, right?
>>
>> Or the device can track dirty pages by itself, otherwise it would be
>> very hard to implement dirty page tracking correctly without the help of
>> switching to software datapath (or maybe you can post the part of BAR0
> I think you mixed "correct" and "accurate".


Nope, it depends on how to define correct. We need try our best to be 
"accurate" instead of  just a superset of dirty pages.


> DMA pre-inspection is a long existing term and we have implemented and
> verified it in NIC for both precopy and postcopy case.


Interesting, for postcopy, you probably need mediate the whole datapath 
I guess, but again there's no codes to demonstrate how it works in this 
series.


> Though I can't promise
> there's 100% no bug, the method is right.


I fully believe mediation is correct.


>
> Also, whether to trap BARs for dirty page is vendor specific and is not
> what should be cared about from this interface part.


The problem is that you introduce a generic interface, it needs to be 
proved to be useful for other devices. (E.g for virtio migration, it 
doesn't help).


>
>> mediation and dirty page tracking which is missed in this series?)
>>
> Currently, that part of code is owned by shaopeng's team. The code I
> posted is only for demonstrating how to use the interface. Shaopeng's
> team is responsible for upsteam of their part at their timing.


It would be very hard to evaluate a framework without any real users. If 
possible, please post with driver codes in next version.

Thanks


>
> Thanks
> Yan
>
>>>>>>>>> In this patchset,
>>>>>>>>> - patches 1-4 enable vfio-pci to call mediate ops registered by vendor
>>>>>>>>>        driver to mediate/customize region info/rw/mmap.
>>>>>>>>>
>>>>>>>>> - patches 5-6 provide a standalone sample driver to register a mediate ops
>>>>>>>>>        for Intel Graphics Devices. It does not bind to IGDs directly but decides
>>>>>>>>>        what devices it supports via its pciidlist. It also demonstrates how to
>>>>>>>>>        dynamic trap a device's PCI bars. (by adding more pciids in its
>>>>>>>>>        pciidlist, this sample driver actually is not necessarily limited to
>>>>>>>>>        support IGDs)
>>>>>>>>>
>>>>>>>>> - patch 7-9 provide a sample on i40e driver that supports Intel(R)
>>>>>>>>>        Ethernet Controller XL710 Family of devices. It supports VF precopy live
>>>>>>>>>        migration on Intel's 710 SRIOV. (but we commented out the real
>>>>>>>>>        implementation of dirty page tracking and device state retrieving part
>>>>>>>>>        to focus on demonstrating framework part. Will send out them in future
>>>>>>>>>        versions)
>>>>>>>>>        patch 7 registers/unregisters VF mediate ops when PF driver
>>>>>>>>>        probes/removes. It specifies its supporting VFs via
>>>>>>>>>        vfio_pci_mediate_ops->open(pdev)
>>>>>>>>>
>>>>>>>>>        patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
>>>>>>>>>        provides a sample implementation of migration region.
>>>>>>>>>        The QEMU part of vfio migration is based on v8
>>>>>>>>>        https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
>>>>>>>>>        We do not based on recent v9 because we think there are still opens in
>>>>>>>>>        dirty page track part in that series.
>>>>>>>>>
>>>>>>>>>        patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
>>>>>>>>>        provides an example on how to trap part of bar0 when migration starts
>>>>>>>>>        and passthrough this part of bar0 again when migration fails.
>>>>>>>>>
>>>>>>>>> Yan Zhao (9):
>>>>>>>>>        vfio/pci: introduce mediate ops to intercept vfio-pci ops
>>>>>>>>>        vfio/pci: test existence before calling region->ops
>>>>>>>>>        vfio/pci: register a default migration region
>>>>>>>>>        vfio-pci: register default dynamic-trap-bar-info region
>>>>>>>>>        samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
>>>>>>>>>        sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
>>>>>>>>>        i40e/vf_migration: register mediate_ops to vfio-pci
>>>>>>>>>        i40e/vf_migration: mediate migration region
>>>>>>>>>        i40e/vf_migration: support dynamic trap of bar0
>>>>>>>>>
>>>>>>>>>       drivers/net/ethernet/intel/Kconfig            |   2 +-
>>>>>>>>>       drivers/net/ethernet/intel/i40e/Makefile      |   3 +-
>>>>>>>>>       drivers/net/ethernet/intel/i40e/i40e.h        |   2 +
>>>>>>>>>       drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
>>>>>>>>>       .../ethernet/intel/i40e/i40e_vf_migration.c   | 626 ++++++++++++++++++
>>>>>>>>>       .../ethernet/intel/i40e/i40e_vf_migration.h   |  78 +++
>>>>>>>>>       drivers/vfio/pci/vfio_pci.c                   | 189 +++++-
>>>>>>>>>       drivers/vfio/pci/vfio_pci_private.h           |   2 +
>>>>>>>>>       include/linux/vfio.h                          |  18 +
>>>>>>>>>       include/uapi/linux/vfio.h                     | 160 +++++
>>>>>>>>>       samples/Kconfig                               |   6 +
>>>>>>>>>       samples/Makefile                              |   1 +
>>>>>>>>>       samples/vfio-pci/Makefile                     |   2 +
>>>>>>>>>       samples/vfio-pci/igd_dt.c                     | 367 ++++++++++
>>>>>>>>>       14 files changed, 1455 insertions(+), 4 deletions(-)
>>>>>>>>>       create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
>>>>>>>>>       create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
>>>>>>>>>       create mode 100644 samples/vfio-pci/Makefile
>>>>>>>>>       create mode 100644 samples/vfio-pci/igd_dt.c
>>>>>>>>>