mbox series

[vfio,00/11] Introduce a vfio driver over virtio devices

Message ID 20230921124040.145386-1-yishaih@nvidia.com (mailing list archive)
Headers show
Series Introduce a vfio driver over virtio devices | expand

Message

Yishai Hadas Sept. 21, 2023, 12:40 p.m. UTC
This series introduce a vfio driver over virtio devices to support the
legacy interface functionality for VFs.

Background, from the virtio spec [1].
--------------------------------------------------------------------
In some systems, there is a need to support a virtio legacy driver with
a device that does not directly support the legacy interface. In such
scenarios, a group owner device can provide the legacy interface
functionality for the group member devices. The driver of the owner
device can then access the legacy interface of a member device on behalf
of the legacy member device driver.

For example, with the SR-IOV group type, group members (VFs) can not
present the legacy interface in an I/O BAR in BAR0 as expected by the
legacy pci driver. If the legacy driver is running inside a virtual
machine, the hypervisor executing the virtual machine can present a
virtual device with an I/O BAR in BAR0. The hypervisor intercepts the
legacy driver accesses to this I/O BAR and forwards them to the group
owner device (PF) using group administration commands.
--------------------------------------------------------------------

The first 7 patches are in the virtio area and handle the below:
- Introduce the admin virtqueue infrastcture.
- Expose APIs to enable upper layers as of vfio, net, etc 
  to execute admin commands.
- Expose the layout of the commands that should be used for
  supporting the legacy access.

The above follows the virtio spec that was lastly accepted in that area
[1].

The last 4 patches are in the vfio area and handle the below:
- Expose some APIs from vfio/pci to be used by the vfio/virtio driver.
- Expose admin commands over virtio device.
- Introduce a vfio driver over virtio devices to support the legacy
  interface functionality for VFs. 

The series was tested successfully over virtio-net VFs in the host,
while running in the guest both modern and legacy drivers.

[1]
https://github.com/oasis-tcs/virtio-spec/commit/03c2d32e5093ca9f2a17797242fbef88efe94b8c

Yishai

Feng Liu (7):
  virtio-pci: Use virtio pci device layer vq info instead of generic one
  virtio: Define feature bit for administration virtqueue
  virtio-pci: Introduce admin virtqueue
  virtio: Expose the synchronous command helper function
  virtio-pci: Introduce admin command sending function
  virtio-pci: Introduce API to get PF virtio device from VF PCI device
  virtio-pci: Introduce admin commands

Yishai Hadas (4):
  vfio/pci: Expose vfio_pci_core_setup_barmap()
  vfio/pci: Expose vfio_pci_iowrite/read##size()
  vfio/virtio: Expose admin commands over virtio device
  vfio/virtio: Introduce a vfio driver over virtio devices

 MAINTAINERS                            |   6 +
 drivers/net/virtio_net.c               |  21 +-
 drivers/vfio/pci/Kconfig               |   2 +
 drivers/vfio/pci/Makefile              |   2 +
 drivers/vfio/pci/vfio_pci_core.c       |  25 ++
 drivers/vfio/pci/vfio_pci_rdwr.c       |  38 +-
 drivers/vfio/pci/virtio/Kconfig        |  15 +
 drivers/vfio/pci/virtio/Makefile       |   4 +
 drivers/vfio/pci/virtio/cmd.c          | 146 +++++++
 drivers/vfio/pci/virtio/cmd.h          |  35 ++
 drivers/vfio/pci/virtio/main.c         | 546 +++++++++++++++++++++++++
 drivers/virtio/Makefile                |   2 +-
 drivers/virtio/virtio.c                |  44 +-
 drivers/virtio/virtio_pci_common.c     |  24 +-
 drivers/virtio/virtio_pci_common.h     |  17 +-
 drivers/virtio/virtio_pci_modern.c     |  12 +-
 drivers/virtio/virtio_pci_modern_avq.c | 138 +++++++
 drivers/virtio/virtio_ring.c           |  27 ++
 include/linux/vfio_pci_core.h          |  20 +
 include/linux/virtio.h                 |  19 +
 include/linux/virtio_config.h          |   7 +
 include/linux/virtio_pci_modern.h      |   3 +
 include/uapi/linux/virtio_config.h     |   8 +-
 include/uapi/linux/virtio_pci.h        |  66 +++
 24 files changed, 1171 insertions(+), 56 deletions(-)
 create mode 100644 drivers/vfio/pci/virtio/Kconfig
 create mode 100644 drivers/vfio/pci/virtio/Makefile
 create mode 100644 drivers/vfio/pci/virtio/cmd.c
 create mode 100644 drivers/vfio/pci/virtio/cmd.h
 create mode 100644 drivers/vfio/pci/virtio/main.c
 create mode 100644 drivers/virtio/virtio_pci_modern_avq.c

Comments

Michael S. Tsirkin Sept. 21, 2023, 1:57 p.m. UTC | #1
On Thu, Sep 21, 2023 at 03:40:32PM +0300, Yishai Hadas wrote:
> From: Feng Liu <feliu@nvidia.com>
> 
> Introduce support for the admin virtqueue. By negotiating
> VIRTIO_F_ADMIN_VQ feature, driver detects capability and creates one
> administration virtqueue. Administration virtqueue implementation in
> virtio pci generic layer, enables multiple types of upper layer
> drivers such as vfio, net, blk to utilize it.
> 
> Signed-off-by: Feng Liu <feliu@nvidia.com>
> Reviewed-by: Parav Pandit <parav@nvidia.com>
> Reviewed-by: Jiri Pirko <jiri@nvidia.com>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
>  drivers/virtio/Makefile                |  2 +-
>  drivers/virtio/virtio.c                | 37 +++++++++++++--
>  drivers/virtio/virtio_pci_common.h     | 15 +++++-
>  drivers/virtio/virtio_pci_modern.c     | 10 +++-
>  drivers/virtio/virtio_pci_modern_avq.c | 65 ++++++++++++++++++++++++++

if you have a .c file without a .h file you know there's something
fishy. Just add this inside drivers/virtio/virtio_pci_modern.c ?

>  include/linux/virtio_config.h          |  4 ++
>  include/linux/virtio_pci_modern.h      |  3 ++
>  7 files changed, 129 insertions(+), 7 deletions(-)
>  create mode 100644 drivers/virtio/virtio_pci_modern_avq.c
> 
> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> index 8e98d24917cc..dcc535b5b4d9 100644
> --- a/drivers/virtio/Makefile
> +++ b/drivers/virtio/Makefile
> @@ -5,7 +5,7 @@ obj-$(CONFIG_VIRTIO_PCI_LIB) += virtio_pci_modern_dev.o
>  obj-$(CONFIG_VIRTIO_PCI_LIB_LEGACY) += virtio_pci_legacy_dev.o
>  obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o
>  obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
> -virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> +virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o virtio_pci_modern_avq.o
>  virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
>  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
>  obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
> index 3893dc29eb26..f4080692b351 100644
> --- a/drivers/virtio/virtio.c
> +++ b/drivers/virtio/virtio.c
> @@ -302,9 +302,15 @@ static int virtio_dev_probe(struct device *_d)
>  	if (err)
>  		goto err;
>  
> +	if (dev->config->create_avq) {
> +		err = dev->config->create_avq(dev);
> +		if (err)
> +			goto err;
> +	}
> +
>  	err = drv->probe(dev);
>  	if (err)
> -		goto err;
> +		goto err_probe;
>  
>  	/* If probe didn't do it, mark device DRIVER_OK ourselves. */
>  	if (!(dev->config->get_status(dev) & VIRTIO_CONFIG_S_DRIVER_OK))
> @@ -316,6 +322,10 @@ static int virtio_dev_probe(struct device *_d)
>  	virtio_config_enable(dev);
>  
>  	return 0;
> +
> +err_probe:
> +	if (dev->config->destroy_avq)
> +		dev->config->destroy_avq(dev);
>  err:
>  	virtio_add_status(dev, VIRTIO_CONFIG_S_FAILED);
>  	return err;
> @@ -331,6 +341,9 @@ static void virtio_dev_remove(struct device *_d)
>  
>  	drv->remove(dev);
>  
> +	if (dev->config->destroy_avq)
> +		dev->config->destroy_avq(dev);
> +
>  	/* Driver should have reset device. */
>  	WARN_ON_ONCE(dev->config->get_status(dev));
>  
> @@ -489,13 +502,20 @@ EXPORT_SYMBOL_GPL(unregister_virtio_device);
>  int virtio_device_freeze(struct virtio_device *dev)
>  {
>  	struct virtio_driver *drv = drv_to_virtio(dev->dev.driver);
> +	int ret;
>  
>  	virtio_config_disable(dev);
>  
>  	dev->failed = dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED;
>  
> -	if (drv && drv->freeze)
> -		return drv->freeze(dev);
> +	if (drv && drv->freeze) {
> +		ret = drv->freeze(dev);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (dev->config->destroy_avq)
> +		dev->config->destroy_avq(dev);
>  
>  	return 0;
>  }
> @@ -532,10 +552,16 @@ int virtio_device_restore(struct virtio_device *dev)
>  	if (ret)
>  		goto err;
>  
> +	if (dev->config->create_avq) {
> +		ret = dev->config->create_avq(dev);
> +		if (ret)
> +			goto err;
> +	}
> +
>  	if (drv->restore) {
>  		ret = drv->restore(dev);
>  		if (ret)
> -			goto err;
> +			goto err_restore;
>  	}
>  
>  	/* If restore didn't do it, mark device DRIVER_OK ourselves. */
> @@ -546,6 +572,9 @@ int virtio_device_restore(struct virtio_device *dev)
>  
>  	return 0;
>  
> +err_restore:
> +	if (dev->config->destroy_avq)
> +		dev->config->destroy_avq(dev);
>  err:
>  	virtio_add_status(dev, VIRTIO_CONFIG_S_FAILED);
>  	return ret;
> diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h
> index 602021967aaa..9bffa95274b6 100644
> --- a/drivers/virtio/virtio_pci_common.h
> +++ b/drivers/virtio/virtio_pci_common.h
> @@ -41,6 +41,14 @@ struct virtio_pci_vq_info {
>  	unsigned int msix_vector;
>  };
>  
> +struct virtio_avq {

admin_vq would be better. and this is pci specific yes? so virtio_pci_

> +	/* Virtqueue info associated with this admin queue. */
> +	struct virtio_pci_vq_info info;
> +	/* Name of the admin queue: avq.$index. */
> +	char name[10];
> +	u16 vq_index;
> +};
> +
>  /* Our device structure */
>  struct virtio_pci_device {
>  	struct virtio_device vdev;
> @@ -58,10 +66,13 @@ struct virtio_pci_device {
>  	spinlock_t lock;
>  	struct list_head virtqueues;
>  
> -	/* array of all queues for house-keeping */
> +	/* Array of all virtqueues reported in the
> +	 * PCI common config num_queues field
> +	 */
>  	struct virtio_pci_vq_info **vqs;
>  	u32 nvqs;
>  
> +	struct virtio_avq *admin;

and this could be thinkably admin_vq.

>  	/* MSI-X support */
>  	int msix_enabled;
>  	int intx_enabled;
> @@ -115,6 +126,8 @@ int vp_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
>  		const char * const names[], const bool *ctx,
>  		struct irq_affinity *desc);
>  const char *vp_bus_name(struct virtio_device *vdev);
> +void vp_destroy_avq(struct virtio_device *vdev);
> +int vp_create_avq(struct virtio_device *vdev);
>  
>  /* Setup the affinity for a virtqueue:
>   * - force the affinity for per vq vector
> diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
> index d6bb68ba84e5..a72c87687196 100644
> --- a/drivers/virtio/virtio_pci_modern.c
> +++ b/drivers/virtio/virtio_pci_modern.c
> @@ -37,6 +37,9 @@ static void vp_transport_features(struct virtio_device *vdev, u64 features)
>  
>  	if (features & BIT_ULL(VIRTIO_F_RING_RESET))
>  		__virtio_set_bit(vdev, VIRTIO_F_RING_RESET);
> +
> +	if (features & BIT_ULL(VIRTIO_F_ADMIN_VQ))
> +		__virtio_set_bit(vdev, VIRTIO_F_ADMIN_VQ);
>  }
>  
>  /* virtio config->finalize_features() implementation */
> @@ -317,7 +320,8 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
>  	else
>  		notify = vp_notify;
>  
> -	if (index >= vp_modern_get_num_queues(mdev))
> +	if (!((index < vp_modern_get_num_queues(mdev) ||
> +	      (vp_dev->admin && vp_dev->admin->vq_index == index))))
>  		return ERR_PTR(-EINVAL);
>  
>  	/* Check if queue is either not available or already active. */
> @@ -509,6 +513,8 @@ static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
>  	.get_shm_region  = vp_get_shm_region,
>  	.disable_vq_and_reset = vp_modern_disable_vq_and_reset,
>  	.enable_vq_after_reset = vp_modern_enable_vq_after_reset,
> +	.create_avq = vp_create_avq,
> +	.destroy_avq = vp_destroy_avq,
>  };
>  
>  static const struct virtio_config_ops virtio_pci_config_ops = {
> @@ -529,6 +535,8 @@ static const struct virtio_config_ops virtio_pci_config_ops = {
>  	.get_shm_region  = vp_get_shm_region,
>  	.disable_vq_and_reset = vp_modern_disable_vq_and_reset,
>  	.enable_vq_after_reset = vp_modern_enable_vq_after_reset,
> +	.create_avq = vp_create_avq,
> +	.destroy_avq = vp_destroy_avq,
>  };
>  
>  /* the PCI probing function */
> diff --git a/drivers/virtio/virtio_pci_modern_avq.c b/drivers/virtio/virtio_pci_modern_avq.c
> new file mode 100644
> index 000000000000..114579ad788f
> --- /dev/null
> +++ b/drivers/virtio/virtio_pci_modern_avq.c
> @@ -0,0 +1,65 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +#include <linux/virtio.h>
> +#include "virtio_pci_common.h"
> +
> +static u16 vp_modern_avq_num(struct virtio_pci_modern_device *mdev)
> +{
> +	struct virtio_pci_modern_common_cfg __iomem *cfg;
> +
> +	cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common;
> +	return vp_ioread16(&cfg->admin_queue_num);
> +}
> +
> +static u16 vp_modern_avq_index(struct virtio_pci_modern_device *mdev)
> +{
> +	struct virtio_pci_modern_common_cfg __iomem *cfg;
> +
> +	cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common;
> +	return vp_ioread16(&cfg->admin_queue_index);
> +}
> +
> +int vp_create_avq(struct virtio_device *vdev)
> +{
> +	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> +	struct virtio_avq *avq;
> +	struct virtqueue *vq;
> +	u16 admin_q_num;
> +
> +	if (!virtio_has_feature(vdev, VIRTIO_F_ADMIN_VQ))
> +		return 0;
> +
> +	admin_q_num = vp_modern_avq_num(&vp_dev->mdev);
> +	if (!admin_q_num)
> +		return -EINVAL;
> +
> +	vp_dev->admin = kzalloc(sizeof(*vp_dev->admin), GFP_KERNEL);
> +	if (!vp_dev->admin)
> +		return -ENOMEM;
> +
> +	avq = vp_dev->admin;
> +	avq->vq_index = vp_modern_avq_index(&vp_dev->mdev);
> +	sprintf(avq->name, "avq.%u", avq->vq_index);
> +	vq = vp_dev->setup_vq(vp_dev, &vp_dev->admin->info, avq->vq_index, NULL,
> +			      avq->name, NULL, VIRTIO_MSI_NO_VECTOR);
> +	if (IS_ERR(vq)) {
> +		dev_err(&vdev->dev, "failed to setup admin virtqueue");
> +		kfree(vp_dev->admin);
> +		return PTR_ERR(vq);
> +	}
> +
> +	vp_dev->admin->info.vq = vq;
> +	vp_modern_set_queue_enable(&vp_dev->mdev, avq->info.vq->index, true);
> +	return 0;
> +}
> +
> +void vp_destroy_avq(struct virtio_device *vdev)
> +{
> +	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> +
> +	if (!vp_dev->admin)
> +		return;
> +
> +	vp_dev->del_vq(&vp_dev->admin->info);
> +	kfree(vp_dev->admin);
> +}
> diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
> index 2b3438de2c4d..028c51ea90ee 100644
> --- a/include/linux/virtio_config.h
> +++ b/include/linux/virtio_config.h
> @@ -93,6 +93,8 @@ typedef void vq_callback_t(struct virtqueue *);
>   *	Returns 0 on success or error status
>   *	If disable_vq_and_reset is set, then enable_vq_after_reset must also be
>   *	set.
> + * @create_avq: initialize admin virtqueue resource.
> + * @destroy_avq: destroy admin virtqueue resource.
>   */
>  struct virtio_config_ops {
>  	void (*get)(struct virtio_device *vdev, unsigned offset,
> @@ -120,6 +122,8 @@ struct virtio_config_ops {
>  			       struct virtio_shm_region *region, u8 id);
>  	int (*disable_vq_and_reset)(struct virtqueue *vq);
>  	int (*enable_vq_after_reset)(struct virtqueue *vq);
> +	int (*create_avq)(struct virtio_device *vdev);
> +	void (*destroy_avq)(struct virtio_device *vdev);
>  };
>  
>  /* If driver didn't advertise the feature, it will never appear. */
> diff --git a/include/linux/virtio_pci_modern.h b/include/linux/virtio_pci_modern.h
> index 067ac1d789bc..f6cb13d858fd 100644
> --- a/include/linux/virtio_pci_modern.h
> +++ b/include/linux/virtio_pci_modern.h
> @@ -10,6 +10,9 @@ struct virtio_pci_modern_common_cfg {
>  
>  	__le16 queue_notify_data;	/* read-write */
>  	__le16 queue_reset;		/* read-write */
> +
> +	__le16 admin_queue_index;	/* read-only */
> +	__le16 admin_queue_num;		/* read-only */
>  };


ouch.
actually there's a problem

        mdev->common = vp_modern_map_capability(mdev, common,
                                      sizeof(struct virtio_pci_common_cfg), 4,
                                      0, sizeof(struct virtio_pci_common_cfg),
                                      NULL, NULL);

extending this structure means some calls will start failing on
existing devices.

even more of an ouch, when we added queue_notify_data and queue_reset we
also possibly broke some devices. well hopefully not since no one
reported failures but we really need to fix that.


>  
>  struct virtio_pci_modern_device {
> -- 
> 2.27.0
Jason Gunthorpe Sept. 21, 2023, 2:11 p.m. UTC | #2
On Thu, Sep 21, 2023 at 09:16:21AM -0400, Michael S. Tsirkin wrote:

> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index bf0f54c24f81..5098418c8389 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -22624,6 +22624,12 @@ L:	kvm@vger.kernel.org
> >  S:	Maintained
> >  F:	drivers/vfio/pci/mlx5/
> >  
> > +VFIO VIRTIO PCI DRIVER
> > +M:	Yishai Hadas <yishaih@nvidia.com>
> > +L:	kvm@vger.kernel.org
> > +S:	Maintained
> > +F:	drivers/vfio/pci/virtio
> > +
> >  VFIO PCI DEVICE SPECIFIC DRIVERS
> >  R:	Jason Gunthorpe <jgg@nvidia.com>
> >  R:	Yishai Hadas <yishaih@nvidia.com>
> 
> Tying two subsystems together like this is going to cause pain when
> merging. God forbid there's something e.g. virtio net specific
> (and there's going to be for sure) - now we are talking 3
> subsystems.

Cross subsystem stuff is normal in the kernel. Drivers should be
placed in their most logical spot - this driver exposes a VFIO
interface so it belongs here.

Your exact argument works the same from the VFIO perspective, someone
has to have code that belongs to them outside their little sphere
here.

> Case in point all other virtio drivers are nicely grouped, have a common
> mailing list etc etc.  This one is completely separate to the point
> where people won't even remember to copy the virtio mailing list.

The virtio mailing list should probably be added to the maintainers
enry

Jason
Michael S. Tsirkin Sept. 21, 2023, 2:16 p.m. UTC | #3
On Thu, Sep 21, 2023 at 11:11:25AM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 09:16:21AM -0400, Michael S. Tsirkin wrote:
> 
> > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > index bf0f54c24f81..5098418c8389 100644
> > > --- a/MAINTAINERS
> > > +++ b/MAINTAINERS
> > > @@ -22624,6 +22624,12 @@ L:	kvm@vger.kernel.org
> > >  S:	Maintained
> > >  F:	drivers/vfio/pci/mlx5/
> > >  
> > > +VFIO VIRTIO PCI DRIVER
> > > +M:	Yishai Hadas <yishaih@nvidia.com>
> > > +L:	kvm@vger.kernel.org
> > > +S:	Maintained
> > > +F:	drivers/vfio/pci/virtio
> > > +
> > >  VFIO PCI DEVICE SPECIFIC DRIVERS
> > >  R:	Jason Gunthorpe <jgg@nvidia.com>
> > >  R:	Yishai Hadas <yishaih@nvidia.com>
> > 
> > Tying two subsystems together like this is going to cause pain when
> > merging. God forbid there's something e.g. virtio net specific
> > (and there's going to be for sure) - now we are talking 3
> > subsystems.
> 
> Cross subsystem stuff is normal in the kernel.

Yea. But it's completely spurious here - virtio has its own way
to work with userspace which is vdpa and let's just use that.
Keeps things nice and contained.

> Drivers should be
> placed in their most logical spot - this driver exposes a VFIO
> interface so it belongs here.
> 
> Your exact argument works the same from the VFIO perspective, someone
> has to have code that belongs to them outside their little sphere
> here.
> 
> > Case in point all other virtio drivers are nicely grouped, have a common
> > mailing list etc etc.  This one is completely separate to the point
> > where people won't even remember to copy the virtio mailing list.
> 
> The virtio mailing list should probably be added to the maintainers
> enry
> 
> Jason
Alex Williamson Sept. 21, 2023, 4:43 p.m. UTC | #4
On Thu, 21 Sep 2023 15:40:40 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> Introduce a vfio driver over virtio devices to support the legacy
> interface functionality for VFs.
> 
> Background, from the virtio spec [1].
> --------------------------------------------------------------------
> In some systems, there is a need to support a virtio legacy driver with
> a device that does not directly support the legacy interface. In such
> scenarios, a group owner device can provide the legacy interface
> functionality for the group member devices. The driver of the owner
> device can then access the legacy interface of a member device on behalf
> of the legacy member device driver.
> 
> For example, with the SR-IOV group type, group members (VFs) can not
> present the legacy interface in an I/O BAR in BAR0 as expected by the
> legacy pci driver. If the legacy driver is running inside a virtual
> machine, the hypervisor executing the virtual machine can present a
> virtual device with an I/O BAR in BAR0. The hypervisor intercepts the
> legacy driver accesses to this I/O BAR and forwards them to the group
> owner device (PF) using group administration commands.
> --------------------------------------------------------------------
> 
> Specifically, this driver adds support for a virtio-net VF to be exposed
> as a transitional device to a guest driver and allows the legacy IO BAR
> functionality on top.
> 
> This allows a VM which uses a legacy virtio-net driver in the guest to
> work transparently over a VF which its driver in the host is that new
> driver.
> 
> The driver can be extended easily to support some other types of virtio
> devices (e.g virtio-blk), by adding in a few places the specific type
> properties as was done for virtio-net.
> 
> For now, only the virtio-net use case was tested and as such we introduce
> the support only for such a device.
> 
> Practically,
> Upon probing a VF for a virtio-net device, in case its PF supports
> legacy access over the virtio admin commands and the VF doesn't have BAR
> 0, we set some specific 'vfio_device_ops' to be able to simulate in SW a
> transitional device with I/O BAR in BAR 0.
> 
> The existence of the simulated I/O bar is reported later on by
> overwriting the VFIO_DEVICE_GET_REGION_INFO command and the device
> exposes itself as a transitional device by overwriting some properties
> upon reading its config space.
> 
> Once we report the existence of I/O BAR as BAR 0 a legacy driver in the
> guest may use it via read/write calls according to the virtio
> specification.
> 
> Any read/write towards the control parts of the BAR will be captured by
> the new driver and will be translated into admin commands towards the
> device.
> 
> Any data path read/write access (i.e. virtio driver notifications) will
> be forwarded to the physical BAR which its properties were supplied by
> the command VIRTIO_PCI_QUEUE_NOTIFY upon the probing/init flow.
> 
> With that code in place a legacy driver in the guest has the look and
> feel as if having a transitional device with legacy support for both its
> control and data path flows.

Why do we need to enable a "legacy" driver in the guest?  The very name
suggests there's an alternative driver that perhaps doesn't require
this I/O BAR.  Why don't we just require the non-legacy driver in the
guest rather than increase our maintenance burden?  Thanks,

Alex

> 
> [1]
> https://github.com/oasis-tcs/virtio-spec/commit/03c2d32e5093ca9f2a17797242fbef88efe94b8c
> 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
>  MAINTAINERS                      |   6 +
>  drivers/vfio/pci/Kconfig         |   2 +
>  drivers/vfio/pci/Makefile        |   2 +
>  drivers/vfio/pci/virtio/Kconfig  |  15 +
>  drivers/vfio/pci/virtio/Makefile |   4 +
>  drivers/vfio/pci/virtio/cmd.c    |   4 +-
>  drivers/vfio/pci/virtio/cmd.h    |   8 +
>  drivers/vfio/pci/virtio/main.c   | 546 +++++++++++++++++++++++++++++++
>  8 files changed, 585 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/vfio/pci/virtio/Kconfig
>  create mode 100644 drivers/vfio/pci/virtio/Makefile
>  create mode 100644 drivers/vfio/pci/virtio/main.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index bf0f54c24f81..5098418c8389 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -22624,6 +22624,12 @@ L:	kvm@vger.kernel.org
>  S:	Maintained
>  F:	drivers/vfio/pci/mlx5/
>  
> +VFIO VIRTIO PCI DRIVER
> +M:	Yishai Hadas <yishaih@nvidia.com>
> +L:	kvm@vger.kernel.org
> +S:	Maintained
> +F:	drivers/vfio/pci/virtio
> +
>  VFIO PCI DEVICE SPECIFIC DRIVERS
>  R:	Jason Gunthorpe <jgg@nvidia.com>
>  R:	Yishai Hadas <yishaih@nvidia.com>
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 8125e5f37832..18c397df566d 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -65,4 +65,6 @@ source "drivers/vfio/pci/hisilicon/Kconfig"
>  
>  source "drivers/vfio/pci/pds/Kconfig"
>  
> +source "drivers/vfio/pci/virtio/Kconfig"
> +
>  endmenu
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 45167be462d8..046139a4eca5 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -13,3 +13,5 @@ obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
>  obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
>  
>  obj-$(CONFIG_PDS_VFIO_PCI) += pds/
> +
> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio/
> diff --git a/drivers/vfio/pci/virtio/Kconfig b/drivers/vfio/pci/virtio/Kconfig
> new file mode 100644
> index 000000000000..89eddce8b1bd
> --- /dev/null
> +++ b/drivers/vfio/pci/virtio/Kconfig
> @@ -0,0 +1,15 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config VIRTIO_VFIO_PCI
> +        tristate "VFIO support for VIRTIO PCI devices"
> +        depends on VIRTIO_PCI
> +        select VFIO_PCI_CORE
> +        help
> +          This provides support for exposing VIRTIO VF devices using the VFIO
> +          framework that can work with a legacy virtio driver in the guest.
> +          Based on PCIe spec, VFs do not support I/O Space; thus, VF BARs shall
> +          not indicate I/O Space.
> +          As of that this driver emulated I/O BAR in software to let a VF be
> +          seen as a transitional device in the guest and let it work with
> +          a legacy driver.
> +
> +          If you don't know what to do here, say N.
> diff --git a/drivers/vfio/pci/virtio/Makefile b/drivers/vfio/pci/virtio/Makefile
> new file mode 100644
> index 000000000000..584372648a03
> --- /dev/null
> +++ b/drivers/vfio/pci/virtio/Makefile
> @@ -0,0 +1,4 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio-vfio-pci.o
> +virtio-vfio-pci-y := main.o cmd.o
> +
> diff --git a/drivers/vfio/pci/virtio/cmd.c b/drivers/vfio/pci/virtio/cmd.c
> index f068239cdbb0..aea9d25fbf1d 100644
> --- a/drivers/vfio/pci/virtio/cmd.c
> +++ b/drivers/vfio/pci/virtio/cmd.c
> @@ -44,7 +44,7 @@ int virtiovf_cmd_lr_write(struct virtiovf_pci_core_device *virtvdev, u16 opcode,
>  {
>  	struct virtio_device *virtio_dev =
>  		virtio_pci_vf_get_pf_dev(virtvdev->core_device.pdev);
> -	struct virtio_admin_cmd_data_lr_write *in;
> +	struct virtio_admin_cmd_legacy_wr_data *in;
>  	struct scatterlist in_sg;
>  	struct virtio_admin_cmd cmd = {};
>  	int ret;
> @@ -74,7 +74,7 @@ int virtiovf_cmd_lr_read(struct virtiovf_pci_core_device *virtvdev, u16 opcode,
>  {
>  	struct virtio_device *virtio_dev =
>  		virtio_pci_vf_get_pf_dev(virtvdev->core_device.pdev);
> -	struct virtio_admin_cmd_data_lr_read *in;
> +	struct virtio_admin_cmd_legacy_rd_data *in;
>  	struct scatterlist in_sg, out_sg;
>  	struct virtio_admin_cmd cmd = {};
>  	int ret;
> diff --git a/drivers/vfio/pci/virtio/cmd.h b/drivers/vfio/pci/virtio/cmd.h
> index c2a3645f4b90..347b1dc85570 100644
> --- a/drivers/vfio/pci/virtio/cmd.h
> +++ b/drivers/vfio/pci/virtio/cmd.h
> @@ -13,7 +13,15 @@
>  
>  struct virtiovf_pci_core_device {
>  	struct vfio_pci_core_device core_device;
> +	u8 bar0_virtual_buf_size;
> +	u8 *bar0_virtual_buf;
> +	/* synchronize access to the virtual buf */
> +	struct mutex bar_mutex;
>  	int vf_id;
> +	void __iomem *notify_addr;
> +	u32 notify_offset;
> +	u8 notify_bar;
> +	u8 pci_cmd_io :1;
>  };
>  
>  int virtiovf_cmd_list_query(struct pci_dev *pdev, u8 *buf, int buf_size);
> diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c
> new file mode 100644
> index 000000000000..2486991c49f3
> --- /dev/null
> +++ b/drivers/vfio/pci/virtio/main.c
> @@ -0,0 +1,546 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include <linux/device.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/pm_runtime.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/vfio_pci_core.h>
> +#include <linux/virtio_pci.h>
> +#include <linux/virtio_net.h>
> +#include <linux/virtio_pci_modern.h>
> +
> +#include "cmd.h"
> +
> +#define VIRTIO_LEGACY_IO_BAR_HEADER_LEN 20
> +#define VIRTIO_LEGACY_IO_BAR_MSIX_HEADER_LEN 4
> +
> +static int virtiovf_issue_lr_cmd(struct virtiovf_pci_core_device *virtvdev,
> +				 loff_t pos, char __user *buf,
> +				 size_t count, bool read)
> +{
> +	u8 *bar0_buf = virtvdev->bar0_virtual_buf;
> +	u16 opcode;
> +	int ret;
> +
> +	mutex_lock(&virtvdev->bar_mutex);
> +	if (read) {
> +		opcode = (pos < VIRTIO_PCI_CONFIG_OFF(true)) ?
> +			VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ :
> +			VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ;
> +		ret = virtiovf_cmd_lr_read(virtvdev, opcode, pos,
> +					   count, bar0_buf + pos);
> +		if (ret)
> +			goto out;
> +		if (copy_to_user(buf, bar0_buf + pos, count))
> +			ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	if (copy_from_user(bar0_buf + pos, buf, count)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	opcode = (pos < VIRTIO_PCI_CONFIG_OFF(true)) ?
> +			VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE :
> +			VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE;
> +	ret = virtiovf_cmd_lr_write(virtvdev, opcode, pos, count,
> +				    bar0_buf + pos);
> +out:
> +	mutex_unlock(&virtvdev->bar_mutex);
> +	return ret;
> +}
> +
> +static int
> +translate_io_bar_to_mem_bar(struct virtiovf_pci_core_device *virtvdev,
> +			    loff_t pos, char __user *buf,
> +			    size_t count, bool read)
> +{
> +	struct vfio_pci_core_device *core_device = &virtvdev->core_device;
> +	u16 queue_notify;
> +	int ret;
> +
> +	if (pos + count > virtvdev->bar0_virtual_buf_size)
> +		return -EINVAL;
> +
> +	switch (pos) {
> +	case VIRTIO_PCI_QUEUE_NOTIFY:
> +		if (count != sizeof(queue_notify))
> +			return -EINVAL;
> +		if (read) {
> +			ret = vfio_pci_ioread16(core_device, true, &queue_notify,
> +						virtvdev->notify_addr);
> +			if (ret)
> +				return ret;
> +			if (copy_to_user(buf, &queue_notify,
> +					 sizeof(queue_notify)))
> +				return -EFAULT;
> +			break;
> +		}
> +
> +		if (copy_from_user(&queue_notify, buf, count))
> +			return -EFAULT;
> +
> +		ret = vfio_pci_iowrite16(core_device, true, queue_notify,
> +					 virtvdev->notify_addr);
> +		break;
> +	default:
> +		ret = virtiovf_issue_lr_cmd(virtvdev, pos, buf, count, read);
> +	}
> +
> +	return ret ? ret : count;
> +}
> +
> +static bool range_contains_range(loff_t range1_start, size_t count1,
> +				 loff_t range2_start, size_t count2,
> +				 loff_t *start_offset)
> +{
> +	if (range1_start <= range2_start &&
> +	    range1_start + count1 >= range2_start + count2) {
> +		*start_offset = range2_start - range1_start;
> +		return true;
> +	}
> +	return false;
> +}
> +
> +static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev,
> +					char __user *buf, size_t count,
> +					loff_t *ppos)
> +{
> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +	loff_t copy_offset;
> +	__le32 val32;
> +	__le16 val16;
> +	u8 val8;
> +	int ret;
> +
> +	ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (range_contains_range(pos, count, PCI_DEVICE_ID, sizeof(val16),
> +				 &copy_offset)) {
> +		val16 = cpu_to_le16(0x1000);
> +		if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
> +			return -EFAULT;
> +	}
> +
> +	if (virtvdev->pci_cmd_io &&
> +	    range_contains_range(pos, count, PCI_COMMAND, sizeof(val16),
> +				 &copy_offset)) {
> +		if (copy_from_user(&val16, buf, sizeof(val16)))
> +			return -EFAULT;
> +		val16 |= cpu_to_le16(PCI_COMMAND_IO);
> +		if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
> +			return -EFAULT;
> +	}
> +
> +	if (range_contains_range(pos, count, PCI_REVISION_ID, sizeof(val8),
> +				 &copy_offset)) {
> +		/* Transional needs to have revision 0 */
> +		val8 = 0;
> +		if (copy_to_user(buf + copy_offset, &val8, sizeof(val8)))
> +			return -EFAULT;
> +	}
> +
> +	if (range_contains_range(pos, count, PCI_BASE_ADDRESS_0, sizeof(val32),
> +				 &copy_offset)) {
> +		val32 = cpu_to_le32(PCI_BASE_ADDRESS_SPACE_IO);
> +		if (copy_to_user(buf + copy_offset, &val32, sizeof(val32)))
> +			return -EFAULT;
> +	}
> +
> +	if (range_contains_range(pos, count, PCI_SUBSYSTEM_ID, sizeof(val16),
> +				 &copy_offset)) {
> +		/* Transitional devices use the PCI subsystem device id as
> +		 * virtio device id, same as legacy driver always did.
> +		 */
> +		val16 = cpu_to_le16(VIRTIO_ID_NET);
> +		if (copy_to_user(buf + copy_offset, &val16, sizeof(val16)))
> +			return -EFAULT;
> +	}
> +
> +	return count;
> +}
> +
> +static ssize_t
> +virtiovf_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
> +		       size_t count, loff_t *ppos)
> +{
> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> +	struct pci_dev *pdev = virtvdev->core_device.pdev;
> +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +	int ret;
> +
> +	if (!count)
> +		return 0;
> +
> +	if (index == VFIO_PCI_CONFIG_REGION_INDEX)
> +		return virtiovf_pci_read_config(core_vdev, buf, count, ppos);
> +
> +	if (index != VFIO_PCI_BAR0_REGION_INDEX)
> +		return vfio_pci_core_read(core_vdev, buf, count, ppos);
> +
> +	ret = pm_runtime_resume_and_get(&pdev->dev);
> +	if (ret) {
> +		pci_info_ratelimited(pdev, "runtime resume failed %d\n",
> +				     ret);
> +		return -EIO;
> +	}
> +
> +	ret = translate_io_bar_to_mem_bar(virtvdev, pos, buf, count, true);
> +	pm_runtime_put(&pdev->dev);
> +	return ret;
> +}
> +
> +static ssize_t
> +virtiovf_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
> +			size_t count, loff_t *ppos)
> +{
> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> +	struct pci_dev *pdev = virtvdev->core_device.pdev;
> +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +	int ret;
> +
> +	if (!count)
> +		return 0;
> +
> +	if (index == VFIO_PCI_CONFIG_REGION_INDEX) {
> +		loff_t copy_offset;
> +		u16 cmd;
> +
> +		if (range_contains_range(pos, count, PCI_COMMAND, sizeof(cmd),
> +					 &copy_offset)) {
> +			if (copy_from_user(&cmd, buf + copy_offset, sizeof(cmd)))
> +				return -EFAULT;
> +			virtvdev->pci_cmd_io = (cmd & PCI_COMMAND_IO);
> +		}
> +	}
> +
> +	if (index != VFIO_PCI_BAR0_REGION_INDEX)
> +		return vfio_pci_core_write(core_vdev, buf, count, ppos);
> +
> +	ret = pm_runtime_resume_and_get(&pdev->dev);
> +	if (ret) {
> +		pci_info_ratelimited(pdev, "runtime resume failed %d\n", ret);
> +		return -EIO;
> +	}
> +
> +	ret = translate_io_bar_to_mem_bar(virtvdev, pos, (char __user *)buf, count, false);
> +	pm_runtime_put(&pdev->dev);
> +	return ret;
> +}
> +
> +static int
> +virtiovf_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
> +				   unsigned int cmd, unsigned long arg)
> +{
> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> +	unsigned long minsz = offsetofend(struct vfio_region_info, offset);
> +	void __user *uarg = (void __user *)arg;
> +	struct vfio_region_info info = {};
> +
> +	if (copy_from_user(&info, uarg, minsz))
> +		return -EFAULT;
> +
> +	if (info.argsz < minsz)
> +		return -EINVAL;
> +
> +	switch (info.index) {
> +	case VFIO_PCI_BAR0_REGION_INDEX:
> +		info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> +		info.size = virtvdev->bar0_virtual_buf_size;
> +		info.flags = VFIO_REGION_INFO_FLAG_READ |
> +			     VFIO_REGION_INFO_FLAG_WRITE;
> +		return copy_to_user(uarg, &info, minsz) ? -EFAULT : 0;
> +	default:
> +		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> +	}
> +}
> +
> +static long
> +virtiovf_vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> +			     unsigned long arg)
> +{
> +	switch (cmd) {
> +	case VFIO_DEVICE_GET_REGION_INFO:
> +		return virtiovf_pci_ioctl_get_region_info(core_vdev, cmd, arg);
> +	default:
> +		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> +	}
> +}
> +
> +static int
> +virtiovf_set_notify_addr(struct virtiovf_pci_core_device *virtvdev)
> +{
> +	struct vfio_pci_core_device *core_device = &virtvdev->core_device;
> +	int ret;
> +
> +	/* Setup the BAR where the 'notify' exists to be used by vfio as well
> +	 * This will let us mmap it only once and use it when needed.
> +	 */
> +	ret = vfio_pci_core_setup_barmap(core_device,
> +					 virtvdev->notify_bar);
> +	if (ret)
> +		return ret;
> +
> +	virtvdev->notify_addr = core_device->barmap[virtvdev->notify_bar] +
> +			virtvdev->notify_offset;
> +	return 0;
> +}
> +
> +static int virtiovf_pci_open_device(struct vfio_device *core_vdev)
> +{
> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> +	struct vfio_pci_core_device *vdev = &virtvdev->core_device;
> +	int ret;
> +
> +	ret = vfio_pci_core_enable(vdev);
> +	if (ret)
> +		return ret;
> +
> +	if (virtvdev->bar0_virtual_buf) {
> +		/* upon close_device() the vfio_pci_core_disable() is called
> +		 * and will close all the previous mmaps, so it seems that the
> +		 * valid life cycle for the 'notify' addr is per open/close.
> +		 */
> +		ret = virtiovf_set_notify_addr(virtvdev);
> +		if (ret) {
> +			vfio_pci_core_disable(vdev);
> +			return ret;
> +		}
> +	}
> +
> +	vfio_pci_core_finish_enable(vdev);
> +	return 0;
> +}
> +
> +static void virtiovf_pci_close_device(struct vfio_device *core_vdev)
> +{
> +	vfio_pci_core_close_device(core_vdev);
> +}
> +
> +static int virtiovf_get_device_config_size(unsigned short device)
> +{
> +	switch (device) {
> +	case 0x1041:
> +		/* network card */
> +		return offsetofend(struct virtio_net_config, status);
> +	default:
> +		return 0;
> +	}
> +}
> +
> +static int virtiovf_read_notify_info(struct virtiovf_pci_core_device *virtvdev)
> +{
> +	u64 offset;
> +	int ret;
> +	u8 bar;
> +
> +	ret = virtiovf_cmd_lq_read_notify(virtvdev,
> +				VIRTIO_ADMIN_CMD_NOTIFY_INFO_FLAGS_OWNER_MEM,
> +				&bar, &offset);
> +	if (ret)
> +		return ret;
> +
> +	virtvdev->notify_bar = bar;
> +	virtvdev->notify_offset = offset;
> +	return 0;
> +}
> +
> +static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
> +{
> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> +	struct pci_dev *pdev;
> +	int ret;
> +
> +	ret = vfio_pci_core_init_dev(core_vdev);
> +	if (ret)
> +		return ret;
> +
> +	pdev = virtvdev->core_device.pdev;
> +	virtvdev->vf_id = pci_iov_vf_id(pdev);
> +	if (virtvdev->vf_id < 0)
> +		return -EINVAL;
> +
> +	ret = virtiovf_read_notify_info(virtvdev);
> +	if (ret)
> +		return ret;
> +
> +	virtvdev->bar0_virtual_buf_size = VIRTIO_LEGACY_IO_BAR_HEADER_LEN +
> +		VIRTIO_LEGACY_IO_BAR_MSIX_HEADER_LEN +
> +		virtiovf_get_device_config_size(pdev->device);
> +	virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size,
> +					     GFP_KERNEL);
> +	if (!virtvdev->bar0_virtual_buf)
> +		return -ENOMEM;
> +	mutex_init(&virtvdev->bar_mutex);
> +	return 0;
> +}
> +
> +static void virtiovf_pci_core_release_dev(struct vfio_device *core_vdev)
> +{
> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> +
> +	kfree(virtvdev->bar0_virtual_buf);
> +	vfio_pci_core_release_dev(core_vdev);
> +}
> +
> +static const struct vfio_device_ops virtiovf_acc_vfio_pci_tran_ops = {
> +	.name = "virtio-transitional-vfio-pci",
> +	.init = virtiovf_pci_init_device,
> +	.release = virtiovf_pci_core_release_dev,
> +	.open_device = virtiovf_pci_open_device,
> +	.close_device = virtiovf_pci_close_device,
> +	.ioctl = virtiovf_vfio_pci_core_ioctl,
> +	.read = virtiovf_pci_core_read,
> +	.write = virtiovf_pci_core_write,
> +	.mmap = vfio_pci_core_mmap,
> +	.request = vfio_pci_core_request,
> +	.match = vfio_pci_core_match,
> +	.bind_iommufd = vfio_iommufd_physical_bind,
> +	.unbind_iommufd = vfio_iommufd_physical_unbind,
> +	.attach_ioas = vfio_iommufd_physical_attach_ioas,
> +};
> +
> +static const struct vfio_device_ops virtiovf_acc_vfio_pci_ops = {
> +	.name = "virtio-acc-vfio-pci",
> +	.init = vfio_pci_core_init_dev,
> +	.release = vfio_pci_core_release_dev,
> +	.open_device = virtiovf_pci_open_device,
> +	.close_device = virtiovf_pci_close_device,
> +	.ioctl = vfio_pci_core_ioctl,
> +	.device_feature = vfio_pci_core_ioctl_feature,
> +	.read = vfio_pci_core_read,
> +	.write = vfio_pci_core_write,
> +	.mmap = vfio_pci_core_mmap,
> +	.request = vfio_pci_core_request,
> +	.match = vfio_pci_core_match,
> +	.bind_iommufd = vfio_iommufd_physical_bind,
> +	.unbind_iommufd = vfio_iommufd_physical_unbind,
> +	.attach_ioas = vfio_iommufd_physical_attach_ioas,
> +};
> +
> +static bool virtiovf_bar0_exists(struct pci_dev *pdev)
> +{
> +	struct resource *res = pdev->resource;
> +
> +	return res->flags ? true : false;
> +}
> +
> +#define VIRTIOVF_USE_ADMIN_CMD_BITMAP \
> +	(BIT_ULL(VIRTIO_ADMIN_CMD_LIST_QUERY) | \
> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LIST_USE) | \
> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE) | \
> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ) | \
> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE) | \
> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ) | \
> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO))
> +
> +static bool virtiovf_support_legacy_access(struct pci_dev *pdev)
> +{
> +	int buf_size = DIV_ROUND_UP(VIRTIO_ADMIN_MAX_CMD_OPCODE, 64) * 8;
> +	u8 *buf;
> +	int ret;
> +
> +	/* Only virtio-net is supported/tested so far */
> +	if (pdev->device != 0x1041)
> +		return false;
> +
> +	buf = kzalloc(buf_size, GFP_KERNEL);
> +	if (!buf)
> +		return false;
> +
> +	ret = virtiovf_cmd_list_query(pdev, buf, buf_size);
> +	if (ret)
> +		goto end;
> +
> +	if ((le64_to_cpup((__le64 *)buf) & VIRTIOVF_USE_ADMIN_CMD_BITMAP) !=
> +		VIRTIOVF_USE_ADMIN_CMD_BITMAP) {
> +		ret = -EOPNOTSUPP;
> +		goto end;
> +	}
> +
> +	/* confirm the used commands */
> +	memset(buf, 0, buf_size);
> +	*(__le64 *)buf = cpu_to_le64(VIRTIOVF_USE_ADMIN_CMD_BITMAP);
> +	ret = virtiovf_cmd_list_use(pdev, buf, buf_size);
> +
> +end:
> +	kfree(buf);
> +	return ret ? false : true;
> +}
> +
> +static int virtiovf_pci_probe(struct pci_dev *pdev,
> +			      const struct pci_device_id *id)
> +{
> +	const struct vfio_device_ops *ops = &virtiovf_acc_vfio_pci_ops;
> +	struct virtiovf_pci_core_device *virtvdev;
> +	int ret;
> +
> +	if (pdev->is_virtfn && virtiovf_support_legacy_access(pdev) &&
> +	    !virtiovf_bar0_exists(pdev) && pdev->msix_cap)
> +		ops = &virtiovf_acc_vfio_pci_tran_ops;
> +
> +	virtvdev = vfio_alloc_device(virtiovf_pci_core_device, core_device.vdev,
> +				     &pdev->dev, ops);
> +	if (IS_ERR(virtvdev))
> +		return PTR_ERR(virtvdev);
> +
> +	dev_set_drvdata(&pdev->dev, &virtvdev->core_device);
> +	ret = vfio_pci_core_register_device(&virtvdev->core_device);
> +	if (ret)
> +		goto out;
> +	return 0;
> +out:
> +	vfio_put_device(&virtvdev->core_device.vdev);
> +	return ret;
> +}
> +
> +static void virtiovf_pci_remove(struct pci_dev *pdev)
> +{
> +	struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
> +
> +	vfio_pci_core_unregister_device(&virtvdev->core_device);
> +	vfio_put_device(&virtvdev->core_device.vdev);
> +}
> +
> +static const struct pci_device_id virtiovf_pci_table[] = {
> +	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_REDHAT_QUMRANET, PCI_ANY_ID) },
> +	{}
> +};
> +
> +MODULE_DEVICE_TABLE(pci, virtiovf_pci_table);
> +
> +static struct pci_driver virtiovf_pci_driver = {
> +	.name = KBUILD_MODNAME,
> +	.id_table = virtiovf_pci_table,
> +	.probe = virtiovf_pci_probe,
> +	.remove = virtiovf_pci_remove,
> +	.err_handler = &vfio_pci_core_err_handlers,
> +	.driver_managed_dma = true,
> +};
> +
> +module_pci_driver(virtiovf_pci_driver);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
> +MODULE_DESCRIPTION(
> +	"VIRTIO VFIO PCI - User Level meta-driver for VIRTIO device family");
Michael S. Tsirkin Sept. 21, 2023, 5:01 p.m. UTC | #5
On Thu, Sep 21, 2023 at 01:52:24PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 10:43:50AM -0600, Alex Williamson wrote:
> 
> > > With that code in place a legacy driver in the guest has the look and
> > > feel as if having a transitional device with legacy support for both its
> > > control and data path flows.
> > 
> > Why do we need to enable a "legacy" driver in the guest?  The very name
> > suggests there's an alternative driver that perhaps doesn't require
> > this I/O BAR.  Why don't we just require the non-legacy driver in the
> > guest rather than increase our maintenance burden?  Thanks,
> 
> It was my reaction also.
> 
> Apparently there is a big deployed base of people using old guest VMs
> with old drivers and they do not want to update their VMs. It is the
> same basic reason why qemu supports all those weird old machine types
> and HW emulations. The desire is to support these old devices so that
> old VMs can work unchanged.
> 
> Jason

And you are saying all these very old VMs use such a large number of
legacy devices that over-counting of locked memory due to vdpa not
correctly using iommufd is a problem that urgently needs to be solved
otherwise the solution has no value?

Another question I'm interested in is whether there's actually a
performance benefit to using this as compared to just software
vhost. I note there's a VM exit on each IO access, so ... perhaps?
Would be nice to see some numbers.
Jason Gunthorpe Sept. 21, 2023, 5:07 p.m. UTC | #6
On Thu, Sep 21, 2023 at 01:01:12PM -0400, Michael S. Tsirkin wrote:
> On Thu, Sep 21, 2023 at 01:52:24PM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 21, 2023 at 10:43:50AM -0600, Alex Williamson wrote:
> > 
> > > > With that code in place a legacy driver in the guest has the look and
> > > > feel as if having a transitional device with legacy support for both its
> > > > control and data path flows.
> > > 
> > > Why do we need to enable a "legacy" driver in the guest?  The very name
> > > suggests there's an alternative driver that perhaps doesn't require
> > > this I/O BAR.  Why don't we just require the non-legacy driver in the
> > > guest rather than increase our maintenance burden?  Thanks,
> > 
> > It was my reaction also.
> > 
> > Apparently there is a big deployed base of people using old guest VMs
> > with old drivers and they do not want to update their VMs. It is the
> > same basic reason why qemu supports all those weird old machine types
> > and HW emulations. The desire is to support these old devices so that
> > old VMs can work unchanged.
> > 
> > Jason
> 
> And you are saying all these very old VMs use such a large number of
> legacy devices that over-counting of locked memory due to vdpa not
> correctly using iommufd is a problem that urgently needs to be solved
> otherwise the solution has no value?

No one has said that.

iommufd is gaining alot more functions than just pinned memory
accounting.

> Another question I'm interested in is whether there's actually a
> performance benefit to using this as compared to just software
> vhost. I note there's a VM exit on each IO access, so ... perhaps?
> Would be nice to see some numbers.

At least a single trap compared with an entire per-packet SW flow
undoubtably uses alot less CPU power in the hypervisor.

Jason
Parav Pandit Sept. 21, 2023, 5:09 p.m. UTC | #7
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, September 21, 2023 10:31 PM

> Another question I'm interested in is whether there's actually a performance
> benefit to using this as compared to just software vhost. I note there's a VM exit
> on each IO access, so ... perhaps?
> Would be nice to see some numbers.

Packet rate and bandwidth are close are only 10% lower than modern device due to the batching of driver notification.
Bw tested with iperf with one and multiple queues. 
Packet rate tested with testpmd.
Michael S. Tsirkin Sept. 21, 2023, 5:21 p.m. UTC | #8
On Thu, Sep 21, 2023 at 02:07:09PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 01:01:12PM -0400, Michael S. Tsirkin wrote:
> > On Thu, Sep 21, 2023 at 01:52:24PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Sep 21, 2023 at 10:43:50AM -0600, Alex Williamson wrote:
> > > 
> > > > > With that code in place a legacy driver in the guest has the look and
> > > > > feel as if having a transitional device with legacy support for both its
> > > > > control and data path flows.
> > > > 
> > > > Why do we need to enable a "legacy" driver in the guest?  The very name
> > > > suggests there's an alternative driver that perhaps doesn't require
> > > > this I/O BAR.  Why don't we just require the non-legacy driver in the
> > > > guest rather than increase our maintenance burden?  Thanks,
> > > 
> > > It was my reaction also.
> > > 
> > > Apparently there is a big deployed base of people using old guest VMs
> > > with old drivers and they do not want to update their VMs. It is the
> > > same basic reason why qemu supports all those weird old machine types
> > > and HW emulations. The desire is to support these old devices so that
> > > old VMs can work unchanged.
> > > 
> > > Jason
> > 
> > And you are saying all these very old VMs use such a large number of
> > legacy devices that over-counting of locked memory due to vdpa not
> > correctly using iommufd is a problem that urgently needs to be solved
> > otherwise the solution has no value?
> 
> No one has said that.
> 
> iommufd is gaining alot more functions than just pinned memory
> accounting.

Yea it's very useful - it's also useful for vdpa whether this patchset
goes in or not.  At some level, if vdpa can't keep up then maybe going
the vfio route is justified. I'm not sure why didn't anyone fix iommufd
yet - looks like a small amount of work. I'll see if I can address it
quickly because we already have virtio accelerators under vdpa and it
seems confusing to people to use vdpa for some and vfio for others, with
overlapping but slightly incompatible functionality.  I'll get back next
week, in either case. I am however genuinely curious whether all the new
functionality is actually useful for these legacy guests.

> > Another question I'm interested in is whether there's actually a
> > performance benefit to using this as compared to just software
> > vhost. I note there's a VM exit on each IO access, so ... perhaps?
> > Would be nice to see some numbers.
> 
> At least a single trap compared with an entire per-packet SW flow
> undoubtably uses alot less CPU power in the hypervisor.
> 
> Jason

Something like the shadow vq thing will be more or less equivalent then?
That's upstream in qemu and needs no hardware support. Worth comparing
against.  Anyway, there's presumably actual hardware this was tested
with, so why guess? Just test and post numbers.
Michael S. Tsirkin Sept. 21, 2023, 5:24 p.m. UTC | #9
On Thu, Sep 21, 2023 at 05:09:04PM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, September 21, 2023 10:31 PM
> 
> > Another question I'm interested in is whether there's actually a performance
> > benefit to using this as compared to just software vhost. I note there's a VM exit
> > on each IO access, so ... perhaps?
> > Would be nice to see some numbers.
> 
> Packet rate and bandwidth are close are only 10% lower than modern device due to the batching of driver notification.
> Bw tested with iperf with one and multiple queues. 
> Packet rate tested with testpmd.

Nice, good to know.  Could you compare this with vdpa with shadow vq
enabled?  That's probably the closest equivalent that needs
no kernel or hardware work.
Jason Gunthorpe Sept. 21, 2023, 5:44 p.m. UTC | #10
On Thu, Sep 21, 2023 at 01:21:26PM -0400, Michael S. Tsirkin wrote:
> Yea it's very useful - it's also useful for vdpa whether this patchset
> goes in or not.  At some level, if vdpa can't keep up then maybe going
> the vfio route is justified. I'm not sure why didn't anyone fix iommufd
> yet - looks like a small amount of work. I'll see if I can address it
> quickly because we already have virtio accelerators under vdpa and it
> seems confusing to people to use vdpa for some and vfio for others, with
> overlapping but slightly incompatible functionality.  I'll get back next
> week, in either case. I am however genuinely curious whether all the new
> functionality is actually useful for these legacy guests.

It doesn't have much to do with the guests - this is new hypervisor
functionality to make the hypervisor do more things. This stuff can
still work with old VMs.

> > > Another question I'm interested in is whether there's actually a
> > > performance benefit to using this as compared to just software
> > > vhost. I note there's a VM exit on each IO access, so ... perhaps?
> > > Would be nice to see some numbers.
> > 
> > At least a single trap compared with an entire per-packet SW flow
> > undoubtably uses alot less CPU power in the hypervisor.
>
> Something like the shadow vq thing will be more or less equivalent
> then?

Huh? It still has the entire netdev stack to go through on every
packet before it reaches the real virtio device.

> That's upstream in qemu and needs no hardware support. Worth comparing
> against.  Anyway, there's presumably actual hardware this was tested
> with, so why guess? Just test and post numbers.

Our prior benchmarking put our VPDA/VFIO solutions at something like
2x-3x improvement over the qemu SW path it replaces.

Parav said 10% is lost, so 10% of 3x is still 3x better :)

I thought we all agreed on this when vdpa was created in the first
place, the all SW path was hopeless to get high performance out of?

Jason
Michael S. Tsirkin Sept. 21, 2023, 5:55 p.m. UTC | #11
On Thu, Sep 21, 2023 at 02:44:50PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 01:21:26PM -0400, Michael S. Tsirkin wrote:
> > Yea it's very useful - it's also useful for vdpa whether this patchset
> > goes in or not.  At some level, if vdpa can't keep up then maybe going
> > the vfio route is justified. I'm not sure why didn't anyone fix iommufd
> > yet - looks like a small amount of work. I'll see if I can address it
> > quickly because we already have virtio accelerators under vdpa and it
> > seems confusing to people to use vdpa for some and vfio for others, with
> > overlapping but slightly incompatible functionality.  I'll get back next
> > week, in either case. I am however genuinely curious whether all the new
> > functionality is actually useful for these legacy guests.
> 
> It doesn't have much to do with the guests - this is new hypervisor
> functionality to make the hypervisor do more things. This stuff can
> still work with old VMs.
> 
> > > > Another question I'm interested in is whether there's actually a
> > > > performance benefit to using this as compared to just software
> > > > vhost. I note there's a VM exit on each IO access, so ... perhaps?
> > > > Would be nice to see some numbers.
> > > 
> > > At least a single trap compared with an entire per-packet SW flow
> > > undoubtably uses alot less CPU power in the hypervisor.
> >
> > Something like the shadow vq thing will be more or less equivalent
> > then?
> 
> Huh? It still has the entire netdev stack to go through on every
> packet before it reaches the real virtio device.

No - shadow vq just tweaks the descriptor and forwards it to
the modern vdpa hardware. No net stack involved.

> > That's upstream in qemu and needs no hardware support. Worth comparing
> > against.  Anyway, there's presumably actual hardware this was tested
> > with, so why guess? Just test and post numbers.
> 
> Our prior benchmarking put our VPDA/VFIO solutions at something like
> 2x-3x improvement over the qemu SW path it replaces.
> Parav said 10% is lost, so 10% of 3x is still 3x better :)
> 
> I thought we all agreed on this when vdpa was created in the first
> place, the all SW path was hopeless to get high performance out of?
> 
> Jason

That's not what I'm asking about though - not what shadow vq does,
shadow vq is a vdpa feature.
Jason Gunthorpe Sept. 21, 2023, 6:16 p.m. UTC | #12
On Thu, Sep 21, 2023 at 01:55:42PM -0400, Michael S. Tsirkin wrote:

> That's not what I'm asking about though - not what shadow vq does,
> shadow vq is a vdpa feature.

That's just VDPA then. We already talked about why VDPA is not a
replacement for VFIO.

I agree you can probably get decent performance out of choosing VDPA
over VFIO. That doesn't justify VDPA as a replacement for VFIO.

Jason
Jason Gunthorpe Sept. 21, 2023, 6:39 p.m. UTC | #13
On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote:
> > vdpa is not vfio, I don't know how you can suggest vdpa is a
> > replacement for a vfio driver. They are completely different
> > things.
> > Each side has its own strengths, and vfio especially is accelerating
> > in its capability in way that vpda is not. eg if an iommufd conversion
> > had been done by now for vdpa I might be more sympathetic.
> 
> Yea, I agree iommufd is a big problem with vdpa right now. Cindy was
> sick and I didn't know and kept assuming she's working on this. I don't
> think it's a huge amount of work though.  I'll take a look.
> Is there anything else though? Do tell.

Confidential compute will never work with VDPA's approach.

> There are a bunch of things that I think are important for virtio
> that are completely out of scope for vfio, such as migrating
> cross-vendor. 

VFIO supports migration, if you want to have cross-vendor migration
then make a standard that describes the VFIO migration data format for
virtio devices.

> What is the huge amount of work am I asking to do?

You are asking us to invest in the complexity of VDPA through out
(keep it working, keep it secure, invest time in deploying and
debugging in the field)

When it doesn't provide *ANY* value to the solution.

The starting point is a completely working vfio PCI function and the
end goal is to put that function into a VM. That is VFIO, not VDPA.

VPDA is fine for what it does, but it is not a reasonable replacement
for VFIO.

Jason
Michael S. Tsirkin Sept. 21, 2023, 7:13 p.m. UTC | #14
On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote:
> > > vdpa is not vfio, I don't know how you can suggest vdpa is a
> > > replacement for a vfio driver. They are completely different
> > > things.
> > > Each side has its own strengths, and vfio especially is accelerating
> > > in its capability in way that vpda is not. eg if an iommufd conversion
> > > had been done by now for vdpa I might be more sympathetic.
> > 
> > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was
> > sick and I didn't know and kept assuming she's working on this. I don't
> > think it's a huge amount of work though.  I'll take a look.
> > Is there anything else though? Do tell.
> 
> Confidential compute will never work with VDPA's approach.

I don't see how what this patchset is doing is different
wrt to Confidential compute - you trap IO accesses and emulate.
Care to elaborate?


> > There are a bunch of things that I think are important for virtio
> > that are completely out of scope for vfio, such as migrating
> > cross-vendor. 
> 
> VFIO supports migration, if you want to have cross-vendor migration
> then make a standard that describes the VFIO migration data format for
> virtio devices.

This has nothing to do with data formats - you need two devices to
behave identically. Which is what VDPA is about really.

> > What is the huge amount of work am I asking to do?
> 
> You are asking us to invest in the complexity of VDPA through out
> (keep it working, keep it secure, invest time in deploying and
> debugging in the field)
> 
> When it doesn't provide *ANY* value to the solution.

There's no "the solution" - this sounds like a vendor only caring about
solutions that involve that vendor's hardware exclusively, a little.

> The starting point is a completely working vfio PCI function and the
> end goal is to put that function into a VM. That is VFIO, not VDPA.
> 
> VPDA is fine for what it does, but it is not a reasonable replacement
> for VFIO.
> 
> Jason

VDPA basically should be a kind of "VFIO for virtio".
Michael S. Tsirkin Sept. 21, 2023, 7:17 p.m. UTC | #15
On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
> > What is the huge amount of work am I asking to do?
> 
> You are asking us to invest in the complexity of VDPA through out
> (keep it working, keep it secure, invest time in deploying and
> debugging in the field)

I'm asking you to do nothing of the kind - I am saying that this code
will have to be duplicated in vdpa, and so I am asking what exactly is
missing to just keep it all there. So far you said iommufd and
note I didn't ask you to add iommufd to vdpa though that would be nice ;)
I just said I'll look into it in the next several days.
Michael S. Tsirkin Sept. 21, 2023, 7:34 p.m. UTC | #16
On Thu, Sep 21, 2023 at 03:16:37PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 01:55:42PM -0400, Michael S. Tsirkin wrote:
> 
> > That's not what I'm asking about though - not what shadow vq does,
> > shadow vq is a vdpa feature.
> 
> That's just VDPA then. We already talked about why VDPA is not a
> replacement for VFIO.

It does however work universally, by software, without any special
hardware support. Which is kind of why I am curious - if VDPA needs this
proxy code because shadow vq is slower then that's an argument for not
having it in two places, and trying to improve vdpa to use iommufd if
that's easy/practical.  If instead VDPA gives the same speed with just
shadow vq then keeping this hack in vfio seems like less of a problem.
Finally if VDPA is faster then maybe you will reconsider using it ;)
Jason Gunthorpe Sept. 21, 2023, 7:49 p.m. UTC | #17
On Thu, Sep 21, 2023 at 03:13:10PM -0400, Michael S. Tsirkin wrote:
> On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote:
> > > > vdpa is not vfio, I don't know how you can suggest vdpa is a
> > > > replacement for a vfio driver. They are completely different
> > > > things.
> > > > Each side has its own strengths, and vfio especially is accelerating
> > > > in its capability in way that vpda is not. eg if an iommufd conversion
> > > > had been done by now for vdpa I might be more sympathetic.
> > > 
> > > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was
> > > sick and I didn't know and kept assuming she's working on this. I don't
> > > think it's a huge amount of work though.  I'll take a look.
> > > Is there anything else though? Do tell.
> > 
> > Confidential compute will never work with VDPA's approach.
> 
> I don't see how what this patchset is doing is different
> wrt to Confidential compute - you trap IO accesses and emulate.
> Care to elaborate?

This patch series isn't about confidential compute, you asked about
the future. VFIO will support confidential compute in the future, VDPA
will not.

> > > There are a bunch of things that I think are important for virtio
> > > that are completely out of scope for vfio, such as migrating
> > > cross-vendor. 
> > 
> > VFIO supports migration, if you want to have cross-vendor migration
> > then make a standard that describes the VFIO migration data format for
> > virtio devices.
> 
> This has nothing to do with data formats - you need two devices to
> behave identically. Which is what VDPA is about really.

We've been looking at VFIO live migration extensively. Device
mediation, like VDPA does, is one legitimate approach for live
migration. It suites a certain type of heterogeneous environment well.

But, it is equally legitimate to make the devices behave the same and
have them process a common migration data.

This can happen in public with standards, or it can happen in private
within a cloud operator's "private-standard" environment.

To date, in most of my discussions, I have not seen a strong appetite
for such public standards. In part due to the complexity.

Regardles, it is not the kernel communities job to insist on one
approach or the other.

> > You are asking us to invest in the complexity of VDPA through out
> > (keep it working, keep it secure, invest time in deploying and
> > debugging in the field)
> > 
> > When it doesn't provide *ANY* value to the solution.
> 
> There's no "the solution"

Nonsense.

> this sounds like a vendor only caring about solutions that involve
> that vendor's hardware exclusively, a little.

Not really.

Understand the DPU provider is not the vendor here. The DPU provider
gives a cloud operator a SDK to build these things. The operator is
the vendor from your perspective.

In many cases live migration never leaves the operator's confines in
the first place.

Even when it does, there is no real use case to live migrate a
virtio-net function from, say, AWS to GCP.

You are pushing for a lot of complexity and software that solves a
problem people in this space don't actually have.

As I said, VDPA is fine for the scenarios it addresses. It is an
alternative, not a replacement, for VFIO.

Jason
Jason Gunthorpe Sept. 21, 2023, 7:51 p.m. UTC | #18
On Thu, Sep 21, 2023 at 03:17:25PM -0400, Michael S. Tsirkin wrote:
> On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
> > > What is the huge amount of work am I asking to do?
> > 
> > You are asking us to invest in the complexity of VDPA through out
> > (keep it working, keep it secure, invest time in deploying and
> > debugging in the field)
> 
> I'm asking you to do nothing of the kind - I am saying that this code
> will have to be duplicated in vdpa,

Why would that be needed?

> and so I am asking what exactly is missing to just keep it all
> there.

VFIO. Seriously, we don't want unnecessary mediation in this path at
all.

> note I didn't ask you to add iommufd to vdpa though that would be
> nice ;)

I did once send someone to look.. It didn't succeed :(

Jason
Jason Gunthorpe Sept. 21, 2023, 7:53 p.m. UTC | #19
On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:

> that's easy/practical.  If instead VDPA gives the same speed with just
> shadow vq then keeping this hack in vfio seems like less of a problem.
> Finally if VDPA is faster then maybe you will reconsider using it ;)

It is not all about the speed.

VDPA presents another large and complex software stack in the
hypervisor that can be eliminated by simply using VFIO. VFIO is
already required for other scenarios.

This is about reducing complexity, reducing attack surface and
increasing maintainability of the hypervisor environment.

Jason
Michael S. Tsirkin Sept. 21, 2023, 8:16 p.m. UTC | #20
On Thu, Sep 21, 2023 at 04:53:45PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:
> 
> > that's easy/practical.  If instead VDPA gives the same speed with just
> > shadow vq then keeping this hack in vfio seems like less of a problem.
> > Finally if VDPA is faster then maybe you will reconsider using it ;)
> 
> It is not all about the speed.
> 
> VDPA presents another large and complex software stack in the
> hypervisor that can be eliminated by simply using VFIO.

If all you want is passing through your card to guest
then yes this can be addressed "by simply using VFIO".

And let me give you a simple example just from this patchset:
it assumes guest uses MSIX and just breaks if it doesn't.
As VDPA emulates it can emulate INT#x for guest while doing MSI
on the host side. Yea modern guests use MSIX but this is about legacy
yes?


> VFIO is
> already required for other scenarios.

Required ... by some people? Most VMs I run don't use anything
outside of virtio.

> This is about reducing complexity, reducing attack surface and
> increasing maintainability of the hypervisor environment.
> 
> Jason

Generally you get better security if you don't let guests poke at
hardware when they don't have to. But sure, matter of preference -
use VFIO, it's great. I am worried about the specific patchset though.
It seems to deal with emulating virtio which seems more like a vdpa
thing. If you start adding virtio emulation to vfio then won't
you just end up with another vdpa? And if no why not?
And I don't buy the "we already invested in this vfio based solution",
sorry - that's not a reason upstream has to maintain it.
Michael S. Tsirkin Sept. 21, 2023, 8:45 p.m. UTC | #21
On Thu, Sep 21, 2023 at 04:49:46PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 03:13:10PM -0400, Michael S. Tsirkin wrote:
> > On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote:
> > > > > vdpa is not vfio, I don't know how you can suggest vdpa is a
> > > > > replacement for a vfio driver. They are completely different
> > > > > things.
> > > > > Each side has its own strengths, and vfio especially is accelerating
> > > > > in its capability in way that vpda is not. eg if an iommufd conversion
> > > > > had been done by now for vdpa I might be more sympathetic.
> > > > 
> > > > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was
> > > > sick and I didn't know and kept assuming she's working on this. I don't
> > > > think it's a huge amount of work though.  I'll take a look.
> > > > Is there anything else though? Do tell.
> > > 
> > > Confidential compute will never work with VDPA's approach.
> > 
> > I don't see how what this patchset is doing is different
> > wrt to Confidential compute - you trap IO accesses and emulate.
> > Care to elaborate?
> 
> This patch series isn't about confidential compute, you asked about
> the future. VFIO will support confidential compute in the future, VDPA
> will not.

Nonsense it already works.

But I did not ask about the future since I do not believe it
can be confidently predicted. I asked what is missing in VDPA
now for you to add this feature there and not in VFIO.


> > > > There are a bunch of things that I think are important for virtio
> > > > that are completely out of scope for vfio, such as migrating
> > > > cross-vendor. 
> > > 
> > > VFIO supports migration, if you want to have cross-vendor migration
> > > then make a standard that describes the VFIO migration data format for
> > > virtio devices.
> > 
> > This has nothing to do with data formats - you need two devices to
> > behave identically. Which is what VDPA is about really.
> 
> We've been looking at VFIO live migration extensively. Device
> mediation, like VDPA does, is one legitimate approach for live
> migration. It suites a certain type of heterogeneous environment well.
> 
> But, it is equally legitimate to make the devices behave the same and
> have them process a common migration data.
> 
> This can happen in public with standards, or it can happen in private
> within a cloud operator's "private-standard" environment.
> 
> To date, in most of my discussions, I have not seen a strong appetite
> for such public standards. In part due to the complexity.
> 
> Regardles, it is not the kernel communities job to insist on one
> approach or the other.
>
> > > You are asking us to invest in the complexity of VDPA through out
> > > (keep it working, keep it secure, invest time in deploying and
> > > debugging in the field)
> > > 
> > > When it doesn't provide *ANY* value to the solution.
> > 
> > There's no "the solution"
> 
> Nonsense.

what there's only one solution that you use the definite article?

> > this sounds like a vendor only caring about solutions that involve
> > that vendor's hardware exclusively, a little.
> 
> Not really.
> 
> Understand the DPU provider is not the vendor here. The DPU provider
> gives a cloud operator a SDK to build these things. The operator is
> the vendor from your perspective.
> 
> In many cases live migration never leaves the operator's confines in
> the first place.
> 
> Even when it does, there is no real use case to live migrate a
> virtio-net function from, say, AWS to GCP.
> 
> You are pushing for a lot of complexity and software that solves a
> problem people in this space don't actually have.
> 
> As I said, VDPA is fine for the scenarios it addresses. It is an
> alternative, not a replacement, for VFIO.
> 
> Jason

yea, VDPA does trap and emulate for config accesses.  which is exactly
what this patch does?  so why does it belong in vfio muddying up its
passthrough model is beyond me, except that apparently there's some
specific deployment that happens to use vfio so now whatever
that deployment needs has to go into vfio whether it belongs there or not.
Michael S. Tsirkin Sept. 21, 2023, 8:55 p.m. UTC | #22
On Thu, Sep 21, 2023 at 04:51:15PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 03:17:25PM -0400, Michael S. Tsirkin wrote:
> > On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
> > > > What is the huge amount of work am I asking to do?
> > > 
> > > You are asking us to invest in the complexity of VDPA through out
> > > (keep it working, keep it secure, invest time in deploying and
> > > debugging in the field)
> > 
> > I'm asking you to do nothing of the kind - I am saying that this code
> > will have to be duplicated in vdpa,
> 
> Why would that be needed?

For the same reason it was developed in the 1st place - presumably
because it adds efficient legacy guest support with the right card?
I get it, you specifically don't need VDPA functionality, but I don't
see why is this universal, or common.


> > and so I am asking what exactly is missing to just keep it all
> > there.
> 
> VFIO. Seriously, we don't want unnecessary mediation in this path at
> all.

But which mediation is necessary is exactly up to the specific use-case.
I have no idea why would you want all of VFIO to e.g. pass access to
random config registers to the guest when it's a virtio device and the
config registers are all nicely listed in the spec. I know nvidia
hardware is so great, it has super robust cards with less security holes
than the vdpa driver, but I very much doubt this is universal for all
virtio offload cards.

> > note I didn't ask you to add iommufd to vdpa though that would be
> > nice ;)
> 
> I did once send someone to look.. It didn't succeed :(
> 
> Jason

Pity. Maybe there's some big difficulty blocking this? I'd like to know.
Jason Gunthorpe Sept. 21, 2023, 10:48 p.m. UTC | #23
On Thu, Sep 21, 2023 at 04:16:25PM -0400, Michael S. Tsirkin wrote:
> On Thu, Sep 21, 2023 at 04:53:45PM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:
> > 
> > > that's easy/practical.  If instead VDPA gives the same speed with just
> > > shadow vq then keeping this hack in vfio seems like less of a problem.
> > > Finally if VDPA is faster then maybe you will reconsider using it ;)
> > 
> > It is not all about the speed.
> > 
> > VDPA presents another large and complex software stack in the
> > hypervisor that can be eliminated by simply using VFIO.
> 
> If all you want is passing through your card to guest
> then yes this can be addressed "by simply using VFIO".

That is pretty much the goal, yes.

> And let me give you a simple example just from this patchset:
> it assumes guest uses MSIX and just breaks if it doesn't.

It does? Really? Where did you see that?

> > VFIO is
> > already required for other scenarios.
> 
> Required ... by some people? Most VMs I run don't use anything
> outside of virtio.

Yes, some people. The sorts of people who run large data centers.

> It seems to deal with emulating virtio which seems more like a vdpa
> thing.

Alex described it right, it creates an SW trapped IO bar that relays
the doorbell to an admin queue command.

> If you start adding virtio emulation to vfio then won't
> you just end up with another vdpa? And if no why not?
> And I don't buy the "we already invested in this vfio based solution",
> sorry - that's not a reason upstream has to maintain it.

I think you would be well justified to object to actual mediation,
like processing queues in VFIO or otherwise complex things.

Fortunately there is no need to do that with DPU HW. The legacy IO BAR
is a weird quirk that just cannot be done without a software trap, and
the OASIS standardization effort was for exactly this kind of
simplistic transformation.

I also don't buy the "upstream has to maintain it" line. The team that
submitted it will maintain it just fine, thank you.

Jason
Jason Gunthorpe Sept. 21, 2023, 10:55 p.m. UTC | #24
On Thu, Sep 21, 2023 at 04:45:45PM -0400, Michael S. Tsirkin wrote:
> On Thu, Sep 21, 2023 at 04:49:46PM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 21, 2023 at 03:13:10PM -0400, Michael S. Tsirkin wrote:
> > > On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote:
> > > > > > vdpa is not vfio, I don't know how you can suggest vdpa is a
> > > > > > replacement for a vfio driver. They are completely different
> > > > > > things.
> > > > > > Each side has its own strengths, and vfio especially is accelerating
> > > > > > in its capability in way that vpda is not. eg if an iommufd conversion
> > > > > > had been done by now for vdpa I might be more sympathetic.
> > > > > 
> > > > > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was
> > > > > sick and I didn't know and kept assuming she's working on this. I don't
> > > > > think it's a huge amount of work though.  I'll take a look.
> > > > > Is there anything else though? Do tell.
> > > > 
> > > > Confidential compute will never work with VDPA's approach.
> > > 
> > > I don't see how what this patchset is doing is different
> > > wrt to Confidential compute - you trap IO accesses and emulate.
> > > Care to elaborate?
> > 
> > This patch series isn't about confidential compute, you asked about
> > the future. VFIO will support confidential compute in the future, VDPA
> > will not.
> 
> Nonsense it already works.

That isn't what I'm talking about. With a real PCI function and TDISP
we can actually DMA directly from the guest's memory without needing
the ugly bounce buffer hack. Then you can get decent performance.

> But I did not ask about the future since I do not believe it
> can be confidently predicted. I asked what is missing in VDPA
> now for you to add this feature there and not in VFIO.

I don't see that VDPA needs this, VDPA should process the IO BAR on
its own with its own logic, just like everything else it does.

This is specifically about avoiding mediation by relaying directly the
IO BAR operations to the device itself.

That is the entire irony, this whole scheme was designed and
standardized *specifically* to avoid complex mediation and here you
are saying we should just use mediation.

Jason
Jason Gunthorpe Sept. 21, 2023, 11:08 p.m. UTC | #25
On Thu, Sep 21, 2023 at 04:55:05PM -0400, Michael S. Tsirkin wrote:
> But which mediation is necessary is exactly up to the specific use-case.
> I have no idea why would you want all of VFIO to e.g. pass access to
> random config registers to the guest when it's a virtio device and the
> config registers are all nicely listed in the spec. I know nvidia
> hardware is so great, it has super robust cards with less security holes
> than the vdpa driver, but I very much doubt this is universal for all
> virtio offload cards.

The great thing about choice is that people can choose the
configuration that best meets their situation and needs.

Jason
Jason Wang Sept. 22, 2023, 3:01 a.m. UTC | #26
On Fri, Sep 22, 2023 at 3:49 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Sep 21, 2023 at 03:13:10PM -0400, Michael S. Tsirkin wrote:
> > On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote:
> > > > > vdpa is not vfio, I don't know how you can suggest vdpa is a
> > > > > replacement for a vfio driver. They are completely different
> > > > > things.
> > > > > Each side has its own strengths, and vfio especially is accelerating
> > > > > in its capability in way that vpda is not. eg if an iommufd conversion
> > > > > had been done by now for vdpa I might be more sympathetic.
> > > >
> > > > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was
> > > > sick and I didn't know and kept assuming she's working on this. I don't
> > > > think it's a huge amount of work though.  I'll take a look.
> > > > Is there anything else though? Do tell.
> > >
> > > Confidential compute will never work with VDPA's approach.
> >
> > I don't see how what this patchset is doing is different
> > wrt to Confidential compute - you trap IO accesses and emulate.
> > Care to elaborate?
>
> This patch series isn't about confidential compute, you asked about
> the future. VFIO will support confidential compute in the future, VDPA
> will not.
>
> > > > There are a bunch of things that I think are important for virtio
> > > > that are completely out of scope for vfio, such as migrating
> > > > cross-vendor.
> > >
> > > VFIO supports migration, if you want to have cross-vendor migration
> > > then make a standard that describes the VFIO migration data format for
> > > virtio devices.
> >
> > This has nothing to do with data formats - you need two devices to
> > behave identically. Which is what VDPA is about really.
>
> We've been looking at VFIO live migration extensively. Device
> mediation, like VDPA does, is one legitimate approach for live
> migration. It suites a certain type of heterogeneous environment well.
>
> But, it is equally legitimate to make the devices behave the same and
> have them process a common migration data.
>
> This can happen in public with standards, or it can happen in private
> within a cloud operator's "private-standard" environment.
>
> To date, in most of my discussions, I have not seen a strong appetite
> for such public standards. In part due to the complexity.
>
> Regardles, it is not the kernel communities job to insist on one
> approach or the other.
>
> > > You are asking us to invest in the complexity of VDPA through out
> > > (keep it working, keep it secure, invest time in deploying and
> > > debugging in the field)
> > >
> > > When it doesn't provide *ANY* value to the solution.
> >
> > There's no "the solution"
>
> Nonsense.
>
> > this sounds like a vendor only caring about solutions that involve
> > that vendor's hardware exclusively, a little.
>
> Not really.
>
> Understand the DPU provider is not the vendor here. The DPU provider
> gives a cloud operator a SDK to build these things. The operator is
> the vendor from your perspective.
>
> In many cases live migration never leaves the operator's confines in
> the first place.
>
> Even when it does, there is no real use case to live migrate a
> virtio-net function from, say, AWS to GCP.

It can happen inside a single cloud vendor. For some reasons, DPU must
be purchased from different vendors. And vDPA has been used in that
case.

I've asked them to present this probably somewhere like KVM Forum.

>
> You are pushing for a lot of complexity and software that solves a
> problem people in this space don't actually have.
>
> As I said, VDPA is fine for the scenarios it addresses. It is an
> alternative, not a replacement, for VFIO.

We never try to replace VFIO. I don't see any problem by just using
the current VFIO to assign a virtio-pci device to the guest.

The problem is the mediation (or what you called relaying) layer
you've invented.

Thanks

>
> Jason
>
Jason Wang Sept. 22, 2023, 3:02 a.m. UTC | #27
On Fri, Sep 22, 2023 at 4:16 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Thu, Sep 21, 2023 at 04:53:45PM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:
> >
> > > that's easy/practical.  If instead VDPA gives the same speed with just
> > > shadow vq then keeping this hack in vfio seems like less of a problem.
> > > Finally if VDPA is faster then maybe you will reconsider using it ;)
> >
> > It is not all about the speed.
> >
> > VDPA presents another large and complex software stack in the
> > hypervisor that can be eliminated by simply using VFIO.
>
> If all you want is passing through your card to guest
> then yes this can be addressed "by simply using VFIO".

+1.

And what's more, using MMIO BAR0 then it can work for legacy.

I have a handy virtio hardware from one vendor that works like this,
and I see it is done by a lot of other vendors.

Thanks
Jason Wang Sept. 22, 2023, 3:02 a.m. UTC | #28
On Fri, Sep 22, 2023 at 6:55 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Sep 21, 2023 at 04:45:45PM -0400, Michael S. Tsirkin wrote:
> > On Thu, Sep 21, 2023 at 04:49:46PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Sep 21, 2023 at 03:13:10PM -0400, Michael S. Tsirkin wrote:
> > > > On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote:
> > > > > > > vdpa is not vfio, I don't know how you can suggest vdpa is a
> > > > > > > replacement for a vfio driver. They are completely different
> > > > > > > things.
> > > > > > > Each side has its own strengths, and vfio especially is accelerating
> > > > > > > in its capability in way that vpda is not. eg if an iommufd conversion
> > > > > > > had been done by now for vdpa I might be more sympathetic.
> > > > > >
> > > > > > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was
> > > > > > sick and I didn't know and kept assuming she's working on this. I don't
> > > > > > think it's a huge amount of work though.  I'll take a look.
> > > > > > Is there anything else though? Do tell.
> > > > >
> > > > > Confidential compute will never work with VDPA's approach.
> > > >
> > > > I don't see how what this patchset is doing is different
> > > > wrt to Confidential compute - you trap IO accesses and emulate.
> > > > Care to elaborate?
> > >
> > > This patch series isn't about confidential compute, you asked about
> > > the future. VFIO will support confidential compute in the future, VDPA
> > > will not.

What blocks vDPA from supporting that?

> >
> > Nonsense it already works.
>
> That isn't what I'm talking about. With a real PCI function and TDISP
> we can actually DMA directly from the guest's memory without needing
> the ugly bounce buffer hack. Then you can get decent performance.

This series requires the trapping in the legacy I/O BAR in VFIO. Why
can TDISP work when trapping in VFIO but not vDPA? If neither, how can
TDISP affect here?

>
> > But I did not ask about the future since I do not believe it
> > can be confidently predicted. I asked what is missing in VDPA
> > now for you to add this feature there and not in VFIO.
>
> I don't see that VDPA needs this, VDPA should process the IO BAR on
> its own with its own logic, just like everything else it does.
>
> This is specifically about avoiding mediation by relaying directly the
> IO BAR operations to the device itself.

So we had:

1) a new virtio specific driver for VFIO
2) the existing vp_vdpa driver

How much differences between them in the context of the mediation or
relaying? Or is it hard to introduce admin commands in the vDPA bus?

> That is the entire irony, this whole scheme was designed and
> standardized *specifically* to avoid complex mediation and here you
> are saying we should just use mediation.

No, using "simple VFIO passthrough" is just fine.

Thanks

>
> Jason
>
Jason Wang Sept. 22, 2023, 3:02 a.m. UTC | #29
On Fri, Sep 22, 2023 at 3:53 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:
>
> > that's easy/practical.  If instead VDPA gives the same speed with just
> > shadow vq then keeping this hack in vfio seems like less of a problem.
> > Finally if VDPA is faster then maybe you will reconsider using it ;)
>
> It is not all about the speed.
>
> VDPA presents another large and complex software stack in the
> hypervisor that can be eliminated by simply using VFIO.

vDPA supports standard virtio devices so how did you define complexity?

From the view of the application, what it wants is a simple virtio
device but not virtio-pci devices. That is what vDPA tries to present.

By simply counting LOCs: vdpa + vhost + vp_vdpa is much less code than
what VFIO had. It's not hard to expect, it will still be much less
even if iommufd is done.

Thanks



> VFIO is
> already required for other scenarios.
>
> This is about reducing complexity, reducing attack surface and
> increasing maintainability of the hypervisor environment.
>
> Jason
>
>
Zhu, Lingshan Sept. 22, 2023, 3:45 a.m. UTC | #30
On 9/22/2023 2:39 AM, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote:
>>> vdpa is not vfio, I don't know how you can suggest vdpa is a
>>> replacement for a vfio driver. They are completely different
>>> things.
>>> Each side has its own strengths, and vfio especially is accelerating
>>> in its capability in way that vpda is not. eg if an iommufd conversion
>>> had been done by now for vdpa I might be more sympathetic.
>> Yea, I agree iommufd is a big problem with vdpa right now. Cindy was
>> sick and I didn't know and kept assuming she's working on this. I don't
>> think it's a huge amount of work though.  I'll take a look.
>> Is there anything else though? Do tell.
> Confidential compute will never work with VDPA's approach.
I don't understand why vDPA can not and will never support Confidential 
computing?

Do you see any blockers?
>
>> There are a bunch of things that I think are important for virtio
>> that are completely out of scope for vfio, such as migrating
>> cross-vendor.
> VFIO supports migration, if you want to have cross-vendor migration
> then make a standard that describes the VFIO migration data format for
> virtio devices.
>
>> What is the huge amount of work am I asking to do?
> You are asking us to invest in the complexity of VDPA through out
> (keep it working, keep it secure, invest time in deploying and
> debugging in the field)
>
> When it doesn't provide *ANY* value to the solution.
>
> The starting point is a completely working vfio PCI function and the
> end goal is to put that function into a VM. That is VFIO, not VDPA.
>
> VPDA is fine for what it does, but it is not a reasonable replacement
> for VFIO.
>
> Jason
Michael S. Tsirkin Sept. 22, 2023, 9:47 a.m. UTC | #31
On Thu, Sep 21, 2023 at 07:48:36PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 04:16:25PM -0400, Michael S. Tsirkin wrote:
> > On Thu, Sep 21, 2023 at 04:53:45PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:
> > > 
> > > > that's easy/practical.  If instead VDPA gives the same speed with just
> > > > shadow vq then keeping this hack in vfio seems like less of a problem.
> > > > Finally if VDPA is faster then maybe you will reconsider using it ;)
> > > 
> > > It is not all about the speed.
> > > 
> > > VDPA presents another large and complex software stack in the
> > > hypervisor that can be eliminated by simply using VFIO.
> > 
> > If all you want is passing through your card to guest
> > then yes this can be addressed "by simply using VFIO".
> 
> That is pretty much the goal, yes.
> 
> > And let me give you a simple example just from this patchset:
> > it assumes guest uses MSIX and just breaks if it doesn't.
> 
> It does? Really? Where did you see that?

This thing apparently:

+               opcode = (pos < VIRTIO_PCI_CONFIG_OFF(true)) ?
+                       VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ :
+                       VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ;

That "true" is supposed to be whether guest enabled MSI or not.


> > > VFIO is
> > > already required for other scenarios.
> > 
> > Required ... by some people? Most VMs I run don't use anything
> > outside of virtio.
> 
> Yes, some people. The sorts of people who run large data centers.
>
> > It seems to deal with emulating virtio which seems more like a vdpa
> > thing.
> 
> Alex described it right, it creates an SW trapped IO bar that relays
> the doorbell to an admin queue command.
> 
> > If you start adding virtio emulation to vfio then won't
> > you just end up with another vdpa? And if no why not?
> > And I don't buy the "we already invested in this vfio based solution",
> > sorry - that's not a reason upstream has to maintain it.
> 
> I think you would be well justified to object to actual mediation,
> like processing queues in VFIO or otherwise complex things.

This mediation is kind of smallish, I agree. Not completely devoid of
logic though.

> Fortunately there is no need to do that with DPU HW. The legacy IO BAR
> is a weird quirk that just cannot be done without a software trap, and
> the OASIS standardization effort was for exactly this kind of
> simplistic transformation.
> 
> I also don't buy the "upstream has to maintain it" line. The team that
> submitted it will maintain it just fine, thank you.

it will require maintainance effort when virtio changes are made.  For
example it pokes at the device state - I don't see specific races right
now but in the past we did e.g. reset the device to recover from errors
and we might start doing it again.

If more of the logic is under virtio directory where we'll remember
to keep it in loop, and will be able to reuse it from vdpa
down the road, I would be more sympathetic.
Michael S. Tsirkin Sept. 22, 2023, 11:23 a.m. UTC | #32
On Thu, Sep 21, 2023 at 07:55:26PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 21, 2023 at 04:45:45PM -0400, Michael S. Tsirkin wrote:
> > On Thu, Sep 21, 2023 at 04:49:46PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Sep 21, 2023 at 03:13:10PM -0400, Michael S. Tsirkin wrote:
> > > > On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote:
> > > > > > > vdpa is not vfio, I don't know how you can suggest vdpa is a
> > > > > > > replacement for a vfio driver. They are completely different
> > > > > > > things.
> > > > > > > Each side has its own strengths, and vfio especially is accelerating
> > > > > > > in its capability in way that vpda is not. eg if an iommufd conversion
> > > > > > > had been done by now for vdpa I might be more sympathetic.
> > > > > > 
> > > > > > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was
> > > > > > sick and I didn't know and kept assuming she's working on this. I don't
> > > > > > think it's a huge amount of work though.  I'll take a look.
> > > > > > Is there anything else though? Do tell.
> > > > > 
> > > > > Confidential compute will never work with VDPA's approach.
> > > > 
> > > > I don't see how what this patchset is doing is different
> > > > wrt to Confidential compute - you trap IO accesses and emulate.
> > > > Care to elaborate?
> > > 
> > > This patch series isn't about confidential compute, you asked about
> > > the future. VFIO will support confidential compute in the future, VDPA
> > > will not.
> > 
> > Nonsense it already works.
> 
> That isn't what I'm talking about. With a real PCI function and TDISP
> we can actually DMA directly from the guest's memory without needing
> the ugly bounce buffer hack. Then you can get decent performance.

Aha, TDISP.  But that one clearly does not need and can not use
this kind of hack?

> > But I did not ask about the future since I do not believe it
> > can be confidently predicted. I asked what is missing in VDPA
> > now for you to add this feature there and not in VFIO.
> 
> I don't see that VDPA needs this, VDPA should process the IO BAR on
> its own with its own logic, just like everything else it does.

First there's some logic here such as translating legacy IO
offsets to modern ones that could be reused.

But also, this is not just IO BAR, that indeed can be easily done in
software.  When a device operates in legacy mode there are subtle
differences with modern mode such as a different header size for the net
device.

> This is specifically about avoiding mediation by relaying directly the
> IO BAR operations to the device itself.
> 
> That is the entire irony, this whole scheme was designed and
> standardized *specifically* to avoid complex mediation and here you
> are saying we should just use mediation.
> 
> Jason

Not exactly. What I had in mind is just having the logic in
the vdpa module so users don't need to know what does the device
support and what it doesn't. If we can we bypass mediation
(to simplify the software stack) if we can not we do not.

Looking at it from user's POV, it is just super confusing that
card ABC would need to be used with VDPA to drive legacy while
card DEF needs to be used with VFIO. And both VFIO and VDPA
will happily bind, too. Oh man ...
Jason Gunthorpe Sept. 22, 2023, 12:11 p.m. UTC | #33
On Fri, Sep 22, 2023 at 11:01:23AM +0800, Jason Wang wrote:

> > Even when it does, there is no real use case to live migrate a
> > virtio-net function from, say, AWS to GCP.
> 
> It can happen inside a single cloud vendor. For some reasons, DPU must
> be purchased from different vendors. And vDPA has been used in that
> case.

Nope, you misunderstand the DPU scenario.

Look at something like vmware DPU enablement. vmware runs the software
side of the DPU and all their supported DPU HW, from every vendor,
generates the same PCI functions on the x86. They are the same because
the same software on the DPU side is creating them.

There is no reason to put a mediation layer in the x86 if you also
control the DPU.

Cloud vendors will similarly use DPUs to create a PCI functions that
meet the cloud vendor's internal specification. Regardless of DPU
vendor.

Fundamentally if you control the DPU SW and the hypervisor software
you do not need hypervisor meditation because everything you could do
in hypervisor mediation can just be done in the DPU. Putting it in the
DPU is better in every regard.

So, as I keep saying, in this scenario the goal is no mediation in the
hypervisor. It is pointless, everything you think you need to do there
is actually already being done in the DPU.

Once you commit to this configuration you are committed to VFIO in the
hypervisor. eg your DPU is likely also making NVMe and other PCI
functions too.

> The problem is the mediation (or what you called relaying) layer
> you've invented.

It is not mediation, it is implementing the OASIS spec for VFIO
support of IO BAR.

Jason
Jason Gunthorpe Sept. 22, 2023, 12:15 p.m. UTC | #34
On Fri, Sep 22, 2023 at 07:23:28AM -0400, Michael S. Tsirkin wrote:
> On Thu, Sep 21, 2023 at 07:55:26PM -0300, Jason Gunthorpe wrote:

> Looking at it from user's POV, it is just super confusing that
> card ABC would need to be used with VDPA to drive legacy while
> card DEF needs to be used with VFIO. And both VFIO and VDPA
> will happily bind, too. Oh man ...

It is standard VFIO stuff. If you don't attach vfio then you get a
normal kernel vfio-net driver. If you turn that into VDPA then you get
that.

If you attach VFIO to the PCI function then you get VFIO.

There is nothing special here, we have good infrastructure to support
doing this already.

User gets to pick. I don't understand why you think the kernel side
should deny this choice.

Jason
Jason Gunthorpe Sept. 22, 2023, 12:22 p.m. UTC | #35
On Fri, Sep 22, 2023 at 11:02:21AM +0800, Jason Wang wrote:

> And what's more, using MMIO BAR0 then it can work for legacy.

Oh? How? Our team didn't think so.

Jason
Jason Gunthorpe Sept. 22, 2023, 12:23 p.m. UTC | #36
On Fri, Sep 22, 2023 at 05:47:23AM -0400, Michael S. Tsirkin wrote:

> it will require maintainance effort when virtio changes are made.  For
> example it pokes at the device state - I don't see specific races right
> now but in the past we did e.g. reset the device to recover from errors
> and we might start doing it again.
> 
> If more of the logic is under virtio directory where we'll remember
> to keep it in loop, and will be able to reuse it from vdpa
> down the road, I would be more sympathetic.

This is inevitable, the VFIO live migration driver will need all this
infrastructure too.

Jason
Jason Gunthorpe Sept. 22, 2023, 12:25 p.m. UTC | #37
On Fri, Sep 22, 2023 at 11:02:50AM +0800, Jason Wang wrote:
> On Fri, Sep 22, 2023 at 3:53 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:
> >
> > > that's easy/practical.  If instead VDPA gives the same speed with just
> > > shadow vq then keeping this hack in vfio seems like less of a problem.
> > > Finally if VDPA is faster then maybe you will reconsider using it ;)
> >
> > It is not all about the speed.
> >
> > VDPA presents another large and complex software stack in the
> > hypervisor that can be eliminated by simply using VFIO.
> 
> vDPA supports standard virtio devices so how did you define
> complexity?

As I said, VFIO is already required for other devices in these VMs. So
anything incremental over base-line vfio-pci is complexity to
minimize.

Everything vdpa does is either redundant or unnecessary compared to
VFIO in these environments.

Jason
Parav Pandit Sept. 22, 2023, 12:25 p.m. UTC | #38
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, September 22, 2023 5:53 PM


> > And what's more, using MMIO BAR0 then it can work for legacy.
> 
> Oh? How? Our team didn't think so.

It does not. It was already discussed.
The device reset in legacy is not synchronous.
The drivers do not wait for reset to complete; it was written for the sw backend.
Hence MMIO BAR0 is not the best option in real implementations.
Michael S. Tsirkin Sept. 22, 2023, 3:13 p.m. UTC | #39
On Fri, Sep 22, 2023 at 12:25:06PM +0000, Parav Pandit wrote:
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, September 22, 2023 5:53 PM
> 
> 
> > > And what's more, using MMIO BAR0 then it can work for legacy.
> > 
> > Oh? How? Our team didn't think so.
> 
> It does not. It was already discussed.
> The device reset in legacy is not synchronous.
> The drivers do not wait for reset to complete; it was written for the sw backend.
> Hence MMIO BAR0 is not the best option in real implementations.

Or maybe they made it synchronous in hardware, that's all.
After all same is true for the IO BAR0 e.g. for the PF: IO writes are posted anyway.

Whether that's possible would depend on the hardware architecture.
Jason Gunthorpe Sept. 22, 2023, 3:15 p.m. UTC | #40
On Fri, Sep 22, 2023 at 11:13:18AM -0400, Michael S. Tsirkin wrote:
> On Fri, Sep 22, 2023 at 12:25:06PM +0000, Parav Pandit wrote:
> > 
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Friday, September 22, 2023 5:53 PM
> > 
> > 
> > > > And what's more, using MMIO BAR0 then it can work for legacy.
> > > 
> > > Oh? How? Our team didn't think so.
> > 
> > It does not. It was already discussed.
> > The device reset in legacy is not synchronous.
> > The drivers do not wait for reset to complete; it was written for the sw backend.
> > Hence MMIO BAR0 is not the best option in real implementations.
> 
> Or maybe they made it synchronous in hardware, that's all.
> After all same is true for the IO BAR0 e.g. for the PF: IO writes
> are posted anyway.

IO writes are not posted in PCI.

Jason
Michael S. Tsirkin Sept. 22, 2023, 3:39 p.m. UTC | #41
On Fri, Sep 22, 2023 at 09:25:01AM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 22, 2023 at 11:02:50AM +0800, Jason Wang wrote:
> > On Fri, Sep 22, 2023 at 3:53 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:
> > >
> > > > that's easy/practical.  If instead VDPA gives the same speed with just
> > > > shadow vq then keeping this hack in vfio seems like less of a problem.
> > > > Finally if VDPA is faster then maybe you will reconsider using it ;)
> > >
> > > It is not all about the speed.
> > >
> > > VDPA presents another large and complex software stack in the
> > > hypervisor that can be eliminated by simply using VFIO.
> > 
> > vDPA supports standard virtio devices so how did you define
> > complexity?
> 
> As I said, VFIO is already required for other devices in these VMs. So
> anything incremental over base-line vfio-pci is complexity to
> minimize.
> 
> Everything vdpa does is either redundant or unnecessary compared to
> VFIO in these environments.
> 
> Jason

Yes but you know. There are all kind of environments.  I guess you
consider yours the most mainstream and important, and are sure it will
always stay like this.  But if there's a driver that does what you need
then you use that. You really should be explaining what vdpa
*does not* do that you need.

But anyway, if Alex wants to maintain this it's not too bad,
but I would like to see more code move into a library
living under the virtio directory. As it is structured now
it will make virtio core development harder.
Michael S. Tsirkin Sept. 22, 2023, 3:40 p.m. UTC | #42
On Fri, Sep 22, 2023 at 12:15:34PM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 22, 2023 at 11:13:18AM -0400, Michael S. Tsirkin wrote:
> > On Fri, Sep 22, 2023 at 12:25:06PM +0000, Parav Pandit wrote:
> > > 
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Friday, September 22, 2023 5:53 PM
> > > 
> > > 
> > > > > And what's more, using MMIO BAR0 then it can work for legacy.
> > > > 
> > > > Oh? How? Our team didn't think so.
> > > 
> > > It does not. It was already discussed.
> > > The device reset in legacy is not synchronous.
> > > The drivers do not wait for reset to complete; it was written for the sw backend.
> > > Hence MMIO BAR0 is not the best option in real implementations.
> > 
> > Or maybe they made it synchronous in hardware, that's all.
> > After all same is true for the IO BAR0 e.g. for the PF: IO writes
> > are posted anyway.
> 
> IO writes are not posted in PCI.

Aha, I was confused. Thanks for the correction. I guess you just buffer
subsequent transactions while reset is going on and reset quickly enough
for it to be seemless then?
Michael S. Tsirkin Sept. 22, 2023, 3:45 p.m. UTC | #43
On Fri, Sep 22, 2023 at 09:23:28AM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 22, 2023 at 05:47:23AM -0400, Michael S. Tsirkin wrote:
> 
> > it will require maintainance effort when virtio changes are made.  For
> > example it pokes at the device state - I don't see specific races right
> > now but in the past we did e.g. reset the device to recover from errors
> > and we might start doing it again.
> > 
> > If more of the logic is under virtio directory where we'll remember
> > to keep it in loop, and will be able to reuse it from vdpa
> > down the road, I would be more sympathetic.
> 
> This is inevitable, the VFIO live migration driver will need all this
> infrastructure too.
> 
> Jason
>  

I am not sure what you are saying and what is inevitable.
VDPA for sure will want live migration support.  I am not at all
sympathetic to efforts that want to duplicate that support for virtio
under VFIO. Put it in a library under the virtio directory,
with a sane will documented interface.
I don't maintain VFIO and Alex can merge what he wants,
but I won't merge patches that export virtio internals in a way
that will make virtio maintainance harder.
Jason Gunthorpe Sept. 22, 2023, 4:19 p.m. UTC | #44
On Fri, Sep 22, 2023 at 11:39:19AM -0400, Michael S. Tsirkin wrote:
> On Fri, Sep 22, 2023 at 09:25:01AM -0300, Jason Gunthorpe wrote:
> > On Fri, Sep 22, 2023 at 11:02:50AM +0800, Jason Wang wrote:
> > > On Fri, Sep 22, 2023 at 3:53 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >
> > > > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:
> > > >
> > > > > that's easy/practical.  If instead VDPA gives the same speed with just
> > > > > shadow vq then keeping this hack in vfio seems like less of a problem.
> > > > > Finally if VDPA is faster then maybe you will reconsider using it ;)
> > > >
> > > > It is not all about the speed.
> > > >
> > > > VDPA presents another large and complex software stack in the
> > > > hypervisor that can be eliminated by simply using VFIO.
> > > 
> > > vDPA supports standard virtio devices so how did you define
> > > complexity?
> > 
> > As I said, VFIO is already required for other devices in these VMs. So
> > anything incremental over base-line vfio-pci is complexity to
> > minimize.
> > 
> > Everything vdpa does is either redundant or unnecessary compared to
> > VFIO in these environments.
> > 
> > Jason
> 
> Yes but you know. There are all kind of environments.  I guess you
> consider yours the most mainstream and important, and are sure it will
> always stay like this.  But if there's a driver that does what you need
> then you use that.

Come on, you are the one saying we cannot do things in the best way
possible because you want your way of doing things to be the only way
allowed. Which of us thinks "yours the most mainstream and important" ??

I'm not telling you to throw away VPDA, I'm saying there are
legimitate real world use cases where VFIO is the appropriate
interface, not VDPA.

I want choice, not dogmatic exclusion that there is Only One True Way.

> You really should be explaining what vdpa *does not* do that you
> need.

I think I've done that enough, but if you have been following my
explanation you should see that the entire point of this design is to
allow a virtio device to be created inside a DPU to a specific
detailed specification (eg an AWS virtio-net device, for instance)

The implementation is in the DPU, and only the DPU.

At the end of the day VDPA uses mediation and creates some
RedHat/VDPA/Qemu virtio-net device in the guest. It is emphatically
NOT a perfect recreation of the "AWS virtio-net" we started out with.

It entirely fails to achieve the most important thing it needs to do!

Yishai will rework the series with your remarks, we can look again on
v2, thanks for all the input!

Jason
Jason Gunthorpe Sept. 22, 2023, 4:22 p.m. UTC | #45
On Fri, Sep 22, 2023 at 11:40:58AM -0400, Michael S. Tsirkin wrote:
> On Fri, Sep 22, 2023 at 12:15:34PM -0300, Jason Gunthorpe wrote:
> > On Fri, Sep 22, 2023 at 11:13:18AM -0400, Michael S. Tsirkin wrote:
> > > On Fri, Sep 22, 2023 at 12:25:06PM +0000, Parav Pandit wrote:
> > > > 
> > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Sent: Friday, September 22, 2023 5:53 PM
> > > > 
> > > > 
> > > > > > And what's more, using MMIO BAR0 then it can work for legacy.
> > > > > 
> > > > > Oh? How? Our team didn't think so.
> > > > 
> > > > It does not. It was already discussed.
> > > > The device reset in legacy is not synchronous.
> > > > The drivers do not wait for reset to complete; it was written for the sw backend.
> > > > Hence MMIO BAR0 is not the best option in real implementations.
> > > 
> > > Or maybe they made it synchronous in hardware, that's all.
> > > After all same is true for the IO BAR0 e.g. for the PF: IO writes
> > > are posted anyway.
> > 
> > IO writes are not posted in PCI.
> 
> Aha, I was confused. Thanks for the correction. I guess you just buffer
> subsequent transactions while reset is going on and reset quickly enough
> for it to be seemless then?

From a hardware perspective the CPU issues an non-posted IO write and
then it stops processing until the far side returns an IO completion.

Using that you can emulate what the SW virtio model did and delay the
CPU from restarting until the reset is completed.

Since MMIO is always posted, this is not possible to emulate directly
using MMIO.

Converting IO into non-posted admin commands is a fairly close
recreation to what actual HW would do.

Jason
Jason Wang Sept. 25, 2023, 2:30 a.m. UTC | #46
On Fri, Sep 22, 2023 at 8:25 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, September 22, 2023 5:53 PM
>
>
> > > And what's more, using MMIO BAR0 then it can work for legacy.
> >
> > Oh? How? Our team didn't think so.
>
> It does not. It was already discussed.
> The device reset in legacy is not synchronous.

How do you know this?

> The drivers do not wait for reset to complete; it was written for the sw backend.

Do you see there's a flush after reset in the legacy driver?

static void vp_reset(struct virtio_device *vdev)
{
        struct virtio_pci_device *vp_dev = to_vp_device(vdev);
        /* 0 status means a reset. */
        vp_legacy_set_status(&vp_dev->ldev, 0);
        /* Flush out the status write, and flush in device writes,
         * including MSi-X interrupts, if any. */
        vp_legacy_get_status(&vp_dev->ldev);
        /* Flush pending VQ/configuration callbacks. */
        vp_synchronize_vectors(vdev);
}

Thanks



> Hence MMIO BAR0 is not the best option in real implementations.
>
Jason Wang Sept. 25, 2023, 2:34 a.m. UTC | #47
On Fri, Sep 22, 2023 at 8:11 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Fri, Sep 22, 2023 at 11:01:23AM +0800, Jason Wang wrote:
>
> > > Even when it does, there is no real use case to live migrate a
> > > virtio-net function from, say, AWS to GCP.
> >
> > It can happen inside a single cloud vendor. For some reasons, DPU must
> > be purchased from different vendors. And vDPA has been used in that
> > case.
>
> Nope, you misunderstand the DPU scenario.
>
> Look at something like vmware DPU enablement. vmware runs the software
> side of the DPU and all their supported DPU HW, from every vendor,
> generates the same PCI functions on the x86. They are the same because
> the same software on the DPU side is creating them.
>
> There is no reason to put a mediation layer in the x86 if you also
> control the DPU.
>
> Cloud vendors will similarly use DPUs to create a PCI functions that
> meet the cloud vendor's internal specification.

This can only work if:

1) the internal specification has finer garin than virtio spec
2) so it can define what is not implemented in the virtio spec (like
migration and compatibility)

All of the above doesn't seem to be possible or realistic now, and it
actually has a risk to be not compatible with virtio spec. In the
future when virtio has live migration supported, they want to be able
to migrate between virtio and vDPA.

As I said, vDPA has been used for cross vendor live migration for a while.

> Regardless of DPU
> vendor.
>
> Fundamentally if you control the DPU SW and the hypervisor software
> you do not need hypervisor meditation because everything you could do
> in hypervisor mediation can just be done in the DPU. Putting it in the
> DPU is better in every regard.
>
> So, as I keep saying, in this scenario the goal is no mediation in the
> hypervisor.

That's pretty fine, but I don't think trapping + relying is not
mediation. Does it really matter what happens after trapping?

> It is pointless, everything you think you need to do there
> is actually already being done in the DPU.

Well, migration or even Qemu could be offloaded to DPU as well. If
that's the direction that's pretty fine.

Thanks

>
> Jason
>
Zhu, Lingshan Sept. 25, 2023, 4:44 a.m. UTC | #48
On 9/22/2023 4:55 AM, Michael S. Tsirkin wrote:
> On Thu, Sep 21, 2023 at 04:51:15PM -0300, Jason Gunthorpe wrote:
>> On Thu, Sep 21, 2023 at 03:17:25PM -0400, Michael S. Tsirkin wrote:
>>> On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote:
>>>>> What is the huge amount of work am I asking to do?
>>>> You are asking us to invest in the complexity of VDPA through out
>>>> (keep it working, keep it secure, invest time in deploying and
>>>> debugging in the field)
>>> I'm asking you to do nothing of the kind - I am saying that this code
>>> will have to be duplicated in vdpa,
>> Why would that be needed?
> For the same reason it was developed in the 1st place - presumably
> because it adds efficient legacy guest support with the right card?
> I get it, you specifically don't need VDPA functionality, but I don't
> see why is this universal, or common.
>
>
>>> and so I am asking what exactly is missing to just keep it all
>>> there.
>> VFIO. Seriously, we don't want unnecessary mediation in this path at
>> all.
> But which mediation is necessary is exactly up to the specific use-case.
> I have no idea why would you want all of VFIO to e.g. pass access to
> random config registers to the guest when it's a virtio device and the
> config registers are all nicely listed in the spec. I know nvidia
> hardware is so great, it has super robust cards with less security holes
> than the vdpa driver, but I very much doubt this is universal for all
> virtio offload cards.
I agree with MST.
>>> note I didn't ask you to add iommufd to vdpa though that would be
>>> nice ;)
>> I did once send someone to look.. It didn't succeed :(
>>
>> Jason
> Pity. Maybe there's some big difficulty blocking this? I'd like to know.
>
Parav Pandit Sept. 25, 2023, 8:26 a.m. UTC | #49
> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, September 25, 2023 8:00 AM
> 
> On Fri, Sep 22, 2023 at 8:25 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Friday, September 22, 2023 5:53 PM
> >
> >
> > > > And what's more, using MMIO BAR0 then it can work for legacy.
> > >
> > > Oh? How? Our team didn't think so.
> >
> > It does not. It was already discussed.
> > The device reset in legacy is not synchronous.
> 
> How do you know this?
>
Not sure the motivation of same discussion done in the OASIS with you and others in past.

Anyways, please find the answer below.

About reset,
The legacy device specification has not enforced below cited 1.0 driver requirement of 1.0.

"The driver SHOULD consider a driver-initiated reset complete when it reads device status as 0."
 
[1] https://ozlabs.org/~rusty/virtio-spec/virtio-0.9.5.pdf

> > The drivers do not wait for reset to complete; it was written for the sw
> backend.
> 
> Do you see there's a flush after reset in the legacy driver?
> 
Yes. it only flushes the write by reading it. The driver does not get _wait_ for the reset to complete within the device like above.

Please see the reset flow of 1.x device as below.
In fact the comment of the 1.x device also needs to be updated to indicate that driver need to wait for the device to finish the reset.
I will send separate patch for improving this comment of vp_reset() to match the spec.

static void vp_reset(struct virtio_device *vdev)
{
        struct virtio_pci_device *vp_dev = to_vp_device(vdev);
        struct virtio_pci_modern_device *mdev = &vp_dev->mdev;

        /* 0 status means a reset. */
        vp_modern_set_status(mdev, 0);
        /* After writing 0 to device_status, the driver MUST wait for a read of
         * device_status to return 0 before reinitializing the device.
         * This will flush out the status write, and flush in device writes,
         * including MSI-X interrupts, if any.
         */
        while (vp_modern_get_status(mdev))
                msleep(1);
        /* Flush pending VQ/configuration callbacks. */
        vp_synchronize_vectors(vdev);
}


> static void vp_reset(struct virtio_device *vdev) {
>         struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>         /* 0 status means a reset. */
>         vp_legacy_set_status(&vp_dev->ldev, 0);
>         /* Flush out the status write, and flush in device writes,
>          * including MSi-X interrupts, if any. */
>         vp_legacy_get_status(&vp_dev->ldev);
>         /* Flush pending VQ/configuration callbacks. */
>         vp_synchronize_vectors(vdev);
> }
> 
> Thanks
> 
> 
> 
> > Hence MMIO BAR0 is not the best option in real implementations.
> >
Jason Gunthorpe Sept. 25, 2023, 12:26 p.m. UTC | #50
On Mon, Sep 25, 2023 at 10:34:54AM +0800, Jason Wang wrote:

> > Cloud vendors will similarly use DPUs to create a PCI functions that
> > meet the cloud vendor's internal specification.
> 
> This can only work if:
> 
> 1) the internal specification has finer garin than virtio spec
> 2) so it can define what is not implemented in the virtio spec (like
> migration and compatibility)

Yes, and that is what is happening. Realistically the "spec" isjust a
piece of software that the Cloud vendor owns which is simply ported to
multiple DPU vendors.

It is the same as VDPA. If VDPA can make multiple NIC vendors
consistent then why do you have a hard time believing we can do the
same thing just on the ARM side of a DPU?

> All of the above doesn't seem to be possible or realistic now, and it
> actually has a risk to be not compatible with virtio spec. In the
> future when virtio has live migration supported, they want to be able
> to migrate between virtio and vDPA.

Well, that is for the spec to design. 

> > So, as I keep saying, in this scenario the goal is no mediation in the
> > hypervisor.
> 
> That's pretty fine, but I don't think trapping + relying is not
> mediation. Does it really matter what happens after trapping?

It is not mediation in the sense that the kernel driver does not in
any way make decisions on the behavior of the device. It simply
transforms an IO operation into a device command and relays it to the
device. The device still fully controls its own behavior.

VDPA is very different from this. You might call them both mediation,
sure, but then you need another word to describe the additional
changes VPDA is doing.

> > It is pointless, everything you think you need to do there
> > is actually already being done in the DPU.
> 
> Well, migration or even Qemu could be offloaded to DPU as well. If
> that's the direction that's pretty fine.

That's silly, of course qemu/kvm can't run in the DPU.

However, we can empty qemu and the hypervisor out so all it does is
run kvm and run vfio. In this model the DPU does all the OVS, storage,
"VPDA", etc. qemu is just a passive relay of the DPU PCI functions
into VM's vPCI functions.

So, everything VDPA was doing in the environment is migrated into the
DPU.

In this model the DPU is an extension of the hypervisor/qemu
environment and we shift code from x86 side to arm side to increase
security, save power and increase total system performance.

Jason
Michael S. Tsirkin Sept. 25, 2023, 5:36 p.m. UTC | #51
On Fri, Sep 22, 2023 at 01:22:33PM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 22, 2023 at 11:40:58AM -0400, Michael S. Tsirkin wrote:
> > On Fri, Sep 22, 2023 at 12:15:34PM -0300, Jason Gunthorpe wrote:
> > > On Fri, Sep 22, 2023 at 11:13:18AM -0400, Michael S. Tsirkin wrote:
> > > > On Fri, Sep 22, 2023 at 12:25:06PM +0000, Parav Pandit wrote:
> > > > > 
> > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > Sent: Friday, September 22, 2023 5:53 PM
> > > > > 
> > > > > 
> > > > > > > And what's more, using MMIO BAR0 then it can work for legacy.
> > > > > > 
> > > > > > Oh? How? Our team didn't think so.
> > > > > 
> > > > > It does not. It was already discussed.
> > > > > The device reset in legacy is not synchronous.
> > > > > The drivers do not wait for reset to complete; it was written for the sw backend.
> > > > > Hence MMIO BAR0 is not the best option in real implementations.
> > > > 
> > > > Or maybe they made it synchronous in hardware, that's all.
> > > > After all same is true for the IO BAR0 e.g. for the PF: IO writes
> > > > are posted anyway.
> > > 
> > > IO writes are not posted in PCI.
> > 
> > Aha, I was confused. Thanks for the correction. I guess you just buffer
> > subsequent transactions while reset is going on and reset quickly enough
> > for it to be seemless then?
> 
> >From a hardware perspective the CPU issues an non-posted IO write and
> then it stops processing until the far side returns an IO completion.
> 
> Using that you can emulate what the SW virtio model did and delay the
> CPU from restarting until the reset is completed.
> 
> Since MMIO is always posted, this is not possible to emulate directly
> using MMIO.
> 
> Converting IO into non-posted admin commands is a fairly close
> recreation to what actual HW would do.
> 
> Jason

I thought you asked how it is possible for hardware to support reset if
all it does is replace IO BAR with memory BAR. The answer is that since
2011 the reset is followed by read of the status field (which isn't much
older than MSIX support from 2009 - which this code assumes).  If one
uses a Linux driver from 2011 and on then all you need to do is defer
response to this read until after the reset is complete.

If you are using older drivers or other OSes then reset using a posted
write after device has operated for a while might not be safe, so e.g.
you might trigger races if you remove drivers from system or
trigger hot unplug.  For example: 

	static void virtio_pci_remove(struct pci_dev *pci_dev)
	{

	....

		unregister_virtio_device(&vp_dev->vdev);

	^^^^ triggers reset, then releases memory

	....

		pci_disable_device(pci_dev);

	^^^ blocks DMA by clearing bus master

	}

here you could see some DMA into memory that has just been released.


As Jason mentions hardware exists that is used under one of these two
restrictions on the guest (Linux since 2011 or no resets while DMA is
going on), and it works fine with these existing guests.

Given the restrictions, virtio TC didn't elect to standardize this
approach and instead opted for the heavier approach of
converting IO into non-posted admin commands in software.
Michael S. Tsirkin Sept. 25, 2023, 6:16 p.m. UTC | #52
On Fri, Sep 22, 2023 at 01:19:28PM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 22, 2023 at 11:39:19AM -0400, Michael S. Tsirkin wrote:
> > On Fri, Sep 22, 2023 at 09:25:01AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Sep 22, 2023 at 11:02:50AM +0800, Jason Wang wrote:
> > > > On Fri, Sep 22, 2023 at 3:53 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > > >
> > > > > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:
> > > > >
> > > > > > that's easy/practical.  If instead VDPA gives the same speed with just
> > > > > > shadow vq then keeping this hack in vfio seems like less of a problem.
> > > > > > Finally if VDPA is faster then maybe you will reconsider using it ;)
> > > > >
> > > > > It is not all about the speed.
> > > > >
> > > > > VDPA presents another large and complex software stack in the
> > > > > hypervisor that can be eliminated by simply using VFIO.
> > > > 
> > > > vDPA supports standard virtio devices so how did you define
> > > > complexity?
> > > 
> > > As I said, VFIO is already required for other devices in these VMs. So
> > > anything incremental over base-line vfio-pci is complexity to
> > > minimize.
> > > 
> > > Everything vdpa does is either redundant or unnecessary compared to
> > > VFIO in these environments.
> > > 
> > > Jason
> > 
> > Yes but you know. There are all kind of environments.  I guess you
> > consider yours the most mainstream and important, and are sure it will
> > always stay like this.  But if there's a driver that does what you need
> > then you use that.
> 
> Come on, you are the one saying we cannot do things in the best way
> possible because you want your way of doing things to be the only way
> allowed. Which of us thinks "yours the most mainstream and important" ??
> 
> I'm not telling you to throw away VPDA, I'm saying there are
> legimitate real world use cases where VFIO is the appropriate
> interface, not VDPA.
> 
> I want choice, not dogmatic exclusion that there is Only One True Way.

I don't particularly think there's only one way, vfio is already there.
I am specifically thinking about this patch, for example it
muddies the waters a bit: normally I think vfio exposed device
with the same ID, suddenly it changes the ID as visible to the guest.
But again, whether doing this kind of thing is OK is more up to Alex than me.

I do want to understand if there's a use-case that vdpa does not address
simply because it might be worth while to extend it to do so, and a
bunch of people working on it are at Red Hat and I might have some input
into how that labor is allocated. But if the use-case is simply "has to
be vfio and not vdpa" then I guess not.




> > You really should be explaining what vdpa *does not* do that you
> > need.
> 
> I think I've done that enough, but if you have been following my
> explanation you should see that the entire point of this design is to
> allow a virtio device to be created inside a DPU to a specific
> detailed specification (eg an AWS virtio-net device, for instance)
> 
> The implementation is in the DPU, and only the DPU.
> 
> At the end of the day VDPA uses mediation and creates some
> RedHat/VDPA/Qemu virtio-net device in the guest. It is emphatically
> NOT a perfect recreation of the "AWS virtio-net" we started out with.
> 
> It entirely fails to achieve the most important thing it needs to do!

It could be that we are using mediation differently - in my world it's
when there's some host software on the path between guest and hardware,
and this qualifies.  The difference between what this patch does and
what vdpa does seems quantitative, not qualitative. Which might be
enough to motivate this work, I don't mind. But you seem to feel
it is qualitative and I am genuinely curious about it, because
if yes then it might lead e.g. the virtio standard in new directions.

I can *imagine* all kind of reasons to want to use vfio as compared to vdpa;
here are some examples I came up with, quickly:
- maybe you have drivers that poke at registers not in virtio spec:
  vfio allows that, vdpa by design does not
- maybe you are using vfio with a lot of devices already and don't want
  to special-case handling for virtio devices on the host
do any of the above motivations ring the bell? Some of the things you
said seem to hint at that. If yes maybe include this in the cover
letter.

There is also a question of capability. Specifically iommufd support
is lacking in vdpa (though there are finally some RFC patches to
address that). All this is fine, could be enough to motivate
a work like this one. But I am very curious to know if there
is any other capability lacking in vdpa. I asked already and you
didn't answer so I guess not?




> Yishai will rework the series with your remarks, we can look again on
> v2, thanks for all the input!
> 
> Jason
Michael S. Tsirkin Sept. 25, 2023, 6:36 p.m. UTC | #53
On Mon, Sep 25, 2023 at 08:26:33AM +0000, Parav Pandit wrote:
> 
> 
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, September 25, 2023 8:00 AM
> > 
> > On Fri, Sep 22, 2023 at 8:25 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Friday, September 22, 2023 5:53 PM
> > >
> > >
> > > > > And what's more, using MMIO BAR0 then it can work for legacy.
> > > >
> > > > Oh? How? Our team didn't think so.
> > >
> > > It does not. It was already discussed.
> > > The device reset in legacy is not synchronous.
> > 
> > How do you know this?
> >
> Not sure the motivation of same discussion done in the OASIS with you and others in past.
> 
> Anyways, please find the answer below.
> 
> About reset,
> The legacy device specification has not enforced below cited 1.0 driver requirement of 1.0.
> 
> "The driver SHOULD consider a driver-initiated reset complete when it reads device status as 0."
>  
> [1] https://ozlabs.org/~rusty/virtio-spec/virtio-0.9.5.pdf

Basically, I think any drivers that did not read status (linux pre 2011)
before freeing memory under DMA have a reset path that is racy wrt DMA, since 
memory writes are posted and IO writes while not posted have completion
that does not order posted transactions, e.g. from pci express spec:
        D2b
        An I/O or Configuration Write Completion 37 is permitted to pass a Posted Request.
having said that there were a ton of driver races discovered on this
path in the years since, I suspect if one cares about this then
just avoiding stress on reset is wise.



> > > The drivers do not wait for reset to complete; it was written for the sw
> > backend.
> > 
> > Do you see there's a flush after reset in the legacy driver?
> > 
> Yes. it only flushes the write by reading it. The driver does not get _wait_ for the reset to complete within the device like above.

One can thinkably do that wait in hardware, though. Just defer completion until
read is done.

> Please see the reset flow of 1.x device as below.
> In fact the comment of the 1.x device also needs to be updated to indicate that driver need to wait for the device to finish the reset.
> I will send separate patch for improving this comment of vp_reset() to match the spec.
> 
> static void vp_reset(struct virtio_device *vdev)
> {
>         struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>         struct virtio_pci_modern_device *mdev = &vp_dev->mdev;
> 
>         /* 0 status means a reset. */
>         vp_modern_set_status(mdev, 0);
>         /* After writing 0 to device_status, the driver MUST wait for a read of
>          * device_status to return 0 before reinitializing the device.
>          * This will flush out the status write, and flush in device writes,
>          * including MSI-X interrupts, if any.
>          */
>         while (vp_modern_get_status(mdev))
>                 msleep(1);
>         /* Flush pending VQ/configuration callbacks. */
>         vp_synchronize_vectors(vdev);
> }
> 
> 
> > static void vp_reset(struct virtio_device *vdev) {
> >         struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> >         /* 0 status means a reset. */
> >         vp_legacy_set_status(&vp_dev->ldev, 0);
> >         /* Flush out the status write, and flush in device writes,
> >          * including MSi-X interrupts, if any. */
> >         vp_legacy_get_status(&vp_dev->ldev);
> >         /* Flush pending VQ/configuration callbacks. */
> >         vp_synchronize_vectors(vdev);
> > }
> > 
> > Thanks
> > 
> > 
> > 
> > > Hence MMIO BAR0 is not the best option in real implementations.
> > >
>
Jason Gunthorpe Sept. 25, 2023, 6:53 p.m. UTC | #54
On Mon, Sep 25, 2023 at 02:16:30PM -0400, Michael S. Tsirkin wrote:

> I do want to understand if there's a use-case that vdpa does not address
> simply because it might be worth while to extend it to do so, and a
> bunch of people working on it are at Red Hat and I might have some input
> into how that labor is allocated. But if the use-case is simply "has to
> be vfio and not vdpa" then I guess not.

If you strip away all the philisophical arguing VDPA has no way to
isolate the control and data virtqs to different IOMMU configurations
with this single PCI function.

The existing HW VDPA drivers provided device specific ways to handle
this.

Without DMA isolation you can't assign the high speed data virtq's to
the VM without mediating them as well.

> It could be that we are using mediation differently - in my world it's
> when there's some host software on the path between guest and hardware,
> and this qualifies.  

That is pretty general. As I said to Jason, if you want to use it that
way then you need to make up a new word to describe what VDPA does as
there is a clear difference in scope between this VFIO patch (relay IO
commands to the device) and VDPA (intercept all the control plane,
control virtq and bring it to a RedHat/qemu standard common behavior)
 
> There is also a question of capability. Specifically iommufd support
> is lacking in vdpa (though there are finally some RFC patches to
> address that). All this is fine, could be enough to motivate
> a work like this one.

I've answered many times, you just don't semm to like the answers or
dismiss them as not relevant to you.

Jason
Michael S. Tsirkin Sept. 25, 2023, 7:44 p.m. UTC | #55
On Mon, Sep 25, 2023 at 09:26:07AM -0300, Jason Gunthorpe wrote:
> > > So, as I keep saying, in this scenario the goal is no mediation in the
> > > hypervisor.
> > 
> > That's pretty fine, but I don't think trapping + relying is not
> > mediation. Does it really matter what happens after trapping?
> 
> It is not mediation in the sense that the kernel driver does not in
> any way make decisions on the behavior of the device. It simply
> transforms an IO operation into a device command and relays it to the
> device. The device still fully controls its own behavior.
> 
> VDPA is very different from this. You might call them both mediation,
> sure, but then you need another word to describe the additional
> changes VPDA is doing.

Sorry about hijacking the thread a little bit, but could you
call out some of the changes that are the most problematic
for you?

> > > It is pointless, everything you think you need to do there
> > > is actually already being done in the DPU.
> > 
> > Well, migration or even Qemu could be offloaded to DPU as well. If
> > that's the direction that's pretty fine.
> 
> That's silly, of course qemu/kvm can't run in the DPU.
> 
> However, we can empty qemu and the hypervisor out so all it does is
> run kvm and run vfio. In this model the DPU does all the OVS, storage,
> "VPDA", etc. qemu is just a passive relay of the DPU PCI functions
> into VM's vPCI functions.
> 
> So, everything VDPA was doing in the environment is migrated into the
> DPU.
> 
> In this model the DPU is an extension of the hypervisor/qemu
> environment and we shift code from x86 side to arm side to increase
> security, save power and increase total system performance.
> 
> Jason

I think I begin to understand. On the DPU you have some virtio
devices but also some non-virtio devices.  So you have to
use VFIO to talk to the DPU. Reusing VFIO to talk to virtio
devices too, simplifies things for you. If guests will see
vendor-specific devices from the DPU anyway, it will be impossible
to migrate such guests away from the DPU so the cross-vendor
migration capability is less important in this use-case.
Is this a good summary?
Michael S. Tsirkin Sept. 25, 2023, 7:52 p.m. UTC | #56
On Mon, Sep 25, 2023 at 03:53:18PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 25, 2023 at 02:16:30PM -0400, Michael S. Tsirkin wrote:
> 
> > I do want to understand if there's a use-case that vdpa does not address
> > simply because it might be worth while to extend it to do so, and a
> > bunch of people working on it are at Red Hat and I might have some input
> > into how that labor is allocated. But if the use-case is simply "has to
> > be vfio and not vdpa" then I guess not.
> 
> If you strip away all the philisophical arguing VDPA has no way to
> isolate the control and data virtqs to different IOMMU configurations
> with this single PCI function.

Aha, so address space/PASID support then?

> The existing HW VDPA drivers provided device specific ways to handle
> this.
> 
> Without DMA isolation you can't assign the high speed data virtq's to
> the VM without mediating them as well.
> 
> > It could be that we are using mediation differently - in my world it's
> > when there's some host software on the path between guest and hardware,
> > and this qualifies.  
> 
> That is pretty general. As I said to Jason, if you want to use it that
> way then you need to make up a new word to describe what VDPA does as
> there is a clear difference in scope between this VFIO patch (relay IO
> commands to the device) and VDPA (intercept all the control plane,
> control virtq and bring it to a RedHat/qemu standard common behavior)

IIUC VDPA itself does not really bring it to either RedHat or qemu
standard, it just allows userspace to control behaviour - if userspace
is qemu then it's qemu deciding how it behaves. Which I guess this
doesn't. Right?  RedHat's not in the picture at all I think.

> > There is also a question of capability. Specifically iommufd support
> > is lacking in vdpa (though there are finally some RFC patches to
> > address that). All this is fine, could be enough to motivate
> > a work like this one.
> 
> I've answered many times, you just don't semm to like the answers or
> dismiss them as not relevant to you.
> 
> Jason


Not really I think I lack some of the picture so I don't fully
understand. Or maybe I missed something else.
Jason Gunthorpe Sept. 26, 2023, 12:40 a.m. UTC | #57
On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote:
> > VDPA is very different from this. You might call them both mediation,
> > sure, but then you need another word to describe the additional
> > changes VPDA is doing.
> 
> Sorry about hijacking the thread a little bit, but could you
> call out some of the changes that are the most problematic
> for you?

I don't really know these details. The operators have an existing
virtio world that is ABI toward the VM for them, and they do not want
*anything* to change. The VM should be unware if the virtio device is
created by old hypervisor software or new DPU software. It presents
exactly the same ABI.

So the challenge really is to convince that VDPA delivers that, and
frankly, I don't think it does. ABI toward the VM is very important
here.

> > In this model the DPU is an extension of the hypervisor/qemu
> > environment and we shift code from x86 side to arm side to increase
> > security, save power and increase total system performance.
> 
> I think I begin to understand. On the DPU you have some virtio
> devices but also some non-virtio devices.  So you have to
> use VFIO to talk to the DPU. Reusing VFIO to talk to virtio
> devices too, simplifies things for you. 

Yes

> If guests will see vendor-specific devices from the DPU anyway, it
> will be impossible to migrate such guests away from the DPU so the
> cross-vendor migration capability is less important in this
> use-case.  Is this a good summary?

Well, sort of. As I said before, the vendor here is the cloud
operator, not the DPU supplier. The guest will see an AWS virtio-net
function, for example.

The operator will ensure that all their different implementations of
this function will interwork for migration.

So within the closed world of a single operator live migration will
work just fine.

Since the hypervisor's controlled by the operator only migrate within
the operators own environment anyhow, it is an already solved problem.

Jason
Jason Wang Sept. 26, 2023, 2:32 a.m. UTC | #58
On Mon, Sep 25, 2023 at 4:26 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, September 25, 2023 8:00 AM
> >
> > On Fri, Sep 22, 2023 at 8:25 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Friday, September 22, 2023 5:53 PM
> > >
> > >
> > > > > And what's more, using MMIO BAR0 then it can work for legacy.
> > > >
> > > > Oh? How? Our team didn't think so.
> > >
> > > It does not. It was already discussed.
> > > The device reset in legacy is not synchronous.
> >
> > How do you know this?
> >
> Not sure the motivation of same discussion done in the OASIS with you and others in past.

That is exactly the same point.

It's too late to define the legacy behaviour accurately in the spec so
people will be lost in the legacy maze easily.

>
> Anyways, please find the answer below.
>
> About reset,
> The legacy device specification has not enforced below cited 1.0 driver requirement of 1.0.
>
> "The driver SHOULD consider a driver-initiated reset complete when it reads device status as 0."

We are talking about how to make devices work for legacy drivers. So
it has nothing related to 1.0.

>
> [1] https://ozlabs.org/~rusty/virtio-spec/virtio-0.9.5.pdf
>
> > > The drivers do not wait for reset to complete; it was written for the sw
> > backend.
> >
> > Do you see there's a flush after reset in the legacy driver?
> >
> Yes. it only flushes the write by reading it. The driver does not get _wait_ for the reset to complete within the device like above.

It's the implementation details in legacy. The device needs to make
sure (reset) the driver can work (is done before get_status return).

That's all.
Zhu, Lingshan Sept. 26, 2023, 2:34 a.m. UTC | #59
On 9/26/2023 2:36 AM, Michael S. Tsirkin wrote:
> On Mon, Sep 25, 2023 at 08:26:33AM +0000, Parav Pandit wrote:
>>
>>> From: Jason Wang <jasowang@redhat.com>
>>> Sent: Monday, September 25, 2023 8:00 AM
>>>
>>> On Fri, Sep 22, 2023 at 8:25 PM Parav Pandit <parav@nvidia.com> wrote:
>>>>
>>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>>> Sent: Friday, September 22, 2023 5:53 PM
>>>>
>>>>>> And what's more, using MMIO BAR0 then it can work for legacy.
>>>>> Oh? How? Our team didn't think so.
>>>> It does not. It was already discussed.
>>>> The device reset in legacy is not synchronous.
>>> How do you know this?
>>>
>> Not sure the motivation of same discussion done in the OASIS with you and others in past.
>>
>> Anyways, please find the answer below.
>>
>> About reset,
>> The legacy device specification has not enforced below cited 1.0 driver requirement of 1.0.
>>
>> "The driver SHOULD consider a driver-initiated reset complete when it reads device status as 0."
>>   
>> [1] https://ozlabs.org/~rusty/virtio-spec/virtio-0.9.5.pdf
> Basically, I think any drivers that did not read status (linux pre 2011)
> before freeing memory under DMA have a reset path that is racy wrt DMA, since
> memory writes are posted and IO writes while not posted have completion
> that does not order posted transactions, e.g. from pci express spec:
>          D2b
>          An I/O or Configuration Write Completion 37 is permitted to pass a Posted Request.
> having said that there were a ton of driver races discovered on this
> path in the years since, I suspect if one cares about this then
> just avoiding stress on reset is wise.
>
>
>
>>>> The drivers do not wait for reset to complete; it was written for the sw
>>> backend.
>>>
>>> Do you see there's a flush after reset in the legacy driver?
>>>
>> Yes. it only flushes the write by reading it. The driver does not get _wait_ for the reset to complete within the device like above.
> One can thinkably do that wait in hardware, though. Just defer completion until
> read is done.
I agree with MST. At least Intel devices work fine with vfio-pci and 
legacy driver without any changes.
So far so good.

Thanks
Zhu Lingshan
>
>> Please see the reset flow of 1.x device as below.
>> In fact the comment of the 1.x device also needs to be updated to indicate that driver need to wait for the device to finish the reset.
>> I will send separate patch for improving this comment of vp_reset() to match the spec.
>>
>> static void vp_reset(struct virtio_device *vdev)
>> {
>>          struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>>          struct virtio_pci_modern_device *mdev = &vp_dev->mdev;
>>
>>          /* 0 status means a reset. */
>>          vp_modern_set_status(mdev, 0);
>>          /* After writing 0 to device_status, the driver MUST wait for a read of
>>           * device_status to return 0 before reinitializing the device.
>>           * This will flush out the status write, and flush in device writes,
>>           * including MSI-X interrupts, if any.
>>           */
>>          while (vp_modern_get_status(mdev))
>>                  msleep(1);
>>          /* Flush pending VQ/configuration callbacks. */
>>          vp_synchronize_vectors(vdev);
>> }
>>
>>
>>> static void vp_reset(struct virtio_device *vdev) {
>>>          struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>>>          /* 0 status means a reset. */
>>>          vp_legacy_set_status(&vp_dev->ldev, 0);
>>>          /* Flush out the status write, and flush in device writes,
>>>           * including MSi-X interrupts, if any. */
>>>          vp_legacy_get_status(&vp_dev->ldev);
>>>          /* Flush pending VQ/configuration callbacks. */
>>>          vp_synchronize_vectors(vdev);
>>> }
>>>
>>> Thanks
>>>
>>>
>>>
>>>> Hence MMIO BAR0 is not the best option in real implementations.
>>>>
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Parav Pandit Sept. 26, 2023, 3:45 a.m. UTC | #60
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Tuesday, September 26, 2023 12:06 AM

> One can thinkably do that wait in hardware, though. Just defer completion until
> read is done.
>
Once OASIS does such new interface and if some hw vendor _actually_ wants to do such complex hw, may be vfio driver can adopt to it.
When we worked with you, we discussed that there such hw does not have enough returns and hence technical committee choose to proceed with admin commands.
I will skip re-discussing all over it again here.

The current virto spec is delivering the best trade-offs of functionality, performance and light weight implementation with future forward path towards more features as Jason explained such as migration.
All with near zero driver, qemu and sw involvement for rapidly growing feature set...
Parav Pandit Sept. 26, 2023, 4:01 a.m. UTC | #61
> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, September 26, 2023 8:03 AM
> 
> It's the implementation details in legacy. The device needs to make sure (reset)
> the driver can work (is done before get_status return).
It is part of the 0.9.5 and 1.x specification as I quoted those text above.
Jason Wang Sept. 26, 2023, 4:37 a.m. UTC | #62
On Tue, Sep 26, 2023 at 12:01 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, September 26, 2023 8:03 AM
> >
> > It's the implementation details in legacy. The device needs to make sure (reset)
> > the driver can work (is done before get_status return).
> It is part of the 0.9.5 and 1.x specification as I quoted those text above.

What I meant is: legacy devices need to find their way to make legacy
drivers work. That's how legacy works.

It's too late to add any normative to the 0.95 spec. So the device
behaviour is actually defined by the legacy drivers. That is why it is
tricky.

If you can't find a way to make legacy drivers work, use modern.

That's it.

Thanks
Jason Wang Sept. 26, 2023, 4:37 a.m. UTC | #63
On Tue, Sep 26, 2023 at 11:45 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Tuesday, September 26, 2023 12:06 AM
>
> > One can thinkably do that wait in hardware, though. Just defer completion until
> > read is done.
> >
> Once OASIS does such new interface and if some hw vendor _actually_ wants to do such complex hw, may be vfio driver can adopt to it.

It is you that tries to revive legacy in the spec. We all know legacy
is tricky but work.

> When we worked with you, we discussed that there such hw does not have enough returns and hence technical committee choose to proceed with admin commands.

I don't think my questions regarding the legacy transport get good
answers at that time. What's more, we all know spec allows to fix,
workaround or even deprecate a feature.

Thanks
Jason Wang Sept. 26, 2023, 4:37 a.m. UTC | #64
On Mon, Sep 25, 2023 at 8:26 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Mon, Sep 25, 2023 at 10:34:54AM +0800, Jason Wang wrote:
>
> > > Cloud vendors will similarly use DPUs to create a PCI functions that
> > > meet the cloud vendor's internal specification.
> >
> > This can only work if:
> >
> > 1) the internal specification has finer garin than virtio spec
> > 2) so it can define what is not implemented in the virtio spec (like
> > migration and compatibility)
>
> Yes, and that is what is happening. Realistically the "spec" isjust a
> piece of software that the Cloud vendor owns which is simply ported to
> multiple DPU vendors.
>
> It is the same as VDPA. If VDPA can make multiple NIC vendors
> consistent then why do you have a hard time believing we can do the
> same thing just on the ARM side of a DPU?

I don't. We all know vDPA can do more than virtio.

>
> > All of the above doesn't seem to be possible or realistic now, and it
> > actually has a risk to be not compatible with virtio spec. In the
> > future when virtio has live migration supported, they want to be able
> > to migrate between virtio and vDPA.
>
> Well, that is for the spec to design.

Right, so if we'd consider migration from virtio to vDPA, it needs to
be designed in a way that allows more involvement from hypervisor
other than coupling it with a specific interface (like admin
virtqueues).

>
> > > So, as I keep saying, in this scenario the goal is no mediation in the
> > > hypervisor.
> >
> > That's pretty fine, but I don't think trapping + relying is not
> > mediation. Does it really matter what happens after trapping?
>
> It is not mediation in the sense that the kernel driver does not in
> any way make decisions on the behavior of the device. It simply
> transforms an IO operation into a device command and relays it to the
> device. The device still fully controls its own behavior.
>
> VDPA is very different from this. You might call them both mediation,
> sure, but then you need another word to describe the additional
> changes VPDA is doing.
>
> > > It is pointless, everything you think you need to do there
> > > is actually already being done in the DPU.
> >
> > Well, migration or even Qemu could be offloaded to DPU as well. If
> > that's the direction that's pretty fine.
>
> That's silly, of course qemu/kvm can't run in the DPU.

KVM can't for sure but part of Qemu could. This model has been used.

>
> However, we can empty qemu and the hypervisor out so all it does is
> run kvm and run vfio. In this model the DPU does all the OVS, storage,
> "VPDA", etc. qemu is just a passive relay of the DPU PCI functions
> into VM's vPCI functions.
>
> So, everything VDPA was doing in the environment is migrated into the
> DPU.

It really depends on the use cases. For example, in the case of DPU
what if we want to provide multiple virtio devices through a single
VF?

>
> In this model the DPU is an extension of the hypervisor/qemu
> environment and we shift code from x86 side to arm side to increase
> security, save power and increase total system performance.

That's pretty fine.

Thanks

>
> Jason
>
Parav Pandit Sept. 26, 2023, 5:27 a.m. UTC | #65
> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, September 26, 2023 10:07 AM


> 
> If you can't find a way to make legacy drivers work, use modern.
>
Understood.
This vfio series make the legacy drivers work.
Thanks.
 
> That's it.
> 
> Thanks
Parav Pandit Sept. 26, 2023, 5:33 a.m. UTC | #66
> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, September 26, 2023 10:08 AM

> Right, so if we'd consider migration from virtio to vDPA, it needs to be designed
> in a way that allows more involvement from hypervisor other than coupling it
> with a specific interface (like admin virtqueues).
It is not attached to the admin virtqueues.
One way to use it using admin commands at [1].
One can define without admin command by explaining the technical difficulties in admin command may/cannot work.

[1] https://lists.oasis-open.org/archives/virtio-comment/202309/msg00061.html
Michael S. Tsirkin Sept. 26, 2023, 5:34 a.m. UTC | #67
On Mon, Sep 25, 2023 at 09:40:59PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote:
> > > VDPA is very different from this. You might call them both mediation,
> > > sure, but then you need another word to describe the additional
> > > changes VPDA is doing.
> > 
> > Sorry about hijacking the thread a little bit, but could you
> > call out some of the changes that are the most problematic
> > for you?
> 
> I don't really know these details. The operators have an existing
> virtio world that is ABI toward the VM for them, and they do not want
> *anything* to change. The VM should be unware if the virtio device is
> created by old hypervisor software or new DPU software. It presents
> exactly the same ABI.
> 
> So the challenge really is to convince that VDPA delivers that, and
> frankly, I don't think it does. ABI toward the VM is very important
> here.

And to complete the picture, it is the DPU software/firmware that
is resposible for maintaining this ABI in your ideal world?


> > > In this model the DPU is an extension of the hypervisor/qemu
> > > environment and we shift code from x86 side to arm side to increase
> > > security, save power and increase total system performance.
> > 
> > I think I begin to understand. On the DPU you have some virtio
> > devices but also some non-virtio devices.  So you have to
> > use VFIO to talk to the DPU. Reusing VFIO to talk to virtio
> > devices too, simplifies things for you. 
> 
> Yes
> 
> > If guests will see vendor-specific devices from the DPU anyway, it
> > will be impossible to migrate such guests away from the DPU so the
> > cross-vendor migration capability is less important in this
> > use-case.  Is this a good summary?
> 
> Well, sort of. As I said before, the vendor here is the cloud
> operator, not the DPU supplier. The guest will see an AWS virtio-net
> function, for example.
> 
> The operator will ensure that all their different implementations of
> this function will interwork for migration.
> 
> So within the closed world of a single operator live migration will
> work just fine.
> 
> Since the hypervisor's controlled by the operator only migrate within
> the operators own environment anyhow, it is an already solved problem.
> 
> Jason


Okay the picture emerges I think. Thanks! I'll try to summarize later
for everyone's benefit.
Michael S. Tsirkin Sept. 26, 2023, 5:42 a.m. UTC | #68
On Mon, Sep 25, 2023 at 09:40:59PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote:
> > > VDPA is very different from this. You might call them both mediation,
> > > sure, but then you need another word to describe the additional
> > > changes VPDA is doing.
> > 
> > Sorry about hijacking the thread a little bit, but could you
> > call out some of the changes that are the most problematic
> > for you?
> 
> I don't really know these details.

Maybe, you then should desist from saying things like "It entirely fails
to achieve the most important thing it needs to do!" You are not making
any new friends with saying this about a piece of software without
knowing the details.
Michael S. Tsirkin Sept. 26, 2023, 11:49 a.m. UTC | #69
On Tue, Sep 26, 2023 at 10:32:39AM +0800, Jason Wang wrote:
> It's the implementation details in legacy. The device needs to make
> sure (reset) the driver can work (is done before get_status return).

I think that there's no way to make it reliably work for all legacy drivers.

They just assumed a software backend and did not bother with DMA
ordering. You can try to avoid resets, they are not that common so
things will tend to mostly work if you don't stress them to much with
things like hot plug/unplug in a loop.  Or you can try to use a driver
after 2011 which is more aware of hardware ordering and flushes the
reset write with a read.  One of these two tricks, I think, is the magic
behind the device exposing memory bar 0 that you mention.
Jason Gunthorpe Sept. 26, 2023, 1:50 p.m. UTC | #70
On Tue, Sep 26, 2023 at 01:42:52AM -0400, Michael S. Tsirkin wrote:
> On Mon, Sep 25, 2023 at 09:40:59PM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote:
> > > > VDPA is very different from this. You might call them both mediation,
> > > > sure, but then you need another word to describe the additional
> > > > changes VPDA is doing.
> > > 
> > > Sorry about hijacking the thread a little bit, but could you
> > > call out some of the changes that are the most problematic
> > > for you?
> > 
> > I don't really know these details.
> 
> Maybe, you then should desist from saying things like "It entirely fails
> to achieve the most important thing it needs to do!" You are not making
> any new friends with saying this about a piece of software without
> knowing the details.

I can't tell you what cloud operators are doing, but I can say with
confidence that it is not the same as VDPA. As I said, if you want to
know more details you need to ask a cloud operator.

Jason
Feng Liu Sept. 26, 2023, 7:23 p.m. UTC | #71
On 2023-09-21 a.m.9:57, Michael S. Tsirkin wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Thu, Sep 21, 2023 at 03:40:32PM +0300, Yishai Hadas wrote:
>> From: Feng Liu <feliu@nvidia.com>
>>
>> Introduce support for the admin virtqueue. By negotiating
>> VIRTIO_F_ADMIN_VQ feature, driver detects capability and creates one
>> administration virtqueue. Administration virtqueue implementation in
>> virtio pci generic layer, enables multiple types of upper layer
>> drivers such as vfio, net, blk to utilize it.
>>
>> Signed-off-by: Feng Liu <feliu@nvidia.com>
>> Reviewed-by: Parav Pandit <parav@nvidia.com>
>> Reviewed-by: Jiri Pirko <jiri@nvidia.com>
>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>> ---
>>   drivers/virtio/Makefile                |  2 +-
>>   drivers/virtio/virtio.c                | 37 +++++++++++++--
>>   drivers/virtio/virtio_pci_common.h     | 15 +++++-
>>   drivers/virtio/virtio_pci_modern.c     | 10 +++-
>>   drivers/virtio/virtio_pci_modern_avq.c | 65 ++++++++++++++++++++++++++
> 
> if you have a .c file without a .h file you know there's something
> fishy. Just add this inside drivers/virtio/virtio_pci_modern.c ?
> 
Will do.

>>   include/linux/virtio_config.h          |  4 ++
>>   include/linux/virtio_pci_modern.h      |  3 ++
>>   7 files changed, 129 insertions(+), 7 deletions(-)
>>   create mode 100644 drivers/virtio/virtio_pci_modern_avq.c
>>
>> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
>> index 8e98d24917cc..dcc535b5b4d9 100644
>> --- a/drivers/virtio/Makefile
>> +++ b/drivers/virtio/Makefile
>> @@ -5,7 +5,7 @@ obj-$(CONFIG_VIRTIO_PCI_LIB) += virtio_pci_modern_dev.o
>>   obj-$(CONFIG_VIRTIO_PCI_LIB_LEGACY) += virtio_pci_legacy_dev.o
>>   obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o
>>   obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
>> -virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
>> +virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o virtio_pci_modern_avq.o
>>   virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
>>   obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
>>   obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
>> diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
>> index 3893dc29eb26..f4080692b351 100644
>> --- a/drivers/virtio/virtio.c
>> +++ b/drivers/virtio/virtio.c
>> @@ -302,9 +302,15 @@ static int virtio_dev_probe(struct device *_d)
>>        if (err)
>>                goto err;
>>
>> +     if (dev->config->create_avq) {
>> +             err = dev->config->create_avq(dev);
>> +             if (err)
>> +                     goto err;
>> +     }
>> +
>>        err = drv->probe(dev);
>>        if (err)
>> -             goto err;
>> +             goto err_probe;
>>
>>        /* If probe didn't do it, mark device DRIVER_OK ourselves. */
>>        if (!(dev->config->get_status(dev) & VIRTIO_CONFIG_S_DRIVER_OK))
>> @@ -316,6 +322,10 @@ static int virtio_dev_probe(struct device *_d)
>>        virtio_config_enable(dev);
>>
>>        return 0;
>> +
>> +err_probe:
>> +     if (dev->config->destroy_avq)
>> +             dev->config->destroy_avq(dev);
>>   err:
>>        virtio_add_status(dev, VIRTIO_CONFIG_S_FAILED);
>>        return err;
>> @@ -331,6 +341,9 @@ static void virtio_dev_remove(struct device *_d)
>>
>>        drv->remove(dev);
>>
>> +     if (dev->config->destroy_avq)
>> +             dev->config->destroy_avq(dev);
>> +
>>        /* Driver should have reset device. */
>>        WARN_ON_ONCE(dev->config->get_status(dev));
>>
>> @@ -489,13 +502,20 @@ EXPORT_SYMBOL_GPL(unregister_virtio_device);
>>   int virtio_device_freeze(struct virtio_device *dev)
>>   {
>>        struct virtio_driver *drv = drv_to_virtio(dev->dev.driver);
>> +     int ret;
>>
>>        virtio_config_disable(dev);
>>
>>        dev->failed = dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED;
>>
>> -     if (drv && drv->freeze)
>> -             return drv->freeze(dev);
>> +     if (drv && drv->freeze) {
>> +             ret = drv->freeze(dev);
>> +             if (ret)
>> +                     return ret;
>> +     }
>> +
>> +     if (dev->config->destroy_avq)
>> +             dev->config->destroy_avq(dev);
>>
>>        return 0;
>>   }
>> @@ -532,10 +552,16 @@ int virtio_device_restore(struct virtio_device *dev)
>>        if (ret)
>>                goto err;
>>
>> +     if (dev->config->create_avq) {
>> +             ret = dev->config->create_avq(dev);
>> +             if (ret)
>> +                     goto err;
>> +     }
>> +
>>        if (drv->restore) {
>>                ret = drv->restore(dev);
>>                if (ret)
>> -                     goto err;
>> +                     goto err_restore;
>>        }
>>
>>        /* If restore didn't do it, mark device DRIVER_OK ourselves. */
>> @@ -546,6 +572,9 @@ int virtio_device_restore(struct virtio_device *dev)
>>
>>        return 0;
>>
>> +err_restore:
>> +     if (dev->config->destroy_avq)
>> +             dev->config->destroy_avq(dev);
>>   err:
>>        virtio_add_status(dev, VIRTIO_CONFIG_S_FAILED);
>>        return ret;
>> diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h
>> index 602021967aaa..9bffa95274b6 100644
>> --- a/drivers/virtio/virtio_pci_common.h
>> +++ b/drivers/virtio/virtio_pci_common.h
>> @@ -41,6 +41,14 @@ struct virtio_pci_vq_info {
>>        unsigned int msix_vector;
>>   };
>>
>> +struct virtio_avq {
> 
> admin_vq would be better. and this is pci specific yes? so virtio_pci_
> 

Will do.

>> +     /* Virtqueue info associated with this admin queue. */
>> +     struct virtio_pci_vq_info info;
>> +     /* Name of the admin queue: avq.$index. */
>> +     char name[10];
>> +     u16 vq_index;
>> +};
>> +
>>   /* Our device structure */
>>   struct virtio_pci_device {
>>        struct virtio_device vdev;
>> @@ -58,10 +66,13 @@ struct virtio_pci_device {
>>        spinlock_t lock;
>>        struct list_head virtqueues;
>>
>> -     /* array of all queues for house-keeping */
>> +     /* Array of all virtqueues reported in the
>> +      * PCI common config num_queues field
>> +      */
>>        struct virtio_pci_vq_info **vqs;
>>        u32 nvqs;
>>
>> +     struct virtio_avq *admin;
> 
> and this could be thinkably admin_vq.
> 
Will do.

>>        /* MSI-X support */
>>        int msix_enabled;
>>        int intx_enabled;
>> @@ -115,6 +126,8 @@ int vp_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
>>                const char * const names[], const bool *ctx,
>>                struct irq_affinity *desc);
>>   const char *vp_bus_name(struct virtio_device *vdev);
>> +void vp_destroy_avq(struct virtio_device *vdev);
>> +int vp_create_avq(struct virtio_device *vdev);
>>
>>   /* Setup the affinity for a virtqueue:
>>    * - force the affinity for per vq vector
>> diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
>> index d6bb68ba84e5..a72c87687196 100644
>> --- a/drivers/virtio/virtio_pci_modern.c
>> +++ b/drivers/virtio/virtio_pci_modern.c
>> @@ -37,6 +37,9 @@ static void vp_transport_features(struct virtio_device *vdev, u64 features)
>>
>>        if (features & BIT_ULL(VIRTIO_F_RING_RESET))
>>                __virtio_set_bit(vdev, VIRTIO_F_RING_RESET);
>> +
>> +     if (features & BIT_ULL(VIRTIO_F_ADMIN_VQ))
>> +             __virtio_set_bit(vdev, VIRTIO_F_ADMIN_VQ);
>>   }
>>
>>   /* virtio config->finalize_features() implementation */
>> @@ -317,7 +320,8 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
>>        else
>>                notify = vp_notify;
>>
>> -     if (index >= vp_modern_get_num_queues(mdev))
>> +     if (!((index < vp_modern_get_num_queues(mdev) ||
>> +           (vp_dev->admin && vp_dev->admin->vq_index == index))))
>>                return ERR_PTR(-EINVAL);
>>
>>        /* Check if queue is either not available or already active. */
>> @@ -509,6 +513,8 @@ static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
>>        .get_shm_region  = vp_get_shm_region,
>>        .disable_vq_and_reset = vp_modern_disable_vq_and_reset,
>>        .enable_vq_after_reset = vp_modern_enable_vq_after_reset,
>> +     .create_avq = vp_create_avq,
>> +     .destroy_avq = vp_destroy_avq,
>>   };
>>
>>   static const struct virtio_config_ops virtio_pci_config_ops = {
>> @@ -529,6 +535,8 @@ static const struct virtio_config_ops virtio_pci_config_ops = {
>>        .get_shm_region  = vp_get_shm_region,
>>        .disable_vq_and_reset = vp_modern_disable_vq_and_reset,
>>        .enable_vq_after_reset = vp_modern_enable_vq_after_reset,
>> +     .create_avq = vp_create_avq,
>> +     .destroy_avq = vp_destroy_avq,
>>   };
>>
>>   /* the PCI probing function */
>> diff --git a/drivers/virtio/virtio_pci_modern_avq.c b/drivers/virtio/virtio_pci_modern_avq.c
>> new file mode 100644
>> index 000000000000..114579ad788f
>> --- /dev/null
>> +++ b/drivers/virtio/virtio_pci_modern_avq.c
>> @@ -0,0 +1,65 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +
>> +#include <linux/virtio.h>
>> +#include "virtio_pci_common.h"
>> +
>> +static u16 vp_modern_avq_num(struct virtio_pci_modern_device *mdev)
>> +{
>> +     struct virtio_pci_modern_common_cfg __iomem *cfg;
>> +
>> +     cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common;
>> +     return vp_ioread16(&cfg->admin_queue_num);
>> +}
>> +
>> +static u16 vp_modern_avq_index(struct virtio_pci_modern_device *mdev)
>> +{
>> +     struct virtio_pci_modern_common_cfg __iomem *cfg;
>> +
>> +     cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common;
>> +     return vp_ioread16(&cfg->admin_queue_index);
>> +}
>> +
>> +int vp_create_avq(struct virtio_device *vdev)
>> +{
>> +     struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>> +     struct virtio_avq *avq;
>> +     struct virtqueue *vq;
>> +     u16 admin_q_num;
>> +
>> +     if (!virtio_has_feature(vdev, VIRTIO_F_ADMIN_VQ))
>> +             return 0;
>> +
>> +     admin_q_num = vp_modern_avq_num(&vp_dev->mdev);
>> +     if (!admin_q_num)
>> +             return -EINVAL;
>> +
>> +     vp_dev->admin = kzalloc(sizeof(*vp_dev->admin), GFP_KERNEL);
>> +     if (!vp_dev->admin)
>> +             return -ENOMEM;
>> +
>> +     avq = vp_dev->admin;
>> +     avq->vq_index = vp_modern_avq_index(&vp_dev->mdev);
>> +     sprintf(avq->name, "avq.%u", avq->vq_index);
>> +     vq = vp_dev->setup_vq(vp_dev, &vp_dev->admin->info, avq->vq_index, NULL,
>> +                           avq->name, NULL, VIRTIO_MSI_NO_VECTOR);
>> +     if (IS_ERR(vq)) {
>> +             dev_err(&vdev->dev, "failed to setup admin virtqueue");
>> +             kfree(vp_dev->admin);
>> +             return PTR_ERR(vq);
>> +     }
>> +
>> +     vp_dev->admin->info.vq = vq;
>> +     vp_modern_set_queue_enable(&vp_dev->mdev, avq->info.vq->index, true);
>> +     return 0;
>> +}
>> +
>> +void vp_destroy_avq(struct virtio_device *vdev)
>> +{
>> +     struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>> +
>> +     if (!vp_dev->admin)
>> +             return;
>> +
>> +     vp_dev->del_vq(&vp_dev->admin->info);
>> +     kfree(vp_dev->admin);
>> +}
>> diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
>> index 2b3438de2c4d..028c51ea90ee 100644
>> --- a/include/linux/virtio_config.h
>> +++ b/include/linux/virtio_config.h
>> @@ -93,6 +93,8 @@ typedef void vq_callback_t(struct virtqueue *);
>>    *   Returns 0 on success or error status
>>    *   If disable_vq_and_reset is set, then enable_vq_after_reset must also be
>>    *   set.
>> + * @create_avq: initialize admin virtqueue resource.
>> + * @destroy_avq: destroy admin virtqueue resource.
>>    */
>>   struct virtio_config_ops {
>>        void (*get)(struct virtio_device *vdev, unsigned offset,
>> @@ -120,6 +122,8 @@ struct virtio_config_ops {
>>                               struct virtio_shm_region *region, u8 id);
>>        int (*disable_vq_and_reset)(struct virtqueue *vq);
>>        int (*enable_vq_after_reset)(struct virtqueue *vq);
>> +     int (*create_avq)(struct virtio_device *vdev);
>> +     void (*destroy_avq)(struct virtio_device *vdev);
>>   };
>>
>>   /* If driver didn't advertise the feature, it will never appear. */
>> diff --git a/include/linux/virtio_pci_modern.h b/include/linux/virtio_pci_modern.h
>> index 067ac1d789bc..f6cb13d858fd 100644
>> --- a/include/linux/virtio_pci_modern.h
>> +++ b/include/linux/virtio_pci_modern.h
>> @@ -10,6 +10,9 @@ struct virtio_pci_modern_common_cfg {
>>
>>        __le16 queue_notify_data;       /* read-write */
>>        __le16 queue_reset;             /* read-write */
>> +
>> +     __le16 admin_queue_index;       /* read-only */
>> +     __le16 admin_queue_num;         /* read-only */
>>   };
> 
> 
> ouch.
> actually there's a problem
> 
>          mdev->common = vp_modern_map_capability(mdev, common,
>                                        sizeof(struct virtio_pci_common_cfg), 4,
>                                        0, sizeof(struct virtio_pci_common_cfg),
>                                        NULL, NULL);
> 
> extending this structure means some calls will start failing on
> existing devices.
> 
> even more of an ouch, when we added queue_notify_data and queue_reset we
> also possibly broke some devices. well hopefully not since no one
> reported failures but we really need to fix that.
> 
> 
Hi Michael

I didn’t see the fail in vp_modern_map_capability(), and 
vp_modern_map_capability() only read and map pci memory. The length of 
the memory mapping will increase as the struct virtio_pci_common_cfg 
increases. No errors are seen.

And according to the existing code, new pci configuration space members 
can only be added in struct virtio_pci_modern_common_cfg.

Every single entry added here is protected by feature bit, there is no 
bug AFAIK.

Could you help to explain it more detail?  Where and why it will fall if 
we add new member in struct virtio_pci_modern_common_cfg.


>>
>>   struct virtio_pci_modern_device {
>> --
>> 2.27.0
>
Feng Liu Sept. 27, 2023, 6:12 p.m. UTC | #72
On 2023-09-26 p.m.3:23, Feng Liu via Virtualization wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 2023-09-21 a.m.9:57, Michael S. Tsirkin wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On Thu, Sep 21, 2023 at 03:40:32PM +0300, Yishai Hadas wrote:
>>> From: Feng Liu <feliu@nvidia.com>


>>>   drivers/virtio/virtio_pci_modern_avq.c | 65 ++++++++++++++++++++++++++
>>
>> if you have a .c file without a .h file you know there's something
>> fishy. Just add this inside drivers/virtio/virtio_pci_modern.c ?
>>
> Will do.
> 

>>> +struct virtio_avq {
>>
>> admin_vq would be better. and this is pci specific yes? so virtio_pci_
>>
> 
> Will do.
> 

>>>
>>> +     struct virtio_avq *admin;
>>
>> and this could be thinkably admin_vq.
>>
> Will do.
> 

>>>
>>>   /* If driver didn't advertise the feature, it will never appear. */
>>> diff --git a/include/linux/virtio_pci_modern.h 
>>> b/include/linux/virtio_pci_modern.h
>>> index 067ac1d789bc..f6cb13d858fd 100644
>>> --- a/include/linux/virtio_pci_modern.h
>>> +++ b/include/linux/virtio_pci_modern.h
>>> @@ -10,6 +10,9 @@ struct virtio_pci_modern_common_cfg {
>>>
>>>        __le16 queue_notify_data;       /* read-write */
>>>        __le16 queue_reset;             /* read-write */
>>> +
>>> +     __le16 admin_queue_index;       /* read-only */
>>> +     __le16 admin_queue_num;         /* read-only */
>>>   };
>>
>>
>> ouch.
>> actually there's a problem
>>
>>          mdev->common = vp_modern_map_capability(mdev, common,
>>                                        sizeof(struct 
>> virtio_pci_common_cfg), 4,
>>                                        0, sizeof(struct 
>> virtio_pci_common_cfg),
>>                                        NULL, NULL);
>>
>> extending this structure means some calls will start failing on
>> existing devices.
>>
>> even more of an ouch, when we added queue_notify_data and queue_reset we
>> also possibly broke some devices. well hopefully not since no one
>> reported failures but we really need to fix that.
>>
>>
> Hi Michael
> 
> I didn’t see the fail in vp_modern_map_capability(), and
> vp_modern_map_capability() only read and map pci memory. The length of
> the memory mapping will increase as the struct virtio_pci_common_cfg
> increases. No errors are seen.
> 
> And according to the existing code, new pci configuration space members
> can only be added in struct virtio_pci_modern_common_cfg.
> 
> Every single entry added here is protected by feature bit, there is no
> bug AFAIK.
> 
> Could you help to explain it more detail?  Where and why it will fall if
> we add new member in struct virtio_pci_modern_common_cfg.
> 
> 
Hi, Michael
	Any comments about this?
Thanks
Feng

>>>
>>>   struct virtio_pci_modern_device {
>>> -- 
>>> 2.27.0
>>
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Michael S. Tsirkin Sept. 27, 2023, 9:27 p.m. UTC | #73
On Wed, Sep 27, 2023 at 02:12:24PM -0400, Feng Liu wrote:
> 
> 
> On 2023-09-26 p.m.3:23, Feng Liu via Virtualization wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On 2023-09-21 a.m.9:57, Michael S. Tsirkin wrote:
> > > External email: Use caution opening links or attachments
> > > 
> > > 
> > > On Thu, Sep 21, 2023 at 03:40:32PM +0300, Yishai Hadas wrote:
> > > > From: Feng Liu <feliu@nvidia.com>
> 
> 
> > > >   drivers/virtio/virtio_pci_modern_avq.c | 65 ++++++++++++++++++++++++++
> > > 
> > > if you have a .c file without a .h file you know there's something
> > > fishy. Just add this inside drivers/virtio/virtio_pci_modern.c ?
> > > 
> > Will do.
> > 
> 
> > > > +struct virtio_avq {
> > > 
> > > admin_vq would be better. and this is pci specific yes? so virtio_pci_
> > > 
> > 
> > Will do.
> > 
> 
> > > > 
> > > > +     struct virtio_avq *admin;
> > > 
> > > and this could be thinkably admin_vq.
> > > 
> > Will do.
> > 
> 
> > > > 
> > > >   /* If driver didn't advertise the feature, it will never appear. */
> > > > diff --git a/include/linux/virtio_pci_modern.h
> > > > b/include/linux/virtio_pci_modern.h
> > > > index 067ac1d789bc..f6cb13d858fd 100644
> > > > --- a/include/linux/virtio_pci_modern.h
> > > > +++ b/include/linux/virtio_pci_modern.h
> > > > @@ -10,6 +10,9 @@ struct virtio_pci_modern_common_cfg {
> > > > 
> > > >        __le16 queue_notify_data;       /* read-write */
> > > >        __le16 queue_reset;             /* read-write */
> > > > +
> > > > +     __le16 admin_queue_index;       /* read-only */
> > > > +     __le16 admin_queue_num;         /* read-only */
> > > >   };
> > > 
> > > 
> > > ouch.
> > > actually there's a problem
> > > 
> > >          mdev->common = vp_modern_map_capability(mdev, common,
> > >                                        sizeof(struct
> > > virtio_pci_common_cfg), 4,
> > >                                        0, sizeof(struct
> > > virtio_pci_common_cfg),
> > >                                        NULL, NULL);
> > > 
> > > extending this structure means some calls will start failing on
> > > existing devices.
> > > 
> > > even more of an ouch, when we added queue_notify_data and queue_reset we
> > > also possibly broke some devices. well hopefully not since no one
> > > reported failures but we really need to fix that.
> > > 
> > > 
> > Hi Michael
> > 
> > I didn’t see the fail in vp_modern_map_capability(), and
> > vp_modern_map_capability() only read and map pci memory. The length of
> > the memory mapping will increase as the struct virtio_pci_common_cfg
> > increases. No errors are seen.
> > 
> > And according to the existing code, new pci configuration space members
> > can only be added in struct virtio_pci_modern_common_cfg.
> > 
> > Every single entry added here is protected by feature bit, there is no
> > bug AFAIK.
> > 
> > Could you help to explain it more detail?  Where and why it will fall if
> > we add new member in struct virtio_pci_modern_common_cfg.
> > 
> > 
> Hi, Michael
> 	Any comments about this?
> Thanks
> Feng

If an existing device exposes a small
capability matching old size, then you change size then
the check will fail on the existing device and driver won't load.

All this happens way before feature bit checks.


> > > > 
> > > >   struct virtio_pci_modern_device {
> > > > -- 
> > > > 2.27.0
> > > 
> > _______________________________________________
> > Virtualization mailing list
> > Virtualization@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Michael S. Tsirkin Sept. 27, 2023, 9:38 p.m. UTC | #74
On Tue, Sep 26, 2023 at 10:50:57AM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 26, 2023 at 01:42:52AM -0400, Michael S. Tsirkin wrote:
> > On Mon, Sep 25, 2023 at 09:40:59PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote:
> > > > > VDPA is very different from this. You might call them both mediation,
> > > > > sure, but then you need another word to describe the additional
> > > > > changes VPDA is doing.
> > > > 
> > > > Sorry about hijacking the thread a little bit, but could you
> > > > call out some of the changes that are the most problematic
> > > > for you?
> > > 
> > > I don't really know these details.
> > 
> > Maybe, you then should desist from saying things like "It entirely fails
> > to achieve the most important thing it needs to do!" You are not making
> > any new friends with saying this about a piece of software without
> > knowing the details.
> 
> I can't tell you what cloud operators are doing, but I can say with
> confidence that it is not the same as VDPA. As I said, if you want to
> know more details you need to ask a cloud operator.
> 
> Jason

So it's not the changes that are problematic, it's that you have
customers who are not using vdpa. The "most important thing" that vdpa
fails at is simply converting your customers from vfio to vdpa.
Jason Gunthorpe Sept. 27, 2023, 11:20 p.m. UTC | #75
On Wed, Sep 27, 2023 at 05:38:55PM -0400, Michael S. Tsirkin wrote:
> On Tue, Sep 26, 2023 at 10:50:57AM -0300, Jason Gunthorpe wrote:
> > On Tue, Sep 26, 2023 at 01:42:52AM -0400, Michael S. Tsirkin wrote:
> > > On Mon, Sep 25, 2023 at 09:40:59PM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote:
> > > > > > VDPA is very different from this. You might call them both mediation,
> > > > > > sure, but then you need another word to describe the additional
> > > > > > changes VPDA is doing.
> > > > > 
> > > > > Sorry about hijacking the thread a little bit, but could you
> > > > > call out some of the changes that are the most problematic
> > > > > for you?
> > > > 
> > > > I don't really know these details.
> > > 
> > > Maybe, you then should desist from saying things like "It entirely fails
> > > to achieve the most important thing it needs to do!" You are not making
> > > any new friends with saying this about a piece of software without
> > > knowing the details.
> > 
> > I can't tell you what cloud operators are doing, but I can say with
> > confidence that it is not the same as VDPA. As I said, if you want to
> > know more details you need to ask a cloud operator.
>
> So it's not the changes that are problematic, it's that you have
> customers who are not using vdpa. The "most important thing" that vdpa
> fails at is simply converting your customers from vfio to vdpa.

I said the most important thing was that VFIO presents exactly the
same virtio device to the VM as the baremetal. Do you dispute that,
technically, VDPA does not actually achieve that?

Then why is it so surprising that people don't want a solution that
changes the vPCI ABI they worked hard to create in the first place?

I'm still baffled why you think everyone should use vdpa..

Jason
Michael S. Tsirkin Sept. 28, 2023, 5:31 a.m. UTC | #76
On Wed, Sep 27, 2023 at 08:20:05PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 27, 2023 at 05:38:55PM -0400, Michael S. Tsirkin wrote:
> > On Tue, Sep 26, 2023 at 10:50:57AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Sep 26, 2023 at 01:42:52AM -0400, Michael S. Tsirkin wrote:
> > > > On Mon, Sep 25, 2023 at 09:40:59PM -0300, Jason Gunthorpe wrote:
> > > > > On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote:
> > > > > > > VDPA is very different from this. You might call them both mediation,
> > > > > > > sure, but then you need another word to describe the additional
> > > > > > > changes VPDA is doing.
> > > > > > 
> > > > > > Sorry about hijacking the thread a little bit, but could you
> > > > > > call out some of the changes that are the most problematic
> > > > > > for you?
> > > > > 
> > > > > I don't really know these details.
> > > > 
> > > > Maybe, you then should desist from saying things like "It entirely fails
> > > > to achieve the most important thing it needs to do!" You are not making
> > > > any new friends with saying this about a piece of software without
> > > > knowing the details.
> > > 
> > > I can't tell you what cloud operators are doing, but I can say with
> > > confidence that it is not the same as VDPA. As I said, if you want to
> > > know more details you need to ask a cloud operator.
> >
> > So it's not the changes that are problematic, it's that you have
> > customers who are not using vdpa. The "most important thing" that vdpa
> > fails at is simply converting your customers from vfio to vdpa.
> 
> I said the most important thing was that VFIO presents exactly the
> same virtio device to the VM as the baremetal. Do you dispute that,
> technically, VDPA does not actually achieve that?

I dispute that it is the most important. The important thing is to have
guests work.

> Then why is it so surprising that people don't want a solution that
> changes the vPCI ABI they worked hard to create in the first place?
> 
> I'm still baffled why you think everyone should use vdpa..
> 
> Jason

They shouldn't. If you want proprietary extensions then vfio is the way
to go, I don't think vdpa will support that.
Feng Liu Oct. 2, 2023, 6:07 p.m. UTC | #77
On 2023-09-27 p.m.5:27, Michael S. Tsirkin wrote:

> 
> If an existing device exposes a small
> capability matching old size, then you change size then
> the check will fail on the existing device and driver won't load.
> 
> All this happens way before feature bit checks.
> 
> 
Will do

Thanks
Feng

>>>>>
>>>>>    struct virtio_pci_modern_device {
>>>>> --
>>>>> 2.27.0
>>>>
>>> _______________________________________________
>>> Virtualization mailing list
>>> Virtualization@lists.linux-foundation.org
>>> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
>
Jason Wang Oct. 8, 2023, 4:28 a.m. UTC | #78
On Tue, Sep 26, 2023 at 7:49 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Sep 26, 2023 at 10:32:39AM +0800, Jason Wang wrote:
> > It's the implementation details in legacy. The device needs to make
> > sure (reset) the driver can work (is done before get_status return).
>
> I think that there's no way to make it reliably work for all legacy drivers.

Yes, we may have ancient drivers.

>
> They just assumed a software backend and did not bother with DMA
> ordering. You can try to avoid resets, they are not that common so
> things will tend to mostly work if you don't stress them to much with
> things like hot plug/unplug in a loop.  Or you can try to use a driver
> after 2011 which is more aware of hardware ordering and flushes the
> reset write with a read.  One of these two tricks, I think, is the magic
> behind the device exposing memory bar 0 that you mention.

Right this is what I see for hardware legacy devices shipped by some
cloud vendors.

Thanks

>
> --
> MST
>
Michael S. Tsirkin Oct. 12, 2023, 10:52 a.m. UTC | #79
On Tue, Sep 26, 2023 at 03:45:36AM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Tuesday, September 26, 2023 12:06 AM
> 
> > One can thinkably do that wait in hardware, though. Just defer completion until
> > read is done.
> >
> Once OASIS does such new interface and if some hw vendor _actually_ wants to do such complex hw, may be vfio driver can adopt to it.

The reset behaviour I describe is already in the spec. What else do you
want OASIS to standardize? Virtio currently is just a register map it
does not yet include suggestions on how exactly do pci express
transactions look. You feel we should add that?
Parav Pandit Oct. 12, 2023, 11:11 a.m. UTC | #80
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 12, 2023 4:23 PM
> 
> On Tue, Sep 26, 2023 at 03:45:36AM +0000, Parav Pandit wrote:
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Tuesday, September 26, 2023 12:06 AM
> >
> > > One can thinkably do that wait in hardware, though. Just defer
> > > completion until read is done.
> > >
> > Once OASIS does such new interface and if some hw vendor _actually_ wants
> to do such complex hw, may be vfio driver can adopt to it.
> 
> The reset behaviour I describe is already in the spec. What else do you want
> OASIS to standardize? Virtio currently is just a register map it does not yet
> include suggestions on how exactly do pci express transactions look. You feel we
> should add that?

The reset behavior in the spec for modern as listed in [1] and [2] is just fine.

What I meant is in context of having MMIO based legacy registers to "defer completion until read is done".
I think you meant, "Just differ read completion, until reset is done".
This means the hw needs to finish the device reset for thousands of devices within the read completion timeout of the pci.
So when if OASIS does such standardization, someone can implement it.

What I recollect, is OASIS didn't not standardize such anti-scale approach and took the admin command approach which achieve better scale.
Hope I clarified.

I am not expecting OASIS to do anything extra for legacy registers.

[1] The device MUST reset when 0 is written to device_status, and present a 0 in device_status once that is done.
[2] After writing 0 to device_status, the driver MUST wait for a read of device_status to return 0 before reinitializing
the device.
Michael S. Tsirkin Oct. 12, 2023, 11:30 a.m. UTC | #81
On Thu, Oct 12, 2023 at 11:11:20AM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, October 12, 2023 4:23 PM
> > 
> > On Tue, Sep 26, 2023 at 03:45:36AM +0000, Parav Pandit wrote:
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Tuesday, September 26, 2023 12:06 AM
> > >
> > > > One can thinkably do that wait in hardware, though. Just defer
> > > > completion until read is done.
> > > >
> > > Once OASIS does such new interface and if some hw vendor _actually_ wants
> > to do such complex hw, may be vfio driver can adopt to it.
> > 
> > The reset behaviour I describe is already in the spec. What else do you want
> > OASIS to standardize? Virtio currently is just a register map it does not yet
> > include suggestions on how exactly do pci express transactions look. You feel we
> > should add that?
> 
> The reset behavior in the spec for modern as listed in [1] and [2] is just fine.
> 
> What I meant is in context of having MMIO based legacy registers to "defer completion until read is done".
> I think you meant, "Just differ read completion, until reset is done".

yes

> This means the hw needs to finish the device reset for thousands of devices within the read completion timeout of the pci.

no, each device does it's own reset.

> So when if OASIS does such standardization, someone can implement it.
> 
> What I recollect, is OASIS didn't not standardize such anti-scale approach and took the admin command approach which achieve better scale.
> Hope I clarified.

You are talking about the extension for trap and emulate.
I am instead talking about devices that work with
existing legacy linux drivers with no traps.

> I am not expecting OASIS to do anything extra for legacy registers.
> 
> [1] The device MUST reset when 0 is written to device_status, and present a 0 in device_status once that is done.
> [2] After writing 0 to device_status, the driver MUST wait for a read of device_status to return 0 before reinitializing
> the device.

We can add a note explaining that legacy drivers do not wait
after doing reset, that is not a problem.
If someone wants to make a device that works with existing
legacy linux drivers, they can do that.
Won't work with all drivers though, which is why oasis did not
want to standardize this.
Parav Pandit Oct. 12, 2023, 11:40 a.m. UTC | #82
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 12, 2023 5:00 PM

> I am instead talking about devices that work with existing legacy linux drivers
> with no traps.
> 
Yep, I understood.

> > I am not expecting OASIS to do anything extra for legacy registers.
> >
> > [1] The device MUST reset when 0 is written to device_status, and present a 0
> in device_status once that is done.
> > [2] After writing 0 to device_status, the driver MUST wait for a read
> > of device_status to return 0 before reinitializing the device.
> 
> We can add a note explaining that legacy drivers do not wait after doing reset,
> that is not a problem.
> If someone wants to make a device that works with existing legacy linux drivers,
> they can do that.
> Won't work with all drivers though, which is why oasis did not want to
> standardize this.

Ok. thanks.