Message ID | 20230921124040.145386-1-yishaih@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | Introduce a vfio driver over virtio devices | expand |
On Thu, Sep 21, 2023 at 03:40:32PM +0300, Yishai Hadas wrote: > From: Feng Liu <feliu@nvidia.com> > > Introduce support for the admin virtqueue. By negotiating > VIRTIO_F_ADMIN_VQ feature, driver detects capability and creates one > administration virtqueue. Administration virtqueue implementation in > virtio pci generic layer, enables multiple types of upper layer > drivers such as vfio, net, blk to utilize it. > > Signed-off-by: Feng Liu <feliu@nvidia.com> > Reviewed-by: Parav Pandit <parav@nvidia.com> > Reviewed-by: Jiri Pirko <jiri@nvidia.com> > Signed-off-by: Yishai Hadas <yishaih@nvidia.com> > --- > drivers/virtio/Makefile | 2 +- > drivers/virtio/virtio.c | 37 +++++++++++++-- > drivers/virtio/virtio_pci_common.h | 15 +++++- > drivers/virtio/virtio_pci_modern.c | 10 +++- > drivers/virtio/virtio_pci_modern_avq.c | 65 ++++++++++++++++++++++++++ if you have a .c file without a .h file you know there's something fishy. Just add this inside drivers/virtio/virtio_pci_modern.c ? > include/linux/virtio_config.h | 4 ++ > include/linux/virtio_pci_modern.h | 3 ++ > 7 files changed, 129 insertions(+), 7 deletions(-) > create mode 100644 drivers/virtio/virtio_pci_modern_avq.c > > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile > index 8e98d24917cc..dcc535b5b4d9 100644 > --- a/drivers/virtio/Makefile > +++ b/drivers/virtio/Makefile > @@ -5,7 +5,7 @@ obj-$(CONFIG_VIRTIO_PCI_LIB) += virtio_pci_modern_dev.o > obj-$(CONFIG_VIRTIO_PCI_LIB_LEGACY) += virtio_pci_legacy_dev.o > obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o > obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o > -virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o > +virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o virtio_pci_modern_avq.o > virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o > obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o > obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c > index 3893dc29eb26..f4080692b351 100644 > --- a/drivers/virtio/virtio.c > +++ b/drivers/virtio/virtio.c > @@ -302,9 +302,15 @@ static int virtio_dev_probe(struct device *_d) > if (err) > goto err; > > + if (dev->config->create_avq) { > + err = dev->config->create_avq(dev); > + if (err) > + goto err; > + } > + > err = drv->probe(dev); > if (err) > - goto err; > + goto err_probe; > > /* If probe didn't do it, mark device DRIVER_OK ourselves. */ > if (!(dev->config->get_status(dev) & VIRTIO_CONFIG_S_DRIVER_OK)) > @@ -316,6 +322,10 @@ static int virtio_dev_probe(struct device *_d) > virtio_config_enable(dev); > > return 0; > + > +err_probe: > + if (dev->config->destroy_avq) > + dev->config->destroy_avq(dev); > err: > virtio_add_status(dev, VIRTIO_CONFIG_S_FAILED); > return err; > @@ -331,6 +341,9 @@ static void virtio_dev_remove(struct device *_d) > > drv->remove(dev); > > + if (dev->config->destroy_avq) > + dev->config->destroy_avq(dev); > + > /* Driver should have reset device. */ > WARN_ON_ONCE(dev->config->get_status(dev)); > > @@ -489,13 +502,20 @@ EXPORT_SYMBOL_GPL(unregister_virtio_device); > int virtio_device_freeze(struct virtio_device *dev) > { > struct virtio_driver *drv = drv_to_virtio(dev->dev.driver); > + int ret; > > virtio_config_disable(dev); > > dev->failed = dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED; > > - if (drv && drv->freeze) > - return drv->freeze(dev); > + if (drv && drv->freeze) { > + ret = drv->freeze(dev); > + if (ret) > + return ret; > + } > + > + if (dev->config->destroy_avq) > + dev->config->destroy_avq(dev); > > return 0; > } > @@ -532,10 +552,16 @@ int virtio_device_restore(struct virtio_device *dev) > if (ret) > goto err; > > + if (dev->config->create_avq) { > + ret = dev->config->create_avq(dev); > + if (ret) > + goto err; > + } > + > if (drv->restore) { > ret = drv->restore(dev); > if (ret) > - goto err; > + goto err_restore; > } > > /* If restore didn't do it, mark device DRIVER_OK ourselves. */ > @@ -546,6 +572,9 @@ int virtio_device_restore(struct virtio_device *dev) > > return 0; > > +err_restore: > + if (dev->config->destroy_avq) > + dev->config->destroy_avq(dev); > err: > virtio_add_status(dev, VIRTIO_CONFIG_S_FAILED); > return ret; > diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h > index 602021967aaa..9bffa95274b6 100644 > --- a/drivers/virtio/virtio_pci_common.h > +++ b/drivers/virtio/virtio_pci_common.h > @@ -41,6 +41,14 @@ struct virtio_pci_vq_info { > unsigned int msix_vector; > }; > > +struct virtio_avq { admin_vq would be better. and this is pci specific yes? so virtio_pci_ > + /* Virtqueue info associated with this admin queue. */ > + struct virtio_pci_vq_info info; > + /* Name of the admin queue: avq.$index. */ > + char name[10]; > + u16 vq_index; > +}; > + > /* Our device structure */ > struct virtio_pci_device { > struct virtio_device vdev; > @@ -58,10 +66,13 @@ struct virtio_pci_device { > spinlock_t lock; > struct list_head virtqueues; > > - /* array of all queues for house-keeping */ > + /* Array of all virtqueues reported in the > + * PCI common config num_queues field > + */ > struct virtio_pci_vq_info **vqs; > u32 nvqs; > > + struct virtio_avq *admin; and this could be thinkably admin_vq. > /* MSI-X support */ > int msix_enabled; > int intx_enabled; > @@ -115,6 +126,8 @@ int vp_find_vqs(struct virtio_device *vdev, unsigned int nvqs, > const char * const names[], const bool *ctx, > struct irq_affinity *desc); > const char *vp_bus_name(struct virtio_device *vdev); > +void vp_destroy_avq(struct virtio_device *vdev); > +int vp_create_avq(struct virtio_device *vdev); > > /* Setup the affinity for a virtqueue: > * - force the affinity for per vq vector > diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c > index d6bb68ba84e5..a72c87687196 100644 > --- a/drivers/virtio/virtio_pci_modern.c > +++ b/drivers/virtio/virtio_pci_modern.c > @@ -37,6 +37,9 @@ static void vp_transport_features(struct virtio_device *vdev, u64 features) > > if (features & BIT_ULL(VIRTIO_F_RING_RESET)) > __virtio_set_bit(vdev, VIRTIO_F_RING_RESET); > + > + if (features & BIT_ULL(VIRTIO_F_ADMIN_VQ)) > + __virtio_set_bit(vdev, VIRTIO_F_ADMIN_VQ); > } > > /* virtio config->finalize_features() implementation */ > @@ -317,7 +320,8 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, > else > notify = vp_notify; > > - if (index >= vp_modern_get_num_queues(mdev)) > + if (!((index < vp_modern_get_num_queues(mdev) || > + (vp_dev->admin && vp_dev->admin->vq_index == index)))) > return ERR_PTR(-EINVAL); > > /* Check if queue is either not available or already active. */ > @@ -509,6 +513,8 @@ static const struct virtio_config_ops virtio_pci_config_nodev_ops = { > .get_shm_region = vp_get_shm_region, > .disable_vq_and_reset = vp_modern_disable_vq_and_reset, > .enable_vq_after_reset = vp_modern_enable_vq_after_reset, > + .create_avq = vp_create_avq, > + .destroy_avq = vp_destroy_avq, > }; > > static const struct virtio_config_ops virtio_pci_config_ops = { > @@ -529,6 +535,8 @@ static const struct virtio_config_ops virtio_pci_config_ops = { > .get_shm_region = vp_get_shm_region, > .disable_vq_and_reset = vp_modern_disable_vq_and_reset, > .enable_vq_after_reset = vp_modern_enable_vq_after_reset, > + .create_avq = vp_create_avq, > + .destroy_avq = vp_destroy_avq, > }; > > /* the PCI probing function */ > diff --git a/drivers/virtio/virtio_pci_modern_avq.c b/drivers/virtio/virtio_pci_modern_avq.c > new file mode 100644 > index 000000000000..114579ad788f > --- /dev/null > +++ b/drivers/virtio/virtio_pci_modern_avq.c > @@ -0,0 +1,65 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > + > +#include <linux/virtio.h> > +#include "virtio_pci_common.h" > + > +static u16 vp_modern_avq_num(struct virtio_pci_modern_device *mdev) > +{ > + struct virtio_pci_modern_common_cfg __iomem *cfg; > + > + cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common; > + return vp_ioread16(&cfg->admin_queue_num); > +} > + > +static u16 vp_modern_avq_index(struct virtio_pci_modern_device *mdev) > +{ > + struct virtio_pci_modern_common_cfg __iomem *cfg; > + > + cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common; > + return vp_ioread16(&cfg->admin_queue_index); > +} > + > +int vp_create_avq(struct virtio_device *vdev) > +{ > + struct virtio_pci_device *vp_dev = to_vp_device(vdev); > + struct virtio_avq *avq; > + struct virtqueue *vq; > + u16 admin_q_num; > + > + if (!virtio_has_feature(vdev, VIRTIO_F_ADMIN_VQ)) > + return 0; > + > + admin_q_num = vp_modern_avq_num(&vp_dev->mdev); > + if (!admin_q_num) > + return -EINVAL; > + > + vp_dev->admin = kzalloc(sizeof(*vp_dev->admin), GFP_KERNEL); > + if (!vp_dev->admin) > + return -ENOMEM; > + > + avq = vp_dev->admin; > + avq->vq_index = vp_modern_avq_index(&vp_dev->mdev); > + sprintf(avq->name, "avq.%u", avq->vq_index); > + vq = vp_dev->setup_vq(vp_dev, &vp_dev->admin->info, avq->vq_index, NULL, > + avq->name, NULL, VIRTIO_MSI_NO_VECTOR); > + if (IS_ERR(vq)) { > + dev_err(&vdev->dev, "failed to setup admin virtqueue"); > + kfree(vp_dev->admin); > + return PTR_ERR(vq); > + } > + > + vp_dev->admin->info.vq = vq; > + vp_modern_set_queue_enable(&vp_dev->mdev, avq->info.vq->index, true); > + return 0; > +} > + > +void vp_destroy_avq(struct virtio_device *vdev) > +{ > + struct virtio_pci_device *vp_dev = to_vp_device(vdev); > + > + if (!vp_dev->admin) > + return; > + > + vp_dev->del_vq(&vp_dev->admin->info); > + kfree(vp_dev->admin); > +} > diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h > index 2b3438de2c4d..028c51ea90ee 100644 > --- a/include/linux/virtio_config.h > +++ b/include/linux/virtio_config.h > @@ -93,6 +93,8 @@ typedef void vq_callback_t(struct virtqueue *); > * Returns 0 on success or error status > * If disable_vq_and_reset is set, then enable_vq_after_reset must also be > * set. > + * @create_avq: initialize admin virtqueue resource. > + * @destroy_avq: destroy admin virtqueue resource. > */ > struct virtio_config_ops { > void (*get)(struct virtio_device *vdev, unsigned offset, > @@ -120,6 +122,8 @@ struct virtio_config_ops { > struct virtio_shm_region *region, u8 id); > int (*disable_vq_and_reset)(struct virtqueue *vq); > int (*enable_vq_after_reset)(struct virtqueue *vq); > + int (*create_avq)(struct virtio_device *vdev); > + void (*destroy_avq)(struct virtio_device *vdev); > }; > > /* If driver didn't advertise the feature, it will never appear. */ > diff --git a/include/linux/virtio_pci_modern.h b/include/linux/virtio_pci_modern.h > index 067ac1d789bc..f6cb13d858fd 100644 > --- a/include/linux/virtio_pci_modern.h > +++ b/include/linux/virtio_pci_modern.h > @@ -10,6 +10,9 @@ struct virtio_pci_modern_common_cfg { > > __le16 queue_notify_data; /* read-write */ > __le16 queue_reset; /* read-write */ > + > + __le16 admin_queue_index; /* read-only */ > + __le16 admin_queue_num; /* read-only */ > }; ouch. actually there's a problem mdev->common = vp_modern_map_capability(mdev, common, sizeof(struct virtio_pci_common_cfg), 4, 0, sizeof(struct virtio_pci_common_cfg), NULL, NULL); extending this structure means some calls will start failing on existing devices. even more of an ouch, when we added queue_notify_data and queue_reset we also possibly broke some devices. well hopefully not since no one reported failures but we really need to fix that. > > struct virtio_pci_modern_device { > -- > 2.27.0
On Thu, Sep 21, 2023 at 09:16:21AM -0400, Michael S. Tsirkin wrote: > > diff --git a/MAINTAINERS b/MAINTAINERS > > index bf0f54c24f81..5098418c8389 100644 > > --- a/MAINTAINERS > > +++ b/MAINTAINERS > > @@ -22624,6 +22624,12 @@ L: kvm@vger.kernel.org > > S: Maintained > > F: drivers/vfio/pci/mlx5/ > > > > +VFIO VIRTIO PCI DRIVER > > +M: Yishai Hadas <yishaih@nvidia.com> > > +L: kvm@vger.kernel.org > > +S: Maintained > > +F: drivers/vfio/pci/virtio > > + > > VFIO PCI DEVICE SPECIFIC DRIVERS > > R: Jason Gunthorpe <jgg@nvidia.com> > > R: Yishai Hadas <yishaih@nvidia.com> > > Tying two subsystems together like this is going to cause pain when > merging. God forbid there's something e.g. virtio net specific > (and there's going to be for sure) - now we are talking 3 > subsystems. Cross subsystem stuff is normal in the kernel. Drivers should be placed in their most logical spot - this driver exposes a VFIO interface so it belongs here. Your exact argument works the same from the VFIO perspective, someone has to have code that belongs to them outside their little sphere here. > Case in point all other virtio drivers are nicely grouped, have a common > mailing list etc etc. This one is completely separate to the point > where people won't even remember to copy the virtio mailing list. The virtio mailing list should probably be added to the maintainers enry Jason
On Thu, Sep 21, 2023 at 11:11:25AM -0300, Jason Gunthorpe wrote: > On Thu, Sep 21, 2023 at 09:16:21AM -0400, Michael S. Tsirkin wrote: > > > > diff --git a/MAINTAINERS b/MAINTAINERS > > > index bf0f54c24f81..5098418c8389 100644 > > > --- a/MAINTAINERS > > > +++ b/MAINTAINERS > > > @@ -22624,6 +22624,12 @@ L: kvm@vger.kernel.org > > > S: Maintained > > > F: drivers/vfio/pci/mlx5/ > > > > > > +VFIO VIRTIO PCI DRIVER > > > +M: Yishai Hadas <yishaih@nvidia.com> > > > +L: kvm@vger.kernel.org > > > +S: Maintained > > > +F: drivers/vfio/pci/virtio > > > + > > > VFIO PCI DEVICE SPECIFIC DRIVERS > > > R: Jason Gunthorpe <jgg@nvidia.com> > > > R: Yishai Hadas <yishaih@nvidia.com> > > > > Tying two subsystems together like this is going to cause pain when > > merging. God forbid there's something e.g. virtio net specific > > (and there's going to be for sure) - now we are talking 3 > > subsystems. > > Cross subsystem stuff is normal in the kernel. Yea. But it's completely spurious here - virtio has its own way to work with userspace which is vdpa and let's just use that. Keeps things nice and contained. > Drivers should be > placed in their most logical spot - this driver exposes a VFIO > interface so it belongs here. > > Your exact argument works the same from the VFIO perspective, someone > has to have code that belongs to them outside their little sphere > here. > > > Case in point all other virtio drivers are nicely grouped, have a common > > mailing list etc etc. This one is completely separate to the point > > where people won't even remember to copy the virtio mailing list. > > The virtio mailing list should probably be added to the maintainers > enry > > Jason
On Thu, 21 Sep 2023 15:40:40 +0300 Yishai Hadas <yishaih@nvidia.com> wrote: > Introduce a vfio driver over virtio devices to support the legacy > interface functionality for VFs. > > Background, from the virtio spec [1]. > -------------------------------------------------------------------- > In some systems, there is a need to support a virtio legacy driver with > a device that does not directly support the legacy interface. In such > scenarios, a group owner device can provide the legacy interface > functionality for the group member devices. The driver of the owner > device can then access the legacy interface of a member device on behalf > of the legacy member device driver. > > For example, with the SR-IOV group type, group members (VFs) can not > present the legacy interface in an I/O BAR in BAR0 as expected by the > legacy pci driver. If the legacy driver is running inside a virtual > machine, the hypervisor executing the virtual machine can present a > virtual device with an I/O BAR in BAR0. The hypervisor intercepts the > legacy driver accesses to this I/O BAR and forwards them to the group > owner device (PF) using group administration commands. > -------------------------------------------------------------------- > > Specifically, this driver adds support for a virtio-net VF to be exposed > as a transitional device to a guest driver and allows the legacy IO BAR > functionality on top. > > This allows a VM which uses a legacy virtio-net driver in the guest to > work transparently over a VF which its driver in the host is that new > driver. > > The driver can be extended easily to support some other types of virtio > devices (e.g virtio-blk), by adding in a few places the specific type > properties as was done for virtio-net. > > For now, only the virtio-net use case was tested and as such we introduce > the support only for such a device. > > Practically, > Upon probing a VF for a virtio-net device, in case its PF supports > legacy access over the virtio admin commands and the VF doesn't have BAR > 0, we set some specific 'vfio_device_ops' to be able to simulate in SW a > transitional device with I/O BAR in BAR 0. > > The existence of the simulated I/O bar is reported later on by > overwriting the VFIO_DEVICE_GET_REGION_INFO command and the device > exposes itself as a transitional device by overwriting some properties > upon reading its config space. > > Once we report the existence of I/O BAR as BAR 0 a legacy driver in the > guest may use it via read/write calls according to the virtio > specification. > > Any read/write towards the control parts of the BAR will be captured by > the new driver and will be translated into admin commands towards the > device. > > Any data path read/write access (i.e. virtio driver notifications) will > be forwarded to the physical BAR which its properties were supplied by > the command VIRTIO_PCI_QUEUE_NOTIFY upon the probing/init flow. > > With that code in place a legacy driver in the guest has the look and > feel as if having a transitional device with legacy support for both its > control and data path flows. Why do we need to enable a "legacy" driver in the guest? The very name suggests there's an alternative driver that perhaps doesn't require this I/O BAR. Why don't we just require the non-legacy driver in the guest rather than increase our maintenance burden? Thanks, Alex > > [1] > https://github.com/oasis-tcs/virtio-spec/commit/03c2d32e5093ca9f2a17797242fbef88efe94b8c > > Signed-off-by: Yishai Hadas <yishaih@nvidia.com> > --- > MAINTAINERS | 6 + > drivers/vfio/pci/Kconfig | 2 + > drivers/vfio/pci/Makefile | 2 + > drivers/vfio/pci/virtio/Kconfig | 15 + > drivers/vfio/pci/virtio/Makefile | 4 + > drivers/vfio/pci/virtio/cmd.c | 4 +- > drivers/vfio/pci/virtio/cmd.h | 8 + > drivers/vfio/pci/virtio/main.c | 546 +++++++++++++++++++++++++++++++ > 8 files changed, 585 insertions(+), 2 deletions(-) > create mode 100644 drivers/vfio/pci/virtio/Kconfig > create mode 100644 drivers/vfio/pci/virtio/Makefile > create mode 100644 drivers/vfio/pci/virtio/main.c > > diff --git a/MAINTAINERS b/MAINTAINERS > index bf0f54c24f81..5098418c8389 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -22624,6 +22624,12 @@ L: kvm@vger.kernel.org > S: Maintained > F: drivers/vfio/pci/mlx5/ > > +VFIO VIRTIO PCI DRIVER > +M: Yishai Hadas <yishaih@nvidia.com> > +L: kvm@vger.kernel.org > +S: Maintained > +F: drivers/vfio/pci/virtio > + > VFIO PCI DEVICE SPECIFIC DRIVERS > R: Jason Gunthorpe <jgg@nvidia.com> > R: Yishai Hadas <yishaih@nvidia.com> > diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig > index 8125e5f37832..18c397df566d 100644 > --- a/drivers/vfio/pci/Kconfig > +++ b/drivers/vfio/pci/Kconfig > @@ -65,4 +65,6 @@ source "drivers/vfio/pci/hisilicon/Kconfig" > > source "drivers/vfio/pci/pds/Kconfig" > > +source "drivers/vfio/pci/virtio/Kconfig" > + > endmenu > diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile > index 45167be462d8..046139a4eca5 100644 > --- a/drivers/vfio/pci/Makefile > +++ b/drivers/vfio/pci/Makefile > @@ -13,3 +13,5 @@ obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5/ > obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/ > > obj-$(CONFIG_PDS_VFIO_PCI) += pds/ > + > +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio/ > diff --git a/drivers/vfio/pci/virtio/Kconfig b/drivers/vfio/pci/virtio/Kconfig > new file mode 100644 > index 000000000000..89eddce8b1bd > --- /dev/null > +++ b/drivers/vfio/pci/virtio/Kconfig > @@ -0,0 +1,15 @@ > +# SPDX-License-Identifier: GPL-2.0-only > +config VIRTIO_VFIO_PCI > + tristate "VFIO support for VIRTIO PCI devices" > + depends on VIRTIO_PCI > + select VFIO_PCI_CORE > + help > + This provides support for exposing VIRTIO VF devices using the VFIO > + framework that can work with a legacy virtio driver in the guest. > + Based on PCIe spec, VFs do not support I/O Space; thus, VF BARs shall > + not indicate I/O Space. > + As of that this driver emulated I/O BAR in software to let a VF be > + seen as a transitional device in the guest and let it work with > + a legacy driver. > + > + If you don't know what to do here, say N. > diff --git a/drivers/vfio/pci/virtio/Makefile b/drivers/vfio/pci/virtio/Makefile > new file mode 100644 > index 000000000000..584372648a03 > --- /dev/null > +++ b/drivers/vfio/pci/virtio/Makefile > @@ -0,0 +1,4 @@ > +# SPDX-License-Identifier: GPL-2.0-only > +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio-vfio-pci.o > +virtio-vfio-pci-y := main.o cmd.o > + > diff --git a/drivers/vfio/pci/virtio/cmd.c b/drivers/vfio/pci/virtio/cmd.c > index f068239cdbb0..aea9d25fbf1d 100644 > --- a/drivers/vfio/pci/virtio/cmd.c > +++ b/drivers/vfio/pci/virtio/cmd.c > @@ -44,7 +44,7 @@ int virtiovf_cmd_lr_write(struct virtiovf_pci_core_device *virtvdev, u16 opcode, > { > struct virtio_device *virtio_dev = > virtio_pci_vf_get_pf_dev(virtvdev->core_device.pdev); > - struct virtio_admin_cmd_data_lr_write *in; > + struct virtio_admin_cmd_legacy_wr_data *in; > struct scatterlist in_sg; > struct virtio_admin_cmd cmd = {}; > int ret; > @@ -74,7 +74,7 @@ int virtiovf_cmd_lr_read(struct virtiovf_pci_core_device *virtvdev, u16 opcode, > { > struct virtio_device *virtio_dev = > virtio_pci_vf_get_pf_dev(virtvdev->core_device.pdev); > - struct virtio_admin_cmd_data_lr_read *in; > + struct virtio_admin_cmd_legacy_rd_data *in; > struct scatterlist in_sg, out_sg; > struct virtio_admin_cmd cmd = {}; > int ret; > diff --git a/drivers/vfio/pci/virtio/cmd.h b/drivers/vfio/pci/virtio/cmd.h > index c2a3645f4b90..347b1dc85570 100644 > --- a/drivers/vfio/pci/virtio/cmd.h > +++ b/drivers/vfio/pci/virtio/cmd.h > @@ -13,7 +13,15 @@ > > struct virtiovf_pci_core_device { > struct vfio_pci_core_device core_device; > + u8 bar0_virtual_buf_size; > + u8 *bar0_virtual_buf; > + /* synchronize access to the virtual buf */ > + struct mutex bar_mutex; > int vf_id; > + void __iomem *notify_addr; > + u32 notify_offset; > + u8 notify_bar; > + u8 pci_cmd_io :1; > }; > > int virtiovf_cmd_list_query(struct pci_dev *pdev, u8 *buf, int buf_size); > diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c > new file mode 100644 > index 000000000000..2486991c49f3 > --- /dev/null > +++ b/drivers/vfio/pci/virtio/main.c > @@ -0,0 +1,546 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved > + */ > + > +#include <linux/device.h> > +#include <linux/module.h> > +#include <linux/mutex.h> > +#include <linux/pci.h> > +#include <linux/pm_runtime.h> > +#include <linux/types.h> > +#include <linux/uaccess.h> > +#include <linux/vfio.h> > +#include <linux/vfio_pci_core.h> > +#include <linux/virtio_pci.h> > +#include <linux/virtio_net.h> > +#include <linux/virtio_pci_modern.h> > + > +#include "cmd.h" > + > +#define VIRTIO_LEGACY_IO_BAR_HEADER_LEN 20 > +#define VIRTIO_LEGACY_IO_BAR_MSIX_HEADER_LEN 4 > + > +static int virtiovf_issue_lr_cmd(struct virtiovf_pci_core_device *virtvdev, > + loff_t pos, char __user *buf, > + size_t count, bool read) > +{ > + u8 *bar0_buf = virtvdev->bar0_virtual_buf; > + u16 opcode; > + int ret; > + > + mutex_lock(&virtvdev->bar_mutex); > + if (read) { > + opcode = (pos < VIRTIO_PCI_CONFIG_OFF(true)) ? > + VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ : > + VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ; > + ret = virtiovf_cmd_lr_read(virtvdev, opcode, pos, > + count, bar0_buf + pos); > + if (ret) > + goto out; > + if (copy_to_user(buf, bar0_buf + pos, count)) > + ret = -EFAULT; > + goto out; > + } > + > + if (copy_from_user(bar0_buf + pos, buf, count)) { > + ret = -EFAULT; > + goto out; > + } > + > + opcode = (pos < VIRTIO_PCI_CONFIG_OFF(true)) ? > + VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE : > + VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE; > + ret = virtiovf_cmd_lr_write(virtvdev, opcode, pos, count, > + bar0_buf + pos); > +out: > + mutex_unlock(&virtvdev->bar_mutex); > + return ret; > +} > + > +static int > +translate_io_bar_to_mem_bar(struct virtiovf_pci_core_device *virtvdev, > + loff_t pos, char __user *buf, > + size_t count, bool read) > +{ > + struct vfio_pci_core_device *core_device = &virtvdev->core_device; > + u16 queue_notify; > + int ret; > + > + if (pos + count > virtvdev->bar0_virtual_buf_size) > + return -EINVAL; > + > + switch (pos) { > + case VIRTIO_PCI_QUEUE_NOTIFY: > + if (count != sizeof(queue_notify)) > + return -EINVAL; > + if (read) { > + ret = vfio_pci_ioread16(core_device, true, &queue_notify, > + virtvdev->notify_addr); > + if (ret) > + return ret; > + if (copy_to_user(buf, &queue_notify, > + sizeof(queue_notify))) > + return -EFAULT; > + break; > + } > + > + if (copy_from_user(&queue_notify, buf, count)) > + return -EFAULT; > + > + ret = vfio_pci_iowrite16(core_device, true, queue_notify, > + virtvdev->notify_addr); > + break; > + default: > + ret = virtiovf_issue_lr_cmd(virtvdev, pos, buf, count, read); > + } > + > + return ret ? ret : count; > +} > + > +static bool range_contains_range(loff_t range1_start, size_t count1, > + loff_t range2_start, size_t count2, > + loff_t *start_offset) > +{ > + if (range1_start <= range2_start && > + range1_start + count1 >= range2_start + count2) { > + *start_offset = range2_start - range1_start; > + return true; > + } > + return false; > +} > + > +static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev, > + char __user *buf, size_t count, > + loff_t *ppos) > +{ > + struct virtiovf_pci_core_device *virtvdev = container_of( > + core_vdev, struct virtiovf_pci_core_device, core_device.vdev); > + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK; > + loff_t copy_offset; > + __le32 val32; > + __le16 val16; > + u8 val8; > + int ret; > + > + ret = vfio_pci_core_read(core_vdev, buf, count, ppos); > + if (ret < 0) > + return ret; > + > + if (range_contains_range(pos, count, PCI_DEVICE_ID, sizeof(val16), > + ©_offset)) { > + val16 = cpu_to_le16(0x1000); > + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16))) > + return -EFAULT; > + } > + > + if (virtvdev->pci_cmd_io && > + range_contains_range(pos, count, PCI_COMMAND, sizeof(val16), > + ©_offset)) { > + if (copy_from_user(&val16, buf, sizeof(val16))) > + return -EFAULT; > + val16 |= cpu_to_le16(PCI_COMMAND_IO); > + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16))) > + return -EFAULT; > + } > + > + if (range_contains_range(pos, count, PCI_REVISION_ID, sizeof(val8), > + ©_offset)) { > + /* Transional needs to have revision 0 */ > + val8 = 0; > + if (copy_to_user(buf + copy_offset, &val8, sizeof(val8))) > + return -EFAULT; > + } > + > + if (range_contains_range(pos, count, PCI_BASE_ADDRESS_0, sizeof(val32), > + ©_offset)) { > + val32 = cpu_to_le32(PCI_BASE_ADDRESS_SPACE_IO); > + if (copy_to_user(buf + copy_offset, &val32, sizeof(val32))) > + return -EFAULT; > + } > + > + if (range_contains_range(pos, count, PCI_SUBSYSTEM_ID, sizeof(val16), > + ©_offset)) { > + /* Transitional devices use the PCI subsystem device id as > + * virtio device id, same as legacy driver always did. > + */ > + val16 = cpu_to_le16(VIRTIO_ID_NET); > + if (copy_to_user(buf + copy_offset, &val16, sizeof(val16))) > + return -EFAULT; > + } > + > + return count; > +} > + > +static ssize_t > +virtiovf_pci_core_read(struct vfio_device *core_vdev, char __user *buf, > + size_t count, loff_t *ppos) > +{ > + struct virtiovf_pci_core_device *virtvdev = container_of( > + core_vdev, struct virtiovf_pci_core_device, core_device.vdev); > + struct pci_dev *pdev = virtvdev->core_device.pdev; > + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos); > + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK; > + int ret; > + > + if (!count) > + return 0; > + > + if (index == VFIO_PCI_CONFIG_REGION_INDEX) > + return virtiovf_pci_read_config(core_vdev, buf, count, ppos); > + > + if (index != VFIO_PCI_BAR0_REGION_INDEX) > + return vfio_pci_core_read(core_vdev, buf, count, ppos); > + > + ret = pm_runtime_resume_and_get(&pdev->dev); > + if (ret) { > + pci_info_ratelimited(pdev, "runtime resume failed %d\n", > + ret); > + return -EIO; > + } > + > + ret = translate_io_bar_to_mem_bar(virtvdev, pos, buf, count, true); > + pm_runtime_put(&pdev->dev); > + return ret; > +} > + > +static ssize_t > +virtiovf_pci_core_write(struct vfio_device *core_vdev, const char __user *buf, > + size_t count, loff_t *ppos) > +{ > + struct virtiovf_pci_core_device *virtvdev = container_of( > + core_vdev, struct virtiovf_pci_core_device, core_device.vdev); > + struct pci_dev *pdev = virtvdev->core_device.pdev; > + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos); > + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK; > + int ret; > + > + if (!count) > + return 0; > + > + if (index == VFIO_PCI_CONFIG_REGION_INDEX) { > + loff_t copy_offset; > + u16 cmd; > + > + if (range_contains_range(pos, count, PCI_COMMAND, sizeof(cmd), > + ©_offset)) { > + if (copy_from_user(&cmd, buf + copy_offset, sizeof(cmd))) > + return -EFAULT; > + virtvdev->pci_cmd_io = (cmd & PCI_COMMAND_IO); > + } > + } > + > + if (index != VFIO_PCI_BAR0_REGION_INDEX) > + return vfio_pci_core_write(core_vdev, buf, count, ppos); > + > + ret = pm_runtime_resume_and_get(&pdev->dev); > + if (ret) { > + pci_info_ratelimited(pdev, "runtime resume failed %d\n", ret); > + return -EIO; > + } > + > + ret = translate_io_bar_to_mem_bar(virtvdev, pos, (char __user *)buf, count, false); > + pm_runtime_put(&pdev->dev); > + return ret; > +} > + > +static int > +virtiovf_pci_ioctl_get_region_info(struct vfio_device *core_vdev, > + unsigned int cmd, unsigned long arg) > +{ > + struct virtiovf_pci_core_device *virtvdev = container_of( > + core_vdev, struct virtiovf_pci_core_device, core_device.vdev); > + unsigned long minsz = offsetofend(struct vfio_region_info, offset); > + void __user *uarg = (void __user *)arg; > + struct vfio_region_info info = {}; > + > + if (copy_from_user(&info, uarg, minsz)) > + return -EFAULT; > + > + if (info.argsz < minsz) > + return -EINVAL; > + > + switch (info.index) { > + case VFIO_PCI_BAR0_REGION_INDEX: > + info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index); > + info.size = virtvdev->bar0_virtual_buf_size; > + info.flags = VFIO_REGION_INFO_FLAG_READ | > + VFIO_REGION_INFO_FLAG_WRITE; > + return copy_to_user(uarg, &info, minsz) ? -EFAULT : 0; > + default: > + return vfio_pci_core_ioctl(core_vdev, cmd, arg); > + } > +} > + > +static long > +virtiovf_vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, > + unsigned long arg) > +{ > + switch (cmd) { > + case VFIO_DEVICE_GET_REGION_INFO: > + return virtiovf_pci_ioctl_get_region_info(core_vdev, cmd, arg); > + default: > + return vfio_pci_core_ioctl(core_vdev, cmd, arg); > + } > +} > + > +static int > +virtiovf_set_notify_addr(struct virtiovf_pci_core_device *virtvdev) > +{ > + struct vfio_pci_core_device *core_device = &virtvdev->core_device; > + int ret; > + > + /* Setup the BAR where the 'notify' exists to be used by vfio as well > + * This will let us mmap it only once and use it when needed. > + */ > + ret = vfio_pci_core_setup_barmap(core_device, > + virtvdev->notify_bar); > + if (ret) > + return ret; > + > + virtvdev->notify_addr = core_device->barmap[virtvdev->notify_bar] + > + virtvdev->notify_offset; > + return 0; > +} > + > +static int virtiovf_pci_open_device(struct vfio_device *core_vdev) > +{ > + struct virtiovf_pci_core_device *virtvdev = container_of( > + core_vdev, struct virtiovf_pci_core_device, core_device.vdev); > + struct vfio_pci_core_device *vdev = &virtvdev->core_device; > + int ret; > + > + ret = vfio_pci_core_enable(vdev); > + if (ret) > + return ret; > + > + if (virtvdev->bar0_virtual_buf) { > + /* upon close_device() the vfio_pci_core_disable() is called > + * and will close all the previous mmaps, so it seems that the > + * valid life cycle for the 'notify' addr is per open/close. > + */ > + ret = virtiovf_set_notify_addr(virtvdev); > + if (ret) { > + vfio_pci_core_disable(vdev); > + return ret; > + } > + } > + > + vfio_pci_core_finish_enable(vdev); > + return 0; > +} > + > +static void virtiovf_pci_close_device(struct vfio_device *core_vdev) > +{ > + vfio_pci_core_close_device(core_vdev); > +} > + > +static int virtiovf_get_device_config_size(unsigned short device) > +{ > + switch (device) { > + case 0x1041: > + /* network card */ > + return offsetofend(struct virtio_net_config, status); > + default: > + return 0; > + } > +} > + > +static int virtiovf_read_notify_info(struct virtiovf_pci_core_device *virtvdev) > +{ > + u64 offset; > + int ret; > + u8 bar; > + > + ret = virtiovf_cmd_lq_read_notify(virtvdev, > + VIRTIO_ADMIN_CMD_NOTIFY_INFO_FLAGS_OWNER_MEM, > + &bar, &offset); > + if (ret) > + return ret; > + > + virtvdev->notify_bar = bar; > + virtvdev->notify_offset = offset; > + return 0; > +} > + > +static int virtiovf_pci_init_device(struct vfio_device *core_vdev) > +{ > + struct virtiovf_pci_core_device *virtvdev = container_of( > + core_vdev, struct virtiovf_pci_core_device, core_device.vdev); > + struct pci_dev *pdev; > + int ret; > + > + ret = vfio_pci_core_init_dev(core_vdev); > + if (ret) > + return ret; > + > + pdev = virtvdev->core_device.pdev; > + virtvdev->vf_id = pci_iov_vf_id(pdev); > + if (virtvdev->vf_id < 0) > + return -EINVAL; > + > + ret = virtiovf_read_notify_info(virtvdev); > + if (ret) > + return ret; > + > + virtvdev->bar0_virtual_buf_size = VIRTIO_LEGACY_IO_BAR_HEADER_LEN + > + VIRTIO_LEGACY_IO_BAR_MSIX_HEADER_LEN + > + virtiovf_get_device_config_size(pdev->device); > + virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size, > + GFP_KERNEL); > + if (!virtvdev->bar0_virtual_buf) > + return -ENOMEM; > + mutex_init(&virtvdev->bar_mutex); > + return 0; > +} > + > +static void virtiovf_pci_core_release_dev(struct vfio_device *core_vdev) > +{ > + struct virtiovf_pci_core_device *virtvdev = container_of( > + core_vdev, struct virtiovf_pci_core_device, core_device.vdev); > + > + kfree(virtvdev->bar0_virtual_buf); > + vfio_pci_core_release_dev(core_vdev); > +} > + > +static const struct vfio_device_ops virtiovf_acc_vfio_pci_tran_ops = { > + .name = "virtio-transitional-vfio-pci", > + .init = virtiovf_pci_init_device, > + .release = virtiovf_pci_core_release_dev, > + .open_device = virtiovf_pci_open_device, > + .close_device = virtiovf_pci_close_device, > + .ioctl = virtiovf_vfio_pci_core_ioctl, > + .read = virtiovf_pci_core_read, > + .write = virtiovf_pci_core_write, > + .mmap = vfio_pci_core_mmap, > + .request = vfio_pci_core_request, > + .match = vfio_pci_core_match, > + .bind_iommufd = vfio_iommufd_physical_bind, > + .unbind_iommufd = vfio_iommufd_physical_unbind, > + .attach_ioas = vfio_iommufd_physical_attach_ioas, > +}; > + > +static const struct vfio_device_ops virtiovf_acc_vfio_pci_ops = { > + .name = "virtio-acc-vfio-pci", > + .init = vfio_pci_core_init_dev, > + .release = vfio_pci_core_release_dev, > + .open_device = virtiovf_pci_open_device, > + .close_device = virtiovf_pci_close_device, > + .ioctl = vfio_pci_core_ioctl, > + .device_feature = vfio_pci_core_ioctl_feature, > + .read = vfio_pci_core_read, > + .write = vfio_pci_core_write, > + .mmap = vfio_pci_core_mmap, > + .request = vfio_pci_core_request, > + .match = vfio_pci_core_match, > + .bind_iommufd = vfio_iommufd_physical_bind, > + .unbind_iommufd = vfio_iommufd_physical_unbind, > + .attach_ioas = vfio_iommufd_physical_attach_ioas, > +}; > + > +static bool virtiovf_bar0_exists(struct pci_dev *pdev) > +{ > + struct resource *res = pdev->resource; > + > + return res->flags ? true : false; > +} > + > +#define VIRTIOVF_USE_ADMIN_CMD_BITMAP \ > + (BIT_ULL(VIRTIO_ADMIN_CMD_LIST_QUERY) | \ > + BIT_ULL(VIRTIO_ADMIN_CMD_LIST_USE) | \ > + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE) | \ > + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ) | \ > + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE) | \ > + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ) | \ > + BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO)) > + > +static bool virtiovf_support_legacy_access(struct pci_dev *pdev) > +{ > + int buf_size = DIV_ROUND_UP(VIRTIO_ADMIN_MAX_CMD_OPCODE, 64) * 8; > + u8 *buf; > + int ret; > + > + /* Only virtio-net is supported/tested so far */ > + if (pdev->device != 0x1041) > + return false; > + > + buf = kzalloc(buf_size, GFP_KERNEL); > + if (!buf) > + return false; > + > + ret = virtiovf_cmd_list_query(pdev, buf, buf_size); > + if (ret) > + goto end; > + > + if ((le64_to_cpup((__le64 *)buf) & VIRTIOVF_USE_ADMIN_CMD_BITMAP) != > + VIRTIOVF_USE_ADMIN_CMD_BITMAP) { > + ret = -EOPNOTSUPP; > + goto end; > + } > + > + /* confirm the used commands */ > + memset(buf, 0, buf_size); > + *(__le64 *)buf = cpu_to_le64(VIRTIOVF_USE_ADMIN_CMD_BITMAP); > + ret = virtiovf_cmd_list_use(pdev, buf, buf_size); > + > +end: > + kfree(buf); > + return ret ? false : true; > +} > + > +static int virtiovf_pci_probe(struct pci_dev *pdev, > + const struct pci_device_id *id) > +{ > + const struct vfio_device_ops *ops = &virtiovf_acc_vfio_pci_ops; > + struct virtiovf_pci_core_device *virtvdev; > + int ret; > + > + if (pdev->is_virtfn && virtiovf_support_legacy_access(pdev) && > + !virtiovf_bar0_exists(pdev) && pdev->msix_cap) > + ops = &virtiovf_acc_vfio_pci_tran_ops; > + > + virtvdev = vfio_alloc_device(virtiovf_pci_core_device, core_device.vdev, > + &pdev->dev, ops); > + if (IS_ERR(virtvdev)) > + return PTR_ERR(virtvdev); > + > + dev_set_drvdata(&pdev->dev, &virtvdev->core_device); > + ret = vfio_pci_core_register_device(&virtvdev->core_device); > + if (ret) > + goto out; > + return 0; > +out: > + vfio_put_device(&virtvdev->core_device.vdev); > + return ret; > +} > + > +static void virtiovf_pci_remove(struct pci_dev *pdev) > +{ > + struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev); > + > + vfio_pci_core_unregister_device(&virtvdev->core_device); > + vfio_put_device(&virtvdev->core_device.vdev); > +} > + > +static const struct pci_device_id virtiovf_pci_table[] = { > + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_REDHAT_QUMRANET, PCI_ANY_ID) }, > + {} > +}; > + > +MODULE_DEVICE_TABLE(pci, virtiovf_pci_table); > + > +static struct pci_driver virtiovf_pci_driver = { > + .name = KBUILD_MODNAME, > + .id_table = virtiovf_pci_table, > + .probe = virtiovf_pci_probe, > + .remove = virtiovf_pci_remove, > + .err_handler = &vfio_pci_core_err_handlers, > + .driver_managed_dma = true, > +}; > + > +module_pci_driver(virtiovf_pci_driver); > + > +MODULE_LICENSE("GPL"); > +MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>"); > +MODULE_DESCRIPTION( > + "VIRTIO VFIO PCI - User Level meta-driver for VIRTIO device family");
On Thu, Sep 21, 2023 at 01:52:24PM -0300, Jason Gunthorpe wrote: > On Thu, Sep 21, 2023 at 10:43:50AM -0600, Alex Williamson wrote: > > > > With that code in place a legacy driver in the guest has the look and > > > feel as if having a transitional device with legacy support for both its > > > control and data path flows. > > > > Why do we need to enable a "legacy" driver in the guest? The very name > > suggests there's an alternative driver that perhaps doesn't require > > this I/O BAR. Why don't we just require the non-legacy driver in the > > guest rather than increase our maintenance burden? Thanks, > > It was my reaction also. > > Apparently there is a big deployed base of people using old guest VMs > with old drivers and they do not want to update their VMs. It is the > same basic reason why qemu supports all those weird old machine types > and HW emulations. The desire is to support these old devices so that > old VMs can work unchanged. > > Jason And you are saying all these very old VMs use such a large number of legacy devices that over-counting of locked memory due to vdpa not correctly using iommufd is a problem that urgently needs to be solved otherwise the solution has no value? Another question I'm interested in is whether there's actually a performance benefit to using this as compared to just software vhost. I note there's a VM exit on each IO access, so ... perhaps? Would be nice to see some numbers.
On Thu, Sep 21, 2023 at 01:01:12PM -0400, Michael S. Tsirkin wrote: > On Thu, Sep 21, 2023 at 01:52:24PM -0300, Jason Gunthorpe wrote: > > On Thu, Sep 21, 2023 at 10:43:50AM -0600, Alex Williamson wrote: > > > > > > With that code in place a legacy driver in the guest has the look and > > > > feel as if having a transitional device with legacy support for both its > > > > control and data path flows. > > > > > > Why do we need to enable a "legacy" driver in the guest? The very name > > > suggests there's an alternative driver that perhaps doesn't require > > > this I/O BAR. Why don't we just require the non-legacy driver in the > > > guest rather than increase our maintenance burden? Thanks, > > > > It was my reaction also. > > > > Apparently there is a big deployed base of people using old guest VMs > > with old drivers and they do not want to update their VMs. It is the > > same basic reason why qemu supports all those weird old machine types > > and HW emulations. The desire is to support these old devices so that > > old VMs can work unchanged. > > > > Jason > > And you are saying all these very old VMs use such a large number of > legacy devices that over-counting of locked memory due to vdpa not > correctly using iommufd is a problem that urgently needs to be solved > otherwise the solution has no value? No one has said that. iommufd is gaining alot more functions than just pinned memory accounting. > Another question I'm interested in is whether there's actually a > performance benefit to using this as compared to just software > vhost. I note there's a VM exit on each IO access, so ... perhaps? > Would be nice to see some numbers. At least a single trap compared with an entire per-packet SW flow undoubtably uses alot less CPU power in the hypervisor. Jason
> From: Michael S. Tsirkin <mst@redhat.com> > Sent: Thursday, September 21, 2023 10:31 PM > Another question I'm interested in is whether there's actually a performance > benefit to using this as compared to just software vhost. I note there's a VM exit > on each IO access, so ... perhaps? > Would be nice to see some numbers. Packet rate and bandwidth are close are only 10% lower than modern device due to the batching of driver notification. Bw tested with iperf with one and multiple queues. Packet rate tested with testpmd.
On Thu, Sep 21, 2023 at 02:07:09PM -0300, Jason Gunthorpe wrote: > On Thu, Sep 21, 2023 at 01:01:12PM -0400, Michael S. Tsirkin wrote: > > On Thu, Sep 21, 2023 at 01:52:24PM -0300, Jason Gunthorpe wrote: > > > On Thu, Sep 21, 2023 at 10:43:50AM -0600, Alex Williamson wrote: > > > > > > > > With that code in place a legacy driver in the guest has the look and > > > > > feel as if having a transitional device with legacy support for both its > > > > > control and data path flows. > > > > > > > > Why do we need to enable a "legacy" driver in the guest? The very name > > > > suggests there's an alternative driver that perhaps doesn't require > > > > this I/O BAR. Why don't we just require the non-legacy driver in the > > > > guest rather than increase our maintenance burden? Thanks, > > > > > > It was my reaction also. > > > > > > Apparently there is a big deployed base of people using old guest VMs > > > with old drivers and they do not want to update their VMs. It is the > > > same basic reason why qemu supports all those weird old machine types > > > and HW emulations. The desire is to support these old devices so that > > > old VMs can work unchanged. > > > > > > Jason > > > > And you are saying all these very old VMs use such a large number of > > legacy devices that over-counting of locked memory due to vdpa not > > correctly using iommufd is a problem that urgently needs to be solved > > otherwise the solution has no value? > > No one has said that. > > iommufd is gaining alot more functions than just pinned memory > accounting. Yea it's very useful - it's also useful for vdpa whether this patchset goes in or not. At some level, if vdpa can't keep up then maybe going the vfio route is justified. I'm not sure why didn't anyone fix iommufd yet - looks like a small amount of work. I'll see if I can address it quickly because we already have virtio accelerators under vdpa and it seems confusing to people to use vdpa for some and vfio for others, with overlapping but slightly incompatible functionality. I'll get back next week, in either case. I am however genuinely curious whether all the new functionality is actually useful for these legacy guests. > > Another question I'm interested in is whether there's actually a > > performance benefit to using this as compared to just software > > vhost. I note there's a VM exit on each IO access, so ... perhaps? > > Would be nice to see some numbers. > > At least a single trap compared with an entire per-packet SW flow > undoubtably uses alot less CPU power in the hypervisor. > > Jason Something like the shadow vq thing will be more or less equivalent then? That's upstream in qemu and needs no hardware support. Worth comparing against. Anyway, there's presumably actual hardware this was tested with, so why guess? Just test and post numbers.
On Thu, Sep 21, 2023 at 05:09:04PM +0000, Parav Pandit wrote: > > > > From: Michael S. Tsirkin <mst@redhat.com> > > Sent: Thursday, September 21, 2023 10:31 PM > > > Another question I'm interested in is whether there's actually a performance > > benefit to using this as compared to just software vhost. I note there's a VM exit > > on each IO access, so ... perhaps? > > Would be nice to see some numbers. > > Packet rate and bandwidth are close are only 10% lower than modern device due to the batching of driver notification. > Bw tested with iperf with one and multiple queues. > Packet rate tested with testpmd. Nice, good to know. Could you compare this with vdpa with shadow vq enabled? That's probably the closest equivalent that needs no kernel or hardware work.
On Thu, Sep 21, 2023 at 01:21:26PM -0400, Michael S. Tsirkin wrote: > Yea it's very useful - it's also useful for vdpa whether this patchset > goes in or not. At some level, if vdpa can't keep up then maybe going > the vfio route is justified. I'm not sure why didn't anyone fix iommufd > yet - looks like a small amount of work. I'll see if I can address it > quickly because we already have virtio accelerators under vdpa and it > seems confusing to people to use vdpa for some and vfio for others, with > overlapping but slightly incompatible functionality. I'll get back next > week, in either case. I am however genuinely curious whether all the new > functionality is actually useful for these legacy guests. It doesn't have much to do with the guests - this is new hypervisor functionality to make the hypervisor do more things. This stuff can still work with old VMs. > > > Another question I'm interested in is whether there's actually a > > > performance benefit to using this as compared to just software > > > vhost. I note there's a VM exit on each IO access, so ... perhaps? > > > Would be nice to see some numbers. > > > > At least a single trap compared with an entire per-packet SW flow > > undoubtably uses alot less CPU power in the hypervisor. > > Something like the shadow vq thing will be more or less equivalent > then? Huh? It still has the entire netdev stack to go through on every packet before it reaches the real virtio device. > That's upstream in qemu and needs no hardware support. Worth comparing > against. Anyway, there's presumably actual hardware this was tested > with, so why guess? Just test and post numbers. Our prior benchmarking put our VPDA/VFIO solutions at something like 2x-3x improvement over the qemu SW path it replaces. Parav said 10% is lost, so 10% of 3x is still 3x better :) I thought we all agreed on this when vdpa was created in the first place, the all SW path was hopeless to get high performance out of? Jason
On Thu, Sep 21, 2023 at 02:44:50PM -0300, Jason Gunthorpe wrote: > On Thu, Sep 21, 2023 at 01:21:26PM -0400, Michael S. Tsirkin wrote: > > Yea it's very useful - it's also useful for vdpa whether this patchset > > goes in or not. At some level, if vdpa can't keep up then maybe going > > the vfio route is justified. I'm not sure why didn't anyone fix iommufd > > yet - looks like a small amount of work. I'll see if I can address it > > quickly because we already have virtio accelerators under vdpa and it > > seems confusing to people to use vdpa for some and vfio for others, with > > overlapping but slightly incompatible functionality. I'll get back next > > week, in either case. I am however genuinely curious whether all the new > > functionality is actually useful for these legacy guests. > > It doesn't have much to do with the guests - this is new hypervisor > functionality to make the hypervisor do more things. This stuff can > still work with old VMs. > > > > > Another question I'm interested in is whether there's actually a > > > > performance benefit to using this as compared to just software > > > > vhost. I note there's a VM exit on each IO access, so ... perhaps? > > > > Would be nice to see some numbers. > > > > > > At least a single trap compared with an entire per-packet SW flow > > > undoubtably uses alot less CPU power in the hypervisor. > > > > Something like the shadow vq thing will be more or less equivalent > > then? > > Huh? It still has the entire netdev stack to go through on every > packet before it reaches the real virtio device. No - shadow vq just tweaks the descriptor and forwards it to the modern vdpa hardware. No net stack involved. > > That's upstream in qemu and needs no hardware support. Worth comparing > > against. Anyway, there's presumably actual hardware this was tested > > with, so why guess? Just test and post numbers. > > Our prior benchmarking put our VPDA/VFIO solutions at something like > 2x-3x improvement over the qemu SW path it replaces. > Parav said 10% is lost, so 10% of 3x is still 3x better :) > > I thought we all agreed on this when vdpa was created in the first > place, the all SW path was hopeless to get high performance out of? > > Jason That's not what I'm asking about though - not what shadow vq does, shadow vq is a vdpa feature.
On Thu, Sep 21, 2023 at 01:55:42PM -0400, Michael S. Tsirkin wrote: > That's not what I'm asking about though - not what shadow vq does, > shadow vq is a vdpa feature. That's just VDPA then. We already talked about why VDPA is not a replacement for VFIO. I agree you can probably get decent performance out of choosing VDPA over VFIO. That doesn't justify VDPA as a replacement for VFIO. Jason
On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote: > > vdpa is not vfio, I don't know how you can suggest vdpa is a > > replacement for a vfio driver. They are completely different > > things. > > Each side has its own strengths, and vfio especially is accelerating > > in its capability in way that vpda is not. eg if an iommufd conversion > > had been done by now for vdpa I might be more sympathetic. > > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was > sick and I didn't know and kept assuming she's working on this. I don't > think it's a huge amount of work though. I'll take a look. > Is there anything else though? Do tell. Confidential compute will never work with VDPA's approach. > There are a bunch of things that I think are important for virtio > that are completely out of scope for vfio, such as migrating > cross-vendor. VFIO supports migration, if you want to have cross-vendor migration then make a standard that describes the VFIO migration data format for virtio devices. > What is the huge amount of work am I asking to do? You are asking us to invest in the complexity of VDPA through out (keep it working, keep it secure, invest time in deploying and debugging in the field) When it doesn't provide *ANY* value to the solution. The starting point is a completely working vfio PCI function and the end goal is to put that function into a VM. That is VFIO, not VDPA. VPDA is fine for what it does, but it is not a reasonable replacement for VFIO. Jason
On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote: > On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote: > > > vdpa is not vfio, I don't know how you can suggest vdpa is a > > > replacement for a vfio driver. They are completely different > > > things. > > > Each side has its own strengths, and vfio especially is accelerating > > > in its capability in way that vpda is not. eg if an iommufd conversion > > > had been done by now for vdpa I might be more sympathetic. > > > > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was > > sick and I didn't know and kept assuming she's working on this. I don't > > think it's a huge amount of work though. I'll take a look. > > Is there anything else though? Do tell. > > Confidential compute will never work with VDPA's approach. I don't see how what this patchset is doing is different wrt to Confidential compute - you trap IO accesses and emulate. Care to elaborate? > > There are a bunch of things that I think are important for virtio > > that are completely out of scope for vfio, such as migrating > > cross-vendor. > > VFIO supports migration, if you want to have cross-vendor migration > then make a standard that describes the VFIO migration data format for > virtio devices. This has nothing to do with data formats - you need two devices to behave identically. Which is what VDPA is about really. > > What is the huge amount of work am I asking to do? > > You are asking us to invest in the complexity of VDPA through out > (keep it working, keep it secure, invest time in deploying and > debugging in the field) > > When it doesn't provide *ANY* value to the solution. There's no "the solution" - this sounds like a vendor only caring about solutions that involve that vendor's hardware exclusively, a little. > The starting point is a completely working vfio PCI function and the > end goal is to put that function into a VM. That is VFIO, not VDPA. > > VPDA is fine for what it does, but it is not a reasonable replacement > for VFIO. > > Jason VDPA basically should be a kind of "VFIO for virtio".
On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote: > > What is the huge amount of work am I asking to do? > > You are asking us to invest in the complexity of VDPA through out > (keep it working, keep it secure, invest time in deploying and > debugging in the field) I'm asking you to do nothing of the kind - I am saying that this code will have to be duplicated in vdpa, and so I am asking what exactly is missing to just keep it all there. So far you said iommufd and note I didn't ask you to add iommufd to vdpa though that would be nice ;) I just said I'll look into it in the next several days.
On Thu, Sep 21, 2023 at 03:16:37PM -0300, Jason Gunthorpe wrote: > On Thu, Sep 21, 2023 at 01:55:42PM -0400, Michael S. Tsirkin wrote: > > > That's not what I'm asking about though - not what shadow vq does, > > shadow vq is a vdpa feature. > > That's just VDPA then. We already talked about why VDPA is not a > replacement for VFIO. It does however work universally, by software, without any special hardware support. Which is kind of why I am curious - if VDPA needs this proxy code because shadow vq is slower then that's an argument for not having it in two places, and trying to improve vdpa to use iommufd if that's easy/practical. If instead VDPA gives the same speed with just shadow vq then keeping this hack in vfio seems like less of a problem. Finally if VDPA is faster then maybe you will reconsider using it ;)
On Thu, Sep 21, 2023 at 03:13:10PM -0400, Michael S. Tsirkin wrote: > On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote: > > On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote: > > > > vdpa is not vfio, I don't know how you can suggest vdpa is a > > > > replacement for a vfio driver. They are completely different > > > > things. > > > > Each side has its own strengths, and vfio especially is accelerating > > > > in its capability in way that vpda is not. eg if an iommufd conversion > > > > had been done by now for vdpa I might be more sympathetic. > > > > > > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was > > > sick and I didn't know and kept assuming she's working on this. I don't > > > think it's a huge amount of work though. I'll take a look. > > > Is there anything else though? Do tell. > > > > Confidential compute will never work with VDPA's approach. > > I don't see how what this patchset is doing is different > wrt to Confidential compute - you trap IO accesses and emulate. > Care to elaborate? This patch series isn't about confidential compute, you asked about the future. VFIO will support confidential compute in the future, VDPA will not. > > > There are a bunch of things that I think are important for virtio > > > that are completely out of scope for vfio, such as migrating > > > cross-vendor. > > > > VFIO supports migration, if you want to have cross-vendor migration > > then make a standard that describes the VFIO migration data format for > > virtio devices. > > This has nothing to do with data formats - you need two devices to > behave identically. Which is what VDPA is about really. We've been looking at VFIO live migration extensively. Device mediation, like VDPA does, is one legitimate approach for live migration. It suites a certain type of heterogeneous environment well. But, it is equally legitimate to make the devices behave the same and have them process a common migration data. This can happen in public with standards, or it can happen in private within a cloud operator's "private-standard" environment. To date, in most of my discussions, I have not seen a strong appetite for such public standards. In part due to the complexity. Regardles, it is not the kernel communities job to insist on one approach or the other. > > You are asking us to invest in the complexity of VDPA through out > > (keep it working, keep it secure, invest time in deploying and > > debugging in the field) > > > > When it doesn't provide *ANY* value to the solution. > > There's no "the solution" Nonsense. > this sounds like a vendor only caring about solutions that involve > that vendor's hardware exclusively, a little. Not really. Understand the DPU provider is not the vendor here. The DPU provider gives a cloud operator a SDK to build these things. The operator is the vendor from your perspective. In many cases live migration never leaves the operator's confines in the first place. Even when it does, there is no real use case to live migrate a virtio-net function from, say, AWS to GCP. You are pushing for a lot of complexity and software that solves a problem people in this space don't actually have. As I said, VDPA is fine for the scenarios it addresses. It is an alternative, not a replacement, for VFIO. Jason
On Thu, Sep 21, 2023 at 03:17:25PM -0400, Michael S. Tsirkin wrote: > On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote: > > > What is the huge amount of work am I asking to do? > > > > You are asking us to invest in the complexity of VDPA through out > > (keep it working, keep it secure, invest time in deploying and > > debugging in the field) > > I'm asking you to do nothing of the kind - I am saying that this code > will have to be duplicated in vdpa, Why would that be needed? > and so I am asking what exactly is missing to just keep it all > there. VFIO. Seriously, we don't want unnecessary mediation in this path at all. > note I didn't ask you to add iommufd to vdpa though that would be > nice ;) I did once send someone to look.. It didn't succeed :( Jason
On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote: > that's easy/practical. If instead VDPA gives the same speed with just > shadow vq then keeping this hack in vfio seems like less of a problem. > Finally if VDPA is faster then maybe you will reconsider using it ;) It is not all about the speed. VDPA presents another large and complex software stack in the hypervisor that can be eliminated by simply using VFIO. VFIO is already required for other scenarios. This is about reducing complexity, reducing attack surface and increasing maintainability of the hypervisor environment. Jason
On Thu, Sep 21, 2023 at 04:53:45PM -0300, Jason Gunthorpe wrote: > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote: > > > that's easy/practical. If instead VDPA gives the same speed with just > > shadow vq then keeping this hack in vfio seems like less of a problem. > > Finally if VDPA is faster then maybe you will reconsider using it ;) > > It is not all about the speed. > > VDPA presents another large and complex software stack in the > hypervisor that can be eliminated by simply using VFIO. If all you want is passing through your card to guest then yes this can be addressed "by simply using VFIO". And let me give you a simple example just from this patchset: it assumes guest uses MSIX and just breaks if it doesn't. As VDPA emulates it can emulate INT#x for guest while doing MSI on the host side. Yea modern guests use MSIX but this is about legacy yes? > VFIO is > already required for other scenarios. Required ... by some people? Most VMs I run don't use anything outside of virtio. > This is about reducing complexity, reducing attack surface and > increasing maintainability of the hypervisor environment. > > Jason Generally you get better security if you don't let guests poke at hardware when they don't have to. But sure, matter of preference - use VFIO, it's great. I am worried about the specific patchset though. It seems to deal with emulating virtio which seems more like a vdpa thing. If you start adding virtio emulation to vfio then won't you just end up with another vdpa? And if no why not? And I don't buy the "we already invested in this vfio based solution", sorry - that's not a reason upstream has to maintain it.
On Thu, Sep 21, 2023 at 04:49:46PM -0300, Jason Gunthorpe wrote: > On Thu, Sep 21, 2023 at 03:13:10PM -0400, Michael S. Tsirkin wrote: > > On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote: > > > On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote: > > > > > vdpa is not vfio, I don't know how you can suggest vdpa is a > > > > > replacement for a vfio driver. They are completely different > > > > > things. > > > > > Each side has its own strengths, and vfio especially is accelerating > > > > > in its capability in way that vpda is not. eg if an iommufd conversion > > > > > had been done by now for vdpa I might be more sympathetic. > > > > > > > > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was > > > > sick and I didn't know and kept assuming she's working on this. I don't > > > > think it's a huge amount of work though. I'll take a look. > > > > Is there anything else though? Do tell. > > > > > > Confidential compute will never work with VDPA's approach. > > > > I don't see how what this patchset is doing is different > > wrt to Confidential compute - you trap IO accesses and emulate. > > Care to elaborate? > > This patch series isn't about confidential compute, you asked about > the future. VFIO will support confidential compute in the future, VDPA > will not. Nonsense it already works. But I did not ask about the future since I do not believe it can be confidently predicted. I asked what is missing in VDPA now for you to add this feature there and not in VFIO. > > > > There are a bunch of things that I think are important for virtio > > > > that are completely out of scope for vfio, such as migrating > > > > cross-vendor. > > > > > > VFIO supports migration, if you want to have cross-vendor migration > > > then make a standard that describes the VFIO migration data format for > > > virtio devices. > > > > This has nothing to do with data formats - you need two devices to > > behave identically. Which is what VDPA is about really. > > We've been looking at VFIO live migration extensively. Device > mediation, like VDPA does, is one legitimate approach for live > migration. It suites a certain type of heterogeneous environment well. > > But, it is equally legitimate to make the devices behave the same and > have them process a common migration data. > > This can happen in public with standards, or it can happen in private > within a cloud operator's "private-standard" environment. > > To date, in most of my discussions, I have not seen a strong appetite > for such public standards. In part due to the complexity. > > Regardles, it is not the kernel communities job to insist on one > approach or the other. > > > > You are asking us to invest in the complexity of VDPA through out > > > (keep it working, keep it secure, invest time in deploying and > > > debugging in the field) > > > > > > When it doesn't provide *ANY* value to the solution. > > > > There's no "the solution" > > Nonsense. what there's only one solution that you use the definite article? > > this sounds like a vendor only caring about solutions that involve > > that vendor's hardware exclusively, a little. > > Not really. > > Understand the DPU provider is not the vendor here. The DPU provider > gives a cloud operator a SDK to build these things. The operator is > the vendor from your perspective. > > In many cases live migration never leaves the operator's confines in > the first place. > > Even when it does, there is no real use case to live migrate a > virtio-net function from, say, AWS to GCP. > > You are pushing for a lot of complexity and software that solves a > problem people in this space don't actually have. > > As I said, VDPA is fine for the scenarios it addresses. It is an > alternative, not a replacement, for VFIO. > > Jason yea, VDPA does trap and emulate for config accesses. which is exactly what this patch does? so why does it belong in vfio muddying up its passthrough model is beyond me, except that apparently there's some specific deployment that happens to use vfio so now whatever that deployment needs has to go into vfio whether it belongs there or not.
On Thu, Sep 21, 2023 at 04:51:15PM -0300, Jason Gunthorpe wrote: > On Thu, Sep 21, 2023 at 03:17:25PM -0400, Michael S. Tsirkin wrote: > > On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote: > > > > What is the huge amount of work am I asking to do? > > > > > > You are asking us to invest in the complexity of VDPA through out > > > (keep it working, keep it secure, invest time in deploying and > > > debugging in the field) > > > > I'm asking you to do nothing of the kind - I am saying that this code > > will have to be duplicated in vdpa, > > Why would that be needed? For the same reason it was developed in the 1st place - presumably because it adds efficient legacy guest support with the right card? I get it, you specifically don't need VDPA functionality, but I don't see why is this universal, or common. > > and so I am asking what exactly is missing to just keep it all > > there. > > VFIO. Seriously, we don't want unnecessary mediation in this path at > all. But which mediation is necessary is exactly up to the specific use-case. I have no idea why would you want all of VFIO to e.g. pass access to random config registers to the guest when it's a virtio device and the config registers are all nicely listed in the spec. I know nvidia hardware is so great, it has super robust cards with less security holes than the vdpa driver, but I very much doubt this is universal for all virtio offload cards. > > note I didn't ask you to add iommufd to vdpa though that would be > > nice ;) > > I did once send someone to look.. It didn't succeed :( > > Jason Pity. Maybe there's some big difficulty blocking this? I'd like to know.
On Thu, Sep 21, 2023 at 04:16:25PM -0400, Michael S. Tsirkin wrote: > On Thu, Sep 21, 2023 at 04:53:45PM -0300, Jason Gunthorpe wrote: > > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote: > > > > > that's easy/practical. If instead VDPA gives the same speed with just > > > shadow vq then keeping this hack in vfio seems like less of a problem. > > > Finally if VDPA is faster then maybe you will reconsider using it ;) > > > > It is not all about the speed. > > > > VDPA presents another large and complex software stack in the > > hypervisor that can be eliminated by simply using VFIO. > > If all you want is passing through your card to guest > then yes this can be addressed "by simply using VFIO". That is pretty much the goal, yes. > And let me give you a simple example just from this patchset: > it assumes guest uses MSIX and just breaks if it doesn't. It does? Really? Where did you see that? > > VFIO is > > already required for other scenarios. > > Required ... by some people? Most VMs I run don't use anything > outside of virtio. Yes, some people. The sorts of people who run large data centers. > It seems to deal with emulating virtio which seems more like a vdpa > thing. Alex described it right, it creates an SW trapped IO bar that relays the doorbell to an admin queue command. > If you start adding virtio emulation to vfio then won't > you just end up with another vdpa? And if no why not? > And I don't buy the "we already invested in this vfio based solution", > sorry - that's not a reason upstream has to maintain it. I think you would be well justified to object to actual mediation, like processing queues in VFIO or otherwise complex things. Fortunately there is no need to do that with DPU HW. The legacy IO BAR is a weird quirk that just cannot be done without a software trap, and the OASIS standardization effort was for exactly this kind of simplistic transformation. I also don't buy the "upstream has to maintain it" line. The team that submitted it will maintain it just fine, thank you. Jason
On Thu, Sep 21, 2023 at 04:45:45PM -0400, Michael S. Tsirkin wrote: > On Thu, Sep 21, 2023 at 04:49:46PM -0300, Jason Gunthorpe wrote: > > On Thu, Sep 21, 2023 at 03:13:10PM -0400, Michael S. Tsirkin wrote: > > > On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote: > > > > On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote: > > > > > > vdpa is not vfio, I don't know how you can suggest vdpa is a > > > > > > replacement for a vfio driver. They are completely different > > > > > > things. > > > > > > Each side has its own strengths, and vfio especially is accelerating > > > > > > in its capability in way that vpda is not. eg if an iommufd conversion > > > > > > had been done by now for vdpa I might be more sympathetic. > > > > > > > > > > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was > > > > > sick and I didn't know and kept assuming she's working on this. I don't > > > > > think it's a huge amount of work though. I'll take a look. > > > > > Is there anything else though? Do tell. > > > > > > > > Confidential compute will never work with VDPA's approach. > > > > > > I don't see how what this patchset is doing is different > > > wrt to Confidential compute - you trap IO accesses and emulate. > > > Care to elaborate? > > > > This patch series isn't about confidential compute, you asked about > > the future. VFIO will support confidential compute in the future, VDPA > > will not. > > Nonsense it already works. That isn't what I'm talking about. With a real PCI function and TDISP we can actually DMA directly from the guest's memory without needing the ugly bounce buffer hack. Then you can get decent performance. > But I did not ask about the future since I do not believe it > can be confidently predicted. I asked what is missing in VDPA > now for you to add this feature there and not in VFIO. I don't see that VDPA needs this, VDPA should process the IO BAR on its own with its own logic, just like everything else it does. This is specifically about avoiding mediation by relaying directly the IO BAR operations to the device itself. That is the entire irony, this whole scheme was designed and standardized *specifically* to avoid complex mediation and here you are saying we should just use mediation. Jason
On Thu, Sep 21, 2023 at 04:55:05PM -0400, Michael S. Tsirkin wrote: > But which mediation is necessary is exactly up to the specific use-case. > I have no idea why would you want all of VFIO to e.g. pass access to > random config registers to the guest when it's a virtio device and the > config registers are all nicely listed in the spec. I know nvidia > hardware is so great, it has super robust cards with less security holes > than the vdpa driver, but I very much doubt this is universal for all > virtio offload cards. The great thing about choice is that people can choose the configuration that best meets their situation and needs. Jason
On Fri, Sep 22, 2023 at 3:49 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Thu, Sep 21, 2023 at 03:13:10PM -0400, Michael S. Tsirkin wrote: > > On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote: > > > On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote: > > > > > vdpa is not vfio, I don't know how you can suggest vdpa is a > > > > > replacement for a vfio driver. They are completely different > > > > > things. > > > > > Each side has its own strengths, and vfio especially is accelerating > > > > > in its capability in way that vpda is not. eg if an iommufd conversion > > > > > had been done by now for vdpa I might be more sympathetic. > > > > > > > > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was > > > > sick and I didn't know and kept assuming she's working on this. I don't > > > > think it's a huge amount of work though. I'll take a look. > > > > Is there anything else though? Do tell. > > > > > > Confidential compute will never work with VDPA's approach. > > > > I don't see how what this patchset is doing is different > > wrt to Confidential compute - you trap IO accesses and emulate. > > Care to elaborate? > > This patch series isn't about confidential compute, you asked about > the future. VFIO will support confidential compute in the future, VDPA > will not. > > > > > There are a bunch of things that I think are important for virtio > > > > that are completely out of scope for vfio, such as migrating > > > > cross-vendor. > > > > > > VFIO supports migration, if you want to have cross-vendor migration > > > then make a standard that describes the VFIO migration data format for > > > virtio devices. > > > > This has nothing to do with data formats - you need two devices to > > behave identically. Which is what VDPA is about really. > > We've been looking at VFIO live migration extensively. Device > mediation, like VDPA does, is one legitimate approach for live > migration. It suites a certain type of heterogeneous environment well. > > But, it is equally legitimate to make the devices behave the same and > have them process a common migration data. > > This can happen in public with standards, or it can happen in private > within a cloud operator's "private-standard" environment. > > To date, in most of my discussions, I have not seen a strong appetite > for such public standards. In part due to the complexity. > > Regardles, it is not the kernel communities job to insist on one > approach or the other. > > > > You are asking us to invest in the complexity of VDPA through out > > > (keep it working, keep it secure, invest time in deploying and > > > debugging in the field) > > > > > > When it doesn't provide *ANY* value to the solution. > > > > There's no "the solution" > > Nonsense. > > > this sounds like a vendor only caring about solutions that involve > > that vendor's hardware exclusively, a little. > > Not really. > > Understand the DPU provider is not the vendor here. The DPU provider > gives a cloud operator a SDK to build these things. The operator is > the vendor from your perspective. > > In many cases live migration never leaves the operator's confines in > the first place. > > Even when it does, there is no real use case to live migrate a > virtio-net function from, say, AWS to GCP. It can happen inside a single cloud vendor. For some reasons, DPU must be purchased from different vendors. And vDPA has been used in that case. I've asked them to present this probably somewhere like KVM Forum. > > You are pushing for a lot of complexity and software that solves a > problem people in this space don't actually have. > > As I said, VDPA is fine for the scenarios it addresses. It is an > alternative, not a replacement, for VFIO. We never try to replace VFIO. I don't see any problem by just using the current VFIO to assign a virtio-pci device to the guest. The problem is the mediation (or what you called relaying) layer you've invented. Thanks > > Jason >
On Fri, Sep 22, 2023 at 4:16 AM Michael S. Tsirkin <mst@redhat.com> wrote: > > On Thu, Sep 21, 2023 at 04:53:45PM -0300, Jason Gunthorpe wrote: > > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote: > > > > > that's easy/practical. If instead VDPA gives the same speed with just > > > shadow vq then keeping this hack in vfio seems like less of a problem. > > > Finally if VDPA is faster then maybe you will reconsider using it ;) > > > > It is not all about the speed. > > > > VDPA presents another large and complex software stack in the > > hypervisor that can be eliminated by simply using VFIO. > > If all you want is passing through your card to guest > then yes this can be addressed "by simply using VFIO". +1. And what's more, using MMIO BAR0 then it can work for legacy. I have a handy virtio hardware from one vendor that works like this, and I see it is done by a lot of other vendors. Thanks
On Fri, Sep 22, 2023 at 6:55 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Thu, Sep 21, 2023 at 04:45:45PM -0400, Michael S. Tsirkin wrote: > > On Thu, Sep 21, 2023 at 04:49:46PM -0300, Jason Gunthorpe wrote: > > > On Thu, Sep 21, 2023 at 03:13:10PM -0400, Michael S. Tsirkin wrote: > > > > On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote: > > > > > On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote: > > > > > > > vdpa is not vfio, I don't know how you can suggest vdpa is a > > > > > > > replacement for a vfio driver. They are completely different > > > > > > > things. > > > > > > > Each side has its own strengths, and vfio especially is accelerating > > > > > > > in its capability in way that vpda is not. eg if an iommufd conversion > > > > > > > had been done by now for vdpa I might be more sympathetic. > > > > > > > > > > > > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was > > > > > > sick and I didn't know and kept assuming she's working on this. I don't > > > > > > think it's a huge amount of work though. I'll take a look. > > > > > > Is there anything else though? Do tell. > > > > > > > > > > Confidential compute will never work with VDPA's approach. > > > > > > > > I don't see how what this patchset is doing is different > > > > wrt to Confidential compute - you trap IO accesses and emulate. > > > > Care to elaborate? > > > > > > This patch series isn't about confidential compute, you asked about > > > the future. VFIO will support confidential compute in the future, VDPA > > > will not. What blocks vDPA from supporting that? > > > > Nonsense it already works. > > That isn't what I'm talking about. With a real PCI function and TDISP > we can actually DMA directly from the guest's memory without needing > the ugly bounce buffer hack. Then you can get decent performance. This series requires the trapping in the legacy I/O BAR in VFIO. Why can TDISP work when trapping in VFIO but not vDPA? If neither, how can TDISP affect here? > > > But I did not ask about the future since I do not believe it > > can be confidently predicted. I asked what is missing in VDPA > > now for you to add this feature there and not in VFIO. > > I don't see that VDPA needs this, VDPA should process the IO BAR on > its own with its own logic, just like everything else it does. > > This is specifically about avoiding mediation by relaying directly the > IO BAR operations to the device itself. So we had: 1) a new virtio specific driver for VFIO 2) the existing vp_vdpa driver How much differences between them in the context of the mediation or relaying? Or is it hard to introduce admin commands in the vDPA bus? > That is the entire irony, this whole scheme was designed and > standardized *specifically* to avoid complex mediation and here you > are saying we should just use mediation. No, using "simple VFIO passthrough" is just fine. Thanks > > Jason >
On Fri, Sep 22, 2023 at 3:53 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote: > > > that's easy/practical. If instead VDPA gives the same speed with just > > shadow vq then keeping this hack in vfio seems like less of a problem. > > Finally if VDPA is faster then maybe you will reconsider using it ;) > > It is not all about the speed. > > VDPA presents another large and complex software stack in the > hypervisor that can be eliminated by simply using VFIO. vDPA supports standard virtio devices so how did you define complexity? From the view of the application, what it wants is a simple virtio device but not virtio-pci devices. That is what vDPA tries to present. By simply counting LOCs: vdpa + vhost + vp_vdpa is much less code than what VFIO had. It's not hard to expect, it will still be much less even if iommufd is done. Thanks > VFIO is > already required for other scenarios. > > This is about reducing complexity, reducing attack surface and > increasing maintainability of the hypervisor environment. > > Jason > >
On 9/22/2023 2:39 AM, Jason Gunthorpe wrote: > On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote: >>> vdpa is not vfio, I don't know how you can suggest vdpa is a >>> replacement for a vfio driver. They are completely different >>> things. >>> Each side has its own strengths, and vfio especially is accelerating >>> in its capability in way that vpda is not. eg if an iommufd conversion >>> had been done by now for vdpa I might be more sympathetic. >> Yea, I agree iommufd is a big problem with vdpa right now. Cindy was >> sick and I didn't know and kept assuming she's working on this. I don't >> think it's a huge amount of work though. I'll take a look. >> Is there anything else though? Do tell. > Confidential compute will never work with VDPA's approach. I don't understand why vDPA can not and will never support Confidential computing? Do you see any blockers? > >> There are a bunch of things that I think are important for virtio >> that are completely out of scope for vfio, such as migrating >> cross-vendor. > VFIO supports migration, if you want to have cross-vendor migration > then make a standard that describes the VFIO migration data format for > virtio devices. > >> What is the huge amount of work am I asking to do? > You are asking us to invest in the complexity of VDPA through out > (keep it working, keep it secure, invest time in deploying and > debugging in the field) > > When it doesn't provide *ANY* value to the solution. > > The starting point is a completely working vfio PCI function and the > end goal is to put that function into a VM. That is VFIO, not VDPA. > > VPDA is fine for what it does, but it is not a reasonable replacement > for VFIO. > > Jason
On Thu, Sep 21, 2023 at 07:48:36PM -0300, Jason Gunthorpe wrote: > On Thu, Sep 21, 2023 at 04:16:25PM -0400, Michael S. Tsirkin wrote: > > On Thu, Sep 21, 2023 at 04:53:45PM -0300, Jason Gunthorpe wrote: > > > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote: > > > > > > > that's easy/practical. If instead VDPA gives the same speed with just > > > > shadow vq then keeping this hack in vfio seems like less of a problem. > > > > Finally if VDPA is faster then maybe you will reconsider using it ;) > > > > > > It is not all about the speed. > > > > > > VDPA presents another large and complex software stack in the > > > hypervisor that can be eliminated by simply using VFIO. > > > > If all you want is passing through your card to guest > > then yes this can be addressed "by simply using VFIO". > > That is pretty much the goal, yes. > > > And let me give you a simple example just from this patchset: > > it assumes guest uses MSIX and just breaks if it doesn't. > > It does? Really? Where did you see that? This thing apparently: + opcode = (pos < VIRTIO_PCI_CONFIG_OFF(true)) ? + VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ : + VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ; That "true" is supposed to be whether guest enabled MSI or not. > > > VFIO is > > > already required for other scenarios. > > > > Required ... by some people? Most VMs I run don't use anything > > outside of virtio. > > Yes, some people. The sorts of people who run large data centers. > > > It seems to deal with emulating virtio which seems more like a vdpa > > thing. > > Alex described it right, it creates an SW trapped IO bar that relays > the doorbell to an admin queue command. > > > If you start adding virtio emulation to vfio then won't > > you just end up with another vdpa? And if no why not? > > And I don't buy the "we already invested in this vfio based solution", > > sorry - that's not a reason upstream has to maintain it. > > I think you would be well justified to object to actual mediation, > like processing queues in VFIO or otherwise complex things. This mediation is kind of smallish, I agree. Not completely devoid of logic though. > Fortunately there is no need to do that with DPU HW. The legacy IO BAR > is a weird quirk that just cannot be done without a software trap, and > the OASIS standardization effort was for exactly this kind of > simplistic transformation. > > I also don't buy the "upstream has to maintain it" line. The team that > submitted it will maintain it just fine, thank you. it will require maintainance effort when virtio changes are made. For example it pokes at the device state - I don't see specific races right now but in the past we did e.g. reset the device to recover from errors and we might start doing it again. If more of the logic is under virtio directory where we'll remember to keep it in loop, and will be able to reuse it from vdpa down the road, I would be more sympathetic.
On Thu, Sep 21, 2023 at 07:55:26PM -0300, Jason Gunthorpe wrote: > On Thu, Sep 21, 2023 at 04:45:45PM -0400, Michael S. Tsirkin wrote: > > On Thu, Sep 21, 2023 at 04:49:46PM -0300, Jason Gunthorpe wrote: > > > On Thu, Sep 21, 2023 at 03:13:10PM -0400, Michael S. Tsirkin wrote: > > > > On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote: > > > > > On Thu, Sep 21, 2023 at 12:53:04PM -0400, Michael S. Tsirkin wrote: > > > > > > > vdpa is not vfio, I don't know how you can suggest vdpa is a > > > > > > > replacement for a vfio driver. They are completely different > > > > > > > things. > > > > > > > Each side has its own strengths, and vfio especially is accelerating > > > > > > > in its capability in way that vpda is not. eg if an iommufd conversion > > > > > > > had been done by now for vdpa I might be more sympathetic. > > > > > > > > > > > > Yea, I agree iommufd is a big problem with vdpa right now. Cindy was > > > > > > sick and I didn't know and kept assuming she's working on this. I don't > > > > > > think it's a huge amount of work though. I'll take a look. > > > > > > Is there anything else though? Do tell. > > > > > > > > > > Confidential compute will never work with VDPA's approach. > > > > > > > > I don't see how what this patchset is doing is different > > > > wrt to Confidential compute - you trap IO accesses and emulate. > > > > Care to elaborate? > > > > > > This patch series isn't about confidential compute, you asked about > > > the future. VFIO will support confidential compute in the future, VDPA > > > will not. > > > > Nonsense it already works. > > That isn't what I'm talking about. With a real PCI function and TDISP > we can actually DMA directly from the guest's memory without needing > the ugly bounce buffer hack. Then you can get decent performance. Aha, TDISP. But that one clearly does not need and can not use this kind of hack? > > But I did not ask about the future since I do not believe it > > can be confidently predicted. I asked what is missing in VDPA > > now for you to add this feature there and not in VFIO. > > I don't see that VDPA needs this, VDPA should process the IO BAR on > its own with its own logic, just like everything else it does. First there's some logic here such as translating legacy IO offsets to modern ones that could be reused. But also, this is not just IO BAR, that indeed can be easily done in software. When a device operates in legacy mode there are subtle differences with modern mode such as a different header size for the net device. > This is specifically about avoiding mediation by relaying directly the > IO BAR operations to the device itself. > > That is the entire irony, this whole scheme was designed and > standardized *specifically* to avoid complex mediation and here you > are saying we should just use mediation. > > Jason Not exactly. What I had in mind is just having the logic in the vdpa module so users don't need to know what does the device support and what it doesn't. If we can we bypass mediation (to simplify the software stack) if we can not we do not. Looking at it from user's POV, it is just super confusing that card ABC would need to be used with VDPA to drive legacy while card DEF needs to be used with VFIO. And both VFIO and VDPA will happily bind, too. Oh man ...
On Fri, Sep 22, 2023 at 11:01:23AM +0800, Jason Wang wrote: > > Even when it does, there is no real use case to live migrate a > > virtio-net function from, say, AWS to GCP. > > It can happen inside a single cloud vendor. For some reasons, DPU must > be purchased from different vendors. And vDPA has been used in that > case. Nope, you misunderstand the DPU scenario. Look at something like vmware DPU enablement. vmware runs the software side of the DPU and all their supported DPU HW, from every vendor, generates the same PCI functions on the x86. They are the same because the same software on the DPU side is creating them. There is no reason to put a mediation layer in the x86 if you also control the DPU. Cloud vendors will similarly use DPUs to create a PCI functions that meet the cloud vendor's internal specification. Regardless of DPU vendor. Fundamentally if you control the DPU SW and the hypervisor software you do not need hypervisor meditation because everything you could do in hypervisor mediation can just be done in the DPU. Putting it in the DPU is better in every regard. So, as I keep saying, in this scenario the goal is no mediation in the hypervisor. It is pointless, everything you think you need to do there is actually already being done in the DPU. Once you commit to this configuration you are committed to VFIO in the hypervisor. eg your DPU is likely also making NVMe and other PCI functions too. > The problem is the mediation (or what you called relaying) layer > you've invented. It is not mediation, it is implementing the OASIS spec for VFIO support of IO BAR. Jason
On Fri, Sep 22, 2023 at 07:23:28AM -0400, Michael S. Tsirkin wrote: > On Thu, Sep 21, 2023 at 07:55:26PM -0300, Jason Gunthorpe wrote: > Looking at it from user's POV, it is just super confusing that > card ABC would need to be used with VDPA to drive legacy while > card DEF needs to be used with VFIO. And both VFIO and VDPA > will happily bind, too. Oh man ... It is standard VFIO stuff. If you don't attach vfio then you get a normal kernel vfio-net driver. If you turn that into VDPA then you get that. If you attach VFIO to the PCI function then you get VFIO. There is nothing special here, we have good infrastructure to support doing this already. User gets to pick. I don't understand why you think the kernel side should deny this choice. Jason
On Fri, Sep 22, 2023 at 11:02:21AM +0800, Jason Wang wrote:
> And what's more, using MMIO BAR0 then it can work for legacy.
Oh? How? Our team didn't think so.
Jason
On Fri, Sep 22, 2023 at 05:47:23AM -0400, Michael S. Tsirkin wrote: > it will require maintainance effort when virtio changes are made. For > example it pokes at the device state - I don't see specific races right > now but in the past we did e.g. reset the device to recover from errors > and we might start doing it again. > > If more of the logic is under virtio directory where we'll remember > to keep it in loop, and will be able to reuse it from vdpa > down the road, I would be more sympathetic. This is inevitable, the VFIO live migration driver will need all this infrastructure too. Jason
On Fri, Sep 22, 2023 at 11:02:50AM +0800, Jason Wang wrote: > On Fri, Sep 22, 2023 at 3:53 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote: > > > > > that's easy/practical. If instead VDPA gives the same speed with just > > > shadow vq then keeping this hack in vfio seems like less of a problem. > > > Finally if VDPA is faster then maybe you will reconsider using it ;) > > > > It is not all about the speed. > > > > VDPA presents another large and complex software stack in the > > hypervisor that can be eliminated by simply using VFIO. > > vDPA supports standard virtio devices so how did you define > complexity? As I said, VFIO is already required for other devices in these VMs. So anything incremental over base-line vfio-pci is complexity to minimize. Everything vdpa does is either redundant or unnecessary compared to VFIO in these environments. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, September 22, 2023 5:53 PM > > And what's more, using MMIO BAR0 then it can work for legacy. > > Oh? How? Our team didn't think so. It does not. It was already discussed. The device reset in legacy is not synchronous. The drivers do not wait for reset to complete; it was written for the sw backend. Hence MMIO BAR0 is not the best option in real implementations.
On Fri, Sep 22, 2023 at 12:25:06PM +0000, Parav Pandit wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Friday, September 22, 2023 5:53 PM > > > > > And what's more, using MMIO BAR0 then it can work for legacy. > > > > Oh? How? Our team didn't think so. > > It does not. It was already discussed. > The device reset in legacy is not synchronous. > The drivers do not wait for reset to complete; it was written for the sw backend. > Hence MMIO BAR0 is not the best option in real implementations. Or maybe they made it synchronous in hardware, that's all. After all same is true for the IO BAR0 e.g. for the PF: IO writes are posted anyway. Whether that's possible would depend on the hardware architecture.
On Fri, Sep 22, 2023 at 11:13:18AM -0400, Michael S. Tsirkin wrote: > On Fri, Sep 22, 2023 at 12:25:06PM +0000, Parav Pandit wrote: > > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Friday, September 22, 2023 5:53 PM > > > > > > > > And what's more, using MMIO BAR0 then it can work for legacy. > > > > > > Oh? How? Our team didn't think so. > > > > It does not. It was already discussed. > > The device reset in legacy is not synchronous. > > The drivers do not wait for reset to complete; it was written for the sw backend. > > Hence MMIO BAR0 is not the best option in real implementations. > > Or maybe they made it synchronous in hardware, that's all. > After all same is true for the IO BAR0 e.g. for the PF: IO writes > are posted anyway. IO writes are not posted in PCI. Jason
On Fri, Sep 22, 2023 at 09:25:01AM -0300, Jason Gunthorpe wrote: > On Fri, Sep 22, 2023 at 11:02:50AM +0800, Jason Wang wrote: > > On Fri, Sep 22, 2023 at 3:53 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote: > > > > > > > that's easy/practical. If instead VDPA gives the same speed with just > > > > shadow vq then keeping this hack in vfio seems like less of a problem. > > > > Finally if VDPA is faster then maybe you will reconsider using it ;) > > > > > > It is not all about the speed. > > > > > > VDPA presents another large and complex software stack in the > > > hypervisor that can be eliminated by simply using VFIO. > > > > vDPA supports standard virtio devices so how did you define > > complexity? > > As I said, VFIO is already required for other devices in these VMs. So > anything incremental over base-line vfio-pci is complexity to > minimize. > > Everything vdpa does is either redundant or unnecessary compared to > VFIO in these environments. > > Jason Yes but you know. There are all kind of environments. I guess you consider yours the most mainstream and important, and are sure it will always stay like this. But if there's a driver that does what you need then you use that. You really should be explaining what vdpa *does not* do that you need. But anyway, if Alex wants to maintain this it's not too bad, but I would like to see more code move into a library living under the virtio directory. As it is structured now it will make virtio core development harder.
On Fri, Sep 22, 2023 at 12:15:34PM -0300, Jason Gunthorpe wrote: > On Fri, Sep 22, 2023 at 11:13:18AM -0400, Michael S. Tsirkin wrote: > > On Fri, Sep 22, 2023 at 12:25:06PM +0000, Parav Pandit wrote: > > > > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > Sent: Friday, September 22, 2023 5:53 PM > > > > > > > > > > > And what's more, using MMIO BAR0 then it can work for legacy. > > > > > > > > Oh? How? Our team didn't think so. > > > > > > It does not. It was already discussed. > > > The device reset in legacy is not synchronous. > > > The drivers do not wait for reset to complete; it was written for the sw backend. > > > Hence MMIO BAR0 is not the best option in real implementations. > > > > Or maybe they made it synchronous in hardware, that's all. > > After all same is true for the IO BAR0 e.g. for the PF: IO writes > > are posted anyway. > > IO writes are not posted in PCI. Aha, I was confused. Thanks for the correction. I guess you just buffer subsequent transactions while reset is going on and reset quickly enough for it to be seemless then?
On Fri, Sep 22, 2023 at 09:23:28AM -0300, Jason Gunthorpe wrote: > On Fri, Sep 22, 2023 at 05:47:23AM -0400, Michael S. Tsirkin wrote: > > > it will require maintainance effort when virtio changes are made. For > > example it pokes at the device state - I don't see specific races right > > now but in the past we did e.g. reset the device to recover from errors > > and we might start doing it again. > > > > If more of the logic is under virtio directory where we'll remember > > to keep it in loop, and will be able to reuse it from vdpa > > down the road, I would be more sympathetic. > > This is inevitable, the VFIO live migration driver will need all this > infrastructure too. > > Jason > I am not sure what you are saying and what is inevitable. VDPA for sure will want live migration support. I am not at all sympathetic to efforts that want to duplicate that support for virtio under VFIO. Put it in a library under the virtio directory, with a sane will documented interface. I don't maintain VFIO and Alex can merge what he wants, but I won't merge patches that export virtio internals in a way that will make virtio maintainance harder.
On Fri, Sep 22, 2023 at 11:39:19AM -0400, Michael S. Tsirkin wrote: > On Fri, Sep 22, 2023 at 09:25:01AM -0300, Jason Gunthorpe wrote: > > On Fri, Sep 22, 2023 at 11:02:50AM +0800, Jason Wang wrote: > > > On Fri, Sep 22, 2023 at 3:53 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > > > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote: > > > > > > > > > that's easy/practical. If instead VDPA gives the same speed with just > > > > > shadow vq then keeping this hack in vfio seems like less of a problem. > > > > > Finally if VDPA is faster then maybe you will reconsider using it ;) > > > > > > > > It is not all about the speed. > > > > > > > > VDPA presents another large and complex software stack in the > > > > hypervisor that can be eliminated by simply using VFIO. > > > > > > vDPA supports standard virtio devices so how did you define > > > complexity? > > > > As I said, VFIO is already required for other devices in these VMs. So > > anything incremental over base-line vfio-pci is complexity to > > minimize. > > > > Everything vdpa does is either redundant or unnecessary compared to > > VFIO in these environments. > > > > Jason > > Yes but you know. There are all kind of environments. I guess you > consider yours the most mainstream and important, and are sure it will > always stay like this. But if there's a driver that does what you need > then you use that. Come on, you are the one saying we cannot do things in the best way possible because you want your way of doing things to be the only way allowed. Which of us thinks "yours the most mainstream and important" ?? I'm not telling you to throw away VPDA, I'm saying there are legimitate real world use cases where VFIO is the appropriate interface, not VDPA. I want choice, not dogmatic exclusion that there is Only One True Way. > You really should be explaining what vdpa *does not* do that you > need. I think I've done that enough, but if you have been following my explanation you should see that the entire point of this design is to allow a virtio device to be created inside a DPU to a specific detailed specification (eg an AWS virtio-net device, for instance) The implementation is in the DPU, and only the DPU. At the end of the day VDPA uses mediation and creates some RedHat/VDPA/Qemu virtio-net device in the guest. It is emphatically NOT a perfect recreation of the "AWS virtio-net" we started out with. It entirely fails to achieve the most important thing it needs to do! Yishai will rework the series with your remarks, we can look again on v2, thanks for all the input! Jason
On Fri, Sep 22, 2023 at 11:40:58AM -0400, Michael S. Tsirkin wrote: > On Fri, Sep 22, 2023 at 12:15:34PM -0300, Jason Gunthorpe wrote: > > On Fri, Sep 22, 2023 at 11:13:18AM -0400, Michael S. Tsirkin wrote: > > > On Fri, Sep 22, 2023 at 12:25:06PM +0000, Parav Pandit wrote: > > > > > > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > > Sent: Friday, September 22, 2023 5:53 PM > > > > > > > > > > > > > > And what's more, using MMIO BAR0 then it can work for legacy. > > > > > > > > > > Oh? How? Our team didn't think so. > > > > > > > > It does not. It was already discussed. > > > > The device reset in legacy is not synchronous. > > > > The drivers do not wait for reset to complete; it was written for the sw backend. > > > > Hence MMIO BAR0 is not the best option in real implementations. > > > > > > Or maybe they made it synchronous in hardware, that's all. > > > After all same is true for the IO BAR0 e.g. for the PF: IO writes > > > are posted anyway. > > > > IO writes are not posted in PCI. > > Aha, I was confused. Thanks for the correction. I guess you just buffer > subsequent transactions while reset is going on and reset quickly enough > for it to be seemless then? From a hardware perspective the CPU issues an non-posted IO write and then it stops processing until the far side returns an IO completion. Using that you can emulate what the SW virtio model did and delay the CPU from restarting until the reset is completed. Since MMIO is always posted, this is not possible to emulate directly using MMIO. Converting IO into non-posted admin commands is a fairly close recreation to what actual HW would do. Jason
On Fri, Sep 22, 2023 at 8:25 PM Parav Pandit <parav@nvidia.com> wrote: > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Friday, September 22, 2023 5:53 PM > > > > > And what's more, using MMIO BAR0 then it can work for legacy. > > > > Oh? How? Our team didn't think so. > > It does not. It was already discussed. > The device reset in legacy is not synchronous. How do you know this? > The drivers do not wait for reset to complete; it was written for the sw backend. Do you see there's a flush after reset in the legacy driver? static void vp_reset(struct virtio_device *vdev) { struct virtio_pci_device *vp_dev = to_vp_device(vdev); /* 0 status means a reset. */ vp_legacy_set_status(&vp_dev->ldev, 0); /* Flush out the status write, and flush in device writes, * including MSi-X interrupts, if any. */ vp_legacy_get_status(&vp_dev->ldev); /* Flush pending VQ/configuration callbacks. */ vp_synchronize_vectors(vdev); } Thanks > Hence MMIO BAR0 is not the best option in real implementations. >
On Fri, Sep 22, 2023 at 8:11 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Fri, Sep 22, 2023 at 11:01:23AM +0800, Jason Wang wrote: > > > > Even when it does, there is no real use case to live migrate a > > > virtio-net function from, say, AWS to GCP. > > > > It can happen inside a single cloud vendor. For some reasons, DPU must > > be purchased from different vendors. And vDPA has been used in that > > case. > > Nope, you misunderstand the DPU scenario. > > Look at something like vmware DPU enablement. vmware runs the software > side of the DPU and all their supported DPU HW, from every vendor, > generates the same PCI functions on the x86. They are the same because > the same software on the DPU side is creating them. > > There is no reason to put a mediation layer in the x86 if you also > control the DPU. > > Cloud vendors will similarly use DPUs to create a PCI functions that > meet the cloud vendor's internal specification. This can only work if: 1) the internal specification has finer garin than virtio spec 2) so it can define what is not implemented in the virtio spec (like migration and compatibility) All of the above doesn't seem to be possible or realistic now, and it actually has a risk to be not compatible with virtio spec. In the future when virtio has live migration supported, they want to be able to migrate between virtio and vDPA. As I said, vDPA has been used for cross vendor live migration for a while. > Regardless of DPU > vendor. > > Fundamentally if you control the DPU SW and the hypervisor software > you do not need hypervisor meditation because everything you could do > in hypervisor mediation can just be done in the DPU. Putting it in the > DPU is better in every regard. > > So, as I keep saying, in this scenario the goal is no mediation in the > hypervisor. That's pretty fine, but I don't think trapping + relying is not mediation. Does it really matter what happens after trapping? > It is pointless, everything you think you need to do there > is actually already being done in the DPU. Well, migration or even Qemu could be offloaded to DPU as well. If that's the direction that's pretty fine. Thanks > > Jason >
On 9/22/2023 4:55 AM, Michael S. Tsirkin wrote: > On Thu, Sep 21, 2023 at 04:51:15PM -0300, Jason Gunthorpe wrote: >> On Thu, Sep 21, 2023 at 03:17:25PM -0400, Michael S. Tsirkin wrote: >>> On Thu, Sep 21, 2023 at 03:39:26PM -0300, Jason Gunthorpe wrote: >>>>> What is the huge amount of work am I asking to do? >>>> You are asking us to invest in the complexity of VDPA through out >>>> (keep it working, keep it secure, invest time in deploying and >>>> debugging in the field) >>> I'm asking you to do nothing of the kind - I am saying that this code >>> will have to be duplicated in vdpa, >> Why would that be needed? > For the same reason it was developed in the 1st place - presumably > because it adds efficient legacy guest support with the right card? > I get it, you specifically don't need VDPA functionality, but I don't > see why is this universal, or common. > > >>> and so I am asking what exactly is missing to just keep it all >>> there. >> VFIO. Seriously, we don't want unnecessary mediation in this path at >> all. > But which mediation is necessary is exactly up to the specific use-case. > I have no idea why would you want all of VFIO to e.g. pass access to > random config registers to the guest when it's a virtio device and the > config registers are all nicely listed in the spec. I know nvidia > hardware is so great, it has super robust cards with less security holes > than the vdpa driver, but I very much doubt this is universal for all > virtio offload cards. I agree with MST. >>> note I didn't ask you to add iommufd to vdpa though that would be >>> nice ;) >> I did once send someone to look.. It didn't succeed :( >> >> Jason > Pity. Maybe there's some big difficulty blocking this? I'd like to know. >
> From: Jason Wang <jasowang@redhat.com> > Sent: Monday, September 25, 2023 8:00 AM > > On Fri, Sep 22, 2023 at 8:25 PM Parav Pandit <parav@nvidia.com> wrote: > > > > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Friday, September 22, 2023 5:53 PM > > > > > > > > And what's more, using MMIO BAR0 then it can work for legacy. > > > > > > Oh? How? Our team didn't think so. > > > > It does not. It was already discussed. > > The device reset in legacy is not synchronous. > > How do you know this? > Not sure the motivation of same discussion done in the OASIS with you and others in past. Anyways, please find the answer below. About reset, The legacy device specification has not enforced below cited 1.0 driver requirement of 1.0. "The driver SHOULD consider a driver-initiated reset complete when it reads device status as 0." [1] https://ozlabs.org/~rusty/virtio-spec/virtio-0.9.5.pdf > > The drivers do not wait for reset to complete; it was written for the sw > backend. > > Do you see there's a flush after reset in the legacy driver? > Yes. it only flushes the write by reading it. The driver does not get _wait_ for the reset to complete within the device like above. Please see the reset flow of 1.x device as below. In fact the comment of the 1.x device also needs to be updated to indicate that driver need to wait for the device to finish the reset. I will send separate patch for improving this comment of vp_reset() to match the spec. static void vp_reset(struct virtio_device *vdev) { struct virtio_pci_device *vp_dev = to_vp_device(vdev); struct virtio_pci_modern_device *mdev = &vp_dev->mdev; /* 0 status means a reset. */ vp_modern_set_status(mdev, 0); /* After writing 0 to device_status, the driver MUST wait for a read of * device_status to return 0 before reinitializing the device. * This will flush out the status write, and flush in device writes, * including MSI-X interrupts, if any. */ while (vp_modern_get_status(mdev)) msleep(1); /* Flush pending VQ/configuration callbacks. */ vp_synchronize_vectors(vdev); } > static void vp_reset(struct virtio_device *vdev) { > struct virtio_pci_device *vp_dev = to_vp_device(vdev); > /* 0 status means a reset. */ > vp_legacy_set_status(&vp_dev->ldev, 0); > /* Flush out the status write, and flush in device writes, > * including MSi-X interrupts, if any. */ > vp_legacy_get_status(&vp_dev->ldev); > /* Flush pending VQ/configuration callbacks. */ > vp_synchronize_vectors(vdev); > } > > Thanks > > > > > Hence MMIO BAR0 is not the best option in real implementations. > >
On Mon, Sep 25, 2023 at 10:34:54AM +0800, Jason Wang wrote: > > Cloud vendors will similarly use DPUs to create a PCI functions that > > meet the cloud vendor's internal specification. > > This can only work if: > > 1) the internal specification has finer garin than virtio spec > 2) so it can define what is not implemented in the virtio spec (like > migration and compatibility) Yes, and that is what is happening. Realistically the "spec" isjust a piece of software that the Cloud vendor owns which is simply ported to multiple DPU vendors. It is the same as VDPA. If VDPA can make multiple NIC vendors consistent then why do you have a hard time believing we can do the same thing just on the ARM side of a DPU? > All of the above doesn't seem to be possible or realistic now, and it > actually has a risk to be not compatible with virtio spec. In the > future when virtio has live migration supported, they want to be able > to migrate between virtio and vDPA. Well, that is for the spec to design. > > So, as I keep saying, in this scenario the goal is no mediation in the > > hypervisor. > > That's pretty fine, but I don't think trapping + relying is not > mediation. Does it really matter what happens after trapping? It is not mediation in the sense that the kernel driver does not in any way make decisions on the behavior of the device. It simply transforms an IO operation into a device command and relays it to the device. The device still fully controls its own behavior. VDPA is very different from this. You might call them both mediation, sure, but then you need another word to describe the additional changes VPDA is doing. > > It is pointless, everything you think you need to do there > > is actually already being done in the DPU. > > Well, migration or even Qemu could be offloaded to DPU as well. If > that's the direction that's pretty fine. That's silly, of course qemu/kvm can't run in the DPU. However, we can empty qemu and the hypervisor out so all it does is run kvm and run vfio. In this model the DPU does all the OVS, storage, "VPDA", etc. qemu is just a passive relay of the DPU PCI functions into VM's vPCI functions. So, everything VDPA was doing in the environment is migrated into the DPU. In this model the DPU is an extension of the hypervisor/qemu environment and we shift code from x86 side to arm side to increase security, save power and increase total system performance. Jason
On Fri, Sep 22, 2023 at 01:22:33PM -0300, Jason Gunthorpe wrote: > On Fri, Sep 22, 2023 at 11:40:58AM -0400, Michael S. Tsirkin wrote: > > On Fri, Sep 22, 2023 at 12:15:34PM -0300, Jason Gunthorpe wrote: > > > On Fri, Sep 22, 2023 at 11:13:18AM -0400, Michael S. Tsirkin wrote: > > > > On Fri, Sep 22, 2023 at 12:25:06PM +0000, Parav Pandit wrote: > > > > > > > > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > > > Sent: Friday, September 22, 2023 5:53 PM > > > > > > > > > > > > > > > > > And what's more, using MMIO BAR0 then it can work for legacy. > > > > > > > > > > > > Oh? How? Our team didn't think so. > > > > > > > > > > It does not. It was already discussed. > > > > > The device reset in legacy is not synchronous. > > > > > The drivers do not wait for reset to complete; it was written for the sw backend. > > > > > Hence MMIO BAR0 is not the best option in real implementations. > > > > > > > > Or maybe they made it synchronous in hardware, that's all. > > > > After all same is true for the IO BAR0 e.g. for the PF: IO writes > > > > are posted anyway. > > > > > > IO writes are not posted in PCI. > > > > Aha, I was confused. Thanks for the correction. I guess you just buffer > > subsequent transactions while reset is going on and reset quickly enough > > for it to be seemless then? > > >From a hardware perspective the CPU issues an non-posted IO write and > then it stops processing until the far side returns an IO completion. > > Using that you can emulate what the SW virtio model did and delay the > CPU from restarting until the reset is completed. > > Since MMIO is always posted, this is not possible to emulate directly > using MMIO. > > Converting IO into non-posted admin commands is a fairly close > recreation to what actual HW would do. > > Jason I thought you asked how it is possible for hardware to support reset if all it does is replace IO BAR with memory BAR. The answer is that since 2011 the reset is followed by read of the status field (which isn't much older than MSIX support from 2009 - which this code assumes). If one uses a Linux driver from 2011 and on then all you need to do is defer response to this read until after the reset is complete. If you are using older drivers or other OSes then reset using a posted write after device has operated for a while might not be safe, so e.g. you might trigger races if you remove drivers from system or trigger hot unplug. For example: static void virtio_pci_remove(struct pci_dev *pci_dev) { .... unregister_virtio_device(&vp_dev->vdev); ^^^^ triggers reset, then releases memory .... pci_disable_device(pci_dev); ^^^ blocks DMA by clearing bus master } here you could see some DMA into memory that has just been released. As Jason mentions hardware exists that is used under one of these two restrictions on the guest (Linux since 2011 or no resets while DMA is going on), and it works fine with these existing guests. Given the restrictions, virtio TC didn't elect to standardize this approach and instead opted for the heavier approach of converting IO into non-posted admin commands in software.
On Fri, Sep 22, 2023 at 01:19:28PM -0300, Jason Gunthorpe wrote: > On Fri, Sep 22, 2023 at 11:39:19AM -0400, Michael S. Tsirkin wrote: > > On Fri, Sep 22, 2023 at 09:25:01AM -0300, Jason Gunthorpe wrote: > > > On Fri, Sep 22, 2023 at 11:02:50AM +0800, Jason Wang wrote: > > > > On Fri, Sep 22, 2023 at 3:53 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > > > > > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote: > > > > > > > > > > > that's easy/practical. If instead VDPA gives the same speed with just > > > > > > shadow vq then keeping this hack in vfio seems like less of a problem. > > > > > > Finally if VDPA is faster then maybe you will reconsider using it ;) > > > > > > > > > > It is not all about the speed. > > > > > > > > > > VDPA presents another large and complex software stack in the > > > > > hypervisor that can be eliminated by simply using VFIO. > > > > > > > > vDPA supports standard virtio devices so how did you define > > > > complexity? > > > > > > As I said, VFIO is already required for other devices in these VMs. So > > > anything incremental over base-line vfio-pci is complexity to > > > minimize. > > > > > > Everything vdpa does is either redundant or unnecessary compared to > > > VFIO in these environments. > > > > > > Jason > > > > Yes but you know. There are all kind of environments. I guess you > > consider yours the most mainstream and important, and are sure it will > > always stay like this. But if there's a driver that does what you need > > then you use that. > > Come on, you are the one saying we cannot do things in the best way > possible because you want your way of doing things to be the only way > allowed. Which of us thinks "yours the most mainstream and important" ?? > > I'm not telling you to throw away VPDA, I'm saying there are > legimitate real world use cases where VFIO is the appropriate > interface, not VDPA. > > I want choice, not dogmatic exclusion that there is Only One True Way. I don't particularly think there's only one way, vfio is already there. I am specifically thinking about this patch, for example it muddies the waters a bit: normally I think vfio exposed device with the same ID, suddenly it changes the ID as visible to the guest. But again, whether doing this kind of thing is OK is more up to Alex than me. I do want to understand if there's a use-case that vdpa does not address simply because it might be worth while to extend it to do so, and a bunch of people working on it are at Red Hat and I might have some input into how that labor is allocated. But if the use-case is simply "has to be vfio and not vdpa" then I guess not. > > You really should be explaining what vdpa *does not* do that you > > need. > > I think I've done that enough, but if you have been following my > explanation you should see that the entire point of this design is to > allow a virtio device to be created inside a DPU to a specific > detailed specification (eg an AWS virtio-net device, for instance) > > The implementation is in the DPU, and only the DPU. > > At the end of the day VDPA uses mediation and creates some > RedHat/VDPA/Qemu virtio-net device in the guest. It is emphatically > NOT a perfect recreation of the "AWS virtio-net" we started out with. > > It entirely fails to achieve the most important thing it needs to do! It could be that we are using mediation differently - in my world it's when there's some host software on the path between guest and hardware, and this qualifies. The difference between what this patch does and what vdpa does seems quantitative, not qualitative. Which might be enough to motivate this work, I don't mind. But you seem to feel it is qualitative and I am genuinely curious about it, because if yes then it might lead e.g. the virtio standard in new directions. I can *imagine* all kind of reasons to want to use vfio as compared to vdpa; here are some examples I came up with, quickly: - maybe you have drivers that poke at registers not in virtio spec: vfio allows that, vdpa by design does not - maybe you are using vfio with a lot of devices already and don't want to special-case handling for virtio devices on the host do any of the above motivations ring the bell? Some of the things you said seem to hint at that. If yes maybe include this in the cover letter. There is also a question of capability. Specifically iommufd support is lacking in vdpa (though there are finally some RFC patches to address that). All this is fine, could be enough to motivate a work like this one. But I am very curious to know if there is any other capability lacking in vdpa. I asked already and you didn't answer so I guess not? > Yishai will rework the series with your remarks, we can look again on > v2, thanks for all the input! > > Jason
On Mon, Sep 25, 2023 at 08:26:33AM +0000, Parav Pandit wrote: > > > > From: Jason Wang <jasowang@redhat.com> > > Sent: Monday, September 25, 2023 8:00 AM > > > > On Fri, Sep 22, 2023 at 8:25 PM Parav Pandit <parav@nvidia.com> wrote: > > > > > > > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > Sent: Friday, September 22, 2023 5:53 PM > > > > > > > > > > > And what's more, using MMIO BAR0 then it can work for legacy. > > > > > > > > Oh? How? Our team didn't think so. > > > > > > It does not. It was already discussed. > > > The device reset in legacy is not synchronous. > > > > How do you know this? > > > Not sure the motivation of same discussion done in the OASIS with you and others in past. > > Anyways, please find the answer below. > > About reset, > The legacy device specification has not enforced below cited 1.0 driver requirement of 1.0. > > "The driver SHOULD consider a driver-initiated reset complete when it reads device status as 0." > > [1] https://ozlabs.org/~rusty/virtio-spec/virtio-0.9.5.pdf Basically, I think any drivers that did not read status (linux pre 2011) before freeing memory under DMA have a reset path that is racy wrt DMA, since memory writes are posted and IO writes while not posted have completion that does not order posted transactions, e.g. from pci express spec: D2b An I/O or Configuration Write Completion 37 is permitted to pass a Posted Request. having said that there were a ton of driver races discovered on this path in the years since, I suspect if one cares about this then just avoiding stress on reset is wise. > > > The drivers do not wait for reset to complete; it was written for the sw > > backend. > > > > Do you see there's a flush after reset in the legacy driver? > > > Yes. it only flushes the write by reading it. The driver does not get _wait_ for the reset to complete within the device like above. One can thinkably do that wait in hardware, though. Just defer completion until read is done. > Please see the reset flow of 1.x device as below. > In fact the comment of the 1.x device also needs to be updated to indicate that driver need to wait for the device to finish the reset. > I will send separate patch for improving this comment of vp_reset() to match the spec. > > static void vp_reset(struct virtio_device *vdev) > { > struct virtio_pci_device *vp_dev = to_vp_device(vdev); > struct virtio_pci_modern_device *mdev = &vp_dev->mdev; > > /* 0 status means a reset. */ > vp_modern_set_status(mdev, 0); > /* After writing 0 to device_status, the driver MUST wait for a read of > * device_status to return 0 before reinitializing the device. > * This will flush out the status write, and flush in device writes, > * including MSI-X interrupts, if any. > */ > while (vp_modern_get_status(mdev)) > msleep(1); > /* Flush pending VQ/configuration callbacks. */ > vp_synchronize_vectors(vdev); > } > > > > static void vp_reset(struct virtio_device *vdev) { > > struct virtio_pci_device *vp_dev = to_vp_device(vdev); > > /* 0 status means a reset. */ > > vp_legacy_set_status(&vp_dev->ldev, 0); > > /* Flush out the status write, and flush in device writes, > > * including MSi-X interrupts, if any. */ > > vp_legacy_get_status(&vp_dev->ldev); > > /* Flush pending VQ/configuration callbacks. */ > > vp_synchronize_vectors(vdev); > > } > > > > Thanks > > > > > > > > > Hence MMIO BAR0 is not the best option in real implementations. > > > >
On Mon, Sep 25, 2023 at 02:16:30PM -0400, Michael S. Tsirkin wrote: > I do want to understand if there's a use-case that vdpa does not address > simply because it might be worth while to extend it to do so, and a > bunch of people working on it are at Red Hat and I might have some input > into how that labor is allocated. But if the use-case is simply "has to > be vfio and not vdpa" then I guess not. If you strip away all the philisophical arguing VDPA has no way to isolate the control and data virtqs to different IOMMU configurations with this single PCI function. The existing HW VDPA drivers provided device specific ways to handle this. Without DMA isolation you can't assign the high speed data virtq's to the VM without mediating them as well. > It could be that we are using mediation differently - in my world it's > when there's some host software on the path between guest and hardware, > and this qualifies. That is pretty general. As I said to Jason, if you want to use it that way then you need to make up a new word to describe what VDPA does as there is a clear difference in scope between this VFIO patch (relay IO commands to the device) and VDPA (intercept all the control plane, control virtq and bring it to a RedHat/qemu standard common behavior) > There is also a question of capability. Specifically iommufd support > is lacking in vdpa (though there are finally some RFC patches to > address that). All this is fine, could be enough to motivate > a work like this one. I've answered many times, you just don't semm to like the answers or dismiss them as not relevant to you. Jason
On Mon, Sep 25, 2023 at 09:26:07AM -0300, Jason Gunthorpe wrote: > > > So, as I keep saying, in this scenario the goal is no mediation in the > > > hypervisor. > > > > That's pretty fine, but I don't think trapping + relying is not > > mediation. Does it really matter what happens after trapping? > > It is not mediation in the sense that the kernel driver does not in > any way make decisions on the behavior of the device. It simply > transforms an IO operation into a device command and relays it to the > device. The device still fully controls its own behavior. > > VDPA is very different from this. You might call them both mediation, > sure, but then you need another word to describe the additional > changes VPDA is doing. Sorry about hijacking the thread a little bit, but could you call out some of the changes that are the most problematic for you? > > > It is pointless, everything you think you need to do there > > > is actually already being done in the DPU. > > > > Well, migration or even Qemu could be offloaded to DPU as well. If > > that's the direction that's pretty fine. > > That's silly, of course qemu/kvm can't run in the DPU. > > However, we can empty qemu and the hypervisor out so all it does is > run kvm and run vfio. In this model the DPU does all the OVS, storage, > "VPDA", etc. qemu is just a passive relay of the DPU PCI functions > into VM's vPCI functions. > > So, everything VDPA was doing in the environment is migrated into the > DPU. > > In this model the DPU is an extension of the hypervisor/qemu > environment and we shift code from x86 side to arm side to increase > security, save power and increase total system performance. > > Jason I think I begin to understand. On the DPU you have some virtio devices but also some non-virtio devices. So you have to use VFIO to talk to the DPU. Reusing VFIO to talk to virtio devices too, simplifies things for you. If guests will see vendor-specific devices from the DPU anyway, it will be impossible to migrate such guests away from the DPU so the cross-vendor migration capability is less important in this use-case. Is this a good summary?
On Mon, Sep 25, 2023 at 03:53:18PM -0300, Jason Gunthorpe wrote: > On Mon, Sep 25, 2023 at 02:16:30PM -0400, Michael S. Tsirkin wrote: > > > I do want to understand if there's a use-case that vdpa does not address > > simply because it might be worth while to extend it to do so, and a > > bunch of people working on it are at Red Hat and I might have some input > > into how that labor is allocated. But if the use-case is simply "has to > > be vfio and not vdpa" then I guess not. > > If you strip away all the philisophical arguing VDPA has no way to > isolate the control and data virtqs to different IOMMU configurations > with this single PCI function. Aha, so address space/PASID support then? > The existing HW VDPA drivers provided device specific ways to handle > this. > > Without DMA isolation you can't assign the high speed data virtq's to > the VM without mediating them as well. > > > It could be that we are using mediation differently - in my world it's > > when there's some host software on the path between guest and hardware, > > and this qualifies. > > That is pretty general. As I said to Jason, if you want to use it that > way then you need to make up a new word to describe what VDPA does as > there is a clear difference in scope between this VFIO patch (relay IO > commands to the device) and VDPA (intercept all the control plane, > control virtq and bring it to a RedHat/qemu standard common behavior) IIUC VDPA itself does not really bring it to either RedHat or qemu standard, it just allows userspace to control behaviour - if userspace is qemu then it's qemu deciding how it behaves. Which I guess this doesn't. Right? RedHat's not in the picture at all I think. > > There is also a question of capability. Specifically iommufd support > > is lacking in vdpa (though there are finally some RFC patches to > > address that). All this is fine, could be enough to motivate > > a work like this one. > > I've answered many times, you just don't semm to like the answers or > dismiss them as not relevant to you. > > Jason Not really I think I lack some of the picture so I don't fully understand. Or maybe I missed something else.
On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote: > > VDPA is very different from this. You might call them both mediation, > > sure, but then you need another word to describe the additional > > changes VPDA is doing. > > Sorry about hijacking the thread a little bit, but could you > call out some of the changes that are the most problematic > for you? I don't really know these details. The operators have an existing virtio world that is ABI toward the VM for them, and they do not want *anything* to change. The VM should be unware if the virtio device is created by old hypervisor software or new DPU software. It presents exactly the same ABI. So the challenge really is to convince that VDPA delivers that, and frankly, I don't think it does. ABI toward the VM is very important here. > > In this model the DPU is an extension of the hypervisor/qemu > > environment and we shift code from x86 side to arm side to increase > > security, save power and increase total system performance. > > I think I begin to understand. On the DPU you have some virtio > devices but also some non-virtio devices. So you have to > use VFIO to talk to the DPU. Reusing VFIO to talk to virtio > devices too, simplifies things for you. Yes > If guests will see vendor-specific devices from the DPU anyway, it > will be impossible to migrate such guests away from the DPU so the > cross-vendor migration capability is less important in this > use-case. Is this a good summary? Well, sort of. As I said before, the vendor here is the cloud operator, not the DPU supplier. The guest will see an AWS virtio-net function, for example. The operator will ensure that all their different implementations of this function will interwork for migration. So within the closed world of a single operator live migration will work just fine. Since the hypervisor's controlled by the operator only migrate within the operators own environment anyhow, it is an already solved problem. Jason
On Mon, Sep 25, 2023 at 4:26 PM Parav Pandit <parav@nvidia.com> wrote: > > > > > From: Jason Wang <jasowang@redhat.com> > > Sent: Monday, September 25, 2023 8:00 AM > > > > On Fri, Sep 22, 2023 at 8:25 PM Parav Pandit <parav@nvidia.com> wrote: > > > > > > > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > Sent: Friday, September 22, 2023 5:53 PM > > > > > > > > > > > And what's more, using MMIO BAR0 then it can work for legacy. > > > > > > > > Oh? How? Our team didn't think so. > > > > > > It does not. It was already discussed. > > > The device reset in legacy is not synchronous. > > > > How do you know this? > > > Not sure the motivation of same discussion done in the OASIS with you and others in past. That is exactly the same point. It's too late to define the legacy behaviour accurately in the spec so people will be lost in the legacy maze easily. > > Anyways, please find the answer below. > > About reset, > The legacy device specification has not enforced below cited 1.0 driver requirement of 1.0. > > "The driver SHOULD consider a driver-initiated reset complete when it reads device status as 0." We are talking about how to make devices work for legacy drivers. So it has nothing related to 1.0. > > [1] https://ozlabs.org/~rusty/virtio-spec/virtio-0.9.5.pdf > > > > The drivers do not wait for reset to complete; it was written for the sw > > backend. > > > > Do you see there's a flush after reset in the legacy driver? > > > Yes. it only flushes the write by reading it. The driver does not get _wait_ for the reset to complete within the device like above. It's the implementation details in legacy. The device needs to make sure (reset) the driver can work (is done before get_status return). That's all.
On 9/26/2023 2:36 AM, Michael S. Tsirkin wrote: > On Mon, Sep 25, 2023 at 08:26:33AM +0000, Parav Pandit wrote: >> >>> From: Jason Wang <jasowang@redhat.com> >>> Sent: Monday, September 25, 2023 8:00 AM >>> >>> On Fri, Sep 22, 2023 at 8:25 PM Parav Pandit <parav@nvidia.com> wrote: >>>> >>>>> From: Jason Gunthorpe <jgg@nvidia.com> >>>>> Sent: Friday, September 22, 2023 5:53 PM >>>> >>>>>> And what's more, using MMIO BAR0 then it can work for legacy. >>>>> Oh? How? Our team didn't think so. >>>> It does not. It was already discussed. >>>> The device reset in legacy is not synchronous. >>> How do you know this? >>> >> Not sure the motivation of same discussion done in the OASIS with you and others in past. >> >> Anyways, please find the answer below. >> >> About reset, >> The legacy device specification has not enforced below cited 1.0 driver requirement of 1.0. >> >> "The driver SHOULD consider a driver-initiated reset complete when it reads device status as 0." >> >> [1] https://ozlabs.org/~rusty/virtio-spec/virtio-0.9.5.pdf > Basically, I think any drivers that did not read status (linux pre 2011) > before freeing memory under DMA have a reset path that is racy wrt DMA, since > memory writes are posted and IO writes while not posted have completion > that does not order posted transactions, e.g. from pci express spec: > D2b > An I/O or Configuration Write Completion 37 is permitted to pass a Posted Request. > having said that there were a ton of driver races discovered on this > path in the years since, I suspect if one cares about this then > just avoiding stress on reset is wise. > > > >>>> The drivers do not wait for reset to complete; it was written for the sw >>> backend. >>> >>> Do you see there's a flush after reset in the legacy driver? >>> >> Yes. it only flushes the write by reading it. The driver does not get _wait_ for the reset to complete within the device like above. > One can thinkably do that wait in hardware, though. Just defer completion until > read is done. I agree with MST. At least Intel devices work fine with vfio-pci and legacy driver without any changes. So far so good. Thanks Zhu Lingshan > >> Please see the reset flow of 1.x device as below. >> In fact the comment of the 1.x device also needs to be updated to indicate that driver need to wait for the device to finish the reset. >> I will send separate patch for improving this comment of vp_reset() to match the spec. >> >> static void vp_reset(struct virtio_device *vdev) >> { >> struct virtio_pci_device *vp_dev = to_vp_device(vdev); >> struct virtio_pci_modern_device *mdev = &vp_dev->mdev; >> >> /* 0 status means a reset. */ >> vp_modern_set_status(mdev, 0); >> /* After writing 0 to device_status, the driver MUST wait for a read of >> * device_status to return 0 before reinitializing the device. >> * This will flush out the status write, and flush in device writes, >> * including MSI-X interrupts, if any. >> */ >> while (vp_modern_get_status(mdev)) >> msleep(1); >> /* Flush pending VQ/configuration callbacks. */ >> vp_synchronize_vectors(vdev); >> } >> >> >>> static void vp_reset(struct virtio_device *vdev) { >>> struct virtio_pci_device *vp_dev = to_vp_device(vdev); >>> /* 0 status means a reset. */ >>> vp_legacy_set_status(&vp_dev->ldev, 0); >>> /* Flush out the status write, and flush in device writes, >>> * including MSi-X interrupts, if any. */ >>> vp_legacy_get_status(&vp_dev->ldev); >>> /* Flush pending VQ/configuration callbacks. */ >>> vp_synchronize_vectors(vdev); >>> } >>> >>> Thanks >>> >>> >>> >>>> Hence MMIO BAR0 is not the best option in real implementations. >>>> > _______________________________________________ > Virtualization mailing list > Virtualization@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
> From: Michael S. Tsirkin <mst@redhat.com> > Sent: Tuesday, September 26, 2023 12:06 AM > One can thinkably do that wait in hardware, though. Just defer completion until > read is done. > Once OASIS does such new interface and if some hw vendor _actually_ wants to do such complex hw, may be vfio driver can adopt to it. When we worked with you, we discussed that there such hw does not have enough returns and hence technical committee choose to proceed with admin commands. I will skip re-discussing all over it again here. The current virto spec is delivering the best trade-offs of functionality, performance and light weight implementation with future forward path towards more features as Jason explained such as migration. All with near zero driver, qemu and sw involvement for rapidly growing feature set...
> From: Jason Wang <jasowang@redhat.com> > Sent: Tuesday, September 26, 2023 8:03 AM > > It's the implementation details in legacy. The device needs to make sure (reset) > the driver can work (is done before get_status return). It is part of the 0.9.5 and 1.x specification as I quoted those text above.
On Tue, Sep 26, 2023 at 12:01 PM Parav Pandit <parav@nvidia.com> wrote: > > > > > From: Jason Wang <jasowang@redhat.com> > > Sent: Tuesday, September 26, 2023 8:03 AM > > > > It's the implementation details in legacy. The device needs to make sure (reset) > > the driver can work (is done before get_status return). > It is part of the 0.9.5 and 1.x specification as I quoted those text above. What I meant is: legacy devices need to find their way to make legacy drivers work. That's how legacy works. It's too late to add any normative to the 0.95 spec. So the device behaviour is actually defined by the legacy drivers. That is why it is tricky. If you can't find a way to make legacy drivers work, use modern. That's it. Thanks
On Tue, Sep 26, 2023 at 11:45 AM Parav Pandit <parav@nvidia.com> wrote: > > > > > From: Michael S. Tsirkin <mst@redhat.com> > > Sent: Tuesday, September 26, 2023 12:06 AM > > > One can thinkably do that wait in hardware, though. Just defer completion until > > read is done. > > > Once OASIS does such new interface and if some hw vendor _actually_ wants to do such complex hw, may be vfio driver can adopt to it. It is you that tries to revive legacy in the spec. We all know legacy is tricky but work. > When we worked with you, we discussed that there such hw does not have enough returns and hence technical committee choose to proceed with admin commands. I don't think my questions regarding the legacy transport get good answers at that time. What's more, we all know spec allows to fix, workaround or even deprecate a feature. Thanks
On Mon, Sep 25, 2023 at 8:26 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Mon, Sep 25, 2023 at 10:34:54AM +0800, Jason Wang wrote: > > > > Cloud vendors will similarly use DPUs to create a PCI functions that > > > meet the cloud vendor's internal specification. > > > > This can only work if: > > > > 1) the internal specification has finer garin than virtio spec > > 2) so it can define what is not implemented in the virtio spec (like > > migration and compatibility) > > Yes, and that is what is happening. Realistically the "spec" isjust a > piece of software that the Cloud vendor owns which is simply ported to > multiple DPU vendors. > > It is the same as VDPA. If VDPA can make multiple NIC vendors > consistent then why do you have a hard time believing we can do the > same thing just on the ARM side of a DPU? I don't. We all know vDPA can do more than virtio. > > > All of the above doesn't seem to be possible or realistic now, and it > > actually has a risk to be not compatible with virtio spec. In the > > future when virtio has live migration supported, they want to be able > > to migrate between virtio and vDPA. > > Well, that is for the spec to design. Right, so if we'd consider migration from virtio to vDPA, it needs to be designed in a way that allows more involvement from hypervisor other than coupling it with a specific interface (like admin virtqueues). > > > > So, as I keep saying, in this scenario the goal is no mediation in the > > > hypervisor. > > > > That's pretty fine, but I don't think trapping + relying is not > > mediation. Does it really matter what happens after trapping? > > It is not mediation in the sense that the kernel driver does not in > any way make decisions on the behavior of the device. It simply > transforms an IO operation into a device command and relays it to the > device. The device still fully controls its own behavior. > > VDPA is very different from this. You might call them both mediation, > sure, but then you need another word to describe the additional > changes VPDA is doing. > > > > It is pointless, everything you think you need to do there > > > is actually already being done in the DPU. > > > > Well, migration or even Qemu could be offloaded to DPU as well. If > > that's the direction that's pretty fine. > > That's silly, of course qemu/kvm can't run in the DPU. KVM can't for sure but part of Qemu could. This model has been used. > > However, we can empty qemu and the hypervisor out so all it does is > run kvm and run vfio. In this model the DPU does all the OVS, storage, > "VPDA", etc. qemu is just a passive relay of the DPU PCI functions > into VM's vPCI functions. > > So, everything VDPA was doing in the environment is migrated into the > DPU. It really depends on the use cases. For example, in the case of DPU what if we want to provide multiple virtio devices through a single VF? > > In this model the DPU is an extension of the hypervisor/qemu > environment and we shift code from x86 side to arm side to increase > security, save power and increase total system performance. That's pretty fine. Thanks > > Jason >
> From: Jason Wang <jasowang@redhat.com> > Sent: Tuesday, September 26, 2023 10:07 AM > > If you can't find a way to make legacy drivers work, use modern. > Understood. This vfio series make the legacy drivers work. Thanks. > That's it. > > Thanks
> From: Jason Wang <jasowang@redhat.com> > Sent: Tuesday, September 26, 2023 10:08 AM > Right, so if we'd consider migration from virtio to vDPA, it needs to be designed > in a way that allows more involvement from hypervisor other than coupling it > with a specific interface (like admin virtqueues). It is not attached to the admin virtqueues. One way to use it using admin commands at [1]. One can define without admin command by explaining the technical difficulties in admin command may/cannot work. [1] https://lists.oasis-open.org/archives/virtio-comment/202309/msg00061.html
On Mon, Sep 25, 2023 at 09:40:59PM -0300, Jason Gunthorpe wrote: > On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote: > > > VDPA is very different from this. You might call them both mediation, > > > sure, but then you need another word to describe the additional > > > changes VPDA is doing. > > > > Sorry about hijacking the thread a little bit, but could you > > call out some of the changes that are the most problematic > > for you? > > I don't really know these details. The operators have an existing > virtio world that is ABI toward the VM for them, and they do not want > *anything* to change. The VM should be unware if the virtio device is > created by old hypervisor software or new DPU software. It presents > exactly the same ABI. > > So the challenge really is to convince that VDPA delivers that, and > frankly, I don't think it does. ABI toward the VM is very important > here. And to complete the picture, it is the DPU software/firmware that is resposible for maintaining this ABI in your ideal world? > > > In this model the DPU is an extension of the hypervisor/qemu > > > environment and we shift code from x86 side to arm side to increase > > > security, save power and increase total system performance. > > > > I think I begin to understand. On the DPU you have some virtio > > devices but also some non-virtio devices. So you have to > > use VFIO to talk to the DPU. Reusing VFIO to talk to virtio > > devices too, simplifies things for you. > > Yes > > > If guests will see vendor-specific devices from the DPU anyway, it > > will be impossible to migrate such guests away from the DPU so the > > cross-vendor migration capability is less important in this > > use-case. Is this a good summary? > > Well, sort of. As I said before, the vendor here is the cloud > operator, not the DPU supplier. The guest will see an AWS virtio-net > function, for example. > > The operator will ensure that all their different implementations of > this function will interwork for migration. > > So within the closed world of a single operator live migration will > work just fine. > > Since the hypervisor's controlled by the operator only migrate within > the operators own environment anyhow, it is an already solved problem. > > Jason Okay the picture emerges I think. Thanks! I'll try to summarize later for everyone's benefit.
On Mon, Sep 25, 2023 at 09:40:59PM -0300, Jason Gunthorpe wrote: > On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote: > > > VDPA is very different from this. You might call them both mediation, > > > sure, but then you need another word to describe the additional > > > changes VPDA is doing. > > > > Sorry about hijacking the thread a little bit, but could you > > call out some of the changes that are the most problematic > > for you? > > I don't really know these details. Maybe, you then should desist from saying things like "It entirely fails to achieve the most important thing it needs to do!" You are not making any new friends with saying this about a piece of software without knowing the details.
On Tue, Sep 26, 2023 at 10:32:39AM +0800, Jason Wang wrote: > It's the implementation details in legacy. The device needs to make > sure (reset) the driver can work (is done before get_status return). I think that there's no way to make it reliably work for all legacy drivers. They just assumed a software backend and did not bother with DMA ordering. You can try to avoid resets, they are not that common so things will tend to mostly work if you don't stress them to much with things like hot plug/unplug in a loop. Or you can try to use a driver after 2011 which is more aware of hardware ordering and flushes the reset write with a read. One of these two tricks, I think, is the magic behind the device exposing memory bar 0 that you mention.
On Tue, Sep 26, 2023 at 01:42:52AM -0400, Michael S. Tsirkin wrote: > On Mon, Sep 25, 2023 at 09:40:59PM -0300, Jason Gunthorpe wrote: > > On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote: > > > > VDPA is very different from this. You might call them both mediation, > > > > sure, but then you need another word to describe the additional > > > > changes VPDA is doing. > > > > > > Sorry about hijacking the thread a little bit, but could you > > > call out some of the changes that are the most problematic > > > for you? > > > > I don't really know these details. > > Maybe, you then should desist from saying things like "It entirely fails > to achieve the most important thing it needs to do!" You are not making > any new friends with saying this about a piece of software without > knowing the details. I can't tell you what cloud operators are doing, but I can say with confidence that it is not the same as VDPA. As I said, if you want to know more details you need to ask a cloud operator. Jason
On 2023-09-21 a.m.9:57, Michael S. Tsirkin wrote: > External email: Use caution opening links or attachments > > > On Thu, Sep 21, 2023 at 03:40:32PM +0300, Yishai Hadas wrote: >> From: Feng Liu <feliu@nvidia.com> >> >> Introduce support for the admin virtqueue. By negotiating >> VIRTIO_F_ADMIN_VQ feature, driver detects capability and creates one >> administration virtqueue. Administration virtqueue implementation in >> virtio pci generic layer, enables multiple types of upper layer >> drivers such as vfio, net, blk to utilize it. >> >> Signed-off-by: Feng Liu <feliu@nvidia.com> >> Reviewed-by: Parav Pandit <parav@nvidia.com> >> Reviewed-by: Jiri Pirko <jiri@nvidia.com> >> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> >> --- >> drivers/virtio/Makefile | 2 +- >> drivers/virtio/virtio.c | 37 +++++++++++++-- >> drivers/virtio/virtio_pci_common.h | 15 +++++- >> drivers/virtio/virtio_pci_modern.c | 10 +++- >> drivers/virtio/virtio_pci_modern_avq.c | 65 ++++++++++++++++++++++++++ > > if you have a .c file without a .h file you know there's something > fishy. Just add this inside drivers/virtio/virtio_pci_modern.c ? > Will do. >> include/linux/virtio_config.h | 4 ++ >> include/linux/virtio_pci_modern.h | 3 ++ >> 7 files changed, 129 insertions(+), 7 deletions(-) >> create mode 100644 drivers/virtio/virtio_pci_modern_avq.c >> >> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile >> index 8e98d24917cc..dcc535b5b4d9 100644 >> --- a/drivers/virtio/Makefile >> +++ b/drivers/virtio/Makefile >> @@ -5,7 +5,7 @@ obj-$(CONFIG_VIRTIO_PCI_LIB) += virtio_pci_modern_dev.o >> obj-$(CONFIG_VIRTIO_PCI_LIB_LEGACY) += virtio_pci_legacy_dev.o >> obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o >> obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o >> -virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o >> +virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o virtio_pci_modern_avq.o >> virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o >> obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o >> obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o >> diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c >> index 3893dc29eb26..f4080692b351 100644 >> --- a/drivers/virtio/virtio.c >> +++ b/drivers/virtio/virtio.c >> @@ -302,9 +302,15 @@ static int virtio_dev_probe(struct device *_d) >> if (err) >> goto err; >> >> + if (dev->config->create_avq) { >> + err = dev->config->create_avq(dev); >> + if (err) >> + goto err; >> + } >> + >> err = drv->probe(dev); >> if (err) >> - goto err; >> + goto err_probe; >> >> /* If probe didn't do it, mark device DRIVER_OK ourselves. */ >> if (!(dev->config->get_status(dev) & VIRTIO_CONFIG_S_DRIVER_OK)) >> @@ -316,6 +322,10 @@ static int virtio_dev_probe(struct device *_d) >> virtio_config_enable(dev); >> >> return 0; >> + >> +err_probe: >> + if (dev->config->destroy_avq) >> + dev->config->destroy_avq(dev); >> err: >> virtio_add_status(dev, VIRTIO_CONFIG_S_FAILED); >> return err; >> @@ -331,6 +341,9 @@ static void virtio_dev_remove(struct device *_d) >> >> drv->remove(dev); >> >> + if (dev->config->destroy_avq) >> + dev->config->destroy_avq(dev); >> + >> /* Driver should have reset device. */ >> WARN_ON_ONCE(dev->config->get_status(dev)); >> >> @@ -489,13 +502,20 @@ EXPORT_SYMBOL_GPL(unregister_virtio_device); >> int virtio_device_freeze(struct virtio_device *dev) >> { >> struct virtio_driver *drv = drv_to_virtio(dev->dev.driver); >> + int ret; >> >> virtio_config_disable(dev); >> >> dev->failed = dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED; >> >> - if (drv && drv->freeze) >> - return drv->freeze(dev); >> + if (drv && drv->freeze) { >> + ret = drv->freeze(dev); >> + if (ret) >> + return ret; >> + } >> + >> + if (dev->config->destroy_avq) >> + dev->config->destroy_avq(dev); >> >> return 0; >> } >> @@ -532,10 +552,16 @@ int virtio_device_restore(struct virtio_device *dev) >> if (ret) >> goto err; >> >> + if (dev->config->create_avq) { >> + ret = dev->config->create_avq(dev); >> + if (ret) >> + goto err; >> + } >> + >> if (drv->restore) { >> ret = drv->restore(dev); >> if (ret) >> - goto err; >> + goto err_restore; >> } >> >> /* If restore didn't do it, mark device DRIVER_OK ourselves. */ >> @@ -546,6 +572,9 @@ int virtio_device_restore(struct virtio_device *dev) >> >> return 0; >> >> +err_restore: >> + if (dev->config->destroy_avq) >> + dev->config->destroy_avq(dev); >> err: >> virtio_add_status(dev, VIRTIO_CONFIG_S_FAILED); >> return ret; >> diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h >> index 602021967aaa..9bffa95274b6 100644 >> --- a/drivers/virtio/virtio_pci_common.h >> +++ b/drivers/virtio/virtio_pci_common.h >> @@ -41,6 +41,14 @@ struct virtio_pci_vq_info { >> unsigned int msix_vector; >> }; >> >> +struct virtio_avq { > > admin_vq would be better. and this is pci specific yes? so virtio_pci_ > Will do. >> + /* Virtqueue info associated with this admin queue. */ >> + struct virtio_pci_vq_info info; >> + /* Name of the admin queue: avq.$index. */ >> + char name[10]; >> + u16 vq_index; >> +}; >> + >> /* Our device structure */ >> struct virtio_pci_device { >> struct virtio_device vdev; >> @@ -58,10 +66,13 @@ struct virtio_pci_device { >> spinlock_t lock; >> struct list_head virtqueues; >> >> - /* array of all queues for house-keeping */ >> + /* Array of all virtqueues reported in the >> + * PCI common config num_queues field >> + */ >> struct virtio_pci_vq_info **vqs; >> u32 nvqs; >> >> + struct virtio_avq *admin; > > and this could be thinkably admin_vq. > Will do. >> /* MSI-X support */ >> int msix_enabled; >> int intx_enabled; >> @@ -115,6 +126,8 @@ int vp_find_vqs(struct virtio_device *vdev, unsigned int nvqs, >> const char * const names[], const bool *ctx, >> struct irq_affinity *desc); >> const char *vp_bus_name(struct virtio_device *vdev); >> +void vp_destroy_avq(struct virtio_device *vdev); >> +int vp_create_avq(struct virtio_device *vdev); >> >> /* Setup the affinity for a virtqueue: >> * - force the affinity for per vq vector >> diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c >> index d6bb68ba84e5..a72c87687196 100644 >> --- a/drivers/virtio/virtio_pci_modern.c >> +++ b/drivers/virtio/virtio_pci_modern.c >> @@ -37,6 +37,9 @@ static void vp_transport_features(struct virtio_device *vdev, u64 features) >> >> if (features & BIT_ULL(VIRTIO_F_RING_RESET)) >> __virtio_set_bit(vdev, VIRTIO_F_RING_RESET); >> + >> + if (features & BIT_ULL(VIRTIO_F_ADMIN_VQ)) >> + __virtio_set_bit(vdev, VIRTIO_F_ADMIN_VQ); >> } >> >> /* virtio config->finalize_features() implementation */ >> @@ -317,7 +320,8 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, >> else >> notify = vp_notify; >> >> - if (index >= vp_modern_get_num_queues(mdev)) >> + if (!((index < vp_modern_get_num_queues(mdev) || >> + (vp_dev->admin && vp_dev->admin->vq_index == index)))) >> return ERR_PTR(-EINVAL); >> >> /* Check if queue is either not available or already active. */ >> @@ -509,6 +513,8 @@ static const struct virtio_config_ops virtio_pci_config_nodev_ops = { >> .get_shm_region = vp_get_shm_region, >> .disable_vq_and_reset = vp_modern_disable_vq_and_reset, >> .enable_vq_after_reset = vp_modern_enable_vq_after_reset, >> + .create_avq = vp_create_avq, >> + .destroy_avq = vp_destroy_avq, >> }; >> >> static const struct virtio_config_ops virtio_pci_config_ops = { >> @@ -529,6 +535,8 @@ static const struct virtio_config_ops virtio_pci_config_ops = { >> .get_shm_region = vp_get_shm_region, >> .disable_vq_and_reset = vp_modern_disable_vq_and_reset, >> .enable_vq_after_reset = vp_modern_enable_vq_after_reset, >> + .create_avq = vp_create_avq, >> + .destroy_avq = vp_destroy_avq, >> }; >> >> /* the PCI probing function */ >> diff --git a/drivers/virtio/virtio_pci_modern_avq.c b/drivers/virtio/virtio_pci_modern_avq.c >> new file mode 100644 >> index 000000000000..114579ad788f >> --- /dev/null >> +++ b/drivers/virtio/virtio_pci_modern_avq.c >> @@ -0,0 +1,65 @@ >> +// SPDX-License-Identifier: GPL-2.0-or-later >> + >> +#include <linux/virtio.h> >> +#include "virtio_pci_common.h" >> + >> +static u16 vp_modern_avq_num(struct virtio_pci_modern_device *mdev) >> +{ >> + struct virtio_pci_modern_common_cfg __iomem *cfg; >> + >> + cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common; >> + return vp_ioread16(&cfg->admin_queue_num); >> +} >> + >> +static u16 vp_modern_avq_index(struct virtio_pci_modern_device *mdev) >> +{ >> + struct virtio_pci_modern_common_cfg __iomem *cfg; >> + >> + cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common; >> + return vp_ioread16(&cfg->admin_queue_index); >> +} >> + >> +int vp_create_avq(struct virtio_device *vdev) >> +{ >> + struct virtio_pci_device *vp_dev = to_vp_device(vdev); >> + struct virtio_avq *avq; >> + struct virtqueue *vq; >> + u16 admin_q_num; >> + >> + if (!virtio_has_feature(vdev, VIRTIO_F_ADMIN_VQ)) >> + return 0; >> + >> + admin_q_num = vp_modern_avq_num(&vp_dev->mdev); >> + if (!admin_q_num) >> + return -EINVAL; >> + >> + vp_dev->admin = kzalloc(sizeof(*vp_dev->admin), GFP_KERNEL); >> + if (!vp_dev->admin) >> + return -ENOMEM; >> + >> + avq = vp_dev->admin; >> + avq->vq_index = vp_modern_avq_index(&vp_dev->mdev); >> + sprintf(avq->name, "avq.%u", avq->vq_index); >> + vq = vp_dev->setup_vq(vp_dev, &vp_dev->admin->info, avq->vq_index, NULL, >> + avq->name, NULL, VIRTIO_MSI_NO_VECTOR); >> + if (IS_ERR(vq)) { >> + dev_err(&vdev->dev, "failed to setup admin virtqueue"); >> + kfree(vp_dev->admin); >> + return PTR_ERR(vq); >> + } >> + >> + vp_dev->admin->info.vq = vq; >> + vp_modern_set_queue_enable(&vp_dev->mdev, avq->info.vq->index, true); >> + return 0; >> +} >> + >> +void vp_destroy_avq(struct virtio_device *vdev) >> +{ >> + struct virtio_pci_device *vp_dev = to_vp_device(vdev); >> + >> + if (!vp_dev->admin) >> + return; >> + >> + vp_dev->del_vq(&vp_dev->admin->info); >> + kfree(vp_dev->admin); >> +} >> diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h >> index 2b3438de2c4d..028c51ea90ee 100644 >> --- a/include/linux/virtio_config.h >> +++ b/include/linux/virtio_config.h >> @@ -93,6 +93,8 @@ typedef void vq_callback_t(struct virtqueue *); >> * Returns 0 on success or error status >> * If disable_vq_and_reset is set, then enable_vq_after_reset must also be >> * set. >> + * @create_avq: initialize admin virtqueue resource. >> + * @destroy_avq: destroy admin virtqueue resource. >> */ >> struct virtio_config_ops { >> void (*get)(struct virtio_device *vdev, unsigned offset, >> @@ -120,6 +122,8 @@ struct virtio_config_ops { >> struct virtio_shm_region *region, u8 id); >> int (*disable_vq_and_reset)(struct virtqueue *vq); >> int (*enable_vq_after_reset)(struct virtqueue *vq); >> + int (*create_avq)(struct virtio_device *vdev); >> + void (*destroy_avq)(struct virtio_device *vdev); >> }; >> >> /* If driver didn't advertise the feature, it will never appear. */ >> diff --git a/include/linux/virtio_pci_modern.h b/include/linux/virtio_pci_modern.h >> index 067ac1d789bc..f6cb13d858fd 100644 >> --- a/include/linux/virtio_pci_modern.h >> +++ b/include/linux/virtio_pci_modern.h >> @@ -10,6 +10,9 @@ struct virtio_pci_modern_common_cfg { >> >> __le16 queue_notify_data; /* read-write */ >> __le16 queue_reset; /* read-write */ >> + >> + __le16 admin_queue_index; /* read-only */ >> + __le16 admin_queue_num; /* read-only */ >> }; > > > ouch. > actually there's a problem > > mdev->common = vp_modern_map_capability(mdev, common, > sizeof(struct virtio_pci_common_cfg), 4, > 0, sizeof(struct virtio_pci_common_cfg), > NULL, NULL); > > extending this structure means some calls will start failing on > existing devices. > > even more of an ouch, when we added queue_notify_data and queue_reset we > also possibly broke some devices. well hopefully not since no one > reported failures but we really need to fix that. > > Hi Michael I didn’t see the fail in vp_modern_map_capability(), and vp_modern_map_capability() only read and map pci memory. The length of the memory mapping will increase as the struct virtio_pci_common_cfg increases. No errors are seen. And according to the existing code, new pci configuration space members can only be added in struct virtio_pci_modern_common_cfg. Every single entry added here is protected by feature bit, there is no bug AFAIK. Could you help to explain it more detail? Where and why it will fall if we add new member in struct virtio_pci_modern_common_cfg. >> >> struct virtio_pci_modern_device { >> -- >> 2.27.0 >
On 2023-09-26 p.m.3:23, Feng Liu via Virtualization wrote: > External email: Use caution opening links or attachments > > > On 2023-09-21 a.m.9:57, Michael S. Tsirkin wrote: >> External email: Use caution opening links or attachments >> >> >> On Thu, Sep 21, 2023 at 03:40:32PM +0300, Yishai Hadas wrote: >>> From: Feng Liu <feliu@nvidia.com> >>> drivers/virtio/virtio_pci_modern_avq.c | 65 ++++++++++++++++++++++++++ >> >> if you have a .c file without a .h file you know there's something >> fishy. Just add this inside drivers/virtio/virtio_pci_modern.c ? >> > Will do. > >>> +struct virtio_avq { >> >> admin_vq would be better. and this is pci specific yes? so virtio_pci_ >> > > Will do. > >>> >>> + struct virtio_avq *admin; >> >> and this could be thinkably admin_vq. >> > Will do. > >>> >>> /* If driver didn't advertise the feature, it will never appear. */ >>> diff --git a/include/linux/virtio_pci_modern.h >>> b/include/linux/virtio_pci_modern.h >>> index 067ac1d789bc..f6cb13d858fd 100644 >>> --- a/include/linux/virtio_pci_modern.h >>> +++ b/include/linux/virtio_pci_modern.h >>> @@ -10,6 +10,9 @@ struct virtio_pci_modern_common_cfg { >>> >>> __le16 queue_notify_data; /* read-write */ >>> __le16 queue_reset; /* read-write */ >>> + >>> + __le16 admin_queue_index; /* read-only */ >>> + __le16 admin_queue_num; /* read-only */ >>> }; >> >> >> ouch. >> actually there's a problem >> >> mdev->common = vp_modern_map_capability(mdev, common, >> sizeof(struct >> virtio_pci_common_cfg), 4, >> 0, sizeof(struct >> virtio_pci_common_cfg), >> NULL, NULL); >> >> extending this structure means some calls will start failing on >> existing devices. >> >> even more of an ouch, when we added queue_notify_data and queue_reset we >> also possibly broke some devices. well hopefully not since no one >> reported failures but we really need to fix that. >> >> > Hi Michael > > I didn’t see the fail in vp_modern_map_capability(), and > vp_modern_map_capability() only read and map pci memory. The length of > the memory mapping will increase as the struct virtio_pci_common_cfg > increases. No errors are seen. > > And according to the existing code, new pci configuration space members > can only be added in struct virtio_pci_modern_common_cfg. > > Every single entry added here is protected by feature bit, there is no > bug AFAIK. > > Could you help to explain it more detail? Where and why it will fall if > we add new member in struct virtio_pci_modern_common_cfg. > > Hi, Michael Any comments about this? Thanks Feng >>> >>> struct virtio_pci_modern_device { >>> -- >>> 2.27.0 >> > _______________________________________________ > Virtualization mailing list > Virtualization@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
On Wed, Sep 27, 2023 at 02:12:24PM -0400, Feng Liu wrote: > > > On 2023-09-26 p.m.3:23, Feng Liu via Virtualization wrote: > > External email: Use caution opening links or attachments > > > > > > On 2023-09-21 a.m.9:57, Michael S. Tsirkin wrote: > > > External email: Use caution opening links or attachments > > > > > > > > > On Thu, Sep 21, 2023 at 03:40:32PM +0300, Yishai Hadas wrote: > > > > From: Feng Liu <feliu@nvidia.com> > > > > > > drivers/virtio/virtio_pci_modern_avq.c | 65 ++++++++++++++++++++++++++ > > > > > > if you have a .c file without a .h file you know there's something > > > fishy. Just add this inside drivers/virtio/virtio_pci_modern.c ? > > > > > Will do. > > > > > > > +struct virtio_avq { > > > > > > admin_vq would be better. and this is pci specific yes? so virtio_pci_ > > > > > > > Will do. > > > > > > > > > > > + struct virtio_avq *admin; > > > > > > and this could be thinkably admin_vq. > > > > > Will do. > > > > > > > > > > > /* If driver didn't advertise the feature, it will never appear. */ > > > > diff --git a/include/linux/virtio_pci_modern.h > > > > b/include/linux/virtio_pci_modern.h > > > > index 067ac1d789bc..f6cb13d858fd 100644 > > > > --- a/include/linux/virtio_pci_modern.h > > > > +++ b/include/linux/virtio_pci_modern.h > > > > @@ -10,6 +10,9 @@ struct virtio_pci_modern_common_cfg { > > > > > > > > __le16 queue_notify_data; /* read-write */ > > > > __le16 queue_reset; /* read-write */ > > > > + > > > > + __le16 admin_queue_index; /* read-only */ > > > > + __le16 admin_queue_num; /* read-only */ > > > > }; > > > > > > > > > ouch. > > > actually there's a problem > > > > > > mdev->common = vp_modern_map_capability(mdev, common, > > > sizeof(struct > > > virtio_pci_common_cfg), 4, > > > 0, sizeof(struct > > > virtio_pci_common_cfg), > > > NULL, NULL); > > > > > > extending this structure means some calls will start failing on > > > existing devices. > > > > > > even more of an ouch, when we added queue_notify_data and queue_reset we > > > also possibly broke some devices. well hopefully not since no one > > > reported failures but we really need to fix that. > > > > > > > > Hi Michael > > > > I didn’t see the fail in vp_modern_map_capability(), and > > vp_modern_map_capability() only read and map pci memory. The length of > > the memory mapping will increase as the struct virtio_pci_common_cfg > > increases. No errors are seen. > > > > And according to the existing code, new pci configuration space members > > can only be added in struct virtio_pci_modern_common_cfg. > > > > Every single entry added here is protected by feature bit, there is no > > bug AFAIK. > > > > Could you help to explain it more detail? Where and why it will fall if > > we add new member in struct virtio_pci_modern_common_cfg. > > > > > Hi, Michael > Any comments about this? > Thanks > Feng If an existing device exposes a small capability matching old size, then you change size then the check will fail on the existing device and driver won't load. All this happens way before feature bit checks. > > > > > > > > struct virtio_pci_modern_device { > > > > -- > > > > 2.27.0 > > > > > _______________________________________________ > > Virtualization mailing list > > Virtualization@lists.linux-foundation.org > > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
On Tue, Sep 26, 2023 at 10:50:57AM -0300, Jason Gunthorpe wrote: > On Tue, Sep 26, 2023 at 01:42:52AM -0400, Michael S. Tsirkin wrote: > > On Mon, Sep 25, 2023 at 09:40:59PM -0300, Jason Gunthorpe wrote: > > > On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote: > > > > > VDPA is very different from this. You might call them both mediation, > > > > > sure, but then you need another word to describe the additional > > > > > changes VPDA is doing. > > > > > > > > Sorry about hijacking the thread a little bit, but could you > > > > call out some of the changes that are the most problematic > > > > for you? > > > > > > I don't really know these details. > > > > Maybe, you then should desist from saying things like "It entirely fails > > to achieve the most important thing it needs to do!" You are not making > > any new friends with saying this about a piece of software without > > knowing the details. > > I can't tell you what cloud operators are doing, but I can say with > confidence that it is not the same as VDPA. As I said, if you want to > know more details you need to ask a cloud operator. > > Jason So it's not the changes that are problematic, it's that you have customers who are not using vdpa. The "most important thing" that vdpa fails at is simply converting your customers from vfio to vdpa.
On Wed, Sep 27, 2023 at 05:38:55PM -0400, Michael S. Tsirkin wrote: > On Tue, Sep 26, 2023 at 10:50:57AM -0300, Jason Gunthorpe wrote: > > On Tue, Sep 26, 2023 at 01:42:52AM -0400, Michael S. Tsirkin wrote: > > > On Mon, Sep 25, 2023 at 09:40:59PM -0300, Jason Gunthorpe wrote: > > > > On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote: > > > > > > VDPA is very different from this. You might call them both mediation, > > > > > > sure, but then you need another word to describe the additional > > > > > > changes VPDA is doing. > > > > > > > > > > Sorry about hijacking the thread a little bit, but could you > > > > > call out some of the changes that are the most problematic > > > > > for you? > > > > > > > > I don't really know these details. > > > > > > Maybe, you then should desist from saying things like "It entirely fails > > > to achieve the most important thing it needs to do!" You are not making > > > any new friends with saying this about a piece of software without > > > knowing the details. > > > > I can't tell you what cloud operators are doing, but I can say with > > confidence that it is not the same as VDPA. As I said, if you want to > > know more details you need to ask a cloud operator. > > So it's not the changes that are problematic, it's that you have > customers who are not using vdpa. The "most important thing" that vdpa > fails at is simply converting your customers from vfio to vdpa. I said the most important thing was that VFIO presents exactly the same virtio device to the VM as the baremetal. Do you dispute that, technically, VDPA does not actually achieve that? Then why is it so surprising that people don't want a solution that changes the vPCI ABI they worked hard to create in the first place? I'm still baffled why you think everyone should use vdpa.. Jason
On Wed, Sep 27, 2023 at 08:20:05PM -0300, Jason Gunthorpe wrote: > On Wed, Sep 27, 2023 at 05:38:55PM -0400, Michael S. Tsirkin wrote: > > On Tue, Sep 26, 2023 at 10:50:57AM -0300, Jason Gunthorpe wrote: > > > On Tue, Sep 26, 2023 at 01:42:52AM -0400, Michael S. Tsirkin wrote: > > > > On Mon, Sep 25, 2023 at 09:40:59PM -0300, Jason Gunthorpe wrote: > > > > > On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote: > > > > > > > VDPA is very different from this. You might call them both mediation, > > > > > > > sure, but then you need another word to describe the additional > > > > > > > changes VPDA is doing. > > > > > > > > > > > > Sorry about hijacking the thread a little bit, but could you > > > > > > call out some of the changes that are the most problematic > > > > > > for you? > > > > > > > > > > I don't really know these details. > > > > > > > > Maybe, you then should desist from saying things like "It entirely fails > > > > to achieve the most important thing it needs to do!" You are not making > > > > any new friends with saying this about a piece of software without > > > > knowing the details. > > > > > > I can't tell you what cloud operators are doing, but I can say with > > > confidence that it is not the same as VDPA. As I said, if you want to > > > know more details you need to ask a cloud operator. > > > > So it's not the changes that are problematic, it's that you have > > customers who are not using vdpa. The "most important thing" that vdpa > > fails at is simply converting your customers from vfio to vdpa. > > I said the most important thing was that VFIO presents exactly the > same virtio device to the VM as the baremetal. Do you dispute that, > technically, VDPA does not actually achieve that? I dispute that it is the most important. The important thing is to have guests work. > Then why is it so surprising that people don't want a solution that > changes the vPCI ABI they worked hard to create in the first place? > > I'm still baffled why you think everyone should use vdpa.. > > Jason They shouldn't. If you want proprietary extensions then vfio is the way to go, I don't think vdpa will support that.
On 2023-09-27 p.m.5:27, Michael S. Tsirkin wrote: > > If an existing device exposes a small > capability matching old size, then you change size then > the check will fail on the existing device and driver won't load. > > All this happens way before feature bit checks. > > Will do Thanks Feng >>>>> >>>>> struct virtio_pci_modern_device { >>>>> -- >>>>> 2.27.0 >>>> >>> _______________________________________________ >>> Virtualization mailing list >>> Virtualization@lists.linux-foundation.org >>> https://lists.linuxfoundation.org/mailman/listinfo/virtualization >
On Tue, Sep 26, 2023 at 7:49 PM Michael S. Tsirkin <mst@redhat.com> wrote: > > On Tue, Sep 26, 2023 at 10:32:39AM +0800, Jason Wang wrote: > > It's the implementation details in legacy. The device needs to make > > sure (reset) the driver can work (is done before get_status return). > > I think that there's no way to make it reliably work for all legacy drivers. Yes, we may have ancient drivers. > > They just assumed a software backend and did not bother with DMA > ordering. You can try to avoid resets, they are not that common so > things will tend to mostly work if you don't stress them to much with > things like hot plug/unplug in a loop. Or you can try to use a driver > after 2011 which is more aware of hardware ordering and flushes the > reset write with a read. One of these two tricks, I think, is the magic > behind the device exposing memory bar 0 that you mention. Right this is what I see for hardware legacy devices shipped by some cloud vendors. Thanks > > -- > MST >
On Tue, Sep 26, 2023 at 03:45:36AM +0000, Parav Pandit wrote: > > > > From: Michael S. Tsirkin <mst@redhat.com> > > Sent: Tuesday, September 26, 2023 12:06 AM > > > One can thinkably do that wait in hardware, though. Just defer completion until > > read is done. > > > Once OASIS does such new interface and if some hw vendor _actually_ wants to do such complex hw, may be vfio driver can adopt to it. The reset behaviour I describe is already in the spec. What else do you want OASIS to standardize? Virtio currently is just a register map it does not yet include suggestions on how exactly do pci express transactions look. You feel we should add that?
> From: Michael S. Tsirkin <mst@redhat.com> > Sent: Thursday, October 12, 2023 4:23 PM > > On Tue, Sep 26, 2023 at 03:45:36AM +0000, Parav Pandit wrote: > > > > > > > From: Michael S. Tsirkin <mst@redhat.com> > > > Sent: Tuesday, September 26, 2023 12:06 AM > > > > > One can thinkably do that wait in hardware, though. Just defer > > > completion until read is done. > > > > > Once OASIS does such new interface and if some hw vendor _actually_ wants > to do such complex hw, may be vfio driver can adopt to it. > > The reset behaviour I describe is already in the spec. What else do you want > OASIS to standardize? Virtio currently is just a register map it does not yet > include suggestions on how exactly do pci express transactions look. You feel we > should add that? The reset behavior in the spec for modern as listed in [1] and [2] is just fine. What I meant is in context of having MMIO based legacy registers to "defer completion until read is done". I think you meant, "Just differ read completion, until reset is done". This means the hw needs to finish the device reset for thousands of devices within the read completion timeout of the pci. So when if OASIS does such standardization, someone can implement it. What I recollect, is OASIS didn't not standardize such anti-scale approach and took the admin command approach which achieve better scale. Hope I clarified. I am not expecting OASIS to do anything extra for legacy registers. [1] The device MUST reset when 0 is written to device_status, and present a 0 in device_status once that is done. [2] After writing 0 to device_status, the driver MUST wait for a read of device_status to return 0 before reinitializing the device.
On Thu, Oct 12, 2023 at 11:11:20AM +0000, Parav Pandit wrote: > > > From: Michael S. Tsirkin <mst@redhat.com> > > Sent: Thursday, October 12, 2023 4:23 PM > > > > On Tue, Sep 26, 2023 at 03:45:36AM +0000, Parav Pandit wrote: > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com> > > > > Sent: Tuesday, September 26, 2023 12:06 AM > > > > > > > One can thinkably do that wait in hardware, though. Just defer > > > > completion until read is done. > > > > > > > Once OASIS does such new interface and if some hw vendor _actually_ wants > > to do such complex hw, may be vfio driver can adopt to it. > > > > The reset behaviour I describe is already in the spec. What else do you want > > OASIS to standardize? Virtio currently is just a register map it does not yet > > include suggestions on how exactly do pci express transactions look. You feel we > > should add that? > > The reset behavior in the spec for modern as listed in [1] and [2] is just fine. > > What I meant is in context of having MMIO based legacy registers to "defer completion until read is done". > I think you meant, "Just differ read completion, until reset is done". yes > This means the hw needs to finish the device reset for thousands of devices within the read completion timeout of the pci. no, each device does it's own reset. > So when if OASIS does such standardization, someone can implement it. > > What I recollect, is OASIS didn't not standardize such anti-scale approach and took the admin command approach which achieve better scale. > Hope I clarified. You are talking about the extension for trap and emulate. I am instead talking about devices that work with existing legacy linux drivers with no traps. > I am not expecting OASIS to do anything extra for legacy registers. > > [1] The device MUST reset when 0 is written to device_status, and present a 0 in device_status once that is done. > [2] After writing 0 to device_status, the driver MUST wait for a read of device_status to return 0 before reinitializing > the device. We can add a note explaining that legacy drivers do not wait after doing reset, that is not a problem. If someone wants to make a device that works with existing legacy linux drivers, they can do that. Won't work with all drivers though, which is why oasis did not want to standardize this.
> From: Michael S. Tsirkin <mst@redhat.com> > Sent: Thursday, October 12, 2023 5:00 PM > I am instead talking about devices that work with existing legacy linux drivers > with no traps. > Yep, I understood. > > I am not expecting OASIS to do anything extra for legacy registers. > > > > [1] The device MUST reset when 0 is written to device_status, and present a 0 > in device_status once that is done. > > [2] After writing 0 to device_status, the driver MUST wait for a read > > of device_status to return 0 before reinitializing the device. > > We can add a note explaining that legacy drivers do not wait after doing reset, > that is not a problem. > If someone wants to make a device that works with existing legacy linux drivers, > they can do that. > Won't work with all drivers though, which is why oasis did not want to > standardize this. Ok. thanks.