Message ID | 158751095889.36773.6009825070990637468.stgit@djiang5-desk3.ch.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Add VFIO mediated device support and IMS support for the idxd driver. | expand |
On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote: > The actual code is independent of the stage 2 driver code submission that adds > support for SVM, ENQCMD(S), PASID, and shared workqueues. This code series will > support dedicated workqueue on a guest with no vIOMMU. > > A new device type "mdev" is introduced for the idxd driver. This allows the wq > to be dedicated to the usage of a VFIO mediated device (mdev). Once the work > queue (wq) is enabled, an uuid generated by the user can be added to the wq > through the uuid sysfs attribute for the wq. After the association, a mdev can > be created using this UUID. The mdev driver code will associate the uuid and > setup the mdev on the driver side. When the create operation is successful, the > uuid can be passed to qemu. When the guest boots up, it should discover a DSA > device when doing PCI discovery. I'm feeling really skeptical that adding all this PCI config space and MMIO BAR emulation to the kernel just to cram this into a VFIO interface is a good idea, that kind of stuff is much safer in userspace. Particularly since vfio is not really needed once a driver is using the PASID stuff. We already have general code for drivers to use to attach a PASID to a mm_struct - and using vfio while disabling all the DMA/iommu config really seems like an abuse. A /dev/idxd char dev that mmaps a bar page and links it to a PASID seems a lot simpler and saner kernel wise. > The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for > interrupts for the guest. This preserves MSIX for host usages and also allows a > significantly larger number of interrupt vectors for guest usage. I never did get a reply to my earlier remarks on the IMS patches. The concept of a device specific addr/data table format for MSI is not Intel specific. This should be general code. We have a device that can use this kind of kernel capability today. Jason
> From: Jason Gunthorpe > Sent: Wednesday, April 22, 2020 7:55 AM > > On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote: > > The actual code is independent of the stage 2 driver code submission that > adds > > support for SVM, ENQCMD(S), PASID, and shared workqueues. This code > series will > > support dedicated workqueue on a guest with no vIOMMU. > > > > A new device type "mdev" is introduced for the idxd driver. This allows the > wq > > to be dedicated to the usage of a VFIO mediated device (mdev). Once the > work > > queue (wq) is enabled, an uuid generated by the user can be added to the > wq > > through the uuid sysfs attribute for the wq. After the association, a mdev > can > > be created using this UUID. The mdev driver code will associate the uuid > and > > setup the mdev on the driver side. When the create operation is successful, > the > > uuid can be passed to qemu. When the guest boots up, it should discover a > DSA > > device when doing PCI discovery. > > I'm feeling really skeptical that adding all this PCI config space and > MMIO BAR emulation to the kernel just to cram this into a VFIO > interface is a good idea, that kind of stuff is much safer in > userspace. > > Particularly since vfio is not really needed once a driver is using > the PASID stuff. We already have general code for drivers to use to > attach a PASID to a mm_struct - and using vfio while disabling all the > DMA/iommu config really seems like an abuse. Well, this series is for virtualizing idxd device to VMs, instead of supporting SVA for bare metal processes. idxd implements a hardware-assisted mediated device technique called Intel Scalable I/O Virtualization, which allows each Assignable Device Interface (ADI, e.g. a work queue) tagged with an unique PASID to ensure fine-grained DMA isolation when those ADIs are assigned to different VMs. For this purpose idxd utilizes the VFIO mdev framework and IOMMU aux-domain extension. Bare metal SVA will be enabled for idxd later by using the general SVA code that you mentioned. Both paths will co-exist in the end so there is no such case of disabling DMA/iommu config. Thanks Kevin > > A /dev/idxd char dev that mmaps a bar page and links it to a PASID > seems a lot simpler and saner kernel wise. > > > The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for > > interrupts for the guest. This preserves MSIX for host usages and also > allows a > > significantly larger number of interrupt vectors for guest usage. > > I never did get a reply to my earlier remarks on the IMS patches. > > The concept of a device specific addr/data table format for MSI is not > Intel specific. This should be general code. We have a device that can > use this kind of kernel capability today. > > Jason
On Wed, Apr 22, 2020 at 12:53:25AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Wednesday, April 22, 2020 7:55 AM > > > > On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote: > > > The actual code is independent of the stage 2 driver code submission that > > adds > > > support for SVM, ENQCMD(S), PASID, and shared workqueues. This code > > series will > > > support dedicated workqueue on a guest with no vIOMMU. > > > > > > A new device type "mdev" is introduced for the idxd driver. This allows the > > wq > > > to be dedicated to the usage of a VFIO mediated device (mdev). Once the > > work > > > queue (wq) is enabled, an uuid generated by the user can be added to the > > wq > > > through the uuid sysfs attribute for the wq. After the association, a mdev > > can > > > be created using this UUID. The mdev driver code will associate the uuid > > and > > > setup the mdev on the driver side. When the create operation is successful, > > the > > > uuid can be passed to qemu. When the guest boots up, it should discover a > > DSA > > > device when doing PCI discovery. > > > > I'm feeling really skeptical that adding all this PCI config space and > > MMIO BAR emulation to the kernel just to cram this into a VFIO > > interface is a good idea, that kind of stuff is much safer in > > userspace. > > > > Particularly since vfio is not really needed once a driver is using > > the PASID stuff. We already have general code for drivers to use to > > attach a PASID to a mm_struct - and using vfio while disabling all the > > DMA/iommu config really seems like an abuse. > > Well, this series is for virtualizing idxd device to VMs, instead of > supporting SVA for bare metal processes. idxd implements a > hardware-assisted mediated device technique called Intel Scalable > I/O Virtualization, I'm familiar with the intel naming scheme. > which allows each Assignable Device Interface (ADI, e.g. a work > queue) tagged with an unique PASID to ensure fine-grained DMA > isolation when those ADIs are assigned to different VMs. For this > purpose idxd utilizes the VFIO mdev framework and IOMMU aux-domain > extension. Bare metal SVA will be enabled for idxd later by using > the general SVA code that you mentioned. Both paths will co-exist > in the end so there is no such case of disabling DMA/iommu config. Again, if you will have a normal SVA interface, there is no need for a VFIO version, just use normal SVA for both. PCI emulation should try to be in userspace, not the kernel, for security. Jason
Hi Jason > > > > > > I'm feeling really skeptical that adding all this PCI config space and > > > MMIO BAR emulation to the kernel just to cram this into a VFIO > > > interface is a good idea, that kind of stuff is much safer in > > > userspace. > > > > > > Particularly since vfio is not really needed once a driver is using > > > the PASID stuff. We already have general code for drivers to use to > > > attach a PASID to a mm_struct - and using vfio while disabling all the > > > DMA/iommu config really seems like an abuse. > > > > Well, this series is for virtualizing idxd device to VMs, instead of > > supporting SVA for bare metal processes. idxd implements a > > hardware-assisted mediated device technique called Intel Scalable > > I/O Virtualization, > > I'm familiar with the intel naming scheme. > > > which allows each Assignable Device Interface (ADI, e.g. a work > > queue) tagged with an unique PASID to ensure fine-grained DMA > > isolation when those ADIs are assigned to different VMs. For this > > purpose idxd utilizes the VFIO mdev framework and IOMMU aux-domain > > extension. Bare metal SVA will be enabled for idxd later by using > > the general SVA code that you mentioned. Both paths will co-exist > > in the end so there is no such case of disabling DMA/iommu config. > > Again, if you will have a normal SVA interface, there is no need for a > VFIO version, just use normal SVA for both. > > PCI emulation should try to be in userspace, not the kernel, for > security. Not sure we completely understand your proposal. Mediated devices are software constructed and they have protected resources like interrupts and stuff and VFIO already provids abstractions to export to user space. Native SVA is simply passing the process CR3 handle to IOMMU so IOMMU knows how to walk process page tables, kernel handles things like page-faults, doing device tlb invalidations and such. That by itself doesn't translate to what a guest typically does with a VDEV. There are other control paths that need to be serviced from the kernel code via VFIO. For speed path operations like ringing doorbells and such they are directly managed from guest. How do you propose to use the existing SVA api's to also provide full device emulation as opposed to using an existing infrastructure that's already in place? Perhaps Alex can ease Jason's concerns? Cheers, Ashok
On Tue, Apr 21, 2020 at 4:55 PM Jason Gunthorpe <jgg@mellanox.com> wrote: > > On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote: > > The actual code is independent of the stage 2 driver code submission that adds > > support for SVM, ENQCMD(S), PASID, and shared workqueues. This code series will > > support dedicated workqueue on a guest with no vIOMMU. > > > > A new device type "mdev" is introduced for the idxd driver. This allows the wq > > to be dedicated to the usage of a VFIO mediated device (mdev). Once the work > > queue (wq) is enabled, an uuid generated by the user can be added to the wq > > through the uuid sysfs attribute for the wq. After the association, a mdev can > > be created using this UUID. The mdev driver code will associate the uuid and > > setup the mdev on the driver side. When the create operation is successful, the > > uuid can be passed to qemu. When the guest boots up, it should discover a DSA > > device when doing PCI discovery. > > I'm feeling really skeptical that adding all this PCI config space and > MMIO BAR emulation to the kernel just to cram this into a VFIO > interface is a good idea, that kind of stuff is much safer in > userspace. > > Particularly since vfio is not really needed once a driver is using > the PASID stuff. We already have general code for drivers to use to > attach a PASID to a mm_struct - and using vfio while disabling all the > DMA/iommu config really seems like an abuse. > > A /dev/idxd char dev that mmaps a bar page and links it to a PASID > seems a lot simpler and saner kernel wise. > > > The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for > > interrupts for the guest. This preserves MSIX for host usages and also allows a > > significantly larger number of interrupt vectors for guest usage. > > I never did get a reply to my earlier remarks on the IMS patches. > > The concept of a device specific addr/data table format for MSI is not > Intel specific. This should be general code. We have a device that can > use this kind of kernel capability today. This has been my concern reviewing the implementation. IMS needs more than one in-tree user to validate degrees of freedom in the api. I had been missing a second "in-tree user" to validate the scope of the flexibility that was needed.
On 4/21/2020 4:54 PM, Jason Gunthorpe wrote: > On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote: >> The actual code is independent of the stage 2 driver code submission that adds >> support for SVM, ENQCMD(S), PASID, and shared workqueues. This code series will >> support dedicated workqueue on a guest with no vIOMMU. >> >> A new device type "mdev" is introduced for the idxd driver. This allows the wq >> to be dedicated to the usage of a VFIO mediated device (mdev). Once the work >> queue (wq) is enabled, an uuid generated by the user can be added to the wq >> through the uuid sysfs attribute for the wq. After the association, a mdev can >> be created using this UUID. The mdev driver code will associate the uuid and >> setup the mdev on the driver side. When the create operation is successful, the >> uuid can be passed to qemu. When the guest boots up, it should discover a DSA >> device when doing PCI discovery. > > I'm feeling really skeptical that adding all this PCI config space and > MMIO BAR emulation to the kernel just to cram this into a VFIO > interface is a good idea, that kind of stuff is much safer in > userspace. > > Particularly since vfio is not really needed once a driver is using > the PASID stuff. We already have general code for drivers to use to > attach a PASID to a mm_struct - and using vfio while disabling all the > DMA/iommu config really seems like an abuse. > > A /dev/idxd char dev that mmaps a bar page and links it to a PASID > seems a lot simpler and saner kernel wise. > >> The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for >> interrupts for the guest. This preserves MSIX for host usages and also allows a >> significantly larger number of interrupt vectors for guest usage. > > I never did get a reply to my earlier remarks on the IMS patches. > > The concept of a device specific addr/data table format for MSI is not > Intel specific. This should be general code. We have a device that can > use this kind of kernel capability today. > <resending to the mailing list, I had incorrect email options set> Hi Jason, I am sorry if I did not address your comments earlier. The present IMS code is quite generic, most of the code is in the drivers/ folder. We basically introduce 2 APIS: allocate and free IMS interrupts and a IMS IRQ domain to allocate these interrupts from. These APIs are architecture agnostic. We also introduce a new IMS IRQ domain which is architecture specific. This is because IMS generates interrupts only in the remappable format, hence interrupt remapping should be enabled for IMS. Currently, the interrupt remapping code is only available for Intel and AMD and I don’t see anything for ARM. If a new architecture would want to use IMS, they must simply introduce a new IMS IRQ domain. I am not sure if there is any other way around this. If you have any ideas, please let me know. Also, could you give more details on the device that could use IMS? Do you have some driver code already? We could then see if and how the current IMS code could be made more generic. > Jason >
On Wed, Apr 22, 2020 at 02:14:36PM -0700, Raj, Ashok wrote: > Hi Jason > > > > > > > > > I'm feeling really skeptical that adding all this PCI config space and > > > > MMIO BAR emulation to the kernel just to cram this into a VFIO > > > > interface is a good idea, that kind of stuff is much safer in > > > > userspace. > > > > > > > > Particularly since vfio is not really needed once a driver is using > > > > the PASID stuff. We already have general code for drivers to use to > > > > attach a PASID to a mm_struct - and using vfio while disabling all the > > > > DMA/iommu config really seems like an abuse. > > > > > > Well, this series is for virtualizing idxd device to VMs, instead of > > > supporting SVA for bare metal processes. idxd implements a > > > hardware-assisted mediated device technique called Intel Scalable > > > I/O Virtualization, > > > > I'm familiar with the intel naming scheme. > > > > > which allows each Assignable Device Interface (ADI, e.g. a work > > > queue) tagged with an unique PASID to ensure fine-grained DMA > > > isolation when those ADIs are assigned to different VMs. For this > > > purpose idxd utilizes the VFIO mdev framework and IOMMU aux-domain > > > extension. Bare metal SVA will be enabled for idxd later by using > > > the general SVA code that you mentioned. Both paths will co-exist > > > in the end so there is no such case of disabling DMA/iommu config. > > > > Again, if you will have a normal SVA interface, there is no need for a > > VFIO version, just use normal SVA for both. > > > > PCI emulation should try to be in userspace, not the kernel, for > > security. > > Not sure we completely understand your proposal. Mediated devices > are software constructed and they have protected resources like > interrupts and stuff and VFIO already provids abstractions to export > to user space. > > Native SVA is simply passing the process CR3 handle to IOMMU so > IOMMU knows how to walk process page tables, kernel handles things > like page-faults, doing device tlb invalidations and such. > That by itself doesn't translate to what a guest typically does > with a VDEV. There are other control paths that need to be serviced > from the kernel code via VFIO. For speed path operations like > ringing doorbells and such they are directly managed from guest. You don't need vfio to mmap BAR pages to userspace. The unique thing that vfio gives is it provides a way to program the classic non-PASID iommu, which you are not using here. > How do you propose to use the existing SVA api's to also provide > full device emulation as opposed to using an existing infrastructure > that's already in place? You'd provide the 'full device emulation' in userspace (eg qemu), along side all the other device emulation. Device emulation does not belong in the kernel without a very good reason. You get the doorbell BAR page from your own char dev You setup a PASID IOMMU configuration over your own char dev Interrupt delivery is triggering a generic event fd What is VFIO needed for? > Perhaps Alex can ease Jason's concerns? Last we talked Alex also had doubts on what mdev should be used for. It is a feature that seems to lack boundaries, and I'll note that when the discussion came up for VDPA, they eventually choose not to use VFIO. Jason
On Wed, Apr 22, 2020 at 2:24 PM Dan Williams <dan.j.williams@intel.com> wrote: > > On Tue, Apr 21, 2020 at 4:55 PM Jason Gunthorpe <jgg@mellanox.com> wrote: > > > > On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote: > > > The actual code is independent of the stage 2 driver code submission that adds > > > support for SVM, ENQCMD(S), PASID, and shared workqueues. This code series will > > > support dedicated workqueue on a guest with no vIOMMU. > > > > > > A new device type "mdev" is introduced for the idxd driver. This allows the wq > > > to be dedicated to the usage of a VFIO mediated device (mdev). Once the work > > > queue (wq) is enabled, an uuid generated by the user can be added to the wq > > > through the uuid sysfs attribute for the wq. After the association, a mdev can > > > be created using this UUID. The mdev driver code will associate the uuid and > > > setup the mdev on the driver side. When the create operation is successful, the > > > uuid can be passed to qemu. When the guest boots up, it should discover a DSA > > > device when doing PCI discovery. > > > > I'm feeling really skeptical that adding all this PCI config space and > > MMIO BAR emulation to the kernel just to cram this into a VFIO > > interface is a good idea, that kind of stuff is much safer in > > userspace. > > > > Particularly since vfio is not really needed once a driver is using > > the PASID stuff. We already have general code for drivers to use to > > attach a PASID to a mm_struct - and using vfio while disabling all the > > DMA/iommu config really seems like an abuse. > > > > A /dev/idxd char dev that mmaps a bar page and links it to a PASID > > seems a lot simpler and saner kernel wise. > > > > > The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for > > > interrupts for the guest. This preserves MSIX for host usages and also allows a > > > significantly larger number of interrupt vectors for guest usage. > > > > I never did get a reply to my earlier remarks on the IMS patches. > > > > The concept of a device specific addr/data table format for MSI is not > > Intel specific. This should be general code. We have a device that can > > use this kind of kernel capability today. > > This has been my concern reviewing the implementation. IMS needs more > than one in-tree user to validate degrees of freedom in the api. I had > been missing a second "in-tree user" to validate the scope of the > flexibility that was needed. Hey Jason, Per Megha's follow-up can you send the details about that other device and help clear a path for a device-specific MSI addr/data table format. Ever since HMM I've been sensitive, perhaps overly-sensitive, to claims about future upstream users. The fact that you have an additional use case is golden for pushing this into a common area and validating the scope of the proposed API.
On Wed, Apr 22, 2020 at 02:24:11PM -0700, Dan Williams wrote: > On Tue, Apr 21, 2020 at 4:55 PM Jason Gunthorpe <jgg@mellanox.com> wrote: > > > > On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote: > > > The actual code is independent of the stage 2 driver code submission that adds > > > support for SVM, ENQCMD(S), PASID, and shared workqueues. This code series will > > > support dedicated workqueue on a guest with no vIOMMU. > > > > > > A new device type "mdev" is introduced for the idxd driver. This allows the wq > > > to be dedicated to the usage of a VFIO mediated device (mdev). Once the work > > > queue (wq) is enabled, an uuid generated by the user can be added to the wq > > > through the uuid sysfs attribute for the wq. After the association, a mdev can > > > be created using this UUID. The mdev driver code will associate the uuid and > > > setup the mdev on the driver side. When the create operation is successful, the > > > uuid can be passed to qemu. When the guest boots up, it should discover a DSA > > > device when doing PCI discovery. > > > > I'm feeling really skeptical that adding all this PCI config space and > > MMIO BAR emulation to the kernel just to cram this into a VFIO > > interface is a good idea, that kind of stuff is much safer in > > userspace. > > > > Particularly since vfio is not really needed once a driver is using > > the PASID stuff. We already have general code for drivers to use to > > attach a PASID to a mm_struct - and using vfio while disabling all the > > DMA/iommu config really seems like an abuse. > > > > A /dev/idxd char dev that mmaps a bar page and links it to a PASID > > seems a lot simpler and saner kernel wise. > > > > > The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for > > > interrupts for the guest. This preserves MSIX for host usages and also allows a > > > significantly larger number of interrupt vectors for guest usage. > > > > I never did get a reply to my earlier remarks on the IMS patches. > > > > The concept of a device specific addr/data table format for MSI is not > > Intel specific. This should be general code. We have a device that can > > use this kind of kernel capability today. > > This has been my concern reviewing the implementation. IMS needs more > than one in-tree user to validate degrees of freedom in the api. I had > been missing a second "in-tree user" to validate the scope of the > flexibility that was needed. IMS is too narrowly specified. All platforms that support MSI today can support IMS. It is simply a way for the platform to give the driver an addr/data pair that triggers an interrupt when a posted write is performed to that pair. This is different from the other interrupt setup flows which are tightly tied to the PCI layer. Here the driver should simply ask for interrupts. Ie the entire IMS API to the driver should be something very simple like: struct message_irq { uint64_t addr; uint32_t data; }; struct message_irq *request_message_irq( struct device *, irq_handler_t handler, unsigned long flags, const char *name, void *dev); And the plumbing underneath should setup the irq chips and so forth as required. Jason
> > > The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for > > > interrupts for the guest. This preserves MSIX for host usages and also allows a > > > significantly larger number of interrupt vectors for guest usage. > > > > I never did get a reply to my earlier remarks on the IMS patches. > > > > The concept of a device specific addr/data table format for MSI is not > > Intel specific. This should be general code. We have a device that can > > use this kind of kernel capability today. > > I am sorry if I did not address your comments earlier. It appears noboy from Intel bothered to answer anyone else on that RFC thread: https://lore.kernel.org/lkml/1568338328-22458-1-git-send-email-megha.dey@linux.intel.com/ However, it seems kind of moot as I see now that this verion of IMS bears almost no resemblance to the original RFC. That said, the similiarity to platform-msi was striking, does this new version harmonize with that? > The present IMS code is quite generic, most of the code is in the drivers/ > folder. We basically introduce 2 APIS: allocate and free IMS interrupts and > a IMS IRQ domain to allocate these interrupts from. These APIs are > architecture agnostic. > > We also introduce a new IMS IRQ domain which is architecture specific. This > is because IMS generates interrupts only in the remappable format, hence > interrupt remapping should be enabled for IMS. Currently, the interrupt > remapping code is only available for Intel and AMD and I don’t see anything > for ARM. I don't understand these remarks though - IMS is simply the mapping of a MemWr addr/data pair to a Linux IRQ number? Why does this intersect with remapping? AFAIK, any platform that supports MSI today should have the inherent HW capability to support IMS. > Also, could you give more details on the device that could use IMS? Do you > have some driver code already? We could then see if and how the current IMS > code could be made more generic. We have several devices of interest, our NICs have very flexible PCI, so it is no problem to take the MemWR addr/data from someplace other than the MSI tables. For this we want to have some way to allocate Linux IRQs dynamically and get a addr/data pair to trigger them. Our NIC devices are also linked to our ARM SOC family, so I'd expect our ARM's to also be able to provide these APIs as the platform. Jason
On Thu, Apr 23, 2020 at 12:17:50PM -0700, Dan Williams wrote: > Per Megha's follow-up can you send the details about that other device > and help clear a path for a device-specific MSI addr/data table > format. Ever since HMM I've been sensitive, perhaps overly-sensitive, > to claims about future upstream users. The fact that you have an > additional use case is golden for pushing this into a common area and > validating the scope of the proposed API. I think I said it at plumbers, but yes, we are interested in this, and would like dynamic MSI-like interrupts available to the driver (what Intel calls IMS) It is something easy enough to illustrate with any RDMA device really, just open a MR against the addr and use RDMA_WRITE to trigger the data. It should trigger a Linux IRQ. Nothing else should be needed. Jason
> From: Jason Gunthorpe <jgg@mellanox.com> > Sent: Friday, April 24, 2020 3:12 AM > > On Wed, Apr 22, 2020 at 02:14:36PM -0700, Raj, Ashok wrote: > > Hi Jason > > > > > > > > > > > > I'm feeling really skeptical that adding all this PCI config space and > > > > > MMIO BAR emulation to the kernel just to cram this into a VFIO > > > > > interface is a good idea, that kind of stuff is much safer in > > > > > userspace. > > > > > > > > > > Particularly since vfio is not really needed once a driver is using > > > > > the PASID stuff. We already have general code for drivers to use to > > > > > attach a PASID to a mm_struct - and using vfio while disabling all the > > > > > DMA/iommu config really seems like an abuse. > > > > > > > > Well, this series is for virtualizing idxd device to VMs, instead of > > > > supporting SVA for bare metal processes. idxd implements a > > > > hardware-assisted mediated device technique called Intel Scalable > > > > I/O Virtualization, > > > > > > I'm familiar with the intel naming scheme. > > > > > > > which allows each Assignable Device Interface (ADI, e.g. a work > > > > queue) tagged with an unique PASID to ensure fine-grained DMA > > > > isolation when those ADIs are assigned to different VMs. For this > > > > purpose idxd utilizes the VFIO mdev framework and IOMMU aux- > domain > > > > extension. Bare metal SVA will be enabled for idxd later by using > > > > the general SVA code that you mentioned. Both paths will co-exist > > > > in the end so there is no such case of disabling DMA/iommu config. > > > > > > Again, if you will have a normal SVA interface, there is no need for a > > > VFIO version, just use normal SVA for both. > > > > > > PCI emulation should try to be in userspace, not the kernel, for > > > security. > > > > Not sure we completely understand your proposal. Mediated devices > > are software constructed and they have protected resources like > > interrupts and stuff and VFIO already provids abstractions to export > > to user space. > > > > Native SVA is simply passing the process CR3 handle to IOMMU so > > IOMMU knows how to walk process page tables, kernel handles things > > like page-faults, doing device tlb invalidations and such. > > > That by itself doesn't translate to what a guest typically does > > with a VDEV. There are other control paths that need to be serviced > > from the kernel code via VFIO. For speed path operations like > > ringing doorbells and such they are directly managed from guest. > > You don't need vfio to mmap BAR pages to userspace. The unique thing > that vfio gives is it provides a way to program the classic non-PASID > iommu, which you are not using here. That unique thing is indeed used here. Please note sharing CPU virtual address space with device (what SVA API is invented for) is not the purpose of this series. We still rely on classic non-PASID iommu programming, i.e. mapping/unmapping IOVA->HPA per iommu_domain. Although we do use PASID to tag ADI, the PASID is contained within iommu_domain and invisible to VFIO. From userspace p.o.v, this is a device passthrough usage instead of PASID-based address space binding. > > > How do you propose to use the existing SVA api's to also provide > > full device emulation as opposed to using an existing infrastructure > > that's already in place? > > You'd provide the 'full device emulation' in userspace (eg qemu), > along side all the other device emulation. Device emulation does not > belong in the kernel without a very good reason. The problem is that we are not doing full device emulation. It's based on mediated passthrough. Some emulation logic requires close engagement with kernel device driver, e.g. resource allocation, WQ configuration, fault report, etc., while the detail interface is very vendor/ device specific (just like between PF and VF). idxd is just the first device that supports Scalable IOV. We have a lot more coming later, in different types. Then putting such emulation in user space means that Qemu needs to support all those vendor specific interfaces for every new device which supports Scalable IOV. This is contrast to our goal of using Scalable IOV as an alternative to SR-IOV. For SR-IOV, Qemu only needs to support one VFIO API then any VF type simply works. We want to sustain the same user experience through VFIO mdev. Specifically for PCI config space emulation, now it's already done in multiple kernel places, e.g. vfio-pci, kvmgt, etc. We do plan to consolidate them later. > > You get the doorbell BAR page from your own char dev > > You setup a PASID IOMMU configuration over your own char dev > > Interrupt delivery is triggering a generic event fd > > What is VFIO needed for? Based on above explanation VFIO mdev already meets all of our requirements then why bother inventing a new one... > > > Perhaps Alex can ease Jason's concerns? > > Last we talked Alex also had doubts on what mdev should be used > for. It is a feature that seems to lack boundaries, and I'll note that > when the discussion came up for VDPA, they eventually choose not to > use VFIO. > Is there a link to Alex's doubt? I'm not sure why vDPA didn't go for VFIO, but imho it is a different story. vDPA is specifically for devices which implement standard vhost/virtio interface, thus it's reasonable that inventing a new mechanism might be more efficient for all vDPA type devices. However Scalable IOV is similar to SR-IOV, only for resource partitioning. It doesn't change the device programming interface, which could be in any vendor specific form. Here VFIO mdev is good for providing an unified interface for managing resource multiplexing of all such devices. Thanks Kevin
On 2020/4/22 上午7:54, Jason Gunthorpe wrote: >> The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for >> interrupts for the guest. This preserves MSIX for host usages and also allows a >> significantly larger number of interrupt vectors for guest usage. > I never did get a reply to my earlier remarks on the IMS patches. > > The concept of a device specific addr/data table format for MSI is not > Intel specific. This should be general code. We have a device that can > use this kind of kernel capability today. > > Jason > +1. Another example is to extend virtio MMIO to support MSI[1]. Thanks [1] https://lkml.org/lkml/2020/2/10/127
On Fri, Apr 24, 2020 at 03:27:41AM +0000, Tian, Kevin wrote: > > > That by itself doesn't translate to what a guest typically does > > > with a VDEV. There are other control paths that need to be serviced > > > from the kernel code via VFIO. For speed path operations like > > > ringing doorbells and such they are directly managed from guest. > > > > You don't need vfio to mmap BAR pages to userspace. The unique thing > > that vfio gives is it provides a way to program the classic non-PASID > > iommu, which you are not using here. > > That unique thing is indeed used here. Please note sharing CPU virtual > address space with device (what SVA API is invented for) is not the > purpose of this series. We still rely on classic non-PASID iommu programming, > i.e. mapping/unmapping IOVA->HPA per iommu_domain. Although > we do use PASID to tag ADI, the PASID is contained within iommu_domain > and invisible to VFIO. From userspace p.o.v, this is a device passthrough > usage instead of PASID-based address space binding. So you have PASID support but don't use it? Why? PASID is much better than classic VFIO iommu, it doesn't require page pinning... > > > How do you propose to use the existing SVA api's to also provide > > > full device emulation as opposed to using an existing infrastructure > > > that's already in place? > > > > You'd provide the 'full device emulation' in userspace (eg qemu), > > along side all the other device emulation. Device emulation does not > > belong in the kernel without a very good reason. > > The problem is that we are not doing full device emulation. It's based > on mediated passthrough. Some emulation logic requires close > engagement with kernel device driver, e.g. resource allocation, WQ > configuration, fault report, etc., while the detail interface is very vendor/ > device specific (just like between PF and VF). Which sounds like the fairly classic case of device emulation to me. > idxd is just the first device that supports Scalable IOV. We have a > lot more coming later, in different types. Then putting such > emulation in user space means that Qemu needs to support all those > vendor specific interfaces for every new device which supports It would be very sad to see an endless amount of device emulation code crammed into the kernel. Userspace is where device emulation is supposed to live. For security qemu is the right place to put this stuff. > > > Perhaps Alex can ease Jason's concerns? > > > > Last we talked Alex also had doubts on what mdev should be used > > for. It is a feature that seems to lack boundaries, and I'll note that > > when the discussion came up for VDPA, they eventually choose not to > > use VFIO. > > > > Is there a link to Alex's doubt? I'm not sure why vDPA didn't go > for VFIO, but imho it is a different story. No, not at all. VDPA HW today is using what Intel has been calling ADI. But qemu already had the device emulation part in userspace, (all of the virtio emulation parts are in userspace) so they didn't try to put it in the kernel. This is the pattern. User space is supposed to do the emulation parts, the kernel provides the raw elements to manage queues/etc - and it is not done through mdev. > efficient for all vDPA type devices. However Scalable IOV is > similar to SR-IOV, only for resource partitioning. It doesn't change > the device programming interface, which could be in any vendor > specific form. Here VFIO mdev is good for providing an unified > interface for managing resource multiplexing of all such devices. SIOV doesn't have a HW config space, and for some reason in these patches there is BAR emulation too. So, no, it is not like SR-IOV at all. This is more like classic device emulation, presumably with some fast path for the data plane. ie just like VDPA :) Jason
> From: Jason Gunthorpe > Sent: Friday, April 24, 2020 8:45 PM > > On Fri, Apr 24, 2020 at 03:27:41AM +0000, Tian, Kevin wrote: > > > > > That by itself doesn't translate to what a guest typically does > > > > with a VDEV. There are other control paths that need to be serviced > > > > from the kernel code via VFIO. For speed path operations like > > > > ringing doorbells and such they are directly managed from guest. > > > > > > You don't need vfio to mmap BAR pages to userspace. The unique thing > > > that vfio gives is it provides a way to program the classic non-PASID > > > iommu, which you are not using here. > > > > That unique thing is indeed used here. Please note sharing CPU virtual > > address space with device (what SVA API is invented for) is not the > > purpose of this series. We still rely on classic non-PASID iommu > programming, > > i.e. mapping/unmapping IOVA->HPA per iommu_domain. Although > > we do use PASID to tag ADI, the PASID is contained within iommu_domain > > and invisible to VFIO. From userspace p.o.v, this is a device passthrough > > usage instead of PASID-based address space binding. > > So you have PASID support but don't use it? Why? PASID is much better > than classic VFIO iommu, it doesn't require page pinning... PASID and I/O page fault (through ATS/PRI) are orthogonal things. Don't draw the equation between them. The host driver can tag PASID to ADI so every DMA request out of that ADI has a PASID prefix, allowing VT-d to do PASID-granular DMA isolation. However I/O page fault cannot be taken for granted. A scalable IOV device may support PASID while without ATS/PRI. Even when ATS/PRI is supported, the tolerance of I/O page fault is decided by the work queue mode that is configured by the guest. For example, if the guest put the work queue in non-faultable transaction mode, the device doesn't do PRI and simply report error if no valid IOMMU mapping. So in this series we support only the basic form for non-faultable transactions, using the classic VFIO iommu interface plus PASID-granular translation. We are working on virtual SVA support in parallel. Once that feature is ready, then I/O page fault could be CONDITIONALLY enabled according to guest vIOMMU setting, e.g. when virtual context entry has page request enabled then we enable nested translation in the physical PASID entry, with 1st level linking to guest page table (GVA->GPA) and 2nd-level carrying (GPA->HPA). > > > > > How do you propose to use the existing SVA api's to also provide > > > > full device emulation as opposed to using an existing infrastructure > > > > that's already in place? > > > > > > You'd provide the 'full device emulation' in userspace (eg qemu), > > > along side all the other device emulation. Device emulation does not > > > belong in the kernel without a very good reason. > > > > The problem is that we are not doing full device emulation. It's based > > on mediated passthrough. Some emulation logic requires close > > engagement with kernel device driver, e.g. resource allocation, WQ > > configuration, fault report, etc., while the detail interface is very vendor/ > > device specific (just like between PF and VF). > > Which sounds like the fairly classic case of device emulation to me. > > > idxd is just the first device that supports Scalable IOV. We have a > > lot more coming later, in different types. Then putting such > > emulation in user space means that Qemu needs to support all those > > vendor specific interfaces for every new device which supports > > It would be very sad to see an endless amount of device emulation code > crammed into the kernel. Userspace is where device emulation is > supposed to live. For security I think providing an unified abstraction to userspace is also important, which is what VFIO provides today. The merit of using one set of VFIO API to manage all kinds of mediated devices and VF devices is a major gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV or equivalent device is just overkill and doesn't scale. Also the actual emulation code in idxd driver is actually small, if putting aside the PCI config space part for which I already explained most logic could be shared between mdev device drivers. > > qemu is the right place to put this stuff. > > > > > Perhaps Alex can ease Jason's concerns? > > > > > > Last we talked Alex also had doubts on what mdev should be used > > > for. It is a feature that seems to lack boundaries, and I'll note that > > > when the discussion came up for VDPA, they eventually choose not to > > > use VFIO. > > > > > > > Is there a link to Alex's doubt? I'm not sure why vDPA didn't go > > for VFIO, but imho it is a different story. > > No, not at all. VDPA HW today is using what Intel has been calling > ADI. But qemu already had the device emulation part in userspace, (all > of the virtio emulation parts are in userspace) so they didn't try to > put it in the kernel. > > This is the pattern. User space is supposed to do the emulation parts, > the kernel provides the raw elements to manage queues/etc - and it is > not done through mdev. > > > efficient for all vDPA type devices. However Scalable IOV is > > similar to SR-IOV, only for resource partitioning. It doesn't change > > the device programming interface, which could be in any vendor > > specific form. Here VFIO mdev is good for providing an unified > > interface for managing resource multiplexing of all such devices. > > SIOV doesn't have a HW config space, and for some reason in these > patches there is BAR emulation too. So, no, it is not like SR-IOV at > all. > > This is more like classic device emulation, presumably with some fast > path for the data plane. ie just like VDPA :) > > Jason Thanks Kevin
On Fri, Apr 24, 2020 at 04:25:56PM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Friday, April 24, 2020 8:45 PM > > > > On Fri, Apr 24, 2020 at 03:27:41AM +0000, Tian, Kevin wrote: > > > > > > > That by itself doesn't translate to what a guest typically does > > > > > with a VDEV. There are other control paths that need to be serviced > > > > > from the kernel code via VFIO. For speed path operations like > > > > > ringing doorbells and such they are directly managed from guest. > > > > > > > > You don't need vfio to mmap BAR pages to userspace. The unique thing > > > > that vfio gives is it provides a way to program the classic non-PASID > > > > iommu, which you are not using here. > > > > > > That unique thing is indeed used here. Please note sharing CPU virtual > > > address space with device (what SVA API is invented for) is not the > > > purpose of this series. We still rely on classic non-PASID iommu > > programming, > > > i.e. mapping/unmapping IOVA->HPA per iommu_domain. Although > > > we do use PASID to tag ADI, the PASID is contained within iommu_domain > > > and invisible to VFIO. From userspace p.o.v, this is a device passthrough > > > usage instead of PASID-based address space binding. > > > > So you have PASID support but don't use it? Why? PASID is much better > > than classic VFIO iommu, it doesn't require page pinning... > > PASID and I/O page fault (through ATS/PRI) are orthogonal things. Don't > draw the equation between them. The host driver can tag PASID to > ADI so every DMA request out of that ADI has a PASID prefix, allowing VT-d > to do PASID-granular DMA isolation. However I/O page fault cannot be > taken for granted. A scalable IOV device may support PASID while without > ATS/PRI. Even when ATS/PRI is supported, the tolerance of I/O page fault > is decided by the work queue mode that is configured by the guest. For > example, if the guest put the work queue in non-faultable transaction > mode, the device doesn't do PRI and simply report error if no valid IOMMU > mapping. Okay, that makes sense, I wasn't aware people were doing PASID without ATS at this point.. > > > idxd is just the first device that supports Scalable IOV. We have a > > > lot more coming later, in different types. Then putting such > > > emulation in user space means that Qemu needs to support all those > > > vendor specific interfaces for every new device which supports > > > > It would be very sad to see an endless amount of device emulation code > > crammed into the kernel. Userspace is where device emulation is > > supposed to live. For security > > I think providing an unified abstraction to userspace is also important, > which is what VFIO provides today. The merit of using one set of VFIO > API to manage all kinds of mediated devices and VF devices is a major > gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV > or equivalent device is just overkill and doesn't scale. Also the actual > emulation code in idxd driver is actually small, if putting aside the PCI > config space part for which I already explained most logic could be shared > between mdev device drivers. If it was just config space you might have an argument, VFIO already does some config space mangling, but emulating BAR space is out of scope of VFIO, IMHO. I also think it is disingenuous to pretend this is similar to SR-IOV. SR-IOV is self contained and the BAR does not require emulation. What you have here sounds like it is just an ordinary multi-queue device with the ability to PASID tag queues for IOMMU handling. This is absolutely not SRIOV - it is much closer to VDPA, which isn't using mdev. Further, I disagree with your assessment that this doesn't scale. You already said you plan a normal user interface for idxd, so instead of having a single sane user interface (ala VDPA) idxd now needs *two*. If this is the general pattern of things to come, it is a bad path. The only thing we get out of this is someone doesn't have to write a idxd emulation driver in qemu, instead they have to write it in the kernel. I don't see how that is a win for the ecosystem. Jason
> From: Jason Gunthorpe <jgg@mellanox.com> > Sent: Saturday, April 25, 2020 2:12 AM > > > > > idxd is just the first device that supports Scalable IOV. We have a > > > > lot more coming later, in different types. Then putting such > > > > emulation in user space means that Qemu needs to support all those > > > > vendor specific interfaces for every new device which supports > > > > > > It would be very sad to see an endless amount of device emulation code > > > crammed into the kernel. Userspace is where device emulation is > > > supposed to live. For security > > > > I think providing an unified abstraction to userspace is also important, > > which is what VFIO provides today. The merit of using one set of VFIO > > API to manage all kinds of mediated devices and VF devices is a major > > gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV > > or equivalent device is just overkill and doesn't scale. Also the actual > > emulation code in idxd driver is actually small, if putting aside the PCI > > config space part for which I already explained most logic could be shared > > between mdev device drivers. > > If it was just config space you might have an argument, VFIO already > does some config space mangling, but emulating BAR space is out of > scope of VFIO, IMHO. out of scope of vfio-pci, but in scope of vfio-mdev. btw I feel that most of your objections are actually related to the general idea of vfio-mdev. Scalable IOV just uses PASID to harden DMA isolation in mediated pass-through usage which vfio-mdev enables. Then are you just opposing the whole vfio-mdev? If not, I'm curious about the criteria in your mind about when using vfio-mdev is good... > > I also think it is disingenuous to pretend this is similar to > SR-IOV. SR-IOV is self contained and the BAR does not require > emulation. What you have here sounds like it is just an ordinary technically Scalable IOV is definitely different from SR-IOV. It's simpler in hardware. And we're not emulating SR-IOV. The point is just in usage-wise we want to present a consistent user experience just like passing through a PCI endpoint (PF or VF) device through vfio eco-system, including various userspace VMMs (Qemu, firecracker, rust-vmm, etc.), middleware (Libvirt), and higher level management stacks. > multi-queue device with the ability to PASID tag queues for IOMMU > handling. This is absolutely not SRIOV - it is much closer to VDPA, > which isn't using mdev. > > Further, I disagree with your assessment that this doesn't scale. You > already said you plan a normal user interface for idxd, so instead of > having a single sane user interface (ala VDPA) idxd now needs *two*. If > this is the general pattern of things to come, it is a bad path. > > The only thing we get out of this is someone doesn't have to write a > idxd emulation driver in qemu, instead they have to write it in the > kernel. I don't see how that is a win for the ecosystem. > No. The clear win is on leveraging classic VFIO iommu and its eco-system as explained above. Thanks Kevin
On Sun, Apr 26, 2020 at 05:18:59AM +0000, Tian, Kevin wrote: > > > I think providing an unified abstraction to userspace is also important, > > > which is what VFIO provides today. The merit of using one set of VFIO > > > API to manage all kinds of mediated devices and VF devices is a major > > > gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV > > > or equivalent device is just overkill and doesn't scale. Also the actual > > > emulation code in idxd driver is actually small, if putting aside the PCI > > > config space part for which I already explained most logic could be shared > > > between mdev device drivers. > > > > If it was just config space you might have an argument, VFIO already > > does some config space mangling, but emulating BAR space is out of > > scope of VFIO, IMHO. > > out of scope of vfio-pci, but in scope of vfio-mdev. btw I feel that most > of your objections are actually related to the general idea of > vfio-mdev. There have been several abusive proposals of vfio-mdev, everything from a way to create device drivers to this kind of generic emulation framework. > Scalable IOV just uses PASID to harden DMA isolation in mediated > pass-through usage which vfio-mdev enables. Then are you just opposing > the whole vfio-mdev? If not, I'm curious about the criteria in your mind > about when using vfio-mdev is good... It is appropriate when non-PCI standard techniques are needed to do raw device assignment, just like VFIO. Basically if vfio-pci is already doing it then it seems reasonable that vfio-mdev should do the same. This mission creep where vfio-mdev gains functionality far beyond VFIO is the problem. > technically Scalable IOV is definitely different from SR-IOV. It's > simpler in hardware. And we're not emulating SR-IOV. The point > is just in usage-wise we want to present a consistent user > experience just like passing through a PCI endpoint (PF or VF) device > through vfio eco-system, including various userspace VMMs (Qemu, > firecracker, rust-vmm, etc.), middleware (Libvirt), and higher level > management stacks. Yes, I understand your desire, but at the same time we have not been doing device emulation in the kernel. You should at least be forthwright about that major change in the cover letters/etc. > > The only thing we get out of this is someone doesn't have to write a > > idxd emulation driver in qemu, instead they have to write it in the > > kernel. I don't see how that is a win for the ecosystem. > > No. The clear win is on leveraging classic VFIO iommu and its eco-system > as explained above. vdpa had no problem implementing iommu support without VFIO. This was their original argument too, it turned out to be erroneous. Jason
On Sun, 26 Apr 2020 16:13:57 -0300 Jason Gunthorpe <jgg@mellanox.com> wrote: > On Sun, Apr 26, 2020 at 05:18:59AM +0000, Tian, Kevin wrote: > > > > > I think providing an unified abstraction to userspace is also important, > > > > which is what VFIO provides today. The merit of using one set of VFIO > > > > API to manage all kinds of mediated devices and VF devices is a major > > > > gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV > > > > or equivalent device is just overkill and doesn't scale. Also the actual > > > > emulation code in idxd driver is actually small, if putting aside the PCI > > > > config space part for which I already explained most logic could be shared > > > > between mdev device drivers. > > > > > > If it was just config space you might have an argument, VFIO already > > > does some config space mangling, but emulating BAR space is out of > > > scope of VFIO, IMHO. > > > > out of scope of vfio-pci, but in scope of vfio-mdev. btw I feel that most > > of your objections are actually related to the general idea of > > vfio-mdev. > > There have been several abusive proposals of vfio-mdev, everything > from a way to create device drivers to this kind of generic emulation > framework. > > > Scalable IOV just uses PASID to harden DMA isolation in mediated > > pass-through usage which vfio-mdev enables. Then are you just opposing > > the whole vfio-mdev? If not, I'm curious about the criteria in your mind > > about when using vfio-mdev is good... > > It is appropriate when non-PCI standard techniques are needed to do > raw device assignment, just like VFIO. > > Basically if vfio-pci is already doing it then it seems reasonable > that vfio-mdev should do the same. This mission creep where vfio-mdev > gains functionality far beyond VFIO is the problem. Ehm, vfio-pci emulates BARs too. We also emulate FLR, power management, DisINTx, and VPD. FLR, PM, and VPD all have device specific quirks in the host kernel, and I've generally taken the stance that would should take advantage of those quirks, not duplicate them in userspace and not invent new access mechanisms/ioctls for each of them. Emulating DisINTx is convenient since we must have a mechanism to mask INTx, whether it's at the device or the APIC, so we can pretend the hardware supports it. BAR emulation is really too trivial to argue about, the BARs mean nothing to the physical device mapping, they're simply scratch registers that we mask out the alignment bits on read. vfio-pci is a mix of things that we decide are too complicated or irrelevant to emulate in the kernel and things that take advantage of shared quirks or are just too darn easy to worry about. BARs fall into that latter category, any sort of mapping into VM address spaces is necessarily done in userspace, but scratch registers that are masked on read, *shrug*, vfio-pci does that. Thanks, Alex > > technically Scalable IOV is definitely different from SR-IOV. It's > > simpler in hardware. And we're not emulating SR-IOV. The point > > is just in usage-wise we want to present a consistent user > > experience just like passing through a PCI endpoint (PF or VF) device > > through vfio eco-system, including various userspace VMMs (Qemu, > > firecracker, rust-vmm, etc.), middleware (Libvirt), and higher level > > management stacks. > > Yes, I understand your desire, but at the same time we have not been > doing device emulation in the kernel. You should at least be > forthwright about that major change in the cover letters/etc. > > > > The only thing we get out of this is someone doesn't have to write a > > > idxd emulation driver in qemu, instead they have to write it in the > > > kernel. I don't see how that is a win for the ecosystem. > > > > No. The clear win is on leveraging classic VFIO iommu and its eco-system > > as explained above. > > vdpa had no problem implementing iommu support without VFIO. This was > their original argument too, it turned out to be erroneous. > > Jason >
On Sun, Apr 26, 2020 at 09:43:55PM -0600, Alex Williamson wrote: > On Sun, 26 Apr 2020 16:13:57 -0300 > Jason Gunthorpe <jgg@mellanox.com> wrote: > > > On Sun, Apr 26, 2020 at 05:18:59AM +0000, Tian, Kevin wrote: > > > > > > > I think providing an unified abstraction to userspace is also important, > > > > > which is what VFIO provides today. The merit of using one set of VFIO > > > > > API to manage all kinds of mediated devices and VF devices is a major > > > > > gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV > > > > > or equivalent device is just overkill and doesn't scale. Also the actual > > > > > emulation code in idxd driver is actually small, if putting aside the PCI > > > > > config space part for which I already explained most logic could be shared > > > > > between mdev device drivers. > > > > > > > > If it was just config space you might have an argument, VFIO already > > > > does some config space mangling, but emulating BAR space is out of > > > > scope of VFIO, IMHO. > > > > > > out of scope of vfio-pci, but in scope of vfio-mdev. btw I feel that most > > > of your objections are actually related to the general idea of > > > vfio-mdev. > > > > There have been several abusive proposals of vfio-mdev, everything > > from a way to create device drivers to this kind of generic emulation > > framework. > > > > > Scalable IOV just uses PASID to harden DMA isolation in mediated > > > pass-through usage which vfio-mdev enables. Then are you just opposing > > > the whole vfio-mdev? If not, I'm curious about the criteria in your mind > > > about when using vfio-mdev is good... > > > > It is appropriate when non-PCI standard techniques are needed to do > > raw device assignment, just like VFIO. > > > > Basically if vfio-pci is already doing it then it seems reasonable > > that vfio-mdev should do the same. This mission creep where vfio-mdev > > gains functionality far beyond VFIO is the problem. > > Ehm, vfio-pci emulates BARs too. We also emulate FLR, power > management, DisINTx, and VPD. FLR, PM, and VPD all have device > specific quirks in the host kernel, and I've generally taken the stance > that would should take advantage of those quirks, not duplicate them in > userspace and not invent new access mechanisms/ioctls for each of them. > Emulating DisINTx is convenient since we must have a mechanism to mask > INTx, whether it's at the device or the APIC, so we can pretend the > hardware supports it. BAR emulation is really too trivial to argue > about, the BARs mean nothing to the physical device mapping, they're > simply scratch registers that we mask out the alignment bits on read. > vfio-pci is a mix of things that we decide are too complicated or > irrelevant to emulate in the kernel and things that take advantage of > shared quirks or are just too darn easy to worry about. BARs fall into > that latter category, any sort of mapping into VM address spaces is > necessarily done in userspace, but scratch registers that are masked on > read, *shrug*, vfio-pci does that. Thanks, It is not trivial masking. It is a 2000 line patch doing comprehensive emulation. Jason
> From: Jason Gunthorpe <jgg@mellanox.com> > Sent: Monday, April 27, 2020 3:14 AM [...] > > technically Scalable IOV is definitely different from SR-IOV. It's > > simpler in hardware. And we're not emulating SR-IOV. The point > > is just in usage-wise we want to present a consistent user > > experience just like passing through a PCI endpoint (PF or VF) device > > through vfio eco-system, including various userspace VMMs (Qemu, > > firecracker, rust-vmm, etc.), middleware (Libvirt), and higher level > > management stacks. > > Yes, I understand your desire, but at the same time we have not been > doing device emulation in the kernel. You should at least be > forthwright about that major change in the cover letters/etc. I searched 'emulate' in kernel/Documentation: Documentation/sound/alsa-configuration.rst (emulate oss on alsa) Documentation/security/tpm/tpm_vtpm_proxy.rst (emulate virtual TPM) Documentation/networking/generic-hdlc.txt (emulate eth on HDLC) Documentation/gpu/todo.rst (generic fbdev emulation) ... I believe the main reason why putting such emulations in kernel is because those emulated device interfaces have their established eco-systems and values which the kernel shouldn't break. As you emphasize earlier, they have good reasons for getting into kernel. Then back to this context. Almost every newly-born Linux VMM (firecracker, crosvm, cloud hypervisor, and some proprietary implementations) support only two types of devices: virtio and vfio, because they want to be simple and slim. Virtio provides a basic set of I/O capabilities required by most VMs, while vfio brings an unified interface for gaining added values or higher performance from assigned devices. Even Qemu supports a minimal configuration ('microvm') now, for similar reason. So the vfio eco-system is significant and represents a major trend in the virtualization space. Then supporting vfio eco-system is actually the usage GOAL of this patch series, instead of an optional technique to be opted. vfio-pci is there for passing through standalone PCI endpoints (PF or VF), and vfio-mdev is there for passing through smaller portion of device resources but sharing the same VFIO interface to gain the uniform support in this eco-system. I believe above is the good reason for putting emulation in idxd driver by using vfio-mdev. Yes, it does imply that there will be more emulations in kernel when more Scalable-IOV (or alike) devices are introduced. But as explained earlier, the pci config space emulation can be largely consolidated and reused. and the remaining device specific MMIO emulation is relatively simple because we define virtual device interface to be same as or even simpler than a VF interface. Only a small set of registers are emulated after fast-path resource is passed through, and such small set of course needs to meet the normal quality requirement for getting into the kernel. We'll definitely highlight this part in future cover letter.
On Mon, Apr 27, 2020 at 12:13:33PM +0000, Tian, Kevin wrote: > Then back to this context. Almost every newly-born Linux VMM > (firecracker, crosvm, cloud hypervisor, and some proprietary > implementations) support only two types of devices: virtio and > vfio, because they want to be simple and slim. For security. Moving all the sketchy emulation code into the kernel seems like a worse security posture over all :( Jason
On Mon, 27 Apr 2020 08:58:18 -0300 Jason Gunthorpe <jgg@mellanox.com> wrote: > On Sun, Apr 26, 2020 at 09:43:55PM -0600, Alex Williamson wrote: > > On Sun, 26 Apr 2020 16:13:57 -0300 > > Jason Gunthorpe <jgg@mellanox.com> wrote: > > > > > On Sun, Apr 26, 2020 at 05:18:59AM +0000, Tian, Kevin wrote: > > > > > > > > > I think providing an unified abstraction to userspace is also important, > > > > > > which is what VFIO provides today. The merit of using one set of VFIO > > > > > > API to manage all kinds of mediated devices and VF devices is a major > > > > > > gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV > > > > > > or equivalent device is just overkill and doesn't scale. Also the actual > > > > > > emulation code in idxd driver is actually small, if putting aside the PCI > > > > > > config space part for which I already explained most logic could be shared > > > > > > between mdev device drivers. > > > > > > > > > > If it was just config space you might have an argument, VFIO already > > > > > does some config space mangling, but emulating BAR space is out of > > > > > scope of VFIO, IMHO. > > > > > > > > out of scope of vfio-pci, but in scope of vfio-mdev. btw I feel that most > > > > of your objections are actually related to the general idea of > > > > vfio-mdev. > > > > > > There have been several abusive proposals of vfio-mdev, everything > > > from a way to create device drivers to this kind of generic emulation > > > framework. > > > > > > > Scalable IOV just uses PASID to harden DMA isolation in mediated > > > > pass-through usage which vfio-mdev enables. Then are you just opposing > > > > the whole vfio-mdev? If not, I'm curious about the criteria in your mind > > > > about when using vfio-mdev is good... > > > > > > It is appropriate when non-PCI standard techniques are needed to do > > > raw device assignment, just like VFIO. > > > > > > Basically if vfio-pci is already doing it then it seems reasonable > > > that vfio-mdev should do the same. This mission creep where vfio-mdev > > > gains functionality far beyond VFIO is the problem. > > > > Ehm, vfio-pci emulates BARs too. We also emulate FLR, power > > management, DisINTx, and VPD. FLR, PM, and VPD all have device > > specific quirks in the host kernel, and I've generally taken the stance > > that would should take advantage of those quirks, not duplicate them in > > userspace and not invent new access mechanisms/ioctls for each of them. > > Emulating DisINTx is convenient since we must have a mechanism to mask > > INTx, whether it's at the device or the APIC, so we can pretend the > > hardware supports it. BAR emulation is really too trivial to argue > > about, the BARs mean nothing to the physical device mapping, they're > > simply scratch registers that we mask out the alignment bits on read. > > vfio-pci is a mix of things that we decide are too complicated or > > irrelevant to emulate in the kernel and things that take advantage of > > shared quirks or are just too darn easy to worry about. BARs fall into > > that latter category, any sort of mapping into VM address spaces is > > necessarily done in userspace, but scratch registers that are masked on > > read, *shrug*, vfio-pci does that. Thanks, > > It is not trivial masking. It is a 2000 line patch doing comprehensive > emulation. Not sure what you're referring to, I see about 30 lines of code in vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in config space and maybe a couple hundred lines of code in total handling config space emulation. Thanks, Alex
On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote: > > It is not trivial masking. It is a 2000 line patch doing comprehensive > > emulation. > > Not sure what you're referring to, I see about 30 lines of code in > vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in > config space and maybe a couple hundred lines of code in total handling > config space emulation. Thanks, Look around vidxd_do_command() If I understand this flow properly.. Jason
On Mon, 27 Apr 2020 10:22:18 -0300 Jason Gunthorpe <jgg@mellanox.com> wrote: > On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote: > > > > It is not trivial masking. It is a 2000 line patch doing comprehensive > > > emulation. > > > > Not sure what you're referring to, I see about 30 lines of code in > > vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in > > config space and maybe a couple hundred lines of code in total handling > > config space emulation. Thanks, > > Look around vidxd_do_command() > > If I understand this flow properly.. I've only glanced at it, but that's called in response to a write to MMIO space on the device, so it's implementing a device specific register. Are you asking that PCI config space be done in userspace or any sort of device emulation? The assumption with mdev is that we need emulation in the host kernel because we need a trusted entity to mediate device access and interact with privileged portion of the device control. Thanks, Alex
On Mon, Apr 27, 2020 at 08:18:41AM -0600, Alex Williamson wrote: > On Mon, 27 Apr 2020 10:22:18 -0300 > Jason Gunthorpe <jgg@mellanox.com> wrote: > > > On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote: > > > > > > It is not trivial masking. It is a 2000 line patch doing comprehensive > > > > emulation. > > > > > > Not sure what you're referring to, I see about 30 lines of code in > > > vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in > > > config space and maybe a couple hundred lines of code in total handling > > > config space emulation. Thanks, > > > > Look around vidxd_do_command() > > > > If I understand this flow properly.. > > I've only glanced at it, but that's called in response to a write to > MMIO space on the device, so it's implementing a device specific > register. It is doing emulation of the secure BAR. The entire 1000 lines of vidxd_* functions appear to be focused on this task. > Are you asking that PCI config space be done in userspace > or any sort of device emulation? I'm concerned about doing full emulation of registers on a MMIO BAR that trigger complex actions in response to MMIO read/write. Simple masking and simple config space stuff doesn't seem so problematic. > The assumption with mdev is that we need emulation in the host > kernel because we need a trusted entity to mediate device access and > interact with privileged portion of the device control. Thanks, Sure, but there are all kinds of different levels to this - mdev should not be some open ended device emulation framework, IMHO. ie other devices need only a small amount of kernel side help and don't need complex MMIO BAR emulation. Would you be happy if someone proposed an e1000 NIC emulator using mdev? Why not move every part of qemu's PCI device emulation into the kernel? Jason
On Mon, 27 Apr 2020 11:25:53 -0300 Jason Gunthorpe <jgg@mellanox.com> wrote: > On Mon, Apr 27, 2020 at 08:18:41AM -0600, Alex Williamson wrote: > > On Mon, 27 Apr 2020 10:22:18 -0300 > > Jason Gunthorpe <jgg@mellanox.com> wrote: > > > > > On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote: > > > > > > > > It is not trivial masking. It is a 2000 line patch doing comprehensive > > > > > emulation. > > > > > > > > Not sure what you're referring to, I see about 30 lines of code in > > > > vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in > > > > config space and maybe a couple hundred lines of code in total handling > > > > config space emulation. Thanks, > > > > > > Look around vidxd_do_command() > > > > > > If I understand this flow properly.. > > > > I've only glanced at it, but that's called in response to a write to > > MMIO space on the device, so it's implementing a device specific > > register. > > It is doing emulation of the secure BAR. The entire 1000 lines of > vidxd_* functions appear to be focused on this task. Ok, we/I need a terminology clarification, a BAR is a register in config space for determining the size, type, and setting the location of a I/O or memory region of a device. I've been asserting that the emulation of the BAR itself is trivial, but are you actually focused on emulation of the region described by the BAR? This is what mdev is for, mediating access to a device and filling in gaps such that we can use re-use the vfio device APIs. > > Are you asking that PCI config space be done in userspace > > or any sort of device emulation? > > I'm concerned about doing full emulation of registers on a MMIO BAR > that trigger complex actions in response to MMIO read/write. Maybe what you're recalling me say about mdev is that its Achilles heel is that we rely on mediation provider (ie. vendor driver) for security, we don't necessarily have an piece of known, common hardware like an IOMMU to protect us when things go wrong. That's true, but don't we also trust drivers in the host kernel to correctly manage and validate their own interactions with hardware, including the APIs provided through other user interfaces. Is the assertion then that device specific, register level API is too difficult to emulate? > Simple masking and simple config space stuff doesn't seem so > problematic. > > > The assumption with mdev is that we need emulation in the host > > kernel because we need a trusted entity to mediate device access and > > interact with privileged portion of the device control. Thanks, > > Sure, but there are all kinds of different levels to this - mdev > should not be some open ended device emulation framework, IMHO. > > ie other devices need only a small amount of kernel side help and > don't need complex MMIO BAR emulation. > > Would you be happy if someone proposed an e1000 NIC emulator using > mdev? Why not move every part of qemu's PCI device emulation into the > kernel? Well, in order to mediate a device, we certainly expect there to be a physical device. I also expect that there's some performance or at least compatibility advantage to using the device API directly rather than masquerading everything behind something like virtio. So no, I wouldn't expect someone to create a fully emulated device in mdev, but also I do expect some degree of device emulation in an mdev driver to fill the gaps in non-performance path that hardware chose to defer to software. Thanks, Alex
On Mon, Apr 27, 2020 at 09:41:37AM -0600, Alex Williamson wrote: > On Mon, 27 Apr 2020 11:25:53 -0300 > Jason Gunthorpe <jgg@mellanox.com> wrote: > > > On Mon, Apr 27, 2020 at 08:18:41AM -0600, Alex Williamson wrote: > > > On Mon, 27 Apr 2020 10:22:18 -0300 > > > Jason Gunthorpe <jgg@mellanox.com> wrote: > > > > > > > On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote: > > > > > > > > > > It is not trivial masking. It is a 2000 line patch doing comprehensive > > > > > > emulation. > > > > > > > > > > Not sure what you're referring to, I see about 30 lines of code in > > > > > vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in > > > > > config space and maybe a couple hundred lines of code in total handling > > > > > config space emulation. Thanks, > > > > > > > > Look around vidxd_do_command() > > > > > > > > If I understand this flow properly.. > > > > > > I've only glanced at it, but that's called in response to a write to > > > MMIO space on the device, so it's implementing a device specific > > > register. > > > > It is doing emulation of the secure BAR. The entire 1000 lines of > > vidxd_* functions appear to be focused on this task. > > Ok, we/I need a terminology clarification, a BAR is a register in > config space for determining the size, type, and setting the location > of a I/O or memory region of a device. I've been asserting that the > emulation of the BAR itself is trivial, but are you actually focused on > emulation of the region described by the BAR? Yes, BAR here means the actually MMIO memory window - not the config space part. Config space emulation is largely trivial. > > > Are you asking that PCI config space be done in userspace > > > or any sort of device emulation? > > > > I'm concerned about doing full emulation of registers on a MMIO BAR > > that trigger complex actions in response to MMIO read/write. > > Maybe what you're recalling me say about mdev is that its Achilles > heel is that we rely on mediation provider (ie. vendor driver) for > security, we don't necessarily have an piece of known, common hardware > like an IOMMU to protect us when things go wrong. That's true, but > don't we also trust drivers in the host kernel to correctly manage and > validate their own interactions with hardware, including the APIs > provided through other user interfaces. Is the assertion then that > device specific, register level API is too difficult to emulate? No, it is a reflection on the standard Linux philosophy that if it can be done in user space it should be done in userspace. Ie keep minimal work in the monolithic kernel. Also to avoid duplication, ie idxd proposes to have a char dev with a normal kernel driver interface and then an in-kernel emulated MMIO BAR version of that same capability for VFIO consumption. Jason
On 4/27/2020 9:16 AM, Jason Gunthorpe wrote: > On Mon, Apr 27, 2020 at 09:41:37AM -0600, Alex Williamson wrote: >> On Mon, 27 Apr 2020 11:25:53 -0300 >> Jason Gunthorpe <jgg@mellanox.com> wrote: >> >>> On Mon, Apr 27, 2020 at 08:18:41AM -0600, Alex Williamson wrote: >>>> On Mon, 27 Apr 2020 10:22:18 -0300 >>>> Jason Gunthorpe <jgg@mellanox.com> wrote: >>>> >>>>> On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote: >>>>> >>>>>>> It is not trivial masking. It is a 2000 line patch doing comprehensive >>>>>>> emulation. >>>>>> >>>>>> Not sure what you're referring to, I see about 30 lines of code in >>>>>> vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in >>>>>> config space and maybe a couple hundred lines of code in total handling >>>>>> config space emulation. Thanks, >>>>> >>>>> Look around vidxd_do_command() >>>>> >>>>> If I understand this flow properly.. >>>> >>>> I've only glanced at it, but that's called in response to a write to >>>> MMIO space on the device, so it's implementing a device specific >>>> register. >>> >>> It is doing emulation of the secure BAR. The entire 1000 lines of >>> vidxd_* functions appear to be focused on this task. >> >> Ok, we/I need a terminology clarification, a BAR is a register in >> config space for determining the size, type, and setting the location >> of a I/O or memory region of a device. I've been asserting that the >> emulation of the BAR itself is trivial, but are you actually focused on >> emulation of the region described by the BAR? > > Yes, BAR here means the actually MMIO memory window - not the config > space part. Config space emulation is largely trivial. > >>>> Are you asking that PCI config space be done in userspace >>>> or any sort of device emulation? >>> >>> I'm concerned about doing full emulation of registers on a MMIO BAR >>> that trigger complex actions in response to MMIO read/write. >> >> Maybe what you're recalling me say about mdev is that its Achilles >> heel is that we rely on mediation provider (ie. vendor driver) for >> security, we don't necessarily have an piece of known, common hardware >> like an IOMMU to protect us when things go wrong. That's true, but >> don't we also trust drivers in the host kernel to correctly manage and >> validate their own interactions with hardware, including the APIs >> provided through other user interfaces. Is the assertion then that >> device specific, register level API is too difficult to emulate? > > No, it is a reflection on the standard Linux philosophy that if it can > be done in user space it should be done in userspace. Ie keep minimal > work in the monolithic kernel. > > Also to avoid duplication, ie idxd proposes to have a char dev with a > normal kernel driver interface and then an in-kernel emulated MMIO BAR > version of that same capability for VFIO consumption. The char dev interface serves user apps on host (which we will deprecate and move to the UACCE framework in near future). The mdev interface will be servicing guests only. I'm not sure where the duplication of functionality comes into play. > > Jason >
On Mon, Apr 27, 2020 at 09:25:58AM -0700, Dave Jiang wrote: > > Also to avoid duplication, ie idxd proposes to have a char dev with a > > normal kernel driver interface and then an in-kernel emulated MMIO BAR > > version of that same capability for VFIO consumption. > > The char dev interface serves user apps on host (which we will deprecate and > move to the UACCE framework in near future). The point is the char dev or UACCE framework should provide enough capability to implement the emulation in user space. Jason
> From: Jason Gunthorpe <jgg@mellanox.com> > Sent: Monday, April 27, 2020 9:22 PM > > On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote: > > > > It is not trivial masking. It is a 2000 line patch doing comprehensive > > > emulation. > > > > Not sure what you're referring to, I see about 30 lines of code in > > vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in > > config space and maybe a couple hundred lines of code in total handling > > config space emulation. Thanks, > > Look around vidxd_do_command() > > If I understand this flow properly.. > Hi, Jason, I guess the 2000 lines mostly refer to the changes in mdev.c and vdev.c. We did a break-down among them: 1) ~150 LOC for vdev initialization 2) ~150 LOC for cfg space emulation 3) ~230 LOC for mmio r/w emulation 4) ~500 LOC for controlling the work queue (vidxd_do_command), triggered by write emulation of IDXD_CMD_OFFSET register 5) the remaining lines are all about vfio-mdev registration/callbacks, for reporting mmio/irq resource, eventfd, mmap, etc. 1/2/3) are pure device emulation, which counts for ~500 LOC. 4) needs be in the kernel regardless of which uAPI is used, because it talks to the physical work queue (enable, disable, drain, abort, reset, etc.) Then if just talking about ~500 LOC emulation code left in the kernel, is it still a big concern to you?
Hi Jason, On 4/23/2020 12:49 PM, Jason Gunthorpe wrote: > On Thu, Apr 23, 2020 at 12:17:50PM -0700, Dan Williams wrote: > >> Per Megha's follow-up can you send the details about that other device >> and help clear a path for a device-specific MSI addr/data table >> format. Ever since HMM I've been sensitive, perhaps overly-sensitive, >> to claims about future upstream users. The fact that you have an >> additional use case is golden for pushing this into a common area and >> validating the scope of the proposed API. > > I think I said it at plumbers, but yes, we are interested in this, and > would like dynamic MSI-like interrupts available to the driver (what > Intel calls IMS) > So basically you are looking for a way to dynamically allocate the platform-msi interrupts, correct? Since I don't have access to any of the platform-msi devices, it is hard for me to test this code for other drivers expect idxd for now. Once I submit the next round of patches, after addressing all the comments, would it be possible for you to test this code for any of your devices? > It is something easy enough to illustrate with any RDMA device really, > just open a MR against the addr and use RDMA_WRITE to trigger the > data. It should trigger a Linux IRQ. Nothing else should be needed. > > Jason >
Hi Jason, On 4/23/2020 12:18 PM, Jason Gunthorpe wrote: > On Wed, Apr 22, 2020 at 02:24:11PM -0700, Dan Williams wrote: >> On Tue, Apr 21, 2020 at 4:55 PM Jason Gunthorpe <jgg@mellanox.com> wrote: >>> >>> On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote: >>>> The actual code is independent of the stage 2 driver code submission that adds >>>> support for SVM, ENQCMD(S), PASID, and shared workqueues. This code series will >>>> support dedicated workqueue on a guest with no vIOMMU. >>>> >>>> A new device type "mdev" is introduced for the idxd driver. This allows the wq >>>> to be dedicated to the usage of a VFIO mediated device (mdev). Once the work >>>> queue (wq) is enabled, an uuid generated by the user can be added to the wq >>>> through the uuid sysfs attribute for the wq. After the association, a mdev can >>>> be created using this UUID. The mdev driver code will associate the uuid and >>>> setup the mdev on the driver side. When the create operation is successful, the >>>> uuid can be passed to qemu. When the guest boots up, it should discover a DSA >>>> device when doing PCI discovery. >>> >>> I'm feeling really skeptical that adding all this PCI config space and >>> MMIO BAR emulation to the kernel just to cram this into a VFIO >>> interface is a good idea, that kind of stuff is much safer in >>> userspace. >>> >>> Particularly since vfio is not really needed once a driver is using >>> the PASID stuff. We already have general code for drivers to use to >>> attach a PASID to a mm_struct - and using vfio while disabling all the >>> DMA/iommu config really seems like an abuse. >>> >>> A /dev/idxd char dev that mmaps a bar page and links it to a PASID >>> seems a lot simpler and saner kernel wise. >>> >>>> The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for >>>> interrupts for the guest. This preserves MSIX for host usages and also allows a >>>> significantly larger number of interrupt vectors for guest usage. >>> >>> I never did get a reply to my earlier remarks on the IMS patches. >>> >>> The concept of a device specific addr/data table format for MSI is not >>> Intel specific. This should be general code. We have a device that can >>> use this kind of kernel capability today. >> >> This has been my concern reviewing the implementation. IMS needs more >> than one in-tree user to validate degrees of freedom in the api. I had >> been missing a second "in-tree user" to validate the scope of the >> flexibility that was needed. > > IMS is too narrowly specified. > > All platforms that support MSI today can support IMS. It is simply a > way for the platform to give the driver an addr/data pair that triggers > an interrupt when a posted write is performed to that pair. > Well, yes and no. IMS requires interrupt remapping in addition to the dynamic nature of IRQ allocation. > This is different from the other interrupt setup flows which are > tightly tied to the PCI layer. Here the driver should simply ask for > interrupts. > > Ie the entire IMS API to the driver should be something very simple > like: > > struct message_irq > { > uint64_t addr; > uint32_t data; > }; > > struct message_irq *request_message_irq( > struct device *, irq_handler_t handler, unsigned long flags, > const char *name, void *dev); > > And the plumbing underneath should setup the irq chips and so forth as > required. > yes, this seems correct. > Jason >
On 4/23/2020 12:44 PM, Jason Gunthorpe wrote: >>>> The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for >>>> interrupts for the guest. This preserves MSIX for host usages and also allows a >>>> significantly larger number of interrupt vectors for guest usage. >>> >>> I never did get a reply to my earlier remarks on the IMS patches. >>> >>> The concept of a device specific addr/data table format for MSI is not >>> Intel specific. This should be general code. We have a device that can >>> use this kind of kernel capability today. >> >> I am sorry if I did not address your comments earlier. > > It appears noboy from Intel bothered to answer anyone else on that RFC > thread: > > https://lore.kernel.org/lkml/1568338328-22458-1-git-send-email-megha.dey@linux.intel.com/ > > However, it seems kind of moot as I see now that this verion of IMS > bears almost no resemblance to the original RFC. hmm yeah, we changed most of the code after getting a lot of feedback from you and folks at plumbers. But yes, I should have replied to all the feedback, lesson learnt :) > > That said, the similiarity to platform-msi was striking, does this new > version harmonize with that? yes! > >> The present IMS code is quite generic, most of the code is in the drivers/ >> folder. We basically introduce 2 APIS: allocate and free IMS interrupts and >> a IMS IRQ domain to allocate these interrupts from. These APIs are >> architecture agnostic. >> >> We also introduce a new IMS IRQ domain which is architecture specific. This >> is because IMS generates interrupts only in the remappable format, hence >> interrupt remapping should be enabled for IMS. Currently, the interrupt >> remapping code is only available for Intel and AMD and I don’t see anything >> for ARM. > > I don't understand these remarks though - IMS is simply the mapping of > a MemWr addr/data pair to a Linux IRQ number? Why does this intersect > with remapping? > From your comments so far, I think your requirement is a subset of what IMS is trying to do. What you want: have a dynamic means of allocating platform-msi interrupts On top of this IMS has a requirement that all of the interrupts should be remapped. So we can have tiered code: generic dynamic platform-msi infrastructure and add the IMS specific bits (Intel specific) on top of this. The generic code will have no reference to IMS. > AFAIK, any platform that supports MSI today should have the inherent > HW capability to support IMS. > >> Also, could you give more details on the device that could use IMS? Do you >> have some driver code already? We could then see if and how the current IMS >> code could be made more generic. > > We have several devices of interest, our NICs have very flexible PCI, > so it is no problem to take the MemWR addr/data from someplace other > than the MSI tables. > > For this we want to have some way to allocate Linux IRQs dynamically > and get a addr/data pair to trigger them. > > Our NIC devices are also linked to our ARM SOC family, so I'd expect > our ARM's to also be able to provide these APIs as the platform. cool, so I will hope that you can test out the generic APIs from the ARM side! > > Jason >
On Fri, May 01, 2020 at 03:31:14PM -0700, Dey, Megha wrote: > Hi Jason, > > On 4/23/2020 12:49 PM, Jason Gunthorpe wrote: > > On Thu, Apr 23, 2020 at 12:17:50PM -0700, Dan Williams wrote: > > > > > Per Megha's follow-up can you send the details about that other device > > > and help clear a path for a device-specific MSI addr/data table > > > format. Ever since HMM I've been sensitive, perhaps overly-sensitive, > > > to claims about future upstream users. The fact that you have an > > > additional use case is golden for pushing this into a common area and > > > validating the scope of the proposed API. > > > > I think I said it at plumbers, but yes, we are interested in this, and > > would like dynamic MSI-like interrupts available to the driver (what > > Intel calls IMS) > > > > So basically you are looking for a way to dynamically allocate the > platform-msi interrupts, correct? The basic high level interface here seems fine, which is bascially a way for a driver to grab a bunch of platform-msi interrupts for its own use > Since I don't have access to any of the platform-msi devices, it is hard for > me to test this code for other drivers expect idxd for now. > Once I submit the next round of patches, after addressing all the comments, > would it be possible for you to test this code for any of your devices? Possibly, need to find time Jason
On Fri, May 01, 2020 at 03:31:51PM -0700, Dey, Megha wrote: > > > This has been my concern reviewing the implementation. IMS needs more > > > than one in-tree user to validate degrees of freedom in the api. I had > > > been missing a second "in-tree user" to validate the scope of the > > > flexibility that was needed. > > > > IMS is too narrowly specified. > > > > All platforms that support MSI today can support IMS. It is simply a > > way for the platform to give the driver an addr/data pair that triggers > > an interrupt when a posted write is performed to that pair. > > > > Well, yes and no. IMS requires interrupt remapping in addition to the > dynamic nature of IRQ allocation. You've mentioned remapping a few times, but I really can't understand why it has anything to do with platform_msi or IMS.. Jason
Hi Jason, On 5/3/2020 3:22 PM, Jason Gunthorpe wrote: > On Fri, May 01, 2020 at 03:31:51PM -0700, Dey, Megha wrote: >>>> This has been my concern reviewing the implementation. IMS needs more >>>> than one in-tree user to validate degrees of freedom in the api. I had >>>> been missing a second "in-tree user" to validate the scope of the >>>> flexibility that was needed. >>> >>> IMS is too narrowly specified. >>> >>> All platforms that support MSI today can support IMS. It is simply a >>> way for the platform to give the driver an addr/data pair that triggers >>> an interrupt when a posted write is performed to that pair. >>> >> >> Well, yes and no. IMS requires interrupt remapping in addition to the >> dynamic nature of IRQ allocation. > > You've mentioned remapping a few times, but I really can't understand > why it has anything to do with platform_msi or IMS.. So after some internal discussions, we have concluded that IMS has no linkage with Interrupt remapping, IR is just a platform concept. IMS is just a name Intel came up with, all it really means is device managed addr/data writes to generate interrupts. Technically we can call something IMS even if device has its own location to store interrupts in non-pci standard mechanism, much like platform-msi indeed. We simply need to extend platform-msi to its address some of its shortcomings: increase number of interrupts to > 2048, enable dynamic allocation of interrupts, add mask/unmask callbacks in addition to write_msg etc. FWIW, even MSI can be IMS with rules on how to manage the addr/data writes following pci sig .. its just that. I will be sending out an email shortly outlining the new design for IMS (A.K.A platform-msi part 2) and what are the improvements we want to add to the already existing platform-msi infrastructure. Thank you so much for your comments, it helped us iron out some of these details :) > > Jason >
Hi Jason, On 5/3/2020 3:21 PM, Jason Gunthorpe wrote: > On Fri, May 01, 2020 at 03:31:14PM -0700, Dey, Megha wrote: >> Hi Jason, >> >> On 4/23/2020 12:49 PM, Jason Gunthorpe wrote: >>> On Thu, Apr 23, 2020 at 12:17:50PM -0700, Dan Williams wrote: >>> >>>> Per Megha's follow-up can you send the details about that other device >>>> and help clear a path for a device-specific MSI addr/data table >>>> format. Ever since HMM I've been sensitive, perhaps overly-sensitive, >>>> to claims about future upstream users. The fact that you have an >>>> additional use case is golden for pushing this into a common area and >>>> validating the scope of the proposed API. >>> >>> I think I said it at plumbers, but yes, we are interested in this, and >>> would like dynamic MSI-like interrupts available to the driver (what >>> Intel calls IMS) >>> >> >> So basically you are looking for a way to dynamically allocate the >> platform-msi interrupts, correct? > > The basic high level interface here seems fine, which is bascially a > way for a driver to grab a bunch of platform-msi interrupts for its > own use ok! > >> Since I don't have access to any of the platform-msi devices, it is hard for >> me to test this code for other drivers expect idxd for now. >> Once I submit the next round of patches, after addressing all the comments, >> would it be possible for you to test this code for any of your devices? > > Possibly, need to find time > Sure, thanks! > Jason >
On Sun, May 03, 2020 at 03:31:39PM -0700, Dey, Megha wrote: > > Hi Jason, > > On 5/3/2020 3:22 PM, Jason Gunthorpe wrote: > > On Fri, May 01, 2020 at 03:31:51PM -0700, Dey, Megha wrote: > > > > > This has been my concern reviewing the implementation. IMS needs more > > > > > than one in-tree user to validate degrees of freedom in the api. I had > > > > > been missing a second "in-tree user" to validate the scope of the > > > > > flexibility that was needed. > > > > > > > > IMS is too narrowly specified. > > > > > > > > All platforms that support MSI today can support IMS. It is simply a > > > > way for the platform to give the driver an addr/data pair that triggers > > > > an interrupt when a posted write is performed to that pair. > > > > > > > > > > Well, yes and no. IMS requires interrupt remapping in addition to the > > > dynamic nature of IRQ allocation. > > > > You've mentioned remapping a few times, but I really can't understand > > why it has anything to do with platform_msi or IMS.. > > So after some internal discussions, we have concluded that IMS has no > linkage with Interrupt remapping, IR is just a platform concept. IMS is just > a name Intel came up with, all it really means is device managed addr/data > writes to generate interrupts. Technically we can call something IMS even if > device has its own location to store interrupts in non-pci standard > mechanism, much like platform-msi indeed. We simply need to extend > platform-msi to its address some of its shortcomings: increase number of > interrupts to > 2048, enable dynamic allocation of interrupts, add > mask/unmask callbacks in addition to write_msg etc. Sounds right to me Persumably you still need a way for the driver, eg vfio, to ensure a MSI is remappable, but shouldn't that be exactly the same way as done in normal PCI MSI today? > FWIW, even MSI can be IMS with rules on how to manage the addr/data writes > following pci sig .. its just that. Yep, IMHO, our whole handling of MSI is very un-general sometimes.. I thought the msi_domain stuff that some platforms are using is a way to improve on that? You might find that updating x86 to use msi_domain might be helpful in this project??? Jason
On 5/3/2020 3:36 PM, Jason Gunthorpe wrote: > On Sun, May 03, 2020 at 03:31:39PM -0700, Dey, Megha wrote: >> >> Hi Jason, >> >> On 5/3/2020 3:22 PM, Jason Gunthorpe wrote: >>> On Fri, May 01, 2020 at 03:31:51PM -0700, Dey, Megha wrote: >>>>>> This has been my concern reviewing the implementation. IMS needs more >>>>>> than one in-tree user to validate degrees of freedom in the api. I had >>>>>> been missing a second "in-tree user" to validate the scope of the >>>>>> flexibility that was needed. >>>>> >>>>> IMS is too narrowly specified. >>>>> >>>>> All platforms that support MSI today can support IMS. It is simply a >>>>> way for the platform to give the driver an addr/data pair that triggers >>>>> an interrupt when a posted write is performed to that pair. >>>>> >>>> >>>> Well, yes and no. IMS requires interrupt remapping in addition to the >>>> dynamic nature of IRQ allocation. >>> >>> You've mentioned remapping a few times, but I really can't understand >>> why it has anything to do with platform_msi or IMS.. >> >> So after some internal discussions, we have concluded that IMS has no >> linkage with Interrupt remapping, IR is just a platform concept. IMS is just >> a name Intel came up with, all it really means is device managed addr/data >> writes to generate interrupts. Technically we can call something IMS even if >> device has its own location to store interrupts in non-pci standard >> mechanism, much like platform-msi indeed. We simply need to extend >> platform-msi to its address some of its shortcomings: increase number of >> interrupts to > 2048, enable dynamic allocation of interrupts, add >> mask/unmask callbacks in addition to write_msg etc. > > Sounds right to me > > Persumably you still need a way for the driver, eg vfio, to ensure a > MSI is remappable, but shouldn't that be exactly the same way as done > in normal PCI MSI today? yes exactly, it should be done in the same way as PCI-MSI, if IR is enabled we will have IR_PCI_MSI for platform msi as well. > >> FWIW, even MSI can be IMS with rules on how to manage the addr/data writes >> following pci sig .. its just that. > > Yep, IMHO, our whole handling of MSI is very un-general sometimes.. > > I thought the msi_domain stuff that some platforms are using is a way > to improve on that? You might find that updating x86 to use msi_domain > might be helpful in this project??? yes, we need to take a closer look at this. > > Jason >
Hi Jason In general your idea of moving pure emulation code to user space is a good strategy. On Wed, Apr 29, 2020 at 02:42:20AM -0700, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@mellanox.com> > > Sent: Monday, April 27, 2020 9:22 PM > > > > On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote: > > > > > > It is not trivial masking. It is a 2000 line patch doing comprehensive > > > > emulation. > > > > > > Not sure what you're referring to, I see about 30 lines of code in > > > vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in > > > config space and maybe a couple hundred lines of code in total handling > > > config space emulation. Thanks, > > > > Look around vidxd_do_command() > > > > If I understand this flow properly.. > > > > Hi, Jason, > > I guess the 2000 lines mostly refer to the changes in mdev.c and vdev.c. > We did a break-down among them: > > 1) ~150 LOC for vdev initialization > 2) ~150 LOC for cfg space emulation > 3) ~230 LOC for mmio r/w emulation > 4) ~500 LOC for controlling the work queue (vidxd_do_command), > triggered by write emulation of IDXD_CMD_OFFSET register > 5) the remaining lines are all about vfio-mdev registration/callbacks, > for reporting mmio/irq resource, eventfd, mmap, etc. > > 1/2/3) are pure device emulation, which counts for ~500 LOC. > > 4) needs be in the kernel regardless of which uAPI is used, because it > talks to the physical work queue (enable, disable, drain, abort, reset, etc.) > > Then if just talking about ~500 LOC emulation code left in the kernel, > is it still a big concern to you?
On Fri, May 08, 2020 at 01:47:10PM -0700, Raj, Ashok wrote: > Even when uaccel was under development, one of the options > was to use VFIO as the transport, goal was the same i.e to keep > the user space have one interface. I feel a bit out of the loop here, uaccel isn't in today's kernel is it? I've heard about it for a while, it sounds very similar to RDMA, so I hope they took some of my advice... > But the needs of generic user space application is significantly > different from exporting a more functional device model to guest, > which isn't full emulated device. which is why VFIO didn't make > sense for native use. I'm not sure this is true. We've done these kinds of emulated SIOV like things already and there is a huge overlap between what a generic user application needs and what the VMM neds. Actually almost a perfect subset except for interrupt remapping (which is quite trivial). The things vfio focuses on, like groups and managing a real config space just don't apply here. > And when we move things from VFIO which is already established > as a general device model and accepted by multiple VMM's it gives > instant footing without a whole redesign. Yes, I understand, but I think you need to get more people to support this idea. From my standpoint this is taking secure lean VMMs and putting emulation code back into them, except in a more dangerous kernel location. This does not seem like a net win to me. You'd be much better to have some userspace library scheme instead of being completely tied to a kernel interface for modularity. > When we move things from VFIO to uaccel to bolt on the functionality > like VFIO, I suspect we would be moving code/functionality from VFIO > to Uaccel. I don't know what the net gain would be. Most of VFIO functionality is already decomposed inside the kernel, and you need most of it to do secure user access anyhow. > For mdev, would you agree we can keep the current architecture, > and investigate moving some emulation code to user space (say even for > standard vfio_pci) and then expand scope later. I won't hard NAK this, but I think you need more people to support this general idea of more emulation code in the kernel to go ahead - particularly since this is one of many future drivers along this design. It would be good to hear from the VMM teams that this is what they want (and why), for instance. Jason
On 5/8/2020 4:16 PM, Jason Gunthorpe wrote: > On Fri, May 08, 2020 at 01:47:10PM -0700, Raj, Ashok wrote: > >> Even when uaccel was under development, one of the options >> was to use VFIO as the transport, goal was the same i.e to keep >> the user space have one interface. > > I feel a bit out of the loop here, uaccel isn't in today's kernel is > it? I've heard about it for a while, it sounds very similar to RDMA, > so I hope they took some of my advice... It went into 5.7 kernel. drivers/misc/uacce. It looks char device exported with SVM support. > >> But the needs of generic user space application is significantly >> different from exporting a more functional device model to guest, >> which isn't full emulated device. which is why VFIO didn't make >> sense for native use. > > I'm not sure this is true. We've done these kinds of emulated SIOV > like things already and there is a huge overlap between what a generic > user application needs and what the VMM neds. Actually almost a > perfect subset except for interrupt remapping (which is quite > trivial). > > The things vfio focuses on, like groups and managing a real config > space just don't apply here. > >> And when we move things from VFIO which is already established >> as a general device model and accepted by multiple VMM's it gives >> instant footing without a whole redesign. > > Yes, I understand, but I think you need to get more people to support > this idea. From my standpoint this is taking secure lean VMMs and > putting emulation code back into them, except in a more dangerous > kernel location. This does not seem like a net win to me. > > You'd be much better to have some userspace library scheme instead of > being completely tied to a kernel interface for modularity. > >> When we move things from VFIO to uaccel to bolt on the functionality >> like VFIO, I suspect we would be moving code/functionality from VFIO >> to Uaccel. I don't know what the net gain would be. > > Most of VFIO functionality is already decomposed inside the kernel, > and you need most of it to do secure user access anyhow. > >> For mdev, would you agree we can keep the current architecture, >> and investigate moving some emulation code to user space (say even for >> standard vfio_pci) and then expand scope later. > > I won't hard NAK this, but I think you need more people to support > this general idea of more emulation code in the kernel to go ahead - > particularly since this is one of many future drivers along this > design. > > It would be good to hear from the VMM teams that this is what they > want (and why), for instance. > > Jason >
Hi Jason On Fri, May 08, 2020 at 08:16:10PM -0300, Jason Gunthorpe wrote: > On Fri, May 08, 2020 at 01:47:10PM -0700, Raj, Ashok wrote: > > > Even when uaccel was under development, one of the options > > was to use VFIO as the transport, goal was the same i.e to keep > > the user space have one interface. > > I feel a bit out of the loop here, uaccel isn't in today's kernel is > it? I've heard about it for a while, it sounds very similar to RDMA, > so I hope they took some of my advice... I think since 5.7 maybe? drivers/misc/uacce. I don't think this is like RDMA, its just a plain accelerator. There is no connection management, memory registration or other things.. IB was my first job at Intel, but saying that i would be giving my age away :) > > > But the needs of generic user space application is significantly > > different from exporting a more functional device model to guest, > > which isn't full emulated device. which is why VFIO didn't make > > sense for native use. > > I'm not sure this is true. We've done these kinds of emulated SIOV > like things already and there is a huge overlap between what a generic > user application needs and what the VMM neds. Actually almost a > perfect subset except for interrupt remapping (which is quite > trivial). From a simple user application POV, if we need to do simple compression or such with a shared WQ, all the application needs do do is bind_mm() that somehow associates the process address space with the IOMMU to create that association and communication channel. For supporting this with guest user, we need to support the same actions from a guest OS. i.e a guest OS bind should be serviced and end up with the IOMMU plumbing it with the guest cr3, and making sure the guest 2nd level is plumed right for the nested walk. Now we can certainly go bolt all these things again. When VFIO has already done the pluming in a generic way. > > The things vfio focuses on, like groups and managing a real config > space just don't apply here. > > > And when we move things from VFIO which is already established > > as a general device model and accepted by multiple VMM's it gives > > instant footing without a whole redesign. > > Yes, I understand, but I think you need to get more people to support > this idea. From my standpoint this is taking secure lean VMMs and When we decided on VFIO, it was after using the best practices then, after discussion with Kirti Wankhede and Alex. Kevin had used it for graphics virtualization. It was even presented at KVM forum and such dating back to 2017. No one has raised alarms until now :-) > putting emulation code back into them, except in a more dangerous > kernel location. This does not seem like a net win to me. Its not a whole lot of emulation right? mdev are soft partitioned. There is just a single PF, but we can create a separate partition for the guest using PASID along with the normal BDF (RID). And exposing a consistent PCI like interface to user space you get everything else for free. Yes, its not SRIOV, but giving that interface to user space via VFIO, we get all of that functionality without having to reinvent a different way to do it. vDPA went the other way, IRC, they went and put a HW implementation of what virtio is in hardware. So they sort of fit the model. Here the instance looks and feels like real hardware for the setup and control aspect. > > You'd be much better to have some userspace library scheme instead of > being completely tied to a kernel interface for modularity. Couldn't agree more :-).. all I'm asking is if we can do a phased approach to get to that goodness! If we need to move things to user space for emulation that's a great goal, but it can be evolutionary. > > > When we move things from VFIO to uaccel to bolt on the functionality > > like VFIO, I suspect we would be moving code/functionality from VFIO > > to Uaccel. I don't know what the net gain would be. > > Most of VFIO functionality is already decomposed inside the kernel, > and you need most of it to do secure user access anyhow. > > > For mdev, would you agree we can keep the current architecture, > > and investigate moving some emulation code to user space (say even for > > standard vfio_pci) and then expand scope later. > > I won't hard NAK this, but I think you need more people to support > this general idea of more emulation code in the kernel to go ahead - > particularly since this is one of many future drivers along this > design. > > It would be good to hear from the VMM teams that this is what they > want (and why), for instance. IRC Paolo was present I think and we can find other VMM folks to chime in if that helps.
On Fri, May 08, 2020 at 05:09:09PM -0700, Raj, Ashok wrote: > Hi Jason > > On Fri, May 08, 2020 at 08:16:10PM -0300, Jason Gunthorpe wrote: > > On Fri, May 08, 2020 at 01:47:10PM -0700, Raj, Ashok wrote: > > > > > Even when uaccel was under development, one of the options > > > was to use VFIO as the transport, goal was the same i.e to keep > > > the user space have one interface. > > > > I feel a bit out of the loop here, uaccel isn't in today's kernel is > > it? I've heard about it for a while, it sounds very similar to RDMA, > > so I hope they took some of my advice... > > I think since 5.7 maybe? drivers/misc/uacce. I don't think this is like > RDMA, its just a plain accelerator. There is no connection management, > memory registration or other things.. IB was my first job at Intel, > but saying that i would be giving my age away :) rdma was the first thing to do kernel bypass, all this stuff is like rdma at some level.. I see this looks like the 'warp driver' stuff redone Wow, lots wrong here. Oh well. > > putting emulation code back into them, except in a more dangerous > > kernel location. This does not seem like a net win to me. > > Its not a whole lot of emulation right? mdev are soft partitioned. There is > just a single PF, but we can create a separate partition for the guest using > PASID along with the normal BDF (RID). And exposing a consistent PCI like > interface to user space you get everything else for free. > > Yes, its not SRIOV, but giving that interface to user space via VFIO, we get > all of that functionality without having to reinvent a different way to do it. > > vDPA went the other way, IRC, they went and put a HW implementation of what > virtio is in hardware. So they sort of fit the model. Here the instance > looks and feels like real hardware for the setup and control aspect. VDPA and this are very similar, of course it depends on the exact HW implementation. Jason
On 2020/5/9 下午8:21, Jason Gunthorpe wrote: > On Fri, May 08, 2020 at 05:09:09PM -0700, Raj, Ashok wrote: >> Hi Jason >> >> On Fri, May 08, 2020 at 08:16:10PM -0300, Jason Gunthorpe wrote: >>> On Fri, May 08, 2020 at 01:47:10PM -0700, Raj, Ashok wrote: >>> >>>> Even when uaccel was under development, one of the options >>>> was to use VFIO as the transport, goal was the same i.e to keep >>>> the user space have one interface. >>> I feel a bit out of the loop here, uaccel isn't in today's kernel is >>> it? I've heard about it for a while, it sounds very similar to RDMA, >>> so I hope they took some of my advice... >> I think since 5.7 maybe? drivers/misc/uacce. I don't think this is like >> RDMA, its just a plain accelerator. There is no connection management, >> memory registration or other things.. IB was my first job at Intel, >> but saying that i would be giving my age away:) > rdma was the first thing to do kernel bypass, all this stuff is like > rdma at some level.. I see this looks like the 'warp driver' stuff > redone > > Wow, lots wrong here. Oh well. > >>> putting emulation code back into them, except in a more dangerous >>> kernel location. This does not seem like a net win to me. >> Its not a whole lot of emulation right? mdev are soft partitioned. There is >> just a single PF, but we can create a separate partition for the guest using >> PASID along with the normal BDF (RID). And exposing a consistent PCI like >> interface to user space you get everything else for free. >> >> Yes, its not SRIOV, but giving that interface to user space via VFIO, we get >> all of that functionality without having to reinvent a different way to do it. >> >> vDPA went the other way, IRC, they went and put a HW implementation of what >> virtio is in hardware. So they sort of fit the model. Here the instance >> looks and feels like real hardware for the setup and control aspect. > VDPA and this are very similar, of course it depends on the exact HW > implementation. > > Jason Actually this is not a must. Technically we can do ring/descriptor translation in the vDPA driver as what zerocopy AF_XDP did. Thanks >
> From: Jason Gunthorpe > Sent: Saturday, May 9, 2020 8:21 PM > > > putting emulation code back into them, except in a more dangerous > > > kernel location. This does not seem like a net win to me. > > > > Its not a whole lot of emulation right? mdev are soft partitioned. There is > > just a single PF, but we can create a separate partition for the guest using > > PASID along with the normal BDF (RID). And exposing a consistent PCI like > > interface to user space you get everything else for free. > > > > Yes, its not SRIOV, but giving that interface to user space via VFIO, we get > > all of that functionality without having to reinvent a different way to do it. > > > > vDPA went the other way, IRC, they went and put a HW implementation of > what > > virtio is in hardware. So they sort of fit the model. Here the instance > > looks and feels like real hardware for the setup and control aspect. > > VDPA and this are very similar, of course it depends on the exact HW > implementation. > Hi, Jason, I have more thoughts below. let's see whether making sense to you. When talking about virtualization, here the target is unmodified guest kernel driver which expects seeing the raw controllability of queues as defined by device spec. In idxd, such controllability includes enable/ disable SVA, dedicated or shared WQ, size, threshold, privilege, fault mode, max batch size, and many other attributes. Different guest OS has its own policy of using all or partial available controllability. When talking about application, we care about providing an efficient programming interface to userspace. For example with uacce, we allow an application to submit vaddr-based workloads to a reserved WQ with kernel bypassed. But it's not necessary to export the raw controllability of the reserved WQ to userspace, and we still rely on kernel driver to configure it including bind_mm. I'm not sure whether uacce would like to evolve as a generic queue management system including non-SVA and all vendor specific raw capabilities as expected by all kinds of guest kernel drivers. It sounds like not worthwhile at this point, given that we already have an highly efficient SVA interface for user applications. That is why we start with mdev as an evolutionary approach. Mdev is introduced to expose raw controllability of a subdevice (WQ or ADI) to guest. It build a channel between guest kernel driver and host kernel driver and uses device spec as the uAPI by sticking to the mmio interface. and all virtualization related setups are just consolidated together in vfio. the drawback, as you pointed out, is putting some degree of emulation code in the kernel. But as explained earlier, they are only small portion of code. Moreover, most registers are emulated as simple memory read/ write, while the remaining logic mostly belongs to raw controllability (e.g. cmd register) that host driver grants to the guest thus must propagate to the device. For the latter part, I would call it more as 'mediation' instead of 'emulation', as required in whatever uapi would be used. If in the future, there do have such requirement of delegating raw WQ controllability to pure userspace applications for DMA engines, and there is be a well-defined uAPI to cover a large common set of controllability across multiple vendors, we will look at that option for sure. From above p.o.v, I feel vdpa is a different story. virtio/vhost has a well established eco-system between guest and host. The user space VMM already emulates all available controllability as defined in virtio spec. Host kernel already supports vhost uAPI for vring setup, iotlb management, etc. Extending that path for data path offloading sounds a reasonable choice for vdpa... Thanks Kevin
On Wed, May 13, 2020 at 08:30:15AM +0000, Tian, Kevin wrote: > When talking about virtualization, here the target is unmodified guest > kernel driver which expects seeing the raw controllability of queues > as defined by device spec. In idxd, such controllability includes enable/ > disable SVA, dedicated or shared WQ, size, threshold, privilege, fault > mode, max batch size, and many other attributes. Different guest OS > has its own policy of using all or partial available controllability. > > When talking about application, we care about providing an efficient > programming interface to userspace. For example with uacce, we > allow an application to submit vaddr-based workloads to a reserved > WQ with kernel bypassed. But it's not necessary to export the raw > controllability of the reserved WQ to userspace, and we still rely on > kernel driver to configure it including bind_mm. I'm not sure whether > uacce would like to evolve as a generic queue management system > including non-SVA and all vendor specific raw capabilities as > expected by all kinds of guest kernel drivers. It sounds like not > worthwhile at this point, given that we already have an highly efficient > SVA interface for user applications. Like I already said, you should get the people who care about this stuff to support emulation in the kernel. I think it has not been explained well in past. Most Intel info on SIOV draws a close parallel to SRIOV and I think people generally assume, that like SRIOV, SIOV does not include kernel side MMIO emulations. > If in the future, there do have such requirement of delegating raw > WQ controllability to pure userspace applications for DMA engines, > and there is be a well-defined uAPI to cover a large common set of > controllability across multiple vendors, we will look at that option for > sure. All this Kernel bypass stuff is 'HW specific' by nature, you should not expect to have general interfaces. Jason