mbox series

[RFC,00/15] Add VFIO mediated device support and IMS support for the idxd driver.

Message ID 158751095889.36773.6009825070990637468.stgit@djiang5-desk3.ch.intel.com (mailing list archive)
Headers show
Series Add VFIO mediated device support and IMS support for the idxd driver. | expand

Message

Dave Jiang April 21, 2020, 11:33 p.m. UTC
The actual code is independent of the stage 2 driver code submission that adds
support for SVM, ENQCMD(S), PASID, and shared workqueues. This code series will
support dedicated workqueue on a guest with no vIOMMU.
  
A new device type "mdev" is introduced for the idxd driver. This allows the wq
to be dedicated to the usage of a VFIO mediated device (mdev). Once the work
queue (wq) is enabled, an uuid generated by the user can be added to the wq
through the uuid sysfs attribute for the wq.  After the association, a mdev can
be created using this UUID. The mdev driver code will associate the uuid and
setup the mdev on the driver side. When the create operation is successful, the
uuid can be passed to qemu. When the guest boots up, it should discover a DSA
device when doing PCI discovery.

For example:
1. Enable wq with “mdev” wq type
2. A user generated UUID is associated with a wq:
echo $UUID > /sys/bus/dsa/devices/wq0.0/uuid
3. The uuid is written to the mdev class sysfs path:
echo $UUID > /sys/class/mdev_bus/0000\:00\:0a.0/mdev_supported_types/idxd-wq/create
4. Pass the following parameter to qemu:
"-device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:0a.0/$UUID"
 
Since the mdev is an emulated device with a single wq, the guest will see a DSA
device with a single wq. With no vIOMMU support, the behavior will be the same
as the stage 1 driver running with no IOMMU turned on on the bare metal host. 
The difference is that the wq exported through mdev will have the read only
config bit set for configuration. This means that the device does not require
the typical configuration. After enabling the device, the user must set the WQ
type and name. That is all is necessary to enable the WQ and start using it.
The single wq configuration is not the only way to create the mdev. Multi wq
support for mdev will be in the future works.
 
The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
interrupts for the guest. This preserves MSIX for host usages and also allows a
significantly larger number of interrupt vectors for guest usage.

The idxd driver implements IMS as on-device memory mapped unified storage. Each
interrupt message is stored as a DWORD size data payload and a 64-bit address
(same as MSI-X). Access to the IMS is through the host idxd driver. All the IMS
interrupt messages are stored in the remappable format. Hence, if the driver
enables IMS, interrupt remapping is also enabled by default. 
 
This patchset extends the existing platfrom-msi.c which already provides a
generic mechanism to support non-PCI compliant MSI interrupts for platform
devices to provide the IMS infrastructure. 

More details about IMS, its implementation in the the kernel, common
misconceptions about IMS and the basic driver changes required to support IMS
can be found under Documentations/interrupt_message_store.txt

[1]: https://lore.kernel.org/lkml/157965011794.73301.15960052071729101309.stgit@djiang5-desk3.ch.intel.com/
[2]: https://software.intel.com/en-us/articles/intel-sdm
[3]: https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification
[4]: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
[5]: https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
[6]: https://intel.github.io/idxd/
[7]: https://github.com/intel/idxd-driver idxd-stage3

---

Dave Jiang (5):
      dmaengine: idxd: add config support for readonly devices
      dmaengine: idxd: add IMS support in base driver
      dmaengine: idxd: add device support functions in prep for mdev
      dmaengine: idxd: add support for VFIO mediated device
      dmaengine: idxd: add error notification from host driver to mediated device

Jing Lin (1):
      dmaengine: idxd: add ABI documentation for mediated device support

Lu Baolu (2):
      vfio/mdev: Add a member for iommu domain in mdev_device
      vfio/type1: Save domain when attach domain to mdev

Megha Dey (7):
      drivers/base: Introduce platform_msi_ops
      drivers/base: Introduce a new platform-msi list
      drivers/base: Allocate/free platform-msi interrupts by group
      drivers/base: Add support for a new IMS irq domain
      ims-msi: Add mask/unmask routines
      ims-msi: Enable IMS interrupts
      Documentation: Interrupt Message store


 Documentation/ABI/stable/sysfs-driver-dma-idxd |   18 
 Documentation/ims-howto.rst                    |  210 +++
 arch/x86/include/asm/hw_irq.h                  |    7 
 arch/x86/include/asm/irq_remapping.h           |    6 
 drivers/base/Kconfig                           |    9 
 drivers/base/Makefile                          |    1 
 drivers/base/core.c                            |    1 
 drivers/base/ims-msi.c                         |  162 ++
 drivers/base/platform-msi.c                    |  202 ++-
 drivers/dma/Kconfig                            |    4 
 drivers/dma/idxd/Makefile                      |    2 
 drivers/dma/idxd/cdev.c                        |    3 
 drivers/dma/idxd/device.c                      |  325 ++++-
 drivers/dma/idxd/dma.c                         |    9 
 drivers/dma/idxd/idxd.h                        |   55 +
 drivers/dma/idxd/init.c                        |   81 +
 drivers/dma/idxd/irq.c                         |    6 
 drivers/dma/idxd/mdev.c                        | 1727 ++++++++++++++++++++++++
 drivers/dma/idxd/mdev.h                        |  105 +
 drivers/dma/idxd/registers.h                   |   10 
 drivers/dma/idxd/submit.c                      |   31 
 drivers/dma/idxd/sysfs.c                       |  199 ++-
 drivers/dma/idxd/vdev.c                        |  603 ++++++++
 drivers/dma/idxd/vdev.h                        |   43 +
 drivers/dma/mv_xor_v2.c                        |    6 
 drivers/dma/qcom/hidma.c                       |    6 
 drivers/iommu/arm-smmu-v3.c                    |    6 
 drivers/iommu/intel-iommu.c                    |    2 
 drivers/iommu/intel_irq_remapping.c            |   31 
 drivers/irqchip/irq-mbigen.c                   |    8 
 drivers/irqchip/irq-mvebu-icu.c                |    6 
 drivers/mailbox/bcm-flexrm-mailbox.c           |    6 
 drivers/perf/arm_smmuv3_pmu.c                  |    6 
 drivers/vfio/mdev/mdev_core.c                  |   22 
 drivers/vfio/mdev/mdev_private.h               |    2 
 drivers/vfio/vfio_iommu_type1.c                |   52 +
 include/linux/device.h                         |    3 
 include/linux/intel-iommu.h                    |    3 
 include/linux/list.h                           |   36 +
 include/linux/mdev.h                           |   13 
 include/linux/msi.h                            |   93 +
 kernel/irq/msi.c                               |   43 -
 42 files changed, 4009 insertions(+), 154 deletions(-)
 create mode 100644 Documentation/ims-howto.rst
 create mode 100644 drivers/base/ims-msi.c
 create mode 100644 drivers/dma/idxd/mdev.c
 create mode 100644 drivers/dma/idxd/mdev.h
 create mode 100644 drivers/dma/idxd/vdev.c
 create mode 100644 drivers/dma/idxd/vdev.h

--

Comments

Jason Gunthorpe April 21, 2020, 11:54 p.m. UTC | #1
On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote:
> The actual code is independent of the stage 2 driver code submission that adds
> support for SVM, ENQCMD(S), PASID, and shared workqueues. This code series will
> support dedicated workqueue on a guest with no vIOMMU.
>   
> A new device type "mdev" is introduced for the idxd driver. This allows the wq
> to be dedicated to the usage of a VFIO mediated device (mdev). Once the work
> queue (wq) is enabled, an uuid generated by the user can be added to the wq
> through the uuid sysfs attribute for the wq.  After the association, a mdev can
> be created using this UUID. The mdev driver code will associate the uuid and
> setup the mdev on the driver side. When the create operation is successful, the
> uuid can be passed to qemu. When the guest boots up, it should discover a DSA
> device when doing PCI discovery.

I'm feeling really skeptical that adding all this PCI config space and
MMIO BAR emulation to the kernel just to cram this into a VFIO
interface is a good idea, that kind of stuff is much safer in
userspace.

Particularly since vfio is not really needed once a driver is using
the PASID stuff. We already have general code for drivers to use to
attach a PASID to a mm_struct - and using vfio while disabling all the
DMA/iommu config really seems like an abuse.

A /dev/idxd char dev that mmaps a bar page and links it to a PASID
seems a lot simpler and saner kernel wise.

> The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
> interrupts for the guest. This preserves MSIX for host usages and also allows a
> significantly larger number of interrupt vectors for guest usage.

I never did get a reply to my earlier remarks on the IMS patches.

The concept of a device specific addr/data table format for MSI is not
Intel specific. This should be general code. We have a device that can
use this kind of kernel capability today.

Jason
Tian, Kevin April 22, 2020, 12:53 a.m. UTC | #2
> From: Jason Gunthorpe
> Sent: Wednesday, April 22, 2020 7:55 AM
> 
> On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote:
> > The actual code is independent of the stage 2 driver code submission that
> adds
> > support for SVM, ENQCMD(S), PASID, and shared workqueues. This code
> series will
> > support dedicated workqueue on a guest with no vIOMMU.
> >
> > A new device type "mdev" is introduced for the idxd driver. This allows the
> wq
> > to be dedicated to the usage of a VFIO mediated device (mdev). Once the
> work
> > queue (wq) is enabled, an uuid generated by the user can be added to the
> wq
> > through the uuid sysfs attribute for the wq.  After the association, a mdev
> can
> > be created using this UUID. The mdev driver code will associate the uuid
> and
> > setup the mdev on the driver side. When the create operation is successful,
> the
> > uuid can be passed to qemu. When the guest boots up, it should discover a
> DSA
> > device when doing PCI discovery.
> 
> I'm feeling really skeptical that adding all this PCI config space and
> MMIO BAR emulation to the kernel just to cram this into a VFIO
> interface is a good idea, that kind of stuff is much safer in
> userspace.
> 
> Particularly since vfio is not really needed once a driver is using
> the PASID stuff. We already have general code for drivers to use to
> attach a PASID to a mm_struct - and using vfio while disabling all the
> DMA/iommu config really seems like an abuse.

Well, this series is for virtualizing idxd device to VMs, instead of supporting
SVA for bare metal processes. idxd implements a hardware-assisted 
mediated device technique called Intel Scalable I/O Virtualization, which
allows each Assignable Device Interface (ADI, e.g. a work queue) tagged 
with an unique PASID to ensure fine-grained DMA isolation when those 
ADIs are assigned to different VMs. For this purpose idxd utilizes the VFIO 
mdev framework and IOMMU aux-domain extension. Bare metal SVA will
be enabled for idxd later by using the general SVA code that you mentioned.
Both paths will co-exist in the end so there is no such case of disabling
DMA/iommu config.

Thanks
Kevin

> 
> A /dev/idxd char dev that mmaps a bar page and links it to a PASID
> seems a lot simpler and saner kernel wise.
> 
> > The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
> > interrupts for the guest. This preserves MSIX for host usages and also
> allows a
> > significantly larger number of interrupt vectors for guest usage.
> 
> I never did get a reply to my earlier remarks on the IMS patches.
> 
> The concept of a device specific addr/data table format for MSI is not
> Intel specific. This should be general code. We have a device that can
> use this kind of kernel capability today.
> 
> Jason
Jason Gunthorpe April 22, 2020, 11:50 a.m. UTC | #3
On Wed, Apr 22, 2020 at 12:53:25AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Wednesday, April 22, 2020 7:55 AM
> > 
> > On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote:
> > > The actual code is independent of the stage 2 driver code submission that
> > adds
> > > support for SVM, ENQCMD(S), PASID, and shared workqueues. This code
> > series will
> > > support dedicated workqueue on a guest with no vIOMMU.
> > >
> > > A new device type "mdev" is introduced for the idxd driver. This allows the
> > wq
> > > to be dedicated to the usage of a VFIO mediated device (mdev). Once the
> > work
> > > queue (wq) is enabled, an uuid generated by the user can be added to the
> > wq
> > > through the uuid sysfs attribute for the wq.  After the association, a mdev
> > can
> > > be created using this UUID. The mdev driver code will associate the uuid
> > and
> > > setup the mdev on the driver side. When the create operation is successful,
> > the
> > > uuid can be passed to qemu. When the guest boots up, it should discover a
> > DSA
> > > device when doing PCI discovery.
> > 
> > I'm feeling really skeptical that adding all this PCI config space and
> > MMIO BAR emulation to the kernel just to cram this into a VFIO
> > interface is a good idea, that kind of stuff is much safer in
> > userspace.
> > 
> > Particularly since vfio is not really needed once a driver is using
> > the PASID stuff. We already have general code for drivers to use to
> > attach a PASID to a mm_struct - and using vfio while disabling all the
> > DMA/iommu config really seems like an abuse.
> 
> Well, this series is for virtualizing idxd device to VMs, instead of
> supporting SVA for bare metal processes. idxd implements a
> hardware-assisted mediated device technique called Intel Scalable
> I/O Virtualization,

I'm familiar with the intel naming scheme.

> which allows each Assignable Device Interface (ADI, e.g. a work
> queue) tagged with an unique PASID to ensure fine-grained DMA
> isolation when those ADIs are assigned to different VMs. For this
> purpose idxd utilizes the VFIO mdev framework and IOMMU aux-domain
> extension. Bare metal SVA will be enabled for idxd later by using
> the general SVA code that you mentioned.  Both paths will co-exist
> in the end so there is no such case of disabling DMA/iommu config.
 
Again, if you will have a normal SVA interface, there is no need for a
VFIO version, just use normal SVA for both.

PCI emulation should try to be in userspace, not the kernel, for
security.

Jason
Ashok Raj April 22, 2020, 9:14 p.m. UTC | #4
Hi Jason

> > > 
> > > I'm feeling really skeptical that adding all this PCI config space and
> > > MMIO BAR emulation to the kernel just to cram this into a VFIO
> > > interface is a good idea, that kind of stuff is much safer in
> > > userspace.
> > > 
> > > Particularly since vfio is not really needed once a driver is using
> > > the PASID stuff. We already have general code for drivers to use to
> > > attach a PASID to a mm_struct - and using vfio while disabling all the
> > > DMA/iommu config really seems like an abuse.
> > 
> > Well, this series is for virtualizing idxd device to VMs, instead of
> > supporting SVA for bare metal processes. idxd implements a
> > hardware-assisted mediated device technique called Intel Scalable
> > I/O Virtualization,
> 
> I'm familiar with the intel naming scheme.
> 
> > which allows each Assignable Device Interface (ADI, e.g. a work
> > queue) tagged with an unique PASID to ensure fine-grained DMA
> > isolation when those ADIs are assigned to different VMs. For this
> > purpose idxd utilizes the VFIO mdev framework and IOMMU aux-domain
> > extension. Bare metal SVA will be enabled for idxd later by using
> > the general SVA code that you mentioned.  Both paths will co-exist
> > in the end so there is no such case of disabling DMA/iommu config.
>  
> Again, if you will have a normal SVA interface, there is no need for a
> VFIO version, just use normal SVA for both.
> 
> PCI emulation should try to be in userspace, not the kernel, for
> security.

Not sure we completely understand your proposal. Mediated devices
are software constructed and they have protected resources like
interrupts and stuff and VFIO already provids abstractions to export
to user space.

Native SVA is simply passing the process CR3 handle to IOMMU so
IOMMU knows how to walk process page tables, kernel handles things
like page-faults, doing device tlb invalidations and such.

That by itself doesn't translate to what a guest typically does
with a VDEV. There are other control paths that need to be serviced
from the kernel code via VFIO. For speed path operations like
ringing doorbells and such they are directly managed from guest.

How do you propose to use the existing SVA api's  to also provide 
full device emulation as opposed to using an existing infrastructure 
that's already in place?

Perhaps Alex can ease Jason's concerns?

Cheers,
Ashok
Dan Williams April 22, 2020, 9:24 p.m. UTC | #5
On Tue, Apr 21, 2020 at 4:55 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
>
> On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote:
> > The actual code is independent of the stage 2 driver code submission that adds
> > support for SVM, ENQCMD(S), PASID, and shared workqueues. This code series will
> > support dedicated workqueue on a guest with no vIOMMU.
> >
> > A new device type "mdev" is introduced for the idxd driver. This allows the wq
> > to be dedicated to the usage of a VFIO mediated device (mdev). Once the work
> > queue (wq) is enabled, an uuid generated by the user can be added to the wq
> > through the uuid sysfs attribute for the wq.  After the association, a mdev can
> > be created using this UUID. The mdev driver code will associate the uuid and
> > setup the mdev on the driver side. When the create operation is successful, the
> > uuid can be passed to qemu. When the guest boots up, it should discover a DSA
> > device when doing PCI discovery.
>
> I'm feeling really skeptical that adding all this PCI config space and
> MMIO BAR emulation to the kernel just to cram this into a VFIO
> interface is a good idea, that kind of stuff is much safer in
> userspace.
>
> Particularly since vfio is not really needed once a driver is using
> the PASID stuff. We already have general code for drivers to use to
> attach a PASID to a mm_struct - and using vfio while disabling all the
> DMA/iommu config really seems like an abuse.
>
> A /dev/idxd char dev that mmaps a bar page and links it to a PASID
> seems a lot simpler and saner kernel wise.
>
> > The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
> > interrupts for the guest. This preserves MSIX for host usages and also allows a
> > significantly larger number of interrupt vectors for guest usage.
>
> I never did get a reply to my earlier remarks on the IMS patches.
>
> The concept of a device specific addr/data table format for MSI is not
> Intel specific. This should be general code. We have a device that can
> use this kind of kernel capability today.

This has been my concern reviewing the implementation. IMS needs more
than one in-tree user to validate degrees of freedom in the api. I had
been missing a second "in-tree user" to validate the scope of the
flexibility that was needed.
Dey, Megha April 22, 2020, 11:04 p.m. UTC | #6
On 4/21/2020 4:54 PM, Jason Gunthorpe wrote:
> On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote:
>> The actual code is independent of the stage 2 driver code submission that adds
>> support for SVM, ENQCMD(S), PASID, and shared workqueues. This code series will
>> support dedicated workqueue on a guest with no vIOMMU.
>>    
>> A new device type "mdev" is introduced for the idxd driver. This allows the wq
>> to be dedicated to the usage of a VFIO mediated device (mdev). Once the work
>> queue (wq) is enabled, an uuid generated by the user can be added to the wq
>> through the uuid sysfs attribute for the wq.  After the association, a mdev can
>> be created using this UUID. The mdev driver code will associate the uuid and
>> setup the mdev on the driver side. When the create operation is successful, the
>> uuid can be passed to qemu. When the guest boots up, it should discover a DSA
>> device when doing PCI discovery.
> 
> I'm feeling really skeptical that adding all this PCI config space and
> MMIO BAR emulation to the kernel just to cram this into a VFIO
> interface is a good idea, that kind of stuff is much safer in
> userspace.
> 
> Particularly since vfio is not really needed once a driver is using
> the PASID stuff. We already have general code for drivers to use to
> attach a PASID to a mm_struct - and using vfio while disabling all the
> DMA/iommu config really seems like an abuse.
> 
> A /dev/idxd char dev that mmaps a bar page and links it to a PASID
> seems a lot simpler and saner kernel wise.
> 
>> The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
>> interrupts for the guest. This preserves MSIX for host usages and also allows a
>> significantly larger number of interrupt vectors for guest usage.
> 
> I never did get a reply to my earlier remarks on the IMS patches.
> 
> The concept of a device specific addr/data table format for MSI is not
> Intel specific. This should be general code. We have a device that can
> use this kind of kernel capability today.
> 

<resending to the mailing list, I had incorrect email options set>

Hi Jason,

I am sorry if I did not address your comments earlier.

The present IMS code is quite generic, most of the code is in the 
drivers/ folder. We basically introduce 2 APIS: allocate and free IMS 
interrupts and a IMS IRQ domain to allocate these interrupts from. These 
APIs are architecture agnostic.

We also introduce a new IMS IRQ domain which is architecture specific. 
This is because IMS generates interrupts only in the remappable format, 
hence interrupt remapping should be enabled for IMS. Currently, the 
interrupt remapping code is only available for Intel and AMD and I don’t 
see anything for ARM.

If a new architecture would want to use IMS, they must simply introduce 
a new IMS IRQ domain. I am not sure if there is any other way around 
this. If you have any ideas, please let me know.

Also, could you give more details on the device that could use IMS? Do 
you have some driver code already? We could then see if and how the 
current IMS code could be made more generic.

> Jason
>
Jason Gunthorpe April 23, 2020, 7:12 p.m. UTC | #7
On Wed, Apr 22, 2020 at 02:14:36PM -0700, Raj, Ashok wrote:
> Hi Jason
> 
> > > > 
> > > > I'm feeling really skeptical that adding all this PCI config space and
> > > > MMIO BAR emulation to the kernel just to cram this into a VFIO
> > > > interface is a good idea, that kind of stuff is much safer in
> > > > userspace.
> > > > 
> > > > Particularly since vfio is not really needed once a driver is using
> > > > the PASID stuff. We already have general code for drivers to use to
> > > > attach a PASID to a mm_struct - and using vfio while disabling all the
> > > > DMA/iommu config really seems like an abuse.
> > > 
> > > Well, this series is for virtualizing idxd device to VMs, instead of
> > > supporting SVA for bare metal processes. idxd implements a
> > > hardware-assisted mediated device technique called Intel Scalable
> > > I/O Virtualization,
> > 
> > I'm familiar with the intel naming scheme.
> > 
> > > which allows each Assignable Device Interface (ADI, e.g. a work
> > > queue) tagged with an unique PASID to ensure fine-grained DMA
> > > isolation when those ADIs are assigned to different VMs. For this
> > > purpose idxd utilizes the VFIO mdev framework and IOMMU aux-domain
> > > extension. Bare metal SVA will be enabled for idxd later by using
> > > the general SVA code that you mentioned.  Both paths will co-exist
> > > in the end so there is no such case of disabling DMA/iommu config.
> >  
> > Again, if you will have a normal SVA interface, there is no need for a
> > VFIO version, just use normal SVA for both.
> > 
> > PCI emulation should try to be in userspace, not the kernel, for
> > security.
> 
> Not sure we completely understand your proposal. Mediated devices
> are software constructed and they have protected resources like
> interrupts and stuff and VFIO already provids abstractions to export
> to user space.
> 
> Native SVA is simply passing the process CR3 handle to IOMMU so
> IOMMU knows how to walk process page tables, kernel handles things
> like page-faults, doing device tlb invalidations and such.

> That by itself doesn't translate to what a guest typically does
> with a VDEV. There are other control paths that need to be serviced
> from the kernel code via VFIO. For speed path operations like
> ringing doorbells and such they are directly managed from guest.

You don't need vfio to mmap BAR pages to userspace. The unique thing
that vfio gives is it provides a way to program the classic non-PASID
iommu, which you are not using here.

> How do you propose to use the existing SVA api's  to also provide
> full device emulation as opposed to using an existing infrastructure 
> that's already in place?

You'd provide the 'full device emulation' in userspace (eg qemu),
along side all the other device emulation. Device emulation does not
belong in the kernel without a very good reason.

You get the doorbell BAR page from your own char dev

You setup a PASID IOMMU configuration over your own char dev

Interrupt delivery is triggering a generic event fd

What is VFIO needed for?
 
> Perhaps Alex can ease Jason's concerns?

Last we talked Alex also had doubts on what mdev should be used
for. It is a feature that seems to lack boundaries, and I'll note that
when the discussion came up for VDPA, they eventually choose not to
use VFIO.

Jason
Dan Williams April 23, 2020, 7:17 p.m. UTC | #8
On Wed, Apr 22, 2020 at 2:24 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Tue, Apr 21, 2020 at 4:55 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
> >
> > On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote:
> > > The actual code is independent of the stage 2 driver code submission that adds
> > > support for SVM, ENQCMD(S), PASID, and shared workqueues. This code series will
> > > support dedicated workqueue on a guest with no vIOMMU.
> > >
> > > A new device type "mdev" is introduced for the idxd driver. This allows the wq
> > > to be dedicated to the usage of a VFIO mediated device (mdev). Once the work
> > > queue (wq) is enabled, an uuid generated by the user can be added to the wq
> > > through the uuid sysfs attribute for the wq.  After the association, a mdev can
> > > be created using this UUID. The mdev driver code will associate the uuid and
> > > setup the mdev on the driver side. When the create operation is successful, the
> > > uuid can be passed to qemu. When the guest boots up, it should discover a DSA
> > > device when doing PCI discovery.
> >
> > I'm feeling really skeptical that adding all this PCI config space and
> > MMIO BAR emulation to the kernel just to cram this into a VFIO
> > interface is a good idea, that kind of stuff is much safer in
> > userspace.
> >
> > Particularly since vfio is not really needed once a driver is using
> > the PASID stuff. We already have general code for drivers to use to
> > attach a PASID to a mm_struct - and using vfio while disabling all the
> > DMA/iommu config really seems like an abuse.
> >
> > A /dev/idxd char dev that mmaps a bar page and links it to a PASID
> > seems a lot simpler and saner kernel wise.
> >
> > > The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
> > > interrupts for the guest. This preserves MSIX for host usages and also allows a
> > > significantly larger number of interrupt vectors for guest usage.
> >
> > I never did get a reply to my earlier remarks on the IMS patches.
> >
> > The concept of a device specific addr/data table format for MSI is not
> > Intel specific. This should be general code. We have a device that can
> > use this kind of kernel capability today.
>
> This has been my concern reviewing the implementation. IMS needs more
> than one in-tree user to validate degrees of freedom in the api. I had
> been missing a second "in-tree user" to validate the scope of the
> flexibility that was needed.

Hey Jason,

Per Megha's follow-up can you send the details about that other device
and help clear a path for a device-specific MSI addr/data table
format. Ever since HMM I've been sensitive, perhaps overly-sensitive,
to claims about future upstream users. The fact that you have an
additional use case is golden for pushing this into a common area and
validating the scope of the proposed API.
Jason Gunthorpe April 23, 2020, 7:18 p.m. UTC | #9
On Wed, Apr 22, 2020 at 02:24:11PM -0700, Dan Williams wrote:
> On Tue, Apr 21, 2020 at 4:55 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
> >
> > On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote:
> > > The actual code is independent of the stage 2 driver code submission that adds
> > > support for SVM, ENQCMD(S), PASID, and shared workqueues. This code series will
> > > support dedicated workqueue on a guest with no vIOMMU.
> > >
> > > A new device type "mdev" is introduced for the idxd driver. This allows the wq
> > > to be dedicated to the usage of a VFIO mediated device (mdev). Once the work
> > > queue (wq) is enabled, an uuid generated by the user can be added to the wq
> > > through the uuid sysfs attribute for the wq.  After the association, a mdev can
> > > be created using this UUID. The mdev driver code will associate the uuid and
> > > setup the mdev on the driver side. When the create operation is successful, the
> > > uuid can be passed to qemu. When the guest boots up, it should discover a DSA
> > > device when doing PCI discovery.
> >
> > I'm feeling really skeptical that adding all this PCI config space and
> > MMIO BAR emulation to the kernel just to cram this into a VFIO
> > interface is a good idea, that kind of stuff is much safer in
> > userspace.
> >
> > Particularly since vfio is not really needed once a driver is using
> > the PASID stuff. We already have general code for drivers to use to
> > attach a PASID to a mm_struct - and using vfio while disabling all the
> > DMA/iommu config really seems like an abuse.
> >
> > A /dev/idxd char dev that mmaps a bar page and links it to a PASID
> > seems a lot simpler and saner kernel wise.
> >
> > > The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
> > > interrupts for the guest. This preserves MSIX for host usages and also allows a
> > > significantly larger number of interrupt vectors for guest usage.
> >
> > I never did get a reply to my earlier remarks on the IMS patches.
> >
> > The concept of a device specific addr/data table format for MSI is not
> > Intel specific. This should be general code. We have a device that can
> > use this kind of kernel capability today.
> 
> This has been my concern reviewing the implementation. IMS needs more
> than one in-tree user to validate degrees of freedom in the api. I had
> been missing a second "in-tree user" to validate the scope of the
> flexibility that was needed.

IMS is too narrowly specified.

All platforms that support MSI today can support IMS. It is simply a
way for the platform to give the driver an addr/data pair that triggers
an interrupt when a posted write is performed to that pair.

This is different from the other interrupt setup flows which are
tightly tied to the PCI layer. Here the driver should simply ask for
interrupts.

Ie the entire IMS API to the driver should be something very simple
like:

 struct message_irq
 {
   uint64_t addr;
   uint32_t data;
 };

 struct message_irq *request_message_irq(
    struct device *, irq_handler_t handler, unsigned long flags,
    const char *name, void *dev);

And the plumbing underneath should setup the irq chips and so forth as
required.

Jason
Jason Gunthorpe April 23, 2020, 7:44 p.m. UTC | #10
> > > The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
> > > interrupts for the guest. This preserves MSIX for host usages and also allows a
> > > significantly larger number of interrupt vectors for guest usage.
> > 
> > I never did get a reply to my earlier remarks on the IMS patches.
> > 
> > The concept of a device specific addr/data table format for MSI is not
> > Intel specific. This should be general code. We have a device that can
> > use this kind of kernel capability today.
> 
> I am sorry if I did not address your comments earlier.

It appears noboy from Intel bothered to answer anyone else on that RFC
thread:

https://lore.kernel.org/lkml/1568338328-22458-1-git-send-email-megha.dey@linux.intel.com/

However, it seems kind of moot as I see now that this verion of IMS
bears almost no resemblance to the original RFC.

That said, the similiarity to platform-msi was striking, does this new
version harmonize with that?

> The present IMS code is quite generic, most of the code is in the drivers/
> folder. We basically introduce 2 APIS: allocate and free IMS interrupts and
> a IMS IRQ domain to allocate these interrupts from. These APIs are
> architecture agnostic.
>
> We also introduce a new IMS IRQ domain which is architecture specific. This
> is because IMS generates interrupts only in the remappable format, hence
> interrupt remapping should be enabled for IMS. Currently, the interrupt
> remapping code is only available for Intel and AMD and I don’t see anything
> for ARM.

I don't understand these remarks though - IMS is simply the mapping of
a MemWr addr/data pair to a Linux IRQ number? Why does this intersect
with remapping?

AFAIK, any platform that supports MSI today should have the inherent
HW capability to support IMS.

> Also, could you give more details on the device that could use IMS? Do you
> have some driver code already? We could then see if and how the current IMS
> code could be made more generic.

We have several devices of interest, our NICs have very flexible PCI,
so it is no problem to take the MemWR addr/data from someplace other
than the MSI tables.

For this we want to have some way to allocate Linux IRQs dynamically
and get a addr/data pair to trigger them.

Our NIC devices are also linked to our ARM SOC family, so I'd expect
our ARM's to also be able to provide these APIs as the platform.

Jason
Jason Gunthorpe April 23, 2020, 7:49 p.m. UTC | #11
On Thu, Apr 23, 2020 at 12:17:50PM -0700, Dan Williams wrote:

> Per Megha's follow-up can you send the details about that other device
> and help clear a path for a device-specific MSI addr/data table
> format. Ever since HMM I've been sensitive, perhaps overly-sensitive,
> to claims about future upstream users. The fact that you have an
> additional use case is golden for pushing this into a common area and
> validating the scope of the proposed API.

I think I said it at plumbers, but yes, we are interested in this, and
would like dynamic MSI-like interrupts available to the driver (what
Intel calls IMS)

It is something easy enough to illustrate with any RDMA device really,
just open a MR against the addr and use RDMA_WRITE to trigger the
data. It should trigger a Linux IRQ. Nothing else should be needed.

Jason
Tian, Kevin April 24, 2020, 3:27 a.m. UTC | #12
> From: Jason Gunthorpe <jgg@mellanox.com>
> Sent: Friday, April 24, 2020 3:12 AM
> 
> On Wed, Apr 22, 2020 at 02:14:36PM -0700, Raj, Ashok wrote:
> > Hi Jason
> >
> > > > >
> > > > > I'm feeling really skeptical that adding all this PCI config space and
> > > > > MMIO BAR emulation to the kernel just to cram this into a VFIO
> > > > > interface is a good idea, that kind of stuff is much safer in
> > > > > userspace.
> > > > >
> > > > > Particularly since vfio is not really needed once a driver is using
> > > > > the PASID stuff. We already have general code for drivers to use to
> > > > > attach a PASID to a mm_struct - and using vfio while disabling all the
> > > > > DMA/iommu config really seems like an abuse.
> > > >
> > > > Well, this series is for virtualizing idxd device to VMs, instead of
> > > > supporting SVA for bare metal processes. idxd implements a
> > > > hardware-assisted mediated device technique called Intel Scalable
> > > > I/O Virtualization,
> > >
> > > I'm familiar with the intel naming scheme.
> > >
> > > > which allows each Assignable Device Interface (ADI, e.g. a work
> > > > queue) tagged with an unique PASID to ensure fine-grained DMA
> > > > isolation when those ADIs are assigned to different VMs. For this
> > > > purpose idxd utilizes the VFIO mdev framework and IOMMU aux-
> domain
> > > > extension. Bare metal SVA will be enabled for idxd later by using
> > > > the general SVA code that you mentioned.  Both paths will co-exist
> > > > in the end so there is no such case of disabling DMA/iommu config.
> > >
> > > Again, if you will have a normal SVA interface, there is no need for a
> > > VFIO version, just use normal SVA for both.
> > >
> > > PCI emulation should try to be in userspace, not the kernel, for
> > > security.
> >
> > Not sure we completely understand your proposal. Mediated devices
> > are software constructed and they have protected resources like
> > interrupts and stuff and VFIO already provids abstractions to export
> > to user space.
> >
> > Native SVA is simply passing the process CR3 handle to IOMMU so
> > IOMMU knows how to walk process page tables, kernel handles things
> > like page-faults, doing device tlb invalidations and such.
> 
> > That by itself doesn't translate to what a guest typically does
> > with a VDEV. There are other control paths that need to be serviced
> > from the kernel code via VFIO. For speed path operations like
> > ringing doorbells and such they are directly managed from guest.
> 
> You don't need vfio to mmap BAR pages to userspace. The unique thing
> that vfio gives is it provides a way to program the classic non-PASID
> iommu, which you are not using here.

That unique thing is indeed used here. Please note sharing CPU virtual 
address space with device (what SVA API is invented for) is not the
purpose of this series. We still rely on classic non-PASID iommu programming, 
i.e. mapping/unmapping IOVA->HPA per iommu_domain. Although 
we do use PASID to tag ADI, the PASID is contained within iommu_domain 
and invisible to VFIO. From userspace p.o.v, this is a device passthrough
usage instead of PASID-based address space binding.

> 
> > How do you propose to use the existing SVA api's  to also provide
> > full device emulation as opposed to using an existing infrastructure
> > that's already in place?
> 
> You'd provide the 'full device emulation' in userspace (eg qemu),
> along side all the other device emulation. Device emulation does not
> belong in the kernel without a very good reason.

The problem is that we are not doing full device emulation. It's based
on mediated passthrough. Some emulation logic requires close 
engagement with kernel device driver, e.g. resource allocation, WQ 
configuration, fault report, etc., while the detail interface is very vendor/
device specific (just like between PF and VF). idxd is just the first 
device that supports Scalable IOV. We have a lot more coming later, 
in different types. Then putting such emulation in user space means 
that Qemu needs to support all those vendor specific interfaces for 
every new device which supports Scalable IOV. This is contrast to our 
goal of using Scalable IOV as an alternative to SR-IOV. For SR-IOV, 
Qemu only needs to support one VFIO API then any VF type simply 
works. We want to sustain the same user experience through VFIO 
mdev. 

Specifically for PCI config space emulation, now it's already done 
in multiple kernel places, e.g. vfio-pci, kvmgt, etc. We do plan to 
consolidate them later.

> 
> You get the doorbell BAR page from your own char dev
> 
> You setup a PASID IOMMU configuration over your own char dev
> 
> Interrupt delivery is triggering a generic event fd
> 
> What is VFIO needed for?

Based on above explanation VFIO mdev already meets all of our
requirements then why bother inventing a new one...

> 
> > Perhaps Alex can ease Jason's concerns?
> 
> Last we talked Alex also had doubts on what mdev should be used
> for. It is a feature that seems to lack boundaries, and I'll note that
> when the discussion came up for VDPA, they eventually choose not to
> use VFIO.
> 

Is there a link to Alex's doubt? I'm not sure why vDPA didn't go 
for VFIO, but imho it is a different story. vDPA is specifically for
devices which implement standard vhost/virtio interface, thus
it's reasonable that inventing a new mechanism might be more
efficient for all vDPA type devices. However Scalable IOV is
similar to SR-IOV, only for resource partitioning. It doesn't change
the device programming interface, which could be in any vendor
specific form. Here VFIO mdev is good for providing an unified 
interface for managing resource multiplexing of all such devices.

Thanks
Kevin
Jason Wang April 24, 2020, 6:31 a.m. UTC | #13
On 2020/4/22 上午7:54, Jason Gunthorpe wrote:
>> The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
>> interrupts for the guest. This preserves MSIX for host usages and also allows a
>> significantly larger number of interrupt vectors for guest usage.
> I never did get a reply to my earlier remarks on the IMS patches.
>
> The concept of a device specific addr/data table format for MSI is not
> Intel specific. This should be general code. We have a device that can
> use this kind of kernel capability today.
>
> Jason
>

+1.

Another example is to extend virtio MMIO to support MSI[1].

Thanks

[1] https://lkml.org/lkml/2020/2/10/127
Jason Gunthorpe April 24, 2020, 12:44 p.m. UTC | #14
On Fri, Apr 24, 2020 at 03:27:41AM +0000, Tian, Kevin wrote:

> > > That by itself doesn't translate to what a guest typically does
> > > with a VDEV. There are other control paths that need to be serviced
> > > from the kernel code via VFIO. For speed path operations like
> > > ringing doorbells and such they are directly managed from guest.
> > 
> > You don't need vfio to mmap BAR pages to userspace. The unique thing
> > that vfio gives is it provides a way to program the classic non-PASID
> > iommu, which you are not using here.
> 
> That unique thing is indeed used here. Please note sharing CPU virtual 
> address space with device (what SVA API is invented for) is not the
> purpose of this series. We still rely on classic non-PASID iommu programming, 
> i.e. mapping/unmapping IOVA->HPA per iommu_domain. Although 
> we do use PASID to tag ADI, the PASID is contained within iommu_domain 
> and invisible to VFIO. From userspace p.o.v, this is a device passthrough
> usage instead of PASID-based address space binding.

So you have PASID support but don't use it? Why? PASID is much better
than classic VFIO iommu, it doesn't require page pinning...

> > > How do you propose to use the existing SVA api's  to also provide
> > > full device emulation as opposed to using an existing infrastructure
> > > that's already in place?
> > 
> > You'd provide the 'full device emulation' in userspace (eg qemu),
> > along side all the other device emulation. Device emulation does not
> > belong in the kernel without a very good reason.
> 
> The problem is that we are not doing full device emulation. It's based
> on mediated passthrough. Some emulation logic requires close
> engagement with kernel device driver, e.g. resource allocation, WQ
> configuration, fault report, etc., while the detail interface is very vendor/
> device specific (just like between PF and VF).

Which sounds like the fairly classic case of device emulation to me.

> idxd is just the first device that supports Scalable IOV. We have a
> lot more coming later, in different types. Then putting such
> emulation in user space means that Qemu needs to support all those
> vendor specific interfaces for every new device which supports

It would be very sad to see an endless amount of device emulation code
crammed into the kernel. Userspace is where device emulation is
supposed to live. For security

qemu is the right place to put this stuff.

> > > Perhaps Alex can ease Jason's concerns?
> > 
> > Last we talked Alex also had doubts on what mdev should be used
> > for. It is a feature that seems to lack boundaries, and I'll note that
> > when the discussion came up for VDPA, they eventually choose not to
> > use VFIO.
> > 
> 
> Is there a link to Alex's doubt? I'm not sure why vDPA didn't go 
> for VFIO, but imho it is a different story.

No, not at all. VDPA HW today is using what Intel has been calling
ADI. But qemu already had the device emulation part in userspace, (all
of the virtio emulation parts are in userspace) so they didn't try to
put it in the kernel.

This is the pattern. User space is supposed to do the emulation parts,
the kernel provides the raw elements to manage queues/etc - and it is
not done through mdev.

> efficient for all vDPA type devices. However Scalable IOV is
> similar to SR-IOV, only for resource partitioning. It doesn't change
> the device programming interface, which could be in any vendor
> specific form. Here VFIO mdev is good for providing an unified 
> interface for managing resource multiplexing of all such devices.

SIOV doesn't have a HW config space, and for some reason in these
patches there is BAR emulation too. So, no, it is not like SR-IOV at
all.

This is more like classic device emulation, presumably with some fast
path for the data plane. ie just like VDPA :)

Jason
Tian, Kevin April 24, 2020, 4:25 p.m. UTC | #15
> From: Jason Gunthorpe
> Sent: Friday, April 24, 2020 8:45 PM
> 
> On Fri, Apr 24, 2020 at 03:27:41AM +0000, Tian, Kevin wrote:
> 
> > > > That by itself doesn't translate to what a guest typically does
> > > > with a VDEV. There are other control paths that need to be serviced
> > > > from the kernel code via VFIO. For speed path operations like
> > > > ringing doorbells and such they are directly managed from guest.
> > >
> > > You don't need vfio to mmap BAR pages to userspace. The unique thing
> > > that vfio gives is it provides a way to program the classic non-PASID
> > > iommu, which you are not using here.
> >
> > That unique thing is indeed used here. Please note sharing CPU virtual
> > address space with device (what SVA API is invented for) is not the
> > purpose of this series. We still rely on classic non-PASID iommu
> programming,
> > i.e. mapping/unmapping IOVA->HPA per iommu_domain. Although
> > we do use PASID to tag ADI, the PASID is contained within iommu_domain
> > and invisible to VFIO. From userspace p.o.v, this is a device passthrough
> > usage instead of PASID-based address space binding.
> 
> So you have PASID support but don't use it? Why? PASID is much better
> than classic VFIO iommu, it doesn't require page pinning...

PASID and I/O page fault (through ATS/PRI) are orthogonal things. Don't
draw the equation between them. The host driver can tag PASID to 
ADI so every DMA request out of that ADI has a PASID prefix, allowing VT-d
to do PASID-granular DMA isolation. However I/O page fault cannot be
taken for granted. A scalable IOV device may support PASID while without
ATS/PRI. Even when ATS/PRI is supported, the tolerance of I/O page fault
is decided by the work queue mode that is configured by the guest. For 
example, if the guest put the work queue in non-faultable transaction 
mode, the device doesn't do PRI and simply report error if no valid IOMMU 
mapping.

So in this series we support only the basic form for non-faultable transactions,
using the classic VFIO iommu interface plus PASID-granular translation. 
We are working on virtual SVA support in parallel. Once that feature is ready, 
then I/O page fault could be CONDITIONALLY enabled according to guest 
vIOMMU setting, e.g. when virtual context entry has page request enabled 
then we enable nested translation in the physical PASID entry, with 1st 
level linking to guest page table (GVA->GPA) and 2nd-level carrying 
(GPA->HPA).

> 
> > > > How do you propose to use the existing SVA api's  to also provide
> > > > full device emulation as opposed to using an existing infrastructure
> > > > that's already in place?
> > >
> > > You'd provide the 'full device emulation' in userspace (eg qemu),
> > > along side all the other device emulation. Device emulation does not
> > > belong in the kernel without a very good reason.
> >
> > The problem is that we are not doing full device emulation. It's based
> > on mediated passthrough. Some emulation logic requires close
> > engagement with kernel device driver, e.g. resource allocation, WQ
> > configuration, fault report, etc., while the detail interface is very vendor/
> > device specific (just like between PF and VF).
> 
> Which sounds like the fairly classic case of device emulation to me.
> 
> > idxd is just the first device that supports Scalable IOV. We have a
> > lot more coming later, in different types. Then putting such
> > emulation in user space means that Qemu needs to support all those
> > vendor specific interfaces for every new device which supports
> 
> It would be very sad to see an endless amount of device emulation code
> crammed into the kernel. Userspace is where device emulation is
> supposed to live. For security

I think providing an unified abstraction to userspace is also important,
which is what VFIO provides today. The merit of using one set of VFIO 
API to manage all kinds of mediated devices and VF devices is a major
gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV
or equivalent device is just overkill and doesn't scale. Also the actual
emulation code in idxd driver is actually small, if putting aside the PCI
config space part for which I already explained most logic could be shared
between mdev device drivers.

> 
> qemu is the right place to put this stuff.
> 
> > > > Perhaps Alex can ease Jason's concerns?
> > >
> > > Last we talked Alex also had doubts on what mdev should be used
> > > for. It is a feature that seems to lack boundaries, and I'll note that
> > > when the discussion came up for VDPA, they eventually choose not to
> > > use VFIO.
> > >
> >
> > Is there a link to Alex's doubt? I'm not sure why vDPA didn't go
> > for VFIO, but imho it is a different story.
> 
> No, not at all. VDPA HW today is using what Intel has been calling
> ADI. But qemu already had the device emulation part in userspace, (all
> of the virtio emulation parts are in userspace) so they didn't try to
> put it in the kernel.
> 
> This is the pattern. User space is supposed to do the emulation parts,
> the kernel provides the raw elements to manage queues/etc - and it is
> not done through mdev.
> 
> > efficient for all vDPA type devices. However Scalable IOV is
> > similar to SR-IOV, only for resource partitioning. It doesn't change
> > the device programming interface, which could be in any vendor
> > specific form. Here VFIO mdev is good for providing an unified
> > interface for managing resource multiplexing of all such devices.
> 
> SIOV doesn't have a HW config space, and for some reason in these
> patches there is BAR emulation too. So, no, it is not like SR-IOV at
> all.
> 
> This is more like classic device emulation, presumably with some fast
> path for the data plane. ie just like VDPA :)
> 
> Jason

Thanks
Kevin
Jason Gunthorpe April 24, 2020, 6:12 p.m. UTC | #16
On Fri, Apr 24, 2020 at 04:25:56PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Friday, April 24, 2020 8:45 PM
> > 
> > On Fri, Apr 24, 2020 at 03:27:41AM +0000, Tian, Kevin wrote:
> > 
> > > > > That by itself doesn't translate to what a guest typically does
> > > > > with a VDEV. There are other control paths that need to be serviced
> > > > > from the kernel code via VFIO. For speed path operations like
> > > > > ringing doorbells and such they are directly managed from guest.
> > > >
> > > > You don't need vfio to mmap BAR pages to userspace. The unique thing
> > > > that vfio gives is it provides a way to program the classic non-PASID
> > > > iommu, which you are not using here.
> > >
> > > That unique thing is indeed used here. Please note sharing CPU virtual
> > > address space with device (what SVA API is invented for) is not the
> > > purpose of this series. We still rely on classic non-PASID iommu
> > programming,
> > > i.e. mapping/unmapping IOVA->HPA per iommu_domain. Although
> > > we do use PASID to tag ADI, the PASID is contained within iommu_domain
> > > and invisible to VFIO. From userspace p.o.v, this is a device passthrough
> > > usage instead of PASID-based address space binding.
> > 
> > So you have PASID support but don't use it? Why? PASID is much better
> > than classic VFIO iommu, it doesn't require page pinning...
> 
> PASID and I/O page fault (through ATS/PRI) are orthogonal things. Don't
> draw the equation between them. The host driver can tag PASID to 
> ADI so every DMA request out of that ADI has a PASID prefix, allowing VT-d
> to do PASID-granular DMA isolation. However I/O page fault cannot be
> taken for granted. A scalable IOV device may support PASID while without
> ATS/PRI. Even when ATS/PRI is supported, the tolerance of I/O page fault
> is decided by the work queue mode that is configured by the guest. For 
> example, if the guest put the work queue in non-faultable transaction 
> mode, the device doesn't do PRI and simply report error if no valid IOMMU 
> mapping.

Okay, that makes sense, I wasn't aware people were doing PASID without
ATS at this point..

> > > idxd is just the first device that supports Scalable IOV. We have a
> > > lot more coming later, in different types. Then putting such
> > > emulation in user space means that Qemu needs to support all those
> > > vendor specific interfaces for every new device which supports
> > 
> > It would be very sad to see an endless amount of device emulation code
> > crammed into the kernel. Userspace is where device emulation is
> > supposed to live. For security
> 
> I think providing an unified abstraction to userspace is also important,
> which is what VFIO provides today. The merit of using one set of VFIO 
> API to manage all kinds of mediated devices and VF devices is a major
> gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV
> or equivalent device is just overkill and doesn't scale. Also the actual
> emulation code in idxd driver is actually small, if putting aside the PCI
> config space part for which I already explained most logic could be shared
> between mdev device drivers.

If it was just config space you might have an argument, VFIO already
does some config space mangling, but emulating BAR space is out of
scope of VFIO, IMHO.

I also think it is disingenuous to pretend this is similar to
SR-IOV. SR-IOV is self contained and the BAR does not require
emulation. What you have here sounds like it is just an ordinary
multi-queue device with the ability to PASID tag queues for IOMMU
handling. This is absolutely not SRIOV - it is much closer to VDPA,
which isn't using mdev.

Further, I disagree with your assessment that this doesn't scale. You
already said you plan a normal user interface for idxd, so instead of
having a single sane user interface (ala VDPA) idxd now needs *two*. If
this is the general pattern of things to come, it is a bad path.

The only thing we get out of this is someone doesn't have to write a
idxd emulation driver in qemu, instead they have to write it in the
kernel. I don't see how that is a win for the ecosystem.

Jason
Tian, Kevin April 26, 2020, 5:18 a.m. UTC | #17
> From: Jason Gunthorpe <jgg@mellanox.com>
> Sent: Saturday, April 25, 2020 2:12 AM
> 
> > > > idxd is just the first device that supports Scalable IOV. We have a
> > > > lot more coming later, in different types. Then putting such
> > > > emulation in user space means that Qemu needs to support all those
> > > > vendor specific interfaces for every new device which supports
> > >
> > > It would be very sad to see an endless amount of device emulation code
> > > crammed into the kernel. Userspace is where device emulation is
> > > supposed to live. For security
> >
> > I think providing an unified abstraction to userspace is also important,
> > which is what VFIO provides today. The merit of using one set of VFIO
> > API to manage all kinds of mediated devices and VF devices is a major
> > gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV
> > or equivalent device is just overkill and doesn't scale. Also the actual
> > emulation code in idxd driver is actually small, if putting aside the PCI
> > config space part for which I already explained most logic could be shared
> > between mdev device drivers.
> 
> If it was just config space you might have an argument, VFIO already
> does some config space mangling, but emulating BAR space is out of
> scope of VFIO, IMHO.

out of scope of vfio-pci, but in scope of vfio-mdev. btw I feel that most
of your objections are actually related to the general idea of vfio-mdev.
Scalable IOV just uses PASID to harden DMA isolation in mediated
pass-through usage which vfio-mdev enables. Then are you just opposing
the whole vfio-mdev? If not, I'm curious about the criteria in your mind 
about when using vfio-mdev is good...

> 
> I also think it is disingenuous to pretend this is similar to
> SR-IOV. SR-IOV is self contained and the BAR does not require
> emulation. What you have here sounds like it is just an ordinary

technically Scalable IOV is definitely different from SR-IOV. It's 
simpler in hardware. And we're not emulating SR-IOV. The point
is just in usage-wise we want to present a consistent user 
experience just like passing through a PCI endpoint (PF or VF) device
through vfio eco-system, including various userspace VMMs (Qemu,
firecracker, rust-vmm, etc.), middleware (Libvirt), and higher level 
management stacks. 

> multi-queue device with the ability to PASID tag queues for IOMMU
> handling. This is absolutely not SRIOV - it is much closer to VDPA,
> which isn't using mdev.
> 
> Further, I disagree with your assessment that this doesn't scale. You
> already said you plan a normal user interface for idxd, so instead of
> having a single sane user interface (ala VDPA) idxd now needs *two*. If
> this is the general pattern of things to come, it is a bad path.
> 
> The only thing we get out of this is someone doesn't have to write a
> idxd emulation driver in qemu, instead they have to write it in the
> kernel. I don't see how that is a win for the ecosystem.
> 

No. The clear win is on leveraging classic VFIO iommu and its eco-system
as explained above.

Thanks
Kevin
Jason Gunthorpe April 26, 2020, 7:13 p.m. UTC | #18
On Sun, Apr 26, 2020 at 05:18:59AM +0000, Tian, Kevin wrote:

> > > I think providing an unified abstraction to userspace is also important,
> > > which is what VFIO provides today. The merit of using one set of VFIO
> > > API to manage all kinds of mediated devices and VF devices is a major
> > > gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV
> > > or equivalent device is just overkill and doesn't scale. Also the actual
> > > emulation code in idxd driver is actually small, if putting aside the PCI
> > > config space part for which I already explained most logic could be shared
> > > between mdev device drivers.
> > 
> > If it was just config space you might have an argument, VFIO already
> > does some config space mangling, but emulating BAR space is out of
> > scope of VFIO, IMHO.
> 
> out of scope of vfio-pci, but in scope of vfio-mdev. btw I feel that most
> of your objections are actually related to the general idea of
> vfio-mdev.

There have been several abusive proposals of vfio-mdev, everything
from a way to create device drivers to this kind of generic emulation
framework.

> Scalable IOV just uses PASID to harden DMA isolation in mediated
> pass-through usage which vfio-mdev enables. Then are you just opposing
> the whole vfio-mdev? If not, I'm curious about the criteria in your mind 
> about when using vfio-mdev is good...

It is appropriate when non-PCI standard techniques are needed to do
raw device assignment, just like VFIO.

Basically if vfio-pci is already doing it then it seems reasonable
that vfio-mdev should do the same. This mission creep where vfio-mdev
gains functionality far beyond VFIO is the problem.

> technically Scalable IOV is definitely different from SR-IOV. It's 
> simpler in hardware. And we're not emulating SR-IOV. The point
> is just in usage-wise we want to present a consistent user 
> experience just like passing through a PCI endpoint (PF or VF) device
> through vfio eco-system, including various userspace VMMs (Qemu,
> firecracker, rust-vmm, etc.), middleware (Libvirt), and higher level 
> management stacks. 

Yes, I understand your desire, but at the same time we have not been
doing device emulation in the kernel. You should at least be
forthwright about that major change in the cover letters/etc.
 
> > The only thing we get out of this is someone doesn't have to write a
> > idxd emulation driver in qemu, instead they have to write it in the
> > kernel. I don't see how that is a win for the ecosystem.
> 
> No. The clear win is on leveraging classic VFIO iommu and its eco-system
> as explained above.

vdpa had no problem implementing iommu support without VFIO. This was
their original argument too, it turned out to be erroneous.

Jason
Alex Williamson April 27, 2020, 3:43 a.m. UTC | #19
On Sun, 26 Apr 2020 16:13:57 -0300
Jason Gunthorpe <jgg@mellanox.com> wrote:

> On Sun, Apr 26, 2020 at 05:18:59AM +0000, Tian, Kevin wrote:
> 
> > > > I think providing an unified abstraction to userspace is also important,
> > > > which is what VFIO provides today. The merit of using one set of VFIO
> > > > API to manage all kinds of mediated devices and VF devices is a major
> > > > gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV
> > > > or equivalent device is just overkill and doesn't scale. Also the actual
> > > > emulation code in idxd driver is actually small, if putting aside the PCI
> > > > config space part for which I already explained most logic could be shared
> > > > between mdev device drivers.  
> > > 
> > > If it was just config space you might have an argument, VFIO already
> > > does some config space mangling, but emulating BAR space is out of
> > > scope of VFIO, IMHO.  
> > 
> > out of scope of vfio-pci, but in scope of vfio-mdev. btw I feel that most
> > of your objections are actually related to the general idea of
> > vfio-mdev.  
> 
> There have been several abusive proposals of vfio-mdev, everything
> from a way to create device drivers to this kind of generic emulation
> framework.
> 
> > Scalable IOV just uses PASID to harden DMA isolation in mediated
> > pass-through usage which vfio-mdev enables. Then are you just opposing
> > the whole vfio-mdev? If not, I'm curious about the criteria in your mind 
> > about when using vfio-mdev is good...  
> 
> It is appropriate when non-PCI standard techniques are needed to do
> raw device assignment, just like VFIO.
> 
> Basically if vfio-pci is already doing it then it seems reasonable
> that vfio-mdev should do the same. This mission creep where vfio-mdev
> gains functionality far beyond VFIO is the problem.

Ehm, vfio-pci emulates BARs too.  We also emulate FLR, power
management, DisINTx, and VPD.  FLR, PM, and VPD all have device
specific quirks in the host kernel, and I've generally taken the stance
that would should take advantage of those quirks, not duplicate them in
userspace and not invent new access mechanisms/ioctls for each of them.
Emulating DisINTx is convenient since we must have a mechanism to mask
INTx, whether it's at the device or the APIC, so we can pretend the
hardware supports it.  BAR emulation is really too trivial to argue
about, the BARs mean nothing to the physical device mapping, they're
simply scratch registers that we mask out the alignment bits on read.
vfio-pci is a mix of things that we decide are too complicated or
irrelevant to emulate in the kernel and things that take advantage of
shared quirks or are just too darn easy to worry about.  BARs fall into
that latter category, any sort of mapping into VM address spaces is
necessarily done in userspace, but scratch registers that are masked on
read, *shrug*, vfio-pci does that.  Thanks,

Alex
 
> > technically Scalable IOV is definitely different from SR-IOV. It's 
> > simpler in hardware. And we're not emulating SR-IOV. The point
> > is just in usage-wise we want to present a consistent user 
> > experience just like passing through a PCI endpoint (PF or VF) device
> > through vfio eco-system, including various userspace VMMs (Qemu,
> > firecracker, rust-vmm, etc.), middleware (Libvirt), and higher level 
> > management stacks.   
> 
> Yes, I understand your desire, but at the same time we have not been
> doing device emulation in the kernel. You should at least be
> forthwright about that major change in the cover letters/etc.
>  
> > > The only thing we get out of this is someone doesn't have to write a
> > > idxd emulation driver in qemu, instead they have to write it in the
> > > kernel. I don't see how that is a win for the ecosystem.  
> > 
> > No. The clear win is on leveraging classic VFIO iommu and its eco-system
> > as explained above.  
> 
> vdpa had no problem implementing iommu support without VFIO. This was
> their original argument too, it turned out to be erroneous.
> 
> Jason
>
Jason Gunthorpe April 27, 2020, 11:58 a.m. UTC | #20
On Sun, Apr 26, 2020 at 09:43:55PM -0600, Alex Williamson wrote:
> On Sun, 26 Apr 2020 16:13:57 -0300
> Jason Gunthorpe <jgg@mellanox.com> wrote:
> 
> > On Sun, Apr 26, 2020 at 05:18:59AM +0000, Tian, Kevin wrote:
> > 
> > > > > I think providing an unified abstraction to userspace is also important,
> > > > > which is what VFIO provides today. The merit of using one set of VFIO
> > > > > API to manage all kinds of mediated devices and VF devices is a major
> > > > > gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV
> > > > > or equivalent device is just overkill and doesn't scale. Also the actual
> > > > > emulation code in idxd driver is actually small, if putting aside the PCI
> > > > > config space part for which I already explained most logic could be shared
> > > > > between mdev device drivers.  
> > > > 
> > > > If it was just config space you might have an argument, VFIO already
> > > > does some config space mangling, but emulating BAR space is out of
> > > > scope of VFIO, IMHO.  
> > > 
> > > out of scope of vfio-pci, but in scope of vfio-mdev. btw I feel that most
> > > of your objections are actually related to the general idea of
> > > vfio-mdev.  
> > 
> > There have been several abusive proposals of vfio-mdev, everything
> > from a way to create device drivers to this kind of generic emulation
> > framework.
> > 
> > > Scalable IOV just uses PASID to harden DMA isolation in mediated
> > > pass-through usage which vfio-mdev enables. Then are you just opposing
> > > the whole vfio-mdev? If not, I'm curious about the criteria in your mind 
> > > about when using vfio-mdev is good...  
> > 
> > It is appropriate when non-PCI standard techniques are needed to do
> > raw device assignment, just like VFIO.
> > 
> > Basically if vfio-pci is already doing it then it seems reasonable
> > that vfio-mdev should do the same. This mission creep where vfio-mdev
> > gains functionality far beyond VFIO is the problem.
> 
> Ehm, vfio-pci emulates BARs too.  We also emulate FLR, power
> management, DisINTx, and VPD.  FLR, PM, and VPD all have device
> specific quirks in the host kernel, and I've generally taken the stance
> that would should take advantage of those quirks, not duplicate them in
> userspace and not invent new access mechanisms/ioctls for each of them.
> Emulating DisINTx is convenient since we must have a mechanism to mask
> INTx, whether it's at the device or the APIC, so we can pretend the
> hardware supports it.  BAR emulation is really too trivial to argue
> about, the BARs mean nothing to the physical device mapping, they're
> simply scratch registers that we mask out the alignment bits on read.
> vfio-pci is a mix of things that we decide are too complicated or
> irrelevant to emulate in the kernel and things that take advantage of
> shared quirks or are just too darn easy to worry about.  BARs fall into
> that latter category, any sort of mapping into VM address spaces is
> necessarily done in userspace, but scratch registers that are masked on
> read, *shrug*, vfio-pci does that.  Thanks,

It is not trivial masking. It is a 2000 line patch doing comprehensive
emulation.

Jason
Tian, Kevin April 27, 2020, 12:13 p.m. UTC | #21
> From: Jason Gunthorpe <jgg@mellanox.com>
> Sent: Monday, April 27, 2020 3:14 AM
[...]
> > technically Scalable IOV is definitely different from SR-IOV. It's
> > simpler in hardware. And we're not emulating SR-IOV. The point
> > is just in usage-wise we want to present a consistent user
> > experience just like passing through a PCI endpoint (PF or VF) device
> > through vfio eco-system, including various userspace VMMs (Qemu,
> > firecracker, rust-vmm, etc.), middleware (Libvirt), and higher level
> > management stacks.
> 
> Yes, I understand your desire, but at the same time we have not been
> doing device emulation in the kernel. You should at least be
> forthwright about that major change in the cover letters/etc.

I searched 'emulate' in kernel/Documentation:

Documentation/sound/alsa-configuration.rst (emulate oss on alsa)
Documentation/security/tpm/tpm_vtpm_proxy.rst (emulate virtual TPM)
Documentation/networking/generic-hdlc.txt (emulate eth on HDLC)
Documentation/gpu/todo.rst (generic fbdev emulation)
...

I believe the main reason why putting such emulations in kernel is 
because those emulated device interfaces have their established 
eco-systems and values which the kernel shouldn't break. As you 
emphasize earlier, they have good reasons for getting into kernel.

Then back to this context. Almost every newly-born Linux VMM
(firecracker, crosvm, cloud hypervisor, and some proprietary 
implementations) support only two types of devices: virtio and 
vfio, because they want to be simple and slim. Virtio provides a 
basic set of I/O capabilities required by most VMs, while vfio brings
an unified interface for gaining added values or higher performance
from assigned devices. Even Qemu supports a minimal configuration 
('microvm') now, for similar reason.  So the vfio eco-system is 
significant and represents a major trend in the virtualization space.

Then supporting vfio eco-system is actually the usage GOAL 
of this patch series, instead of an optional technique to be opted. 
vfio-pci is there for passing through standalone PCI endpoints 
(PF or VF), and vfio-mdev is there for passing through smaller 
portion of device resources but sharing the same VFIO interface 
to gain the uniform support in this eco-system. 

I believe above is the good reason for putting emulation in idxd 
driver by using vfio-mdev. Yes, it does imply that there will be 
more emulations in kernel when more Scalable-IOV (or alike) 
devices are introduced. But as explained earlier, the pci config 
space emulation can be largely consolidated and reused. and 
the remaining device specific MMIO emulation is relatively 
simple because we define virtual device interface to be same 
as or even simpler than a VF interface. Only a small set of registers 
are emulated after fast-path resource is passed through, and 
such small set of course needs to meet the normal quality 
requirement for getting into the kernel.

We'll definitely highlight this part in future cover letter. 
Jason Gunthorpe April 27, 2020, 12:55 p.m. UTC | #22
On Mon, Apr 27, 2020 at 12:13:33PM +0000, Tian, Kevin wrote:

> Then back to this context. Almost every newly-born Linux VMM
> (firecracker, crosvm, cloud hypervisor, and some proprietary 
> implementations) support only two types of devices: virtio and 
> vfio, because they want to be simple and slim.

For security. Moving all the sketchy emulation code into the kernel
seems like a worse security posture over all :(

Jason
Alex Williamson April 27, 2020, 1:19 p.m. UTC | #23
On Mon, 27 Apr 2020 08:58:18 -0300
Jason Gunthorpe <jgg@mellanox.com> wrote:

> On Sun, Apr 26, 2020 at 09:43:55PM -0600, Alex Williamson wrote:
> > On Sun, 26 Apr 2020 16:13:57 -0300
> > Jason Gunthorpe <jgg@mellanox.com> wrote:
> >   
> > > On Sun, Apr 26, 2020 at 05:18:59AM +0000, Tian, Kevin wrote:
> > >   
> > > > > > I think providing an unified abstraction to userspace is also important,
> > > > > > which is what VFIO provides today. The merit of using one set of VFIO
> > > > > > API to manage all kinds of mediated devices and VF devices is a major
> > > > > > gain. Instead, inventing a new vDPA-like interface for every Scalable-IOV
> > > > > > or equivalent device is just overkill and doesn't scale. Also the actual
> > > > > > emulation code in idxd driver is actually small, if putting aside the PCI
> > > > > > config space part for which I already explained most logic could be shared
> > > > > > between mdev device drivers.    
> > > > > 
> > > > > If it was just config space you might have an argument, VFIO already
> > > > > does some config space mangling, but emulating BAR space is out of
> > > > > scope of VFIO, IMHO.    
> > > > 
> > > > out of scope of vfio-pci, but in scope of vfio-mdev. btw I feel that most
> > > > of your objections are actually related to the general idea of
> > > > vfio-mdev.    
> > > 
> > > There have been several abusive proposals of vfio-mdev, everything
> > > from a way to create device drivers to this kind of generic emulation
> > > framework.
> > >   
> > > > Scalable IOV just uses PASID to harden DMA isolation in mediated
> > > > pass-through usage which vfio-mdev enables. Then are you just opposing
> > > > the whole vfio-mdev? If not, I'm curious about the criteria in your mind 
> > > > about when using vfio-mdev is good...    
> > > 
> > > It is appropriate when non-PCI standard techniques are needed to do
> > > raw device assignment, just like VFIO.
> > > 
> > > Basically if vfio-pci is already doing it then it seems reasonable
> > > that vfio-mdev should do the same. This mission creep where vfio-mdev
> > > gains functionality far beyond VFIO is the problem.  
> > 
> > Ehm, vfio-pci emulates BARs too.  We also emulate FLR, power
> > management, DisINTx, and VPD.  FLR, PM, and VPD all have device
> > specific quirks in the host kernel, and I've generally taken the stance
> > that would should take advantage of those quirks, not duplicate them in
> > userspace and not invent new access mechanisms/ioctls for each of them.
> > Emulating DisINTx is convenient since we must have a mechanism to mask
> > INTx, whether it's at the device or the APIC, so we can pretend the
> > hardware supports it.  BAR emulation is really too trivial to argue
> > about, the BARs mean nothing to the physical device mapping, they're
> > simply scratch registers that we mask out the alignment bits on read.
> > vfio-pci is a mix of things that we decide are too complicated or
> > irrelevant to emulate in the kernel and things that take advantage of
> > shared quirks or are just too darn easy to worry about.  BARs fall into
> > that latter category, any sort of mapping into VM address spaces is
> > necessarily done in userspace, but scratch registers that are masked on
> > read, *shrug*, vfio-pci does that.  Thanks,  
> 
> It is not trivial masking. It is a 2000 line patch doing comprehensive
> emulation.

Not sure what you're referring to, I see about 30 lines of code in
vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in
config space and maybe a couple hundred lines of code in total handling
config space emulation.  Thanks,

Alex
Jason Gunthorpe April 27, 2020, 1:22 p.m. UTC | #24
On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote:

> > It is not trivial masking. It is a 2000 line patch doing comprehensive
> > emulation.
> 
> Not sure what you're referring to, I see about 30 lines of code in
> vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in
> config space and maybe a couple hundred lines of code in total handling
> config space emulation.  Thanks,

Look around vidxd_do_command()

If I understand this flow properly..

Jason
Alex Williamson April 27, 2020, 2:18 p.m. UTC | #25
On Mon, 27 Apr 2020 10:22:18 -0300
Jason Gunthorpe <jgg@mellanox.com> wrote:

> On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote:
> 
> > > It is not trivial masking. It is a 2000 line patch doing comprehensive
> > > emulation.  
> > 
> > Not sure what you're referring to, I see about 30 lines of code in
> > vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in
> > config space and maybe a couple hundred lines of code in total handling
> > config space emulation.  Thanks,  
> 
> Look around vidxd_do_command()
> 
> If I understand this flow properly..

I've only glanced at it, but that's called in response to a write to
MMIO space on the device, so it's implementing a device specific
register.  Are you asking that PCI config space be done in userspace
or any sort of device emulation?  The assumption with mdev is that we
need emulation in the host kernel because we need a trusted entity to
mediate device access and interact with privileged portion of the
device control.  Thanks,

Alex
Jason Gunthorpe April 27, 2020, 2:25 p.m. UTC | #26
On Mon, Apr 27, 2020 at 08:18:41AM -0600, Alex Williamson wrote:
> On Mon, 27 Apr 2020 10:22:18 -0300
> Jason Gunthorpe <jgg@mellanox.com> wrote:
> 
> > On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote:
> > 
> > > > It is not trivial masking. It is a 2000 line patch doing comprehensive
> > > > emulation.  
> > > 
> > > Not sure what you're referring to, I see about 30 lines of code in
> > > vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in
> > > config space and maybe a couple hundred lines of code in total handling
> > > config space emulation.  Thanks,  
> > 
> > Look around vidxd_do_command()
> > 
> > If I understand this flow properly..
> 
> I've only glanced at it, but that's called in response to a write to
> MMIO space on the device, so it's implementing a device specific
> register.

It is doing emulation of the secure BAR. The entire 1000 lines of
vidxd_* functions appear to be focused on this task.

> Are you asking that PCI config space be done in userspace
> or any sort of device emulation?  

I'm concerned about doing full emulation of registers on a MMIO BAR
that trigger complex actions in response to MMIO read/write.

Simple masking and simple config space stuff doesn't seem so
problematic.

> The assumption with mdev is that we need emulation in the host
> kernel because we need a trusted entity to mediate device access and
> interact with privileged portion of the device control.  Thanks,

Sure, but there are all kinds of different levels to this - mdev
should not be some open ended device emulation framework, IMHO.

ie other devices need only a small amount of kernel side help and
don't need complex MMIO BAR emulation.

Would you be happy if someone proposed an e1000 NIC emulator using
mdev? Why not move every part of qemu's PCI device emulation into the
kernel?

Jason
Alex Williamson April 27, 2020, 3:41 p.m. UTC | #27
On Mon, 27 Apr 2020 11:25:53 -0300
Jason Gunthorpe <jgg@mellanox.com> wrote:

> On Mon, Apr 27, 2020 at 08:18:41AM -0600, Alex Williamson wrote:
> > On Mon, 27 Apr 2020 10:22:18 -0300
> > Jason Gunthorpe <jgg@mellanox.com> wrote:
> >   
> > > On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote:
> > >   
> > > > > It is not trivial masking. It is a 2000 line patch doing comprehensive
> > > > > emulation.    
> > > > 
> > > > Not sure what you're referring to, I see about 30 lines of code in
> > > > vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in
> > > > config space and maybe a couple hundred lines of code in total handling
> > > > config space emulation.  Thanks,    
> > > 
> > > Look around vidxd_do_command()
> > > 
> > > If I understand this flow properly..  
> > 
> > I've only glanced at it, but that's called in response to a write to
> > MMIO space on the device, so it's implementing a device specific
> > register.  
> 
> It is doing emulation of the secure BAR. The entire 1000 lines of
> vidxd_* functions appear to be focused on this task.

Ok, we/I need a terminology clarification, a BAR is a register in
config space for determining the size, type, and setting the location
of a I/O or memory region of a device.  I've been asserting that the
emulation of the BAR itself is trivial, but are you actually focused on
emulation of the region described by the BAR?  This is what mdev is
for, mediating access to a device and filling in gaps such that we can
use re-use the vfio device APIs.

> > Are you asking that PCI config space be done in userspace
> > or any sort of device emulation?    
> 
> I'm concerned about doing full emulation of registers on a MMIO BAR
> that trigger complex actions in response to MMIO read/write.

Maybe what you're recalling me say about mdev is that its Achilles
heel is that we rely on mediation provider (ie. vendor driver) for
security, we don't necessarily have an piece of known, common hardware
like an IOMMU to protect us when things go wrong.  That's true, but
don't we also trust drivers in the host kernel to correctly manage and
validate their own interactions with hardware, including the APIs
provided through other user interfaces.  Is the assertion then that
device specific, register level API is too difficult to emulate?

> Simple masking and simple config space stuff doesn't seem so
> problematic.
> 
> > The assumption with mdev is that we need emulation in the host
> > kernel because we need a trusted entity to mediate device access and
> > interact with privileged portion of the device control.  Thanks,  
> 
> Sure, but there are all kinds of different levels to this - mdev
> should not be some open ended device emulation framework, IMHO.
> 
> ie other devices need only a small amount of kernel side help and
> don't need complex MMIO BAR emulation.
> 
> Would you be happy if someone proposed an e1000 NIC emulator using
> mdev? Why not move every part of qemu's PCI device emulation into the
> kernel?

Well, in order to mediate a device, we certainly expect there to be a
physical device.  I also expect that there's some performance or at
least compatibility advantage to using the device API directly rather
than masquerading everything behind something like virtio.  So no, I
wouldn't expect someone to create a fully emulated device in mdev, but
also I do expect some degree of device emulation in an mdev driver to
fill the gaps in non-performance path that hardware chose to defer to
software.  Thanks,

Alex
Jason Gunthorpe April 27, 2020, 4:16 p.m. UTC | #28
On Mon, Apr 27, 2020 at 09:41:37AM -0600, Alex Williamson wrote:
> On Mon, 27 Apr 2020 11:25:53 -0300
> Jason Gunthorpe <jgg@mellanox.com> wrote:
> 
> > On Mon, Apr 27, 2020 at 08:18:41AM -0600, Alex Williamson wrote:
> > > On Mon, 27 Apr 2020 10:22:18 -0300
> > > Jason Gunthorpe <jgg@mellanox.com> wrote:
> > >   
> > > > On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote:
> > > >   
> > > > > > It is not trivial masking. It is a 2000 line patch doing comprehensive
> > > > > > emulation.    
> > > > > 
> > > > > Not sure what you're referring to, I see about 30 lines of code in
> > > > > vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in
> > > > > config space and maybe a couple hundred lines of code in total handling
> > > > > config space emulation.  Thanks,    
> > > > 
> > > > Look around vidxd_do_command()
> > > > 
> > > > If I understand this flow properly..  
> > > 
> > > I've only glanced at it, but that's called in response to a write to
> > > MMIO space on the device, so it's implementing a device specific
> > > register.  
> > 
> > It is doing emulation of the secure BAR. The entire 1000 lines of
> > vidxd_* functions appear to be focused on this task.
> 
> Ok, we/I need a terminology clarification, a BAR is a register in
> config space for determining the size, type, and setting the location
> of a I/O or memory region of a device.  I've been asserting that the
> emulation of the BAR itself is trivial, but are you actually focused on
> emulation of the region described by the BAR?

Yes, BAR here means the actually MMIO memory window - not the config
space part. Config space emulation is largely trivial.

> > > Are you asking that PCI config space be done in userspace
> > > or any sort of device emulation?    
> > 
> > I'm concerned about doing full emulation of registers on a MMIO BAR
> > that trigger complex actions in response to MMIO read/write.
> 
> Maybe what you're recalling me say about mdev is that its Achilles
> heel is that we rely on mediation provider (ie. vendor driver) for
> security, we don't necessarily have an piece of known, common hardware
> like an IOMMU to protect us when things go wrong.  That's true, but
> don't we also trust drivers in the host kernel to correctly manage and
> validate their own interactions with hardware, including the APIs
> provided through other user interfaces.  Is the assertion then that
> device specific, register level API is too difficult to emulate?

No, it is a reflection on the standard Linux philosophy that if it can
be done in user space it should be done in userspace. Ie keep minimal
work in the monolithic kernel.

Also to avoid duplication, ie idxd proposes to have a char dev with a
normal kernel driver interface and then an in-kernel emulated MMIO BAR
version of that same capability for VFIO consumption.

Jason
Dave Jiang April 27, 2020, 4:25 p.m. UTC | #29
On 4/27/2020 9:16 AM, Jason Gunthorpe wrote:
> On Mon, Apr 27, 2020 at 09:41:37AM -0600, Alex Williamson wrote:
>> On Mon, 27 Apr 2020 11:25:53 -0300
>> Jason Gunthorpe <jgg@mellanox.com> wrote:
>>
>>> On Mon, Apr 27, 2020 at 08:18:41AM -0600, Alex Williamson wrote:
>>>> On Mon, 27 Apr 2020 10:22:18 -0300
>>>> Jason Gunthorpe <jgg@mellanox.com> wrote:
>>>>    
>>>>> On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote:
>>>>>    
>>>>>>> It is not trivial masking. It is a 2000 line patch doing comprehensive
>>>>>>> emulation.
>>>>>>
>>>>>> Not sure what you're referring to, I see about 30 lines of code in
>>>>>> vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in
>>>>>> config space and maybe a couple hundred lines of code in total handling
>>>>>> config space emulation.  Thanks,
>>>>>
>>>>> Look around vidxd_do_command()
>>>>>
>>>>> If I understand this flow properly..
>>>>
>>>> I've only glanced at it, but that's called in response to a write to
>>>> MMIO space on the device, so it's implementing a device specific
>>>> register.
>>>
>>> It is doing emulation of the secure BAR. The entire 1000 lines of
>>> vidxd_* functions appear to be focused on this task.
>>
>> Ok, we/I need a terminology clarification, a BAR is a register in
>> config space for determining the size, type, and setting the location
>> of a I/O or memory region of a device.  I've been asserting that the
>> emulation of the BAR itself is trivial, but are you actually focused on
>> emulation of the region described by the BAR?
> 
> Yes, BAR here means the actually MMIO memory window - not the config
> space part. Config space emulation is largely trivial.
> 
>>>> Are you asking that PCI config space be done in userspace
>>>> or any sort of device emulation?
>>>
>>> I'm concerned about doing full emulation of registers on a MMIO BAR
>>> that trigger complex actions in response to MMIO read/write.
>>
>> Maybe what you're recalling me say about mdev is that its Achilles
>> heel is that we rely on mediation provider (ie. vendor driver) for
>> security, we don't necessarily have an piece of known, common hardware
>> like an IOMMU to protect us when things go wrong.  That's true, but
>> don't we also trust drivers in the host kernel to correctly manage and
>> validate their own interactions with hardware, including the APIs
>> provided through other user interfaces.  Is the assertion then that
>> device specific, register level API is too difficult to emulate?
> 
> No, it is a reflection on the standard Linux philosophy that if it can
> be done in user space it should be done in userspace. Ie keep minimal
> work in the monolithic kernel.
> 
> Also to avoid duplication, ie idxd proposes to have a char dev with a
> normal kernel driver interface and then an in-kernel emulated MMIO BAR
> version of that same capability for VFIO consumption.

The char dev interface serves user apps on host (which we will deprecate 
and move to the UACCE framework in near future). The mdev interface will 
be servicing guests only. I'm not sure where the duplication of 
functionality comes into play.

> 
> Jason
>
Jason Gunthorpe April 27, 2020, 9:56 p.m. UTC | #30
On Mon, Apr 27, 2020 at 09:25:58AM -0700, Dave Jiang wrote:
> > Also to avoid duplication, ie idxd proposes to have a char dev with a
> > normal kernel driver interface and then an in-kernel emulated MMIO BAR
> > version of that same capability for VFIO consumption.
> 
> The char dev interface serves user apps on host (which we will deprecate and
> move to the UACCE framework in near future). 

The point is the char dev or UACCE framework should provide enough
capability to implement the emulation in user space.

Jason
Tian, Kevin April 29, 2020, 9:42 a.m. UTC | #31
> From: Jason Gunthorpe <jgg@mellanox.com>
> Sent: Monday, April 27, 2020 9:22 PM
> 
> On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote:
> 
> > > It is not trivial masking. It is a 2000 line patch doing comprehensive
> > > emulation.
> >
> > Not sure what you're referring to, I see about 30 lines of code in
> > vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in
> > config space and maybe a couple hundred lines of code in total handling
> > config space emulation.  Thanks,
> 
> Look around vidxd_do_command()
> 
> If I understand this flow properly..
> 

Hi, Jason,

I guess the 2000 lines mostly refer to the changes in mdev.c and vdev.c. 
We did a break-down among them:

1) ~150 LOC for vdev initialization
2) ~150 LOC for cfg space emulation
3) ~230 LOC for mmio r/w emulation
4) ~500 LOC for controlling the work queue (vidxd_do_command), 
triggered by write emulation of IDXD_CMD_OFFSET register
5) the remaining lines are all about vfio-mdev registration/callbacks,
for reporting mmio/irq resource, eventfd, mmap, etc.

1/2/3) are pure device emulation, which counts for ~500 LOC. 

4) needs be in the kernel regardless of which uAPI is used, because it
talks to the physical work queue (enable, disable, drain, abort, reset, etc.)

Then if just talking about ~500 LOC emulation code left in the kernel, 
is it still a big concern to you? 
Dey, Megha May 1, 2020, 10:31 p.m. UTC | #32
Hi Jason,

On 4/23/2020 12:49 PM, Jason Gunthorpe wrote:
> On Thu, Apr 23, 2020 at 12:17:50PM -0700, Dan Williams wrote:
> 
>> Per Megha's follow-up can you send the details about that other device
>> and help clear a path for a device-specific MSI addr/data table
>> format. Ever since HMM I've been sensitive, perhaps overly-sensitive,
>> to claims about future upstream users. The fact that you have an
>> additional use case is golden for pushing this into a common area and
>> validating the scope of the proposed API.
> 
> I think I said it at plumbers, but yes, we are interested in this, and
> would like dynamic MSI-like interrupts available to the driver (what
> Intel calls IMS)
> 

So basically you are looking for a way to dynamically allocate the 
platform-msi interrupts, correct?

Since I don't have access to any of the platform-msi devices, it is hard 
for me to test this code for other drivers expect idxd for now.
Once I submit the next round of patches, after addressing all the 
comments, would it be possible for you to test this code for any of your 
devices?

> It is something easy enough to illustrate with any RDMA device really,
> just open a MR against the addr and use RDMA_WRITE to trigger the
> data. It should trigger a Linux IRQ. Nothing else should be needed.
> 
> Jason
>
Dey, Megha May 1, 2020, 10:31 p.m. UTC | #33
Hi Jason,

On 4/23/2020 12:18 PM, Jason Gunthorpe wrote:
> On Wed, Apr 22, 2020 at 02:24:11PM -0700, Dan Williams wrote:
>> On Tue, Apr 21, 2020 at 4:55 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
>>>
>>> On Tue, Apr 21, 2020 at 04:33:46PM -0700, Dave Jiang wrote:
>>>> The actual code is independent of the stage 2 driver code submission that adds
>>>> support for SVM, ENQCMD(S), PASID, and shared workqueues. This code series will
>>>> support dedicated workqueue on a guest with no vIOMMU.
>>>>
>>>> A new device type "mdev" is introduced for the idxd driver. This allows the wq
>>>> to be dedicated to the usage of a VFIO mediated device (mdev). Once the work
>>>> queue (wq) is enabled, an uuid generated by the user can be added to the wq
>>>> through the uuid sysfs attribute for the wq.  After the association, a mdev can
>>>> be created using this UUID. The mdev driver code will associate the uuid and
>>>> setup the mdev on the driver side. When the create operation is successful, the
>>>> uuid can be passed to qemu. When the guest boots up, it should discover a DSA
>>>> device when doing PCI discovery.
>>>
>>> I'm feeling really skeptical that adding all this PCI config space and
>>> MMIO BAR emulation to the kernel just to cram this into a VFIO
>>> interface is a good idea, that kind of stuff is much safer in
>>> userspace.
>>>
>>> Particularly since vfio is not really needed once a driver is using
>>> the PASID stuff. We already have general code for drivers to use to
>>> attach a PASID to a mm_struct - and using vfio while disabling all the
>>> DMA/iommu config really seems like an abuse.
>>>
>>> A /dev/idxd char dev that mmaps a bar page and links it to a PASID
>>> seems a lot simpler and saner kernel wise.
>>>
>>>> The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
>>>> interrupts for the guest. This preserves MSIX for host usages and also allows a
>>>> significantly larger number of interrupt vectors for guest usage.
>>>
>>> I never did get a reply to my earlier remarks on the IMS patches.
>>>
>>> The concept of a device specific addr/data table format for MSI is not
>>> Intel specific. This should be general code. We have a device that can
>>> use this kind of kernel capability today.
>>
>> This has been my concern reviewing the implementation. IMS needs more
>> than one in-tree user to validate degrees of freedom in the api. I had
>> been missing a second "in-tree user" to validate the scope of the
>> flexibility that was needed.
> 
> IMS is too narrowly specified.
> 
> All platforms that support MSI today can support IMS. It is simply a
> way for the platform to give the driver an addr/data pair that triggers
> an interrupt when a posted write is performed to that pair.
> 

Well, yes and no. IMS requires interrupt remapping in addition to the 
dynamic nature of IRQ allocation.

> This is different from the other interrupt setup flows which are
> tightly tied to the PCI layer. Here the driver should simply ask for
> interrupts.
> 
> Ie the entire IMS API to the driver should be something very simple
> like:
> 
>   struct message_irq
>   {
>     uint64_t addr;
>     uint32_t data;
>   };
> 
>   struct message_irq *request_message_irq(
>      struct device *, irq_handler_t handler, unsigned long flags,
>      const char *name, void *dev);
> 
> And the plumbing underneath should setup the irq chips and so forth as
> required.
> 

yes, this seems correct.
> Jason
>
Dey, Megha May 1, 2020, 10:32 p.m. UTC | #34
On 4/23/2020 12:44 PM, Jason Gunthorpe wrote:
>>>> The mdev utilizes Interrupt Message Store or IMS[3] instead of MSIX for
>>>> interrupts for the guest. This preserves MSIX for host usages and also allows a
>>>> significantly larger number of interrupt vectors for guest usage.
>>>
>>> I never did get a reply to my earlier remarks on the IMS patches.
>>>
>>> The concept of a device specific addr/data table format for MSI is not
>>> Intel specific. This should be general code. We have a device that can
>>> use this kind of kernel capability today.
>>
>> I am sorry if I did not address your comments earlier.
> 
> It appears noboy from Intel bothered to answer anyone else on that RFC
> thread:
> 
> https://lore.kernel.org/lkml/1568338328-22458-1-git-send-email-megha.dey@linux.intel.com/
> 
> However, it seems kind of moot as I see now that this verion of IMS
> bears almost no resemblance to the original RFC.

hmm yeah, we changed most of the code after getting a lot of feedback 
from you and folks at plumbers. But yes, I should have replied to all 
the feedback, lesson learnt :)

> 
> That said, the similiarity to platform-msi was striking, does this new
> version harmonize with that?

yes!
> 
>> The present IMS code is quite generic, most of the code is in the drivers/
>> folder. We basically introduce 2 APIS: allocate and free IMS interrupts and
>> a IMS IRQ domain to allocate these interrupts from. These APIs are
>> architecture agnostic.
>>
>> We also introduce a new IMS IRQ domain which is architecture specific. This
>> is because IMS generates interrupts only in the remappable format, hence
>> interrupt remapping should be enabled for IMS. Currently, the interrupt
>> remapping code is only available for Intel and AMD and I don’t see anything
>> for ARM.
> 
> I don't understand these remarks though - IMS is simply the mapping of
> a MemWr addr/data pair to a Linux IRQ number? Why does this intersect
> with remapping?
> 

 From your comments so far, I think your requirement is a subset of what 
IMS is trying to do.

What you want:
have a dynamic means of allocating platform-msi interrupts

On top of this IMS has a requirement that all of the interrupts should 
be remapped.

So we can have tiered code: generic dynamic platform-msi infrastructure
and add the IMS specific bits (Intel specific) on top of this.

The generic code will have no reference to IMS.

> AFAIK, any platform that supports MSI today should have the inherent
> HW capability to support IMS.
> 
>> Also, could you give more details on the device that could use IMS? Do you
>> have some driver code already? We could then see if and how the current IMS
>> code could be made more generic.
> 
> We have several devices of interest, our NICs have very flexible PCI,
> so it is no problem to take the MemWR addr/data from someplace other
> than the MSI tables.
> 
> For this we want to have some way to allocate Linux IRQs dynamically
> and get a addr/data pair to trigger them.
> 
> Our NIC devices are also linked to our ARM SOC family, so I'd expect
> our ARM's to also be able to provide these APIs as the platform.

cool, so I will hope that you can test out the generic APIs from the ARM 
side!
> 
> Jason
>
Jason Gunthorpe May 3, 2020, 10:21 p.m. UTC | #35
On Fri, May 01, 2020 at 03:31:14PM -0700, Dey, Megha wrote:
> Hi Jason,
> 
> On 4/23/2020 12:49 PM, Jason Gunthorpe wrote:
> > On Thu, Apr 23, 2020 at 12:17:50PM -0700, Dan Williams wrote:
> > 
> > > Per Megha's follow-up can you send the details about that other device
> > > and help clear a path for a device-specific MSI addr/data table
> > > format. Ever since HMM I've been sensitive, perhaps overly-sensitive,
> > > to claims about future upstream users. The fact that you have an
> > > additional use case is golden for pushing this into a common area and
> > > validating the scope of the proposed API.
> > 
> > I think I said it at plumbers, but yes, we are interested in this, and
> > would like dynamic MSI-like interrupts available to the driver (what
> > Intel calls IMS)
> > 
> 
> So basically you are looking for a way to dynamically allocate the
> platform-msi interrupts, correct?

The basic high level interface here seems fine, which is bascially a
way for a driver to grab a bunch of platform-msi interrupts for its
own use
 
> Since I don't have access to any of the platform-msi devices, it is hard for
> me to test this code for other drivers expect idxd for now.
> Once I submit the next round of patches, after addressing all the comments,
> would it be possible for you to test this code for any of your devices?

Possibly, need to find time

Jason
Jason Gunthorpe May 3, 2020, 10:22 p.m. UTC | #36
On Fri, May 01, 2020 at 03:31:51PM -0700, Dey, Megha wrote:
> > > This has been my concern reviewing the implementation. IMS needs more
> > > than one in-tree user to validate degrees of freedom in the api. I had
> > > been missing a second "in-tree user" to validate the scope of the
> > > flexibility that was needed.
> > 
> > IMS is too narrowly specified.
> > 
> > All platforms that support MSI today can support IMS. It is simply a
> > way for the platform to give the driver an addr/data pair that triggers
> > an interrupt when a posted write is performed to that pair.
> > 
> 
> Well, yes and no. IMS requires interrupt remapping in addition to the
> dynamic nature of IRQ allocation.

You've mentioned remapping a few times, but I really can't understand
why it has anything to do with platform_msi or IMS..

Jason
Dey, Megha May 3, 2020, 10:31 p.m. UTC | #37
Hi Jason,

On 5/3/2020 3:22 PM, Jason Gunthorpe wrote:
> On Fri, May 01, 2020 at 03:31:51PM -0700, Dey, Megha wrote:
>>>> This has been my concern reviewing the implementation. IMS needs more
>>>> than one in-tree user to validate degrees of freedom in the api. I had
>>>> been missing a second "in-tree user" to validate the scope of the
>>>> flexibility that was needed.
>>>
>>> IMS is too narrowly specified.
>>>
>>> All platforms that support MSI today can support IMS. It is simply a
>>> way for the platform to give the driver an addr/data pair that triggers
>>> an interrupt when a posted write is performed to that pair.
>>>
>>
>> Well, yes and no. IMS requires interrupt remapping in addition to the
>> dynamic nature of IRQ allocation.
> 
> You've mentioned remapping a few times, but I really can't understand
> why it has anything to do with platform_msi or IMS..

So after some internal discussions, we have concluded that IMS has no 
linkage with Interrupt remapping, IR is just a platform concept. IMS is 
just a name Intel came up with, all it really means is device managed 
addr/data writes to generate interrupts. Technically we can call 
something IMS even if device has its own location to store interrupts in 
non-pci standard mechanism, much like platform-msi indeed. We simply 
need to extend platform-msi to its address some of its shortcomings: 
increase number of interrupts to > 2048, enable dynamic allocation of 
interrupts, add mask/unmask callbacks in addition to write_msg etc.
FWIW, even MSI can be IMS with rules on how to manage the addr/data 
writes following pci sig .. its just that.

I will be sending out an email shortly outlining the new design for IMS 
(A.K.A platform-msi part 2) and what are the improvements we want to add 
to the already existing platform-msi infrastructure.

Thank you so much for your comments, it helped us iron out some of these 
details :)

> 
> Jason
>
Dey, Megha May 3, 2020, 10:32 p.m. UTC | #38
Hi Jason,

On 5/3/2020 3:21 PM, Jason Gunthorpe wrote:
> On Fri, May 01, 2020 at 03:31:14PM -0700, Dey, Megha wrote:
>> Hi Jason,
>>
>> On 4/23/2020 12:49 PM, Jason Gunthorpe wrote:
>>> On Thu, Apr 23, 2020 at 12:17:50PM -0700, Dan Williams wrote:
>>>
>>>> Per Megha's follow-up can you send the details about that other device
>>>> and help clear a path for a device-specific MSI addr/data table
>>>> format. Ever since HMM I've been sensitive, perhaps overly-sensitive,
>>>> to claims about future upstream users. The fact that you have an
>>>> additional use case is golden for pushing this into a common area and
>>>> validating the scope of the proposed API.
>>>
>>> I think I said it at plumbers, but yes, we are interested in this, and
>>> would like dynamic MSI-like interrupts available to the driver (what
>>> Intel calls IMS)
>>>
>>
>> So basically you are looking for a way to dynamically allocate the
>> platform-msi interrupts, correct?
> 
> The basic high level interface here seems fine, which is bascially a
> way for a driver to grab a bunch of platform-msi interrupts for its
> own use

ok!
>   
>> Since I don't have access to any of the platform-msi devices, it is hard for
>> me to test this code for other drivers expect idxd for now.
>> Once I submit the next round of patches, after addressing all the comments,
>> would it be possible for you to test this code for any of your devices?
> 
> Possibly, need to find time
> 
Sure, thanks!

> Jason
>
Jason Gunthorpe May 3, 2020, 10:36 p.m. UTC | #39
On Sun, May 03, 2020 at 03:31:39PM -0700, Dey, Megha wrote:
> 
> Hi Jason,
> 
> On 5/3/2020 3:22 PM, Jason Gunthorpe wrote:
> > On Fri, May 01, 2020 at 03:31:51PM -0700, Dey, Megha wrote:
> > > > > This has been my concern reviewing the implementation. IMS needs more
> > > > > than one in-tree user to validate degrees of freedom in the api. I had
> > > > > been missing a second "in-tree user" to validate the scope of the
> > > > > flexibility that was needed.
> > > > 
> > > > IMS is too narrowly specified.
> > > > 
> > > > All platforms that support MSI today can support IMS. It is simply a
> > > > way for the platform to give the driver an addr/data pair that triggers
> > > > an interrupt when a posted write is performed to that pair.
> > > > 
> > > 
> > > Well, yes and no. IMS requires interrupt remapping in addition to the
> > > dynamic nature of IRQ allocation.
> > 
> > You've mentioned remapping a few times, but I really can't understand
> > why it has anything to do with platform_msi or IMS..
> 
> So after some internal discussions, we have concluded that IMS has no
> linkage with Interrupt remapping, IR is just a platform concept. IMS is just
> a name Intel came up with, all it really means is device managed addr/data
> writes to generate interrupts. Technically we can call something IMS even if
> device has its own location to store interrupts in non-pci standard
> mechanism, much like platform-msi indeed. We simply need to extend
> platform-msi to its address some of its shortcomings: increase number of
> interrupts to > 2048, enable dynamic allocation of interrupts, add
> mask/unmask callbacks in addition to write_msg etc.

Sounds right to me

Persumably you still need a way for the driver, eg vfio, to ensure a
MSI is remappable, but shouldn't that be exactly the same way as done
in normal PCI MSI today?

> FWIW, even MSI can be IMS with rules on how to manage the addr/data writes
> following pci sig .. its just that.

Yep, IMHO, our whole handling of MSI is very un-general sometimes..

I thought the msi_domain stuff that some platforms are using is a way
to improve on that? You might find that updating x86 to use msi_domain
might be helpful in this project???

Jason
Dey, Megha May 4, 2020, 12:20 a.m. UTC | #40
On 5/3/2020 3:36 PM, Jason Gunthorpe wrote:
> On Sun, May 03, 2020 at 03:31:39PM -0700, Dey, Megha wrote:
>>
>> Hi Jason,
>>
>> On 5/3/2020 3:22 PM, Jason Gunthorpe wrote:
>>> On Fri, May 01, 2020 at 03:31:51PM -0700, Dey, Megha wrote:
>>>>>> This has been my concern reviewing the implementation. IMS needs more
>>>>>> than one in-tree user to validate degrees of freedom in the api. I had
>>>>>> been missing a second "in-tree user" to validate the scope of the
>>>>>> flexibility that was needed.
>>>>>
>>>>> IMS is too narrowly specified.
>>>>>
>>>>> All platforms that support MSI today can support IMS. It is simply a
>>>>> way for the platform to give the driver an addr/data pair that triggers
>>>>> an interrupt when a posted write is performed to that pair.
>>>>>
>>>>
>>>> Well, yes and no. IMS requires interrupt remapping in addition to the
>>>> dynamic nature of IRQ allocation.
>>>
>>> You've mentioned remapping a few times, but I really can't understand
>>> why it has anything to do with platform_msi or IMS..
>>
>> So after some internal discussions, we have concluded that IMS has no
>> linkage with Interrupt remapping, IR is just a platform concept. IMS is just
>> a name Intel came up with, all it really means is device managed addr/data
>> writes to generate interrupts. Technically we can call something IMS even if
>> device has its own location to store interrupts in non-pci standard
>> mechanism, much like platform-msi indeed. We simply need to extend
>> platform-msi to its address some of its shortcomings: increase number of
>> interrupts to > 2048, enable dynamic allocation of interrupts, add
>> mask/unmask callbacks in addition to write_msg etc.
> 
> Sounds right to me
> 
> Persumably you still need a way for the driver, eg vfio, to ensure a
> MSI is remappable, but shouldn't that be exactly the same way as done
> in normal PCI MSI today?

yes exactly, it should be done in the same way as PCI-MSI, if IR is 
enabled we will have IR_PCI_MSI for platform msi as well.
> 
>> FWIW, even MSI can be IMS with rules on how to manage the addr/data writes
>> following pci sig .. its just that.
> 
> Yep, IMHO, our whole handling of MSI is very un-general sometimes..
> 
> I thought the msi_domain stuff that some platforms are using is a way
> to improve on that? You might find that updating x86 to use msi_domain
> might be helpful in this project???

yes, we need to take a closer look at this.
> 
> Jason
>
Ashok Raj May 8, 2020, 8:47 p.m. UTC | #41
Hi Jason

In general your idea of moving pure emulation code to user space 
is a good strategy.


On Wed, Apr 29, 2020 at 02:42:20AM -0700, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > Sent: Monday, April 27, 2020 9:22 PM
> > 
> > On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote:
> > 
> > > > It is not trivial masking. It is a 2000 line patch doing comprehensive
> > > > emulation.
> > >
> > > Not sure what you're referring to, I see about 30 lines of code in
> > > vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in
> > > config space and maybe a couple hundred lines of code in total handling
> > > config space emulation.  Thanks,
> > 
> > Look around vidxd_do_command()
> > 
> > If I understand this flow properly..
> > 
> 
> Hi, Jason,
> 
> I guess the 2000 lines mostly refer to the changes in mdev.c and vdev.c. 
> We did a break-down among them:
> 
> 1) ~150 LOC for vdev initialization
> 2) ~150 LOC for cfg space emulation
> 3) ~230 LOC for mmio r/w emulation
> 4) ~500 LOC for controlling the work queue (vidxd_do_command), 
> triggered by write emulation of IDXD_CMD_OFFSET register
> 5) the remaining lines are all about vfio-mdev registration/callbacks,
> for reporting mmio/irq resource, eventfd, mmap, etc.
> 
> 1/2/3) are pure device emulation, which counts for ~500 LOC. 
> 
> 4) needs be in the kernel regardless of which uAPI is used, because it
> talks to the physical work queue (enable, disable, drain, abort, reset, etc.)
> 
> Then if just talking about ~500 LOC emulation code left in the kernel, 
> is it still a big concern to you? 
Jason Gunthorpe May 8, 2020, 11:16 p.m. UTC | #42
On Fri, May 08, 2020 at 01:47:10PM -0700, Raj, Ashok wrote:

> Even when uaccel was under development, one of the options
> was to use VFIO as the transport, goal was the same i.e to keep
> the user space have one interface. 

I feel a bit out of the loop here, uaccel isn't in today's kernel is
it? I've heard about it for a while, it sounds very similar to RDMA,
so I hope they took some of my advice...

> But the needs of generic user space application is significantly
> different from exporting a more functional device model to guest,
> which isn't full emulated device. which is why VFIO didn't make
> sense for native use.

I'm not sure this is true. We've done these kinds of emulated SIOV
like things already and there is a huge overlap between what a generic
user application needs and what the VMM neds. Actually almost a
perfect subset except for interrupt remapping (which is quite
trivial).

The things vfio focuses on, like groups and managing a real config
space just don't apply here.

> And when we move things from VFIO which is already established
> as a general device model and accepted by multiple VMM's it gives
> instant footing without a whole redesign. 

Yes, I understand, but I think you need to get more people to support
this idea. From my standpoint this is taking secure lean VMMs and
putting emulation code back into them, except in a more dangerous
kernel location. This does not seem like a net win to me.

You'd be much better to have some userspace library scheme instead of
being completely tied to a kernel interface for modularity.

> When we move things from VFIO to uaccel to bolt on the functionality
> like VFIO, I suspect we would be moving code/functionality from VFIO
> to Uaccel. I don't know what the net gain would be.

Most of VFIO functionality is already decomposed inside the kernel,
and you need most of it to do secure user access anyhow.

> For mdev, would you agree we can keep the current architecture,
> and investigate moving some emulation code to user space (say even for
> standard vfio_pci) and then expand scope later.

I won't hard NAK this, but I think you need more people to support
this general idea of more emulation code in the kernel to go ahead -
particularly since this is one of many future drivers along this
design.

It would be good to hear from the VMM teams that this is what they
want (and why), for instance.

Jason
Dave Jiang May 8, 2020, 11:52 p.m. UTC | #43
On 5/8/2020 4:16 PM, Jason Gunthorpe wrote:
> On Fri, May 08, 2020 at 01:47:10PM -0700, Raj, Ashok wrote:
> 
>> Even when uaccel was under development, one of the options
>> was to use VFIO as the transport, goal was the same i.e to keep
>> the user space have one interface.
> 
> I feel a bit out of the loop here, uaccel isn't in today's kernel is
> it? I've heard about it for a while, it sounds very similar to RDMA,
> so I hope they took some of my advice...

It went into 5.7 kernel. drivers/misc/uacce. It looks char device exported with 
SVM support.

> 
>> But the needs of generic user space application is significantly
>> different from exporting a more functional device model to guest,
>> which isn't full emulated device. which is why VFIO didn't make
>> sense for native use.
> 
> I'm not sure this is true. We've done these kinds of emulated SIOV
> like things already and there is a huge overlap between what a generic
> user application needs and what the VMM neds. Actually almost a
> perfect subset except for interrupt remapping (which is quite
> trivial).
> 
> The things vfio focuses on, like groups and managing a real config
> space just don't apply here.
> 
>> And when we move things from VFIO which is already established
>> as a general device model and accepted by multiple VMM's it gives
>> instant footing without a whole redesign.
> 
> Yes, I understand, but I think you need to get more people to support
> this idea. From my standpoint this is taking secure lean VMMs and
> putting emulation code back into them, except in a more dangerous
> kernel location. This does not seem like a net win to me.
> 
> You'd be much better to have some userspace library scheme instead of
> being completely tied to a kernel interface for modularity.
> 
>> When we move things from VFIO to uaccel to bolt on the functionality
>> like VFIO, I suspect we would be moving code/functionality from VFIO
>> to Uaccel. I don't know what the net gain would be.
> 
> Most of VFIO functionality is already decomposed inside the kernel,
> and you need most of it to do secure user access anyhow.
> 
>> For mdev, would you agree we can keep the current architecture,
>> and investigate moving some emulation code to user space (say even for
>> standard vfio_pci) and then expand scope later.
> 
> I won't hard NAK this, but I think you need more people to support
> this general idea of more emulation code in the kernel to go ahead -
> particularly since this is one of many future drivers along this
> design.
> 
> It would be good to hear from the VMM teams that this is what they
> want (and why), for instance.
> 
> Jason
>
Ashok Raj May 9, 2020, 12:09 a.m. UTC | #44
Hi Jason

On Fri, May 08, 2020 at 08:16:10PM -0300, Jason Gunthorpe wrote:
> On Fri, May 08, 2020 at 01:47:10PM -0700, Raj, Ashok wrote:
> 
> > Even when uaccel was under development, one of the options
> > was to use VFIO as the transport, goal was the same i.e to keep
> > the user space have one interface. 
> 
> I feel a bit out of the loop here, uaccel isn't in today's kernel is
> it? I've heard about it for a while, it sounds very similar to RDMA,
> so I hope they took some of my advice...

I think since 5.7 maybe? drivers/misc/uacce. I don't think this is like
RDMA, its just a plain accelerator. There is no connection management,
memory registration or other things.. IB was my first job at Intel,
but saying that i would be giving my age away :)

> 
> > But the needs of generic user space application is significantly
> > different from exporting a more functional device model to guest,
> > which isn't full emulated device. which is why VFIO didn't make
> > sense for native use.
> 
> I'm not sure this is true. We've done these kinds of emulated SIOV
> like things already and there is a huge overlap between what a generic
> user application needs and what the VMM neds. Actually almost a
> perfect subset except for interrupt remapping (which is quite
> trivial).

From a simple user application POV, if we need to do simple compression
or such with a shared WQ, all the application needs do do is
bind_mm() that somehow associates the process address space with the 
IOMMU to create that association and communication channel.

For supporting this with guest user, we need to support the same actions
from a guest OS. i.e a guest OS bind should be serviced and end up with the 
IOMMU plumbing it with the guest cr3, and making sure the guest 2nd level 
is plumed right for the nested walk. 

Now we can certainly go bolt all these things again. When VFIO has already 
done the pluming in a generic way.

> 
> The things vfio focuses on, like groups and managing a real config
> space just don't apply here.
> 
> > And when we move things from VFIO which is already established
> > as a general device model and accepted by multiple VMM's it gives
> > instant footing without a whole redesign. 
> 
> Yes, I understand, but I think you need to get more people to support
> this idea. From my standpoint this is taking secure lean VMMs and

When we decided on VFIO, it was after using the best practices then,
after discussion with Kirti Wankhede and Alex. Kevin had used it for
graphics virtualization. It was even presented at KVM forum and such
dating back to 2017. No one has raised alarms until now :-)


> putting emulation code back into them, except in a more dangerous
> kernel location. This does not seem like a net win to me.

Its not a whole lot of emulation right? mdev are soft partitioned. There is
just a single PF, but we can create a separate partition for the guest using
PASID along with the normal BDF (RID). And exposing a consistent PCI like
interface to user space you get everything else for free.

Yes, its not SRIOV, but giving that interface to user space via VFIO, we get 
all of that functionality without having to reinvent a different way to do it.

vDPA went the other way, IRC, they went and put a HW implementation of what
virtio is in hardware. So they sort of fit the model. Here the instance
looks and feels like real hardware for the setup and control aspect.


> 
> You'd be much better to have some userspace library scheme instead of
> being completely tied to a kernel interface for modularity.

Couldn't agree more :-).. all I'm asking is if we can do a phased approach to 
get to that goodness! If we need to move things to user space for emulation
that's a great goal, but it can be evolutionary.

> 
> > When we move things from VFIO to uaccel to bolt on the functionality
> > like VFIO, I suspect we would be moving code/functionality from VFIO
> > to Uaccel. I don't know what the net gain would be.
> 
> Most of VFIO functionality is already decomposed inside the kernel,
> and you need most of it to do secure user access anyhow.
> 
> > For mdev, would you agree we can keep the current architecture,
> > and investigate moving some emulation code to user space (say even for
> > standard vfio_pci) and then expand scope later.
> 
> I won't hard NAK this, but I think you need more people to support
> this general idea of more emulation code in the kernel to go ahead -
> particularly since this is one of many future drivers along this
> design.
> 
> It would be good to hear from the VMM teams that this is what they
> want (and why), for instance.

IRC Paolo was present I think and we can find other VMM folks to chime in if
that helps.
Jason Gunthorpe May 9, 2020, 12:21 p.m. UTC | #45
On Fri, May 08, 2020 at 05:09:09PM -0700, Raj, Ashok wrote:
> Hi Jason
> 
> On Fri, May 08, 2020 at 08:16:10PM -0300, Jason Gunthorpe wrote:
> > On Fri, May 08, 2020 at 01:47:10PM -0700, Raj, Ashok wrote:
> > 
> > > Even when uaccel was under development, one of the options
> > > was to use VFIO as the transport, goal was the same i.e to keep
> > > the user space have one interface. 
> > 
> > I feel a bit out of the loop here, uaccel isn't in today's kernel is
> > it? I've heard about it for a while, it sounds very similar to RDMA,
> > so I hope they took some of my advice...
> 
> I think since 5.7 maybe? drivers/misc/uacce. I don't think this is like
> RDMA, its just a plain accelerator. There is no connection management,
> memory registration or other things.. IB was my first job at Intel,
> but saying that i would be giving my age away :)

rdma was the first thing to do kernel bypass, all this stuff is like
rdma at some level.. I see this looks like the 'warp driver' stuff
redone

Wow, lots wrong here. Oh well.

> > putting emulation code back into them, except in a more dangerous
> > kernel location. This does not seem like a net win to me.
> 
> Its not a whole lot of emulation right? mdev are soft partitioned. There is
> just a single PF, but we can create a separate partition for the guest using
> PASID along with the normal BDF (RID). And exposing a consistent PCI like
> interface to user space you get everything else for free.
> 
> Yes, its not SRIOV, but giving that interface to user space via VFIO, we get 
> all of that functionality without having to reinvent a different way to do it.
> 
> vDPA went the other way, IRC, they went and put a HW implementation of what
> virtio is in hardware. So they sort of fit the model. Here the instance
> looks and feels like real hardware for the setup and control aspect.

VDPA and this are very similar, of course it depends on the exact HW
implementation.

Jason
Jason Wang May 13, 2020, 2:29 a.m. UTC | #46
On 2020/5/9 下午8:21, Jason Gunthorpe wrote:
> On Fri, May 08, 2020 at 05:09:09PM -0700, Raj, Ashok wrote:
>> Hi Jason
>>
>> On Fri, May 08, 2020 at 08:16:10PM -0300, Jason Gunthorpe wrote:
>>> On Fri, May 08, 2020 at 01:47:10PM -0700, Raj, Ashok wrote:
>>>
>>>> Even when uaccel was under development, one of the options
>>>> was to use VFIO as the transport, goal was the same i.e to keep
>>>> the user space have one interface.
>>> I feel a bit out of the loop here, uaccel isn't in today's kernel is
>>> it? I've heard about it for a while, it sounds very similar to RDMA,
>>> so I hope they took some of my advice...
>> I think since 5.7 maybe? drivers/misc/uacce. I don't think this is like
>> RDMA, its just a plain accelerator. There is no connection management,
>> memory registration or other things.. IB was my first job at Intel,
>> but saying that i would be giving my age away:)
> rdma was the first thing to do kernel bypass, all this stuff is like
> rdma at some level.. I see this looks like the 'warp driver' stuff
> redone
>
> Wow, lots wrong here. Oh well.
>
>>> putting emulation code back into them, except in a more dangerous
>>> kernel location. This does not seem like a net win to me.
>> Its not a whole lot of emulation right? mdev are soft partitioned. There is
>> just a single PF, but we can create a separate partition for the guest using
>> PASID along with the normal BDF (RID). And exposing a consistent PCI like
>> interface to user space you get everything else for free.
>>
>> Yes, its not SRIOV, but giving that interface to user space via VFIO, we get
>> all of that functionality without having to reinvent a different way to do it.
>>
>> vDPA went the other way, IRC, they went and put a HW implementation of what
>> virtio is in hardware. So they sort of fit the model. Here the instance
>> looks and feels like real hardware for the setup and control aspect.
> VDPA and this are very similar, of course it depends on the exact HW
> implementation.
>
> Jason


Actually this is not a must. Technically we can do ring/descriptor 
translation in the vDPA driver as what zerocopy AF_XDP did.

Thanks


>
Tian, Kevin May 13, 2020, 8:30 a.m. UTC | #47
> From: Jason Gunthorpe
> Sent: Saturday, May 9, 2020 8:21 PM
> > > putting emulation code back into them, except in a more dangerous
> > > kernel location. This does not seem like a net win to me.
> >
> > Its not a whole lot of emulation right? mdev are soft partitioned. There is
> > just a single PF, but we can create a separate partition for the guest using
> > PASID along with the normal BDF (RID). And exposing a consistent PCI like
> > interface to user space you get everything else for free.
> >
> > Yes, its not SRIOV, but giving that interface to user space via VFIO, we get
> > all of that functionality without having to reinvent a different way to do it.
> >
> > vDPA went the other way, IRC, they went and put a HW implementation of
> what
> > virtio is in hardware. So they sort of fit the model. Here the instance
> > looks and feels like real hardware for the setup and control aspect.
> 
> VDPA and this are very similar, of course it depends on the exact HW
> implementation.
> 

Hi, Jason,

I have more thoughts below. let's see whether making sense to you.

When talking about virtualization, here the target is unmodified guest 
kernel driver which expects seeing the raw controllability of queues 
as defined by device spec. In idxd, such controllability includes enable/
disable SVA, dedicated or shared WQ, size, threshold, privilege, fault 
mode, max batch size, and many other attributes. Different guest OS 
has its own policy of using all or partial available controllability. 

When talking about application, we care about providing an efficient
programming interface to userspace. For example with uacce, we
allow an application to submit vaddr-based workloads to a reserved
WQ with kernel bypassed. But it's not necessary to export the raw
controllability of the reserved WQ to userspace, and we still rely on
kernel driver to configure it including bind_mm. I'm not sure whether 
uacce would like to evolve as a generic queue management system
including non-SVA and all vendor specific raw capabilities as 
expected by all kinds of guest kernel drivers. It sounds like not 
worthwhile at this point, given that we already have an highly efficient 
SVA interface for user applications.

That is why we start with mdev as an evolutionary approach. Mdev is 
introduced to expose raw controllability of a subdevice (WQ or ADI) to 
guest. It build a channel between guest kernel driver and host kernel 
driver and uses device spec as the uAPI by sticking to the mmio interface.
and all virtualization related setups are just consolidated together in vfio. 
the drawback, as you pointed out, is putting some degree of emulation
code in the kernel. But as explained earlier, they are only small portion of
code. Moreover, most registers are emulated as simple memory read/
write, while the remaining logic mostly belongs to raw controllability 
(e.g. cmd register) that host driver grants to the guest thus must 
propagate to the device. For the latter part, I would call it more as 
'mediation' instead of 'emulation', as required in whatever uapi would 
be used.

If in the future, there do have such requirement of delegating raw
WQ controllability to pure userspace applications for DMA engines, 
and there is be a well-defined uAPI to cover a large common set of 
controllability across multiple vendors, we will look at that option for
sure.

From above p.o.v, I feel vdpa is a different story. virtio/vhost has a 
well established eco-system between guest and host. The user
space VMM already emulates all available controllability as defined 
in virtio spec. Host kernel already supports vhost uAPI for vring
setup, iotlb management, etc. Extending that path for data path
offloading sounds a reasonable choice for vdpa...

Thanks
Kevin
Jason Gunthorpe May 13, 2020, 12:40 p.m. UTC | #48
On Wed, May 13, 2020 at 08:30:15AM +0000, Tian, Kevin wrote:

> When talking about virtualization, here the target is unmodified guest 
> kernel driver which expects seeing the raw controllability of queues 
> as defined by device spec. In idxd, such controllability includes enable/
> disable SVA, dedicated or shared WQ, size, threshold, privilege, fault 
> mode, max batch size, and many other attributes. Different guest OS 
> has its own policy of using all or partial available controllability. 
> 
> When talking about application, we care about providing an efficient
> programming interface to userspace. For example with uacce, we
> allow an application to submit vaddr-based workloads to a reserved
> WQ with kernel bypassed. But it's not necessary to export the raw
> controllability of the reserved WQ to userspace, and we still rely on
> kernel driver to configure it including bind_mm. I'm not sure whether 
> uacce would like to evolve as a generic queue management system
> including non-SVA and all vendor specific raw capabilities as 
> expected by all kinds of guest kernel drivers. It sounds like not 
> worthwhile at this point, given that we already have an highly efficient 
> SVA interface for user applications.

Like I already said, you should get the people who care about this
stuff to support emulation in the kernel. I think it has not been
explained well in past.

Most Intel info on SIOV draws a close parallel to SRIOV and I think
people generally assume, that like SRIOV, SIOV does not include kernel
side MMIO emulations.

> If in the future, there do have such requirement of delegating raw
> WQ controllability to pure userspace applications for DMA engines, 
> and there is be a well-defined uAPI to cover a large common set of 
> controllability across multiple vendors, we will look at that option for
> sure.

All this Kernel bypass stuff is 'HW specific' by nature, you should
not expect to have general interfaces.

Jason