mbox series

[v6,00/20] Add VFIO mediated device support and DEV-MSI support for the idxd driver

Message ID 162164243591.261970.3439987543338120797.stgit@djiang5-desk3.ch.intel.com (mailing list archive)
Headers show
Series Add VFIO mediated device support and DEV-MSI support for the idxd driver | expand

Message

Dave Jiang May 22, 2021, 12:19 a.m. UTC
The series is rebased on top of Jason's VFIO refactoring collection:
https://github.com/jgunthorpe/linux/pull/3

We would like to receive review comments with respect to the mdev driver itself
and the common VFIO IMS support code that was suggested by Jason. The previous
version of the DEV-MSI/IMS code is still under review and also the IOASID code
is under design.

v6:
- Rebased on top of Jason's recent VFIO refactoring.
- Move VFIO IMS setup code to common (Jason)
- Changed patch ordering to minimize code stubs (Jason)

v5:
- Split out non driver IMS code to its own series.
- Removed common devsec code, Bjorn asked to deal with it post 5.11 and keep
custom code for now.
- Reworked irq_entries for IMS so emulated vector is also included.
- Reworked vidxd_send_interrupt() to take irq_entry directly (data ready for
consumption) (Thomas)
- Removed pointer to msi_entry in irq_entries (Thomas)
- Removed irq_domain check on free entries (Thomas)
- Split out irqbypass management code (Thomas)
- Fix EXPORT_SYMBOL to EXPORT_SYMBOL_GPL (Thomas)

v4:
dev-msi:
- Make interrupt remapping code more readable (Thomas)
- Add flush writes to unmask/write and reset ims slots (Thomas)
- Interrupt Message Storm-> Interrupt Message Store (Thomas)
- Merge in pasid programming code. (Thomas)

mdev:
- Fixed up domain assignment (Thomas)
- Define magic numbers (Thomas)
- Move siov detection code to PCI common (Thomas)
- Remove duplicated MSI entry info (Thomas)
- Convert code to use ims_slot (Thomas)
- Add explanation of pasid programming for IMS entry (Thomas)
- Add release int handle release support due to spec 1.1 update.

v3:
Dev-msi:
- No need to add support for 2 different dev-msi irq domains, a common
once can be used for both the cases(with IR enabled/disabled)
- Add arch specific function to specify additions to msi_prepare callback
instead of making the callback a weak function
- Call platform ops directly instead of a wrapper function
- Make mask/unmask callbacks as void functions
dev->msi_domain should be updated at the device driver level before
calling dev_msi_alloc_irqs()
dev_msi_alloc/free_irqs() cannot be used for PCI devices
Followed the generic layering scheme: infrastructure bits->arch
bits->enabling bits

Mdev:
- Remove set kvm group notifier (Yan Zhao)
- Fix VFIO irq trigger removal (Yan Zhao)
- Add mmio read flush to ims mask (Jason)

v2:
IMS (now dev-msi):
- With recommendations from Jason/Thomas/Dan on making IMS more generic:
- Pass a non-pci generic device(struct device) for IMS management instead of
mdev
- Remove all references to mdev and symbol_get/put
- Remove all references to IMS in common code and replace with dev-msi
- Remove dynamic allocation of platform-msi interrupts: no groups,no
new msi list or list helpers
- Create a generic dev-msi domain with and without interrupt remapping
enabled.
- Introduce dev_msi_domain_alloc_irqs and dev_msi_domain_free_irqs apis

mdev:
- Removing unrelated bits from SVA enabling that’s not necessary for
the submission. (Kevin)
- Restructured entire mdev driver series to make reviewing easier (Kevin)
- Made rw emulation more robust (Kevin)
- Removed uuid wq type and added single dedicated wq type (Kevin)
- Locking fixes for vdev (Yan Zhao)
- VFIO MSIX trigger fixes (Yan Zhao)

This code series will match the support of the 5.6 kernel (stage 1) driver
but on guest.

The code has dependency on DEV_MSI/IMS enabling code:
https://lore.kernel.org/lkml/1614370277-23235-1-git-send-email-megha.dey@intel.com/

The code has dependency on idxd driver sub-driver cleanup series:
https://lore.kernel.org/dmaengine/162163546245.260470.18336189072934823712.stgit@djiang5-desk3.ch.intel.com/T/#t

The code has dependency on Jason's VFIO refactoring:
https://lore.kernel.org/kvm/0-v2-7667f42c9bad+935-vfio3_jgg@nvidia.com/

Part 1 of the driver has been accepted in v5.6 kernel. It supports dedicated
workqueue (wq) without Shared Virtual Memory (SVM) support.

Part 2 of the driver supports shared wq and SVM and has been accepted in
kernel v5.11.

VFIO mediated device framework allows vendor drivers to wrap a portion of
device resources into virtual devices (mdev). Each mdev can be assigned
to different guest using the same set of VFIO uAPIs as assigning a
physical device. Accessing to the mdev resource is served with mixed
policies. For example, vendor drivers typically mark data-path interface
as pass-through for fast guest operations, and then trap-and-mediate the
control-path interface to avoid undesired interference between mdevs. Some
level of emulation is necessary behind vfio mdev to compose the virtual
device interface.

This series brings mdev to idxd driver to enable Intel Scalable IOV
(SIOV), a hardware-assisted mediated pass-through technology. SIOV makes
each DSA wq independently assignable through PASID-granular resource/DMA
isolation. It helps improve scalability and reduces mediation complexity
against purely software-based mdev implementations. Each assigned wq is
configured by host and exposed to the guest in a read-only configuration
mode, which allows the guest to use the wq w/o additional setup. This
design greatly reduces the emulation bits to focus on handling commands
from guests.

There are two possible avenues to support virtual device composition:
1. VFIO mediated device (mdev) or 2. User space DMA through char device
(or UACCE). Given the small portion of emulation to satisfy our needs
and VFIO mdev having the infrastructure already to support the device
passthrough, we feel that VFIO mdev is the better route. For more in depth
explanation, see documentation in Documents/driver-api/vfio/mdev-idxd.rst.

Introducing mdev types “1dwq-v1” type. This mdev type allows
allocation of a single dedicated wq from available dedicated wqs. After
a workqueue (wq) is enabled, the user will generate an uuid. On mdev
creation, the mdev driver code will find a dwq depending on the mdev
type. When the create operation is successful, the user generated uuid
can be passed to qemu. When the guest boots up, it should discover a
DSA device when doing PCI discovery.

For example of “1dwq-v1” type:
1. Enable wq with “mdev” wq type
2. A user generated uuid.
3. The uuid is written to the mdev class sysfs path:
echo $UUID > /sys/class/mdev_bus/0000\:00\:0a.0/mdev_supported_types/idxd-1dwq-v1/create
4. Pass the following parameter to qemu:
"-device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:0a.0/$UUID"

The wq exported through mdev will have the read only config bit set
for configuration. This means that the device does not require the
typical configuration. After enabling the device, the user must set the
WQ type and name. That is all is necessary to enable the WQ and start
using it. The single wq configuration is not the only way to create the
mdev. Multi wqs support for mdev will be in the future works.

The mdev utilizes Interrupt Message Store or IMS[3], a device-specific
MSI implementation, instead of MSIX for interrupts for the guest. This
preserves MSIX for host usages and also allows a significantly larger
number of interrupt vectors for guest usage.

The idxd driver implements IMS as on-device memory mapped unified
storage. Each interrupt message is stored as a DWORD size data payload
and a 64-bit address (same as MSI-X). Access to the IMS is through the
host idxd driver.

The idxd driver makes use of the generic IMS irq chip and domain which
stores the interrupt messages as an array in device memory. Allocation and
freeing of interrupts happens via the generic msi_domain_alloc/free_irqs()
interface. One only needs to ensure the interrupt domain is stored in
the underlying device struct.

The kernel tree can be found at [7].

[1]: https://lore.kernel.org/lkml/157965011794.73301.15960052071729101309.stgit@djiang5-desk3.ch.intel.com/
[2]: https://software.intel.com/en-us/articles/intel-sdm
[3]: https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification
[4]: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
[5]: https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
[6]: https://intel.github.io/idxd/
[7]: https://github.com/intel/idxd-driver idxd-stage2.5

---

Dave Jiang (20):
      vfio/mdev: idxd: add theory of operation documentation for idxd mdev
      dmaengine: idxd: add external module driver support for dsa_bus_type
      dmaengine: idxd: add IMS offset and size retrieval code
      dmaengine: idxd: add portal offset for IMS portals
      vfio: mdev: common lib code for setting up Interrupt Message Store
      vfio/mdev: idxd: add PCI config for read/write for mdev
      vfio/mdev: idxd: Add administrative commands emulation for mdev
      vfio/mdev: idxd: Add mdev device context initialization
      vfio/mdev: Add mmio read/write support for mdev
      vfio/mdev: idxd: add mdev type as a new wq type
      vfio/mdev: idxd: Add basic driver setup for idxd mdev
      vfio: move VFIO PCI macros to common header
      vfio/mdev: idxd: add mdev driver registration and helper functions
      vfio/mdev: idxd: add 1dwq-v1 mdev type
      vfio/mdev: idxd: ims domain setup for the vdcm
      vfio/mdev: idxd: add new wq state for mdev
      vfio/mdev: idxd: add error notification from host driver to mediated device
      vfio: move vfio_pci_set_ctx_trigger_single to common code
      vfio: mdev: Add device request interface
      vfio/mdev: idxd: setup request interrupt


 .../ABI/stable/sysfs-driver-dma-idxd          |    6 +
 drivers/dma/Kconfig                           |    1 +
 drivers/dma/idxd/Makefile                     |    2 +
 drivers/dma/idxd/cdev.c                       |    4 +-
 drivers/dma/idxd/device.c                     |  102 +-
 drivers/dma/idxd/idxd.h                       |   52 +-
 drivers/dma/idxd/init.c                       |   59 +
 drivers/dma/idxd/irq.c                        |   21 +-
 drivers/dma/idxd/registers.h                  |   25 +-
 drivers/dma/idxd/sysfs.c                      |   22 +
 drivers/vfio/Makefile                         |    2 +-
 drivers/vfio/mdev/Kconfig                     |   21 +
 drivers/vfio/mdev/Makefile                    |    4 +
 drivers/vfio/mdev/idxd/Makefile               |    4 +
 drivers/vfio/mdev/idxd/mdev.c                 |  958 ++++++++++++++++
 drivers/vfio/mdev/idxd/mdev.h                 |  112 ++
 drivers/vfio/mdev/idxd/vdev.c                 | 1016 +++++++++++++++++
 drivers/vfio/mdev/mdev_irqs.c                 |  341 ++++++
 drivers/vfio/pci/vfio_pci_intrs.c             |   63 +-
 drivers/vfio/pci/vfio_pci_private.h           |    6 -
 drivers/vfio/vfio_common.c                    |   74 ++
 include/linux/mdev.h                          |   66 ++
 include/linux/vfio.h                          |   10 +
 include/uapi/linux/idxd.h                     |    2 +
 24 files changed, 2890 insertions(+), 83 deletions(-)
 create mode 100644 drivers/vfio/mdev/idxd/Makefile
 create mode 100644 drivers/vfio/mdev/idxd/mdev.c
 create mode 100644 drivers/vfio/mdev/idxd/mdev.h
 create mode 100644 drivers/vfio/mdev/idxd/vdev.c
 create mode 100644 drivers/vfio/mdev/mdev_irqs.c
 create mode 100644 drivers/vfio/vfio_common.c

--

Comments

Jason Gunthorpe May 23, 2021, 11:22 p.m. UTC | #1
On Fri, May 21, 2021 at 05:19:05PM -0700, Dave Jiang wrote:

> The code has dependency on DEV_MSI/IMS enabling code:
> https://lore.kernel.org/lkml/1614370277-23235-1-git-send-email-megha.dey@intel.com/
> 
> The code has dependency on idxd driver sub-driver cleanup series:
> https://lore.kernel.org/dmaengine/162163546245.260470.18336189072934823712.stgit@djiang5-desk3.ch.intel.com/T/#t
> 
> The code has dependency on Jason's VFIO refactoring:
> https://lore.kernel.org/kvm/0-v2-7667f42c9bad+935-vfio3_jgg@nvidia.com/

That is quite an inter-tangled list, do you have a submission plan??

> Introducing mdev types “1dwq-v1” type. This mdev type allows
> allocation of a single dedicated wq from available dedicated wqs. After
> a workqueue (wq) is enabled, the user will generate an uuid. On mdev
> creation, the mdev driver code will find a dwq depending on the mdev
> type. When the create operation is successful, the user generated uuid
> can be passed to qemu. When the guest boots up, it should discover a
> DSA device when doing PCI discovery.
> 
> For example of “1dwq-v1” type:
> 1. Enable wq with “mdev” wq type
> 2. A user generated uuid.
> 3. The uuid is written to the mdev class sysfs path:
> echo $UUID > /sys/class/mdev_bus/0000\:00\:0a.0/mdev_supported_types/idxd-1dwq-v1/create
> 4. Pass the following parameter to qemu:
> "-device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:0a.0/$UUID"

So the idxd core driver knows to create a "vfio" wq with its own much
machinery but you still want to involve the horrible mdev guid stuff?

Why??

Jason
Dave Jiang June 2, 2021, 3:40 p.m. UTC | #2
On 5/23/2021 4:22 PM, Jason Gunthorpe wrote:
> On Fri, May 21, 2021 at 05:19:05PM -0700, Dave Jiang wrote:
>> Introducing mdev types “1dwq-v1” type. This mdev type allows
>> allocation of a single dedicated wq from available dedicated wqs. After
>> a workqueue (wq) is enabled, the user will generate an uuid. On mdev
>> creation, the mdev driver code will find a dwq depending on the mdev
>> type. When the create operation is successful, the user generated uuid
>> can be passed to qemu. When the guest boots up, it should discover a
>> DSA device when doing PCI discovery.
>>
>> For example of “1dwq-v1” type:
>> 1. Enable wq with “mdev” wq type
>> 2. A user generated uuid.
>> 3. The uuid is written to the mdev class sysfs path:
>> echo $UUID > /sys/class/mdev_bus/0000\:00\:0a.0/mdev_supported_types/idxd-1dwq-v1/create
>> 4. Pass the following parameter to qemu:
>> "-device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:0a.0/$UUID"
> So the idxd core driver knows to create a "vfio" wq with its own much
> machinery but you still want to involve the horrible mdev guid stuff?
>
> Why??

Are you referring to calling mdev_device_create() directly in the mdev 
idxd_driver probe? I think this would work with our dedicated wq where a 
single mdev can be assigned to a wq. However, later on when we need to 
support shared wq where we can create multiple mdev per wq, we'll need 
an entry point to do so. In the name of making things consistent from 
user perspective, going through sysfs seems the way to do it.
Jason Gunthorpe June 2, 2021, 11:17 p.m. UTC | #3
On Wed, Jun 02, 2021 at 08:40:51AM -0700, Dave Jiang wrote:
> 
> On 5/23/2021 4:22 PM, Jason Gunthorpe wrote:
> > On Fri, May 21, 2021 at 05:19:05PM -0700, Dave Jiang wrote:
> > > Introducing mdev types “1dwq-v1” type. This mdev type allows
> > > allocation of a single dedicated wq from available dedicated wqs. After
> > > a workqueue (wq) is enabled, the user will generate an uuid. On mdev
> > > creation, the mdev driver code will find a dwq depending on the mdev
> > > type. When the create operation is successful, the user generated uuid
> > > can be passed to qemu. When the guest boots up, it should discover a
> > > DSA device when doing PCI discovery.
> > > 
> > > For example of “1dwq-v1” type:
> > > 1. Enable wq with “mdev” wq type
> > > 2. A user generated uuid.
> > > 3. The uuid is written to the mdev class sysfs path:
> > > echo $UUID > /sys/class/mdev_bus/0000\:00\:0a.0/mdev_supported_types/idxd-1dwq-v1/create
> > > 4. Pass the following parameter to qemu:
> > > "-device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:0a.0/$UUID"
> > So the idxd core driver knows to create a "vfio" wq with its own much
> > machinery but you still want to involve the horrible mdev guid stuff?
> > 
> > Why??
> 
> Are you referring to calling mdev_device_create() directly in the mdev
> idxd_driver probe? 

No, just call vfio_register_group_dev and forget about mdev.

> I think this would work with our dedicated wq where a single mdev
> can be assigned to a wq.

Ok, sounds great

> However, later on when we need to support shared wq where we can
> create multiple mdev per wq, we'll need an entry point to do so. In
> the name of making things consistent from user perspective, going
> through sysfs seems the way to do it.

Why not use your already very complicated idxd sysfs to do this?

Jason
Tian, Kevin June 3, 2021, 1:11 a.m. UTC | #4
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, June 3, 2021 7:18 AM
> 
> On Wed, Jun 02, 2021 at 08:40:51AM -0700, Dave Jiang wrote:
> >
> > On 5/23/2021 4:22 PM, Jason Gunthorpe wrote:
> > > On Fri, May 21, 2021 at 05:19:05PM -0700, Dave Jiang wrote:
> > > > Introducing mdev types “1dwq-v1” type. This mdev type allows
> > > > allocation of a single dedicated wq from available dedicated wqs. After
> > > > a workqueue (wq) is enabled, the user will generate an uuid. On mdev
> > > > creation, the mdev driver code will find a dwq depending on the mdev
> > > > type. When the create operation is successful, the user generated uuid
> > > > can be passed to qemu. When the guest boots up, it should discover a
> > > > DSA device when doing PCI discovery.
> > > >
> > > > For example of “1dwq-v1” type:
> > > > 1. Enable wq with “mdev” wq type
> > > > 2. A user generated uuid.
> > > > 3. The uuid is written to the mdev class sysfs path:
> > > > echo $UUID >
> /sys/class/mdev_bus/0000\:00\:0a.0/mdev_supported_types/idxd-1dwq-
> v1/create
> > > > 4. Pass the following parameter to qemu:
> > > > "-device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:0a.0/$UUID"
> > > So the idxd core driver knows to create a "vfio" wq with its own much
> > > machinery but you still want to involve the horrible mdev guid stuff?
> > >
> > > Why??
> >
> > Are you referring to calling mdev_device_create() directly in the mdev
> > idxd_driver probe?
> 
> No, just call vfio_register_group_dev and forget about mdev.
> 
> > I think this would work with our dedicated wq where a single mdev
> > can be assigned to a wq.
> 
> Ok, sounds great
> 
> > However, later on when we need to support shared wq where we can
> > create multiple mdev per wq, we'll need an entry point to do so. In
> > the name of making things consistent from user perspective, going
> > through sysfs seems the way to do it.
> 
> Why not use your already very complicated idxd sysfs to do this?
> 

Jason, can you clarify your attitude on mdev guid stuff? Are you 
completely against it or case-by-case? If the former, this is a big
decision thus it's better to have consensus with Alex/Kirti. If the
latter, would like to hear your criteria for when it can be used
and when not...

Thanks
Kevin
Jason Gunthorpe June 3, 2021, 1:49 a.m. UTC | #5
On Thu, Jun 03, 2021 at 01:11:37AM +0000, Tian, Kevin wrote:

> Jason, can you clarify your attitude on mdev guid stuff? Are you 
> completely against it or case-by-case? If the former, this is a big
> decision thus it's better to have consensus with Alex/Kirti. If the
> latter, would like to hear your criteria for when it can be used
> and when not...

I dislike it generally, but it exists so <shrug>. I know others feel
more strongly about it being un-kernely and the wrong way to use sysfs.

Here I was remarking how the example in the cover letter made the mdev
part seem totally pointless. If it is pointless then don't do it.

Remember we have stripped away the actual special need to use
mdev. You don't *have* to use mdev anymore to use vfio. That is a
significant ideology change even from a few months ago.

Jason
Tian, Kevin June 3, 2021, 5:52 a.m. UTC | #6
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, June 3, 2021 9:50 AM
> 
> On Thu, Jun 03, 2021 at 01:11:37AM +0000, Tian, Kevin wrote:
> 
> > Jason, can you clarify your attitude on mdev guid stuff? Are you
> > completely against it or case-by-case? If the former, this is a big
> > decision thus it's better to have consensus with Alex/Kirti. If the
> > latter, would like to hear your criteria for when it can be used
> > and when not...
> 
> I dislike it generally, but it exists so <shrug>. I know others feel
> more strongly about it being un-kernely and the wrong way to use sysfs.
> 
> Here I was remarking how the example in the cover letter made the mdev
> part seem totally pointless. If it is pointless then don't do it.

Is your point about that as long as a mdev requires pre-config
through driver specific sysfs then it doesn't make sense to use
mdev guid interface anymore?

The value of mdev guid interface is providing a vendor-agnostic
interface for mdev life-cycle management which allows one-
enable-fit-all in upper management stack. Requiring vendor
specific pre-config does blur the boundary here.

Alex/Kirt/Cornelia, what about your opinion here? It's better 
we can have an consensus on when and where the existing
mdev sysfs could be used, as this will affect every new mdev
implementation from now on.

> 
> Remember we have stripped away the actual special need to use
> mdev. You don't *have* to use mdev anymore to use vfio. That is a
> significant ideology change even from a few months ago.
> 

Yes, "don't have to" but if there is value of doing so  it's
not necessary to blocking it? One point in my mind is that if
we should minimize vendor-specific contracts for user to
manage mdev or subdevice...

Thanks
Kevin
Jason Gunthorpe June 3, 2021, 11:23 a.m. UTC | #7
On Thu, Jun 03, 2021 at 05:52:58AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, June 3, 2021 9:50 AM
> > 
> > On Thu, Jun 03, 2021 at 01:11:37AM +0000, Tian, Kevin wrote:
> > 
> > > Jason, can you clarify your attitude on mdev guid stuff? Are you
> > > completely against it or case-by-case? If the former, this is a big
> > > decision thus it's better to have consensus with Alex/Kirti. If the
> > > latter, would like to hear your criteria for when it can be used
> > > and when not...
> > 
> > I dislike it generally, but it exists so <shrug>. I know others feel
> > more strongly about it being un-kernely and the wrong way to use sysfs.
> > 
> > Here I was remarking how the example in the cover letter made the mdev
> > part seem totally pointless. If it is pointless then don't do it.
> 
> Is your point about that as long as a mdev requires pre-config
> through driver specific sysfs then it doesn't make sense to use
> mdev guid interface anymore?

Yes

> The value of mdev guid interface is providing a vendor-agnostic
> interface for mdev life-cycle management which allows one-
> enable-fit-all in upper management stack. Requiring vendor
> specific pre-config does blur the boundary here.

It isn't even vendor-agnostic - understanding the mdev_type
configuration stuff is still vendor specific.

Jason
Alex Williamson June 4, 2021, 3:40 a.m. UTC | #8
On Thu, 3 Jun 2021 05:52:58 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, June 3, 2021 9:50 AM
> > 
> > On Thu, Jun 03, 2021 at 01:11:37AM +0000, Tian, Kevin wrote:
> >   
> > > Jason, can you clarify your attitude on mdev guid stuff? Are you
> > > completely against it or case-by-case? If the former, this is a big
> > > decision thus it's better to have consensus with Alex/Kirti. If the
> > > latter, would like to hear your criteria for when it can be used
> > > and when not...  
> > 
> > I dislike it generally, but it exists so <shrug>. I know others feel
> > more strongly about it being un-kernely and the wrong way to use sysfs.
> > 
> > Here I was remarking how the example in the cover letter made the mdev
> > part seem totally pointless. If it is pointless then don't do it.  
> 
> Is your point about that as long as a mdev requires pre-config
> through driver specific sysfs then it doesn't make sense to use
> mdev guid interface anymore?

Can you describe exactly what step 1. is doing in this case from the
original cover letter ("Enable wq with "mdev" wq type")?  That does
sound a bit like configuring something to use mdev then separately
going to the trouble of creating the mdev.  As Jason suggests, if a wq
is tagged for mdev/vfio, it could just register itself as a vfio bus
driver.

But if we want to use mdev, why doesn't available_instances for your
mdev type simply report all unassigned wq and the `echo $UUID > create`
grabs a wq for mdev?  That would remove this pre-config contention,
right?

> The value of mdev guid interface is providing a vendor-agnostic
> interface for mdev life-cycle management which allows one-
> enable-fit-all in upper management stack. Requiring vendor
> specific pre-config does blur the boundary here.

We need to be careful about using work-avoidance in the upper
management stack as a primary use case for an interface though.

> Alex/Kirt/Cornelia, what about your opinion here? It's better 
> we can have an consensus on when and where the existing
> mdev sysfs could be used, as this will affect every new mdev
> implementation from now on.

I have a hard time defining some fixed criteria for using mdev.  It's
essentially always been true that vendors could write their own vfio
"bus driver", like vfio-pci or vfio-platform, specific to their device.
Mdevs were meant to be a way for the (non-vfio) driver of a device to
expose portions of the device through mediation for use with vfio.  It
seems like that's largely being done here.

What I think has changed recently is this desire to make it easier to
create those vendor drivers and some promise of making module binding
work to avoid the messiness around picking a driver for the device.  In
the auxiliary bus case that I think Jason is working on, it sounds like
the main device driver exposes portions of itself on an auxiliary bus
where drivers on that bus can integrate into the vfio subsystem.  It
starts to get pretty fuzzy with what mdev already does, but it's also a
more versatile interface.  Is it right for everyone?  Probably not.

Is the pre-configuration issue here really a push vs pull problem?  I
can see the requirement in step 1. is dedicating some resources to an
mdev use case, so at that point it seems like the argument is whether we
should just create aux bus devices that get automatically bound to a
vendor vfio-pci variant and we avoid the mdev lifecycle, which is both
convenient and ugly.  On the other hand, mdev has more of a pull
interface, ie. here are a bunch of device types and how many of each we
can support, use create to pull what you need.

> > Remember we have stripped away the actual special need to use
> > mdev. You don't *have* to use mdev anymore to use vfio. That is a
> > significant ideology change even from a few months ago.
> >   
> 
> Yes, "don't have to" but if there is value of doing so  it's
> not necessary to blocking it? One point in my mind is that if
> we should minimize vendor-specific contracts for user to
> manage mdev or subdevice...

Again, this in itself is not a great justification for using mdev,
we're creating vendor specific device types with vendor specific
additional features, that could all be done via some sort of netlink
interface too.  The thing that pushes this more towards mdev for me is
that I don't think each of these wqs appear as devices to the host,
they're internal resources of the parent device and we want to compose
them in ways that are slightly more amenable to traditional mdevs... I
think.  Thanks,

Alex
Tian, Kevin June 7, 2021, 6:22 a.m. UTC | #9
Hi, Alex,

Thanks for sharing your thoughts.

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, June 4, 2021 11:40 AM
> 
> On Thu, 3 Jun 2021 05:52:58 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, June 3, 2021 9:50 AM
> > >
> > > On Thu, Jun 03, 2021 at 01:11:37AM +0000, Tian, Kevin wrote:
> > >
> > > > Jason, can you clarify your attitude on mdev guid stuff? Are you
> > > > completely against it or case-by-case? If the former, this is a big
> > > > decision thus it's better to have consensus with Alex/Kirti. If the
> > > > latter, would like to hear your criteria for when it can be used
> > > > and when not...
> > >
> > > I dislike it generally, but it exists so <shrug>. I know others feel
> > > more strongly about it being un-kernely and the wrong way to use sysfs.
> > >
> > > Here I was remarking how the example in the cover letter made the mdev
> > > part seem totally pointless. If it is pointless then don't do it.
> >
> > Is your point about that as long as a mdev requires pre-config
> > through driver specific sysfs then it doesn't make sense to use
> > mdev guid interface anymore?
> 
> Can you describe exactly what step 1. is doing in this case from the
> original cover letter ("Enable wq with "mdev" wq type")?  That does
> sound a bit like configuring something to use mdev then separately
> going to the trouble of creating the mdev.  As Jason suggests, if a wq
> is tagged for mdev/vfio, it could just register itself as a vfio bus
> driver.

I'll leave to Dave to explain the exact detail in step 1.

> 
> But if we want to use mdev, why doesn't available_instances for your
> mdev type simply report all unassigned wq and the `echo $UUID > create`
> grabs a wq for mdev?  That would remove this pre-config contention,
> right?

This way could also work. It sort of changes pre-config to post-config,
i.e. after an unassigned wq is grabbed for mdev, the admin then 
configures additional vendor specific parameters (not initialized by 
parent driver) before this mdev is assigned to a VM. Looks this is also
what NVIDIA is doing for their vGPU, with a cmdline tool (nvidia-smi)
and nvidia sysfs node for setting plugin parameters:

	https://docs.nvidia.com/grid/latest/pdf/grid-vgpu-user-guide.pdf

But I'll leave to Dave again as there must be a reason why they choose
pre-config in the first place.

> 
> > The value of mdev guid interface is providing a vendor-agnostic
> > interface for mdev life-cycle management which allows one-
> > enable-fit-all in upper management stack. Requiring vendor
> > specific pre-config does blur the boundary here.
> 
> We need to be careful about using work-avoidance in the upper
> management stack as a primary use case for an interface though.

ok

> 
> > Alex/Kirt/Cornelia, what about your opinion here? It's better
> > we can have an consensus on when and where the existing
> > mdev sysfs could be used, as this will affect every new mdev
> > implementation from now on.
> 
> I have a hard time defining some fixed criteria for using mdev.  It's
> essentially always been true that vendors could write their own vfio
> "bus driver", like vfio-pci or vfio-platform, specific to their device.
> Mdevs were meant to be a way for the (non-vfio) driver of a device to
> expose portions of the device through mediation for use with vfio.  It
> seems like that's largely being done here.
> 
> What I think has changed recently is this desire to make it easier to
> create those vendor drivers and some promise of making module binding
> work to avoid the messiness around picking a driver for the device.  In
> the auxiliary bus case that I think Jason is working on, it sounds like
> the main device driver exposes portions of itself on an auxiliary bus
> where drivers on that bus can integrate into the vfio subsystem.  It
> starts to get pretty fuzzy with what mdev already does, but it's also a
> more versatile interface.  Is it right for everyone?  Probably not.

idxd is also moving toward this model per Jason's suggestion. Although
auxiliar bus is not directly used, idxd driver has its own bus for exposing
portion of its resources. From this angle, all the motivation around mdev
bus does get fuzzy...

> 
> Is the pre-configuration issue here really a push vs pull problem?  I
> can see the requirement in step 1. is dedicating some resources to an
> mdev use case, so at that point it seems like the argument is whether we
> should just create aux bus devices that get automatically bound to a
> vendor vfio-pci variant and we avoid the mdev lifecycle, which is both
> convenient and ugly.  On the other hand, mdev has more of a pull
> interface, ie. here are a bunch of device types and how many of each we
> can support, use create to pull what you need.

I see your point. Looks what idxd is toward now is a mixed model. The
parent driver uses a push interface to initialize a pool of instances
which are then managed through mdev in a pull mode. 

> 
> > > Remember we have stripped away the actual special need to use
> > > mdev. You don't *have* to use mdev anymore to use vfio. That is a
> > > significant ideology change even from a few months ago.
> > >
> >
> > Yes, "don't have to" but if there is value of doing so  it's
> > not necessary to blocking it? One point in my mind is that if
> > we should minimize vendor-specific contracts for user to
> > manage mdev or subdevice...
> 
> Again, this in itself is not a great justification for using mdev,
> we're creating vendor specific device types with vendor specific
> additional features, that could all be done via some sort of netlink
> interface too.  The thing that pushes this more towards mdev for me is
> that I don't think each of these wqs appear as devices to the host,
> they're internal resources of the parent device and we want to compose
> them in ways that are slightly more amenable to traditional mdevs... I
> think.  Thanks,
> 

Yes, this is one reason going toward mdev.

btw I'm not clear what the netlink interface will finally be, especially 
about whether any generic cmd should be defined cross devices given 
that subdevice management still has large generality. Jason, do you have
an example somewhere which we can take a look regarding to mlx 
netlink design? 

Thanks
Kevin
Dave Jiang June 7, 2021, 6:13 p.m. UTC | #10
On 6/6/2021 11:22 PM, Tian, Kevin wrote:
> Hi, Alex,
>
> Thanks for sharing your thoughts.
>
>> From: Alex Williamson <alex.williamson@redhat.com>
>> Sent: Friday, June 4, 2021 11:40 AM
>>
>> On Thu, 3 Jun 2021 05:52:58 +0000
>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>>
>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>> Sent: Thursday, June 3, 2021 9:50 AM
>>>>
>>>> On Thu, Jun 03, 2021 at 01:11:37AM +0000, Tian, Kevin wrote:
>>>>
>>>>> Jason, can you clarify your attitude on mdev guid stuff? Are you
>>>>> completely against it or case-by-case? If the former, this is a big
>>>>> decision thus it's better to have consensus with Alex/Kirti. If the
>>>>> latter, would like to hear your criteria for when it can be used
>>>>> and when not...
>>>> I dislike it generally, but it exists so <shrug>. I know others feel
>>>> more strongly about it being un-kernely and the wrong way to use sysfs.
>>>>
>>>> Here I was remarking how the example in the cover letter made the mdev
>>>> part seem totally pointless. If it is pointless then don't do it.
>>> Is your point about that as long as a mdev requires pre-config
>>> through driver specific sysfs then it doesn't make sense to use
>>> mdev guid interface anymore?
>> Can you describe exactly what step 1. is doing in this case from the
>> original cover letter ("Enable wq with "mdev" wq type")?  That does
>> sound a bit like configuring something to use mdev then separately
>> going to the trouble of creating the mdev.  As Jason suggests, if a wq
>> is tagged for mdev/vfio, it could just register itself as a vfio bus
>> driver.
> I'll leave to Dave to explain the exact detail in step 1.

So in step 1, we 'tag' the wq to be dedicated to guest usage and put the 
hardware wq into enable state. For a dedicated mode wq, we can 
definitely just register directly and skip the mdev step. For a shared 
wq mode, we can have multiple mdev running on top of a single wq. So we 
need some way to create more mdevs. We can either go with the existing 
established creation path by mdev, or invent something custom for the 
driver as Jason suggested to accomodate additional virtual devices for 
guests. We implemented the mdev path originally with consideration of 
mdev is established and has a known interface already.

>
>> But if we want to use mdev, why doesn't available_instances for your
>> mdev type simply report all unassigned wq and the `echo $UUID > create`
>> grabs a wq for mdev?  That would remove this pre-config contention,
>> right?
> This way could also work. It sort of changes pre-config to post-config,
> i.e. after an unassigned wq is grabbed for mdev, the admin then
> configures additional vendor specific parameters (not initialized by
> parent driver) before this mdev is assigned to a VM. Looks this is also
> what NVIDIA is doing for their vGPU, with a cmdline tool (nvidia-smi)
> and nvidia sysfs node for setting plugin parameters:
>
>          https://docs.nvidia.com/grid/latest/pdf/grid-vgpu-user-guide.pdf
>
> But I'll leave to Dave again as there must be a reason why they choose
> pre-config in the first place.

I think things become more complicated when we go from a dedicated wq to 
shared wq where the relationship of wq : mdev is 1 : 1 goes to 1 : N. 
Also needing to keep a consistent user config experience is desired, 
especially we already have such behavior since kernel 5.6 for host 
usages. So we really need try to avoid doing wq configuration 
differently just for "mdev" wqs. In the case suggested above, we 
basically just flipped the configuration steps. Mdev is first created 
through mdev sysfs interface. And then the device paramters are 
configured. Where for us, we configure the device parameter first, and 
then create the mdev. But in the end, it's still the hybrid mdev setup 
right?


>>> The value of mdev guid interface is providing a vendor-agnostic
>>> interface for mdev life-cycle management which allows one-
>>> enable-fit-all in upper management stack. Requiring vendor
>>> specific pre-config does blur the boundary here.
>> We need to be careful about using work-avoidance in the upper
>> management stack as a primary use case for an interface though.
> ok
>
>>> Alex/Kirt/Cornelia, what about your opinion here? It's better
>>> we can have an consensus on when and where the existing
>>> mdev sysfs could be used, as this will affect every new mdev
>>> implementation from now on.
>> I have a hard time defining some fixed criteria for using mdev.  It's
>> essentially always been true that vendors could write their own vfio
>> "bus driver", like vfio-pci or vfio-platform, specific to their device.
>> Mdevs were meant to be a way for the (non-vfio) driver of a device to
>> expose portions of the device through mediation for use with vfio.  It
>> seems like that's largely being done here.
>>
>> What I think has changed recently is this desire to make it easier to
>> create those vendor drivers and some promise of making module binding
>> work to avoid the messiness around picking a driver for the device.  In
>> the auxiliary bus case that I think Jason is working on, it sounds like
>> the main device driver exposes portions of itself on an auxiliary bus
>> where drivers on that bus can integrate into the vfio subsystem.  It
>> starts to get pretty fuzzy with what mdev already does, but it's also a
>> more versatile interface.  Is it right for everyone?  Probably not.
> idxd is also moving toward this model per Jason's suggestion. Although
> auxiliar bus is not directly used, idxd driver has its own bus for exposing
> portion of its resources. From this angle, all the motivation around mdev
> bus does get fuzzy...
>
>> Is the pre-configuration issue here really a push vs pull problem?  I
>> can see the requirement in step 1. is dedicating some resources to an
>> mdev use case, so at that point it seems like the argument is whether we
>> should just create aux bus devices that get automatically bound to a
>> vendor vfio-pci variant and we avoid the mdev lifecycle, which is both
>> convenient and ugly.  On the other hand, mdev has more of a pull
>> interface, ie. here are a bunch of device types and how many of each we
>> can support, use create to pull what you need.
> I see your point. Looks what idxd is toward now is a mixed model. The
> parent driver uses a push interface to initialize a pool of instances
> which are then managed through mdev in a pull mode.
>
>>>> Remember we have stripped away the actual special need to use
>>>> mdev. You don't *have* to use mdev anymore to use vfio. That is a
>>>> significant ideology change even from a few months ago.
>>>>
>>> Yes, "don't have to" but if there is value of doing so  it's
>>> not necessary to blocking it? One point in my mind is that if
>>> we should minimize vendor-specific contracts for user to
>>> manage mdev or subdevice...
>> Again, this in itself is not a great justification for using mdev,
>> we're creating vendor specific device types with vendor specific
>> additional features, that could all be done via some sort of netlink
>> interface too.  The thing that pushes this more towards mdev for me is
>> that I don't think each of these wqs appear as devices to the host,
>> they're internal resources of the parent device and we want to compose
>> them in ways that are slightly more amenable to traditional mdevs... I
>> think.  Thanks,
>>
> Yes, this is one reason going toward mdev.
>
> btw I'm not clear what the netlink interface will finally be, especially
> about whether any generic cmd should be defined cross devices given
> that subdevice management still has large generality. Jason, do you have
> an example somewhere which we can take a look regarding to mlx
> netlink design?
>
> Thanks
> Kevin
Jason Gunthorpe June 7, 2021, 6:28 p.m. UTC | #11
On Mon, Jun 07, 2021 at 06:22:08AM +0000, Tian, Kevin wrote:
> 
> btw I'm not clear what the netlink interface will finally be, especially 
> about whether any generic cmd should be defined cross devices given 
> that subdevice management still has large generality. Jason, do you have
> an example somewhere which we can take a look regarding to mlx 
> netlink design? 

Start here:

https://lore.kernel.org/netdev/20210120090531.49553-1-saeed@kernel.org/

devlink is some more generic way to control PCI/etc devices as some side
band to their usual uAPI interfaces like netlink/rdma/etc

Jason
Jason Gunthorpe June 7, 2021, 7:11 p.m. UTC | #12
On Mon, Jun 07, 2021 at 11:13:04AM -0700, Dave Jiang wrote:

> So in step 1, we 'tag' the wq to be dedicated to guest usage and put the
> hardware wq into enable state. For a dedicated mode wq, we can definitely
> just register directly and skip the mdev step. For a shared wq mode, we can
> have multiple mdev running on top of a single wq. So we need some way to
> create more mdevs. We can either go with the existing established creation
> path by mdev, or invent something custom for the driver as Jason suggested
> to accomodate additional virtual devices for guests. We implemented the mdev
> path originally with consideration of mdev is established and has a known
> interface already.

It sounds like you could just as easially have a 'create new vfio'
file under the idxd sysfs.. Especially since you already have a bus
and dynamic vfio specific things being created on this bus.

Have you gone over this with Dan?

> I think things become more complicated when we go from a dedicated wq to
> shared wq where the relationship of wq : mdev is 1 : 1 goes to 1 : N. Also
> needing to keep a consistent user config experience is desired, especially
> we already have such behavior since kernel 5.6 for host usages. So we really
> need try to avoid doing wq configuration differently just for "mdev" wqs. In
> the case suggested above, we basically just flipped the configuration steps.
> Mdev is first created through mdev sysfs interface. And then the device
> paramters are configured. Where for us, we configure the device parameter
> first, and then create the mdev. But in the end, it's still the hybrid mdev
> setup right?

So you don't even use mdev to configure anything? Yuk.

Jason
Dave Jiang June 8, 2021, 4:02 p.m. UTC | #13
On 6/7/2021 12:11 PM, Jason Gunthorpe wrote:
> On Mon, Jun 07, 2021 at 11:13:04AM -0700, Dave Jiang wrote:
>
>> So in step 1, we 'tag' the wq to be dedicated to guest usage and put the
>> hardware wq into enable state. For a dedicated mode wq, we can definitely
>> just register directly and skip the mdev step. For a shared wq mode, we can
>> have multiple mdev running on top of a single wq. So we need some way to
>> create more mdevs. We can either go with the existing established creation
>> path by mdev, or invent something custom for the driver as Jason suggested
>> to accomodate additional virtual devices for guests. We implemented the mdev
>> path originally with consideration of mdev is established and has a known
>> interface already.
> It sounds like you could just as easially have a 'create new vfio'
> file under the idxd sysfs.. Especially since you already have a bus
> and dynamic vfio specific things being created on this bus.

Will explore this and using of 'struct vfio_device' without mdev.



>
> Have you gone over this with Dan?
>
>> I think things become more complicated when we go from a dedicated wq to
>> shared wq where the relationship of wq : mdev is 1 : 1 goes to 1 : N. Also
>> needing to keep a consistent user config experience is desired, especially
>> we already have such behavior since kernel 5.6 for host usages. So we really
>> need try to avoid doing wq configuration differently just for "mdev" wqs. In
>> the case suggested above, we basically just flipped the configuration steps.
>> Mdev is first created through mdev sysfs interface. And then the device
>> paramters are configured. Where for us, we configure the device parameter
>> first, and then create the mdev. But in the end, it's still the hybrid mdev
>> setup right?
> So you don't even use mdev to configure anything? Yuk.
>
> Jason
Dave Jiang June 11, 2021, 6:21 p.m. UTC | #14
On 6/8/2021 9:02 AM, Dave Jiang wrote:
>
> On 6/7/2021 12:11 PM, Jason Gunthorpe wrote:
>> On Mon, Jun 07, 2021 at 11:13:04AM -0700, Dave Jiang wrote:
>>
>>> So in step 1, we 'tag' the wq to be dedicated to guest usage and put 
>>> the
>>> hardware wq into enable state. For a dedicated mode wq, we can 
>>> definitely
>>> just register directly and skip the mdev step. For a shared wq mode, 
>>> we can
>>> have multiple mdev running on top of a single wq. So we need some 
>>> way to
>>> create more mdevs. We can either go with the existing established 
>>> creation
>>> path by mdev, or invent something custom for the driver as Jason 
>>> suggested
>>> to accomodate additional virtual devices for guests. We implemented 
>>> the mdev
>>> path originally with consideration of mdev is established and has a 
>>> known
>>> interface already.
>> It sounds like you could just as easially have a 'create new vfio'
>> file under the idxd sysfs.. Especially since you already have a bus
>> and dynamic vfio specific things being created on this bus.
>
> Will explore this and using of 'struct vfio_device' without mdev.
>
Hi Jason. I hacked the idxd driver to remove mdev association and use 
vfio_device directly. Ran into some issues. Specifically mdev does some 
special handling when it comes to iommu domain. When we hit 
vfio_iommu_type1_attach_group(), there's a branch in there for 
mdev_bus_type. It sets the group with mdev_group flag, which later has 
effect of special handling for iommu_attach_group. And in addition, it 
ends up switching the bus to pci_bus_type before iommu_domain_alloc() is 
called.  Do we need to provide similar type of handling for vfio_device 
that are not backed by an entire PCI device like vfio_pci? Not sure it's 
the right thing to do to attach these devices to pci_bus_type directly.
Jason Gunthorpe June 11, 2021, 6:29 p.m. UTC | #15
On Fri, Jun 11, 2021 at 11:21:42AM -0700, Dave Jiang wrote:
> 
> On 6/8/2021 9:02 AM, Dave Jiang wrote:
> > 
> > On 6/7/2021 12:11 PM, Jason Gunthorpe wrote:
> > > On Mon, Jun 07, 2021 at 11:13:04AM -0700, Dave Jiang wrote:
> > > 
> > > > So in step 1, we 'tag' the wq to be dedicated to guest usage and
> > > > put the
> > > > hardware wq into enable state. For a dedicated mode wq, we can
> > > > definitely
> > > > just register directly and skip the mdev step. For a shared wq
> > > > mode, we can
> > > > have multiple mdev running on top of a single wq. So we need
> > > > some way to
> > > > create more mdevs. We can either go with the existing
> > > > established creation
> > > > path by mdev, or invent something custom for the driver as Jason
> > > > suggested
> > > > to accomodate additional virtual devices for guests. We
> > > > implemented the mdev
> > > > path originally with consideration of mdev is established and
> > > > has a known
> > > > interface already.
> > > It sounds like you could just as easially have a 'create new vfio'
> > > file under the idxd sysfs.. Especially since you already have a bus
> > > and dynamic vfio specific things being created on this bus.
> > 
> > Will explore this and using of 'struct vfio_device' without mdev.
> > 
> Hi Jason. I hacked the idxd driver to remove mdev association and use
> vfio_device directly. Ran into some issues. Specifically mdev does some
> special handling when it comes to iommu domain.

Yes, I know of this, it this needs fixing.

> effect of special handling for iommu_attach_group. And in addition, it ends
> up switching the bus to pci_bus_type before iommu_domain_alloc() is called. 
> Do we need to provide similar type of handling for vfio_device that are not
> backed by an entire PCI device like vfio_pci? Not sure it's the right thing
> to do to attach these devices to pci_bus_type directly.

Yes, type1 needs to be changed so it somehow knows that the struct
device is 'sw only', for instance because it has no direct HW IOMMU
connection, then all the mdev hackery should be deleted from type1.

I haven't investigated closely exactly how to do this.

Jason