[v6,15/20] vfio/mdev: idxd: ims domain setup for the vdcm

Message ID	162164283796.261970.11020270418798826121.stgit@djiang5-desk3.ch.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> IronPort-SDR: fi+7dYaKZ4qTU+zU6svarIEbKrVvALqgplV+G+x3nJiOpBONuBrkXpm0LuLN1HD6S7zqXHl60Y 75peFwzruLlA== IronPort-SDR: R6Ld8Jeh1YpVrBnCKtCCt9ZDuxVeJi+rAaPlJpE++vFtUQSWxurWeNvBCu1bU2JZIEwuG6LBjf SME2dqgGOKIA== Subject: [PATCH v6 15/20] vfio/mdev: idxd: ims domain setup for the vdcm From: Dave Jiang <dave.jiang@intel.com> To: alex.williamson@redhat.com, kwankhede@nvidia.com, tglx@linutronix.de, vkoul@kernel.org, jgg@mellanox.com Cc: megha.dey@intel.com, jacob.jun.pan@intel.com, ashok.raj@intel.com, yi.l.liu@intel.com, baolu.lu@intel.com, kevin.tian@intel.com, sanjay.k.kumar@intel.com, tony.luck@intel.com, dan.j.williams@intel.com, eric.auger@redhat.com, pbonzini@redhat.com, dmaengine@vger.kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org Date: Fri, 21 May 2021 17:20:37 -0700 Message-ID: <162164283796.261970.11020270418798826121.stgit@djiang5-desk3.ch.intel.com> In-Reply-To: <162164243591.261970.3439987543338120797.stgit@djiang5-desk3.ch.intel.com> References: <162164243591.261970.3439987543338120797.stgit@djiang5-desk3.ch.intel.com> User-Agent: StGit/0.23-29-ga622f1 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk
Series	Add VFIO mediated device support and DEV-MSI support for the idxd driver \| expand [v6,00/20] Add VFIO mediated device support and DEV-MSI support for the idxd driver [v6,01/20] vfio/mdev: idxd: add theory of operation documentation for idxd mdev [v6,02/20] dmaengine: idxd: add external module driver support for dsa_bus_type [v6,03/20] dmaengine: idxd: add IMS offset and size retrieval code [v6,04/20] dmaengine: idxd: add portal offset for IMS portals [v6,05/20] vfio: mdev: common lib code for setting up Interrupt Message Store [v6,06/20] vfio/mdev: idxd: add PCI config for read/write for mdev [v6,07/20] vfio/mdev: idxd: Add administrative commands emulation for mdev [v6,08/20] vfio/mdev: idxd: Add mdev device context initialization [v6,09/20] vfio/mdev: Add mmio read/write support for mdev [v6,10/20] vfio/mdev: idxd: add mdev type as a new wq type [v6,11/20] vfio/mdev: idxd: Add basic driver setup for idxd mdev [v6,12/20] vfio: move VFIO PCI macros to common header [v6,13/20] vfio/mdev: idxd: add mdev driver registration and helper functions [v6,14/20] vfio/mdev: idxd: add 1dwq-v1 mdev type [v6,15/20] vfio/mdev: idxd: ims domain setup for the vdcm [v6,16/20] vfio/mdev: idxd: add new wq state for mdev [v6,17/20] vfio/mdev: idxd: add error notification from host driver to mediated device [v6,18/20] vfio: move vfio_pci_set_ctx_trigger_single to common code [v6,19/20] vfio: mdev: Add device request interface [v6,20/20] vfio: mdev: idxd: setup request interrupt

Dave Jiang May 22, 2021, 12:20 a.m. UTC

Add setup code for the IMS domain. This feeds the MSI subsystem the
relevant information for device IMS. The allocation of the IMS vectors are
done in common VFIO code if the correct domain set for the
mdev device.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/dma/idxd/idxd.h       |    1 +
 drivers/dma/idxd/init.c       |   14 ++++++++++++++
 drivers/vfio/mdev/idxd/mdev.c |    2 ++
 3 files changed, 17 insertions(+)

Jason Gunthorpe May 23, 2021, 11:50 p.m. UTC | #1

On Fri, May 21, 2021 at 05:20:37PM -0700, Dave Jiang wrote:
> @@ -77,8 +80,18 @@ int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv)
>  		return rc;
>  	}
>  
> +	ims_info.max_slots = idxd->ims_size;
> +	ims_info.slots = idxd->reg_base + idxd->ims_offset;
> +	idxd->ims_domain = pci_ims_array_create_msi_irq_domain(idxd->pdev, &ims_info);
> +	if (!idxd->ims_domain) {
> +		dev_warn(dev, "Fail to acquire IMS domain\n");
> +		iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
> +		return -ENODEV;
> +	}

I'm quite surprised that every mdev doesn't create its own ims_domain
in its probe function.

This places a global total limit on the # of vectors which makes me
ask what was the point of using IMS in the first place ?

The entire idea for IMS was to make the whole allocation system fully
dynamic based on demand.

>  	rc = mdev_register_device(dev, drv);
>  	if (rc < 0) {
> +		irq_domain_remove(idxd->ims_domain);
>  		iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
>  		return rc;
>  	}

This really wants a goto error unwind

Jason

Dave Jiang May 27, 2021, 12:22 a.m. UTC | #2

On 5/23/2021 4:50 PM, Jason Gunthorpe wrote:
> On Fri, May 21, 2021 at 05:20:37PM -0700, Dave Jiang wrote:
>> @@ -77,8 +80,18 @@ int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv)
>>   		return rc;
>>   	}
>>   
>> +	ims_info.max_slots = idxd->ims_size;
>> +	ims_info.slots = idxd->reg_base + idxd->ims_offset;
>> +	idxd->ims_domain = pci_ims_array_create_msi_irq_domain(idxd->pdev, &ims_info);
>> +	if (!idxd->ims_domain) {
>> +		dev_warn(dev, "Fail to acquire IMS domain\n");
>> +		iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
>> +		return -ENODEV;
>> +	}
> I'm quite surprised that every mdev doesn't create its own ims_domain
> in its probe function.
>
> This places a global total limit on the # of vectors which makes me
> ask what was the point of using IMS in the first place ?
>
> The entire idea for IMS was to make the whole allocation system fully
> dynamic based on demand.

Hi Jason, thank you for the review of the series.

My understanding is that the driver creates a single IMS domain for the 
device and provides the address base and IMS numbers for the domain 
based on device IMS resources. So the IMS region needs to be contiguous. 
Each mdev can call msi_domain_alloc_irqs() and acquire the number of IMS 
vectors it desires and the DEV MSI core code will keep track of which 
vectors are being used. This allows the mdev devices to dynamically 
allocate based on demand. If the driver allocates a domain per mdev, 
it'll needs to do internal accounting of the base and vector numbers for 
each of those domains that the MSI core already provides. Isn't that 
what we are trying to avoid? As mdevs come and go, that partitioning 
will become fragmented.

For example, mdev 0 allocates 1 vector, mdev 1 allocates 2 vectors, and 
mdev 3 allocates 3 vector. You have 1 vectors unallocated. When mdev 1 
goes away and a new mdev shows up wanting 3 vectors, you won't be able 
to allocate the domain because of fragmentation even though you have 
enough vectors.

If all mdevs allocate the same IMS numbers, the fragmentation issue does 
not exist. But the driver still has to keep track of which vectors are 
free and which ones are used in order to provide the appropriate base. 
And dev msi core already does this for us if we have a single domain. 
Feels like we would just be duplicating code doing the same thing?

>
>>   	rc = mdev_register_device(dev, drv);
>>   	if (rc < 0) {
>> +		irq_domain_remove(idxd->ims_domain);
>>   		iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
>>   		return rc;
>>   	}
> This really wants a goto error unwind
>
> Jason

Jason Gunthorpe May 27, 2021, 12:54 a.m. UTC | #3

On Wed, May 26, 2021 at 05:22:22PM -0700, Dave Jiang wrote:
> 
> On 5/23/2021 4:50 PM, Jason Gunthorpe wrote:
> > On Fri, May 21, 2021 at 05:20:37PM -0700, Dave Jiang wrote:
> > > @@ -77,8 +80,18 @@ int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv)
> > >   		return rc;
> > >   	}
> > > +	ims_info.max_slots = idxd->ims_size;
> > > +	ims_info.slots = idxd->reg_base + idxd->ims_offset;
> > > +	idxd->ims_domain = pci_ims_array_create_msi_irq_domain(idxd->pdev, &ims_info);
> > > +	if (!idxd->ims_domain) {
> > > +		dev_warn(dev, "Fail to acquire IMS domain\n");
> > > +		iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
> > > +		return -ENODEV;
> > > +	}
> > I'm quite surprised that every mdev doesn't create its own ims_domain
> > in its probe function.
> > 
> > This places a global total limit on the # of vectors which makes me
> > ask what was the point of using IMS in the first place ?
> > 
> > The entire idea for IMS was to make the whole allocation system fully
> > dynamic based on demand.
> 
> Hi Jason, thank you for the review of the series.
> 
> My understanding is that the driver creates a single IMS domain for the
> device and provides the address base and IMS numbers for the domain based on
> device IMS resources. So the IMS region needs to be contiguous. Each mdev
> can call msi_domain_alloc_irqs() and acquire the number of IMS vectors it
> desires and the DEV MSI core code will keep track of which vectors are being
> used. This allows the mdev devices to dynamically allocate based on demand.
> If the driver allocates a domain per mdev, it'll needs to do internal
> accounting of the base and vector numbers for each of those domains that the
> MSI core already provides. Isn't that what we are trying to avoid? As mdevs
> come and go, that partitioning will become fragmented.

I suppose it depends entirely on how the HW works.

If the HW has a fixed number of interrupt vectors organized in a
single table then by all means allocate a single domain that spans the
entire fixed HW vector space. But then why do we have a ims_size
variable here??

However, that really begs the question of why the HW is using IMS at
all? I'd expect needing 2x-10x the max MSI-X vector size before
reaching for IMS.

So does IDXD really have like a 4k - 40k entry linear IMS vector table
to wrap a shared domain around?

Basically, that isn't really "scalable" it is just "bigger".

Fully scalable would be for every mdev to point to its own 2k entry
IMS table that is allocated on the fly. Every mdev gets a domain and
every domain is fully utilized by the mdev in emulating
MSI-X. Basically for a device like idxd every PASID would have to map
to a IMS vector table array.

I suppose that was not what was done?

Jason

Dave Jiang May 27, 2021, 1:15 a.m. UTC | #4

On 5/26/2021 5:54 PM, Jason Gunthorpe wrote:
> On Wed, May 26, 2021 at 05:22:22PM -0700, Dave Jiang wrote:
>> On 5/23/2021 4:50 PM, Jason Gunthorpe wrote:
>>> On Fri, May 21, 2021 at 05:20:37PM -0700, Dave Jiang wrote:
>>>> @@ -77,8 +80,18 @@ int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv)
>>>>    		return rc;
>>>>    	}
>>>> +	ims_info.max_slots = idxd->ims_size;
>>>> +	ims_info.slots = idxd->reg_base + idxd->ims_offset;
>>>> +	idxd->ims_domain = pci_ims_array_create_msi_irq_domain(idxd->pdev, &ims_info);
>>>> +	if (!idxd->ims_domain) {
>>>> +		dev_warn(dev, "Fail to acquire IMS domain\n");
>>>> +		iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
>>>> +		return -ENODEV;
>>>> +	}
>>> I'm quite surprised that every mdev doesn't create its own ims_domain
>>> in its probe function.
>>>
>>> This places a global total limit on the # of vectors which makes me
>>> ask what was the point of using IMS in the first place ?
>>>
>>> The entire idea for IMS was to make the whole allocation system fully
>>> dynamic based on demand.
>> Hi Jason, thank you for the review of the series.
>>
>> My understanding is that the driver creates a single IMS domain for the
>> device and provides the address base and IMS numbers for the domain based on
>> device IMS resources. So the IMS region needs to be contiguous. Each mdev
>> can call msi_domain_alloc_irqs() and acquire the number of IMS vectors it
>> desires and the DEV MSI core code will keep track of which vectors are being
>> used. This allows the mdev devices to dynamically allocate based on demand.
>> If the driver allocates a domain per mdev, it'll needs to do internal
>> accounting of the base and vector numbers for each of those domains that the
>> MSI core already provides. Isn't that what we are trying to avoid? As mdevs
>> come and go, that partitioning will become fragmented.
> I suppose it depends entirely on how the HW works.
>
> If the HW has a fixed number of interrupt vectors organized in a
> single table then by all means allocate a single domain that spans the
> entire fixed HW vector space. But then why do we have a ims_size
> variable here??
>
> However, that really begs the question of why the HW is using IMS at
> all? I'd expect needing 2x-10x the max MSI-X vector size before
> reaching for IMS.
>
> So does IDXD really have like a 4k - 40k entry linear IMS vector table
> to wrap a shared domain around?
>
> Basically, that isn't really "scalable" it is just "bigger".
>
> Fully scalable would be for every mdev to point to its own 2k entry
> IMS table that is allocated on the fly. Every mdev gets a domain and
> every domain is fully utilized by the mdev in emulating
> MSI-X. Basically for a device like idxd every PASID would have to map
> to a IMS vector table array.
>
> I suppose that was not what was done?

At least not for first gen of hardware. DSA 1.0 has 2k of IMS entries 
total. ims_size is what is read from the device cap register. For MSIX, 
the device only has 1 misc vector and 8 I/O vectors. That's why IMS is 
being used for mdevs. We will discuss with our hardware people your 
suggestion.

>
> Jason

Ashok Raj May 27, 2021, 1:41 a.m. UTC | #5

On Wed, May 26, 2021 at 09:54:44PM -0300, Jason Gunthorpe wrote:
> On Wed, May 26, 2021 at 05:22:22PM -0700, Dave Jiang wrote:
> > 
> > On 5/23/2021 4:50 PM, Jason Gunthorpe wrote:
> > > On Fri, May 21, 2021 at 05:20:37PM -0700, Dave Jiang wrote:
> > > > @@ -77,8 +80,18 @@ int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv)
> > > >   		return rc;
> > > >   	}
> > > > +	ims_info.max_slots = idxd->ims_size;
> > > > +	ims_info.slots = idxd->reg_base + idxd->ims_offset;
> > > > +	idxd->ims_domain = pci_ims_array_create_msi_irq_domain(idxd->pdev, &ims_info);
> > > > +	if (!idxd->ims_domain) {
> > > > +		dev_warn(dev, "Fail to acquire IMS domain\n");
> > > > +		iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
> > > > +		return -ENODEV;
> > > > +	}
> > > I'm quite surprised that every mdev doesn't create its own ims_domain
> > > in its probe function.
> > > 
> > > This places a global total limit on the # of vectors which makes me
> > > ask what was the point of using IMS in the first place ?
> > > 
> > > The entire idea for IMS was to make the whole allocation system fully
> > > dynamic based on demand.
> > 
> > Hi Jason, thank you for the review of the series.
> > 
> > My understanding is that the driver creates a single IMS domain for the
> > device and provides the address base and IMS numbers for the domain based on
> > device IMS resources. So the IMS region needs to be contiguous. Each mdev
> > can call msi_domain_alloc_irqs() and acquire the number of IMS vectors it
> > desires and the DEV MSI core code will keep track of which vectors are being
> > used. This allows the mdev devices to dynamically allocate based on demand.
> > If the driver allocates a domain per mdev, it'll needs to do internal
> > accounting of the base and vector numbers for each of those domains that the
> > MSI core already provides. Isn't that what we are trying to avoid? As mdevs
> > come and go, that partitioning will become fragmented.
> 
> I suppose it depends entirely on how the HW works.
> 
> If the HW has a fixed number of interrupt vectors organized in a
> single table then by all means allocate a single domain that spans the
> entire fixed HW vector space. But then why do we have a ims_size
> variable here??
> 
> However, that really begs the question of why the HW is using IMS at
> all? I'd expect needing 2x-10x the max MSI-X vector size before
> reaching for IMS.

Its more than the number of vectors. Yes, thats one of the attributes.
IMS also has have additional flexibility. I think we covered this a while
back but maybe lost since its been a while.

- Format isn't just the standard MSIx, for e.g. DSA has the pending bits
  all merged and co-located together with the interrupt store.
- You might want the vector space to be mabe on device. (I think you
  alluded one of your devices can actually do that?)
- Remember we do handle validation when interrupts are requested from user
  space. Interrupts are validated with PASID of requester. (I think we also
  talked about if we should turn the interrupt message to also take a PASID
  as opposed to request without PASID as its specified in PCIe)
- For certain devices the interupt might be simply in the user context
  maintained by the kernel. Graphics for e.g.
> 
> So does IDXD really have like a 4k - 40k entry linear IMS vector table
> to wrap a shared domain around?
> 
> Basically, that isn't really "scalable" it is just "bigger".
> 
> Fully scalable would be for every mdev to point to its own 2k entry
> IMS table that is allocated on the fly. Every mdev gets a domain and
> every domain is fully utilized by the mdev in emulating
> MSI-X. Basically for a device like idxd every PASID would have to map
> to a IMS vector table array.
> 
> I suppose that was not what was done?
> 
> Jason

Jason Gunthorpe May 27, 2021, 1:36 p.m. UTC | #6

On Wed, May 26, 2021 at 06:41:07PM -0700, Raj, Ashok wrote:
> On Wed, May 26, 2021 at 09:54:44PM -0300, Jason Gunthorpe wrote:
> > On Wed, May 26, 2021 at 05:22:22PM -0700, Dave Jiang wrote:
> > > 
> > > On 5/23/2021 4:50 PM, Jason Gunthorpe wrote:
> > > > On Fri, May 21, 2021 at 05:20:37PM -0700, Dave Jiang wrote:
> > > > > @@ -77,8 +80,18 @@ int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv)
> > > > >   		return rc;
> > > > >   	}
> > > > > +	ims_info.max_slots = idxd->ims_size;
> > > > > +	ims_info.slots = idxd->reg_base + idxd->ims_offset;
> > > > > +	idxd->ims_domain = pci_ims_array_create_msi_irq_domain(idxd->pdev, &ims_info);
> > > > > +	if (!idxd->ims_domain) {
> > > > > +		dev_warn(dev, "Fail to acquire IMS domain\n");
> > > > > +		iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
> > > > > +		return -ENODEV;
> > > > > +	}
> > > > I'm quite surprised that every mdev doesn't create its own ims_domain
> > > > in its probe function.
> > > > 
> > > > This places a global total limit on the # of vectors which makes me
> > > > ask what was the point of using IMS in the first place ?
> > > > 
> > > > The entire idea for IMS was to make the whole allocation system fully
> > > > dynamic based on demand.
> > > 
> > > Hi Jason, thank you for the review of the series.
> > > 
> > > My understanding is that the driver creates a single IMS domain for the
> > > device and provides the address base and IMS numbers for the domain based on
> > > device IMS resources. So the IMS region needs to be contiguous. Each mdev
> > > can call msi_domain_alloc_irqs() and acquire the number of IMS vectors it
> > > desires and the DEV MSI core code will keep track of which vectors are being
> > > used. This allows the mdev devices to dynamically allocate based on demand.
> > > If the driver allocates a domain per mdev, it'll needs to do internal
> > > accounting of the base and vector numbers for each of those domains that the
> > > MSI core already provides. Isn't that what we are trying to avoid? As mdevs
> > > come and go, that partitioning will become fragmented.
> > 
> > I suppose it depends entirely on how the HW works.
> > 
> > If the HW has a fixed number of interrupt vectors organized in a
> > single table then by all means allocate a single domain that spans the
> > entire fixed HW vector space. But then why do we have a ims_size
> > variable here??
> > 
> > However, that really begs the question of why the HW is using IMS at
> > all? I'd expect needing 2x-10x the max MSI-X vector size before
> > reaching for IMS.
> 
> Its more than the number of vectors. Yes, thats one of the attributes.
> IMS also has have additional flexibility. I think we covered this a while
> back but maybe lost since its been a while.
> 
> - Format isn't just the standard MSIx, for e.g. DSA has the pending bits
>   all merged and co-located together with the interrupt store.

But this is just random hardware churn, there is nothing wrong with
keeping the pending bits in the standard format

> - You might want the vector space to be mabe on device. (I think you
>   alluded one of your devices can actually do that?)

Sure, but IDXD is not doing this

> - Remember we do handle validation when interrupts are requested from user
>   space. Interrupts are validated with PASID of requester. (I think we also
>   talked about if we should turn the interrupt message to also take a PASID
>   as opposed to request without PASID as its specified in PCIe)

Yes, but overall this doesn't really make sense to me, and doesn't in
of itself require IMS. The PASID table could be an addendum to the
normal MSI-X table.

Besides, Interrupts and PASID are not related concepts. Does every
user SVA process with a unique PASID get a unique interrupt?  The
device and CPU doesn't have enough vectors to do this.

Frankly I expect interrupts to be multiplexed by queues not by PASID,
so that interrupts can be shared.

> - For certain devices the interupt might be simply in the user context
>   maintained by the kernel. Graphics for e.g.

IDXD is also not doing this.

Jason

Thomas Gleixner May 31, 2021, 2:02 p.m. UTC | #7

On Sun, May 23 2021 at 20:50, Jason Gunthorpe wrote:
> On Fri, May 21, 2021 at 05:20:37PM -0700, Dave Jiang wrote:
>> @@ -77,8 +80,18 @@ int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv)
>>  		return rc;
>>  	}
>>  
>> +	ims_info.max_slots = idxd->ims_size;
>> +	ims_info.slots = idxd->reg_base + idxd->ims_offset;
>> +	idxd->ims_domain = pci_ims_array_create_msi_irq_domain(idxd->pdev, &ims_info);
>> +	if (!idxd->ims_domain) {
>> +		dev_warn(dev, "Fail to acquire IMS domain\n");
>> +		iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
>> +		return -ENODEV;
>> +	}
>
> I'm quite surprised that every mdev doesn't create its own ims_domain
> in its probe function.

What for?

> This places a global total limit on the # of vectors which makes me
> ask what was the point of using IMS in the first place ?

That depends on how IMS is implemented. The IDXD variant has a fixed
sized message store which is shared between all subdevices, so yet
another domain would not provide any value.

For the case where the IMS store is seperate, you still have one central
irqdomain per physical device. The domain allocation function can then
create storage on demand or reuse existing storage and just fill in the
pointers.

Thanks,

        tglx

Jason Gunthorpe May 31, 2021, 4:57 p.m. UTC | #8

On Mon, May 31, 2021 at 04:02:02PM +0200, Thomas Gleixner wrote:
> On Sun, May 23 2021 at 20:50, Jason Gunthorpe wrote:
> > On Fri, May 21, 2021 at 05:20:37PM -0700, Dave Jiang wrote:
> >> @@ -77,8 +80,18 @@ int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv)
> >>  		return rc;
> >>  	}
> >>  
> >> +	ims_info.max_slots = idxd->ims_size;
> >> +	ims_info.slots = idxd->reg_base + idxd->ims_offset;
> >> +	idxd->ims_domain = pci_ims_array_create_msi_irq_domain(idxd->pdev, &ims_info);
> >> +	if (!idxd->ims_domain) {
> >> +		dev_warn(dev, "Fail to acquire IMS domain\n");
> >> +		iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
> >> +		return -ENODEV;
> >> +	}
> >
> > I'm quite surprised that every mdev doesn't create its own ims_domain
> > in its probe function.
> 
> What for?

IDXD wouldn't need it, but proper IMS HW with no bound of number of
vectors can't provide a ims_info.max_slots value here.

Instead each use use site, like VFIO, would want to specify the number
of vectors to allocate for its own usage, then parcel them out one by
one in the normal way. Basically VFIO is emulating a normal MSI-X
table.

> > This places a global total limit on the # of vectors which makes me
> > ask what was the point of using IMS in the first place ?
>
> That depends on how IMS is implemented. The IDXD variant has a fixed
> sized message store which is shared between all subdevices, so yet
> another domain would not provide any value.

Right, IDXD would have been perfectly happy to use the normal MSI-X
table from what I can see.

> For the case where the IMS store is seperate, you still have one central
> irqdomain per physical device. The domain allocation function can then
> create storage on demand or reuse existing storage and just fill in the
> pointers.

I think it is philosophically backwards, and it is in part what is
motivating pretending this weird auxdomain and PASID stuff is generic.

The VFIO model is the IRQ table is associated with a VM. When the
vfio_device is created it decides how big the MSI-X table will be and
it needs to allocate a block of interrupts to emulate it. For security
those interrupts need to be linked in the HW to the vfio_device and
the VM. ie VM A cannot trigger an interrupt that would deliver to VM
B.

IDXD choose to use the PASID, but other HW might use a generic VM_ID.

Further, IDXD choose to use a VM_ID per IMS entry, but other HW is
likely to use a VM_ID per block of IMS entries. Ie the HW tree starts
a VM object, then locates the IMS table for that object, then triggers
the interrupt.

If we think about the later sort of HW I don't think the whole aux
data and domain per pci function makes alot of sense. You'd want a
domain per VM_ID and all the IMS entires in that domain share the same
VM_ID. In this regard the irq domain will correspond to the security
boundary.

While IDXD is probably fine to organize its domains like this, I am
surprised to learn there is basically no reason for it to be using
IMS.

Jason

Thomas Gleixner May 31, 2021, 11:55 p.m. UTC | #9

On Mon, May 31 2021 at 13:57, Jason Gunthorpe wrote:
> On Mon, May 31, 2021 at 04:02:02PM +0200, Thomas Gleixner wrote:
>> > I'm quite surprised that every mdev doesn't create its own ims_domain
>> > in its probe function.
>> 
>> What for?
>
> IDXD wouldn't need it, but proper IMS HW with no bound of number of
> vectors can't provide a ims_info.max_slots value here.

There is no need to do so:

     https://lore.kernel.org/r/20200826112335.202234502@linutronix.de

which has the IMS_MSI_QUEUE variant at which you looked at and said:

 "I haven't looked through everything in detail, but this does look like
  it is good for the mlx5 devices."

ims_info.max_slots is a property of the IMS_MSI_ARRAY and does not make
any restrictions on other storage.

> Instead each use use site, like VFIO, would want to specify the number
> of vectors to allocate for its own usage, then parcel them out one by
> one in the normal way. Basically VFIO is emulating a normal MSI-X
> table.

Just with a size which exceeds a normal MSI-X table, but that's an
implementation detail of the underlying physical device. It does not put
any restrictions on mdev at all.

>> > This places a global total limit on the # of vectors which makes me
>> > ask what was the point of using IMS in the first place ?
>>
>> That depends on how IMS is implemented. The IDXD variant has a fixed
>> sized message store which is shared between all subdevices, so yet
>> another domain would not provide any value.
>
> Right, IDXD would have been perfectly happy to use the normal MSI-X
> table from what I can see.

Again. No, it's a storage size problem and regular MSI-X does not
support auxiliary data.

>> For the case where the IMS store is seperate, you still have one central
>> irqdomain per physical device. The domain allocation function can then
>> create storage on demand or reuse existing storage and just fill in the
>> pointers.
>
> I think it is philosophically backwards, and it is in part what is
> motivating pretending this weird auxdomain and PASID stuff is generic.

That's a different story and as I explained to Dave already hacking all
this into mdev is backwards, but that does not make your idea of a
irqdomain per mdev any more reasonable.

The mdev does not do anything irq chip/domain related. It uses what the
underlying physical device provides. If you think otherwise then please
provide me the hierarchical model which I explained here:

 https://lore.kernel.org/r/87h7tcgbs2.fsf@nanos.tec.linutronix.de
 https://lore.kernel.org/r/87bljg7u4f.fsf@nanos.tec.linutronix.de

> The VFIO model is the IRQ table is associated with a VM. When the
> vfio_device is created it decides how big the MSI-X table will be and
> it needs to allocate a block of interrupts to emulate it. For security
> those interrupts need to be linked in the HW to the vfio_device and
> the VM. ie VM A cannot trigger an interrupt that would deliver to VM
> B.

Fine.

> IDXD choose to use the PASID, but other HW might use a generic VM_ID.

So what?

> Further, IDXD choose to use a VM_ID per IMS entry, but other HW is
> likely to use a VM_ID per block of IMS entries. Ie the HW tree starts
> a VM object, then locates the IMS table for that object, then triggers
> the interrupt.

If you read my other reply to Dave carefuly then you might have noticed
that this is crap and irrelevant because the ID (whatever it is) is per
device and that ID has to be stored in the device. Whether the actual
irq chip/domain driver implementation uses it per associated irq or not
does not matter at all.

> If we think about the later sort of HW I don't think the whole aux
> data and domain per pci function makes alot of sense.

First of all that is already debunked and will go nowhere and second
there is no requirement to implement this for some other incarnation of
IMS when done correctly. That whole irq_set_auxdata() stuff is not going
anywhere simply because it's not needed at all.

All what's needed is a function to store some sort of ID per device
(mdev) and the underlying IMS driver takes care of what to do with it.

That has to happen before the interrupts are allocated and if that info
is invalid then the allocation function can reject it.

> You'd want a domain per VM_ID and all the IMS entires in that domain
> share the same VM_ID. In this regard the irq domain will correspond to
> the security boundary.

The real problems are:

  - Intel misled me with the requirement to set PASID after the fact
    which is simply wrong and what caused me to come up with that
    irq_set_auxdata() workaround.

  - Their absolute ignorance for proper layering led to adding all that
    irq_set_auxdata() muck to this mdev library.

Ergo, the proper thing to do is to fix this ID storage problem (PASID,
VM_ID or whatever) at the proper place, i.e. store it in struct device
(which is associated to that mdev) and let the individual drivers handle
it as they require.

It's that simple and this needs to be fixed and not some made up
philosophical question about irqdomains per mdev. Those would be even
worse than what Intel did here.

Thanks,

        tglx

Jason Gunthorpe June 1, 2021, 12:16 p.m. UTC | #10

On Tue, Jun 01, 2021 at 01:55:22AM +0200, Thomas Gleixner wrote:
> On Mon, May 31 2021 at 13:57, Jason Gunthorpe wrote:
> > On Mon, May 31, 2021 at 04:02:02PM +0200, Thomas Gleixner wrote:
> >> > I'm quite surprised that every mdev doesn't create its own ims_domain
> >> > in its probe function.
> >> 
> >> What for?
> >
> > IDXD wouldn't need it, but proper IMS HW with no bound of number of
> > vectors can't provide a ims_info.max_slots value here.
> 
> There is no need to do so:
> 
>      https://lore.kernel.org/r/20200826112335.202234502@linutronix.de
> 
> which has the IMS_MSI_QUEUE variant at which you looked at and said:
> 
>  "I haven't looked through everything in detail, but this does look like
>   it is good for the mlx5 devices."
> 
> ims_info.max_slots is a property of the IMS_MSI_ARRAY and does not make
> any restrictions on other storage.

Ok, it has been a while since then

> >> That depends on how IMS is implemented. The IDXD variant has a fixed
> >> sized message store which is shared between all subdevices, so yet
> >> another domain would not provide any value.
> >
> > Right, IDXD would have been perfectly happy to use the normal MSI-X
> > table from what I can see.
> 
> Again. No, it's a storage size problem and regular MSI-X does not
> support auxiliary data.

I mean the IDXD HW could have been designed with a normal format MSI-X
table and a side table with the PASID.

> Ergo, the proper thing to do is to fix this ID storage problem (PASID,
> VM_ID or whatever) at the proper place, i.e. store it in struct device
> (which is associated to that mdev) and let the individual drivers handle
> it as they require.

If the struct device defines all the details of how to place the IRQ
into the HW, including what HW table to use, then it seems like it
could work.

I don't clearly remember all the details anymore so lets look at how
non-IDXD devices might work when HW actually comes.

Jason

[v6,15/20] vfio/mdev: idxd: ims domain setup for the vdcm

Commit Message

Comments

Patch