[v2,2/3] iommu/dma: Support MSIs through nested domains

Message ID	b1b8ff9c716f22f524be0313ad12e5c6d10f5bd4.1722644866.git.nicolinc@nvidia.com (mailing list archive)
State	New
Headers	show Received: from NAM11-DM6-obe.outbound.protection.outlook.com (mail-dm6nam11on2054.outbound.protection.outlook.com [40.107.223.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8095663AE; Sat, 3 Aug 2024 00:32:27 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.117.161 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.117.161; helo=mail.nvidia.com; pr=C From: Nicolin Chen <nicolinc@nvidia.com> To: <jgg@nvidia.com>, <kevin.tian@intel.com> CC: <robin.murphy@arm.com>, <joro@8bytes.org>, <will@kernel.org>, <shuah@kernel.org>, <iommu@lists.linux.dev>, <linux-kernel@vger.kernel.org>, <linux-kselftest@vger.kernel.org> Subject: [PATCH v2 2/3] iommu/dma: Support MSIs through nested domains Date: Fri, 2 Aug 2024 17:32:03 -0700 Message-ID: <b1b8ff9c716f22f524be0313ad12e5c6d10f5bd4.1722644866.git.nicolinc@nvidia.com> In-Reply-To: <cover.1722644866.git.nicolinc@nvidia.com> References: <cover.1722644866.git.nicolinc@nvidia.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	iommufd: Add selftest coverage for reserved IOVAs \| expand [v2,0/3] iommufd: Add selftest coverage for reserved IOVAs [v2,1/3] iommufd: Reorder include files [v2,2/3] iommu/dma: Support MSIs through nested domains [v2,3/3] iommufd/selftest: Add coverage for reserved IOVAs

Nicolin Chen Aug. 3, 2024, 12:32 a.m. UTC

From: Robin Murphy <robin.murphy@arm.com>

Currently, iommu-dma is the only place outside of IOMMUFD and drivers
which might need to be aware of the stage 2 domain encapsulated within
a nested domain. This would be in the legacy-VFIO-style case where we're
using host-managed MSIs with an identity mapping at stage 1, where it is
the underlying stage 2 domain which owns an MSI cookie and holds the
corresponding dynamic mappings. Hook up the new op to resolve what we
need from a nested domain.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/dma-iommu.c | 18 ++++++++++++++++--
 include/linux/iommu.h     |  4 ++++
 2 files changed, 20 insertions(+), 2 deletions(-)

Tian, Kevin Aug. 6, 2024, 8:25 a.m. UTC | #1

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Saturday, August 3, 2024 8:32 AM
> 
> From: Robin Murphy <robin.murphy@arm.com>
> 
> Currently, iommu-dma is the only place outside of IOMMUFD and drivers
> which might need to be aware of the stage 2 domain encapsulated within
> a nested domain. This would be in the legacy-VFIO-style case where we're

why is it a legacy-VFIO-style? We only support nested in IOMMUFD.

> using host-managed MSIs with an identity mapping at stage 1, where it is
> the underlying stage 2 domain which owns an MSI cookie and holds the
> corresponding dynamic mappings. Hook up the new op to resolve what we
> need from a nested domain.
> 
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/dma-iommu.c | 18 ++++++++++++++++--
>  include/linux/iommu.h     |  4 ++++
>  2 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 7b1dfa0665df6..05e04934a5f81 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -1799,6 +1799,20 @@ static struct iommu_dma_msi_page
> *iommu_dma_get_msi_page(struct device *dev,
>  	return NULL;
>  }
> 
> +/*
> + * Nested domains may not have an MSI cookie or accept mappings, but
> they may
> + * be related to a domain which does, so we let them tell us what they need.
> + */
> +static struct iommu_domain
> *iommu_dma_get_msi_mapping_domain(struct device *dev)
> +{
> +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> +
> +	if (domain && domain->type == IOMMU_DOMAIN_NESTED &&
> +	    domain->ops->get_msi_mapping_domain)

I'm not sure the core should restrict it to the NESTED type. Given
there is a new domain ops any type restriction can be handled
inside the callback. Anyway the driver should register the op
for a domain only when there is a need. 

> +		domain = domain->ops->get_msi_mapping_domain(domain);
> +	return domain;
> +}
> +
>  /**
>   * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
>   * @desc: MSI descriptor, will store the MSI page
> @@ -1809,7 +1823,7 @@ static struct iommu_dma_msi_page
> *iommu_dma_get_msi_page(struct device *dev,
>  int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
>  {
>  	struct device *dev = msi_desc_to_dev(desc);
> -	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> +	struct iommu_domain *domain =
> iommu_dma_get_msi_mapping_domain(dev);
>  	struct iommu_dma_msi_page *msi_page;
>  	static DEFINE_MUTEX(msi_prepare_lock); /* see below */
> 
> @@ -1842,7 +1856,7 @@ int iommu_dma_prepare_msi(struct msi_desc
> *desc, phys_addr_t msi_addr)
>  void iommu_dma_compose_msi_msg(struct msi_desc *desc, struct msi_msg
> *msg)
>  {
>  	struct device *dev = msi_desc_to_dev(desc);
> -	const struct iommu_domain *domain =
> iommu_get_domain_for_dev(dev);
> +	const struct iommu_domain *domain =
> iommu_dma_get_msi_mapping_domain(dev);
>  	const struct iommu_dma_msi_page *msi_page;
> 
>  	msi_page = msi_desc_get_iommu_cookie(desc);
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 4d47f2c333118..69ed76f9c3ec4 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -638,6 +638,8 @@ struct iommu_ops {
>   * @enable_nesting: Enable nesting
>   * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
>   * @free: Release the domain after use.
> + * @get_msi_mapping_domain: Return the related iommu_domain that
> should hold the
> + *                          MSI cookie and accept mapping(s).
>   */
>  struct iommu_domain_ops {
>  	int (*attach_dev)(struct iommu_domain *domain, struct device *dev);
> @@ -668,6 +670,8 @@ struct iommu_domain_ops {
>  				  unsigned long quirks);
> 
>  	void (*free)(struct iommu_domain *domain);
> +	struct iommu_domain *
> +		(*get_msi_mapping_domain)(struct iommu_domain
> *domain);
>  };
> 
>  /**
> --
> 2.43.0
>

Nicolin Chen Aug. 6, 2024, 5:24 p.m. UTC | #2

On Tue, Aug 06, 2024 at 08:25:33AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Saturday, August 3, 2024 8:32 AM
> >
> > From: Robin Murphy <robin.murphy@arm.com>
> >
> > Currently, iommu-dma is the only place outside of IOMMUFD and drivers
> > which might need to be aware of the stage 2 domain encapsulated within
> > a nested domain. This would be in the legacy-VFIO-style case where we're
> 
> why is it a legacy-VFIO-style? We only support nested in IOMMUFD.

I think it's describing the RMR solution that was decided in
Eric's VFIO solution prior to we having IOMMUFD at all.

So long as Robin won't mind (hopefully), I can rephrase it:

Currently, iommu-dma is the only place outside of IOMMUFD and drivers
which might need to be aware of the stage 2 domain encapsulated within
a nested domain. This would be still the RMR solution where we're using
host-managed MSIs with an identity mapping at stage 1, where it is
the underlying stage 2 domain which owns an MSI cookie and holds the
corresponding dynamic mappings. Hook up the new op to resolve what we
need from a nested domain.

> > Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> > ---
> >  drivers/iommu/dma-iommu.c | 18 ++++++++++++++++--
> >  include/linux/iommu.h     |  4 ++++
> >  2 files changed, 20 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > index 7b1dfa0665df6..05e04934a5f81 100644
> > --- a/drivers/iommu/dma-iommu.c
> > +++ b/drivers/iommu/dma-iommu.c
> > @@ -1799,6 +1799,20 @@ static struct iommu_dma_msi_page
> > *iommu_dma_get_msi_page(struct device *dev,
> >       return NULL;
> >  }
> >
> > +/*
> > + * Nested domains may not have an MSI cookie or accept mappings, but
> > they may
> > + * be related to a domain which does, so we let them tell us what they need.
> > + */
> > +static struct iommu_domain
> > *iommu_dma_get_msi_mapping_domain(struct device *dev)
> > +{
> > +     struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> > +
> > +     if (domain && domain->type == IOMMU_DOMAIN_NESTED &&
> > +         domain->ops->get_msi_mapping_domain)
> 
> I'm not sure the core should restrict it to the NESTED type. Given
> there is a new domain ops any type restriction can be handled
> inside the callback. Anyway the driver should register the op
> for a domain only when there is a need.

I think we can do either way, given that the use case is very
particular for IOMMU_DOMAIN_NESTED. Otherwise, driver doesn't
need to be aware of the msi mapping domain at all that should
be just taken care of by dma-iommu. If the domain pointer had
a generic parent iommu pointer, the get_msi_mapping_domain op
could have been omitted too.

That being said, yea, likely we should check !!domain->ops at
least.

Thanks
Nicolin

Robin Murphy Aug. 8, 2024, 12:38 p.m. UTC | #3

On 06/08/2024 9:25 am, Tian, Kevin wrote:
>> From: Nicolin Chen <nicolinc@nvidia.com>
>> Sent: Saturday, August 3, 2024 8:32 AM
>>
>> From: Robin Murphy <robin.murphy@arm.com>
>>
>> Currently, iommu-dma is the only place outside of IOMMUFD and drivers
>> which might need to be aware of the stage 2 domain encapsulated within
>> a nested domain. This would be in the legacy-VFIO-style case where we're
> 
> why is it a legacy-VFIO-style? We only support nested in IOMMUFD.

Because with proper nesting we ideally shouldn't need the host-managed 
MSI mess at all, which all stems from the old VFIO paradigm of 
completely abstracting interrupts from userspace. I'm still hoping 
IOMMUFD can grow its own interface for efficient MSI passthrough, where 
the VMM can simply map the physical MSI doorbell into whatever IPA (GPA) 
it wants it to appear at in the S2 domain, then whatever the guest does 
with S1 it can program the MSI address into the endpoint accordingly 
without us having to fiddle with it.

FWIW, apart from IOMMUFD, we may also end up wanting something along 
those lines for Arm CCA (if non-Secure proxying of T=1 MSIs becomes an 
unavoidable thing).

>> using host-managed MSIs with an identity mapping at stage 1, where it is
>> the underlying stage 2 domain which owns an MSI cookie and holds the
>> corresponding dynamic mappings. Hook up the new op to resolve what we
>> need from a nested domain.
>>
>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>> ---
>>   drivers/iommu/dma-iommu.c | 18 ++++++++++++++++--
>>   include/linux/iommu.h     |  4 ++++
>>   2 files changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>> index 7b1dfa0665df6..05e04934a5f81 100644
>> --- a/drivers/iommu/dma-iommu.c
>> +++ b/drivers/iommu/dma-iommu.c
>> @@ -1799,6 +1799,20 @@ static struct iommu_dma_msi_page
>> *iommu_dma_get_msi_page(struct device *dev,
>>   	return NULL;
>>   }
>>
>> +/*
>> + * Nested domains may not have an MSI cookie or accept mappings, but
>> they may
>> + * be related to a domain which does, so we let them tell us what they need.
>> + */
>> +static struct iommu_domain
>> *iommu_dma_get_msi_mapping_domain(struct device *dev)
>> +{
>> +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
>> +
>> +	if (domain && domain->type == IOMMU_DOMAIN_NESTED &&
>> +	    domain->ops->get_msi_mapping_domain)
> 
> I'm not sure the core should restrict it to the NESTED type. Given
> there is a new domain ops any type restriction can be handled
> inside the callback. Anyway the driver should register the op
> for a domain only when there is a need.

Nested should remain the only case where domains are chained together 
such that the VFIO MSI cookie may be hiding somewhere else. This is 
effectively documenting that implementing the callback for any other 
domain type would be a bad smell. Much like how the mapping-related ops 
are explicitly short-circuited for non-paging domain types.

Thanks,
Robin.

>> +		domain = domain->ops->get_msi_mapping_domain(domain);
>> +	return domain;
>> +}
>> +
>>   /**
>>    * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
>>    * @desc: MSI descriptor, will store the MSI page
>> @@ -1809,7 +1823,7 @@ static struct iommu_dma_msi_page
>> *iommu_dma_get_msi_page(struct device *dev,
>>   int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
>>   {
>>   	struct device *dev = msi_desc_to_dev(desc);
>> -	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
>> +	struct iommu_domain *domain =
>> iommu_dma_get_msi_mapping_domain(dev);
>>   	struct iommu_dma_msi_page *msi_page;
>>   	static DEFINE_MUTEX(msi_prepare_lock); /* see below */
>>
>> @@ -1842,7 +1856,7 @@ int iommu_dma_prepare_msi(struct msi_desc
>> *desc, phys_addr_t msi_addr)
>>   void iommu_dma_compose_msi_msg(struct msi_desc *desc, struct msi_msg
>> *msg)
>>   {
>>   	struct device *dev = msi_desc_to_dev(desc);
>> -	const struct iommu_domain *domain =
>> iommu_get_domain_for_dev(dev);
>> +	const struct iommu_domain *domain =
>> iommu_dma_get_msi_mapping_domain(dev);
>>   	const struct iommu_dma_msi_page *msi_page;
>>
>>   	msi_page = msi_desc_get_iommu_cookie(desc);
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index 4d47f2c333118..69ed76f9c3ec4 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -638,6 +638,8 @@ struct iommu_ops {
>>    * @enable_nesting: Enable nesting
>>    * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
>>    * @free: Release the domain after use.
>> + * @get_msi_mapping_domain: Return the related iommu_domain that
>> should hold the
>> + *                          MSI cookie and accept mapping(s).
>>    */
>>   struct iommu_domain_ops {
>>   	int (*attach_dev)(struct iommu_domain *domain, struct device *dev);
>> @@ -668,6 +670,8 @@ struct iommu_domain_ops {
>>   				  unsigned long quirks);
>>
>>   	void (*free)(struct iommu_domain *domain);
>> +	struct iommu_domain *
>> +		(*get_msi_mapping_domain)(struct iommu_domain
>> *domain);
>>   };
>>
>>   /**
>> --
>> 2.43.0
>>
>

Nicolin Chen Aug. 8, 2024, 10:59 p.m. UTC | #4

On Thu, Aug 08, 2024 at 01:38:44PM +0100, Robin Murphy wrote:
> On 06/08/2024 9:25 am, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Saturday, August 3, 2024 8:32 AM
> > > 
> > > From: Robin Murphy <robin.murphy@arm.com>
> > > 
> > > Currently, iommu-dma is the only place outside of IOMMUFD and drivers
> > > which might need to be aware of the stage 2 domain encapsulated within
> > > a nested domain. This would be in the legacy-VFIO-style case where we're
> > 
> > why is it a legacy-VFIO-style? We only support nested in IOMMUFD.
> 
> Because with proper nesting we ideally shouldn't need the host-managed
> MSI mess at all, which all stems from the old VFIO paradigm of
> completely abstracting interrupts from userspace. I'm still hoping
> IOMMUFD can grow its own interface for efficient MSI passthrough, where
> the VMM can simply map the physical MSI doorbell into whatever IPA (GPA)
> it wants it to appear at in the S2 domain, then whatever the guest does
> with S1 it can program the MSI address into the endpoint accordingly
> without us having to fiddle with it.

Hmm, until now I wasn't so convinced myself that it could work as I
was worried about the data. But having a second thought, since the
host configures the MSI, it can still set the correct data. What we
only need is to change the MSI address from a RMRed IPA/gIOVA to a
real gIOVA of the vITS page.

I did a quick hack to test that loop. MSI in the guest still works
fine without having the RMR node in its IORT. Sweet!

To go further on this path, we will need the following changes:
- MSI configuration in the host (via a VFIO_IRQ_SET_ACTION_TRIGGER
  hypercall) should set gIOVA instead of fetching from msi_cookie.
  That hypercall doesn't forward an address currently, since host
  kernel pre-sets the msi_cookie. So, we need a way to forward the
  gIOVA to kernel and pack it into the msi_msg structure. I haven't
  read the VFIO PCI code thoroughly, yet wonder if we could just
  let the guest program the gIOVA to the PCI register and fall it
  through to the hardware, so host kernel handling that hypercall
  can just read it back from the register?
- IOMMUFD should provide VMM a way to tell the gPA (or directly + 
  GITS_TRANSLATER?). Then kernel should do the stage-2 mapping. I
  have talked to Jason about this a while ago, and we have a few
  thoughts how to implement it. But eventually, I think we still
  can't avoid a middle man like msi_cookie to associate the gPA in
  IOMMUFD to PA in irqchip?

One more concern is the MSI window size. VMM sets up a MSI region
that must fit the hardware window size. Most of ITS versions have
only one page size but one of them can have multiple pages? What
if vITS is one-page size while the underlying pITS has multiple?

My understanding of the current kernel-defined 1MB size is also a
hard-coding window to potential fit all cases, since IOMMU code in
the code can just eyeball what's going on in the irqchip subsystem
and adjust accordingly if someday it needs to. But VMM can't?

Thanks
Nicolin

Tian, Kevin Aug. 9, 2024, 7:34 a.m. UTC | #5

> From: Robin Murphy <robin.murphy@arm.com>
> Sent: Thursday, August 8, 2024 8:39 PM
> 
> On 06/08/2024 9:25 am, Tian, Kevin wrote:
> >> From: Nicolin Chen <nicolinc@nvidia.com>
> >> Sent: Saturday, August 3, 2024 8:32 AM
> >>
> >> From: Robin Murphy <robin.murphy@arm.com>
> >>
> >> Currently, iommu-dma is the only place outside of IOMMUFD and drivers
> >> which might need to be aware of the stage 2 domain encapsulated within
> >> a nested domain. This would be in the legacy-VFIO-style case where we're
> >
> > why is it a legacy-VFIO-style? We only support nested in IOMMUFD.
> 
> Because with proper nesting we ideally shouldn't need the host-managed
> MSI mess at all, which all stems from the old VFIO paradigm of
> completely abstracting interrupts from userspace. I'm still hoping

yes this was the consensus from previous discussion iirc. Just a bit
distracting by calling it old VFIO paradigm.

Tian, Kevin Aug. 9, 2024, 8 a.m. UTC | #6

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Friday, August 9, 2024 7:00 AM
> 
> On Thu, Aug 08, 2024 at 01:38:44PM +0100, Robin Murphy wrote:
> > On 06/08/2024 9:25 am, Tian, Kevin wrote:
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Saturday, August 3, 2024 8:32 AM
> > > >
> > > > From: Robin Murphy <robin.murphy@arm.com>
> > > >
> > > > Currently, iommu-dma is the only place outside of IOMMUFD and
> drivers
> > > > which might need to be aware of the stage 2 domain encapsulated
> within
> > > > a nested domain. This would be in the legacy-VFIO-style case where
> we're
> > >
> > > why is it a legacy-VFIO-style? We only support nested in IOMMUFD.
> >
> > Because with proper nesting we ideally shouldn't need the host-managed
> > MSI mess at all, which all stems from the old VFIO paradigm of
> > completely abstracting interrupts from userspace. I'm still hoping
> > IOMMUFD can grow its own interface for efficient MSI passthrough, where
> > the VMM can simply map the physical MSI doorbell into whatever IPA (GPA)
> > it wants it to appear at in the S2 domain, then whatever the guest does
> > with S1 it can program the MSI address into the endpoint accordingly
> > without us having to fiddle with it.
> 
> Hmm, until now I wasn't so convinced myself that it could work as I
> was worried about the data. But having a second thought, since the
> host configures the MSI, it can still set the correct data. What we
> only need is to change the MSI address from a RMRed IPA/gIOVA to a
> real gIOVA of the vITS page.
> 
> I did a quick hack to test that loop. MSI in the guest still works
> fine without having the RMR node in its IORT. Sweet!
> 
> To go further on this path, we will need the following changes:
> - MSI configuration in the host (via a VFIO_IRQ_SET_ACTION_TRIGGER
>   hypercall) should set gIOVA instead of fetching from msi_cookie.
>   That hypercall doesn't forward an address currently, since host
>   kernel pre-sets the msi_cookie. So, we need a way to forward the
>   gIOVA to kernel and pack it into the msi_msg structure. I haven't
>   read the VFIO PCI code thoroughly, yet wonder if we could just
>   let the guest program the gIOVA to the PCI register and fall it
>   through to the hardware, so host kernel handling that hypercall
>   can just read it back from the register?
> - IOMMUFD should provide VMM a way to tell the gPA (or directly +
>   GITS_TRANSLATER?). Then kernel should do the stage-2 mapping. I
>   have talked to Jason about this a while ago, and we have a few
>   thoughts how to implement it. But eventually, I think we still
>   can't avoid a middle man like msi_cookie to associate the gPA in
>   IOMMUFD to PA in irqchip?

Probably a new IOMMU_DMA_MSI_COOKIE_USER type which uses
GPA (passed in in ALLOC_HWPT for a nested_parent type) as IOVA
in iommu_dma_get_msi_page()?

> 
> One more concern is the MSI window size. VMM sets up a MSI region
> that must fit the hardware window size. Most of ITS versions have
> only one page size but one of them can have multiple pages? What
> if vITS is one-page size while the underlying pITS has multiple?
> 
> My understanding of the current kernel-defined 1MB size is also a
> hard-coding window to potential fit all cases, since IOMMU code in
> the code can just eyeball what's going on in the irqchip subsystem
> and adjust accordingly if someday it needs to. But VMM can't?
> 
> Thanks
> Nicolin

Robin Murphy Aug. 9, 2024, 5:43 p.m. UTC | #7

On 2024-08-09 9:00 am, Tian, Kevin wrote:
>> From: Nicolin Chen <nicolinc@nvidia.com>
>> Sent: Friday, August 9, 2024 7:00 AM
>>
>> On Thu, Aug 08, 2024 at 01:38:44PM +0100, Robin Murphy wrote:
>>> On 06/08/2024 9:25 am, Tian, Kevin wrote:
>>>>> From: Nicolin Chen <nicolinc@nvidia.com>
>>>>> Sent: Saturday, August 3, 2024 8:32 AM
>>>>>
>>>>> From: Robin Murphy <robin.murphy@arm.com>
>>>>>
>>>>> Currently, iommu-dma is the only place outside of IOMMUFD and
>> drivers
>>>>> which might need to be aware of the stage 2 domain encapsulated
>> within
>>>>> a nested domain. This would be in the legacy-VFIO-style case where
>> we're
>>>>
>>>> why is it a legacy-VFIO-style? We only support nested in IOMMUFD.
>>>
>>> Because with proper nesting we ideally shouldn't need the host-managed
>>> MSI mess at all, which all stems from the old VFIO paradigm of
>>> completely abstracting interrupts from userspace. I'm still hoping
>>> IOMMUFD can grow its own interface for efficient MSI passthrough, where
>>> the VMM can simply map the physical MSI doorbell into whatever IPA (GPA)
>>> it wants it to appear at in the S2 domain, then whatever the guest does
>>> with S1 it can program the MSI address into the endpoint accordingly
>>> without us having to fiddle with it.
>>
>> Hmm, until now I wasn't so convinced myself that it could work as I
>> was worried about the data. But having a second thought, since the
>> host configures the MSI, it can still set the correct data. What we
>> only need is to change the MSI address from a RMRed IPA/gIOVA to a
>> real gIOVA of the vITS page.
>>
>> I did a quick hack to test that loop. MSI in the guest still works
>> fine without having the RMR node in its IORT. Sweet!
>>
>> To go further on this path, we will need the following changes:
>> - MSI configuration in the host (via a VFIO_IRQ_SET_ACTION_TRIGGER
>>    hypercall) should set gIOVA instead of fetching from msi_cookie.
>>    That hypercall doesn't forward an address currently, since host
>>    kernel pre-sets the msi_cookie. So, we need a way to forward the
>>    gIOVA to kernel and pack it into the msi_msg structure. I haven't
>>    read the VFIO PCI code thoroughly, yet wonder if we could just
>>    let the guest program the gIOVA to the PCI register and fall it
>>    through to the hardware, so host kernel handling that hypercall
>>    can just read it back from the register?
>> - IOMMUFD should provide VMM a way to tell the gPA (or directly +
>>    GITS_TRANSLATER?). Then kernel should do the stage-2 mapping. I
>>    have talked to Jason about this a while ago, and we have a few
>>    thoughts how to implement it. But eventually, I think we still
>>    can't avoid a middle man like msi_cookie to associate the gPA in
>>    IOMMUFD to PA in irqchip?
> 
> Probably a new IOMMU_DMA_MSI_COOKIE_USER type which uses
> GPA (passed in in ALLOC_HWPT for a nested_parent type) as IOVA
> in iommu_dma_get_msi_page()?

No, the whole point is to get away from cookies and having to keep track 
of things in the kernel that can and should just be simple regular 
user-owned S2 mappings.

>> One more concern is the MSI window size. VMM sets up a MSI region
>> that must fit the hardware window size. Most of ITS versions have
>> only one page size but one of them can have multiple pages? What
>> if vITS is one-page size while the underlying pITS has multiple?
>>
>> My understanding of the current kernel-defined 1MB size is also a
>> hard-coding window to potential fit all cases, since IOMMU code in
>> the code can just eyeball what's going on in the irqchip subsystem
>> and adjust accordingly if someday it needs to. But VMM can't?

The existing design is based around the kernel potentially having to 
stuff multiple different mappings for different devices into the MSI 
hole in a single domain, since VFIO userspace is allowed to do wacky 
things like emulate INTx using an underlying physical MSI, so there may 
not be any actual vITS region in the VM IPA space at all. I think that 
was also why it ended up being a fake reserved region exposed by the 
SMMU drivers rather than relying on userspace to say where to put it - 
making things look superficially a bit more x86-like meant fewer changes 
to userspace, which I think by now we can consider a tasty slice of 
technical debt.

For a dedicated "MSI passthrough" model where, in parallel to IOMMU 
nesting, the abstraction is thinner and userspace is in on the game of 
knowingly emulating a GIC ITS backed by a GIC ITS, I'd imagine it could 
be pretty straightforward, at least conceptually. Userspace has 
something like an IOAS_MAP_MSI(device, IOVA) to indicate where it's 
placing a vITS to which it wants that device's MSIs to be able to go, 
the kernel resolves the PA from the IRQ layer and maps it, job done. If 
userspace wants to associate two devices with the same vITS when they 
have different physical ITSes, either we split the IOAS into two HWPTs 
to hold the different mappings, or we punt it back to userspace to 
resolve at the IOAS level.

Or I guess the really cheeky version is the IRQ layer exposes its own 
thing for userspace to mmap the ITS, then it can call a literal IOAS_MAP 
on that mapping... :D

Thanks,
Robin.

Jason Gunthorpe Aug. 9, 2024, 6:41 p.m. UTC | #8

On Thu, Aug 08, 2024 at 01:38:44PM +0100, Robin Murphy wrote:
> On 06/08/2024 9:25 am, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Saturday, August 3, 2024 8:32 AM
> > > 
> > > From: Robin Murphy <robin.murphy@arm.com>
> > > 
> > > Currently, iommu-dma is the only place outside of IOMMUFD and drivers
> > > which might need to be aware of the stage 2 domain encapsulated within
> > > a nested domain. This would be in the legacy-VFIO-style case where we're
> > 
> > why is it a legacy-VFIO-style? We only support nested in IOMMUFD.
> 
> Because with proper nesting we ideally shouldn't need the host-managed MSI
> mess at all, which all stems from the old VFIO paradigm of completely
> abstracting interrupts from userspace. I'm still hoping IOMMUFD can grow its
> own interface for efficient MSI passthrough, where the VMM can simply map
> the physical MSI doorbell into whatever IPA (GPA) it wants it to appear at
> in the S2 domain, then whatever the guest does with S1 it can program the
> MSI address into the endpoint accordingly without us having to fiddle with
> it.

+1

I don't have a staged plan to do this though. Getting the ITS page
into the S2 at a user specified address should be simple enough to
manage.

The bigger issue is that we still have the hypervisor GIC driver
controlling things and it will need to know to use the guest provided
MSI address captured during the MSI trap, not its own address. I don't
have an idea how to connect those two parts yet.

Jason

Nicolin Chen Aug. 9, 2024, 7:18 p.m. UTC | #9

On Fri, Aug 09, 2024 at 03:41:36PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 08, 2024 at 01:38:44PM +0100, Robin Murphy wrote:
> > On 06/08/2024 9:25 am, Tian, Kevin wrote:
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Saturday, August 3, 2024 8:32 AM
> > > > 
> > > > From: Robin Murphy <robin.murphy@arm.com>
> > > > 
> > > > Currently, iommu-dma is the only place outside of IOMMUFD and drivers
> > > > which might need to be aware of the stage 2 domain encapsulated within
> > > > a nested domain. This would be in the legacy-VFIO-style case where we're
> > > 
> > > why is it a legacy-VFIO-style? We only support nested in IOMMUFD.
> > 
> > Because with proper nesting we ideally shouldn't need the host-managed MSI
> > mess at all, which all stems from the old VFIO paradigm of completely
> > abstracting interrupts from userspace. I'm still hoping IOMMUFD can grow its
> > own interface for efficient MSI passthrough, where the VMM can simply map
> > the physical MSI doorbell into whatever IPA (GPA) it wants it to appear at
> > in the S2 domain, then whatever the guest does with S1 it can program the
> > MSI address into the endpoint accordingly without us having to fiddle with
> > it.
> 
> +1
> 
> I don't have a staged plan to do this though. Getting the ITS page
> into the S2 at a user specified address should be simple enough to
> manage.
> 
> The bigger issue is that we still have the hypervisor GIC driver
> controlling things and it will need to know to use the guest provided
> MSI address captured during the MSI trap, not its own address. I don't
> have an idea how to connect those two parts yet.

You mean the gPA of the vITS v.s. PA of the ITS, right? I think
that's because only VMM knows the virtual IRQ number to insert?
We don't seem to have a choice for that unless we want to poke
a hole to the vGIC design..

With that, it feels a quite big improvement already by getting
rid of the entire shadow MSI mapping, including msi_cookie and
RMR..

Thanks
Nicolin

Nicolin Chen Aug. 9, 2024, 8:09 p.m. UTC | #10

On Fri, Aug 09, 2024 at 06:43:47PM +0100, Robin Murphy wrote:
> On 2024-08-09 9:00 am, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Friday, August 9, 2024 7:00 AM
> > > 
> > > On Thu, Aug 08, 2024 at 01:38:44PM +0100, Robin Murphy wrote:
> > > > On 06/08/2024 9:25 am, Tian, Kevin wrote:
> > > > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > > > Sent: Saturday, August 3, 2024 8:32 AM
> > > > > > 
> > > > > > From: Robin Murphy <robin.murphy@arm.com>
> > > > > > 
> > > > > > Currently, iommu-dma is the only place outside of IOMMUFD and
> > > drivers
> > > > > > which might need to be aware of the stage 2 domain encapsulated
> > > within
> > > > > > a nested domain. This would be in the legacy-VFIO-style case where
> > > we're
> > > > > 
> > > > > why is it a legacy-VFIO-style? We only support nested in IOMMUFD.
> > > > 
> > > > Because with proper nesting we ideally shouldn't need the host-managed
> > > > MSI mess at all, which all stems from the old VFIO paradigm of
> > > > completely abstracting interrupts from userspace. I'm still hoping
> > > > IOMMUFD can grow its own interface for efficient MSI passthrough, where
> > > > the VMM can simply map the physical MSI doorbell into whatever IPA (GPA)
> > > > it wants it to appear at in the S2 domain, then whatever the guest does
> > > > with S1 it can program the MSI address into the endpoint accordingly
> > > > without us having to fiddle with it.
> > > 
> > > Hmm, until now I wasn't so convinced myself that it could work as I
> > > was worried about the data. But having a second thought, since the
> > > host configures the MSI, it can still set the correct data. What we
> > > only need is to change the MSI address from a RMRed IPA/gIOVA to a
> > > real gIOVA of the vITS page.
> > > 
> > > I did a quick hack to test that loop. MSI in the guest still works
> > > fine without having the RMR node in its IORT. Sweet!
> > > 
> > > To go further on this path, we will need the following changes:
> > > - MSI configuration in the host (via a VFIO_IRQ_SET_ACTION_TRIGGER
> > >    hypercall) should set gIOVA instead of fetching from msi_cookie.
> > >    That hypercall doesn't forward an address currently, since host
> > >    kernel pre-sets the msi_cookie. So, we need a way to forward the
> > >    gIOVA to kernel and pack it into the msi_msg structure. I haven't
> > >    read the VFIO PCI code thoroughly, yet wonder if we could just
> > >    let the guest program the gIOVA to the PCI register and fall it
> > >    through to the hardware, so host kernel handling that hypercall
> > >    can just read it back from the register?
> > > - IOMMUFD should provide VMM a way to tell the gPA (or directly +
> > >    GITS_TRANSLATER?). Then kernel should do the stage-2 mapping. I
> > >    have talked to Jason about this a while ago, and we have a few
> > >    thoughts how to implement it. But eventually, I think we still
> > >    can't avoid a middle man like msi_cookie to associate the gPA in
> > >    IOMMUFD to PA in irqchip?
> > 
> > Probably a new IOMMU_DMA_MSI_COOKIE_USER type which uses
> > GPA (passed in in ALLOC_HWPT for a nested_parent type) as IOVA
> > in iommu_dma_get_msi_page()?
> 
> No, the whole point is to get away from cookies and having to keep track
> of things in the kernel that can and should just be simple regular
> user-owned S2 mappings.

I see. That'd be ideal.

> > > One more concern is the MSI window size. VMM sets up a MSI region
> > > that must fit the hardware window size. Most of ITS versions have
> > > only one page size but one of them can have multiple pages? What
> > > if vITS is one-page size while the underlying pITS has multiple?
> > > 
> > > My understanding of the current kernel-defined 1MB size is also a
> > > hard-coding window to potential fit all cases, since IOMMU code in
> > > the code can just eyeball what's going on in the irqchip subsystem
> > > and adjust accordingly if someday it needs to. But VMM can't?
> 
> The existing design is based around the kernel potentially having to
> stuff multiple different mappings for different devices into the MSI
> hole in a single domain, since VFIO userspace is allowed to do wacky
> things like emulate INTx using an underlying physical MSI, so there may
> not be any actual vITS region in the VM IPA space at all. I think that
> was also why it ended up being a fake reserved region exposed by the
> SMMU drivers rather than relying on userspace to say where to put it -
> making things look superficially a bit more x86-like meant fewer changes
> to userspace, which I think by now we can consider a tasty slice of
> technical debt.

Just found that intel-iommu's MSI window has the same 1MB size.
Everything makes sense now!

> For a dedicated "MSI passthrough" model where, in parallel to IOMMU
> nesting, the abstraction is thinner and userspace is in on the game of
> knowingly emulating a GIC ITS backed by a GIC ITS, I'd imagine it could
> be pretty straightforward, at least conceptually. Userspace has
> something like an IOAS_MAP_MSI(device, IOVA) to indicate where it's
> placing a vITS to which it wants that device's MSIs to be able to go,
> the kernel resolves the PA from the IRQ layer and maps it, job done. If

Will try draft something.

> userspace wants to associate two devices with the same vITS when they
> have different physical ITSes, either we split the IOAS into two HWPTs
> to hold the different mappings, or we punt it back to userspace to
> resolve at the IOAS level.

I see. That sounds like a good solution. EEXIST for the case, so
VMM will need to allocate a different IOAS/HWPT for that device
to attach.

> Or I guess the really cheeky version is the IRQ layer exposes its own
> thing for userspace to mmap the ITS, then it can call a literal IOAS_MAP
> on that mapping... :D

I recall one benefit from the IOMMU_RESV_SW_MSI is to expose the
MSI region via sysfs. Would it be safe to expose it in the same
way? The reserved region itself is per-device, which feels quite
a fit...

Thanks
Nicolin

Jason Gunthorpe Aug. 9, 2024, 10:49 p.m. UTC | #11

On Fri, Aug 09, 2024 at 12:18:42PM -0700, Nicolin Chen wrote:

> > The bigger issue is that we still have the hypervisor GIC driver
> > controlling things and it will need to know to use the guest provided
> > MSI address captured during the MSI trap, not its own address. I don't
> > have an idea how to connect those two parts yet.
> 
> You mean the gPA of the vITS v.s. PA of the ITS, right? I think
> that's because only VMM knows the virtual IRQ number to insert?
> We don't seem to have a choice for that unless we want to poke
> a hole to the vGIC design..

I mean what you explained in your other email:

> - MSI configuration in the host (via a VFIO_IRQ_SET_ACTION_TRIGGER
>   hypercall) should set gIOVA instead of fetching from msi_cookie.
>   That hypercall doesn't forward an address currently, since host
>   kernel pre-sets the msi_cookie. So, we need a way to forward the
>   gIOVA to kernel and pack it into the msi_msg structure. I haven't
>   read the VFIO PCI code thoroughly, yet wonder if we could just
>   let the guest program the gIOVA to the PCI register and fall it
>   through to the hardware, so host kernel handling that hypercall
>   can just read it back from the register?

We still need to convay the MSI-X address from the register trap into
the kernel and use the VM supplied address instead of calling
iommu_dma_compose_msi_msg().

When you did your test you may have lucked out that the guest was
putting the ITS at the same place the host kernel expected because
they are both running the same software and making the same 
decision :)

Maybe take a look at what pushing the address down through the
VFIO_IRQ_SET_ACTION_TRIGGER would look like?

Jason

Jason Gunthorpe Aug. 9, 2024, 11:01 p.m. UTC | #12

On Fri, Aug 09, 2024 at 08:00:34AM +0000, Tian, Kevin wrote:

> > - IOMMUFD should provide VMM a way to tell the gPA (or directly +
> >   GITS_TRANSLATER?). Then kernel should do the stage-2 mapping. I
> >   have talked to Jason about this a while ago, and we have a few
> >   thoughts how to implement it. But eventually, I think we still
> >   can't avoid a middle man like msi_cookie to associate the gPA in
> >   IOMMUFD to PA in irqchip?
> 
> Probably a new IOMMU_DMA_MSI_COOKIE_USER type which uses
> GPA (passed in in ALLOC_HWPT for a nested_parent type) as IOVA
> in iommu_dma_get_msi_page()?

To get the ITS page into the iommufd I suspect we'd add a new iommufd ioctl:

struct iommu_ioas_map_msi_window {
	__u32 size;
	__u32 flags;
	__u32 ioas_id;
	__u32 __reserved;
	__s32 kvm_device_fd;
	__u32 kvm_device_type; // == KVM_DEV_TYPE_ARM_VGIC_ITS for safety
	__aligned_u64 device_args; // ??
	__aligned_u64 length;
	__aligned_u64 iova;
};

Where kvm_device_fd would be the KVM_DEV_TYPE_ARM_VGIC_ITS (?) device
(or the RISCV version).

This would let us get the ITS physical address from the GIC driver and
put it into the S2 at IOVA while relying on KVM to authorize and
locate the correct PA for whatever is going on here.

Jason

Nicolin Chen Aug. 9, 2024, 11:38 p.m. UTC | #13

On Fri, Aug 09, 2024 at 07:49:10PM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 09, 2024 at 12:18:42PM -0700, Nicolin Chen wrote:
> 
> > > The bigger issue is that we still have the hypervisor GIC driver
> > > controlling things and it will need to know to use the guest provided
> > > MSI address captured during the MSI trap, not its own address. I don't
> > > have an idea how to connect those two parts yet.
> > 
> > You mean the gPA of the vITS v.s. PA of the ITS, right? I think
> > that's because only VMM knows the virtual IRQ number to insert?
> > We don't seem to have a choice for that unless we want to poke
> > a hole to the vGIC design..
> 
> I mean what you explained in your other email:
> 
> > - MSI configuration in the host (via a VFIO_IRQ_SET_ACTION_TRIGGER
> >   hypercall) should set gIOVA instead of fetching from msi_cookie.
> >   That hypercall doesn't forward an address currently, since host
> >   kernel pre-sets the msi_cookie. So, we need a way to forward the
> >   gIOVA to kernel and pack it into the msi_msg structure. I haven't
> >   read the VFIO PCI code thoroughly, yet wonder if we could just
> >   let the guest program the gIOVA to the PCI register and fall it
> >   through to the hardware, so host kernel handling that hypercall
> >   can just read it back from the register?
> 
> We still need to convay the MSI-X address from the register trap into
> the kernel and use the VM supplied address instead of calling
> iommu_dma_compose_msi_msg().

Yes.

> When you did your test you may have lucked out that the guest was
> putting the ITS at the same place the host kernel expected because
> they are both running the same software and making the same 
> decision :)

Oh, the devil's in the details: I hard-coded every address in the
vITS's 2-stage mapping lol

> Maybe take a look at what pushing the address down through the
> VFIO_IRQ_SET_ACTION_TRIGGER would look like?

Yea, there's a data field and we can add a new flag to define the
format/type. Will take a closer look.

Thanks
Nicolin

[v2,2/3] iommu/dma: Support MSIs through nested domains

Commit Message

Comments

Patch