diff mbox series

[v9,03/10] iommu: Separate IOMMU_DEV_FEAT_IOPF from IOMMU_DEV_FEAT_SVA

Message ID 20210108145217.2254447-4-jean-philippe@linaro.org (mailing list archive)
State New, archived
Headers show
Series iommu: I/O page faults for SMMUv3 | expand

Commit Message

Jean-Philippe Brucker Jan. 8, 2021, 2:52 p.m. UTC
Some devices manage I/O Page Faults (IOPF) themselves instead of relying
on PCIe PRI or Arm SMMU stall. Allow their drivers to enable SVA without
mandating IOMMU-managed IOPF. The other device drivers now need to first
enable IOMMU_DEV_FEAT_IOPF before enabling IOMMU_DEV_FEAT_SVA.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zhangfei Gao <zhangfei.gao@linaro.org>
Cc: Zhou Wang <wangzhou1@hisilicon.com>
---
 include/linux/iommu.h | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

Comments

Baolu Lu Jan. 12, 2021, 4:31 a.m. UTC | #1
Hi Jean,

On 1/8/21 10:52 PM, Jean-Philippe Brucker wrote:
> Some devices manage I/O Page Faults (IOPF) themselves instead of relying
> on PCIe PRI or Arm SMMU stall. Allow their drivers to enable SVA without
> mandating IOMMU-managed IOPF. The other device drivers now need to first
> enable IOMMU_DEV_FEAT_IOPF before enabling IOMMU_DEV_FEAT_SVA.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: David Woodhouse <dwmw2@infradead.org>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Zhangfei Gao <zhangfei.gao@linaro.org>
> Cc: Zhou Wang <wangzhou1@hisilicon.com>
> ---
>   include/linux/iommu.h | 20 +++++++++++++++++---
>   1 file changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 583c734b2e87..701b2eeb0dc5 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -156,10 +156,24 @@ struct iommu_resv_region {
>   	enum iommu_resv_type	type;
>   };
>   
> -/* Per device IOMMU features */
> +/**
> + * enum iommu_dev_features - Per device IOMMU features
> + * @IOMMU_DEV_FEAT_AUX: Auxiliary domain feature
> + * @IOMMU_DEV_FEAT_SVA: Shared Virtual Addresses
> + * @IOMMU_DEV_FEAT_IOPF: I/O Page Faults such as PRI or Stall. Generally using
> + *			 %IOMMU_DEV_FEAT_SVA requires %IOMMU_DEV_FEAT_IOPF, but
> + *			 some devices manage I/O Page Faults themselves instead
> + *			 of relying on the IOMMU. When supported, this feature
> + *			 must be enabled before and disabled after
> + *			 %IOMMU_DEV_FEAT_SVA.

Is this only for SVA? We may see more scenarios of using IOPF. For
example, when passing through devices to user level, the user's pages
could be managed dynamically instead of being allocated and pinned
statically.

If @IOMMU_DEV_FEAT_IOPF is defined as generic iopf support, the current
vendor IOMMU driver support may not enough.

Best regards,
baolu

> + *
> + * Device drivers query whether a feature is supported using
> + * iommu_dev_has_feature(), and enable it using iommu_dev_enable_feature().
> + */
>   enum iommu_dev_features {
> -	IOMMU_DEV_FEAT_AUX,	/* Aux-domain feature */
> -	IOMMU_DEV_FEAT_SVA,	/* Shared Virtual Addresses */
> +	IOMMU_DEV_FEAT_AUX,
> +	IOMMU_DEV_FEAT_SVA,
> +	IOMMU_DEV_FEAT_IOPF,
>   };
>   
>   #define IOMMU_PASID_INVALID	(-1U)
>
Jean-Philippe Brucker Jan. 12, 2021, 9:16 a.m. UTC | #2
Hi Baolu,

On Tue, Jan 12, 2021 at 12:31:23PM +0800, Lu Baolu wrote:
> Hi Jean,
> 
> On 1/8/21 10:52 PM, Jean-Philippe Brucker wrote:
> > Some devices manage I/O Page Faults (IOPF) themselves instead of relying
> > on PCIe PRI or Arm SMMU stall. Allow their drivers to enable SVA without
> > mandating IOMMU-managed IOPF. The other device drivers now need to first
> > enable IOMMU_DEV_FEAT_IOPF before enabling IOMMU_DEV_FEAT_SVA.
> > 
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > ---
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: David Woodhouse <dwmw2@infradead.org>
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Cc: Joerg Roedel <joro@8bytes.org>
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Zhangfei Gao <zhangfei.gao@linaro.org>
> > Cc: Zhou Wang <wangzhou1@hisilicon.com>
> > ---
> >   include/linux/iommu.h | 20 +++++++++++++++++---
> >   1 file changed, 17 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 583c734b2e87..701b2eeb0dc5 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -156,10 +156,24 @@ struct iommu_resv_region {
> >   	enum iommu_resv_type	type;
> >   };
> > -/* Per device IOMMU features */
> > +/**
> > + * enum iommu_dev_features - Per device IOMMU features
> > + * @IOMMU_DEV_FEAT_AUX: Auxiliary domain feature
> > + * @IOMMU_DEV_FEAT_SVA: Shared Virtual Addresses
> > + * @IOMMU_DEV_FEAT_IOPF: I/O Page Faults such as PRI or Stall. Generally using
> > + *			 %IOMMU_DEV_FEAT_SVA requires %IOMMU_DEV_FEAT_IOPF, but
> > + *			 some devices manage I/O Page Faults themselves instead
> > + *			 of relying on the IOMMU. When supported, this feature
> > + *			 must be enabled before and disabled after
> > + *			 %IOMMU_DEV_FEAT_SVA.
> 
> Is this only for SVA? We may see more scenarios of using IOPF. For
> example, when passing through devices to user level, the user's pages
> could be managed dynamically instead of being allocated and pinned
> statically.

Hm, isn't that precisely what SVA does?  I don't understand the
difference. That said FEAT_IOPF doesn't have to be only for SVA. It could
later be used as a prerequisite some another feature. For special cases
device drivers can always use the iommu_register_device_fault_handler()
API and handle faults themselves.

> If @IOMMU_DEV_FEAT_IOPF is defined as generic iopf support, the current
> vendor IOMMU driver support may not enough.

IOMMU_DEV_FEAT_IOPF on its own doesn't do anything useful, it's mainly a
way for device drivers to probe the IOMMU capability. Granted in patch
10 the SMMU driver registers the IOPF queue on enable() but that could be
done by FEAT_SVA enable() instead, if we ever repurpose FEAT_IOPF.

Thanks,
Jean
Baolu Lu Jan. 13, 2021, 2:49 a.m. UTC | #3
Hi Jean,

On 1/12/21 5:16 PM, Jean-Philippe Brucker wrote:
> Hi Baolu,
> 
> On Tue, Jan 12, 2021 at 12:31:23PM +0800, Lu Baolu wrote:
>> Hi Jean,
>>
>> On 1/8/21 10:52 PM, Jean-Philippe Brucker wrote:
>>> Some devices manage I/O Page Faults (IOPF) themselves instead of relying
>>> on PCIe PRI or Arm SMMU stall. Allow their drivers to enable SVA without
>>> mandating IOMMU-managed IOPF. The other device drivers now need to first
>>> enable IOMMU_DEV_FEAT_IOPF before enabling IOMMU_DEV_FEAT_SVA.
>>>
>>> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
>>> ---
>>> Cc: Arnd Bergmann <arnd@arndb.de>
>>> Cc: David Woodhouse <dwmw2@infradead.org>
>>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>> Cc: Joerg Roedel <joro@8bytes.org>
>>> Cc: Lu Baolu <baolu.lu@linux.intel.com>
>>> Cc: Will Deacon <will@kernel.org>
>>> Cc: Zhangfei Gao <zhangfei.gao@linaro.org>
>>> Cc: Zhou Wang <wangzhou1@hisilicon.com>
>>> ---
>>>    include/linux/iommu.h | 20 +++++++++++++++++---
>>>    1 file changed, 17 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>>> index 583c734b2e87..701b2eeb0dc5 100644
>>> --- a/include/linux/iommu.h
>>> +++ b/include/linux/iommu.h
>>> @@ -156,10 +156,24 @@ struct iommu_resv_region {
>>>    	enum iommu_resv_type	type;
>>>    };
>>> -/* Per device IOMMU features */
>>> +/**
>>> + * enum iommu_dev_features - Per device IOMMU features
>>> + * @IOMMU_DEV_FEAT_AUX: Auxiliary domain feature
>>> + * @IOMMU_DEV_FEAT_SVA: Shared Virtual Addresses
>>> + * @IOMMU_DEV_FEAT_IOPF: I/O Page Faults such as PRI or Stall. Generally using
>>> + *			 %IOMMU_DEV_FEAT_SVA requires %IOMMU_DEV_FEAT_IOPF, but
>>> + *			 some devices manage I/O Page Faults themselves instead
>>> + *			 of relying on the IOMMU. When supported, this feature
>>> + *			 must be enabled before and disabled after
>>> + *			 %IOMMU_DEV_FEAT_SVA.
>>
>> Is this only for SVA? We may see more scenarios of using IOPF. For
>> example, when passing through devices to user level, the user's pages
>> could be managed dynamically instead of being allocated and pinned
>> statically.
> 
> Hm, isn't that precisely what SVA does?  I don't understand the
> difference. That said FEAT_IOPF doesn't have to be only for SVA. It could
> later be used as a prerequisite some another feature. For special cases
> device drivers can always use the iommu_register_device_fault_handler()
> API and handle faults themselves.

 From the perspective of IOMMU, there is a little difference between
these two. For SVA, the page table is from CPU side, so IOMMU only needs
to call handle_mm_fault(); For above pass-through case, the page table
is from IOMMU side, so the device driver (probably VFIO) needs to
register a fault handler and call iommu_map/unmap() to serve the page
faults.

If we think about the nested mode (or dual-stage translation), it's more
complicated since the kernel (probably VFIO) handles the second level
page faults, while the first level page faults need to be delivered to
user-level guest. Obviously, this hasn't been fully implemented in any
IOMMU driver.

> 
>> If @IOMMU_DEV_FEAT_IOPF is defined as generic iopf support, the current
>> vendor IOMMU driver support may not enough.
> 
> IOMMU_DEV_FEAT_IOPF on its own doesn't do anything useful, it's mainly a
> way for device drivers to probe the IOMMU capability. Granted in patch
> 10 the SMMU driver registers the IOPF queue on enable() but that could be
> done by FEAT_SVA enable() instead, if we ever repurpose FEAT_IOPF.

I have no objection to split IOPF from SVA. Actually we must have this
eventually. My concern is that at this stage, the IOMMU drivers only
support SVA type of IOPF, a generic IOMMU_DEV_FEAT_IOPF feature might
confuse the device drivers which want to add other types of IOPF usage.

> 
> Thanks,
> Jean
> 

Best regards,
baolu
Tian, Kevin Jan. 13, 2021, 8:10 a.m. UTC | #4
> From: Lu Baolu <baolu.lu@linux.intel.com>
> Sent: Wednesday, January 13, 2021 10:50 AM
> 
> Hi Jean,
> 
> On 1/12/21 5:16 PM, Jean-Philippe Brucker wrote:
> > Hi Baolu,
> >
> > On Tue, Jan 12, 2021 at 12:31:23PM +0800, Lu Baolu wrote:
> >> Hi Jean,
> >>
> >> On 1/8/21 10:52 PM, Jean-Philippe Brucker wrote:
> >>> Some devices manage I/O Page Faults (IOPF) themselves instead of
> relying
> >>> on PCIe PRI or Arm SMMU stall. Allow their drivers to enable SVA without
> >>> mandating IOMMU-managed IOPF. The other device drivers now need to
> first
> >>> enable IOMMU_DEV_FEAT_IOPF before enabling
> IOMMU_DEV_FEAT_SVA.
> >>>
> >>> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> >>> ---
> >>> Cc: Arnd Bergmann <arnd@arndb.de>
> >>> Cc: David Woodhouse <dwmw2@infradead.org>
> >>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> >>> Cc: Joerg Roedel <joro@8bytes.org>
> >>> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> >>> Cc: Will Deacon <will@kernel.org>
> >>> Cc: Zhangfei Gao <zhangfei.gao@linaro.org>
> >>> Cc: Zhou Wang <wangzhou1@hisilicon.com>
> >>> ---
> >>>    include/linux/iommu.h | 20 +++++++++++++++++---
> >>>    1 file changed, 17 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> >>> index 583c734b2e87..701b2eeb0dc5 100644
> >>> --- a/include/linux/iommu.h
> >>> +++ b/include/linux/iommu.h
> >>> @@ -156,10 +156,24 @@ struct iommu_resv_region {
> >>>    	enum iommu_resv_type	type;
> >>>    };
> >>> -/* Per device IOMMU features */
> >>> +/**
> >>> + * enum iommu_dev_features - Per device IOMMU features
> >>> + * @IOMMU_DEV_FEAT_AUX: Auxiliary domain feature
> >>> + * @IOMMU_DEV_FEAT_SVA: Shared Virtual Addresses
> >>> + * @IOMMU_DEV_FEAT_IOPF: I/O Page Faults such as PRI or Stall.
> Generally using
> >>> + *			 %IOMMU_DEV_FEAT_SVA
> requires %IOMMU_DEV_FEAT_IOPF, but
> >>> + *			 some devices manage I/O Page Faults themselves
> instead
> >>> + *			 of relying on the IOMMU. When supported, this
> feature
> >>> + *			 must be enabled before and disabled after
> >>> + *			 %IOMMU_DEV_FEAT_SVA.
> >>
> >> Is this only for SVA? We may see more scenarios of using IOPF. For
> >> example, when passing through devices to user level, the user's pages
> >> could be managed dynamically instead of being allocated and pinned
> >> statically.
> >
> > Hm, isn't that precisely what SVA does?  I don't understand the
> > difference. That said FEAT_IOPF doesn't have to be only for SVA. It could
> > later be used as a prerequisite some another feature. For special cases
> > device drivers can always use the iommu_register_device_fault_handler()
> > API and handle faults themselves.
> 
>  From the perspective of IOMMU, there is a little difference between
> these two. For SVA, the page table is from CPU side, so IOMMU only needs
> to call handle_mm_fault(); For above pass-through case, the page table
> is from IOMMU side, so the device driver (probably VFIO) needs to
> register a fault handler and call iommu_map/unmap() to serve the page
> faults.
> 
> If we think about the nested mode (or dual-stage translation), it's more
> complicated since the kernel (probably VFIO) handles the second level
> page faults, while the first level page faults need to be delivered to
> user-level guest. Obviously, this hasn't been fully implemented in any
> IOMMU driver.
> 

Thinking more the confusion might come from the fact that we mixed
hardware capability with software capability. IOMMU_FEAT describes
the hardware capability. When FEAT_IOPF is set, it purely means whatever
page faults that are enabled by the software are routed through the IOMMU.
Nothing more. Then the software (IOMMU drivers) may choose to support
only limited faulting scenarios and then evolve to support more complex 
usages gradually. For example, the intel-iommu driver only supports 1st-level
fault (thus SVA) for now, while FEAT_IOPF as a separate feature may give the
impression that 2nd-level faults are also allowed. From this angle once we 
start to separate page fault from SVA, we may also need a way to report 
the software capability (e.g. a set of faulting categories) and also extend
iommu_register_device_fault_handler to allow specifying which 
category is enabled respectively. The example categories could be:

- IOPF_BIND, for page tables which are bound/linked to the IOMMU. 
Apply to bare metal SVA and guest SVA case;
- IOPF_MAP, for page tables which are managed through explicit IOMMU
map interfaces. Apply to removing VFIO pinning restriction;

Both categories can be enabled together in nested translation, with 
additional information provided to differentiate them in fault information.
Using paging/staging level doesn't make much sense as it's IOMMU driver's 
internal knowledge, e.g. VT-d driver plans to use 1st level for GPA if no 
nesting and then turn to 2nd level when nesting is enabled.

Thanks
Kevin
Jean-Philippe Brucker Jan. 14, 2021, 4:41 p.m. UTC | #5
On Wed, Jan 13, 2021 at 08:10:28AM +0000, Tian, Kevin wrote:
> > >> Is this only for SVA? We may see more scenarios of using IOPF. For
> > >> example, when passing through devices to user level, the user's pages
> > >> could be managed dynamically instead of being allocated and pinned
> > >> statically.
> > >
> > > Hm, isn't that precisely what SVA does?  I don't understand the
> > > difference. That said FEAT_IOPF doesn't have to be only for SVA. It could
> > > later be used as a prerequisite some another feature. For special cases
> > > device drivers can always use the iommu_register_device_fault_handler()
> > > API and handle faults themselves.
> > 
> >  From the perspective of IOMMU, there is a little difference between
> > these two. For SVA, the page table is from CPU side, so IOMMU only needs
> > to call handle_mm_fault(); For above pass-through case, the page table
> > is from IOMMU side, so the device driver (probably VFIO) needs to
> > register a fault handler and call iommu_map/unmap() to serve the page
> > faults.
> > 
> > If we think about the nested mode (or dual-stage translation), it's more
> > complicated since the kernel (probably VFIO) handles the second level
> > page faults, while the first level page faults need to be delivered to
> > user-level guest. Obviously, this hasn't been fully implemented in any
> > IOMMU driver.
> > 
> 
> Thinking more the confusion might come from the fact that we mixed
> hardware capability with software capability. IOMMU_FEAT describes
> the hardware capability. When FEAT_IOPF is set, it purely means whatever
> page faults that are enabled by the software are routed through the IOMMU.
> Nothing more. Then the software (IOMMU drivers) may choose to support
> only limited faulting scenarios and then evolve to support more complex 
> usages gradually. For example, the intel-iommu driver only supports 1st-level
> fault (thus SVA) for now, while FEAT_IOPF as a separate feature may give the
> impression that 2nd-level faults are also allowed. From this angle once we 
> start to separate page fault from SVA, we may also need a way to report 
> the software capability (e.g. a set of faulting categories) and also extend
> iommu_register_device_fault_handler to allow specifying which 
> category is enabled respectively. The example categories could be:
> 
> - IOPF_BIND, for page tables which are bound/linked to the IOMMU. 
> Apply to bare metal SVA and guest SVA case;

These don't seem to fit in the same software capability, since the action
to perform on incoming page faults is very different. In the first case
the fault handling is entirely contained within the IOMMU driver; in the
second case the IOMMU driver only tracks page requests, and offloads
handling to VFIO.

> - IOPF_MAP, for page tables which are managed through explicit IOMMU
> map interfaces. Apply to removing VFIO pinning restriction;

From the IOMMU perspective this is the same as guest SVA, no? VFIO
registering a fault handler and doing the bulk of the work itself.

> Both categories can be enabled together in nested translation, with 
> additional information provided to differentiate them in fault information.
> Using paging/staging level doesn't make much sense as it's IOMMU driver's 
> internal knowledge, e.g. VT-d driver plans to use 1st level for GPA if no 
> nesting and then turn to 2nd level when nesting is enabled.

I guess detailing what's needed for nested IOPF can help the discussion,
although I haven't seen any concrete plan about implementing it, and it
still seems a couple of years away. There are two important steps with
nested IOPF:

(1) Figuring out whether a fault comes from L1 or L2. A SMMU stall event
    comes with this information, but a PRI page request doesn't. The IOMMU
    driver has to first translate the IOVA to a GPA, injecting the fault
    into the guest if this translation fails by using the usual
    iommu_report_device_fault().

(2) Translating the faulting GPA to a HVA that can be fed to
    handle_mm_fault(). That requires help from KVM, so another interface -
    either KVM registering GPA->HVA translation tables or IOMMU driver
    querying each translation. Either way it should be reusable by device
    drivers that implement IOPF themselves.

(1) could be enabled with iommu_dev_enable_feature(). (2) requires a more
complex interface. (2) alone might also be desirable - demand-paging for
level 2 only, no SVA for level 1.

Anyway, back to this patch. What I'm trying to convey is "can the IOMMU
receive incoming I/O page faults for this device and, when SVA is enabled,
feed them to the mm subsystem?  Enable that or return an error." I'm stuck
on the name. IOPF alone is too vague. Not IOPF_L1 as Kevin noted, since L1
is also used in virtualization. IOPF_BIND and IOPF_SVA could also mean (2)
above. IOMMU_DEV_FEAT_IOPF_FLAT?

That leaves space for the nested extensions. (1) above could be
IOMMU_FEAT_IOPF_NESTED, and (2) requires some new interfacing with KVM (or
just an external fault handler) and could be used with either IOPF_FLAT or
IOPF_NESTED. We can figure out the details later. What do you think?

Thanks,
Jean
Baolu Lu Jan. 16, 2021, 3:54 a.m. UTC | #6
Hi Jean,

On 2021/1/15 0:41, Jean-Philippe Brucker wrote:
> I guess detailing what's needed for nested IOPF can help the discussion,
> although I haven't seen any concrete plan about implementing it, and it
> still seems a couple of years away. There are two important steps with
> nested IOPF:
> 
> (1) Figuring out whether a fault comes from L1 or L2. A SMMU stall event
>      comes with this information, but a PRI page request doesn't. The IOMMU
>      driver has to first translate the IOVA to a GPA, injecting the fault
>      into the guest if this translation fails by using the usual
>      iommu_report_device_fault().
> 
> (2) Translating the faulting GPA to a HVA that can be fed to
>      handle_mm_fault(). That requires help from KVM, so another interface -
>      either KVM registering GPA->HVA translation tables or IOMMU driver
>      querying each translation. Either way it should be reusable by device
>      drivers that implement IOPF themselves.
> 
> (1) could be enabled with iommu_dev_enable_feature(). (2) requires a more
> complex interface. (2) alone might also be desirable - demand-paging for
> level 2 only, no SVA for level 1.
> 
> Anyway, back to this patch. What I'm trying to convey is "can the IOMMU
> receive incoming I/O page faults for this device and, when SVA is enabled,
> feed them to the mm subsystem?  Enable that or return an error." I'm stuck
> on the name. IOPF alone is too vague. Not IOPF_L1 as Kevin noted, since L1
> is also used in virtualization. IOPF_BIND and IOPF_SVA could also mean (2)
> above. IOMMU_DEV_FEAT_IOPF_FLAT?
> 
> That leaves space for the nested extensions. (1) above could be
> IOMMU_FEAT_IOPF_NESTED, and (2) requires some new interfacing with KVM (or
> just an external fault handler) and could be used with either IOPF_FLAT or
> IOPF_NESTED. We can figure out the details later. What do you think?

I agree that we can define IOPF_ for current usage and leave space for
future extensions.

IOPF_FLAT represents IOPF on first-level translation, currently first
level translation could be used in below cases.

1) FL w/ internal Page Table: Kernel IOVA;
2) FL w/ external Page Table: VFIO passthrough;
3) FL w/ shared CPU page table: SVA

We don't need to support IOPF for case 1). Let's put it aside.

IOPF handling of 2) and 3) are different. Do we need to define different
names to distinguish these two cases?

Best regards,
baolu
Tian, Kevin Jan. 18, 2021, 6:54 a.m. UTC | #7
> From: Lu Baolu <baolu.lu@linux.intel.com>
> Sent: Saturday, January 16, 2021 11:54 AM
> 
> Hi Jean,
> 
> On 2021/1/15 0:41, Jean-Philippe Brucker wrote:
> > I guess detailing what's needed for nested IOPF can help the discussion,
> > although I haven't seen any concrete plan about implementing it, and it
> > still seems a couple of years away. There are two important steps with
> > nested IOPF:
> >
> > (1) Figuring out whether a fault comes from L1 or L2. A SMMU stall event
> >      comes with this information, but a PRI page request doesn't. The
> IOMMU
> >      driver has to first translate the IOVA to a GPA, injecting the fault
> >      into the guest if this translation fails by using the usual
> >      iommu_report_device_fault().

The IOMMU driver can walk the page tables to find out the level information.
If the walk terminates at the 1st level, inject to the guest. Otherwise fix the 
mm fault at 2nd level. It's not efficient compared to hardware-provided info,
but it's doable and actual overhead needs to be measured (optimization exists
e.g. having fault client to hint no 2nd level fault expected when registering fault
handler in pinned case).

> >
> > (2) Translating the faulting GPA to a HVA that can be fed to
> >      handle_mm_fault(). That requires help from KVM, so another interface -
> >      either KVM registering GPA->HVA translation tables or IOMMU driver
> >      querying each translation. Either way it should be reusable by device
> >      drivers that implement IOPF themselves.

Or just leave to the fault client (say VFIO here) to figure it out. VFIO has the
information about GPA->HPA and can then call handle_mm_fault to fix the
received fault. The IOMMU driver just exports an interface for the device drivers 
which implement IOPF themselves to report a fault which is then handled by
the IOMMU core by reusing the same faulting path.

> >
> > (1) could be enabled with iommu_dev_enable_feature(). (2) requires a
> more
> > complex interface. (2) alone might also be desirable - demand-paging for
> > level 2 only, no SVA for level 1.

Yes, this is what we want to point out. A general FEAT_IOPF implies more than
what this patch intended to address.

> >
> > Anyway, back to this patch. What I'm trying to convey is "can the IOMMU
> > receive incoming I/O page faults for this device and, when SVA is enabled,
> > feed them to the mm subsystem?  Enable that or return an error." I'm stuck
> > on the name. IOPF alone is too vague. Not IOPF_L1 as Kevin noted, since L1
> > is also used in virtualization. IOPF_BIND and IOPF_SVA could also mean (2)
> > above. IOMMU_DEV_FEAT_IOPF_FLAT?
> >
> > That leaves space for the nested extensions. (1) above could be
> > IOMMU_FEAT_IOPF_NESTED, and (2) requires some new interfacing with
> KVM (or
> > just an external fault handler) and could be used with either IOPF_FLAT or
> > IOPF_NESTED. We can figure out the details later. What do you think?
> 
> I agree that we can define IOPF_ for current usage and leave space for
> future extensions.
> 
> IOPF_FLAT represents IOPF on first-level translation, currently first
> level translation could be used in below cases.
> 
> 1) FL w/ internal Page Table: Kernel IOVA;
> 2) FL w/ external Page Table: VFIO passthrough;
> 3) FL w/ shared CPU page table: SVA
> 
> We don't need to support IOPF for case 1). Let's put it aside.
> 
> IOPF handling of 2) and 3) are different. Do we need to define different
> names to distinguish these two cases?
> 

Defining feature names according to various use cases does not sound a
clean way. In an ideal way we should have just a general FEAT_IOPF since
the hardware (at least VT-d) does support fault in either 1st-level, 2nd-
level or nested configurations. We are entering this trouble just because
there is difficulty for the software evolving to enable full hardware cap
in one batch. My last proposal was sort of keeping FEAT_IOPF as a general
capability for whether delivering fault through the IOMMU or the ad-hoc
device, and then having a separate interface for whether IOPF reporting
is available under a specific configuration. The former is about the path
between the IOMMU and the device, while the latter is about the interface
between the IOMMU driver and its faulting client.

The reporting capability can be checked when the fault client is registering 
its fault handler, and at this time the IOMMU driver knows how the related 
mapping is configured (1st, 2nd, or nested) and whether fault reporting is 
supported in such configuration. We may introduce IOPF_REPORT_FLAT and 
IOPF_REPORT_NESTED respectively. while IOPF_REPORT_FLAT detection is 
straightforward (2 and 3 can be differentiated internally based on configured 
level), IOPF_REPORT_NESTED needs additional info to indicate which level is 
concerned since the vendor driver may not support fault reporting in both 
levels or the fault client may be interested in only one level (e.g. with 2nd
level pinned).

Thanks
Kevin
Jean-Philippe Brucker Jan. 19, 2021, 10:16 a.m. UTC | #8
On Mon, Jan 18, 2021 at 06:54:28AM +0000, Tian, Kevin wrote:
> > From: Lu Baolu <baolu.lu@linux.intel.com>
> > Sent: Saturday, January 16, 2021 11:54 AM
> > 
> > Hi Jean,
> > 
> > On 2021/1/15 0:41, Jean-Philippe Brucker wrote:
> > > I guess detailing what's needed for nested IOPF can help the discussion,
> > > although I haven't seen any concrete plan about implementing it, and it
> > > still seems a couple of years away. There are two important steps with
> > > nested IOPF:
> > >
> > > (1) Figuring out whether a fault comes from L1 or L2. A SMMU stall event
> > >      comes with this information, but a PRI page request doesn't. The
> > IOMMU
> > >      driver has to first translate the IOVA to a GPA, injecting the fault
> > >      into the guest if this translation fails by using the usual
> > >      iommu_report_device_fault().
> 
> The IOMMU driver can walk the page tables to find out the level information.
> If the walk terminates at the 1st level, inject to the guest. Otherwise fix the 
> mm fault at 2nd level. It's not efficient compared to hardware-provided info,
> but it's doable and actual overhead needs to be measured (optimization exists
> e.g. having fault client to hint no 2nd level fault expected when registering fault
> handler in pinned case).
> 
> > >
> > > (2) Translating the faulting GPA to a HVA that can be fed to
> > >      handle_mm_fault(). That requires help from KVM, so another interface -
> > >      either KVM registering GPA->HVA translation tables or IOMMU driver
> > >      querying each translation. Either way it should be reusable by device
> > >      drivers that implement IOPF themselves.
> 
> Or just leave to the fault client (say VFIO here) to figure it out. VFIO has the
> information about GPA->HPA and can then call handle_mm_fault to fix the
> received fault. The IOMMU driver just exports an interface for the device drivers 
> which implement IOPF themselves to report a fault which is then handled by
> the IOMMU core by reusing the same faulting path.
> 
> > >
> > > (1) could be enabled with iommu_dev_enable_feature(). (2) requires a
> > more
> > > complex interface. (2) alone might also be desirable - demand-paging for
> > > level 2 only, no SVA for level 1.
> 
> Yes, this is what we want to point out. A general FEAT_IOPF implies more than
> what this patch intended to address.
> 
> > >
> > > Anyway, back to this patch. What I'm trying to convey is "can the IOMMU
> > > receive incoming I/O page faults for this device and, when SVA is enabled,
> > > feed them to the mm subsystem?  Enable that or return an error." I'm stuck
> > > on the name. IOPF alone is too vague. Not IOPF_L1 as Kevin noted, since L1
> > > is also used in virtualization. IOPF_BIND and IOPF_SVA could also mean (2)
> > > above. IOMMU_DEV_FEAT_IOPF_FLAT?
> > >
> > > That leaves space for the nested extensions. (1) above could be
> > > IOMMU_FEAT_IOPF_NESTED, and (2) requires some new interfacing with
> > KVM (or
> > > just an external fault handler) and could be used with either IOPF_FLAT or
> > > IOPF_NESTED. We can figure out the details later. What do you think?
> > 
> > I agree that we can define IOPF_ for current usage and leave space for
> > future extensions.
> > 
> > IOPF_FLAT represents IOPF on first-level translation, currently first
> > level translation could be used in below cases.
> > 
> > 1) FL w/ internal Page Table: Kernel IOVA;
> > 2) FL w/ external Page Table: VFIO passthrough;
> > 3) FL w/ shared CPU page table: SVA
> > 
> > We don't need to support IOPF for case 1). Let's put it aside.
> > 
> > IOPF handling of 2) and 3) are different. Do we need to define different
> > names to distinguish these two cases?
> > 
> 
> Defining feature names according to various use cases does not sound a
> clean way. In an ideal way we should have just a general FEAT_IOPF since
> the hardware (at least VT-d) does support fault in either 1st-level, 2nd-
> level or nested configurations. We are entering this trouble just because
> there is difficulty for the software evolving to enable full hardware cap
> in one batch. My last proposal was sort of keeping FEAT_IOPF as a general
> capability for whether delivering fault through the IOMMU or the ad-hoc
> device, and then having a separate interface for whether IOPF reporting
> is available under a specific configuration. The former is about the path
> between the IOMMU and the device, while the latter is about the interface
> between the IOMMU driver and its faulting client.
> 
> The reporting capability can be checked when the fault client is registering 
> its fault handler, and at this time the IOMMU driver knows how the related 
> mapping is configured (1st, 2nd, or nested) and whether fault reporting is 
> supported in such configuration. We may introduce IOPF_REPORT_FLAT and 
> IOPF_REPORT_NESTED respectively. while IOPF_REPORT_FLAT detection is 
> straightforward (2 and 3 can be differentiated internally based on configured 
> level), IOPF_REPORT_NESTED needs additional info to indicate which level is 
> concerned since the vendor driver may not support fault reporting in both 
> levels or the fault client may be interested in only one level (e.g. with 2nd
> level pinned).

I agree with this plan (provided I understood it correctly this time):
have IOMMU_DEV_FEAT_IOPF describing the IOPF interface between device and
IOMMU. Enabling it on its own doesn't do anything visible to the driver,
it just probes for capabilities and enables PRI if necessary. For host
SVA, since there is no additional communication between IOMMU and device
driver, enabling IOMMU_DEV_FEAT_SVA in addition to IOPF is sufficient.
Then when implementing nested we'll extend iommu_register_fault_handler()
with flags and parameters. That will also enable advanced dispatching (1).

Will it be necessary to enable FEAT_IOPF when doing VFIO passthrough
(injecting to the guest or handling it with external page tables)?
I think that would be better. Currently a device driver registering a
fault handler doesn't know if it will get recoverable page faults or only
unrecoverable ones.

So I don't think this patch needs any change. Baolu, are you ok with
keeping this and patch 4?

Thanks,
Jean
Baolu Lu Jan. 20, 2021, 1:57 a.m. UTC | #9
Hi Jean,

On 1/19/21 6:16 PM, Jean-Philippe Brucker wrote:
> On Mon, Jan 18, 2021 at 06:54:28AM +0000, Tian, Kevin wrote:
>>> From: Lu Baolu <baolu.lu@linux.intel.com>
>>> Sent: Saturday, January 16, 2021 11:54 AM
>>>
>>> Hi Jean,
>>>
>>> On 2021/1/15 0:41, Jean-Philippe Brucker wrote:
>>>> I guess detailing what's needed for nested IOPF can help the discussion,
>>>> although I haven't seen any concrete plan about implementing it, and it
>>>> still seems a couple of years away. There are two important steps with
>>>> nested IOPF:
>>>>
>>>> (1) Figuring out whether a fault comes from L1 or L2. A SMMU stall event
>>>>       comes with this information, but a PRI page request doesn't. The
>>> IOMMU
>>>>       driver has to first translate the IOVA to a GPA, injecting the fault
>>>>       into the guest if this translation fails by using the usual
>>>>       iommu_report_device_fault().
>>
>> The IOMMU driver can walk the page tables to find out the level information.
>> If the walk terminates at the 1st level, inject to the guest. Otherwise fix the
>> mm fault at 2nd level. It's not efficient compared to hardware-provided info,
>> but it's doable and actual overhead needs to be measured (optimization exists
>> e.g. having fault client to hint no 2nd level fault expected when registering fault
>> handler in pinned case).
>>
>>>>
>>>> (2) Translating the faulting GPA to a HVA that can be fed to
>>>>       handle_mm_fault(). That requires help from KVM, so another interface -
>>>>       either KVM registering GPA->HVA translation tables or IOMMU driver
>>>>       querying each translation. Either way it should be reusable by device
>>>>       drivers that implement IOPF themselves.
>>
>> Or just leave to the fault client (say VFIO here) to figure it out. VFIO has the
>> information about GPA->HPA and can then call handle_mm_fault to fix the
>> received fault. The IOMMU driver just exports an interface for the device drivers
>> which implement IOPF themselves to report a fault which is then handled by
>> the IOMMU core by reusing the same faulting path.
>>
>>>>
>>>> (1) could be enabled with iommu_dev_enable_feature(). (2) requires a
>>> more
>>>> complex interface. (2) alone might also be desirable - demand-paging for
>>>> level 2 only, no SVA for level 1.
>>
>> Yes, this is what we want to point out. A general FEAT_IOPF implies more than
>> what this patch intended to address.
>>
>>>>
>>>> Anyway, back to this patch. What I'm trying to convey is "can the IOMMU
>>>> receive incoming I/O page faults for this device and, when SVA is enabled,
>>>> feed them to the mm subsystem?  Enable that or return an error." I'm stuck
>>>> on the name. IOPF alone is too vague. Not IOPF_L1 as Kevin noted, since L1
>>>> is also used in virtualization. IOPF_BIND and IOPF_SVA could also mean (2)
>>>> above. IOMMU_DEV_FEAT_IOPF_FLAT?
>>>>
>>>> That leaves space for the nested extensions. (1) above could be
>>>> IOMMU_FEAT_IOPF_NESTED, and (2) requires some new interfacing with
>>> KVM (or
>>>> just an external fault handler) and could be used with either IOPF_FLAT or
>>>> IOPF_NESTED. We can figure out the details later. What do you think?
>>>
>>> I agree that we can define IOPF_ for current usage and leave space for
>>> future extensions.
>>>
>>> IOPF_FLAT represents IOPF on first-level translation, currently first
>>> level translation could be used in below cases.
>>>
>>> 1) FL w/ internal Page Table: Kernel IOVA;
>>> 2) FL w/ external Page Table: VFIO passthrough;
>>> 3) FL w/ shared CPU page table: SVA
>>>
>>> We don't need to support IOPF for case 1). Let's put it aside.
>>>
>>> IOPF handling of 2) and 3) are different. Do we need to define different
>>> names to distinguish these two cases?
>>>
>>
>> Defining feature names according to various use cases does not sound a
>> clean way. In an ideal way we should have just a general FEAT_IOPF since
>> the hardware (at least VT-d) does support fault in either 1st-level, 2nd-
>> level or nested configurations. We are entering this trouble just because
>> there is difficulty for the software evolving to enable full hardware cap
>> in one batch. My last proposal was sort of keeping FEAT_IOPF as a general
>> capability for whether delivering fault through the IOMMU or the ad-hoc
>> device, and then having a separate interface for whether IOPF reporting
>> is available under a specific configuration. The former is about the path
>> between the IOMMU and the device, while the latter is about the interface
>> between the IOMMU driver and its faulting client.
>>
>> The reporting capability can be checked when the fault client is registering
>> its fault handler, and at this time the IOMMU driver knows how the related
>> mapping is configured (1st, 2nd, or nested) and whether fault reporting is
>> supported in such configuration. We may introduce IOPF_REPORT_FLAT and
>> IOPF_REPORT_NESTED respectively. while IOPF_REPORT_FLAT detection is
>> straightforward (2 and 3 can be differentiated internally based on configured
>> level), IOPF_REPORT_NESTED needs additional info to indicate which level is
>> concerned since the vendor driver may not support fault reporting in both
>> levels or the fault client may be interested in only one level (e.g. with 2nd
>> level pinned).
> 
> I agree with this plan (provided I understood it correctly this time):
> have IOMMU_DEV_FEAT_IOPF describing the IOPF interface between device and
> IOMMU. Enabling it on its own doesn't do anything visible to the driver,
> it just probes for capabilities and enables PRI if necessary. For host
> SVA, since there is no additional communication between IOMMU and device
> driver, enabling IOMMU_DEV_FEAT_SVA in addition to IOPF is sufficient.
> Then when implementing nested we'll extend iommu_register_fault_handler()
> with flags and parameters. That will also enable advanced dispatching (1).
> 
> Will it be necessary to enable FEAT_IOPF when doing VFIO passthrough
> (injecting to the guest or handling it with external page tables)?
> I think that would be better. Currently a device driver registering a
> fault handler doesn't know if it will get recoverable page faults or only
> unrecoverable ones.
> 
> So I don't think this patch needs any change. Baolu, are you ok with
> keeping this and patch 4?

It sounds good to me. Keep FEAT_IOPF as the IOMMU capability of
generating I/O page fault and differentiate different I/O page faults by
extending the fault handler register interface.

> 
> Thanks,
> Jean
> 

Best regards,
baolu
diff mbox series

Patch

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 583c734b2e87..701b2eeb0dc5 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -156,10 +156,24 @@  struct iommu_resv_region {
 	enum iommu_resv_type	type;
 };
 
-/* Per device IOMMU features */
+/**
+ * enum iommu_dev_features - Per device IOMMU features
+ * @IOMMU_DEV_FEAT_AUX: Auxiliary domain feature
+ * @IOMMU_DEV_FEAT_SVA: Shared Virtual Addresses
+ * @IOMMU_DEV_FEAT_IOPF: I/O Page Faults such as PRI or Stall. Generally using
+ *			 %IOMMU_DEV_FEAT_SVA requires %IOMMU_DEV_FEAT_IOPF, but
+ *			 some devices manage I/O Page Faults themselves instead
+ *			 of relying on the IOMMU. When supported, this feature
+ *			 must be enabled before and disabled after
+ *			 %IOMMU_DEV_FEAT_SVA.
+ *
+ * Device drivers query whether a feature is supported using
+ * iommu_dev_has_feature(), and enable it using iommu_dev_enable_feature().
+ */
 enum iommu_dev_features {
-	IOMMU_DEV_FEAT_AUX,	/* Aux-domain feature */
-	IOMMU_DEV_FEAT_SVA,	/* Shared Virtual Addresses */
+	IOMMU_DEV_FEAT_AUX,
+	IOMMU_DEV_FEAT_SVA,
+	IOMMU_DEV_FEAT_IOPF,
 };
 
 #define IOMMU_PASID_INVALID	(-1U)