diff mbox series

[RFCv2,04/24] iommu: Add iommu_domain ops for dirty tracking

Message ID 20230518204650.14541-5-joao.m.martins@oracle.com (mailing list archive)
State New, archived
Headers show
Series IOMMUFD Dirty Tracking | expand

Commit Message

Joao Martins May 18, 2023, 8:46 p.m. UTC
Add to iommu domain operations a set of callbacks to perform dirty
tracking, particulary to start and stop tracking and finally to read and
clear the dirty data.

Drivers are generally expected to dynamically change its translation
structures to toggle the tracking and flush some form of control state
structure that stands in the IOVA translation path. Though it's not
mandatory, as drivers will be enable dirty tracking at boot, and just flush
the IO pagetables when setting dirty tracking.  For each of the newly added
IOMMU core APIs:

.supported_flags[IOMMU_DOMAIN_F_ENFORCE_DIRTY]: Introduce a set of flags
that enforce certain restrictions in the iommu_domain object. For dirty
tracking this means that when IOMMU_DOMAIN_F_ENFORCE_DIRTY is set via its
helper iommu_domain_set_flags(...) devices attached via attach_dev will
fail on devices that do *not* have dirty tracking supported. IOMMU drivers
that support dirty tracking should advertise this flag, while enforcing
that dirty tracking is supported by the device in its .attach_dev iommu op.

iommu_cap::IOMMU_CAP_DIRTY: new device iommu_capable value when probing for
capabilities of the device.

.set_dirty_tracking(): an iommu driver is expected to change its
translation structures and enable dirty tracking for the devices in the
iommu_domain. For drivers making dirty tracking always-enabled, it should
just return 0.

.read_and_clear_dirty(): an iommu driver is expected to walk the iova range
passed in and use iommu_dirty_bitmap_record() to record dirty info per
IOVA. When detecting a given IOVA is dirty it should also clear its dirty
state from the PTE, *unless* the flag IOMMU_DIRTY_NO_CLEAR is passed in --
flushing is steered from the caller of the domain_op via iotlb_gather. The
iommu core APIs use the same data structure in use for dirty tracking for
VFIO device dirty (struct iova_bitmap) abstracted by
iommu_dirty_bitmap_record() helper function.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommu.c      | 11 +++++++
 include/linux/io-pgtable.h |  4 +++
 include/linux/iommu.h      | 67 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 82 insertions(+)

Comments

Baolu Lu May 19, 2023, 8:42 a.m. UTC | #1
On 2023/5/19 4:46, Joao Martins wrote:
> Add to iommu domain operations a set of callbacks to perform dirty
> tracking, particulary to start and stop tracking and finally to read and
> clear the dirty data.
> 
> Drivers are generally expected to dynamically change its translation
> structures to toggle the tracking and flush some form of control state
> structure that stands in the IOVA translation path. Though it's not
> mandatory, as drivers will be enable dirty tracking at boot, and just flush
> the IO pagetables when setting dirty tracking.  For each of the newly added
> IOMMU core APIs:
> 
> .supported_flags[IOMMU_DOMAIN_F_ENFORCE_DIRTY]: Introduce a set of flags
> that enforce certain restrictions in the iommu_domain object. For dirty
> tracking this means that when IOMMU_DOMAIN_F_ENFORCE_DIRTY is set via its
> helper iommu_domain_set_flags(...) devices attached via attach_dev will
> fail on devices that do*not*  have dirty tracking supported. IOMMU drivers
> that support dirty tracking should advertise this flag, while enforcing
> that dirty tracking is supported by the device in its .attach_dev iommu op.
> 
> iommu_cap::IOMMU_CAP_DIRTY: new device iommu_capable value when probing for
> capabilities of the device.
> 
> .set_dirty_tracking(): an iommu driver is expected to change its
> translation structures and enable dirty tracking for the devices in the
> iommu_domain. For drivers making dirty tracking always-enabled, it should
> just return 0.
> 
> .read_and_clear_dirty(): an iommu driver is expected to walk the iova range
> passed in and use iommu_dirty_bitmap_record() to record dirty info per
> IOVA. When detecting a given IOVA is dirty it should also clear its dirty
> state from the PTE,*unless*  the flag IOMMU_DIRTY_NO_CLEAR is passed in --
> flushing is steered from the caller of the domain_op via iotlb_gather. The
> iommu core APIs use the same data structure in use for dirty tracking for
> VFIO device dirty (struct iova_bitmap) abstracted by
> iommu_dirty_bitmap_record() helper function.
> 
> Signed-off-by: Joao Martins<joao.m.martins@oracle.com>
> ---
>   drivers/iommu/iommu.c      | 11 +++++++
>   include/linux/io-pgtable.h |  4 +++
>   include/linux/iommu.h      | 67 ++++++++++++++++++++++++++++++++++++++
>   3 files changed, 82 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 2088caae5074..95acc543e8fb 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -2013,6 +2013,17 @@ struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus)
>   }
>   EXPORT_SYMBOL_GPL(iommu_domain_alloc);
>   
> +int iommu_domain_set_flags(struct iommu_domain *domain,
> +			   const struct bus_type *bus, unsigned long val)
> +{
> +	if (!(val & bus->iommu_ops->supported_flags))
> +		return -EINVAL;
> +
> +	domain->flags |= val;
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iommu_domain_set_flags);

This seems to be a return to an old question. IOMMU domains are
allocated through buses, but can a domain be attached to devices on
different buses that happen to have different IOMMU ops? In reality, we
may not have such heterogeneous configurations yet, but it is better to
avoid such confusion as much as possible.

How about adding a domain op like .enforce_dirty_page_tracking. The
individual iommu driver which implements this callback will iterate all
devices that the domain has been attached and return success only if all
attached devices support dirty page tracking.

Then, in the domain's attach_dev or set_dev_pasid callbacks, if the
domain has been enforced dirty page tracking while the device to be
attached doesn't support it, -EINVAL will be returned which could be
intercepted by the caller as domain is incompatible.

Best regards,
baolu
Joao Martins May 19, 2023, 9:28 a.m. UTC | #2
On 19/05/2023 09:42, Baolu Lu wrote:
> On 2023/5/19 4:46, Joao Martins wrote:
>> Add to iommu domain operations a set of callbacks to perform dirty
>> tracking, particulary to start and stop tracking and finally to read and
>> clear the dirty data.
>>
>> Drivers are generally expected to dynamically change its translation
>> structures to toggle the tracking and flush some form of control state
>> structure that stands in the IOVA translation path. Though it's not
>> mandatory, as drivers will be enable dirty tracking at boot, and just flush
>> the IO pagetables when setting dirty tracking.  For each of the newly added
>> IOMMU core APIs:
>>
>> .supported_flags[IOMMU_DOMAIN_F_ENFORCE_DIRTY]: Introduce a set of flags
>> that enforce certain restrictions in the iommu_domain object. For dirty
>> tracking this means that when IOMMU_DOMAIN_F_ENFORCE_DIRTY is set via its
>> helper iommu_domain_set_flags(...) devices attached via attach_dev will
>> fail on devices that do*not*  have dirty tracking supported. IOMMU drivers
>> that support dirty tracking should advertise this flag, while enforcing
>> that dirty tracking is supported by the device in its .attach_dev iommu op.
>>
>> iommu_cap::IOMMU_CAP_DIRTY: new device iommu_capable value when probing for
>> capabilities of the device.
>>
>> .set_dirty_tracking(): an iommu driver is expected to change its
>> translation structures and enable dirty tracking for the devices in the
>> iommu_domain. For drivers making dirty tracking always-enabled, it should
>> just return 0.
>>
>> .read_and_clear_dirty(): an iommu driver is expected to walk the iova range
>> passed in and use iommu_dirty_bitmap_record() to record dirty info per
>> IOVA. When detecting a given IOVA is dirty it should also clear its dirty
>> state from the PTE,*unless*  the flag IOMMU_DIRTY_NO_CLEAR is passed in --
>> flushing is steered from the caller of the domain_op via iotlb_gather. The
>> iommu core APIs use the same data structure in use for dirty tracking for
>> VFIO device dirty (struct iova_bitmap) abstracted by
>> iommu_dirty_bitmap_record() helper function.
>>
>> Signed-off-by: Joao Martins<joao.m.martins@oracle.com>
>> ---
>>   drivers/iommu/iommu.c      | 11 +++++++
>>   include/linux/io-pgtable.h |  4 +++
>>   include/linux/iommu.h      | 67 ++++++++++++++++++++++++++++++++++++++
>>   3 files changed, 82 insertions(+)
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 2088caae5074..95acc543e8fb 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -2013,6 +2013,17 @@ struct iommu_domain *iommu_domain_alloc(const struct
>> bus_type *bus)
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_domain_alloc);
>>   +int iommu_domain_set_flags(struct iommu_domain *domain,
>> +               const struct bus_type *bus, unsigned long val)
>> +{
>> +    if (!(val & bus->iommu_ops->supported_flags))
>> +        return -EINVAL;
>> +
>> +    domain->flags |= val;
>> +    return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_domain_set_flags);
> 
> This seems to be a return to an old question. IOMMU domains are
> allocated through buses, but can a domain be attached to devices on
> different buses that happen to have different IOMMU ops? In reality, we
> may not have such heterogeneous configurations yet, but it is better to
> avoid such confusion as much as possible.
> 
> How about adding a domain op like .enforce_dirty_page_tracking. The
> individual iommu driver which implements this callback will iterate all
> devices that the domain has been attached and return success only if all
> attached devices support dirty page tracking.
> 

Hmm, but isn't the point is to actually prevent this from happening? Meaning to
ensure that only devices that support dirty tracking are attached in the domain?
The flag is meant to be advertise that individual domain knows about dirty
tracking enforcement, and it will validate it at .attach_dev when asked.

> Then, in the domain's attach_dev or set_dev_pasid callbacks, if the

I certainly didn't handle the ::set_dev_pasid callback, might have to fix next
iteration

> domain has been enforced dirty page tracking while the device to be
> attached doesn't support it, -EINVAL will be returned which could be
> intercepted by the caller as domain is incompatible.

This part is already done; I am just a little stuck on a
::enforce_dirty_tracking domain op being done /after/ devices are already
present in the domain. Note that today this is done right after we create the
hwpt (i.e. the iommu_domain) without devices being attached to it yet. I had a
separate version where I create a domain object with (bus, flags) as an
alternate incantation of this. In the variants alternatives I implemented,
ultimately picked this one as it could other similar things to sit on (e.g.
enforce_cache_coherency?)

I can switch to enforce_dirty_tracking instead of a flag, but I think it looks
more correct to ensure the property remains imutable at-domain-creation rather
than post dev attachment, unless I am missing something here.
Jason Gunthorpe May 19, 2023, 11:40 a.m. UTC | #3
On Thu, May 18, 2023 at 09:46:30PM +0100, Joao Martins wrote:
> Add to iommu domain operations a set of callbacks to perform dirty
> tracking, particulary to start and stop tracking and finally to read and
> clear the dirty data.
> 
> Drivers are generally expected to dynamically change its translation
> structures to toggle the tracking and flush some form of control state
> structure that stands in the IOVA translation path. Though it's not
> mandatory, as drivers will be enable dirty tracking at boot, and just flush
> the IO pagetables when setting dirty tracking.  For each of the newly added
> IOMMU core APIs:
> 
> .supported_flags[IOMMU_DOMAIN_F_ENFORCE_DIRTY]: Introduce a set of flags
> that enforce certain restrictions in the iommu_domain object. For dirty
> tracking this means that when IOMMU_DOMAIN_F_ENFORCE_DIRTY is set via its
> helper iommu_domain_set_flags(...) devices attached via attach_dev will
> fail on devices that do *not* have dirty tracking supported. IOMMU drivers
> that support dirty tracking should advertise this flag, while enforcing
> that dirty tracking is supported by the device in its .attach_dev iommu op.
> 
> iommu_cap::IOMMU_CAP_DIRTY: new device iommu_capable value when probing for
> capabilities of the device.
> 
> .set_dirty_tracking(): an iommu driver is expected to change its
> translation structures and enable dirty tracking for the devices in the
> iommu_domain. For drivers making dirty tracking always-enabled, it should
> just return 0.
> 
> .read_and_clear_dirty(): an iommu driver is expected to walk the iova range
> passed in and use iommu_dirty_bitmap_record() to record dirty info per
> IOVA. When detecting a given IOVA is dirty it should also clear its dirty
> state from the PTE, *unless* the flag IOMMU_DIRTY_NO_CLEAR is passed in --
> flushing is steered from the caller of the domain_op via iotlb_gather. The
> iommu core APIs use the same data structure in use for dirty tracking for
> VFIO device dirty (struct iova_bitmap) abstracted by
> iommu_dirty_bitmap_record() helper function.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/iommu/iommu.c      | 11 +++++++
>  include/linux/io-pgtable.h |  4 +++
>  include/linux/iommu.h      | 67 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 82 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 2088caae5074..95acc543e8fb 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -2013,6 +2013,17 @@ struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus)
>  }
>  EXPORT_SYMBOL_GPL(iommu_domain_alloc);
>  
> +int iommu_domain_set_flags(struct iommu_domain *domain,
> +			   const struct bus_type *bus, unsigned long val)
> +{

Definately no bus argument.

The supported_flags should be in the domain op not the bus op.

But I think this is sort of the wrong direction, the dirty tracking
mode should be requested when the domain is created, not changed after
the fact.

Jason
Joao Martins May 19, 2023, 11:47 a.m. UTC | #4
On 19/05/2023 12:40, Jason Gunthorpe wrote:
> On Thu, May 18, 2023 at 09:46:30PM +0100, Joao Martins wrote:
>> Add to iommu domain operations a set of callbacks to perform dirty
>> tracking, particulary to start and stop tracking and finally to read and
>> clear the dirty data.
>>
>> Drivers are generally expected to dynamically change its translation
>> structures to toggle the tracking and flush some form of control state
>> structure that stands in the IOVA translation path. Though it's not
>> mandatory, as drivers will be enable dirty tracking at boot, and just flush
>> the IO pagetables when setting dirty tracking.  For each of the newly added
>> IOMMU core APIs:
>>
>> .supported_flags[IOMMU_DOMAIN_F_ENFORCE_DIRTY]: Introduce a set of flags
>> that enforce certain restrictions in the iommu_domain object. For dirty
>> tracking this means that when IOMMU_DOMAIN_F_ENFORCE_DIRTY is set via its
>> helper iommu_domain_set_flags(...) devices attached via attach_dev will
>> fail on devices that do *not* have dirty tracking supported. IOMMU drivers
>> that support dirty tracking should advertise this flag, while enforcing
>> that dirty tracking is supported by the device in its .attach_dev iommu op.
>>
>> iommu_cap::IOMMU_CAP_DIRTY: new device iommu_capable value when probing for
>> capabilities of the device.
>>
>> .set_dirty_tracking(): an iommu driver is expected to change its
>> translation structures and enable dirty tracking for the devices in the
>> iommu_domain. For drivers making dirty tracking always-enabled, it should
>> just return 0.
>>
>> .read_and_clear_dirty(): an iommu driver is expected to walk the iova range
>> passed in and use iommu_dirty_bitmap_record() to record dirty info per
>> IOVA. When detecting a given IOVA is dirty it should also clear its dirty
>> state from the PTE, *unless* the flag IOMMU_DIRTY_NO_CLEAR is passed in --
>> flushing is steered from the caller of the domain_op via iotlb_gather. The
>> iommu core APIs use the same data structure in use for dirty tracking for
>> VFIO device dirty (struct iova_bitmap) abstracted by
>> iommu_dirty_bitmap_record() helper function.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  drivers/iommu/iommu.c      | 11 +++++++
>>  include/linux/io-pgtable.h |  4 +++
>>  include/linux/iommu.h      | 67 ++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 82 insertions(+)
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 2088caae5074..95acc543e8fb 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -2013,6 +2013,17 @@ struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus)
>>  }
>>  EXPORT_SYMBOL_GPL(iommu_domain_alloc);
>>  
>> +int iommu_domain_set_flags(struct iommu_domain *domain,
>> +			   const struct bus_type *bus, unsigned long val)
>> +{
> 
> Definately no bus argument.
> 
> The supported_flags should be in the domain op not the bus op.
> 
> But I think this is sort of the wrong direction, the dirty tracking
> mode should be requested when the domain is created, not changed after
> the fact.

In practice it is done as soon after the domain is created but I understand what
you mean that both should be together; I have this implemented like that as my
first take as a domain_alloc passed flags, but I was a little undecided because
we are adding another domain_alloc() op for the user-managed pagetable and after
having another one we would end up with 3 ways of creating iommu domain -- but
maybe that's not an issue
Jason Gunthorpe May 19, 2023, 11:51 a.m. UTC | #5
On Fri, May 19, 2023 at 12:47:24PM +0100, Joao Martins wrote:

> In practice it is done as soon after the domain is created but I understand what
> you mean that both should be together; I have this implemented like that as my
> first take as a domain_alloc passed flags, but I was a little undecided because
> we are adding another domain_alloc() op for the user-managed pagetable and after
> having another one we would end up with 3 ways of creating iommu domain -- but
> maybe that's not an issue

It should ride on the same user domain alloc op as some generic flags,
there is no immediate use case to enable dirty tracking for
non-iommufd page tables

Jason
Joao Martins May 19, 2023, 11:56 a.m. UTC | #6
On 19/05/2023 12:51, Jason Gunthorpe wrote:
> On Fri, May 19, 2023 at 12:47:24PM +0100, Joao Martins wrote:
> 
>> In practice it is done as soon after the domain is created but I understand what
>> you mean that both should be together; I have this implemented like that as my
>> first take as a domain_alloc passed flags, but I was a little undecided because
>> we are adding another domain_alloc() op for the user-managed pagetable and after
>> having another one we would end up with 3 ways of creating iommu domain -- but
>> maybe that's not an issue
> 
> It should ride on the same user domain alloc op as some generic flags,

OK, I suppose that makes sense specially with this being tied in HWPT_ALLOC
where all this new user domain alloc does.

> there is no immediate use case to enable dirty tracking for
> non-iommufd page tables

True
Baolu Lu May 19, 2023, 12:13 p.m. UTC | #7
On 2023/5/19 19:51, Jason Gunthorpe wrote:
> On Fri, May 19, 2023 at 12:47:24PM +0100, Joao Martins wrote:
> 
>> In practice it is done as soon after the domain is created but I understand what
>> you mean that both should be together; I have this implemented like that as my
>> first take as a domain_alloc passed flags, but I was a little undecided because
>> we are adding another domain_alloc() op for the user-managed pagetable and after
>> having another one we would end up with 3 ways of creating iommu domain -- but
>> maybe that's not an issue
> It should ride on the same user domain alloc op as some generic flags,
> there is no immediate use case to enable dirty tracking for
> non-iommufd page tables

This is better than the current solution.

Best regards,
baolu
Robin Murphy May 19, 2023, 1:22 p.m. UTC | #8
On 2023-05-18 21:46, Joao Martins wrote:
> Add to iommu domain operations a set of callbacks to perform dirty
> tracking, particulary to start and stop tracking and finally to read and
> clear the dirty data.
> 
> Drivers are generally expected to dynamically change its translation
> structures to toggle the tracking and flush some form of control state
> structure that stands in the IOVA translation path. Though it's not
> mandatory, as drivers will be enable dirty tracking at boot, and just flush
> the IO pagetables when setting dirty tracking.  For each of the newly added
> IOMMU core APIs:
> 
> .supported_flags[IOMMU_DOMAIN_F_ENFORCE_DIRTY]: Introduce a set of flags
> that enforce certain restrictions in the iommu_domain object. For dirty
> tracking this means that when IOMMU_DOMAIN_F_ENFORCE_DIRTY is set via its
> helper iommu_domain_set_flags(...) devices attached via attach_dev will
> fail on devices that do *not* have dirty tracking supported. IOMMU drivers
> that support dirty tracking should advertise this flag, while enforcing
> that dirty tracking is supported by the device in its .attach_dev iommu op.

Eww, no. For an internal thing, just call ->capable() - I mean, you're 
literally adding this feature as one of its caps...

However I'm not sure if we even need that - domains which don't support 
dirty tracking should just not expose the ops, and thus it ought to be 
inherently obvious.

I'm guessing most of the weirdness here is implicitly working around the 
enabled-from-the-start scenario on SMMUv3:

	domain = iommu_domain_alloc(bus);
	iommu_set_dirty_tracking(domain);
	// arm-smmu-v3 says OK since it doesn't know that it
	// definitely *isn't* possible, and saying no wouldn't
	// be helpful
	iommu_attach_group(group, domain);
	// oops, now we see that the relevant SMMU instance isn't one
	// which actually supports HTTU, what do we do? :(

I don't have any major objection to the general principle of flagging 
the domain to fail attach if it can't do what we promised, as a bodge 
for now, but please implement it privately in arm-smmu-v3 so it's easier 
to clean up again in future once until iommu_domain_alloc() gets sorted 
out properly to get rid of this awkward blind spot.

Thanks,
Robin.

> iommu_cap::IOMMU_CAP_DIRTY: new device iommu_capable value when probing for
> capabilities of the device.
> 
> .set_dirty_tracking(): an iommu driver is expected to change its
> translation structures and enable dirty tracking for the devices in the
> iommu_domain. For drivers making dirty tracking always-enabled, it should
> just return 0.
> 
> .read_and_clear_dirty(): an iommu driver is expected to walk the iova range
> passed in and use iommu_dirty_bitmap_record() to record dirty info per
> IOVA. When detecting a given IOVA is dirty it should also clear its dirty
> state from the PTE, *unless* the flag IOMMU_DIRTY_NO_CLEAR is passed in --
> flushing is steered from the caller of the domain_op via iotlb_gather. The
> iommu core APIs use the same data structure in use for dirty tracking for
> VFIO device dirty (struct iova_bitmap) abstracted by
> iommu_dirty_bitmap_record() helper function.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   drivers/iommu/iommu.c      | 11 +++++++
>   include/linux/io-pgtable.h |  4 +++
>   include/linux/iommu.h      | 67 ++++++++++++++++++++++++++++++++++++++
>   3 files changed, 82 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 2088caae5074..95acc543e8fb 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -2013,6 +2013,17 @@ struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus)
>   }
>   EXPORT_SYMBOL_GPL(iommu_domain_alloc);
>   
> +int iommu_domain_set_flags(struct iommu_domain *domain,
> +			   const struct bus_type *bus, unsigned long val)
> +{
> +	if (!(val & bus->iommu_ops->supported_flags))
> +		return -EINVAL;
> +
> +	domain->flags |= val;
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iommu_domain_set_flags);
> +
>   void iommu_domain_free(struct iommu_domain *domain)
>   {
>   	if (domain->type == IOMMU_DOMAIN_SVA)
> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
> index 1b7a44b35616..25142a0e2fc2 100644
> --- a/include/linux/io-pgtable.h
> +++ b/include/linux/io-pgtable.h
> @@ -166,6 +166,10 @@ struct io_pgtable_ops {
>   			      struct iommu_iotlb_gather *gather);
>   	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
>   				    unsigned long iova);
> +	int (*read_and_clear_dirty)(struct io_pgtable_ops *ops,
> +				    unsigned long iova, size_t size,
> +				    unsigned long flags,
> +				    struct iommu_dirty_bitmap *dirty);
>   };
>   
>   /**
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 39d25645a5ab..992ea87f2f8e 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -13,6 +13,7 @@
>   #include <linux/errno.h>
>   #include <linux/err.h>
>   #include <linux/of.h>
> +#include <linux/iova_bitmap.h>
>   #include <uapi/linux/iommu.h>
>   
>   #define IOMMU_READ	(1 << 0)
> @@ -65,6 +66,11 @@ struct iommu_domain_geometry {
>   
>   #define __IOMMU_DOMAIN_SVA	(1U << 4)  /* Shared process address space */
>   
> +/* Domain feature flags that do not define domain types */
> +#define IOMMU_DOMAIN_F_ENFORCE_DIRTY	(1U << 6)  /* Enforce attachment of
> +						      dirty tracking supported
> +						      devices		  */
> +
>   /*
>    * This are the possible domain-types
>    *
> @@ -93,6 +99,7 @@ struct iommu_domain_geometry {
>   
>   struct iommu_domain {
>   	unsigned type;
> +	unsigned flags;
>   	const struct iommu_domain_ops *ops;
>   	unsigned long pgsize_bitmap;	/* Bitmap of page sizes in use */
>   	struct iommu_domain_geometry geometry;
> @@ -128,6 +135,7 @@ enum iommu_cap {
>   	 * this device.
>   	 */
>   	IOMMU_CAP_ENFORCE_CACHE_COHERENCY,
> +	IOMMU_CAP_DIRTY,		/* IOMMU supports dirty tracking */
>   };
>   
>   /* These are the possible reserved region types */
> @@ -220,6 +228,17 @@ struct iommu_iotlb_gather {
>   	bool			queued;
>   };
>   
> +/**
> + * struct iommu_dirty_bitmap - Dirty IOVA bitmap state
> + *
> + * @bitmap: IOVA bitmap
> + * @gather: Range information for a pending IOTLB flush
> + */
> +struct iommu_dirty_bitmap {
> +	struct iova_bitmap *bitmap;
> +	struct iommu_iotlb_gather *gather;
> +};
> +
>   /**
>    * struct iommu_ops - iommu ops and capabilities
>    * @capable: check capability
> @@ -248,6 +267,7 @@ struct iommu_iotlb_gather {
>    *                    pasid, so that any DMA transactions with this pasid
>    *                    will be blocked by the hardware.
>    * @pgsize_bitmap: bitmap of all possible supported page sizes
> + * @flags: All non domain type supported features
>    * @owner: Driver module providing these ops
>    */
>   struct iommu_ops {
> @@ -281,6 +301,7 @@ struct iommu_ops {
>   
>   	const struct iommu_domain_ops *default_domain_ops;
>   	unsigned long pgsize_bitmap;
> +	unsigned long supported_flags;
>   	struct module *owner;
>   };
>   
> @@ -316,6 +337,11 @@ struct iommu_ops {
>    * @enable_nesting: Enable nesting
>    * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
>    * @free: Release the domain after use.
> + * @set_dirty_tracking: Enable or Disable dirty tracking on the iommu domain
> + * @read_and_clear_dirty: Walk IOMMU page tables for dirtied PTEs marshalled
> + *                        into a bitmap, with a bit represented as a page.
> + *                        Reads the dirty PTE bits and clears it from IO
> + *                        pagetables.
>    */
>   struct iommu_domain_ops {
>   	int (*attach_dev)(struct iommu_domain *domain, struct device *dev);
> @@ -348,6 +374,12 @@ struct iommu_domain_ops {
>   				  unsigned long quirks);
>   
>   	void (*free)(struct iommu_domain *domain);
> +
> +	int (*set_dirty_tracking)(struct iommu_domain *domain, bool enabled);
> +	int (*read_and_clear_dirty)(struct iommu_domain *domain,
> +				    unsigned long iova, size_t size,
> +				    unsigned long flags,
> +				    struct iommu_dirty_bitmap *dirty);
>   };
>   
>   /**
> @@ -461,6 +493,9 @@ extern bool iommu_present(const struct bus_type *bus);
>   extern bool device_iommu_capable(struct device *dev, enum iommu_cap cap);
>   extern bool iommu_group_has_isolated_msi(struct iommu_group *group);
>   extern struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus);
> +extern int iommu_domain_set_flags(struct iommu_domain *domain,
> +				  const struct bus_type *bus,
> +				  unsigned long flags);
>   extern void iommu_domain_free(struct iommu_domain *domain);
>   extern int iommu_attach_device(struct iommu_domain *domain,
>   			       struct device *dev);
> @@ -627,6 +662,28 @@ static inline bool iommu_iotlb_gather_queued(struct iommu_iotlb_gather *gather)
>   	return gather && gather->queued;
>   }
>   
> +static inline void iommu_dirty_bitmap_init(struct iommu_dirty_bitmap *dirty,
> +					   struct iova_bitmap *bitmap,
> +					   struct iommu_iotlb_gather *gather)
> +{
> +	if (gather)
> +		iommu_iotlb_gather_init(gather);
> +
> +	dirty->bitmap = bitmap;
> +	dirty->gather = gather;
> +}
> +
> +static inline void
> +iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty, unsigned long iova,
> +			  unsigned long length)
> +{
> +	if (dirty->bitmap)
> +		iova_bitmap_set(dirty->bitmap, iova, length);
> +
> +	if (dirty->gather)
> +		iommu_iotlb_gather_add_range(dirty->gather, iova, length);
> +}
> +
>   /* PCI device grouping function */
>   extern struct iommu_group *pci_device_group(struct device *dev);
>   /* Generic device grouping function */
> @@ -657,6 +714,9 @@ struct iommu_fwspec {
>   /* ATS is supported */
>   #define IOMMU_FWSPEC_PCI_RC_ATS			(1 << 0)
>   
> +/* Read but do not clear any dirty bits */
> +#define IOMMU_DIRTY_NO_CLEAR			(1 << 0)
> +
>   /**
>    * struct iommu_sva - handle to a device-mm bond
>    */
> @@ -755,6 +815,13 @@ static inline struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus
>   	return NULL;
>   }
>   
> +static inline int iommu_domain_set_flags(struct iommu_domain *domain,
> +					 const struct bus_type *bus,
> +					 unsigned long flags)
> +{
> +	return -ENODEV;
> +}
> +
>   static inline void iommu_domain_free(struct iommu_domain *domain)
>   {
>   }
Jason Gunthorpe May 19, 2023, 1:29 p.m. UTC | #9
On Fri, May 19, 2023 at 12:56:19PM +0100, Joao Martins wrote:
> 
> 
> On 19/05/2023 12:51, Jason Gunthorpe wrote:
> > On Fri, May 19, 2023 at 12:47:24PM +0100, Joao Martins wrote:
> > 
> >> In practice it is done as soon after the domain is created but I understand what
> >> you mean that both should be together; I have this implemented like that as my
> >> first take as a domain_alloc passed flags, but I was a little undecided because
> >> we are adding another domain_alloc() op for the user-managed pagetable and after
> >> having another one we would end up with 3 ways of creating iommu domain -- but
> >> maybe that's not an issue
> > 
> > It should ride on the same user domain alloc op as some generic flags,
> 
> OK, I suppose that makes sense specially with this being tied in HWPT_ALLOC
> where all this new user domain alloc does.

Yes, it should be easy.

Then do what Robin said and make the domain ops NULL if the user
didn't ask for dirty tracking and then attach can fail if there are
domain incompatibility's.

Since alloc_user (or whatever it settles into) will have the struct
device * argument this should be easy enough with out getting mixed
with the struct bus cleanup.

Jason
Joao Martins May 19, 2023, 1:43 p.m. UTC | #10
On 19/05/2023 14:22, Robin Murphy wrote:
> On 2023-05-18 21:46, Joao Martins wrote:
>> Add to iommu domain operations a set of callbacks to perform dirty
>> tracking, particulary to start and stop tracking and finally to read and
>> clear the dirty data.
>>
>> Drivers are generally expected to dynamically change its translation
>> structures to toggle the tracking and flush some form of control state
>> structure that stands in the IOVA translation path. Though it's not
>> mandatory, as drivers will be enable dirty tracking at boot, and just flush
>> the IO pagetables when setting dirty tracking.  For each of the newly added
>> IOMMU core APIs:
>>
>> .supported_flags[IOMMU_DOMAIN_F_ENFORCE_DIRTY]: Introduce a set of flags
>> that enforce certain restrictions in the iommu_domain object. For dirty
>> tracking this means that when IOMMU_DOMAIN_F_ENFORCE_DIRTY is set via its
>> helper iommu_domain_set_flags(...) devices attached via attach_dev will
>> fail on devices that do *not* have dirty tracking supported. IOMMU drivers
>> that support dirty tracking should advertise this flag, while enforcing
>> that dirty tracking is supported by the device in its .attach_dev iommu op.
> 
> Eww, no. For an internal thing, just call ->capable() - I mean, you're literally
> adding this feature as one of its caps...
> 
> However I'm not sure if we even need that - domains which don't support dirty
> tracking should just not expose the ops, and thus it ought to be inherently
> obvious.
> 
OK.

> I'm guessing most of the weirdness here is implicitly working around the
> enabled-from-the-start scenario on SMMUv3:
> 
It has nothing to do with SMMUv3. This is to futureproof the case where the
IOMMU capabilities are not homogeneous, even though, it isn't the case today.

The only thing SMMUv3 that kinda comes for free in this series is clearing
dirties before setting dirty tracking, because [it is always enabled]. But that
is needed regardless of SMMUv3.

>     domain = iommu_domain_alloc(bus);
>     iommu_set_dirty_tracking(domain);
>     // arm-smmu-v3 says OK since it doesn't know that it
>     // definitely *isn't* possible, and saying no wouldn't
>     // be helpful
>     iommu_attach_group(group, domain);
>     // oops, now we see that the relevant SMMU instance isn't one
>     // which actually supports HTTU, what do we do? :(
> 
> I don't have any major objection to the general principle of flagging the domain
> to fail attach if it can't do what we promised 

This is the reason why the I had the flag (or now a domain_alloc flag)...

> , as a bodge for now, but please
> implement it privately in arm-smmu-v3 so it's easier to clean up again in future
> once until iommu_domain_alloc() gets sorted out properly to get rid of this
> awkward blind spot.
> 

But it wasn't related to smmu-v3 logic.

All it is meant is to guarantee that when we only ever have dirty tracking
supported in a single domain. And we don't want that to change throughout the
lifetime of the domain.

> Thanks,
> Robin.
> 
>> iommu_cap::IOMMU_CAP_DIRTY: new device iommu_capable value when probing for
>> capabilities of the device.
>>
>> .set_dirty_tracking(): an iommu driver is expected to change its
>> translation structures and enable dirty tracking for the devices in the
>> iommu_domain. For drivers making dirty tracking always-enabled, it should
>> just return 0.
>>
>> .read_and_clear_dirty(): an iommu driver is expected to walk the iova range
>> passed in and use iommu_dirty_bitmap_record() to record dirty info per
>> IOVA. When detecting a given IOVA is dirty it should also clear its dirty
>> state from the PTE, *unless* the flag IOMMU_DIRTY_NO_CLEAR is passed in --
>> flushing is steered from the caller of the domain_op via iotlb_gather. The
>> iommu core APIs use the same data structure in use for dirty tracking for
>> VFIO device dirty (struct iova_bitmap) abstracted by
>> iommu_dirty_bitmap_record() helper function.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   drivers/iommu/iommu.c      | 11 +++++++
>>   include/linux/io-pgtable.h |  4 +++
>>   include/linux/iommu.h      | 67 ++++++++++++++++++++++++++++++++++++++
>>   3 files changed, 82 insertions(+)
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 2088caae5074..95acc543e8fb 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -2013,6 +2013,17 @@ struct iommu_domain *iommu_domain_alloc(const struct
>> bus_type *bus)
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_domain_alloc);
>>   +int iommu_domain_set_flags(struct iommu_domain *domain,
>> +               const struct bus_type *bus, unsigned long val)
>> +{
>> +    if (!(val & bus->iommu_ops->supported_flags))
>> +        return -EINVAL;
>> +
>> +    domain->flags |= val;
>> +    return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_domain_set_flags);
>> +
>>   void iommu_domain_free(struct iommu_domain *domain)
>>   {
>>       if (domain->type == IOMMU_DOMAIN_SVA)
>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>> index 1b7a44b35616..25142a0e2fc2 100644
>> --- a/include/linux/io-pgtable.h
>> +++ b/include/linux/io-pgtable.h
>> @@ -166,6 +166,10 @@ struct io_pgtable_ops {
>>                     struct iommu_iotlb_gather *gather);
>>       phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
>>                       unsigned long iova);
>> +    int (*read_and_clear_dirty)(struct io_pgtable_ops *ops,
>> +                    unsigned long iova, size_t size,
>> +                    unsigned long flags,
>> +                    struct iommu_dirty_bitmap *dirty);
>>   };
>>     /**
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index 39d25645a5ab..992ea87f2f8e 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -13,6 +13,7 @@
>>   #include <linux/errno.h>
>>   #include <linux/err.h>
>>   #include <linux/of.h>
>> +#include <linux/iova_bitmap.h>
>>   #include <uapi/linux/iommu.h>
>>     #define IOMMU_READ    (1 << 0)
>> @@ -65,6 +66,11 @@ struct iommu_domain_geometry {
>>     #define __IOMMU_DOMAIN_SVA    (1U << 4)  /* Shared process address space */
>>   +/* Domain feature flags that do not define domain types */
>> +#define IOMMU_DOMAIN_F_ENFORCE_DIRTY    (1U << 6)  /* Enforce attachment of
>> +                              dirty tracking supported
>> +                              devices          */
>> +
>>   /*
>>    * This are the possible domain-types
>>    *
>> @@ -93,6 +99,7 @@ struct iommu_domain_geometry {
>>     struct iommu_domain {
>>       unsigned type;
>> +    unsigned flags;
>>       const struct iommu_domain_ops *ops;
>>       unsigned long pgsize_bitmap;    /* Bitmap of page sizes in use */
>>       struct iommu_domain_geometry geometry;
>> @@ -128,6 +135,7 @@ enum iommu_cap {
>>        * this device.
>>        */
>>       IOMMU_CAP_ENFORCE_CACHE_COHERENCY,
>> +    IOMMU_CAP_DIRTY,        /* IOMMU supports dirty tracking */
>>   };
>>     /* These are the possible reserved region types */
>> @@ -220,6 +228,17 @@ struct iommu_iotlb_gather {
>>       bool            queued;
>>   };
>>   +/**
>> + * struct iommu_dirty_bitmap - Dirty IOVA bitmap state
>> + *
>> + * @bitmap: IOVA bitmap
>> + * @gather: Range information for a pending IOTLB flush
>> + */
>> +struct iommu_dirty_bitmap {
>> +    struct iova_bitmap *bitmap;
>> +    struct iommu_iotlb_gather *gather;
>> +};
>> +
>>   /**
>>    * struct iommu_ops - iommu ops and capabilities
>>    * @capable: check capability
>> @@ -248,6 +267,7 @@ struct iommu_iotlb_gather {
>>    *                    pasid, so that any DMA transactions with this pasid
>>    *                    will be blocked by the hardware.
>>    * @pgsize_bitmap: bitmap of all possible supported page sizes
>> + * @flags: All non domain type supported features
>>    * @owner: Driver module providing these ops
>>    */
>>   struct iommu_ops {
>> @@ -281,6 +301,7 @@ struct iommu_ops {
>>         const struct iommu_domain_ops *default_domain_ops;
>>       unsigned long pgsize_bitmap;
>> +    unsigned long supported_flags;
>>       struct module *owner;
>>   };
>>   @@ -316,6 +337,11 @@ struct iommu_ops {
>>    * @enable_nesting: Enable nesting
>>    * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
>>    * @free: Release the domain after use.
>> + * @set_dirty_tracking: Enable or Disable dirty tracking on the iommu domain
>> + * @read_and_clear_dirty: Walk IOMMU page tables for dirtied PTEs marshalled
>> + *                        into a bitmap, with a bit represented as a page.
>> + *                        Reads the dirty PTE bits and clears it from IO
>> + *                        pagetables.
>>    */
>>   struct iommu_domain_ops {
>>       int (*attach_dev)(struct iommu_domain *domain, struct device *dev);
>> @@ -348,6 +374,12 @@ struct iommu_domain_ops {
>>                     unsigned long quirks);
>>         void (*free)(struct iommu_domain *domain);
>> +
>> +    int (*set_dirty_tracking)(struct iommu_domain *domain, bool enabled);
>> +    int (*read_and_clear_dirty)(struct iommu_domain *domain,
>> +                    unsigned long iova, size_t size,
>> +                    unsigned long flags,
>> +                    struct iommu_dirty_bitmap *dirty);
>>   };
>>     /**
>> @@ -461,6 +493,9 @@ extern bool iommu_present(const struct bus_type *bus);
>>   extern bool device_iommu_capable(struct device *dev, enum iommu_cap cap);
>>   extern bool iommu_group_has_isolated_msi(struct iommu_group *group);
>>   extern struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus);
>> +extern int iommu_domain_set_flags(struct iommu_domain *domain,
>> +                  const struct bus_type *bus,
>> +                  unsigned long flags);
>>   extern void iommu_domain_free(struct iommu_domain *domain);
>>   extern int iommu_attach_device(struct iommu_domain *domain,
>>                      struct device *dev);
>> @@ -627,6 +662,28 @@ static inline bool iommu_iotlb_gather_queued(struct
>> iommu_iotlb_gather *gather)
>>       return gather && gather->queued;
>>   }
>>   +static inline void iommu_dirty_bitmap_init(struct iommu_dirty_bitmap *dirty,
>> +                       struct iova_bitmap *bitmap,
>> +                       struct iommu_iotlb_gather *gather)
>> +{
>> +    if (gather)
>> +        iommu_iotlb_gather_init(gather);
>> +
>> +    dirty->bitmap = bitmap;
>> +    dirty->gather = gather;
>> +}
>> +
>> +static inline void
>> +iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty, unsigned long iova,
>> +              unsigned long length)
>> +{
>> +    if (dirty->bitmap)
>> +        iova_bitmap_set(dirty->bitmap, iova, length);
>> +
>> +    if (dirty->gather)
>> +        iommu_iotlb_gather_add_range(dirty->gather, iova, length);
>> +}
>> +
>>   /* PCI device grouping function */
>>   extern struct iommu_group *pci_device_group(struct device *dev);
>>   /* Generic device grouping function */
>> @@ -657,6 +714,9 @@ struct iommu_fwspec {
>>   /* ATS is supported */
>>   #define IOMMU_FWSPEC_PCI_RC_ATS            (1 << 0)
>>   +/* Read but do not clear any dirty bits */
>> +#define IOMMU_DIRTY_NO_CLEAR            (1 << 0)
>> +
>>   /**
>>    * struct iommu_sva - handle to a device-mm bond
>>    */
>> @@ -755,6 +815,13 @@ static inline struct iommu_domain
>> *iommu_domain_alloc(const struct bus_type *bus
>>       return NULL;
>>   }
>>   +static inline int iommu_domain_set_flags(struct iommu_domain *domain,
>> +                     const struct bus_type *bus,
>> +                     unsigned long flags)
>> +{
>> +    return -ENODEV;
>> +}
>> +
>>   static inline void iommu_domain_free(struct iommu_domain *domain)
>>   {
>>   }
Joao Martins May 19, 2023, 1:46 p.m. UTC | #11
On 19/05/2023 14:29, Jason Gunthorpe wrote:
> On Fri, May 19, 2023 at 12:56:19PM +0100, Joao Martins wrote:
>>
>>
>> On 19/05/2023 12:51, Jason Gunthorpe wrote:
>>> On Fri, May 19, 2023 at 12:47:24PM +0100, Joao Martins wrote:
>>>
>>>> In practice it is done as soon after the domain is created but I understand what
>>>> you mean that both should be together; I have this implemented like that as my
>>>> first take as a domain_alloc passed flags, but I was a little undecided because
>>>> we are adding another domain_alloc() op for the user-managed pagetable and after
>>>> having another one we would end up with 3 ways of creating iommu domain -- but
>>>> maybe that's not an issue
>>>
>>> It should ride on the same user domain alloc op as some generic flags,
>>
>> OK, I suppose that makes sense specially with this being tied in HWPT_ALLOC
>> where all this new user domain alloc does.
> 
> Yes, it should be easy.
> 
> Then do what Robin said and make the domain ops NULL if the user
> didn't ask for dirty tracking and then attach can fail if there are
> domain incompatibility's.
> 
Yes
Robin Murphy May 19, 2023, 6:12 p.m. UTC | #12
On 19/05/2023 2:43 pm, Joao Martins wrote:
> On 19/05/2023 14:22, Robin Murphy wrote:
>> On 2023-05-18 21:46, Joao Martins wrote:
>>> Add to iommu domain operations a set of callbacks to perform dirty
>>> tracking, particulary to start and stop tracking and finally to read and
>>> clear the dirty data.
>>>
>>> Drivers are generally expected to dynamically change its translation
>>> structures to toggle the tracking and flush some form of control state
>>> structure that stands in the IOVA translation path. Though it's not
>>> mandatory, as drivers will be enable dirty tracking at boot, and just flush
>>> the IO pagetables when setting dirty tracking.  For each of the newly added
>>> IOMMU core APIs:
>>>
>>> .supported_flags[IOMMU_DOMAIN_F_ENFORCE_DIRTY]: Introduce a set of flags
>>> that enforce certain restrictions in the iommu_domain object. For dirty
>>> tracking this means that when IOMMU_DOMAIN_F_ENFORCE_DIRTY is set via its
>>> helper iommu_domain_set_flags(...) devices attached via attach_dev will
>>> fail on devices that do *not* have dirty tracking supported. IOMMU drivers
>>> that support dirty tracking should advertise this flag, while enforcing
>>> that dirty tracking is supported by the device in its .attach_dev iommu op.
>>
>> Eww, no. For an internal thing, just call ->capable() - I mean, you're literally
>> adding this feature as one of its caps...
>>
>> However I'm not sure if we even need that - domains which don't support dirty
>> tracking should just not expose the ops, and thus it ought to be inherently
>> obvious.
>>
> OK.
> 
>> I'm guessing most of the weirdness here is implicitly working around the
>> enabled-from-the-start scenario on SMMUv3:
>>
> It has nothing to do with SMMUv3. This is to futureproof the case where the
> IOMMU capabilities are not homogeneous, even though, it isn't the case today.

Indeed, but in practice SMMUv3 is the only case where you're 
realistically likely to see heterogenous dirty-tracking support across 
instances, and the one where it's most problematic to do things with a 
domain immediately after allocation.

iommu_domain_alloc() is the last piece of the puzzle in terms of 
updating the public IOMMU API to the modern multi-instance model, and 
we're working towards the point where it will be able to return a fully 
functional domain right off the bat. Thus I'm not keen on building more 
APIs around the current behviour which would become obsolete all too 
soon and need to be unpicked again. The domain_alloc_user stuff is a 
glimpse of the future, so if you're happy to focus on the IOMMUFD case 
for now and build on that model for the initial concept, that's great - 
by the time anyone comes up with a reason for generalising it further, 
the rest of the API should have caught up.

Thanks,
Robin.

> The only thing SMMUv3 that kinda comes for free in this series is clearing
> dirties before setting dirty tracking, because [it is always enabled]. But that
> is needed regardless of SMMUv3.
> 
>>      domain = iommu_domain_alloc(bus);
>>      iommu_set_dirty_tracking(domain);
>>      // arm-smmu-v3 says OK since it doesn't know that it
>>      // definitely *isn't* possible, and saying no wouldn't
>>      // be helpful
>>      iommu_attach_group(group, domain);
>>      // oops, now we see that the relevant SMMU instance isn't one
>>      // which actually supports HTTU, what do we do? :(
>>
>> I don't have any major objection to the general principle of flagging the domain
>> to fail attach if it can't do what we promised
> 
> This is the reason why the I had the flag (or now a domain_alloc flag)...
> 
>> , as a bodge for now, but please
>> implement it privately in arm-smmu-v3 so it's easier to clean up again in future
>> once until iommu_domain_alloc() gets sorted out properly to get rid of this
>> awkward blind spot.
>>
> 
> But it wasn't related to smmu-v3 logic.
> 
> All it is meant is to guarantee that when we only ever have dirty tracking
> supported in a single domain. And we don't want that to change throughout the
> lifetime of the domain.
> 
>> Thanks,
>> Robin.
>>
>>> iommu_cap::IOMMU_CAP_DIRTY: new device iommu_capable value when probing for
>>> capabilities of the device.
>>>
>>> .set_dirty_tracking(): an iommu driver is expected to change its
>>> translation structures and enable dirty tracking for the devices in the
>>> iommu_domain. For drivers making dirty tracking always-enabled, it should
>>> just return 0.
>>>
>>> .read_and_clear_dirty(): an iommu driver is expected to walk the iova range
>>> passed in and use iommu_dirty_bitmap_record() to record dirty info per
>>> IOVA. When detecting a given IOVA is dirty it should also clear its dirty
>>> state from the PTE, *unless* the flag IOMMU_DIRTY_NO_CLEAR is passed in --
>>> flushing is steered from the caller of the domain_op via iotlb_gather. The
>>> iommu core APIs use the same data structure in use for dirty tracking for
>>> VFIO device dirty (struct iova_bitmap) abstracted by
>>> iommu_dirty_bitmap_record() helper function.
>>>
>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>> ---
>>>    drivers/iommu/iommu.c      | 11 +++++++
>>>    include/linux/io-pgtable.h |  4 +++
>>>    include/linux/iommu.h      | 67 ++++++++++++++++++++++++++++++++++++++
>>>    3 files changed, 82 insertions(+)
>>>
>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>> index 2088caae5074..95acc543e8fb 100644
>>> --- a/drivers/iommu/iommu.c
>>> +++ b/drivers/iommu/iommu.c
>>> @@ -2013,6 +2013,17 @@ struct iommu_domain *iommu_domain_alloc(const struct
>>> bus_type *bus)
>>>    }
>>>    EXPORT_SYMBOL_GPL(iommu_domain_alloc);
>>>    +int iommu_domain_set_flags(struct iommu_domain *domain,
>>> +               const struct bus_type *bus, unsigned long val)
>>> +{
>>> +    if (!(val & bus->iommu_ops->supported_flags))
>>> +        return -EINVAL;
>>> +
>>> +    domain->flags |= val;
>>> +    return 0;
>>> +}
>>> +EXPORT_SYMBOL_GPL(iommu_domain_set_flags);
>>> +
>>>    void iommu_domain_free(struct iommu_domain *domain)
>>>    {
>>>        if (domain->type == IOMMU_DOMAIN_SVA)
>>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>>> index 1b7a44b35616..25142a0e2fc2 100644
>>> --- a/include/linux/io-pgtable.h
>>> +++ b/include/linux/io-pgtable.h
>>> @@ -166,6 +166,10 @@ struct io_pgtable_ops {
>>>                      struct iommu_iotlb_gather *gather);
>>>        phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
>>>                        unsigned long iova);
>>> +    int (*read_and_clear_dirty)(struct io_pgtable_ops *ops,
>>> +                    unsigned long iova, size_t size,
>>> +                    unsigned long flags,
>>> +                    struct iommu_dirty_bitmap *dirty);
>>>    };
>>>      /**
>>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>>> index 39d25645a5ab..992ea87f2f8e 100644
>>> --- a/include/linux/iommu.h
>>> +++ b/include/linux/iommu.h
>>> @@ -13,6 +13,7 @@
>>>    #include <linux/errno.h>
>>>    #include <linux/err.h>
>>>    #include <linux/of.h>
>>> +#include <linux/iova_bitmap.h>
>>>    #include <uapi/linux/iommu.h>
>>>      #define IOMMU_READ    (1 << 0)
>>> @@ -65,6 +66,11 @@ struct iommu_domain_geometry {
>>>      #define __IOMMU_DOMAIN_SVA    (1U << 4)  /* Shared process address space */
>>>    +/* Domain feature flags that do not define domain types */
>>> +#define IOMMU_DOMAIN_F_ENFORCE_DIRTY    (1U << 6)  /* Enforce attachment of
>>> +                              dirty tracking supported
>>> +                              devices          */
>>> +
>>>    /*
>>>     * This are the possible domain-types
>>>     *
>>> @@ -93,6 +99,7 @@ struct iommu_domain_geometry {
>>>      struct iommu_domain {
>>>        unsigned type;
>>> +    unsigned flags;
>>>        const struct iommu_domain_ops *ops;
>>>        unsigned long pgsize_bitmap;    /* Bitmap of page sizes in use */
>>>        struct iommu_domain_geometry geometry;
>>> @@ -128,6 +135,7 @@ enum iommu_cap {
>>>         * this device.
>>>         */
>>>        IOMMU_CAP_ENFORCE_CACHE_COHERENCY,
>>> +    IOMMU_CAP_DIRTY,        /* IOMMU supports dirty tracking */
>>>    };
>>>      /* These are the possible reserved region types */
>>> @@ -220,6 +228,17 @@ struct iommu_iotlb_gather {
>>>        bool            queued;
>>>    };
>>>    +/**
>>> + * struct iommu_dirty_bitmap - Dirty IOVA bitmap state
>>> + *
>>> + * @bitmap: IOVA bitmap
>>> + * @gather: Range information for a pending IOTLB flush
>>> + */
>>> +struct iommu_dirty_bitmap {
>>> +    struct iova_bitmap *bitmap;
>>> +    struct iommu_iotlb_gather *gather;
>>> +};
>>> +
>>>    /**
>>>     * struct iommu_ops - iommu ops and capabilities
>>>     * @capable: check capability
>>> @@ -248,6 +267,7 @@ struct iommu_iotlb_gather {
>>>     *                    pasid, so that any DMA transactions with this pasid
>>>     *                    will be blocked by the hardware.
>>>     * @pgsize_bitmap: bitmap of all possible supported page sizes
>>> + * @flags: All non domain type supported features
>>>     * @owner: Driver module providing these ops
>>>     */
>>>    struct iommu_ops {
>>> @@ -281,6 +301,7 @@ struct iommu_ops {
>>>          const struct iommu_domain_ops *default_domain_ops;
>>>        unsigned long pgsize_bitmap;
>>> +    unsigned long supported_flags;
>>>        struct module *owner;
>>>    };
>>>    @@ -316,6 +337,11 @@ struct iommu_ops {
>>>     * @enable_nesting: Enable nesting
>>>     * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
>>>     * @free: Release the domain after use.
>>> + * @set_dirty_tracking: Enable or Disable dirty tracking on the iommu domain
>>> + * @read_and_clear_dirty: Walk IOMMU page tables for dirtied PTEs marshalled
>>> + *                        into a bitmap, with a bit represented as a page.
>>> + *                        Reads the dirty PTE bits and clears it from IO
>>> + *                        pagetables.
>>>     */
>>>    struct iommu_domain_ops {
>>>        int (*attach_dev)(struct iommu_domain *domain, struct device *dev);
>>> @@ -348,6 +374,12 @@ struct iommu_domain_ops {
>>>                      unsigned long quirks);
>>>          void (*free)(struct iommu_domain *domain);
>>> +
>>> +    int (*set_dirty_tracking)(struct iommu_domain *domain, bool enabled);
>>> +    int (*read_and_clear_dirty)(struct iommu_domain *domain,
>>> +                    unsigned long iova, size_t size,
>>> +                    unsigned long flags,
>>> +                    struct iommu_dirty_bitmap *dirty);
>>>    };
>>>      /**
>>> @@ -461,6 +493,9 @@ extern bool iommu_present(const struct bus_type *bus);
>>>    extern bool device_iommu_capable(struct device *dev, enum iommu_cap cap);
>>>    extern bool iommu_group_has_isolated_msi(struct iommu_group *group);
>>>    extern struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus);
>>> +extern int iommu_domain_set_flags(struct iommu_domain *domain,
>>> +                  const struct bus_type *bus,
>>> +                  unsigned long flags);
>>>    extern void iommu_domain_free(struct iommu_domain *domain);
>>>    extern int iommu_attach_device(struct iommu_domain *domain,
>>>                       struct device *dev);
>>> @@ -627,6 +662,28 @@ static inline bool iommu_iotlb_gather_queued(struct
>>> iommu_iotlb_gather *gather)
>>>        return gather && gather->queued;
>>>    }
>>>    +static inline void iommu_dirty_bitmap_init(struct iommu_dirty_bitmap *dirty,
>>> +                       struct iova_bitmap *bitmap,
>>> +                       struct iommu_iotlb_gather *gather)
>>> +{
>>> +    if (gather)
>>> +        iommu_iotlb_gather_init(gather);
>>> +
>>> +    dirty->bitmap = bitmap;
>>> +    dirty->gather = gather;
>>> +}
>>> +
>>> +static inline void
>>> +iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty, unsigned long iova,
>>> +              unsigned long length)
>>> +{
>>> +    if (dirty->bitmap)
>>> +        iova_bitmap_set(dirty->bitmap, iova, length);
>>> +
>>> +    if (dirty->gather)
>>> +        iommu_iotlb_gather_add_range(dirty->gather, iova, length);
>>> +}
>>> +
>>>    /* PCI device grouping function */
>>>    extern struct iommu_group *pci_device_group(struct device *dev);
>>>    /* Generic device grouping function */
>>> @@ -657,6 +714,9 @@ struct iommu_fwspec {
>>>    /* ATS is supported */
>>>    #define IOMMU_FWSPEC_PCI_RC_ATS            (1 << 0)
>>>    +/* Read but do not clear any dirty bits */
>>> +#define IOMMU_DIRTY_NO_CLEAR            (1 << 0)
>>> +
>>>    /**
>>>     * struct iommu_sva - handle to a device-mm bond
>>>     */
>>> @@ -755,6 +815,13 @@ static inline struct iommu_domain
>>> *iommu_domain_alloc(const struct bus_type *bus
>>>        return NULL;
>>>    }
>>>    +static inline int iommu_domain_set_flags(struct iommu_domain *domain,
>>> +                     const struct bus_type *bus,
>>> +                     unsigned long flags)
>>> +{
>>> +    return -ENODEV;
>>> +}
>>> +
>>>    static inline void iommu_domain_free(struct iommu_domain *domain)
>>>    {
>>>    }
Joao Martins Aug. 10, 2023, 6:23 p.m. UTC | #13
On 19/05/2023 14:29, Jason Gunthorpe wrote:
> On Fri, May 19, 2023 at 12:56:19PM +0100, Joao Martins wrote:
>> On 19/05/2023 12:51, Jason Gunthorpe wrote:
>>> On Fri, May 19, 2023 at 12:47:24PM +0100, Joao Martins wrote:
>>>> In practice it is done as soon after the domain is created but I understand what
>>>> you mean that both should be together; I have this implemented like that as my
>>>> first take as a domain_alloc passed flags, but I was a little undecided because
>>>> we are adding another domain_alloc() op for the user-managed pagetable and after
>>>> having another one we would end up with 3 ways of creating iommu domain -- but
>>>> maybe that's not an issue
>>>
>>> It should ride on the same user domain alloc op as some generic flags,
>>
>> OK, I suppose that makes sense specially with this being tied in HWPT_ALLOC
>> where all this new user domain alloc does.
> 
> Yes, it should be easy.
> 
> Then do what Robin said and make the domain ops NULL if the user
> didn't ask for dirty tracking and then attach can fail if there are
> domain incompatibility's.
> 
> Since alloc_user (or whatever it settles into) will have the struct
> device * argument this should be easy enough with out getting mixed
> with the struct bus cleanup.

Taking a step back, the iommu domain ops are a shared global pointer in all
iommu domains AFAIU at least in all three iommu implementations I was targetting
with this -- init-ed from iommu_ops::domain_default_ops. Not something we can
"just" clear part of it as that's the same global pointer shared with every
other domain. We would have to duplicate for every vendor two domain ops: one
with dirty and another without dirty tracking; though the general sentiment
behind clearing makes sense

But this is for IOMMUFD API driven only, perhaps we can just enforce at HWPT
allocation time as we are given a device ID there, or via device_attach too
inside iommufd core when we attach a device to an already existent hwpt.

This is a bit simpler and as a bonus it avoids getting dependent on the
domain_alloc_user() nesting infra and no core iommu domain changes;

Unless we also need to worry about non-IOMMUFD device-attach, which I don't
think it is the case here

e.g.

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 4c37eeea2bcd..4966775f5b00 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -339,6 +339,12 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable
*hwpt,
                goto err_unlock;
        }

+       if (hwpt->enforce_dirty &&
+           !device_iommu_capable(idev->dev, IOMMU_CAP_DIRTY)) {
+               rc = -EINVAL;
+               goto out_abort;
+       }
+
        /* Try to upgrade the domain we have */
        if (idev->enforce_cache_coherency) {
                rc = iommufd_hw_pagetable_enforce_cc(hwpt);
@@ -542,7 +548,7 @@ iommufd_device_auto_get_domain(struct iommufd_device *idev,
        }

        hwpt = iommufd_hw_pagetable_alloc(idev->ictx, ioas, idev,
-                                         immediate_attach);
+                                         immediate_attach, false);
        if (IS_ERR(hwpt)) {
                destroy_hwpt = ERR_CAST(hwpt);
                goto out_unlock;

diff --git a/drivers/iommu/iommufd/hw_pagetable.c
b/drivers/iommu/iommufd/hw_pagetable.c
index 838530460d9b..da831b4404fd 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -62,6 +62,8 @@ int iommufd_hw_pagetable_enforce_cc(struct
iommufd_hw_pagetable *hwpt)
  * @ioas: IOAS to associate the domain with
  * @idev: Device to get an iommu_domain for
  * @immediate_attach: True if idev should be attached to the hwpt
+ * @enforce_dirty: True if dirty tracking support should be enforced
+ *                 on device attach
  *
  * Allocate a new iommu_domain and return it as a hw_pagetable. The HWPT
  * will be linked to the given ioas and upon return the underlying iommu_domain
@@ -73,7 +75,8 @@ int iommufd_hw_pagetable_enforce_cc(struct
iommufd_hw_pagetable *hwpt)
  */
 struct iommufd_hw_pagetable *
 iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
-                          struct iommufd_device *idev, bool immediate_attach)
+                          struct iommufd_device *idev, bool immediate_attach,
+                          bool enforce_dirty)
 {
        const struct iommu_ops *ops = dev_iommu_ops(idev->dev);
        struct iommufd_hw_pagetable *hwpt;
@@ -90,8 +93,17 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct
iommufd_ioas *ioas,
        refcount_inc(&ioas->obj.users);
        hwpt->ioas = ioas;

+       if (enforce_dirty &&
+           !device_iommu_capable(idev->dev, IOMMU_CAP_DIRTY)) {
+               rc = -EINVAL;
+               goto out_abort;
+       }
+
@@ -99,6 +111,8 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct
iommufd_ioas *ioas,
                        hwpt->domain = NULL;
                        goto out_abort;
                }
+
+               hwpt->enforce_dirty = enforce_dirty;
        } else {
                hwpt->domain = iommu_domain_alloc(idev->dev->bus);
                if (!hwpt->domain) {
@@ -154,7 +168,8 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
        struct iommufd_ioas *ioas;
        int rc;

-       if (cmd->flags || cmd->__reserved)
+       if ((cmd->flags & ~(IOMMU_HWPT_ALLOC_ENFORCE_DIRTY)) ||
+           cmd->__reserved)
                return -EOPNOTSUPP;

        idev = iommufd_get_device(ucmd, cmd->dev_id);
@@ -168,7 +183,8 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
        }

        mutex_lock(&ioas->mutex);
-       hwpt = iommufd_hw_pagetable_alloc(ucmd->ictx, ioas, idev, false);
+       hwpt = iommufd_hw_pagetable_alloc(ucmd->ictx, ioas, idev, false,
+                                 cmd->flags & IOMMU_HWPT_ALLOC_ENFORCE_DIRTY);
        if (IS_ERR(hwpt)) {
                rc = PTR_ERR(hwpt);
                goto out_unlock;
diff --git a/drivers/iommu/iommufd/iommufd_private.h
b/drivers/iommu/iommufd/iommufd_private.h
index 8ba786bc95ff..7f0173e54c9c 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -247,6 +247,7 @@ struct iommufd_hw_pagetable {
        struct iommu_domain *domain;
        bool auto_domain : 1;
        bool enforce_cache_coherency : 1;
+       bool enforce_dirty : 1;
        bool msi_cookie : 1;
        /* Head at iommufd_ioas::hwpt_list */
        struct list_head hwpt_item;
Jason Gunthorpe Aug. 10, 2023, 6:55 p.m. UTC | #14
On Thu, Aug 10, 2023 at 07:23:11PM +0100, Joao Martins wrote:
> On 19/05/2023 14:29, Jason Gunthorpe wrote:
> > On Fri, May 19, 2023 at 12:56:19PM +0100, Joao Martins wrote:
> >> On 19/05/2023 12:51, Jason Gunthorpe wrote:
> >>> On Fri, May 19, 2023 at 12:47:24PM +0100, Joao Martins wrote:
> >>>> In practice it is done as soon after the domain is created but I understand what
> >>>> you mean that both should be together; I have this implemented like that as my
> >>>> first take as a domain_alloc passed flags, but I was a little undecided because
> >>>> we are adding another domain_alloc() op for the user-managed pagetable and after
> >>>> having another one we would end up with 3 ways of creating iommu domain -- but
> >>>> maybe that's not an issue
> >>>
> >>> It should ride on the same user domain alloc op as some generic flags,
> >>
> >> OK, I suppose that makes sense specially with this being tied in HWPT_ALLOC
> >> where all this new user domain alloc does.
> > 
> > Yes, it should be easy.
> > 
> > Then do what Robin said and make the domain ops NULL if the user
> > didn't ask for dirty tracking and then attach can fail if there are
> > domain incompatibility's.
> > 
> > Since alloc_user (or whatever it settles into) will have the struct
> > device * argument this should be easy enough with out getting mixed
> > with the struct bus cleanup.
> 
> Taking a step back, the iommu domain ops are a shared global pointer in all
> iommu domains AFAIU at least in all three iommu implementations I was targetting
> with this -- init-ed from iommu_ops::domain_default_ops. Not something we can
> "just" clear part of it as that's the same global pointer shared with every
> other domain. We would have to duplicate for every vendor two domain ops: one
> with dirty and another without dirty tracking; though the general sentiment
> behind clearing makes sense

Yes, the "domain_default_ops" is basically a transitional hack to
help migrate to narrowly defined per-usage domain ops.

eg things like blocking and identity should not have mapping ops.

Things that don't support dirty tracking should not have dirty
tracking ops in the first place.

So the simplest version of this is that by default all domain
allocations do not support dirty tracking. This ensures maximum
cross-instance/device domain re-use.

If userspace would like to use dirty tracking it signals it to
iommufd, probably using the user domain alloc path.

The driver, if it supports it, returns a dirty capable domain with
matching dirty enabled ops.

A dirty capable domain can only be attached to a device/instance that
is compatible and continues to provide dirty tracking.

This allows HW that has special restrictions to be properly supported.
eg maybe HW can only support dirty on a specific page table
format. It can select that format during alloc.

> This is a bit simpler and as a bonus it avoids getting dependent on the
> domain_alloc_user() nesting infra and no core iommu domain changes;

We have to start tackling some of this and not just bodging on top of
bodges :\

I think the domain_alloc_user patches are in good enough shape you can
rely on them.

Return the IOMMU_CAP_DIRTY as generic data in the new GET_INFO
Accept some generic flag in the alloc_hwpt requesting dirty
Pass generic flags down to the driver.
Reject set flags and drivers that don't implement alloc_domain_user.

Driver returns a domain with the right ops 'XXX_domain_ops_dirty_paging'.
Driver refuses to attach the dirty enabled domain to places that do
dirty tracking.

Jason
Joao Martins Aug. 10, 2023, 8:36 p.m. UTC | #15
On 10/08/2023 19:55, Jason Gunthorpe wrote:
> On Thu, Aug 10, 2023 at 07:23:11PM +0100, Joao Martins wrote:
>> On 19/05/2023 14:29, Jason Gunthorpe wrote:
>>> On Fri, May 19, 2023 at 12:56:19PM +0100, Joao Martins wrote:
>>>> On 19/05/2023 12:51, Jason Gunthorpe wrote:
>>>>> On Fri, May 19, 2023 at 12:47:24PM +0100, Joao Martins wrote:
>>>>>> In practice it is done as soon after the domain is created but I understand what
>>>>>> you mean that both should be together; I have this implemented like that as my
>>>>>> first take as a domain_alloc passed flags, but I was a little undecided because
>>>>>> we are adding another domain_alloc() op for the user-managed pagetable and after
>>>>>> having another one we would end up with 3 ways of creating iommu domain -- but
>>>>>> maybe that's not an issue
>>>>>
>>>>> It should ride on the same user domain alloc op as some generic flags,
>>>>
>>>> OK, I suppose that makes sense specially with this being tied in HWPT_ALLOC
>>>> where all this new user domain alloc does.
>>>
>>> Yes, it should be easy.
>>>
>>> Then do what Robin said and make the domain ops NULL if the user
>>> didn't ask for dirty tracking and then attach can fail if there are
>>> domain incompatibility's.
>>>
>>> Since alloc_user (or whatever it settles into) will have the struct
>>> device * argument this should be easy enough with out getting mixed
>>> with the struct bus cleanup.
>>
>> Taking a step back, the iommu domain ops are a shared global pointer in all
>> iommu domains AFAIU at least in all three iommu implementations I was targetting
>> with this -- init-ed from iommu_ops::domain_default_ops. Not something we can
>> "just" clear part of it as that's the same global pointer shared with every
>> other domain. We would have to duplicate for every vendor two domain ops: one
>> with dirty and another without dirty tracking; though the general sentiment
>> behind clearing makes sense
> 
> Yes, the "domain_default_ops" is basically a transitional hack to
> help migrate to narrowly defined per-usage domain ops.
> 
> eg things like blocking and identity should not have mapping ops.
> 
My earlier point was more about not what 'domain_default_ops' represents
but that it's a pointer. Shared by everyone (devices and domains alike). But you
sort of made it clear that it's OK to duplicate it to not have dirty tracking.
The duplication is what I felt a little odd.

> Things that don't support dirty tracking should not have dirty
> tracking ops in the first place.
> 
> So the simplest version of this is that by default all domain
> allocations do not support dirty tracking. This ensures maximum
> cross-instance/device domain re-use.
> 
> If userspace would like to use dirty tracking it signals it to
> iommufd, probably using the user domain alloc path.
> 
> The driver, if it supports it, returns a dirty capable domain with
> matching dirty enabled ops.
> 
> A dirty capable domain can only be attached to a device/instance that
> is compatible and continues to provide dirty tracking.
> 
> This allows HW that has special restrictions to be properly supported.
> eg maybe HW can only support dirty on a specific page table
> format. It can select that format during alloc.
> 

(...)

>> This is a bit simpler and as a bonus it avoids getting dependent on the
>> domain_alloc_user() nesting infra and no core iommu domain changes;
> 
> We have to start tackling some of this and not just bodging on top of
> bodges :\
> 

(...) I wasn't quite bodging, just trying to parallelize what was bus cleanup
could be tackling domain/device-independent ops without them being global. Maybe
I read too much into it hence my previous question.

> I think the domain_alloc_user patches are in good enough shape you can
> rely on them.
> 
I have been using them. Just needs a flags arg and I have a alternative to the
snip I pasted earlier with domain_alloc_user already there

> Return the IOMMU_CAP_DIRTY as generic data in the new GET_INFO

I have this one here:

https://lore.kernel.org/linux-iommu/20230518204650.14541-14-joao.m.martins@oracle.com/

I can rework to GET_HW_INFO but it really needs to be generic bits of data and
not iommu hw specific e.g. that translates into device_iommu_capable() cap
checking. The moment it stays hw specific and having userspace to decode what
different hw specific bits for general simple capability checking (e.g. do we
support nesting, do we support dirty tracking, etc), it breaks the point of a
generic API.

With exception I guess of going into the weeds of nesting specific formats as
there's hardware formats intimate to the userspace space process emulation for
iommu model specific thing i.e. things that can't be made generic.

Hopefully my expectation here matches yours for cap checking?

> Accept some generic flag in the alloc_hwpt requesting dirty
> Pass generic flags down to the driver.
> Reject set flags and drivers that don't implement alloc_domain_user.
> Driver refuses to attach the dirty enabled domain to places that do
> dirty tracking.
>

This is already done, and so far I have an unsigned long flags to
domain_alloc_user() and probably be kept defining it as iommu-domain-feature bit
(unless it's better to follow similar direction as hwpt_type like in
domain_alloc_user). And gets stored as iommu_domain::flags, like this series
had. Though if majority of driver rejection flows via alloc_domain_user only
(which has a struct device), perhaps it's not even needed to store as a new
iommu_domain::flags

> Driver returns a domain with the right ops 'XXX_domain_ops_dirty_paging'.

alright, you look OK with this -- I'll go with it then
Jason Gunthorpe Aug. 11, 2023, 1:09 a.m. UTC | #16
On Thu, Aug 10, 2023 at 09:36:37PM +0100, Joao Martins wrote:

> > Yes, the "domain_default_ops" is basically a transitional hack to
> > help migrate to narrowly defined per-usage domain ops.
> > 
> > eg things like blocking and identity should not have mapping ops.
> > 
> My earlier point was more about not what 'domain_default_ops' represents
> but that it's a pointer. Shared by everyone (devices and domains alike). But you
> sort of made it clear that it's OK to duplicate it to not have dirty tracking.
> The duplication is what I felt a little odd.

Well, it is one path, we could also add a dirty_ops to the
domain. Hard to say which is better.

> (...) I wasn't quite bodging, just trying to parallelize what was bus cleanup
> could be tackling domain/device-independent ops without them being global. Maybe
> I read too much into it hence my previous question.

domain_alloc_user bypasses the bus cleanup

> > Return the IOMMU_CAP_DIRTY as generic data in the new GET_INFO
> 
> I have this one here:
> 
> https://lore.kernel.org/linux-iommu/20230518204650.14541-14-joao.m.martins@oracle.com/
> 
> I can rework to GET_HW_INFO but it really needs to be generic bits of data and
> not iommu hw specific e.g. that translates into device_iommu_capable() cap
> checking. 

Yes, HW_INFO seems the right way. Just add a

   __aligned_u64 out_capabilities;

To that struct iommu_hw_info and fill it with generic code.

> > Accept some generic flag in the alloc_hwpt requesting dirty
> > Pass generic flags down to the driver.
> > Reject set flags and drivers that don't implement alloc_domain_user.
> > Driver refuses to attach the dirty enabled domain to places that do
> > dirty tracking.
> 
> This is already done, and so far I have an unsigned long flags to
> domain_alloc_user() and probably be kept defining it as
> iommu-domain-feature bit

Yes a flag in some way is the best choice

> (unless it's better to follow similar direction as hwpt_type like in
> domain_alloc_user). And gets stored as iommu_domain::flags, like this series
> had. Though if majority of driver rejection flows via alloc_domain_user only
> (which has a struct device), perhaps it's not even needed to store as a new
> iommu_domain::flags

Right, we don't need to reflect it back if the dirty ops are NULL.

Jason
diff mbox series

Patch

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 2088caae5074..95acc543e8fb 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2013,6 +2013,17 @@  struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus)
 }
 EXPORT_SYMBOL_GPL(iommu_domain_alloc);
 
+int iommu_domain_set_flags(struct iommu_domain *domain,
+			   const struct bus_type *bus, unsigned long val)
+{
+	if (!(val & bus->iommu_ops->supported_flags))
+		return -EINVAL;
+
+	domain->flags |= val;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iommu_domain_set_flags);
+
 void iommu_domain_free(struct iommu_domain *domain)
 {
 	if (domain->type == IOMMU_DOMAIN_SVA)
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index 1b7a44b35616..25142a0e2fc2 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -166,6 +166,10 @@  struct io_pgtable_ops {
 			      struct iommu_iotlb_gather *gather);
 	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
 				    unsigned long iova);
+	int (*read_and_clear_dirty)(struct io_pgtable_ops *ops,
+				    unsigned long iova, size_t size,
+				    unsigned long flags,
+				    struct iommu_dirty_bitmap *dirty);
 };
 
 /**
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 39d25645a5ab..992ea87f2f8e 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -13,6 +13,7 @@ 
 #include <linux/errno.h>
 #include <linux/err.h>
 #include <linux/of.h>
+#include <linux/iova_bitmap.h>
 #include <uapi/linux/iommu.h>
 
 #define IOMMU_READ	(1 << 0)
@@ -65,6 +66,11 @@  struct iommu_domain_geometry {
 
 #define __IOMMU_DOMAIN_SVA	(1U << 4)  /* Shared process address space */
 
+/* Domain feature flags that do not define domain types */
+#define IOMMU_DOMAIN_F_ENFORCE_DIRTY	(1U << 6)  /* Enforce attachment of
+						      dirty tracking supported
+						      devices		  */
+
 /*
  * This are the possible domain-types
  *
@@ -93,6 +99,7 @@  struct iommu_domain_geometry {
 
 struct iommu_domain {
 	unsigned type;
+	unsigned flags;
 	const struct iommu_domain_ops *ops;
 	unsigned long pgsize_bitmap;	/* Bitmap of page sizes in use */
 	struct iommu_domain_geometry geometry;
@@ -128,6 +135,7 @@  enum iommu_cap {
 	 * this device.
 	 */
 	IOMMU_CAP_ENFORCE_CACHE_COHERENCY,
+	IOMMU_CAP_DIRTY,		/* IOMMU supports dirty tracking */
 };
 
 /* These are the possible reserved region types */
@@ -220,6 +228,17 @@  struct iommu_iotlb_gather {
 	bool			queued;
 };
 
+/**
+ * struct iommu_dirty_bitmap - Dirty IOVA bitmap state
+ *
+ * @bitmap: IOVA bitmap
+ * @gather: Range information for a pending IOTLB flush
+ */
+struct iommu_dirty_bitmap {
+	struct iova_bitmap *bitmap;
+	struct iommu_iotlb_gather *gather;
+};
+
 /**
  * struct iommu_ops - iommu ops and capabilities
  * @capable: check capability
@@ -248,6 +267,7 @@  struct iommu_iotlb_gather {
  *                    pasid, so that any DMA transactions with this pasid
  *                    will be blocked by the hardware.
  * @pgsize_bitmap: bitmap of all possible supported page sizes
+ * @flags: All non domain type supported features
  * @owner: Driver module providing these ops
  */
 struct iommu_ops {
@@ -281,6 +301,7 @@  struct iommu_ops {
 
 	const struct iommu_domain_ops *default_domain_ops;
 	unsigned long pgsize_bitmap;
+	unsigned long supported_flags;
 	struct module *owner;
 };
 
@@ -316,6 +337,11 @@  struct iommu_ops {
  * @enable_nesting: Enable nesting
  * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
  * @free: Release the domain after use.
+ * @set_dirty_tracking: Enable or Disable dirty tracking on the iommu domain
+ * @read_and_clear_dirty: Walk IOMMU page tables for dirtied PTEs marshalled
+ *                        into a bitmap, with a bit represented as a page.
+ *                        Reads the dirty PTE bits and clears it from IO
+ *                        pagetables.
  */
 struct iommu_domain_ops {
 	int (*attach_dev)(struct iommu_domain *domain, struct device *dev);
@@ -348,6 +374,12 @@  struct iommu_domain_ops {
 				  unsigned long quirks);
 
 	void (*free)(struct iommu_domain *domain);
+
+	int (*set_dirty_tracking)(struct iommu_domain *domain, bool enabled);
+	int (*read_and_clear_dirty)(struct iommu_domain *domain,
+				    unsigned long iova, size_t size,
+				    unsigned long flags,
+				    struct iommu_dirty_bitmap *dirty);
 };
 
 /**
@@ -461,6 +493,9 @@  extern bool iommu_present(const struct bus_type *bus);
 extern bool device_iommu_capable(struct device *dev, enum iommu_cap cap);
 extern bool iommu_group_has_isolated_msi(struct iommu_group *group);
 extern struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus);
+extern int iommu_domain_set_flags(struct iommu_domain *domain,
+				  const struct bus_type *bus,
+				  unsigned long flags);
 extern void iommu_domain_free(struct iommu_domain *domain);
 extern int iommu_attach_device(struct iommu_domain *domain,
 			       struct device *dev);
@@ -627,6 +662,28 @@  static inline bool iommu_iotlb_gather_queued(struct iommu_iotlb_gather *gather)
 	return gather && gather->queued;
 }
 
+static inline void iommu_dirty_bitmap_init(struct iommu_dirty_bitmap *dirty,
+					   struct iova_bitmap *bitmap,
+					   struct iommu_iotlb_gather *gather)
+{
+	if (gather)
+		iommu_iotlb_gather_init(gather);
+
+	dirty->bitmap = bitmap;
+	dirty->gather = gather;
+}
+
+static inline void
+iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty, unsigned long iova,
+			  unsigned long length)
+{
+	if (dirty->bitmap)
+		iova_bitmap_set(dirty->bitmap, iova, length);
+
+	if (dirty->gather)
+		iommu_iotlb_gather_add_range(dirty->gather, iova, length);
+}
+
 /* PCI device grouping function */
 extern struct iommu_group *pci_device_group(struct device *dev);
 /* Generic device grouping function */
@@ -657,6 +714,9 @@  struct iommu_fwspec {
 /* ATS is supported */
 #define IOMMU_FWSPEC_PCI_RC_ATS			(1 << 0)
 
+/* Read but do not clear any dirty bits */
+#define IOMMU_DIRTY_NO_CLEAR			(1 << 0)
+
 /**
  * struct iommu_sva - handle to a device-mm bond
  */
@@ -755,6 +815,13 @@  static inline struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus
 	return NULL;
 }
 
+static inline int iommu_domain_set_flags(struct iommu_domain *domain,
+					 const struct bus_type *bus,
+					 unsigned long flags)
+{
+	return -ENODEV;
+}
+
 static inline void iommu_domain_free(struct iommu_domain *domain)
 {
 }