diff mbox series

[v3,19/19] iommu/intel: Access/Dirty bit support for SL domains

Message ID 20230923012511.10379-20-joao.m.martins@oracle.com (mailing list archive)
State New, archived
Headers show
Series IOMMUFD Dirty Tracking | expand

Commit Message

Joao Martins Sept. 23, 2023, 1:25 a.m. UTC
IOMMU advertises Access/Dirty bits for second-stage page table if the
extended capability DMAR register reports it (ECAP, mnemonic ECAP.SSADS).
The first stage table is compatible with CPU page table thus A/D bits are
implicitly supported. Relevant Intel IOMMU SDM ref for first stage table
"3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second stage table
"3.7.2 Accessed and Dirty Flags".

First stage page table is enabled by default so it's allowed to set dirty
tracking and no control bits needed, it just returns 0. To use SSADS, set
bit 9 (SSADE) in the scalable-mode PASID table entry and flush the IOTLB
via pasid_flush_caches() following the manual. Relevant SDM refs:

"3.7.2 Accessed and Dirty Flags"
"6.5.3.3 Guidance to Software for Invalidations,
 Table 23. Guidance to Software for Invalidations"

PTE dirty bit is located in bit 9 and it's cached in the IOTLB so flush
IOTLB to make sure IOMMU attempts to set the dirty bit again. Note that
iommu_dirty_bitmap_record() will add the IOVA to iotlb_gather and thus
the caller of the iommu op will flush the IOTLB. Relevant manuals over
the hardware translation is chapter 6 with some special mention to:

"6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations"
"6.2.4 IOTLB"

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
The IOPTE walker is still a bit inneficient. Making sure the UAPI/IOMMUFD is
solid and agreed upon.
---
 drivers/iommu/intel/iommu.c | 94 +++++++++++++++++++++++++++++++++++++
 drivers/iommu/intel/iommu.h | 15 ++++++
 drivers/iommu/intel/pasid.c | 94 +++++++++++++++++++++++++++++++++++++
 drivers/iommu/intel/pasid.h |  4 ++
 4 files changed, 207 insertions(+)

Comments

Baolu Lu Sept. 25, 2023, 7:01 a.m. UTC | #1
On 9/23/23 9:25 AM, Joao Martins wrote:
> IOMMU advertises Access/Dirty bits for second-stage page table if the
> extended capability DMAR register reports it (ECAP, mnemonic ECAP.SSADS).
> The first stage table is compatible with CPU page table thus A/D bits are
> implicitly supported. Relevant Intel IOMMU SDM ref for first stage table
> "3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second stage table
> "3.7.2 Accessed and Dirty Flags".
> 
> First stage page table is enabled by default so it's allowed to set dirty
> tracking and no control bits needed, it just returns 0. To use SSADS, set
> bit 9 (SSADE) in the scalable-mode PASID table entry and flush the IOTLB
> via pasid_flush_caches() following the manual. Relevant SDM refs:
> 
> "3.7.2 Accessed and Dirty Flags"
> "6.5.3.3 Guidance to Software for Invalidations,
>   Table 23. Guidance to Software for Invalidations"
> 
> PTE dirty bit is located in bit 9 and it's cached in the IOTLB so flush
> IOTLB to make sure IOMMU attempts to set the dirty bit again. Note that
> iommu_dirty_bitmap_record() will add the IOVA to iotlb_gather and thus
> the caller of the iommu op will flush the IOTLB. Relevant manuals over
> the hardware translation is chapter 6 with some special mention to:
> 
> "6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations"
> "6.2.4 IOTLB"
> 
> Signed-off-by: Joao Martins<joao.m.martins@oracle.com>
> ---
> The IOPTE walker is still a bit inneficient. Making sure the UAPI/IOMMUFD is
> solid and agreed upon.
> ---
>   drivers/iommu/intel/iommu.c | 94 +++++++++++++++++++++++++++++++++++++
>   drivers/iommu/intel/iommu.h | 15 ++++++
>   drivers/iommu/intel/pasid.c | 94 +++++++++++++++++++++++++++++++++++++
>   drivers/iommu/intel/pasid.h |  4 ++
>   4 files changed, 207 insertions(+)

The code is probably incomplete. When attaching a domain to a device,
check the domain's dirty tracking capability against the device's
capabilities. If the domain's dirty tracking capability is set but the
device does not support it, the attach callback should return -EINVAL.

Best regards,
baolu
Joao Martins Sept. 25, 2023, 9:08 a.m. UTC | #2
On 25/09/2023 08:01, Baolu Lu wrote:
> On 9/23/23 9:25 AM, Joao Martins wrote:
>> IOMMU advertises Access/Dirty bits for second-stage page table if the
>> extended capability DMAR register reports it (ECAP, mnemonic ECAP.SSADS).
>> The first stage table is compatible with CPU page table thus A/D bits are
>> implicitly supported. Relevant Intel IOMMU SDM ref for first stage table
>> "3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second stage table
>> "3.7.2 Accessed and Dirty Flags".
>>
>> First stage page table is enabled by default so it's allowed to set dirty
>> tracking and no control bits needed, it just returns 0. To use SSADS, set
>> bit 9 (SSADE) in the scalable-mode PASID table entry and flush the IOTLB
>> via pasid_flush_caches() following the manual. Relevant SDM refs:
>>
>> "3.7.2 Accessed and Dirty Flags"
>> "6.5.3.3 Guidance to Software for Invalidations,
>>   Table 23. Guidance to Software for Invalidations"
>>
>> PTE dirty bit is located in bit 9 and it's cached in the IOTLB so flush
>> IOTLB to make sure IOMMU attempts to set the dirty bit again. Note that
>> iommu_dirty_bitmap_record() will add the IOVA to iotlb_gather and thus
>> the caller of the iommu op will flush the IOTLB. Relevant manuals over
>> the hardware translation is chapter 6 with some special mention to:
>>
>> "6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations"
>> "6.2.4 IOTLB"
>>
>> Signed-off-by: Joao Martins<joao.m.martins@oracle.com>
>> ---
>> The IOPTE walker is still a bit inneficient. Making sure the UAPI/IOMMUFD is
>> solid and agreed upon.
>> ---
>>   drivers/iommu/intel/iommu.c | 94 +++++++++++++++++++++++++++++++++++++
>>   drivers/iommu/intel/iommu.h | 15 ++++++
>>   drivers/iommu/intel/pasid.c | 94 +++++++++++++++++++++++++++++++++++++
>>   drivers/iommu/intel/pasid.h |  4 ++
>>   4 files changed, 207 insertions(+)
> 
> The code is probably incomplete. When attaching a domain to a device,
> check the domain's dirty tracking capability against the device's
> capabilities. If the domain's dirty tracking capability is set but the
> device does not support it, the attach callback should return -EINVAL.
> 
Yeap, I did that for AMD, but it seems in the mix of changes I may have deleted
and then forgot to include it here.

Here's what I added (together with consolidated cap checking):

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 7d5a8f5283a7..fabfe363f1f9 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4075,6 +4075,11 @@ static struct iommu_domain
*intel_iommu_domain_alloc(unsigned type)
        return NULL;
 }

+static bool intel_iommu_slads_supported(struct intel_iommu *iommu)
+{
+       return sm_supported(iommu) && ecap_slads(iommu->ecap);
+}
+
 static struct iommu_domain *
 intel_iommu_domain_alloc_user(struct device *dev, u32 flags)
 {
@@ -4090,7 +4095,7 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 flags)
                return ERR_PTR(-EOPNOTSUPP);

        if (enforce_dirty &&
-           !device_iommu_capable(dev, IOMMU_CAP_DIRTY))
+           !intel_iommu_slads_supported(iommu))
                return ERR_PTR(-EOPNOTSUPP);

        domain = iommu_domain_alloc(dev->bus);
@@ -4121,6 +4126,9 @@ static int prepare_domain_attach_device(struct
iommu_domain *domain,
        if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap))
                return -EINVAL;

+       if (domain->dirty_ops && !intel_iommu_slads_supported(iommu))
+               return -EINVAL;
+
        /* check if this iommu agaw is sufficient for max mapped address */
        addr_width = agaw_to_width(iommu->agaw);
        if (addr_width > cap_mgaw(iommu->cap))
@@ -4376,8 +4384,7 @@ static bool intel_iommu_capable(struct device *dev, enum
iommu_cap cap)
        case IOMMU_CAP_ENFORCE_CACHE_COHERENCY:
                return ecap_sc_support(info->iommu->ecap);
        case IOMMU_CAP_DIRTY:
-               return sm_supported(info->iommu) &&
-                       ecap_slads(info->iommu->ecap);
+               return intel_iommu_slads_supported(info->iommu);
        default:
                return false;
        }
Baolu Lu Oct. 16, 2023, 12:51 a.m. UTC | #3
On 9/23/23 9:25 AM, Joao Martins wrote:
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -300,6 +300,7 @@ static int iommu_skip_te_disable;
>   #define IDENTMAP_AZALIA		4
>   
>   const struct iommu_ops intel_iommu_ops;
> +const struct iommu_dirty_ops intel_dirty_ops;
>   
>   static bool translation_pre_enabled(struct intel_iommu *iommu)
>   {
> @@ -4077,6 +4078,7 @@ static struct iommu_domain *intel_iommu_domain_alloc(unsigned type)
>   static struct iommu_domain *
>   intel_iommu_domain_alloc_user(struct device *dev, u32 flags)
>   {
> +	bool enforce_dirty = (flags & IOMMU_HWPT_ALLOC_ENFORCE_DIRTY);
>   	struct iommu_domain *domain;
>   	struct intel_iommu *iommu;
>   
> @@ -4087,9 +4089,15 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 flags)
>   	if ((flags & IOMMU_HWPT_ALLOC_NEST_PARENT) && !ecap_nest(iommu->ecap))
>   		return ERR_PTR(-EOPNOTSUPP);
>   
> +	if (enforce_dirty &&
> +	    !device_iommu_capable(dev, IOMMU_CAP_DIRTY))
> +		return ERR_PTR(-EOPNOTSUPP);
> +
>   	domain = iommu_domain_alloc(dev->bus);
>   	if (!domain)
>   		domain = ERR_PTR(-ENOMEM);
> +	if (domain && enforce_dirty)

@domain can not be NULL here.

> +		domain->dirty_ops = &intel_dirty_ops;
>   	return domain;
>   }

The VT-d driver always uses second level for a user domain translation.
In order to avoid checks of "domain->use_first_level" in the callbacks,
how about check it here and return failure if first level is used for
user domain?


[...]
         domain = iommu_domain_alloc(dev->bus);
         if (!domain)
                 return ERR_PTR(-ENOMEM);

         if (enforce_dirty) {
                 if (to_dmar_domain(domain)->use_first_level) {
                         iommu_domain_free(domain);
                         return ERR_PTR(-EOPNOTSUPP);
                 }
                 domain->dirty_ops = &intel_dirty_ops;
         }

         return domain;

>   
> @@ -4367,6 +4375,9 @@ static bool intel_iommu_capable(struct device *dev, enum iommu_cap cap)
>   		return dmar_platform_optin();
>   	case IOMMU_CAP_ENFORCE_CACHE_COHERENCY:
>   		return ecap_sc_support(info->iommu->ecap);
> +	case IOMMU_CAP_DIRTY:
> +		return sm_supported(info->iommu) &&
> +			ecap_slads(info->iommu->ecap);

Above appears several times in this patch. Is it possible to define it
as a macro?

diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index bccd44db3316..379e141bbb28 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -542,6 +542,8 @@ enum {
  #define sm_supported(iommu)    (intel_iommu_sm && 
ecap_smts((iommu)->ecap))
  #define pasid_supported(iommu) (sm_supported(iommu) &&                 \
                                  ecap_pasid((iommu)->ecap))
+#define slads_supported(iommu) (sm_supported(iommu) &&                 \
+                                ecap_slads((iommu)->ecap))

>   	default:
>   		return false;
>   	}

Best regards,
baolu
Baolu Lu Oct. 16, 2023, 1:37 a.m. UTC | #4
On 9/23/23 9:25 AM, Joao Martins wrote:
[...]
>   
> +static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
> +					  bool enable)
> +{
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct device_domain_info *info;
> +	int ret = -EINVAL;
> +
> +	spin_lock(&dmar_domain->lock);
> +	if (!(dmar_domain->dirty_tracking ^ enable) ||

Just out of curiosity, can we simply write

dmar_domain->dirty_tracking == enable

instead? I am not sure whether the compiler will be happy with this.

> +	    list_empty(&dmar_domain->devices)) {

list_for_each_entry is no op if list is empty, so no need to check it.

> +		spin_unlock(&dmar_domain->lock);
> +		return 0;
> +	}
> +
> +	list_for_each_entry(info, &dmar_domain->devices, link) {
> +		/* First-level page table always enables dirty bit*/
> +		if (dmar_domain->use_first_level) {

Since we leave out domain->use_first_level in the user_domain_alloc
function, we no longer need to check it here.

> +			ret = 0;
> +			break;
> +		}
> +
> +		ret = intel_pasid_setup_dirty_tracking(info->iommu, info->domain,
> +						     info->dev, IOMMU_NO_PASID,
> +						     enable);
> +		if (ret)
> +			break;

We need to unwind to the previous status here. We cannot leave some
devices with status @enable while others do not.

> +
> +	}

The VT-d driver also support attaching domain to a pasid of a device. We
also need to enable dirty tracking on those devices.

> +
> +	if (!ret)
> +		dmar_domain->dirty_tracking = enable;
> +	spin_unlock(&dmar_domain->lock);
> +
> +	return ret;
> +}

I have made some changes to the code based on my above comments. Please
let me know what you think.

static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
                                           bool enable)
{
         struct dmar_domain *dmar_domain = to_dmar_domain(domain);
         struct dev_pasid_info *dev_pasid;
         struct device_domain_info *info;
         int ret;

         spin_lock(&dmar_domain->lock);
         if (!(dmar_domain->dirty_tracking ^ enable))
                 goto out_unlock;

         list_for_each_entry(info, &dmar_domain->devices, link) {
                 ret = intel_pasid_setup_dirty_tracking(info->iommu, 
dmar_domain,
                                                        info->dev, 
IOMMU_NO_PASID,
                                                        enable);
                 if (ret)
                         goto err_unwind;
         }

         list_for_each_entry(dev_pasid, &dmar_domain->dev_pasids, 
link_domain) {
                 info = dev_iommu_priv_get(dev_pasid->dev);
                 ret = intel_pasid_setup_dirty_tracking(info->iommu, 
dmar_domain,
                                                        info->dev, 
dev_pasid->pasid,
                                                        enable);
                 if (ret)
                         goto err_unwind;
         }

         dmar_domain->dirty_tracking = enable;
out_unlock:
         spin_unlock(&dmar_domain->lock);

         return 0;

err_unwind:
         list_for_each_entry(info, &dmar_domain->devices, link)
                 intel_pasid_setup_dirty_tracking(info->iommu, 
dmar_domain, info->dev,
                                                  IOMMU_NO_PASID,
 
dmar_domain->dirty_tracking);
         list_for_each_entry(dev_pasid, &dmar_domain->dev_pasids, 
link_domain) {
                 info = dev_iommu_priv_get(dev_pasid->dev);
                 intel_pasid_setup_dirty_tracking(info->iommu, dmar_domain,
                                                  info->dev, 
dev_pasid->pasid,
 
dmar_domain->dirty_tracking);
         }
         spin_unlock(&dmar_domain->lock);

         return ret;
}

Best regards,
baolu
Baolu Lu Oct. 16, 2023, 2:07 a.m. UTC | #5
On 9/23/23 9:25 AM, Joao Martins wrote:
[...]
> +static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain,
> +					    unsigned long iova, size_t size,
> +					    unsigned long flags,
> +					    struct iommu_dirty_bitmap *dirty)
> +{
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	unsigned long end = iova + size - 1;
> +	unsigned long pgsize;
> +	bool ad_enabled;
> +
> +	spin_lock(&dmar_domain->lock);
> +	ad_enabled = dmar_domain->dirty_tracking;
> +	spin_unlock(&dmar_domain->lock);

The spin lock is to protect the RID and PASID device tracking list. No
need to use it here.

> +
> +	if (!ad_enabled && dirty->bitmap)
> +		return -EINVAL;

I don't understand above check of "dirty->bitmap". Isn't it always
invalid to call this if dirty tracking is not enabled on the domain?

The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no
need to understand it and check its member anyway.

Or, I overlooked anything?

> +
> +	rcu_read_lock();

Do we really need a rcu lock here? This operation is protected by
iopt->iova_rwsem. Is it reasonable to remove it? If not, how about put
some comments around it?

> +	do {
> +		struct dma_pte *pte;
> +		int lvl = 0;
> +
> +		pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &lvl,
> +				     GFP_ATOMIC);
> +		pgsize = level_size(lvl) << VTD_PAGE_SHIFT;
> +		if (!pte || !dma_pte_present(pte)) {
> +			iova += pgsize;
> +			continue;
> +		}
> +
> +		/* It is writable, set the bitmap */
> +		if (((flags & IOMMU_DIRTY_NO_CLEAR) &&
> +				dma_sl_pte_dirty(pte)) ||
> +		    dma_sl_pte_test_and_clear_dirty(pte))
> +			iommu_dirty_bitmap_record(dirty, iova, pgsize);
> +		iova += pgsize;
> +	} while (iova < end);
> +	rcu_read_unlock();
> +
> +	return 0;
> +}

Best regards,
baolu
Baolu Lu Oct. 16, 2023, 2:21 a.m. UTC | #6
On 9/23/23 9:25 AM, Joao Martins wrote:
[...]
> +/*
> + * Set up dirty tracking on a second only translation type.

Set up dirty tracking on a second only or nested translation type.

> + */
> +int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
> +				     struct dmar_domain *domain,
> +				     struct device *dev, u32 pasid,
> +				     bool enabled)
> +{
> +	struct pasid_entry *pte;
> +	u16 did, pgtt;
> +
> +	spin_lock(&iommu->lock);
> +
> +	did = domain_id_iommu(domain, iommu);
> +	pte = intel_pasid_get_entry(dev, pasid);
> +	if (!pte) {
> +		spin_unlock(&iommu->lock);
> +		dev_err(dev, "Failed to get pasid entry of PASID %d\n", pasid);

Use dev_err_ratelimited() to avoid user DOS attack.

> +		return -ENODEV;
> +	}

Can we add a check to limit this interface to second-only and nested
translation types? These are the only valid use cases currently and for
the foreseeable future.

And, return directly if the pasid bit matches the target state.

[...]
         spin_lock(&iommu->lock);
         pte = intel_pasid_get_entry(dev, pasid);
         if (!pte) {
                 spin_unlock(&iommu->lock);
                 dev_err_ratelimited(dev, "Failed to get pasid entry of 
PASID %d\n", pasid);
                 return -ENODEV;
         }

         did = domain_id_iommu(domain, iommu);
         pgtt = pasid_pte_get_pgtt(pte);

         if (pgtt != PASID_ENTRY_PGTT_SL_ONLY && pgtt != 
PASID_ENTRY_PGTT_NESTED) {
                 spin_unlock(&iommu->lock);
                 dev_err_ratelimited(dev,
                                     "Dirty tracking not supported on 
translation type %d\n",
                                     pgtt);
                 return -EOPNOTSUPP;
         }

         if (pasid_get_ssade(pte) == enabled) {
                 spin_unlock(&iommu->lock);
                 return 0;
         }

         if (enabled)
                 pasid_set_ssade(pte);
         else
                 pasid_clear_ssade(pte);
         spin_unlock(&iommu->lock);
[...]

> +
> +	pgtt = pasid_pte_get_pgtt(pte);
> +
> +	if (enabled)
> +		pasid_set_ssade(pte);
> +	else
> +		pasid_clear_ssade(pte);
> +	spin_unlock(&iommu->lock);


Add below here:

         if (!ecap_coherent(iommu->ecap))
                 clflush_cache_range(pte, sizeof(*pte));

> +
> +	/*
> +	 * From VT-d spec table 25 "Guidance to Software for Invalidations":
> +	 *
> +	 * - PASID-selective-within-Domain PASID-cache invalidation
> +	 *   If (PGTT=SS or Nested)
> +	 *    - Domain-selective IOTLB invalidation
> +	 *   Else
> +	 *    - PASID-selective PASID-based IOTLB invalidation
> +	 * - If (pasid is RID_PASID)
> +	 *    - Global Device-TLB invalidation to affected functions
> +	 *   Else
> +	 *    - PASID-based Device-TLB invalidation (with S=1 and
> +	 *      Addr[63:12]=0x7FFFFFFF_FFFFF) to affected functions
> +	 */
> +	pasid_cache_invalidation_with_pasid(iommu, did, pasid);
> +
> +	if (pgtt == PASID_ENTRY_PGTT_SL_ONLY || pgtt == PASID_ENTRY_PGTT_NESTED)
> +		iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
> +	else
> +		qi_flush_piotlb(iommu, did, pasid, 0, -1, 0);

Only "Domain-selective IOTLB invalidation" is needed here.

> +
> +	/* Device IOTLB doesn't need to be flushed in caching mode. */
> +	if (!cap_caching_mode(iommu->cap))
> +		devtlb_invalidation_with_pasid(iommu, dev, pasid);

For the device IOTLB invalidation, need to follow what spec requires.

   If (pasid is RID_PASID)
    - Global Device-TLB invalidation to affected functions
   Else
    - PASID-based Device-TLB invalidation (with S=1 and
      Addr[63:12]=0x7FFFFFFF_FFFFF) to affected functions

> +
> +	return 0;
> +}
> +

Best regards,
baolu
Baolu Lu Oct. 16, 2023, 2:26 a.m. UTC | #7
On 9/25/23 5:08 PM, Joao Martins wrote:
> 
> 
> On 25/09/2023 08:01, Baolu Lu wrote:
>> On 9/23/23 9:25 AM, Joao Martins wrote:
>>> IOMMU advertises Access/Dirty bits for second-stage page table if the
>>> extended capability DMAR register reports it (ECAP, mnemonic ECAP.SSADS).
>>> The first stage table is compatible with CPU page table thus A/D bits are
>>> implicitly supported. Relevant Intel IOMMU SDM ref for first stage table
>>> "3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second stage table
>>> "3.7.2 Accessed and Dirty Flags".
>>>
>>> First stage page table is enabled by default so it's allowed to set dirty
>>> tracking and no control bits needed, it just returns 0. To use SSADS, set
>>> bit 9 (SSADE) in the scalable-mode PASID table entry and flush the IOTLB
>>> via pasid_flush_caches() following the manual. Relevant SDM refs:
>>>
>>> "3.7.2 Accessed and Dirty Flags"
>>> "6.5.3.3 Guidance to Software for Invalidations,
>>>    Table 23. Guidance to Software for Invalidations"
>>>
>>> PTE dirty bit is located in bit 9 and it's cached in the IOTLB so flush
>>> IOTLB to make sure IOMMU attempts to set the dirty bit again. Note that
>>> iommu_dirty_bitmap_record() will add the IOVA to iotlb_gather and thus
>>> the caller of the iommu op will flush the IOTLB. Relevant manuals over
>>> the hardware translation is chapter 6 with some special mention to:
>>>
>>> "6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations"
>>> "6.2.4 IOTLB"
>>>
>>> Signed-off-by: Joao Martins<joao.m.martins@oracle.com>
>>> ---
>>> The IOPTE walker is still a bit inneficient. Making sure the UAPI/IOMMUFD is
>>> solid and agreed upon.
>>> ---
>>>    drivers/iommu/intel/iommu.c | 94 +++++++++++++++++++++++++++++++++++++
>>>    drivers/iommu/intel/iommu.h | 15 ++++++
>>>    drivers/iommu/intel/pasid.c | 94 +++++++++++++++++++++++++++++++++++++
>>>    drivers/iommu/intel/pasid.h |  4 ++
>>>    4 files changed, 207 insertions(+)
>>
>> The code is probably incomplete. When attaching a domain to a device,
>> check the domain's dirty tracking capability against the device's
>> capabilities. If the domain's dirty tracking capability is set but the
>> device does not support it, the attach callback should return -EINVAL.
>>
> Yeap, I did that for AMD, but it seems in the mix of changes I may have deleted
> and then forgot to include it here.
> 
> Here's what I added (together with consolidated cap checking):
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 7d5a8f5283a7..fabfe363f1f9 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -4075,6 +4075,11 @@ static struct iommu_domain
> *intel_iommu_domain_alloc(unsigned type)
>          return NULL;
>   }
> 
> +static bool intel_iommu_slads_supported(struct intel_iommu *iommu)
> +{
> +       return sm_supported(iommu) && ecap_slads(iommu->ecap);
> +}
> +
>   static struct iommu_domain *
>   intel_iommu_domain_alloc_user(struct device *dev, u32 flags)
>   {
> @@ -4090,7 +4095,7 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 flags)
>                  return ERR_PTR(-EOPNOTSUPP);
> 
>          if (enforce_dirty &&
> -           !device_iommu_capable(dev, IOMMU_CAP_DIRTY))
> +           !intel_iommu_slads_supported(iommu))
>                  return ERR_PTR(-EOPNOTSUPP);
> 
>          domain = iommu_domain_alloc(dev->bus);
> @@ -4121,6 +4126,9 @@ static int prepare_domain_attach_device(struct
> iommu_domain *domain,
>          if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap))
>                  return -EINVAL;
> 
> +       if (domain->dirty_ops && !intel_iommu_slads_supported(iommu))
> +               return -EINVAL;
> +
>          /* check if this iommu agaw is sufficient for max mapped address */
>          addr_width = agaw_to_width(iommu->agaw);
>          if (addr_width > cap_mgaw(iommu->cap))
> @@ -4376,8 +4384,7 @@ static bool intel_iommu_capable(struct device *dev, enum
> iommu_cap cap)
>          case IOMMU_CAP_ENFORCE_CACHE_COHERENCY:
>                  return ecap_sc_support(info->iommu->ecap);
>          case IOMMU_CAP_DIRTY:
> -               return sm_supported(info->iommu) &&
> -                       ecap_slads(info->iommu->ecap);
> +               return intel_iommu_slads_supported(info->iommu);
>          default:
>                  return false;
>          }

Yes. Above additional change looks good to me.

Best regards,
baolu
Joao Martins Oct. 16, 2023, 10:42 a.m. UTC | #8
On 16/10/2023 01:51, Baolu Lu wrote:
> On 9/23/23 9:25 AM, Joao Martins wrote:
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -300,6 +300,7 @@ static int iommu_skip_te_disable;
>>   #define IDENTMAP_AZALIA        4
>>     const struct iommu_ops intel_iommu_ops;
>> +const struct iommu_dirty_ops intel_dirty_ops;
>>     static bool translation_pre_enabled(struct intel_iommu *iommu)
>>   {
>> @@ -4077,6 +4078,7 @@ static struct iommu_domain
>> *intel_iommu_domain_alloc(unsigned type)
>>   static struct iommu_domain *
>>   intel_iommu_domain_alloc_user(struct device *dev, u32 flags)
>>   {
>> +    bool enforce_dirty = (flags & IOMMU_HWPT_ALLOC_ENFORCE_DIRTY);
>>       struct iommu_domain *domain;
>>       struct intel_iommu *iommu;
>>   @@ -4087,9 +4089,15 @@ intel_iommu_domain_alloc_user(struct device *dev, u32
>> flags)
>>       if ((flags & IOMMU_HWPT_ALLOC_NEST_PARENT) && !ecap_nest(iommu->ecap))
>>           return ERR_PTR(-EOPNOTSUPP);
>>   +    if (enforce_dirty &&
>> +        !device_iommu_capable(dev, IOMMU_CAP_DIRTY))
>> +        return ERR_PTR(-EOPNOTSUPP);
>> +
>>       domain = iommu_domain_alloc(dev->bus);
>>       if (!domain)
>>           domain = ERR_PTR(-ENOMEM);
>> +    if (domain && enforce_dirty)
> 
> @domain can not be NULL here.
> 
True, it should be:

if (!IS_ERR(domain) && enforce_dirty)

>> +        domain->dirty_ops = &intel_dirty_ops;
>>       return domain;
>>   }
> 
> The VT-d driver always uses second level for a user domain translation.
> In order to avoid checks of "domain->use_first_level" in the callbacks,
> how about check it here and return failure if first level is used for
> user domain?
> 

I was told by Yi Y Sun offlist to have the first_level checked, because dirty
bit in first stage page table is always enabled (and cannot be toggled on/off).
I can remove it again; initially RFC didn't have it as it was failing in similar
way to how you suggest here. Not sure how to proceed?

> 
> [...]
>         domain = iommu_domain_alloc(dev->bus);
>         if (!domain)
>                 return ERR_PTR(-ENOMEM);
> 
>         if (enforce_dirty) {
>                 if (to_dmar_domain(domain)->use_first_level) {
>                         iommu_domain_free(domain);
>                         return ERR_PTR(-EOPNOTSUPP);
>                 }
>                 domain->dirty_ops = &intel_dirty_ops;
>         }
> 
>         return domain;
> 

Should I fail on first level pagetable dirty enforcement, certainly will adopt
the above (and remove the sucessfully return on first_level set_dirty_tracking)

>>   @@ -4367,6 +4375,9 @@ static bool intel_iommu_capable(struct device *dev,
>> enum iommu_cap cap)
>>           return dmar_platform_optin();
>>       case IOMMU_CAP_ENFORCE_CACHE_COHERENCY:
>>           return ecap_sc_support(info->iommu->ecap);
>> +    case IOMMU_CAP_DIRTY:
>> +        return sm_supported(info->iommu) &&
>> +            ecap_slads(info->iommu->ecap);
> 
> Above appears several times in this patch. Is it possible to define it
> as a macro?
> 
Yeap, for sure much cleaner indeed.

> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
> index bccd44db3316..379e141bbb28 100644
> --- a/drivers/iommu/intel/iommu.h
> +++ b/drivers/iommu/intel/iommu.h
> @@ -542,6 +542,8 @@ enum {
>  #define sm_supported(iommu)    (intel_iommu_sm && ecap_smts((iommu)->ecap))
>  #define pasid_supported(iommu) (sm_supported(iommu) &&                 \
>                                  ecap_pasid((iommu)->ecap))
> +#define slads_supported(iommu) (sm_supported(iommu) &&                 \
> +                                ecap_slads((iommu)->ecap))
> 
Yeap.

>>       default:
>>           return false;
>>       }
> 
> Best regards,
> baolu
Joao Martins Oct. 16, 2023, 10:57 a.m. UTC | #9
On 16/10/2023 02:37, Baolu Lu wrote:
> On 9/23/23 9:25 AM, Joao Martins wrote:
> [...]
>>   +static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
>> +                      bool enable)
>> +{
>> +    struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>> +    struct device_domain_info *info;
>> +    int ret = -EINVAL;
>> +
>> +    spin_lock(&dmar_domain->lock);
>> +    if (!(dmar_domain->dirty_tracking ^ enable) ||
> 
> Just out of curiosity, can we simply write
> 
> dmar_domain->dirty_tracking == enable
> 
> instead? I am not sure whether the compiler will be happy with this.
> 

Part of the check above, was just trying to avoid same-state transitions. the
above you wrote should work (...)

>> +        list_empty(&dmar_domain->devices)) {
> 
 list_for_each_entry is no op if list is empty, so no need to check it.
> 

Though this is unnecessary yes.

>> +        spin_unlock(&dmar_domain->lock);
>> +        return 0;
>> +    }
>> +
>> +    list_for_each_entry(info, &dmar_domain->devices, link) {
>> +        /* First-level page table always enables dirty bit*/
>> +        if (dmar_domain->use_first_level) {
> 
> Since we leave out domain->use_first_level in the user_domain_alloc
> function, we no longer need to check it here.
> 
>> +            ret = 0;
>> +            break;
>> +        }
>> +
>> +        ret = intel_pasid_setup_dirty_tracking(info->iommu, info->domain,
>> +                             info->dev, IOMMU_NO_PASID,
>> +                             enable);
>> +        if (ret)
>> +            break;
> 
> We need to unwind to the previous status here. We cannot leave some
> devices with status @enable while others do not.
>
Ugh, yes

>> +
>> +    }
> 
> The VT-d driver also support attaching domain to a pasid of a device. We
> also need to enable dirty tracking on those devices.
> 
True. But to be honest, I thought we weren't quite there yet in PASID support
from IOMMUFD perspective; hence why I didn't aim at it. Or do I have the wrong
impression? From the code below, it clearly looks the driver does.

>> +
>> +    if (!ret)
>> +        dmar_domain->dirty_tracking = enable;
>> +    spin_unlock(&dmar_domain->lock);
>> +
>> +    return ret;
>> +}
> 
> I have made some changes to the code based on my above comments. Please
> let me know what you think.
> 

Looks great; thanks a lot for these changes!

> static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
>                                           bool enable)
> {
>         struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>         struct dev_pasid_info *dev_pasid;
>         struct device_domain_info *info;
>         int ret;
> 
>         spin_lock(&dmar_domain->lock);
>         if (!(dmar_domain->dirty_tracking ^ enable))
>                 goto out_unlock;
> 
>         list_for_each_entry(info, &dmar_domain->devices, link) {
>                 ret = intel_pasid_setup_dirty_tracking(info->iommu, dmar_domain,
>                                                        info->dev, IOMMU_NO_PASID,
>                                                        enable);
>                 if (ret)
>                         goto err_unwind;
>         }
> 
>         list_for_each_entry(dev_pasid, &dmar_domain->dev_pasids, link_domain) {
>                 info = dev_iommu_priv_get(dev_pasid->dev);
>                 ret = intel_pasid_setup_dirty_tracking(info->iommu, dmar_domain,
>                                                        info->dev, dev_pasid->pasid,
>                                                        enable);
>                 if (ret)
>                         goto err_unwind;
>         }
> 
>         dmar_domain->dirty_tracking = enable;
> out_unlock:
>         spin_unlock(&dmar_domain->lock);
> 
>         return 0;
> 
> err_unwind:
>         list_for_each_entry(info, &dmar_domain->devices, link)
>                 intel_pasid_setup_dirty_tracking(info->iommu, dmar_domain,
> info->dev,
>                                                  IOMMU_NO_PASID,
> 
> dmar_domain->dirty_tracking);
>         list_for_each_entry(dev_pasid, &dmar_domain->dev_pasids, link_domain) {
>                 info = dev_iommu_priv_get(dev_pasid->dev);
>                 intel_pasid_setup_dirty_tracking(info->iommu, dmar_domain,
>                                                  info->dev, dev_pasid->pasid,
> 
> dmar_domain->dirty_tracking);
>         }
>         spin_unlock(&dmar_domain->lock);
> 
>         return ret;
> }
> 
> Best regards,
> baolu
Joao Martins Oct. 16, 2023, 11:26 a.m. UTC | #10
On 16/10/2023 03:07, Baolu Lu wrote:
> On 9/23/23 9:25 AM, Joao Martins wrote:
> [...]
>> +static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain,
>> +                        unsigned long iova, size_t size,
>> +                        unsigned long flags,
>> +                        struct iommu_dirty_bitmap *dirty)
>> +{
>> +    struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>> +    unsigned long end = iova + size - 1;
>> +    unsigned long pgsize;
>> +    bool ad_enabled;
>> +
>> +    spin_lock(&dmar_domain->lock);
>> +    ad_enabled = dmar_domain->dirty_tracking;
>> +    spin_unlock(&dmar_domain->lock);
> 
> The spin lock is to protect the RID and PASID device tracking list. No
> need to use it here.
>
OK

>> +
>> +    if (!ad_enabled && dirty->bitmap)
>> +        return -EINVAL;
> 
> I don't understand above check of "dirty->bitmap". Isn't it always
> invalid to call this if dirty tracking is not enabled on the domain?
> 
It is spurious (...)

> The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no
> need to understand it and check its member anyway.
> 

(...) The iommu driver has no need to understand it. iommu_dirty_bitmap_record()
already makes those checks in case there's no iova_bitmap to set bits to.

> Or, I overlooked anything?
> 
>> +
>> +    rcu_read_lock();
> 
> Do we really need a rcu lock here? This operation is protected by
> iopt->iova_rwsem. Is it reasonable to remove it? If not, how about put
> some comments around it?
> 
As I had mentioned in an earlier comment, this was not meant to be here.

>> +    do {
>> +        struct dma_pte *pte;
>> +        int lvl = 0;
>> +
>> +        pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &lvl,
>> +                     GFP_ATOMIC);
>> +        pgsize = level_size(lvl) << VTD_PAGE_SHIFT;
>> +        if (!pte || !dma_pte_present(pte)) {
>> +            iova += pgsize;
>> +            continue;
>> +        }
>> +
>> +        /* It is writable, set the bitmap */
>> +        if (((flags & IOMMU_DIRTY_NO_CLEAR) &&
>> +                dma_sl_pte_dirty(pte)) ||
>> +            dma_sl_pte_test_and_clear_dirty(pte))
>> +            iommu_dirty_bitmap_record(dirty, iova, pgsize);
>> +        iova += pgsize;
>> +    } while (iova < end);
>> +    rcu_read_unlock();
>> +
>> +    return 0;
>> +}
> 
> Best regards,
> baolu
Joao Martins Oct. 16, 2023, 11:39 a.m. UTC | #11
On 16/10/2023 03:21, Baolu Lu wrote:
> On 9/23/23 9:25 AM, Joao Martins wrote:
> [...]
>> +/*
>> + * Set up dirty tracking on a second only translation type.
> 
> Set up dirty tracking on a second only or nested translation type.
> 
>> + */
>> +int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
>> +                     struct dmar_domain *domain,
>> +                     struct device *dev, u32 pasid,
>> +                     bool enabled)
>> +{
>> +    struct pasid_entry *pte;
>> +    u16 did, pgtt;
>> +
>> +    spin_lock(&iommu->lock);
>> +
>> +    did = domain_id_iommu(domain, iommu);
>> +    pte = intel_pasid_get_entry(dev, pasid);
>> +    if (!pte) {
>> +        spin_unlock(&iommu->lock);
>> +        dev_err(dev, "Failed to get pasid entry of PASID %d\n", pasid);
> 
> Use dev_err_ratelimited() to avoid user DOS attack.
> 
OK

>> +        return -ENODEV;
>> +    }
> 
> Can we add a check to limit this interface to second-only and nested
> translation types? These are the only valid use cases currently and for
> the foreseeable future.
> 
OK.

> And, return directly if the pasid bit matches the target state.
> 

OK

> [...]
>         spin_lock(&iommu->lock);
>         pte = intel_pasid_get_entry(dev, pasid);
>         if (!pte) {
>                 spin_unlock(&iommu->lock);
>                 dev_err_ratelimited(dev, "Failed to get pasid entry of PASID
> %d\n", pasid);
>                 return -ENODEV;
>         }
> 
>         did = domain_id_iommu(domain, iommu);
>         pgtt = pasid_pte_get_pgtt(pte);
> 
>         if (pgtt != PASID_ENTRY_PGTT_SL_ONLY && pgtt != PASID_ENTRY_PGTT_NESTED) {
>                 spin_unlock(&iommu->lock);
>                 dev_err_ratelimited(dev,
>                                     "Dirty tracking not supported on translation
> type %d\n",
>                                     pgtt);
>                 return -EOPNOTSUPP;
>         }
> 
>         if (pasid_get_ssade(pte) == enabled) {
>                 spin_unlock(&iommu->lock);
>                 return 0;
>         }
> 
>         if (enabled)
>                 pasid_set_ssade(pte);
>         else
>                 pasid_clear_ssade(pte);
>         spin_unlock(&iommu->lock);
> [...]
> 

OK

>> +
>> +    pgtt = pasid_pte_get_pgtt(pte);
>> +
>> +    if (enabled)
>> +        pasid_set_ssade(pte);
>> +    else
>> +        pasid_clear_ssade(pte);
>> +    spin_unlock(&iommu->lock);
> 
> 
> Add below here:
> 
>         if (!ecap_coherent(iommu->ecap))
>                 clflush_cache_range(pte, sizeof(*pte));
> 
Got it

>> +
>> +    /*
>> +     * From VT-d spec table 25 "Guidance to Software for Invalidations":
>> +     *
>> +     * - PASID-selective-within-Domain PASID-cache invalidation
>> +     *   If (PGTT=SS or Nested)
>> +     *    - Domain-selective IOTLB invalidation
>> +     *   Else
>> +     *    - PASID-selective PASID-based IOTLB invalidation
>> +     * - If (pasid is RID_PASID)
>> +     *    - Global Device-TLB invalidation to affected functions
>> +     *   Else
>> +     *    - PASID-based Device-TLB invalidation (with S=1 and
>> +     *      Addr[63:12]=0x7FFFFFFF_FFFFF) to affected functions
>> +     */
>> +    pasid_cache_invalidation_with_pasid(iommu, did, pasid);
>> +
>> +    if (pgtt == PASID_ENTRY_PGTT_SL_ONLY || pgtt == PASID_ENTRY_PGTT_NESTED)
>> +        iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
>> +    else
>> +        qi_flush_piotlb(iommu, did, pasid, 0, -1, 0);
> 
> Only "Domain-selective IOTLB invalidation" is needed here.
> 
Will delete the qi_flush_piotlb() then.

>> +
>> +    /* Device IOTLB doesn't need to be flushed in caching mode. */
>> +    if (!cap_caching_mode(iommu->cap))
>> +        devtlb_invalidation_with_pasid(iommu, dev, pasid);
> 
> For the device IOTLB invalidation, need to follow what spec requires.
> 
>   If (pasid is RID_PASID)
>    - Global Device-TLB invalidation to affected functions
>   Else
>    - PASID-based Device-TLB invalidation (with S=1 and
>      Addr[63:12]=0x7FFFFFFF_FFFFF) to affected functions
> 
devtlb_invalidation_with_pasid() underneath does:

	if (pasid == PASID_RID2PASID)
		qi_flush_dev_iotlb(iommu, sid, pfsid, qdep, 0, 64 - VTD_PAGE_SHIFT);
	else
		qi_flush_dev_iotlb_pasid(iommu, sid, pfsid, pasid, qdep, 0, 64 - VTD_PAGE_SHIFT);

... Which is what the spec suggests (IIUC).
Should I read your comment above as to drop the cap_caching_mode(iommu->cap) ?
Jason Gunthorpe Oct. 16, 2023, 11:42 a.m. UTC | #12
On Mon, Oct 16, 2023 at 11:57:34AM +0100, Joao Martins wrote:

> True. But to be honest, I thought we weren't quite there yet in PASID support
> from IOMMUFD perspective; hence why I didn't aim at it. Or do I have the wrong
> impression? From the code below, it clearly looks the driver does.

I think we should plan that this series will go before the PASID
series

Jason
Baolu Lu Oct. 16, 2023, 12:41 p.m. UTC | #13
On 2023/10/16 18:42, Joao Martins wrote:
>>> +        domain->dirty_ops = &intel_dirty_ops;
>>>        return domain;
>>>    }
>> The VT-d driver always uses second level for a user domain translation.
>> In order to avoid checks of "domain->use_first_level" in the callbacks,
>> how about check it here and return failure if first level is used for
>> user domain?
>>
> I was told by Yi Y Sun offlist to have the first_level checked, because dirty
> bit in first stage page table is always enabled (and cannot be toggled on/off).
> I can remove it again; initially RFC didn't have it as it was failing in similar
> way to how you suggest here. Not sure how to proceed?

Yi was right. But we currently have no use case for dirty tracking in
the first-level page table. So let's start from only supporting it in
the second level page table.

If we later identify a use case for dirty tracking in the first level
page table, we can then add the code with appropriate testing efforts.

Best regards,
baolu
Baolu Lu Oct. 16, 2023, 12:58 p.m. UTC | #14
On 2023/10/16 19:42, Jason Gunthorpe wrote:
> On Mon, Oct 16, 2023 at 11:57:34AM +0100, Joao Martins wrote:
> 
>> True. But to be honest, I thought we weren't quite there yet in PASID support
>> from IOMMUFD perspective; hence why I didn't aim at it. Or do I have the wrong
>> impression? From the code below, it clearly looks the driver does.
> I think we should plan that this series will go before the PASID
> series

I know that PASID support in IOMMUFD is not yet available, but the VT-d
driver already supports attaching a domain to a PASID, as required by
the idxd driver for kernel DMA with PASID. Therefore, from the driver's
perspective, dirty tracking should also be enabled for PASIDs.

However, I am also fine with deferring this code until PASID support is
added to IOMMUFD. In that case, it's better to add a comment to remind
people to revisit this issue later.

Best regards,
baolu
Jason Gunthorpe Oct. 16, 2023, 12:59 p.m. UTC | #15
On Mon, Oct 16, 2023 at 08:58:42PM +0800, Baolu Lu wrote:
> On 2023/10/16 19:42, Jason Gunthorpe wrote:
> > On Mon, Oct 16, 2023 at 11:57:34AM +0100, Joao Martins wrote:
> > 
> > > True. But to be honest, I thought we weren't quite there yet in PASID support
> > > from IOMMUFD perspective; hence why I didn't aim at it. Or do I have the wrong
> > > impression? From the code below, it clearly looks the driver does.
> > I think we should plan that this series will go before the PASID
> > series
> 
> I know that PASID support in IOMMUFD is not yet available, but the VT-d
> driver already supports attaching a domain to a PASID, as required by
> the idxd driver for kernel DMA with PASID. Therefore, from the driver's
> perspective, dirty tracking should also be enabled for PASIDs.

As long as the driver refuses to attach a dirty track enabled domain
to PASID it would be fine for now.

Jason
Baolu Lu Oct. 16, 2023, 1:01 p.m. UTC | #16
On 2023/10/16 20:59, Jason Gunthorpe wrote:
> On Mon, Oct 16, 2023 at 08:58:42PM +0800, Baolu Lu wrote:
>> On 2023/10/16 19:42, Jason Gunthorpe wrote:
>>> On Mon, Oct 16, 2023 at 11:57:34AM +0100, Joao Martins wrote:
>>>
>>>> True. But to be honest, I thought we weren't quite there yet in PASID support
>>>> from IOMMUFD perspective; hence why I didn't aim at it. Or do I have the wrong
>>>> impression? From the code below, it clearly looks the driver does.
>>> I think we should plan that this series will go before the PASID
>>> series
>> I know that PASID support in IOMMUFD is not yet available, but the VT-d
>> driver already supports attaching a domain to a PASID, as required by
>> the idxd driver for kernel DMA with PASID. Therefore, from the driver's
>> perspective, dirty tracking should also be enabled for PASIDs.
> As long as the driver refuses to attach a dirty track enabled domain
> to PASID it would be fine for now.

Yes. This works.

Best regards,
baolu
Baolu Lu Oct. 16, 2023, 1:06 p.m. UTC | #17
On 2023/10/16 19:39, Joao Martins wrote:
>>> +    /* Device IOTLB doesn't need to be flushed in caching mode. */
>>> +    if (!cap_caching_mode(iommu->cap))
>>> +        devtlb_invalidation_with_pasid(iommu, dev, pasid);
>> For the device IOTLB invalidation, need to follow what spec requires.
>>
>>    If (pasid is RID_PASID)
>>     - Global Device-TLB invalidation to affected functions
>>    Else
>>     - PASID-based Device-TLB invalidation (with S=1 and
>>       Addr[63:12]=0x7FFFFFFF_FFFFF) to affected functions
>>
> devtlb_invalidation_with_pasid() underneath does:
> 
> 	if (pasid == PASID_RID2PASID)
> 		qi_flush_dev_iotlb(iommu, sid, pfsid, qdep, 0, 64 - VTD_PAGE_SHIFT);
> 	else
> 		qi_flush_dev_iotlb_pasid(iommu, sid, pfsid, pasid, qdep, 0, 64 - VTD_PAGE_SHIFT);
> 
> ... Which is what the spec suggests (IIUC).

Ah! I overlooked this. Sorry about it.

> Should I read your comment above as to drop the cap_caching_mode(iommu->cap) ?

No. Your code is fine.

Best regards,
baolu
Joao Martins Oct. 16, 2023, 4 p.m. UTC | #18
On 16/10/2023 12:26, Joao Martins wrote:
> On 16/10/2023 03:07, Baolu Lu wrote:
>> On 9/23/23 9:25 AM, Joao Martins wrote:
>>> +
>>> +    if (!ad_enabled && dirty->bitmap)
>>> +        return -EINVAL;
>>
>> I don't understand above check of "dirty->bitmap". Isn't it always
>> invalid to call this if dirty tracking is not enabled on the domain?
>>
> It is spurious (...)
> 

I take this back;

>> The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no
>> need to understand it and check its member anyway.
>>
> 
> (...) The iommu driver has no need to understand it. iommu_dirty_bitmap_record()
> already makes those checks in case there's no iova_bitmap to set bits to.
> 

This is all true but the reason I am checking iommu_dirty_bitmap::bitmap is to
essentially not record anything in the iova bitmap and just clear the dirty bits
from the IOPTEs, all when dirty tracking is technically disabled. This is done
internally only when starting dirty tracking, and thus to ensure that we cleanup
all dirty bits before we enable dirty tracking to have a consistent snapshot as
opposed to inheriting dirties from the past.

Some alternative ways to do this: 1) via the iommu_dirty_bitmap structure, where
I add one field which if true then iommufd core is able to call into iommu
driver on a "clear IOPTE" manner or 2)  via the ::flags ... the thing is that
::flags values is UAPI, so it feels weird to use these flags for internal purposes.

	Joao
Baolu Lu Oct. 17, 2023, 2:08 a.m. UTC | #19
On 10/17/23 12:00 AM, Joao Martins wrote:
>>> The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no
>>> need to understand it and check its member anyway.
>>>
>> (...) The iommu driver has no need to understand it. iommu_dirty_bitmap_record()
>> already makes those checks in case there's no iova_bitmap to set bits to.
>>
> This is all true but the reason I am checking iommu_dirty_bitmap::bitmap is to
> essentially not record anything in the iova bitmap and just clear the dirty bits
> from the IOPTEs, all when dirty tracking is technically disabled. This is done
> internally only when starting dirty tracking, and thus to ensure that we cleanup
> all dirty bits before we enable dirty tracking to have a consistent snapshot as
> opposed to inheriting dirties from the past.

It's okay since it serves a functional purpose. Can you please add some
comments around the code to explain the rationale.

Best regards,
baolu
Joao Martins Oct. 17, 2023, 10:51 a.m. UTC | #20
On 16/10/2023 14:01, Baolu Lu wrote:
> On 2023/10/16 20:59, Jason Gunthorpe wrote:
>> On Mon, Oct 16, 2023 at 08:58:42PM +0800, Baolu Lu wrote:
>>> On 2023/10/16 19:42, Jason Gunthorpe wrote:
>>>> On Mon, Oct 16, 2023 at 11:57:34AM +0100, Joao Martins wrote:
>>>>
>>>>> True. But to be honest, I thought we weren't quite there yet in PASID support
>>>>> from IOMMUFD perspective; hence why I didn't aim at it. Or do I have the wrong
>>>>> impression? From the code below, it clearly looks the driver does.
>>>> I think we should plan that this series will go before the PASID
>>>> series
>>> I know that PASID support in IOMMUFD is not yet available, but the VT-d
>>> driver already supports attaching a domain to a PASID, as required by
>>> the idxd driver for kernel DMA with PASID. Therefore, from the driver's
>>> perspective, dirty tracking should also be enabled for PASIDs.
>> As long as the driver refuses to attach a dirty track enabled domain
>> to PASID it would be fine for now.
> 
> Yes. This works.

Baolu Lu, I am blocking PASID attachment this way; let me know if this matches
how would you have the driver refuse dirty tracking on PASID.

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 66b0e1d5a98c..d33df02f0f2d 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4123,7 +4123,7 @@ static void intel_iommu_domain_free(struct iommu_domain
*domain)
 }

 static int prepare_domain_attach_device(struct iommu_domain *domain,
-                                       struct device *dev)
+                                       struct device *dev, ioasid_t pasid)
 {
        struct dmar_domain *dmar_domain = to_dmar_domain(domain);
        struct intel_iommu *iommu;
@@ -4136,7 +4136,8 @@ static int prepare_domain_attach_device(struct
iommu_domain *domain,
        if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap))
                return -EINVAL;

-       if (domain->dirty_ops && !slads_supported(iommu))
+       if (domain->dirty_ops &&
+           (!slads_supported(iommu) || pasid != IOMMU_NO_PASID))
                return -EINVAL;

        /* check if this iommu agaw is sufficient for max mapped address */
@@ -4174,7 +4175,7 @@ static int intel_iommu_attach_device(struct iommu_domain
*domain,
        if (info->domain)
                device_block_translation(dev);

-       ret = prepare_domain_attach_device(domain, dev);
+       ret = prepare_domain_attach_device(domain, dev, IOMMU_NO_PASID);
        if (ret)
                return ret;

@@ -4795,7 +4796,7 @@ static int intel_iommu_set_dev_pasid(struct iommu_domain
*domain,
        if (context_copied(iommu, info->bus, info->devfn))
                return -EBUSY;

-       ret = prepare_domain_attach_device(domain, dev);
+       ret = prepare_domain_attach_device(domain, dev, pasid);
        if (ret)
                return ret;
Joao Martins Oct. 17, 2023, 11:22 a.m. UTC | #21
On 17/10/2023 03:08, Baolu Lu wrote:
> On 10/17/23 12:00 AM, Joao Martins wrote:
>>>> The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no
>>>> need to understand it and check its member anyway.
>>>>
>>> (...) The iommu driver has no need to understand it. iommu_dirty_bitmap_record()
>>> already makes those checks in case there's no iova_bitmap to set bits to.
>>>
>> This is all true but the reason I am checking iommu_dirty_bitmap::bitmap is to
>> essentially not record anything in the iova bitmap and just clear the dirty bits
>> from the IOPTEs, all when dirty tracking is technically disabled. This is done
>> internally only when starting dirty tracking, and thus to ensure that we cleanup
>> all dirty bits before we enable dirty tracking to have a consistent snapshot as
>> opposed to inheriting dirties from the past.
> 
> It's okay since it serves a functional purpose. Can you please add some
> comments around the code to explain the rationale.
> 

I added this comment below:

+       /*
+        * IOMMUFD core calls into a dirty tracking disabled domain without an
+        * IOVA bitmap set in order to clean dirty bits in all PTEs that might
+        * have occured when we stopped dirty tracking. This ensures that we
+        * never inherit dirtied bits from a previous cycle.
+        */

Also fixed an issue where I could theoretically clear the bit with
IOMMU_NO_CLEAR. Essentially passed the read_and_clear_dirty flags and let
dma_sl_pte_test_and_clear_dirty() to test and test-and-clear, similar to AMD:

@@ -781,6 +788,16 @@ static inline bool dma_pte_present(struct dma_pte *pte)
        return (pte->val & 3) != 0;
 }

+static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte,
+                                                  unsigned long flags)
+{
+       if (flags & IOMMU_DIRTY_NO_CLEAR)
+               return (pte->val & DMA_SL_PTE_DIRTY) != 0;
+
+       return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT,
+                                 (unsigned long *)&pte->val);
+}
+

Anyhow, see below the full diff compared to this patch. Some things are in tree
that is different to submitted from this patch.

diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig
index 2e56bd79f589..dedb8ae3cba8 100644
--- a/drivers/iommu/intel/Kconfig
+++ b/drivers/iommu/intel/Kconfig
@@ -15,6 +15,7 @@ config INTEL_IOMMU
 	select DMA_OPS
 	select IOMMU_API
 	select IOMMU_IOVA
+	select IOMMUFD_DRIVER
 	select NEED_DMA_MAP_STATE
 	select DMAR_TABLE
 	select SWIOTLB
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index fabfe363f1f9..0e3f532f3bca 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4075,11 +4075,6 @@ static struct iommu_domain
*intel_iommu_domain_alloc(unsigned type)
 	return NULL;
 }

-static bool intel_iommu_slads_supported(struct intel_iommu *iommu)
-{
-	return sm_supported(iommu) && ecap_slads(iommu->ecap);
-}
-
 static struct iommu_domain *
 intel_iommu_domain_alloc_user(struct device *dev, u32 flags)
 {
@@ -4087,6 +4082,10 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 flags)
 	struct iommu_domain *domain;
 	struct intel_iommu *iommu;

+	if (flags & (~(IOMMU_HWPT_ALLOC_NEST_PARENT|
+		       IOMMU_HWPT_ALLOC_ENFORCE_DIRTY)))
+		return ERR_PTR(-EOPNOTSUPP);
+
 	iommu = device_to_iommu(dev, NULL, NULL);
 	if (!iommu)
 		return ERR_PTR(-ENODEV);
@@ -4094,15 +4093,26 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 flags)
 	if ((flags & IOMMU_HWPT_ALLOC_NEST_PARENT) && !ecap_nest(iommu->ecap))
 		return ERR_PTR(-EOPNOTSUPP);

-	if (enforce_dirty &&
-	    !intel_iommu_slads_supported(iommu))
+	if (enforce_dirty && !slads_supported(iommu))
 		return ERR_PTR(-EOPNOTSUPP);

+	/*
+	 * domain_alloc_user op needs to fully initialize a domain
+	 * before return, so uses iommu_domain_alloc() here for
+	 * simple.
+	 */
 	domain = iommu_domain_alloc(dev->bus);
 	if (!domain)
 		domain = ERR_PTR(-ENOMEM);
-	if (domain && enforce_dirty)
+
+	if (!IS_ERR(domain) && enforce_dirty) {
+		if (to_dmar_domain(domain)->use_first_level) {
+			iommu_domain_free(domain);
+			return ERR_PTR(-EOPNOTSUPP);
+		}
 		domain->dirty_ops = &intel_dirty_ops;
+	}
+
 	return domain;
 }

@@ -4113,7 +4123,7 @@ static void intel_iommu_domain_free(struct iommu_domain
*domain)
 }

 static int prepare_domain_attach_device(struct iommu_domain *domain,
-					struct device *dev)
+					struct device *dev, ioasid_t pasid)
 {
 	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
 	struct intel_iommu *iommu;
@@ -4126,7 +4136,8 @@ static int prepare_domain_attach_device(struct
iommu_domain *domain,
 	if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap))
 		return -EINVAL;

-	if (domain->dirty_ops && !intel_iommu_slads_supported(iommu))
+	if (domain->dirty_ops &&
+	    (!slads_supported(iommu) || pasid != IOMMU_NO_PASID))
 		return -EINVAL;

 	/* check if this iommu agaw is sufficient for max mapped address */
@@ -4164,7 +4175,7 @@ static int intel_iommu_attach_device(struct iommu_domain
*domain,
 	if (info->domain)
 		device_block_translation(dev);

-	ret = prepare_domain_attach_device(domain, dev);
+	ret = prepare_domain_attach_device(domain, dev, IOMMU_NO_PASID);
 	if (ret)
 		return ret;

@@ -4384,7 +4395,7 @@ static bool intel_iommu_capable(struct device *dev, enum
iommu_cap cap)
 	case IOMMU_CAP_ENFORCE_CACHE_COHERENCY:
 		return ecap_sc_support(info->iommu->ecap);
 	case IOMMU_CAP_DIRTY:
-		return intel_iommu_slads_supported(info->iommu);
+		return slads_supported(info->iommu);
 	default:
 		return false;
 	}
@@ -4785,7 +4796,7 @@ static int intel_iommu_set_dev_pasid(struct iommu_domain
*domain,
 	if (context_copied(iommu, info->bus, info->devfn))
 		return -EBUSY;

-	ret = prepare_domain_attach_device(domain, dev);
+	ret = prepare_domain_attach_device(domain, dev, pasid);
 	if (ret)
 		return ret;

@@ -4848,31 +4859,31 @@ static int intel_iommu_set_dirty_tracking(struct
iommu_domain *domain,
 	int ret = -EINVAL;

 	spin_lock(&dmar_domain->lock);
-	if (!(dmar_domain->dirty_tracking ^ enable) ||
-	    list_empty(&dmar_domain->devices)) {
-		spin_unlock(&dmar_domain->lock);
-		return 0;
-	}
+	if (dmar_domain->dirty_tracking == enable)
+		goto out_unlock;

 	list_for_each_entry(info, &dmar_domain->devices, link) {
-		/* First-level page table always enables dirty bit*/
-		if (dmar_domain->use_first_level) {
-			ret = 0;
-			break;
-		}
-
 		ret = intel_pasid_setup_dirty_tracking(info->iommu, info->domain,
 						     info->dev, IOMMU_NO_PASID,
 						     enable);
 		if (ret)
-			break;
+			goto err_unwind;

 	}

 	if (!ret)
 		dmar_domain->dirty_tracking = enable;
+out_unlock:
 	spin_unlock(&dmar_domain->lock);

+	return 0;
+
+err_unwind:
+         list_for_each_entry(info, &dmar_domain->devices, link)
+                 intel_pasid_setup_dirty_tracking(info->iommu, dmar_domain,
+						  info->dev, IOMMU_NO_PASID,
+						  dmar_domain->dirty_tracking);
+	spin_unlock(&dmar_domain->lock);
 	return ret;
 }

@@ -4886,14 +4897,16 @@ static int intel_iommu_read_and_clear_dirty(struct
iommu_domain *domain,
 	unsigned long pgsize;
 	bool ad_enabled;

-	spin_lock(&dmar_domain->lock);
+	/*
+	 * IOMMUFD core calls into a dirty tracking disabled domain without an
+	 * IOVA bitmap set in order to clean dirty bits in all PTEs that might
+	 * have occured when we stopped dirty tracking. This ensures that we
+	 * never inherit dirtied bits from a previous cycle.
+	 */
 	ad_enabled = dmar_domain->dirty_tracking;
-	spin_unlock(&dmar_domain->lock);
-
 	if (!ad_enabled && dirty->bitmap)
 		return -EINVAL;

-	rcu_read_lock();
 	do {
 		struct dma_pte *pte;
 		int lvl = 0;
@@ -4906,14 +4919,10 @@ static int intel_iommu_read_and_clear_dirty(struct
iommu_domain *domain,
 			continue;
 		}

-		/* It is writable, set the bitmap */
-		if (((flags & IOMMU_DIRTY_NO_CLEAR) &&
-				dma_sl_pte_dirty(pte)) ||
-		    dma_sl_pte_test_and_clear_dirty(pte))
+		if (dma_sl_pte_test_and_clear_dirty(pte, flags))
 			iommu_dirty_bitmap_record(dirty, iova, pgsize);
 		iova += pgsize;
 	} while (iova < end);
-	rcu_read_unlock();

 	return 0;
 }
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index bccd44db3316..0b390d9e669b 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -542,6 +542,9 @@ enum {
 #define sm_supported(iommu)	(intel_iommu_sm && ecap_smts((iommu)->ecap))
 #define pasid_supported(iommu)	(sm_supported(iommu) &&			\
 				 ecap_pasid((iommu)->ecap))
+#define slads_supported(iommu) (sm_supported(iommu) &&                 \
+                                ecap_slads((iommu)->ecap))
+

 struct pasid_entry;
 struct pasid_state_entry;
@@ -785,13 +788,12 @@ static inline bool dma_pte_present(struct dma_pte *pte)
 	return (pte->val & 3) != 0;
 }

-static inline bool dma_sl_pte_dirty(struct dma_pte *pte)
+static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte,
+						   unsigned long flags)
 {
-	return (pte->val & DMA_SL_PTE_DIRTY) != 0;
-}
+	if (flags & IOMMU_DIRTY_NO_CLEAR)
+		return (pte->val & DMA_SL_PTE_DIRTY) != 0;

-static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte)
-{
 	return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT,
 				  (unsigned long *)&pte->val);
 }
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 03814942d59c..785384a59d55 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -686,15 +686,29 @@ int intel_pasid_setup_dirty_tracking(struct intel_iommu
*iommu,

 	spin_lock(&iommu->lock);

-	did = domain_id_iommu(domain, iommu);
 	pte = intel_pasid_get_entry(dev, pasid);
 	if (!pte) {
 		spin_unlock(&iommu->lock);
-		dev_err(dev, "Failed to get pasid entry of PASID %d\n", pasid);
+		dev_err_ratelimited(dev,
+				    "Failed to get pasid entry of PASID %d\n",
+				    pasid);
 		return -ENODEV;
 	}

+	did = domain_id_iommu(domain, iommu);
 	pgtt = pasid_pte_get_pgtt(pte);
+	if (pgtt != PASID_ENTRY_PGTT_SL_ONLY && pgtt != PASID_ENTRY_PGTT_NESTED) {
+		spin_unlock(&iommu->lock);
+		dev_err_ratelimited(dev,
+				    "Dirty tracking not supported on translation type %d\n",
+				    pgtt);
+		return -EOPNOTSUPP;
+	}
+
+	if (pasid_get_ssade(pte) == enabled) {
+		spin_unlock(&iommu->lock);
+		return 0;
+	}

 	if (enabled)
 		pasid_set_ssade(pte);
@@ -702,6 +716,9 @@ int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
 		pasid_clear_ssade(pte);
 	spin_unlock(&iommu->lock);

+	if (!ecap_coherent(iommu->ecap))
+		clflush_cache_range(pte, sizeof(*pte));
+
 	/*
 	 * From VT-d spec table 25 "Guidance to Software for Invalidations":
 	 *
@@ -720,8 +737,6 @@ int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,

 	if (pgtt == PASID_ENTRY_PGTT_SL_ONLY || pgtt == PASID_ENTRY_PGTT_NESTED)
 		iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
-	else
-		qi_flush_piotlb(iommu, did, pasid, 0, -1, 0);

 	/* Device IOTLB doesn't need to be flushed in caching mode. */
 	if (!cap_caching_mode(iommu->cap))
Baolu Lu Oct. 17, 2023, 12:41 p.m. UTC | #22
On 2023/10/17 18:51, Joao Martins wrote:
> On 16/10/2023 14:01, Baolu Lu wrote:
>> On 2023/10/16 20:59, Jason Gunthorpe wrote:
>>> On Mon, Oct 16, 2023 at 08:58:42PM +0800, Baolu Lu wrote:
>>>> On 2023/10/16 19:42, Jason Gunthorpe wrote:
>>>>> On Mon, Oct 16, 2023 at 11:57:34AM +0100, Joao Martins wrote:
>>>>>
>>>>>> True. But to be honest, I thought we weren't quite there yet in PASID support
>>>>>> from IOMMUFD perspective; hence why I didn't aim at it. Or do I have the wrong
>>>>>> impression? From the code below, it clearly looks the driver does.
>>>>> I think we should plan that this series will go before the PASID
>>>>> series
>>>> I know that PASID support in IOMMUFD is not yet available, but the VT-d
>>>> driver already supports attaching a domain to a PASID, as required by
>>>> the idxd driver for kernel DMA with PASID. Therefore, from the driver's
>>>> perspective, dirty tracking should also be enabled for PASIDs.
>>> As long as the driver refuses to attach a dirty track enabled domain
>>> to PASID it would be fine for now.
>> Yes. This works.
> Baolu Lu, I am blocking PASID attachment this way; let me know if this matches
> how would you have the driver refuse dirty tracking on PASID.

Joao, how about blocking pasid attachment in intel_iommu_set_dev_pasid()
directly?

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index c86ba5a3e75c..392b6ca9ce90 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4783,6 +4783,9 @@ static int intel_iommu_set_dev_pasid(struct 
iommu_domain *domain,
         if (context_copied(iommu, info->bus, info->devfn))
                 return -EBUSY;

+       if (domain->dirty_ops)
+               return -EOPNOTSUPP;
+
         ret = prepare_domain_attach_device(domain, dev);
         if (ret)
                 return ret;

Best regards,
baolu
Baolu Lu Oct. 17, 2023, 12:49 p.m. UTC | #23
On 2023/10/17 19:22, Joao Martins wrote:
> On 17/10/2023 03:08, Baolu Lu wrote:
>> On 10/17/23 12:00 AM, Joao Martins wrote:
>>>>> The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no
>>>>> need to understand it and check its member anyway.
>>>>>
>>>> (...) The iommu driver has no need to understand it. iommu_dirty_bitmap_record()
>>>> already makes those checks in case there's no iova_bitmap to set bits to.
>>>>
>>> This is all true but the reason I am checking iommu_dirty_bitmap::bitmap is to
>>> essentially not record anything in the iova bitmap and just clear the dirty bits
>>> from the IOPTEs, all when dirty tracking is technically disabled. This is done
>>> internally only when starting dirty tracking, and thus to ensure that we cleanup
>>> all dirty bits before we enable dirty tracking to have a consistent snapshot as
>>> opposed to inheriting dirties from the past.
>>
>> It's okay since it serves a functional purpose. Can you please add some
>> comments around the code to explain the rationale.
>>
> 
> I added this comment below:
> 
> +       /*
> +        * IOMMUFD core calls into a dirty tracking disabled domain without an
> +        * IOVA bitmap set in order to clean dirty bits in all PTEs that might
> +        * have occured when we stopped dirty tracking. This ensures that we
> +        * never inherit dirtied bits from a previous cycle.
> +        */
> 

Yes. It's clear. Thank you!

> Also fixed an issue where I could theoretically clear the bit with
> IOMMU_NO_CLEAR. Essentially passed the read_and_clear_dirty flags and let
> dma_sl_pte_test_and_clear_dirty() to test and test-and-clear, similar to AMD:
> 
> @@ -781,6 +788,16 @@ static inline bool dma_pte_present(struct dma_pte *pte)
>          return (pte->val & 3) != 0;
>   }
> 
> +static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte,
> +                                                  unsigned long flags)
> +{
> +       if (flags & IOMMU_DIRTY_NO_CLEAR)
> +               return (pte->val & DMA_SL_PTE_DIRTY) != 0;
> +
> +       return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT,
> +                                 (unsigned long *)&pte->val);
> +}
> +

Yes. Sure.

> Anyhow, see below the full diff compared to this patch. Some things are in tree
> that is different to submitted from this patch.

[...]

> @@ -4113,7 +4123,7 @@ static void intel_iommu_domain_free(struct iommu_domain
> *domain)
>   }
> 
>   static int prepare_domain_attach_device(struct iommu_domain *domain,
> -					struct device *dev)
> +					struct device *dev, ioasid_t pasid)

How about blocking pasid attaching in intel_iommu_set_dev_pasid().

>   {
>   	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>   	struct intel_iommu *iommu;
> @@ -4126,7 +4136,8 @@ static int prepare_domain_attach_device(struct
> iommu_domain *domain,
>   	if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap))
>   		return -EINVAL;
> 
> -	if (domain->dirty_ops && !intel_iommu_slads_supported(iommu))
> +	if (domain->dirty_ops &&
> +	    (!slads_supported(iommu) || pasid != IOMMU_NO_PASID))
>   		return -EINVAL;
> 
>   	/* check if this iommu agaw is sufficient for max mapped address */

[...]

> 
> @@ -4886,14 +4897,16 @@ static int intel_iommu_read_and_clear_dirty(struct
> iommu_domain *domain,
>   	unsigned long pgsize;
>   	bool ad_enabled;
> 
> -	spin_lock(&dmar_domain->lock);
> +	/*
> +	 * IOMMUFD core calls into a dirty tracking disabled domain without an
> +	 * IOVA bitmap set in order to clean dirty bits in all PTEs that might
> +	 * have occured when we stopped dirty tracking. This ensures that we
> +	 * never inherit dirtied bits from a previous cycle.
> +	 */
>   	ad_enabled = dmar_domain->dirty_tracking;
> -	spin_unlock(&dmar_domain->lock);
> -
>   	if (!ad_enabled && dirty->bitmap)

How about
	if (!dmar_domain->dirty_tracking && dirty->bitmap)
		return -EINVAL;
?

>   		return -EINVAL;
> 
> -	rcu_read_lock();
>   	do {
>   		struct dma_pte *pte;
>   		int lvl = 0;
> @@ -4906,14 +4919,10 @@ static int intel_iommu_read_and_clear_dirty(struct
> iommu_domain *domain,
>   			continue;
>   		}
> 
> -		/* It is writable, set the bitmap */
> -		if (((flags & IOMMU_DIRTY_NO_CLEAR) &&
> -				dma_sl_pte_dirty(pte)) ||
> -		    dma_sl_pte_test_and_clear_dirty(pte))
> +		if (dma_sl_pte_test_and_clear_dirty(pte, flags))
>   			iommu_dirty_bitmap_record(dirty, iova, pgsize);
>   		iova += pgsize;
>   	} while (iova < end);
> -	rcu_read_unlock();
> 
>   	return 0;
>   }
> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
> index bccd44db3316..0b390d9e669b 100644
> --- a/drivers/iommu/intel/iommu.h
> +++ b/drivers/iommu/intel/iommu.h
> @@ -542,6 +542,9 @@ enum {
>   #define sm_supported(iommu)	(intel_iommu_sm && ecap_smts((iommu)->ecap))
>   #define pasid_supported(iommu)	(sm_supported(iommu) &&			\
>   				 ecap_pasid((iommu)->ecap))
> +#define slads_supported(iommu) (sm_supported(iommu) &&                 \
> +                                ecap_slads((iommu)->ecap))
> +
> 
>   struct pasid_entry;
>   struct pasid_state_entry;
> @@ -785,13 +788,12 @@ static inline bool dma_pte_present(struct dma_pte *pte)
>   	return (pte->val & 3) != 0;
>   }
> 
> -static inline bool dma_sl_pte_dirty(struct dma_pte *pte)
> +static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte,
> +						   unsigned long flags)
>   {
> -	return (pte->val & DMA_SL_PTE_DIRTY) != 0;
> -}
> +	if (flags & IOMMU_DIRTY_NO_CLEAR)
> +		return (pte->val & DMA_SL_PTE_DIRTY) != 0;
> 
> -static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte)
> -{
>   	return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT,
>   				  (unsigned long *)&pte->val);
>   }
> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
> index 03814942d59c..785384a59d55 100644
> --- a/drivers/iommu/intel/pasid.c
> +++ b/drivers/iommu/intel/pasid.c
> @@ -686,15 +686,29 @@ int intel_pasid_setup_dirty_tracking(struct intel_iommu
> *iommu,
> 
>   	spin_lock(&iommu->lock);
> 
> -	did = domain_id_iommu(domain, iommu);
>   	pte = intel_pasid_get_entry(dev, pasid);
>   	if (!pte) {
>   		spin_unlock(&iommu->lock);
> -		dev_err(dev, "Failed to get pasid entry of PASID %d\n", pasid);
> +		dev_err_ratelimited(dev,
> +				    "Failed to get pasid entry of PASID %d\n",
> +				    pasid);
>   		return -ENODEV;
>   	}
> 
> +	did = domain_id_iommu(domain, iommu);
>   	pgtt = pasid_pte_get_pgtt(pte);
> +	if (pgtt != PASID_ENTRY_PGTT_SL_ONLY && pgtt != PASID_ENTRY_PGTT_NESTED) {
> +		spin_unlock(&iommu->lock);
> +		dev_err_ratelimited(dev,
> +				    "Dirty tracking not supported on translation type %d\n",
> +				    pgtt);
> +		return -EOPNOTSUPP;
> +	}
> +
> +	if (pasid_get_ssade(pte) == enabled) {
> +		spin_unlock(&iommu->lock);
> +		return 0;
> +	}
> 
>   	if (enabled)
>   		pasid_set_ssade(pte);
> @@ -702,6 +716,9 @@ int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
>   		pasid_clear_ssade(pte);
>   	spin_unlock(&iommu->lock);
> 
> +	if (!ecap_coherent(iommu->ecap))
> +		clflush_cache_range(pte, sizeof(*pte));
> +
>   	/*
>   	 * From VT-d spec table 25 "Guidance to Software for Invalidations":
>   	 *
> @@ -720,8 +737,6 @@ int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
> 
>   	if (pgtt == PASID_ENTRY_PGTT_SL_ONLY || pgtt == PASID_ENTRY_PGTT_NESTED)
>   		iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
> -	else
> -		qi_flush_piotlb(iommu, did, pasid, 0, -1, 0);
> 
>   	/* Device IOTLB doesn't need to be flushed in caching mode. */
>   	if (!cap_caching_mode(iommu->cap))

Others look good to me. Thanks a lot.

Best regards,
baolu
Jason Gunthorpe Oct. 17, 2023, 1:10 p.m. UTC | #24
On Tue, Oct 17, 2023 at 12:22:34PM +0100, Joao Martins wrote:
> On 17/10/2023 03:08, Baolu Lu wrote:
> > On 10/17/23 12:00 AM, Joao Martins wrote:
> >>>> The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no
> >>>> need to understand it and check its member anyway.
> >>>>
> >>> (...) The iommu driver has no need to understand it. iommu_dirty_bitmap_record()
> >>> already makes those checks in case there's no iova_bitmap to set bits to.
> >>>
> >> This is all true but the reason I am checking iommu_dirty_bitmap::bitmap is to
> >> essentially not record anything in the iova bitmap and just clear the dirty bits
> >> from the IOPTEs, all when dirty tracking is technically disabled. This is done
> >> internally only when starting dirty tracking, and thus to ensure that we cleanup
> >> all dirty bits before we enable dirty tracking to have a consistent snapshot as
> >> opposed to inheriting dirties from the past.
> > 
> > It's okay since it serves a functional purpose. Can you please add some
> > comments around the code to explain the rationale.
> > 
> 
> I added this comment below:
> 
> +       /*
> +        * IOMMUFD core calls into a dirty tracking disabled domain without an
> +        * IOVA bitmap set in order to clean dirty bits in all PTEs that might
> +        * have occured when we stopped dirty tracking. This ensures that we
> +        * never inherit dirtied bits from a previous cycle.
> +        */
> 
> Also fixed an issue where I could theoretically clear the bit with
> IOMMU_NO_CLEAR. Essentially passed the read_and_clear_dirty flags and let
> dma_sl_pte_test_and_clear_dirty() to test and test-and-clear, similar to AMD:

How does all this work, does this leak into the uapi? Why would we
want to not clear the dirty bits upon enable/disable if dirty
tracking? I can understand that the driver needs help from the caller
due to the externalized locking, but do we leak this into the userspace?

Jason
Joao Martins Oct. 17, 2023, 2:11 p.m. UTC | #25
On 17/10/2023 14:10, Jason Gunthorpe wrote:
> On Tue, Oct 17, 2023 at 12:22:34PM +0100, Joao Martins wrote:
>> On 17/10/2023 03:08, Baolu Lu wrote:
>>> On 10/17/23 12:00 AM, Joao Martins wrote:
>>>>>> The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no
>>>>>> need to understand it and check its member anyway.
>>>>>>
>>>>> (...) The iommu driver has no need to understand it. iommu_dirty_bitmap_record()
>>>>> already makes those checks in case there's no iova_bitmap to set bits to.
>>>>>
>>>> This is all true but the reason I am checking iommu_dirty_bitmap::bitmap is to
>>>> essentially not record anything in the iova bitmap and just clear the dirty bits
>>>> from the IOPTEs, all when dirty tracking is technically disabled. This is done
>>>> internally only when starting dirty tracking, and thus to ensure that we cleanup
>>>> all dirty bits before we enable dirty tracking to have a consistent snapshot as
>>>> opposed to inheriting dirties from the past.
>>>
>>> It's okay since it serves a functional purpose. Can you please add some
>>> comments around the code to explain the rationale.
>>>
>>
>> I added this comment below:
>>
>> +       /*
>> +        * IOMMUFD core calls into a dirty tracking disabled domain without an
>> +        * IOVA bitmap set in order to clean dirty bits in all PTEs that might
>> +        * have occured when we stopped dirty tracking. This ensures that we
>> +        * never inherit dirtied bits from a previous cycle.
>> +        */
>>
>> Also fixed an issue where I could theoretically clear the bit with
>> IOMMU_NO_CLEAR. Essentially passed the read_and_clear_dirty flags and let
>> dma_sl_pte_test_and_clear_dirty() to test and test-and-clear, similar to AMD:
> 
> How does all this work, does this leak into the uapi? 

UAPI is only ever expected to collect/clear dirty bits while dirty tracking is
enabled. And it requires valid bitmaps before it gets to the IOMMU driver.

The above where I pass no dirty::bitmap (but with an iotlb_gather) is internal
usage only. Open to alternatives if this is prone to audit errors e.g. 1) via
the iommu_dirty_bitmap structure, where I add one field which if true then
iommufd core is able to call into iommu driver on a "clear IOPTE" manner or 2)
via the ::flags ... the thing is that ::flags values is UAPI, so it feels weird
to use these flags for internal purposes.

With respect to IOMMU_NO_CLEAR that is UAPI (a flag in read-and-clear) where the
user fetches bits, but does want to clear the hw IOPTE dirty bit (to avoid the
TLB flush).

> Why would we
> want to not clear the dirty bits upon enable/disable if dirty
> tracking? 

For the unmap-and-read-dirty case. Where you unmap, and want to get the dirty
bits, but you don't care to clear them as you will be unmapping. But my comment
above is not about that btw. It is just my broken check where either I test for
dirty or test-and-clear. That's it.

> I can understand that the driver needs help from the caller
> due to the externalized locking, but do we leak this into the userspace?

AFAICT no for the first. The IOMMU_NO_CLEAR is UAPI. I take it you were talking
about the first.
Joao Martins Oct. 17, 2023, 2:16 p.m. UTC | #26
On 17/10/2023 13:41, Baolu Lu wrote:
> On 2023/10/17 18:51, Joao Martins wrote:
>> On 16/10/2023 14:01, Baolu Lu wrote:
>>> On 2023/10/16 20:59, Jason Gunthorpe wrote:
>>>> On Mon, Oct 16, 2023 at 08:58:42PM +0800, Baolu Lu wrote:
>>>>> On 2023/10/16 19:42, Jason Gunthorpe wrote:
>>>>>> On Mon, Oct 16, 2023 at 11:57:34AM +0100, Joao Martins wrote:
>>>>>>
>>>>>>> True. But to be honest, I thought we weren't quite there yet in PASID
>>>>>>> support
>>>>>>> from IOMMUFD perspective; hence why I didn't aim at it. Or do I have the
>>>>>>> wrong
>>>>>>> impression? From the code below, it clearly looks the driver does.
>>>>>> I think we should plan that this series will go before the PASID
>>>>>> series
>>>>> I know that PASID support in IOMMUFD is not yet available, but the VT-d
>>>>> driver already supports attaching a domain to a PASID, as required by
>>>>> the idxd driver for kernel DMA with PASID. Therefore, from the driver's
>>>>> perspective, dirty tracking should also be enabled for PASIDs.
>>>> As long as the driver refuses to attach a dirty track enabled domain
>>>> to PASID it would be fine for now.
>>> Yes. This works.
>> Baolu Lu, I am blocking PASID attachment this way; let me know if this matches
>> how would you have the driver refuse dirty tracking on PASID.
> 
> Joao, how about blocking pasid attachment in intel_iommu_set_dev_pasid()
> directly?
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index c86ba5a3e75c..392b6ca9ce90 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -4783,6 +4783,9 @@ static int intel_iommu_set_dev_pasid(struct iommu_domain
> *domain,
>         if (context_copied(iommu, info->bus, info->devfn))
>                 return -EBUSY;
> 
> +       if (domain->dirty_ops)
> +               return -EOPNOTSUPP;
> +
>         ret = prepare_domain_attach_device(domain, dev);
>         if (ret)
>                 return ret;

I was trying to centralize all the checks, but I can place it here if you prefer
this way.
Joao Martins Oct. 17, 2023, 2:19 p.m. UTC | #27
On 17/10/2023 13:49, Baolu Lu wrote:
> On 2023/10/17 19:22, Joao Martins wrote:
>> On 17/10/2023 03:08, Baolu Lu wrote:
>>> On 10/17/23 12:00 AM, Joao Martins wrote:
>>>>>> The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no
>>>>>> need to understand it and check its member anyway.
>>>>>>
>>>>> (...) The iommu driver has no need to understand it.
>>>>> iommu_dirty_bitmap_record()
>>>>> already makes those checks in case there's no iova_bitmap to set bits to.
>>>>>
>>>> This is all true but the reason I am checking iommu_dirty_bitmap::bitmap is to
>>>> essentially not record anything in the iova bitmap and just clear the dirty
>>>> bits
>>>> from the IOPTEs, all when dirty tracking is technically disabled. This is done
>>>> internally only when starting dirty tracking, and thus to ensure that we
>>>> cleanup
>>>> all dirty bits before we enable dirty tracking to have a consistent snapshot as
>>>> opposed to inheriting dirties from the past.
>>>
>>> It's okay since it serves a functional purpose. Can you please add some
>>> comments around the code to explain the rationale.
>>>
>>
>> I added this comment below:
>>
>> +       /*
>> +        * IOMMUFD core calls into a dirty tracking disabled domain without an
>> +        * IOVA bitmap set in order to clean dirty bits in all PTEs that might
>> +        * have occured when we stopped dirty tracking. This ensures that we
>> +        * never inherit dirtied bits from a previous cycle.
>> +        */
>>
> 
> Yes. It's clear. Thank you!
> 
>> Also fixed an issue where I could theoretically clear the bit with
>> IOMMU_NO_CLEAR. Essentially passed the read_and_clear_dirty flags and let
>> dma_sl_pte_test_and_clear_dirty() to test and test-and-clear, similar to AMD:
>>
>> @@ -781,6 +788,16 @@ static inline bool dma_pte_present(struct dma_pte *pte)
>>          return (pte->val & 3) != 0;
>>   }
>>
>> +static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte,
>> +                                                  unsigned long flags)
>> +{
>> +       if (flags & IOMMU_DIRTY_NO_CLEAR)
>> +               return (pte->val & DMA_SL_PTE_DIRTY) != 0;
>> +
>> +       return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT,
>> +                                 (unsigned long *)&pte->val);
>> +}
>> +
> 
> Yes. Sure.
> 
>> Anyhow, see below the full diff compared to this patch. Some things are in tree
>> that is different to submitted from this patch.
> 
> [...]
> 
>> @@ -4113,7 +4123,7 @@ static void intel_iommu_domain_free(struct iommu_domain
>> *domain)
>>   }
>>
>>   static int prepare_domain_attach_device(struct iommu_domain *domain,
>> -                    struct device *dev)
>> +                    struct device *dev, ioasid_t pasid)
> 
> How about blocking pasid attaching in intel_iommu_set_dev_pasid().
> 
OK

>>   {
>>       struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>>       struct intel_iommu *iommu;
>> @@ -4126,7 +4136,8 @@ static int prepare_domain_attach_device(struct
>> iommu_domain *domain,
>>       if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap))
>>           return -EINVAL;
>>
>> -    if (domain->dirty_ops && !intel_iommu_slads_supported(iommu))
>> +    if (domain->dirty_ops &&
>> +        (!slads_supported(iommu) || pasid != IOMMU_NO_PASID))
>>           return -EINVAL;
>>
>>       /* check if this iommu agaw is sufficient for max mapped address */
> 
> [...]
> 
>>
>> @@ -4886,14 +4897,16 @@ static int intel_iommu_read_and_clear_dirty(struct
>> iommu_domain *domain,
>>       unsigned long pgsize;
>>       bool ad_enabled;
>>
>> -    spin_lock(&dmar_domain->lock);
>> +    /*
>> +     * IOMMUFD core calls into a dirty tracking disabled domain without an
>> +     * IOVA bitmap set in order to clean dirty bits in all PTEs that might
>> +     * have occured when we stopped dirty tracking. This ensures that we
>> +     * never inherit dirtied bits from a previous cycle.
>> +     */
>>       ad_enabled = dmar_domain->dirty_tracking;
>> -    spin_unlock(&dmar_domain->lock);
>> -
>>       if (!ad_enabled && dirty->bitmap)
> 
> How about
>     if (!dmar_domain->dirty_tracking && dirty->bitmap)
>         return -EINVAL;
> ?
> 
OK

>>           return -EINVAL;
>>
>> -    rcu_read_lock();
>>       do {
>>           struct dma_pte *pte;
>>           int lvl = 0;
>> @@ -4906,14 +4919,10 @@ static int intel_iommu_read_and_clear_dirty(struct
>> iommu_domain *domain,
>>               continue;
>>           }
>>
>> -        /* It is writable, set the bitmap */
>> -        if (((flags & IOMMU_DIRTY_NO_CLEAR) &&
>> -                dma_sl_pte_dirty(pte)) ||
>> -            dma_sl_pte_test_and_clear_dirty(pte))
>> +        if (dma_sl_pte_test_and_clear_dirty(pte, flags))
>>               iommu_dirty_bitmap_record(dirty, iova, pgsize);
>>           iova += pgsize;
>>       } while (iova < end);
>> -    rcu_read_unlock();
>>
>>       return 0;
>>   }
>> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
>> index bccd44db3316..0b390d9e669b 100644
>> --- a/drivers/iommu/intel/iommu.h
>> +++ b/drivers/iommu/intel/iommu.h
>> @@ -542,6 +542,9 @@ enum {
>>   #define sm_supported(iommu)    (intel_iommu_sm && ecap_smts((iommu)->ecap))
>>   #define pasid_supported(iommu)    (sm_supported(iommu) &&            \
>>                    ecap_pasid((iommu)->ecap))
>> +#define slads_supported(iommu) (sm_supported(iommu) &&                 \
>> +                                ecap_slads((iommu)->ecap))
>> +
>>
>>   struct pasid_entry;
>>   struct pasid_state_entry;
>> @@ -785,13 +788,12 @@ static inline bool dma_pte_present(struct dma_pte *pte)
>>       return (pte->val & 3) != 0;
>>   }
>>
>> -static inline bool dma_sl_pte_dirty(struct dma_pte *pte)
>> +static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte,
>> +                           unsigned long flags)
>>   {
>> -    return (pte->val & DMA_SL_PTE_DIRTY) != 0;
>> -}
>> +    if (flags & IOMMU_DIRTY_NO_CLEAR)
>> +        return (pte->val & DMA_SL_PTE_DIRTY) != 0;
>>
>> -static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte)
>> -{
>>       return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT,
>>                     (unsigned long *)&pte->val);
>>   }
>> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
>> index 03814942d59c..785384a59d55 100644
>> --- a/drivers/iommu/intel/pasid.c
>> +++ b/drivers/iommu/intel/pasid.c
>> @@ -686,15 +686,29 @@ int intel_pasid_setup_dirty_tracking(struct intel_iommu
>> *iommu,
>>
>>       spin_lock(&iommu->lock);
>>
>> -    did = domain_id_iommu(domain, iommu);
>>       pte = intel_pasid_get_entry(dev, pasid);
>>       if (!pte) {
>>           spin_unlock(&iommu->lock);
>> -        dev_err(dev, "Failed to get pasid entry of PASID %d\n", pasid);
>> +        dev_err_ratelimited(dev,
>> +                    "Failed to get pasid entry of PASID %d\n",
>> +                    pasid);
>>           return -ENODEV;
>>       }
>>
>> +    did = domain_id_iommu(domain, iommu);
>>       pgtt = pasid_pte_get_pgtt(pte);
>> +    if (pgtt != PASID_ENTRY_PGTT_SL_ONLY && pgtt != PASID_ENTRY_PGTT_NESTED) {
>> +        spin_unlock(&iommu->lock);
>> +        dev_err_ratelimited(dev,
>> +                    "Dirty tracking not supported on translation type %d\n",
>> +                    pgtt);
>> +        return -EOPNOTSUPP;
>> +    }
>> +
>> +    if (pasid_get_ssade(pte) == enabled) {
>> +        spin_unlock(&iommu->lock);
>> +        return 0;
>> +    }
>>
>>       if (enabled)
>>           pasid_set_ssade(pte);
>> @@ -702,6 +716,9 @@ int intel_pasid_setup_dirty_tracking(struct intel_iommu
>> *iommu,
>>           pasid_clear_ssade(pte);
>>       spin_unlock(&iommu->lock);
>>
>> +    if (!ecap_coherent(iommu->ecap))
>> +        clflush_cache_range(pte, sizeof(*pte));
>> +
>>       /*
>>        * From VT-d spec table 25 "Guidance to Software for Invalidations":
>>        *
>> @@ -720,8 +737,6 @@ int intel_pasid_setup_dirty_tracking(struct intel_iommu
>> *iommu,
>>
>>       if (pgtt == PASID_ENTRY_PGTT_SL_ONLY || pgtt == PASID_ENTRY_PGTT_NESTED)
>>           iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
>> -    else
>> -        qi_flush_piotlb(iommu, did, pasid, 0, -1, 0);
>>
>>       /* Device IOTLB doesn't need to be flushed in caching mode. */
>>       if (!cap_caching_mode(iommu->cap))
> 
> Others look good to me. Thanks a lot.
> 

	Joao
Joao Martins Oct. 17, 2023, 2:25 p.m. UTC | #28
On 17/10/2023 15:16, Joao Martins wrote:
> On 17/10/2023 13:41, Baolu Lu wrote:
>> On 2023/10/17 18:51, Joao Martins wrote:
>>> On 16/10/2023 14:01, Baolu Lu wrote:
>>>> On 2023/10/16 20:59, Jason Gunthorpe wrote:
>>>>> On Mon, Oct 16, 2023 at 08:58:42PM +0800, Baolu Lu wrote:
>>>>>> On 2023/10/16 19:42, Jason Gunthorpe wrote:
>>>>>>> On Mon, Oct 16, 2023 at 11:57:34AM +0100, Joao Martins wrote:
>>>>>>>
>>>>>>>> True. But to be honest, I thought we weren't quite there yet in PASID
>>>>>>>> support
>>>>>>>> from IOMMUFD perspective; hence why I didn't aim at it. Or do I have the
>>>>>>>> wrong
>>>>>>>> impression? From the code below, it clearly looks the driver does.
>>>>>>> I think we should plan that this series will go before the PASID
>>>>>>> series
>>>>>> I know that PASID support in IOMMUFD is not yet available, but the VT-d
>>>>>> driver already supports attaching a domain to a PASID, as required by
>>>>>> the idxd driver for kernel DMA with PASID. Therefore, from the driver's
>>>>>> perspective, dirty tracking should also be enabled for PASIDs.
>>>>> As long as the driver refuses to attach a dirty track enabled domain
>>>>> to PASID it would be fine for now.
>>>> Yes. This works.
>>> Baolu Lu, I am blocking PASID attachment this way; let me know if this matches
>>> how would you have the driver refuse dirty tracking on PASID.
>>
>> Joao, how about blocking pasid attachment in intel_iommu_set_dev_pasid()
>> directly?
>>
>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>> index c86ba5a3e75c..392b6ca9ce90 100644
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -4783,6 +4783,9 @@ static int intel_iommu_set_dev_pasid(struct iommu_domain
>> *domain,
>>         if (context_copied(iommu, info->bus, info->devfn))
>>                 return -EBUSY;
>>
>> +       if (domain->dirty_ops)
>> +               return -EOPNOTSUPP;
>> +
>>         ret = prepare_domain_attach_device(domain, dev);
>>         if (ret)
>>                 return ret;
> 
> I was trying to centralize all the checks, but I can place it here if you prefer
> this way.
> 
Minor change, I'm changing error code to -EINVAL to align with non-PASID case.
Jason Gunthorpe Oct. 17, 2023, 3:31 p.m. UTC | #29
On Tue, Oct 17, 2023 at 03:11:46PM +0100, Joao Martins wrote:
> On 17/10/2023 14:10, Jason Gunthorpe wrote:
> > On Tue, Oct 17, 2023 at 12:22:34PM +0100, Joao Martins wrote:
> >> On 17/10/2023 03:08, Baolu Lu wrote:
> >>> On 10/17/23 12:00 AM, Joao Martins wrote:
> >>>>>> The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no
> >>>>>> need to understand it and check its member anyway.
> >>>>>>
> >>>>> (...) The iommu driver has no need to understand it. iommu_dirty_bitmap_record()
> >>>>> already makes those checks in case there's no iova_bitmap to set bits to.
> >>>>>
> >>>> This is all true but the reason I am checking iommu_dirty_bitmap::bitmap is to
> >>>> essentially not record anything in the iova bitmap and just clear the dirty bits
> >>>> from the IOPTEs, all when dirty tracking is technically disabled. This is done
> >>>> internally only when starting dirty tracking, and thus to ensure that we cleanup
> >>>> all dirty bits before we enable dirty tracking to have a consistent snapshot as
> >>>> opposed to inheriting dirties from the past.
> >>>
> >>> It's okay since it serves a functional purpose. Can you please add some
> >>> comments around the code to explain the rationale.
> >>>
> >>
> >> I added this comment below:
> >>
> >> +       /*
> >> +        * IOMMUFD core calls into a dirty tracking disabled domain without an
> >> +        * IOVA bitmap set in order to clean dirty bits in all PTEs that might
> >> +        * have occured when we stopped dirty tracking. This ensures that we
> >> +        * never inherit dirtied bits from a previous cycle.
> >> +        */
> >>
> >> Also fixed an issue where I could theoretically clear the bit with
> >> IOMMU_NO_CLEAR. Essentially passed the read_and_clear_dirty flags and let
> >> dma_sl_pte_test_and_clear_dirty() to test and test-and-clear, similar to AMD:
> > 
> > How does all this work, does this leak into the uapi? 
> 
> UAPI is only ever expected to collect/clear dirty bits while dirty tracking is
> enabled. And it requires valid bitmaps before it gets to the IOMMU driver.
> 
> The above where I pass no dirty::bitmap (but with an iotlb_gather) is internal
> usage only. Open to alternatives if this is prone to audit errors e.g. 1) via
> the iommu_dirty_bitmap structure, where I add one field which if true then
> iommufd core is able to call into iommu driver on a "clear IOPTE" manner or 2)
> via the ::flags ... the thing is that ::flags values is UAPI, so it feels weird
> to use these flags for internal purposes.

I think NULL to mean clear but not record is OK, it doesn't matter too
much but ideally this would be sort of hidden in the iova APIs..

Jason
Joao Martins Oct. 17, 2023, 3:54 p.m. UTC | #30
On 17/10/2023 16:31, Jason Gunthorpe wrote:
> On Tue, Oct 17, 2023 at 03:11:46PM +0100, Joao Martins wrote:
>> On 17/10/2023 14:10, Jason Gunthorpe wrote:
>>> On Tue, Oct 17, 2023 at 12:22:34PM +0100, Joao Martins wrote:
>>>> On 17/10/2023 03:08, Baolu Lu wrote:
>>>>> On 10/17/23 12:00 AM, Joao Martins wrote:
>>>>>>>> The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no
>>>>>>>> need to understand it and check its member anyway.
>>>>>>>>
>>>>>>> (...) The iommu driver has no need to understand it. iommu_dirty_bitmap_record()
>>>>>>> already makes those checks in case there's no iova_bitmap to set bits to.
>>>>>>>
>>>>>> This is all true but the reason I am checking iommu_dirty_bitmap::bitmap is to
>>>>>> essentially not record anything in the iova bitmap and just clear the dirty bits
>>>>>> from the IOPTEs, all when dirty tracking is technically disabled. This is done
>>>>>> internally only when starting dirty tracking, and thus to ensure that we cleanup
>>>>>> all dirty bits before we enable dirty tracking to have a consistent snapshot as
>>>>>> opposed to inheriting dirties from the past.
>>>>>
>>>>> It's okay since it serves a functional purpose. Can you please add some
>>>>> comments around the code to explain the rationale.
>>>>>
>>>>
>>>> I added this comment below:
>>>>
>>>> +       /*
>>>> +        * IOMMUFD core calls into a dirty tracking disabled domain without an
>>>> +        * IOVA bitmap set in order to clean dirty bits in all PTEs that might
>>>> +        * have occured when we stopped dirty tracking. This ensures that we
>>>> +        * never inherit dirtied bits from a previous cycle.
>>>> +        */
>>>>
>>>> Also fixed an issue where I could theoretically clear the bit with
>>>> IOMMU_NO_CLEAR. Essentially passed the read_and_clear_dirty flags and let
>>>> dma_sl_pte_test_and_clear_dirty() to test and test-and-clear, similar to AMD:
>>>
>>> How does all this work, does this leak into the uapi? 
>>
>> UAPI is only ever expected to collect/clear dirty bits while dirty tracking is
>> enabled. And it requires valid bitmaps before it gets to the IOMMU driver.
>>
>> The above where I pass no dirty::bitmap (but with an iotlb_gather) is internal
>> usage only. Open to alternatives if this is prone to audit errors e.g. 1) via
>> the iommu_dirty_bitmap structure, where I add one field which if true then
>> iommufd core is able to call into iommu driver on a "clear IOPTE" manner or 2)
>> via the ::flags ... the thing is that ::flags values is UAPI, so it feels weird
>> to use these flags for internal purposes.
> 
> I think NULL to mean clear but not record is OK, it doesn't matter too
> much but ideally this would be sort of hidden in the iova APIs..

And it is hidden? unless you mean by hidden that there's explicit IOVA APIs that
do this?

Currently, iopt_clear_dirty_data() does this 'internal-only' usage of
iommu_read_and_clear() and does this stuff.
Baolu Lu Oct. 18, 2023, 2:06 a.m. UTC | #31
On 10/17/23 10:25 PM, Joao Martins wrote:
> On 17/10/2023 15:16, Joao Martins wrote:
>> On 17/10/2023 13:41, Baolu Lu wrote:
>>> On 2023/10/17 18:51, Joao Martins wrote:
>>>> On 16/10/2023 14:01, Baolu Lu wrote:
>>>>> On 2023/10/16 20:59, Jason Gunthorpe wrote:
>>>>>> On Mon, Oct 16, 2023 at 08:58:42PM +0800, Baolu Lu wrote:
>>>>>>> On 2023/10/16 19:42, Jason Gunthorpe wrote:
>>>>>>>> On Mon, Oct 16, 2023 at 11:57:34AM +0100, Joao Martins wrote:
>>>>>>>>
>>>>>>>>> True. But to be honest, I thought we weren't quite there yet in PASID
>>>>>>>>> support
>>>>>>>>> from IOMMUFD perspective; hence why I didn't aim at it. Or do I have the
>>>>>>>>> wrong
>>>>>>>>> impression? From the code below, it clearly looks the driver does.
>>>>>>>> I think we should plan that this series will go before the PASID
>>>>>>>> series
>>>>>>> I know that PASID support in IOMMUFD is not yet available, but the VT-d
>>>>>>> driver already supports attaching a domain to a PASID, as required by
>>>>>>> the idxd driver for kernel DMA with PASID. Therefore, from the driver's
>>>>>>> perspective, dirty tracking should also be enabled for PASIDs.
>>>>>> As long as the driver refuses to attach a dirty track enabled domain
>>>>>> to PASID it would be fine for now.
>>>>> Yes. This works.
>>>> Baolu Lu, I am blocking PASID attachment this way; let me know if this matches
>>>> how would you have the driver refuse dirty tracking on PASID.
>>>
>>> Joao, how about blocking pasid attachment in intel_iommu_set_dev_pasid()
>>> directly?
>>>
>>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>>> index c86ba5a3e75c..392b6ca9ce90 100644
>>> --- a/drivers/iommu/intel/iommu.c
>>> +++ b/drivers/iommu/intel/iommu.c
>>> @@ -4783,6 +4783,9 @@ static int intel_iommu_set_dev_pasid(struct iommu_domain
>>> *domain,
>>>          if (context_copied(iommu, info->bus, info->devfn))
>>>                  return -EBUSY;
>>>
>>> +       if (domain->dirty_ops)
>>> +               return -EOPNOTSUPP;
>>> +
>>>          ret = prepare_domain_attach_device(domain, dev);
>>>          if (ret)
>>>                  return ret;
>>
>> I was trying to centralize all the checks, but I can place it here if you prefer
>> this way.

We will soon remove this check when pasid is supported in iommufd. So
less code change is better for future work.

>>
> Minor change, I'm changing error code to -EINVAL to align with non-PASID case.

Yes. Make sense. -EINVAL means "not compatible". The caller can retry.

Best regards,
baolu
diff mbox series

Patch

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 491bcde1ff96..7d5a8f5283a7 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -300,6 +300,7 @@  static int iommu_skip_te_disable;
 #define IDENTMAP_AZALIA		4
 
 const struct iommu_ops intel_iommu_ops;
+const struct iommu_dirty_ops intel_dirty_ops;
 
 static bool translation_pre_enabled(struct intel_iommu *iommu)
 {
@@ -4077,6 +4078,7 @@  static struct iommu_domain *intel_iommu_domain_alloc(unsigned type)
 static struct iommu_domain *
 intel_iommu_domain_alloc_user(struct device *dev, u32 flags)
 {
+	bool enforce_dirty = (flags & IOMMU_HWPT_ALLOC_ENFORCE_DIRTY);
 	struct iommu_domain *domain;
 	struct intel_iommu *iommu;
 
@@ -4087,9 +4089,15 @@  intel_iommu_domain_alloc_user(struct device *dev, u32 flags)
 	if ((flags & IOMMU_HWPT_ALLOC_NEST_PARENT) && !ecap_nest(iommu->ecap))
 		return ERR_PTR(-EOPNOTSUPP);
 
+	if (enforce_dirty &&
+	    !device_iommu_capable(dev, IOMMU_CAP_DIRTY))
+		return ERR_PTR(-EOPNOTSUPP);
+
 	domain = iommu_domain_alloc(dev->bus);
 	if (!domain)
 		domain = ERR_PTR(-ENOMEM);
+	if (domain && enforce_dirty)
+		domain->dirty_ops = &intel_dirty_ops;
 	return domain;
 }
 
@@ -4367,6 +4375,9 @@  static bool intel_iommu_capable(struct device *dev, enum iommu_cap cap)
 		return dmar_platform_optin();
 	case IOMMU_CAP_ENFORCE_CACHE_COHERENCY:
 		return ecap_sc_support(info->iommu->ecap);
+	case IOMMU_CAP_DIRTY:
+		return sm_supported(info->iommu) &&
+			ecap_slads(info->iommu->ecap);
 	default:
 		return false;
 	}
@@ -4822,6 +4833,89 @@  static void *intel_iommu_hw_info(struct device *dev, u32 *length, u32 *type)
 	return vtd;
 }
 
+static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
+					  bool enable)
+{
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	struct device_domain_info *info;
+	int ret = -EINVAL;
+
+	spin_lock(&dmar_domain->lock);
+	if (!(dmar_domain->dirty_tracking ^ enable) ||
+	    list_empty(&dmar_domain->devices)) {
+		spin_unlock(&dmar_domain->lock);
+		return 0;
+	}
+
+	list_for_each_entry(info, &dmar_domain->devices, link) {
+		/* First-level page table always enables dirty bit*/
+		if (dmar_domain->use_first_level) {
+			ret = 0;
+			break;
+		}
+
+		ret = intel_pasid_setup_dirty_tracking(info->iommu, info->domain,
+						     info->dev, IOMMU_NO_PASID,
+						     enable);
+		if (ret)
+			break;
+
+	}
+
+	if (!ret)
+		dmar_domain->dirty_tracking = enable;
+	spin_unlock(&dmar_domain->lock);
+
+	return ret;
+}
+
+static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain,
+					    unsigned long iova, size_t size,
+					    unsigned long flags,
+					    struct iommu_dirty_bitmap *dirty)
+{
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	unsigned long end = iova + size - 1;
+	unsigned long pgsize;
+	bool ad_enabled;
+
+	spin_lock(&dmar_domain->lock);
+	ad_enabled = dmar_domain->dirty_tracking;
+	spin_unlock(&dmar_domain->lock);
+
+	if (!ad_enabled && dirty->bitmap)
+		return -EINVAL;
+
+	rcu_read_lock();
+	do {
+		struct dma_pte *pte;
+		int lvl = 0;
+
+		pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &lvl,
+				     GFP_ATOMIC);
+		pgsize = level_size(lvl) << VTD_PAGE_SHIFT;
+		if (!pte || !dma_pte_present(pte)) {
+			iova += pgsize;
+			continue;
+		}
+
+		/* It is writable, set the bitmap */
+		if (((flags & IOMMU_DIRTY_NO_CLEAR) &&
+				dma_sl_pte_dirty(pte)) ||
+		    dma_sl_pte_test_and_clear_dirty(pte))
+			iommu_dirty_bitmap_record(dirty, iova, pgsize);
+		iova += pgsize;
+	} while (iova < end);
+	rcu_read_unlock();
+
+	return 0;
+}
+
+const struct iommu_dirty_ops intel_dirty_ops = {
+	.set_dirty_tracking	= intel_iommu_set_dirty_tracking,
+	.read_and_clear_dirty   = intel_iommu_read_and_clear_dirty,
+};
+
 const struct iommu_ops intel_iommu_ops = {
 	.capable		= intel_iommu_capable,
 	.hw_info		= intel_iommu_hw_info,
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index c18fb699c87a..bccd44db3316 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -48,6 +48,9 @@ 
 #define DMA_FL_PTE_DIRTY	BIT_ULL(6)
 #define DMA_FL_PTE_XD		BIT_ULL(63)
 
+#define DMA_SL_PTE_DIRTY_BIT	9
+#define DMA_SL_PTE_DIRTY	BIT_ULL(DMA_SL_PTE_DIRTY_BIT)
+
 #define ADDR_WIDTH_5LEVEL	(57)
 #define ADDR_WIDTH_4LEVEL	(48)
 
@@ -592,6 +595,7 @@  struct dmar_domain {
 					 * otherwise, goes through the second
 					 * level.
 					 */
+	u8 dirty_tracking:1;		/* Dirty tracking is enabled */
 
 	spinlock_t lock;		/* Protect device tracking lists */
 	struct list_head devices;	/* all devices' list */
@@ -781,6 +785,17 @@  static inline bool dma_pte_present(struct dma_pte *pte)
 	return (pte->val & 3) != 0;
 }
 
+static inline bool dma_sl_pte_dirty(struct dma_pte *pte)
+{
+	return (pte->val & DMA_SL_PTE_DIRTY) != 0;
+}
+
+static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte)
+{
+	return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT,
+				  (unsigned long *)&pte->val);
+}
+
 static inline bool dma_pte_superpage(struct dma_pte *pte)
 {
 	return (pte->val & DMA_PTE_LARGE_PAGE);
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 8f92b92f3d2a..03814942d59c 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -277,6 +277,11 @@  static inline void pasid_set_bits(u64 *ptr, u64 mask, u64 bits)
 	WRITE_ONCE(*ptr, (old & ~mask) | bits);
 }
 
+static inline u64 pasid_get_bits(u64 *ptr)
+{
+	return READ_ONCE(*ptr);
+}
+
 /*
  * Setup the DID(Domain Identifier) field (Bit 64~79) of scalable mode
  * PASID entry.
@@ -335,6 +340,36 @@  static inline void pasid_set_fault_enable(struct pasid_entry *pe)
 	pasid_set_bits(&pe->val[0], 1 << 1, 0);
 }
 
+/*
+ * Enable second level A/D bits by setting the SLADE (Second Level
+ * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
+ * entry.
+ */
+static inline void pasid_set_ssade(struct pasid_entry *pe)
+{
+	pasid_set_bits(&pe->val[0], 1 << 9, 1 << 9);
+}
+
+/*
+ * Enable second level A/D bits by setting the SLADE (Second Level
+ * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
+ * entry.
+ */
+static inline void pasid_clear_ssade(struct pasid_entry *pe)
+{
+	pasid_set_bits(&pe->val[0], 1 << 9, 0);
+}
+
+/*
+ * Checks if second level A/D bits by setting the SLADE (Second Level
+ * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
+ * entry is enabled.
+ */
+static inline bool pasid_get_ssade(struct pasid_entry *pe)
+{
+	return pasid_get_bits(&pe->val[0]) & (1 << 9);
+}
+
 /*
  * Setup the WPE(Write Protect Enable) field (Bit 132) of a
  * scalable mode PASID entry.
@@ -627,6 +662,8 @@  int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_SL_ONLY);
 	pasid_set_fault_enable(pte);
 	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
+	if (domain->dirty_tracking)
+		pasid_set_ssade(pte);
 
 	pasid_set_present(pte);
 	spin_unlock(&iommu->lock);
@@ -636,6 +673,63 @@  int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 	return 0;
 }
 
+/*
+ * Set up dirty tracking on a second only translation type.
+ */
+int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
+				     struct dmar_domain *domain,
+				     struct device *dev, u32 pasid,
+				     bool enabled)
+{
+	struct pasid_entry *pte;
+	u16 did, pgtt;
+
+	spin_lock(&iommu->lock);
+
+	did = domain_id_iommu(domain, iommu);
+	pte = intel_pasid_get_entry(dev, pasid);
+	if (!pte) {
+		spin_unlock(&iommu->lock);
+		dev_err(dev, "Failed to get pasid entry of PASID %d\n", pasid);
+		return -ENODEV;
+	}
+
+	pgtt = pasid_pte_get_pgtt(pte);
+
+	if (enabled)
+		pasid_set_ssade(pte);
+	else
+		pasid_clear_ssade(pte);
+	spin_unlock(&iommu->lock);
+
+	/*
+	 * From VT-d spec table 25 "Guidance to Software for Invalidations":
+	 *
+	 * - PASID-selective-within-Domain PASID-cache invalidation
+	 *   If (PGTT=SS or Nested)
+	 *    - Domain-selective IOTLB invalidation
+	 *   Else
+	 *    - PASID-selective PASID-based IOTLB invalidation
+	 * - If (pasid is RID_PASID)
+	 *    - Global Device-TLB invalidation to affected functions
+	 *   Else
+	 *    - PASID-based Device-TLB invalidation (with S=1 and
+	 *      Addr[63:12]=0x7FFFFFFF_FFFFF) to affected functions
+	 */
+	pasid_cache_invalidation_with_pasid(iommu, did, pasid);
+
+	if (pgtt == PASID_ENTRY_PGTT_SL_ONLY || pgtt == PASID_ENTRY_PGTT_NESTED)
+		iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
+	else
+		qi_flush_piotlb(iommu, did, pasid, 0, -1, 0);
+
+	/* Device IOTLB doesn't need to be flushed in caching mode. */
+	if (!cap_caching_mode(iommu->cap))
+		devtlb_invalidation_with_pasid(iommu, dev, pasid);
+
+	return 0;
+}
+
 /*
  * Set up the scalable mode pasid entry for passthrough translation type.
  */
diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index 4e9e68c3c388..958050b093aa 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -106,6 +106,10 @@  int intel_pasid_setup_first_level(struct intel_iommu *iommu,
 int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 				   struct dmar_domain *domain,
 				   struct device *dev, u32 pasid);
+int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
+				     struct dmar_domain *domain,
+				     struct device *dev, u32 pasid,
+				     bool enabled);
 int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
 				   struct dmar_domain *domain,
 				   struct device *dev, u32 pasid);