Message ID | 20230923012511.10379-20-joao.m.martins@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | IOMMUFD Dirty Tracking | expand |
On 9/23/23 9:25 AM, Joao Martins wrote: > IOMMU advertises Access/Dirty bits for second-stage page table if the > extended capability DMAR register reports it (ECAP, mnemonic ECAP.SSADS). > The first stage table is compatible with CPU page table thus A/D bits are > implicitly supported. Relevant Intel IOMMU SDM ref for first stage table > "3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second stage table > "3.7.2 Accessed and Dirty Flags". > > First stage page table is enabled by default so it's allowed to set dirty > tracking and no control bits needed, it just returns 0. To use SSADS, set > bit 9 (SSADE) in the scalable-mode PASID table entry and flush the IOTLB > via pasid_flush_caches() following the manual. Relevant SDM refs: > > "3.7.2 Accessed and Dirty Flags" > "6.5.3.3 Guidance to Software for Invalidations, > Table 23. Guidance to Software for Invalidations" > > PTE dirty bit is located in bit 9 and it's cached in the IOTLB so flush > IOTLB to make sure IOMMU attempts to set the dirty bit again. Note that > iommu_dirty_bitmap_record() will add the IOVA to iotlb_gather and thus > the caller of the iommu op will flush the IOTLB. Relevant manuals over > the hardware translation is chapter 6 with some special mention to: > > "6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations" > "6.2.4 IOTLB" > > Signed-off-by: Joao Martins<joao.m.martins@oracle.com> > --- > The IOPTE walker is still a bit inneficient. Making sure the UAPI/IOMMUFD is > solid and agreed upon. > --- > drivers/iommu/intel/iommu.c | 94 +++++++++++++++++++++++++++++++++++++ > drivers/iommu/intel/iommu.h | 15 ++++++ > drivers/iommu/intel/pasid.c | 94 +++++++++++++++++++++++++++++++++++++ > drivers/iommu/intel/pasid.h | 4 ++ > 4 files changed, 207 insertions(+) The code is probably incomplete. When attaching a domain to a device, check the domain's dirty tracking capability against the device's capabilities. If the domain's dirty tracking capability is set but the device does not support it, the attach callback should return -EINVAL. Best regards, baolu
On 25/09/2023 08:01, Baolu Lu wrote: > On 9/23/23 9:25 AM, Joao Martins wrote: >> IOMMU advertises Access/Dirty bits for second-stage page table if the >> extended capability DMAR register reports it (ECAP, mnemonic ECAP.SSADS). >> The first stage table is compatible with CPU page table thus A/D bits are >> implicitly supported. Relevant Intel IOMMU SDM ref for first stage table >> "3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second stage table >> "3.7.2 Accessed and Dirty Flags". >> >> First stage page table is enabled by default so it's allowed to set dirty >> tracking and no control bits needed, it just returns 0. To use SSADS, set >> bit 9 (SSADE) in the scalable-mode PASID table entry and flush the IOTLB >> via pasid_flush_caches() following the manual. Relevant SDM refs: >> >> "3.7.2 Accessed and Dirty Flags" >> "6.5.3.3 Guidance to Software for Invalidations, >> Table 23. Guidance to Software for Invalidations" >> >> PTE dirty bit is located in bit 9 and it's cached in the IOTLB so flush >> IOTLB to make sure IOMMU attempts to set the dirty bit again. Note that >> iommu_dirty_bitmap_record() will add the IOVA to iotlb_gather and thus >> the caller of the iommu op will flush the IOTLB. Relevant manuals over >> the hardware translation is chapter 6 with some special mention to: >> >> "6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations" >> "6.2.4 IOTLB" >> >> Signed-off-by: Joao Martins<joao.m.martins@oracle.com> >> --- >> The IOPTE walker is still a bit inneficient. Making sure the UAPI/IOMMUFD is >> solid and agreed upon. >> --- >> drivers/iommu/intel/iommu.c | 94 +++++++++++++++++++++++++++++++++++++ >> drivers/iommu/intel/iommu.h | 15 ++++++ >> drivers/iommu/intel/pasid.c | 94 +++++++++++++++++++++++++++++++++++++ >> drivers/iommu/intel/pasid.h | 4 ++ >> 4 files changed, 207 insertions(+) > > The code is probably incomplete. When attaching a domain to a device, > check the domain's dirty tracking capability against the device's > capabilities. If the domain's dirty tracking capability is set but the > device does not support it, the attach callback should return -EINVAL. > Yeap, I did that for AMD, but it seems in the mix of changes I may have deleted and then forgot to include it here. Here's what I added (together with consolidated cap checking): diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 7d5a8f5283a7..fabfe363f1f9 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -4075,6 +4075,11 @@ static struct iommu_domain *intel_iommu_domain_alloc(unsigned type) return NULL; } +static bool intel_iommu_slads_supported(struct intel_iommu *iommu) +{ + return sm_supported(iommu) && ecap_slads(iommu->ecap); +} + static struct iommu_domain * intel_iommu_domain_alloc_user(struct device *dev, u32 flags) { @@ -4090,7 +4095,7 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 flags) return ERR_PTR(-EOPNOTSUPP); if (enforce_dirty && - !device_iommu_capable(dev, IOMMU_CAP_DIRTY)) + !intel_iommu_slads_supported(iommu)) return ERR_PTR(-EOPNOTSUPP); domain = iommu_domain_alloc(dev->bus); @@ -4121,6 +4126,9 @@ static int prepare_domain_attach_device(struct iommu_domain *domain, if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap)) return -EINVAL; + if (domain->dirty_ops && !intel_iommu_slads_supported(iommu)) + return -EINVAL; + /* check if this iommu agaw is sufficient for max mapped address */ addr_width = agaw_to_width(iommu->agaw); if (addr_width > cap_mgaw(iommu->cap)) @@ -4376,8 +4384,7 @@ static bool intel_iommu_capable(struct device *dev, enum iommu_cap cap) case IOMMU_CAP_ENFORCE_CACHE_COHERENCY: return ecap_sc_support(info->iommu->ecap); case IOMMU_CAP_DIRTY: - return sm_supported(info->iommu) && - ecap_slads(info->iommu->ecap); + return intel_iommu_slads_supported(info->iommu); default: return false; }
On 9/23/23 9:25 AM, Joao Martins wrote: > --- a/drivers/iommu/intel/iommu.c > +++ b/drivers/iommu/intel/iommu.c > @@ -300,6 +300,7 @@ static int iommu_skip_te_disable; > #define IDENTMAP_AZALIA 4 > > const struct iommu_ops intel_iommu_ops; > +const struct iommu_dirty_ops intel_dirty_ops; > > static bool translation_pre_enabled(struct intel_iommu *iommu) > { > @@ -4077,6 +4078,7 @@ static struct iommu_domain *intel_iommu_domain_alloc(unsigned type) > static struct iommu_domain * > intel_iommu_domain_alloc_user(struct device *dev, u32 flags) > { > + bool enforce_dirty = (flags & IOMMU_HWPT_ALLOC_ENFORCE_DIRTY); > struct iommu_domain *domain; > struct intel_iommu *iommu; > > @@ -4087,9 +4089,15 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 flags) > if ((flags & IOMMU_HWPT_ALLOC_NEST_PARENT) && !ecap_nest(iommu->ecap)) > return ERR_PTR(-EOPNOTSUPP); > > + if (enforce_dirty && > + !device_iommu_capable(dev, IOMMU_CAP_DIRTY)) > + return ERR_PTR(-EOPNOTSUPP); > + > domain = iommu_domain_alloc(dev->bus); > if (!domain) > domain = ERR_PTR(-ENOMEM); > + if (domain && enforce_dirty) @domain can not be NULL here. > + domain->dirty_ops = &intel_dirty_ops; > return domain; > } The VT-d driver always uses second level for a user domain translation. In order to avoid checks of "domain->use_first_level" in the callbacks, how about check it here and return failure if first level is used for user domain? [...] domain = iommu_domain_alloc(dev->bus); if (!domain) return ERR_PTR(-ENOMEM); if (enforce_dirty) { if (to_dmar_domain(domain)->use_first_level) { iommu_domain_free(domain); return ERR_PTR(-EOPNOTSUPP); } domain->dirty_ops = &intel_dirty_ops; } return domain; > > @@ -4367,6 +4375,9 @@ static bool intel_iommu_capable(struct device *dev, enum iommu_cap cap) > return dmar_platform_optin(); > case IOMMU_CAP_ENFORCE_CACHE_COHERENCY: > return ecap_sc_support(info->iommu->ecap); > + case IOMMU_CAP_DIRTY: > + return sm_supported(info->iommu) && > + ecap_slads(info->iommu->ecap); Above appears several times in this patch. Is it possible to define it as a macro? diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index bccd44db3316..379e141bbb28 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -542,6 +542,8 @@ enum { #define sm_supported(iommu) (intel_iommu_sm && ecap_smts((iommu)->ecap)) #define pasid_supported(iommu) (sm_supported(iommu) && \ ecap_pasid((iommu)->ecap)) +#define slads_supported(iommu) (sm_supported(iommu) && \ + ecap_slads((iommu)->ecap)) > default: > return false; > } Best regards, baolu
On 9/23/23 9:25 AM, Joao Martins wrote: [...] > > +static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain, > + bool enable) > +{ > + struct dmar_domain *dmar_domain = to_dmar_domain(domain); > + struct device_domain_info *info; > + int ret = -EINVAL; > + > + spin_lock(&dmar_domain->lock); > + if (!(dmar_domain->dirty_tracking ^ enable) || Just out of curiosity, can we simply write dmar_domain->dirty_tracking == enable instead? I am not sure whether the compiler will be happy with this. > + list_empty(&dmar_domain->devices)) { list_for_each_entry is no op if list is empty, so no need to check it. > + spin_unlock(&dmar_domain->lock); > + return 0; > + } > + > + list_for_each_entry(info, &dmar_domain->devices, link) { > + /* First-level page table always enables dirty bit*/ > + if (dmar_domain->use_first_level) { Since we leave out domain->use_first_level in the user_domain_alloc function, we no longer need to check it here. > + ret = 0; > + break; > + } > + > + ret = intel_pasid_setup_dirty_tracking(info->iommu, info->domain, > + info->dev, IOMMU_NO_PASID, > + enable); > + if (ret) > + break; We need to unwind to the previous status here. We cannot leave some devices with status @enable while others do not. > + > + } The VT-d driver also support attaching domain to a pasid of a device. We also need to enable dirty tracking on those devices. > + > + if (!ret) > + dmar_domain->dirty_tracking = enable; > + spin_unlock(&dmar_domain->lock); > + > + return ret; > +} I have made some changes to the code based on my above comments. Please let me know what you think. static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain, bool enable) { struct dmar_domain *dmar_domain = to_dmar_domain(domain); struct dev_pasid_info *dev_pasid; struct device_domain_info *info; int ret; spin_lock(&dmar_domain->lock); if (!(dmar_domain->dirty_tracking ^ enable)) goto out_unlock; list_for_each_entry(info, &dmar_domain->devices, link) { ret = intel_pasid_setup_dirty_tracking(info->iommu, dmar_domain, info->dev, IOMMU_NO_PASID, enable); if (ret) goto err_unwind; } list_for_each_entry(dev_pasid, &dmar_domain->dev_pasids, link_domain) { info = dev_iommu_priv_get(dev_pasid->dev); ret = intel_pasid_setup_dirty_tracking(info->iommu, dmar_domain, info->dev, dev_pasid->pasid, enable); if (ret) goto err_unwind; } dmar_domain->dirty_tracking = enable; out_unlock: spin_unlock(&dmar_domain->lock); return 0; err_unwind: list_for_each_entry(info, &dmar_domain->devices, link) intel_pasid_setup_dirty_tracking(info->iommu, dmar_domain, info->dev, IOMMU_NO_PASID, dmar_domain->dirty_tracking); list_for_each_entry(dev_pasid, &dmar_domain->dev_pasids, link_domain) { info = dev_iommu_priv_get(dev_pasid->dev); intel_pasid_setup_dirty_tracking(info->iommu, dmar_domain, info->dev, dev_pasid->pasid, dmar_domain->dirty_tracking); } spin_unlock(&dmar_domain->lock); return ret; } Best regards, baolu
On 9/23/23 9:25 AM, Joao Martins wrote: [...] > +static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain, > + unsigned long iova, size_t size, > + unsigned long flags, > + struct iommu_dirty_bitmap *dirty) > +{ > + struct dmar_domain *dmar_domain = to_dmar_domain(domain); > + unsigned long end = iova + size - 1; > + unsigned long pgsize; > + bool ad_enabled; > + > + spin_lock(&dmar_domain->lock); > + ad_enabled = dmar_domain->dirty_tracking; > + spin_unlock(&dmar_domain->lock); The spin lock is to protect the RID and PASID device tracking list. No need to use it here. > + > + if (!ad_enabled && dirty->bitmap) > + return -EINVAL; I don't understand above check of "dirty->bitmap". Isn't it always invalid to call this if dirty tracking is not enabled on the domain? The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no need to understand it and check its member anyway. Or, I overlooked anything? > + > + rcu_read_lock(); Do we really need a rcu lock here? This operation is protected by iopt->iova_rwsem. Is it reasonable to remove it? If not, how about put some comments around it? > + do { > + struct dma_pte *pte; > + int lvl = 0; > + > + pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &lvl, > + GFP_ATOMIC); > + pgsize = level_size(lvl) << VTD_PAGE_SHIFT; > + if (!pte || !dma_pte_present(pte)) { > + iova += pgsize; > + continue; > + } > + > + /* It is writable, set the bitmap */ > + if (((flags & IOMMU_DIRTY_NO_CLEAR) && > + dma_sl_pte_dirty(pte)) || > + dma_sl_pte_test_and_clear_dirty(pte)) > + iommu_dirty_bitmap_record(dirty, iova, pgsize); > + iova += pgsize; > + } while (iova < end); > + rcu_read_unlock(); > + > + return 0; > +} Best regards, baolu
On 9/23/23 9:25 AM, Joao Martins wrote: [...] > +/* > + * Set up dirty tracking on a second only translation type. Set up dirty tracking on a second only or nested translation type. > + */ > +int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu, > + struct dmar_domain *domain, > + struct device *dev, u32 pasid, > + bool enabled) > +{ > + struct pasid_entry *pte; > + u16 did, pgtt; > + > + spin_lock(&iommu->lock); > + > + did = domain_id_iommu(domain, iommu); > + pte = intel_pasid_get_entry(dev, pasid); > + if (!pte) { > + spin_unlock(&iommu->lock); > + dev_err(dev, "Failed to get pasid entry of PASID %d\n", pasid); Use dev_err_ratelimited() to avoid user DOS attack. > + return -ENODEV; > + } Can we add a check to limit this interface to second-only and nested translation types? These are the only valid use cases currently and for the foreseeable future. And, return directly if the pasid bit matches the target state. [...] spin_lock(&iommu->lock); pte = intel_pasid_get_entry(dev, pasid); if (!pte) { spin_unlock(&iommu->lock); dev_err_ratelimited(dev, "Failed to get pasid entry of PASID %d\n", pasid); return -ENODEV; } did = domain_id_iommu(domain, iommu); pgtt = pasid_pte_get_pgtt(pte); if (pgtt != PASID_ENTRY_PGTT_SL_ONLY && pgtt != PASID_ENTRY_PGTT_NESTED) { spin_unlock(&iommu->lock); dev_err_ratelimited(dev, "Dirty tracking not supported on translation type %d\n", pgtt); return -EOPNOTSUPP; } if (pasid_get_ssade(pte) == enabled) { spin_unlock(&iommu->lock); return 0; } if (enabled) pasid_set_ssade(pte); else pasid_clear_ssade(pte); spin_unlock(&iommu->lock); [...] > + > + pgtt = pasid_pte_get_pgtt(pte); > + > + if (enabled) > + pasid_set_ssade(pte); > + else > + pasid_clear_ssade(pte); > + spin_unlock(&iommu->lock); Add below here: if (!ecap_coherent(iommu->ecap)) clflush_cache_range(pte, sizeof(*pte)); > + > + /* > + * From VT-d spec table 25 "Guidance to Software for Invalidations": > + * > + * - PASID-selective-within-Domain PASID-cache invalidation > + * If (PGTT=SS or Nested) > + * - Domain-selective IOTLB invalidation > + * Else > + * - PASID-selective PASID-based IOTLB invalidation > + * - If (pasid is RID_PASID) > + * - Global Device-TLB invalidation to affected functions > + * Else > + * - PASID-based Device-TLB invalidation (with S=1 and > + * Addr[63:12]=0x7FFFFFFF_FFFFF) to affected functions > + */ > + pasid_cache_invalidation_with_pasid(iommu, did, pasid); > + > + if (pgtt == PASID_ENTRY_PGTT_SL_ONLY || pgtt == PASID_ENTRY_PGTT_NESTED) > + iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH); > + else > + qi_flush_piotlb(iommu, did, pasid, 0, -1, 0); Only "Domain-selective IOTLB invalidation" is needed here. > + > + /* Device IOTLB doesn't need to be flushed in caching mode. */ > + if (!cap_caching_mode(iommu->cap)) > + devtlb_invalidation_with_pasid(iommu, dev, pasid); For the device IOTLB invalidation, need to follow what spec requires. If (pasid is RID_PASID) - Global Device-TLB invalidation to affected functions Else - PASID-based Device-TLB invalidation (with S=1 and Addr[63:12]=0x7FFFFFFF_FFFFF) to affected functions > + > + return 0; > +} > + Best regards, baolu
On 9/25/23 5:08 PM, Joao Martins wrote: > > > On 25/09/2023 08:01, Baolu Lu wrote: >> On 9/23/23 9:25 AM, Joao Martins wrote: >>> IOMMU advertises Access/Dirty bits for second-stage page table if the >>> extended capability DMAR register reports it (ECAP, mnemonic ECAP.SSADS). >>> The first stage table is compatible with CPU page table thus A/D bits are >>> implicitly supported. Relevant Intel IOMMU SDM ref for first stage table >>> "3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second stage table >>> "3.7.2 Accessed and Dirty Flags". >>> >>> First stage page table is enabled by default so it's allowed to set dirty >>> tracking and no control bits needed, it just returns 0. To use SSADS, set >>> bit 9 (SSADE) in the scalable-mode PASID table entry and flush the IOTLB >>> via pasid_flush_caches() following the manual. Relevant SDM refs: >>> >>> "3.7.2 Accessed and Dirty Flags" >>> "6.5.3.3 Guidance to Software for Invalidations, >>> Table 23. Guidance to Software for Invalidations" >>> >>> PTE dirty bit is located in bit 9 and it's cached in the IOTLB so flush >>> IOTLB to make sure IOMMU attempts to set the dirty bit again. Note that >>> iommu_dirty_bitmap_record() will add the IOVA to iotlb_gather and thus >>> the caller of the iommu op will flush the IOTLB. Relevant manuals over >>> the hardware translation is chapter 6 with some special mention to: >>> >>> "6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations" >>> "6.2.4 IOTLB" >>> >>> Signed-off-by: Joao Martins<joao.m.martins@oracle.com> >>> --- >>> The IOPTE walker is still a bit inneficient. Making sure the UAPI/IOMMUFD is >>> solid and agreed upon. >>> --- >>> drivers/iommu/intel/iommu.c | 94 +++++++++++++++++++++++++++++++++++++ >>> drivers/iommu/intel/iommu.h | 15 ++++++ >>> drivers/iommu/intel/pasid.c | 94 +++++++++++++++++++++++++++++++++++++ >>> drivers/iommu/intel/pasid.h | 4 ++ >>> 4 files changed, 207 insertions(+) >> >> The code is probably incomplete. When attaching a domain to a device, >> check the domain's dirty tracking capability against the device's >> capabilities. If the domain's dirty tracking capability is set but the >> device does not support it, the attach callback should return -EINVAL. >> > Yeap, I did that for AMD, but it seems in the mix of changes I may have deleted > and then forgot to include it here. > > Here's what I added (together with consolidated cap checking): > > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c > index 7d5a8f5283a7..fabfe363f1f9 100644 > --- a/drivers/iommu/intel/iommu.c > +++ b/drivers/iommu/intel/iommu.c > @@ -4075,6 +4075,11 @@ static struct iommu_domain > *intel_iommu_domain_alloc(unsigned type) > return NULL; > } > > +static bool intel_iommu_slads_supported(struct intel_iommu *iommu) > +{ > + return sm_supported(iommu) && ecap_slads(iommu->ecap); > +} > + > static struct iommu_domain * > intel_iommu_domain_alloc_user(struct device *dev, u32 flags) > { > @@ -4090,7 +4095,7 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 flags) > return ERR_PTR(-EOPNOTSUPP); > > if (enforce_dirty && > - !device_iommu_capable(dev, IOMMU_CAP_DIRTY)) > + !intel_iommu_slads_supported(iommu)) > return ERR_PTR(-EOPNOTSUPP); > > domain = iommu_domain_alloc(dev->bus); > @@ -4121,6 +4126,9 @@ static int prepare_domain_attach_device(struct > iommu_domain *domain, > if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap)) > return -EINVAL; > > + if (domain->dirty_ops && !intel_iommu_slads_supported(iommu)) > + return -EINVAL; > + > /* check if this iommu agaw is sufficient for max mapped address */ > addr_width = agaw_to_width(iommu->agaw); > if (addr_width > cap_mgaw(iommu->cap)) > @@ -4376,8 +4384,7 @@ static bool intel_iommu_capable(struct device *dev, enum > iommu_cap cap) > case IOMMU_CAP_ENFORCE_CACHE_COHERENCY: > return ecap_sc_support(info->iommu->ecap); > case IOMMU_CAP_DIRTY: > - return sm_supported(info->iommu) && > - ecap_slads(info->iommu->ecap); > + return intel_iommu_slads_supported(info->iommu); > default: > return false; > } Yes. Above additional change looks good to me. Best regards, baolu
On 16/10/2023 01:51, Baolu Lu wrote: > On 9/23/23 9:25 AM, Joao Martins wrote: >> --- a/drivers/iommu/intel/iommu.c >> +++ b/drivers/iommu/intel/iommu.c >> @@ -300,6 +300,7 @@ static int iommu_skip_te_disable; >> #define IDENTMAP_AZALIA 4 >> const struct iommu_ops intel_iommu_ops; >> +const struct iommu_dirty_ops intel_dirty_ops; >> static bool translation_pre_enabled(struct intel_iommu *iommu) >> { >> @@ -4077,6 +4078,7 @@ static struct iommu_domain >> *intel_iommu_domain_alloc(unsigned type) >> static struct iommu_domain * >> intel_iommu_domain_alloc_user(struct device *dev, u32 flags) >> { >> + bool enforce_dirty = (flags & IOMMU_HWPT_ALLOC_ENFORCE_DIRTY); >> struct iommu_domain *domain; >> struct intel_iommu *iommu; >> @@ -4087,9 +4089,15 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 >> flags) >> if ((flags & IOMMU_HWPT_ALLOC_NEST_PARENT) && !ecap_nest(iommu->ecap)) >> return ERR_PTR(-EOPNOTSUPP); >> + if (enforce_dirty && >> + !device_iommu_capable(dev, IOMMU_CAP_DIRTY)) >> + return ERR_PTR(-EOPNOTSUPP); >> + >> domain = iommu_domain_alloc(dev->bus); >> if (!domain) >> domain = ERR_PTR(-ENOMEM); >> + if (domain && enforce_dirty) > > @domain can not be NULL here. > True, it should be: if (!IS_ERR(domain) && enforce_dirty) >> + domain->dirty_ops = &intel_dirty_ops; >> return domain; >> } > > The VT-d driver always uses second level for a user domain translation. > In order to avoid checks of "domain->use_first_level" in the callbacks, > how about check it here and return failure if first level is used for > user domain? > I was told by Yi Y Sun offlist to have the first_level checked, because dirty bit in first stage page table is always enabled (and cannot be toggled on/off). I can remove it again; initially RFC didn't have it as it was failing in similar way to how you suggest here. Not sure how to proceed? > > [...] > domain = iommu_domain_alloc(dev->bus); > if (!domain) > return ERR_PTR(-ENOMEM); > > if (enforce_dirty) { > if (to_dmar_domain(domain)->use_first_level) { > iommu_domain_free(domain); > return ERR_PTR(-EOPNOTSUPP); > } > domain->dirty_ops = &intel_dirty_ops; > } > > return domain; > Should I fail on first level pagetable dirty enforcement, certainly will adopt the above (and remove the sucessfully return on first_level set_dirty_tracking) >> @@ -4367,6 +4375,9 @@ static bool intel_iommu_capable(struct device *dev, >> enum iommu_cap cap) >> return dmar_platform_optin(); >> case IOMMU_CAP_ENFORCE_CACHE_COHERENCY: >> return ecap_sc_support(info->iommu->ecap); >> + case IOMMU_CAP_DIRTY: >> + return sm_supported(info->iommu) && >> + ecap_slads(info->iommu->ecap); > > Above appears several times in this patch. Is it possible to define it > as a macro? > Yeap, for sure much cleaner indeed. > diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h > index bccd44db3316..379e141bbb28 100644 > --- a/drivers/iommu/intel/iommu.h > +++ b/drivers/iommu/intel/iommu.h > @@ -542,6 +542,8 @@ enum { > #define sm_supported(iommu) (intel_iommu_sm && ecap_smts((iommu)->ecap)) > #define pasid_supported(iommu) (sm_supported(iommu) && \ > ecap_pasid((iommu)->ecap)) > +#define slads_supported(iommu) (sm_supported(iommu) && \ > + ecap_slads((iommu)->ecap)) > Yeap. >> default: >> return false; >> } > > Best regards, > baolu
On 16/10/2023 02:37, Baolu Lu wrote: > On 9/23/23 9:25 AM, Joao Martins wrote: > [...] >> +static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain, >> + bool enable) >> +{ >> + struct dmar_domain *dmar_domain = to_dmar_domain(domain); >> + struct device_domain_info *info; >> + int ret = -EINVAL; >> + >> + spin_lock(&dmar_domain->lock); >> + if (!(dmar_domain->dirty_tracking ^ enable) || > > Just out of curiosity, can we simply write > > dmar_domain->dirty_tracking == enable > > instead? I am not sure whether the compiler will be happy with this. > Part of the check above, was just trying to avoid same-state transitions. the above you wrote should work (...) >> + list_empty(&dmar_domain->devices)) { > list_for_each_entry is no op if list is empty, so no need to check it. > Though this is unnecessary yes. >> + spin_unlock(&dmar_domain->lock); >> + return 0; >> + } >> + >> + list_for_each_entry(info, &dmar_domain->devices, link) { >> + /* First-level page table always enables dirty bit*/ >> + if (dmar_domain->use_first_level) { > > Since we leave out domain->use_first_level in the user_domain_alloc > function, we no longer need to check it here. > >> + ret = 0; >> + break; >> + } >> + >> + ret = intel_pasid_setup_dirty_tracking(info->iommu, info->domain, >> + info->dev, IOMMU_NO_PASID, >> + enable); >> + if (ret) >> + break; > > We need to unwind to the previous status here. We cannot leave some > devices with status @enable while others do not. > Ugh, yes >> + >> + } > > The VT-d driver also support attaching domain to a pasid of a device. We > also need to enable dirty tracking on those devices. > True. But to be honest, I thought we weren't quite there yet in PASID support from IOMMUFD perspective; hence why I didn't aim at it. Or do I have the wrong impression? From the code below, it clearly looks the driver does. >> + >> + if (!ret) >> + dmar_domain->dirty_tracking = enable; >> + spin_unlock(&dmar_domain->lock); >> + >> + return ret; >> +} > > I have made some changes to the code based on my above comments. Please > let me know what you think. > Looks great; thanks a lot for these changes! > static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain, > bool enable) > { > struct dmar_domain *dmar_domain = to_dmar_domain(domain); > struct dev_pasid_info *dev_pasid; > struct device_domain_info *info; > int ret; > > spin_lock(&dmar_domain->lock); > if (!(dmar_domain->dirty_tracking ^ enable)) > goto out_unlock; > > list_for_each_entry(info, &dmar_domain->devices, link) { > ret = intel_pasid_setup_dirty_tracking(info->iommu, dmar_domain, > info->dev, IOMMU_NO_PASID, > enable); > if (ret) > goto err_unwind; > } > > list_for_each_entry(dev_pasid, &dmar_domain->dev_pasids, link_domain) { > info = dev_iommu_priv_get(dev_pasid->dev); > ret = intel_pasid_setup_dirty_tracking(info->iommu, dmar_domain, > info->dev, dev_pasid->pasid, > enable); > if (ret) > goto err_unwind; > } > > dmar_domain->dirty_tracking = enable; > out_unlock: > spin_unlock(&dmar_domain->lock); > > return 0; > > err_unwind: > list_for_each_entry(info, &dmar_domain->devices, link) > intel_pasid_setup_dirty_tracking(info->iommu, dmar_domain, > info->dev, > IOMMU_NO_PASID, > > dmar_domain->dirty_tracking); > list_for_each_entry(dev_pasid, &dmar_domain->dev_pasids, link_domain) { > info = dev_iommu_priv_get(dev_pasid->dev); > intel_pasid_setup_dirty_tracking(info->iommu, dmar_domain, > info->dev, dev_pasid->pasid, > > dmar_domain->dirty_tracking); > } > spin_unlock(&dmar_domain->lock); > > return ret; > } > > Best regards, > baolu
On 16/10/2023 03:07, Baolu Lu wrote: > On 9/23/23 9:25 AM, Joao Martins wrote: > [...] >> +static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain, >> + unsigned long iova, size_t size, >> + unsigned long flags, >> + struct iommu_dirty_bitmap *dirty) >> +{ >> + struct dmar_domain *dmar_domain = to_dmar_domain(domain); >> + unsigned long end = iova + size - 1; >> + unsigned long pgsize; >> + bool ad_enabled; >> + >> + spin_lock(&dmar_domain->lock); >> + ad_enabled = dmar_domain->dirty_tracking; >> + spin_unlock(&dmar_domain->lock); > > The spin lock is to protect the RID and PASID device tracking list. No > need to use it here. > OK >> + >> + if (!ad_enabled && dirty->bitmap) >> + return -EINVAL; > > I don't understand above check of "dirty->bitmap". Isn't it always > invalid to call this if dirty tracking is not enabled on the domain? > It is spurious (...) > The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no > need to understand it and check its member anyway. > (...) The iommu driver has no need to understand it. iommu_dirty_bitmap_record() already makes those checks in case there's no iova_bitmap to set bits to. > Or, I overlooked anything? > >> + >> + rcu_read_lock(); > > Do we really need a rcu lock here? This operation is protected by > iopt->iova_rwsem. Is it reasonable to remove it? If not, how about put > some comments around it? > As I had mentioned in an earlier comment, this was not meant to be here. >> + do { >> + struct dma_pte *pte; >> + int lvl = 0; >> + >> + pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &lvl, >> + GFP_ATOMIC); >> + pgsize = level_size(lvl) << VTD_PAGE_SHIFT; >> + if (!pte || !dma_pte_present(pte)) { >> + iova += pgsize; >> + continue; >> + } >> + >> + /* It is writable, set the bitmap */ >> + if (((flags & IOMMU_DIRTY_NO_CLEAR) && >> + dma_sl_pte_dirty(pte)) || >> + dma_sl_pte_test_and_clear_dirty(pte)) >> + iommu_dirty_bitmap_record(dirty, iova, pgsize); >> + iova += pgsize; >> + } while (iova < end); >> + rcu_read_unlock(); >> + >> + return 0; >> +} > > Best regards, > baolu
On 16/10/2023 03:21, Baolu Lu wrote: > On 9/23/23 9:25 AM, Joao Martins wrote: > [...] >> +/* >> + * Set up dirty tracking on a second only translation type. > > Set up dirty tracking on a second only or nested translation type. > >> + */ >> +int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu, >> + struct dmar_domain *domain, >> + struct device *dev, u32 pasid, >> + bool enabled) >> +{ >> + struct pasid_entry *pte; >> + u16 did, pgtt; >> + >> + spin_lock(&iommu->lock); >> + >> + did = domain_id_iommu(domain, iommu); >> + pte = intel_pasid_get_entry(dev, pasid); >> + if (!pte) { >> + spin_unlock(&iommu->lock); >> + dev_err(dev, "Failed to get pasid entry of PASID %d\n", pasid); > > Use dev_err_ratelimited() to avoid user DOS attack. > OK >> + return -ENODEV; >> + } > > Can we add a check to limit this interface to second-only and nested > translation types? These are the only valid use cases currently and for > the foreseeable future. > OK. > And, return directly if the pasid bit matches the target state. > OK > [...] > spin_lock(&iommu->lock); > pte = intel_pasid_get_entry(dev, pasid); > if (!pte) { > spin_unlock(&iommu->lock); > dev_err_ratelimited(dev, "Failed to get pasid entry of PASID > %d\n", pasid); > return -ENODEV; > } > > did = domain_id_iommu(domain, iommu); > pgtt = pasid_pte_get_pgtt(pte); > > if (pgtt != PASID_ENTRY_PGTT_SL_ONLY && pgtt != PASID_ENTRY_PGTT_NESTED) { > spin_unlock(&iommu->lock); > dev_err_ratelimited(dev, > "Dirty tracking not supported on translation > type %d\n", > pgtt); > return -EOPNOTSUPP; > } > > if (pasid_get_ssade(pte) == enabled) { > spin_unlock(&iommu->lock); > return 0; > } > > if (enabled) > pasid_set_ssade(pte); > else > pasid_clear_ssade(pte); > spin_unlock(&iommu->lock); > [...] > OK >> + >> + pgtt = pasid_pte_get_pgtt(pte); >> + >> + if (enabled) >> + pasid_set_ssade(pte); >> + else >> + pasid_clear_ssade(pte); >> + spin_unlock(&iommu->lock); > > > Add below here: > > if (!ecap_coherent(iommu->ecap)) > clflush_cache_range(pte, sizeof(*pte)); > Got it >> + >> + /* >> + * From VT-d spec table 25 "Guidance to Software for Invalidations": >> + * >> + * - PASID-selective-within-Domain PASID-cache invalidation >> + * If (PGTT=SS or Nested) >> + * - Domain-selective IOTLB invalidation >> + * Else >> + * - PASID-selective PASID-based IOTLB invalidation >> + * - If (pasid is RID_PASID) >> + * - Global Device-TLB invalidation to affected functions >> + * Else >> + * - PASID-based Device-TLB invalidation (with S=1 and >> + * Addr[63:12]=0x7FFFFFFF_FFFFF) to affected functions >> + */ >> + pasid_cache_invalidation_with_pasid(iommu, did, pasid); >> + >> + if (pgtt == PASID_ENTRY_PGTT_SL_ONLY || pgtt == PASID_ENTRY_PGTT_NESTED) >> + iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH); >> + else >> + qi_flush_piotlb(iommu, did, pasid, 0, -1, 0); > > Only "Domain-selective IOTLB invalidation" is needed here. > Will delete the qi_flush_piotlb() then. >> + >> + /* Device IOTLB doesn't need to be flushed in caching mode. */ >> + if (!cap_caching_mode(iommu->cap)) >> + devtlb_invalidation_with_pasid(iommu, dev, pasid); > > For the device IOTLB invalidation, need to follow what spec requires. > > If (pasid is RID_PASID) > - Global Device-TLB invalidation to affected functions > Else > - PASID-based Device-TLB invalidation (with S=1 and > Addr[63:12]=0x7FFFFFFF_FFFFF) to affected functions > devtlb_invalidation_with_pasid() underneath does: if (pasid == PASID_RID2PASID) qi_flush_dev_iotlb(iommu, sid, pfsid, qdep, 0, 64 - VTD_PAGE_SHIFT); else qi_flush_dev_iotlb_pasid(iommu, sid, pfsid, pasid, qdep, 0, 64 - VTD_PAGE_SHIFT); ... Which is what the spec suggests (IIUC). Should I read your comment above as to drop the cap_caching_mode(iommu->cap) ?
On Mon, Oct 16, 2023 at 11:57:34AM +0100, Joao Martins wrote: > True. But to be honest, I thought we weren't quite there yet in PASID support > from IOMMUFD perspective; hence why I didn't aim at it. Or do I have the wrong > impression? From the code below, it clearly looks the driver does. I think we should plan that this series will go before the PASID series Jason
On 2023/10/16 18:42, Joao Martins wrote: >>> + domain->dirty_ops = &intel_dirty_ops; >>> return domain; >>> } >> The VT-d driver always uses second level for a user domain translation. >> In order to avoid checks of "domain->use_first_level" in the callbacks, >> how about check it here and return failure if first level is used for >> user domain? >> > I was told by Yi Y Sun offlist to have the first_level checked, because dirty > bit in first stage page table is always enabled (and cannot be toggled on/off). > I can remove it again; initially RFC didn't have it as it was failing in similar > way to how you suggest here. Not sure how to proceed? Yi was right. But we currently have no use case for dirty tracking in the first-level page table. So let's start from only supporting it in the second level page table. If we later identify a use case for dirty tracking in the first level page table, we can then add the code with appropriate testing efforts. Best regards, baolu
On 2023/10/16 19:42, Jason Gunthorpe wrote: > On Mon, Oct 16, 2023 at 11:57:34AM +0100, Joao Martins wrote: > >> True. But to be honest, I thought we weren't quite there yet in PASID support >> from IOMMUFD perspective; hence why I didn't aim at it. Or do I have the wrong >> impression? From the code below, it clearly looks the driver does. > I think we should plan that this series will go before the PASID > series I know that PASID support in IOMMUFD is not yet available, but the VT-d driver already supports attaching a domain to a PASID, as required by the idxd driver for kernel DMA with PASID. Therefore, from the driver's perspective, dirty tracking should also be enabled for PASIDs. However, I am also fine with deferring this code until PASID support is added to IOMMUFD. In that case, it's better to add a comment to remind people to revisit this issue later. Best regards, baolu
On Mon, Oct 16, 2023 at 08:58:42PM +0800, Baolu Lu wrote: > On 2023/10/16 19:42, Jason Gunthorpe wrote: > > On Mon, Oct 16, 2023 at 11:57:34AM +0100, Joao Martins wrote: > > > > > True. But to be honest, I thought we weren't quite there yet in PASID support > > > from IOMMUFD perspective; hence why I didn't aim at it. Or do I have the wrong > > > impression? From the code below, it clearly looks the driver does. > > I think we should plan that this series will go before the PASID > > series > > I know that PASID support in IOMMUFD is not yet available, but the VT-d > driver already supports attaching a domain to a PASID, as required by > the idxd driver for kernel DMA with PASID. Therefore, from the driver's > perspective, dirty tracking should also be enabled for PASIDs. As long as the driver refuses to attach a dirty track enabled domain to PASID it would be fine for now. Jason
On 2023/10/16 20:59, Jason Gunthorpe wrote: > On Mon, Oct 16, 2023 at 08:58:42PM +0800, Baolu Lu wrote: >> On 2023/10/16 19:42, Jason Gunthorpe wrote: >>> On Mon, Oct 16, 2023 at 11:57:34AM +0100, Joao Martins wrote: >>> >>>> True. But to be honest, I thought we weren't quite there yet in PASID support >>>> from IOMMUFD perspective; hence why I didn't aim at it. Or do I have the wrong >>>> impression? From the code below, it clearly looks the driver does. >>> I think we should plan that this series will go before the PASID >>> series >> I know that PASID support in IOMMUFD is not yet available, but the VT-d >> driver already supports attaching a domain to a PASID, as required by >> the idxd driver for kernel DMA with PASID. Therefore, from the driver's >> perspective, dirty tracking should also be enabled for PASIDs. > As long as the driver refuses to attach a dirty track enabled domain > to PASID it would be fine for now. Yes. This works. Best regards, baolu
On 2023/10/16 19:39, Joao Martins wrote: >>> + /* Device IOTLB doesn't need to be flushed in caching mode. */ >>> + if (!cap_caching_mode(iommu->cap)) >>> + devtlb_invalidation_with_pasid(iommu, dev, pasid); >> For the device IOTLB invalidation, need to follow what spec requires. >> >> If (pasid is RID_PASID) >> - Global Device-TLB invalidation to affected functions >> Else >> - PASID-based Device-TLB invalidation (with S=1 and >> Addr[63:12]=0x7FFFFFFF_FFFFF) to affected functions >> > devtlb_invalidation_with_pasid() underneath does: > > if (pasid == PASID_RID2PASID) > qi_flush_dev_iotlb(iommu, sid, pfsid, qdep, 0, 64 - VTD_PAGE_SHIFT); > else > qi_flush_dev_iotlb_pasid(iommu, sid, pfsid, pasid, qdep, 0, 64 - VTD_PAGE_SHIFT); > > ... Which is what the spec suggests (IIUC). Ah! I overlooked this. Sorry about it. > Should I read your comment above as to drop the cap_caching_mode(iommu->cap) ? No. Your code is fine. Best regards, baolu
On 16/10/2023 12:26, Joao Martins wrote: > On 16/10/2023 03:07, Baolu Lu wrote: >> On 9/23/23 9:25 AM, Joao Martins wrote: >>> + >>> + if (!ad_enabled && dirty->bitmap) >>> + return -EINVAL; >> >> I don't understand above check of "dirty->bitmap". Isn't it always >> invalid to call this if dirty tracking is not enabled on the domain? >> > It is spurious (...) > I take this back; >> The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no >> need to understand it and check its member anyway. >> > > (...) The iommu driver has no need to understand it. iommu_dirty_bitmap_record() > already makes those checks in case there's no iova_bitmap to set bits to. > This is all true but the reason I am checking iommu_dirty_bitmap::bitmap is to essentially not record anything in the iova bitmap and just clear the dirty bits from the IOPTEs, all when dirty tracking is technically disabled. This is done internally only when starting dirty tracking, and thus to ensure that we cleanup all dirty bits before we enable dirty tracking to have a consistent snapshot as opposed to inheriting dirties from the past. Some alternative ways to do this: 1) via the iommu_dirty_bitmap structure, where I add one field which if true then iommufd core is able to call into iommu driver on a "clear IOPTE" manner or 2) via the ::flags ... the thing is that ::flags values is UAPI, so it feels weird to use these flags for internal purposes. Joao
On 10/17/23 12:00 AM, Joao Martins wrote: >>> The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no >>> need to understand it and check its member anyway. >>> >> (...) The iommu driver has no need to understand it. iommu_dirty_bitmap_record() >> already makes those checks in case there's no iova_bitmap to set bits to. >> > This is all true but the reason I am checking iommu_dirty_bitmap::bitmap is to > essentially not record anything in the iova bitmap and just clear the dirty bits > from the IOPTEs, all when dirty tracking is technically disabled. This is done > internally only when starting dirty tracking, and thus to ensure that we cleanup > all dirty bits before we enable dirty tracking to have a consistent snapshot as > opposed to inheriting dirties from the past. It's okay since it serves a functional purpose. Can you please add some comments around the code to explain the rationale. Best regards, baolu
On 16/10/2023 14:01, Baolu Lu wrote: > On 2023/10/16 20:59, Jason Gunthorpe wrote: >> On Mon, Oct 16, 2023 at 08:58:42PM +0800, Baolu Lu wrote: >>> On 2023/10/16 19:42, Jason Gunthorpe wrote: >>>> On Mon, Oct 16, 2023 at 11:57:34AM +0100, Joao Martins wrote: >>>> >>>>> True. But to be honest, I thought we weren't quite there yet in PASID support >>>>> from IOMMUFD perspective; hence why I didn't aim at it. Or do I have the wrong >>>>> impression? From the code below, it clearly looks the driver does. >>>> I think we should plan that this series will go before the PASID >>>> series >>> I know that PASID support in IOMMUFD is not yet available, but the VT-d >>> driver already supports attaching a domain to a PASID, as required by >>> the idxd driver for kernel DMA with PASID. Therefore, from the driver's >>> perspective, dirty tracking should also be enabled for PASIDs. >> As long as the driver refuses to attach a dirty track enabled domain >> to PASID it would be fine for now. > > Yes. This works. Baolu Lu, I am blocking PASID attachment this way; let me know if this matches how would you have the driver refuse dirty tracking on PASID. diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 66b0e1d5a98c..d33df02f0f2d 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -4123,7 +4123,7 @@ static void intel_iommu_domain_free(struct iommu_domain *domain) } static int prepare_domain_attach_device(struct iommu_domain *domain, - struct device *dev) + struct device *dev, ioasid_t pasid) { struct dmar_domain *dmar_domain = to_dmar_domain(domain); struct intel_iommu *iommu; @@ -4136,7 +4136,8 @@ static int prepare_domain_attach_device(struct iommu_domain *domain, if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap)) return -EINVAL; - if (domain->dirty_ops && !slads_supported(iommu)) + if (domain->dirty_ops && + (!slads_supported(iommu) || pasid != IOMMU_NO_PASID)) return -EINVAL; /* check if this iommu agaw is sufficient for max mapped address */ @@ -4174,7 +4175,7 @@ static int intel_iommu_attach_device(struct iommu_domain *domain, if (info->domain) device_block_translation(dev); - ret = prepare_domain_attach_device(domain, dev); + ret = prepare_domain_attach_device(domain, dev, IOMMU_NO_PASID); if (ret) return ret; @@ -4795,7 +4796,7 @@ static int intel_iommu_set_dev_pasid(struct iommu_domain *domain, if (context_copied(iommu, info->bus, info->devfn)) return -EBUSY; - ret = prepare_domain_attach_device(domain, dev); + ret = prepare_domain_attach_device(domain, dev, pasid); if (ret) return ret;
On 17/10/2023 03:08, Baolu Lu wrote: > On 10/17/23 12:00 AM, Joao Martins wrote: >>>> The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no >>>> need to understand it and check its member anyway. >>>> >>> (...) The iommu driver has no need to understand it. iommu_dirty_bitmap_record() >>> already makes those checks in case there's no iova_bitmap to set bits to. >>> >> This is all true but the reason I am checking iommu_dirty_bitmap::bitmap is to >> essentially not record anything in the iova bitmap and just clear the dirty bits >> from the IOPTEs, all when dirty tracking is technically disabled. This is done >> internally only when starting dirty tracking, and thus to ensure that we cleanup >> all dirty bits before we enable dirty tracking to have a consistent snapshot as >> opposed to inheriting dirties from the past. > > It's okay since it serves a functional purpose. Can you please add some > comments around the code to explain the rationale. > I added this comment below: + /* + * IOMMUFD core calls into a dirty tracking disabled domain without an + * IOVA bitmap set in order to clean dirty bits in all PTEs that might + * have occured when we stopped dirty tracking. This ensures that we + * never inherit dirtied bits from a previous cycle. + */ Also fixed an issue where I could theoretically clear the bit with IOMMU_NO_CLEAR. Essentially passed the read_and_clear_dirty flags and let dma_sl_pte_test_and_clear_dirty() to test and test-and-clear, similar to AMD: @@ -781,6 +788,16 @@ static inline bool dma_pte_present(struct dma_pte *pte) return (pte->val & 3) != 0; } +static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte, + unsigned long flags) +{ + if (flags & IOMMU_DIRTY_NO_CLEAR) + return (pte->val & DMA_SL_PTE_DIRTY) != 0; + + return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT, + (unsigned long *)&pte->val); +} + Anyhow, see below the full diff compared to this patch. Some things are in tree that is different to submitted from this patch. diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig index 2e56bd79f589..dedb8ae3cba8 100644 --- a/drivers/iommu/intel/Kconfig +++ b/drivers/iommu/intel/Kconfig @@ -15,6 +15,7 @@ config INTEL_IOMMU select DMA_OPS select IOMMU_API select IOMMU_IOVA + select IOMMUFD_DRIVER select NEED_DMA_MAP_STATE select DMAR_TABLE select SWIOTLB diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index fabfe363f1f9..0e3f532f3bca 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -4075,11 +4075,6 @@ static struct iommu_domain *intel_iommu_domain_alloc(unsigned type) return NULL; } -static bool intel_iommu_slads_supported(struct intel_iommu *iommu) -{ - return sm_supported(iommu) && ecap_slads(iommu->ecap); -} - static struct iommu_domain * intel_iommu_domain_alloc_user(struct device *dev, u32 flags) { @@ -4087,6 +4082,10 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 flags) struct iommu_domain *domain; struct intel_iommu *iommu; + if (flags & (~(IOMMU_HWPT_ALLOC_NEST_PARENT| + IOMMU_HWPT_ALLOC_ENFORCE_DIRTY))) + return ERR_PTR(-EOPNOTSUPP); + iommu = device_to_iommu(dev, NULL, NULL); if (!iommu) return ERR_PTR(-ENODEV); @@ -4094,15 +4093,26 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 flags) if ((flags & IOMMU_HWPT_ALLOC_NEST_PARENT) && !ecap_nest(iommu->ecap)) return ERR_PTR(-EOPNOTSUPP); - if (enforce_dirty && - !intel_iommu_slads_supported(iommu)) + if (enforce_dirty && !slads_supported(iommu)) return ERR_PTR(-EOPNOTSUPP); + /* + * domain_alloc_user op needs to fully initialize a domain + * before return, so uses iommu_domain_alloc() here for + * simple. + */ domain = iommu_domain_alloc(dev->bus); if (!domain) domain = ERR_PTR(-ENOMEM); - if (domain && enforce_dirty) + + if (!IS_ERR(domain) && enforce_dirty) { + if (to_dmar_domain(domain)->use_first_level) { + iommu_domain_free(domain); + return ERR_PTR(-EOPNOTSUPP); + } domain->dirty_ops = &intel_dirty_ops; + } + return domain; } @@ -4113,7 +4123,7 @@ static void intel_iommu_domain_free(struct iommu_domain *domain) } static int prepare_domain_attach_device(struct iommu_domain *domain, - struct device *dev) + struct device *dev, ioasid_t pasid) { struct dmar_domain *dmar_domain = to_dmar_domain(domain); struct intel_iommu *iommu; @@ -4126,7 +4136,8 @@ static int prepare_domain_attach_device(struct iommu_domain *domain, if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap)) return -EINVAL; - if (domain->dirty_ops && !intel_iommu_slads_supported(iommu)) + if (domain->dirty_ops && + (!slads_supported(iommu) || pasid != IOMMU_NO_PASID)) return -EINVAL; /* check if this iommu agaw is sufficient for max mapped address */ @@ -4164,7 +4175,7 @@ static int intel_iommu_attach_device(struct iommu_domain *domain, if (info->domain) device_block_translation(dev); - ret = prepare_domain_attach_device(domain, dev); + ret = prepare_domain_attach_device(domain, dev, IOMMU_NO_PASID); if (ret) return ret; @@ -4384,7 +4395,7 @@ static bool intel_iommu_capable(struct device *dev, enum iommu_cap cap) case IOMMU_CAP_ENFORCE_CACHE_COHERENCY: return ecap_sc_support(info->iommu->ecap); case IOMMU_CAP_DIRTY: - return intel_iommu_slads_supported(info->iommu); + return slads_supported(info->iommu); default: return false; } @@ -4785,7 +4796,7 @@ static int intel_iommu_set_dev_pasid(struct iommu_domain *domain, if (context_copied(iommu, info->bus, info->devfn)) return -EBUSY; - ret = prepare_domain_attach_device(domain, dev); + ret = prepare_domain_attach_device(domain, dev, pasid); if (ret) return ret; @@ -4848,31 +4859,31 @@ static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain, int ret = -EINVAL; spin_lock(&dmar_domain->lock); - if (!(dmar_domain->dirty_tracking ^ enable) || - list_empty(&dmar_domain->devices)) { - spin_unlock(&dmar_domain->lock); - return 0; - } + if (dmar_domain->dirty_tracking == enable) + goto out_unlock; list_for_each_entry(info, &dmar_domain->devices, link) { - /* First-level page table always enables dirty bit*/ - if (dmar_domain->use_first_level) { - ret = 0; - break; - } - ret = intel_pasid_setup_dirty_tracking(info->iommu, info->domain, info->dev, IOMMU_NO_PASID, enable); if (ret) - break; + goto err_unwind; } if (!ret) dmar_domain->dirty_tracking = enable; +out_unlock: spin_unlock(&dmar_domain->lock); + return 0; + +err_unwind: + list_for_each_entry(info, &dmar_domain->devices, link) + intel_pasid_setup_dirty_tracking(info->iommu, dmar_domain, + info->dev, IOMMU_NO_PASID, + dmar_domain->dirty_tracking); + spin_unlock(&dmar_domain->lock); return ret; } @@ -4886,14 +4897,16 @@ static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain, unsigned long pgsize; bool ad_enabled; - spin_lock(&dmar_domain->lock); + /* + * IOMMUFD core calls into a dirty tracking disabled domain without an + * IOVA bitmap set in order to clean dirty bits in all PTEs that might + * have occured when we stopped dirty tracking. This ensures that we + * never inherit dirtied bits from a previous cycle. + */ ad_enabled = dmar_domain->dirty_tracking; - spin_unlock(&dmar_domain->lock); - if (!ad_enabled && dirty->bitmap) return -EINVAL; - rcu_read_lock(); do { struct dma_pte *pte; int lvl = 0; @@ -4906,14 +4919,10 @@ static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain, continue; } - /* It is writable, set the bitmap */ - if (((flags & IOMMU_DIRTY_NO_CLEAR) && - dma_sl_pte_dirty(pte)) || - dma_sl_pte_test_and_clear_dirty(pte)) + if (dma_sl_pte_test_and_clear_dirty(pte, flags)) iommu_dirty_bitmap_record(dirty, iova, pgsize); iova += pgsize; } while (iova < end); - rcu_read_unlock(); return 0; } diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index bccd44db3316..0b390d9e669b 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -542,6 +542,9 @@ enum { #define sm_supported(iommu) (intel_iommu_sm && ecap_smts((iommu)->ecap)) #define pasid_supported(iommu) (sm_supported(iommu) && \ ecap_pasid((iommu)->ecap)) +#define slads_supported(iommu) (sm_supported(iommu) && \ + ecap_slads((iommu)->ecap)) + struct pasid_entry; struct pasid_state_entry; @@ -785,13 +788,12 @@ static inline bool dma_pte_present(struct dma_pte *pte) return (pte->val & 3) != 0; } -static inline bool dma_sl_pte_dirty(struct dma_pte *pte) +static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte, + unsigned long flags) { - return (pte->val & DMA_SL_PTE_DIRTY) != 0; -} + if (flags & IOMMU_DIRTY_NO_CLEAR) + return (pte->val & DMA_SL_PTE_DIRTY) != 0; -static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte) -{ return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT, (unsigned long *)&pte->val); } diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 03814942d59c..785384a59d55 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -686,15 +686,29 @@ int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu, spin_lock(&iommu->lock); - did = domain_id_iommu(domain, iommu); pte = intel_pasid_get_entry(dev, pasid); if (!pte) { spin_unlock(&iommu->lock); - dev_err(dev, "Failed to get pasid entry of PASID %d\n", pasid); + dev_err_ratelimited(dev, + "Failed to get pasid entry of PASID %d\n", + pasid); return -ENODEV; } + did = domain_id_iommu(domain, iommu); pgtt = pasid_pte_get_pgtt(pte); + if (pgtt != PASID_ENTRY_PGTT_SL_ONLY && pgtt != PASID_ENTRY_PGTT_NESTED) { + spin_unlock(&iommu->lock); + dev_err_ratelimited(dev, + "Dirty tracking not supported on translation type %d\n", + pgtt); + return -EOPNOTSUPP; + } + + if (pasid_get_ssade(pte) == enabled) { + spin_unlock(&iommu->lock); + return 0; + } if (enabled) pasid_set_ssade(pte); @@ -702,6 +716,9 @@ int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu, pasid_clear_ssade(pte); spin_unlock(&iommu->lock); + if (!ecap_coherent(iommu->ecap)) + clflush_cache_range(pte, sizeof(*pte)); + /* * From VT-d spec table 25 "Guidance to Software for Invalidations": * @@ -720,8 +737,6 @@ int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu, if (pgtt == PASID_ENTRY_PGTT_SL_ONLY || pgtt == PASID_ENTRY_PGTT_NESTED) iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH); - else - qi_flush_piotlb(iommu, did, pasid, 0, -1, 0); /* Device IOTLB doesn't need to be flushed in caching mode. */ if (!cap_caching_mode(iommu->cap))
On 2023/10/17 18:51, Joao Martins wrote: > On 16/10/2023 14:01, Baolu Lu wrote: >> On 2023/10/16 20:59, Jason Gunthorpe wrote: >>> On Mon, Oct 16, 2023 at 08:58:42PM +0800, Baolu Lu wrote: >>>> On 2023/10/16 19:42, Jason Gunthorpe wrote: >>>>> On Mon, Oct 16, 2023 at 11:57:34AM +0100, Joao Martins wrote: >>>>> >>>>>> True. But to be honest, I thought we weren't quite there yet in PASID support >>>>>> from IOMMUFD perspective; hence why I didn't aim at it. Or do I have the wrong >>>>>> impression? From the code below, it clearly looks the driver does. >>>>> I think we should plan that this series will go before the PASID >>>>> series >>>> I know that PASID support in IOMMUFD is not yet available, but the VT-d >>>> driver already supports attaching a domain to a PASID, as required by >>>> the idxd driver for kernel DMA with PASID. Therefore, from the driver's >>>> perspective, dirty tracking should also be enabled for PASIDs. >>> As long as the driver refuses to attach a dirty track enabled domain >>> to PASID it would be fine for now. >> Yes. This works. > Baolu Lu, I am blocking PASID attachment this way; let me know if this matches > how would you have the driver refuse dirty tracking on PASID. Joao, how about blocking pasid attachment in intel_iommu_set_dev_pasid() directly? diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index c86ba5a3e75c..392b6ca9ce90 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -4783,6 +4783,9 @@ static int intel_iommu_set_dev_pasid(struct iommu_domain *domain, if (context_copied(iommu, info->bus, info->devfn)) return -EBUSY; + if (domain->dirty_ops) + return -EOPNOTSUPP; + ret = prepare_domain_attach_device(domain, dev); if (ret) return ret; Best regards, baolu
On 2023/10/17 19:22, Joao Martins wrote: > On 17/10/2023 03:08, Baolu Lu wrote: >> On 10/17/23 12:00 AM, Joao Martins wrote: >>>>> The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no >>>>> need to understand it and check its member anyway. >>>>> >>>> (...) The iommu driver has no need to understand it. iommu_dirty_bitmap_record() >>>> already makes those checks in case there's no iova_bitmap to set bits to. >>>> >>> This is all true but the reason I am checking iommu_dirty_bitmap::bitmap is to >>> essentially not record anything in the iova bitmap and just clear the dirty bits >>> from the IOPTEs, all when dirty tracking is technically disabled. This is done >>> internally only when starting dirty tracking, and thus to ensure that we cleanup >>> all dirty bits before we enable dirty tracking to have a consistent snapshot as >>> opposed to inheriting dirties from the past. >> >> It's okay since it serves a functional purpose. Can you please add some >> comments around the code to explain the rationale. >> > > I added this comment below: > > + /* > + * IOMMUFD core calls into a dirty tracking disabled domain without an > + * IOVA bitmap set in order to clean dirty bits in all PTEs that might > + * have occured when we stopped dirty tracking. This ensures that we > + * never inherit dirtied bits from a previous cycle. > + */ > Yes. It's clear. Thank you! > Also fixed an issue where I could theoretically clear the bit with > IOMMU_NO_CLEAR. Essentially passed the read_and_clear_dirty flags and let > dma_sl_pte_test_and_clear_dirty() to test and test-and-clear, similar to AMD: > > @@ -781,6 +788,16 @@ static inline bool dma_pte_present(struct dma_pte *pte) > return (pte->val & 3) != 0; > } > > +static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte, > + unsigned long flags) > +{ > + if (flags & IOMMU_DIRTY_NO_CLEAR) > + return (pte->val & DMA_SL_PTE_DIRTY) != 0; > + > + return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT, > + (unsigned long *)&pte->val); > +} > + Yes. Sure. > Anyhow, see below the full diff compared to this patch. Some things are in tree > that is different to submitted from this patch. [...] > @@ -4113,7 +4123,7 @@ static void intel_iommu_domain_free(struct iommu_domain > *domain) > } > > static int prepare_domain_attach_device(struct iommu_domain *domain, > - struct device *dev) > + struct device *dev, ioasid_t pasid) How about blocking pasid attaching in intel_iommu_set_dev_pasid(). > { > struct dmar_domain *dmar_domain = to_dmar_domain(domain); > struct intel_iommu *iommu; > @@ -4126,7 +4136,8 @@ static int prepare_domain_attach_device(struct > iommu_domain *domain, > if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap)) > return -EINVAL; > > - if (domain->dirty_ops && !intel_iommu_slads_supported(iommu)) > + if (domain->dirty_ops && > + (!slads_supported(iommu) || pasid != IOMMU_NO_PASID)) > return -EINVAL; > > /* check if this iommu agaw is sufficient for max mapped address */ [...] > > @@ -4886,14 +4897,16 @@ static int intel_iommu_read_and_clear_dirty(struct > iommu_domain *domain, > unsigned long pgsize; > bool ad_enabled; > > - spin_lock(&dmar_domain->lock); > + /* > + * IOMMUFD core calls into a dirty tracking disabled domain without an > + * IOVA bitmap set in order to clean dirty bits in all PTEs that might > + * have occured when we stopped dirty tracking. This ensures that we > + * never inherit dirtied bits from a previous cycle. > + */ > ad_enabled = dmar_domain->dirty_tracking; > - spin_unlock(&dmar_domain->lock); > - > if (!ad_enabled && dirty->bitmap) How about if (!dmar_domain->dirty_tracking && dirty->bitmap) return -EINVAL; ? > return -EINVAL; > > - rcu_read_lock(); > do { > struct dma_pte *pte; > int lvl = 0; > @@ -4906,14 +4919,10 @@ static int intel_iommu_read_and_clear_dirty(struct > iommu_domain *domain, > continue; > } > > - /* It is writable, set the bitmap */ > - if (((flags & IOMMU_DIRTY_NO_CLEAR) && > - dma_sl_pte_dirty(pte)) || > - dma_sl_pte_test_and_clear_dirty(pte)) > + if (dma_sl_pte_test_and_clear_dirty(pte, flags)) > iommu_dirty_bitmap_record(dirty, iova, pgsize); > iova += pgsize; > } while (iova < end); > - rcu_read_unlock(); > > return 0; > } > diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h > index bccd44db3316..0b390d9e669b 100644 > --- a/drivers/iommu/intel/iommu.h > +++ b/drivers/iommu/intel/iommu.h > @@ -542,6 +542,9 @@ enum { > #define sm_supported(iommu) (intel_iommu_sm && ecap_smts((iommu)->ecap)) > #define pasid_supported(iommu) (sm_supported(iommu) && \ > ecap_pasid((iommu)->ecap)) > +#define slads_supported(iommu) (sm_supported(iommu) && \ > + ecap_slads((iommu)->ecap)) > + > > struct pasid_entry; > struct pasid_state_entry; > @@ -785,13 +788,12 @@ static inline bool dma_pte_present(struct dma_pte *pte) > return (pte->val & 3) != 0; > } > > -static inline bool dma_sl_pte_dirty(struct dma_pte *pte) > +static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte, > + unsigned long flags) > { > - return (pte->val & DMA_SL_PTE_DIRTY) != 0; > -} > + if (flags & IOMMU_DIRTY_NO_CLEAR) > + return (pte->val & DMA_SL_PTE_DIRTY) != 0; > > -static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte) > -{ > return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT, > (unsigned long *)&pte->val); > } > diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c > index 03814942d59c..785384a59d55 100644 > --- a/drivers/iommu/intel/pasid.c > +++ b/drivers/iommu/intel/pasid.c > @@ -686,15 +686,29 @@ int intel_pasid_setup_dirty_tracking(struct intel_iommu > *iommu, > > spin_lock(&iommu->lock); > > - did = domain_id_iommu(domain, iommu); > pte = intel_pasid_get_entry(dev, pasid); > if (!pte) { > spin_unlock(&iommu->lock); > - dev_err(dev, "Failed to get pasid entry of PASID %d\n", pasid); > + dev_err_ratelimited(dev, > + "Failed to get pasid entry of PASID %d\n", > + pasid); > return -ENODEV; > } > > + did = domain_id_iommu(domain, iommu); > pgtt = pasid_pte_get_pgtt(pte); > + if (pgtt != PASID_ENTRY_PGTT_SL_ONLY && pgtt != PASID_ENTRY_PGTT_NESTED) { > + spin_unlock(&iommu->lock); > + dev_err_ratelimited(dev, > + "Dirty tracking not supported on translation type %d\n", > + pgtt); > + return -EOPNOTSUPP; > + } > + > + if (pasid_get_ssade(pte) == enabled) { > + spin_unlock(&iommu->lock); > + return 0; > + } > > if (enabled) > pasid_set_ssade(pte); > @@ -702,6 +716,9 @@ int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu, > pasid_clear_ssade(pte); > spin_unlock(&iommu->lock); > > + if (!ecap_coherent(iommu->ecap)) > + clflush_cache_range(pte, sizeof(*pte)); > + > /* > * From VT-d spec table 25 "Guidance to Software for Invalidations": > * > @@ -720,8 +737,6 @@ int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu, > > if (pgtt == PASID_ENTRY_PGTT_SL_ONLY || pgtt == PASID_ENTRY_PGTT_NESTED) > iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH); > - else > - qi_flush_piotlb(iommu, did, pasid, 0, -1, 0); > > /* Device IOTLB doesn't need to be flushed in caching mode. */ > if (!cap_caching_mode(iommu->cap)) Others look good to me. Thanks a lot. Best regards, baolu
On Tue, Oct 17, 2023 at 12:22:34PM +0100, Joao Martins wrote: > On 17/10/2023 03:08, Baolu Lu wrote: > > On 10/17/23 12:00 AM, Joao Martins wrote: > >>>> The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no > >>>> need to understand it and check its member anyway. > >>>> > >>> (...) The iommu driver has no need to understand it. iommu_dirty_bitmap_record() > >>> already makes those checks in case there's no iova_bitmap to set bits to. > >>> > >> This is all true but the reason I am checking iommu_dirty_bitmap::bitmap is to > >> essentially not record anything in the iova bitmap and just clear the dirty bits > >> from the IOPTEs, all when dirty tracking is technically disabled. This is done > >> internally only when starting dirty tracking, and thus to ensure that we cleanup > >> all dirty bits before we enable dirty tracking to have a consistent snapshot as > >> opposed to inheriting dirties from the past. > > > > It's okay since it serves a functional purpose. Can you please add some > > comments around the code to explain the rationale. > > > > I added this comment below: > > + /* > + * IOMMUFD core calls into a dirty tracking disabled domain without an > + * IOVA bitmap set in order to clean dirty bits in all PTEs that might > + * have occured when we stopped dirty tracking. This ensures that we > + * never inherit dirtied bits from a previous cycle. > + */ > > Also fixed an issue where I could theoretically clear the bit with > IOMMU_NO_CLEAR. Essentially passed the read_and_clear_dirty flags and let > dma_sl_pte_test_and_clear_dirty() to test and test-and-clear, similar to AMD: How does all this work, does this leak into the uapi? Why would we want to not clear the dirty bits upon enable/disable if dirty tracking? I can understand that the driver needs help from the caller due to the externalized locking, but do we leak this into the userspace? Jason
On 17/10/2023 14:10, Jason Gunthorpe wrote: > On Tue, Oct 17, 2023 at 12:22:34PM +0100, Joao Martins wrote: >> On 17/10/2023 03:08, Baolu Lu wrote: >>> On 10/17/23 12:00 AM, Joao Martins wrote: >>>>>> The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no >>>>>> need to understand it and check its member anyway. >>>>>> >>>>> (...) The iommu driver has no need to understand it. iommu_dirty_bitmap_record() >>>>> already makes those checks in case there's no iova_bitmap to set bits to. >>>>> >>>> This is all true but the reason I am checking iommu_dirty_bitmap::bitmap is to >>>> essentially not record anything in the iova bitmap and just clear the dirty bits >>>> from the IOPTEs, all when dirty tracking is technically disabled. This is done >>>> internally only when starting dirty tracking, and thus to ensure that we cleanup >>>> all dirty bits before we enable dirty tracking to have a consistent snapshot as >>>> opposed to inheriting dirties from the past. >>> >>> It's okay since it serves a functional purpose. Can you please add some >>> comments around the code to explain the rationale. >>> >> >> I added this comment below: >> >> + /* >> + * IOMMUFD core calls into a dirty tracking disabled domain without an >> + * IOVA bitmap set in order to clean dirty bits in all PTEs that might >> + * have occured when we stopped dirty tracking. This ensures that we >> + * never inherit dirtied bits from a previous cycle. >> + */ >> >> Also fixed an issue where I could theoretically clear the bit with >> IOMMU_NO_CLEAR. Essentially passed the read_and_clear_dirty flags and let >> dma_sl_pte_test_and_clear_dirty() to test and test-and-clear, similar to AMD: > > How does all this work, does this leak into the uapi? UAPI is only ever expected to collect/clear dirty bits while dirty tracking is enabled. And it requires valid bitmaps before it gets to the IOMMU driver. The above where I pass no dirty::bitmap (but with an iotlb_gather) is internal usage only. Open to alternatives if this is prone to audit errors e.g. 1) via the iommu_dirty_bitmap structure, where I add one field which if true then iommufd core is able to call into iommu driver on a "clear IOPTE" manner or 2) via the ::flags ... the thing is that ::flags values is UAPI, so it feels weird to use these flags for internal purposes. With respect to IOMMU_NO_CLEAR that is UAPI (a flag in read-and-clear) where the user fetches bits, but does want to clear the hw IOPTE dirty bit (to avoid the TLB flush). > Why would we > want to not clear the dirty bits upon enable/disable if dirty > tracking? For the unmap-and-read-dirty case. Where you unmap, and want to get the dirty bits, but you don't care to clear them as you will be unmapping. But my comment above is not about that btw. It is just my broken check where either I test for dirty or test-and-clear. That's it. > I can understand that the driver needs help from the caller > due to the externalized locking, but do we leak this into the userspace? AFAICT no for the first. The IOMMU_NO_CLEAR is UAPI. I take it you were talking about the first.
On 17/10/2023 13:41, Baolu Lu wrote: > On 2023/10/17 18:51, Joao Martins wrote: >> On 16/10/2023 14:01, Baolu Lu wrote: >>> On 2023/10/16 20:59, Jason Gunthorpe wrote: >>>> On Mon, Oct 16, 2023 at 08:58:42PM +0800, Baolu Lu wrote: >>>>> On 2023/10/16 19:42, Jason Gunthorpe wrote: >>>>>> On Mon, Oct 16, 2023 at 11:57:34AM +0100, Joao Martins wrote: >>>>>> >>>>>>> True. But to be honest, I thought we weren't quite there yet in PASID >>>>>>> support >>>>>>> from IOMMUFD perspective; hence why I didn't aim at it. Or do I have the >>>>>>> wrong >>>>>>> impression? From the code below, it clearly looks the driver does. >>>>>> I think we should plan that this series will go before the PASID >>>>>> series >>>>> I know that PASID support in IOMMUFD is not yet available, but the VT-d >>>>> driver already supports attaching a domain to a PASID, as required by >>>>> the idxd driver for kernel DMA with PASID. Therefore, from the driver's >>>>> perspective, dirty tracking should also be enabled for PASIDs. >>>> As long as the driver refuses to attach a dirty track enabled domain >>>> to PASID it would be fine for now. >>> Yes. This works. >> Baolu Lu, I am blocking PASID attachment this way; let me know if this matches >> how would you have the driver refuse dirty tracking on PASID. > > Joao, how about blocking pasid attachment in intel_iommu_set_dev_pasid() > directly? > > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c > index c86ba5a3e75c..392b6ca9ce90 100644 > --- a/drivers/iommu/intel/iommu.c > +++ b/drivers/iommu/intel/iommu.c > @@ -4783,6 +4783,9 @@ static int intel_iommu_set_dev_pasid(struct iommu_domain > *domain, > if (context_copied(iommu, info->bus, info->devfn)) > return -EBUSY; > > + if (domain->dirty_ops) > + return -EOPNOTSUPP; > + > ret = prepare_domain_attach_device(domain, dev); > if (ret) > return ret; I was trying to centralize all the checks, but I can place it here if you prefer this way.
On 17/10/2023 13:49, Baolu Lu wrote: > On 2023/10/17 19:22, Joao Martins wrote: >> On 17/10/2023 03:08, Baolu Lu wrote: >>> On 10/17/23 12:00 AM, Joao Martins wrote: >>>>>> The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no >>>>>> need to understand it and check its member anyway. >>>>>> >>>>> (...) The iommu driver has no need to understand it. >>>>> iommu_dirty_bitmap_record() >>>>> already makes those checks in case there's no iova_bitmap to set bits to. >>>>> >>>> This is all true but the reason I am checking iommu_dirty_bitmap::bitmap is to >>>> essentially not record anything in the iova bitmap and just clear the dirty >>>> bits >>>> from the IOPTEs, all when dirty tracking is technically disabled. This is done >>>> internally only when starting dirty tracking, and thus to ensure that we >>>> cleanup >>>> all dirty bits before we enable dirty tracking to have a consistent snapshot as >>>> opposed to inheriting dirties from the past. >>> >>> It's okay since it serves a functional purpose. Can you please add some >>> comments around the code to explain the rationale. >>> >> >> I added this comment below: >> >> + /* >> + * IOMMUFD core calls into a dirty tracking disabled domain without an >> + * IOVA bitmap set in order to clean dirty bits in all PTEs that might >> + * have occured when we stopped dirty tracking. This ensures that we >> + * never inherit dirtied bits from a previous cycle. >> + */ >> > > Yes. It's clear. Thank you! > >> Also fixed an issue where I could theoretically clear the bit with >> IOMMU_NO_CLEAR. Essentially passed the read_and_clear_dirty flags and let >> dma_sl_pte_test_and_clear_dirty() to test and test-and-clear, similar to AMD: >> >> @@ -781,6 +788,16 @@ static inline bool dma_pte_present(struct dma_pte *pte) >> return (pte->val & 3) != 0; >> } >> >> +static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte, >> + unsigned long flags) >> +{ >> + if (flags & IOMMU_DIRTY_NO_CLEAR) >> + return (pte->val & DMA_SL_PTE_DIRTY) != 0; >> + >> + return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT, >> + (unsigned long *)&pte->val); >> +} >> + > > Yes. Sure. > >> Anyhow, see below the full diff compared to this patch. Some things are in tree >> that is different to submitted from this patch. > > [...] > >> @@ -4113,7 +4123,7 @@ static void intel_iommu_domain_free(struct iommu_domain >> *domain) >> } >> >> static int prepare_domain_attach_device(struct iommu_domain *domain, >> - struct device *dev) >> + struct device *dev, ioasid_t pasid) > > How about blocking pasid attaching in intel_iommu_set_dev_pasid(). > OK >> { >> struct dmar_domain *dmar_domain = to_dmar_domain(domain); >> struct intel_iommu *iommu; >> @@ -4126,7 +4136,8 @@ static int prepare_domain_attach_device(struct >> iommu_domain *domain, >> if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap)) >> return -EINVAL; >> >> - if (domain->dirty_ops && !intel_iommu_slads_supported(iommu)) >> + if (domain->dirty_ops && >> + (!slads_supported(iommu) || pasid != IOMMU_NO_PASID)) >> return -EINVAL; >> >> /* check if this iommu agaw is sufficient for max mapped address */ > > [...] > >> >> @@ -4886,14 +4897,16 @@ static int intel_iommu_read_and_clear_dirty(struct >> iommu_domain *domain, >> unsigned long pgsize; >> bool ad_enabled; >> >> - spin_lock(&dmar_domain->lock); >> + /* >> + * IOMMUFD core calls into a dirty tracking disabled domain without an >> + * IOVA bitmap set in order to clean dirty bits in all PTEs that might >> + * have occured when we stopped dirty tracking. This ensures that we >> + * never inherit dirtied bits from a previous cycle. >> + */ >> ad_enabled = dmar_domain->dirty_tracking; >> - spin_unlock(&dmar_domain->lock); >> - >> if (!ad_enabled && dirty->bitmap) > > How about > if (!dmar_domain->dirty_tracking && dirty->bitmap) > return -EINVAL; > ? > OK >> return -EINVAL; >> >> - rcu_read_lock(); >> do { >> struct dma_pte *pte; >> int lvl = 0; >> @@ -4906,14 +4919,10 @@ static int intel_iommu_read_and_clear_dirty(struct >> iommu_domain *domain, >> continue; >> } >> >> - /* It is writable, set the bitmap */ >> - if (((flags & IOMMU_DIRTY_NO_CLEAR) && >> - dma_sl_pte_dirty(pte)) || >> - dma_sl_pte_test_and_clear_dirty(pte)) >> + if (dma_sl_pte_test_and_clear_dirty(pte, flags)) >> iommu_dirty_bitmap_record(dirty, iova, pgsize); >> iova += pgsize; >> } while (iova < end); >> - rcu_read_unlock(); >> >> return 0; >> } >> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h >> index bccd44db3316..0b390d9e669b 100644 >> --- a/drivers/iommu/intel/iommu.h >> +++ b/drivers/iommu/intel/iommu.h >> @@ -542,6 +542,9 @@ enum { >> #define sm_supported(iommu) (intel_iommu_sm && ecap_smts((iommu)->ecap)) >> #define pasid_supported(iommu) (sm_supported(iommu) && \ >> ecap_pasid((iommu)->ecap)) >> +#define slads_supported(iommu) (sm_supported(iommu) && \ >> + ecap_slads((iommu)->ecap)) >> + >> >> struct pasid_entry; >> struct pasid_state_entry; >> @@ -785,13 +788,12 @@ static inline bool dma_pte_present(struct dma_pte *pte) >> return (pte->val & 3) != 0; >> } >> >> -static inline bool dma_sl_pte_dirty(struct dma_pte *pte) >> +static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte, >> + unsigned long flags) >> { >> - return (pte->val & DMA_SL_PTE_DIRTY) != 0; >> -} >> + if (flags & IOMMU_DIRTY_NO_CLEAR) >> + return (pte->val & DMA_SL_PTE_DIRTY) != 0; >> >> -static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte) >> -{ >> return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT, >> (unsigned long *)&pte->val); >> } >> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c >> index 03814942d59c..785384a59d55 100644 >> --- a/drivers/iommu/intel/pasid.c >> +++ b/drivers/iommu/intel/pasid.c >> @@ -686,15 +686,29 @@ int intel_pasid_setup_dirty_tracking(struct intel_iommu >> *iommu, >> >> spin_lock(&iommu->lock); >> >> - did = domain_id_iommu(domain, iommu); >> pte = intel_pasid_get_entry(dev, pasid); >> if (!pte) { >> spin_unlock(&iommu->lock); >> - dev_err(dev, "Failed to get pasid entry of PASID %d\n", pasid); >> + dev_err_ratelimited(dev, >> + "Failed to get pasid entry of PASID %d\n", >> + pasid); >> return -ENODEV; >> } >> >> + did = domain_id_iommu(domain, iommu); >> pgtt = pasid_pte_get_pgtt(pte); >> + if (pgtt != PASID_ENTRY_PGTT_SL_ONLY && pgtt != PASID_ENTRY_PGTT_NESTED) { >> + spin_unlock(&iommu->lock); >> + dev_err_ratelimited(dev, >> + "Dirty tracking not supported on translation type %d\n", >> + pgtt); >> + return -EOPNOTSUPP; >> + } >> + >> + if (pasid_get_ssade(pte) == enabled) { >> + spin_unlock(&iommu->lock); >> + return 0; >> + } >> >> if (enabled) >> pasid_set_ssade(pte); >> @@ -702,6 +716,9 @@ int intel_pasid_setup_dirty_tracking(struct intel_iommu >> *iommu, >> pasid_clear_ssade(pte); >> spin_unlock(&iommu->lock); >> >> + if (!ecap_coherent(iommu->ecap)) >> + clflush_cache_range(pte, sizeof(*pte)); >> + >> /* >> * From VT-d spec table 25 "Guidance to Software for Invalidations": >> * >> @@ -720,8 +737,6 @@ int intel_pasid_setup_dirty_tracking(struct intel_iommu >> *iommu, >> >> if (pgtt == PASID_ENTRY_PGTT_SL_ONLY || pgtt == PASID_ENTRY_PGTT_NESTED) >> iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH); >> - else >> - qi_flush_piotlb(iommu, did, pasid, 0, -1, 0); >> >> /* Device IOTLB doesn't need to be flushed in caching mode. */ >> if (!cap_caching_mode(iommu->cap)) > > Others look good to me. Thanks a lot. > Joao
On 17/10/2023 15:16, Joao Martins wrote: > On 17/10/2023 13:41, Baolu Lu wrote: >> On 2023/10/17 18:51, Joao Martins wrote: >>> On 16/10/2023 14:01, Baolu Lu wrote: >>>> On 2023/10/16 20:59, Jason Gunthorpe wrote: >>>>> On Mon, Oct 16, 2023 at 08:58:42PM +0800, Baolu Lu wrote: >>>>>> On 2023/10/16 19:42, Jason Gunthorpe wrote: >>>>>>> On Mon, Oct 16, 2023 at 11:57:34AM +0100, Joao Martins wrote: >>>>>>> >>>>>>>> True. But to be honest, I thought we weren't quite there yet in PASID >>>>>>>> support >>>>>>>> from IOMMUFD perspective; hence why I didn't aim at it. Or do I have the >>>>>>>> wrong >>>>>>>> impression? From the code below, it clearly looks the driver does. >>>>>>> I think we should plan that this series will go before the PASID >>>>>>> series >>>>>> I know that PASID support in IOMMUFD is not yet available, but the VT-d >>>>>> driver already supports attaching a domain to a PASID, as required by >>>>>> the idxd driver for kernel DMA with PASID. Therefore, from the driver's >>>>>> perspective, dirty tracking should also be enabled for PASIDs. >>>>> As long as the driver refuses to attach a dirty track enabled domain >>>>> to PASID it would be fine for now. >>>> Yes. This works. >>> Baolu Lu, I am blocking PASID attachment this way; let me know if this matches >>> how would you have the driver refuse dirty tracking on PASID. >> >> Joao, how about blocking pasid attachment in intel_iommu_set_dev_pasid() >> directly? >> >> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c >> index c86ba5a3e75c..392b6ca9ce90 100644 >> --- a/drivers/iommu/intel/iommu.c >> +++ b/drivers/iommu/intel/iommu.c >> @@ -4783,6 +4783,9 @@ static int intel_iommu_set_dev_pasid(struct iommu_domain >> *domain, >> if (context_copied(iommu, info->bus, info->devfn)) >> return -EBUSY; >> >> + if (domain->dirty_ops) >> + return -EOPNOTSUPP; >> + >> ret = prepare_domain_attach_device(domain, dev); >> if (ret) >> return ret; > > I was trying to centralize all the checks, but I can place it here if you prefer > this way. > Minor change, I'm changing error code to -EINVAL to align with non-PASID case.
On Tue, Oct 17, 2023 at 03:11:46PM +0100, Joao Martins wrote: > On 17/10/2023 14:10, Jason Gunthorpe wrote: > > On Tue, Oct 17, 2023 at 12:22:34PM +0100, Joao Martins wrote: > >> On 17/10/2023 03:08, Baolu Lu wrote: > >>> On 10/17/23 12:00 AM, Joao Martins wrote: > >>>>>> The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no > >>>>>> need to understand it and check its member anyway. > >>>>>> > >>>>> (...) The iommu driver has no need to understand it. iommu_dirty_bitmap_record() > >>>>> already makes those checks in case there's no iova_bitmap to set bits to. > >>>>> > >>>> This is all true but the reason I am checking iommu_dirty_bitmap::bitmap is to > >>>> essentially not record anything in the iova bitmap and just clear the dirty bits > >>>> from the IOPTEs, all when dirty tracking is technically disabled. This is done > >>>> internally only when starting dirty tracking, and thus to ensure that we cleanup > >>>> all dirty bits before we enable dirty tracking to have a consistent snapshot as > >>>> opposed to inheriting dirties from the past. > >>> > >>> It's okay since it serves a functional purpose. Can you please add some > >>> comments around the code to explain the rationale. > >>> > >> > >> I added this comment below: > >> > >> + /* > >> + * IOMMUFD core calls into a dirty tracking disabled domain without an > >> + * IOVA bitmap set in order to clean dirty bits in all PTEs that might > >> + * have occured when we stopped dirty tracking. This ensures that we > >> + * never inherit dirtied bits from a previous cycle. > >> + */ > >> > >> Also fixed an issue where I could theoretically clear the bit with > >> IOMMU_NO_CLEAR. Essentially passed the read_and_clear_dirty flags and let > >> dma_sl_pte_test_and_clear_dirty() to test and test-and-clear, similar to AMD: > > > > How does all this work, does this leak into the uapi? > > UAPI is only ever expected to collect/clear dirty bits while dirty tracking is > enabled. And it requires valid bitmaps before it gets to the IOMMU driver. > > The above where I pass no dirty::bitmap (but with an iotlb_gather) is internal > usage only. Open to alternatives if this is prone to audit errors e.g. 1) via > the iommu_dirty_bitmap structure, where I add one field which if true then > iommufd core is able to call into iommu driver on a "clear IOPTE" manner or 2) > via the ::flags ... the thing is that ::flags values is UAPI, so it feels weird > to use these flags for internal purposes. I think NULL to mean clear but not record is OK, it doesn't matter too much but ideally this would be sort of hidden in the iova APIs.. Jason
On 17/10/2023 16:31, Jason Gunthorpe wrote: > On Tue, Oct 17, 2023 at 03:11:46PM +0100, Joao Martins wrote: >> On 17/10/2023 14:10, Jason Gunthorpe wrote: >>> On Tue, Oct 17, 2023 at 12:22:34PM +0100, Joao Martins wrote: >>>> On 17/10/2023 03:08, Baolu Lu wrote: >>>>> On 10/17/23 12:00 AM, Joao Martins wrote: >>>>>>>> The iommu_dirty_bitmap is defined in iommu core. The iommu driver has no >>>>>>>> need to understand it and check its member anyway. >>>>>>>> >>>>>>> (...) The iommu driver has no need to understand it. iommu_dirty_bitmap_record() >>>>>>> already makes those checks in case there's no iova_bitmap to set bits to. >>>>>>> >>>>>> This is all true but the reason I am checking iommu_dirty_bitmap::bitmap is to >>>>>> essentially not record anything in the iova bitmap and just clear the dirty bits >>>>>> from the IOPTEs, all when dirty tracking is technically disabled. This is done >>>>>> internally only when starting dirty tracking, and thus to ensure that we cleanup >>>>>> all dirty bits before we enable dirty tracking to have a consistent snapshot as >>>>>> opposed to inheriting dirties from the past. >>>>> >>>>> It's okay since it serves a functional purpose. Can you please add some >>>>> comments around the code to explain the rationale. >>>>> >>>> >>>> I added this comment below: >>>> >>>> + /* >>>> + * IOMMUFD core calls into a dirty tracking disabled domain without an >>>> + * IOVA bitmap set in order to clean dirty bits in all PTEs that might >>>> + * have occured when we stopped dirty tracking. This ensures that we >>>> + * never inherit dirtied bits from a previous cycle. >>>> + */ >>>> >>>> Also fixed an issue where I could theoretically clear the bit with >>>> IOMMU_NO_CLEAR. Essentially passed the read_and_clear_dirty flags and let >>>> dma_sl_pte_test_and_clear_dirty() to test and test-and-clear, similar to AMD: >>> >>> How does all this work, does this leak into the uapi? >> >> UAPI is only ever expected to collect/clear dirty bits while dirty tracking is >> enabled. And it requires valid bitmaps before it gets to the IOMMU driver. >> >> The above where I pass no dirty::bitmap (but with an iotlb_gather) is internal >> usage only. Open to alternatives if this is prone to audit errors e.g. 1) via >> the iommu_dirty_bitmap structure, where I add one field which if true then >> iommufd core is able to call into iommu driver on a "clear IOPTE" manner or 2) >> via the ::flags ... the thing is that ::flags values is UAPI, so it feels weird >> to use these flags for internal purposes. > > I think NULL to mean clear but not record is OK, it doesn't matter too > much but ideally this would be sort of hidden in the iova APIs.. And it is hidden? unless you mean by hidden that there's explicit IOVA APIs that do this? Currently, iopt_clear_dirty_data() does this 'internal-only' usage of iommu_read_and_clear() and does this stuff.
On 10/17/23 10:25 PM, Joao Martins wrote: > On 17/10/2023 15:16, Joao Martins wrote: >> On 17/10/2023 13:41, Baolu Lu wrote: >>> On 2023/10/17 18:51, Joao Martins wrote: >>>> On 16/10/2023 14:01, Baolu Lu wrote: >>>>> On 2023/10/16 20:59, Jason Gunthorpe wrote: >>>>>> On Mon, Oct 16, 2023 at 08:58:42PM +0800, Baolu Lu wrote: >>>>>>> On 2023/10/16 19:42, Jason Gunthorpe wrote: >>>>>>>> On Mon, Oct 16, 2023 at 11:57:34AM +0100, Joao Martins wrote: >>>>>>>> >>>>>>>>> True. But to be honest, I thought we weren't quite there yet in PASID >>>>>>>>> support >>>>>>>>> from IOMMUFD perspective; hence why I didn't aim at it. Or do I have the >>>>>>>>> wrong >>>>>>>>> impression? From the code below, it clearly looks the driver does. >>>>>>>> I think we should plan that this series will go before the PASID >>>>>>>> series >>>>>>> I know that PASID support in IOMMUFD is not yet available, but the VT-d >>>>>>> driver already supports attaching a domain to a PASID, as required by >>>>>>> the idxd driver for kernel DMA with PASID. Therefore, from the driver's >>>>>>> perspective, dirty tracking should also be enabled for PASIDs. >>>>>> As long as the driver refuses to attach a dirty track enabled domain >>>>>> to PASID it would be fine for now. >>>>> Yes. This works. >>>> Baolu Lu, I am blocking PASID attachment this way; let me know if this matches >>>> how would you have the driver refuse dirty tracking on PASID. >>> >>> Joao, how about blocking pasid attachment in intel_iommu_set_dev_pasid() >>> directly? >>> >>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c >>> index c86ba5a3e75c..392b6ca9ce90 100644 >>> --- a/drivers/iommu/intel/iommu.c >>> +++ b/drivers/iommu/intel/iommu.c >>> @@ -4783,6 +4783,9 @@ static int intel_iommu_set_dev_pasid(struct iommu_domain >>> *domain, >>> if (context_copied(iommu, info->bus, info->devfn)) >>> return -EBUSY; >>> >>> + if (domain->dirty_ops) >>> + return -EOPNOTSUPP; >>> + >>> ret = prepare_domain_attach_device(domain, dev); >>> if (ret) >>> return ret; >> >> I was trying to centralize all the checks, but I can place it here if you prefer >> this way. We will soon remove this check when pasid is supported in iommufd. So less code change is better for future work. >> > Minor change, I'm changing error code to -EINVAL to align with non-PASID case. Yes. Make sense. -EINVAL means "not compatible". The caller can retry. Best regards, baolu
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 491bcde1ff96..7d5a8f5283a7 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -300,6 +300,7 @@ static int iommu_skip_te_disable; #define IDENTMAP_AZALIA 4 const struct iommu_ops intel_iommu_ops; +const struct iommu_dirty_ops intel_dirty_ops; static bool translation_pre_enabled(struct intel_iommu *iommu) { @@ -4077,6 +4078,7 @@ static struct iommu_domain *intel_iommu_domain_alloc(unsigned type) static struct iommu_domain * intel_iommu_domain_alloc_user(struct device *dev, u32 flags) { + bool enforce_dirty = (flags & IOMMU_HWPT_ALLOC_ENFORCE_DIRTY); struct iommu_domain *domain; struct intel_iommu *iommu; @@ -4087,9 +4089,15 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 flags) if ((flags & IOMMU_HWPT_ALLOC_NEST_PARENT) && !ecap_nest(iommu->ecap)) return ERR_PTR(-EOPNOTSUPP); + if (enforce_dirty && + !device_iommu_capable(dev, IOMMU_CAP_DIRTY)) + return ERR_PTR(-EOPNOTSUPP); + domain = iommu_domain_alloc(dev->bus); if (!domain) domain = ERR_PTR(-ENOMEM); + if (domain && enforce_dirty) + domain->dirty_ops = &intel_dirty_ops; return domain; } @@ -4367,6 +4375,9 @@ static bool intel_iommu_capable(struct device *dev, enum iommu_cap cap) return dmar_platform_optin(); case IOMMU_CAP_ENFORCE_CACHE_COHERENCY: return ecap_sc_support(info->iommu->ecap); + case IOMMU_CAP_DIRTY: + return sm_supported(info->iommu) && + ecap_slads(info->iommu->ecap); default: return false; } @@ -4822,6 +4833,89 @@ static void *intel_iommu_hw_info(struct device *dev, u32 *length, u32 *type) return vtd; } +static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain, + bool enable) +{ + struct dmar_domain *dmar_domain = to_dmar_domain(domain); + struct device_domain_info *info; + int ret = -EINVAL; + + spin_lock(&dmar_domain->lock); + if (!(dmar_domain->dirty_tracking ^ enable) || + list_empty(&dmar_domain->devices)) { + spin_unlock(&dmar_domain->lock); + return 0; + } + + list_for_each_entry(info, &dmar_domain->devices, link) { + /* First-level page table always enables dirty bit*/ + if (dmar_domain->use_first_level) { + ret = 0; + break; + } + + ret = intel_pasid_setup_dirty_tracking(info->iommu, info->domain, + info->dev, IOMMU_NO_PASID, + enable); + if (ret) + break; + + } + + if (!ret) + dmar_domain->dirty_tracking = enable; + spin_unlock(&dmar_domain->lock); + + return ret; +} + +static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain, + unsigned long iova, size_t size, + unsigned long flags, + struct iommu_dirty_bitmap *dirty) +{ + struct dmar_domain *dmar_domain = to_dmar_domain(domain); + unsigned long end = iova + size - 1; + unsigned long pgsize; + bool ad_enabled; + + spin_lock(&dmar_domain->lock); + ad_enabled = dmar_domain->dirty_tracking; + spin_unlock(&dmar_domain->lock); + + if (!ad_enabled && dirty->bitmap) + return -EINVAL; + + rcu_read_lock(); + do { + struct dma_pte *pte; + int lvl = 0; + + pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &lvl, + GFP_ATOMIC); + pgsize = level_size(lvl) << VTD_PAGE_SHIFT; + if (!pte || !dma_pte_present(pte)) { + iova += pgsize; + continue; + } + + /* It is writable, set the bitmap */ + if (((flags & IOMMU_DIRTY_NO_CLEAR) && + dma_sl_pte_dirty(pte)) || + dma_sl_pte_test_and_clear_dirty(pte)) + iommu_dirty_bitmap_record(dirty, iova, pgsize); + iova += pgsize; + } while (iova < end); + rcu_read_unlock(); + + return 0; +} + +const struct iommu_dirty_ops intel_dirty_ops = { + .set_dirty_tracking = intel_iommu_set_dirty_tracking, + .read_and_clear_dirty = intel_iommu_read_and_clear_dirty, +}; + const struct iommu_ops intel_iommu_ops = { .capable = intel_iommu_capable, .hw_info = intel_iommu_hw_info, diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index c18fb699c87a..bccd44db3316 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -48,6 +48,9 @@ #define DMA_FL_PTE_DIRTY BIT_ULL(6) #define DMA_FL_PTE_XD BIT_ULL(63) +#define DMA_SL_PTE_DIRTY_BIT 9 +#define DMA_SL_PTE_DIRTY BIT_ULL(DMA_SL_PTE_DIRTY_BIT) + #define ADDR_WIDTH_5LEVEL (57) #define ADDR_WIDTH_4LEVEL (48) @@ -592,6 +595,7 @@ struct dmar_domain { * otherwise, goes through the second * level. */ + u8 dirty_tracking:1; /* Dirty tracking is enabled */ spinlock_t lock; /* Protect device tracking lists */ struct list_head devices; /* all devices' list */ @@ -781,6 +785,17 @@ static inline bool dma_pte_present(struct dma_pte *pte) return (pte->val & 3) != 0; } +static inline bool dma_sl_pte_dirty(struct dma_pte *pte) +{ + return (pte->val & DMA_SL_PTE_DIRTY) != 0; +} + +static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte) +{ + return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT, + (unsigned long *)&pte->val); +} + static inline bool dma_pte_superpage(struct dma_pte *pte) { return (pte->val & DMA_PTE_LARGE_PAGE); diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 8f92b92f3d2a..03814942d59c 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -277,6 +277,11 @@ static inline void pasid_set_bits(u64 *ptr, u64 mask, u64 bits) WRITE_ONCE(*ptr, (old & ~mask) | bits); } +static inline u64 pasid_get_bits(u64 *ptr) +{ + return READ_ONCE(*ptr); +} + /* * Setup the DID(Domain Identifier) field (Bit 64~79) of scalable mode * PASID entry. @@ -335,6 +340,36 @@ static inline void pasid_set_fault_enable(struct pasid_entry *pe) pasid_set_bits(&pe->val[0], 1 << 1, 0); } +/* + * Enable second level A/D bits by setting the SLADE (Second Level + * Access Dirty Enable) field (Bit 9) of a scalable mode PASID + * entry. + */ +static inline void pasid_set_ssade(struct pasid_entry *pe) +{ + pasid_set_bits(&pe->val[0], 1 << 9, 1 << 9); +} + +/* + * Enable second level A/D bits by setting the SLADE (Second Level + * Access Dirty Enable) field (Bit 9) of a scalable mode PASID + * entry. + */ +static inline void pasid_clear_ssade(struct pasid_entry *pe) +{ + pasid_set_bits(&pe->val[0], 1 << 9, 0); +} + +/* + * Checks if second level A/D bits by setting the SLADE (Second Level + * Access Dirty Enable) field (Bit 9) of a scalable mode PASID + * entry is enabled. + */ +static inline bool pasid_get_ssade(struct pasid_entry *pe) +{ + return pasid_get_bits(&pe->val[0]) & (1 << 9); +} + /* * Setup the WPE(Write Protect Enable) field (Bit 132) of a * scalable mode PASID entry. @@ -627,6 +662,8 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu, pasid_set_translation_type(pte, PASID_ENTRY_PGTT_SL_ONLY); pasid_set_fault_enable(pte); pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap)); + if (domain->dirty_tracking) + pasid_set_ssade(pte); pasid_set_present(pte); spin_unlock(&iommu->lock); @@ -636,6 +673,63 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu, return 0; } +/* + * Set up dirty tracking on a second only translation type. + */ +int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu, + struct dmar_domain *domain, + struct device *dev, u32 pasid, + bool enabled) +{ + struct pasid_entry *pte; + u16 did, pgtt; + + spin_lock(&iommu->lock); + + did = domain_id_iommu(domain, iommu); + pte = intel_pasid_get_entry(dev, pasid); + if (!pte) { + spin_unlock(&iommu->lock); + dev_err(dev, "Failed to get pasid entry of PASID %d\n", pasid); + return -ENODEV; + } + + pgtt = pasid_pte_get_pgtt(pte); + + if (enabled) + pasid_set_ssade(pte); + else + pasid_clear_ssade(pte); + spin_unlock(&iommu->lock); + + /* + * From VT-d spec table 25 "Guidance to Software for Invalidations": + * + * - PASID-selective-within-Domain PASID-cache invalidation + * If (PGTT=SS or Nested) + * - Domain-selective IOTLB invalidation + * Else + * - PASID-selective PASID-based IOTLB invalidation + * - If (pasid is RID_PASID) + * - Global Device-TLB invalidation to affected functions + * Else + * - PASID-based Device-TLB invalidation (with S=1 and + * Addr[63:12]=0x7FFFFFFF_FFFFF) to affected functions + */ + pasid_cache_invalidation_with_pasid(iommu, did, pasid); + + if (pgtt == PASID_ENTRY_PGTT_SL_ONLY || pgtt == PASID_ENTRY_PGTT_NESTED) + iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH); + else + qi_flush_piotlb(iommu, did, pasid, 0, -1, 0); + + /* Device IOTLB doesn't need to be flushed in caching mode. */ + if (!cap_caching_mode(iommu->cap)) + devtlb_invalidation_with_pasid(iommu, dev, pasid); + + return 0; +} + /* * Set up the scalable mode pasid entry for passthrough translation type. */ diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h index 4e9e68c3c388..958050b093aa 100644 --- a/drivers/iommu/intel/pasid.h +++ b/drivers/iommu/intel/pasid.h @@ -106,6 +106,10 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu, int intel_pasid_setup_second_level(struct intel_iommu *iommu, struct dmar_domain *domain, struct device *dev, u32 pasid); +int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu, + struct dmar_domain *domain, + struct device *dev, u32 pasid, + bool enabled); int intel_pasid_setup_pass_through(struct intel_iommu *iommu, struct dmar_domain *domain, struct device *dev, u32 pasid);
IOMMU advertises Access/Dirty bits for second-stage page table if the extended capability DMAR register reports it (ECAP, mnemonic ECAP.SSADS). The first stage table is compatible with CPU page table thus A/D bits are implicitly supported. Relevant Intel IOMMU SDM ref for first stage table "3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second stage table "3.7.2 Accessed and Dirty Flags". First stage page table is enabled by default so it's allowed to set dirty tracking and no control bits needed, it just returns 0. To use SSADS, set bit 9 (SSADE) in the scalable-mode PASID table entry and flush the IOTLB via pasid_flush_caches() following the manual. Relevant SDM refs: "3.7.2 Accessed and Dirty Flags" "6.5.3.3 Guidance to Software for Invalidations, Table 23. Guidance to Software for Invalidations" PTE dirty bit is located in bit 9 and it's cached in the IOTLB so flush IOTLB to make sure IOMMU attempts to set the dirty bit again. Note that iommu_dirty_bitmap_record() will add the IOVA to iotlb_gather and thus the caller of the iommu op will flush the IOTLB. Relevant manuals over the hardware translation is chapter 6 with some special mention to: "6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations" "6.2.4 IOTLB" Signed-off-by: Joao Martins <joao.m.martins@oracle.com> --- The IOPTE walker is still a bit inneficient. Making sure the UAPI/IOMMUFD is solid and agreed upon. --- drivers/iommu/intel/iommu.c | 94 +++++++++++++++++++++++++++++++++++++ drivers/iommu/intel/iommu.h | 15 ++++++ drivers/iommu/intel/pasid.c | 94 +++++++++++++++++++++++++++++++++++++ drivers/iommu/intel/pasid.h | 4 ++ 4 files changed, 207 insertions(+)