diff mbox series

[v2,06/11] iommu/arm-smmu-v3: Introduce arm_smmu_s2_parent_tlb_ invalidation helpers

Message ID 61fef9052b2034e5b4ffa1fa6ce481667d8ea6b1.1744692494.git.nicolinc@nvidia.com (mailing list archive)
State New
Headers show
Series iommu/arm-smmu-v3: Allocate vmid per vsmmu instead of s2_parent | expand

Commit Message

Nicolin Chen April 15, 2025, 4:57 a.m. UTC
An S2 nest_parent domain can be shared across vSMMUs in the same VM, since
the S2 domain is basically the IPA mappings for the entire RAM of the VM.

Meanwhile, each vSMMU can have its own VMID, so the VMID allocation should
be done per vSMMU instance v.s. per S2 nest_parent domain.

However, an S2 domain can be also allocated when a physical SMMU instance
doesn't support S1. So, the structure has to retain the s2_cfg and vmid.

Add a per-domain "vsmmus" list pairing with a spinlock, maintaining a list
of vSMMUs in the S2 parent domain.

Provide two arm_smmu_s2_parent_tlb_ helpers that will be used for nesting
cases to invalidate S2 cache using vsmmu->vmid by iterating this "vsmmus"
list.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   | 22 ++++++++
 .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c     | 53 +++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |  2 +
 3 files changed, 77 insertions(+)

Comments

Jason Gunthorpe April 15, 2025, 12:50 p.m. UTC | #1
On Mon, Apr 14, 2025 at 09:57:41PM -0700, Nicolin Chen wrote:
> An S2 nest_parent domain can be shared across vSMMUs in the same VM, since
> the S2 domain is basically the IPA mappings for the entire RAM of the VM.
> 
> Meanwhile, each vSMMU can have its own VMID, so the VMID allocation should
> be done per vSMMU instance v.s. per S2 nest_parent domain.
> 
> However, an S2 domain can be also allocated when a physical SMMU instance
> doesn't support S1. So, the structure has to retain the s2_cfg and vmid.
> 
> Add a per-domain "vsmmus" list pairing with a spinlock, maintaining a list
> of vSMMUs in the S2 parent domain.
> 
> Provide two arm_smmu_s2_parent_tlb_ helpers that will be used for nesting
> cases to invalidate S2 cache using vsmmu->vmid by iterating this "vsmmus"
> list.

I was rather hoping to fix the normal S2 case as well, the nested case
is really not so different.

The challenge with that is to rework the list of invalidation
instructions stored in the smmu_domain to be more general and have
more information, how to invalidate for vsmmu is just another special
case.

> @@ -859,6 +859,10 @@ struct arm_smmu_domain {
>  		struct arm_smmu_ctx_desc	cd;
>  		struct arm_smmu_s2_cfg		s2_cfg;
>  	};
> +	struct {
> +		struct list_head list;
> +		spinlock_t lock;
> +	} vsmmus;

So this approach of just adding more lists is functional, but it isn't
very general :\

This is why it is a tough project, because carefully generalizing the
invalidation data without degrading the performance is certainly
somewhat tricky.

But what I was broadly thinking is to have an allocated array attached
to each domain with something like:

struct invalidation_op {
    struct arm_smmu_device *smmu;
    enum {ATS,S2_VMDIA_IPA,S2_VMID,S1_ASID} invalidation_op;
    union {
        u16 vmid;
        u32 asid;
    	u32 ats_id;
    };
    refcount_t users;
};

Then invalidation would just iterate over this list following each
instruction. 

When things are attached the list is mutated:
 - Normal S1/S2 attach would reuse an ASID for the same instance or
   allocate a new list entry, users keeps track of ID sharing
 - VMID attach would use the VMID of the vSMMU
 - ATS enabled would add entries for each PCI device instead of the
   seperate ATS list

To do this without locking on the invalidation side would require
using RCU to manage the list, which suggests it is probably an array
that is re-allocated each time it is changed.

That means some fancy algorithms to copy and mutate the array, deal
with error cases and sort it (ATS must follow ID, want things grouped
by instance).

There is some tricky memory barriers needed and RCU would require that
SMMU unplug do a synchronize_rcu(). IIRC riscv did this in their
driver.

But the end result is we fully disconnect the domain from the smmu
instance and all domain types can be shared across all instances if
they support the pagetable layout. The invalidation also becomes
somewhat simpler as it just sweeps the list and does what it is
told. The special ATS list, counter and locking is removed too.

Jason
Nicolin Chen April 15, 2025, 8:10 p.m. UTC | #2
On Tue, Apr 15, 2025 at 09:50:42AM -0300, Jason Gunthorpe wrote:
> struct invalidation_op {
>     struct arm_smmu_device *smmu;
>     enum {ATS,S2_VMDIA_IPA,S2_VMID,S1_ASID} invalidation_op;
>     union {
>         u16 vmid;
>         u32 asid;
>     	u32 ats_id;
>     };
>     refcount_t users;
> };
> 
> Then invalidation would just iterate over this list following each
> instruction. 
> 
> When things are attached the list is mutated:
>  - Normal S1/S2 attach would reuse an ASID for the same instance or
>    allocate a new list entry, users keeps track of ID sharing
>  - VMID attach would use the VMID of the vSMMU
>  - ATS enabled would add entries for each PCI device instead of the
>    seperate ATS list

Interesting. I can see it generalize all the use cases.

Yet are you expecting a big list combining TLBI and ATC_INV cmds?

I think the ATC_INV entries doesn't need a refcount? And finding
an SID (to remove the device for example) would take long, when
there are a lot of entries in the list?

Should the ATS list still be separate, or even an xarray?

> To do this without locking on the invalidation side would require
> using RCU to manage the list, which suggests it is probably an array
> that is re-allocated each time it is changed.
> 
> That means some fancy algorithms to copy and mutate the array, deal
> with error cases and sort it (ATS must follow ID, want things grouped
> by instance).
> 
> There is some tricky memory barriers needed and RCU would require that
> SMMU unplug do a synchronize_rcu(). IIRC riscv did this in their
> driver.

I will refer to their driver. Yet, I wonder what we will gain from
RCU here? Race condition? Would you elaborate with some use case?

> But the end result is we fully disconnect the domain from the smmu
> instance and all domain types can be shared across all instances if
> they support the pagetable layout. The invalidation also becomes
> somewhat simpler as it just sweeps the list and does what it is
> told. The special ATS list, counter and locking is removed too.

OK. I'd like to give it another try. Or would you prefer to write
yourself?

Thanks
Nicolin
Jason Gunthorpe April 15, 2025, 11:46 p.m. UTC | #3
On Tue, Apr 15, 2025 at 01:10:37PM -0700, Nicolin Chen wrote:
> On Tue, Apr 15, 2025 at 09:50:42AM -0300, Jason Gunthorpe wrote:
> > struct invalidation_op {
> >     struct arm_smmu_device *smmu;
> >     enum {ATS,S2_VMDIA_IPA,S2_VMID,S1_ASID} invalidation_op;
> >     union {
> >         u16 vmid;
> >         u32 asid;
> >     	u32 ats_id;
> >     };
> >     refcount_t users;
> > };
> > 
> > Then invalidation would just iterate over this list following each
> > instruction. 
> > 
> > When things are attached the list is mutated:
> >  - Normal S1/S2 attach would reuse an ASID for the same instance or
> >    allocate a new list entry, users keeps track of ID sharing
> >  - VMID attach would use the VMID of the vSMMU
> >  - ATS enabled would add entries for each PCI device instead of the
> >    seperate ATS list
> 
> Interesting. I can see it generalize all the use cases.
> 
> Yet are you expecting a big list combining TLBI and ATC_INV cmds?

It is the idea I had in my head. There isn't really a great reason to
have two lists if one list can handle the required updating and
locking needs.. I imagine the IOTLB entries would be sorted first and
the ATC entries last.

> I think the ATC_INV entries doesn't need a refcount? 

Probably in almost all cases.

But see below about needing two domains in the list at once and recall
that today we temporarily put the same domain in the list twice
sometimes. So it may make alot of sense to use the refcount in every
entry to track how many masters are using that entry just to keep the
design simple.

> And finding an SID (to remove the device for example) would take
> long, when there are a lot of entries in the list?

It depends how smart you get, bisection search on a sorted linear list
would scale fine. But I don't think we care much about attach/detach
performance, or have such high numbers of attachments that this is
worth optimizing for.

> Should the ATS list still be separate, or even an xarray?

I haven't gone through it in any details to know.. If the invalidation
can use the structure above for ATS and nothing else needs the ATS
list, then perhaps it doesn't need to exist.

> I will refer to their driver. Yet, I wonder what we will gain from
> RCU here? Race condition? Would you elaborate with some use case?

The invalidation path was optimized to avoid locking, look at the
stuff in arm_smmu_atc_inv_domain() to try to avoid the spinlock
protecting the ATS invalidations read from the devices list.

So, I imagine a similar lock free scheme would be

invalidation:
  rcu_read_lock()
  list = READ_ONCE(domain->invalidation_ops);
  [execute invalidation on list]
  rcu_read_unlock()

mutate:
   mutex_lock(domain->lock for attachment)
   new_list = kcalloc()
   copy_and_mutate(domain->invalidation_ops, new_list);
   rcu_assign_pointer(domain->invalidation_ops, new_list);
   mutex_unlock(domain->lock for attachment)

Then because of RCU you have to deal with some races.

1) HW flushing must be synchronous with the domain attach:
      CPU 1                   CPU 2
    change an IOPTE
    release IOPTs
                              attach a domain
			      release invalidation_ops
    invalidation
       acquire READ_ONCE()
			      acquire IOPTEs
                              update the STE/CD

Such that the HW is guarenteed to either:
 a) see the new value of IOPTE before seeing the STE/CD that could
    cause it be fetched
 b) is guaranteed to see the invalidation_op for the new STE prior to
    the STE being installed.

IIRC the riscv folks determined that this was a simple smp_mb()..

On the detaching side spurious IOTLB invalidation is OK, that will
just cause some performance anomaly. And I think spurious ATC
invalidation is OK too, though maybe need a synchronize_rcu() in
device removal due to friendly hot unplug.. IDK

2) Safe domain replacement

The existing code double adds devices to the invalidations lists for
safety. So it would need a algorithm like this:
   
prepare:
    middle_list = copy_and_mutate_add_master(domain->list, new_master);
    final_list = copy_and_mutate_remove_master(middle_list, old_master);
commit:
   // Invalidate both new/old master while we mess with the STE/CD
   rcu_assign_pointer(domain->list, middle_list);
   install_ste()
   // Only invalidate new master
   rcu_assign_pointer(domain->list, final_list);
   kfree_rcu(middle_list);
   kfree_rcu(old_list);

As there is an intrinsic time window after the STE is written to
memory but before the STE invalidation sync has been completed in HW
where we have no idea which of the two domains the HW is fetching
from.

3) IOMMU Device removal

Since the RCU is also protecting the smmu instance memory and queues:

       CPU 1                      CPU 2
    invalidation
      rcu_read_lock()
                                   domain detach
				   arm_smmu_release_device()
				   iommu_device_unregister()
      list = READ_ONCE()
      .. list[i]->smmu ..
      rcu_read_unlock()
				     synchronize_rcu()
				     kfree(smmu);

But that's easy and we never hotunplug smmu's anyhow.

> > But the end result is we fully disconnect the domain from the smmu
> > instance and all domain types can be shared across all instances if
> > they support the pagetable layout. The invalidation also becomes
> > somewhat simpler as it just sweeps the list and does what it is
> > told. The special ATS list, counter and locking is removed too.
> 
> OK. I'd like to give it another try. Or would you prefer to write
> yourself?

I'd be happy if you can knock it out, or at least determine it is too
hard/bad idea I'm trying to push out the io page table stuff this
cycle

The only thing that gives me pause is the complexity of the list copy
and mutate, but I didn't try to enumerate all the mutations that are
required. Maybe if this is done in a very simple unoptimized way it is
good enough 'mutate add master' 'mutate remove master', allocating a
new list copy for each operation.

Scan the list and calculate the new size. Copy the list discarding
things to delete. Add the new things to the end. Sort.

I'd probably start here, try to write the two mutate functions, check
if those are enough mutate functions, then try to migrate the
invalidation logic over to use the new lists part by part. Building
the new lists can be done first in a series.

From here a future project would be to optimize the invalidation for
multi-SMMU and multi-device... The current code runs everything
serially, but we could push all the invalidation commands to all the
instances, then wait for the sync's to come back from each instance
allowing the HW invalidation to be in parallel. Then similarly do the
ATC in parallel. It is easy to do if the list is sorted already in
order of required operations. This might make most sense for ATC
invalidation since it is always range based and only needs two command
entries?

Jason
diff mbox series

Patch

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 7b47f4408a7a..7d76d8ac9acc 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -859,6 +859,10 @@  struct arm_smmu_domain {
 		struct arm_smmu_ctx_desc	cd;
 		struct arm_smmu_s2_cfg		s2_cfg;
 	};
+	struct {
+		struct list_head list;
+		spinlock_t lock;
+	} vsmmus;
 
 	struct iommu_domain		domain;
 
@@ -1081,6 +1085,7 @@  struct arm_vsmmu {
 	struct arm_smmu_device *smmu;
 	struct arm_smmu_domain *s2_parent;
 	u16 vmid;
+	struct list_head vsmmus_elm; /* arm_smmu_domain::vsmmus::list */
 };
 
 #if IS_ENABLED(CONFIG_ARM_SMMU_V3_IOMMUFD)
@@ -1094,6 +1099,11 @@  int arm_vsmmu_attach_prepare(struct arm_smmu_attach_state *state,
 void arm_smmu_attach_commit_vmaster(struct arm_smmu_attach_state *state);
 void arm_smmu_master_clear_vmaster(struct arm_smmu_master *master);
 int arm_vmaster_report_event(struct arm_smmu_vmaster *vmaster, u64 *evt);
+
+void arm_smmu_s2_parent_tlb_inv_domain(struct arm_smmu_domain *s2_parent);
+void arm_smmu_s2_parent_tlb_inv_range(struct arm_smmu_domain *s2_parent,
+				      unsigned long iova, size_t size,
+				      size_t granule, bool leaf);
 #else
 #define arm_smmu_hw_info NULL
 #define arm_vsmmu_alloc NULL
@@ -1119,6 +1129,18 @@  static inline int arm_vmaster_report_event(struct arm_smmu_vmaster *vmaster,
 {
 	return -EOPNOTSUPP;
 }
+
+static inline void
+arm_smmu_s2_parent_tlb_inv_domain(struct arm_smmu_domain *s2_parent)
+{
+}
+
+static inline void
+arm_smmu_s2_parent_tlb_inv_range(struct arm_smmu_domain *s2_parent,
+				 unsigned long iova, size_t size,
+				 size_t granule, bool leaf)
+{
+}
 #endif /* CONFIG_ARM_SMMU_V3_IOMMUFD */
 
 #endif /* _ARM_SMMU_V3_H */
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
index 6cd01536c966..45ba68a1b59a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
@@ -30,6 +30,54 @@  void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type)
 	return info;
 }
 
+void arm_smmu_s2_parent_tlb_inv_domain(struct arm_smmu_domain *s2_parent)
+{
+	struct arm_vsmmu *vsmmu, *next;
+	unsigned long flags;
+
+	spin_lock_irqsave(&s2_parent->vsmmus.lock, flags);
+	list_for_each_entry_safe(vsmmu, next, &s2_parent->vsmmus.list,
+				 vsmmus_elm) {
+		arm_smmu_tlb_inv_vmid(vsmmu->smmu, vsmmu->vmid);
+	}
+	spin_unlock_irqrestore(&s2_parent->vsmmus.lock, flags);
+}
+
+void arm_smmu_s2_parent_tlb_inv_range(struct arm_smmu_domain *s2_parent,
+				      unsigned long iova, size_t size,
+				      size_t granule, bool leaf)
+{
+	struct arm_smmu_cmdq_ent cmd = { .tlbi = { .leaf = leaf } };
+	struct arm_vsmmu *vsmmu, *next;
+	unsigned long flags;
+
+	spin_lock_irqsave(&s2_parent->vsmmus.lock, flags);
+	list_for_each_entry_safe(vsmmu, next, &s2_parent->vsmmus.list,
+				 vsmmus_elm) {
+		cmd.tlbi.vmid = vsmmu->vmid;
+
+		/* Must flush all the nested S1 ASIDs when S2 domain changes */
+		cmd.opcode = CMDQ_OP_TLBI_NH_ALL;
+		arm_smmu_cmdq_issue_cmd_with_sync(vsmmu->smmu, &cmd);
+		cmd.opcode = CMDQ_OP_TLBI_S2_IPA;
+		__arm_smmu_tlb_inv_range(vsmmu->smmu, &cmd, iova, size, granule,
+					 &s2_parent->domain);
+	}
+	spin_unlock_irqrestore(&s2_parent->vsmmus.lock, flags);
+}
+
+static void arm_vsmmu_destroy(struct iommufd_viommu *viommu)
+{
+	struct arm_vsmmu *vsmmu = container_of(viommu, struct arm_vsmmu, core);
+	unsigned long flags;
+
+	spin_lock_irqsave(&vsmmu->s2_parent->vsmmus.lock, flags);
+	list_del(&vsmmu->vsmmus_elm);
+	spin_unlock_irqrestore(&vsmmu->s2_parent->vsmmus.lock, flags);
+	/* Must flush S2 vmid after delinking vSMMU */
+	arm_smmu_tlb_inv_vmid(vsmmu->smmu, vsmmu->vmid);
+}
+
 static void arm_smmu_make_nested_cd_table_ste(
 	struct arm_smmu_ste *target, struct arm_smmu_master *master,
 	struct arm_smmu_nested_domain *nested_domain, bool ats_enabled)
@@ -380,6 +428,7 @@  static int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu,
 }
 
 static const struct iommufd_viommu_ops arm_vsmmu_ops = {
+	.destroy = arm_vsmmu_destroy,
 	.alloc_domain_nested = arm_vsmmu_alloc_domain_nested,
 	.cache_invalidate = arm_vsmmu_cache_invalidate,
 };
@@ -394,6 +443,7 @@  struct iommufd_viommu *arm_vsmmu_alloc(struct device *dev,
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
 	struct arm_smmu_domain *s2_parent = to_smmu_domain(parent);
 	struct arm_vsmmu *vsmmu;
+	unsigned long flags;
 
 	if (viommu_type != IOMMU_VIOMMU_TYPE_ARM_SMMUV3)
 		return ERR_PTR(-EOPNOTSUPP);
@@ -433,6 +483,9 @@  struct iommufd_viommu *arm_vsmmu_alloc(struct device *dev,
 	vsmmu->s2_parent = s2_parent;
 	/* FIXME Move VMID allocation from the S2 domain allocation to here */
 	vsmmu->vmid = s2_parent->s2_cfg.vmid;
+	spin_lock_irqsave(&s2_parent->vsmmus.lock, flags);
+	list_add_tail(&vsmmu->vsmmus_elm, &s2_parent->vsmmus.list);
+	spin_unlock_irqrestore(&s2_parent->vsmmus.lock, flags);
 
 	return &vsmmu->core;
 }
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 07d435562da2..df87880e2a29 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3256,6 +3256,8 @@  arm_smmu_domain_alloc_paging_flags(struct device *dev, u32 flags,
 		}
 		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
 		smmu_domain->nest_parent = true;
+		INIT_LIST_HEAD(&smmu_domain->vsmmus.list);
+		spin_lock_init(&smmu_domain->vsmmus.lock);
 		break;
 	case IOMMU_HWPT_ALLOC_DIRTY_TRACKING:
 	case IOMMU_HWPT_ALLOC_DIRTY_TRACKING | IOMMU_HWPT_ALLOC_PASID: