diff mbox series

[v2,2/2] iommu/arm-smmu-v3: add nr_ats_masters for quickly check

Message ID 20190815054439.30652-3-thunder.leizhen@huawei.com (mailing list archive)
State New, archived
Headers show
Series add nr_ats_masters for quickly check | expand

Commit Message

Leizhen (ThunderTown) Aug. 15, 2019, 5:44 a.m. UTC
When (smmu_domain->smmu->features & ARM_SMMU_FEAT_ATS) is true, even if a
smmu domain does not contain any ats master, the operations of
arm_smmu_atc_inv_to_cmd() and lock protection in arm_smmu_atc_inv_domain()
are always executed. This will impact performance, especially in
multi-core and stress scenarios. For my FIO test scenario, about 8%
performance reduced.

In fact, we can use a struct member to record how many ats masters that
the smmu contains. And check that without traverse the list and check all
masters one by one in the lock protection.

Fixes: 9ce27afc0830 ("iommu/arm-smmu-v3: Add support for PCI ATS")
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
---
 drivers/iommu/arm-smmu-v3.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Comments

Will Deacon Aug. 15, 2019, 3:23 p.m. UTC | #1
On Thu, Aug 15, 2019 at 01:44:39PM +0800, Zhen Lei wrote:
> When (smmu_domain->smmu->features & ARM_SMMU_FEAT_ATS) is true, even if a
> smmu domain does not contain any ats master, the operations of
> arm_smmu_atc_inv_to_cmd() and lock protection in arm_smmu_atc_inv_domain()
> are always executed. This will impact performance, especially in
> multi-core and stress scenarios. For my FIO test scenario, about 8%
> performance reduced.
> 
> In fact, we can use a struct member to record how many ats masters that
> the smmu contains. And check that without traverse the list and check all
> masters one by one in the lock protection.
> 
> Fixes: 9ce27afc0830 ("iommu/arm-smmu-v3: Add support for PCI ATS")
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
>  drivers/iommu/arm-smmu-v3.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 29056d9bb12aa01..154334d3310c9b8 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -631,6 +631,7 @@ struct arm_smmu_domain {
>  
>  	struct io_pgtable_ops		*pgtbl_ops;
>  	bool				non_strict;
> +	int				nr_ats_masters;
>  
>  	enum arm_smmu_domain_stage	stage;
>  	union {
> @@ -1531,7 +1532,16 @@ static int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
>  	struct arm_smmu_cmdq_ent cmd;
>  	struct arm_smmu_master *master;
>  
> -	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_ATS))
> +	/*
> +	 * The protectiom of spinlock(&iommu_domain->devices_lock) is omitted.
> +	 * Because for a given master, its map/unmap operations should only be
> +	 * happened after it has been attached and before it has been detached.
> +	 * So that, if at least one master need to be atc invalidated, the
> +	 * value of smmu_domain->nr_ats_masters can not be zero.
> +	 *
> +	 * This can alleviate performance loss in multi-core scenarios.
> +	 */

I find this reasoning pretty dubious, since I think you're assuming that
an endpoint cannot issue speculative ATS translation requests once its
ATS capability is enabled. That said, I think it also means we should enable
ATS in the STE *before* enabling it in the endpoint -- the current logic
looks like it's the wrong way round to me (including in detach()).

Anyway, these speculative translations could race with a concurrent unmap()
call and end up with the ATC containing translations for unmapped pages,
which I think we should try to avoid.

Did the RCU approach not work out? You could use an rwlock instead as a
temporary bodge if the performance doesn't hurt too much.

Alternatively... maybe we could change the attach flow to do something
like:

	enable_ats_in_ste(master);
	enable_ats_at_pcie_endpoint(master);
	spin_lock(devices_lock)
	add_to_device_list(master);
	nr_ats_masters++;
	spin_unlock(devices_lock);
	invalidate_atc(master);

in which case, the concurrent unmapper will be doing something like:

	issue_tlbi();
	smp_mb();
	if (READ_ONCE(nr_ats_masters)) {
		...
	}

and I *think* that means that either the unmapper will see the
nr_ats_masters update and perform the invalidation, or they'll miss
the update but the attach will invalidate the ATC /after/ the TLBI
in the command queue.

Also, John's idea of converting this stuff over to my command batching
mechanism should help a lot if we can defer this to sync time using the
gather structure. Maybe an rwlock would be alright for that. Dunno.

Will
Leizhen (ThunderTown) Aug. 16, 2019, 10:12 a.m. UTC | #2
On 2019/8/15 23:23, Will Deacon wrote:
> On Thu, Aug 15, 2019 at 01:44:39PM +0800, Zhen Lei wrote:
>> When (smmu_domain->smmu->features & ARM_SMMU_FEAT_ATS) is true, even if a
>> smmu domain does not contain any ats master, the operations of
>> arm_smmu_atc_inv_to_cmd() and lock protection in arm_smmu_atc_inv_domain()
>> are always executed. This will impact performance, especially in
>> multi-core and stress scenarios. For my FIO test scenario, about 8%
>> performance reduced.
>>
>> In fact, we can use a struct member to record how many ats masters that
>> the smmu contains. And check that without traverse the list and check all
>> masters one by one in the lock protection.
>>
>> Fixes: 9ce27afc0830 ("iommu/arm-smmu-v3: Add support for PCI ATS")
>> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
>> ---
>>  drivers/iommu/arm-smmu-v3.c | 14 +++++++++++++-
>>  1 file changed, 13 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>> index 29056d9bb12aa01..154334d3310c9b8 100644
>> --- a/drivers/iommu/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm-smmu-v3.c
>> @@ -631,6 +631,7 @@ struct arm_smmu_domain {
>>  
>>  	struct io_pgtable_ops		*pgtbl_ops;
>>  	bool				non_strict;
>> +	int				nr_ats_masters;
>>  
>>  	enum arm_smmu_domain_stage	stage;
>>  	union {
>> @@ -1531,7 +1532,16 @@ static int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
>>  	struct arm_smmu_cmdq_ent cmd;
>>  	struct arm_smmu_master *master;
>>  
>> -	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_ATS))
>> +	/*
>> +	 * The protectiom of spinlock(&iommu_domain->devices_lock) is omitted.
>> +	 * Because for a given master, its map/unmap operations should only be
>> +	 * happened after it has been attached and before it has been detached.
>> +	 * So that, if at least one master need to be atc invalidated, the
>> +	 * value of smmu_domain->nr_ats_masters can not be zero.
>> +	 *
>> +	 * This can alleviate performance loss in multi-core scenarios.
>> +	 */
> 
> I find this reasoning pretty dubious, since I think you're assuming that
> an endpoint cannot issue speculative ATS translation requests once its
> ATS capability is enabled. That said, I think it also means we should enable
> ATS in the STE *before* enabling it in the endpoint -- the current logic
> looks like it's the wrong way round to me (including in detach()).
> 
> Anyway, these speculative translations could race with a concurrent unmap()
> call and end up with the ATC containing translations for unmapped pages,
> which I think we should try to avoid.
> 
> Did the RCU approach not work out? You could use an rwlock instead as a
> temporary bodge if the performance doesn't hurt too much.
OK, I will try rwlock first, this does not change the original code logic.

> 
> Alternatively... maybe we could change the attach flow to do something
> like:
> 
> 	enable_ats_in_ste(master);
> 	enable_ats_at_pcie_endpoint(master);
> 	spin_lock(devices_lock)
> 	add_to_device_list(master);
> 	nr_ats_masters++;
> 	spin_unlock(devices_lock);
> 	invalidate_atc(master);
> 
> in which case, the concurrent unmapper will be doing something like:
> 
> 	issue_tlbi();
> 	smp_mb();
> 	if (READ_ONCE(nr_ats_masters)) {
> 		...
> 	}
> 
> and I *think* that means that either the unmapper will see the
> nr_ats_masters update and perform the invalidation, or they'll miss
> the update but the attach will invalidate the ATC /after/ the TLBI
> in the command queue.
> 
> Also, John's idea of converting this stuff over to my command batching
> mechanism should help a lot if we can defer this to sync time using the
> gather structure. Maybe an rwlock would be alright for that. Dunno.
> 
> Will
> 
> .
>
diff mbox series

Patch

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 29056d9bb12aa01..154334d3310c9b8 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -631,6 +631,7 @@  struct arm_smmu_domain {
 
 	struct io_pgtable_ops		*pgtbl_ops;
 	bool				non_strict;
+	int				nr_ats_masters;
 
 	enum arm_smmu_domain_stage	stage;
 	union {
@@ -1531,7 +1532,16 @@  static int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
 	struct arm_smmu_cmdq_ent cmd;
 	struct arm_smmu_master *master;
 
-	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_ATS))
+	/*
+	 * The protectiom of spinlock(&iommu_domain->devices_lock) is omitted.
+	 * Because for a given master, its map/unmap operations should only be
+	 * happened after it has been attached and before it has been detached.
+	 * So that, if at least one master need to be atc invalidated, the
+	 * value of smmu_domain->nr_ats_masters can not be zero.
+	 *
+	 * This can alleviate performance loss in multi-core scenarios.
+	 */
+	if (!smmu_domain->nr_ats_masters)
 		return 0;
 
 	arm_smmu_atc_inv_to_cmd(ssid, iova, size, &cmd);
@@ -1913,6 +1923,7 @@  static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 
 	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
 	list_del(&master->domain_head);
+	smmu_domain->nr_ats_masters--;
 	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 
 	master->domain = NULL;
@@ -1968,6 +1979,7 @@  static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 
 	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
 	list_add(&master->domain_head, &smmu_domain->devices);
+	smmu_domain->nr_ats_masters++;
 	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 out_unlock:
 	mutex_unlock(&smmu_domain->init_mutex);