diff mbox series

[3/3] iommu/arm-smmu-v3: Add a max_tlbi_ops for __arm_smmu_tlb_inv_range()

Message ID 3ba332e141102d31b756326cdc4078cac1f5ab1c.1692693557.git.nicolinc@nvidia.com (mailing list archive)
State New, archived
Headers show
Series iommu/arm-smmu-v3: Reduce latency in __arm_smmu_tlb_inv_range() | expand

Commit Message

Nicolin Chen Aug. 22, 2023, 8:45 a.m. UTC
When receiving an __arm_smmu_tlb_inv_range() call with a large size, there
could be a long latency at this function call: one part is coming from a
large software overhead in the routine of building commands, and the other
part is coming from CMDQ hardware consuming the large number of commands.
This latency could be significantly large on an SMMU that does not support
range invalidation commands, i.e. no ARM_SMMU_FEAT_RANGE_INV.

One way to optimize this is to replace a large number of VA invalidation
commands with one single per-asid or per-vmid invalidation command, if an
invalidation size reaches a preset threshold using the number of entries
per io-pgtable, similar to the MAX_TLBI_OPS in arm64's tlbflush.h.

Add a max_tlbi_ops in arm_smmu_domain, and convert a large number of per-
granule TLBI commands to one single per-asid or per-vmid TLBI command, on
SMMUs without ARM_SMMU_FEAT_RANGE_INV.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 10 +++++++++-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
 2 files changed, 10 insertions(+), 1 deletion(-)

Comments

Robin Murphy Aug. 22, 2023, 9:30 a.m. UTC | #1
On 2023-08-22 09:45, Nicolin Chen wrote:
> When receiving an __arm_smmu_tlb_inv_range() call with a large size, there
> could be a long latency at this function call: one part is coming from a
> large software overhead in the routine of building commands, and the other
> part is coming from CMDQ hardware consuming the large number of commands.
> This latency could be significantly large on an SMMU that does not support
> range invalidation commands, i.e. no ARM_SMMU_FEAT_RANGE_INV.
> 
> One way to optimize this is to replace a large number of VA invalidation
> commands with one single per-asid or per-vmid invalidation command, if an
> invalidation size reaches a preset threshold using the number of entries
> per io-pgtable, similar to the MAX_TLBI_OPS in arm64's tlbflush.h.
> 
> Add a max_tlbi_ops in arm_smmu_domain, and convert a large number of per-
> granule TLBI commands to one single per-asid or per-vmid TLBI command, on
> SMMUs without ARM_SMMU_FEAT_RANGE_INV.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 10 +++++++++-
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
>   2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index d6c647e1eb01..3f0db30932bd 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1897,7 +1897,14 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
>   	if (!size)
>   		return;
>   
> -	if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> +	if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV)) {
> +		/*
> +		 * When the size reaches a threshold, replace per-granule TLBI
> +		 * commands with one single per-asid or per-vmid TLBI command.
> +		 */
> +		if (size >= granule * smmu_domain->max_tlbi_ops)
> +			return arm_smmu_tlb_inv_domain(smmu_domain);

This looks like it's at the wrong level - we should have figured this 
out before we got as far as low-level command-building. I'd have thought 
it would be a case of short-circuiting directly from 
arm_smmu_tlb_inv_range_domain() to arm_smmu_tlb_inv_context().

> +	} else {
>   		/* Get the leaf page size */
>   		tg = __ffs(smmu_domain->domain.pgsize_bitmap);
>   
> @@ -2258,6 +2265,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
>   	}
>   
>   	smmu_domain->pgtbl_ops = pgtbl_ops;
> +	smmu_domain->max_tlbi_ops = pgtbl_cfg.nents_per_pgtable;

And now we're carrying *three* copies of the same information around 
everywhere? Honestly, just pull cfg->bits_per_level out of the 
io_pgtable_ops at the point where you need it, like the pagetable code 
itself manages to do perfectly happily. Wrap it in an io-pgtable helper 
if you think that's cleaner.

Thanks,
Robin.

>   	return 0;
>   }
>   
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index dcab85698a4e..f68c95a2e24f 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -721,6 +721,7 @@ struct arm_smmu_domain {
>   	struct io_pgtable_ops		*pgtbl_ops;
>   	bool				stall_enabled;
>   	atomic_t			nr_ats_masters;
> +	unsigned long			max_tlbi_ops;
>   
>   	enum arm_smmu_domain_stage	stage;
>   	union {
Nicolin Chen Aug. 22, 2023, 4:32 p.m. UTC | #2
On Tue, Aug 22, 2023 at 10:30:35AM +0100, Robin Murphy wrote:

> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index d6c647e1eb01..3f0db30932bd 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -1897,7 +1897,14 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
> >       if (!size)
> >               return;
> > 
> > -     if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> > +     if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV)) {
> > +             /*
> > +              * When the size reaches a threshold, replace per-granule TLBI
> > +              * commands with one single per-asid or per-vmid TLBI command.
> > +              */
> > +             if (size >= granule * smmu_domain->max_tlbi_ops)
> > +                     return arm_smmu_tlb_inv_domain(smmu_domain);
> 
> This looks like it's at the wrong level - we should have figured this
> out before we got as far as low-level command-building. I'd have thought
> it would be a case of short-circuiting directly from
> arm_smmu_tlb_inv_range_domain() to arm_smmu_tlb_inv_context().

OK, I could do that. We would have copies of this same routine
though. Also, the shortcut applies to !ARM_SMMU_FEAT_RANGE_INV
cases only, so this function feels convenient to me.

> > +     } else {
> >               /* Get the leaf page size */
> >               tg = __ffs(smmu_domain->domain.pgsize_bitmap);
> > 
> > @@ -2258,6 +2265,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
> >       }
> > 
> >       smmu_domain->pgtbl_ops = pgtbl_ops;
> > +     smmu_domain->max_tlbi_ops = pgtbl_cfg.nents_per_pgtable;
> 
> And now we're carrying *three* copies of the same information around
> everywhere? Honestly, just pull cfg->bits_per_level out of the
> io_pgtable_ops at the point where you need it, like the pagetable code
> itself manages to do perfectly happily. Wrap it in an io-pgtable helper
> if you think that's cleaner.

OK. I overlooked io_pgtable_ops_to_pgtable. Will do that.

Thanks
Nic
Nicolin Chen Aug. 22, 2023, 11:04 p.m. UTC | #3
Hi Robin,

On Tue, Aug 22, 2023 at 09:32:26AM -0700, Nicolin Chen wrote:
> On Tue, Aug 22, 2023 at 10:30:35AM +0100, Robin Murphy wrote:
> 
> > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > index d6c647e1eb01..3f0db30932bd 100644
> > > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > @@ -1897,7 +1897,14 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
> > >       if (!size)
> > >               return;
> > > 
> > > -     if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> > > +     if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV)) {
> > > +             /*
> > > +              * When the size reaches a threshold, replace per-granule TLBI
> > > +              * commands with one single per-asid or per-vmid TLBI command.
> > > +              */
> > > +             if (size >= granule * smmu_domain->max_tlbi_ops)
> > > +                     return arm_smmu_tlb_inv_domain(smmu_domain);
> > 
> > This looks like it's at the wrong level - we should have figured this
> > out before we got as far as low-level command-building. I'd have thought
> > it would be a case of short-circuiting directly from
> > arm_smmu_tlb_inv_range_domain() to arm_smmu_tlb_inv_context().
> 
> OK, I could do that. We would have copies of this same routine
> though. Also, the shortcut applies to !ARM_SMMU_FEAT_RANGE_INV
> cases only, so this function feels convenient to me.

I was trying to say that we would need the same piece in both
arm_smmu_tlb_inv_range_domain() and arm_smmu_tlb_inv_range_asid(),
though the latter one only needs to call arm_smmu_tlb_inv_asid().

Also, arm_smmu_tlb_inv_context() does a full range ATC invalidation
instead of a given range like what arm_smmu_tlb_inv_range_domain()
currently does. So, it might be a bit overkill.

Combining all your comments, we'd have something like this:

-------------------------------------------------------------------
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 7614739ea2c1..2967a6634c7c 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1937,12 +1937,22 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
 					  size_t granule, bool leaf,
 					  struct arm_smmu_domain *smmu_domain)
 {
+	struct io_pgtable_cfg *cfg =
+		&io_pgtable_ops_to_pgtable(smmu_domain->pgtbl_ops)->cfg;
 	struct arm_smmu_cmdq_ent cmd = {
 		.tlbi = {
 			.leaf	= leaf,
 		},
 	};
 
+	/*
+	 * If the given size is too large that would end up with too many TLBI
+	 * commands in CMDQ, short circuit directly to a full invalidation
+	 */
+	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV) &&
+	    size >= granule * (1UL << cfg->bits_per_level))
+		return arm_smmu_tlb_inv_context(smmu_domain);
+
 	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
 		cmd.opcode	= smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ?
 				  CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA;
@@ -1964,6 +1974,8 @@ void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
 				 size_t granule, bool leaf,
 				 struct arm_smmu_domain *smmu_domain)
 {
+	struct io_pgtable_cfg *cfg =
+		&io_pgtable_ops_to_pgtable(smmu_domain->pgtbl_ops)->cfg;
 	struct arm_smmu_cmdq_ent cmd = {
 		.opcode	= smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ?
 			  CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA,
@@ -1973,6 +1985,14 @@ void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
 		},
 	};
 
+	/*
+	 * If the given size is too large that would end up with too many TLBI
+	 * commands in CMDQ, short circuit directly to a full invalidation
+	 */
+	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV) &&
+	    size >= granule * (1UL << cfg->bits_per_level))
+		return arm_smmu_tlb_inv_asid(smmu_domain->smmu, asid);
+
 	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
 }
 
-------------------------------------------------------------------

You're sure that you prefer this, right?

Thanks
Nicolin
Robin Murphy Aug. 29, 2023, 10:40 p.m. UTC | #4
On 2023-08-23 00:04, Nicolin Chen wrote:
> Hi Robin,
> 
> On Tue, Aug 22, 2023 at 09:32:26AM -0700, Nicolin Chen wrote:
>> On Tue, Aug 22, 2023 at 10:30:35AM +0100, Robin Murphy wrote:
>>
>>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> index d6c647e1eb01..3f0db30932bd 100644
>>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> @@ -1897,7 +1897,14 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
>>>>        if (!size)
>>>>                return;
>>>>
>>>> -     if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
>>>> +     if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV)) {
>>>> +             /*
>>>> +              * When the size reaches a threshold, replace per-granule TLBI
>>>> +              * commands with one single per-asid or per-vmid TLBI command.
>>>> +              */
>>>> +             if (size >= granule * smmu_domain->max_tlbi_ops)
>>>> +                     return arm_smmu_tlb_inv_domain(smmu_domain);
>>>
>>> This looks like it's at the wrong level - we should have figured this
>>> out before we got as far as low-level command-building. I'd have thought
>>> it would be a case of short-circuiting directly from
>>> arm_smmu_tlb_inv_range_domain() to arm_smmu_tlb_inv_context().
>>
>> OK, I could do that. We would have copies of this same routine
>> though. Also, the shortcut applies to !ARM_SMMU_FEAT_RANGE_INV
>> cases only, so this function feels convenient to me.
> 
> I was trying to say that we would need the same piece in both
> arm_smmu_tlb_inv_range_domain() and arm_smmu_tlb_inv_range_asid(),
> though the latter one only needs to call arm_smmu_tlb_inv_asid().

Its not like arm_smmu_tlb_inv_range_asid() doesn't already massively 
overlap with arm_smmu_tlb_inv_range_domain() anyway, so a little further 
duplication hardly seems like it would hurt. Checking 
ARM_SMMU_FEAT_RANGE_INV should be cheap (otherwise we'd really want to 
split __arm_smmu_tlb_inv_range() into separate RIL vs. non-RIL versions 
to avoid having it in the loop), and it makes the intent clear. What I 
just really don't like is a flow where we construct a specific command, 
then call the low-level function to issue it, only that function then 
actually jumps back out into another high-level function which 
constructs a *different* command. This code is already a maze of twisty 
little passages...

> Also, arm_smmu_tlb_inv_context() does a full range ATC invalidation
> instead of a given range like what arm_smmu_tlb_inv_range_domain()
> currently does. So, it might be a bit overkill.
> 
> Combining all your comments, we'd have something like this:

TBH I'd be inclined to refactor a bit harder, maybe break out some 
VMID-based helpers for orthogonality, and aim for a flow like:

	if (over threshold)
		tlb_inv_domain()
	else if (stage 1)
		tlb_inv_range_asid()
	else
		tlb_inv_range_vmid()
	atc_inv_range()

or possibly if you prefer:

	if (stage 1) {
		if (over threshold)
			tlb_inv_asid()
		else
			tlb_inv_range_asid()
	} else {
		if (over threshold)
			tlb_inv_vmid()
		else
			tlb_inv_range_vmid()
	}
	atc_inv_range()

where the latter maybe trades more verbosity for less duplication 
overall - I'd probably have to try both to see which looks nicer in the 
end. And obviously if there's any chance of inventing a clear and 
consistent naming scheme in the process, that would be lovely.

Thanks,
Robin.

> -------------------------------------------------------------------
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 7614739ea2c1..2967a6634c7c 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1937,12 +1937,22 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
>   					  size_t granule, bool leaf,
>   					  struct arm_smmu_domain *smmu_domain)
>   {
> +	struct io_pgtable_cfg *cfg =
> +		&io_pgtable_ops_to_pgtable(smmu_domain->pgtbl_ops)->cfg;
>   	struct arm_smmu_cmdq_ent cmd = {
>   		.tlbi = {
>   			.leaf	= leaf,
>   		},
>   	};
>   
> +	/*
> +	 * If the given size is too large that would end up with too many TLBI
> +	 * commands in CMDQ, short circuit directly to a full invalidation
> +	 */
> +	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV) &&
> +	    size >= granule * (1UL << cfg->bits_per_level))
> +		return arm_smmu_tlb_inv_context(smmu_domain);
> +
>   	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
>   		cmd.opcode	= smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ?
>   				  CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA;
> @@ -1964,6 +1974,8 @@ void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
>   				 size_t granule, bool leaf,
>   				 struct arm_smmu_domain *smmu_domain)
>   {
> +	struct io_pgtable_cfg *cfg =
> +		&io_pgtable_ops_to_pgtable(smmu_domain->pgtbl_ops)->cfg;
>   	struct arm_smmu_cmdq_ent cmd = {
>   		.opcode	= smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ?
>   			  CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA,
> @@ -1973,6 +1985,14 @@ void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
>   		},
>   	};
>   
> +	/*
> +	 * If the given size is too large that would end up with too many TLBI
> +	 * commands in CMDQ, short circuit directly to a full invalidation
> +	 */
> +	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV) &&
> +	    size >= granule * (1UL << cfg->bits_per_level))
> +		return arm_smmu_tlb_inv_asid(smmu_domain->smmu, asid);
> +
>   	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
>   }
>   
> -------------------------------------------------------------------
> 
> You're sure that you prefer this, right?
> 
> Thanks
> Nicolin
Nicolin Chen Aug. 29, 2023, 11:14 p.m. UTC | #5
On Tue, Aug 29, 2023 at 11:40:29PM +0100, Robin Murphy wrote:

> > > > > -     if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> > > > > +     if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV)) {
> > > > > +             /*
> > > > > +              * When the size reaches a threshold, replace per-granule TLBI
> > > > > +              * commands with one single per-asid or per-vmid TLBI command.
> > > > > +              */
> > > > > +             if (size >= granule * smmu_domain->max_tlbi_ops)
> > > > > +                     return arm_smmu_tlb_inv_domain(smmu_domain);
> > > > 
> > > > This looks like it's at the wrong level - we should have figured this
> > > > out before we got as far as low-level command-building. I'd have thought
> > > > it would be a case of short-circuiting directly from
> > > > arm_smmu_tlb_inv_range_domain() to arm_smmu_tlb_inv_context().
> > > 
> > > OK, I could do that. We would have copies of this same routine
> > > though. Also, the shortcut applies to !ARM_SMMU_FEAT_RANGE_INV
> > > cases only, so this function feels convenient to me.
> > 
> > I was trying to say that we would need the same piece in both
> > arm_smmu_tlb_inv_range_domain() and arm_smmu_tlb_inv_range_asid(),
> > though the latter one only needs to call arm_smmu_tlb_inv_asid().
> 
> Its not like arm_smmu_tlb_inv_range_asid() doesn't already massively
> overlap with arm_smmu_tlb_inv_range_domain() anyway, so a little further
> duplication hardly seems like it would hurt. Checking
> ARM_SMMU_FEAT_RANGE_INV should be cheap (otherwise we'd really want to
> split __arm_smmu_tlb_inv_range() into separate RIL vs. non-RIL versions
> to avoid having it in the loop), and it makes the intent clear. What I
> just really don't like is a flow where we construct a specific command,
> then call the low-level function to issue it, only that function then
> actually jumps back out into another high-level function which
> constructs a *different* command. This code is already a maze of twisty
> little passages...

OK. That sounds convincing to me. We can do at the higher level.

> > Also, arm_smmu_tlb_inv_context() does a full range ATC invalidation
> > instead of a given range like what arm_smmu_tlb_inv_range_domain()
> > currently does. So, it might be a bit overkill.
> > 
> > Combining all your comments, we'd have something like this:
> 
> TBH I'd be inclined to refactor a bit harder, maybe break out some
> VMID-based helpers for orthogonality, and aim for a flow like:
> 
>        if (over threshold)
>                tlb_inv_domain()
>        else if (stage 1)
>                tlb_inv_range_asid()
>        else
>                tlb_inv_range_vmid()
>        atc_inv_range()
> 
> or possibly if you prefer:
> 
>        if (stage 1) {
>                if (over threshold)
>                        tlb_inv_asid()
>                else
>                        tlb_inv_range_asid()
>        } else {
>                if (over threshold)
>                        tlb_inv_vmid()
>                else
>                        tlb_inv_range_vmid()
>        }
>        atc_inv_range()
> 
> where the latter maybe trades more verbosity for less duplication
> overall - I'd probably have to try both to see which looks nicer in the
> end. And obviously if there's any chance of inventing a clear and
> consistent naming scheme in the process, that would be lovely.

Oh, I just replied you in another email, asking for a refactor,
though that didn't include the over-threshold part yet.

Anyway, I think I got your point now and will bear in mind that
we want something clean overall with less duplication among the
"inv" functions.

Let me try some rewriting. And we can see how it looks after.

Thanks
Nicolin
diff mbox series

Patch

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index d6c647e1eb01..3f0db30932bd 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1897,7 +1897,14 @@  static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
 	if (!size)
 		return;
 
-	if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
+	if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV)) {
+		/*
+		 * When the size reaches a threshold, replace per-granule TLBI
+		 * commands with one single per-asid or per-vmid TLBI command.
+		 */
+		if (size >= granule * smmu_domain->max_tlbi_ops)
+			return arm_smmu_tlb_inv_domain(smmu_domain);
+	} else {
 		/* Get the leaf page size */
 		tg = __ffs(smmu_domain->domain.pgsize_bitmap);
 
@@ -2258,6 +2265,7 @@  static int arm_smmu_domain_finalise(struct iommu_domain *domain,
 	}
 
 	smmu_domain->pgtbl_ops = pgtbl_ops;
+	smmu_domain->max_tlbi_ops = pgtbl_cfg.nents_per_pgtable;
 	return 0;
 }
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index dcab85698a4e..f68c95a2e24f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -721,6 +721,7 @@  struct arm_smmu_domain {
 	struct io_pgtable_ops		*pgtbl_ops;
 	bool				stall_enabled;
 	atomic_t			nr_ats_masters;
+	unsigned long			max_tlbi_ops;
 
 	enum arm_smmu_domain_stage	stage;
 	union {