Message ID | 3ba332e141102d31b756326cdc4078cac1f5ab1c.1692693557.git.nicolinc@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | iommu/arm-smmu-v3: Reduce latency in __arm_smmu_tlb_inv_range() | expand |
On 2023-08-22 09:45, Nicolin Chen wrote: > When receiving an __arm_smmu_tlb_inv_range() call with a large size, there > could be a long latency at this function call: one part is coming from a > large software overhead in the routine of building commands, and the other > part is coming from CMDQ hardware consuming the large number of commands. > This latency could be significantly large on an SMMU that does not support > range invalidation commands, i.e. no ARM_SMMU_FEAT_RANGE_INV. > > One way to optimize this is to replace a large number of VA invalidation > commands with one single per-asid or per-vmid invalidation command, if an > invalidation size reaches a preset threshold using the number of entries > per io-pgtable, similar to the MAX_TLBI_OPS in arm64's tlbflush.h. > > Add a max_tlbi_ops in arm_smmu_domain, and convert a large number of per- > granule TLBI commands to one single per-asid or per-vmid TLBI command, on > SMMUs without ARM_SMMU_FEAT_RANGE_INV. > > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> > --- > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 10 +++++++++- > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 + > 2 files changed, 10 insertions(+), 1 deletion(-) > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > index d6c647e1eb01..3f0db30932bd 100644 > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > @@ -1897,7 +1897,14 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd, > if (!size) > return; > > - if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) { > + if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV)) { > + /* > + * When the size reaches a threshold, replace per-granule TLBI > + * commands with one single per-asid or per-vmid TLBI command. > + */ > + if (size >= granule * smmu_domain->max_tlbi_ops) > + return arm_smmu_tlb_inv_domain(smmu_domain); This looks like it's at the wrong level - we should have figured this out before we got as far as low-level command-building. I'd have thought it would be a case of short-circuiting directly from arm_smmu_tlb_inv_range_domain() to arm_smmu_tlb_inv_context(). > + } else { > /* Get the leaf page size */ > tg = __ffs(smmu_domain->domain.pgsize_bitmap); > > @@ -2258,6 +2265,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain, > } > > smmu_domain->pgtbl_ops = pgtbl_ops; > + smmu_domain->max_tlbi_ops = pgtbl_cfg.nents_per_pgtable; And now we're carrying *three* copies of the same information around everywhere? Honestly, just pull cfg->bits_per_level out of the io_pgtable_ops at the point where you need it, like the pagetable code itself manages to do perfectly happily. Wrap it in an io-pgtable helper if you think that's cleaner. Thanks, Robin. > return 0; > } > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h > index dcab85698a4e..f68c95a2e24f 100644 > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h > @@ -721,6 +721,7 @@ struct arm_smmu_domain { > struct io_pgtable_ops *pgtbl_ops; > bool stall_enabled; > atomic_t nr_ats_masters; > + unsigned long max_tlbi_ops; > > enum arm_smmu_domain_stage stage; > union {
On Tue, Aug 22, 2023 at 10:30:35AM +0100, Robin Murphy wrote: > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > index d6c647e1eb01..3f0db30932bd 100644 > > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > @@ -1897,7 +1897,14 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd, > > if (!size) > > return; > > > > - if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) { > > + if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV)) { > > + /* > > + * When the size reaches a threshold, replace per-granule TLBI > > + * commands with one single per-asid or per-vmid TLBI command. > > + */ > > + if (size >= granule * smmu_domain->max_tlbi_ops) > > + return arm_smmu_tlb_inv_domain(smmu_domain); > > This looks like it's at the wrong level - we should have figured this > out before we got as far as low-level command-building. I'd have thought > it would be a case of short-circuiting directly from > arm_smmu_tlb_inv_range_domain() to arm_smmu_tlb_inv_context(). OK, I could do that. We would have copies of this same routine though. Also, the shortcut applies to !ARM_SMMU_FEAT_RANGE_INV cases only, so this function feels convenient to me. > > + } else { > > /* Get the leaf page size */ > > tg = __ffs(smmu_domain->domain.pgsize_bitmap); > > > > @@ -2258,6 +2265,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain, > > } > > > > smmu_domain->pgtbl_ops = pgtbl_ops; > > + smmu_domain->max_tlbi_ops = pgtbl_cfg.nents_per_pgtable; > > And now we're carrying *three* copies of the same information around > everywhere? Honestly, just pull cfg->bits_per_level out of the > io_pgtable_ops at the point where you need it, like the pagetable code > itself manages to do perfectly happily. Wrap it in an io-pgtable helper > if you think that's cleaner. OK. I overlooked io_pgtable_ops_to_pgtable. Will do that. Thanks Nic
Hi Robin, On Tue, Aug 22, 2023 at 09:32:26AM -0700, Nicolin Chen wrote: > On Tue, Aug 22, 2023 at 10:30:35AM +0100, Robin Murphy wrote: > > > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > > index d6c647e1eb01..3f0db30932bd 100644 > > > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > > @@ -1897,7 +1897,14 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd, > > > if (!size) > > > return; > > > > > > - if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) { > > > + if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV)) { > > > + /* > > > + * When the size reaches a threshold, replace per-granule TLBI > > > + * commands with one single per-asid or per-vmid TLBI command. > > > + */ > > > + if (size >= granule * smmu_domain->max_tlbi_ops) > > > + return arm_smmu_tlb_inv_domain(smmu_domain); > > > > This looks like it's at the wrong level - we should have figured this > > out before we got as far as low-level command-building. I'd have thought > > it would be a case of short-circuiting directly from > > arm_smmu_tlb_inv_range_domain() to arm_smmu_tlb_inv_context(). > > OK, I could do that. We would have copies of this same routine > though. Also, the shortcut applies to !ARM_SMMU_FEAT_RANGE_INV > cases only, so this function feels convenient to me. I was trying to say that we would need the same piece in both arm_smmu_tlb_inv_range_domain() and arm_smmu_tlb_inv_range_asid(), though the latter one only needs to call arm_smmu_tlb_inv_asid(). Also, arm_smmu_tlb_inv_context() does a full range ATC invalidation instead of a given range like what arm_smmu_tlb_inv_range_domain() currently does. So, it might be a bit overkill. Combining all your comments, we'd have something like this: ------------------------------------------------------------------- diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 7614739ea2c1..2967a6634c7c 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -1937,12 +1937,22 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size, size_t granule, bool leaf, struct arm_smmu_domain *smmu_domain) { + struct io_pgtable_cfg *cfg = + &io_pgtable_ops_to_pgtable(smmu_domain->pgtbl_ops)->cfg; struct arm_smmu_cmdq_ent cmd = { .tlbi = { .leaf = leaf, }, }; + /* + * If the given size is too large that would end up with too many TLBI + * commands in CMDQ, short circuit directly to a full invalidation + */ + if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV) && + size >= granule * (1UL << cfg->bits_per_level)) + return arm_smmu_tlb_inv_context(smmu_domain); + if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) { cmd.opcode = smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ? CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA; @@ -1964,6 +1974,8 @@ void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid, size_t granule, bool leaf, struct arm_smmu_domain *smmu_domain) { + struct io_pgtable_cfg *cfg = + &io_pgtable_ops_to_pgtable(smmu_domain->pgtbl_ops)->cfg; struct arm_smmu_cmdq_ent cmd = { .opcode = smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ? CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA, @@ -1973,6 +1985,14 @@ void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid, }, }; + /* + * If the given size is too large that would end up with too many TLBI + * commands in CMDQ, short circuit directly to a full invalidation + */ + if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV) && + size >= granule * (1UL << cfg->bits_per_level)) + return arm_smmu_tlb_inv_asid(smmu_domain->smmu, asid); + __arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain); } ------------------------------------------------------------------- You're sure that you prefer this, right? Thanks Nicolin
On 2023-08-23 00:04, Nicolin Chen wrote: > Hi Robin, > > On Tue, Aug 22, 2023 at 09:32:26AM -0700, Nicolin Chen wrote: >> On Tue, Aug 22, 2023 at 10:30:35AM +0100, Robin Murphy wrote: >> >>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >>>> index d6c647e1eb01..3f0db30932bd 100644 >>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >>>> @@ -1897,7 +1897,14 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd, >>>> if (!size) >>>> return; >>>> >>>> - if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) { >>>> + if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV)) { >>>> + /* >>>> + * When the size reaches a threshold, replace per-granule TLBI >>>> + * commands with one single per-asid or per-vmid TLBI command. >>>> + */ >>>> + if (size >= granule * smmu_domain->max_tlbi_ops) >>>> + return arm_smmu_tlb_inv_domain(smmu_domain); >>> >>> This looks like it's at the wrong level - we should have figured this >>> out before we got as far as low-level command-building. I'd have thought >>> it would be a case of short-circuiting directly from >>> arm_smmu_tlb_inv_range_domain() to arm_smmu_tlb_inv_context(). >> >> OK, I could do that. We would have copies of this same routine >> though. Also, the shortcut applies to !ARM_SMMU_FEAT_RANGE_INV >> cases only, so this function feels convenient to me. > > I was trying to say that we would need the same piece in both > arm_smmu_tlb_inv_range_domain() and arm_smmu_tlb_inv_range_asid(), > though the latter one only needs to call arm_smmu_tlb_inv_asid(). Its not like arm_smmu_tlb_inv_range_asid() doesn't already massively overlap with arm_smmu_tlb_inv_range_domain() anyway, so a little further duplication hardly seems like it would hurt. Checking ARM_SMMU_FEAT_RANGE_INV should be cheap (otherwise we'd really want to split __arm_smmu_tlb_inv_range() into separate RIL vs. non-RIL versions to avoid having it in the loop), and it makes the intent clear. What I just really don't like is a flow where we construct a specific command, then call the low-level function to issue it, only that function then actually jumps back out into another high-level function which constructs a *different* command. This code is already a maze of twisty little passages... > Also, arm_smmu_tlb_inv_context() does a full range ATC invalidation > instead of a given range like what arm_smmu_tlb_inv_range_domain() > currently does. So, it might be a bit overkill. > > Combining all your comments, we'd have something like this: TBH I'd be inclined to refactor a bit harder, maybe break out some VMID-based helpers for orthogonality, and aim for a flow like: if (over threshold) tlb_inv_domain() else if (stage 1) tlb_inv_range_asid() else tlb_inv_range_vmid() atc_inv_range() or possibly if you prefer: if (stage 1) { if (over threshold) tlb_inv_asid() else tlb_inv_range_asid() } else { if (over threshold) tlb_inv_vmid() else tlb_inv_range_vmid() } atc_inv_range() where the latter maybe trades more verbosity for less duplication overall - I'd probably have to try both to see which looks nicer in the end. And obviously if there's any chance of inventing a clear and consistent naming scheme in the process, that would be lovely. Thanks, Robin. > ------------------------------------------------------------------- > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > index 7614739ea2c1..2967a6634c7c 100644 > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > @@ -1937,12 +1937,22 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size, > size_t granule, bool leaf, > struct arm_smmu_domain *smmu_domain) > { > + struct io_pgtable_cfg *cfg = > + &io_pgtable_ops_to_pgtable(smmu_domain->pgtbl_ops)->cfg; > struct arm_smmu_cmdq_ent cmd = { > .tlbi = { > .leaf = leaf, > }, > }; > > + /* > + * If the given size is too large that would end up with too many TLBI > + * commands in CMDQ, short circuit directly to a full invalidation > + */ > + if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV) && > + size >= granule * (1UL << cfg->bits_per_level)) > + return arm_smmu_tlb_inv_context(smmu_domain); > + > if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) { > cmd.opcode = smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ? > CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA; > @@ -1964,6 +1974,8 @@ void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid, > size_t granule, bool leaf, > struct arm_smmu_domain *smmu_domain) > { > + struct io_pgtable_cfg *cfg = > + &io_pgtable_ops_to_pgtable(smmu_domain->pgtbl_ops)->cfg; > struct arm_smmu_cmdq_ent cmd = { > .opcode = smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ? > CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA, > @@ -1973,6 +1985,14 @@ void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid, > }, > }; > > + /* > + * If the given size is too large that would end up with too many TLBI > + * commands in CMDQ, short circuit directly to a full invalidation > + */ > + if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV) && > + size >= granule * (1UL << cfg->bits_per_level)) > + return arm_smmu_tlb_inv_asid(smmu_domain->smmu, asid); > + > __arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain); > } > > ------------------------------------------------------------------- > > You're sure that you prefer this, right? > > Thanks > Nicolin
On Tue, Aug 29, 2023 at 11:40:29PM +0100, Robin Murphy wrote: > > > > > - if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) { > > > > > + if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV)) { > > > > > + /* > > > > > + * When the size reaches a threshold, replace per-granule TLBI > > > > > + * commands with one single per-asid or per-vmid TLBI command. > > > > > + */ > > > > > + if (size >= granule * smmu_domain->max_tlbi_ops) > > > > > + return arm_smmu_tlb_inv_domain(smmu_domain); > > > > > > > > This looks like it's at the wrong level - we should have figured this > > > > out before we got as far as low-level command-building. I'd have thought > > > > it would be a case of short-circuiting directly from > > > > arm_smmu_tlb_inv_range_domain() to arm_smmu_tlb_inv_context(). > > > > > > OK, I could do that. We would have copies of this same routine > > > though. Also, the shortcut applies to !ARM_SMMU_FEAT_RANGE_INV > > > cases only, so this function feels convenient to me. > > > > I was trying to say that we would need the same piece in both > > arm_smmu_tlb_inv_range_domain() and arm_smmu_tlb_inv_range_asid(), > > though the latter one only needs to call arm_smmu_tlb_inv_asid(). > > Its not like arm_smmu_tlb_inv_range_asid() doesn't already massively > overlap with arm_smmu_tlb_inv_range_domain() anyway, so a little further > duplication hardly seems like it would hurt. Checking > ARM_SMMU_FEAT_RANGE_INV should be cheap (otherwise we'd really want to > split __arm_smmu_tlb_inv_range() into separate RIL vs. non-RIL versions > to avoid having it in the loop), and it makes the intent clear. What I > just really don't like is a flow where we construct a specific command, > then call the low-level function to issue it, only that function then > actually jumps back out into another high-level function which > constructs a *different* command. This code is already a maze of twisty > little passages... OK. That sounds convincing to me. We can do at the higher level. > > Also, arm_smmu_tlb_inv_context() does a full range ATC invalidation > > instead of a given range like what arm_smmu_tlb_inv_range_domain() > > currently does. So, it might be a bit overkill. > > > > Combining all your comments, we'd have something like this: > > TBH I'd be inclined to refactor a bit harder, maybe break out some > VMID-based helpers for orthogonality, and aim for a flow like: > > if (over threshold) > tlb_inv_domain() > else if (stage 1) > tlb_inv_range_asid() > else > tlb_inv_range_vmid() > atc_inv_range() > > or possibly if you prefer: > > if (stage 1) { > if (over threshold) > tlb_inv_asid() > else > tlb_inv_range_asid() > } else { > if (over threshold) > tlb_inv_vmid() > else > tlb_inv_range_vmid() > } > atc_inv_range() > > where the latter maybe trades more verbosity for less duplication > overall - I'd probably have to try both to see which looks nicer in the > end. And obviously if there's any chance of inventing a clear and > consistent naming scheme in the process, that would be lovely. Oh, I just replied you in another email, asking for a refactor, though that didn't include the over-threshold part yet. Anyway, I think I got your point now and will bear in mind that we want something clean overall with less duplication among the "inv" functions. Let me try some rewriting. And we can see how it looks after. Thanks Nicolin
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index d6c647e1eb01..3f0db30932bd 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -1897,7 +1897,14 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd, if (!size) return; - if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) { + if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV)) { + /* + * When the size reaches a threshold, replace per-granule TLBI + * commands with one single per-asid or per-vmid TLBI command. + */ + if (size >= granule * smmu_domain->max_tlbi_ops) + return arm_smmu_tlb_inv_domain(smmu_domain); + } else { /* Get the leaf page size */ tg = __ffs(smmu_domain->domain.pgsize_bitmap); @@ -2258,6 +2265,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain, } smmu_domain->pgtbl_ops = pgtbl_ops; + smmu_domain->max_tlbi_ops = pgtbl_cfg.nents_per_pgtable; return 0; } diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index dcab85698a4e..f68c95a2e24f 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -721,6 +721,7 @@ struct arm_smmu_domain { struct io_pgtable_ops *pgtbl_ops; bool stall_enabled; atomic_t nr_ats_masters; + unsigned long max_tlbi_ops; enum arm_smmu_domain_stage stage; union {
When receiving an __arm_smmu_tlb_inv_range() call with a large size, there could be a long latency at this function call: one part is coming from a large software overhead in the routine of building commands, and the other part is coming from CMDQ hardware consuming the large number of commands. This latency could be significantly large on an SMMU that does not support range invalidation commands, i.e. no ARM_SMMU_FEAT_RANGE_INV. One way to optimize this is to replace a large number of VA invalidation commands with one single per-asid or per-vmid invalidation command, if an invalidation size reaches a preset threshold using the number of entries per io-pgtable, similar to the MAX_TLBI_OPS in arm64's tlbflush.h. Add a max_tlbi_ops in arm_smmu_domain, and convert a large number of per- granule TLBI commands to one single per-asid or per-vmid TLBI command, on SMMUs without ARM_SMMU_FEAT_RANGE_INV. Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 10 +++++++++- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 + 2 files changed, 10 insertions(+), 1 deletion(-)