iommu/mediatek: Move the tlb_sync into tlb_flush

Message ID	1569822142-14303-1-git-send-email-yong.wu@mediatek.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=zt4r=XZ=lists.infradead.org=linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E254E218DE From: Yong Wu <yong.wu@mediatek.com> To: Matthias Brugger <matthias.bgg@gmail.com>, Joerg Roedel <joro@8bytes.org>, Will Deacon <will.deacon@arm.com> Subject: [PATCH] iommu/mediatek: Move the tlb_sync into tlb_flush Date: Mon, 30 Sep 2019 13:42:22 +0800 Message-ID: <1569822142-14303-1-git-send-email-yong.wu@mediatek.com> MIME-Version: 1.0 summary: Content analysis details: (0.0 points) pts rule name description ---- ---------------------- -------------------------------------------------- 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record -0.0 SPF_PASS SPF: sender matches SPF record 0.0 UNPARSEABLE_RELAY Informational: message has unparseable relay lines Precedence: list Cc: youlin.pei@mediatek.com, anan.sun@mediatek.com, Nicolas Boichat <drinkcat@chromium.org>, cui.zhang@mediatek.com, srv_heupstream@mediatek.com, chao.hao@mediatek.com, linux-kernel@vger.kernel.org, Evan Green <evgreen@chromium.org>, Tomasz Figa <tfiga@google.com>, iommu@lists.linux-foundation.org, linux-mediatek@lists.infradead.org, yong.wu@mediatek.com, Robin Murphy <robin.murphy@arm.com>, linux-arm-kernel@lists.infradead.org Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org
Series	iommu/mediatek: Move the tlb_sync into tlb_flush \| expand iommu/mediatek: Move the tlb_sync into tlb_flush

Yong Wu (吴勇) Sept. 30, 2019, 5:42 a.m. UTC

The commit 4d689b619445 ("iommu/io-pgtable-arm-v7s: Convert to IOMMU API
TLB sync") help move the tlb_sync of unmap from v7s into the iommu
framework. It helps add a new function "mtk_iommu_iotlb_sync", But it
lacked the dom->pgtlock, then it will cause the variable
"tlb_flush_active" may be changed unexpectedly, we could see this warning
log randomly:

mtk-iommu 10205000.iommu: Partial TLB flush timed out, falling back to
full flush

To fix this issue, we can add dom->pgtlock in the "mtk_iommu_iotlb_sync".
And when checking this issue, we find that __arm_v7s_unmap call
io_pgtable_tlb_add_flush consecutively when it is supersection/largepage,
this also is potential unsafe for us. There is no tlb flush queue in the
MediaTek M4U HW. The HW always expect the tlb_flush/tlb_sync one by one.
If v7s don't always gurarantee the sequence, Thus, In this patch I move
the tlb_sync into tlb_flush(also rename the function deleting "_nosync").
and we don't care if it is leaf, rearrange the callback functions. Also,
the tlb flush/sync was already finished in v7s, then iotlb_sync and
iotlb_sync_all is unnecessary.

Besides, there are two minor changes:
a) Use writel for the register F_MMU_INV_RANGE which is for triggering the
HW work. We expect all the setting(iova_start/iova_end...) have already
been finished before F_MMU_INV_RANGE.
b) Reduce the tlb timeout value from 100000us to 1000us. the original value
is so long that affect the multimedia performance.

Fixes: 4d689b619445 ("iommu/io-pgtable-arm-v7s: Convert to IOMMU API TLB sync")
Signed-off-by: Chao Hao <chao.hao@mediatek.com>
Signed-off-by: Yong Wu <yong.wu@mediatek.com>
---
This patch looks break the logic for tlb_flush and tlb_sync. I'm not
sure if it
is reasonable. If someone has concern, I could change:
a) Add dom->pgtlock in the mtk_iommu_iotlb_sync
b) Add a io_pgtable_tlb_sync in [1].

[1]
https://elixir.bootlin.com/linux/v5.3-rc1/source/drivers/iommu/io-pgtable-arm-v7s.c#L655

This patch rebase on Joerg's mediatek-smmu-merge branch which has mt8183
and Will's "Rework IOMMU API to allow for batching of invalidation".
---
 drivers/iommu/mtk_iommu.c | 74 ++++++++++++-----------------------------------
 drivers/iommu/mtk_iommu.h |  1 -
 2 files changed, 19 insertions(+), 56 deletions(-)

Will Deacon Sept. 30, 2019, 12:09 p.m. UTC | #1

On Mon, Sep 30, 2019 at 01:42:22PM +0800, Yong Wu wrote:
> The commit 4d689b619445 ("iommu/io-pgtable-arm-v7s: Convert to IOMMU API
> TLB sync") help move the tlb_sync of unmap from v7s into the iommu
> framework. It helps add a new function "mtk_iommu_iotlb_sync", But it
> lacked the dom->pgtlock, then it will cause the variable
> "tlb_flush_active" may be changed unexpectedly, we could see this warning
> log randomly:
> 
> mtk-iommu 10205000.iommu: Partial TLB flush timed out, falling back to
> full flush
> 
> To fix this issue, we can add dom->pgtlock in the "mtk_iommu_iotlb_sync".
> And when checking this issue, we find that __arm_v7s_unmap call
> io_pgtable_tlb_add_flush consecutively when it is supersection/largepage,
> this also is potential unsafe for us. There is no tlb flush queue in the
> MediaTek M4U HW. The HW always expect the tlb_flush/tlb_sync one by one.
> If v7s don't always gurarantee the sequence, Thus, In this patch I move
> the tlb_sync into tlb_flush(also rename the function deleting "_nosync").
> and we don't care if it is leaf, rearrange the callback functions. Also,
> the tlb flush/sync was already finished in v7s, then iotlb_sync and
> iotlb_sync_all is unnecessary.
> 
> Besides, there are two minor changes:
> a) Use writel for the register F_MMU_INV_RANGE which is for triggering the
> HW work. We expect all the setting(iova_start/iova_end...) have already
> been finished before F_MMU_INV_RANGE.
> b) Reduce the tlb timeout value from 100000us to 1000us. the original value
> is so long that affect the multimedia performance.
> 
> Fixes: 4d689b619445 ("iommu/io-pgtable-arm-v7s: Convert to IOMMU API TLB sync")
> Signed-off-by: Chao Hao <chao.hao@mediatek.com>
> Signed-off-by: Yong Wu <yong.wu@mediatek.com>
> ---
> This patch looks break the logic for tlb_flush and tlb_sync. I'm not
> sure if it
> is reasonable. If someone has concern, I could change:
> a) Add dom->pgtlock in the mtk_iommu_iotlb_sync
> b) Add a io_pgtable_tlb_sync in [1].

The patch looks ok to me, but please could you split it up so that the
timeout and writel are done separately?

Thanks,

Will

Tomasz Figa Oct. 2, 2019, 5:18 a.m. UTC | #2

Hi Yong,

On Mon, Sep 30, 2019 at 2:42 PM Yong Wu <yong.wu@mediatek.com> wrote:
>
> The commit 4d689b619445 ("iommu/io-pgtable-arm-v7s: Convert to IOMMU API
> TLB sync") help move the tlb_sync of unmap from v7s into the iommu
> framework. It helps add a new function "mtk_iommu_iotlb_sync", But it
> lacked the dom->pgtlock, then it will cause the variable
> "tlb_flush_active" may be changed unexpectedly, we could see this warning
> log randomly:
>

Thanks for the patch! Please see my comments inline.

> mtk-iommu 10205000.iommu: Partial TLB flush timed out, falling back to
> full flush
>
> To fix this issue, we can add dom->pgtlock in the "mtk_iommu_iotlb_sync".
> And when checking this issue, we find that __arm_v7s_unmap call
> io_pgtable_tlb_add_flush consecutively when it is supersection/largepage,
> this also is potential unsafe for us. There is no tlb flush queue in the
> MediaTek M4U HW. The HW always expect the tlb_flush/tlb_sync one by one.
> If v7s don't always gurarantee the sequence, Thus, In this patch I move
> the tlb_sync into tlb_flush(also rename the function deleting "_nosync").
> and we don't care if it is leaf, rearrange the callback functions. Also,
> the tlb flush/sync was already finished in v7s, then iotlb_sync and
> iotlb_sync_all is unnecessary.

Performance-wise, we could do much better. Instead of synchronously
syncing at the end of mtk_iommu_tlb_add_flush(), we could sync at the
beginning, if there was any previous flush still pending. We would
also have to keep the .iotlb_sync() callback, to take care of waiting
for the last flush. That would allow better pipelining with CPU in
cases like this:

for (all pages in range) {
   change page table();
   flush();
}

"change page table()" could execute while the IOMMU is flushing the
previous change.

>
> Besides, there are two minor changes:
> a) Use writel for the register F_MMU_INV_RANGE which is for triggering the
> HW work. We expect all the setting(iova_start/iova_end...) have already
> been finished before F_MMU_INV_RANGE.
> b) Reduce the tlb timeout value from 100000us to 1000us. the original value
> is so long that affect the multimedia performance.

By definition, timeout is something that should not normally happen.
Too long timeout affecting multimedia performance would suggest that
the timeout was actually happening, which is the core problem, not the
length of the timeout. Could you provide more details on this?

>
> Fixes: 4d689b619445 ("iommu/io-pgtable-arm-v7s: Convert to IOMMU API TLB sync")
> Signed-off-by: Chao Hao <chao.hao@mediatek.com>
> Signed-off-by: Yong Wu <yong.wu@mediatek.com>
> ---
> This patch looks break the logic for tlb_flush and tlb_sync. I'm not
> sure if it
> is reasonable. If someone has concern, I could change:
> a) Add dom->pgtlock in the mtk_iommu_iotlb_sync
> b) Add a io_pgtable_tlb_sync in [1].
>
> [1]
> https://elixir.bootlin.com/linux/v5.3-rc1/source/drivers/iommu/io-pgtable-arm-v7s.c#L655
>
> This patch rebase on Joerg's mediatek-smmu-merge branch which has mt8183
> and Will's "Rework IOMMU API to allow for batching of invalidation".
> ---
>  drivers/iommu/mtk_iommu.c | 74 ++++++++++++-----------------------------------
>  drivers/iommu/mtk_iommu.h |  1 -
>  2 files changed, 19 insertions(+), 56 deletions(-)
>
> diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
> index 6066272..e13cc56 100644
> --- a/drivers/iommu/mtk_iommu.c
> +++ b/drivers/iommu/mtk_iommu.c
> @@ -173,11 +173,12 @@ static void mtk_iommu_tlb_flush_all(void *cookie)
>         }
>  }
>
> -static void mtk_iommu_tlb_add_flush_nosync(unsigned long iova, size_t size,
> -                                          size_t granule, bool leaf,
> -                                          void *cookie)
> +static void mtk_iommu_tlb_add_flush(unsigned long iova, size_t size,
> +                                   size_t granule, void *cookie)
>  {
>         struct mtk_iommu_data *data = cookie;
> +       int ret;
> +       u32 tmp;
>
>         for_each_m4u(data) {
>                 writel_relaxed(F_INVLD_EN1 | F_INVLD_EN0,
> @@ -186,25 +187,15 @@ static void mtk_iommu_tlb_add_flush_nosync(unsigned long iova, size_t size,
>                 writel_relaxed(iova, data->base + REG_MMU_INVLD_START_A);
>                 writel_relaxed(iova + size - 1,
>                                data->base + REG_MMU_INVLD_END_A);
> -               writel_relaxed(F_MMU_INV_RANGE,
> -                              data->base + REG_MMU_INVALIDATE);
> -               data->tlb_flush_active = true;
> -       }
> -}
> -
> -static void mtk_iommu_tlb_sync(void *cookie)
> -{
> -       struct mtk_iommu_data *data = cookie;
> -       int ret;
> -       u32 tmp;
> -
> -       for_each_m4u(data) {
> -               /* Avoid timing out if there's nothing to wait for */
> -               if (!data->tlb_flush_active)
> -                       return;
> +               writel(F_MMU_INV_RANGE, data->base + REG_MMU_INVALIDATE);
>
> +               /*
> +                * There is no tlb flush queue in the HW, the HW always expect
> +                * tlb_flush and tlb_sync one by one. Here tlb_sync always
> +                * follows tlb_flush to avoid break the sequence.
> +                */
>                 ret = readl_poll_timeout_atomic(data->base + REG_MMU_CPE_DONE,
> -                                               tmp, tmp != 0, 10, 100000);
> +                                               tmp, tmp != 0, 10, 1000);
>                 if (ret) {
>                         dev_warn(data->dev,
>                                  "Partial TLB flush timed out, falling back to full flush\n");
> @@ -212,36 +203,21 @@ static void mtk_iommu_tlb_sync(void *cookie)
>                 }
>                 /* Clear the CPE status */
>                 writel_relaxed(0, data->base + REG_MMU_CPE_DONE);
> -               data->tlb_flush_active = false;
>         }
>  }
>
> -static void mtk_iommu_tlb_flush_walk(unsigned long iova, size_t size,
> -                                    size_t granule, void *cookie)
> +static void mtk_iommu_tlb_flush_page(struct iommu_iotlb_gather *gather,
> +                                    unsigned long iova, size_t granule,
> +                                    void *cookie)
>  {
> -       mtk_iommu_tlb_add_flush_nosync(iova, size, granule, false, cookie);
> -       mtk_iommu_tlb_sync(cookie);
> -}
> -
> -static void mtk_iommu_tlb_flush_leaf(unsigned long iova, size_t size,
> -                                    size_t granule, void *cookie)
> -{
> -       mtk_iommu_tlb_add_flush_nosync(iova, size, granule, true, cookie);
> -       mtk_iommu_tlb_sync(cookie);
> -}
> -
> -static void mtk_iommu_tlb_flush_page_nosync(struct iommu_iotlb_gather *gather,
> -                                           unsigned long iova, size_t granule,
> -                                           void *cookie)
> -{
> -       mtk_iommu_tlb_add_flush_nosync(iova, granule, granule, true, cookie);
> +       mtk_iommu_tlb_add_flush(iova, granule, granule, cookie);
>  }
>
>  static const struct iommu_flush_ops mtk_iommu_flush_ops = {
>         .tlb_flush_all = mtk_iommu_tlb_flush_all,
> -       .tlb_flush_walk = mtk_iommu_tlb_flush_walk,
> -       .tlb_flush_leaf = mtk_iommu_tlb_flush_leaf,
> -       .tlb_add_page = mtk_iommu_tlb_flush_page_nosync,
> +       .tlb_flush_walk = mtk_iommu_tlb_add_flush,
> +       .tlb_flush_leaf = mtk_iommu_tlb_add_flush,
> +       .tlb_add_page = mtk_iommu_tlb_flush_page,
>  };
>
>  static irqreturn_t mtk_iommu_isr(int irq, void *dev_id)
> @@ -445,17 +421,6 @@ static size_t mtk_iommu_unmap(struct iommu_domain *domain,
>         return unmapsz;
>  }
>
> -static void mtk_iommu_flush_iotlb_all(struct iommu_domain *domain)
> -{
> -       mtk_iommu_tlb_sync(mtk_iommu_get_m4u_data());
> -}
> -
> -static void mtk_iommu_iotlb_sync(struct iommu_domain *domain,
> -                                struct iommu_iotlb_gather *gather)
> -{
> -       mtk_iommu_tlb_sync(mtk_iommu_get_m4u_data());
> -}
> -
>  static phys_addr_t mtk_iommu_iova_to_phys(struct iommu_domain *domain,
>                                           dma_addr_t iova)
>  {
> @@ -574,8 +539,7 @@ static int mtk_iommu_of_xlate(struct device *dev, struct of_phandle_args *args)
>         .detach_dev     = mtk_iommu_detach_device,
>         .map            = mtk_iommu_map,
>         .unmap          = mtk_iommu_unmap,
> -       .flush_iotlb_all = mtk_iommu_flush_iotlb_all,

Don't we still want .flush_iotlb_all()? I think it should be more
efficient in some cases than doing a big number of single flushes.
(That said, the previous implementation didn't do any flush at all. It
just waited for previously queued flushes to happen. Was that
expected?)

Best regards,
Tomasz

Robin Murphy Oct. 2, 2019, 10:35 a.m. UTC | #3

On 02/10/2019 06:18, Tomasz Figa wrote:
> Hi Yong,
> 
> On Mon, Sep 30, 2019 at 2:42 PM Yong Wu <yong.wu@mediatek.com> wrote:
>>
>> The commit 4d689b619445 ("iommu/io-pgtable-arm-v7s: Convert to IOMMU API
>> TLB sync") help move the tlb_sync of unmap from v7s into the iommu
>> framework. It helps add a new function "mtk_iommu_iotlb_sync", But it
>> lacked the dom->pgtlock, then it will cause the variable
>> "tlb_flush_active" may be changed unexpectedly, we could see this warning
>> log randomly:
>>
> 
> Thanks for the patch! Please see my comments inline.
> 
>> mtk-iommu 10205000.iommu: Partial TLB flush timed out, falling back to
>> full flush
>>
>> To fix this issue, we can add dom->pgtlock in the "mtk_iommu_iotlb_sync".
>> And when checking this issue, we find that __arm_v7s_unmap call
>> io_pgtable_tlb_add_flush consecutively when it is supersection/largepage,
>> this also is potential unsafe for us. There is no tlb flush queue in the
>> MediaTek M4U HW. The HW always expect the tlb_flush/tlb_sync one by one.
>> If v7s don't always gurarantee the sequence, Thus, In this patch I move
>> the tlb_sync into tlb_flush(also rename the function deleting "_nosync").
>> and we don't care if it is leaf, rearrange the callback functions. Also,
>> the tlb flush/sync was already finished in v7s, then iotlb_sync and
>> iotlb_sync_all is unnecessary.
> 
> Performance-wise, we could do much better. Instead of synchronously
> syncing at the end of mtk_iommu_tlb_add_flush(), we could sync at the
> beginning, if there was any previous flush still pending. We would
> also have to keep the .iotlb_sync() callback, to take care of waiting
> for the last flush. That would allow better pipelining with CPU in
> cases like this:
> 
> for (all pages in range) {
>     change page table();
>     flush();
> }
> 
> "change page table()" could execute while the IOMMU is flushing the
> previous change.

FWIW, given that the underlying invalidation mechanism is range-based, 
this driver would be an ideal candidate for making use of the new 
iommu_gather mechanism. As a fix for stable, though, simply ensuring 
that add_flush syncs any pending invalidation before issuing a new one 
sounds like a good idea (and probably a simpler patch too).

[...]
>> @@ -574,8 +539,7 @@ static int mtk_iommu_of_xlate(struct device *dev, struct of_phandle_args *args)
>>          .detach_dev     = mtk_iommu_detach_device,
>>          .map            = mtk_iommu_map,
>>          .unmap          = mtk_iommu_unmap,
>> -       .flush_iotlb_all = mtk_iommu_flush_iotlb_all,
> 
> Don't we still want .flush_iotlb_all()? I think it should be more
> efficient in some cases than doing a big number of single flushes.
> (That said, the previous implementation didn't do any flush at all. It
> just waited for previously queued flushes to happen. Was that
> expected?)

Commit 07fdef34d2be ("iommu/arm-smmu-v3: Implement flush_iotlb_all 
hook") has an explanation of what the deal was there - similarly, it's 
probably worth this driver implementing it properly as well now (but 
that's really a separate patch).

Robin.

Yong Wu (吴勇) Oct. 8, 2019, 8:09 a.m. UTC | #4

Hi Tomasz,

Sorry for reply late.

On Wed, 2019-10-02 at 14:18 +0900, Tomasz Figa wrote:
> Hi Yong,
> 
> On Mon, Sep 30, 2019 at 2:42 PM Yong Wu <yong.wu@mediatek.com> wrote:
> >
> > The commit 4d689b619445 ("iommu/io-pgtable-arm-v7s: Convert to IOMMU API
> > TLB sync") help move the tlb_sync of unmap from v7s into the iommu
> > framework. It helps add a new function "mtk_iommu_iotlb_sync", But it
> > lacked the dom->pgtlock, then it will cause the variable
> > "tlb_flush_active" may be changed unexpectedly, we could see this warning
> > log randomly:
> >
> 
> Thanks for the patch! Please see my comments inline.
> 
> > mtk-iommu 10205000.iommu: Partial TLB flush timed out, falling back to
> > full flush
> >
> > To fix this issue, we can add dom->pgtlock in the "mtk_iommu_iotlb_sync".
> > And when checking this issue, we find that __arm_v7s_unmap call
> > io_pgtable_tlb_add_flush consecutively when it is supersection/largepage,
> > this also is potential unsafe for us. There is no tlb flush queue in the
> > MediaTek M4U HW. The HW always expect the tlb_flush/tlb_sync one by one.
> > If v7s don't always gurarantee the sequence, Thus, In this patch I move
> > the tlb_sync into tlb_flush(also rename the function deleting "_nosync").
> > and we don't care if it is leaf, rearrange the callback functions. Also,
> > the tlb flush/sync was already finished in v7s, then iotlb_sync and
> > iotlb_sync_all is unnecessary.
> 
> Performance-wise, we could do much better. Instead of synchronously
> syncing at the end of mtk_iommu_tlb_add_flush(), we could sync at the
> beginning, if there was any previous flush still pending. We would
> also have to keep the .iotlb_sync() callback, to take care of waiting
> for the last flush. That would allow better pipelining with CPU in
> cases like this:
> 
> for (all pages in range) {
>    change page table();
>    flush();
> }
> 
> "change page table()" could execute while the IOMMU is flushing the
> previous change.

Do you mean adding a new tlb_sync before tlb_flush_no_sync, like below:

mtk_iommu_tlb_add_flush_nosync {   
   + mtk_iommu_tlb_sync();
   tlb_flush_no_sync();
   data->tlb_flush_active = true;
}

mtk_iommu_tlb_sync {
	if (!data->tlb_flush_active)
		return;
	tlb_sync();
	data->tlb_flush_active = false;
}

This way look improve the flow, But adjusting the flow is not the root
cause of this issue. the problem is "data->tlb_flush_active" may be
changed from mtk_iommu_iotlb_sync which don't have a dom->pglock.

Currently the synchronisation of the tlb_flush/tlb_sync flow are
controlled by the variable "data->tlb_flush_active".

In this patch putting the tlb_flush/tlb_sync together looks make
the flow simpler:
a) Don't need the sensitive variable "tlb_flush_active".
b) Remove mtk_iommu_iotlb_sync, Don't need add lock in it.
c) Simplify the tlb_flush_walk/tlb_flush_leaf.
is it ok?

> 
> >
> > Besides, there are two minor changes:
> > a) Use writel for the register F_MMU_INV_RANGE which is for triggering the
> > HW work. We expect all the setting(iova_start/iova_end...) have already
> > been finished before F_MMU_INV_RANGE.
> > b) Reduce the tlb timeout value from 100000us to 1000us. the original value
> > is so long that affect the multimedia performance.
> 
> By definition, timeout is something that should not normally happen.
> Too long timeout affecting multimedia performance would suggest that
> the timeout was actually happening, which is the core problem, not the
> length of the timeout. Could you provide more details on this?

As description above, this issue is because there is no dom->pgtlock in
the mtk_iommu_iotlb_sync. I have tried that the issue will disappear
after adding lock in it.

Although the issue is fixed after this patch, I still would like to
reduce the timeout value for somehow error happen in the future. 100ms
is unnecessary for us. It looks a minor improvement rather than fixing
the issue. I will use a new patch for it.

> 
> >
> > Fixes: 4d689b619445 ("iommu/io-pgtable-arm-v7s: Convert to IOMMU API TLB sync")
> > Signed-off-by: Chao Hao <chao.hao@mediatek.com>
> > Signed-off-by: Yong Wu <yong.wu@mediatek.com>
> > ---
> > This patch looks break the logic for tlb_flush and tlb_sync. I'm not
> > sure if it
> > is reasonable. If someone has concern, I could change:
> > a) Add dom->pgtlock in the mtk_iommu_iotlb_sync
> > b) Add a io_pgtable_tlb_sync in [1].
> >
> > [1]
> > https://elixir.bootlin.com/linux/v5.3-rc1/source/drivers/iommu/io-pgtable-arm-v7s.c#L655
> >
> > This patch rebase on Joerg's mediatek-smmu-merge branch which has mt8183
> > and Will's "Rework IOMMU API to allow for batching of invalidation".
> > ---
> >  drivers/iommu/mtk_iommu.c | 74 ++++++++++++-----------------------------------
> >  drivers/iommu/mtk_iommu.h |  1 -
> >  2 files changed, 19 insertions(+), 56 deletions(-)
> >
> > diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
> > index 6066272..e13cc56 100644
> > --- a/drivers/iommu/mtk_iommu.c
> > +++ b/drivers/iommu/mtk_iommu.c
> > @@ -173,11 +173,12 @@ static void mtk_iommu_tlb_flush_all(void *cookie)
> >         }
> >  }
> >
> > -static void mtk_iommu_tlb_add_flush_nosync(unsigned long iova, size_t size,
> > -                                          size_t granule, bool leaf,
> > -                                          void *cookie)
> > +static void mtk_iommu_tlb_add_flush(unsigned long iova, size_t size,
> > +                                   size_t granule, void *cookie)
> >  {
> >         struct mtk_iommu_data *data = cookie;
> > +       int ret;
> > +       u32 tmp;
> >
> >         for_each_m4u(data) {
> >                 writel_relaxed(F_INVLD_EN1 | F_INVLD_EN0,
> > @@ -186,25 +187,15 @@ static void mtk_iommu_tlb_add_flush_nosync(unsigned long iova, size_t size,
> >                 writel_relaxed(iova, data->base + REG_MMU_INVLD_START_A);
> >                 writel_relaxed(iova + size - 1,
> >                                data->base + REG_MMU_INVLD_END_A);
> > -               writel_relaxed(F_MMU_INV_RANGE,
> > -                              data->base + REG_MMU_INVALIDATE);
> > -               data->tlb_flush_active = true;
> > -       }
> > -}
> > -
> > -static void mtk_iommu_tlb_sync(void *cookie)
> > -{
> > -       struct mtk_iommu_data *data = cookie;
> > -       int ret;
> > -       u32 tmp;
> > -
> > -       for_each_m4u(data) {
> > -               /* Avoid timing out if there's nothing to wait for */
> > -               if (!data->tlb_flush_active)
> > -                       return;
> > +               writel(F_MMU_INV_RANGE, data->base + REG_MMU_INVALIDATE);
> >
> > +               /*
> > +                * There is no tlb flush queue in the HW, the HW always expect
> > +                * tlb_flush and tlb_sync one by one. Here tlb_sync always
> > +                * follows tlb_flush to avoid break the sequence.
> > +                */
> >                 ret = readl_poll_timeout_atomic(data->base + REG_MMU_CPE_DONE,
> > -                                               tmp, tmp != 0, 10, 100000);
> > +                                               tmp, tmp != 0, 10, 1000);
> >                 if (ret) {
> >                         dev_warn(data->dev,
> >                                  "Partial TLB flush timed out, falling back to full flush\n");
> > @@ -212,36 +203,21 @@ static void mtk_iommu_tlb_sync(void *cookie)
> >                 }
> >                 /* Clear the CPE status */
> >                 writel_relaxed(0, data->base + REG_MMU_CPE_DONE);
> > -               data->tlb_flush_active = false;
> >         }
> >  }
> >
> > -static void mtk_iommu_tlb_flush_walk(unsigned long iova, size_t size,
> > -                                    size_t granule, void *cookie)
> > +static void mtk_iommu_tlb_flush_page(struct iommu_iotlb_gather *gather,
> > +                                    unsigned long iova, size_t granule,
> > +                                    void *cookie)
> >  {
> > -       mtk_iommu_tlb_add_flush_nosync(iova, size, granule, false, cookie);
> > -       mtk_iommu_tlb_sync(cookie);
> > -}
> > -
> > -static void mtk_iommu_tlb_flush_leaf(unsigned long iova, size_t size,
> > -                                    size_t granule, void *cookie)
> > -{
> > -       mtk_iommu_tlb_add_flush_nosync(iova, size, granule, true, cookie);
> > -       mtk_iommu_tlb_sync(cookie);
> > -}
> > -
> > -static void mtk_iommu_tlb_flush_page_nosync(struct iommu_iotlb_gather *gather,
> > -                                           unsigned long iova, size_t granule,
> > -                                           void *cookie)
> > -{
> > -       mtk_iommu_tlb_add_flush_nosync(iova, granule, granule, true, cookie);
> > +       mtk_iommu_tlb_add_flush(iova, granule, granule, cookie);
> >  }
> >
> >  static const struct iommu_flush_ops mtk_iommu_flush_ops = {
> >         .tlb_flush_all = mtk_iommu_tlb_flush_all,
> > -       .tlb_flush_walk = mtk_iommu_tlb_flush_walk,
> > -       .tlb_flush_leaf = mtk_iommu_tlb_flush_leaf,
> > -       .tlb_add_page = mtk_iommu_tlb_flush_page_nosync,
> > +       .tlb_flush_walk = mtk_iommu_tlb_add_flush,
> > +       .tlb_flush_leaf = mtk_iommu_tlb_add_flush,
> > +       .tlb_add_page = mtk_iommu_tlb_flush_page,
> >  };
> >
> >  static irqreturn_t mtk_iommu_isr(int irq, void *dev_id)
> > @@ -445,17 +421,6 @@ static size_t mtk_iommu_unmap(struct iommu_domain *domain,
> >         return unmapsz;
> >  }
> >
> > -static void mtk_iommu_flush_iotlb_all(struct iommu_domain *domain)
> > -{
> > -       mtk_iommu_tlb_sync(mtk_iommu_get_m4u_data());
> > -}
> > -
> > -static void mtk_iommu_iotlb_sync(struct iommu_domain *domain,
> > -                                struct iommu_iotlb_gather *gather)
> > -{
> > -       mtk_iommu_tlb_sync(mtk_iommu_get_m4u_data());
> > -}
> > -
> >  static phys_addr_t mtk_iommu_iova_to_phys(struct iommu_domain *domain,
> >                                           dma_addr_t iova)
> >  {
> > @@ -574,8 +539,7 @@ static int mtk_iommu_of_xlate(struct device *dev, struct of_phandle_args *args)
> >         .detach_dev     = mtk_iommu_detach_device,
> >         .map            = mtk_iommu_map,
> >         .unmap          = mtk_iommu_unmap,
> > -       .flush_iotlb_all = mtk_iommu_flush_iotlb_all,
> 
> Don't we still want .flush_iotlb_all()? I think it should be more
> efficient in some cases than doing a big number of single flushes.
> (That said, the previous implementation didn't do any flush at all. It
> just waited for previously queued flushes to happen. Was that
> expected?)

I will keep the flush_iotlb_all.

Thanks.

> 
> Best regards,
> Tomasz

Yong Wu (吴勇) Oct. 8, 2019, 8:09 a.m. UTC | #5

On Wed, 2019-10-02 at 11:35 +0100, Robin Murphy wrote:
> On 02/10/2019 06:18, Tomasz Figa wrote:
> > Hi Yong,
> > 
> > On Mon, Sep 30, 2019 at 2:42 PM Yong Wu <yong.wu@mediatek.com> wrote:
> >>
> >> The commit 4d689b619445 ("iommu/io-pgtable-arm-v7s: Convert to IOMMU API
> >> TLB sync") help move the tlb_sync of unmap from v7s into the iommu
> >> framework. It helps add a new function "mtk_iommu_iotlb_sync", But it
> >> lacked the dom->pgtlock, then it will cause the variable
> >> "tlb_flush_active" may be changed unexpectedly, we could see this warning
> >> log randomly:
> >>
> > 
> > Thanks for the patch! Please see my comments inline.
> > 
> >> mtk-iommu 10205000.iommu: Partial TLB flush timed out, falling back to
> >> full flush
> >>
> >> To fix this issue, we can add dom->pgtlock in the "mtk_iommu_iotlb_sync".
> >> And when checking this issue, we find that __arm_v7s_unmap call
> >> io_pgtable_tlb_add_flush consecutively when it is supersection/largepage,
> >> this also is potential unsafe for us. There is no tlb flush queue in the
> >> MediaTek M4U HW. The HW always expect the tlb_flush/tlb_sync one by one.
> >> If v7s don't always gurarantee the sequence, Thus, In this patch I move
> >> the tlb_sync into tlb_flush(also rename the function deleting "_nosync").
> >> and we don't care if it is leaf, rearrange the callback functions. Also,
> >> the tlb flush/sync was already finished in v7s, then iotlb_sync and
> >> iotlb_sync_all is unnecessary.
> > 
> > Performance-wise, we could do much better. Instead of synchronously
> > syncing at the end of mtk_iommu_tlb_add_flush(), we could sync at the
> > beginning, if there was any previous flush still pending. We would
> > also have to keep the .iotlb_sync() callback, to take care of waiting
> > for the last flush. That would allow better pipelining with CPU in
> > cases like this:
> > 
> > for (all pages in range) {
> >     change page table();
> >     flush();
> > }
> > 
> > "change page table()" could execute while the IOMMU is flushing the
> > previous change.
> 
> FWIW, given that the underlying invalidation mechanism is range-based, 
> this driver would be an ideal candidate for making use of the new 
> iommu_gather mechanism. As a fix for stable, though, simply ensuring 
> that add_flush syncs any pending invalidation before issuing a new one 
> sounds like a good idea (and probably a simpler patch too).

Thanks very much for the confirmation.

> 
> [...]
> >> @@ -574,8 +539,7 @@ static int mtk_iommu_of_xlate(struct device *dev, struct of_phandle_args *args)
> >>          .detach_dev     = mtk_iommu_detach_device,
> >>          .map            = mtk_iommu_map,
> >>          .unmap          = mtk_iommu_unmap,
> >> -       .flush_iotlb_all = mtk_iommu_flush_iotlb_all,
> > 
> > Don't we still want .flush_iotlb_all()? I think it should be more
> > efficient in some cases than doing a big number of single flushes.
> > (That said, the previous implementation didn't do any flush at all. It
> > just waited for previously queued flushes to happen. Was that
> > expected?)
> 
> Commit 07fdef34d2be ("iommu/arm-smmu-v3: Implement flush_iotlb_all 
> hook") has an explanation of what the deal was there - similarly, it's 
> probably worth this driver implementing it properly as well now (but 
> that's really a separate patch).

Thanks the hint, At the beginning, I noticed that we don't have
DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE, thus I thought flush_iotlb_all is
unnecessary.

After grep, iommu_flush_tlb_all also is called in the x_direct_mapping,
then we still need this.

If putting it in a new patch(switch flush_iotlb_all to
mtk_iommu_tlb_flush_all), then the Fixes tag(commit id: 4d689b619445) is
needed?

> 
> Robin.
> 
> _______________________________________________
> Linux-mediatek mailing list
> Linux-mediatek@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-mediatek

Yong Wu (吴勇) Oct. 8, 2019, 8:18 a.m. UTC | #6

On Mon, 2019-09-30 at 13:09 +0100, Will Deacon wrote:
> On Mon, Sep 30, 2019 at 01:42:22PM +0800, Yong Wu wrote:
> > The commit 4d689b619445 ("iommu/io-pgtable-arm-v7s: Convert to IOMMU API
> > TLB sync") help move the tlb_sync of unmap from v7s into the iommu
> > framework. It helps add a new function "mtk_iommu_iotlb_sync", But it
> > lacked the dom->pgtlock, then it will cause the variable
> > "tlb_flush_active" may be changed unexpectedly, we could see this warning
> > log randomly:
> > 
> > mtk-iommu 10205000.iommu: Partial TLB flush timed out, falling back to
> > full flush
> > 
> > To fix this issue, we can add dom->pgtlock in the "mtk_iommu_iotlb_sync".
> > And when checking this issue, we find that __arm_v7s_unmap call
> > io_pgtable_tlb_add_flush consecutively when it is supersection/largepage,
> > this also is potential unsafe for us. There is no tlb flush queue in the
> > MediaTek M4U HW. The HW always expect the tlb_flush/tlb_sync one by one.
> > If v7s don't always gurarantee the sequence, Thus, In this patch I move
> > the tlb_sync into tlb_flush(also rename the function deleting "_nosync").
> > and we don't care if it is leaf, rearrange the callback functions. Also,
> > the tlb flush/sync was already finished in v7s, then iotlb_sync and
> > iotlb_sync_all is unnecessary.
> > 
> > Besides, there are two minor changes:
> > a) Use writel for the register F_MMU_INV_RANGE which is for triggering the
> > HW work. We expect all the setting(iova_start/iova_end...) have already
> > been finished before F_MMU_INV_RANGE.
> > b) Reduce the tlb timeout value from 100000us to 1000us. the original value
> > is so long that affect the multimedia performance.
> > 
> > Fixes: 4d689b619445 ("iommu/io-pgtable-arm-v7s: Convert to IOMMU API TLB sync")
> > Signed-off-by: Chao Hao <chao.hao@mediatek.com>
> > Signed-off-by: Yong Wu <yong.wu@mediatek.com>
> > ---
> > This patch looks break the logic for tlb_flush and tlb_sync. I'm not
> > sure if it
> > is reasonable. If someone has concern, I could change:
> > a) Add dom->pgtlock in the mtk_iommu_iotlb_sync
> > b) Add a io_pgtable_tlb_sync in [1].
> 
> The patch looks ok to me, but please could you split it up so that the
> timeout and writel are done separately?

Thanks for the quick review, I will separate them.

> 
> Thanks,
> 
> Will

Tomasz Figa Oct. 9, 2019, 7:56 a.m. UTC | #7

On Tue, Oct 8, 2019 at 5:09 PM Yong Wu <yong.wu@mediatek.com> wrote:
>
> Hi Tomasz,
>
> Sorry for reply late.
>
> On Wed, 2019-10-02 at 14:18 +0900, Tomasz Figa wrote:
> > Hi Yong,
> >
> > On Mon, Sep 30, 2019 at 2:42 PM Yong Wu <yong.wu@mediatek.com> wrote:
> > >
> > > The commit 4d689b619445 ("iommu/io-pgtable-arm-v7s: Convert to IOMMU API
> > > TLB sync") help move the tlb_sync of unmap from v7s into the iommu
> > > framework. It helps add a new function "mtk_iommu_iotlb_sync", But it
> > > lacked the dom->pgtlock, then it will cause the variable
> > > "tlb_flush_active" may be changed unexpectedly, we could see this warning
> > > log randomly:
> > >
> >
> > Thanks for the patch! Please see my comments inline.
> >
> > > mtk-iommu 10205000.iommu: Partial TLB flush timed out, falling back to
> > > full flush
> > >
> > > To fix this issue, we can add dom->pgtlock in the "mtk_iommu_iotlb_sync".
> > > And when checking this issue, we find that __arm_v7s_unmap call
> > > io_pgtable_tlb_add_flush consecutively when it is supersection/largepage,
> > > this also is potential unsafe for us. There is no tlb flush queue in the
> > > MediaTek M4U HW. The HW always expect the tlb_flush/tlb_sync one by one.
> > > If v7s don't always gurarantee the sequence, Thus, In this patch I move
> > > the tlb_sync into tlb_flush(also rename the function deleting "_nosync").
> > > and we don't care if it is leaf, rearrange the callback functions. Also,
> > > the tlb flush/sync was already finished in v7s, then iotlb_sync and
> > > iotlb_sync_all is unnecessary.
> >
> > Performance-wise, we could do much better. Instead of synchronously
> > syncing at the end of mtk_iommu_tlb_add_flush(), we could sync at the
> > beginning, if there was any previous flush still pending. We would
> > also have to keep the .iotlb_sync() callback, to take care of waiting
> > for the last flush. That would allow better pipelining with CPU in
> > cases like this:
> >
> > for (all pages in range) {
> >    change page table();
> >    flush();
> > }
> >
> > "change page table()" could execute while the IOMMU is flushing the
> > previous change.
>
> Do you mean adding a new tlb_sync before tlb_flush_no_sync, like below:
>
> mtk_iommu_tlb_add_flush_nosync {
>    + mtk_iommu_tlb_sync();
>    tlb_flush_no_sync();
>    data->tlb_flush_active = true;
> }
>
> mtk_iommu_tlb_sync {
>         if (!data->tlb_flush_active)
>                 return;
>         tlb_sync();
>         data->tlb_flush_active = false;
> }
>
> This way look improve the flow, But adjusting the flow is not the root
> cause of this issue. the problem is "data->tlb_flush_active" may be
> changed from mtk_iommu_iotlb_sync which don't have a dom->pglock.

That was not the only problem with existing code. Existing code also
assumed that add_flush and sync always go in pairs, but that's not
true.

My suggestion is to fix the locking in the driver and keep the sync
deferred as much as possible, so that performance is not degraded. I
changed my mind, though. I think we would need to make more changes to
the driver to make it implement the flushing efficiently, so let's go
with the current simple approach for now and improve incrementally.

>
> Currently the synchronisation of the tlb_flush/tlb_sync flow are
> controlled by the variable "data->tlb_flush_active".
>
> In this patch putting the tlb_flush/tlb_sync together looks make
> the flow simpler:
> a) Don't need the sensitive variable "tlb_flush_active".
> b) Remove mtk_iommu_iotlb_sync, Don't need add lock in it.
> c) Simplify the tlb_flush_walk/tlb_flush_leaf.
> is it ok?
>

Okay, let's do so as a first step to fix the issue. Then we can
optimize in follow up patches.

> >
> > >
> > > Besides, there are two minor changes:
> > > a) Use writel for the register F_MMU_INV_RANGE which is for triggering the
> > > HW work. We expect all the setting(iova_start/iova_end...) have already
> > > been finished before F_MMU_INV_RANGE.
> > > b) Reduce the tlb timeout value from 100000us to 1000us. the original value
> > > is so long that affect the multimedia performance.
> >
> > By definition, timeout is something that should not normally happen.
> > Too long timeout affecting multimedia performance would suggest that
> > the timeout was actually happening, which is the core problem, not the
> > length of the timeout. Could you provide more details on this?
>
> As description above, this issue is because there is no dom->pgtlock in
> the mtk_iommu_iotlb_sync. I have tried that the issue will disappear
> after adding lock in it.
>
> Although the issue is fixed after this patch, I still would like to
> reduce the timeout value for somehow error happen in the future. 100ms
> is unnecessary for us. It looks a minor improvement rather than fixing
> the issue. I will use a new patch for it.
>

Okay, makes sense.

Best regards,
Tomasz

Yong Wu (吴勇) Oct. 9, 2019, 1:38 p.m. UTC | #8

On Wed, 2019-10-09 at 16:56 +0900, Tomasz Figa wrote:
> On Tue, Oct 8, 2019 at 5:09 PM Yong Wu <yong.wu@mediatek.com> wrote:
> >
> > Hi Tomasz,
> >
> > Sorry for reply late.
> >
> > On Wed, 2019-10-02 at 14:18 +0900, Tomasz Figa wrote:
> > > Hi Yong,
> > >
> > > On Mon, Sep 30, 2019 at 2:42 PM Yong Wu <yong.wu@mediatek.com> wrote:
> > > >
> > > > The commit 4d689b619445 ("iommu/io-pgtable-arm-v7s: Convert to IOMMU API
> > > > TLB sync") help move the tlb_sync of unmap from v7s into the iommu
> > > > framework. It helps add a new function "mtk_iommu_iotlb_sync", But it
> > > > lacked the dom->pgtlock, then it will cause the variable
> > > > "tlb_flush_active" may be changed unexpectedly, we could see this warning
> > > > log randomly:
> > > >
> > >
> > > Thanks for the patch! Please see my comments inline.
> > >
> > > > mtk-iommu 10205000.iommu: Partial TLB flush timed out, falling back to
> > > > full flush
> > > >
> > > > To fix this issue, we can add dom->pgtlock in the "mtk_iommu_iotlb_sync".
> > > > And when checking this issue, we find that __arm_v7s_unmap call
> > > > io_pgtable_tlb_add_flush consecutively when it is supersection/largepage,
> > > > this also is potential unsafe for us. There is no tlb flush queue in the
> > > > MediaTek M4U HW. The HW always expect the tlb_flush/tlb_sync one by one.
> > > > If v7s don't always gurarantee the sequence, Thus, In this patch I move
> > > > the tlb_sync into tlb_flush(also rename the function deleting "_nosync").
> > > > and we don't care if it is leaf, rearrange the callback functions. Also,
> > > > the tlb flush/sync was already finished in v7s, then iotlb_sync and
> > > > iotlb_sync_all is unnecessary.
> > >
> > > Performance-wise, we could do much better. Instead of synchronously
> > > syncing at the end of mtk_iommu_tlb_add_flush(), we could sync at the
> > > beginning, if there was any previous flush still pending. We would
> > > also have to keep the .iotlb_sync() callback, to take care of waiting
> > > for the last flush. That would allow better pipelining with CPU in
> > > cases like this:
> > >
> > > for (all pages in range) {
> > >    change page table();
> > >    flush();
> > > }
> > >
> > > "change page table()" could execute while the IOMMU is flushing the
> > > previous change.
> >
> > Do you mean adding a new tlb_sync before tlb_flush_no_sync, like below:
> >
> > mtk_iommu_tlb_add_flush_nosync {
> >    + mtk_iommu_tlb_sync();
> >    tlb_flush_no_sync();
> >    data->tlb_flush_active = true;
> > }
> >
> > mtk_iommu_tlb_sync {
> >         if (!data->tlb_flush_active)
> >                 return;
> >         tlb_sync();
> >         data->tlb_flush_active = false;
> > }
> >
> > This way look improve the flow, But adjusting the flow is not the root
> > cause of this issue. the problem is "data->tlb_flush_active" may be
> > changed from mtk_iommu_iotlb_sync which don't have a dom->pglock.
> 
> That was not the only problem with existing code. Existing code also
> assumed that add_flush and sync always go in pairs, but that's not
> true.

Yes. Thus I put the tlb_flush always followed by tlb_sync to make sure
they always go in pairs.

> 
> My suggestion is to fix the locking in the driver and keep the sync
> deferred as much as possible, so that performance is not degraded. I

I really didn't get this timeout warning log in previous kernel(Many
tlb_flush followed by one tlb_sync), But deferring the sync is not
suggested by our DE, thus I still would like to fix the sequence in this
patch with putting them together.

> changed my mind, though. I think we would need to make more changes to
> the driver to make it implement the flushing efficiently, so let's go
> with the current simple approach for now and improve incrementally.
> 
> >
> > Currently the synchronisation of the tlb_flush/tlb_sync flow are
> > controlled by the variable "data->tlb_flush_active".
> >
> > In this patch putting the tlb_flush/tlb_sync together looks make
> > the flow simpler:
> > a) Don't need the sensitive variable "tlb_flush_active".
> > b) Remove mtk_iommu_iotlb_sync, Don't need add lock in it.
> > c) Simplify the tlb_flush_walk/tlb_flush_leaf.
> > is it ok?
> >
> 
> Okay, let's do so as a first step to fix the issue. Then we can
> optimize in follow up patches.

Thanks the confirm, I have sent a quick v2.

> 
> > >
> > > >
> > > > Besides, there are two minor changes:
> > > > a) Use writel for the register F_MMU_INV_RANGE which is for triggering the
> > > > HW work. We expect all the setting(iova_start/iova_end...) have already
> > > > been finished before F_MMU_INV_RANGE.
> > > > b) Reduce the tlb timeout value from 100000us to 1000us. the original value
> > > > is so long that affect the multimedia performance.
> > >
> > > By definition, timeout is something that should not normally happen.
> > > Too long timeout affecting multimedia performance would suggest that
> > > the timeout was actually happening, which is the core problem, not the
> > > length of the timeout. Could you provide more details on this?
> >
> > As description above, this issue is because there is no dom->pgtlock in
> > the mtk_iommu_iotlb_sync. I have tried that the issue will disappear
> > after adding lock in it.
> >
> > Although the issue is fixed after this patch, I still would like to
> > reduce the timeout value for somehow error happen in the future. 100ms
> > is unnecessary for us. It looks a minor improvement rather than fixing
> > the issue. I will use a new patch for it.
> >
> 
> Okay, makes sense.
> 
> Best regards,
> Tomasz

Tomasz Figa Oct. 9, 2019, 3:08 p.m. UTC | #9

On Wed, Oct 9, 2019 at 10:38 PM Yong Wu <yong.wu@mediatek.com> wrote:
>
> On Wed, 2019-10-09 at 16:56 +0900, Tomasz Figa wrote:
> > On Tue, Oct 8, 2019 at 5:09 PM Yong Wu <yong.wu@mediatek.com> wrote:
> > >
> > > Hi Tomasz,
> > >
> > > Sorry for reply late.
> > >
> > > On Wed, 2019-10-02 at 14:18 +0900, Tomasz Figa wrote:
> > > > Hi Yong,
> > > >
> > > > On Mon, Sep 30, 2019 at 2:42 PM Yong Wu <yong.wu@mediatek.com> wrote:
> > > > >
> > > > > The commit 4d689b619445 ("iommu/io-pgtable-arm-v7s: Convert to IOMMU API
> > > > > TLB sync") help move the tlb_sync of unmap from v7s into the iommu
> > > > > framework. It helps add a new function "mtk_iommu_iotlb_sync", But it
> > > > > lacked the dom->pgtlock, then it will cause the variable
> > > > > "tlb_flush_active" may be changed unexpectedly, we could see this warning
> > > > > log randomly:
> > > > >
> > > >
> > > > Thanks for the patch! Please see my comments inline.
> > > >
> > > > > mtk-iommu 10205000.iommu: Partial TLB flush timed out, falling back to
> > > > > full flush
> > > > >
> > > > > To fix this issue, we can add dom->pgtlock in the "mtk_iommu_iotlb_sync".
> > > > > And when checking this issue, we find that __arm_v7s_unmap call
> > > > > io_pgtable_tlb_add_flush consecutively when it is supersection/largepage,
> > > > > this also is potential unsafe for us. There is no tlb flush queue in the
> > > > > MediaTek M4U HW. The HW always expect the tlb_flush/tlb_sync one by one.
> > > > > If v7s don't always gurarantee the sequence, Thus, In this patch I move
> > > > > the tlb_sync into tlb_flush(also rename the function deleting "_nosync").
> > > > > and we don't care if it is leaf, rearrange the callback functions. Also,
> > > > > the tlb flush/sync was already finished in v7s, then iotlb_sync and
> > > > > iotlb_sync_all is unnecessary.
> > > >
> > > > Performance-wise, we could do much better. Instead of synchronously
> > > > syncing at the end of mtk_iommu_tlb_add_flush(), we could sync at the
> > > > beginning, if there was any previous flush still pending. We would
> > > > also have to keep the .iotlb_sync() callback, to take care of waiting
> > > > for the last flush. That would allow better pipelining with CPU in
> > > > cases like this:
> > > >
> > > > for (all pages in range) {
> > > >    change page table();
> > > >    flush();
> > > > }
> > > >
> > > > "change page table()" could execute while the IOMMU is flushing the
> > > > previous change.
> > >
> > > Do you mean adding a new tlb_sync before tlb_flush_no_sync, like below:
> > >
> > > mtk_iommu_tlb_add_flush_nosync {
> > >    + mtk_iommu_tlb_sync();
> > >    tlb_flush_no_sync();
> > >    data->tlb_flush_active = true;
> > > }
> > >
> > > mtk_iommu_tlb_sync {
> > >         if (!data->tlb_flush_active)
> > >                 return;
> > >         tlb_sync();
> > >         data->tlb_flush_active = false;
> > > }
> > >
> > > This way look improve the flow, But adjusting the flow is not the root
> > > cause of this issue. the problem is "data->tlb_flush_active" may be
> > > changed from mtk_iommu_iotlb_sync which don't have a dom->pglock.
> >
> > That was not the only problem with existing code. Existing code also
> > assumed that add_flush and sync always go in pairs, but that's not
> > true.
>
> Yes. Thus I put the tlb_flush always followed by tlb_sync to make sure
> they always go in pairs.
>
> >
> > My suggestion is to fix the locking in the driver and keep the sync
> > deferred as much as possible, so that performance is not degraded. I
>
> I really didn't get this timeout warning log in previous kernel(Many
> tlb_flush followed by one tlb_sync),

Locking issues typically lead to timing problems (race conditions), so
it might just be that the sequence or timing of calls changed between
kernel versions, enough to trigger the issue.

> But deferring the sync is not
> suggested by our DE, thus I still would like to fix the sequence in this
> patch with putting them together.
>

What's the reason it's not suggested? From my understanding, there
shouldn't be any dependency on hardware design here, it's just a
simple software optimization - we can pipeline other CPU operations
while the IOMMU is flushing the TLB.

Basically, right now:

CPU writes page tables 1
IOMMU flushes 1 | CPU busy waits
CPU writes page tables 2
IOMMU flushes 2 | CPU busy waits
CPU writes page tables 3
IOMMU flushes 3 | CPU busy waits
...

With my suggestion that could be:

CPU writes page tables 1 I
IOMMU flushes 1 | CPU writes page tables 2
IOMMU flushes 1 | CPU busy waits less time
IOMMU flushes 2 | CPU writes page tables 3
IOMMU flushes 2 | CPU busy waits less time
IOMMU flushes 3 | CPU busy waits

It reduces the time the CPU spends on busy waiting rather than doing
something useful. It also reduces the total time of maps and unmaps.

Actually, we can optimize even more. Please consider the case below.

CPU writes PTE for IOVA 0x1000.
IOMMU flushes 0x1000 | CPU busy waits
CPU writes PTE for IOVA 0x2000.
IOMMU flushes 0x2000 | CPU busy waits
CPU writes PTE for IOVA 0x3000.
IOMMU flushes 0x3000 | CPU busy waits
CPU writes PTE for IOVA 0x8000.
IOMMU flushes 0x8000 | CPU busy waits

However, depending on the way the hardware implements TLB flush, the
optimal sequence of operations could be:

CPU writes PTE for IOVA 0x1000.
CPU writes PTE for IOVA 0x2000.
CPU writes PTE for IOVA 0x3000.
IOMMU flushes 0x1000-0x3000 | CPU writes PTE for IOVA 0x8000.
IOMMU flushes 0x1000-0x3000 | CPU busy waits remaining flush time.
IOMMU flushes 0x8000 | CPU busy waits.

What's the algorithmic complexity of the TLB flush operation on
MediaTek IOMMU? If N is the number of pages to flush, is it O(N), O(1)
or something else?

Best regards,
Tomasz

iommu/mediatek: Move the tlb_sync into tlb_flush

Commit Message

Comments

Patch