Message ID | 1463741694-1735-7-git-send-email-sricharan@codeaurora.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Friday 20 May 2016 16:24:53 Sricharan R wrote: > While using the generic pagetable ops the tlb maintenance > operation gets completed in the sync callback. So use writel_relaxed > for all register access and add a mb() at appropriate places. > > Signed-off-by: Sricharan R <sricharan@codeaurora.org> > --- > drivers/iommu/msm_iommu.c | 24 +++++++-- > drivers/iommu/msm_iommu_hw-8xxx.h | 109 ++++++++++++++++++++------------------ > 2 files changed, 79 insertions(+), 54 deletions(-) > > diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c > index 0299a37..dfcaeef 100644 > --- a/drivers/iommu/msm_iommu.c > +++ b/drivers/iommu/msm_iommu.c > @@ -124,6 +124,9 @@ static void msm_iommu_reset(void __iomem *base, int ncb) > SET_TLBLKCR(base, ctx, 0); > SET_CONTEXTIDR(base, ctx, 0); > } > + > + /* Ensure completion of relaxed writes from the above SET macros */ > + mb(); > } Why do the above use the relaxed writes? Do you have any performance numbers showing that skipping the sync for the reset makes a measurable difference? How did you prove that skipping the sync before the write is safe? How about resetting the iommu less often instead? > static void __flush_iotlb(void *cookie) > @@ -141,6 +144,9 @@ static void __flush_iotlb(void *cookie) > list_for_each_entry(master, &iommu->ctx_list, list) > SET_CTX_TLBIALL(iommu->base, master->num, 0); > > + /* To ensure completion of TLBIALL above */ > + mb(); > + > __disable_clocks(iommu); > } > fail: __disable_clocks() takes spinlocks and accesses the hardware, which in turn will also involve multiple syncs, so this seems like a premature optimization. Have you tried just leaving the clocks enabled? I'd suspect you could gain more performance that way. > @@ -181,7 +187,8 @@ fail: > > static void __flush_iotlb_sync(void *cookie) > { > - /* To avoid a null function pointer */ > + /* To ensure completion of the TLBIVA in __flush_iotlb_range */ > + mb(); > } I don't understand the specific race from the comment. What operation comes after this that relies on __flush_iotlb_range having completed, and how does an mb() guarantee this? This seems to be a bug fix that is unrelated to the change to use writel_relaxed(), so better split it out into a separate patch, with a longer changeset description. Did you spot this race by running into incorrect data, or by inspection? > static const struct iommu_gather_ops msm_iommu_gather_ops = { > @@ -235,6 +242,9 @@ static void config_mids(struct msm_iommu_dev *iommu, > /* Set security bit override to be Non-secure */ > SET_NSCFG(iommu->base, mid, 3); > } > + > + /* Ensure completion of relaxed writes from the above SET macros */ > + mb(); > } > This looks similar to the msm_iommu_reset() case. Maybe put those two into one patch, or find a way to run config_mids() less often so it doesn't show up in the profiles any more. If you hit this often enough that it causes performance problems, it's more likely that there is a bug in your code than that changing to relaxed accessors is a good idea. > static void __reset_context(void __iomem *base, int ctx) > @@ -257,6 +267,9 @@ static void __reset_context(void __iomem *base, int ctx) > SET_TLBFLPTER(base, ctx, 0); > SET_TLBSLPTER(base, ctx, 0); > SET_TLBLKCR(base, ctx, 0); > + > + /* Ensure completion of relaxed writes from the above SET macros */ > + mb(); > } > > static void __program_context(void __iomem *base, int ctx, > @@ -305,6 +318,9 @@ static void __program_context(void __iomem *base, int ctx, > > /* Enable the MMU */ > SET_M(base, ctx, 1); > + > + /* Ensure completion of relaxed writes from the above SET macros */ > + mb(); > } > > static struct iommu_domain *msm_iommu_domain_alloc(unsigned type) Maybe one more patch for these? I can see that __reset_context/__program_context would be run often enough to make a difference, though maybe not as much as the flushes. Also, split this in two patches: one that adds a comment describing how you proved that you don't need barriers before the writes, and another one adding the barrier afterwards, with a description of the failure scenario. > @@ -500,7 +516,8 @@ static phys_addr_t msm_iommu_iova_to_phys(struct iommu_domain *domain, > /* Invalidate context TLB */ > SET_CTX_TLBIALL(iommu->base, master->num, 0); > SET_V2PPR(iommu->base, master->num, va & V2Pxx_VA); > - > + /* Ensure completion of relaxed writes from the above SET macros */ > + mb(); > par = GET_PAR(iommu->base, master->num); > > /* We are dealing with a supersection */ In this case, I'd say it's better to rewrite the function to avoid the read: iova_to_phys() should be fast, and not require hardware access. Getting rid of the hardware access by using an in-memory cache for this should gain more than just removing the barriers, as an MMIO read is always slow. > @@ -714,7 +731,8 @@ static int msm_iommu_probe(struct platform_device *pdev) > par = GET_PAR(iommu->base, 0); > SET_V2PCFG(iommu->base, 0, 0); > SET_M(iommu->base, 0, 0); > - > + /* Ensure completion of relaxed writes from the above SET macros */ > + mb(); > if (!par) { Probe() clearly isn't performance sensitive, so please back this out and use the non-relaxed accessors. > diff --git a/drivers/iommu/msm_iommu_hw-8xxx.h b/drivers/iommu/msm_iommu_hw-8xxx.h > index fc16010..fe2c5ca 100644 > --- a/drivers/iommu/msm_iommu_hw-8xxx.h > +++ b/drivers/iommu/msm_iommu_hw-8xxx.h > @@ -24,13 +24,19 @@ > #define GET_CTX_REG(reg, base, ctx) \ > (readl((base) + (reg) + ((ctx) << CTX_SHIFT))) > > -#define SET_GLOBAL_REG(reg, base, val) writel((val), ((base) + (reg))) > +/* > + * The writes to the global/context registers needs to be synced only after > + * all the configuration writes are done. So use relaxed accessors and > + * a barrier at the end. > + */ > +#define SET_GLOBAL_REG_RELAXED(reg, base, val) \ > + writel_relaxed((val), ((base) + (reg))) > > -#define SET_CTX_REG(reg, base, ctx, val) \ > - writel((val), ((base) + (reg) + ((ctx) << CTX_SHIFT))) > +#define SET_CTX_REG_RELAXED(reg, base, ctx, val) \ > + writel_relaxed((val), ((base) + (reg) + ((ctx) << CTX_SHIFT))) Please add the relaxed accessors first in a separate patch, and then change over one user at a time before you remove the non-relaxed ones. It's hard to believe that all users are actually performance critical, so start with the ones that gain the most performance. As commented above, some of the conversions seem particularly fishy and it's likely that a slow reset() function should not be fixed by making reset() faster but by calling it less often. > /* Context register setters/getters */ > -#define SET_SCTLR(b, c, v) SET_CTX_REG(SCTLR, (b), (c), (v)) > -#define SET_ACTLR(b, c, v) SET_CTX_REG(ACTLR, (b), (c), (v)) > -#define SET_CONTEXTIDR(b, c, v) SET_CTX_REG(CONTEXTIDR, (b), (c), (v)) > -#define SET_TTBR0(b, c, v) SET_CTX_REG(TTBR0, (b), (c), (v)) > -#define SET_TTBR1(b, c, v) SET_CTX_REG(TTBR1, (b), (c), (v)) > +#define SET_SCTLR(b, c, v) SET_CTX_REG_RELAXED(SCTLR, (b), (c), (v)) > +#define SET_ACTLR(b, c, v) SET_CTX_REG_RELAXED(ACTLR, (b), (c), (v)) > +#define SET_CONTEXTIDR(b, c, v) SET_CTX_REG_RELAXED(CONTEXTIDR, (b), (c), (v)) > +#define SET_TTBR0(b, c, v) SET_CTX_REG_RELAXED(TTBR0, (b), (c), (v)) > +#define SET_TTBR1(b, c, v) SET_CTX_REG_RELAXED(TTBR1, (b), (c), (v)) > +#define SET_TTBCR(b, c, v) SET_CTX_REG_RELAXED(TTBCR, (b), (c), (v)) > +#define SET_PAR(b, c, v) SET_CTX_REG_RELAXED(PAR, (b), (c), (v)) Can you just remove those macros, and open-code the function calls? This is an unnecessary obfuscation that now also hides the fact that you are using relaxed accessors. Arnd
On Friday 20 May 2016 13:44:10 Arnd Bergmann wrote: > > #define GET_CTX_REG(reg, base, ctx) \ > > (readl((base) + (reg) + ((ctx) << CTX_SHIFT))) > > > > -#define SET_GLOBAL_REG(reg, base, val) writel((val), ((base) + (reg))) > > +/* > > + * The writes to the global/context registers needs to be synced only after > > + * all the configuration writes are done. So use relaxed accessors and > > + * a barrier at the end. > > + */ > > +#define SET_GLOBAL_REG_RELAXED(reg, base, val) \ > > + writel_relaxed((val), ((base) + (reg))) > > > > -#define SET_CTX_REG(reg, base, ctx, val) \ > > - writel((val), ((base) + (reg) + ((ctx) << CTX_SHIFT))) > > +#define SET_CTX_REG_RELAXED(reg, base, ctx, val) \ > > + writel_relaxed((val), ((base) + (reg) + ((ctx) << CTX_SHIFT))) > > Please add the relaxed accessors first in a separate patch, and then > change over one user at a time before you remove the non-relaxed ones. > > It's hard to believe that all users are actually performance critical, > so start with the ones that gain the most performance. As commented above, > some of the conversions seem particularly fishy and it's likely that > a slow reset() function should not be fixed by making reset() faster > but by calling it less often. One more thing, please also convert them into inline functions when you modify them, and use normal names such as static inline void msm_iommu_write(struct msm_iommu_dev *iommu, unsigned int reg, u32 val) { writel(val, iommu->base + reg); } static inline void msm_iommu_context_write(struct msm_iommu_ctx_dev *ctx, unsigned int reg, u32 val) { writel(val, ctx->iommu->base + ctx << CTX_SHIFT + reg, val); } Arnd
Hi Arnd, >> @@ -124,6 +124,9 @@ static void msm_iommu_reset(void __iomem *base, int ncb) >> SET_TLBLKCR(base, ctx, 0); >> SET_CONTEXTIDR(base, ctx, 0); >> } >> + >> + /* Ensure completion of relaxed writes from the above SET macros */ >> + mb(); >> } > >Why do the above use the relaxed writes? Do you have any performance >numbers showing that skipping the sync for the reset makes a measurable >difference? > >How did you prove that skipping the sync before the write is safe? > >How about resetting the iommu less often instead? > I had measured the numbers only for the full usecase path, not for the reset path alone. I saw improvement of about 5% on full numbers. As you said, the reset path would be called only less often and might not bring a measurable change. I did not see a difference in behavior when changing the sync to happen after the writes. But my understanding was that the sync after the writes was required to ensure write completion. I should have made smaller patches to do this change. The only patch relevant for this series is the one that changes the write in _iotlb_range function. Rest of the changes, should be added one by one in a separate series. >> static void __flush_iotlb(void *cookie) >> @@ -141,6 +144,9 @@ static void __flush_iotlb(void *cookie) >> list_for_each_entry(master, &iommu->ctx_list, list) >> SET_CTX_TLBIALL(iommu->base, master->num, 0); >> >> + /* To ensure completion of TLBIALL above */ >> + mb(); >> + >> __disable_clocks(iommu); >> } >> fail: > >__disable_clocks() takes spinlocks and accesses the hardware, which >in turn will also involve multiple syncs, so this seems like >a premature optimization. > >Have you tried just leaving the clocks enabled? I'd suspect you could >gain more performance that way. I have not tried by keep clocks enabled always. But intention here was to have more granular clock control. But agree that the barrier might not be needed in this case as it is indirectly taken care. I will remove this here and explain it. > >> @@ -181,7 +187,8 @@ fail: >> >> static void __flush_iotlb_sync(void *cookie) >> { >> - /* To avoid a null function pointer */ >> + /* To ensure completion of the TLBIVA in __flush_iotlb_range */ >> + mb(); >> } > >I don't understand the specific race from the comment. > >What operation comes after this that relies on __flush_iotlb_range >having completed, and how does an mb() guarantee this? > The flush_iotlb_range operation invalidates the tlb for writes to pagetable and the finally calls the sync operation to ensure completion of the flush and this is required before returning back to the client of the iommu. In the case of this iommu, only a barrier is required to ensure completion of the invalidate operation. >This seems to be a bug fix that is unrelated to the change to >use writel_relaxed(), so better split it out into a separate >patch, with a longer changeset description. Did you spot this >race by running into incorrect data, or by inspection? > No i did not see a data corruption issue without the mb(), but that it would have been hidden in someother way as well. Another difference was the sync was done before the write previously and now its moved after the write. As i understand sync after the write is correct. So i will change this patch with more description and move rest of that changes out. >> static const struct iommu_gather_ops msm_iommu_gather_ops = { >> @@ -235,6 +242,9 @@ static void config_mids(struct msm_iommu_dev *iommu, >> /* Set security bit override to be Non-secure */ >> SET_NSCFG(iommu->base, mid, 3); >> } >> + >> + /* Ensure completion of relaxed writes from the above SET macros */ >> + mb(); >> } >> > >This looks similar to the msm_iommu_reset() case. Maybe put those >two into one patch, or find a way to run config_mids() less often so >it doesn't show up in the profiles any more. > >If you hit this often enough that it causes performance problems, >it's more likely that there is a bug in your code than that changing >to relaxed accessors is a good idea. > Ok, similar to reset this is called less often compared to flushes and may not show difference in performance. I will move this out. >> static void __reset_context(void __iomem *base, int ctx) >> @@ -257,6 +267,9 @@ static void __reset_context(void __iomem *base, int ctx) >> SET_TLBFLPTER(base, ctx, 0); >> SET_TLBSLPTER(base, ctx, 0); >> SET_TLBLKCR(base, ctx, 0); >> + >> + /* Ensure completion of relaxed writes from the above SET macros */ >> + mb(); >> } >> >> static void __program_context(void __iomem *base, int ctx, >> @@ -305,6 +318,9 @@ static void __program_context(void __iomem *base, int ctx, >> >> /* Enable the MMU */ >> SET_M(base, ctx, 1); >> + >> + /* Ensure completion of relaxed writes from the above SET macros */ >> + mb(); >> } >> >> static struct iommu_domain *msm_iommu_domain_alloc(unsigned type) > >Maybe one more patch for these? I can see that >__reset_context/__program_context would be run often enough to make >a difference, though maybe not as much as the flushes. > >Also, split this in two patches: one that adds a comment describing >how you proved that you don't need barriers before the writes, and >another one adding the barrier afterwards, with a description of >the failure scenario. ok, i will split it. I did not see a failure though explicitly. It is based on the understanding that the sync after the write is more appropriate than before it. > >> @@ -500,7 +516,8 @@ static phys_addr_t msm_iommu_iova_to_phys(struct iommu_domain *domain, >> /* Invalidate context TLB */ >> SET_CTX_TLBIALL(iommu->base, master->num, 0); >> SET_V2PPR(iommu->base, master->num, va & V2Pxx_VA); >> - >> + /* Ensure completion of relaxed writes from the above SET macros */ >> + mb(); >> par = GET_PAR(iommu->base, master->num); >> >> /* We are dealing with a supersection */ > >In this case, I'd say it's better to rewrite the function to avoid the >read: iova_to_phys() should be fast, and not require hardware access. >Getting rid of the hardware access by using an in-memory cache for >this should gain more than just removing the barriers, as an MMIO read >is always slow Ok, you mean using the software walk through ? I will check on this to measure the latency difference. If thats true, then the iopgtable ops itself provides a function for iova_to_phys conversion, so that can be used. > >> @@ -714,7 +731,8 @@ static int msm_iommu_probe(struct platform_device *pdev) >> par = GET_PAR(iommu->base, 0); >> SET_V2PCFG(iommu->base, 0, 0); >> SET_M(iommu->base, 0, 0); >> - >> + /* Ensure completion of relaxed writes from the above SET macros */ >> + mb(); >> if (!par) { > >Probe() clearly isn't performance sensitive, so please back this out >and use the non-relaxed accessors. > ok, will change this back. >> diff --git a/drivers/iommu/msm_iommu_hw-8xxx.h b/drivers/iommu/msm_iommu_hw-8xxx.h >> index fc16010..fe2c5ca 100644 >> --- a/drivers/iommu/msm_iommu_hw-8xxx.h >> +++ b/drivers/iommu/msm_iommu_hw-8xxx.h >> @@ -24,13 +24,19 @@ >> #define GET_CTX_REG(reg, base, ctx) \ >> (readl((base) + (reg) + ((ctx) << CTX_SHIFT))) >> >> -#define SET_GLOBAL_REG(reg, base, val) writel((val), ((base) + (reg))) >> +/* >> + * The writes to the global/context registers needs to be synced only after >> + * all the configuration writes are done. So use relaxed accessors and >> + * a barrier at the end. >> + */ >> +#define SET_GLOBAL_REG_RELAXED(reg, base, val) \ >> + writel_relaxed((val), ((base) + (reg))) >> >> -#define SET_CTX_REG(reg, base, ctx, val) \ >> - writel((val), ((base) + (reg) + ((ctx) << CTX_SHIFT))) >> +#define SET_CTX_REG_RELAXED(reg, base, ctx, val) \ >> + writel_relaxed((val), ((base) + (reg) + ((ctx) << CTX_SHIFT))) > >Please add the relaxed accessors first in a separate patch, and then >change over one user at a time before you remove the non-relaxed ones. > ok >It's hard to believe that all users are actually performance critical, >so start with the ones that gain the most performance. As commented above, >some of the conversions seem particularly fishy and it's likely that >a slow reset() function should not be fixed by making reset() faster >but by calling it less often. Ok. Will change this. So i will split this in to separate series, one patch to modify relevant the flush_iotlb_range function in this series and rest of all changes as a driver improvement in one more. >Can you just remove those macros, and open-code the function calls? > >This is an unnecessary obfuscation that now also hides the fact that you >are using relaxed accessors. Ok, will make this in to inline functions to make it obvious. Regards, Sricharan
On Monday, May 23, 2016 11:35:04 AM CEST Sricharan wrote: > Hi Arnd, > > >> @@ -124,6 +124,9 @@ static void msm_iommu_reset(void __iomem *base, int ncb) > >> SET_TLBLKCR(base, ctx, 0); > >> SET_CONTEXTIDR(base, ctx, 0); > >> } > >> + > >> + /* Ensure completion of relaxed writes from the above SET macros */ > >> + mb(); > >> } > > > >Why do the above use the relaxed writes? Do you have any performance > >numbers showing that skipping the sync for the reset makes a measurable > >difference? > > > >How did you prove that skipping the sync before the write is safe? > > > >How about resetting the iommu less often instead? > > > > I had measured the numbers only for the full usecase path, not for the > reset path alone. I saw improvement of about 5% on full numbers. > As you said, the reset path would be called only less often > and might not bring a measurable change. I did not see a difference in behavior > when changing the sync to happen after the writes. Ok, then better not change it. > But my understanding was that > the sync after the writes was required to ensure write completion. Can you cite the relevant documentation on this? Is this specific to the Qualcomm CPU implementation or the IOMMU? I don't think the ARM architecture requires anything like this in general. > I should have made smaller patches to do this change. > The only patch relevant for this series is the one that changes the write in _iotlb_range > function. Rest of the changes, should be added one by one in a separate series. If you see the same 5% performance improvement with a simpler change, then better do only that. The IOMMU infrastructure is rather sensitive to having correct barriers everywhere, so this minimizes the risk of getting it wrong somewhere. > >> @@ -181,7 +187,8 @@ fail: > >> > >> static void __flush_iotlb_sync(void *cookie) > >> { > >> - /* To avoid a null function pointer */ > >> + /* To ensure completion of the TLBIVA in __flush_iotlb_range */ > >> + mb(); > >> } > > > >I don't understand the specific race from the comment. > > > >What operation comes after this that relies on __flush_iotlb_range > >having completed, and how does an mb() guarantee this? > > > > The flush_iotlb_range operation invalidates the tlb for writes to > pagetable and the finally calls the sync operation to ensure completion > of the flush and this is required before returning back to the client > of the iommu. In the case of this iommu, only a barrier is required to > ensure completion of the invalidate operation. This doesn't answer my question: What operation would a client do that requires the flush to be completed here? A barrier is always defined in terms of things that come before it in combination with things that come after it. Any operation that could trigger a DMA from a device is required to have a barrier preceding it (usually wmb() one implied by writel()), so this is clearly not about a driver that installs a DMA mapping before starting a DMA, but I don't see what else it would be. > >This seems to be a bug fix that is unrelated to the change to > >use writel_relaxed(), so better split it out into a separate > >patch, with a longer changeset description. Did you spot this > >race by running into incorrect data, or by inspection? > > > > No i did not see a data corruption issue without the mb(), > but that it would have been hidden in someother way as well. > Another difference was the sync was done before the write > previously and now its moved after the write. As i understand > sync after the write is correct. So i will change this patch with more > description and move rest of that changes out. Ok. > >> @@ -500,7 +516,8 @@ static phys_addr_t msm_iommu_iova_to_phys(struct iommu_domain *domain, > >> /* Invalidate context TLB */ > >> SET_CTX_TLBIALL(iommu->base, master->num, 0); > >> SET_V2PPR(iommu->base, master->num, va & V2Pxx_VA); > >> - > >> + /* Ensure completion of relaxed writes from the above SET macros */ > >> + mb(); > >> par = GET_PAR(iommu->base, master->num); > >> > >> /* We are dealing with a supersection */ > > > >In this case, I'd say it's better to rewrite the function to avoid the > >read: iova_to_phys() should be fast, and not require hardware access. > >Getting rid of the hardware access by using an in-memory cache for > >this should gain more than just removing the barriers, as an MMIO read > >is always slow > > Ok, you mean using the software walk through ? I will check on this to measure > the latency difference. If thats true, then the iopgtable ops itself provides a > function for iova_to_phys conversion, so that can be used. I hadn't realized that this is a lookup in the hardware, rather than reading a static register. It's probably a good idea to check this anyway. Arnd
Hi Arnd, <snip..> >> I had measured the numbers only for the full usecase path, not for the >> reset path alone. I saw improvement of about 5% on full numbers. >> As you said, the reset path would be called only less often >> and might not bring a measurable change. I did not see a difference in behavior >> when changing the sync to happen after the writes. > >Ok, then better not change it. > Ok. >> But my understanding was that >> the sync after the writes was required to ensure write completion. > >Can you cite the relevant documentation on this? Is this specific >to the Qualcomm CPU implementation or the IOMMU? I don't think >the ARM architecture requires anything like this in general. > My explanation was not correct. So the whole reason for doing the sync here was to make sure that client is going to observe the effect the of tlb invalidation before starting any dma operation. But as you have explained below, if it is implicit that any operation from a device that could trigger a dma should have an barrier preceding that, then the a sync here may not be needed. > >> >> @@ -181,7 +187,8 @@ fail: >> >> >> >> static void __flush_iotlb_sync(void *cookie) >> >> { >> >> - /* To avoid a null function pointer */ >> >> + /* To ensure completion of the TLBIVA in __flush_iotlb_range */ >> >> + mb(); >> >> } >> > >> >I don't understand the specific race from the comment. >> > >> >What operation comes after this that relies on __flush_iotlb_range >> >having completed, and how does an mb() guarantee this? >> > >> >> The flush_iotlb_range operation invalidates the tlb for writes to >> pagetable and the finally calls the sync operation to ensure completion >> of the flush and this is required before returning back to the client >> of the iommu. In the case of this iommu, only a barrier is required to >> ensure completion of the invalidate operation. > >This doesn't answer my question: What operation would a client do >that requires the flush to be completed here? A barrier is always >defined in terms of things that come before it in combination with >things that come after it. > >Any operation that could trigger a DMA from a device is required >to have a barrier preceding it (usually wmb() one implied by writel()), >so this is clearly not about a driver that installs a DMA mapping >before starting a DMA, but I don't see what else it would be. > Ok, so i was doing this from the idea that, other iommu drivers where polling on a status bit in their sync call to ensure completion of pending TLB invalidations. But in this case, there is no status bit. So added a barrier to have no ordering issues before the client triggers the dma operation. But as you say above that it is implicit that the device would have a barrier before starting the trigger, then the barrier here becomes redundant. <snip..> >> >In this case, I'd say it's better to rewrite the function to avoid the >> >read: iova_to_phys() should be fast, and not require hardware access. >> >Getting rid of the hardware access by using an in-memory cache for >> >this should gain more than just removing the barriers, as an MMIO read >> >is always slow >> >> Ok, you mean using the software walk through ? I will check on this to measure >> the latency difference. If thats true, then the iopgtable ops itself provides a >> function for iova_to_phys conversion, so that can be used. > >I hadn't realized that this is a lookup in the hardware, rather than >reading a static register. It's probably a good idea to check this >anyway. Ok, will measure the difference and have the better one. Regards, Sricharan
On Wednesday, May 25, 2016 4:15:31 PM CEST Sricharan wrote: > > > >Any operation that could trigger a DMA from a device is required > >to have a barrier preceding it (usually wmb() one implied by writel()), > >so this is clearly not about a driver that installs a DMA mapping > >before starting a DMA, but I don't see what else it would be. > > > > Ok, so i was doing this from the idea that, other iommu drivers > where polling on a status bit in their sync call to ensure completion > of pending TLB invalidations. But in this case, there is no status bit. > So added a barrier to have no ordering issues before the client > triggers the dma operation. But as you say above that it is implicit that > the device would have a barrier before starting the trigger, then the > barrier here becomes redundant. Ok. There are two more things to note here: * On other platforms, polling the register is likely required because an MMIO write is "posted", meaning that a sync after writel() will only ensure that it has left the CPU write queue, but it may still be on the bus fabric and whatever side-effects are triggered by the write are normally not guaranteed to be completed even after the 'sync'. You need to check the datasheet for your IOMMU to find out whether the 'dsb' instruction actually has any effect on the IOMMU. If not, then neither the barrier that you add here nor the barrier in the following writel() is sufficient. * The one barrier that is normally required in an IOMMU is between updating the in-memory page tables and the following MMIO store that triggers the TLB flush for that entry. This barrier is implied by writel() but not writel_relaxed(). If you don't have a hardware page table walker in your IOMMU, you don't need to worry about this. Arnd
Hi Arnd, >> Ok, so i was doing this from the idea that, other iommu drivers >> where polling on a status bit in their sync call to ensure completion >> of pending TLB invalidations. But in this case, there is no status bit. >> So added a barrier to have no ordering issues before the client >> triggers the dma operation. But as you say above that it is implicit that >> the device would have a barrier before starting the trigger, then the >> barrier here becomes redundant. > >Ok. There are two more things to note here: > >* On other platforms, polling the register is likely required because > an MMIO write is "posted", meaning that a sync after writel() will > only ensure that it has left the CPU write queue, but it may still be > on the bus fabric and whatever side-effects are triggered by the > write are normally not guaranteed to be completed even after the > 'sync'. You need to check the datasheet for your IOMMU to find out > whether the 'dsb' instruction actually has any effect on the IOMMU. > If not, then neither the barrier that you add here nor the barrier > in the following writel() is sufficient. > Thanks for the detailed explanation. i will check this. So with this, i think that if the iommu does not support polling for its status, then it should listen to 'dsb' otherwise there will no be no way of make it coherent ? >* The one barrier that is normally required in an IOMMU is between > updating the in-memory page tables and the following MMIO store > that triggers the TLB flush for that entry. This barrier is > implied by writel() but not writel_relaxed(). If you don't have > a hardware page table walker in your IOMMU, you don't need to worry > about this. > To get my understanding correct here, is the barrier required here because of speculative fetch ? Regards, Sricharan
On Wednesday, May 25, 2016 6:49:18 PM CEST Sricharan wrote: > Hi Arnd, > > >> Ok, so i was doing this from the idea that, other iommu drivers > >> where polling on a status bit in their sync call to ensure completion > >> of pending TLB invalidations. But in this case, there is no status bit. > >> So added a barrier to have no ordering issues before the client > >> triggers the dma operation. But as you say above that it is implicit that > >> the device would have a barrier before starting the trigger, then the > >> barrier here becomes redundant. > > > >Ok. There are two more things to note here: > > > >* On other platforms, polling the register is likely required because > > an MMIO write is "posted", meaning that a sync after writel() will > > only ensure that it has left the CPU write queue, but it may still be > > on the bus fabric and whatever side-effects are triggered by the > > write are normally not guaranteed to be completed even after the > > 'sync'. You need to check the datasheet for your IOMMU to find out > > whether the 'dsb' instruction actually has any effect on the IOMMU. > > If not, then neither the barrier that you add here nor the barrier > > in the following writel() is sufficient. > > > Thanks for the detailed explanation. > i will check this. So with this, i think that if the iommu does not > support polling for its status, then it should listen to 'dsb' otherwise > there will no be no way of make it coherent ? It's more likely that you have to do a readl() after the writel() to any address inside of the device in order to ensure the writel() has completed, but that is something that the hardware manual for the iommu should describe. > >* The one barrier that is normally required in an IOMMU is between > > updating the in-memory page tables and the following MMIO store > > that triggers the TLB flush for that entry. This barrier is > > implied by writel() but not writel_relaxed(). If you don't have > > a hardware page table walker in your IOMMU, you don't need to worry > > about this. > > > To get my understanding correct here, is the barrier required here because > of speculative fetch ? It doesn't have to be a speculative fetch, it could also be a normal fetch triggered by a DMA. The point is that the CPU is free to reorder the store to (io page table) memory relative to the store to the IOMMU flush register, and when you directly follow up with a store to the DMA master that triggers a transfer, this transfer can hit the IOMMU and cause the stale page table entry to be read. I guess in practice the barrier that comes before the MMIO write to the DMA master would ensure ordering of the DMA after both the IOPTE update and the MMIO flush (unless there was the speculative fetch you mentioned), but this is where I'm no longer confident enough in my pwn understanding of th ordering rules to rely on that. Having the barrier between the IOPTE update and the flush certainly leaves no room for ambiguity. Arnd
Hi Arnd, >> >* On other platforms, polling the register is likely required because >> > an MMIO write is "posted", meaning that a sync after writel() will >> > only ensure that it has left the CPU write queue, but it may still be >> > on the bus fabric and whatever side-effects are triggered by the >> > write are normally not guaranteed to be completed even after the >> > 'sync'. You need to check the datasheet for your IOMMU to find out >> > whether the 'dsb' instruction actually has any effect on the IOMMU. >> > If not, then neither the barrier that you add here nor the barrier >> > in the following writel() is sufficient. >> > >> Thanks for the detailed explanation. >> i will check this. So with this, i think that if the iommu does not >> support polling for its status, then it should listen to 'dsb' otherwise >> there will no be no way of make it coherent ? > >It's more likely that you have to do a readl() after the writel() to >any address inside of the device in order to ensure the writel() >has completed, but that is something that the hardware manual for >the iommu should describe. > ok, get it. So will do it as required for here. >> >* The one barrier that is normally required in an IOMMU is between >> > updating the in-memory page tables and the following MMIO store >> > that triggers the TLB flush for that entry. This barrier is >> > implied by writel() but not writel_relaxed(). If you don't have >> > a hardware page table walker in your IOMMU, you don't need to worry >> > about this. >> > >> To get my understanding correct here, is the barrier required here because >> of speculative fetch ? > >It doesn't have to be a speculative fetch, it could also be a normal >fetch triggered by a DMA. The point is that the CPU is free to reorder >the store to (io page table) memory relative to the store to the >IOMMU flush register, and when you directly follow up with a store to >the DMA master that triggers a transfer, this transfer can hit the >IOMMU and cause the stale page table entry to be read. > >I guess in practice the barrier that comes before the MMIO write to the >DMA master would ensure ordering of the DMA after both the IOPTE >update and the MMIO flush (unless there was the speculative fetch you >mentioned), but this is where I'm no longer confident enough in my >pwn understanding of th ordering rules to rely on that. Having the >barrier between the IOPTE update and the flush certainly leaves >no room for ambiguity. Ok understand. Also i think speculative access if any, should go away if unmap is done first and then remap by clients. Regards, Sricharan
diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c index 0299a37..dfcaeef 100644 --- a/drivers/iommu/msm_iommu.c +++ b/drivers/iommu/msm_iommu.c @@ -124,6 +124,9 @@ static void msm_iommu_reset(void __iomem *base, int ncb) SET_TLBLKCR(base, ctx, 0); SET_CONTEXTIDR(base, ctx, 0); } + + /* Ensure completion of relaxed writes from the above SET macros */ + mb(); } static void __flush_iotlb(void *cookie) @@ -141,6 +144,9 @@ static void __flush_iotlb(void *cookie) list_for_each_entry(master, &iommu->ctx_list, list) SET_CTX_TLBIALL(iommu->base, master->num, 0); + /* To ensure completion of TLBIALL above */ + mb(); + __disable_clocks(iommu); } fail: @@ -181,7 +187,8 @@ fail: static void __flush_iotlb_sync(void *cookie) { - /* To avoid a null function pointer */ + /* To ensure completion of the TLBIVA in __flush_iotlb_range */ + mb(); } static const struct iommu_gather_ops msm_iommu_gather_ops = { @@ -235,6 +242,9 @@ static void config_mids(struct msm_iommu_dev *iommu, /* Set security bit override to be Non-secure */ SET_NSCFG(iommu->base, mid, 3); } + + /* Ensure completion of relaxed writes from the above SET macros */ + mb(); } static void __reset_context(void __iomem *base, int ctx) @@ -257,6 +267,9 @@ static void __reset_context(void __iomem *base, int ctx) SET_TLBFLPTER(base, ctx, 0); SET_TLBSLPTER(base, ctx, 0); SET_TLBLKCR(base, ctx, 0); + + /* Ensure completion of relaxed writes from the above SET macros */ + mb(); } static void __program_context(void __iomem *base, int ctx, @@ -305,6 +318,9 @@ static void __program_context(void __iomem *base, int ctx, /* Enable the MMU */ SET_M(base, ctx, 1); + + /* Ensure completion of relaxed writes from the above SET macros */ + mb(); } static struct iommu_domain *msm_iommu_domain_alloc(unsigned type) @@ -500,7 +516,8 @@ static phys_addr_t msm_iommu_iova_to_phys(struct iommu_domain *domain, /* Invalidate context TLB */ SET_CTX_TLBIALL(iommu->base, master->num, 0); SET_V2PPR(iommu->base, master->num, va & V2Pxx_VA); - + /* Ensure completion of relaxed writes from the above SET macros */ + mb(); par = GET_PAR(iommu->base, master->num); /* We are dealing with a supersection */ @@ -714,7 +731,8 @@ static int msm_iommu_probe(struct platform_device *pdev) par = GET_PAR(iommu->base, 0); SET_V2PCFG(iommu->base, 0, 0); SET_M(iommu->base, 0, 0); - + /* Ensure completion of relaxed writes from the above SET macros */ + mb(); if (!par) { pr_err("Invalid PAR value detected\n"); ret = -ENODEV; diff --git a/drivers/iommu/msm_iommu_hw-8xxx.h b/drivers/iommu/msm_iommu_hw-8xxx.h index fc16010..fe2c5ca 100644 --- a/drivers/iommu/msm_iommu_hw-8xxx.h +++ b/drivers/iommu/msm_iommu_hw-8xxx.h @@ -24,13 +24,19 @@ #define GET_CTX_REG(reg, base, ctx) \ (readl((base) + (reg) + ((ctx) << CTX_SHIFT))) -#define SET_GLOBAL_REG(reg, base, val) writel((val), ((base) + (reg))) +/* + * The writes to the global/context registers needs to be synced only after + * all the configuration writes are done. So use relaxed accessors and + * a barrier at the end. + */ +#define SET_GLOBAL_REG_RELAXED(reg, base, val) \ + writel_relaxed((val), ((base) + (reg))) -#define SET_CTX_REG(reg, base, ctx, val) \ - writel((val), ((base) + (reg) + ((ctx) << CTX_SHIFT))) +#define SET_CTX_REG_RELAXED(reg, base, ctx, val) \ + writel_relaxed((val), ((base) + (reg) + ((ctx) << CTX_SHIFT))) -/* Wrappers for numbered registers */ -#define SET_GLOBAL_REG_N(b, n, r, v) SET_GLOBAL_REG(b, ((r) + (n << 2)), (v)) + /* Wrappers for numbered registers */ +#define SET_GLOBAL_REG_N(b, n, r, v) SET_GLOBAL_REG_RELAXED(b, ((r) + (n << 2)), (v)) #define GET_GLOBAL_REG_N(b, n, r) GET_GLOBAL_REG(b, ((r) + (n << 2))) /* Field wrappers */ @@ -39,16 +45,17 @@ GET_FIELD(((b) + (r) + ((c) << CTX_SHIFT)), F##_MASK, F##_SHIFT) #define SET_GLOBAL_FIELD(b, r, F, v) \ - SET_FIELD(((b) + (r)), F##_MASK, F##_SHIFT, (v)) + SET_FIELD_RELAXED(((b) + (r)), F##_MASK, F##_SHIFT, (v)) #define SET_CONTEXT_FIELD(b, c, r, F, v) \ - SET_FIELD(((b) + (r) + ((c) << CTX_SHIFT)), F##_MASK, F##_SHIFT, (v)) + SET_FIELD_RELAXED(((b) + (r) + ((c) << CTX_SHIFT)), F##_MASK, F##_SHIFT, (v)) #define GET_FIELD(addr, mask, shift) ((readl(addr) >> (shift)) & (mask)) -#define SET_FIELD(addr, mask, shift, v) \ +#define SET_FIELD_RELAXED(addr, mask, shift, v) \ do { \ int t = readl(addr); \ - writel((t & ~((mask) << (shift))) + (((v) & (mask)) << (shift)), addr);\ + writel_relaxed((t & ~((mask) << (shift))) + \ + (((v) & (mask)) << (shift)), addr);\ } while (0) @@ -96,20 +103,20 @@ do { \ /* Global register setters / getters */ #define SET_M2VCBR_N(b, N, v) SET_GLOBAL_REG_N(M2VCBR_N, N, (b), (v)) #define SET_CBACR_N(b, N, v) SET_GLOBAL_REG_N(CBACR_N, N, (b), (v)) -#define SET_TLBRSW(b, v) SET_GLOBAL_REG(TLBRSW, (b), (v)) -#define SET_TLBTR0(b, v) SET_GLOBAL_REG(TLBTR0, (b), (v)) -#define SET_TLBTR1(b, v) SET_GLOBAL_REG(TLBTR1, (b), (v)) -#define SET_TLBTR2(b, v) SET_GLOBAL_REG(TLBTR2, (b), (v)) -#define SET_TESTBUSCR(b, v) SET_GLOBAL_REG(TESTBUSCR, (b), (v)) -#define SET_GLOBAL_TLBIALL(b, v) SET_GLOBAL_REG(GLOBAL_TLBIALL, (b), (v)) -#define SET_TLBIVMID(b, v) SET_GLOBAL_REG(TLBIVMID, (b), (v)) -#define SET_CR(b, v) SET_GLOBAL_REG(CR, (b), (v)) -#define SET_EAR(b, v) SET_GLOBAL_REG(EAR, (b), (v)) -#define SET_ESR(b, v) SET_GLOBAL_REG(ESR, (b), (v)) -#define SET_ESRRESTORE(b, v) SET_GLOBAL_REG(ESRRESTORE, (b), (v)) -#define SET_ESYNR0(b, v) SET_GLOBAL_REG(ESYNR0, (b), (v)) -#define SET_ESYNR1(b, v) SET_GLOBAL_REG(ESYNR1, (b), (v)) -#define SET_RPU_ACR(b, v) SET_GLOBAL_REG(RPU_ACR, (b), (v)) +#define SET_TLBRSW(b, v) SET_GLOBAL_REG_RELAXED(TLBRSW, (b), (v)) +#define SET_TLBTR0(b, v) SET_GLOBAL_REG_RELAXED(TLBTR0, (b), (v)) +#define SET_TLBTR1(b, v) SET_GLOBAL_REG_RELAXED(TLBTR1, (b), (v)) +#define SET_TLBTR2(b, v) SET_GLOBAL_REG_RELAXED(TLBTR2, (b), (v)) +#define SET_TESTBUSCR(b, v) SET_GLOBAL_REG_RELAXED(TESTBUSCR, (b), (v)) +#define SET_GLOBAL_TLBIALL(b, v) SET_GLOBAL_REG_RELAXED(GLOBAL_TLBIALL, (b), (v)) +#define SET_TLBIVMID(b, v) SET_GLOBAL_REG_RELAXED(TLBIVMID, (b), (v)) +#define SET_CR(b, v) SET_GLOBAL_REG_RELAXED(CR, (b), (v)) +#define SET_EAR(b, v) SET_GLOBAL_REG_RELAXED(EAR, (b), (v)) +#define SET_ESR(b, v) SET_GLOBAL_REG_RELAXED(ESR, (b), (v)) +#define SET_ESRRESTORE(b, v) SET_GLOBAL_REG_RELAXED(ESRRESTORE, (b), (v)) +#define SET_ESYNR0(b, v) SET_GLOBAL_REG_RELAXED(ESYNR0, (b), (v)) +#define SET_ESYNR1(b, v) SET_GLOBAL_REG_RELAXED(ESYNR1, (b), (v)) +#define SET_RPU_ACR(b, v) SET_GLOBAL_REG_RELAXED(RPU_ACR, (b), (v)) #define GET_M2VCBR_N(b, N) GET_GLOBAL_REG_N(M2VCBR_N, N, (b)) #define GET_CBACR_N(b, N) GET_GLOBAL_REG_N(CBACR_N, N, (b)) @@ -131,34 +138,34 @@ do { \ /* Context register setters/getters */ -#define SET_SCTLR(b, c, v) SET_CTX_REG(SCTLR, (b), (c), (v)) -#define SET_ACTLR(b, c, v) SET_CTX_REG(ACTLR, (b), (c), (v)) -#define SET_CONTEXTIDR(b, c, v) SET_CTX_REG(CONTEXTIDR, (b), (c), (v)) -#define SET_TTBR0(b, c, v) SET_CTX_REG(TTBR0, (b), (c), (v)) -#define SET_TTBR1(b, c, v) SET_CTX_REG(TTBR1, (b), (c), (v)) -#define SET_TTBCR(b, c, v) SET_CTX_REG(TTBCR, (b), (c), (v)) -#define SET_PAR(b, c, v) SET_CTX_REG(PAR, (b), (c), (v)) -#define SET_FSR(b, c, v) SET_CTX_REG(FSR, (b), (c), (v)) -#define SET_FSRRESTORE(b, c, v) SET_CTX_REG(FSRRESTORE, (b), (c), (v)) -#define SET_FAR(b, c, v) SET_CTX_REG(FAR, (b), (c), (v)) -#define SET_FSYNR0(b, c, v) SET_CTX_REG(FSYNR0, (b), (c), (v)) -#define SET_FSYNR1(b, c, v) SET_CTX_REG(FSYNR1, (b), (c), (v)) -#define SET_PRRR(b, c, v) SET_CTX_REG(PRRR, (b), (c), (v)) -#define SET_NMRR(b, c, v) SET_CTX_REG(NMRR, (b), (c), (v)) -#define SET_TLBLKCR(b, c, v) SET_CTX_REG(TLBLCKR, (b), (c), (v)) -#define SET_V2PSR(b, c, v) SET_CTX_REG(V2PSR, (b), (c), (v)) -#define SET_TLBFLPTER(b, c, v) SET_CTX_REG(TLBFLPTER, (b), (c), (v)) -#define SET_TLBSLPTER(b, c, v) SET_CTX_REG(TLBSLPTER, (b), (c), (v)) -#define SET_BFBCR(b, c, v) SET_CTX_REG(BFBCR, (b), (c), (v)) -#define SET_CTX_TLBIALL(b, c, v) SET_CTX_REG(CTX_TLBIALL, (b), (c), (v)) -#define SET_TLBIASID(b, c, v) SET_CTX_REG(TLBIASID, (b), (c), (v)) -#define SET_TLBIVA(b, c, v) SET_CTX_REG(TLBIVA, (b), (c), (v)) -#define SET_TLBIVAA(b, c, v) SET_CTX_REG(TLBIVAA, (b), (c), (v)) -#define SET_V2PPR(b, c, v) SET_CTX_REG(V2PPR, (b), (c), (v)) -#define SET_V2PPW(b, c, v) SET_CTX_REG(V2PPW, (b), (c), (v)) -#define SET_V2PUR(b, c, v) SET_CTX_REG(V2PUR, (b), (c), (v)) -#define SET_V2PUW(b, c, v) SET_CTX_REG(V2PUW, (b), (c), (v)) -#define SET_RESUME(b, c, v) SET_CTX_REG(RESUME, (b), (c), (v)) +#define SET_SCTLR(b, c, v) SET_CTX_REG_RELAXED(SCTLR, (b), (c), (v)) +#define SET_ACTLR(b, c, v) SET_CTX_REG_RELAXED(ACTLR, (b), (c), (v)) +#define SET_CONTEXTIDR(b, c, v) SET_CTX_REG_RELAXED(CONTEXTIDR, (b), (c), (v)) +#define SET_TTBR0(b, c, v) SET_CTX_REG_RELAXED(TTBR0, (b), (c), (v)) +#define SET_TTBR1(b, c, v) SET_CTX_REG_RELAXED(TTBR1, (b), (c), (v)) +#define SET_TTBCR(b, c, v) SET_CTX_REG_RELAXED(TTBCR, (b), (c), (v)) +#define SET_PAR(b, c, v) SET_CTX_REG_RELAXED(PAR, (b), (c), (v)) +#define SET_FSR(b, c, v) SET_CTX_REG_RELAXED(FSR, (b), (c), (v)) +#define SET_FSRRESTORE(b, c, v) SET_CTX_REG_RELAXED(FSRRESTORE, (b), (c), (v)) +#define SET_FAR(b, c, v) SET_CTX_REG_RELAXED(FAR, (b), (c), (v)) +#define SET_FSYNR0(b, c, v) SET_CTX_REG_RELAXED(FSYNR0, (b), (c), (v)) +#define SET_FSYNR1(b, c, v) SET_CTX_REG_RELAXED(FSYNR1, (b), (c), (v)) +#define SET_PRRR(b, c, v) SET_CTX_REG_RELAXED(PRRR, (b), (c), (v)) +#define SET_NMRR(b, c, v) SET_CTX_REG_RELAXED(NMRR, (b), (c), (v)) +#define SET_TLBLKCR(b, c, v) SET_CTX_REG_RELAXED(TLBLCKR, (b), (c), (v)) +#define SET_V2PSR(b, c, v) SET_CTX_REG_RELAXED(V2PSR, (b), (c), (v)) +#define SET_TLBFLPTER(b, c, v) SET_CTX_REG_RELAXED(TLBFLPTER, (b), (c), (v)) +#define SET_TLBSLPTER(b, c, v) SET_CTX_REG_RELAXED(TLBSLPTER, (b), (c), (v)) +#define SET_BFBCR(b, c, v) SET_CTX_REG_RELAXED(BFBCR, (b), (c), (v)) +#define SET_CTX_TLBIALL(b, c, v) SET_CTX_REG_RELAXED(CTX_TLBIALL, (b), (c), (v)) +#define SET_TLBIASID(b, c, v) SET_CTX_REG_RELAXED(TLBIASID, (b), (c), (v)) +#define SET_TLBIVA(b, c, v) SET_CTX_REG_RELAXED(TLBIVA, (b), (c), (v)) +#define SET_TLBIVAA(b, c, v) SET_CTX_REG_RELAXED(TLBIVAA, (b), (c), (v)) +#define SET_V2PPR(b, c, v) SET_CTX_REG_RELAXED(V2PPR, (b), (c), (v)) +#define SET_V2PPW(b, c, v) SET_CTX_REG_RELAXED(V2PPW, (b), (c), (v)) +#define SET_V2PUR(b, c, v) SET_CTX_REG_RELAXED(V2PUR, (b), (c), (v)) +#define SET_V2PUW(b, c, v) SET_CTX_REG_RELAXED(V2PUW, (b), (c), (v)) +#define SET_RESUME(b, c, v) SET_CTX_REG_RELAXED(RESUME, (b), (c), (v)) #define GET_SCTLR(b, c) GET_CTX_REG(SCTLR, (b), (c)) #define GET_ACTLR(b, c) GET_CTX_REG(ACTLR, (b), (c))
While using the generic pagetable ops the tlb maintenance operation gets completed in the sync callback. So use writel_relaxed for all register access and add a mb() at appropriate places. Signed-off-by: Sricharan R <sricharan@codeaurora.org> --- drivers/iommu/msm_iommu.c | 24 +++++++-- drivers/iommu/msm_iommu_hw-8xxx.h | 109 ++++++++++++++++++++------------------ 2 files changed, 79 insertions(+), 54 deletions(-)