[V5,6/7] iommu/msm: Use writel_relaxed and add a barrier

Message ID	1463741694-1735-7-git-send-email-sricharan@codeaurora.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> sender: sricharan@smtp.codeaurora.org) by smtp.codeaurora.org (Postfix) with ESMTPSA id 27BAA614DF; Fri, 20 May 2016 10:56:13 +0000 (UTC) From: Sricharan R <sricharan@codeaurora.org> To: devicetree@vger.kernel.org, linux-arm-msm@vger.kernel.org, joro@8bytes.org, robdclark@gmail.com, iommu@lists.linux-foundation.org, srinivas.kandagatla@linaro.org, laurent.pinchart@ideasonboard.com, treding@nvidia.com, robin.murphy@arm.com, linux-arm-kernel@lists.infradead.org, stepanm@codeaurora.org, architt@codeaurora.org, arnd@arndb.de Subject: [PATCH V5 6/7] iommu/msm: Use writel_relaxed and add a barrier Date: Fri, 20 May 2016 16:24:53 +0530 Message-Id: <1463741694-1735-7-git-send-email-sricharan@codeaurora.org> In-Reply-To: <1463741694-1735-1-git-send-email-sricharan@codeaurora.org> References: <1463741694-1735-1-git-send-email-sricharan@codeaurora.org> Precedence: list Cc: sricharan@codeaurora.org MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Sricharan Ramabadhran May 20, 2016, 10:54 a.m. UTC

While using the generic pagetable ops the tlb maintenance
operation gets completed in the sync callback. So use writel_relaxed
for all register access and add a mb() at appropriate places.

Signed-off-by: Sricharan R <sricharan@codeaurora.org>
---
 drivers/iommu/msm_iommu.c         |  24 +++++++--
 drivers/iommu/msm_iommu_hw-8xxx.h | 109 ++++++++++++++++++++------------------
 2 files changed, 79 insertions(+), 54 deletions(-)

Arnd Bergmann May 20, 2016, 11:44 a.m. UTC | #1

On Friday 20 May 2016 16:24:53 Sricharan R wrote:
> While using the generic pagetable ops the tlb maintenance
> operation gets completed in the sync callback. So use writel_relaxed
> for all register access and add a mb() at appropriate places.
> 
> Signed-off-by: Sricharan R <sricharan@codeaurora.org>
> ---
>  drivers/iommu/msm_iommu.c         |  24 +++++++--
>  drivers/iommu/msm_iommu_hw-8xxx.h | 109 ++++++++++++++++++++------------------
>  2 files changed, 79 insertions(+), 54 deletions(-)
> 
> diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c
> index 0299a37..dfcaeef 100644
> --- a/drivers/iommu/msm_iommu.c
> +++ b/drivers/iommu/msm_iommu.c
> @@ -124,6 +124,9 @@ static void msm_iommu_reset(void __iomem *base, int ncb)
>  		SET_TLBLKCR(base, ctx, 0);
>  		SET_CONTEXTIDR(base, ctx, 0);
>  	}
> +
> +	/* Ensure completion of relaxed writes from the above SET macros */
> +	mb();
>  }

Why do the above use the relaxed writes? Do you have any performance
numbers showing that skipping the sync for the reset makes a measurable
difference?

How did you prove that skipping the sync before the write is safe?

How about resetting the iommu less often instead?

>  static void __flush_iotlb(void *cookie)
> @@ -141,6 +144,9 @@ static void __flush_iotlb(void *cookie)
>  		list_for_each_entry(master, &iommu->ctx_list, list)
>  			SET_CTX_TLBIALL(iommu->base, master->num, 0);
>  
> +		/* To ensure completion of TLBIALL above */
> +		mb();
> +
>  		__disable_clocks(iommu);
>  	}
>  fail:

__disable_clocks() takes spinlocks and accesses the hardware, which
in turn will also involve multiple syncs, so this seems like
a premature optimization. 

Have you tried just leaving the clocks enabled? I'd suspect you could
gain more performance that way.

> @@ -181,7 +187,8 @@ fail:
>  
>  static void __flush_iotlb_sync(void *cookie)
>  {
> -	/* To avoid a null function pointer */
> +	/* To ensure completion of the TLBIVA in __flush_iotlb_range */
> +	mb();
>  }

I don't understand the specific race from the comment.

What operation comes after this that relies on __flush_iotlb_range
having completed, and how does an mb() guarantee this?

This seems to be a bug fix that is unrelated to the change to
use writel_relaxed(), so better split it out into a separate
patch, with a longer changeset description. Did you spot this
race by running into incorrect data, or by inspection?

>  static const struct iommu_gather_ops msm_iommu_gather_ops = {
> @@ -235,6 +242,9 @@ static void config_mids(struct msm_iommu_dev *iommu,
>  		/* Set security bit override to be Non-secure */
>  		SET_NSCFG(iommu->base, mid, 3);
>  	}
> +
> +	/* Ensure completion of relaxed writes from the above SET macros */
> +	mb();
>  }
>  

This looks similar to the msm_iommu_reset() case. Maybe put those
two into one patch, or find a way to run config_mids() less often so
it doesn't show up in the profiles any more.

If you hit this often enough that it causes performance problems,
it's more likely that there is a bug in your code than that changing
to relaxed accessors is a good idea.

>  static void __reset_context(void __iomem *base, int ctx)
> @@ -257,6 +267,9 @@ static void __reset_context(void __iomem *base, int ctx)
>  	SET_TLBFLPTER(base, ctx, 0);
>  	SET_TLBSLPTER(base, ctx, 0);
>  	SET_TLBLKCR(base, ctx, 0);
> +
> +	/* Ensure completion of relaxed writes from the above SET macros */
> +	mb();
>  }
>
>  static void __program_context(void __iomem *base, int ctx,
> @@ -305,6 +318,9 @@ static void __program_context(void __iomem *base, int ctx,
>  
>  	/* Enable the MMU */
>  	SET_M(base, ctx, 1);
> +
> +	/* Ensure completion of relaxed writes from the above SET macros */
> +	mb();
>  }
>  
>  static struct iommu_domain *msm_iommu_domain_alloc(unsigned type)

Maybe one more patch for these? I can see that
__reset_context/__program_context would be run often enough to make
a difference, though maybe not as much as the flushes.

Also, split this in two patches: one that adds a comment describing
how you proved that you don't need barriers before the writes, and
another one adding the barrier afterwards, with a description of
the failure scenario.

> @@ -500,7 +516,8 @@ static phys_addr_t msm_iommu_iova_to_phys(struct iommu_domain *domain,
>  	/* Invalidate context TLB */
>  	SET_CTX_TLBIALL(iommu->base, master->num, 0);
>  	SET_V2PPR(iommu->base, master->num, va & V2Pxx_VA);
> -
> +	/* Ensure completion of relaxed writes from the above SET macros */
> +	mb();
>  	par = GET_PAR(iommu->base, master->num);
>  
>  	/* We are dealing with a supersection */

In this case, I'd say it's better to rewrite the function to avoid the
read: iova_to_phys() should be fast, and not require hardware access.
Getting rid of the hardware access by using an in-memory cache for
this should gain more than just removing the barriers, as an MMIO read
is always slow.

> @@ -714,7 +731,8 @@ static int msm_iommu_probe(struct platform_device *pdev)
>  	par = GET_PAR(iommu->base, 0);
>  	SET_V2PCFG(iommu->base, 0, 0);
>  	SET_M(iommu->base, 0, 0);
> -
> +	/* Ensure completion of relaxed writes from the above SET macros */
> +	mb();
>  	if (!par) {

Probe() clearly isn't performance sensitive, so please back this out
and use the non-relaxed accessors.

> diff --git a/drivers/iommu/msm_iommu_hw-8xxx.h b/drivers/iommu/msm_iommu_hw-8xxx.h
> index fc16010..fe2c5ca 100644
> --- a/drivers/iommu/msm_iommu_hw-8xxx.h
> +++ b/drivers/iommu/msm_iommu_hw-8xxx.h
> @@ -24,13 +24,19 @@
>  #define GET_CTX_REG(reg, base, ctx) \
>  				(readl((base) + (reg) + ((ctx) << CTX_SHIFT)))
>  
> -#define SET_GLOBAL_REG(reg, base, val)	writel((val), ((base) + (reg)))
> +/*
> + * The writes to the global/context registers needs to be synced only after
> + * all the configuration writes are done. So use relaxed accessors and
> + * a barrier at the end.
> + */
> +#define SET_GLOBAL_REG_RELAXED(reg, base, val) \
> +			       writel_relaxed((val), ((base) + (reg)))
>  
> -#define SET_CTX_REG(reg, base, ctx, val) \
> -			writel((val), ((base) + (reg) + ((ctx) << CTX_SHIFT)))
> +#define SET_CTX_REG_RELAXED(reg, base, ctx, val) \
> +		writel_relaxed((val), ((base) + (reg) + ((ctx) << CTX_SHIFT)))

Please add the relaxed accessors first in a separate patch, and then
change over one user at a time before you remove the non-relaxed ones.

It's hard to believe that all users are actually performance critical,
so start with the ones that gain the most performance. As commented above,
some of the conversions seem particularly fishy and it's likely that
a slow reset() function should not be fixed by making reset() faster
but by calling it less often.

>  /* Context register setters/getters */
> -#define SET_SCTLR(b, c, v)	 SET_CTX_REG(SCTLR, (b), (c), (v))
> -#define SET_ACTLR(b, c, v)	 SET_CTX_REG(ACTLR, (b), (c), (v))
> -#define SET_CONTEXTIDR(b, c, v)	 SET_CTX_REG(CONTEXTIDR, (b), (c), (v))
> -#define SET_TTBR0(b, c, v)	 SET_CTX_REG(TTBR0, (b), (c), (v))
> -#define SET_TTBR1(b, c, v)	 SET_CTX_REG(TTBR1, (b), (c), (v))
> +#define SET_SCTLR(b, c, v)	 SET_CTX_REG_RELAXED(SCTLR, (b), (c), (v))
> +#define SET_ACTLR(b, c, v)	 SET_CTX_REG_RELAXED(ACTLR, (b), (c), (v))
> +#define SET_CONTEXTIDR(b, c, v)	 SET_CTX_REG_RELAXED(CONTEXTIDR, (b), (c), (v))
> +#define SET_TTBR0(b, c, v)	 SET_CTX_REG_RELAXED(TTBR0, (b), (c), (v))
> +#define SET_TTBR1(b, c, v)	 SET_CTX_REG_RELAXED(TTBR1, (b), (c), (v))
> +#define SET_TTBCR(b, c, v)	 SET_CTX_REG_RELAXED(TTBCR, (b), (c), (v))
> +#define SET_PAR(b, c, v)	 SET_CTX_REG_RELAXED(PAR, (b), (c), (v))

Can you just remove those macros, and open-code the function calls?

This is an unnecessary obfuscation that now also hides the fact that you
are using relaxed accessors.

	Arnd

Arnd Bergmann May 20, 2016, 12:20 p.m. UTC | #2

On Friday 20 May 2016 13:44:10 Arnd Bergmann wrote:
> >  #define GET_CTX_REG(reg, base, ctx) \
> >                               (readl((base) + (reg) + ((ctx) << CTX_SHIFT)))
> >  
> > -#define SET_GLOBAL_REG(reg, base, val)       writel((val), ((base) + (reg)))
> > +/*
> > + * The writes to the global/context registers needs to be synced only after
> > + * all the configuration writes are done. So use relaxed accessors and
> > + * a barrier at the end.
> > + */
> > +#define SET_GLOBAL_REG_RELAXED(reg, base, val) \
> > +                            writel_relaxed((val), ((base) + (reg)))
> >  
> > -#define SET_CTX_REG(reg, base, ctx, val) \
> > -                     writel((val), ((base) + (reg) + ((ctx) << CTX_SHIFT)))
> > +#define SET_CTX_REG_RELAXED(reg, base, ctx, val) \
> > +             writel_relaxed((val), ((base) + (reg) + ((ctx) << CTX_SHIFT)))
> 
> Please add the relaxed accessors first in a separate patch, and then
> change over one user at a time before you remove the non-relaxed ones.
> 
> It's hard to believe that all users are actually performance critical,
> so start with the ones that gain the most performance. As commented above,
> some of the conversions seem particularly fishy and it's likely that
> a slow reset() function should not be fixed by making reset() faster
> but by calling it less often.

One more thing, please also convert them into inline functions when you modify
them, and use normal names such as

static inline void msm_iommu_write(struct msm_iommu_dev *iommu, unsigned int reg, u32 val)
{
	writel(val, iommu->base + reg);
}

static inline void msm_iommu_context_write(struct msm_iommu_ctx_dev *ctx, unsigned int reg, u32 val)
{
	writel(val, ctx->iommu->base + ctx << CTX_SHIFT + reg, val);
}

	Arnd

Sricharan Ramabadhran May 23, 2016, 6:05 a.m. UTC | #3

Hi Arnd,

>> @@ -124,6 +124,9 @@ static void msm_iommu_reset(void __iomem *base, int ncb)
>>  		SET_TLBLKCR(base, ctx, 0);
>>  		SET_CONTEXTIDR(base, ctx, 0);
>>  	}
>> +
>> +	/* Ensure completion of relaxed writes from the above SET macros */
>> +	mb();
>>  }
>
>Why do the above use the relaxed writes? Do you have any performance
>numbers showing that skipping the sync for the reset makes a measurable
>difference?
>
>How did you prove that skipping the sync before the write is safe?
>
>How about resetting the iommu less often instead?
>

I had measured the numbers only for the full usecase path, not for the
reset path alone. I saw improvement of about 5% on full numbers.
As you said, the reset path would be called only less often
and might not bring a measurable change. I did not see a difference in behavior
when changing the sync to happen after the writes. But my understanding was that
the sync after the writes was required to ensure write completion. 

I should have made smaller patches to do this change.
The only patch relevant for this series is the one that changes the write in _iotlb_range
function. Rest of the changes, should be added one by one in a separate series.

>>  static void __flush_iotlb(void *cookie)
>> @@ -141,6 +144,9 @@ static void __flush_iotlb(void *cookie)
>>  		list_for_each_entry(master, &iommu->ctx_list, list)
>>  			SET_CTX_TLBIALL(iommu->base, master->num, 0);
>>
>> +		/* To ensure completion of TLBIALL above */
>> +		mb();
>> +
>>  		__disable_clocks(iommu);
>>  	}
>>  fail:
>
>__disable_clocks() takes spinlocks and accesses the hardware, which
>in turn will also involve multiple syncs, so this seems like
>a premature optimization.
>
>Have you tried just leaving the clocks enabled? I'd suspect you could
>gain more performance that way.

I have not tried by keep clocks enabled always. But intention here was
to have more granular clock control. But agree that the barrier might not
be needed in this case as it is indirectly taken care. I will remove this here
and explain it.

>
>> @@ -181,7 +187,8 @@ fail:
>>
>>  static void __flush_iotlb_sync(void *cookie)
>>  {
>> -	/* To avoid a null function pointer */
>> +	/* To ensure completion of the TLBIVA in __flush_iotlb_range */
>> +	mb();
>>  }
>
>I don't understand the specific race from the comment.
>
>What operation comes after this that relies on __flush_iotlb_range
>having completed, and how does an mb() guarantee this?
>

The flush_iotlb_range operation invalidates the tlb for writes to
pagetable and the finally calls the sync operation to ensure completion
of the flush and this is required before returning back to the client
of the iommu. In the case of this iommu, only a barrier is required to
ensure completion of the invalidate operation. 

>This seems to be a bug fix that is unrelated to the change to
>use writel_relaxed(), so better split it out into a separate
>patch, with a longer changeset description. Did you spot this
>race by running into incorrect data, or by inspection?
>

No i did not see a data corruption issue without the mb(),
but that it would have been hidden in someother way as well.
Another difference was the sync  was done before the write
previously and now its moved after the write. As i understand
sync after the write is correct. So i will change this patch with more
description and move rest of that changes out.

>>  static const struct iommu_gather_ops msm_iommu_gather_ops = {
>> @@ -235,6 +242,9 @@ static void config_mids(struct msm_iommu_dev *iommu,
>>  		/* Set security bit override to be Non-secure */
>>  		SET_NSCFG(iommu->base, mid, 3);
>>  	}
>> +
>> +	/* Ensure completion of relaxed writes from the above SET macros */
>> +	mb();
>>  }
>>
>
>This looks similar to the msm_iommu_reset() case. Maybe put those
>two into one patch, or find a way to run config_mids() less often so
>it doesn't show up in the profiles any more.
>
>If you hit this often enough that it causes performance problems,
>it's more likely that there is a bug in your code than that changing
>to relaxed accessors is a good idea.
>

  Ok, similar to reset this is called less often compared to flushes and
  may not show difference in performance. I will move this out.

>>  static void __reset_context(void __iomem *base, int ctx)
>> @@ -257,6 +267,9 @@ static void __reset_context(void __iomem *base, int ctx)
>>  	SET_TLBFLPTER(base, ctx, 0);
>>  	SET_TLBSLPTER(base, ctx, 0);
>>  	SET_TLBLKCR(base, ctx, 0);
>> +
>> +	/* Ensure completion of relaxed writes from the above SET macros */
>> +	mb();
>>  }
>>
>>  static void __program_context(void __iomem *base, int ctx,
>> @@ -305,6 +318,9 @@ static void __program_context(void __iomem *base, int ctx,
>>
>>  	/* Enable the MMU */
>>  	SET_M(base, ctx, 1);
>> +
>> +	/* Ensure completion of relaxed writes from the above SET macros */
>> +	mb();
>>  }
>>
>>  static struct iommu_domain *msm_iommu_domain_alloc(unsigned type)
>
>Maybe one more patch for these? I can see that
>__reset_context/__program_context would be run often enough to make
>a difference, though maybe not as much as the flushes.
>
>Also, split this in two patches: one that adds a comment describing
>how you proved that you don't need barriers before the writes, and
>another one adding the barrier afterwards, with a description of
>the failure scenario.

ok, i will split it. I did not see a failure though explicitly. It is based on
the understanding that the sync after the write is more appropriate
 than before it.
>
>> @@ -500,7 +516,8 @@ static phys_addr_t msm_iommu_iova_to_phys(struct iommu_domain *domain,
>>  	/* Invalidate context TLB */
>>  	SET_CTX_TLBIALL(iommu->base, master->num, 0);
>>  	SET_V2PPR(iommu->base, master->num, va & V2Pxx_VA);
>> -
>> +	/* Ensure completion of relaxed writes from the above SET macros */
>> +	mb();
>>  	par = GET_PAR(iommu->base, master->num);
>>
>>  	/* We are dealing with a supersection */
>
>In this case, I'd say it's better to rewrite the function to avoid the
>read: iova_to_phys() should be fast, and not require hardware access.
>Getting rid of the hardware access by using an in-memory cache for
>this should gain more than just removing the barriers, as an MMIO read
>is always slow

Ok, you mean using the software walk through ? I will check on this to measure
 the latency difference. If thats true, then the iopgtable ops itself provides a
function for iova_to_phys conversion, so that can be used.

>
>> @@ -714,7 +731,8 @@ static int msm_iommu_probe(struct platform_device *pdev)
>>  	par = GET_PAR(iommu->base, 0);
>>  	SET_V2PCFG(iommu->base, 0, 0);
>>  	SET_M(iommu->base, 0, 0);
>> -
>> +	/* Ensure completion of relaxed writes from the above SET macros */
>> +	mb();
>>  	if (!par) {
>
>Probe() clearly isn't performance sensitive, so please back this out
>and use the non-relaxed accessors.
>

ok, will change this back.

>> diff --git a/drivers/iommu/msm_iommu_hw-8xxx.h b/drivers/iommu/msm_iommu_hw-8xxx.h
>> index fc16010..fe2c5ca 100644
>> --- a/drivers/iommu/msm_iommu_hw-8xxx.h
>> +++ b/drivers/iommu/msm_iommu_hw-8xxx.h
>> @@ -24,13 +24,19 @@
>>  #define GET_CTX_REG(reg, base, ctx) \
>>  				(readl((base) + (reg) + ((ctx) << CTX_SHIFT)))
>>
>> -#define SET_GLOBAL_REG(reg, base, val)	writel((val), ((base) + (reg)))
>> +/*
>> + * The writes to the global/context registers needs to be synced only after
>> + * all the configuration writes are done. So use relaxed accessors and
>> + * a barrier at the end.
>> + */
>> +#define SET_GLOBAL_REG_RELAXED(reg, base, val) \
>> +			       writel_relaxed((val), ((base) + (reg)))
>>
>> -#define SET_CTX_REG(reg, base, ctx, val) \
>> -			writel((val), ((base) + (reg) + ((ctx) << CTX_SHIFT)))
>> +#define SET_CTX_REG_RELAXED(reg, base, ctx, val) \
>> +		writel_relaxed((val), ((base) + (reg) + ((ctx) << CTX_SHIFT)))
>
>Please add the relaxed accessors first in a separate patch, and then
>change over one user at a time before you remove the non-relaxed ones.
>

ok

>It's hard to believe that all users are actually performance critical,
>so start with the ones that gain the most performance. As commented above,
>some of the conversions seem particularly fishy and it's likely that
>a slow reset() function should not be fixed by making reset() faster
>but by calling it less often.

Ok. Will change this. So i will split this in to separate series, one patch to modify
relevant the flush_iotlb_range function in this series and rest of all changes
as  a driver improvement in one more.

>Can you just remove those macros, and open-code the function calls?
>
>This is an unnecessary obfuscation that now also hides the fact that you
>are using relaxed accessors.

Ok, will make this in to inline functions to make it obvious.

Regards,
 Sricharan

Arnd Bergmann May 24, 2016, 2 p.m. UTC | #4

On Monday, May 23, 2016 11:35:04 AM CEST Sricharan wrote:
> Hi Arnd,
> 
> >> @@ -124,6 +124,9 @@ static void msm_iommu_reset(void __iomem *base, int ncb)
> >>  		SET_TLBLKCR(base, ctx, 0);
> >>  		SET_CONTEXTIDR(base, ctx, 0);
> >>  	}
> >> +
> >> +	/* Ensure completion of relaxed writes from the above SET macros */
> >> +	mb();
> >>  }
> >
> >Why do the above use the relaxed writes? Do you have any performance
> >numbers showing that skipping the sync for the reset makes a measurable
> >difference?
> >
> >How did you prove that skipping the sync before the write is safe?
> >
> >How about resetting the iommu less often instead?
> >
> 
> I had measured the numbers only for the full usecase path, not for the
> reset path alone. I saw improvement of about 5% on full numbers.
> As you said, the reset path would be called only less often
> and might not bring a measurable change. I did not see a difference in behavior
> when changing the sync to happen after the writes.

Ok, then better not change it.

> But my understanding was that
> the sync after the writes was required to ensure write completion. 

Can you cite the relevant documentation on this? Is this specific
to the Qualcomm CPU implementation or the IOMMU? I don't think
the ARM architecture requires anything like this in general.

> I should have made smaller patches to do this change.
> The only patch relevant for this series is the one that changes the write in _iotlb_range
> function. Rest of the changes, should be added one by one in a separate series.

If you see the same 5% performance improvement with a simpler change, then
better do only that. The IOMMU infrastructure is rather sensitive to
having correct barriers everywhere, so this minimizes the risk of getting
it wrong somewhere.


> >> @@ -181,7 +187,8 @@ fail:
> >>
> >>  static void __flush_iotlb_sync(void *cookie)
> >>  {
> >> -	/* To avoid a null function pointer */
> >> +	/* To ensure completion of the TLBIVA in __flush_iotlb_range */
> >> +	mb();
> >>  }
> >
> >I don't understand the specific race from the comment.
> >
> >What operation comes after this that relies on __flush_iotlb_range
> >having completed, and how does an mb() guarantee this?
> >
> 
> The flush_iotlb_range operation invalidates the tlb for writes to
> pagetable and the finally calls the sync operation to ensure completion
> of the flush and this is required before returning back to the client
> of the iommu. In the case of this iommu, only a barrier is required to
> ensure completion of the invalidate operation. 

This doesn't answer my question: What operation would a client do
that requires the flush to be completed here? A barrier is always
defined in terms of things that come before it in combination with
things that come after it.

Any operation that could trigger a DMA from a device is required
to have a barrier preceding it (usually wmb() one implied by writel()),
so this is clearly not about a driver that installs a DMA mapping
before starting a DMA, but I don't see what else it would be.

> >This seems to be a bug fix that is unrelated to the change to
> >use writel_relaxed(), so better split it out into a separate
> >patch, with a longer changeset description. Did you spot this
> >race by running into incorrect data, or by inspection?
> >
> 
> No i did not see a data corruption issue without the mb(),
> but that it would have been hidden in someother way as well.
> Another difference was the sync  was done before the write
> previously and now its moved after the write. As i understand
> sync after the write is correct. So i will change this patch with more
> description and move rest of that changes out.

Ok.

> >> @@ -500,7 +516,8 @@ static phys_addr_t msm_iommu_iova_to_phys(struct iommu_domain *domain,
> >>  	/* Invalidate context TLB */
> >>  	SET_CTX_TLBIALL(iommu->base, master->num, 0);
> >>  	SET_V2PPR(iommu->base, master->num, va & V2Pxx_VA);
> >> -
> >> +	/* Ensure completion of relaxed writes from the above SET macros */
> >> +	mb();
> >>  	par = GET_PAR(iommu->base, master->num);
> >>
> >>  	/* We are dealing with a supersection */
> >
> >In this case, I'd say it's better to rewrite the function to avoid the
> >read: iova_to_phys() should be fast, and not require hardware access.
> >Getting rid of the hardware access by using an in-memory cache for
> >this should gain more than just removing the barriers, as an MMIO read
> >is always slow
> 
> Ok, you mean using the software walk through ? I will check on this to measure
>  the latency difference. If thats true, then the iopgtable ops itself provides a
> function for iova_to_phys conversion, so that can be used.

I hadn't realized that this is a lookup in the hardware, rather than
reading a static register. It's probably a good idea to check this
anyway.


	Arnd

Sricharan Ramabadhran May 25, 2016, 10:45 a.m. UTC | #5

Hi Arnd,

<snip..>

>> I had measured the numbers only for the full usecase path, not for the
>> reset path alone. I saw improvement of about 5% on full numbers.
>> As you said, the reset path would be called only less often
>> and might not bring a measurable change. I did not see a difference in behavior
>> when changing the sync to happen after the writes.
>
>Ok, then better not change it.
>

Ok.

>> But my understanding was that
>> the sync after the writes was required to ensure write completion.
>
>Can you cite the relevant documentation on this? Is this specific
>to the Qualcomm CPU implementation or the IOMMU? I don't think
>the ARM architecture requires anything like this in general.
>

My explanation was not correct. So the whole reason for doing the
sync here was to make sure that client is going to observe the effect
the of tlb invalidation before starting any dma operation.
But as you have explained below, if it is implicit that any operation
 from a device that could trigger a dma should have an barrier preceding
that, then the a sync here may not be needed.

>
>> >> @@ -181,7 +187,8 @@ fail:
>> >>
>> >>  static void __flush_iotlb_sync(void *cookie)
>> >>  {
>> >> -	/* To avoid a null function pointer */
>> >> +	/* To ensure completion of the TLBIVA in __flush_iotlb_range */
>> >> +	mb();
>> >>  }
>> >
>> >I don't understand the specific race from the comment.
>> >
>> >What operation comes after this that relies on __flush_iotlb_range
>> >having completed, and how does an mb() guarantee this?
>> >
>>
>> The flush_iotlb_range operation invalidates the tlb for writes to
>> pagetable and the finally calls the sync operation to ensure completion
>> of the flush and this is required before returning back to the client
>> of the iommu. In the case of this iommu, only a barrier is required to
>> ensure completion of the invalidate operation.
>
>This doesn't answer my question: What operation would a client do
>that requires the flush to be completed here? A barrier is always
>defined in terms of things that come before it in combination with
>things that come after it.
>
>Any operation that could trigger a DMA from a device is required
>to have a barrier preceding it (usually wmb() one implied by writel()),
>so this is clearly not about a driver that installs a DMA mapping
>before starting a DMA, but I don't see what else it would be.
>

Ok, so i was doing this from the idea that, other iommu drivers
 where polling on a status bit in their sync call to ensure completion
of pending TLB invalidations. But in this case, there is no status bit.
 So added a barrier to have no ordering issues before the client
triggers the dma operation. But as you say above that it is implicit that
the device would have a barrier before starting the trigger, then the
barrier here becomes redundant.

<snip..>

>> >In this case, I'd say it's better to rewrite the function to avoid the
>> >read: iova_to_phys() should be fast, and not require hardware access.
>> >Getting rid of the hardware access by using an in-memory cache for
>> >this should gain more than just removing the barriers, as an MMIO read
>> >is always slow
>>
>> Ok, you mean using the software walk through ? I will check on this to measure
>>  the latency difference. If thats true, then the iopgtable ops itself provides a
>> function for iova_to_phys conversion, so that can be used.
>
>I hadn't realized that this is a lookup in the hardware, rather than
>reading a static register. It's probably a good idea to check this
>anyway.

Ok, will measure the difference and have the better one.

Regards,
 Sricharan

Arnd Bergmann May 25, 2016, 12:18 p.m. UTC | #6

On Wednesday, May 25, 2016 4:15:31 PM CEST Sricharan wrote:
> >
> >Any operation that could trigger a DMA from a device is required
> >to have a barrier preceding it (usually wmb() one implied by writel()),
> >so this is clearly not about a driver that installs a DMA mapping
> >before starting a DMA, but I don't see what else it would be.
> >
> 
> Ok, so i was doing this from the idea that, other iommu drivers
>  where polling on a status bit in their sync call to ensure completion
> of pending TLB invalidations. But in this case, there is no status bit.
>  So added a barrier to have no ordering issues before the client
> triggers the dma operation. But as you say above that it is implicit that
> the device would have a barrier before starting the trigger, then the
> barrier here becomes redundant.

Ok. There are two more things to note here:

* On other platforms, polling the register is likely required because
  an MMIO write is "posted", meaning that a sync after writel() will
  only ensure that it has left the CPU write queue, but it may still be
  on the bus fabric and whatever side-effects are triggered by the
  write are normally not guaranteed to be completed even after the
  'sync'. You need to check the datasheet for your IOMMU to find out
  whether the 'dsb' instruction actually has any effect on the IOMMU.
  If not, then neither the barrier that you add here nor the barrier
  in the following writel() is sufficient.

* The one barrier that is normally required in an IOMMU is between
  updating the in-memory page tables and the following MMIO store
  that triggers the TLB flush for that entry. This barrier is
  implied by writel() but not writel_relaxed(). If you don't have
  a hardware page table walker in your IOMMU, you don't need to worry
  about this.

	Arnd

Sricharan Ramabadhran May 25, 2016, 1:19 p.m. UTC | #7

Hi Arnd,

>> Ok, so i was doing this from the idea that, other iommu drivers
>>  where polling on a status bit in their sync call to ensure completion
>> of pending TLB invalidations. But in this case, there is no status bit.
>>  So added a barrier to have no ordering issues before the client
>> triggers the dma operation. But as you say above that it is implicit that
>> the device would have a barrier before starting the trigger, then the
>> barrier here becomes redundant.
>
>Ok. There are two more things to note here:
>
>* On other platforms, polling the register is likely required because
>  an MMIO write is "posted", meaning that a sync after writel() will
>  only ensure that it has left the CPU write queue, but it may still be
>  on the bus fabric and whatever side-effects are triggered by the
>  write are normally not guaranteed to be completed even after the
>  'sync'. You need to check the datasheet for your IOMMU to find out
>  whether the 'dsb' instruction actually has any effect on the IOMMU.
>  If not, then neither the barrier that you add here nor the barrier
>  in the following writel() is sufficient.
>
   Thanks for the detailed explanation.
i will check this. So with this, i think that if the iommu does not
 support polling for its status, then it should listen to 'dsb' otherwise
 there will no be no way of make it coherent ?

>* The one barrier that is normally required in an IOMMU is between
>  updating the in-memory page tables and the following MMIO store
>  that triggers the TLB flush for that entry. This barrier is
>  implied by writel() but not writel_relaxed(). If you don't have
>  a hardware page table walker in your IOMMU, you don't need to worry
>  about this.
>
  To get my understanding correct here, is the barrier required here because
   of speculative fetch ?

Regards,
 Sricharan

Arnd Bergmann May 25, 2016, 2:15 p.m. UTC | #8

On Wednesday, May 25, 2016 6:49:18 PM CEST Sricharan wrote:
> Hi Arnd,
> 
> >> Ok, so i was doing this from the idea that, other iommu drivers
> >>  where polling on a status bit in their sync call to ensure completion
> >> of pending TLB invalidations. But in this case, there is no status bit.
> >>  So added a barrier to have no ordering issues before the client
> >> triggers the dma operation. But as you say above that it is implicit that
> >> the device would have a barrier before starting the trigger, then the
> >> barrier here becomes redundant.
> >
> >Ok. There are two more things to note here:
> >
> >* On other platforms, polling the register is likely required because
> >  an MMIO write is "posted", meaning that a sync after writel() will
> >  only ensure that it has left the CPU write queue, but it may still be
> >  on the bus fabric and whatever side-effects are triggered by the
> >  write are normally not guaranteed to be completed even after the
> >  'sync'. You need to check the datasheet for your IOMMU to find out
> >  whether the 'dsb' instruction actually has any effect on the IOMMU.
> >  If not, then neither the barrier that you add here nor the barrier
> >  in the following writel() is sufficient.
> >
>    Thanks for the detailed explanation.
> i will check this. So with this, i think that if the iommu does not
>  support polling for its status, then it should listen to 'dsb' otherwise
>  there will no be no way of make it coherent ?

It's more likely that you have to do a readl() after the writel() to
any address inside of the device in order to ensure the writel()
has completed, but that is something that the hardware manual for
the iommu should describe.

> >* The one barrier that is normally required in an IOMMU is between
> >  updating the in-memory page tables and the following MMIO store
> >  that triggers the TLB flush for that entry. This barrier is
> >  implied by writel() but not writel_relaxed(). If you don't have
> >  a hardware page table walker in your IOMMU, you don't need to worry
> >  about this.
> >
>   To get my understanding correct here, is the barrier required here because
>    of speculative fetch ?

It doesn't have to be a speculative fetch, it could also be a normal
fetch triggered by a DMA. The point is that the CPU is free to reorder
the store to (io page table) memory relative to the store to the
IOMMU flush register, and when you directly follow up with a store to
the DMA master that triggers a transfer, this transfer can hit the
IOMMU and cause the stale page table entry to be read.

I guess in practice the barrier that comes before the MMIO write to the
DMA master would ensure ordering of the DMA after both the IOPTE
update and the MMIO flush (unless there was the speculative fetch you
mentioned), but this is where I'm no longer confident enough in my
pwn understanding of th ordering rules to rely on that. Having the
barrier between the IOPTE update and the flush certainly leaves
no room for ambiguity.

	Arnd

Sricharan Ramabadhran May 25, 2016, 4:49 p.m. UTC | #9

Hi Arnd,

>> >* On other platforms, polling the register is likely required because
>> >  an MMIO write is "posted", meaning that a sync after writel() will
>> >  only ensure that it has left the CPU write queue, but it may still be
>> >  on the bus fabric and whatever side-effects are triggered by the
>> >  write are normally not guaranteed to be completed even after the
>> >  'sync'. You need to check the datasheet for your IOMMU to find out
>> >  whether the 'dsb' instruction actually has any effect on the IOMMU.
>> >  If not, then neither the barrier that you add here nor the barrier
>> >  in the following writel() is sufficient.
>> >
>>    Thanks for the detailed explanation.
>> i will check this. So with this, i think that if the iommu does not
>>  support polling for its status, then it should listen to 'dsb' otherwise
>>  there will no be no way of make it coherent ?
>
>It's more likely that you have to do a readl() after the writel() to
>any address inside of the device in order to ensure the writel()
>has completed, but that is something that the hardware manual for
>the iommu should describe.
>
 ok, get it. So will do it as required for here.

>> >* The one barrier that is normally required in an IOMMU is between
>> >  updating the in-memory page tables and the following MMIO store
>> >  that triggers the TLB flush for that entry. This barrier is
>> >  implied by writel() but not writel_relaxed(). If you don't have
>> >  a hardware page table walker in your IOMMU, you don't need to worry
>> >  about this.
>> >
>>   To get my understanding correct here, is the barrier required here because
>>    of speculative fetch ?
>
>It doesn't have to be a speculative fetch, it could also be a normal
>fetch triggered by a DMA. The point is that the CPU is free to reorder
>the store to (io page table) memory relative to the store to the
>IOMMU flush register, and when you directly follow up with a store to
>the DMA master that triggers a transfer, this transfer can hit the
>IOMMU and cause the stale page table entry to be read.
>
>I guess in practice the barrier that comes before the MMIO write to the
>DMA master would ensure ordering of the DMA after both the IOPTE
>update and the MMIO flush (unless there was the speculative fetch you
>mentioned), but this is where I'm no longer confident enough in my
>pwn understanding of th ordering rules to rely on that. Having the
>barrier between the IOPTE update and the flush certainly leaves
>no room for ambiguity.
    Ok understand. Also i think speculative access if any, should 
     go away if unmap is done first and then remap by clients.

Regards,
 Sricharan

[V5,6/7] iommu/msm: Use writel_relaxed and add a barrier

Commit Message

Comments

Patch