drm/msm/adreno: De-spaghettify the use of memory barriers

Message ID	20240508-topic-adreno-v1-1-1babd05c119d@linaro.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Konrad Dybcio <konrad.dybcio@linaro.org> Date: Wed, 08 May 2024 19:46:31 +0200 Subject: [PATCH] drm/msm/adreno: De-spaghettify the use of memory barriers MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20240508-topic-adreno-v1-1-1babd05c119d@linaro.org> To: Rob Clark <robdclark@gmail.com>, Sean Paul <sean@poorly.run>, Abhinav Kumar <quic_abhinavk@quicinc.com>, Dmitry Baryshkov <dmitry.baryshkov@linaro.org>, Marijn Suijten <marijn.suijten@somainline.org>, David Airlie <airlied@gmail.com>, Daniel Vetter <daniel@ffwll.ch> Cc: Rob Clark <robdclark@chromium.org>, linux-arm-msm@vger.kernel.org, dri-devel@lists.freedesktop.org, freedreno@lists.freedesktop.org, linux-kernel@vger.kernel.org, Konrad Dybcio <konrad.dybcio@linaro.org> Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	drm/msm/adreno: De-spaghettify the use of memory barriers \| expand drm/msm/adreno: De-spaghettify the use of memory barriers

Konrad Dybcio May 8, 2024, 5:46 p.m. UTC

Memory barriers help ensure instruction ordering, NOT time and order
of actual write arrival at other observers (e.g. memory-mapped IP).
On architectures employing weak memory ordering, the latter can be a
giant pain point, and it has been as part of this driver.

Moreover, the gpu_/gmu_ accessors already use non-relaxed versions of
readl/writel, which include r/w (respectively) barriers.

Replace the barriers with a readback that ensures the previous writes
have exited the write buffer (as the CPU must flush the write to the
register it's trying to read back) and subsequently remove the hack
introduced in commit b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt
status in hw_init").

Fixes: b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt status in hw_init")
Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org>
---
 drivers/gpu/drm/msm/adreno/a6xx_gmu.c |  5 ++---
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 ++++----------
 2 files changed, 6 insertions(+), 13 deletions(-)


---
base-commit: 93a39e4766083050ca0ecd6a3548093a3b9eb60c
change-id: 20240508-topic-adreno-a2d199cd4152

Best regards,

Akhil P Oommen May 14, 2024, 6:38 p.m. UTC | #1

On Wed, May 08, 2024 at 07:46:31PM +0200, Konrad Dybcio wrote:
> Memory barriers help ensure instruction ordering, NOT time and order
> of actual write arrival at other observers (e.g. memory-mapped IP).
> On architectures employing weak memory ordering, the latter can be a
> giant pain point, and it has been as part of this driver.
> 
> Moreover, the gpu_/gmu_ accessors already use non-relaxed versions of
> readl/writel, which include r/w (respectively) barriers.
> 
> Replace the barriers with a readback that ensures the previous writes
> have exited the write buffer (as the CPU must flush the write to the
> register it's trying to read back) and subsequently remove the hack
> introduced in commit b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt
> status in hw_init").
> 
> Fixes: b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt status in hw_init")
> Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org>
> ---
>  drivers/gpu/drm/msm/adreno/a6xx_gmu.c |  5 ++---
>  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 ++++----------
>  2 files changed, 6 insertions(+), 13 deletions(-)

I prefer this version compared to the v2. A helper routine is
unnecessary here because:
1. there are very few scenarios where we have to read back the same
register.
2. we may accidently readback a write only register.

> 
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> index 0e3dfd4c2bc8..4135a53b55a7 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> @@ -466,9 +466,8 @@ static int a6xx_rpmh_start(struct a6xx_gmu *gmu)
>  	int ret;
>  	u32 val;
>  
> -	gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, 1 << 1);
> -	/* Wait for the register to finish posting */
> -	wmb();
> +	gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, BIT(1));
> +	gmu_read(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ);

This is unnecessary because we are polling on a register on the same port below. But I think we
can replace "wmb()" above with "mb()" to avoid reordering between read
and write IO instructions.

>  
>  	ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_RSCC_CONTROL_ACK, val,
>  		val & (1 << 1), 100, 10000);
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index 973872ad0474..0acbc38b8e70 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -1713,22 +1713,16 @@ static int hw_init(struct msm_gpu *gpu)
>  	}
>  
>  	/* Clear GBIF halt in case GX domain was not collapsed */
> +	gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);

We need a full barrier here to avoid reordering. Also, lets add a
comment about why we are doing this odd looking sequence.

> +	gpu_read(gpu, REG_A6XX_GBIF_HALT);
>  	if (adreno_is_a619_holi(adreno_gpu)) {
> -		gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
>  		gpu_write(gpu, REG_A6XX_RBBM_GPR0_CNTL, 0);
> -		/* Let's make extra sure that the GPU can access the memory.. */
> -		mb();

We need a full barrier here.

> +		gpu_read(gpu, REG_A6XX_RBBM_GPR0_CNTL);
>  	} else if (a6xx_has_gbif(adreno_gpu)) {
> -		gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
>  		gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 0);
> -		/* Let's make extra sure that the GPU can access the memory.. */
> -		mb();

We need a full barrier here.

> +		gpu_read(gpu, REG_A6XX_RBBM_GBIF_HALT);
>  	}
>  
> -	/* Some GPUs are stubborn and take their sweet time to unhalt GBIF! */
> -	if (adreno_is_a7xx(adreno_gpu) && a6xx_has_gbif(adreno_gpu))
> -		spin_until(!gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK));
> -

Why is this removed?

-Akhil

>  	gpu_write(gpu, REG_A6XX_RBBM_SECVID_TSB_CNTL, 0);
>  
>  	if (adreno_is_a619_holi(adreno_gpu))
> 
> ---
> base-commit: 93a39e4766083050ca0ecd6a3548093a3b9eb60c
> change-id: 20240508-topic-adreno-a2d199cd4152
> 
> Best regards,
> -- 
> Konrad Dybcio <konrad.dybcio@linaro.org>
>

Andrew Halaney May 16, 2024, 1:15 p.m. UTC | #2

On Wed, May 15, 2024 at 12:08:49AM GMT, Akhil P Oommen wrote:
> On Wed, May 08, 2024 at 07:46:31PM +0200, Konrad Dybcio wrote:
> > Memory barriers help ensure instruction ordering, NOT time and order
> > of actual write arrival at other observers (e.g. memory-mapped IP).
> > On architectures employing weak memory ordering, the latter can be a
> > giant pain point, and it has been as part of this driver.
> > 
> > Moreover, the gpu_/gmu_ accessors already use non-relaxed versions of
> > readl/writel, which include r/w (respectively) barriers.
> > 
> > Replace the barriers with a readback that ensures the previous writes
> > have exited the write buffer (as the CPU must flush the write to the
> > register it's trying to read back) and subsequently remove the hack
> > introduced in commit b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt
> > status in hw_init").

For what its worth, I've been eyeing (but haven't tested) sending some
patches to clean up dsi_phy_write_udelay/ndelay(). There's no ordering
guarantee between a writel() and a delay(), so the expected "write then
delay" sequence might not be happening.. you need to write, read, delay.

memory-barriers.txt:

	5. A readX() by a CPU thread from the peripheral will complete before
	   any subsequent delay() loop can begin execution on the same thread.
	   This ensures that two MMIO register writes by the CPU to a peripheral
	   will arrive at least 1us apart if the first write is immediately read
	   back with readX() and udelay(1) is called prior to the second
	   writeX():

		writel(42, DEVICE_REGISTER_0); // Arrives at the device...
		readl(DEVICE_REGISTER_0);
		udelay(1);
		writel(42, DEVICE_REGISTER_1); // ...at least 1us before this.

> > 
> > Fixes: b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt status in hw_init")
> > Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org>
> > ---
> >  drivers/gpu/drm/msm/adreno/a6xx_gmu.c |  5 ++---
> >  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 ++++----------
> >  2 files changed, 6 insertions(+), 13 deletions(-)
> 
> I prefer this version compared to the v2. A helper routine is
> unnecessary here because:
> 1. there are very few scenarios where we have to read back the same
> register.
> 2. we may accidently readback a write only register.
> 
> > 
> > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > index 0e3dfd4c2bc8..4135a53b55a7 100644
> > --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > @@ -466,9 +466,8 @@ static int a6xx_rpmh_start(struct a6xx_gmu *gmu)
> >  	int ret;
> >  	u32 val;
> >  
> > -	gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, 1 << 1);
> > -	/* Wait for the register to finish posting */
> > -	wmb();
> > +	gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, BIT(1));
> > +	gmu_read(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ);
> 
> This is unnecessary because we are polling on a register on the same port below. But I think we
> can replace "wmb()" above with "mb()" to avoid reordering between read
> and write IO instructions.

If I understand correctly, you don't need any memory barrier.
writel()/readl()'s are ordered to the same endpoint. That goes for all
the reordering/barrier comments mentioned below too.

device-io.rst:

    The read and write functions are defined to be ordered. That is the
    compiler is not permitted to reorder the I/O sequence. When the ordering
    can be compiler optimised, you can use __readb() and friends to
    indicate the relaxed ordering. Use this with care.

memory-barriers.txt:

     (*) readX(), writeX():

	    The readX() and writeX() MMIO accessors take a pointer to the
	    peripheral being accessed as an __iomem * parameter. For pointers
	    mapped with the default I/O attributes (e.g. those returned by
	    ioremap()), the ordering guarantees are as follows:

	    1. All readX() and writeX() accesses to the same peripheral are ordered
	       with respect to each other. This ensures that MMIO register accesses
	       by the same CPU thread to a particular device will arrive in program
	       order.


> 
> >  
> >  	ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_RSCC_CONTROL_ACK, val,
> >  		val & (1 << 1), 100, 10000);
> > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > index 973872ad0474..0acbc38b8e70 100644
> > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > @@ -1713,22 +1713,16 @@ static int hw_init(struct msm_gpu *gpu)
> >  	}
> >  
> >  	/* Clear GBIF halt in case GX domain was not collapsed */
> > +	gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> 
> We need a full barrier here to avoid reordering. Also, lets add a
> comment about why we are doing this odd looking sequence.
> 
> > +	gpu_read(gpu, REG_A6XX_GBIF_HALT);
> >  	if (adreno_is_a619_holi(adreno_gpu)) {
> > -		gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> >  		gpu_write(gpu, REG_A6XX_RBBM_GPR0_CNTL, 0);
> > -		/* Let's make extra sure that the GPU can access the memory.. */
> > -		mb();
> 
> We need a full barrier here.
> 
> > +		gpu_read(gpu, REG_A6XX_RBBM_GPR0_CNTL);
> >  	} else if (a6xx_has_gbif(adreno_gpu)) {
> > -		gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> >  		gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 0);
> > -		/* Let's make extra sure that the GPU can access the memory.. */
> > -		mb();
> 
> We need a full barrier here.
> 
> > +		gpu_read(gpu, REG_A6XX_RBBM_GBIF_HALT);
> >  	}
> >  
> > -	/* Some GPUs are stubborn and take their sweet time to unhalt GBIF! */
> > -	if (adreno_is_a7xx(adreno_gpu) && a6xx_has_gbif(adreno_gpu))
> > -		spin_until(!gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK));
> > -
> 
> Why is this removed?
> 
> -Akhil
> 
> >  	gpu_write(gpu, REG_A6XX_RBBM_SECVID_TSB_CNTL, 0);
> >  
> >  	if (adreno_is_a619_holi(adreno_gpu))
> > 
> > ---
> > base-commit: 93a39e4766083050ca0ecd6a3548093a3b9eb60c
> > change-id: 20240508-topic-adreno-a2d199cd4152
> > 
> > Best regards,
> > -- 
> > Konrad Dybcio <konrad.dybcio@linaro.org>
> > 
>

Akhil P Oommen May 16, 2024, 2:50 p.m. UTC | #3

On Thu, May 16, 2024 at 08:15:34AM -0500, Andrew Halaney wrote:
> On Wed, May 15, 2024 at 12:08:49AM GMT, Akhil P Oommen wrote:
> > On Wed, May 08, 2024 at 07:46:31PM +0200, Konrad Dybcio wrote:
> > > Memory barriers help ensure instruction ordering, NOT time and order
> > > of actual write arrival at other observers (e.g. memory-mapped IP).
> > > On architectures employing weak memory ordering, the latter can be a
> > > giant pain point, and it has been as part of this driver.
> > > 
> > > Moreover, the gpu_/gmu_ accessors already use non-relaxed versions of
> > > readl/writel, which include r/w (respectively) barriers.
> > > 
> > > Replace the barriers with a readback that ensures the previous writes
> > > have exited the write buffer (as the CPU must flush the write to the
> > > register it's trying to read back) and subsequently remove the hack
> > > introduced in commit b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt
> > > status in hw_init").
> 
> For what its worth, I've been eyeing (but haven't tested) sending some
> patches to clean up dsi_phy_write_udelay/ndelay(). There's no ordering
> guarantee between a writel() and a delay(), so the expected "write then
> delay" sequence might not be happening.. you need to write, read, delay.
> 
> memory-barriers.txt:
> 
> 	5. A readX() by a CPU thread from the peripheral will complete before
> 	   any subsequent delay() loop can begin execution on the same thread.
> 	   This ensures that two MMIO register writes by the CPU to a peripheral
> 	   will arrive at least 1us apart if the first write is immediately read
> 	   back with readX() and udelay(1) is called prior to the second
> 	   writeX():
> 
> 		writel(42, DEVICE_REGISTER_0); // Arrives at the device...
> 		readl(DEVICE_REGISTER_0);
> 		udelay(1);
> 		writel(42, DEVICE_REGISTER_1); // ...at least 1us before this.

Yes, udelay orders only with readl(). I saw a patch from Will Deacon
which fixes this for arm64 few years back:
https://lore.kernel.org/all/1543251228-30001-1-git-send-email-will.deacon@arm.com/T/

But this is needed only when you write io and do cpuside wait , not when
you poll io to check status.

> 
> > > 
> > > Fixes: b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt status in hw_init")
> > > Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org>
> > > ---
> > >  drivers/gpu/drm/msm/adreno/a6xx_gmu.c |  5 ++---
> > >  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 ++++----------
> > >  2 files changed, 6 insertions(+), 13 deletions(-)
> > 
> > I prefer this version compared to the v2. A helper routine is
> > unnecessary here because:
> > 1. there are very few scenarios where we have to read back the same
> > register.
> > 2. we may accidently readback a write only register.
> > 
> > > 
> > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > index 0e3dfd4c2bc8..4135a53b55a7 100644
> > > --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > @@ -466,9 +466,8 @@ static int a6xx_rpmh_start(struct a6xx_gmu *gmu)
> > >  	int ret;
> > >  	u32 val;
> > >  
> > > -	gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, 1 << 1);
> > > -	/* Wait for the register to finish posting */
> > > -	wmb();
> > > +	gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, BIT(1));
> > > +	gmu_read(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ);
> > 
> > This is unnecessary because we are polling on a register on the same port below. But I think we
> > can replace "wmb()" above with "mb()" to avoid reordering between read
> > and write IO instructions.
> 
> If I understand correctly, you don't need any memory barrier.
> writel()/readl()'s are ordered to the same endpoint. That goes for all
> the reordering/barrier comments mentioned below too.
> 
> device-io.rst:
> 
>     The read and write functions are defined to be ordered. That is the
>     compiler is not permitted to reorder the I/O sequence. When the ordering
>     can be compiler optimised, you can use __readb() and friends to
>     indicate the relaxed ordering. Use this with care.
> 
> memory-barriers.txt:
> 
>      (*) readX(), writeX():
> 
> 	    The readX() and writeX() MMIO accessors take a pointer to the
> 	    peripheral being accessed as an __iomem * parameter. For pointers
> 	    mapped with the default I/O attributes (e.g. those returned by
> 	    ioremap()), the ordering guarantees are as follows:
> 
> 	    1. All readX() and writeX() accesses to the same peripheral are ordered
> 	       with respect to each other. This ensures that MMIO register accesses
> 	       by the same CPU thread to a particular device will arrive in program
> 	       order.
> 

In arm64, a writel followed by readl translates to roughly the following
sequence: dmb_wmb(), __raw_writel(), __raw_readl(), dmb_rmb(). I am not
sure what is stopping compiler from reordering  __raw_writel() and __raw_readl()
above? I am assuming iomem cookie is ignored during compilation.

Added Will to this thread if he can throw some light on this.

-Akhil

> 
> > 
> > >  
> > >  	ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_RSCC_CONTROL_ACK, val,
> > >  		val & (1 << 1), 100, 10000);
> > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > > index 973872ad0474..0acbc38b8e70 100644
> > > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > > @@ -1713,22 +1713,16 @@ static int hw_init(struct msm_gpu *gpu)
> > >  	}
> > >  
> > >  	/* Clear GBIF halt in case GX domain was not collapsed */
> > > +	gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> > 
> > We need a full barrier here to avoid reordering. Also, lets add a
> > comment about why we are doing this odd looking sequence.
> > 
> > > +	gpu_read(gpu, REG_A6XX_GBIF_HALT);
> > >  	if (adreno_is_a619_holi(adreno_gpu)) {
> > > -		gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> > >  		gpu_write(gpu, REG_A6XX_RBBM_GPR0_CNTL, 0);
> > > -		/* Let's make extra sure that the GPU can access the memory.. */
> > > -		mb();
> > 
> > We need a full barrier here.
> > 
> > > +		gpu_read(gpu, REG_A6XX_RBBM_GPR0_CNTL);
> > >  	} else if (a6xx_has_gbif(adreno_gpu)) {
> > > -		gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> > >  		gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 0);
> > > -		/* Let's make extra sure that the GPU can access the memory.. */
> > > -		mb();
> > 
> > We need a full barrier here.
> > 
> > > +		gpu_read(gpu, REG_A6XX_RBBM_GBIF_HALT);
> > >  	}
> > >  
> > > -	/* Some GPUs are stubborn and take their sweet time to unhalt GBIF! */
> > > -	if (adreno_is_a7xx(adreno_gpu) && a6xx_has_gbif(adreno_gpu))
> > > -		spin_until(!gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK));
> > > -
> > 
> > Why is this removed?
> > 
> > -Akhil
> > 
> > >  	gpu_write(gpu, REG_A6XX_RBBM_SECVID_TSB_CNTL, 0);
> > >  
> > >  	if (adreno_is_a619_holi(adreno_gpu))
> > > 
> > > ---
> > > base-commit: 93a39e4766083050ca0ecd6a3548093a3b9eb60c
> > > change-id: 20240508-topic-adreno-a2d199cd4152
> > > 
> > > Best regards,
> > > -- 
> > > Konrad Dybcio <konrad.dybcio@linaro.org>
> > > 
> > 
>

Andrew Halaney May 16, 2024, 6:55 p.m. UTC | #4

On Thu, May 16, 2024 at 08:20:05PM GMT, Akhil P Oommen wrote:
> On Thu, May 16, 2024 at 08:15:34AM -0500, Andrew Halaney wrote:
> > On Wed, May 15, 2024 at 12:08:49AM GMT, Akhil P Oommen wrote:
> > > On Wed, May 08, 2024 at 07:46:31PM +0200, Konrad Dybcio wrote:
> > > > Memory barriers help ensure instruction ordering, NOT time and order
> > > > of actual write arrival at other observers (e.g. memory-mapped IP).
> > > > On architectures employing weak memory ordering, the latter can be a
> > > > giant pain point, and it has been as part of this driver.
> > > > 
> > > > Moreover, the gpu_/gmu_ accessors already use non-relaxed versions of
> > > > readl/writel, which include r/w (respectively) barriers.
> > > > 
> > > > Replace the barriers with a readback that ensures the previous writes
> > > > have exited the write buffer (as the CPU must flush the write to the
> > > > register it's trying to read back) and subsequently remove the hack
> > > > introduced in commit b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt
> > > > status in hw_init").
> > 
> > For what its worth, I've been eyeing (but haven't tested) sending some
> > patches to clean up dsi_phy_write_udelay/ndelay(). There's no ordering
> > guarantee between a writel() and a delay(), so the expected "write then
> > delay" sequence might not be happening.. you need to write, read, delay.
> > 
> > memory-barriers.txt:
> > 
> > 	5. A readX() by a CPU thread from the peripheral will complete before
> > 	   any subsequent delay() loop can begin execution on the same thread.
> > 	   This ensures that two MMIO register writes by the CPU to a peripheral
> > 	   will arrive at least 1us apart if the first write is immediately read
> > 	   back with readX() and udelay(1) is called prior to the second
> > 	   writeX():
> > 
> > 		writel(42, DEVICE_REGISTER_0); // Arrives at the device...
> > 		readl(DEVICE_REGISTER_0);
> > 		udelay(1);
> > 		writel(42, DEVICE_REGISTER_1); // ...at least 1us before this.
> 
> Yes, udelay orders only with readl(). I saw a patch from Will Deacon
> which fixes this for arm64 few years back:
> https://lore.kernel.org/all/1543251228-30001-1-git-send-email-will.deacon@arm.com/T/
> 
> But this is needed only when you write io and do cpuside wait , not when
> you poll io to check status.

Sure, I'm just highlighting the bug in dsi_phy_write_*delay() functions
here, which match the udelay case you mention.

> 
> > 
> > > > 
> > > > Fixes: b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt status in hw_init")
> > > > Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org>
> > > > ---
> > > >  drivers/gpu/drm/msm/adreno/a6xx_gmu.c |  5 ++---
> > > >  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 ++++----------
> > > >  2 files changed, 6 insertions(+), 13 deletions(-)
> > > 
> > > I prefer this version compared to the v2. A helper routine is
> > > unnecessary here because:
> > > 1. there are very few scenarios where we have to read back the same
> > > register.
> > > 2. we may accidently readback a write only register.
> > > 
> > > > 
> > > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > > index 0e3dfd4c2bc8..4135a53b55a7 100644
> > > > --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > > @@ -466,9 +466,8 @@ static int a6xx_rpmh_start(struct a6xx_gmu *gmu)
> > > >  	int ret;
> > > >  	u32 val;
> > > >  
> > > > -	gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, 1 << 1);
> > > > -	/* Wait for the register to finish posting */
> > > > -	wmb();
> > > > +	gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, BIT(1));
> > > > +	gmu_read(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ);
> > > 
> > > This is unnecessary because we are polling on a register on the same port below. But I think we
> > > can replace "wmb()" above with "mb()" to avoid reordering between read
> > > and write IO instructions.
> > 
> > If I understand correctly, you don't need any memory barrier.
> > writel()/readl()'s are ordered to the same endpoint. That goes for all
> > the reordering/barrier comments mentioned below too.
> > 
> > device-io.rst:
> > 
> >     The read and write functions are defined to be ordered. That is the
> >     compiler is not permitted to reorder the I/O sequence. When the ordering
> >     can be compiler optimised, you can use __readb() and friends to
> >     indicate the relaxed ordering. Use this with care.
> > 
> > memory-barriers.txt:
> > 
> >      (*) readX(), writeX():
> > 
> > 	    The readX() and writeX() MMIO accessors take a pointer to the
> > 	    peripheral being accessed as an __iomem * parameter. For pointers
> > 	    mapped with the default I/O attributes (e.g. those returned by
> > 	    ioremap()), the ordering guarantees are as follows:
> > 
> > 	    1. All readX() and writeX() accesses to the same peripheral are ordered
> > 	       with respect to each other. This ensures that MMIO register accesses
> > 	       by the same CPU thread to a particular device will arrive in program
> > 	       order.
> > 
> 
> In arm64, a writel followed by readl translates to roughly the following
> sequence: dmb_wmb(), __raw_writel(), __raw_readl(), dmb_rmb(). I am not
> sure what is stopping compiler from reordering  __raw_writel() and __raw_readl()
> above? I am assuming iomem cookie is ignored during compilation.

It seems to me that is due to some usage of volatile there in
__raw_writel() etc, but to be honest after reading about volatile and
some threads from gcc mailing lists, I don't have a confident answer :)

> 
> Added Will to this thread if he can throw some light on this.

Hopefully Will can school us.

> 
> -Akhil
> 
> > 
> > > 
> > > >  
> > > >  	ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_RSCC_CONTROL_ACK, val,
> > > >  		val & (1 << 1), 100, 10000);
> > > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > > > index 973872ad0474..0acbc38b8e70 100644
> > > > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > > > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > > > @@ -1713,22 +1713,16 @@ static int hw_init(struct msm_gpu *gpu)
> > > >  	}
> > > >  
> > > >  	/* Clear GBIF halt in case GX domain was not collapsed */
> > > > +	gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> > > 
> > > We need a full barrier here to avoid reordering. Also, lets add a
> > > comment about why we are doing this odd looking sequence.
> > > 
> > > > +	gpu_read(gpu, REG_A6XX_GBIF_HALT);
> > > >  	if (adreno_is_a619_holi(adreno_gpu)) {
> > > > -		gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> > > >  		gpu_write(gpu, REG_A6XX_RBBM_GPR0_CNTL, 0);
> > > > -		/* Let's make extra sure that the GPU can access the memory.. */
> > > > -		mb();
> > > 
> > > We need a full barrier here.
> > > 
> > > > +		gpu_read(gpu, REG_A6XX_RBBM_GPR0_CNTL);
> > > >  	} else if (a6xx_has_gbif(adreno_gpu)) {
> > > > -		gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> > > >  		gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 0);
> > > > -		/* Let's make extra sure that the GPU can access the memory.. */
> > > > -		mb();
> > > 
> > > We need a full barrier here.
> > > 
> > > > +		gpu_read(gpu, REG_A6XX_RBBM_GBIF_HALT);
> > > >  	}
> > > >  
> > > > -	/* Some GPUs are stubborn and take their sweet time to unhalt GBIF! */
> > > > -	if (adreno_is_a7xx(adreno_gpu) && a6xx_has_gbif(adreno_gpu))
> > > > -		spin_until(!gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK));
> > > > -
> > > 
> > > Why is this removed?
> > > 
> > > -Akhil
> > > 
> > > >  	gpu_write(gpu, REG_A6XX_RBBM_SECVID_TSB_CNTL, 0);
> > > >  
> > > >  	if (adreno_is_a619_holi(adreno_gpu))
> > > > 
> > > > ---
> > > > base-commit: 93a39e4766083050ca0ecd6a3548093a3b9eb60c
> > > > change-id: 20240508-topic-adreno-a2d199cd4152
> > > > 
> > > > Best regards,
> > > > -- 
> > > > Konrad Dybcio <konrad.dybcio@linaro.org>
> > > > 
> > > 
> > 
>

Will Deacon June 4, 2024, 2:40 p.m. UTC | #5

On Thu, May 16, 2024 at 01:55:26PM -0500, Andrew Halaney wrote:
> On Thu, May 16, 2024 at 08:20:05PM GMT, Akhil P Oommen wrote:
> > On Thu, May 16, 2024 at 08:15:34AM -0500, Andrew Halaney wrote:
> > > If I understand correctly, you don't need any memory barrier.
> > > writel()/readl()'s are ordered to the same endpoint. That goes for all
> > > the reordering/barrier comments mentioned below too.
> > > 
> > > device-io.rst:
> > > 
> > >     The read and write functions are defined to be ordered. That is the
> > >     compiler is not permitted to reorder the I/O sequence. When the ordering
> > >     can be compiler optimised, you can use __readb() and friends to
> > >     indicate the relaxed ordering. Use this with care.
> > > 
> > > memory-barriers.txt:
> > > 
> > >      (*) readX(), writeX():
> > > 
> > > 	    The readX() and writeX() MMIO accessors take a pointer to the
> > > 	    peripheral being accessed as an __iomem * parameter. For pointers
> > > 	    mapped with the default I/O attributes (e.g. those returned by
> > > 	    ioremap()), the ordering guarantees are as follows:
> > > 
> > > 	    1. All readX() and writeX() accesses to the same peripheral are ordered
> > > 	       with respect to each other. This ensures that MMIO register accesses
> > > 	       by the same CPU thread to a particular device will arrive in program
> > > 	       order.
> > > 
> > 
> > In arm64, a writel followed by readl translates to roughly the following
> > sequence: dmb_wmb(), __raw_writel(), __raw_readl(), dmb_rmb(). I am not
> > sure what is stopping compiler from reordering  __raw_writel() and __raw_readl()
> > above? I am assuming iomem cookie is ignored during compilation.
> 
> It seems to me that is due to some usage of volatile there in
> __raw_writel() etc, but to be honest after reading about volatile and
> some threads from gcc mailing lists, I don't have a confident answer :)
> 
> > 
> > Added Will to this thread if he can throw some light on this.
> 
> Hopefully Will can school us.

The ordering in this case is ensured by the memory attributes used for
ioremap(). When an MMIO region is mapped using Device-nGnRE attributes
(as it the case for ioremap()), the "nR" part means "no reordering", so
readX() and writeX() to that region are ordered wrt each other.

Note that guarantee _doesn't_ apply to other flavours of ioremap(), so
e.g. ioremap_wc() won't give you the ordering.

Hope that helps,

Will

Konrad Dybcio June 4, 2024, 5:35 p.m. UTC | #6

On 5/14/24 20:38, Akhil P Oommen wrote:
> On Wed, May 08, 2024 at 07:46:31PM +0200, Konrad Dybcio wrote:
>> Memory barriers help ensure instruction ordering, NOT time and order
>> of actual write arrival at other observers (e.g. memory-mapped IP).
>> On architectures employing weak memory ordering, the latter can be a
>> giant pain point, and it has been as part of this driver.
>>
>> Moreover, the gpu_/gmu_ accessors already use non-relaxed versions of
>> readl/writel, which include r/w (respectively) barriers.
>>
>> Replace the barriers with a readback that ensures the previous writes
>> have exited the write buffer (as the CPU must flush the write to the
>> register it's trying to read back) and subsequently remove the hack
>> introduced in commit b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt
>> status in hw_init").
>>
>> Fixes: b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt status in hw_init")
>> Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org>
>> ---
>>   drivers/gpu/drm/msm/adreno/a6xx_gmu.c |  5 ++---
>>   drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 ++++----------
>>   2 files changed, 6 insertions(+), 13 deletions(-)
> 
> I prefer this version compared to the v2. A helper routine is
> unnecessary here because:
> 1. there are very few scenarios where we have to read back the same
> register.
> 2. we may accidently readback a write only register.

Which would still trigger an address dependency on the CPU, no?

> 
>>
>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
>> index 0e3dfd4c2bc8..4135a53b55a7 100644
>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
>> @@ -466,9 +466,8 @@ static int a6xx_rpmh_start(struct a6xx_gmu *gmu)
>>   	int ret;
>>   	u32 val;
>>   
>> -	gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, 1 << 1);
>> -	/* Wait for the register to finish posting */
>> -	wmb();
>> +	gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, BIT(1));
>> +	gmu_read(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ);
> 
> This is unnecessary because we are polling on a register on the same port below. But I think we
> can replace "wmb()" above with "mb()" to avoid reordering between read
> and write IO instructions.

Ok on the dropping readback part

+ AFAIU from Will's response, we can drop the barrier as well

> 
>>   
>>   	ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_RSCC_CONTROL_ACK, val,
>>   		val & (1 << 1), 100, 10000);
>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> index 973872ad0474..0acbc38b8e70 100644
>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> @@ -1713,22 +1713,16 @@ static int hw_init(struct msm_gpu *gpu)
>>   	}
>>   
>>   	/* Clear GBIF halt in case GX domain was not collapsed */
>> +	gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> 
> We need a full barrier here to avoid reordering. Also, lets add a
> comment about why we are doing this odd looking sequence.
> 
>> +	gpu_read(gpu, REG_A6XX_GBIF_HALT);
>>   	if (adreno_is_a619_holi(adreno_gpu)) {
>> -		gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
>>   		gpu_write(gpu, REG_A6XX_RBBM_GPR0_CNTL, 0);
>> -		/* Let's make extra sure that the GPU can access the memory.. */
>> -		mb();
> 
> We need a full barrier here.
> 
>> +		gpu_read(gpu, REG_A6XX_RBBM_GPR0_CNTL);
>>   	} else if (a6xx_has_gbif(adreno_gpu)) {
>> -		gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
>>   		gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 0);
>> -		/* Let's make extra sure that the GPU can access the memory.. */
>> -		mb();
> 
> We need a full barrier here.

Not sure we do between REG_A6XX_GBIF_HALT & REG_A6XX_RBBM_(GBIF_HALT/GPR0_CNTL),
but I suppose keeping the one after REG_A6XX_RBBM_(GBIF_HALT/GPR0_CNTL) makes
sense to avoid the possibility of configuring the GPU before it can access DRAM..

> 
>> +		gpu_read(gpu, REG_A6XX_RBBM_GBIF_HALT);
>>   	}
>>   
>> -	/* Some GPUs are stubborn and take their sweet time to unhalt GBIF! */
>> -	if (adreno_is_a7xx(adreno_gpu) && a6xx_has_gbif(adreno_gpu))
>> -		spin_until(!gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK));
>> -
> 
> Why is this removed?

Because it was a hack in the first place and the enforcement of GBIF
unhalt requests coming through before proceeding further removes the
necessity to check this (unless there's some hw-mandated delay we should
keep in mind, but kgsl doesn't have that and there doesn't seem to be
any from testing on 8[456]50).

Konrad

Konrad Dybcio June 6, 2024, 12:03 p.m. UTC | #7

On 4.06.2024 4:40 PM, Will Deacon wrote:
> On Thu, May 16, 2024 at 01:55:26PM -0500, Andrew Halaney wrote:
>> On Thu, May 16, 2024 at 08:20:05PM GMT, Akhil P Oommen wrote:
>>> On Thu, May 16, 2024 at 08:15:34AM -0500, Andrew Halaney wrote:
>>>> If I understand correctly, you don't need any memory barrier.
>>>> writel()/readl()'s are ordered to the same endpoint. That goes for all
>>>> the reordering/barrier comments mentioned below too.
>>>>
>>>> device-io.rst:
>>>>
>>>>     The read and write functions are defined to be ordered. That is the
>>>>     compiler is not permitted to reorder the I/O sequence. When the ordering
>>>>     can be compiler optimised, you can use __readb() and friends to
>>>>     indicate the relaxed ordering. Use this with care.
>>>>
>>>> memory-barriers.txt:
>>>>
>>>>      (*) readX(), writeX():
>>>>
>>>> 	    The readX() and writeX() MMIO accessors take a pointer to the
>>>> 	    peripheral being accessed as an __iomem * parameter. For pointers
>>>> 	    mapped with the default I/O attributes (e.g. those returned by
>>>> 	    ioremap()), the ordering guarantees are as follows:
>>>>
>>>> 	    1. All readX() and writeX() accesses to the same peripheral are ordered
>>>> 	       with respect to each other. This ensures that MMIO register accesses
>>>> 	       by the same CPU thread to a particular device will arrive in program
>>>> 	       order.
>>>>
>>>
>>> In arm64, a writel followed by readl translates to roughly the following
>>> sequence: dmb_wmb(), __raw_writel(), __raw_readl(), dmb_rmb(). I am not
>>> sure what is stopping compiler from reordering  __raw_writel() and __raw_readl()
>>> above? I am assuming iomem cookie is ignored during compilation.
>>
>> It seems to me that is due to some usage of volatile there in
>> __raw_writel() etc, but to be honest after reading about volatile and
>> some threads from gcc mailing lists, I don't have a confident answer :)
>>
>>>
>>> Added Will to this thread if he can throw some light on this.
>>
>> Hopefully Will can school us.
> 
> The ordering in this case is ensured by the memory attributes used for
> ioremap(). When an MMIO region is mapped using Device-nGnRE attributes
> (as it the case for ioremap()), the "nR" part means "no reordering", so
> readX() and writeX() to that region are ordered wrt each other.
> 
> Note that guarantee _doesn't_ apply to other flavours of ioremap(), so
> e.g. ioremap_wc() won't give you the ordering.
> 
> Hope that helps,

Just to make sure I'm following, would mapping things as nGnRnE effectively
get rid of write buffering, perhaps being a way of debugging whether that
in particular is causing issues (at the cost of speed)?

Konrad

Will Deacon June 18, 2024, 3:34 p.m. UTC | #8

On Thu, Jun 06, 2024 at 02:03:24PM +0200, Konrad Dybcio wrote:
> On 4.06.2024 4:40 PM, Will Deacon wrote:
> > On Thu, May 16, 2024 at 01:55:26PM -0500, Andrew Halaney wrote:
> >> On Thu, May 16, 2024 at 08:20:05PM GMT, Akhil P Oommen wrote:
> >>> On Thu, May 16, 2024 at 08:15:34AM -0500, Andrew Halaney wrote:
> >>>> If I understand correctly, you don't need any memory barrier.
> >>>> writel()/readl()'s are ordered to the same endpoint. That goes for all
> >>>> the reordering/barrier comments mentioned below too.
> >>>>
> >>>> device-io.rst:
> >>>>
> >>>>     The read and write functions are defined to be ordered. That is the
> >>>>     compiler is not permitted to reorder the I/O sequence. When the ordering
> >>>>     can be compiler optimised, you can use __readb() and friends to
> >>>>     indicate the relaxed ordering. Use this with care.
> >>>>
> >>>> memory-barriers.txt:
> >>>>
> >>>>      (*) readX(), writeX():
> >>>>
> >>>> 	    The readX() and writeX() MMIO accessors take a pointer to the
> >>>> 	    peripheral being accessed as an __iomem * parameter. For pointers
> >>>> 	    mapped with the default I/O attributes (e.g. those returned by
> >>>> 	    ioremap()), the ordering guarantees are as follows:
> >>>>
> >>>> 	    1. All readX() and writeX() accesses to the same peripheral are ordered
> >>>> 	       with respect to each other. This ensures that MMIO register accesses
> >>>> 	       by the same CPU thread to a particular device will arrive in program
> >>>> 	       order.
> >>>>
> >>>
> >>> In arm64, a writel followed by readl translates to roughly the following
> >>> sequence: dmb_wmb(), __raw_writel(), __raw_readl(), dmb_rmb(). I am not
> >>> sure what is stopping compiler from reordering  __raw_writel() and __raw_readl()
> >>> above? I am assuming iomem cookie is ignored during compilation.
> >>
> >> It seems to me that is due to some usage of volatile there in
> >> __raw_writel() etc, but to be honest after reading about volatile and
> >> some threads from gcc mailing lists, I don't have a confident answer :)
> >>
> >>>
> >>> Added Will to this thread if he can throw some light on this.
> >>
> >> Hopefully Will can school us.
> > 
> > The ordering in this case is ensured by the memory attributes used for
> > ioremap(). When an MMIO region is mapped using Device-nGnRE attributes
> > (as it the case for ioremap()), the "nR" part means "no reordering", so
> > readX() and writeX() to that region are ordered wrt each other.
> > 
> > Note that guarantee _doesn't_ apply to other flavours of ioremap(), so
> > e.g. ioremap_wc() won't give you the ordering.
> > 
> > Hope that helps,
> 
> Just to make sure I'm following, would mapping things as nGnRnE effectively
> get rid of write buffering, perhaps being a way of debugging whether that
> in particular is causing issues (at the cost of speed)?

I think the "nE" part is just a hint, so it will depend on how the
hardware has been built. On top of that, you'll still need something
like a DSB to force the CPU to wait for the write response.

Will

Akhil P Oommen June 18, 2024, 4:11 p.m. UTC | #9

On Tue, Jun 04, 2024 at 03:40:56PM +0100, Will Deacon wrote:
> On Thu, May 16, 2024 at 01:55:26PM -0500, Andrew Halaney wrote:
> > On Thu, May 16, 2024 at 08:20:05PM GMT, Akhil P Oommen wrote:
> > > On Thu, May 16, 2024 at 08:15:34AM -0500, Andrew Halaney wrote:
> > > > If I understand correctly, you don't need any memory barrier.
> > > > writel()/readl()'s are ordered to the same endpoint. That goes for all
> > > > the reordering/barrier comments mentioned below too.
> > > > 
> > > > device-io.rst:
> > > > 
> > > >     The read and write functions are defined to be ordered. That is the
> > > >     compiler is not permitted to reorder the I/O sequence. When the ordering
> > > >     can be compiler optimised, you can use __readb() and friends to
> > > >     indicate the relaxed ordering. Use this with care.
> > > > 
> > > > memory-barriers.txt:
> > > > 
> > > >      (*) readX(), writeX():
> > > > 
> > > > 	    The readX() and writeX() MMIO accessors take a pointer to the
> > > > 	    peripheral being accessed as an __iomem * parameter. For pointers
> > > > 	    mapped with the default I/O attributes (e.g. those returned by
> > > > 	    ioremap()), the ordering guarantees are as follows:
> > > > 
> > > > 	    1. All readX() and writeX() accesses to the same peripheral are ordered
> > > > 	       with respect to each other. This ensures that MMIO register accesses
> > > > 	       by the same CPU thread to a particular device will arrive in program
> > > > 	       order.
> > > > 
> > > 
> > > In arm64, a writel followed by readl translates to roughly the following
> > > sequence: dmb_wmb(), __raw_writel(), __raw_readl(), dmb_rmb(). I am not
> > > sure what is stopping compiler from reordering  __raw_writel() and __raw_readl()
> > > above? I am assuming iomem cookie is ignored during compilation.
> > 
> > It seems to me that is due to some usage of volatile there in
> > __raw_writel() etc, but to be honest after reading about volatile and
> > some threads from gcc mailing lists, I don't have a confident answer :)
> > 
> > > 
> > > Added Will to this thread if he can throw some light on this.
> > 
> > Hopefully Will can school us.
> 
> The ordering in this case is ensured by the memory attributes used for
> ioremap(). When an MMIO region is mapped using Device-nGnRE attributes
> (as it the case for ioremap()), the "nR" part means "no reordering", so
> readX() and writeX() to that region are ordered wrt each other.

But that avoids only HW reordering, doesn't it? What about *compiler reordering* in the
case of a writel following by a readl which translates to:
	1: dmb_wmb()
	2: __raw_writel() -> roughly "asm volatile('str')
	3: __raw_readl() -> roughly "asm volatile('ldr')
	4: dmb_rmb()

Is the 'volatile' keyword sufficient to avoid reordering between (2) and (3)? Or
do we need a "memory" clobber to inhibit reordering?

This is still not clear to me even after going through some compiler documentions.

-Akhil.

> 
> Note that guarantee _doesn't_ apply to other flavours of ioremap(), so
> e.g. ioremap_wc() won't give you the ordering.
> 
> Hope that helps,
> 
> Will

Akhil P Oommen June 18, 2024, 4:38 p.m. UTC | #10

On Tue, Jun 04, 2024 at 07:35:04PM +0200, Konrad Dybcio wrote:
> 
> 
> On 5/14/24 20:38, Akhil P Oommen wrote:
> > On Wed, May 08, 2024 at 07:46:31PM +0200, Konrad Dybcio wrote:
> > > Memory barriers help ensure instruction ordering, NOT time and order
> > > of actual write arrival at other observers (e.g. memory-mapped IP).
> > > On architectures employing weak memory ordering, the latter can be a
> > > giant pain point, and it has been as part of this driver.
> > > 
> > > Moreover, the gpu_/gmu_ accessors already use non-relaxed versions of
> > > readl/writel, which include r/w (respectively) barriers.
> > > 
> > > Replace the barriers with a readback that ensures the previous writes
> > > have exited the write buffer (as the CPU must flush the write to the
> > > register it's trying to read back) and subsequently remove the hack
> > > introduced in commit b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt
> > > status in hw_init").
> > > 
> > > Fixes: b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt status in hw_init")
> > > Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org>
> > > ---
> > >   drivers/gpu/drm/msm/adreno/a6xx_gmu.c |  5 ++---
> > >   drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 ++++----------
> > >   2 files changed, 6 insertions(+), 13 deletions(-)
> > 
> > I prefer this version compared to the v2. A helper routine is
> > unnecessary here because:
> > 1. there are very few scenarios where we have to read back the same
> > register.
> > 2. we may accidently readback a write only register.
> 
> Which would still trigger an address dependency on the CPU, no?

Yes, but it is not a good idea to read a write-only register. We can't be
sure about its effect on the endpoint.

> 
> > 
> > > 
> > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > index 0e3dfd4c2bc8..4135a53b55a7 100644
> > > --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > @@ -466,9 +466,8 @@ static int a6xx_rpmh_start(struct a6xx_gmu *gmu)
> > >   	int ret;
> > >   	u32 val;
> > > -	gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, 1 << 1);
> > > -	/* Wait for the register to finish posting */
> > > -	wmb();
> > > +	gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, BIT(1));
> > > +	gmu_read(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ);
> > 
> > This is unnecessary because we are polling on a register on the same port below. But I think we
> > can replace "wmb()" above with "mb()" to avoid reordering between read
> > and write IO instructions.
> 
> Ok on the dropping readback part
> 
> + AFAIU from Will's response, we can drop the barrier as well

Lets wait a bit on Will's response on compiler reordering.

> 
> > 
> > >   	ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_RSCC_CONTROL_ACK, val,
> > >   		val & (1 << 1), 100, 10000);
> > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > > index 973872ad0474..0acbc38b8e70 100644
> > > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > > @@ -1713,22 +1713,16 @@ static int hw_init(struct msm_gpu *gpu)
> > >   	}
> > >   	/* Clear GBIF halt in case GX domain was not collapsed */
> > > +	gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> > 
> > We need a full barrier here to avoid reordering. Also, lets add a
> > comment about why we are doing this odd looking sequence.
> > 
> > > +	gpu_read(gpu, REG_A6XX_GBIF_HALT);
> > >   	if (adreno_is_a619_holi(adreno_gpu)) {
> > > -		gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> > >   		gpu_write(gpu, REG_A6XX_RBBM_GPR0_CNTL, 0);
> > > -		/* Let's make extra sure that the GPU can access the memory.. */
> > > -		mb();
> > 
> > We need a full barrier here.
> > 
> > > +		gpu_read(gpu, REG_A6XX_RBBM_GPR0_CNTL);
> > >   	} else if (a6xx_has_gbif(adreno_gpu)) {
> > > -		gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> > >   		gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 0);
> > > -		/* Let's make extra sure that the GPU can access the memory.. */
> > > -		mb();
> > 
> > We need a full barrier here.
> 
> Not sure we do between REG_A6XX_GBIF_HALT & REG_A6XX_RBBM_(GBIF_HALT/GPR0_CNTL),
> but I suppose keeping the one after REG_A6XX_RBBM_(GBIF_HALT/GPR0_CNTL) makes
> sense to avoid the possibility of configuring the GPU before it can access DRAM..

Techinically, I think we don't need a barrier or the below read back.
Because the above write is ordered with the write (on CP_CNTL reg) which
finally triggers CP INIT later. GPU won't access memory before CP INIT.

> 
> > 
> > > +		gpu_read(gpu, REG_A6XX_RBBM_GBIF_HALT);
> > >   	}
> > > -	/* Some GPUs are stubborn and take their sweet time to unhalt GBIF! */
> > > -	if (adreno_is_a7xx(adreno_gpu) && a6xx_has_gbif(adreno_gpu))
> > > -		spin_until(!gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK));
> > > -
> > 
> > Why is this removed?
> 
> Because it was a hack in the first place and the enforcement of GBIF
> unhalt requests coming through before proceeding further removes the
> necessity to check this (unless there's some hw-mandated delay we should
> keep in mind, but kgsl doesn't have that and there doesn't seem to be
> any from testing on 8[456]50).

Oh! I just saw the history. There is no ack for 'unhalt' in hw.
Anyway this chunk is an unrelated change. Should be a separate change,
no?

-Akhil.

> 
> Konrad

Will Deacon June 20, 2024, 1:04 p.m. UTC | #11

On Tue, Jun 18, 2024 at 09:41:58PM +0530, Akhil P Oommen wrote:
> On Tue, Jun 04, 2024 at 03:40:56PM +0100, Will Deacon wrote:
> > On Thu, May 16, 2024 at 01:55:26PM -0500, Andrew Halaney wrote:
> > > On Thu, May 16, 2024 at 08:20:05PM GMT, Akhil P Oommen wrote:
> > > > On Thu, May 16, 2024 at 08:15:34AM -0500, Andrew Halaney wrote:
> > > > > If I understand correctly, you don't need any memory barrier.
> > > > > writel()/readl()'s are ordered to the same endpoint. That goes for all
> > > > > the reordering/barrier comments mentioned below too.
> > > > > 
> > > > > device-io.rst:
> > > > > 
> > > > >     The read and write functions are defined to be ordered. That is the
> > > > >     compiler is not permitted to reorder the I/O sequence. When the ordering
> > > > >     can be compiler optimised, you can use __readb() and friends to
> > > > >     indicate the relaxed ordering. Use this with care.
> > > > > 
> > > > > memory-barriers.txt:
> > > > > 
> > > > >      (*) readX(), writeX():
> > > > > 
> > > > > 	    The readX() and writeX() MMIO accessors take a pointer to the
> > > > > 	    peripheral being accessed as an __iomem * parameter. For pointers
> > > > > 	    mapped with the default I/O attributes (e.g. those returned by
> > > > > 	    ioremap()), the ordering guarantees are as follows:
> > > > > 
> > > > > 	    1. All readX() and writeX() accesses to the same peripheral are ordered
> > > > > 	       with respect to each other. This ensures that MMIO register accesses
> > > > > 	       by the same CPU thread to a particular device will arrive in program
> > > > > 	       order.
> > > > > 
> > > > 
> > > > In arm64, a writel followed by readl translates to roughly the following
> > > > sequence: dmb_wmb(), __raw_writel(), __raw_readl(), dmb_rmb(). I am not
> > > > sure what is stopping compiler from reordering  __raw_writel() and __raw_readl()
> > > > above? I am assuming iomem cookie is ignored during compilation.
> > > 
> > > It seems to me that is due to some usage of volatile there in
> > > __raw_writel() etc, but to be honest after reading about volatile and
> > > some threads from gcc mailing lists, I don't have a confident answer :)
> > > 
> > > > 
> > > > Added Will to this thread if he can throw some light on this.
> > > 
> > > Hopefully Will can school us.
> > 
> > The ordering in this case is ensured by the memory attributes used for
> > ioremap(). When an MMIO region is mapped using Device-nGnRE attributes
> > (as it the case for ioremap()), the "nR" part means "no reordering", so
> > readX() and writeX() to that region are ordered wrt each other.
> 
> But that avoids only HW reordering, doesn't it? What about *compiler reordering* in the
> case of a writel following by a readl which translates to:
> 	1: dmb_wmb()
> 	2: __raw_writel() -> roughly "asm volatile('str')
> 	3: __raw_readl() -> roughly "asm volatile('ldr')
> 	4: dmb_rmb()
> 
> Is the 'volatile' keyword sufficient to avoid reordering between (2) and (3)? Or
> do we need a "memory" clobber to inhibit reordering?
> 
> This is still not clear to me even after going through some compiler documentions.

I don't think the compiler should reorder volatile asm blocks wrt each
other.

Will

Akhil P Oommen June 25, 2024, 6:02 p.m. UTC | #12

On Thu, Jun 20, 2024 at 02:04:01PM +0100, Will Deacon wrote:
> On Tue, Jun 18, 2024 at 09:41:58PM +0530, Akhil P Oommen wrote:
> > On Tue, Jun 04, 2024 at 03:40:56PM +0100, Will Deacon wrote:
> > > On Thu, May 16, 2024 at 01:55:26PM -0500, Andrew Halaney wrote:
> > > > On Thu, May 16, 2024 at 08:20:05PM GMT, Akhil P Oommen wrote:
> > > > > On Thu, May 16, 2024 at 08:15:34AM -0500, Andrew Halaney wrote:
> > > > > > If I understand correctly, you don't need any memory barrier.
> > > > > > writel()/readl()'s are ordered to the same endpoint. That goes for all
> > > > > > the reordering/barrier comments mentioned below too.
> > > > > > 
> > > > > > device-io.rst:
> > > > > > 
> > > > > >     The read and write functions are defined to be ordered. That is the
> > > > > >     compiler is not permitted to reorder the I/O sequence. When the ordering
> > > > > >     can be compiler optimised, you can use __readb() and friends to
> > > > > >     indicate the relaxed ordering. Use this with care.
> > > > > > 
> > > > > > memory-barriers.txt:
> > > > > > 
> > > > > >      (*) readX(), writeX():
> > > > > > 
> > > > > > 	    The readX() and writeX() MMIO accessors take a pointer to the
> > > > > > 	    peripheral being accessed as an __iomem * parameter. For pointers
> > > > > > 	    mapped with the default I/O attributes (e.g. those returned by
> > > > > > 	    ioremap()), the ordering guarantees are as follows:
> > > > > > 
> > > > > > 	    1. All readX() and writeX() accesses to the same peripheral are ordered
> > > > > > 	       with respect to each other. This ensures that MMIO register accesses
> > > > > > 	       by the same CPU thread to a particular device will arrive in program
> > > > > > 	       order.
> > > > > > 
> > > > > 
> > > > > In arm64, a writel followed by readl translates to roughly the following
> > > > > sequence: dmb_wmb(), __raw_writel(), __raw_readl(), dmb_rmb(). I am not
> > > > > sure what is stopping compiler from reordering  __raw_writel() and __raw_readl()
> > > > > above? I am assuming iomem cookie is ignored during compilation.
> > > > 
> > > > It seems to me that is due to some usage of volatile there in
> > > > __raw_writel() etc, but to be honest after reading about volatile and
> > > > some threads from gcc mailing lists, I don't have a confident answer :)
> > > > 
> > > > > 
> > > > > Added Will to this thread if he can throw some light on this.
> > > > 
> > > > Hopefully Will can school us.
> > > 
> > > The ordering in this case is ensured by the memory attributes used for
> > > ioremap(). When an MMIO region is mapped using Device-nGnRE attributes
> > > (as it the case for ioremap()), the "nR" part means "no reordering", so
> > > readX() and writeX() to that region are ordered wrt each other.
> > 
> > But that avoids only HW reordering, doesn't it? What about *compiler reordering* in the
> > case of a writel following by a readl which translates to:
> > 	1: dmb_wmb()
> > 	2: __raw_writel() -> roughly "asm volatile('str')
> > 	3: __raw_readl() -> roughly "asm volatile('ldr')
> > 	4: dmb_rmb()
> > 
> > Is the 'volatile' keyword sufficient to avoid reordering between (2) and (3)? Or
> > do we need a "memory" clobber to inhibit reordering?
> > 
> > This is still not clear to me even after going through some compiler documentions.
> 
> I don't think the compiler should reorder volatile asm blocks wrt each
> other.
> 

Thanks Will for confirmation.

-Akhil.

> Will

Akhil P Oommen June 25, 2024, 6:08 p.m. UTC | #13

On Tue, Jun 18, 2024 at 10:08:23PM +0530, Akhil P Oommen wrote:
> On Tue, Jun 04, 2024 at 07:35:04PM +0200, Konrad Dybcio wrote:
> > 
> > 
> > On 5/14/24 20:38, Akhil P Oommen wrote:
> > > On Wed, May 08, 2024 at 07:46:31PM +0200, Konrad Dybcio wrote:
> > > > Memory barriers help ensure instruction ordering, NOT time and order
> > > > of actual write arrival at other observers (e.g. memory-mapped IP).
> > > > On architectures employing weak memory ordering, the latter can be a
> > > > giant pain point, and it has been as part of this driver.
> > > > 
> > > > Moreover, the gpu_/gmu_ accessors already use non-relaxed versions of
> > > > readl/writel, which include r/w (respectively) barriers.
> > > > 
> > > > Replace the barriers with a readback that ensures the previous writes
> > > > have exited the write buffer (as the CPU must flush the write to the
> > > > register it's trying to read back) and subsequently remove the hack
> > > > introduced in commit b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt
> > > > status in hw_init").
> > > > 
> > > > Fixes: b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt status in hw_init")
> > > > Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org>
> > > > ---
> > > >   drivers/gpu/drm/msm/adreno/a6xx_gmu.c |  5 ++---
> > > >   drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 ++++----------
> > > >   2 files changed, 6 insertions(+), 13 deletions(-)
> > > 
> > > I prefer this version compared to the v2. A helper routine is
> > > unnecessary here because:
> > > 1. there are very few scenarios where we have to read back the same
> > > register.
> > > 2. we may accidently readback a write only register.
> > 
> > Which would still trigger an address dependency on the CPU, no?
> 
> Yes, but it is not a good idea to read a write-only register. We can't be
> sure about its effect on the endpoint.
> 
> > 
> > > 
> > > > 
> > > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > > index 0e3dfd4c2bc8..4135a53b55a7 100644
> > > > --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > > @@ -466,9 +466,8 @@ static int a6xx_rpmh_start(struct a6xx_gmu *gmu)
> > > >   	int ret;
> > > >   	u32 val;
> > > > -	gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, 1 << 1);
> > > > -	/* Wait for the register to finish posting */
> > > > -	wmb();
> > > > +	gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, BIT(1));
> > > > +	gmu_read(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ);
> > > 
> > > This is unnecessary because we are polling on a register on the same port below. But I think we
> > > can replace "wmb()" above with "mb()" to avoid reordering between read
> > > and write IO instructions.
> > 
> > Ok on the dropping readback part
> > 
> > + AFAIU from Will's response, we can drop the barrier as well

Yes, let drop the the barrier.

> 
> Lets wait a bit on Will's response on compiler reordering.
> 
> > 
> > > 
> > > >   	ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_RSCC_CONTROL_ACK, val,
> > > >   		val & (1 << 1), 100, 10000);
> > > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > > > index 973872ad0474..0acbc38b8e70 100644
> > > > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > > > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > > > @@ -1713,22 +1713,16 @@ static int hw_init(struct msm_gpu *gpu)
> > > >   	}
> > > >   	/* Clear GBIF halt in case GX domain was not collapsed */
> > > > +	gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> > > 
> > > We need a full barrier here to avoid reordering. Also, lets add a
> > > comment about why we are doing this odd looking sequence.

Please ignore this.

> > > 
> > > > +	gpu_read(gpu, REG_A6XX_GBIF_HALT);
> > > >   	if (adreno_is_a619_holi(adreno_gpu)) {
> > > > -		gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> > > >   		gpu_write(gpu, REG_A6XX_RBBM_GPR0_CNTL, 0);
> > > > -		/* Let's make extra sure that the GPU can access the memory.. */
> > > > -		mb();
> > > 
> > > We need a full barrier here.

Please ignore this.

> > > 
> > > > +		gpu_read(gpu, REG_A6XX_RBBM_GPR0_CNTL);
> > > >   	} else if (a6xx_has_gbif(adreno_gpu)) {
> > > > -		gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> > > >   		gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 0);
> > > > -		/* Let's make extra sure that the GPU can access the memory.. */
> > > > -		mb();
> > > 
> > > We need a full barrier here.
> > 
> > Not sure we do between REG_A6XX_GBIF_HALT & REG_A6XX_RBBM_(GBIF_HALT/GPR0_CNTL),
> > but I suppose keeping the one after REG_A6XX_RBBM_(GBIF_HALT/GPR0_CNTL) makes
> > sense to avoid the possibility of configuring the GPU before it can access DRAM..
> 
> Techinically, I think we don't need a barrier or the below read back.
> Because the above write is ordered with the write (on CP_CNTL reg) which
> finally triggers CP INIT later. GPU won't access memory before CP INIT.
> 
> > 
> > > 
> > > > +		gpu_read(gpu, REG_A6XX_RBBM_GBIF_HALT);
> > > >   	}
> > > > -	/* Some GPUs are stubborn and take their sweet time to unhalt GBIF! */
> > > > -	if (adreno_is_a7xx(adreno_gpu) && a6xx_has_gbif(adreno_gpu))
> > > > -		spin_until(!gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK));
> > > > -
> > > 
> > > Why is this removed?
> > 
> > Because it was a hack in the first place and the enforcement of GBIF
> > unhalt requests coming through before proceeding further removes the
> > necessity to check this (unless there's some hw-mandated delay we should
> > keep in mind, but kgsl doesn't have that and there doesn't seem to be
> > any from testing on 8[456]50).
> 
> Oh! I just saw the history. There is no ack for 'unhalt' in hw.
> Anyway this chunk is an unrelated change. Should be a separate change,
> no?

I need this fix this as soon as possible for the x185 support. It is
almost unusable without it. Could you spin off a separate patch and get
it picked up via fixes branch?

-Akhil.

> 
> -Akhil.
> 
> > 
> > Konrad
>

drm/msm/adreno: De-spaghettify the use of memory barriers

Commit Message

Comments

Patch