[4/4] drm/nouveau: introduce CPU cache flushing macro

Message ID	1400483458-9648-5-git-send-email-acourbot@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Alexandre Courbot <acourbot@nvidia.com> To: David Airlie <airlied@linux.ie>, Ben Skeggs <bskeggs@redhat.com>, Lucas Stach <dev@lynxeye.de>, Thierry Reding <thierry.reding@gmail.com> Subject: [PATCH 4/4] drm/nouveau: introduce CPU cache flushing macro Date: Mon, 19 May 2014 16:10:58 +0900 Message-ID: <1400483458-9648-5-git-send-email-acourbot@nvidia.com> In-Reply-To: <1400483458-9648-1-git-send-email-acourbot@nvidia.com> References: <1400483458-9648-1-git-send-email-acourbot@nvidia.com> MIME-Version: 1.0 Cc: gnurou@gmail.com, nouveau@lists.freedesktop.org, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-tegra@vger.kernel.org Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Alexandre Courbot May 19, 2014, 7:10 a.m. UTC

Some architectures (e.g. ARM) need the CPU buffers to be explicitely
flushed for a memory write to take effect. Not doing so results in
synchronization issues, especially after writing to BOs.

This patch introduces a macro that flushes the caches on ARM and
translates to a no-op on other architectures, and uses it when
writing to in-memory BOs. It will also be useful for implementations of
instmem that access shared memory directly instead of going through
PRAMIN.

Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
---
 drivers/gpu/drm/nouveau/core/os.h    | 17 +++++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_bo.c |  8 ++++++--
 2 files changed, 23 insertions(+), 2 deletions(-)

Thierry Reding May 19, 2014, 9:02 a.m. UTC | #1

On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
> Some architectures (e.g. ARM) need the CPU buffers to be explicitely
> flushed for a memory write to take effect. Not doing so results in
> synchronization issues, especially after writing to BOs.

It seems to me that the above is generally true for all architectures,
not just ARM.

Also: s/explicitely/explicitly/

> This patch introduces a macro that flushes the caches on ARM and
> translates to a no-op on other architectures, and uses it when
> writing to in-memory BOs. It will also be useful for implementations of
> instmem that access shared memory directly instead of going through
> PRAMIN.

Presumably instmem can access shared memory on all architectures, so
this doesn't seem like a property of the architecture but rather of the
memory pool backing the instmem.

In that case I wonder if this shouldn't be moved into an operation that
is implemented by the backing memory pool and be a noop where the cache
doesn't need explicit flushing.

> diff --git a/drivers/gpu/drm/nouveau/core/os.h b/drivers/gpu/drm/nouveau/core/os.h
> index d0ced94ca54c..274b4460bb03 100644
> --- a/drivers/gpu/drm/nouveau/core/os.h
> +++ b/drivers/gpu/drm/nouveau/core/os.h
> @@ -38,4 +38,21 @@
>  #endif /* def __BIG_ENDIAN else */
>  #endif /* !ioread32_native */
>  
> +#if defined(__arm__)
> +
> +#define nv_cpu_cache_flush_area(va, size)	\
> +do {						\
> +	phys_addr_t pa = virt_to_phys(va);	\
> +	__cpuc_flush_dcache_area(va, size);	\
> +	outer_flush_range(pa, pa + size);	\
> +} while (0)

Couldn't this be a static inline function?

> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
[...]
> index 0886f47e5244..b9c9729c5733 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
>  	mem = &mem[index];
>  	if (is_iomem)
>  		iowrite16_native(val, (void __force __iomem *)mem);
> -	else
> +	else {
>  		*mem = val;
> +		nv_cpu_cache_flush_area(mem, 2);
> +	}
>  }
>  
>  u32
> @@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
>  	mem = &mem[index];
>  	if (is_iomem)
>  		iowrite32_native(val, (void __force __iomem *)mem);
> -	else
> +	else {
>  		*mem = val;
> +		nv_cpu_cache_flush_area(mem, 4);
> +	}

This looks rather like a sledgehammer to me. Effectively this turns nvbo
into an uncached buffer. With additional overhead of constantly flushing
caches. Wouldn't it make more sense to locate the places where these are
called and flush the cache after all the writes have completed?

Thierry

Lucas Stach May 19, 2014, 9:22 a.m. UTC | #2

Am Montag, den 19.05.2014, 11:02 +0200 schrieb Thierry Reding:
> On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
> > Some architectures (e.g. ARM) need the CPU buffers to be explicitely
> > flushed for a memory write to take effect. Not doing so results in
> > synchronization issues, especially after writing to BOs.
> 
> It seems to me that the above is generally true for all architectures,
> not just ARM.
> 
No, on PCI coherent arches, like x86 and some PowerPCs, the GPU will
snoop the CPU caches and therefore an explicit cache flush is not
required.

> Also: s/explicitely/explicitly/
> 
> > This patch introduces a macro that flushes the caches on ARM and
> > translates to a no-op on other architectures, and uses it when
> > writing to in-memory BOs. It will also be useful for implementations of
> > instmem that access shared memory directly instead of going through
> > PRAMIN.
> 
> Presumably instmem can access shared memory on all architectures, so
> this doesn't seem like a property of the architecture but rather of the
> memory pool backing the instmem.
> 
> In that case I wonder if this shouldn't be moved into an operation that
> is implemented by the backing memory pool and be a noop where the cache
> doesn't need explicit flushing.
> 
> > diff --git a/drivers/gpu/drm/nouveau/core/os.h b/drivers/gpu/drm/nouveau/core/os.h
> > index d0ced94ca54c..274b4460bb03 100644
> > --- a/drivers/gpu/drm/nouveau/core/os.h
> > +++ b/drivers/gpu/drm/nouveau/core/os.h
> > @@ -38,4 +38,21 @@
> >  #endif /* def __BIG_ENDIAN else */
> >  #endif /* !ioread32_native */
> >  
> > +#if defined(__arm__)
> > +
> > +#define nv_cpu_cache_flush_area(va, size)	\
> > +do {						\
> > +	phys_addr_t pa = virt_to_phys(va);	\
> > +	__cpuc_flush_dcache_area(va, size);	\
> > +	outer_flush_range(pa, pa + size);	\
> > +} while (0)
> 
> Couldn't this be a static inline function?
> 
> > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> [...]
> > index 0886f47e5244..b9c9729c5733 100644
> > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> > @@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
> >  	mem = &mem[index];
> >  	if (is_iomem)
> >  		iowrite16_native(val, (void __force __iomem *)mem);
> > -	else
> > +	else {
> >  		*mem = val;
> > +		nv_cpu_cache_flush_area(mem, 2);
> > +	}
> >  }
> >  
> >  u32
> > @@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
> >  	mem = &mem[index];
> >  	if (is_iomem)
> >  		iowrite32_native(val, (void __force __iomem *)mem);
> > -	else
> > +	else {
> >  		*mem = val;
> > +		nv_cpu_cache_flush_area(mem, 4);
> > +	}
> 
> This looks rather like a sledgehammer to me. Effectively this turns nvbo
> into an uncached buffer. With additional overhead of constantly flushing
> caches. Wouldn't it make more sense to locate the places where these are
> called and flush the cache after all the writes have completed?
> 
I don't think the explicit flushing for those things makes sense. I
think it is a lot more effective to just map the BOs write-combined on
PCI non-coherent arches. This way any writes will be buffered. Reads
will be slow, but I don't think nouveau is reading back a lot from those
buffers.
Using the write-combining buffer doesn't need any additional
synchronization as it will get flushed on pushbuf kickoff anyways.

Regards,
Lucas

Thierry Reding May 19, 2014, 10:03 a.m. UTC | #3

On Mon, May 19, 2014 at 11:22:11AM +0200, Lucas Stach wrote:
> Am Montag, den 19.05.2014, 11:02 +0200 schrieb Thierry Reding:
> > On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
> > > Some architectures (e.g. ARM) need the CPU buffers to be explicitely
> > > flushed for a memory write to take effect. Not doing so results in
> > > synchronization issues, especially after writing to BOs.
> > 
> > It seems to me that the above is generally true for all architectures,
> > not just ARM.
> > 
> No, on PCI coherent arches, like x86 and some PowerPCs, the GPU will
> snoop the CPU caches and therefore an explicit cache flush is not
> required.

I was criticizing the wording in the commit message. Perhaps it could be
enhanced with what you just said.

> > > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> > [...]
> > > index 0886f47e5244..b9c9729c5733 100644
> > > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> > > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> > > @@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
> > >  	mem = &mem[index];
> > >  	if (is_iomem)
> > >  		iowrite16_native(val, (void __force __iomem *)mem);
> > > -	else
> > > +	else {
> > >  		*mem = val;
> > > +		nv_cpu_cache_flush_area(mem, 2);
> > > +	}
> > >  }
> > >  
> > >  u32
> > > @@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
> > >  	mem = &mem[index];
> > >  	if (is_iomem)
> > >  		iowrite32_native(val, (void __force __iomem *)mem);
> > > -	else
> > > +	else {
> > >  		*mem = val;
> > > +		nv_cpu_cache_flush_area(mem, 4);
> > > +	}
> > 
> > This looks rather like a sledgehammer to me. Effectively this turns nvbo
> > into an uncached buffer. With additional overhead of constantly flushing
> > caches. Wouldn't it make more sense to locate the places where these are
> > called and flush the cache after all the writes have completed?
> > 
> I don't think the explicit flushing for those things makes sense. I
> think it is a lot more effective to just map the BOs write-combined on
> PCI non-coherent arches. This way any writes will be buffered. Reads
> will be slow, but I don't think nouveau is reading back a lot from those
> buffers.
> Using the write-combining buffer doesn't need any additional
> synchronization as it will get flushed on pushbuf kickoff anyways.

Sounds good to me.

Thierry

Daniel Vetter May 19, 2014, 10:27 a.m. UTC | #4

On Mon, May 19, 2014 at 12:03:17PM +0200, Thierry Reding wrote:
> On Mon, May 19, 2014 at 11:22:11AM +0200, Lucas Stach wrote:
> > Am Montag, den 19.05.2014, 11:02 +0200 schrieb Thierry Reding:
> > > On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
> > > > Some architectures (e.g. ARM) need the CPU buffers to be explicitely
> > > > flushed for a memory write to take effect. Not doing so results in
> > > > synchronization issues, especially after writing to BOs.
> > > 
> > > It seems to me that the above is generally true for all architectures,
> > > not just ARM.
> > > 
> > No, on PCI coherent arches, like x86 and some PowerPCs, the GPU will
> > snoop the CPU caches and therefore an explicit cache flush is not
> > required.
> 
> I was criticizing the wording in the commit message. Perhaps it could be
> enhanced with what you just said.

Shouldn't this be done in the dma mapping layer? I know that i915 does all
the cpu cache flushing itself, but that's because the x86 dma layer
refuses to believe that there are non-coherent platforms on x86. But on
arm it can cope.

This is somewhat important for dma-buf buffer sharing since if the cpu
cache control is done in drivers you must do double-flushing on shared
buffers. Atm you have to do that anyway, but at least this would make it
easier. The other problem is that ttm reinvents half of the dma mapping
functions.

Just my 2 cents.
-Daniel

Alexandre Courbot May 23, 2014, 6:58 a.m. UTC | #5

On Mon, May 19, 2014 at 7:03 PM, Thierry Reding
<thierry.reding@gmail.com> wrote:
> On Mon, May 19, 2014 at 11:22:11AM +0200, Lucas Stach wrote:
>> Am Montag, den 19.05.2014, 11:02 +0200 schrieb Thierry Reding:
>> > On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
>> > > Some architectures (e.g. ARM) need the CPU buffers to be explicitely
>> > > flushed for a memory write to take effect. Not doing so results in
>> > > synchronization issues, especially after writing to BOs.
>> >
>> > It seems to me that the above is generally true for all architectures,
>> > not just ARM.
>> >
>> No, on PCI coherent arches, like x86 and some PowerPCs, the GPU will
>> snoop the CPU caches and therefore an explicit cache flush is not
>> required.
>
> I was criticizing the wording in the commit message. Perhaps it could be
> enhanced with what you just said.
>
>> > > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> > [...]
>> > > index 0886f47e5244..b9c9729c5733 100644
>> > > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
>> > > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> > > @@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
>> > >   mem = &mem[index];
>> > >   if (is_iomem)
>> > >           iowrite16_native(val, (void __force __iomem *)mem);
>> > > - else
>> > > + else {
>> > >           *mem = val;
>> > > +         nv_cpu_cache_flush_area(mem, 2);
>> > > + }
>> > >  }
>> > >
>> > >  u32
>> > > @@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
>> > >   mem = &mem[index];
>> > >   if (is_iomem)
>> > >           iowrite32_native(val, (void __force __iomem *)mem);
>> > > - else
>> > > + else {
>> > >           *mem = val;
>> > > +         nv_cpu_cache_flush_area(mem, 4);
>> > > + }
>> >
>> > This looks rather like a sledgehammer to me. Effectively this turns nvbo
>> > into an uncached buffer. With additional overhead of constantly flushing
>> > caches. Wouldn't it make more sense to locate the places where these are
>> > called and flush the cache after all the writes have completed?
>> >
>> I don't think the explicit flushing for those things makes sense. I
>> think it is a lot more effective to just map the BOs write-combined on
>> PCI non-coherent arches. This way any writes will be buffered. Reads
>> will be slow, but I don't think nouveau is reading back a lot from those
>> buffers.
>> Using the write-combining buffer doesn't need any additional
>> synchronization as it will get flushed on pushbuf kickoff anyways.
>
> Sounds good to me.

I will need to wrap my head around TTM some more to understand how to
do this the right way, but it is true that brute-forcing in-memory BO
mappings to be WC make the addition of nv_cpu_cache_flush_area()
unneeded. Is that the direction we want to take with this?

Alexandre Courbot June 9, 2014, 10:41 a.m. UTC | #6

On Mon, May 19, 2014 at 6:22 PM, Lucas Stach <l.stach@pengutronix.de> wrote:
> Am Montag, den 19.05.2014, 11:02 +0200 schrieb Thierry Reding:
>> On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
>> > Some architectures (e.g. ARM) need the CPU buffers to be explicitely
>> > flushed for a memory write to take effect. Not doing so results in
>> > synchronization issues, especially after writing to BOs.
>>
>> It seems to me that the above is generally true for all architectures,
>> not just ARM.
>>
> No, on PCI coherent arches, like x86 and some PowerPCs, the GPU will
> snoop the CPU caches and therefore an explicit cache flush is not
> required.
>
>> Also: s/explicitely/explicitly/
>>
>> > This patch introduces a macro that flushes the caches on ARM and
>> > translates to a no-op on other architectures, and uses it when
>> > writing to in-memory BOs. It will also be useful for implementations of
>> > instmem that access shared memory directly instead of going through
>> > PRAMIN.
>>
>> Presumably instmem can access shared memory on all architectures, so
>> this doesn't seem like a property of the architecture but rather of the
>> memory pool backing the instmem.
>>
>> In that case I wonder if this shouldn't be moved into an operation that
>> is implemented by the backing memory pool and be a noop where the cache
>> doesn't need explicit flushing.
>>
>> > diff --git a/drivers/gpu/drm/nouveau/core/os.h b/drivers/gpu/drm/nouveau/core/os.h
>> > index d0ced94ca54c..274b4460bb03 100644
>> > --- a/drivers/gpu/drm/nouveau/core/os.h
>> > +++ b/drivers/gpu/drm/nouveau/core/os.h
>> > @@ -38,4 +38,21 @@
>> >  #endif /* def __BIG_ENDIAN else */
>> >  #endif /* !ioread32_native */
>> >
>> > +#if defined(__arm__)
>> > +
>> > +#define nv_cpu_cache_flush_area(va, size)  \
>> > +do {                                               \
>> > +   phys_addr_t pa = virt_to_phys(va);      \
>> > +   __cpuc_flush_dcache_area(va, size);     \
>> > +   outer_flush_range(pa, pa + size);       \
>> > +} while (0)
>>
>> Couldn't this be a static inline function?
>>
>> > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> [...]
>> > index 0886f47e5244..b9c9729c5733 100644
>> > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
>> > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> > @@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
>> >     mem = &mem[index];
>> >     if (is_iomem)
>> >             iowrite16_native(val, (void __force __iomem *)mem);
>> > -   else
>> > +   else {
>> >             *mem = val;
>> > +           nv_cpu_cache_flush_area(mem, 2);
>> > +   }
>> >  }
>> >
>> >  u32
>> > @@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
>> >     mem = &mem[index];
>> >     if (is_iomem)
>> >             iowrite32_native(val, (void __force __iomem *)mem);
>> > -   else
>> > +   else {
>> >             *mem = val;
>> > +           nv_cpu_cache_flush_area(mem, 4);
>> > +   }
>>
>> This looks rather like a sledgehammer to me. Effectively this turns nvbo
>> into an uncached buffer. With additional overhead of constantly flushing
>> caches. Wouldn't it make more sense to locate the places where these are
>> called and flush the cache after all the writes have completed?
>>
> I don't think the explicit flushing for those things makes sense. I
> think it is a lot more effective to just map the BOs write-combined on
> PCI non-coherent arches. This way any writes will be buffered. Reads
> will be slow, but I don't think nouveau is reading back a lot from those
> buffers.
> Using the write-combining buffer doesn't need any additional
> synchronization as it will get flushed on pushbuf kickoff anyways.

I tried to go that way, and something interesting happened.

What I did: remove this patch and instead set the following caching
parameters for the TTM_PL_TT case in nouveau_bo_init_mem_type():

    man->available_caching = TTM_PL_FLAG_UNCACHED | TTM_PL_FLAG_WC;
    man->default_caching = TTM_PL_FLAG_WC;

What happened: no runtime errors as what happened when caching is
enabled. However, many of the vertex and texture buffers seem to be
partially corrupted. In glmark2 the 3d models had many vertices (but
not all) at the wrong position. Note that not all the scenes ended up
being corrupted - in particular, when two consecutive scenes used the
same model, the second instance would be uncorrupted.

Forcing the caching to TTM_PL_FLAG_UNCACHED led to the same result.
What is interesting is that while data like vertices and textures got
corrupted, pushbuffers and shader programs seem to be just fine, as I
could not see any runtime error.

I don't really understand what kind of caching behavior could lead to
that. If anyone has any idea, I'd love to hear.

Thanks,
Alex.

Alexandre Courbot June 12, 2014, 1:50 p.m. UTC | #7

On Mon, Jun 9, 2014 at 7:41 PM, Alexandre Courbot <gnurou@gmail.com> wrote:
> On Mon, May 19, 2014 at 6:22 PM, Lucas Stach <l.stach@pengutronix.de> wrote:
>> Am Montag, den 19.05.2014, 11:02 +0200 schrieb Thierry Reding:
>>> On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
>>> > Some architectures (e.g. ARM) need the CPU buffers to be explicitely
>>> > flushed for a memory write to take effect. Not doing so results in
>>> > synchronization issues, especially after writing to BOs.
>>>
>>> It seems to me that the above is generally true for all architectures,
>>> not just ARM.
>>>
>> No, on PCI coherent arches, like x86 and some PowerPCs, the GPU will
>> snoop the CPU caches and therefore an explicit cache flush is not
>> required.
>>
>>> Also: s/explicitely/explicitly/
>>>
>>> > This patch introduces a macro that flushes the caches on ARM and
>>> > translates to a no-op on other architectures, and uses it when
>>> > writing to in-memory BOs. It will also be useful for implementations of
>>> > instmem that access shared memory directly instead of going through
>>> > PRAMIN.
>>>
>>> Presumably instmem can access shared memory on all architectures, so
>>> this doesn't seem like a property of the architecture but rather of the
>>> memory pool backing the instmem.
>>>
>>> In that case I wonder if this shouldn't be moved into an operation that
>>> is implemented by the backing memory pool and be a noop where the cache
>>> doesn't need explicit flushing.
>>>
>>> > diff --git a/drivers/gpu/drm/nouveau/core/os.h b/drivers/gpu/drm/nouveau/core/os.h
>>> > index d0ced94ca54c..274b4460bb03 100644
>>> > --- a/drivers/gpu/drm/nouveau/core/os.h
>>> > +++ b/drivers/gpu/drm/nouveau/core/os.h
>>> > @@ -38,4 +38,21 @@
>>> >  #endif /* def __BIG_ENDIAN else */
>>> >  #endif /* !ioread32_native */
>>> >
>>> > +#if defined(__arm__)
>>> > +
>>> > +#define nv_cpu_cache_flush_area(va, size)  \
>>> > +do {                                               \
>>> > +   phys_addr_t pa = virt_to_phys(va);      \
>>> > +   __cpuc_flush_dcache_area(va, size);     \
>>> > +   outer_flush_range(pa, pa + size);       \
>>> > +} while (0)
>>>
>>> Couldn't this be a static inline function?
>>>
>>> > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
>>> [...]
>>> > index 0886f47e5244..b9c9729c5733 100644
>>> > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
>>> > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
>>> > @@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
>>> >     mem = &mem[index];
>>> >     if (is_iomem)
>>> >             iowrite16_native(val, (void __force __iomem *)mem);
>>> > -   else
>>> > +   else {
>>> >             *mem = val;
>>> > +           nv_cpu_cache_flush_area(mem, 2);
>>> > +   }
>>> >  }
>>> >
>>> >  u32
>>> > @@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
>>> >     mem = &mem[index];
>>> >     if (is_iomem)
>>> >             iowrite32_native(val, (void __force __iomem *)mem);
>>> > -   else
>>> > +   else {
>>> >             *mem = val;
>>> > +           nv_cpu_cache_flush_area(mem, 4);
>>> > +   }
>>>
>>> This looks rather like a sledgehammer to me. Effectively this turns nvbo
>>> into an uncached buffer. With additional overhead of constantly flushing
>>> caches. Wouldn't it make more sense to locate the places where these are
>>> called and flush the cache after all the writes have completed?
>>>
>> I don't think the explicit flushing for those things makes sense. I
>> think it is a lot more effective to just map the BOs write-combined on
>> PCI non-coherent arches. This way any writes will be buffered. Reads
>> will be slow, but I don't think nouveau is reading back a lot from those
>> buffers.
>> Using the write-combining buffer doesn't need any additional
>> synchronization as it will get flushed on pushbuf kickoff anyways.
>
> I tried to go that way, and something interesting happened.
>
> What I did: remove this patch and instead set the following caching
> parameters for the TTM_PL_TT case in nouveau_bo_init_mem_type():
>
>     man->available_caching = TTM_PL_FLAG_UNCACHED | TTM_PL_FLAG_WC;
>     man->default_caching = TTM_PL_FLAG_WC;
>
> What happened: no runtime errors as what happened when caching is
> enabled. However, many of the vertex and texture buffers seem to be
> partially corrupted. In glmark2 the 3d models had many vertices (but
> not all) at the wrong position. Note that not all the scenes ended up
> being corrupted - in particular, when two consecutive scenes used the
> same model, the second instance would be uncorrupted.
>
> Forcing the caching to TTM_PL_FLAG_UNCACHED led to the same result.
> What is interesting is that while data like vertices and textures got
> corrupted, pushbuffers and shader programs seem to be just fine, as I
> could not see any runtime error.

An interesting fact: if I change ttm_bo_kmap_ttm() such as kernel
mappings of BOs are always performed write-combined, and leave the
TTM_PL_TT default caching to TTM_PL_FLAG_CACHED so user-space mappings
remain cached, the corruptions just vanish. It seems to be the fact of
setting user-space mappings to anything non-cached that leads to this
puzzling behavior. Certainly some subtlety of ARM mappings are getting
over my head here.

If we need to implement different policies for kernel and user-space
mappings, this might complicate things a bit, especially since support
needs to be in TTM and not only Nouveau. I will submit a RFC tomorrow
if I don't hear better ideas by then.

Alex.

[4/4] drm/nouveau: introduce CPU cache flushing macro

Commit Message

Comments

Patch