[v4,3/4] drm/shmem-helpers: Allocate wc pages on x86

Message ID	20210713205153.1896059-4-daniel.vetter@ffwll.ch (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=NHDR=MF=lists.freedesktop.org=dri-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3FECB61369 From: Daniel Vetter <daniel.vetter@ffwll.ch> To: Intel Graphics Development <intel-gfx@lists.freedesktop.org> Subject: [PATCH v4 3/4] drm/shmem-helpers: Allocate wc pages on x86 Date: Tue, 13 Jul 2021 22:51:52 +0200 Message-Id: <20210713205153.1896059-4-daniel.vetter@ffwll.ch> In-Reply-To: <20210713205153.1896059-1-daniel.vetter@ffwll.ch> References: <20210713205153.1896059-1-daniel.vetter@ffwll.ch> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: list Cc: =?utf-8?q?Thomas_Hellstr=C3=B6m?= <thomas.hellstrom@linux.intel.com>, David Airlie <airlied@linux.ie>, Daniel Vetter <daniel.vetter@ffwll.ch>, DRI Development <dri-devel@lists.freedesktop.org>, Thomas Zimmermann <tzimmermann@suse.de>, Daniel Vetter <daniel.vetter@intel.com>, =?utf-8?q?Christian_K=C3=B6nig?= <christian.koenig@amd.com> Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	shmem helpers for vgem \| expand [v4,0/4] shmem helpers for vgem [v4,1/4] dma-buf: Require VM_PFNMAP vma for mmap [v4,2/4] drm/shmem-helper: Switch to vmf_insert_pfn [v4,3/4] drm/shmem-helpers: Allocate wc pages on x86 [v4,4/4] drm/vgem: use shmem helpers

Daniel Vetter July 13, 2021, 8:51 p.m. UTC

intel-gfx-ci realized that something is not quite coherent anymore on
some platforms for our i915+vgem tests, when I tried to switch vgem
over to shmem helpers.

After lots of head-scratching I realized that I've removed calls to
drm_clflush. And we need those. To make this a bit cleaner use the
same page allocation tooling as ttm, which does internally clflush
(and more, as neeeded on any platform instead of just the intel x86
cpus i915 can be combined with).

Unfortunately this doesn't exist on arm, or as a generic feature. For
that I think only the dma-api can get at wc memory reliably, so maybe
we'd need some kind of GFP_WC flag to do this properly.

Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
---
 drivers/gpu/drm/drm_gem_shmem_helper.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

Christian König July 14, 2021, 11:54 a.m. UTC | #1

Am 13.07.21 um 22:51 schrieb Daniel Vetter:
> intel-gfx-ci realized that something is not quite coherent anymore on
> some platforms for our i915+vgem tests, when I tried to switch vgem
> over to shmem helpers.
>
> After lots of head-scratching I realized that I've removed calls to
> drm_clflush. And we need those. To make this a bit cleaner use the
> same page allocation tooling as ttm, which does internally clflush
> (and more, as neeeded on any platform instead of just the intel x86
> cpus i915 can be combined with).
>
> Unfortunately this doesn't exist on arm, or as a generic feature. For
> that I think only the dma-api can get at wc memory reliably, so maybe
> we'd need some kind of GFP_WC flag to do this properly.

The problem is that this stuff is extremely architecture specific. So 
GFP_WC and GFP_UNCACHED are really what we should aim for in the long term.

And as far as I know we have at least the following possibilities how it 
is implemented:

* A fixed amount of registers which tells the CPU the caching behavior 
for a memory region, e.g. MTRR.
* Some bits of the memory pointers used, e.g. you see the same memory at 
different locations with different caching attributes.
* Some bits in the CPUs page table.
* Some bits in a separate page table.

On top of that there is the PCIe specification which defines non-cache 
snooping access as an extension.

Mixing that with the CPU caching behavior gets you some really nice ways 
to break a driver. In general x86 seems to be rather graceful, but arm 
and PowerPC are easily pissed if you mess that up.

> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Maxime Ripard <mripard@kernel.org>
> Cc: Thomas Zimmermann <tzimmermann@suse.de>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Daniel Vetter <daniel@ffwll.ch>

Acked-by: Christian könig <christian.koenig@amd.com>

Regards,
Christian.

> ---
>   drivers/gpu/drm/drm_gem_shmem_helper.c | 14 ++++++++++++++
>   1 file changed, 14 insertions(+)
>
> diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c
> index 296ab1b7c07f..657d2490aaa5 100644
> --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> @@ -10,6 +10,10 @@
>   #include <linux/slab.h>
>   #include <linux/vmalloc.h>
>   
> +#ifdef CONFIG_X86
> +#include <asm/set_memory.h>
> +#endif
> +
>   #include <drm/drm.h>
>   #include <drm/drm_device.h>
>   #include <drm/drm_drv.h>
> @@ -162,6 +166,11 @@ static int drm_gem_shmem_get_pages_locked(struct drm_gem_shmem_object *shmem)
>   		return PTR_ERR(pages);
>   	}
>   
> +#ifdef CONFIG_X86
> +	if (shmem->map_wc)
> +		set_pages_array_wc(pages, obj->size >> PAGE_SHIFT);
> +#endif
> +
>   	shmem->pages = pages;
>   
>   	return 0;
> @@ -203,6 +212,11 @@ static void drm_gem_shmem_put_pages_locked(struct drm_gem_shmem_object *shmem)
>   	if (--shmem->pages_use_count > 0)
>   		return;
>   
> +#ifdef CONFIG_X86
> +	if (shmem->map_wc)
> +		set_pages_array_wb(shmem->pages, obj->size >> PAGE_SHIFT);
> +#endif
> +
>   	drm_gem_put_pages(obj, shmem->pages,
>   			  shmem->pages_mark_dirty_on_put,
>   			  shmem->pages_mark_accessed_on_put);

Daniel Vetter July 14, 2021, 12:48 p.m. UTC | #2

On Wed, Jul 14, 2021 at 01:54:50PM +0200, Christian König wrote:
> Am 13.07.21 um 22:51 schrieb Daniel Vetter:
> > intel-gfx-ci realized that something is not quite coherent anymore on
> > some platforms for our i915+vgem tests, when I tried to switch vgem
> > over to shmem helpers.
> > 
> > After lots of head-scratching I realized that I've removed calls to
> > drm_clflush. And we need those. To make this a bit cleaner use the
> > same page allocation tooling as ttm, which does internally clflush
> > (and more, as neeeded on any platform instead of just the intel x86
> > cpus i915 can be combined with).
> > 
> > Unfortunately this doesn't exist on arm, or as a generic feature. For
> > that I think only the dma-api can get at wc memory reliably, so maybe
> > we'd need some kind of GFP_WC flag to do this properly.
> 
> The problem is that this stuff is extremely architecture specific. So GFP_WC
> and GFP_UNCACHED are really what we should aim for in the long term.
> 
> And as far as I know we have at least the following possibilities how it is
> implemented:
> 
> * A fixed amount of registers which tells the CPU the caching behavior for a
> memory region, e.g. MTRR.
> * Some bits of the memory pointers used, e.g. you see the same memory at
> different locations with different caching attributes.
> * Some bits in the CPUs page table.
> * Some bits in a separate page table.
> 
> On top of that there is the PCIe specification which defines non-cache
> snooping access as an extension.

Yeah dma-buf is extremely ill-defined even on x86 if you combine these
all. We just play a game of whack-a-mole with the cacheline dirt until
it's gone.

That's the other piece here, how do you even make sure that the page is
properly flushed and ready for wc access:
- easy case is x86 with clflush available pretty much everywhere (since
  10+ years at least)
- next are cpus which have some cache flush instructions, but it's highly
  cpu model specific
- next up is the same, but you absolutely have to make sure there's no
  other mapping around anymore or the coherency fabric just dies
- and I'm pretty sure there's worse stuff where you defacto can only
  allocate wc memory that's set aside at boot-up and that's all you ever
  get.

Cheers, Daniel

> Mixing that with the CPU caching behavior gets you some really nice ways to
> break a driver. In general x86 seems to be rather graceful, but arm and
> PowerPC are easily pissed if you mess that up.
> 
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
> > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > Cc: Maxime Ripard <mripard@kernel.org>
> > Cc: Thomas Zimmermann <tzimmermann@suse.de>
> > Cc: David Airlie <airlied@linux.ie>
> > Cc: Daniel Vetter <daniel@ffwll.ch>
> 
> Acked-by: Christian könig <christian.koenig@amd.com>
> 
> Regards,
> Christian.
> 
> > ---
> >   drivers/gpu/drm/drm_gem_shmem_helper.c | 14 ++++++++++++++
> >   1 file changed, 14 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c
> > index 296ab1b7c07f..657d2490aaa5 100644
> > --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> > +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> > @@ -10,6 +10,10 @@
> >   #include <linux/slab.h>
> >   #include <linux/vmalloc.h>
> > +#ifdef CONFIG_X86
> > +#include <asm/set_memory.h>
> > +#endif
> > +
> >   #include <drm/drm.h>
> >   #include <drm/drm_device.h>
> >   #include <drm/drm_drv.h>
> > @@ -162,6 +166,11 @@ static int drm_gem_shmem_get_pages_locked(struct drm_gem_shmem_object *shmem)
> >   		return PTR_ERR(pages);
> >   	}
> > +#ifdef CONFIG_X86
> > +	if (shmem->map_wc)
> > +		set_pages_array_wc(pages, obj->size >> PAGE_SHIFT);
> > +#endif
> > +
> >   	shmem->pages = pages;
> >   	return 0;
> > @@ -203,6 +212,11 @@ static void drm_gem_shmem_put_pages_locked(struct drm_gem_shmem_object *shmem)
> >   	if (--shmem->pages_use_count > 0)
> >   		return;
> > +#ifdef CONFIG_X86
> > +	if (shmem->map_wc)
> > +		set_pages_array_wb(shmem->pages, obj->size >> PAGE_SHIFT);
> > +#endif
> > +
> >   	drm_gem_put_pages(obj, shmem->pages,
> >   			  shmem->pages_mark_dirty_on_put,
> >   			  shmem->pages_mark_accessed_on_put);
>

Christian König July 14, 2021, 12:58 p.m. UTC | #3

Am 14.07.21 um 14:48 schrieb Daniel Vetter:
> On Wed, Jul 14, 2021 at 01:54:50PM +0200, Christian König wrote:
>> Am 13.07.21 um 22:51 schrieb Daniel Vetter:
>>> intel-gfx-ci realized that something is not quite coherent anymore on
>>> some platforms for our i915+vgem tests, when I tried to switch vgem
>>> over to shmem helpers.
>>>
>>> After lots of head-scratching I realized that I've removed calls to
>>> drm_clflush. And we need those. To make this a bit cleaner use the
>>> same page allocation tooling as ttm, which does internally clflush
>>> (and more, as neeeded on any platform instead of just the intel x86
>>> cpus i915 can be combined with).
>>>
>>> Unfortunately this doesn't exist on arm, or as a generic feature. For
>>> that I think only the dma-api can get at wc memory reliably, so maybe
>>> we'd need some kind of GFP_WC flag to do this properly.
>> The problem is that this stuff is extremely architecture specific. So GFP_WC
>> and GFP_UNCACHED are really what we should aim for in the long term.
>>
>> And as far as I know we have at least the following possibilities how it is
>> implemented:
>>
>> * A fixed amount of registers which tells the CPU the caching behavior for a
>> memory region, e.g. MTRR.
>> * Some bits of the memory pointers used, e.g. you see the same memory at
>> different locations with different caching attributes.
>> * Some bits in the CPUs page table.
>> * Some bits in a separate page table.
>>
>> On top of that there is the PCIe specification which defines non-cache
>> snooping access as an extension.
> Yeah dma-buf is extremely ill-defined even on x86 if you combine these
> all. We just play a game of whack-a-mole with the cacheline dirt until
> it's gone.
>
> That's the other piece here, how do you even make sure that the page is
> properly flushed and ready for wc access:
> - easy case is x86 with clflush available pretty much everywhere (since
>    10+ years at least)
> - next are cpus which have some cache flush instructions, but it's highly
>    cpu model specific
> - next up is the same, but you absolutely have to make sure there's no
>    other mapping around anymore or the coherency fabric just dies
> - and I'm pretty sure there's worse stuff where you defacto can only
>    allocate wc memory that's set aside at boot-up and that's all you ever
>    get.

Well long story short you don't make sure that the page is flushed at all.

What you do is to allocate the page as WC in the first place, if you 
fail to do this you can't use it.

The whole idea TTM try to sell until a while ago that you can actually 
change that on the fly only works on x86 and even there only very very 
limited.

Cheers,
Christian.

>
> Cheers, Daniel
>
>> Mixing that with the CPU caching behavior gets you some really nice ways to
>> break a driver. In general x86 seems to be rather graceful, but arm and
>> PowerPC are easily pissed if you mess that up.
>>
>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>>> Cc: Christian König <christian.koenig@amd.com>
>>> Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>> Cc: Maxime Ripard <mripard@kernel.org>
>>> Cc: Thomas Zimmermann <tzimmermann@suse.de>
>>> Cc: David Airlie <airlied@linux.ie>
>>> Cc: Daniel Vetter <daniel@ffwll.ch>
>> Acked-by: Christian könig <christian.koenig@amd.com>
>>
>> Regards,
>> Christian.
>>
>>> ---
>>>    drivers/gpu/drm/drm_gem_shmem_helper.c | 14 ++++++++++++++
>>>    1 file changed, 14 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c
>>> index 296ab1b7c07f..657d2490aaa5 100644
>>> --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
>>> +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
>>> @@ -10,6 +10,10 @@
>>>    #include <linux/slab.h>
>>>    #include <linux/vmalloc.h>
>>> +#ifdef CONFIG_X86
>>> +#include <asm/set_memory.h>
>>> +#endif
>>> +
>>>    #include <drm/drm.h>
>>>    #include <drm/drm_device.h>
>>>    #include <drm/drm_drv.h>
>>> @@ -162,6 +166,11 @@ static int drm_gem_shmem_get_pages_locked(struct drm_gem_shmem_object *shmem)
>>>    		return PTR_ERR(pages);
>>>    	}
>>> +#ifdef CONFIG_X86
>>> +	if (shmem->map_wc)
>>> +		set_pages_array_wc(pages, obj->size >> PAGE_SHIFT);
>>> +#endif
>>> +
>>>    	shmem->pages = pages;
>>>    	return 0;
>>> @@ -203,6 +212,11 @@ static void drm_gem_shmem_put_pages_locked(struct drm_gem_shmem_object *shmem)
>>>    	if (--shmem->pages_use_count > 0)
>>>    		return;
>>> +#ifdef CONFIG_X86
>>> +	if (shmem->map_wc)
>>> +		set_pages_array_wb(shmem->pages, obj->size >> PAGE_SHIFT);
>>> +#endif
>>> +
>>>    	drm_gem_put_pages(obj, shmem->pages,
>>>    			  shmem->pages_mark_dirty_on_put,
>>>    			  shmem->pages_mark_accessed_on_put);

Daniel Vetter July 14, 2021, 4:16 p.m. UTC | #4

On Wed, Jul 14, 2021 at 02:58:26PM +0200, Christian König wrote:
> Am 14.07.21 um 14:48 schrieb Daniel Vetter:
> > On Wed, Jul 14, 2021 at 01:54:50PM +0200, Christian König wrote:
> > > Am 13.07.21 um 22:51 schrieb Daniel Vetter:
> > > > intel-gfx-ci realized that something is not quite coherent anymore on
> > > > some platforms for our i915+vgem tests, when I tried to switch vgem
> > > > over to shmem helpers.
> > > > 
> > > > After lots of head-scratching I realized that I've removed calls to
> > > > drm_clflush. And we need those. To make this a bit cleaner use the
> > > > same page allocation tooling as ttm, which does internally clflush
> > > > (and more, as neeeded on any platform instead of just the intel x86
> > > > cpus i915 can be combined with).
> > > > 
> > > > Unfortunately this doesn't exist on arm, or as a generic feature. For
> > > > that I think only the dma-api can get at wc memory reliably, so maybe
> > > > we'd need some kind of GFP_WC flag to do this properly.
> > > The problem is that this stuff is extremely architecture specific. So GFP_WC
> > > and GFP_UNCACHED are really what we should aim for in the long term.
> > > 
> > > And as far as I know we have at least the following possibilities how it is
> > > implemented:
> > > 
> > > * A fixed amount of registers which tells the CPU the caching behavior for a
> > > memory region, e.g. MTRR.
> > > * Some bits of the memory pointers used, e.g. you see the same memory at
> > > different locations with different caching attributes.
> > > * Some bits in the CPUs page table.
> > > * Some bits in a separate page table.
> > > 
> > > On top of that there is the PCIe specification which defines non-cache
> > > snooping access as an extension.
> > Yeah dma-buf is extremely ill-defined even on x86 if you combine these
> > all. We just play a game of whack-a-mole with the cacheline dirt until
> > it's gone.
> > 
> > That's the other piece here, how do you even make sure that the page is
> > properly flushed and ready for wc access:
> > - easy case is x86 with clflush available pretty much everywhere (since
> >    10+ years at least)
> > - next are cpus which have some cache flush instructions, but it's highly
> >    cpu model specific
> > - next up is the same, but you absolutely have to make sure there's no
> >    other mapping around anymore or the coherency fabric just dies
> > - and I'm pretty sure there's worse stuff where you defacto can only
> >    allocate wc memory that's set aside at boot-up and that's all you ever
> >    get.
> 
> Well long story short you don't make sure that the page is flushed at all.
> 
> What you do is to allocate the page as WC in the first place, if you fail to
> do this you can't use it.

Oh sure, but even when you allocate as wc you need to make sure the page
you have is actually wc coherent from the start. I'm chasing some fun
trying to convert vgem over to shmem helpers right now (i.e. this patch
series), and if you don't start out with flushed pages some of the vgem +
i915 igts just fail on the less coherent igpu platforms we have.

And if you look into what set_pages_wc actually does, then you spot the
clflush somewhere deep down (aside from all the other things it does).

On some ARM platforms that's just not possible, and you have to do a
carveout that you never even map as wb (so needs to be excluded from the
kernel map too and treated as highmem). There's some really bonkers stuff
here.

> The whole idea TTM try to sell until a while ago that you can actually
> change that on the fly only works on x86 and even there only very very
> limited.

Yeah that's clear, this is why we're locking down the i915 gem uapi a lot
for dgpu. All the tricks are out the window.
-Daniel


> 
> Cheers,
> Christian.
> 
> > 
> > Cheers, Daniel
> > 
> > > Mixing that with the CPU caching behavior gets you some really nice ways to
> > > break a driver. In general x86 seems to be rather graceful, but arm and
> > > PowerPC are easily pissed if you mess that up.
> > > 
> > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > Cc: Christian König <christian.koenig@amd.com>
> > > > Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
> > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > Cc: Maxime Ripard <mripard@kernel.org>
> > > > Cc: Thomas Zimmermann <tzimmermann@suse.de>
> > > > Cc: David Airlie <airlied@linux.ie>
> > > > Cc: Daniel Vetter <daniel@ffwll.ch>
> > > Acked-by: Christian könig <christian.koenig@amd.com>
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > ---
> > > >    drivers/gpu/drm/drm_gem_shmem_helper.c | 14 ++++++++++++++
> > > >    1 file changed, 14 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c
> > > > index 296ab1b7c07f..657d2490aaa5 100644
> > > > --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> > > > +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> > > > @@ -10,6 +10,10 @@
> > > >    #include <linux/slab.h>
> > > >    #include <linux/vmalloc.h>
> > > > +#ifdef CONFIG_X86
> > > > +#include <asm/set_memory.h>
> > > > +#endif
> > > > +
> > > >    #include <drm/drm.h>
> > > >    #include <drm/drm_device.h>
> > > >    #include <drm/drm_drv.h>
> > > > @@ -162,6 +166,11 @@ static int drm_gem_shmem_get_pages_locked(struct drm_gem_shmem_object *shmem)
> > > >    		return PTR_ERR(pages);
> > > >    	}
> > > > +#ifdef CONFIG_X86
> > > > +	if (shmem->map_wc)
> > > > +		set_pages_array_wc(pages, obj->size >> PAGE_SHIFT);
> > > > +#endif
> > > > +
> > > >    	shmem->pages = pages;
> > > >    	return 0;
> > > > @@ -203,6 +212,11 @@ static void drm_gem_shmem_put_pages_locked(struct drm_gem_shmem_object *shmem)
> > > >    	if (--shmem->pages_use_count > 0)
> > > >    		return;
> > > > +#ifdef CONFIG_X86
> > > > +	if (shmem->map_wc)
> > > > +		set_pages_array_wb(shmem->pages, obj->size >> PAGE_SHIFT);
> > > > +#endif
> > > > +
> > > >    	drm_gem_put_pages(obj, shmem->pages,
> > > >    			  shmem->pages_mark_dirty_on_put,
> > > >    			  shmem->pages_mark_accessed_on_put);
>

Thomas Zimmermann July 22, 2021, 6:40 p.m. UTC | #5

Hi

Am 13.07.21 um 22:51 schrieb Daniel Vetter:
> intel-gfx-ci realized that something is not quite coherent anymore on
> some platforms for our i915+vgem tests, when I tried to switch vgem
> over to shmem helpers.
> 
> After lots of head-scratching I realized that I've removed calls to
> drm_clflush. And we need those. To make this a bit cleaner use the
> same page allocation tooling as ttm, which does internally clflush
> (and more, as neeeded on any platform instead of just the intel x86
> cpus i915 can be combined with).

Vgem would therefore not work correctly on non-X86 platforms?

> 
> Unfortunately this doesn't exist on arm, or as a generic feature. For
> that I think only the dma-api can get at wc memory reliably, so maybe
> we'd need some kind of GFP_WC flag to do this properly.
> 
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Maxime Ripard <mripard@kernel.org>
> Cc: Thomas Zimmermann <tzimmermann@suse.de>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> ---
>   drivers/gpu/drm/drm_gem_shmem_helper.c | 14 ++++++++++++++
>   1 file changed, 14 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c
> index 296ab1b7c07f..657d2490aaa5 100644
> --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> @@ -10,6 +10,10 @@
>   #include <linux/slab.h>
>   #include <linux/vmalloc.h>
>   
> +#ifdef CONFIG_X86
> +#include <asm/set_memory.h>
> +#endif
> +
>   #include <drm/drm.h>
>   #include <drm/drm_device.h>
>   #include <drm/drm_drv.h>
> @@ -162,6 +166,11 @@ static int drm_gem_shmem_get_pages_locked(struct drm_gem_shmem_object *shmem)
>   		return PTR_ERR(pages);
>   	}
>   
> +#ifdef CONFIG_X86
> +	if (shmem->map_wc)
> +		set_pages_array_wc(pages, obj->size >> PAGE_SHIFT);
> +#endif

I cannot comment much on the technical details of the caching of various 
architectures. If this patch goes in, there should be a longer comment 
that reflects the discussion in this thread. It's apparently a workaround.

I think the call itself should be hidden behind a DRM API, which depends 
on CONFIG_X86. Something simple like

ifdef CONFIG_X86
drm_set_pages_array_wc()
{
	set_pages_array_wc();
}
else
drm_set_pages_array_wc()
  {
  }
#endif

Maybe in drm_cache.h?

Best regard
Thomas

> +
>   	shmem->pages = pages;
>   
>   	return 0;
> @@ -203,6 +212,11 @@ static void drm_gem_shmem_put_pages_locked(struct drm_gem_shmem_object *shmem)
>   	if (--shmem->pages_use_count > 0)
>   		return;
>   
> +#ifdef CONFIG_X86
> +	if (shmem->map_wc)
> +		set_pages_array_wb(shmem->pages, obj->size >> PAGE_SHIFT);
> +#endif
> +
>   	drm_gem_put_pages(obj, shmem->pages,
>   			  shmem->pages_mark_dirty_on_put,
>   			  shmem->pages_mark_accessed_on_put);
>

Daniel Vetter July 23, 2021, 7:36 a.m. UTC | #6

On Thu, Jul 22, 2021 at 08:40:56PM +0200, Thomas Zimmermann wrote:
> Hi
> 
> Am 13.07.21 um 22:51 schrieb Daniel Vetter:
> > intel-gfx-ci realized that something is not quite coherent anymore on
> > some platforms for our i915+vgem tests, when I tried to switch vgem
> > over to shmem helpers.
> > 
> > After lots of head-scratching I realized that I've removed calls to
> > drm_clflush. And we need those. To make this a bit cleaner use the
> > same page allocation tooling as ttm, which does internally clflush
> > (and more, as neeeded on any platform instead of just the intel x86
> > cpus i915 can be combined with).
> 
> Vgem would therefore not work correctly on non-X86 platforms?

Anything using shmem helpers doesn't work correctly on non-x86 platforms.
At least if they use wc.

vgem with intel-gfx-ci is simply running some very nasty tests that catch
the bugs.

I'm kinda hoping that someone from the armsoc world would care enough to
fix this there. But it's a tricky issue.

> > 
> > Unfortunately this doesn't exist on arm, or as a generic feature. For
> > that I think only the dma-api can get at wc memory reliably, so maybe
> > we'd need some kind of GFP_WC flag to do this properly.
> > 
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
> > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > Cc: Maxime Ripard <mripard@kernel.org>
> > Cc: Thomas Zimmermann <tzimmermann@suse.de>
> > Cc: David Airlie <airlied@linux.ie>
> > Cc: Daniel Vetter <daniel@ffwll.ch>
> > ---
> >   drivers/gpu/drm/drm_gem_shmem_helper.c | 14 ++++++++++++++
> >   1 file changed, 14 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c
> > index 296ab1b7c07f..657d2490aaa5 100644
> > --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> > +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> > @@ -10,6 +10,10 @@
> >   #include <linux/slab.h>
> >   #include <linux/vmalloc.h>
> > +#ifdef CONFIG_X86
> > +#include <asm/set_memory.h>
> > +#endif
> > +
> >   #include <drm/drm.h>
> >   #include <drm/drm_device.h>
> >   #include <drm/drm_drv.h>
> > @@ -162,6 +166,11 @@ static int drm_gem_shmem_get_pages_locked(struct drm_gem_shmem_object *shmem)
> >   		return PTR_ERR(pages);
> >   	}
> > +#ifdef CONFIG_X86
> > +	if (shmem->map_wc)
> > +		set_pages_array_wc(pages, obj->size >> PAGE_SHIFT);
> > +#endif
> 
> I cannot comment much on the technical details of the caching of various
> architectures. If this patch goes in, there should be a longer comment that
> reflects the discussion in this thread. It's apparently a workaround.
> 
> I think the call itself should be hidden behind a DRM API, which depends on
> CONFIG_X86. Something simple like
> 
> ifdef CONFIG_X86
> drm_set_pages_array_wc()
> {
> 	set_pages_array_wc();
> }
> else
> drm_set_pages_array_wc()
>  {
>  }
> #endif
> 
> Maybe in drm_cache.h?

We do have a bunch of this in drm_cache.h already, and architecture
maintainers hate us for it.

The real fix is to get at the architecture-specific wc allocator, which is
currently not something that's exposed, but hidden within the dma api. I
think having this stick out like this is better than hiding it behind fake
generic code (like we do with drm_clflush, which defacto also only really
works on x86).

Also note that ttm has the exact same ifdef in its page allocator, but it
does fall back to using dma_alloc_coherent on other platforms.
-Daniel

> Best regard
> Thomas
> 
> > +
> >   	shmem->pages = pages;
> >   	return 0;
> > @@ -203,6 +212,11 @@ static void drm_gem_shmem_put_pages_locked(struct drm_gem_shmem_object *shmem)
> >   	if (--shmem->pages_use_count > 0)
> >   		return;
> > +#ifdef CONFIG_X86
> > +	if (shmem->map_wc)
> > +		set_pages_array_wb(shmem->pages, obj->size >> PAGE_SHIFT);
> > +#endif
> > +
> >   	drm_gem_put_pages(obj, shmem->pages,
> >   			  shmem->pages_mark_dirty_on_put,
> >   			  shmem->pages_mark_accessed_on_put);
> > 
> 
> -- 
> Thomas Zimmermann
> Graphics Driver Developer
> SUSE Software Solutions Germany GmbH
> Maxfeldstr. 5, 90409 Nürnberg, Germany
> (HRB 36809, AG Nürnberg)
> Geschäftsführer: Felix Imendörffer
>

Christian König July 23, 2021, 8:02 a.m. UTC | #7

Am 23.07.21 um 09:36 schrieb Daniel Vetter:
> On Thu, Jul 22, 2021 at 08:40:56PM +0200, Thomas Zimmermann wrote:
>> Hi
>>
>> Am 13.07.21 um 22:51 schrieb Daniel Vetter:
>> [SNIP]
>>> +#ifdef CONFIG_X86
>>> +	if (shmem->map_wc)
>>> +		set_pages_array_wc(pages, obj->size >> PAGE_SHIFT);
>>> +#endif
>> I cannot comment much on the technical details of the caching of various
>> architectures. If this patch goes in, there should be a longer comment that
>> reflects the discussion in this thread. It's apparently a workaround.
>>
>> I think the call itself should be hidden behind a DRM API, which depends on
>> CONFIG_X86. Something simple like
>>
>> ifdef CONFIG_X86
>> drm_set_pages_array_wc()
>> {
>> 	set_pages_array_wc();
>> }
>> else
>> drm_set_pages_array_wc()
>>   {
>>   }
>> #endif
>>
>> Maybe in drm_cache.h?
> We do have a bunch of this in drm_cache.h already, and architecture
> maintainers hate us for it.

Yeah, for good reasons :)

> The real fix is to get at the architecture-specific wc allocator, which is
> currently not something that's exposed, but hidden within the dma api. I
> think having this stick out like this is better than hiding it behind fake
> generic code (like we do with drm_clflush, which defacto also only really
> works on x86).

The DMA API also doesn't really touch that stuff as far as I know.

What we rather do on other architectures is to set the appropriate 
caching flags on the CPU mappings, see function ttm_prot_from_caching().

> Also note that ttm has the exact same ifdef in its page allocator, but it
> does fall back to using dma_alloc_coherent on other platforms.

This works surprisingly well on non x86 architectures as well. We just 
don't necessary update the kernel mappings everywhere which limits the 
kmap usage.

In other words radeon and nouveau still work on PowerPC AGP systems as 
far as I know for example.

Christian.

> -Daniel
>
>> Best regard
>> Thomas
>>
>>> +
>>>    	shmem->pages = pages;
>>>    	return 0;
>>> @@ -203,6 +212,11 @@ static void drm_gem_shmem_put_pages_locked(struct drm_gem_shmem_object *shmem)
>>>    	if (--shmem->pages_use_count > 0)
>>>    		return;
>>> +#ifdef CONFIG_X86
>>> +	if (shmem->map_wc)
>>> +		set_pages_array_wb(shmem->pages, obj->size >> PAGE_SHIFT);
>>> +#endif
>>> +
>>>    	drm_gem_put_pages(obj, shmem->pages,
>>>    			  shmem->pages_mark_dirty_on_put,
>>>    			  shmem->pages_mark_accessed_on_put);
>>>
>> -- 
>> Thomas Zimmermann
>> Graphics Driver Developer
>> SUSE Software Solutions Germany GmbH
>> Maxfeldstr. 5, 90409 Nürnberg, Germany
>> (HRB 36809, AG Nürnberg)
>> Geschäftsführer: Felix Imendörffer
>>
>
>
>

Daniel Vetter July 23, 2021, 8:34 a.m. UTC | #8

On Fri, Jul 23, 2021 at 10:02:39AM +0200, Christian König wrote:
> Am 23.07.21 um 09:36 schrieb Daniel Vetter:
> > On Thu, Jul 22, 2021 at 08:40:56PM +0200, Thomas Zimmermann wrote:
> > > Hi
> > > 
> > > Am 13.07.21 um 22:51 schrieb Daniel Vetter:
> > > [SNIP]
> > > > +#ifdef CONFIG_X86
> > > > +	if (shmem->map_wc)
> > > > +		set_pages_array_wc(pages, obj->size >> PAGE_SHIFT);
> > > > +#endif
> > > I cannot comment much on the technical details of the caching of various
> > > architectures. If this patch goes in, there should be a longer comment that
> > > reflects the discussion in this thread. It's apparently a workaround.
> > > 
> > > I think the call itself should be hidden behind a DRM API, which depends on
> > > CONFIG_X86. Something simple like
> > > 
> > > ifdef CONFIG_X86
> > > drm_set_pages_array_wc()
> > > {
> > > 	set_pages_array_wc();
> > > }
> > > else
> > > drm_set_pages_array_wc()
> > >   {
> > >   }
> > > #endif
> > > 
> > > Maybe in drm_cache.h?
> > We do have a bunch of this in drm_cache.h already, and architecture
> > maintainers hate us for it.
> 
> Yeah, for good reasons :)
> 
> > The real fix is to get at the architecture-specific wc allocator, which is
> > currently not something that's exposed, but hidden within the dma api. I
> > think having this stick out like this is better than hiding it behind fake
> > generic code (like we do with drm_clflush, which defacto also only really
> > works on x86).
> 
> The DMA API also doesn't really touch that stuff as far as I know.
> 
> What we rather do on other architectures is to set the appropriate caching
> flags on the CPU mappings, see function ttm_prot_from_caching().

This alone doesn't do cache flushes. And at least on some arm cpus having
inconsistent mappings can lead to interconnect hangs, so you have to at
least punch out the kernel linear map. Which on some arms isn't possible
(because the kernel map is a special linear map and not done with
pagetables). Which means you need to carve this out at boot and treat them
as GFP_HIGHMEM.

Afaik dma-api has that allocator somewhere which dtrt for
dma_alloc_coherent.

Also shmem helpers already set the caching pgprot.

> > Also note that ttm has the exact same ifdef in its page allocator, but it
> > does fall back to using dma_alloc_coherent on other platforms.
> 
> This works surprisingly well on non x86 architectures as well. We just don't
> necessary update the kernel mappings everywhere which limits the kmap usage.
> 
> In other words radeon and nouveau still work on PowerPC AGP systems as far
> as I know for example.

The thing is, on most cpus you get away with just pgprot set to wc, and on
many others it's only an issue while there's still some cpu dirt hanging
around because they don't prefetch badly enough. It's very few were it's a
persistent problem.

Really the only reason I've even caught this was because some of the
i915+vgem buffer sharing tests we have are very nasty and intentionally
try to provoke the worst case :-)

Anyway, since you're looking, can you pls review this and the previous
patch for shmem helpers?

The first one to make VM_PFNMAP standard for all dma-buf isn't ready yet,
because I need to audit all the driver still. And at least i915 dma-buf
mmap is still using gup-able memory too. So more work to do here.
-Danel

> 
> Christian.
> 
> > -Daniel
> > 
> > > Best regard
> > > Thomas
> > > 
> > > > +
> > > >    	shmem->pages = pages;
> > > >    	return 0;
> > > > @@ -203,6 +212,11 @@ static void drm_gem_shmem_put_pages_locked(struct drm_gem_shmem_object *shmem)
> > > >    	if (--shmem->pages_use_count > 0)
> > > >    		return;
> > > > +#ifdef CONFIG_X86
> > > > +	if (shmem->map_wc)
> > > > +		set_pages_array_wb(shmem->pages, obj->size >> PAGE_SHIFT);
> > > > +#endif
> > > > +
> > > >    	drm_gem_put_pages(obj, shmem->pages,
> > > >    			  shmem->pages_mark_dirty_on_put,
> > > >    			  shmem->pages_mark_accessed_on_put);
> > > > 
> > > -- 
> > > Thomas Zimmermann
> > > Graphics Driver Developer
> > > SUSE Software Solutions Germany GmbH
> > > Maxfeldstr. 5, 90409 Nürnberg, Germany
> > > (HRB 36809, AG Nürnberg)
> > > Geschäftsführer: Felix Imendörffer
> > > 
> > 
> > 
> > 
>

Thomas Zimmermann Aug. 5, 2021, 6:40 p.m. UTC | #9

Hi

Am 23.07.21 um 09:36 schrieb Daniel Vetter:
> 
> The real fix is to get at the architecture-specific wc allocator, which is
> currently not something that's exposed, but hidden within the dma api. I
> think having this stick out like this is better than hiding it behind fake
> generic code (like we do with drm_clflush, which defacto also only really
> works on x86).
> 
> Also note that ttm has the exact same ifdef in its page allocator, but it
> does fall back to using dma_alloc_coherent on other platforms.

If this fixes a real problem and there's no full solution yet, let's 
take what we have. So if you can extract the essence of this comment 
into a TODO comment that tells how to fix the issue, fell free to add my

Acked-by: Thomas Zimmermann <tzimmermann@suse.de>

Best regards
Thomas

> -Daniel
> 
>> Best regard
>> Thomas
>>
>>> +
>>>    	shmem->pages = pages;
>>>    	return 0;
>>> @@ -203,6 +212,11 @@ static void drm_gem_shmem_put_pages_locked(struct drm_gem_shmem_object *shmem)
>>>    	if (--shmem->pages_use_count > 0)
>>>    		return;
>>> +#ifdef CONFIG_X86
>>> +	if (shmem->map_wc)
>>> +		set_pages_array_wb(shmem->pages, obj->size >> PAGE_SHIFT);
>>> +#endif
>>> +
>>>    	drm_gem_put_pages(obj, shmem->pages,
>>>    			  shmem->pages_mark_dirty_on_put,
>>>    			  shmem->pages_mark_accessed_on_put);
>>>
>>
>> -- 
>> Thomas Zimmermann
>> Graphics Driver Developer
>> SUSE Software Solutions Germany GmbH
>> Maxfeldstr. 5, 90409 Nürnberg, Germany
>> (HRB 36809, AG Nürnberg)
>> Geschäftsführer: Felix Imendörffer
>>
> 
> 
> 
>

[v4,3/4] drm/shmem-helpers: Allocate wc pages on x86

Commit Message

Comments

Patch