Message ID | 20190718145407.21352-3-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/4] drm/i915: Drop wmb() inside pread_gtt | expand |
Quoting Chris Wilson (2019-07-18 17:54:06) > Since userspace has the ability to bypass the CPU cache from within its > unprivileged command stream, we have to flush the CPU cache to memory > in order to overwrite the previous contents on creation. > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > Cc: stablevger.kernel.org Makes sense. If there is some platform where this absolutely can't happen and they need optimization, we can special case later. Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Regards, Joonas
On 18/07/2019 17:54, Chris Wilson wrote: > Since userspace has the ability to bypass the CPU cache from within its > unprivileged command stream, we have to flush the CPU cache to memory > in order to overwrite the previous contents on creation. > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > Cc: stablevger.kernel.org > --- > drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 26 ++++++----------------- > 1 file changed, 7 insertions(+), 19 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c > index d2a1158868e7..f752b326d399 100644 > --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c > +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c > @@ -459,7 +459,6 @@ i915_gem_object_create_shmem(struct drm_i915_private *i915, u64 size) > { > struct drm_i915_gem_object *obj; > struct address_space *mapping; > - unsigned int cache_level; > gfp_t mask; > int ret; > > @@ -498,24 +497,13 @@ i915_gem_object_create_shmem(struct drm_i915_private *i915, u64 size) > obj->write_domain = I915_GEM_DOMAIN_CPU; > obj->read_domains = I915_GEM_DOMAIN_CPU; > > - if (HAS_LLC(i915)) > - /* On some devices, we can have the GPU use the LLC (the CPU > - * cache) for about a 10% performance improvement > - * compared to uncached. Graphics requests other than > - * display scanout are coherent with the CPU in > - * accessing this cache. This means in this mode we > - * don't need to clflush on the CPU side, and on the > - * GPU side we only need to flush internal caches to > - * get data visible to the CPU. > - * > - * However, we maintain the display planes as UC, and so > - * need to rebind when first used as such. > - */ > - cache_level = I915_CACHE_LLC; > - else > - cache_level = I915_CACHE_NONE; > - > - i915_gem_object_set_cache_coherency(obj, cache_level); > + /* > + * Note that userspace has control over cache-bypass > + * via its command stream, so even on LLC architectures > + * we have to flush out the CPU cache to memory to > + * clear previous contents. > + */ > + i915_gem_object_set_cache_coherency(obj, I915_CACHE_NONE); > > trace_i915_gem_object_create(obj); > Does i915_drm.h needs updating? : /** * I915_CACHING_CACHED * * GPU access is coherent with cpu caches and furthermore the data is cached in * last-level caches shared between cpu cores and the gpu GT. Default on * machines with HAS_LLC. */ #define I915_CACHING_CACHED 1
Quoting Lionel Landwerlin (2019-07-19 11:18:42) > On 18/07/2019 17:54, Chris Wilson wrote: > > Since userspace has the ability to bypass the CPU cache from within its > > unprivileged command stream, we have to flush the CPU cache to memory > > in order to overwrite the previous contents on creation. > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > > Cc: stablevger.kernel.org > > --- > > drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 26 ++++++----------------- > > 1 file changed, 7 insertions(+), 19 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c > > index d2a1158868e7..f752b326d399 100644 > > --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c > > +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c > > @@ -459,7 +459,6 @@ i915_gem_object_create_shmem(struct drm_i915_private *i915, u64 size) > > { > > struct drm_i915_gem_object *obj; > > struct address_space *mapping; > > - unsigned int cache_level; > > gfp_t mask; > > int ret; > > > > @@ -498,24 +497,13 @@ i915_gem_object_create_shmem(struct drm_i915_private *i915, u64 size) > > obj->write_domain = I915_GEM_DOMAIN_CPU; > > obj->read_domains = I915_GEM_DOMAIN_CPU; > > > > - if (HAS_LLC(i915)) > > - /* On some devices, we can have the GPU use the LLC (the CPU > > - * cache) for about a 10% performance improvement > > - * compared to uncached. Graphics requests other than > > - * display scanout are coherent with the CPU in > > - * accessing this cache. This means in this mode we > > - * don't need to clflush on the CPU side, and on the > > - * GPU side we only need to flush internal caches to > > - * get data visible to the CPU. > > - * > > - * However, we maintain the display planes as UC, and so > > - * need to rebind when first used as such. > > - */ > > - cache_level = I915_CACHE_LLC; > > - else > > - cache_level = I915_CACHE_NONE; > > - > > - i915_gem_object_set_cache_coherency(obj, cache_level); > > + /* > > + * Note that userspace has control over cache-bypass > > + * via its command stream, so even on LLC architectures > > + * we have to flush out the CPU cache to memory to > > + * clear previous contents. > > + */ > > + i915_gem_object_set_cache_coherency(obj, I915_CACHE_NONE); > > > > trace_i915_gem_object_create(obj); > > > > Does i915_drm.h needs updating? : > > > /** > * I915_CACHING_CACHED > * > * GPU access is coherent with cpu caches and furthermore the data is > cached in > * last-level caches shared between cpu cores and the gpu GT. Default on > * machines with HAS_LLC. > */ > #define I915_CACHING_CACHED 1 Sneaky. Thanks, -Chris
Just to be clear, is this just adding a CLFLUSH or is it actually changing the default caching state of buffers from CACHED to NONE? If it's actually changing the default state, that's going to break userspace badly. --Jason On Fri, Jul 19, 2019 at 5:21 AM Chris Wilson <chris@chris-wilson.co.uk> wrote: > Quoting Lionel Landwerlin (2019-07-19 11:18:42) > > On 18/07/2019 17:54, Chris Wilson wrote: > > > Since userspace has the ability to bypass the CPU cache from within its > > > unprivileged command stream, we have to flush the CPU cache to memory > > > in order to overwrite the previous contents on creation. > > > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > > > Cc: stablevger.kernel.org > > > --- > > > drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 26 > ++++++----------------- > > > 1 file changed, 7 insertions(+), 19 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c > b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c > > > index d2a1158868e7..f752b326d399 100644 > > > --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c > > > +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c > > > @@ -459,7 +459,6 @@ i915_gem_object_create_shmem(struct > drm_i915_private *i915, u64 size) > > > { > > > struct drm_i915_gem_object *obj; > > > struct address_space *mapping; > > > - unsigned int cache_level; > > > gfp_t mask; > > > int ret; > > > > > > @@ -498,24 +497,13 @@ i915_gem_object_create_shmem(struct > drm_i915_private *i915, u64 size) > > > obj->write_domain = I915_GEM_DOMAIN_CPU; > > > obj->read_domains = I915_GEM_DOMAIN_CPU; > > > > > > - if (HAS_LLC(i915)) > > > - /* On some devices, we can have the GPU use the LLC (the > CPU > > > - * cache) for about a 10% performance improvement > > > - * compared to uncached. Graphics requests other than > > > - * display scanout are coherent with the CPU in > > > - * accessing this cache. This means in this mode we > > > - * don't need to clflush on the CPU side, and on the > > > - * GPU side we only need to flush internal caches to > > > - * get data visible to the CPU. > > > - * > > > - * However, we maintain the display planes as UC, and so > > > - * need to rebind when first used as such. > > > - */ > > > - cache_level = I915_CACHE_LLC; > > > - else > > > - cache_level = I915_CACHE_NONE; > > > - > > > - i915_gem_object_set_cache_coherency(obj, cache_level); > > > + /* > > > + * Note that userspace has control over cache-bypass > > > + * via its command stream, so even on LLC architectures > > > + * we have to flush out the CPU cache to memory to > > > + * clear previous contents. > > > + */ > > > + i915_gem_object_set_cache_coherency(obj, I915_CACHE_NONE); > > > > > > trace_i915_gem_object_create(obj); > > > > > > > Does i915_drm.h needs updating? : > > > > > > /** > > * I915_CACHING_CACHED > > * > > * GPU access is coherent with cpu caches and furthermore the data is > > cached in > > * last-level caches shared between cpu cores and the gpu GT. Default on > > * machines with HAS_LLC. > > */ > > #define I915_CACHING_CACHED 1 > > Sneaky. Thanks, > -Chris > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Quoting Jason Ekstrand (2019-07-19 23:55:03) > Just to be clear, is this just adding a CLFLUSH or is it actually changing the > default caching state of buffers from CACHED to NONE? If it's actually > changing the default state, that's going to break userspace badly. What userspace is actually broken? You don't use set-domain, preferring to do your own cache management and MOCS; where you do default to PTE is where we expect the buffers to be uncached anyway. In effect, it is only swap in of the objects that the kernel will insert an extra clflush for you. I'm looking for the negative impact of this, and not finding much repercussion at all. At the end of the day, we have no choice but to close such a hole; the only question is how much leeway to give and how to mitigate such issues in future interfaces. -Chris
Chris Wilson <chris@chris-wilson.co.uk> writes: > Since userspace has the ability to bypass the CPU cache from within its > unprivileged command stream, we have to flush the CPU cache to memory > in order to overwrite the previous contents on creation. > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > Cc: stablevger.kernel.org > --- > drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 26 ++++++----------------- > 1 file changed, 7 insertions(+), 19 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c > index d2a1158868e7..f752b326d399 100644 > --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c > +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c > @@ -459,7 +459,6 @@ i915_gem_object_create_shmem(struct drm_i915_private *i915, u64 size) > { > struct drm_i915_gem_object *obj; > struct address_space *mapping; > - unsigned int cache_level; > gfp_t mask; > int ret; > > @@ -498,24 +497,13 @@ i915_gem_object_create_shmem(struct drm_i915_private *i915, u64 size) > obj->write_domain = I915_GEM_DOMAIN_CPU; > obj->read_domains = I915_GEM_DOMAIN_CPU; > > - if (HAS_LLC(i915)) > - /* On some devices, we can have the GPU use the LLC (the CPU > - * cache) for about a 10% performance improvement > - * compared to uncached. Graphics requests other than > - * display scanout are coherent with the CPU in > - * accessing this cache. This means in this mode we > - * don't need to clflush on the CPU side, and on the > - * GPU side we only need to flush internal caches to > - * get data visible to the CPU. > - * > - * However, we maintain the display planes as UC, and so > - * need to rebind when first used as such. > - */ > - cache_level = I915_CACHE_LLC; > - else > - cache_level = I915_CACHE_NONE; > - > - i915_gem_object_set_cache_coherency(obj, cache_level); > + /* > + * Note that userspace has control over cache-bypass > + * via its command stream, so even on LLC architectures > + * we have to flush out the CPU cache to memory to > + * clear previous contents. > + */ > + i915_gem_object_set_cache_coherency(obj, I915_CACHE_NONE); > I'm not sure if you've seen my comments about this in an internal thread you were CC'ed to: I don't think this patch will have the intended effect. The various clflushes this could trigger before the first use of the buffer are conditional on "obj->cache_dirty & ~obj->cache_coherent", which will always be false on LLC platforms AFAICT, because on those platforms i915_gem_object_set_cache_coherency() will always set bit 0 of obj->cache_coherent. I attached a patch with the same purpose as this to that internal thread which doesn't suffer from this bug, but my patch was specific to Gen12+ where cache bypass is actually exposed to userspace. Why is this patch enabling the flushes for all platforms? AFAICT the only currently exposed MOCS entries that might allow userspace to bypass the LLC are 16 and 17 on Gen11, which enable the SCF MOCS table bit, which is marked "S/W should NOT set this field in client platforms" in the Gen9 docs, and according to the Gen10-11 docs is "Not supported". Does LLC bypass actually work on ICL? I doubt it but it might have been fixed in some other Gen11 project. Shouldn't this change be conditional on the platform supporting LLC bypass? Do you want me to resend my patch here with the Gen12+ checks changed to Gen11+? > trace_i915_gem_object_create(obj); > > -- > 2.22.0 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Quoting Francisco Jerez (2019-07-24 21:37:24) > Chris Wilson <chris@chris-wilson.co.uk> writes: > > > Since userspace has the ability to bypass the CPU cache from within its > > unprivileged command stream, we have to flush the CPU cache to memory > > in order to overwrite the previous contents on creation. > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > > Cc: stablevger.kernel.org > > --- > > drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 26 ++++++----------------- > > 1 file changed, 7 insertions(+), 19 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c > > index d2a1158868e7..f752b326d399 100644 > > --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c > > +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c > > @@ -459,7 +459,6 @@ i915_gem_object_create_shmem(struct drm_i915_private *i915, u64 size) > > { > > struct drm_i915_gem_object *obj; > > struct address_space *mapping; > > - unsigned int cache_level; > > gfp_t mask; > > int ret; > > > > @@ -498,24 +497,13 @@ i915_gem_object_create_shmem(struct drm_i915_private *i915, u64 size) > > obj->write_domain = I915_GEM_DOMAIN_CPU; > > obj->read_domains = I915_GEM_DOMAIN_CPU; > > > > - if (HAS_LLC(i915)) > > - /* On some devices, we can have the GPU use the LLC (the CPU > > - * cache) for about a 10% performance improvement > > - * compared to uncached. Graphics requests other than > > - * display scanout are coherent with the CPU in > > - * accessing this cache. This means in this mode we > > - * don't need to clflush on the CPU side, and on the > > - * GPU side we only need to flush internal caches to > > - * get data visible to the CPU. > > - * > > - * However, we maintain the display planes as UC, and so > > - * need to rebind when first used as such. > > - */ > > - cache_level = I915_CACHE_LLC; > > - else > > - cache_level = I915_CACHE_NONE; > > - > > - i915_gem_object_set_cache_coherency(obj, cache_level); > > + /* > > + * Note that userspace has control over cache-bypass > > + * via its command stream, so even on LLC architectures > > + * we have to flush out the CPU cache to memory to > > + * clear previous contents. > > + */ > > + i915_gem_object_set_cache_coherency(obj, I915_CACHE_NONE); > > > > I'm not sure if you've seen my comments about this in an internal thread > you were CC'ed to: I don't think this patch will have the intended > effect. The various clflushes this could trigger before the first use > of the buffer are conditional on "obj->cache_dirty & > ~obj->cache_coherent", which will always be false on LLC platforms > AFAICT, because on those platforms i915_gem_object_set_cache_coherency() > will always set bit 0 of obj->cache_coherent. You only need to flush the stale written-to cachelines, so that the page content is correct on reuse of the foreign struct page. After that point, the CPU cache is managed by the client. -Chris
Chris Wilson <chris@chris-wilson.co.uk> writes: > Quoting Francisco Jerez (2019-07-24 21:37:24) >> Chris Wilson <chris@chris-wilson.co.uk> writes: >> >> > Since userspace has the ability to bypass the CPU cache from within its >> > unprivileged command stream, we have to flush the CPU cache to memory >> > in order to overwrite the previous contents on creation. >> > >> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> >> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> >> > Cc: stablevger.kernel.org >> > --- >> > drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 26 ++++++----------------- >> > 1 file changed, 7 insertions(+), 19 deletions(-) >> > >> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c >> > index d2a1158868e7..f752b326d399 100644 >> > --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c >> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c >> > @@ -459,7 +459,6 @@ i915_gem_object_create_shmem(struct drm_i915_private *i915, u64 size) >> > { >> > struct drm_i915_gem_object *obj; >> > struct address_space *mapping; >> > - unsigned int cache_level; >> > gfp_t mask; >> > int ret; >> > >> > @@ -498,24 +497,13 @@ i915_gem_object_create_shmem(struct drm_i915_private *i915, u64 size) >> > obj->write_domain = I915_GEM_DOMAIN_CPU; >> > obj->read_domains = I915_GEM_DOMAIN_CPU; >> > >> > - if (HAS_LLC(i915)) >> > - /* On some devices, we can have the GPU use the LLC (the CPU >> > - * cache) for about a 10% performance improvement >> > - * compared to uncached. Graphics requests other than >> > - * display scanout are coherent with the CPU in >> > - * accessing this cache. This means in this mode we >> > - * don't need to clflush on the CPU side, and on the >> > - * GPU side we only need to flush internal caches to >> > - * get data visible to the CPU. >> > - * >> > - * However, we maintain the display planes as UC, and so >> > - * need to rebind when first used as such. >> > - */ >> > - cache_level = I915_CACHE_LLC; >> > - else >> > - cache_level = I915_CACHE_NONE; >> > - >> > - i915_gem_object_set_cache_coherency(obj, cache_level); >> > + /* >> > + * Note that userspace has control over cache-bypass >> > + * via its command stream, so even on LLC architectures >> > + * we have to flush out the CPU cache to memory to >> > + * clear previous contents. >> > + */ >> > + i915_gem_object_set_cache_coherency(obj, I915_CACHE_NONE); >> > >> >> I'm not sure if you've seen my comments about this in an internal thread >> you were CC'ed to: I don't think this patch will have the intended >> effect. The various clflushes this could trigger before the first use >> of the buffer are conditional on "obj->cache_dirty & >> ~obj->cache_coherent", which will always be false on LLC platforms >> AFAICT, because on those platforms i915_gem_object_set_cache_coherency() >> will always set bit 0 of obj->cache_coherent. > > You only need to flush the stale written-to cachelines, so that the page > content is correct on reuse of the foreign struct page. After that point, > the CPU cache is managed by the client. Point was that the code above wouldn't have necessarily led to *any* flushing on Gen11+ platforms, not even of stale written-to cachelines, because the various clflushes this could have triggered before the first use of the buffer were conditional on "obj->cache_dirty & ~obj->cache_coherent", which would still have been false on LLC platforms. Anyway you may have fixed that in the next revision of this patch you sent today by doing things in a completely different way. But you didn't answer my question asking why you are doing this on all platforms. Cache bypass isn't available on ICL nor earlier platforms, is it? And it's been defeatured on TGL: https://www.spinics.net/lists/intel-gfx/msg219552.html > -Chris
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c index d2a1158868e7..f752b326d399 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c @@ -459,7 +459,6 @@ i915_gem_object_create_shmem(struct drm_i915_private *i915, u64 size) { struct drm_i915_gem_object *obj; struct address_space *mapping; - unsigned int cache_level; gfp_t mask; int ret; @@ -498,24 +497,13 @@ i915_gem_object_create_shmem(struct drm_i915_private *i915, u64 size) obj->write_domain = I915_GEM_DOMAIN_CPU; obj->read_domains = I915_GEM_DOMAIN_CPU; - if (HAS_LLC(i915)) - /* On some devices, we can have the GPU use the LLC (the CPU - * cache) for about a 10% performance improvement - * compared to uncached. Graphics requests other than - * display scanout are coherent with the CPU in - * accessing this cache. This means in this mode we - * don't need to clflush on the CPU side, and on the - * GPU side we only need to flush internal caches to - * get data visible to the CPU. - * - * However, we maintain the display planes as UC, and so - * need to rebind when first used as such. - */ - cache_level = I915_CACHE_LLC; - else - cache_level = I915_CACHE_NONE; - - i915_gem_object_set_cache_coherency(obj, cache_level); + /* + * Note that userspace has control over cache-bypass + * via its command stream, so even on LLC architectures + * we have to flush out the CPU cache to memory to + * clear previous contents. + */ + i915_gem_object_set_cache_coherency(obj, I915_CACHE_NONE); trace_i915_gem_object_create(obj);
Since userspace has the ability to bypass the CPU cache from within its unprivileged command stream, we have to flush the CPU cache to memory in order to overwrite the previous contents on creation. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: stablevger.kernel.org --- drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 26 ++++++----------------- 1 file changed, 7 insertions(+), 19 deletions(-)