diff mbox

Revert "drm/i915: Kill GTT mappings when moving from GTT domain"

Message ID 1308094989-3153-1-git-send-email-eric@anholt.net
State New, archived
Headers show

Commit Message

Eric Anholt June 14, 2011, 11:43 p.m. UTC
This reverts commit 4a684a4117abd756291969336af454e8a958802f.
Userland has always been required to set the object's domain to GTT
before using it through a GTT mapping, it's not something that the
kernel is supposed to enforce.  (The pagefault support is so that we
can handle multiple mappings without userland having to pin across
them, not so that userland can use GTT after GPU domains without
telling the kernel).

Fixes 19.2% +/- 0.8% (n=6) performance regression in cairo-gl
firefox-talos-gfx on my T420 latop.
---
 drivers/gpu/drm/i915/i915_gem.c            |   10 ++++------
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |    4 ----
 2 files changed, 4 insertions(+), 10 deletions(-)

Comments

Chris Wilson June 15, 2011, 9:39 a.m. UTC | #1
On Tue, 14 Jun 2011 16:43:09 -0700, Eric Anholt <eric@anholt.net> wrote:
> This reverts commit 4a684a4117abd756291969336af454e8a958802f.
> Userland has always been required to set the object's domain to GTT
> before using it through a GTT mapping, it's not something that the
> kernel is supposed to enforce.  (The pagefault support is so that we
> can handle multiple mappings without userland having to pin across
> them, not so that userland can use GTT after GPU domains without
> telling the kernel).

The only place that can do accurate per-page tracking of domains is the
kernel. (And avoid clflushes for GTT eviction. Ok, so we're not quite
there yet.) Relying on userspace to get it right, and for co-operating
processes to coordinate their access to buffers is a misfeature.
 
> Fixes 19.2% +/- 0.8% (n=6) performance regression in cairo-gl
> firefox-talos-gfx on my T420 latop.

Since GTT mappings are fundamentally slow, don't you think this suggests
that mesa is doing something wrong? Or a bug in cairo-gl to make mesa act
this way?

For the record, on my SugarBay, this patch on top of drm-intel-fixes and
mesa.git + your instruction streaming, regresses cairo-xlib 4.5s to 4.7s
(4%, with a claim of .1% stddev) and improves cairo-gl from 21.5s to 20.1s
(6.5%, with a claim of .5% stddev) for the firefox-talos-gfx. [This is a
system that can do under 2s for cairo-xlib using vmap and the perf
patches I posted before.]
-Chris
Eric Anholt June 15, 2011, 3:08 p.m. UTC | #2
On Wed, 15 Jun 2011 10:39:06 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Tue, 14 Jun 2011 16:43:09 -0700, Eric Anholt <eric@anholt.net> wrote:
> > This reverts commit 4a684a4117abd756291969336af454e8a958802f.
> > Userland has always been required to set the object's domain to GTT
> > before using it through a GTT mapping, it's not something that the
> > kernel is supposed to enforce.  (The pagefault support is so that we
> > can handle multiple mappings without userland having to pin across
> > them, not so that userland can use GTT after GPU domains without
> > telling the kernel).
> 
> The only place that can do accurate per-page tracking of domains is the
> kernel. (And avoid clflushes for GTT eviction. Ok, so we're not quite
> there yet.) Relying on userspace to get it right, and for co-operating
> processes to coordinate their access to buffers is a misfeature.

We rely on userspace to get it right all the time, which was a design
decision made early on for performance reasons.  For example, userspace
can submit the wrong domains with an exec call, and non-GTT mappings
also behave like I'm changing GTT back to.  Unless you quietly changed
that, too?

> > Fixes 19.2% +/- 0.8% (n=6) performance regression in cairo-gl
> > firefox-talos-gfx on my T420 latop.
> 
> Since GTT mappings are fundamentally slow, don't you think this suggests
> that mesa is doing something wrong? Or a bug in cairo-gl to make mesa act
> this way?

Page table changes are expensive, yes.  That's why we cache BOs in
userspace and hang onto the mappings of them.  We only moved to GTT
mappings in Mesa in places where it was faster than the alternatives (or
for tiled buffers, where due to the a17 swizzling there is no other
option).

> For the record, on my SugarBay, this patch on top of drm-intel-fixes and
> mesa.git + your instruction streaming, regresses cairo-xlib 4.5s to 4.7s
> (4%, with a claim of .1% stddev) and improves cairo-gl from 21.5s to 20.1s
> (6.5%, with a claim of .5% stddev) for the firefox-talos-gfx. [This is a
> system that can do under 2s for cairo-xlib using vmap and the perf
> patches I posted before.]
> -Chris

I was testing with semaphores enabled on the LLC series.  I'm curious
how this revert could possibly regress cairo-xlib, though -- that seems
very strange.
Chris Wilson June 15, 2011, 4:03 p.m. UTC | #3
On Wed, 15 Jun 2011 08:08:59 -0700, Eric Anholt <eric@anholt.net> wrote:
> On Wed, 15 Jun 2011 10:39:06 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > For the record, on my SugarBay, this patch on top of drm-intel-fixes and
> > mesa.git + your instruction streaming, regresses cairo-xlib 4.5s to 4.7s
> > (4%, with a claim of .1% stddev) and improves cairo-gl from 21.5s to 20.1s
> > (6.5%, with a claim of .5% stddev) for the firefox-talos-gfx. [This is a
> > system that can do under 2s for cairo-xlib using vmap and the perf
> > patches I posted before.]
> > -Chris
> 
> I was testing with semaphores enabled on the LLC series.  I'm curious
> how this revert could possibly regress cairo-xlib, though -- that seems
> very strange.

Moving back to LLC+semaphores (2.6.39-rc2+), firefox-talos-gfx:
xlib: 4.473
gl:  20.753

applying the patch:
xlib: 4.472
gl:  20.824

I'm just not reproducing the same issue you are seeing. Are you using a
standard distro Kconfig, or if not, can you send me yours?
-Chris
Eric Anholt June 17, 2011, 7:06 p.m. UTC | #4
On Wed, 15 Jun 2011 17:03:58 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Wed, 15 Jun 2011 08:08:59 -0700, Eric Anholt <eric@anholt.net> wrote:
> > On Wed, 15 Jun 2011 10:39:06 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > > For the record, on my SugarBay, this patch on top of drm-intel-fixes and
> > > mesa.git + your instruction streaming, regresses cairo-xlib 4.5s to 4.7s
> > > (4%, with a claim of .1% stddev) and improves cairo-gl from 21.5s to 20.1s
> > > (6.5%, with a claim of .5% stddev) for the firefox-talos-gfx. [This is a
> > > system that can do under 2s for cairo-xlib using vmap and the perf
> > > patches I posted before.]
> > > -Chris
> > 
> > I was testing with semaphores enabled on the LLC series.  I'm curious
> > how this revert could possibly regress cairo-xlib, though -- that seems
> > very strange.
> 
> Moving back to LLC+semaphores (2.6.39-rc2+), firefox-talos-gfx:
> xlib: 4.473
> gl:  20.753
> 
> applying the patch:
> xlib: 4.472
> gl:  20.824
> 
> I'm just not reproducing the same issue you are seeing. Are you using a
> standard distro Kconfig, or if not, can you send me yours?
> -Chris

http://people.freedesktop.org/~anholt/dotconfig

Pushed my kernel tree to "gtt-revert" branch.

Fresh set of numbers on two fresh boots, just about as identical testing
environments as I could produce:

(without revert)
anholt@eliezer:src/cairo/perf% cat /home/anholt/cairogl-before
21791099815
21677090234
22060471069
22049244399
21721362892
21966143347

(with revert)
anholt@eliezer:src/cairo/perf% cat /home/anholt/cairogl-after 
19162606198
19127697366
19462253708
19172535715
19339999910
19221539215

anholt@eliezer:src/cairo/perf% ministat ~/cairogl-before ~/cairogl-after
x /home/anholt/cairogl-before
+ /home/anholt/cairogl-after
+------------------------------------------------------------------------------+
| +                                                                           x|
|++ +  +  +                                                         xx x    x x|
||__A___|                                                            |___A__M_||
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   6  2.167709e+10 2.2060471e+10 2.1966143e+10 2.1877569e+10 1.6902073e+08
+   6 1.9127697e+10 1.9462254e+10 1.9221539e+10 1.9247772e+10 1.2847426e+08
Difference at 95.0% confidence
	-2.6298e+09 +/- 1.93108e+08
	-12.0205% +/- 0.882677%
	(Student's t, pooled s = 1.50123e+08)
Chris Wilson June 18, 2011, 11:43 a.m. UTC | #5
On Fri, 17 Jun 2011 12:06:54 -0700, Eric Anholt <eric@anholt.net> wrote:
> On Wed, 15 Jun 2011 17:03:58 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > Moving back to LLC+semaphores (2.6.39-rc2+), firefox-talos-gfx:
> > xlib: 4.473
> > gl:  20.753
> > 
> > applying the patch:
> > xlib: 4.472
> > gl:  20.824
> > 
> > I'm just not reproducing the same issue you are seeing. Are you using a
> > standard distro Kconfig, or if not, can you send me yours?
> > -Chris
> 
> http://people.freedesktop.org/~anholt/dotconfig
> 
> Pushed my kernel tree to "gtt-revert" branch.

Thanks, I'm testing with those on my SNB desktop, still only see around
a 5% hit for firefox-talos-gfx.

I still think this is optimising for bad behaviour of the client, and
papering over two slow linear lookups in the kernel. [One for find_vma
which is exacerbated by the cached vma on the bo and the other checking
whether the GTT aperture is RAM.]

Looking at what cairo-gl is doing, the root cause of why we are
repeatedly rewriting textures is:

commit 3b1c0a4bd66660780095e6016e3db451f34503a3
Author: Benjamin Otte <otte@redhat.com>
Date:   Fri May 14 15:56:17 2010 +0200

    fallback: Remove span renderer paths
    
    Those paths were broken, as they didn't properly translate the polygon
    to the destination size. And rather than adding lots of code that allows
    translation, it's easier to just delete this code.
    
    Note that the only user of the code was the GL backend anyway.

i.e. cairo-gl is no longer using spans for firefox-talos-gfx where
unaligned clipping dominates.
-Chris
Eric Anholt June 18, 2011, 8:20 p.m. UTC | #6
On Sat, 18 Jun 2011 12:43:58 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Fri, 17 Jun 2011 12:06:54 -0700, Eric Anholt <eric@anholt.net> wrote:
> > On Wed, 15 Jun 2011 17:03:58 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > > Moving back to LLC+semaphores (2.6.39-rc2+), firefox-talos-gfx:
> > > xlib: 4.473
> > > gl:  20.753
> > > 
> > > applying the patch:
> > > xlib: 4.472
> > > gl:  20.824
> > > 
> > > I'm just not reproducing the same issue you are seeing. Are you using a
> > > standard distro Kconfig, or if not, can you send me yours?
> > > -Chris
> > 
> > http://people.freedesktop.org/~anholt/dotconfig
> > 
> > Pushed my kernel tree to "gtt-revert" branch.
> 
> Thanks, I'm testing with those on my SNB desktop, still only see around
> a 5% hit for firefox-talos-gfx.

GTT mappings are important.  They're how textures are uploaded in GL in
general.  One of the longstanding things I've wanted to do for GL
applications that repeatedly allocate new texture images is to userland
BO cache those objects, which we don't do because of tiling.  With the
patch I'm reverting in place, there's basically no reason to do so
because remapping the BO is much of the cost of recreating the BO from
scratch.  Remapping is the reason we do userland caching instead of
kernel-side caching.
Chris Wilson June 19, 2011, 4:28 p.m. UTC | #7
On Sat, 18 Jun 2011 13:20:05 -0700, Eric Anholt <eric@anholt.net> wrote:
> On Sat, 18 Jun 2011 12:43:58 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > On Fri, 17 Jun 2011 12:06:54 -0700, Eric Anholt <eric@anholt.net> wrote:
> > > On Wed, 15 Jun 2011 17:03:58 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > > > Moving back to LLC+semaphores (2.6.39-rc2+), firefox-talos-gfx:
> > > > xlib: 4.473
> > > > gl:  20.753
> > > > 
> > > > applying the patch:
> > > > xlib: 4.472
> > > > gl:  20.824
> > > > 
> > > > I'm just not reproducing the same issue you are seeing. Are you using a
> > > > standard distro Kconfig, or if not, can you send me yours?
> > > > -Chris
> > > 
> > > http://people.freedesktop.org/~anholt/dotconfig
> > > 
> > > Pushed my kernel tree to "gtt-revert" branch.
> > 
> > Thanks, I'm testing with those on my SNB desktop, still only see around
> > a 5% hit for firefox-talos-gfx.
> 
> GTT mappings are important.  They're how textures are uploaded in GL in
> general.

For lack of a better mechanism. Even using anholt/gtt-revert, I question
the value of caching the GTT mapping in drm_intel_bo. For the cairo-gl and
pts benchmarks I've run, the efficacy of the cached vma is very small and
there is a very slight improvement by unmapping the vma after use. (The
difference is so small, that it will take a lot more runs to determine if
it is statistically significant.)

> One of the longstanding things I've wanted to do for GL
> applications that repeatedly allocate new texture images is to userland
> BO cache those objects, which we don't do because of tiling.

I'm failing to see the difference between this cache and a slightly
smarter (i.e. one that preferentially returns an object with appropriate
tiling and busy status) libdrm_intel cache.

> With the
> patch I'm reverting in place, there's basically no reason to do so
> because remapping the BO is much of the cost of recreating the BO from
> scratch.  Remapping is the reason we do userland caching instead of
> kernel-side caching.

Conversely, it is the lack of such a kernel bo and page cache that is
determinental to the ddx (at least on pre-SNB). And for any system running
more than one renderer. [Except for the thorny issue of whether having to
clear the pages returned from such a cache will nullify its benefits. I
have yet to measure the impact of doing so.]
-Chris
Eric Anholt June 19, 2011, 5:01 p.m. UTC | #8
On Sun, 19 Jun 2011 17:28:11 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Sat, 18 Jun 2011 13:20:05 -0700, Eric Anholt <eric@anholt.net> wrote:
> > On Sat, 18 Jun 2011 12:43:58 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > > On Fri, 17 Jun 2011 12:06:54 -0700, Eric Anholt <eric@anholt.net> wrote:
> > > > On Wed, 15 Jun 2011 17:03:58 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > > > > Moving back to LLC+semaphores (2.6.39-rc2+), firefox-talos-gfx:
> > > > > xlib: 4.473
> > > > > gl:  20.753
> > > > > 
> > > > > applying the patch:
> > > > > xlib: 4.472
> > > > > gl:  20.824
> > > > > 
> > > > > I'm just not reproducing the same issue you are seeing. Are you using a
> > > > > standard distro Kconfig, or if not, can you send me yours?
> > > > > -Chris
> > > > 
> > > > http://people.freedesktop.org/~anholt/dotconfig
> > > > 
> > > > Pushed my kernel tree to "gtt-revert" branch.
> > > 
> > > Thanks, I'm testing with those on my SNB desktop, still only see around
> > > a 5% hit for firefox-talos-gfx.
> > 
> > GTT mappings are important.  They're how textures are uploaded in GL in
> > general.
> 
> For lack of a better mechanism. Even using anholt/gtt-revert, I question
> the value of caching the GTT mapping in drm_intel_bo. For the cairo-gl and
> pts benchmarks I've run, the efficacy of the cached vma is very small and
> there is a very slight improvement by unmapping the vma after use. (The
> difference is so small, that it will take a lot more runs to determine if
> it is statistically significant.)

I'm confused.  You've measured a 5% impact from removing this part of
the caching, and I've measured 12-19%, so what are you planning that
doesn't involve caching the mapping that's faster than caching the
mapping?

> > One of the longstanding things I've wanted to do for GL
> > applications that repeatedly allocate new texture images is to userland
> > BO cache those objects, which we don't do because of tiling.
> 
> I'm failing to see the difference between this cache and a slightly
> smarter (i.e. one that preferentially returns an object with appropriate
> tiling and busy status) libdrm_intel cache.

I was talking about it being in libdrm_intel, sorry if that was
unclear.
Chris Wilson June 19, 2011, 5:14 p.m. UTC | #9
On Sun, 19 Jun 2011 10:01:23 -0700, Eric Anholt <eric@anholt.net> wrote:
> On Sun, 19 Jun 2011 17:28:11 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > For lack of a better mechanism. Even using anholt/gtt-revert, I question
> > the value of caching the GTT mapping in drm_intel_bo. For the cairo-gl and
> > pts benchmarks I've run, the efficacy of the cached vma is very small and
> > there is a very slight improvement by unmapping the vma after use. (The
> > difference is so small, that it will take a lot more runs to determine if
> > it is statistically significant.)
> 
> I'm confused.  You've measured a 5% impact from removing this part of
> the caching, and I've measured 12-19%, so what are you planning that
> doesn't involve caching the mapping that's faster than caching the
> mapping?

As I pointed out, cairo-gl is using fallback code and not spans. Once that
is corrected, cairo-gl no longer continually recreating textures and so
unaffected by the patch. Removing the cached vma is then arguably
beneficial, though the difference looks to be in the noise.
-Chris
Eric Anholt June 19, 2011, 9:20 p.m. UTC | #10
On Sun, 19 Jun 2011 18:14:11 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Sun, 19 Jun 2011 10:01:23 -0700, Eric Anholt <eric@anholt.net> wrote:
> > On Sun, 19 Jun 2011 17:28:11 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > > For lack of a better mechanism. Even using anholt/gtt-revert, I question
> > > the value of caching the GTT mapping in drm_intel_bo. For the cairo-gl and
> > > pts benchmarks I've run, the efficacy of the cached vma is very small and
> > > there is a very slight improvement by unmapping the vma after use. (The
> > > difference is so small, that it will take a lot more runs to determine if
> > > it is statistically significant.)
> > 
> > I'm confused.  You've measured a 5% impact from removing this part of
> > the caching, and I've measured 12-19%, so what are you planning that
> > doesn't involve caching the mapping that's faster than caching the
> > mapping?
> 
> As I pointed out, cairo-gl is using fallback code and not spans. Once that
> is corrected, cairo-gl no longer continually recreating textures and so
> unaffected by the patch. Removing the cached vma is then arguably
> beneficial, though the difference looks to be in the noise.

So you don't actually want us to fix the performance regression in
texture uploads, and instead want to just tell applications not to
upload textures?

What about applications where we're not the authors, and where they
don't have a choice but to upload textures (media players)?
Chris Wilson June 19, 2011, 9:49 p.m. UTC | #11
On Sun, 19 Jun 2011 14:20:14 -0700, Eric Anholt <eric@anholt.net> wrote:
> On Sun, 19 Jun 2011 18:14:11 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > On Sun, 19 Jun 2011 10:01:23 -0700, Eric Anholt <eric@anholt.net> wrote:
> > > On Sun, 19 Jun 2011 17:28:11 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > > > For lack of a better mechanism. Even using anholt/gtt-revert, I question
> > > > the value of caching the GTT mapping in drm_intel_bo. For the cairo-gl and
> > > > pts benchmarks I've run, the efficacy of the cached vma is very small and
> > > > there is a very slight improvement by unmapping the vma after use. (The
> > > > difference is so small, that it will take a lot more runs to determine if
> > > > it is statistically significant.)
> > > 
> > > I'm confused.  You've measured a 5% impact from removing this part of
> > > the caching, and I've measured 12-19%, so what are you planning that
> > > doesn't involve caching the mapping that's faster than caching the
> > > mapping?
> > 
> > As I pointed out, cairo-gl is using fallback code and not spans. Once that
> > is corrected, cairo-gl no longer continually recreating textures and so
> > unaffected by the patch. Removing the cached vma is then arguably
> > beneficial, though the difference looks to be in the noise.
> 
> So you don't actually want us to fix the performance regression in
> texture uploads, and instead want to just tell applications not to
> upload textures?

Nowhere, did I say that. I just said you were optimising for broken code
and papering over bugs and lack of functionality elsewhere.

The patch closes a race condition. The essence of your complaint is that
the kernel is not as fast as we need it to be, and that the initial upload
to any object is slower than expected. I presume you also have a plan for
fixing the underwhelming performance of a glibc memcpy through an
established GTT mapping?

> What about applications where we're not the authors, and where they
> don't have a choice but to upload textures (media players)?

No, I'm not necessarily saying application code is broken, just the
library.
-Chris
Chris Wilson June 21, 2011, 4:17 p.m. UTC | #12
Poking around in x86,pat I found one way of avoiding the linear walk for
pat_pagerange_is_ram() which appears to nullify the regression. I would
appreciate confirmation and some review before I go poking dragons.
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index d231729..7bbd355 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1218,11 +1218,11 @@  int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		ret = i915_gem_object_bind_to_gtt(obj, 0, true);
 		if (ret)
 			goto unlock;
-	}
 
-	ret = i915_gem_object_set_to_gtt_domain(obj, write);
-	if (ret)
-		goto unlock;
+		ret = i915_gem_object_set_to_gtt_domain(obj, write);
+		if (ret)
+			goto unlock;
+	}
 
 	if (obj->tiling_mode == I915_TILING_NONE)
 		ret = i915_gem_object_put_fence(obj);
@@ -2953,8 +2953,6 @@  i915_gem_object_flush_gtt_write_domain(struct drm_i915_gem_object *obj)
 	 */
 	wmb();
 
-	i915_gem_release_mmap(obj);
-
 	old_write_domain = obj->base.write_domain;
 	obj->base.write_domain = 0;
 
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 20a4cc5..4934cf8 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -187,10 +187,6 @@  i915_gem_object_set_to_gpu_domain(struct drm_i915_gem_object *obj,
 	if ((flush_domains | invalidate_domains) & I915_GEM_DOMAIN_CPU)
 		i915_gem_clflush_object(obj);
 
-	/* blow away mappings if mapped through GTT */
-	if ((flush_domains | invalidate_domains) & I915_GEM_DOMAIN_GTT)
-		i915_gem_release_mmap(obj);
-
 	if (obj->base.pending_write_domain)
 		cd->flips |= atomic_read(&obj->pending_flip);