diff mbox

drm/i915: Only insert the mb() before updating the fence parameter

Message ID 1349771938-18887-1-git-send-email-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson Oct. 9, 2012, 8:38 a.m. UTC
With a fence, we only need to insert a memory barrier around the actual
fence alteration for CPU accesses through the GTT. Performing the
barrier in flush-fence was inserting unnecessary and expensive barriers
for never fenced objects.

Note removing the barriers from flush-fence, which was effectively a
barrier before every direct access through the GTT, revealed that we
where missing a barrier before the first access through the GTT. Lack of
that barrier was sufficient to cause GPU hangs.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_gem.c |   33 +++++++++++++++++++++++----------
 1 file changed, 23 insertions(+), 10 deletions(-)

Comments

Daniel Vetter Oct. 9, 2012, 9:54 a.m. UTC | #1
On Tue, Oct 09, 2012 at 09:38:58AM +0100, Chris Wilson wrote:
> With a fence, we only need to insert a memory barrier around the actual
> fence alteration for CPU accesses through the GTT. Performing the
> barrier in flush-fence was inserting unnecessary and expensive barriers
> for never fenced objects.
> 
> Note removing the barriers from flush-fence, which was effectively a
> barrier before every direct access through the GTT, revealed that we
> where missing a barrier before the first access through the GTT. Lack of
> that barrier was sufficient to cause GPU hangs.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Looks good and finally puts some clear explanation and consistency behind
our mb()s. Two minor nitpicks, otherwise.

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

> ---
>  drivers/gpu/drm/i915/i915_gem.c |   33 +++++++++++++++++++++++----------
>  1 file changed, 23 insertions(+), 10 deletions(-)

[snip]

> @@ -3244,6 +3254,9 @@ i915_gem_object_set_to_gtt_domain(struct drm_i915_gem_object *obj, bool write)
>  
>  	i915_gem_object_flush_cpu_write_domain(obj);
>  
> +	if ((obj->base.read_domains & I915_GEM_DOMAIN_GTT) == 0)
> +		mb();
> +

I think a comment here like we have one for all other gtt related memory
barries would be good. Another thing is the flush_gtt_write_domain uses a
wmb, whereas here we don't bother with micro-optimizing things. So I think
it'd be good to just use a mb() for that, too, if just for consistency.

Also, you know the grumpy maintainer drill: Could we exercise these
barriers with a minimal i-g-t testcase, please? Since you've managed to
kill your machine by removing them, they're no longer just there to keep
us happy, hence I'd like to have them exercised ...

Another thing that just crossed my mind: Could we lack a set of mb()s for
cpu access on llc platforms? For non-coherent platforms the mb() in the
clflush paths will do that, but on llc platforms I couldn't find anything.
And that lp bugs seems to make an excellent case for them being required
...

Cheers, Daniel
Daniel Vetter Oct. 9, 2012, 10:02 a.m. UTC | #2
While looking at barriers, I think we could be a bit more paranoid with
the barrier in intel_read_status_page and up it to a full mb() and move it
into the !lazy_coherency conditional of the various get_seqno functions.
-Daniel

On Tue, Oct 09, 2012 at 11:54:12AM +0200, Daniel Vetter wrote:
> On Tue, Oct 09, 2012 at 09:38:58AM +0100, Chris Wilson wrote:
> > With a fence, we only need to insert a memory barrier around the actual
> > fence alteration for CPU accesses through the GTT. Performing the
> > barrier in flush-fence was inserting unnecessary and expensive barriers
> > for never fenced objects.
> > 
> > Note removing the barriers from flush-fence, which was effectively a
> > barrier before every direct access through the GTT, revealed that we
> > where missing a barrier before the first access through the GTT. Lack of
> > that barrier was sufficient to cause GPU hangs.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> 
> Looks good and finally puts some clear explanation and consistency behind
> our mb()s. Two minor nitpicks, otherwise.
> 
> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> 
> > ---
> >  drivers/gpu/drm/i915/i915_gem.c |   33 +++++++++++++++++++++++----------
> >  1 file changed, 23 insertions(+), 10 deletions(-)
> 
> [snip]
> 
> > @@ -3244,6 +3254,9 @@ i915_gem_object_set_to_gtt_domain(struct drm_i915_gem_object *obj, bool write)
> >  
> >  	i915_gem_object_flush_cpu_write_domain(obj);
> >  
> > +	if ((obj->base.read_domains & I915_GEM_DOMAIN_GTT) == 0)
> > +		mb();
> > +
> 
> I think a comment here like we have one for all other gtt related memory
> barries would be good. Another thing is the flush_gtt_write_domain uses a
> wmb, whereas here we don't bother with micro-optimizing things. So I think
> it'd be good to just use a mb() for that, too, if just for consistency.
> 
> Also, you know the grumpy maintainer drill: Could we exercise these
> barriers with a minimal i-g-t testcase, please? Since you've managed to
> kill your machine by removing them, they're no longer just there to keep
> us happy, hence I'd like to have them exercised ...
> 
> Another thing that just crossed my mind: Could we lack a set of mb()s for
> cpu access on llc platforms? For non-coherent platforms the mb() in the
> clflush paths will do that, but on llc platforms I couldn't find anything.
> And that lp bugs seems to make an excellent case for them being required
> ...
> 
> Cheers, Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Chris Wilson Oct. 9, 2012, 11:03 a.m. UTC | #3
On Tue, 9 Oct 2012 11:54:12 +0200, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Tue, Oct 09, 2012 at 09:38:58AM +0100, Chris Wilson wrote:
> > @@ -3244,6 +3254,9 @@ i915_gem_object_set_to_gtt_domain(struct drm_i915_gem_object *obj, bool write)
> >  
> >  	i915_gem_object_flush_cpu_write_domain(obj);
> >  
> > +	if ((obj->base.read_domains & I915_GEM_DOMAIN_GTT) == 0)
> > +		mb();
> > +
> 
> I think a comment here like we have one for all other gtt related memory
> barries would be good. Another thing is the flush_gtt_write_domain uses a
> wmb, whereas here we don't bother with micro-optimizing things. So I think
> it'd be good to just use a mb() for that, too, if just for consistency.

In fact, not only is that the wmb() the easiest to micro-optimise, it
is the only one that can be - I think. Around manipulating the fence, we
need a read/write barrier in case we have any pending accesses through
the fenced region, since the register write may be reordered passed the
memory reads since there is no obvious dependency. That might just be
heightened paranoia and our memory controller isn't that smart. Yet. So
those two need to be mb() so that I can sleep safely at night. For the
mb() inside set-to-gtt-domain, I don't have a robust explanation other
than that empirically we need a barrier, therefore there is some
lingering incoherency when reusing a bo. (The hangs always seem to occur
when crossing a page boundary, we see stale data.) You could attempt to
insert a read/write barrier depending upon actual usage, but it hardly
seems worth the effort.
 
> Also, you know the grumpy maintainer drill: Could we exercise these
> barriers with a minimal i-g-t testcase, please? Since you've managed to
> kill your machine by removing them, they're no longer just there to keep
> us happy, hence I'd like to have them exercised ...

Still hunting.
 
> Another thing that just crossed my mind: Could we lack a set of mb()s for
> cpu access on llc platforms? For non-coherent platforms the mb() in the
> clflush paths will do that, but on llc platforms I couldn't find anything.
> And that lp bugs seems to make an excellent case for them being required
> ...

Yes, with LLC we need to treat the GPU as another core and so put
similar SMP-esque memory barriers in place.
-Chris
Daniel Vetter Oct. 9, 2012, 11:14 a.m. UTC | #4
On Tue, Oct 9, 2012 at 1:03 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> In fact, not only is that the wmb() the easiest to micro-optimise, it
> is the only one that can be - I think. Around manipulating the fence, we
> need a read/write barrier in case we have any pending accesses through
> the fenced region, since the register write may be reordered passed the
> memory reads since there is no obvious dependency. That might just be
> heightened paranoia and our memory controller isn't that smart. Yet. So
> those two need to be mb() so that I can sleep safely at night. For the
> mb() inside set-to-gtt-domain, I don't have a robust explanation other
> than that empirically we need a barrier, therefore there is some
> lingering incoherency when reusing a bo. (The hangs always seem to occur
> when crossing a page boundary, we see stale data.) You could attempt to
> insert a read/write barrier depending upon actual usage, but it hardly
> seems worth the effort.

Hm, I think we can make a case that the barrier before the fence
change can only be a wmb(): Racing reads against fence changes is
ill-defined anyway, so we don't need the read barrier for that reason.
But we need to flush out any store buffers (especially the wc store
buffer) before changing the fencing. The mb() afterwards seems to be
required, since we need to sync both subsequent reads and writes
against the fence mmio write.

One thing I wonder is whether we miss any barrier between the wc
writes to the ringbuffer and the tail update. If that's the case I
wonder where all the bug reports are ...

Last one: Which machines blow up when you drop that mb()?

Cheers, Daniel
Chris Wilson Oct. 9, 2012, 11:26 a.m. UTC | #5
On Tue, 9 Oct 2012 13:14:09 +0200, Daniel Vetter <daniel@ffwll.ch> wrote:
> One thing I wonder is whether we miss any barrier between the wc
> writes to the ringbuffer and the tail update. If that's the case I
> wonder where all the bug reports are ...

Ditto. I've often wondered how we get away without a wmb() there...
 
> Last one: Which machines blow up when you drop that mb()?

pnv, though that's the only non-LLC I've been testing with the
incomplete patch so I can't say it is limited to that machine.
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 05ff790..c1ef0a8 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2771,9 +2771,22 @@  static void i830_write_fence_reg(struct drm_device *dev, int reg,
 	POSTING_READ(FENCE_REG_830_0 + reg * 4);
 }
 
+inline static bool i915_gem_object_needs_mb(struct drm_i915_gem_object *obj)
+{
+	return obj && obj->base.read_domains & I915_GEM_DOMAIN_GTT;
+}
+
 static void i915_gem_write_fence(struct drm_device *dev, int reg,
 				 struct drm_i915_gem_object *obj)
 {
+	struct drm_i915_private *dev_priv = dev->dev_private;
+
+	/* Ensure that all CPU reads are completed before installing a fence
+	 * and all writes before removing the fence.
+	 */
+	if (i915_gem_object_needs_mb(dev_priv->fence_regs[reg].obj))
+		mb();
+
 	switch (INTEL_INFO(dev)->gen) {
 	case 7:
 	case 6: sandybridge_write_fence_reg(dev, reg, obj); break;
@@ -2783,6 +2796,9 @@  static void i915_gem_write_fence(struct drm_device *dev, int reg,
 	case 2: i830_write_fence_reg(dev, reg, obj); break;
 	default: break;
 	}
+
+	if (i915_gem_object_needs_mb(obj))
+		mb();
 }
 
 static inline int fence_number(struct drm_i915_private *dev_priv,
@@ -2812,7 +2828,7 @@  static void i915_gem_object_update_fence(struct drm_i915_gem_object *obj,
 }
 
 static int
-i915_gem_object_flush_fence(struct drm_i915_gem_object *obj)
+i915_gem_object_wait_fence(struct drm_i915_gem_object *obj)
 {
 	if (obj->last_fenced_seqno) {
 		int ret = i915_wait_seqno(obj->ring, obj->last_fenced_seqno);
@@ -2822,12 +2838,6 @@  i915_gem_object_flush_fence(struct drm_i915_gem_object *obj)
 		obj->last_fenced_seqno = 0;
 	}
 
-	/* Ensure that all CPU reads are completed before installing a fence
-	 * and all writes before removing the fence.
-	 */
-	if (obj->base.read_domains & I915_GEM_DOMAIN_GTT)
-		mb();
-
 	obj->fenced_gpu_access = false;
 	return 0;
 }
@@ -2838,7 +2848,7 @@  i915_gem_object_put_fence(struct drm_i915_gem_object *obj)
 	struct drm_i915_private *dev_priv = obj->base.dev->dev_private;
 	int ret;
 
-	ret = i915_gem_object_flush_fence(obj);
+	ret = i915_gem_object_wait_fence(obj);
 	if (ret)
 		return ret;
 
@@ -2912,7 +2922,7 @@  i915_gem_object_get_fence(struct drm_i915_gem_object *obj)
 	 * will need to serialise the write to the associated fence register?
 	 */
 	if (obj->fence_dirty) {
-		ret = i915_gem_object_flush_fence(obj);
+		ret = i915_gem_object_wait_fence(obj);
 		if (ret)
 			return ret;
 	}
@@ -2933,7 +2943,7 @@  i915_gem_object_get_fence(struct drm_i915_gem_object *obj)
 		if (reg->obj) {
 			struct drm_i915_gem_object *old = reg->obj;
 
-			ret = i915_gem_object_flush_fence(old);
+			ret = i915_gem_object_wait_fence(old);
 			if (ret)
 				return ret;
 
@@ -3244,6 +3254,9 @@  i915_gem_object_set_to_gtt_domain(struct drm_i915_gem_object *obj, bool write)
 
 	i915_gem_object_flush_cpu_write_domain(obj);
 
+	if ((obj->base.read_domains & I915_GEM_DOMAIN_GTT) == 0)
+		mb();
+
 	old_write_domain = obj->base.write_domain;
 	old_read_domains = obj->base.read_domains;