Message ID | 1349771938-18887-1-git-send-email-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Oct 09, 2012 at 09:38:58AM +0100, Chris Wilson wrote: > With a fence, we only need to insert a memory barrier around the actual > fence alteration for CPU accesses through the GTT. Performing the > barrier in flush-fence was inserting unnecessary and expensive barriers > for never fenced objects. > > Note removing the barriers from flush-fence, which was effectively a > barrier before every direct access through the GTT, revealed that we > where missing a barrier before the first access through the GTT. Lack of > that barrier was sufficient to cause GPU hangs. > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Looks good and finally puts some clear explanation and consistency behind our mb()s. Two minor nitpicks, otherwise. Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> > --- > drivers/gpu/drm/i915/i915_gem.c | 33 +++++++++++++++++++++++---------- > 1 file changed, 23 insertions(+), 10 deletions(-) [snip] > @@ -3244,6 +3254,9 @@ i915_gem_object_set_to_gtt_domain(struct drm_i915_gem_object *obj, bool write) > > i915_gem_object_flush_cpu_write_domain(obj); > > + if ((obj->base.read_domains & I915_GEM_DOMAIN_GTT) == 0) > + mb(); > + I think a comment here like we have one for all other gtt related memory barries would be good. Another thing is the flush_gtt_write_domain uses a wmb, whereas here we don't bother with micro-optimizing things. So I think it'd be good to just use a mb() for that, too, if just for consistency. Also, you know the grumpy maintainer drill: Could we exercise these barriers with a minimal i-g-t testcase, please? Since you've managed to kill your machine by removing them, they're no longer just there to keep us happy, hence I'd like to have them exercised ... Another thing that just crossed my mind: Could we lack a set of mb()s for cpu access on llc platforms? For non-coherent platforms the mb() in the clflush paths will do that, but on llc platforms I couldn't find anything. And that lp bugs seems to make an excellent case for them being required ... Cheers, Daniel
While looking at barriers, I think we could be a bit more paranoid with the barrier in intel_read_status_page and up it to a full mb() and move it into the !lazy_coherency conditional of the various get_seqno functions. -Daniel On Tue, Oct 09, 2012 at 11:54:12AM +0200, Daniel Vetter wrote: > On Tue, Oct 09, 2012 at 09:38:58AM +0100, Chris Wilson wrote: > > With a fence, we only need to insert a memory barrier around the actual > > fence alteration for CPU accesses through the GTT. Performing the > > barrier in flush-fence was inserting unnecessary and expensive barriers > > for never fenced objects. > > > > Note removing the barriers from flush-fence, which was effectively a > > barrier before every direct access through the GTT, revealed that we > > where missing a barrier before the first access through the GTT. Lack of > > that barrier was sufficient to cause GPU hangs. > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > Looks good and finally puts some clear explanation and consistency behind > our mb()s. Two minor nitpicks, otherwise. > > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> > > > --- > > drivers/gpu/drm/i915/i915_gem.c | 33 +++++++++++++++++++++++---------- > > 1 file changed, 23 insertions(+), 10 deletions(-) > > [snip] > > > @@ -3244,6 +3254,9 @@ i915_gem_object_set_to_gtt_domain(struct drm_i915_gem_object *obj, bool write) > > > > i915_gem_object_flush_cpu_write_domain(obj); > > > > + if ((obj->base.read_domains & I915_GEM_DOMAIN_GTT) == 0) > > + mb(); > > + > > I think a comment here like we have one for all other gtt related memory > barries would be good. Another thing is the flush_gtt_write_domain uses a > wmb, whereas here we don't bother with micro-optimizing things. So I think > it'd be good to just use a mb() for that, too, if just for consistency. > > Also, you know the grumpy maintainer drill: Could we exercise these > barriers with a minimal i-g-t testcase, please? Since you've managed to > kill your machine by removing them, they're no longer just there to keep > us happy, hence I'd like to have them exercised ... > > Another thing that just crossed my mind: Could we lack a set of mb()s for > cpu access on llc platforms? For non-coherent platforms the mb() in the > clflush paths will do that, but on llc platforms I couldn't find anything. > And that lp bugs seems to make an excellent case for them being required > ... > > Cheers, Daniel > -- > Daniel Vetter > Software Engineer, Intel Corporation > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Tue, 9 Oct 2012 11:54:12 +0200, Daniel Vetter <daniel@ffwll.ch> wrote: > On Tue, Oct 09, 2012 at 09:38:58AM +0100, Chris Wilson wrote: > > @@ -3244,6 +3254,9 @@ i915_gem_object_set_to_gtt_domain(struct drm_i915_gem_object *obj, bool write) > > > > i915_gem_object_flush_cpu_write_domain(obj); > > > > + if ((obj->base.read_domains & I915_GEM_DOMAIN_GTT) == 0) > > + mb(); > > + > > I think a comment here like we have one for all other gtt related memory > barries would be good. Another thing is the flush_gtt_write_domain uses a > wmb, whereas here we don't bother with micro-optimizing things. So I think > it'd be good to just use a mb() for that, too, if just for consistency. In fact, not only is that the wmb() the easiest to micro-optimise, it is the only one that can be - I think. Around manipulating the fence, we need a read/write barrier in case we have any pending accesses through the fenced region, since the register write may be reordered passed the memory reads since there is no obvious dependency. That might just be heightened paranoia and our memory controller isn't that smart. Yet. So those two need to be mb() so that I can sleep safely at night. For the mb() inside set-to-gtt-domain, I don't have a robust explanation other than that empirically we need a barrier, therefore there is some lingering incoherency when reusing a bo. (The hangs always seem to occur when crossing a page boundary, we see stale data.) You could attempt to insert a read/write barrier depending upon actual usage, but it hardly seems worth the effort. > Also, you know the grumpy maintainer drill: Could we exercise these > barriers with a minimal i-g-t testcase, please? Since you've managed to > kill your machine by removing them, they're no longer just there to keep > us happy, hence I'd like to have them exercised ... Still hunting. > Another thing that just crossed my mind: Could we lack a set of mb()s for > cpu access on llc platforms? For non-coherent platforms the mb() in the > clflush paths will do that, but on llc platforms I couldn't find anything. > And that lp bugs seems to make an excellent case for them being required > ... Yes, with LLC we need to treat the GPU as another core and so put similar SMP-esque memory barriers in place. -Chris
On Tue, Oct 9, 2012 at 1:03 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > In fact, not only is that the wmb() the easiest to micro-optimise, it > is the only one that can be - I think. Around manipulating the fence, we > need a read/write barrier in case we have any pending accesses through > the fenced region, since the register write may be reordered passed the > memory reads since there is no obvious dependency. That might just be > heightened paranoia and our memory controller isn't that smart. Yet. So > those two need to be mb() so that I can sleep safely at night. For the > mb() inside set-to-gtt-domain, I don't have a robust explanation other > than that empirically we need a barrier, therefore there is some > lingering incoherency when reusing a bo. (The hangs always seem to occur > when crossing a page boundary, we see stale data.) You could attempt to > insert a read/write barrier depending upon actual usage, but it hardly > seems worth the effort. Hm, I think we can make a case that the barrier before the fence change can only be a wmb(): Racing reads against fence changes is ill-defined anyway, so we don't need the read barrier for that reason. But we need to flush out any store buffers (especially the wc store buffer) before changing the fencing. The mb() afterwards seems to be required, since we need to sync both subsequent reads and writes against the fence mmio write. One thing I wonder is whether we miss any barrier between the wc writes to the ringbuffer and the tail update. If that's the case I wonder where all the bug reports are ... Last one: Which machines blow up when you drop that mb()? Cheers, Daniel
On Tue, 9 Oct 2012 13:14:09 +0200, Daniel Vetter <daniel@ffwll.ch> wrote: > One thing I wonder is whether we miss any barrier between the wc > writes to the ringbuffer and the tail update. If that's the case I > wonder where all the bug reports are ... Ditto. I've often wondered how we get away without a wmb() there... > Last one: Which machines blow up when you drop that mb()? pnv, though that's the only non-LLC I've been testing with the incomplete patch so I can't say it is limited to that machine. -Chris
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 05ff790..c1ef0a8 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2771,9 +2771,22 @@ static void i830_write_fence_reg(struct drm_device *dev, int reg, POSTING_READ(FENCE_REG_830_0 + reg * 4); } +inline static bool i915_gem_object_needs_mb(struct drm_i915_gem_object *obj) +{ + return obj && obj->base.read_domains & I915_GEM_DOMAIN_GTT; +} + static void i915_gem_write_fence(struct drm_device *dev, int reg, struct drm_i915_gem_object *obj) { + struct drm_i915_private *dev_priv = dev->dev_private; + + /* Ensure that all CPU reads are completed before installing a fence + * and all writes before removing the fence. + */ + if (i915_gem_object_needs_mb(dev_priv->fence_regs[reg].obj)) + mb(); + switch (INTEL_INFO(dev)->gen) { case 7: case 6: sandybridge_write_fence_reg(dev, reg, obj); break; @@ -2783,6 +2796,9 @@ static void i915_gem_write_fence(struct drm_device *dev, int reg, case 2: i830_write_fence_reg(dev, reg, obj); break; default: break; } + + if (i915_gem_object_needs_mb(obj)) + mb(); } static inline int fence_number(struct drm_i915_private *dev_priv, @@ -2812,7 +2828,7 @@ static void i915_gem_object_update_fence(struct drm_i915_gem_object *obj, } static int -i915_gem_object_flush_fence(struct drm_i915_gem_object *obj) +i915_gem_object_wait_fence(struct drm_i915_gem_object *obj) { if (obj->last_fenced_seqno) { int ret = i915_wait_seqno(obj->ring, obj->last_fenced_seqno); @@ -2822,12 +2838,6 @@ i915_gem_object_flush_fence(struct drm_i915_gem_object *obj) obj->last_fenced_seqno = 0; } - /* Ensure that all CPU reads are completed before installing a fence - * and all writes before removing the fence. - */ - if (obj->base.read_domains & I915_GEM_DOMAIN_GTT) - mb(); - obj->fenced_gpu_access = false; return 0; } @@ -2838,7 +2848,7 @@ i915_gem_object_put_fence(struct drm_i915_gem_object *obj) struct drm_i915_private *dev_priv = obj->base.dev->dev_private; int ret; - ret = i915_gem_object_flush_fence(obj); + ret = i915_gem_object_wait_fence(obj); if (ret) return ret; @@ -2912,7 +2922,7 @@ i915_gem_object_get_fence(struct drm_i915_gem_object *obj) * will need to serialise the write to the associated fence register? */ if (obj->fence_dirty) { - ret = i915_gem_object_flush_fence(obj); + ret = i915_gem_object_wait_fence(obj); if (ret) return ret; } @@ -2933,7 +2943,7 @@ i915_gem_object_get_fence(struct drm_i915_gem_object *obj) if (reg->obj) { struct drm_i915_gem_object *old = reg->obj; - ret = i915_gem_object_flush_fence(old); + ret = i915_gem_object_wait_fence(old); if (ret) return ret; @@ -3244,6 +3254,9 @@ i915_gem_object_set_to_gtt_domain(struct drm_i915_gem_object *obj, bool write) i915_gem_object_flush_cpu_write_domain(obj); + if ((obj->base.read_domains & I915_GEM_DOMAIN_GTT) == 0) + mb(); + old_write_domain = obj->base.write_domain; old_read_domains = obj->base.read_domains;
With a fence, we only need to insert a memory barrier around the actual fence alteration for CPU accesses through the GTT. Performing the barrier in flush-fence was inserting unnecessary and expensive barriers for never fenced objects. Note removing the barriers from flush-fence, which was effectively a barrier before every direct access through the GTT, revealed that we where missing a barrier before the first access through the GTT. Lack of that barrier was sufficient to cause GPU hangs. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> --- drivers/gpu/drm/i915/i915_gem.c | 33 +++++++++++++++++++++++---------- 1 file changed, 23 insertions(+), 10 deletions(-)