drm/i915: Unconditionally flush writes before execbuffer
diff mbox

Message ID 1431330696-15379-1-git-send-email-chris@chris-wilson.co.uk
State New
Headers show

Commit Message

Chris Wilson May 11, 2015, 7:51 a.m. UTC
With the advent of mmap(wc), we have a path to write directly into
active GPU buffers. When combined with async updates (i.e. avoiding the
explicit domain management along with the memory barriers and GPU
stalls) we start to see the GPU read the wrong values from memory - i.e.
we have insufficient memory barriers along the execbuffer path. Writes
through the GTT should have been naturally serialised with execution
through the GTT as well and so the impact only seems to be from the WC
paths.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Akash Goel <akash.goel@intel.com>
Cc: stable@vger.kernel.org
---
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Daniel Vetter May 11, 2015, 10:34 a.m. UTC | #1
On Mon, May 11, 2015 at 08:51:36AM +0100, Chris Wilson wrote:
> With the advent of mmap(wc), we have a path to write directly into
> active GPU buffers. When combined with async updates (i.e. avoiding the
> explicit domain management along with the memory barriers and GPU
> stalls) we start to see the GPU read the wrong values from memory - i.e.
> we have insufficient memory barriers along the execbuffer path. Writes
> through the GTT should have been naturally serialised with execution
> through the GTT as well and so the impact only seems to be from the WC
> paths.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Akash Goel <akash.goel@intel.com>
> Cc: stable@vger.kernel.org

Do we have a nasty igt for this? Bugzilla?
-Daniel

> ---
>  drivers/gpu/drm/i915/i915_gem_execbuffer.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> index 650ae02484b0..4f97275ba799 100644
> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> @@ -1127,8 +1127,12 @@ i915_gem_execbuffer_move_to_gpu(struct intel_engine_cs *ring,
>  	if (flush_chipset)
>  		i915_gem_chipset_flush(ring->dev);
>  
> -	if (flush_domains & I915_GEM_DOMAIN_GTT)
> -		wmb();
> +	/* Unconditionally flush out writes to memory as the user may be
> +	 * doing asynchronous streaming writes to active buffers (i.e.
> +	 * lazy domain management to avoid serialisation) directly into
> +	 * the physical pages and so not naturally serialised by the GTT.
> +	 */
> +	wmb();
>  
>  	/* Unconditionally invalidate gpu caches and ensure that we do flush
>  	 * any residual writes from the previous batch.
> -- 
> 2.1.4
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Chris Wilson May 11, 2015, 10:37 a.m. UTC | #2
On Mon, May 11, 2015 at 12:34:37PM +0200, Daniel Vetter wrote:
> On Mon, May 11, 2015 at 08:51:36AM +0100, Chris Wilson wrote:
> > With the advent of mmap(wc), we have a path to write directly into
> > active GPU buffers. When combined with async updates (i.e. avoiding the
> > explicit domain management along with the memory barriers and GPU
> > stalls) we start to see the GPU read the wrong values from memory - i.e.
> > we have insufficient memory barriers along the execbuffer path. Writes
> > through the GTT should have been naturally serialised with execution
> > through the GTT as well and so the impact only seems to be from the WC
> > paths.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Akash Goel <akash.goel@intel.com>
> > Cc: stable@vger.kernel.org
> 
> Do we have a nasty igt for this? Bugzilla?

Maybe some of the 4.0 mmap(wc) regressions, we don't have an igt (yet).
The corruption, spotted whilst playing with enabling mmap(wc) for mesa
on byt, is very sporadic and so capturing it in a concise igt may be
quite difficult.
-Chris
Chris Wilson May 11, 2015, 3:25 p.m. UTC | #3
On Mon, May 11, 2015 at 12:34:37PM +0200, Daniel Vetter wrote:
> On Mon, May 11, 2015 at 08:51:36AM +0100, Chris Wilson wrote:
> > With the advent of mmap(wc), we have a path to write directly into
> > active GPU buffers. When combined with async updates (i.e. avoiding the
> > explicit domain management along with the memory barriers and GPU
> > stalls) we start to see the GPU read the wrong values from memory - i.e.
> > we have insufficient memory barriers along the execbuffer path. Writes
> > through the GTT should have been naturally serialised with execution
> > through the GTT as well and so the impact only seems to be from the WC
> > paths.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Akash Goel <akash.goel@intel.com>
> > Cc: stable@vger.kernel.org
> 
> Do we have a nasty igt for this? Bugzilla?

I've added igt/gem_streaming_writes.

That wmb() is not enough for !llc. Since the wmb() made piglit happy it
is quite possible I haven't hit the same path exactly, but it's going to
take some investigation to see if igt/gem_streaming_writes can possibly
work on !llc.
-Chris
Chris Wilson May 12, 2015, 10:19 a.m. UTC | #4
On Mon, May 11, 2015 at 04:25:52PM +0100, Chris Wilson wrote:
> On Mon, May 11, 2015 at 12:34:37PM +0200, Daniel Vetter wrote:
> > On Mon, May 11, 2015 at 08:51:36AM +0100, Chris Wilson wrote:
> > > With the advent of mmap(wc), we have a path to write directly into
> > > active GPU buffers. When combined with async updates (i.e. avoiding the
> > > explicit domain management along with the memory barriers and GPU
> > > stalls) we start to see the GPU read the wrong values from memory - i.e.
> > > we have insufficient memory barriers along the execbuffer path. Writes
> > > through the GTT should have been naturally serialised with execution
> > > through the GTT as well and so the impact only seems to be from the WC
> > > paths.
> > > 
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Akash Goel <akash.goel@intel.com>
> > > Cc: stable@vger.kernel.org
> > 
> > Do we have a nasty igt for this? Bugzilla?
> 
> I've added igt/gem_streaming_writes.

So far the only thing that makes a difference on !llc is hitting it with
wbinvd().

I wonder if EXEC_OBJECT_WBINVD (or EXEC_OBJECT_WMB to be less specific)
is acceptable?
-Chris
Shuang He May 14, 2015, 11:52 a.m. UTC | #5
Tested-By: Intel Graphics QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
Task id: 6378
-------------------------------------Summary-------------------------------------
Platform          Delta          drm-intel-nightly          Series Applied
PNV                                  272/272              272/272
ILK                                  302/302              302/302
SNB                 -1              315/315              314/315
IVB                                  343/343              343/343
BYT                                  287/287              287/287
BDW                                  317/317              317/317
-------------------------------------Detailed-------------------------------------
Platform  Test                                drm-intel-nightly          Series Applied
 SNB  igt@pm_rpm@dpms-mode-unset-non-lpsp      DMESG_WARN(6)PASS(1)      DMESG_WARN(1)
(dmesg patch applied)WARNING:at_drivers/gpu/drm/i915/intel_uncore.c:#assert_device_not_suspended[i915]()@WARNING:.* at .* assert_device_not_suspended+0x
Note: You need to pay more attention to line start with '*'
Chris Wilson May 19, 2015, 2:41 p.m. UTC | #6
On Mon, May 11, 2015 at 04:25:52PM +0100, Chris Wilson wrote:
> On Mon, May 11, 2015 at 12:34:37PM +0200, Daniel Vetter wrote:
> > On Mon, May 11, 2015 at 08:51:36AM +0100, Chris Wilson wrote:
> > > With the advent of mmap(wc), we have a path to write directly into
> > > active GPU buffers. When combined with async updates (i.e. avoiding the
> > > explicit domain management along with the memory barriers and GPU
> > > stalls) we start to see the GPU read the wrong values from memory - i.e.
> > > we have insufficient memory barriers along the execbuffer path. Writes
> > > through the GTT should have been naturally serialised with execution
> > > through the GTT as well and so the impact only seems to be from the WC
> > > paths.
> > > 
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Akash Goel <akash.goel@intel.com>
> > > Cc: stable@vger.kernel.org
> > 
> > Do we have a nasty igt for this? Bugzilla?
> 
> I've added igt/gem_streaming_writes.
> 
> That wmb() is not enough for !llc. Since the wmb() made piglit happy it
> is quite possible I haven't hit the same path exactly, but it's going to
> take some investigation to see if igt/gem_streaming_writes can possibly
> work on !llc.

Humbug.

Found the bug in gem_streaming_writes, even though I still think the
wmb() is strictly required, it runs fine without (presumably I haven't
managed to avoid all barriers in the execbuffer path yet). However, I
think can improve the stress by inserting extra gpu load -- that should
help make the CPU writes / GPU reads of the buffer concurrent?
-Chris
Chris Wilson May 21, 2015, 1 p.m. UTC | #7
On Tue, May 19, 2015 at 03:41:48PM +0100, Chris Wilson wrote:
> On Mon, May 11, 2015 at 04:25:52PM +0100, Chris Wilson wrote:
> > On Mon, May 11, 2015 at 12:34:37PM +0200, Daniel Vetter wrote:
> > > On Mon, May 11, 2015 at 08:51:36AM +0100, Chris Wilson wrote:
> > > > With the advent of mmap(wc), we have a path to write directly into
> > > > active GPU buffers. When combined with async updates (i.e. avoiding the
> > > > explicit domain management along with the memory barriers and GPU
> > > > stalls) we start to see the GPU read the wrong values from memory - i.e.
> > > > we have insufficient memory barriers along the execbuffer path. Writes
> > > > through the GTT should have been naturally serialised with execution
> > > > through the GTT as well and so the impact only seems to be from the WC
> > > > paths.
> > > > 
> > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > > Cc: Akash Goel <akash.goel@intel.com>
> > > > Cc: stable@vger.kernel.org
> > > 
> > > Do we have a nasty igt for this? Bugzilla?
> > 
> > I've added igt/gem_streaming_writes.
> > 
> > That wmb() is not enough for !llc. Since the wmb() made piglit happy it
> > is quite possible I haven't hit the same path exactly, but it's going to
> > take some investigation to see if igt/gem_streaming_writes can possibly
> > work on !llc.
> 
> Humbug.
> 
> Found the bug in gem_streaming_writes, even though I still think the
> wmb() is strictly required, it runs fine without (presumably I haven't
> managed to avoid all barriers in the execbuffer path yet). However, I
> think can improve the stress by inserting extra gpu load -- that should
> help make the CPU writes / GPU reads of the buffer concurrent?

Just a small update. I haven't found a way to reproduce this in igt yet,
but I can still observe the effect using vbo-map-unsync and the fix
there is the above patch to make the wmb() unconditional.

We need to put this into stable@ reasonably quickly (I suspect some of
the 4.0 mmap(wc) regressions are due to this as well).
-Chris
Daniel Vetter May 21, 2015, 1:07 p.m. UTC | #8
On Thu, May 21, 2015 at 02:00:34PM +0100, Chris Wilson wrote:
> On Tue, May 19, 2015 at 03:41:48PM +0100, Chris Wilson wrote:
> > On Mon, May 11, 2015 at 04:25:52PM +0100, Chris Wilson wrote:
> > > On Mon, May 11, 2015 at 12:34:37PM +0200, Daniel Vetter wrote:
> > > > On Mon, May 11, 2015 at 08:51:36AM +0100, Chris Wilson wrote:
> > > > > With the advent of mmap(wc), we have a path to write directly into
> > > > > active GPU buffers. When combined with async updates (i.e. avoiding the
> > > > > explicit domain management along with the memory barriers and GPU
> > > > > stalls) we start to see the GPU read the wrong values from memory - i.e.
> > > > > we have insufficient memory barriers along the execbuffer path. Writes
> > > > > through the GTT should have been naturally serialised with execution
> > > > > through the GTT as well and so the impact only seems to be from the WC
> > > > > paths.
> > > > > 
> > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > Cc: Akash Goel <akash.goel@intel.com>
> > > > > Cc: stable@vger.kernel.org
> > > > 
> > > > Do we have a nasty igt for this? Bugzilla?
> > > 
> > > I've added igt/gem_streaming_writes.
> > > 
> > > That wmb() is not enough for !llc. Since the wmb() made piglit happy it
> > > is quite possible I haven't hit the same path exactly, but it's going to
> > > take some investigation to see if igt/gem_streaming_writes can possibly
> > > work on !llc.
> > 
> > Humbug.
> > 
> > Found the bug in gem_streaming_writes, even though I still think the
> > wmb() is strictly required, it runs fine without (presumably I haven't
> > managed to avoid all barriers in the execbuffer path yet). However, I
> > think can improve the stress by inserting extra gpu load -- that should
> > help make the CPU writes / GPU reads of the buffer concurrent?
> 
> Just a small update. I haven't found a way to reproduce this in igt yet,
> but I can still observe the effect using vbo-map-unsync and the fix
> there is the above patch to make the wmb() unconditional.
> 
> We need to put this into stable@ reasonably quickly (I suspect some of
> the 4.0 mmap(wc) regressions are due to this as well).

What about

	if (flush_domains & (GTT | CPU))
		wmb();

instead? That would imo explain things a lot better, since cpu wc is
treated as if in the CPU domains. Hm, looking at the igt that's not quite
the case, we still put it into the GTT domain for wc mmaps afaict.
-Daniel
Chris Wilson May 21, 2015, 1:13 p.m. UTC | #9
On Thu, May 21, 2015 at 03:07:54PM +0200, Daniel Vetter wrote:
> On Thu, May 21, 2015 at 02:00:34PM +0100, Chris Wilson wrote:
> > On Tue, May 19, 2015 at 03:41:48PM +0100, Chris Wilson wrote:
> > > On Mon, May 11, 2015 at 04:25:52PM +0100, Chris Wilson wrote:
> > > > On Mon, May 11, 2015 at 12:34:37PM +0200, Daniel Vetter wrote:
> > > > > On Mon, May 11, 2015 at 08:51:36AM +0100, Chris Wilson wrote:
> > > > > > With the advent of mmap(wc), we have a path to write directly into
> > > > > > active GPU buffers. When combined with async updates (i.e. avoiding the
> > > > > > explicit domain management along with the memory barriers and GPU
> > > > > > stalls) we start to see the GPU read the wrong values from memory - i.e.
> > > > > > we have insufficient memory barriers along the execbuffer path. Writes
> > > > > > through the GTT should have been naturally serialised with execution
> > > > > > through the GTT as well and so the impact only seems to be from the WC
> > > > > > paths.
> > > > > > 
> > > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > > Cc: Akash Goel <akash.goel@intel.com>
> > > > > > Cc: stable@vger.kernel.org
> > > > > 
> > > > > Do we have a nasty igt for this? Bugzilla?
> > > > 
> > > > I've added igt/gem_streaming_writes.
> > > > 
> > > > That wmb() is not enough for !llc. Since the wmb() made piglit happy it
> > > > is quite possible I haven't hit the same path exactly, but it's going to
> > > > take some investigation to see if igt/gem_streaming_writes can possibly
> > > > work on !llc.
> > > 
> > > Humbug.
> > > 
> > > Found the bug in gem_streaming_writes, even though I still think the
> > > wmb() is strictly required, it runs fine without (presumably I haven't
> > > managed to avoid all barriers in the execbuffer path yet). However, I
> > > think can improve the stress by inserting extra gpu load -- that should
> > > help make the CPU writes / GPU reads of the buffer concurrent?
> > 
> > Just a small update. I haven't found a way to reproduce this in igt yet,
> > but I can still observe the effect using vbo-map-unsync and the fix
> > there is the above patch to make the wmb() unconditional.
> > 
> > We need to put this into stable@ reasonably quickly (I suspect some of
> > the 4.0 mmap(wc) regressions are due to this as well).
> 
> What about
> 
> 	if (flush_domains & (GTT | CPU))
> 		wmb();
> 
> instead? That would imo explain things a lot better, since cpu wc is
> treated as if in the CPU domains. Hm, looking at the igt that's not quite
> the case, we still put it into the GTT domain for wc mmaps afaict.

No. flush_domains is 0. We are talking about async writes which means
that userspace is not telling the kernel about susbsequent writes into
the inactive portions of the bo, and trusting that the buffer is
coherent and the writes are flushed. Putting the wmb() in the kernel is
not the only solution, but the most convenient (and allows us to just
emit one wmb() - but given the large number of other potential barriers
in this path, I am surprised that is required. Empirical evidence to the
contrary!)
-Chris
Daniel Vetter May 21, 2015, 2:21 p.m. UTC | #10
On Thu, May 21, 2015 at 02:13:01PM +0100, Chris Wilson wrote:
> On Thu, May 21, 2015 at 03:07:54PM +0200, Daniel Vetter wrote:
> > On Thu, May 21, 2015 at 02:00:34PM +0100, Chris Wilson wrote:
> > > On Tue, May 19, 2015 at 03:41:48PM +0100, Chris Wilson wrote:
> > > > On Mon, May 11, 2015 at 04:25:52PM +0100, Chris Wilson wrote:
> > > > > On Mon, May 11, 2015 at 12:34:37PM +0200, Daniel Vetter wrote:
> > > > > > On Mon, May 11, 2015 at 08:51:36AM +0100, Chris Wilson wrote:
> > > > > > > With the advent of mmap(wc), we have a path to write directly into
> > > > > > > active GPU buffers. When combined with async updates (i.e. avoiding the
> > > > > > > explicit domain management along with the memory barriers and GPU
> > > > > > > stalls) we start to see the GPU read the wrong values from memory - i.e.
> > > > > > > we have insufficient memory barriers along the execbuffer path. Writes
> > > > > > > through the GTT should have been naturally serialised with execution
> > > > > > > through the GTT as well and so the impact only seems to be from the WC
> > > > > > > paths.
> > > > > > > 
> > > > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > > > Cc: Akash Goel <akash.goel@intel.com>
> > > > > > > Cc: stable@vger.kernel.org
> > > > > > 
> > > > > > Do we have a nasty igt for this? Bugzilla?
> > > > > 
> > > > > I've added igt/gem_streaming_writes.
> > > > > 
> > > > > That wmb() is not enough for !llc. Since the wmb() made piglit happy it
> > > > > is quite possible I haven't hit the same path exactly, but it's going to
> > > > > take some investigation to see if igt/gem_streaming_writes can possibly
> > > > > work on !llc.
> > > > 
> > > > Humbug.
> > > > 
> > > > Found the bug in gem_streaming_writes, even though I still think the
> > > > wmb() is strictly required, it runs fine without (presumably I haven't
> > > > managed to avoid all barriers in the execbuffer path yet). However, I
> > > > think can improve the stress by inserting extra gpu load -- that should
> > > > help make the CPU writes / GPU reads of the buffer concurrent?
> > > 
> > > Just a small update. I haven't found a way to reproduce this in igt yet,
> > > but I can still observe the effect using vbo-map-unsync and the fix
> > > there is the above patch to make the wmb() unconditional.
> > > 
> > > We need to put this into stable@ reasonably quickly (I suspect some of
> > > the 4.0 mmap(wc) regressions are due to this as well).
> > 
> > What about
> > 
> > 	if (flush_domains & (GTT | CPU))
> > 		wmb();
> > 
> > instead? That would imo explain things a lot better, since cpu wc is
> > treated as if in the CPU domains. Hm, looking at the igt that's not quite
> > the case, we still put it into the GTT domain for wc mmaps afaict.
> 
> No. flush_domains is 0. We are talking about async writes which means
> that userspace is not telling the kernel about susbsequent writes into
> the inactive portions of the bo, and trusting that the buffer is
> coherent and the writes are flushed. Putting the wmb() in the kernel is
> not the only solution, but the most convenient (and allows us to just
> emit one wmb() - but given the large number of other potential barriers
> in this path, I am surprised that is required. Empirical evidence to the
> contrary!)

Hm right. What about emphasising this a bit more in the comment:

	/*
	 * Empirical evidence indicates that we need a write barrier to
	 * make sure write-combined writes (both to the gtt, but also to
	 * the cpu mmaps). But userspace also uses wc mmaps as
	 * unsynchronized upload paths where it inform the kernel about
	 * domain changes (to avoid the stalls). Hence we must do this
	 * barrier unconditinally.
	 */

Mostly just rewording, unsing unsynchronized as used by gl/libdrm and
clarification why we need to have the barrier unconditionally. With that

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

And I guess also

Cc: stable@vger.kernel.org
Chris Wilson May 21, 2015, 3:22 p.m. UTC | #11
On Thu, May 21, 2015 at 04:21:46PM +0200, Daniel Vetter wrote:
> Hm right. What about emphasising this a bit more in the comment:
> 
> 	/*
> 	 * Empirical evidence indicates that we need a write barrier to
> 	 * make sure write-combined writes (both to the gtt, but also to
> 	 * the cpu mmaps). But userspace also uses wc mmaps as
> 	 * unsynchronized upload paths where it inform the kernel about
> 	 * domain changes (to avoid the stalls). Hence we must do this
> 	 * barrier unconditinally.
> 	 */

For reference the wording in the commit is:

/* Unconditionally flush out writes to memory as the user may be
 * doing asynchronous streaming writes to active buffers (i.e.
 * lazy domain management to avoid serialisation) directly into
 * the physical pages and so not naturally serialised by the GTT.
 */

> Mostly just rewording, unsing unsynchronized as used by gl/libdrm and
> clarification why we need to have the barrier unconditionally. With that

Hmm, glMapBufferRange does use unsynchronized, but async is almost
universally preferred when talking about io and runqueues.

/* Unconditionally flush out writes to memory as the user may be
 * doing asynchronous streaming writes to active buffers in this
 * batch (i.e. lazy domain management to avoid serialisation, for
 * example with glMapBufferRange(GL_MAP_UNSYNCHRONIZED_BIT)) directly
 * into the physical pages and so not naturally serialised by the GTT.
 */

> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> 
> And I guess also
> 
> Cc: stable@vger.kernel.org

It already was ;-)
-Chris
Daniel Vetter May 21, 2015, 3:30 p.m. UTC | #12
On Thu, May 21, 2015 at 04:22:55PM +0100, Chris Wilson wrote:
> On Thu, May 21, 2015 at 04:21:46PM +0200, Daniel Vetter wrote:
> > Hm right. What about emphasising this a bit more in the comment:
> > 
> > 	/*
> > 	 * Empirical evidence indicates that we need a write barrier to
> > 	 * make sure write-combined writes (both to the gtt, but also to
> > 	 * the cpu mmaps). But userspace also uses wc mmaps as
> > 	 * unsynchronized upload paths where it inform the kernel about
> > 	 * domain changes (to avoid the stalls). Hence we must do this
> > 	 * barrier unconditinally.
> > 	 */
> 
> For reference the wording in the commit is:
> 
> /* Unconditionally flush out writes to memory as the user may be
>  * doing asynchronous streaming writes to active buffers (i.e.
>  * lazy domain management to avoid serialisation) directly into
>  * the physical pages and so not naturally serialised by the GTT.
>  */
> 
> > Mostly just rewording, unsing unsynchronized as used by gl/libdrm and
> > clarification why we need to have the barrier unconditionally. With that
> 
> Hmm, glMapBufferRange does use unsynchronized, but async is almost
> universally preferred when talking about io and runqueues.
> 
> /* Unconditionally flush out writes to memory as the user may be
>  * doing asynchronous streaming writes to active buffers in this
>  * batch (i.e. lazy domain management to avoid serialisation, for
>  * example with glMapBufferRange(GL_MAP_UNSYNCHRONIZED_BIT)) directly
>  * into the physical pages and so not naturally serialised by the GTT.
>  */
> 
> > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Yeah, r-b: me also with this one. Jani, can you please exchange the
comment and apply to -fixes?
-Daniel

> > 
> > And I guess also
> > 
> > Cc: stable@vger.kernel.org
> 
> It already was ;-)
> -Chris
> 
> -- 
> Chris Wilson, Intel Open Source Technology Centre
Jesse Barnes May 21, 2015, 8:29 p.m. UTC | #13
On 05/21/2015 06:00 AM, Chris Wilson wrote:
> On Tue, May 19, 2015 at 03:41:48PM +0100, Chris Wilson wrote:
>> On Mon, May 11, 2015 at 04:25:52PM +0100, Chris Wilson wrote:
>>> On Mon, May 11, 2015 at 12:34:37PM +0200, Daniel Vetter wrote:
>>>> On Mon, May 11, 2015 at 08:51:36AM +0100, Chris Wilson wrote:
>>>>> With the advent of mmap(wc), we have a path to write directly into
>>>>> active GPU buffers. When combined with async updates (i.e. avoiding the
>>>>> explicit domain management along with the memory barriers and GPU
>>>>> stalls) we start to see the GPU read the wrong values from memory - i.e.
>>>>> we have insufficient memory barriers along the execbuffer path. Writes
>>>>> through the GTT should have been naturally serialised with execution
>>>>> through the GTT as well and so the impact only seems to be from the WC
>>>>> paths.
>>>>>
>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>>> Cc: Akash Goel <akash.goel@intel.com>
>>>>> Cc: stable@vger.kernel.org
>>>>
>>>> Do we have a nasty igt for this? Bugzilla?
>>>
>>> I've added igt/gem_streaming_writes.
>>>
>>> That wmb() is not enough for !llc. Since the wmb() made piglit happy it
>>> is quite possible I haven't hit the same path exactly, but it's going to
>>> take some investigation to see if igt/gem_streaming_writes can possibly
>>> work on !llc.
>>
>> Humbug.
>>
>> Found the bug in gem_streaming_writes, even though I still think the
>> wmb() is strictly required, it runs fine without (presumably I haven't
>> managed to avoid all barriers in the execbuffer path yet). However, I
>> think can improve the stress by inserting extra gpu load -- that should
>> help make the CPU writes / GPU reads of the buffer concurrent?
> 
> Just a small update. I haven't found a way to reproduce this in igt yet,
> but I can still observe the effect using vbo-map-unsync and the fix
> there is the above patch to make the wmb() unconditional.
> 
> We need to put this into stable@ reasonably quickly (I suspect some of
> the 4.0 mmap(wc) regressions are due to this as well).

So the symptom is that the GPU picks up older values from memory, and
your theory is that the wmb() kicks the values out of the store buffer
or WC buffer prior to the execbuf?

I think that's reasonable, and I'm hoping "globally visible" is spec'd
to include the GPU and other system agents in the sfence definition.

Jesse
Daniel Vetter May 26, 2015, 8 a.m. UTC | #14
On Thu, May 21, 2015 at 5:30 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Thu, May 21, 2015 at 04:22:55PM +0100, Chris Wilson wrote:
>> On Thu, May 21, 2015 at 04:21:46PM +0200, Daniel Vetter wrote:
>> > Hm right. What about emphasising this a bit more in the comment:
>> >
>> >     /*
>> >      * Empirical evidence indicates that we need a write barrier to
>> >      * make sure write-combined writes (both to the gtt, but also to
>> >      * the cpu mmaps). But userspace also uses wc mmaps as
>> >      * unsynchronized upload paths where it inform the kernel about
>> >      * domain changes (to avoid the stalls). Hence we must do this
>> >      * barrier unconditinally.
>> >      */
>>
>> For reference the wording in the commit is:
>>
>> /* Unconditionally flush out writes to memory as the user may be
>>  * doing asynchronous streaming writes to active buffers (i.e.
>>  * lazy domain management to avoid serialisation) directly into
>>  * the physical pages and so not naturally serialised by the GTT.
>>  */
>>
>> > Mostly just rewording, unsing unsynchronized as used by gl/libdrm and
>> > clarification why we need to have the barrier unconditionally. With that
>>
>> Hmm, glMapBufferRange does use unsynchronized, but async is almost
>> universally preferred when talking about io and runqueues.
>>
>> /* Unconditionally flush out writes to memory as the user may be
>>  * doing asynchronous streaming writes to active buffers in this
>>  * batch (i.e. lazy domain management to avoid serialisation, for
>>  * example with glMapBufferRange(GL_MAP_UNSYNCHRONIZED_BIT)) directly
>>  * into the physical pages and so not naturally serialised by the GTT.
>>  */
>>
>> > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>
> Yeah, r-b: me also with this one. Jani, can you please exchange the
> comment and apply to -fixes?

I retract my r-b for now, this seems to be a much bigger can of worms
- it seems like we need to make all the mb()/wmb() we have all over
the place unconditional.
-Daniel

Patch
diff mbox

diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 650ae02484b0..4f97275ba799 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -1127,8 +1127,12 @@  i915_gem_execbuffer_move_to_gpu(struct intel_engine_cs *ring,
 	if (flush_chipset)
 		i915_gem_chipset_flush(ring->dev);
 
-	if (flush_domains & I915_GEM_DOMAIN_GTT)
-		wmb();
+	/* Unconditionally flush out writes to memory as the user may be
+	 * doing asynchronous streaming writes to active buffers (i.e.
+	 * lazy domain management to avoid serialisation) directly into
+	 * the physical pages and so not naturally serialised by the GTT.
+	 */
+	wmb();
 
 	/* Unconditionally invalidate gpu caches and ensure that we do flush
 	 * any residual writes from the previous batch.