diff mbox

[55/56] drm/i915: Remove 'faked' request from LRC submission

Message ID 1425563852-21349-2-git-send-email-John.C.Harrison@Intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

John Harrison March 5, 2015, 1:57 p.m. UTC
From: John Harrison <John.C.Harrison@Intel.com>

The LRC submission code requires a request for tracking purposes. It does not
actually require that request to 'complete' it simply uses it for keeping hold
of reference counts on contexts and such like.

In the case where the ring buffer is completely full, the LRC code looks for a
pending request that would free up sufficient space upon completion and waits
for it. If no such request can be found it resorts to simply polling the free
space count until it is big enough. This situation should only occur when the
entire buffer is filled with the request currently being generated. I.e., the
user is trying to submit a single piece of work that is large than the ring
buffer itself (which should be impossible because very large batch buffers don't
consume any more ring buffer space). Before starting to poll, a submit call is
made to make sure that the currently queued up work in the buffer will actually
be submtted and thus the poll will eventually succeed.

The problem here is that the 'official' request cannot be used as that could
lead to multiple LRC submissions being tagged to a single request structure.
Instead, the code fakes up a private request structure and uses that.

This patch moves the faked request allocation higher up in the call stack to the
wait code itself (rather than being at the very lowest submission level). Thus
it is now obvious where the faked request is coming from and why it is
necessary. The patch also replaces it with a call to the official request
allocation code rather than attempting to duplicate that code. This becomes
especially important in the future when the request allocation changes to
accommodate a conversion to struct fence.

For: VIZ-5115
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c |   45 ++++++++++++++++++++++----------------
 1 file changed, 26 insertions(+), 19 deletions(-)

Comments

Daniel Vetter March 5, 2015, 2:49 p.m. UTC | #1
On Thu, Mar 05, 2015 at 01:57:31PM +0000, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> The LRC submission code requires a request for tracking purposes. It does not
> actually require that request to 'complete' it simply uses it for keeping hold
> of reference counts on contexts and such like.
> 
> In the case where the ring buffer is completely full, the LRC code looks for a
> pending request that would free up sufficient space upon completion and waits
> for it. If no such request can be found it resorts to simply polling the free
> space count until it is big enough. This situation should only occur when the
> entire buffer is filled with the request currently being generated. I.e., the
> user is trying to submit a single piece of work that is large than the ring
> buffer itself (which should be impossible because very large batch buffers don't
> consume any more ring buffer space). Before starting to poll, a submit call is
> made to make sure that the currently queued up work in the buffer will actually
> be submtted and thus the poll will eventually succeed.
> 
> The problem here is that the 'official' request cannot be used as that could
> lead to multiple LRC submissions being tagged to a single request structure.
> Instead, the code fakes up a private request structure and uses that.
> 
> This patch moves the faked request allocation higher up in the call stack to the
> wait code itself (rather than being at the very lowest submission level). Thus
> it is now obvious where the faked request is coming from and why it is
> necessary. The patch also replaces it with a call to the official request
> allocation code rather than attempting to duplicate that code. This becomes
> especially important in the future when the request allocation changes to
> accommodate a conversion to struct fence.
> 
> For: VIZ-5115
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>

This is only possible if you pile up tons of olr. Since your patch series
fixes this design issue by removing olr I think we can just put a WARN_ON
in here if this ever happens and bail out with -ELOSTMYMARBLES or
something. And then rip out all this complexity.

Or do I miss something important?
-Daniel

> ---
>  drivers/gpu/drm/i915/intel_lrc.c |   45 ++++++++++++++++++++++----------------
>  1 file changed, 26 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 65eea51..1fa36de 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -507,23 +507,11 @@ static int execlists_context_queue(struct intel_engine_cs *ring,
>  	if (to != ring->default_context)
>  		intel_lr_context_pin(ring, to);
>  
> -	if (!request) {
> -		/*
> -		 * If there isn't a request associated with this submission,
> -		 * create one as a temporary holder.
> -		 */
> -		request = kzalloc(sizeof(*request), GFP_KERNEL);
> -		if (request == NULL)
> -			return -ENOMEM;
> -		request->ring = ring;
> -		request->ctx = to;
> -		kref_init(&request->ref);
> -		request->uniq = dev_priv->request_uniq++;
> -		i915_gem_context_reference(request->ctx);
> -	} else {
> -		i915_gem_request_reference(request);
> -		WARN_ON(to != request->ctx);
> -	}
> +	WARN_ON(!request);
> +	WARN_ON(to != request->ctx);
> +
> +	i915_gem_request_reference(request);
> +
>  	request->tail = tail;
>  
>  	intel_runtime_pm_get(dev_priv);
> @@ -677,6 +665,7 @@ static int logical_ring_wait_for_space(struct intel_ringbuffer *ringbuf,
>  	struct intel_engine_cs *ring = ringbuf->ring;
>  	struct drm_device *dev = ring->dev;
>  	struct drm_i915_private *dev_priv = dev->dev_private;
> +	struct drm_i915_gem_request *local_req;
>  	unsigned long end;
>  	int ret;
>  
> @@ -684,8 +673,23 @@ static int logical_ring_wait_for_space(struct intel_ringbuffer *ringbuf,
>  	if (ret != -ENOSPC)
>  		return ret;
>  
> -	/* Force the context submission in case we have been skipping it */
> -	intel_logical_ring_advance_and_submit(ringbuf, ctx, NULL);
> +	/*
> +	 * Force the context submission in case we have been skipping it.
> +	 * This requires creating a place holder request so that the LRC
> +	 * submission can be tracked. Note that if this point has been
> +	 * reached then it is the current submission that is blocking the
> +	 * driver and the only course of action is to do a partial send and
> +	 * wait for it to complete.
> +	 * Note also that because there is no space left in the ring, it is
> +	 * not possible to write the request submission prologue (which does
> +	 * things like update seqno values and trigger completion interrupts).
> +	 * Thus the request cannot be submitted via i915_add_request() and
> +	 * can not be waiting on by i915_gem_wait_request().
> +	 */
> +	ret = dev_priv->gt.alloc_request(ring, ctx, &local_req);
> +	if (ret)
> +		return ret;
> +	intel_logical_ring_advance_and_submit(ringbuf, ctx, local_req);
>  
>  	/* With GEM the hangcheck timer should kick us out of the loop,
>  	 * leaving it early runs the risk of corrupting GEM state (due
> @@ -717,6 +721,9 @@ static int logical_ring_wait_for_space(struct intel_ringbuffer *ringbuf,
>  		}
>  	} while (1);
>  
> +	/* This request is now done with and can be disposed of. */
> +	i915_gem_request_unreference(local_req);
> +
>  	return ret;
>  }
>  
> -- 
> 1.7.9.5
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
John Harrison March 11, 2015, 2:53 p.m. UTC | #2
On 05/03/2015 14:49, Daniel Vetter wrote:
> On Thu, Mar 05, 2015 at 01:57:31PM +0000, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> The LRC submission code requires a request for tracking purposes. It does not
>> actually require that request to 'complete' it simply uses it for keeping hold
>> of reference counts on contexts and such like.
>>
>> In the case where the ring buffer is completely full, the LRC code looks for a
>> pending request that would free up sufficient space upon completion and waits
>> for it. If no such request can be found it resorts to simply polling the free
>> space count until it is big enough. This situation should only occur when the
>> entire buffer is filled with the request currently being generated. I.e., the
>> user is trying to submit a single piece of work that is large than the ring
>> buffer itself (which should be impossible because very large batch buffers don't
>> consume any more ring buffer space). Before starting to poll, a submit call is
>> made to make sure that the currently queued up work in the buffer will actually
>> be submtted and thus the poll will eventually succeed.
>>
>> The problem here is that the 'official' request cannot be used as that could
>> lead to multiple LRC submissions being tagged to a single request structure.
>> Instead, the code fakes up a private request structure and uses that.
>>
>> This patch moves the faked request allocation higher up in the call stack to the
>> wait code itself (rather than being at the very lowest submission level). Thus
>> it is now obvious where the faked request is coming from and why it is
>> necessary. The patch also replaces it with a call to the official request
>> allocation code rather than attempting to duplicate that code. This becomes
>> especially important in the future when the request allocation changes to
>> accommodate a conversion to struct fence.
>>
>> For: VIZ-5115
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> This is only possible if you pile up tons of olr. Since your patch series
> fixes this design issue by removing olr I think we can just put a WARN_ON
> in here if this ever happens and bail out with -ELOSTMYMARBLES or
> something. And then rip out all this complexity.
>
> Or do I miss something important?
> -Daniel

Yeah, you missed the extremely important bug in the free space 
calculation that meant this impossible code path was being hit on a 
regular basis. The LRC wait_request code differed from the legacy 
wait_request code in the the latter was updated with request->postfix 
changes and the former was not. Thus the LRC one would happily find a 
request that frees up enough space, wait on it, retire it and then find 
there was still not enough space!

New patches to fix the space calculation bug and to completely remove 
the polling path will be forth coming...


>> ---
>>   drivers/gpu/drm/i915/intel_lrc.c |   45 ++++++++++++++++++++++----------------
>>   1 file changed, 26 insertions(+), 19 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>> index 65eea51..1fa36de 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -507,23 +507,11 @@ static int execlists_context_queue(struct intel_engine_cs *ring,
>>   	if (to != ring->default_context)
>>   		intel_lr_context_pin(ring, to);
>>   
>> -	if (!request) {
>> -		/*
>> -		 * If there isn't a request associated with this submission,
>> -		 * create one as a temporary holder.
>> -		 */
>> -		request = kzalloc(sizeof(*request), GFP_KERNEL);
>> -		if (request == NULL)
>> -			return -ENOMEM;
>> -		request->ring = ring;
>> -		request->ctx = to;
>> -		kref_init(&request->ref);
>> -		request->uniq = dev_priv->request_uniq++;
>> -		i915_gem_context_reference(request->ctx);
>> -	} else {
>> -		i915_gem_request_reference(request);
>> -		WARN_ON(to != request->ctx);
>> -	}
>> +	WARN_ON(!request);
>> +	WARN_ON(to != request->ctx);
>> +
>> +	i915_gem_request_reference(request);
>> +
>>   	request->tail = tail;
>>   
>>   	intel_runtime_pm_get(dev_priv);
>> @@ -677,6 +665,7 @@ static int logical_ring_wait_for_space(struct intel_ringbuffer *ringbuf,
>>   	struct intel_engine_cs *ring = ringbuf->ring;
>>   	struct drm_device *dev = ring->dev;
>>   	struct drm_i915_private *dev_priv = dev->dev_private;
>> +	struct drm_i915_gem_request *local_req;
>>   	unsigned long end;
>>   	int ret;
>>   
>> @@ -684,8 +673,23 @@ static int logical_ring_wait_for_space(struct intel_ringbuffer *ringbuf,
>>   	if (ret != -ENOSPC)
>>   		return ret;
>>   
>> -	/* Force the context submission in case we have been skipping it */
>> -	intel_logical_ring_advance_and_submit(ringbuf, ctx, NULL);
>> +	/*
>> +	 * Force the context submission in case we have been skipping it.
>> +	 * This requires creating a place holder request so that the LRC
>> +	 * submission can be tracked. Note that if this point has been
>> +	 * reached then it is the current submission that is blocking the
>> +	 * driver and the only course of action is to do a partial send and
>> +	 * wait for it to complete.
>> +	 * Note also that because there is no space left in the ring, it is
>> +	 * not possible to write the request submission prologue (which does
>> +	 * things like update seqno values and trigger completion interrupts).
>> +	 * Thus the request cannot be submitted via i915_add_request() and
>> +	 * can not be waiting on by i915_gem_wait_request().
>> +	 */
>> +	ret = dev_priv->gt.alloc_request(ring, ctx, &local_req);
>> +	if (ret)
>> +		return ret;
>> +	intel_logical_ring_advance_and_submit(ringbuf, ctx, local_req);
>>   
>>   	/* With GEM the hangcheck timer should kick us out of the loop,
>>   	 * leaving it early runs the risk of corrupting GEM state (due
>> @@ -717,6 +721,9 @@ static int logical_ring_wait_for_space(struct intel_ringbuffer *ringbuf,
>>   		}
>>   	} while (1);
>>   
>> +	/* This request is now done with and can be disposed of. */
>> +	i915_gem_request_unreference(local_req);
>> +
>>   	return ret;
>>   }
>>   
>> -- 
>> 1.7.9.5
>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Daniel Vetter March 11, 2015, 4:14 p.m. UTC | #3
On Wed, Mar 11, 2015 at 02:53:39PM +0000, John Harrison wrote:
> On 05/03/2015 14:49, Daniel Vetter wrote:
> >On Thu, Mar 05, 2015 at 01:57:31PM +0000, John.C.Harrison@Intel.com wrote:
> >>From: John Harrison <John.C.Harrison@Intel.com>
> >>
> >>The LRC submission code requires a request for tracking purposes. It does not
> >>actually require that request to 'complete' it simply uses it for keeping hold
> >>of reference counts on contexts and such like.
> >>
> >>In the case where the ring buffer is completely full, the LRC code looks for a
> >>pending request that would free up sufficient space upon completion and waits
> >>for it. If no such request can be found it resorts to simply polling the free
> >>space count until it is big enough. This situation should only occur when the
> >>entire buffer is filled with the request currently being generated. I.e., the
> >>user is trying to submit a single piece of work that is large than the ring
> >>buffer itself (which should be impossible because very large batch buffers don't
> >>consume any more ring buffer space). Before starting to poll, a submit call is
> >>made to make sure that the currently queued up work in the buffer will actually
> >>be submtted and thus the poll will eventually succeed.
> >>
> >>The problem here is that the 'official' request cannot be used as that could
> >>lead to multiple LRC submissions being tagged to a single request structure.
> >>Instead, the code fakes up a private request structure and uses that.
> >>
> >>This patch moves the faked request allocation higher up in the call stack to the
> >>wait code itself (rather than being at the very lowest submission level). Thus
> >>it is now obvious where the faked request is coming from and why it is
> >>necessary. The patch also replaces it with a call to the official request
> >>allocation code rather than attempting to duplicate that code. This becomes
> >>especially important in the future when the request allocation changes to
> >>accommodate a conversion to struct fence.
> >>
> >>For: VIZ-5115
> >>Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> >This is only possible if you pile up tons of olr. Since your patch series
> >fixes this design issue by removing olr I think we can just put a WARN_ON
> >in here if this ever happens and bail out with -ELOSTMYMARBLES or
> >something. And then rip out all this complexity.
> >
> >Or do I miss something important?
> >-Daniel
> 
> Yeah, you missed the extremely important bug in the free space calculation
> that meant this impossible code path was being hit on a regular basis. The
> LRC wait_request code differed from the legacy wait_request code in the the
> latter was updated with request->postfix changes and the former was not.
> Thus the LRC one would happily find a request that frees up enough space,
> wait on it, retire it and then find there was still not enough space!
> 
> New patches to fix the space calculation bug and to completely remove the
> polling path will be forth coming...

Hm, Jesse did dig out a regression where gem_ringfill blows up on
execlist. That igt is specifically to exercise this cornercases. I'm not
sure whith bugzilla Jesse meant, there's two:

https://bugs.freedesktop.org/show_bug.cgi?id=89001
https://bugs.freedesktop.org/show_bug.cgi?id=88865

Can you please check whether your fixes help for those issues two?

Also adding Jesse since he's chasing these atm.
-Daniel
Jesse Barnes March 11, 2015, 4:44 p.m. UTC | #4
On 03/11/2015 09:14 AM, Daniel Vetter wrote:
> On Wed, Mar 11, 2015 at 02:53:39PM +0000, John Harrison wrote:
>> On 05/03/2015 14:49, Daniel Vetter wrote:
>>> On Thu, Mar 05, 2015 at 01:57:31PM +0000, John.C.Harrison@Intel.com wrote:
>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>
>>>> The LRC submission code requires a request for tracking purposes. It does not
>>>> actually require that request to 'complete' it simply uses it for keeping hold
>>>> of reference counts on contexts and such like.
>>>>
>>>> In the case where the ring buffer is completely full, the LRC code looks for a
>>>> pending request that would free up sufficient space upon completion and waits
>>>> for it. If no such request can be found it resorts to simply polling the free
>>>> space count until it is big enough. This situation should only occur when the
>>>> entire buffer is filled with the request currently being generated. I.e., the
>>>> user is trying to submit a single piece of work that is large than the ring
>>>> buffer itself (which should be impossible because very large batch buffers don't
>>>> consume any more ring buffer space). Before starting to poll, a submit call is
>>>> made to make sure that the currently queued up work in the buffer will actually
>>>> be submtted and thus the poll will eventually succeed.
>>>>
>>>> The problem here is that the 'official' request cannot be used as that could
>>>> lead to multiple LRC submissions being tagged to a single request structure.
>>>> Instead, the code fakes up a private request structure and uses that.
>>>>
>>>> This patch moves the faked request allocation higher up in the call stack to the
>>>> wait code itself (rather than being at the very lowest submission level). Thus
>>>> it is now obvious where the faked request is coming from and why it is
>>>> necessary. The patch also replaces it with a call to the official request
>>>> allocation code rather than attempting to duplicate that code. This becomes
>>>> especially important in the future when the request allocation changes to
>>>> accommodate a conversion to struct fence.
>>>>
>>>> For: VIZ-5115
>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>> This is only possible if you pile up tons of olr. Since your patch series
>>> fixes this design issue by removing olr I think we can just put a WARN_ON
>>> in here if this ever happens and bail out with -ELOSTMYMARBLES or
>>> something. And then rip out all this complexity.
>>>
>>> Or do I miss something important?
>>> -Daniel
>>
>> Yeah, you missed the extremely important bug in the free space calculation
>> that meant this impossible code path was being hit on a regular basis. The
>> LRC wait_request code differed from the legacy wait_request code in the the
>> latter was updated with request->postfix changes and the former was not.
>> Thus the LRC one would happily find a request that frees up enough space,
>> wait on it, retire it and then find there was still not enough space!
>>
>> New patches to fix the space calculation bug and to completely remove the
>> polling path will be forth coming...
> 
> Hm, Jesse did dig out a regression where gem_ringfill blows up on
> execlist. That igt is specifically to exercise this cornercases. I'm not
> sure whith bugzilla Jesse meant, there's two:
> 
> https://bugs.freedesktop.org/show_bug.cgi?id=89001
> https://bugs.freedesktop.org/show_bug.cgi?id=88865
> 
> Can you please check whether your fixes help for those issues two?
> 
> Also adding Jesse since he's chasing these atm.

Ah cool, sounds related at least.  89001 was the one I looked at
yesterday.  The worrying thing was that this bug caused a hard system
hang.  Not even the reset button worked...  I guess we should have an
HSD for that too so we can root cause the system part of it.

Jesse
John Harrison March 11, 2015, 8:45 p.m. UTC | #5
On 11/03/2015 16:44, Jesse Barnes wrote:
> On 03/11/2015 09:14 AM, Daniel Vetter wrote:
>> On Wed, Mar 11, 2015 at 02:53:39PM +0000, John Harrison wrote:
>>> On 05/03/2015 14:49, Daniel Vetter wrote:
>>>> On Thu, Mar 05, 2015 at 01:57:31PM +0000, John.C.Harrison@Intel.com wrote:
>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>
>>>>> The LRC submission code requires a request for tracking purposes. It does not
>>>>> actually require that request to 'complete' it simply uses it for keeping hold
>>>>> of reference counts on contexts and such like.
>>>>>
>>>>> In the case where the ring buffer is completely full, the LRC code looks for a
>>>>> pending request that would free up sufficient space upon completion and waits
>>>>> for it. If no such request can be found it resorts to simply polling the free
>>>>> space count until it is big enough. This situation should only occur when the
>>>>> entire buffer is filled with the request currently being generated. I.e., the
>>>>> user is trying to submit a single piece of work that is large than the ring
>>>>> buffer itself (which should be impossible because very large batch buffers don't
>>>>> consume any more ring buffer space). Before starting to poll, a submit call is
>>>>> made to make sure that the currently queued up work in the buffer will actually
>>>>> be submtted and thus the poll will eventually succeed.
>>>>>
>>>>> The problem here is that the 'official' request cannot be used as that could
>>>>> lead to multiple LRC submissions being tagged to a single request structure.
>>>>> Instead, the code fakes up a private request structure and uses that.
>>>>>
>>>>> This patch moves the faked request allocation higher up in the call stack to the
>>>>> wait code itself (rather than being at the very lowest submission level). Thus
>>>>> it is now obvious where the faked request is coming from and why it is
>>>>> necessary. The patch also replaces it with a call to the official request
>>>>> allocation code rather than attempting to duplicate that code. This becomes
>>>>> especially important in the future when the request allocation changes to
>>>>> accommodate a conversion to struct fence.
>>>>>
>>>>> For: VIZ-5115
>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>> This is only possible if you pile up tons of olr. Since your patch series
>>>> fixes this design issue by removing olr I think we can just put a WARN_ON
>>>> in here if this ever happens and bail out with -ELOSTMYMARBLES or
>>>> something. And then rip out all this complexity.
>>>>
>>>> Or do I miss something important?
>>>> -Daniel
>>> Yeah, you missed the extremely important bug in the free space calculation
>>> that meant this impossible code path was being hit on a regular basis. The
>>> LRC wait_request code differed from the legacy wait_request code in the the
>>> latter was updated with request->postfix changes and the former was not.
>>> Thus the LRC one would happily find a request that frees up enough space,
>>> wait on it, retire it and then find there was still not enough space!
>>>
>>> New patches to fix the space calculation bug and to completely remove the
>>> polling path will be forth coming...
>> Hm, Jesse did dig out a regression where gem_ringfill blows up on
>> execlist. That igt is specifically to exercise this cornercases. I'm not
>> sure whith bugzilla Jesse meant, there's two:
>>
>> https://bugs.freedesktop.org/show_bug.cgi?id=89001
>> https://bugs.freedesktop.org/show_bug.cgi?id=88865

I don't think the space calculation bug could cause the gem_ringfill 
failure. As noted above, even when the calculation gets it horribly 
wrong and flunks the wait for request path, the sit and poll path should 
still do the right thing. Plus the 88865 bug report I've already tried 
to reproduce and verified as fixed by the initial version of this patch 
set (i.e. without the space calc bug fix or the clean up of the fake 
request that follows it). It would go wrong pretty reliably on a clean 
nightly but never failed on my anti-OLR branch after quite a large 
amount of trying.

I've just realised that I somehow got completely the wrong bug number in 
the cover letter. So where patch zero talks about fixing a gem_ringfill 
bug, it was supposed to be BZ:88865 not BZ:4####! The second one, 
BZ:89001 certainly looks like the same issue but I don't have a SKL to 
test with.


>>
>> Can you please check whether your fixes help for those issues two?
>>
>> Also adding Jesse since he's chasing these atm.
> Ah cool, sounds related at least.  89001 was the one I looked at
> yesterday.  The worrying thing was that this bug caused a hard system
> hang.  Not even the reset button worked...  I guess we should have an
> HSD for that too so we can root cause the system part of it.
>
> Jesse
>
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 65eea51..1fa36de 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -507,23 +507,11 @@  static int execlists_context_queue(struct intel_engine_cs *ring,
 	if (to != ring->default_context)
 		intel_lr_context_pin(ring, to);
 
-	if (!request) {
-		/*
-		 * If there isn't a request associated with this submission,
-		 * create one as a temporary holder.
-		 */
-		request = kzalloc(sizeof(*request), GFP_KERNEL);
-		if (request == NULL)
-			return -ENOMEM;
-		request->ring = ring;
-		request->ctx = to;
-		kref_init(&request->ref);
-		request->uniq = dev_priv->request_uniq++;
-		i915_gem_context_reference(request->ctx);
-	} else {
-		i915_gem_request_reference(request);
-		WARN_ON(to != request->ctx);
-	}
+	WARN_ON(!request);
+	WARN_ON(to != request->ctx);
+
+	i915_gem_request_reference(request);
+
 	request->tail = tail;
 
 	intel_runtime_pm_get(dev_priv);
@@ -677,6 +665,7 @@  static int logical_ring_wait_for_space(struct intel_ringbuffer *ringbuf,
 	struct intel_engine_cs *ring = ringbuf->ring;
 	struct drm_device *dev = ring->dev;
 	struct drm_i915_private *dev_priv = dev->dev_private;
+	struct drm_i915_gem_request *local_req;
 	unsigned long end;
 	int ret;
 
@@ -684,8 +673,23 @@  static int logical_ring_wait_for_space(struct intel_ringbuffer *ringbuf,
 	if (ret != -ENOSPC)
 		return ret;
 
-	/* Force the context submission in case we have been skipping it */
-	intel_logical_ring_advance_and_submit(ringbuf, ctx, NULL);
+	/*
+	 * Force the context submission in case we have been skipping it.
+	 * This requires creating a place holder request so that the LRC
+	 * submission can be tracked. Note that if this point has been
+	 * reached then it is the current submission that is blocking the
+	 * driver and the only course of action is to do a partial send and
+	 * wait for it to complete.
+	 * Note also that because there is no space left in the ring, it is
+	 * not possible to write the request submission prologue (which does
+	 * things like update seqno values and trigger completion interrupts).
+	 * Thus the request cannot be submitted via i915_add_request() and
+	 * can not be waiting on by i915_gem_wait_request().
+	 */
+	ret = dev_priv->gt.alloc_request(ring, ctx, &local_req);
+	if (ret)
+		return ret;
+	intel_logical_ring_advance_and_submit(ringbuf, ctx, local_req);
 
 	/* With GEM the hangcheck timer should kick us out of the loop,
 	 * leaving it early runs the risk of corrupting GEM state (due
@@ -717,6 +721,9 @@  static int logical_ring_wait_for_space(struct intel_ringbuffer *ringbuf,
 		}
 	} while (1);
 
+	/* This request is now done with and can be disposed of. */
+	i915_gem_request_unreference(local_req);
+
 	return ret;
 }