diff mbox

[49/50] drm/i915/bdw: Help out the ctx switch interrupt handler

Message ID 1399637360-4277-50-git-send-email-oscar.mateo@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

oscar.mateo@intel.com May 9, 2014, 12:09 p.m. UTC
From: Oscar Mateo <oscar.mateo@intel.com>

If we receive a storm of requests for the same context (see gem_storedw_loop_*)
we might end up iterating over too many elements in interrupt time, looking for
contexts to squash together. Instead, share the burden by giving more
intelligence to the queue function. At most, the interrupt will iterate over
three elements.

Signed-off-by: Oscar Mateo <oscar.mateo@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

Comments

Daniel Vetter June 11, 2014, 11:50 a.m. UTC | #1
On Fri, May 09, 2014 at 01:09:19PM +0100, oscar.mateo@intel.com wrote:
> From: Oscar Mateo <oscar.mateo@intel.com>
> 
> If we receive a storm of requests for the same context (see gem_storedw_loop_*)
> we might end up iterating over too many elements in interrupt time, looking for
> contexts to squash together. Instead, share the burden by giving more
> intelligence to the queue function. At most, the interrupt will iterate over
> three elements.
> 
> Signed-off-by: Oscar Mateo <oscar.mateo@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 23 ++++++++++++++++++++---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index d9edd10..0aad721 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -410,9 +410,11 @@ int gen8_switch_context_queue(struct intel_engine *ring,
>  			      struct i915_hw_context *to,
>  			      u32 tail)
>  {
> +	struct drm_i915_private *dev_priv = ring->dev->dev_private;
>  	struct drm_i915_gem_request *req = NULL;
>  	unsigned long flags;
> -	bool was_empty;
> +	struct drm_i915_gem_request *cursor;
> +	int num_elements = 0;
>  
>  	req = kzalloc(sizeof(*req), GFP_KERNEL);
>  	if (req == NULL)
> @@ -425,9 +427,24 @@ int gen8_switch_context_queue(struct intel_engine *ring,
>  
>  	spin_lock_irqsave(&ring->execlist_lock, flags);
>  
> -	was_empty = list_empty(&ring->execlist_queue);
> +	list_for_each_entry(cursor, &ring->execlist_queue, execlist_link)
> +		if (++num_elements > 2)
> +			break;
> +
> +	if (num_elements > 2) {
> +		struct drm_i915_gem_request *tail_req =
> +				list_last_entry(&ring->execlist_queue,
> +					struct drm_i915_gem_request, execlist_link);
> +		if (to == tail_req->ctx) {
> +			WARN(tail_req->elsp_submitted != 0,
> +					"More than 2 already-submitted reqs queued\n");
> +			list_del(&tail_req->execlist_link);
> +			queue_work(dev_priv->wq, &tail_req->work);
> +		}
> +	}

Completely forgotten to mention this: Chris&I discussed this on irc and I
guess this issue will disappear if we track contexts instead of requests
in the scheduler. I guess this is an artifact of the gen7 scheduler you've
based this on, but even for that I think scheduling contexts (with preempt
point after each batch) is the right approach. But I haven't dug out the
scheduler patches again so might be wrong with that.
-Daniel

> +
>  	list_add_tail(&req->execlist_link, &ring->execlist_queue);
> -	if (was_empty)
> +	if (num_elements == 0)
>  		gen8_switch_context_unqueue(ring);
>  
>  	spin_unlock_irqrestore(&ring->execlist_lock, flags);
> -- 
> 1.9.0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
oscar.mateo@intel.com June 11, 2014, 12:01 p.m. UTC | #2
> -----Original Message-----
> From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel
> Vetter
> Sent: Wednesday, June 11, 2014 12:50 PM
> To: Mateo Lozano, Oscar
> Cc: intel-gfx@lists.freedesktop.org
> Subject: Re: [Intel-gfx] [PATCH 49/50] drm/i915/bdw: Help out the ctx switch
> interrupt handler
> 
> On Fri, May 09, 2014 at 01:09:19PM +0100, oscar.mateo@intel.com wrote:
> > From: Oscar Mateo <oscar.mateo@intel.com>
> >
> > If we receive a storm of requests for the same context (see
> > gem_storedw_loop_*) we might end up iterating over too many elements
> > in interrupt time, looking for contexts to squash together. Instead,
> > share the burden by giving more intelligence to the queue function. At
> > most, the interrupt will iterate over three elements.
> >
> > Signed-off-by: Oscar Mateo <oscar.mateo@intel.com>
> > ---
> >  drivers/gpu/drm/i915/intel_lrc.c | 23 ++++++++++++++++++++---
> >  1 file changed, 20 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c
> > b/drivers/gpu/drm/i915/intel_lrc.c
> > index d9edd10..0aad721 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -410,9 +410,11 @@ int gen8_switch_context_queue(struct
> intel_engine *ring,
> >  			      struct i915_hw_context *to,
> >  			      u32 tail)
> >  {
> > +	struct drm_i915_private *dev_priv = ring->dev->dev_private;
> >  	struct drm_i915_gem_request *req = NULL;
> >  	unsigned long flags;
> > -	bool was_empty;
> > +	struct drm_i915_gem_request *cursor;
> > +	int num_elements = 0;
> >
> >  	req = kzalloc(sizeof(*req), GFP_KERNEL);
> >  	if (req == NULL)
> > @@ -425,9 +427,24 @@ int gen8_switch_context_queue(struct
> intel_engine
> > *ring,
> >
> >  	spin_lock_irqsave(&ring->execlist_lock, flags);
> >
> > -	was_empty = list_empty(&ring->execlist_queue);
> > +	list_for_each_entry(cursor, &ring->execlist_queue, execlist_link)
> > +		if (++num_elements > 2)
> > +			break;
> > +
> > +	if (num_elements > 2) {
> > +		struct drm_i915_gem_request *tail_req =
> > +				list_last_entry(&ring->execlist_queue,
> > +					struct drm_i915_gem_request,
> execlist_link);
> > +		if (to == tail_req->ctx) {
> > +			WARN(tail_req->elsp_submitted != 0,
> > +					"More than 2 already-submitted reqs
> queued\n");
> > +			list_del(&tail_req->execlist_link);
> > +			queue_work(dev_priv->wq, &tail_req->work);
> > +		}
> > +	}
> 
> Completely forgotten to mention this: Chris&I discussed this on irc and I
> guess this issue will disappear if we track contexts instead of requests in the
> scheduler. I guess this is an artifact of the gen7 scheduler you've based this
> on, but even for that I think scheduling contexts (with preempt point after
> each batch) is the right approach. But I haven't dug out the scheduler patches
> again so might be wrong with that.
> -Daniel

Hmmmm... I didn´t really base this on the scheduler. Some kind of queue to hold context submissions until the hardware was ready was needed, and queuing drm_i915_gem_requests seemed like a good choice at the time (by the way, in the next version I am using a new struct intel_ctx_submit_request, since I don´t need most of the fields in drm_i915_gem_requests, and I have to add a couple of new ones anyway).

What do you mean by "scheduling contexts"? Notice that the requests I am queuing basically just contain the context and the tail at the point it was submitted for execution...
Daniel Vetter June 11, 2014, 1:57 p.m. UTC | #3
On Wed, Jun 11, 2014 at 12:01:42PM +0000, Mateo Lozano, Oscar wrote:
> > -----Original Message-----
> > From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel
> > Vetter
> > Sent: Wednesday, June 11, 2014 12:50 PM
> > To: Mateo Lozano, Oscar
> > Cc: intel-gfx@lists.freedesktop.org
> > Subject: Re: [Intel-gfx] [PATCH 49/50] drm/i915/bdw: Help out the ctx switch
> > interrupt handler
> > 
> > On Fri, May 09, 2014 at 01:09:19PM +0100, oscar.mateo@intel.com wrote:
> > > From: Oscar Mateo <oscar.mateo@intel.com>
> > >
> > > If we receive a storm of requests for the same context (see
> > > gem_storedw_loop_*) we might end up iterating over too many elements
> > > in interrupt time, looking for contexts to squash together. Instead,
> > > share the burden by giving more intelligence to the queue function. At
> > > most, the interrupt will iterate over three elements.
> > >
> > > Signed-off-by: Oscar Mateo <oscar.mateo@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/intel_lrc.c | 23 ++++++++++++++++++++---
> > >  1 file changed, 20 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c
> > > b/drivers/gpu/drm/i915/intel_lrc.c
> > > index d9edd10..0aad721 100644
> > > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > > @@ -410,9 +410,11 @@ int gen8_switch_context_queue(struct
> > intel_engine *ring,
> > >  			      struct i915_hw_context *to,
> > >  			      u32 tail)
> > >  {
> > > +	struct drm_i915_private *dev_priv = ring->dev->dev_private;
> > >  	struct drm_i915_gem_request *req = NULL;
> > >  	unsigned long flags;
> > > -	bool was_empty;
> > > +	struct drm_i915_gem_request *cursor;
> > > +	int num_elements = 0;
> > >
> > >  	req = kzalloc(sizeof(*req), GFP_KERNEL);
> > >  	if (req == NULL)
> > > @@ -425,9 +427,24 @@ int gen8_switch_context_queue(struct
> > intel_engine
> > > *ring,
> > >
> > >  	spin_lock_irqsave(&ring->execlist_lock, flags);
> > >
> > > -	was_empty = list_empty(&ring->execlist_queue);
> > > +	list_for_each_entry(cursor, &ring->execlist_queue, execlist_link)
> > > +		if (++num_elements > 2)
> > > +			break;
> > > +
> > > +	if (num_elements > 2) {
> > > +		struct drm_i915_gem_request *tail_req =
> > > +				list_last_entry(&ring->execlist_queue,
> > > +					struct drm_i915_gem_request,
> > execlist_link);
> > > +		if (to == tail_req->ctx) {
> > > +			WARN(tail_req->elsp_submitted != 0,
> > > +					"More than 2 already-submitted reqs
> > queued\n");
> > > +			list_del(&tail_req->execlist_link);
> > > +			queue_work(dev_priv->wq, &tail_req->work);
> > > +		}
> > > +	}
> > 
> > Completely forgotten to mention this: Chris&I discussed this on irc and I
> > guess this issue will disappear if we track contexts instead of requests in the
> > scheduler. I guess this is an artifact of the gen7 scheduler you've based this
> > on, but even for that I think scheduling contexts (with preempt point after
> > each batch) is the right approach. But I haven't dug out the scheduler patches
> > again so might be wrong with that.
> > -Daniel
> 
> Hmmmm... I didn´t really base this on the scheduler. Some kind of queue
> to hold context submissions until the hardware was ready was needed, and
> queuing drm_i915_gem_requests seemed like a good choice at the time (by
> the way, in the next version I am using a new struct
> intel_ctx_submit_request, since I don´t need most of the fields in
> drm_i915_gem_requests, and I have to add a couple of new ones anyway).
> 
> What do you mean by "scheduling contexts"? Notice that the requests I am
> queuing basically just contain the context and the tail at the point it
> was submitted for execution...

Well I've thought we could just throw contexts at the hardware and throw
new ones at it when the old ones get stuck/are completed. But now I've
realized that since we do the cross-engine/ctx depency tracking in
software it's not quite that easy and we can't unconditionally update the
tail-pointer.

Still for the degenerate case of one ctx submitting batches exclusively
I've hoped just updating the tail pointer in the context and telling the
hw to reload the current context should have been enough. Or at least I've
hoped so, and that should take (mostly) care of the insane request
overload case your patch here addresses.
-Daniel
oscar.mateo@intel.com June 11, 2014, 2:26 p.m. UTC | #4
---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47


> -----Original Message-----
> From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel
> Vetter
> Sent: Wednesday, June 11, 2014 2:58 PM
> To: Mateo Lozano, Oscar
> Cc: Daniel Vetter; intel-gfx@lists.freedesktop.org
> Subject: Re: [Intel-gfx] [PATCH 49/50] drm/i915/bdw: Help out the ctx switch
> interrupt handler
> 
> On Wed, Jun 11, 2014 at 12:01:42PM +0000, Mateo Lozano, Oscar wrote:
> > > -----Original Message-----
> > > From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of
> > > Daniel Vetter
> > > Sent: Wednesday, June 11, 2014 12:50 PM
> > > To: Mateo Lozano, Oscar
> > > Cc: intel-gfx@lists.freedesktop.org
> > > Subject: Re: [Intel-gfx] [PATCH 49/50] drm/i915/bdw: Help out the
> > > ctx switch interrupt handler
> > >
> > > On Fri, May 09, 2014 at 01:09:19PM +0100, oscar.mateo@intel.com
> wrote:
> > > > From: Oscar Mateo <oscar.mateo@intel.com>
> > > >
> > > > If we receive a storm of requests for the same context (see
> > > > gem_storedw_loop_*) we might end up iterating over too many
> > > > elements in interrupt time, looking for contexts to squash
> > > > together. Instead, share the burden by giving more intelligence to
> > > > the queue function. At most, the interrupt will iterate over three
> elements.
> > > >
> > > > Signed-off-by: Oscar Mateo <oscar.mateo@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/i915/intel_lrc.c | 23 ++++++++++++++++++++---
> > > >  1 file changed, 20 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c
> > > > b/drivers/gpu/drm/i915/intel_lrc.c
> > > > index d9edd10..0aad721 100644
> > > > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > > > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > > > @@ -410,9 +410,11 @@ int gen8_switch_context_queue(struct
> > > intel_engine *ring,
> > > >  			      struct i915_hw_context *to,
> > > >  			      u32 tail)
> > > >  {
> > > > +	struct drm_i915_private *dev_priv = ring->dev->dev_private;
> > > >  	struct drm_i915_gem_request *req = NULL;
> > > >  	unsigned long flags;
> > > > -	bool was_empty;
> > > > +	struct drm_i915_gem_request *cursor;
> > > > +	int num_elements = 0;
> > > >
> > > >  	req = kzalloc(sizeof(*req), GFP_KERNEL);
> > > >  	if (req == NULL)
> > > > @@ -425,9 +427,24 @@ int gen8_switch_context_queue(struct
> > > intel_engine
> > > > *ring,
> > > >
> > > >  	spin_lock_irqsave(&ring->execlist_lock, flags);
> > > >
> > > > -	was_empty = list_empty(&ring->execlist_queue);
> > > > +	list_for_each_entry(cursor, &ring->execlist_queue, execlist_link)
> > > > +		if (++num_elements > 2)
> > > > +			break;
> > > > +
> > > > +	if (num_elements > 2) {
> > > > +		struct drm_i915_gem_request *tail_req =
> > > > +				list_last_entry(&ring->execlist_queue,
> > > > +					struct drm_i915_gem_request,
> > > execlist_link);
> > > > +		if (to == tail_req->ctx) {
> > > > +			WARN(tail_req->elsp_submitted != 0,
> > > > +					"More than 2 already-submitted reqs
> > > queued\n");
> > > > +			list_del(&tail_req->execlist_link);
> > > > +			queue_work(dev_priv->wq, &tail_req->work);
> > > > +		}
> > > > +	}
> > >
> > > Completely forgotten to mention this: Chris&I discussed this on irc
> > > and I guess this issue will disappear if we track contexts instead
> > > of requests in the scheduler. I guess this is an artifact of the
> > > gen7 scheduler you've based this on, but even for that I think
> > > scheduling contexts (with preempt point after each batch) is the
> > > right approach. But I haven't dug out the scheduler patches again so
> might be wrong with that.
> > > -Daniel
> >
> > Hmmmm... I didn´t really base this on the scheduler. Some kind of
> > queue to hold context submissions until the hardware was ready was
> > needed, and queuing drm_i915_gem_requests seemed like a good choice
> at
> > the time (by the way, in the next version I am using a new struct
> > intel_ctx_submit_request, since I don´t need most of the fields in
> > drm_i915_gem_requests, and I have to add a couple of new ones anyway).
> >
> > What do you mean by "scheduling contexts"? Notice that the requests I
> > am queuing basically just contain the context and the tail at the
> > point it was submitted for execution...
> 
> Well I've thought we could just throw contexts at the hardware and throw
> new ones at it when the old ones get stuck/are completed. But now I've
> realized that since we do the cross-engine/ctx depency tracking in software
> it's not quite that easy and we can't unconditionally update the tail-pointer.

Exactly: unconditionally updating the tail-pointer also means the seqnos might get executed out of order, which is not nice (at least until there is a scheduler keeping track of the dependencies).

> Still for the degenerate case of one ctx submitting batches exclusively I've
> hoped just updating the tail pointer in the context and telling the hw to
> reload the current context should have been enough. Or at least I've hoped
> so, and that should take (mostly) care of the insane request overload case
> your patch here addresses.

What we had before this patch:

A) The submitter places one request per new batch received in the queue, regardless.
B) The interrupt handler traverses the queue backwards, squashing together all requests for the same context that come in a row. Notice that the first one in the queue might be already in execution, in which case the squashing will end up doing exactly what you said: updating the tail pointer.
C) When the ELSP is written to, if the same context was already in execution, the GPU performs a Lite Restore (the new tail is sampled and execution continues).

After this patch:

A) The submitter places one request per new batch received in the queue. But, if the last request in the queue belongs to the same context, and it´s not suspicious of being in-execution (which means it occupies the 3rd position or higher) then squash the two requests together (the old and the new).
B) The interrupt handler traverses the queue backwards, squashing together all requests for the same context that come in a row (maximum of two: one in execution, and the pending one).
C) When the ELSP is written to, if the same context was already in execution, the GPU performs a Lite Restore (the new tail is sampled and execution continues).

In a graphical form, if A and B are contexts and (*) means they are in execution:

B3 -> B2 B1* A1* (the submitter wants to queue B with tail 3)
B3 B1* A1*(as B2 is redundant, the submitter drops it and queues B3 in place)
B3 B1* (the interrupt informs us that A is complete, so we remove it from the head of the queue)
-> B3 (since B1 is redundant, we submit B3, causing a Lite Restore)

(* context in execution)
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index d9edd10..0aad721 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -410,9 +410,11 @@  int gen8_switch_context_queue(struct intel_engine *ring,
 			      struct i915_hw_context *to,
 			      u32 tail)
 {
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
 	struct drm_i915_gem_request *req = NULL;
 	unsigned long flags;
-	bool was_empty;
+	struct drm_i915_gem_request *cursor;
+	int num_elements = 0;
 
 	req = kzalloc(sizeof(*req), GFP_KERNEL);
 	if (req == NULL)
@@ -425,9 +427,24 @@  int gen8_switch_context_queue(struct intel_engine *ring,
 
 	spin_lock_irqsave(&ring->execlist_lock, flags);
 
-	was_empty = list_empty(&ring->execlist_queue);
+	list_for_each_entry(cursor, &ring->execlist_queue, execlist_link)
+		if (++num_elements > 2)
+			break;
+
+	if (num_elements > 2) {
+		struct drm_i915_gem_request *tail_req =
+				list_last_entry(&ring->execlist_queue,
+					struct drm_i915_gem_request, execlist_link);
+		if (to == tail_req->ctx) {
+			WARN(tail_req->elsp_submitted != 0,
+					"More than 2 already-submitted reqs queued\n");
+			list_del(&tail_req->execlist_link);
+			queue_work(dev_priv->wq, &tail_req->work);
+		}
+	}
+
 	list_add_tail(&req->execlist_link, &ring->execlist_queue);
-	if (was_empty)
+	if (num_elements == 0)
 		gen8_switch_context_unqueue(ring);
 
 	spin_unlock_irqrestore(&ring->execlist_lock, flags);