[2/3] drm/i915: Record pid/comm of hanging task

Message ID	1392042651-3278-2-git-send-email-mika.kuoppala@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Mika Kuoppala <mika.kuoppala@linux.intel.com> To: intel-gfx@lists.freedesktop.org Date: Mon, 10 Feb 2014 16:30:50 +0200 Message-Id: <1392042651-3278-2-git-send-email-mika.kuoppala@intel.com> In-Reply-To: <1392042651-3278-1-git-send-email-mika.kuoppala@intel.com> References: <1392042651-3278-1-git-send-email-mika.kuoppala@intel.com> Cc: Ben Widawsky <ben@bwidawsk.net>, miku@iki.fi Subject: [Intel-gfx] [PATCH 2/3] drm/i915: Record pid/comm of hanging task Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: intel-gfx-bounces@lists.freedesktop.org Errors-To: intel-gfx-bounces@lists.freedesktop.org

Mika Kuoppala Feb. 10, 2014, 2:30 p.m. UTC

From: Chris Wilson <chris@chris-wilson.co.uk>

After finding the guilty batch and request, we can use it to find the
process that submitted the batch and then add the culprit into the error
state.

This is a slightly different approach from Ben's in that instead of
adding the extra information into the struct i915_hw_context, we use the
information already captured in struct drm_file which is then referenced
from the request.

v2: Also capture the workaround buffer for gen2, so that we can compare
    its contents against the intended batch for the active request.

v3: Rebase (Mika)

Link: http://lists.freedesktop.org/archives/intel-gfx/2013-August/032280.html
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> (v2)
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> (v3)
Cc: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_drv.h       |    6 +-
 drivers/gpu/drm/i915/i915_gem.c       |    1 +
 drivers/gpu/drm/i915/i915_gpu_error.c |  121 ++++++++++++++++++---------------
 3 files changed, 74 insertions(+), 54 deletions(-)

Ben Widawsky Feb. 12, 2014, 2:07 a.m. UTC | #1

On Mon, Feb 10, 2014 at 04:30:50PM +0200, Mika Kuoppala wrote:
> From: Chris Wilson <chris@chris-wilson.co.uk>
> 
> After finding the guilty batch and request, we can use it to find the
> process that submitted the batch and then add the culprit into the error
> state.
> 
> This is a slightly different approach from Ben's in that instead of
> adding the extra information into the struct i915_hw_context, we use the
> information already captured in struct drm_file which is then referenced
> from the request.
> 
> v2: Also capture the workaround buffer for gen2, so that we can compare
>     its contents against the intended batch for the active request.
> 
> v3: Rebase (Mika)
> 
> Link: http://lists.freedesktop.org/archives/intel-gfx/2013-August/032280.html
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> (v2)
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> (v3)
> Cc: Ben Widawsky <ben@bwidawsk.net>
> ---
>  drivers/gpu/drm/i915/i915_drv.h       |    6 +-
>  drivers/gpu/drm/i915/i915_gem.c       |    1 +
>  drivers/gpu/drm/i915/i915_gpu_error.c |  121 ++++++++++++++++++---------------
>  3 files changed, 74 insertions(+), 54 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index eb260bc..2a61a29 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -356,7 +356,7 @@ struct drm_i915_error_state {
>  			int page_count;
>  			u32 gtt_offset;
>  			u32 *pages[0];
> -		} *ringbuffer, *batchbuffer, *ctx, *hws_page;
> +		} *ringbuffer, *batchbuffer, *wa_batchbuffer, *ctx, *hws_page;
>  
>  		struct drm_i915_error_request {
>  			long jiffies;
> @@ -371,6 +371,9 @@ struct drm_i915_error_state {
>  				u32 pp_dir_base;
>  			};
>  		} vm_info;
> +
> +		pid_t pid;
> +		char comm[TASK_COMM_LEN];
>  	} ring[I915_NUM_RINGS];
>  	struct drm_i915_error_buffer {
>  		u32 size;
> @@ -1786,6 +1789,7 @@ struct drm_i915_gem_request {
>  
>  struct drm_i915_file_private {
>  	struct drm_i915_private *dev_priv;
> +	struct drm_file *file;
>  
>  	struct {
>  		spinlock_t lock;
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 20e55ef..efbd1dc 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -4883,6 +4883,7 @@ int i915_gem_open(struct drm_device *dev, struct drm_file *file)
>  
>  	file->driver_priv = file_priv;
>  	file_priv->dev_priv = dev->dev_private;
> +	file_priv->file = file;
>  
>  	spin_lock_init(&file_priv->mm.lock);
>  	INIT_LIST_HEAD(&file_priv->mm.request_list);
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 5bd075a..a90971a 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -301,13 +301,28 @@ void i915_error_printf(struct drm_i915_error_state_buf *e, const char *f, ...)
>  	va_end(args);
>  }
>  
> +static void print_error_obj(struct drm_i915_error_state_buf *m,
> +			    struct drm_i915_error_object *obj)
> +{
> +	int page, offset, elt;
> +
> +	for (page = offset = 0; page < obj->page_count; page++) {
> +		for (elt = 0; elt < PAGE_SIZE/4; elt++) {
> +			err_printf(m, "%08x :  %08x\n", offset,
> +				   obj->pages[page][elt]);
> +			offset += 4;
> +		}
> +	}
> +}
> +
>  int i915_error_state_to_str(struct drm_i915_error_state_buf *m,
>  			    const struct i915_error_state_file_priv *error_priv)
>  {
>  	struct drm_device *dev = error_priv->dev;
>  	drm_i915_private_t *dev_priv = dev->dev_private;
>  	struct drm_i915_error_state *error = error_priv->error;
> -	int i, j, page, offset, elt;
> +	int i, j, offset, elt;
> +	int max_hangcheck_score;
>  
>  	if (!error) {
>  		err_printf(m, "no error state collected\n");
> @@ -317,6 +332,20 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m,
>  	err_printf(m, "Time: %ld s %ld us\n", error->time.tv_sec,
>  		   error->time.tv_usec);
>  	err_printf(m, "Kernel: " UTS_RELEASE "\n");
> +	max_hangcheck_score = 0;
> +	for (i = 0; i < ARRAY_SIZE(error->ring); i++) {
> +		if (error->ring[i].hangcheck_score > max_hangcheck_score)
> +			max_hangcheck_score = error->ring[i].hangcheck_score;
> +	}
> +	for (i = 0; i < ARRAY_SIZE(error->ring); i++) {
> +		if (error->ring[i].hangcheck_score == max_hangcheck_score &&
> +		    error->ring[i].pid != -1) {
> +			err_printf(m, "Active process (on ring %s): %s [%d]\n",
> +				   ring_str(i),
> +				   error->ring[i].comm,
> +				   error->ring[i].pid);
> +		}
> +	}
>  	err_printf(m, "PCI ID: 0x%04x\n", dev->pdev->device);
>  	err_printf(m, "EIR: 0x%08x\n", error->eir);
>  	err_printf(m, "IER: 0x%08x\n", error->ier);
> @@ -360,17 +389,19 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m,
>  		struct drm_i915_error_object *obj;
>  
>  		if ((obj = error->ring[i].batchbuffer)) {
> -			err_printf(m, "%s --- gtt_offset = 0x%08x\n",
> -				   dev_priv->ring[i].name,
> +			err_puts(m, dev_priv->ring[i].name);
> +			if (error->ring[i].pid != -1)
> +				err_printf(m, " (submitted by %s [%d])",
> +					   error->ring[i].comm, error->ring[i].pid);
> +			err_printf(m, " --- gtt_offset = 0x%08x\n",
>  				   obj->gtt_offset);
> -			offset = 0;
> -			for (page = 0; page < obj->page_count; page++) {
> -				for (elt = 0; elt < PAGE_SIZE/4; elt++) {
> -					err_printf(m, "%08x :  %08x\n", offset,
> -						   obj->pages[page][elt]);
> -					offset += 4;
> -				}
> -			}
> +			print_error_obj(m, obj);
> +		}
> +
> +		if ((obj = error->ring[i].wa_batchbuffer)) {
> +			err_printf(m, "%s (w/a) --- gtt_offset = 0x%08x\n",
> +				   dev_priv->ring[i].name, obj->gtt_offset);
> +			print_error_obj(m, obj);
>  		}
>  
>  		if (error->ring[i].num_requests) {
> @@ -389,15 +420,7 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m,
>  			err_printf(m, "%s --- ringbuffer = 0x%08x\n",
>  				   dev_priv->ring[i].name,
>  				   obj->gtt_offset);
> -			offset = 0;
> -			for (page = 0; page < obj->page_count; page++) {
> -				for (elt = 0; elt < PAGE_SIZE/4; elt++) {
> -					err_printf(m, "%08x :  %08x\n",
> -						   offset,
> -						   obj->pages[page][elt]);
> -					offset += 4;
> -				}
> -			}
> +			print_error_obj(m, obj);
>  		}
>  
>  		if ((obj = error->ring[i].hws_page)) {
> @@ -713,37 +736,6 @@ static void i915_gem_record_fences(struct drm_device *dev,
>  	}
>  }
>  
> -static struct drm_i915_error_object *
> -i915_error_first_batchbuffer(struct drm_i915_private *dev_priv,
> -			     struct intel_ring_buffer *ring)
> -{
> -	struct drm_i915_gem_request *request;
> -
> -	if (HAS_BROKEN_CS_TLB(dev_priv->dev)) {
> -		struct drm_i915_gem_object *obj;
> -		u32 acthd = I915_READ(ACTHD);
> -
> -		if (WARN_ON(ring->id != RCS))
> -			return NULL;
> -
> -		obj = ring->scratch.obj;
> -		if (obj != NULL &&
> -		    acthd >= i915_gem_obj_ggtt_offset(obj) &&
> -		    acthd < i915_gem_obj_ggtt_offset(obj) + obj->base.size)
> -			return i915_error_ggtt_object_create(dev_priv, obj);
> -	}
> -
> -	request = i915_gem_find_active_request(ring);
> -	if (request) {
> -		/* We need to copy these to an anonymous buffer as the simplest
> -		 * method to avoid being overwritten by userspace.
> -		 */
> -		return i915_error_object_create(dev_priv, request->batch_obj, request->ctx->vm);
> -	}
> -
> -	return NULL;
> -}
> -
>  static void i915_record_ring_state(struct drm_device *dev,
>  				   struct intel_ring_buffer *ring,
>  				   struct drm_i915_error_ring *ering)
> @@ -892,8 +884,31 @@ static void i915_gem_record_rings(struct drm_device *dev,
>  
>  		i915_record_ring_state(dev, ring, &error->ring[i]);
>  
> -		error->ring[i].batchbuffer =
> -			i915_error_first_batchbuffer(dev_priv, ring);
> +		error->ring[i].pid = -1;
> +		request = i915_gem_find_active_request(ring);
> +		if (request) {
> +			/* We need to copy these to an anonymous buffer as the simplest
> +			 * method to avoid being overwritten by userspace.
> +			 */
> +			error->ring[i].batchbuffer =
> +				i915_error_object_create(dev_priv, request->batch_obj, request->ctx->vm);
> +
> +			if (HAS_BROKEN_CS_TLB(dev_priv->dev) && ring->scratch.obj)
> +				error->ring[i].wa_batchbuffer =
> +					i915_error_ggtt_object_create(dev_priv, ring->scratch.obj);
> +
> +			if (request->file_priv) {
> +				struct task_struct *task;
> +
> +				rcu_read_lock();
> +				task = pid_task(request->file_priv->file->pid, PIDTYPE_PID);
> +				if (task) {
> +					strcpy(error->ring[i].comm, task->comm);
> +					error->ring[i].pid = task->pid;
> +				}
> +				rcu_read_unlock();
> +			}

I still like my solution which does the strcpy on context creation. Your
solution is deferred in the usual case since the hangcheck is
asynchronous. The process could be long dead and gone.

Also, I think my solution is simpler, and with full PPGTT should get the
same information.

> +		}
>  
>  		error->ring[i].ringbuffer =
>  			i915_error_ggtt_object_create(dev_priv, ring->obj);

Chris Wilson Feb. 12, 2014, 8:15 a.m. UTC | #2

On Tue, Feb 11, 2014 at 06:07:09PM -0800, Ben Widawsky wrote:
> I still like my solution which does the strcpy on context creation. Your
> solution is deferred in the usual case since the hangcheck is
> asynchronous. The process could be long dead and gone.

I disliked your solution since there was never a need to copy the string
upon process creation, you could just have kept a reference to the
pid...

> Also, I think my solution is simpler, and with full PPGTT should get the
> same information.

I disagree that your solution was simpler and just introduced redundant
storage.
-Chris

Ben Widawsky Feb. 12, 2014, 6:55 p.m. UTC | #3

On Wed, Feb 12, 2014 at 08:15:58AM +0000, Chris Wilson wrote:
> On Tue, Feb 11, 2014 at 06:07:09PM -0800, Ben Widawsky wrote:
> > I still like my solution which does the strcpy on context creation. Your
> > solution is deferred in the usual case since the hangcheck is
> > asynchronous. The process could be long dead and gone.
> 
> I disliked your solution since there was never a need to copy the string
> upon process creation, you could just have kept a reference to the
> pid...
> 
> > Also, I think my solution is simpler, and with full PPGTT should get the
> > same information.
> 
> I disagree that your solution was simpler and just introduced redundant
> storage.
> -Chris
> 
> -- 
> Chris Wilson, Intel Open Source Technology Centre

Are you opposed to doing anything at context creation? pid reference
works for me too, or hold on to it with the request - I don't care.

Chris Wilson Feb. 12, 2014, 7:18 p.m. UTC | #4

On Wed, Feb 12, 2014 at 10:55:07AM -0800, Ben Widawsky wrote:
> On Wed, Feb 12, 2014 at 08:15:58AM +0000, Chris Wilson wrote:
> Are you opposed to doing anything at context creation? pid reference
> works for me too, or hold on to it with the request - I don't care.

I'm just don't feel that the issue is severe enough for yet another pid
reference floating around inside drm. I could well be wrong though, but
I would like to be sure we have tried our best to use the available
information we have already.
-Chris

Ben Widawsky Feb. 12, 2014, 7:32 p.m. UTC | #5

On Wed, Feb 12, 2014 at 07:18:19PM +0000, Chris Wilson wrote:
> On Wed, Feb 12, 2014 at 10:55:07AM -0800, Ben Widawsky wrote:
> > On Wed, Feb 12, 2014 at 08:15:58AM +0000, Chris Wilson wrote:
> > Are you opposed to doing anything at context creation? pid reference
> > works for me too, or hold on to it with the request - I don't care.
> 
> I'm just don't feel that the issue is severe enough for yet another pid
> reference floating around inside drm. I could well be wrong though, but
> I would like to be sure we have tried our best to use the available
> information we have already.
> -Chris
> 
> -- 
> Chris Wilson, Intel Open Source Technology Centre

Well, the ref wouldn't be held if you just copied it at context create,
and it avoids having to stick in *file to actually do the proper look up
(not that that in itself is so horrible).

For the sake of making progress, I disagree with Chris in tbut think it's
better than what is currently there, so:
Acked-by: Ben Widawsky <ben@bwidawsk.net>

Chris Wilson March 5, 2014, 1:08 p.m. UTC | #6

On Wed, Feb 12, 2014 at 11:32:00AM -0800, Ben Widawsky wrote:
> On Wed, Feb 12, 2014 at 07:18:19PM +0000, Chris Wilson wrote:
> > On Wed, Feb 12, 2014 at 10:55:07AM -0800, Ben Widawsky wrote:
> > > On Wed, Feb 12, 2014 at 08:15:58AM +0000, Chris Wilson wrote:
> > > Are you opposed to doing anything at context creation? pid reference
> > > works for me too, or hold on to it with the request - I don't care.
> > 
> > I'm just don't feel that the issue is severe enough for yet another pid
> > reference floating around inside drm. I could well be wrong though, but
> > I would like to be sure we have tried our best to use the available
> > information we have already.
> > -Chris
> > 
> > -- 
> > Chris Wilson, Intel Open Source Technology Centre
> 
> Well, the ref wouldn't be held if you just copied it at context create,
> and it avoids having to stick in *file to actually do the proper look up
> (not that that in itself is so horrible).
> 
> For the sake of making progress, I disagree with Chris in tbut think it's
> better than what is currently there, so:
> Acked-by: Ben Widawsky <ben@bwidawsk.net>

Daniel?
-Chris

Daniel Vetter March 5, 2014, 1:34 p.m. UTC | #7

On Wed, Mar 05, 2014 at 01:08:15PM +0000, Chris Wilson wrote:
> On Wed, Feb 12, 2014 at 11:32:00AM -0800, Ben Widawsky wrote:
> > On Wed, Feb 12, 2014 at 07:18:19PM +0000, Chris Wilson wrote:
> > > On Wed, Feb 12, 2014 at 10:55:07AM -0800, Ben Widawsky wrote:
> > > > On Wed, Feb 12, 2014 at 08:15:58AM +0000, Chris Wilson wrote:
> > > > Are you opposed to doing anything at context creation? pid reference
> > > > works for me too, or hold on to it with the request - I don't care.
> > > 
> > > I'm just don't feel that the issue is severe enough for yet another pid
> > > reference floating around inside drm. I could well be wrong though, but
> > > I would like to be sure we have tried our best to use the available
> > > information we have already.
> > > -Chris
> > > 
> > > -- 
> > > Chris Wilson, Intel Open Source Technology Centre
> > 
> > Well, the ref wouldn't be held if you just copied it at context create,
> > and it avoids having to stick in *file to actually do the proper look up
> > (not that that in itself is so horrible).
> > 
> > For the sake of making progress, I disagree with Chris in tbut think it's
> > better than what is currently there, so:
> > Acked-by: Ben Widawsky <ben@bwidawsk.net>
> 
> Daniel?

I've thought I pick this all up when Mika resends the revised versions for
1&3. Or is everything ready already (it didn't quite look like that ...)?
-Daniel

Chris Wilson March 5, 2014, 1:50 p.m. UTC | #8

On Wed, Mar 05, 2014 at 02:34:48PM +0100, Daniel Vetter wrote:
> On Wed, Mar 05, 2014 at 01:08:15PM +0000, Chris Wilson wrote:
> > On Wed, Feb 12, 2014 at 11:32:00AM -0800, Ben Widawsky wrote:
> > > On Wed, Feb 12, 2014 at 07:18:19PM +0000, Chris Wilson wrote:
> > > > On Wed, Feb 12, 2014 at 10:55:07AM -0800, Ben Widawsky wrote:
> > > > > On Wed, Feb 12, 2014 at 08:15:58AM +0000, Chris Wilson wrote:
> > > > > Are you opposed to doing anything at context creation? pid reference
> > > > > works for me too, or hold on to it with the request - I don't care.
> > > > 
> > > > I'm just don't feel that the issue is severe enough for yet another pid
> > > > reference floating around inside drm. I could well be wrong though, but
> > > > I would like to be sure we have tried our best to use the available
> > > > information we have already.
> > > > -Chris
> > > > 
> > > > -- 
> > > > Chris Wilson, Intel Open Source Technology Centre
> > > 
> > > Well, the ref wouldn't be held if you just copied it at context create,
> > > and it avoids having to stick in *file to actually do the proper look up
> > > (not that that in itself is so horrible).
> > > 
> > > For the sake of making progress, I disagree with Chris in tbut think it's
> > > better than what is currently there, so:
> > > Acked-by: Ben Widawsky <ben@bwidawsk.net>
> > 
> > Daniel?
> 
> I've thought I pick this all up when Mika resends the revised versions for
> 1&3. Or is everything ready already (it didn't quite look like that ...)?

Oh, I just forgot where this was taken up. It just made it to the top of
my wishlist with a couple of bizarre error-states containing gibberish
for which I want to find the culprit.
-Chris

[2/3] drm/i915: Record pid/comm of hanging task

Commit Message

Comments

Patch