[12/12] drm/i915: Cache the reset_counter for the request

Message ID	1448023432-10726-12-git-send-email-chris@chris-wilson.co.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Chris Wilson <chris@chris-wilson.co.uk> To: intel-gfx@lists.freedesktop.org Date: Fri, 20 Nov 2015 12:43:52 +0000 Message-Id: <1448023432-10726-12-git-send-email-chris@chris-wilson.co.uk> In-Reply-To: <1448023432-10726-1-git-send-email-chris@chris-wilson.co.uk> References: <1448023432-10726-1-git-send-email-chris@chris-wilson.co.uk> Subject: [Intel-gfx] [PATCH 12/12] drm/i915: Cache the reset_counter for the request Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Chris Wilson Nov. 20, 2015, 12:43 p.m. UTC

Instead of querying the reset counter before every access to the ring,
query it the first time we touch the ring, and do a final compare when
submitting the request. For correctness, we need to then sanitize how
the reset_counter is incremented to prevent broken submission and
waiting across resets, in the process fixing the persistent -EIO we
still see today on failed waits.

v2: Rebase
v3: Now with added testcase

Testcase: igt/gem_eio
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_debugfs.c     |  2 +-
 drivers/gpu/drm/i915/i915_drv.c         | 32 +++++++-----
 drivers/gpu/drm/i915/i915_drv.h         | 39 ++++++++++----
 drivers/gpu/drm/i915/i915_gem.c         | 90 +++++++++++++++------------------
 drivers/gpu/drm/i915/i915_irq.c         | 21 +-------
 drivers/gpu/drm/i915/intel_display.c    | 10 ++--
 drivers/gpu/drm/i915/intel_ringbuffer.c | 17 ++++---
 7 files changed, 104 insertions(+), 107 deletions(-)

Daniel Vetter Nov. 24, 2015, 4:43 p.m. UTC | #1

On Fri, Nov 20, 2015 at 12:43:52PM +0000, Chris Wilson wrote:
> Instead of querying the reset counter before every access to the ring,
> query it the first time we touch the ring, and do a final compare when
> submitting the request. For correctness, we need to then sanitize how
> the reset_counter is incremented to prevent broken submission and
> waiting across resets, in the process fixing the persistent -EIO we
> still see today on failed waits.

tbh that explanation went straight over my head. Questions:
- Where do we wait again over a reset? With all the wakeups we should
  catch them properly. I guess this needs the detailed scenario to help me
  out, since I have dim memories that there is something wrong ;-)

- What precisely is the effect for waiters? I only see moving the
  atomic_inc under the dev->struct_mutex, which shouldn't do a hole lot
  for anyone. Plus not returning -EAGAIN when reset_in_progress, which
  looks like it might result in missed wakeups and deadlocks with the
  reset work.

I /thought/ that the -EIO is from check_wedge in intel_ring_begin, but
that part seems essentially unchanged.

Aside from all that, shuffling the atomic_inc (if I can convince myself
that it's safe) to avoid the reload_in_reset hack looks like a good
improvement.

More confused comments below.
-Daniel

> v2: Rebase
> v3: Now with added testcase
> 
> Testcase: igt/gem_eio
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/i915_debugfs.c     |  2 +-
>  drivers/gpu/drm/i915/i915_drv.c         | 32 +++++++-----
>  drivers/gpu/drm/i915/i915_drv.h         | 39 ++++++++++----
>  drivers/gpu/drm/i915/i915_gem.c         | 90 +++++++++++++++------------------
>  drivers/gpu/drm/i915/i915_irq.c         | 21 +-------
>  drivers/gpu/drm/i915/intel_display.c    | 10 ++--
>  drivers/gpu/drm/i915/intel_ringbuffer.c | 17 ++++---
>  7 files changed, 104 insertions(+), 107 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> index d83f35c8df34..107711ec956f 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -4415,7 +4415,7 @@ i915_wedged_get(void *data, u64 *val)
>  	struct drm_device *dev = data;
>  	struct drm_i915_private *dev_priv = dev->dev_private;
>  
> -	*val = atomic_read(&dev_priv->gpu_error.reset_counter);
> +	*val = i915_terminally_wedged(&dev_priv->gpu_error);

Don't we have a few tests left that look at this still?

>  	return 0;
>  }
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index ffd99d27fac7..5838882e10a1 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -841,23 +841,31 @@ int i915_resume_legacy(struct drm_device *dev)
>  int i915_reset(struct drm_device *dev)
>  {
>  	struct drm_i915_private *dev_priv = dev->dev_private;
> +	struct i915_gpu_error *error = &dev_priv->gpu_error;
> +	unsigned reset_counter;
>  	bool simulated;
>  	int ret;
>  
>  	intel_reset_gt_powersave(dev);
>  
>  	mutex_lock(&dev->struct_mutex);
> +	atomic_clear_mask(I915_WEDGED, &error->reset_counter);
> +	reset_counter = atomic_inc_return(&error->reset_counter);

This publishes the reset-complete too early for the modeset code I think.
At least it crucially relies upon dev->struct_mutex serializing
everything in our driver, and I don't like to cement that assumption in
even more.

Why do we need to move that atomic_inc again?

I guess what we could try is set it right after we've reset the hw, but
before we start to re-initialize it. So move the atomic stuff below
intel_gpu_reset, with some neat barrier again. Maybe we should even push
i915_gem_reset down below that too. Not sure.

> +	if (WARN_ON(__i915_reset_in_progress(reset_counter))) {
> +		ret = -EIO;
> +		goto error;
> +	}
>  
>  	i915_gem_reset(dev);
>  
> -	simulated = dev_priv->gpu_error.stop_rings != 0;
> +	simulated = error->stop_rings != 0;
>  
>  	ret = intel_gpu_reset(dev);
>  
>  	/* Also reset the gpu hangman. */
>  	if (simulated) {
>  		DRM_INFO("Simulated gpu hang, resetting stop_rings\n");
> -		dev_priv->gpu_error.stop_rings = 0;
> +		error->stop_rings = 0;
>  		if (ret == -ENODEV) {
>  			DRM_INFO("Reset not implemented, but ignoring "
>  				 "error for simulated gpu hangs\n");
> @@ -870,8 +878,7 @@ int i915_reset(struct drm_device *dev)
>  
>  	if (ret) {
>  		DRM_ERROR("Failed to reset chip: %i\n", ret);
> -		mutex_unlock(&dev->struct_mutex);
> -		return ret;
> +		goto error;
>  	}
>  
>  	intel_overlay_reset(dev_priv);
> @@ -890,20 +897,14 @@ int i915_reset(struct drm_device *dev)
>  	 * was running at the time of the reset (i.e. we weren't VT
>  	 * switched away).
>  	 */
> -
> -	/* Used to prevent gem_check_wedged returning -EAGAIN during gpu reset */
> -	dev_priv->gpu_error.reload_in_reset = true;
> -
>  	ret = i915_gem_init_hw(dev);
> -
> -	dev_priv->gpu_error.reload_in_reset = false;
> -
> -	mutex_unlock(&dev->struct_mutex);
>  	if (ret) {
>  		DRM_ERROR("Failed hw init on reset %d\n", ret);
> -		return ret;
> +		goto error;
>  	}
>  
> +	mutex_unlock(&dev->struct_mutex);
> +
>  	/*
>  	 * rps/rc6 re-init is necessary to restore state lost after the
>  	 * reset and the re-install of gt irqs. Skip for ironlake per
> @@ -914,6 +915,11 @@ int i915_reset(struct drm_device *dev)
>  		intel_enable_gt_powersave(dev);
>  
>  	return 0;
> +
> +error:
> +	atomic_set_mask(I915_WEDGED, &error->reset_counter);
> +	mutex_unlock(&dev->struct_mutex);
> +	return ret;
>  }
>  
>  static int i915_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 3fe933e4e664..aa83b284cb2f 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1395,9 +1395,6 @@ struct i915_gpu_error {
>  
>  	/* For missed irq/seqno simulation. */
>  	unsigned int test_irq_rings;
> -
> -	/* Used to prevent gem_check_wedged returning -EAGAIN during gpu reset   */
> -	bool reload_in_reset;
>  };
>  
>  enum modeset_restore {
> @@ -2157,6 +2154,7 @@ struct drm_i915_gem_request {
>  	/** On Which ring this request was generated */
>  	struct drm_i915_private *i915;
>  	struct intel_engine_cs *engine;
> +	unsigned reset_counter;
>  
>  	/** GEM sequence number associated with this request. */
>  	uint32_t seqno;
> @@ -2891,23 +2889,47 @@ struct drm_i915_gem_request *
>  i915_gem_find_active_request(struct intel_engine_cs *ring);
>  
>  bool i915_gem_retire_requests(struct drm_device *dev);
> -int __must_check i915_gem_check_wedge(struct i915_gpu_error *error,
> +int __must_check i915_gem_check_wedge(unsigned reset_counter,
>  				      bool interruptible);
>  
> +static inline u32 i915_reset_counter(struct i915_gpu_error *error)
> +{
> +	return atomic_read(&error->reset_counter);
> +}
> +
> +static inline bool __i915_reset_in_progress(u32 reset)
> +{
> +	return unlikely(reset & I915_RESET_IN_PROGRESS_FLAG);
> +}
> +
> +static inline bool __i915_reset_in_progress_or_wedged(u32 reset)
> +{
> +	return unlikely(reset & (I915_RESET_IN_PROGRESS_FLAG | I915_WEDGED));
> +}
> +
> +static inline bool __i915_terminally_wedged(u32 reset)
> +{
> +	return unlikely(reset & I915_WEDGED);
> +}
> +
>  static inline bool i915_reset_in_progress(struct i915_gpu_error *error)
>  {
> -	return unlikely(atomic_read(&error->reset_counter)
> -			& (I915_RESET_IN_PROGRESS_FLAG | I915_WEDGED));
> +	return __i915_reset_in_progress(i915_reset_counter(error));
> +}
> +
> +static inline bool i915_reset_in_progress_or_wedged(struct i915_gpu_error *error)
> +{
> +	return __i915_reset_in_progress_or_wedged(i915_reset_counter(error));
>  }
>  
>  static inline bool i915_terminally_wedged(struct i915_gpu_error *error)
>  {
> -	return atomic_read(&error->reset_counter) & I915_WEDGED;
> +	return __i915_terminally_wedged(i915_reset_counter(error));
>  }
>  
>  static inline u32 i915_reset_count(struct i915_gpu_error *error)
>  {
> -	return ((atomic_read(&error->reset_counter) & ~I915_WEDGED) + 1) / 2;
> +	return ((i915_reset_counter(error) & ~I915_WEDGED) + 1) / 2;
>  }
>  
>  static inline bool i915_stop_ring_allow_ban(struct drm_i915_private *dev_priv)
> @@ -2940,7 +2962,6 @@ void __i915_add_request(struct drm_i915_gem_request *req,
>  #define i915_add_request_no_flush(req) \
>  	__i915_add_request(req, NULL, false)
>  int __i915_wait_request(struct drm_i915_gem_request *req,
> -			unsigned reset_counter,
>  			bool interruptible,
>  			s64 *timeout,
>  			struct intel_rps_client *rps);
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index d3609e111647..6ac9c80244fa 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -85,9 +85,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
>  {
>  	int ret;
>  
> -#define EXIT_COND (!i915_reset_in_progress(error) || \
> -		   i915_terminally_wedged(error))
> -	if (EXIT_COND)
> +	if (!i915_reset_in_progress(error))
>  		return 0;
>  
>  	/*
> @@ -96,17 +94,16 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
>  	 * we should simply try to bail out and fail as gracefully as possible.
>  	 */
>  	ret = wait_event_interruptible_timeout(error->reset_queue,
> -					       EXIT_COND,
> +					       !i915_reset_in_progress(error),
>  					       10*HZ);
>  	if (ret == 0) {
>  		DRM_ERROR("Timed out waiting for the gpu reset to complete\n");
>  		return -EIO;
>  	} else if (ret < 0) {
>  		return ret;
> +	} else {
> +		return 0;
>  	}
> -#undef EXIT_COND
> -
> -	return 0;

I like the existing color better ;-)

>  }
>  
>  int i915_mutex_lock_interruptible(struct drm_device *dev)
> @@ -1112,26 +1109,20 @@ put_rpm:
>  }
>  
>  int
> -i915_gem_check_wedge(struct i915_gpu_error *error,
> +i915_gem_check_wedge(unsigned reset_counter,
>  		     bool interruptible)
>  {
> -	if (i915_reset_in_progress(error)) {
> +	if (__i915_reset_in_progress_or_wedged(reset_counter)) {
>  		/* Non-interruptible callers can't handle -EAGAIN, hence return
>  		 * -EIO unconditionally for these. */
>  		if (!interruptible)
>  			return -EIO;
>  
>  		/* Recovery complete, but the reset failed ... */
> -		if (i915_terminally_wedged(error))
> +		if (__i915_terminally_wedged(reset_counter))
>  			return -EIO;
>  
> -		/*
> -		 * Check if GPU Reset is in progress - we need intel_ring_begin
> -		 * to work properly to reinit the hw state while the gpu is
> -		 * still marked as reset-in-progress. Handle this with a flag.
> -		 */
> -		if (!error->reload_in_reset)
> -			return -EAGAIN;
> +		return -EAGAIN;

This only works because you move the check_wedge from ring_begin to
wait_space, so assumes that we never change that again. Rather fragile
imo.

>  	}
>  
>  	return 0;
> @@ -1200,7 +1191,6 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
>  /**
>   * __i915_wait_request - wait until execution of request has finished
>   * @req: duh!
> - * @reset_counter: reset sequence associated with the given request
>   * @interruptible: do an interruptible wait (normally yes)
>   * @timeout: in - how long to wait (NULL forever); out - how much time remaining
>   *
> @@ -1215,7 +1205,6 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
>   * errno with remaining time filled in timeout argument.
>   */
>  int __i915_wait_request(struct drm_i915_gem_request *req,
> -			unsigned reset_counter,
>  			bool interruptible,
>  			s64 *timeout,
>  			struct intel_rps_client *rps)
> @@ -1264,12 +1253,12 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
>  
>  		/* We need to check whether any gpu reset happened in between
>  		 * the caller grabbing the seqno and now ... */
> -		if (reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter)) {
> -			/* ... but upgrade the -EAGAIN to an -EIO if the gpu
> -			 * is truely gone. */
> -			ret = i915_gem_check_wedge(&dev_priv->gpu_error, interruptible);
> -			if (ret == 0)
> -				ret = -EAGAIN;
> +		if (req->reset_counter != i915_reset_counter(&dev_priv->gpu_error)) {
> +			/* As we do not requeue the request over a GPU reset,
> +			 * if one does occur we know that the request is
> +			 * effectively complete.
> +			 */
> +			ret = 0;
>  			break;

This now no longer bails out straight when a reset is in progress with
-EAGAIN. I fear for the deadlocks.

>  		}
>  
> @@ -1433,13 +1422,7 @@ i915_wait_request(struct drm_i915_gem_request *req)
>  
>  	BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex));
>  
> -	ret = i915_gem_check_wedge(&dev_priv->gpu_error, interruptible);
> -	if (ret)
> -		return ret;
> -
> -	ret = __i915_wait_request(req,
> -				  atomic_read(&dev_priv->gpu_error.reset_counter),
> -				  interruptible, NULL, NULL);
> +	ret = __i915_wait_request(req, interruptible, NULL, NULL);
>  	if (ret)
>  		return ret;
>  
> @@ -1514,7 +1497,6 @@ i915_gem_object_wait_rendering__nonblocking(struct drm_i915_gem_object *obj,
>  	struct drm_device *dev = obj->base.dev;
>  	struct drm_i915_private *dev_priv = dev->dev_private;
>  	struct drm_i915_gem_request *requests[I915_NUM_RINGS];
> -	unsigned reset_counter;
>  	int ret, i, n = 0;
>  
>  	BUG_ON(!mutex_is_locked(&dev->struct_mutex));
> @@ -1523,12 +1505,6 @@ i915_gem_object_wait_rendering__nonblocking(struct drm_i915_gem_object *obj,
>  	if (!obj->active)
>  		return 0;
>  
> -	ret = i915_gem_check_wedge(&dev_priv->gpu_error, true);
> -	if (ret)
> -		return ret;
> -
> -	reset_counter = atomic_read(&dev_priv->gpu_error.reset_counter);
> -
>  	if (readonly) {
>  		struct drm_i915_gem_request *req;
>  
> @@ -1550,9 +1526,9 @@ i915_gem_object_wait_rendering__nonblocking(struct drm_i915_gem_object *obj,
>  	}
>  
>  	mutex_unlock(&dev->struct_mutex);
> +	ret = 0;
>  	for (i = 0; ret == 0 && i < n; i++)
> -		ret = __i915_wait_request(requests[i], reset_counter, true,
> -					  NULL, rps);
> +		ret = __i915_wait_request(requests[i], true, NULL, rps);
>  	mutex_lock(&dev->struct_mutex);
>  
>  	for (i = 0; i < n; i++) {
> @@ -2707,6 +2683,7 @@ int i915_gem_request_alloc(struct intel_engine_cs *engine,
>  			   struct drm_i915_gem_request **req_out)
>  {
>  	struct drm_i915_private *dev_priv = to_i915(engine->dev);
> +	unsigned reset_counter = i915_reset_counter(&dev_priv->gpu_error);
>  	struct drm_i915_gem_request *req;
>  	int ret;
>  
> @@ -2715,6 +2692,10 @@ int i915_gem_request_alloc(struct intel_engine_cs *engine,
>  
>  	*req_out = NULL;
>  
> +	ret = i915_gem_check_wedge(reset_counter, dev_priv->mm.interruptible);
> +	if (ret)
> +		return ret;
> +
>  	req = kmem_cache_zalloc(dev_priv->requests, GFP_KERNEL);
>  	if (req == NULL)
>  		return -ENOMEM;
> @@ -2725,6 +2706,7 @@ int i915_gem_request_alloc(struct intel_engine_cs *engine,
>  
>  	kref_init(&req->ref);
>  	req->i915 = dev_priv;
> +	req->reset_counter = reset_counter;
>  	req->engine = engine;
>  	req->ctx  = ctx;
>  	i915_gem_context_reference(req->ctx);
> @@ -3077,11 +3059,9 @@ void i915_gem_close_object(struct drm_gem_object *gem,
>  int
>  i915_gem_wait_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>  {
> -	struct drm_i915_private *dev_priv = dev->dev_private;
>  	struct drm_i915_gem_wait *args = data;
>  	struct drm_i915_gem_object *obj;
>  	struct drm_i915_gem_request *req[I915_NUM_RINGS];
> -	unsigned reset_counter;
>  	int i, n = 0;
>  	int ret;
>  
> @@ -3115,7 +3095,6 @@ i915_gem_wait_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>  	}
>  
>  	drm_gem_object_unreference(&obj->base);
> -	reset_counter = atomic_read(&dev_priv->gpu_error.reset_counter);
>  
>  	for (i = 0; i < I915_NUM_RINGS; i++) {
>  		if (obj->last_read_req[i] == NULL)
> @@ -3128,11 +3107,23 @@ i915_gem_wait_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
>  
>  	for (i = 0; i < n; i++) {
>  		if (ret == 0)
> -			ret = __i915_wait_request(req[i], reset_counter, true,
> +			ret = __i915_wait_request(req[i], true,
>  						  args->timeout_ns > 0 ? &args->timeout_ns : NULL,
>  						  file->driver_priv);
> +
> +		/* If the GPU hung before this, report it. Ideally we only
> +		 * report if this request cannot be completed. Currently
> +		 * when we don't mark the guilty party and abort all
> +		 * requests on reset, so just mark all as EIO.
> +		 */
> +		if (ret == 0 &&
> +		    req[i]->reset_counter != i915_reset_counter(&req[i]->i915->gpu_error))
> +			ret = -EIO;
> +
>  		i915_gem_request_unreference__unlocked(req[i]);
>  	}
> +
> +
>  	return ret;
>  
>  out:
> @@ -3160,7 +3151,6 @@ __i915_gem_object_sync(struct drm_i915_gem_object *obj,
>  	if (!i915_semaphore_is_enabled(from_req->i915)) {
>  		struct drm_i915_private *i915 = from_req->i915;
>  		ret = __i915_wait_request(from_req,
> -					  atomic_read(&i915->gpu_error.reset_counter),
>  					  i915->mm.interruptible,
>  					  NULL,
>  					  &i915->rps.semaphores);
> @@ -4082,14 +4072,15 @@ i915_gem_ring_throttle(struct drm_device *dev, struct drm_file *file)
>  	struct drm_i915_file_private *file_priv = file->driver_priv;
>  	unsigned long recent_enough = jiffies - DRM_I915_THROTTLE_JIFFIES;
>  	struct drm_i915_gem_request *request, *target = NULL;
> -	unsigned reset_counter;
>  	int ret;
>  
>  	ret = i915_gem_wait_for_error(&dev_priv->gpu_error);
>  	if (ret)
>  		return ret;
>  
> -	ret = i915_gem_check_wedge(&dev_priv->gpu_error, false);
> +	/* ABI: return -EIO if wedged */
> +	ret = i915_gem_check_wedge(i915_reset_counter(&dev_priv->gpu_error),
> +				   false);
>  	if (ret)
>  		return ret;
>  
> @@ -4107,7 +4098,6 @@ i915_gem_ring_throttle(struct drm_device *dev, struct drm_file *file)
>  
>  		target = request;
>  	}
> -	reset_counter = atomic_read(&dev_priv->gpu_error.reset_counter);
>  	if (target)
>  		i915_gem_request_reference(target);
>  	spin_unlock(&file_priv->mm.lock);
> @@ -4115,7 +4105,7 @@ i915_gem_ring_throttle(struct drm_device *dev, struct drm_file *file)
>  	if (target == NULL)
>  		return 0;
>  
> -	ret = __i915_wait_request(target, reset_counter, true, NULL, NULL);
> +	ret = __i915_wait_request(target, true, NULL, NULL);
>  	if (ret == 0)
>  		queue_delayed_work(dev_priv->wq, &dev_priv->mm.retire_work, 0);
>  
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index f9f5e9cf7360..186315208462 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2200,7 +2200,6 @@ static void i915_error_wake_up(struct drm_i915_private *dev_priv,
>  static void i915_reset_and_wakeup(struct drm_device *dev)
>  {
>  	struct drm_i915_private *dev_priv = to_i915(dev);
> -	struct i915_gpu_error *error = &dev_priv->gpu_error;
>  	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
>  	char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
>  	char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
> @@ -2218,7 +2217,7 @@ static void i915_reset_and_wakeup(struct drm_device *dev)
>  	 * the reset in-progress bit is only ever set by code outside of this
>  	 * work we don't need to worry about any other races.
>  	 */
> -	if (i915_reset_in_progress(error) && !i915_terminally_wedged(error)) {
> +	if (i915_reset_in_progress(&dev_priv->gpu_error)) {
>  		DRM_DEBUG_DRIVER("resetting chip\n");
>  		kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE,
>  				   reset_event);
> @@ -2246,25 +2245,9 @@ static void i915_reset_and_wakeup(struct drm_device *dev)
>  
>  		intel_runtime_pm_put(dev_priv);
>  
> -		if (ret == 0) {
> -			/*
> -			 * After all the gem state is reset, increment the reset
> -			 * counter and wake up everyone waiting for the reset to
> -			 * complete.
> -			 *
> -			 * Since unlock operations are a one-sided barrier only,
> -			 * we need to insert a barrier here to order any seqno
> -			 * updates before
> -			 * the counter increment.
> -			 */
> -			smp_mb__before_atomic();
> -			atomic_inc(&dev_priv->gpu_error.reset_counter);
> -
> +		if (ret == 0)
>  			kobject_uevent_env(&dev->primary->kdev->kobj,
>  					   KOBJ_CHANGE, reset_done_event);
> -		} else {
> -			atomic_set_mask(I915_WEDGED, &error->reset_counter);
> -		}
>  
>  		/*
>  		 * Note: The wake_up also serves as a memory barrier so that
> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
> index cdd6074257af..8cb89320d28d 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -3298,8 +3298,7 @@ static bool intel_crtc_has_pending_flip(struct drm_crtc *crtc)
>  	struct intel_crtc *intel_crtc = to_intel_crtc(crtc);
>  	bool pending;
>  
> -	if (i915_reset_in_progress(&dev_priv->gpu_error) ||
> -	    intel_crtc->reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter))
> +	if (intel_crtc->reset_counter != i915_reset_counter(&dev_priv->gpu_error))
>  		return false;
>  
>  	spin_lock_irq(&dev->event_lock);
> @@ -10751,8 +10750,7 @@ static bool page_flip_finished(struct intel_crtc *crtc)
>  	struct drm_device *dev = crtc->base.dev;
>  	struct drm_i915_private *dev_priv = dev->dev_private;
>  
> -	if (i915_reset_in_progress(&dev_priv->gpu_error) ||
> -	    crtc->reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter))
> +	if (crtc->reset_counter != i915_reset_counter(&dev_priv->gpu_error))
>  		return true;
>  
>  	/*
> @@ -11180,9 +11178,7 @@ static void intel_mmio_flip_work_func(struct work_struct *work)
>  		container_of(work, struct intel_mmio_flip, work);
>  
>  	if (mmio_flip->req)
> -		WARN_ON(__i915_wait_request(mmio_flip->req,
> -					    mmio_flip->crtc->reset_counter,
> -					    false, NULL,
> +		WARN_ON(__i915_wait_request(mmio_flip->req, false, NULL,
>  					    &mmio_flip->i915->rps.mmioflips));
>  
>  	intel_do_mmio_flip(mmio_flip->crtc);
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index f33db3495e35..b3538b66353b 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -2156,9 +2156,7 @@ int intel_ring_idle(struct intel_engine_cs *ring)
>  			list);
>  
>  	/* Make sure we do not trigger any retires */
> -	return __i915_wait_request(req,
> -				   atomic_read(&req->i915->gpu_error.reset_counter),
> -				   req->i915->mm.interruptible,
> +	return __i915_wait_request(req, req->i915->mm.interruptible,
>  				   NULL, NULL);
>  }
>  
> @@ -2252,6 +2250,14 @@ static int wait_for_space(struct drm_i915_gem_request *req, int bytes)
>  	if (ret)
>  		return ret;
>  
> +	/* If the request was completed due to a GPU hang, we want to
> +	 * error out before we continue to emit more commands to the GPU.
> +	 */
> +	ret = i915_gem_check_wedge(i915_reset_counter(&to_i915(engine->dev)->gpu_error),
> +				   to_i915(engine->dev)->mm.interruptible);
> +	if (ret)
> +		return ret;

Don't we still have the problem with -EIO now here, just less likely than
from intel_ring_begin?

> +
>  	ring->space = space;
>  	return 0;
>  }
> @@ -2316,11 +2322,6 @@ int intel_ring_begin(struct drm_i915_gem_request *req, int num_dwords)
>  {
>  	int ret;
>  
> -	ret = i915_gem_check_wedge(&req->i915->gpu_error,
> -				   req->i915->mm.interruptible);
> -	if (ret)
> -		return ret;
> -
>  	ret = ring_prepare(req, num_dwords * sizeof(uint32_t));
>  	if (ret)
>  		return ret;
> -- 
> 2.6.2
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

Chris Wilson Nov. 24, 2015, 9:43 p.m. UTC | #2

On Tue, Nov 24, 2015 at 05:43:22PM +0100, Daniel Vetter wrote:
> On Fri, Nov 20, 2015 at 12:43:52PM +0000, Chris Wilson wrote:
> > Instead of querying the reset counter before every access to the ring,
> > query it the first time we touch the ring, and do a final compare when
> > submitting the request. For correctness, we need to then sanitize how
> > the reset_counter is incremented to prevent broken submission and
> > waiting across resets, in the process fixing the persistent -EIO we
> > still see today on failed waits.
> 
> tbh that explanation went straight over my head. Questions:
> - Where do we wait again over a reset? With all the wakeups we should
>   catch them properly. I guess this needs the detailed scenario to help me
>   out, since I have dim memories that there is something wrong ;-)

TDR. In the future (and in past speculative patches) we have proposed
keeping requests over a reset and requeuing them. That is a complication
to the simplification of bailing out from the wait. What I have in mind,
is the recovery code has to fix up the request list somehow, though that
will be fun.
 
> - What precisely is the effect for waiters? I only see moving the
>   atomic_inc under the dev->struct_mutex, which shouldn't do a hole lot
>   for anyone. Plus not returning -EAGAIN when reset_in_progress, which
>   looks like it might result in missed wakeups and deadlocks with the
>   reset work.

At the moment, I can trivially wedge the machine. This patch fixes that.
The patch also ensures that we cannot raise unhandled errors from
wait-request (EIO/EAGAIN handling has to be explicitly added to the
caller).

The wait-request interface still has the wait-seqno legacy of having to
acquire the reset_counter under the mutex before calling. That is quite
hairy and causes a major complication later where we want to amalgamate
waiters.

Re EAGAIN. No, it cannot result in missed wakeups since that is internal
to the wait_request function, nor can it result in new deadlocks with the
reset worker.

> I /thought/ that the -EIO is from check_wedge in intel_ring_begin, but
> that part seems essentially unchanged.

For two reasons, we need to to protect the access to the ring, and you
argued that (reporting of EIO from previous wedged GPUs) it was part of
the ABI. The other ABI that I think is important, is the reporting of
EIO if the user explicits waits on a request and it is incomplete (i.e.
wait_ioctl).

> Aside from all that, shuffling the atomic_inc (if I can convince myself
> that it's safe) to avoid the reload_in_reset hack looks like a good
> improvement.

That's what I said at the time :-p
 
> More confused comments below.

A large proportion of the actual patch is spent trying to make using the
atomic reset_counter safer (wrt to it changing values between multiple
tests and subsequent action).

> > diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> > index d83f35c8df34..107711ec956f 100644
> > --- a/drivers/gpu/drm/i915/i915_debugfs.c
> > +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> > @@ -4415,7 +4415,7 @@ i915_wedged_get(void *data, u64 *val)
> >  	struct drm_device *dev = data;
> >  	struct drm_i915_private *dev_priv = dev->dev_private;
> >  
> > -	*val = atomic_read(&dev_priv->gpu_error.reset_counter);
> > +	*val = i915_terminally_wedged(&dev_priv->gpu_error);
> 
> Don't we have a few tests left that look at this still?

One. I strongly believe that the debug interface should not be reporting
the reset_counter but the wedged status (as that is what it is called).

> > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> > index ffd99d27fac7..5838882e10a1 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.c
> > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > @@ -841,23 +841,31 @@ int i915_resume_legacy(struct drm_device *dev)
> >  int i915_reset(struct drm_device *dev)
> >  {
> >  	struct drm_i915_private *dev_priv = dev->dev_private;
> > +	struct i915_gpu_error *error = &dev_priv->gpu_error;
> > +	unsigned reset_counter;
> >  	bool simulated;
> >  	int ret;
> >  
> >  	intel_reset_gt_powersave(dev);
> >  
> >  	mutex_lock(&dev->struct_mutex);
> > +	atomic_clear_mask(I915_WEDGED, &error->reset_counter);
> > +	reset_counter = atomic_inc_return(&error->reset_counter);
> 
> This publishes the reset-complete too early for the modeset code I think.
> At least it crucially relies upon dev->struct_mutex serializing
> everything in our driver, and I don't like to cement that assumption in
> even more.

Hmm? It is already concreted in as the reset / continuation is ordered
with the struct_mutex. We have kicks in place for the modesetting code,
but we have not actually ordered that with anything inside the reset
recovery code. Judging by the few changes to i915_reset(), I am doubtful
that has actually been improved for atomic, so that it basically relies
on display being unaffected by anything but pending work.
 
> Why do we need to move that atomic_inc again?

The inc is to acknowledge the reset and to allow us to begin using the
device again (inside the reset func). It is the equivalent of adding
another bit for reload_in_reset. I moved it to the beginning for my
convenience.
 
> I guess what we could try is set it right after we've reset the hw, but
> before we start to re-initialize it. So move the atomic stuff below
> intel_gpu_reset, with some neat barrier again. Maybe we should even push
> i915_gem_reset down below that too. Not sure.

Sure, I liked the simplicity though :)

> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > index d3609e111647..6ac9c80244fa 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -85,9 +85,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
> >  {
> >  	int ret;
> >  
> > -#define EXIT_COND (!i915_reset_in_progress(error) || \
> > -		   i915_terminally_wedged(error))
> > -	if (EXIT_COND)
> > +	if (!i915_reset_in_progress(error))
> >  		return 0;
> >  
> >  	/*
> > @@ -96,17 +94,16 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
> >  	 * we should simply try to bail out and fail as gracefully as possible.
> >  	 */
> >  	ret = wait_event_interruptible_timeout(error->reset_queue,
> > -					       EXIT_COND,
> > +					       !i915_reset_in_progress(error),
> >  					       10*HZ);
> >  	if (ret == 0) {
> >  		DRM_ERROR("Timed out waiting for the gpu reset to complete\n");
> >  		return -EIO;
> >  	} else if (ret < 0) {
> >  		return ret;
> > +	} else {
> > +		return 0;
> >  	}
> > -#undef EXIT_COND
> > -
> > -	return 0;
> 
> I like the existing color better ;-)

You would like
#define EXIT_COND (!i915_reset_in_progress(error))) ?

> >  int
> > -i915_gem_check_wedge(struct i915_gpu_error *error,
> > +i915_gem_check_wedge(unsigned reset_counter,
> >  		     bool interruptible)
> >  {
> > -	if (i915_reset_in_progress(error)) {
> > +	if (__i915_reset_in_progress_or_wedged(reset_counter)) {
> >  		/* Non-interruptible callers can't handle -EAGAIN, hence return
> >  		 * -EIO unconditionally for these. */
> >  		if (!interruptible)
> >  			return -EIO;
> >  
> >  		/* Recovery complete, but the reset failed ... */
> > -		if (i915_terminally_wedged(error))
> > +		if (__i915_terminally_wedged(reset_counter))
> >  			return -EIO;
> >  
> > -		/*
> > -		 * Check if GPU Reset is in progress - we need intel_ring_begin
> > -		 * to work properly to reinit the hw state while the gpu is
> > -		 * still marked as reset-in-progress. Handle this with a flag.
> > -		 */
> > -		if (!error->reload_in_reset)
> > -			return -EAGAIN;
> > +		return -EAGAIN;
> 
> This only works because you move the check_wedge from ring_begin to
> wait_space, so assumes that we never change that again. Rather fragile
> imo.

Pardon? If you argue that intel_ring_begin() is a better place to check
you have misunderstood the reason behind the patch entirely. Note that
we still call this function to preserve ABI. The second reason for
checking in wait_for_space is that is the only place where we do care
about the masked pending reset.

> > @@ -1200,7 +1191,6 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
> >  /**
> >   * __i915_wait_request - wait until execution of request has finished
> >   * @req: duh!
> > - * @reset_counter: reset sequence associated with the given request
> >   * @interruptible: do an interruptible wait (normally yes)
> >   * @timeout: in - how long to wait (NULL forever); out - how much time remaining
> >   *
> > @@ -1215,7 +1205,6 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
> >   * errno with remaining time filled in timeout argument.
> >   */
> >  int __i915_wait_request(struct drm_i915_gem_request *req,
> > -			unsigned reset_counter,
> >  			bool interruptible,
> >  			s64 *timeout,
> >  			struct intel_rps_client *rps)
> > @@ -1264,12 +1253,12 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
> >  
> >  		/* We need to check whether any gpu reset happened in between
> >  		 * the caller grabbing the seqno and now ... */
> > -		if (reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter)) {
> > -			/* ... but upgrade the -EAGAIN to an -EIO if the gpu
> > -			 * is truely gone. */
> > -			ret = i915_gem_check_wedge(&dev_priv->gpu_error, interruptible);
> > -			if (ret == 0)
> > -				ret = -EAGAIN;
> > +		if (req->reset_counter != i915_reset_counter(&dev_priv->gpu_error)) {
> > +			/* As we do not requeue the request over a GPU reset,
> > +			 * if one does occur we know that the request is
> > +			 * effectively complete.
> > +			 */
> > +			ret = 0;
> >  			break;
> 
> This now no longer bails out straight when a reset is in progress with
> -EAGAIN. I fear for the deadlocks.

You fear incorrectly here. Either we hold the struct_mutex in which case
the reset waits and we complete our work without needing anything else,
or we may want to take the struct_mutex to complete our work and so may
end up waiting for the reset to complete. Either way the state of this
request is the same as to when the GPU reset actually occurs. There is
an inconsistency that we don't mark it as complete though, so we would
may try and wait on it again (only to find that the reset is in
progress and bail out again).

The issue is not a potential deadlock (because that would be an already
existing ABBA in the code) but resources that can be depleted by
requests.

> > @@ -2252,6 +2250,14 @@ static int wait_for_space(struct drm_i915_gem_request *req, int bytes)
> >  	if (ret)
> >  		return ret;
> >  
> > +	/* If the request was completed due to a GPU hang, we want to
> > +	 * error out before we continue to emit more commands to the GPU.
> > +	 */
> > +	ret = i915_gem_check_wedge(i915_reset_counter(&to_i915(engine->dev)->gpu_error),
> > +				   to_i915(engine->dev)->mm.interruptible);
> > +	if (ret)
> > +		return ret;
> 
> Don't we still have the problem with -EIO now here, just less likely than
> from intel_ring_begin?

What's the problem? intel_ring_begin() is supposed to return
-EIO/-EAGAIN.

Daniel Vetter Nov. 25, 2015, 9:12 a.m. UTC | #3

On Tue, Nov 24, 2015 at 09:43:32PM +0000, Chris Wilson wrote:
> On Tue, Nov 24, 2015 at 05:43:22PM +0100, Daniel Vetter wrote:
> > On Fri, Nov 20, 2015 at 12:43:52PM +0000, Chris Wilson wrote:
> > > Instead of querying the reset counter before every access to the ring,
> > > query it the first time we touch the ring, and do a final compare when
> > > submitting the request. For correctness, we need to then sanitize how
> > > the reset_counter is incremented to prevent broken submission and
> > > waiting across resets, in the process fixing the persistent -EIO we
> > > still see today on failed waits.
> > 
> > tbh that explanation went straight over my head. Questions:
> > - Where do we wait again over a reset? With all the wakeups we should
> >   catch them properly. I guess this needs the detailed scenario to help me
> >   out, since I have dim memories that there is something wrong ;-)
> 
> TDR. In the future (and in past speculative patches) we have proposed
> keeping requests over a reset and requeuing them. That is a complication
> to the simplification of bailing out from the wait. What I have in mind,
> is the recovery code has to fix up the request list somehow, though that
> will be fun.

But even with requeueing the waiters need to bail out and restart the
ioctl. So I don't see why requeueing of requests matter.

And my question was about what exactly is broken when waiting over a
reset? As long as we detect the reset and restart the ioctl we should pick
up any kinds of state fixups the reset work has done and recover
perfectly. Irrespective of whether the reset work has requeued some of the
requests or not.

> > - What precisely is the effect for waiters? I only see moving the
> >   atomic_inc under the dev->struct_mutex, which shouldn't do a hole lot
> >   for anyone. Plus not returning -EAGAIN when reset_in_progress, which
> >   looks like it might result in missed wakeups and deadlocks with the
> >   reset work.
> 
> At the moment, I can trivially wedge the machine. This patch fixes that.
> The patch also ensures that we cannot raise unhandled errors from
> wait-request (EIO/EAGAIN handling has to be explicitly added to the
> caller).

Hm, how/where do we wedge the machine, and how does the fix work?

> The wait-request interface still has the wait-seqno legacy of having to
> acquire the reset_counter under the mutex before calling. That is quite
> hairy and causes a major complication later where we want to amalgamate
> waiters.

Yeah, that's a sensible change. But since I don't yet understand where
exactly the current logic results in a wedge gpu I can't convince myself
that the new one is ok.

> Re EAGAIN. No, it cannot result in missed wakeups since that is internal
> to the wait_request function, nor can it result in new deadlocks with the
> reset worker.

Yeah I think today with just struct_mutex it's impossible. But if we have
more locks the waiting thread could continue to grab more locks instead of
bailing out straight. And then the reset handler would be somewhat screwed
since it isn't guaranteed to make forward progress.

> > I /thought/ that the -EIO is from check_wedge in intel_ring_begin, but
> > that part seems essentially unchanged.
> 
> For two reasons, we need to to protect the access to the ring, and you
> argued that (reporting of EIO from previous wedged GPUs) it was part of
> the ABI. The other ABI that I think is important, is the reporting of
> EIO if the user explicits waits on a request and it is incomplete (i.e.
> wait_ioctl).

Well then I don't get where the -EIO is from. That leaves a truly wedge
gpu as the cause, and for that case I don't get how it can happen by
accident with the current code.
 
> > Aside from all that, shuffling the atomic_inc (if I can convince myself
> > that it's safe) to avoid the reload_in_reset hack looks like a good
> > improvement.
> 
> That's what I said at the time :-p
>  
> > More confused comments below.
> 
> A large proportion of the actual patch is spent trying to make using the
> atomic reset_counter safer (wrt to it changing values between multiple
> tests and subsequent action).
> 
> > > diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> > > index d83f35c8df34..107711ec956f 100644
> > > --- a/drivers/gpu/drm/i915/i915_debugfs.c
> > > +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> > > @@ -4415,7 +4415,7 @@ i915_wedged_get(void *data, u64 *val)
> > >  	struct drm_device *dev = data;
> > >  	struct drm_i915_private *dev_priv = dev->dev_private;
> > >  
> > > -	*val = atomic_read(&dev_priv->gpu_error.reset_counter);
> > > +	*val = i915_terminally_wedged(&dev_priv->gpu_error);
> > 
> > Don't we have a few tests left that look at this still?
> 
> One. I strongly believe that the debug interface should not be reporting
> the reset_counter but the wedged status (as that is what it is called).

And that one only writes, so we're perfectly fine I think.

> > > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> > > index ffd99d27fac7..5838882e10a1 100644
> > > --- a/drivers/gpu/drm/i915/i915_drv.c
> > > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > > @@ -841,23 +841,31 @@ int i915_resume_legacy(struct drm_device *dev)
> > >  int i915_reset(struct drm_device *dev)
> > >  {
> > >  	struct drm_i915_private *dev_priv = dev->dev_private;
> > > +	struct i915_gpu_error *error = &dev_priv->gpu_error;
> > > +	unsigned reset_counter;
> > >  	bool simulated;
> > >  	int ret;
> > >  
> > >  	intel_reset_gt_powersave(dev);
> > >  
> > >  	mutex_lock(&dev->struct_mutex);
> > > +	atomic_clear_mask(I915_WEDGED, &error->reset_counter);
> > > +	reset_counter = atomic_inc_return(&error->reset_counter);
> > 
> > This publishes the reset-complete too early for the modeset code I think.
> > At least it crucially relies upon dev->struct_mutex serializing
> > everything in our driver, and I don't like to cement that assumption in
> > even more.
> 
> Hmm? It is already concreted in as the reset / continuation is ordered
> with the struct_mutex. We have kicks in place for the modesetting code,
> but we have not actually ordered that with anything inside the reset
> recovery code. Judging by the few changes to i915_reset(), I am doubtful
> that has actually been improved for atomic, so that it basically relies
> on display being unaffected by anything but pending work.

Currently the reset/continuation is ordered with the 2-phase atomic_inc of
the reset counter, with the requirement that anyone who sees
reset_in_progress hauls out of the kernel asap. That most of gem is
serialized with struct_mutex is just a side-effect of the current gem
locking. With the current design we could rework gem locking and it should
all still work, with this approach we pretty much cement the BKL locking
approach I think.

> > Why do we need to move that atomic_inc again?
> 
> The inc is to acknowledge the reset and to allow us to begin using the
> device again (inside the reset func). It is the equivalent of adding
> another bit for reload_in_reset. I moved it to the beginning for my
> convenience.

Ok, this needs a code comment I think.

But I thought that the crucial change was to move check_wedge from
ring_begin to wait_for_space and so move it out from ever being hit from
the reset code.

> > I guess what we could try is set it right after we've reset the hw, but
> > before we start to re-initialize it. So move the atomic stuff below
> > intel_gpu_reset, with some neat barrier again. Maybe we should even push
> > i915_gem_reset down below that too. Not sure.
> 
> Sure, I liked the simplicity though :)
> 
> > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > > index d3609e111647..6ac9c80244fa 100644
> > > --- a/drivers/gpu/drm/i915/i915_gem.c
> > > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > > @@ -85,9 +85,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
> > >  {
> > >  	int ret;
> > >  
> > > -#define EXIT_COND (!i915_reset_in_progress(error) || \
> > > -		   i915_terminally_wedged(error))
> > > -	if (EXIT_COND)
> > > +	if (!i915_reset_in_progress(error))
> > >  		return 0;
> > >  
> > >  	/*
> > > @@ -96,17 +94,16 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
> > >  	 * we should simply try to bail out and fail as gracefully as possible.
> > >  	 */
> > >  	ret = wait_event_interruptible_timeout(error->reset_queue,
> > > -					       EXIT_COND,
> > > +					       !i915_reset_in_progress(error),
> > >  					       10*HZ);
> > >  	if (ret == 0) {
> > >  		DRM_ERROR("Timed out waiting for the gpu reset to complete\n");
> > >  		return -EIO;
> > >  	} else if (ret < 0) {
> > >  		return ret;
> > > +	} else {
> > > +		return 0;
> > >  	}
> > > -#undef EXIT_COND
> > > -
> > > -	return 0;
> > 
> > I like the existing color better ;-)
> 
> You would like
> #define EXIT_COND (!i915_reset_in_progress(error))) ?
> 
> > >  int
> > > -i915_gem_check_wedge(struct i915_gpu_error *error,
> > > +i915_gem_check_wedge(unsigned reset_counter,
> > >  		     bool interruptible)
> > >  {
> > > -	if (i915_reset_in_progress(error)) {
> > > +	if (__i915_reset_in_progress_or_wedged(reset_counter)) {
> > >  		/* Non-interruptible callers can't handle -EAGAIN, hence return
> > >  		 * -EIO unconditionally for these. */
> > >  		if (!interruptible)
> > >  			return -EIO;
> > >  
> > >  		/* Recovery complete, but the reset failed ... */
> > > -		if (i915_terminally_wedged(error))
> > > +		if (__i915_terminally_wedged(reset_counter))
> > >  			return -EIO;
> > >  
> > > -		/*
> > > -		 * Check if GPU Reset is in progress - we need intel_ring_begin
> > > -		 * to work properly to reinit the hw state while the gpu is
> > > -		 * still marked as reset-in-progress. Handle this with a flag.
> > > -		 */
> > > -		if (!error->reload_in_reset)
> > > -			return -EAGAIN;
> > > +		return -EAGAIN;
> > 
> > This only works because you move the check_wedge from ring_begin to
> > wait_space, so assumes that we never change that again. Rather fragile
> > imo.
> 
> Pardon? If you argue that intel_ring_begin() is a better place to check
> you have misunderstood the reason behind the patch entirely. Note that
> we still call this function to preserve ABI. The second reason for
> checking in wait_for_space is that is the only place where we do care
> about the masked pending reset.

I looked at this under the assumption that the atomic_inc(reset_counter)
stays where it currently is.

> > > @@ -1200,7 +1191,6 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
> > >  /**
> > >   * __i915_wait_request - wait until execution of request has finished
> > >   * @req: duh!
> > > - * @reset_counter: reset sequence associated with the given request
> > >   * @interruptible: do an interruptible wait (normally yes)
> > >   * @timeout: in - how long to wait (NULL forever); out - how much time remaining
> > >   *
> > > @@ -1215,7 +1205,6 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
> > >   * errno with remaining time filled in timeout argument.
> > >   */
> > >  int __i915_wait_request(struct drm_i915_gem_request *req,
> > > -			unsigned reset_counter,
> > >  			bool interruptible,
> > >  			s64 *timeout,
> > >  			struct intel_rps_client *rps)
> > > @@ -1264,12 +1253,12 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
> > >  
> > >  		/* We need to check whether any gpu reset happened in between
> > >  		 * the caller grabbing the seqno and now ... */
> > > -		if (reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter)) {
> > > -			/* ... but upgrade the -EAGAIN to an -EIO if the gpu
> > > -			 * is truely gone. */
> > > -			ret = i915_gem_check_wedge(&dev_priv->gpu_error, interruptible);
> > > -			if (ret == 0)
> > > -				ret = -EAGAIN;
> > > +		if (req->reset_counter != i915_reset_counter(&dev_priv->gpu_error)) {
> > > +			/* As we do not requeue the request over a GPU reset,
> > > +			 * if one does occur we know that the request is
> > > +			 * effectively complete.
> > > +			 */
> > > +			ret = 0;
> > >  			break;
> > 
> > This now no longer bails out straight when a reset is in progress with
> > -EAGAIN. I fear for the deadlocks.
> 
> You fear incorrectly here. Either we hold the struct_mutex in which case
> the reset waits and we complete our work without needing anything else,
> or we may want to take the struct_mutex to complete our work and so may
> end up waiting for the reset to complete. Either way the state of this
> request is the same as to when the GPU reset actually occurs. There is
> an inconsistency that we don't mark it as complete though, so we would
> may try and wait on it again (only to find that the reset is in
> progress and bail out again).

We only bail once per reset since we wait for resets to complete before
acquiring dev->struct_mutex. On the modeset side we unfortunately can't do
that since that code is in the drm core.

> The issue is not a potential deadlock (because that would be an already
> existing ABBA in the code) but resources that can be depleted by
> requests.

Besides deadlocks my other concern is that you encode reset behaviour
outside of the reset code here. If we unconditionally EAGAIN here, then
everyone will pick up whatever kind of fixup the reset code applied. Which
means requeueing of requests is contained to the reset logic. But if we
stop doing the unconditional EAGAIN then we have to fix up things here
too.

On the flip side I don't see what's easier by returning 0 here instead of
EAGAIN?

> > > @@ -2252,6 +2250,14 @@ static int wait_for_space(struct drm_i915_gem_request *req, int bytes)
> > >  	if (ret)
> > >  		return ret;
> > >  
> > > +	/* If the request was completed due to a GPU hang, we want to
> > > +	 * error out before we continue to emit more commands to the GPU.
> > > +	 */
> > > +	ret = i915_gem_check_wedge(i915_reset_counter(&to_i915(engine->dev)->gpu_error),
> > > +				   to_i915(engine->dev)->mm.interruptible);
> > > +	if (ret)
> > > +		return ret;
> > 
> > Don't we still have the problem with -EIO now here, just less likely than
> > from intel_ring_begin?
> 
> What's the problem? intel_ring_begin() is supposed to return
> -EIO/-EAGAIN.

Hm, I thought the entire problem was that ring_begin can return -EIO,
since that's how mesa usually dies. But with your mail your saying that we
accidentally wedge the gpu, and that's the part I don't understand yet.

Probably best if I figure this out with an irc chat directly ;-)

But without that figured out I'm not sure how to best untangle the other
bits (specifically reload_in_reset) you're changing in here.

Thanks, Daniel

Chris Wilson Nov. 25, 2015, 12:17 p.m. UTC | #4

On Wed, Nov 25, 2015 at 10:12:53AM +0100, Daniel Vetter wrote:
> On Tue, Nov 24, 2015 at 09:43:32PM +0000, Chris Wilson wrote:
> > On Tue, Nov 24, 2015 at 05:43:22PM +0100, Daniel Vetter wrote:
> > > On Fri, Nov 20, 2015 at 12:43:52PM +0000, Chris Wilson wrote:
> > > > Instead of querying the reset counter before every access to the ring,
> > > > query it the first time we touch the ring, and do a final compare when
> > > > submitting the request. For correctness, we need to then sanitize how
> > > > the reset_counter is incremented to prevent broken submission and
> > > > waiting across resets, in the process fixing the persistent -EIO we
> > > > still see today on failed waits.
> > > 
> > > tbh that explanation went straight over my head. Questions:
> > > - Where do we wait again over a reset? With all the wakeups we should
> > >   catch them properly. I guess this needs the detailed scenario to help me
> > >   out, since I have dim memories that there is something wrong ;-)
> > 
> > TDR. In the future (and in past speculative patches) we have proposed
> > keeping requests over a reset and requeuing them. That is a complication
> > to the simplification of bailing out from the wait. What I have in mind,
> > is the recovery code has to fix up the request list somehow, though that
> > will be fun.
> 
> But even with requeueing the waiters need to bail out and restart the
> ioctl. So I don't see why requeueing of requests matter.

But why should the waiter even wake up? It is not holding any locks, all
it is just waiting for the request to complete. It is only those holding
a lock required to reset that we need to kick.
 
> And my question was about what exactly is broken when waiting over a
> reset? As long as we detect the reset and restart the ioctl we should pick
> up any kinds of state fixups the reset work has done and recover
> perfectly. Irrespective of whether the reset work has requeued some of the
> requests or not.

But... The reset handler clears the requests, we cannot wait over a
reset. So waiting over a reset today is clearly a bug in the kernel.

Only in the future do we contemplate situations where a request may
survive a reset.
 
> > > - What precisely is the effect for waiters? I only see moving the
> > >   atomic_inc under the dev->struct_mutex, which shouldn't do a hole lot
> > >   for anyone. Plus not returning -EAGAIN when reset_in_progress, which
> > >   looks like it might result in missed wakeups and deadlocks with the
> > >   reset work.
> > 
> > At the moment, I can trivially wedge the machine. This patch fixes that.
> > The patch also ensures that we cannot raise unhandled errors from
> > wait-request (EIO/EAGAIN handling has to be explicitly added to the
> > caller).
> 
> Hm, how/where do we wedge the machine, and how does the fix work?

Being able to reset a previously wedged machine.

> > The wait-request interface still has the wait-seqno legacy of having to
> > acquire the reset_counter under the mutex before calling. That is quite
> > hairy and causes a major complication later where we want to amalgamate
> > waiters.
> 
> Yeah, that's a sensible change. But since I don't yet understand where
> exactly the current logic results in a wedge gpu I can't convince myself
> that the new one is ok.
> 
> > Re EAGAIN. No, it cannot result in missed wakeups since that is internal
> > to the wait_request function, nor can it result in new deadlocks with the
> > reset worker.
> 
> Yeah I think today with just struct_mutex it's impossible. But if we have
> more locks the waiting thread could continue to grab more locks instead of
> bailing out straight. And then the reset handler would be somewhat screwed
> since it isn't guaranteed to make forward progress.

So basically if you add a deadlock/livelock, it may deadlock/livelock.
 
> > > I /thought/ that the -EIO is from check_wedge in intel_ring_begin, but
> > > that part seems essentially unchanged.
> > 
> > For two reasons, we need to to protect the access to the ring, and you
> > argued that (reporting of EIO from previous wedged GPUs) it was part of
> > the ABI. The other ABI that I think is important, is the reporting of
> > EIO if the user explicits waits on a request and it is incomplete (i.e.
> > wait_ioctl).
> 
> Well then I don't get where the -EIO is from. That leaves a truly wedge
> gpu as the cause, and for that case I don't get how it can happen by
> accident with the current code.

We fail a reset, or to recover. Happens often enough.

> > > > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> > > > index ffd99d27fac7..5838882e10a1 100644
> > > > --- a/drivers/gpu/drm/i915/i915_drv.c
> > > > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > > > @@ -841,23 +841,31 @@ int i915_resume_legacy(struct drm_device *dev)
> > > >  int i915_reset(struct drm_device *dev)
> > > >  {
> > > >  	struct drm_i915_private *dev_priv = dev->dev_private;
> > > > +	struct i915_gpu_error *error = &dev_priv->gpu_error;
> > > > +	unsigned reset_counter;
> > > >  	bool simulated;
> > > >  	int ret;
> > > >  
> > > >  	intel_reset_gt_powersave(dev);
> > > >  
> > > >  	mutex_lock(&dev->struct_mutex);
> > > > +	atomic_clear_mask(I915_WEDGED, &error->reset_counter);
> > > > +	reset_counter = atomic_inc_return(&error->reset_counter);
> > > 
> > > This publishes the reset-complete too early for the modeset code I think.
> > > At least it crucially relies upon dev->struct_mutex serializing
> > > everything in our driver, and I don't like to cement that assumption in
> > > even more.
> > 
> > Hmm? It is already concreted in as the reset / continuation is ordered
> > with the struct_mutex. We have kicks in place for the modesetting code,
> > but we have not actually ordered that with anything inside the reset
> > recovery code. Judging by the few changes to i915_reset(), I am doubtful
> > that has actually been improved for atomic, so that it basically relies
> > on display being unaffected by anything but pending work.
> 
> Currently the reset/continuation is ordered with the 2-phase atomic_inc of
> the reset counter, with the requirement that anyone who sees
> reset_in_progress hauls out of the kernel asap. That most of gem is
> serialized with struct_mutex is just a side-effect of the current gem
> locking. With the current design we could rework gem locking and it should
> all still work, with this approach we pretty much cement the BKL locking
> approach I think.

No. It doesn't add anything more to the scheme. All it does is change
the order in which the second increment is applied so that after the
HW/SW reset happens we can start using it from inside the reset handler
to reinitialise the state.
 
> > > Why do we need to move that atomic_inc again?
> > 
> > The inc is to acknowledge the reset and to allow us to begin using the
> > device again (inside the reset func). It is the equivalent of adding
> > another bit for reload_in_reset. I moved it to the beginning for my
> > convenience.
> 
> Ok, this needs a code comment I think.
> 
> But I thought that the crucial change was to move check_wedge from
> ring_begin to wait_for_space and so move it out from ever being hit from
> the reset code.

No. That has nothing to do with the reset code, as by the time we ever
call intel_ring_begin() in the recovery code, all previous
waits/requests are completed i.e. (excluding execlists bugs ahem) the
rings should be empty.

> > > > @@ -1200,7 +1191,6 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
> > > >  /**
> > > >   * __i915_wait_request - wait until execution of request has finished
> > > >   * @req: duh!
> > > > - * @reset_counter: reset sequence associated with the given request
> > > >   * @interruptible: do an interruptible wait (normally yes)
> > > >   * @timeout: in - how long to wait (NULL forever); out - how much time remaining
> > > >   *
> > > > @@ -1215,7 +1205,6 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
> > > >   * errno with remaining time filled in timeout argument.
> > > >   */
> > > >  int __i915_wait_request(struct drm_i915_gem_request *req,
> > > > -			unsigned reset_counter,
> > > >  			bool interruptible,
> > > >  			s64 *timeout,
> > > >  			struct intel_rps_client *rps)
> > > > @@ -1264,12 +1253,12 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
> > > >  
> > > >  		/* We need to check whether any gpu reset happened in between
> > > >  		 * the caller grabbing the seqno and now ... */
> > > > -		if (reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter)) {
> > > > -			/* ... but upgrade the -EAGAIN to an -EIO if the gpu
> > > > -			 * is truely gone. */
> > > > -			ret = i915_gem_check_wedge(&dev_priv->gpu_error, interruptible);
> > > > -			if (ret == 0)
> > > > -				ret = -EAGAIN;
> > > > +		if (req->reset_counter != i915_reset_counter(&dev_priv->gpu_error)) {
> > > > +			/* As we do not requeue the request over a GPU reset,
> > > > +			 * if one does occur we know that the request is
> > > > +			 * effectively complete.
> > > > +			 */
> > > > +			ret = 0;
> > > >  			break;
> > > 
> > > This now no longer bails out straight when a reset is in progress with
> > > -EAGAIN. I fear for the deadlocks.
> > 
> > You fear incorrectly here. Either we hold the struct_mutex in which case
> > the reset waits and we complete our work without needing anything else,
> > or we may want to take the struct_mutex to complete our work and so may
> > end up waiting for the reset to complete. Either way the state of this
> > request is the same as to when the GPU reset actually occurs. There is
> > an inconsistency that we don't mark it as complete though, so we would
> > may try and wait on it again (only to find that the reset is in
> > progress and bail out again).
> 
> We only bail once per reset since we wait for resets to complete before
> acquiring dev->struct_mutex. On the modeset side we unfortunately can't do
> that since that code is in the drm core.
> 
> > The issue is not a potential deadlock (because that would be an already
> > existing ABBA in the code) but resources that can be depleted by
> > requests.
> 
> Besides deadlocks my other concern is that you encode reset behaviour
> outside of the reset code here. If we unconditionally EAGAIN here, then
> everyone will pick up whatever kind of fixup the reset code applied. Which
> means requeueing of requests is contained to the reset logic. But if we
> stop doing the unconditional EAGAIN then we have to fix up things here
> too.
> 
> On the flip side I don't see what's easier by returning 0 here instead of
> EAGAIN?

My fundamental argument for the last 5 years is that resets complete
requests, i.e. i915_wait_seqno/i915_wait_request should report 0 as the
object is no longer active by the GPU. Only in special circumstances do
we actually care that a reset happened (ABI), and rather than encode
that logic inside the core, that is better expressed as exceptions in
the caller.

e.g. set-to-domain works correctly even if the last request on the
object ended in a hang (and from that we can deduce that a very large
proportion of the API is also correct.)

From my perspective, by completing the requests following a hang, we
make the kernel and userspace much more fault tolerant.
 
> > > > @@ -2252,6 +2250,14 @@ static int wait_for_space(struct drm_i915_gem_request *req, int bytes)
> > > >  	if (ret)
> > > >  		return ret;
> > > >  
> > > > +	/* If the request was completed due to a GPU hang, we want to
> > > > +	 * error out before we continue to emit more commands to the GPU.
> > > > +	 */
> > > > +	ret = i915_gem_check_wedge(i915_reset_counter(&to_i915(engine->dev)->gpu_error),
> > > > +				   to_i915(engine->dev)->mm.interruptible);
> > > > +	if (ret)
> > > > +		return ret;
> > > 
> > > Don't we still have the problem with -EIO now here, just less likely than
> > > from intel_ring_begin?
> > 
> > What's the problem? intel_ring_begin() is supposed to return
> > -EIO/-EAGAIN.
> 
> Hm, I thought the entire problem was that ring_begin can return -EIO,
> since that's how mesa usually dies. But with your mail your saying that we
> accidentally wedge the gpu, and that's the part I don't understand yet.

Nah, mesa dies all the time from unchecked errors elsewhere. It just
calls fprintf/exit on intel_do_flush_locked(), so that is the visible one.
By the way, how are you getting on reviewing my mesa patches to fix that?
-Chris

Chris Wilson Nov. 25, 2015, 5:11 p.m. UTC | #5

On Wed, Nov 25, 2015 at 10:12:53AM +0100, Daniel Vetter wrote:
> On Tue, Nov 24, 2015 at 09:43:32PM +0000, Chris Wilson wrote:
> > The wait-request interface still has the wait-seqno legacy of having to
> > acquire the reset_counter under the mutex before calling. That is quite
> > hairy and causes a major complication later where we want to amalgamate
> > waiters.
> 
> Yeah, that's a sensible change. But since I don't yet understand where
> exactly the current logic results in a wedge gpu I can't convince myself
> that the new one is ok.

Just to show where I am going with the interface change and why I think
it so important (other than it removes a hairy dance that we've have had
to copy all around the code):

http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=nightly&id=75e73616f72f14ab82037adb1b2b054e2e500c21
-Chris

Daniel Vetter Nov. 26, 2015, 9:21 a.m. UTC | #6

On Wed, Nov 25, 2015 at 12:17:08PM +0000, Chris Wilson wrote:
> On Wed, Nov 25, 2015 at 10:12:53AM +0100, Daniel Vetter wrote:
> > On Tue, Nov 24, 2015 at 09:43:32PM +0000, Chris Wilson wrote:
> > > On Tue, Nov 24, 2015 at 05:43:22PM +0100, Daniel Vetter wrote:
> > > > On Fri, Nov 20, 2015 at 12:43:52PM +0000, Chris Wilson wrote:
> > > > > Instead of querying the reset counter before every access to the ring,
> > > > > query it the first time we touch the ring, and do a final compare when
> > > > > submitting the request. For correctness, we need to then sanitize how
> > > > > the reset_counter is incremented to prevent broken submission and
> > > > > waiting across resets, in the process fixing the persistent -EIO we
> > > > > still see today on failed waits.
> > > > 
> > > > tbh that explanation went straight over my head. Questions:
> > > > - Where do we wait again over a reset? With all the wakeups we should
> > > >   catch them properly. I guess this needs the detailed scenario to help me
> > > >   out, since I have dim memories that there is something wrong ;-)
> > > 
> > > TDR. In the future (and in past speculative patches) we have proposed
> > > keeping requests over a reset and requeuing them. That is a complication
> > > to the simplification of bailing out from the wait. What I have in mind,
> > > is the recovery code has to fix up the request list somehow, though that
> > > will be fun.
> > 
> > But even with requeueing the waiters need to bail out and restart the
> > ioctl. So I don't see why requeueing of requests matter.
> 
> But why should the waiter even wake up? It is not holding any locks, all
> it is just waiting for the request to complete. It is only those holding
> a lock required to reset that we need to kick.

Because the request might not actually be completed when we do partial
resets like with TDR. And then we'd have to undo this.

> > And my question was about what exactly is broken when waiting over a
> > reset? As long as we detect the reset and restart the ioctl we should pick
> > up any kinds of state fixups the reset work has done and recover
> > perfectly. Irrespective of whether the reset work has requeued some of the
> > requests or not.
> 
> But... The reset handler clears the requests, we cannot wait over a
> reset. So waiting over a reset today is clearly a bug in the kernel.
> 
> Only in the future do we contemplate situations where a request may
> survive a reset.

I think I phrased my question badly: What's broken with current waiters
racing against resets? Currently the should all get woken and restart, and
that seems to work. It does create some ioctl restart work when a reset
happens, but I think that overhead is justified given that it allows us to
have no knowledge of how exactly we reset hw and sw outside of reset
functions.

> > > > - What precisely is the effect for waiters? I only see moving the
> > > >   atomic_inc under the dev->struct_mutex, which shouldn't do a hole lot
> > > >   for anyone. Plus not returning -EAGAIN when reset_in_progress, which
> > > >   looks like it might result in missed wakeups and deadlocks with the
> > > >   reset work.
> > > 
> > > At the moment, I can trivially wedge the machine. This patch fixes that.
> > > The patch also ensures that we cannot raise unhandled errors from
> > > wait-request (EIO/EAGAIN handling has to be explicitly added to the
> > > caller).
> > 
> > Hm, how/where do we wedge the machine, and how does the fix work?
> 
> Being able to reset a previously wedged machine.

Ok, I think I get that problem now, after looking at gem_eio work.
echo 1 > i915_wedged seems to have been broken since years. I haven't ever
noticed that since I generally just kick the machine and restart when that
happens.

> > > The wait-request interface still has the wait-seqno legacy of having to
> > > acquire the reset_counter under the mutex before calling. That is quite
> > > hairy and causes a major complication later where we want to amalgamate
> > > waiters.
> > 
> > Yeah, that's a sensible change. But since I don't yet understand where
> > exactly the current logic results in a wedge gpu I can't convince myself
> > that the new one is ok.
> > 
> > > Re EAGAIN. No, it cannot result in missed wakeups since that is internal
> > > to the wait_request function, nor can it result in new deadlocks with the
> > > reset worker.
> > 
> > Yeah I think today with just struct_mutex it's impossible. But if we have
> > more locks the waiting thread could continue to grab more locks instead of
> > bailing out straight. And then the reset handler would be somewhat screwed
> > since it isn't guaranteed to make forward progress.
> 
> So basically if you add a deadlock/livelock, it may deadlock/livelock.

The two-phase reset logic is some kind of hand-rolled special lock which
makes sure nothing bad happens. As long as it's really two-phase.
Essentially reset_in_progress means "everyone get of locks and stay of
locks". With some small short-lived exceptions here and there (and well
not so short-lived ones).

Of course we can just write reset code that doesn't deadlock, but given
our history in this area I much prefer something that's more obviously
correct.

> > > > I /thought/ that the -EIO is from check_wedge in intel_ring_begin, but
> > > > that part seems essentially unchanged.
> > > 
> > > For two reasons, we need to to protect the access to the ring, and you
> > > argued that (reporting of EIO from previous wedged GPUs) it was part of
> > > the ABI. The other ABI that I think is important, is the reporting of
> > > EIO if the user explicits waits on a request and it is incomplete (i.e.
> > > wait_ioctl).
> > 
> > Well then I don't get where the -EIO is from. That leaves a truly wedge
> > gpu as the cause, and for that case I don't get how it can happen by
> > accident with the current code.
> 
> We fail a reset, or to recover. Happens often enough.

Ok, looking at gem_eio the only case where we put out an EIO and shouldn't
seems to be set_domain_ioctl. Is that right?

Well there's all the modeset stuff, and that just became a bit worse with
Maarten's recent changes. So probably we need the same -EIO eating there
too.

> > > > > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> > > > > index ffd99d27fac7..5838882e10a1 100644
> > > > > --- a/drivers/gpu/drm/i915/i915_drv.c
> > > > > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > > > > @@ -841,23 +841,31 @@ int i915_resume_legacy(struct drm_device *dev)
> > > > >  int i915_reset(struct drm_device *dev)
> > > > >  {
> > > > >  	struct drm_i915_private *dev_priv = dev->dev_private;
> > > > > +	struct i915_gpu_error *error = &dev_priv->gpu_error;
> > > > > +	unsigned reset_counter;
> > > > >  	bool simulated;
> > > > >  	int ret;
> > > > >  
> > > > >  	intel_reset_gt_powersave(dev);
> > > > >  
> > > > >  	mutex_lock(&dev->struct_mutex);
> > > > > +	atomic_clear_mask(I915_WEDGED, &error->reset_counter);
> > > > > +	reset_counter = atomic_inc_return(&error->reset_counter);
> > > > 
> > > > This publishes the reset-complete too early for the modeset code I think.
> > > > At least it crucially relies upon dev->struct_mutex serializing
> > > > everything in our driver, and I don't like to cement that assumption in
> > > > even more.
> > > 
> > > Hmm? It is already concreted in as the reset / continuation is ordered
> > > with the struct_mutex. We have kicks in place for the modesetting code,
> > > but we have not actually ordered that with anything inside the reset
> > > recovery code. Judging by the few changes to i915_reset(), I am doubtful
> > > that has actually been improved for atomic, so that it basically relies
> > > on display being unaffected by anything but pending work.
> > 
> > Currently the reset/continuation is ordered with the 2-phase atomic_inc of
> > the reset counter, with the requirement that anyone who sees
> > reset_in_progress hauls out of the kernel asap. That most of gem is
> > serialized with struct_mutex is just a side-effect of the current gem
> > locking. With the current design we could rework gem locking and it should
> > all still work, with this approach we pretty much cement the BKL locking
> > approach I think.
> 
> No. It doesn't add anything more to the scheme. All it does is change
> the order in which the second increment is applied so that after the
> HW/SW reset happens we can start using it from inside the reset handler
> to reinitialise the state.

Well we can make that work, but it still feels like trading considerable
amounts of complexity to get rid of reload_in_reset. Well it's not really
any complexity right now with the BKL we have, but I don't see things as
that simple when/if we ever split things up.

> > > > Why do we need to move that atomic_inc again?
> > > 
> > > The inc is to acknowledge the reset and to allow us to begin using the
> > > device again (inside the reset func). It is the equivalent of adding
> > > another bit for reload_in_reset. I moved it to the beginning for my
> > > convenience.
> > 
> > Ok, this needs a code comment I think.
> > 
> > But I thought that the crucial change was to move check_wedge from
> > ring_begin to wait_for_space and so move it out from ever being hit from
> > the reset code.
> 
> No. That has nothing to do with the reset code, as by the time we ever
> call intel_ring_begin() in the recovery code, all previous
> waits/requests are completed i.e. (excluding execlists bugs ahem) the
> rings should be empty.
> 
> > > > > @@ -1200,7 +1191,6 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
> > > > >  /**
> > > > >   * __i915_wait_request - wait until execution of request has finished
> > > > >   * @req: duh!
> > > > > - * @reset_counter: reset sequence associated with the given request
> > > > >   * @interruptible: do an interruptible wait (normally yes)
> > > > >   * @timeout: in - how long to wait (NULL forever); out - how much time remaining
> > > > >   *
> > > > > @@ -1215,7 +1205,6 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
> > > > >   * errno with remaining time filled in timeout argument.
> > > > >   */
> > > > >  int __i915_wait_request(struct drm_i915_gem_request *req,
> > > > > -			unsigned reset_counter,
> > > > >  			bool interruptible,
> > > > >  			s64 *timeout,
> > > > >  			struct intel_rps_client *rps)
> > > > > @@ -1264,12 +1253,12 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
> > > > >  
> > > > >  		/* We need to check whether any gpu reset happened in between
> > > > >  		 * the caller grabbing the seqno and now ... */
> > > > > -		if (reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter)) {
> > > > > -			/* ... but upgrade the -EAGAIN to an -EIO if the gpu
> > > > > -			 * is truely gone. */
> > > > > -			ret = i915_gem_check_wedge(&dev_priv->gpu_error, interruptible);
> > > > > -			if (ret == 0)
> > > > > -				ret = -EAGAIN;
> > > > > +		if (req->reset_counter != i915_reset_counter(&dev_priv->gpu_error)) {
> > > > > +			/* As we do not requeue the request over a GPU reset,
> > > > > +			 * if one does occur we know that the request is
> > > > > +			 * effectively complete.
> > > > > +			 */
> > > > > +			ret = 0;
> > > > >  			break;
> > > > 
> > > > This now no longer bails out straight when a reset is in progress with
> > > > -EAGAIN. I fear for the deadlocks.
> > > 
> > > You fear incorrectly here. Either we hold the struct_mutex in which case
> > > the reset waits and we complete our work without needing anything else,
> > > or we may want to take the struct_mutex to complete our work and so may
> > > end up waiting for the reset to complete. Either way the state of this
> > > request is the same as to when the GPU reset actually occurs. There is
> > > an inconsistency that we don't mark it as complete though, so we would
> > > may try and wait on it again (only to find that the reset is in
> > > progress and bail out again).
> > 
> > We only bail once per reset since we wait for resets to complete before
> > acquiring dev->struct_mutex. On the modeset side we unfortunately can't do
> > that since that code is in the drm core.
> > 
> > > The issue is not a potential deadlock (because that would be an already
> > > existing ABBA in the code) but resources that can be depleted by
> > > requests.
> > 
> > Besides deadlocks my other concern is that you encode reset behaviour
> > outside of the reset code here. If we unconditionally EAGAIN here, then
> > everyone will pick up whatever kind of fixup the reset code applied. Which
> > means requeueing of requests is contained to the reset logic. But if we
> > stop doing the unconditional EAGAIN then we have to fix up things here
> > too.
> > 
> > On the flip side I don't see what's easier by returning 0 here instead of
> > EAGAIN?
> 
> My fundamental argument for the last 5 years is that resets complete
> requests, i.e. i915_wait_seqno/i915_wait_request should report 0 as the
> object is no longer active by the GPU. Only in special circumstances do
> we actually care that a reset happened (ABI), and rather than encode
> that logic inside the core, that is better expressed as exceptions in
> the caller.
> 
> e.g. set-to-domain works correctly even if the last request on the
> object ended in a hang (and from that we can deduce that a very large
> proportion of the API is also correct.)
> 
> From my perspective, by completing the requests following a hang, we
> make the kernel and userspace much more fault tolerant.

Userspace must restart when we return -EAGAIN. That's also ABI. Which
means yes it will only ever see 0 here.

What we can do is drop the check_wedge out of wait_request, since that
should make a pile of things more robust. But imo the -EAGAIN really needs
to stay. It doesn't hurt, userspace never sees it, and we keep
encapsulation of how exactly sw/hw reset is done contained in the reset
worker.

I'll respin and retest my gem_eio patch with removing check_wedge from
wait_request, but keeping the -EAGAIN.

> > > > > @@ -2252,6 +2250,14 @@ static int wait_for_space(struct drm_i915_gem_request *req, int bytes)
> > > > >  	if (ret)
> > > > >  		return ret;
> > > > >  
> > > > > +	/* If the request was completed due to a GPU hang, we want to
> > > > > +	 * error out before we continue to emit more commands to the GPU.
> > > > > +	 */
> > > > > +	ret = i915_gem_check_wedge(i915_reset_counter(&to_i915(engine->dev)->gpu_error),
> > > > > +				   to_i915(engine->dev)->mm.interruptible);
> > > > > +	if (ret)
> > > > > +		return ret;
> > > > 
> > > > Don't we still have the problem with -EIO now here, just less likely than
> > > > from intel_ring_begin?
> > > 
> > > What's the problem? intel_ring_begin() is supposed to return
> > > -EIO/-EAGAIN.
> > 
> > Hm, I thought the entire problem was that ring_begin can return -EIO,
> > since that's how mesa usually dies. But with your mail your saying that we
> > accidentally wedge the gpu, and that's the part I don't understand yet.
> 
> Nah, mesa dies all the time from unchecked errors elsewhere. It just
> calls fprintf/exit on intel_do_flush_locked(), so that is the visible one.
> By the way, how are you getting on reviewing my mesa patches to fix that?

Hm, I hoped mesa folks would do that ;-)
-Daniel

Chris Wilson Nov. 26, 2015, 9:50 a.m. UTC | #7

On Thu, Nov 26, 2015 at 10:21:38AM +0100, Daniel Vetter wrote:
> On Wed, Nov 25, 2015 at 12:17:08PM +0000, Chris Wilson wrote:
> > On Wed, Nov 25, 2015 at 10:12:53AM +0100, Daniel Vetter wrote:
> > > On Tue, Nov 24, 2015 at 09:43:32PM +0000, Chris Wilson wrote:
> > > > On Tue, Nov 24, 2015 at 05:43:22PM +0100, Daniel Vetter wrote:
> > > > > On Fri, Nov 20, 2015 at 12:43:52PM +0000, Chris Wilson wrote:
> > > > > > Instead of querying the reset counter before every access to the ring,
> > > > > > query it the first time we touch the ring, and do a final compare when
> > > > > > submitting the request. For correctness, we need to then sanitize how
> > > > > > the reset_counter is incremented to prevent broken submission and
> > > > > > waiting across resets, in the process fixing the persistent -EIO we
> > > > > > still see today on failed waits.
> > > > > 
> > > > > tbh that explanation went straight over my head. Questions:
> > > > > - Where do we wait again over a reset? With all the wakeups we should
> > > > >   catch them properly. I guess this needs the detailed scenario to help me
> > > > >   out, since I have dim memories that there is something wrong ;-)
> > > > 
> > > > TDR. In the future (and in past speculative patches) we have proposed
> > > > keeping requests over a reset and requeuing them. That is a complication
> > > > to the simplification of bailing out from the wait. What I have in mind,
> > > > is the recovery code has to fix up the request list somehow, though that
> > > > will be fun.
> > > 
> > > But even with requeueing the waiters need to bail out and restart the
> > > ioctl. So I don't see why requeueing of requests matter.
> > 
> > But why should the waiter even wake up? It is not holding any locks, all
> > it is just waiting for the request to complete. It is only those holding
> > a lock required to reset that we need to kick.
> 
> Because the request might not actually be completed when we do partial
> resets like with TDR. And then we'd have to undo this.

I don't think you have to undo. TDR and wait_request already doesn't
mesh well, so it needs to be solved either way. More than likely we
simply don't increment the global reset counter and just bump the seqno
for skipped requests.
 
> > > And my question was about what exactly is broken when waiting over a
> > > reset? As long as we detect the reset and restart the ioctl we should pick
> > > up any kinds of state fixups the reset work has done and recover
> > > perfectly. Irrespective of whether the reset work has requeued some of the
> > > requests or not.
> > 
> > But... The reset handler clears the requests, we cannot wait over a
> > reset. So waiting over a reset today is clearly a bug in the kernel.
> > 
> > Only in the future do we contemplate situations where a request may
> > survive a reset.
> 
> I think I phrased my question badly: What's broken with current waiters
> racing against resets? Currently the should all get woken and restart, and
> that seems to work. It does create some ioctl restart work when a reset
> happens, but I think that overhead is justified given that it allows us to
> have no knowledge of how exactly we reset hw and sw outside of reset
> functions.

Good. Because that doesn't change.
 
> > > > > - What precisely is the effect for waiters? I only see moving the
> > > > >   atomic_inc under the dev->struct_mutex, which shouldn't do a hole lot
> > > > >   for anyone. Plus not returning -EAGAIN when reset_in_progress, which
> > > > >   looks like it might result in missed wakeups and deadlocks with the
> > > > >   reset work.
> > > > 
> > > > At the moment, I can trivially wedge the machine. This patch fixes that.
> > > > The patch also ensures that we cannot raise unhandled errors from
> > > > wait-request (EIO/EAGAIN handling has to be explicitly added to the
> > > > caller).
> > > 
> > > Hm, how/where do we wedge the machine, and how does the fix work?
> > 
> > Being able to reset a previously wedged machine.
> 
> Ok, I think I get that problem now, after looking at gem_eio work.
> echo 1 > i915_wedged seems to have been broken since years. I haven't ever
> noticed that since I generally just kick the machine and restart when that
> happens.

It has been my get-of-jail card for a long time: manually grab the GPU
state and reset? Yes, please.
 
> The two-phase reset logic is some kind of hand-rolled special lock which
> makes sure nothing bad happens. As long as it's really two-phase.
> Essentially reset_in_progress means "everyone get of locks and stay of
> locks". With some small short-lived exceptions here and there (and well
> not so short-lived ones).
> 
> Of course we can just write reset code that doesn't deadlock, but given
> our history in this area I much prefer something that's more obviously
> correct.

No. It is nothing more than a crude leaky hack to allow you call the ring
accessors from inside the reset handler.
 
> > > > > I /thought/ that the -EIO is from check_wedge in intel_ring_begin, but
> > > > > that part seems essentially unchanged.
> > > > 
> > > > For two reasons, we need to to protect the access to the ring, and you
> > > > argued that (reporting of EIO from previous wedged GPUs) it was part of
> > > > the ABI. The other ABI that I think is important, is the reporting of
> > > > EIO if the user explicits waits on a request and it is incomplete (i.e.
> > > > wait_ioctl).
> > > 
> > > Well then I don't get where the -EIO is from. That leaves a truly wedge
> > > gpu as the cause, and for that case I don't get how it can happen by
> > > accident with the current code.
> > 
> > We fail a reset, or to recover. Happens often enough.
> 
> Ok, looking at gem_eio the only case where we put out an EIO and shouldn't
> seems to be set_domain_ioctl. Is that right?

Not really. That we see EIO from set-domain is just fallout from the igt
framework. The only things gem_eio tests are the ABI points where EIO is
expected to see, it doesn't contain the negative tests for all the paths
that should never generate EIO.
 
> Well there's all the modeset stuff, and that just became a bit worse with
> Maarten's recent changes. So probably we need the same -EIO eating there
> too.

I'm puzzled, since i915_reset() only handles GPU reset not the display
engine. The modesetting code should only have to worry about its
interactions with the GPU (any MI commands it is waiting on, or
activity) both of which should be through requests.
 
> > No. It doesn't add anything more to the scheme. All it does is change
> > the order in which the second increment is applied so that after the
> > HW/SW reset happens we can start using it from inside the reset handler
> > to reinitialise the state.
> 
> Well we can make that work, but it still feels like trading considerable
> amounts of complexity to get rid of reload_in_reset. Well it's not really
> any complexity right now with the BKL we have, but I don't see things as
> that simple when/if we ever split things up.

I disagree (now and then). reload_in_reset is crude and adds fragility
to the reset mechanism that you have had had to add extra code to handle.
 
> > From my perspective, by completing the requests following a hang, we
> > make the kernel and userspace much more fault tolerant.
> 
> Userspace must restart when we return -EAGAIN. That's also ABI. Which
> means yes it will only ever see 0 here.

Hahahaha. My point is that the kernel breaks quite often. It doesn't
matter if userspace doesn't die if the kernel doesn't behave correctly
if it still means the screen is frozen. Having access to the screen
without the GPU or driver breakage is paramount to a robust system.
 
> What we can do is drop the check_wedge out of wait_request, since that
> should make a pile of things more robust. But imo the -EAGAIN really needs
> to stay. It doesn't hurt, userspace never sees it, and we keep
> encapsulation of how exactly sw/hw reset is done contained in the reset
> worker.

Why though? That is skipping over the question of "when is a request
complete" that I feel is central to the argument here. You still need
the code to decide whether to return 0 or EAGAIN, or else you still
cause false WARN.
-Chris

[12/12] drm/i915: Cache the reset_counter for the request

Commit Message

Comments

Patch