diff mbox

[v7,02/20] drm/i915: Modify error handler for per engine hang recovery

Message ID 20170427231300.32841-3-michel.thierry@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Michel Thierry April 27, 2017, 11:12 p.m. UTC
From: Arun Siluvery <arun.siluvery@linux.intel.com>

This is a preparatory patch which modifies error handler to do per engine
hang recovery. The actual patch which implements this sequence follows
later in the series. The aim is to prepare existing recovery function to
adapt to this new function where applicable (which fails at this point
because core implementation is lacking) and continue recovery using legacy
full gpu reset.

A helper function is also added to query the availability of engine
reset.

The error events behaviour that are used to notify user of reset are
adapted to engine reset such that it doesn't break users listening to these
events. In legacy we report an error event, a reset event before resetting
the gpu and a reset done event marking the completion of reset. The same
behaviour is adapted but reset event is only dispatched once even when
multiple engines are hung. Finally once reset is complete we send reset
done event as usual.

Note that this implementation of engine reset is for i915 directly
submitting to the ELSP, where the driver manages the hang detection,
recovery and resubmission. With GuC submission these tasks are shared
between driver and firmware; i915 will still responsible for detecting a
hang, and when it does it will have to request GuC to reset that Engine and
remind the firmware about the outstanding submissions. This will be
added in different patch.

v2: rebase, advertise engine reset availability in platform definition,
add note about GuC submission.
v3: s/*engine_reset*/*reset_engine*/. (Chris)
Handle reset as 2 level resets, by first going to engine only and fall
backing to full/chip reset as needed, i.e. reset_engine will need the
struct_mutex.
v4: Pass the engine mask to i915_reset. (Chris)
v5: Rebase, update selftests.
v6: Rebase, prepare for mutex-less reset engine.
v7: Pass reset_engine mask as a function parameter, and iterate over the
engine mask for reset_engine. (Chris)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c     | 15 +++++++++++++++
 drivers/gpu/drm/i915/i915_drv.h     |  3 +++
 drivers/gpu/drm/i915/i915_irq.c     | 33 ++++++++++++++++++++++++++++++---
 drivers/gpu/drm/i915/i915_pci.c     |  5 ++++-
 drivers/gpu/drm/i915/intel_uncore.c | 11 +++++++++++
 5 files changed, 63 insertions(+), 4 deletions(-)

Comments

Chris Wilson April 29, 2017, 2:19 p.m. UTC | #1
On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote:
> From: Arun Siluvery <arun.siluvery@linux.intel.com>
> 
> This is a preparatory patch which modifies error handler to do per engine
> hang recovery. The actual patch which implements this sequence follows
> later in the series. The aim is to prepare existing recovery function to
> adapt to this new function where applicable (which fails at this point
> because core implementation is lacking) and continue recovery using legacy
> full gpu reset.
> 
> A helper function is also added to query the availability of engine
> reset.
> 
> The error events behaviour that are used to notify user of reset are
> adapted to engine reset such that it doesn't break users listening to these
> events. In legacy we report an error event, a reset event before resetting
> the gpu and a reset done event marking the completion of reset. The same
> behaviour is adapted but reset event is only dispatched once even when
> multiple engines are hung. Finally once reset is complete we send reset
> done event as usual.
> 
> Note that this implementation of engine reset is for i915 directly
> submitting to the ELSP, where the driver manages the hang detection,
> recovery and resubmission. With GuC submission these tasks are shared
> between driver and firmware; i915 will still responsible for detecting a
> hang, and when it does it will have to request GuC to reset that Engine and
> remind the firmware about the outstanding submissions. This will be
> added in different patch.
> 
> v2: rebase, advertise engine reset availability in platform definition,
> add note about GuC submission.
> v3: s/*engine_reset*/*reset_engine*/. (Chris)
> Handle reset as 2 level resets, by first going to engine only and fall
> backing to full/chip reset as needed, i.e. reset_engine will need the
> struct_mutex.
> v4: Pass the engine mask to i915_reset. (Chris)
> v5: Rebase, update selftests.
> v6: Rebase, prepare for mutex-less reset engine.
> v7: Pass reset_engine mask as a function parameter, and iterate over the
> engine mask for reset_engine. (Chris)
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Ian Lister <ian.lister@intel.com>
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.c     | 15 +++++++++++++++
>  drivers/gpu/drm/i915/i915_drv.h     |  3 +++
>  drivers/gpu/drm/i915/i915_irq.c     | 33 ++++++++++++++++++++++++++++++---
>  drivers/gpu/drm/i915/i915_pci.c     |  5 ++++-
>  drivers/gpu/drm/i915/intel_uncore.c | 11 +++++++++++
>  5 files changed, 63 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index c7d68e789642..48c8b69d9bde 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1800,6 +1800,8 @@ void i915_reset(struct drm_i915_private *dev_priv)
>  	if (!test_bit(I915_RESET_HANDOFF, &error->flags))
>  		return;
>  
> +	DRM_DEBUG_DRIVER("resetting chip\n");

This is redundant since we have a "Resetting chip" already here. Just
kill it.

> +
>  	/* Clear any previous failed attempts at recovery. Time to try again. */
>  	if (!i915_gem_unset_wedged(dev_priv))
>  		goto wakeup;
> @@ -1863,6 +1865,19 @@ void i915_reset(struct drm_i915_private *dev_priv)
>  	goto finish;
>  }
>  
> +/**
> + * i915_reset_engine - reset GPU engine to recover from a hang
> + * @engine: engine to reset
> + *
> + * Reset a specific GPU engine. Useful if a hang is detected.
> + * Returns zero on successful reset or otherwise an error code.
> + */
> +int i915_reset_engine(struct intel_engine_cs *engine)
> +{
> +	/* FIXME: replace me with engine reset sequence */
> +	return -ENODEV;
> +}
> +
>  static int i915_pm_suspend(struct device *kdev)
>  {
>  	struct pci_dev *pdev = to_pci_dev(kdev);
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index e06af46f5a57..ab7e68626c49 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -814,6 +814,7 @@ struct intel_csr {
>  	func(has_ddi); \
>  	func(has_decoupled_mmio); \
>  	func(has_dp_mst); \
> +	func(has_reset_engine); \
>  	func(has_fbc); \
>  	func(has_fpga_dbg); \
>  	func(has_full_ppgtt); \
> @@ -3019,6 +3020,8 @@ extern void i915_driver_unload(struct drm_device *dev);
>  extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
>  extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
>  extern void i915_reset(struct drm_i915_private *dev_priv);
> +extern int i915_reset_engine(struct intel_engine_cs *engine);
> +extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
>  extern int intel_guc_reset(struct drm_i915_private *dev_priv);
>  extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
>  extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index fd97fe00cd0d..3a59ef1367ec 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2635,11 +2635,13 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg)
>  /**
>   * i915_reset_and_wakeup - do process context error handling work
>   * @dev_priv: i915 device private
> + * @engine_mask: engine(s) hung - for reset-engine only.
>   *
>   * Fire an error uevent so userspace can see that a hang or error
>   * was detected.
>   */
> -static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
> +static void
> +i915_reset_and_wakeup(struct drm_i915_private *dev_priv, u32 engine_mask)
>  {
>  	struct kobject *kobj = &dev_priv->drm.primary->kdev->kobj;
>  	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
> @@ -2648,9 +2650,33 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
>  
>  	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
>  
> -	DRM_DEBUG_DRIVER("resetting chip\n");
> +	/*
> +	 * This event needs to be sent before performing gpu reset. When
> +	 * engine resets are supported we iterate through all engines and
> +	 * reset hung engines individually. To keep the event dispatch
> +	 * mechanism consistent with full gpu reset, this is only sent once
> +	 * even when multiple engines are hung. It is also safe to move this
> +	 * here because when we are in this function, we will definitely
> +	 * perform gpu reset.
> +	 */
>  	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
>  
> +	/* try engine reset first, and continue if fails; look mom, no mutex! */
> +	if (intel_has_reset_engine(dev_priv)) {
> +		struct intel_engine_cs *engine;
> +		unsigned int tmp;
> +
> +		for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
> +			if (i915_reset_engine(engine) == 0)
> +				engine_mask &= ~intel_engine_flag(engine);
> +		}
> +
> +		if (engine_mask)
> +			DRM_WARN("per-engine reset failed, promoting to full gpu reset\n");
> +		else
> +			goto finish;

This will look nicer if we did just try per-engine reset and then
quitely promote (it's not that quiet as we do get logging) to global.

for_each_engine_masked() {}
if (!engine_mask)


> +	}
> +
>  	intel_prepare_reset(dev_priv);
>  
>  	set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
> @@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
>  		kobject_uevent_env(kobj,
>  				   KOBJ_CHANGE, reset_done_event);
>  
> +finish:
>  	/*
>  	 * Note: The wake_up also serves as a memory barrier so that
>  	 * waiters see the updated value of the dev_priv->gpu_error.
> @@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
>  			     &dev_priv->gpu_error.flags))
>  		goto out;
>  
> -	i915_reset_and_wakeup(dev_priv);
> +	i915_reset_and_wakeup(dev_priv, engine_mask);

? You don't need to wakeup the struct_mutex so we don't need this after
per-engine resets. Time to split up i915_reset_and_wakeup(), because we
certainly shouldn't be calling intel_finish_reset() without first calling
intel_prepare_reset(). Which is right here in my tree...

> +/*
> + * When GuC submission is enabled, GuC manages ELSP and can initiate the
> + * engine reset too. For now, fall back to full GPU reset if it is enabled.
> + */
> +bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
> +{
> +	return (dev_priv->info.has_reset_engine &&
> +		!dev_priv->guc.execbuf_client &&
> +		i915.reset == 2);

i915.reset >= 2
-Chris
Michel Thierry May 8, 2017, 6:31 p.m. UTC | #2
On 4/29/2017 7:19 AM, Chris Wilson wrote:
> On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote:
>> From: Arun Siluvery <arun.siluvery@linux.intel.com>
>>
>> This is a preparatory patch which modifies error handler to do per engine
>> hang recovery. The actual patch which implements this sequence follows
>> later in the series. The aim is to prepare existing recovery function to
>> adapt to this new function where applicable (which fails at this point
>> because core implementation is lacking) and continue recovery using legacy
>> full gpu reset.
>>
>> A helper function is also added to query the availability of engine
>> reset.
>>
>> The error events behaviour that are used to notify user of reset are
>> adapted to engine reset such that it doesn't break users listening to these
>> events. In legacy we report an error event, a reset event before resetting
>> the gpu and a reset done event marking the completion of reset. The same
>> behaviour is adapted but reset event is only dispatched once even when
>> multiple engines are hung. Finally once reset is complete we send reset
>> done event as usual.
>>
>> Note that this implementation of engine reset is for i915 directly
>> submitting to the ELSP, where the driver manages the hang detection,
>> recovery and resubmission. With GuC submission these tasks are shared
>> between driver and firmware; i915 will still responsible for detecting a
>> hang, and when it does it will have to request GuC to reset that Engine and
>> remind the firmware about the outstanding submissions. This will be
>> added in different patch.
>>
>> v2: rebase, advertise engine reset availability in platform definition,
>> add note about GuC submission.
>> v3: s/*engine_reset*/*reset_engine*/. (Chris)
>> Handle reset as 2 level resets, by first going to engine only and fall
>> backing to full/chip reset as needed, i.e. reset_engine will need the
>> struct_mutex.
>> v4: Pass the engine mask to i915_reset. (Chris)
>> v5: Rebase, update selftests.
>> v6: Rebase, prepare for mutex-less reset engine.
>> v7: Pass reset_engine mask as a function parameter, and iterate over the
>> engine mask for reset_engine. (Chris)
>>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> Signed-off-by: Ian Lister <ian.lister@intel.com>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>> ---
>>  drivers/gpu/drm/i915/i915_drv.c     | 15 +++++++++++++++
>>  drivers/gpu/drm/i915/i915_drv.h     |  3 +++
>>  drivers/gpu/drm/i915/i915_irq.c     | 33 ++++++++++++++++++++++++++++++---
>>  drivers/gpu/drm/i915/i915_pci.c     |  5 ++++-
>>  drivers/gpu/drm/i915/intel_uncore.c | 11 +++++++++++
>>  5 files changed, 63 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> index c7d68e789642..48c8b69d9bde 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.c
>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>> @@ -1800,6 +1800,8 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>  	if (!test_bit(I915_RESET_HANDOFF, &error->flags))
>>  		return;
>>
>> +	DRM_DEBUG_DRIVER("resetting chip\n");
>
> This is redundant since we have a "Resetting chip" already here. Just
> kill it.
>
>> +
>>  	/* Clear any previous failed attempts at recovery. Time to try again. */
>>  	if (!i915_gem_unset_wedged(dev_priv))
>>  		goto wakeup;
>> @@ -1863,6 +1865,19 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>  	goto finish;
>>  }
>>
>> +/**
>> + * i915_reset_engine - reset GPU engine to recover from a hang
>> + * @engine: engine to reset
>> + *
>> + * Reset a specific GPU engine. Useful if a hang is detected.
>> + * Returns zero on successful reset or otherwise an error code.
>> + */
>> +int i915_reset_engine(struct intel_engine_cs *engine)
>> +{
>> +	/* FIXME: replace me with engine reset sequence */
>> +	return -ENODEV;
>> +}
>> +
>>  static int i915_pm_suspend(struct device *kdev)
>>  {
>>  	struct pci_dev *pdev = to_pci_dev(kdev);
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>> index e06af46f5a57..ab7e68626c49 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -814,6 +814,7 @@ struct intel_csr {
>>  	func(has_ddi); \
>>  	func(has_decoupled_mmio); \
>>  	func(has_dp_mst); \
>> +	func(has_reset_engine); \
>>  	func(has_fbc); \
>>  	func(has_fpga_dbg); \
>>  	func(has_full_ppgtt); \
>> @@ -3019,6 +3020,8 @@ extern void i915_driver_unload(struct drm_device *dev);
>>  extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
>>  extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
>>  extern void i915_reset(struct drm_i915_private *dev_priv);
>> +extern int i915_reset_engine(struct intel_engine_cs *engine);
>> +extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
>>  extern int intel_guc_reset(struct drm_i915_private *dev_priv);
>>  extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
>>  extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>> index fd97fe00cd0d..3a59ef1367ec 100644
>> --- a/drivers/gpu/drm/i915/i915_irq.c
>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>> @@ -2635,11 +2635,13 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg)
>>  /**
>>   * i915_reset_and_wakeup - do process context error handling work
>>   * @dev_priv: i915 device private
>> + * @engine_mask: engine(s) hung - for reset-engine only.
>>   *
>>   * Fire an error uevent so userspace can see that a hang or error
>>   * was detected.
>>   */
>> -static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
>> +static void
>> +i915_reset_and_wakeup(struct drm_i915_private *dev_priv, u32 engine_mask)
>>  {
>>  	struct kobject *kobj = &dev_priv->drm.primary->kdev->kobj;
>>  	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
>> @@ -2648,9 +2650,33 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
>>
>>  	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
>>
>> -	DRM_DEBUG_DRIVER("resetting chip\n");
>> +	/*
>> +	 * This event needs to be sent before performing gpu reset. When
>> +	 * engine resets are supported we iterate through all engines and
>> +	 * reset hung engines individually. To keep the event dispatch
>> +	 * mechanism consistent with full gpu reset, this is only sent once
>> +	 * even when multiple engines are hung. It is also safe to move this
>> +	 * here because when we are in this function, we will definitely
>> +	 * perform gpu reset.
>> +	 */
>>  	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
>>
>> +	/* try engine reset first, and continue if fails; look mom, no mutex! */
>> +	if (intel_has_reset_engine(dev_priv)) {
>> +		struct intel_engine_cs *engine;
>> +		unsigned int tmp;
>> +
>> +		for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
>> +			if (i915_reset_engine(engine) == 0)
>> +				engine_mask &= ~intel_engine_flag(engine);
>> +		}
>> +
>> +		if (engine_mask)
>> +			DRM_WARN("per-engine reset failed, promoting to full gpu reset\n");
>> +		else
>> +			goto finish;
>
> This will look nicer if we did just try per-engine reset and then
> quitely promote (it's not that quiet as we do get logging) to global.
>
> for_each_engine_masked() {}
> if (!engine_mask)
>
>
>> +	}
>> +
>>  	intel_prepare_reset(dev_priv);
>>
>>  	set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
>> @@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
>>  		kobject_uevent_env(kobj,
>>  				   KOBJ_CHANGE, reset_done_event);
>>
>> +finish:
>>  	/*
>>  	 * Note: The wake_up also serves as a memory barrier so that
>>  	 * waiters see the updated value of the dev_priv->gpu_error.
>> @@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
>>  			     &dev_priv->gpu_error.flags))
>>  		goto out;
>>
>> -	i915_reset_and_wakeup(dev_priv);
>> +	i915_reset_and_wakeup(dev_priv, engine_mask);
>
> ? You don't need to wakeup the struct_mutex so we don't need this after
> per-engine resets. Time to split up i915_reset_and_wakeup(), because we
> certainly shouldn't be calling intel_finish_reset() without first calling
> intel_prepare_reset(). Which is right here in my tree...
>

Looking at your tree, it wouldn't call finish_reset there either, only 
these two are called after a successful reset:

finish:
	clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags);
	wake_up_all(&dev_priv->gpu_error.reset_queue);

But you're right, we only need to clear the error flag, no need to call 
wake_up_all.

Should I move the per-engine reset to i915_handle_error, and then leave 
i915_reset_and_wakeup just for full resets?
That would also make the promotion from per-engine to global look a bit 
'clearer'.

Thanks,

>> +/*
>> + * When GuC submission is enabled, GuC manages ELSP and can initiate the
>> + * engine reset too. For now, fall back to full GPU reset if it is enabled.
>> + */
>> +bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
>> +{
>> +	return (dev_priv->info.has_reset_engine &&
>> +		!dev_priv->guc.execbuf_client &&
>> +		i915.reset == 2);
>
> i915.reset >= 2
> -Chris
>
Michel Thierry May 12, 2017, 8:55 p.m. UTC | #3
On 5/8/2017 11:31 AM, Michel Thierry wrote:
> On 4/29/2017 7:19 AM, Chris Wilson wrote:
>> On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote:
>>> From: Arun Siluvery <arun.siluvery@linux.intel.com>
>>>
...
>>> +    }
>>> +
>>>      intel_prepare_reset(dev_priv);
>>>
>>>      set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
>>> @@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct
>>> drm_i915_private *dev_priv)
>>>          kobject_uevent_env(kobj,
>>>                     KOBJ_CHANGE, reset_done_event);
>>>
>>> +finish:
>>>      /*
>>>       * Note: The wake_up also serves as a memory barrier so that
>>>       * waiters see the updated value of the dev_priv->gpu_error.
>>> @@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private
>>> *dev_priv,
>>>                   &dev_priv->gpu_error.flags))
>>>          goto out;
>>>
>>> -    i915_reset_and_wakeup(dev_priv);
>>> +    i915_reset_and_wakeup(dev_priv, engine_mask);
>>
>> ? You don't need to wakeup the struct_mutex so we don't need this after
>> per-engine resets. Time to split up i915_reset_and_wakeup(), because we
>> certainly shouldn't be calling intel_finish_reset() without first calling
>> intel_prepare_reset(). Which is right here in my tree...
>>
>
> Looking at your tree, it wouldn't call finish_reset there either, only
> these two are called after a successful reset:
>
> finish:
>     clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags);
>     wake_up_all(&dev_priv->gpu_error.reset_queue);
>
> But you're right, we only need to clear the error flag, no need to call
> wake_up_all.
>
> Should I move the per-engine reset to i915_handle_error, and then leave
> i915_reset_and_wakeup just for full resets?
> That would also make the promotion from per-engine to global look a bit
> 'clearer'.
>

I just noticed an issue if I don't call wake_up_all. There can be 
someone else waiting for the reset to complete 
(i915_mutex_lock_interruptible -> i915_gem_wait_for_error).

I915_RESET_BACKOFF has/had 2 roles, stop any other user to grab the 
struct mutex (which we won't need in reset-engine) and prevent two 
concurrent reset attempts (which we still want). Time to add a new flag 
for the later? (I915_RESET_ENGINE_IN_PROGRESS?)

Here's an example without calling wake_up_all (10s timeout):
[  126.816054] [drm:i915_reset_engine [i915]] resetting rcs0
...
[  137.499910] [IGT] gem_ringfill: exiting, ret=0

Compared to the one that does,
[   69.799519] [drm:i915_reset_engine [i915]] resetting rcs0
...
[   69.801335] [IGT] gem_tdr: exiting, ret=0

Thanks,

-Michel
Chris Wilson May 12, 2017, 9:09 p.m. UTC | #4
On Fri, May 12, 2017 at 01:55:11PM -0700, Michel Thierry wrote:
> On 5/8/2017 11:31 AM, Michel Thierry wrote:
> >On 4/29/2017 7:19 AM, Chris Wilson wrote:
> >>On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote:
> >>>From: Arun Siluvery <arun.siluvery@linux.intel.com>
> >>>
> ...
> >>>+    }
> >>>+
> >>>     intel_prepare_reset(dev_priv);
> >>>
> >>>     set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
> >>>@@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct
> >>>drm_i915_private *dev_priv)
> >>>         kobject_uevent_env(kobj,
> >>>                    KOBJ_CHANGE, reset_done_event);
> >>>
> >>>+finish:
> >>>     /*
> >>>      * Note: The wake_up also serves as a memory barrier so that
> >>>      * waiters see the updated value of the dev_priv->gpu_error.
> >>>@@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private
> >>>*dev_priv,
> >>>                  &dev_priv->gpu_error.flags))
> >>>         goto out;
> >>>
> >>>-    i915_reset_and_wakeup(dev_priv);
> >>>+    i915_reset_and_wakeup(dev_priv, engine_mask);
> >>
> >>? You don't need to wakeup the struct_mutex so we don't need this after
> >>per-engine resets. Time to split up i915_reset_and_wakeup(), because we
> >>certainly shouldn't be calling intel_finish_reset() without first calling
> >>intel_prepare_reset(). Which is right here in my tree...
> >>
> >
> >Looking at your tree, it wouldn't call finish_reset there either, only
> >these two are called after a successful reset:
> >
> >finish:
> >    clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags);
> >    wake_up_all(&dev_priv->gpu_error.reset_queue);
> >
> >But you're right, we only need to clear the error flag, no need to call
> >wake_up_all.
> >
> >Should I move the per-engine reset to i915_handle_error, and then leave
> >i915_reset_and_wakeup just for full resets?
> >That would also make the promotion from per-engine to global look a bit
> >'clearer'.
> >
> 
> I just noticed an issue if I don't call wake_up_all. There can be
> someone else waiting for the reset to complete
> (i915_mutex_lock_interruptible -> i915_gem_wait_for_error).
> 
> I915_RESET_BACKOFF has/had 2 roles, stop any other user to grab the
> struct mutex (which we won't need in reset-engine) and prevent two
> concurrent reset attempts (which we still want). Time to add a new
> flag for the later? (I915_RESET_ENGINE_IN_PROGRESS?)

Yes, that would be a good idea to avoid dual purposing the bits. Now
that we do direct resets along the wait path, we can completely drop the
i915_mutex_interruptible(). (No one else should be holding the mutex
indefinitely.) I think that's a better approach -- I think we've already
moved all the EIO magic aware to the ABI points where we deemed it
necessary.
-Chris
Michel Thierry May 12, 2017, 9:23 p.m. UTC | #5
On 5/12/2017 2:09 PM, Chris Wilson wrote:
> On Fri, May 12, 2017 at 01:55:11PM -0700, Michel Thierry wrote:
>> On 5/8/2017 11:31 AM, Michel Thierry wrote:
>>> On 4/29/2017 7:19 AM, Chris Wilson wrote:
>>>> On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote:
>>>>> From: Arun Siluvery <arun.siluvery@linux.intel.com>
>>>>>
>> ...
>>>>> +    }
>>>>> +
>>>>>     intel_prepare_reset(dev_priv);
>>>>>
>>>>>     set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
>>>>> @@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct
>>>>> drm_i915_private *dev_priv)
>>>>>         kobject_uevent_env(kobj,
>>>>>                    KOBJ_CHANGE, reset_done_event);
>>>>>
>>>>> +finish:
>>>>>     /*
>>>>>      * Note: The wake_up also serves as a memory barrier so that
>>>>>      * waiters see the updated value of the dev_priv->gpu_error.
>>>>> @@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private
>>>>> *dev_priv,
>>>>>                  &dev_priv->gpu_error.flags))
>>>>>         goto out;
>>>>>
>>>>> -    i915_reset_and_wakeup(dev_priv);
>>>>> +    i915_reset_and_wakeup(dev_priv, engine_mask);
>>>>
>>>> ? You don't need to wakeup the struct_mutex so we don't need this after
>>>> per-engine resets. Time to split up i915_reset_and_wakeup(), because we
>>>> certainly shouldn't be calling intel_finish_reset() without first calling
>>>> intel_prepare_reset(). Which is right here in my tree...
>>>>
>>>
>>> Looking at your tree, it wouldn't call finish_reset there either, only
>>> these two are called after a successful reset:
>>>
>>> finish:
>>>    clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags);
>>>    wake_up_all(&dev_priv->gpu_error.reset_queue);
>>>
>>> But you're right, we only need to clear the error flag, no need to call
>>> wake_up_all.
>>>
>>> Should I move the per-engine reset to i915_handle_error, and then leave
>>> i915_reset_and_wakeup just for full resets?
>>> That would also make the promotion from per-engine to global look a bit
>>> 'clearer'.
>>>
>>
>> I just noticed an issue if I don't call wake_up_all. There can be
>> someone else waiting for the reset to complete
>> (i915_mutex_lock_interruptible -> i915_gem_wait_for_error).
>>
>> I915_RESET_BACKOFF has/had 2 roles, stop any other user to grab the
>> struct mutex (which we won't need in reset-engine) and prevent two
>> concurrent reset attempts (which we still want). Time to add a new
>> flag for the later? (I915_RESET_ENGINE_IN_PROGRESS?)
>
> Yes, that would be a good idea to avoid dual purposing the bits. Now
> that we do direct resets along the wait path, we can completely drop the
> i915_mutex_interruptible(). (No one else should be holding the mutex
> indefinitely.) I think that's a better approach -- I think we've already
> moved all the EIO magic aware to the ABI points where we deemed it
> necessary.

And it seems to work ok with the new flag and no wake_up. I'll run more 
tests.

Thanks
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index c7d68e789642..48c8b69d9bde 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1800,6 +1800,8 @@  void i915_reset(struct drm_i915_private *dev_priv)
 	if (!test_bit(I915_RESET_HANDOFF, &error->flags))
 		return;
 
+	DRM_DEBUG_DRIVER("resetting chip\n");
+
 	/* Clear any previous failed attempts at recovery. Time to try again. */
 	if (!i915_gem_unset_wedged(dev_priv))
 		goto wakeup;
@@ -1863,6 +1865,19 @@  void i915_reset(struct drm_i915_private *dev_priv)
 	goto finish;
 }
 
+/**
+ * i915_reset_engine - reset GPU engine to recover from a hang
+ * @engine: engine to reset
+ *
+ * Reset a specific GPU engine. Useful if a hang is detected.
+ * Returns zero on successful reset or otherwise an error code.
+ */
+int i915_reset_engine(struct intel_engine_cs *engine)
+{
+	/* FIXME: replace me with engine reset sequence */
+	return -ENODEV;
+}
+
 static int i915_pm_suspend(struct device *kdev)
 {
 	struct pci_dev *pdev = to_pci_dev(kdev);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index e06af46f5a57..ab7e68626c49 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -814,6 +814,7 @@  struct intel_csr {
 	func(has_ddi); \
 	func(has_decoupled_mmio); \
 	func(has_dp_mst); \
+	func(has_reset_engine); \
 	func(has_fbc); \
 	func(has_fpga_dbg); \
 	func(has_full_ppgtt); \
@@ -3019,6 +3020,8 @@  extern void i915_driver_unload(struct drm_device *dev);
 extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
 extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
 extern void i915_reset(struct drm_i915_private *dev_priv);
+extern int i915_reset_engine(struct intel_engine_cs *engine);
+extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
 extern int intel_guc_reset(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index fd97fe00cd0d..3a59ef1367ec 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2635,11 +2635,13 @@  static irqreturn_t gen8_irq_handler(int irq, void *arg)
 /**
  * i915_reset_and_wakeup - do process context error handling work
  * @dev_priv: i915 device private
+ * @engine_mask: engine(s) hung - for reset-engine only.
  *
  * Fire an error uevent so userspace can see that a hang or error
  * was detected.
  */
-static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
+static void
+i915_reset_and_wakeup(struct drm_i915_private *dev_priv, u32 engine_mask)
 {
 	struct kobject *kobj = &dev_priv->drm.primary->kdev->kobj;
 	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
@@ -2648,9 +2650,33 @@  static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
 
 	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
 
-	DRM_DEBUG_DRIVER("resetting chip\n");
+	/*
+	 * This event needs to be sent before performing gpu reset. When
+	 * engine resets are supported we iterate through all engines and
+	 * reset hung engines individually. To keep the event dispatch
+	 * mechanism consistent with full gpu reset, this is only sent once
+	 * even when multiple engines are hung. It is also safe to move this
+	 * here because when we are in this function, we will definitely
+	 * perform gpu reset.
+	 */
 	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
 
+	/* try engine reset first, and continue if fails; look mom, no mutex! */
+	if (intel_has_reset_engine(dev_priv)) {
+		struct intel_engine_cs *engine;
+		unsigned int tmp;
+
+		for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
+			if (i915_reset_engine(engine) == 0)
+				engine_mask &= ~intel_engine_flag(engine);
+		}
+
+		if (engine_mask)
+			DRM_WARN("per-engine reset failed, promoting to full gpu reset\n");
+		else
+			goto finish;
+	}
+
 	intel_prepare_reset(dev_priv);
 
 	set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
@@ -2680,6 +2706,7 @@  static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
 		kobject_uevent_env(kobj,
 				   KOBJ_CHANGE, reset_done_event);
 
+finish:
 	/*
 	 * Note: The wake_up also serves as a memory barrier so that
 	 * waiters see the updated value of the dev_priv->gpu_error.
@@ -2781,7 +2808,7 @@  void i915_handle_error(struct drm_i915_private *dev_priv,
 			     &dev_priv->gpu_error.flags))
 		goto out;
 
-	i915_reset_and_wakeup(dev_priv);
+	i915_reset_and_wakeup(dev_priv, engine_mask);
 
 out:
 	intel_runtime_pm_put(dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index f87b0c4e564d..d5002b55cbd8 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -313,7 +313,8 @@  static const struct intel_device_info intel_haswell_info = {
 	BDW_COLORS, \
 	.has_logical_ring_contexts = 1, \
 	.has_full_48bit_ppgtt = 1, \
-	.has_64bit_reloc = 1
+	.has_64bit_reloc = 1, \
+	.has_reset_engine = 1
 
 static const struct intel_device_info intel_broadwell_info = {
 	BDW_FEATURES,
@@ -345,6 +346,7 @@  static const struct intel_device_info intel_cherryview_info = {
 	.has_gmch_display = 1,
 	.has_aliasing_ppgtt = 1,
 	.has_full_ppgtt = 1,
+	.has_reset_engine = 1,
 	.display_mmio_offset = VLV_DISPLAY_BASE,
 	GEN_CHV_PIPEOFFSETS,
 	CURSOR_OFFSETS,
@@ -394,6 +396,7 @@  static const struct intel_device_info intel_skylake_gt3_info = {
 	.has_aliasing_ppgtt = 1, \
 	.has_full_ppgtt = 1, \
 	.has_full_48bit_ppgtt = 1, \
+	.has_reset_engine = 1, \
 	GEN_DEFAULT_PIPEOFFSETS, \
 	IVB_CURSOR_OFFSETS, \
 	BDW_COLORS
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 07a722f74fa1..ab5bdd110ac3 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1776,6 +1776,17 @@  bool intel_has_gpu_reset(struct drm_i915_private *dev_priv)
 	return intel_get_gpu_reset(dev_priv) != NULL;
 }
 
+/*
+ * When GuC submission is enabled, GuC manages ELSP and can initiate the
+ * engine reset too. For now, fall back to full GPU reset if it is enabled.
+ */
+bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
+{
+	return (dev_priv->info.has_reset_engine &&
+		!dev_priv->guc.execbuf_client &&
+		i915.reset == 2);
+}
+
 int intel_guc_reset(struct drm_i915_private *dev_priv)
 {
 	int ret;