diff mbox series

[3/3] drm/panthor: Rreset device and load FW after failed PM suspend

Message ID 20241011225906.3789965-3-adrian.larumbe@collabora.com (mailing list archive)
State New, archived
Headers show
Series [1/3] drm/panthor: Fix runtime suspend sequence after OPP transition error | expand

Commit Message

Adrián Larumbe Oct. 11, 2024, 10:57 p.m. UTC
On rk3588 SoCs, during a runtime PM suspend, the transition to the
lowest voltage/frequency pair might sometimes fail for reasons not yet
understood. In that case, even a slow FW reset will fail, leaving the
device's PM runtime status as unusuable.

When that happens, successive attempts to resume the device upon running
a job will always fail.

Fix it by forcing a synchronous device reset, which will lead to a
successful FW reload, and also reset the device's PM runtime error
status before resuming it.

Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_device.c | 10 ++++++++++
 drivers/gpu/drm/panthor/panthor_device.h |  2 ++
 drivers/gpu/drm/panthor/panthor_sched.c  |  7 +++++++
 3 files changed, 19 insertions(+)

Comments

Boris Brezillon Oct. 14, 2024, 7:27 a.m. UTC | #1
On Fri, 11 Oct 2024 23:57:01 +0100
Adrián Larumbe <adrian.larumbe@collabora.com> wrote:

> On rk3588 SoCs, during a runtime PM suspend, the transition to the
> lowest voltage/frequency pair might sometimes fail for reasons not yet
> understood. In that case, even a slow FW reset will fail, leaving the
> device's PM runtime status as unusuable.
> 
> When that happens, successive attempts to resume the device upon running
> a job will always fail.
> 
> Fix it by forcing a synchronous device reset, which will lead to a
> successful FW reload, and also reset the device's PM runtime error
> status before resuming it.
> 
> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
> ---
>  drivers/gpu/drm/panthor/panthor_device.c | 10 ++++++++++
>  drivers/gpu/drm/panthor/panthor_device.h |  2 ++
>  drivers/gpu/drm/panthor/panthor_sched.c  |  7 +++++++
>  3 files changed, 19 insertions(+)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> index 5430557bd0b8..ec6fed5e996b 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.c
> +++ b/drivers/gpu/drm/panthor/panthor_device.c
> @@ -105,6 +105,16 @@ static void panthor_device_reset_cleanup(struct drm_device *ddev, void *data)
>  	destroy_workqueue(ptdev->reset.wq);
>  }
>  
> +int panthor_device_reset_sync(struct panthor_device *ptdev)
> +{
> +	panthor_fw_pre_reset(ptdev, false);
> +	panthor_mmu_pre_reset(ptdev);
> +	panthor_gpu_soft_reset(ptdev);
> +	panthor_gpu_l2_power_on(ptdev);
> +	panthor_mmu_post_reset(ptdev);
> +	return panthor_fw_post_reset(ptdev);
> +}
> +
>  static void panthor_device_reset_work(struct work_struct *work)
>  {
>  	struct panthor_device *ptdev = container_of(work, struct panthor_device, reset.work);
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index 0e68f5a70d20..05a5a7233378 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -217,6 +217,8 @@ struct panthor_file {
>  int panthor_device_init(struct panthor_device *ptdev);
>  void panthor_device_unplug(struct panthor_device *ptdev);
>  
> +int panthor_device_reset_sync(struct panthor_device *ptdev);
> +
>  /**
>   * panthor_device_schedule_reset() - Schedules a reset operation
>   */
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> index c7b350fc3eba..9a854c8c5718 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -3101,6 +3101,13 @@ queue_run_job(struct drm_sched_job *sched_job)
>  		return dma_fence_get(job->done_fence);
>  	}
>  
> +	if (ptdev->base.dev->power.runtime_error) {
> +		ret = panthor_device_reset_sync(ptdev);
> +		if (drm_WARN_ON(&ptdev->base, ret))
> +			return ERR_PTR(ret);
> +		drm_WARN_ON(&ptdev->base, pm_runtime_set_active(ptdev->base.dev));
> +	}

I'd rather pretend the suspend/resume worked (even if it didn't) and
deal with the consequences (force a slow reset on the next resume), than
spread the 'if-PM-op-failed-force-sync-reset' thing everywhere we do a
pm_runtime_resume_and_get(). Also not sure how resetting the GPU will
help fixing the OPP transition failure.

> +
>  	ret = pm_runtime_resume_and_get(ptdev->base.dev);
>  	if (drm_WARN_ON(&ptdev->base, ret))
>  		return ERR_PTR(ret);
kernel test robot Oct. 16, 2024, 9:14 a.m. UTC | #2
Hi Adrián,

kernel test robot noticed the following build errors:

[auto build test ERROR on drm-misc/drm-misc-next]
[also build test ERROR on linus/master v6.12-rc3 next-20241016]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Adri-n-Larumbe/drm-panthor-Retry-OPP-transition-to-suspension-state-a-few-times/20241012-070112
base:   git://anongit.freedesktop.org/drm/drm-misc drm-misc-next
patch link:    https://lore.kernel.org/r/20241011225906.3789965-3-adrian.larumbe%40collabora.com
patch subject: [PATCH 3/3] drm/panthor: Rreset device and load FW after failed PM suspend
config: i386-buildonly-randconfig-001-20241016 (https://download.01.org/0day-ci/archive/20241016/202410161634.8YjhTQM2-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241016/202410161634.8YjhTQM2-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410161634.8YjhTQM2-lkp@intel.com/

All errors (new ones prefixed by >>):

>> drivers/gpu/drm/panthor/panthor_sched.c:3104:29: error: no member named 'runtime_error' in 'struct dev_pm_info'
    3104 |         if (ptdev->base.dev->power.runtime_error) {
         |             ~~~~~~~~~~~~~~~~~~~~~~ ^
   include/linux/compiler.h:55:47: note: expanded from macro 'if'
      55 | #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
         |                                               ^~~~
   include/linux/compiler.h:57:52: note: expanded from macro '__trace_if_var'
      57 | #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
         |                                                    ^~~~
>> drivers/gpu/drm/panthor/panthor_sched.c:3104:29: error: no member named 'runtime_error' in 'struct dev_pm_info'
    3104 |         if (ptdev->base.dev->power.runtime_error) {
         |             ~~~~~~~~~~~~~~~~~~~~~~ ^
   include/linux/compiler.h:55:47: note: expanded from macro 'if'
      55 | #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
         |                                               ^~~~
   include/linux/compiler.h:57:61: note: expanded from macro '__trace_if_var'
      57 | #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
         |                                                             ^~~~
>> drivers/gpu/drm/panthor/panthor_sched.c:3104:29: error: no member named 'runtime_error' in 'struct dev_pm_info'
    3104 |         if (ptdev->base.dev->power.runtime_error) {
         |             ~~~~~~~~~~~~~~~~~~~~~~ ^
   include/linux/compiler.h:55:47: note: expanded from macro 'if'
      55 | #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
         |                                               ^~~~
   include/linux/compiler.h:57:86: note: expanded from macro '__trace_if_var'
      57 | #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
         |                                                                                      ^~~~
   include/linux/compiler.h:68:3: note: expanded from macro '__trace_if_value'
      68 |         (cond) ?                                        \
         |          ^~~~
   3 errors generated.


vim +3104 drivers/gpu/drm/panthor/panthor_sched.c

  3081	
  3082	static struct dma_fence *
  3083	queue_run_job(struct drm_sched_job *sched_job)
  3084	{
  3085		struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
  3086		struct panthor_group *group = job->group;
  3087		struct panthor_queue *queue = group->queues[job->queue_idx];
  3088		struct panthor_device *ptdev = group->ptdev;
  3089		struct panthor_scheduler *sched = ptdev->scheduler;
  3090		struct panthor_job_ringbuf_instrs instrs;
  3091		struct panthor_job_cs_params cs_params;
  3092		struct dma_fence *done_fence;
  3093		int ret;
  3094	
  3095		/* Stream size is zero, nothing to do except making sure all previously
  3096		 * submitted jobs are done before we signal the
  3097		 * drm_sched_job::s_fence::finished fence.
  3098		 */
  3099		if (!job->call_info.size) {
  3100			job->done_fence = dma_fence_get(queue->fence_ctx.last_fence);
  3101			return dma_fence_get(job->done_fence);
  3102		}
  3103	
> 3104		if (ptdev->base.dev->power.runtime_error) {
  3105			ret = panthor_device_reset_sync(ptdev);
  3106			if (drm_WARN_ON(&ptdev->base, ret))
  3107				return ERR_PTR(ret);
  3108			drm_WARN_ON(&ptdev->base, pm_runtime_set_active(ptdev->base.dev));
  3109		}
  3110	
  3111		ret = pm_runtime_resume_and_get(ptdev->base.dev);
  3112		if (drm_WARN_ON(&ptdev->base, ret))
  3113			return ERR_PTR(ret);
  3114	
  3115		mutex_lock(&sched->lock);
  3116		if (!group_can_run(group)) {
  3117			done_fence = ERR_PTR(-ECANCELED);
  3118			goto out_unlock;
  3119		}
  3120	
  3121		dma_fence_init(job->done_fence,
  3122			       &panthor_queue_fence_ops,
  3123			       &queue->fence_ctx.lock,
  3124			       queue->fence_ctx.id,
  3125			       atomic64_inc_return(&queue->fence_ctx.seqno));
  3126	
  3127		job->profiling.slot = queue->profiling.seqno++;
  3128		if (queue->profiling.seqno == queue->profiling.slot_count)
  3129			queue->profiling.seqno = 0;
  3130	
  3131		job->ringbuf.start = queue->iface.input->insert;
  3132	
  3133		get_job_cs_params(job, &cs_params);
  3134		prepare_job_instrs(&cs_params, &instrs);
  3135		copy_instrs_to_ringbuf(queue, job, &instrs);
  3136	
  3137		job->ringbuf.end = job->ringbuf.start + (instrs.count * sizeof(u64));
  3138	
  3139		panthor_job_get(&job->base);
  3140		spin_lock(&queue->fence_ctx.lock);
  3141		list_add_tail(&job->node, &queue->fence_ctx.in_flight_jobs);
  3142		spin_unlock(&queue->fence_ctx.lock);
  3143	
  3144		/* Make sure the ring buffer is updated before the INSERT
  3145		 * register.
  3146		 */
  3147		wmb();
  3148	
  3149		queue->iface.input->extract = queue->iface.output->extract;
  3150		queue->iface.input->insert = job->ringbuf.end;
  3151	
  3152		if (group->csg_id < 0) {
  3153			/* If the queue is blocked, we want to keep the timeout running, so we
  3154			 * can detect unbounded waits and kill the group when that happens.
  3155			 * Otherwise, we suspend the timeout so the time we spend waiting for
  3156			 * a CSG slot is not counted.
  3157			 */
  3158			if (!(group->blocked_queues & BIT(job->queue_idx)) &&
  3159			    !queue->timeout_suspended) {
  3160				queue->remaining_time = drm_sched_suspend_timeout(&queue->scheduler);
  3161				queue->timeout_suspended = true;
  3162			}
  3163	
  3164			group_schedule_locked(group, BIT(job->queue_idx));
  3165		} else {
  3166			gpu_write(ptdev, CSF_DOORBELL(queue->doorbell_id), 1);
  3167			if (!sched->pm.has_ref &&
  3168			    !(group->blocked_queues & BIT(job->queue_idx))) {
  3169				pm_runtime_get(ptdev->base.dev);
  3170				sched->pm.has_ref = true;
  3171			}
  3172			panthor_devfreq_record_busy(sched->ptdev);
  3173		}
  3174	
  3175		/* Update the last fence. */
  3176		dma_fence_put(queue->fence_ctx.last_fence);
  3177		queue->fence_ctx.last_fence = dma_fence_get(job->done_fence);
  3178	
  3179		done_fence = dma_fence_get(job->done_fence);
  3180	
  3181	out_unlock:
  3182		mutex_unlock(&sched->lock);
  3183		pm_runtime_mark_last_busy(ptdev->base.dev);
  3184		pm_runtime_put_autosuspend(ptdev->base.dev);
  3185	
  3186		return done_fence;
  3187	}
  3188
diff mbox series

Patch

diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
index 5430557bd0b8..ec6fed5e996b 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -105,6 +105,16 @@  static void panthor_device_reset_cleanup(struct drm_device *ddev, void *data)
 	destroy_workqueue(ptdev->reset.wq);
 }
 
+int panthor_device_reset_sync(struct panthor_device *ptdev)
+{
+	panthor_fw_pre_reset(ptdev, false);
+	panthor_mmu_pre_reset(ptdev);
+	panthor_gpu_soft_reset(ptdev);
+	panthor_gpu_l2_power_on(ptdev);
+	panthor_mmu_post_reset(ptdev);
+	return panthor_fw_post_reset(ptdev);
+}
+
 static void panthor_device_reset_work(struct work_struct *work)
 {
 	struct panthor_device *ptdev = container_of(work, struct panthor_device, reset.work);
diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index 0e68f5a70d20..05a5a7233378 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -217,6 +217,8 @@  struct panthor_file {
 int panthor_device_init(struct panthor_device *ptdev);
 void panthor_device_unplug(struct panthor_device *ptdev);
 
+int panthor_device_reset_sync(struct panthor_device *ptdev);
+
 /**
  * panthor_device_schedule_reset() - Schedules a reset operation
  */
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index c7b350fc3eba..9a854c8c5718 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -3101,6 +3101,13 @@  queue_run_job(struct drm_sched_job *sched_job)
 		return dma_fence_get(job->done_fence);
 	}
 
+	if (ptdev->base.dev->power.runtime_error) {
+		ret = panthor_device_reset_sync(ptdev);
+		if (drm_WARN_ON(&ptdev->base, ret))
+			return ERR_PTR(ret);
+		drm_WARN_ON(&ptdev->base, pm_runtime_set_active(ptdev->base.dev));
+	}
+
 	ret = pm_runtime_resume_and_get(ptdev->base.dev);
 	if (drm_WARN_ON(&ptdev->base, ret))
 		return ERR_PTR(ret);