diff mbox series

drm/sched: add optional errno to drm_sched_start()

Message ID 20240726075550.1511-1-christian.koenig@amd.com (mailing list archive)
State New, archived
Headers show
Series drm/sched: add optional errno to drm_sched_start() | expand

Commit Message

Christian König July 26, 2024, 7:55 a.m. UTC
The current implementation of drm_sched_start uses a hardcoded
-ECANCELED to dispose of a job when the parent/hw fence is NULL.
This results in drm_sched_job_done being called with -ECANCELED for
each job with a NULL parent in the pending list, making it difficult
to distinguish between recovery methods, whether a queue reset or a
full GPU reset was used.

To improve this, we first try a soft recovery for timeout jobs and
use the error code -ENODATA. If soft recovery fails, we proceed with
a queue reset, where the error code remains -ENODATA for the job.
Finally, for a full GPU reset, we use error codes -ECANCELED or
-ETIME. This patch adds an error code parameter to drm_sched_start,
allowing us to differentiate between queue reset and GPU reset
failures. This enables user mode and test applications to validate
the expected correctness of the requested operation. After a
successful queue reset, the only way to continue normal operation is
to call drm_sched_job_done with the specific error code -ENODATA.

v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain
    and amdgpu_device_unlock_reset_domain to allow user mode to track
    the queue reset status and distinguish between queue reset and
    GPU reset.
v2: Christian suggested using the error codes -ENODATA for queue reset
    and -ECANCELED or -ETIME for GPU reset, returned to
    amdgpu_cs_wait_ioctl.
v3: To meet the requirements, we introduce a new function
    drm_sched_start_ex with an additional parameter to set
    dma_fence_set_error, allowing us to handle the specific error
    codes appropriately and dispose of bad jobs with the selected
    error code depending on whether it was a queue reset or GPU reset.
v4: Alex suggested using a new name, drm_sched_start_with_recovery_error,
    which more accurately describes the function's purpose.
    Additionally, it was recommended to add documentation details
    about the new method.
v5: Fixed declaration of new function drm_sched_start_with_recovery_error.(Alex)
v6 (chk): rebase on upstream changes, cleanup the commit message,
          drop the new function again and update all callers,
          apply the errno also to scheduler fences with hw fences

Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c          | 4 ++--
 drivers/gpu/drm/etnaviv/etnaviv_sched.c             | 4 ++--
 drivers/gpu/drm/imagination/pvr_queue.c             | 4 ++--
 drivers/gpu/drm/lima/lima_sched.c                   | 2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c             | 2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c             | 2 +-
 drivers/gpu/drm/panthor/panthor_mmu.c               | 2 +-
 drivers/gpu/drm/scheduler/sched_main.c              | 7 ++++---
 drivers/gpu/drm/v3d/v3d_sched.c                     | 2 +-
 include/drm/gpu_scheduler.h                         | 2 +-
 11 files changed, 17 insertions(+), 16 deletions(-)

Comments

Matthew Brost July 26, 2024, 12:30 p.m. UTC | #1
On Fri, Jul 26, 2024 at 09:55:50AM +0200, Christian König wrote:
> The current implementation of drm_sched_start uses a hardcoded
> -ECANCELED to dispose of a job when the parent/hw fence is NULL.
> This results in drm_sched_job_done being called with -ECANCELED for
> each job with a NULL parent in the pending list, making it difficult
> to distinguish between recovery methods, whether a queue reset or a
> full GPU reset was used.
> 
> To improve this, we first try a soft recovery for timeout jobs and
> use the error code -ENODATA. If soft recovery fails, we proceed with
> a queue reset, where the error code remains -ENODATA for the job.
> Finally, for a full GPU reset, we use error codes -ECANCELED or
> -ETIME. This patch adds an error code parameter to drm_sched_start,
> allowing us to differentiate between queue reset and GPU reset
> failures. This enables user mode and test applications to validate
> the expected correctness of the requested operation. After a
> successful queue reset, the only way to continue normal operation is
> to call drm_sched_job_done with the specific error code -ENODATA.
> 
> v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain
>     and amdgpu_device_unlock_reset_domain to allow user mode to track
>     the queue reset status and distinguish between queue reset and
>     GPU reset.
> v2: Christian suggested using the error codes -ENODATA for queue reset
>     and -ECANCELED or -ETIME for GPU reset, returned to
>     amdgpu_cs_wait_ioctl.
> v3: To meet the requirements, we introduce a new function
>     drm_sched_start_ex with an additional parameter to set
>     dma_fence_set_error, allowing us to handle the specific error
>     codes appropriately and dispose of bad jobs with the selected
>     error code depending on whether it was a queue reset or GPU reset.
> v4: Alex suggested using a new name, drm_sched_start_with_recovery_error,
>     which more accurately describes the function's purpose.
>     Additionally, it was recommended to add documentation details
>     about the new method.
> v5: Fixed declaration of new function drm_sched_start_with_recovery_error.(Alex)
> v6 (chk): rebase on upstream changes, cleanup the commit message,
>           drop the new function again and update all callers,
>           apply the errno also to scheduler fences with hw fences
> 

Seems responablie to me, but all the caller pass in an errno of zero to
drm_sched_start. Going to change in a follow up?

Anyways, LGTM but will leave RB for a user a user of this interface.
Acked-by: Matthew Brost <matthew.brost@intel.com>

> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c          | 4 ++--
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c             | 4 ++--
>  drivers/gpu/drm/imagination/pvr_queue.c             | 4 ++--
>  drivers/gpu/drm/lima/lima_sched.c                   | 2 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c             | 2 +-
>  drivers/gpu/drm/panfrost/panfrost_job.c             | 2 +-
>  drivers/gpu/drm/panthor/panthor_mmu.c               | 2 +-
>  drivers/gpu/drm/scheduler/sched_main.c              | 7 ++++---
>  drivers/gpu/drm/v3d/v3d_sched.c                     | 2 +-
>  include/drm/gpu_scheduler.h                         | 2 +-
>  11 files changed, 17 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> index 2320df51c914..18135d8235f9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> @@ -300,7 +300,7 @@ static int suspend_resume_compute_scheduler(struct amdgpu_device *adev, bool sus
>  			if (r)
>  				goto out;
>  		} else {
> -			drm_sched_start(&ring->sched);
> +			drm_sched_start(&ring->sched, 0);
>  		}
>  	}
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index c186fdb198ad..861827deb03f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5862,7 +5862,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>  			if (!amdgpu_ring_sched_ready(ring))
>  				continue;
>  
> -			drm_sched_start(&ring->sched);
> +			drm_sched_start(&ring->sched, 0);
>  		}
>  
>  		if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
> @@ -6360,7 +6360,7 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>  		if (!amdgpu_ring_sched_ready(ring))
>  			continue;
>  
> -		drm_sched_start(&ring->sched);
> +		drm_sched_start(&ring->sched, 0);
>  	}
>  
>  	amdgpu_device_unset_mp1_state(adev);
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> index c53641aa146f..2c8666f8ec4a 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> @@ -72,12 +72,12 @@ static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job
>  
>  	drm_sched_resubmit_jobs(&gpu->sched);
>  
> -	drm_sched_start(&gpu->sched);
> +	drm_sched_start(&gpu->sched, 0);
>  	return DRM_GPU_SCHED_STAT_NOMINAL;
>  
>  out_no_timeout:
>  	/* restart scheduler after GPU is usable again */
> -	drm_sched_start(&gpu->sched);
> +	drm_sched_start(&gpu->sched, 0);
>  	return DRM_GPU_SCHED_STAT_NOMINAL;
>  }
>  
> diff --git a/drivers/gpu/drm/imagination/pvr_queue.c b/drivers/gpu/drm/imagination/pvr_queue.c
> index 20cb46012082..c4f08432882b 100644
> --- a/drivers/gpu/drm/imagination/pvr_queue.c
> +++ b/drivers/gpu/drm/imagination/pvr_queue.c
> @@ -782,7 +782,7 @@ static void pvr_queue_start(struct pvr_queue *queue)
>  		}
>  	}
>  
> -	drm_sched_start(&queue->scheduler);
> +	drm_sched_start(&queue->scheduler, 0);
>  }
>  
>  /**
> @@ -842,7 +842,7 @@ pvr_queue_timedout_job(struct drm_sched_job *s_job)
>  	}
>  	mutex_unlock(&pvr_dev->queues.lock);
>  
> -	drm_sched_start(sched);
> +	drm_sched_start(sched, 0);
>  
>  	return DRM_GPU_SCHED_STAT_NOMINAL;
>  }
> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> index 1a944edb6ddc..b40c90e97d7e 100644
> --- a/drivers/gpu/drm/lima/lima_sched.c
> +++ b/drivers/gpu/drm/lima/lima_sched.c
> @@ -463,7 +463,7 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
>  	lima_pm_idle(ldev);
>  
>  	drm_sched_resubmit_jobs(&pipe->base);
> -	drm_sched_start(&pipe->base);
> +	drm_sched_start(&pipe->base, 0);
>  
>  	return DRM_GPU_SCHED_STAT_NOMINAL;
>  }
> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
> index eb6c3f9a01f5..4412f2711fb5 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
> @@ -379,7 +379,7 @@ nouveau_sched_timedout_job(struct drm_sched_job *sched_job)
>  	else
>  		NV_PRINTK(warn, job->cli, "Generic job timeout.\n");
>  
> -	drm_sched_start(sched);
> +	drm_sched_start(sched, 0);
>  
>  	return stat;
>  }
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index df49d37d0e7e..d140800606bf 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -727,7 +727,7 @@ panfrost_reset(struct panfrost_device *pfdev,
>  
>  	/* Restart the schedulers */
>  	for (i = 0; i < NUM_JOB_SLOTS; i++)
> -		drm_sched_start(&pfdev->js->queue[i].sched);
> +		drm_sched_start(&pfdev->js->queue[i].sched, 0);
>  
>  	/* Re-enable job interrupts now that everything has been restarted. */
>  	job_write(pfdev, JOB_INT_MASK,
> diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
> index d47972806d50..e630cdf47f99 100644
> --- a/drivers/gpu/drm/panthor/panthor_mmu.c
> +++ b/drivers/gpu/drm/panthor/panthor_mmu.c
> @@ -827,7 +827,7 @@ static void panthor_vm_stop(struct panthor_vm *vm)
>  
>  static void panthor_vm_start(struct panthor_vm *vm)
>  {
> -	drm_sched_start(&vm->sched);
> +	drm_sched_start(&vm->sched, 0);
>  }
>  
>  /**
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index ab53ab486fe6..f093616fe53c 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -674,9 +674,10 @@ EXPORT_SYMBOL(drm_sched_stop);
>   * drm_sched_start - recover jobs after a reset
>   *
>   * @sched: scheduler instance
> + * @errno: error to set on the pending fences
>   *
>   */
> -void drm_sched_start(struct drm_gpu_scheduler *sched)
> +void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
>  {
>  	struct drm_sched_job *s_job, *tmp;
>  
> @@ -691,13 +692,13 @@ void drm_sched_start(struct drm_gpu_scheduler *sched)
>  		atomic_add(s_job->credits, &sched->credit_count);
>  
>  		if (!fence) {
> -			drm_sched_job_done(s_job, -ECANCELED);
> +			drm_sched_job_done(s_job, errno ?: -ECANCELED);
>  			continue;
>  		}
>  
>  		if (dma_fence_add_callback(fence, &s_job->cb,
>  					   drm_sched_job_done_cb))
> -			drm_sched_job_done(s_job, fence->error);
> +			drm_sched_job_done(s_job, fence->error ?: errno);
>  	}
>  
>  	drm_sched_start_timeout_unlocked(sched);
> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
> index 42d4f4a2dba2..cac02284cd19 100644
> --- a/drivers/gpu/drm/v3d/v3d_sched.c
> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> @@ -653,7 +653,7 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job)
>  
>  	/* Unblock schedulers and restart their jobs. */
>  	for (q = 0; q < V3D_MAX_QUEUES; q++) {
> -		drm_sched_start(&v3d->queue[q].sched);
> +		drm_sched_start(&v3d->queue[q].sched, 0);
>  	}
>  
>  	mutex_unlock(&v3d->reset_lock);
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index fe8edb917360..a8d19b10f9b8 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -579,7 +579,7 @@ bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched);
>  void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
>  void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
> -void drm_sched_start(struct drm_gpu_scheduler *sched);
> +void drm_sched_start(struct drm_gpu_scheduler *sched, int errno);
>  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
>  void drm_sched_increase_karma(struct drm_sched_job *bad);
>  void drm_sched_reset_karma(struct drm_sched_job *bad);
> -- 
> 2.34.1
>
Daniel Vetter July 26, 2024, 2:21 p.m. UTC | #2
On Fri, Jul 26, 2024 at 09:55:50AM +0200, Christian König wrote:
> The current implementation of drm_sched_start uses a hardcoded
> -ECANCELED to dispose of a job when the parent/hw fence is NULL.
> This results in drm_sched_job_done being called with -ECANCELED for
> each job with a NULL parent in the pending list, making it difficult
> to distinguish between recovery methods, whether a queue reset or a
> full GPU reset was used.
> 
> To improve this, we first try a soft recovery for timeout jobs and
> use the error code -ENODATA. If soft recovery fails, we proceed with
> a queue reset, where the error code remains -ENODATA for the job.
> Finally, for a full GPU reset, we use error codes -ECANCELED or
> -ETIME. This patch adds an error code parameter to drm_sched_start,
> allowing us to differentiate between queue reset and GPU reset
> failures. This enables user mode and test applications to validate
> the expected correctness of the requested operation. After a
> successful queue reset, the only way to continue normal operation is
> to call drm_sched_job_done with the specific error code -ENODATA.
> 
> v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain
>     and amdgpu_device_unlock_reset_domain to allow user mode to track
>     the queue reset status and distinguish between queue reset and
>     GPU reset.
> v2: Christian suggested using the error codes -ENODATA for queue reset
>     and -ECANCELED or -ETIME for GPU reset, returned to
>     amdgpu_cs_wait_ioctl.
> v3: To meet the requirements, we introduce a new function
>     drm_sched_start_ex with an additional parameter to set
>     dma_fence_set_error, allowing us to handle the specific error
>     codes appropriately and dispose of bad jobs with the selected
>     error code depending on whether it was a queue reset or GPU reset.
> v4: Alex suggested using a new name, drm_sched_start_with_recovery_error,
>     which more accurately describes the function's purpose.
>     Additionally, it was recommended to add documentation details
>     about the new method.
> v5: Fixed declaration of new function drm_sched_start_with_recovery_error.(Alex)
> v6 (chk): rebase on upstream changes, cleanup the commit message,
>           drop the new function again and update all callers,
>           apply the errno also to scheduler fences with hw fences
> 
> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>

Maybe I'm extremely missing the point, but it's kind hard to be sure
without the testcase/mesa side code too, but for gl robustness I don't
think this is enough, because you also need to know whether it was your
context or someone else that caused the gpu reset. Probably biased, but I
think the per-ctx guilty/reset counters is more then right code here. Or
something along those lines.

If we really want to stuff this into per-job fences then I think we should
at least try to document this mess in the sync_file uapi, for a bit of
consistency.

But yeah without the full picture no idea really what we want here.
-Sima

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c          | 4 ++--
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c             | 4 ++--
>  drivers/gpu/drm/imagination/pvr_queue.c             | 4 ++--
>  drivers/gpu/drm/lima/lima_sched.c                   | 2 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c             | 2 +-
>  drivers/gpu/drm/panfrost/panfrost_job.c             | 2 +-
>  drivers/gpu/drm/panthor/panthor_mmu.c               | 2 +-
>  drivers/gpu/drm/scheduler/sched_main.c              | 7 ++++---
>  drivers/gpu/drm/v3d/v3d_sched.c                     | 2 +-
>  include/drm/gpu_scheduler.h                         | 2 +-
>  11 files changed, 17 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> index 2320df51c914..18135d8235f9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> @@ -300,7 +300,7 @@ static int suspend_resume_compute_scheduler(struct amdgpu_device *adev, bool sus
>  			if (r)
>  				goto out;
>  		} else {
> -			drm_sched_start(&ring->sched);
> +			drm_sched_start(&ring->sched, 0);
>  		}
>  	}
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index c186fdb198ad..861827deb03f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5862,7 +5862,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>  			if (!amdgpu_ring_sched_ready(ring))
>  				continue;
>  
> -			drm_sched_start(&ring->sched);
> +			drm_sched_start(&ring->sched, 0);
>  		}
>  
>  		if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
> @@ -6360,7 +6360,7 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>  		if (!amdgpu_ring_sched_ready(ring))
>  			continue;
>  
> -		drm_sched_start(&ring->sched);
> +		drm_sched_start(&ring->sched, 0);
>  	}
>  
>  	amdgpu_device_unset_mp1_state(adev);
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> index c53641aa146f..2c8666f8ec4a 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> @@ -72,12 +72,12 @@ static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job
>  
>  	drm_sched_resubmit_jobs(&gpu->sched);
>  
> -	drm_sched_start(&gpu->sched);
> +	drm_sched_start(&gpu->sched, 0);
>  	return DRM_GPU_SCHED_STAT_NOMINAL;
>  
>  out_no_timeout:
>  	/* restart scheduler after GPU is usable again */
> -	drm_sched_start(&gpu->sched);
> +	drm_sched_start(&gpu->sched, 0);
>  	return DRM_GPU_SCHED_STAT_NOMINAL;
>  }
>  
> diff --git a/drivers/gpu/drm/imagination/pvr_queue.c b/drivers/gpu/drm/imagination/pvr_queue.c
> index 20cb46012082..c4f08432882b 100644
> --- a/drivers/gpu/drm/imagination/pvr_queue.c
> +++ b/drivers/gpu/drm/imagination/pvr_queue.c
> @@ -782,7 +782,7 @@ static void pvr_queue_start(struct pvr_queue *queue)
>  		}
>  	}
>  
> -	drm_sched_start(&queue->scheduler);
> +	drm_sched_start(&queue->scheduler, 0);
>  }
>  
>  /**
> @@ -842,7 +842,7 @@ pvr_queue_timedout_job(struct drm_sched_job *s_job)
>  	}
>  	mutex_unlock(&pvr_dev->queues.lock);
>  
> -	drm_sched_start(sched);
> +	drm_sched_start(sched, 0);
>  
>  	return DRM_GPU_SCHED_STAT_NOMINAL;
>  }
> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> index 1a944edb6ddc..b40c90e97d7e 100644
> --- a/drivers/gpu/drm/lima/lima_sched.c
> +++ b/drivers/gpu/drm/lima/lima_sched.c
> @@ -463,7 +463,7 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
>  	lima_pm_idle(ldev);
>  
>  	drm_sched_resubmit_jobs(&pipe->base);
> -	drm_sched_start(&pipe->base);
> +	drm_sched_start(&pipe->base, 0);
>  
>  	return DRM_GPU_SCHED_STAT_NOMINAL;
>  }
> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
> index eb6c3f9a01f5..4412f2711fb5 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
> @@ -379,7 +379,7 @@ nouveau_sched_timedout_job(struct drm_sched_job *sched_job)
>  	else
>  		NV_PRINTK(warn, job->cli, "Generic job timeout.\n");
>  
> -	drm_sched_start(sched);
> +	drm_sched_start(sched, 0);
>  
>  	return stat;
>  }
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index df49d37d0e7e..d140800606bf 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -727,7 +727,7 @@ panfrost_reset(struct panfrost_device *pfdev,
>  
>  	/* Restart the schedulers */
>  	for (i = 0; i < NUM_JOB_SLOTS; i++)
> -		drm_sched_start(&pfdev->js->queue[i].sched);
> +		drm_sched_start(&pfdev->js->queue[i].sched, 0);
>  
>  	/* Re-enable job interrupts now that everything has been restarted. */
>  	job_write(pfdev, JOB_INT_MASK,
> diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
> index d47972806d50..e630cdf47f99 100644
> --- a/drivers/gpu/drm/panthor/panthor_mmu.c
> +++ b/drivers/gpu/drm/panthor/panthor_mmu.c
> @@ -827,7 +827,7 @@ static void panthor_vm_stop(struct panthor_vm *vm)
>  
>  static void panthor_vm_start(struct panthor_vm *vm)
>  {
> -	drm_sched_start(&vm->sched);
> +	drm_sched_start(&vm->sched, 0);
>  }
>  
>  /**
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index ab53ab486fe6..f093616fe53c 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -674,9 +674,10 @@ EXPORT_SYMBOL(drm_sched_stop);
>   * drm_sched_start - recover jobs after a reset
>   *
>   * @sched: scheduler instance
> + * @errno: error to set on the pending fences
>   *
>   */
> -void drm_sched_start(struct drm_gpu_scheduler *sched)
> +void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
>  {
>  	struct drm_sched_job *s_job, *tmp;
>  
> @@ -691,13 +692,13 @@ void drm_sched_start(struct drm_gpu_scheduler *sched)
>  		atomic_add(s_job->credits, &sched->credit_count);
>  
>  		if (!fence) {
> -			drm_sched_job_done(s_job, -ECANCELED);
> +			drm_sched_job_done(s_job, errno ?: -ECANCELED);
>  			continue;
>  		}
>  
>  		if (dma_fence_add_callback(fence, &s_job->cb,
>  					   drm_sched_job_done_cb))
> -			drm_sched_job_done(s_job, fence->error);
> +			drm_sched_job_done(s_job, fence->error ?: errno);
>  	}
>  
>  	drm_sched_start_timeout_unlocked(sched);
> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
> index 42d4f4a2dba2..cac02284cd19 100644
> --- a/drivers/gpu/drm/v3d/v3d_sched.c
> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> @@ -653,7 +653,7 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job)
>  
>  	/* Unblock schedulers and restart their jobs. */
>  	for (q = 0; q < V3D_MAX_QUEUES; q++) {
> -		drm_sched_start(&v3d->queue[q].sched);
> +		drm_sched_start(&v3d->queue[q].sched, 0);
>  	}
>  
>  	mutex_unlock(&v3d->reset_lock);
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index fe8edb917360..a8d19b10f9b8 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -579,7 +579,7 @@ bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched);
>  void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
>  void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
> -void drm_sched_start(struct drm_gpu_scheduler *sched);
> +void drm_sched_start(struct drm_gpu_scheduler *sched, int errno);
>  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
>  void drm_sched_increase_karma(struct drm_sched_job *bad);
>  void drm_sched_reset_karma(struct drm_sched_job *bad);
> -- 
> 2.34.1
>
Christian König July 29, 2024, 6:37 p.m. UTC | #3
Am 26.07.24 um 14:30 schrieb Matthew Brost:
> On Fri, Jul 26, 2024 at 09:55:50AM +0200, Christian König wrote:
>> The current implementation of drm_sched_start uses a hardcoded
>> -ECANCELED to dispose of a job when the parent/hw fence is NULL.
>> This results in drm_sched_job_done being called with -ECANCELED for
>> each job with a NULL parent in the pending list, making it difficult
>> to distinguish between recovery methods, whether a queue reset or a
>> full GPU reset was used.
>>
>> To improve this, we first try a soft recovery for timeout jobs and
>> use the error code -ENODATA. If soft recovery fails, we proceed with
>> a queue reset, where the error code remains -ENODATA for the job.
>> Finally, for a full GPU reset, we use error codes -ECANCELED or
>> -ETIME. This patch adds an error code parameter to drm_sched_start,
>> allowing us to differentiate between queue reset and GPU reset
>> failures. This enables user mode and test applications to validate
>> the expected correctness of the requested operation. After a
>> successful queue reset, the only way to continue normal operation is
>> to call drm_sched_job_done with the specific error code -ENODATA.
>>
>> v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain
>>      and amdgpu_device_unlock_reset_domain to allow user mode to track
>>      the queue reset status and distinguish between queue reset and
>>      GPU reset.
>> v2: Christian suggested using the error codes -ENODATA for queue reset
>>      and -ECANCELED or -ETIME for GPU reset, returned to
>>      amdgpu_cs_wait_ioctl.
>> v3: To meet the requirements, we introduce a new function
>>      drm_sched_start_ex with an additional parameter to set
>>      dma_fence_set_error, allowing us to handle the specific error
>>      codes appropriately and dispose of bad jobs with the selected
>>      error code depending on whether it was a queue reset or GPU reset.
>> v4: Alex suggested using a new name, drm_sched_start_with_recovery_error,
>>      which more accurately describes the function's purpose.
>>      Additionally, it was recommended to add documentation details
>>      about the new method.
>> v5: Fixed declaration of new function drm_sched_start_with_recovery_error.(Alex)
>> v6 (chk): rebase on upstream changes, cleanup the commit message,
>>            drop the new function again and update all callers,
>>            apply the errno also to scheduler fences with hw fences
>>
> Seems responablie to me, but all the caller pass in an errno of zero to
> drm_sched_start. Going to change in a follow up?

Yes, exactly that's the idea. Just wanted to double check if the 
approach makes sense.

Regards,
Christian.

>
> Anyways, LGTM but will leave RB for a user a user of this interface.
> Acked-by: Matthew Brost <matthew.brost@intel.com>
>
>> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
>> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> Cc: Alex Deucher <alexander.deucher@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c | 2 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c          | 4 ++--
>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c             | 4 ++--
>>   drivers/gpu/drm/imagination/pvr_queue.c             | 4 ++--
>>   drivers/gpu/drm/lima/lima_sched.c                   | 2 +-
>>   drivers/gpu/drm/nouveau/nouveau_sched.c             | 2 +-
>>   drivers/gpu/drm/panfrost/panfrost_job.c             | 2 +-
>>   drivers/gpu/drm/panthor/panthor_mmu.c               | 2 +-
>>   drivers/gpu/drm/scheduler/sched_main.c              | 7 ++++---
>>   drivers/gpu/drm/v3d/v3d_sched.c                     | 2 +-
>>   include/drm/gpu_scheduler.h                         | 2 +-
>>   11 files changed, 17 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
>> index 2320df51c914..18135d8235f9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
>> @@ -300,7 +300,7 @@ static int suspend_resume_compute_scheduler(struct amdgpu_device *adev, bool sus
>>   			if (r)
>>   				goto out;
>>   		} else {
>> -			drm_sched_start(&ring->sched);
>> +			drm_sched_start(&ring->sched, 0);
>>   		}
>>   	}
>>   
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index c186fdb198ad..861827deb03f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -5862,7 +5862,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>   			if (!amdgpu_ring_sched_ready(ring))
>>   				continue;
>>   
>> -			drm_sched_start(&ring->sched);
>> +			drm_sched_start(&ring->sched, 0);
>>   		}
>>   
>>   		if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
>> @@ -6360,7 +6360,7 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>>   		if (!amdgpu_ring_sched_ready(ring))
>>   			continue;
>>   
>> -		drm_sched_start(&ring->sched);
>> +		drm_sched_start(&ring->sched, 0);
>>   	}
>>   
>>   	amdgpu_device_unset_mp1_state(adev);
>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> index c53641aa146f..2c8666f8ec4a 100644
>> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> @@ -72,12 +72,12 @@ static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job
>>   
>>   	drm_sched_resubmit_jobs(&gpu->sched);
>>   
>> -	drm_sched_start(&gpu->sched);
>> +	drm_sched_start(&gpu->sched, 0);
>>   	return DRM_GPU_SCHED_STAT_NOMINAL;
>>   
>>   out_no_timeout:
>>   	/* restart scheduler after GPU is usable again */
>> -	drm_sched_start(&gpu->sched);
>> +	drm_sched_start(&gpu->sched, 0);
>>   	return DRM_GPU_SCHED_STAT_NOMINAL;
>>   }
>>   
>> diff --git a/drivers/gpu/drm/imagination/pvr_queue.c b/drivers/gpu/drm/imagination/pvr_queue.c
>> index 20cb46012082..c4f08432882b 100644
>> --- a/drivers/gpu/drm/imagination/pvr_queue.c
>> +++ b/drivers/gpu/drm/imagination/pvr_queue.c
>> @@ -782,7 +782,7 @@ static void pvr_queue_start(struct pvr_queue *queue)
>>   		}
>>   	}
>>   
>> -	drm_sched_start(&queue->scheduler);
>> +	drm_sched_start(&queue->scheduler, 0);
>>   }
>>   
>>   /**
>> @@ -842,7 +842,7 @@ pvr_queue_timedout_job(struct drm_sched_job *s_job)
>>   	}
>>   	mutex_unlock(&pvr_dev->queues.lock);
>>   
>> -	drm_sched_start(sched);
>> +	drm_sched_start(sched, 0);
>>   
>>   	return DRM_GPU_SCHED_STAT_NOMINAL;
>>   }
>> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
>> index 1a944edb6ddc..b40c90e97d7e 100644
>> --- a/drivers/gpu/drm/lima/lima_sched.c
>> +++ b/drivers/gpu/drm/lima/lima_sched.c
>> @@ -463,7 +463,7 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
>>   	lima_pm_idle(ldev);
>>   
>>   	drm_sched_resubmit_jobs(&pipe->base);
>> -	drm_sched_start(&pipe->base);
>> +	drm_sched_start(&pipe->base, 0);
>>   
>>   	return DRM_GPU_SCHED_STAT_NOMINAL;
>>   }
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
>> index eb6c3f9a01f5..4412f2711fb5 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
>> @@ -379,7 +379,7 @@ nouveau_sched_timedout_job(struct drm_sched_job *sched_job)
>>   	else
>>   		NV_PRINTK(warn, job->cli, "Generic job timeout.\n");
>>   
>> -	drm_sched_start(sched);
>> +	drm_sched_start(sched, 0);
>>   
>>   	return stat;
>>   }
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
>> index df49d37d0e7e..d140800606bf 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
>> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
>> @@ -727,7 +727,7 @@ panfrost_reset(struct panfrost_device *pfdev,
>>   
>>   	/* Restart the schedulers */
>>   	for (i = 0; i < NUM_JOB_SLOTS; i++)
>> -		drm_sched_start(&pfdev->js->queue[i].sched);
>> +		drm_sched_start(&pfdev->js->queue[i].sched, 0);
>>   
>>   	/* Re-enable job interrupts now that everything has been restarted. */
>>   	job_write(pfdev, JOB_INT_MASK,
>> diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
>> index d47972806d50..e630cdf47f99 100644
>> --- a/drivers/gpu/drm/panthor/panthor_mmu.c
>> +++ b/drivers/gpu/drm/panthor/panthor_mmu.c
>> @@ -827,7 +827,7 @@ static void panthor_vm_stop(struct panthor_vm *vm)
>>   
>>   static void panthor_vm_start(struct panthor_vm *vm)
>>   {
>> -	drm_sched_start(&vm->sched);
>> +	drm_sched_start(&vm->sched, 0);
>>   }
>>   
>>   /**
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index ab53ab486fe6..f093616fe53c 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -674,9 +674,10 @@ EXPORT_SYMBOL(drm_sched_stop);
>>    * drm_sched_start - recover jobs after a reset
>>    *
>>    * @sched: scheduler instance
>> + * @errno: error to set on the pending fences
>>    *
>>    */
>> -void drm_sched_start(struct drm_gpu_scheduler *sched)
>> +void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
>>   {
>>   	struct drm_sched_job *s_job, *tmp;
>>   
>> @@ -691,13 +692,13 @@ void drm_sched_start(struct drm_gpu_scheduler *sched)
>>   		atomic_add(s_job->credits, &sched->credit_count);
>>   
>>   		if (!fence) {
>> -			drm_sched_job_done(s_job, -ECANCELED);
>> +			drm_sched_job_done(s_job, errno ?: -ECANCELED);
>>   			continue;
>>   		}
>>   
>>   		if (dma_fence_add_callback(fence, &s_job->cb,
>>   					   drm_sched_job_done_cb))
>> -			drm_sched_job_done(s_job, fence->error);
>> +			drm_sched_job_done(s_job, fence->error ?: errno);
>>   	}
>>   
>>   	drm_sched_start_timeout_unlocked(sched);
>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
>> index 42d4f4a2dba2..cac02284cd19 100644
>> --- a/drivers/gpu/drm/v3d/v3d_sched.c
>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>> @@ -653,7 +653,7 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job)
>>   
>>   	/* Unblock schedulers and restart their jobs. */
>>   	for (q = 0; q < V3D_MAX_QUEUES; q++) {
>> -		drm_sched_start(&v3d->queue[q].sched);
>> +		drm_sched_start(&v3d->queue[q].sched, 0);
>>   	}
>>   
>>   	mutex_unlock(&v3d->reset_lock);
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index fe8edb917360..a8d19b10f9b8 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -579,7 +579,7 @@ bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched);
>>   void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
>>   void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
>>   void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
>> -void drm_sched_start(struct drm_gpu_scheduler *sched);
>> +void drm_sched_start(struct drm_gpu_scheduler *sched, int errno);
>>   void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
>>   void drm_sched_increase_karma(struct drm_sched_job *bad);
>>   void drm_sched_reset_karma(struct drm_sched_job *bad);
>> -- 
>> 2.34.1
>>
Christian König July 29, 2024, 6:43 p.m. UTC | #4
Am 26.07.24 um 16:21 schrieb Daniel Vetter:
> On Fri, Jul 26, 2024 at 09:55:50AM +0200, Christian König wrote:
>> The current implementation of drm_sched_start uses a hardcoded
>> -ECANCELED to dispose of a job when the parent/hw fence is NULL.
>> This results in drm_sched_job_done being called with -ECANCELED for
>> each job with a NULL parent in the pending list, making it difficult
>> to distinguish between recovery methods, whether a queue reset or a
>> full GPU reset was used.
>>
>> To improve this, we first try a soft recovery for timeout jobs and
>> use the error code -ENODATA. If soft recovery fails, we proceed with
>> a queue reset, where the error code remains -ENODATA for the job.
>> Finally, for a full GPU reset, we use error codes -ECANCELED or
>> -ETIME. This patch adds an error code parameter to drm_sched_start,
>> allowing us to differentiate between queue reset and GPU reset
>> failures. This enables user mode and test applications to validate
>> the expected correctness of the requested operation. After a
>> successful queue reset, the only way to continue normal operation is
>> to call drm_sched_job_done with the specific error code -ENODATA.
>>
>> v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain
>>      and amdgpu_device_unlock_reset_domain to allow user mode to track
>>      the queue reset status and distinguish between queue reset and
>>      GPU reset.
>> v2: Christian suggested using the error codes -ENODATA for queue reset
>>      and -ECANCELED or -ETIME for GPU reset, returned to
>>      amdgpu_cs_wait_ioctl.
>> v3: To meet the requirements, we introduce a new function
>>      drm_sched_start_ex with an additional parameter to set
>>      dma_fence_set_error, allowing us to handle the specific error
>>      codes appropriately and dispose of bad jobs with the selected
>>      error code depending on whether it was a queue reset or GPU reset.
>> v4: Alex suggested using a new name, drm_sched_start_with_recovery_error,
>>      which more accurately describes the function's purpose.
>>      Additionally, it was recommended to add documentation details
>>      about the new method.
>> v5: Fixed declaration of new function drm_sched_start_with_recovery_error.(Alex)
>> v6 (chk): rebase on upstream changes, cleanup the commit message,
>>            drop the new function again and update all callers,
>>            apply the errno also to scheduler fences with hw fences
>>
>> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
>> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> Cc: Alex Deucher <alexander.deucher@amd.com>
> Maybe I'm extremely missing the point, but it's kind hard to be sure
> without the testcase/mesa side code too, but for gl robustness I don't
> think this is enough, because you also need to know whether it was your
> context or someone else that caused the gpu reset. Probably biased, but I
> think the per-ctx guilty/reset counters is more then right code here. Or
> something along those lines.

Exactly that ctx based approach blew up pretty nicely because it doesn't 
match the lifetime of the ctx.

On the one hand you don't want the ctx to outlive the file descriptor 
which it was created with since it points back to the fd, on the other 
hand when you need it for error handling you need to keep it around 
until all submissions are completed.

In the end you have a really nice circle dependency.

> If we really want to stuff this into per-job fences then I think we should
> at least try to document this mess in the sync_file uapi, for a bit of
> consistency.

Good point. Going to add some documentation.

Regards,
Christian.

>
> But yeah without the full picture no idea really what we want here.
> -Sima
>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c | 2 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c          | 4 ++--
>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c             | 4 ++--
>>   drivers/gpu/drm/imagination/pvr_queue.c             | 4 ++--
>>   drivers/gpu/drm/lima/lima_sched.c                   | 2 +-
>>   drivers/gpu/drm/nouveau/nouveau_sched.c             | 2 +-
>>   drivers/gpu/drm/panfrost/panfrost_job.c             | 2 +-
>>   drivers/gpu/drm/panthor/panthor_mmu.c               | 2 +-
>>   drivers/gpu/drm/scheduler/sched_main.c              | 7 ++++---
>>   drivers/gpu/drm/v3d/v3d_sched.c                     | 2 +-
>>   include/drm/gpu_scheduler.h                         | 2 +-
>>   11 files changed, 17 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
>> index 2320df51c914..18135d8235f9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
>> @@ -300,7 +300,7 @@ static int suspend_resume_compute_scheduler(struct amdgpu_device *adev, bool sus
>>   			if (r)
>>   				goto out;
>>   		} else {
>> -			drm_sched_start(&ring->sched);
>> +			drm_sched_start(&ring->sched, 0);
>>   		}
>>   	}
>>   
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index c186fdb198ad..861827deb03f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -5862,7 +5862,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>   			if (!amdgpu_ring_sched_ready(ring))
>>   				continue;
>>   
>> -			drm_sched_start(&ring->sched);
>> +			drm_sched_start(&ring->sched, 0);
>>   		}
>>   
>>   		if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
>> @@ -6360,7 +6360,7 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>>   		if (!amdgpu_ring_sched_ready(ring))
>>   			continue;
>>   
>> -		drm_sched_start(&ring->sched);
>> +		drm_sched_start(&ring->sched, 0);
>>   	}
>>   
>>   	amdgpu_device_unset_mp1_state(adev);
>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> index c53641aa146f..2c8666f8ec4a 100644
>> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> @@ -72,12 +72,12 @@ static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job
>>   
>>   	drm_sched_resubmit_jobs(&gpu->sched);
>>   
>> -	drm_sched_start(&gpu->sched);
>> +	drm_sched_start(&gpu->sched, 0);
>>   	return DRM_GPU_SCHED_STAT_NOMINAL;
>>   
>>   out_no_timeout:
>>   	/* restart scheduler after GPU is usable again */
>> -	drm_sched_start(&gpu->sched);
>> +	drm_sched_start(&gpu->sched, 0);
>>   	return DRM_GPU_SCHED_STAT_NOMINAL;
>>   }
>>   
>> diff --git a/drivers/gpu/drm/imagination/pvr_queue.c b/drivers/gpu/drm/imagination/pvr_queue.c
>> index 20cb46012082..c4f08432882b 100644
>> --- a/drivers/gpu/drm/imagination/pvr_queue.c
>> +++ b/drivers/gpu/drm/imagination/pvr_queue.c
>> @@ -782,7 +782,7 @@ static void pvr_queue_start(struct pvr_queue *queue)
>>   		}
>>   	}
>>   
>> -	drm_sched_start(&queue->scheduler);
>> +	drm_sched_start(&queue->scheduler, 0);
>>   }
>>   
>>   /**
>> @@ -842,7 +842,7 @@ pvr_queue_timedout_job(struct drm_sched_job *s_job)
>>   	}
>>   	mutex_unlock(&pvr_dev->queues.lock);
>>   
>> -	drm_sched_start(sched);
>> +	drm_sched_start(sched, 0);
>>   
>>   	return DRM_GPU_SCHED_STAT_NOMINAL;
>>   }
>> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
>> index 1a944edb6ddc..b40c90e97d7e 100644
>> --- a/drivers/gpu/drm/lima/lima_sched.c
>> +++ b/drivers/gpu/drm/lima/lima_sched.c
>> @@ -463,7 +463,7 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
>>   	lima_pm_idle(ldev);
>>   
>>   	drm_sched_resubmit_jobs(&pipe->base);
>> -	drm_sched_start(&pipe->base);
>> +	drm_sched_start(&pipe->base, 0);
>>   
>>   	return DRM_GPU_SCHED_STAT_NOMINAL;
>>   }
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
>> index eb6c3f9a01f5..4412f2711fb5 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
>> @@ -379,7 +379,7 @@ nouveau_sched_timedout_job(struct drm_sched_job *sched_job)
>>   	else
>>   		NV_PRINTK(warn, job->cli, "Generic job timeout.\n");
>>   
>> -	drm_sched_start(sched);
>> +	drm_sched_start(sched, 0);
>>   
>>   	return stat;
>>   }
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
>> index df49d37d0e7e..d140800606bf 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
>> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
>> @@ -727,7 +727,7 @@ panfrost_reset(struct panfrost_device *pfdev,
>>   
>>   	/* Restart the schedulers */
>>   	for (i = 0; i < NUM_JOB_SLOTS; i++)
>> -		drm_sched_start(&pfdev->js->queue[i].sched);
>> +		drm_sched_start(&pfdev->js->queue[i].sched, 0);
>>   
>>   	/* Re-enable job interrupts now that everything has been restarted. */
>>   	job_write(pfdev, JOB_INT_MASK,
>> diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
>> index d47972806d50..e630cdf47f99 100644
>> --- a/drivers/gpu/drm/panthor/panthor_mmu.c
>> +++ b/drivers/gpu/drm/panthor/panthor_mmu.c
>> @@ -827,7 +827,7 @@ static void panthor_vm_stop(struct panthor_vm *vm)
>>   
>>   static void panthor_vm_start(struct panthor_vm *vm)
>>   {
>> -	drm_sched_start(&vm->sched);
>> +	drm_sched_start(&vm->sched, 0);
>>   }
>>   
>>   /**
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index ab53ab486fe6..f093616fe53c 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -674,9 +674,10 @@ EXPORT_SYMBOL(drm_sched_stop);
>>    * drm_sched_start - recover jobs after a reset
>>    *
>>    * @sched: scheduler instance
>> + * @errno: error to set on the pending fences
>>    *
>>    */
>> -void drm_sched_start(struct drm_gpu_scheduler *sched)
>> +void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
>>   {
>>   	struct drm_sched_job *s_job, *tmp;
>>   
>> @@ -691,13 +692,13 @@ void drm_sched_start(struct drm_gpu_scheduler *sched)
>>   		atomic_add(s_job->credits, &sched->credit_count);
>>   
>>   		if (!fence) {
>> -			drm_sched_job_done(s_job, -ECANCELED);
>> +			drm_sched_job_done(s_job, errno ?: -ECANCELED);
>>   			continue;
>>   		}
>>   
>>   		if (dma_fence_add_callback(fence, &s_job->cb,
>>   					   drm_sched_job_done_cb))
>> -			drm_sched_job_done(s_job, fence->error);
>> +			drm_sched_job_done(s_job, fence->error ?: errno);
>>   	}
>>   
>>   	drm_sched_start_timeout_unlocked(sched);
>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
>> index 42d4f4a2dba2..cac02284cd19 100644
>> --- a/drivers/gpu/drm/v3d/v3d_sched.c
>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>> @@ -653,7 +653,7 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job)
>>   
>>   	/* Unblock schedulers and restart their jobs. */
>>   	for (q = 0; q < V3D_MAX_QUEUES; q++) {
>> -		drm_sched_start(&v3d->queue[q].sched);
>> +		drm_sched_start(&v3d->queue[q].sched, 0);
>>   	}
>>   
>>   	mutex_unlock(&v3d->reset_lock);
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index fe8edb917360..a8d19b10f9b8 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -579,7 +579,7 @@ bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched);
>>   void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
>>   void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
>>   void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
>> -void drm_sched_start(struct drm_gpu_scheduler *sched);
>> +void drm_sched_start(struct drm_gpu_scheduler *sched, int errno);
>>   void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
>>   void drm_sched_increase_karma(struct drm_sched_job *bad);
>>   void drm_sched_reset_karma(struct drm_sched_job *bad);
>> -- 
>> 2.34.1
>>
Daniel Vetter July 30, 2024, 8:36 a.m. UTC | #5
On Mon, Jul 29, 2024 at 08:43:05PM +0200, Christian König wrote:
> Am 26.07.24 um 16:21 schrieb Daniel Vetter:
> > On Fri, Jul 26, 2024 at 09:55:50AM +0200, Christian König wrote:
> > > The current implementation of drm_sched_start uses a hardcoded
> > > -ECANCELED to dispose of a job when the parent/hw fence is NULL.
> > > This results in drm_sched_job_done being called with -ECANCELED for
> > > each job with a NULL parent in the pending list, making it difficult
> > > to distinguish between recovery methods, whether a queue reset or a
> > > full GPU reset was used.
> > > 
> > > To improve this, we first try a soft recovery for timeout jobs and
> > > use the error code -ENODATA. If soft recovery fails, we proceed with
> > > a queue reset, where the error code remains -ENODATA for the job.
> > > Finally, for a full GPU reset, we use error codes -ECANCELED or
> > > -ETIME. This patch adds an error code parameter to drm_sched_start,
> > > allowing us to differentiate between queue reset and GPU reset
> > > failures. This enables user mode and test applications to validate
> > > the expected correctness of the requested operation. After a
> > > successful queue reset, the only way to continue normal operation is
> > > to call drm_sched_job_done with the specific error code -ENODATA.
> > > 
> > > v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain
> > >      and amdgpu_device_unlock_reset_domain to allow user mode to track
> > >      the queue reset status and distinguish between queue reset and
> > >      GPU reset.
> > > v2: Christian suggested using the error codes -ENODATA for queue reset
> > >      and -ECANCELED or -ETIME for GPU reset, returned to
> > >      amdgpu_cs_wait_ioctl.
> > > v3: To meet the requirements, we introduce a new function
> > >      drm_sched_start_ex with an additional parameter to set
> > >      dma_fence_set_error, allowing us to handle the specific error
> > >      codes appropriately and dispose of bad jobs with the selected
> > >      error code depending on whether it was a queue reset or GPU reset.
> > > v4: Alex suggested using a new name, drm_sched_start_with_recovery_error,
> > >      which more accurately describes the function's purpose.
> > >      Additionally, it was recommended to add documentation details
> > >      about the new method.
> > > v5: Fixed declaration of new function drm_sched_start_with_recovery_error.(Alex)
> > > v6 (chk): rebase on upstream changes, cleanup the commit message,
> > >            drop the new function again and update all callers,
> > >            apply the errno also to scheduler fences with hw fences
> > > 
> > > Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
> > > Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
> > > Signed-off-by: Christian König <christian.koenig@amd.com>
> > > Cc: Alex Deucher <alexander.deucher@amd.com>
> > Maybe I'm extremely missing the point, but it's kind hard to be sure
> > without the testcase/mesa side code too, but for gl robustness I don't
> > think this is enough, because you also need to know whether it was your
> > context or someone else that caused the gpu reset. Probably biased, but I
> > think the per-ctx guilty/reset counters is more then right code here. Or
> > something along those lines.
> 
> Exactly that ctx based approach blew up pretty nicely because it doesn't
> match the lifetime of the ctx.
> 
> On the one hand you don't want the ctx to outlive the file descriptor which
> it was created with since it points back to the fd, on the other hand when
> you need it for error handling you need to keep it around until all
> submissions are completed.

Why does the ctx need to point back to the fd? At least with the reset
stats query approach you only ever go from fd to ctx, not the other way
around. Going from ctx to fd is indeed all kinds of enormous fun and
really not great.

I guess the "jobs keep ctx alive" is the age old ctx refcounting fun, and
there's leaks involved if they outlive the fd in bad ways ... :-/

> In the end you have a really nice circle dependency.

Maybe a follow up, so for arb robustness or vk context where we want the
context to die and refuse to accept any more jobs: We can get at that
error somehow? I think that's really the only worry I have with a job
error approach for all this ...

> > If we really want to stuff this into per-job fences then I think we should
> > at least try to document this mess in the sync_file uapi, for a bit of
> > consistency.
> 
> Good point. Going to add some documentation.

Sounds good.

Cheers, Sima

> 
> Regards,
> Christian.
> 
> > 
> > But yeah without the full picture no idea really what we want here.
> > -Sima
> > 
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c | 2 +-
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c          | 4 ++--
> > >   drivers/gpu/drm/etnaviv/etnaviv_sched.c             | 4 ++--
> > >   drivers/gpu/drm/imagination/pvr_queue.c             | 4 ++--
> > >   drivers/gpu/drm/lima/lima_sched.c                   | 2 +-
> > >   drivers/gpu/drm/nouveau/nouveau_sched.c             | 2 +-
> > >   drivers/gpu/drm/panfrost/panfrost_job.c             | 2 +-
> > >   drivers/gpu/drm/panthor/panthor_mmu.c               | 2 +-
> > >   drivers/gpu/drm/scheduler/sched_main.c              | 7 ++++---
> > >   drivers/gpu/drm/v3d/v3d_sched.c                     | 2 +-
> > >   include/drm/gpu_scheduler.h                         | 2 +-
> > >   11 files changed, 17 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> > > index 2320df51c914..18135d8235f9 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> > > @@ -300,7 +300,7 @@ static int suspend_resume_compute_scheduler(struct amdgpu_device *adev, bool sus
> > >   			if (r)
> > >   				goto out;
> > >   		} else {
> > > -			drm_sched_start(&ring->sched);
> > > +			drm_sched_start(&ring->sched, 0);
> > >   		}
> > >   	}
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index c186fdb198ad..861827deb03f 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -5862,7 +5862,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> > >   			if (!amdgpu_ring_sched_ready(ring))
> > >   				continue;
> > > -			drm_sched_start(&ring->sched);
> > > +			drm_sched_start(&ring->sched, 0);
> > >   		}
> > >   		if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
> > > @@ -6360,7 +6360,7 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
> > >   		if (!amdgpu_ring_sched_ready(ring))
> > >   			continue;
> > > -		drm_sched_start(&ring->sched);
> > > +		drm_sched_start(&ring->sched, 0);
> > >   	}
> > >   	amdgpu_device_unset_mp1_state(adev);
> > > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > index c53641aa146f..2c8666f8ec4a 100644
> > > --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > @@ -72,12 +72,12 @@ static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job
> > >   	drm_sched_resubmit_jobs(&gpu->sched);
> > > -	drm_sched_start(&gpu->sched);
> > > +	drm_sched_start(&gpu->sched, 0);
> > >   	return DRM_GPU_SCHED_STAT_NOMINAL;
> > >   out_no_timeout:
> > >   	/* restart scheduler after GPU is usable again */
> > > -	drm_sched_start(&gpu->sched);
> > > +	drm_sched_start(&gpu->sched, 0);
> > >   	return DRM_GPU_SCHED_STAT_NOMINAL;
> > >   }
> > > diff --git a/drivers/gpu/drm/imagination/pvr_queue.c b/drivers/gpu/drm/imagination/pvr_queue.c
> > > index 20cb46012082..c4f08432882b 100644
> > > --- a/drivers/gpu/drm/imagination/pvr_queue.c
> > > +++ b/drivers/gpu/drm/imagination/pvr_queue.c
> > > @@ -782,7 +782,7 @@ static void pvr_queue_start(struct pvr_queue *queue)
> > >   		}
> > >   	}
> > > -	drm_sched_start(&queue->scheduler);
> > > +	drm_sched_start(&queue->scheduler, 0);
> > >   }
> > >   /**
> > > @@ -842,7 +842,7 @@ pvr_queue_timedout_job(struct drm_sched_job *s_job)
> > >   	}
> > >   	mutex_unlock(&pvr_dev->queues.lock);
> > > -	drm_sched_start(sched);
> > > +	drm_sched_start(sched, 0);
> > >   	return DRM_GPU_SCHED_STAT_NOMINAL;
> > >   }
> > > diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> > > index 1a944edb6ddc..b40c90e97d7e 100644
> > > --- a/drivers/gpu/drm/lima/lima_sched.c
> > > +++ b/drivers/gpu/drm/lima/lima_sched.c
> > > @@ -463,7 +463,7 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
> > >   	lima_pm_idle(ldev);
> > >   	drm_sched_resubmit_jobs(&pipe->base);
> > > -	drm_sched_start(&pipe->base);
> > > +	drm_sched_start(&pipe->base, 0);
> > >   	return DRM_GPU_SCHED_STAT_NOMINAL;
> > >   }
> > > diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
> > > index eb6c3f9a01f5..4412f2711fb5 100644
> > > --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
> > > +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
> > > @@ -379,7 +379,7 @@ nouveau_sched_timedout_job(struct drm_sched_job *sched_job)
> > >   	else
> > >   		NV_PRINTK(warn, job->cli, "Generic job timeout.\n");
> > > -	drm_sched_start(sched);
> > > +	drm_sched_start(sched, 0);
> > >   	return stat;
> > >   }
> > > diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> > > index df49d37d0e7e..d140800606bf 100644
> > > --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> > > +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> > > @@ -727,7 +727,7 @@ panfrost_reset(struct panfrost_device *pfdev,
> > >   	/* Restart the schedulers */
> > >   	for (i = 0; i < NUM_JOB_SLOTS; i++)
> > > -		drm_sched_start(&pfdev->js->queue[i].sched);
> > > +		drm_sched_start(&pfdev->js->queue[i].sched, 0);
> > >   	/* Re-enable job interrupts now that everything has been restarted. */
> > >   	job_write(pfdev, JOB_INT_MASK,
> > > diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
> > > index d47972806d50..e630cdf47f99 100644
> > > --- a/drivers/gpu/drm/panthor/panthor_mmu.c
> > > +++ b/drivers/gpu/drm/panthor/panthor_mmu.c
> > > @@ -827,7 +827,7 @@ static void panthor_vm_stop(struct panthor_vm *vm)
> > >   static void panthor_vm_start(struct panthor_vm *vm)
> > >   {
> > > -	drm_sched_start(&vm->sched);
> > > +	drm_sched_start(&vm->sched, 0);
> > >   }
> > >   /**
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > index ab53ab486fe6..f093616fe53c 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -674,9 +674,10 @@ EXPORT_SYMBOL(drm_sched_stop);
> > >    * drm_sched_start - recover jobs after a reset
> > >    *
> > >    * @sched: scheduler instance
> > > + * @errno: error to set on the pending fences
> > >    *
> > >    */
> > > -void drm_sched_start(struct drm_gpu_scheduler *sched)
> > > +void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
> > >   {
> > >   	struct drm_sched_job *s_job, *tmp;
> > > @@ -691,13 +692,13 @@ void drm_sched_start(struct drm_gpu_scheduler *sched)
> > >   		atomic_add(s_job->credits, &sched->credit_count);
> > >   		if (!fence) {
> > > -			drm_sched_job_done(s_job, -ECANCELED);
> > > +			drm_sched_job_done(s_job, errno ?: -ECANCELED);
> > >   			continue;
> > >   		}
> > >   		if (dma_fence_add_callback(fence, &s_job->cb,
> > >   					   drm_sched_job_done_cb))
> > > -			drm_sched_job_done(s_job, fence->error);
> > > +			drm_sched_job_done(s_job, fence->error ?: errno);
> > >   	}
> > >   	drm_sched_start_timeout_unlocked(sched);
> > > diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
> > > index 42d4f4a2dba2..cac02284cd19 100644
> > > --- a/drivers/gpu/drm/v3d/v3d_sched.c
> > > +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> > > @@ -653,7 +653,7 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job)
> > >   	/* Unblock schedulers and restart their jobs. */
> > >   	for (q = 0; q < V3D_MAX_QUEUES; q++) {
> > > -		drm_sched_start(&v3d->queue[q].sched);
> > > +		drm_sched_start(&v3d->queue[q].sched, 0);
> > >   	}
> > >   	mutex_unlock(&v3d->reset_lock);
> > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > index fe8edb917360..a8d19b10f9b8 100644
> > > --- a/include/drm/gpu_scheduler.h
> > > +++ b/include/drm/gpu_scheduler.h
> > > @@ -579,7 +579,7 @@ bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched);
> > >   void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
> > >   void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
> > >   void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
> > > -void drm_sched_start(struct drm_gpu_scheduler *sched);
> > > +void drm_sched_start(struct drm_gpu_scheduler *sched, int errno);
> > >   void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
> > >   void drm_sched_increase_karma(struct drm_sched_job *bad);
> > >   void drm_sched_reset_karma(struct drm_sched_job *bad);
> > > -- 
> > > 2.34.1
> > > 
>
Christian König July 30, 2024, 12:06 p.m. UTC | #6
Am 30.07.24 um 10:36 schrieb Daniel Vetter:
>> In the end you have a really nice circle dependency.
> Maybe a follow up, so for arb robustness or vk context where we want the
> context to die and refuse to accept any more jobs: We can get at that
> error somehow? I think that's really the only worry I have with a job
> error approach for all this ...

See drm_sched_entity_error(). The idea is that the driver uses this 
function in two ways:

1. In it's prepare callback so that when one submission fails all 
following from the same ctx are marked with an error number as well.

This is intentionally done in a driver callback so that driver decides 
if they want subsequent submissions to fail or not. That can be helpful 
for example for in kernel paging queues where submissions don't depend 
on each other and a failed submission shouldn't cancel all following.

For an example see amdgpu_job_prepare_job().

2. In it's submission IOCTL to reject new submissions and inform 
userspace that it needs to kick of some error handling.

Cheers,
Christian.

>
>>> If we really want to stuff this into per-job fences then I think we should
>>> at least try to document this mess in the sync_file uapi, for a bit of
>>> consistency.
>> Good point. Going to add some documentation.
> Sounds good.
>
> Cheers, Sima
>
>> Regards,
>> Christian.
>>
>>> But yeah without the full picture no idea really what we want here.
>>> -Sima
>>>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c | 2 +-
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c          | 4 ++--
>>>>    drivers/gpu/drm/etnaviv/etnaviv_sched.c             | 4 ++--
>>>>    drivers/gpu/drm/imagination/pvr_queue.c             | 4 ++--
>>>>    drivers/gpu/drm/lima/lima_sched.c                   | 2 +-
>>>>    drivers/gpu/drm/nouveau/nouveau_sched.c             | 2 +-
>>>>    drivers/gpu/drm/panfrost/panfrost_job.c             | 2 +-
>>>>    drivers/gpu/drm/panthor/panthor_mmu.c               | 2 +-
>>>>    drivers/gpu/drm/scheduler/sched_main.c              | 7 ++++---
>>>>    drivers/gpu/drm/v3d/v3d_sched.c                     | 2 +-
>>>>    include/drm/gpu_scheduler.h                         | 2 +-
>>>>    11 files changed, 17 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
>>>> index 2320df51c914..18135d8235f9 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
>>>> @@ -300,7 +300,7 @@ static int suspend_resume_compute_scheduler(struct amdgpu_device *adev, bool sus
>>>>    			if (r)
>>>>    				goto out;
>>>>    		} else {
>>>> -			drm_sched_start(&ring->sched);
>>>> +			drm_sched_start(&ring->sched, 0);
>>>>    		}
>>>>    	}
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index c186fdb198ad..861827deb03f 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -5862,7 +5862,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>    			if (!amdgpu_ring_sched_ready(ring))
>>>>    				continue;
>>>> -			drm_sched_start(&ring->sched);
>>>> +			drm_sched_start(&ring->sched, 0);
>>>>    		}
>>>>    		if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
>>>> @@ -6360,7 +6360,7 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>>>>    		if (!amdgpu_ring_sched_ready(ring))
>>>>    			continue;
>>>> -		drm_sched_start(&ring->sched);
>>>> +		drm_sched_start(&ring->sched, 0);
>>>>    	}
>>>>    	amdgpu_device_unset_mp1_state(adev);
>>>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>>>> index c53641aa146f..2c8666f8ec4a 100644
>>>> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>>>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>>>> @@ -72,12 +72,12 @@ static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job
>>>>    	drm_sched_resubmit_jobs(&gpu->sched);
>>>> -	drm_sched_start(&gpu->sched);
>>>> +	drm_sched_start(&gpu->sched, 0);
>>>>    	return DRM_GPU_SCHED_STAT_NOMINAL;
>>>>    out_no_timeout:
>>>>    	/* restart scheduler after GPU is usable again */
>>>> -	drm_sched_start(&gpu->sched);
>>>> +	drm_sched_start(&gpu->sched, 0);
>>>>    	return DRM_GPU_SCHED_STAT_NOMINAL;
>>>>    }
>>>> diff --git a/drivers/gpu/drm/imagination/pvr_queue.c b/drivers/gpu/drm/imagination/pvr_queue.c
>>>> index 20cb46012082..c4f08432882b 100644
>>>> --- a/drivers/gpu/drm/imagination/pvr_queue.c
>>>> +++ b/drivers/gpu/drm/imagination/pvr_queue.c
>>>> @@ -782,7 +782,7 @@ static void pvr_queue_start(struct pvr_queue *queue)
>>>>    		}
>>>>    	}
>>>> -	drm_sched_start(&queue->scheduler);
>>>> +	drm_sched_start(&queue->scheduler, 0);
>>>>    }
>>>>    /**
>>>> @@ -842,7 +842,7 @@ pvr_queue_timedout_job(struct drm_sched_job *s_job)
>>>>    	}
>>>>    	mutex_unlock(&pvr_dev->queues.lock);
>>>> -	drm_sched_start(sched);
>>>> +	drm_sched_start(sched, 0);
>>>>    	return DRM_GPU_SCHED_STAT_NOMINAL;
>>>>    }
>>>> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
>>>> index 1a944edb6ddc..b40c90e97d7e 100644
>>>> --- a/drivers/gpu/drm/lima/lima_sched.c
>>>> +++ b/drivers/gpu/drm/lima/lima_sched.c
>>>> @@ -463,7 +463,7 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
>>>>    	lima_pm_idle(ldev);
>>>>    	drm_sched_resubmit_jobs(&pipe->base);
>>>> -	drm_sched_start(&pipe->base);
>>>> +	drm_sched_start(&pipe->base, 0);
>>>>    	return DRM_GPU_SCHED_STAT_NOMINAL;
>>>>    }
>>>> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
>>>> index eb6c3f9a01f5..4412f2711fb5 100644
>>>> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
>>>> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
>>>> @@ -379,7 +379,7 @@ nouveau_sched_timedout_job(struct drm_sched_job *sched_job)
>>>>    	else
>>>>    		NV_PRINTK(warn, job->cli, "Generic job timeout.\n");
>>>> -	drm_sched_start(sched);
>>>> +	drm_sched_start(sched, 0);
>>>>    	return stat;
>>>>    }
>>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
>>>> index df49d37d0e7e..d140800606bf 100644
>>>> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
>>>> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
>>>> @@ -727,7 +727,7 @@ panfrost_reset(struct panfrost_device *pfdev,
>>>>    	/* Restart the schedulers */
>>>>    	for (i = 0; i < NUM_JOB_SLOTS; i++)
>>>> -		drm_sched_start(&pfdev->js->queue[i].sched);
>>>> +		drm_sched_start(&pfdev->js->queue[i].sched, 0);
>>>>    	/* Re-enable job interrupts now that everything has been restarted. */
>>>>    	job_write(pfdev, JOB_INT_MASK,
>>>> diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
>>>> index d47972806d50..e630cdf47f99 100644
>>>> --- a/drivers/gpu/drm/panthor/panthor_mmu.c
>>>> +++ b/drivers/gpu/drm/panthor/panthor_mmu.c
>>>> @@ -827,7 +827,7 @@ static void panthor_vm_stop(struct panthor_vm *vm)
>>>>    static void panthor_vm_start(struct panthor_vm *vm)
>>>>    {
>>>> -	drm_sched_start(&vm->sched);
>>>> +	drm_sched_start(&vm->sched, 0);
>>>>    }
>>>>    /**
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index ab53ab486fe6..f093616fe53c 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -674,9 +674,10 @@ EXPORT_SYMBOL(drm_sched_stop);
>>>>     * drm_sched_start - recover jobs after a reset
>>>>     *
>>>>     * @sched: scheduler instance
>>>> + * @errno: error to set on the pending fences
>>>>     *
>>>>     */
>>>> -void drm_sched_start(struct drm_gpu_scheduler *sched)
>>>> +void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
>>>>    {
>>>>    	struct drm_sched_job *s_job, *tmp;
>>>> @@ -691,13 +692,13 @@ void drm_sched_start(struct drm_gpu_scheduler *sched)
>>>>    		atomic_add(s_job->credits, &sched->credit_count);
>>>>    		if (!fence) {
>>>> -			drm_sched_job_done(s_job, -ECANCELED);
>>>> +			drm_sched_job_done(s_job, errno ?: -ECANCELED);
>>>>    			continue;
>>>>    		}
>>>>    		if (dma_fence_add_callback(fence, &s_job->cb,
>>>>    					   drm_sched_job_done_cb))
>>>> -			drm_sched_job_done(s_job, fence->error);
>>>> +			drm_sched_job_done(s_job, fence->error ?: errno);
>>>>    	}
>>>>    	drm_sched_start_timeout_unlocked(sched);
>>>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
>>>> index 42d4f4a2dba2..cac02284cd19 100644
>>>> --- a/drivers/gpu/drm/v3d/v3d_sched.c
>>>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>>>> @@ -653,7 +653,7 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job)
>>>>    	/* Unblock schedulers and restart their jobs. */
>>>>    	for (q = 0; q < V3D_MAX_QUEUES; q++) {
>>>> -		drm_sched_start(&v3d->queue[q].sched);
>>>> +		drm_sched_start(&v3d->queue[q].sched, 0);
>>>>    	}
>>>>    	mutex_unlock(&v3d->reset_lock);
>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>>> index fe8edb917360..a8d19b10f9b8 100644
>>>> --- a/include/drm/gpu_scheduler.h
>>>> +++ b/include/drm/gpu_scheduler.h
>>>> @@ -579,7 +579,7 @@ bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched);
>>>>    void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
>>>>    void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
>>>>    void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
>>>> -void drm_sched_start(struct drm_gpu_scheduler *sched);
>>>> +void drm_sched_start(struct drm_gpu_scheduler *sched, int errno);
>>>>    void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
>>>>    void drm_sched_increase_karma(struct drm_sched_job *bad);
>>>>    void drm_sched_reset_karma(struct drm_sched_job *bad);
>>>> -- 
>>>> 2.34.1
>>>>
Daniel Vetter July 31, 2024, 8:43 p.m. UTC | #7
On Tue, Jul 30, 2024 at 02:06:08PM +0200, Christian König wrote:
> Am 30.07.24 um 10:36 schrieb Daniel Vetter:
> > > In the end you have a really nice circle dependency.
> > Maybe a follow up, so for arb robustness or vk context where we want the
> > context to die and refuse to accept any more jobs: We can get at that
> > error somehow? I think that's really the only worry I have with a job
> > error approach for all this ...
> 
> See drm_sched_entity_error(). The idea is that the driver uses this function
> in two ways:

Ah that's the other piece I missed ...

> 1. In it's prepare callback so that when one submission fails all following
> from the same ctx are marked with an error number as well.
> 
> This is intentionally done in a driver callback so that driver decides if
> they want subsequent submissions to fail or not. That can be helpful for
> example for in kernel paging queues where submissions don't depend on each
> other and a failed submission shouldn't cancel all following.
> 
> For an example see amdgpu_job_prepare_job().
> 
> 2. In it's submission IOCTL to reject new submissions and inform userspace
> that it needs to kick of some error handling.

Would be good to add that to the docs, I think just one sentence in the
drm_sched_start should fish out the errno with drm_sched_entity_error()
would have avoided my confusion.

Plus I think your above text would make a good addition to the kerneldoc
for drm_sched_entity_error() itself.

Cheers, Sima

> 
> Cheers,
> Christian.
> 
> > 
> > > > If we really want to stuff this into per-job fences then I think we should
> > > > at least try to document this mess in the sync_file uapi, for a bit of
> > > > consistency.
> > > Good point. Going to add some documentation.
> > Sounds good.
> > 
> > Cheers, Sima
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > But yeah without the full picture no idea really what we want here.
> > > > -Sima
> > > > 
> > > > > ---
> > > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c | 2 +-
> > > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c          | 4 ++--
> > > > >    drivers/gpu/drm/etnaviv/etnaviv_sched.c             | 4 ++--
> > > > >    drivers/gpu/drm/imagination/pvr_queue.c             | 4 ++--
> > > > >    drivers/gpu/drm/lima/lima_sched.c                   | 2 +-
> > > > >    drivers/gpu/drm/nouveau/nouveau_sched.c             | 2 +-
> > > > >    drivers/gpu/drm/panfrost/panfrost_job.c             | 2 +-
> > > > >    drivers/gpu/drm/panthor/panthor_mmu.c               | 2 +-
> > > > >    drivers/gpu/drm/scheduler/sched_main.c              | 7 ++++---
> > > > >    drivers/gpu/drm/v3d/v3d_sched.c                     | 2 +-
> > > > >    include/drm/gpu_scheduler.h                         | 2 +-
> > > > >    11 files changed, 17 insertions(+), 16 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> > > > > index 2320df51c914..18135d8235f9 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> > > > > @@ -300,7 +300,7 @@ static int suspend_resume_compute_scheduler(struct amdgpu_device *adev, bool sus
> > > > >    			if (r)
> > > > >    				goto out;
> > > > >    		} else {
> > > > > -			drm_sched_start(&ring->sched);
> > > > > +			drm_sched_start(&ring->sched, 0);
> > > > >    		}
> > > > >    	}
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > index c186fdb198ad..861827deb03f 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > @@ -5862,7 +5862,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> > > > >    			if (!amdgpu_ring_sched_ready(ring))
> > > > >    				continue;
> > > > > -			drm_sched_start(&ring->sched);
> > > > > +			drm_sched_start(&ring->sched, 0);
> > > > >    		}
> > > > >    		if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
> > > > > @@ -6360,7 +6360,7 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
> > > > >    		if (!amdgpu_ring_sched_ready(ring))
> > > > >    			continue;
> > > > > -		drm_sched_start(&ring->sched);
> > > > > +		drm_sched_start(&ring->sched, 0);
> > > > >    	}
> > > > >    	amdgpu_device_unset_mp1_state(adev);
> > > > > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > > > index c53641aa146f..2c8666f8ec4a 100644
> > > > > --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > > > +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > > > @@ -72,12 +72,12 @@ static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job
> > > > >    	drm_sched_resubmit_jobs(&gpu->sched);
> > > > > -	drm_sched_start(&gpu->sched);
> > > > > +	drm_sched_start(&gpu->sched, 0);
> > > > >    	return DRM_GPU_SCHED_STAT_NOMINAL;
> > > > >    out_no_timeout:
> > > > >    	/* restart scheduler after GPU is usable again */
> > > > > -	drm_sched_start(&gpu->sched);
> > > > > +	drm_sched_start(&gpu->sched, 0);
> > > > >    	return DRM_GPU_SCHED_STAT_NOMINAL;
> > > > >    }
> > > > > diff --git a/drivers/gpu/drm/imagination/pvr_queue.c b/drivers/gpu/drm/imagination/pvr_queue.c
> > > > > index 20cb46012082..c4f08432882b 100644
> > > > > --- a/drivers/gpu/drm/imagination/pvr_queue.c
> > > > > +++ b/drivers/gpu/drm/imagination/pvr_queue.c
> > > > > @@ -782,7 +782,7 @@ static void pvr_queue_start(struct pvr_queue *queue)
> > > > >    		}
> > > > >    	}
> > > > > -	drm_sched_start(&queue->scheduler);
> > > > > +	drm_sched_start(&queue->scheduler, 0);
> > > > >    }
> > > > >    /**
> > > > > @@ -842,7 +842,7 @@ pvr_queue_timedout_job(struct drm_sched_job *s_job)
> > > > >    	}
> > > > >    	mutex_unlock(&pvr_dev->queues.lock);
> > > > > -	drm_sched_start(sched);
> > > > > +	drm_sched_start(sched, 0);
> > > > >    	return DRM_GPU_SCHED_STAT_NOMINAL;
> > > > >    }
> > > > > diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> > > > > index 1a944edb6ddc..b40c90e97d7e 100644
> > > > > --- a/drivers/gpu/drm/lima/lima_sched.c
> > > > > +++ b/drivers/gpu/drm/lima/lima_sched.c
> > > > > @@ -463,7 +463,7 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
> > > > >    	lima_pm_idle(ldev);
> > > > >    	drm_sched_resubmit_jobs(&pipe->base);
> > > > > -	drm_sched_start(&pipe->base);
> > > > > +	drm_sched_start(&pipe->base, 0);
> > > > >    	return DRM_GPU_SCHED_STAT_NOMINAL;
> > > > >    }
> > > > > diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
> > > > > index eb6c3f9a01f5..4412f2711fb5 100644
> > > > > --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
> > > > > +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
> > > > > @@ -379,7 +379,7 @@ nouveau_sched_timedout_job(struct drm_sched_job *sched_job)
> > > > >    	else
> > > > >    		NV_PRINTK(warn, job->cli, "Generic job timeout.\n");
> > > > > -	drm_sched_start(sched);
> > > > > +	drm_sched_start(sched, 0);
> > > > >    	return stat;
> > > > >    }
> > > > > diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> > > > > index df49d37d0e7e..d140800606bf 100644
> > > > > --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> > > > > +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> > > > > @@ -727,7 +727,7 @@ panfrost_reset(struct panfrost_device *pfdev,
> > > > >    	/* Restart the schedulers */
> > > > >    	for (i = 0; i < NUM_JOB_SLOTS; i++)
> > > > > -		drm_sched_start(&pfdev->js->queue[i].sched);
> > > > > +		drm_sched_start(&pfdev->js->queue[i].sched, 0);
> > > > >    	/* Re-enable job interrupts now that everything has been restarted. */
> > > > >    	job_write(pfdev, JOB_INT_MASK,
> > > > > diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
> > > > > index d47972806d50..e630cdf47f99 100644
> > > > > --- a/drivers/gpu/drm/panthor/panthor_mmu.c
> > > > > +++ b/drivers/gpu/drm/panthor/panthor_mmu.c
> > > > > @@ -827,7 +827,7 @@ static void panthor_vm_stop(struct panthor_vm *vm)
> > > > >    static void panthor_vm_start(struct panthor_vm *vm)
> > > > >    {
> > > > > -	drm_sched_start(&vm->sched);
> > > > > +	drm_sched_start(&vm->sched, 0);
> > > > >    }
> > > > >    /**
> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > index ab53ab486fe6..f093616fe53c 100644
> > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > @@ -674,9 +674,10 @@ EXPORT_SYMBOL(drm_sched_stop);
> > > > >     * drm_sched_start - recover jobs after a reset
> > > > >     *
> > > > >     * @sched: scheduler instance
> > > > > + * @errno: error to set on the pending fences
> > > > >     *
> > > > >     */
> > > > > -void drm_sched_start(struct drm_gpu_scheduler *sched)
> > > > > +void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
> > > > >    {
> > > > >    	struct drm_sched_job *s_job, *tmp;
> > > > > @@ -691,13 +692,13 @@ void drm_sched_start(struct drm_gpu_scheduler *sched)
> > > > >    		atomic_add(s_job->credits, &sched->credit_count);
> > > > >    		if (!fence) {
> > > > > -			drm_sched_job_done(s_job, -ECANCELED);
> > > > > +			drm_sched_job_done(s_job, errno ?: -ECANCELED);
> > > > >    			continue;
> > > > >    		}
> > > > >    		if (dma_fence_add_callback(fence, &s_job->cb,
> > > > >    					   drm_sched_job_done_cb))
> > > > > -			drm_sched_job_done(s_job, fence->error);
> > > > > +			drm_sched_job_done(s_job, fence->error ?: errno);
> > > > >    	}
> > > > >    	drm_sched_start_timeout_unlocked(sched);
> > > > > diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
> > > > > index 42d4f4a2dba2..cac02284cd19 100644
> > > > > --- a/drivers/gpu/drm/v3d/v3d_sched.c
> > > > > +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> > > > > @@ -653,7 +653,7 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job)
> > > > >    	/* Unblock schedulers and restart their jobs. */
> > > > >    	for (q = 0; q < V3D_MAX_QUEUES; q++) {
> > > > > -		drm_sched_start(&v3d->queue[q].sched);
> > > > > +		drm_sched_start(&v3d->queue[q].sched, 0);
> > > > >    	}
> > > > >    	mutex_unlock(&v3d->reset_lock);
> > > > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > > > index fe8edb917360..a8d19b10f9b8 100644
> > > > > --- a/include/drm/gpu_scheduler.h
> > > > > +++ b/include/drm/gpu_scheduler.h
> > > > > @@ -579,7 +579,7 @@ bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched);
> > > > >    void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
> > > > >    void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
> > > > >    void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
> > > > > -void drm_sched_start(struct drm_gpu_scheduler *sched);
> > > > > +void drm_sched_start(struct drm_gpu_scheduler *sched, int errno);
> > > > >    void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
> > > > >    void drm_sched_increase_karma(struct drm_sched_job *bad);
> > > > >    void drm_sched_reset_karma(struct drm_sched_job *bad);
> > > > > -- 
> > > > > 2.34.1
> > > > >
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
index 2320df51c914..18135d8235f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
@@ -300,7 +300,7 @@  static int suspend_resume_compute_scheduler(struct amdgpu_device *adev, bool sus
 			if (r)
 				goto out;
 		} else {
-			drm_sched_start(&ring->sched);
+			drm_sched_start(&ring->sched, 0);
 		}
 	}
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c186fdb198ad..861827deb03f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5862,7 +5862,7 @@  int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			if (!amdgpu_ring_sched_ready(ring))
 				continue;
 
-			drm_sched_start(&ring->sched);
+			drm_sched_start(&ring->sched, 0);
 		}
 
 		if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
@@ -6360,7 +6360,7 @@  void amdgpu_pci_resume(struct pci_dev *pdev)
 		if (!amdgpu_ring_sched_ready(ring))
 			continue;
 
-		drm_sched_start(&ring->sched);
+		drm_sched_start(&ring->sched, 0);
 	}
 
 	amdgpu_device_unset_mp1_state(adev);
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index c53641aa146f..2c8666f8ec4a 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -72,12 +72,12 @@  static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job
 
 	drm_sched_resubmit_jobs(&gpu->sched);
 
-	drm_sched_start(&gpu->sched);
+	drm_sched_start(&gpu->sched, 0);
 	return DRM_GPU_SCHED_STAT_NOMINAL;
 
 out_no_timeout:
 	/* restart scheduler after GPU is usable again */
-	drm_sched_start(&gpu->sched);
+	drm_sched_start(&gpu->sched, 0);
 	return DRM_GPU_SCHED_STAT_NOMINAL;
 }
 
diff --git a/drivers/gpu/drm/imagination/pvr_queue.c b/drivers/gpu/drm/imagination/pvr_queue.c
index 20cb46012082..c4f08432882b 100644
--- a/drivers/gpu/drm/imagination/pvr_queue.c
+++ b/drivers/gpu/drm/imagination/pvr_queue.c
@@ -782,7 +782,7 @@  static void pvr_queue_start(struct pvr_queue *queue)
 		}
 	}
 
-	drm_sched_start(&queue->scheduler);
+	drm_sched_start(&queue->scheduler, 0);
 }
 
 /**
@@ -842,7 +842,7 @@  pvr_queue_timedout_job(struct drm_sched_job *s_job)
 	}
 	mutex_unlock(&pvr_dev->queues.lock);
 
-	drm_sched_start(sched);
+	drm_sched_start(sched, 0);
 
 	return DRM_GPU_SCHED_STAT_NOMINAL;
 }
diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
index 1a944edb6ddc..b40c90e97d7e 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -463,7 +463,7 @@  static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
 	lima_pm_idle(ldev);
 
 	drm_sched_resubmit_jobs(&pipe->base);
-	drm_sched_start(&pipe->base);
+	drm_sched_start(&pipe->base, 0);
 
 	return DRM_GPU_SCHED_STAT_NOMINAL;
 }
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
index eb6c3f9a01f5..4412f2711fb5 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
@@ -379,7 +379,7 @@  nouveau_sched_timedout_job(struct drm_sched_job *sched_job)
 	else
 		NV_PRINTK(warn, job->cli, "Generic job timeout.\n");
 
-	drm_sched_start(sched);
+	drm_sched_start(sched, 0);
 
 	return stat;
 }
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index df49d37d0e7e..d140800606bf 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -727,7 +727,7 @@  panfrost_reset(struct panfrost_device *pfdev,
 
 	/* Restart the schedulers */
 	for (i = 0; i < NUM_JOB_SLOTS; i++)
-		drm_sched_start(&pfdev->js->queue[i].sched);
+		drm_sched_start(&pfdev->js->queue[i].sched, 0);
 
 	/* Re-enable job interrupts now that everything has been restarted. */
 	job_write(pfdev, JOB_INT_MASK,
diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
index d47972806d50..e630cdf47f99 100644
--- a/drivers/gpu/drm/panthor/panthor_mmu.c
+++ b/drivers/gpu/drm/panthor/panthor_mmu.c
@@ -827,7 +827,7 @@  static void panthor_vm_stop(struct panthor_vm *vm)
 
 static void panthor_vm_start(struct panthor_vm *vm)
 {
-	drm_sched_start(&vm->sched);
+	drm_sched_start(&vm->sched, 0);
 }
 
 /**
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index ab53ab486fe6..f093616fe53c 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -674,9 +674,10 @@  EXPORT_SYMBOL(drm_sched_stop);
  * drm_sched_start - recover jobs after a reset
  *
  * @sched: scheduler instance
+ * @errno: error to set on the pending fences
  *
  */
-void drm_sched_start(struct drm_gpu_scheduler *sched)
+void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
 {
 	struct drm_sched_job *s_job, *tmp;
 
@@ -691,13 +692,13 @@  void drm_sched_start(struct drm_gpu_scheduler *sched)
 		atomic_add(s_job->credits, &sched->credit_count);
 
 		if (!fence) {
-			drm_sched_job_done(s_job, -ECANCELED);
+			drm_sched_job_done(s_job, errno ?: -ECANCELED);
 			continue;
 		}
 
 		if (dma_fence_add_callback(fence, &s_job->cb,
 					   drm_sched_job_done_cb))
-			drm_sched_job_done(s_job, fence->error);
+			drm_sched_job_done(s_job, fence->error ?: errno);
 	}
 
 	drm_sched_start_timeout_unlocked(sched);
diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
index 42d4f4a2dba2..cac02284cd19 100644
--- a/drivers/gpu/drm/v3d/v3d_sched.c
+++ b/drivers/gpu/drm/v3d/v3d_sched.c
@@ -653,7 +653,7 @@  v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job)
 
 	/* Unblock schedulers and restart their jobs. */
 	for (q = 0; q < V3D_MAX_QUEUES; q++) {
-		drm_sched_start(&v3d->queue[q].sched);
+		drm_sched_start(&v3d->queue[q].sched, 0);
 	}
 
 	mutex_unlock(&v3d->reset_lock);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index fe8edb917360..a8d19b10f9b8 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -579,7 +579,7 @@  bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched);
 void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
 void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
 void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
-void drm_sched_start(struct drm_gpu_scheduler *sched);
+void drm_sched_start(struct drm_gpu_scheduler *sched, int errno);
 void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
 void drm_sched_increase_karma(struct drm_sched_job *bad);
 void drm_sched_reset_karma(struct drm_sched_job *bad);