diff mbox series

[drm-misc-next,v4] drm/sched: implement dynamic job-flow control

Message ID 20231031002655.38707-1-dakr@redhat.com (mailing list archive)
State New, archived
Headers show
Series [drm-misc-next,v4] drm/sched: implement dynamic job-flow control | expand

Commit Message

Danilo Krummrich Oct. 31, 2023, 12:26 a.m. UTC
Currently, job flow control is implemented simply by limiting the number
of jobs in flight. Therefore, a scheduler is initialized with a credit
limit that corresponds to the number of jobs which can be sent to the
hardware.

This implies that for each job, drivers need to account for the maximum
job size possible in order to not overflow the ring buffer.

However, there are drivers, such as Nouveau, where the job size has a
rather large range. For such drivers it can easily happen that job
submissions not even filling the ring by 1% can block subsequent
submissions, which, in the worst case, can lead to the ring run dry.

In order to overcome this issue, allow for tracking the actual job size
instead of the number of jobs. Therefore, add a field to track a job's
credit count, which represents the number of credits a job contributes
to the scheduler's credit limit.

Signed-off-by: Danilo Krummrich <dakr@redhat.com>
---
Changes in V2:
==============
  - fixed up influence on scheduling fairness due to consideration of a job's
    size
    - If we reach a ready entity in drm_sched_select_entity() but can't actually
      queue a job from it due to size limitations, just give up and go to sleep
      until woken up due to a pending job finishing, rather than continue to try
      other entities.
  - added a callback to dynamically update a job's credits (Boris)
  - renamed 'units' to 'credits'
  - fixed commit message and comments as requested by Luben

Changes in V3:
==============
  - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
  - move up drm_sched_can_queue() instead of adding a forward declaration
    (Boris)
  - add a drm_sched_available_credits() helper (Boris)
  - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal
  - re-phrase a few comments and fix a typo (Luben)
  - change naming of all structures credit fields and function parameters to the
    following scheme
    - drm_sched_job::credits
    - drm_gpu_scheduler::credit_limit
    - drm_gpu_scheduler::credit_count
    - drm_sched_init(..., u32 credit_limit, ...)
    - drm_sched_job_init(..., u32 credits, ...)
  - add proper documentation for the scheduler's job-flow control mechanism

Changes in V4:
==============
  - address Lubens comments regarding documentation
  - switch to drm_WARN() variants
  - WARN_ON() drivers passing in zero job credits for both drm_sched_job_init()
    and the update_job_credits() callback
  - don't retry with another runq if job doesn't fit on the ring to prevent
    priority inversion
  - rebase onto drm-misc-next (will probably land before Matt's series)

Patch also available in [1].

[1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
---
 Documentation/gpu/drm-mm.rst                  |   6 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |   2 +-
 drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
 drivers/gpu/drm/lima/lima_sched.c             |   2 +-
 drivers/gpu/drm/msm/msm_gem_submit.c          |   2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c       |   2 +-
 drivers/gpu/drm/panfrost/panfrost_drv.c       |   2 +-
 .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
 drivers/gpu/drm/scheduler/sched_entity.c      |   4 +-
 drivers/gpu/drm/scheduler/sched_main.c        | 148 ++++++++++++++----
 drivers/gpu/drm/v3d/v3d_gem.c                 |   2 +-
 include/drm/gpu_scheduler.h                   |  31 +++-
 12 files changed, 156 insertions(+), 49 deletions(-)


base-commit: b2139fb5051554a7f297e4ad584ef1bc26c76d5d

Comments

kernel test robot Oct. 31, 2023, 7:14 a.m. UTC | #1
Hi Danilo,

kernel test robot noticed the following build errors:

[auto build test ERROR on b2139fb5051554a7f297e4ad584ef1bc26c76d5d]

url:    https://github.com/intel-lab-lkp/linux/commits/Danilo-Krummrich/drm-sched-implement-dynamic-job-flow-control/20231031-082915
base:   b2139fb5051554a7f297e4ad584ef1bc26c76d5d
patch link:    https://lore.kernel.org/r/20231031002655.38707-1-dakr%40redhat.com
patch subject: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control
config: arc-randconfig-002-20231031 (https://download.01.org/0day-ci/archive/20231031/202310311534.I8WQWOK0-lkp@intel.com/config)
compiler: arceb-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231031/202310311534.I8WQWOK0-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202310311534.I8WQWOK0-lkp@intel.com/

All errors (new ones prefixed by >>):

   drivers/gpu/drm/panfrost/panfrost_job.c: In function 'panfrost_job_is_idle':
>> drivers/gpu/drm/panfrost/panfrost_job.c:966:52: error: 'struct drm_gpu_scheduler' has no member named 'hw_rq_count'
     966 |                 if (atomic_read(&js->queue[i].sched.hw_rq_count))
         |                                                    ^


vim +966 drivers/gpu/drm/panfrost/panfrost_job.c

f3ba91228e8e91 Rob Herring 2018-09-10  958  
f3ba91228e8e91 Rob Herring 2018-09-10  959  int panfrost_job_is_idle(struct panfrost_device *pfdev)
f3ba91228e8e91 Rob Herring 2018-09-10  960  {
f3ba91228e8e91 Rob Herring 2018-09-10  961  	struct panfrost_job_slot *js = pfdev->js;
f3ba91228e8e91 Rob Herring 2018-09-10  962  	int i;
f3ba91228e8e91 Rob Herring 2018-09-10  963  
f3ba91228e8e91 Rob Herring 2018-09-10  964  	for (i = 0; i < NUM_JOB_SLOTS; i++) {
f3ba91228e8e91 Rob Herring 2018-09-10  965  		/* If there are any jobs in the HW queue, we're not idle */
f3ba91228e8e91 Rob Herring 2018-09-10 @966  		if (atomic_read(&js->queue[i].sched.hw_rq_count))
kernel test robot Oct. 31, 2023, 8:51 a.m. UTC | #2
Hi Danilo,

kernel test robot noticed the following build errors:

[auto build test ERROR on b2139fb5051554a7f297e4ad584ef1bc26c76d5d]

url:    https://github.com/intel-lab-lkp/linux/commits/Danilo-Krummrich/drm-sched-implement-dynamic-job-flow-control/20231031-082915
base:   b2139fb5051554a7f297e4ad584ef1bc26c76d5d
patch link:    https://lore.kernel.org/r/20231031002655.38707-1-dakr%40redhat.com
patch subject: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control
config: loongarch-randconfig-002-20231031 (https://download.01.org/0day-ci/archive/20231031/202310311632.RpEHvHMk-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231031/202310311632.RpEHvHMk-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202310311632.RpEHvHMk-lkp@intel.com/

All errors (new ones prefixed by >>):

   drivers/gpu/drm/etnaviv/etnaviv_gpu.c: In function 'etnaviv_gpu_rpm_suspend':
>> drivers/gpu/drm/etnaviv/etnaviv_gpu.c:1920:36: error: 'struct drm_gpu_scheduler' has no member named 'hw_rq_count'
    1920 |         if (atomic_read(&gpu->sched.hw_rq_count))
         |                                    ^


vim +1920 drivers/gpu/drm/etnaviv/etnaviv_gpu.c

a8c21a5451d831 The etnaviv authors 2015-12-03  1913  
a8c21a5451d831 The etnaviv authors 2015-12-03  1914  static int etnaviv_gpu_rpm_suspend(struct device *dev)
a8c21a5451d831 The etnaviv authors 2015-12-03  1915  {
a8c21a5451d831 The etnaviv authors 2015-12-03  1916  	struct etnaviv_gpu *gpu = dev_get_drvdata(dev);
a8c21a5451d831 The etnaviv authors 2015-12-03  1917  	u32 idle, mask;
a8c21a5451d831 The etnaviv authors 2015-12-03  1918  
f4163814813fb3 Lucas Stach         2018-11-05  1919  	/* If there are any jobs in the HW queue, we're not idle */
f4163814813fb3 Lucas Stach         2018-11-05 @1920  	if (atomic_read(&gpu->sched.hw_rq_count))
a8c21a5451d831 The etnaviv authors 2015-12-03  1921  		return -EBUSY;
a8c21a5451d831 The etnaviv authors 2015-12-03  1922  
1a910c11d35bfa Guido Günther       2020-03-02  1923  	/* Check whether the hardware (except FE and MC) is idle */
1a910c11d35bfa Guido Günther       2020-03-02  1924  	mask = gpu->idle_mask & ~(VIVS_HI_IDLE_STATE_FE |
1a910c11d35bfa Guido Günther       2020-03-02  1925  				  VIVS_HI_IDLE_STATE_MC);
a8c21a5451d831 The etnaviv authors 2015-12-03  1926  	idle = gpu_read(gpu, VIVS_HI_IDLE_STATE) & mask;
78f2bfa3181cd7 Guido Günther       2020-03-02  1927  	if (idle != mask) {
78f2bfa3181cd7 Guido Günther       2020-03-02  1928  		dev_warn_ratelimited(dev, "GPU not yet idle, mask: 0x%08x\n",
78f2bfa3181cd7 Guido Günther       2020-03-02  1929  				     idle);
a8c21a5451d831 The etnaviv authors 2015-12-03  1930  		return -EBUSY;
78f2bfa3181cd7 Guido Günther       2020-03-02  1931  	}
a8c21a5451d831 The etnaviv authors 2015-12-03  1932  
7cb544946a138d Lucas Stach         2023-06-07  1933  	etnaviv_gpu_hw_suspend(gpu);
7cb544946a138d Lucas Stach         2023-06-07  1934  
647d817d807127 Lucas Stach         2023-06-07  1935  	gpu->state = ETNA_GPU_STATE_IDENTIFIED;
647d817d807127 Lucas Stach         2023-06-07  1936  
7cb544946a138d Lucas Stach         2023-06-07  1937  	return etnaviv_gpu_clk_disable(gpu);
a8c21a5451d831 The etnaviv authors 2015-12-03  1938  }
a8c21a5451d831 The etnaviv authors 2015-12-03  1939
Christian König Oct. 31, 2023, 11:13 a.m. UTC | #3
Am 31.10.23 um 01:26 schrieb Danilo Krummrich:
> Currently, job flow control is implemented simply by limiting the number
> of jobs in flight. Therefore, a scheduler is initialized with a credit
> limit that corresponds to the number of jobs which can be sent to the
> hardware.
>
> This implies that for each job, drivers need to account for the maximum
> job size possible in order to not overflow the ring buffer.
>
> However, there are drivers, such as Nouveau, where the job size has a
> rather large range. For such drivers it can easily happen that job
> submissions not even filling the ring by 1% can block subsequent
> submissions, which, in the worst case, can lead to the ring run dry.
>
> In order to overcome this issue, allow for tracking the actual job size
> instead of the number of jobs. Therefore, add a field to track a job's
> credit count, which represents the number of credits a job contributes
> to the scheduler's credit limit.
>
> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
> ---
> Changes in V2:
> ==============
>    - fixed up influence on scheduling fairness due to consideration of a job's
>      size
>      - If we reach a ready entity in drm_sched_select_entity() but can't actually
>        queue a job from it due to size limitations, just give up and go to sleep
>        until woken up due to a pending job finishing, rather than continue to try
>        other entities.
>    - added a callback to dynamically update a job's credits (Boris)
>    - renamed 'units' to 'credits'
>    - fixed commit message and comments as requested by Luben
>
> Changes in V3:
> ==============
>    - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
>    - move up drm_sched_can_queue() instead of adding a forward declaration
>      (Boris)
>    - add a drm_sched_available_credits() helper (Boris)
>    - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal
>    - re-phrase a few comments and fix a typo (Luben)
>    - change naming of all structures credit fields and function parameters to the
>      following scheme
>      - drm_sched_job::credits
>      - drm_gpu_scheduler::credit_limit
>      - drm_gpu_scheduler::credit_count
>      - drm_sched_init(..., u32 credit_limit, ...)
>      - drm_sched_job_init(..., u32 credits, ...)
>    - add proper documentation for the scheduler's job-flow control mechanism
>
> Changes in V4:
> ==============
>    - address Lubens comments regarding documentation
>    - switch to drm_WARN() variants
>    - WARN_ON() drivers passing in zero job credits for both drm_sched_job_init()
>      and the update_job_credits() callback
>    - don't retry with another runq if job doesn't fit on the ring to prevent
>      priority inversion
>    - rebase onto drm-misc-next (will probably land before Matt's series)
>
> Patch also available in [1].
>
> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
> ---
>   Documentation/gpu/drm-mm.rst                  |   6 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |   2 +-
>   drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
>   drivers/gpu/drm/lima/lima_sched.c             |   2 +-
>   drivers/gpu/drm/msm/msm_gem_submit.c          |   2 +-
>   drivers/gpu/drm/nouveau/nouveau_sched.c       |   2 +-
>   drivers/gpu/drm/panfrost/panfrost_drv.c       |   2 +-
>   .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
>   drivers/gpu/drm/scheduler/sched_entity.c      |   4 +-
>   drivers/gpu/drm/scheduler/sched_main.c        | 148 ++++++++++++++----
>   drivers/gpu/drm/v3d/v3d_gem.c                 |   2 +-
>   include/drm/gpu_scheduler.h                   |  31 +++-
>   12 files changed, 156 insertions(+), 49 deletions(-)
>
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index 602010cb6894..acc5901ac840 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -552,6 +552,12 @@ Overview
>   .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>      :doc: Overview
>   
> +Flow Control
> +------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Flow Control
> +
>   Scheduler Function References
>   -----------------------------
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 1f357198533f..62bb7fc7448a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>   	if (!entity)
>   		return 0;
>   
> -	return drm_sched_job_init(&(*job)->base, entity, owner);
> +	return drm_sched_job_init(&(*job)->base, entity, 1, owner);
>   }
>   
>   int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
> index 2416c526f9b0..3d0f8d182506 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
> @@ -535,7 +535,7 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, void *data,
>   
>   	ret = drm_sched_job_init(&submit->sched_job,
>   				 &ctx->sched_entity[args->pipe],
> -				 submit->ctx);
> +				 1, submit->ctx);
>   	if (ret)
>   		goto err_submit_put;
>   
> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> index 295f0353a02e..0febe8ceb43b 100644
> --- a/drivers/gpu/drm/lima/lima_sched.c
> +++ b/drivers/gpu/drm/lima/lima_sched.c
> @@ -123,7 +123,7 @@ int lima_sched_task_init(struct lima_sched_task *task,
>   	for (i = 0; i < num_bos; i++)
>   		drm_gem_object_get(&bos[i]->base.base);
>   
> -	err = drm_sched_job_init(&task->base, &context->base, vm);
> +	err = drm_sched_job_init(&task->base, &context->base, 1, vm);
>   	if (err) {
>   		kfree(task->bos);
>   		return err;
> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> index 99744de6c05a..c002cabe7b9c 100644
> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> @@ -48,7 +48,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
>   		return ERR_PTR(ret);
>   	}
>   
> -	ret = drm_sched_job_init(&submit->base, queue->entity, queue);
> +	ret = drm_sched_job_init(&submit->base, queue->entity, 1, queue);
>   	if (ret) {
>   		kfree(submit->hw_fence);
>   		kfree(submit);
> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
> index e62b04109b2f..2a845f9c34ec 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
> @@ -89,7 +89,7 @@ nouveau_job_init(struct nouveau_job *job,
>   
>   	}
>   
> -	ret = drm_sched_job_init(&job->base, &entity->base, NULL);
> +	ret = drm_sched_job_init(&job->base, &entity->base, 1, NULL);
>   	if (ret)
>   		goto err_free_chains;
>   
> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
> index b834777b409b..54d1c19bea84 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
> @@ -274,7 +274,7 @@ static int panfrost_ioctl_submit(struct drm_device *dev, void *data,
>   
>   	ret = drm_sched_job_init(&job->base,
>   				 &file_priv->sched_entity[slot],
> -				 NULL);
> +				 1, NULL);
>   	if (ret)
>   		goto out_put_job;
>   
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
> index 3143ecaaff86..f8ed093b7356 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
> @@ -51,7 +51,7 @@ DECLARE_EVENT_CLASS(drm_sched_job,
>   			   __assign_str(name, sched_job->sched->name);
>   			   __entry->job_count = spsc_queue_count(&entity->job_queue);
>   			   __entry->hw_job_count = atomic_read(
> -				   &sched_job->sched->hw_rq_count);
> +				   &sched_job->sched->credit_count);
>   			   ),
>   	    TP_printk("entity=%p, id=%llu, fence=%p, ring=%s, job count:%u, hw job count:%d",
>   		      __entry->entity, __entry->id,
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 409e4256f6e7..b79e0672b94f 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -370,7 +370,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>   		container_of(cb, struct drm_sched_entity, cb);
>   
>   	drm_sched_entity_clear_dep(f, cb);
> -	drm_sched_wakeup_if_can_queue(entity->rq->sched);
> +	drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>   }
>   
>   /**
> @@ -602,7 +602,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>   		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>   			drm_sched_rq_update_fifo(entity, submit_ts);
>   
> -		drm_sched_wakeup_if_can_queue(entity->rq->sched);
> +		drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>   	}
>   }
>   EXPORT_SYMBOL(drm_sched_entity_push_job);
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index fd755e953487..5dfde149441a 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -48,6 +48,30 @@
>    * through the jobs entity pointer.
>    */
>   
> +/**
> + * DOC: Flow Control
> + *
> + * The DRM GPU scheduler provides a flow control mechanism to regulate the rate
> + * in which the jobs fetched from scheduler entities are executed.
> + *
> + * In this context the &drm_gpu_scheduler keeps track of a driver specified
> + * credit limit representing the capacity of this scheduler and a credit count;
> + * every &drm_sched_job carries a driver specified number of credits.
> + *
> + * Once a job is executed (but not yet finished), the job's credits contribute
> + * to the scheduler's credit count until the job is finished. If by executing
> + * one more job the scheduler's credit count would exceed the scheduler's
> + * credit limit, the job won't be executed. Instead, the scheduler will wait
> + * until the credit count has decreased enough to not overflow its credit limit.
> + * This implies waiting for previously executed jobs.
> + *
> + * Optionally, drivers may register a callback (update_job_credits) provided by
> + * struct drm_sched_backend_ops to update the job's credits dynamically. The
> + * scheduler executes this callback every time the scheduler considers a job for
> + * execution and subsequently checks whether the job fits the scheduler's credit
> + * limit.
> + */
> +
>   #include <linux/kthread.h>
>   #include <linux/wait.h>
>   #include <linux/sched.h>
> @@ -76,6 +100,46 @@ int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
>   MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
>   module_param_named(sched_policy, drm_sched_policy, int, 0444);
>   
> +static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
> +{
> +	u32 credits;
> +
> +	drm_WARN_ON(sched, check_sub_overflow(sched->credit_limit,
> +					      atomic_read(&sched->credit_count),
> +					      &credits));
> +
> +	return credits;
> +}
> +
> +/**
> + * drm_sched_can_queue -- Can we queue more to the hardware?
> + * @sched: scheduler instance
> + * @entity: the scheduler entity
> + *
> + * Return true if we can push at least one more job from @entity, false
> + * otherwise.
> + */
> +static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
> +				struct drm_sched_entity *entity)
> +{
> +	struct drm_sched_job *s_job;
> +
> +	s_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> +	if (!s_job)
> +		return false;
> +

> +	if (sched->ops->update_job_credits) {
> +		s_job->credits = sched->ops->update_job_credits(s_job);
> +
> +		drm_WARN(sched, !s_job->credits,
> +			 "Jobs with zero credits bypass job-flow control\n");
> +	}

I'm still absolutely not keen about that.

Adding another callback adds more complexity into scheduler driver 
interface, so at the end of the day we have more which could be 
implemented incorrectly.

So this needs to be justified somehow and to be honest I don't see the 
value in filling in the ring buffers more than necessary.

Regards,
Christian.


> +
> +	drm_WARN_ON(sched, s_job->credits > sched->credit_limit);
> +
> +	return drm_sched_available_credits(sched) >= s_job->credits;
> +}
> +
>   static __always_inline bool drm_sched_entity_compare_before(struct rb_node *a,
>   							    const struct rb_node *b)
>   {
> @@ -187,12 +251,14 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>   /**
>    * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>    *
> + * @sched: the gpu scheduler
>    * @rq: scheduler run queue to check.
>    *
>    * Try to find a ready entity, returns NULL if none found.
>    */
>   static struct drm_sched_entity *
> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> +drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched,
> +			      struct drm_sched_rq *rq)
>   {
>   	struct drm_sched_entity *entity;
>   
> @@ -202,6 +268,14 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>   	if (entity) {
>   		list_for_each_entry_continue(entity, &rq->entities, list) {
>   			if (drm_sched_entity_is_ready(entity)) {
> +				/* If we can't queue yet, preserve the current
> +				 * entity in terms of fairness.
> +				 */
> +				if (!drm_sched_can_queue(sched, entity)) {
> +					spin_unlock(&rq->lock);
> +					return ERR_PTR(-ENOSPC);
> +				}
> +
>   				rq->current_entity = entity;
>   				reinit_completion(&entity->entity_idle);
>   				spin_unlock(&rq->lock);
> @@ -211,8 +285,15 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>   	}
>   
>   	list_for_each_entry(entity, &rq->entities, list) {
> -
>   		if (drm_sched_entity_is_ready(entity)) {
> +			/* If we can't queue yet, preserve the current entity in
> +			 * terms of fairness.
> +			 */
> +			if (!drm_sched_can_queue(sched, entity)) {
> +				spin_unlock(&rq->lock);
> +				return ERR_PTR(-ENOSPC);
> +			}
> +
>   			rq->current_entity = entity;
>   			reinit_completion(&entity->entity_idle);
>   			spin_unlock(&rq->lock);
> @@ -231,12 +312,14 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>   /**
>    * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
>    *
> + * @sched: the gpu scheduler
>    * @rq: scheduler run queue to check.
>    *
>    * Find oldest waiting ready entity, returns NULL if none found.
>    */
>   static struct drm_sched_entity *
> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> +drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
> +				struct drm_sched_rq *rq)
>   {
>   	struct rb_node *rb;
>   
> @@ -246,6 +329,14 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>   
>   		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>   		if (drm_sched_entity_is_ready(entity)) {
> +			/* If we can't queue yet, preserve the current entity in
> +			 * terms of fairness.
> +			 */
> +			if (!drm_sched_can_queue(sched, entity)) {
> +				spin_unlock(&rq->lock);
> +				return ERR_PTR(-ENOSPC);
> +			}
> +
>   			rq->current_entity = entity;
>   			reinit_completion(&entity->entity_idle);
>   			break;
> @@ -267,7 +358,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>   	struct drm_sched_fence *s_fence = s_job->s_fence;
>   	struct drm_gpu_scheduler *sched = s_fence->sched;
>   
> -	atomic_dec(&sched->hw_rq_count);
> +	atomic_sub(s_job->credits, &sched->credit_count);
>   	atomic_dec(sched->score);
>   
>   	trace_drm_sched_process_job(s_fence);
> @@ -468,7 +559,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>   					      &s_job->cb)) {
>   			dma_fence_put(s_job->s_fence->parent);
>   			s_job->s_fence->parent = NULL;
> -			atomic_dec(&sched->hw_rq_count);
> +			atomic_sub(s_job->credits, &sched->credit_count);
>   		} else {
>   			/*
>   			 * remove job from pending_list.
> @@ -529,7 +620,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>   	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>   		struct dma_fence *fence = s_job->s_fence->parent;
>   
> -		atomic_inc(&sched->hw_rq_count);
> +		atomic_add(s_job->credits, &sched->credit_count);
>   
>   		if (!full_recovery)
>   			continue;
> @@ -613,6 +704,8 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>    * drm_sched_job_init - init a scheduler job
>    * @job: scheduler job to init
>    * @entity: scheduler entity to use
> + * @credits: the number of credits this job contributes to the schedulers
> + * credit limit
>    * @owner: job owner for debugging
>    *
>    * Refer to drm_sched_entity_push_job() documentation
> @@ -630,7 +723,7 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>    */
>   int drm_sched_job_init(struct drm_sched_job *job,
>   		       struct drm_sched_entity *entity,
> -		       void *owner)
> +		       u32 credits, void *owner)
>   {
>   	if (!entity->rq) {
>   		/* This will most likely be followed by missing frames
> @@ -647,6 +740,10 @@ int drm_sched_job_init(struct drm_sched_job *job,
>   		return -ENOMEM;
>   
>   	INIT_LIST_HEAD(&job->list);
> +	job->credits = credits;
> +
> +	drm_WARN(job->sched, !credits,
> +		 "Jobs with zero credits bypass job-flow control\n");
>   
>   	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
>   
> @@ -853,27 +950,17 @@ void drm_sched_job_cleanup(struct drm_sched_job *job)
>   }
>   EXPORT_SYMBOL(drm_sched_job_cleanup);
>   
> -/**
> - * drm_sched_can_queue -- Can we queue more to the hardware?
> - * @sched: scheduler instance
> - *
> - * Return true if we can push more jobs to the hw, otherwise false.
> - */
> -static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> -{
> -	return atomic_read(&sched->hw_rq_count) <
> -		sched->hw_submission_limit;
> -}
> -
>   /**
>    * drm_sched_wakeup_if_can_queue - Wake up the scheduler
>    * @sched: scheduler instance
> + * @entity: the scheduler entity
>    *
>    * Wake up the scheduler if we can queue jobs.
>    */
> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
> +				   struct drm_sched_entity *entity)
>   {
> -	if (drm_sched_can_queue(sched))
> +	if (drm_sched_can_queue(sched, entity))
>   		wake_up_interruptible(&sched->wake_up_worker);
>   }
>   
> @@ -890,19 +977,16 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>   	struct drm_sched_entity *entity;
>   	int i;
>   
> -	if (!drm_sched_can_queue(sched))
> -		return NULL;
> -
>   	/* Kernel run queue has higher priority than normal run queue*/
>   	for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>   		entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
> -			drm_sched_rq_select_entity_fifo(sched->sched_rq[i]) :
> -			drm_sched_rq_select_entity_rr(sched->sched_rq[i]);
> +			drm_sched_rq_select_entity_fifo(sched, sched->sched_rq[i]) :
> +			drm_sched_rq_select_entity_rr(sched, sched->sched_rq[i]);
>   		if (entity)
>   			break;
>   	}
>   
> -	return entity;
> +	return IS_ERR(entity) ? NULL : entity;
>   }
>   
>   /**
> @@ -1043,7 +1127,7 @@ static int drm_sched_main(void *param)
>   
>   		s_fence = sched_job->s_fence;
>   
> -		atomic_inc(&sched->hw_rq_count);
> +		atomic_add(sched_job->credits, &sched->credit_count);
>   		drm_sched_job_begin(sched_job);
>   
>   		trace_drm_run_job(sched_job, entity);
> @@ -1078,7 +1162,7 @@ static int drm_sched_main(void *param)
>    * @sched: scheduler instance
>    * @ops: backend operations for this scheduler
>    * @num_rqs: number of runqueues, one for each priority, up to DRM_SCHED_PRIORITY_COUNT
> - * @hw_submission: number of hw submissions that can be in flight
> + * @credit_limit: the number of credits this scheduler can hold from all jobs
>    * @hang_limit: number of times to allow a job to hang before dropping it
>    * @timeout: timeout value in jiffies for the scheduler
>    * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
> @@ -1091,14 +1175,14 @@ static int drm_sched_main(void *param)
>    */
>   int drm_sched_init(struct drm_gpu_scheduler *sched,
>   		   const struct drm_sched_backend_ops *ops,
> -		   u32 num_rqs, uint32_t hw_submission, unsigned int hang_limit,
> +		   u32 num_rqs, u32 credit_limit, unsigned int hang_limit,
>   		   long timeout, struct workqueue_struct *timeout_wq,
>   		   atomic_t *score, const char *name, struct device *dev)
>   {
>   	int i, ret;
>   
>   	sched->ops = ops;
> -	sched->hw_submission_limit = hw_submission;
> +	sched->credit_limit = credit_limit;
>   	sched->name = name;
>   	sched->timeout = timeout;
>   	sched->timeout_wq = timeout_wq ? : system_wq;
> @@ -1140,7 +1224,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>   	init_waitqueue_head(&sched->job_scheduled);
>   	INIT_LIST_HEAD(&sched->pending_list);
>   	spin_lock_init(&sched->job_list_lock);
> -	atomic_set(&sched->hw_rq_count, 0);
> +	atomic_set(&sched->credit_count, 0);
>   	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>   	atomic_set(&sched->_score, 0);
>   	atomic64_set(&sched->job_id_count, 0);
> diff --git a/drivers/gpu/drm/v3d/v3d_gem.c b/drivers/gpu/drm/v3d/v3d_gem.c
> index afa7d170d1ff..8dc5a7988488 100644
> --- a/drivers/gpu/drm/v3d/v3d_gem.c
> +++ b/drivers/gpu/drm/v3d/v3d_gem.c
> @@ -417,7 +417,7 @@ v3d_job_init(struct v3d_dev *v3d, struct drm_file *file_priv,
>   	job->free = free;
>   
>   	ret = drm_sched_job_init(&job->base, &v3d_priv->sched_entity[queue],
> -				 v3d_priv);
> +				 1, v3d_priv);
>   	if (ret)
>   		goto fail;
>   
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 653e0eec9743..8bafa81300b2 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -321,6 +321,7 @@ struct drm_sched_fence *to_drm_sched_fence(struct dma_fence *f);
>    * @sched: the scheduler instance on which this job is scheduled.
>    * @s_fence: contains the fences for the scheduling of job.
>    * @finish_cb: the callback for the finished fence.
> + * @credits: the number of credits this job contributes to the scheduler
>    * @work: Helper to reschdeule job kill to different context.
>    * @id: a unique id assigned to each job scheduled on the scheduler.
>    * @karma: increment on every hang caused by this job. If this exceeds the hang
> @@ -340,6 +341,8 @@ struct drm_sched_job {
>   	struct drm_gpu_scheduler	*sched;
>   	struct drm_sched_fence		*s_fence;
>   
> +	u32				credits;
> +
>   	/*
>   	 * work is used only after finish_cb has been used and will not be
>   	 * accessed anymore.
> @@ -463,13 +466,27 @@ struct drm_sched_backend_ops {
>            * and it's time to clean it up.
>   	 */
>   	void (*free_job)(struct drm_sched_job *sched_job);
> +
> +	/**
> +	 * @update_job_credits: Called once the scheduler is considering this
> +	 * job for execution.
> +	 *
> +	 * This callback returns the number of credits this job would take if
> +	 * pushed to the hardware. Drivers may use this to dynamically update
> +	 * the job's credit count. For instance, deduct the number of credits
> +	 * for already signalled native fences.
> +	 *
> +	 * This callback is optional.
> +	 */
> +	u32 (*update_job_credits)(struct drm_sched_job *sched_job);
>   };
>   
>   /**
>    * struct drm_gpu_scheduler - scheduler instance-specific data
>    *
>    * @ops: backend operations provided by the driver.
> - * @hw_submission_limit: the max size of the hardware queue.
> + * @credit_limit: the credit limit of this scheduler
> + * @credit_count: the current credit count of this scheduler
>    * @timeout: the time after which a job is removed from the scheduler.
>    * @name: name of the ring for which this scheduler is being used.
>    * @num_rqs: Number of run-queues. This is at most DRM_SCHED_PRIORITY_COUNT,
> @@ -480,7 +497,6 @@ struct drm_sched_backend_ops {
>    * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
>    *                 waits on this wait queue until all the scheduled jobs are
>    *                 finished.
> - * @hw_rq_count: the number of jobs currently in the hardware queue.
>    * @job_id_count: used to assign unique id to the each job.
>    * @timeout_wq: workqueue used to queue @work_tdr
>    * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> @@ -500,14 +516,14 @@ struct drm_sched_backend_ops {
>    */
>   struct drm_gpu_scheduler {
>   	const struct drm_sched_backend_ops	*ops;
> -	uint32_t			hw_submission_limit;
> +	u32				credit_limit;
> +	atomic_t			credit_count;
>   	long				timeout;
>   	const char			*name;
>   	u32                             num_rqs;
>   	struct drm_sched_rq             **sched_rq;
>   	wait_queue_head_t		wake_up_worker;
>   	wait_queue_head_t		job_scheduled;
> -	atomic_t			hw_rq_count;
>   	atomic64_t			job_id_count;
>   	struct workqueue_struct		*timeout_wq;
>   	struct delayed_work		work_tdr;
> @@ -524,14 +540,14 @@ struct drm_gpu_scheduler {
>   
>   int drm_sched_init(struct drm_gpu_scheduler *sched,
>   		   const struct drm_sched_backend_ops *ops,
> -		   u32 num_rqs, uint32_t hw_submission, unsigned int hang_limit,
> +		   u32 num_rqs, u32 credit_limit, unsigned int hang_limit,
>   		   long timeout, struct workqueue_struct *timeout_wq,
>   		   atomic_t *score, const char *name, struct device *dev);
>   
>   void drm_sched_fini(struct drm_gpu_scheduler *sched);
>   int drm_sched_job_init(struct drm_sched_job *job,
>   		       struct drm_sched_entity *entity,
> -		       void *owner);
> +		       u32 credits, void *owner);
>   void drm_sched_job_arm(struct drm_sched_job *job);
>   int drm_sched_job_add_dependency(struct drm_sched_job *job,
>   				 struct dma_fence *fence);
> @@ -552,7 +568,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>                                      unsigned int num_sched_list);
>   
>   void drm_sched_job_cleanup(struct drm_sched_job *job);
> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
> +				   struct drm_sched_entity *entity);
>   void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
>   void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
>   void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
>
> base-commit: b2139fb5051554a7f297e4ad584ef1bc26c76d5d
Danilo Krummrich Oct. 31, 2023, 12:19 p.m. UTC | #4
On 10/31/23 12:13, Christian König wrote:
> 
> 
> Am 31.10.23 um 01:26 schrieb Danilo Krummrich:
>> Currently, job flow control is implemented simply by limiting the number
>> of jobs in flight. Therefore, a scheduler is initialized with a credit
>> limit that corresponds to the number of jobs which can be sent to the
>> hardware.
>>
>> This implies that for each job, drivers need to account for the maximum
>> job size possible in order to not overflow the ring buffer.
>>
>> However, there are drivers, such as Nouveau, where the job size has a
>> rather large range. For such drivers it can easily happen that job
>> submissions not even filling the ring by 1% can block subsequent
>> submissions, which, in the worst case, can lead to the ring run dry.
>>
>> In order to overcome this issue, allow for tracking the actual job size
>> instead of the number of jobs. Therefore, add a field to track a job's
>> credit count, which represents the number of credits a job contributes
>> to the scheduler's credit limit.
>>
>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
>> ---
>> Changes in V2:
>> ==============
>>    - fixed up influence on scheduling fairness due to consideration of a job's
>>      size
>>      - If we reach a ready entity in drm_sched_select_entity() but can't actually
>>        queue a job from it due to size limitations, just give up and go to sleep
>>        until woken up due to a pending job finishing, rather than continue to try
>>        other entities.
>>    - added a callback to dynamically update a job's credits (Boris)
>>    - renamed 'units' to 'credits'
>>    - fixed commit message and comments as requested by Luben
>>
>> Changes in V3:
>> ==============
>>    - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
>>    - move up drm_sched_can_queue() instead of adding a forward declaration
>>      (Boris)
>>    - add a drm_sched_available_credits() helper (Boris)
>>    - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal
>>    - re-phrase a few comments and fix a typo (Luben)
>>    - change naming of all structures credit fields and function parameters to the
>>      following scheme
>>      - drm_sched_job::credits
>>      - drm_gpu_scheduler::credit_limit
>>      - drm_gpu_scheduler::credit_count
>>      - drm_sched_init(..., u32 credit_limit, ...)
>>      - drm_sched_job_init(..., u32 credits, ...)
>>    - add proper documentation for the scheduler's job-flow control mechanism
>>
>> Changes in V4:
>> ==============
>>    - address Lubens comments regarding documentation
>>    - switch to drm_WARN() variants
>>    - WARN_ON() drivers passing in zero job credits for both drm_sched_job_init()
>>      and the update_job_credits() callback
>>    - don't retry with another runq if job doesn't fit on the ring to prevent
>>      priority inversion
>>    - rebase onto drm-misc-next (will probably land before Matt's series)
>>
>> Patch also available in [1].
>>
>> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
>> ---
>>   Documentation/gpu/drm-mm.rst                  |   6 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |   2 +-
>>   drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
>>   drivers/gpu/drm/lima/lima_sched.c             |   2 +-
>>   drivers/gpu/drm/msm/msm_gem_submit.c          |   2 +-
>>   drivers/gpu/drm/nouveau/nouveau_sched.c       |   2 +-
>>   drivers/gpu/drm/panfrost/panfrost_drv.c       |   2 +-
>>   .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
>>   drivers/gpu/drm/scheduler/sched_entity.c      |   4 +-
>>   drivers/gpu/drm/scheduler/sched_main.c        | 148 ++++++++++++++----
>>   drivers/gpu/drm/v3d/v3d_gem.c                 |   2 +-
>>   include/drm/gpu_scheduler.h                   |  31 +++-
>>   12 files changed, 156 insertions(+), 49 deletions(-)
>>
>> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
>> index 602010cb6894..acc5901ac840 100644
>> --- a/Documentation/gpu/drm-mm.rst
>> +++ b/Documentation/gpu/drm-mm.rst
>> @@ -552,6 +552,12 @@ Overview
>>   .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>      :doc: Overview
>> +Flow Control
>> +------------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Flow Control
>> +
>>   Scheduler Function References
>>   -----------------------------
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 1f357198533f..62bb7fc7448a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>>       if (!entity)
>>           return 0;
>> -    return drm_sched_job_init(&(*job)->base, entity, owner);
>> +    return drm_sched_job_init(&(*job)->base, entity, 1, owner);
>>   }
>>   int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>> index 2416c526f9b0..3d0f8d182506 100644
>> --- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>> @@ -535,7 +535,7 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, void *data,
>>       ret = drm_sched_job_init(&submit->sched_job,
>>                    &ctx->sched_entity[args->pipe],
>> -                 submit->ctx);
>> +                 1, submit->ctx);
>>       if (ret)
>>           goto err_submit_put;
>> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
>> index 295f0353a02e..0febe8ceb43b 100644
>> --- a/drivers/gpu/drm/lima/lima_sched.c
>> +++ b/drivers/gpu/drm/lima/lima_sched.c
>> @@ -123,7 +123,7 @@ int lima_sched_task_init(struct lima_sched_task *task,
>>       for (i = 0; i < num_bos; i++)
>>           drm_gem_object_get(&bos[i]->base.base);
>> -    err = drm_sched_job_init(&task->base, &context->base, vm);
>> +    err = drm_sched_job_init(&task->base, &context->base, 1, vm);
>>       if (err) {
>>           kfree(task->bos);
>>           return err;
>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
>> index 99744de6c05a..c002cabe7b9c 100644
>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
>> @@ -48,7 +48,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
>>           return ERR_PTR(ret);
>>       }
>> -    ret = drm_sched_job_init(&submit->base, queue->entity, queue);
>> +    ret = drm_sched_job_init(&submit->base, queue->entity, 1, queue);
>>       if (ret) {
>>           kfree(submit->hw_fence);
>>           kfree(submit);
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
>> index e62b04109b2f..2a845f9c34ec 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
>> @@ -89,7 +89,7 @@ nouveau_job_init(struct nouveau_job *job,
>>       }
>> -    ret = drm_sched_job_init(&job->base, &entity->base, NULL);
>> +    ret = drm_sched_job_init(&job->base, &entity->base, 1, NULL);
>>       if (ret)
>>           goto err_free_chains;
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
>> index b834777b409b..54d1c19bea84 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
>> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
>> @@ -274,7 +274,7 @@ static int panfrost_ioctl_submit(struct drm_device *dev, void *data,
>>       ret = drm_sched_job_init(&job->base,
>>                    &file_priv->sched_entity[slot],
>> -                 NULL);
>> +                 1, NULL);
>>       if (ret)
>>           goto out_put_job;
>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
>> index 3143ecaaff86..f8ed093b7356 100644
>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
>> @@ -51,7 +51,7 @@ DECLARE_EVENT_CLASS(drm_sched_job,
>>                  __assign_str(name, sched_job->sched->name);
>>                  __entry->job_count = spsc_queue_count(&entity->job_queue);
>>                  __entry->hw_job_count = atomic_read(
>> -                   &sched_job->sched->hw_rq_count);
>> +                   &sched_job->sched->credit_count);
>>                  ),
>>           TP_printk("entity=%p, id=%llu, fence=%p, ring=%s, job count:%u, hw job count:%d",
>>                 __entry->entity, __entry->id,
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 409e4256f6e7..b79e0672b94f 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -370,7 +370,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>>           container_of(cb, struct drm_sched_entity, cb);
>>       drm_sched_entity_clear_dep(f, cb);
>> -    drm_sched_wakeup_if_can_queue(entity->rq->sched);
>> +    drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>>   }
>>   /**
>> @@ -602,7 +602,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>>           if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>>               drm_sched_rq_update_fifo(entity, submit_ts);
>> -        drm_sched_wakeup_if_can_queue(entity->rq->sched);
>> +        drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>>       }
>>   }
>>   EXPORT_SYMBOL(drm_sched_entity_push_job);
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index fd755e953487..5dfde149441a 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -48,6 +48,30 @@
>>    * through the jobs entity pointer.
>>    */
>> +/**
>> + * DOC: Flow Control
>> + *
>> + * The DRM GPU scheduler provides a flow control mechanism to regulate the rate
>> + * in which the jobs fetched from scheduler entities are executed.
>> + *
>> + * In this context the &drm_gpu_scheduler keeps track of a driver specified
>> + * credit limit representing the capacity of this scheduler and a credit count;
>> + * every &drm_sched_job carries a driver specified number of credits.
>> + *
>> + * Once a job is executed (but not yet finished), the job's credits contribute
>> + * to the scheduler's credit count until the job is finished. If by executing
>> + * one more job the scheduler's credit count would exceed the scheduler's
>> + * credit limit, the job won't be executed. Instead, the scheduler will wait
>> + * until the credit count has decreased enough to not overflow its credit limit.
>> + * This implies waiting for previously executed jobs.
>> + *
>> + * Optionally, drivers may register a callback (update_job_credits) provided by
>> + * struct drm_sched_backend_ops to update the job's credits dynamically. The
>> + * scheduler executes this callback every time the scheduler considers a job for
>> + * execution and subsequently checks whether the job fits the scheduler's credit
>> + * limit.
>> + */
>> +
>>   #include <linux/kthread.h>
>>   #include <linux/wait.h>
>>   #include <linux/sched.h>
>> @@ -76,6 +100,46 @@ int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
>>   MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
>>   module_param_named(sched_policy, drm_sched_policy, int, 0444);
>> +static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
>> +{
>> +    u32 credits;
>> +
>> +    drm_WARN_ON(sched, check_sub_overflow(sched->credit_limit,
>> +                          atomic_read(&sched->credit_count),
>> +                          &credits));
>> +
>> +    return credits;
>> +}
>> +
>> +/**
>> + * drm_sched_can_queue -- Can we queue more to the hardware?
>> + * @sched: scheduler instance
>> + * @entity: the scheduler entity
>> + *
>> + * Return true if we can push at least one more job from @entity, false
>> + * otherwise.
>> + */
>> +static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
>> +                struct drm_sched_entity *entity)
>> +{
>> +    struct drm_sched_job *s_job;
>> +
>> +    s_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
>> +    if (!s_job)
>> +        return false;
>> +
> 
>> +    if (sched->ops->update_job_credits) {
>> +        s_job->credits = sched->ops->update_job_credits(s_job);
>> +
>> +        drm_WARN(sched, !s_job->credits,
>> +             "Jobs with zero credits bypass job-flow control\n");
>> +    }
> 
> I'm still absolutely not keen about that.
> 
> Adding another callback adds more complexity into scheduler driver interface, so at the end of the day we have more which could be implemented incorrectly.

I already commented on this concern in [1]. If this isn't satisfying yet,
please send an answer to [1], such that I don't need to re-write everything.

> 
> So this needs to be justified somehow and to be honest I don't see the value in filling in the ring buffers more than necessary.

I think both Boris and me gave a justification, it's just that it seems you're
not accepting it. Again, if there are concerns left, please continue the discussion
in [1], such that we don't need to start over.

- Danilo

[1] https://lore.kernel.org/all/ZT%2FuPhLK7gan61Ns@pollux/

> 
> Regards,
> Christian.
> 
> 
>> +
>> +    drm_WARN_ON(sched, s_job->credits > sched->credit_limit);
>> +
>> +    return drm_sched_available_credits(sched) >= s_job->credits;
>> +}
>> +
>>   static __always_inline bool drm_sched_entity_compare_before(struct rb_node *a,
>>                                   const struct rb_node *b)
>>   {
>> @@ -187,12 +251,14 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>   /**
>>    * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>>    *
>> + * @sched: the gpu scheduler
>>    * @rq: scheduler run queue to check.
>>    *
>>    * Try to find a ready entity, returns NULL if none found.
>>    */
>>   static struct drm_sched_entity *
>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>> +drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched,
>> +                  struct drm_sched_rq *rq)
>>   {
>>       struct drm_sched_entity *entity;
>> @@ -202,6 +268,14 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>       if (entity) {
>>           list_for_each_entry_continue(entity, &rq->entities, list) {
>>               if (drm_sched_entity_is_ready(entity)) {
>> +                /* If we can't queue yet, preserve the current
>> +                 * entity in terms of fairness.
>> +                 */
>> +                if (!drm_sched_can_queue(sched, entity)) {
>> +                    spin_unlock(&rq->lock);
>> +                    return ERR_PTR(-ENOSPC);
>> +                }
>> +
>>                   rq->current_entity = entity;
>>                   reinit_completion(&entity->entity_idle);
>>                   spin_unlock(&rq->lock);
>> @@ -211,8 +285,15 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>       }
>>       list_for_each_entry(entity, &rq->entities, list) {
>> -
>>           if (drm_sched_entity_is_ready(entity)) {
>> +            /* If we can't queue yet, preserve the current entity in
>> +             * terms of fairness.
>> +             */
>> +            if (!drm_sched_can_queue(sched, entity)) {
>> +                spin_unlock(&rq->lock);
>> +                return ERR_PTR(-ENOSPC);
>> +            }
>> +
>>               rq->current_entity = entity;
>>               reinit_completion(&entity->entity_idle);
>>               spin_unlock(&rq->lock);
>> @@ -231,12 +312,14 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>   /**
>>    * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
>>    *
>> + * @sched: the gpu scheduler
>>    * @rq: scheduler run queue to check.
>>    *
>>    * Find oldest waiting ready entity, returns NULL if none found.
>>    */
>>   static struct drm_sched_entity *
>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>> +drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
>> +                struct drm_sched_rq *rq)
>>   {
>>       struct rb_node *rb;
>> @@ -246,6 +329,14 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>           entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>>           if (drm_sched_entity_is_ready(entity)) {
>> +            /* If we can't queue yet, preserve the current entity in
>> +             * terms of fairness.
>> +             */
>> +            if (!drm_sched_can_queue(sched, entity)) {
>> +                spin_unlock(&rq->lock);
>> +                return ERR_PTR(-ENOSPC);
>> +            }
>> +
>>               rq->current_entity = entity;
>>               reinit_completion(&entity->entity_idle);
>>               break;
>> @@ -267,7 +358,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>>       struct drm_sched_fence *s_fence = s_job->s_fence;
>>       struct drm_gpu_scheduler *sched = s_fence->sched;
>> -    atomic_dec(&sched->hw_rq_count);
>> +    atomic_sub(s_job->credits, &sched->credit_count);
>>       atomic_dec(sched->score);
>>       trace_drm_sched_process_job(s_fence);
>> @@ -468,7 +559,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>                             &s_job->cb)) {
>>               dma_fence_put(s_job->s_fence->parent);
>>               s_job->s_fence->parent = NULL;
>> -            atomic_dec(&sched->hw_rq_count);
>> +            atomic_sub(s_job->credits, &sched->credit_count);
>>           } else {
>>               /*
>>                * remove job from pending_list.
>> @@ -529,7 +620,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>>       list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>>           struct dma_fence *fence = s_job->s_fence->parent;
>> -        atomic_inc(&sched->hw_rq_count);
>> +        atomic_add(s_job->credits, &sched->credit_count);
>>           if (!full_recovery)
>>               continue;
>> @@ -613,6 +704,8 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>    * drm_sched_job_init - init a scheduler job
>>    * @job: scheduler job to init
>>    * @entity: scheduler entity to use
>> + * @credits: the number of credits this job contributes to the schedulers
>> + * credit limit
>>    * @owner: job owner for debugging
>>    *
>>    * Refer to drm_sched_entity_push_job() documentation
>> @@ -630,7 +723,7 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>    */
>>   int drm_sched_job_init(struct drm_sched_job *job,
>>                  struct drm_sched_entity *entity,
>> -               void *owner)
>> +               u32 credits, void *owner)
>>   {
>>       if (!entity->rq) {
>>           /* This will most likely be followed by missing frames
>> @@ -647,6 +740,10 @@ int drm_sched_job_init(struct drm_sched_job *job,
>>           return -ENOMEM;
>>       INIT_LIST_HEAD(&job->list);
>> +    job->credits = credits;
>> +
>> +    drm_WARN(job->sched, !credits,
>> +         "Jobs with zero credits bypass job-flow control\n");
>>       xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
>> @@ -853,27 +950,17 @@ void drm_sched_job_cleanup(struct drm_sched_job *job)
>>   }
>>   EXPORT_SYMBOL(drm_sched_job_cleanup);
>> -/**
>> - * drm_sched_can_queue -- Can we queue more to the hardware?
>> - * @sched: scheduler instance
>> - *
>> - * Return true if we can push more jobs to the hw, otherwise false.
>> - */
>> -static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
>> -{
>> -    return atomic_read(&sched->hw_rq_count) <
>> -        sched->hw_submission_limit;
>> -}
>> -
>>   /**
>>    * drm_sched_wakeup_if_can_queue - Wake up the scheduler
>>    * @sched: scheduler instance
>> + * @entity: the scheduler entity
>>    *
>>    * Wake up the scheduler if we can queue jobs.
>>    */
>> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
>> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
>> +                   struct drm_sched_entity *entity)
>>   {
>> -    if (drm_sched_can_queue(sched))
>> +    if (drm_sched_can_queue(sched, entity))
>>           wake_up_interruptible(&sched->wake_up_worker);
>>   }
>> @@ -890,19 +977,16 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>       struct drm_sched_entity *entity;
>>       int i;
>> -    if (!drm_sched_can_queue(sched))
>> -        return NULL;
>> -
>>       /* Kernel run queue has higher priority than normal run queue*/
>>       for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>           entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
>> -            drm_sched_rq_select_entity_fifo(sched->sched_rq[i]) :
>> -            drm_sched_rq_select_entity_rr(sched->sched_rq[i]);
>> +            drm_sched_rq_select_entity_fifo(sched, sched->sched_rq[i]) :
>> +            drm_sched_rq_select_entity_rr(sched, sched->sched_rq[i]);
>>           if (entity)
>>               break;
>>       }
>> -    return entity;
>> +    return IS_ERR(entity) ? NULL : entity;
>>   }
>>   /**
>> @@ -1043,7 +1127,7 @@ static int drm_sched_main(void *param)
>>           s_fence = sched_job->s_fence;
>> -        atomic_inc(&sched->hw_rq_count);
>> +        atomic_add(sched_job->credits, &sched->credit_count);
>>           drm_sched_job_begin(sched_job);
>>           trace_drm_run_job(sched_job, entity);
>> @@ -1078,7 +1162,7 @@ static int drm_sched_main(void *param)
>>    * @sched: scheduler instance
>>    * @ops: backend operations for this scheduler
>>    * @num_rqs: number of runqueues, one for each priority, up to DRM_SCHED_PRIORITY_COUNT
>> - * @hw_submission: number of hw submissions that can be in flight
>> + * @credit_limit: the number of credits this scheduler can hold from all jobs
>>    * @hang_limit: number of times to allow a job to hang before dropping it
>>    * @timeout: timeout value in jiffies for the scheduler
>>    * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
>> @@ -1091,14 +1175,14 @@ static int drm_sched_main(void *param)
>>    */
>>   int drm_sched_init(struct drm_gpu_scheduler *sched,
>>              const struct drm_sched_backend_ops *ops,
>> -           u32 num_rqs, uint32_t hw_submission, unsigned int hang_limit,
>> +           u32 num_rqs, u32 credit_limit, unsigned int hang_limit,
>>              long timeout, struct workqueue_struct *timeout_wq,
>>              atomic_t *score, const char *name, struct device *dev)
>>   {
>>       int i, ret;
>>       sched->ops = ops;
>> -    sched->hw_submission_limit = hw_submission;
>> +    sched->credit_limit = credit_limit;
>>       sched->name = name;
>>       sched->timeout = timeout;
>>       sched->timeout_wq = timeout_wq ? : system_wq;
>> @@ -1140,7 +1224,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>       init_waitqueue_head(&sched->job_scheduled);
>>       INIT_LIST_HEAD(&sched->pending_list);
>>       spin_lock_init(&sched->job_list_lock);
>> -    atomic_set(&sched->hw_rq_count, 0);
>> +    atomic_set(&sched->credit_count, 0);
>>       INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>>       atomic_set(&sched->_score, 0);
>>       atomic64_set(&sched->job_id_count, 0);
>> diff --git a/drivers/gpu/drm/v3d/v3d_gem.c b/drivers/gpu/drm/v3d/v3d_gem.c
>> index afa7d170d1ff..8dc5a7988488 100644
>> --- a/drivers/gpu/drm/v3d/v3d_gem.c
>> +++ b/drivers/gpu/drm/v3d/v3d_gem.c
>> @@ -417,7 +417,7 @@ v3d_job_init(struct v3d_dev *v3d, struct drm_file *file_priv,
>>       job->free = free;
>>       ret = drm_sched_job_init(&job->base, &v3d_priv->sched_entity[queue],
>> -                 v3d_priv);
>> +                 1, v3d_priv);
>>       if (ret)
>>           goto fail;
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index 653e0eec9743..8bafa81300b2 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -321,6 +321,7 @@ struct drm_sched_fence *to_drm_sched_fence(struct dma_fence *f);
>>    * @sched: the scheduler instance on which this job is scheduled.
>>    * @s_fence: contains the fences for the scheduling of job.
>>    * @finish_cb: the callback for the finished fence.
>> + * @credits: the number of credits this job contributes to the scheduler
>>    * @work: Helper to reschdeule job kill to different context.
>>    * @id: a unique id assigned to each job scheduled on the scheduler.
>>    * @karma: increment on every hang caused by this job. If this exceeds the hang
>> @@ -340,6 +341,8 @@ struct drm_sched_job {
>>       struct drm_gpu_scheduler    *sched;
>>       struct drm_sched_fence        *s_fence;
>> +    u32                credits;
>> +
>>       /*
>>        * work is used only after finish_cb has been used and will not be
>>        * accessed anymore.
>> @@ -463,13 +466,27 @@ struct drm_sched_backend_ops {
>>            * and it's time to clean it up.
>>        */
>>       void (*free_job)(struct drm_sched_job *sched_job);
>> +
>> +    /**
>> +     * @update_job_credits: Called once the scheduler is considering this
>> +     * job for execution.
>> +     *
>> +     * This callback returns the number of credits this job would take if
>> +     * pushed to the hardware. Drivers may use this to dynamically update
>> +     * the job's credit count. For instance, deduct the number of credits
>> +     * for already signalled native fences.
>> +     *
>> +     * This callback is optional.
>> +     */
>> +    u32 (*update_job_credits)(struct drm_sched_job *sched_job);
>>   };
>>   /**
>>    * struct drm_gpu_scheduler - scheduler instance-specific data
>>    *
>>    * @ops: backend operations provided by the driver.
>> - * @hw_submission_limit: the max size of the hardware queue.
>> + * @credit_limit: the credit limit of this scheduler
>> + * @credit_count: the current credit count of this scheduler
>>    * @timeout: the time after which a job is removed from the scheduler.
>>    * @name: name of the ring for which this scheduler is being used.
>>    * @num_rqs: Number of run-queues. This is at most DRM_SCHED_PRIORITY_COUNT,
>> @@ -480,7 +497,6 @@ struct drm_sched_backend_ops {
>>    * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
>>    *                 waits on this wait queue until all the scheduled jobs are
>>    *                 finished.
>> - * @hw_rq_count: the number of jobs currently in the hardware queue.
>>    * @job_id_count: used to assign unique id to the each job.
>>    * @timeout_wq: workqueue used to queue @work_tdr
>>    * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
>> @@ -500,14 +516,14 @@ struct drm_sched_backend_ops {
>>    */
>>   struct drm_gpu_scheduler {
>>       const struct drm_sched_backend_ops    *ops;
>> -    uint32_t            hw_submission_limit;
>> +    u32                credit_limit;
>> +    atomic_t            credit_count;
>>       long                timeout;
>>       const char            *name;
>>       u32                             num_rqs;
>>       struct drm_sched_rq             **sched_rq;
>>       wait_queue_head_t        wake_up_worker;
>>       wait_queue_head_t        job_scheduled;
>> -    atomic_t            hw_rq_count;
>>       atomic64_t            job_id_count;
>>       struct workqueue_struct        *timeout_wq;
>>       struct delayed_work        work_tdr;
>> @@ -524,14 +540,14 @@ struct drm_gpu_scheduler {
>>   int drm_sched_init(struct drm_gpu_scheduler *sched,
>>              const struct drm_sched_backend_ops *ops,
>> -           u32 num_rqs, uint32_t hw_submission, unsigned int hang_limit,
>> +           u32 num_rqs, u32 credit_limit, unsigned int hang_limit,
>>              long timeout, struct workqueue_struct *timeout_wq,
>>              atomic_t *score, const char *name, struct device *dev);
>>   void drm_sched_fini(struct drm_gpu_scheduler *sched);
>>   int drm_sched_job_init(struct drm_sched_job *job,
>>                  struct drm_sched_entity *entity,
>> -               void *owner);
>> +               u32 credits, void *owner);
>>   void drm_sched_job_arm(struct drm_sched_job *job);
>>   int drm_sched_job_add_dependency(struct drm_sched_job *job,
>>                    struct dma_fence *fence);
>> @@ -552,7 +568,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>>                                      unsigned int num_sched_list);
>>   void drm_sched_job_cleanup(struct drm_sched_job *job);
>> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
>> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
>> +                   struct drm_sched_entity *entity);
>>   void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
>>   void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
>>   void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
>>
>> base-commit: b2139fb5051554a7f297e4ad584ef1bc26c76d5d
>
Luben Tuikov Nov. 1, 2023, 12:51 a.m. UTC | #5
Hi,

(PSA: luben.tuikov@amd.com should've bounced :-) I'm removing it from the To: field.)

On 2023-10-30 20:26, Danilo Krummrich wrote:
> Currently, job flow control is implemented simply by limiting the number
> of jobs in flight. Therefore, a scheduler is initialized with a credit
> limit that corresponds to the number of jobs which can be sent to the
> hardware.
> 
> This implies that for each job, drivers need to account for the maximum
> job size possible in order to not overflow the ring buffer.
> 
> However, there are drivers, such as Nouveau, where the job size has a
> rather large range. For such drivers it can easily happen that job
> submissions not even filling the ring by 1% can block subsequent
> submissions, which, in the worst case, can lead to the ring run dry.
> 
> In order to overcome this issue, allow for tracking the actual job size
> instead of the number of jobs. Therefore, add a field to track a job's
> credit count, which represents the number of credits a job contributes
> to the scheduler's credit limit.
> 
> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
> ---
> Changes in V2:
> ==============
>   - fixed up influence on scheduling fairness due to consideration of a job's
>     size
>     - If we reach a ready entity in drm_sched_select_entity() but can't actually
>       queue a job from it due to size limitations, just give up and go to sleep
>       until woken up due to a pending job finishing, rather than continue to try
>       other entities.
>   - added a callback to dynamically update a job's credits (Boris)
>   - renamed 'units' to 'credits'
>   - fixed commit message and comments as requested by Luben
> 
> Changes in V3:
> ==============
>   - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
>   - move up drm_sched_can_queue() instead of adding a forward declaration
>     (Boris)
>   - add a drm_sched_available_credits() helper (Boris)
>   - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal
>   - re-phrase a few comments and fix a typo (Luben)
>   - change naming of all structures credit fields and function parameters to the
>     following scheme
>     - drm_sched_job::credits
>     - drm_gpu_scheduler::credit_limit
>     - drm_gpu_scheduler::credit_count
>     - drm_sched_init(..., u32 credit_limit, ...)
>     - drm_sched_job_init(..., u32 credits, ...)
>   - add proper documentation for the scheduler's job-flow control mechanism
> 
> Changes in V4:
> ==============
>   - address Lubens comments regarding documentation
>   - switch to drm_WARN() variants
>   - WARN_ON() drivers passing in zero job credits for both drm_sched_job_init()
>     and the update_job_credits() callback
>   - don't retry with another runq if job doesn't fit on the ring to prevent
>     priority inversion
>   - rebase onto drm-misc-next (will probably land before Matt's series)
> 
> Patch also available in [1].
> 
> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
> ---
>  Documentation/gpu/drm-mm.rst                  |   6 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |   2 +-
>  drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
>  drivers/gpu/drm/lima/lima_sched.c             |   2 +-
>  drivers/gpu/drm/msm/msm_gem_submit.c          |   2 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c       |   2 +-
>  drivers/gpu/drm/panfrost/panfrost_drv.c       |   2 +-
>  .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
>  drivers/gpu/drm/scheduler/sched_entity.c      |   4 +-
>  drivers/gpu/drm/scheduler/sched_main.c        | 148 ++++++++++++++----
>  drivers/gpu/drm/v3d/v3d_gem.c                 |   2 +-
>  include/drm/gpu_scheduler.h                   |  31 +++-
>  12 files changed, 156 insertions(+), 49 deletions(-)
> 
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index 602010cb6894..acc5901ac840 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -552,6 +552,12 @@ Overview
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>     :doc: Overview
>  
> +Flow Control
> +------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Flow Control
> +
>  Scheduler Function References
>  -----------------------------
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 1f357198533f..62bb7fc7448a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>  	if (!entity)
>  		return 0;
>  
> -	return drm_sched_job_init(&(*job)->base, entity, owner);
> +	return drm_sched_job_init(&(*job)->base, entity, 1, owner);
>  }
>  
>  int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
> index 2416c526f9b0..3d0f8d182506 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
> @@ -535,7 +535,7 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, void *data,
>  
>  	ret = drm_sched_job_init(&submit->sched_job,
>  				 &ctx->sched_entity[args->pipe],
> -				 submit->ctx);
> +				 1, submit->ctx);
>  	if (ret)
>  		goto err_submit_put;
>  
> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> index 295f0353a02e..0febe8ceb43b 100644
> --- a/drivers/gpu/drm/lima/lima_sched.c
> +++ b/drivers/gpu/drm/lima/lima_sched.c
> @@ -123,7 +123,7 @@ int lima_sched_task_init(struct lima_sched_task *task,
>  	for (i = 0; i < num_bos; i++)
>  		drm_gem_object_get(&bos[i]->base.base);
>  
> -	err = drm_sched_job_init(&task->base, &context->base, vm);
> +	err = drm_sched_job_init(&task->base, &context->base, 1, vm);
>  	if (err) {
>  		kfree(task->bos);
>  		return err;
> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> index 99744de6c05a..c002cabe7b9c 100644
> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> @@ -48,7 +48,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
>  		return ERR_PTR(ret);
>  	}
>  
> -	ret = drm_sched_job_init(&submit->base, queue->entity, queue);
> +	ret = drm_sched_job_init(&submit->base, queue->entity, 1, queue);
>  	if (ret) {
>  		kfree(submit->hw_fence);
>  		kfree(submit);
> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
> index e62b04109b2f..2a845f9c34ec 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
> @@ -89,7 +89,7 @@ nouveau_job_init(struct nouveau_job *job,
>  
>  	}
>  
> -	ret = drm_sched_job_init(&job->base, &entity->base, NULL);
> +	ret = drm_sched_job_init(&job->base, &entity->base, 1, NULL);
>  	if (ret)
>  		goto err_free_chains;
>  
> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
> index b834777b409b..54d1c19bea84 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
> @@ -274,7 +274,7 @@ static int panfrost_ioctl_submit(struct drm_device *dev, void *data,
>  
>  	ret = drm_sched_job_init(&job->base,
>  				 &file_priv->sched_entity[slot],
> -				 NULL);
> +				 1, NULL);
>  	if (ret)
>  		goto out_put_job;
>  
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
> index 3143ecaaff86..f8ed093b7356 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
> @@ -51,7 +51,7 @@ DECLARE_EVENT_CLASS(drm_sched_job,
>  			   __assign_str(name, sched_job->sched->name);
>  			   __entry->job_count = spsc_queue_count(&entity->job_queue);
>  			   __entry->hw_job_count = atomic_read(
> -				   &sched_job->sched->hw_rq_count);
> +				   &sched_job->sched->credit_count);
>  			   ),
>  	    TP_printk("entity=%p, id=%llu, fence=%p, ring=%s, job count:%u, hw job count:%d",
>  		      __entry->entity, __entry->id,
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 409e4256f6e7..b79e0672b94f 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -370,7 +370,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>  		container_of(cb, struct drm_sched_entity, cb);
>  
>  	drm_sched_entity_clear_dep(f, cb);
> -	drm_sched_wakeup_if_can_queue(entity->rq->sched);
> +	drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>  }
>  
>  /**
> @@ -602,7 +602,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>  		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>  			drm_sched_rq_update_fifo(entity, submit_ts);
>  
> -		drm_sched_wakeup_if_can_queue(entity->rq->sched);
> +		drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>  	}
>  }
>  EXPORT_SYMBOL(drm_sched_entity_push_job);
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index fd755e953487..5dfde149441a 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -48,6 +48,30 @@
>   * through the jobs entity pointer.
>   */
>  
> +/**
> + * DOC: Flow Control
> + *
> + * The DRM GPU scheduler provides a flow control mechanism to regulate the rate
> + * in which the jobs fetched from scheduler entities are executed.
> + *
> + * In this context the &drm_gpu_scheduler keeps track of a driver specified
> + * credit limit representing the capacity of this scheduler and a credit count;
> + * every &drm_sched_job carries a driver specified number of credits.
> + *
> + * Once a job is executed (but not yet finished), the job's credits contribute
> + * to the scheduler's credit count until the job is finished. If by executing
> + * one more job the scheduler's credit count would exceed the scheduler's
> + * credit limit, the job won't be executed. Instead, the scheduler will wait
> + * until the credit count has decreased enough to not overflow its credit limit.
> + * This implies waiting for previously executed jobs.
> + *
> + * Optionally, drivers may register a callback (update_job_credits) provided by
> + * struct drm_sched_backend_ops to update the job's credits dynamically. The
> + * scheduler executes this callback every time the scheduler considers a job for
> + * execution and subsequently checks whether the job fits the scheduler's credit
> + * limit.
> + */

Excellent!

This reads really good right now and is very lucid. Thanks for updating it.

> +
>  #include <linux/kthread.h>
>  #include <linux/wait.h>
>  #include <linux/sched.h>
> @@ -76,6 +100,46 @@ int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
>  MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
>  module_param_named(sched_policy, drm_sched_policy, int, 0444);
>  
> +static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
> +{
> +	u32 credits;
> +
> +	drm_WARN_ON(sched, check_sub_overflow(sched->credit_limit,
> +					      atomic_read(&sched->credit_count),
> +					      &credits));
> +
> +	return credits;
> +}
> +
> +/**
> + * drm_sched_can_queue -- Can we queue more to the hardware?
> + * @sched: scheduler instance
> + * @entity: the scheduler entity
> + *
> + * Return true if we can push at least one more job from @entity, false
> + * otherwise.
> + */
> +static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
> +				struct drm_sched_entity *entity)
> +{
> +	struct drm_sched_job *s_job;
> +
> +	s_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> +	if (!s_job)
> +		return false;
> +
> +	if (sched->ops->update_job_credits) {
> +		s_job->credits = sched->ops->update_job_credits(s_job);
> +
> +		drm_WARN(sched, !s_job->credits,
> +			 "Jobs with zero credits bypass job-flow control\n");
> +	}
> +
> +	drm_WARN_ON(sched, s_job->credits > sched->credit_limit);
> +
> +	return drm_sched_available_credits(sched) >= s_job->credits;
> +}
> +

Yeah, this is good.

>  static __always_inline bool drm_sched_entity_compare_before(struct rb_node *a,
>  							    const struct rb_node *b)
>  {
> @@ -187,12 +251,14 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>  /**
>   * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>   *
> + * @sched: the gpu scheduler
>   * @rq: scheduler run queue to check.
>   *
>   * Try to find a ready entity, returns NULL if none found.
>   */
>  static struct drm_sched_entity *
> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> +drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched,
> +			      struct drm_sched_rq *rq)
>  {
>  	struct drm_sched_entity *entity;
>  
> @@ -202,6 +268,14 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>  	if (entity) {
>  		list_for_each_entry_continue(entity, &rq->entities, list) {
>  			if (drm_sched_entity_is_ready(entity)) {
> +				/* If we can't queue yet, preserve the current
> +				 * entity in terms of fairness.
> +				 */
> +				if (!drm_sched_can_queue(sched, entity)) {
> +					spin_unlock(&rq->lock);
> +					return ERR_PTR(-ENOSPC);
> +				}
> +
>  				rq->current_entity = entity;
>  				reinit_completion(&entity->entity_idle);
>  				spin_unlock(&rq->lock);
> @@ -211,8 +285,15 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>  	}
>  
>  	list_for_each_entry(entity, &rq->entities, list) {
> -
>  		if (drm_sched_entity_is_ready(entity)) {
> +			/* If we can't queue yet, preserve the current entity in
> +			 * terms of fairness.
> +			 */
> +			if (!drm_sched_can_queue(sched, entity)) {
> +				spin_unlock(&rq->lock);
> +				return ERR_PTR(-ENOSPC);
> +			}
> +
>  			rq->current_entity = entity;
>  			reinit_completion(&entity->entity_idle);
>  			spin_unlock(&rq->lock);
> @@ -231,12 +312,14 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>  /**
>   * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
>   *
> + * @sched: the gpu scheduler
>   * @rq: scheduler run queue to check.
>   *
>   * Find oldest waiting ready entity, returns NULL if none found.
>   */
>  static struct drm_sched_entity *
> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> +drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
> +				struct drm_sched_rq *rq)
>  {
>  	struct rb_node *rb;
>  
> @@ -246,6 +329,14 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>  
>  		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>  		if (drm_sched_entity_is_ready(entity)) {
> +			/* If we can't queue yet, preserve the current entity in
> +			 * terms of fairness.
> +			 */
> +			if (!drm_sched_can_queue(sched, entity)) {
> +				spin_unlock(&rq->lock);
> +				return ERR_PTR(-ENOSPC);
> +			}
> +

So, this kinda did this abrupt "return" in v2, then it was fixed fine in v3,
and now I see we return and return an error now, which doesn't seem to be used
by anyone. In fact, in drm_sched_select_entity(), we do this,

-	return entity;
+	return IS_ERR(entity) ? NULL : entity;

So, it's perhaps best to leave this as it was in v3, and if in the future
we need to distinguish between the type of error, then that future patch
would do that and also show how this is used with new code and logic.

There is little value to over-engineer this right now, given that in
the future, the logic may be more esoteric than we can think of.

>  			rq->current_entity = entity;
>  			reinit_completion(&entity->entity_idle);
>  			break;
> @@ -267,7 +358,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>  	struct drm_sched_fence *s_fence = s_job->s_fence;
>  	struct drm_gpu_scheduler *sched = s_fence->sched;
>  
> -	atomic_dec(&sched->hw_rq_count);
> +	atomic_sub(s_job->credits, &sched->credit_count);
>  	atomic_dec(sched->score);
>  
>  	trace_drm_sched_process_job(s_fence);
> @@ -468,7 +559,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>  					      &s_job->cb)) {
>  			dma_fence_put(s_job->s_fence->parent);
>  			s_job->s_fence->parent = NULL;
> -			atomic_dec(&sched->hw_rq_count);
> +			atomic_sub(s_job->credits, &sched->credit_count);
>  		} else {
>  			/*
>  			 * remove job from pending_list.
> @@ -529,7 +620,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>  	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>  		struct dma_fence *fence = s_job->s_fence->parent;
>  
> -		atomic_inc(&sched->hw_rq_count);
> +		atomic_add(s_job->credits, &sched->credit_count);
>  
>  		if (!full_recovery)
>  			continue;
> @@ -613,6 +704,8 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>   * drm_sched_job_init - init a scheduler job
>   * @job: scheduler job to init
>   * @entity: scheduler entity to use
> + * @credits: the number of credits this job contributes to the schedulers
> + * credit limit
>   * @owner: job owner for debugging
>   *
>   * Refer to drm_sched_entity_push_job() documentation
> @@ -630,7 +723,7 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>   */
>  int drm_sched_job_init(struct drm_sched_job *job,
>  		       struct drm_sched_entity *entity,
> -		       void *owner)
> +		       u32 credits, void *owner)
>  {
>  	if (!entity->rq) {
>  		/* This will most likely be followed by missing frames
> @@ -647,6 +740,10 @@ int drm_sched_job_init(struct drm_sched_job *job,
>  		return -ENOMEM;
>  
>  	INIT_LIST_HEAD(&job->list);
> +	job->credits = credits;
> +
> +	drm_WARN(job->sched, !credits,
> +		 "Jobs with zero credits bypass job-flow control\n");
>  
>  	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
>  
> @@ -853,27 +950,17 @@ void drm_sched_job_cleanup(struct drm_sched_job *job)
>  }
>  EXPORT_SYMBOL(drm_sched_job_cleanup);
>  
> -/**
> - * drm_sched_can_queue -- Can we queue more to the hardware?
> - * @sched: scheduler instance
> - *
> - * Return true if we can push more jobs to the hw, otherwise false.
> - */
> -static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> -{
> -	return atomic_read(&sched->hw_rq_count) <
> -		sched->hw_submission_limit;
> -}
> -
>  /**
>   * drm_sched_wakeup_if_can_queue - Wake up the scheduler
>   * @sched: scheduler instance
> + * @entity: the scheduler entity
>   *
>   * Wake up the scheduler if we can queue jobs.
>   */
> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
> +				   struct drm_sched_entity *entity)
>  {
> -	if (drm_sched_can_queue(sched))
> +	if (drm_sched_can_queue(sched, entity))
>  		wake_up_interruptible(&sched->wake_up_worker);
>  }
>  
> @@ -890,19 +977,16 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>  	struct drm_sched_entity *entity;
>  	int i;
>  
> -	if (!drm_sched_can_queue(sched))
> -		return NULL;
> -
>  	/* Kernel run queue has higher priority than normal run queue*/
>  	for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>  		entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
> -			drm_sched_rq_select_entity_fifo(sched->sched_rq[i]) :
> -			drm_sched_rq_select_entity_rr(sched->sched_rq[i]);
> +			drm_sched_rq_select_entity_fifo(sched, sched->sched_rq[i]) :
> +			drm_sched_rq_select_entity_rr(sched, sched->sched_rq[i]);
>  		if (entity)
>  			break;
>  	}
>  
> -	return entity;
> +	return IS_ERR(entity) ? NULL : entity;

Let's leave this alone as we did in v3. The smaller diff-stat
this patch has, yet achieving the desired behaviour--the better
the code is.

>  }
>  
>  /**
> @@ -1043,7 +1127,7 @@ static int drm_sched_main(void *param)
>  
>  		s_fence = sched_job->s_fence;
>  
> -		atomic_inc(&sched->hw_rq_count);
> +		atomic_add(sched_job->credits, &sched->credit_count);
>  		drm_sched_job_begin(sched_job);
>  
>  		trace_drm_run_job(sched_job, entity);
> @@ -1078,7 +1162,7 @@ static int drm_sched_main(void *param)
>   * @sched: scheduler instance
>   * @ops: backend operations for this scheduler
>   * @num_rqs: number of runqueues, one for each priority, up to DRM_SCHED_PRIORITY_COUNT
> - * @hw_submission: number of hw submissions that can be in flight
> + * @credit_limit: the number of credits this scheduler can hold from all jobs
>   * @hang_limit: number of times to allow a job to hang before dropping it
>   * @timeout: timeout value in jiffies for the scheduler
>   * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
> @@ -1091,14 +1175,14 @@ static int drm_sched_main(void *param)
>   */
>  int drm_sched_init(struct drm_gpu_scheduler *sched,
>  		   const struct drm_sched_backend_ops *ops,
> -		   u32 num_rqs, uint32_t hw_submission, unsigned int hang_limit,
> +		   u32 num_rqs, u32 credit_limit, unsigned int hang_limit,
>  		   long timeout, struct workqueue_struct *timeout_wq,
>  		   atomic_t *score, const char *name, struct device *dev)
>  {
>  	int i, ret;
>  
>  	sched->ops = ops;
> -	sched->hw_submission_limit = hw_submission;
> +	sched->credit_limit = credit_limit;
>  	sched->name = name;
>  	sched->timeout = timeout;
>  	sched->timeout_wq = timeout_wq ? : system_wq;
> @@ -1140,7 +1224,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>  	init_waitqueue_head(&sched->job_scheduled);
>  	INIT_LIST_HEAD(&sched->pending_list);
>  	spin_lock_init(&sched->job_list_lock);
> -	atomic_set(&sched->hw_rq_count, 0);
> +	atomic_set(&sched->credit_count, 0);
>  	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>  	atomic_set(&sched->_score, 0);
>  	atomic64_set(&sched->job_id_count, 0);
> diff --git a/drivers/gpu/drm/v3d/v3d_gem.c b/drivers/gpu/drm/v3d/v3d_gem.c
> index afa7d170d1ff..8dc5a7988488 100644
> --- a/drivers/gpu/drm/v3d/v3d_gem.c
> +++ b/drivers/gpu/drm/v3d/v3d_gem.c
> @@ -417,7 +417,7 @@ v3d_job_init(struct v3d_dev *v3d, struct drm_file *file_priv,
>  	job->free = free;
>  
>  	ret = drm_sched_job_init(&job->base, &v3d_priv->sched_entity[queue],
> -				 v3d_priv);
> +				 1, v3d_priv);
>  	if (ret)
>  		goto fail;
>  
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 653e0eec9743..8bafa81300b2 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -321,6 +321,7 @@ struct drm_sched_fence *to_drm_sched_fence(struct dma_fence *f);
>   * @sched: the scheduler instance on which this job is scheduled.
>   * @s_fence: contains the fences for the scheduling of job.
>   * @finish_cb: the callback for the finished fence.
> + * @credits: the number of credits this job contributes to the scheduler
>   * @work: Helper to reschdeule job kill to different context.
>   * @id: a unique id assigned to each job scheduled on the scheduler.
>   * @karma: increment on every hang caused by this job. If this exceeds the hang
> @@ -340,6 +341,8 @@ struct drm_sched_job {
>  	struct drm_gpu_scheduler	*sched;
>  	struct drm_sched_fence		*s_fence;
>  
> +	u32				credits;
> +
>  	/*
>  	 * work is used only after finish_cb has been used and will not be
>  	 * accessed anymore.
> @@ -463,13 +466,27 @@ struct drm_sched_backend_ops {
>           * and it's time to clean it up.
>  	 */
>  	void (*free_job)(struct drm_sched_job *sched_job);
> +
> +	/**
> +	 * @update_job_credits: Called once the scheduler is considering this
> +	 * job for execution.

"once" --> "when", a close cousin of "once", but clearer in code comments.
It is called in fact as many times as the job is considered to be pushed down
to hardware, if we couldn't previously.

> +	 *
> +	 * This callback returns the number of credits this job would take if

Too many repetitions of "this". Instead of "this job", say "the job".

> +	 * pushed to the hardware. Drivers may use this to dynamically update
> +	 * the job's credit count. For instance, deduct the number of credits
> +	 * for already signalled native fences.
> +	 *
> +	 * This callback is optional.
> +	 */

Yeah, this is really good and well said. It's very clear and lucid. Thanks
for updating it.

> +	u32 (*update_job_credits)(struct drm_sched_job *sched_job);
>  };
>  
>  /**
>   * struct drm_gpu_scheduler - scheduler instance-specific data
>   *
>   * @ops: backend operations provided by the driver.
> - * @hw_submission_limit: the max size of the hardware queue.
> + * @credit_limit: the credit limit of this scheduler
> + * @credit_count: the current credit count of this scheduler
>   * @timeout: the time after which a job is removed from the scheduler.
>   * @name: name of the ring for which this scheduler is being used.
>   * @num_rqs: Number of run-queues. This is at most DRM_SCHED_PRIORITY_COUNT,
> @@ -480,7 +497,6 @@ struct drm_sched_backend_ops {
>   * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
>   *                 waits on this wait queue until all the scheduled jobs are
>   *                 finished.
> - * @hw_rq_count: the number of jobs currently in the hardware queue.
>   * @job_id_count: used to assign unique id to the each job.
>   * @timeout_wq: workqueue used to queue @work_tdr
>   * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> @@ -500,14 +516,14 @@ struct drm_sched_backend_ops {
>   */
>  struct drm_gpu_scheduler {
>  	const struct drm_sched_backend_ops	*ops;
> -	uint32_t			hw_submission_limit;
> +	u32				credit_limit;
> +	atomic_t			credit_count;
>  	long				timeout;
>  	const char			*name;
>  	u32                             num_rqs;
>  	struct drm_sched_rq             **sched_rq;
>  	wait_queue_head_t		wake_up_worker;
>  	wait_queue_head_t		job_scheduled;
> -	atomic_t			hw_rq_count;
>  	atomic64_t			job_id_count;
>  	struct workqueue_struct		*timeout_wq;
>  	struct delayed_work		work_tdr;
> @@ -524,14 +540,14 @@ struct drm_gpu_scheduler {
>  
>  int drm_sched_init(struct drm_gpu_scheduler *sched,
>  		   const struct drm_sched_backend_ops *ops,
> -		   u32 num_rqs, uint32_t hw_submission, unsigned int hang_limit,
> +		   u32 num_rqs, u32 credit_limit, unsigned int hang_limit,
>  		   long timeout, struct workqueue_struct *timeout_wq,
>  		   atomic_t *score, const char *name, struct device *dev);
>  
>  void drm_sched_fini(struct drm_gpu_scheduler *sched);
>  int drm_sched_job_init(struct drm_sched_job *job,
>  		       struct drm_sched_entity *entity,
> -		       void *owner);
> +		       u32 credits, void *owner);
>  void drm_sched_job_arm(struct drm_sched_job *job);
>  int drm_sched_job_add_dependency(struct drm_sched_job *job,
>  				 struct dma_fence *fence);
> @@ -552,7 +568,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>                                     unsigned int num_sched_list);
>  
>  void drm_sched_job_cleanup(struct drm_sched_job *job);
> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
> +				   struct drm_sched_entity *entity);
>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
>  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
>  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
> 
> base-commit: b2139fb5051554a7f297e4ad584ef1bc26c76d5d

Ah, thanks for adding base!

So, yeah--we're almost there!
Danilo Krummrich Nov. 1, 2023, 2:23 a.m. UTC | #6
Hi Luben,

On Tue, Oct 31, 2023 at 08:51:17PM -0400, Luben Tuikov wrote:
> Hi,
> 
> (PSA: luben.tuikov@amd.com should've bounced :-) I'm removing it from the To: field.)
> 
> On 2023-10-30 20:26, Danilo Krummrich wrote:
> > Currently, job flow control is implemented simply by limiting the number
> > of jobs in flight. Therefore, a scheduler is initialized with a credit
> > limit that corresponds to the number of jobs which can be sent to the
> > hardware.
> > 
> > This implies that for each job, drivers need to account for the maximum
> > job size possible in order to not overflow the ring buffer.
> > 
> > However, there are drivers, such as Nouveau, where the job size has a
> > rather large range. For such drivers it can easily happen that job
> > submissions not even filling the ring by 1% can block subsequent
> > submissions, which, in the worst case, can lead to the ring run dry.
> > 
> > In order to overcome this issue, allow for tracking the actual job size
> > instead of the number of jobs. Therefore, add a field to track a job's
> > credit count, which represents the number of credits a job contributes
> > to the scheduler's credit limit.
> > 
> > Signed-off-by: Danilo Krummrich <dakr@redhat.com>
> > ---
> > Changes in V2:
> > ==============
> >   - fixed up influence on scheduling fairness due to consideration of a job's
> >     size
> >     - If we reach a ready entity in drm_sched_select_entity() but can't actually
> >       queue a job from it due to size limitations, just give up and go to sleep
> >       until woken up due to a pending job finishing, rather than continue to try
> >       other entities.
> >   - added a callback to dynamically update a job's credits (Boris)
> >   - renamed 'units' to 'credits'
> >   - fixed commit message and comments as requested by Luben
> > 
> > Changes in V3:
> > ==============
> >   - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
> >   - move up drm_sched_can_queue() instead of adding a forward declaration
> >     (Boris)
> >   - add a drm_sched_available_credits() helper (Boris)
> >   - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal
> >   - re-phrase a few comments and fix a typo (Luben)
> >   - change naming of all structures credit fields and function parameters to the
> >     following scheme
> >     - drm_sched_job::credits
> >     - drm_gpu_scheduler::credit_limit
> >     - drm_gpu_scheduler::credit_count
> >     - drm_sched_init(..., u32 credit_limit, ...)
> >     - drm_sched_job_init(..., u32 credits, ...)
> >   - add proper documentation for the scheduler's job-flow control mechanism
> > 
> > Changes in V4:
> > ==============
> >   - address Lubens comments regarding documentation
> >   - switch to drm_WARN() variants
> >   - WARN_ON() drivers passing in zero job credits for both drm_sched_job_init()
> >     and the update_job_credits() callback
> >   - don't retry with another runq if job doesn't fit on the ring to prevent
> >     priority inversion
> >   - rebase onto drm-misc-next (will probably land before Matt's series)
> > 
> > Patch also available in [1].
> > 
> > [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
> > ---
> >  Documentation/gpu/drm-mm.rst                  |   6 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |   2 +-
> >  drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
> >  drivers/gpu/drm/lima/lima_sched.c             |   2 +-
> >  drivers/gpu/drm/msm/msm_gem_submit.c          |   2 +-
> >  drivers/gpu/drm/nouveau/nouveau_sched.c       |   2 +-
> >  drivers/gpu/drm/panfrost/panfrost_drv.c       |   2 +-
> >  .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
> >  drivers/gpu/drm/scheduler/sched_entity.c      |   4 +-
> >  drivers/gpu/drm/scheduler/sched_main.c        | 148 ++++++++++++++----
> >  drivers/gpu/drm/v3d/v3d_gem.c                 |   2 +-
> >  include/drm/gpu_scheduler.h                   |  31 +++-
> >  12 files changed, 156 insertions(+), 49 deletions(-)
> > 
> > diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> > index 602010cb6894..acc5901ac840 100644
> > --- a/Documentation/gpu/drm-mm.rst
> > +++ b/Documentation/gpu/drm-mm.rst
> > @@ -552,6 +552,12 @@ Overview
> >  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> >     :doc: Overview
> >  
> > +Flow Control
> > +------------
> > +
> > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> > +   :doc: Flow Control
> > +
> >  Scheduler Function References
> >  -----------------------------
> >  
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index 1f357198533f..62bb7fc7448a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
> >  	if (!entity)
> >  		return 0;
> >  
> > -	return drm_sched_job_init(&(*job)->base, entity, owner);
> > +	return drm_sched_job_init(&(*job)->base, entity, 1, owner);
> >  }
> >  
> >  int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
> > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
> > index 2416c526f9b0..3d0f8d182506 100644
> > --- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
> > +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
> > @@ -535,7 +535,7 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, void *data,
> >  
> >  	ret = drm_sched_job_init(&submit->sched_job,
> >  				 &ctx->sched_entity[args->pipe],
> > -				 submit->ctx);
> > +				 1, submit->ctx);
> >  	if (ret)
> >  		goto err_submit_put;
> >  
> > diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> > index 295f0353a02e..0febe8ceb43b 100644
> > --- a/drivers/gpu/drm/lima/lima_sched.c
> > +++ b/drivers/gpu/drm/lima/lima_sched.c
> > @@ -123,7 +123,7 @@ int lima_sched_task_init(struct lima_sched_task *task,
> >  	for (i = 0; i < num_bos; i++)
> >  		drm_gem_object_get(&bos[i]->base.base);
> >  
> > -	err = drm_sched_job_init(&task->base, &context->base, vm);
> > +	err = drm_sched_job_init(&task->base, &context->base, 1, vm);
> >  	if (err) {
> >  		kfree(task->bos);
> >  		return err;
> > diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> > index 99744de6c05a..c002cabe7b9c 100644
> > --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> > +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> > @@ -48,7 +48,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
> >  		return ERR_PTR(ret);
> >  	}
> >  
> > -	ret = drm_sched_job_init(&submit->base, queue->entity, queue);
> > +	ret = drm_sched_job_init(&submit->base, queue->entity, 1, queue);
> >  	if (ret) {
> >  		kfree(submit->hw_fence);
> >  		kfree(submit);
> > diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
> > index e62b04109b2f..2a845f9c34ec 100644
> > --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
> > +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
> > @@ -89,7 +89,7 @@ nouveau_job_init(struct nouveau_job *job,
> >  
> >  	}
> >  
> > -	ret = drm_sched_job_init(&job->base, &entity->base, NULL);
> > +	ret = drm_sched_job_init(&job->base, &entity->base, 1, NULL);
> >  	if (ret)
> >  		goto err_free_chains;
> >  
> > diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
> > index b834777b409b..54d1c19bea84 100644
> > --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
> > +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
> > @@ -274,7 +274,7 @@ static int panfrost_ioctl_submit(struct drm_device *dev, void *data,
> >  
> >  	ret = drm_sched_job_init(&job->base,
> >  				 &file_priv->sched_entity[slot],
> > -				 NULL);
> > +				 1, NULL);
> >  	if (ret)
> >  		goto out_put_job;
> >  
> > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
> > index 3143ecaaff86..f8ed093b7356 100644
> > --- a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
> > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
> > @@ -51,7 +51,7 @@ DECLARE_EVENT_CLASS(drm_sched_job,
> >  			   __assign_str(name, sched_job->sched->name);
> >  			   __entry->job_count = spsc_queue_count(&entity->job_queue);
> >  			   __entry->hw_job_count = atomic_read(
> > -				   &sched_job->sched->hw_rq_count);
> > +				   &sched_job->sched->credit_count);
> >  			   ),
> >  	    TP_printk("entity=%p, id=%llu, fence=%p, ring=%s, job count:%u, hw job count:%d",
> >  		      __entry->entity, __entry->id,
> > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> > index 409e4256f6e7..b79e0672b94f 100644
> > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > @@ -370,7 +370,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
> >  		container_of(cb, struct drm_sched_entity, cb);
> >  
> >  	drm_sched_entity_clear_dep(f, cb);
> > -	drm_sched_wakeup_if_can_queue(entity->rq->sched);
> > +	drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
> >  }
> >  
> >  /**
> > @@ -602,7 +602,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
> >  		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> >  			drm_sched_rq_update_fifo(entity, submit_ts);
> >  
> > -		drm_sched_wakeup_if_can_queue(entity->rq->sched);
> > +		drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
> >  	}
> >  }
> >  EXPORT_SYMBOL(drm_sched_entity_push_job);
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index fd755e953487..5dfde149441a 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -48,6 +48,30 @@
> >   * through the jobs entity pointer.
> >   */
> >  
> > +/**
> > + * DOC: Flow Control
> > + *
> > + * The DRM GPU scheduler provides a flow control mechanism to regulate the rate
> > + * in which the jobs fetched from scheduler entities are executed.
> > + *
> > + * In this context the &drm_gpu_scheduler keeps track of a driver specified
> > + * credit limit representing the capacity of this scheduler and a credit count;
> > + * every &drm_sched_job carries a driver specified number of credits.
> > + *
> > + * Once a job is executed (but not yet finished), the job's credits contribute
> > + * to the scheduler's credit count until the job is finished. If by executing
> > + * one more job the scheduler's credit count would exceed the scheduler's
> > + * credit limit, the job won't be executed. Instead, the scheduler will wait
> > + * until the credit count has decreased enough to not overflow its credit limit.
> > + * This implies waiting for previously executed jobs.
> > + *
> > + * Optionally, drivers may register a callback (update_job_credits) provided by
> > + * struct drm_sched_backend_ops to update the job's credits dynamically. The
> > + * scheduler executes this callback every time the scheduler considers a job for
> > + * execution and subsequently checks whether the job fits the scheduler's credit
> > + * limit.
> > + */
> 
> Excellent!
> 
> This reads really good right now and is very lucid. Thanks for updating it.
> 
> > +
> >  #include <linux/kthread.h>
> >  #include <linux/wait.h>
> >  #include <linux/sched.h>
> > @@ -76,6 +100,46 @@ int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
> >  MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
> >  module_param_named(sched_policy, drm_sched_policy, int, 0444);
> >  
> > +static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
> > +{
> > +	u32 credits;
> > +
> > +	drm_WARN_ON(sched, check_sub_overflow(sched->credit_limit,
> > +					      atomic_read(&sched->credit_count),
> > +					      &credits));
> > +
> > +	return credits;
> > +}
> > +
> > +/**
> > + * drm_sched_can_queue -- Can we queue more to the hardware?
> > + * @sched: scheduler instance
> > + * @entity: the scheduler entity
> > + *
> > + * Return true if we can push at least one more job from @entity, false
> > + * otherwise.
> > + */
> > +static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
> > +				struct drm_sched_entity *entity)
> > +{
> > +	struct drm_sched_job *s_job;
> > +
> > +	s_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> > +	if (!s_job)
> > +		return false;
> > +
> > +	if (sched->ops->update_job_credits) {
> > +		s_job->credits = sched->ops->update_job_credits(s_job);
> > +
> > +		drm_WARN(sched, !s_job->credits,
> > +			 "Jobs with zero credits bypass job-flow control\n");
> > +	}
> > +
> > +	drm_WARN_ON(sched, s_job->credits > sched->credit_limit);
> > +
> > +	return drm_sched_available_credits(sched) >= s_job->credits;
> > +}
> > +
> 
> Yeah, this is good.
> 
> >  static __always_inline bool drm_sched_entity_compare_before(struct rb_node *a,
> >  							    const struct rb_node *b)
> >  {
> > @@ -187,12 +251,14 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> >  /**
> >   * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
> >   *
> > + * @sched: the gpu scheduler
> >   * @rq: scheduler run queue to check.
> >   *
> >   * Try to find a ready entity, returns NULL if none found.
> >   */
> >  static struct drm_sched_entity *
> > -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > +drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched,
> > +			      struct drm_sched_rq *rq)
> >  {
> >  	struct drm_sched_entity *entity;
> >  
> > @@ -202,6 +268,14 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> >  	if (entity) {
> >  		list_for_each_entry_continue(entity, &rq->entities, list) {
> >  			if (drm_sched_entity_is_ready(entity)) {
> > +				/* If we can't queue yet, preserve the current
> > +				 * entity in terms of fairness.
> > +				 */
> > +				if (!drm_sched_can_queue(sched, entity)) {
> > +					spin_unlock(&rq->lock);
> > +					return ERR_PTR(-ENOSPC);
> > +				}
> > +
> >  				rq->current_entity = entity;
> >  				reinit_completion(&entity->entity_idle);
> >  				spin_unlock(&rq->lock);
> > @@ -211,8 +285,15 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> >  	}
> >  
> >  	list_for_each_entry(entity, &rq->entities, list) {
> > -
> >  		if (drm_sched_entity_is_ready(entity)) {
> > +			/* If we can't queue yet, preserve the current entity in
> > +			 * terms of fairness.
> > +			 */
> > +			if (!drm_sched_can_queue(sched, entity)) {
> > +				spin_unlock(&rq->lock);
> > +				return ERR_PTR(-ENOSPC);
> > +			}
> > +
> >  			rq->current_entity = entity;
> >  			reinit_completion(&entity->entity_idle);
> >  			spin_unlock(&rq->lock);
> > @@ -231,12 +312,14 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> >  /**
> >   * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
> >   *
> > + * @sched: the gpu scheduler
> >   * @rq: scheduler run queue to check.
> >   *
> >   * Find oldest waiting ready entity, returns NULL if none found.
> >   */
> >  static struct drm_sched_entity *
> > -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > +drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
> > +				struct drm_sched_rq *rq)
> >  {
> >  	struct rb_node *rb;
> >  
> > @@ -246,6 +329,14 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> >  
> >  		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
> >  		if (drm_sched_entity_is_ready(entity)) {
> > +			/* If we can't queue yet, preserve the current entity in
> > +			 * terms of fairness.
> > +			 */
> > +			if (!drm_sched_can_queue(sched, entity)) {
> > +				spin_unlock(&rq->lock);
> > +				return ERR_PTR(-ENOSPC);
> > +			}
> > +
> 
> So, this kinda did this abrupt "return" in v2, then it was fixed fine in v3,
> and now I see we return and return an error now, which doesn't seem to be used
> by anyone. In fact, in drm_sched_select_entity(), we do this,
> 
> -	return entity;
> +	return IS_ERR(entity) ? NULL : entity;
> 
> So, it's perhaps best to leave this as it was in v3, and if in the future
> we need to distinguish between the type of error, then that future patch
> would do that and also show how this is used with new code and logic.
> 
> There is little value to over-engineer this right now, given that in
> the future, the logic may be more esoteric than we can think of.

Ok, maybe what I do here is a little bit subtle and requires a comment. Let me
explain.

The reason I return an ERR_PTR() instead of NULL is to indicate to
drm_sched_select_entity() to break out of the runqueue loop
(drm_sched_select_entity() breaks the loop when the returned entity is not
NULL).

Without that, we'd continue the runqueue loop in drm_sched_select_entity() and
retry with the next lower priority. This could lead to prioritiy inversion of
the runqueues, since a lower priority runqueue with jobs with less credits could
stall a higher priority runqueue with jobs with more credits.

Hence, in drm_sched_select_entity() we need to be able to distinguish between
drm_sched_rq_select_entity_fifo()/drm_sched_rq_select_entity_rr() can't find an
entity and they *can* find an entity, but it's job doesn't fit on the ring.

I think what makes it subtle in this patch is that in drm_sched_select_entity()
the condition already fits with

	if (entity)
		break;

because we want to break this loop when we actually found an entity, or when
there is no space on the ring buffer, but we want to keep checking the other
runqueues if entity is NULL.

> 
> >  			rq->current_entity = entity;
> >  			reinit_completion(&entity->entity_idle);
> >  			break;
> > @@ -267,7 +358,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
> >  	struct drm_sched_fence *s_fence = s_job->s_fence;
> >  	struct drm_gpu_scheduler *sched = s_fence->sched;
> >  
> > -	atomic_dec(&sched->hw_rq_count);
> > +	atomic_sub(s_job->credits, &sched->credit_count);
> >  	atomic_dec(sched->score);
> >  
> >  	trace_drm_sched_process_job(s_fence);
> > @@ -468,7 +559,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> >  					      &s_job->cb)) {
> >  			dma_fence_put(s_job->s_fence->parent);
> >  			s_job->s_fence->parent = NULL;
> > -			atomic_dec(&sched->hw_rq_count);
> > +			atomic_sub(s_job->credits, &sched->credit_count);
> >  		} else {
> >  			/*
> >  			 * remove job from pending_list.
> > @@ -529,7 +620,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
> >  	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
> >  		struct dma_fence *fence = s_job->s_fence->parent;
> >  
> > -		atomic_inc(&sched->hw_rq_count);
> > +		atomic_add(s_job->credits, &sched->credit_count);
> >  
> >  		if (!full_recovery)
> >  			continue;
> > @@ -613,6 +704,8 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
> >   * drm_sched_job_init - init a scheduler job
> >   * @job: scheduler job to init
> >   * @entity: scheduler entity to use
> > + * @credits: the number of credits this job contributes to the schedulers
> > + * credit limit
> >   * @owner: job owner for debugging
> >   *
> >   * Refer to drm_sched_entity_push_job() documentation
> > @@ -630,7 +723,7 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
> >   */
> >  int drm_sched_job_init(struct drm_sched_job *job,
> >  		       struct drm_sched_entity *entity,
> > -		       void *owner)
> > +		       u32 credits, void *owner)
> >  {
> >  	if (!entity->rq) {
> >  		/* This will most likely be followed by missing frames
> > @@ -647,6 +740,10 @@ int drm_sched_job_init(struct drm_sched_job *job,
> >  		return -ENOMEM;
> >  
> >  	INIT_LIST_HEAD(&job->list);
> > +	job->credits = credits;
> > +
> > +	drm_WARN(job->sched, !credits,
> > +		 "Jobs with zero credits bypass job-flow control\n");
> >  
> >  	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
> >  
> > @@ -853,27 +950,17 @@ void drm_sched_job_cleanup(struct drm_sched_job *job)
> >  }
> >  EXPORT_SYMBOL(drm_sched_job_cleanup);
> >  
> > -/**
> > - * drm_sched_can_queue -- Can we queue more to the hardware?
> > - * @sched: scheduler instance
> > - *
> > - * Return true if we can push more jobs to the hw, otherwise false.
> > - */
> > -static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> > -{
> > -	return atomic_read(&sched->hw_rq_count) <
> > -		sched->hw_submission_limit;
> > -}
> > -
> >  /**
> >   * drm_sched_wakeup_if_can_queue - Wake up the scheduler
> >   * @sched: scheduler instance
> > + * @entity: the scheduler entity
> >   *
> >   * Wake up the scheduler if we can queue jobs.
> >   */
> > -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> > +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
> > +				   struct drm_sched_entity *entity)
> >  {
> > -	if (drm_sched_can_queue(sched))
> > +	if (drm_sched_can_queue(sched, entity))
> >  		wake_up_interruptible(&sched->wake_up_worker);
> >  }
> >  
> > @@ -890,19 +977,16 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> >  	struct drm_sched_entity *entity;
> >  	int i;
> >  
> > -	if (!drm_sched_can_queue(sched))
> > -		return NULL;
> > -
> >  	/* Kernel run queue has higher priority than normal run queue*/
> >  	for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> >  		entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
> > -			drm_sched_rq_select_entity_fifo(sched->sched_rq[i]) :
> > -			drm_sched_rq_select_entity_rr(sched->sched_rq[i]);
> > +			drm_sched_rq_select_entity_fifo(sched, sched->sched_rq[i]) :
> > +			drm_sched_rq_select_entity_rr(sched, sched->sched_rq[i]);
> >  		if (entity)
> >  			break;
> >  	}
> >  
> > -	return entity;
> > +	return IS_ERR(entity) ? NULL : entity;
> 
> Let's leave this alone as we did in v3. The smaller diff-stat
> this patch has, yet achieving the desired behaviour--the better
> the code is.

See comment above.

> 
> >  }
> >  
> >  /**
> > @@ -1043,7 +1127,7 @@ static int drm_sched_main(void *param)
> >  
> >  		s_fence = sched_job->s_fence;
> >  
> > -		atomic_inc(&sched->hw_rq_count);
> > +		atomic_add(sched_job->credits, &sched->credit_count);
> >  		drm_sched_job_begin(sched_job);
> >  
> >  		trace_drm_run_job(sched_job, entity);
> > @@ -1078,7 +1162,7 @@ static int drm_sched_main(void *param)
> >   * @sched: scheduler instance
> >   * @ops: backend operations for this scheduler
> >   * @num_rqs: number of runqueues, one for each priority, up to DRM_SCHED_PRIORITY_COUNT
> > - * @hw_submission: number of hw submissions that can be in flight
> > + * @credit_limit: the number of credits this scheduler can hold from all jobs
> >   * @hang_limit: number of times to allow a job to hang before dropping it
> >   * @timeout: timeout value in jiffies for the scheduler
> >   * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
> > @@ -1091,14 +1175,14 @@ static int drm_sched_main(void *param)
> >   */
> >  int drm_sched_init(struct drm_gpu_scheduler *sched,
> >  		   const struct drm_sched_backend_ops *ops,
> > -		   u32 num_rqs, uint32_t hw_submission, unsigned int hang_limit,
> > +		   u32 num_rqs, u32 credit_limit, unsigned int hang_limit,
> >  		   long timeout, struct workqueue_struct *timeout_wq,
> >  		   atomic_t *score, const char *name, struct device *dev)
> >  {
> >  	int i, ret;
> >  
> >  	sched->ops = ops;
> > -	sched->hw_submission_limit = hw_submission;
> > +	sched->credit_limit = credit_limit;
> >  	sched->name = name;
> >  	sched->timeout = timeout;
> >  	sched->timeout_wq = timeout_wq ? : system_wq;
> > @@ -1140,7 +1224,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> >  	init_waitqueue_head(&sched->job_scheduled);
> >  	INIT_LIST_HEAD(&sched->pending_list);
> >  	spin_lock_init(&sched->job_list_lock);
> > -	atomic_set(&sched->hw_rq_count, 0);
> > +	atomic_set(&sched->credit_count, 0);
> >  	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> >  	atomic_set(&sched->_score, 0);
> >  	atomic64_set(&sched->job_id_count, 0);
> > diff --git a/drivers/gpu/drm/v3d/v3d_gem.c b/drivers/gpu/drm/v3d/v3d_gem.c
> > index afa7d170d1ff..8dc5a7988488 100644
> > --- a/drivers/gpu/drm/v3d/v3d_gem.c
> > +++ b/drivers/gpu/drm/v3d/v3d_gem.c
> > @@ -417,7 +417,7 @@ v3d_job_init(struct v3d_dev *v3d, struct drm_file *file_priv,
> >  	job->free = free;
> >  
> >  	ret = drm_sched_job_init(&job->base, &v3d_priv->sched_entity[queue],
> > -				 v3d_priv);
> > +				 1, v3d_priv);
> >  	if (ret)
> >  		goto fail;
> >  
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index 653e0eec9743..8bafa81300b2 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -321,6 +321,7 @@ struct drm_sched_fence *to_drm_sched_fence(struct dma_fence *f);
> >   * @sched: the scheduler instance on which this job is scheduled.
> >   * @s_fence: contains the fences for the scheduling of job.
> >   * @finish_cb: the callback for the finished fence.
> > + * @credits: the number of credits this job contributes to the scheduler
> >   * @work: Helper to reschdeule job kill to different context.
> >   * @id: a unique id assigned to each job scheduled on the scheduler.
> >   * @karma: increment on every hang caused by this job. If this exceeds the hang
> > @@ -340,6 +341,8 @@ struct drm_sched_job {
> >  	struct drm_gpu_scheduler	*sched;
> >  	struct drm_sched_fence		*s_fence;
> >  
> > +	u32				credits;
> > +
> >  	/*
> >  	 * work is used only after finish_cb has been used and will not be
> >  	 * accessed anymore.
> > @@ -463,13 +466,27 @@ struct drm_sched_backend_ops {
> >           * and it's time to clean it up.
> >  	 */
> >  	void (*free_job)(struct drm_sched_job *sched_job);
> > +
> > +	/**
> > +	 * @update_job_credits: Called once the scheduler is considering this
> > +	 * job for execution.
> 
> "once" --> "when", a close cousin of "once", but clearer in code comments.
> It is called in fact as many times as the job is considered to be pushed down
> to hardware, if we couldn't previously.

Sure, gonna change that.

> 
> > +	 *
> > +	 * This callback returns the number of credits this job would take if
> 
> Too many repetitions of "this". Instead of "this job", say "the job".

Gonna fix.

> 
> > +	 * pushed to the hardware. Drivers may use this to dynamically update
> > +	 * the job's credit count. For instance, deduct the number of credits
> > +	 * for already signalled native fences.
> > +	 *
> > +	 * This callback is optional.
> > +	 */
> 
> Yeah, this is really good and well said. It's very clear and lucid. Thanks
> for updating it.
> 
> > +	u32 (*update_job_credits)(struct drm_sched_job *sched_job);
> >  };
> >  
> >  /**
> >   * struct drm_gpu_scheduler - scheduler instance-specific data
> >   *
> >   * @ops: backend operations provided by the driver.
> > - * @hw_submission_limit: the max size of the hardware queue.
> > + * @credit_limit: the credit limit of this scheduler
> > + * @credit_count: the current credit count of this scheduler
> >   * @timeout: the time after which a job is removed from the scheduler.
> >   * @name: name of the ring for which this scheduler is being used.
> >   * @num_rqs: Number of run-queues. This is at most DRM_SCHED_PRIORITY_COUNT,
> > @@ -480,7 +497,6 @@ struct drm_sched_backend_ops {
> >   * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
> >   *                 waits on this wait queue until all the scheduled jobs are
> >   *                 finished.
> > - * @hw_rq_count: the number of jobs currently in the hardware queue.
> >   * @job_id_count: used to assign unique id to the each job.
> >   * @timeout_wq: workqueue used to queue @work_tdr
> >   * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> > @@ -500,14 +516,14 @@ struct drm_sched_backend_ops {
> >   */
> >  struct drm_gpu_scheduler {
> >  	const struct drm_sched_backend_ops	*ops;
> > -	uint32_t			hw_submission_limit;
> > +	u32				credit_limit;
> > +	atomic_t			credit_count;
> >  	long				timeout;
> >  	const char			*name;
> >  	u32                             num_rqs;
> >  	struct drm_sched_rq             **sched_rq;
> >  	wait_queue_head_t		wake_up_worker;
> >  	wait_queue_head_t		job_scheduled;
> > -	atomic_t			hw_rq_count;
> >  	atomic64_t			job_id_count;
> >  	struct workqueue_struct		*timeout_wq;
> >  	struct delayed_work		work_tdr;
> > @@ -524,14 +540,14 @@ struct drm_gpu_scheduler {
> >  
> >  int drm_sched_init(struct drm_gpu_scheduler *sched,
> >  		   const struct drm_sched_backend_ops *ops,
> > -		   u32 num_rqs, uint32_t hw_submission, unsigned int hang_limit,
> > +		   u32 num_rqs, u32 credit_limit, unsigned int hang_limit,
> >  		   long timeout, struct workqueue_struct *timeout_wq,
> >  		   atomic_t *score, const char *name, struct device *dev);
> >  
> >  void drm_sched_fini(struct drm_gpu_scheduler *sched);
> >  int drm_sched_job_init(struct drm_sched_job *job,
> >  		       struct drm_sched_entity *entity,
> > -		       void *owner);
> > +		       u32 credits, void *owner);
> >  void drm_sched_job_arm(struct drm_sched_job *job);
> >  int drm_sched_job_add_dependency(struct drm_sched_job *job,
> >  				 struct dma_fence *fence);
> > @@ -552,7 +568,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
> >                                     unsigned int num_sched_list);
> >  
> >  void drm_sched_job_cleanup(struct drm_sched_job *job);
> > -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
> > +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
> > +				   struct drm_sched_entity *entity);
> >  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
> >  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
> >  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
> > 
> > base-commit: b2139fb5051554a7f297e4ad584ef1bc26c76d5d
> 
> Ah, thanks for adding base!
> 
> So, yeah--we're almost there!
> -- 
> Regards,
> Luben
Luben Tuikov Nov. 1, 2023, 11:03 p.m. UTC | #7
On 2023-10-31 22:23, Danilo Krummrich wrote:
> Hi Luben,
> 

[snip]
>>> @@ -187,12 +251,14 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>>  /**
>>>   * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>>>   *
>>> + * @sched: the gpu scheduler
>>>   * @rq: scheduler run queue to check.
>>>   *
>>>   * Try to find a ready entity, returns NULL if none found.
>>>   */
>>>  static struct drm_sched_entity *
>>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>> +drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched,
>>> +			      struct drm_sched_rq *rq)
>>>  {
>>>  	struct drm_sched_entity *entity;
>>>  
>>> @@ -202,6 +268,14 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>  	if (entity) {
>>>  		list_for_each_entry_continue(entity, &rq->entities, list) {
>>>  			if (drm_sched_entity_is_ready(entity)) {
>>> +				/* If we can't queue yet, preserve the current
>>> +				 * entity in terms of fairness.
>>> +				 */
>>> +				if (!drm_sched_can_queue(sched, entity)) {
>>> +					spin_unlock(&rq->lock);
>>> +					return ERR_PTR(-ENOSPC);
>>> +				}
>>> +
>>>  				rq->current_entity = entity;
>>>  				reinit_completion(&entity->entity_idle);
>>>  				spin_unlock(&rq->lock);
>>> @@ -211,8 +285,15 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>  	}
>>>  
>>>  	list_for_each_entry(entity, &rq->entities, list) {
>>> -
>>>  		if (drm_sched_entity_is_ready(entity)) {
>>> +			/* If we can't queue yet, preserve the current entity in
>>> +			 * terms of fairness.
>>> +			 */
>>> +			if (!drm_sched_can_queue(sched, entity)) {
>>> +				spin_unlock(&rq->lock);
>>> +				return ERR_PTR(-ENOSPC);
>>> +			}
>>> +
>>>  			rq->current_entity = entity;
>>>  			reinit_completion(&entity->entity_idle);
>>>  			spin_unlock(&rq->lock);
>>> @@ -231,12 +312,14 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>  /**
>>>   * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
>>>   *
>>> + * @sched: the gpu scheduler
>>>   * @rq: scheduler run queue to check.
>>>   *
>>>   * Find oldest waiting ready entity, returns NULL if none found.
>>>   */
>>>  static struct drm_sched_entity *
>>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>> +drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
>>> +				struct drm_sched_rq *rq)
>>>  {
>>>  	struct rb_node *rb;
>>>  
>>> @@ -246,6 +329,14 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>  
>>>  		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>>>  		if (drm_sched_entity_is_ready(entity)) {
>>> +			/* If we can't queue yet, preserve the current entity in
>>> +			 * terms of fairness.
>>> +			 */
>>> +			if (!drm_sched_can_queue(sched, entity)) {
>>> +				spin_unlock(&rq->lock);
>>> +				return ERR_PTR(-ENOSPC);
>>> +			}
>>> +
>>
>> So, this kinda did this abrupt "return" in v2, then it was fixed fine in v3,
>> and now I see we return and return an error now, which doesn't seem to be used
>> by anyone. In fact, in drm_sched_select_entity(), we do this,
>>
>> -	return entity;
>> +	return IS_ERR(entity) ? NULL : entity;
>>
>> So, it's perhaps best to leave this as it was in v3, and if in the future
>> we need to distinguish between the type of error, then that future patch
>> would do that and also show how this is used with new code and logic.
>>
>> There is little value to over-engineer this right now, given that in
>> the future, the logic may be more esoteric than we can think of.
> 
> Ok, maybe what I do here is a little bit subtle and requires a comment. Let me
> explain.
> 
> The reason I return an ERR_PTR() instead of NULL is to indicate to
> drm_sched_select_entity() to break out of the runqueue loop
> (drm_sched_select_entity() breaks the loop when the returned entity is not
> NULL).
> 
> Without that, we'd continue the runqueue loop in drm_sched_select_entity() and
> retry with the next lower priority. This could lead to prioritiy inversion of
> the runqueues, since a lower priority runqueue with jobs with less credits could
> stall a higher priority runqueue with jobs with more credits.
> 
> Hence, in drm_sched_select_entity() we need to be able to distinguish between
> drm_sched_rq_select_entity_fifo()/drm_sched_rq_select_entity_rr() can't find an
> entity and they *can* find an entity, but it's job doesn't fit on the ring.
> 
> I think what makes it subtle in this patch is that in drm_sched_select_entity()
> the condition already fits with
> 
> 	if (entity)
> 		break;
> 
> because we want to break this loop when we actually found an entity, or when
> there is no space on the ring buffer, but we want to keep checking the other
> runqueues if entity is NULL.

Okay, that's fine, but please update the head comment of
drm_sched_rq_select_entity_{rr,fifo}() and of 
drm_sched_select_entity() to explain what is being returned.

This invariably adds to the documentation which some want added to DRM--let's
first all start documenting code which we add ourselves.

I'd imagine this would look something like this, for each _{rr,fifo},
respectively, (remove content with braces {}, x4),

/**
 * drm_sched_rq_select_entity_{LT: rr,fifo} - Select an entity which provides a job to run
 * @sched: the gpu scheduler
 * @rq: scheduler run queue to check.
 *
 * Try to find a ready entity, returns NULL if none found. {LT: RR}
 * Find oldest waiting ready entity, returns NULL if none found. {LT: FIFO}
 *
 * Return an entity if one is found; return an error-pointer (!NULL) if an entity
 * was ready, but the scheduler had insufficient credits to accommodate its job;
 * return NULL, if no ready entity was found. {LT: for both RR and FIFO.}
 */

And,

/**
 * drm_sched_select_entity - Select the next entity to process
 * @sched: the scheduler instance
 *
 * Return an entity to process or NULL if none are found.
 *
 * Note, that we break out of the for-loop when "entity"
 * is non-null, which can also be an error-pointer--this assures
 * we don't process lower priority run-queues. See comments
 * in the respectively called functions.
 */

[snip]
>>> +
>>> +	/**
>>> +	 * @update_job_credits: Called once the scheduler is considering this
>>> +	 * job for execution.
>>
>> "once" --> "when", a close cousin of "once", but clearer in code comments.
>> It is called in fact as many times as the job is considered to be pushed down
>> to hardware, if we couldn't previously.
> 
> Sure, gonna change that.
> 
>>
>>> +	 *
>>> +	 * This callback returns the number of credits this job would take if
>>
>> Too many repetitions of "this". Instead of "this job", say "the job".
> 
> Gonna fix.

Great.

Could you please rebase your patch on top of drm-misc-next--Matthew's Xe changes
went in this afternoon.

Thanks!
diff mbox series

Patch

diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
index 602010cb6894..acc5901ac840 100644
--- a/Documentation/gpu/drm-mm.rst
+++ b/Documentation/gpu/drm-mm.rst
@@ -552,6 +552,12 @@  Overview
 .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
    :doc: Overview
 
+Flow Control
+------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Flow Control
+
 Scheduler Function References
 -----------------------------
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 1f357198533f..62bb7fc7448a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -115,7 +115,7 @@  int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 	if (!entity)
 		return 0;
 
-	return drm_sched_job_init(&(*job)->base, entity, owner);
+	return drm_sched_job_init(&(*job)->base, entity, 1, owner);
 }
 
 int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
index 2416c526f9b0..3d0f8d182506 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
@@ -535,7 +535,7 @@  int etnaviv_ioctl_gem_submit(struct drm_device *dev, void *data,
 
 	ret = drm_sched_job_init(&submit->sched_job,
 				 &ctx->sched_entity[args->pipe],
-				 submit->ctx);
+				 1, submit->ctx);
 	if (ret)
 		goto err_submit_put;
 
diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
index 295f0353a02e..0febe8ceb43b 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -123,7 +123,7 @@  int lima_sched_task_init(struct lima_sched_task *task,
 	for (i = 0; i < num_bos; i++)
 		drm_gem_object_get(&bos[i]->base.base);
 
-	err = drm_sched_job_init(&task->base, &context->base, vm);
+	err = drm_sched_job_init(&task->base, &context->base, 1, vm);
 	if (err) {
 		kfree(task->bos);
 		return err;
diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
index 99744de6c05a..c002cabe7b9c 100644
--- a/drivers/gpu/drm/msm/msm_gem_submit.c
+++ b/drivers/gpu/drm/msm/msm_gem_submit.c
@@ -48,7 +48,7 @@  static struct msm_gem_submit *submit_create(struct drm_device *dev,
 		return ERR_PTR(ret);
 	}
 
-	ret = drm_sched_job_init(&submit->base, queue->entity, queue);
+	ret = drm_sched_job_init(&submit->base, queue->entity, 1, queue);
 	if (ret) {
 		kfree(submit->hw_fence);
 		kfree(submit);
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
index e62b04109b2f..2a845f9c34ec 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
@@ -89,7 +89,7 @@  nouveau_job_init(struct nouveau_job *job,
 
 	}
 
-	ret = drm_sched_job_init(&job->base, &entity->base, NULL);
+	ret = drm_sched_job_init(&job->base, &entity->base, 1, NULL);
 	if (ret)
 		goto err_free_chains;
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
index b834777b409b..54d1c19bea84 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -274,7 +274,7 @@  static int panfrost_ioctl_submit(struct drm_device *dev, void *data,
 
 	ret = drm_sched_job_init(&job->base,
 				 &file_priv->sched_entity[slot],
-				 NULL);
+				 1, NULL);
 	if (ret)
 		goto out_put_job;
 
diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
index 3143ecaaff86..f8ed093b7356 100644
--- a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
+++ b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
@@ -51,7 +51,7 @@  DECLARE_EVENT_CLASS(drm_sched_job,
 			   __assign_str(name, sched_job->sched->name);
 			   __entry->job_count = spsc_queue_count(&entity->job_queue);
 			   __entry->hw_job_count = atomic_read(
-				   &sched_job->sched->hw_rq_count);
+				   &sched_job->sched->credit_count);
 			   ),
 	    TP_printk("entity=%p, id=%llu, fence=%p, ring=%s, job count:%u, hw job count:%d",
 		      __entry->entity, __entry->id,
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 409e4256f6e7..b79e0672b94f 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -370,7 +370,7 @@  static void drm_sched_entity_wakeup(struct dma_fence *f,
 		container_of(cb, struct drm_sched_entity, cb);
 
 	drm_sched_entity_clear_dep(f, cb);
-	drm_sched_wakeup_if_can_queue(entity->rq->sched);
+	drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
 }
 
 /**
@@ -602,7 +602,7 @@  void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
 		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
 			drm_sched_rq_update_fifo(entity, submit_ts);
 
-		drm_sched_wakeup_if_can_queue(entity->rq->sched);
+		drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
 	}
 }
 EXPORT_SYMBOL(drm_sched_entity_push_job);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index fd755e953487..5dfde149441a 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -48,6 +48,30 @@ 
  * through the jobs entity pointer.
  */
 
+/**
+ * DOC: Flow Control
+ *
+ * The DRM GPU scheduler provides a flow control mechanism to regulate the rate
+ * in which the jobs fetched from scheduler entities are executed.
+ *
+ * In this context the &drm_gpu_scheduler keeps track of a driver specified
+ * credit limit representing the capacity of this scheduler and a credit count;
+ * every &drm_sched_job carries a driver specified number of credits.
+ *
+ * Once a job is executed (but not yet finished), the job's credits contribute
+ * to the scheduler's credit count until the job is finished. If by executing
+ * one more job the scheduler's credit count would exceed the scheduler's
+ * credit limit, the job won't be executed. Instead, the scheduler will wait
+ * until the credit count has decreased enough to not overflow its credit limit.
+ * This implies waiting for previously executed jobs.
+ *
+ * Optionally, drivers may register a callback (update_job_credits) provided by
+ * struct drm_sched_backend_ops to update the job's credits dynamically. The
+ * scheduler executes this callback every time the scheduler considers a job for
+ * execution and subsequently checks whether the job fits the scheduler's credit
+ * limit.
+ */
+
 #include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/sched.h>
@@ -76,6 +100,46 @@  int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
 MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
 module_param_named(sched_policy, drm_sched_policy, int, 0444);
 
+static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
+{
+	u32 credits;
+
+	drm_WARN_ON(sched, check_sub_overflow(sched->credit_limit,
+					      atomic_read(&sched->credit_count),
+					      &credits));
+
+	return credits;
+}
+
+/**
+ * drm_sched_can_queue -- Can we queue more to the hardware?
+ * @sched: scheduler instance
+ * @entity: the scheduler entity
+ *
+ * Return true if we can push at least one more job from @entity, false
+ * otherwise.
+ */
+static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
+				struct drm_sched_entity *entity)
+{
+	struct drm_sched_job *s_job;
+
+	s_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
+	if (!s_job)
+		return false;
+
+	if (sched->ops->update_job_credits) {
+		s_job->credits = sched->ops->update_job_credits(s_job);
+
+		drm_WARN(sched, !s_job->credits,
+			 "Jobs with zero credits bypass job-flow control\n");
+	}
+
+	drm_WARN_ON(sched, s_job->credits > sched->credit_limit);
+
+	return drm_sched_available_credits(sched) >= s_job->credits;
+}
+
 static __always_inline bool drm_sched_entity_compare_before(struct rb_node *a,
 							    const struct rb_node *b)
 {
@@ -187,12 +251,14 @@  void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
 /**
  * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
  *
+ * @sched: the gpu scheduler
  * @rq: scheduler run queue to check.
  *
  * Try to find a ready entity, returns NULL if none found.
  */
 static struct drm_sched_entity *
-drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
+drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched,
+			      struct drm_sched_rq *rq)
 {
 	struct drm_sched_entity *entity;
 
@@ -202,6 +268,14 @@  drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
 	if (entity) {
 		list_for_each_entry_continue(entity, &rq->entities, list) {
 			if (drm_sched_entity_is_ready(entity)) {
+				/* If we can't queue yet, preserve the current
+				 * entity in terms of fairness.
+				 */
+				if (!drm_sched_can_queue(sched, entity)) {
+					spin_unlock(&rq->lock);
+					return ERR_PTR(-ENOSPC);
+				}
+
 				rq->current_entity = entity;
 				reinit_completion(&entity->entity_idle);
 				spin_unlock(&rq->lock);
@@ -211,8 +285,15 @@  drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
 	}
 
 	list_for_each_entry(entity, &rq->entities, list) {
-
 		if (drm_sched_entity_is_ready(entity)) {
+			/* If we can't queue yet, preserve the current entity in
+			 * terms of fairness.
+			 */
+			if (!drm_sched_can_queue(sched, entity)) {
+				spin_unlock(&rq->lock);
+				return ERR_PTR(-ENOSPC);
+			}
+
 			rq->current_entity = entity;
 			reinit_completion(&entity->entity_idle);
 			spin_unlock(&rq->lock);
@@ -231,12 +312,14 @@  drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
 /**
  * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
  *
+ * @sched: the gpu scheduler
  * @rq: scheduler run queue to check.
  *
  * Find oldest waiting ready entity, returns NULL if none found.
  */
 static struct drm_sched_entity *
-drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
+drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
+				struct drm_sched_rq *rq)
 {
 	struct rb_node *rb;
 
@@ -246,6 +329,14 @@  drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
 
 		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
 		if (drm_sched_entity_is_ready(entity)) {
+			/* If we can't queue yet, preserve the current entity in
+			 * terms of fairness.
+			 */
+			if (!drm_sched_can_queue(sched, entity)) {
+				spin_unlock(&rq->lock);
+				return ERR_PTR(-ENOSPC);
+			}
+
 			rq->current_entity = entity;
 			reinit_completion(&entity->entity_idle);
 			break;
@@ -267,7 +358,7 @@  static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
 	struct drm_sched_fence *s_fence = s_job->s_fence;
 	struct drm_gpu_scheduler *sched = s_fence->sched;
 
-	atomic_dec(&sched->hw_rq_count);
+	atomic_sub(s_job->credits, &sched->credit_count);
 	atomic_dec(sched->score);
 
 	trace_drm_sched_process_job(s_fence);
@@ -468,7 +559,7 @@  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 					      &s_job->cb)) {
 			dma_fence_put(s_job->s_fence->parent);
 			s_job->s_fence->parent = NULL;
-			atomic_dec(&sched->hw_rq_count);
+			atomic_sub(s_job->credits, &sched->credit_count);
 		} else {
 			/*
 			 * remove job from pending_list.
@@ -529,7 +620,7 @@  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
 	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
 		struct dma_fence *fence = s_job->s_fence->parent;
 
-		atomic_inc(&sched->hw_rq_count);
+		atomic_add(s_job->credits, &sched->credit_count);
 
 		if (!full_recovery)
 			continue;
@@ -613,6 +704,8 @@  EXPORT_SYMBOL(drm_sched_resubmit_jobs);
  * drm_sched_job_init - init a scheduler job
  * @job: scheduler job to init
  * @entity: scheduler entity to use
+ * @credits: the number of credits this job contributes to the schedulers
+ * credit limit
  * @owner: job owner for debugging
  *
  * Refer to drm_sched_entity_push_job() documentation
@@ -630,7 +723,7 @@  EXPORT_SYMBOL(drm_sched_resubmit_jobs);
  */
 int drm_sched_job_init(struct drm_sched_job *job,
 		       struct drm_sched_entity *entity,
-		       void *owner)
+		       u32 credits, void *owner)
 {
 	if (!entity->rq) {
 		/* This will most likely be followed by missing frames
@@ -647,6 +740,10 @@  int drm_sched_job_init(struct drm_sched_job *job,
 		return -ENOMEM;
 
 	INIT_LIST_HEAD(&job->list);
+	job->credits = credits;
+
+	drm_WARN(job->sched, !credits,
+		 "Jobs with zero credits bypass job-flow control\n");
 
 	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
 
@@ -853,27 +950,17 @@  void drm_sched_job_cleanup(struct drm_sched_job *job)
 }
 EXPORT_SYMBOL(drm_sched_job_cleanup);
 
-/**
- * drm_sched_can_queue -- Can we queue more to the hardware?
- * @sched: scheduler instance
- *
- * Return true if we can push more jobs to the hw, otherwise false.
- */
-static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
-{
-	return atomic_read(&sched->hw_rq_count) <
-		sched->hw_submission_limit;
-}
-
 /**
  * drm_sched_wakeup_if_can_queue - Wake up the scheduler
  * @sched: scheduler instance
+ * @entity: the scheduler entity
  *
  * Wake up the scheduler if we can queue jobs.
  */
-void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
+void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
+				   struct drm_sched_entity *entity)
 {
-	if (drm_sched_can_queue(sched))
+	if (drm_sched_can_queue(sched, entity))
 		wake_up_interruptible(&sched->wake_up_worker);
 }
 
@@ -890,19 +977,16 @@  drm_sched_select_entity(struct drm_gpu_scheduler *sched)
 	struct drm_sched_entity *entity;
 	int i;
 
-	if (!drm_sched_can_queue(sched))
-		return NULL;
-
 	/* Kernel run queue has higher priority than normal run queue*/
 	for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
 		entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
-			drm_sched_rq_select_entity_fifo(sched->sched_rq[i]) :
-			drm_sched_rq_select_entity_rr(sched->sched_rq[i]);
+			drm_sched_rq_select_entity_fifo(sched, sched->sched_rq[i]) :
+			drm_sched_rq_select_entity_rr(sched, sched->sched_rq[i]);
 		if (entity)
 			break;
 	}
 
-	return entity;
+	return IS_ERR(entity) ? NULL : entity;
 }
 
 /**
@@ -1043,7 +1127,7 @@  static int drm_sched_main(void *param)
 
 		s_fence = sched_job->s_fence;
 
-		atomic_inc(&sched->hw_rq_count);
+		atomic_add(sched_job->credits, &sched->credit_count);
 		drm_sched_job_begin(sched_job);
 
 		trace_drm_run_job(sched_job, entity);
@@ -1078,7 +1162,7 @@  static int drm_sched_main(void *param)
  * @sched: scheduler instance
  * @ops: backend operations for this scheduler
  * @num_rqs: number of runqueues, one for each priority, up to DRM_SCHED_PRIORITY_COUNT
- * @hw_submission: number of hw submissions that can be in flight
+ * @credit_limit: the number of credits this scheduler can hold from all jobs
  * @hang_limit: number of times to allow a job to hang before dropping it
  * @timeout: timeout value in jiffies for the scheduler
  * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
@@ -1091,14 +1175,14 @@  static int drm_sched_main(void *param)
  */
 int drm_sched_init(struct drm_gpu_scheduler *sched,
 		   const struct drm_sched_backend_ops *ops,
-		   u32 num_rqs, uint32_t hw_submission, unsigned int hang_limit,
+		   u32 num_rqs, u32 credit_limit, unsigned int hang_limit,
 		   long timeout, struct workqueue_struct *timeout_wq,
 		   atomic_t *score, const char *name, struct device *dev)
 {
 	int i, ret;
 
 	sched->ops = ops;
-	sched->hw_submission_limit = hw_submission;
+	sched->credit_limit = credit_limit;
 	sched->name = name;
 	sched->timeout = timeout;
 	sched->timeout_wq = timeout_wq ? : system_wq;
@@ -1140,7 +1224,7 @@  int drm_sched_init(struct drm_gpu_scheduler *sched,
 	init_waitqueue_head(&sched->job_scheduled);
 	INIT_LIST_HEAD(&sched->pending_list);
 	spin_lock_init(&sched->job_list_lock);
-	atomic_set(&sched->hw_rq_count, 0);
+	atomic_set(&sched->credit_count, 0);
 	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
 	atomic_set(&sched->_score, 0);
 	atomic64_set(&sched->job_id_count, 0);
diff --git a/drivers/gpu/drm/v3d/v3d_gem.c b/drivers/gpu/drm/v3d/v3d_gem.c
index afa7d170d1ff..8dc5a7988488 100644
--- a/drivers/gpu/drm/v3d/v3d_gem.c
+++ b/drivers/gpu/drm/v3d/v3d_gem.c
@@ -417,7 +417,7 @@  v3d_job_init(struct v3d_dev *v3d, struct drm_file *file_priv,
 	job->free = free;
 
 	ret = drm_sched_job_init(&job->base, &v3d_priv->sched_entity[queue],
-				 v3d_priv);
+				 1, v3d_priv);
 	if (ret)
 		goto fail;
 
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 653e0eec9743..8bafa81300b2 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -321,6 +321,7 @@  struct drm_sched_fence *to_drm_sched_fence(struct dma_fence *f);
  * @sched: the scheduler instance on which this job is scheduled.
  * @s_fence: contains the fences for the scheduling of job.
  * @finish_cb: the callback for the finished fence.
+ * @credits: the number of credits this job contributes to the scheduler
  * @work: Helper to reschdeule job kill to different context.
  * @id: a unique id assigned to each job scheduled on the scheduler.
  * @karma: increment on every hang caused by this job. If this exceeds the hang
@@ -340,6 +341,8 @@  struct drm_sched_job {
 	struct drm_gpu_scheduler	*sched;
 	struct drm_sched_fence		*s_fence;
 
+	u32				credits;
+
 	/*
 	 * work is used only after finish_cb has been used and will not be
 	 * accessed anymore.
@@ -463,13 +466,27 @@  struct drm_sched_backend_ops {
          * and it's time to clean it up.
 	 */
 	void (*free_job)(struct drm_sched_job *sched_job);
+
+	/**
+	 * @update_job_credits: Called once the scheduler is considering this
+	 * job for execution.
+	 *
+	 * This callback returns the number of credits this job would take if
+	 * pushed to the hardware. Drivers may use this to dynamically update
+	 * the job's credit count. For instance, deduct the number of credits
+	 * for already signalled native fences.
+	 *
+	 * This callback is optional.
+	 */
+	u32 (*update_job_credits)(struct drm_sched_job *sched_job);
 };
 
 /**
  * struct drm_gpu_scheduler - scheduler instance-specific data
  *
  * @ops: backend operations provided by the driver.
- * @hw_submission_limit: the max size of the hardware queue.
+ * @credit_limit: the credit limit of this scheduler
+ * @credit_count: the current credit count of this scheduler
  * @timeout: the time after which a job is removed from the scheduler.
  * @name: name of the ring for which this scheduler is being used.
  * @num_rqs: Number of run-queues. This is at most DRM_SCHED_PRIORITY_COUNT,
@@ -480,7 +497,6 @@  struct drm_sched_backend_ops {
  * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
  *                 waits on this wait queue until all the scheduled jobs are
  *                 finished.
- * @hw_rq_count: the number of jobs currently in the hardware queue.
  * @job_id_count: used to assign unique id to the each job.
  * @timeout_wq: workqueue used to queue @work_tdr
  * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
@@ -500,14 +516,14 @@  struct drm_sched_backend_ops {
  */
 struct drm_gpu_scheduler {
 	const struct drm_sched_backend_ops	*ops;
-	uint32_t			hw_submission_limit;
+	u32				credit_limit;
+	atomic_t			credit_count;
 	long				timeout;
 	const char			*name;
 	u32                             num_rqs;
 	struct drm_sched_rq             **sched_rq;
 	wait_queue_head_t		wake_up_worker;
 	wait_queue_head_t		job_scheduled;
-	atomic_t			hw_rq_count;
 	atomic64_t			job_id_count;
 	struct workqueue_struct		*timeout_wq;
 	struct delayed_work		work_tdr;
@@ -524,14 +540,14 @@  struct drm_gpu_scheduler {
 
 int drm_sched_init(struct drm_gpu_scheduler *sched,
 		   const struct drm_sched_backend_ops *ops,
-		   u32 num_rqs, uint32_t hw_submission, unsigned int hang_limit,
+		   u32 num_rqs, u32 credit_limit, unsigned int hang_limit,
 		   long timeout, struct workqueue_struct *timeout_wq,
 		   atomic_t *score, const char *name, struct device *dev);
 
 void drm_sched_fini(struct drm_gpu_scheduler *sched);
 int drm_sched_job_init(struct drm_sched_job *job,
 		       struct drm_sched_entity *entity,
-		       void *owner);
+		       u32 credits, void *owner);
 void drm_sched_job_arm(struct drm_sched_job *job);
 int drm_sched_job_add_dependency(struct drm_sched_job *job,
 				 struct dma_fence *fence);
@@ -552,7 +568,8 @@  void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
                                    unsigned int num_sched_list);
 
 void drm_sched_job_cleanup(struct drm_sched_job *job);
-void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
+void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
+				   struct drm_sched_entity *entity);
 void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
 void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
 void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);