diff mbox series

[1/5] drm/sched: Optimise drm_sched_entity_push_job

Message ID 20241014104637.83209-2-tursulin@igalia.com (mailing list archive)
State New, archived
Headers show
Series Small DRM scheduler improvements | expand

Commit Message

Tvrtko Ursulin Oct. 14, 2024, 10:46 a.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

In FIFO mode We can avoid dropping the lock only to immediately re-acquire
by adding a new drm_sched_rq_update_fifo_locked() helper.

v2:
 * Remove drm_sched_rq_update_fifo() altogether. (Christian)

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Luben Tuikov <ltuikov89@gmail.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <pstanner@redhat.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/scheduler/sched_entity.c | 13 +++++++++----
 drivers/gpu/drm/scheduler/sched_main.c   |  6 +++---
 include/drm/gpu_scheduler.h              |  2 +-
 3 files changed, 13 insertions(+), 8 deletions(-)

Comments

Philipp Stanner Oct. 14, 2024, 11:32 a.m. UTC | #1
Hi,

On Mon, 2024-10-14 at 11:46 +0100, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> 
> In FIFO mode We can avoid dropping the lock only to immediately re-
> acquire
> by adding a new drm_sched_rq_update_fifo_locked() helper.
> 

Please write detailed commit messages, as described here [1].
   1. Describe the problem: current state and why it's bad.
   2. Then, describe in imperative (present tense) form what the commit
      does about the problem.

Optionally, in between can be information about why it's solved this
way and not another etc.

Applies to the other patches, too.


[1] https://www.kernel.org/doc/html/latest/process/submitting-patches.html#describe-your-changes

> v2:
>  * Remove drm_sched_rq_update_fifo() altogether. (Christian)
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Luben Tuikov <ltuikov89@gmail.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Philipp Stanner <pstanner@redhat.com>
> Reviewed-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/scheduler/sched_entity.c | 13 +++++++++----
>  drivers/gpu/drm/scheduler/sched_main.c   |  6 +++---
>  include/drm/gpu_scheduler.h              |  2 +-
>  3 files changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> b/drivers/gpu/drm/scheduler/sched_entity.c
> index 2951fcc2e6b1..b72cba292839 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -514,8 +514,12 @@ struct drm_sched_job
> *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>   struct drm_sched_job *next;
>  
>   next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> - if (next)
> - drm_sched_rq_update_fifo(entity, next->submit_ts);
> + if (next) {
> + spin_lock(&entity->rq_lock);
> + drm_sched_rq_update_fifo_locked(entity,
> + next->submit_ts);
> + spin_unlock(&entity->rq_lock);
> + }
>   }
>  
>   /* Jobs and entities might have different lifecycles. Since we're
> @@ -613,10 +617,11 @@ void drm_sched_entity_push_job(struct
> drm_sched_job *sched_job)
>   sched = rq->sched;
>  
>   drm_sched_rq_add_entity(rq, entity);
> - spin_unlock(&entity->rq_lock);
>  
>   if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> - drm_sched_rq_update_fifo(entity, submit_ts);
> + drm_sched_rq_update_fifo_locked(entity, submit_ts);
> +
> + spin_unlock(&entity->rq_lock);
>  
>   drm_sched_wakeup(sched);
>   }
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index e32b0f7d7e94..bbd1630407e4 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -169,14 +169,15 @@ static inline void
> drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *enti
>   }
>  }
>  
> -void drm_sched_rq_update_fifo(struct drm_sched_entity *entity,
> ktime_t ts)
> +void drm_sched_rq_update_fifo_locked(struct drm_sched_entity
> *entity, ktime_t ts)

Since you touch function name / signature already, would you mind
writing a small doc string that also mentions the locking requirements
or lack of the same?

>  {
>   /*
>   * Both locks need to be grabbed, one to protect from entity->rq
> change
>   * for entity from within concurrent drm_sched_entity_select_rq and
> the
>   * other to update the rb tree structure.
>   */

It seems to me that the comment above is now out of date, no?


Thx for your efforts,
P.

> - spin_lock(&entity->rq_lock);
> + lockdep_assert_held(&entity->rq_lock);
> +
>   spin_lock(&entity->rq->lock);
>  
>   drm_sched_rq_remove_fifo_locked(entity);
> @@ -187,7 +188,6 @@ void drm_sched_rq_update_fifo(struct
> drm_sched_entity *entity, ktime_t ts)
>         drm_sched_entity_compare_before);
>  
>   spin_unlock(&entity->rq->lock);
> - spin_unlock(&entity->rq_lock);
>  }
>  
>  /**
> diff --git a/include/drm/gpu_scheduler.h
> b/include/drm/gpu_scheduler.h
> index e9f075f51db3..3658a6cb048e 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -593,7 +593,7 @@ void drm_sched_rq_add_entity(struct drm_sched_rq
> *rq,
>  void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>   struct drm_sched_entity *entity);
>  
> -void drm_sched_rq_update_fifo(struct drm_sched_entity *entity,
> ktime_t ts);
> +void drm_sched_rq_update_fifo_locked(struct drm_sched_entity
> *entity, ktime_t ts);
>  
>  int drm_sched_entity_init(struct drm_sched_entity *entity,
>     enum drm_sched_priority priority,
Tvrtko Ursulin Oct. 14, 2024, 12:07 p.m. UTC | #2
On 14/10/2024 12:32, Philipp Stanner wrote:
> Hi,
> 
> On Mon, 2024-10-14 at 11:46 +0100, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>
>> In FIFO mode We can avoid dropping the lock only to immediately re-
>> acquire
>> by adding a new drm_sched_rq_update_fifo_locked() helper.
>>
> 
> Please write detailed commit messages, as described here [1].
>     1. Describe the problem: current state and why it's bad.
>     2. Then, describe in imperative (present tense) form what the commit
>        does about the problem.

Both pieces of info are already there:

1. Drops the lock to immediately re-acquire it.
2. We avoid that by by adding a locked helper.
> Optionally, in between can be information about why it's solved this
> way and not another etc.
> 
> Applies to the other patches, too.
> 
> 
> [1] https://www.kernel.org/doc/html/latest/process/submitting-patches.html#describe-your-changes

Thanks I am new here and did not know this.

Seriosuly, lets not be too blindly strict about this because it can get 
IMO ridiculous.

One example when I previously accomodated your request is patch 3/5 from 
this series:

"""
Current kerneldoc for struct drm_sched_rq incompletely documents what
fields are protected by the lock.

This is not good because it is misleading.

Lets fix it by listing all the elements which are protected by the lock.
"""

While this was the original commit text you weren't happy with:

"""
drm/sched: Re-order struct drm_sched_rq members for clarity

Lets re-order the members to make it clear which are protected by the
lock
and at the same time document it via kerneldoc.
"""

I maintain the original text was passable.

On top, this was just a respin to accomodate the merge process. All 
approvals were done and dusted couple weeks or so ago so asking for yet 
another respin for such trivial objections is not great.

Regards,

Tvrtko

>> v2:
>>   * Remove drm_sched_rq_update_fifo() altogether. (Christian)
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Alex Deucher <alexander.deucher@amd.com>
>> Cc: Luben Tuikov <ltuikov89@gmail.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Philipp Stanner <pstanner@redhat.com>
>> Reviewed-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_entity.c | 13 +++++++++----
>>   drivers/gpu/drm/scheduler/sched_main.c   |  6 +++---
>>   include/drm/gpu_scheduler.h              |  2 +-
>>   3 files changed, 13 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>> b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 2951fcc2e6b1..b72cba292839 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -514,8 +514,12 @@ struct drm_sched_job
>> *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>>    struct drm_sched_job *next;
>>   
>>    next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
>> - if (next)
>> - drm_sched_rq_update_fifo(entity, next->submit_ts);
>> + if (next) {
>> + spin_lock(&entity->rq_lock);
>> + drm_sched_rq_update_fifo_locked(entity,
>> + next->submit_ts);
>> + spin_unlock(&entity->rq_lock);
>> + }
>>    }
>>   
>>    /* Jobs and entities might have different lifecycles. Since we're
>> @@ -613,10 +617,11 @@ void drm_sched_entity_push_job(struct
>> drm_sched_job *sched_job)
>>    sched = rq->sched;
>>   
>>    drm_sched_rq_add_entity(rq, entity);
>> - spin_unlock(&entity->rq_lock);
>>   
>>    if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>> - drm_sched_rq_update_fifo(entity, submit_ts);
>> + drm_sched_rq_update_fifo_locked(entity, submit_ts);
>> +
>> + spin_unlock(&entity->rq_lock);
>>   
>>    drm_sched_wakeup(sched);
>>    }
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index e32b0f7d7e94..bbd1630407e4 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -169,14 +169,15 @@ static inline void
>> drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *enti
>>    }
>>   }
>>   
>> -void drm_sched_rq_update_fifo(struct drm_sched_entity *entity,
>> ktime_t ts)
>> +void drm_sched_rq_update_fifo_locked(struct drm_sched_entity
>> *entity, ktime_t ts)
> 
> Since you touch function name / signature already, would you mind
> writing a small doc string that also mentions the locking requirements
> or lack of the same?
> 
>>   {
>>    /*
>>    * Both locks need to be grabbed, one to protect from entity->rq
>> change
>>    * for entity from within concurrent drm_sched_entity_select_rq and
>> the
>>    * other to update the rb tree structure.
>>    */
> 
> It seems to me that the comment above is now out of date, no?
> 
> 
> Thx for your efforts,
> P.
> 
>> - spin_lock(&entity->rq_lock);
>> + lockdep_assert_held(&entity->rq_lock);
>> +
>>    spin_lock(&entity->rq->lock);
>>   
>>    drm_sched_rq_remove_fifo_locked(entity);
>> @@ -187,7 +188,6 @@ void drm_sched_rq_update_fifo(struct
>> drm_sched_entity *entity, ktime_t ts)
>>          drm_sched_entity_compare_before);
>>   
>>    spin_unlock(&entity->rq->lock);
>> - spin_unlock(&entity->rq_lock);
>>   }
>>   
>>   /**
>> diff --git a/include/drm/gpu_scheduler.h
>> b/include/drm/gpu_scheduler.h
>> index e9f075f51db3..3658a6cb048e 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -593,7 +593,7 @@ void drm_sched_rq_add_entity(struct drm_sched_rq
>> *rq,
>>   void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>    struct drm_sched_entity *entity);
>>   
>> -void drm_sched_rq_update_fifo(struct drm_sched_entity *entity,
>> ktime_t ts);
>> +void drm_sched_rq_update_fifo_locked(struct drm_sched_entity
>> *entity, ktime_t ts);
>>   
>>   int drm_sched_entity_init(struct drm_sched_entity *entity,
>>      enum drm_sched_priority priority,
>
Philipp Stanner Oct. 15, 2024, 7:11 a.m. UTC | #3
On Mon, 2024-10-14 at 13:07 +0100, Tvrtko Ursulin wrote:
> 
> On 14/10/2024 12:32, Philipp Stanner wrote:
> > Hi,
> > 
> > On Mon, 2024-10-14 at 11:46 +0100, Tvrtko Ursulin wrote:
> > > From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > > 
> > > In FIFO mode We can avoid dropping the lock only to immediately
> > > re-
> > > acquire
> > > by adding a new drm_sched_rq_update_fifo_locked() helper.
> > > 
> > 
> > Please write detailed commit messages, as described here [1].
> >     1. Describe the problem: current state and why it's bad.
> >     2. Then, describe in imperative (present tense) form what the
> > commit
> >        does about the problem.
> 
> Both pieces of info are already there:
> 
> 1. Drops the lock to immediately re-acquire it.
> 2. We avoid that by by adding a locked helper.
> > Optionally, in between can be information about why it's solved
> > this
> > way and not another etc.
> > 
> > Applies to the other patches, too.
> > 
> > 
> > [1]
> > https://www.kernel.org/doc/html/latest/process/submitting-patches.html#describe-your-changes
> 
> Thanks I am new here and did not know this.
> 
> Seriosuly, lets not be too blindly strict about this because it can
> get 
> IMO ridiculous.
> 
> One example when I previously accomodated your request is patch 3/5
> from 
> this series:
> 
> """
> Current kerneldoc for struct drm_sched_rq incompletely documents what
> fields are protected by the lock.
> 
> This is not good because it is misleading.
> 
> Lets fix it by listing all the elements which are protected by the
> lock.
> """
> 
> While this was the original commit text you weren't happy with:
> 
> """
> drm/sched: Re-order struct drm_sched_rq members for clarity
> 
> Lets re-order the members to make it clear which are protected by the
> lock
> and at the same time document it via kerneldoc.
> """
> 
> I maintain the original text was passable.
> 
> On top, this was just a respin to accomodate the merge process. All 
> approvals were done and dusted couple weeks or so ago so asking for
> yet 
> another respin for such trivial objections is not great.

I understand that you're unhappy, but please understand the position
I'm coming from. As you know, since you sent these patches within a
different series (and, thus, since I reviewed them), I was trusted with
co-maintaining this piece of shared infrastructure.

And since you've worked on it a bit now, I suppose you also know that
the GPU Scheduler is arguably in quite a bad shape, has far too little
documentation, has leaks, maybe race conditions, parts *where the
locking rules are unclear* and is probably only fully understood by a
small hand full of people. I also argue that this is a *very*
complicated piece of software.

So I might be or appear to be a bit pedantic, but I'm not doing that to
terrorize you, but because I want this thing to become well documented,
understandable, and bisectable. Working towards a canonical, idiot-
proof commit style is one measure that will help with that.

I want to offer you the following: I can be more relaxed with things
universally recognized as trivial (comment changes, struct member
reordering) – but when something like a lock is touched in any way, we
shall document that in the commit message as canonically as possible,
so someone who's less experienced and just bisected the commit
immediately understands what has been done (or rather: was supposed to
be done).

Greetings
P.



> 
> Regards,
> 
> Tvrtko
> 
> > > v2:
> > >   * Remove drm_sched_rq_update_fifo() altogether. (Christian)
> > > 
> > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Cc: Alex Deucher <alexander.deucher@amd.com>
> > > Cc: Luben Tuikov <ltuikov89@gmail.com>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Philipp Stanner <pstanner@redhat.com>
> > > Reviewed-by: Christian König <christian.koenig@amd.com>
> > > ---
> > >   drivers/gpu/drm/scheduler/sched_entity.c | 13 +++++++++----
> > >   drivers/gpu/drm/scheduler/sched_main.c   |  6 +++---
> > >   include/drm/gpu_scheduler.h              |  2 +-
> > >   3 files changed, 13 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > index 2951fcc2e6b1..b72cba292839 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > @@ -514,8 +514,12 @@ struct drm_sched_job
> > > *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
> > >    struct drm_sched_job *next;
> > >   
> > >    next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> > > - if (next)
> > > - drm_sched_rq_update_fifo(entity, next->submit_ts);
> > > + if (next) {
> > > + spin_lock(&entity->rq_lock);
> > > + drm_sched_rq_update_fifo_locked(entity,
> > > + next->submit_ts);
> > > + spin_unlock(&entity->rq_lock);
> > > + }
> > >    }
> > >   
> > >    /* Jobs and entities might have different lifecycles. Since
> > > we're
> > > @@ -613,10 +617,11 @@ void drm_sched_entity_push_job(struct
> > > drm_sched_job *sched_job)
> > >    sched = rq->sched;
> > >   
> > >    drm_sched_rq_add_entity(rq, entity);
> > > - spin_unlock(&entity->rq_lock);
> > >   
> > >    if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> > > - drm_sched_rq_update_fifo(entity, submit_ts);
> > > + drm_sched_rq_update_fifo_locked(entity, submit_ts);
> > > +
> > > + spin_unlock(&entity->rq_lock);
> > >   
> > >    drm_sched_wakeup(sched);
> > >    }
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > index e32b0f7d7e94..bbd1630407e4 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -169,14 +169,15 @@ static inline void
> > > drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *enti
> > >    }
> > >   }
> > >   
> > > -void drm_sched_rq_update_fifo(struct drm_sched_entity *entity,
> > > ktime_t ts)
> > > +void drm_sched_rq_update_fifo_locked(struct drm_sched_entity
> > > *entity, ktime_t ts)
> > 
> > Since you touch function name / signature already, would you mind
> > writing a small doc string that also mentions the locking
> > requirements
> > or lack of the same?
> > 
> > >   {
> > >    /*
> > >    * Both locks need to be grabbed, one to protect from entity-
> > > >rq
> > > change
> > >    * for entity from within concurrent drm_sched_entity_select_rq
> > > and
> > > the
> > >    * other to update the rb tree structure.
> > >    */
> > 
> > It seems to me that the comment above is now out of date, no?
> > 
> > 
> > Thx for your efforts,
> > P.
> > 
> > > - spin_lock(&entity->rq_lock);
> > > + lockdep_assert_held(&entity->rq_lock);
> > > +
> > >    spin_lock(&entity->rq->lock);
> > >   
> > >    drm_sched_rq_remove_fifo_locked(entity);
> > > @@ -187,7 +188,6 @@ void drm_sched_rq_update_fifo(struct
> > > drm_sched_entity *entity, ktime_t ts)
> > >          drm_sched_entity_compare_before);
> > >   
> > >    spin_unlock(&entity->rq->lock);
> > > - spin_unlock(&entity->rq_lock);
> > >   }
> > >   
> > >   /**
> > > diff --git a/include/drm/gpu_scheduler.h
> > > b/include/drm/gpu_scheduler.h
> > > index e9f075f51db3..3658a6cb048e 100644
> > > --- a/include/drm/gpu_scheduler.h
> > > +++ b/include/drm/gpu_scheduler.h
> > > @@ -593,7 +593,7 @@ void drm_sched_rq_add_entity(struct
> > > drm_sched_rq
> > > *rq,
> > >   void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> > >    struct drm_sched_entity *entity);
> > >   
> > > -void drm_sched_rq_update_fifo(struct drm_sched_entity *entity,
> > > ktime_t ts);
> > > +void drm_sched_rq_update_fifo_locked(struct drm_sched_entity
> > > *entity, ktime_t ts);
> > >   
> > >   int drm_sched_entity_init(struct drm_sched_entity *entity,
> > >      enum drm_sched_priority priority,
> > 
>
Tvrtko Ursulin Oct. 15, 2024, 8:12 a.m. UTC | #4
On 15/10/2024 08:11, Philipp Stanner wrote:
> On Mon, 2024-10-14 at 13:07 +0100, Tvrtko Ursulin wrote:
>>
>> On 14/10/2024 12:32, Philipp Stanner wrote:
>>> Hi,
>>>
>>> On Mon, 2024-10-14 at 11:46 +0100, Tvrtko Ursulin wrote:
>>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>>
>>>> In FIFO mode We can avoid dropping the lock only to immediately
>>>> re-
>>>> acquire
>>>> by adding a new drm_sched_rq_update_fifo_locked() helper.
>>>>
>>>
>>> Please write detailed commit messages, as described here [1].
>>>      1. Describe the problem: current state and why it's bad.
>>>      2. Then, describe in imperative (present tense) form what the
>>> commit
>>>         does about the problem.
>>
>> Both pieces of info are already there:
>>
>> 1. Drops the lock to immediately re-acquire it.
>> 2. We avoid that by by adding a locked helper.
>>> Optionally, in between can be information about why it's solved
>>> this
>>> way and not another etc.
>>>
>>> Applies to the other patches, too.
>>>
>>>
>>> [1]
>>> https://www.kernel.org/doc/html/latest/process/submitting-patches.html#describe-your-changes
>>
>> Thanks I am new here and did not know this.
>>
>> Seriosuly, lets not be too blindly strict about this because it can
>> get
>> IMO ridiculous.
>>
>> One example when I previously accomodated your request is patch 3/5
>> from
>> this series:
>>
>> """
>> Current kerneldoc for struct drm_sched_rq incompletely documents what
>> fields are protected by the lock.
>>
>> This is not good because it is misleading.
>>
>> Lets fix it by listing all the elements which are protected by the
>> lock.
>> """
>>
>> While this was the original commit text you weren't happy with:
>>
>> """
>> drm/sched: Re-order struct drm_sched_rq members for clarity
>>
>> Lets re-order the members to make it clear which are protected by the
>> lock
>> and at the same time document it via kerneldoc.
>> """
>>
>> I maintain the original text was passable.
>>
>> On top, this was just a respin to accomodate the merge process. All
>> approvals were done and dusted couple weeks or so ago so asking for
>> yet
>> another respin for such trivial objections is not great.
> 
> I understand that you're unhappy, but please understand the position
> I'm coming from. As you know, since you sent these patches within a
> different series (and, thus, since I reviewed them), I was trusted with
> co-maintaining this piece of shared infrastructure.
> 
> And since you've worked on it a bit now, I suppose you also know that
> the GPU Scheduler is arguably in quite a bad shape, has far too little
> documentation, has leaks, maybe race conditions, parts *where the
> locking rules are unclear* and is probably only fully understood by a
> small hand full of people. I also argue that this is a *very*
> complicated piece of software.

We already went over that and agreed. Not least I agreed the base is 
shaky since few years  ago. :)

Btw if things align, I hope you will at some point see a follow up 
series from me which makes some significant simplifications and 
improvements at the same time.
> So I might be or appear to be a bit pedantic, but I'm not doing that to
> terrorize you, but because I want this thing to become well documented,
> understandable, and bisectable. Working towards a canonical, idiot-
> proof commit style is one measure that will help with that.
> 
> I want to offer you the following: I can be more relaxed with things
> universally recognized as trivial (comment changes, struct member
> reordering) – but when something like a lock is touched in any way, we
> shall document that in the commit message as canonically as possible,
> so someone who's less experienced and just bisected the commit
> immediately understands what has been done (or rather: was supposed to
> be done).

So how would you suggest to expand this commit text so it doesn't read 
too self-repeating?

Regards,

Tvrtko
Philipp Stanner Oct. 15, 2024, 11:38 a.m. UTC | #5
On Tue, 2024-10-15 at 09:12 +0100, Tvrtko Ursulin wrote:
> 
> On 15/10/2024 08:11, Philipp Stanner wrote:
> > On Mon, 2024-10-14 at 13:07 +0100, Tvrtko Ursulin wrote:
> > > 
> > > On 14/10/2024 12:32, Philipp Stanner wrote:
> > > > Hi,
> > > > 
> > > > On Mon, 2024-10-14 at 11:46 +0100, Tvrtko Ursulin wrote:
> > > > > From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > > > > 
> > > > > In FIFO mode We can avoid dropping the lock only to
> > > > > immediately
> > > > > re-
> > > > > acquire
> > > > > by adding a new drm_sched_rq_update_fifo_locked() helper.
> > > > > 
> > > > 
> > > > Please write detailed commit messages, as described here [1].
> > > >      1. Describe the problem: current state and why it's bad.
> > > >      2. Then, describe in imperative (present tense) form what
> > > > the
> > > > commit
> > > >         does about the problem.
> > > 
> > > Both pieces of info are already there:
> > > 
> > > 1. Drops the lock to immediately re-acquire it.
> > > 2. We avoid that by by adding a locked helper.
> > > > Optionally, in between can be information about why it's solved
> > > > this
> > > > way and not another etc.
> > > > 
> > > > Applies to the other patches, too.
> > > > 
> > > > 
> > > > [1]
> > > > https://www.kernel.org/doc/html/latest/process/submitting-patches.html#describe-your-changes
> > > 
> > > Thanks I am new here and did not know this.
> > > 
> > > Seriosuly, lets not be too blindly strict about this because it
> > > can
> > > get
> > > IMO ridiculous.
> > > 
> > > One example when I previously accomodated your request is patch
> > > 3/5
> > > from
> > > this series:
> > > 
> > > """
> > > Current kerneldoc for struct drm_sched_rq incompletely documents
> > > what
> > > fields are protected by the lock.
> > > 
> > > This is not good because it is misleading.
> > > 
> > > Lets fix it by listing all the elements which are protected by
> > > the
> > > lock.
> > > """
> > > 
> > > While this was the original commit text you weren't happy with:
> > > 
> > > """
> > > drm/sched: Re-order struct drm_sched_rq members for clarity
> > > 
> > > Lets re-order the members to make it clear which are protected by
> > > the
> > > lock
> > > and at the same time document it via kerneldoc.
> > > """
> > > 
> > > I maintain the original text was passable.
> > > 
> > > On top, this was just a respin to accomodate the merge process.
> > > All
> > > approvals were done and dusted couple weeks or so ago so asking
> > > for
> > > yet
> > > another respin for such trivial objections is not great.
> > 
> > I understand that you're unhappy, but please understand the
> > position
> > I'm coming from. As you know, since you sent these patches within a
> > different series (and, thus, since I reviewed them), I was trusted
> > with
> > co-maintaining this piece of shared infrastructure.
> > 
> > And since you've worked on it a bit now, I suppose you also know
> > that
> > the GPU Scheduler is arguably in quite a bad shape, has far too
> > little
> > documentation, has leaks, maybe race conditions, parts *where the
> > locking rules are unclear* and is probably only fully understood by
> > a
> > small hand full of people. I also argue that this is a *very*
> > complicated piece of software.
> 
> We already went over that and agreed. Not least I agreed the base is 
> shaky since few years  ago. :)
> 
> Btw if things align, I hope you will at some point see a follow up 
> series from me which makes some significant simplifications and 
> improvements at the same time.

Cool, good to hear!
(Would be even cooler if simplifications and improvements can be
delivered through separate patch series to be easier to review etc.)

> > So I might be or appear to be a bit pedantic, but I'm not doing
> > that to
> > terrorize you, but because I want this thing to become well
> > documented,
> > understandable, and bisectable. Working towards a canonical, idiot-
> > proof commit style is one measure that will help with that.
> > 
> > I want to offer you the following: I can be more relaxed with
> > things
> > universally recognized as trivial (comment changes, struct member
> > reordering) – but when something like a lock is touched in any way,
> > we
> > shall document that in the commit message as canonically as
> > possible,
> > so someone who's less experienced and just bisected the commit
> > immediately understands what has been done (or rather: was supposed
> > to
> > be done).
> 
> So how would you suggest to expand this commit text so it doesn't
> read 
> too self-repeating?

My issue with this particular commit message is mainly that it doesn't
make it obvious what the patch is supposed to do. So one can make it
quicker and better to review by detailing it a bit more, so the
reviewer then can compare commit message vs. what the code does. It
seems to me for example that the actual optimization is being done in
drm_sched_entity_push_job(), and drm_sched_entity_pop_job() had to be
ported, too, for correctness

Another small thing that might be cool is something that makes it a bit
more obvious that this is an optimization, not a fix.

So I would probably write:

"So far, drm_sched_rq_update_fifo() automatically takes
drm_sched_entity.rq_lock. For DRM_SCHED_POLICY_FIFO, this is
inefficient because that lock is then taken, released and retaken in
drm_sched_entity_push_job().

Improve performance by moving the locking out of
drm_sched_rq_update_fifo()."

Not that much longer but makes it far more obvious what shall be
achieved where :]


(Let me read through the other patches briefly. Then we should be good
with v2 of this series.. or would it be v3 then?)

Thanks,
P.


> 
> Regards,
> 
> Tvrtko
>
Tvrtko Ursulin Oct. 15, 2024, 1:14 p.m. UTC | #6
On 15/10/2024 12:38, Philipp Stanner wrote:
> On Tue, 2024-10-15 at 09:12 +0100, Tvrtko Ursulin wrote:
>>
>> On 15/10/2024 08:11, Philipp Stanner wrote:
>>> On Mon, 2024-10-14 at 13:07 +0100, Tvrtko Ursulin wrote:
>>>>
>>>> On 14/10/2024 12:32, Philipp Stanner wrote:
>>>>> Hi,
>>>>>
>>>>> On Mon, 2024-10-14 at 11:46 +0100, Tvrtko Ursulin wrote:
>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>>>>
>>>>>> In FIFO mode We can avoid dropping the lock only to
>>>>>> immediately
>>>>>> re-
>>>>>> acquire
>>>>>> by adding a new drm_sched_rq_update_fifo_locked() helper.
>>>>>>
>>>>>
>>>>> Please write detailed commit messages, as described here [1].
>>>>>       1. Describe the problem: current state and why it's bad.
>>>>>       2. Then, describe in imperative (present tense) form what
>>>>> the
>>>>> commit
>>>>>          does about the problem.
>>>>
>>>> Both pieces of info are already there:
>>>>
>>>> 1. Drops the lock to immediately re-acquire it.
>>>> 2. We avoid that by by adding a locked helper.
>>>>> Optionally, in between can be information about why it's solved
>>>>> this
>>>>> way and not another etc.
>>>>>
>>>>> Applies to the other patches, too.
>>>>>
>>>>>
>>>>> [1]
>>>>> https://www.kernel.org/doc/html/latest/process/submitting-patches.html#describe-your-changes
>>>>
>>>> Thanks I am new here and did not know this.
>>>>
>>>> Seriosuly, lets not be too blindly strict about this because it
>>>> can
>>>> get
>>>> IMO ridiculous.
>>>>
>>>> One example when I previously accomodated your request is patch
>>>> 3/5
>>>> from
>>>> this series:
>>>>
>>>> """
>>>> Current kerneldoc for struct drm_sched_rq incompletely documents
>>>> what
>>>> fields are protected by the lock.
>>>>
>>>> This is not good because it is misleading.
>>>>
>>>> Lets fix it by listing all the elements which are protected by
>>>> the
>>>> lock.
>>>> """
>>>>
>>>> While this was the original commit text you weren't happy with:
>>>>
>>>> """
>>>> drm/sched: Re-order struct drm_sched_rq members for clarity
>>>>
>>>> Lets re-order the members to make it clear which are protected by
>>>> the
>>>> lock
>>>> and at the same time document it via kerneldoc.
>>>> """
>>>>
>>>> I maintain the original text was passable.
>>>>
>>>> On top, this was just a respin to accomodate the merge process.
>>>> All
>>>> approvals were done and dusted couple weeks or so ago so asking
>>>> for
>>>> yet
>>>> another respin for such trivial objections is not great.
>>>
>>> I understand that you're unhappy, but please understand the
>>> position
>>> I'm coming from. As you know, since you sent these patches within a
>>> different series (and, thus, since I reviewed them), I was trusted
>>> with
>>> co-maintaining this piece of shared infrastructure.
>>>
>>> And since you've worked on it a bit now, I suppose you also know
>>> that
>>> the GPU Scheduler is arguably in quite a bad shape, has far too
>>> little
>>> documentation, has leaks, maybe race conditions, parts *where the
>>> locking rules are unclear* and is probably only fully understood by
>>> a
>>> small hand full of people. I also argue that this is a *very*
>>> complicated piece of software.
>>
>> We already went over that and agreed. Not least I agreed the base is
>> shaky since few years  ago. :)
>>
>> Btw if things align, I hope you will at some point see a follow up
>> series from me which makes some significant simplifications and
>> improvements at the same time.
> 
> Cool, good to hear!
> (Would be even cooler if simplifications and improvements can be
> delivered through separate patch series to be easier to review etc.)

Yes, when I spot something I pull it ahead and/or standalone when it 
makes sense. But it is early days and a big job.

>>> So I might be or appear to be a bit pedantic, but I'm not doing
>>> that to
>>> terrorize you, but because I want this thing to become well
>>> documented,
>>> understandable, and bisectable. Working towards a canonical, idiot-
>>> proof commit style is one measure that will help with that.
>>>
>>> I want to offer you the following: I can be more relaxed with
>>> things
>>> universally recognized as trivial (comment changes, struct member
>>> reordering) – but when something like a lock is touched in any way,
>>> we
>>> shall document that in the commit message as canonically as
>>> possible,
>>> so someone who's less experienced and just bisected the commit
>>> immediately understands what has been done (or rather: was supposed
>>> to
>>> be done).
>>
>> So how would you suggest to expand this commit text so it doesn't
>> read
>> too self-repeating?
> 
> My issue with this particular commit message is mainly that it doesn't
> make it obvious what the patch is supposed to do. So one can make it
> quicker and better to review by detailing it a bit more, so the
> reviewer then can compare commit message vs. what the code does. It
> seems to me for example that the actual optimization is being done in
> drm_sched_entity_push_job(), and drm_sched_entity_pop_job() had to be
> ported, too, for correctness

"It seems" aka the commit title says so. ;)

> Another small thing that might be cool is something that makes it a bit
> more obvious that this is an optimization, not a fix.
> 
> So I would probably write:
> 
> "So far, drm_sched_rq_update_fifo() automatically takes
> drm_sched_entity.rq_lock. For DRM_SCHED_POLICY_FIFO, this is
> inefficient because that lock is then taken, released and retaken in
> drm_sched_entity_push_job().
> 
> Improve performance by moving the locking out of
> drm_sched_rq_update_fifo()."
> 
> Not that much longer but makes it far more obvious what shall be
> achieved where :]

How about this:

"""
In FIFO mode (which is the default), both drm_sched_entity_push_job() 
and drm_sched_rq_update_fifo(), where the latter calls the former, are 
currently taking and releasing the same entity->rq_lock.

We can avoid that design inelegance, and also have a miniscule 
efficiency improvement on the idle submit path, by introducing a new 
drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking 
to its callers.
"""

> (Let me read through the other patches briefly. Then we should be good
> with v2 of this series.. or would it be v3 then?)

Depends how you count. By unique series titles or by fundamental content.

Regards,

Tvrtko
Philipp Stanner Oct. 15, 2024, 2 p.m. UTC | #7
On Tue, 2024-10-15 at 14:14 +0100, Tvrtko Ursulin wrote:
> 
> On 15/10/2024 12:38, Philipp Stanner wrote:
> > On Tue, 2024-10-15 at 09:12 +0100, Tvrtko Ursulin wrote:
> > > 
> > > On 15/10/2024 08:11, Philipp Stanner wrote:
> > > > On Mon, 2024-10-14 at 13:07 +0100, Tvrtko Ursulin wrote:
> > > > > 
> > > > > On 14/10/2024 12:32, Philipp Stanner wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > On Mon, 2024-10-14 at 11:46 +0100, Tvrtko Ursulin wrote:
> > > > > > > From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > > > > > > 
> > > > > > > In FIFO mode We can avoid dropping the lock only to
> > > > > > > immediately
> > > > > > > re-
> > > > > > > acquire
> > > > > > > by adding a new drm_sched_rq_update_fifo_locked() helper.
> > > > > > > 
> > > > > > 
> > > > > > Please write detailed commit messages, as described here
> > > > > > [1].
> > > > > >       1. Describe the problem: current state and why it's
> > > > > > bad.
> > > > > >       2. Then, describe in imperative (present tense) form
> > > > > > what
> > > > > > the
> > > > > > commit
> > > > > >          does about the problem.
> > > > > 
> > > > > Both pieces of info are already there:
> > > > > 
> > > > > 1. Drops the lock to immediately re-acquire it.
> > > > > 2. We avoid that by by adding a locked helper.
> > > > > > Optionally, in between can be information about why it's
> > > > > > solved
> > > > > > this
> > > > > > way and not another etc.
> > > > > > 
> > > > > > Applies to the other patches, too.
> > > > > > 
> > > > > > 
> > > > > > [1]
> > > > > > https://www.kernel.org/doc/html/latest/process/submitting-patches.html#describe-your-changes
> > > > > 
> > > > > Thanks I am new here and did not know this.
> > > > > 
> > > > > Seriosuly, lets not be too blindly strict about this because
> > > > > it
> > > > > can
> > > > > get
> > > > > IMO ridiculous.
> > > > > 
> > > > > One example when I previously accomodated your request is
> > > > > patch
> > > > > 3/5
> > > > > from
> > > > > this series:
> > > > > 
> > > > > """
> > > > > Current kerneldoc for struct drm_sched_rq incompletely
> > > > > documents
> > > > > what
> > > > > fields are protected by the lock.
> > > > > 
> > > > > This is not good because it is misleading.
> > > > > 
> > > > > Lets fix it by listing all the elements which are protected
> > > > > by
> > > > > the
> > > > > lock.
> > > > > """
> > > > > 
> > > > > While this was the original commit text you weren't happy
> > > > > with:
> > > > > 
> > > > > """
> > > > > drm/sched: Re-order struct drm_sched_rq members for clarity
> > > > > 
> > > > > Lets re-order the members to make it clear which are
> > > > > protected by
> > > > > the
> > > > > lock
> > > > > and at the same time document it via kerneldoc.
> > > > > """
> > > > > 
> > > > > I maintain the original text was passable.
> > > > > 
> > > > > On top, this was just a respin to accomodate the merge
> > > > > process.
> > > > > All
> > > > > approvals were done and dusted couple weeks or so ago so
> > > > > asking
> > > > > for
> > > > > yet
> > > > > another respin for such trivial objections is not great.
> > > > 
> > > > I understand that you're unhappy, but please understand the
> > > > position
> > > > I'm coming from. As you know, since you sent these patches
> > > > within a
> > > > different series (and, thus, since I reviewed them), I was
> > > > trusted
> > > > with
> > > > co-maintaining this piece of shared infrastructure.
> > > > 
> > > > And since you've worked on it a bit now, I suppose you also
> > > > know
> > > > that
> > > > the GPU Scheduler is arguably in quite a bad shape, has far too
> > > > little
> > > > documentation, has leaks, maybe race conditions, parts *where
> > > > the
> > > > locking rules are unclear* and is probably only fully
> > > > understood by
> > > > a
> > > > small hand full of people. I also argue that this is a *very*
> > > > complicated piece of software.
> > > 
> > > We already went over that and agreed. Not least I agreed the base
> > > is
> > > shaky since few years  ago. :)
> > > 
> > > Btw if things align, I hope you will at some point see a follow
> > > up
> > > series from me which makes some significant simplifications and
> > > improvements at the same time.
> > 
> > Cool, good to hear!
> > (Would be even cooler if simplifications and improvements can be
> > delivered through separate patch series to be easier to review
> > etc.)
> 
> Yes, when I spot something I pull it ahead and/or standalone when it 
> makes sense. But it is early days and a big job.
> 
> > > > So I might be or appear to be a bit pedantic, but I'm not doing
> > > > that to
> > > > terrorize you, but because I want this thing to become well
> > > > documented,
> > > > understandable, and bisectable. Working towards a canonical,
> > > > idiot-
> > > > proof commit style is one measure that will help with that.
> > > > 
> > > > I want to offer you the following: I can be more relaxed with
> > > > things
> > > > universally recognized as trivial (comment changes, struct
> > > > member
> > > > reordering) – but when something like a lock is touched in any
> > > > way,
> > > > we
> > > > shall document that in the commit message as canonically as
> > > > possible,
> > > > so someone who's less experienced and just bisected the commit
> > > > immediately understands what has been done (or rather: was
> > > > supposed
> > > > to
> > > > be done).
> > > 
> > > So how would you suggest to expand this commit text so it doesn't
> > > read
> > > too self-repeating?
> > 
> > My issue with this particular commit message is mainly that it
> > doesn't
> > make it obvious what the patch is supposed to do. So one can make
> > it
> > quicker and better to review by detailing it a bit more, so the
> > reviewer then can compare commit message vs. what the code does. It
> > seems to me for example that the actual optimization is being done
> > in
> > drm_sched_entity_push_job(), and drm_sched_entity_pop_job() had to
> > be
> > ported, too, for correctness
> 
> "It seems" aka the commit title says so. ;)
> 
> > Another small thing that might be cool is something that makes it a
> > bit
> > more obvious that this is an optimization, not a fix.
> > 
> > So I would probably write:
> > 
> > "So far, drm_sched_rq_update_fifo() automatically takes
> > drm_sched_entity.rq_lock. For DRM_SCHED_POLICY_FIFO, this is
> > inefficient because that lock is then taken, released and retaken
> > in
> > drm_sched_entity_push_job().
> > 
> > Improve performance by moving the locking out of
> > drm_sched_rq_update_fifo()."
> > 
> > Not that much longer but makes it far more obvious what shall be
> > achieved where :]
> 
> How about this:
> 
> """
> In FIFO mode (which is the default), both drm_sched_entity_push_job()
> and drm_sched_rq_update_fifo(), where the latter calls the former,
> are 
> currently taking and releasing the same entity->rq_lock.
> 
> We can avoid that design inelegance, and also have a miniscule 
> efficiency improvement on the idle submit path, by introducing a new 
> drm_sched_rq_update_fifo_locked() helper and pulling up the lock
> taking 
> to its callers.
> """

That looks good to me

P.


> 
> > (Let me read through the other patches briefly. Then we should be
> > good
> > with v2 of this series.. or would it be v3 then?)
> 
> Depends how you count. By unique series titles or by fundamental
> content.
> 
> Regards,
> 
> Tvrtko
>
Tvrtko Ursulin Oct. 16, 2024, 7:41 a.m. UTC | #8
On 15/10/2024 15:00, Philipp Stanner wrote:
> On Tue, 2024-10-15 at 14:14 +0100, Tvrtko Ursulin wrote:
>>
>> On 15/10/2024 12:38, Philipp Stanner wrote:
>>> On Tue, 2024-10-15 at 09:12 +0100, Tvrtko Ursulin wrote:
>>>>
>>>> On 15/10/2024 08:11, Philipp Stanner wrote:
>>>>> On Mon, 2024-10-14 at 13:07 +0100, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 14/10/2024 12:32, Philipp Stanner wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On Mon, 2024-10-14 at 11:46 +0100, Tvrtko Ursulin wrote:
>>>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>>>>>>
>>>>>>>> In FIFO mode We can avoid dropping the lock only to
>>>>>>>> immediately
>>>>>>>> re-
>>>>>>>> acquire
>>>>>>>> by adding a new drm_sched_rq_update_fifo_locked() helper.
>>>>>>>>
>>>>>>>
>>>>>>> Please write detailed commit messages, as described here
>>>>>>> [1].
>>>>>>>        1. Describe the problem: current state and why it's
>>>>>>> bad.
>>>>>>>        2. Then, describe in imperative (present tense) form
>>>>>>> what
>>>>>>> the
>>>>>>> commit
>>>>>>>           does about the problem.
>>>>>>
>>>>>> Both pieces of info are already there:
>>>>>>
>>>>>> 1. Drops the lock to immediately re-acquire it.
>>>>>> 2. We avoid that by by adding a locked helper.
>>>>>>> Optionally, in between can be information about why it's
>>>>>>> solved
>>>>>>> this
>>>>>>> way and not another etc.
>>>>>>>
>>>>>>> Applies to the other patches, too.
>>>>>>>
>>>>>>>
>>>>>>> [1]
>>>>>>> https://www.kernel.org/doc/html/latest/process/submitting-patches.html#describe-your-changes
>>>>>>
>>>>>> Thanks I am new here and did not know this.
>>>>>>
>>>>>> Seriosuly, lets not be too blindly strict about this because
>>>>>> it
>>>>>> can
>>>>>> get
>>>>>> IMO ridiculous.
>>>>>>
>>>>>> One example when I previously accomodated your request is
>>>>>> patch
>>>>>> 3/5
>>>>>> from
>>>>>> this series:
>>>>>>
>>>>>> """
>>>>>> Current kerneldoc for struct drm_sched_rq incompletely
>>>>>> documents
>>>>>> what
>>>>>> fields are protected by the lock.
>>>>>>
>>>>>> This is not good because it is misleading.
>>>>>>
>>>>>> Lets fix it by listing all the elements which are protected
>>>>>> by
>>>>>> the
>>>>>> lock.
>>>>>> """
>>>>>>
>>>>>> While this was the original commit text you weren't happy
>>>>>> with:
>>>>>>
>>>>>> """
>>>>>> drm/sched: Re-order struct drm_sched_rq members for clarity
>>>>>>
>>>>>> Lets re-order the members to make it clear which are
>>>>>> protected by
>>>>>> the
>>>>>> lock
>>>>>> and at the same time document it via kerneldoc.
>>>>>> """
>>>>>>
>>>>>> I maintain the original text was passable.
>>>>>>
>>>>>> On top, this was just a respin to accomodate the merge
>>>>>> process.
>>>>>> All
>>>>>> approvals were done and dusted couple weeks or so ago so
>>>>>> asking
>>>>>> for
>>>>>> yet
>>>>>> another respin for such trivial objections is not great.
>>>>>
>>>>> I understand that you're unhappy, but please understand the
>>>>> position
>>>>> I'm coming from. As you know, since you sent these patches
>>>>> within a
>>>>> different series (and, thus, since I reviewed them), I was
>>>>> trusted
>>>>> with
>>>>> co-maintaining this piece of shared infrastructure.
>>>>>
>>>>> And since you've worked on it a bit now, I suppose you also
>>>>> know
>>>>> that
>>>>> the GPU Scheduler is arguably in quite a bad shape, has far too
>>>>> little
>>>>> documentation, has leaks, maybe race conditions, parts *where
>>>>> the
>>>>> locking rules are unclear* and is probably only fully
>>>>> understood by
>>>>> a
>>>>> small hand full of people. I also argue that this is a *very*
>>>>> complicated piece of software.
>>>>
>>>> We already went over that and agreed. Not least I agreed the base
>>>> is
>>>> shaky since few years  ago. :)
>>>>
>>>> Btw if things align, I hope you will at some point see a follow
>>>> up
>>>> series from me which makes some significant simplifications and
>>>> improvements at the same time.
>>>
>>> Cool, good to hear!
>>> (Would be even cooler if simplifications and improvements can be
>>> delivered through separate patch series to be easier to review
>>> etc.)
>>
>> Yes, when I spot something I pull it ahead and/or standalone when it
>> makes sense. But it is early days and a big job.
>>
>>>>> So I might be or appear to be a bit pedantic, but I'm not doing
>>>>> that to
>>>>> terrorize you, but because I want this thing to become well
>>>>> documented,
>>>>> understandable, and bisectable. Working towards a canonical,
>>>>> idiot-
>>>>> proof commit style is one measure that will help with that.
>>>>>
>>>>> I want to offer you the following: I can be more relaxed with
>>>>> things
>>>>> universally recognized as trivial (comment changes, struct
>>>>> member
>>>>> reordering) – but when something like a lock is touched in any
>>>>> way,
>>>>> we
>>>>> shall document that in the commit message as canonically as
>>>>> possible,
>>>>> so someone who's less experienced and just bisected the commit
>>>>> immediately understands what has been done (or rather: was
>>>>> supposed
>>>>> to
>>>>> be done).
>>>>
>>>> So how would you suggest to expand this commit text so it doesn't
>>>> read
>>>> too self-repeating?
>>>
>>> My issue with this particular commit message is mainly that it
>>> doesn't
>>> make it obvious what the patch is supposed to do. So one can make
>>> it
>>> quicker and better to review by detailing it a bit more, so the
>>> reviewer then can compare commit message vs. what the code does. It
>>> seems to me for example that the actual optimization is being done
>>> in
>>> drm_sched_entity_push_job(), and drm_sched_entity_pop_job() had to
>>> be
>>> ported, too, for correctness
>>
>> "It seems" aka the commit title says so. ;)
>>
>>> Another small thing that might be cool is something that makes it a
>>> bit
>>> more obvious that this is an optimization, not a fix.
>>>
>>> So I would probably write:
>>>
>>> "So far, drm_sched_rq_update_fifo() automatically takes
>>> drm_sched_entity.rq_lock. For DRM_SCHED_POLICY_FIFO, this is
>>> inefficient because that lock is then taken, released and retaken
>>> in
>>> drm_sched_entity_push_job().
>>>
>>> Improve performance by moving the locking out of
>>> drm_sched_rq_update_fifo()."
>>>
>>> Not that much longer but makes it far more obvious what shall be
>>> achieved where :]
>>
>> How about this:
>>
>> """
>> In FIFO mode (which is the default), both drm_sched_entity_push_job()
>> and drm_sched_rq_update_fifo(), where the latter calls the former,
>> are
>> currently taking and releasing the same entity->rq_lock.
>>
>> We can avoid that design inelegance, and also have a miniscule
>> efficiency improvement on the idle submit path, by introducing a new
>> drm_sched_rq_update_fifo_locked() helper and pulling up the lock
>> taking
>> to its callers.
>> """
> 
> That looks good to me

Cool. So this for 1/5, your text and some tweaks for 4/5, anything else 
or I can respin with that?

Regards,

Tvrtko

> 
> P.
> 
> 
>>
>>> (Let me read through the other patches briefly. Then we should be
>>> good
>>> with v2 of this series.. or would it be v3 then?)
>>
>> Depends how you count. By unique series titles or by fundamental
>> content.
>>
>> Regards,
>>
>> Tvrtko
>>
>
Philipp Stanner Oct. 16, 2024, 9:15 a.m. UTC | #9
On Wed, 2024-10-16 at 08:41 +0100, Tvrtko Ursulin wrote:
> 
> On 15/10/2024 15:00, Philipp Stanner wrote:
> > > [...]
> > > How about this:
> > > 
> > > """
> > > In FIFO mode (which is the default), both
> > > drm_sched_entity_push_job()
> > > and drm_sched_rq_update_fifo(), where the latter calls the
> > > former,
> > > are
> > > currently taking and releasing the same entity->rq_lock.
> > > 
> > > We can avoid that design inelegance, and also have a miniscule
> > > efficiency improvement on the idle submit path, by introducing a
> > > new
> > > drm_sched_rq_update_fifo_locked() helper and pulling up the lock
> > > taking
> > > to its callers.
> > > """
> > 
> > That looks good to me
> 
> Cool. So this for 1/5, your text and some tweaks for 4/5, anything
> else 
> or I can respin with that?

Just went through all of them and looks good to me, I think we can go
with that.

Note that below your signature in the latest answer to patch 4 I
mentioned a second position where the old name was forgotten; was not
sure if you saw that.

Regards,
P.



> 
> Regards,
> 
> Tvrtko
> 
> > 
> > P.
> > 
> > 
> > > 
> > > > (Let me read through the other patches briefly. Then we should
> > > > be
> > > > good
> > > > with v2 of this series.. or would it be v3 then?)
> > > 
> > > Depends how you count. By unique series titles or by fundamental
> > > content.
> > > 
> > > Regards,
> > > 
> > > Tvrtko
> > > 
> > 
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 2951fcc2e6b1..b72cba292839 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -514,8 +514,12 @@  struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
 		struct drm_sched_job *next;
 
 		next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
-		if (next)
-			drm_sched_rq_update_fifo(entity, next->submit_ts);
+		if (next) {
+			spin_lock(&entity->rq_lock);
+			drm_sched_rq_update_fifo_locked(entity,
+							next->submit_ts);
+			spin_unlock(&entity->rq_lock);
+		}
 	}
 
 	/* Jobs and entities might have different lifecycles. Since we're
@@ -613,10 +617,11 @@  void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
 		sched = rq->sched;
 
 		drm_sched_rq_add_entity(rq, entity);
-		spin_unlock(&entity->rq_lock);
 
 		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
-			drm_sched_rq_update_fifo(entity, submit_ts);
+			drm_sched_rq_update_fifo_locked(entity, submit_ts);
+
+		spin_unlock(&entity->rq_lock);
 
 		drm_sched_wakeup(sched);
 	}
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index e32b0f7d7e94..bbd1630407e4 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -169,14 +169,15 @@  static inline void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *enti
 	}
 }
 
-void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
+void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, ktime_t ts)
 {
 	/*
 	 * Both locks need to be grabbed, one to protect from entity->rq change
 	 * for entity from within concurrent drm_sched_entity_select_rq and the
 	 * other to update the rb tree structure.
 	 */
-	spin_lock(&entity->rq_lock);
+	lockdep_assert_held(&entity->rq_lock);
+
 	spin_lock(&entity->rq->lock);
 
 	drm_sched_rq_remove_fifo_locked(entity);
@@ -187,7 +188,6 @@  void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
 		      drm_sched_entity_compare_before);
 
 	spin_unlock(&entity->rq->lock);
-	spin_unlock(&entity->rq_lock);
 }
 
 /**
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index e9f075f51db3..3658a6cb048e 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -593,7 +593,7 @@  void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
 void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
 				struct drm_sched_entity *entity);
 
-void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts);
+void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, ktime_t ts);
 
 int drm_sched_entity_init(struct drm_sched_entity *entity,
 			  enum drm_sched_priority priority,