diff mbox series

[RFC,1/4] drm/sched: Add locking to drm_sched_entity_modify_sched

Message ID 20240906180618.12180-2-tursulin@igalia.com (mailing list archive)
State New, archived
Headers show
Series DRM scheduler fixes, or not, or incorrect kind | expand

Commit Message

Tvrtko Ursulin Sept. 6, 2024, 6:06 p.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Without the locking amdgpu currently can race
amdgpu_ctx_set_entity_priority() and drm_sched_job_arm(), leading to the
latter accesing potentially inconsitent entity->sched_list and
entity->num_sched_list pair.

The comment on drm_sched_entity_modify_sched() however says:

"""
 * Note that this must be called under the same common lock for @entity as
 * drm_sched_job_arm() and drm_sched_entity_push_job(), or the driver needs to
 * guarantee through some other means that this is never called while new jobs
 * can be pushed to @entity.
"""

It is unclear if that is referring to this race or something else.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Fixes: b37aced31eb0 ("drm/scheduler: implement a function to modify sched list")
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Luben Tuikov <ltuikov89@gmail.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: dri-devel@lists.freedesktop.org
Cc: <stable@vger.kernel.org> # v5.7+
---
 drivers/gpu/drm/scheduler/sched_entity.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Philipp Stanner Sept. 9, 2024, 9:44 a.m. UTC | #1
On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> 
> Without the locking amdgpu currently can race
> amdgpu_ctx_set_entity_priority() and drm_sched_job_arm(), 

I would explicitly say "amdgpu's amdgpu_ctx_set_entity_priority() races
through drm_sched_entity_modify_sched() with drm_sched_job_arm()".

The actual issue then seems to be drm_sched_job_arm() calling
drm_sched_entity_select_rq(). I would mention that, too.


> leading to the
> latter accesing potentially inconsitent entity->sched_list and
> entity->num_sched_list pair.
> 
> The comment on drm_sched_entity_modify_sched() however says:
> 
> """
>  * Note that this must be called under the same common lock for
> @entity as
>  * drm_sched_job_arm() and drm_sched_entity_push_job(), or the driver
> needs to
>  * guarantee through some other means that this is never called while
> new jobs
>  * can be pushed to @entity.
> """
> 
> It is unclear if that is referring to this race or something else.

That comment is indeed a bit awkward. Both drm_sched_entity_push_job()
and drm_sched_job_arm() take rq_lock. But
drm_sched_entity_modify_sched() doesn't.

The comment was written in 981b04d968561. Interestingly, in
drm_sched_entity_push_job(), this "common lock" is mentioned with the
soft requirement word "should" and apparently is more about keeping
sequence numbers in order when inserting.

I tend to think that the issue discovered by you is unrelated to that
comment. But if no one can make sense of the comment, should it maybe
be removed? Confusing comment is arguably worse than no comment

P.

> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Fixes: b37aced31eb0 ("drm/scheduler: implement a function to modify
> sched list")
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Luben Tuikov <ltuikov89@gmail.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: dri-devel@lists.freedesktop.org
> Cc: <stable@vger.kernel.org> # v5.7+
> ---
>  drivers/gpu/drm/scheduler/sched_entity.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> b/drivers/gpu/drm/scheduler/sched_entity.c
> index 58c8161289fe..ae8be30472cd 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -133,8 +133,10 @@ void drm_sched_entity_modify_sched(struct
> drm_sched_entity *entity,
>  {
>  	WARN_ON(!num_sched_list || !sched_list);
>  
> +	spin_lock(&entity->rq_lock);
>  	entity->sched_list = sched_list;
>  	entity->num_sched_list = num_sched_list;
> +	spin_unlock(&entity->rq_lock);
>  }
>  EXPORT_SYMBOL(drm_sched_entity_modify_sched);
>
Christian König Sept. 9, 2024, 11:29 a.m. UTC | #2
Am 09.09.24 um 11:44 schrieb Philipp Stanner:
> On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>
>> Without the locking amdgpu currently can race
>> amdgpu_ctx_set_entity_priority() and drm_sched_job_arm(),
> I would explicitly say "amdgpu's amdgpu_ctx_set_entity_priority() races
> through drm_sched_entity_modify_sched() with drm_sched_job_arm()".
>
> The actual issue then seems to be drm_sched_job_arm() calling
> drm_sched_entity_select_rq(). I would mention that, too.
>
>
>> leading to the
>> latter accesing potentially inconsitent entity->sched_list and
>> entity->num_sched_list pair.
>>
>> The comment on drm_sched_entity_modify_sched() however says:
>>
>> """
>>   * Note that this must be called under the same common lock for
>> @entity as
>>   * drm_sched_job_arm() and drm_sched_entity_push_job(), or the driver
>> needs to
>>   * guarantee through some other means that this is never called while
>> new jobs
>>   * can be pushed to @entity.
>> """
>>
>> It is unclear if that is referring to this race or something else.
> That comment is indeed a bit awkward. Both drm_sched_entity_push_job()
> and drm_sched_job_arm() take rq_lock. But
> drm_sched_entity_modify_sched() doesn't.
>
> The comment was written in 981b04d968561. Interestingly, in
> drm_sched_entity_push_job(), this "common lock" is mentioned with the
> soft requirement word "should" and apparently is more about keeping
> sequence numbers in order when inserting.
>
> I tend to think that the issue discovered by you is unrelated to that
> comment. But if no one can make sense of the comment, should it maybe
> be removed? Confusing comment is arguably worse than no comment

Agree, we probably mixed up in 981b04d968561 that submission needs a 
common lock and that rq/priority needs to be protected by the rq_lock.

There is also the big FIXME in the drm_sched_entity documentation 
pointing out that this is most likely not implemented correctly.

I suggest to move the rq, priority and rq_lock fields together in the 
drm_sched_entity structure and document that rq_lock is protecting the two.

Then audit the code if all users of rq and priority actually hold the 
correct locks while reading and writing them.

Regards,
Christian.

>
> P.
>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> Fixes: b37aced31eb0 ("drm/scheduler: implement a function to modify
>> sched list")
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Alex Deucher <alexander.deucher@amd.com>
>> Cc: Luben Tuikov <ltuikov89@gmail.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Daniel Vetter <daniel@ffwll.ch>
>> Cc: dri-devel@lists.freedesktop.org
>> Cc: <stable@vger.kernel.org> # v5.7+
>> ---
>>   drivers/gpu/drm/scheduler/sched_entity.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>> b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 58c8161289fe..ae8be30472cd 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -133,8 +133,10 @@ void drm_sched_entity_modify_sched(struct
>> drm_sched_entity *entity,
>>   {
>>   	WARN_ON(!num_sched_list || !sched_list);
>>   
>> +	spin_lock(&entity->rq_lock);
>>   	entity->sched_list = sched_list;
>>   	entity->num_sched_list = num_sched_list;
>> +	spin_unlock(&entity->rq_lock);
>>   }
>>   EXPORT_SYMBOL(drm_sched_entity_modify_sched);
>>
Philipp Stanner Sept. 9, 2024, 12:13 p.m. UTC | #3
On Mon, 2024-09-09 at 13:29 +0200, Christian König wrote:
> Am 09.09.24 um 11:44 schrieb Philipp Stanner:
> > On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote:
> > > From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > > 
> > > Without the locking amdgpu currently can race
> > > amdgpu_ctx_set_entity_priority() and drm_sched_job_arm(),
> > I would explicitly say "amdgpu's amdgpu_ctx_set_entity_priority()
> > races
> > through drm_sched_entity_modify_sched() with drm_sched_job_arm()".
> > 
> > The actual issue then seems to be drm_sched_job_arm() calling
> > drm_sched_entity_select_rq(). I would mention that, too.
> > 
> > 
> > > leading to the
> > > latter accesing potentially inconsitent entity->sched_list and
> > > entity->num_sched_list pair.
> > > 
> > > The comment on drm_sched_entity_modify_sched() however says:
> > > 
> > > """
> > >   * Note that this must be called under the same common lock for
> > > @entity as
> > >   * drm_sched_job_arm() and drm_sched_entity_push_job(), or the
> > > driver
> > > needs to
> > >   * guarantee through some other means that this is never called
> > > while
> > > new jobs
> > >   * can be pushed to @entity.
> > > """
> > > 
> > > It is unclear if that is referring to this race or something
> > > else.
> > That comment is indeed a bit awkward. Both
> > drm_sched_entity_push_job()
> > and drm_sched_job_arm() take rq_lock. But
> > drm_sched_entity_modify_sched() doesn't.
> > 
> > The comment was written in 981b04d968561. Interestingly, in
> > drm_sched_entity_push_job(), this "common lock" is mentioned with
> > the
> > soft requirement word "should" and apparently is more about keeping
> > sequence numbers in order when inserting.
> > 
> > I tend to think that the issue discovered by you is unrelated to
> > that
> > comment. But if no one can make sense of the comment, should it
> > maybe
> > be removed? Confusing comment is arguably worse than no comment
> 
> Agree, we probably mixed up in 981b04d968561 that submission needs a 
> common lock and that rq/priority needs to be protected by the
> rq_lock.
> 
> There is also the big FIXME in the drm_sched_entity documentation 
> pointing out that this is most likely not implemented correctly.
> 
> I suggest to move the rq, priority and rq_lock fields together in the
> drm_sched_entity structure and document that rq_lock is protecting
> the two.

That could also be a great opportunity for improving the lock naming:

void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
{
	/*
	 * Both locks need to be grabbed, one to protect from entity->rq change
	 * for entity from within concurrent drm_sched_entity_select_rq and the
	 * other to update the rb tree structure.
	 */
	spin_lock(&entity->rq_lock);
	spin_lock(&entity->rq->lock);

[...]


P.


> 
> Then audit the code if all users of rq and priority actually hold the
> correct locks while reading and writing them.
> 
> Regards,
> Christian.
> 
> > 
> > P.
> > 
> > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > > Fixes: b37aced31eb0 ("drm/scheduler: implement a function to
> > > modify
> > > sched list")
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Cc: Alex Deucher <alexander.deucher@amd.com>
> > > Cc: Luben Tuikov <ltuikov89@gmail.com>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: David Airlie <airlied@gmail.com>
> > > Cc: Daniel Vetter <daniel@ffwll.ch>
> > > Cc: dri-devel@lists.freedesktop.org
> > > Cc: <stable@vger.kernel.org> # v5.7+
> > > ---
> > >   drivers/gpu/drm/scheduler/sched_entity.c | 2 ++
> > >   1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > index 58c8161289fe..ae8be30472cd 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > @@ -133,8 +133,10 @@ void drm_sched_entity_modify_sched(struct
> > > drm_sched_entity *entity,
> > >   {
> > >   	WARN_ON(!num_sched_list || !sched_list);
> > >   
> > > +	spin_lock(&entity->rq_lock);
> > >   	entity->sched_list = sched_list;
> > >   	entity->num_sched_list = num_sched_list;
> > > +	spin_unlock(&entity->rq_lock);
> > >   }
> > >   EXPORT_SYMBOL(drm_sched_entity_modify_sched);
> > >   
>
Christian König Sept. 9, 2024, 12:18 p.m. UTC | #4
Am 09.09.24 um 14:13 schrieb Philipp Stanner:
> On Mon, 2024-09-09 at 13:29 +0200, Christian König wrote:
>> Am 09.09.24 um 11:44 schrieb Philipp Stanner:
>>> On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote:
>>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>>
>>>> Without the locking amdgpu currently can race
>>>> amdgpu_ctx_set_entity_priority() and drm_sched_job_arm(),
>>> I would explicitly say "amdgpu's amdgpu_ctx_set_entity_priority()
>>> races
>>> through drm_sched_entity_modify_sched() with drm_sched_job_arm()".
>>>
>>> The actual issue then seems to be drm_sched_job_arm() calling
>>> drm_sched_entity_select_rq(). I would mention that, too.
>>>
>>>
>>>> leading to the
>>>> latter accesing potentially inconsitent entity->sched_list and
>>>> entity->num_sched_list pair.
>>>>
>>>> The comment on drm_sched_entity_modify_sched() however says:
>>>>
>>>> """
>>>>    * Note that this must be called under the same common lock for
>>>> @entity as
>>>>    * drm_sched_job_arm() and drm_sched_entity_push_job(), or the
>>>> driver
>>>> needs to
>>>>    * guarantee through some other means that this is never called
>>>> while
>>>> new jobs
>>>>    * can be pushed to @entity.
>>>> """
>>>>
>>>> It is unclear if that is referring to this race or something
>>>> else.
>>> That comment is indeed a bit awkward. Both
>>> drm_sched_entity_push_job()
>>> and drm_sched_job_arm() take rq_lock. But
>>> drm_sched_entity_modify_sched() doesn't.
>>>
>>> The comment was written in 981b04d968561. Interestingly, in
>>> drm_sched_entity_push_job(), this "common lock" is mentioned with
>>> the
>>> soft requirement word "should" and apparently is more about keeping
>>> sequence numbers in order when inserting.
>>>
>>> I tend to think that the issue discovered by you is unrelated to
>>> that
>>> comment. But if no one can make sense of the comment, should it
>>> maybe
>>> be removed? Confusing comment is arguably worse than no comment
>> Agree, we probably mixed up in 981b04d968561 that submission needs a
>> common lock and that rq/priority needs to be protected by the
>> rq_lock.
>>
>> There is also the big FIXME in the drm_sched_entity documentation
>> pointing out that this is most likely not implemented correctly.
>>
>> I suggest to move the rq, priority and rq_lock fields together in the
>> drm_sched_entity structure and document that rq_lock is protecting
>> the two.
> That could also be a great opportunity for improving the lock naming:

Well that comment made me laugh because I point out the same when the 
scheduler came out ~8years ago and nobody cared about it since then.

But yeah completely agree :)

Christian.

>
> void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
> {
> 	/*
> 	 * Both locks need to be grabbed, one to protect from entity->rq change
> 	 * for entity from within concurrent drm_sched_entity_select_rq and the
> 	 * other to update the rb tree structure.
> 	 */
> 	spin_lock(&entity->rq_lock);
> 	spin_lock(&entity->rq->lock);
>
> [...]
>
>
> P.
>
>
>> Then audit the code if all users of rq and priority actually hold the
>> correct locks while reading and writing them.
>>
>> Regards,
>> Christian.
>>
>>> P.
>>>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>> Fixes: b37aced31eb0 ("drm/scheduler: implement a function to
>>>> modify
>>>> sched list")
>>>> Cc: Christian König <christian.koenig@amd.com>
>>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>>> Cc: Luben Tuikov <ltuikov89@gmail.com>
>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>> Cc: David Airlie <airlied@gmail.com>
>>>> Cc: Daniel Vetter <daniel@ffwll.ch>
>>>> Cc: dri-devel@lists.freedesktop.org
>>>> Cc: <stable@vger.kernel.org> # v5.7+
>>>> ---
>>>>    drivers/gpu/drm/scheduler/sched_entity.c | 2 ++
>>>>    1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> index 58c8161289fe..ae8be30472cd 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> @@ -133,8 +133,10 @@ void drm_sched_entity_modify_sched(struct
>>>> drm_sched_entity *entity,
>>>>    {
>>>>    	WARN_ON(!num_sched_list || !sched_list);
>>>>    
>>>> +	spin_lock(&entity->rq_lock);
>>>>    	entity->sched_list = sched_list;
>>>>    	entity->num_sched_list = num_sched_list;
>>>> +	spin_unlock(&entity->rq_lock);
>>>>    }
>>>>    EXPORT_SYMBOL(drm_sched_entity_modify_sched);
>>>>
Tvrtko Ursulin Sept. 9, 2024, 12:37 p.m. UTC | #5
On 09/09/2024 13:18, Christian König wrote:
> Am 09.09.24 um 14:13 schrieb Philipp Stanner:
>> On Mon, 2024-09-09 at 13:29 +0200, Christian König wrote:
>>> Am 09.09.24 um 11:44 schrieb Philipp Stanner:
>>>> On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote:
>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>>>
>>>>> Without the locking amdgpu currently can race
>>>>> amdgpu_ctx_set_entity_priority() and drm_sched_job_arm(),
>>>> I would explicitly say "amdgpu's amdgpu_ctx_set_entity_priority()
>>>> races
>>>> through drm_sched_entity_modify_sched() with drm_sched_job_arm()".
>>>>
>>>> The actual issue then seems to be drm_sched_job_arm() calling
>>>> drm_sched_entity_select_rq(). I would mention that, too.
>>>>
>>>>
>>>>> leading to the
>>>>> latter accesing potentially inconsitent entity->sched_list and
>>>>> entity->num_sched_list pair.
>>>>>
>>>>> The comment on drm_sched_entity_modify_sched() however says:
>>>>>
>>>>> """
>>>>>    * Note that this must be called under the same common lock for
>>>>> @entity as
>>>>>    * drm_sched_job_arm() and drm_sched_entity_push_job(), or the
>>>>> driver
>>>>> needs to
>>>>>    * guarantee through some other means that this is never called
>>>>> while
>>>>> new jobs
>>>>>    * can be pushed to @entity.
>>>>> """
>>>>>
>>>>> It is unclear if that is referring to this race or something
>>>>> else.
>>>> That comment is indeed a bit awkward. Both
>>>> drm_sched_entity_push_job()
>>>> and drm_sched_job_arm() take rq_lock. But
>>>> drm_sched_entity_modify_sched() doesn't.
>>>>
>>>> The comment was written in 981b04d968561. Interestingly, in
>>>> drm_sched_entity_push_job(), this "common lock" is mentioned with
>>>> the
>>>> soft requirement word "should" and apparently is more about keeping
>>>> sequence numbers in order when inserting.
>>>>
>>>> I tend to think that the issue discovered by you is unrelated to
>>>> that
>>>> comment. But if no one can make sense of the comment, should it
>>>> maybe
>>>> be removed? Confusing comment is arguably worse than no comment
>>> Agree, we probably mixed up in 981b04d968561 that submission needs a
>>> common lock and that rq/priority needs to be protected by the
>>> rq_lock.
>>>
>>> There is also the big FIXME in the drm_sched_entity documentation
>>> pointing out that this is most likely not implemented correctly.
>>>
>>> I suggest to move the rq, priority and rq_lock fields together in the
>>> drm_sched_entity structure and document that rq_lock is protecting
>>> the two.
>> That could also be a great opportunity for improving the lock naming:
> 
> Well that comment made me laugh because I point out the same when the 
> scheduler came out ~8years ago and nobody cared about it since then.
> 
> But yeah completely agree :)

Maybe, but we need to keep in sight the fact some of these fixes may be 
good to backport. In which case re-naming exercises are best left to follow.

Also..

>> void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t 
>> ts)
>> {
>>     /*
>>      * Both locks need to be grabbed, one to protect from entity->rq 
>> change
>>      * for entity from within concurrent drm_sched_entity_select_rq 
>> and the
>>      * other to update the rb tree structure.
>>      */
>>     spin_lock(&entity->rq_lock);
>>     spin_lock(&entity->rq->lock);

.. I agree this is quite unredable and my initial reaction was a similar 
ugh. However.. What names would you guys suggest and for what to make 
this better and not lessen the logic of naming each individually?

Regards,

Tvrtko

>> [...]
>>
>>
>> P.
>>
>>
>>> Then audit the code if all users of rq and priority actually hold the
>>> correct locks while reading and writing them.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> P.
>>>>
>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>>> Fixes: b37aced31eb0 ("drm/scheduler: implement a function to
>>>>> modify
>>>>> sched list")
>>>>> Cc: Christian König <christian.koenig@amd.com>
>>>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>>>> Cc: Luben Tuikov <ltuikov89@gmail.com>
>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>> Cc: Daniel Vetter <daniel@ffwll.ch>
>>>>> Cc: dri-devel@lists.freedesktop.org
>>>>> Cc: <stable@vger.kernel.org> # v5.7+
>>>>> ---
>>>>>    drivers/gpu/drm/scheduler/sched_entity.c | 2 ++
>>>>>    1 file changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> index 58c8161289fe..ae8be30472cd 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> @@ -133,8 +133,10 @@ void drm_sched_entity_modify_sched(struct
>>>>> drm_sched_entity *entity,
>>>>>    {
>>>>>        WARN_ON(!num_sched_list || !sched_list);
>>>>> +    spin_lock(&entity->rq_lock);
>>>>>        entity->sched_list = sched_list;
>>>>>        entity->num_sched_list = num_sched_list;
>>>>> +    spin_unlock(&entity->rq_lock);
>>>>>    }
>>>>>    EXPORT_SYMBOL(drm_sched_entity_modify_sched);
>
Philipp Stanner Sept. 9, 2024, 12:46 p.m. UTC | #6
On Mon, 2024-09-09 at 13:37 +0100, Tvrtko Ursulin wrote:
> 
> On 09/09/2024 13:18, Christian König wrote:
> > Am 09.09.24 um 14:13 schrieb Philipp Stanner:
> > > On Mon, 2024-09-09 at 13:29 +0200, Christian König wrote:
> > > > Am 09.09.24 um 11:44 schrieb Philipp Stanner:
> > > > > On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote:
> > > > > > From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > > > > > 
> > > > > > Without the locking amdgpu currently can race
> > > > > > amdgpu_ctx_set_entity_priority() and drm_sched_job_arm(),
> > > > > I would explicitly say "amdgpu's
> > > > > amdgpu_ctx_set_entity_priority()
> > > > > races
> > > > > through drm_sched_entity_modify_sched() with
> > > > > drm_sched_job_arm()".
> > > > > 
> > > > > The actual issue then seems to be drm_sched_job_arm() calling
> > > > > drm_sched_entity_select_rq(). I would mention that, too.
> > > > > 
> > > > > 
> > > > > > leading to the
> > > > > > latter accesing potentially inconsitent entity->sched_list
> > > > > > and
> > > > > > entity->num_sched_list pair.
> > > > > > 
> > > > > > The comment on drm_sched_entity_modify_sched() however
> > > > > > says:
> > > > > > 
> > > > > > """
> > > > > >    * Note that this must be called under the same common
> > > > > > lock for
> > > > > > @entity as
> > > > > >    * drm_sched_job_arm() and drm_sched_entity_push_job(),
> > > > > > or the
> > > > > > driver
> > > > > > needs to
> > > > > >    * guarantee through some other means that this is never
> > > > > > called
> > > > > > while
> > > > > > new jobs
> > > > > >    * can be pushed to @entity.
> > > > > > """
> > > > > > 
> > > > > > It is unclear if that is referring to this race or
> > > > > > something
> > > > > > else.
> > > > > That comment is indeed a bit awkward. Both
> > > > > drm_sched_entity_push_job()
> > > > > and drm_sched_job_arm() take rq_lock. But
> > > > > drm_sched_entity_modify_sched() doesn't.
> > > > > 
> > > > > The comment was written in 981b04d968561. Interestingly, in
> > > > > drm_sched_entity_push_job(), this "common lock" is mentioned
> > > > > with
> > > > > the
> > > > > soft requirement word "should" and apparently is more about
> > > > > keeping
> > > > > sequence numbers in order when inserting.
> > > > > 
> > > > > I tend to think that the issue discovered by you is unrelated
> > > > > to
> > > > > that
> > > > > comment. But if no one can make sense of the comment, should
> > > > > it
> > > > > maybe
> > > > > be removed? Confusing comment is arguably worse than no
> > > > > comment
> > > > Agree, we probably mixed up in 981b04d968561 that submission
> > > > needs a
> > > > common lock and that rq/priority needs to be protected by the
> > > > rq_lock.
> > > > 
> > > > There is also the big FIXME in the drm_sched_entity
> > > > documentation
> > > > pointing out that this is most likely not implemented
> > > > correctly.
> > > > 
> > > > I suggest to move the rq, priority and rq_lock fields together
> > > > in the
> > > > drm_sched_entity structure and document that rq_lock is
> > > > protecting
> > > > the two.
> > > That could also be a great opportunity for improving the lock
> > > naming:
> > 
> > Well that comment made me laugh because I point out the same when
> > the 
> > scheduler came out ~8years ago and nobody cared about it since
> > then.
> > 
> > But yeah completely agree :)
> 
> Maybe, but we need to keep in sight the fact some of these fixes may
> be 
> good to backport. In which case re-naming exercises are best left to
> follow.

My argument basically. It's good if fixes and other improvements are
separated, in general, unless there is a practical / good reason not
to.

> 
> Also..
> 
> > > void drm_sched_rq_update_fifo(struct drm_sched_entity *entity,
> > > ktime_t 
> > > ts)
> > > {
> > >     /*
> > >      * Both locks need to be grabbed, one to protect from entity-
> > > >rq 
> > > change
> > >      * for entity from within concurrent
> > > drm_sched_entity_select_rq 
> > > and the
> > >      * other to update the rb tree structure.
> > >      */
> > >     spin_lock(&entity->rq_lock);
> > >     spin_lock(&entity->rq->lock);
> 
> .. I agree this is quite unredable and my initial reaction was a
> similar 
> ugh. However.. What names would you guys suggest and for what to make
> this better and not lessen the logic of naming each individually?

According to the documentation, drm_sched_rq.lock does not protect the
entire runque, but "@lock: to modify the entities list."

So I would keep drm_sched_entity.rq_lock as it is, because it indeed
protects the entire runqueue.

And drm_sched_rq.lock could be named "entities_lock" or
"entities_list_lock" or something. That's debatable, but it should be
something that highlights that this lock is not for locking the entire
runque as the one in the entity apparently is.


Cheers,
P.

> 
> Regards,
> 
> Tvrtko
> 
> > > [...]
> > > 
> > > 
> > > P.
> > > 
> > > 
> > > > Then audit the code if all users of rq and priority actually
> > > > hold the
> > > > correct locks while reading and writing them.
> > > > 
> > > > Regards,
> > > > Christian.
> > > > 
> > > > > P.
> > > > > 
> > > > > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > > > > > Fixes: b37aced31eb0 ("drm/scheduler: implement a function
> > > > > > to
> > > > > > modify
> > > > > > sched list")
> > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > Cc: Alex Deucher <alexander.deucher@amd.com>
> > > > > > Cc: Luben Tuikov <ltuikov89@gmail.com>
> > > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > > Cc: David Airlie <airlied@gmail.com>
> > > > > > Cc: Daniel Vetter <daniel@ffwll.ch>
> > > > > > Cc: dri-devel@lists.freedesktop.org
> > > > > > Cc: <stable@vger.kernel.org> # v5.7+
> > > > > > ---
> > > > > >    drivers/gpu/drm/scheduler/sched_entity.c | 2 ++
> > > > > >    1 file changed, 2 insertions(+)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > index 58c8161289fe..ae8be30472cd 100644
> > > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > @@ -133,8 +133,10 @@ void
> > > > > > drm_sched_entity_modify_sched(struct
> > > > > > drm_sched_entity *entity,
> > > > > >    {
> > > > > >        WARN_ON(!num_sched_list || !sched_list);
> > > > > > +    spin_lock(&entity->rq_lock);
> > > > > >        entity->sched_list = sched_list;
> > > > > >        entity->num_sched_list = num_sched_list;
> > > > > > +    spin_unlock(&entity->rq_lock);
> > > > > >    }
> > > > > >    EXPORT_SYMBOL(drm_sched_entity_modify_sched);
> > 
>
Tvrtko Ursulin Sept. 9, 2024, 1:27 p.m. UTC | #7
On 09/09/2024 13:46, Philipp Stanner wrote:
> On Mon, 2024-09-09 at 13:37 +0100, Tvrtko Ursulin wrote:
>>
>> On 09/09/2024 13:18, Christian König wrote:
>>> Am 09.09.24 um 14:13 schrieb Philipp Stanner:
>>>> On Mon, 2024-09-09 at 13:29 +0200, Christian König wrote:
>>>>> Am 09.09.24 um 11:44 schrieb Philipp Stanner:
>>>>>> On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote:
>>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>>>>>
>>>>>>> Without the locking amdgpu currently can race
>>>>>>> amdgpu_ctx_set_entity_priority() and drm_sched_job_arm(),
>>>>>> I would explicitly say "amdgpu's
>>>>>> amdgpu_ctx_set_entity_priority()
>>>>>> races
>>>>>> through drm_sched_entity_modify_sched() with
>>>>>> drm_sched_job_arm()".
>>>>>>
>>>>>> The actual issue then seems to be drm_sched_job_arm() calling
>>>>>> drm_sched_entity_select_rq(). I would mention that, too.
>>>>>>
>>>>>>
>>>>>>> leading to the
>>>>>>> latter accesing potentially inconsitent entity->sched_list
>>>>>>> and
>>>>>>> entity->num_sched_list pair.
>>>>>>>
>>>>>>> The comment on drm_sched_entity_modify_sched() however
>>>>>>> says:
>>>>>>>
>>>>>>> """
>>>>>>>     * Note that this must be called under the same common
>>>>>>> lock for
>>>>>>> @entity as
>>>>>>>     * drm_sched_job_arm() and drm_sched_entity_push_job(),
>>>>>>> or the
>>>>>>> driver
>>>>>>> needs to
>>>>>>>     * guarantee through some other means that this is never
>>>>>>> called
>>>>>>> while
>>>>>>> new jobs
>>>>>>>     * can be pushed to @entity.
>>>>>>> """
>>>>>>>
>>>>>>> It is unclear if that is referring to this race or
>>>>>>> something
>>>>>>> else.
>>>>>> That comment is indeed a bit awkward. Both
>>>>>> drm_sched_entity_push_job()
>>>>>> and drm_sched_job_arm() take rq_lock. But
>>>>>> drm_sched_entity_modify_sched() doesn't.
>>>>>>
>>>>>> The comment was written in 981b04d968561. Interestingly, in
>>>>>> drm_sched_entity_push_job(), this "common lock" is mentioned
>>>>>> with
>>>>>> the
>>>>>> soft requirement word "should" and apparently is more about
>>>>>> keeping
>>>>>> sequence numbers in order when inserting.
>>>>>>
>>>>>> I tend to think that the issue discovered by you is unrelated
>>>>>> to
>>>>>> that
>>>>>> comment. But if no one can make sense of the comment, should
>>>>>> it
>>>>>> maybe
>>>>>> be removed? Confusing comment is arguably worse than no
>>>>>> comment
>>>>> Agree, we probably mixed up in 981b04d968561 that submission
>>>>> needs a
>>>>> common lock and that rq/priority needs to be protected by the
>>>>> rq_lock.
>>>>>
>>>>> There is also the big FIXME in the drm_sched_entity
>>>>> documentation
>>>>> pointing out that this is most likely not implemented
>>>>> correctly.
>>>>>
>>>>> I suggest to move the rq, priority and rq_lock fields together
>>>>> in the
>>>>> drm_sched_entity structure and document that rq_lock is
>>>>> protecting
>>>>> the two.
>>>> That could also be a great opportunity for improving the lock
>>>> naming:
>>>
>>> Well that comment made me laugh because I point out the same when
>>> the
>>> scheduler came out ~8years ago and nobody cared about it since
>>> then.
>>>
>>> But yeah completely agree :)
>>
>> Maybe, but we need to keep in sight the fact some of these fixes may
>> be
>> good to backport. In which case re-naming exercises are best left to
>> follow.
> 
> My argument basically. It's good if fixes and other improvements are
> separated, in general, unless there is a practical / good reason not
> to.

Ah cool, I am happy to add follow up patches after the fixes.

>> Also..
>>
>>>> void drm_sched_rq_update_fifo(struct drm_sched_entity *entity,
>>>> ktime_t
>>>> ts)
>>>> {
>>>>      /*
>>>>       * Both locks need to be grabbed, one to protect from entity-
>>>>> rq
>>>> change
>>>>       * for entity from within concurrent
>>>> drm_sched_entity_select_rq
>>>> and the
>>>>       * other to update the rb tree structure.
>>>>       */
>>>>      spin_lock(&entity->rq_lock);
>>>>      spin_lock(&entity->rq->lock);
>>
>> .. I agree this is quite unredable and my initial reaction was a
>> similar
>> ugh. However.. What names would you guys suggest and for what to make
>> this better and not lessen the logic of naming each individually?
> 
> According to the documentation, drm_sched_rq.lock does not protect the
> entire runque, but "@lock: to modify the entities list."
> 
> So I would keep drm_sched_entity.rq_lock as it is, because it indeed
> protects the entire runqueue.

Agreed on entity->rq_lock.

> And drm_sched_rq.lock could be named "entities_lock" or
> "entities_list_lock" or something. That's debatable, but it should be
> something that highlights that this lock is not for locking the entire
> runque as the one in the entity apparently is.

AFAICT it also protects rq->current_entity and rq->rb_tree_root in which 
case it is a bit more tricky. Only rq->sched is outside its scope. Hm. 
Maybe just re-arrange the struct to be like:

struct drm_sched_rq {
	struct drm_gpu_scheduler	*sched;

	spinlock_t			lock;
	/* Following members are protected by the @lock: */
	struct list_head		entities;
	struct drm_sched_entity		*current_entity;
	struct rb_root_cached		rb_tree_root;
};

I have no ideas for better naming. But this would be inline with 
Christian's suggestion for tidying the order in drm_sched_entity.

I am also not sure what is the point of setting rq->current_entity in 
drm_sched_rq_select_entity_fifo().

Regards,

Tvrtko

> 
> 
> Cheers,
> P.
> 
>>
>> Regards,
>>
>> Tvrtko
>>
>>>> [...]
>>>>
>>>>
>>>> P.
>>>>
>>>>
>>>>> Then audit the code if all users of rq and priority actually
>>>>> hold the
>>>>> correct locks while reading and writing them.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> P.
>>>>>>
>>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>>>>> Fixes: b37aced31eb0 ("drm/scheduler: implement a function
>>>>>>> to
>>>>>>> modify
>>>>>>> sched list")
>>>>>>> Cc: Christian König <christian.koenig@amd.com>
>>>>>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>>>>>> Cc: Luben Tuikov <ltuikov89@gmail.com>
>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>> Cc: Daniel Vetter <daniel@ffwll.ch>
>>>>>>> Cc: dri-devel@lists.freedesktop.org
>>>>>>> Cc: <stable@vger.kernel.org> # v5.7+
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/scheduler/sched_entity.c | 2 ++
>>>>>>>     1 file changed, 2 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> index 58c8161289fe..ae8be30472cd 100644
>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> @@ -133,8 +133,10 @@ void
>>>>>>> drm_sched_entity_modify_sched(struct
>>>>>>> drm_sched_entity *entity,
>>>>>>>     {
>>>>>>>         WARN_ON(!num_sched_list || !sched_list);
>>>>>>> +    spin_lock(&entity->rq_lock);
>>>>>>>         entity->sched_list = sched_list;
>>>>>>>         entity->num_sched_list = num_sched_list;
>>>>>>> +    spin_unlock(&entity->rq_lock);
>>>>>>>     }
>>>>>>>     EXPORT_SYMBOL(drm_sched_entity_modify_sched);
>>>
>>
>
Philipp Stanner Sept. 9, 2024, 1:40 p.m. UTC | #8
On Mon, 2024-09-09 at 14:27 +0100, Tvrtko Ursulin wrote:
> 
> On 09/09/2024 13:46, Philipp Stanner wrote:
> > On Mon, 2024-09-09 at 13:37 +0100, Tvrtko Ursulin wrote:
> > > 
> > > On 09/09/2024 13:18, Christian König wrote:
> > > > Am 09.09.24 um 14:13 schrieb Philipp Stanner:
> > > > > On Mon, 2024-09-09 at 13:29 +0200, Christian König wrote:
> > > > > > Am 09.09.24 um 11:44 schrieb Philipp Stanner:
> > > > > > > On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote:
> > > > > > > > From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > > > > > > > 
> > > > > > > > Without the locking amdgpu currently can race
> > > > > > > > amdgpu_ctx_set_entity_priority() and
> > > > > > > > drm_sched_job_arm(),
> > > > > > > I would explicitly say "amdgpu's
> > > > > > > amdgpu_ctx_set_entity_priority()
> > > > > > > races
> > > > > > > through drm_sched_entity_modify_sched() with
> > > > > > > drm_sched_job_arm()".
> > > > > > > 
> > > > > > > The actual issue then seems to be drm_sched_job_arm()
> > > > > > > calling
> > > > > > > drm_sched_entity_select_rq(). I would mention that, too.
> > > > > > > 
> > > > > > > 
> > > > > > > > leading to the
> > > > > > > > latter accesing potentially inconsitent entity-
> > > > > > > > >sched_list
> > > > > > > > and
> > > > > > > > entity->num_sched_list pair.
> > > > > > > > 
> > > > > > > > The comment on drm_sched_entity_modify_sched() however
> > > > > > > > says:
> > > > > > > > 
> > > > > > > > """
> > > > > > > >     * Note that this must be called under the same
> > > > > > > > common
> > > > > > > > lock for
> > > > > > > > @entity as
> > > > > > > >     * drm_sched_job_arm() and
> > > > > > > > drm_sched_entity_push_job(),
> > > > > > > > or the
> > > > > > > > driver
> > > > > > > > needs to
> > > > > > > >     * guarantee through some other means that this is
> > > > > > > > never
> > > > > > > > called
> > > > > > > > while
> > > > > > > > new jobs
> > > > > > > >     * can be pushed to @entity.
> > > > > > > > """
> > > > > > > > 
> > > > > > > > It is unclear if that is referring to this race or
> > > > > > > > something
> > > > > > > > else.
> > > > > > > That comment is indeed a bit awkward. Both
> > > > > > > drm_sched_entity_push_job()
> > > > > > > and drm_sched_job_arm() take rq_lock. But
> > > > > > > drm_sched_entity_modify_sched() doesn't.
> > > > > > > 
> > > > > > > The comment was written in 981b04d968561. Interestingly,
> > > > > > > in
> > > > > > > drm_sched_entity_push_job(), this "common lock" is
> > > > > > > mentioned
> > > > > > > with
> > > > > > > the
> > > > > > > soft requirement word "should" and apparently is more
> > > > > > > about
> > > > > > > keeping
> > > > > > > sequence numbers in order when inserting.
> > > > > > > 
> > > > > > > I tend to think that the issue discovered by you is
> > > > > > > unrelated
> > > > > > > to
> > > > > > > that
> > > > > > > comment. But if no one can make sense of the comment,
> > > > > > > should
> > > > > > > it
> > > > > > > maybe
> > > > > > > be removed? Confusing comment is arguably worse than no
> > > > > > > comment
> > > > > > Agree, we probably mixed up in 981b04d968561 that
> > > > > > submission
> > > > > > needs a
> > > > > > common lock and that rq/priority needs to be protected by
> > > > > > the
> > > > > > rq_lock.
> > > > > > 
> > > > > > There is also the big FIXME in the drm_sched_entity
> > > > > > documentation
> > > > > > pointing out that this is most likely not implemented
> > > > > > correctly.
> > > > > > 
> > > > > > I suggest to move the rq, priority and rq_lock fields
> > > > > > together
> > > > > > in the
> > > > > > drm_sched_entity structure and document that rq_lock is
> > > > > > protecting
> > > > > > the two.
> > > > > That could also be a great opportunity for improving the lock
> > > > > naming:
> > > > 
> > > > Well that comment made me laugh because I point out the same
> > > > when
> > > > the
> > > > scheduler came out ~8years ago and nobody cared about it since
> > > > then.
> > > > 
> > > > But yeah completely agree :)
> > > 
> > > Maybe, but we need to keep in sight the fact some of these fixes
> > > may
> > > be
> > > good to backport. In which case re-naming exercises are best left
> > > to
> > > follow.
> > 
> > My argument basically. It's good if fixes and other improvements
> > are
> > separated, in general, unless there is a practical / good reason
> > not
> > to.
> 
> Ah cool, I am happy to add follow up patches after the fixes.
> 
> > > Also..
> > > 
> > > > > void drm_sched_rq_update_fifo(struct drm_sched_entity
> > > > > *entity,
> > > > > ktime_t
> > > > > ts)
> > > > > {
> > > > >      /*
> > > > >       * Both locks need to be grabbed, one to protect from
> > > > > entity-
> > > > > > rq
> > > > > change
> > > > >       * for entity from within concurrent
> > > > > drm_sched_entity_select_rq
> > > > > and the
> > > > >       * other to update the rb tree structure.
> > > > >       */
> > > > >      spin_lock(&entity->rq_lock);
> > > > >      spin_lock(&entity->rq->lock);
> > > 
> > > .. I agree this is quite unredable and my initial reaction was a
> > > similar
> > > ugh. However.. What names would you guys suggest and for what to
> > > make
> > > this better and not lessen the logic of naming each individually?
> > 
> > According to the documentation, drm_sched_rq.lock does not protect
> > the
> > entire runque, but "@lock: to modify the entities list."
> > 
> > So I would keep drm_sched_entity.rq_lock as it is, because it
> > indeed
> > protects the entire runqueue.
> 
> Agreed on entity->rq_lock.
> 
> > And drm_sched_rq.lock could be named "entities_lock" or
> > "entities_list_lock" or something. That's debatable, but it should
> > be
> > something that highlights that this lock is not for locking the
> > entire
> > runque as the one in the entity apparently is.
> 
> AFAICT it also protects rq->current_entity and rq->rb_tree_root in
> which 
> case it is a bit more tricky.

Then in any case we'll also have to update the documentation snippet I
had quoted above.

btw. I'm not saying you have to do all of that; I'm also currently
working on some additional scheduler documentation.

>  Only rq->sched is outside its scope. Hm. 
> Maybe just re-arrange the struct to be like:
> 
> struct drm_sched_rq {
> 	struct drm_gpu_scheduler	*sched;
> 
> 	spinlock_t			lock;
> 	/* Following members are protected by the @lock: */
> 	struct list_head		entities;
> 	struct drm_sched_entity		*current_entity;
> 	struct rb_root_cached		rb_tree_root;
> };
> 
> I have no ideas for better naming.

Hmmm. Difficult.

Maybe rq_outer_lock <-> rq_inner_lock or "partial" and "whole".

Explains why no one ever bothered renaming it.

>  But this would be inline with 
> Christian's suggestion for tidying the order in drm_sched_entity.
> 
> I am also not sure what is the point of setting rq->current_entity in
> drm_sched_rq_select_entity_fifo().

It seems to me that current_entity is only used (read) in
drm_sched_rq_remove_entity() and drm_sched_rq_select_entity_rr(). It 
seems to be only really useful for the RR function?

drm_sched_rq_select_entity_fifo() was added later, in 08fb97de03aa2,
and it's very likely that the author oriented himself at the RR
function. So it's possible it's actually not needed and was just copied
by accident.


P.

> 
> Regards,
> 
> Tvrtko
> 
> > 
> > 
> > Cheers,
> > P.
> > 
> > > 
> > > Regards,
> > > 
> > > Tvrtko
> > > 
> > > > > [...]
> > > > > 
> > > > > 
> > > > > P.
> > > > > 
> > > > > 
> > > > > > Then audit the code if all users of rq and priority
> > > > > > actually
> > > > > > hold the
> > > > > > correct locks while reading and writing them.
> > > > > > 
> > > > > > Regards,
> > > > > > Christian.
> > > > > > 
> > > > > > > P.
> > > > > > > 
> > > > > > > > Signed-off-by: Tvrtko Ursulin
> > > > > > > > <tvrtko.ursulin@igalia.com>
> > > > > > > > Fixes: b37aced31eb0 ("drm/scheduler: implement a
> > > > > > > > function
> > > > > > > > to
> > > > > > > > modify
> > > > > > > > sched list")
> > > > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > > > Cc: Alex Deucher <alexander.deucher@amd.com>
> > > > > > > > Cc: Luben Tuikov <ltuikov89@gmail.com>
> > > > > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > Cc: David Airlie <airlied@gmail.com>
> > > > > > > > Cc: Daniel Vetter <daniel@ffwll.ch>
> > > > > > > > Cc: dri-devel@lists.freedesktop.org
> > > > > > > > Cc: <stable@vger.kernel.org> # v5.7+
> > > > > > > > ---
> > > > > > > >     drivers/gpu/drm/scheduler/sched_entity.c | 2 ++
> > > > > > > >     1 file changed, 2 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > index 58c8161289fe..ae8be30472cd 100644
> > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > @@ -133,8 +133,10 @@ void
> > > > > > > > drm_sched_entity_modify_sched(struct
> > > > > > > > drm_sched_entity *entity,
> > > > > > > >     {
> > > > > > > >         WARN_ON(!num_sched_list || !sched_list);
> > > > > > > > +    spin_lock(&entity->rq_lock);
> > > > > > > >         entity->sched_list = sched_list;
> > > > > > > >         entity->num_sched_list = num_sched_list;
> > > > > > > > +    spin_unlock(&entity->rq_lock);
> > > > > > > >     }
> > > > > > > >     EXPORT_SYMBOL(drm_sched_entity_modify_sched);
> > > > 
> > > 
> > 
>
Christian König Sept. 9, 2024, 1:50 p.m. UTC | #9
Am 09.09.24 um 15:27 schrieb Tvrtko Ursulin:
>
> On 09/09/2024 13:46, Philipp Stanner wrote:
>> On Mon, 2024-09-09 at 13:37 +0100, Tvrtko Ursulin wrote:
>>> [SNIP]
>>>>> That could also be a great opportunity for improving the lock
>>>>> naming:
>>>>
>>>> Well that comment made me laugh because I point out the same when
>>>> the
>>>> scheduler came out ~8years ago and nobody cared about it since
>>>> then.
>>>>
>>>> But yeah completely agree :)
>>>
>>> Maybe, but we need to keep in sight the fact some of these fixes may
>>> be
>>> good to backport. In which case re-naming exercises are best left to
>>> follow.
>>
>> My argument basically. It's good if fixes and other improvements are
>> separated, in general, unless there is a practical / good reason not
>> to.
>
> Ah cool, I am happy to add follow up patches after the fixes.

+1

>
>>> Also..
>>>
>>>>> void drm_sched_rq_update_fifo(struct drm_sched_entity *entity,
>>>>> ktime_t
>>>>> ts)
>>>>> {
>>>>>      /*
>>>>>       * Both locks need to be grabbed, one to protect from entity-
>>>>>> rq
>>>>> change
>>>>>       * for entity from within concurrent
>>>>> drm_sched_entity_select_rq
>>>>> and the
>>>>>       * other to update the rb tree structure.
>>>>>       */
>>>>>      spin_lock(&entity->rq_lock);
>>>>>      spin_lock(&entity->rq->lock);
>>>
>>> .. I agree this is quite unredable and my initial reaction was a
>>> similar
>>> ugh. However.. What names would you guys suggest and for what to make
>>> this better and not lessen the logic of naming each individually?
>>
>> According to the documentation, drm_sched_rq.lock does not protect the
>> entire runque, but "@lock: to modify the entities list."
>>
>> So I would keep drm_sched_entity.rq_lock as it is, because it indeed
>> protects the entire runqueue.
>
> Agreed on entity->rq_lock.

I would just name that lock since it should be a protection of fields in 
the drm_sched_entity structure.

That those fields are the rq and priority member should not necessary 
have an influence on the name of the lock protecting it.

Only when we have multiple locks in the same structure then we need to 
start giving them distinct names.

>
>> And drm_sched_rq.lock could be named "entities_lock" or
>> "entities_list_lock" or something. That's debatable, but it should be
>> something that highlights that this lock is not for locking the entire
>> runque as the one in the entity apparently is.
>
> AFAICT it also protects rq->current_entity and rq->rb_tree_root in 
> which case it is a bit more tricky. Only rq->sched is outside its 
> scope. Hm. Maybe just re-arrange the struct to be like:
>
> struct drm_sched_rq {
>     struct drm_gpu_scheduler    *sched;
>
>     spinlock_t            lock;
>     /* Following members are protected by the @lock: */
>     struct list_head        entities;
>     struct drm_sched_entity        *current_entity;
>     struct rb_root_cached        rb_tree_root;
> };
>
> I have no ideas for better naming. But this would be inline with 
> Christian's suggestion for tidying the order in drm_sched_entity.

+1

Yeah I mean see the other structure we have in DRM and general Linux 
kernel. The stuff that is static is usually grouped together since that 
is good for cache locality and documentation at the same time.

>
> I am also not sure what is the point of setting rq->current_entity in 
> drm_sched_rq_select_entity_fifo().

No idea either, Luben could answer that.

Christian.

>
> Regards,
>
> Tvrtko
>
>>
>>
>> Cheers,
>> P.
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>>> [...]
>>>>>
>>>>>
>>>>> P.
>>>>>
>>>>>
>>>>>> Then audit the code if all users of rq and priority actually
>>>>>> hold the
>>>>>> correct locks while reading and writing them.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> P.
>>>>>>>
>>>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>>>>>> Fixes: b37aced31eb0 ("drm/scheduler: implement a function
>>>>>>>> to
>>>>>>>> modify
>>>>>>>> sched list")
>>>>>>>> Cc: Christian König <christian.koenig@amd.com>
>>>>>>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>>>>>>> Cc: Luben Tuikov <ltuikov89@gmail.com>
>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>> Cc: Daniel Vetter <daniel@ffwll.ch>
>>>>>>>> Cc: dri-devel@lists.freedesktop.org
>>>>>>>> Cc: <stable@vger.kernel.org> # v5.7+
>>>>>>>> ---
>>>>>>>>     drivers/gpu/drm/scheduler/sched_entity.c | 2 ++
>>>>>>>>     1 file changed, 2 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>> index 58c8161289fe..ae8be30472cd 100644
>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>> @@ -133,8 +133,10 @@ void
>>>>>>>> drm_sched_entity_modify_sched(struct
>>>>>>>> drm_sched_entity *entity,
>>>>>>>>     {
>>>>>>>>         WARN_ON(!num_sched_list || !sched_list);
>>>>>>>> +    spin_lock(&entity->rq_lock);
>>>>>>>>         entity->sched_list = sched_list;
>>>>>>>>         entity->num_sched_list = num_sched_list;
>>>>>>>> +    spin_unlock(&entity->rq_lock);
>>>>>>>>     }
>>>>>>>>     EXPORT_SYMBOL(drm_sched_entity_modify_sched);
>>>>
>>>
>>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 58c8161289fe..ae8be30472cd 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -133,8 +133,10 @@  void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
 {
 	WARN_ON(!num_sched_list || !sched_list);
 
+	spin_lock(&entity->rq_lock);
 	entity->sched_list = sched_list;
 	entity->num_sched_list = num_sched_list;
+	spin_unlock(&entity->rq_lock);
 }
 EXPORT_SYMBOL(drm_sched_entity_modify_sched);