[2/2] drm/sched: serialize job_timeout and scheduler

Message ID	1630406139-19621-2-git-send-email-Monk.Liu@amd.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=xzWn=NW=lists.freedesktop.org=dri-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org B1FBE601FD Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; From: Monk Liu <Monk.Liu@amd.com> To: <amd-gfx@lists.freedesktop.org> CC: <dri-devel@lists.freedesktop.org>, Monk Liu <Monk.Liu@amd.com>, "jingwen chen" <jingwen.chen@amd.com> Subject: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler Date: Tue, 31 Aug 2021 18:35:39 +0800 Message-ID: <1630406139-19621-2-git-send-email-Monk.Liu@amd.com> In-Reply-To: <1630406139-19621-1-git-send-email-Monk.Liu@amd.com> References: <1630406139-19621-1-git-send-email-Monk.Liu@amd.com> MIME-Version: 1.0 Content-Type: text/plain Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	[1/2] drm/sched: fix the bug of time out calculation(v3) \| expand [1/2] drm/sched: fix the bug of time out calculation(v3) [2/2] drm/sched: serialize job_timeout and scheduler

Liu, Monk Aug. 31, 2021, 10:35 a.m. UTC

tested-by: jingwen chen <jingwen.chen@amd.com>
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Signed-off-by: jingwen chen <jingwen.chen@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 24 ++++--------------------
 1 file changed, 4 insertions(+), 20 deletions(-)

Daniel Vetter Aug. 31, 2021, 12:59 p.m. UTC | #1

Can we please have some actual commit message here, with detailed
explanation of the race/bug/whatever, how you fix it and why this is the
best option?

On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
> tested-by: jingwen chen <jingwen.chen@amd.com>
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> Signed-off-by: jingwen chen <jingwen.chen@amd.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 24 ++++--------------------
>  1 file changed, 4 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index ecf8140..894fdb24 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
>  	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>  
>  	/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
> +	if (!__kthread_should_park(sched->thread))

This is a __ function, i.e. considered internal, and it's lockless atomic,
i.e. unordered. And you're not explaining why this works.

Iow it's probably buggy, and an just unconditionally parking the kthread
is probably the right thing to do. If it's not the right thing to do,
there's a bug here for sure.
-Daniel

> +		kthread_park(sched->thread);
> +
>  	spin_lock(&sched->job_list_lock);
>  	job = list_first_entry_or_null(&sched->pending_list,
>  				       struct drm_sched_job, list);
>  
>  	if (job) {
> -		/*
> -		 * Remove the bad job so it cannot be freed by concurrent
> -		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> -		 * is parked at which point it's safe.
> -		 */
> -		list_del_init(&job->list);
>  		spin_unlock(&sched->job_list_lock);
>  
> +		/* vendor's timeout_job should call drm_sched_start() */
>  		status = job->sched->ops->timedout_job(job);
>  
>  		/*
> @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>  	kthread_park(sched->thread);
>  
>  	/*
> -	 * Reinsert back the bad job here - now it's safe as
> -	 * drm_sched_get_cleanup_job cannot race against us and release the
> -	 * bad job at this point - we parked (waited for) any in progress
> -	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> -	 * now until the scheduler thread is unparked.
> -	 */
> -	if (bad && bad->sched == sched)
> -		/*
> -		 * Add at the head of the queue to reflect it was the earliest
> -		 * job extracted.
> -		 */
> -		list_add(&bad->list, &sched->pending_list);
> -
> -	/*
>  	 * Iterate the job list from later to  earlier one and either deactive
>  	 * their HW callbacks or remove them from pending list if they already
>  	 * signaled.
> -- 
> 2.7.4
>

Daniel Vetter Aug. 31, 2021, 1:01 p.m. UTC | #2

On Tue, Aug 31, 2021 at 02:59:02PM +0200, Daniel Vetter wrote:
> Can we please have some actual commit message here, with detailed
> explanation of the race/bug/whatever, how you fix it and why this is the
> best option?
> 
> On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
> > tested-by: jingwen chen <jingwen.chen@amd.com>
> > Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> > Signed-off-by: jingwen chen <jingwen.chen@amd.com>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 24 ++++--------------------
> >  1 file changed, 4 insertions(+), 20 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index ecf8140..894fdb24 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
> >  	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
> >  
> >  	/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
> > +	if (!__kthread_should_park(sched->thread))
> 
> This is a __ function, i.e. considered internal, and it's lockless atomic,
> i.e. unordered. And you're not explaining why this works.
> 
> Iow it's probably buggy, and an just unconditionally parking the kthread
> is probably the right thing to do. If it's not the right thing to do,
> there's a bug here for sure.

Also why don't we reuse the function drivers already have to stop a
scheduler thread? We seem to have two kthread_park now, that's probably
one too much.
-Daniel

> > +		kthread_park(sched->thread);
> > +
> >  	spin_lock(&sched->job_list_lock);
> >  	job = list_first_entry_or_null(&sched->pending_list,
> >  				       struct drm_sched_job, list);
> >  
> >  	if (job) {
> > -		/*
> > -		 * Remove the bad job so it cannot be freed by concurrent
> > -		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > -		 * is parked at which point it's safe.
> > -		 */
> > -		list_del_init(&job->list);
> >  		spin_unlock(&sched->job_list_lock);
> >  
> > +		/* vendor's timeout_job should call drm_sched_start() */
> >  		status = job->sched->ops->timedout_job(job);
> >  
> >  		/*
> > @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> >  	kthread_park(sched->thread);
> >  
> >  	/*
> > -	 * Reinsert back the bad job here - now it's safe as
> > -	 * drm_sched_get_cleanup_job cannot race against us and release the
> > -	 * bad job at this point - we parked (waited for) any in progress
> > -	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > -	 * now until the scheduler thread is unparked.
> > -	 */
> > -	if (bad && bad->sched == sched)
> > -		/*
> > -		 * Add at the head of the queue to reflect it was the earliest
> > -		 * job extracted.
> > -		 */
> > -		list_add(&bad->list, &sched->pending_list);
> > -
> > -	/*
> >  	 * Iterate the job list from later to  earlier one and either deactive
> >  	 * their HW callbacks or remove them from pending list if they already
> >  	 * signaled.
> > -- 
> > 2.7.4
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Andrey Grodzovsky Aug. 31, 2021, 1:53 p.m. UTC | #3

It's says patch [2/2] but i can't find patch 1

On 2021-08-31 6:35 a.m., Monk Liu wrote:
> tested-by: jingwen chen <jingwen.chen@amd.com>
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> Signed-off-by: jingwen chen <jingwen.chen@amd.com>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 24 ++++--------------------
>   1 file changed, 4 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index ecf8140..894fdb24 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
>   	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>   
>   	/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
> +	if (!__kthread_should_park(sched->thread))
> +		kthread_park(sched->thread);
> +


As mentioned before, without serializing against other TDR handlers from 
other
schedulers you just race here against them, e.g. you parked it now but 
another
one in progress will unpark it as part of calling  drm_sched_start for 
other rings[1]
Unless I am missing something since I haven't found patch [1/2]

[1] - 
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L5041

Andrey


>   	spin_lock(&sched->job_list_lock);
>   	job = list_first_entry_or_null(&sched->pending_list,
>   				       struct drm_sched_job, list);
>   
>   	if (job) {
> -		/*
> -		 * Remove the bad job so it cannot be freed by concurrent
> -		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> -		 * is parked at which point it's safe.
> -		 */
> -		list_del_init(&job->list);
>   		spin_unlock(&sched->job_list_lock);
>   
> +		/* vendor's timeout_job should call drm_sched_start() */
>   		status = job->sched->ops->timedout_job(job);
>   
>   		/*
> @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>   	kthread_park(sched->thread);
>   
>   	/*
> -	 * Reinsert back the bad job here - now it's safe as
> -	 * drm_sched_get_cleanup_job cannot race against us and release the
> -	 * bad job at this point - we parked (waited for) any in progress
> -	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> -	 * now until the scheduler thread is unparked.
> -	 */
> -	if (bad && bad->sched == sched)
> -		/*
> -		 * Add at the head of the queue to reflect it was the earliest
> -		 * job extracted.
> -		 */
> -		list_add(&bad->list, &sched->pending_list);
> -
> -	/*
>   	 * Iterate the job list from later to  earlier one and either deactive
>   	 * their HW callbacks or remove them from pending list if they already
>   	 * signaled.

Daniel Vetter Aug. 31, 2021, 2:03 p.m. UTC | #4

On Tue, Aug 31, 2021 at 09:53:36AM -0400, Andrey Grodzovsky wrote:
> It's says patch [2/2] but i can't find patch 1
> 
> On 2021-08-31 6:35 a.m., Monk Liu wrote:
> > tested-by: jingwen chen <jingwen.chen@amd.com>
> > Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> > Signed-off-by: jingwen chen <jingwen.chen@amd.com>
> > ---
> >   drivers/gpu/drm/scheduler/sched_main.c | 24 ++++--------------------
> >   1 file changed, 4 insertions(+), 20 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index ecf8140..894fdb24 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
> >   	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
> >   	/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
> > +	if (!__kthread_should_park(sched->thread))
> > +		kthread_park(sched->thread);
> > +
> 
> 
> As mentioned before, without serializing against other TDR handlers from
> other
> schedulers you just race here against them, e.g. you parked it now but
> another
> one in progress will unpark it as part of calling  drm_sched_start for other
> rings[1]
> Unless I am missing something since I haven't found patch [1/2]
> 
> [1] - https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L5041

You need to have your own wq and run all your tdr work on the same wq if
your reset has any cross-engine impact.

See

https://dri.freedesktop.org/docs/drm/gpu/drm-mm.html#c.drm_sched_backend_ops

for the ->timeout_job callback docs. I thought I brought this up already?
-Daniel

> 
> Andrey
> 
> 
> >   	spin_lock(&sched->job_list_lock);
> >   	job = list_first_entry_or_null(&sched->pending_list,
> >   				       struct drm_sched_job, list);
> >   	if (job) {
> > -		/*
> > -		 * Remove the bad job so it cannot be freed by concurrent
> > -		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > -		 * is parked at which point it's safe.
> > -		 */
> > -		list_del_init(&job->list);
> >   		spin_unlock(&sched->job_list_lock);
> > +		/* vendor's timeout_job should call drm_sched_start() */
> >   		status = job->sched->ops->timedout_job(job);
> >   		/*
> > @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> >   	kthread_park(sched->thread);
> >   	/*
> > -	 * Reinsert back the bad job here - now it's safe as
> > -	 * drm_sched_get_cleanup_job cannot race against us and release the
> > -	 * bad job at this point - we parked (waited for) any in progress
> > -	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > -	 * now until the scheduler thread is unparked.
> > -	 */
> > -	if (bad && bad->sched == sched)
> > -		/*
> > -		 * Add at the head of the queue to reflect it was the earliest
> > -		 * job extracted.
> > -		 */
> > -		list_add(&bad->list, &sched->pending_list);
> > -
> > -	/*
> >   	 * Iterate the job list from later to  earlier one and either deactive
> >   	 * their HW callbacks or remove them from pending list if they already
> >   	 * signaled.

Andrey Grodzovsky Aug. 31, 2021, 2:20 p.m. UTC | #5

On 2021-08-31 10:03 a.m., Daniel Vetter wrote:
> On Tue, Aug 31, 2021 at 09:53:36AM -0400, Andrey Grodzovsky wrote:
>> It's says patch [2/2] but i can't find patch 1
>>
>> On 2021-08-31 6:35 a.m., Monk Liu wrote:
>>> tested-by: jingwen chen <jingwen.chen@amd.com>
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> Signed-off-by: jingwen chen <jingwen.chen@amd.com>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_main.c | 24 ++++--------------------
>>>    1 file changed, 4 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index ecf8140..894fdb24 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>    	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>>>    	/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
>>> +	if (!__kthread_should_park(sched->thread))
>>> +		kthread_park(sched->thread);
>>> +
>>
>> As mentioned before, without serializing against other TDR handlers from
>> other
>> schedulers you just race here against them, e.g. you parked it now but
>> another
>> one in progress will unpark it as part of calling  drm_sched_start for other
>> rings[1]
>> Unless I am missing something since I haven't found patch [1/2]
>>
>> [1] - https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L5041&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cc697c75898664f678f4b08d96c8820e7%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660154199259544%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=1Y8Tuh2fLtexYsGrmQD2ITTSIfUVJmqTylwgMryDjxw%3D&amp;reserved=0
> You need to have your own wq and run all your tdr work on the same wq if
> your reset has any cross-engine impact.


IMHO what is problematic in serializing vs. locking (with trylock and 
bail out like we do in [1]) is for multiple TO events arising from same 
reason
like maybe one job just waits for another and once first is hanged the 
second will also appear to be hanged triggering it's own TO event.
In this case multiple TOs event will trigger multiple resets if we 
serialize but if we use lock with trylock the second one will quietly 
bail out.

[1] 
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L4903

Andrey


>
> See
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdri.freedesktop.org%2Fdocs%2Fdrm%2Fgpu%2Fdrm-mm.html%23c.drm_sched_backend_ops&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cc697c75898664f678f4b08d96c8820e7%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660154199259544%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=tLjFaN7mREYjjydxHszbQlTk3lwH4bQtBDVzHFHvPJg%3D&amp;reserved=0
>
> for the ->timeout_job callback docs. I thought I brought this up already?
> -Daniel


Yes, this discussion is a continuation of your comment about 
serializing, I mentioned before that you proposed it.

Andrey


>
>> Andrey
>>
>>
>>>    	spin_lock(&sched->job_list_lock);
>>>    	job = list_first_entry_or_null(&sched->pending_list,
>>>    				       struct drm_sched_job, list);
>>>    	if (job) {
>>> -		/*
>>> -		 * Remove the bad job so it cannot be freed by concurrent
>>> -		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>>> -		 * is parked at which point it's safe.
>>> -		 */
>>> -		list_del_init(&job->list);
>>>    		spin_unlock(&sched->job_list_lock);
>>> +		/* vendor's timeout_job should call drm_sched_start() */
>>>    		status = job->sched->ops->timedout_job(job);
>>>    		/*
>>> @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>    	kthread_park(sched->thread);
>>>    	/*
>>> -	 * Reinsert back the bad job here - now it's safe as
>>> -	 * drm_sched_get_cleanup_job cannot race against us and release the
>>> -	 * bad job at this point - we parked (waited for) any in progress
>>> -	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>> -	 * now until the scheduler thread is unparked.
>>> -	 */
>>> -	if (bad && bad->sched == sched)
>>> -		/*
>>> -		 * Add at the head of the queue to reflect it was the earliest
>>> -		 * job extracted.
>>> -		 */
>>> -		list_add(&bad->list, &sched->pending_list);
>>> -
>>> -	/*
>>>    	 * Iterate the job list from later to  earlier one and either deactive
>>>    	 * their HW callbacks or remove them from pending list if they already
>>>    	 * signaled.

Daniel Vetter Aug. 31, 2021, 2:38 p.m. UTC | #6

On Tue, Aug 31, 2021 at 10:20:40AM -0400, Andrey Grodzovsky wrote:
> 
> On 2021-08-31 10:03 a.m., Daniel Vetter wrote:
> > On Tue, Aug 31, 2021 at 09:53:36AM -0400, Andrey Grodzovsky wrote:
> > > It's says patch [2/2] but i can't find patch 1
> > > 
> > > On 2021-08-31 6:35 a.m., Monk Liu wrote:
> > > > tested-by: jingwen chen <jingwen.chen@amd.com>
> > > > Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> > > > Signed-off-by: jingwen chen <jingwen.chen@amd.com>
> > > > ---
> > > >    drivers/gpu/drm/scheduler/sched_main.c | 24 ++++--------------------
> > > >    1 file changed, 4 insertions(+), 20 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > index ecf8140..894fdb24 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > >    	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
> > > >    	/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
> > > > +	if (!__kthread_should_park(sched->thread))
> > > > +		kthread_park(sched->thread);
> > > > +
> > > 
> > > As mentioned before, without serializing against other TDR handlers from
> > > other
> > > schedulers you just race here against them, e.g. you parked it now but
> > > another
> > > one in progress will unpark it as part of calling  drm_sched_start for other
> > > rings[1]
> > > Unless I am missing something since I haven't found patch [1/2]
> > > 
> > > [1] - https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L5041&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cc697c75898664f678f4b08d96c8820e7%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660154199259544%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=1Y8Tuh2fLtexYsGrmQD2ITTSIfUVJmqTylwgMryDjxw%3D&amp;reserved=0
> > You need to have your own wq and run all your tdr work on the same wq if
> > your reset has any cross-engine impact.
> 
> 
> IMHO what is problematic in serializing vs. locking (with trylock and bail
> out like we do in [1]) is for multiple TO events arising from same reason
> like maybe one job just waits for another and once first is hanged the
> second will also appear to be hanged triggering it's own TO event.
> In this case multiple TOs event will trigger multiple resets if we serialize
> but if we use lock with trylock the second one will quietly bail out.
> 
> [1] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L4903

Hm so I guess a single wq here, that will hold up all other TO. And they
should recheck whether the job is moving meanwhile.

Also unless you use hw semaphores the job shouldn't even start before the
deps are singalled, so not sure how this goes wrong?

The vm_id flush stuff can make things a bit more fun for your specific
case, but in your specific case you have to run all TO handlers on the
same ordered workqueue anyway (because trying to paper over this in other
ways doesn't work imo).

So I think this should all work, no need for tricky cross-scheduler
locking.
-Daniel

> 
> Andrey
> 
> 
> > 
> > See
> > 
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdri.freedesktop.org%2Fdocs%2Fdrm%2Fgpu%2Fdrm-mm.html%23c.drm_sched_backend_ops&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cc697c75898664f678f4b08d96c8820e7%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660154199259544%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=tLjFaN7mREYjjydxHszbQlTk3lwH4bQtBDVzHFHvPJg%3D&amp;reserved=0
> > 
> > for the ->timeout_job callback docs. I thought I brought this up already?
> > -Daniel
> 
> 
> Yes, this discussion is a continuation of your comment about serializing, I
> mentioned before that you proposed it.
> 
> Andrey
> 
> 
> > 
> > > Andrey
> > > 
> > > 
> > > >    	spin_lock(&sched->job_list_lock);
> > > >    	job = list_first_entry_or_null(&sched->pending_list,
> > > >    				       struct drm_sched_job, list);
> > > >    	if (job) {
> > > > -		/*
> > > > -		 * Remove the bad job so it cannot be freed by concurrent
> > > > -		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > -		 * is parked at which point it's safe.
> > > > -		 */
> > > > -		list_del_init(&job->list);
> > > >    		spin_unlock(&sched->job_list_lock);
> > > > +		/* vendor's timeout_job should call drm_sched_start() */
> > > >    		status = job->sched->ops->timedout_job(job);
> > > >    		/*
> > > > @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > >    	kthread_park(sched->thread);
> > > >    	/*
> > > > -	 * Reinsert back the bad job here - now it's safe as
> > > > -	 * drm_sched_get_cleanup_job cannot race against us and release the
> > > > -	 * bad job at this point - we parked (waited for) any in progress
> > > > -	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > -	 * now until the scheduler thread is unparked.
> > > > -	 */
> > > > -	if (bad && bad->sched == sched)
> > > > -		/*
> > > > -		 * Add at the head of the queue to reflect it was the earliest
> > > > -		 * job extracted.
> > > > -		 */
> > > > -		list_add(&bad->list, &sched->pending_list);
> > > > -
> > > > -	/*
> > > >    	 * Iterate the job list from later to  earlier one and either deactive
> > > >    	 * their HW callbacks or remove them from pending list if they already
> > > >    	 * signaled.

Luben Tuikov Aug. 31, 2021, 3:06 p.m. UTC | #7

On 2021-08-31 08:59, Daniel Vetter wrote:
> Can we please have some actual commit message here, with detailed
> explanation of the race/bug/whatever, how you fix it and why this is the
> best option?

I agree with Daniel--a narrative form of a commit message is so much easier
for humans to digest. The "[what]"/"[why]"/"[how]" and "issue"/"fix" format is
somewhat dry and uninformative, and leaves much to be desired.

Regards,
Luben

>
> On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
>> tested-by: jingwen chen <jingwen.chen@amd.com>
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> Signed-off-by: jingwen chen <jingwen.chen@amd.com>
>> ---
>>  drivers/gpu/drm/scheduler/sched_main.c | 24 ++++--------------------
>>  1 file changed, 4 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index ecf8140..894fdb24 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>  	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>>  
>>  	/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
>> +	if (!__kthread_should_park(sched->thread))
> This is a __ function, i.e. considered internal, and it's lockless atomic,
> i.e. unordered. And you're not explaining why this works.
>
> Iow it's probably buggy, and an just unconditionally parking the kthread
> is probably the right thing to do. If it's not the right thing to do,
> there's a bug here for sure.
> -Daniel
>
>> +		kthread_park(sched->thread);
>> +
>>  	spin_lock(&sched->job_list_lock);
>>  	job = list_first_entry_or_null(&sched->pending_list,
>>  				       struct drm_sched_job, list);
>>  
>>  	if (job) {
>> -		/*
>> -		 * Remove the bad job so it cannot be freed by concurrent
>> -		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>> -		 * is parked at which point it's safe.
>> -		 */
>> -		list_del_init(&job->list);
>>  		spin_unlock(&sched->job_list_lock);
>>  
>> +		/* vendor's timeout_job should call drm_sched_start() */
>>  		status = job->sched->ops->timedout_job(job);
>>  
>>  		/*
>> @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>  	kthread_park(sched->thread);
>>  
>>  	/*
>> -	 * Reinsert back the bad job here - now it's safe as
>> -	 * drm_sched_get_cleanup_job cannot race against us and release the
>> -	 * bad job at this point - we parked (waited for) any in progress
>> -	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>> -	 * now until the scheduler thread is unparked.
>> -	 */
>> -	if (bad && bad->sched == sched)
>> -		/*
>> -		 * Add at the head of the queue to reflect it was the earliest
>> -		 * job extracted.
>> -		 */
>> -		list_add(&bad->list, &sched->pending_list);
>> -
>> -	/*
>>  	 * Iterate the job list from later to  earlier one and either deactive
>>  	 * their HW callbacks or remove them from pending list if they already
>>  	 * signaled.
>> -- 
>> 2.7.4
>>

Andrey Grodzovsky Aug. 31, 2021, 3:23 p.m. UTC | #8

On 2021-08-31 10:38 a.m., Daniel Vetter wrote:
> On Tue, Aug 31, 2021 at 10:20:40AM -0400, Andrey Grodzovsky wrote:
>> On 2021-08-31 10:03 a.m., Daniel Vetter wrote:
>>> On Tue, Aug 31, 2021 at 09:53:36AM -0400, Andrey Grodzovsky wrote:
>>>> It's says patch [2/2] but i can't find patch 1
>>>>
>>>> On 2021-08-31 6:35 a.m., Monk Liu wrote:
>>>>> tested-by: jingwen chen <jingwen.chen@amd.com>
>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>> Signed-off-by: jingwen chen <jingwen.chen@amd.com>
>>>>> ---
>>>>>     drivers/gpu/drm/scheduler/sched_main.c | 24 ++++--------------------
>>>>>     1 file changed, 4 insertions(+), 20 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index ecf8140..894fdb24 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>     	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>>>>>     	/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
>>>>> +	if (!__kthread_should_park(sched->thread))
>>>>> +		kthread_park(sched->thread);
>>>>> +
>>>> As mentioned before, without serializing against other TDR handlers from
>>>> other
>>>> schedulers you just race here against them, e.g. you parked it now but
>>>> another
>>>> one in progress will unpark it as part of calling  drm_sched_start for other
>>>> rings[1]
>>>> Unless I am missing something since I haven't found patch [1/2]
>>>>
>>>> [1] - https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L5041&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C86b39a7bbcd34a18c6e908d96c8cf862%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660174991641911%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=tz7lxvL%2BR6NrpcdfIg1Mjw5lZ55%2F5HTPF%2Fkwu7wqvqE%3D&amp;reserved=0
>>> You need to have your own wq and run all your tdr work on the same wq if
>>> your reset has any cross-engine impact.
>>
>> IMHO what is problematic in serializing vs. locking (with trylock and bail
>> out like we do in [1]) is for multiple TO events arising from same reason
>> like maybe one job just waits for another and once first is hanged the
>> second will also appear to be hanged triggering it's own TO event.
>> In this case multiple TOs event will trigger multiple resets if we serialize
>> but if we use lock with trylock the second one will quietly bail out.
>>
>> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L4903&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C86b39a7bbcd34a18c6e908d96c8cf862%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660174991651903%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SpirDOLVdw5kIZAq0LHjnB0Qy6apwPLDPFjm61Wc2ko%3D&amp;reserved=0
> Hm so I guess a single wq here, that will hold up all other TO. And they
> should recheck whether the job is moving meanwhile.


Can you clarify about this ? What job should be moving ? The dependent job ?


>
> Also unless you use hw semaphores the job shouldn't even start before the
> deps are singalled, so not sure how this goes wrong?


What about a simple example where
we actually can submit a shader on one ring and a simple
WAIT_REG_MEM packet on another to wait for the shader to write
a specific value to specific memory location. Here you have both of them 
started
in close proximity and no explicit dependencies involved (at the 
scheduler level)
and yet if the shader hangs also the WAIT_REG_MEM job will hang.


>
> The vm_id flush stuff can make things a bit more fun for your specific
> case, but in your specific case you have to run all TO handlers on the
> same ordered workqueue anyway (because trying to paper over this in other
> ways doesn't work imo).


I didn't get this one.

Andrey


>
> So I think this should all work, no need for tricky cross-scheduler
> locking.
> -Daniel
>
>> Andrey
>>
>>
>>> See
>>>
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdri.freedesktop.org%2Fdocs%2Fdrm%2Fgpu%2Fdrm-mm.html%23c.drm_sched_backend_ops&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C86b39a7bbcd34a18c6e908d96c8cf862%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660174991651903%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=uV4s%2Fsu%2FKabZvsRMd1PAyd36JRSz91aPfYEn8PlvFlM%3D&amp;reserved=0
>>>
>>> for the ->timeout_job callback docs. I thought I brought this up already?
>>> -Daniel
>>
>> Yes, this discussion is a continuation of your comment about serializing, I
>> mentioned before that you proposed it.
>>
>> Andrey
>>
>>
>>>> Andrey
>>>>
>>>>
>>>>>     	spin_lock(&sched->job_list_lock);
>>>>>     	job = list_first_entry_or_null(&sched->pending_list,
>>>>>     				       struct drm_sched_job, list);
>>>>>     	if (job) {
>>>>> -		/*
>>>>> -		 * Remove the bad job so it cannot be freed by concurrent
>>>>> -		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>>>>> -		 * is parked at which point it's safe.
>>>>> -		 */
>>>>> -		list_del_init(&job->list);
>>>>>     		spin_unlock(&sched->job_list_lock);
>>>>> +		/* vendor's timeout_job should call drm_sched_start() */
>>>>>     		status = job->sched->ops->timedout_job(job);
>>>>>     		/*
>>>>> @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>     	kthread_park(sched->thread);
>>>>>     	/*
>>>>> -	 * Reinsert back the bad job here - now it's safe as
>>>>> -	 * drm_sched_get_cleanup_job cannot race against us and release the
>>>>> -	 * bad job at this point - we parked (waited for) any in progress
>>>>> -	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>>>> -	 * now until the scheduler thread is unparked.
>>>>> -	 */
>>>>> -	if (bad && bad->sched == sched)
>>>>> -		/*
>>>>> -		 * Add at the head of the queue to reflect it was the earliest
>>>>> -		 * job extracted.
>>>>> -		 */
>>>>> -		list_add(&bad->list, &sched->pending_list);
>>>>> -
>>>>> -	/*
>>>>>     	 * Iterate the job list from later to  earlier one and either deactive
>>>>>     	 * their HW callbacks or remove them from pending list if they already
>>>>>     	 * signaled.

Luben Tuikov Aug. 31, 2021, 4:01 p.m. UTC | #9

On 2021-08-31 11:23, Andrey Grodzovsky wrote:
> On 2021-08-31 10:38 a.m., Daniel Vetter wrote:
>> On Tue, Aug 31, 2021 at 10:20:40AM -0400, Andrey Grodzovsky wrote:
>>> On 2021-08-31 10:03 a.m., Daniel Vetter wrote:
>>>> On Tue, Aug 31, 2021 at 09:53:36AM -0400, Andrey Grodzovsky wrote:
>>>>> It's says patch [2/2] but i can't find patch 1
>>>>>
>>>>> On 2021-08-31 6:35 a.m., Monk Liu wrote:
>>>>>> tested-by: jingwen chen <jingwen.chen@amd.com>
>>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>>> Signed-off-by: jingwen chen <jingwen.chen@amd.com>
>>>>>> ---
>>>>>>     drivers/gpu/drm/scheduler/sched_main.c | 24 ++++--------------------
>>>>>>     1 file changed, 4 insertions(+), 20 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> index ecf8140..894fdb24 100644
>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>     	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>>>>>>     	/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
>>>>>> +	if (!__kthread_should_park(sched->thread))
>>>>>> +		kthread_park(sched->thread);
>>>>>> +
>>>>> As mentioned before, without serializing against other TDR handlers from
>>>>> other
>>>>> schedulers you just race here against them, e.g. you parked it now but
>>>>> another
>>>>> one in progress will unpark it as part of calling  drm_sched_start for other
>>>>> rings[1]
>>>>> Unless I am missing something since I haven't found patch [1/2]
>>>>>
>>>>> [1] - https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L5041&amp;data=04%7C01%7Cluben.tuikov%40amd.com%7C228bd1600c914efe24aa08d96c934bbb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660202148713283%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=PrrvFHAwDeLlbcOctlKHsCFs9%2F56XNVzoLVcT1RoJgI%3D&amp;reserved=0
>>>> You need to have your own wq and run all your tdr work on the same wq if
>>>> your reset has any cross-engine impact.
>>> IMHO what is problematic in serializing vs. locking (with trylock and bail
>>> out like we do in [1]) is for multiple TO events arising from same reason
>>> like maybe one job just waits for another and once first is hanged the
>>> second will also appear to be hanged triggering it's own TO event.
>>> In this case multiple TOs event will trigger multiple resets if we serialize
>>> but if we use lock with trylock the second one will quietly bail out.
>>>
>>> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L4903&amp;data=04%7C01%7Cluben.tuikov%40amd.com%7C228bd1600c914efe24aa08d96c934bbb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660202148713283%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=kxSWBoshVTLMMFIFZdPsP4MBgUAoC%2F3szo9GUemSRLY%3D&amp;reserved=0
>> Hm so I guess a single wq here, that will hold up all other TO. And they
>> should recheck whether the job is moving meanwhile.
>
> Can you clarify about this ? What job should be moving ? The dependent job ?
>
>
>> Also unless you use hw semaphores the job shouldn't even start before the
>> deps are singalled, so not sure how this goes wrong?
>
> What about a simple example where
> we actually can submit a shader on one ring and a simple
> WAIT_REG_MEM packet on another to wait for the shader to write
> a specific value to specific memory location. Here you have both of them 
> started
> in close proximity and no explicit dependencies involved (at the 
> scheduler level)
> and yet if the shader hangs also the WAIT_REG_MEM job will hang.
>
>
>> The vm_id flush stuff can make things a bit more fun for your specific
>> case, but in your specific case you have to run all TO handlers on the
>> same ordered workqueue anyway (because trying to paper over this in other
>> ways doesn't work imo).
>
> I didn't get this one.

So, awhile back I tried to "serialize" this by moving timed-out jobs
into their own timed-out-dedicated list, then freeing them asynchronously,
but I never got it to work reliably due to races with low-level drivers and
assumptions made way back.

My idea was to atomic-move timed-out jobs into their own list, at the time of
timeout, and later asynchronously to free them (or better yet, inquire about
their state, and free them or move them back--ideally the inquiry is atomic
and done at timeout time before being moved to the timeout list). Anyway...

But I found out that all these knobs and levers weren't in place and I was
getting problems with it and it never materialized.

The paradigm was loosely "let someone else do it", like, "on an event,
move it to a list, and let someone else handle it", or "on an event, mark
it, and let someone else handle it". (loosely borrowed from an iSCSI target
I did many many years ago--it worked well and there were no races, even with
out-of-order executions.)

If you guys have any ideas to that end, maybe we can try it out.

Regards,
Luben


>
> Andrey
>
>
>> So I think this should all work, no need for tricky cross-scheduler
>> locking.
>> -Daniel
>>
>>> Andrey
>>>
>>>
>>>> See
>>>>
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdri.freedesktop.org%2Fdocs%2Fdrm%2Fgpu%2Fdrm-mm.html%23c.drm_sched_backend_ops&amp;data=04%7C01%7Cluben.tuikov%40amd.com%7C228bd1600c914efe24aa08d96c934bbb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660202148713283%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Fpt%2Btho2W4woHKQ861cEbBzoOidS6zuDhFi%2B1UTwWJg%3D&amp;reserved=0
>>>>
>>>> for the ->timeout_job callback docs. I thought I brought this up already?
>>>> -Daniel
>>> Yes, this discussion is a continuation of your comment about serializing, I
>>> mentioned before that you proposed it.
>>>
>>> Andrey
>>>
>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>     	spin_lock(&sched->job_list_lock);
>>>>>>     	job = list_first_entry_or_null(&sched->pending_list,
>>>>>>     				       struct drm_sched_job, list);
>>>>>>     	if (job) {
>>>>>> -		/*
>>>>>> -		 * Remove the bad job so it cannot be freed by concurrent
>>>>>> -		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>>>>>> -		 * is parked at which point it's safe.
>>>>>> -		 */
>>>>>> -		list_del_init(&job->list);
>>>>>>     		spin_unlock(&sched->job_list_lock);
>>>>>> +		/* vendor's timeout_job should call drm_sched_start() */
>>>>>>     		status = job->sched->ops->timedout_job(job);
>>>>>>     		/*
>>>>>> @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>>     	kthread_park(sched->thread);
>>>>>>     	/*
>>>>>> -	 * Reinsert back the bad job here - now it's safe as
>>>>>> -	 * drm_sched_get_cleanup_job cannot race against us and release the
>>>>>> -	 * bad job at this point - we parked (waited for) any in progress
>>>>>> -	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>>>>> -	 * now until the scheduler thread is unparked.
>>>>>> -	 */
>>>>>> -	if (bad && bad->sched == sched)
>>>>>> -		/*
>>>>>> -		 * Add at the head of the queue to reflect it was the earliest
>>>>>> -		 * job extracted.
>>>>>> -		 */
>>>>>> -		list_add(&bad->list, &sched->pending_list);
>>>>>> -
>>>>>> -	/*
>>>>>>     	 * Iterate the job list from later to  earlier one and either deactive
>>>>>>     	 * their HW callbacks or remove them from pending list if they already
>>>>>>     	 * signaled.

Andrey Grodzovsky Aug. 31, 2021, 8:56 p.m. UTC | #10

On 2021-08-31 12:01 p.m., Luben Tuikov wrote:
> On 2021-08-31 11:23, Andrey Grodzovsky wrote:
>> On 2021-08-31 10:38 a.m., Daniel Vetter wrote:
>>> On Tue, Aug 31, 2021 at 10:20:40AM -0400, Andrey Grodzovsky wrote:
>>>> On 2021-08-31 10:03 a.m., Daniel Vetter wrote:
>>>>> On Tue, Aug 31, 2021 at 09:53:36AM -0400, Andrey Grodzovsky wrote:
>>>>>> It's says patch [2/2] but i can't find patch 1
>>>>>>
>>>>>> On 2021-08-31 6:35 a.m., Monk Liu wrote:
>>>>>>> tested-by: jingwen chen <jingwen.chen@amd.com>
>>>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>>>> Signed-off-by: jingwen chen <jingwen.chen@amd.com>
>>>>>>> ---
>>>>>>>      drivers/gpu/drm/scheduler/sched_main.c | 24 ++++--------------------
>>>>>>>      1 file changed, 4 insertions(+), 20 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> index ecf8140..894fdb24 100644
>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>      	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>>>>>>>      	/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
>>>>>>> +	if (!__kthread_should_park(sched->thread))
>>>>>>> +		kthread_park(sched->thread);
>>>>>>> +
>>>>>> As mentioned before, without serializing against other TDR handlers from
>>>>>> other
>>>>>> schedulers you just race here against them, e.g. you parked it now but
>>>>>> another
>>>>>> one in progress will unpark it as part of calling  drm_sched_start for other
>>>>>> rings[1]
>>>>>> Unless I am missing something since I haven't found patch [1/2]
>>>>>>
>>>>>> [1] - https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L5041&amp;data=04%7C01%7Cluben.tuikov%40amd.com%7C228bd1600c914efe24aa08d96c934bbb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660202148713283%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=PrrvFHAwDeLlbcOctlKHsCFs9%2F56XNVzoLVcT1RoJgI%3D&amp;reserved=0
>>>>> You need to have your own wq and run all your tdr work on the same wq if
>>>>> your reset has any cross-engine impact.
>>>> IMHO what is problematic in serializing vs. locking (with trylock and bail
>>>> out like we do in [1]) is for multiple TO events arising from same reason
>>>> like maybe one job just waits for another and once first is hanged the
>>>> second will also appear to be hanged triggering it's own TO event.
>>>> In this case multiple TOs event will trigger multiple resets if we serialize
>>>> but if we use lock with trylock the second one will quietly bail out.
>>>>
>>>> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L4903&amp;data=04%7C01%7Cluben.tuikov%40amd.com%7C228bd1600c914efe24aa08d96c934bbb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660202148713283%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=kxSWBoshVTLMMFIFZdPsP4MBgUAoC%2F3szo9GUemSRLY%3D&amp;reserved=0
>>> Hm so I guess a single wq here, that will hold up all other TO. And they
>>> should recheck whether the job is moving meanwhile.
>> Can you clarify about this ? What job should be moving ? The dependent job ?
>>
>>
>>> Also unless you use hw semaphores the job shouldn't even start before the
>>> deps are singalled, so not sure how this goes wrong?
>> What about a simple example where
>> we actually can submit a shader on one ring and a simple
>> WAIT_REG_MEM packet on another to wait for the shader to write
>> a specific value to specific memory location. Here you have both of them
>> started
>> in close proximity and no explicit dependencies involved (at the
>> scheduler level)
>> and yet if the shader hangs also the WAIT_REG_MEM job will hang.
>>
>>
>>> The vm_id flush stuff can make things a bit more fun for your specific
>>> case, but in your specific case you have to run all TO handlers on the
>>> same ordered workqueue anyway (because trying to paper over this in other
>>> ways doesn't work imo).
>> I didn't get this one.
> So, awhile back I tried to "serialize" this by moving timed-out jobs
> into their own timed-out-dedicated list, then freeing them asynchronously,
> but I never got it to work reliably due to races with low-level drivers and
> assumptions made way back.
>
> My idea was to atomic-move timed-out jobs into their own list, at the time of
> timeout, and later asynchronously to free them (or better yet, inquire about
> their state, and free them or move them back--ideally the inquiry is atomic
> and done at timeout time before being moved to the timeout list). Anyway...
>
> But I found out that all these knobs and levers weren't in place and I was
> getting problems with it and it never materialized.
>
> The paradigm was loosely "let someone else do it", like, "on an event,
> move it to a list, and let someone else handle it", or "on an event, mark
> it, and let someone else handle it". (loosely borrowed from an iSCSI target
> I did many many years ago--it worked well and there were no races, even with
> out-of-order executions.)
>
> If you guys have any ideas to that end, maybe we can try it out.
>
> Regards,
> Luben


I wonder if we really need this serialization at all, if we do HW fence 
embedding
at the drm_sched_job level instead of doing it only for amdgpu, and 
modifying all
the drivers to support this we can both remove this hack and solve the race
against concurrent drm_sched_cleanup_jobs job freeing just by taking 
reference
to the hw fence of the job at the beginning of drm_sched_job_timedout

Andrey


>
>
>> Andrey
>>
>>
>>> So I think this should all work, no need for tricky cross-scheduler
>>> locking.
>>> -Daniel
>>>
>>>> Andrey
>>>>
>>>>
>>>>> See
>>>>>
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdri.freedesktop.org%2Fdocs%2Fdrm%2Fgpu%2Fdrm-mm.html%23c.drm_sched_backend_ops&amp;data=04%7C01%7Cluben.tuikov%40amd.com%7C228bd1600c914efe24aa08d96c934bbb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660202148713283%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Fpt%2Btho2W4woHKQ861cEbBzoOidS6zuDhFi%2B1UTwWJg%3D&amp;reserved=0
>>>>>
>>>>> for the ->timeout_job callback docs. I thought I brought this up already?
>>>>> -Daniel
>>>> Yes, this discussion is a continuation of your comment about serializing, I
>>>> mentioned before that you proposed it.
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>      	spin_lock(&sched->job_list_lock);
>>>>>>>      	job = list_first_entry_or_null(&sched->pending_list,
>>>>>>>      				       struct drm_sched_job, list);
>>>>>>>      	if (job) {
>>>>>>> -		/*
>>>>>>> -		 * Remove the bad job so it cannot be freed by concurrent
>>>>>>> -		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>>>>>>> -		 * is parked at which point it's safe.
>>>>>>> -		 */
>>>>>>> -		list_del_init(&job->list);
>>>>>>>      		spin_unlock(&sched->job_list_lock);
>>>>>>> +		/* vendor's timeout_job should call drm_sched_start() */
>>>>>>>      		status = job->sched->ops->timedout_job(job);
>>>>>>>      		/*
>>>>>>> @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>>>      	kthread_park(sched->thread);
>>>>>>>      	/*
>>>>>>> -	 * Reinsert back the bad job here - now it's safe as
>>>>>>> -	 * drm_sched_get_cleanup_job cannot race against us and release the
>>>>>>> -	 * bad job at this point - we parked (waited for) any in progress
>>>>>>> -	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>>>>>> -	 * now until the scheduler thread is unparked.
>>>>>>> -	 */
>>>>>>> -	if (bad && bad->sched == sched)
>>>>>>> -		/*
>>>>>>> -		 * Add at the head of the queue to reflect it was the earliest
>>>>>>> -		 * job extracted.
>>>>>>> -		 */
>>>>>>> -		list_add(&bad->list, &sched->pending_list);
>>>>>>> -
>>>>>>> -	/*
>>>>>>>      	 * Iterate the job list from later to  earlier one and either deactive
>>>>>>>      	 * their HW callbacks or remove them from pending list if they already
>>>>>>>      	 * signaled.

Luben Tuikov Aug. 31, 2021, 9:24 p.m. UTC | #11

On 2021-08-31 16:56, Andrey Grodzovsky wrote:
> On 2021-08-31 12:01 p.m., Luben Tuikov wrote:
>> On 2021-08-31 11:23, Andrey Grodzovsky wrote:
>>> On 2021-08-31 10:38 a.m., Daniel Vetter wrote:
>>>> On Tue, Aug 31, 2021 at 10:20:40AM -0400, Andrey Grodzovsky wrote:
>>>>> On 2021-08-31 10:03 a.m., Daniel Vetter wrote:
>>>>>> On Tue, Aug 31, 2021 at 09:53:36AM -0400, Andrey Grodzovsky wrote:
>>>>>>> It's says patch [2/2] but i can't find patch 1
>>>>>>>
>>>>>>> On 2021-08-31 6:35 a.m., Monk Liu wrote:
>>>>>>>> tested-by: jingwen chen <jingwen.chen@amd.com>
>>>>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>>>>> Signed-off-by: jingwen chen <jingwen.chen@amd.com>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/scheduler/sched_main.c | 24 ++++--------------------
>>>>>>>>      1 file changed, 4 insertions(+), 20 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> index ecf8140..894fdb24 100644
>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>      	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>>>>>>>>      	/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
>>>>>>>> +	if (!__kthread_should_park(sched->thread))
>>>>>>>> +		kthread_park(sched->thread);
>>>>>>>> +
>>>>>>> As mentioned before, without serializing against other TDR handlers from
>>>>>>> other
>>>>>>> schedulers you just race here against them, e.g. you parked it now but
>>>>>>> another
>>>>>>> one in progress will unpark it as part of calling  drm_sched_start for other
>>>>>>> rings[1]
>>>>>>> Unless I am missing something since I haven't found patch [1/2]
>>>>>>>
>>>>>>> [1] - https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L5041&amp;data=04%7C01%7Cluben.tuikov%40amd.com%7C228bd1600c914efe24aa08d96c934bbb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660202148713283%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=PrrvFHAwDeLlbcOctlKHsCFs9%2F56XNVzoLVcT1RoJgI%3D&amp;reserved=0
>>>>>> You need to have your own wq and run all your tdr work on the same wq if
>>>>>> your reset has any cross-engine impact.
>>>>> IMHO what is problematic in serializing vs. locking (with trylock and bail
>>>>> out like we do in [1]) is for multiple TO events arising from same reason
>>>>> like maybe one job just waits for another and once first is hanged the
>>>>> second will also appear to be hanged triggering it's own TO event.
>>>>> In this case multiple TOs event will trigger multiple resets if we serialize
>>>>> but if we use lock with trylock the second one will quietly bail out.
>>>>>
>>>>> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L4903&amp;data=04%7C01%7Cluben.tuikov%40amd.com%7C228bd1600c914efe24aa08d96c934bbb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660202148713283%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=kxSWBoshVTLMMFIFZdPsP4MBgUAoC%2F3szo9GUemSRLY%3D&amp;reserved=0
>>>> Hm so I guess a single wq here, that will hold up all other TO. And they
>>>> should recheck whether the job is moving meanwhile.
>>> Can you clarify about this ? What job should be moving ? The dependent job ?
>>>
>>>
>>>> Also unless you use hw semaphores the job shouldn't even start before the
>>>> deps are singalled, so not sure how this goes wrong?
>>> What about a simple example where
>>> we actually can submit a shader on one ring and a simple
>>> WAIT_REG_MEM packet on another to wait for the shader to write
>>> a specific value to specific memory location. Here you have both of them
>>> started
>>> in close proximity and no explicit dependencies involved (at the
>>> scheduler level)
>>> and yet if the shader hangs also the WAIT_REG_MEM job will hang.
>>>
>>>
>>>> The vm_id flush stuff can make things a bit more fun for your specific
>>>> case, but in your specific case you have to run all TO handlers on the
>>>> same ordered workqueue anyway (because trying to paper over this in other
>>>> ways doesn't work imo).
>>> I didn't get this one.
>> So, awhile back I tried to "serialize" this by moving timed-out jobs
>> into their own timed-out-dedicated list, then freeing them asynchronously,
>> but I never got it to work reliably due to races with low-level drivers and
>> assumptions made way back.
>>
>> My idea was to atomic-move timed-out jobs into their own list, at the time of
>> timeout, and later asynchronously to free them (or better yet, inquire about
>> their state, and free them or move them back--ideally the inquiry is atomic
>> and done at timeout time before being moved to the timeout list). Anyway...
>>
>> But I found out that all these knobs and levers weren't in place and I was
>> getting problems with it and it never materialized.
>>
>> The paradigm was loosely "let someone else do it", like, "on an event,
>> move it to a list, and let someone else handle it", or "on an event, mark
>> it, and let someone else handle it". (loosely borrowed from an iSCSI target
>> I did many many years ago--it worked well and there were no races, even with
>> out-of-order executions.)
>>
>> If you guys have any ideas to that end, maybe we can try it out.
>>
>> Regards,
>> Luben
>
> I wonder if we really need this serialization at all, if we do HW fence 
> embedding
> at the drm_sched_job level instead of doing it only for amdgpu, and 
> modifying all
> the drivers to support this we can both remove this hack and solve the race
> against concurrent drm_sched_cleanup_jobs job freeing just by taking 
> reference
> to the hw fence of the job at the beginning of drm_sched_job_timedout
>
> Andrey

This sounds like the right approach to me.

(Convincing the low-level drivers of this might might be a big task.)

Regards,
Luben

>
>
>>
>>> Andrey
>>>
>>>
>>>> So I think this should all work, no need for tricky cross-scheduler
>>>> locking.
>>>> -Daniel
>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>> See
>>>>>>
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdri.freedesktop.org%2Fdocs%2Fdrm%2Fgpu%2Fdrm-mm.html%23c.drm_sched_backend_ops&amp;data=04%7C01%7Cluben.tuikov%40amd.com%7C228bd1600c914efe24aa08d96c934bbb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660202148713283%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Fpt%2Btho2W4woHKQ861cEbBzoOidS6zuDhFi%2B1UTwWJg%3D&amp;reserved=0
>>>>>>
>>>>>> for the ->timeout_job callback docs. I thought I brought this up already?
>>>>>> -Daniel
>>>>> Yes, this discussion is a continuation of your comment about serializing, I
>>>>> mentioned before that you proposed it.
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>>      	spin_lock(&sched->job_list_lock);
>>>>>>>>      	job = list_first_entry_or_null(&sched->pending_list,
>>>>>>>>      				       struct drm_sched_job, list);
>>>>>>>>      	if (job) {
>>>>>>>> -		/*
>>>>>>>> -		 * Remove the bad job so it cannot be freed by concurrent
>>>>>>>> -		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>>>>>>>> -		 * is parked at which point it's safe.
>>>>>>>> -		 */
>>>>>>>> -		list_del_init(&job->list);
>>>>>>>>      		spin_unlock(&sched->job_list_lock);
>>>>>>>> +		/* vendor's timeout_job should call drm_sched_start() */
>>>>>>>>      		status = job->sched->ops->timedout_job(job);
>>>>>>>>      		/*
>>>>>>>> @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>>>>      	kthread_park(sched->thread);
>>>>>>>>      	/*
>>>>>>>> -	 * Reinsert back the bad job here - now it's safe as
>>>>>>>> -	 * drm_sched_get_cleanup_job cannot race against us and release the
>>>>>>>> -	 * bad job at this point - we parked (waited for) any in progress
>>>>>>>> -	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>>>>>>> -	 * now until the scheduler thread is unparked.
>>>>>>>> -	 */
>>>>>>>> -	if (bad && bad->sched == sched)
>>>>>>>> -		/*
>>>>>>>> -		 * Add at the head of the queue to reflect it was the earliest
>>>>>>>> -		 * job extracted.
>>>>>>>> -		 */
>>>>>>>> -		list_add(&bad->list, &sched->pending_list);
>>>>>>>> -
>>>>>>>> -	/*
>>>>>>>>      	 * Iterate the job list from later to  earlier one and either deactive
>>>>>>>>      	 * their HW callbacks or remove them from pending list if they already
>>>>>>>>      	 * signaled.

Liu, Monk Sept. 1, 2021, 12:52 a.m. UTC | #12

[AMD Official Use Only]

>> This is a __ function, i.e. considered internal, and it's lockless atomic, i.e. unordered. And you're not explaining why this works.

It's not a traditional habit from what I can see that put explain in code, but we can do that in mails ,
We want to park the scheduler in job_timeout to serialize the job accessing from both sched and TO handler , but inside vendor's callback timeout_job at least both panfrost and amd 
They both call drm_sched_stop() on all schedulers.  

If we unconditionally call "kthread_park" in job_timedout  then the bailing job's timedout will try to call "kthread_park" again on its scheduler and introduce "warning"

The scenario is :
1,Job1 on sched1 triggers timedout, and sched1 is parked,
2,vendor callback runs, it will usually stop all schedulers.
3,Job2 on sched2 triggers timedout, so the job_timedout also try to park sched2, but sched2 was stopped already by above step.  (job2's timeout is introduced by job1, or by other VF)
          ---So there will be "warning" in kernel log from above step... after this "__kthread_should_park()" here we can avoid the warning, that's the only reason I need this __function.
4,Before vendor callback exit, it will unpark all schedulers.

From another hand if we don't do the kthread_park() and still delete the job here (drop deleting/reinserting the job from pending_list  is what we want), we still have a small windows to hit the race issue: 
That cleanup_job from sched thread is freeing the job while job is under processing by job_timedout or vendor's call back.

And the reason we want to avoid deleting/reinserting the timedout job is because we (amd vendor) have our own way to do a diagnostic on all jobs in pending list from all scheduler, we want to cherry-pick the real bad job 
From all scheduler's pending list that causes this JOB TIMEOUT.

Besides, it is also much reasonable to park scheduler when job_timedout is running there, they should exclusively access those common members like sched_job. (due to spin_lock is off before running into vendor's calback)

Hope I explained ourselves well enough.

Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Daniel Vetter <daniel@ffwll.ch> 
Sent: Tuesday, August 31, 2021 8:59 PM
To: Liu, Monk <Monk.Liu@amd.com>
Cc: amd-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Chen, Jingwen <Jingwen.Chen@amd.com>
Subject: Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

Can we please have some actual commit message here, with detailed explanation of the race/bug/whatever, how you fix it and why this is the best option?

On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
> tested-by: jingwen chen <jingwen.chen@amd.com>
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> Signed-off-by: jingwen chen <jingwen.chen@amd.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 24 ++++--------------------
>  1 file changed, 4 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index ecf8140..894fdb24 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
>  	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>  
>  	/* Protects against concurrent deletion in drm_sched_get_cleanup_job 
> */
> +	if (!__kthread_should_park(sched->thread))

This is a __ function, i.e. considered internal, and it's lockless atomic, i.e. unordered. And you're not explaining why this works.

Iow it's probably buggy, and an just unconditionally parking the kthread is probably the right thing to do. If it's not the right thing to do, there's a bug here for sure.
-Daniel

> +		kthread_park(sched->thread);
> +
>  	spin_lock(&sched->job_list_lock);
>  	job = list_first_entry_or_null(&sched->pending_list,
>  				       struct drm_sched_job, list);
>  
>  	if (job) {
> -		/*
> -		 * Remove the bad job so it cannot be freed by concurrent
> -		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> -		 * is parked at which point it's safe.
> -		 */
> -		list_del_init(&job->list);
>  		spin_unlock(&sched->job_list_lock);
>  
> +		/* vendor's timeout_job should call drm_sched_start() */
>  		status = job->sched->ops->timedout_job(job);
>  
>  		/*
> @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>  	kthread_park(sched->thread);
>  
>  	/*
> -	 * Reinsert back the bad job here - now it's safe as
> -	 * drm_sched_get_cleanup_job cannot race against us and release the
> -	 * bad job at this point - we parked (waited for) any in progress
> -	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> -	 * now until the scheduler thread is unparked.
> -	 */
> -	if (bad && bad->sched == sched)
> -		/*
> -		 * Add at the head of the queue to reflect it was the earliest
> -		 * job extracted.
> -		 */
> -		list_add(&bad->list, &sched->pending_list);
> -
> -	/*
>  	 * Iterate the job list from later to  earlier one and either deactive
>  	 * their HW callbacks or remove them from pending list if they already
>  	 * signaled.
> --
> 2.7.4
>

Liu, Monk Sept. 1, 2021, 12:56 a.m. UTC | #13

[AMD Official Use Only]

>> Also why don't we reuse the function drivers already have to stop a scheduler thread? We seem to have two kthread_park now, that's probably one too much.
Are you referring to drm_sched_stop ?

That's  different, we don't need the logic from it, see that it go through pending list and remove all callbacks , etc... meanwhile vendor's timeout callback will call drm_sched_stop in a proper way,
All we want in my patch is to simply park scheduler,
Besides, even you call drm_sched_stop in job_timeout you still run into the warning issue I hit.	

Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Daniel Vetter <daniel@ffwll.ch> 
Sent: Tuesday, August 31, 2021 9:02 PM
To: Liu, Monk <Monk.Liu@amd.com>
Cc: amd-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Chen, Jingwen <Jingwen.Chen@amd.com>
Subject: Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

On Tue, Aug 31, 2021 at 02:59:02PM +0200, Daniel Vetter wrote:
> Can we please have some actual commit message here, with detailed 
> explanation of the race/bug/whatever, how you fix it and why this is 
> the best option?
> 
> On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
> > tested-by: jingwen chen <jingwen.chen@amd.com>
> > Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> > Signed-off-by: jingwen chen <jingwen.chen@amd.com>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 24 
> > ++++--------------------
> >  1 file changed, 4 insertions(+), 20 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index ecf8140..894fdb24 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
> >  	sched = container_of(work, struct drm_gpu_scheduler, 
> > work_tdr.work);
> >  
> >  	/* Protects against concurrent deletion in 
> > drm_sched_get_cleanup_job */
> > +	if (!__kthread_should_park(sched->thread))
> 
> This is a __ function, i.e. considered internal, and it's lockless 
> atomic, i.e. unordered. And you're not explaining why this works.
> 
> Iow it's probably buggy, and an just unconditionally parking the 
> kthread is probably the right thing to do. If it's not the right thing 
> to do, there's a bug here for sure.

Also why don't we reuse the function drivers already have to stop a scheduler thread? We seem to have two kthread_park now, that's probably one too much.
-Daniel

> > +		kthread_park(sched->thread);
> > +
> >  	spin_lock(&sched->job_list_lock);
> >  	job = list_first_entry_or_null(&sched->pending_list,
> >  				       struct drm_sched_job, list);
> >  
> >  	if (job) {
> > -		/*
> > -		 * Remove the bad job so it cannot be freed by concurrent
> > -		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > -		 * is parked at which point it's safe.
> > -		 */
> > -		list_del_init(&job->list);
> >  		spin_unlock(&sched->job_list_lock);
> >  
> > +		/* vendor's timeout_job should call drm_sched_start() */
> >  		status = job->sched->ops->timedout_job(job);
> >  
> >  		/*
> > @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> >  	kthread_park(sched->thread);
> >  
> >  	/*
> > -	 * Reinsert back the bad job here - now it's safe as
> > -	 * drm_sched_get_cleanup_job cannot race against us and release the
> > -	 * bad job at this point - we parked (waited for) any in progress
> > -	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > -	 * now until the scheduler thread is unparked.
> > -	 */
> > -	if (bad && bad->sched == sched)
> > -		/*
> > -		 * Add at the head of the queue to reflect it was the earliest
> > -		 * job extracted.
> > -		 */
> > -		list_add(&bad->list, &sched->pending_list);
> > -
> > -	/*
> >  	 * Iterate the job list from later to  earlier one and either deactive
> >  	 * their HW callbacks or remove them from pending list if they already
> >  	 * signaled.
> > --
> > 2.7.4
> > 
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
> ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76
> b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376601170
> 51194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
> CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLg
> KeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0

--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660117051194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLgKeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0

Liu, Monk Sept. 1, 2021, 1:29 a.m. UTC | #14

[AMD Official Use Only]

Okay, I will reprepare this patch

Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Daniel Vetter <daniel@ffwll.ch> 
Sent: Tuesday, August 31, 2021 9:02 PM
To: Liu, Monk <Monk.Liu@amd.com>
Cc: amd-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Chen, Jingwen <Jingwen.Chen@amd.com>
Subject: Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

On Tue, Aug 31, 2021 at 02:59:02PM +0200, Daniel Vetter wrote:
> Can we please have some actual commit message here, with detailed 
> explanation of the race/bug/whatever, how you fix it and why this is 
> the best option?
> 
> On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
> > tested-by: jingwen chen <jingwen.chen@amd.com>
> > Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> > Signed-off-by: jingwen chen <jingwen.chen@amd.com>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 24 
> > ++++--------------------
> >  1 file changed, 4 insertions(+), 20 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index ecf8140..894fdb24 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
> >  	sched = container_of(work, struct drm_gpu_scheduler, 
> > work_tdr.work);
> >  
> >  	/* Protects against concurrent deletion in 
> > drm_sched_get_cleanup_job */
> > +	if (!__kthread_should_park(sched->thread))
> 
> This is a __ function, i.e. considered internal, and it's lockless 
> atomic, i.e. unordered. And you're not explaining why this works.
> 
> Iow it's probably buggy, and an just unconditionally parking the 
> kthread is probably the right thing to do. If it's not the right thing 
> to do, there's a bug here for sure.

Also why don't we reuse the function drivers already have to stop a scheduler thread? We seem to have two kthread_park now, that's probably one too much.
-Daniel

> > +		kthread_park(sched->thread);
> > +
> >  	spin_lock(&sched->job_list_lock);
> >  	job = list_first_entry_or_null(&sched->pending_list,
> >  				       struct drm_sched_job, list);
> >  
> >  	if (job) {
> > -		/*
> > -		 * Remove the bad job so it cannot be freed by concurrent
> > -		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > -		 * is parked at which point it's safe.
> > -		 */
> > -		list_del_init(&job->list);
> >  		spin_unlock(&sched->job_list_lock);
> >  
> > +		/* vendor's timeout_job should call drm_sched_start() */
> >  		status = job->sched->ops->timedout_job(job);
> >  
> >  		/*
> > @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> >  	kthread_park(sched->thread);
> >  
> >  	/*
> > -	 * Reinsert back the bad job here - now it's safe as
> > -	 * drm_sched_get_cleanup_job cannot race against us and release the
> > -	 * bad job at this point - we parked (waited for) any in progress
> > -	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > -	 * now until the scheduler thread is unparked.
> > -	 */
> > -	if (bad && bad->sched == sched)
> > -		/*
> > -		 * Add at the head of the queue to reflect it was the earliest
> > -		 * job extracted.
> > -		 */
> > -		list_add(&bad->list, &sched->pending_list);
> > -
> > -	/*
> >  	 * Iterate the job list from later to  earlier one and either deactive
> >  	 * their HW callbacks or remove them from pending list if they already
> >  	 * signaled.
> > --
> > 2.7.4
> > 
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
> ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76
> b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376601170
> 51194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
> CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLg
> KeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0

--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660117051194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLgKeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0

Rob Clark Nov. 8, 2021, 11:39 p.m. UTC | #15

I stumbled across this thread when I ran into the same issue, while
working out how to move drm/msm to use scheduler's retire +
timeout/recovery (and get rid of our own mirror list of in-flight
jobs).  We already have hw error detection enabled, and it can signal
quite fast, so assuming the first job on the list is the guilty job
just won't work.

But I was considering a slightly different approach to fixing this,
instead just handling it all in drm_sched_main() and getting rid of
the complicated kthread parking gymnastics.  Ie. something along the
lines of:

---------------------
diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index 67382621b429..4d6ce775c316 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -764,6 +764,45 @@ static bool drm_sched_blocked(struct
drm_gpu_scheduler *sched)
        return false;
 }

+static bool handle_timeout(struct drm_gpu_scheduler *sched)
+{
+       struct drm_sched_job *bad;
+
+       if (!sched->has_timeout)
+               return false;
+
+       sched->has_timeout = false;
+
+       spin_lock(&sched->job_list_lock);
+       bad = list_first_entry_or_null(&sched->pending_list,
+                                      struct drm_sched_job, list);
+
+       if (!bad) {
+               spin_unlock(&sched->job_list_lock);
+               return false;
+       }
+
+       spin_unlock(&sched->job_list_lock);
+
+       if (sched->timeout_wq == system_wq) {
+               /*
+                * If driver has no specific requirements about serializing
+                * reset wrt. other engines, just call timedout_job() directly
+                */
+               sched->ops->timedout_job(job);
+       } else {
+               /*
+                * Otherwise queue it on timeout_wq and wait for it to complete
+                */
+               ... more typing needed here ...
+       }
+
+       if (sched->free_guilty) {
+               sched->ops->free_job(job);
+               sched->free_guilty = false;
+       }
+}
+
 /**
  * drm_sched_main - main scheduler thread
  *
@@ -787,6 +826,7 @@ static int drm_sched_main(void *param)

                wait_event_interruptible(sched->wake_up_worker,
                                         (cleanup_job =
drm_sched_get_cleanup_job(sched)) ||
+                                        handle_timeout(sched) ||
                                         (!drm_sched_blocked(sched) &&
                                          (entity =
drm_sched_select_entity(sched))) ||
                                         kthread_should_stop());
---------------------

drm_sched_fault() and the sw timeout handler would just set
sched->has_timeout and kick sched->wake_up_worker.

And since we handle the timeout case after
drm_sched_get_cleanup_job(), we know that all of the successfully
completed jobs have already been popped off the list, and won't be
unfairly maligned.

BR,
-R

On Tue, Aug 31, 2021 at 6:29 PM Liu, Monk <Monk.Liu@amd.com> wrote:
>
> [AMD Official Use Only]
>
> Okay, I will reprepare this patch
>
> Thanks
>
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
>
> -----Original Message-----
> From: Daniel Vetter <daniel@ffwll.ch>
> Sent: Tuesday, August 31, 2021 9:02 PM
> To: Liu, Monk <Monk.Liu@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Chen, Jingwen <Jingwen.Chen@amd.com>
> Subject: Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
>
> On Tue, Aug 31, 2021 at 02:59:02PM +0200, Daniel Vetter wrote:
> > Can we please have some actual commit message here, with detailed
> > explanation of the race/bug/whatever, how you fix it and why this is
> > the best option?
> >
> > On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
> > > tested-by: jingwen chen <jingwen.chen@amd.com>
> > > Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> > > Signed-off-by: jingwen chen <jingwen.chen@amd.com>
> > > ---
> > >  drivers/gpu/drm/scheduler/sched_main.c | 24
> > > ++++--------------------
> > >  1 file changed, 4 insertions(+), 20 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > index ecf8140..894fdb24 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > >     sched = container_of(work, struct drm_gpu_scheduler,
> > > work_tdr.work);
> > >
> > >     /* Protects against concurrent deletion in
> > > drm_sched_get_cleanup_job */
> > > +   if (!__kthread_should_park(sched->thread))
> >
> > This is a __ function, i.e. considered internal, and it's lockless
> > atomic, i.e. unordered. And you're not explaining why this works.
> >
> > Iow it's probably buggy, and an just unconditionally parking the
> > kthread is probably the right thing to do. If it's not the right thing
> > to do, there's a bug here for sure.
>
> Also why don't we reuse the function drivers already have to stop a scheduler thread? We seem to have two kthread_park now, that's probably one too much.
> -Daniel
>
> > > +           kthread_park(sched->thread);
> > > +
> > >     spin_lock(&sched->job_list_lock);
> > >     job = list_first_entry_or_null(&sched->pending_list,
> > >                                    struct drm_sched_job, list);
> > >
> > >     if (job) {
> > > -           /*
> > > -            * Remove the bad job so it cannot be freed by concurrent
> > > -            * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > -            * is parked at which point it's safe.
> > > -            */
> > > -           list_del_init(&job->list);
> > >             spin_unlock(&sched->job_list_lock);
> > >
> > > +           /* vendor's timeout_job should call drm_sched_start() */
> > >             status = job->sched->ops->timedout_job(job);
> > >
> > >             /*
> > > @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > >     kthread_park(sched->thread);
> > >
> > >     /*
> > > -    * Reinsert back the bad job here - now it's safe as
> > > -    * drm_sched_get_cleanup_job cannot race against us and release the
> > > -    * bad job at this point - we parked (waited for) any in progress
> > > -    * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > -    * now until the scheduler thread is unparked.
> > > -    */
> > > -   if (bad && bad->sched == sched)
> > > -           /*
> > > -            * Add at the head of the queue to reflect it was the earliest
> > > -            * job extracted.
> > > -            */
> > > -           list_add(&bad->list, &sched->pending_list);
> > > -
> > > -   /*
> > >      * Iterate the job list from later to  earlier one and either deactive
> > >      * their HW callbacks or remove them from pending list if they already
> > >      * signaled.
> > > --
> > > 2.7.4
> > >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
> > ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76
> > b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376601170
> > 51194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
> > CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLg
> > KeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660117051194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLgKeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0

Daniel Vetter Nov. 9, 2021, 9:07 a.m. UTC | #16

On Mon, Nov 08, 2021 at 03:39:17PM -0800, Rob Clark wrote:
> I stumbled across this thread when I ran into the same issue, while
> working out how to move drm/msm to use scheduler's retire +
> timeout/recovery (and get rid of our own mirror list of in-flight
> jobs).  We already have hw error detection enabled, and it can signal
> quite fast, so assuming the first job on the list is the guilty job
> just won't work.
> 
> But I was considering a slightly different approach to fixing this,
> instead just handling it all in drm_sched_main() and getting rid of
> the complicated kthread parking gymnastics.  Ie. something along the
> lines of:

So handling timeouts in the main sched thread wont work as soon as you
have multiple engines and reset that impacts across engines:

- Nothing is simplified since you still need to stop the other scheduler
  threads.

- You get deadlocks if 2 schedulers time out at the same time, and both
  want to stop the other one.

Hence workqueue. Now the rule for the wq is that you can only have one per
reset domain, so
- single engine you just take the one drm/sched provides
- if reset affects all your engines in the chip, then you allocate on in
  the drm_device and pass that to all
- if you have a complex of gpus all interconnected (e.g. xgmi hive for
  amd), then it's one wq for the entire hive

_All_ reset related things must be run on that workqueue or things breaks,
which means if you get hw fault that also needs to be run there. I guess
we should either patch drm/sched to check you call that function from the
right workqueue, or just handle it internally.
-Daniel

> 
> ---------------------
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 67382621b429..4d6ce775c316 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -764,6 +764,45 @@ static bool drm_sched_blocked(struct
> drm_gpu_scheduler *sched)
>         return false;
>  }
> 
> +static bool handle_timeout(struct drm_gpu_scheduler *sched)
> +{
> +       struct drm_sched_job *bad;
> +
> +       if (!sched->has_timeout)
> +               return false;
> +
> +       sched->has_timeout = false;
> +
> +       spin_lock(&sched->job_list_lock);
> +       bad = list_first_entry_or_null(&sched->pending_list,
> +                                      struct drm_sched_job, list);
> +
> +       if (!bad) {
> +               spin_unlock(&sched->job_list_lock);
> +               return false;
> +       }
> +
> +       spin_unlock(&sched->job_list_lock);
> +
> +       if (sched->timeout_wq == system_wq) {
> +               /*
> +                * If driver has no specific requirements about serializing
> +                * reset wrt. other engines, just call timedout_job() directly
> +                */
> +               sched->ops->timedout_job(job);
> +       } else {
> +               /*
> +                * Otherwise queue it on timeout_wq and wait for it to complete
> +                */
> +               ... more typing needed here ...
> +       }
> +
> +       if (sched->free_guilty) {
> +               sched->ops->free_job(job);
> +               sched->free_guilty = false;
> +       }
> +}
> +
>  /**
>   * drm_sched_main - main scheduler thread
>   *
> @@ -787,6 +826,7 @@ static int drm_sched_main(void *param)
> 
>                 wait_event_interruptible(sched->wake_up_worker,
>                                          (cleanup_job =
> drm_sched_get_cleanup_job(sched)) ||
> +                                        handle_timeout(sched) ||
>                                          (!drm_sched_blocked(sched) &&
>                                           (entity =
> drm_sched_select_entity(sched))) ||
>                                          kthread_should_stop());
> ---------------------
> 
> drm_sched_fault() and the sw timeout handler would just set
> sched->has_timeout and kick sched->wake_up_worker.
> 
> And since we handle the timeout case after
> drm_sched_get_cleanup_job(), we know that all of the successfully
> completed jobs have already been popped off the list, and won't be
> unfairly maligned.
> 
> BR,
> -R
> 
> On Tue, Aug 31, 2021 at 6:29 PM Liu, Monk <Monk.Liu@amd.com> wrote:
> >
> > [AMD Official Use Only]
> >
> > Okay, I will reprepare this patch
> >
> > Thanks
> >
> > ------------------------------------------
> > Monk Liu | Cloud-GPU Core team
> > ------------------------------------------
> >
> > -----Original Message-----
> > From: Daniel Vetter <daniel@ffwll.ch>
> > Sent: Tuesday, August 31, 2021 9:02 PM
> > To: Liu, Monk <Monk.Liu@amd.com>
> > Cc: amd-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Chen, Jingwen <Jingwen.Chen@amd.com>
> > Subject: Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
> >
> > On Tue, Aug 31, 2021 at 02:59:02PM +0200, Daniel Vetter wrote:
> > > Can we please have some actual commit message here, with detailed
> > > explanation of the race/bug/whatever, how you fix it and why this is
> > > the best option?
> > >
> > > On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
> > > > tested-by: jingwen chen <jingwen.chen@amd.com>
> > > > Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> > > > Signed-off-by: jingwen chen <jingwen.chen@amd.com>
> > > > ---
> > > >  drivers/gpu/drm/scheduler/sched_main.c | 24
> > > > ++++--------------------
> > > >  1 file changed, 4 insertions(+), 20 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > index ecf8140..894fdb24 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > >     sched = container_of(work, struct drm_gpu_scheduler,
> > > > work_tdr.work);
> > > >
> > > >     /* Protects against concurrent deletion in
> > > > drm_sched_get_cleanup_job */
> > > > +   if (!__kthread_should_park(sched->thread))
> > >
> > > This is a __ function, i.e. considered internal, and it's lockless
> > > atomic, i.e. unordered. And you're not explaining why this works.
> > >
> > > Iow it's probably buggy, and an just unconditionally parking the
> > > kthread is probably the right thing to do. If it's not the right thing
> > > to do, there's a bug here for sure.
> >
> > Also why don't we reuse the function drivers already have to stop a scheduler thread? We seem to have two kthread_park now, that's probably one too much.
> > -Daniel
> >
> > > > +           kthread_park(sched->thread);
> > > > +
> > > >     spin_lock(&sched->job_list_lock);
> > > >     job = list_first_entry_or_null(&sched->pending_list,
> > > >                                    struct drm_sched_job, list);
> > > >
> > > >     if (job) {
> > > > -           /*
> > > > -            * Remove the bad job so it cannot be freed by concurrent
> > > > -            * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > -            * is parked at which point it's safe.
> > > > -            */
> > > > -           list_del_init(&job->list);
> > > >             spin_unlock(&sched->job_list_lock);
> > > >
> > > > +           /* vendor's timeout_job should call drm_sched_start() */
> > > >             status = job->sched->ops->timedout_job(job);
> > > >
> > > >             /*
> > > > @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > >     kthread_park(sched->thread);
> > > >
> > > >     /*
> > > > -    * Reinsert back the bad job here - now it's safe as
> > > > -    * drm_sched_get_cleanup_job cannot race against us and release the
> > > > -    * bad job at this point - we parked (waited for) any in progress
> > > > -    * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > -    * now until the scheduler thread is unparked.
> > > > -    */
> > > > -   if (bad && bad->sched == sched)
> > > > -           /*
> > > > -            * Add at the head of the queue to reflect it was the earliest
> > > > -            * job extracted.
> > > > -            */
> > > > -           list_add(&bad->list, &sched->pending_list);
> > > > -
> > > > -   /*
> > > >      * Iterate the job list from later to  earlier one and either deactive
> > > >      * their HW callbacks or remove them from pending list if they already
> > > >      * signaled.
> > > > --
> > > > 2.7.4
> > > >
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
> > > ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76
> > > b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376601170
> > > 51194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
> > > CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLg
> > > KeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660117051194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLgKeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0

Rob Clark Nov. 9, 2021, 4:17 p.m. UTC | #17

On Tue, Nov 9, 2021 at 1:07 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Mon, Nov 08, 2021 at 03:39:17PM -0800, Rob Clark wrote:
> > I stumbled across this thread when I ran into the same issue, while
> > working out how to move drm/msm to use scheduler's retire +
> > timeout/recovery (and get rid of our own mirror list of in-flight
> > jobs).  We already have hw error detection enabled, and it can signal
> > quite fast, so assuming the first job on the list is the guilty job
> > just won't work.
> >
> > But I was considering a slightly different approach to fixing this,
> > instead just handling it all in drm_sched_main() and getting rid of
> > the complicated kthread parking gymnastics.  Ie. something along the
> > lines of:
>
> So handling timeouts in the main sched thread wont work as soon as you
> have multiple engines and reset that impacts across engines:
>
> - Nothing is simplified since you still need to stop the other scheduler
>   threads.
>
> - You get deadlocks if 2 schedulers time out at the same time, and both
>   want to stop the other one.
>
> Hence workqueue. Now the rule for the wq is that you can only have one per
> reset domain, so
> - single engine you just take the one drm/sched provides
> - if reset affects all your engines in the chip, then you allocate on in
>   the drm_device and pass that to all
> - if you have a complex of gpus all interconnected (e.g. xgmi hive for
>   amd), then it's one wq for the entire hive
>
> _All_ reset related things must be run on that workqueue or things breaks,
> which means if you get hw fault that also needs to be run there. I guess
> we should either patch drm/sched to check you call that function from the
> right workqueue, or just handle it internally.

Hmm, ok.. I guess it would be useful to better document the reasoning
for the current design, that would have steered me more towards the
approach taken in this patch.

BR,
-R

> -Daniel
>
> >
> > ---------------------
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index 67382621b429..4d6ce775c316 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -764,6 +764,45 @@ static bool drm_sched_blocked(struct
> > drm_gpu_scheduler *sched)
> >         return false;
> >  }
> >
> > +static bool handle_timeout(struct drm_gpu_scheduler *sched)
> > +{
> > +       struct drm_sched_job *bad;
> > +
> > +       if (!sched->has_timeout)
> > +               return false;
> > +
> > +       sched->has_timeout = false;
> > +
> > +       spin_lock(&sched->job_list_lock);
> > +       bad = list_first_entry_or_null(&sched->pending_list,
> > +                                      struct drm_sched_job, list);
> > +
> > +       if (!bad) {
> > +               spin_unlock(&sched->job_list_lock);
> > +               return false;
> > +       }
> > +
> > +       spin_unlock(&sched->job_list_lock);
> > +
> > +       if (sched->timeout_wq == system_wq) {
> > +               /*
> > +                * If driver has no specific requirements about serializing
> > +                * reset wrt. other engines, just call timedout_job() directly
> > +                */
> > +               sched->ops->timedout_job(job);
> > +       } else {
> > +               /*
> > +                * Otherwise queue it on timeout_wq and wait for it to complete
> > +                */
> > +               ... more typing needed here ...
> > +       }
> > +
> > +       if (sched->free_guilty) {
> > +               sched->ops->free_job(job);
> > +               sched->free_guilty = false;
> > +       }
> > +}
> > +
> >  /**
> >   * drm_sched_main - main scheduler thread
> >   *
> > @@ -787,6 +826,7 @@ static int drm_sched_main(void *param)
> >
> >                 wait_event_interruptible(sched->wake_up_worker,
> >                                          (cleanup_job =
> > drm_sched_get_cleanup_job(sched)) ||
> > +                                        handle_timeout(sched) ||
> >                                          (!drm_sched_blocked(sched) &&
> >                                           (entity =
> > drm_sched_select_entity(sched))) ||
> >                                          kthread_should_stop());
> > ---------------------
> >
> > drm_sched_fault() and the sw timeout handler would just set
> > sched->has_timeout and kick sched->wake_up_worker.
> >
> > And since we handle the timeout case after
> > drm_sched_get_cleanup_job(), we know that all of the successfully
> > completed jobs have already been popped off the list, and won't be
> > unfairly maligned.
> >
> > BR,
> > -R
> >
> > On Tue, Aug 31, 2021 at 6:29 PM Liu, Monk <Monk.Liu@amd.com> wrote:
> > >
> > > [AMD Official Use Only]
> > >
> > > Okay, I will reprepare this patch
> > >
> > > Thanks
> > >
> > > ------------------------------------------
> > > Monk Liu | Cloud-GPU Core team
> > > ------------------------------------------
> > >
> > > -----Original Message-----
> > > From: Daniel Vetter <daniel@ffwll.ch>
> > > Sent: Tuesday, August 31, 2021 9:02 PM
> > > To: Liu, Monk <Monk.Liu@amd.com>
> > > Cc: amd-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Chen, Jingwen <Jingwen.Chen@amd.com>
> > > Subject: Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
> > >
> > > On Tue, Aug 31, 2021 at 02:59:02PM +0200, Daniel Vetter wrote:
> > > > Can we please have some actual commit message here, with detailed
> > > > explanation of the race/bug/whatever, how you fix it and why this is
> > > > the best option?
> > > >
> > > > On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
> > > > > tested-by: jingwen chen <jingwen.chen@amd.com>
> > > > > Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> > > > > Signed-off-by: jingwen chen <jingwen.chen@amd.com>
> > > > > ---
> > > > >  drivers/gpu/drm/scheduler/sched_main.c | 24
> > > > > ++++--------------------
> > > > >  1 file changed, 4 insertions(+), 20 deletions(-)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > index ecf8140..894fdb24 100644
> > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > >     sched = container_of(work, struct drm_gpu_scheduler,
> > > > > work_tdr.work);
> > > > >
> > > > >     /* Protects against concurrent deletion in
> > > > > drm_sched_get_cleanup_job */
> > > > > +   if (!__kthread_should_park(sched->thread))
> > > >
> > > > This is a __ function, i.e. considered internal, and it's lockless
> > > > atomic, i.e. unordered. And you're not explaining why this works.
> > > >
> > > > Iow it's probably buggy, and an just unconditionally parking the
> > > > kthread is probably the right thing to do. If it's not the right thing
> > > > to do, there's a bug here for sure.
> > >
> > > Also why don't we reuse the function drivers already have to stop a scheduler thread? We seem to have two kthread_park now, that's probably one too much.
> > > -Daniel
> > >
> > > > > +           kthread_park(sched->thread);
> > > > > +
> > > > >     spin_lock(&sched->job_list_lock);
> > > > >     job = list_first_entry_or_null(&sched->pending_list,
> > > > >                                    struct drm_sched_job, list);
> > > > >
> > > > >     if (job) {
> > > > > -           /*
> > > > > -            * Remove the bad job so it cannot be freed by concurrent
> > > > > -            * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > -            * is parked at which point it's safe.
> > > > > -            */
> > > > > -           list_del_init(&job->list);
> > > > >             spin_unlock(&sched->job_list_lock);
> > > > >
> > > > > +           /* vendor's timeout_job should call drm_sched_start() */
> > > > >             status = job->sched->ops->timedout_job(job);
> > > > >
> > > > >             /*
> > > > > @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > >     kthread_park(sched->thread);
> > > > >
> > > > >     /*
> > > > > -    * Reinsert back the bad job here - now it's safe as
> > > > > -    * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > -    * bad job at this point - we parked (waited for) any in progress
> > > > > -    * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > -    * now until the scheduler thread is unparked.
> > > > > -    */
> > > > > -   if (bad && bad->sched == sched)
> > > > > -           /*
> > > > > -            * Add at the head of the queue to reflect it was the earliest
> > > > > -            * job extracted.
> > > > > -            */
> > > > > -           list_add(&bad->list, &sched->pending_list);
> > > > > -
> > > > > -   /*
> > > > >      * Iterate the job list from later to  earlier one and either deactive
> > > > >      * their HW callbacks or remove them from pending list if they already
> > > > >      * signaled.
> > > > > --
> > > > > 2.7.4
> > > > >
> > > >
> > > > --
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
> > > > ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76
> > > > b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376601170
> > > > 51194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
> > > > CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLg
> > > > KeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660117051194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLgKeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Daniel Vetter Nov. 10, 2021, 9:50 a.m. UTC | #18

On Tue, Nov 09, 2021 at 08:17:01AM -0800, Rob Clark wrote:
> On Tue, Nov 9, 2021 at 1:07 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Mon, Nov 08, 2021 at 03:39:17PM -0800, Rob Clark wrote:
> > > I stumbled across this thread when I ran into the same issue, while
> > > working out how to move drm/msm to use scheduler's retire +
> > > timeout/recovery (and get rid of our own mirror list of in-flight
> > > jobs).  We already have hw error detection enabled, and it can signal
> > > quite fast, so assuming the first job on the list is the guilty job
> > > just won't work.
> > >
> > > But I was considering a slightly different approach to fixing this,
> > > instead just handling it all in drm_sched_main() and getting rid of
> > > the complicated kthread parking gymnastics.  Ie. something along the
> > > lines of:
> >
> > So handling timeouts in the main sched thread wont work as soon as you
> > have multiple engines and reset that impacts across engines:
> >
> > - Nothing is simplified since you still need to stop the other scheduler
> >   threads.
> >
> > - You get deadlocks if 2 schedulers time out at the same time, and both
> >   want to stop the other one.
> >
> > Hence workqueue. Now the rule for the wq is that you can only have one per
> > reset domain, so
> > - single engine you just take the one drm/sched provides
> > - if reset affects all your engines in the chip, then you allocate on in
> >   the drm_device and pass that to all
> > - if you have a complex of gpus all interconnected (e.g. xgmi hive for
> >   amd), then it's one wq for the entire hive
> >
> > _All_ reset related things must be run on that workqueue or things breaks,
> > which means if you get hw fault that also needs to be run there. I guess
> > we should either patch drm/sched to check you call that function from the
> > right workqueue, or just handle it internally.
> 
> Hmm, ok.. I guess it would be useful to better document the reasoning
> for the current design, that would have steered me more towards the
> approach taken in this patch.

Maybe this was because you worked on an old kernel? Boris did update the
kerneldoc as part of making gpu reset work for panfrost, which has this
multi-engine reset problem. If that's not yet clear then we need to
improve the docs further.

AMD's problem is even worse, because their reset domain is the entire xgmi
hive, so multiple pci devices.

Also there might more issues in drm/sched ofc, e.g. I've looked a bit at
ordering/barriers and I'm pretty sure a lot are still missing. Or at least
we should have comments in the code explaining why it all works.
-Daniel

> 
> BR,
> -R
> 
> > -Daniel
> >
> > >
> > > ---------------------
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > index 67382621b429..4d6ce775c316 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -764,6 +764,45 @@ static bool drm_sched_blocked(struct
> > > drm_gpu_scheduler *sched)
> > >         return false;
> > >  }
> > >
> > > +static bool handle_timeout(struct drm_gpu_scheduler *sched)
> > > +{
> > > +       struct drm_sched_job *bad;
> > > +
> > > +       if (!sched->has_timeout)
> > > +               return false;
> > > +
> > > +       sched->has_timeout = false;
> > > +
> > > +       spin_lock(&sched->job_list_lock);
> > > +       bad = list_first_entry_or_null(&sched->pending_list,
> > > +                                      struct drm_sched_job, list);
> > > +
> > > +       if (!bad) {
> > > +               spin_unlock(&sched->job_list_lock);
> > > +               return false;
> > > +       }
> > > +
> > > +       spin_unlock(&sched->job_list_lock);
> > > +
> > > +       if (sched->timeout_wq == system_wq) {
> > > +               /*
> > > +                * If driver has no specific requirements about serializing
> > > +                * reset wrt. other engines, just call timedout_job() directly
> > > +                */
> > > +               sched->ops->timedout_job(job);
> > > +       } else {
> > > +               /*
> > > +                * Otherwise queue it on timeout_wq and wait for it to complete
> > > +                */
> > > +               ... more typing needed here ...
> > > +       }
> > > +
> > > +       if (sched->free_guilty) {
> > > +               sched->ops->free_job(job);
> > > +               sched->free_guilty = false;
> > > +       }
> > > +}
> > > +
> > >  /**
> > >   * drm_sched_main - main scheduler thread
> > >   *
> > > @@ -787,6 +826,7 @@ static int drm_sched_main(void *param)
> > >
> > >                 wait_event_interruptible(sched->wake_up_worker,
> > >                                          (cleanup_job =
> > > drm_sched_get_cleanup_job(sched)) ||
> > > +                                        handle_timeout(sched) ||
> > >                                          (!drm_sched_blocked(sched) &&
> > >                                           (entity =
> > > drm_sched_select_entity(sched))) ||
> > >                                          kthread_should_stop());
> > > ---------------------
> > >
> > > drm_sched_fault() and the sw timeout handler would just set
> > > sched->has_timeout and kick sched->wake_up_worker.
> > >
> > > And since we handle the timeout case after
> > > drm_sched_get_cleanup_job(), we know that all of the successfully
> > > completed jobs have already been popped off the list, and won't be
> > > unfairly maligned.
> > >
> > > BR,
> > > -R
> > >
> > > On Tue, Aug 31, 2021 at 6:29 PM Liu, Monk <Monk.Liu@amd.com> wrote:
> > > >
> > > > [AMD Official Use Only]
> > > >
> > > > Okay, I will reprepare this patch
> > > >
> > > > Thanks
> > > >
> > > > ------------------------------------------
> > > > Monk Liu | Cloud-GPU Core team
> > > > ------------------------------------------
> > > >
> > > > -----Original Message-----
> > > > From: Daniel Vetter <daniel@ffwll.ch>
> > > > Sent: Tuesday, August 31, 2021 9:02 PM
> > > > To: Liu, Monk <Monk.Liu@amd.com>
> > > > Cc: amd-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Chen, Jingwen <Jingwen.Chen@amd.com>
> > > > Subject: Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
> > > >
> > > > On Tue, Aug 31, 2021 at 02:59:02PM +0200, Daniel Vetter wrote:
> > > > > Can we please have some actual commit message here, with detailed
> > > > > explanation of the race/bug/whatever, how you fix it and why this is
> > > > > the best option?
> > > > >
> > > > > On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
> > > > > > tested-by: jingwen chen <jingwen.chen@amd.com>
> > > > > > Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> > > > > > Signed-off-by: jingwen chen <jingwen.chen@amd.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/scheduler/sched_main.c | 24
> > > > > > ++++--------------------
> > > > > >  1 file changed, 4 insertions(+), 20 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > index ecf8140..894fdb24 100644
> > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > >     sched = container_of(work, struct drm_gpu_scheduler,
> > > > > > work_tdr.work);
> > > > > >
> > > > > >     /* Protects against concurrent deletion in
> > > > > > drm_sched_get_cleanup_job */
> > > > > > +   if (!__kthread_should_park(sched->thread))
> > > > >
> > > > > This is a __ function, i.e. considered internal, and it's lockless
> > > > > atomic, i.e. unordered. And you're not explaining why this works.
> > > > >
> > > > > Iow it's probably buggy, and an just unconditionally parking the
> > > > > kthread is probably the right thing to do. If it's not the right thing
> > > > > to do, there's a bug here for sure.
> > > >
> > > > Also why don't we reuse the function drivers already have to stop a scheduler thread? We seem to have two kthread_park now, that's probably one too much.
> > > > -Daniel
> > > >
> > > > > > +           kthread_park(sched->thread);
> > > > > > +
> > > > > >     spin_lock(&sched->job_list_lock);
> > > > > >     job = list_first_entry_or_null(&sched->pending_list,
> > > > > >                                    struct drm_sched_job, list);
> > > > > >
> > > > > >     if (job) {
> > > > > > -           /*
> > > > > > -            * Remove the bad job so it cannot be freed by concurrent
> > > > > > -            * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > > -            * is parked at which point it's safe.
> > > > > > -            */
> > > > > > -           list_del_init(&job->list);
> > > > > >             spin_unlock(&sched->job_list_lock);
> > > > > >
> > > > > > +           /* vendor's timeout_job should call drm_sched_start() */
> > > > > >             status = job->sched->ops->timedout_job(job);
> > > > > >
> > > > > >             /*
> > > > > > @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > > >     kthread_park(sched->thread);
> > > > > >
> > > > > >     /*
> > > > > > -    * Reinsert back the bad job here - now it's safe as
> > > > > > -    * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > > -    * bad job at this point - we parked (waited for) any in progress
> > > > > > -    * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > > -    * now until the scheduler thread is unparked.
> > > > > > -    */
> > > > > > -   if (bad && bad->sched == sched)
> > > > > > -           /*
> > > > > > -            * Add at the head of the queue to reflect it was the earliest
> > > > > > -            * job extracted.
> > > > > > -            */
> > > > > > -           list_add(&bad->list, &sched->pending_list);
> > > > > > -
> > > > > > -   /*
> > > > > >      * Iterate the job list from later to  earlier one and either deactive
> > > > > >      * their HW callbacks or remove them from pending list if they already
> > > > > >      * signaled.
> > > > > > --
> > > > > > 2.7.4
> > > > > >
> > > > >
> > > > > --
> > > > > Daniel Vetter
> > > > > Software Engineer, Intel Corporation
> > > > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
> > > > > ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76
> > > > > b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376601170
> > > > > 51194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
> > > > > CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLg
> > > > > KeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0
> > > >
> > > > --
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660117051194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLgKeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

Christian König Nov. 10, 2021, 10:09 a.m. UTC | #19

Am 10.11.21 um 10:50 schrieb Daniel Vetter:
> On Tue, Nov 09, 2021 at 08:17:01AM -0800, Rob Clark wrote:
>> On Tue, Nov 9, 2021 at 1:07 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>>> On Mon, Nov 08, 2021 at 03:39:17PM -0800, Rob Clark wrote:
>>>> I stumbled across this thread when I ran into the same issue, while
>>>> working out how to move drm/msm to use scheduler's retire +
>>>> timeout/recovery (and get rid of our own mirror list of in-flight
>>>> jobs).  We already have hw error detection enabled, and it can signal
>>>> quite fast, so assuming the first job on the list is the guilty job
>>>> just won't work.
>>>>
>>>> But I was considering a slightly different approach to fixing this,
>>>> instead just handling it all in drm_sched_main() and getting rid of
>>>> the complicated kthread parking gymnastics.  Ie. something along the
>>>> lines of:
>>> So handling timeouts in the main sched thread wont work as soon as you
>>> have multiple engines and reset that impacts across engines:
>>>
>>> - Nothing is simplified since you still need to stop the other scheduler
>>>    threads.
>>>
>>> - You get deadlocks if 2 schedulers time out at the same time, and both
>>>    want to stop the other one.
>>>
>>> Hence workqueue. Now the rule for the wq is that you can only have one per
>>> reset domain, so
>>> - single engine you just take the one drm/sched provides
>>> - if reset affects all your engines in the chip, then you allocate on in
>>>    the drm_device and pass that to all
>>> - if you have a complex of gpus all interconnected (e.g. xgmi hive for
>>>    amd), then it's one wq for the entire hive
>>>
>>> _All_ reset related things must be run on that workqueue or things breaks,
>>> which means if you get hw fault that also needs to be run there. I guess
>>> we should either patch drm/sched to check you call that function from the
>>> right workqueue, or just handle it internally.
>> Hmm, ok.. I guess it would be useful to better document the reasoning
>> for the current design, that would have steered me more towards the
>> approach taken in this patch.
> Maybe this was because you worked on an old kernel? Boris did update the
> kerneldoc as part of making gpu reset work for panfrost, which has this
> multi-engine reset problem. If that's not yet clear then we need to
> improve the docs further.
>
> AMD's problem is even worse, because their reset domain is the entire xgmi
> hive, so multiple pci devices.

I'm pushing for quite a while that we get something like an 
amdgpu_reset_domain structure or similar for this, but we unfortunately 
don't have that yet.

Maybe it should be a good idea to have something like a drm_sched_domain 
or similar with all the necessary information for the inter scheduler 
handling.

E.g. a workqueue for reset etc...

Regards,
Christian.

>
> Also there might more issues in drm/sched ofc, e.g. I've looked a bit at
> ordering/barriers and I'm pretty sure a lot are still missing. Or at least
> we should have comments in the code explaining why it all works.
> -Daniel
>
>> BR,
>> -R
>>
>>> -Daniel
>>>
>>>> ---------------------
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 67382621b429..4d6ce775c316 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -764,6 +764,45 @@ static bool drm_sched_blocked(struct
>>>> drm_gpu_scheduler *sched)
>>>>          return false;
>>>>   }
>>>>
>>>> +static bool handle_timeout(struct drm_gpu_scheduler *sched)
>>>> +{
>>>> +       struct drm_sched_job *bad;
>>>> +
>>>> +       if (!sched->has_timeout)
>>>> +               return false;
>>>> +
>>>> +       sched->has_timeout = false;
>>>> +
>>>> +       spin_lock(&sched->job_list_lock);
>>>> +       bad = list_first_entry_or_null(&sched->pending_list,
>>>> +                                      struct drm_sched_job, list);
>>>> +
>>>> +       if (!bad) {
>>>> +               spin_unlock(&sched->job_list_lock);
>>>> +               return false;
>>>> +       }
>>>> +
>>>> +       spin_unlock(&sched->job_list_lock);
>>>> +
>>>> +       if (sched->timeout_wq == system_wq) {
>>>> +               /*
>>>> +                * If driver has no specific requirements about serializing
>>>> +                * reset wrt. other engines, just call timedout_job() directly
>>>> +                */
>>>> +               sched->ops->timedout_job(job);
>>>> +       } else {
>>>> +               /*
>>>> +                * Otherwise queue it on timeout_wq and wait for it to complete
>>>> +                */
>>>> +               ... more typing needed here ...
>>>> +       }
>>>> +
>>>> +       if (sched->free_guilty) {
>>>> +               sched->ops->free_job(job);
>>>> +               sched->free_guilty = false;
>>>> +       }
>>>> +}
>>>> +
>>>>   /**
>>>>    * drm_sched_main - main scheduler thread
>>>>    *
>>>> @@ -787,6 +826,7 @@ static int drm_sched_main(void *param)
>>>>
>>>>                  wait_event_interruptible(sched->wake_up_worker,
>>>>                                           (cleanup_job =
>>>> drm_sched_get_cleanup_job(sched)) ||
>>>> +                                        handle_timeout(sched) ||
>>>>                                           (!drm_sched_blocked(sched) &&
>>>>                                            (entity =
>>>> drm_sched_select_entity(sched))) ||
>>>>                                           kthread_should_stop());
>>>> ---------------------
>>>>
>>>> drm_sched_fault() and the sw timeout handler would just set
>>>> sched->has_timeout and kick sched->wake_up_worker.
>>>>
>>>> And since we handle the timeout case after
>>>> drm_sched_get_cleanup_job(), we know that all of the successfully
>>>> completed jobs have already been popped off the list, and won't be
>>>> unfairly maligned.
>>>>
>>>> BR,
>>>> -R
>>>>
>>>> On Tue, Aug 31, 2021 at 6:29 PM Liu, Monk <Monk.Liu@amd.com> wrote:
>>>>> [AMD Official Use Only]
>>>>>
>>>>> Okay, I will reprepare this patch
>>>>>
>>>>> Thanks
>>>>>
>>>>> ------------------------------------------
>>>>> Monk Liu | Cloud-GPU Core team
>>>>> ------------------------------------------
>>>>>
>>>>> -----Original Message-----
>>>>> From: Daniel Vetter <daniel@ffwll.ch>
>>>>> Sent: Tuesday, August 31, 2021 9:02 PM
>>>>> To: Liu, Monk <Monk.Liu@amd.com>
>>>>> Cc: amd-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Chen, Jingwen <Jingwen.Chen@amd.com>
>>>>> Subject: Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
>>>>>
>>>>> On Tue, Aug 31, 2021 at 02:59:02PM +0200, Daniel Vetter wrote:
>>>>>> Can we please have some actual commit message here, with detailed
>>>>>> explanation of the race/bug/whatever, how you fix it and why this is
>>>>>> the best option?
>>>>>>
>>>>>> On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
>>>>>>> tested-by: jingwen chen <jingwen.chen@amd.com>
>>>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>>>> Signed-off-by: jingwen chen <jingwen.chen@amd.com>
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/scheduler/sched_main.c | 24
>>>>>>> ++++--------------------
>>>>>>>   1 file changed, 4 insertions(+), 20 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> index ecf8140..894fdb24 100644
>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>      sched = container_of(work, struct drm_gpu_scheduler,
>>>>>>> work_tdr.work);
>>>>>>>
>>>>>>>      /* Protects against concurrent deletion in
>>>>>>> drm_sched_get_cleanup_job */
>>>>>>> +   if (!__kthread_should_park(sched->thread))
>>>>>> This is a __ function, i.e. considered internal, and it's lockless
>>>>>> atomic, i.e. unordered. And you're not explaining why this works.
>>>>>>
>>>>>> Iow it's probably buggy, and an just unconditionally parking the
>>>>>> kthread is probably the right thing to do. If it's not the right thing
>>>>>> to do, there's a bug here for sure.
>>>>> Also why don't we reuse the function drivers already have to stop a scheduler thread? We seem to have two kthread_park now, that's probably one too much.
>>>>> -Daniel
>>>>>
>>>>>>> +           kthread_park(sched->thread);
>>>>>>> +
>>>>>>>      spin_lock(&sched->job_list_lock);
>>>>>>>      job = list_first_entry_or_null(&sched->pending_list,
>>>>>>>                                     struct drm_sched_job, list);
>>>>>>>
>>>>>>>      if (job) {
>>>>>>> -           /*
>>>>>>> -            * Remove the bad job so it cannot be freed by concurrent
>>>>>>> -            * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>>>>>>> -            * is parked at which point it's safe.
>>>>>>> -            */
>>>>>>> -           list_del_init(&job->list);
>>>>>>>              spin_unlock(&sched->job_list_lock);
>>>>>>>
>>>>>>> +           /* vendor's timeout_job should call drm_sched_start() */
>>>>>>>              status = job->sched->ops->timedout_job(job);
>>>>>>>
>>>>>>>              /*
>>>>>>> @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>>>      kthread_park(sched->thread);
>>>>>>>
>>>>>>>      /*
>>>>>>> -    * Reinsert back the bad job here - now it's safe as
>>>>>>> -    * drm_sched_get_cleanup_job cannot race against us and release the
>>>>>>> -    * bad job at this point - we parked (waited for) any in progress
>>>>>>> -    * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>>>>>> -    * now until the scheduler thread is unparked.
>>>>>>> -    */
>>>>>>> -   if (bad && bad->sched == sched)
>>>>>>> -           /*
>>>>>>> -            * Add at the head of the queue to reflect it was the earliest
>>>>>>> -            * job extracted.
>>>>>>> -            */
>>>>>>> -           list_add(&bad->list, &sched->pending_list);
>>>>>>> -
>>>>>>> -   /*
>>>>>>>       * Iterate the job list from later to  earlier one and either deactive
>>>>>>>       * their HW callbacks or remove them from pending list if they already
>>>>>>>       * signaled.
>>>>>>> --
>>>>>>> 2.7.4
>>>>>>>
>>>>>> --
>>>>>> Daniel Vetter
>>>>>> Software Engineer, Intel Corporation
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
>>>>>> ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76
>>>>>> b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376601170
>>>>>> 51194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
>>>>>> CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLg
>>>>>> KeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0
>>>>> --
>>>>> Daniel Vetter
>>>>> Software Engineer, Intel Corporation
>>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660117051194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLgKeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0
>>> --
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> http://blog.ffwll.ch

Andrey Grodzovsky Nov. 10, 2021, 12:50 p.m. UTC | #20

On 2021-11-10 5:09 a.m., Christian König wrote:
> Am 10.11.21 um 10:50 schrieb Daniel Vetter:
>> On Tue, Nov 09, 2021 at 08:17:01AM -0800, Rob Clark wrote:
>>> On Tue, Nov 9, 2021 at 1:07 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>>>> On Mon, Nov 08, 2021 at 03:39:17PM -0800, Rob Clark wrote:
>>>>> I stumbled across this thread when I ran into the same issue, while
>>>>> working out how to move drm/msm to use scheduler's retire +
>>>>> timeout/recovery (and get rid of our own mirror list of in-flight
>>>>> jobs).  We already have hw error detection enabled, and it can signal
>>>>> quite fast, so assuming the first job on the list is the guilty job
>>>>> just won't work.
>>>>>
>>>>> But I was considering a slightly different approach to fixing this,
>>>>> instead just handling it all in drm_sched_main() and getting rid of
>>>>> the complicated kthread parking gymnastics.  Ie. something along the
>>>>> lines of:
>>>> So handling timeouts in the main sched thread wont work as soon as you
>>>> have multiple engines and reset that impacts across engines:
>>>>
>>>> - Nothing is simplified since you still need to stop the other 
>>>> scheduler
>>>>    threads.
>>>>
>>>> - You get deadlocks if 2 schedulers time out at the same time, and 
>>>> both
>>>>    want to stop the other one.
>>>>
>>>> Hence workqueue. Now the rule for the wq is that you can only have 
>>>> one per
>>>> reset domain, so
>>>> - single engine you just take the one drm/sched provides
>>>> - if reset affects all your engines in the chip, then you allocate 
>>>> on in
>>>>    the drm_device and pass that to all
>>>> - if you have a complex of gpus all interconnected (e.g. xgmi hive for
>>>>    amd), then it's one wq for the entire hive
>>>>
>>>> _All_ reset related things must be run on that workqueue or things 
>>>> breaks,
>>>> which means if you get hw fault that also needs to be run there. I 
>>>> guess
>>>> we should either patch drm/sched to check you call that function 
>>>> from the
>>>> right workqueue, or just handle it internally.
>>> Hmm, ok.. I guess it would be useful to better document the reasoning
>>> for the current design, that would have steered me more towards the
>>> approach taken in this patch.
>> Maybe this was because you worked on an old kernel? Boris did update the
>> kerneldoc as part of making gpu reset work for panfrost, which has this
>> multi-engine reset problem. If that's not yet clear then we need to
>> improve the docs further.
>>
>> AMD's problem is even worse, because their reset domain is the entire 
>> xgmi
>> hive, so multiple pci devices.
>
> I'm pushing for quite a while that we get something like an 
> amdgpu_reset_domain structure or similar for this, but we 
> unfortunately don't have that yet.
>
> Maybe it should be a good idea to have something like a 
> drm_sched_domain or similar with all the necessary information for the 
> inter scheduler handling.
>
> E.g. a workqueue for reset etc...
>
> Regards,
> Christian.


I think Monk and Jingwen already switched SRIOV case to using Boris's 
single threaded queue
interface for SRIOV, we can try to expand this to general bare metal 
case for AMD and on the way
to add drm_sched_domain to the scheduler.

Andrey


>
>>
>> Also there might more issues in drm/sched ofc, e.g. I've looked a bit at
>> ordering/barriers and I'm pretty sure a lot are still missing. Or at 
>> least
>> we should have comments in the code explaining why it all works.
>> -Daniel
>>
>>> BR,
>>> -R
>>>
>>>> -Daniel
>>>>
>>>>> ---------------------
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 67382621b429..4d6ce775c316 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -764,6 +764,45 @@ static bool drm_sched_blocked(struct
>>>>> drm_gpu_scheduler *sched)
>>>>>          return false;
>>>>>   }
>>>>>
>>>>> +static bool handle_timeout(struct drm_gpu_scheduler *sched)
>>>>> +{
>>>>> +       struct drm_sched_job *bad;
>>>>> +
>>>>> +       if (!sched->has_timeout)
>>>>> +               return false;
>>>>> +
>>>>> +       sched->has_timeout = false;
>>>>> +
>>>>> +       spin_lock(&sched->job_list_lock);
>>>>> +       bad = list_first_entry_or_null(&sched->pending_list,
>>>>> +                                      struct drm_sched_job, list);
>>>>> +
>>>>> +       if (!bad) {
>>>>> +               spin_unlock(&sched->job_list_lock);
>>>>> +               return false;
>>>>> +       }
>>>>> +
>>>>> +       spin_unlock(&sched->job_list_lock);
>>>>> +
>>>>> +       if (sched->timeout_wq == system_wq) {
>>>>> +               /*
>>>>> +                * If driver has no specific requirements about 
>>>>> serializing
>>>>> +                * reset wrt. other engines, just call 
>>>>> timedout_job() directly
>>>>> +                */
>>>>> +               sched->ops->timedout_job(job);
>>>>> +       } else {
>>>>> +               /*
>>>>> +                * Otherwise queue it on timeout_wq and wait for 
>>>>> it to complete
>>>>> +                */
>>>>> +               ... more typing needed here ...
>>>>> +       }
>>>>> +
>>>>> +       if (sched->free_guilty) {
>>>>> +               sched->ops->free_job(job);
>>>>> +               sched->free_guilty = false;
>>>>> +       }
>>>>> +}
>>>>> +
>>>>>   /**
>>>>>    * drm_sched_main - main scheduler thread
>>>>>    *
>>>>> @@ -787,6 +826,7 @@ static int drm_sched_main(void *param)
>>>>>
>>>>> wait_event_interruptible(sched->wake_up_worker,
>>>>>                                           (cleanup_job =
>>>>> drm_sched_get_cleanup_job(sched)) ||
>>>>> + handle_timeout(sched) ||
>>>>> (!drm_sched_blocked(sched) &&
>>>>>                                            (entity =
>>>>> drm_sched_select_entity(sched))) ||
>>>>> kthread_should_stop());
>>>>> ---------------------
>>>>>
>>>>> drm_sched_fault() and the sw timeout handler would just set
>>>>> sched->has_timeout and kick sched->wake_up_worker.
>>>>>
>>>>> And since we handle the timeout case after
>>>>> drm_sched_get_cleanup_job(), we know that all of the successfully
>>>>> completed jobs have already been popped off the list, and won't be
>>>>> unfairly maligned.
>>>>>
>>>>> BR,
>>>>> -R
>>>>>
>>>>> On Tue, Aug 31, 2021 at 6:29 PM Liu, Monk <Monk.Liu@amd.com> wrote:
>>>>>> [AMD Official Use Only]
>>>>>>
>>>>>> Okay, I will reprepare this patch
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> ------------------------------------------
>>>>>> Monk Liu | Cloud-GPU Core team
>>>>>> ------------------------------------------
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Daniel Vetter <daniel@ffwll.ch>
>>>>>> Sent: Tuesday, August 31, 2021 9:02 PM
>>>>>> To: Liu, Monk <Monk.Liu@amd.com>
>>>>>> Cc: amd-gfx@lists.freedesktop.org; 
>>>>>> dri-devel@lists.freedesktop.org; Chen, Jingwen 
>>>>>> <Jingwen.Chen@amd.com>
>>>>>> Subject: Re: [PATCH 2/2] drm/sched: serialize job_timeout and 
>>>>>> scheduler
>>>>>>
>>>>>> On Tue, Aug 31, 2021 at 02:59:02PM +0200, Daniel Vetter wrote:
>>>>>>> Can we please have some actual commit message here, with detailed
>>>>>>> explanation of the race/bug/whatever, how you fix it and why 
>>>>>>> this is
>>>>>>> the best option?
>>>>>>>
>>>>>>> On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
>>>>>>>> tested-by: jingwen chen <jingwen.chen@amd.com>
>>>>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>>>>> Signed-off-by: jingwen chen <jingwen.chen@amd.com>
>>>>>>>> ---
>>>>>>>>   drivers/gpu/drm/scheduler/sched_main.c | 24
>>>>>>>> ++++--------------------
>>>>>>>>   1 file changed, 4 insertions(+), 20 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> index ecf8140..894fdb24 100644
>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct 
>>>>>>>> work_struct *work)
>>>>>>>>      sched = container_of(work, struct drm_gpu_scheduler,
>>>>>>>> work_tdr.work);
>>>>>>>>
>>>>>>>>      /* Protects against concurrent deletion in
>>>>>>>> drm_sched_get_cleanup_job */
>>>>>>>> +   if (!__kthread_should_park(sched->thread))
>>>>>>> This is a __ function, i.e. considered internal, and it's lockless
>>>>>>> atomic, i.e. unordered. And you're not explaining why this works.
>>>>>>>
>>>>>>> Iow it's probably buggy, and an just unconditionally parking the
>>>>>>> kthread is probably the right thing to do. If it's not the right 
>>>>>>> thing
>>>>>>> to do, there's a bug here for sure.
>>>>>> Also why don't we reuse the function drivers already have to stop 
>>>>>> a scheduler thread? We seem to have two kthread_park now, that's 
>>>>>> probably one too much.
>>>>>> -Daniel
>>>>>>
>>>>>>>> + kthread_park(sched->thread);
>>>>>>>> +
>>>>>>>>      spin_lock(&sched->job_list_lock);
>>>>>>>>      job = list_first_entry_or_null(&sched->pending_list,
>>>>>>>>                                     struct drm_sched_job, list);
>>>>>>>>
>>>>>>>>      if (job) {
>>>>>>>> -           /*
>>>>>>>> -            * Remove the bad job so it cannot be freed by 
>>>>>>>> concurrent
>>>>>>>> -            * drm_sched_cleanup_jobs. It will be reinserted 
>>>>>>>> back after sched->thread
>>>>>>>> -            * is parked at which point it's safe.
>>>>>>>> -            */
>>>>>>>> -           list_del_init(&job->list);
>>>>>>>> spin_unlock(&sched->job_list_lock);
>>>>>>>>
>>>>>>>> +           /* vendor's timeout_job should call 
>>>>>>>> drm_sched_start() */
>>>>>>>>              status = job->sched->ops->timedout_job(job);
>>>>>>>>
>>>>>>>>              /*
>>>>>>>> @@ -393,20 +391,6 @@ void drm_sched_stop(struct 
>>>>>>>> drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>>>>      kthread_park(sched->thread);
>>>>>>>>
>>>>>>>>      /*
>>>>>>>> -    * Reinsert back the bad job here - now it's safe as
>>>>>>>> -    * drm_sched_get_cleanup_job cannot race against us and 
>>>>>>>> release the
>>>>>>>> -    * bad job at this point - we parked (waited for) any in 
>>>>>>>> progress
>>>>>>>> -    * (earlier) cleanups and drm_sched_get_cleanup_job will 
>>>>>>>> not be called
>>>>>>>> -    * now until the scheduler thread is unparked.
>>>>>>>> -    */
>>>>>>>> -   if (bad && bad->sched == sched)
>>>>>>>> -           /*
>>>>>>>> -            * Add at the head of the queue to reflect it was 
>>>>>>>> the earliest
>>>>>>>> -            * job extracted.
>>>>>>>> -            */
>>>>>>>> -           list_add(&bad->list, &sched->pending_list);
>>>>>>>> -
>>>>>>>> -   /*
>>>>>>>>       * Iterate the job list from later to  earlier one and 
>>>>>>>> either deactive
>>>>>>>>       * their HW callbacks or remove them from pending list if 
>>>>>>>> they already
>>>>>>>>       * signaled.
>>>>>>>> -- 
>>>>>>>> 2.7.4
>>>>>>>>
>>>>>>> -- 
>>>>>>> Daniel Vetter
>>>>>>> Software Engineer, Intel Corporation
>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog. 
>>>>>>>
>>>>>>> ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76 
>>>>>>>
>>>>>>> b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376601170 
>>>>>>>
>>>>>>> 51194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL 
>>>>>>>
>>>>>>> CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLg 
>>>>>>>
>>>>>>> KeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0
>>>>>> -- 
>>>>>> Daniel Vetter
>>>>>> Software Engineer, Intel Corporation
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C3c31e06d94674e61f3a008d9a4323fb1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637721358003695123%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=FMFNbfIxwNG3Tv1p1iHLI%2BpepJgwvsT%2FxhL%2FKZc0eVE%3D&amp;reserved=0 
>>>>>>
>>>> -- 
>>>> Daniel Vetter
>>>> Software Engineer, Intel Corporation
>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C3c31e06d94674e61f3a008d9a4323fb1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637721358003695123%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=FMFNbfIxwNG3Tv1p1iHLI%2BpepJgwvsT%2FxhL%2FKZc0eVE%3D&amp;reserved=0 
>>>>
>

Daniel Vetter Nov. 10, 2021, 1:24 p.m. UTC | #21

On Wed, Nov 10, 2021 at 11:09:50AM +0100, Christian König wrote:
> Am 10.11.21 um 10:50 schrieb Daniel Vetter:
> > On Tue, Nov 09, 2021 at 08:17:01AM -0800, Rob Clark wrote:
> > > On Tue, Nov 9, 2021 at 1:07 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > On Mon, Nov 08, 2021 at 03:39:17PM -0800, Rob Clark wrote:
> > > > > I stumbled across this thread when I ran into the same issue, while
> > > > > working out how to move drm/msm to use scheduler's retire +
> > > > > timeout/recovery (and get rid of our own mirror list of in-flight
> > > > > jobs).  We already have hw error detection enabled, and it can signal
> > > > > quite fast, so assuming the first job on the list is the guilty job
> > > > > just won't work.
> > > > > 
> > > > > But I was considering a slightly different approach to fixing this,
> > > > > instead just handling it all in drm_sched_main() and getting rid of
> > > > > the complicated kthread parking gymnastics.  Ie. something along the
> > > > > lines of:
> > > > So handling timeouts in the main sched thread wont work as soon as you
> > > > have multiple engines and reset that impacts across engines:
> > > > 
> > > > - Nothing is simplified since you still need to stop the other scheduler
> > > >    threads.
> > > > 
> > > > - You get deadlocks if 2 schedulers time out at the same time, and both
> > > >    want to stop the other one.
> > > > 
> > > > Hence workqueue. Now the rule for the wq is that you can only have one per
> > > > reset domain, so
> > > > - single engine you just take the one drm/sched provides
> > > > - if reset affects all your engines in the chip, then you allocate on in
> > > >    the drm_device and pass that to all
> > > > - if you have a complex of gpus all interconnected (e.g. xgmi hive for
> > > >    amd), then it's one wq for the entire hive
> > > > 
> > > > _All_ reset related things must be run on that workqueue or things breaks,
> > > > which means if you get hw fault that also needs to be run there. I guess
> > > > we should either patch drm/sched to check you call that function from the
> > > > right workqueue, or just handle it internally.
> > > Hmm, ok.. I guess it would be useful to better document the reasoning
> > > for the current design, that would have steered me more towards the
> > > approach taken in this patch.
> > Maybe this was because you worked on an old kernel? Boris did update the
> > kerneldoc as part of making gpu reset work for panfrost, which has this
> > multi-engine reset problem. If that's not yet clear then we need to
> > improve the docs further.
> > 
> > AMD's problem is even worse, because their reset domain is the entire xgmi
> > hive, so multiple pci devices.
> 
> I'm pushing for quite a while that we get something like an
> amdgpu_reset_domain structure or similar for this, but we unfortunately
> don't have that yet.
> 
> Maybe it should be a good idea to have something like a drm_sched_domain or
> similar with all the necessary information for the inter scheduler handling.
> 
> E.g. a workqueue for reset etc...

Yeah I think as soon as we have more stuff than just the wq then a
drm_sched_reset_domain sounds good.

But if it's just driver stuff (e.g. the xgmi locking you have in amdgpu
reset comes to mind) then I think just a driver_reset_domain struct that
bundles all that stuff up seems good enough.

E.g. on i915 I'm also pondering whether some of the fw requests should be
processed by the reset wq, to avoid locking headaches, so I don't think
hiding that work too much in abstractions is a good idea.
-Daniel

> 
> Regards,
> Christian.
> 
> > 
> > Also there might more issues in drm/sched ofc, e.g. I've looked a bit at
> > ordering/barriers and I'm pretty sure a lot are still missing. Or at least
> > we should have comments in the code explaining why it all works.
> > -Daniel
> > 
> > > BR,
> > > -R
> > > 
> > > > -Daniel
> > > > 
> > > > > ---------------------
> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > index 67382621b429..4d6ce775c316 100644
> > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > @@ -764,6 +764,45 @@ static bool drm_sched_blocked(struct
> > > > > drm_gpu_scheduler *sched)
> > > > >          return false;
> > > > >   }
> > > > > 
> > > > > +static bool handle_timeout(struct drm_gpu_scheduler *sched)
> > > > > +{
> > > > > +       struct drm_sched_job *bad;
> > > > > +
> > > > > +       if (!sched->has_timeout)
> > > > > +               return false;
> > > > > +
> > > > > +       sched->has_timeout = false;
> > > > > +
> > > > > +       spin_lock(&sched->job_list_lock);
> > > > > +       bad = list_first_entry_or_null(&sched->pending_list,
> > > > > +                                      struct drm_sched_job, list);
> > > > > +
> > > > > +       if (!bad) {
> > > > > +               spin_unlock(&sched->job_list_lock);
> > > > > +               return false;
> > > > > +       }
> > > > > +
> > > > > +       spin_unlock(&sched->job_list_lock);
> > > > > +
> > > > > +       if (sched->timeout_wq == system_wq) {
> > > > > +               /*
> > > > > +                * If driver has no specific requirements about serializing
> > > > > +                * reset wrt. other engines, just call timedout_job() directly
> > > > > +                */
> > > > > +               sched->ops->timedout_job(job);
> > > > > +       } else {
> > > > > +               /*
> > > > > +                * Otherwise queue it on timeout_wq and wait for it to complete
> > > > > +                */
> > > > > +               ... more typing needed here ...
> > > > > +       }
> > > > > +
> > > > > +       if (sched->free_guilty) {
> > > > > +               sched->ops->free_job(job);
> > > > > +               sched->free_guilty = false;
> > > > > +       }
> > > > > +}
> > > > > +
> > > > >   /**
> > > > >    * drm_sched_main - main scheduler thread
> > > > >    *
> > > > > @@ -787,6 +826,7 @@ static int drm_sched_main(void *param)
> > > > > 
> > > > >                  wait_event_interruptible(sched->wake_up_worker,
> > > > >                                           (cleanup_job =
> > > > > drm_sched_get_cleanup_job(sched)) ||
> > > > > +                                        handle_timeout(sched) ||
> > > > >                                           (!drm_sched_blocked(sched) &&
> > > > >                                            (entity =
> > > > > drm_sched_select_entity(sched))) ||
> > > > >                                           kthread_should_stop());
> > > > > ---------------------
> > > > > 
> > > > > drm_sched_fault() and the sw timeout handler would just set
> > > > > sched->has_timeout and kick sched->wake_up_worker.
> > > > > 
> > > > > And since we handle the timeout case after
> > > > > drm_sched_get_cleanup_job(), we know that all of the successfully
> > > > > completed jobs have already been popped off the list, and won't be
> > > > > unfairly maligned.
> > > > > 
> > > > > BR,
> > > > > -R
> > > > > 
> > > > > On Tue, Aug 31, 2021 at 6:29 PM Liu, Monk <Monk.Liu@amd.com> wrote:
> > > > > > [AMD Official Use Only]
> > > > > > 
> > > > > > Okay, I will reprepare this patch
> > > > > > 
> > > > > > Thanks
> > > > > > 
> > > > > > ------------------------------------------
> > > > > > Monk Liu | Cloud-GPU Core team
> > > > > > ------------------------------------------
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Daniel Vetter <daniel@ffwll.ch>
> > > > > > Sent: Tuesday, August 31, 2021 9:02 PM
> > > > > > To: Liu, Monk <Monk.Liu@amd.com>
> > > > > > Cc: amd-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Chen, Jingwen <Jingwen.Chen@amd.com>
> > > > > > Subject: Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
> > > > > > 
> > > > > > On Tue, Aug 31, 2021 at 02:59:02PM +0200, Daniel Vetter wrote:
> > > > > > > Can we please have some actual commit message here, with detailed
> > > > > > > explanation of the race/bug/whatever, how you fix it and why this is
> > > > > > > the best option?
> > > > > > > 
> > > > > > > On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
> > > > > > > > tested-by: jingwen chen <jingwen.chen@amd.com>
> > > > > > > > Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> > > > > > > > Signed-off-by: jingwen chen <jingwen.chen@amd.com>
> > > > > > > > ---
> > > > > > > >   drivers/gpu/drm/scheduler/sched_main.c | 24
> > > > > > > > ++++--------------------
> > > > > > > >   1 file changed, 4 insertions(+), 20 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > index ecf8140..894fdb24 100644
> > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > >      sched = container_of(work, struct drm_gpu_scheduler,
> > > > > > > > work_tdr.work);
> > > > > > > > 
> > > > > > > >      /* Protects against concurrent deletion in
> > > > > > > > drm_sched_get_cleanup_job */
> > > > > > > > +   if (!__kthread_should_park(sched->thread))
> > > > > > > This is a __ function, i.e. considered internal, and it's lockless
> > > > > > > atomic, i.e. unordered. And you're not explaining why this works.
> > > > > > > 
> > > > > > > Iow it's probably buggy, and an just unconditionally parking the
> > > > > > > kthread is probably the right thing to do. If it's not the right thing
> > > > > > > to do, there's a bug here for sure.
> > > > > > Also why don't we reuse the function drivers already have to stop a scheduler thread? We seem to have two kthread_park now, that's probably one too much.
> > > > > > -Daniel
> > > > > > 
> > > > > > > > +           kthread_park(sched->thread);
> > > > > > > > +
> > > > > > > >      spin_lock(&sched->job_list_lock);
> > > > > > > >      job = list_first_entry_or_null(&sched->pending_list,
> > > > > > > >                                     struct drm_sched_job, list);
> > > > > > > > 
> > > > > > > >      if (job) {
> > > > > > > > -           /*
> > > > > > > > -            * Remove the bad job so it cannot be freed by concurrent
> > > > > > > > -            * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > > > > -            * is parked at which point it's safe.
> > > > > > > > -            */
> > > > > > > > -           list_del_init(&job->list);
> > > > > > > >              spin_unlock(&sched->job_list_lock);
> > > > > > > > 
> > > > > > > > +           /* vendor's timeout_job should call drm_sched_start() */
> > > > > > > >              status = job->sched->ops->timedout_job(job);
> > > > > > > > 
> > > > > > > >              /*
> > > > > > > > @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > > > > >      kthread_park(sched->thread);
> > > > > > > > 
> > > > > > > >      /*
> > > > > > > > -    * Reinsert back the bad job here - now it's safe as
> > > > > > > > -    * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > > > > -    * bad job at this point - we parked (waited for) any in progress
> > > > > > > > -    * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > > > > -    * now until the scheduler thread is unparked.
> > > > > > > > -    */
> > > > > > > > -   if (bad && bad->sched == sched)
> > > > > > > > -           /*
> > > > > > > > -            * Add at the head of the queue to reflect it was the earliest
> > > > > > > > -            * job extracted.
> > > > > > > > -            */
> > > > > > > > -           list_add(&bad->list, &sched->pending_list);
> > > > > > > > -
> > > > > > > > -   /*
> > > > > > > >       * Iterate the job list from later to  earlier one and either deactive
> > > > > > > >       * their HW callbacks or remove them from pending list if they already
> > > > > > > >       * signaled.
> > > > > > > > --
> > > > > > > > 2.7.4
> > > > > > > > 
> > > > > > > --
> > > > > > > Daniel Vetter
> > > > > > > Software Engineer, Intel Corporation
> > > > > > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
> > > > > > > ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76
> > > > > > > b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376601170
> > > > > > > 51194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
> > > > > > > CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLg
> > > > > > > KeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0
> > > > > > --
> > > > > > Daniel Vetter
> > > > > > Software Engineer, Intel Corporation
> > > > > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660117051194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLgKeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0
> > > > --
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
>

Rob Clark Nov. 10, 2021, 7:15 p.m. UTC | #22

On Wed, Nov 10, 2021 at 1:50 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Tue, Nov 09, 2021 at 08:17:01AM -0800, Rob Clark wrote:
> > On Tue, Nov 9, 2021 at 1:07 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Mon, Nov 08, 2021 at 03:39:17PM -0800, Rob Clark wrote:
> > > > I stumbled across this thread when I ran into the same issue, while
> > > > working out how to move drm/msm to use scheduler's retire +
> > > > timeout/recovery (and get rid of our own mirror list of in-flight
> > > > jobs).  We already have hw error detection enabled, and it can signal
> > > > quite fast, so assuming the first job on the list is the guilty job
> > > > just won't work.
> > > >
> > > > But I was considering a slightly different approach to fixing this,
> > > > instead just handling it all in drm_sched_main() and getting rid of
> > > > the complicated kthread parking gymnastics.  Ie. something along the
> > > > lines of:
> > >
> > > So handling timeouts in the main sched thread wont work as soon as you
> > > have multiple engines and reset that impacts across engines:
> > >
> > > - Nothing is simplified since you still need to stop the other scheduler
> > >   threads.
> > >
> > > - You get deadlocks if 2 schedulers time out at the same time, and both
> > >   want to stop the other one.
> > >
> > > Hence workqueue. Now the rule for the wq is that you can only have one per
> > > reset domain, so
> > > - single engine you just take the one drm/sched provides
> > > - if reset affects all your engines in the chip, then you allocate on in
> > >   the drm_device and pass that to all
> > > - if you have a complex of gpus all interconnected (e.g. xgmi hive for
> > >   amd), then it's one wq for the entire hive
> > >
> > > _All_ reset related things must be run on that workqueue or things breaks,
> > > which means if you get hw fault that also needs to be run there. I guess
> > > we should either patch drm/sched to check you call that function from the
> > > right workqueue, or just handle it internally.
> >
> > Hmm, ok.. I guess it would be useful to better document the reasoning
> > for the current design, that would have steered me more towards the
> > approach taken in this patch.
>
> Maybe this was because you worked on an old kernel? Boris did update the
> kerneldoc as part of making gpu reset work for panfrost, which has this
> multi-engine reset problem. If that's not yet clear then we need to
> improve the docs further.

I saw that, and understood the ordered wq.. but missed the implication
regarding having to park other scheduler kthreads

BR,
-R

> AMD's problem is even worse, because their reset domain is the entire xgmi
> hive, so multiple pci devices.
>
> Also there might more issues in drm/sched ofc, e.g. I've looked a bit at
> ordering/barriers and I'm pretty sure a lot are still missing. Or at least
> we should have comments in the code explaining why it all works.
> -Daniel
>
> >
> > BR,
> > -R
> >
> > > -Daniel
> > >
> > > >
> > > > ---------------------
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > index 67382621b429..4d6ce775c316 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > @@ -764,6 +764,45 @@ static bool drm_sched_blocked(struct
> > > > drm_gpu_scheduler *sched)
> > > >         return false;
> > > >  }
> > > >
> > > > +static bool handle_timeout(struct drm_gpu_scheduler *sched)
> > > > +{
> > > > +       struct drm_sched_job *bad;
> > > > +
> > > > +       if (!sched->has_timeout)
> > > > +               return false;
> > > > +
> > > > +       sched->has_timeout = false;
> > > > +
> > > > +       spin_lock(&sched->job_list_lock);
> > > > +       bad = list_first_entry_or_null(&sched->pending_list,
> > > > +                                      struct drm_sched_job, list);
> > > > +
> > > > +       if (!bad) {
> > > > +               spin_unlock(&sched->job_list_lock);
> > > > +               return false;
> > > > +       }
> > > > +
> > > > +       spin_unlock(&sched->job_list_lock);
> > > > +
> > > > +       if (sched->timeout_wq == system_wq) {
> > > > +               /*
> > > > +                * If driver has no specific requirements about serializing
> > > > +                * reset wrt. other engines, just call timedout_job() directly
> > > > +                */
> > > > +               sched->ops->timedout_job(job);
> > > > +       } else {
> > > > +               /*
> > > > +                * Otherwise queue it on timeout_wq and wait for it to complete
> > > > +                */
> > > > +               ... more typing needed here ...
> > > > +       }
> > > > +
> > > > +       if (sched->free_guilty) {
> > > > +               sched->ops->free_job(job);
> > > > +               sched->free_guilty = false;
> > > > +       }
> > > > +}
> > > > +
> > > >  /**
> > > >   * drm_sched_main - main scheduler thread
> > > >   *
> > > > @@ -787,6 +826,7 @@ static int drm_sched_main(void *param)
> > > >
> > > >                 wait_event_interruptible(sched->wake_up_worker,
> > > >                                          (cleanup_job =
> > > > drm_sched_get_cleanup_job(sched)) ||
> > > > +                                        handle_timeout(sched) ||
> > > >                                          (!drm_sched_blocked(sched) &&
> > > >                                           (entity =
> > > > drm_sched_select_entity(sched))) ||
> > > >                                          kthread_should_stop());
> > > > ---------------------
> > > >
> > > > drm_sched_fault() and the sw timeout handler would just set
> > > > sched->has_timeout and kick sched->wake_up_worker.
> > > >
> > > > And since we handle the timeout case after
> > > > drm_sched_get_cleanup_job(), we know that all of the successfully
> > > > completed jobs have already been popped off the list, and won't be
> > > > unfairly maligned.
> > > >
> > > > BR,
> > > > -R
> > > >
> > > > On Tue, Aug 31, 2021 at 6:29 PM Liu, Monk <Monk.Liu@amd.com> wrote:
> > > > >
> > > > > [AMD Official Use Only]
> > > > >
> > > > > Okay, I will reprepare this patch
> > > > >
> > > > > Thanks
> > > > >
> > > > > ------------------------------------------
> > > > > Monk Liu | Cloud-GPU Core team
> > > > > ------------------------------------------
> > > > >
> > > > > -----Original Message-----
> > > > > From: Daniel Vetter <daniel@ffwll.ch>
> > > > > Sent: Tuesday, August 31, 2021 9:02 PM
> > > > > To: Liu, Monk <Monk.Liu@amd.com>
> > > > > Cc: amd-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Chen, Jingwen <Jingwen.Chen@amd.com>
> > > > > Subject: Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
> > > > >
> > > > > On Tue, Aug 31, 2021 at 02:59:02PM +0200, Daniel Vetter wrote:
> > > > > > Can we please have some actual commit message here, with detailed
> > > > > > explanation of the race/bug/whatever, how you fix it and why this is
> > > > > > the best option?
> > > > > >
> > > > > > On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
> > > > > > > tested-by: jingwen chen <jingwen.chen@amd.com>
> > > > > > > Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> > > > > > > Signed-off-by: jingwen chen <jingwen.chen@amd.com>
> > > > > > > ---
> > > > > > >  drivers/gpu/drm/scheduler/sched_main.c | 24
> > > > > > > ++++--------------------
> > > > > > >  1 file changed, 4 insertions(+), 20 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > index ecf8140..894fdb24 100644
> > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > >     sched = container_of(work, struct drm_gpu_scheduler,
> > > > > > > work_tdr.work);
> > > > > > >
> > > > > > >     /* Protects against concurrent deletion in
> > > > > > > drm_sched_get_cleanup_job */
> > > > > > > +   if (!__kthread_should_park(sched->thread))
> > > > > >
> > > > > > This is a __ function, i.e. considered internal, and it's lockless
> > > > > > atomic, i.e. unordered. And you're not explaining why this works.
> > > > > >
> > > > > > Iow it's probably buggy, and an just unconditionally parking the
> > > > > > kthread is probably the right thing to do. If it's not the right thing
> > > > > > to do, there's a bug here for sure.
> > > > >
> > > > > Also why don't we reuse the function drivers already have to stop a scheduler thread? We seem to have two kthread_park now, that's probably one too much.
> > > > > -Daniel
> > > > >
> > > > > > > +           kthread_park(sched->thread);
> > > > > > > +
> > > > > > >     spin_lock(&sched->job_list_lock);
> > > > > > >     job = list_first_entry_or_null(&sched->pending_list,
> > > > > > >                                    struct drm_sched_job, list);
> > > > > > >
> > > > > > >     if (job) {
> > > > > > > -           /*
> > > > > > > -            * Remove the bad job so it cannot be freed by concurrent
> > > > > > > -            * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > > > -            * is parked at which point it's safe.
> > > > > > > -            */
> > > > > > > -           list_del_init(&job->list);
> > > > > > >             spin_unlock(&sched->job_list_lock);
> > > > > > >
> > > > > > > +           /* vendor's timeout_job should call drm_sched_start() */
> > > > > > >             status = job->sched->ops->timedout_job(job);
> > > > > > >
> > > > > > >             /*
> > > > > > > @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > > > >     kthread_park(sched->thread);
> > > > > > >
> > > > > > >     /*
> > > > > > > -    * Reinsert back the bad job here - now it's safe as
> > > > > > > -    * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > > > -    * bad job at this point - we parked (waited for) any in progress
> > > > > > > -    * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > > > -    * now until the scheduler thread is unparked.
> > > > > > > -    */
> > > > > > > -   if (bad && bad->sched == sched)
> > > > > > > -           /*
> > > > > > > -            * Add at the head of the queue to reflect it was the earliest
> > > > > > > -            * job extracted.
> > > > > > > -            */
> > > > > > > -           list_add(&bad->list, &sched->pending_list);
> > > > > > > -
> > > > > > > -   /*
> > > > > > >      * Iterate the job list from later to  earlier one and either deactive
> > > > > > >      * their HW callbacks or remove them from pending list if they already
> > > > > > >      * signaled.
> > > > > > > --
> > > > > > > 2.7.4
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Daniel Vetter
> > > > > > Software Engineer, Intel Corporation
> > > > > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
> > > > > > ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76
> > > > > > b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376601170
> > > > > > 51194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
> > > > > > CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLg
> > > > > > KeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0
> > > > >
> > > > > --
> > > > > Daniel Vetter
> > > > > Software Engineer, Intel Corporation
> > > > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660117051194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLgKeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Andrey Grodzovsky Nov. 11, 2021, 3:54 p.m. UTC | #23

On 2021-11-10 8:24 a.m., Daniel Vetter wrote:
> On Wed, Nov 10, 2021 at 11:09:50AM +0100, Christian König wrote:
>> Am 10.11.21 um 10:50 schrieb Daniel Vetter:
>>> On Tue, Nov 09, 2021 at 08:17:01AM -0800, Rob Clark wrote:
>>>> On Tue, Nov 9, 2021 at 1:07 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>> On Mon, Nov 08, 2021 at 03:39:17PM -0800, Rob Clark wrote:
>>>>>> I stumbled across this thread when I ran into the same issue, while
>>>>>> working out how to move drm/msm to use scheduler's retire +
>>>>>> timeout/recovery (and get rid of our own mirror list of in-flight
>>>>>> jobs).  We already have hw error detection enabled, and it can signal
>>>>>> quite fast, so assuming the first job on the list is the guilty job
>>>>>> just won't work.
>>>>>>
>>>>>> But I was considering a slightly different approach to fixing this,
>>>>>> instead just handling it all in drm_sched_main() and getting rid of
>>>>>> the complicated kthread parking gymnastics.  Ie. something along the
>>>>>> lines of:
>>>>> So handling timeouts in the main sched thread wont work as soon as you
>>>>> have multiple engines and reset that impacts across engines:
>>>>>
>>>>> - Nothing is simplified since you still need to stop the other scheduler
>>>>>     threads.
>>>>>
>>>>> - You get deadlocks if 2 schedulers time out at the same time, and both
>>>>>     want to stop the other one.
>>>>>
>>>>> Hence workqueue. Now the rule for the wq is that you can only have one per
>>>>> reset domain, so
>>>>> - single engine you just take the one drm/sched provides
>>>>> - if reset affects all your engines in the chip, then you allocate on in
>>>>>     the drm_device and pass that to all
>>>>> - if you have a complex of gpus all interconnected (e.g. xgmi hive for
>>>>>     amd), then it's one wq for the entire hive
>>>>>
>>>>> _All_ reset related things must be run on that workqueue or things breaks,
>>>>> which means if you get hw fault that also needs to be run there. I guess
>>>>> we should either patch drm/sched to check you call that function from the
>>>>> right workqueue, or just handle it internally.
>>>> Hmm, ok.. I guess it would be useful to better document the reasoning
>>>> for the current design, that would have steered me more towards the
>>>> approach taken in this patch.
>>> Maybe this was because you worked on an old kernel? Boris did update the
>>> kerneldoc as part of making gpu reset work for panfrost, which has this
>>> multi-engine reset problem. If that's not yet clear then we need to
>>> improve the docs further.
>>>
>>> AMD's problem is even worse, because their reset domain is the entire xgmi
>>> hive, so multiple pci devices.
>> I'm pushing for quite a while that we get something like an
>> amdgpu_reset_domain structure or similar for this, but we unfortunately
>> don't have that yet.
>>
>> Maybe it should be a good idea to have something like a drm_sched_domain or
>> similar with all the necessary information for the inter scheduler handling.
>>
>> E.g. a workqueue for reset etc...
> Yeah I think as soon as we have more stuff than just the wq then a
> drm_sched_reset_domain sounds good.
>
> But if it's just driver stuff (e.g. the xgmi locking you have in amdgpu
> reset comes to mind) then I think just a driver_reset_domain struct that
> bundles all that stuff up seems good enough.
>
> E.g. on i915 I'm also pondering whether some of the fw requests should be
> processed by the reset wq, to avoid locking headaches, so I don't think
> hiding that work too much in abstractions is a good idea.
> -Daniel


I suggest we keep the drm_sched_reset_domain as a base struct to hold the wq
(and possible something else cross drivers in the future) and then embed 
it in a derived
driver specific struct to hold driver specific stuff like
the XGMI lock you mentioned.

Andrey


>
>> Regards,
>> Christian.
>>
>>> Also there might more issues in drm/sched ofc, e.g. I've looked a bit at
>>> ordering/barriers and I'm pretty sure a lot are still missing. Or at least
>>> we should have comments in the code explaining why it all works.
>>> -Daniel
>>>
>>>> BR,
>>>> -R
>>>>
>>>>> -Daniel
>>>>>
>>>>>> ---------------------
>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> index 67382621b429..4d6ce775c316 100644
>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> @@ -764,6 +764,45 @@ static bool drm_sched_blocked(struct
>>>>>> drm_gpu_scheduler *sched)
>>>>>>           return false;
>>>>>>    }
>>>>>>
>>>>>> +static bool handle_timeout(struct drm_gpu_scheduler *sched)
>>>>>> +{
>>>>>> +       struct drm_sched_job *bad;
>>>>>> +
>>>>>> +       if (!sched->has_timeout)
>>>>>> +               return false;
>>>>>> +
>>>>>> +       sched->has_timeout = false;
>>>>>> +
>>>>>> +       spin_lock(&sched->job_list_lock);
>>>>>> +       bad = list_first_entry_or_null(&sched->pending_list,
>>>>>> +                                      struct drm_sched_job, list);
>>>>>> +
>>>>>> +       if (!bad) {
>>>>>> +               spin_unlock(&sched->job_list_lock);
>>>>>> +               return false;
>>>>>> +       }
>>>>>> +
>>>>>> +       spin_unlock(&sched->job_list_lock);
>>>>>> +
>>>>>> +       if (sched->timeout_wq == system_wq) {
>>>>>> +               /*
>>>>>> +                * If driver has no specific requirements about serializing
>>>>>> +                * reset wrt. other engines, just call timedout_job() directly
>>>>>> +                */
>>>>>> +               sched->ops->timedout_job(job);
>>>>>> +       } else {
>>>>>> +               /*
>>>>>> +                * Otherwise queue it on timeout_wq and wait for it to complete
>>>>>> +                */
>>>>>> +               ... more typing needed here ...
>>>>>> +       }
>>>>>> +
>>>>>> +       if (sched->free_guilty) {
>>>>>> +               sched->ops->free_job(job);
>>>>>> +               sched->free_guilty = false;
>>>>>> +       }
>>>>>> +}
>>>>>> +
>>>>>>    /**
>>>>>>     * drm_sched_main - main scheduler thread
>>>>>>     *
>>>>>> @@ -787,6 +826,7 @@ static int drm_sched_main(void *param)
>>>>>>
>>>>>>                   wait_event_interruptible(sched->wake_up_worker,
>>>>>>                                            (cleanup_job =
>>>>>> drm_sched_get_cleanup_job(sched)) ||
>>>>>> +                                        handle_timeout(sched) ||
>>>>>>                                            (!drm_sched_blocked(sched) &&
>>>>>>                                             (entity =
>>>>>> drm_sched_select_entity(sched))) ||
>>>>>>                                            kthread_should_stop());
>>>>>> ---------------------
>>>>>>
>>>>>> drm_sched_fault() and the sw timeout handler would just set
>>>>>> sched->has_timeout and kick sched->wake_up_worker.
>>>>>>
>>>>>> And since we handle the timeout case after
>>>>>> drm_sched_get_cleanup_job(), we know that all of the successfully
>>>>>> completed jobs have already been popped off the list, and won't be
>>>>>> unfairly maligned.
>>>>>>
>>>>>> BR,
>>>>>> -R
>>>>>>
>>>>>> On Tue, Aug 31, 2021 at 6:29 PM Liu, Monk <Monk.Liu@amd.com> wrote:
>>>>>>> [AMD Official Use Only]
>>>>>>>
>>>>>>> Okay, I will reprepare this patch
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> ------------------------------------------
>>>>>>> Monk Liu | Cloud-GPU Core team
>>>>>>> ------------------------------------------
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Daniel Vetter <daniel@ffwll.ch>
>>>>>>> Sent: Tuesday, August 31, 2021 9:02 PM
>>>>>>> To: Liu, Monk <Monk.Liu@amd.com>
>>>>>>> Cc: amd-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Chen, Jingwen <Jingwen.Chen@amd.com>
>>>>>>> Subject: Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
>>>>>>>
>>>>>>> On Tue, Aug 31, 2021 at 02:59:02PM +0200, Daniel Vetter wrote:
>>>>>>>> Can we please have some actual commit message here, with detailed
>>>>>>>> explanation of the race/bug/whatever, how you fix it and why this is
>>>>>>>> the best option?
>>>>>>>>
>>>>>>>> On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
>>>>>>>>> tested-by: jingwen chen <jingwen.chen@amd.com>
>>>>>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>>>>>> Signed-off-by: jingwen chen <jingwen.chen@amd.com>
>>>>>>>>> ---
>>>>>>>>>    drivers/gpu/drm/scheduler/sched_main.c | 24
>>>>>>>>> ++++--------------------
>>>>>>>>>    1 file changed, 4 insertions(+), 20 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> index ecf8140..894fdb24 100644
>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>>       sched = container_of(work, struct drm_gpu_scheduler,
>>>>>>>>> work_tdr.work);
>>>>>>>>>
>>>>>>>>>       /* Protects against concurrent deletion in
>>>>>>>>> drm_sched_get_cleanup_job */
>>>>>>>>> +   if (!__kthread_should_park(sched->thread))
>>>>>>>> This is a __ function, i.e. considered internal, and it's lockless
>>>>>>>> atomic, i.e. unordered. And you're not explaining why this works.
>>>>>>>>
>>>>>>>> Iow it's probably buggy, and an just unconditionally parking the
>>>>>>>> kthread is probably the right thing to do. If it's not the right thing
>>>>>>>> to do, there's a bug here for sure.
>>>>>>> Also why don't we reuse the function drivers already have to stop a scheduler thread? We seem to have two kthread_park now, that's probably one too much.
>>>>>>> -Daniel
>>>>>>>
>>>>>>>>> +           kthread_park(sched->thread);
>>>>>>>>> +
>>>>>>>>>       spin_lock(&sched->job_list_lock);
>>>>>>>>>       job = list_first_entry_or_null(&sched->pending_list,
>>>>>>>>>                                      struct drm_sched_job, list);
>>>>>>>>>
>>>>>>>>>       if (job) {
>>>>>>>>> -           /*
>>>>>>>>> -            * Remove the bad job so it cannot be freed by concurrent
>>>>>>>>> -            * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>>>>>>>>> -            * is parked at which point it's safe.
>>>>>>>>> -            */
>>>>>>>>> -           list_del_init(&job->list);
>>>>>>>>>               spin_unlock(&sched->job_list_lock);
>>>>>>>>>
>>>>>>>>> +           /* vendor's timeout_job should call drm_sched_start() */
>>>>>>>>>               status = job->sched->ops->timedout_job(job);
>>>>>>>>>
>>>>>>>>>               /*
>>>>>>>>> @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>>>>>       kthread_park(sched->thread);
>>>>>>>>>
>>>>>>>>>       /*
>>>>>>>>> -    * Reinsert back the bad job here - now it's safe as
>>>>>>>>> -    * drm_sched_get_cleanup_job cannot race against us and release the
>>>>>>>>> -    * bad job at this point - we parked (waited for) any in progress
>>>>>>>>> -    * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>>>>>>>> -    * now until the scheduler thread is unparked.
>>>>>>>>> -    */
>>>>>>>>> -   if (bad && bad->sched == sched)
>>>>>>>>> -           /*
>>>>>>>>> -            * Add at the head of the queue to reflect it was the earliest
>>>>>>>>> -            * job extracted.
>>>>>>>>> -            */
>>>>>>>>> -           list_add(&bad->list, &sched->pending_list);
>>>>>>>>> -
>>>>>>>>> -   /*
>>>>>>>>>        * Iterate the job list from later to  earlier one and either deactive
>>>>>>>>>        * their HW callbacks or remove them from pending list if they already
>>>>>>>>>        * signaled.
>>>>>>>>> --
>>>>>>>>> 2.7.4
>>>>>>>>>
>>>>>>>> --
>>>>>>>> Daniel Vetter
>>>>>>>> Software Engineer, Intel Corporation
>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
>>>>>>>> ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76
>>>>>>>> b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376601170
>>>>>>>> 51194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
>>>>>>>> CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLg
>>>>>>>> KeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0
>>>>>>> --
>>>>>>> Daniel Vetter
>>>>>>> Software Engineer, Intel Corporation
>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cbf8af1e8a797474bd5c108d9a44d664b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637721474618053495%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=irN0l%2F8L7X9A8BRNAIYmOO4jMI1ZLeFGHPLYanVOMOA%3D&amp;reserved=0
>>>>> --
>>>>> Daniel Vetter
>>>>> Software Engineer, Intel Corporation
>>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cbf8af1e8a797474bd5c108d9a44d664b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637721474618053495%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=irN0l%2F8L7X9A8BRNAIYmOO4jMI1ZLeFGHPLYanVOMOA%3D&amp;reserved=0

[2/2] drm/sched: serialize job_timeout and scheduler

Commit Message

Comments

Patch