diff mbox series

drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

Message ID 20230130214504.1305042-1-gpiccoli@igalia.com (mailing list archive)
State New, archived
Headers show
Series drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini | expand

Commit Message

Guilherme G. Piccoli Jan. 30, 2023, 9:45 p.m. UTC
Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
routine - such function is expected to be called only after the
respective init function - drm_sched_init() - was executed successfully.

Happens that we faced a driver probe failure in the Steam Deck
recently, and the function drm_sched_fini() was called even without
its counter-part had been previously called, causing the following oops:

amdgpu: probe of 0000:04:00.0 failed with error -110
BUG: kernel NULL pointer dereference, address: 0000000000000090
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
[...]
Call Trace:
 <TASK>
 amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
 amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
 amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
 devm_drm_dev_init_release+0x49/0x70
 [...]

To prevent that, check if the drm_sched was properly initialized for a
given ring before calling its fini counter-part.

Notice ideally we'd use sched.ready for that; such field is set as the latest
thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
field - in the above oops for example, it was a GFX ring causing the crash, and
the sched.ready field was set to true in the ring init routine, regardless of
the state of the DRM scheduler. Hence, we ended-up using another sched field.

Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)")
Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Cc: Guchun Chen <guchun.chen@amd.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
---


Hi folks, first of all thanks in advance for reviews / comments!
Notice that I've used the Fixes tag more in the sense to bring it
to stable, I didn't find a good patch candidate that added the
call to drm_sched_fini(), was reaching way too old commits...so
067f44c8b459 seems a good candidate - or maybe not?

Now, with regards sched.ready, spent a bit of time to figure what
was happening...would be feasible maybe to stop using that to
mark some kind ring status? I think it should be possible to add a
flag to the ring structure for that, and free sched.ready from
being manipulate by the amdgpu driver, what's your thoughts on that?

I could try myself, but first of course I'd like to raise the
"temperature" on this topic and check if somebody is already working
on that.

Cheers,

Guilherme


 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Guilherme G. Piccoli Jan. 30, 2023, 9:51 p.m. UTC | #1
+ Luben

(sorry, missed that in the first submission).

On 30/01/2023 18:45, Guilherme G. Piccoli wrote:
> Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
> routine - such function is expected to be called only after the
> respective init function - drm_sched_init() - was executed successfully.
> 
> Happens that we faced a driver probe failure in the Steam Deck
> recently, and the function drm_sched_fini() was called even without
> its counter-part had been previously called, causing the following oops:
> 
> amdgpu: probe of 0000:04:00.0 failed with error -110
> BUG: kernel NULL pointer dereference, address: 0000000000000090
> PGD 0 P4D 0
> Oops: 0002 [#1] PREEMPT SMP NOPTI
> CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
> Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
> RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
> [...]
> Call Trace:
>  <TASK>
>  amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
>  amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
>  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>  devm_drm_dev_init_release+0x49/0x70
>  [...]
> 
> To prevent that, check if the drm_sched was properly initialized for a
> given ring before calling its fini counter-part.
> 
> Notice ideally we'd use sched.ready for that; such field is set as the latest
> thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
> field - in the above oops for example, it was a GFX ring causing the crash, and
> the sched.ready field was set to true in the ring init routine, regardless of
> the state of the DRM scheduler. Hence, we ended-up using another sched field.
> 
> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)")
> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> Cc: Guchun Chen <guchun.chen@amd.com>
> Cc: Mario Limonciello <mario.limonciello@amd.com>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> ---
> 
> 
> Hi folks, first of all thanks in advance for reviews / comments!
> Notice that I've used the Fixes tag more in the sense to bring it
> to stable, I didn't find a good patch candidate that added the
> call to drm_sched_fini(), was reaching way too old commits...so
> 067f44c8b459 seems a good candidate - or maybe not?
> 
> Now, with regards sched.ready, spent a bit of time to figure what
> was happening...would be feasible maybe to stop using that to
> mark some kind ring status? I think it should be possible to add a
> flag to the ring structure for that, and free sched.ready from
> being manipulate by the amdgpu driver, what's your thoughts on that?
> 
> I could try myself, but first of course I'd like to raise the
> "temperature" on this topic and check if somebody is already working
> on that.
> 
> Cheers,
> 
> Guilherme
> 
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 00444203220d..e154eb8241fb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>  		if (!ring || !ring->fence_drv.initialized)
>  			continue;
>  
> -		if (!ring->no_scheduler)
> +		/*
> +		 * Notice we check for sched.name since there's some
> +		 * override on the meaning of sched.ready by amdgpu.
> +		 * The natural check would be sched.ready, which is
> +		 * set as drm_sched_init() finishes...
> +		 */
> +		if (!ring->no_scheduler && ring->sched.name)
>  			drm_sched_fini(&ring->sched);
>  
>  		for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
Alex Deucher Jan. 30, 2023, 10:30 p.m. UTC | #2
On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli
<gpiccoli@igalia.com> wrote:
>
> + Luben
>
> (sorry, missed that in the first submission).
>
> On 30/01/2023 18:45, Guilherme G. Piccoli wrote:
> > Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
> > routine - such function is expected to be called only after the
> > respective init function - drm_sched_init() - was executed successfully.
> >
> > Happens that we faced a driver probe failure in the Steam Deck
> > recently, and the function drm_sched_fini() was called even without
> > its counter-part had been previously called, causing the following oops:
> >
> > amdgpu: probe of 0000:04:00.0 failed with error -110
> > BUG: kernel NULL pointer dereference, address: 0000000000000090
> > PGD 0 P4D 0
> > Oops: 0002 [#1] PREEMPT SMP NOPTI
> > CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
> > Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
> > RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
> > [...]
> > Call Trace:
> >  <TASK>
> >  amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
> >  amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
> >  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
> >  devm_drm_dev_init_release+0x49/0x70
> >  [...]
> >
> > To prevent that, check if the drm_sched was properly initialized for a
> > given ring before calling its fini counter-part.
> >
> > Notice ideally we'd use sched.ready for that; such field is set as the latest
> > thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
> > field - in the above oops for example, it was a GFX ring causing the crash, and
> > the sched.ready field was set to true in the ring init routine, regardless of
> > the state of the DRM scheduler. Hence, we ended-up using another sched field.
> >> > Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)")
> > Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> > Cc: Guchun Chen <guchun.chen@amd.com>
> > Cc: Mario Limonciello <mario.limonciello@amd.com>
> > Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> > ---
> >
> >
> > Hi folks, first of all thanks in advance for reviews / comments!
> > Notice that I've used the Fixes tag more in the sense to bring it
> > to stable, I didn't find a good patch candidate that added the
> > call to drm_sched_fini(), was reaching way too old commits...so
> > 067f44c8b459 seems a good candidate - or maybe not?
> >
> > Now, with regards sched.ready, spent a bit of time to figure what
> > was happening...would be feasible maybe to stop using that to
> > mark some kind ring status? I think it should be possible to add a
> > flag to the ring structure for that, and free sched.ready from
> > being manipulate by the amdgpu driver, what's your thoughts on that?

It's been a while, but IIRC, we used to have a ring->ready field in
the driver which at some point got migrated out of the driver into the
GPU scheduler and the driver side code never got cleaned up.  I think
we should probably just drop the driver messing with that field and
leave it up to the drm scheduler.

Alex


> >
> > I could try myself, but first of course I'd like to raise the
> > "temperature" on this topic and check if somebody is already working
> > on that.
> >
> > Cheers,
> >
> > Guilherme
> >
> >
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > index 00444203220d..e154eb8241fb 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
> >               if (!ring || !ring->fence_drv.initialized)
> >                       continue;
> >
> > -             if (!ring->no_scheduler)
> > +             /*
> > +              * Notice we check for sched.name since there's some
> > +              * override on the meaning of sched.ready by amdgpu.
> > +              * The natural check would be sched.ready, which is
> > +              * set as drm_sched_init() finishes...
> > +              */
> > +             if (!ring->no_scheduler && ring->sched.name)
> >                       drm_sched_fini(&ring->sched);
> >
> >               for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
Chen, Guchun Jan. 31, 2023, 6:36 a.m. UTC | #3
Hi Piccoli,

I agree with Alex's point, using ring->sched.name for such check is not a good way. BTW, can you please attach a full dmesg long in bad case to help me understand more?

Regards,
Guchun

-----Original Message-----
From: Alex Deucher <alexdeucher@gmail.com> 
Sent: Tuesday, January 31, 2023 6:30 AM
To: Guilherme G. Piccoli <gpiccoli@igalia.com>
Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Chen, Guchun <Guchun.Chen@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; dri-devel@lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov@amd.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; kernel-dev@igalia.com; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli <gpiccoli@igalia.com> wrote:
>
> + Luben
>
> (sorry, missed that in the first submission).
>
> On 30/01/2023 18:45, Guilherme G. Piccoli wrote:
> > Currently amdgpu calls drm_sched_fini() from the fence driver sw 
> > fini routine - such function is expected to be called only after the 
> > respective init function - drm_sched_init() - was executed successfully.
> >
> > Happens that we faced a driver probe failure in the Steam Deck 
> > recently, and the function drm_sched_fini() was called even without 
> > its counter-part had been previously called, causing the following oops:
> >
> > amdgpu: probe of 0000:04:00.0 failed with error -110
> > BUG: kernel NULL pointer dereference, address: 0000000000000090 PGD 
> > 0 P4D 0
> > Oops: 0002 [#1] PREEMPT SMP NOPTI
> > CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli 
> > #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
> > RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace:
> >  <TASK>
> >  amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
> >  amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
> >  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
> >  devm_drm_dev_init_release+0x49/0x70
> >  [...]
> >
> > To prevent that, check if the drm_sched was properly initialized for 
> > a given ring before calling its fini counter-part.
> >
> > Notice ideally we'd use sched.ready for that; such field is set as 
> > the latest thing on drm_sched_init(). But amdgpu seems to "override" 
> > the meaning of such field - in the above oops for example, it was a 
> > GFX ring causing the crash, and the sched.ready field was set to 
> > true in the ring init routine, regardless of the state of the DRM scheduler. Hence, we ended-up using another sched field.
> >> > Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence 
> >> > driver fini in s3 test (v2)")
> > Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> > Cc: Guchun Chen <guchun.chen@amd.com>
> > Cc: Mario Limonciello <mario.limonciello@amd.com>
> > Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> > ---
> >
> >
> > Hi folks, first of all thanks in advance for reviews / comments!
> > Notice that I've used the Fixes tag more in the sense to bring it to 
> > stable, I didn't find a good patch candidate that added the call to 
> > drm_sched_fini(), was reaching way too old commits...so
> > 067f44c8b459 seems a good candidate - or maybe not?
> >
> > Now, with regards sched.ready, spent a bit of time to figure what 
> > was happening...would be feasible maybe to stop using that to mark 
> > some kind ring status? I think it should be possible to add a flag 
> > to the ring structure for that, and free sched.ready from being 
> > manipulate by the amdgpu driver, what's your thoughts on that?

It's been a while, but IIRC, we used to have a ring->ready field in the driver which at some point got migrated out of the driver into the GPU scheduler and the driver side code never got cleaned up.  I think we should probably just drop the driver messing with that field and leave it up to the drm scheduler.

Alex


> >
> > I could try myself, but first of course I'd like to raise the 
> > "temperature" on this topic and check if somebody is already working 
> > on that.
> >
> > Cheers,
> >
> > Guilherme
> >
> >
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > index 00444203220d..e154eb8241fb 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
> >               if (!ring || !ring->fence_drv.initialized)
> >                       continue;
> >
> > -             if (!ring->no_scheduler)
> > +             /*
> > +              * Notice we check for sched.name since there's some
> > +              * override on the meaning of sched.ready by amdgpu.
> > +              * The natural check would be sched.ready, which is
> > +              * set as drm_sched_init() finishes...
> > +              */
> > +             if (!ring->no_scheduler && ring->sched.name)
> >                       drm_sched_fini(&ring->sched);
> >
> >               for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
Chen, Guchun Jan. 31, 2023, 9:17 a.m. UTC | #4
Hi Piccoli,

Please ignore my request of full dmesg log. I can reproduce the issue and get the same failure callstack by returning early with an error code prior to amdgpu_device_init_schedulers.

Regards,
Guchun

-----Original Message-----
From: Chen, Guchun 
Sent: Tuesday, January 31, 2023 2:37 PM
To: Alex Deucher <alexdeucher@gmail.com>; Guilherme G. Piccoli <gpiccoli@igalia.com>
Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Pan, Xinhui <Xinhui.Pan@amd.com>; dri-devel@lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov@amd.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; kernel-dev@igalia.com; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: RE: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

Hi Piccoli,

I agree with Alex's point, using ring->sched.name for such check is not a good way. BTW, can you please attach a full dmesg long in bad case to help me understand more?

Regards,
Guchun

-----Original Message-----
From: Alex Deucher <alexdeucher@gmail.com>
Sent: Tuesday, January 31, 2023 6:30 AM
To: Guilherme G. Piccoli <gpiccoli@igalia.com>
Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Chen, Guchun <Guchun.Chen@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; dri-devel@lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov@amd.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; kernel-dev@igalia.com; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli <gpiccoli@igalia.com> wrote:
>
> + Luben
>
> (sorry, missed that in the first submission).
>
> On 30/01/2023 18:45, Guilherme G. Piccoli wrote:
> > Currently amdgpu calls drm_sched_fini() from the fence driver sw 
> > fini routine - such function is expected to be called only after the 
> > respective init function - drm_sched_init() - was executed successfully.
> >
> > Happens that we faced a driver probe failure in the Steam Deck 
> > recently, and the function drm_sched_fini() was called even without 
> > its counter-part had been previously called, causing the following oops:
> >
> > amdgpu: probe of 0000:04:00.0 failed with error -110
> > BUG: kernel NULL pointer dereference, address: 0000000000000090 PGD
> > 0 P4D 0
> > Oops: 0002 [#1] PREEMPT SMP NOPTI
> > CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli
> > #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
> > RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace:
> >  <TASK>
> >  amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
> >  amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
> >  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
> >  devm_drm_dev_init_release+0x49/0x70
> >  [...]
> >
> > To prevent that, check if the drm_sched was properly initialized for 
> > a given ring before calling its fini counter-part.
> >
> > Notice ideally we'd use sched.ready for that; such field is set as 
> > the latest thing on drm_sched_init(). But amdgpu seems to "override"
> > the meaning of such field - in the above oops for example, it was a 
> > GFX ring causing the crash, and the sched.ready field was set to 
> > true in the ring init routine, regardless of the state of the DRM scheduler. Hence, we ended-up using another sched field.
> >> > Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence 
> >> > driver fini in s3 test (v2)")
> > Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> > Cc: Guchun Chen <guchun.chen@amd.com>
> > Cc: Mario Limonciello <mario.limonciello@amd.com>
> > Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> > ---
> >
> >
> > Hi folks, first of all thanks in advance for reviews / comments!
> > Notice that I've used the Fixes tag more in the sense to bring it to 
> > stable, I didn't find a good patch candidate that added the call to 
> > drm_sched_fini(), was reaching way too old commits...so
> > 067f44c8b459 seems a good candidate - or maybe not?
> >
> > Now, with regards sched.ready, spent a bit of time to figure what 
> > was happening...would be feasible maybe to stop using that to mark 
> > some kind ring status? I think it should be possible to add a flag 
> > to the ring structure for that, and free sched.ready from being 
> > manipulate by the amdgpu driver, what's your thoughts on that?

It's been a while, but IIRC, we used to have a ring->ready field in the driver which at some point got migrated out of the driver into the GPU scheduler and the driver side code never got cleaned up.  I think we should probably just drop the driver messing with that field and leave it up to the drm scheduler.

Alex


> >
> > I could try myself, but first of course I'd like to raise the 
> > "temperature" on this topic and check if somebody is already working 
> > on that.
> >
> > Cheers,
> >
> > Guilherme
> >
> >
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > index 00444203220d..e154eb8241fb 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
> >               if (!ring || !ring->fence_drv.initialized)
> >                       continue;
> >
> > -             if (!ring->no_scheduler)
> > +             /*
> > +              * Notice we check for sched.name since there's some
> > +              * override on the meaning of sched.ready by amdgpu.
> > +              * The natural check would be sched.ready, which is
> > +              * set as drm_sched_init() finishes...
> > +              */
> > +             if (!ring->no_scheduler && ring->sched.name)
> >                       drm_sched_fini(&ring->sched);
> >
> >               for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
Christian König Jan. 31, 2023, 10:58 a.m. UTC | #5
Am 31.01.23 um 10:17 schrieb Chen, Guchun:
> Hi Piccoli,
>
> Please ignore my request of full dmesg log. I can reproduce the issue and get the same failure callstack by returning early with an error code prior to amdgpu_device_init_schedulers.
>
> Regards,
> Guchun
>
> -----Original Message-----
> From: Chen, Guchun
> Sent: Tuesday, January 31, 2023 2:37 PM
> To: Alex Deucher <alexdeucher@gmail.com>; Guilherme G. Piccoli <gpiccoli@igalia.com>
> Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Pan, Xinhui <Xinhui.Pan@amd.com>; dri-devel@lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov@amd.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; kernel-dev@igalia.com; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
> Subject: RE: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
>
> Hi Piccoli,
>
> I agree with Alex's point, using ring->sched.name for such check is not a good way. BTW, can you please attach a full dmesg long in bad case to help me understand more?
>
> Regards,
> Guchun
>
> -----Original Message-----
> From: Alex Deucher <alexdeucher@gmail.com>
> Sent: Tuesday, January 31, 2023 6:30 AM
> To: Guilherme G. Piccoli <gpiccoli@igalia.com>
> Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Chen, Guchun <Guchun.Chen@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; dri-devel@lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov@amd.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; kernel-dev@igalia.com; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
> Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
>
> On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli <gpiccoli@igalia.com> wrote:
>> + Luben
>>
>> (sorry, missed that in the first submission).
>>
>> On 30/01/2023 18:45, Guilherme G. Piccoli wrote:
>>> Currently amdgpu calls drm_sched_fini() from the fence driver sw
>>> fini routine - such function is expected to be called only after the
>>> respective init function - drm_sched_init() - was executed successfully.
>>>
>>> Happens that we faced a driver probe failure in the Steam Deck
>>> recently, and the function drm_sched_fini() was called even without
>>> its counter-part had been previously called, causing the following oops:
>>>
>>> amdgpu: probe of 0000:04:00.0 failed with error -110
>>> BUG: kernel NULL pointer dereference, address: 0000000000000090 PGD
>>> 0 P4D 0
>>> Oops: 0002 [#1] PREEMPT SMP NOPTI
>>> CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli
>>> #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
>>> RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace:
>>>   <TASK>
>>>   amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
>>>   amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
>>>   amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>>>   devm_drm_dev_init_release+0x49/0x70
>>>   [...]
>>>
>>> To prevent that, check if the drm_sched was properly initialized for
>>> a given ring before calling its fini counter-part.
>>>
>>> Notice ideally we'd use sched.ready for that; such field is set as
>>> the latest thing on drm_sched_init(). But amdgpu seems to "override"
>>> the meaning of such field - in the above oops for example, it was a
>>> GFX ring causing the crash, and the sched.ready field was set to
>>> true in the ring init routine, regardless of the state of the DRM scheduler. Hence, we ended-up using another sched field.
>>>>> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence
>>>>> driver fini in s3 test (v2)")
>>> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> Cc: Guchun Chen <guchun.chen@amd.com>
>>> Cc: Mario Limonciello <mario.limonciello@amd.com>
>>> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
>>> ---
>>>
>>>
>>> Hi folks, first of all thanks in advance for reviews / comments!
>>> Notice that I've used the Fixes tag more in the sense to bring it to
>>> stable, I didn't find a good patch candidate that added the call to
>>> drm_sched_fini(), was reaching way too old commits...so
>>> 067f44c8b459 seems a good candidate - or maybe not?
>>>
>>> Now, with regards sched.ready, spent a bit of time to figure what
>>> was happening...would be feasible maybe to stop using that to mark
>>> some kind ring status? I think it should be possible to add a flag
>>> to the ring structure for that, and free sched.ready from being
>>> manipulate by the amdgpu driver, what's your thoughts on that?
> It's been a while, but IIRC, we used to have a ring->ready field in the driver which at some point got migrated out of the driver into the GPU scheduler and the driver side code never got cleaned up.  I think we should probably just drop the driver messing with that field and leave it up to the drm scheduler.

To use ring->ready is not a good idea since this doesn't specify if the 
scheduler was initialized, but rather if bringup of the engine worked.

It's perfectly possible that the scheduler was initialized, but then the 
IB test failed later on.

I strongly suggest to use scheduler->ops instead since those are set 
during init and mandatory for correct operation.

Christian.

>
> Alex
>
>
>>> I could try myself, but first of course I'd like to raise the
>>> "temperature" on this topic and check if somebody is already working
>>> on that.
>>>
>>> Cheers,
>>>
>>> Guilherme
>>>
>>>
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
>>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index 00444203220d..e154eb8241fb 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>>>                if (!ring || !ring->fence_drv.initialized)
>>>                        continue;
>>>
>>> -             if (!ring->no_scheduler)
>>> +             /*
>>> +              * Notice we check for sched.name since there's some
>>> +              * override on the meaning of sched.ready by amdgpu.
>>> +              * The natural check would be sched.ready, which is
>>> +              * set as drm_sched_init() finishes...
>>> +              */
>>> +             if (!ring->no_scheduler && ring->sched.name)
>>>                        drm_sched_fini(&ring->sched);
>>>
>>>                for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
Chen, Guchun Jan. 31, 2023, 1:58 p.m. UTC | #6
Hi Christian,

Do you think if it makes sense that we can set 'ring->sched.ready' to be true in each ring init, even if before executing/setting up drm_sched_init in amdgpu_device_init_schedulers? As 'ready' is a member of gpu scheduler structure.

Regards,
Guchun

-----Original Message-----
From: Koenig, Christian <Christian.Koenig@amd.com> 
Sent: Tuesday, January 31, 2023 6:59 PM
To: Chen, Guchun <Guchun.Chen@amd.com>; Alex Deucher <alexdeucher@gmail.com>; Guilherme G. Piccoli <gpiccoli@igalia.com>
Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Pan, Xinhui <Xinhui.Pan@amd.com>; dri-devel@lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov@amd.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; kernel-dev@igalia.com; Deucher, Alexander <Alexander.Deucher@amd.com>
Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

Am 31.01.23 um 10:17 schrieb Chen, Guchun:
> Hi Piccoli,
>
> Please ignore my request of full dmesg log. I can reproduce the issue and get the same failure callstack by returning early with an error code prior to amdgpu_device_init_schedulers.
>
> Regards,
> Guchun
>
> -----Original Message-----
> From: Chen, Guchun
> Sent: Tuesday, January 31, 2023 2:37 PM
> To: Alex Deucher <alexdeucher@gmail.com>; Guilherme G. Piccoli 
> <gpiccoli@igalia.com>
> Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Pan, Xinhui 
> <Xinhui.Pan@amd.com>; dri-devel@lists.freedesktop.org; Tuikov, Luben 
> <Luben.Tuikov@amd.com>; Limonciello, Mario 
> <Mario.Limonciello@amd.com>; kernel-dev@igalia.com; Deucher, Alexander 
> <Alexander.Deucher@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>
> Subject: RE: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching 
> drm_sched init/fini
>
> Hi Piccoli,
>
> I agree with Alex's point, using ring->sched.name for such check is not a good way. BTW, can you please attach a full dmesg long in bad case to help me understand more?
>
> Regards,
> Guchun
>
> -----Original Message-----
> From: Alex Deucher <alexdeucher@gmail.com>
> Sent: Tuesday, January 31, 2023 6:30 AM
> To: Guilherme G. Piccoli <gpiccoli@igalia.com>
> Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Chen, Guchun 
> <Guchun.Chen@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; 
> dri-devel@lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov@amd.com>; 
> Limonciello, Mario <Mario.Limonciello@amd.com>; kernel-dev@igalia.com; 
> Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>
> Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching 
> drm_sched init/fini
>
> On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli <gpiccoli@igalia.com> wrote:
>> + Luben
>>
>> (sorry, missed that in the first submission).
>>
>> On 30/01/2023 18:45, Guilherme G. Piccoli wrote:
>>> Currently amdgpu calls drm_sched_fini() from the fence driver sw 
>>> fini routine - such function is expected to be called only after the 
>>> respective init function - drm_sched_init() - was executed successfully.
>>>
>>> Happens that we faced a driver probe failure in the Steam Deck 
>>> recently, and the function drm_sched_fini() was called even without 
>>> its counter-part had been previously called, causing the following oops:
>>>
>>> amdgpu: probe of 0000:04:00.0 failed with error -110
>>> BUG: kernel NULL pointer dereference, address: 0000000000000090 PGD
>>> 0 P4D 0
>>> Oops: 0002 [#1] PREEMPT SMP NOPTI
>>> CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli
>>> #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
>>> RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace:
>>>   <TASK>
>>>   amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
>>>   amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
>>>   amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>>>   devm_drm_dev_init_release+0x49/0x70
>>>   [...]
>>>
>>> To prevent that, check if the drm_sched was properly initialized for 
>>> a given ring before calling its fini counter-part.
>>>
>>> Notice ideally we'd use sched.ready for that; such field is set as 
>>> the latest thing on drm_sched_init(). But amdgpu seems to "override"
>>> the meaning of such field - in the above oops for example, it was a 
>>> GFX ring causing the crash, and the sched.ready field was set to 
>>> true in the ring init routine, regardless of the state of the DRM scheduler. Hence, we ended-up using another sched field.
>>>>> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence 
>>>>> driver fini in s3 test (v2)")
>>> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> Cc: Guchun Chen <guchun.chen@amd.com>
>>> Cc: Mario Limonciello <mario.limonciello@amd.com>
>>> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
>>> ---
>>>
>>>
>>> Hi folks, first of all thanks in advance for reviews / comments!
>>> Notice that I've used the Fixes tag more in the sense to bring it to 
>>> stable, I didn't find a good patch candidate that added the call to 
>>> drm_sched_fini(), was reaching way too old commits...so
>>> 067f44c8b459 seems a good candidate - or maybe not?
>>>
>>> Now, with regards sched.ready, spent a bit of time to figure what 
>>> was happening...would be feasible maybe to stop using that to mark 
>>> some kind ring status? I think it should be possible to add a flag 
>>> to the ring structure for that, and free sched.ready from being 
>>> manipulate by the amdgpu driver, what's your thoughts on that?
> It's been a while, but IIRC, we used to have a ring->ready field in the driver which at some point got migrated out of the driver into the GPU scheduler and the driver side code never got cleaned up.  I think we should probably just drop the driver messing with that field and leave it up to the drm scheduler.

To use ring->ready is not a good idea since this doesn't specify if the scheduler was initialized, but rather if bringup of the engine worked.

It's perfectly possible that the scheduler was initialized, but then the IB test failed later on.

I strongly suggest to use scheduler->ops instead since those are set during init and mandatory for correct operation.

Christian.

>
> Alex
>
>
>>> I could try myself, but first of course I'd like to raise the 
>>> "temperature" on this topic and check if somebody is already working 
>>> on that.
>>>
>>> Cheers,
>>>
>>> Guilherme
>>>
>>>
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
>>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index 00444203220d..e154eb8241fb 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>>>                if (!ring || !ring->fence_drv.initialized)
>>>                        continue;
>>>
>>> -             if (!ring->no_scheduler)
>>> +             /*
>>> +              * Notice we check for sched.name since there's some
>>> +              * override on the meaning of sched.ready by amdgpu.
>>> +              * The natural check would be sched.ready, which is
>>> +              * set as drm_sched_init() finishes...
>>> +              */
>>> +             if (!ring->no_scheduler && ring->sched.name)
>>>                        drm_sched_fini(&ring->sched);
>>>
>>>                for (j = 0; j <= ring->fence_drv.num_fences_mask; 
>>> ++j)
Guilherme G. Piccoli Jan. 31, 2023, 2:32 p.m. UTC | #7
On 31/01/2023 10:58, Chen, Guchun wrote:
> Hi Christian,
> 
> Do you think if it makes sense that we can set 'ring->sched.ready' to be true in each ring init, even if before executing/setting up drm_sched_init in amdgpu_device_init_schedulers? As 'ready' is a member of gpu scheduler structure.
> 
> Regards,
> Guchun
> 

Hi folks, thanks a lot for the feedback so far, much appreciated!

I'm feeling a bit confused specially since there seems to be 2
orthogonal (yet related) topics being discussed; let me try to summarize
my understanding and we can then further discuss better:

(a) The first problem is the one addressed in the patch - how to prevent
drm_sched_fini() to get called if drm_sched_init() wasn't called?

I've proposed sched.name, seems Christian prefers sched.ops, correct?


(b) We can't use sched.ready, which would make sense...but amdgpu
overrides its meaning, the driver manipulates this value for its own
purposes of tracking ring init, or something like that.

This is the tangential topic: what should we do here? My understanding
of Alex's message is that we could have a "ready" field in the ring
structure and stop messing with sched.ready - does it make sense Alex?

Guchun / Christian, does it also make sense for you?


Regarding (a), I could re-submit having s/sched.name/sched.ops, no
biggies, I tested both to be fair, before sending...I just chose name
but any field that is proper initialized on drm_sched_init() would work.

Thanks,


Guilherme
Alex Deucher Jan. 31, 2023, 5:51 p.m. UTC | #8
On Tue, Jan 31, 2023 at 5:58 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 31.01.23 um 10:17 schrieb Chen, Guchun:
> > Hi Piccoli,
> >
> > Please ignore my request of full dmesg log. I can reproduce the issue and get the same failure callstack by returning early with an error code prior to amdgpu_device_init_schedulers.
> >
> > Regards,
> > Guchun
> >
> > -----Original Message-----
> > From: Chen, Guchun
> > Sent: Tuesday, January 31, 2023 2:37 PM
> > To: Alex Deucher <alexdeucher@gmail.com>; Guilherme G. Piccoli <gpiccoli@igalia.com>
> > Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Pan, Xinhui <Xinhui.Pan@amd.com>; dri-devel@lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov@amd.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; kernel-dev@igalia.com; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
> > Subject: RE: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
> >
> > Hi Piccoli,
> >
> > I agree with Alex's point, using ring->sched.name for such check is not a good way. BTW, can you please attach a full dmesg long in bad case to help me understand more?
> >
> > Regards,
> > Guchun
> >
> > -----Original Message-----
> > From: Alex Deucher <alexdeucher@gmail.com>
> > Sent: Tuesday, January 31, 2023 6:30 AM
> > To: Guilherme G. Piccoli <gpiccoli@igalia.com>
> > Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Chen, Guchun <Guchun.Chen@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; dri-devel@lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov@amd.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; kernel-dev@igalia.com; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
> > Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
> >
> > On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli <gpiccoli@igalia.com> wrote:
> >> + Luben
> >>
> >> (sorry, missed that in the first submission).
> >>
> >> On 30/01/2023 18:45, Guilherme G. Piccoli wrote:
> >>> Currently amdgpu calls drm_sched_fini() from the fence driver sw
> >>> fini routine - such function is expected to be called only after the
> >>> respective init function - drm_sched_init() - was executed successfully.
> >>>
> >>> Happens that we faced a driver probe failure in the Steam Deck
> >>> recently, and the function drm_sched_fini() was called even without
> >>> its counter-part had been previously called, causing the following oops:
> >>>
> >>> amdgpu: probe of 0000:04:00.0 failed with error -110
> >>> BUG: kernel NULL pointer dereference, address: 0000000000000090 PGD
> >>> 0 P4D 0
> >>> Oops: 0002 [#1] PREEMPT SMP NOPTI
> >>> CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli
> >>> #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
> >>> RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace:
> >>>   <TASK>
> >>>   amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
> >>>   amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
> >>>   amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
> >>>   devm_drm_dev_init_release+0x49/0x70
> >>>   [...]
> >>>
> >>> To prevent that, check if the drm_sched was properly initialized for
> >>> a given ring before calling its fini counter-part.
> >>>
> >>> Notice ideally we'd use sched.ready for that; such field is set as
> >>> the latest thing on drm_sched_init(). But amdgpu seems to "override"
> >>> the meaning of such field - in the above oops for example, it was a
> >>> GFX ring causing the crash, and the sched.ready field was set to
> >>> true in the ring init routine, regardless of the state of the DRM scheduler. Hence, we ended-up using another sched field.
> >>>>> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence
> >>>>> driver fini in s3 test (v2)")
> >>> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >>> Cc: Guchun Chen <guchun.chen@amd.com>
> >>> Cc: Mario Limonciello <mario.limonciello@amd.com>
> >>> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> >>> ---
> >>>
> >>>
> >>> Hi folks, first of all thanks in advance for reviews / comments!
> >>> Notice that I've used the Fixes tag more in the sense to bring it to
> >>> stable, I didn't find a good patch candidate that added the call to
> >>> drm_sched_fini(), was reaching way too old commits...so
> >>> 067f44c8b459 seems a good candidate - or maybe not?
> >>>
> >>> Now, with regards sched.ready, spent a bit of time to figure what
> >>> was happening...would be feasible maybe to stop using that to mark
> >>> some kind ring status? I think it should be possible to add a flag
> >>> to the ring structure for that, and free sched.ready from being
> >>> manipulate by the amdgpu driver, what's your thoughts on that?
> > It's been a while, but IIRC, we used to have a ring->ready field in the driver which at some point got migrated out of the driver into the GPU scheduler and the driver side code never got cleaned up.  I think we should probably just drop the driver messing with that field and leave it up to the drm scheduler.
>
> To use ring->ready is not a good idea since this doesn't specify if the
> scheduler was initialized, but rather if bringup of the engine worked.

ring->ready no longer exists.  It was moved into the scheduler and the
driver side never got cleaned up.  Also, we set ring->sched.ready =
true once the rings are set up, but before we actually test them, so
it's currently a proxy for the ring being ready to test/use.

Alex

>
> It's perfectly possible that the scheduler was initialized, but then the
> IB test failed later on.
>
> I strongly suggest to use scheduler->ops instead since those are set
> during init and mandatory for correct operation.
>
> Christian.
>
> >
> > Alex
> >
> >
> >>> I could try myself, but first of course I'd like to raise the
> >>> "temperature" on this topic and check if somebody is already working
> >>> on that.
> >>>
> >>> Cheers,
> >>>
> >>> Guilherme
> >>>
> >>>
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
> >>>   1 file changed, 7 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>> index 00444203220d..e154eb8241fb 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>> @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
> >>>                if (!ring || !ring->fence_drv.initialized)
> >>>                        continue;
> >>>
> >>> -             if (!ring->no_scheduler)
> >>> +             /*
> >>> +              * Notice we check for sched.name since there's some
> >>> +              * override on the meaning of sched.ready by amdgpu.
> >>> +              * The natural check would be sched.ready, which is
> >>> +              * set as drm_sched_init() finishes...
> >>> +              */
> >>> +             if (!ring->no_scheduler && ring->sched.name)
> >>>                        drm_sched_fini(&ring->sched);
> >>>
> >>>                for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
>
Alex Deucher Jan. 31, 2023, 5:52 p.m. UTC | #9
On Tue, Jan 31, 2023 at 9:32 AM Guilherme G. Piccoli
<gpiccoli@igalia.com> wrote:
>
> On 31/01/2023 10:58, Chen, Guchun wrote:
> > Hi Christian,
> >
> > Do you think if it makes sense that we can set 'ring->sched.ready' to be true in each ring init, even if before executing/setting up drm_sched_init in amdgpu_device_init_schedulers? As 'ready' is a member of gpu scheduler structure.
> >
> > Regards,
> > Guchun
> >
>
> Hi folks, thanks a lot for the feedback so far, much appreciated!
>
> I'm feeling a bit confused specially since there seems to be 2
> orthogonal (yet related) topics being discussed; let me try to summarize
> my understanding and we can then further discuss better:
>
> (a) The first problem is the one addressed in the patch - how to prevent
> drm_sched_fini() to get called if drm_sched_init() wasn't called?
>
> I've proposed sched.name, seems Christian prefers sched.ops, correct?
>
>
> (b) We can't use sched.ready, which would make sense...but amdgpu
> overrides its meaning, the driver manipulates this value for its own
> purposes of tracking ring init, or something like that.
>
> This is the tangential topic: what should we do here? My understanding
> of Alex's message is that we could have a "ready" field in the ring
> structure and stop messing with sched.ready - does it make sense Alex?

Yes, I think so.  The tricky part will be figuring out which current
sched.ready checks are checking for the scheduler being ready vs.
whether the ring itself is ready.

>
> Guchun / Christian, does it also make sense for you?
>
>
> Regarding (a), I could re-submit having s/sched.name/sched.ops, no
> biggies, I tested both to be fair, before sending...I just chose name
> but any field that is proper initialized on drm_sched_init() would work.

Yeah, I think ops is fine.  You could even use sched.ready after we
clean up the use of that in the driver.  There are already a bunch of
places where we check sched.ready to check if the scheduler really is
ready.

Alex

>
> Thanks,
>
>
> Guilherme
Guilherme G. Piccoli Jan. 31, 2023, 6:22 p.m. UTC | #10
On 31/01/2023 14:52, Alex Deucher wrote:
> [...]
>> (b) We can't use sched.ready, which would make sense...but amdgpu
>> overrides its meaning, the driver manipulates this value for its own
>> purposes of tracking ring init, or something like that.
>>
>> This is the tangential topic: what should we do here? My understanding
>> of Alex's message is that we could have a "ready" field in the ring
>> structure and stop messing with sched.ready - does it make sense Alex?
> 
> Yes, I think so.  The tricky part will be figuring out which current
> sched.ready checks are checking for the scheduler being ready vs.
> whether the ring itself is ready.
> 

Thanks, makes sense!

$ grep -nr "sched.ready" drivers/gpu/drm/amd/ | wc -l
83

Maybe not super tough, I hope heh

>>
>> Guchun / Christian, does it also make sense for you?
>>
>>
>> Regarding (a), I could re-submit having s/sched.name/sched.ops, no
>> biggies, I tested both to be fair, before sending...I just chose name
>> but any field that is proper initialized on drm_sched_init() would work.
> 
> Yeah, I think ops is fine.  You could even use sched.ready after we
> clean up the use of that in the driver.  There are already a bunch of
> places where we check sched.ready to check if the scheduler really is
> ready.

Hmm..unfortunately, doesn't work. This was a case in which sched.ready
was set to true in the ring init routine, but scheduler wasn't properly
initialized. So, a different key for comparison is required..I'll
re-submit with sched.ops.

After a potential rework of the driver to get rid of sched.ready
manipulation, then it could be fixed to properly use this flag...makes
sense to you?

Tnx again for the prompt review!
Cheers,


Guilherme
Alex Deucher Jan. 31, 2023, 6:44 p.m. UTC | #11
On Tue, Jan 31, 2023 at 1:23 PM Guilherme G. Piccoli
<gpiccoli@igalia.com> wrote:
>
> On 31/01/2023 14:52, Alex Deucher wrote:
> > [...]
> >> (b) We can't use sched.ready, which would make sense...but amdgpu
> >> overrides its meaning, the driver manipulates this value for its own
> >> purposes of tracking ring init, or something like that.
> >>
> >> This is the tangential topic: what should we do here? My understanding
> >> of Alex's message is that we could have a "ready" field in the ring
> >> structure and stop messing with sched.ready - does it make sense Alex?
> >
> > Yes, I think so.  The tricky part will be figuring out which current
> > sched.ready checks are checking for the scheduler being ready vs.
> > whether the ring itself is ready.
> >
>
> Thanks, makes sense!
>
> $ grep -nr "sched.ready" drivers/gpu/drm/amd/ | wc -l
> 83
>
> Maybe not super tough, I hope heh
>
> >>
> >> Guchun / Christian, does it also make sense for you?
> >>
> >>
> >> Regarding (a), I could re-submit having s/sched.name/sched.ops, no
> >> biggies, I tested both to be fair, before sending...I just chose name
> >> but any field that is proper initialized on drm_sched_init() would work.
> >
> > Yeah, I think ops is fine.  You could even use sched.ready after we
> > clean up the use of that in the driver.  There are already a bunch of
> > places where we check sched.ready to check if the scheduler really is
> > ready.
>
> Hmm..unfortunately, doesn't work. This was a case in which sched.ready
> was set to true in the ring init routine, but scheduler wasn't properly
> initialized. So, a different key for comparison is required..I'll
> re-submit with sched.ops.
>
> After a potential rework of the driver to get rid of sched.ready
> manipulation, then it could be fixed to properly use this flag...makes
> sense to you?

Yeah, sounds good.

Thanks!

Alex

>
> Tnx again for the prompt review!
> Cheers,
>
>
> Guilherme
Christian König Feb. 1, 2023, 7:18 a.m. UTC | #12
Hi Guchun,

no, that doesn't make any sense at all.

The ready flag indicates that the scheduler is fully prepared for hw 
submissions from userspace and is unrelated to the initialization 
status. It's set to true after IB testing was successful and only set to 
false only when a GPU reset fails and we can't get the hardware to work 
any more.

Please use sched.ops instead as I suggested before.

Regards,
Christian.

Am 31.01.23 um 14:58 schrieb Chen, Guchun:
> Hi Christian,
>
> Do you think if it makes sense that we can set 'ring->sched.ready' to be true in each ring init, even if before executing/setting up drm_sched_init in amdgpu_device_init_schedulers? As 'ready' is a member of gpu scheduler structure.
>
> Regards,
> Guchun
>
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@amd.com>
> Sent: Tuesday, January 31, 2023 6:59 PM
> To: Chen, Guchun <Guchun.Chen@amd.com>; Alex Deucher <alexdeucher@gmail.com>; Guilherme G. Piccoli <gpiccoli@igalia.com>
> Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Pan, Xinhui <Xinhui.Pan@amd.com>; dri-devel@lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov@amd.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; kernel-dev@igalia.com; Deucher, Alexander <Alexander.Deucher@amd.com>
> Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
>
> Am 31.01.23 um 10:17 schrieb Chen, Guchun:
>> Hi Piccoli,
>>
>> Please ignore my request of full dmesg log. I can reproduce the issue and get the same failure callstack by returning early with an error code prior to amdgpu_device_init_schedulers.
>>
>> Regards,
>> Guchun
>>
>> -----Original Message-----
>> From: Chen, Guchun
>> Sent: Tuesday, January 31, 2023 2:37 PM
>> To: Alex Deucher <alexdeucher@gmail.com>; Guilherme G. Piccoli
>> <gpiccoli@igalia.com>
>> Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Pan, Xinhui
>> <Xinhui.Pan@amd.com>; dri-devel@lists.freedesktop.org; Tuikov, Luben
>> <Luben.Tuikov@amd.com>; Limonciello, Mario
>> <Mario.Limonciello@amd.com>; kernel-dev@igalia.com; Deucher, Alexander
>> <Alexander.Deucher@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>
>> Subject: RE: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching
>> drm_sched init/fini
>>
>> Hi Piccoli,
>>
>> I agree with Alex's point, using ring->sched.name for such check is not a good way. BTW, can you please attach a full dmesg long in bad case to help me understand more?
>>
>> Regards,
>> Guchun
>>
>> -----Original Message-----
>> From: Alex Deucher <alexdeucher@gmail.com>
>> Sent: Tuesday, January 31, 2023 6:30 AM
>> To: Guilherme G. Piccoli <gpiccoli@igalia.com>
>> Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Chen, Guchun
>> <Guchun.Chen@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>;
>> dri-devel@lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov@amd.com>;
>> Limonciello, Mario <Mario.Limonciello@amd.com>; kernel-dev@igalia.com;
>> Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching
>> drm_sched init/fini
>>
>> On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli <gpiccoli@igalia.com> wrote:
>>> + Luben
>>>
>>> (sorry, missed that in the first submission).
>>>
>>> On 30/01/2023 18:45, Guilherme G. Piccoli wrote:
>>>> Currently amdgpu calls drm_sched_fini() from the fence driver sw
>>>> fini routine - such function is expected to be called only after the
>>>> respective init function - drm_sched_init() - was executed successfully.
>>>>
>>>> Happens that we faced a driver probe failure in the Steam Deck
>>>> recently, and the function drm_sched_fini() was called even without
>>>> its counter-part had been previously called, causing the following oops:
>>>>
>>>> amdgpu: probe of 0000:04:00.0 failed with error -110
>>>> BUG: kernel NULL pointer dereference, address: 0000000000000090 PGD
>>>> 0 P4D 0
>>>> Oops: 0002 [#1] PREEMPT SMP NOPTI
>>>> CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli
>>>> #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
>>>> RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace:
>>>>    <TASK>
>>>>    amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
>>>>    amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
>>>>    amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>>>>    devm_drm_dev_init_release+0x49/0x70
>>>>    [...]
>>>>
>>>> To prevent that, check if the drm_sched was properly initialized for
>>>> a given ring before calling its fini counter-part.
>>>>
>>>> Notice ideally we'd use sched.ready for that; such field is set as
>>>> the latest thing on drm_sched_init(). But amdgpu seems to "override"
>>>> the meaning of such field - in the above oops for example, it was a
>>>> GFX ring causing the crash, and the sched.ready field was set to
>>>> true in the ring init routine, regardless of the state of the DRM scheduler. Hence, we ended-up using another sched field.
>>>>>> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence
>>>>>> driver fini in s3 test (v2)")
>>>> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> Cc: Guchun Chen <guchun.chen@amd.com>
>>>> Cc: Mario Limonciello <mario.limonciello@amd.com>
>>>> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
>>>> ---
>>>>
>>>>
>>>> Hi folks, first of all thanks in advance for reviews / comments!
>>>> Notice that I've used the Fixes tag more in the sense to bring it to
>>>> stable, I didn't find a good patch candidate that added the call to
>>>> drm_sched_fini(), was reaching way too old commits...so
>>>> 067f44c8b459 seems a good candidate - or maybe not?
>>>>
>>>> Now, with regards sched.ready, spent a bit of time to figure what
>>>> was happening...would be feasible maybe to stop using that to mark
>>>> some kind ring status? I think it should be possible to add a flag
>>>> to the ring structure for that, and free sched.ready from being
>>>> manipulate by the amdgpu driver, what's your thoughts on that?
>> It's been a while, but IIRC, we used to have a ring->ready field in the driver which at some point got migrated out of the driver into the GPU scheduler and the driver side code never got cleaned up.  I think we should probably just drop the driver messing with that field and leave it up to the drm scheduler.
> To use ring->ready is not a good idea since this doesn't specify if the scheduler was initialized, but rather if bringup of the engine worked.
>
> It's perfectly possible that the scheduler was initialized, but then the IB test failed later on.
>
> I strongly suggest to use scheduler->ops instead since those are set during init and mandatory for correct operation.
>
> Christian.
>
>> Alex
>>
>>
>>>> I could try myself, but first of course I'd like to raise the
>>>> "temperature" on this topic and check if somebody is already working
>>>> on that.
>>>>
>>>> Cheers,
>>>>
>>>> Guilherme
>>>>
>>>>
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
>>>>    1 file changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> index 00444203220d..e154eb8241fb 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>>>>                 if (!ring || !ring->fence_drv.initialized)
>>>>                         continue;
>>>>
>>>> -             if (!ring->no_scheduler)
>>>> +             /*
>>>> +              * Notice we check for sched.name since there's some
>>>> +              * override on the meaning of sched.ready by amdgpu.
>>>> +              * The natural check would be sched.ready, which is
>>>> +              * set as drm_sched_init() finishes...
>>>> +              */
>>>> +             if (!ring->no_scheduler && ring->sched.name)
>>>>                         drm_sched_fini(&ring->sched);
>>>>
>>>>                 for (j = 0; j <= ring->fence_drv.num_fences_mask;
>>>> ++j)
Alex Deucher Feb. 1, 2023, 2:24 p.m. UTC | #13
On Wed, Feb 1, 2023 at 2:18 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Hi Guchun,
>
> no, that doesn't make any sense at all.
>
> The ready flag indicates that the scheduler is fully prepared for hw
> submissions from userspace and is unrelated to the initialization
> status. It's set to true after IB testing was successful and only set to
> false only when a GPU reset fails and we can't get the hardware to work
> any more.

That might have been the original intention, but right now sched.ready
gets set to true when we finish setting up the ring, but before we do
ring or IB tests.

Alex

>
> Please use sched.ops instead as I suggested before.
>
> Regards,
> Christian.
>
> Am 31.01.23 um 14:58 schrieb Chen, Guchun:
> > Hi Christian,
> >
> > Do you think if it makes sense that we can set 'ring->sched.ready' to be true in each ring init, even if before executing/setting up drm_sched_init in amdgpu_device_init_schedulers? As 'ready' is a member of gpu scheduler structure.
> >
> > Regards,
> > Guchun
> >
> > -----Original Message-----
> > From: Koenig, Christian <Christian.Koenig@amd.com>
> > Sent: Tuesday, January 31, 2023 6:59 PM
> > To: Chen, Guchun <Guchun.Chen@amd.com>; Alex Deucher <alexdeucher@gmail.com>; Guilherme G. Piccoli <gpiccoli@igalia.com>
> > Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Pan, Xinhui <Xinhui.Pan@amd.com>; dri-devel@lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov@amd.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; kernel-dev@igalia.com; Deucher, Alexander <Alexander.Deucher@amd.com>
> > Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
> >
> > Am 31.01.23 um 10:17 schrieb Chen, Guchun:
> >> Hi Piccoli,
> >>
> >> Please ignore my request of full dmesg log. I can reproduce the issue and get the same failure callstack by returning early with an error code prior to amdgpu_device_init_schedulers.
> >>
> >> Regards,
> >> Guchun
> >>
> >> -----Original Message-----
> >> From: Chen, Guchun
> >> Sent: Tuesday, January 31, 2023 2:37 PM
> >> To: Alex Deucher <alexdeucher@gmail.com>; Guilherme G. Piccoli
> >> <gpiccoli@igalia.com>
> >> Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Pan, Xinhui
> >> <Xinhui.Pan@amd.com>; dri-devel@lists.freedesktop.org; Tuikov, Luben
> >> <Luben.Tuikov@amd.com>; Limonciello, Mario
> >> <Mario.Limonciello@amd.com>; kernel-dev@igalia.com; Deucher, Alexander
> >> <Alexander.Deucher@amd.com>; Koenig, Christian
> >> <Christian.Koenig@amd.com>
> >> Subject: RE: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching
> >> drm_sched init/fini
> >>
> >> Hi Piccoli,
> >>
> >> I agree with Alex's point, using ring->sched.name for such check is not a good way. BTW, can you please attach a full dmesg long in bad case to help me understand more?
> >>
> >> Regards,
> >> Guchun
> >>
> >> -----Original Message-----
> >> From: Alex Deucher <alexdeucher@gmail.com>
> >> Sent: Tuesday, January 31, 2023 6:30 AM
> >> To: Guilherme G. Piccoli <gpiccoli@igalia.com>
> >> Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Chen, Guchun
> >> <Guchun.Chen@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>;
> >> dri-devel@lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov@amd.com>;
> >> Limonciello, Mario <Mario.Limonciello@amd.com>; kernel-dev@igalia.com;
> >> Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
> >> <Christian.Koenig@amd.com>
> >> Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching
> >> drm_sched init/fini
> >>
> >> On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli <gpiccoli@igalia.com> wrote:
> >>> + Luben
> >>>
> >>> (sorry, missed that in the first submission).
> >>>
> >>> On 30/01/2023 18:45, Guilherme G. Piccoli wrote:
> >>>> Currently amdgpu calls drm_sched_fini() from the fence driver sw
> >>>> fini routine - such function is expected to be called only after the
> >>>> respective init function - drm_sched_init() - was executed successfully.
> >>>>
> >>>> Happens that we faced a driver probe failure in the Steam Deck
> >>>> recently, and the function drm_sched_fini() was called even without
> >>>> its counter-part had been previously called, causing the following oops:
> >>>>
> >>>> amdgpu: probe of 0000:04:00.0 failed with error -110
> >>>> BUG: kernel NULL pointer dereference, address: 0000000000000090 PGD
> >>>> 0 P4D 0
> >>>> Oops: 0002 [#1] PREEMPT SMP NOPTI
> >>>> CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli
> >>>> #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
> >>>> RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace:
> >>>>    <TASK>
> >>>>    amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
> >>>>    amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
> >>>>    amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
> >>>>    devm_drm_dev_init_release+0x49/0x70
> >>>>    [...]
> >>>>
> >>>> To prevent that, check if the drm_sched was properly initialized for
> >>>> a given ring before calling its fini counter-part.
> >>>>
> >>>> Notice ideally we'd use sched.ready for that; such field is set as
> >>>> the latest thing on drm_sched_init(). But amdgpu seems to "override"
> >>>> the meaning of such field - in the above oops for example, it was a
> >>>> GFX ring causing the crash, and the sched.ready field was set to
> >>>> true in the ring init routine, regardless of the state of the DRM scheduler. Hence, we ended-up using another sched field.
> >>>>>> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence
> >>>>>> driver fini in s3 test (v2)")
> >>>> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >>>> Cc: Guchun Chen <guchun.chen@amd.com>
> >>>> Cc: Mario Limonciello <mario.limonciello@amd.com>
> >>>> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> >>>> ---
> >>>>
> >>>>
> >>>> Hi folks, first of all thanks in advance for reviews / comments!
> >>>> Notice that I've used the Fixes tag more in the sense to bring it to
> >>>> stable, I didn't find a good patch candidate that added the call to
> >>>> drm_sched_fini(), was reaching way too old commits...so
> >>>> 067f44c8b459 seems a good candidate - or maybe not?
> >>>>
> >>>> Now, with regards sched.ready, spent a bit of time to figure what
> >>>> was happening...would be feasible maybe to stop using that to mark
> >>>> some kind ring status? I think it should be possible to add a flag
> >>>> to the ring structure for that, and free sched.ready from being
> >>>> manipulate by the amdgpu driver, what's your thoughts on that?
> >> It's been a while, but IIRC, we used to have a ring->ready field in the driver which at some point got migrated out of the driver into the GPU scheduler and the driver side code never got cleaned up.  I think we should probably just drop the driver messing with that field and leave it up to the drm scheduler.
> > To use ring->ready is not a good idea since this doesn't specify if the scheduler was initialized, but rather if bringup of the engine worked.
> >
> > It's perfectly possible that the scheduler was initialized, but then the IB test failed later on.
> >
> > I strongly suggest to use scheduler->ops instead since those are set during init and mandatory for correct operation.
> >
> > Christian.
> >
> >> Alex
> >>
> >>
> >>>> I could try myself, but first of course I'd like to raise the
> >>>> "temperature" on this topic and check if somebody is already working
> >>>> on that.
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Guilherme
> >>>>
> >>>>
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
> >>>>    1 file changed, 7 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>>> index 00444203220d..e154eb8241fb 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>>> @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
> >>>>                 if (!ring || !ring->fence_drv.initialized)
> >>>>                         continue;
> >>>>
> >>>> -             if (!ring->no_scheduler)
> >>>> +             /*
> >>>> +              * Notice we check for sched.name since there's some
> >>>> +              * override on the meaning of sched.ready by amdgpu.
> >>>> +              * The natural check would be sched.ready, which is
> >>>> +              * set as drm_sched_init() finishes...
> >>>> +              */
> >>>> +             if (!ring->no_scheduler && ring->sched.name)
> >>>>                         drm_sched_fini(&ring->sched);
> >>>>
> >>>>                 for (j = 0; j <= ring->fence_drv.num_fences_mask;
> >>>> ++j)
>
Christian König Feb. 1, 2023, 2:35 p.m. UTC | #14
Am 01.02.23 um 15:24 schrieb Alex Deucher:
> On Wed, Feb 1, 2023 at 2:18 AM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Hi Guchun,
>>
>> no, that doesn't make any sense at all.
>>
>> The ready flag indicates that the scheduler is fully prepared for hw
>> submissions from userspace and is unrelated to the initialization
>> status. It's set to true after IB testing was successful and only set to
>> false only when a GPU reset fails and we can't get the hardware to work
>> any more.
> That might have been the original intention, but right now sched.ready
> gets set to true when we finish setting up the ring, but before we do
> ring or IB tests.

WHAT? Please not again.

I'm really tired of fixing this over and over again, the meaning of 
ring->sched.ready is to block submissions when a GPU reset fails. AND 
NOTHING ELSE!

The problem is people seem to abuse it and I have to fix it for the 
fourth or fives time now.

I'm going to send out patches,
Christian.

>
> Alex
>
>> Please use sched.ops instead as I suggested before.
>>
>> Regards,
>> Christian.
>>
>> Am 31.01.23 um 14:58 schrieb Chen, Guchun:
>>> Hi Christian,
>>>
>>> Do you think if it makes sense that we can set 'ring->sched.ready' to be true in each ring init, even if before executing/setting up drm_sched_init in amdgpu_device_init_schedulers? As 'ready' is a member of gpu scheduler structure.
>>>
>>> Regards,
>>> Guchun
>>>
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>> Sent: Tuesday, January 31, 2023 6:59 PM
>>> To: Chen, Guchun <Guchun.Chen@amd.com>; Alex Deucher <alexdeucher@gmail.com>; Guilherme G. Piccoli <gpiccoli@igalia.com>
>>> Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Pan, Xinhui <Xinhui.Pan@amd.com>; dri-devel@lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov@amd.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; kernel-dev@igalia.com; Deucher, Alexander <Alexander.Deucher@amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
>>>
>>> Am 31.01.23 um 10:17 schrieb Chen, Guchun:
>>>> Hi Piccoli,
>>>>
>>>> Please ignore my request of full dmesg log. I can reproduce the issue and get the same failure callstack by returning early with an error code prior to amdgpu_device_init_schedulers.
>>>>
>>>> Regards,
>>>> Guchun
>>>>
>>>> -----Original Message-----
>>>> From: Chen, Guchun
>>>> Sent: Tuesday, January 31, 2023 2:37 PM
>>>> To: Alex Deucher <alexdeucher@gmail.com>; Guilherme G. Piccoli
>>>> <gpiccoli@igalia.com>
>>>> Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Pan, Xinhui
>>>> <Xinhui.Pan@amd.com>; dri-devel@lists.freedesktop.org; Tuikov, Luben
>>>> <Luben.Tuikov@amd.com>; Limonciello, Mario
>>>> <Mario.Limonciello@amd.com>; kernel-dev@igalia.com; Deucher, Alexander
>>>> <Alexander.Deucher@amd.com>; Koenig, Christian
>>>> <Christian.Koenig@amd.com>
>>>> Subject: RE: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching
>>>> drm_sched init/fini
>>>>
>>>> Hi Piccoli,
>>>>
>>>> I agree with Alex's point, using ring->sched.name for such check is not a good way. BTW, can you please attach a full dmesg long in bad case to help me understand more?
>>>>
>>>> Regards,
>>>> Guchun
>>>>
>>>> -----Original Message-----
>>>> From: Alex Deucher <alexdeucher@gmail.com>
>>>> Sent: Tuesday, January 31, 2023 6:30 AM
>>>> To: Guilherme G. Piccoli <gpiccoli@igalia.com>
>>>> Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Chen, Guchun
>>>> <Guchun.Chen@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>;
>>>> dri-devel@lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov@amd.com>;
>>>> Limonciello, Mario <Mario.Limonciello@amd.com>; kernel-dev@igalia.com;
>>>> Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
>>>> <Christian.Koenig@amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching
>>>> drm_sched init/fini
>>>>
>>>> On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli <gpiccoli@igalia.com> wrote:
>>>>> + Luben
>>>>>
>>>>> (sorry, missed that in the first submission).
>>>>>
>>>>> On 30/01/2023 18:45, Guilherme G. Piccoli wrote:
>>>>>> Currently amdgpu calls drm_sched_fini() from the fence driver sw
>>>>>> fini routine - such function is expected to be called only after the
>>>>>> respective init function - drm_sched_init() - was executed successfully.
>>>>>>
>>>>>> Happens that we faced a driver probe failure in the Steam Deck
>>>>>> recently, and the function drm_sched_fini() was called even without
>>>>>> its counter-part had been previously called, causing the following oops:
>>>>>>
>>>>>> amdgpu: probe of 0000:04:00.0 failed with error -110
>>>>>> BUG: kernel NULL pointer dereference, address: 0000000000000090 PGD
>>>>>> 0 P4D 0
>>>>>> Oops: 0002 [#1] PREEMPT SMP NOPTI
>>>>>> CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli
>>>>>> #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
>>>>>> RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace:
>>>>>>     <TASK>
>>>>>>     amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
>>>>>>     amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
>>>>>>     amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>>>>>>     devm_drm_dev_init_release+0x49/0x70
>>>>>>     [...]
>>>>>>
>>>>>> To prevent that, check if the drm_sched was properly initialized for
>>>>>> a given ring before calling its fini counter-part.
>>>>>>
>>>>>> Notice ideally we'd use sched.ready for that; such field is set as
>>>>>> the latest thing on drm_sched_init(). But amdgpu seems to "override"
>>>>>> the meaning of such field - in the above oops for example, it was a
>>>>>> GFX ring causing the crash, and the sched.ready field was set to
>>>>>> true in the ring init routine, regardless of the state of the DRM scheduler. Hence, we ended-up using another sched field.
>>>>>>>> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence
>>>>>>>> driver fini in s3 test (v2)")
>>>>>> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>> Cc: Guchun Chen <guchun.chen@amd.com>
>>>>>> Cc: Mario Limonciello <mario.limonciello@amd.com>
>>>>>> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
>>>>>> ---
>>>>>>
>>>>>>
>>>>>> Hi folks, first of all thanks in advance for reviews / comments!
>>>>>> Notice that I've used the Fixes tag more in the sense to bring it to
>>>>>> stable, I didn't find a good patch candidate that added the call to
>>>>>> drm_sched_fini(), was reaching way too old commits...so
>>>>>> 067f44c8b459 seems a good candidate - or maybe not?
>>>>>>
>>>>>> Now, with regards sched.ready, spent a bit of time to figure what
>>>>>> was happening...would be feasible maybe to stop using that to mark
>>>>>> some kind ring status? I think it should be possible to add a flag
>>>>>> to the ring structure for that, and free sched.ready from being
>>>>>> manipulate by the amdgpu driver, what's your thoughts on that?
>>>> It's been a while, but IIRC, we used to have a ring->ready field in the driver which at some point got migrated out of the driver into the GPU scheduler and the driver side code never got cleaned up.  I think we should probably just drop the driver messing with that field and leave it up to the drm scheduler.
>>> To use ring->ready is not a good idea since this doesn't specify if the scheduler was initialized, but rather if bringup of the engine worked.
>>>
>>> It's perfectly possible that the scheduler was initialized, but then the IB test failed later on.
>>>
>>> I strongly suggest to use scheduler->ops instead since those are set during init and mandatory for correct operation.
>>>
>>> Christian.
>>>
>>>> Alex
>>>>
>>>>
>>>>>> I could try myself, but first of course I'd like to raise the
>>>>>> "temperature" on this topic and check if somebody is already working
>>>>>> on that.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Guilherme
>>>>>>
>>>>>>
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
>>>>>>     1 file changed, 7 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> index 00444203220d..e154eb8241fb 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>>>>>>                  if (!ring || !ring->fence_drv.initialized)
>>>>>>                          continue;
>>>>>>
>>>>>> -             if (!ring->no_scheduler)
>>>>>> +             /*
>>>>>> +              * Notice we check for sched.name since there's some
>>>>>> +              * override on the meaning of sched.ready by amdgpu.
>>>>>> +              * The natural check would be sched.ready, which is
>>>>>> +              * set as drm_sched_init() finishes...
>>>>>> +              */
>>>>>> +             if (!ring->no_scheduler && ring->sched.name)
>>>>>>                          drm_sched_fini(&ring->sched);
>>>>>>
>>>>>>                  for (j = 0; j <= ring->fence_drv.num_fences_mask;
>>>>>> ++j)
Luben Tuikov Feb. 1, 2023, 4:21 p.m. UTC | #15
Hi Guilherme,

Since setting sched->ready to false, seems to be taking place in, directly amdgpu_ring_fini()
and in amdgpu_fence_driver_sw_fini() indirectly as that function calls drm_sched_fini()
which sets it to false, we seem to have two competing policies of,
"set ready to false to show that _fini() was called, and set to false to disable IB submissions".

To that effect, your patch is generally correct, as it would be the case of an early failure
and unroll from (indirectly) amdgpu_device_init_schedulers().

Please resubmit your patch but using .ops as Christian suggested, as .name is sufficient,
but .ops is necessary.

On a side-note: in the future we should probably discern between
"this ring has an initialized and working scheduler" (looking up at DRM), from
"this ring can take on IBs to send them down to the hardware" (looking down at hardware).
Sched->ready seems to be overloaded with these disparate states, and this is why you need
to use .ops to guard calling drm_sched_fini().

Regards,
Luben

On 2023-02-01 09:35, Christian König wrote:
> Am 01.02.23 um 15:24 schrieb Alex Deucher:
>> On Wed, Feb 1, 2023 at 2:18 AM Christian König
>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>> Hi Guchun,
>>>
>>> no, that doesn't make any sense at all.
>>>
>>> The ready flag indicates that the scheduler is fully prepared for hw
>>> submissions from userspace and is unrelated to the initialization
>>> status. It's set to true after IB testing was successful and only set to
>>> false only when a GPU reset fails and we can't get the hardware to work
>>> any more.
>> That might have been the original intention, but right now sched.ready
>> gets set to true when we finish setting up the ring, but before we do
>> ring or IB tests.
> WHAT? Please not again.
>
> I'm really tired of fixing this over and over again, the meaning of 
> ring->sched.ready is to block submissions when a GPU reset fails. AND 
> NOTHING ELSE!
>
> The problem is people seem to abuse it and I have to fix it for the 
> fourth or fives time now.
>
> I'm going to send out patches,
> Christian.
>
>> Alex
>>
>>> Please use sched.ops instead as I suggested before.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 31.01.23 um 14:58 schrieb Chen, Guchun:
>>>> Hi Christian,
>>>>
>>>> Do you think if it makes sense that we can set 'ring->sched.ready' to be true in each ring init, even if before executing/setting up drm_sched_init in amdgpu_device_init_schedulers? As 'ready' is a member of gpu scheduler structure.
>>>>
>>>> Regards,
>>>> Guchun
>>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>> Sent: Tuesday, January 31, 2023 6:59 PM
>>>> To: Chen, Guchun <Guchun.Chen@amd.com>; Alex Deucher <alexdeucher@gmail.com>; Guilherme G. Piccoli <gpiccoli@igalia.com>
>>>> Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Pan, Xinhui <Xinhui.Pan@amd.com>; dri-devel@lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov@amd.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; kernel-dev@igalia.com; Deucher, Alexander <Alexander.Deucher@amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
>>>>
>>>> Am 31.01.23 um 10:17 schrieb Chen, Guchun:
>>>>> Hi Piccoli,
>>>>>
>>>>> Please ignore my request of full dmesg log. I can reproduce the issue and get the same failure callstack by returning early with an error code prior to amdgpu_device_init_schedulers.
>>>>>
>>>>> Regards,
>>>>> Guchun
>>>>>
>>>>> -----Original Message-----
>>>>> From: Chen, Guchun
>>>>> Sent: Tuesday, January 31, 2023 2:37 PM
>>>>> To: Alex Deucher <alexdeucher@gmail.com>; Guilherme G. Piccoli
>>>>> <gpiccoli@igalia.com>
>>>>> Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Pan, Xinhui
>>>>> <Xinhui.Pan@amd.com>; dri-devel@lists.freedesktop.org; Tuikov, Luben
>>>>> <Luben.Tuikov@amd.com>; Limonciello, Mario
>>>>> <Mario.Limonciello@amd.com>; kernel-dev@igalia.com; Deucher, Alexander
>>>>> <Alexander.Deucher@amd.com>; Koenig, Christian
>>>>> <Christian.Koenig@amd.com>
>>>>> Subject: RE: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching
>>>>> drm_sched init/fini
>>>>>
>>>>> Hi Piccoli,
>>>>>
>>>>> I agree with Alex's point, using ring->sched.name for such check is not a good way. BTW, can you please attach a full dmesg long in bad case to help me understand more?
>>>>>
>>>>> Regards,
>>>>> Guchun
>>>>>
>>>>> -----Original Message-----
>>>>> From: Alex Deucher <alexdeucher@gmail.com>
>>>>> Sent: Tuesday, January 31, 2023 6:30 AM
>>>>> To: Guilherme G. Piccoli <gpiccoli@igalia.com>
>>>>> Cc: amd-gfx@lists.freedesktop.org; kernel@gpiccoli.net; Chen, Guchun
>>>>> <Guchun.Chen@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>;
>>>>> dri-devel@lists.freedesktop.org; Tuikov, Luben <Luben.Tuikov@amd.com>;
>>>>> Limonciello, Mario <Mario.Limonciello@amd.com>; kernel-dev@igalia.com;
>>>>> Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
>>>>> <Christian.Koenig@amd.com>
>>>>> Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching
>>>>> drm_sched init/fini
>>>>>
>>>>> On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli <gpiccoli@igalia.com> wrote:
>>>>>> + Luben
>>>>>>
>>>>>> (sorry, missed that in the first submission).
>>>>>>
>>>>>> On 30/01/2023 18:45, Guilherme G. Piccoli wrote:
>>>>>>> Currently amdgpu calls drm_sched_fini() from the fence driver sw
>>>>>>> fini routine - such function is expected to be called only after the
>>>>>>> respective init function - drm_sched_init() - was executed successfully.
>>>>>>>
>>>>>>> Happens that we faced a driver probe failure in the Steam Deck
>>>>>>> recently, and the function drm_sched_fini() was called even without
>>>>>>> its counter-part had been previously called, causing the following oops:
>>>>>>>
>>>>>>> amdgpu: probe of 0000:04:00.0 failed with error -110
>>>>>>> BUG: kernel NULL pointer dereference, address: 0000000000000090 PGD
>>>>>>> 0 P4D 0
>>>>>>> Oops: 0002 [#1] PREEMPT SMP NOPTI
>>>>>>> CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli
>>>>>>> #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
>>>>>>> RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace:
>>>>>>>     <TASK>
>>>>>>>     amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
>>>>>>>     amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
>>>>>>>     amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>>>>>>>     devm_drm_dev_init_release+0x49/0x70
>>>>>>>     [...]
>>>>>>>
>>>>>>> To prevent that, check if the drm_sched was properly initialized for
>>>>>>> a given ring before calling its fini counter-part.
>>>>>>>
>>>>>>> Notice ideally we'd use sched.ready for that; such field is set as
>>>>>>> the latest thing on drm_sched_init(). But amdgpu seems to "override"
>>>>>>> the meaning of such field - in the above oops for example, it was a
>>>>>>> GFX ring causing the crash, and the sched.ready field was set to
>>>>>>> true in the ring init routine, regardless of the state of the DRM scheduler. Hence, we ended-up using another sched field.
>>>>>>>>> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence
>>>>>>>>> driver fini in s3 test (v2)")
>>>>>>> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>> Cc: Guchun Chen <guchun.chen@amd.com>
>>>>>>> Cc: Mario Limonciello <mario.limonciello@amd.com>
>>>>>>> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
>>>>>>> ---
>>>>>>>
>>>>>>>
>>>>>>> Hi folks, first of all thanks in advance for reviews / comments!
>>>>>>> Notice that I've used the Fixes tag more in the sense to bring it to
>>>>>>> stable, I didn't find a good patch candidate that added the call to
>>>>>>> drm_sched_fini(), was reaching way too old commits...so
>>>>>>> 067f44c8b459 seems a good candidate - or maybe not?
>>>>>>>
>>>>>>> Now, with regards sched.ready, spent a bit of time to figure what
>>>>>>> was happening...would be feasible maybe to stop using that to mark
>>>>>>> some kind ring status? I think it should be possible to add a flag
>>>>>>> to the ring structure for that, and free sched.ready from being
>>>>>>> manipulate by the amdgpu driver, what's your thoughts on that?
>>>>> It's been a while, but IIRC, we used to have a ring->ready field in the driver which at some point got migrated out of the driver into the GPU scheduler and the driver side code never got cleaned up.  I think we should probably just drop the driver messing with that field and leave it up to the drm scheduler.
>>>> To use ring->ready is not a good idea since this doesn't specify if the scheduler was initialized, but rather if bringup of the engine worked.
>>>>
>>>> It's perfectly possible that the scheduler was initialized, but then the IB test failed later on.
>>>>
>>>> I strongly suggest to use scheduler->ops instead since those are set during init and mandatory for correct operation.
>>>>
>>>> Christian.
>>>>
>>>>> Alex
>>>>>
>>>>>
>>>>>>> I could try myself, but first of course I'd like to raise the
>>>>>>> "temperature" on this topic and check if somebody is already working
>>>>>>> on that.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Guilherme
>>>>>>>
>>>>>>>
>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
>>>>>>>     1 file changed, 7 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> index 00444203220d..e154eb8241fb 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>>>>>>>                  if (!ring || !ring->fence_drv.initialized)
>>>>>>>                          continue;
>>>>>>>
>>>>>>> -             if (!ring->no_scheduler)
>>>>>>> +             /*
>>>>>>> +              * Notice we check for sched.name since there's some
>>>>>>> +              * override on the meaning of sched.ready by amdgpu.
>>>>>>> +              * The natural check would be sched.ready, which is
>>>>>>> +              * set as drm_sched_init() finishes...
>>>>>>> +              */
>>>>>>> +             if (!ring->no_scheduler && ring->sched.name)
>>>>>>>                          drm_sched_fini(&ring->sched);
>>>>>>>
>>>>>>>                  for (j = 0; j <= ring->fence_drv.num_fences_mask;
>>>>>>> ++j)
Guilherme G. Piccoli Feb. 1, 2023, 4:53 p.m. UTC | #16
On 01/02/2023 13:21, Luben Tuikov wrote:
> Hi Guilherme,
> 
> Since setting sched->ready to false, seems to be taking place in, directly amdgpu_ring_fini()
> and in amdgpu_fence_driver_sw_fini() indirectly as that function calls drm_sched_fini()
> which sets it to false, we seem to have two competing policies of,
> "set ready to false to show that _fini() was called, and set to false to disable IB submissions".
> 
> To that effect, your patch is generally correct, as it would be the case of an early failure
> and unroll from (indirectly) amdgpu_device_init_schedulers().
> 
> Please resubmit your patch but using .ops as Christian suggested, as .name is sufficient,
> but .ops is necessary.
> 
> On a side-note: in the future we should probably discern between
> "this ring has an initialized and working scheduler" (looking up at DRM), from
> "this ring can take on IBs to send them down to the hardware" (looking down at hardware).
> Sched->ready seems to be overloaded with these disparate states, and this is why you need
> to use .ops to guard calling drm_sched_fini().
> 
> Regards,
> Luben

Thanks a lot Luben, makes perfect sense!

Also, thanks for everyone that provided feedback here, very interesting
discussion.

Submitted V2:
https://lore.kernel.org/dri-devel/20230201164814.1353383-1-gpiccoli@igalia.com/
Cheers,


Guilherme
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 00444203220d..e154eb8241fb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -618,7 +618,13 @@  void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
 		if (!ring || !ring->fence_drv.initialized)
 			continue;
 
-		if (!ring->no_scheduler)
+		/*
+		 * Notice we check for sched.name since there's some
+		 * override on the meaning of sched.ready by amdgpu.
+		 * The natural check would be sched.ready, which is
+		 * set as drm_sched_init() finishes...
+		 */
+		if (!ring->no_scheduler && ring->sched.name)
 			drm_sched_fini(&ring->sched);
 
 		for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)