Message ID | 20230201164814.1353383-1-gpiccoli@igalia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini | expand |
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Regards, Luben On 2023-02-01 11:48, Guilherme G. Piccoli wrote: > Currently amdgpu calls drm_sched_fini() from the fence driver sw fini > routine - such function is expected to be called only after the > respective init function - drm_sched_init() - was executed successfully. > > Happens that we faced a driver probe failure in the Steam Deck > recently, and the function drm_sched_fini() was called even without > its counter-part had been previously called, causing the following oops: > > amdgpu: probe of 0000:04:00.0 failed with error -110 > BUG: kernel NULL pointer dereference, address: 0000000000000090 > PGD 0 P4D 0 > Oops: 0002 [#1] PREEMPT SMP NOPTI > CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338 > Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022 > RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] > [...] > Call Trace: > <TASK> > amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu] > amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu] > amdgpu_driver_release_kms+0x16/0x30 [amdgpu] > devm_drm_dev_init_release+0x49/0x70 > [...] > > To prevent that, check if the drm_sched was properly initialized for a > given ring before calling its fini counter-part. > > Notice ideally we'd use sched.ready for that; such field is set as the latest > thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such > field - in the above oops for example, it was a GFX ring causing the crash, and > the sched.ready field was set to true in the ring init routine, regardless of > the state of the DRM scheduler. Hence, we ended-up using sched.ops as per > Christian's suggestion [0]. > > [0] https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb136@amd.com/ > > Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)") > Suggested-by: Christian König <christian.koenig@amd.com> > Cc: Guchun Chen <guchun.chen@amd.com> > Cc: Luben Tuikov <luben.tuikov@amd.com> > Cc: Mario Limonciello <mario.limonciello@amd.com> > Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > index 00444203220d..3b962cb680a6 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev) > if (!ring || !ring->fence_drv.initialized) > continue; > > - if (!ring->no_scheduler) > + /* > + * Notice we check for sched.ops since there's some > + * override on the meaning of sched.ready by amdgpu. > + * The natural check would be sched.ready, which is > + * set as drm_sched_init() finishes... > + */ > + if (!ring->no_scheduler && ring->sched.ops) > drm_sched_fini(&ring->sched); > > for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
Reviewed-by: Guchun Chen <guchun.chen@amd.com> Regards, Guchun -----Original Message----- From: Guilherme G. Piccoli <gpiccoli@igalia.com> Sent: Thursday, February 2, 2023 12:48 AM To: amd-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; kernel@gpiccoli.net; kernel-dev@igalia.com; Guilherme G. Piccoli <gpiccoli@igalia.com>; Chen, Guchun <Guchun.Chen@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; Limonciello, Mario <Mario.Limonciello@amd.com> Subject: [PATCH v2] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini Currently amdgpu calls drm_sched_fini() from the fence driver sw fini routine - such function is expected to be called only after the respective init function - drm_sched_init() - was executed successfully. Happens that we faced a driver probe failure in the Steam Deck recently, and the function drm_sched_fini() was called even without its counter-part had been previously called, causing the following oops: amdgpu: probe of 0000:04:00.0 failed with error -110 BUG: kernel NULL pointer dereference, address: 0000000000000090 PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022 RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace: <TASK> amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu] amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu] amdgpu_driver_release_kms+0x16/0x30 [amdgpu] devm_drm_dev_init_release+0x49/0x70 [...] To prevent that, check if the drm_sched was properly initialized for a given ring before calling its fini counter-part. Notice ideally we'd use sched.ready for that; such field is set as the latest thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such field - in the above oops for example, it was a GFX ring causing the crash, and the sched.ready field was set to true in the ring init routine, regardless of the state of the DRM scheduler. Hence, we ended-up using sched.ops as per Christian's suggestion [0]. [0] https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb136@amd.com/ Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)") Suggested-by: Christian König <christian.koenig@amd.com> Cc: Guchun Chen <guchun.chen@amd.com> Cc: Luben Tuikov <luben.tuikov@amd.com> Cc: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index 00444203220d..3b962cb680a6 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev) if (!ring || !ring->fence_drv.initialized) continue; - if (!ring->no_scheduler) + /* + * Notice we check for sched.ops since there's some + * override on the meaning of sched.ready by amdgpu. + * The natural check would be sched.ready, which is + * set as drm_sched_init() finishes... + */ + if (!ring->no_scheduler && ring->sched.ops) drm_sched_fini(&ring->sched); for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j) -- 2.39.0
Am 01.02.23 um 17:48 schrieb Guilherme G. Piccoli: > Currently amdgpu calls drm_sched_fini() from the fence driver sw fini > routine - such function is expected to be called only after the > respective init function - drm_sched_init() - was executed successfully. > > Happens that we faced a driver probe failure in the Steam Deck > recently, and the function drm_sched_fini() was called even without > its counter-part had been previously called, causing the following oops: > > amdgpu: probe of 0000:04:00.0 failed with error -110 > BUG: kernel NULL pointer dereference, address: 0000000000000090 > PGD 0 P4D 0 > Oops: 0002 [#1] PREEMPT SMP NOPTI > CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338 > Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022 > RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] > [...] > Call Trace: > <TASK> > amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu] > amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu] > amdgpu_driver_release_kms+0x16/0x30 [amdgpu] > devm_drm_dev_init_release+0x49/0x70 > [...] > > To prevent that, check if the drm_sched was properly initialized for a > given ring before calling its fini counter-part. > > Notice ideally we'd use sched.ready for that; such field is set as the latest > thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such > field - in the above oops for example, it was a GFX ring causing the crash, and > the sched.ready field was set to true in the ring init routine, regardless of > the state of the DRM scheduler. Hence, we ended-up using sched.ops as per > Christian's suggestion [0]. > > [0] https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb136@amd.com/ > > Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)") > Suggested-by: Christian König <christian.koenig@amd.com> > Cc: Guchun Chen <guchun.chen@amd.com> > Cc: Luben Tuikov <luben.tuikov@amd.com> > Cc: Mario Limonciello <mario.limonciello@amd.com> > Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > index 00444203220d..3b962cb680a6 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev) > if (!ring || !ring->fence_drv.initialized) > continue; > > - if (!ring->no_scheduler) > + /* > + * Notice we check for sched.ops since there's some > + * override on the meaning of sched.ready by amdgpu. > + * The natural check would be sched.ready, which is > + * set as drm_sched_init() finishes... > + */ > + if (!ring->no_scheduler && ring->sched.ops) > drm_sched_fini(&ring->sched); I think we should drop the check for no_scheduler here and just call drm_sched_fini() when the scheduler instance was initialized before. Background is that I've found a couple of places where we potentially set no_scheduler to false after the scheduler was already initialized. This is then probably leading to a memory leak or worth. Regards, Christian. > > for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
On 02/02/2023 08:58, Christian König wrote: > [...] >> + if (!ring->no_scheduler && ring->sched.ops) >> drm_sched_fini(&ring->sched); > > I think we should drop the check for no_scheduler here and just call > drm_sched_fini() when the scheduler instance was initialized before. > > Background is that I've found a couple of places where we potentially > set no_scheduler to false after the scheduler was already initialized. > > This is then probably leading to a memory leak or worth. > > Regards, > Christian. Thanks Christian, nice finding! And thanks for the reviews Guchun / Luben =) I just submitted a V3 [0], but didn't add the review tags from Guchun / Luben, since the patch changed. Cheers, Guilherme [0] https://lore.kernel.org/dri-devel/20230202134856.1382169-1-gpiccoli@igalia.com/
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index 00444203220d..3b962cb680a6 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev) if (!ring || !ring->fence_drv.initialized) continue; - if (!ring->no_scheduler) + /* + * Notice we check for sched.ops since there's some + * override on the meaning of sched.ready by amdgpu. + * The natural check would be sched.ready, which is + * set as drm_sched_init() finishes... + */ + if (!ring->no_scheduler && ring->sched.ops) drm_sched_fini(&ring->sched); for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
Currently amdgpu calls drm_sched_fini() from the fence driver sw fini routine - such function is expected to be called only after the respective init function - drm_sched_init() - was executed successfully. Happens that we faced a driver probe failure in the Steam Deck recently, and the function drm_sched_fini() was called even without its counter-part had been previously called, causing the following oops: amdgpu: probe of 0000:04:00.0 failed with error -110 BUG: kernel NULL pointer dereference, address: 0000000000000090 PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022 RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace: <TASK> amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu] amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu] amdgpu_driver_release_kms+0x16/0x30 [amdgpu] devm_drm_dev_init_release+0x49/0x70 [...] To prevent that, check if the drm_sched was properly initialized for a given ring before calling its fini counter-part. Notice ideally we'd use sched.ready for that; such field is set as the latest thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such field - in the above oops for example, it was a GFX ring causing the crash, and the sched.ready field was set to true in the ring init routine, regardless of the state of the DRM scheduler. Hence, we ended-up using sched.ops as per Christian's suggestion [0]. [0] https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb136@amd.com/ Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)") Suggested-by: Christian König <christian.koenig@amd.com> Cc: Guchun Chen <guchun.chen@amd.com> Cc: Luben Tuikov <luben.tuikov@amd.com> Cc: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)