diff mbox series

drm/sched: Fix amdgpu crash upon suspend/resume

Message ID 20250107140240.325899-1-philipp.reisner@linbit.com (mailing list archive)
State New
Headers show
Series drm/sched: Fix amdgpu crash upon suspend/resume | expand

Commit Message

Philipp Reisner Jan. 7, 2025, 2:02 p.m. UTC
The following OOPS plagues me on about every 10th suspend and resume:

[160640.791304] BUG: kernel NULL pointer dereference, address: 0000000000000008
[160640.791309] #PF: supervisor read access in kernel mode
[160640.791311] #PF: error_code(0x0000) - not-present page
[160640.791313] PGD 0 P4D 0
[160640.791316] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[160640.791320] CPU: 12 UID: 1001 PID: 648526 Comm: kscreenloc:cs0 Tainted: G           OE      6.11.7-300.fc41.x86_64 #1
[160640.791324] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[160640.791325] Hardware name: Micro-Star International Co., Ltd. MS-7A38/B450M PRO-VDH MAX (MS-7A38), BIOS B.B0 02/03/2021
[160640.791327] RIP: 0010:drm_sched_job_arm+0x23/0x60 [gpu_sched]
[160640.791337] Code: 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 53 48 8b 6f 60 48 85 ed 74 3f 48 89 fb 48 89 ef e8 31 39 00 00 48 8b 45 10 <48> 8b 50 08 48 89 53 18 8b 45 24 89 43 5c b8 01 00 00 00 f0 48 0f
[160640.791340] RSP: 0018:ffffb2ef5e6cb9b8 EFLAGS: 00010206
[160640.791342] RAX: 0000000000000000 RBX: ffff9d804cc62800 RCX: ffff9d784020f0d0
[160640.791344] RDX: 0000000000000000 RSI: ffff9d784d3b9cd0 RDI: ffff9d784020f638
[160640.791345] RBP: ffff9d784020f610 R08: ffff9d78414e4268 R09: 2072656c75646568
[160640.791346] R10: 686373205d6d7264 R11: 632072656c756465 R12: 0000000000000000
[160640.791347] R13: 0000000000000001 R14: ffffb2ef5e6cba38 R15: 0000000000000000
[160640.791349] FS:  00007f8f30aca6c0(0000) GS:ffff9d873ec00000(0000) knlGS:0000000000000000
[160640.791351] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[160640.791352] CR2: 0000000000000008 CR3: 000000069de82000 CR4: 0000000000350ef0
[160640.791354] Call Trace:
[160640.791357]  <TASK>
[160640.791360]  ? __die_body.cold+0x19/0x27
[160640.791367]  ? page_fault_oops+0x15a/0x2f0
[160640.791372]  ? exc_page_fault+0x7e/0x180
[160640.791376]  ? asm_exc_page_fault+0x26/0x30
[160640.791380]  ? drm_sched_job_arm+0x23/0x60 [gpu_sched]
[160640.791384]  ? drm_sched_job_arm+0x1f/0x60 [gpu_sched]
[160640.791390]  amdgpu_cs_ioctl+0x170c/0x1e40 [amdgpu]
[160640.792011]  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
[160640.792341]  drm_ioctl_kernel+0xb0/0x100
[160640.792346]  drm_ioctl+0x28b/0x540
[160640.792349]  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
[160640.792673]  amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
[160640.792994]  __x64_sys_ioctl+0x94/0xd0
[160640.792999]  do_syscall_64+0x82/0x160
[160640.793006]  ? __count_memcg_events+0x75/0x130
[160640.793009]  ? count_memcg_events.constprop.0+0x1a/0x30
[160640.793014]  ? handle_mm_fault+0x21b/0x330
[160640.793016]  ? do_user_addr_fault+0x55a/0x7b0
[160640.793020]  ? exc_page_fault+0x7e/0x180
[160640.793023]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

The OOPS happens because the rq member of entity is NULL in
drm_sched_job_arm() after the call to drm_sched_entity_select_rq().

In drm_sched_entity_select_rq(), the code considers that
drb_sched_pick_best() might return a NULL value. When NULL, it assigns
NULL to entity->rq even if it had a non-NULL value before.

drm_sched_job_arm() does not deal with entities having a rq of NULL.

Fix this by leaving the entity on the engine it was instead of
assigning a NULL to its run queue member.

Link: https://retrace.fedoraproject.org/faf/reports/1038619/
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3746
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
---
 drivers/gpu/drm/scheduler/sched_entity.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Comments

Christian König Jan. 8, 2025, 8:19 a.m. UTC | #1
Am 07.01.25 um 16:21 schrieb Philipp Reisner:
> [...]
>>> The OOPS happens because the rq member of entity is NULL in
>>> drm_sched_job_arm() after the call to drm_sched_entity_select_rq().
>>>
>>> In drm_sched_entity_select_rq(), the code considers that
>>> drb_sched_pick_best() might return a NULL value. When NULL, it assigns
>>> NULL to entity->rq even if it had a non-NULL value before.
>>>
>>> drm_sched_job_arm() does not deal with entities having a rq of NULL.
>>>
>>> Fix this by leaving the entity on the engine it was instead of
>>> assigning a NULL to its run queue member.
>> Well that is clearly not the correct approach to fixing this. So clearly
>> a NAK from my side.
>>
>> The real question is why is amdgpu_cs_ioctl() called when all of
>> userspace should be frozen?
>>
>> Regards,
>> Christian.
>>
> Could the OOPS happen at resume time? Might it be that the kernel
> activates user-space
> before all the components of the GPU finished their wakeup?
>
> Maybe drm_sched_pick_best() returns NULL since no scheduler is ready yet?

Yeah that is exactly what I meant. It looks like either the suspend or 
the resume order is somehow messed up.

In other words either some application tries to submit GPU work while it 
should already been stopped, or it tries to submit GPU work before it is 
started.

> Apart from whether amdgpu_cs_ioctl() should run at this point, I still think the
> suggested change improves the code. drm_sched_pick_best() can return NULL.
> drm_sched_entity_select_rq() can handle the NULL (partially).
>
> drm_sched_job_arm() crashes on an entity that has rq set to NULL.

Which is actually not the worst outcome :)

With your patch applied we don't immediately crash any more in the 
submission path, but the whole system could then later deadlock because 
the core memory management waits for a GPU submission which never returns.

That is an even worse situation because you then can't pinpoint any more 
where that is coming from.

> The handling of NULL values is half-baked.
>
> In my opinion, you should define if drm_sched_pick_best() may put a NULL into
> rq. If your answer is yes, it might put a NULL there; then, there should be a
> BUG_ON(!entity->rq) after the invocation of drm_sched_entity_select_rq().
> If your answer is no, the BUG_ON() should be in drm_sched_pick_best().

Yeah good point.

We might not want a BUG_ON(), that is only justified when we prevent 
further damage (e.g. random data corruption or similar).

I suggest using a WARN(!shed, "Submission without activated sheduler!"). 
This way the system has at least a chance of survival should the 
scheduler become ready later on.

On the other hand the BUG_ON() or the NULL pointer deref should only 
kill the application thread which is submitting something before the 
driver is resumed. So that might help to pinpoint where the actually 
issue is.

Regards,
Christian.

>
> That helps guys with zero domain knowledge, like me, to figure out how
> this is all
> supposed to work.
>
> best regards,
>   Philipp
Alex Deucher Jan. 8, 2025, 2:26 p.m. UTC | #2
On Tue, Jan 7, 2025 at 9:09 AM Christian König <christian.koenig@amd.com> wrote:
>
> Am 07.01.25 um 15:02 schrieb Philipp Reisner:
> > The following OOPS plagues me on about every 10th suspend and resume:
> >
> > [160640.791304] BUG: kernel NULL pointer dereference, address: 0000000000000008
> > [160640.791309] #PF: supervisor read access in kernel mode
> > [160640.791311] #PF: error_code(0x0000) - not-present page
> > [160640.791313] PGD 0 P4D 0
> > [160640.791316] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
> > [160640.791320] CPU: 12 UID: 1001 PID: 648526 Comm: kscreenloc:cs0 Tainted: G           OE      6.11.7-300.fc41.x86_64 #1
> > [160640.791324] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
> > [160640.791325] Hardware name: Micro-Star International Co., Ltd. MS-7A38/B450M PRO-VDH MAX (MS-7A38), BIOS B.B0 02/03/2021
> > [160640.791327] RIP: 0010:drm_sched_job_arm+0x23/0x60 [gpu_sched]
> > [160640.791337] Code: 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 53 48 8b 6f 60 48 85 ed 74 3f 48 89 fb 48 89 ef e8 31 39 00 00 48 8b 45 10 <48> 8b 50 08 48 89 53 18 8b 45 24 89 43 5c b8 01 00 00 00 f0 48 0f
> > [160640.791340] RSP: 0018:ffffb2ef5e6cb9b8 EFLAGS: 00010206
> > [160640.791342] RAX: 0000000000000000 RBX: ffff9d804cc62800 RCX: ffff9d784020f0d0
> > [160640.791344] RDX: 0000000000000000 RSI: ffff9d784d3b9cd0 RDI: ffff9d784020f638
> > [160640.791345] RBP: ffff9d784020f610 R08: ffff9d78414e4268 R09: 2072656c75646568
> > [160640.791346] R10: 686373205d6d7264 R11: 632072656c756465 R12: 0000000000000000
> > [160640.791347] R13: 0000000000000001 R14: ffffb2ef5e6cba38 R15: 0000000000000000
> > [160640.791349] FS:  00007f8f30aca6c0(0000) GS:ffff9d873ec00000(0000) knlGS:0000000000000000
> > [160640.791351] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [160640.791352] CR2: 0000000000000008 CR3: 000000069de82000 CR4: 0000000000350ef0
> > [160640.791354] Call Trace:
> > [160640.791357]  <TASK>
> > [160640.791360]  ? __die_body.cold+0x19/0x27
> > [160640.791367]  ? page_fault_oops+0x15a/0x2f0
> > [160640.791372]  ? exc_page_fault+0x7e/0x180
> > [160640.791376]  ? asm_exc_page_fault+0x26/0x30
> > [160640.791380]  ? drm_sched_job_arm+0x23/0x60 [gpu_sched]
> > [160640.791384]  ? drm_sched_job_arm+0x1f/0x60 [gpu_sched]
> > [160640.791390]  amdgpu_cs_ioctl+0x170c/0x1e40 [amdgpu]
> > [160640.792011]  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
> > [160640.792341]  drm_ioctl_kernel+0xb0/0x100
> > [160640.792346]  drm_ioctl+0x28b/0x540
> > [160640.792349]  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
> > [160640.792673]  amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
> > [160640.792994]  __x64_sys_ioctl+0x94/0xd0
> > [160640.792999]  do_syscall_64+0x82/0x160
> > [160640.793006]  ? __count_memcg_events+0x75/0x130
> > [160640.793009]  ? count_memcg_events.constprop.0+0x1a/0x30
> > [160640.793014]  ? handle_mm_fault+0x21b/0x330
> > [160640.793016]  ? do_user_addr_fault+0x55a/0x7b0
> > [160640.793020]  ? exc_page_fault+0x7e/0x180
> > [160640.793023]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >
> > The OOPS happens because the rq member of entity is NULL in
> > drm_sched_job_arm() after the call to drm_sched_entity_select_rq().
> >
> > In drm_sched_entity_select_rq(), the code considers that
> > drb_sched_pick_best() might return a NULL value. When NULL, it assigns
> > NULL to entity->rq even if it had a non-NULL value before.
> >
> > drm_sched_job_arm() does not deal with entities having a rq of NULL.
> >
> > Fix this by leaving the entity on the engine it was instead of
> > assigning a NULL to its run queue member.
>
> Well that is clearly not the correct approach to fixing this. So clearly
> a NAK from my side.
>
> The real question is why is amdgpu_cs_ioctl() called when all of
> userspace should be frozen?
>

Could this be due to amdgpu setting sched->ready when the rings are
finished initializing from long ago rather than when the scheduler has
been armed?

Alex


> Regards,
> Christian.
>
> >
> > Link: https://retrace.fedoraproject.org/faf/reports/1038619/
> > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3746
> > Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
> > ---
> >   drivers/gpu/drm/scheduler/sched_entity.c | 10 ++++++----
> >   1 file changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> > index a75eede8bf8d..495bc087588b 100644
> > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > @@ -557,10 +557,12 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
> >
> >       spin_lock(&entity->rq_lock);
> >       sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
> > -     rq = sched ? sched->sched_rq[entity->priority] : NULL;
> > -     if (rq != entity->rq) {
> > -             drm_sched_rq_remove_entity(entity->rq, entity);
> > -             entity->rq = rq;
> > +     if (sched) {
> > +             rq = sched->sched_rq[entity->priority];
> > +             if (rq != entity->rq) {
> > +                     drm_sched_rq_remove_entity(entity->rq, entity);
> > +                     entity->rq = rq;
> > +             }
> >       }
> >       spin_unlock(&entity->rq_lock);
> >
>
Christian König Jan. 8, 2025, 2:35 p.m. UTC | #3
Am 08.01.25 um 15:26 schrieb Alex Deucher:
> On Tue, Jan 7, 2025 at 9:09 AM Christian König <christian.koenig@amd.com> wrote:
>> Am 07.01.25 um 15:02 schrieb Philipp Reisner:
>>> The following OOPS plagues me on about every 10th suspend and resume:
>>>
>>> [160640.791304] BUG: kernel NULL pointer dereference, address: 0000000000000008
>>> [160640.791309] #PF: supervisor read access in kernel mode
>>> [160640.791311] #PF: error_code(0x0000) - not-present page
>>> [160640.791313] PGD 0 P4D 0
>>> [160640.791316] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
>>> [160640.791320] CPU: 12 UID: 1001 PID: 648526 Comm: kscreenloc:cs0 Tainted: G           OE      6.11.7-300.fc41.x86_64 #1
>>> [160640.791324] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
>>> [160640.791325] Hardware name: Micro-Star International Co., Ltd. MS-7A38/B450M PRO-VDH MAX (MS-7A38), BIOS B.B0 02/03/2021
>>> [160640.791327] RIP: 0010:drm_sched_job_arm+0x23/0x60 [gpu_sched]
>>> [160640.791337] Code: 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 53 48 8b 6f 60 48 85 ed 74 3f 48 89 fb 48 89 ef e8 31 39 00 00 48 8b 45 10 <48> 8b 50 08 48 89 53 18 8b 45 24 89 43 5c b8 01 00 00 00 f0 48 0f
>>> [160640.791340] RSP: 0018:ffffb2ef5e6cb9b8 EFLAGS: 00010206
>>> [160640.791342] RAX: 0000000000000000 RBX: ffff9d804cc62800 RCX: ffff9d784020f0d0
>>> [160640.791344] RDX: 0000000000000000 RSI: ffff9d784d3b9cd0 RDI: ffff9d784020f638
>>> [160640.791345] RBP: ffff9d784020f610 R08: ffff9d78414e4268 R09: 2072656c75646568
>>> [160640.791346] R10: 686373205d6d7264 R11: 632072656c756465 R12: 0000000000000000
>>> [160640.791347] R13: 0000000000000001 R14: ffffb2ef5e6cba38 R15: 0000000000000000
>>> [160640.791349] FS:  00007f8f30aca6c0(0000) GS:ffff9d873ec00000(0000) knlGS:0000000000000000
>>> [160640.791351] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [160640.791352] CR2: 0000000000000008 CR3: 000000069de82000 CR4: 0000000000350ef0
>>> [160640.791354] Call Trace:
>>> [160640.791357]  <TASK>
>>> [160640.791360]  ? __die_body.cold+0x19/0x27
>>> [160640.791367]  ? page_fault_oops+0x15a/0x2f0
>>> [160640.791372]  ? exc_page_fault+0x7e/0x180
>>> [160640.791376]  ? asm_exc_page_fault+0x26/0x30
>>> [160640.791380]  ? drm_sched_job_arm+0x23/0x60 [gpu_sched]
>>> [160640.791384]  ? drm_sched_job_arm+0x1f/0x60 [gpu_sched]
>>> [160640.791390]  amdgpu_cs_ioctl+0x170c/0x1e40 [amdgpu]
>>> [160640.792011]  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
>>> [160640.792341]  drm_ioctl_kernel+0xb0/0x100
>>> [160640.792346]  drm_ioctl+0x28b/0x540
>>> [160640.792349]  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
>>> [160640.792673]  amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
>>> [160640.792994]  __x64_sys_ioctl+0x94/0xd0
>>> [160640.792999]  do_syscall_64+0x82/0x160
>>> [160640.793006]  ? __count_memcg_events+0x75/0x130
>>> [160640.793009]  ? count_memcg_events.constprop.0+0x1a/0x30
>>> [160640.793014]  ? handle_mm_fault+0x21b/0x330
>>> [160640.793016]  ? do_user_addr_fault+0x55a/0x7b0
>>> [160640.793020]  ? exc_page_fault+0x7e/0x180
>>> [160640.793023]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>
>>> The OOPS happens because the rq member of entity is NULL in
>>> drm_sched_job_arm() after the call to drm_sched_entity_select_rq().
>>>
>>> In drm_sched_entity_select_rq(), the code considers that
>>> drb_sched_pick_best() might return a NULL value. When NULL, it assigns
>>> NULL to entity->rq even if it had a non-NULL value before.
>>>
>>> drm_sched_job_arm() does not deal with entities having a rq of NULL.
>>>
>>> Fix this by leaving the entity on the engine it was instead of
>>> assigning a NULL to its run queue member.
>> Well that is clearly not the correct approach to fixing this. So clearly
>> a NAK from my side.
>>
>> The real question is why is amdgpu_cs_ioctl() called when all of
>> userspace should be frozen?
>>
> Could this be due to amdgpu setting sched->ready when the rings are
> finished initializing from long ago rather than when the scheduler has
> been armed?

Yes and that is absolutely intentional.

Either the driver is not done with it's resume yet, or it has already 
started it's suspend handler. So the scheduler backends are not started 
and so the ready flag is false.

But some userspace application still tries to submit work.

If we would now wait for this work to finish we would deadlock, so 
crashing on the NULL pointer deref is actually the less worse outcome.

Christian.

>
> Alex
>
>
>> Regards,
>> Christian.
>>
>>> Link: https://retrace.fedoraproject.org/faf/reports/1038619/
>>> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3746
>>> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_entity.c | 10 ++++++----
>>>    1 file changed, 6 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index a75eede8bf8d..495bc087588b 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -557,10 +557,12 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>>>
>>>        spin_lock(&entity->rq_lock);
>>>        sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
>>> -     rq = sched ? sched->sched_rq[entity->priority] : NULL;
>>> -     if (rq != entity->rq) {
>>> -             drm_sched_rq_remove_entity(entity->rq, entity);
>>> -             entity->rq = rq;
>>> +     if (sched) {
>>> +             rq = sched->sched_rq[entity->priority];
>>> +             if (rq != entity->rq) {
>>> +                     drm_sched_rq_remove_entity(entity->rq, entity);
>>> +                     entity->rq = rq;
>>> +             }
>>>        }
>>>        spin_unlock(&entity->rq_lock);
>>>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index a75eede8bf8d..495bc087588b 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -557,10 +557,12 @@  void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
 
 	spin_lock(&entity->rq_lock);
 	sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
-	rq = sched ? sched->sched_rq[entity->priority] : NULL;
-	if (rq != entity->rq) {
-		drm_sched_rq_remove_entity(entity->rq, entity);
-		entity->rq = rq;
+	if (sched) {
+		rq = sched->sched_rq[entity->priority];
+		if (rq != entity->rq) {
+			drm_sched_rq_remove_entity(entity->rq, entity);
+			entity->rq = rq;
+		}
 	}
 	spin_unlock(&entity->rq_lock);