Message ID | 20250415184318.2465197-3-sunil.khatri@amd.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v3,1/4] drm: add function drm_file_err to print proc information too | expand |
On 15/04/2025 19:43, Sunil Khatri wrote: > add process and pid information in the userqueue error > logging to make it more useful in resolving the error > by logs. > > Sample log: > [ 42.444297] [drm:amdgpu_userqueue_wait_for_signal [amdgpu]] *ERROR* Timed out waiting for fence f=000000001c74d978 for comm:Xwayland pid:3427 > [ 42.444669] [drm:amdgpu_userqueue_suspend [amdgpu]] *ERROR* Not suspending userqueue, timeout waiting for comm:Xwayland pid:3427 > [ 42.824729] [drm:amdgpu_userqueue_wait_for_signal [amdgpu]] *ERROR* Timed out waiting for fence f=0000000074407d3e for comm:systemd-logind pid:1058 > [ 42.825082] [drm:amdgpu_userqueue_suspend [amdgpu]] *ERROR* Not suspending userqueue, timeout waiting for comm:systemd-logind pid:1058 > > Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c | 14 ++++++++------ > 1 file changed, 8 insertions(+), 6 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c > index 1867520ba258..05c1ee27a319 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c > @@ -43,7 +43,7 @@ amdgpu_userqueue_cleanup(struct amdgpu_userq_mgr *uq_mgr, > if (f && !dma_fence_is_signaled(f)) { > ret = dma_fence_wait_timeout(f, true, msecs_to_jiffies(100)); > if (ret <= 0) { > - DRM_ERROR("Timed out waiting for fence f=%p\n", f); > + drm_file_err(uq_mgr->file, "Timed out waiting for fence f=%p\n", f); You decided to leave %p after all? > return; > } > } > @@ -440,7 +440,8 @@ amdgpu_userqueue_resume_all(struct amdgpu_userq_mgr *uq_mgr) > } > > if (ret) > - DRM_ERROR("Failed to map all the queues\n"); > + drm_file_err(uq_mgr->file, "Failed to map all the queue\n"); You lost the plural by accident. I am also not sure "all the queues" makes sense in this context versus "all queues" but it's inconsequential really. > + > return ret; > } > > @@ -598,7 +599,8 @@ amdgpu_userqueue_suspend_all(struct amdgpu_userq_mgr *uq_mgr) > } > > if (ret) > - DRM_ERROR("Couldn't unmap all the queues\n"); > + drm_file_err(uq_mgr->file, "Couldn't unmap all the queues\n"); > + > return ret; > } > > @@ -615,7 +617,7 @@ amdgpu_userqueue_wait_for_signal(struct amdgpu_userq_mgr *uq_mgr) > continue; > ret = dma_fence_wait_timeout(f, true, msecs_to_jiffies(100)); > if (ret <= 0) { > - DRM_ERROR("Timed out waiting for fence f=%p\n", f); > + drm_file_err(uq_mgr->file, "Timed out waiting for fence f=%p\n", f); > return -ETIMEDOUT; > } > } > @@ -634,13 +636,13 @@ amdgpu_userqueue_suspend(struct amdgpu_userq_mgr *uq_mgr, > /* Wait for any pending userqueue fence work to finish */ > ret = amdgpu_userqueue_wait_for_signal(uq_mgr); > if (ret) { > - DRM_ERROR("Not suspending userqueue, timeout waiting for work\n"); > + drm_file_err(uq_mgr->file, "Not suspending userqueue, timeout waiting\n"); > return; > } > > ret = amdgpu_userqueue_suspend_all(uq_mgr); > if (ret) { > - DRM_ERROR("Failed to evict userqueue\n"); > + drm_file_err(uq_mgr->file, "Failed to evict userqueue\n"); > return; It is pre-existing but strikes me as odd that failure to amdgpu_userqueue_suspend_all() logs a failure to *evict* instead of suspend (as the previous log does). Anyway, I did not look at the surrounding code so just thinking out loud. Regards, Tvrtko > } >
On 4/16/2025 12:56 PM, Tvrtko Ursulin wrote: > > On 15/04/2025 19:43, Sunil Khatri wrote: >> add process and pid information in the userqueue error >> logging to make it more useful in resolving the error >> by logs. >> >> Sample log: >> [ 42.444297] [drm:amdgpu_userqueue_wait_for_signal [amdgpu]] >> *ERROR* Timed out waiting for fence f=000000001c74d978 for >> comm:Xwayland pid:3427 >> [ 42.444669] [drm:amdgpu_userqueue_suspend [amdgpu]] *ERROR* Not >> suspending userqueue, timeout waiting for comm:Xwayland pid:3427 >> [ 42.824729] [drm:amdgpu_userqueue_wait_for_signal [amdgpu]] >> *ERROR* Timed out waiting for fence f=0000000074407d3e for >> comm:systemd-logind pid:1058 >> [ 42.825082] [drm:amdgpu_userqueue_suspend [amdgpu]] *ERROR* Not >> suspending userqueue, timeout waiting for comm:systemd-logind pid:1058 >> >> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c | 14 ++++++++------ >> 1 file changed, 8 insertions(+), 6 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c >> index 1867520ba258..05c1ee27a319 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c >> @@ -43,7 +43,7 @@ amdgpu_userqueue_cleanup(struct amdgpu_userq_mgr >> *uq_mgr, >> if (f && !dma_fence_is_signaled(f)) { >> ret = dma_fence_wait_timeout(f, true, msecs_to_jiffies(100)); >> if (ret <= 0) { >> - DRM_ERROR("Timed out waiting for fence f=%p\n", f); >> + drm_file_err(uq_mgr->file, "Timed out waiting for fence >> f=%p\n", f); > > You decided to leave %p after all? Yes we are printing the fence ptr here to see which fence is timing out. Anyways right now intention of this patch is to add additional process information along with existing information like fence here. regards Sunil > >> return; >> } >> } >> @@ -440,7 +440,8 @@ amdgpu_userqueue_resume_all(struct >> amdgpu_userq_mgr *uq_mgr) >> } >> if (ret) >> - DRM_ERROR("Failed to map all the queues\n"); >> + drm_file_err(uq_mgr->file, "Failed to map all the queue\n"); > > You lost the plural by accident. Yes i will add 's'. Noted. > I am also not sure "all the queues" makes sense in this context versus "all queues" but it's inconsequential really. Regards Sunil > Yes it all queues from a uq_mgr. >> + >> return ret; >> } >> @@ -598,7 +599,8 @@ amdgpu_userqueue_suspend_all(struct >> amdgpu_userq_mgr *uq_mgr) >> } >> if (ret) >> - DRM_ERROR("Couldn't unmap all the queues\n"); >> + drm_file_err(uq_mgr->file, "Couldn't unmap all the queues\n"); >> + >> return ret; >> } >> @@ -615,7 +617,7 @@ amdgpu_userqueue_wait_for_signal(struct >> amdgpu_userq_mgr *uq_mgr) >> continue; >> ret = dma_fence_wait_timeout(f, true, msecs_to_jiffies(100)); >> if (ret <= 0) { >> - DRM_ERROR("Timed out waiting for fence f=%p\n", f); >> + drm_file_err(uq_mgr->file, "Timed out waiting for fence >> f=%p\n", f); >> return -ETIMEDOUT; >> } >> } >> @@ -634,13 +636,13 @@ amdgpu_userqueue_suspend(struct >> amdgpu_userq_mgr *uq_mgr, >> /* Wait for any pending userqueue fence work to finish */ >> ret = amdgpu_userqueue_wait_for_signal(uq_mgr); >> if (ret) { >> - DRM_ERROR("Not suspending userqueue, timeout waiting for >> work\n"); >> + drm_file_err(uq_mgr->file, "Not suspending userqueue, >> timeout waiting\n"); >> return; >> } >> ret = amdgpu_userqueue_suspend_all(uq_mgr); >> if (ret) { >> - DRM_ERROR("Failed to evict userqueue\n"); >> + drm_file_err(uq_mgr->file, "Failed to evict userqueue\n"); >> return; > > It is pre-existing but strikes me as odd that failure to > amdgpu_userqueue_suspend_all() logs a failure to *evict* instead of > suspend (as the previous log does). Anyway, I did not look at the > surrounding code so just thinking out loud. Yes suspend failed as all the fences were not evicted and thats why suspend failed. Anyways there are already alex patches which will change this to unmap as a code reorganisation for suspend/resume is in pipeline. regards Sunil > > Regards, > > Tvrtko > >> } >
Hi, Le 16/04/2025 à 12:01, Khatri, Sunil a écrit : > > On 4/16/2025 12:56 PM, Tvrtko Ursulin wrote: >> >> On 15/04/2025 19:43, Sunil Khatri wrote: >>> add process and pid information in the userqueue error >>> logging to make it more useful in resolving the error >>> by logs. >>> >>> Sample log: >>> [ 42.444297] [drm:amdgpu_userqueue_wait_for_signal [amdgpu]] *ERROR* Timed out waiting for >>> fence f=000000001c74d978 for comm:Xwayland pid:3427 >>> [ 42.444669] [drm:amdgpu_userqueue_suspend [amdgpu]] *ERROR* Not suspending userqueue, timeout >>> waiting for comm:Xwayland pid:3427 >>> [ 42.824729] [drm:amdgpu_userqueue_wait_for_signal [amdgpu]] *ERROR* Timed out waiting for >>> fence f=0000000074407d3e for comm:systemd-logind pid:1058 >>> [ 42.825082] [drm:amdgpu_userqueue_suspend [amdgpu]] *ERROR* Not suspending userqueue, timeout >>> waiting for comm:systemd-logind pid:1058 >>> >>> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c | 14 ++++++++------ >>> 1 file changed, 8 insertions(+), 6 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c b/drivers/gpu/drm/amd/amdgpu/ >>> amdgpu_userqueue.c >>> index 1867520ba258..05c1ee27a319 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c >>> @@ -43,7 +43,7 @@ amdgpu_userqueue_cleanup(struct amdgpu_userq_mgr *uq_mgr, >>> if (f && !dma_fence_is_signaled(f)) { >>> ret = dma_fence_wait_timeout(f, true, msecs_to_jiffies(100)); >>> if (ret <= 0) { >>> - DRM_ERROR("Timed out waiting for fence f=%p\n", f); >>> + drm_file_err(uq_mgr->file, "Timed out waiting for fence f=%p\n", f); >> >> You decided to leave %p after all? > > Yes we are printing the fence ptr here to see which fence is timing out. Anyways right now intention > of this patch is to add additional process information along with existing information like fence here. > I agree with Tvrtko, "fence=%llu:%llu" would be better to identify "which fence is timing out". Pierre-Eric > regards > Sunil > >> >>> return; >>> } >>> } >>> @@ -440,7 +440,8 @@ amdgpu_userqueue_resume_all(struct amdgpu_userq_mgr *uq_mgr) >>> } >>> if (ret) >>> - DRM_ERROR("Failed to map all the queues\n"); >>> + drm_file_err(uq_mgr->file, "Failed to map all the queue\n"); >> >> You lost the plural by accident. > Yes i will add 's'. Noted. >> > I am also not sure "all the queues" makes sense in this context versus "all queues" but it's > inconsequential really. > Regards > Sunil >> Yes it all queues from a uq_mgr. >>> + >>> return ret; >>> } >>> @@ -598,7 +599,8 @@ amdgpu_userqueue_suspend_all(struct amdgpu_userq_mgr *uq_mgr) >>> } >>> if (ret) >>> - DRM_ERROR("Couldn't unmap all the queues\n"); >>> + drm_file_err(uq_mgr->file, "Couldn't unmap all the queues\n"); >>> + >>> return ret; >>> } >>> @@ -615,7 +617,7 @@ amdgpu_userqueue_wait_for_signal(struct amdgpu_userq_mgr *uq_mgr) >>> continue; >>> ret = dma_fence_wait_timeout(f, true, msecs_to_jiffies(100)); >>> if (ret <= 0) { >>> - DRM_ERROR("Timed out waiting for fence f=%p\n", f); >>> + drm_file_err(uq_mgr->file, "Timed out waiting for fence f=%p\n", f); >>> return -ETIMEDOUT; >>> } >>> } >>> @@ -634,13 +636,13 @@ amdgpu_userqueue_suspend(struct amdgpu_userq_mgr *uq_mgr, >>> /* Wait for any pending userqueue fence work to finish */ >>> ret = amdgpu_userqueue_wait_for_signal(uq_mgr); >>> if (ret) { >>> - DRM_ERROR("Not suspending userqueue, timeout waiting for work\n"); >>> + drm_file_err(uq_mgr->file, "Not suspending userqueue, timeout waiting\n"); >>> return; >>> } >>> ret = amdgpu_userqueue_suspend_all(uq_mgr); >>> if (ret) { >>> - DRM_ERROR("Failed to evict userqueue\n"); >>> + drm_file_err(uq_mgr->file, "Failed to evict userqueue\n"); >>> return; >> >> It is pre-existing but strikes me as odd that failure to amdgpu_userqueue_suspend_all() logs a >> failure to *evict* instead of suspend (as the previous log does). Anyway, I did not look at the >> surrounding code so just thinking out loud. > > Yes suspend failed as all the fences were not evicted and thats why suspend failed. Anyways there > are already alex patches which will change this to unmap as a code reorganisation for suspend/resume > is in pipeline. > > regards > > Sunil > >> >> Regards, >> >> Tvrtko >> >>> } >>
On 4/16/2025 5:37 PM, Pierre-Eric Pelloux-Prayer wrote: > Hi, > > Le 16/04/2025 à 12:01, Khatri, Sunil a écrit : >> >> On 4/16/2025 12:56 PM, Tvrtko Ursulin wrote: >>> >>> On 15/04/2025 19:43, Sunil Khatri wrote: >>>> add process and pid information in the userqueue error >>>> logging to make it more useful in resolving the error >>>> by logs. >>>> >>>> Sample log: >>>> [ 42.444297] [drm:amdgpu_userqueue_wait_for_signal [amdgpu]] >>>> *ERROR* Timed out waiting for fence f=000000001c74d978 for >>>> comm:Xwayland pid:3427 >>>> [ 42.444669] [drm:amdgpu_userqueue_suspend [amdgpu]] *ERROR* Not >>>> suspending userqueue, timeout waiting for comm:Xwayland pid:3427 >>>> [ 42.824729] [drm:amdgpu_userqueue_wait_for_signal [amdgpu]] >>>> *ERROR* Timed out waiting for fence f=0000000074407d3e for >>>> comm:systemd-logind pid:1058 >>>> [ 42.825082] [drm:amdgpu_userqueue_suspend [amdgpu]] *ERROR* Not >>>> suspending userqueue, timeout waiting for comm:systemd-logind pid:1058 >>>> >>>> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c | 14 ++++++++------ >>>> 1 file changed, 8 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c >>>> b/drivers/gpu/drm/amd/amdgpu/ amdgpu_userqueue.c >>>> index 1867520ba258..05c1ee27a319 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c >>>> @@ -43,7 +43,7 @@ amdgpu_userqueue_cleanup(struct amdgpu_userq_mgr >>>> *uq_mgr, >>>> if (f && !dma_fence_is_signaled(f)) { >>>> ret = dma_fence_wait_timeout(f, true, >>>> msecs_to_jiffies(100)); >>>> if (ret <= 0) { >>>> - DRM_ERROR("Timed out waiting for fence f=%p\n", f); >>>> + drm_file_err(uq_mgr->file, "Timed out waiting for >>>> fence f=%p\n", f); >>> >>> You decided to leave %p after all? >> >> Yes we are printing the fence ptr here to see which fence is timing >> out. Anyways right now intention of this patch is to add additional >> process information along with existing information like fence here. >> > > I agree with Tvrtko, "fence=%llu:%llu" would be better to identify > "which fence is timing out". I agree to it for sure, just that there are other places also where we are printing fence ptr and will take that up in another patch. Regards Sunil Khatri > > > Pierre-Eric > > >> regards >> Sunil >> >>> >>>> return; >>>> } >>>> } >>>> @@ -440,7 +440,8 @@ amdgpu_userqueue_resume_all(struct >>>> amdgpu_userq_mgr *uq_mgr) >>>> } >>>> if (ret) >>>> - DRM_ERROR("Failed to map all the queues\n"); >>>> + drm_file_err(uq_mgr->file, "Failed to map all the queue\n"); >>> >>> You lost the plural by accident. >> Yes i will add 's'. Noted. >>> >> I am also not sure "all the queues" makes sense in this context >> versus "all queues" but it's inconsequential really. >> Regards >> Sunil >>> Yes it all queues from a uq_mgr. >>>> + >>>> return ret; >>>> } >>>> @@ -598,7 +599,8 @@ amdgpu_userqueue_suspend_all(struct >>>> amdgpu_userq_mgr *uq_mgr) >>>> } >>>> if (ret) >>>> - DRM_ERROR("Couldn't unmap all the queues\n"); >>>> + drm_file_err(uq_mgr->file, "Couldn't unmap all the >>>> queues\n"); >>>> + >>>> return ret; >>>> } >>>> @@ -615,7 +617,7 @@ amdgpu_userqueue_wait_for_signal(struct >>>> amdgpu_userq_mgr *uq_mgr) >>>> continue; >>>> ret = dma_fence_wait_timeout(f, true, >>>> msecs_to_jiffies(100)); >>>> if (ret <= 0) { >>>> - DRM_ERROR("Timed out waiting for fence f=%p\n", f); >>>> + drm_file_err(uq_mgr->file, "Timed out waiting for >>>> fence f=%p\n", f); >>>> return -ETIMEDOUT; >>>> } >>>> } >>>> @@ -634,13 +636,13 @@ amdgpu_userqueue_suspend(struct >>>> amdgpu_userq_mgr *uq_mgr, >>>> /* Wait for any pending userqueue fence work to finish */ >>>> ret = amdgpu_userqueue_wait_for_signal(uq_mgr); >>>> if (ret) { >>>> - DRM_ERROR("Not suspending userqueue, timeout waiting for >>>> work\n"); >>>> + drm_file_err(uq_mgr->file, "Not suspending userqueue, >>>> timeout waiting\n"); >>>> return; >>>> } >>>> ret = amdgpu_userqueue_suspend_all(uq_mgr); >>>> if (ret) { >>>> - DRM_ERROR("Failed to evict userqueue\n"); >>>> + drm_file_err(uq_mgr->file, "Failed to evict userqueue\n"); >>>> return; >>> >>> It is pre-existing but strikes me as odd that failure to >>> amdgpu_userqueue_suspend_all() logs a failure to *evict* instead of >>> suspend (as the previous log does). Anyway, I did not look at the >>> surrounding code so just thinking out loud. >> >> Yes suspend failed as all the fences were not evicted and thats why >> suspend failed. Anyways there are already alex patches which will >> change this to unmap as a code reorganisation for suspend/resume is >> in pipeline. >> >> regards >> >> Sunil >> >>> >>> Regards, >>> >>> Tvrtko >>> >>>> } >>>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c index 1867520ba258..05c1ee27a319 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c @@ -43,7 +43,7 @@ amdgpu_userqueue_cleanup(struct amdgpu_userq_mgr *uq_mgr, if (f && !dma_fence_is_signaled(f)) { ret = dma_fence_wait_timeout(f, true, msecs_to_jiffies(100)); if (ret <= 0) { - DRM_ERROR("Timed out waiting for fence f=%p\n", f); + drm_file_err(uq_mgr->file, "Timed out waiting for fence f=%p\n", f); return; } } @@ -440,7 +440,8 @@ amdgpu_userqueue_resume_all(struct amdgpu_userq_mgr *uq_mgr) } if (ret) - DRM_ERROR("Failed to map all the queues\n"); + drm_file_err(uq_mgr->file, "Failed to map all the queue\n"); + return ret; } @@ -598,7 +599,8 @@ amdgpu_userqueue_suspend_all(struct amdgpu_userq_mgr *uq_mgr) } if (ret) - DRM_ERROR("Couldn't unmap all the queues\n"); + drm_file_err(uq_mgr->file, "Couldn't unmap all the queues\n"); + return ret; } @@ -615,7 +617,7 @@ amdgpu_userqueue_wait_for_signal(struct amdgpu_userq_mgr *uq_mgr) continue; ret = dma_fence_wait_timeout(f, true, msecs_to_jiffies(100)); if (ret <= 0) { - DRM_ERROR("Timed out waiting for fence f=%p\n", f); + drm_file_err(uq_mgr->file, "Timed out waiting for fence f=%p\n", f); return -ETIMEDOUT; } } @@ -634,13 +636,13 @@ amdgpu_userqueue_suspend(struct amdgpu_userq_mgr *uq_mgr, /* Wait for any pending userqueue fence work to finish */ ret = amdgpu_userqueue_wait_for_signal(uq_mgr); if (ret) { - DRM_ERROR("Not suspending userqueue, timeout waiting for work\n"); + drm_file_err(uq_mgr->file, "Not suspending userqueue, timeout waiting\n"); return; } ret = amdgpu_userqueue_suspend_all(uq_mgr); if (ret) { - DRM_ERROR("Failed to evict userqueue\n"); + drm_file_err(uq_mgr->file, "Failed to evict userqueue\n"); return; }
add process and pid information in the userqueue error logging to make it more useful in resolving the error by logs. Sample log: [ 42.444297] [drm:amdgpu_userqueue_wait_for_signal [amdgpu]] *ERROR* Timed out waiting for fence f=000000001c74d978 for comm:Xwayland pid:3427 [ 42.444669] [drm:amdgpu_userqueue_suspend [amdgpu]] *ERROR* Not suspending userqueue, timeout waiting for comm:Xwayland pid:3427 [ 42.824729] [drm:amdgpu_userqueue_wait_for_signal [amdgpu]] *ERROR* Timed out waiting for fence f=0000000074407d3e for comm:systemd-logind pid:1058 [ 42.825082] [drm:amdgpu_userqueue_suspend [amdgpu]] *ERROR* Not suspending userqueue, timeout waiting for comm:systemd-logind pid:1058 Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-)