diff mbox series

[1/2] drm/amdgpu: add the IP information of the soc

Message ID 20240312124148.257067-1-sunil.khatri@amd.com (mailing list archive)
State New, archived
Headers show
Series [1/2] drm/amdgpu: add the IP information of the soc | expand

Commit Message

Sunil Khatri March 12, 2024, 12:41 p.m. UTC
Add all the IP's information on a SOC to the
devcoredump.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Comments

Sunil Khatri March 13, 2024, 7:45 p.m. UTC | #1
[AMD Official Use Only - General]

Gentle Reminder for review.

Regards,
Sunil

Get Outlook for Android<https://aka.ms/AAb9ysg>
Sunil Khatri March 13, 2024, 7:46 p.m. UTC | #2
[AMD Official Use Only - General]

Gentle reminder for review.

Regards
Sunil

Get Outlook for Android<https://aka.ms/AAb9ysg>
Alex Deucher March 13, 2024, 8:28 p.m. UTC | #3
On Tue, Mar 12, 2024 at 8:41 AM Sunil Khatri <sunil.khatri@amd.com> wrote:
>
> Add all the IP's information on a SOC to the
> devcoredump.
>
> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> index a0dbccad2f53..611fdb90a1fc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> @@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count,
>                            coredump->reset_task_info.process_name,
>                            coredump->reset_task_info.pid);
>
> +       /* GPU IP's information of the SOC */
> +       if (coredump->adev) {
> +               drm_printf(&p, "\nIP Information\n");
> +               drm_printf(&p, "SOC Family: %d\n", coredump->adev->family);
> +               drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id);
> +
> +               for (int i = 0; i < coredump->adev->num_ip_blocks; i++) {
> +                       struct amdgpu_ip_block *ip =
> +                               &coredump->adev->ip_blocks[i];
> +                       drm_printf(&p, "IP type: %d IP name: %s\n",
> +                                  ip->version->type,
> +                                  ip->version->funcs->name);
> +                       drm_printf(&p, "IP version: (%d,%d,%d)\n\n",
> +                                  ip->version->major,
> +                                  ip->version->minor,
> +                                  ip->version->rev);
> +               }
> +       }

I think the IP discovery table would be more useful.  Either walk the
adev->ip_versions structure, or just include the IP discovery binary.

Alex

> +
>         if (coredump->ring) {
>                 drm_printf(&p, "\nRing timed out details\n");
>                 drm_printf(&p, "IP Type: %d Ring Name: %s\n",
> --
> 2.34.1
>
Khatri, Sunil March 14, 2024, 5:44 a.m. UTC | #4
On 3/14/2024 1:58 AM, Alex Deucher wrote:
> On Tue, Mar 12, 2024 at 8:41 AM Sunil Khatri <sunil.khatri@amd.com> wrote:
>> Add all the IP's information on a SOC to the
>> devcoredump.
>>
>> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++++++++++++++++++
>>   1 file changed, 19 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>> index a0dbccad2f53..611fdb90a1fc 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>> @@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count,
>>                             coredump->reset_task_info.process_name,
>>                             coredump->reset_task_info.pid);
>>
>> +       /* GPU IP's information of the SOC */
>> +       if (coredump->adev) {
>> +               drm_printf(&p, "\nIP Information\n");
>> +               drm_printf(&p, "SOC Family: %d\n", coredump->adev->family);
>> +               drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id);
>> +
>> +               for (int i = 0; i < coredump->adev->num_ip_blocks; i++) {
>> +                       struct amdgpu_ip_block *ip =
>> +                               &coredump->adev->ip_blocks[i];
>> +                       drm_printf(&p, "IP type: %d IP name: %s\n",
>> +                                  ip->version->type,
>> +                                  ip->version->funcs->name);
>> +                       drm_printf(&p, "IP version: (%d,%d,%d)\n\n",
>> +                                  ip->version->major,
>> +                                  ip->version->minor,
>> +                                  ip->version->rev);
>> +               }
>> +       }
> I think the IP discovery table would be more useful.  Either walk the
> adev->ip_versions structure, or just include the IP discovery binary.

I did explore the adev->ip_versions and if i just go through the array 
it doesn't give any useful information directly.
There are no ways to find directly from adev->ip_versions below things 
until i also reparse the discovery binary again like done the discovery 
amdgpu_discovery_reg_base_init and walk through the headers of various 
ips using discovery binary.
a. Which IP is available on soc or not.
b. How many instances are there
Also i again have to change back to major, minor and rev convention for 
this information to be useful. I am exploring it more if i find some 
other information i will update.

adev->ip_block[] is derived from ip discovery only for each block which 
is there on the SOC, so we are not reading information which isnt 
applicable for the soc. We have name , type and version no of the IPs 
available on the soc. If you want i could add no of instances of each IP 
too if you think that's useful information here. Could you share what 
information is missing in this approach so i can include that.

For dumping the IP discovery binary, i dont understand how that 
information would be useful directly and needs to be decoded like we are 
doing in discovery init. Please correct me if my understanding is wrong 
here.
> Alex
>
>> +
>>          if (coredump->ring) {
>>                  drm_printf(&p, "\nRing timed out details\n");
>>                  drm_printf(&p, "IP Type: %d Ring Name: %s\n",
>> --
>> 2.34.1
>>
Alex Deucher March 14, 2024, 2:42 p.m. UTC | #5
On Thu, Mar 14, 2024 at 1:44 AM Khatri, Sunil <sukhatri@amd.com> wrote:
>
>
> On 3/14/2024 1:58 AM, Alex Deucher wrote:
> > On Tue, Mar 12, 2024 at 8:41 AM Sunil Khatri <sunil.khatri@amd.com> wrote:
> >> Add all the IP's information on a SOC to the
> >> devcoredump.
> >>
> >> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++++++++++++++++++
> >>   1 file changed, 19 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> >> index a0dbccad2f53..611fdb90a1fc 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> >> @@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count,
> >>                             coredump->reset_task_info.process_name,
> >>                             coredump->reset_task_info.pid);
> >>
> >> +       /* GPU IP's information of the SOC */
> >> +       if (coredump->adev) {
> >> +               drm_printf(&p, "\nIP Information\n");
> >> +               drm_printf(&p, "SOC Family: %d\n", coredump->adev->family);
> >> +               drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id);
> >> +
> >> +               for (int i = 0; i < coredump->adev->num_ip_blocks; i++) {
> >> +                       struct amdgpu_ip_block *ip =
> >> +                               &coredump->adev->ip_blocks[i];
> >> +                       drm_printf(&p, "IP type: %d IP name: %s\n",
> >> +                                  ip->version->type,
> >> +                                  ip->version->funcs->name);
> >> +                       drm_printf(&p, "IP version: (%d,%d,%d)\n\n",
> >> +                                  ip->version->major,
> >> +                                  ip->version->minor,
> >> +                                  ip->version->rev);
> >> +               }
> >> +       }
> > I think the IP discovery table would be more useful.  Either walk the
> > adev->ip_versions structure, or just include the IP discovery binary.
>
> I did explore the adev->ip_versions and if i just go through the array
> it doesn't give any useful information directly.
> There are no ways to find directly from adev->ip_versions below things
> until i also reparse the discovery binary again like done the discovery
> amdgpu_discovery_reg_base_init and walk through the headers of various
> ips using discovery binary.
> a. Which IP is available on soc or not.
> b. How many instances are there
> Also i again have to change back to major, minor and rev convention for
> this information to be useful. I am exploring it more if i find some
> other information i will update.
>
> adev->ip_block[] is derived from ip discovery only for each block which
> is there on the SOC, so we are not reading information which isnt
> applicable for the soc. We have name , type and version no of the IPs
> available on the soc. If you want i could add no of instances of each IP
> too if you think that's useful information here. Could you share what
> information is missing in this approach so i can include that.

I was hoping to get the actual IP versions for the IPs from IP
discovery rather than the versions from the ip_block array.  The
latter are common so you can end up with the same version used across
a wide variety of chips (e.g., all gfx10.x based chips use the same
gfx 10 IP code even if the actual IP version is different for most of
the chips).

>
> For dumping the IP discovery binary, i dont understand how that
> information would be useful directly and needs to be decoded like we are
> doing in discovery init. Please correct me if my understanding is wrong
> here.

It's probably not a high priority, I was just thinking it might be
useful to have in case there ended up being some problem related to
the IP discovery table on some boards.  E.g., we'd know that all
boards with a certain harvest config seem to align with a reported
problem.  Similar for vbios.  It's more for telemetry.  E.g., all the
boards reporting some problem have a particular powerplay config or
whatever.

Alex


> > Alex
> >
> >> +
> >>          if (coredump->ring) {
> >>                  drm_printf(&p, "\nRing timed out details\n");
> >>                  drm_printf(&p, "IP Type: %d Ring Name: %s\n",
> >> --
> >> 2.34.1
> >>
Khatri, Sunil March 14, 2024, 4:16 p.m. UTC | #6
On 3/14/2024 8:12 PM, Alex Deucher wrote:
> On Thu, Mar 14, 2024 at 1:44 AM Khatri, Sunil <sukhatri@amd.com> wrote:
>>
>> On 3/14/2024 1:58 AM, Alex Deucher wrote:
>>> On Tue, Mar 12, 2024 at 8:41 AM Sunil Khatri <sunil.khatri@amd.com> wrote:
>>>> Add all the IP's information on a SOC to the
>>>> devcoredump.
>>>>
>>>> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++++++++++++++++++
>>>>    1 file changed, 19 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>>> index a0dbccad2f53..611fdb90a1fc 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>>> @@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count,
>>>>                              coredump->reset_task_info.process_name,
>>>>                              coredump->reset_task_info.pid);
>>>>
>>>> +       /* GPU IP's information of the SOC */
>>>> +       if (coredump->adev) {
>>>> +               drm_printf(&p, "\nIP Information\n");
>>>> +               drm_printf(&p, "SOC Family: %d\n", coredump->adev->family);
>>>> +               drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id);
>>>> +
>>>> +               for (int i = 0; i < coredump->adev->num_ip_blocks; i++) {
>>>> +                       struct amdgpu_ip_block *ip =
>>>> +                               &coredump->adev->ip_blocks[i];
>>>> +                       drm_printf(&p, "IP type: %d IP name: %s\n",
>>>> +                                  ip->version->type,
>>>> +                                  ip->version->funcs->name);
>>>> +                       drm_printf(&p, "IP version: (%d,%d,%d)\n\n",
>>>> +                                  ip->version->major,
>>>> +                                  ip->version->minor,
>>>> +                                  ip->version->rev);
>>>> +               }
>>>> +       }
>>> I think the IP discovery table would be more useful.  Either walk the
>>> adev->ip_versions structure, or just include the IP discovery binary.
>> I did explore the adev->ip_versions and if i just go through the array
>> it doesn't give any useful information directly.
>> There are no ways to find directly from adev->ip_versions below things
>> until i also reparse the discovery binary again like done the discovery
>> amdgpu_discovery_reg_base_init and walk through the headers of various
>> ips using discovery binary.
>> a. Which IP is available on soc or not.
>> b. How many instances are there
>> Also i again have to change back to major, minor and rev convention for
>> this information to be useful. I am exploring it more if i find some
>> other information i will update.
>>
>> adev->ip_block[] is derived from ip discovery only for each block which
>> is there on the SOC, so we are not reading information which isnt
>> applicable for the soc. We have name , type and version no of the IPs
>> available on the soc. If you want i could add no of instances of each IP
>> too if you think that's useful information here. Could you share what
>> information is missing in this approach so i can include that.
> I was hoping to get the actual IP versions for the IPs from IP
> discovery rather than the versions from the ip_block array.  The
> latter are common so you can end up with the same version used across
> a wide variety of chips (e.g., all gfx10.x based chips use the same
> gfx 10 IP code even if the actual IP version is different for most of
> the chips).
Got it. let me check how to get it could be done rightly.
>
>> For dumping the IP discovery binary, i dont understand how that
>> information would be useful directly and needs to be decoded like we are
>> doing in discovery init. Please correct me if my understanding is wrong
>> here.
> It's probably not a high priority, I was just thinking it might be
> useful to have in case there ended up being some problem related to
> the IP discovery table on some boards.  E.g., we'd know that all
> boards with a certain harvest config seem to align with a reported
> problem.  Similar for vbios.  It's more for telemetry.  E.g., all the
> boards reporting some problem have a particular powerplay config or
> whatever.
I got it.
But two points of contention here in my understanding. The dump works 
only where there is reset and not sure if it could be used very early in 
development of not. Second point is that devcoredump is 4096 
bytes/4Kbyte of memory where we are dumping all the information. Not 
sure if that could be increased but it might not be enough if we are 
planning to dump all to it.

Another point is since we have sysfs/debugfs/info ioctl etc information 
available. We should sort out what really is helpful in debugging GPU 
hang and that's added in devcore.

Regards
Sunil

>
> Alex
>
>
>>> Alex
>>>
>>>> +
>>>>           if (coredump->ring) {
>>>>                   drm_printf(&p, "\nRing timed out details\n");
>>>>                   drm_printf(&p, "IP Type: %d Ring Name: %s\n",
>>>> --
>>>> 2.34.1
>>>>
Alex Deucher March 14, 2024, 5:51 p.m. UTC | #7
On Thu, Mar 14, 2024 at 12:16 PM Khatri, Sunil <sukhatri@amd.com> wrote:
>
>
> On 3/14/2024 8:12 PM, Alex Deucher wrote:
> > On Thu, Mar 14, 2024 at 1:44 AM Khatri, Sunil <sukhatri@amd.com> wrote:
> >>
> >> On 3/14/2024 1:58 AM, Alex Deucher wrote:
> >>> On Tue, Mar 12, 2024 at 8:41 AM Sunil Khatri <sunil.khatri@amd.com> wrote:
> >>>> Add all the IP's information on a SOC to the
> >>>> devcoredump.
> >>>>
> >>>> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
> >>>> ---
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++++++++++++++++++
> >>>>    1 file changed, 19 insertions(+)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> >>>> index a0dbccad2f53..611fdb90a1fc 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> >>>> @@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count,
> >>>>                              coredump->reset_task_info.process_name,
> >>>>                              coredump->reset_task_info.pid);
> >>>>
> >>>> +       /* GPU IP's information of the SOC */
> >>>> +       if (coredump->adev) {
> >>>> +               drm_printf(&p, "\nIP Information\n");
> >>>> +               drm_printf(&p, "SOC Family: %d\n", coredump->adev->family);
> >>>> +               drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id);
> >>>> +
> >>>> +               for (int i = 0; i < coredump->adev->num_ip_blocks; i++) {
> >>>> +                       struct amdgpu_ip_block *ip =
> >>>> +                               &coredump->adev->ip_blocks[i];
> >>>> +                       drm_printf(&p, "IP type: %d IP name: %s\n",
> >>>> +                                  ip->version->type,
> >>>> +                                  ip->version->funcs->name);
> >>>> +                       drm_printf(&p, "IP version: (%d,%d,%d)\n\n",
> >>>> +                                  ip->version->major,
> >>>> +                                  ip->version->minor,
> >>>> +                                  ip->version->rev);
> >>>> +               }
> >>>> +       }
> >>> I think the IP discovery table would be more useful.  Either walk the
> >>> adev->ip_versions structure, or just include the IP discovery binary.
> >> I did explore the adev->ip_versions and if i just go through the array
> >> it doesn't give any useful information directly.
> >> There are no ways to find directly from adev->ip_versions below things
> >> until i also reparse the discovery binary again like done the discovery
> >> amdgpu_discovery_reg_base_init and walk through the headers of various
> >> ips using discovery binary.
> >> a. Which IP is available on soc or not.
> >> b. How many instances are there
> >> Also i again have to change back to major, minor and rev convention for
> >> this information to be useful. I am exploring it more if i find some
> >> other information i will update.
> >>
> >> adev->ip_block[] is derived from ip discovery only for each block which
> >> is there on the SOC, so we are not reading information which isnt
> >> applicable for the soc. We have name , type and version no of the IPs
> >> available on the soc. If you want i could add no of instances of each IP
> >> too if you think that's useful information here. Could you share what
> >> information is missing in this approach so i can include that.
> > I was hoping to get the actual IP versions for the IPs from IP
> > discovery rather than the versions from the ip_block array.  The
> > latter are common so you can end up with the same version used across
> > a wide variety of chips (e.g., all gfx10.x based chips use the same
> > gfx 10 IP code even if the actual IP version is different for most of
> > the chips).
> Got it. let me check how to get it could be done rightly.
> >
> >> For dumping the IP discovery binary, i dont understand how that
> >> information would be useful directly and needs to be decoded like we are
> >> doing in discovery init. Please correct me if my understanding is wrong
> >> here.
> > It's probably not a high priority, I was just thinking it might be
> > useful to have in case there ended up being some problem related to
> > the IP discovery table on some boards.  E.g., we'd know that all
> > boards with a certain harvest config seem to align with a reported
> > problem.  Similar for vbios.  It's more for telemetry.  E.g., all the
> > boards reporting some problem have a particular powerplay config or
> > whatever.
> I got it.
> But two points of contention here in my understanding. The dump works
> only where there is reset and not sure if it could be used very early in
> development of not. Second point is that devcoredump is 4096
> bytes/4Kbyte of memory where we are dumping all the information. Not
> sure if that could be increased but it might not be enough if we are
> planning to dump all to it.

ah, ok.  Let's skip the IP versions in that case, we can use the
family and rev_id and external_rev_id to look up the IP versions.

Alex

>
> Another point is since we have sysfs/debugfs/info ioctl etc information
> available. We should sort out what really is helpful in debugging GPU
> hang and that's added in devcore.
>
> Regards
> Sunil
>
> >
> > Alex
> >
> >
> >>> Alex
> >>>
> >>>> +
> >>>>           if (coredump->ring) {
> >>>>                   drm_printf(&p, "\nRing timed out details\n");
> >>>>                   drm_printf(&p, "IP Type: %d Ring Name: %s\n",
> >>>> --
> >>>> 2.34.1
> >>>>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index a0dbccad2f53..611fdb90a1fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -196,6 +196,25 @@  amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count,
 			   coredump->reset_task_info.process_name,
 			   coredump->reset_task_info.pid);
 
+	/* GPU IP's information of the SOC */
+	if (coredump->adev) {
+		drm_printf(&p, "\nIP Information\n");
+		drm_printf(&p, "SOC Family: %d\n", coredump->adev->family);
+		drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id);
+
+		for (int i = 0; i < coredump->adev->num_ip_blocks; i++) {
+			struct amdgpu_ip_block *ip =
+				&coredump->adev->ip_blocks[i];
+			drm_printf(&p, "IP type: %d IP name: %s\n",
+				   ip->version->type,
+				   ip->version->funcs->name);
+			drm_printf(&p, "IP version: (%d,%d,%d)\n\n",
+				   ip->version->major,
+				   ip->version->minor,
+				   ip->version->rev);
+		}
+	}
+
 	if (coredump->ring) {
 		drm_printf(&p, "\nRing timed out details\n");
 		drm_printf(&p, "IP Type: %d Ring Name: %s\n",