diff mbox series

drm/amdgpu: resove reboot exception for si oland

Message ID 20230310074000.2078124-1-lizhenneng@kylinos.cn (mailing list archive)
State New, archived
Headers show
Series drm/amdgpu: resove reboot exception for si oland | expand

Commit Message

Zhenneng Li March 10, 2023, 7:39 a.m. UTC
During reboot test on arm64 platform, it may failure
on boot.

The error message are as follows:
[    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR*
			    late_init of IP block <si_dpm> failed -22
[    7.006919][ 7] [  T295] amdgpu 0000:04:00.0: amdgpu_device_ip_late_init failed
[    7.014224][ 7] [  T295] amdgpu 0000:04:00.0: Fatal error during GPU init
---
 drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c | 3 ---
 1 file changed, 3 deletions(-)

Comments

Chen, Guchun March 10, 2023, 8:18 a.m. UTC | #1
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
> Zhenneng Li
> Sent: Friday, March 10, 2023 3:40 PM
> To: Deucher, Alexander <Alexander.Deucher@amd.com>
> Cc: David Airlie <airlied@linux.ie>; Pan, Xinhui <Xinhui.Pan@amd.com>;
> linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org; Zhenneng Li
> <lizhenneng@kylinos.cn>; amd-gfx@lists.freedesktop.org; Daniel Vetter
> <daniel@ffwll.ch>; Koenig, Christian <Christian.Koenig@amd.com>
> Subject: [PATCH] drm/amdgpu: resove reboot exception for si oland
> 
> During reboot test on arm64 platform, it may failure on boot.
> 
> The error message are as follows:
> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init [amdgpu]]
> *ERROR*
> 			    late_init of IP block <si_dpm> failed -22
> [    7.006919][ 7] [  T295] amdgpu 0000:04:00.0: amdgpu_device_ip_late_init
> failed
> [    7.014224][ 7] [  T295] amdgpu 0000:04:00.0: Fatal error during GPU init
> ---
>  drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> index d6d9e3b1b2c0..dee51c757ac0 100644
> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> @@ -7632,9 +7632,6 @@ static int si_dpm_late_init(void *handle)
>  	if (!adev->pm.dpm_enabled)
>  		return 0;
> 
> -	ret = si_set_temperature_range(adev);
> -	if (ret)
> -		return ret;

si_set_temperature_range should be platform agnostic. Can you please elaborate more?

Regards,
Guchun

>  #if 0 //TODO ?
>  	si_dpm_powergate_uvd(adev, true);
>  #endif
> --
> 2.25.1
Alex Deucher March 10, 2023, 3:21 p.m. UTC | #2
On Fri, Mar 10, 2023 at 3:18 AM Chen, Guchun <Guchun.Chen@amd.com> wrote:
>
>
> > -----Original Message-----
> > From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
> > Zhenneng Li
> > Sent: Friday, March 10, 2023 3:40 PM
> > To: Deucher, Alexander <Alexander.Deucher@amd.com>
> > Cc: David Airlie <airlied@linux.ie>; Pan, Xinhui <Xinhui.Pan@amd.com>;
> > linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org; Zhenneng Li
> > <lizhenneng@kylinos.cn>; amd-gfx@lists.freedesktop.org; Daniel Vetter
> > <daniel@ffwll.ch>; Koenig, Christian <Christian.Koenig@amd.com>
> > Subject: [PATCH] drm/amdgpu: resove reboot exception for si oland
> >
> > During reboot test on arm64 platform, it may failure on boot.
> >
> > The error message are as follows:
> > [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init [amdgpu]]
> > *ERROR*
> >                           late_init of IP block <si_dpm> failed -22
> > [    7.006919][ 7] [  T295] amdgpu 0000:04:00.0: amdgpu_device_ip_late_init
> > failed
> > [    7.014224][ 7] [  T295] amdgpu 0000:04:00.0: Fatal error during GPU init
> > ---
> >  drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c | 3 ---
> >  1 file changed, 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> > b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> > index d6d9e3b1b2c0..dee51c757ac0 100644
> > --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> > +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> > @@ -7632,9 +7632,6 @@ static int si_dpm_late_init(void *handle)
> >       if (!adev->pm.dpm_enabled)
> >               return 0;
> >
> > -     ret = si_set_temperature_range(adev);
> > -     if (ret)
> > -             return ret;
>
> si_set_temperature_range should be platform agnostic. Can you please elaborate more?
>

Yes.  Not setting this means we won't get thermal interrupts.  We
shouldn't skip this.

Alex


> Regards,
> Guchun
>
> >  #if 0 //TODO ?
> >       si_dpm_powergate_uvd(adev, true);
> >  #endif
> > --
> > 2.25.1
>
Lazar, Lijo March 10, 2023, 3:55 p.m. UTC | #3
[AMD Official Use Only - General]

I recall that there was a previous discussion around this and that time we found that the range is already set earlier during DPM enablement.

The suspected root cause was enable/disable of thermal alert within this call to set range again.

Thanks,
Lijo
Zhenneng Li March 13, 2023, 1:04 a.m. UTC | #4
This bug is first reported here:

https://lore.kernel.org/lkml/1a620e7c-5b71-3d16-001a-0d79b292aca7@amd.com/

I modify the patch accroding mail list's discusstion,   and I do reboot 
test for tens of thousands of times about 10 machines on arm64,  there's 
no bug reported.

在 2023/3/10 16:18, Chen, Guchun 写道:
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
>> Zhenneng Li
>> Sent: Friday, March 10, 2023 3:40 PM
>> To: Deucher, Alexander <Alexander.Deucher@amd.com>
>> Cc: David Airlie <airlied@linux.ie>; Pan, Xinhui <Xinhui.Pan@amd.com>;
>> linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org; Zhenneng Li
>> <lizhenneng@kylinos.cn>; amd-gfx@lists.freedesktop.org; Daniel Vetter
>> <daniel@ffwll.ch>; Koenig, Christian <Christian.Koenig@amd.com>
>> Subject: [PATCH] drm/amdgpu: resove reboot exception for si oland
>>
>> During reboot test on arm64 platform, it may failure on boot.
>>
>> The error message are as follows:
>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init [amdgpu]]
>> *ERROR*
>> 			    late_init of IP block <si_dpm> failed -22
>> [    7.006919][ 7] [  T295] amdgpu 0000:04:00.0: amdgpu_device_ip_late_init
>> failed
>> [    7.014224][ 7] [  T295] amdgpu 0000:04:00.0: Fatal error during GPU init
>> ---
>>   drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c | 3 ---
>>   1 file changed, 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
>> index d6d9e3b1b2c0..dee51c757ac0 100644
>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
>> @@ -7632,9 +7632,6 @@ static int si_dpm_late_init(void *handle)
>>   	if (!adev->pm.dpm_enabled)
>>   		return 0;
>>
>> -	ret = si_set_temperature_range(adev);
>> -	if (ret)
>> -		return ret;
> si_set_temperature_range should be platform agnostic. Can you please elaborate more?
>
> Regards,
> Guchun
>
>>   #if 0 //TODO ?
>>   	si_dpm_powergate_uvd(adev, true);
>>   #endif
>> --
>> 2.25.1
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
index d6d9e3b1b2c0..dee51c757ac0 100644
--- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
+++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
@@ -7632,9 +7632,6 @@  static int si_dpm_late_init(void *handle)
 	if (!adev->pm.dpm_enabled)
 		return 0;
 
-	ret = si_set_temperature_range(adev);
-	if (ret)
-		return ret;
 #if 0 //TODO ?
 	si_dpm_powergate_uvd(adev, true);
 #endif