Message ID | 20221118074810.380368-1-lizhenneng@kylinos.cn (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/amdgpu: add mb for si | expand |
Am 18.11.22 um 08:48 schrieb Zhenneng Li: > During reboot test on arm64 platform, it may failure on boot, > so add this mb in smc. > > The error message are as follows: > [ 6.996395][ 7] [ T295] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* > late_init of IP block <si_dpm> failed -22 > [ 7.006919][ 7] [ T295] amdgpu 0000:04:00.0: amdgpu_device_ip_late_init failed > [ 7.014224][ 7] [ T295] amdgpu 0000:04:00.0: Fatal error during GPU init Memory barries are not supposed to be sprinkled around like this, you need to give a detailed explanation why this is necessary. Regards, Christian. > > Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn> > --- > drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c > index 8f994ffa9cd1..c7656f22278d 100644 > --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c > +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c > @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct amdgpu_device *adev) > u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL); > u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0); > > + mb(); > + > if (!(rst & RST_REG) && !(clk & CK_DISABLE)) > return true; >
On 11/18/22 09:01, Christian König wrote: > Am 18.11.22 um 08:48 schrieb Zhenneng Li: >> During reboot test on arm64 platform, it may failure on boot, >> so add this mb in smc. >> >> The error message are as follows: >> [ 6.996395][ 7] [ T295] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* >> late_init of IP block <si_dpm> failed -22 >> [ 7.006919][ 7] [ T295] amdgpu 0000:04:00.0: amdgpu_device_ip_late_init failed >> [ 7.014224][ 7] [ T295] amdgpu 0000:04:00.0: Fatal error during GPU init > > Memory barries are not supposed to be sprinkled around like this, you need to give a detailed explanation why this is necessary. > > Regards, > Christian. > >> >> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn> >> --- >> drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >> index 8f994ffa9cd1..c7656f22278d 100644 >> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct amdgpu_device *adev) >> u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL); >> u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0); >> + mb(); >> + >> if (!(rst & RST_REG) && !(clk & CK_DISABLE)) >> return true; In particular, it makes no sense in this specific place, since it cannot directly affect the values of rst & clk.
在 2022/11/18 17:18, Michel Dänzer 写道: > On 11/18/22 09:01, Christian König wrote: >> Am 18.11.22 um 08:48 schrieb Zhenneng Li: >>> During reboot test on arm64 platform, it may failure on boot, >>> so add this mb in smc. >>> >>> The error message are as follows: >>> [ 6.996395][ 7] [ T295] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* >>> late_init of IP block <si_dpm> failed -22 >>> [ 7.006919][ 7] [ T295] amdgpu 0000:04:00.0: amdgpu_device_ip_late_init failed >>> [ 7.014224][ 7] [ T295] amdgpu 0000:04:00.0: Fatal error during GPU init >> Memory barries are not supposed to be sprinkled around like this, you need to give a detailed explanation why this is necessary. >> >> Regards, >> Christian. >> >>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn> >>> --- >>> drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++ >>> 1 file changed, 2 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >>> index 8f994ffa9cd1..c7656f22278d 100644 >>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct amdgpu_device *adev) >>> u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL); >>> u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0); >>> + mb(); >>> + >>> if (!(rst & RST_REG) && !(clk & CK_DISABLE)) >>> return true; > In particular, it makes no sense in this specific place, since it cannot directly affect the values of rst & clk. I thinks so too. But when I do reboot test using nine desktop machines, there maybe report this error on one or two machines after Hundreds of times or Thousands of times reboot test, at the beginning, I use msleep() instead of mb(), these two methods are all works, but I don't know what is the root case. I use this method on other verdor's oland card, this error message are reported again. What could be the root reason? test environmen: graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87 driver: amdgpu os: ubuntu 2004 platform: arm64 kernel: 5.4.18 >
[AMD Official Use Only - General] Could the attached patch help? Evan > -----Original Message----- > From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of ??? > Sent: Friday, November 18, 2022 5:25 PM > To: Michel Dänzer <michel.daenzer@mailbox.org>; Koenig, Christian > <Christian.Koenig@amd.com>; Deucher, Alexander > <Alexander.Deucher@amd.com> > Cc: amd-gfx@lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan@amd.com>; > linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org > Subject: Re: [PATCH] drm/amdgpu: add mb for si > > > 在 2022/11/18 17:18, Michel Dänzer 写道: > > On 11/18/22 09:01, Christian König wrote: > >> Am 18.11.22 um 08:48 schrieb Zhenneng Li: > >>> During reboot test on arm64 platform, it may failure on boot, so add > >>> this mb in smc. > >>> > >>> The error message are as follows: > >>> [ 6.996395][ 7] [ T295] [drm:amdgpu_device_ip_late_init > >>> [amdgpu]] *ERROR* > >>> late_init of IP block <si_dpm> failed -22 [ > >>> 7.006919][ 7] [ T295] amdgpu 0000:04:00.0: > >>> amdgpu_device_ip_late_init failed [ 7.014224][ 7] [ T295] amdgpu > >>> 0000:04:00.0: Fatal error during GPU init > >> Memory barries are not supposed to be sprinkled around like this, you > need to give a detailed explanation why this is necessary. > >> > >> Regards, > >> Christian. > >> > >>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn> > >>> --- > >>> drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++ > >>> 1 file changed, 2 insertions(+) > >>> > >>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c > >>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c > >>> index 8f994ffa9cd1..c7656f22278d 100644 > >>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c > >>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c > >>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct > >>> amdgpu_device *adev) > >>> u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL); > >>> u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0); > >>> + mb(); > >>> + > >>> if (!(rst & RST_REG) && !(clk & CK_DISABLE)) > >>> return true; > > In particular, it makes no sense in this specific place, since it cannot directly > affect the values of rst & clk. > > I thinks so too. > > But when I do reboot test using nine desktop machines, there maybe report > this error on one or two machines after Hundreds of times or Thousands of > times reboot test, at the beginning, I use msleep() instead of mb(), these > two methods are all works, but I don't know what is the root case. > > I use this method on other verdor's oland card, this error message are > reported again. > > What could be the root reason? > > test environmen: > > graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87 > > driver: amdgpu > > os: ubuntu 2004 > > platform: arm64 > > kernel: 5.4.18 > > >
That's not a patch but some binary file? Christian. Am 24.11.22 um 11:04 schrieb Quan, Evan: > [AMD Official Use Only - General] > > Could the attached patch help? > > Evan >> -----Original Message----- >> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of ??? >> Sent: Friday, November 18, 2022 5:25 PM >> To: Michel Dänzer <michel.daenzer@mailbox.org>; Koenig, Christian >> <Christian.Koenig@amd.com>; Deucher, Alexander >> <Alexander.Deucher@amd.com> >> Cc: amd-gfx@lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan@amd.com>; >> linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org >> Subject: Re: [PATCH] drm/amdgpu: add mb for si >> >> >> 在 2022/11/18 17:18, Michel Dänzer 写道: >>> On 11/18/22 09:01, Christian König wrote: >>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li: >>>>> During reboot test on arm64 platform, it may failure on boot, so add >>>>> this mb in smc. >>>>> >>>>> The error message are as follows: >>>>> [ 6.996395][ 7] [ T295] [drm:amdgpu_device_ip_late_init >>>>> [amdgpu]] *ERROR* >>>>> late_init of IP block <si_dpm> failed -22 [ >>>>> 7.006919][ 7] [ T295] amdgpu 0000:04:00.0: >>>>> amdgpu_device_ip_late_init failed [ 7.014224][ 7] [ T295] amdgpu >>>>> 0000:04:00.0: Fatal error during GPU init >>>> Memory barries are not supposed to be sprinkled around like this, you >> need to give a detailed explanation why this is necessary. >>>> Regards, >>>> Christian. >>>> >>>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn> >>>>> --- >>>>> drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++ >>>>> 1 file changed, 2 insertions(+) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >>>>> index 8f994ffa9cd1..c7656f22278d 100644 >>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct >>>>> amdgpu_device *adev) >>>>> u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL); >>>>> u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0); >>>>> + mb(); >>>>> + >>>>> if (!(rst & RST_REG) && !(clk & CK_DISABLE)) >>>>> return true; >>> In particular, it makes no sense in this specific place, since it cannot directly >> affect the values of rst & clk. >> >> I thinks so too. >> >> But when I do reboot test using nine desktop machines, there maybe report >> this error on one or two machines after Hundreds of times or Thousands of >> times reboot test, at the beginning, I use msleep() instead of mb(), these >> two methods are all works, but I don't know what is the root case. >> >> I use this method on other verdor's oland card, this error message are >> reported again. >> >> What could be the root reason? >> >> test environmen: >> >> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87 >> >> driver: amdgpu >> >> os: ubuntu 2004 >> >> platform: arm64 >> >> kernel: 5.4.18 >>
On 11/24/2022 3:34 PM, Quan, Evan wrote: > [AMD Official Use Only - General] > > Could the attached patch help? > > Evan >> -----Original Message----- >> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of ??? >> Sent: Friday, November 18, 2022 5:25 PM >> To: Michel Dänzer <michel.daenzer@mailbox.org>; Koenig, Christian >> <Christian.Koenig@amd.com>; Deucher, Alexander >> <Alexander.Deucher@amd.com> >> Cc: amd-gfx@lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan@amd.com>; >> linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org >> Subject: Re: [PATCH] drm/amdgpu: add mb for si >> >> >> 在 2022/11/18 17:18, Michel Dänzer 写道: >>> On 11/18/22 09:01, Christian König wrote: >>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li: >>>>> During reboot test on arm64 platform, it may failure on boot, so add >>>>> this mb in smc. >>>>> >>>>> The error message are as follows: >>>>> [ 6.996395][ 7] [ T295] [drm:amdgpu_device_ip_late_init >>>>> [amdgpu]] *ERROR* >>>>> late_init of IP block <si_dpm> failed -22 [ >>>>> 7.006919][ 7] [ T295] amdgpu 0000:04:00.0: The issue is happening in late_init() which eventually does ret = si_thermal_enable_alert(adev, false); Just before this, si_thermal_start_thermal_controller is called in hw_init and that enables thermal alert. Maybe the issue is with enable/disable of thermal alerts in quick succession. Adding a delay inside si_thermal_start_thermal_controller might help. Thanks, Lijo >>>>> amdgpu_device_ip_late_init failed [ 7.014224][ 7] [ T295] amdgpu >>>>> 0000:04:00.0: Fatal error during GPU init >>>> Memory barries are not supposed to be sprinkled around like this, you >> need to give a detailed explanation why this is necessary. >>>> >>>> Regards, >>>> Christian. >>>> >>>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn> >>>>> --- >>>>> drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++ >>>>> 1 file changed, 2 insertions(+) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >>>>> index 8f994ffa9cd1..c7656f22278d 100644 >>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct >>>>> amdgpu_device *adev) >>>>> u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL); >>>>> u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0); >>>>> + mb(); >>>>> + >>>>> if (!(rst & RST_REG) && !(clk & CK_DISABLE)) >>>>> return true; >>> In particular, it makes no sense in this specific place, since it cannot directly >> affect the values of rst & clk. >> >> I thinks so too. >> >> But when I do reboot test using nine desktop machines, there maybe report >> this error on one or two machines after Hundreds of times or Thousands of >> times reboot test, at the beginning, I use msleep() instead of mb(), these >> two methods are all works, but I don't know what is the root case. >> >> I use this method on other verdor's oland card, this error message are >> reported again. >> >> What could be the root reason? >> >> test environmen: >> >> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87 >> >> driver: amdgpu >> >> os: ubuntu 2004 >> >> platform: arm64 >> >> kernel: 5.4.18 >> >>>
On 11/24/2022 4:11 PM, Lazar, Lijo wrote: > > > On 11/24/2022 3:34 PM, Quan, Evan wrote: >> [AMD Official Use Only - General] >> >> Could the attached patch help? >> >> Evan >>> -----Original Message----- >>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of ??? >>> Sent: Friday, November 18, 2022 5:25 PM >>> To: Michel Dänzer <michel.daenzer@mailbox.org>; Koenig, Christian >>> <Christian.Koenig@amd.com>; Deucher, Alexander >>> <Alexander.Deucher@amd.com> >>> Cc: amd-gfx@lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan@amd.com>; >>> linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org >>> Subject: Re: [PATCH] drm/amdgpu: add mb for si >>> >>> >>> 在 2022/11/18 17:18, Michel Dänzer 写道: >>>> On 11/18/22 09:01, Christian König wrote: >>>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li: >>>>>> During reboot test on arm64 platform, it may failure on boot, so add >>>>>> this mb in smc. >>>>>> >>>>>> The error message are as follows: >>>>>> [ 6.996395][ 7] [ T295] [drm:amdgpu_device_ip_late_init >>>>>> [amdgpu]] *ERROR* >>>>>> late_init of IP block <si_dpm> failed -22 [ >>>>>> 7.006919][ 7] [ T295] amdgpu 0000:04:00.0: > > The issue is happening in late_init() which eventually does > > ret = si_thermal_enable_alert(adev, false); > > Just before this, si_thermal_start_thermal_controller is called in > hw_init and that enables thermal alert. > > Maybe the issue is with enable/disable of thermal alerts in quick > succession. Adding a delay inside si_thermal_start_thermal_controller > might help. > On a second look, temperature range is already set as part of si_thermal_start_thermal_controller in hw_init https://elixir.bootlin.com/linux/v6.1-rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L6780 There is no need to set it again here - https://elixir.bootlin.com/linux/v6.1-rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L7635 I think it is safe to remove the call from late_init altogether. Alex/Evan? Thanks, Lijo > Thanks, > Lijo > >>>>>> amdgpu_device_ip_late_init failed [ 7.014224][ 7] [ T295] amdgpu >>>>>> 0000:04:00.0: Fatal error during GPU init >>>>> Memory barries are not supposed to be sprinkled around like this, you >>> need to give a detailed explanation why this is necessary. >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn> >>>>>> --- >>>>>> drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++ >>>>>> 1 file changed, 2 insertions(+) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >>>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >>>>>> index 8f994ffa9cd1..c7656f22278d 100644 >>>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >>>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >>>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct >>>>>> amdgpu_device *adev) >>>>>> u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL); >>>>>> u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0); >>>>>> + mb(); >>>>>> + >>>>>> if (!(rst & RST_REG) && !(clk & CK_DISABLE)) >>>>>> return true; >>>> In particular, it makes no sense in this specific place, since it >>>> cannot directly >>> affect the values of rst & clk. >>> >>> I thinks so too. >>> >>> But when I do reboot test using nine desktop machines, there maybe >>> report >>> this error on one or two machines after Hundreds of times or >>> Thousands of >>> times reboot test, at the beginning, I use msleep() instead of mb(), >>> these >>> two methods are all works, but I don't know what is the root case. >>> >>> I use this method on other verdor's oland card, this error message are >>> reported again. >>> >>> What could be the root reason? >>> >>> test environmen: >>> >>> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87 >>> >>> driver: amdgpu >>> >>> os: ubuntu 2004 >>> >>> platform: arm64 >>> >>> kernel: 5.4.18 >>> >>>>
[AMD Official Use Only - General] Did you see that? It's a patch which I created by git-format-patch. Anyway I will paste the changes below. I was suspecting maybe we need some waits for smu running. diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c index 49c398ec0aaf..9f308a021b2d 100644 --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c @@ -6814,6 +6814,7 @@ static int si_dpm_enable(struct amdgpu_device *adev) struct si_power_info *si_pi = si_get_pi(adev); struct amdgpu_ps *boot_ps = adev->pm.dpm.boot_ps; int ret; + int i; if (amdgpu_si_is_smc_running(adev)) return -EINVAL; @@ -6909,6 +6910,17 @@ static int si_dpm_enable(struct amdgpu_device *adev) si_program_response_times(adev); si_program_ds_registers(adev); si_dpm_start_smc(adev); + /* Waiting for smc alive */ + for (i = 0; i < adev->usec_timeout; i++) { + if (amdgpu_si_is_smc_running(adev)) + break; + udelay(1); + } + if (i >= adev->usec_timeout) { + DRM_ERROR("Timedout on waiting for smu running\n"); + return -EINVAL; + } + ret = si_notify_smc_display_change(adev, false); if (ret) { DRM_ERROR("si_notify_smc_display_change failed\n"); BR Evan > -----Original Message----- > From: Christian König <ckoenig.leichtzumerken@gmail.com> > Sent: Thursday, November 24, 2022 6:06 PM > To: Quan, Evan <Evan.Quan@amd.com>; 李真能 <lizhenneng@kylinos.cn>; > Michel Dänzer <michel.daenzer@mailbox.org>; Koenig, Christian > <Christian.Koenig@amd.com>; Deucher, Alexander > <Alexander.Deucher@amd.com> > Cc: dri-devel@lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan@amd.com>; > linux-kernel@vger.kernel.org; amd-gfx@lists.freedesktop.org > Subject: Re: [PATCH] drm/amdgpu: add mb for si > > That's not a patch but some binary file? > > Christian. > > Am 24.11.22 um 11:04 schrieb Quan, Evan: > > [AMD Official Use Only - General] > > > > Could the attached patch help? > > > > Evan > >> -----Original Message----- > >> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf > Of ??? > >> Sent: Friday, November 18, 2022 5:25 PM > >> To: Michel Dänzer <michel.daenzer@mailbox.org>; Koenig, Christian > >> <Christian.Koenig@amd.com>; Deucher, Alexander > >> <Alexander.Deucher@amd.com> > >> Cc: amd-gfx@lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan@amd.com>; > >> linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org > >> Subject: Re: [PATCH] drm/amdgpu: add mb for si > >> > >> > >> 在 2022/11/18 17:18, Michel Dänzer 写道: > >>> On 11/18/22 09:01, Christian König wrote: > >>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li: > >>>>> During reboot test on arm64 platform, it may failure on boot, so > >>>>> add this mb in smc. > >>>>> > >>>>> The error message are as follows: > >>>>> [ 6.996395][ 7] [ T295] [drm:amdgpu_device_ip_late_init > >>>>> [amdgpu]] *ERROR* > >>>>> late_init of IP block <si_dpm> failed -22 [ > >>>>> 7.006919][ 7] [ T295] amdgpu 0000:04:00.0: > >>>>> amdgpu_device_ip_late_init failed [ 7.014224][ 7] [ T295] > >>>>> amdgpu > >>>>> 0000:04:00.0: Fatal error during GPU init > >>>> Memory barries are not supposed to be sprinkled around like this, > >>>> you > >> need to give a detailed explanation why this is necessary. > >>>> Regards, > >>>> Christian. > >>>> > >>>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn> > >>>>> --- > >>>>> drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++ > >>>>> 1 file changed, 2 insertions(+) > >>>>> > >>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c > >>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c > >>>>> index 8f994ffa9cd1..c7656f22278d 100644 > >>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c > >>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c > >>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct > >>>>> amdgpu_device *adev) > >>>>> u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL); > >>>>> u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0); > >>>>> + mb(); > >>>>> + > >>>>> if (!(rst & RST_REG) && !(clk & CK_DISABLE)) > >>>>> return true; > >>> In particular, it makes no sense in this specific place, since it > >>> cannot directly > >> affect the values of rst & clk. > >> > >> I thinks so too. > >> > >> But when I do reboot test using nine desktop machines, there maybe > >> report this error on one or two machines after Hundreds of times or > >> Thousands of times reboot test, at the beginning, I use msleep() > >> instead of mb(), these two methods are all works, but I don't know what > is the root case. > >> > >> I use this method on other verdor's oland card, this error message > >> are reported again. > >> > >> What could be the root reason? > >> > >> test environmen: > >> > >> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87 > >> > >> driver: amdgpu > >> > >> os: ubuntu 2004 > >> > >> platform: arm64 > >> > >> kernel: 5.4.18 > >>
[AMD Official Use Only - General] > -----Original Message----- > From: Lazar, Lijo <Lijo.Lazar@amd.com> > Sent: Thursday, November 24, 2022 6:49 PM > To: Quan, Evan <Evan.Quan@amd.com>; 李真能 <lizhenneng@kylinos.cn>; > Michel Dänzer <michel.daenzer@mailbox.org>; Koenig, Christian > <Christian.Koenig@amd.com>; Deucher, Alexander > <Alexander.Deucher@amd.com> > Cc: amd-gfx@lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan@amd.com>; > linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org > Subject: Re: [PATCH] drm/amdgpu: add mb for si > > > > On 11/24/2022 4:11 PM, Lazar, Lijo wrote: > > > > > > On 11/24/2022 3:34 PM, Quan, Evan wrote: > >> [AMD Official Use Only - General] > >> > >> Could the attached patch help? > >> > >> Evan > >>> -----Original Message----- > >>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf > Of ??? > >>> Sent: Friday, November 18, 2022 5:25 PM > >>> To: Michel Dänzer <michel.daenzer@mailbox.org>; Koenig, Christian > >>> <Christian.Koenig@amd.com>; Deucher, Alexander > >>> <Alexander.Deucher@amd.com> > >>> Cc: amd-gfx@lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan@amd.com>; > >>> linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org > >>> Subject: Re: [PATCH] drm/amdgpu: add mb for si > >>> > >>> > >>> 在 2022/11/18 17:18, Michel Dänzer 写道: > >>>> On 11/18/22 09:01, Christian König wrote: > >>>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li: > >>>>>> During reboot test on arm64 platform, it may failure on boot, so > >>>>>> add this mb in smc. > >>>>>> > >>>>>> The error message are as follows: > >>>>>> [ 6.996395][ 7] [ T295] [drm:amdgpu_device_ip_late_init > >>>>>> [amdgpu]] *ERROR* > >>>>>> late_init of IP block <si_dpm> failed -22 [ > >>>>>> 7.006919][ 7] [ T295] amdgpu 0000:04:00.0: > > > > The issue is happening in late_init() which eventually does > > > > ret = si_thermal_enable_alert(adev, false); > > > > Just before this, si_thermal_start_thermal_controller is called in > > hw_init and that enables thermal alert. > > > > Maybe the issue is with enable/disable of thermal alerts in quick > > succession. Adding a delay inside si_thermal_start_thermal_controller > > might help. > > > > On a second look, temperature range is already set as part of > si_thermal_start_thermal_controller in hw_init > https://elixir.bootlin.com/linux/v6.1- > rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L6780 > > There is no need to set it again here - > > https://elixir.bootlin.com/linux/v6.1- > rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L7635 > > I think it is safe to remove the call from late_init altogether. Alex/Evan? > [Quan, Evan] Yes, it makes sense to me. But I'm not sure whether that’s related with the issue here. Since per my understandings, if the issue is caused by double calling of thermal_alert enablement, it will fail every time. That cannot explain why adding some delays or a mb() calling can help. BR Evan > Thanks, > Lijo > > > Thanks, > > Lijo > > > >>>>>> amdgpu_device_ip_late_init failed [ 7.014224][ 7] [ T295] amdgpu > >>>>>> 0000:04:00.0: Fatal error during GPU init > >>>>> Memory barries are not supposed to be sprinkled around like this, > you > >>> need to give a detailed explanation why this is necessary. > >>>>> > >>>>> Regards, > >>>>> Christian. > >>>>> > >>>>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn> > >>>>>> --- > >>>>>> drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++ > >>>>>> 1 file changed, 2 insertions(+) > >>>>>> > >>>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c > >>>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c > >>>>>> index 8f994ffa9cd1..c7656f22278d 100644 > >>>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c > >>>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c > >>>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct > >>>>>> amdgpu_device *adev) > >>>>>> u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL); > >>>>>> u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0); > >>>>>> + mb(); > >>>>>> + > >>>>>> if (!(rst & RST_REG) && !(clk & CK_DISABLE)) > >>>>>> return true; > >>>> In particular, it makes no sense in this specific place, since it > >>>> cannot directly > >>> affect the values of rst & clk. > >>> > >>> I thinks so too. > >>> > >>> But when I do reboot test using nine desktop machines, there maybe > >>> report > >>> this error on one or two machines after Hundreds of times or > >>> Thousands of > >>> times reboot test, at the beginning, I use msleep() instead of mb(), > >>> these > >>> two methods are all works, but I don't know what is the root case. > >>> > >>> I use this method on other verdor's oland card, this error message are > >>> reported again. > >>> > >>> What could be the root reason? > >>> > >>> test environmen: > >>> > >>> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87 > >>> > >>> driver: amdgpu > >>> > >>> os: ubuntu 2004 > >>> > >>> platform: arm64 > >>> > >>> kernel: 5.4.18 > >>> > >>>>
On 11/25/2022 7:43 AM, Quan, Evan wrote: > [AMD Official Use Only - General] > > > >> -----Original Message----- >> From: Lazar, Lijo <Lijo.Lazar@amd.com> >> Sent: Thursday, November 24, 2022 6:49 PM >> To: Quan, Evan <Evan.Quan@amd.com>; 李真能 <lizhenneng@kylinos.cn>; >> Michel Dänzer <michel.daenzer@mailbox.org>; Koenig, Christian >> <Christian.Koenig@amd.com>; Deucher, Alexander >> <Alexander.Deucher@amd.com> >> Cc: amd-gfx@lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan@amd.com>; >> linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org >> Subject: Re: [PATCH] drm/amdgpu: add mb for si >> >> >> >> On 11/24/2022 4:11 PM, Lazar, Lijo wrote: >>> >>> On 11/24/2022 3:34 PM, Quan, Evan wrote: >>>> [AMD Official Use Only - General] >>>> >>>> Could the attached patch help? >>>> >>>> Evan >>>>> -----Original Message----- >>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf >> Of ??? >>>>> Sent: Friday, November 18, 2022 5:25 PM >>>>> To: Michel Dänzer <michel.daenzer@mailbox.org>; Koenig, Christian >>>>> <Christian.Koenig@amd.com>; Deucher, Alexander >>>>> <Alexander.Deucher@amd.com> >>>>> Cc: amd-gfx@lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan@amd.com>; >>>>> linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org >>>>> Subject: Re: [PATCH] drm/amdgpu: add mb for si >>>>> >>>>> >>>>> 在 2022/11/18 17:18, Michel Dänzer 写道: >>>>>> On 11/18/22 09:01, Christian König wrote: >>>>>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li: >>>>>>>> During reboot test on arm64 platform, it may failure on boot, so >>>>>>>> add this mb in smc. >>>>>>>> >>>>>>>> The error message are as follows: >>>>>>>> [ 6.996395][ 7] [ T295] [drm:amdgpu_device_ip_late_init >>>>>>>> [amdgpu]] *ERROR* >>>>>>>> late_init of IP block <si_dpm> failed -22 [ >>>>>>>> 7.006919][ 7] [ T295] amdgpu 0000:04:00.0: >>> The issue is happening in late_init() which eventually does >>> >>> ret = si_thermal_enable_alert(adev, false); >>> >>> Just before this, si_thermal_start_thermal_controller is called in >>> hw_init and that enables thermal alert. >>> >>> Maybe the issue is with enable/disable of thermal alerts in quick >>> succession. Adding a delay inside si_thermal_start_thermal_controller >>> might help. >>> >> On a second look, temperature range is already set as part of >> si_thermal_start_thermal_controller in hw_init >> https://elixir.bootlin.com/linux/v6.1- >> rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L6780 >> >> There is no need to set it again here - >> >> https://elixir.bootlin.com/linux/v6.1- >> rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L7635 >> >> I think it is safe to remove the call from late_init altogether. Alex/Evan? >> > [Quan, Evan] Yes, it makes sense to me. But I'm not sure whether that’s related with the issue here. > Since per my understandings, if the issue is caused by double calling of thermal_alert enablement, it will fail every time. > That cannot explain why adding some delays or a mb() calling can help. The side effect of the patch is just some random delay introduced for every SMC message The issue happens in late_init(). Between late_init() and dpm enablement, there are many smc messages sent which don't have this issue. So I think the issue is not with FW not running. Thus the only case I see is enable/disable of thermal alert in random succession. Thanks, Lijo > BR > Evan >> Thanks, >> Lijo >> >>> Thanks, >>> Lijo >>> >>>>>>>> amdgpu_device_ip_late_init failed [ 7.014224][ 7] [ T295] amdgpu >>>>>>>> 0000:04:00.0: Fatal error during GPU init >>>>>>> Memory barries are not supposed to be sprinkled around like this, >> you >>>>> need to give a detailed explanation why this is necessary. >>>>>>> Regards, >>>>>>> Christian. >>>>>>> >>>>>>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn> >>>>>>>> --- >>>>>>>> drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++ >>>>>>>> 1 file changed, 2 insertions(+) >>>>>>>> >>>>>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >>>>>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >>>>>>>> index 8f994ffa9cd1..c7656f22278d 100644 >>>>>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >>>>>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c >>>>>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct >>>>>>>> amdgpu_device *adev) >>>>>>>> u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL); >>>>>>>> u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0); >>>>>>>> + mb(); >>>>>>>> + >>>>>>>> if (!(rst & RST_REG) && !(clk & CK_DISABLE)) >>>>>>>> return true; >>>>>> In particular, it makes no sense in this specific place, since it >>>>>> cannot directly >>>>> affect the values of rst & clk. >>>>> >>>>> I thinks so too. >>>>> >>>>> But when I do reboot test using nine desktop machines, there maybe >>>>> report >>>>> this error on one or two machines after Hundreds of times or >>>>> Thousands of >>>>> times reboot test, at the beginning, I use msleep() instead of mb(), >>>>> these >>>>> two methods are all works, but I don't know what is the root case. >>>>> >>>>> I use this method on other verdor's oland card, this error message are >>>>> reported again. >>>>> >>>>> What could be the root reason? >>>>> >>>>> test environmen: >>>>> >>>>> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87 >>>>> >>>>> driver: amdgpu >>>>> >>>>> os: ubuntu 2004 >>>>> >>>>> platform: arm64 >>>>> >>>>> kernel: 5.4.18 >>>>>
diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c index 8f994ffa9cd1..c7656f22278d 100644 --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct amdgpu_device *adev) u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL); u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0); + mb(); + if (!(rst & RST_REG) && !(clk & CK_DISABLE)) return true;
During reboot test on arm64 platform, it may failure on boot, so add this mb in smc. The error message are as follows: [ 6.996395][ 7] [ T295] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP block <si_dpm> failed -22 [ 7.006919][ 7] [ T295] amdgpu 0000:04:00.0: amdgpu_device_ip_late_init failed [ 7.014224][ 7] [ T295] amdgpu 0000:04:00.0: Fatal error during GPU init Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn> --- drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++ 1 file changed, 2 insertions(+)