diff mbox series

[3/3] drm/amdgpu/vcn: Add parameter to force (en/dis)abling indirect SRAM mode

Message ID 20230116212004.860968-3-gpiccoli@igalia.com (mailing list archive)
State New, archived
Headers show
Series [1/3] drm/amdgpu/vcn: Adjust firmware names indentation | expand

Commit Message

Guilherme G. Piccoli Jan. 16, 2023, 9:20 p.m. UTC
Currently the FW loading path perform some checks based on IP model
and in case it is advertised as supported, the VCN indirect SRAM
mode is used.

Happens that in case there's any issue on FW and this mode ends-up
not being properly supported, the driver probe fails [0]. Debugging
this requires driver rebuilding, so to allow fast debug and experiments,
add a parameter to force setting indirect SRAM mode to true/false from
the kernel command-line; parameter default is -1, which doesn't change
the current driver's behavior.

[0] Example of this issue, observed on Steam Deck:

[drm] kiq ring mec 2 pipe 1 q 0
[drm] failed to load ucode VCN0_RAM(0x3A)
[drm] psp gfx command LOAD_IP_FW(0x6) failed and response status is (0xFFFF0000)
amdgpu 0000:04:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring vcn_dec_0 test failed (-110)
[drm:amdgpu_device_init.cold [amdgpu]] *ERROR* hw_init of IP block <vcn_v3_0> failed -110
amdgpu 0000:04:00.0: amdgpu: amdgpu_device_ip_init failed
amdgpu 0000:04:00.0: amdgpu: Fatal error during GPU init

Cc: James Zhu <James.Zhu@amd.com>
Cc: Lazar Lijo <Lijo.Lazar@amd.com>
Cc: Leo Liu <leo.liu@amd.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
Cc: Sonny Jiang <sonny.jiang@amd.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
---


This work is based on agd5f/amd-staging-drm-next branch.
Thanks in advance for reviews/comments!
Cheers,

Guilherme


 drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 9 +++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 3 +++
 3 files changed, 13 insertions(+)

Comments

Alex Deucher Jan. 16, 2023, 9:50 p.m. UTC | #1
On Mon, Jan 16, 2023 at 4:21 PM Guilherme G. Piccoli
<gpiccoli@igalia.com> wrote:
>
> Currently the FW loading path perform some checks based on IP model
> and in case it is advertised as supported, the VCN indirect SRAM
> mode is used.
>
> Happens that in case there's any issue on FW and this mode ends-up
> not being properly supported, the driver probe fails [0]. Debugging
> this requires driver rebuilding, so to allow fast debug and experiments,
> add a parameter to force setting indirect SRAM mode to true/false from
> the kernel command-line; parameter default is -1, which doesn't change
> the current driver's behavior.

Probably a question for Leo or James, but I was under the impression
non-indirect mode didn't work on production silicon for most chips,
but maybe I'm mis-remembering that.

Alex


>
> [0] Example of this issue, observed on Steam Deck:
>
> [drm] kiq ring mec 2 pipe 1 q 0
> [drm] failed to load ucode VCN0_RAM(0x3A)
> [drm] psp gfx command LOAD_IP_FW(0x6) failed and response status is (0xFFFF0000)
> amdgpu 0000:04:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring vcn_dec_0 test failed (-110)
> [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* hw_init of IP block <vcn_v3_0> failed -110
> amdgpu 0000:04:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:04:00.0: amdgpu: Fatal error during GPU init
>
> Cc: James Zhu <James.Zhu@amd.com>
> Cc: Lazar Lijo <Lijo.Lazar@amd.com>
> Cc: Leo Liu <leo.liu@amd.com>
> Cc: Mario Limonciello <mario.limonciello@amd.com>
> Cc: Sonny Jiang <sonny.jiang@amd.com>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> ---
>
>
> This work is based on agd5f/amd-staging-drm-next branch.
> Thanks in advance for reviews/comments!
> Cheers,
>
> Guilherme
>
>
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 9 +++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 3 +++
>  3 files changed, 13 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 872450a3a164..5d3c92c94f18 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -215,6 +215,7 @@ extern int amdgpu_noretry;
>  extern int amdgpu_force_asic_type;
>  extern int amdgpu_smartshift_bias;
>  extern int amdgpu_use_xgmi_p2p;
> +extern int amdgpu_indirect_sram;
>  #ifdef CONFIG_HSA_AMD
>  extern int sched_policy;
>  extern bool debug_evictions;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 06aba201d4db..c7182c0bc841 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -187,6 +187,7 @@ int amdgpu_num_kcq = -1;
>  int amdgpu_smartshift_bias;
>  int amdgpu_use_xgmi_p2p = 1;
>  int amdgpu_vcnfw_log;
> +int amdgpu_indirect_sram = -1;
>
>  static void amdgpu_drv_delayed_reset_work_handler(struct work_struct *work);
>
> @@ -941,6 +942,14 @@ MODULE_PARM_DESC(smu_pptable_id,
>         "specify pptable id to be used (-1 = auto(default) value, 0 = use pptable from vbios, > 0 = soft pptable id)");
>  module_param_named(smu_pptable_id, amdgpu_smu_pptable_id, int, 0444);
>
> +/**
> + * DOC: indirect_sram (int)
> + * Allow users to force using (or not) the VCN indirect SRAM mode in the fw load
> + * code. Default is -1, meaning auto (aka, don't mess with driver's behavior).
> + */
> +MODULE_PARM_DESC(indirect_sram, "Force VCN indirect SRAM (-1 = auto (default), 0 = disabled, 1 = enabled)");
> +module_param_named(indirect_sram, amdgpu_indirect_sram, int, 0444);
> +
>  /* These devices are not supported by amdgpu.
>   * They are supported by the mach64, r128, radeon drivers
>   */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> index 1f880e162d9d..a2290087e01c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> @@ -137,6 +137,9 @@ int amdgpu_vcn_sw_init(struct amdgpu_device *adev)
>                 return -EINVAL;
>         }
>
> +       if (amdgpu_indirect_sram >= 0)
> +               adev->vcn.indirect_sram = (bool)amdgpu_indirect_sram;
> +
>         hdr = (const struct common_firmware_header *)adev->vcn.fw->data;
>         adev->vcn.fw_version = le32_to_cpu(hdr->ucode_version);
>
> --
> 2.39.0
>
Leo Liu Jan. 16, 2023, 10:27 p.m. UTC | #2
Secure part requires PSP load VCN boot sequence which is with indirect sram mode.

Regards,
Leo
Guilherme G. Piccoli Jan. 16, 2023, 10:34 p.m. UTC | #3
On 16/01/2023 19:27, Liu, Leo wrote:
> Secure part requires PSP load VCN boot sequence which is with indirect
> sram mode.
> 
> Regards,
> Leo

With that said, and with the plans of just setting the indirect sram
mode nevertheless (with no switch/cases anymore - I'll submit the V2
tomorrow if everybody agrees), does anybody oppose to have this
parameter for debug reasons?

I first considered doing through debugfs, but obviously was a (very) bad
idea, since this should be an early boot tuning heheh

Cheers,


Guilherme
Alex Deucher Jan. 16, 2023, 11 p.m. UTC | #4
On Mon, Jan 16, 2023 at 5:34 PM Guilherme G. Piccoli
<gpiccoli@igalia.com> wrote:
>
> On 16/01/2023 19:27, Liu, Leo wrote:
> > Secure part requires PSP load VCN boot sequence which is with indirect
> > sram mode.
> >
> > Regards,
> > Leo
>
> With that said, and with the plans of just setting the indirect sram
> mode nevertheless (with no switch/cases anymore - I'll submit the V2
> tomorrow if everybody agrees), does anybody oppose to have this
> parameter for debug reasons?

It's not clear to me when this would be used.  We only disable it
briefly during new asic bring up, after that we never touch it again.
No end user on production hardware should ever have to change it and
doing so would break VCN on their system.

Alex

>
> I first considered doing through debugfs, but obviously was a (very) bad
> idea, since this should be an early boot tuning heheh
>
> Cheers,
>
>
> Guilherme
Guilherme G. Piccoli Jan. 17, 2023, 12:47 a.m. UTC | #5
On 16/01/2023 20:00, Alex Deucher wrote:
> [...]
> 
> It's not clear to me when this would be used.  We only disable it
> briefly during new asic bring up, after that we never touch it again.
> No end user on production hardware should ever have to change it and
> doing so would break VCN on their system.
> 
> Alex

[+ Pierre-Loup]

Steam Deck is facing a pretty weird scenario then heheh

Commit 82132ecc543 ("drm/amdgpu: enable Vangogh VCN indirect sram mode")
corrected a long-term issue in which the indirect SRAM mode wasn't
enabled for Vangogh - and Deck GPU architecture is Vangogh, so it was
working perfectly with that disabled.

Happens that a bad FW candidate seems to have broken something - it was
a bit convoluted to debug, but we proved that disabling indirect SRAM is
a good workaround so far, it "restored the behavior" pre-82132ecc543.

Hence my proposal - this parameter would've made life so much easier,
and we're start using it "downstream" now. While I understand that of
course the FW should be fixed, meanwhile this is a cheap solution to
allow further debug and real use of the system.

Thanks,


Guilherme
Limonciello, Mario Jan. 17, 2023, 2:33 a.m. UTC | #6
On 1/16/2023 18:47, Guilherme G. Piccoli wrote:
> On 16/01/2023 20:00, Alex Deucher wrote:
>> [...]
>>
>> It's not clear to me when this would be used.  We only disable it
>> briefly during new asic bring up, after that we never touch it again.
>> No end user on production hardware should ever have to change it and
>> doing so would break VCN on their system.
>>
>> Alex
> 
> [+ Pierre-Loup]
> 
> Steam Deck is facing a pretty weird scenario then heheh
> 
> Commit 82132ecc543 ("drm/amdgpu: enable Vangogh VCN indirect sram mode")
> corrected a long-term issue in which the indirect SRAM mode wasn't
> enabled for Vangogh - and Deck GPU architecture is Vangogh, so it was
> working perfectly with that disabled.
> 
> Happens that a bad FW candidate seems to have broken something - it was
> a bit convoluted to debug, but we proved that disabling indirect SRAM is
> a good workaround so far, it "restored the behavior" pre-82132ecc543.
> 
> Hence my proposal - this parameter would've made life so much easier,
> and we're start using it "downstream" now. While I understand that of
> course the FW should be fixed, meanwhile this is a cheap solution to
> allow further debug and real use of the system.
> 

For debugging these type of problems, I think an effective debugging 
tactic would have been to mask the IP block (amdgpu.ip_block_mask).
Guilherme G. Piccoli Jan. 17, 2023, 2:33 p.m. UTC | #7
On 16/01/2023 23:33, Limonciello, Mario wrote:
> [...]
> 
> For debugging these type of problems, I think an effective debugging 
> tactic would have been to mask the IP block (amdgpu.ip_block_mask).

Thank you, it worked indeed - nice suggestion!

Though I see two problems with that: first, I'm not sure what's the
impact in the GPU functioning when I disable some IP block.

Second, the parameter is a bit hard to figure - we need to clear a bit
for the IP block we want to disable, and the doc suggest to read on
dmesg to get this information (it seems it changes depending on the HW
model), but I couldn't parse the proper bit from dmesg. Needed to
instrument the kernel to find the proper bit heh

The second part is easy to improve (we can just show this bit in
dmesg!), I might do that instead of proposing this parameter, which
seems didn't raise much excitement after all heheh

Finally, I'm still curious on why Deck was working fine with the
indirect SRAM mode disabled (by mistake) in many kernels - was it in
practice the same as disabling the VCN IP block?

Thanks,


Guilherme
Alex Deucher Jan. 17, 2023, 3:11 p.m. UTC | #8
On Tue, Jan 17, 2023 at 9:33 AM Guilherme G. Piccoli
<gpiccoli@igalia.com> wrote:
>
> On 16/01/2023 23:33, Limonciello, Mario wrote:
> > [...]
> >
> > For debugging these type of problems, I think an effective debugging
> > tactic would have been to mask the IP block (amdgpu.ip_block_mask).
>
> Thank you, it worked indeed - nice suggestion!
>
> Though I see two problems with that: first, I'm not sure what's the
> impact in the GPU functioning when I disable some IP block.
>
> Second, the parameter is a bit hard to figure - we need to clear a bit
> for the IP block we want to disable, and the doc suggest to read on
> dmesg to get this information (it seems it changes depending on the HW
> model), but I couldn't parse the proper bit from dmesg. Needed to
> instrument the kernel to find the proper bit heh
>
> The second part is easy to improve (we can just show this bit in
> dmesg!), I might do that instead of proposing this parameter, which
> seems didn't raise much excitement after all heheh
>
> Finally, I'm still curious on why Deck was working fine with the
> indirect SRAM mode disabled (by mistake) in many kernels - was it in
> practice the same as disabling the VCN IP block?

IIRC, it depends on the fuse recipe for the particular ASIC.

Alex


>
> Thanks,
>
>
> Guilherme
>
Limonciello, Mario Jan. 17, 2023, 4:24 p.m. UTC | #9
[Public]



> -----Original Message-----
> From: Alex Deucher <alexdeucher@gmail.com>
> Sent: Tuesday, January 17, 2023 09:11
> To: Guilherme G. Piccoli <gpiccoli@igalia.com>
> Cc: Limonciello, Mario <Mario.Limonciello@amd.com>; Liu, Leo
> <Leo.Liu@amd.com>; amd-gfx@lists.freedesktop.org; Jiang, Sonny
> <Sonny.Jiang@amd.com>; kernel@gpiccoli.net; Pan, Xinhui
> <Xinhui.Pan@amd.com>; dri-devel@lists.freedesktop.org; Lazar, Lijo
> <Lijo.Lazar@amd.com>; kernel-dev@igalia.com; Deucher, Alexander
> <Alexander.Deucher@amd.com>; Zhu, James <James.Zhu@amd.com>;
> Koenig, Christian <Christian.Koenig@amd.com>; Pierre-Loup Griffais
> <pgriffais@valvesoftware.com>
> Subject: Re: [PATCH 3/3] drm/amdgpu/vcn: Add parameter to force
> (en/dis)abling indirect SRAM mode
> 
> On Tue, Jan 17, 2023 at 9:33 AM Guilherme G. Piccoli
> <gpiccoli@igalia.com> wrote:
> >
> > On 16/01/2023 23:33, Limonciello, Mario wrote:
> > > [...]
> > >
> > > For debugging these type of problems, I think an effective debugging
> > > tactic would have been to mask the IP block (amdgpu.ip_block_mask).
> >
> > Thank you, it worked indeed - nice suggestion!
> >
> > Though I see two problems with that: first, I'm not sure what's the
> > impact in the GPU functioning when I disable some IP block.
> >

It depends on the individual block what the impact is.  For example
if you don't have VCN, then you can't do any accelerated video playback.

> > Second, the parameter is a bit hard to figure - we need to clear a bit
> > for the IP block we want to disable, and the doc suggest to read on
> > dmesg to get this information (it seems it changes depending on the HW
> > model), but I couldn't parse the proper bit from dmesg. Needed to
> > instrument the kernel to find the proper bit heh
> >

Isn't it this stuff (taken from a CZN system):

[    7.797779] [drm] add ip block number 0 <soc15_common>
[    7.797781] [drm] add ip block number 1 <gmc_v9_0>
[    7.797782] [drm] add ip block number 2 <vega10_ih>
[    7.797783] [drm] add ip block number 3 <psp>
[    7.797783] [drm] add ip block number 4 <smu>
[    7.797784] [drm] add ip block number 5 <dm>
[    7.797785] [drm] add ip block number 6 <gfx_v9_0>
[    7.797786] [drm] add ip block number 7 <sdma_v4_0>
[    7.797787] [drm] add ip block number 8 <vcn_v2_0>
[    7.797788] [drm] add ip block number 9 <jpeg_v2_0>

So for that system it would be bit 8 to disable vcn.

In terms of how debugging would work:
I would expect when you get your failure it will have been the previous
block # that failed, and so you can reboot with that block masked and
see if you get further.

> > The second part is easy to improve (we can just show this bit in
> > dmesg!), I might do that instead of proposing this parameter, which
> > seems didn't raise much excitement after all heheh
> >
> > Finally, I'm still curious on why Deck was working fine with the
> > indirect SRAM mode disabled (by mistake) in many kernels - was it in
> > practice the same as disabling the VCN IP block?
> 
> IIRC, it depends on the fuse recipe for the particular ASIC.
> 
> Alex
> 
> 
> >
> > Thanks,
> >
> >
> > Guilherme
> >
Guilherme G. Piccoli Jan. 17, 2023, 4:34 p.m. UTC | #10
On 17/01/2023 13:24, Limonciello, Mario wrote:
> [...]
>>> Though I see two problems with that: first, I'm not sure what's the
>>> impact in the GPU functioning when I disable some IP block.
>>>
> 
> It depends on the individual block what the impact is.  For example
> if you don't have VCN, then you can't do any accelerated video playback.
> 
>>> Second, the parameter is a bit hard to figure - we need to clear a bit
>>> for the IP block we want to disable, and the doc suggest to read on
>>> dmesg to get this information (it seems it changes depending on the HW
>>> model), but I couldn't parse the proper bit from dmesg. Needed to
>>> instrument the kernel to find the proper bit heh
>>>
> 
> Isn't it this stuff (taken from a CZN system):
> 
> [    7.797779] [drm] add ip block number 0 <soc15_common>
> [    7.797781] [drm] add ip block number 1 <gmc_v9_0>
> [    7.797782] [drm] add ip block number 2 <vega10_ih>
> [    7.797783] [drm] add ip block number 3 <psp>
> [    7.797783] [drm] add ip block number 4 <smu>
> [    7.797784] [drm] add ip block number 5 <dm>
> [    7.797785] [drm] add ip block number 6 <gfx_v9_0>
> [    7.797786] [drm] add ip block number 7 <sdma_v4_0>
> [    7.797787] [drm] add ip block number 8 <vcn_v2_0>
> [    7.797788] [drm] add ip block number 9 <jpeg_v2_0>
> 
> So for that system it would be bit 8 to disable vcn.
> 
> In terms of how debugging would work:
> I would expect when you get your failure it will have been the previous
> block # that failed, and so you can reboot with that block masked and
> see if you get further.
> 

Thanks Mario, much appreciated! You're totally right and I messed up not
seeing these obvious messages...

So, I'll just drop the parameter on V2.
Cheers,


Guilherme
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 872450a3a164..5d3c92c94f18 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -215,6 +215,7 @@  extern int amdgpu_noretry;
 extern int amdgpu_force_asic_type;
 extern int amdgpu_smartshift_bias;
 extern int amdgpu_use_xgmi_p2p;
+extern int amdgpu_indirect_sram;
 #ifdef CONFIG_HSA_AMD
 extern int sched_policy;
 extern bool debug_evictions;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 06aba201d4db..c7182c0bc841 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -187,6 +187,7 @@  int amdgpu_num_kcq = -1;
 int amdgpu_smartshift_bias;
 int amdgpu_use_xgmi_p2p = 1;
 int amdgpu_vcnfw_log;
+int amdgpu_indirect_sram = -1;
 
 static void amdgpu_drv_delayed_reset_work_handler(struct work_struct *work);
 
@@ -941,6 +942,14 @@  MODULE_PARM_DESC(smu_pptable_id,
 	"specify pptable id to be used (-1 = auto(default) value, 0 = use pptable from vbios, > 0 = soft pptable id)");
 module_param_named(smu_pptable_id, amdgpu_smu_pptable_id, int, 0444);
 
+/**
+ * DOC: indirect_sram (int)
+ * Allow users to force using (or not) the VCN indirect SRAM mode in the fw load
+ * code. Default is -1, meaning auto (aka, don't mess with driver's behavior).
+ */
+MODULE_PARM_DESC(indirect_sram, "Force VCN indirect SRAM (-1 = auto (default), 0 = disabled, 1 = enabled)");
+module_param_named(indirect_sram, amdgpu_indirect_sram, int, 0444);
+
 /* These devices are not supported by amdgpu.
  * They are supported by the mach64, r128, radeon drivers
  */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
index 1f880e162d9d..a2290087e01c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
@@ -137,6 +137,9 @@  int amdgpu_vcn_sw_init(struct amdgpu_device *adev)
 		return -EINVAL;
 	}
 
+	if (amdgpu_indirect_sram >= 0)
+		adev->vcn.indirect_sram = (bool)amdgpu_indirect_sram;
+
 	hdr = (const struct common_firmware_header *)adev->vcn.fw->data;
 	adev->vcn.fw_version = le32_to_cpu(hdr->ucode_version);