diff mbox series

[v4,3/8] drm/amdgpu: Fix SMU error failure

Message ID 1599072130-10043-4-git-send-email-andrey.grodzovsky@amd.com (mailing list archive)
State Not Applicable, archived
Headers show
Series Implement PCI Error Recovery on Navi12 | expand

Commit Message

Andrey Grodzovsky Sept. 2, 2020, 6:42 p.m. UTC
Wait for HW/PSP initiated ASIC reset to complete before
starting the recovery operations.

v2: Remove typo

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

Comments

Bjorn Helgaas Sept. 2, 2020, 10:05 p.m. UTC | #1
On Wed, Sep 02, 2020 at 02:42:05PM -0400, Andrey Grodzovsky wrote:
> Wait for HW/PSP initiated ASIC reset to complete before
> starting the recovery operations.
> 
> v2: Remove typo
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++--
>  1 file changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e999f1f..412d07e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4838,14 +4838,32 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>  {
>  	struct drm_device *dev = pci_get_drvdata(pdev);
>  	struct amdgpu_device *adev = drm_to_adev(dev);
> -	int r;
> +	int r, i;
>  	bool vram_lost;
> +	u32 memsize;
>  
>  	DRM_INFO("PCI error: slot reset callback!!\n");
>  
> +	/* wait for asic to come out of reset */

I know it's totally OCD, but it is a minor speed bump to read "asic"
here and "ASIC" in the commit log above and the new comment below.

> +	msleep(500);
> +
>  	pci_restore_state(pdev);
>  
> -	adev->in_pci_err_recovery = true;
> +	/* confirm  ASIC came out of reset */
> +	for (i = 0; i < adev->usec_timeout; i++) {
> +		memsize = amdgpu_asic_get_config_memsize(adev);
> +
> +		if (memsize != 0xffffffff)

I guess this is a spot where you actually depend on an MMIO read
returning 0xffffffff because adev->in_pci_err_recovery is false at
this point, so amdgpu_mm_rreg() will actually *do* the MMIO read
instead of returning 0.  Right?

> +			break;
> +		udelay(1);
> +	}
> +	if (memsize == 0xffffffff) {
> +		r = -ETIME;
> +		goto out;
> +	}
> +
> +	/* TODO Call amdgpu_pre_asic_reset instead */
> +	adev->in_pci_err_recovery = true;	
>  	r = amdgpu_device_ip_suspend(adev);
>  	adev->in_pci_err_recovery = false;
>  	if (r)
> -- 
> 2.7.4
>
Andrey Grodzovsky Sept. 3, 2020, 3:29 p.m. UTC | #2
On 9/2/20 6:05 PM, Bjorn Helgaas wrote:
> On Wed, Sep 02, 2020 at 02:42:05PM -0400, Andrey Grodzovsky wrote:
>> Wait for HW/PSP initiated ASIC reset to complete before
>> starting the recovery operations.
>>
>> v2: Remove typo
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++--
>>   1 file changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index e999f1f..412d07e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4838,14 +4838,32 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>   {
>>   	struct drm_device *dev = pci_get_drvdata(pdev);
>>   	struct amdgpu_device *adev = drm_to_adev(dev);
>> -	int r;
>> +	int r, i;
>>   	bool vram_lost;
>> +	u32 memsize;
>>   
>>   	DRM_INFO("PCI error: slot reset callback!!\n");
>>   
>> +	/* wait for asic to come out of reset */
> I know it's totally OCD, but it is a minor speed bump to read "asic"
> here and "ASIC" in the commit log above and the new comment below.
>
>> +	msleep(500);
>> +
>>   	pci_restore_state(pdev);
>>   
>> -	adev->in_pci_err_recovery = true;
>> +	/* confirm  ASIC came out of reset */
>> +	for (i = 0; i < adev->usec_timeout; i++) {
>> +		memsize = amdgpu_asic_get_config_memsize(adev);
>> +
>> +		if (memsize != 0xffffffff)
> I guess this is a spot where you actually depend on an MMIO read
> returning 0xffffffff because adev->in_pci_err_recovery is false at
> this point, so amdgpu_mm_rreg() will actually *do* the MMIO read
> instead of returning 0.  Right?


Yes, i picked this form another place where they use it to confirm PCI confspace and
the BARs specifically are restored and active and so you can read back sane MMIO 
regs values.

Andrey


>
>> +			break;
>> +		udelay(1);
>> +	}
>> +	if (memsize == 0xffffffff) {
>> +		r = -ETIME;
>> +		goto out;
>> +	}
>> +
>> +	/* TODO Call amdgpu_pre_asic_reset instead */
>> +	adev->in_pci_err_recovery = true;	
>>   	r = amdgpu_device_ip_suspend(adev);
>>   	adev->in_pci_err_recovery = false;
>>   	if (r)
>> -- 
>> 2.7.4
>>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e999f1f..412d07e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4838,14 +4838,32 @@  pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 {
 	struct drm_device *dev = pci_get_drvdata(pdev);
 	struct amdgpu_device *adev = drm_to_adev(dev);
-	int r;
+	int r, i;
 	bool vram_lost;
+	u32 memsize;
 
 	DRM_INFO("PCI error: slot reset callback!!\n");
 
+	/* wait for asic to come out of reset */
+	msleep(500);
+
 	pci_restore_state(pdev);
 
-	adev->in_pci_err_recovery = true;
+	/* confirm  ASIC came out of reset */
+	for (i = 0; i < adev->usec_timeout; i++) {
+		memsize = amdgpu_asic_get_config_memsize(adev);
+
+		if (memsize != 0xffffffff)
+			break;
+		udelay(1);
+	}
+	if (memsize == 0xffffffff) {
+		r = -ETIME;
+		goto out;
+	}
+
+	/* TODO Call amdgpu_pre_asic_reset instead */
+	adev->in_pci_err_recovery = true;	
 	r = amdgpu_device_ip_suspend(adev);
 	adev->in_pci_err_recovery = false;
 	if (r)