[1/1] drm/amdgpu: Use device wedged event

Message ID	20241212190909.28559-2-andrealmeid@igalia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: =?utf-8?q?Andr=C3=A9_Almeida?= <andrealmeid@igalia.com> To: Raag Jadav <raag.jadav@intel.com>, airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com, rodrigo.vivi@intel.com, jani.nikula@linux.intel.com, andriy.shevchenko@linux.intel.com, lina@asahilina.net, michal.wajdeczko@intel.com, christian.koenig@amd.com Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, himal.prasad.ghimiray@intel.com, aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com, alexander.deucher@amd.com, andrealmeid@igalia.com, amd-gfx@lists.freedesktop.org, kernel-dev@igalia.com Subject: [PATCH 1/1] drm/amdgpu: Use device wedged event Date: Thu, 12 Dec 2024 16:09:09 -0300 Message-ID: <20241212190909.28559-2-andrealmeid@igalia.com> In-Reply-To: <20241212190909.28559-1-andrealmeid@igalia.com> References: <20241212190909.28559-1-andrealmeid@igalia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	drm/amdgpu: Use device wedged event \| expand [0/1] drm/amdgpu: Use device wedged event [1/1] drm/amdgpu: Use device wedged event

André Almeida Dec. 12, 2024, 7:09 p.m. UTC

Use DRM's device wedged event to notify userspace that a reset had
happened. For now, only use `none` method meant for telemetry
capture.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
 1 file changed, 3 insertions(+)

Christian König Dec. 13, 2024, 7:34 a.m. UTC | #1

Am 12.12.24 um 20:09 schrieb André Almeida:
> Use DRM's device wedged event to notify userspace that a reset had
> happened. For now, only use `none` method meant for telemetry
> capture.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 96316111300a..19e1a5493778 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   		dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>   
>   	atomic_set(&adev->reset_domain->reset_res, r);
> +
> +	drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);

That looks really good in general. I would just make the 
DRM_WEDGE_RECOVERY_NONE depend on the value of "r".

@Shashank any objections? IIRC you have worked on the AMD specific event 
we never upstreamed.

Regards,
Christian.

> +
>   	return r;
>   }
>

Sharma, Shashank Dec. 13, 2024, 7:46 a.m. UTC | #2

[AMD Official Use Only - AMD Internal Distribution Only]

Looks good to me as well, With Christian's comment accommodated:
Acked-by: Shashank Sharma <shashank.sharma@amd.com>

Regards
Shashank

André Almeida Dec. 13, 2024, 2:15 p.m. UTC | #3

Hi Christian,

Em 13/12/2024 04:34, Christian König escreveu:
> Am 12.12.24 um 20:09 schrieb André Almeida:
>> Use DRM's device wedged event to notify userspace that a reset had
>> happened. For now, only use `none` method meant for telemetry
>> capture.
>>
>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/ 
>> drm/amd/amdgpu/amdgpu_device.c
>> index 96316111300a..19e1a5493778 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct 
>> amdgpu_device *adev,
>>           dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>>       atomic_set(&adev->reset_domain->reset_res, r);
>> +
>> +    drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> 
> That looks really good in general. I would just make the 
> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
> 

Why depend or `r`? A reset was triggered anyway, regardless of the 
success of it, shouldn't we tell userspace?

Raag Jadav Dec. 13, 2024, 2:36 p.m. UTC | #4

On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
> Hi Christian,
> 
> Em 13/12/2024 04:34, Christian König escreveu:
> > Am 12.12.24 um 20:09 schrieb André Almeida:
> > > Use DRM's device wedged event to notify userspace that a reset had
> > > happened. For now, only use `none` method meant for telemetry
> > > capture.
> > > 
> > > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
> > >   1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
> > > index 96316111300a..19e1a5493778 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
> > > amdgpu_device *adev,
> > >           dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
> > >       atomic_set(&adev->reset_domain->reset_res, r);
> > > +
> > > +    drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> > 
> > That looks really good in general. I would just make the
> > DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
> > 
> 
> Why depend or `r`? A reset was triggered anyway, regardless of the success
> of it, shouldn't we tell userspace?

A failed reset would perhaps result in wedging, atleast that's how i915
is handling it.

Raag

André Almeida Dec. 13, 2024, 3:56 p.m. UTC | #5

Em 13/12/2024 11:36, Raag Jadav escreveu:
> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>> Hi Christian,
>>
>> Em 13/12/2024 04:34, Christian König escreveu:
>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>> happened. For now, only use `none` method meant for telemetry
>>>> capture.
>>>>
>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>    1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>> index 96316111300a..19e1a5493778 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>> amdgpu_device *adev,
>>>>            dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>>>>        atomic_set(&adev->reset_domain->reset_res, r);
>>>> +
>>>> +    drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
>>>
>>> That looks really good in general. I would just make the
>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>
>>
>> Why depend or `r`? A reset was triggered anyway, regardless of the success
>> of it, shouldn't we tell userspace?
> 
> A failed reset would perhaps result in wedging, atleast that's how i915
> is handling it.
> 

Right, and I think this raises the question of what wedge recovery 
method should I add for amdgpu... Christian?

Christian König Dec. 16, 2024, 10:18 a.m. UTC | #6

Am 13.12.24 um 16:56 schrieb André Almeida:
> Em 13/12/2024 11:36, Raag Jadav escreveu:
>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>> Hi Christian,
>>>
>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>>> happened. For now, only use `none` method meant for telemetry
>>>>> capture.
>>>>>
>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>    1 file changed, 3 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>> index 96316111300a..19e1a5493778 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>> amdgpu_device *adev,
>>>>>            dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>> +
>>>>> +    drm_dev_wedged_event(adev_to_drm(adev), 
>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>
>>>> That looks really good in general. I would just make the
>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>
>>>
>>> Why depend or `r`? A reset was triggered anyway, regardless of the 
>>> success
>>> of it, shouldn't we tell userspace?
>>
>> A failed reset would perhaps result in wedging, atleast that's how i915
>> is handling it.
>>
>
> Right, and I think this raises the question of what wedge recovery 
> method should I add for amdgpu... Christian?
>

In theory a rebind should be enough to get the device going again, our 
BOCO does a bus reset on driver load anyway.

Regards,
Christian.

Lazar, Lijo Dec. 16, 2024, 10:38 a.m. UTC | #7

On 12/16/2024 3:48 PM, Christian König wrote:
> Am 13.12.24 um 16:56 schrieb André Almeida:
>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>> Hi Christian,
>>>>
>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>> capture.
>>>>>>
>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>>> ---
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>    1 file changed, 3 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>> amdgpu_device *adev,
>>>>>>            dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>> +
>>>>>> +    drm_dev_wedged_event(adev_to_drm(adev),
>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>
>>>>> That looks really good in general. I would just make the
>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>
>>>>
>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>> success
>>>> of it, shouldn't we tell userspace?
>>>
>>> A failed reset would perhaps result in wedging, atleast that's how i915
>>> is handling it.
>>>
>>
>> Right, and I think this raises the question of what wedge recovery
>> method should I add for amdgpu... Christian?
>>
> 
> In theory a rebind should be enough to get the device going again, our
> BOCO does a bus reset on driver load anyway.
> 

The behavior varies between SOCs. In certain ones, if driver reset
fails, that means it's really in a bad state and it would need system
reboot.

I had asked earlier about the utility of this one here. If this is just
to inform userspace that driver has done a reset and recovered, it would
need some additional context also. We have a mechanism in KFD which
sends the context in which a reset has to be done. Currently, that's
restricted to compute applications, but if this is in a similar line, we
would like to pass some additional info like job timeout, RAS error etc.

Thanks,
Lijo

> Regards,
> Christian.

André Almeida Dec. 16, 2024, 1:04 p.m. UTC | #8

Em 16/12/2024 07:38, Lazar, Lijo escreveu:
> 
> 
> On 12/16/2024 3:48 PM, Christian König wrote:
>> Am 13.12.24 um 16:56 schrieb André Almeida:
>>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>>> Hi Christian,
>>>>>
>>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>>> capture.
>>>>>>>
>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>>     1 file changed, 3 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>             dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>> +
>>>>>>> +    drm_dev_wedged_event(adev_to_drm(adev),
>>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>>
>>>>>> That looks really good in general. I would just make the
>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>>
>>>>>
>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>>> success
>>>>> of it, shouldn't we tell userspace?
>>>>
>>>> A failed reset would perhaps result in wedging, atleast that's how i915
>>>> is handling it.
>>>>
>>>
>>> Right, and I think this raises the question of what wedge recovery
>>> method should I add for amdgpu... Christian?
>>>
>>
>> In theory a rebind should be enough to get the device going again, our
>> BOCO does a bus reset on driver load anyway.
>>
> 
> The behavior varies between SOCs. In certain ones, if driver reset
> fails, that means it's really in a bad state and it would need system
> reboot.
> 

Is this documented somewhere? Then I could even add a 
DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.

> I had asked earlier about the utility of this one here. If this is just
> to inform userspace that driver has done a reset and recovered, it would
> need some additional context also. We have a mechanism in KFD which
> sends the context in which a reset has to be done. Currently, that's
> restricted to compute applications, but if this is in a similar line, we
> would like to pass some additional info like job timeout, RAS error etc.
> 

DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a 
reset and recovered, but additional data about like which job timeout, 
RAS error and such belong to devcoredump I guess, where all data is 
gathered and collected later.

> Thanks,
> Lijo
> 
>> Regards,
>> Christian.
>

Christian König Dec. 16, 2024, 1:10 p.m. UTC | #9

Am 16.12.24 um 14:04 schrieb André Almeida:
> Em 16/12/2024 07:38, Lazar, Lijo escreveu:
>>
>>
>> On 12/16/2024 3:48 PM, Christian König wrote:
>>> Am 13.12.24 um 16:56 schrieb André Almeida:
>>>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>>>> Hi Christian,
>>>>>>
>>>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>>>> capture.
>>>>>>>>
>>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>>>>> ---
>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>>>     1 file changed, 3 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>>>> amdgpu_device *adev,
>>>>>>>>             dev_info(adev->dev, "GPU reset end with ret = 
>>>>>>>> %d\n", r);
>>>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>>> +
>>>>>>>> +    drm_dev_wedged_event(adev_to_drm(adev),
>>>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>>>
>>>>>>> That looks really good in general. I would just make the
>>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>>>
>>>>>>
>>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>>>> success
>>>>>> of it, shouldn't we tell userspace?
>>>>>
>>>>> A failed reset would perhaps result in wedging, atleast that's how 
>>>>> i915
>>>>> is handling it.
>>>>>
>>>>
>>>> Right, and I think this raises the question of what wedge recovery
>>>> method should I add for amdgpu... Christian?
>>>>
>>>
>>> In theory a rebind should be enough to get the device going again, our
>>> BOCO does a bus reset on driver load anyway.
>>>
>>
>> The behavior varies between SOCs. In certain ones, if driver reset
>> fails, that means it's really in a bad state and it would need system
>> reboot.
>>
>
> Is this documented somewhere? Then I could even add a 
> DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.

Not publicly as far as I know. But indeed a driver reset has basically 
the same chance of succeeding than a driver reload.

I think the use case we have here is more that the administrator 
intentionally disabled the reset to allow HW investigation.

So far we did that with a rather broken we don't do anything at all 
approach.

>> I had asked earlier about the utility of this one here. If this is just
>> to inform userspace that driver has done a reset and recovered, it would
>> need some additional context also. We have a mechanism in KFD which
>> sends the context in which a reset has to be done. Currently, that's
>> restricted to compute applications, but if this is in a similar line, we
>> would like to pass some additional info like job timeout, RAS error etc.
>>
>
> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a 
> reset and recovered, but additional data about like which job timeout, 
> RAS error and such belong to devcoredump I guess, where all data is 
> gathered and collected later.

I think somebody else mentioned it as well that the source of the issue, 
e.g. the PID of the submitting process would be helpful as well for 
supervising daemons which need to restart processes when they caused 
some issue.

We just postponed adding that till later.

Regards,
Christian.

>
>> Thanks,
>> Lijo
>>
>>> Regards,
>>> Christian.
>>
>

André Almeida Dec. 16, 2024, 1:15 p.m. UTC | #10

Em 16/12/2024 10:10, Christian König escreveu:
> Am 16.12.24 um 14:04 schrieb André Almeida:
>> Em 16/12/2024 07:38, Lazar, Lijo escreveu:
>>>
>>>
>>> On 12/16/2024 3:48 PM, Christian König wrote:
>>>> Am 13.12.24 um 16:56 schrieb André Almeida:
>>>>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>>>>> capture.
>>>>>>>>>
>>>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>>>>>> ---
>>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>>>>     1 file changed, 3 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>             dev_info(adev->dev, "GPU reset end with ret = 
>>>>>>>>> %d\n", r);
>>>>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>>>> +
>>>>>>>>> +    drm_dev_wedged_event(adev_to_drm(adev),
>>>>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>>>>
>>>>>>>> That looks really good in general. I would just make the
>>>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>>>>
>>>>>>>
>>>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>>>>> success
>>>>>>> of it, shouldn't we tell userspace?
>>>>>>
>>>>>> A failed reset would perhaps result in wedging, atleast that's how 
>>>>>> i915
>>>>>> is handling it.
>>>>>>
>>>>>
>>>>> Right, and I think this raises the question of what wedge recovery
>>>>> method should I add for amdgpu... Christian?
>>>>>
>>>>
>>>> In theory a rebind should be enough to get the device going again, our
>>>> BOCO does a bus reset on driver load anyway.
>>>>
>>>
>>> The behavior varies between SOCs. In certain ones, if driver reset
>>> fails, that means it's really in a bad state and it would need system
>>> reboot.
>>>
>>
>> Is this documented somewhere? Then I could even add a 
>> DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.
> 
> Not publicly as far as I know. But indeed a driver reset has basically 
> the same chance of succeeding than a driver reload.
> 
> I think the use case we have here is more that the administrator 
> intentionally disabled the reset to allow HW investigation.
> 
> So far we did that with a rather broken we don't do anything at all 
> approach.

OK.

> 
>>> I had asked earlier about the utility of this one here. If this is just
>>> to inform userspace that driver has done a reset and recovered, it would
>>> need some additional context also. We have a mechanism in KFD which
>>> sends the context in which a reset has to be done. Currently, that's
>>> restricted to compute applications, but if this is in a similar line, we
>>> would like to pass some additional info like job timeout, RAS error etc.
>>>
>>
>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a 
>> reset and recovered, but additional data about like which job timeout, 
>> RAS error and such belong to devcoredump I guess, where all data is 
>> gathered and collected later.
> 
> I think somebody else mentioned it as well that the source of the issue, 
> e.g. the PID of the submitting process would be helpful as well for 
> supervising daemons which need to restart processes when they caused 
> some issue.
> 

It was me :) we have a use case that we would need the PID for the 
daemon indeed, but the daemon doesn't need to know what's the RAS error 
or the job name that timeouted, there's no immediate action to be taken 
with this information, contrary to the PID that we need to know.

> We just postponed adding that till later.
> 
> Regards,
> Christian.
> 
>>
>>> Thanks,
>>> Lijo
>>>
>>>> Regards,
>>>> Christian.
>>>
>>
>

Christian König Dec. 16, 2024, 1:36 p.m. UTC | #11

Am 16.12.24 um 14:15 schrieb André Almeida:
> Em 16/12/2024 10:10, Christian König escreveu:
>> Am 16.12.24 um 14:04 schrieb André Almeida:
>>> Em 16/12/2024 07:38, Lazar, Lijo escreveu:
>>>>
>>>>
>>>> On 12/16/2024 3:48 PM, Christian König wrote:
>>>>> Am 13.12.24 um 16:56 schrieb André Almeida:
>>>>>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>>>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>>>>>> Hi Christian,
>>>>>>>>
>>>>>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>>>>>> Use DRM's device wedged event to notify userspace that a 
>>>>>>>>>> reset had
>>>>>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>>>>>> capture.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>>>>>>> ---
>>>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>>>>>     1 file changed, 3 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>             dev_info(adev->dev, "GPU reset end with ret = 
>>>>>>>>>> %d\n", r);
>>>>>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>>>>> +
>>>>>>>>>> +    drm_dev_wedged_event(adev_to_drm(adev),
>>>>>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>>>>>
>>>>>>>>> That looks really good in general. I would just make the
>>>>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>>>>>
>>>>>>>>
>>>>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>>>>>> success
>>>>>>>> of it, shouldn't we tell userspace?
>>>>>>>
>>>>>>> A failed reset would perhaps result in wedging, atleast that's 
>>>>>>> how i915
>>>>>>> is handling it.
>>>>>>>
>>>>>>
>>>>>> Right, and I think this raises the question of what wedge recovery
>>>>>> method should I add for amdgpu... Christian?
>>>>>>
>>>>>
>>>>> In theory a rebind should be enough to get the device going again, 
>>>>> our
>>>>> BOCO does a bus reset on driver load anyway.
>>>>>
>>>>
>>>> The behavior varies between SOCs. In certain ones, if driver reset
>>>> fails, that means it's really in a bad state and it would need system
>>>> reboot.
>>>>
>>>
>>> Is this documented somewhere? Then I could even add a 
>>> DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.
>>
>> Not publicly as far as I know. But indeed a driver reset has 
>> basically the same chance of succeeding than a driver reload.
>>
>> I think the use case we have here is more that the administrator 
>> intentionally disabled the reset to allow HW investigation.
>>
>> So far we did that with a rather broken we don't do anything at all 
>> approach.
>
> OK.
>
>>
>>>> I had asked earlier about the utility of this one here. If this is 
>>>> just
>>>> to inform userspace that driver has done a reset and recovered, it 
>>>> would
>>>> need some additional context also. We have a mechanism in KFD which
>>>> sends the context in which a reset has to be done. Currently, that's
>>>> restricted to compute applications, but if this is in a similar 
>>>> line, we
>>>> would like to pass some additional info like job timeout, RAS error 
>>>> etc.
>>>>
>>>
>>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done 
>>> a reset and recovered, but additional data about like which job 
>>> timeout, RAS error and such belong to devcoredump I guess, where all 
>>> data is gathered and collected later.
>>
>> I think somebody else mentioned it as well that the source of the 
>> issue, e.g. the PID of the submitting process would be helpful as 
>> well for supervising daemons which need to restart processes when 
>> they caused some issue.
>>
>
> It was me :) we have a use case that we would need the PID for the 
> daemon indeed, but the daemon doesn't need to know what's the RAS 
> error or the job name that timeouted, there's no immediate action to 
> be taken with this information, contrary to the PID that we need to know.

Yeah, that's all stuff for the device core dump I think.

But if you want to add the PID for high level control then that would 
make sense I think.

Regards,
Christian.

>
>> We just postponed adding that till later.
>>
>> Regards,
>> Christian.
>>
>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>
>>>
>>
>

Lazar, Lijo Dec. 16, 2024, 1:36 p.m. UTC | #12

On 12/16/2024 6:45 PM, André Almeida wrote:
> 
> 
> Em 16/12/2024 10:10, Christian König escreveu:
>> Am 16.12.24 um 14:04 schrieb André Almeida:
>>> Em 16/12/2024 07:38, Lazar, Lijo escreveu:
>>>>
>>>>
>>>> On 12/16/2024 3:48 PM, Christian König wrote:
>>>>> Am 13.12.24 um 16:56 schrieb André Almeida:
>>>>>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>>>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>>>>>> Hi Christian,
>>>>>>>>
>>>>>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>>>>>> Use DRM's device wedged event to notify userspace that a reset
>>>>>>>>>> had
>>>>>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>>>>>> capture.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>>>>>>> ---
>>>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>>>>>     1 file changed, 3 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>             dev_info(adev->dev, "GPU reset end with ret =
>>>>>>>>>> %d\n", r);
>>>>>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>>>>> +
>>>>>>>>>> +    drm_dev_wedged_event(adev_to_drm(adev),
>>>>>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>>>>>
>>>>>>>>> That looks really good in general. I would just make the
>>>>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>>>>>
>>>>>>>>
>>>>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>>>>>> success
>>>>>>>> of it, shouldn't we tell userspace?
>>>>>>>
>>>>>>> A failed reset would perhaps result in wedging, atleast that's
>>>>>>> how i915
>>>>>>> is handling it.
>>>>>>>
>>>>>>
>>>>>> Right, and I think this raises the question of what wedge recovery
>>>>>> method should I add for amdgpu... Christian?
>>>>>>
>>>>>
>>>>> In theory a rebind should be enough to get the device going again, our
>>>>> BOCO does a bus reset on driver load anyway.
>>>>>
>>>>
>>>> The behavior varies between SOCs. In certain ones, if driver reset
>>>> fails, that means it's really in a bad state and it would need system
>>>> reboot.
>>>>
>>>
>>> Is this documented somewhere? Then I could even add a
>>> DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.
>>
>> Not publicly as far as I know. But indeed a driver reset has basically
>> the same chance of succeeding than a driver reload.
>>
>> I think the use case we have here is more that the administrator
>> intentionally disabled the reset to allow HW investigation.
>>
>> So far we did that with a rather broken we don't do anything at all
>> approach.
> 
> OK.
> 
>>
>>>> I had asked earlier about the utility of this one here. If this is just
>>>> to inform userspace that driver has done a reset and recovered, it
>>>> would
>>>> need some additional context also. We have a mechanism in KFD which
>>>> sends the context in which a reset has to be done. Currently, that's
>>>> restricted to compute applications, but if this is in a similar
>>>> line, we
>>>> would like to pass some additional info like job timeout, RAS error
>>>> etc.
>>>>
>>>
>>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a
>>> reset and recovered, but additional data about like which job
>>> timeout, RAS error and such belong to devcoredump I guess, where all
>>> data is gathered and collected later.
>>
>> I think somebody else mentioned it as well that the source of the
>> issue, e.g. the PID of the submitting process would be helpful as well
>> for supervising daemons which need to restart processes when they
>> caused some issue.
>>
> 
> It was me :) we have a use case that we would need the PID for the
> daemon indeed, but the daemon doesn't need to know what's the RAS error
> or the job name that timeouted, there's no immediate action to be taken
> with this information, contrary to the PID that we need to know.
> 

Regarding devcoredump - it's not done every time. For ex: RAS errors
have a different way to identify the source of error, hence we don't
need a coredump in such cases.

The intention is only to let the user know the reason for reset at a
high level, and probably add more things later like the engines or
queues that have reset etc.

Thanks,
Lijo

>> We just postponed adding that till later.
>>
>> Regards,
>> Christian.
>>
>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>
>>>
>>
>

Christian König Dec. 16, 2024, 1:39 p.m. UTC | #13

Am 16.12.24 um 14:36 schrieb Lazar, Lijo:
>>>>> I had asked earlier about the utility of this one here. If this is just
>>>>> to inform userspace that driver has done a reset and recovered, it
>>>>> would
>>>>> need some additional context also. We have a mechanism in KFD which
>>>>> sends the context in which a reset has to be done. Currently, that's
>>>>> restricted to compute applications, but if this is in a similar
>>>>> line, we
>>>>> would like to pass some additional info like job timeout, RAS error
>>>>> etc.
>>>>>
>>>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a
>>>> reset and recovered, but additional data about like which job
>>>> timeout, RAS error and such belong to devcoredump I guess, where all
>>>> data is gathered and collected later.
>>> I think somebody else mentioned it as well that the source of the
>>> issue, e.g. the PID of the submitting process would be helpful as well
>>> for supervising daemons which need to restart processes when they
>>> caused some issue.
>>>
>> It was me :) we have a use case that we would need the PID for the
>> daemon indeed, but the daemon doesn't need to know what's the RAS error
>> or the job name that timeouted, there's no immediate action to be taken
>> with this information, contrary to the PID that we need to know.
>>
> Regarding devcoredump - it's not done every time. For ex: RAS errors
> have a different way to identify the source of error, hence we don't
> need a coredump in such cases.
>
> The intention is only to let the user know the reason for reset at a
> high level, and probably add more things later like the engines or
> queues that have reset etc.

Well what is the use case for that? That doesn't looks valuable to me.

RAS errors should generally be reported to the application who issued 
the submission.

As a system wide event they are only useful in things like logfiles I think.

Regards,
Christian.

>
> Thanks,
> Lijo
>
>>> We just postponed adding that till later.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>>> Regards,
>>>>>> Christian.

Lazar, Lijo Dec. 16, 2024, 1:44 p.m. UTC | #14

On 12/16/2024 7:09 PM, Christian König wrote:
> Am 16.12.24 um 14:36 schrieb Lazar, Lijo:
>>>>>> I had asked earlier about the utility of this one here. If this is just
>>>>>> to inform userspace that driver has done a reset and recovered, it
>>>>>> would
>>>>>> need some additional context also. We have a mechanism in KFD which
>>>>>> sends the context in which a reset has to be done. Currently, that's
>>>>>> restricted to compute applications, but if this is in a similar
>>>>>> line, we
>>>>>> would like to pass some additional info like job timeout, RAS error
>>>>>> etc.
>>>>>>
>>>>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a
>>>>> reset and recovered, but additional data about like which job
>>>>> timeout, RAS error and such belong to devcoredump I guess, where all
>>>>> data is gathered and collected later.
>>>> I think somebody else mentioned it as well that the source of the
>>>> issue, e.g. the PID of the submitting process would be helpful as well
>>>> for supervising daemons which need to restart processes when they
>>>> caused some issue.
>>>>
>>> It was me :) we have a use case that we would need the PID for the
>>> daemon indeed, but the daemon doesn't need to know what's the RAS error
>>> or the job name that timeouted, there's no immediate action to be taken
>>> with this information, contrary to the PID that we need to know.
>>>
>> Regarding devcoredump - it's not done every time. For ex: RAS errors
>> have a different way to identify the source of error, hence we don't
>> need a coredump in such cases.
>>
>> The intention is only to let the user know the reason for reset at a
>> high level, and probably add more things later like the engines or
>> queues that have reset etc.
> 
> Well what is the use case for that? That doesn't looks valuable to me.

It's mostly for in-band telemetry reporting through tools like amd-smi -
 more for admin purpose rather than any debug.

Thanks,
Lijo

> 
> RAS errors should generally be reported to the application who issued
> the submission.
> 
> As a system wide event they are only useful in things like logfiles I think.
> 
> Regards,
> Christian.
> 
>> Thanks,
>> Lijo
>>
>>>> We just postponed adding that till later.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>> Thanks,
>>>>>> Lijo
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>

Raag Jadav Dec. 16, 2024, 1:57 p.m. UTC | #15

On Mon, Dec 16, 2024 at 10:15:00AM -0300, André Almeida wrote:
> Em 16/12/2024 10:10, Christian König escreveu:
> > Am 16.12.24 um 14:04 schrieb André Almeida:
> > > Em 16/12/2024 07:38, Lazar, Lijo escreveu:
> > > > 
> > > > 
> > > > On 12/16/2024 3:48 PM, Christian König wrote:
> > > > > Am 13.12.24 um 16:56 schrieb André Almeida:
> > > > > > Em 13/12/2024 11:36, Raag Jadav escreveu:
> > > > > > > On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
> > > > > > > > Hi Christian,
> > > > > > > > 
> > > > > > > > Em 13/12/2024 04:34, Christian König escreveu:
> > > > > > > > > Am 12.12.24 um 20:09 schrieb André Almeida:
> > > > > > > > > > Use DRM's device wedged event to notify userspace that a reset had
> > > > > > > > > > happened. For now, only use `none` method meant for telemetry
> > > > > > > > > > capture.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > > > > > > > > > ---
> > > > > > > > > >     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
> > > > > > > > > >     1 file changed, 3 insertions(+)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > > b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > > index 96316111300a..19e1a5493778 100644
> > > > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > > @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
> > > > > > > > > > amdgpu_device *adev,
> > > > > > > > > >             dev_info(adev->dev, "GPU
> > > > > > > > > > reset end with ret = %d\n", r);
> > > > > > > > > > atomic_set(&adev->reset_domain->reset_res, r);
> > > > > > > > > > +
> > > > > > > > > > +    drm_dev_wedged_event(adev_to_drm(adev),
> > > > > > > > > > DRM_WEDGE_RECOVERY_NONE);
> > > > > > > > > 
> > > > > > > > > That looks really good in general. I would just make the
> > > > > > > > > DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Why depend or `r`? A reset was triggered anyway, regardless of the
> > > > > > > > success
> > > > > > > > of it, shouldn't we tell userspace?
> > > > > > > 
> > > > > > > A failed reset would perhaps result in wedging,
> > > > > > > atleast that's how i915
> > > > > > > is handling it.
> > > > > > > 
> > > > > > 
> > > > > > Right, and I think this raises the question of what wedge recovery
> > > > > > method should I add for amdgpu... Christian?
> > > > > > 
> > > > > 
> > > > > In theory a rebind should be enough to get the device going again, our
> > > > > BOCO does a bus reset on driver load anyway.
> > > > > 
> > > > 
> > > > The behavior varies between SOCs. In certain ones, if driver reset
> > > > fails, that means it's really in a bad state and it would need system
> > > > reboot.
> > > > 
> > > 
> > > Is this documented somewhere? Then I could even add a
> > > DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.

This was present in drafts v5 through v7 but later dropped with the
understanding that it is unwise to let a drm device make system level
decisions and rather have something like WEDGED=unknown to let admin/user
decide how to deal with it.

https://patchwork.freedesktop.org/series/138069/

> > Not publicly as far as I know. But indeed a driver reset has basically
> > the same chance of succeeding than a driver reload.
> > 
> > I think the use case we have here is more that the administrator
> > intentionally disabled the reset to allow HW investigation.
> > 
> > So far we did that with a rather broken we don't do anything at all
> > approach.
> 
> OK.
> 
> > 
> > > > I had asked earlier about the utility of this one here. If this is just
> > > > to inform userspace that driver has done a reset and recovered, it would
> > > > need some additional context also. We have a mechanism in KFD which
> > > > sends the context in which a reset has to be done. Currently, that's
> > > > restricted to compute applications, but if this is in a similar line, we
> > > > would like to pass some additional info like job timeout, RAS error etc.
> > > > 
> > > 
> > > DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done
> > > a reset and recovered, but additional data about like which job
> > > timeout, RAS error and such belong to devcoredump I guess, where all
> > > data is gathered and collected later.
> > 
> > I think somebody else mentioned it as well that the source of the issue,
> > e.g. the PID of the submitting process would be helpful as well for
> > supervising daemons which need to restart processes when they caused
> > some issue.
> > 
> 
> It was me :) we have a use case that we would need the PID for the daemon
> indeed, but the daemon doesn't need to know what's the RAS error or the job
> name that timeouted, there's no immediate action to be taken with this
> information, contrary to the PID that we need to know.

I think this calls for the standardization of telemetry (devcoredump, syslog
etc) but since each driver has its own way of doing it, it'd be quite an uphill
battle.

Raag

kernel test robot Dec. 20, 2024, 1:31 p.m. UTC | #16

Hi André,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.13-rc3 next-20241220]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Andr-Almeida/drm-amdgpu-Use-device-wedged-event/20241213-031134
base:   linus/master
patch link:    https://lore.kernel.org/r/20241212190909.28559-2-andrealmeid%40igalia.com
patch subject: [PATCH 1/1] drm/amdgpu: Use device wedged event
config: arm-randconfig-001-20241220 (https://download.01.org/0day-ci/archive/20241220/202412202104.IwpOz5t5-lkp@intel.com/config)
compiler: clang version 19.1.3 (https://github.com/llvm/llvm-project ab51eccf88f5321e7c60591c5546b254b6afab99)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241220/202412202104.IwpOz5t5-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412202104.IwpOz5t5-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:35:
   In file included from include/linux/iommu.h:10:
   In file included from include/linux/scatterlist.h:8:
   In file included from include/linux/mm.h:2223:
   include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     518 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6061:2: error: call to undeclared function 'drm_dev_wedged_event'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    6061 |         drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
         |         ^
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6061:42: error: use of undeclared identifier 'DRM_WEDGE_RECOVERY_NONE'
    6061 |         drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
         |                                                 ^
   1 warning and 2 errors generated.


vim +/drm_dev_wedged_event +6061 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

  5794	
  5795	/**
  5796	 * amdgpu_device_gpu_recover - reset the asic and recover scheduler
  5797	 *
  5798	 * @adev: amdgpu_device pointer
  5799	 * @job: which job trigger hang
  5800	 * @reset_context: amdgpu reset context pointer
  5801	 *
  5802	 * Attempt to reset the GPU if it has hung (all asics).
  5803	 * Attempt to do soft-reset or full-reset and reinitialize Asic
  5804	 * Returns 0 for success or an error on failure.
  5805	 */
  5806	
  5807	int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
  5808				      struct amdgpu_job *job,
  5809				      struct amdgpu_reset_context *reset_context)
  5810	{
  5811		struct list_head device_list, *device_list_handle =  NULL;
  5812		bool job_signaled = false;
  5813		struct amdgpu_hive_info *hive = NULL;
  5814		struct amdgpu_device *tmp_adev = NULL;
  5815		int i, r = 0;
  5816		bool need_emergency_restart = false;
  5817		bool audio_suspended = false;
  5818		int retry_limit = AMDGPU_MAX_RETRY_LIMIT;
  5819	
  5820		/*
  5821		 * Special case: RAS triggered and full reset isn't supported
  5822		 */
  5823		need_emergency_restart = amdgpu_ras_need_emergency_restart(adev);
  5824	
  5825		/*
  5826		 * Flush RAM to disk so that after reboot
  5827		 * the user can read log and see why the system rebooted.
  5828		 */
  5829		if (need_emergency_restart && amdgpu_ras_get_context(adev) &&
  5830			amdgpu_ras_get_context(adev)->reboot) {
  5831			DRM_WARN("Emergency reboot.");
  5832	
  5833			ksys_sync_helper();
  5834			emergency_restart();
  5835		}
  5836	
  5837		dev_info(adev->dev, "GPU %s begin!\n",
  5838			need_emergency_restart ? "jobs stop":"reset");
  5839	
  5840		if (!amdgpu_sriov_vf(adev))
  5841			hive = amdgpu_get_xgmi_hive(adev);
  5842		if (hive)
  5843			mutex_lock(&hive->hive_lock);
  5844	
  5845		reset_context->job = job;
  5846		reset_context->hive = hive;
  5847		/*
  5848		 * Build list of devices to reset.
  5849		 * In case we are in XGMI hive mode, resort the device list
  5850		 * to put adev in the 1st position.
  5851		 */
  5852		INIT_LIST_HEAD(&device_list);
  5853		if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1) && hive) {
  5854			list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
  5855				list_add_tail(&tmp_adev->reset_list, &device_list);
  5856				if (adev->shutdown)
  5857					tmp_adev->shutdown = true;
  5858			}
  5859			if (!list_is_first(&adev->reset_list, &device_list))
  5860				list_rotate_to_front(&adev->reset_list, &device_list);
  5861			device_list_handle = &device_list;
  5862		} else {
  5863			list_add_tail(&adev->reset_list, &device_list);
  5864			device_list_handle = &device_list;
  5865		}
  5866	
  5867		if (!amdgpu_sriov_vf(adev)) {
  5868			r = amdgpu_device_health_check(device_list_handle);
  5869			if (r)
  5870				goto end_reset;
  5871		}
  5872	
  5873		/* We need to lock reset domain only once both for XGMI and single device */
  5874		tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
  5875					    reset_list);
  5876		amdgpu_device_lock_reset_domain(tmp_adev->reset_domain);
  5877	
  5878		/* block all schedulers and reset given job's ring */
  5879		list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
  5880	
  5881			amdgpu_device_set_mp1_state(tmp_adev);
  5882	
  5883			/*
  5884			 * Try to put the audio codec into suspend state
  5885			 * before gpu reset started.
  5886			 *
  5887			 * Due to the power domain of the graphics device
  5888			 * is shared with AZ power domain. Without this,
  5889			 * we may change the audio hardware from behind
  5890			 * the audio driver's back. That will trigger
  5891			 * some audio codec errors.
  5892			 */
  5893			if (!amdgpu_device_suspend_display_audio(tmp_adev))
  5894				audio_suspended = true;
  5895	
  5896			amdgpu_ras_set_error_query_ready(tmp_adev, false);
  5897	
  5898			cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
  5899	
  5900			amdgpu_amdkfd_pre_reset(tmp_adev, reset_context);
  5901	
  5902			/*
  5903			 * Mark these ASICs to be reseted as untracked first
  5904			 * And add them back after reset completed
  5905			 */
  5906			amdgpu_unregister_gpu_instance(tmp_adev);
  5907	
  5908			drm_client_dev_suspend(adev_to_drm(tmp_adev), false);
  5909	
  5910			/* disable ras on ALL IPs */
  5911			if (!need_emergency_restart &&
  5912			      amdgpu_device_ip_need_full_reset(tmp_adev))
  5913				amdgpu_ras_suspend(tmp_adev);
  5914	
  5915			for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
  5916				struct amdgpu_ring *ring = tmp_adev->rings[i];
  5917	
  5918				if (!amdgpu_ring_sched_ready(ring))
  5919					continue;
  5920	
  5921				drm_sched_stop(&ring->sched, job ? &job->base : NULL);
  5922	
  5923				if (need_emergency_restart)
  5924					amdgpu_job_stop_all_jobs_on_sched(&ring->sched);
  5925			}
  5926			atomic_inc(&tmp_adev->gpu_reset_counter);
  5927		}
  5928	
  5929		if (need_emergency_restart)
  5930			goto skip_sched_resume;
  5931	
  5932		/*
  5933		 * Must check guilty signal here since after this point all old
  5934		 * HW fences are force signaled.
  5935		 *
  5936		 * job->base holds a reference to parent fence
  5937		 */
  5938		if (job && dma_fence_is_signaled(&job->hw_fence)) {
  5939			job_signaled = true;
  5940			dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
  5941			goto skip_hw_reset;
  5942		}
  5943	
  5944	retry:	/* Rest of adevs pre asic reset from XGMI hive. */
  5945		list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
  5946			r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context);
  5947			/*TODO Should we stop ?*/
  5948			if (r) {
  5949				dev_err(tmp_adev->dev, "GPU pre asic reset failed with err, %d for drm dev, %s ",
  5950					  r, adev_to_drm(tmp_adev)->unique);
  5951				tmp_adev->asic_reset_res = r;
  5952			}
  5953		}
  5954	
  5955		/* Actual ASIC resets if needed.*/
  5956		/* Host driver will handle XGMI hive reset for SRIOV */
  5957		if (amdgpu_sriov_vf(adev)) {
  5958			if (amdgpu_ras_get_fed_status(adev) || amdgpu_virt_rcvd_ras_interrupt(adev)) {
  5959				dev_dbg(adev->dev, "Detected RAS error, wait for FLR completion\n");
  5960				amdgpu_ras_set_fed(adev, true);
  5961				set_bit(AMDGPU_HOST_FLR, &reset_context->flags);
  5962			}
  5963	
  5964			r = amdgpu_device_reset_sriov(adev, reset_context);
  5965			if (AMDGPU_RETRY_SRIOV_RESET(r) && (retry_limit--) > 0) {
  5966				amdgpu_virt_release_full_gpu(adev, true);
  5967				goto retry;
  5968			}
  5969			if (r)
  5970				adev->asic_reset_res = r;
  5971		} else {
  5972			r = amdgpu_do_asic_reset(device_list_handle, reset_context);
  5973			if (r && r == -EAGAIN)
  5974				goto retry;
  5975		}
  5976	
  5977		list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
  5978			/*
  5979			 * Drop any pending non scheduler resets queued before reset is done.
  5980			 * Any reset scheduled after this point would be valid. Scheduler resets
  5981			 * were already dropped during drm_sched_stop and no new ones can come
  5982			 * in before drm_sched_start.
  5983			 */
  5984			amdgpu_device_stop_pending_resets(tmp_adev);
  5985		}
  5986	
  5987	skip_hw_reset:
  5988	
  5989		/* Post ASIC reset for all devs .*/
  5990		list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
  5991	
  5992			for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
  5993				struct amdgpu_ring *ring = tmp_adev->rings[i];
  5994	
  5995				if (!amdgpu_ring_sched_ready(ring))
  5996					continue;
  5997	
  5998				drm_sched_start(&ring->sched, 0);
  5999			}
  6000	
  6001			if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
  6002				drm_helper_resume_force_mode(adev_to_drm(tmp_adev));
  6003	
  6004			if (tmp_adev->asic_reset_res)
  6005				r = tmp_adev->asic_reset_res;
  6006	
  6007			tmp_adev->asic_reset_res = 0;
  6008	
  6009			if (r) {
  6010				/* bad news, how to tell it to userspace ?
  6011				 * for ras error, we should report GPU bad status instead of
  6012				 * reset failure
  6013				 */
  6014				if (reset_context->src != AMDGPU_RESET_SRC_RAS ||
  6015				    !amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
  6016					dev_info(tmp_adev->dev, "GPU reset(%d) failed\n",
  6017						atomic_read(&tmp_adev->gpu_reset_counter));
  6018				amdgpu_vf_error_put(tmp_adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
  6019			} else {
  6020				dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));
  6021				if (amdgpu_acpi_smart_shift_update(adev_to_drm(tmp_adev), AMDGPU_SS_DEV_D0))
  6022					DRM_WARN("smart shift update failed\n");
  6023			}
  6024		}
  6025	
  6026	skip_sched_resume:
  6027		list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
  6028			/* unlock kfd: SRIOV would do it separately */
  6029			if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev))
  6030				amdgpu_amdkfd_post_reset(tmp_adev);
  6031	
  6032			/* kfd_post_reset will do nothing if kfd device is not initialized,
  6033			 * need to bring up kfd here if it's not be initialized before
  6034			 */
  6035			if (!adev->kfd.init_complete)
  6036				amdgpu_amdkfd_device_init(adev);
  6037	
  6038			if (audio_suspended)
  6039				amdgpu_device_resume_display_audio(tmp_adev);
  6040	
  6041			amdgpu_device_unset_mp1_state(tmp_adev);
  6042	
  6043			amdgpu_ras_set_error_query_ready(tmp_adev, true);
  6044		}
  6045	
  6046		tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
  6047						    reset_list);
  6048		amdgpu_device_unlock_reset_domain(tmp_adev->reset_domain);
  6049	
  6050	end_reset:
  6051		if (hive) {
  6052			mutex_unlock(&hive->hive_lock);
  6053			amdgpu_put_xgmi_hive(hive);
  6054		}
  6055	
  6056		if (r)
  6057			dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
  6058	
  6059		atomic_set(&adev->reset_domain->reset_res, r);
  6060	
> 6061		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
  6062	
  6063		return r;
  6064	}
  6065

kernel test robot Dec. 20, 2024, 2:06 p.m. UTC | #17

Hi André,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on drm-intel/for-linux-next drm-intel/for-linux-next-fixes drm-tip/drm-tip v6.13-rc3 next-20241220]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Andr-Almeida/drm-amdgpu-Use-device-wedged-event/20241213-031134
base:   linus/master
patch link:    https://lore.kernel.org/r/20241212190909.28559-2-andrealmeid%40igalia.com
patch subject: [PATCH 1/1] drm/amdgpu: Use device wedged event
config: s390-randconfig-001-20241220 (https://download.01.org/0day-ci/archive/20241220/202412202144.SalviKDh-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241220/202412202144.SalviKDh-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412202144.SalviKDh-lkp@intel.com/

All errors (new ones prefixed by >>):

   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c: In function 'amdgpu_device_gpu_recover':
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6061:9: error: implicit declaration of function 'drm_dev_wedged_event' [-Wimplicit-function-declaration]
    6061 |         drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
         |         ^~~~~~~~~~~~~~~~~~~~
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6061:49: error: 'DRM_WEDGE_RECOVERY_NONE' undeclared (first use in this function); did you mean 'DRM_MODE_ENCODER_NONE'?
    6061 |         drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
         |                                                 ^~~~~~~~~~~~~~~~~~~~~~~
         |                                                 DRM_MODE_ENCODER_NONE
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6061:49: note: each undeclared identifier is reported only once for each function it appears in


vim +/drm_dev_wedged_event +6061 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

  5794	
  5795	/**
  5796	 * amdgpu_device_gpu_recover - reset the asic and recover scheduler
  5797	 *
  5798	 * @adev: amdgpu_device pointer
  5799	 * @job: which job trigger hang
  5800	 * @reset_context: amdgpu reset context pointer
  5801	 *
  5802	 * Attempt to reset the GPU if it has hung (all asics).
  5803	 * Attempt to do soft-reset or full-reset and reinitialize Asic
  5804	 * Returns 0 for success or an error on failure.
  5805	 */
  5806	
  5807	int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
  5808				      struct amdgpu_job *job,
  5809				      struct amdgpu_reset_context *reset_context)
  5810	{
  5811		struct list_head device_list, *device_list_handle =  NULL;
  5812		bool job_signaled = false;
  5813		struct amdgpu_hive_info *hive = NULL;
  5814		struct amdgpu_device *tmp_adev = NULL;
  5815		int i, r = 0;
  5816		bool need_emergency_restart = false;
  5817		bool audio_suspended = false;
  5818		int retry_limit = AMDGPU_MAX_RETRY_LIMIT;
  5819	
  5820		/*
  5821		 * Special case: RAS triggered and full reset isn't supported
  5822		 */
  5823		need_emergency_restart = amdgpu_ras_need_emergency_restart(adev);
  5824	
  5825		/*
  5826		 * Flush RAM to disk so that after reboot
  5827		 * the user can read log and see why the system rebooted.
  5828		 */
  5829		if (need_emergency_restart && amdgpu_ras_get_context(adev) &&
  5830			amdgpu_ras_get_context(adev)->reboot) {
  5831			DRM_WARN("Emergency reboot.");
  5832	
  5833			ksys_sync_helper();
  5834			emergency_restart();
  5835		}
  5836	
  5837		dev_info(adev->dev, "GPU %s begin!\n",
  5838			need_emergency_restart ? "jobs stop":"reset");
  5839	
  5840		if (!amdgpu_sriov_vf(adev))
  5841			hive = amdgpu_get_xgmi_hive(adev);
  5842		if (hive)
  5843			mutex_lock(&hive->hive_lock);
  5844	
  5845		reset_context->job = job;
  5846		reset_context->hive = hive;
  5847		/*
  5848		 * Build list of devices to reset.
  5849		 * In case we are in XGMI hive mode, resort the device list
  5850		 * to put adev in the 1st position.
  5851		 */
  5852		INIT_LIST_HEAD(&device_list);
  5853		if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1) && hive) {
  5854			list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
  5855				list_add_tail(&tmp_adev->reset_list, &device_list);
  5856				if (adev->shutdown)
  5857					tmp_adev->shutdown = true;
  5858			}
  5859			if (!list_is_first(&adev->reset_list, &device_list))
  5860				list_rotate_to_front(&adev->reset_list, &device_list);
  5861			device_list_handle = &device_list;
  5862		} else {
  5863			list_add_tail(&adev->reset_list, &device_list);
  5864			device_list_handle = &device_list;
  5865		}
  5866	
  5867		if (!amdgpu_sriov_vf(adev)) {
  5868			r = amdgpu_device_health_check(device_list_handle);
  5869			if (r)
  5870				goto end_reset;
  5871		}
  5872	
  5873		/* We need to lock reset domain only once both for XGMI and single device */
  5874		tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
  5875					    reset_list);
  5876		amdgpu_device_lock_reset_domain(tmp_adev->reset_domain);
  5877	
  5878		/* block all schedulers and reset given job's ring */
  5879		list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
  5880	
  5881			amdgpu_device_set_mp1_state(tmp_adev);
  5882	
  5883			/*
  5884			 * Try to put the audio codec into suspend state
  5885			 * before gpu reset started.
  5886			 *
  5887			 * Due to the power domain of the graphics device
  5888			 * is shared with AZ power domain. Without this,
  5889			 * we may change the audio hardware from behind
  5890			 * the audio driver's back. That will trigger
  5891			 * some audio codec errors.
  5892			 */
  5893			if (!amdgpu_device_suspend_display_audio(tmp_adev))
  5894				audio_suspended = true;
  5895	
  5896			amdgpu_ras_set_error_query_ready(tmp_adev, false);
  5897	
  5898			cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
  5899	
  5900			amdgpu_amdkfd_pre_reset(tmp_adev, reset_context);
  5901	
  5902			/*
  5903			 * Mark these ASICs to be reseted as untracked first
  5904			 * And add them back after reset completed
  5905			 */
  5906			amdgpu_unregister_gpu_instance(tmp_adev);
  5907	
  5908			drm_client_dev_suspend(adev_to_drm(tmp_adev), false);
  5909	
  5910			/* disable ras on ALL IPs */
  5911			if (!need_emergency_restart &&
  5912			      amdgpu_device_ip_need_full_reset(tmp_adev))
  5913				amdgpu_ras_suspend(tmp_adev);
  5914	
  5915			for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
  5916				struct amdgpu_ring *ring = tmp_adev->rings[i];
  5917	
  5918				if (!amdgpu_ring_sched_ready(ring))
  5919					continue;
  5920	
  5921				drm_sched_stop(&ring->sched, job ? &job->base : NULL);
  5922	
  5923				if (need_emergency_restart)
  5924					amdgpu_job_stop_all_jobs_on_sched(&ring->sched);
  5925			}
  5926			atomic_inc(&tmp_adev->gpu_reset_counter);
  5927		}
  5928	
  5929		if (need_emergency_restart)
  5930			goto skip_sched_resume;
  5931	
  5932		/*
  5933		 * Must check guilty signal here since after this point all old
  5934		 * HW fences are force signaled.
  5935		 *
  5936		 * job->base holds a reference to parent fence
  5937		 */
  5938		if (job && dma_fence_is_signaled(&job->hw_fence)) {
  5939			job_signaled = true;
  5940			dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
  5941			goto skip_hw_reset;
  5942		}
  5943	
  5944	retry:	/* Rest of adevs pre asic reset from XGMI hive. */
  5945		list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
  5946			r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context);
  5947			/*TODO Should we stop ?*/
  5948			if (r) {
  5949				dev_err(tmp_adev->dev, "GPU pre asic reset failed with err, %d for drm dev, %s ",
  5950					  r, adev_to_drm(tmp_adev)->unique);
  5951				tmp_adev->asic_reset_res = r;
  5952			}
  5953		}
  5954	
  5955		/* Actual ASIC resets if needed.*/
  5956		/* Host driver will handle XGMI hive reset for SRIOV */
  5957		if (amdgpu_sriov_vf(adev)) {
  5958			if (amdgpu_ras_get_fed_status(adev) || amdgpu_virt_rcvd_ras_interrupt(adev)) {
  5959				dev_dbg(adev->dev, "Detected RAS error, wait for FLR completion\n");
  5960				amdgpu_ras_set_fed(adev, true);
  5961				set_bit(AMDGPU_HOST_FLR, &reset_context->flags);
  5962			}
  5963	
  5964			r = amdgpu_device_reset_sriov(adev, reset_context);
  5965			if (AMDGPU_RETRY_SRIOV_RESET(r) && (retry_limit--) > 0) {
  5966				amdgpu_virt_release_full_gpu(adev, true);
  5967				goto retry;
  5968			}
  5969			if (r)
  5970				adev->asic_reset_res = r;
  5971		} else {
  5972			r = amdgpu_do_asic_reset(device_list_handle, reset_context);
  5973			if (r && r == -EAGAIN)
  5974				goto retry;
  5975		}
  5976	
  5977		list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
  5978			/*
  5979			 * Drop any pending non scheduler resets queued before reset is done.
  5980			 * Any reset scheduled after this point would be valid. Scheduler resets
  5981			 * were already dropped during drm_sched_stop and no new ones can come
  5982			 * in before drm_sched_start.
  5983			 */
  5984			amdgpu_device_stop_pending_resets(tmp_adev);
  5985		}
  5986	
  5987	skip_hw_reset:
  5988	
  5989		/* Post ASIC reset for all devs .*/
  5990		list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
  5991	
  5992			for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
  5993				struct amdgpu_ring *ring = tmp_adev->rings[i];
  5994	
  5995				if (!amdgpu_ring_sched_ready(ring))
  5996					continue;
  5997	
  5998				drm_sched_start(&ring->sched, 0);
  5999			}
  6000	
  6001			if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
  6002				drm_helper_resume_force_mode(adev_to_drm(tmp_adev));
  6003	
  6004			if (tmp_adev->asic_reset_res)
  6005				r = tmp_adev->asic_reset_res;
  6006	
  6007			tmp_adev->asic_reset_res = 0;
  6008	
  6009			if (r) {
  6010				/* bad news, how to tell it to userspace ?
  6011				 * for ras error, we should report GPU bad status instead of
  6012				 * reset failure
  6013				 */
  6014				if (reset_context->src != AMDGPU_RESET_SRC_RAS ||
  6015				    !amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
  6016					dev_info(tmp_adev->dev, "GPU reset(%d) failed\n",
  6017						atomic_read(&tmp_adev->gpu_reset_counter));
  6018				amdgpu_vf_error_put(tmp_adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
  6019			} else {
  6020				dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));
  6021				if (amdgpu_acpi_smart_shift_update(adev_to_drm(tmp_adev), AMDGPU_SS_DEV_D0))
  6022					DRM_WARN("smart shift update failed\n");
  6023			}
  6024		}
  6025	
  6026	skip_sched_resume:
  6027		list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
  6028			/* unlock kfd: SRIOV would do it separately */
  6029			if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev))
  6030				amdgpu_amdkfd_post_reset(tmp_adev);
  6031	
  6032			/* kfd_post_reset will do nothing if kfd device is not initialized,
  6033			 * need to bring up kfd here if it's not be initialized before
  6034			 */
  6035			if (!adev->kfd.init_complete)
  6036				amdgpu_amdkfd_device_init(adev);
  6037	
  6038			if (audio_suspended)
  6039				amdgpu_device_resume_display_audio(tmp_adev);
  6040	
  6041			amdgpu_device_unset_mp1_state(tmp_adev);
  6042	
  6043			amdgpu_ras_set_error_query_ready(tmp_adev, true);
  6044		}
  6045	
  6046		tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
  6047						    reset_list);
  6048		amdgpu_device_unlock_reset_domain(tmp_adev->reset_domain);
  6049	
  6050	end_reset:
  6051		if (hive) {
  6052			mutex_unlock(&hive->hive_lock);
  6053			amdgpu_put_xgmi_hive(hive);
  6054		}
  6055	
  6056		if (r)
  6057			dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
  6058	
  6059		atomic_set(&adev->reset_domain->reset_res, r);
  6060	
> 6061		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
  6062	
  6063		return r;
  6064	}
  6065

[1/1] drm/amdgpu: Use device wedged event

Commit Message

Comments

Patch