diff mbox series

[1/1] drm/amdgpu: Use device wedged event

Message ID 20241212190909.28559-2-andrealmeid@igalia.com (mailing list archive)
State New
Headers show
Series drm/amdgpu: Use device wedged event | expand

Commit Message

André Almeida Dec. 12, 2024, 7:09 p.m. UTC
Use DRM's device wedged event to notify userspace that a reset had
happened. For now, only use `none` method meant for telemetry
capture.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

André Almeida Dec. 13, 2024, 2:15 p.m. UTC | #1
Hi Christian,

Em 13/12/2024 04:34, Christian König escreveu:
> Am 12.12.24 um 20:09 schrieb André Almeida:
>> Use DRM's device wedged event to notify userspace that a reset had
>> happened. For now, only use `none` method meant for telemetry
>> capture.
>>
>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/ 
>> drm/amd/amdgpu/amdgpu_device.c
>> index 96316111300a..19e1a5493778 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct 
>> amdgpu_device *adev,
>>           dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>>       atomic_set(&adev->reset_domain->reset_res, r);
>> +
>> +    drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> 
> That looks really good in general. I would just make the 
> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
> 

Why depend or `r`? A reset was triggered anyway, regardless of the 
success of it, shouldn't we tell userspace?
Raag Jadav Dec. 13, 2024, 2:36 p.m. UTC | #2
On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
> Hi Christian,
> 
> Em 13/12/2024 04:34, Christian König escreveu:
> > Am 12.12.24 um 20:09 schrieb André Almeida:
> > > Use DRM's device wedged event to notify userspace that a reset had
> > > happened. For now, only use `none` method meant for telemetry
> > > capture.
> > > 
> > > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
> > >   1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
> > > index 96316111300a..19e1a5493778 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
> > > amdgpu_device *adev,
> > >           dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
> > >       atomic_set(&adev->reset_domain->reset_res, r);
> > > +
> > > +    drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> > 
> > That looks really good in general. I would just make the
> > DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
> > 
> 
> Why depend or `r`? A reset was triggered anyway, regardless of the success
> of it, shouldn't we tell userspace?

A failed reset would perhaps result in wedging, atleast that's how i915
is handling it.

Raag
André Almeida Dec. 13, 2024, 3:56 p.m. UTC | #3
Em 13/12/2024 11:36, Raag Jadav escreveu:
> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>> Hi Christian,
>>
>> Em 13/12/2024 04:34, Christian König escreveu:
>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>> happened. For now, only use `none` method meant for telemetry
>>>> capture.
>>>>
>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>    1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>> index 96316111300a..19e1a5493778 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>> amdgpu_device *adev,
>>>>            dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>>>>        atomic_set(&adev->reset_domain->reset_res, r);
>>>> +
>>>> +    drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
>>>
>>> That looks really good in general. I would just make the
>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>
>>
>> Why depend or `r`? A reset was triggered anyway, regardless of the success
>> of it, shouldn't we tell userspace?
> 
> A failed reset would perhaps result in wedging, atleast that's how i915
> is handling it.
> 

Right, and I think this raises the question of what wedge recovery 
method should I add for amdgpu... Christian?
Christian König Dec. 16, 2024, 10:18 a.m. UTC | #4
Am 13.12.24 um 16:56 schrieb André Almeida:
> Em 13/12/2024 11:36, Raag Jadav escreveu:
>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>> Hi Christian,
>>>
>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>>> happened. For now, only use `none` method meant for telemetry
>>>>> capture.
>>>>>
>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>    1 file changed, 3 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>> index 96316111300a..19e1a5493778 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>> amdgpu_device *adev,
>>>>>            dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>> +
>>>>> +    drm_dev_wedged_event(adev_to_drm(adev), 
>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>
>>>> That looks really good in general. I would just make the
>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>
>>>
>>> Why depend or `r`? A reset was triggered anyway, regardless of the 
>>> success
>>> of it, shouldn't we tell userspace?
>>
>> A failed reset would perhaps result in wedging, atleast that's how i915
>> is handling it.
>>
>
> Right, and I think this raises the question of what wedge recovery 
> method should I add for amdgpu... Christian?
>

In theory a rebind should be enough to get the device going again, our 
BOCO does a bus reset on driver load anyway.

Regards,
Christian.
Lazar, Lijo Dec. 16, 2024, 10:38 a.m. UTC | #5
On 12/16/2024 3:48 PM, Christian König wrote:
> Am 13.12.24 um 16:56 schrieb André Almeida:
>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>> Hi Christian,
>>>>
>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>> capture.
>>>>>>
>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>>> ---
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>    1 file changed, 3 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>> amdgpu_device *adev,
>>>>>>            dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>> +
>>>>>> +    drm_dev_wedged_event(adev_to_drm(adev),
>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>
>>>>> That looks really good in general. I would just make the
>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>
>>>>
>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>> success
>>>> of it, shouldn't we tell userspace?
>>>
>>> A failed reset would perhaps result in wedging, atleast that's how i915
>>> is handling it.
>>>
>>
>> Right, and I think this raises the question of what wedge recovery
>> method should I add for amdgpu... Christian?
>>
> 
> In theory a rebind should be enough to get the device going again, our
> BOCO does a bus reset on driver load anyway.
> 

The behavior varies between SOCs. In certain ones, if driver reset
fails, that means it's really in a bad state and it would need system
reboot.

I had asked earlier about the utility of this one here. If this is just
to inform userspace that driver has done a reset and recovered, it would
need some additional context also. We have a mechanism in KFD which
sends the context in which a reset has to be done. Currently, that's
restricted to compute applications, but if this is in a similar line, we
would like to pass some additional info like job timeout, RAS error etc.

Thanks,
Lijo

> Regards,
> Christian.
André Almeida Dec. 16, 2024, 1:04 p.m. UTC | #6
Em 16/12/2024 07:38, Lazar, Lijo escreveu:
> 
> 
> On 12/16/2024 3:48 PM, Christian König wrote:
>> Am 13.12.24 um 16:56 schrieb André Almeida:
>>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>>> Hi Christian,
>>>>>
>>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>>> capture.
>>>>>>>
>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>>     1 file changed, 3 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>             dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>> +
>>>>>>> +    drm_dev_wedged_event(adev_to_drm(adev),
>>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>>
>>>>>> That looks really good in general. I would just make the
>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>>
>>>>>
>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>>> success
>>>>> of it, shouldn't we tell userspace?
>>>>
>>>> A failed reset would perhaps result in wedging, atleast that's how i915
>>>> is handling it.
>>>>
>>>
>>> Right, and I think this raises the question of what wedge recovery
>>> method should I add for amdgpu... Christian?
>>>
>>
>> In theory a rebind should be enough to get the device going again, our
>> BOCO does a bus reset on driver load anyway.
>>
> 
> The behavior varies between SOCs. In certain ones, if driver reset
> fails, that means it's really in a bad state and it would need system
> reboot.
> 

Is this documented somewhere? Then I could even add a 
DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.

> I had asked earlier about the utility of this one here. If this is just
> to inform userspace that driver has done a reset and recovered, it would
> need some additional context also. We have a mechanism in KFD which
> sends the context in which a reset has to be done. Currently, that's
> restricted to compute applications, but if this is in a similar line, we
> would like to pass some additional info like job timeout, RAS error etc.
> 

DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a 
reset and recovered, but additional data about like which job timeout, 
RAS error and such belong to devcoredump I guess, where all data is 
gathered and collected later.

> Thanks,
> Lijo
> 
>> Regards,
>> Christian.
>
Christian König Dec. 16, 2024, 1:10 p.m. UTC | #7
Am 16.12.24 um 14:04 schrieb André Almeida:
> Em 16/12/2024 07:38, Lazar, Lijo escreveu:
>>
>>
>> On 12/16/2024 3:48 PM, Christian König wrote:
>>> Am 13.12.24 um 16:56 schrieb André Almeida:
>>>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>>>> Hi Christian,
>>>>>>
>>>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>>>> capture.
>>>>>>>>
>>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>>>>> ---
>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>>>     1 file changed, 3 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>>>> amdgpu_device *adev,
>>>>>>>>             dev_info(adev->dev, "GPU reset end with ret = 
>>>>>>>> %d\n", r);
>>>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>>> +
>>>>>>>> +    drm_dev_wedged_event(adev_to_drm(adev),
>>>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>>>
>>>>>>> That looks really good in general. I would just make the
>>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>>>
>>>>>>
>>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>>>> success
>>>>>> of it, shouldn't we tell userspace?
>>>>>
>>>>> A failed reset would perhaps result in wedging, atleast that's how 
>>>>> i915
>>>>> is handling it.
>>>>>
>>>>
>>>> Right, and I think this raises the question of what wedge recovery
>>>> method should I add for amdgpu... Christian?
>>>>
>>>
>>> In theory a rebind should be enough to get the device going again, our
>>> BOCO does a bus reset on driver load anyway.
>>>
>>
>> The behavior varies between SOCs. In certain ones, if driver reset
>> fails, that means it's really in a bad state and it would need system
>> reboot.
>>
>
> Is this documented somewhere? Then I could even add a 
> DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.

Not publicly as far as I know. But indeed a driver reset has basically 
the same chance of succeeding than a driver reload.

I think the use case we have here is more that the administrator 
intentionally disabled the reset to allow HW investigation.

So far we did that with a rather broken we don't do anything at all 
approach.

>> I had asked earlier about the utility of this one here. If this is just
>> to inform userspace that driver has done a reset and recovered, it would
>> need some additional context also. We have a mechanism in KFD which
>> sends the context in which a reset has to be done. Currently, that's
>> restricted to compute applications, but if this is in a similar line, we
>> would like to pass some additional info like job timeout, RAS error etc.
>>
>
> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a 
> reset and recovered, but additional data about like which job timeout, 
> RAS error and such belong to devcoredump I guess, where all data is 
> gathered and collected later.

I think somebody else mentioned it as well that the source of the issue, 
e.g. the PID of the submitting process would be helpful as well for 
supervising daemons which need to restart processes when they caused 
some issue.

We just postponed adding that till later.

Regards,
Christian.

>
>> Thanks,
>> Lijo
>>
>>> Regards,
>>> Christian.
>>
>
André Almeida Dec. 16, 2024, 1:15 p.m. UTC | #8
Em 16/12/2024 10:10, Christian König escreveu:
> Am 16.12.24 um 14:04 schrieb André Almeida:
>> Em 16/12/2024 07:38, Lazar, Lijo escreveu:
>>>
>>>
>>> On 12/16/2024 3:48 PM, Christian König wrote:
>>>> Am 13.12.24 um 16:56 schrieb André Almeida:
>>>>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>>>>> capture.
>>>>>>>>>
>>>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>>>>>> ---
>>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>>>>     1 file changed, 3 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>             dev_info(adev->dev, "GPU reset end with ret = 
>>>>>>>>> %d\n", r);
>>>>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>>>> +
>>>>>>>>> +    drm_dev_wedged_event(adev_to_drm(adev),
>>>>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>>>>
>>>>>>>> That looks really good in general. I would just make the
>>>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>>>>
>>>>>>>
>>>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>>>>> success
>>>>>>> of it, shouldn't we tell userspace?
>>>>>>
>>>>>> A failed reset would perhaps result in wedging, atleast that's how 
>>>>>> i915
>>>>>> is handling it.
>>>>>>
>>>>>
>>>>> Right, and I think this raises the question of what wedge recovery
>>>>> method should I add for amdgpu... Christian?
>>>>>
>>>>
>>>> In theory a rebind should be enough to get the device going again, our
>>>> BOCO does a bus reset on driver load anyway.
>>>>
>>>
>>> The behavior varies between SOCs. In certain ones, if driver reset
>>> fails, that means it's really in a bad state and it would need system
>>> reboot.
>>>
>>
>> Is this documented somewhere? Then I could even add a 
>> DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.
> 
> Not publicly as far as I know. But indeed a driver reset has basically 
> the same chance of succeeding than a driver reload.
> 
> I think the use case we have here is more that the administrator 
> intentionally disabled the reset to allow HW investigation.
> 
> So far we did that with a rather broken we don't do anything at all 
> approach.

OK.

> 
>>> I had asked earlier about the utility of this one here. If this is just
>>> to inform userspace that driver has done a reset and recovered, it would
>>> need some additional context also. We have a mechanism in KFD which
>>> sends the context in which a reset has to be done. Currently, that's
>>> restricted to compute applications, but if this is in a similar line, we
>>> would like to pass some additional info like job timeout, RAS error etc.
>>>
>>
>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a 
>> reset and recovered, but additional data about like which job timeout, 
>> RAS error and such belong to devcoredump I guess, where all data is 
>> gathered and collected later.
> 
> I think somebody else mentioned it as well that the source of the issue, 
> e.g. the PID of the submitting process would be helpful as well for 
> supervising daemons which need to restart processes when they caused 
> some issue.
> 

It was me :) we have a use case that we would need the PID for the 
daemon indeed, but the daemon doesn't need to know what's the RAS error 
or the job name that timeouted, there's no immediate action to be taken 
with this information, contrary to the PID that we need to know.

> We just postponed adding that till later.
> 
> Regards,
> Christian.
> 
>>
>>> Thanks,
>>> Lijo
>>>
>>>> Regards,
>>>> Christian.
>>>
>>
>
Christian König Dec. 16, 2024, 1:36 p.m. UTC | #9
Am 16.12.24 um 14:15 schrieb André Almeida:
> Em 16/12/2024 10:10, Christian König escreveu:
>> Am 16.12.24 um 14:04 schrieb André Almeida:
>>> Em 16/12/2024 07:38, Lazar, Lijo escreveu:
>>>>
>>>>
>>>> On 12/16/2024 3:48 PM, Christian König wrote:
>>>>> Am 13.12.24 um 16:56 schrieb André Almeida:
>>>>>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>>>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>>>>>> Hi Christian,
>>>>>>>>
>>>>>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>>>>>> Use DRM's device wedged event to notify userspace that a 
>>>>>>>>>> reset had
>>>>>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>>>>>> capture.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>>>>>>> ---
>>>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>>>>>     1 file changed, 3 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>             dev_info(adev->dev, "GPU reset end with ret = 
>>>>>>>>>> %d\n", r);
>>>>>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>>>>> +
>>>>>>>>>> +    drm_dev_wedged_event(adev_to_drm(adev),
>>>>>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>>>>>
>>>>>>>>> That looks really good in general. I would just make the
>>>>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>>>>>
>>>>>>>>
>>>>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>>>>>> success
>>>>>>>> of it, shouldn't we tell userspace?
>>>>>>>
>>>>>>> A failed reset would perhaps result in wedging, atleast that's 
>>>>>>> how i915
>>>>>>> is handling it.
>>>>>>>
>>>>>>
>>>>>> Right, and I think this raises the question of what wedge recovery
>>>>>> method should I add for amdgpu... Christian?
>>>>>>
>>>>>
>>>>> In theory a rebind should be enough to get the device going again, 
>>>>> our
>>>>> BOCO does a bus reset on driver load anyway.
>>>>>
>>>>
>>>> The behavior varies between SOCs. In certain ones, if driver reset
>>>> fails, that means it's really in a bad state and it would need system
>>>> reboot.
>>>>
>>>
>>> Is this documented somewhere? Then I could even add a 
>>> DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.
>>
>> Not publicly as far as I know. But indeed a driver reset has 
>> basically the same chance of succeeding than a driver reload.
>>
>> I think the use case we have here is more that the administrator 
>> intentionally disabled the reset to allow HW investigation.
>>
>> So far we did that with a rather broken we don't do anything at all 
>> approach.
>
> OK.
>
>>
>>>> I had asked earlier about the utility of this one here. If this is 
>>>> just
>>>> to inform userspace that driver has done a reset and recovered, it 
>>>> would
>>>> need some additional context also. We have a mechanism in KFD which
>>>> sends the context in which a reset has to be done. Currently, that's
>>>> restricted to compute applications, but if this is in a similar 
>>>> line, we
>>>> would like to pass some additional info like job timeout, RAS error 
>>>> etc.
>>>>
>>>
>>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done 
>>> a reset and recovered, but additional data about like which job 
>>> timeout, RAS error and such belong to devcoredump I guess, where all 
>>> data is gathered and collected later.
>>
>> I think somebody else mentioned it as well that the source of the 
>> issue, e.g. the PID of the submitting process would be helpful as 
>> well for supervising daemons which need to restart processes when 
>> they caused some issue.
>>
>
> It was me :) we have a use case that we would need the PID for the 
> daemon indeed, but the daemon doesn't need to know what's the RAS 
> error or the job name that timeouted, there's no immediate action to 
> be taken with this information, contrary to the PID that we need to know.

Yeah, that's all stuff for the device core dump I think.

But if you want to add the PID for high level control then that would 
make sense I think.

Regards,
Christian.

>
>> We just postponed adding that till later.
>>
>> Regards,
>> Christian.
>>
>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>
>>>
>>
>
Lazar, Lijo Dec. 16, 2024, 1:36 p.m. UTC | #10
On 12/16/2024 6:45 PM, André Almeida wrote:
> 
> 
> Em 16/12/2024 10:10, Christian König escreveu:
>> Am 16.12.24 um 14:04 schrieb André Almeida:
>>> Em 16/12/2024 07:38, Lazar, Lijo escreveu:
>>>>
>>>>
>>>> On 12/16/2024 3:48 PM, Christian König wrote:
>>>>> Am 13.12.24 um 16:56 schrieb André Almeida:
>>>>>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>>>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>>>>>> Hi Christian,
>>>>>>>>
>>>>>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>>>>>> Use DRM's device wedged event to notify userspace that a reset
>>>>>>>>>> had
>>>>>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>>>>>> capture.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>>>>>>> ---
>>>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>>>>>     1 file changed, 3 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>             dev_info(adev->dev, "GPU reset end with ret =
>>>>>>>>>> %d\n", r);
>>>>>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>>>>> +
>>>>>>>>>> +    drm_dev_wedged_event(adev_to_drm(adev),
>>>>>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>>>>>
>>>>>>>>> That looks really good in general. I would just make the
>>>>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>>>>>
>>>>>>>>
>>>>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>>>>>> success
>>>>>>>> of it, shouldn't we tell userspace?
>>>>>>>
>>>>>>> A failed reset would perhaps result in wedging, atleast that's
>>>>>>> how i915
>>>>>>> is handling it.
>>>>>>>
>>>>>>
>>>>>> Right, and I think this raises the question of what wedge recovery
>>>>>> method should I add for amdgpu... Christian?
>>>>>>
>>>>>
>>>>> In theory a rebind should be enough to get the device going again, our
>>>>> BOCO does a bus reset on driver load anyway.
>>>>>
>>>>
>>>> The behavior varies between SOCs. In certain ones, if driver reset
>>>> fails, that means it's really in a bad state and it would need system
>>>> reboot.
>>>>
>>>
>>> Is this documented somewhere? Then I could even add a
>>> DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.
>>
>> Not publicly as far as I know. But indeed a driver reset has basically
>> the same chance of succeeding than a driver reload.
>>
>> I think the use case we have here is more that the administrator
>> intentionally disabled the reset to allow HW investigation.
>>
>> So far we did that with a rather broken we don't do anything at all
>> approach.
> 
> OK.
> 
>>
>>>> I had asked earlier about the utility of this one here. If this is just
>>>> to inform userspace that driver has done a reset and recovered, it
>>>> would
>>>> need some additional context also. We have a mechanism in KFD which
>>>> sends the context in which a reset has to be done. Currently, that's
>>>> restricted to compute applications, but if this is in a similar
>>>> line, we
>>>> would like to pass some additional info like job timeout, RAS error
>>>> etc.
>>>>
>>>
>>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a
>>> reset and recovered, but additional data about like which job
>>> timeout, RAS error and such belong to devcoredump I guess, where all
>>> data is gathered and collected later.
>>
>> I think somebody else mentioned it as well that the source of the
>> issue, e.g. the PID of the submitting process would be helpful as well
>> for supervising daemons which need to restart processes when they
>> caused some issue.
>>
> 
> It was me :) we have a use case that we would need the PID for the
> daemon indeed, but the daemon doesn't need to know what's the RAS error
> or the job name that timeouted, there's no immediate action to be taken
> with this information, contrary to the PID that we need to know.
> 

Regarding devcoredump - it's not done every time. For ex: RAS errors
have a different way to identify the source of error, hence we don't
need a coredump in such cases.

The intention is only to let the user know the reason for reset at a
high level, and probably add more things later like the engines or
queues that have reset etc.

Thanks,
Lijo

>> We just postponed adding that till later.
>>
>> Regards,
>> Christian.
>>
>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>
>>>
>>
>
Christian König Dec. 16, 2024, 1:39 p.m. UTC | #11
Am 16.12.24 um 14:36 schrieb Lazar, Lijo:
>>>>> I had asked earlier about the utility of this one here. If this is just
>>>>> to inform userspace that driver has done a reset and recovered, it
>>>>> would
>>>>> need some additional context also. We have a mechanism in KFD which
>>>>> sends the context in which a reset has to be done. Currently, that's
>>>>> restricted to compute applications, but if this is in a similar
>>>>> line, we
>>>>> would like to pass some additional info like job timeout, RAS error
>>>>> etc.
>>>>>
>>>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a
>>>> reset and recovered, but additional data about like which job
>>>> timeout, RAS error and such belong to devcoredump I guess, where all
>>>> data is gathered and collected later.
>>> I think somebody else mentioned it as well that the source of the
>>> issue, e.g. the PID of the submitting process would be helpful as well
>>> for supervising daemons which need to restart processes when they
>>> caused some issue.
>>>
>> It was me :) we have a use case that we would need the PID for the
>> daemon indeed, but the daemon doesn't need to know what's the RAS error
>> or the job name that timeouted, there's no immediate action to be taken
>> with this information, contrary to the PID that we need to know.
>>
> Regarding devcoredump - it's not done every time. For ex: RAS errors
> have a different way to identify the source of error, hence we don't
> need a coredump in such cases.
>
> The intention is only to let the user know the reason for reset at a
> high level, and probably add more things later like the engines or
> queues that have reset etc.

Well what is the use case for that? That doesn't looks valuable to me.

RAS errors should generally be reported to the application who issued 
the submission.

As a system wide event they are only useful in things like logfiles I think.

Regards,
Christian.

>
> Thanks,
> Lijo
>
>>> We just postponed adding that till later.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>>> Regards,
>>>>>> Christian.
Lazar, Lijo Dec. 16, 2024, 1:44 p.m. UTC | #12
On 12/16/2024 7:09 PM, Christian König wrote:
> Am 16.12.24 um 14:36 schrieb Lazar, Lijo:
>>>>>> I had asked earlier about the utility of this one here. If this is just
>>>>>> to inform userspace that driver has done a reset and recovered, it
>>>>>> would
>>>>>> need some additional context also. We have a mechanism in KFD which
>>>>>> sends the context in which a reset has to be done. Currently, that's
>>>>>> restricted to compute applications, but if this is in a similar
>>>>>> line, we
>>>>>> would like to pass some additional info like job timeout, RAS error
>>>>>> etc.
>>>>>>
>>>>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a
>>>>> reset and recovered, but additional data about like which job
>>>>> timeout, RAS error and such belong to devcoredump I guess, where all
>>>>> data is gathered and collected later.
>>>> I think somebody else mentioned it as well that the source of the
>>>> issue, e.g. the PID of the submitting process would be helpful as well
>>>> for supervising daemons which need to restart processes when they
>>>> caused some issue.
>>>>
>>> It was me :) we have a use case that we would need the PID for the
>>> daemon indeed, but the daemon doesn't need to know what's the RAS error
>>> or the job name that timeouted, there's no immediate action to be taken
>>> with this information, contrary to the PID that we need to know.
>>>
>> Regarding devcoredump - it's not done every time. For ex: RAS errors
>> have a different way to identify the source of error, hence we don't
>> need a coredump in such cases.
>>
>> The intention is only to let the user know the reason for reset at a
>> high level, and probably add more things later like the engines or
>> queues that have reset etc.
> 
> Well what is the use case for that? That doesn't looks valuable to me.

It's mostly for in-band telemetry reporting through tools like amd-smi -
 more for admin purpose rather than any debug.

Thanks,
Lijo

> 
> RAS errors should generally be reported to the application who issued
> the submission.
> 
> As a system wide event they are only useful in things like logfiles I think.
> 
> Regards,
> Christian.
> 
>> Thanks,
>> Lijo
>>
>>>> We just postponed adding that till later.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>> Thanks,
>>>>>> Lijo
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>
Raag Jadav Dec. 16, 2024, 1:57 p.m. UTC | #13
On Mon, Dec 16, 2024 at 10:15:00AM -0300, André Almeida wrote:
> Em 16/12/2024 10:10, Christian König escreveu:
> > Am 16.12.24 um 14:04 schrieb André Almeida:
> > > Em 16/12/2024 07:38, Lazar, Lijo escreveu:
> > > > 
> > > > 
> > > > On 12/16/2024 3:48 PM, Christian König wrote:
> > > > > Am 13.12.24 um 16:56 schrieb André Almeida:
> > > > > > Em 13/12/2024 11:36, Raag Jadav escreveu:
> > > > > > > On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
> > > > > > > > Hi Christian,
> > > > > > > > 
> > > > > > > > Em 13/12/2024 04:34, Christian König escreveu:
> > > > > > > > > Am 12.12.24 um 20:09 schrieb André Almeida:
> > > > > > > > > > Use DRM's device wedged event to notify userspace that a reset had
> > > > > > > > > > happened. For now, only use `none` method meant for telemetry
> > > > > > > > > > capture.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > > > > > > > > > ---
> > > > > > > > > >     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
> > > > > > > > > >     1 file changed, 3 insertions(+)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > > b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > > index 96316111300a..19e1a5493778 100644
> > > > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > > @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
> > > > > > > > > > amdgpu_device *adev,
> > > > > > > > > >             dev_info(adev->dev, "GPU
> > > > > > > > > > reset end with ret = %d\n", r);
> > > > > > > > > > atomic_set(&adev->reset_domain->reset_res, r);
> > > > > > > > > > +
> > > > > > > > > > +    drm_dev_wedged_event(adev_to_drm(adev),
> > > > > > > > > > DRM_WEDGE_RECOVERY_NONE);
> > > > > > > > > 
> > > > > > > > > That looks really good in general. I would just make the
> > > > > > > > > DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Why depend or `r`? A reset was triggered anyway, regardless of the
> > > > > > > > success
> > > > > > > > of it, shouldn't we tell userspace?
> > > > > > > 
> > > > > > > A failed reset would perhaps result in wedging,
> > > > > > > atleast that's how i915
> > > > > > > is handling it.
> > > > > > > 
> > > > > > 
> > > > > > Right, and I think this raises the question of what wedge recovery
> > > > > > method should I add for amdgpu... Christian?
> > > > > > 
> > > > > 
> > > > > In theory a rebind should be enough to get the device going again, our
> > > > > BOCO does a bus reset on driver load anyway.
> > > > > 
> > > > 
> > > > The behavior varies between SOCs. In certain ones, if driver reset
> > > > fails, that means it's really in a bad state and it would need system
> > > > reboot.
> > > > 
> > > 
> > > Is this documented somewhere? Then I could even add a
> > > DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.

This was present in drafts v5 through v7 but later dropped with the
understanding that it is unwise to let a drm device make system level
decisions and rather have something like WEDGED=unknown to let admin/user
decide how to deal with it.

https://patchwork.freedesktop.org/series/138069/

> > Not publicly as far as I know. But indeed a driver reset has basically
> > the same chance of succeeding than a driver reload.
> > 
> > I think the use case we have here is more that the administrator
> > intentionally disabled the reset to allow HW investigation.
> > 
> > So far we did that with a rather broken we don't do anything at all
> > approach.
> 
> OK.
> 
> > 
> > > > I had asked earlier about the utility of this one here. If this is just
> > > > to inform userspace that driver has done a reset and recovered, it would
> > > > need some additional context also. We have a mechanism in KFD which
> > > > sends the context in which a reset has to be done. Currently, that's
> > > > restricted to compute applications, but if this is in a similar line, we
> > > > would like to pass some additional info like job timeout, RAS error etc.
> > > > 
> > > 
> > > DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done
> > > a reset and recovered, but additional data about like which job
> > > timeout, RAS error and such belong to devcoredump I guess, where all
> > > data is gathered and collected later.
> > 
> > I think somebody else mentioned it as well that the source of the issue,
> > e.g. the PID of the submitting process would be helpful as well for
> > supervising daemons which need to restart processes when they caused
> > some issue.
> > 
> 
> It was me :) we have a use case that we would need the PID for the daemon
> indeed, but the daemon doesn't need to know what's the RAS error or the job
> name that timeouted, there's no immediate action to be taken with this
> information, contrary to the PID that we need to know.

I think this calls for the standardization of telemetry (devcoredump, syslog
etc) but since each driver has its own way of doing it, it'd be quite an uphill
battle.

Raag
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 96316111300a..19e1a5493778 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -6057,6 +6057,9 @@  int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
 
 	atomic_set(&adev->reset_domain->reset_res, r);
+
+	drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
+
 	return r;
 }