diff mbox series

[v4,1/3] drm: Introduce device wedged event

Message ID 20240906094225.3082162-2-raag.jadav@intel.com (mailing list archive)
State New, archived
Headers show
Series Introduce DRM device wedged event | expand

Commit Message

Raag Jadav Sept. 6, 2024, 9:42 a.m. UTC
Introduce device wedged event, which will notify userspace of wedged
(hanged/unusable) state of the DRM device through a uevent. This is
useful especially in cases where the device is in unrecoverable state
and requires userspace intervention for recovery.

Purpose of this implementation is to be vendor agnostic. Userspace
consumers (sysadmin) can define udev rules to parse this event and
take respective action to recover the device.

Consumer expectations:
----------------------
1) Unbind driver
2) Reset bus device
3) Re-bind driver

v4: s/drm_dev_wedged/drm_dev_wedged_event
    Use drm_info() (Jani)
    Kernel doc adjustment (Aravind)

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
 drivers/gpu/drm/drm_drv.c | 20 ++++++++++++++++++++
 include/drm/drm_drv.h     |  1 +
 2 files changed, 21 insertions(+)

Comments

Asahi Lina Sept. 7, 2024, 11:38 a.m. UTC | #1
On 9/6/24 6:42 PM, Raag Jadav wrote:
> Introduce device wedged event, which will notify userspace of wedged
> (hanged/unusable) state of the DRM device through a uevent. This is
> useful especially in cases where the device is in unrecoverable state
> and requires userspace intervention for recovery.
> 
> Purpose of this implementation is to be vendor agnostic. Userspace
> consumers (sysadmin) can define udev rules to parse this event and
> take respective action to recover the device.
> 
> Consumer expectations:
> ----------------------
> 1) Unbind driver
> 2) Reset bus device
> 3) Re-bind driver

Is this supposed to be normative? For drm/asahi we have a "wedged"
concept (firmware crashed), but the only possible recovery action is a
full system reboot (which might still be desirable to allow userspace to
trigger automatically in some scenarios) since there is no bus-level
reset and no firmware reload possible.

> 
> v4: s/drm_dev_wedged/drm_dev_wedged_event
>     Use drm_info() (Jani)
>     Kernel doc adjustment (Aravind)
> 
> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> ---
>  drivers/gpu/drm/drm_drv.c | 20 ++++++++++++++++++++
>  include/drm/drm_drv.h     |  1 +
>  2 files changed, 21 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 93543071a500..cca5d8295eb7 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -499,6 +499,26 @@ void drm_dev_unplug(struct drm_device *dev)
>  }
>  EXPORT_SYMBOL(drm_dev_unplug);
>  
> +/**
> + * drm_dev_wedged_event - generate a device wedged uevent
> + * @dev: DRM device
> + *
> + * This generates a device wedged uevent for the DRM device specified by @dev,
> + * on the basis of which, userspace may take respective action to recover the
> + * device. Currently we only set WEDGED=1 in the uevent environment, but this
> + * can be expanded in the future.
> + */
> +void drm_dev_wedged_event(struct drm_device *dev)
> +{
> +	char *event_string = "WEDGED=1";
> +	char *envp[] = { event_string, NULL };
> +
> +	drm_info(dev, "device wedged, generating uevent\n");
> +
> +	kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> +}
> +EXPORT_SYMBOL(drm_dev_wedged_event);
> +
>  /*
>   * DRM internal mount
>   * We want to be able to allocate our own "struct address_space" to control
> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
> index cd37936c3926..eed5e54c74fd 100644
> --- a/include/drm/drm_drv.h
> +++ b/include/drm/drm_drv.h
> @@ -489,6 +489,7 @@ void drm_put_dev(struct drm_device *dev);
>  bool drm_dev_enter(struct drm_device *dev, int *idx);
>  void drm_dev_exit(int idx);
>  void drm_dev_unplug(struct drm_device *dev);
> +void drm_dev_wedged_event(struct drm_device *dev);
>  
>  /**
>   * drm_dev_is_unplugged - is a DRM device unplugged

~~ Lina
Lucas De Marchi Sept. 7, 2024, 3:07 p.m. UTC | #2
On Sat, Sep 07, 2024 at 08:38:30PM GMT, Asahi Lina wrote:
>
>
>On 9/6/24 6:42 PM, Raag Jadav wrote:
>> Introduce device wedged event, which will notify userspace of wedged
>> (hanged/unusable) state of the DRM device through a uevent. This is
>> useful especially in cases where the device is in unrecoverable state
>> and requires userspace intervention for recovery.
>>
>> Purpose of this implementation is to be vendor agnostic. Userspace
>> consumers (sysadmin) can define udev rules to parse this event and
>> take respective action to recover the device.
>>
>> Consumer expectations:
>> ----------------------
>> 1) Unbind driver
>> 2) Reset bus device
>> 3) Re-bind driver
>
>Is this supposed to be normative? For drm/asahi we have a "wedged"
>concept (firmware crashed), but the only possible recovery action is a
>full system reboot (which might still be desirable to allow userspace to
>trigger automatically in some scenarios) since there is no bus-level
>reset and no firmware reload possible.

maybe let drivers hint possible/supported recovery mechanisms and then
sysadmin chooses what to do?

Lucas De Marchi
Asahi Lina Sept. 8, 2024, 2:08 p.m. UTC | #3
On 9/8/24 12:07 AM, Lucas De Marchi wrote:
> On Sat, Sep 07, 2024 at 08:38:30PM GMT, Asahi Lina wrote:
>>
>>
>> On 9/6/24 6:42 PM, Raag Jadav wrote:
>>> Introduce device wedged event, which will notify userspace of wedged
>>> (hanged/unusable) state of the DRM device through a uevent. This is
>>> useful especially in cases where the device is in unrecoverable state
>>> and requires userspace intervention for recovery.
>>>
>>> Purpose of this implementation is to be vendor agnostic. Userspace
>>> consumers (sysadmin) can define udev rules to parse this event and
>>> take respective action to recover the device.
>>>
>>> Consumer expectations:
>>> ----------------------
>>> 1) Unbind driver
>>> 2) Reset bus device
>>> 3) Re-bind driver
>>
>> Is this supposed to be normative? For drm/asahi we have a "wedged"
>> concept (firmware crashed), but the only possible recovery action is a
>> full system reboot (which might still be desirable to allow userspace to
>> trigger automatically in some scenarios) since there is no bus-level
>> reset and no firmware reload possible.
> 
> maybe let drivers hint possible/supported recovery mechanisms and then
> sysadmin chooses what to do?

How would we do this? A textual value for the event or something like
that? ("WEDGED=bus-reset" vs "WEDGED=reboot"?)

~~ Lina
Lucas De Marchi Sept. 9, 2024, 8:01 p.m. UTC | #4
On Sun, Sep 08, 2024 at 11:08:39PM GMT, Asahi Lina wrote:
>
>
>On 9/8/24 12:07 AM, Lucas De Marchi wrote:
>> On Sat, Sep 07, 2024 at 08:38:30PM GMT, Asahi Lina wrote:
>>>
>>>
>>> On 9/6/24 6:42 PM, Raag Jadav wrote:
>>>> Introduce device wedged event, which will notify userspace of wedged
>>>> (hanged/unusable) state of the DRM device through a uevent. This is
>>>> useful especially in cases where the device is in unrecoverable state
>>>> and requires userspace intervention for recovery.
>>>>
>>>> Purpose of this implementation is to be vendor agnostic. Userspace
>>>> consumers (sysadmin) can define udev rules to parse this event and
>>>> take respective action to recover the device.
>>>>
>>>> Consumer expectations:
>>>> ----------------------
>>>> 1) Unbind driver
>>>> 2) Reset bus device
>>>> 3) Re-bind driver
>>>
>>> Is this supposed to be normative? For drm/asahi we have a "wedged"
>>> concept (firmware crashed), but the only possible recovery action is a
>>> full system reboot (which might still be desirable to allow userspace to
>>> trigger automatically in some scenarios) since there is no bus-level
>>> reset and no firmware reload possible.
>>
>> maybe let drivers hint possible/supported recovery mechanisms and then
>> sysadmin chooses what to do?
>
>How would we do this? A textual value for the event or something like
>that? ("WEDGED=bus-reset" vs "WEDGED=reboot"?)

If there's a need for more than one, than I think exposing the supported
ones sorted by "side effect" in sysfs would be good. Something like:

	$ cat /sys/class/drm/card0/device/wedge_recover
	rebind
	bus-reset
	reboot

Although if there is actually an unrecoverable state like "reboot", you
could simply remove the underlying device from the kernel side, with no
userspace intervention.

Lucas De Marchi

>
>~~ Lina
Vivi, Rodrigo Sept. 9, 2024, 8:43 p.m. UTC | #5
On Sun, Sep 08, 2024 at 11:08:39PM +0900, Asahi Lina wrote:
> 
> 
> On 9/8/24 12:07 AM, Lucas De Marchi wrote:
> > On Sat, Sep 07, 2024 at 08:38:30PM GMT, Asahi Lina wrote:
> >>
> >>
> >> On 9/6/24 6:42 PM, Raag Jadav wrote:
> >>> Introduce device wedged event, which will notify userspace of wedged
> >>> (hanged/unusable) state of the DRM device through a uevent. This is
> >>> useful especially in cases where the device is in unrecoverable state
> >>> and requires userspace intervention for recovery.
> >>>
> >>> Purpose of this implementation is to be vendor agnostic. Userspace
> >>> consumers (sysadmin) can define udev rules to parse this event and
> >>> take respective action to recover the device.
> >>>
> >>> Consumer expectations:
> >>> ----------------------
> >>> 1) Unbind driver
> >>> 2) Reset bus device
> >>> 3) Re-bind driver
> >>
> >> Is this supposed to be normative? For drm/asahi we have a "wedged"
> >> concept (firmware crashed), but the only possible recovery action is a
> >> full system reboot (which might still be desirable to allow userspace to
> >> trigger automatically in some scenarios) since there is no bus-level
> >> reset and no firmware reload possible.
> > 
> > maybe let drivers hint possible/supported recovery mechanisms and then
> > sysadmin chooses what to do?
> 
> How would we do this? A textual value for the event or something like
> that? ("WEDGED=bus-reset" vs "WEDGED=reboot"?)

Looks like a good idea.

Although in our case it is not just a 'bus-reset' but unbind+bus_reset+rebind,
but that should be okay to have 'bus-reset' kind of text and driver
to document the meaning.

> 
> ~~ Lina
Matt Roper Sept. 9, 2024, 9:53 p.m. UTC | #6
On Fri, Sep 06, 2024 at 03:12:23PM +0530, Raag Jadav wrote:
> Introduce device wedged event, which will notify userspace of wedged
> (hanged/unusable) state of the DRM device through a uevent. This is
> useful especially in cases where the device is in unrecoverable state
> and requires userspace intervention for recovery.
> 
> Purpose of this implementation is to be vendor agnostic. Userspace
> consumers (sysadmin) can define udev rules to parse this event and
> take respective action to recover the device.
> 
> Consumer expectations:
> ----------------------
> 1) Unbind driver
> 2) Reset bus device
> 3) Re-bind driver
> 
> v4: s/drm_dev_wedged/drm_dev_wedged_event
>     Use drm_info() (Jani)
>     Kernel doc adjustment (Aravind)
> 
> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> ---
>  drivers/gpu/drm/drm_drv.c | 20 ++++++++++++++++++++
>  include/drm/drm_drv.h     |  1 +
>  2 files changed, 21 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 93543071a500..cca5d8295eb7 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -499,6 +499,26 @@ void drm_dev_unplug(struct drm_device *dev)
>  }
>  EXPORT_SYMBOL(drm_dev_unplug);
>  
> +/**
> + * drm_dev_wedged_event - generate a device wedged uevent
> + * @dev: DRM device
> + *
> + * This generates a device wedged uevent for the DRM device specified by @dev,
> + * on the basis of which, userspace may take respective action to recover the
> + * device. Currently we only set WEDGED=1 in the uevent environment, but this
> + * can be expanded in the future.

Just to clarify, is "wedged" intended to always mean "the entire device
is unusable" or are there cases where it would also get sent if only
part of the device is in a bad state?  For example, using i915/Xe
terminology, maybe the GT is dead but display is still working.  Or one
GT is dead, but another is still alive.

Basically, is this event intended as a signal that userspace should stop
trying to do _anything_ with the device, or just that the device has
degraded functionality in some way (and maybe userspace can still do
something useful if it's lucky)?  It would be good to clarify that in
the docs here in case different drivers have different ideas about how
this is expected to work.


Matt

> + */
> +void drm_dev_wedged_event(struct drm_device *dev)
> +{
> +	char *event_string = "WEDGED=1";
> +	char *envp[] = { event_string, NULL };
> +
> +	drm_info(dev, "device wedged, generating uevent\n");
> +
> +	kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> +}
> +EXPORT_SYMBOL(drm_dev_wedged_event);
> +
>  /*
>   * DRM internal mount
>   * We want to be able to allocate our own "struct address_space" to control
> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
> index cd37936c3926..eed5e54c74fd 100644
> --- a/include/drm/drm_drv.h
> +++ b/include/drm/drm_drv.h
> @@ -489,6 +489,7 @@ void drm_put_dev(struct drm_device *dev);
>  bool drm_dev_enter(struct drm_device *dev, int *idx);
>  void drm_dev_exit(int idx);
>  void drm_dev_unplug(struct drm_device *dev);
> +void drm_dev_wedged_event(struct drm_device *dev);
>  
>  /**
>   * drm_dev_is_unplugged - is a DRM device unplugged
> -- 
> 2.34.1
>
Raag Jadav Sept. 10, 2024, 3:49 p.m. UTC | #7
On Mon, Sep 09, 2024 at 02:53:23PM -0700, Matt Roper wrote:
> On Fri, Sep 06, 2024 at 03:12:23PM +0530, Raag Jadav wrote:
> > Introduce device wedged event, which will notify userspace of wedged
> > (hanged/unusable) state of the DRM device through a uevent. This is
> > useful especially in cases where the device is in unrecoverable state
> > and requires userspace intervention for recovery.
> > 
> > Purpose of this implementation is to be vendor agnostic. Userspace
> > consumers (sysadmin) can define udev rules to parse this event and
> > take respective action to recover the device.
> > 
> > Consumer expectations:
> > ----------------------
> > 1) Unbind driver
> > 2) Reset bus device
> > 3) Re-bind driver
> > 
> > v4: s/drm_dev_wedged/drm_dev_wedged_event
> >     Use drm_info() (Jani)
> >     Kernel doc adjustment (Aravind)
> > 
> > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > ---
> >  drivers/gpu/drm/drm_drv.c | 20 ++++++++++++++++++++
> >  include/drm/drm_drv.h     |  1 +
> >  2 files changed, 21 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > index 93543071a500..cca5d8295eb7 100644
> > --- a/drivers/gpu/drm/drm_drv.c
> > +++ b/drivers/gpu/drm/drm_drv.c
> > @@ -499,6 +499,26 @@ void drm_dev_unplug(struct drm_device *dev)
> >  }
> >  EXPORT_SYMBOL(drm_dev_unplug);
> >  
> > +/**
> > + * drm_dev_wedged_event - generate a device wedged uevent
> > + * @dev: DRM device
> > + *
> > + * This generates a device wedged uevent for the DRM device specified by @dev,
> > + * on the basis of which, userspace may take respective action to recover the
> > + * device. Currently we only set WEDGED=1 in the uevent environment, but this
> > + * can be expanded in the future.
> 
> Just to clarify, is "wedged" intended to always mean "the entire device
> is unusable" or are there cases where it would also get sent if only
> part of the device is in a bad state?  For example, using i915/Xe
> terminology, maybe the GT is dead but display is still working.  Or one
> GT is dead, but another is still alive.

The idea is to provide drivers a way to recover through userspace intervention.
It is upto the drivers to decide when they see the need for recovery and how
they want to recover.

> Basically, is this event intended as a signal that userspace should stop
> trying to do _anything_ with the device, or just that the device has
> degraded functionality in some way (and maybe userspace can still do
> something useful if it's lucky)?  It would be good to clarify that in
> the docs here in case different drivers have different ideas about how
> this is expected to work.

And hence the open discussion. Improvements are welcome :)

Raag
Raag Jadav Sept. 10, 2024, 3:53 p.m. UTC | #8
On Mon, Sep 09, 2024 at 03:01:50PM -0500, Lucas De Marchi wrote:
> On Sun, Sep 08, 2024 at 11:08:39PM GMT, Asahi Lina wrote:
> > On 9/8/24 12:07 AM, Lucas De Marchi wrote:
> > > On Sat, Sep 07, 2024 at 08:38:30PM GMT, Asahi Lina wrote:
> > > > On 9/6/24 6:42 PM, Raag Jadav wrote:
> > > > > Introduce device wedged event, which will notify userspace of wedged
> > > > > (hanged/unusable) state of the DRM device through a uevent. This is
> > > > > useful especially in cases where the device is in unrecoverable state
> > > > > and requires userspace intervention for recovery.
> > > > > 
> > > > > Purpose of this implementation is to be vendor agnostic. Userspace
> > > > > consumers (sysadmin) can define udev rules to parse this event and
> > > > > take respective action to recover the device.
> > > > > 
> > > > > Consumer expectations:
> > > > > ----------------------
> > > > > 1) Unbind driver
> > > > > 2) Reset bus device
> > > > > 3) Re-bind driver
> > > > 
> > > > Is this supposed to be normative? For drm/asahi we have a "wedged"
> > > > concept (firmware crashed), but the only possible recovery action is a
> > > > full system reboot (which might still be desirable to allow userspace to
> > > > trigger automatically in some scenarios) since there is no bus-level
> > > > reset and no firmware reload possible.
> > > 
> > > maybe let drivers hint possible/supported recovery mechanisms and then
> > > sysadmin chooses what to do?
> > 
> > How would we do this? A textual value for the event or something like
> > that? ("WEDGED=bus-reset" vs "WEDGED=reboot"?)
> 
> If there's a need for more than one, than I think exposing the supported
> ones sorted by "side effect" in sysfs would be good. Something like:
> 
> 	$ cat /sys/class/drm/card0/device/wedge_recover
> 	rebind
> 	bus-reset
> 	reboot

How do we expect the drivers to flag supported ones? Extra hooks?

Raag
Lucas De Marchi Sept. 10, 2024, 4:06 p.m. UTC | #9
On Tue, Sep 10, 2024 at 06:53:19PM GMT, Raag Jadav wrote:
>On Mon, Sep 09, 2024 at 03:01:50PM -0500, Lucas De Marchi wrote:
>> On Sun, Sep 08, 2024 at 11:08:39PM GMT, Asahi Lina wrote:
>> > On 9/8/24 12:07 AM, Lucas De Marchi wrote:
>> > > On Sat, Sep 07, 2024 at 08:38:30PM GMT, Asahi Lina wrote:
>> > > > On 9/6/24 6:42 PM, Raag Jadav wrote:
>> > > > > Introduce device wedged event, which will notify userspace of wedged
>> > > > > (hanged/unusable) state of the DRM device through a uevent. This is
>> > > > > useful especially in cases where the device is in unrecoverable state
>> > > > > and requires userspace intervention for recovery.
>> > > > >
>> > > > > Purpose of this implementation is to be vendor agnostic. Userspace
>> > > > > consumers (sysadmin) can define udev rules to parse this event and
>> > > > > take respective action to recover the device.
>> > > > >
>> > > > > Consumer expectations:
>> > > > > ----------------------
>> > > > > 1) Unbind driver
>> > > > > 2) Reset bus device
>> > > > > 3) Re-bind driver
>> > > >
>> > > > Is this supposed to be normative? For drm/asahi we have a "wedged"
>> > > > concept (firmware crashed), but the only possible recovery action is a
>> > > > full system reboot (which might still be desirable to allow userspace to
>> > > > trigger automatically in some scenarios) since there is no bus-level
>> > > > reset and no firmware reload possible.
>> > >
>> > > maybe let drivers hint possible/supported recovery mechanisms and then
>> > > sysadmin chooses what to do?
>> >
>> > How would we do this? A textual value for the event or something like
>> > that? ("WEDGED=bus-reset" vs "WEDGED=reboot"?)
>>
>> If there's a need for more than one, than I think exposing the supported
>> ones sorted by "side effect" in sysfs would be good. Something like:
>>
>> 	$ cat /sys/class/drm/card0/device/wedge_recover
>> 	rebind
>> 	bus-reset
>> 	reboot
>
>How do we expect the drivers to flag supported ones? Extra hooks?

The comment above... wedge_recover would be a sysfs exposed by the
driver to userspace with the supported mechanisms.

WEDGED=<mechanism> (which is also crafted by the driver or with explicit
functions in drm) would report to userspace the minimum
needed mechanism for recovery.

Lucas De Marchi
diff mbox series

Patch

diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 93543071a500..cca5d8295eb7 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -499,6 +499,26 @@  void drm_dev_unplug(struct drm_device *dev)
 }
 EXPORT_SYMBOL(drm_dev_unplug);
 
+/**
+ * drm_dev_wedged_event - generate a device wedged uevent
+ * @dev: DRM device
+ *
+ * This generates a device wedged uevent for the DRM device specified by @dev,
+ * on the basis of which, userspace may take respective action to recover the
+ * device. Currently we only set WEDGED=1 in the uevent environment, but this
+ * can be expanded in the future.
+ */
+void drm_dev_wedged_event(struct drm_device *dev)
+{
+	char *event_string = "WEDGED=1";
+	char *envp[] = { event_string, NULL };
+
+	drm_info(dev, "device wedged, generating uevent\n");
+
+	kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
+}
+EXPORT_SYMBOL(drm_dev_wedged_event);
+
 /*
  * DRM internal mount
  * We want to be able to allocate our own "struct address_space" to control
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index cd37936c3926..eed5e54c74fd 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -489,6 +489,7 @@  void drm_put_dev(struct drm_device *dev);
 bool drm_dev_enter(struct drm_device *dev, int *idx);
 void drm_dev_exit(int idx);
 void drm_dev_unplug(struct drm_device *dev);
+void drm_dev_wedged_event(struct drm_device *dev);
 
 /**
  * drm_dev_is_unplugged - is a DRM device unplugged