diff mbox series

[v6,1/4] drm: Introduce device wedged event

Message ID 20240923035826.624196-2-raag.jadav@intel.com (mailing list archive)
State New
Headers show
Series Introduce DRM device wedged event | expand

Commit Message

Raag Jadav Sept. 23, 2024, 3:58 a.m. UTC
Introduce device wedged event, which will notify userspace of wedged
(hanged/unusable) state of the DRM device through a uevent. This is
useful especially in cases where the device is no longer operating as
expected and has become unrecoverable from driver context.

Purpose of this implementation is to provide drivers a way to recover
through userspace intervention. Different drivers may have different
ideas of a "wedged device" depending on their hardware implementation,
and hence the vendor agnostic nature of the event. It is up to the drivers
to decide when they see the need for recovery and how they want to recover
from the available methods.

Current implementation defines three recovery methods, out of which,
drivers can choose to support any one or multiple of them. Preferred
recovery method will be sent in the uevent environment as WEDGED=<method>.
Userspace consumers (sysadmin) can define udev rules to parse this event
and take respective action to recover the device.

 Method    | Consumer expectations
-----------|-----------------------------------
 rebind    | unbind + rebind driver
 bus-reset | unbind + reset bus device + rebind
 reboot    | reboot system

v4: s/drm_dev_wedged/drm_dev_wedged_event
    Use drm_info() (Jani)
    Kernel doc adjustment (Aravind)
v5: Send recovery method with uevent (Lina)
v6: Access wedge_recovery_opts[] using helper function (Jani)
    Use snprintf() (Jani)

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
 drivers/gpu/drm/drm_drv.c | 41 +++++++++++++++++++++++++++++++++++++++
 include/drm/drm_device.h  | 24 +++++++++++++++++++++++
 include/drm/drm_drv.h     | 18 +++++++++++++++++
 3 files changed, 83 insertions(+)

Comments

Andy Shevchenko Sept. 23, 2024, 8:38 a.m. UTC | #1
On Mon, Sep 23, 2024 at 09:28:23AM +0530, Raag Jadav wrote:
> Introduce device wedged event, which will notify userspace of wedged
> (hanged/unusable) state of the DRM device through a uevent. This is
> useful especially in cases where the device is no longer operating as
> expected and has become unrecoverable from driver context.
> 
> Purpose of this implementation is to provide drivers a way to recover
> through userspace intervention. Different drivers may have different
> ideas of a "wedged device" depending on their hardware implementation,
> and hence the vendor agnostic nature of the event. It is up to the drivers
> to decide when they see the need for recovery and how they want to recover
> from the available methods.
> 
> Current implementation defines three recovery methods, out of which,
> drivers can choose to support any one or multiple of them. Preferred
> recovery method will be sent in the uevent environment as WEDGED=<method>.
> Userspace consumers (sysadmin) can define udev rules to parse this event
> and take respective action to recover the device.
> 
>  Method    | Consumer expectations
> -----------|-----------------------------------
>  rebind    | unbind + rebind driver
>  bus-reset | unbind + reset bus device + rebind
>  reboot    | reboot system

> v4: s/drm_dev_wedged/drm_dev_wedged_event
>     Use drm_info() (Jani)
>     Kernel doc adjustment (Aravind)
> v5: Send recovery method with uevent (Lina)
> v6: Access wedge_recovery_opts[] using helper function (Jani)
>     Use snprintf() (Jani)

Hmm... Isn't changelog in the cover letter is not enough?

...

> +/*
> + * Available recovery methods for wedged device. To be sent along with device
> + * wedged uevent.
> + */
> +#define WEDGE_LEN	32	/* Need 16+ */

This "Need 16+" comment seems unfinished as it doesn't tell why.

...

> +int drm_dev_wedged_event(struct drm_device *dev, enum wedge_recovery_method method)
> +{
> +	char event_string[WEDGE_LEN] = {};
> +	char *envp[] = { event_string, NULL };
> +
> +	if (!test_bit(method, &dev->wedge_recovery)) {
> +		drm_err(dev, "device wedged, recovery method not supported\n");
> +		return -EOPNOTSUPP;
> +	}

> +	snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery_method_name(method));

Is sprintf.h being included already?

> +	drm_info(dev, "device wedged, generating uevent\n");
> +	return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> +}

...

> +/**
> + * enum wedge_recovery_method - Recovery method for wedged device in order
> + * of severity. To be set as bit fields in drm_device.wedge_recovery variable.
> + * Drivers can choose to support any one or multiple of them depending on their
> + * needs.
> + */

> +

Redundant blank line.

> +enum wedge_recovery_method {
> +	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
> +	DRM_WEDGE_RECOVERY_REBIND,
> +
> +	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
> +	DRM_WEDGE_RECOVERY_BUS_RESET,
> +
> +	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
> +	DRM_WEDGE_RECOVERY_REBOOT,
> +
> +	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
> +	DRM_WEDGE_RECOVERY_MAX
> +};

...

> +extern const char *const wedge_recovery_opts[];

It's not NULL terminated. How users will know that they have an index valid?

Either you NULL-terminate that, or export the size as well (personally I would
go with the first approach).

...

> +static inline bool recovery_method_is_valid(enum wedge_recovery_method method)
> +{
> +	if (method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX)
> +		return true;
> +
> +	return false;

Besides that this can be written as

	return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;

> +}

this seems a runtime approach for what we have at compile-time, i.e. static_assert()
It's also possible to have as a third approach, but it's less robust.
Raag Jadav Sept. 23, 2024, 2:35 p.m. UTC | #2
On Mon, Sep 23, 2024 at 11:38:55AM +0300, Andy Shevchenko wrote:
> On Mon, Sep 23, 2024 at 09:28:23AM +0530, Raag Jadav wrote:
> > Introduce device wedged event, which will notify userspace of wedged
> > (hanged/unusable) state of the DRM device through a uevent. This is
> > useful especially in cases where the device is no longer operating as
> > expected and has become unrecoverable from driver context.
> > 
> > Purpose of this implementation is to provide drivers a way to recover
> > through userspace intervention. Different drivers may have different
> > ideas of a "wedged device" depending on their hardware implementation,
> > and hence the vendor agnostic nature of the event. It is up to the drivers
> > to decide when they see the need for recovery and how they want to recover
> > from the available methods.
> > 
> > Current implementation defines three recovery methods, out of which,
> > drivers can choose to support any one or multiple of them. Preferred
> > recovery method will be sent in the uevent environment as WEDGED=<method>.
> > Userspace consumers (sysadmin) can define udev rules to parse this event
> > and take respective action to recover the device.
> > 
> >  Method    | Consumer expectations
> > -----------|-----------------------------------
> >  rebind    | unbind + rebind driver
> >  bus-reset | unbind + reset bus device + rebind
> >  reboot    | reboot system
> 
> > v4: s/drm_dev_wedged/drm_dev_wedged_event
> >     Use drm_info() (Jani)
> >     Kernel doc adjustment (Aravind)
> > v5: Send recovery method with uevent (Lina)
> > v6: Access wedge_recovery_opts[] using helper function (Jani)
> >     Use snprintf() (Jani)
> 
> Hmm... Isn't changelog in the cover letter is not enough?

Which was initial thought but I'm told otherwise ¯\_(ツ)_/¯

> ...
> 
> > +extern const char *const wedge_recovery_opts[];
> 
> It's not NULL terminated. How users will know that they have an index valid?

It's expected to be accessed using recovery_*() helpers.
 
> Either you NULL-terminate that, or export the size as well (personally I would
> go with the first approach).
> 
> ...
> 
> > +static inline bool recovery_method_is_valid(enum wedge_recovery_method method)
> > +{
> > +	if (method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX)
> > +		return true;
> > +
> > +	return false;
> 
> Besides that this can be written as
> 
> 	return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
> 
> > +}
> 
> this seems a runtime approach for what we have at compile-time, i.e. static_assert()

My understanding is that we have runtime users that the compiler may not be
able to resolve.

Raag
Andy Shevchenko Sept. 23, 2024, 2:57 p.m. UTC | #3
On Mon, Sep 23, 2024 at 05:35:23PM +0300, Raag Jadav wrote:
> On Mon, Sep 23, 2024 at 11:38:55AM +0300, Andy Shevchenko wrote:
> > On Mon, Sep 23, 2024 at 09:28:23AM +0530, Raag Jadav wrote:

...

> > > +extern const char *const wedge_recovery_opts[];
> > 
> > It's not NULL terminated. How users will know that they have an index valid?
> 
> It's expected to be accessed using recovery_*() helpers.

If so, this has to be static then.

> > Either you NULL-terminate that, or export the size as well (personally I would
> > go with the first approach).
Jani Nikula Sept. 23, 2024, 10:01 p.m. UTC | #4
On Mon, 23 Sep 2024, Andy Shevchenko <andriy.shevchenko@linux.intel.com> wrote:
> On Mon, Sep 23, 2024 at 05:35:23PM +0300, Raag Jadav wrote:
>> On Mon, Sep 23, 2024 at 11:38:55AM +0300, Andy Shevchenko wrote:
>> > On Mon, Sep 23, 2024 at 09:28:23AM +0530, Raag Jadav wrote:
>
> ...
>
>> > > +extern const char *const wedge_recovery_opts[];
>> > 
>> > It's not NULL terminated. How users will know that they have an index valid?
>> 
>> It's expected to be accessed using recovery_*() helpers.
>
> If so, this has to be static then.

Yeah, please make the helpers regular functions. Static inlines are just
harmful here.

BR,
Jani.

>
>> > Either you NULL-terminate that, or export the size as well (personally I would
>> > go with the first approach).
diff mbox series

Patch

diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index ac30b0ec9d93..03a5d9009689 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -70,6 +70,18 @@  static struct dentry *drm_debugfs_root;
 
 DEFINE_STATIC_SRCU(drm_unplug_srcu);
 
+/*
+ * Available recovery methods for wedged device. To be sent along with device
+ * wedged uevent.
+ */
+#define WEDGE_LEN	32	/* Need 16+ */
+
+const char *const wedge_recovery_opts[] = {
+	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
+	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
+	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
+};
+
 /*
  * DRM Minors
  * A DRM device can provide several char-dev interfaces on the DRM-Major. Each
@@ -497,6 +509,35 @@  void drm_dev_unplug(struct drm_device *dev)
 }
 EXPORT_SYMBOL(drm_dev_unplug);
 
+/**
+ * drm_dev_wedged_event - generate a device wedged uevent
+ * @dev: DRM device
+ * @method: method to be used for recovery
+ *
+ * This generates a device wedged uevent for the DRM device specified by @dev.
+ * Recovery @method from wedge_recovery_opts[] (if supprted by the device) is
+ * sent in the uevent environment as WEDGED=<method>, on the basis of which,
+ * userspace may take respective action to recover the device.
+ *
+ * Returns: 0 on success, or negative error code otherwise.
+ */
+int drm_dev_wedged_event(struct drm_device *dev, enum wedge_recovery_method method)
+{
+	char event_string[WEDGE_LEN] = {};
+	char *envp[] = { event_string, NULL };
+
+	if (!test_bit(method, &dev->wedge_recovery)) {
+		drm_err(dev, "device wedged, recovery method not supported\n");
+		return -EOPNOTSUPP;
+	}
+
+	snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery_method_name(method));
+
+	drm_info(dev, "device wedged, generating uevent\n");
+	return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
+}
+EXPORT_SYMBOL(drm_dev_wedged_event);
+
 /*
  * DRM internal mount
  * We want to be able to allocate our own "struct address_space" to control
diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
index c91f87b5242d..f1a71763c22a 100644
--- a/include/drm/drm_device.h
+++ b/include/drm/drm_device.h
@@ -40,6 +40,27 @@  enum switch_power_state {
 	DRM_SWITCH_POWER_DYNAMIC_OFF = 3,
 };
 
+/**
+ * enum wedge_recovery_method - Recovery method for wedged device in order
+ * of severity. To be set as bit fields in drm_device.wedge_recovery variable.
+ * Drivers can choose to support any one or multiple of them depending on their
+ * needs.
+ */
+
+enum wedge_recovery_method {
+	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
+	DRM_WEDGE_RECOVERY_REBIND,
+
+	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
+	DRM_WEDGE_RECOVERY_BUS_RESET,
+
+	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
+	DRM_WEDGE_RECOVERY_REBOOT,
+
+	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
+	DRM_WEDGE_RECOVERY_MAX
+};
+
 /**
  * struct drm_device - DRM device structure
  *
@@ -317,6 +338,9 @@  struct drm_device {
 	 * Root directory for debugfs files.
 	 */
 	struct dentry *debugfs_root;
+
+	/** @wedge_recovery: Supported recovery methods for wedged device */
+	unsigned long wedge_recovery;
 };
 
 #endif
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index 02ea4e3248fd..83d44e153557 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -45,6 +45,8 @@  struct drm_mode_create_dumb;
 struct drm_printer;
 struct sg_table;
 
+extern const char *const wedge_recovery_opts[];
+
 /**
  * enum drm_driver_feature - feature flags
  *
@@ -461,6 +463,7 @@  void drm_put_dev(struct drm_device *dev);
 bool drm_dev_enter(struct drm_device *dev, int *idx);
 void drm_dev_exit(int idx);
 void drm_dev_unplug(struct drm_device *dev);
+int drm_dev_wedged_event(struct drm_device *dev, enum wedge_recovery_method method);
 
 /**
  * drm_dev_is_unplugged - is a DRM device unplugged
@@ -551,4 +554,19 @@  static inline void drm_debugfs_dev_init(struct drm_device *dev, struct dentry *r
 }
 #endif
 
+static inline bool recovery_method_is_valid(enum wedge_recovery_method method)
+{
+	if (method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX)
+		return true;
+
+	return false;
+}
+
+static inline const char *recovery_method_name(enum wedge_recovery_method method)
+{
+	if (recovery_method_is_valid(method))
+		return wedge_recovery_opts[method];
+
+	return NULL;
+}
 #endif