[v7,1/5] drm: Introduce device wedged event

Message ID	20240930073845.347326-2-raag.jadav@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Raag Jadav <raag.jadav@intel.com> To: airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com, thomas.hellstrom@linux.intel.com, rodrigo.vivi@intel.com, jani.nikula@linux.intel.com, andriy.shevchenko@linux.intel.com, joonas.lahtinen@linux.intel.com, tursulin@ursulin.net, lina@asahilina.net Cc: intel-xe@lists.freedesktop.org, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, himal.prasad.ghimiray@intel.com, francois.dugast@intel.com, aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com, andi.shyti@linux.intel.com, matthew.d.roper@intel.com, Raag Jadav <raag.jadav@intel.com> Subject: [PATCH v7 1/5] drm: Introduce device wedged event Date: Mon, 30 Sep 2024 13:08:41 +0530 Message-Id: <20240930073845.347326-2-raag.jadav@intel.com> In-Reply-To: <20240930073845.347326-1-raag.jadav@intel.com> References: <20240930073845.347326-1-raag.jadav@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	Introduce DRM device wedged event \| expand [v7,0/5] Introduce DRM device wedged event [v7,1/5] drm: Introduce device wedged event [v7,2/5] drm: Expose wedge recovery methods [v7,3/5] drm/doc: Document device wedged event [v7,4/5] drm/xe: Use device wedged event [v7,5/5] drm/i915: Use device wedged event

Raag Jadav Sept. 30, 2024, 7:38 a.m. UTC

Introduce device wedged event, which will notify userspace of wedged
(hanged/unusable) state of the DRM device through a uevent. This is
useful especially in cases where the device is no longer operating as
expected even after a hardware reset and has become unrecoverable from
driver context.

Purpose of this implementation is to provide drivers a generic way to
recover with the help of userspace intervention. Different drivers may
have different ideas of a "wedged device" depending on their hardware
implementation, and hence the vendor agnostic nature of the event.
It is up to the drivers to decide when they see the need for recovery
and how they want to recover from the available methods.

Current implementation defines three recovery methods, out of which,
drivers can choose to support any one or multiple of them. Preferred
recovery method will be sent in the uevent environment as WEDGED=<method>.
Userspace consumers (sysadmin) can define udev rules to parse this event
and take respective action to recover the device.

    =============== ==================================
    Recovery method Consumer expectations
    =============== ==================================
    rebind          unbind + rebind driver
    bus-reset       unbind + reset bus device + rebind
    reboot          reboot system
    =============== ==================================

v4: s/drm_dev_wedged/drm_dev_wedged_event
    Use drm_info() (Jani)
    Kernel doc adjustment (Aravind)
v5: Send recovery method with uevent (Lina)
v6: Access wedge_recovery_opts[] using helper function (Jani)
    Use snprintf() (Jani)
v7: Convert recovery helpers into regular functions (Andy, Jani)
    Aesthetic adjustments (Andy)
    Handle invalid method cases

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
 drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++
 include/drm/drm_device.h  | 23 ++++++++++++
 include/drm/drm_drv.h     |  3 ++
 3 files changed, 103 insertions(+)

Andy Shevchenko Sept. 30, 2024, 12:59 p.m. UTC | #1

On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:
> Introduce device wedged event, which will notify userspace of wedged
> (hanged/unusable) state of the DRM device through a uevent. This is
> useful especially in cases where the device is no longer operating as
> expected even after a hardware reset and has become unrecoverable from
> driver context.
> 
> Purpose of this implementation is to provide drivers a generic way to
> recover with the help of userspace intervention. Different drivers may
> have different ideas of a "wedged device" depending on their hardware
> implementation, and hence the vendor agnostic nature of the event.
> It is up to the drivers to decide when they see the need for recovery
> and how they want to recover from the available methods.
> 
> Current implementation defines three recovery methods, out of which,
> drivers can choose to support any one or multiple of them. Preferred
> recovery method will be sent in the uevent environment as WEDGED=<method>.
> Userspace consumers (sysadmin) can define udev rules to parse this event
> and take respective action to recover the device.
> 
>     =============== ==================================
>     Recovery method Consumer expectations
>     =============== ==================================
>     rebind          unbind + rebind driver
>     bus-reset       unbind + reset bus device + rebind
>     reboot          reboot system
>     =============== ==================================

...

> +/*
> + * Available recovery methods for wedged device. To be sent along with device
> + * wedged uevent.
> + */
> +static const char *const drm_wedge_recovery_opts[] = {
> +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> +};

Place for static_assert() is here, as it closer to the actual data we test...

> +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
> +{
> +	static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX);

...it doesn't fully belong to this function (or only to this function).

> +	return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
> +}

Why do we need this one-liner (after above comment being addressed) as a
separate function?

Raag Jadav Oct. 1, 2024, 5:08 a.m. UTC | #2

On Mon, Sep 30, 2024 at 03:59:59PM +0300, Andy Shevchenko wrote:
> On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:
> > Introduce device wedged event, which will notify userspace of wedged
> > (hanged/unusable) state of the DRM device through a uevent. This is
> > useful especially in cases where the device is no longer operating as
> > expected even after a hardware reset and has become unrecoverable from
> > driver context.
> > 
> > Purpose of this implementation is to provide drivers a generic way to
> > recover with the help of userspace intervention. Different drivers may
> > have different ideas of a "wedged device" depending on their hardware
> > implementation, and hence the vendor agnostic nature of the event.
> > It is up to the drivers to decide when they see the need for recovery
> > and how they want to recover from the available methods.
> > 
> > Current implementation defines three recovery methods, out of which,
> > drivers can choose to support any one or multiple of them. Preferred
> > recovery method will be sent in the uevent environment as WEDGED=<method>.
> > Userspace consumers (sysadmin) can define udev rules to parse this event
> > and take respective action to recover the device.
> > 
> >     =============== ==================================
> >     Recovery method Consumer expectations
> >     =============== ==================================
> >     rebind          unbind + rebind driver
> >     bus-reset       unbind + reset bus device + rebind
> >     reboot          reboot system
> >     =============== ==================================
> 
> ...
> 
> > +/*
> > + * Available recovery methods for wedged device. To be sent along with device
> > + * wedged uevent.
> > + */
> > +static const char *const drm_wedge_recovery_opts[] = {
> > +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> > +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> > +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> > +};
> 
> Place for static_assert() is here, as it closer to the actual data we test...

Shouldn't it be at the point of access?
If no, why do we care about the data when it's not being used?

> > +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
> > +{
> > +	static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX);
> 
> ...it doesn't fully belong to this function (or only to this function).

The purpose of having a helper is to have a single point of access, no?

Side note: It also goes well with is_valid() semantic IMHO.

> > +	return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
> > +}
> 
> Why do we need this one-liner (after above comment being addressed) as a
> separate function?

I'm not sure if I'm following you. Method is not a constant here, we'll get it
on the stack.

Raag

Andy Shevchenko Oct. 1, 2024, 12:07 p.m. UTC | #3

On Tue, Oct 01, 2024 at 08:08:18AM +0300, Raag Jadav wrote:
> On Mon, Sep 30, 2024 at 03:59:59PM +0300, Andy Shevchenko wrote:
> > On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:

...

> > > +static const char *const drm_wedge_recovery_opts[] = {
> > > +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> > > +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> > > +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> > > +};
> > 
> > Place for static_assert() is here, as it closer to the actual data we test...
> 
> Shouldn't it be at the point of access?

No, the idea of static_assert() is in word 'static', meaning it's allowed to be
used in the global space.

> If no, why do we care about the data when it's not being used?

What does this suppose to mean? The assertion is for enforcing the boundaries
that are defined by different means (constant of the size and real size of
an array).

...

> > > +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
> > > +{
> > > +	static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX);
> > 
> > ...it doesn't fully belong to this function (or only to this function).
> 
> The purpose of having a helper is to have a single point of access, no?

What single access you are talking about? It seems you are trying to solve
non-existing issue. There is a function that is being used exactly once
and it's a one-liner. There is no point to have it being separated (at least
right now).

> Side note: It also goes well with is_valid() semantic IMHO.

It doesn't matter at all, it's unrelated to the point.

> > > +	return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
> > > +}
> > 
> > Why do we need this one-liner (after above comment being addressed) as a
> > separate function?
> 
> I'm not sure if I'm following you. Method is not a constant here, we'll get it
> on the stack.

I elaborated above.

Michal Wajdeczko Oct. 1, 2024, 12:20 p.m. UTC | #4

Hi,

sorry for late comments,

On 30.09.2024 09:38, Raag Jadav wrote:
> Introduce device wedged event, which will notify userspace of wedged
> (hanged/unusable) state of the DRM device through a uevent. This is
> useful especially in cases where the device is no longer operating as
> expected even after a hardware reset and has become unrecoverable from
> driver context.
> 
> Purpose of this implementation is to provide drivers a generic way to
> recover with the help of userspace intervention. Different drivers may
> have different ideas of a "wedged device" depending on their hardware
> implementation, and hence the vendor agnostic nature of the event.
> It is up to the drivers to decide when they see the need for recovery
> and how they want to recover from the available methods.

what about when driver just wants to tell that it is in unusable state,
but recovery method is unknown or not possible?

> 
> Current implementation defines three recovery methods, out of which,
> drivers can choose to support any one or multiple of them. Preferred
> recovery method will be sent in the uevent environment as WEDGED=<method>.

could this be something like below instead:

	WEDGED=<reason>
	RECOVERY=<method>[,<method>]

then driver will have a chance to tell what happen _and_ additionally
provide a hint(s) how to recover from that situation

> Userspace consumers (sysadmin) can define udev rules to parse this event
> and take respective action to recover the device.
> 
>     =============== ==================================
>     Recovery method Consumer expectations
>     =============== ==================================
>     rebind          unbind + rebind driver
>     bus-reset       unbind + reset bus device + rebind
>     reboot          reboot system

btw, what if driver detects a really broken HW, or has no clue what will
help here, shouldn't we have a "none" method?

>     =============== ==================================
> 
> v4: s/drm_dev_wedged/drm_dev_wedged_event
>     Use drm_info() (Jani)
>     Kernel doc adjustment (Aravind)
> v5: Send recovery method with uevent (Lina)
> v6: Access wedge_recovery_opts[] using helper function (Jani)
>     Use snprintf() (Jani)
> v7: Convert recovery helpers into regular functions (Andy, Jani)
>     Aesthetic adjustments (Andy)
>     Handle invalid method cases
> 
> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> ---
>  drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++
>  include/drm/drm_device.h  | 23 ++++++++++++
>  include/drm/drm_drv.h     |  3 ++
>  3 files changed, 103 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index ac30b0ec9d93..cfe9600da2ee 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -26,6 +26,8 @@
>   * DEALINGS IN THE SOFTWARE.
>   */
>  
> +#include <linux/array_size.h>
> +#include <linux/build_bug.h>
>  #include <linux/debugfs.h>
>  #include <linux/fs.h>
>  #include <linux/module.h>
> @@ -33,6 +35,7 @@
>  #include <linux/mount.h>
>  #include <linux/pseudo_fs.h>
>  #include <linux/slab.h>
> +#include <linux/sprintf.h>
>  #include <linux/srcu.h>
>  #include <linux/xarray.h>
>  
> @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root;
>  
>  DEFINE_STATIC_SRCU(drm_unplug_srcu);
>  
> +/*
> + * Available recovery methods for wedged device. To be sent along with device
> + * wedged uevent.
> + */
> +static const char *const drm_wedge_recovery_opts[] = {
> +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> +};
> +
> +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
> +{
> +	static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX);
> +
> +	return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
> +}
> +
> +/**
> + * drm_wedge_recovery_name - provide wedge recovery name
> + * @method: method to be used for recovery
> + *
> + * This validates wedge recovery @method against the available ones in

do we really need to validate an enum? maybe the problem is that there
is MAX enumerator that just shouldn't be there?

> + * drm_wedge_recovery_opts[] and provides respective recovery name in string
> + * format if found valid.
> + *
> + * Returns: pointer to const recovery string on success, NULL otherwise.
> + */
> +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method)
> +{
> +	if (drm_wedge_recovery_is_valid(method))
> +		return drm_wedge_recovery_opts[method];

as we only have 3 methods, maybe simple switch() will do the work?

> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL(drm_wedge_recovery_name);
> +
>  /*
>   * DRM Minors
>   * A DRM device can provide several char-dev interfaces on the DRM-Major. Each
> @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev)
>  }
>  EXPORT_SYMBOL(drm_dev_unplug);
>  
> +/**
> + * drm_dev_wedged_event - generate a device wedged uevent
> + * @dev: DRM device
> + * @method: method to be used for recovery
> + *
> + * This generates a device wedged uevent for the DRM device specified by @dev.
> + * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device)

typo

> + * is sent in the uevent environment as WEDGED=<method>, on the basis of which,
> + * userspace may take respective action to recover the device.
> + *
> + * Returns: 0 on success, or negative error code otherwise.
> + */
> +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method)
> +{
> +	/* Event string length up to 16+ characters with available methods */
> +	char event_string[32] = {};

magic 32 here and likely don't need to be initialized with { }

> +	char *envp[] = { event_string, NULL };
> +	const char *recovery;
> +
> +	recovery = drm_wedge_recovery_name(method);
> +	if (!recovery) {
> +		drm_err(dev, "device wedged, invalid recovery method %d\n", method);

maybe use drm_WARN() to see who is abusing the API ?

> +		return -EINVAL;

but shouldn't we still trigger an event with "none" recovery?

> +	}
> +
> +	if (!test_bit(method, &dev->wedge_recovery)) {
> +		drm_err(dev, "device wedged, %s based recovery not supported\n",
> +			drm_wedge_recovery_name(method));

do we really need this kind of guard? it will be a driver code that will
call this function, so likely it knows better what will work to recover

> +		return -EOPNOTSUPP;
> +	}
> +
> +	snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery);
> +
> +	drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery);

nit:
	drm_info(dev, "device wedged, needs %s to recover\n", recovery);

> +	return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> +}
> +EXPORT_SYMBOL(drm_dev_wedged_event);
> +
>  /*
>   * DRM internal mount
>   * We want to be able to allocate our own "struct address_space" to control
> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> index c91f87b5242d..fed6f20e52fb 100644
> --- a/include/drm/drm_device.h
> +++ b/include/drm/drm_device.h
> @@ -40,6 +40,26 @@ enum switch_power_state {
>  	DRM_SWITCH_POWER_DYNAMIC_OFF = 3,
>  };
>  
> +/**
> + * enum drm_wedge_recovery - Recovery method for wedged device in order of
> + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
> + * Drivers can choose to support any one or multiple of them depending on
> + * their needs.
> + */
> +enum drm_wedge_recovery {
> +	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
> +	DRM_WEDGE_RECOVERY_REBIND,
> +
> +	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
> +	DRM_WEDGE_RECOVERY_BUS_RESET,
> +
> +	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
> +	DRM_WEDGE_RECOVERY_REBOOT,
> +
> +	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
> +	DRM_WEDGE_RECOVERY_MAX
> +};
> +
>  /**
>   * struct drm_device - DRM device structure
>   *
> @@ -317,6 +337,9 @@ struct drm_device {
>  	 * Root directory for debugfs files.
>  	 */
>  	struct dentry *debugfs_root;
> +
> +	/** @wedge_recovery: Supported recovery methods for wedged device */
> +	unsigned long wedge_recovery;

hmm, so before the driver can ask for a reboot as a recovery method from
wedge it has to somehow add 'reboot' as available method? why it that?

and if you insist that this is useful then at least document how this
should be initialized (to not forcing developers to look at
drm_dev_wedged_event code where it's used)

>  };
>  
>  #endif
> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
> index 02ea4e3248fd..d8dbc77010b0 100644
> --- a/include/drm/drm_drv.h
> +++ b/include/drm/drm_drv.h
> @@ -462,6 +462,9 @@ bool drm_dev_enter(struct drm_device *dev, int *idx);
>  void drm_dev_exit(int idx);
>  void drm_dev_unplug(struct drm_device *dev);
>  
> +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method);
> +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method);
> +
>  /**
>   * drm_dev_is_unplugged - is a DRM device unplugged
>   * @dev: DRM device

Raag Jadav Oct. 1, 2024, 2:18 p.m. UTC | #5

On Tue, Oct 01, 2024 at 03:07:59PM +0300, Andy Shevchenko wrote:
> On Tue, Oct 01, 2024 at 08:08:18AM +0300, Raag Jadav wrote:
> > On Mon, Sep 30, 2024 at 03:59:59PM +0300, Andy Shevchenko wrote:
> > > On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:
> 
> ...
> 
> > > > +static const char *const drm_wedge_recovery_opts[] = {
> > > > +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> > > > +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> > > > +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> > > > +};
> > > 
> > > Place for static_assert() is here, as it closer to the actual data we test...
> > 
> > Shouldn't it be at the point of access?
> 
> No, the idea of static_assert() is in word 'static', meaning it's allowed to be
> used in the global space.
> 
> > If no, why do we care about the data when it's not being used?
> 
> What does this suppose to mean? The assertion is for enforcing the boundaries
> that are defined by different means (constant of the size and real size of
> an array).

The point was to simply not assert without an active user of the array, which is
not the case now but may be possible with growing functionality in the future.

Raag

Andy Shevchenko Oct. 1, 2024, 2:54 p.m. UTC | #6

On Tue, Oct 01, 2024 at 05:18:33PM +0300, Raag Jadav wrote:
> On Tue, Oct 01, 2024 at 03:07:59PM +0300, Andy Shevchenko wrote:
> > On Tue, Oct 01, 2024 at 08:08:18AM +0300, Raag Jadav wrote:
> > > On Mon, Sep 30, 2024 at 03:59:59PM +0300, Andy Shevchenko wrote:
> > > > On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:

...

> > > > > +static const char *const drm_wedge_recovery_opts[] = {
> > > > > +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> > > > > +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> > > > > +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> > > > > +};
> > > > 
> > > > Place for static_assert() is here, as it closer to the actual data we test...
> > > 
> > > Shouldn't it be at the point of access?
> > 
> > No, the idea of static_assert() is in word 'static', meaning it's allowed to be
> > used in the global space.
> > 
> > > If no, why do we care about the data when it's not being used?
> > 
> > What does this suppose to mean? The assertion is for enforcing the boundaries
> > that are defined by different means (constant of the size and real size of
> > an array).
> 
> The point was to simply not assert without an active user of the array, which is
> not the case now but may be possible with growing functionality in the future.

static_assert() is a compile-time check. How is it even related to this?
So, i.o.w., you are contradicting yourself in this code: on one hand you want
compile-time static checker, on the other you do not want it and rely on the
usage of the function.

Possible solutions:
1) remove static_assert() completely;
2) move it as I said.

Raag Jadav Oct. 1, 2024, 4:42 p.m. UTC | #7

On Tue, Oct 01, 2024 at 05:54:46PM +0300, Andy Shevchenko wrote:
> On Tue, Oct 01, 2024 at 05:18:33PM +0300, Raag Jadav wrote:
> > On Tue, Oct 01, 2024 at 03:07:59PM +0300, Andy Shevchenko wrote:
> > > On Tue, Oct 01, 2024 at 08:08:18AM +0300, Raag Jadav wrote:
> > > > On Mon, Sep 30, 2024 at 03:59:59PM +0300, Andy Shevchenko wrote:
> > > > > On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:
> 
> ...
> 
> > > > > > +static const char *const drm_wedge_recovery_opts[] = {
> > > > > > +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> > > > > > +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> > > > > > +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> > > > > > +};
> > > > > 
> > > > > Place for static_assert() is here, as it closer to the actual data we test...
> > > > 
> > > > Shouldn't it be at the point of access?
> > > 
> > > No, the idea of static_assert() is in word 'static', meaning it's allowed to be
> > > used in the global space.
> > > 
> > > > If no, why do we care about the data when it's not being used?
> > > 
> > > What does this suppose to mean? The assertion is for enforcing the boundaries
> > > that are defined by different means (constant of the size and real size of
> > > an array).
> > 
> > The point was to simply not assert without an active user of the array, which is
> > not the case now but may be possible with growing functionality in the future.
> 
> static_assert() is a compile-time check. How is it even related to this?

Yes, I understand. Semantically it made more sense to me is all, since core
helpers can always end up in config based ifdeffery.

Anyway, I'll update.

Raag

Raag Jadav Oct. 3, 2024, 12:23 p.m. UTC | #8

On Tue, Oct 01, 2024 at 02:20:29PM +0200, Michal Wajdeczko wrote:
> Hi,
> 
> sorry for late comments,

Sure, no problem.

> On 30.09.2024 09:38, Raag Jadav wrote:
> > Introduce device wedged event, which will notify userspace of wedged
> > (hanged/unusable) state of the DRM device through a uevent. This is
> > useful especially in cases where the device is no longer operating as
> > expected even after a hardware reset and has become unrecoverable from
> > driver context.
> > 
> > Purpose of this implementation is to provide drivers a generic way to
> > recover with the help of userspace intervention. Different drivers may
> > have different ideas of a "wedged device" depending on their hardware
> > implementation, and hence the vendor agnostic nature of the event.
> > It is up to the drivers to decide when they see the need for recovery
> > and how they want to recover from the available methods.
> 
> what about when driver just wants to tell that it is in unusable state,
> but recovery method is unknown or not possible?

Interesting... However, what would be the consumer expectation for it?
If the expectation is to not recover, why send an event at all?

> > 
> > Current implementation defines three recovery methods, out of which,
> > drivers can choose to support any one or multiple of them. Preferred
> > recovery method will be sent in the uevent environment as WEDGED=<method>.
> 
> could this be something like below instead:
> 
> 	WEDGED=<reason>
> 	RECOVERY=<method>[,<method>]
> 
> then driver will have a chance to tell what happen _and_ additionally
> provide a hint(s) how to recover from that situation

Documentation/gpu/drm-uapi.rst +337

UMD can issue an ioctl to the KMD to check the reset status

...or <reason> for wedging, which KMD will signify with an error code...

UMD will then proceed to report it to the application using the appropriate
API error code

(should've explicitly added, sorry)

> > Userspace consumers (sysadmin) can define udev rules to parse this event
> > and take respective action to recover the device.
> > 
> >     =============== ==================================
> >     Recovery method Consumer expectations
> >     =============== ==================================
> >     rebind          unbind + rebind driver
> >     bus-reset       unbind + reset bus device + rebind
> >     reboot          reboot system
> 
> btw, what if driver detects a really broken HW, or has no clue what will
> help here, shouldn't we have a "none" method?

Sure. But same as above, we have to define expectations.

> >     =============== ==================================
> > 
> > v4: s/drm_dev_wedged/drm_dev_wedged_event
> >     Use drm_info() (Jani)
> >     Kernel doc adjustment (Aravind)
> > v5: Send recovery method with uevent (Lina)
> > v6: Access wedge_recovery_opts[] using helper function (Jani)
> >     Use snprintf() (Jani)
> > v7: Convert recovery helpers into regular functions (Andy, Jani)
> >     Aesthetic adjustments (Andy)
> >     Handle invalid method cases
> > 
> > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > ---
> >  drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++
> >  include/drm/drm_device.h  | 23 ++++++++++++
> >  include/drm/drm_drv.h     |  3 ++
> >  3 files changed, 103 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > index ac30b0ec9d93..cfe9600da2ee 100644
> > --- a/drivers/gpu/drm/drm_drv.c
> > +++ b/drivers/gpu/drm/drm_drv.c
> > @@ -26,6 +26,8 @@
> >   * DEALINGS IN THE SOFTWARE.
> >   */
> >  
> > +#include <linux/array_size.h>
> > +#include <linux/build_bug.h>
> >  #include <linux/debugfs.h>
> >  #include <linux/fs.h>
> >  #include <linux/module.h>
> > @@ -33,6 +35,7 @@
> >  #include <linux/mount.h>
> >  #include <linux/pseudo_fs.h>
> >  #include <linux/slab.h>
> > +#include <linux/sprintf.h>
> >  #include <linux/srcu.h>
> >  #include <linux/xarray.h>
> >  
> > @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root;
> >  
> >  DEFINE_STATIC_SRCU(drm_unplug_srcu);
> >  
> > +/*
> > + * Available recovery methods for wedged device. To be sent along with device
> > + * wedged uevent.
> > + */
> > +static const char *const drm_wedge_recovery_opts[] = {
> > +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> > +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> > +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> > +};
> > +
> > +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
> > +{
> > +	static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX);
> > +
> > +	return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
> > +}
> > +
> > +/**
> > + * drm_wedge_recovery_name - provide wedge recovery name
> > + * @method: method to be used for recovery
> > + *
> > + * This validates wedge recovery @method against the available ones in
> 
> do we really need to validate an enum?

I'm all for trusting the drivers explicitly, but since this is a core feature
I thought we'd have some guard rails (for abusers).

> maybe the problem is that there is MAX enumerator that just shouldn't be there?

With MAX in place we won't need to adjust the helpers to match with enum
modifications in the future (if any).

> > + * drm_wedge_recovery_opts[] and provides respective recovery name in string
> > + * format if found valid.
> > + *
> > + * Returns: pointer to const recovery string on success, NULL otherwise.
> > + */
> > +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method)
> > +{
> > +	if (drm_wedge_recovery_is_valid(method))
> > +		return drm_wedge_recovery_opts[method];
> 
> as we only have 3 methods, maybe simple switch() will do the work?

Sure.

> > +
> > +	return NULL;
> > +}
> > +EXPORT_SYMBOL(drm_wedge_recovery_name);
> > +
> >  /*
> >   * DRM Minors
> >   * A DRM device can provide several char-dev interfaces on the DRM-Major. Each
> > @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev)
> >  }
> >  EXPORT_SYMBOL(drm_dev_unplug);
> >  
> > +/**
> > + * drm_dev_wedged_event - generate a device wedged uevent
> > + * @dev: DRM device
> > + * @method: method to be used for recovery
> > + *
> > + * This generates a device wedged uevent for the DRM device specified by @dev.
> > + * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device)
> 
> typo

Good catch.

> > + * is sent in the uevent environment as WEDGED=<method>, on the basis of which,
> > + * userspace may take respective action to recover the device.
> > + *
> > + * Returns: 0 on success, or negative error code otherwise.
> > + */
> > +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method)
> > +{
> > +	/* Event string length up to 16+ characters with available methods */
> > +	char event_string[32] = {};
> 
> magic 32 here

Anything to add to the event string length comment above?

> > +	char *envp[] = { event_string, NULL };
> > +	const char *recovery;
> > +
> > +	recovery = drm_wedge_recovery_name(method);
> > +	if (!recovery) {
> > +		drm_err(dev, "device wedged, invalid recovery method %d\n", method);
> 
> maybe use drm_WARN() to see who is abusing the API ?

Sure.

> > +		return -EINVAL;
> 
> but shouldn't we still trigger an event with "none" recovery?

Explained above.

> > +	}
> > +
> > +	if (!test_bit(method, &dev->wedge_recovery)) {
> > +		drm_err(dev, "device wedged, %s based recovery not supported\n",
> > +			drm_wedge_recovery_name(method));
> 
> do we really need this kind of guard? it will be a driver code that will
> call this function, so likely it knows better what will work to recover

Agree, although unsupported method could cause undefined behaviour.

> > +		return -EOPNOTSUPP;
> > +	}
> > +
> > +	snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery);
> > +
> > +	drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery);
> 
> nit:
> 	drm_info(dev, "device wedged, needs %s to recover\n", recovery);

Sure.

> > +	return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> > +}
> > +EXPORT_SYMBOL(drm_dev_wedged_event);
> > +
> >  /*
> >   * DRM internal mount
> >   * We want to be able to allocate our own "struct address_space" to control
> > diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > index c91f87b5242d..fed6f20e52fb 100644
> > --- a/include/drm/drm_device.h
> > +++ b/include/drm/drm_device.h
> > @@ -40,6 +40,26 @@ enum switch_power_state {
> >  	DRM_SWITCH_POWER_DYNAMIC_OFF = 3,
> >  };
> >  
> > +/**
> > + * enum drm_wedge_recovery - Recovery method for wedged device in order of
> > + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
> > + * Drivers can choose to support any one or multiple of them depending on
> > + * their needs.
> > + */
> > +enum drm_wedge_recovery {
> > +	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
> > +	DRM_WEDGE_RECOVERY_REBIND,
> > +
> > +	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
> > +	DRM_WEDGE_RECOVERY_BUS_RESET,
> > +
> > +	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
> > +	DRM_WEDGE_RECOVERY_REBOOT,
> > +
> > +	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
> > +	DRM_WEDGE_RECOVERY_MAX
> > +};
> > +
> >  /**
> >   * struct drm_device - DRM device structure
> >   *
> > @@ -317,6 +337,9 @@ struct drm_device {
> >  	 * Root directory for debugfs files.
> >  	 */
> >  	struct dentry *debugfs_root;
> > +
> > +	/** @wedge_recovery: Supported recovery methods for wedged device */
> > +	unsigned long wedge_recovery;
> 
> hmm, so before the driver can ask for a reboot as a recovery method from
> wedge it has to somehow add 'reboot' as available method? why it that?

It's for consumers to use as fallbacks in case the preferred recovery method
(sent along with uevent) don't workout. (patch 2/5)

> and if you insist that this is useful then at least document how this
> should be initialized (to not forcing developers to look at
> drm_dev_wedged_event code where it's used)

Sure.

Raag

Raag Jadav Oct. 8, 2024, 3:02 p.m. UTC | #9

On Thu, Oct 03, 2024 at 03:23:22PM +0300, Raag Jadav wrote:
> On Tue, Oct 01, 2024 at 02:20:29PM +0200, Michal Wajdeczko wrote:
> > On 30.09.2024 09:38, Raag Jadav wrote:
> > >  
> > > +/**
> > > + * enum drm_wedge_recovery - Recovery method for wedged device in order of
> > > + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
> > > + * Drivers can choose to support any one or multiple of them depending on
> > > + * their needs.
> > > + */
> > > +enum drm_wedge_recovery {
> > > +	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
> > > +	DRM_WEDGE_RECOVERY_REBIND,
> > > +
> > > +	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
> > > +	DRM_WEDGE_RECOVERY_BUS_RESET,
> > > +
> > > +	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
> > > +	DRM_WEDGE_RECOVERY_REBOOT,
> > > +
> > > +	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
> > > +	DRM_WEDGE_RECOVERY_MAX
> > > +};
> > > +
> > >  /**
> > >   * struct drm_device - DRM device structure
> > >   *
> > > @@ -317,6 +337,9 @@ struct drm_device {
> > >  	 * Root directory for debugfs files.
> > >  	 */
> > >  	struct dentry *debugfs_root;
> > > +
> > > +	/** @wedge_recovery: Supported recovery methods for wedged device */
> > > +	unsigned long wedge_recovery;
> > 
> > hmm, so before the driver can ask for a reboot as a recovery method from
> > wedge it has to somehow add 'reboot' as available method? why it that?
> 
> It's for consumers to use as fallbacks in case the preferred recovery method
> (sent along with uevent) don't workout. (patch 2/5)

On second thought...

Lucas, do we have a convincing enough usecase for fallback recovery?
If <method> were to fail, I would expect there to be even bigger problems
like kernel crash or unrecoverable hardware failure.

At that point is it worth retrying?

Raag

Lucas De Marchi Oct. 10, 2024, 1:02 p.m. UTC | #10

On Tue, Oct 08, 2024 at 06:02:43PM +0300, Raag Jadav wrote:
>On Thu, Oct 03, 2024 at 03:23:22PM +0300, Raag Jadav wrote:
>> On Tue, Oct 01, 2024 at 02:20:29PM +0200, Michal Wajdeczko wrote:
>> > On 30.09.2024 09:38, Raag Jadav wrote:
>> > >
>> > > +/**
>> > > + * enum drm_wedge_recovery - Recovery method for wedged device in order of
>> > > + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
>> > > + * Drivers can choose to support any one or multiple of them depending on
>> > > + * their needs.
>> > > + */
>> > > +enum drm_wedge_recovery {
>> > > +	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
>> > > +	DRM_WEDGE_RECOVERY_REBIND,
>> > > +
>> > > +	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
>> > > +	DRM_WEDGE_RECOVERY_BUS_RESET,
>> > > +
>> > > +	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
>> > > +	DRM_WEDGE_RECOVERY_REBOOT,
>> > > +
>> > > +	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
>> > > +	DRM_WEDGE_RECOVERY_MAX
>> > > +};
>> > > +
>> > >  /**
>> > >   * struct drm_device - DRM device structure
>> > >   *
>> > > @@ -317,6 +337,9 @@ struct drm_device {
>> > >  	 * Root directory for debugfs files.
>> > >  	 */
>> > >  	struct dentry *debugfs_root;
>> > > +
>> > > +	/** @wedge_recovery: Supported recovery methods for wedged device */
>> > > +	unsigned long wedge_recovery;
>> >
>> > hmm, so before the driver can ask for a reboot as a recovery method from
>> > wedge it has to somehow add 'reboot' as available method? why it that?
>>
>> It's for consumers to use as fallbacks in case the preferred recovery method
>> (sent along with uevent) don't workout. (patch 2/5)
>
>On second thought...
>
>Lucas, do we have a convincing enough usecase for fallback recovery?
>If <method> were to fail, I would expect there to be even bigger problems
>like kernel crash or unrecoverable hardware failure.
>
>At that point is it worth retrying?

when we were talking about this, I brought it up about allowing the
driver to inform what was the supported wedge recovery mechanisms
when the notification is sent. Not to be intended as fallback mechanism.

So if the driver sends a notification with:

	DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET | DRM_WEDGE_RECOVERY_REBOOT

it means any of these would be suitable, with the first being the option
with less side-effect. I don't think we are advising userspace to use
fallback, just informing what the driver/device supports. Depending on
the error, the driver may leave only

	DRM_WEDGE_RECOVERY_REBOOT

That name could actually be DRM_WEDGE_RECOVERY_NONE. Because at that
state the driver doesn't really know what can be done to recover.
With that we can drop _MAX and use _NONE for bounding check. I think
we can also omit it in the notification as it's clear:

	WEDGED
	DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET

This means the driver can use any of these options to recover

	WEDGED
	DRM_WEDGE_RECOVERY_BUS_RESET

only bus reset would fix it

	WEDGED
	
driver doesn't know anything that could fix it. It may be a soft-reboot,
hard-reboot, firmware flashing etc... We just don't know.

Lucas De Marchi

Raag Jadav Oct. 11, 2024, 8:47 a.m. UTC | #11

On Thu, Oct 10, 2024 at 08:02:10AM -0500, Lucas De Marchi wrote:
> On Tue, Oct 08, 2024 at 06:02:43PM +0300, Raag Jadav wrote:
> > On Thu, Oct 03, 2024 at 03:23:22PM +0300, Raag Jadav wrote:
> > > On Tue, Oct 01, 2024 at 02:20:29PM +0200, Michal Wajdeczko wrote:
> > > > On 30.09.2024 09:38, Raag Jadav wrote:
> > > > >
> > > > > +/**
> > > > > + * enum drm_wedge_recovery - Recovery method for wedged device in order of
> > > > > + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
> > > > > + * Drivers can choose to support any one or multiple of them depending on
> > > > > + * their needs.
> > > > > + */
> > > > > +enum drm_wedge_recovery {
> > > > > +	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
> > > > > +	DRM_WEDGE_RECOVERY_REBIND,
> > > > > +
> > > > > +	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
> > > > > +	DRM_WEDGE_RECOVERY_BUS_RESET,
> > > > > +
> > > > > +	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
> > > > > +	DRM_WEDGE_RECOVERY_REBOOT,
> > > > > +
> > > > > +	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
> > > > > +	DRM_WEDGE_RECOVERY_MAX
> > > > > +};
> > > > > +
> > > > >  /**
> > > > >   * struct drm_device - DRM device structure
> > > > >   *
> > > > > @@ -317,6 +337,9 @@ struct drm_device {
> > > > >  	 * Root directory for debugfs files.
> > > > >  	 */
> > > > >  	struct dentry *debugfs_root;
> > > > > +
> > > > > +	/** @wedge_recovery: Supported recovery methods for wedged device */
> > > > > +	unsigned long wedge_recovery;
> > > >
> > > > hmm, so before the driver can ask for a reboot as a recovery method from
> > > > wedge it has to somehow add 'reboot' as available method? why it that?
> > > 
> > > It's for consumers to use as fallbacks in case the preferred recovery method
> > > (sent along with uevent) don't workout. (patch 2/5)
> > 
> > On second thought...
> > 
> > Lucas, do we have a convincing enough usecase for fallback recovery?
> > If <method> were to fail, I would expect there to be even bigger problems
> > like kernel crash or unrecoverable hardware failure.
> > 
> > At that point is it worth retrying?
> 
> when we were talking about this, I brought it up about allowing the
> driver to inform what was the supported wedge recovery mechanisms
> when the notification is sent. Not to be intended as fallback mechanism.
> 
> So if the driver sends a notification with:
> 
> 	DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET | DRM_WEDGE_RECOVERY_REBOOT
> 
> it means any of these would be suitable, with the first being the option
> with less side-effect. I don't think we are advising userspace to use
> fallback, just informing what the driver/device supports. Depending on
> the error, the driver may leave only
> 
> 	DRM_WEDGE_RECOVERY_REBOOT
> 
> That name could actually be DRM_WEDGE_RECOVERY_NONE. Because at that
> state the driver doesn't really know what can be done to recover.
> With that we can drop _MAX and use _NONE for bounding check. I think
> we can also omit it in the notification as it's clear:
> 
> 	WEDGED
> 	DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET
> 
> This means the driver can use any of these options to recover
> 
> 	WEDGED
> 	DRM_WEDGE_RECOVERY_BUS_RESET
> 
> only bus reset would fix it
> 
> 	WEDGED
> 	
> driver doesn't know anything that could fix it. It may be a soft-reboot,
> hard-reboot, firmware flashing etc... We just don't know.

With this I think we can drop sysfs.
(Already too many ABIs to deal with)

Raag

Raag Jadav Oct. 17, 2024, 2:47 a.m. UTC | #12

On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:
> Introduce device wedged event, which will notify userspace of wedged
> (hanged/unusable) state of the DRM device through a uevent. This is
> useful especially in cases where the device is no longer operating as
> expected even after a hardware reset and has become unrecoverable from
> driver context.
> 
> Purpose of this implementation is to provide drivers a generic way to
> recover with the help of userspace intervention. Different drivers may
> have different ideas of a "wedged device" depending on their hardware
> implementation, and hence the vendor agnostic nature of the event.
> It is up to the drivers to decide when they see the need for recovery
> and how they want to recover from the available methods.
> 
> Current implementation defines three recovery methods, out of which,
> drivers can choose to support any one or multiple of them. Preferred
> recovery method will be sent in the uevent environment as WEDGED=<method>.
> Userspace consumers (sysadmin) can define udev rules to parse this event
> and take respective action to recover the device.
> 
>     =============== ==================================
>     Recovery method Consumer expectations
>     =============== ==================================
>     rebind          unbind + rebind driver
>     bus-reset       unbind + reset bus device + rebind
>     reboot          reboot system
>     =============== ==================================
> 
> v4: s/drm_dev_wedged/drm_dev_wedged_event
>     Use drm_info() (Jani)
>     Kernel doc adjustment (Aravind)
> v5: Send recovery method with uevent (Lina)
> v6: Access wedge_recovery_opts[] using helper function (Jani)
>     Use snprintf() (Jani)
> v7: Convert recovery helpers into regular functions (Andy, Jani)
>     Aesthetic adjustments (Andy)
>     Handle invalid method cases
> 
> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> ---

Cc'ing amd, collabora and others as I found semi-related work at

https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealmeid@igalia.com/
https://lore.kernel.org/amd-gfx/20240725150055.1991893-1-alexander.deucher@amd.com/
https://lore.kernel.org/dri-devel/20241011225906.3789965-3-adrian.larumbe@collabora.com/
https://lore.kernel.org/amd-gfx/CAAxE2A5v_RkZ9ex4=7jiBSKVb22_1FAj0AANBcmKtETt5c3gVA@mail.gmail.com/


Please share feedback about usefulness and adoption of this.
Improvements are welcome.

Raag

>  drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++
>  include/drm/drm_device.h  | 23 ++++++++++++
>  include/drm/drm_drv.h     |  3 ++
>  3 files changed, 103 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index ac30b0ec9d93..cfe9600da2ee 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -26,6 +26,8 @@
>   * DEALINGS IN THE SOFTWARE.
>   */
>  
> +#include <linux/array_size.h>
> +#include <linux/build_bug.h>
>  #include <linux/debugfs.h>
>  #include <linux/fs.h>
>  #include <linux/module.h>
> @@ -33,6 +35,7 @@
>  #include <linux/mount.h>
>  #include <linux/pseudo_fs.h>
>  #include <linux/slab.h>
> +#include <linux/sprintf.h>
>  #include <linux/srcu.h>
>  #include <linux/xarray.h>
>  
> @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root;
>  
>  DEFINE_STATIC_SRCU(drm_unplug_srcu);
>  
> +/*
> + * Available recovery methods for wedged device. To be sent along with device
> + * wedged uevent.
> + */
> +static const char *const drm_wedge_recovery_opts[] = {
> +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> +};
> +
> +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
> +{
> +	static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX);
> +
> +	return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
> +}
> +
> +/**
> + * drm_wedge_recovery_name - provide wedge recovery name
> + * @method: method to be used for recovery
> + *
> + * This validates wedge recovery @method against the available ones in
> + * drm_wedge_recovery_opts[] and provides respective recovery name in string
> + * format if found valid.
> + *
> + * Returns: pointer to const recovery string on success, NULL otherwise.
> + */
> +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method)
> +{
> +	if (drm_wedge_recovery_is_valid(method))
> +		return drm_wedge_recovery_opts[method];
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL(drm_wedge_recovery_name);
> +
>  /*
>   * DRM Minors
>   * A DRM device can provide several char-dev interfaces on the DRM-Major. Each
> @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev)
>  }
>  EXPORT_SYMBOL(drm_dev_unplug);
>  
> +/**
> + * drm_dev_wedged_event - generate a device wedged uevent
> + * @dev: DRM device
> + * @method: method to be used for recovery
> + *
> + * This generates a device wedged uevent for the DRM device specified by @dev.
> + * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device)
> + * is sent in the uevent environment as WEDGED=<method>, on the basis of which,
> + * userspace may take respective action to recover the device.
> + *
> + * Returns: 0 on success, or negative error code otherwise.
> + */
> +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method)
> +{
> +	/* Event string length up to 16+ characters with available methods */
> +	char event_string[32] = {};
> +	char *envp[] = { event_string, NULL };
> +	const char *recovery;
> +
> +	recovery = drm_wedge_recovery_name(method);
> +	if (!recovery) {
> +		drm_err(dev, "device wedged, invalid recovery method %d\n", method);
> +		return -EINVAL;
> +	}
> +
> +	if (!test_bit(method, &dev->wedge_recovery)) {
> +		drm_err(dev, "device wedged, %s based recovery not supported\n",
> +			drm_wedge_recovery_name(method));
> +		return -EOPNOTSUPP;
> +	}
> +
> +	snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery);
> +
> +	drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery);
> +	return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> +}
> +EXPORT_SYMBOL(drm_dev_wedged_event);
> +
>  /*
>   * DRM internal mount
>   * We want to be able to allocate our own "struct address_space" to control
> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> index c91f87b5242d..fed6f20e52fb 100644
> --- a/include/drm/drm_device.h
> +++ b/include/drm/drm_device.h
> @@ -40,6 +40,26 @@ enum switch_power_state {
>  	DRM_SWITCH_POWER_DYNAMIC_OFF = 3,
>  };
>  
> +/**
> + * enum drm_wedge_recovery - Recovery method for wedged device in order of
> + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
> + * Drivers can choose to support any one or multiple of them depending on
> + * their needs.
> + */
> +enum drm_wedge_recovery {
> +	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
> +	DRM_WEDGE_RECOVERY_REBIND,
> +
> +	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
> +	DRM_WEDGE_RECOVERY_BUS_RESET,
> +
> +	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
> +	DRM_WEDGE_RECOVERY_REBOOT,
> +
> +	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
> +	DRM_WEDGE_RECOVERY_MAX
> +};
> +
>  /**
>   * struct drm_device - DRM device structure
>   *
> @@ -317,6 +337,9 @@ struct drm_device {
>  	 * Root directory for debugfs files.
>  	 */
>  	struct dentry *debugfs_root;
> +
> +	/** @wedge_recovery: Supported recovery methods for wedged device */
> +	unsigned long wedge_recovery;
>  };
>  
>  #endif
> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
> index 02ea4e3248fd..d8dbc77010b0 100644
> --- a/include/drm/drm_drv.h
> +++ b/include/drm/drm_drv.h
> @@ -462,6 +462,9 @@ bool drm_dev_enter(struct drm_device *dev, int *idx);
>  void drm_dev_exit(int idx);
>  void drm_dev_unplug(struct drm_device *dev);
>  
> +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method);
> +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method);
> +
>  /**
>   * drm_dev_is_unplugged - is a DRM device unplugged
>   * @dev: DRM device
> -- 
> 2.34.1
>

Christian König Oct. 17, 2024, 7:59 a.m. UTC | #13

Am 17.10.24 um 04:47 schrieb Raag Jadav:
> On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:
>> Introduce device wedged event, which will notify userspace of wedged
>> (hanged/unusable) state of the DRM device through a uevent. This is
>> useful especially in cases where the device is no longer operating as
>> expected even after a hardware reset and has become unrecoverable from
>> driver context.

Well introduce is probably the wrong wording since i915 already has that 
and amdgpu looked into it but never upstreamed the support.

I would rather say standardize.

>>
>> Purpose of this implementation is to provide drivers a generic way to
>> recover with the help of userspace intervention. Different drivers may
>> have different ideas of a "wedged device" depending on their hardware
>> implementation, and hence the vendor agnostic nature of the event.
>> It is up to the drivers to decide when they see the need for recovery
>> and how they want to recover from the available methods.
>>
>> Current implementation defines three recovery methods, out of which,
>> drivers can choose to support any one or multiple of them. Preferred
>> recovery method will be sent in the uevent environment as WEDGED=<method>.
>> Userspace consumers (sysadmin) can define udev rules to parse this event
>> and take respective action to recover the device.
>>
>>      =============== ==================================
>>      Recovery method Consumer expectations
>>      =============== ==================================
>>      rebind          unbind + rebind driver
>>      bus-reset       unbind + reset bus device + rebind
>>      reboot          reboot system
>>      =============== ==================================

Well that sounds like userspace would need to be involved in recovery.

That in turn is a complete no-go since we at least need to signal all 
dma_fences to unblock the kernel. In other words things like bus reset 
needs to happen inside the kernel and *not* in userspace.

What we can do is to signal to userspace: Hey a bus reset of device X 
happened, maybe restart container, daemon, whatever service which was 
using this device.

Regards,
Christian.

>>
>> v4: s/drm_dev_wedged/drm_dev_wedged_event
>>      Use drm_info() (Jani)
>>      Kernel doc adjustment (Aravind)
>> v5: Send recovery method with uevent (Lina)
>> v6: Access wedge_recovery_opts[] using helper function (Jani)
>>      Use snprintf() (Jani)
>> v7: Convert recovery helpers into regular functions (Andy, Jani)
>>      Aesthetic adjustments (Andy)
>>      Handle invalid method cases
>>
>> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
>> ---
> Cc'ing amd, collabora and others as I found semi-related work at
>
> https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealmeid@igalia.com/
> https://lore.kernel.org/amd-gfx/20240725150055.1991893-1-alexander.deucher@amd.com/
> https://lore.kernel.org/dri-devel/20241011225906.3789965-3-adrian.larumbe@collabora.com/
> https://lore.kernel.org/amd-gfx/CAAxE2A5v_RkZ9ex4=7jiBSKVb22_1FAj0AANBcmKtETt5c3gVA@mail.gmail.com/
>
>
> Please share feedback about usefulness and adoption of this.
> Improvements are welcome.
>
> Raag
>
>>   drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++
>>   include/drm/drm_device.h  | 23 ++++++++++++
>>   include/drm/drm_drv.h     |  3 ++
>>   3 files changed, 103 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>> index ac30b0ec9d93..cfe9600da2ee 100644
>> --- a/drivers/gpu/drm/drm_drv.c
>> +++ b/drivers/gpu/drm/drm_drv.c
>> @@ -26,6 +26,8 @@
>>    * DEALINGS IN THE SOFTWARE.
>>    */
>>   
>> +#include <linux/array_size.h>
>> +#include <linux/build_bug.h>
>>   #include <linux/debugfs.h>
>>   #include <linux/fs.h>
>>   #include <linux/module.h>
>> @@ -33,6 +35,7 @@
>>   #include <linux/mount.h>
>>   #include <linux/pseudo_fs.h>
>>   #include <linux/slab.h>
>> +#include <linux/sprintf.h>
>>   #include <linux/srcu.h>
>>   #include <linux/xarray.h>
>>   
>> @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root;
>>   
>>   DEFINE_STATIC_SRCU(drm_unplug_srcu);
>>   
>> +/*
>> + * Available recovery methods for wedged device. To be sent along with device
>> + * wedged uevent.
>> + */
>> +static const char *const drm_wedge_recovery_opts[] = {
>> +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
>> +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
>> +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
>> +};
>> +
>> +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
>> +{
>> +	static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX);
>> +
>> +	return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
>> +}
>> +
>> +/**
>> + * drm_wedge_recovery_name - provide wedge recovery name
>> + * @method: method to be used for recovery
>> + *
>> + * This validates wedge recovery @method against the available ones in
>> + * drm_wedge_recovery_opts[] and provides respective recovery name in string
>> + * format if found valid.
>> + *
>> + * Returns: pointer to const recovery string on success, NULL otherwise.
>> + */
>> +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method)
>> +{
>> +	if (drm_wedge_recovery_is_valid(method))
>> +		return drm_wedge_recovery_opts[method];
>> +
>> +	return NULL;
>> +}
>> +EXPORT_SYMBOL(drm_wedge_recovery_name);
>> +
>>   /*
>>    * DRM Minors
>>    * A DRM device can provide several char-dev interfaces on the DRM-Major. Each
>> @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev)
>>   }
>>   EXPORT_SYMBOL(drm_dev_unplug);
>>   
>> +/**
>> + * drm_dev_wedged_event - generate a device wedged uevent
>> + * @dev: DRM device
>> + * @method: method to be used for recovery
>> + *
>> + * This generates a device wedged uevent for the DRM device specified by @dev.
>> + * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device)
>> + * is sent in the uevent environment as WEDGED=<method>, on the basis of which,
>> + * userspace may take respective action to recover the device.
>> + *
>> + * Returns: 0 on success, or negative error code otherwise.
>> + */
>> +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method)
>> +{
>> +	/* Event string length up to 16+ characters with available methods */
>> +	char event_string[32] = {};
>> +	char *envp[] = { event_string, NULL };
>> +	const char *recovery;
>> +
>> +	recovery = drm_wedge_recovery_name(method);
>> +	if (!recovery) {
>> +		drm_err(dev, "device wedged, invalid recovery method %d\n", method);
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (!test_bit(method, &dev->wedge_recovery)) {
>> +		drm_err(dev, "device wedged, %s based recovery not supported\n",
>> +			drm_wedge_recovery_name(method));
>> +		return -EOPNOTSUPP;
>> +	}
>> +
>> +	snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery);
>> +
>> +	drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery);
>> +	return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
>> +}
>> +EXPORT_SYMBOL(drm_dev_wedged_event);
>> +
>>   /*
>>    * DRM internal mount
>>    * We want to be able to allocate our own "struct address_space" to control
>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>> index c91f87b5242d..fed6f20e52fb 100644
>> --- a/include/drm/drm_device.h
>> +++ b/include/drm/drm_device.h
>> @@ -40,6 +40,26 @@ enum switch_power_state {
>>   	DRM_SWITCH_POWER_DYNAMIC_OFF = 3,
>>   };
>>   
>> +/**
>> + * enum drm_wedge_recovery - Recovery method for wedged device in order of
>> + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
>> + * Drivers can choose to support any one or multiple of them depending on
>> + * their needs.
>> + */
>> +enum drm_wedge_recovery {
>> +	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
>> +	DRM_WEDGE_RECOVERY_REBIND,
>> +
>> +	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
>> +	DRM_WEDGE_RECOVERY_BUS_RESET,
>> +
>> +	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
>> +	DRM_WEDGE_RECOVERY_REBOOT,
>> +
>> +	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
>> +	DRM_WEDGE_RECOVERY_MAX
>> +};
>> +
>>   /**
>>    * struct drm_device - DRM device structure
>>    *
>> @@ -317,6 +337,9 @@ struct drm_device {
>>   	 * Root directory for debugfs files.
>>   	 */
>>   	struct dentry *debugfs_root;
>> +
>> +	/** @wedge_recovery: Supported recovery methods for wedged device */
>> +	unsigned long wedge_recovery;
>>   };
>>   
>>   #endif
>> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
>> index 02ea4e3248fd..d8dbc77010b0 100644
>> --- a/include/drm/drm_drv.h
>> +++ b/include/drm/drm_drv.h
>> @@ -462,6 +462,9 @@ bool drm_dev_enter(struct drm_device *dev, int *idx);
>>   void drm_dev_exit(int idx);
>>   void drm_dev_unplug(struct drm_device *dev);
>>   
>> +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method);
>> +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method);
>> +
>>   /**
>>    * drm_dev_is_unplugged - is a DRM device unplugged
>>    * @dev: DRM device
>> -- 
>> 2.34.1
>>

Rodrigo Vivi Oct. 17, 2024, 4:43 p.m. UTC | #14

On Thu, Oct 17, 2024 at 09:59:10AM +0200, Christian König wrote:
> Am 17.10.24 um 04:47 schrieb Raag Jadav:
> > On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:
> > > Introduce device wedged event, which will notify userspace of wedged
> > > (hanged/unusable) state of the DRM device through a uevent. This is
> > > useful especially in cases where the device is no longer operating as
> > > expected even after a hardware reset and has become unrecoverable from
> > > driver context.
> 
> Well introduce is probably the wrong wording since i915 already has that and
> amdgpu looked into it but never upstreamed the support.

in i915 we have the reset and error uevents, but not one specific for 'wedge'.
This would indeed be a new one.

> 
> I would rather say standardize.
> 
> > > 
> > > Purpose of this implementation is to provide drivers a generic way to
> > > recover with the help of userspace intervention. Different drivers may
> > > have different ideas of a "wedged device" depending on their hardware
> > > implementation, and hence the vendor agnostic nature of the event.
> > > It is up to the drivers to decide when they see the need for recovery
> > > and how they want to recover from the available methods.
> > > 
> > > Current implementation defines three recovery methods, out of which,
> > > drivers can choose to support any one or multiple of them. Preferred
> > > recovery method will be sent in the uevent environment as WEDGED=<method>.
> > > Userspace consumers (sysadmin) can define udev rules to parse this event
> > > and take respective action to recover the device.
> > > 
> > >      =============== ==================================
> > >      Recovery method Consumer expectations
> > >      =============== ==================================
> > >      rebind          unbind + rebind driver
> > >      bus-reset       unbind + reset bus device + rebind
> > >      reboot          reboot system
> > >      =============== ==================================
> 
> Well that sounds like userspace would need to be involved in recovery.
> 
> That in turn is a complete no-go since we at least need to signal all
> dma_fences to unblock the kernel. In other words things like bus reset needs
> to happen inside the kernel and *not* in userspace.
> 
> What we can do is to signal to userspace: Hey a bus reset of device X
> happened, maybe restart container, daemon, whatever service which was using
> this device.

Well, when we declare device 'wedged' it is because we don't want to take
any drastic measures inside the kernel and want to leave it in a protected
and unusable state. In a way that users wouldn't lose display for instance,
or at least the device is in a debugable state.

Then, the instructions here is to tell what could possibly be attempted
from userspace to get the device to an usable state.

The 'wedge' mode (the one emiting this uevent) needs to be responsible
for signaling all the fences and everything needed for a clean unbind
and whatever next step might be indicated to userspace.

That should already be part of any wedged mode, regardless the uevent
to inform the userspace here.

> 
> Regards,
> Christian.
> 
> > > 
> > > v4: s/drm_dev_wedged/drm_dev_wedged_event
> > >      Use drm_info() (Jani)
> > >      Kernel doc adjustment (Aravind)
> > > v5: Send recovery method with uevent (Lina)
> > > v6: Access wedge_recovery_opts[] using helper function (Jani)
> > >      Use snprintf() (Jani)
> > > v7: Convert recovery helpers into regular functions (Andy, Jani)
> > >      Aesthetic adjustments (Andy)
> > >      Handle invalid method cases
> > > 
> > > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > > ---
> > Cc'ing amd, collabora and others as I found semi-related work at
> > 
> > https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealmeid@igalia.com/
> > https://lore.kernel.org/amd-gfx/20240725150055.1991893-1-alexander.deucher@amd.com/
> > https://lore.kernel.org/dri-devel/20241011225906.3789965-3-adrian.larumbe@collabora.com/
> > https://lore.kernel.org/amd-gfx/CAAxE2A5v_RkZ9ex4=7jiBSKVb22_1FAj0AANBcmKtETt5c3gVA@mail.gmail.com/
> > 
> > 
> > Please share feedback about usefulness and adoption of this.
> > Improvements are welcome.
> > 
> > Raag
> > 
> > >   drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++
> > >   include/drm/drm_device.h  | 23 ++++++++++++
> > >   include/drm/drm_drv.h     |  3 ++
> > >   3 files changed, 103 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > index ac30b0ec9d93..cfe9600da2ee 100644
> > > --- a/drivers/gpu/drm/drm_drv.c
> > > +++ b/drivers/gpu/drm/drm_drv.c
> > > @@ -26,6 +26,8 @@
> > >    * DEALINGS IN THE SOFTWARE.
> > >    */
> > > +#include <linux/array_size.h>
> > > +#include <linux/build_bug.h>
> > >   #include <linux/debugfs.h>
> > >   #include <linux/fs.h>
> > >   #include <linux/module.h>
> > > @@ -33,6 +35,7 @@
> > >   #include <linux/mount.h>
> > >   #include <linux/pseudo_fs.h>
> > >   #include <linux/slab.h>
> > > +#include <linux/sprintf.h>
> > >   #include <linux/srcu.h>
> > >   #include <linux/xarray.h>
> > > @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root;
> > >   DEFINE_STATIC_SRCU(drm_unplug_srcu);
> > > +/*
> > > + * Available recovery methods for wedged device. To be sent along with device
> > > + * wedged uevent.
> > > + */
> > > +static const char *const drm_wedge_recovery_opts[] = {
> > > +	[DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> > > +	[DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> > > +	[DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> > > +};
> > > +
> > > +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
> > > +{
> > > +	static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX);
> > > +
> > > +	return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX;
> > > +}
> > > +
> > > +/**
> > > + * drm_wedge_recovery_name - provide wedge recovery name
> > > + * @method: method to be used for recovery
> > > + *
> > > + * This validates wedge recovery @method against the available ones in
> > > + * drm_wedge_recovery_opts[] and provides respective recovery name in string
> > > + * format if found valid.
> > > + *
> > > + * Returns: pointer to const recovery string on success, NULL otherwise.
> > > + */
> > > +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method)
> > > +{
> > > +	if (drm_wedge_recovery_is_valid(method))
> > > +		return drm_wedge_recovery_opts[method];
> > > +
> > > +	return NULL;
> > > +}
> > > +EXPORT_SYMBOL(drm_wedge_recovery_name);
> > > +
> > >   /*
> > >    * DRM Minors
> > >    * A DRM device can provide several char-dev interfaces on the DRM-Major. Each
> > > @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev)
> > >   }
> > >   EXPORT_SYMBOL(drm_dev_unplug);
> > > +/**
> > > + * drm_dev_wedged_event - generate a device wedged uevent
> > > + * @dev: DRM device
> > > + * @method: method to be used for recovery
> > > + *
> > > + * This generates a device wedged uevent for the DRM device specified by @dev.
> > > + * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device)
> > > + * is sent in the uevent environment as WEDGED=<method>, on the basis of which,
> > > + * userspace may take respective action to recover the device.
> > > + *
> > > + * Returns: 0 on success, or negative error code otherwise.
> > > + */
> > > +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method)
> > > +{
> > > +	/* Event string length up to 16+ characters with available methods */
> > > +	char event_string[32] = {};
> > > +	char *envp[] = { event_string, NULL };
> > > +	const char *recovery;
> > > +
> > > +	recovery = drm_wedge_recovery_name(method);
> > > +	if (!recovery) {
> > > +		drm_err(dev, "device wedged, invalid recovery method %d\n", method);
> > > +		return -EINVAL;
> > > +	}
> > > +
> > > +	if (!test_bit(method, &dev->wedge_recovery)) {
> > > +		drm_err(dev, "device wedged, %s based recovery not supported\n",
> > > +			drm_wedge_recovery_name(method));
> > > +		return -EOPNOTSUPP;
> > > +	}
> > > +
> > > +	snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery);
> > > +
> > > +	drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery);
> > > +	return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> > > +}
> > > +EXPORT_SYMBOL(drm_dev_wedged_event);
> > > +
> > >   /*
> > >    * DRM internal mount
> > >    * We want to be able to allocate our own "struct address_space" to control
> > > diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > > index c91f87b5242d..fed6f20e52fb 100644
> > > --- a/include/drm/drm_device.h
> > > +++ b/include/drm/drm_device.h
> > > @@ -40,6 +40,26 @@ enum switch_power_state {
> > >   	DRM_SWITCH_POWER_DYNAMIC_OFF = 3,
> > >   };
> > > +/**
> > > + * enum drm_wedge_recovery - Recovery method for wedged device in order of
> > > + * severity. To be set as bit fields in drm_device.wedge_recovery variable.
> > > + * Drivers can choose to support any one or multiple of them depending on
> > > + * their needs.
> > > + */
> > > +enum drm_wedge_recovery {
> > > +	/** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */
> > > +	DRM_WEDGE_RECOVERY_REBIND,
> > > +
> > > +	/** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */
> > > +	DRM_WEDGE_RECOVERY_BUS_RESET,
> > > +
> > > +	/** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */
> > > +	DRM_WEDGE_RECOVERY_REBOOT,
> > > +
> > > +	/** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */
> > > +	DRM_WEDGE_RECOVERY_MAX
> > > +};
> > > +
> > >   /**
> > >    * struct drm_device - DRM device structure
> > >    *
> > > @@ -317,6 +337,9 @@ struct drm_device {
> > >   	 * Root directory for debugfs files.
> > >   	 */
> > >   	struct dentry *debugfs_root;
> > > +
> > > +	/** @wedge_recovery: Supported recovery methods for wedged device */
> > > +	unsigned long wedge_recovery;
> > >   };
> > >   #endif
> > > diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
> > > index 02ea4e3248fd..d8dbc77010b0 100644
> > > --- a/include/drm/drm_drv.h
> > > +++ b/include/drm/drm_drv.h
> > > @@ -462,6 +462,9 @@ bool drm_dev_enter(struct drm_device *dev, int *idx);
> > >   void drm_dev_exit(int idx);
> > >   void drm_dev_unplug(struct drm_device *dev);
> > > +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method);
> > > +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method);
> > > +
> > >   /**
> > >    * drm_dev_is_unplugged - is a DRM device unplugged
> > >    * @dev: DRM device
> > > -- 
> > > 2.34.1
> > > 
>

André Almeida Oct. 17, 2024, 7:16 p.m. UTC | #15

Hi Raag,

Em 30/09/2024 04:38, Raag Jadav escreveu:
> Introduce device wedged event, which will notify userspace of wedged
> (hanged/unusable) state of the DRM device through a uevent. This is
> useful especially in cases where the device is no longer operating as
> expected even after a hardware reset and has become unrecoverable from
> driver context.
> 
> Purpose of this implementation is to provide drivers a generic way to
> recover with the help of userspace intervention. Different drivers may
> have different ideas of a "wedged device" depending on their hardware
> implementation, and hence the vendor agnostic nature of the event.
> It is up to the drivers to decide when they see the need for recovery
> and how they want to recover from the available methods.
> 
> Current implementation defines three recovery methods, out of which,
> drivers can choose to support any one or multiple of them. Preferred
> recovery method will be sent in the uevent environment as WEDGED=<method>.
> Userspace consumers (sysadmin) can define udev rules to parse this event
> and take respective action to recover the device.
> 
>      =============== ==================================
>      Recovery method Consumer expectations
>      =============== ==================================
>      rebind          unbind + rebind driver
>      bus-reset       unbind + reset bus device + rebind
>      reboot          reboot system
>      =============== ==================================
> 
>

I proposed something similar in the past: 
https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/

The motivation was that amdgpu was getting stuck after every GPU reset, 
and there was just a black screen. The uevent would then trigger a 
daemon to reset the compositor and getting things back together. As you 
can see in my thread, the feature was blocked in favor of getting better 
overall GPU reset from the kernel side.

Which kind of scenarios are making i915/xe the need to have userspace 
involvement? I tested a bunch of resets in i915 but never managed to get 
the driver stuck.

For the bus-reset, amdgpu does that too, but it doesn't require 
userspace intervention.

Christian König Oct. 18, 2024, 10:58 a.m. UTC | #16

Am 17.10.24 um 18:43 schrieb Rodrigo Vivi:
> On Thu, Oct 17, 2024 at 09:59:10AM +0200, Christian König wrote:
>>>> Purpose of this implementation is to provide drivers a generic way to
>>>> recover with the help of userspace intervention. Different drivers may
>>>> have different ideas of a "wedged device" depending on their hardware
>>>> implementation, and hence the vendor agnostic nature of the event.
>>>> It is up to the drivers to decide when they see the need for recovery
>>>> and how they want to recover from the available methods.
>>>>
>>>> Current implementation defines three recovery methods, out of which,
>>>> drivers can choose to support any one or multiple of them. Preferred
>>>> recovery method will be sent in the uevent environment as WEDGED=<method>.
>>>> Userspace consumers (sysadmin) can define udev rules to parse this event
>>>> and take respective action to recover the device.
>>>>
>>>>       =============== ==================================
>>>>       Recovery method Consumer expectations
>>>>       =============== ==================================
>>>>       rebind          unbind + rebind driver
>>>>       bus-reset       unbind + reset bus device + rebind
>>>>       reboot          reboot system
>>>>       =============== ==================================
>> Well that sounds like userspace would need to be involved in recovery.
>>
>> That in turn is a complete no-go since we at least need to signal all
>> dma_fences to unblock the kernel. In other words things like bus reset needs
>> to happen inside the kernel and *not* in userspace.
>>
>> What we can do is to signal to userspace: Hey a bus reset of device X
>> happened, maybe restart container, daemon, whatever service which was using
>> this device.
> Well, when we declare device 'wedged' it is because we don't want to take
> any drastic measures inside the kernel and want to leave it in a protected
> and unusable state. In a way that users wouldn't lose display for instance,
> or at least the device is in a debugable state.

Uff, that needs to be very very well documented or otherwise the whole 
approach is an absolutely clear NAK from my side as DMA-buf maintainer.

>
> Then, the instructions here is to tell what could possibly be attempted
> from userspace to get the device to an usable state.
>
> The 'wedge' mode (the one emiting this uevent) needs to be responsible
> for signaling all the fences and everything needed for a clean unbind
> and whatever next step might be indicated to userspace.
>
> That should already be part of any wedged mode, regardless the uevent
> to inform the userspace here.

You need to approach that from a different side. With the current patch 
set you are ignoring documented mandatory driver behavior as far as I 
can see.

So first of all describe in the documentation what the wedged mode is 
and what requirements a driver has to fulfill to enter it: 
https://docs.kernel.org/gpu/drm-uapi.html#device-reset

Especially document that all system memory accesses of the device needs 
to be blocked by (for example) disabling DMA accesses in the PCI config 
space.

When it is guaranteed that the device can't access any system memory any 
more the device driver should signal all pending fences of this device.

And only after all of that is done the driver  can send an uevent to 
inform userspace that it can debug the hanged state.

As far as I can see this makes the enum how to recover the device 
superfluous because you will most likely always need a bus reset to get 
out of this again.

Regards,
Christian.

Raag Jadav Oct. 18, 2024, 12:46 p.m. UTC | #17

On Fri, Oct 18, 2024 at 12:58:09PM +0200, Christian König wrote:
> Am 17.10.24 um 18:43 schrieb Rodrigo Vivi:
> > On Thu, Oct 17, 2024 at 09:59:10AM +0200, Christian König wrote:
> > > > > Purpose of this implementation is to provide drivers a generic way to
> > > > > recover with the help of userspace intervention. Different drivers may
> > > > > have different ideas of a "wedged device" depending on their hardware
> > > > > implementation, and hence the vendor agnostic nature of the event.
> > > > > It is up to the drivers to decide when they see the need for recovery
> > > > > and how they want to recover from the available methods.
> > > > > 
> > > > > Current implementation defines three recovery methods, out of which,
> > > > > drivers can choose to support any one or multiple of them. Preferred
> > > > > recovery method will be sent in the uevent environment as WEDGED=<method>.
> > > > > Userspace consumers (sysadmin) can define udev rules to parse this event
> > > > > and take respective action to recover the device.
> > > > > 
> > > > >       =============== ==================================
> > > > >       Recovery method Consumer expectations
> > > > >       =============== ==================================
> > > > >       rebind          unbind + rebind driver
> > > > >       bus-reset       unbind + reset bus device + rebind
> > > > >       reboot          reboot system
> > > > >       =============== ==================================
> > > Well that sounds like userspace would need to be involved in recovery.
> > > 
> > > That in turn is a complete no-go since we at least need to signal all
> > > dma_fences to unblock the kernel. In other words things like bus reset needs
> > > to happen inside the kernel and *not* in userspace.
> > > 
> > > What we can do is to signal to userspace: Hey a bus reset of device X
> > > happened, maybe restart container, daemon, whatever service which was using
> > > this device.
> > Well, when we declare device 'wedged' it is because we don't want to take
> > any drastic measures inside the kernel and want to leave it in a protected
> > and unusable state. In a way that users wouldn't lose display for instance,
> > or at least the device is in a debugable state.
> 
> Uff, that needs to be very very well documented or otherwise the whole
> approach is an absolutely clear NAK from my side as DMA-buf maintainer.
> 
> > 
> > Then, the instructions here is to tell what could possibly be attempted
> > from userspace to get the device to an usable state.
> > 
> > The 'wedge' mode (the one emiting this uevent) needs to be responsible
> > for signaling all the fences and everything needed for a clean unbind
> > and whatever next step might be indicated to userspace.
> > 
> > That should already be part of any wedged mode, regardless the uevent
> > to inform the userspace here.
> 
> You need to approach that from a different side. With the current patch set
> you are ignoring documented mandatory driver behavior as far as I can see.
> 
> So first of all describe in the documentation what the wedged mode is and
> what requirements a driver has to fulfill to enter it:
> https://docs.kernel.org/gpu/drm-uapi.html#device-reset
>
> Especially document that all system memory accesses of the device needs to
> be blocked by (for example) disabling DMA accesses in the PCI config space.
> 
> When it is guaranteed that the device can't access any system memory any
> more the device driver should signal all pending fences of this device.
> 
> And only after all of that is done the driver  can send an uevent to inform
> userspace that it can debug the hanged state.

Sure, will do.

> As far as I can see this makes the enum how to recover the device
> superfluous because you will most likely always need a bus reset to get out
> of this again.

That depends on the kind of fault the device has encountered and the bus it is
sitting on. There could be buses that don't support reset.

Raag

Christian König Oct. 18, 2024, 12:54 p.m. UTC | #18

Am 18.10.24 um 14:46 schrieb Raag Jadav:
>> As far as I can see this makes the enum how to recover the device
>> superfluous because you will most likely always need a bus reset to get out
>> of this again.
> That depends on the kind of fault the device has encountered and the bus it is
> sitting on. There could be buses that don't support reset.

That is even more an argument to not expose this in the uevent.

Getting the device working again is strongly device dependent and can't 
be handled in a generic way.

Regards,
Christian.

>
> Raag

Raag Jadav Oct. 18, 2024, 2:09 p.m. UTC | #19

On Fri, Oct 18, 2024 at 02:54:38PM +0200, Christian König wrote:
> Am 18.10.24 um 14:46 schrieb Raag Jadav:
> > > As far as I can see this makes the enum how to recover the device
> > > superfluous because you will most likely always need a bus reset to get out
> > > of this again.
> > That depends on the kind of fault the device has encountered and the bus it is
> > sitting on. There could be buses that don't support reset.
> 
> That is even more an argument to not expose this in the uevent.
> 
> Getting the device working again is strongly device dependent and can't be
> handled in a generic way.

My understanding is that the proposed methods can be handled in a generic way
and are useful for the devices that do support it. This way the userspace can
atleast have a hint about recovery.

For others we can have something like WEDGED=none (as proposed by Michal and
Lucas in other threads) and let admin/user decide how to deal with it.

Raag

Rodrigo Vivi Oct. 18, 2024, 2:56 p.m. UTC | #20

On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
> Hi Raag,
> 
> Em 30/09/2024 04:38, Raag Jadav escreveu:
> > Introduce device wedged event, which will notify userspace of wedged
> > (hanged/unusable) state of the DRM device through a uevent. This is
> > useful especially in cases where the device is no longer operating as
> > expected even after a hardware reset and has become unrecoverable from
> > driver context.
> > 
> > Purpose of this implementation is to provide drivers a generic way to
> > recover with the help of userspace intervention. Different drivers may
> > have different ideas of a "wedged device" depending on their hardware
> > implementation, and hence the vendor agnostic nature of the event.
> > It is up to the drivers to decide when they see the need for recovery
> > and how they want to recover from the available methods.
> > 
> > Current implementation defines three recovery methods, out of which,
> > drivers can choose to support any one or multiple of them. Preferred
> > recovery method will be sent in the uevent environment as WEDGED=<method>.
> > Userspace consumers (sysadmin) can define udev rules to parse this event
> > and take respective action to recover the device.
> > 
> >      =============== ==================================
> >      Recovery method Consumer expectations
> >      =============== ==================================
> >      rebind          unbind + rebind driver
> >      bus-reset       unbind + reset bus device + rebind
> >      reboot          reboot system
> >      =============== ==================================
> > 
> > 
> 
> I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/
> 
> The motivation was that amdgpu was getting stuck after every GPU reset, and
> there was just a black screen. The uevent would then trigger a daemon to
> reset the compositor and getting things back together. As you can see in my
> thread, the feature was blocked in favor of getting better overall GPU reset
> from the kernel side.
> 
> Which kind of scenarios are making i915/xe the need to have userspace
> involvement? I tested a bunch of resets in i915 but never managed to get the
> driver stuck.

2 scenarios:

1. Multiple levels of reset has failed and device was declared wedged. This is
rare indeed as the resets improved a lot.
2. Debug case. We can boot the driver with option to declare device wedged at
any timeout, so the device can be debugged.

> 
> For the bus-reset, amdgpu does that too, but it doesn't require userspace
> intervention.

How do you trigger that?

Alex Deucher Oct. 18, 2024, 3:31 p.m. UTC | #21

On Fri, Oct 18, 2024 at 11:23 AM Rodrigo Vivi <rodrigo.vivi@intel.com> wrote:
>
> On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
> > Hi Raag,
> >
> > Em 30/09/2024 04:38, Raag Jadav escreveu:
> > > Introduce device wedged event, which will notify userspace of wedged
> > > (hanged/unusable) state of the DRM device through a uevent. This is
> > > useful especially in cases where the device is no longer operating as
> > > expected even after a hardware reset and has become unrecoverable from
> > > driver context.
> > >
> > > Purpose of this implementation is to provide drivers a generic way to
> > > recover with the help of userspace intervention. Different drivers may
> > > have different ideas of a "wedged device" depending on their hardware
> > > implementation, and hence the vendor agnostic nature of the event.
> > > It is up to the drivers to decide when they see the need for recovery
> > > and how they want to recover from the available methods.
> > >
> > > Current implementation defines three recovery methods, out of which,
> > > drivers can choose to support any one or multiple of them. Preferred
> > > recovery method will be sent in the uevent environment as WEDGED=<method>.
> > > Userspace consumers (sysadmin) can define udev rules to parse this event
> > > and take respective action to recover the device.
> > >
> > >      =============== ==================================
> > >      Recovery method Consumer expectations
> > >      =============== ==================================
> > >      rebind          unbind + rebind driver
> > >      bus-reset       unbind + reset bus device + rebind
> > >      reboot          reboot system
> > >      =============== ==================================
> > >
> > >
> >
> > I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/
> >
> > The motivation was that amdgpu was getting stuck after every GPU reset, and
> > there was just a black screen. The uevent would then trigger a daemon to
> > reset the compositor and getting things back together. As you can see in my
> > thread, the feature was blocked in favor of getting better overall GPU reset
> > from the kernel side.
> >
> > Which kind of scenarios are making i915/xe the need to have userspace
> > involvement? I tested a bunch of resets in i915 but never managed to get the
> > driver stuck.
>
> 2 scenarios:
>
> 1. Multiple levels of reset has failed and device was declared wedged. This is
> rare indeed as the resets improved a lot.
> 2. Debug case. We can boot the driver with option to declare device wedged at
> any timeout, so the device can be debugged.
>
> >
> > For the bus-reset, amdgpu does that too, but it doesn't require userspace
> > intervention.
>
> How do you trigger that?

What do you mean by bus reset?  I think Chrisitian is just referring
to a full adapter reset (as opposed to a queue reset or something more
fine grained).  Driver can reset the device via MMIO or firmware,
depending on the device.  I think there are also PCI helpers for
things like PCI FLR.

Alex

André Almeida Oct. 18, 2024, 5:56 p.m. UTC | #22

Em 18/10/2024 12:31, Alex Deucher escreveu:
> On Fri, Oct 18, 2024 at 11:23 AM Rodrigo Vivi <rodrigo.vivi@intel.com> wrote:
>>
>> On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
>>> Hi Raag,
>>>
>>> Em 30/09/2024 04:38, Raag Jadav escreveu:
>>>> Introduce device wedged event, which will notify userspace of wedged
>>>> (hanged/unusable) state of the DRM device through a uevent. This is
>>>> useful especially in cases where the device is no longer operating as
>>>> expected even after a hardware reset and has become unrecoverable from
>>>> driver context.
>>>>
>>>> Purpose of this implementation is to provide drivers a generic way to
>>>> recover with the help of userspace intervention. Different drivers may
>>>> have different ideas of a "wedged device" depending on their hardware
>>>> implementation, and hence the vendor agnostic nature of the event.
>>>> It is up to the drivers to decide when they see the need for recovery
>>>> and how they want to recover from the available methods.
>>>>
>>>> Current implementation defines three recovery methods, out of which,
>>>> drivers can choose to support any one or multiple of them. Preferred
>>>> recovery method will be sent in the uevent environment as WEDGED=<method>.
>>>> Userspace consumers (sysadmin) can define udev rules to parse this event
>>>> and take respective action to recover the device.
>>>>
>>>>       =============== ==================================
>>>>       Recovery method Consumer expectations
>>>>       =============== ==================================
>>>>       rebind          unbind + rebind driver
>>>>       bus-reset       unbind + reset bus device + rebind
>>>>       reboot          reboot system
>>>>       =============== ==================================
>>>>
>>>>
>>>
>>> I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/
>>>
>>> The motivation was that amdgpu was getting stuck after every GPU reset, and
>>> there was just a black screen. The uevent would then trigger a daemon to
>>> reset the compositor and getting things back together. As you can see in my
>>> thread, the feature was blocked in favor of getting better overall GPU reset
>>> from the kernel side.
>>>
>>> Which kind of scenarios are making i915/xe the need to have userspace
>>> involvement? I tested a bunch of resets in i915 but never managed to get the
>>> driver stuck.
>>
>> 2 scenarios:
>>
>> 1. Multiple levels of reset has failed and device was declared wedged. This is
>> rare indeed as the resets improved a lot.
>> 2. Debug case. We can boot the driver with option to declare device wedged at
>> any timeout, so the device can be debugged.
>>
>>>
>>> For the bus-reset, amdgpu does that too, but it doesn't require userspace
>>> intervention.
>>
>> How do you trigger that?
> 
> What do you mean by bus reset?  I think Chrisitian is just referring
> to a full adapter reset (as opposed to a queue reset or something more
> fine grained).  Driver can reset the device via MMIO or firmware,
> depending on the device.  I think there are also PCI helpers for
> things like PCI FLR.
> 

I was referring to AMD_RESET_PCI:

"Does a full bus reset using core Linux subsystem PCI reset and does a 
secondary bus reset or FLR, depending on what the underlying hardware 
supports."

And that can be triggered by using `amdgpu_reset_method=5` as the module 
option.

Alex Deucher Oct. 18, 2024, 9:07 p.m. UTC | #23

On Fri, Oct 18, 2024 at 1:56 PM André Almeida <andrealmeid@igalia.com> wrote:
>
> Em 18/10/2024 12:31, Alex Deucher escreveu:
> > On Fri, Oct 18, 2024 at 11:23 AM Rodrigo Vivi <rodrigo.vivi@intel.com> wrote:
> >>
> >> On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
> >>> Hi Raag,
> >>>
> >>> Em 30/09/2024 04:38, Raag Jadav escreveu:
> >>>> Introduce device wedged event, which will notify userspace of wedged
> >>>> (hanged/unusable) state of the DRM device through a uevent. This is
> >>>> useful especially in cases where the device is no longer operating as
> >>>> expected even after a hardware reset and has become unrecoverable from
> >>>> driver context.
> >>>>
> >>>> Purpose of this implementation is to provide drivers a generic way to
> >>>> recover with the help of userspace intervention. Different drivers may
> >>>> have different ideas of a "wedged device" depending on their hardware
> >>>> implementation, and hence the vendor agnostic nature of the event.
> >>>> It is up to the drivers to decide when they see the need for recovery
> >>>> and how they want to recover from the available methods.
> >>>>
> >>>> Current implementation defines three recovery methods, out of which,
> >>>> drivers can choose to support any one or multiple of them. Preferred
> >>>> recovery method will be sent in the uevent environment as WEDGED=<method>.
> >>>> Userspace consumers (sysadmin) can define udev rules to parse this event
> >>>> and take respective action to recover the device.
> >>>>
> >>>>       =============== ==================================
> >>>>       Recovery method Consumer expectations
> >>>>       =============== ==================================
> >>>>       rebind          unbind + rebind driver
> >>>>       bus-reset       unbind + reset bus device + rebind
> >>>>       reboot          reboot system
> >>>>       =============== ==================================
> >>>>
> >>>>
> >>>
> >>> I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/
> >>>
> >>> The motivation was that amdgpu was getting stuck after every GPU reset, and
> >>> there was just a black screen. The uevent would then trigger a daemon to
> >>> reset the compositor and getting things back together. As you can see in my
> >>> thread, the feature was blocked in favor of getting better overall GPU reset
> >>> from the kernel side.
> >>>
> >>> Which kind of scenarios are making i915/xe the need to have userspace
> >>> involvement? I tested a bunch of resets in i915 but never managed to get the
> >>> driver stuck.
> >>
> >> 2 scenarios:
> >>
> >> 1. Multiple levels of reset has failed and device was declared wedged. This is
> >> rare indeed as the resets improved a lot.
> >> 2. Debug case. We can boot the driver with option to declare device wedged at
> >> any timeout, so the device can be debugged.
> >>
> >>>
> >>> For the bus-reset, amdgpu does that too, but it doesn't require userspace
> >>> intervention.
> >>
> >> How do you trigger that?
> >
> > What do you mean by bus reset?  I think Chrisitian is just referring
> > to a full adapter reset (as opposed to a queue reset or something more
> > fine grained).  Driver can reset the device via MMIO or firmware,
> > depending on the device.  I think there are also PCI helpers for
> > things like PCI FLR.
> >
>
> I was referring to AMD_RESET_PCI:
>
> "Does a full bus reset using core Linux subsystem PCI reset and does a
> secondary bus reset or FLR, depending on what the underlying hardware
> supports."
>
> And that can be triggered by using `amdgpu_reset_method=5` as the module
> option.
>

That option doesn't actually do anything useful on most AMD GPUs.  We
don't support FLR on most boards and SBR doesn't work once the driver
has been loaded except for really old chips.  That said, internally
these all end up being mode1 or mode2 resets which the driver can
trigger directly and which are the defaults.

Alex

Raag Jadav Oct. 19, 2024, 7:08 p.m. UTC | #24

On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
> Hi Raag,
> 
> Em 30/09/2024 04:38, Raag Jadav escreveu:
> > Introduce device wedged event, which will notify userspace of wedged
> > (hanged/unusable) state of the DRM device through a uevent. This is
> > useful especially in cases where the device is no longer operating as
> > expected even after a hardware reset and has become unrecoverable from
> > driver context.
> > 
> > Purpose of this implementation is to provide drivers a generic way to
> > recover with the help of userspace intervention. Different drivers may
> > have different ideas of a "wedged device" depending on their hardware
> > implementation, and hence the vendor agnostic nature of the event.
> > It is up to the drivers to decide when they see the need for recovery
> > and how they want to recover from the available methods.
> > 
> > Current implementation defines three recovery methods, out of which,
> > drivers can choose to support any one or multiple of them. Preferred
> > recovery method will be sent in the uevent environment as WEDGED=<method>.
> > Userspace consumers (sysadmin) can define udev rules to parse this event
> > and take respective action to recover the device.
> > 
> >      =============== ==================================
> >      Recovery method Consumer expectations
> >      =============== ==================================
> >      rebind          unbind + rebind driver
> >      bus-reset       unbind + reset bus device + rebind
> >      reboot          reboot system
> >      =============== ==================================
> > 
> > 
> 
> I proposed something similar in the past:
> https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/

Thanks for sharing. I went through it and I think we can use some of the ideas
with generic adaption.

While we can always execute scripts on uevent, it'd be good to have a userspace
daemon applying automated policies for wedge cases based on admin/user needs.
This way we can also manage repeat offenders.

Xe has devcoredump so telemetry would also be a nice addition.

Great opportunity to collaborate here.

> The motivation was that amdgpu was getting stuck after every GPU reset, and
> there was just a black screen. The uevent would then trigger a daemon to
> reset the compositor and getting things back together. As you can see in my
> thread, the feature was blocked in favor of getting better overall GPU reset
> from the kernel side.

We have hardware level resets but (although rare) they're also prone to failure.
We do what we can to recover from driver context but it adds on to the complexity
overtime. Something like wedging, if done right, would be much more robust IMHO.

Raag

Rodrigo Vivi Oct. 24, 2024, 5:48 p.m. UTC | #25

On Fri, Oct 18, 2024 at 05:07:22PM -0400, Alex Deucher wrote:
> On Fri, Oct 18, 2024 at 1:56 PM André Almeida <andrealmeid@igalia.com> wrote:
> >
> > Em 18/10/2024 12:31, Alex Deucher escreveu:
> > > On Fri, Oct 18, 2024 at 11:23 AM Rodrigo Vivi <rodrigo.vivi@intel.com> wrote:
> > >>
> > >> On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
> > >>> Hi Raag,
> > >>>
> > >>> Em 30/09/2024 04:38, Raag Jadav escreveu:
> > >>>> Introduce device wedged event, which will notify userspace of wedged
> > >>>> (hanged/unusable) state of the DRM device through a uevent. This is
> > >>>> useful especially in cases where the device is no longer operating as
> > >>>> expected even after a hardware reset and has become unrecoverable from
> > >>>> driver context.
> > >>>>
> > >>>> Purpose of this implementation is to provide drivers a generic way to
> > >>>> recover with the help of userspace intervention. Different drivers may
> > >>>> have different ideas of a "wedged device" depending on their hardware
> > >>>> implementation, and hence the vendor agnostic nature of the event.
> > >>>> It is up to the drivers to decide when they see the need for recovery
> > >>>> and how they want to recover from the available methods.
> > >>>>
> > >>>> Current implementation defines three recovery methods, out of which,
> > >>>> drivers can choose to support any one or multiple of them. Preferred
> > >>>> recovery method will be sent in the uevent environment as WEDGED=<method>.
> > >>>> Userspace consumers (sysadmin) can define udev rules to parse this event
> > >>>> and take respective action to recover the device.
> > >>>>
> > >>>>       =============== ==================================
> > >>>>       Recovery method Consumer expectations
> > >>>>       =============== ==================================
> > >>>>       rebind          unbind + rebind driver
> > >>>>       bus-reset       unbind + reset bus device + rebind
> > >>>>       reboot          reboot system
> > >>>>       =============== ==================================
> > >>>>
> > >>>>
> > >>>
> > >>> I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/
> > >>>
> > >>> The motivation was that amdgpu was getting stuck after every GPU reset, and
> > >>> there was just a black screen. The uevent would then trigger a daemon to
> > >>> reset the compositor and getting things back together. As you can see in my
> > >>> thread, the feature was blocked in favor of getting better overall GPU reset
> > >>> from the kernel side.
> > >>>
> > >>> Which kind of scenarios are making i915/xe the need to have userspace
> > >>> involvement? I tested a bunch of resets in i915 but never managed to get the
> > >>> driver stuck.
> > >>
> > >> 2 scenarios:
> > >>
> > >> 1. Multiple levels of reset has failed and device was declared wedged. This is
> > >> rare indeed as the resets improved a lot.
> > >> 2. Debug case. We can boot the driver with option to declare device wedged at
> > >> any timeout, so the device can be debugged.
> > >>
> > >>>
> > >>> For the bus-reset, amdgpu does that too, but it doesn't require userspace
> > >>> intervention.
> > >>
> > >> How do you trigger that?
> > >
> > > What do you mean by bus reset?  I think Chrisitian is just referring
> > > to a full adapter reset (as opposed to a queue reset or something more
> > > fine grained).  Driver can reset the device via MMIO or firmware,
> > > depending on the device.  I think there are also PCI helpers for
> > > things like PCI FLR.
> > >
> >
> > I was referring to AMD_RESET_PCI:
> >
> > "Does a full bus reset using core Linux subsystem PCI reset and does a
> > secondary bus reset or FLR, depending on what the underlying hardware
> > supports."
> >
> > And that can be triggered by using `amdgpu_reset_method=5` as the module
> > option.
> >
> 
> That option doesn't actually do anything useful on most AMD GPUs.  We
> don't support FLR on most boards and SBR doesn't work once the driver
> has been loaded except for really old chips.  That said, internally
> these all end up being mode1 or mode2 resets which the driver can
> trigger directly and which are the defaults.

okay, this is the same for us then.
And this is the main reason that we have this option:
- unbind + reset bus device + rebind

unbind by itself needs to be a supported and working case regardless
the reset state. Then this sequence should be fine.

Afaik there's no way that the driver itself could call for the bus
reset.

> 
> Alex

[v7,1/5] drm: Introduce device wedged event

Commit Message

Comments

Patch