Message ID | 20240930073845.347326-2-raag.jadav@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Introduce DRM device wedged event | expand |
On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote: > Introduce device wedged event, which will notify userspace of wedged > (hanged/unusable) state of the DRM device through a uevent. This is > useful especially in cases where the device is no longer operating as > expected even after a hardware reset and has become unrecoverable from > driver context. > > Purpose of this implementation is to provide drivers a generic way to > recover with the help of userspace intervention. Different drivers may > have different ideas of a "wedged device" depending on their hardware > implementation, and hence the vendor agnostic nature of the event. > It is up to the drivers to decide when they see the need for recovery > and how they want to recover from the available methods. > > Current implementation defines three recovery methods, out of which, > drivers can choose to support any one or multiple of them. Preferred > recovery method will be sent in the uevent environment as WEDGED=<method>. > Userspace consumers (sysadmin) can define udev rules to parse this event > and take respective action to recover the device. > > =============== ================================== > Recovery method Consumer expectations > =============== ================================== > rebind unbind + rebind driver > bus-reset unbind + reset bus device + rebind > reboot reboot system > =============== ================================== ... > +/* > + * Available recovery methods for wedged device. To be sent along with device > + * wedged uevent. > + */ > +static const char *const drm_wedge_recovery_opts[] = { > + [DRM_WEDGE_RECOVERY_REBIND] = "rebind", > + [DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset", > + [DRM_WEDGE_RECOVERY_REBOOT] = "reboot", > +}; Place for static_assert() is here, as it closer to the actual data we test... > +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method) > +{ > + static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX); ...it doesn't fully belong to this function (or only to this function). > + return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX; > +} Why do we need this one-liner (after above comment being addressed) as a separate function?
On Mon, Sep 30, 2024 at 03:59:59PM +0300, Andy Shevchenko wrote: > On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote: > > Introduce device wedged event, which will notify userspace of wedged > > (hanged/unusable) state of the DRM device through a uevent. This is > > useful especially in cases where the device is no longer operating as > > expected even after a hardware reset and has become unrecoverable from > > driver context. > > > > Purpose of this implementation is to provide drivers a generic way to > > recover with the help of userspace intervention. Different drivers may > > have different ideas of a "wedged device" depending on their hardware > > implementation, and hence the vendor agnostic nature of the event. > > It is up to the drivers to decide when they see the need for recovery > > and how they want to recover from the available methods. > > > > Current implementation defines three recovery methods, out of which, > > drivers can choose to support any one or multiple of them. Preferred > > recovery method will be sent in the uevent environment as WEDGED=<method>. > > Userspace consumers (sysadmin) can define udev rules to parse this event > > and take respective action to recover the device. > > > > =============== ================================== > > Recovery method Consumer expectations > > =============== ================================== > > rebind unbind + rebind driver > > bus-reset unbind + reset bus device + rebind > > reboot reboot system > > =============== ================================== > > ... > > > +/* > > + * Available recovery methods for wedged device. To be sent along with device > > + * wedged uevent. > > + */ > > +static const char *const drm_wedge_recovery_opts[] = { > > + [DRM_WEDGE_RECOVERY_REBIND] = "rebind", > > + [DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset", > > + [DRM_WEDGE_RECOVERY_REBOOT] = "reboot", > > +}; > > Place for static_assert() is here, as it closer to the actual data we test... Shouldn't it be at the point of access? If no, why do we care about the data when it's not being used? > > +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method) > > +{ > > + static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX); > > ...it doesn't fully belong to this function (or only to this function). The purpose of having a helper is to have a single point of access, no? Side note: It also goes well with is_valid() semantic IMHO. > > + return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX; > > +} > > Why do we need this one-liner (after above comment being addressed) as a > separate function? I'm not sure if I'm following you. Method is not a constant here, we'll get it on the stack. Raag
On Tue, Oct 01, 2024 at 08:08:18AM +0300, Raag Jadav wrote: > On Mon, Sep 30, 2024 at 03:59:59PM +0300, Andy Shevchenko wrote: > > On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote: ... > > > +static const char *const drm_wedge_recovery_opts[] = { > > > + [DRM_WEDGE_RECOVERY_REBIND] = "rebind", > > > + [DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset", > > > + [DRM_WEDGE_RECOVERY_REBOOT] = "reboot", > > > +}; > > > > Place for static_assert() is here, as it closer to the actual data we test... > > Shouldn't it be at the point of access? No, the idea of static_assert() is in word 'static', meaning it's allowed to be used in the global space. > If no, why do we care about the data when it's not being used? What does this suppose to mean? The assertion is for enforcing the boundaries that are defined by different means (constant of the size and real size of an array). ... > > > +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method) > > > +{ > > > + static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX); > > > > ...it doesn't fully belong to this function (or only to this function). > > The purpose of having a helper is to have a single point of access, no? What single access you are talking about? It seems you are trying to solve non-existing issue. There is a function that is being used exactly once and it's a one-liner. There is no point to have it being separated (at least right now). > Side note: It also goes well with is_valid() semantic IMHO. It doesn't matter at all, it's unrelated to the point. > > > + return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX; > > > +} > > > > Why do we need this one-liner (after above comment being addressed) as a > > separate function? > > I'm not sure if I'm following you. Method is not a constant here, we'll get it > on the stack. I elaborated above.
Hi, sorry for late comments, On 30.09.2024 09:38, Raag Jadav wrote: > Introduce device wedged event, which will notify userspace of wedged > (hanged/unusable) state of the DRM device through a uevent. This is > useful especially in cases where the device is no longer operating as > expected even after a hardware reset and has become unrecoverable from > driver context. > > Purpose of this implementation is to provide drivers a generic way to > recover with the help of userspace intervention. Different drivers may > have different ideas of a "wedged device" depending on their hardware > implementation, and hence the vendor agnostic nature of the event. > It is up to the drivers to decide when they see the need for recovery > and how they want to recover from the available methods. what about when driver just wants to tell that it is in unusable state, but recovery method is unknown or not possible? > > Current implementation defines three recovery methods, out of which, > drivers can choose to support any one or multiple of them. Preferred > recovery method will be sent in the uevent environment as WEDGED=<method>. could this be something like below instead: WEDGED=<reason> RECOVERY=<method>[,<method>] then driver will have a chance to tell what happen _and_ additionally provide a hint(s) how to recover from that situation > Userspace consumers (sysadmin) can define udev rules to parse this event > and take respective action to recover the device. > > =============== ================================== > Recovery method Consumer expectations > =============== ================================== > rebind unbind + rebind driver > bus-reset unbind + reset bus device + rebind > reboot reboot system btw, what if driver detects a really broken HW, or has no clue what will help here, shouldn't we have a "none" method? > =============== ================================== > > v4: s/drm_dev_wedged/drm_dev_wedged_event > Use drm_info() (Jani) > Kernel doc adjustment (Aravind) > v5: Send recovery method with uevent (Lina) > v6: Access wedge_recovery_opts[] using helper function (Jani) > Use snprintf() (Jani) > v7: Convert recovery helpers into regular functions (Andy, Jani) > Aesthetic adjustments (Andy) > Handle invalid method cases > > Signed-off-by: Raag Jadav <raag.jadav@intel.com> > --- > drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++ > include/drm/drm_device.h | 23 ++++++++++++ > include/drm/drm_drv.h | 3 ++ > 3 files changed, 103 insertions(+) > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c > index ac30b0ec9d93..cfe9600da2ee 100644 > --- a/drivers/gpu/drm/drm_drv.c > +++ b/drivers/gpu/drm/drm_drv.c > @@ -26,6 +26,8 @@ > * DEALINGS IN THE SOFTWARE. > */ > > +#include <linux/array_size.h> > +#include <linux/build_bug.h> > #include <linux/debugfs.h> > #include <linux/fs.h> > #include <linux/module.h> > @@ -33,6 +35,7 @@ > #include <linux/mount.h> > #include <linux/pseudo_fs.h> > #include <linux/slab.h> > +#include <linux/sprintf.h> > #include <linux/srcu.h> > #include <linux/xarray.h> > > @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root; > > DEFINE_STATIC_SRCU(drm_unplug_srcu); > > +/* > + * Available recovery methods for wedged device. To be sent along with device > + * wedged uevent. > + */ > +static const char *const drm_wedge_recovery_opts[] = { > + [DRM_WEDGE_RECOVERY_REBIND] = "rebind", > + [DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset", > + [DRM_WEDGE_RECOVERY_REBOOT] = "reboot", > +}; > + > +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method) > +{ > + static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX); > + > + return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX; > +} > + > +/** > + * drm_wedge_recovery_name - provide wedge recovery name > + * @method: method to be used for recovery > + * > + * This validates wedge recovery @method against the available ones in do we really need to validate an enum? maybe the problem is that there is MAX enumerator that just shouldn't be there? > + * drm_wedge_recovery_opts[] and provides respective recovery name in string > + * format if found valid. > + * > + * Returns: pointer to const recovery string on success, NULL otherwise. > + */ > +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method) > +{ > + if (drm_wedge_recovery_is_valid(method)) > + return drm_wedge_recovery_opts[method]; as we only have 3 methods, maybe simple switch() will do the work? > + > + return NULL; > +} > +EXPORT_SYMBOL(drm_wedge_recovery_name); > + > /* > * DRM Minors > * A DRM device can provide several char-dev interfaces on the DRM-Major. Each > @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev) > } > EXPORT_SYMBOL(drm_dev_unplug); > > +/** > + * drm_dev_wedged_event - generate a device wedged uevent > + * @dev: DRM device > + * @method: method to be used for recovery > + * > + * This generates a device wedged uevent for the DRM device specified by @dev. > + * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device) typo > + * is sent in the uevent environment as WEDGED=<method>, on the basis of which, > + * userspace may take respective action to recover the device. > + * > + * Returns: 0 on success, or negative error code otherwise. > + */ > +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method) > +{ > + /* Event string length up to 16+ characters with available methods */ > + char event_string[32] = {}; magic 32 here and likely don't need to be initialized with { } > + char *envp[] = { event_string, NULL }; > + const char *recovery; > + > + recovery = drm_wedge_recovery_name(method); > + if (!recovery) { > + drm_err(dev, "device wedged, invalid recovery method %d\n", method); maybe use drm_WARN() to see who is abusing the API ? > + return -EINVAL; but shouldn't we still trigger an event with "none" recovery? > + } > + > + if (!test_bit(method, &dev->wedge_recovery)) { > + drm_err(dev, "device wedged, %s based recovery not supported\n", > + drm_wedge_recovery_name(method)); do we really need this kind of guard? it will be a driver code that will call this function, so likely it knows better what will work to recover > + return -EOPNOTSUPP; > + } > + > + snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery); > + > + drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery); nit: drm_info(dev, "device wedged, needs %s to recover\n", recovery); > + return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp); > +} > +EXPORT_SYMBOL(drm_dev_wedged_event); > + > /* > * DRM internal mount > * We want to be able to allocate our own "struct address_space" to control > diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h > index c91f87b5242d..fed6f20e52fb 100644 > --- a/include/drm/drm_device.h > +++ b/include/drm/drm_device.h > @@ -40,6 +40,26 @@ enum switch_power_state { > DRM_SWITCH_POWER_DYNAMIC_OFF = 3, > }; > > +/** > + * enum drm_wedge_recovery - Recovery method for wedged device in order of > + * severity. To be set as bit fields in drm_device.wedge_recovery variable. > + * Drivers can choose to support any one or multiple of them depending on > + * their needs. > + */ > +enum drm_wedge_recovery { > + /** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */ > + DRM_WEDGE_RECOVERY_REBIND, > + > + /** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */ > + DRM_WEDGE_RECOVERY_BUS_RESET, > + > + /** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */ > + DRM_WEDGE_RECOVERY_REBOOT, > + > + /** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */ > + DRM_WEDGE_RECOVERY_MAX > +}; > + > /** > * struct drm_device - DRM device structure > * > @@ -317,6 +337,9 @@ struct drm_device { > * Root directory for debugfs files. > */ > struct dentry *debugfs_root; > + > + /** @wedge_recovery: Supported recovery methods for wedged device */ > + unsigned long wedge_recovery; hmm, so before the driver can ask for a reboot as a recovery method from wedge it has to somehow add 'reboot' as available method? why it that? and if you insist that this is useful then at least document how this should be initialized (to not forcing developers to look at drm_dev_wedged_event code where it's used) > }; > > #endif > diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h > index 02ea4e3248fd..d8dbc77010b0 100644 > --- a/include/drm/drm_drv.h > +++ b/include/drm/drm_drv.h > @@ -462,6 +462,9 @@ bool drm_dev_enter(struct drm_device *dev, int *idx); > void drm_dev_exit(int idx); > void drm_dev_unplug(struct drm_device *dev); > > +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method); > +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method); > + > /** > * drm_dev_is_unplugged - is a DRM device unplugged > * @dev: DRM device
On Tue, Oct 01, 2024 at 03:07:59PM +0300, Andy Shevchenko wrote: > On Tue, Oct 01, 2024 at 08:08:18AM +0300, Raag Jadav wrote: > > On Mon, Sep 30, 2024 at 03:59:59PM +0300, Andy Shevchenko wrote: > > > On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote: > > ... > > > > > +static const char *const drm_wedge_recovery_opts[] = { > > > > + [DRM_WEDGE_RECOVERY_REBIND] = "rebind", > > > > + [DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset", > > > > + [DRM_WEDGE_RECOVERY_REBOOT] = "reboot", > > > > +}; > > > > > > Place for static_assert() is here, as it closer to the actual data we test... > > > > Shouldn't it be at the point of access? > > No, the idea of static_assert() is in word 'static', meaning it's allowed to be > used in the global space. > > > If no, why do we care about the data when it's not being used? > > What does this suppose to mean? The assertion is for enforcing the boundaries > that are defined by different means (constant of the size and real size of > an array). The point was to simply not assert without an active user of the array, which is not the case now but may be possible with growing functionality in the future. Raag
On Tue, Oct 01, 2024 at 05:18:33PM +0300, Raag Jadav wrote: > On Tue, Oct 01, 2024 at 03:07:59PM +0300, Andy Shevchenko wrote: > > On Tue, Oct 01, 2024 at 08:08:18AM +0300, Raag Jadav wrote: > > > On Mon, Sep 30, 2024 at 03:59:59PM +0300, Andy Shevchenko wrote: > > > > On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote: ... > > > > > +static const char *const drm_wedge_recovery_opts[] = { > > > > > + [DRM_WEDGE_RECOVERY_REBIND] = "rebind", > > > > > + [DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset", > > > > > + [DRM_WEDGE_RECOVERY_REBOOT] = "reboot", > > > > > +}; > > > > > > > > Place for static_assert() is here, as it closer to the actual data we test... > > > > > > Shouldn't it be at the point of access? > > > > No, the idea of static_assert() is in word 'static', meaning it's allowed to be > > used in the global space. > > > > > If no, why do we care about the data when it's not being used? > > > > What does this suppose to mean? The assertion is for enforcing the boundaries > > that are defined by different means (constant of the size and real size of > > an array). > > The point was to simply not assert without an active user of the array, which is > not the case now but may be possible with growing functionality in the future. static_assert() is a compile-time check. How is it even related to this? So, i.o.w., you are contradicting yourself in this code: on one hand you want compile-time static checker, on the other you do not want it and rely on the usage of the function. Possible solutions: 1) remove static_assert() completely; 2) move it as I said.
On Tue, Oct 01, 2024 at 05:54:46PM +0300, Andy Shevchenko wrote: > On Tue, Oct 01, 2024 at 05:18:33PM +0300, Raag Jadav wrote: > > On Tue, Oct 01, 2024 at 03:07:59PM +0300, Andy Shevchenko wrote: > > > On Tue, Oct 01, 2024 at 08:08:18AM +0300, Raag Jadav wrote: > > > > On Mon, Sep 30, 2024 at 03:59:59PM +0300, Andy Shevchenko wrote: > > > > > On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote: > > ... > > > > > > > +static const char *const drm_wedge_recovery_opts[] = { > > > > > > + [DRM_WEDGE_RECOVERY_REBIND] = "rebind", > > > > > > + [DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset", > > > > > > + [DRM_WEDGE_RECOVERY_REBOOT] = "reboot", > > > > > > +}; > > > > > > > > > > Place for static_assert() is here, as it closer to the actual data we test... > > > > > > > > Shouldn't it be at the point of access? > > > > > > No, the idea of static_assert() is in word 'static', meaning it's allowed to be > > > used in the global space. > > > > > > > If no, why do we care about the data when it's not being used? > > > > > > What does this suppose to mean? The assertion is for enforcing the boundaries > > > that are defined by different means (constant of the size and real size of > > > an array). > > > > The point was to simply not assert without an active user of the array, which is > > not the case now but may be possible with growing functionality in the future. > > static_assert() is a compile-time check. How is it even related to this? Yes, I understand. Semantically it made more sense to me is all, since core helpers can always end up in config based ifdeffery. Anyway, I'll update. Raag
On Tue, Oct 01, 2024 at 02:20:29PM +0200, Michal Wajdeczko wrote: > Hi, > > sorry for late comments, Sure, no problem. > On 30.09.2024 09:38, Raag Jadav wrote: > > Introduce device wedged event, which will notify userspace of wedged > > (hanged/unusable) state of the DRM device through a uevent. This is > > useful especially in cases where the device is no longer operating as > > expected even after a hardware reset and has become unrecoverable from > > driver context. > > > > Purpose of this implementation is to provide drivers a generic way to > > recover with the help of userspace intervention. Different drivers may > > have different ideas of a "wedged device" depending on their hardware > > implementation, and hence the vendor agnostic nature of the event. > > It is up to the drivers to decide when they see the need for recovery > > and how they want to recover from the available methods. > > what about when driver just wants to tell that it is in unusable state, > but recovery method is unknown or not possible? Interesting... However, what would be the consumer expectation for it? If the expectation is to not recover, why send an event at all? > > > > Current implementation defines three recovery methods, out of which, > > drivers can choose to support any one or multiple of them. Preferred > > recovery method will be sent in the uevent environment as WEDGED=<method>. > > could this be something like below instead: > > WEDGED=<reason> > RECOVERY=<method>[,<method>] > > then driver will have a chance to tell what happen _and_ additionally > provide a hint(s) how to recover from that situation Documentation/gpu/drm-uapi.rst +337 UMD can issue an ioctl to the KMD to check the reset status ...or <reason> for wedging, which KMD will signify with an error code... UMD will then proceed to report it to the application using the appropriate API error code (should've explicitly added, sorry) > > Userspace consumers (sysadmin) can define udev rules to parse this event > > and take respective action to recover the device. > > > > =============== ================================== > > Recovery method Consumer expectations > > =============== ================================== > > rebind unbind + rebind driver > > bus-reset unbind + reset bus device + rebind > > reboot reboot system > > btw, what if driver detects a really broken HW, or has no clue what will > help here, shouldn't we have a "none" method? Sure. But same as above, we have to define expectations. > > =============== ================================== > > > > v4: s/drm_dev_wedged/drm_dev_wedged_event > > Use drm_info() (Jani) > > Kernel doc adjustment (Aravind) > > v5: Send recovery method with uevent (Lina) > > v6: Access wedge_recovery_opts[] using helper function (Jani) > > Use snprintf() (Jani) > > v7: Convert recovery helpers into regular functions (Andy, Jani) > > Aesthetic adjustments (Andy) > > Handle invalid method cases > > > > Signed-off-by: Raag Jadav <raag.jadav@intel.com> > > --- > > drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++ > > include/drm/drm_device.h | 23 ++++++++++++ > > include/drm/drm_drv.h | 3 ++ > > 3 files changed, 103 insertions(+) > > > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c > > index ac30b0ec9d93..cfe9600da2ee 100644 > > --- a/drivers/gpu/drm/drm_drv.c > > +++ b/drivers/gpu/drm/drm_drv.c > > @@ -26,6 +26,8 @@ > > * DEALINGS IN THE SOFTWARE. > > */ > > > > +#include <linux/array_size.h> > > +#include <linux/build_bug.h> > > #include <linux/debugfs.h> > > #include <linux/fs.h> > > #include <linux/module.h> > > @@ -33,6 +35,7 @@ > > #include <linux/mount.h> > > #include <linux/pseudo_fs.h> > > #include <linux/slab.h> > > +#include <linux/sprintf.h> > > #include <linux/srcu.h> > > #include <linux/xarray.h> > > > > @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root; > > > > DEFINE_STATIC_SRCU(drm_unplug_srcu); > > > > +/* > > + * Available recovery methods for wedged device. To be sent along with device > > + * wedged uevent. > > + */ > > +static const char *const drm_wedge_recovery_opts[] = { > > + [DRM_WEDGE_RECOVERY_REBIND] = "rebind", > > + [DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset", > > + [DRM_WEDGE_RECOVERY_REBOOT] = "reboot", > > +}; > > + > > +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method) > > +{ > > + static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX); > > + > > + return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX; > > +} > > + > > +/** > > + * drm_wedge_recovery_name - provide wedge recovery name > > + * @method: method to be used for recovery > > + * > > + * This validates wedge recovery @method against the available ones in > > do we really need to validate an enum? I'm all for trusting the drivers explicitly, but since this is a core feature I thought we'd have some guard rails (for abusers). > maybe the problem is that there is MAX enumerator that just shouldn't be there? With MAX in place we won't need to adjust the helpers to match with enum modifications in the future (if any). > > + * drm_wedge_recovery_opts[] and provides respective recovery name in string > > + * format if found valid. > > + * > > + * Returns: pointer to const recovery string on success, NULL otherwise. > > + */ > > +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method) > > +{ > > + if (drm_wedge_recovery_is_valid(method)) > > + return drm_wedge_recovery_opts[method]; > > as we only have 3 methods, maybe simple switch() will do the work? Sure. > > + > > + return NULL; > > +} > > +EXPORT_SYMBOL(drm_wedge_recovery_name); > > + > > /* > > * DRM Minors > > * A DRM device can provide several char-dev interfaces on the DRM-Major. Each > > @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev) > > } > > EXPORT_SYMBOL(drm_dev_unplug); > > > > +/** > > + * drm_dev_wedged_event - generate a device wedged uevent > > + * @dev: DRM device > > + * @method: method to be used for recovery > > + * > > + * This generates a device wedged uevent for the DRM device specified by @dev. > > + * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device) > > typo Good catch. > > + * is sent in the uevent environment as WEDGED=<method>, on the basis of which, > > + * userspace may take respective action to recover the device. > > + * > > + * Returns: 0 on success, or negative error code otherwise. > > + */ > > +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method) > > +{ > > + /* Event string length up to 16+ characters with available methods */ > > + char event_string[32] = {}; > > magic 32 here Anything to add to the event string length comment above? > > + char *envp[] = { event_string, NULL }; > > + const char *recovery; > > + > > + recovery = drm_wedge_recovery_name(method); > > + if (!recovery) { > > + drm_err(dev, "device wedged, invalid recovery method %d\n", method); > > maybe use drm_WARN() to see who is abusing the API ? Sure. > > + return -EINVAL; > > but shouldn't we still trigger an event with "none" recovery? Explained above. > > + } > > + > > + if (!test_bit(method, &dev->wedge_recovery)) { > > + drm_err(dev, "device wedged, %s based recovery not supported\n", > > + drm_wedge_recovery_name(method)); > > do we really need this kind of guard? it will be a driver code that will > call this function, so likely it knows better what will work to recover Agree, although unsupported method could cause undefined behaviour. > > + return -EOPNOTSUPP; > > + } > > + > > + snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery); > > + > > + drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery); > > nit: > drm_info(dev, "device wedged, needs %s to recover\n", recovery); Sure. > > + return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp); > > +} > > +EXPORT_SYMBOL(drm_dev_wedged_event); > > + > > /* > > * DRM internal mount > > * We want to be able to allocate our own "struct address_space" to control > > diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h > > index c91f87b5242d..fed6f20e52fb 100644 > > --- a/include/drm/drm_device.h > > +++ b/include/drm/drm_device.h > > @@ -40,6 +40,26 @@ enum switch_power_state { > > DRM_SWITCH_POWER_DYNAMIC_OFF = 3, > > }; > > > > +/** > > + * enum drm_wedge_recovery - Recovery method for wedged device in order of > > + * severity. To be set as bit fields in drm_device.wedge_recovery variable. > > + * Drivers can choose to support any one or multiple of them depending on > > + * their needs. > > + */ > > +enum drm_wedge_recovery { > > + /** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */ > > + DRM_WEDGE_RECOVERY_REBIND, > > + > > + /** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */ > > + DRM_WEDGE_RECOVERY_BUS_RESET, > > + > > + /** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */ > > + DRM_WEDGE_RECOVERY_REBOOT, > > + > > + /** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */ > > + DRM_WEDGE_RECOVERY_MAX > > +}; > > + > > /** > > * struct drm_device - DRM device structure > > * > > @@ -317,6 +337,9 @@ struct drm_device { > > * Root directory for debugfs files. > > */ > > struct dentry *debugfs_root; > > + > > + /** @wedge_recovery: Supported recovery methods for wedged device */ > > + unsigned long wedge_recovery; > > hmm, so before the driver can ask for a reboot as a recovery method from > wedge it has to somehow add 'reboot' as available method? why it that? It's for consumers to use as fallbacks in case the preferred recovery method (sent along with uevent) don't workout. (patch 2/5) > and if you insist that this is useful then at least document how this > should be initialized (to not forcing developers to look at > drm_dev_wedged_event code where it's used) Sure. Raag
On Thu, Oct 03, 2024 at 03:23:22PM +0300, Raag Jadav wrote: > On Tue, Oct 01, 2024 at 02:20:29PM +0200, Michal Wajdeczko wrote: > > On 30.09.2024 09:38, Raag Jadav wrote: > > > > > > +/** > > > + * enum drm_wedge_recovery - Recovery method for wedged device in order of > > > + * severity. To be set as bit fields in drm_device.wedge_recovery variable. > > > + * Drivers can choose to support any one or multiple of them depending on > > > + * their needs. > > > + */ > > > +enum drm_wedge_recovery { > > > + /** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */ > > > + DRM_WEDGE_RECOVERY_REBIND, > > > + > > > + /** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */ > > > + DRM_WEDGE_RECOVERY_BUS_RESET, > > > + > > > + /** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */ > > > + DRM_WEDGE_RECOVERY_REBOOT, > > > + > > > + /** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */ > > > + DRM_WEDGE_RECOVERY_MAX > > > +}; > > > + > > > /** > > > * struct drm_device - DRM device structure > > > * > > > @@ -317,6 +337,9 @@ struct drm_device { > > > * Root directory for debugfs files. > > > */ > > > struct dentry *debugfs_root; > > > + > > > + /** @wedge_recovery: Supported recovery methods for wedged device */ > > > + unsigned long wedge_recovery; > > > > hmm, so before the driver can ask for a reboot as a recovery method from > > wedge it has to somehow add 'reboot' as available method? why it that? > > It's for consumers to use as fallbacks in case the preferred recovery method > (sent along with uevent) don't workout. (patch 2/5) On second thought... Lucas, do we have a convincing enough usecase for fallback recovery? If <method> were to fail, I would expect there to be even bigger problems like kernel crash or unrecoverable hardware failure. At that point is it worth retrying? Raag
On Tue, Oct 08, 2024 at 06:02:43PM +0300, Raag Jadav wrote: >On Thu, Oct 03, 2024 at 03:23:22PM +0300, Raag Jadav wrote: >> On Tue, Oct 01, 2024 at 02:20:29PM +0200, Michal Wajdeczko wrote: >> > On 30.09.2024 09:38, Raag Jadav wrote: >> > > >> > > +/** >> > > + * enum drm_wedge_recovery - Recovery method for wedged device in order of >> > > + * severity. To be set as bit fields in drm_device.wedge_recovery variable. >> > > + * Drivers can choose to support any one or multiple of them depending on >> > > + * their needs. >> > > + */ >> > > +enum drm_wedge_recovery { >> > > + /** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */ >> > > + DRM_WEDGE_RECOVERY_REBIND, >> > > + >> > > + /** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */ >> > > + DRM_WEDGE_RECOVERY_BUS_RESET, >> > > + >> > > + /** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */ >> > > + DRM_WEDGE_RECOVERY_REBOOT, >> > > + >> > > + /** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */ >> > > + DRM_WEDGE_RECOVERY_MAX >> > > +}; >> > > + >> > > /** >> > > * struct drm_device - DRM device structure >> > > * >> > > @@ -317,6 +337,9 @@ struct drm_device { >> > > * Root directory for debugfs files. >> > > */ >> > > struct dentry *debugfs_root; >> > > + >> > > + /** @wedge_recovery: Supported recovery methods for wedged device */ >> > > + unsigned long wedge_recovery; >> > >> > hmm, so before the driver can ask for a reboot as a recovery method from >> > wedge it has to somehow add 'reboot' as available method? why it that? >> >> It's for consumers to use as fallbacks in case the preferred recovery method >> (sent along with uevent) don't workout. (patch 2/5) > >On second thought... > >Lucas, do we have a convincing enough usecase for fallback recovery? >If <method> were to fail, I would expect there to be even bigger problems >like kernel crash or unrecoverable hardware failure. > >At that point is it worth retrying? when we were talking about this, I brought it up about allowing the driver to inform what was the supported wedge recovery mechanisms when the notification is sent. Not to be intended as fallback mechanism. So if the driver sends a notification with: DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET | DRM_WEDGE_RECOVERY_REBOOT it means any of these would be suitable, with the first being the option with less side-effect. I don't think we are advising userspace to use fallback, just informing what the driver/device supports. Depending on the error, the driver may leave only DRM_WEDGE_RECOVERY_REBOOT That name could actually be DRM_WEDGE_RECOVERY_NONE. Because at that state the driver doesn't really know what can be done to recover. With that we can drop _MAX and use _NONE for bounding check. I think we can also omit it in the notification as it's clear: WEDGED DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET This means the driver can use any of these options to recover WEDGED DRM_WEDGE_RECOVERY_BUS_RESET only bus reset would fix it WEDGED driver doesn't know anything that could fix it. It may be a soft-reboot, hard-reboot, firmware flashing etc... We just don't know. Lucas De Marchi
On Thu, Oct 10, 2024 at 08:02:10AM -0500, Lucas De Marchi wrote: > On Tue, Oct 08, 2024 at 06:02:43PM +0300, Raag Jadav wrote: > > On Thu, Oct 03, 2024 at 03:23:22PM +0300, Raag Jadav wrote: > > > On Tue, Oct 01, 2024 at 02:20:29PM +0200, Michal Wajdeczko wrote: > > > > On 30.09.2024 09:38, Raag Jadav wrote: > > > > > > > > > > +/** > > > > > + * enum drm_wedge_recovery - Recovery method for wedged device in order of > > > > > + * severity. To be set as bit fields in drm_device.wedge_recovery variable. > > > > > + * Drivers can choose to support any one or multiple of them depending on > > > > > + * their needs. > > > > > + */ > > > > > +enum drm_wedge_recovery { > > > > > + /** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */ > > > > > + DRM_WEDGE_RECOVERY_REBIND, > > > > > + > > > > > + /** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */ > > > > > + DRM_WEDGE_RECOVERY_BUS_RESET, > > > > > + > > > > > + /** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */ > > > > > + DRM_WEDGE_RECOVERY_REBOOT, > > > > > + > > > > > + /** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */ > > > > > + DRM_WEDGE_RECOVERY_MAX > > > > > +}; > > > > > + > > > > > /** > > > > > * struct drm_device - DRM device structure > > > > > * > > > > > @@ -317,6 +337,9 @@ struct drm_device { > > > > > * Root directory for debugfs files. > > > > > */ > > > > > struct dentry *debugfs_root; > > > > > + > > > > > + /** @wedge_recovery: Supported recovery methods for wedged device */ > > > > > + unsigned long wedge_recovery; > > > > > > > > hmm, so before the driver can ask for a reboot as a recovery method from > > > > wedge it has to somehow add 'reboot' as available method? why it that? > > > > > > It's for consumers to use as fallbacks in case the preferred recovery method > > > (sent along with uevent) don't workout. (patch 2/5) > > > > On second thought... > > > > Lucas, do we have a convincing enough usecase for fallback recovery? > > If <method> were to fail, I would expect there to be even bigger problems > > like kernel crash or unrecoverable hardware failure. > > > > At that point is it worth retrying? > > when we were talking about this, I brought it up about allowing the > driver to inform what was the supported wedge recovery mechanisms > when the notification is sent. Not to be intended as fallback mechanism. > > So if the driver sends a notification with: > > DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET | DRM_WEDGE_RECOVERY_REBOOT > > it means any of these would be suitable, with the first being the option > with less side-effect. I don't think we are advising userspace to use > fallback, just informing what the driver/device supports. Depending on > the error, the driver may leave only > > DRM_WEDGE_RECOVERY_REBOOT > > That name could actually be DRM_WEDGE_RECOVERY_NONE. Because at that > state the driver doesn't really know what can be done to recover. > With that we can drop _MAX and use _NONE for bounding check. I think > we can also omit it in the notification as it's clear: > > WEDGED > DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET > > This means the driver can use any of these options to recover > > WEDGED > DRM_WEDGE_RECOVERY_BUS_RESET > > only bus reset would fix it > > WEDGED > > driver doesn't know anything that could fix it. It may be a soft-reboot, > hard-reboot, firmware flashing etc... We just don't know. With this I think we can drop sysfs. (Already too many ABIs to deal with) Raag
On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote: > Introduce device wedged event, which will notify userspace of wedged > (hanged/unusable) state of the DRM device through a uevent. This is > useful especially in cases where the device is no longer operating as > expected even after a hardware reset and has become unrecoverable from > driver context. > > Purpose of this implementation is to provide drivers a generic way to > recover with the help of userspace intervention. Different drivers may > have different ideas of a "wedged device" depending on their hardware > implementation, and hence the vendor agnostic nature of the event. > It is up to the drivers to decide when they see the need for recovery > and how they want to recover from the available methods. > > Current implementation defines three recovery methods, out of which, > drivers can choose to support any one or multiple of them. Preferred > recovery method will be sent in the uevent environment as WEDGED=<method>. > Userspace consumers (sysadmin) can define udev rules to parse this event > and take respective action to recover the device. > > =============== ================================== > Recovery method Consumer expectations > =============== ================================== > rebind unbind + rebind driver > bus-reset unbind + reset bus device + rebind > reboot reboot system > =============== ================================== > > v4: s/drm_dev_wedged/drm_dev_wedged_event > Use drm_info() (Jani) > Kernel doc adjustment (Aravind) > v5: Send recovery method with uevent (Lina) > v6: Access wedge_recovery_opts[] using helper function (Jani) > Use snprintf() (Jani) > v7: Convert recovery helpers into regular functions (Andy, Jani) > Aesthetic adjustments (Andy) > Handle invalid method cases > > Signed-off-by: Raag Jadav <raag.jadav@intel.com> > --- Cc'ing amd, collabora and others as I found semi-related work at https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealmeid@igalia.com/ https://lore.kernel.org/amd-gfx/20240725150055.1991893-1-alexander.deucher@amd.com/ https://lore.kernel.org/dri-devel/20241011225906.3789965-3-adrian.larumbe@collabora.com/ https://lore.kernel.org/amd-gfx/CAAxE2A5v_RkZ9ex4=7jiBSKVb22_1FAj0AANBcmKtETt5c3gVA@mail.gmail.com/ Please share feedback about usefulness and adoption of this. Improvements are welcome. Raag > drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++ > include/drm/drm_device.h | 23 ++++++++++++ > include/drm/drm_drv.h | 3 ++ > 3 files changed, 103 insertions(+) > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c > index ac30b0ec9d93..cfe9600da2ee 100644 > --- a/drivers/gpu/drm/drm_drv.c > +++ b/drivers/gpu/drm/drm_drv.c > @@ -26,6 +26,8 @@ > * DEALINGS IN THE SOFTWARE. > */ > > +#include <linux/array_size.h> > +#include <linux/build_bug.h> > #include <linux/debugfs.h> > #include <linux/fs.h> > #include <linux/module.h> > @@ -33,6 +35,7 @@ > #include <linux/mount.h> > #include <linux/pseudo_fs.h> > #include <linux/slab.h> > +#include <linux/sprintf.h> > #include <linux/srcu.h> > #include <linux/xarray.h> > > @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root; > > DEFINE_STATIC_SRCU(drm_unplug_srcu); > > +/* > + * Available recovery methods for wedged device. To be sent along with device > + * wedged uevent. > + */ > +static const char *const drm_wedge_recovery_opts[] = { > + [DRM_WEDGE_RECOVERY_REBIND] = "rebind", > + [DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset", > + [DRM_WEDGE_RECOVERY_REBOOT] = "reboot", > +}; > + > +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method) > +{ > + static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX); > + > + return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX; > +} > + > +/** > + * drm_wedge_recovery_name - provide wedge recovery name > + * @method: method to be used for recovery > + * > + * This validates wedge recovery @method against the available ones in > + * drm_wedge_recovery_opts[] and provides respective recovery name in string > + * format if found valid. > + * > + * Returns: pointer to const recovery string on success, NULL otherwise. > + */ > +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method) > +{ > + if (drm_wedge_recovery_is_valid(method)) > + return drm_wedge_recovery_opts[method]; > + > + return NULL; > +} > +EXPORT_SYMBOL(drm_wedge_recovery_name); > + > /* > * DRM Minors > * A DRM device can provide several char-dev interfaces on the DRM-Major. Each > @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev) > } > EXPORT_SYMBOL(drm_dev_unplug); > > +/** > + * drm_dev_wedged_event - generate a device wedged uevent > + * @dev: DRM device > + * @method: method to be used for recovery > + * > + * This generates a device wedged uevent for the DRM device specified by @dev. > + * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device) > + * is sent in the uevent environment as WEDGED=<method>, on the basis of which, > + * userspace may take respective action to recover the device. > + * > + * Returns: 0 on success, or negative error code otherwise. > + */ > +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method) > +{ > + /* Event string length up to 16+ characters with available methods */ > + char event_string[32] = {}; > + char *envp[] = { event_string, NULL }; > + const char *recovery; > + > + recovery = drm_wedge_recovery_name(method); > + if (!recovery) { > + drm_err(dev, "device wedged, invalid recovery method %d\n", method); > + return -EINVAL; > + } > + > + if (!test_bit(method, &dev->wedge_recovery)) { > + drm_err(dev, "device wedged, %s based recovery not supported\n", > + drm_wedge_recovery_name(method)); > + return -EOPNOTSUPP; > + } > + > + snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery); > + > + drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery); > + return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp); > +} > +EXPORT_SYMBOL(drm_dev_wedged_event); > + > /* > * DRM internal mount > * We want to be able to allocate our own "struct address_space" to control > diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h > index c91f87b5242d..fed6f20e52fb 100644 > --- a/include/drm/drm_device.h > +++ b/include/drm/drm_device.h > @@ -40,6 +40,26 @@ enum switch_power_state { > DRM_SWITCH_POWER_DYNAMIC_OFF = 3, > }; > > +/** > + * enum drm_wedge_recovery - Recovery method for wedged device in order of > + * severity. To be set as bit fields in drm_device.wedge_recovery variable. > + * Drivers can choose to support any one or multiple of them depending on > + * their needs. > + */ > +enum drm_wedge_recovery { > + /** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */ > + DRM_WEDGE_RECOVERY_REBIND, > + > + /** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */ > + DRM_WEDGE_RECOVERY_BUS_RESET, > + > + /** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */ > + DRM_WEDGE_RECOVERY_REBOOT, > + > + /** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */ > + DRM_WEDGE_RECOVERY_MAX > +}; > + > /** > * struct drm_device - DRM device structure > * > @@ -317,6 +337,9 @@ struct drm_device { > * Root directory for debugfs files. > */ > struct dentry *debugfs_root; > + > + /** @wedge_recovery: Supported recovery methods for wedged device */ > + unsigned long wedge_recovery; > }; > > #endif > diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h > index 02ea4e3248fd..d8dbc77010b0 100644 > --- a/include/drm/drm_drv.h > +++ b/include/drm/drm_drv.h > @@ -462,6 +462,9 @@ bool drm_dev_enter(struct drm_device *dev, int *idx); > void drm_dev_exit(int idx); > void drm_dev_unplug(struct drm_device *dev); > > +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method); > +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method); > + > /** > * drm_dev_is_unplugged - is a DRM device unplugged > * @dev: DRM device > -- > 2.34.1 >
Am 17.10.24 um 04:47 schrieb Raag Jadav: > On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote: >> Introduce device wedged event, which will notify userspace of wedged >> (hanged/unusable) state of the DRM device through a uevent. This is >> useful especially in cases where the device is no longer operating as >> expected even after a hardware reset and has become unrecoverable from >> driver context. Well introduce is probably the wrong wording since i915 already has that and amdgpu looked into it but never upstreamed the support. I would rather say standardize. >> >> Purpose of this implementation is to provide drivers a generic way to >> recover with the help of userspace intervention. Different drivers may >> have different ideas of a "wedged device" depending on their hardware >> implementation, and hence the vendor agnostic nature of the event. >> It is up to the drivers to decide when they see the need for recovery >> and how they want to recover from the available methods. >> >> Current implementation defines three recovery methods, out of which, >> drivers can choose to support any one or multiple of them. Preferred >> recovery method will be sent in the uevent environment as WEDGED=<method>. >> Userspace consumers (sysadmin) can define udev rules to parse this event >> and take respective action to recover the device. >> >> =============== ================================== >> Recovery method Consumer expectations >> =============== ================================== >> rebind unbind + rebind driver >> bus-reset unbind + reset bus device + rebind >> reboot reboot system >> =============== ================================== Well that sounds like userspace would need to be involved in recovery. That in turn is a complete no-go since we at least need to signal all dma_fences to unblock the kernel. In other words things like bus reset needs to happen inside the kernel and *not* in userspace. What we can do is to signal to userspace: Hey a bus reset of device X happened, maybe restart container, daemon, whatever service which was using this device. Regards, Christian. >> >> v4: s/drm_dev_wedged/drm_dev_wedged_event >> Use drm_info() (Jani) >> Kernel doc adjustment (Aravind) >> v5: Send recovery method with uevent (Lina) >> v6: Access wedge_recovery_opts[] using helper function (Jani) >> Use snprintf() (Jani) >> v7: Convert recovery helpers into regular functions (Andy, Jani) >> Aesthetic adjustments (Andy) >> Handle invalid method cases >> >> Signed-off-by: Raag Jadav <raag.jadav@intel.com> >> --- > Cc'ing amd, collabora and others as I found semi-related work at > > https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealmeid@igalia.com/ > https://lore.kernel.org/amd-gfx/20240725150055.1991893-1-alexander.deucher@amd.com/ > https://lore.kernel.org/dri-devel/20241011225906.3789965-3-adrian.larumbe@collabora.com/ > https://lore.kernel.org/amd-gfx/CAAxE2A5v_RkZ9ex4=7jiBSKVb22_1FAj0AANBcmKtETt5c3gVA@mail.gmail.com/ > > > Please share feedback about usefulness and adoption of this. > Improvements are welcome. > > Raag > >> drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++ >> include/drm/drm_device.h | 23 ++++++++++++ >> include/drm/drm_drv.h | 3 ++ >> 3 files changed, 103 insertions(+) >> >> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c >> index ac30b0ec9d93..cfe9600da2ee 100644 >> --- a/drivers/gpu/drm/drm_drv.c >> +++ b/drivers/gpu/drm/drm_drv.c >> @@ -26,6 +26,8 @@ >> * DEALINGS IN THE SOFTWARE. >> */ >> >> +#include <linux/array_size.h> >> +#include <linux/build_bug.h> >> #include <linux/debugfs.h> >> #include <linux/fs.h> >> #include <linux/module.h> >> @@ -33,6 +35,7 @@ >> #include <linux/mount.h> >> #include <linux/pseudo_fs.h> >> #include <linux/slab.h> >> +#include <linux/sprintf.h> >> #include <linux/srcu.h> >> #include <linux/xarray.h> >> >> @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root; >> >> DEFINE_STATIC_SRCU(drm_unplug_srcu); >> >> +/* >> + * Available recovery methods for wedged device. To be sent along with device >> + * wedged uevent. >> + */ >> +static const char *const drm_wedge_recovery_opts[] = { >> + [DRM_WEDGE_RECOVERY_REBIND] = "rebind", >> + [DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset", >> + [DRM_WEDGE_RECOVERY_REBOOT] = "reboot", >> +}; >> + >> +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method) >> +{ >> + static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX); >> + >> + return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX; >> +} >> + >> +/** >> + * drm_wedge_recovery_name - provide wedge recovery name >> + * @method: method to be used for recovery >> + * >> + * This validates wedge recovery @method against the available ones in >> + * drm_wedge_recovery_opts[] and provides respective recovery name in string >> + * format if found valid. >> + * >> + * Returns: pointer to const recovery string on success, NULL otherwise. >> + */ >> +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method) >> +{ >> + if (drm_wedge_recovery_is_valid(method)) >> + return drm_wedge_recovery_opts[method]; >> + >> + return NULL; >> +} >> +EXPORT_SYMBOL(drm_wedge_recovery_name); >> + >> /* >> * DRM Minors >> * A DRM device can provide several char-dev interfaces on the DRM-Major. Each >> @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev) >> } >> EXPORT_SYMBOL(drm_dev_unplug); >> >> +/** >> + * drm_dev_wedged_event - generate a device wedged uevent >> + * @dev: DRM device >> + * @method: method to be used for recovery >> + * >> + * This generates a device wedged uevent for the DRM device specified by @dev. >> + * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device) >> + * is sent in the uevent environment as WEDGED=<method>, on the basis of which, >> + * userspace may take respective action to recover the device. >> + * >> + * Returns: 0 on success, or negative error code otherwise. >> + */ >> +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method) >> +{ >> + /* Event string length up to 16+ characters with available methods */ >> + char event_string[32] = {}; >> + char *envp[] = { event_string, NULL }; >> + const char *recovery; >> + >> + recovery = drm_wedge_recovery_name(method); >> + if (!recovery) { >> + drm_err(dev, "device wedged, invalid recovery method %d\n", method); >> + return -EINVAL; >> + } >> + >> + if (!test_bit(method, &dev->wedge_recovery)) { >> + drm_err(dev, "device wedged, %s based recovery not supported\n", >> + drm_wedge_recovery_name(method)); >> + return -EOPNOTSUPP; >> + } >> + >> + snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery); >> + >> + drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery); >> + return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp); >> +} >> +EXPORT_SYMBOL(drm_dev_wedged_event); >> + >> /* >> * DRM internal mount >> * We want to be able to allocate our own "struct address_space" to control >> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h >> index c91f87b5242d..fed6f20e52fb 100644 >> --- a/include/drm/drm_device.h >> +++ b/include/drm/drm_device.h >> @@ -40,6 +40,26 @@ enum switch_power_state { >> DRM_SWITCH_POWER_DYNAMIC_OFF = 3, >> }; >> >> +/** >> + * enum drm_wedge_recovery - Recovery method for wedged device in order of >> + * severity. To be set as bit fields in drm_device.wedge_recovery variable. >> + * Drivers can choose to support any one or multiple of them depending on >> + * their needs. >> + */ >> +enum drm_wedge_recovery { >> + /** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */ >> + DRM_WEDGE_RECOVERY_REBIND, >> + >> + /** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */ >> + DRM_WEDGE_RECOVERY_BUS_RESET, >> + >> + /** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */ >> + DRM_WEDGE_RECOVERY_REBOOT, >> + >> + /** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */ >> + DRM_WEDGE_RECOVERY_MAX >> +}; >> + >> /** >> * struct drm_device - DRM device structure >> * >> @@ -317,6 +337,9 @@ struct drm_device { >> * Root directory for debugfs files. >> */ >> struct dentry *debugfs_root; >> + >> + /** @wedge_recovery: Supported recovery methods for wedged device */ >> + unsigned long wedge_recovery; >> }; >> >> #endif >> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h >> index 02ea4e3248fd..d8dbc77010b0 100644 >> --- a/include/drm/drm_drv.h >> +++ b/include/drm/drm_drv.h >> @@ -462,6 +462,9 @@ bool drm_dev_enter(struct drm_device *dev, int *idx); >> void drm_dev_exit(int idx); >> void drm_dev_unplug(struct drm_device *dev); >> >> +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method); >> +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method); >> + >> /** >> * drm_dev_is_unplugged - is a DRM device unplugged >> * @dev: DRM device >> -- >> 2.34.1 >>
On Thu, Oct 17, 2024 at 09:59:10AM +0200, Christian König wrote: > Am 17.10.24 um 04:47 schrieb Raag Jadav: > > On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote: > > > Introduce device wedged event, which will notify userspace of wedged > > > (hanged/unusable) state of the DRM device through a uevent. This is > > > useful especially in cases where the device is no longer operating as > > > expected even after a hardware reset and has become unrecoverable from > > > driver context. > > Well introduce is probably the wrong wording since i915 already has that and > amdgpu looked into it but never upstreamed the support. in i915 we have the reset and error uevents, but not one specific for 'wedge'. This would indeed be a new one. > > I would rather say standardize. > > > > > > > Purpose of this implementation is to provide drivers a generic way to > > > recover with the help of userspace intervention. Different drivers may > > > have different ideas of a "wedged device" depending on their hardware > > > implementation, and hence the vendor agnostic nature of the event. > > > It is up to the drivers to decide when they see the need for recovery > > > and how they want to recover from the available methods. > > > > > > Current implementation defines three recovery methods, out of which, > > > drivers can choose to support any one or multiple of them. Preferred > > > recovery method will be sent in the uevent environment as WEDGED=<method>. > > > Userspace consumers (sysadmin) can define udev rules to parse this event > > > and take respective action to recover the device. > > > > > > =============== ================================== > > > Recovery method Consumer expectations > > > =============== ================================== > > > rebind unbind + rebind driver > > > bus-reset unbind + reset bus device + rebind > > > reboot reboot system > > > =============== ================================== > > Well that sounds like userspace would need to be involved in recovery. > > That in turn is a complete no-go since we at least need to signal all > dma_fences to unblock the kernel. In other words things like bus reset needs > to happen inside the kernel and *not* in userspace. > > What we can do is to signal to userspace: Hey a bus reset of device X > happened, maybe restart container, daemon, whatever service which was using > this device. Well, when we declare device 'wedged' it is because we don't want to take any drastic measures inside the kernel and want to leave it in a protected and unusable state. In a way that users wouldn't lose display for instance, or at least the device is in a debugable state. Then, the instructions here is to tell what could possibly be attempted from userspace to get the device to an usable state. The 'wedge' mode (the one emiting this uevent) needs to be responsible for signaling all the fences and everything needed for a clean unbind and whatever next step might be indicated to userspace. That should already be part of any wedged mode, regardless the uevent to inform the userspace here. > > Regards, > Christian. > > > > > > > v4: s/drm_dev_wedged/drm_dev_wedged_event > > > Use drm_info() (Jani) > > > Kernel doc adjustment (Aravind) > > > v5: Send recovery method with uevent (Lina) > > > v6: Access wedge_recovery_opts[] using helper function (Jani) > > > Use snprintf() (Jani) > > > v7: Convert recovery helpers into regular functions (Andy, Jani) > > > Aesthetic adjustments (Andy) > > > Handle invalid method cases > > > > > > Signed-off-by: Raag Jadav <raag.jadav@intel.com> > > > --- > > Cc'ing amd, collabora and others as I found semi-related work at > > > > https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealmeid@igalia.com/ > > https://lore.kernel.org/amd-gfx/20240725150055.1991893-1-alexander.deucher@amd.com/ > > https://lore.kernel.org/dri-devel/20241011225906.3789965-3-adrian.larumbe@collabora.com/ > > https://lore.kernel.org/amd-gfx/CAAxE2A5v_RkZ9ex4=7jiBSKVb22_1FAj0AANBcmKtETt5c3gVA@mail.gmail.com/ > > > > > > Please share feedback about usefulness and adoption of this. > > Improvements are welcome. > > > > Raag > > > > > drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++ > > > include/drm/drm_device.h | 23 ++++++++++++ > > > include/drm/drm_drv.h | 3 ++ > > > 3 files changed, 103 insertions(+) > > > > > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c > > > index ac30b0ec9d93..cfe9600da2ee 100644 > > > --- a/drivers/gpu/drm/drm_drv.c > > > +++ b/drivers/gpu/drm/drm_drv.c > > > @@ -26,6 +26,8 @@ > > > * DEALINGS IN THE SOFTWARE. > > > */ > > > +#include <linux/array_size.h> > > > +#include <linux/build_bug.h> > > > #include <linux/debugfs.h> > > > #include <linux/fs.h> > > > #include <linux/module.h> > > > @@ -33,6 +35,7 @@ > > > #include <linux/mount.h> > > > #include <linux/pseudo_fs.h> > > > #include <linux/slab.h> > > > +#include <linux/sprintf.h> > > > #include <linux/srcu.h> > > > #include <linux/xarray.h> > > > @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root; > > > DEFINE_STATIC_SRCU(drm_unplug_srcu); > > > +/* > > > + * Available recovery methods for wedged device. To be sent along with device > > > + * wedged uevent. > > > + */ > > > +static const char *const drm_wedge_recovery_opts[] = { > > > + [DRM_WEDGE_RECOVERY_REBIND] = "rebind", > > > + [DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset", > > > + [DRM_WEDGE_RECOVERY_REBOOT] = "reboot", > > > +}; > > > + > > > +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method) > > > +{ > > > + static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX); > > > + > > > + return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX; > > > +} > > > + > > > +/** > > > + * drm_wedge_recovery_name - provide wedge recovery name > > > + * @method: method to be used for recovery > > > + * > > > + * This validates wedge recovery @method against the available ones in > > > + * drm_wedge_recovery_opts[] and provides respective recovery name in string > > > + * format if found valid. > > > + * > > > + * Returns: pointer to const recovery string on success, NULL otherwise. > > > + */ > > > +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method) > > > +{ > > > + if (drm_wedge_recovery_is_valid(method)) > > > + return drm_wedge_recovery_opts[method]; > > > + > > > + return NULL; > > > +} > > > +EXPORT_SYMBOL(drm_wedge_recovery_name); > > > + > > > /* > > > * DRM Minors > > > * A DRM device can provide several char-dev interfaces on the DRM-Major. Each > > > @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev) > > > } > > > EXPORT_SYMBOL(drm_dev_unplug); > > > +/** > > > + * drm_dev_wedged_event - generate a device wedged uevent > > > + * @dev: DRM device > > > + * @method: method to be used for recovery > > > + * > > > + * This generates a device wedged uevent for the DRM device specified by @dev. > > > + * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device) > > > + * is sent in the uevent environment as WEDGED=<method>, on the basis of which, > > > + * userspace may take respective action to recover the device. > > > + * > > > + * Returns: 0 on success, or negative error code otherwise. > > > + */ > > > +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method) > > > +{ > > > + /* Event string length up to 16+ characters with available methods */ > > > + char event_string[32] = {}; > > > + char *envp[] = { event_string, NULL }; > > > + const char *recovery; > > > + > > > + recovery = drm_wedge_recovery_name(method); > > > + if (!recovery) { > > > + drm_err(dev, "device wedged, invalid recovery method %d\n", method); > > > + return -EINVAL; > > > + } > > > + > > > + if (!test_bit(method, &dev->wedge_recovery)) { > > > + drm_err(dev, "device wedged, %s based recovery not supported\n", > > > + drm_wedge_recovery_name(method)); > > > + return -EOPNOTSUPP; > > > + } > > > + > > > + snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery); > > > + > > > + drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery); > > > + return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp); > > > +} > > > +EXPORT_SYMBOL(drm_dev_wedged_event); > > > + > > > /* > > > * DRM internal mount > > > * We want to be able to allocate our own "struct address_space" to control > > > diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h > > > index c91f87b5242d..fed6f20e52fb 100644 > > > --- a/include/drm/drm_device.h > > > +++ b/include/drm/drm_device.h > > > @@ -40,6 +40,26 @@ enum switch_power_state { > > > DRM_SWITCH_POWER_DYNAMIC_OFF = 3, > > > }; > > > +/** > > > + * enum drm_wedge_recovery - Recovery method for wedged device in order of > > > + * severity. To be set as bit fields in drm_device.wedge_recovery variable. > > > + * Drivers can choose to support any one or multiple of them depending on > > > + * their needs. > > > + */ > > > +enum drm_wedge_recovery { > > > + /** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */ > > > + DRM_WEDGE_RECOVERY_REBIND, > > > + > > > + /** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */ > > > + DRM_WEDGE_RECOVERY_BUS_RESET, > > > + > > > + /** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */ > > > + DRM_WEDGE_RECOVERY_REBOOT, > > > + > > > + /** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */ > > > + DRM_WEDGE_RECOVERY_MAX > > > +}; > > > + > > > /** > > > * struct drm_device - DRM device structure > > > * > > > @@ -317,6 +337,9 @@ struct drm_device { > > > * Root directory for debugfs files. > > > */ > > > struct dentry *debugfs_root; > > > + > > > + /** @wedge_recovery: Supported recovery methods for wedged device */ > > > + unsigned long wedge_recovery; > > > }; > > > #endif > > > diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h > > > index 02ea4e3248fd..d8dbc77010b0 100644 > > > --- a/include/drm/drm_drv.h > > > +++ b/include/drm/drm_drv.h > > > @@ -462,6 +462,9 @@ bool drm_dev_enter(struct drm_device *dev, int *idx); > > > void drm_dev_exit(int idx); > > > void drm_dev_unplug(struct drm_device *dev); > > > +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method); > > > +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method); > > > + > > > /** > > > * drm_dev_is_unplugged - is a DRM device unplugged > > > * @dev: DRM device > > > -- > > > 2.34.1 > > > >
Hi Raag, Em 30/09/2024 04:38, Raag Jadav escreveu: > Introduce device wedged event, which will notify userspace of wedged > (hanged/unusable) state of the DRM device through a uevent. This is > useful especially in cases where the device is no longer operating as > expected even after a hardware reset and has become unrecoverable from > driver context. > > Purpose of this implementation is to provide drivers a generic way to > recover with the help of userspace intervention. Different drivers may > have different ideas of a "wedged device" depending on their hardware > implementation, and hence the vendor agnostic nature of the event. > It is up to the drivers to decide when they see the need for recovery > and how they want to recover from the available methods. > > Current implementation defines three recovery methods, out of which, > drivers can choose to support any one or multiple of them. Preferred > recovery method will be sent in the uevent environment as WEDGED=<method>. > Userspace consumers (sysadmin) can define udev rules to parse this event > and take respective action to recover the device. > > =============== ================================== > Recovery method Consumer expectations > =============== ================================== > rebind unbind + rebind driver > bus-reset unbind + reset bus device + rebind > reboot reboot system > =============== ================================== > > I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/ The motivation was that amdgpu was getting stuck after every GPU reset, and there was just a black screen. The uevent would then trigger a daemon to reset the compositor and getting things back together. As you can see in my thread, the feature was blocked in favor of getting better overall GPU reset from the kernel side. Which kind of scenarios are making i915/xe the need to have userspace involvement? I tested a bunch of resets in i915 but never managed to get the driver stuck. For the bus-reset, amdgpu does that too, but it doesn't require userspace intervention.
Am 17.10.24 um 18:43 schrieb Rodrigo Vivi: > On Thu, Oct 17, 2024 at 09:59:10AM +0200, Christian König wrote: >>>> Purpose of this implementation is to provide drivers a generic way to >>>> recover with the help of userspace intervention. Different drivers may >>>> have different ideas of a "wedged device" depending on their hardware >>>> implementation, and hence the vendor agnostic nature of the event. >>>> It is up to the drivers to decide when they see the need for recovery >>>> and how they want to recover from the available methods. >>>> >>>> Current implementation defines three recovery methods, out of which, >>>> drivers can choose to support any one or multiple of them. Preferred >>>> recovery method will be sent in the uevent environment as WEDGED=<method>. >>>> Userspace consumers (sysadmin) can define udev rules to parse this event >>>> and take respective action to recover the device. >>>> >>>> =============== ================================== >>>> Recovery method Consumer expectations >>>> =============== ================================== >>>> rebind unbind + rebind driver >>>> bus-reset unbind + reset bus device + rebind >>>> reboot reboot system >>>> =============== ================================== >> Well that sounds like userspace would need to be involved in recovery. >> >> That in turn is a complete no-go since we at least need to signal all >> dma_fences to unblock the kernel. In other words things like bus reset needs >> to happen inside the kernel and *not* in userspace. >> >> What we can do is to signal to userspace: Hey a bus reset of device X >> happened, maybe restart container, daemon, whatever service which was using >> this device. > Well, when we declare device 'wedged' it is because we don't want to take > any drastic measures inside the kernel and want to leave it in a protected > and unusable state. In a way that users wouldn't lose display for instance, > or at least the device is in a debugable state. Uff, that needs to be very very well documented or otherwise the whole approach is an absolutely clear NAK from my side as DMA-buf maintainer. > > Then, the instructions here is to tell what could possibly be attempted > from userspace to get the device to an usable state. > > The 'wedge' mode (the one emiting this uevent) needs to be responsible > for signaling all the fences and everything needed for a clean unbind > and whatever next step might be indicated to userspace. > > That should already be part of any wedged mode, regardless the uevent > to inform the userspace here. You need to approach that from a different side. With the current patch set you are ignoring documented mandatory driver behavior as far as I can see. So first of all describe in the documentation what the wedged mode is and what requirements a driver has to fulfill to enter it: https://docs.kernel.org/gpu/drm-uapi.html#device-reset Especially document that all system memory accesses of the device needs to be blocked by (for example) disabling DMA accesses in the PCI config space. When it is guaranteed that the device can't access any system memory any more the device driver should signal all pending fences of this device. And only after all of that is done the driver can send an uevent to inform userspace that it can debug the hanged state. As far as I can see this makes the enum how to recover the device superfluous because you will most likely always need a bus reset to get out of this again. Regards, Christian.
On Fri, Oct 18, 2024 at 12:58:09PM +0200, Christian König wrote: > Am 17.10.24 um 18:43 schrieb Rodrigo Vivi: > > On Thu, Oct 17, 2024 at 09:59:10AM +0200, Christian König wrote: > > > > > Purpose of this implementation is to provide drivers a generic way to > > > > > recover with the help of userspace intervention. Different drivers may > > > > > have different ideas of a "wedged device" depending on their hardware > > > > > implementation, and hence the vendor agnostic nature of the event. > > > > > It is up to the drivers to decide when they see the need for recovery > > > > > and how they want to recover from the available methods. > > > > > > > > > > Current implementation defines three recovery methods, out of which, > > > > > drivers can choose to support any one or multiple of them. Preferred > > > > > recovery method will be sent in the uevent environment as WEDGED=<method>. > > > > > Userspace consumers (sysadmin) can define udev rules to parse this event > > > > > and take respective action to recover the device. > > > > > > > > > > =============== ================================== > > > > > Recovery method Consumer expectations > > > > > =============== ================================== > > > > > rebind unbind + rebind driver > > > > > bus-reset unbind + reset bus device + rebind > > > > > reboot reboot system > > > > > =============== ================================== > > > Well that sounds like userspace would need to be involved in recovery. > > > > > > That in turn is a complete no-go since we at least need to signal all > > > dma_fences to unblock the kernel. In other words things like bus reset needs > > > to happen inside the kernel and *not* in userspace. > > > > > > What we can do is to signal to userspace: Hey a bus reset of device X > > > happened, maybe restart container, daemon, whatever service which was using > > > this device. > > Well, when we declare device 'wedged' it is because we don't want to take > > any drastic measures inside the kernel and want to leave it in a protected > > and unusable state. In a way that users wouldn't lose display for instance, > > or at least the device is in a debugable state. > > Uff, that needs to be very very well documented or otherwise the whole > approach is an absolutely clear NAK from my side as DMA-buf maintainer. > > > > > Then, the instructions here is to tell what could possibly be attempted > > from userspace to get the device to an usable state. > > > > The 'wedge' mode (the one emiting this uevent) needs to be responsible > > for signaling all the fences and everything needed for a clean unbind > > and whatever next step might be indicated to userspace. > > > > That should already be part of any wedged mode, regardless the uevent > > to inform the userspace here. > > You need to approach that from a different side. With the current patch set > you are ignoring documented mandatory driver behavior as far as I can see. > > So first of all describe in the documentation what the wedged mode is and > what requirements a driver has to fulfill to enter it: > https://docs.kernel.org/gpu/drm-uapi.html#device-reset > > Especially document that all system memory accesses of the device needs to > be blocked by (for example) disabling DMA accesses in the PCI config space. > > When it is guaranteed that the device can't access any system memory any > more the device driver should signal all pending fences of this device. > > And only after all of that is done the driver can send an uevent to inform > userspace that it can debug the hanged state. Sure, will do. > As far as I can see this makes the enum how to recover the device > superfluous because you will most likely always need a bus reset to get out > of this again. That depends on the kind of fault the device has encountered and the bus it is sitting on. There could be buses that don't support reset. Raag
Am 18.10.24 um 14:46 schrieb Raag Jadav: >> As far as I can see this makes the enum how to recover the device >> superfluous because you will most likely always need a bus reset to get out >> of this again. > That depends on the kind of fault the device has encountered and the bus it is > sitting on. There could be buses that don't support reset. That is even more an argument to not expose this in the uevent. Getting the device working again is strongly device dependent and can't be handled in a generic way. Regards, Christian. > > Raag
On Fri, Oct 18, 2024 at 02:54:38PM +0200, Christian König wrote: > Am 18.10.24 um 14:46 schrieb Raag Jadav: > > > As far as I can see this makes the enum how to recover the device > > > superfluous because you will most likely always need a bus reset to get out > > > of this again. > > That depends on the kind of fault the device has encountered and the bus it is > > sitting on. There could be buses that don't support reset. > > That is even more an argument to not expose this in the uevent. > > Getting the device working again is strongly device dependent and can't be > handled in a generic way. My understanding is that the proposed methods can be handled in a generic way and are useful for the devices that do support it. This way the userspace can atleast have a hint about recovery. For others we can have something like WEDGED=none (as proposed by Michal and Lucas in other threads) and let admin/user decide how to deal with it. Raag
On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote: > Hi Raag, > > Em 30/09/2024 04:38, Raag Jadav escreveu: > > Introduce device wedged event, which will notify userspace of wedged > > (hanged/unusable) state of the DRM device through a uevent. This is > > useful especially in cases where the device is no longer operating as > > expected even after a hardware reset and has become unrecoverable from > > driver context. > > > > Purpose of this implementation is to provide drivers a generic way to > > recover with the help of userspace intervention. Different drivers may > > have different ideas of a "wedged device" depending on their hardware > > implementation, and hence the vendor agnostic nature of the event. > > It is up to the drivers to decide when they see the need for recovery > > and how they want to recover from the available methods. > > > > Current implementation defines three recovery methods, out of which, > > drivers can choose to support any one or multiple of them. Preferred > > recovery method will be sent in the uevent environment as WEDGED=<method>. > > Userspace consumers (sysadmin) can define udev rules to parse this event > > and take respective action to recover the device. > > > > =============== ================================== > > Recovery method Consumer expectations > > =============== ================================== > > rebind unbind + rebind driver > > bus-reset unbind + reset bus device + rebind > > reboot reboot system > > =============== ================================== > > > > > > I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/ > > The motivation was that amdgpu was getting stuck after every GPU reset, and > there was just a black screen. The uevent would then trigger a daemon to > reset the compositor and getting things back together. As you can see in my > thread, the feature was blocked in favor of getting better overall GPU reset > from the kernel side. > > Which kind of scenarios are making i915/xe the need to have userspace > involvement? I tested a bunch of resets in i915 but never managed to get the > driver stuck. 2 scenarios: 1. Multiple levels of reset has failed and device was declared wedged. This is rare indeed as the resets improved a lot. 2. Debug case. We can boot the driver with option to declare device wedged at any timeout, so the device can be debugged. > > For the bus-reset, amdgpu does that too, but it doesn't require userspace > intervention. How do you trigger that?
On Fri, Oct 18, 2024 at 11:23 AM Rodrigo Vivi <rodrigo.vivi@intel.com> wrote: > > On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote: > > Hi Raag, > > > > Em 30/09/2024 04:38, Raag Jadav escreveu: > > > Introduce device wedged event, which will notify userspace of wedged > > > (hanged/unusable) state of the DRM device through a uevent. This is > > > useful especially in cases where the device is no longer operating as > > > expected even after a hardware reset and has become unrecoverable from > > > driver context. > > > > > > Purpose of this implementation is to provide drivers a generic way to > > > recover with the help of userspace intervention. Different drivers may > > > have different ideas of a "wedged device" depending on their hardware > > > implementation, and hence the vendor agnostic nature of the event. > > > It is up to the drivers to decide when they see the need for recovery > > > and how they want to recover from the available methods. > > > > > > Current implementation defines three recovery methods, out of which, > > > drivers can choose to support any one or multiple of them. Preferred > > > recovery method will be sent in the uevent environment as WEDGED=<method>. > > > Userspace consumers (sysadmin) can define udev rules to parse this event > > > and take respective action to recover the device. > > > > > > =============== ================================== > > > Recovery method Consumer expectations > > > =============== ================================== > > > rebind unbind + rebind driver > > > bus-reset unbind + reset bus device + rebind > > > reboot reboot system > > > =============== ================================== > > > > > > > > > > I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/ > > > > The motivation was that amdgpu was getting stuck after every GPU reset, and > > there was just a black screen. The uevent would then trigger a daemon to > > reset the compositor and getting things back together. As you can see in my > > thread, the feature was blocked in favor of getting better overall GPU reset > > from the kernel side. > > > > Which kind of scenarios are making i915/xe the need to have userspace > > involvement? I tested a bunch of resets in i915 but never managed to get the > > driver stuck. > > 2 scenarios: > > 1. Multiple levels of reset has failed and device was declared wedged. This is > rare indeed as the resets improved a lot. > 2. Debug case. We can boot the driver with option to declare device wedged at > any timeout, so the device can be debugged. > > > > > For the bus-reset, amdgpu does that too, but it doesn't require userspace > > intervention. > > How do you trigger that? What do you mean by bus reset? I think Chrisitian is just referring to a full adapter reset (as opposed to a queue reset or something more fine grained). Driver can reset the device via MMIO or firmware, depending on the device. I think there are also PCI helpers for things like PCI FLR. Alex
Em 18/10/2024 12:31, Alex Deucher escreveu: > On Fri, Oct 18, 2024 at 11:23 AM Rodrigo Vivi <rodrigo.vivi@intel.com> wrote: >> >> On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote: >>> Hi Raag, >>> >>> Em 30/09/2024 04:38, Raag Jadav escreveu: >>>> Introduce device wedged event, which will notify userspace of wedged >>>> (hanged/unusable) state of the DRM device through a uevent. This is >>>> useful especially in cases where the device is no longer operating as >>>> expected even after a hardware reset and has become unrecoverable from >>>> driver context. >>>> >>>> Purpose of this implementation is to provide drivers a generic way to >>>> recover with the help of userspace intervention. Different drivers may >>>> have different ideas of a "wedged device" depending on their hardware >>>> implementation, and hence the vendor agnostic nature of the event. >>>> It is up to the drivers to decide when they see the need for recovery >>>> and how they want to recover from the available methods. >>>> >>>> Current implementation defines three recovery methods, out of which, >>>> drivers can choose to support any one or multiple of them. Preferred >>>> recovery method will be sent in the uevent environment as WEDGED=<method>. >>>> Userspace consumers (sysadmin) can define udev rules to parse this event >>>> and take respective action to recover the device. >>>> >>>> =============== ================================== >>>> Recovery method Consumer expectations >>>> =============== ================================== >>>> rebind unbind + rebind driver >>>> bus-reset unbind + reset bus device + rebind >>>> reboot reboot system >>>> =============== ================================== >>>> >>>> >>> >>> I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/ >>> >>> The motivation was that amdgpu was getting stuck after every GPU reset, and >>> there was just a black screen. The uevent would then trigger a daemon to >>> reset the compositor and getting things back together. As you can see in my >>> thread, the feature was blocked in favor of getting better overall GPU reset >>> from the kernel side. >>> >>> Which kind of scenarios are making i915/xe the need to have userspace >>> involvement? I tested a bunch of resets in i915 but never managed to get the >>> driver stuck. >> >> 2 scenarios: >> >> 1. Multiple levels of reset has failed and device was declared wedged. This is >> rare indeed as the resets improved a lot. >> 2. Debug case. We can boot the driver with option to declare device wedged at >> any timeout, so the device can be debugged. >> >>> >>> For the bus-reset, amdgpu does that too, but it doesn't require userspace >>> intervention. >> >> How do you trigger that? > > What do you mean by bus reset? I think Chrisitian is just referring > to a full adapter reset (as opposed to a queue reset or something more > fine grained). Driver can reset the device via MMIO or firmware, > depending on the device. I think there are also PCI helpers for > things like PCI FLR. > I was referring to AMD_RESET_PCI: "Does a full bus reset using core Linux subsystem PCI reset and does a secondary bus reset or FLR, depending on what the underlying hardware supports." And that can be triggered by using `amdgpu_reset_method=5` as the module option.
On Fri, Oct 18, 2024 at 1:56 PM André Almeida <andrealmeid@igalia.com> wrote: > > Em 18/10/2024 12:31, Alex Deucher escreveu: > > On Fri, Oct 18, 2024 at 11:23 AM Rodrigo Vivi <rodrigo.vivi@intel.com> wrote: > >> > >> On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote: > >>> Hi Raag, > >>> > >>> Em 30/09/2024 04:38, Raag Jadav escreveu: > >>>> Introduce device wedged event, which will notify userspace of wedged > >>>> (hanged/unusable) state of the DRM device through a uevent. This is > >>>> useful especially in cases where the device is no longer operating as > >>>> expected even after a hardware reset and has become unrecoverable from > >>>> driver context. > >>>> > >>>> Purpose of this implementation is to provide drivers a generic way to > >>>> recover with the help of userspace intervention. Different drivers may > >>>> have different ideas of a "wedged device" depending on their hardware > >>>> implementation, and hence the vendor agnostic nature of the event. > >>>> It is up to the drivers to decide when they see the need for recovery > >>>> and how they want to recover from the available methods. > >>>> > >>>> Current implementation defines three recovery methods, out of which, > >>>> drivers can choose to support any one or multiple of them. Preferred > >>>> recovery method will be sent in the uevent environment as WEDGED=<method>. > >>>> Userspace consumers (sysadmin) can define udev rules to parse this event > >>>> and take respective action to recover the device. > >>>> > >>>> =============== ================================== > >>>> Recovery method Consumer expectations > >>>> =============== ================================== > >>>> rebind unbind + rebind driver > >>>> bus-reset unbind + reset bus device + rebind > >>>> reboot reboot system > >>>> =============== ================================== > >>>> > >>>> > >>> > >>> I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/ > >>> > >>> The motivation was that amdgpu was getting stuck after every GPU reset, and > >>> there was just a black screen. The uevent would then trigger a daemon to > >>> reset the compositor and getting things back together. As you can see in my > >>> thread, the feature was blocked in favor of getting better overall GPU reset > >>> from the kernel side. > >>> > >>> Which kind of scenarios are making i915/xe the need to have userspace > >>> involvement? I tested a bunch of resets in i915 but never managed to get the > >>> driver stuck. > >> > >> 2 scenarios: > >> > >> 1. Multiple levels of reset has failed and device was declared wedged. This is > >> rare indeed as the resets improved a lot. > >> 2. Debug case. We can boot the driver with option to declare device wedged at > >> any timeout, so the device can be debugged. > >> > >>> > >>> For the bus-reset, amdgpu does that too, but it doesn't require userspace > >>> intervention. > >> > >> How do you trigger that? > > > > What do you mean by bus reset? I think Chrisitian is just referring > > to a full adapter reset (as opposed to a queue reset or something more > > fine grained). Driver can reset the device via MMIO or firmware, > > depending on the device. I think there are also PCI helpers for > > things like PCI FLR. > > > > I was referring to AMD_RESET_PCI: > > "Does a full bus reset using core Linux subsystem PCI reset and does a > secondary bus reset or FLR, depending on what the underlying hardware > supports." > > And that can be triggered by using `amdgpu_reset_method=5` as the module > option. > That option doesn't actually do anything useful on most AMD GPUs. We don't support FLR on most boards and SBR doesn't work once the driver has been loaded except for really old chips. That said, internally these all end up being mode1 or mode2 resets which the driver can trigger directly and which are the defaults. Alex
On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote: > Hi Raag, > > Em 30/09/2024 04:38, Raag Jadav escreveu: > > Introduce device wedged event, which will notify userspace of wedged > > (hanged/unusable) state of the DRM device through a uevent. This is > > useful especially in cases where the device is no longer operating as > > expected even after a hardware reset and has become unrecoverable from > > driver context. > > > > Purpose of this implementation is to provide drivers a generic way to > > recover with the help of userspace intervention. Different drivers may > > have different ideas of a "wedged device" depending on their hardware > > implementation, and hence the vendor agnostic nature of the event. > > It is up to the drivers to decide when they see the need for recovery > > and how they want to recover from the available methods. > > > > Current implementation defines three recovery methods, out of which, > > drivers can choose to support any one or multiple of them. Preferred > > recovery method will be sent in the uevent environment as WEDGED=<method>. > > Userspace consumers (sysadmin) can define udev rules to parse this event > > and take respective action to recover the device. > > > > =============== ================================== > > Recovery method Consumer expectations > > =============== ================================== > > rebind unbind + rebind driver > > bus-reset unbind + reset bus device + rebind > > reboot reboot system > > =============== ================================== > > > > > > I proposed something similar in the past: > https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/ Thanks for sharing. I went through it and I think we can use some of the ideas with generic adaption. While we can always execute scripts on uevent, it'd be good to have a userspace daemon applying automated policies for wedge cases based on admin/user needs. This way we can also manage repeat offenders. Xe has devcoredump so telemetry would also be a nice addition. Great opportunity to collaborate here. > The motivation was that amdgpu was getting stuck after every GPU reset, and > there was just a black screen. The uevent would then trigger a daemon to > reset the compositor and getting things back together. As you can see in my > thread, the feature was blocked in favor of getting better overall GPU reset > from the kernel side. We have hardware level resets but (although rare) they're also prone to failure. We do what we can to recover from driver context but it adds on to the complexity overtime. Something like wedging, if done right, would be much more robust IMHO. Raag
On Fri, Oct 18, 2024 at 05:07:22PM -0400, Alex Deucher wrote: > On Fri, Oct 18, 2024 at 1:56 PM André Almeida <andrealmeid@igalia.com> wrote: > > > > Em 18/10/2024 12:31, Alex Deucher escreveu: > > > On Fri, Oct 18, 2024 at 11:23 AM Rodrigo Vivi <rodrigo.vivi@intel.com> wrote: > > >> > > >> On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote: > > >>> Hi Raag, > > >>> > > >>> Em 30/09/2024 04:38, Raag Jadav escreveu: > > >>>> Introduce device wedged event, which will notify userspace of wedged > > >>>> (hanged/unusable) state of the DRM device through a uevent. This is > > >>>> useful especially in cases where the device is no longer operating as > > >>>> expected even after a hardware reset and has become unrecoverable from > > >>>> driver context. > > >>>> > > >>>> Purpose of this implementation is to provide drivers a generic way to > > >>>> recover with the help of userspace intervention. Different drivers may > > >>>> have different ideas of a "wedged device" depending on their hardware > > >>>> implementation, and hence the vendor agnostic nature of the event. > > >>>> It is up to the drivers to decide when they see the need for recovery > > >>>> and how they want to recover from the available methods. > > >>>> > > >>>> Current implementation defines three recovery methods, out of which, > > >>>> drivers can choose to support any one or multiple of them. Preferred > > >>>> recovery method will be sent in the uevent environment as WEDGED=<method>. > > >>>> Userspace consumers (sysadmin) can define udev rules to parse this event > > >>>> and take respective action to recover the device. > > >>>> > > >>>> =============== ================================== > > >>>> Recovery method Consumer expectations > > >>>> =============== ================================== > > >>>> rebind unbind + rebind driver > > >>>> bus-reset unbind + reset bus device + rebind > > >>>> reboot reboot system > > >>>> =============== ================================== > > >>>> > > >>>> > > >>> > > >>> I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@igalia.com/ > > >>> > > >>> The motivation was that amdgpu was getting stuck after every GPU reset, and > > >>> there was just a black screen. The uevent would then trigger a daemon to > > >>> reset the compositor and getting things back together. As you can see in my > > >>> thread, the feature was blocked in favor of getting better overall GPU reset > > >>> from the kernel side. > > >>> > > >>> Which kind of scenarios are making i915/xe the need to have userspace > > >>> involvement? I tested a bunch of resets in i915 but never managed to get the > > >>> driver stuck. > > >> > > >> 2 scenarios: > > >> > > >> 1. Multiple levels of reset has failed and device was declared wedged. This is > > >> rare indeed as the resets improved a lot. > > >> 2. Debug case. We can boot the driver with option to declare device wedged at > > >> any timeout, so the device can be debugged. > > >> > > >>> > > >>> For the bus-reset, amdgpu does that too, but it doesn't require userspace > > >>> intervention. > > >> > > >> How do you trigger that? > > > > > > What do you mean by bus reset? I think Chrisitian is just referring > > > to a full adapter reset (as opposed to a queue reset or something more > > > fine grained). Driver can reset the device via MMIO or firmware, > > > depending on the device. I think there are also PCI helpers for > > > things like PCI FLR. > > > > > > > I was referring to AMD_RESET_PCI: > > > > "Does a full bus reset using core Linux subsystem PCI reset and does a > > secondary bus reset or FLR, depending on what the underlying hardware > > supports." > > > > And that can be triggered by using `amdgpu_reset_method=5` as the module > > option. > > > > That option doesn't actually do anything useful on most AMD GPUs. We > don't support FLR on most boards and SBR doesn't work once the driver > has been loaded except for really old chips. That said, internally > these all end up being mode1 or mode2 resets which the driver can > trigger directly and which are the defaults. okay, this is the same for us then. And this is the main reason that we have this option: - unbind + reset bus device + rebind unbind by itself needs to be a supported and working case regardless the reset state. Then this sequence should be fine. Afaik there's no way that the driver itself could call for the bus reset. > > Alex
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c index ac30b0ec9d93..cfe9600da2ee 100644 --- a/drivers/gpu/drm/drm_drv.c +++ b/drivers/gpu/drm/drm_drv.c @@ -26,6 +26,8 @@ * DEALINGS IN THE SOFTWARE. */ +#include <linux/array_size.h> +#include <linux/build_bug.h> #include <linux/debugfs.h> #include <linux/fs.h> #include <linux/module.h> @@ -33,6 +35,7 @@ #include <linux/mount.h> #include <linux/pseudo_fs.h> #include <linux/slab.h> +#include <linux/sprintf.h> #include <linux/srcu.h> #include <linux/xarray.h> @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root; DEFINE_STATIC_SRCU(drm_unplug_srcu); +/* + * Available recovery methods for wedged device. To be sent along with device + * wedged uevent. + */ +static const char *const drm_wedge_recovery_opts[] = { + [DRM_WEDGE_RECOVERY_REBIND] = "rebind", + [DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset", + [DRM_WEDGE_RECOVERY_REBOOT] = "reboot", +}; + +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method) +{ + static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX); + + return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX; +} + +/** + * drm_wedge_recovery_name - provide wedge recovery name + * @method: method to be used for recovery + * + * This validates wedge recovery @method against the available ones in + * drm_wedge_recovery_opts[] and provides respective recovery name in string + * format if found valid. + * + * Returns: pointer to const recovery string on success, NULL otherwise. + */ +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method) +{ + if (drm_wedge_recovery_is_valid(method)) + return drm_wedge_recovery_opts[method]; + + return NULL; +} +EXPORT_SYMBOL(drm_wedge_recovery_name); + /* * DRM Minors * A DRM device can provide several char-dev interfaces on the DRM-Major. Each @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev) } EXPORT_SYMBOL(drm_dev_unplug); +/** + * drm_dev_wedged_event - generate a device wedged uevent + * @dev: DRM device + * @method: method to be used for recovery + * + * This generates a device wedged uevent for the DRM device specified by @dev. + * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device) + * is sent in the uevent environment as WEDGED=<method>, on the basis of which, + * userspace may take respective action to recover the device. + * + * Returns: 0 on success, or negative error code otherwise. + */ +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method) +{ + /* Event string length up to 16+ characters with available methods */ + char event_string[32] = {}; + char *envp[] = { event_string, NULL }; + const char *recovery; + + recovery = drm_wedge_recovery_name(method); + if (!recovery) { + drm_err(dev, "device wedged, invalid recovery method %d\n", method); + return -EINVAL; + } + + if (!test_bit(method, &dev->wedge_recovery)) { + drm_err(dev, "device wedged, %s based recovery not supported\n", + drm_wedge_recovery_name(method)); + return -EOPNOTSUPP; + } + + snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery); + + drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery); + return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp); +} +EXPORT_SYMBOL(drm_dev_wedged_event); + /* * DRM internal mount * We want to be able to allocate our own "struct address_space" to control diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h index c91f87b5242d..fed6f20e52fb 100644 --- a/include/drm/drm_device.h +++ b/include/drm/drm_device.h @@ -40,6 +40,26 @@ enum switch_power_state { DRM_SWITCH_POWER_DYNAMIC_OFF = 3, }; +/** + * enum drm_wedge_recovery - Recovery method for wedged device in order of + * severity. To be set as bit fields in drm_device.wedge_recovery variable. + * Drivers can choose to support any one or multiple of them depending on + * their needs. + */ +enum drm_wedge_recovery { + /** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */ + DRM_WEDGE_RECOVERY_REBIND, + + /** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */ + DRM_WEDGE_RECOVERY_BUS_RESET, + + /** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */ + DRM_WEDGE_RECOVERY_REBOOT, + + /** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */ + DRM_WEDGE_RECOVERY_MAX +}; + /** * struct drm_device - DRM device structure * @@ -317,6 +337,9 @@ struct drm_device { * Root directory for debugfs files. */ struct dentry *debugfs_root; + + /** @wedge_recovery: Supported recovery methods for wedged device */ + unsigned long wedge_recovery; }; #endif diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h index 02ea4e3248fd..d8dbc77010b0 100644 --- a/include/drm/drm_drv.h +++ b/include/drm/drm_drv.h @@ -462,6 +462,9 @@ bool drm_dev_enter(struct drm_device *dev, int *idx); void drm_dev_exit(int idx); void drm_dev_unplug(struct drm_device *dev); +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method); +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method); + /** * drm_dev_is_unplugged - is a DRM device unplugged * @dev: DRM device
Introduce device wedged event, which will notify userspace of wedged (hanged/unusable) state of the DRM device through a uevent. This is useful especially in cases where the device is no longer operating as expected even after a hardware reset and has become unrecoverable from driver context. Purpose of this implementation is to provide drivers a generic way to recover with the help of userspace intervention. Different drivers may have different ideas of a "wedged device" depending on their hardware implementation, and hence the vendor agnostic nature of the event. It is up to the drivers to decide when they see the need for recovery and how they want to recover from the available methods. Current implementation defines three recovery methods, out of which, drivers can choose to support any one or multiple of them. Preferred recovery method will be sent in the uevent environment as WEDGED=<method>. Userspace consumers (sysadmin) can define udev rules to parse this event and take respective action to recover the device. =============== ================================== Recovery method Consumer expectations =============== ================================== rebind unbind + rebind driver bus-reset unbind + reset bus device + rebind reboot reboot system =============== ================================== v4: s/drm_dev_wedged/drm_dev_wedged_event Use drm_info() (Jani) Kernel doc adjustment (Aravind) v5: Send recovery method with uevent (Lina) v6: Access wedge_recovery_opts[] using helper function (Jani) Use snprintf() (Jani) v7: Convert recovery helpers into regular functions (Andy, Jani) Aesthetic adjustments (Andy) Handle invalid method cases Signed-off-by: Raag Jadav <raag.jadav@intel.com> --- drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++ include/drm/drm_device.h | 23 ++++++++++++ include/drm/drm_drv.h | 3 ++ 3 files changed, 103 insertions(+)