From patchwork Mon Sep 30 07:38:41 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Raag Jadav X-Patchwork-Id: 13815490 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 4B394CF6497 for ; Mon, 30 Sep 2024 07:39:38 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id D1B5510E3AE; Mon, 30 Sep 2024 07:39:37 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="CApce3RB"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) by gabe.freedesktop.org (Postfix) with ESMTPS id 5877110E3AD; Mon, 30 Sep 2024 07:39:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1727681977; x=1759217977; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=cji0HqaVHxxAff4YFRcNBaRc8c4+jaWjEWUzYo4hhmg=; b=CApce3RBNjEQSyr8vbU+YCgwakjsflAi1aZcSGnWASnk7bsCj123Q9SI e9+kz3PtRo8ZyrA0gJ3jxFH15nPqrpsnyjzSNiheLkWHFzrxQoX3UuYgK vsazTwfSO05ot0d7aRXsKbsJItGjEYATUf+bsdgIQUViBF3wDJQ1nNcQo zJTyxmsHckwQAl9cR2XtBVGz06L7g/IC0iT8vcxViQguHVHe2fJlNcqq5 wMU6Wffdf9LmPj1ZsolSO2VuzjMJWYQlNKnr0rsYVg63wuSjBvxaBeL5t TsnMtl9wmiN311du+jylE2GI6IwN8sbnPV1R6pjMxYK2fi24igYJdmb34 A==; X-CSE-ConnectionGUID: 39DW+inxRKewQknROnLbhQ== X-CSE-MsgGUID: SWbfpKH6SJSla6WrMkv4ng== X-IronPort-AV: E=McAfee;i="6700,10204,11210"; a="37315453" X-IronPort-AV: E=Sophos;i="6.11,165,1725346800"; d="scan'208";a="37315453" Received: from fmviesa006.fm.intel.com ([10.60.135.146]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Sep 2024 00:39:37 -0700 X-CSE-ConnectionGUID: N7zqWVSWSL+oIviDCJgJdQ== X-CSE-MsgGUID: SleehpyWR/iQi0DvhDorgQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,165,1725346800"; d="scan'208";a="72797411" Received: from jraag-nuc8i7beh.iind.intel.com ([10.145.169.79]) by fmviesa006.fm.intel.com with ESMTP; 30 Sep 2024 00:39:30 -0700 From: Raag Jadav To: airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com, thomas.hellstrom@linux.intel.com, rodrigo.vivi@intel.com, jani.nikula@linux.intel.com, andriy.shevchenko@linux.intel.com, joonas.lahtinen@linux.intel.com, tursulin@ursulin.net, lina@asahilina.net Cc: intel-xe@lists.freedesktop.org, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, himal.prasad.ghimiray@intel.com, francois.dugast@intel.com, aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com, andi.shyti@linux.intel.com, matthew.d.roper@intel.com, Raag Jadav Subject: [PATCH v7 1/5] drm: Introduce device wedged event Date: Mon, 30 Sep 2024 13:08:41 +0530 Message-Id: <20240930073845.347326-2-raag.jadav@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240930073845.347326-1-raag.jadav@intel.com> References: <20240930073845.347326-1-raag.jadav@intel.com> MIME-Version: 1.0 X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" Introduce device wedged event, which will notify userspace of wedged (hanged/unusable) state of the DRM device through a uevent. This is useful especially in cases where the device is no longer operating as expected even after a hardware reset and has become unrecoverable from driver context. Purpose of this implementation is to provide drivers a generic way to recover with the help of userspace intervention. Different drivers may have different ideas of a "wedged device" depending on their hardware implementation, and hence the vendor agnostic nature of the event. It is up to the drivers to decide when they see the need for recovery and how they want to recover from the available methods. Current implementation defines three recovery methods, out of which, drivers can choose to support any one or multiple of them. Preferred recovery method will be sent in the uevent environment as WEDGED=. Userspace consumers (sysadmin) can define udev rules to parse this event and take respective action to recover the device. =============== ================================== Recovery method Consumer expectations =============== ================================== rebind unbind + rebind driver bus-reset unbind + reset bus device + rebind reboot reboot system =============== ================================== v4: s/drm_dev_wedged/drm_dev_wedged_event Use drm_info() (Jani) Kernel doc adjustment (Aravind) v5: Send recovery method with uevent (Lina) v6: Access wedge_recovery_opts[] using helper function (Jani) Use snprintf() (Jani) v7: Convert recovery helpers into regular functions (Andy, Jani) Aesthetic adjustments (Andy) Handle invalid method cases Signed-off-by: Raag Jadav --- drivers/gpu/drm/drm_drv.c | 77 +++++++++++++++++++++++++++++++++++++++ include/drm/drm_device.h | 23 ++++++++++++ include/drm/drm_drv.h | 3 ++ 3 files changed, 103 insertions(+) diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c index ac30b0ec9d93..cfe9600da2ee 100644 --- a/drivers/gpu/drm/drm_drv.c +++ b/drivers/gpu/drm/drm_drv.c @@ -26,6 +26,8 @@ * DEALINGS IN THE SOFTWARE. */ +#include +#include #include #include #include @@ -33,6 +35,7 @@ #include #include #include +#include #include #include @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root; DEFINE_STATIC_SRCU(drm_unplug_srcu); +/* + * Available recovery methods for wedged device. To be sent along with device + * wedged uevent. + */ +static const char *const drm_wedge_recovery_opts[] = { + [DRM_WEDGE_RECOVERY_REBIND] = "rebind", + [DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset", + [DRM_WEDGE_RECOVERY_REBOOT] = "reboot", +}; + +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method) +{ + static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == DRM_WEDGE_RECOVERY_MAX); + + return method >= DRM_WEDGE_RECOVERY_REBIND && method < DRM_WEDGE_RECOVERY_MAX; +} + +/** + * drm_wedge_recovery_name - provide wedge recovery name + * @method: method to be used for recovery + * + * This validates wedge recovery @method against the available ones in + * drm_wedge_recovery_opts[] and provides respective recovery name in string + * format if found valid. + * + * Returns: pointer to const recovery string on success, NULL otherwise. + */ +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method) +{ + if (drm_wedge_recovery_is_valid(method)) + return drm_wedge_recovery_opts[method]; + + return NULL; +} +EXPORT_SYMBOL(drm_wedge_recovery_name); + /* * DRM Minors * A DRM device can provide several char-dev interfaces on the DRM-Major. Each @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev) } EXPORT_SYMBOL(drm_dev_unplug); +/** + * drm_dev_wedged_event - generate a device wedged uevent + * @dev: DRM device + * @method: method to be used for recovery + * + * This generates a device wedged uevent for the DRM device specified by @dev. + * Recovery @method from drm_wedge_recovery_opts[] (if supprted by the device) + * is sent in the uevent environment as WEDGED=, on the basis of which, + * userspace may take respective action to recover the device. + * + * Returns: 0 on success, or negative error code otherwise. + */ +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method) +{ + /* Event string length up to 16+ characters with available methods */ + char event_string[32] = {}; + char *envp[] = { event_string, NULL }; + const char *recovery; + + recovery = drm_wedge_recovery_name(method); + if (!recovery) { + drm_err(dev, "device wedged, invalid recovery method %d\n", method); + return -EINVAL; + } + + if (!test_bit(method, &dev->wedge_recovery)) { + drm_err(dev, "device wedged, %s based recovery not supported\n", + drm_wedge_recovery_name(method)); + return -EOPNOTSUPP; + } + + snprintf(event_string, sizeof(event_string), "WEDGED=%s", recovery); + + drm_info(dev, "device wedged, generating uevent for %s based recovery\n", recovery); + return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp); +} +EXPORT_SYMBOL(drm_dev_wedged_event); + /* * DRM internal mount * We want to be able to allocate our own "struct address_space" to control diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h index c91f87b5242d..fed6f20e52fb 100644 --- a/include/drm/drm_device.h +++ b/include/drm/drm_device.h @@ -40,6 +40,26 @@ enum switch_power_state { DRM_SWITCH_POWER_DYNAMIC_OFF = 3, }; +/** + * enum drm_wedge_recovery - Recovery method for wedged device in order of + * severity. To be set as bit fields in drm_device.wedge_recovery variable. + * Drivers can choose to support any one or multiple of them depending on + * their needs. + */ +enum drm_wedge_recovery { + /** @DRM_WEDGE_RECOVERY_REBIND: unbind + rebind driver */ + DRM_WEDGE_RECOVERY_REBIND, + + /** @DRM_WEDGE_RECOVERY_BUS_RESET: unbind + reset bus device + rebind */ + DRM_WEDGE_RECOVERY_BUS_RESET, + + /** @DRM_WEDGE_RECOVERY_REBOOT: reboot system */ + DRM_WEDGE_RECOVERY_REBOOT, + + /** @DRM_WEDGE_RECOVERY_MAX: for bounds checking, do not use */ + DRM_WEDGE_RECOVERY_MAX +}; + /** * struct drm_device - DRM device structure * @@ -317,6 +337,9 @@ struct drm_device { * Root directory for debugfs files. */ struct dentry *debugfs_root; + + /** @wedge_recovery: Supported recovery methods for wedged device */ + unsigned long wedge_recovery; }; #endif diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h index 02ea4e3248fd..d8dbc77010b0 100644 --- a/include/drm/drm_drv.h +++ b/include/drm/drm_drv.h @@ -462,6 +462,9 @@ bool drm_dev_enter(struct drm_device *dev, int *idx); void drm_dev_exit(int idx); void drm_dev_unplug(struct drm_device *dev); +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method); +int drm_dev_wedged_event(struct drm_device *dev, enum drm_wedge_recovery method); + /** * drm_dev_is_unplugged - is a DRM device unplugged * @dev: DRM device From patchwork Mon Sep 30 07:38:42 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Raag Jadav X-Patchwork-Id: 13815491 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 9E9C5CF6491 for ; Mon, 30 Sep 2024 07:39:43 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 388C010E3B2; Mon, 30 Sep 2024 07:39:43 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="ad6woCvv"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) by gabe.freedesktop.org (Postfix) with ESMTPS id 6839710E3B1; Mon, 30 Sep 2024 07:39:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1727681982; x=1759217982; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=zBfj6l/LDRVXnzgkyVxes51ZF+Xrnbm5dhhr663CIvs=; b=ad6woCvvuHwwIN23FIwhqyZ0OxA9Jcw5orVhCFyjGmBkoDmHJiMvOSVC 027KjKWUh5YW6tkJyESfQS9rk1XC755fSyF6tpyYUKFcwqqsZO9+j5REB jg425afxyOjjBEQ7vBdghBcm4DPMMqR54T+zD1z2B28bBAWw8M3aAtKFY NycIQbYo70fd7IzVsN/KReGXojKih9YziST0J4L2WMc/Z58T7as9PO4Oa n37+lvSII/xNcMWwDY/yeyDt5xQPGfh4+dLeLz2fwQdTXabWh2mbhUtet bCWq6od4rwFIf/2svrSNQMrHECdNCdoo8Le+1jcrVWS2z68RHklNF18Qq w==; X-CSE-ConnectionGUID: URG+Xx4MQX+5i6up3i/NoQ== X-CSE-MsgGUID: eRMA4fJ9S7mDTnFLSwyUlw== X-IronPort-AV: E=McAfee;i="6700,10204,11210"; a="37315477" X-IronPort-AV: E=Sophos;i="6.11,165,1725346800"; d="scan'208";a="37315477" Received: from fmviesa006.fm.intel.com ([10.60.135.146]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Sep 2024 00:39:42 -0700 X-CSE-ConnectionGUID: xIGQS4jqTZu+c19v925Fgg== X-CSE-MsgGUID: vFFskLctSZaS9zFwN/49Ug== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,165,1725346800"; d="scan'208";a="72797435" Received: from jraag-nuc8i7beh.iind.intel.com ([10.145.169.79]) by fmviesa006.fm.intel.com with ESMTP; 30 Sep 2024 00:39:36 -0700 From: Raag Jadav To: airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com, thomas.hellstrom@linux.intel.com, rodrigo.vivi@intel.com, jani.nikula@linux.intel.com, andriy.shevchenko@linux.intel.com, joonas.lahtinen@linux.intel.com, tursulin@ursulin.net, lina@asahilina.net Cc: intel-xe@lists.freedesktop.org, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, himal.prasad.ghimiray@intel.com, francois.dugast@intel.com, aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com, andi.shyti@linux.intel.com, matthew.d.roper@intel.com, Raag Jadav Subject: [PATCH v7 2/5] drm: Expose wedge recovery methods Date: Mon, 30 Sep 2024 13:08:42 +0530 Message-Id: <20240930073845.347326-3-raag.jadav@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240930073845.347326-1-raag.jadav@intel.com> References: <20240930073845.347326-1-raag.jadav@intel.com> MIME-Version: 1.0 X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" Now that we have device wedged event in place, add wedge_recovery sysfs attribute which will expose recovery methods supported by the DRM device. This is useful for userspace consumers in cases where the device supports multiple recovery methods which can be used as fallbacks. $ cat /sys/class/drm/card/wedge_recovery rebind bus-reset reboot Signed-off-by: Raag Jadav --- drivers/gpu/drm/drm_sysfs.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c index fb3bbb6adcd1..bd77b35ceb8a 100644 --- a/drivers/gpu/drm/drm_sysfs.c +++ b/drivers/gpu/drm/drm_sysfs.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -508,6 +509,26 @@ void drm_sysfs_connector_property_event(struct drm_connector *connector, } EXPORT_SYMBOL(drm_sysfs_connector_property_event); +static ssize_t wedge_recovery_show(struct device *device, + struct device_attribute *attr, char *buf) +{ + struct drm_minor *minor = to_drm_minor(device); + struct drm_device *dev = minor->dev; + unsigned int method, count = DRM_WEDGE_RECOVERY_REBIND; + + for_each_set_bit(method, &dev->wedge_recovery, DRM_WEDGE_RECOVERY_MAX) + count += sysfs_emit_at(buf, count, "%s\n", drm_wedge_recovery_name(method)); + + return count; +} +static DEVICE_ATTR_RO(wedge_recovery); + +static struct attribute *minor_dev_attrs[] = { + &dev_attr_wedge_recovery.attr, + NULL +}; +ATTRIBUTE_GROUPS(minor_dev); + struct device *drm_sysfs_minor_alloc(struct drm_minor *minor) { const char *minor_str; @@ -532,6 +553,7 @@ struct device *drm_sysfs_minor_alloc(struct drm_minor *minor) kdev->devt = MKDEV(DRM_MAJOR, minor->index); kdev->class = drm_class; kdev->type = &drm_sysfs_device_minor; + kdev->groups = minor_dev_groups; } kdev->parent = minor->dev->dev; From patchwork Mon Sep 30 07:38:43 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Raag Jadav X-Patchwork-Id: 13815492 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 0B308CF6497 for ; Mon, 30 Sep 2024 07:39:51 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id AC51E10E3B3; Mon, 30 Sep 2024 07:39:50 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="b6gcz2mX"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) by gabe.freedesktop.org (Postfix) with ESMTPS id 45B0A10E3B5; Mon, 30 Sep 2024 07:39:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1727681988; x=1759217988; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Ot8CwKy9aFzh2XqNcy0XgW2wiUIEEhhDu8b4OvvgZBk=; b=b6gcz2mXxB054qp3N3/4kp0d12ZDLY8WZgaB3xbNpktLRKdFAO8bQ6kA NyUETvtHI9gDusbEvqONyup023crbPHkCcm3G7UiOJv143eAHdxOFI7Zo zbisflLITH0sSosuEEJZShRk5IDUxN5FVS1Ufe9bJM65R2emlYMKxAZeQ EJaw9f35X5P8shiOVl+uC2Nf7AOndESqqZ+GroPeQqg91Zcl3nU1Jm2ST HnrbF+DBHQGSs80O9ycZMFzS/Vxov5LS/OQHPzIgdw1DZ2pI/ywEqaKKz zWQ/zhanlJvSdxVQ6vK2xE9D4u319FQxGMkrbekdr3rO0t/UqObsxYKew A==; X-CSE-ConnectionGUID: UPzi2pLlR5uwjgkSbQp+/A== X-CSE-MsgGUID: zZjmMFj+THyEKQH03veAoQ== X-IronPort-AV: E=McAfee;i="6700,10204,11210"; a="37315493" X-IronPort-AV: E=Sophos;i="6.11,165,1725346800"; d="scan'208";a="37315493" Received: from fmviesa006.fm.intel.com ([10.60.135.146]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Sep 2024 00:39:48 -0700 X-CSE-ConnectionGUID: Ax07ZjdkQgeqzgCqfeWq+w== X-CSE-MsgGUID: 7fRadhUZSXOf62vRIilCNw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,165,1725346800"; d="scan'208";a="72797453" Received: from jraag-nuc8i7beh.iind.intel.com ([10.145.169.79]) by fmviesa006.fm.intel.com with ESMTP; 30 Sep 2024 00:39:42 -0700 From: Raag Jadav To: airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com, thomas.hellstrom@linux.intel.com, rodrigo.vivi@intel.com, jani.nikula@linux.intel.com, andriy.shevchenko@linux.intel.com, joonas.lahtinen@linux.intel.com, tursulin@ursulin.net, lina@asahilina.net Cc: intel-xe@lists.freedesktop.org, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, himal.prasad.ghimiray@intel.com, francois.dugast@intel.com, aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com, andi.shyti@linux.intel.com, matthew.d.roper@intel.com, Raag Jadav Subject: [PATCH v7 3/5] drm/doc: Document device wedged event Date: Mon, 30 Sep 2024 13:08:43 +0530 Message-Id: <20240930073845.347326-4-raag.jadav@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240930073845.347326-1-raag.jadav@intel.com> References: <20240930073845.347326-1-raag.jadav@intel.com> MIME-Version: 1.0 X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" Add documentation for device wedged event along with its consumer expectations. For now it is amended to 'Device reset' chapter, but with extended functionality in the future it can be refactored into its own chapter. Signed-off-by: Raag Jadav --- Documentation/gpu/drm-uapi.rst | 42 ++++++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst index 370d820be248..c1186dfd283d 100644 --- a/Documentation/gpu/drm-uapi.rst +++ b/Documentation/gpu/drm-uapi.rst @@ -313,6 +313,22 @@ driver separately, with no common DRM interface. Ideally this should be properly integrated at DRM scheduler to provide a common ground for all drivers. After a reset, KMD should reject new command submissions for affected contexts. +Drivers can optionally make use of device wedged event (implemented as +drm_dev_wedged_event() in DRM subsystem) which notifies userspace of wedged +(hanged/unusable) state of the DRM device through a uevent. This is useful +especially in cases where the device is no longer operating as expected even +after a hardware reset and has become unrecoverable from driver context. +Purpose of this implementation is to provide drivers a generic way to recover +with the help of userspace intervention, and hence the vendor agnostic nature +of the event. + +Different drivers may have different ideas of a "wedged device" depending on +their hardware implementation. It is up to the drivers to decide when they see +the need for recovery and how they want to recover from the available methods. +Current implementation defines three recovery methods, out of which, drivers +can choose to support any one or multiple of them. Preferred recovery method +will be sent in the uevent environment as WEDGED=. + User Mode Driver ---------------- @@ -323,6 +339,32 @@ if the UMD requires it. After detecting a reset, UMD will then proceed to report it to the application using the appropriate API error code, as explained in the section below about robustness. +On device wedged scenario, userspace will receive a uevent from KMD with +its preferred recovery method in the uevent environment as WEDGED=. +Userspace consumers (sysadmin) can define udev rules to parse this event +and take respective action to recover the device. + +.. table:: Wedged Device Recovery + + =============== ================================== + Recovery method Consumer expectations + =============== ================================== + rebind unbind + rebind driver + bus-reset unbind + reset bus device + rebind + reboot reboot system + =============== ================================== + +Userspace consumers can optionally read the recovery methods supported by the +device via ``wedge_recovery`` sysfs attribute:: + + $ cat /sys/class/drm/card/wedge_recovery + rebind + bus-reset + reboot + +This is useful in cases where the device supports multiple recovery methods +which can be used as fallbacks. + Robustness ---------- From patchwork Mon Sep 30 07:38:44 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Raag Jadav X-Patchwork-Id: 13815493 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 57F93CF6497 for ; Mon, 30 Sep 2024 07:39:55 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 02B2E10E3B0; Mon, 30 Sep 2024 07:39:55 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="LsaU5xH2"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) by gabe.freedesktop.org (Postfix) with ESMTPS id 4D66210E3B0; Mon, 30 Sep 2024 07:39:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1727681994; x=1759217994; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=IbBxaLr90y+i/HUYv7f8Dy2FQ8ilzw0hvAzy09Og+vk=; b=LsaU5xH2E/kQa1XteidqE2bgC9IRcZsU+ofN6XrFy2bVGpPuWFfMr8yZ w6e33shz/StEwjamNSASrtgsZSnfzSuh7/vB9NmYAV5ed5qL+UiYc+unG mcGzsBzMbX8n09D+q+tK7WfAEwOf5EZxgfNohhxuUOUBbfE8zvtWKXW4O P9heLwXmhCHYKq67PvEgMqT/BJPxfpzrrWvo3YmMYysNZF+Zj9NNfWVpv gBfVadAbPcmymp4jqK6hq0g+1S2N0FeV0wT16+Q2U/ld1FCDdvTkU6i7W venk/xOBiRZIZ7zcqatRnMqLWXt/Jvqsoeu5JCyeXiMBiELz5OWOn9/YV Q==; X-CSE-ConnectionGUID: HcCOipYjSkieeHP/1yVgjg== X-CSE-MsgGUID: NVTXffNiQiaeCjOHBTjIIw== X-IronPort-AV: E=McAfee;i="6700,10204,11210"; a="37315507" X-IronPort-AV: E=Sophos;i="6.11,165,1725346800"; d="scan'208";a="37315507" Received: from fmviesa006.fm.intel.com ([10.60.135.146]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Sep 2024 00:39:54 -0700 X-CSE-ConnectionGUID: kRb58UdsSBKgiljshzycVw== X-CSE-MsgGUID: iWUYq9N1TCqhdePt1C5Drw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,165,1725346800"; d="scan'208";a="72797465" Received: from jraag-nuc8i7beh.iind.intel.com ([10.145.169.79]) by fmviesa006.fm.intel.com with ESMTP; 30 Sep 2024 00:39:47 -0700 From: Raag Jadav To: airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com, thomas.hellstrom@linux.intel.com, rodrigo.vivi@intel.com, jani.nikula@linux.intel.com, andriy.shevchenko@linux.intel.com, joonas.lahtinen@linux.intel.com, tursulin@ursulin.net, lina@asahilina.net Cc: intel-xe@lists.freedesktop.org, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, himal.prasad.ghimiray@intel.com, francois.dugast@intel.com, aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com, andi.shyti@linux.intel.com, matthew.d.roper@intel.com, Raag Jadav Subject: [PATCH v7 4/5] drm/xe: Use device wedged event Date: Mon, 30 Sep 2024 13:08:44 +0530 Message-Id: <20240930073845.347326-5-raag.jadav@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240930073845.347326-1-raag.jadav@intel.com> References: <20240930073845.347326-1-raag.jadav@intel.com> MIME-Version: 1.0 X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" This was previously attempted as xe specific reset uevent but dropped in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset uevent for now") as part of refactoring. Now that we have device wedged event provided by DRM core, make use of it and support both driver rebind and bus-reset based recovery. With this in place userspace will be notified of wedged device, on the basis of which, userspace may take respective action to recover the device. $ udevadm monitor --property --kernel monitor will print the received events for: KERNEL - the kernel uevent KERNEL[265.802982] change /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 (drm) ACTION=change DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 SUBSYSTEM=drm WEDGED=bus-reset DEVNAME=/dev/dri/card0 DEVTYPE=drm_minor SEQNUM=5208 MAJOR=226 MINOR=0 v2: Change authorship to Himal (Aravind) Add uevent for all device wedged cases (Aravind) v3: Generic re-implementation in DRM subsystem (Lucas) v4: Change authorship to Raag (Aravind) Signed-off-by: Raag Jadav --- drivers/gpu/drm/xe/xe_device.c | 17 +++++++++++++++-- drivers/gpu/drm/xe/xe_device.h | 1 + drivers/gpu/drm/xe/xe_pci.c | 2 ++ 3 files changed, 18 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c index 8e9b551c7033..bbf2052a91ba 100644 --- a/drivers/gpu/drm/xe/xe_device.c +++ b/drivers/gpu/drm/xe/xe_device.c @@ -785,6 +785,15 @@ int xe_device_probe(struct xe_device *xe) return err; } +void xe_setup_wedge_recovery(struct xe_device *xe) +{ + struct drm_device *dev = &xe->drm; + + /* Support both driver rebind and bus-reset based recovery. */ + set_bit(DRM_WEDGE_RECOVERY_REBIND, &dev->wedge_recovery); + set_bit(DRM_WEDGE_RECOVERY_BUS_RESET, &dev->wedge_recovery); +} + static void xe_device_remove_display(struct xe_device *xe) { xe_display_unregister(xe); @@ -991,11 +1000,12 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg) * xe_device_declare_wedged - Declare device wedged * @xe: xe device instance * - * This is a final state that can only be cleared with a mudule + * This is a final state that can only be cleared with a module * re-probe (unbind + bind). * In this state every IOCTL will be blocked so the GT cannot be used. * In general it will be called upon any critical error such as gt reset - * failure or guc loading failure. + * failure or guc loading failure. Userspace will be notified of this state + * by a DRM uevent. * If xe.wedged module parameter is set to 2, this function will be called * on every single execution timeout (a.k.a. GPU hang) right after devcoredump * snapshot capture. In this mode, GT reset won't be attempted so the state of @@ -1025,6 +1035,9 @@ void xe_device_declare_wedged(struct xe_device *xe) "IOCTLs and executions are blocked. Only a rebind may clear the failure\n" "Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n", dev_name(xe->drm.dev)); + + /* Notify userspace of wedged device */ + drm_dev_wedged_event(&xe->drm, DRM_WEDGE_RECOVERY_BUS_RESET); } for_each_gt(gt, xe, id) diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h index 4c3f0ebe78a9..ca4b3935a982 100644 --- a/drivers/gpu/drm/xe/xe_device.h +++ b/drivers/gpu/drm/xe/xe_device.h @@ -186,6 +186,7 @@ static inline bool xe_device_wedged(struct xe_device *xe) return atomic_read(&xe->wedged.flag); } +void xe_setup_wedge_recovery(struct xe_device *xe); void xe_device_declare_wedged(struct xe_device *xe); struct xe_file *xe_file_get(struct xe_file *xef); diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c index edaeefd2d648..e7a1d59c40a9 100644 --- a/drivers/gpu/drm/xe/xe_pci.c +++ b/drivers/gpu/drm/xe/xe_pci.c @@ -860,6 +860,8 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent) if (err) goto err_driver_cleanup; + xe_setup_wedge_recovery(xe); + drm_dbg(&xe->drm, "d3cold: capable=%s\n", str_yes_no(xe->d3cold.capable)); From patchwork Mon Sep 30 07:38:45 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Raag Jadav X-Patchwork-Id: 13815494 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C11E5CF649D for ; Mon, 30 Sep 2024 07:40:00 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 659C310E3B5; Mon, 30 Sep 2024 07:40:00 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="KcsHX4NM"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) by gabe.freedesktop.org (Postfix) with ESMTPS id 8297410E3BC; Mon, 30 Sep 2024 07:39:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1727681999; x=1759217999; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Sl6Qv8wOCtprWGv2/+M1HJjrHpTQPyogDzqG/JnxWnc=; b=KcsHX4NM97jBUZaKU8zINU5v733oNdNRSgp1mP4uqL17YRN0N/7CkLoC PtbKuTXg8udUXtOzm9g20tWzcW5CEHP//AZ/3tX4DKUgbXtVTbUGULOd+ Goy1Rgy0besVF1smYklUzB8GYXR4jIWjj/abvM2+HiTfpPm41MMafbFZ2 34aQzZGLQT89x1GUqwtW+pVDCQlfk5iLE0eNRRV5yt96cKzoWOq9OSJ8F wvaKXu3YiUX4dPijp70rlYdueKyg3WDM5uFQvmMSb9gXq1YTvA86G+RoJ Q6EcBMQv/d2jIK4GJH6S3Lgnd3bLUUuIb1N7gB0hQkbEOJWQ4ghqkXIJh w==; X-CSE-ConnectionGUID: MmgkMwuiRgOsqfnGKBU28A== X-CSE-MsgGUID: fq5L/kSeSlKT0WgNGU0z8g== X-IronPort-AV: E=McAfee;i="6700,10204,11210"; a="37315523" X-IronPort-AV: E=Sophos;i="6.11,165,1725346800"; d="scan'208";a="37315523" Received: from fmviesa006.fm.intel.com ([10.60.135.146]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Sep 2024 00:39:59 -0700 X-CSE-ConnectionGUID: hKsre1f4TCWB1TBWmBPQ5w== X-CSE-MsgGUID: n/16mOQaQGW+WCpR5FkaKA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,165,1725346800"; d="scan'208";a="72797481" Received: from jraag-nuc8i7beh.iind.intel.com ([10.145.169.79]) by fmviesa006.fm.intel.com with ESMTP; 30 Sep 2024 00:39:53 -0700 From: Raag Jadav To: airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com, thomas.hellstrom@linux.intel.com, rodrigo.vivi@intel.com, jani.nikula@linux.intel.com, andriy.shevchenko@linux.intel.com, joonas.lahtinen@linux.intel.com, tursulin@ursulin.net, lina@asahilina.net Cc: intel-xe@lists.freedesktop.org, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, himal.prasad.ghimiray@intel.com, francois.dugast@intel.com, aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com, andi.shyti@linux.intel.com, matthew.d.roper@intel.com, Raag Jadav Subject: [PATCH v7 5/5] drm/i915: Use device wedged event Date: Mon, 30 Sep 2024 13:08:45 +0530 Message-Id: <20240930073845.347326-6-raag.jadav@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240930073845.347326-1-raag.jadav@intel.com> References: <20240930073845.347326-1-raag.jadav@intel.com> MIME-Version: 1.0 X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" Now that we have device wedged event provided by DRM core, make use of it and support both driver rebind and bus-reset based recovery. With this in place, userspace will be notified of wedged device on gt reset failure. Signed-off-by: Raag Jadav --- drivers/gpu/drm/i915/gt/intel_reset.c | 2 ++ drivers/gpu/drm/i915/i915_driver.c | 10 ++++++++++ 2 files changed, 12 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index 8f1ea95471ef..02f357d4e4fb 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -1418,6 +1418,8 @@ static void intel_gt_reset_global(struct intel_gt *gt, if (!test_bit(I915_WEDGED, >->reset.flags)) kobject_uevent_env(kobj, KOBJ_CHANGE, reset_done_event); + else + drm_dev_wedged_event(>->i915->drm, DRM_WEDGE_RECOVERY_BUS_RESET); } /** diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c index fe905d65ddf7..389d9fc67eeb 100644 --- a/drivers/gpu/drm/i915/i915_driver.c +++ b/drivers/gpu/drm/i915/i915_driver.c @@ -711,6 +711,15 @@ static void i915_welcome_messages(struct drm_i915_private *dev_priv) "DRM_I915_DEBUG_RUNTIME_PM enabled\n"); } +static void i915_setup_wedge_recovery(struct drm_i915_private *i915) +{ + struct drm_device *dev = &i915->drm; + + /* Support both driver rebind and bus-reset based recovery. */ + set_bit(DRM_WEDGE_RECOVERY_REBIND, &dev->wedge_recovery); + set_bit(DRM_WEDGE_RECOVERY_BUS_RESET, &dev->wedge_recovery); +} + static struct drm_i915_private * i915_driver_create(struct pci_dev *pdev, const struct pci_device_id *ent) { @@ -812,6 +821,7 @@ int i915_driver_probe(struct pci_dev *pdev, const struct pci_device_id *ent) enable_rpm_wakeref_asserts(&i915->runtime_pm); + i915_setup_wedge_recovery(i915); i915_welcome_messages(i915); i915->do_release = true;