[v2] drm/xe/uapi: Bring back reset uevent

Message ID	20240812074812.1457164-1-raag.jadav@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Raag Jadav <raag.jadav@intel.com> To: lucas.demarchi@intel.com, thomas.hellstrom@linux.intel.com Cc: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org, himal.prasad.ghimiray@intel.com, francois.dugast@intel.com, rodrigo.vivi@intel.com, aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com, Raag Jadav <raag.jadav@intel.com> Subject: [PATCH v2] drm/xe/uapi: Bring back reset uevent Date: Mon, 12 Aug 2024 13:18:12 +0530 Message-Id: <20240812074812.1457164-1-raag.jadav@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	[v2] drm/xe/uapi: Bring back reset uevent \| expand [v2] drm/xe/uapi: Bring back reset uevent

Raag Jadav Aug. 12, 2024, 7:48 a.m. UTC

From: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>

This was dropped in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset
uevent for now") as part of refactoring.

Now that we have better uapi semantics and naming for the uevent,
bring it back. With this in place, userspace will be notified of
wedged device along with its reason.

$ udevadm monitor --property --kernel
monitor will print the received events for:
KERNEL - the kernel uevent

KERNEL[871.188570] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0 (pci)
ACTION=change
DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0
SUBSYSTEM=pci
DEVICE_STATUS=NEEDS_RESET
REASON=GT_RESET_FAILED
TILE_ID=0
GT_ID=0
DRIVER=xe
PCI_CLASS=30000
PCI_ID=8086:56B1
PCI_SUBSYS_ID=8086:1210
PCI_SLOT_NAME=0000:03:00.0
MODALIAS=pci:v00008086d000056B1sv00008086sd00001210bc03sc00i00
SEQNUM=6104

v2: Change authorship to Himal (Aravind)
    Add uevent for all device wedged cases (Aravind)

Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
 drivers/gpu/drm/xe/xe_device.c     | 10 +++++++++-
 drivers/gpu/drm/xe/xe_device.h     |  2 +-
 drivers/gpu/drm/xe/xe_gt.c         | 23 +++++++++++++++++++----
 drivers/gpu/drm/xe/xe_guc.c        | 13 ++++++++++++-
 drivers/gpu/drm/xe/xe_guc_submit.c | 13 ++++++++++++-
 include/uapi/drm/xe_drm.h          | 29 +++++++++++++++++++++++++++++
 6 files changed, 82 insertions(+), 8 deletions(-)

Ghimiray, Himal Prasad Aug. 12, 2024, 8:06 a.m. UTC | #1

On 12-08-2024 13:18, Raag Jadav wrote:
> From: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> 
> This was dropped in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset
> uevent for now") as part of refactoring.
> 
> Now that we have better uapi semantics and naming for the uevent,
> bring it back. With this in place, userspace will be notified of
> wedged device along with its reason.
> 
> $ udevadm monitor --property --kernel
> monitor will print the received events for:
> KERNEL - the kernel uevent
> 
> KERNEL[871.188570] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0 (pci)
> ACTION=change
> DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0
> SUBSYSTEM=pci
> DEVICE_STATUS=NEEDS_RESET
> REASON=GT_RESET_FAILED
> TILE_ID=0
> GT_ID=0
> DRIVER=xe
> PCI_CLASS=30000
> PCI_ID=8086:56B1
> PCI_SUBSYS_ID=8086:1210
> PCI_SLOT_NAME=0000:03:00.0
> MODALIAS=pci:v00008086d000056B1sv00008086sd00001210bc03sc00i00
> SEQNUM=6104
> 
> v2: Change authorship to Himal (Aravind)
>      Add uevent for all device wedged cases (Aravind)
> 

Thanks for the Patch.

You are extending the UAPI beyond GT reset.
Please add your Co-authored-by:

> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_device.c     | 10 +++++++++-
>   drivers/gpu/drm/xe/xe_device.h     |  2 +-
>   drivers/gpu/drm/xe/xe_gt.c         | 23 +++++++++++++++++++----
>   drivers/gpu/drm/xe/xe_guc.c        | 13 ++++++++++++-
>   drivers/gpu/drm/xe/xe_guc_submit.c | 13 ++++++++++++-
>   include/uapi/drm/xe_drm.h          | 29 +++++++++++++++++++++++++++++
>   6 files changed, 82 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 1aba6f9eaa19..d975bdce4a7d 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -955,6 +955,7 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>   /**
>    * xe_device_declare_wedged - Declare device wedged
>    * @xe: xe device instance
> + * @event_params: parameters to be sent along with uevent
>    *
>    * This is a final state that can only be cleared with a mudule

since you are here. Please fix typo s/mudule/module.

>    * re-probe (unbind + bind).
> @@ -965,8 +966,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>    * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
>    * snapshot capture. In this mode, GT reset won't be attempted so the state of
>    * the issue is preserved for further debugging.
> + * Caller is expected to pass respective parameters to be sent along with
> + * uevent. Pass NULL in case of no params.
>    */
> -void xe_device_declare_wedged(struct xe_device *xe)
> +void xe_device_declare_wedged(struct xe_device *xe, char **event_params)
>   {
>   	struct xe_gt *gt;
>   	u8 id;
> @@ -984,12 +987,17 @@ void xe_device_declare_wedged(struct xe_device *xe)
>   	xe_pm_runtime_get_noresume(xe);
>   
>   	if (!atomic_xchg(&xe->wedged.flag, 1)) {
> +		struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> +
>   		xe->needs_flr_on_fini = true;
>   		drm_err(&xe->drm,
>   			"CRITICAL: Xe has declared device %s as wedged.\n"
>   			"IOCTLs and executions are blocked. Only a rebind may clear the failure\n"
>   			"Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
>   			dev_name(xe->drm.dev));
> +
> +		/* Notify userspace about reset required */
> +		kobject_uevent_env(&pdev->dev.kobj, KOBJ_CHANGE, event_params);
>   	}
>   
>   	for_each_gt(gt, xe, id)
> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> index db6cc8d0d6b8..5d40fc6f0904 100644
> --- a/drivers/gpu/drm/xe/xe_device.h
> +++ b/drivers/gpu/drm/xe/xe_device.h
> @@ -174,7 +174,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
>   	return atomic_read(&xe->wedged.flag);
>   }
>   
> -void xe_device_declare_wedged(struct xe_device *xe);
> +void xe_device_declare_wedged(struct xe_device *xe, char **reset_event);
>   
>   struct xe_file *xe_file_get(struct xe_file *xef);
>   void xe_file_put(struct xe_file *xef);
> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> index 58895ed22f6e..519f3c2cf9e2 100644
> --- a/drivers/gpu/drm/xe/xe_gt.c
> +++ b/drivers/gpu/drm/xe/xe_gt.c
> @@ -741,6 +741,24 @@ static int do_gt_restart(struct xe_gt *gt)
>   	return 0;
>   }
>   
> +static void xe_gt_reset_failed(struct xe_gt *gt, int err)
> +{
> +	char *event_params[5];
> +
> +	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
> +
> +	event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
> +	event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_GT;
> +	event_params[2] = kasprintf(GFP_KERNEL, "TILE_ID=%d", gt_to_tile(gt)->id);
> +	event_params[3] = kasprintf(GFP_KERNEL, "GT_ID=%d", gt->info.id);
> +	event_params[4] = NULL;
> +
> +	xe_device_declare_wedged(gt_to_xe(gt), event_params);
> +
> +	kfree(event_params[2]);
> +	kfree(event_params[3]);
> +}
> +
>   static int gt_reset(struct xe_gt *gt)
>   {
>   	int err;
> @@ -796,10 +814,7 @@ static int gt_reset(struct xe_gt *gt)
>   	XE_WARN_ON(xe_uc_start(&gt->uc));
>   	xe_pm_runtime_put(gt_to_xe(gt));
>   err_fail:
> -	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
> -
> -	xe_device_declare_wedged(gt_to_xe(gt));
> -
> +	xe_gt_reset_failed(gt, err);
>   	return err;
>   }
>   
> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> index de0fe9e65746..b544012f5b11 100644
> --- a/drivers/gpu/drm/xe/xe_guc.c
> +++ b/drivers/gpu/drm/xe/xe_guc.c
> @@ -560,6 +560,17 @@ static s32 guc_pc_get_cur_freq(struct xe_guc_pc *guc_pc)
>   	return ret ? ret : freq;
>   }
>   
> +static void xe_guc_load_failed(struct xe_gt *gt)
> +{
> +	char *event_params[3];
> +
> +	event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
> +	event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_GUC;
> +	event_params[2] = NULL;
> +
> +	xe_device_declare_wedged(gt_to_xe(gt), event_params);
> +}
> +
>   /*
>    * Wait for the GuC to start up.
>    *
> @@ -684,7 +695,7 @@ static void guc_wait_ucode(struct xe_guc *guc)
>   			break;
>   		}
>   
> -		xe_device_declare_wedged(gt_to_xe(gt));
> +		xe_guc_load_failed(gt);
>   	} else if (delta_ms > GUC_LOAD_TIME_WARN_MS) {
>   		xe_gt_warn(gt, "excessive init time: %lldms! [status = 0x%08X, timeouts = %d]\n",
>   			   delta_ms, status, count);
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 460808507947..33ed6221f465 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -891,6 +891,17 @@ void xe_guc_submit_wedge(struct xe_guc *guc)
>   	mutex_unlock(&guc->submission_state.lock);
>   }
>   
> +static void xe_exec_queue_timedout(struct xe_device *xe)
> +{
> +	char *event_params[3];
> +
> +	event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
> +	event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_TOUT;
> +	event_params[2] = NULL;
> +
> +	xe_device_declare_wedged(xe, event_params);
> +}
> +
>   static bool guc_submit_hint_wedged(struct xe_guc *guc)
>   {
>   	struct xe_device *xe = guc_to_xe(guc);
> @@ -901,7 +912,7 @@ static bool guc_submit_hint_wedged(struct xe_guc *guc)
>   	if (xe_device_wedged(xe))
>   		return true;
>   
> -	xe_device_declare_wedged(xe);
> +	xe_exec_queue_timedout(xe);
>   
>   	return true;
>   }
> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> index b6fbe4988f2e..dd2f36710057 100644
> --- a/include/uapi/drm/xe_drm.h
> +++ b/include/uapi/drm/xe_drm.h
> @@ -20,6 +20,7 @@ extern "C" {
>    *   2. Extension definition and helper structs
>    *   3. IOCTL's Query structs in the order of the Query's entries.
>    *   4. The rest of IOCTL structs in the order of IOCTL declaration.
> + *   5. uEvents
>    */
>   
>   /**
> @@ -1694,6 +1695,34 @@ struct drm_xe_oa_stream_info {
>   	__u64 reserved[3];
>   };
>   
> +/**
> + * DOC: uevent generated by xe on it's pci node.
> + *
> + * DRM_XE_RESET_REQUIRED_UEVENT - Event is generated when device needs reset.
> + * The REASON is provided along with the event for which reset is required.
> + * On the basis of REASONS, additional information might be supplied.
> + */
> +#define DRM_XE_RESET_REQUIRED_UEVENT "DEVICE_STATUS=NEEDS_RESET"
> +
> +/**
> + * DRM_XE_RESET_REQUIRED_UEVENT_REASON_GT - Reason provided to
> + * DRM_XE_RESET_REQUIRED_UEVENT incase of gt reset failure. The additional
> + * information supplied is tile id and gt id for which reset has failed.
> + */
> +#define DRM_XE_RESET_REQUIRED_UEVENT_REASON_GT "REASON=GT_RESET_FAILED"
> +
> +/**
> + * DRM_XE_RESET_REQUIRED_UEVENT_REASON_GUC - Reason provided to
> + * DRM_XE_RESET_REQUIRED_UEVENT incase of guc fw load failure.
> + */
> +#define DRM_XE_RESET_REQUIRED_UEVENT_REASON_GUC "REASON=GUC_LOAD_FAILED"
> +
> +/**
> + * DRM_XE_RESET_REQUIRED_UEVENT_REASON_TOUT - Reason provided to
> + * DRM_XE_RESET_REQUIRED_UEVENT incase of exec queue timeout.
> + */
> +#define DRM_XE_RESET_REQUIRED_UEVENT_REASON_TOUT "REASON=EXEC_QUEUE_TIMEDOUT"
> +
>   #if defined(__cplusplus)
>   }
>   #endif

Aravind Iddamsetty Aug. 12, 2024, 9:38 a.m. UTC | #2

On 12/08/24 13:18, Raag Jadav wrote:
> From: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>
> This was dropped in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset
> uevent for now") as part of refactoring.
>
> Now that we have better uapi semantics and naming for the uevent,
> bring it back. With this in place, userspace will be notified of
> wedged device along with its reason.
>
> $ udevadm monitor --property --kernel
> monitor will print the received events for:
> KERNEL - the kernel uevent
>
> KERNEL[871.188570] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0 (pci)
> ACTION=change
> DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0
> SUBSYSTEM=pci
> DEVICE_STATUS=NEEDS_RESET
> REASON=GT_RESET_FAILED
> TILE_ID=0
> GT_ID=0
> DRIVER=xe
> PCI_CLASS=30000
> PCI_ID=8086:56B1
> PCI_SUBSYS_ID=8086:1210
> PCI_SLOT_NAME=0000:03:00.0
> MODALIAS=pci:v00008086d000056B1sv00008086sd00001210bc03sc00i00
> SEQNUM=6104
>
> v2: Change authorship to Himal (Aravind)
>     Add uevent for all device wedged cases (Aravind)
>
> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device.c     | 10 +++++++++-
>  drivers/gpu/drm/xe/xe_device.h     |  2 +-
>  drivers/gpu/drm/xe/xe_gt.c         | 23 +++++++++++++++++++----
>  drivers/gpu/drm/xe/xe_guc.c        | 13 ++++++++++++-
>  drivers/gpu/drm/xe/xe_guc_submit.c | 13 ++++++++++++-
>  include/uapi/drm/xe_drm.h          | 29 +++++++++++++++++++++++++++++
>  6 files changed, 82 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 1aba6f9eaa19..d975bdce4a7d 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -955,6 +955,7 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>  /**
>   * xe_device_declare_wedged - Declare device wedged
>   * @xe: xe device instance
> + * @event_params: parameters to be sent along with uevent
>   *
>   * This is a final state that can only be cleared with a mudule
>   * re-probe (unbind + bind).
> @@ -965,8 +966,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>   * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
>   * snapshot capture. In this mode, GT reset won't be attempted so the state of
>   * the issue is preserved for further debugging.
> + * Caller is expected to pass respective parameters to be sent along with
> + * uevent. Pass NULL in case of no params.
>   */
> -void xe_device_declare_wedged(struct xe_device *xe)
> +void xe_device_declare_wedged(struct xe_device *xe, char **event_params)
>  {
>  	struct xe_gt *gt;
>  	u8 id;
> @@ -984,12 +987,17 @@ void xe_device_declare_wedged(struct xe_device *xe)
>  	xe_pm_runtime_get_noresume(xe);
>  
>  	if (!atomic_xchg(&xe->wedged.flag, 1)) {
> +		struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> +
>  		xe->needs_flr_on_fini = true;
>  		drm_err(&xe->drm,
>  			"CRITICAL: Xe has declared device %s as wedged.\n"
>  			"IOCTLs and executions are blocked. Only a rebind may clear the failure\n"
>  			"Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
>  			dev_name(xe->drm.dev));
> +
> +		/* Notify userspace about reset required */
> +		kobject_uevent_env(&pdev->dev.kobj, KOBJ_CHANGE, event_params);
>  	}
>  
>  	for_each_gt(gt, xe, id)
> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> index db6cc8d0d6b8..5d40fc6f0904 100644
> --- a/drivers/gpu/drm/xe/xe_device.h
> +++ b/drivers/gpu/drm/xe/xe_device.h
> @@ -174,7 +174,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
>  	return atomic_read(&xe->wedged.flag);
>  }
>  
> -void xe_device_declare_wedged(struct xe_device *xe);
> +void xe_device_declare_wedged(struct xe_device *xe, char **reset_event);
>  
>  struct xe_file *xe_file_get(struct xe_file *xef);
>  void xe_file_put(struct xe_file *xef);
> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> index 58895ed22f6e..519f3c2cf9e2 100644
> --- a/drivers/gpu/drm/xe/xe_gt.c
> +++ b/drivers/gpu/drm/xe/xe_gt.c
> @@ -741,6 +741,24 @@ static int do_gt_restart(struct xe_gt *gt)
>  	return 0;
>  }
>  
> +static void xe_gt_reset_failed(struct xe_gt *gt, int err)
> +{
> +	char *event_params[5];
> +
> +	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
> +
> +	event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
> +	event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_GT;
> +	event_params[2] = kasprintf(GFP_KERNEL, "TILE_ID=%d", gt_to_tile(gt)->id);
> +	event_params[3] = kasprintf(GFP_KERNEL, "GT_ID=%d", gt->info.id);

the TILE_ID and GT_ID can be passed for other events as well, with that you can
have a common function to send uevent which would take reason as an input.
> +	event_params[4] = NULL;
> +
> +	xe_device_declare_wedged(gt_to_xe(gt), event_params);
> +
> +	kfree(event_params[2]);
> +	kfree(event_params[3]);
> +}
> +
>  static int gt_reset(struct xe_gt *gt)
>  {
>  	int err;
> @@ -796,10 +814,7 @@ static int gt_reset(struct xe_gt *gt)
>  	XE_WARN_ON(xe_uc_start(&gt->uc));
>  	xe_pm_runtime_put(gt_to_xe(gt));
>  err_fail:
> -	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
> -
> -	xe_device_declare_wedged(gt_to_xe(gt));
> -
> +	xe_gt_reset_failed(gt, err);
>  	return err;
>  }
>  
> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> index de0fe9e65746..b544012f5b11 100644
> --- a/drivers/gpu/drm/xe/xe_guc.c
> +++ b/drivers/gpu/drm/xe/xe_guc.c
> @@ -560,6 +560,17 @@ static s32 guc_pc_get_cur_freq(struct xe_guc_pc *guc_pc)
>  	return ret ? ret : freq;
>  }
>  
> +static void xe_guc_load_failed(struct xe_gt *gt)
> +{
> +	char *event_params[3];
> +
> +	event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
> +	event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_GUC;
> +	event_params[2] = NULL;
> +
> +	xe_device_declare_wedged(gt_to_xe(gt), event_params);
> +}
> +
>  /*
>   * Wait for the GuC to start up.
>   *
> @@ -684,7 +695,7 @@ static void guc_wait_ucode(struct xe_guc *guc)
>  			break;
>  		}
>  
> -		xe_device_declare_wedged(gt_to_xe(gt));
> +		xe_guc_load_failed(gt);
>  	} else if (delta_ms > GUC_LOAD_TIME_WARN_MS) {
>  		xe_gt_warn(gt, "excessive init time: %lldms! [status = 0x%08X, timeouts = %d]\n",
>  			   delta_ms, status, count);
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 460808507947..33ed6221f465 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -891,6 +891,17 @@ void xe_guc_submit_wedge(struct xe_guc *guc)
>  	mutex_unlock(&guc->submission_state.lock);
>  }
>  
> +static void xe_exec_queue_timedout(struct xe_device *xe)
> +{
> +	char *event_params[3];
> +
> +	event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
> +	event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_TOUT;
> +	event_params[2] = NULL;
> +
> +	xe_device_declare_wedged(xe, event_params);
> +}
> +
>  static bool guc_submit_hint_wedged(struct xe_guc *guc)
>  {
>  	struct xe_device *xe = guc_to_xe(guc);
> @@ -901,7 +912,7 @@ static bool guc_submit_hint_wedged(struct xe_guc *guc)
>  	if (xe_device_wedged(xe))
>  		return true;
>  

i see that this is even called from xe_guc_exec_queue_lr_cleanup which is for long running queue
from various call paths need to check in what scenarios those happen.
> -	xe_device_declare_wedged(xe);
> +	xe_exec_queue_timedout(xe);
>  
>  	return true;
>  }
> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> index b6fbe4988f2e..dd2f36710057 100644
> --- a/include/uapi/drm/xe_drm.h
> +++ b/include/uapi/drm/xe_drm.h
> @@ -20,6 +20,7 @@ extern "C" {
>   *   2. Extension definition and helper structs
>   *   3. IOCTL's Query structs in the order of the Query's entries.
>   *   4. The rest of IOCTL structs in the order of IOCTL declaration.
> + *   5. uEvents
>   */
>  
>  /**
> @@ -1694,6 +1695,34 @@ struct drm_xe_oa_stream_info {
>  	__u64 reserved[3];
>  };
>  
> +/**
> + * DOC: uevent generated by xe on it's pci node.
> + *
> + * DRM_XE_RESET_REQUIRED_UEVENT - Event is generated when device needs reset.
> + * The REASON is provided along with the event for which reset is required.
> + * On the basis of REASONS, additional information might be supplied.
> + */
> +#define DRM_XE_RESET_REQUIRED_UEVENT "DEVICE_STATUS=NEEDS_RESET"
> +
> +/**
> + * DRM_XE_RESET_REQUIRED_UEVENT_REASON_GT - Reason provided to
> + * DRM_XE_RESET_REQUIRED_UEVENT incase of gt reset failure. The additional
> + * information supplied is tile id and gt id for which reset has failed.
> + */
> +#define DRM_XE_RESET_REQUIRED_UEVENT_REASON_GT "REASON=GT_RESET_FAILED"
> +
> +/**
> + * DRM_XE_RESET_REQUIRED_UEVENT_REASON_GUC - Reason provided to
> + * DRM_XE_RESET_REQUIRED_UEVENT incase of guc fw load failure.
> + */
> +#define DRM_XE_RESET_REQUIRED_UEVENT_REASON_GUC "REASON=GUC_LOAD_FAILED"
> +
> +/**
> + * DRM_XE_RESET_REQUIRED_UEVENT_REASON_TOUT - Reason provided to
> + * DRM_XE_RESET_REQUIRED_UEVENT incase of exec queue timeout.
> + */
> +#define DRM_XE_RESET_REQUIRED_UEVENT_REASON_TOUT "REASON=EXEC_QUEUE_TIMEDOUT"

better to  name DRM_XE_RESET_REQUIRED_UEVENT_REASON_EXEC_QUEUE_TIMEOUT for clarity

Thanks,
Aravind.
> +
>  #if defined(__cplusplus)
>  }
>  #endif

Raag Jadav Aug. 13, 2024, 1:28 p.m. UTC | #3

On Mon, Aug 12, 2024 at 03:08:14PM +0530, Aravind Iddamsetty wrote:
> 
> On 12/08/24 13:18, Raag Jadav wrote:
> > From: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> >
> > This was dropped in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset
> > uevent for now") as part of refactoring.
> >
> > Now that we have better uapi semantics and naming for the uevent,
> > bring it back. With this in place, userspace will be notified of
> > wedged device along with its reason.
> >
> > $ udevadm monitor --property --kernel
> > monitor will print the received events for:
> > KERNEL - the kernel uevent
> >
> > KERNEL[871.188570] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0 (pci)
> > ACTION=change
> > DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0
> > SUBSYSTEM=pci
> > DEVICE_STATUS=NEEDS_RESET
> > REASON=GT_RESET_FAILED
> > TILE_ID=0
> > GT_ID=0
> > DRIVER=xe
> > PCI_CLASS=30000
> > PCI_ID=8086:56B1
> > PCI_SUBSYS_ID=8086:1210
> > PCI_SLOT_NAME=0000:03:00.0
> > MODALIAS=pci:v00008086d000056B1sv00008086sd00001210bc03sc00i00
> > SEQNUM=6104
> >
> > v2: Change authorship to Himal (Aravind)
> >     Add uevent for all device wedged cases (Aravind)
> >
> > Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_device.c     | 10 +++++++++-
> >  drivers/gpu/drm/xe/xe_device.h     |  2 +-
> >  drivers/gpu/drm/xe/xe_gt.c         | 23 +++++++++++++++++++----
> >  drivers/gpu/drm/xe/xe_guc.c        | 13 ++++++++++++-
> >  drivers/gpu/drm/xe/xe_guc_submit.c | 13 ++++++++++++-
> >  include/uapi/drm/xe_drm.h          | 29 +++++++++++++++++++++++++++++
> >  6 files changed, 82 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> > index 1aba6f9eaa19..d975bdce4a7d 100644
> > --- a/drivers/gpu/drm/xe/xe_device.c
> > +++ b/drivers/gpu/drm/xe/xe_device.c
> > @@ -955,6 +955,7 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
> >  /**
> >   * xe_device_declare_wedged - Declare device wedged
> >   * @xe: xe device instance
> > + * @event_params: parameters to be sent along with uevent
> >   *
> >   * This is a final state that can only be cleared with a mudule
> >   * re-probe (unbind + bind).
> > @@ -965,8 +966,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
> >   * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
> >   * snapshot capture. In this mode, GT reset won't be attempted so the state of
> >   * the issue is preserved for further debugging.
> > + * Caller is expected to pass respective parameters to be sent along with
> > + * uevent. Pass NULL in case of no params.
> >   */
> > -void xe_device_declare_wedged(struct xe_device *xe)
> > +void xe_device_declare_wedged(struct xe_device *xe, char **event_params)
> >  {
> >  	struct xe_gt *gt;
> >  	u8 id;
> > @@ -984,12 +987,17 @@ void xe_device_declare_wedged(struct xe_device *xe)
> >  	xe_pm_runtime_get_noresume(xe);
> >  
> >  	if (!atomic_xchg(&xe->wedged.flag, 1)) {
> > +		struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> > +
> >  		xe->needs_flr_on_fini = true;
> >  		drm_err(&xe->drm,
> >  			"CRITICAL: Xe has declared device %s as wedged.\n"
> >  			"IOCTLs and executions are blocked. Only a rebind may clear the failure\n"
> >  			"Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
> >  			dev_name(xe->drm.dev));
> > +
> > +		/* Notify userspace about reset required */
> > +		kobject_uevent_env(&pdev->dev.kobj, KOBJ_CHANGE, event_params);
> >  	}
> >  
> >  	for_each_gt(gt, xe, id)
> > diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> > index db6cc8d0d6b8..5d40fc6f0904 100644
> > --- a/drivers/gpu/drm/xe/xe_device.h
> > +++ b/drivers/gpu/drm/xe/xe_device.h
> > @@ -174,7 +174,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
> >  	return atomic_read(&xe->wedged.flag);
> >  }
> >  
> > -void xe_device_declare_wedged(struct xe_device *xe);
> > +void xe_device_declare_wedged(struct xe_device *xe, char **reset_event);
> >  
> >  struct xe_file *xe_file_get(struct xe_file *xef);
> >  void xe_file_put(struct xe_file *xef);
> > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > index 58895ed22f6e..519f3c2cf9e2 100644
> > --- a/drivers/gpu/drm/xe/xe_gt.c
> > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > @@ -741,6 +741,24 @@ static int do_gt_restart(struct xe_gt *gt)
> >  	return 0;
> >  }
> >  
> > +static void xe_gt_reset_failed(struct xe_gt *gt, int err)
> > +{
> > +	char *event_params[5];
> > +
> > +	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
> > +
> > +	event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
> > +	event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_GT;
> > +	event_params[2] = kasprintf(GFP_KERNEL, "TILE_ID=%d", gt_to_tile(gt)->id);
> > +	event_params[3] = kasprintf(GFP_KERNEL, "GT_ID=%d", gt->info.id);
> 
> the TILE_ID and GT_ID can be passed for other events as well, with that you can
> have a common function to send uevent which would take reason as an input.

But is that required for all cases? There could be potential cases atleast
in the future where it is not needed.

> > +	event_params[4] = NULL;
> > +
> > +	xe_device_declare_wedged(gt_to_xe(gt), event_params);
> > +
> > +	kfree(event_params[2]);
> > +	kfree(event_params[3]);
> > +}
> > +
> >  static int gt_reset(struct xe_gt *gt)
> >  {
> >  	int err;
> > @@ -796,10 +814,7 @@ static int gt_reset(struct xe_gt *gt)
> >  	XE_WARN_ON(xe_uc_start(&gt->uc));
> >  	xe_pm_runtime_put(gt_to_xe(gt));
> >  err_fail:
> > -	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
> > -
> > -	xe_device_declare_wedged(gt_to_xe(gt));
> > -
> > +	xe_gt_reset_failed(gt, err);
> >  	return err;
> >  }
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> > index de0fe9e65746..b544012f5b11 100644
> > --- a/drivers/gpu/drm/xe/xe_guc.c
> > +++ b/drivers/gpu/drm/xe/xe_guc.c
> > @@ -560,6 +560,17 @@ static s32 guc_pc_get_cur_freq(struct xe_guc_pc *guc_pc)
> >  	return ret ? ret : freq;
> >  }
> >  
> > +static void xe_guc_load_failed(struct xe_gt *gt)
> > +{
> > +	char *event_params[3];
> > +
> > +	event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
> > +	event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_GUC;
> > +	event_params[2] = NULL;
> > +
> > +	xe_device_declare_wedged(gt_to_xe(gt), event_params);
> > +}
> > +
> >  /*
> >   * Wait for the GuC to start up.
> >   *
> > @@ -684,7 +695,7 @@ static void guc_wait_ucode(struct xe_guc *guc)
> >  			break;
> >  		}
> >  
> > -		xe_device_declare_wedged(gt_to_xe(gt));
> > +		xe_guc_load_failed(gt);
> >  	} else if (delta_ms > GUC_LOAD_TIME_WARN_MS) {
> >  		xe_gt_warn(gt, "excessive init time: %lldms! [status = 0x%08X, timeouts = %d]\n",
> >  			   delta_ms, status, count);
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index 460808507947..33ed6221f465 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -891,6 +891,17 @@ void xe_guc_submit_wedge(struct xe_guc *guc)
> >  	mutex_unlock(&guc->submission_state.lock);
> >  }
> >  
> > +static void xe_exec_queue_timedout(struct xe_device *xe)
> > +{
> > +	char *event_params[3];
> > +
> > +	event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
> > +	event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_TOUT;
> > +	event_params[2] = NULL;
> > +
> > +	xe_device_declare_wedged(xe, event_params);
> > +}
> > +
> >  static bool guc_submit_hint_wedged(struct xe_guc *guc)
> >  {
> >  	struct xe_device *xe = guc_to_xe(guc);
> > @@ -901,7 +912,7 @@ static bool guc_submit_hint_wedged(struct xe_guc *guc)
> >  	if (xe_device_wedged(xe))
> >  		return true;
> >  
> 
> i see that this is even called from xe_guc_exec_queue_lr_cleanup which is for long running queue
> from various call paths need to check in what scenarios those happen.

Should we add a reason for long running queue?

Raag

Lucas De Marchi Aug. 13, 2024, 2:13 p.m. UTC | #4

On Mon, Aug 12, 2024 at 01:18:12PM GMT, Raag Jadav wrote:
>From: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>
>This was dropped in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset
>uevent for now") as part of refactoring.
>
>Now that we have better uapi semantics and naming for the uevent,
>bring it back. With this in place, userspace will be notified of
>wedged device along with its reason.

Title of this patch, this paragraph and what's being done in the
implemenation don't match since reset != wedged

>
>$ udevadm monitor --property --kernel
>monitor will print the received events for:
>KERNEL - the kernel uevent
>
>KERNEL[871.188570] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0 (pci)
>ACTION=change
>DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0
>SUBSYSTEM=pci
>DEVICE_STATUS=NEEDS_RESET
>REASON=GT_RESET_FAILED
>TILE_ID=0
>GT_ID=0
>DRIVER=xe
>PCI_CLASS=30000
>PCI_ID=8086:56B1
>PCI_SUBSYS_ID=8086:1210
>PCI_SLOT_NAME=0000:03:00.0
>MODALIAS=pci:v00008086d000056B1sv00008086sd00001210bc03sc00i00
>SEQNUM=6104
>
>v2: Change authorship to Himal (Aravind)
>    Add uevent for all device wedged cases (Aravind)
>
>Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>Signed-off-by: Raag Jadav <raag.jadav@intel.com>
>---
> drivers/gpu/drm/xe/xe_device.c     | 10 +++++++++-
> drivers/gpu/drm/xe/xe_device.h     |  2 +-
> drivers/gpu/drm/xe/xe_gt.c         | 23 +++++++++++++++++++----
> drivers/gpu/drm/xe/xe_guc.c        | 13 ++++++++++++-
> drivers/gpu/drm/xe/xe_guc_submit.c | 13 ++++++++++++-
> include/uapi/drm/xe_drm.h          | 29 +++++++++++++++++++++++++++++
> 6 files changed, 82 insertions(+), 8 deletions(-)
>
>diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>index 1aba6f9eaa19..d975bdce4a7d 100644
>--- a/drivers/gpu/drm/xe/xe_device.c
>+++ b/drivers/gpu/drm/xe/xe_device.c
>@@ -955,6 +955,7 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
> /**
>  * xe_device_declare_wedged - Declare device wedged
>  * @xe: xe device instance
>+ * @event_params: parameters to be sent along with uevent
>  *
>  * This is a final state that can only be cleared with a mudule
>  * re-probe (unbind + bind).
>@@ -965,8 +966,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>  * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
>  * snapshot capture. In this mode, GT reset won't be attempted so the state of
>  * the issue is preserved for further debugging.
>+ * Caller is expected to pass respective parameters to be sent along with
>+ * uevent. Pass NULL in case of no params.
>  */
>-void xe_device_declare_wedged(struct xe_device *xe)
>+void xe_device_declare_wedged(struct xe_device *xe, char **event_params)
> {
> 	struct xe_gt *gt;
> 	u8 id;
>@@ -984,12 +987,17 @@ void xe_device_declare_wedged(struct xe_device *xe)
> 	xe_pm_runtime_get_noresume(xe);
>
> 	if (!atomic_xchg(&xe->wedged.flag, 1)) {
>+		struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>+

Given the names of the uevents, I'm not sure if the intention is to
notify when a gt reset happens or only when it fails
(and wedged.mode != 0)

Lucas De Marchi

> 		xe->needs_flr_on_fini = true;
> 		drm_err(&xe->drm,
> 			"CRITICAL: Xe has declared device %s as wedged.\n"
> 			"IOCTLs and executions are blocked. Only a rebind may clear the failure\n"
> 			"Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
> 			dev_name(xe->drm.dev));
>+
>+		/* Notify userspace about reset required */
>+		kobject_uevent_env(&pdev->dev.kobj, KOBJ_CHANGE, event_params);
> 	}
>
> 	for_each_gt(gt, xe, id)
>diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
>index db6cc8d0d6b8..5d40fc6f0904 100644
>--- a/drivers/gpu/drm/xe/xe_device.h
>+++ b/drivers/gpu/drm/xe/xe_device.h
>@@ -174,7 +174,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
> 	return atomic_read(&xe->wedged.flag);
> }
>
>-void xe_device_declare_wedged(struct xe_device *xe);
>+void xe_device_declare_wedged(struct xe_device *xe, char **reset_event);
>
> struct xe_file *xe_file_get(struct xe_file *xef);
> void xe_file_put(struct xe_file *xef);
>diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
>index 58895ed22f6e..519f3c2cf9e2 100644
>--- a/drivers/gpu/drm/xe/xe_gt.c
>+++ b/drivers/gpu/drm/xe/xe_gt.c
>@@ -741,6 +741,24 @@ static int do_gt_restart(struct xe_gt *gt)
> 	return 0;
> }
>
>+static void xe_gt_reset_failed(struct xe_gt *gt, int err)
>+{
>+	char *event_params[5];
>+
>+	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
>+
>+	event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
>+	event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_GT;
>+	event_params[2] = kasprintf(GFP_KERNEL, "TILE_ID=%d", gt_to_tile(gt)->id);
>+	event_params[3] = kasprintf(GFP_KERNEL, "GT_ID=%d", gt->info.id);
>+	event_params[4] = NULL;
>+
>+	xe_device_declare_wedged(gt_to_xe(gt), event_params);
>+
>+	kfree(event_params[2]);
>+	kfree(event_params[3]);
>+}
>+
> static int gt_reset(struct xe_gt *gt)
> {
> 	int err;
>@@ -796,10 +814,7 @@ static int gt_reset(struct xe_gt *gt)
> 	XE_WARN_ON(xe_uc_start(&gt->uc));
> 	xe_pm_runtime_put(gt_to_xe(gt));
> err_fail:
>-	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
>-
>-	xe_device_declare_wedged(gt_to_xe(gt));
>-
>+	xe_gt_reset_failed(gt, err);
> 	return err;
> }
>
>diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
>index de0fe9e65746..b544012f5b11 100644
>--- a/drivers/gpu/drm/xe/xe_guc.c
>+++ b/drivers/gpu/drm/xe/xe_guc.c
>@@ -560,6 +560,17 @@ static s32 guc_pc_get_cur_freq(struct xe_guc_pc *guc_pc)
> 	return ret ? ret : freq;
> }
>
>+static void xe_guc_load_failed(struct xe_gt *gt)
>+{
>+	char *event_params[3];
>+
>+	event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
>+	event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_GUC;
>+	event_params[2] = NULL;
>+
>+	xe_device_declare_wedged(gt_to_xe(gt), event_params);
>+}
>+
> /*
>  * Wait for the GuC to start up.
>  *
>@@ -684,7 +695,7 @@ static void guc_wait_ucode(struct xe_guc *guc)
> 			break;
> 		}
>
>-		xe_device_declare_wedged(gt_to_xe(gt));
>+		xe_guc_load_failed(gt);
> 	} else if (delta_ms > GUC_LOAD_TIME_WARN_MS) {
> 		xe_gt_warn(gt, "excessive init time: %lldms! [status = 0x%08X, timeouts = %d]\n",
> 			   delta_ms, status, count);
>diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
>index 460808507947..33ed6221f465 100644
>--- a/drivers/gpu/drm/xe/xe_guc_submit.c
>+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>@@ -891,6 +891,17 @@ void xe_guc_submit_wedge(struct xe_guc *guc)
> 	mutex_unlock(&guc->submission_state.lock);
> }
>
>+static void xe_exec_queue_timedout(struct xe_device *xe)
>+{
>+	char *event_params[3];
>+
>+	event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
>+	event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_TOUT;
>+	event_params[2] = NULL;
>+
>+	xe_device_declare_wedged(xe, event_params);
>+}
>+
> static bool guc_submit_hint_wedged(struct xe_guc *guc)
> {
> 	struct xe_device *xe = guc_to_xe(guc);
>@@ -901,7 +912,7 @@ static bool guc_submit_hint_wedged(struct xe_guc *guc)
> 	if (xe_device_wedged(xe))
> 		return true;
>
>-	xe_device_declare_wedged(xe);
>+	xe_exec_queue_timedout(xe);
>
> 	return true;
> }
>diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>index b6fbe4988f2e..dd2f36710057 100644
>--- a/include/uapi/drm/xe_drm.h
>+++ b/include/uapi/drm/xe_drm.h
>@@ -20,6 +20,7 @@ extern "C" {
>  *   2. Extension definition and helper structs
>  *   3. IOCTL's Query structs in the order of the Query's entries.
>  *   4. The rest of IOCTL structs in the order of IOCTL declaration.
>+ *   5. uEvents
>  */
>
> /**
>@@ -1694,6 +1695,34 @@ struct drm_xe_oa_stream_info {
> 	__u64 reserved[3];
> };
>
>+/**
>+ * DOC: uevent generated by xe on it's pci node.
>+ *
>+ * DRM_XE_RESET_REQUIRED_UEVENT - Event is generated when device needs reset.
>+ * The REASON is provided along with the event for which reset is required.
>+ * On the basis of REASONS, additional information might be supplied.
>+ */
>+#define DRM_XE_RESET_REQUIRED_UEVENT "DEVICE_STATUS=NEEDS_RESET"
>+
>+/**
>+ * DRM_XE_RESET_REQUIRED_UEVENT_REASON_GT - Reason provided to
>+ * DRM_XE_RESET_REQUIRED_UEVENT incase of gt reset failure. The additional
>+ * information supplied is tile id and gt id for which reset has failed.
>+ */
>+#define DRM_XE_RESET_REQUIRED_UEVENT_REASON_GT "REASON=GT_RESET_FAILED"
>+
>+/**
>+ * DRM_XE_RESET_REQUIRED_UEVENT_REASON_GUC - Reason provided to
>+ * DRM_XE_RESET_REQUIRED_UEVENT incase of guc fw load failure.
>+ */
>+#define DRM_XE_RESET_REQUIRED_UEVENT_REASON_GUC "REASON=GUC_LOAD_FAILED"
>+
>+/**
>+ * DRM_XE_RESET_REQUIRED_UEVENT_REASON_TOUT - Reason provided to
>+ * DRM_XE_RESET_REQUIRED_UEVENT incase of exec queue timeout.
>+ */
>+#define DRM_XE_RESET_REQUIRED_UEVENT_REASON_TOUT "REASON=EXEC_QUEUE_TIMEDOUT"


TOUT / TIMEDOUT etc need some love.

DRM_XE_WEDGED_UEVENT, then document in the same place all the additional
params that can be passed to this event.

or DRM_XE_DEVICE_STATUS_UEVENT, or DRM_XE_RESET_UEVENT.

who is the expected consumer of that uevent and what could be done?

Lucas De Marchi


>+
> #if defined(__cplusplus)
> }
> #endif
>-- 
>2.34.1
>

Rodrigo Vivi Aug. 13, 2024, 4:54 p.m. UTC | #5

On Tue, Aug 13, 2024 at 04:28:32PM +0300, Raag Jadav wrote:
> On Mon, Aug 12, 2024 at 03:08:14PM +0530, Aravind Iddamsetty wrote:
> > 
> > On 12/08/24 13:18, Raag Jadav wrote:
> > > From: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> > >
> > > This was dropped in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset
> > > uevent for now") as part of refactoring.
> > >
> > > Now that we have better uapi semantics and naming for the uevent,
> > > bring it back. With this in place, userspace will be notified of
> > > wedged device along with its reason.
> > >
> > > $ udevadm monitor --property --kernel
> > > monitor will print the received events for:
> > > KERNEL - the kernel uevent
> > >
> > > KERNEL[871.188570] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0 (pci)
> > > ACTION=change
> > > DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0
> > > SUBSYSTEM=pci
> > > DEVICE_STATUS=NEEDS_RESET
> > > REASON=GT_RESET_FAILED
> > > TILE_ID=0
> > > GT_ID=0
> > > DRIVER=xe
> > > PCI_CLASS=30000
> > > PCI_ID=8086:56B1
> > > PCI_SUBSYS_ID=8086:1210
> > > PCI_SLOT_NAME=0000:03:00.0
> > > MODALIAS=pci:v00008086d000056B1sv00008086sd00001210bc03sc00i00
> > > SEQNUM=6104
> > >
> > > v2: Change authorship to Himal (Aravind)
> > >     Add uevent for all device wedged cases (Aravind)
> > >
> > > Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> > > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_device.c     | 10 +++++++++-
> > >  drivers/gpu/drm/xe/xe_device.h     |  2 +-
> > >  drivers/gpu/drm/xe/xe_gt.c         | 23 +++++++++++++++++++----
> > >  drivers/gpu/drm/xe/xe_guc.c        | 13 ++++++++++++-
> > >  drivers/gpu/drm/xe/xe_guc_submit.c | 13 ++++++++++++-
> > >  include/uapi/drm/xe_drm.h          | 29 +++++++++++++++++++++++++++++
> > >  6 files changed, 82 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> > > index 1aba6f9eaa19..d975bdce4a7d 100644
> > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > @@ -955,6 +955,7 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
> > >  /**
> > >   * xe_device_declare_wedged - Declare device wedged
> > >   * @xe: xe device instance
> > > + * @event_params: parameters to be sent along with uevent
> > >   *
> > >   * This is a final state that can only be cleared with a mudule
> > >   * re-probe (unbind + bind).
> > > @@ -965,8 +966,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
> > >   * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
> > >   * snapshot capture. In this mode, GT reset won't be attempted so the state of
> > >   * the issue is preserved for further debugging.
> > > + * Caller is expected to pass respective parameters to be sent along with
> > > + * uevent. Pass NULL in case of no params.
> > >   */
> > > -void xe_device_declare_wedged(struct xe_device *xe)
> > > +void xe_device_declare_wedged(struct xe_device *xe, char **event_params)
> > >  {
> > >  	struct xe_gt *gt;
> > >  	u8 id;
> > > @@ -984,12 +987,17 @@ void xe_device_declare_wedged(struct xe_device *xe)
> > >  	xe_pm_runtime_get_noresume(xe);
> > >  
> > >  	if (!atomic_xchg(&xe->wedged.flag, 1)) {
> > > +		struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> > > +
> > >  		xe->needs_flr_on_fini = true;
> > >  		drm_err(&xe->drm,
> > >  			"CRITICAL: Xe has declared device %s as wedged.\n"
> > >  			"IOCTLs and executions are blocked. Only a rebind may clear the failure\n"
> > >  			"Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
> > >  			dev_name(xe->drm.dev));
> > > +
> > > +		/* Notify userspace about reset required */
> > > +		kobject_uevent_env(&pdev->dev.kobj, KOBJ_CHANGE, event_params);
> > >  	}
> > >  
> > >  	for_each_gt(gt, xe, id)
> > > diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> > > index db6cc8d0d6b8..5d40fc6f0904 100644
> > > --- a/drivers/gpu/drm/xe/xe_device.h
> > > +++ b/drivers/gpu/drm/xe/xe_device.h
> > > @@ -174,7 +174,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
> > >  	return atomic_read(&xe->wedged.flag);
> > >  }
> > >  
> > > -void xe_device_declare_wedged(struct xe_device *xe);
> > > +void xe_device_declare_wedged(struct xe_device *xe, char **reset_event);
> > >  
> > >  struct xe_file *xe_file_get(struct xe_file *xef);
> > >  void xe_file_put(struct xe_file *xef);
> > > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > > index 58895ed22f6e..519f3c2cf9e2 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt.c
> > > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > > @@ -741,6 +741,24 @@ static int do_gt_restart(struct xe_gt *gt)
> > >  	return 0;
> > >  }
> > >  
> > > +static void xe_gt_reset_failed(struct xe_gt *gt, int err)
> > > +{
> > > +	char *event_params[5];
> > > +
> > > +	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
> > > +
> > > +	event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
> > > +	event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_GT;
> > > +	event_params[2] = kasprintf(GFP_KERNEL, "TILE_ID=%d", gt_to_tile(gt)->id);
> > > +	event_params[3] = kasprintf(GFP_KERNEL, "GT_ID=%d", gt->info.id);
> > 
> > the TILE_ID and GT_ID can be passed for other events as well, with that you can
> > have a common function to send uevent which would take reason as an input.
> 
> But is that required for all cases? There could be potential cases atleast
> in the future where it is not needed.
> 
> > > +	event_params[4] = NULL;
> > > +
> > > +	xe_device_declare_wedged(gt_to_xe(gt), event_params);
> > > +
> > > +	kfree(event_params[2]);
> > > +	kfree(event_params[3]);
> > > +}
> > > +
> > >  static int gt_reset(struct xe_gt *gt)
> > >  {
> > >  	int err;
> > > @@ -796,10 +814,7 @@ static int gt_reset(struct xe_gt *gt)
> > >  	XE_WARN_ON(xe_uc_start(&gt->uc));
> > >  	xe_pm_runtime_put(gt_to_xe(gt));
> > >  err_fail:
> > > -	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
> > > -
> > > -	xe_device_declare_wedged(gt_to_xe(gt));
> > > -
> > > +	xe_gt_reset_failed(gt, err);
> > >  	return err;
> > >  }
> > >  
> > > diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> > > index de0fe9e65746..b544012f5b11 100644
> > > --- a/drivers/gpu/drm/xe/xe_guc.c
> > > +++ b/drivers/gpu/drm/xe/xe_guc.c
> > > @@ -560,6 +560,17 @@ static s32 guc_pc_get_cur_freq(struct xe_guc_pc *guc_pc)
> > >  	return ret ? ret : freq;
> > >  }
> > >  
> > > +static void xe_guc_load_failed(struct xe_gt *gt)
> > > +{
> > > +	char *event_params[3];
> > > +
> > > +	event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
> > > +	event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_GUC;
> > > +	event_params[2] = NULL;
> > > +
> > > +	xe_device_declare_wedged(gt_to_xe(gt), event_params);
> > > +}
> > > +
> > >  /*
> > >   * Wait for the GuC to start up.
> > >   *
> > > @@ -684,7 +695,7 @@ static void guc_wait_ucode(struct xe_guc *guc)
> > >  			break;
> > >  		}
> > >  
> > > -		xe_device_declare_wedged(gt_to_xe(gt));
> > > +		xe_guc_load_failed(gt);
> > >  	} else if (delta_ms > GUC_LOAD_TIME_WARN_MS) {
> > >  		xe_gt_warn(gt, "excessive init time: %lldms! [status = 0x%08X, timeouts = %d]\n",
> > >  			   delta_ms, status, count);
> > > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > index 460808507947..33ed6221f465 100644
> > > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > @@ -891,6 +891,17 @@ void xe_guc_submit_wedge(struct xe_guc *guc)
> > >  	mutex_unlock(&guc->submission_state.lock);
> > >  }
> > >  
> > > +static void xe_exec_queue_timedout(struct xe_device *xe)
> > > +{
> > > +	char *event_params[3];
> > > +
> > > +	event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
> > > +	event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_TOUT;
> > > +	event_params[2] = NULL;
> > > +
> > > +	xe_device_declare_wedged(xe, event_params);
> > > +}
> > > +
> > >  static bool guc_submit_hint_wedged(struct xe_guc *guc)
> > >  {
> > >  	struct xe_device *xe = guc_to_xe(guc);
> > > @@ -901,7 +912,7 @@ static bool guc_submit_hint_wedged(struct xe_guc *guc)
> > >  	if (xe_device_wedged(xe))
> > >  		return true;
> > >  
> > 
> > i see that this is even called from xe_guc_exec_queue_lr_cleanup which is for long running queue
> > from various call paths need to check in what scenarios those happen.
> 
> Should we add a reason for long running queue?

Both of your questions would probably be answered if this was getting developed
at the same time of the user space usage of these uevents.

> 
> Raag

Aravind Iddamsetty Aug. 14, 2024, 6:31 a.m. UTC | #6

On 13/08/24 19:43, Lucas De Marchi wrote:
> On Mon, Aug 12, 2024 at 01:18:12PM GMT, Raag Jadav wrote:
>> From: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>>
>> This was dropped in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset
>> uevent for now") as part of refactoring.
>>
>> Now that we have better uapi semantics and naming for the uevent,
>> bring it back. With this in place, userspace will be notified of
>> wedged device along with its reason.
>
> Title of this patch, this paragraph and what's being done in the
> implemenation don't match since reset != wedged
>
>>
>> $ udevadm monitor --property --kernel
>> monitor will print the received events for:
>> KERNEL - the kernel uevent
>>
>> KERNEL[871.188570] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0 (pci)
>> ACTION=change
>> DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0
>> SUBSYSTEM=pci
>> DEVICE_STATUS=NEEDS_RESET
>> REASON=GT_RESET_FAILED
>> TILE_ID=0
>> GT_ID=0
>> DRIVER=xe
>> PCI_CLASS=30000
>> PCI_ID=8086:56B1
>> PCI_SUBSYS_ID=8086:1210
>> PCI_SLOT_NAME=0000:03:00.0
>> MODALIAS=pci:v00008086d000056B1sv00008086sd00001210bc03sc00i00
>> SEQNUM=6104
>>
>> v2: Change authorship to Himal (Aravind)
>>    Add uevent for all device wedged cases (Aravind)
>>
>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
>> ---
>> drivers/gpu/drm/xe/xe_device.c     | 10 +++++++++-
>> drivers/gpu/drm/xe/xe_device.h     |  2 +-
>> drivers/gpu/drm/xe/xe_gt.c         | 23 +++++++++++++++++++----
>> drivers/gpu/drm/xe/xe_guc.c        | 13 ++++++++++++-
>> drivers/gpu/drm/xe/xe_guc_submit.c | 13 ++++++++++++-
>> include/uapi/drm/xe_drm.h          | 29 +++++++++++++++++++++++++++++
>> 6 files changed, 82 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>> index 1aba6f9eaa19..d975bdce4a7d 100644
>> --- a/drivers/gpu/drm/xe/xe_device.c
>> +++ b/drivers/gpu/drm/xe/xe_device.c
>> @@ -955,6 +955,7 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>> /**
>>  * xe_device_declare_wedged - Declare device wedged
>>  * @xe: xe device instance
>> + * @event_params: parameters to be sent along with uevent
>>  *
>>  * This is a final state that can only be cleared with a mudule
>>  * re-probe (unbind + bind).
>> @@ -965,8 +966,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>>  * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
>>  * snapshot capture. In this mode, GT reset won't be attempted so the state of
>>  * the issue is preserved for further debugging.
>> + * Caller is expected to pass respective parameters to be sent along with
>> + * uevent. Pass NULL in case of no params.
>>  */
>> -void xe_device_declare_wedged(struct xe_device *xe)
>> +void xe_device_declare_wedged(struct xe_device *xe, char **event_params)
>> {
>>     struct xe_gt *gt;
>>     u8 id;
>> @@ -984,12 +987,17 @@ void xe_device_declare_wedged(struct xe_device *xe)
>>     xe_pm_runtime_get_noresume(xe);
>>
>>     if (!atomic_xchg(&xe->wedged.flag, 1)) {
>> +        struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>> +
>
> Given the names of the uevents, I'm not sure if the intention is to
> notify when a gt reset happens or only when it fails
> (and wedged.mode != 0)

The intention is to raise an uevent whenever the device is declared wedged
and pass the reason for the same in uevent.


Thanks,
Aravind.
>
> Lucas De Marchi
>
>>         xe->needs_flr_on_fini = true;
>>         drm_err(&xe->drm,
>>             "CRITICAL: Xe has declared device %s as wedged.\n"
>>             "IOCTLs and executions are blocked. Only a rebind may clear the failure\n"
>>             "Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
>>             dev_name(xe->drm.dev));
>> +
>> +        /* Notify userspace about reset required */
>> +        kobject_uevent_env(&pdev->dev.kobj, KOBJ_CHANGE, event_params);
>>     }
>>
>>     for_each_gt(gt, xe, id)
>> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
>> index db6cc8d0d6b8..5d40fc6f0904 100644
>> --- a/drivers/gpu/drm/xe/xe_device.h
>> +++ b/drivers/gpu/drm/xe/xe_device.h
>> @@ -174,7 +174,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
>>     return atomic_read(&xe->wedged.flag);
>> }
>>
>> -void xe_device_declare_wedged(struct xe_device *xe);
>> +void xe_device_declare_wedged(struct xe_device *xe, char **reset_event);
>>
>> struct xe_file *xe_file_get(struct xe_file *xef);
>> void xe_file_put(struct xe_file *xef);
>> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
>> index 58895ed22f6e..519f3c2cf9e2 100644
>> --- a/drivers/gpu/drm/xe/xe_gt.c
>> +++ b/drivers/gpu/drm/xe/xe_gt.c
>> @@ -741,6 +741,24 @@ static int do_gt_restart(struct xe_gt *gt)
>>     return 0;
>> }
>>
>> +static void xe_gt_reset_failed(struct xe_gt *gt, int err)
>> +{
>> +    char *event_params[5];
>> +
>> +    xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
>> +
>> +    event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
>> +    event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_GT;
>> +    event_params[2] = kasprintf(GFP_KERNEL, "TILE_ID=%d", gt_to_tile(gt)->id);
>> +    event_params[3] = kasprintf(GFP_KERNEL, "GT_ID=%d", gt->info.id);
>> +    event_params[4] = NULL;
>> +
>> +    xe_device_declare_wedged(gt_to_xe(gt), event_params);
>> +
>> +    kfree(event_params[2]);
>> +    kfree(event_params[3]);
>> +}
>> +
>> static int gt_reset(struct xe_gt *gt)
>> {
>>     int err;
>> @@ -796,10 +814,7 @@ static int gt_reset(struct xe_gt *gt)
>>     XE_WARN_ON(xe_uc_start(&gt->uc));
>>     xe_pm_runtime_put(gt_to_xe(gt));
>> err_fail:
>> -    xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
>> -
>> -    xe_device_declare_wedged(gt_to_xe(gt));
>> -
>> +    xe_gt_reset_failed(gt, err);
>>     return err;
>> }
>>
>> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
>> index de0fe9e65746..b544012f5b11 100644
>> --- a/drivers/gpu/drm/xe/xe_guc.c
>> +++ b/drivers/gpu/drm/xe/xe_guc.c
>> @@ -560,6 +560,17 @@ static s32 guc_pc_get_cur_freq(struct xe_guc_pc *guc_pc)
>>     return ret ? ret : freq;
>> }
>>
>> +static void xe_guc_load_failed(struct xe_gt *gt)
>> +{
>> +    char *event_params[3];
>> +
>> +    event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
>> +    event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_GUC;
>> +    event_params[2] = NULL;
>> +
>> +    xe_device_declare_wedged(gt_to_xe(gt), event_params);
>> +}
>> +
>> /*
>>  * Wait for the GuC to start up.
>>  *
>> @@ -684,7 +695,7 @@ static void guc_wait_ucode(struct xe_guc *guc)
>>             break;
>>         }
>>
>> -        xe_device_declare_wedged(gt_to_xe(gt));
>> +        xe_guc_load_failed(gt);
>>     } else if (delta_ms > GUC_LOAD_TIME_WARN_MS) {
>>         xe_gt_warn(gt, "excessive init time: %lldms! [status = 0x%08X, timeouts = %d]\n",
>>                delta_ms, status, count);
>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
>> index 460808507947..33ed6221f465 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>> @@ -891,6 +891,17 @@ void xe_guc_submit_wedge(struct xe_guc *guc)
>>     mutex_unlock(&guc->submission_state.lock);
>> }
>>
>> +static void xe_exec_queue_timedout(struct xe_device *xe)
>> +{
>> +    char *event_params[3];
>> +
>> +    event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
>> +    event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_TOUT;
>> +    event_params[2] = NULL;
>> +
>> +    xe_device_declare_wedged(xe, event_params);
>> +}
>> +
>> static bool guc_submit_hint_wedged(struct xe_guc *guc)
>> {
>>     struct xe_device *xe = guc_to_xe(guc);
>> @@ -901,7 +912,7 @@ static bool guc_submit_hint_wedged(struct xe_guc *guc)
>>     if (xe_device_wedged(xe))
>>         return true;
>>
>> -    xe_device_declare_wedged(xe);
>> +    xe_exec_queue_timedout(xe);
>>
>>     return true;
>> }
>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>> index b6fbe4988f2e..dd2f36710057 100644
>> --- a/include/uapi/drm/xe_drm.h
>> +++ b/include/uapi/drm/xe_drm.h
>> @@ -20,6 +20,7 @@ extern "C" {
>>  *   2. Extension definition and helper structs
>>  *   3. IOCTL's Query structs in the order of the Query's entries.
>>  *   4. The rest of IOCTL structs in the order of IOCTL declaration.
>> + *   5. uEvents
>>  */
>>
>> /**
>> @@ -1694,6 +1695,34 @@ struct drm_xe_oa_stream_info {
>>     __u64 reserved[3];
>> };
>>
>> +/**
>> + * DOC: uevent generated by xe on it's pci node.
>> + *
>> + * DRM_XE_RESET_REQUIRED_UEVENT - Event is generated when device needs reset.
>> + * The REASON is provided along with the event for which reset is required.
>> + * On the basis of REASONS, additional information might be supplied.
>> + */
>> +#define DRM_XE_RESET_REQUIRED_UEVENT "DEVICE_STATUS=NEEDS_RESET"
>> +
>> +/**
>> + * DRM_XE_RESET_REQUIRED_UEVENT_REASON_GT - Reason provided to
>> + * DRM_XE_RESET_REQUIRED_UEVENT incase of gt reset failure. The additional
>> + * information supplied is tile id and gt id for which reset has failed.
>> + */
>> +#define DRM_XE_RESET_REQUIRED_UEVENT_REASON_GT "REASON=GT_RESET_FAILED"
>> +
>> +/**
>> + * DRM_XE_RESET_REQUIRED_UEVENT_REASON_GUC - Reason provided to
>> + * DRM_XE_RESET_REQUIRED_UEVENT incase of guc fw load failure.
>> + */
>> +#define DRM_XE_RESET_REQUIRED_UEVENT_REASON_GUC "REASON=GUC_LOAD_FAILED"
>> +
>> +/**
>> + * DRM_XE_RESET_REQUIRED_UEVENT_REASON_TOUT - Reason provided to
>> + * DRM_XE_RESET_REQUIRED_UEVENT incase of exec queue timeout.
>> + */
>> +#define DRM_XE_RESET_REQUIRED_UEVENT_REASON_TOUT "REASON=EXEC_QUEUE_TIMEDOUT"
>
>
> TOUT / TIMEDOUT etc need some love.
>
> DRM_XE_WEDGED_UEVENT, then document in the same place all the additional
> params that can be passed to this event.
>
> or DRM_XE_DEVICE_STATUS_UEVENT, or DRM_XE_RESET_UEVENT.
>
> who is the expected consumer of that uevent and what could be done?
>
> Lucas De Marchi
>
>
>> +
>> #if defined(__cplusplus)
>> }
>> #endif
>> -- 
>> 2.34.1
>>

Aravind Iddamsetty Aug. 14, 2024, 6:46 a.m. UTC | #7

On 13/08/24 22:24, Rodrigo Vivi wrote:
> On Tue, Aug 13, 2024 at 04:28:32PM +0300, Raag Jadav wrote:
>> On Mon, Aug 12, 2024 at 03:08:14PM +0530, Aravind Iddamsetty wrote:
>>> On 12/08/24 13:18, Raag Jadav wrote:
>>>> From: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>>>>
>>>> This was dropped in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset
>>>> uevent for now") as part of refactoring.
>>>>
>>>> Now that we have better uapi semantics and naming for the uevent,
>>>> bring it back. With this in place, userspace will be notified of
>>>> wedged device along with its reason.
>>>>
>>>> $ udevadm monitor --property --kernel
>>>> monitor will print the received events for:
>>>> KERNEL - the kernel uevent
>>>>
>>>> KERNEL[871.188570] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0 (pci)
>>>> ACTION=change
>>>> DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0
>>>> SUBSYSTEM=pci
>>>> DEVICE_STATUS=NEEDS_RESET
>>>> REASON=GT_RESET_FAILED
>>>> TILE_ID=0
>>>> GT_ID=0
>>>> DRIVER=xe
>>>> PCI_CLASS=30000
>>>> PCI_ID=8086:56B1
>>>> PCI_SUBSYS_ID=8086:1210
>>>> PCI_SLOT_NAME=0000:03:00.0
>>>> MODALIAS=pci:v00008086d000056B1sv00008086sd00001210bc03sc00i00
>>>> SEQNUM=6104
>>>>
>>>> v2: Change authorship to Himal (Aravind)
>>>>     Add uevent for all device wedged cases (Aravind)
>>>>
>>>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>>>> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
>>>> ---
>>>>  drivers/gpu/drm/xe/xe_device.c     | 10 +++++++++-
>>>>  drivers/gpu/drm/xe/xe_device.h     |  2 +-
>>>>  drivers/gpu/drm/xe/xe_gt.c         | 23 +++++++++++++++++++----
>>>>  drivers/gpu/drm/xe/xe_guc.c        | 13 ++++++++++++-
>>>>  drivers/gpu/drm/xe/xe_guc_submit.c | 13 ++++++++++++-
>>>>  include/uapi/drm/xe_drm.h          | 29 +++++++++++++++++++++++++++++
>>>>  6 files changed, 82 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>>>> index 1aba6f9eaa19..d975bdce4a7d 100644
>>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>>> @@ -955,6 +955,7 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>>>>  /**
>>>>   * xe_device_declare_wedged - Declare device wedged
>>>>   * @xe: xe device instance
>>>> + * @event_params: parameters to be sent along with uevent
>>>>   *
>>>>   * This is a final state that can only be cleared with a mudule
>>>>   * re-probe (unbind + bind).
>>>> @@ -965,8 +966,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>>>>   * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
>>>>   * snapshot capture. In this mode, GT reset won't be attempted so the state of
>>>>   * the issue is preserved for further debugging.
>>>> + * Caller is expected to pass respective parameters to be sent along with
>>>> + * uevent. Pass NULL in case of no params.
>>>>   */
>>>> -void xe_device_declare_wedged(struct xe_device *xe)
>>>> +void xe_device_declare_wedged(struct xe_device *xe, char **event_params)
>>>>  {
>>>>  	struct xe_gt *gt;
>>>>  	u8 id;
>>>> @@ -984,12 +987,17 @@ void xe_device_declare_wedged(struct xe_device *xe)
>>>>  	xe_pm_runtime_get_noresume(xe);
>>>>  
>>>>  	if (!atomic_xchg(&xe->wedged.flag, 1)) {
>>>> +		struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>>>> +
>>>>  		xe->needs_flr_on_fini = true;
>>>>  		drm_err(&xe->drm,
>>>>  			"CRITICAL: Xe has declared device %s as wedged.\n"
>>>>  			"IOCTLs and executions are blocked. Only a rebind may clear the failure\n"
>>>>  			"Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
>>>>  			dev_name(xe->drm.dev));
>>>> +
>>>> +		/* Notify userspace about reset required */
>>>> +		kobject_uevent_env(&pdev->dev.kobj, KOBJ_CHANGE, event_params);
>>>>  	}
>>>>  
>>>>  	for_each_gt(gt, xe, id)
>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
>>>> index db6cc8d0d6b8..5d40fc6f0904 100644
>>>> --- a/drivers/gpu/drm/xe/xe_device.h
>>>> +++ b/drivers/gpu/drm/xe/xe_device.h
>>>> @@ -174,7 +174,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
>>>>  	return atomic_read(&xe->wedged.flag);
>>>>  }
>>>>  
>>>> -void xe_device_declare_wedged(struct xe_device *xe);
>>>> +void xe_device_declare_wedged(struct xe_device *xe, char **reset_event);
>>>>  
>>>>  struct xe_file *xe_file_get(struct xe_file *xef);
>>>>  void xe_file_put(struct xe_file *xef);
>>>> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
>>>> index 58895ed22f6e..519f3c2cf9e2 100644
>>>> --- a/drivers/gpu/drm/xe/xe_gt.c
>>>> +++ b/drivers/gpu/drm/xe/xe_gt.c
>>>> @@ -741,6 +741,24 @@ static int do_gt_restart(struct xe_gt *gt)
>>>>  	return 0;
>>>>  }
>>>>  
>>>> +static void xe_gt_reset_failed(struct xe_gt *gt, int err)
>>>> +{
>>>> +	char *event_params[5];
>>>> +
>>>> +	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
>>>> +
>>>> +	event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
>>>> +	event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_GT;
>>>> +	event_params[2] = kasprintf(GFP_KERNEL, "TILE_ID=%d", gt_to_tile(gt)->id);
>>>> +	event_params[3] = kasprintf(GFP_KERNEL, "GT_ID=%d", gt->info.id);
>>> the TILE_ID and GT_ID can be passed for other events as well, with that you can
>>> have a common function to send uevent which would take reason as an input.
>> But is that required for all cases? There could be potential cases atleast
>> in the future where it is not needed.


At least in these cases it makes sense as they (other reasons)
can be associated to a GT and a Tile. If in future they arises a
reason where these details are not needed i guess we can handle that.


>>
>>>> +	event_params[4] = NULL;
>>>> +
>>>> +	xe_device_declare_wedged(gt_to_xe(gt), event_params);
>>>> +
>>>> +	kfree(event_params[2]);
>>>> +	kfree(event_params[3]);
>>>> +}
>>>> +
>>>>  static int gt_reset(struct xe_gt *gt)
>>>>  {
>>>>  	int err;
>>>> @@ -796,10 +814,7 @@ static int gt_reset(struct xe_gt *gt)
>>>>  	XE_WARN_ON(xe_uc_start(&gt->uc));
>>>>  	xe_pm_runtime_put(gt_to_xe(gt));
>>>>  err_fail:
>>>> -	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
>>>> -
>>>> -	xe_device_declare_wedged(gt_to_xe(gt));
>>>> -
>>>> +	xe_gt_reset_failed(gt, err);
>>>>  	return err;
>>>>  }
>>>>  
>>>> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
>>>> index de0fe9e65746..b544012f5b11 100644
>>>> --- a/drivers/gpu/drm/xe/xe_guc.c
>>>> +++ b/drivers/gpu/drm/xe/xe_guc.c
>>>> @@ -560,6 +560,17 @@ static s32 guc_pc_get_cur_freq(struct xe_guc_pc *guc_pc)
>>>>  	return ret ? ret : freq;
>>>>  }
>>>>  
>>>> +static void xe_guc_load_failed(struct xe_gt *gt)
>>>> +{
>>>> +	char *event_params[3];
>>>> +
>>>> +	event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
>>>> +	event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_GUC;
>>>> +	event_params[2] = NULL;
>>>> +
>>>> +	xe_device_declare_wedged(gt_to_xe(gt), event_params);
>>>> +}
>>>> +
>>>>  /*
>>>>   * Wait for the GuC to start up.
>>>>   *
>>>> @@ -684,7 +695,7 @@ static void guc_wait_ucode(struct xe_guc *guc)
>>>>  			break;
>>>>  		}
>>>>  
>>>> -		xe_device_declare_wedged(gt_to_xe(gt));
>>>> +		xe_guc_load_failed(gt);
>>>>  	} else if (delta_ms > GUC_LOAD_TIME_WARN_MS) {
>>>>  		xe_gt_warn(gt, "excessive init time: %lldms! [status = 0x%08X, timeouts = %d]\n",
>>>>  			   delta_ms, status, count);
>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
>>>> index 460808507947..33ed6221f465 100644
>>>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>>>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>>>> @@ -891,6 +891,17 @@ void xe_guc_submit_wedge(struct xe_guc *guc)
>>>>  	mutex_unlock(&guc->submission_state.lock);
>>>>  }
>>>>  
>>>> +static void xe_exec_queue_timedout(struct xe_device *xe)
>>>> +{
>>>> +	char *event_params[3];
>>>> +
>>>> +	event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
>>>> +	event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_TOUT;
>>>> +	event_params[2] = NULL;
>>>> +
>>>> +	xe_device_declare_wedged(xe, event_params);
>>>> +}
>>>> +
>>>>  static bool guc_submit_hint_wedged(struct xe_guc *guc)
>>>>  {
>>>>  	struct xe_device *xe = guc_to_xe(guc);
>>>> @@ -901,7 +912,7 @@ static bool guc_submit_hint_wedged(struct xe_guc *guc)
>>>>  	if (xe_device_wedged(xe))
>>>>  		return true;
>>>>  
>>> i see that this is even called from xe_guc_exec_queue_lr_cleanup which is for long running queue
>>> from various call paths need to check in what scenarios those happen.
>> Should we add a reason for long running queue?
> Both of your questions would probably be answered if this was getting developed
> at the same time of the user space usage of these uevents.

Can't we consider the generic Linux user as a consumer, via udevadm.

Thanks,
Aravind.
>
>> Raag

Raag Jadav Aug. 14, 2024, 9:12 a.m. UTC | #8

On Wed, Aug 14, 2024 at 12:16:41PM +0530, Aravind Iddamsetty wrote:
>
>On 13/08/24 22:24, Rodrigo Vivi wrote:
>> On Tue, Aug 13, 2024 at 04:28:32PM +0300, Raag Jadav wrote:
>>> On Mon, Aug 12, 2024 at 03:08:14PM +0530, Aravind Iddamsetty wrote:
>>>> On 12/08/24 13:18, Raag Jadav wrote:
>>>>> From: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>>>>>
>>>>> This was dropped in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset
>>>>> uevent for now") as part of refactoring.
>>>>>
>>>>> Now that we have better uapi semantics and naming for the uevent,
>>>>> bring it back. With this in place, userspace will be notified of
>>>>> wedged device along with its reason.
>>>>>
>>>>> $ udevadm monitor --property --kernel
>>>>> monitor will print the received events for:
>>>>> KERNEL - the kernel uevent
>>>>>
>>>>> KERNEL[871.188570] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0 (pci)
>>>>> ACTION=change
>>>>> DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0
>>>>> SUBSYSTEM=pci
>>>>> DEVICE_STATUS=NEEDS_RESET
>>>>> REASON=GT_RESET_FAILED
>>>>> TILE_ID=0
>>>>> GT_ID=0
>>>>> DRIVER=xe
>>>>> PCI_CLASS=30000
>>>>> PCI_ID=8086:56B1
>>>>> PCI_SUBSYS_ID=8086:1210
>>>>> PCI_SLOT_NAME=0000:03:00.0
>>>>> MODALIAS=pci:v00008086d000056B1sv00008086sd00001210bc03sc00i00
>>>>> SEQNUM=6104
>>>>>
>>>>> v2: Change authorship to Himal (Aravind)
>>>>>     Add uevent for all device wedged cases (Aravind)
>>>>>
>>>>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>>>>> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
>>>>> ---
>>>>>  drivers/gpu/drm/xe/xe_device.c     | 10 +++++++++-
>>>>>  drivers/gpu/drm/xe/xe_device.h     |  2 +-
>>>>>  drivers/gpu/drm/xe/xe_gt.c         | 23 +++++++++++++++++++----
>>>>>  drivers/gpu/drm/xe/xe_guc.c        | 13 ++++++++++++-
>>>>>  drivers/gpu/drm/xe/xe_guc_submit.c | 13 ++++++++++++-
>>>>>  include/uapi/drm/xe_drm.h          | 29 +++++++++++++++++++++++++++++
>>>>>  6 files changed, 82 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>>>>> index 1aba6f9eaa19..d975bdce4a7d 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>>>> @@ -955,6 +955,7 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>>>>>  /**
>>>>>   * xe_device_declare_wedged - Declare device wedged
>>>>>   * @xe: xe device instance
>>>>> + * @event_params: parameters to be sent along with uevent
>>>>>   *
>>>>>   * This is a final state that can only be cleared with a mudule
>>>>>   * re-probe (unbind + bind).
>>>>> @@ -965,8 +966,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>>>>>   * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
>>>>>   * snapshot capture. In this mode, GT reset won't be attempted so the state of
>>>>>   * the issue is preserved for further debugging.
>>>>> + * Caller is expected to pass respective parameters to be sent along with
>>>>> + * uevent. Pass NULL in case of no params.
>>>>>   */
>>>>> -void xe_device_declare_wedged(struct xe_device *xe)
>>>>> +void xe_device_declare_wedged(struct xe_device *xe, char **event_params)
>>>>>  {
>>>>>  	struct xe_gt *gt;
>>>>>  	u8 id;
>>>>> @@ -984,12 +987,17 @@ void xe_device_declare_wedged(struct xe_device *xe)
>>>>>  	xe_pm_runtime_get_noresume(xe);
>>>>>  
>>>>>  	if (!atomic_xchg(&xe->wedged.flag, 1)) {
>>>>> +		struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>>>>> +
>>>>>  		xe->needs_flr_on_fini = true;
>>>>>  		drm_err(&xe->drm,
>>>>>  			"CRITICAL: Xe has declared device %s as wedged.\n"
>>>>>  			"IOCTLs and executions are blocked. Only a rebind may clear the failure\n"
>>>>>  			"Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
>>>>>  			dev_name(xe->drm.dev));
>>>>> +
>>>>> +		/* Notify userspace about reset required */
>>>>> +		kobject_uevent_env(&pdev->dev.kobj, KOBJ_CHANGE, event_params);
>>>>>  	}
>>>>>  
>>>>>  	for_each_gt(gt, xe, id)
>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
>>>>> index db6cc8d0d6b8..5d40fc6f0904 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_device.h
>>>>> +++ b/drivers/gpu/drm/xe/xe_device.h
>>>>> @@ -174,7 +174,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
>>>>>  	return atomic_read(&xe->wedged.flag);
>>>>>  }
>>>>>  
>>>>> -void xe_device_declare_wedged(struct xe_device *xe);
>>>>> +void xe_device_declare_wedged(struct xe_device *xe, char **reset_event);
>>>>>  
>>>>>  struct xe_file *xe_file_get(struct xe_file *xef);
>>>>>  void xe_file_put(struct xe_file *xef);
>>>>> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
>>>>> index 58895ed22f6e..519f3c2cf9e2 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_gt.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_gt.c
>>>>> @@ -741,6 +741,24 @@ static int do_gt_restart(struct xe_gt *gt)
>>>>>  	return 0;
>>>>>  }
>>>>>  
>>>>> +static void xe_gt_reset_failed(struct xe_gt *gt, int err)
>>>>> +{
>>>>> +	char *event_params[5];
>>>>> +
>>>>> +	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
>>>>> +
>>>>> +	event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
>>>>> +	event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_GT;
>>>>> +	event_params[2] = kasprintf(GFP_KERNEL, "TILE_ID=%d", gt_to_tile(gt)->id);
>>>>> +	event_params[3] = kasprintf(GFP_KERNEL, "GT_ID=%d", gt->info.id);
>>>> the TILE_ID and GT_ID can be passed for other events as well, with that you can
>>>> have a common function to send uevent which would take reason as an input.
>>> But is that required for all cases? There could be potential cases atleast
>>> in the future where it is not needed.
> 
> 
> At least in these cases it makes sense as they (other reasons)
> can be associated to a GT and a Tile. If in future they arises a
> reason where these details are not needed i guess we can handle that.

But then we'll have to modify it with every new addition, which doesn't
look like a win. With current implementation the callers atleast have
the autonomy to send params as-needed.

Raag

Aravind Iddamsetty Aug. 14, 2024, 12:49 p.m. UTC | #9

On 14/08/24 14:42, Raag Jadav wrote:
> On Wed, Aug 14, 2024 at 12:16:41PM +0530, Aravind Iddamsetty wrote:
>> On 13/08/24 22:24, Rodrigo Vivi wrote:
>>> On Tue, Aug 13, 2024 at 04:28:32PM +0300, Raag Jadav wrote:
>>>> On Mon, Aug 12, 2024 at 03:08:14PM +0530, Aravind Iddamsetty wrote:
>>>>> On 12/08/24 13:18, Raag Jadav wrote:
>>>>>> From: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>>>>>>
>>>>>> This was dropped in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset
>>>>>> uevent for now") as part of refactoring.
>>>>>>
>>>>>> Now that we have better uapi semantics and naming for the uevent,
>>>>>> bring it back. With this in place, userspace will be notified of
>>>>>> wedged device along with its reason.
>>>>>>
>>>>>> $ udevadm monitor --property --kernel
>>>>>> monitor will print the received events for:
>>>>>> KERNEL - the kernel uevent
>>>>>>
>>>>>> KERNEL[871.188570] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0 (pci)
>>>>>> ACTION=change
>>>>>> DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0
>>>>>> SUBSYSTEM=pci
>>>>>> DEVICE_STATUS=NEEDS_RESET
>>>>>> REASON=GT_RESET_FAILED
>>>>>> TILE_ID=0
>>>>>> GT_ID=0
>>>>>> DRIVER=xe
>>>>>> PCI_CLASS=30000
>>>>>> PCI_ID=8086:56B1
>>>>>> PCI_SUBSYS_ID=8086:1210
>>>>>> PCI_SLOT_NAME=0000:03:00.0
>>>>>> MODALIAS=pci:v00008086d000056B1sv00008086sd00001210bc03sc00i00
>>>>>> SEQNUM=6104
>>>>>>
>>>>>> v2: Change authorship to Himal (Aravind)
>>>>>>     Add uevent for all device wedged cases (Aravind)
>>>>>>
>>>>>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>>>>>> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
>>>>>> ---
>>>>>>  drivers/gpu/drm/xe/xe_device.c     | 10 +++++++++-
>>>>>>  drivers/gpu/drm/xe/xe_device.h     |  2 +-
>>>>>>  drivers/gpu/drm/xe/xe_gt.c         | 23 +++++++++++++++++++----
>>>>>>  drivers/gpu/drm/xe/xe_guc.c        | 13 ++++++++++++-
>>>>>>  drivers/gpu/drm/xe/xe_guc_submit.c | 13 ++++++++++++-
>>>>>>  include/uapi/drm/xe_drm.h          | 29 +++++++++++++++++++++++++++++
>>>>>>  6 files changed, 82 insertions(+), 8 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>>>>>> index 1aba6f9eaa19..d975bdce4a7d 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>>>>> @@ -955,6 +955,7 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>>>>>>  /**
>>>>>>   * xe_device_declare_wedged - Declare device wedged
>>>>>>   * @xe: xe device instance
>>>>>> + * @event_params: parameters to be sent along with uevent
>>>>>>   *
>>>>>>   * This is a final state that can only be cleared with a mudule
>>>>>>   * re-probe (unbind + bind).
>>>>>> @@ -965,8 +966,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>>>>>>   * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
>>>>>>   * snapshot capture. In this mode, GT reset won't be attempted so the state of
>>>>>>   * the issue is preserved for further debugging.
>>>>>> + * Caller is expected to pass respective parameters to be sent along with
>>>>>> + * uevent. Pass NULL in case of no params.
>>>>>>   */
>>>>>> -void xe_device_declare_wedged(struct xe_device *xe)
>>>>>> +void xe_device_declare_wedged(struct xe_device *xe, char **event_params)
>>>>>>  {
>>>>>>  	struct xe_gt *gt;
>>>>>>  	u8 id;
>>>>>> @@ -984,12 +987,17 @@ void xe_device_declare_wedged(struct xe_device *xe)
>>>>>>  	xe_pm_runtime_get_noresume(xe);
>>>>>>  
>>>>>>  	if (!atomic_xchg(&xe->wedged.flag, 1)) {
>>>>>> +		struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>>>>>> +
>>>>>>  		xe->needs_flr_on_fini = true;
>>>>>>  		drm_err(&xe->drm,
>>>>>>  			"CRITICAL: Xe has declared device %s as wedged.\n"
>>>>>>  			"IOCTLs and executions are blocked. Only a rebind may clear the failure\n"
>>>>>>  			"Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
>>>>>>  			dev_name(xe->drm.dev));
>>>>>> +
>>>>>> +		/* Notify userspace about reset required */
>>>>>> +		kobject_uevent_env(&pdev->dev.kobj, KOBJ_CHANGE, event_params);
>>>>>>  	}
>>>>>>  
>>>>>>  	for_each_gt(gt, xe, id)
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
>>>>>> index db6cc8d0d6b8..5d40fc6f0904 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_device.h
>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.h
>>>>>> @@ -174,7 +174,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
>>>>>>  	return atomic_read(&xe->wedged.flag);
>>>>>>  }
>>>>>>  
>>>>>> -void xe_device_declare_wedged(struct xe_device *xe);
>>>>>> +void xe_device_declare_wedged(struct xe_device *xe, char **reset_event);
>>>>>>  
>>>>>>  struct xe_file *xe_file_get(struct xe_file *xef);
>>>>>>  void xe_file_put(struct xe_file *xef);
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
>>>>>> index 58895ed22f6e..519f3c2cf9e2 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_gt.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_gt.c
>>>>>> @@ -741,6 +741,24 @@ static int do_gt_restart(struct xe_gt *gt)
>>>>>>  	return 0;
>>>>>>  }
>>>>>>  
>>>>>> +static void xe_gt_reset_failed(struct xe_gt *gt, int err)
>>>>>> +{
>>>>>> +	char *event_params[5];
>>>>>> +
>>>>>> +	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
>>>>>> +
>>>>>> +	event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT;
>>>>>> +	event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_GT;
>>>>>> +	event_params[2] = kasprintf(GFP_KERNEL, "TILE_ID=%d", gt_to_tile(gt)->id);
>>>>>> +	event_params[3] = kasprintf(GFP_KERNEL, "GT_ID=%d", gt->info.id);
>>>>> the TILE_ID and GT_ID can be passed for other events as well, with that you can
>>>>> have a common function to send uevent which would take reason as an input.
>>>> But is that required for all cases? There could be potential cases atleast
>>>> in the future where it is not needed.
>>
>> At least in these cases it makes sense as they (other reasons)
>> can be associated to a GT and a Tile. If in future they arises a
>> reason where these details are not needed i guess we can handle that.
> But then we'll have to modify it with every new addition, which doesn't
> look like a win. With current implementation the callers atleast have
> the autonomy to send params as-needed.

My intention is to have a common code at least for these events
as much as possible as all the fields are same except reason.

The only place in future that might not have a tile or gt is 
for gobal (entire device) level event for that I thought we 
can have a check based on reason type. Otherwise most of the 
reasons can be associated to a Tile or a GT.

Thanks,
Aravind,
>
> Raag

Lucas De Marchi Aug. 14, 2024, 2:24 p.m. UTC | #10

On Wed, Aug 14, 2024 at 12:16:41PM GMT, Aravind Iddamsetty wrote:
>>>> i see that this is even called from xe_guc_exec_queue_lr_cleanup which is for long running queue
>>>> from various call paths need to check in what scenarios those happen.
>>> Should we add a reason for long running queue?
>> Both of your questions would probably be answered if this was getting developed
>> at the same time of the user space usage of these uevents.
>
>Can't we consider the generic Linux user as a consumer, via udevadm.

No, udevadm just confirms that there is an event being sent. Main thing
to understand here is "what is this event useful for? what can be done
as outcome of receiving this event?". From your other reply it seems
this is about "device is wedged/toast, and driver can't recover without
userspace intervention since userspace may have different policies"

What would be some examples of actions for userspace to take?

	- Unbind driver, kill clients, rebind?
	- FLR?
	- Something else?

Is this currently implemented somewhere?

Also, is it possible to make is a generic drm event handling so distros
don't have to setup udev rules for each vendor?

Lucas De Marchi

>
>Thanks,
>Aravind.
>>
>>> Raag

Aravind Iddamsetty Aug. 16, 2024, 4:29 a.m. UTC | #11

On 14/08/24 19:54, Lucas De Marchi wrote:
> On Wed, Aug 14, 2024 at 12:16:41PM GMT, Aravind Iddamsetty wrote:
>>>>> i see that this is even called from xe_guc_exec_queue_lr_cleanup which is for long running queue
>>>>> from various call paths need to check in what scenarios those happen.
>>>> Should we add a reason for long running queue?
>>> Both of your questions would probably be answered if this was getting developed
>>> at the same time of the user space usage of these uevents.
>>
>> Can't we consider the generic Linux user as a consumer, via udevadm.
>
> No, udevadm just confirms that there is an event being sent. Main thing
> to understand here is "what is this event useful for? what can be done
> as outcome of receiving this event?". From your other reply it seems
> this is about "device is wedged/toast, and driver can't recover without
> userspace intervention since userspace may have different policies"
>
> What would be some examples of actions for userspace to take?
>
>     - Unbind driver, kill clients, rebind?
>     - FLR?
>     - Something else?

The expectation from userspace on receiving this event is to do a reset
most probably SBR(unbind->sbr->rebind).

The other requirement of this event is for L0 Sysman

https://spec.oneapi.io/level-zero/latest/sysman/api.html#_CPPv4N21zes_event_type_flag_t41ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIREDE

https://spec.oneapi.io/level-zero/latest/sysman/api.html#_CPPv4N18zes_device_state_t5resetE

So we expect the admin do the above actions

>
> Is this currently implemented somewhere?
Do you mean by some other driver or subsystem?
>
> Also, is it possible to make is a generic drm event handling so distros
> don't have to setup udev rules for each vendor?

I think doing via drm event mandates the observability applications(L0)
to have open connection to the device to process these which is not desired.

Thanks,
Aravind.
>
>
>
> Lucas De Marchi
>
>>
>> Thanks,
>> Aravind.
>>>
>>>> Raag

Lucas De Marchi Aug. 16, 2024, 5:41 a.m. UTC | #12

On Fri, Aug 16, 2024 at 09:59:31AM GMT, Aravind Iddamsetty wrote:
>
>On 14/08/24 19:54, Lucas De Marchi wrote:
>> On Wed, Aug 14, 2024 at 12:16:41PM GMT, Aravind Iddamsetty wrote:
>>>>>> i see that this is even called from xe_guc_exec_queue_lr_cleanup which is for long running queue
>>>>>> from various call paths need to check in what scenarios those happen.
>>>>> Should we add a reason for long running queue?
>>>> Both of your questions would probably be answered if this was getting developed
>>>> at the same time of the user space usage of these uevents.
>>>
>>> Can't we consider the generic Linux user as a consumer, via udevadm.
>>
>> No, udevadm just confirms that there is an event being sent. Main thing
>> to understand here is "what is this event useful for? what can be done
>> as outcome of receiving this event?". From your other reply it seems
>> this is about "device is wedged/toast, and driver can't recover without
>> userspace intervention since userspace may have different policies"
>>
>> What would be some examples of actions for userspace to take?
>>
>>     - Unbind driver, kill clients, rebind?
>>     - FLR?
>>     - Something else?
>
>The expectation from userspace on receiving this event is to do a reset
>most probably SBR(unbind->sbr->rebind).
>
>The other requirement of this event is for L0 Sysman
>
>https://spec.oneapi.io/level-zero/latest/sysman/api.html#_CPPv4N21zes_event_type_flag_t41ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIREDE
>
>https://spec.oneapi.io/level-zero/latest/sysman/api.html#_CPPv4N18zes_device_state_t5resetE
>
>So we expect the admin do the above actions

so these things should be in the commit message. 

where's the pending implementation of this spec?  I imagine somewhere
we'd see something on top of libudev generating these L0 events. The
events themselves are not much more than confirming via udevadm, but
then at least we know the implementation covers the expectations from
the the spec without having to come back later fixing up stuff.

>
>>
>> Is this currently implemented somewhere?
>Do you mean by some other driver or subsystem?
>>
>> Also, is it possible to make is a generic drm event handling so distros
>> don't have to setup udev rules for each vendor?
>
>I think doing via drm event mandates the observability applications(L0)
>to have open connection to the device to process these which is not desired.

I'm not talking about keeping the device open. Here you are defining
xe-specific uevents. Could this be drm-specific so someone implementing
that specification you mentioned doesn't have to implement N different
uevents for N different vendors?  And a distro defining a policy via
udev rules doesn't have to do the same thing slightly differently N times?

Lucas De Marchi

>
>Thanks,
>Aravind.
>>
>>
>>
>> Lucas De Marchi
>>
>>>
>>> Thanks,
>>> Aravind.
>>>>
>>>>> Raag

[v2] drm/xe/uapi: Bring back reset uevent

Commit Message

Comments

Patch