From patchwork Wed Feb 19 20:28:41 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Cavitt X-Patchwork-Id: 13982863 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 7B19BC021B1 for ; Wed, 19 Feb 2025 20:28:54 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id E76A610E365; Wed, 19 Feb 2025 20:28:49 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="FzJ+coWt"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14]) by gabe.freedesktop.org (Postfix) with ESMTPS id 5330610E892; Wed, 19 Feb 2025 20:28:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1739996929; x=1771532929; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=kdx43vBPxQuJgAnPefatV3H8gKQrTpuV5MOzEF/ZEe8=; b=FzJ+coWt4+ippL5YWfVvA0WDCZlGiWd8ahRQkLM7PP3FLS6sDUml7uqL IwVhjXIuhdoquk0dk+NnDFtCGnTAGrchgOi7QxXbwDaIPJ6BW31Ab1lGH 15qe+mVeu4hlM6QAMKSb/FLbYfjXoiHGeqkuG83Ykriar3FQOQi/r2+Z5 4nsQw0nJmStI4TGBaIfWjNJntFLOWx5xkpW6XytbquN1K1Ert+/F3lo+4 LiwndgVhdbs3687I7Fd6yKvgq7Sf75GLgvytK2nADl0jI58FnPIrlWCCB XrePWDFKAPu3/9YxZL6SrzTcBnRqiwMGaaESiRcz/D6EZwHXFi0Y7y+HS g==; X-CSE-ConnectionGUID: VAMdxxYESFmteRt/ZCdbIg== X-CSE-MsgGUID: hrVIFiiSRbi1Vwbbv6g/Zw== X-IronPort-AV: E=McAfee;i="6700,10204,11350"; a="41012235" X-IronPort-AV: E=Sophos;i="6.13,299,1732608000"; d="scan'208";a="41012235" Received: from orviesa003.jf.intel.com ([10.64.159.143]) by fmvoesa108.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Feb 2025 12:28:47 -0800 X-CSE-ConnectionGUID: YQa4z64jQ++2IFyUtInrCw== X-CSE-MsgGUID: TbmbTKOLSTavpPhFyW/bpg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,199,1725346800"; d="scan'208";a="119765720" Received: from dut4410lnl.fm.intel.com ([10.105.8.78]) by ORVIESA003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Feb 2025 12:28:47 -0800 From: Jonathan Cavitt To: intel-xe@lists.freedesktop.org Cc: saurabhg.gupta@intel.com, alex.zuo@intel.com, jonathan.cavitt@intel.com, joonas.lahtinen@linux.intel.com, tvrtko.ursulin@igalia.com, lucas.demarchi@intel.com, matthew.brost@intel.com, dri-devel@lists.freedesktop.org, simona.vetter@ffwll.ch Subject: [PATCH v4 1/6] drm/xe/xe_exec_queue: Add ID param to exec queue struct Date: Wed, 19 Feb 2025 20:28:41 +0000 Message-ID: <20250219202847.127425-2-jonathan.cavitt@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250219202847.127425-1-jonathan.cavitt@intel.com> References: <20250219202847.127425-1-jonathan.cavitt@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Add the exec queue id to the exec queue struct. This is useful for performing a reverse lookup into the xef->exec_queue xarray. Signed-off-by: Jonathan Cavitt --- drivers/gpu/drm/xe/xe_exec_queue.c | 1 + drivers/gpu/drm/xe/xe_exec_queue_types.h | 2 ++ 2 files changed, 3 insertions(+) diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c index 23a9f519ce1c..4a98a5d0e405 100644 --- a/drivers/gpu/drm/xe/xe_exec_queue.c +++ b/drivers/gpu/drm/xe/xe_exec_queue.c @@ -709,6 +709,7 @@ int xe_exec_queue_create_ioctl(struct drm_device *dev, void *data, if (err) goto kill_exec_queue; + q->id = id; args->exec_queue_id = id; return 0; diff --git a/drivers/gpu/drm/xe/xe_exec_queue_types.h b/drivers/gpu/drm/xe/xe_exec_queue_types.h index 6eb7ff091534..088d838218e9 100644 --- a/drivers/gpu/drm/xe/xe_exec_queue_types.h +++ b/drivers/gpu/drm/xe/xe_exec_queue_types.h @@ -55,6 +55,8 @@ struct xe_exec_queue { struct xe_vm *vm; /** @class: class of this exec queue */ enum xe_engine_class class; + /** @id: exec queue ID as reported during create ioctl */ + u32 id; /** * @logical_mask: logical mask of where job submitted to exec queue can run */ From patchwork Wed Feb 19 20:28:42 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Cavitt X-Patchwork-Id: 13982868 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 5E7AFC021B3 for ; Wed, 19 Feb 2025 20:29:01 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 2A5C910E8A3; Wed, 19 Feb 2025 20:28:54 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="ay84GBti"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14]) by gabe.freedesktop.org (Postfix) with ESMTPS id 7E31E10E895; Wed, 19 Feb 2025 20:28:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1739996929; x=1771532929; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=u2AbzGaI4Ls00Mz3zF+gpAWIquuEq3mLtyJZINYvlBU=; b=ay84GBtibq8xiZQRsHP+AcdOdf9Y4U5SxEPUPJ4up5ByfSRC6o0Iv8zO fMZCDieS+IHcG/M9z5Ysd5ZAc6HLnEBkHAgae7q9U4NTufNWdzym1FjrC KMmtihmTm2mZmnMdRhPXFvOw9mmKYJHI0OQSTAtvpwS5ZcCQpxosX2HyF uCppffvJf3SiMa37DmbT+XChbz4g7U5IAIvIAmT2ZTvlaLx3hu5YMNUp7 59M+NNLgA5V5tgB7wamZzZCZ5RweF9T6LmFikyU3YPOJUl0uoc3aAA/Et zENThVsL3Y0/5JXNWt3pr+66EKplV4vGNDzK/hPpg5zx3D+5EzIW/rZhB A==; X-CSE-ConnectionGUID: Wx0M8mtLTouxVKui6HvJUg== X-CSE-MsgGUID: ZSdL5FsFSzyckYuq/qsBAQ== X-IronPort-AV: E=McAfee;i="6700,10204,11350"; a="41012239" X-IronPort-AV: E=Sophos;i="6.13,299,1732608000"; d="scan'208";a="41012239" Received: from orviesa003.jf.intel.com ([10.64.159.143]) by fmvoesa108.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Feb 2025 12:28:48 -0800 X-CSE-ConnectionGUID: /HIqTo1VRuaf7SGyJPQ9XQ== X-CSE-MsgGUID: BceK3oe8RW6XMk38QrEIww== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,199,1725346800"; d="scan'208";a="119765723" Received: from dut4410lnl.fm.intel.com ([10.105.8.78]) by ORVIESA003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Feb 2025 12:28:48 -0800 From: Jonathan Cavitt To: intel-xe@lists.freedesktop.org Cc: saurabhg.gupta@intel.com, alex.zuo@intel.com, jonathan.cavitt@intel.com, joonas.lahtinen@linux.intel.com, tvrtko.ursulin@igalia.com, lucas.demarchi@intel.com, matthew.brost@intel.com, dri-devel@lists.freedesktop.org, simona.vetter@ffwll.ch Subject: [PATCH v4 2/6] drm/xe/xe_gt_pagefault: Migrate pagefault struct to header Date: Wed, 19 Feb 2025 20:28:42 +0000 Message-ID: <20250219202847.127425-3-jonathan.cavitt@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250219202847.127425-1-jonathan.cavitt@intel.com> References: <20250219202847.127425-1-jonathan.cavitt@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Migrate the pagefault struct from xe_gt_pagefault.c to the xe_gt_pagefault.h header file, along with the associated enum values. Signed-off-by: Jonathan Cavitt --- drivers/gpu/drm/xe/xe_gt_pagefault.c | 27 --------------------------- drivers/gpu/drm/xe/xe_gt_pagefault.h | 28 ++++++++++++++++++++++++++++ 2 files changed, 28 insertions(+), 27 deletions(-) diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c index 46701ca11ce0..fe18e3ec488a 100644 --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c @@ -22,33 +22,6 @@ #include "xe_trace_bo.h" #include "xe_vm.h" -struct pagefault { - u64 page_addr; - u32 asid; - u16 pdata; - u8 vfid; - u8 access_type; - u8 fault_type; - u8 fault_level; - u8 engine_class; - u8 engine_instance; - u8 fault_unsuccessful; - bool trva_fault; -}; - -enum access_type { - ACCESS_TYPE_READ = 0, - ACCESS_TYPE_WRITE = 1, - ACCESS_TYPE_ATOMIC = 2, - ACCESS_TYPE_RESERVED = 3, -}; - -enum fault_type { - NOT_PRESENT = 0, - WRITE_ACCESS_VIOLATION = 1, - ATOMIC_ACCESS_VIOLATION = 2, -}; - struct acc { u64 va_range_base; u32 asid; diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.h b/drivers/gpu/drm/xe/xe_gt_pagefault.h index 839c065a5e4c..e9911da5c8a7 100644 --- a/drivers/gpu/drm/xe/xe_gt_pagefault.h +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.h @@ -11,6 +11,34 @@ struct xe_gt; struct xe_guc; +struct pagefault { + u64 page_addr; + u32 asid; + u16 pdata; + u8 vfid; + u8 access_type; + u8 fault_type; + u8 fault_level; + u8 engine_class; + u8 engine_instance; + u8 fault_unsuccessful; + bool prefetch; + bool trva_fault; +}; + +enum access_type { + ACCESS_TYPE_READ = 0, + ACCESS_TYPE_WRITE = 1, + ACCESS_TYPE_ATOMIC = 2, + ACCESS_TYPE_RESERVED = 3, +}; + +enum fault_type { + NOT_PRESENT = 0, + WRITE_ACCESS_VIOLATION = 1, + ATOMIC_ACCESS_VIOLATION = 2, +}; + int xe_gt_pagefault_init(struct xe_gt *gt); void xe_gt_pagefault_reset(struct xe_gt *gt); int xe_guc_pagefault_handler(struct xe_guc *guc, u32 *msg, u32 len); From patchwork Wed Feb 19 20:28:43 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Cavitt X-Patchwork-Id: 13982864 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 6A7A1C021B0 for ; Wed, 19 Feb 2025 20:28:56 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 5F10210E896; Wed, 19 Feb 2025 20:28:50 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="agtJvthQ"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14]) by gabe.freedesktop.org (Postfix) with ESMTPS id 7E21210E894; Wed, 19 Feb 2025 20:28:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1739996929; x=1771532929; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=0uyGn1APz4ljCn/pIHS38mXX4lsg7s1/S0amSUYGNj8=; b=agtJvthQueD+OUqmt964P/olbWGsAOauTlsIEkcQo/MuL+6BwupLtb75 76yrsmBfLg9/c74yAuD4bqQB6eCRKq82IZqBtcsXt6T6IklyTWz7y8WRV +DR1E7QnjyKMPXXyo8xHnPZgf7C4jSKF0A1HW20WHZGiO0wjSQeoxJ1V9 O+lflun1aXUKnc6hp3MxfIMsnsxpbiIigGUrkBboa2kKzxj1GJir3QdUg QEqz+Xu81YhLTjALHHXDEKP0ev3c8yoVPpCEkM2id9Sa7fn3Qhuulbh6d tOxAvaijzXM6G85jADI1nZYIZ/XtQ4I/pF/BM9yAUXVoOjxdxW4jQonMm g==; X-CSE-ConnectionGUID: nu3R9/pGTyiKgB2IcApUPg== X-CSE-MsgGUID: qhG1hl5CSqOmtKPstsFlZw== X-IronPort-AV: E=McAfee;i="6700,10204,11350"; a="41012242" X-IronPort-AV: E=Sophos;i="6.13,299,1732608000"; d="scan'208";a="41012242" Received: from orviesa003.jf.intel.com ([10.64.159.143]) by fmvoesa108.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Feb 2025 12:28:48 -0800 X-CSE-ConnectionGUID: /4rUSO8YTle2GejGFS4KJw== X-CSE-MsgGUID: P9u9WEfvTYi6SwiD3Xr/qw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,199,1725346800"; d="scan'208";a="119765727" Received: from dut4410lnl.fm.intel.com ([10.105.8.78]) by ORVIESA003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Feb 2025 12:28:48 -0800 From: Jonathan Cavitt To: intel-xe@lists.freedesktop.org Cc: saurabhg.gupta@intel.com, alex.zuo@intel.com, jonathan.cavitt@intel.com, joonas.lahtinen@linux.intel.com, tvrtko.ursulin@igalia.com, lucas.demarchi@intel.com, matthew.brost@intel.com, dri-devel@lists.freedesktop.org, simona.vetter@ffwll.ch Subject: [PATCH v4 3/6] drm/xe/xe_drm_client: Add per drm client pagefault info Date: Wed, 19 Feb 2025 20:28:43 +0000 Message-ID: <20250219202847.127425-4-jonathan.cavitt@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250219202847.127425-1-jonathan.cavitt@intel.com> References: <20250219202847.127425-1-jonathan.cavitt@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Add additional information to drm client so it can report up to the last 50 exec queues to have been banned on it, as well as the last pagefault seen when said exec queues were banned. Since we cannot reasonably associate a pagefault to a specific exec queue, we currently report the last seen pagefault on the associated hw engine instead. The last pagefault seen per exec queue is saved to the hw engine, and the pagefault is updated during the pagefault handling process in xe_gt_pagefault. The last seen pagefault is reset when the engine is reset because any future exec queue bans likely were not caused by said pagefault after the reset. v2: Remove exec queue from blame list on destroy and recreate (Joonas) v3: Do not print as part of xe_drm_client_fdinfo (Joonas) v4: Fix formatting and kzalloc during lock warnings Signed-off-by: Jonathan Cavitt --- drivers/gpu/drm/xe/xe_drm_client.c | 68 +++++++++++++++++++++++++ drivers/gpu/drm/xe/xe_drm_client.h | 42 +++++++++++++++ drivers/gpu/drm/xe/xe_exec_queue.c | 7 +++ drivers/gpu/drm/xe/xe_gt_pagefault.c | 17 +++++++ drivers/gpu/drm/xe/xe_guc_submit.c | 15 ++++++ drivers/gpu/drm/xe/xe_hw_engine.c | 4 ++ drivers/gpu/drm/xe/xe_hw_engine_types.h | 8 +++ 7 files changed, 161 insertions(+) diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c index 2d4874d2b922..1bc978ae4c2f 100644 --- a/drivers/gpu/drm/xe/xe_drm_client.c +++ b/drivers/gpu/drm/xe/xe_drm_client.c @@ -17,6 +17,7 @@ #include "xe_exec_queue.h" #include "xe_force_wake.h" #include "xe_gt.h" +#include "xe_gt_pagefault.h" #include "xe_hw_engine.h" #include "xe_pm.h" #include "xe_trace.h" @@ -97,6 +98,8 @@ struct xe_drm_client *xe_drm_client_alloc(void) #ifdef CONFIG_PROC_FS spin_lock_init(&client->bos_lock); INIT_LIST_HEAD(&client->bos_list); + spin_lock_init(&client->blame_lock); + INIT_LIST_HEAD(&client->blame_list); #endif return client; } @@ -164,6 +167,71 @@ void xe_drm_client_remove_bo(struct xe_bo *bo) xe_drm_client_put(client); } +static void free_blame(struct blame *b) +{ + list_del(&b->list); + kfree(b->pf); + kfree(b); +} + +void xe_drm_client_add_blame(struct xe_drm_client *client, + struct xe_exec_queue *q) +{ + struct blame *b = NULL; + struct pagefault *pf = NULL; + struct xe_file *xef = q->xef; + struct xe_hw_engine *hwe = q->hwe; + + b = kzalloc(sizeof(*b), GFP_KERNEL); + xe_assert(xef->xe, b); + + spin_lock(&client->blame_lock); + list_add_tail(&b->list, &client->blame_list); + client->blame_len++; + /** + * Limit the number of blames in the blame list to prevent memory overuse. + */ + if (client->blame_len > MAX_BLAME_LEN) { + struct blame *rem = list_first_entry(&client->blame_list, struct blame, list); + + free_blame(rem); + client->blame_len--; + } + spin_unlock(&client->blame_lock); + + /** + * Duplicate pagefault on engine to blame, if one may have caused the + * exec queue to be ban. + */ + b->pf = NULL; + pf = kzalloc(sizeof(*pf), GFP_KERNEL); + spin_lock(&hwe->pf.lock); + if (hwe->pf.info) { + memcpy(pf, hwe->pf.info, sizeof(struct pagefault)); + b->pf = pf; + } else { + kfree(pf); + } + spin_unlock(&hwe->pf.lock); + + /** Save blame data to list element */ + b->exec_queue_id = q->id; +} + +void xe_drm_client_remove_blame(struct xe_drm_client *client, + struct xe_exec_queue *q) +{ + struct blame *b, *tmp; + + spin_lock(&client->blame_lock); + list_for_each_entry_safe(b, tmp, &client->blame_list, list) + if (b->exec_queue_id == q->id) { + free_blame(b); + client->blame_len--; + } + spin_unlock(&client->blame_lock); +} + static void bo_meminfo(struct xe_bo *bo, struct drm_memory_stats stats[TTM_NUM_MEM_TYPES]) { diff --git a/drivers/gpu/drm/xe/xe_drm_client.h b/drivers/gpu/drm/xe/xe_drm_client.h index a9649aa36011..b3d9b279d55f 100644 --- a/drivers/gpu/drm/xe/xe_drm_client.h +++ b/drivers/gpu/drm/xe/xe_drm_client.h @@ -13,9 +13,22 @@ #include #include +#define MAX_BLAME_LEN 50 + struct drm_file; struct drm_printer; +struct pagefault; struct xe_bo; +struct xe_exec_queue; + +struct blame { + /** @exec_queue_id: ID number of banned exec queue */ + u32 exec_queue_id; + /** @pf: pagefault on engine of banned exec queue, if any at time */ + struct pagefault *pf; + /** @list: link into @xe_drm_client.blame_list */ + struct list_head list; +}; struct xe_drm_client { struct kref kref; @@ -31,6 +44,21 @@ struct xe_drm_client { * Protected by @bos_lock. */ struct list_head bos_list; + /** + * @blame_lock: lock protecting @blame_list + */ + spinlock_t blame_lock; + /** + * @blame_list: list of banned exec queues associated with this drm + * client, as well as any pagefaults at time of ban. + * + * Protected by @blame_lock; + */ + struct list_head blame_list; + /** + * @blame_len: length of @blame_list + */ + unsigned int blame_len; #endif }; @@ -57,6 +85,10 @@ void xe_drm_client_fdinfo(struct drm_printer *p, struct drm_file *file); void xe_drm_client_add_bo(struct xe_drm_client *client, struct xe_bo *bo); void xe_drm_client_remove_bo(struct xe_bo *bo); +void xe_drm_client_add_blame(struct xe_drm_client *client, + struct xe_exec_queue *q); +void xe_drm_client_remove_blame(struct xe_drm_client *client, + struct xe_exec_queue *q); #else static inline void xe_drm_client_add_bo(struct xe_drm_client *client, struct xe_bo *bo) @@ -66,5 +98,15 @@ static inline void xe_drm_client_add_bo(struct xe_drm_client *client, static inline void xe_drm_client_remove_bo(struct xe_bo *bo) { } + +static inline void xe_drm_client_add_blame(struct xe_drm_client *client, + struct xe_exec_queue *q) +{ +} + +static inline void xe_drm_client_remove_blame(struct xe_drm_client *client, + struct xe_exec_queue *q) +{ +} #endif #endif diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c index 4a98a5d0e405..f8bcf43b2a0e 100644 --- a/drivers/gpu/drm/xe/xe_exec_queue.c +++ b/drivers/gpu/drm/xe/xe_exec_queue.c @@ -13,6 +13,7 @@ #include #include "xe_device.h" +#include "xe_drm_client.h" #include "xe_gt.h" #include "xe_hw_engine_class_sysfs.h" #include "xe_hw_engine_group.h" @@ -712,6 +713,12 @@ int xe_exec_queue_create_ioctl(struct drm_device *dev, void *data, q->id = id; args->exec_queue_id = id; + /** + * If an exec queue in the blame list shares the same exec queue + * ID, remove it from the blame list to avoid confusion. + */ + xe_drm_client_remove_blame(q->xef->client, q); + return 0; kill_exec_queue: diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c index fe18e3ec488a..b95501076569 100644 --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c @@ -330,6 +330,21 @@ int xe_guc_pagefault_handler(struct xe_guc *guc, u32 *msg, u32 len) return full ? -ENOSPC : 0; } +static void save_pagefault_to_engine(struct xe_gt *gt, struct pagefault *pf) +{ + struct xe_hw_engine *hwe; + + hwe = xe_gt_hw_engine(gt, pf->engine_class, pf->engine_instance, false); + if (hwe) { + spin_lock(&hwe->pf.lock); + /** Info initializes as NULL, so alloc if first pagefault */ + if (!hwe->pf.info) + hwe->pf.info = kzalloc(sizeof(*pf), GFP_KERNEL); + memcpy(hwe->pf.info, pf, sizeof(*pf)); + spin_unlock(&hwe->pf.lock); + } +} + #define USM_QUEUE_MAX_RUNTIME_MS 20 static void pf_queue_work_func(struct work_struct *w) @@ -352,6 +367,8 @@ static void pf_queue_work_func(struct work_struct *w) drm_dbg(&xe->drm, "Fault response: Unsuccessful %d\n", ret); } + save_pagefault_to_engine(gt, &pf); + reply.dw0 = FIELD_PREP(PFR_VALID, 1) | FIELD_PREP(PFR_SUCCESS, pf.fault_unsuccessful) | FIELD_PREP(PFR_REPLY, PFR_ACCESS) | diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c index 913c74d6e2ae..92de926bd505 100644 --- a/drivers/gpu/drm/xe/xe_guc_submit.c +++ b/drivers/gpu/drm/xe/xe_guc_submit.c @@ -20,11 +20,13 @@ #include "xe_assert.h" #include "xe_devcoredump.h" #include "xe_device.h" +#include "xe_drm_client.h" #include "xe_exec_queue.h" #include "xe_force_wake.h" #include "xe_gpu_scheduler.h" #include "xe_gt.h" #include "xe_gt_clock.h" +#include "xe_gt_pagefault.h" #include "xe_gt_printk.h" #include "xe_guc.h" #include "xe_guc_capture.h" @@ -146,6 +148,7 @@ static bool exec_queue_banned(struct xe_exec_queue *q) static void set_exec_queue_banned(struct xe_exec_queue *q) { atomic_or(EXEC_QUEUE_STATE_BANNED, &q->guc->state); + xe_drm_client_add_blame(q->xef->client, q); } static bool exec_queue_suspended(struct xe_exec_queue *q) @@ -1971,6 +1974,7 @@ int xe_guc_deregister_done_handler(struct xe_guc *guc, u32 *msg, u32 len) int xe_guc_exec_queue_reset_handler(struct xe_guc *guc, u32 *msg, u32 len) { struct xe_gt *gt = guc_to_gt(guc); + struct xe_hw_engine *hwe; struct xe_exec_queue *q; u32 guc_id; @@ -1983,11 +1987,22 @@ int xe_guc_exec_queue_reset_handler(struct xe_guc *guc, u32 *msg, u32 len) if (unlikely(!q)) return -EPROTO; + hwe = q->hwe; + xe_gt_info(gt, "Engine reset: engine_class=%s, logical_mask: 0x%x, guc_id=%d", xe_hw_engine_class_to_str(q->class), q->logical_mask, guc_id); trace_xe_exec_queue_reset(q); + /** + * Clear last pagefault from engine. Any future exec queue bans likely were + * not caused by said pagefault now that the engine has reset. + */ + spin_lock(&hwe->pf.lock); + kfree(hwe->pf.info); + hwe->pf.info = NULL; + spin_unlock(&hwe->pf.lock); + /* * A banned engine is a NOP at this point (came from * guc_exec_queue_timedout_job). Otherwise, kick drm scheduler to cancel diff --git a/drivers/gpu/drm/xe/xe_hw_engine.c b/drivers/gpu/drm/xe/xe_hw_engine.c index fc447751fe78..69f61e4905e2 100644 --- a/drivers/gpu/drm/xe/xe_hw_engine.c +++ b/drivers/gpu/drm/xe/xe_hw_engine.c @@ -21,6 +21,7 @@ #include "xe_gsc.h" #include "xe_gt.h" #include "xe_gt_ccs_mode.h" +#include "xe_gt_pagefault.h" #include "xe_gt_printk.h" #include "xe_gt_mcr.h" #include "xe_gt_topology.h" @@ -557,6 +558,9 @@ static void hw_engine_init_early(struct xe_gt *gt, struct xe_hw_engine *hwe, hwe->eclass->defaults = hwe->eclass->sched_props; } + hwe->pf.info = NULL; + spin_lock_init(&hwe->pf.lock); + xe_reg_sr_init(&hwe->reg_sr, hwe->name, gt_to_xe(gt)); xe_tuning_process_engine(hwe); xe_wa_process_engine(hwe); diff --git a/drivers/gpu/drm/xe/xe_hw_engine_types.h b/drivers/gpu/drm/xe/xe_hw_engine_types.h index e4191a7a2c31..2e1be9481d9b 100644 --- a/drivers/gpu/drm/xe/xe_hw_engine_types.h +++ b/drivers/gpu/drm/xe/xe_hw_engine_types.h @@ -64,6 +64,7 @@ enum xe_hw_engine_id { struct xe_bo; struct xe_execlist_port; struct xe_gt; +struct pagefault; /** * struct xe_hw_engine_class_intf - per hw engine class struct interface @@ -150,6 +151,13 @@ struct xe_hw_engine { struct xe_oa_unit *oa_unit; /** @hw_engine_group: the group of hw engines this one belongs to */ struct xe_hw_engine_group *hw_engine_group; + /** @pf: the last pagefault seen on this engine */ + struct { + /** @pf.info: info containing last seen pagefault details */ + struct pagefault *info; + /** @pf.lock: lock protecting @pf.info */ + spinlock_t lock; + } pf; }; enum xe_hw_engine_snapshot_source_id { From patchwork Wed Feb 19 20:28:44 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Cavitt X-Patchwork-Id: 13982866 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 30660C021B0 for ; Wed, 19 Feb 2025 20:28:59 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id B423210E89B; Wed, 19 Feb 2025 20:28:50 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="hpmqHXOI"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14]) by gabe.freedesktop.org (Postfix) with ESMTPS id A3F8D10E365; Wed, 19 Feb 2025 20:28:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1739996929; x=1771532929; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Gzg2r7fR3ov8t74dKxtvai5BqCJm/3d4stlGuCVvktw=; b=hpmqHXOIScrn1iw8OiN4EGnBZf6hzbhETHOUgqRZoAK2WG5bHdV7FEiA RNxwsdewfDCd51e8EpKXtJ3nCeUFaQM7qSYXunVYCl7PIAerrOrXfu5aC d9Jl3HiqbLkz+0/5BDjHskM+QdFc/WfWo8zvqCUHiBRFuc4ihTFp8V6R7 qd2tE/46nZPnM6+n9PZtVRLwqFkurApiEATF13ZVO5fJ8doQd+ElFHsfs vKLB5Px9UJ+EUQg+qd+5jdA2TVLcxp99k+/7robJZ4F+27HlWTRtdGX2D 9/io4pwXg3rk6szQSaU/z6q+qC5K0bHz8wel7gDxFNTvqyork9JytwS9q A==; X-CSE-ConnectionGUID: XinLPSq5RNKBUsaIFO8caQ== X-CSE-MsgGUID: o59kVRAMRSiHj7uOWm+WBA== X-IronPort-AV: E=McAfee;i="6700,10204,11350"; a="41012245" X-IronPort-AV: E=Sophos;i="6.13,299,1732608000"; d="scan'208";a="41012245" Received: from orviesa003.jf.intel.com ([10.64.159.143]) by fmvoesa108.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Feb 2025 12:28:48 -0800 X-CSE-ConnectionGUID: 3fHIiw0rQ+2jtE+tc5CIog== X-CSE-MsgGUID: g262hSPxQea6U+ftvylrZw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,199,1725346800"; d="scan'208";a="119765730" Received: from dut4410lnl.fm.intel.com ([10.105.8.78]) by ORVIESA003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Feb 2025 12:28:48 -0800 From: Jonathan Cavitt To: intel-xe@lists.freedesktop.org Cc: saurabhg.gupta@intel.com, alex.zuo@intel.com, jonathan.cavitt@intel.com, joonas.lahtinen@linux.intel.com, tvrtko.ursulin@igalia.com, lucas.demarchi@intel.com, matthew.brost@intel.com, dri-devel@lists.freedesktop.org, simona.vetter@ffwll.ch Subject: [PATCH v4 4/6] drm/xe/xe_drm_client: Add per drm client reset stats Date: Wed, 19 Feb 2025 20:28:44 +0000 Message-ID: <20250219202847.127425-5-jonathan.cavitt@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250219202847.127425-1-jonathan.cavitt@intel.com> References: <20250219202847.127425-1-jonathan.cavitt@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Add a counter to xe_drm_client that tracks the number of times the engine has been reset since the drm client was created. Signed-off-by: Jonathan Cavitt --- drivers/gpu/drm/xe/xe_drm_client.h | 2 ++ drivers/gpu/drm/xe/xe_guc_submit.c | 4 +++- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/xe/xe_drm_client.h b/drivers/gpu/drm/xe/xe_drm_client.h index b3d9b279d55f..6579c4b60ae7 100644 --- a/drivers/gpu/drm/xe/xe_drm_client.h +++ b/drivers/gpu/drm/xe/xe_drm_client.h @@ -59,6 +59,8 @@ struct xe_drm_client { * @blame_len: length of @blame_list */ unsigned int blame_len; + /** @reset_count: number of times this drm client has seen an engine reset */ + atomic_t reset_count; #endif }; diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c index 92de926bd505..5d899de3dd83 100644 --- a/drivers/gpu/drm/xe/xe_guc_submit.c +++ b/drivers/gpu/drm/xe/xe_guc_submit.c @@ -1988,7 +1988,9 @@ int xe_guc_exec_queue_reset_handler(struct xe_guc *guc, u32 *msg, u32 len) return -EPROTO; hwe = q->hwe; - +#ifdef CONFIG_PROC_FS + atomic_inc(&q->xef->client->reset_count); +#endif xe_gt_info(gt, "Engine reset: engine_class=%s, logical_mask: 0x%x, guc_id=%d", xe_hw_engine_class_to_str(q->class), q->logical_mask, guc_id); From patchwork Wed Feb 19 20:28:45 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Cavitt X-Patchwork-Id: 13982865 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id EF19EC021B3 for ; Wed, 19 Feb 2025 20:28:57 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 6721210E897; Wed, 19 Feb 2025 20:28:50 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="m4d/gNsv"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14]) by gabe.freedesktop.org (Postfix) with ESMTPS id AEB4310E892; Wed, 19 Feb 2025 20:28:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1739996929; x=1771532929; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=60nckjAvEwo8lnuh60Lk9jDkK46KiEserXo4D8hJMuY=; b=m4d/gNsvs3EGlkScDS+m5xCLifzDwsPqcdeDnS8Fhg7CA+R/1BOhenhh g4muFgdzzi5Z3Tb3epOGBPWwYV+tOE9h1NfrCQA4WJfSHMfx+kta3KpYz t4henG0vmUPJr1CaVFA8CwFOyiB9IVe0yUhztbhtEARwFZ7wq9QjUac8X xAgLt1b2hOcS5lrvl5NnBYU/W948ZHM0Nftlb+ytMQMMqJiTpfItLq2Gt 1YsB2v3HcW9QdgTANmdsLQfohQ1wc2g++7HGqBsiX2Cl4S265t6dRCxqZ R3i1ra0LecOaUbdCMLQUgxMGXP+M979Dz93QtBhdCdKk1HrKbTVvDtmrq A==; X-CSE-ConnectionGUID: ocE6LBlWSUGTqikj+m9+Ww== X-CSE-MsgGUID: FrKv8G4JQVqK7UgSTkvxKQ== X-IronPort-AV: E=McAfee;i="6700,10204,11350"; a="41012248" X-IronPort-AV: E=Sophos;i="6.13,299,1732608000"; d="scan'208";a="41012248" Received: from orviesa003.jf.intel.com ([10.64.159.143]) by fmvoesa108.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Feb 2025 12:28:49 -0800 X-CSE-ConnectionGUID: gpExPD5YQ8e5kFuaNUdNmA== X-CSE-MsgGUID: tVsHc2YRQKWm5fjzIv6a3w== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,199,1725346800"; d="scan'208";a="119765734" Received: from dut4410lnl.fm.intel.com ([10.105.8.78]) by ORVIESA003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Feb 2025 12:28:48 -0800 From: Jonathan Cavitt To: intel-xe@lists.freedesktop.org Cc: saurabhg.gupta@intel.com, alex.zuo@intel.com, jonathan.cavitt@intel.com, joonas.lahtinen@linux.intel.com, tvrtko.ursulin@igalia.com, lucas.demarchi@intel.com, matthew.brost@intel.com, dri-devel@lists.freedesktop.org, simona.vetter@ffwll.ch Subject: [PATCH v4 5/6] drm/xe/xe_query: Pass drm file to query funcs Date: Wed, 19 Feb 2025 20:28:45 +0000 Message-ID: <20250219202847.127425-6-jonathan.cavitt@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250219202847.127425-1-jonathan.cavitt@intel.com> References: <20250219202847.127425-1-jonathan.cavitt@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Pass the drm file to the query funcs in xe_query.c. This will be necessary for a future query. Signed-off-by: Jonathan Cavitt --- drivers/gpu/drm/xe/xe_query.c | 39 ++++++++++++++++++++++++----------- 1 file changed, 27 insertions(+), 12 deletions(-) diff --git a/drivers/gpu/drm/xe/xe_query.c b/drivers/gpu/drm/xe/xe_query.c index 042f87a688e7..3aad4737bfec 100644 --- a/drivers/gpu/drm/xe/xe_query.c +++ b/drivers/gpu/drm/xe/xe_query.c @@ -110,7 +110,8 @@ hwe_read_timestamp(struct xe_hw_engine *hwe, u64 *engine_ts, u64 *cpu_ts, static int query_engine_cycles(struct xe_device *xe, - struct drm_xe_device_query *query) + struct drm_xe_device_query *query, + struct drm_file *file) { struct drm_xe_query_engine_cycles __user *query_ptr; struct drm_xe_engine_class_instance *eci; @@ -179,7 +180,8 @@ query_engine_cycles(struct xe_device *xe, } static int query_engines(struct xe_device *xe, - struct drm_xe_device_query *query) + struct drm_xe_device_query *query, + struct drm_file *file) { size_t size = calc_hw_engine_info_size(xe); struct drm_xe_query_engines __user *query_ptr = @@ -240,7 +242,8 @@ static size_t calc_mem_regions_size(struct xe_device *xe) } static int query_mem_regions(struct xe_device *xe, - struct drm_xe_device_query *query) + struct drm_xe_device_query *query, + struct drm_file *file) { size_t size = calc_mem_regions_size(xe); struct drm_xe_query_mem_regions *mem_regions; @@ -310,7 +313,9 @@ static int query_mem_regions(struct xe_device *xe, return ret; } -static int query_config(struct xe_device *xe, struct drm_xe_device_query *query) +static int query_config(struct xe_device *xe, + struct drm_xe_device_query *query, + struct drm_file *file) { const u32 num_params = DRM_XE_QUERY_CONFIG_MAX_EXEC_QUEUE_PRIORITY + 1; size_t size = @@ -351,7 +356,9 @@ static int query_config(struct xe_device *xe, struct drm_xe_device_query *query) return 0; } -static int query_gt_list(struct xe_device *xe, struct drm_xe_device_query *query) +static int query_gt_list(struct xe_device *xe, + struct drm_xe_device_query *query, + struct drm_file *file) { struct xe_gt *gt; size_t size = sizeof(struct drm_xe_query_gt_list) + @@ -422,7 +429,8 @@ static int query_gt_list(struct xe_device *xe, struct drm_xe_device_query *query } static int query_hwconfig(struct xe_device *xe, - struct drm_xe_device_query *query) + struct drm_xe_device_query *query, + struct drm_file *file) { struct xe_gt *gt = xe_root_mmio_gt(xe); size_t size = xe_guc_hwconfig_size(>->uc.guc); @@ -490,7 +498,8 @@ static int copy_mask(void __user **ptr, } static int query_gt_topology(struct xe_device *xe, - struct drm_xe_device_query *query) + struct drm_xe_device_query *query, + struct drm_file *file) { void __user *query_ptr = u64_to_user_ptr(query->data); size_t size = calc_topo_query_size(xe); @@ -549,7 +558,9 @@ static int query_gt_topology(struct xe_device *xe, } static int -query_uc_fw_version(struct xe_device *xe, struct drm_xe_device_query *query) +query_uc_fw_version(struct xe_device *xe, + struct drm_xe_device_query *query, + struct drm_file *file) { struct drm_xe_query_uc_fw_version __user *query_ptr = u64_to_user_ptr(query->data); size_t size = sizeof(struct drm_xe_query_uc_fw_version); @@ -639,7 +650,8 @@ static size_t calc_oa_unit_query_size(struct xe_device *xe) } static int query_oa_units(struct xe_device *xe, - struct drm_xe_device_query *query) + struct drm_xe_device_query *query, + struct drm_file *file) { void __user *query_ptr = u64_to_user_ptr(query->data); size_t size = calc_oa_unit_query_size(xe); @@ -699,7 +711,9 @@ static int query_oa_units(struct xe_device *xe, return ret ? -EFAULT : 0; } -static int query_pxp_status(struct xe_device *xe, struct drm_xe_device_query *query) +static int query_pxp_status(struct xe_device *xe, + struct drm_xe_device_query *query, + struct drm_file *file) { struct drm_xe_query_pxp_status __user *query_ptr = u64_to_user_ptr(query->data); size_t size = sizeof(struct drm_xe_query_pxp_status); @@ -727,7 +741,8 @@ static int query_pxp_status(struct xe_device *xe, struct drm_xe_device_query *qu } static int (* const xe_query_funcs[])(struct xe_device *xe, - struct drm_xe_device_query *query) = { + struct drm_xe_device_query *query, + struct drm_file *file) = { query_engines, query_mem_regions, query_config, @@ -757,5 +772,5 @@ int xe_query_ioctl(struct drm_device *dev, void *data, struct drm_file *file) if (XE_IOCTL_DBG(xe, !xe_query_funcs[idx])) return -EINVAL; - return xe_query_funcs[idx](xe, query); + return xe_query_funcs[idx](xe, query, file); } From patchwork Wed Feb 19 20:28:46 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Cavitt X-Patchwork-Id: 13982867 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 61D3CC021AA for ; Wed, 19 Feb 2025 20:29:00 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 08F7D10E8A0; Wed, 19 Feb 2025 20:28:51 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="FeIEUIWv"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14]) by gabe.freedesktop.org (Postfix) with ESMTPS id C866010E894; Wed, 19 Feb 2025 20:28:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1739996929; x=1771532929; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=WRoA2Otu8KXZFy/YShs2Evzm1qWRGAVUan1amEpzAB8=; b=FeIEUIWvSM+WBFqvw/zIzltGIBkPu0uHSFqLxSyF/KZDhzS55TjJAO8Z nqvDGtT+GqmLZDEQ+BLKwGNdHTymFuaExpt9jJHd5EcyvH5A8AwN1qW83 T1i9hZNIIp7gkc7wtNUFNZgbCxIifaxeb7G4k0HiniJBi4t0jTl6xONbB AMX3RxaZj49M0MYnccwEU9tf530T/66fJaYtpdQ3RIOu68Ceqql8E1BcO gZi/PaG13E9gXHjxrfvXker7y6lmFbaJ0UBLdC8+LDTaSLhXA/bsRUFPb rs+jPsVVlEtrzeLBEI/c+oHb4ykMw9XLH0R0gZ1iDjvbNLKOYcKc7Keir g==; X-CSE-ConnectionGUID: sM6sy976RUuayWjJxmAnkg== X-CSE-MsgGUID: 6UScIwDWRva2QvcCHoMGwg== X-IronPort-AV: E=McAfee;i="6700,10204,11350"; a="41012251" X-IronPort-AV: E=Sophos;i="6.13,299,1732608000"; d="scan'208";a="41012251" Received: from orviesa003.jf.intel.com ([10.64.159.143]) by fmvoesa108.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Feb 2025 12:28:49 -0800 X-CSE-ConnectionGUID: McMD7KADSxyUouaXsFl1vA== X-CSE-MsgGUID: vCp4XjHyQxyEy3ETs/wXPA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,199,1725346800"; d="scan'208";a="119765739" Received: from dut4410lnl.fm.intel.com ([10.105.8.78]) by ORVIESA003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Feb 2025 12:28:49 -0800 From: Jonathan Cavitt To: intel-xe@lists.freedesktop.org Cc: saurabhg.gupta@intel.com, alex.zuo@intel.com, jonathan.cavitt@intel.com, joonas.lahtinen@linux.intel.com, tvrtko.ursulin@igalia.com, lucas.demarchi@intel.com, matthew.brost@intel.com, dri-devel@lists.freedesktop.org, simona.vetter@ffwll.ch Subject: [PATCH v4 6/6] drm/xe/xe_query: Add support for per-drm-client reset stat querying Date: Wed, 19 Feb 2025 20:28:46 +0000 Message-ID: <20250219202847.127425-7-jonathan.cavitt@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250219202847.127425-1-jonathan.cavitt@intel.com> References: <20250219202847.127425-1-jonathan.cavitt@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Add support for userspace to query per drm client reset stats via the query ioctl. This includes the number of engine resets the drm client has observed, as well as a list of up to the last 50 relevant exec queue bans and their associated causal pagefaults (if they exists). v2: Report EOPNOTSUPP if CONFIG_PROC_FS is not set in the kernel config, as it is required to trace the reset count and exec queue bans. Signed-off-by: Jonathan Cavitt --- drivers/gpu/drm/xe/xe_query.c | 70 +++++++++++++++++++++++++++++++++++ include/uapi/drm/xe_drm.h | 50 +++++++++++++++++++++++++ 2 files changed, 120 insertions(+) diff --git a/drivers/gpu/drm/xe/xe_query.c b/drivers/gpu/drm/xe/xe_query.c index 3aad4737bfec..671bc4270b93 100644 --- a/drivers/gpu/drm/xe/xe_query.c +++ b/drivers/gpu/drm/xe/xe_query.c @@ -16,10 +16,12 @@ #include "regs/xe_gt_regs.h" #include "xe_bo.h" #include "xe_device.h" +#include "xe_drm_client.h" #include "xe_exec_queue.h" #include "xe_force_wake.h" #include "xe_ggtt.h" #include "xe_gt.h" +#include "xe_gt_pagefault.h" #include "xe_guc_hwconfig.h" #include "xe_macros.h" #include "xe_mmio.h" @@ -740,6 +742,73 @@ static int query_pxp_status(struct xe_device *xe, return 0; } +static size_t calc_reset_stats_size(struct xe_drm_client *client) +{ + size_t size = sizeof(struct drm_xe_query_reset_stats); +#ifdef CONFIG_PROC_FS + spin_lock(&client->blame_lock); + size += sizeof(struct drm_xe_exec_queue_ban) * client->blame_len; + spin_lock(&client->blame_lock); +#endif + return size; +} + +static int query_reset_stats(struct xe_device *xe, + struct drm_xe_device_query *query, + struct drm_file *file) +{ + void __user *query_ptr = u64_to_user_ptr(query->data); + struct drm_xe_query_reset_stats resp; + struct xe_file *xef = to_xe_file(file); + struct xe_drm_client *client = xef->client; + struct blame *b; + size_t size = calc_reset_stats_size(client); + int i = 0; + +#ifdef CONFIG_PROC_FS + if (query->size == 0) { + query->size = size; + return 0; + } else if (XE_IOCTL_DBG(xe, query->size != size)) { + return -EINVAL; + } + + if (copy_from_user(&resp, query_ptr, size)) + return -EFAULT; + + resp.reset_count = atomic_read(&client->reset_count); + + spin_lock(&client->blame_lock); + resp.ban_count = client->blame_len; + list_for_each_entry(b, &client->blame_list, list) { + struct drm_xe_exec_queue_ban *ban = &resp.ban_list[i++]; + struct pagefault *pf = b->pf; + + ban->exec_queue_id = b->exec_queue_id; + ban->pf_found = pf ? 1 : 0; + if (!pf) + continue; + + ban->access_type = pf->access_type; + ban->fault_type = pf->fault_type; + ban->vfid = pf->vfid; + ban->asid = pf->asid; + ban->pdata = pf->pdata; + ban->engine_class = xe_to_user_engine_class[pf->engine_class]; + ban->engine_instance = pf->engine_instance; + ban->fault_addr = pf->page_addr; + } + spin_unlock(&client->blame_lock); + + if (copy_to_user(query_ptr, &resp, size)) + return -EFAULT; + + return 0; +#else + return -EOPNOTSUPP; +#endif +} + static int (* const xe_query_funcs[])(struct xe_device *xe, struct drm_xe_device_query *query, struct drm_file *file) = { @@ -753,6 +822,7 @@ static int (* const xe_query_funcs[])(struct xe_device *xe, query_uc_fw_version, query_oa_units, query_pxp_status, + query_reset_stats, }; int xe_query_ioctl(struct drm_device *dev, void *data, struct drm_file *file) diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h index 892f54d3aa09..ffeb2a79e084 100644 --- a/include/uapi/drm/xe_drm.h +++ b/include/uapi/drm/xe_drm.h @@ -682,6 +682,7 @@ struct drm_xe_query_pxp_status { * - %DRM_XE_DEVICE_QUERY_GT_TOPOLOGY * - %DRM_XE_DEVICE_QUERY_ENGINE_CYCLES * - %DRM_XE_DEVICE_QUERY_PXP_STATUS + * - %DRM_XE_DEVICE_QUERY_RESET_STATS * * If size is set to 0, the driver fills it with the required size for * the requested type of data to query. If size is equal to the required @@ -735,6 +736,7 @@ struct drm_xe_device_query { #define DRM_XE_DEVICE_QUERY_UC_FW_VERSION 7 #define DRM_XE_DEVICE_QUERY_OA_UNITS 8 #define DRM_XE_DEVICE_QUERY_PXP_STATUS 9 +#define DRM_XE_DEVICE_QUERY_RESET_STATS 10 /** @query: The type of data to query */ __u32 query; @@ -1845,6 +1847,54 @@ enum drm_xe_pxp_session_type { DRM_XE_PXP_TYPE_HWDRM = 1, }; +/** + * struct drm_xe_exec_queue_ban - Per drm client exec queue ban info returned + * from @DRM_XE_DEVICE_QUERY_RESET_STATS query. Includes the exec queue ID and + * all associated pagefault information, if relevant. + */ +struct drm_xe_exec_queue_ban { + /** @exec_queue_id: ID of banned exec queue */ + __u32 exec_queue_id; + /** + * @pf_found: whether or not the ban is associated with a pagefault. + * If not, all pagefault data will default to 0 and will not be relevant. + */ + __u8 pf_found; + /** @access_type: access type of associated pagefault */ + __u8 access_type; + /** @fault_type: fault type of associated pagefault */ + __u8 fault_type; + /** @vfid: VFID of associated pagefault */ + __u8 vfid; + /** @asid: ASID of associated pagefault */ + __u32 asid; + /** @pdata: PDATA of associated pagefault */ + __u16 pdata; + /** @engine_class: engine class of associated pagefault */ + __u8 engine_class; + /** @engine_instance: engine instance of associated pagefault */ + __u8 engine_instance; + /** @fault_addr: faulted address of associated pagefault */ + __u64 fault_addr; +}; + +/** + * struct drm_xe_query_reset_stats - Per drm client reset stats query. + */ +struct drm_xe_query_reset_stats { + /** @extensions: Pointer to the first extension struct, if any */ + __u64 extensions; + /** @reset_count: Number of times the drm client has observed an engine reset */ + __u64 reset_count; + /** @ban_count: number of exec queue bans saved by the drm client */ + __u64 ban_count; + /** + * @ban_list: flexible array of struct drm_xe_exec_queue_ban, reporting all + * observed exec queue bans on the drm client. + */ + struct drm_xe_exec_queue_ban ban_list[]; +}; + /* ID of the protected content session managed by Xe when PXP is active */ #define DRM_XE_PXP_HWDRM_DEFAULT_SESSION 0xf