diff mbox series

drm/i915/guc: Fix missing ecodes

Message ID 20230125004935.1986479-1-John.C.Harrison@Intel.com (mailing list archive)
State New, archived
Headers show
Series drm/i915/guc: Fix missing ecodes | expand

Commit Message

John Harrison Jan. 25, 2023, 12:49 a.m. UTC
From: John Harrison <John.C.Harrison@Intel.com>

Error captures are tagged with an 'ecode'. This is a pseduo-unique magic
number that is meant to distinguish similar seeming bugs with
different underlying signatures. It is a combination of two ring state
registers. Unfortunately, the register state being used is only valid
in execlist mode. In GuC mode, the register state exists in a separate
list of arbitrary register address/value pairs rather than the named
entry structure. So, search through that list to find the two exciting
registers and copy them over to the structure's named members.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Fixes: a6f0f9cf330a ("drm/i915/guc: Plumb GuC-capture into gpu_coredump")
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
Cc: Michael Cheng <michael.cheng@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Bruce Chang <yu.bruce.chang@intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_capture.c    | 22 +++++++++++++++++++
 1 file changed, 22 insertions(+)

Comments

Alan Previn Jan. 26, 2023, 7:17 p.m. UTC | #1
Firstly, thanks for catching this miss.
Since I only have one trivial nit and one non-blocker ask.
and the non-blocker ask will not impact the patch intent as it merely
tweaks an existing debug message, I believe we have an rb:

Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com>

On Tue, 2023-01-24 at 16:49 -0800, Harrison, John C wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> Error captures are tagged with an 'ecode'. This is a pseduo-unique magic
> number that is meant to distinguish similar seeming bugs with
> different underlying signatures. It is a combination of two ring state
> registers. Unfortunately, the register state being used is only valid
> in execlist mode. In GuC mode, the register state exists in a separate
> list of arbitrary register address/value pairs rather than the named
> entry structure. So, search through that list to find the two exciting
> registers and copy them over to the structure's named members.
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> Fixes: a6f0f9cf330a ("drm/i915/guc: Plumb GuC-capture into gpu_coredump")
> Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
> Cc: Michael Cheng <michael.cheng@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Bruce Chang <yu.bruce.chang@intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> ---
>  .../gpu/drm/i915/gt/uc/intel_guc_capture.c    | 22 +++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> index 1c1b85073b4bd..4e0b06ceed96d 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> @@ -1571,6 +1571,27 @@ int intel_guc_capture_print_engine_node(struct drm_i915_error_state_buf *ebuf,
>  
>  #endif //CONFIG_DRM_I915_CAPTURE_ERROR
>  
> +static void guc_capture_find_ecode(struct intel_engine_coredump *ee)
> +{
> +       struct gcap_reg_list_info *reginfo;
> +       struct guc_mmio_reg *regs;
> +       i915_reg_t reg_ipehr = RING_IPEHR(0);
> +       i915_reg_t reg_instdone = RING_INSTDONE(0);
> +       int i;
> +
> +       if (!ee->guc_capture_node)
> +               return;
> +
> +       reginfo = ee->guc_capture_node->reginfo + GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE;
> +       regs = reginfo->regs;
> +       for (i = 0; i < reginfo->num_regs; i++) {
> +               if (regs[i].offset == reg_ipehr.reg)
> +                       ee->ipehr = regs[i].value;
> +               if (regs[i].offset == reg_instdone.reg)
nit: "else if"?
> +                       ee->instdone.instdone = regs[i].value;
> +       }
> +}
> +
>  void intel_guc_capture_free_node(struct intel_engine_coredump *ee)
>  {
>         if (!ee || !ee->guc_capture_node)
> @@ -1612,6 +1633,7 @@ void intel_guc_capture_get_matching_node(struct intel_gt *gt,
>                         list_del(&n->link);
>                         ee->guc_capture_node = n;
>                         ee->capture = guc->capture;
> +                       guc_capture_find_ecode(ee);
>                         return;
>                 }
>         }

alan: only one non-blocker request:
while we are here, could we update the debug message when we can't find a matching captured node?
Current code:
	drm_dbg(&i915->drm, "GuC capture can't match ee to node\n");
New suggestion:
	drm_dbg(&i915->drm, "GuC capture can't find node for ee-ctx: lcra = 0x%08x | gucid = 0x%08x\n",
		ce->lrc.lrca, ce->guc_id.id);
John Harrison Jan. 28, 2023, 2:28 a.m. UTC | #2
On 1/26/2023 11:17, Teres Alexis, Alan Previn wrote:
> Firstly, thanks for catching this miss.
> Since I only have one trivial nit and one non-blocker ask.
> and the non-blocker ask will not impact the patch intent as it merely
> tweaks an existing debug message, I believe we have an rb:
>
> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com>
>
> On Tue, 2023-01-24 at 16:49 -0800, Harrison, John C wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> Error captures are tagged with an 'ecode'. This is a pseduo-unique magic
>> number that is meant to distinguish similar seeming bugs with
>> different underlying signatures. It is a combination of two ring state
>> registers. Unfortunately, the register state being used is only valid
>> in execlist mode. In GuC mode, the register state exists in a separate
>> list of arbitrary register address/value pairs rather than the named
>> entry structure. So, search through that list to find the two exciting
>> registers and copy them over to the structure's named members.
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> Fixes: a6f0f9cf330a ("drm/i915/guc: Plumb GuC-capture into gpu_coredump")
>> Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
>> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> Cc: Jani Nikula <jani.nikula@linux.intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>> Cc: Matt Roper <matthew.d.roper@intel.com>
>> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
>> Cc: Michael Cheng <michael.cheng@intel.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Bruce Chang <yu.bruce.chang@intel.com>
>> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>> Cc: Matthew Auld <matthew.auld@intel.com>
>> ---
>>   .../gpu/drm/i915/gt/uc/intel_guc_capture.c    | 22 +++++++++++++++++++
>>   1 file changed, 22 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
>> index 1c1b85073b4bd..4e0b06ceed96d 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
>> @@ -1571,6 +1571,27 @@ int intel_guc_capture_print_engine_node(struct drm_i915_error_state_buf *ebuf,
>>   
>>   #endif //CONFIG_DRM_I915_CAPTURE_ERROR
>>   
>> +static void guc_capture_find_ecode(struct intel_engine_coredump *ee)
>> +{
>> +       struct gcap_reg_list_info *reginfo;
>> +       struct guc_mmio_reg *regs;
>> +       i915_reg_t reg_ipehr = RING_IPEHR(0);
>> +       i915_reg_t reg_instdone = RING_INSTDONE(0);
>> +       int i;
>> +
>> +       if (!ee->guc_capture_node)
>> +               return;
>> +
>> +       reginfo = ee->guc_capture_node->reginfo + GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE;
>> +       regs = reginfo->regs;
>> +       for (i = 0; i < reginfo->num_regs; i++) {
>> +               if (regs[i].offset == reg_ipehr.reg)
>> +                       ee->ipehr = regs[i].value;
>> +               if (regs[i].offset == reg_instdone.reg)
> nit: "else if"?
>> +                       ee->instdone.instdone = regs[i].value;
>> +       }
>> +}
>> +
>>   void intel_guc_capture_free_node(struct intel_engine_coredump *ee)
>>   {
>>          if (!ee || !ee->guc_capture_node)
>> @@ -1612,6 +1633,7 @@ void intel_guc_capture_get_matching_node(struct intel_gt *gt,
>>                          list_del(&n->link);
>>                          ee->guc_capture_node = n;
>>                          ee->capture = guc->capture;
>> +                       guc_capture_find_ecode(ee);
>>                          return;
>>                  }
>>          }
> alan: only one non-blocker request:
> while we are here, could we update the debug message when we can't find a matching captured node?
> Current code:
> 	drm_dbg(&i915->drm, "GuC capture can't match ee to node\n");
> New suggestion:
> 	drm_dbg(&i915->drm, "GuC capture can't find node for ee-ctx: lcra = 0x%08x | gucid = 0x%08x\n",
> 		ce->lrc.lrca, ce->guc_id.id);
Regarding the search test, there seem to be some incorrect terms in 
there. The if itself is also not the easiest to read with some terms 
across multiple lines and other lines with multiple terms. Breaking it down:
     (n->eng_inst == GUC_ID_TO_ENGINE_INSTANCE(ee->engine->guc_id) &&
     n->eng_class == GUC_ID_TO_ENGINE_CLASS(ee->engine->guc_id) &&
     n->guc_id &&
Why does the GuC id have to be non zero? Zero is a valid id. And even if 
it isn't, comparing to ce->guc_id.id is sufficient to filter out 
anything bad.
     n->guc_id == ce->guc_id.id &&
     (n->lrca & CTX_GTT_ADDRESS_MASK) &&
Again, address zero is not invalid but the next test makes this one 
redundant anyway.
     (n->lrca & CTX_GTT_ADDRESS_MASK) == (ce->lrc.lrca & 
CTX_GTT_ADDRESS_MASK)) {

Any objection to dropping the !zero tests and reformatting the whole thing?

John.

>
>
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
index 1c1b85073b4bd..4e0b06ceed96d 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
@@ -1571,6 +1571,27 @@  int intel_guc_capture_print_engine_node(struct drm_i915_error_state_buf *ebuf,
 
 #endif //CONFIG_DRM_I915_CAPTURE_ERROR
 
+static void guc_capture_find_ecode(struct intel_engine_coredump *ee)
+{
+	struct gcap_reg_list_info *reginfo;
+	struct guc_mmio_reg *regs;
+	i915_reg_t reg_ipehr = RING_IPEHR(0);
+	i915_reg_t reg_instdone = RING_INSTDONE(0);
+	int i;
+
+	if (!ee->guc_capture_node)
+		return;
+
+	reginfo = ee->guc_capture_node->reginfo + GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE;
+	regs = reginfo->regs;
+	for (i = 0; i < reginfo->num_regs; i++) {
+		if (regs[i].offset == reg_ipehr.reg)
+			ee->ipehr = regs[i].value;
+		if (regs[i].offset == reg_instdone.reg)
+			ee->instdone.instdone = regs[i].value;
+	}
+}
+
 void intel_guc_capture_free_node(struct intel_engine_coredump *ee)
 {
 	if (!ee || !ee->guc_capture_node)
@@ -1612,6 +1633,7 @@  void intel_guc_capture_get_matching_node(struct intel_gt *gt,
 			list_del(&n->link);
 			ee->guc_capture_node = n;
 			ee->capture = guc->capture;
+			guc_capture_find_ecode(ee);
 			return;
 		}
 	}