diff mbox

drm/i915: Include GuC fw version in error state

Message ID 20170223231137.7682-1-michel.thierry@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Michel Thierry Feb. 23, 2017, 11:11 p.m. UTC
There was no way to check if the platform is running the latest firmware.

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

sagar.a.kamble@intel.com Feb. 24, 2017, 3:43 a.m. UTC | #1
Reviewed-by: Sagar Arun Kamble <sagar.a.kamble@intel.com>

On 2/24/2017 4:41 AM, Michel Thierry wrote:
> There was no way to check if the platform is running the latest firmware.
>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 2b1d15668192..e022187916ee 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m,
>   			   CSR_VERSION_MINOR(csr->version));
>   	}
>   
> +	if (HAS_GUC_UCODE(dev_priv)) {
> +		struct intel_uc_fw *guc_fw = &dev_priv->guc.fw;
> +
> +		err_printf(m, "GuC loaded: %s\n",
> +			   yesno(guc_fw->load_status ==
> +				 INTEL_UC_FIRMWARE_SUCCESS));
> +		err_printf(m, "GuC fw version: %d.%d\n",
> +			   guc_fw->major_ver_found, guc_fw->minor_ver_found);
> +	}
> +
>   	err_printf(m, "EIR: 0x%08x\n", error->eir);
>   	err_printf(m, "IER: 0x%08x\n", error->ier);
>   	for (i = 0; i < error->ngtier; i++)
Chris Wilson Feb. 24, 2017, 9:13 a.m. UTC | #2
On Fri, Feb 24, 2017 at 09:13:05AM +0530, Kamble, Sagar A wrote:
>    Reviewed-by: Sagar Arun Kamble [1]<sagar.a.kamble@intel.com>
> 
>    On 2/24/2017 4:41 AM, Michel Thierry wrote:
> 
>  There was no way to check if the platform is running the latest firmware.
> 
>  Cc: Tvrtko Ursulin [2]<tvrtko.ursulin@intel.com>
>  Cc: Arkadiusz Hiler [3]<arkadiusz.hiler@intel.com>
>  Signed-off-by: Michel Thierry [4]<michel.thierry@intel.com>
>  ---
>   drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
> 
>  diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
>  index 2b1d15668192..e022187916ee 100644
>  --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>  +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>  @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m,
>                             CSR_VERSION_MINOR(csr->version));
>          }
> 
>  +       if (HAS_GUC_UCODE(dev_priv)) {
>  +               struct intel_uc_fw *guc_fw = &dev_priv->guc.fw;
>  +
>  +               err_printf(m, "GuC loaded: %s\n",
>  +                          yesno(guc_fw->load_status ==
>  +                                INTEL_UC_FIRMWARE_SUCCESS));
>  +               err_printf(m, "GuC fw version: %d.%d\n",
>  +                          guc_fw->major_ver_found, guc_fw->minor_ver_found);
>  +       }
>  +

Hmm. The firmware may change between the hang and cat
/sys/class/drm/card0/error (as it will be reloaded after the reset).
-Chris
Michal Wajdeczko Feb. 24, 2017, 10:40 a.m. UTC | #3
On Thu, Feb 23, 2017 at 03:11:37PM -0800, Michel Thierry wrote:
> There was no way to check if the platform is running the latest firmware.

Can we also add similar patch for the HuC ?

> 
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 2b1d15668192..e022187916ee 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m,
>  			   CSR_VERSION_MINOR(csr->version));
>  	}
>  
> +	if (HAS_GUC_UCODE(dev_priv)) {
> +		struct intel_uc_fw *guc_fw = &dev_priv->guc.fw;

I would preffer to use HAS_GUC and intel_guc* here.


> +
> +		err_printf(m, "GuC loaded: %s\n",
> +			   yesno(guc_fw->load_status ==
> +				 INTEL_UC_FIRMWARE_SUCCESS));

Hmm, as we do have more detailed load status, why limiting it to yes/no only?


-Michal  

> +		err_printf(m, "GuC fw version: %d.%d\n",
> +			   guc_fw->major_ver_found, guc_fw->minor_ver_found);
> +	}
> +
>  	err_printf(m, "EIR: 0x%08x\n", error->eir);
>  	err_printf(m, "IER: 0x%08x\n", error->ier);
>  	for (i = 0; i < error->ngtier; i++)
> -- 
> 2.11.0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Michal Wajdeczko Feb. 24, 2017, 10:43 a.m. UTC | #4
On Fri, Feb 24, 2017 at 09:13:29AM +0000, Chris Wilson wrote:
> On Fri, Feb 24, 2017 at 09:13:05AM +0530, Kamble, Sagar A wrote:
> >    Reviewed-by: Sagar Arun Kamble [1]<sagar.a.kamble@intel.com>
> > 
> >    On 2/24/2017 4:41 AM, Michel Thierry wrote:
> > 
> >  There was no way to check if the platform is running the latest firmware.
> > 
> >  Cc: Tvrtko Ursulin [2]<tvrtko.ursulin@intel.com>
> >  Cc: Arkadiusz Hiler [3]<arkadiusz.hiler@intel.com>
> >  Signed-off-by: Michel Thierry [4]<michel.thierry@intel.com>
> >  ---
> >   drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++
> >   1 file changed, 10 insertions(+)
> > 
> >  diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> >  index 2b1d15668192..e022187916ee 100644
> >  --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> >  +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> >  @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m,
> >                             CSR_VERSION_MINOR(csr->version));
> >          }
> > 
> >  +       if (HAS_GUC_UCODE(dev_priv)) {
> >  +               struct intel_uc_fw *guc_fw = &dev_priv->guc.fw;
> >  +
> >  +               err_printf(m, "GuC loaded: %s\n",
> >  +                          yesno(guc_fw->load_status ==
> >  +                                INTEL_UC_FIRMWARE_SUCCESS));
> >  +               err_printf(m, "GuC fw version: %d.%d\n",
> >  +                          guc_fw->major_ver_found, guc_fw->minor_ver_found);
> >  +       }
> >  +
> 
> Hmm. The firmware may change between the hang and cat
> /sys/class/drm/card0/error (as it will be reloaded after the reset).

Btw, maybe we should add counter that will be incremented on each fw reload
and reported here ?

-Michal
Chris Wilson Feb. 24, 2017, 10:49 a.m. UTC | #5
On Fri, Feb 24, 2017 at 11:43:32AM +0100, Michal Wajdeczko wrote:
> On Fri, Feb 24, 2017 at 09:13:29AM +0000, Chris Wilson wrote:
> > On Fri, Feb 24, 2017 at 09:13:05AM +0530, Kamble, Sagar A wrote:
> > >    Reviewed-by: Sagar Arun Kamble [1]<sagar.a.kamble@intel.com>
> > > 
> > >    On 2/24/2017 4:41 AM, Michel Thierry wrote:
> > > 
> > >  There was no way to check if the platform is running the latest firmware.
> > > 
> > >  Cc: Tvrtko Ursulin [2]<tvrtko.ursulin@intel.com>
> > >  Cc: Arkadiusz Hiler [3]<arkadiusz.hiler@intel.com>
> > >  Signed-off-by: Michel Thierry [4]<michel.thierry@intel.com>
> > >  ---
> > >   drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++
> > >   1 file changed, 10 insertions(+)
> > > 
> > >  diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> > >  index 2b1d15668192..e022187916ee 100644
> > >  --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> > >  +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> > >  @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m,
> > >                             CSR_VERSION_MINOR(csr->version));
> > >          }
> > > 
> > >  +       if (HAS_GUC_UCODE(dev_priv)) {
> > >  +               struct intel_uc_fw *guc_fw = &dev_priv->guc.fw;
> > >  +
> > >  +               err_printf(m, "GuC loaded: %s\n",
> > >  +                          yesno(guc_fw->load_status ==
> > >  +                                INTEL_UC_FIRMWARE_SUCCESS));
> > >  +               err_printf(m, "GuC fw version: %d.%d\n",
> > >  +                          guc_fw->major_ver_found, guc_fw->minor_ver_found);
> > >  +       }
> > >  +
> > 
> > Hmm. The firmware may change between the hang and cat
> > /sys/class/drm/card0/error (as it will be reloaded after the reset).
> 
> Btw, maybe we should add counter that will be incremented on each fw reload
> and reported here ?

If it occurs to you that we need it for post-mortem debugging and having
it is worth more than any potential confusion....

I can see the need for knowing what guc/huc/dmc/etc was running at the
time of a hang - I just hope that what was previously running before an
earlier reset doesn't contribute. But that's why we focus on the first
error in a system...
-Chris
sagar.a.kamble@intel.com Feb. 24, 2017, 3:45 p.m. UTC | #6
On 2/24/2017 4:19 PM, Chris Wilson wrote:
> On Fri, Feb 24, 2017 at 11:43:32AM +0100, Michal Wajdeczko wrote:
>> On Fri, Feb 24, 2017 at 09:13:29AM +0000, Chris Wilson wrote:
>>> On Fri, Feb 24, 2017 at 09:13:05AM +0530, Kamble, Sagar A wrote:
>>>>     Reviewed-by: Sagar Arun Kamble [1]<sagar.a.kamble@intel.com>
>>>>
>>>>     On 2/24/2017 4:41 AM, Michel Thierry wrote:
>>>>
>>>>   There was no way to check if the platform is running the latest firmware.
>>>>
>>>>   Cc: Tvrtko Ursulin [2]<tvrtko.ursulin@intel.com>
>>>>   Cc: Arkadiusz Hiler [3]<arkadiusz.hiler@intel.com>
>>>>   Signed-off-by: Michel Thierry [4]<michel.thierry@intel.com>
>>>>   ---
>>>>    drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++
>>>>    1 file changed, 10 insertions(+)
>>>>
>>>>   diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>   index 2b1d15668192..e022187916ee 100644
>>>>   --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>   +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>   @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m,
>>>>                              CSR_VERSION_MINOR(csr->version));
>>>>           }
>>>>
>>>>   +       if (HAS_GUC_UCODE(dev_priv)) {
>>>>   +               struct intel_uc_fw *guc_fw = &dev_priv->guc.fw;
>>>>   +
>>>>   +               err_printf(m, "GuC loaded: %s\n",
>>>>   +                          yesno(guc_fw->load_status ==
>>>>   +                                INTEL_UC_FIRMWARE_SUCCESS));
>>>>   +               err_printf(m, "GuC fw version: %d.%d\n",
>>>>   +                          guc_fw->major_ver_found, guc_fw->minor_ver_found);
>>>>   +       }
>>>>   +
>>> Hmm. The firmware may change between the hang and cat
>>> /sys/class/drm/card0/error (as it will be reloaded after the reset).
>> Btw, maybe we should add counter that will be incremented on each fw reload
>> and reported here ?
> If it occurs to you that we need it for post-mortem debugging and having
> it is worth more than any potential confusion....
>
> I can see the need for knowing what guc/huc/dmc/etc was running at the
> time of a hang - I just hope that what was previously running before an
> earlier reset doesn't contribute. But that's why we focus on the first
> error in a system...
> -Chris
>
GT reset count is present already in error state. GuC kernel parameters 
are present and this
change will help us identify which firmware issue was encountered.
So I feel printing ver_found should be enough.
Michel Thierry Feb. 24, 2017, 4:15 p.m. UTC | #7
On 2/24/2017 2:40 AM, Michal Wajdeczko wrote:
> On Thu, Feb 23, 2017 at 03:11:37PM -0800, Michel Thierry wrote:
>> There was no way to check if the platform is running the latest firmware.
>
> Can we also add similar patch for the HuC ?
>

Please don't tell me the HuC can hang the gpu too.

>>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>> ---
>>  drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
>> index 2b1d15668192..e022187916ee 100644
>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>> @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m,
>>  			   CSR_VERSION_MINOR(csr->version));
>>  	}
>>
>> +	if (HAS_GUC_UCODE(dev_priv)) {
>> +		struct intel_uc_fw *guc_fw = &dev_priv->guc.fw;
>
> I would preffer to use HAS_GUC and intel_guc* here.
>
>
>> +
>> +		err_printf(m, "GuC loaded: %s\n",
>> +			   yesno(guc_fw->load_status ==
>> +				 INTEL_UC_FIRMWARE_SUCCESS));
>
> Hmm, as we do have more detailed load status, why limiting it to yes/no only?
>
>
> -Michal
>
>> +		err_printf(m, "GuC fw version: %d.%d\n",
>> +			   guc_fw->major_ver_found, guc_fw->minor_ver_found);
>> +	}
>> +
>>  	err_printf(m, "EIR: 0x%08x\n", error->eir);
>>  	err_printf(m, "IER: 0x%08x\n", error->ier);
>>  	for (i = 0; i < error->ngtier; i++)
>> --
>> 2.11.0
>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Michel Thierry Feb. 24, 2017, 4:21 p.m. UTC | #8
On 2/24/2017 8:15 AM, Michel Thierry wrote:
>
>
> On 2/24/2017 2:40 AM, Michal Wajdeczko wrote:
>> On Thu, Feb 23, 2017 at 03:11:37PM -0800, Michel Thierry wrote:
>>> There was no way to check if the platform is running the latest
>>> firmware.
>>
>> Can we also add similar patch for the HuC ?
>>
>
> Please don't tell me the HuC can hang the gpu too.
>
>>>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
>>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>>> ---
>>>  drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++
>>>  1 file changed, 10 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c
>>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> index 2b1d15668192..e022187916ee 100644
>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct
>>> drm_i915_error_state_buf *m,
>>>                 CSR_VERSION_MINOR(csr->version));
>>>      }
>>>
>>> +    if (HAS_GUC_UCODE(dev_priv)) {
>>> +        struct intel_uc_fw *guc_fw = &dev_priv->guc.fw;
>>
>> I would preffer to use HAS_GUC and intel_guc* here.
>>
>>
>>> +
>>> +        err_printf(m, "GuC loaded: %s\n",
>>> +               yesno(guc_fw->load_status ==
>>> +                 INTEL_UC_FIRMWARE_SUCCESS));
>>
>> Hmm, as we do have more detailed load status, why limiting it to
>> yes/no only?

Will it help in the post-mortem debug?
My idea was, if the fw didn't load, we can take it completely out of the 
picture.

>> -Michal
>>
>>> +        err_printf(m, "GuC fw version: %d.%d\n",
>>> +               guc_fw->major_ver_found, guc_fw->minor_ver_found);
>>> +    }
>>> +
>>>      err_printf(m, "EIR: 0x%08x\n", error->eir);
>>>      err_printf(m, "IER: 0x%08x\n", error->ier);
>>>      for (i = 0; i < error->ngtier; i++)
>>> --
>>> 2.11.0
>>>
>>> _______________________________________________
>>> Intel-gfx mailing list
>>> Intel-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Michel Thierry Feb. 24, 2017, 4:30 p.m. UTC | #9
On 2/24/2017 2:49 AM, Chris Wilson wrote:
> On Fri, Feb 24, 2017 at 11:43:32AM +0100, Michal Wajdeczko wrote:
>> On Fri, Feb 24, 2017 at 09:13:29AM +0000, Chris Wilson wrote:
>>> On Fri, Feb 24, 2017 at 09:13:05AM +0530, Kamble, Sagar A wrote:
>>>>    Reviewed-by: Sagar Arun Kamble [1]<sagar.a.kamble@intel.com>
>>>>
>>>>    On 2/24/2017 4:41 AM, Michel Thierry wrote:
>>>>
>>>>  There was no way to check if the platform is running the latest firmware.
>>>>
>>>>  Cc: Tvrtko Ursulin [2]<tvrtko.ursulin@intel.com>
>>>>  Cc: Arkadiusz Hiler [3]<arkadiusz.hiler@intel.com>
>>>>  Signed-off-by: Michel Thierry [4]<michel.thierry@intel.com>
>>>>  ---
>>>>   drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++
>>>>   1 file changed, 10 insertions(+)
>>>>
>>>>  diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>  index 2b1d15668192..e022187916ee 100644
>>>>  --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>  +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>  @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m,
>>>>                             CSR_VERSION_MINOR(csr->version));
>>>>          }
>>>>
>>>>  +       if (HAS_GUC_UCODE(dev_priv)) {
>>>>  +               struct intel_uc_fw *guc_fw = &dev_priv->guc.fw;
>>>>  +
>>>>  +               err_printf(m, "GuC loaded: %s\n",
>>>>  +                          yesno(guc_fw->load_status ==
>>>>  +                                INTEL_UC_FIRMWARE_SUCCESS));
>>>>  +               err_printf(m, "GuC fw version: %d.%d\n",
>>>>  +                          guc_fw->major_ver_found, guc_fw->minor_ver_found);
>>>>  +       }
>>>>  +
>>>
>>> Hmm. The firmware may change between the hang and cat
>>> /sys/class/drm/card0/error (as it will be reloaded after the reset).
>>
>> Btw, maybe we should add counter that will be incremented on each fw reload
>> and reported here ?
>
> If it occurs to you that we need it for post-mortem debugging and having
> it is worth more than any potential confusion....
>
> I can see the need for knowing what guc/huc/dmc/etc was running at the
> time of a hang - I just hope that what was previously running before an
> earlier reset doesn't contribute. But that's why we focus on the first
> error in a system...

Can the firmware change?
Last time I checked the filename was hard-coded in the driver. It's true 
that the load process could fail and then the information be incorrect.
Chris Wilson Feb. 24, 2017, 5:15 p.m. UTC | #10
On Fri, Feb 24, 2017 at 08:30:43AM -0800, Michel Thierry wrote:
> On 2/24/2017 2:49 AM, Chris Wilson wrote:
> >On Fri, Feb 24, 2017 at 11:43:32AM +0100, Michal Wajdeczko wrote:
> >>On Fri, Feb 24, 2017 at 09:13:29AM +0000, Chris Wilson wrote:
> >>>On Fri, Feb 24, 2017 at 09:13:05AM +0530, Kamble, Sagar A wrote:
> >>>>   Reviewed-by: Sagar Arun Kamble [1]<sagar.a.kamble@intel.com>
> >>>>
> >>>>   On 2/24/2017 4:41 AM, Michel Thierry wrote:
> >>>>
> >>>> There was no way to check if the platform is running the latest firmware.
> >>>>
> >>>> Cc: Tvrtko Ursulin [2]<tvrtko.ursulin@intel.com>
> >>>> Cc: Arkadiusz Hiler [3]<arkadiusz.hiler@intel.com>
> >>>> Signed-off-by: Michel Thierry [4]<michel.thierry@intel.com>
> >>>> ---
> >>>>  drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++
> >>>>  1 file changed, 10 insertions(+)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> >>>> index 2b1d15668192..e022187916ee 100644
> >>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> >>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> >>>> @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m,
> >>>>                            CSR_VERSION_MINOR(csr->version));
> >>>>         }
> >>>>
> >>>> +       if (HAS_GUC_UCODE(dev_priv)) {
> >>>> +               struct intel_uc_fw *guc_fw = &dev_priv->guc.fw;
> >>>> +
> >>>> +               err_printf(m, "GuC loaded: %s\n",
> >>>> +                          yesno(guc_fw->load_status ==
> >>>> +                                INTEL_UC_FIRMWARE_SUCCESS));
> >>>> +               err_printf(m, "GuC fw version: %d.%d\n",
> >>>> +                          guc_fw->major_ver_found, guc_fw->minor_ver_found);
> >>>> +       }
> >>>> +
> >>>
> >>>Hmm. The firmware may change between the hang and cat
> >>>/sys/class/drm/card0/error (as it will be reloaded after the reset).
> >>
> >>Btw, maybe we should add counter that will be incremented on each fw reload
> >>and reported here ?
> >
> >If it occurs to you that we need it for post-mortem debugging and having
> >it is worth more than any potential confusion....
> >
> >I can see the need for knowing what guc/huc/dmc/etc was running at the
> >time of a hang - I just hope that what was previously running before an
> >earlier reset doesn't contribute. But that's why we focus on the first
> >error in a system...
> 
> Can the firmware change?
> Last time I checked the filename was hard-coded in the driver. It's
> true that the load process could fail and then the information be
> incorrect.

Assume it won't be hardcoded for ever (or at least no more than a week)...
And yes, the filesystem state may have changed since the previous load.
-Chris
Michel Thierry Feb. 24, 2017, 5:32 p.m. UTC | #11
On 2/24/2017 9:15 AM, Chris Wilson wrote:
> On Fri, Feb 24, 2017 at 08:30:43AM -0800, Michel Thierry wrote:
>> On 2/24/2017 2:49 AM, Chris Wilson wrote:
>>> On Fri, Feb 24, 2017 at 11:43:32AM +0100, Michal Wajdeczko wrote:
>>>> On Fri, Feb 24, 2017 at 09:13:29AM +0000, Chris Wilson wrote:
>>>>> On Fri, Feb 24, 2017 at 09:13:05AM +0530, Kamble, Sagar A wrote:
>>>>>>   Reviewed-by: Sagar Arun Kamble [1]<sagar.a.kamble@intel.com>
>>>>>>
>>>>>>   On 2/24/2017 4:41 AM, Michel Thierry wrote:
>>>>>>
>>>>>> There was no way to check if the platform is running the latest firmware.
>>>>>>
>>>>>> Cc: Tvrtko Ursulin [2]<tvrtko.ursulin@intel.com>
>>>>>> Cc: Arkadiusz Hiler [3]<arkadiusz.hiler@intel.com>
>>>>>> Signed-off-by: Michel Thierry [4]<michel.thierry@intel.com>
>>>>>> ---
>>>>>>  drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++
>>>>>>  1 file changed, 10 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>>> index 2b1d15668192..e022187916ee 100644
>>>>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>>> @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m,
>>>>>>                            CSR_VERSION_MINOR(csr->version));
>>>>>>         }
>>>>>>
>>>>>> +       if (HAS_GUC_UCODE(dev_priv)) {
>>>>>> +               struct intel_uc_fw *guc_fw = &dev_priv->guc.fw;
>>>>>> +
>>>>>> +               err_printf(m, "GuC loaded: %s\n",
>>>>>> +                          yesno(guc_fw->load_status ==
>>>>>> +                                INTEL_UC_FIRMWARE_SUCCESS));
>>>>>> +               err_printf(m, "GuC fw version: %d.%d\n",
>>>>>> +                          guc_fw->major_ver_found, guc_fw->minor_ver_found);
>>>>>> +       }
>>>>>> +
>>>>>
>>>>> Hmm. The firmware may change between the hang and cat
>>>>> /sys/class/drm/card0/error (as it will be reloaded after the reset).
>>>>
>>>> Btw, maybe we should add counter that will be incremented on each fw reload
>>>> and reported here ?
>>>
>>> If it occurs to you that we need it for post-mortem debugging and having
>>> it is worth more than any potential confusion....
>>>
>>> I can see the need for knowing what guc/huc/dmc/etc was running at the
>>> time of a hang - I just hope that what was previously running before an
>>> earlier reset doesn't contribute. But that's why we focus on the first
>>> error in a system...
>>
>> Can the firmware change?
>> Last time I checked the filename was hard-coded in the driver. It's
>> true that the load process could fail and then the information be
>> incorrect.
>
> Assume it won't be hardcoded for ever (or at least no more than a week)...
> And yes, the filesystem state may have changed since the previous load.

ok, I'll add an i915_capture_fw_state to collect the information before 
the reset (for dmc/guc/huc).
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 2b1d15668192..e022187916ee 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -632,6 +632,16 @@  int i915_error_state_to_str(struct drm_i915_error_state_buf *m,
 			   CSR_VERSION_MINOR(csr->version));
 	}
 
+	if (HAS_GUC_UCODE(dev_priv)) {
+		struct intel_uc_fw *guc_fw = &dev_priv->guc.fw;
+
+		err_printf(m, "GuC loaded: %s\n",
+			   yesno(guc_fw->load_status ==
+				 INTEL_UC_FIRMWARE_SUCCESS));
+		err_printf(m, "GuC fw version: %d.%d\n",
+			   guc_fw->major_ver_found, guc_fw->minor_ver_found);
+	}
+
 	err_printf(m, "EIR: 0x%08x\n", error->eir);
 	err_printf(m, "IER: 0x%08x\n", error->ier);
 	for (i = 0; i < error->ngtier; i++)