diff mbox series

[v2,1/2] drm/i915: Sanitycheck MMIO access early in driver load

Message ID 20230321170936.478631-2-andi.shyti@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series Report MMIO communication problems more clearly | expand

Commit Message

Andi Shyti March 21, 2023, 5:09 p.m. UTC
From: Matt Roper <matthew.d.roper@intel.com>

We occasionally see the PCI device in a non-accessible state at the
point the driver is loaded.  When this happens, all BAR accesses will
read back as 0xFFFFFFFF.  Rather than reading registers and
misinterpreting their (invalid) values, let's specifically check for
0xFFFFFFFF in a register that cannot have that value to see if the
device is accessible.

Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_uncore.c | 35 +++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

Comments

Andi Shyti March 21, 2023, 5:15 p.m. UTC | #1
> We occasionally see the PCI device in a non-accessible state at the
> point the driver is loaded.  When this happens, all BAR accesses will
> read back as 0xFFFFFFFF.  Rather than reading registers and
> misinterpreting their (invalid) values, let's specifically check for
> 0xFFFFFFFF in a register that cannot have that value to see if the
> device is accessible.
> 
> Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>

Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>

Andi
Matt Roper March 21, 2023, 9:55 p.m. UTC | #2
On Tue, Mar 21, 2023 at 06:09:35PM +0100, Andi Shyti wrote:
> From: Matt Roper <matthew.d.roper@intel.com>
> 
> We occasionally see the PCI device in a non-accessible state at the
> point the driver is loaded.  When this happens, all BAR accesses will
> read back as 0xFFFFFFFF.  Rather than reading registers and
> misinterpreting their (invalid) values, let's specifically check for
> 0xFFFFFFFF in a register that cannot have that value to see if the
> device is accessible.
> 
> Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/intel_uncore.c | 35 +++++++++++++++++++++++++++++
>  1 file changed, 35 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index e1e1f34490c8e..0b69081d6d285 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -2602,11 +2602,46 @@ static int uncore_forcewake_init(struct intel_uncore *uncore)
>  	return 0;
>  }
>  
> +static int sanity_check_mmio_access(struct intel_uncore *uncore)
> +{
> +	struct drm_i915_private *i915 = uncore->i915;
> +	int ret;
> +
> +	if (GRAPHICS_VER(i915) < 8)
> +		return 0;
> +
> +	/*
> +	 * Sanitycheck that MMIO access to the device is working properly.  If
> +	 * the CPU is unable to communcate with a PCI device, BAR reads will
> +	 * return 0xFFFFFFFF.  Let's make sure the device isn't in this state
> +	 * before we start trying to access registers.
> +	 *
> +	 * We use the primary GT's forcewake register as our guinea pig since
> +	 * it's been around since HSW and it's a masked register so the upper
> +	 * 16 bits can never read back as 1's if device access is operating
> +	 * properly.
> +	 *
> +	 * If MMIO isn't working, we'll wait up to 2 seconds to see if it
> +	 * recovers, then give up.
> +	 */
> +	ret = intel_wait_for_register_fw(uncore, FORCEWAKE_MT, 0, 0, 2000000);

It looks like you lost the check for 0xFFFFFFFF specifically.  In fact
with a mask/value of 0, isn't this always going to just always pass
immediately?

We don't know what the value of this register will be (there may or may
not be some bits set), but we need to make sure that it isn't 0xFFFFFFFF
because that means we're not even truly accessing the register, just
hitting a PCI BAR read failure.


Matt

> +	if (ret == -ETIMEDOUT) {
> +		drm_err(&i915->drm, "Device is non-operational; MMIO access returns 0xFFFFFFFF!\n");
> +		return -EIO;
> +	}
> +
> +	return 0;
> +}
> +
>  int intel_uncore_init_mmio(struct intel_uncore *uncore)
>  {
>  	struct drm_i915_private *i915 = uncore->i915;
>  	int ret;
>  
> +	ret = sanity_check_mmio_access(uncore);
> +	if (ret)
> +		return ret;
> +
>  	/*
>  	 * The boot firmware initializes local memory and assesses its health.
>  	 * If memory training fails, the punit will have been instructed to
> -- 
> 2.39.2
>
Andi Shyti March 21, 2023, 10:43 p.m. UTC | #3
Hi Matt,

> > We occasionally see the PCI device in a non-accessible state at the
> > point the driver is loaded.  When this happens, all BAR accesses will
> > read back as 0xFFFFFFFF.  Rather than reading registers and
> > misinterpreting their (invalid) values, let's specifically check for
> > 0xFFFFFFFF in a register that cannot have that value to see if the
> > device is accessible.
> > 
> > Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> > ---
> >  drivers/gpu/drm/i915/intel_uncore.c | 35 +++++++++++++++++++++++++++++
> >  1 file changed, 35 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> > index e1e1f34490c8e..0b69081d6d285 100644
> > --- a/drivers/gpu/drm/i915/intel_uncore.c
> > +++ b/drivers/gpu/drm/i915/intel_uncore.c
> > @@ -2602,11 +2602,46 @@ static int uncore_forcewake_init(struct intel_uncore *uncore)
> >  	return 0;
> >  }
> >  
> > +static int sanity_check_mmio_access(struct intel_uncore *uncore)
> > +{
> > +	struct drm_i915_private *i915 = uncore->i915;
> > +	int ret;
> > +
> > +	if (GRAPHICS_VER(i915) < 8)
> > +		return 0;
> > +
> > +	/*
> > +	 * Sanitycheck that MMIO access to the device is working properly.  If
> > +	 * the CPU is unable to communcate with a PCI device, BAR reads will
> > +	 * return 0xFFFFFFFF.  Let's make sure the device isn't in this state
> > +	 * before we start trying to access registers.
> > +	 *
> > +	 * We use the primary GT's forcewake register as our guinea pig since
> > +	 * it's been around since HSW and it's a masked register so the upper
> > +	 * 16 bits can never read back as 1's if device access is operating
> > +	 * properly.
> > +	 *
> > +	 * If MMIO isn't working, we'll wait up to 2 seconds to see if it
> > +	 * recovers, then give up.
> > +	 */
> > +	ret = intel_wait_for_register_fw(uncore, FORCEWAKE_MT, 0, 0, 2000000);
> 
> It looks like you lost the check for 0xFFFFFFFF specifically.  In fact
> with a mask/value of 0, isn't this always going to just always pass
> immediately?

uh... yes... sorry, I just got confused and lost track of the
goal of the patch.

Sorry, then please ignore... I don't see then how
intel_wait_for_register_fw() can be used with a '!='.

Please, ignore this v2.

Thanks and sorry, again,
Andi

> We don't know what the value of this register will be (there may or may
> not be some bits set), but we need to make sure that it isn't 0xFFFFFFFF
> because that means we're not even truly accessing the register, just
> hitting a PCI BAR read failure.
> 
> 
> Matt
> 
> > +	if (ret == -ETIMEDOUT) {
> > +		drm_err(&i915->drm, "Device is non-operational; MMIO access returns 0xFFFFFFFF!\n");
> > +		return -EIO;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  int intel_uncore_init_mmio(struct intel_uncore *uncore)
> >  {
> >  	struct drm_i915_private *i915 = uncore->i915;
> >  	int ret;
> >  
> > +	ret = sanity_check_mmio_access(uncore);
> > +	if (ret)
> > +		return ret;
> > +
> >  	/*
> >  	 * The boot firmware initializes local memory and assesses its health.
> >  	 * If memory training fails, the punit will have been instructed to
> > -- 
> > 2.39.2
> > 
> 
> -- 
> Matt Roper
> Graphics Software Engineer
> Linux GPU Platform Enablement
> Intel Corporation
Andrzej Hajda March 22, 2023, 8:40 a.m. UTC | #4
On 21.03.2023 23:43, Andi Shyti wrote:
> Hi Matt,
> 
>>> We occasionally see the PCI device in a non-accessible state at the
>>> point the driver is loaded.  When this happens, all BAR accesses will
>>> read back as 0xFFFFFFFF.  Rather than reading registers and
>>> misinterpreting their (invalid) values, let's specifically check for
>>> 0xFFFFFFFF in a register that cannot have that value to see if the
>>> device is accessible.
>>>
>>> Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/intel_uncore.c | 35 +++++++++++++++++++++++++++++
>>>   1 file changed, 35 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
>>> index e1e1f34490c8e..0b69081d6d285 100644
>>> --- a/drivers/gpu/drm/i915/intel_uncore.c
>>> +++ b/drivers/gpu/drm/i915/intel_uncore.c
>>> @@ -2602,11 +2602,46 @@ static int uncore_forcewake_init(struct intel_uncore *uncore)
>>>   	return 0;
>>>   }
>>>   
>>> +static int sanity_check_mmio_access(struct intel_uncore *uncore)
>>> +{
>>> +	struct drm_i915_private *i915 = uncore->i915;
>>> +	int ret;
>>> +
>>> +	if (GRAPHICS_VER(i915) < 8)
>>> +		return 0;
>>> +
>>> +	/*
>>> +	 * Sanitycheck that MMIO access to the device is working properly.  If
>>> +	 * the CPU is unable to communcate with a PCI device, BAR reads will
>>> +	 * return 0xFFFFFFFF.  Let's make sure the device isn't in this state
>>> +	 * before we start trying to access registers.
>>> +	 *
>>> +	 * We use the primary GT's forcewake register as our guinea pig since
>>> +	 * it's been around since HSW and it's a masked register so the upper
>>> +	 * 16 bits can never read back as 1's if device access is operating
>>> +	 * properly.


Maybe we can just check upper 16bits then. Sth like:
ret = intel_wait_for_register_fw(uncore, FORCEWAKE_MT, GENMASK(31, 16), 
0, 2000000);

Regards
Andrzej

>>> +	 *
>>> +	 * If MMIO isn't working, we'll wait up to 2 seconds to see if it
>>> +	 * recovers, then give up.
>>> +	 */
>>> +	ret = intel_wait_for_register_fw(uncore, FORCEWAKE_MT, 0, 0, 2000000);
>>
>> It looks like you lost the check for 0xFFFFFFFF specifically.  In fact
>> with a mask/value of 0, isn't this always going to just always pass
>> immediately?
> 
> uh... yes... sorry, I just got confused and lost track of the
> goal of the patch.
> 
> Sorry, then please ignore... I don't see then how
> intel_wait_for_register_fw() can be used with a '!='.
> 
> Please, ignore this v2.
> 
> Thanks and sorry, again,
> Andi
> 
>> We don't know what the value of this register will be (there may or may
>> not be some bits set), but we need to make sure that it isn't 0xFFFFFFFF
>> because that means we're not even truly accessing the register, just
>> hitting a PCI BAR read failure.
>>
>>
>> Matt
>>
>>> +	if (ret == -ETIMEDOUT) {
>>> +		drm_err(&i915->drm, "Device is non-operational; MMIO access returns 0xFFFFFFFF!\n");
>>> +		return -EIO;
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>>   int intel_uncore_init_mmio(struct intel_uncore *uncore)
>>>   {
>>>   	struct drm_i915_private *i915 = uncore->i915;
>>>   	int ret;
>>>   
>>> +	ret = sanity_check_mmio_access(uncore);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>>   	/*
>>>   	 * The boot firmware initializes local memory and assesses its health.
>>>   	 * If memory training fails, the punit will have been instructed to
>>> -- 
>>> 2.39.2
>>>
>>
>> -- 
>> Matt Roper
>> Graphics Software Engineer
>> Linux GPU Platform Enablement
>> Intel Corporation
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index e1e1f34490c8e..0b69081d6d285 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -2602,11 +2602,46 @@  static int uncore_forcewake_init(struct intel_uncore *uncore)
 	return 0;
 }
 
+static int sanity_check_mmio_access(struct intel_uncore *uncore)
+{
+	struct drm_i915_private *i915 = uncore->i915;
+	int ret;
+
+	if (GRAPHICS_VER(i915) < 8)
+		return 0;
+
+	/*
+	 * Sanitycheck that MMIO access to the device is working properly.  If
+	 * the CPU is unable to communcate with a PCI device, BAR reads will
+	 * return 0xFFFFFFFF.  Let's make sure the device isn't in this state
+	 * before we start trying to access registers.
+	 *
+	 * We use the primary GT's forcewake register as our guinea pig since
+	 * it's been around since HSW and it's a masked register so the upper
+	 * 16 bits can never read back as 1's if device access is operating
+	 * properly.
+	 *
+	 * If MMIO isn't working, we'll wait up to 2 seconds to see if it
+	 * recovers, then give up.
+	 */
+	ret = intel_wait_for_register_fw(uncore, FORCEWAKE_MT, 0, 0, 2000000);
+	if (ret == -ETIMEDOUT) {
+		drm_err(&i915->drm, "Device is non-operational; MMIO access returns 0xFFFFFFFF!\n");
+		return -EIO;
+	}
+
+	return 0;
+}
+
 int intel_uncore_init_mmio(struct intel_uncore *uncore)
 {
 	struct drm_i915_private *i915 = uncore->i915;
 	int ret;
 
+	ret = sanity_check_mmio_access(uncore);
+	if (ret)
+		return ret;
+
 	/*
 	 * The boot firmware initializes local memory and assesses its health.
 	 * If memory training fails, the punit will have been instructed to