diff mbox series

[1/2] drm/i915: limit double GT reset to pre-MTL

Message ID 20230320211039.1513368-1-daniele.ceraolospurio@intel.com (mailing list archive)
State New, archived
Headers show
Series [1/2] drm/i915: limit double GT reset to pre-MTL | expand

Commit Message

Daniele Ceraolo Spurio March 20, 2023, 9:10 p.m. UTC
Commit 3db9d590557d ("drm/i915/gt: Reset twice") modified the code to
always hit the GDRST register twice when doing a reset, with the
reported aim to fix invalid post-reset engine state on some platforms
(Jasperlake being the only one actually mentioned).

This is a problem on MTL, due to the fact that we have to apply a time
consuming WA (coming in the next patch) every time we hit the GDRST
register in a way that can include the GSC engine. Even post MTL, the
expectation is that we'll have some work to do before and after hitting
the GDRST if the GSC is involved.

Since the issue requiring the double reset seems to be limited to older
platforms, instead of trying to handle the double-reset on MTL and
future platforms it is just easier to turn it off. The default on MTL is
also for GuC to own engine reset, with i915 only covering full-GT reset.

Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andi Shyti <andi.shyti@linux.intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 35 +++++++++++++++------------
 1 file changed, 20 insertions(+), 15 deletions(-)

Comments

Andi Shyti March 20, 2023, 9:41 p.m. UTC | #1
Hi Daniele,

On Mon, Mar 20, 2023 at 02:10:38PM -0700, Daniele Ceraolo Spurio wrote:
> Commit 3db9d590557d ("drm/i915/gt: Reset twice") modified the code to
> always hit the GDRST register twice when doing a reset, with the
> reported aim to fix invalid post-reset engine state on some platforms
> (Jasperlake being the only one actually mentioned).
> 
> This is a problem on MTL, due to the fact that we have to apply a time
> consuming WA (coming in the next patch) every time we hit the GDRST
> register in a way that can include the GSC engine. Even post MTL, the
> expectation is that we'll have some work to do before and after hitting
> the GDRST if the GSC is involved.
> 
> Since the issue requiring the double reset seems to be limited to older
> platforms, instead of trying to handle the double-reset on MTL and
> future platforms it is just easier to turn it off. The default on MTL is
> also for GuC to own engine reset, with i915 only covering full-GT reset.
> 
> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Andi Shyti <andi.shyti@linux.intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>
> Cc: John Harrison <John.C.Harrison@Intel.com>

Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>

Andi
John Harrison March 22, 2023, 7:44 p.m. UTC | #2
On 3/20/2023 14:10, Daniele Ceraolo Spurio wrote:
> Commit 3db9d590557d ("drm/i915/gt: Reset twice") modified the code to
> always hit the GDRST register twice when doing a reset, with the
> reported aim to fix invalid post-reset engine state on some platforms
> (Jasperlake being the only one actually mentioned).
It still concerns me that there are no actual details about this issue 
from a hardware perspective as to what/why it goes wrong, the comment is 
fully of non-definitive language - 'appears to', 'should', 'is still a 
concern that'. And there is no w/a number associated with it. It all 
feels extremely suspect and warrants a great big FIXME tag as a minimum.

John.


>
> This is a problem on MTL, due to the fact that we have to apply a time
> consuming WA (coming in the next patch) every time we hit the GDRST
> register in a way that can include the GSC engine. Even post MTL, the
> expectation is that we'll have some work to do before and after hitting
> the GDRST if the GSC is involved.
>
> Since the issue requiring the double reset seems to be limited to older
> platforms, instead of trying to handle the double-reset on MTL and
> future platforms it is just easier to turn it off. The default on MTL is
> also for GuC to own engine reset, with i915 only covering full-GT reset.
>
> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Andi Shyti <andi.shyti@linux.intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>
> Cc: John Harrison <John.C.Harrison@Intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_reset.c | 35 +++++++++++++++------------
>   1 file changed, 20 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index 0bb9094fdacd..2c3463f77e5c 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -268,9 +268,27 @@ static int ilk_do_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask,
>   static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
>   {
>   	struct intel_uncore *uncore = gt->uncore;
> -	int loops = 2;
> +	int loops;
>   	int err;
>   
> +	/*
> +	 * On some platforms, e.g. Jasperlake, we see that the engine register
> +	 * state is not cleared until shortly after GDRST reports completion,
> +	 * causing a failure as we try to immediately resume while the internal
> +	 * state is still in flux. If we immediately repeat the reset, the
> +	 * second reset appears to serialise with the first, and since it is a
> +	 * no-op, the registers should retain their reset value. However, there
> +	 * is still a concern that upon leaving the second reset, the internal
> +	 * engine state is still in flux and not ready for resuming.
> +	 *
> +	 * Starting on MTL, there are some prep steps that we need to do when
> +	 * resetting some engines that need to be applied every time we write to
> +	 * GEN6_GDRST. As those are time consuming (tens of ms), we don't want
> +	 * to perform that twice, so, since the Jasperlake issue hasn't been
> +	 * observed on MTL, we avoid repeating the reset on newer platforms.
> +	 */
> +	loops = GRAPHICS_VER_FULL(gt->i915) < IP_VER(12, 70) ? 2 : 1;
> +
>   	/*
>   	 * GEN6_GDRST is not in the gt power well, no need to check
>   	 * for fifo space for the write or forcewake the chip for
> @@ -279,20 +297,7 @@ static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
>   	do {
>   		intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
>   
> -		/*
> -		 * Wait for the device to ack the reset requests.
> -		 *
> -		 * On some platforms, e.g. Jasperlake, we see that the
> -		 * engine register state is not cleared until shortly after
> -		 * GDRST reports completion, causing a failure as we try
> -		 * to immediately resume while the internal state is still
> -		 * in flux. If we immediately repeat the reset, the second
> -		 * reset appears to serialise with the first, and since
> -		 * it is a no-op, the registers should retain their reset
> -		 * value. However, there is still a concern that upon
> -		 * leaving the second reset, the internal engine state
> -		 * is still in flux and not ready for resuming.
> -		 */
> +		/* Wait for the device to ack the reset requests. */
>   		err = __intel_wait_for_register_fw(uncore, GEN6_GDRST,
>   						   hw_domain_mask, 0,
>   						   2000, 0,
Daniele Ceraolo Spurio March 22, 2023, 9:03 p.m. UTC | #3
On 3/22/2023 12:44 PM, John Harrison wrote:
> On 3/20/2023 14:10, Daniele Ceraolo Spurio wrote:
>> Commit 3db9d590557d ("drm/i915/gt: Reset twice") modified the code to
>> always hit the GDRST register twice when doing a reset, with the
>> reported aim to fix invalid post-reset engine state on some platforms
>> (Jasperlake being the only one actually mentioned).
> It still concerns me that there are no actual details about this issue 
> from a hardware perspective as to what/why it goes wrong, the comment 
> is fully of non-definitive language - 'appears to', 'should', 'is 
> still a concern that'. And there is no w/a number associated with it. 
> It all feels extremely suspect and warrants a great big FIXME tag as a 
> minimum.

I agree that the whole thing is unclear and we could add a FIXME, but 
IMO that is outside the scope of this patch as I'm not adding the code 
in question. This should be discussed with the original author/reviewers.

Daniele

>
> John.
>
>
>>
>> This is a problem on MTL, due to the fact that we have to apply a time
>> consuming WA (coming in the next patch) every time we hit the GDRST
>> register in a way that can include the GSC engine. Even post MTL, the
>> expectation is that we'll have some work to do before and after hitting
>> the GDRST if the GSC is involved.
>>
>> Since the issue requiring the double reset seems to be limited to older
>> platforms, instead of trying to handle the double-reset on MTL and
>> future platforms it is just easier to turn it off. The default on MTL is
>> also for GuC to own engine reset, with i915 only covering full-GT reset.
>>
>> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Andi Shyti <andi.shyti@linux.intel.com>
>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>
>> Cc: John Harrison <John.C.Harrison@Intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_reset.c | 35 +++++++++++++++------------
>>   1 file changed, 20 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>> index 0bb9094fdacd..2c3463f77e5c 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>> @@ -268,9 +268,27 @@ static int ilk_do_reset(struct intel_gt *gt, 
>> intel_engine_mask_t engine_mask,
>>   static int gen6_hw_domain_reset(struct intel_gt *gt, u32 
>> hw_domain_mask)
>>   {
>>       struct intel_uncore *uncore = gt->uncore;
>> -    int loops = 2;
>> +    int loops;
>>       int err;
>>   +    /*
>> +     * On some platforms, e.g. Jasperlake, we see that the engine 
>> register
>> +     * state is not cleared until shortly after GDRST reports 
>> completion,
>> +     * causing a failure as we try to immediately resume while the 
>> internal
>> +     * state is still in flux. If we immediately repeat the reset, the
>> +     * second reset appears to serialise with the first, and since 
>> it is a
>> +     * no-op, the registers should retain their reset value. 
>> However, there
>> +     * is still a concern that upon leaving the second reset, the 
>> internal
>> +     * engine state is still in flux and not ready for resuming.
>> +     *
>> +     * Starting on MTL, there are some prep steps that we need to do 
>> when
>> +     * resetting some engines that need to be applied every time we 
>> write to
>> +     * GEN6_GDRST. As those are time consuming (tens of ms), we 
>> don't want
>> +     * to perform that twice, so, since the Jasperlake issue hasn't 
>> been
>> +     * observed on MTL, we avoid repeating the reset on newer 
>> platforms.
>> +     */
>> +    loops = GRAPHICS_VER_FULL(gt->i915) < IP_VER(12, 70) ? 2 : 1;
>> +
>>       /*
>>        * GEN6_GDRST is not in the gt power well, no need to check
>>        * for fifo space for the write or forcewake the chip for
>> @@ -279,20 +297,7 @@ static int gen6_hw_domain_reset(struct intel_gt 
>> *gt, u32 hw_domain_mask)
>>       do {
>>           intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
>>   -        /*
>> -         * Wait for the device to ack the reset requests.
>> -         *
>> -         * On some platforms, e.g. Jasperlake, we see that the
>> -         * engine register state is not cleared until shortly after
>> -         * GDRST reports completion, causing a failure as we try
>> -         * to immediately resume while the internal state is still
>> -         * in flux. If we immediately repeat the reset, the second
>> -         * reset appears to serialise with the first, and since
>> -         * it is a no-op, the registers should retain their reset
>> -         * value. However, there is still a concern that upon
>> -         * leaving the second reset, the internal engine state
>> -         * is still in flux and not ready for resuming.
>> -         */
>> +        /* Wait for the device to ack the reset requests. */
>>           err = __intel_wait_for_register_fw(uncore, GEN6_GDRST,
>>                              hw_domain_mask, 0,
>>                              2000, 0,
>
Andi Shyti March 23, 2023, 12:16 a.m. UTC | #4
Hi,

> On 3/22/2023 12:44 PM, John Harrison wrote:
> > On 3/20/2023 14:10, Daniele Ceraolo Spurio wrote:
> > > Commit 3db9d590557d ("drm/i915/gt: Reset twice") modified the code to
> > > always hit the GDRST register twice when doing a reset, with the
> > > reported aim to fix invalid post-reset engine state on some platforms
> > > (Jasperlake being the only one actually mentioned).
> >
> > It still concerns me that there are no actual details about this issue
> > from a hardware perspective as to what/why it goes wrong, the comment is
> > fully of non-definitive language - 'appears to', 'should', 'is still a
> > concern that'. And there is no w/a number associated with it. It all
> > feels extremely suspect and warrants a great big FIXME tag as a minimum.
> 
> I agree that the whole thing is unclear and we could add a FIXME, but IMO
> that is outside the scope of this patch as I'm not adding the code in
> question. This should be discussed with the original author/reviewers.

Sorry for chiming in a bit late. I'm with Daniele on this one;
the patch just takes things back to how they were before Chris's
patch, so we should look into the reasoning behind Chris's patch
itself.

As mentioned in the commit log, in Jasperlake (and only
Jasperlake), a second reset attempt is needed to clear the
register state. If I remember right, Chris discovered this
through experimentation, and I don't think any hardware folks
have documented it.

Chris can probably give more details on this.

In general, I'm on board with Daniele's patch since it's doing
what the driver has always done, which is why I gave it a quick
review already in V1.

On the other hand, the price to pay having something like this:

  loops = GRAPHICS_VER_FULL(gt->i915) < IP_VER(12, 70) ? 2 : 1;

Is the following perhaps a bit more self-explanatory?

  /*
   * The big comment with explanation
   */
  if (IS_PLATFORM(i915, INTEL_JASPERLAKE))
  	/* try again */ ;

Andi
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index 0bb9094fdacd..2c3463f77e5c 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -268,9 +268,27 @@  static int ilk_do_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask,
 static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
 {
 	struct intel_uncore *uncore = gt->uncore;
-	int loops = 2;
+	int loops;
 	int err;
 
+	/*
+	 * On some platforms, e.g. Jasperlake, we see that the engine register
+	 * state is not cleared until shortly after GDRST reports completion,
+	 * causing a failure as we try to immediately resume while the internal
+	 * state is still in flux. If we immediately repeat the reset, the
+	 * second reset appears to serialise with the first, and since it is a
+	 * no-op, the registers should retain their reset value. However, there
+	 * is still a concern that upon leaving the second reset, the internal
+	 * engine state is still in flux and not ready for resuming.
+	 *
+	 * Starting on MTL, there are some prep steps that we need to do when
+	 * resetting some engines that need to be applied every time we write to
+	 * GEN6_GDRST. As those are time consuming (tens of ms), we don't want
+	 * to perform that twice, so, since the Jasperlake issue hasn't been
+	 * observed on MTL, we avoid repeating the reset on newer platforms.
+	 */
+	loops = GRAPHICS_VER_FULL(gt->i915) < IP_VER(12, 70) ? 2 : 1;
+
 	/*
 	 * GEN6_GDRST is not in the gt power well, no need to check
 	 * for fifo space for the write or forcewake the chip for
@@ -279,20 +297,7 @@  static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
 	do {
 		intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
 
-		/*
-		 * Wait for the device to ack the reset requests.
-		 *
-		 * On some platforms, e.g. Jasperlake, we see that the
-		 * engine register state is not cleared until shortly after
-		 * GDRST reports completion, causing a failure as we try
-		 * to immediately resume while the internal state is still
-		 * in flux. If we immediately repeat the reset, the second
-		 * reset appears to serialise with the first, and since
-		 * it is a no-op, the registers should retain their reset
-		 * value. However, there is still a concern that upon
-		 * leaving the second reset, the internal engine state
-		 * is still in flux and not ready for resuming.
-		 */
+		/* Wait for the device to ack the reset requests. */
 		err = __intel_wait_for_register_fw(uncore, GEN6_GDRST,
 						   hw_domain_mask, 0,
 						   2000, 0,