diff mbox series

[1/2] drm/i915: Fix skl_pcode_try_request function

Message ID 20220408125200.9069-2-stanislav.lisovskiy@intel.com (mailing list archive)
State New, archived
Headers show
Series Fix issues in skl_pcode_request | expand

Commit Message

Stanislav Lisovskiy April 8, 2022, 12:51 p.m. UTC
Currently skl_pcode_try_request function doesn't
properly handle return value it gets from
snb_pcode_rw, but treats status != 0 as success,
returning true, which basically doesn't allow
to use retry/timeout mechanisms if PCode happens
to be busy and returns EGAIN or some other status
code not equal to 0.

We saw this on real hw and also tried simulating this
by always returning -EAGAIN from snb_pcode_rw for 6 times, which
currently will just result in false success, while it should
have tried until timeout is reached:

[   22.357729] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
[   22.357831] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
[   22.357892] i915 0000:00:02.0: [drm:skl_pcode_request [i915]] Success, exiting
[   22.357936] i915 0000:00:02.0: [drm] ERROR Failed to inform PCU about cdclk change (err -11, freq 307200)

We see en error because higher level api, still notices that status was wrong,
however we still did try only once.

We fix it by requiring _both_ the status to be 0 and
request/reply match for success(true) and function
should return failure(false) if either status turns
out to be EAGAIN, EBUSY or whatever or reply/request
masks do not match.

So now we see this in the logs:

[   22.318667] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
[   22.318782] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
[   22.318849] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 2
[   22.319006] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 3
[   22.319091] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 4
[   22.319158] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 5
[   22.319224] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 6

Reviewed-by: Vinod Govindapillai <vinod.govindapillai@intel.com>
Signed-off-by: Stanislav Lisovskiy <stanislav.lisovskiy@intel.com>
---
 drivers/gpu/drm/i915/intel_pcode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Ville Syrjälä April 8, 2022, 3:46 p.m. UTC | #1
On Fri, Apr 08, 2022 at 03:51:59PM +0300, Stanislav Lisovskiy wrote:
> Currently skl_pcode_try_request function doesn't
> properly handle return value it gets from
> snb_pcode_rw, but treats status != 0 as success,
> returning true, which basically doesn't allow
> to use retry/timeout mechanisms if PCode happens
> to be busy and returns EGAIN or some other status
> code not equal to 0.
> 
> We saw this on real hw and also tried simulating this
> by always returning -EAGAIN from snb_pcode_rw for 6 times, which
> currently will just result in false success, while it should
> have tried until timeout is reached:
> 
> [   22.357729] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
> 307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
> [   22.357831] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
> [   22.357892] i915 0000:00:02.0: [drm:skl_pcode_request [i915]] Success, exiting
> [   22.357936] i915 0000:00:02.0: [drm] ERROR Failed to inform PCU about cdclk change (err -11, freq 307200)
> 
> We see en error because higher level api, still notices that status was wrong,
> however we still did try only once.
> 
> We fix it by requiring _both_ the status to be 0 and
> request/reply match for success(true) and function
> should return failure(false) if either status turns
> out to be EAGAIN, EBUSY or whatever or reply/request
> masks do not match.
> 
> So now we see this in the logs:
> 
> [   22.318667] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
> 307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
> [   22.318782] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
> [   22.318849] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 2
> [   22.319006] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 3
> [   22.319091] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 4
> [   22.319158] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 5
> [   22.319224] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 6
> 
> Reviewed-by: Vinod Govindapillai <vinod.govindapillai@intel.com>
> Signed-off-by: Stanislav Lisovskiy <stanislav.lisovskiy@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_pcode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_pcode.c b/drivers/gpu/drm/i915/intel_pcode.c
> index 391a37492ce5..fb6c43e8a02f 100644
> --- a/drivers/gpu/drm/i915/intel_pcode.c
> +++ b/drivers/gpu/drm/i915/intel_pcode.c
> @@ -136,7 +136,7 @@ static bool skl_pcode_try_request(struct drm_i915_private *i915, u32 mbox,
>  {
>  	*status = __snb_pcode_rw(i915, mbox, &request, NULL, 500, 0, true);
>  
> -	return *status || ((request & reply_mask) == reply);
> +	return (*status == 0) && ((request & reply_mask) == reply);

The problem with this is that now we'll keep pointlessly banging it
even if it returns a real error.

We should never really see that -EAGAIN since it indicates that our
timeout is too short. So the real fix should be to increase that
timeout. But I guess we could do a belt-and-suspenders approach
where we also keep repeating on -EGAIN. But I'm thinking -EAGAIN
should WARN as well to make sure we notice that our timeout is wrong.

>  }
>  
>  /**
> -- 
> 2.24.1.485.gad05a3d8e5
Ville Syrjälä April 8, 2022, 3:59 p.m. UTC | #2
On Fri, Apr 08, 2022 at 06:46:00PM +0300, Ville Syrjälä wrote:
> On Fri, Apr 08, 2022 at 03:51:59PM +0300, Stanislav Lisovskiy wrote:
> > Currently skl_pcode_try_request function doesn't
> > properly handle return value it gets from
> > snb_pcode_rw, but treats status != 0 as success,
> > returning true, which basically doesn't allow
> > to use retry/timeout mechanisms if PCode happens
> > to be busy and returns EGAIN or some other status
> > code not equal to 0.
> > 
> > We saw this on real hw and also tried simulating this
> > by always returning -EAGAIN from snb_pcode_rw for 6 times, which
> > currently will just result in false success, while it should
> > have tried until timeout is reached:
> > 
> > [   22.357729] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
> > 307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
> > [   22.357831] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
> > [   22.357892] i915 0000:00:02.0: [drm:skl_pcode_request [i915]] Success, exiting
> > [   22.357936] i915 0000:00:02.0: [drm] ERROR Failed to inform PCU about cdclk change (err -11, freq 307200)
> > 
> > We see en error because higher level api, still notices that status was wrong,
> > however we still did try only once.
> > 
> > We fix it by requiring _both_ the status to be 0 and
> > request/reply match for success(true) and function
> > should return failure(false) if either status turns
> > out to be EAGAIN, EBUSY or whatever or reply/request
> > masks do not match.
> > 
> > So now we see this in the logs:
> > 
> > [   22.318667] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
> > 307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
> > [   22.318782] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1

Hmm. That is weird. The timestamp difference is only ~100 usec even
though we are supposed to use that 500 usec timeout. So did some
previous pcode access already timeout and leave the mailbox busy 
before we even do this request, or what is going on?

> > [   22.318849] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 2
> > [   22.319006] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 3
> > [   22.319091] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 4
> > [   22.319158] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 5
> > [   22.319224] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 6
> > 
> > Reviewed-by: Vinod Govindapillai <vinod.govindapillai@intel.com>
> > Signed-off-by: Stanislav Lisovskiy <stanislav.lisovskiy@intel.com>
> > ---
> >  drivers/gpu/drm/i915/intel_pcode.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/intel_pcode.c b/drivers/gpu/drm/i915/intel_pcode.c
> > index 391a37492ce5..fb6c43e8a02f 100644
> > --- a/drivers/gpu/drm/i915/intel_pcode.c
> > +++ b/drivers/gpu/drm/i915/intel_pcode.c
> > @@ -136,7 +136,7 @@ static bool skl_pcode_try_request(struct drm_i915_private *i915, u32 mbox,
> >  {
> >  	*status = __snb_pcode_rw(i915, mbox, &request, NULL, 500, 0, true);
> >  
> > -	return *status || ((request & reply_mask) == reply);
> > +	return (*status == 0) && ((request & reply_mask) == reply);
> 
> The problem with this is that now we'll keep pointlessly banging it
> even if it returns a real error.
> 
> We should never really see that -EAGAIN since it indicates that our
> timeout is too short. So the real fix should be to increase that
> timeout. But I guess we could do a belt-and-suspenders approach
> where we also keep repeating on -EGAIN. But I'm thinking -EAGAIN
> should WARN as well to make sure we notice that our timeout is wrong.
> 
> >  }
> >  
> >  /**
> > -- 
> > 2.24.1.485.gad05a3d8e5
> 
> -- 
> Ville Syrjälä
> Intel
Stanislav Lisovskiy April 8, 2022, 5:39 p.m. UTC | #3
On Fri, Apr 08, 2022 at 06:46:00PM +0300, Ville Syrjälä wrote:
> On Fri, Apr 08, 2022 at 03:51:59PM +0300, Stanislav Lisovskiy wrote:
> > Currently skl_pcode_try_request function doesn't
> > properly handle return value it gets from
> > snb_pcode_rw, but treats status != 0 as success,
> > returning true, which basically doesn't allow
> > to use retry/timeout mechanisms if PCode happens
> > to be busy and returns EGAIN or some other status
> > code not equal to 0.
> > 
> > We saw this on real hw and also tried simulating this
> > by always returning -EAGAIN from snb_pcode_rw for 6 times, which
> > currently will just result in false success, while it should
> > have tried until timeout is reached:
> > 
> > [   22.357729] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
> > 307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
> > [   22.357831] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
> > [   22.357892] i915 0000:00:02.0: [drm:skl_pcode_request [i915]] Success, exiting
> > [   22.357936] i915 0000:00:02.0: [drm] ERROR Failed to inform PCU about cdclk change (err -11, freq 307200)
> > 
> > We see en error because higher level api, still notices that status was wrong,
> > however we still did try only once.
> > 
> > We fix it by requiring _both_ the status to be 0 and
> > request/reply match for success(true) and function
> > should return failure(false) if either status turns
> > out to be EAGAIN, EBUSY or whatever or reply/request
> > masks do not match.
> > 
> > So now we see this in the logs:
> > 
> > [   22.318667] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
> > 307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
> > [   22.318782] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
> > [   22.318849] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 2
> > [   22.319006] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 3
> > [   22.319091] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 4
> > [   22.319158] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 5
> > [   22.319224] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 6
> > 
> > Reviewed-by: Vinod Govindapillai <vinod.govindapillai@intel.com>
> > Signed-off-by: Stanislav Lisovskiy <stanislav.lisovskiy@intel.com>
> > ---
> >  drivers/gpu/drm/i915/intel_pcode.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/intel_pcode.c b/drivers/gpu/drm/i915/intel_pcode.c
> > index 391a37492ce5..fb6c43e8a02f 100644
> > --- a/drivers/gpu/drm/i915/intel_pcode.c
> > +++ b/drivers/gpu/drm/i915/intel_pcode.c
> > @@ -136,7 +136,7 @@ static bool skl_pcode_try_request(struct drm_i915_private *i915, u32 mbox,
> >  {
> >  	*status = __snb_pcode_rw(i915, mbox, &request, NULL, 500, 0, true);
> >  
> > -	return *status || ((request & reply_mask) == reply);
> > +	return (*status == 0) && ((request & reply_mask) == reply);
> 
> The problem with this is that now we'll keep pointlessly banging it
> even if it returns a real error.

Question is that we should then define when it is pointless and when
its not. 
Not sure, if we should retry on any status - maybe only if its EBUSY
or smth. But even if at some point we get some issue with PCode
which is resolved after retry as currently should we may be still retry?

Anyway I guess this isn't how this is supposed to work - i.e if we
get status != 0, we should fail, but not succeed, also I guess
request not matching the reply mask should make it fail.

Stan

> 
> We should never really see that -EAGAIN since it indicates that our
> timeout is too short. So the real fix should be to increase that
> timeout. But I guess we could do a belt-and-suspenders approach
> where we also keep repeating on -EGAIN. But I'm thinking -EAGAIN
> should WARN as well to make sure we notice that our timeout is wrong.
> 
> >  }
> >  
> >  /**
> > -- 
> > 2.24.1.485.gad05a3d8e5
> 
> -- 
> Ville Syrjälä
> Intel
Stanislav Lisovskiy April 8, 2022, 5:49 p.m. UTC | #4
On Fri, Apr 08, 2022 at 06:59:07PM +0300, Ville Syrjälä wrote:
> On Fri, Apr 08, 2022 at 06:46:00PM +0300, Ville Syrjälä wrote:
> > On Fri, Apr 08, 2022 at 03:51:59PM +0300, Stanislav Lisovskiy wrote:
> > > Currently skl_pcode_try_request function doesn't
> > > properly handle return value it gets from
> > > snb_pcode_rw, but treats status != 0 as success,
> > > returning true, which basically doesn't allow
> > > to use retry/timeout mechanisms if PCode happens
> > > to be busy and returns EGAIN or some other status
> > > code not equal to 0.
> > > 
> > > We saw this on real hw and also tried simulating this
> > > by always returning -EAGAIN from snb_pcode_rw for 6 times, which
> > > currently will just result in false success, while it should
> > > have tried until timeout is reached:
> > > 
> > > [   22.357729] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
> > > 307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
> > > [   22.357831] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
> > > [   22.357892] i915 0000:00:02.0: [drm:skl_pcode_request [i915]] Success, exiting
> > > [   22.357936] i915 0000:00:02.0: [drm] ERROR Failed to inform PCU about cdclk change (err -11, freq 307200)
> > > 
> > > We see en error because higher level api, still notices that status was wrong,
> > > however we still did try only once.
> > > 
> > > We fix it by requiring _both_ the status to be 0 and
> > > request/reply match for success(true) and function
> > > should return failure(false) if either status turns
> > > out to be EAGAIN, EBUSY or whatever or reply/request
> > > masks do not match.
> > > 
> > > So now we see this in the logs:
> > > 
> > > [   22.318667] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
> > > 307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
> > > [   22.318782] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
> 
> Hmm. That is weird. The timestamp difference is only ~100 usec even
> though we are supposed to use that 500 usec timeout. So did some
> previous pcode access already timeout and leave the mailbox busy 
> before we even do this request, or what is going on?

Ah that is not the real example. 
What I did to check how it works is just added something like this:

+ static int retries = 0;

+  if (++retry < 6)
+      return -EAGAIN

to _snb_pcode_rw.

So before the patch, skl_pcode_try_request returned True when getting -EAGAIN and 
immediately bailed out, assigning ret = 0 and goto out, as if it succeeded while it obvously not.

> 
> > > [   22.318849] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 2
> > > [   22.319006] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 3
> > > [   22.319091] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 4
> > > [   22.319158] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 5
> > > [   22.319224] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 6

That is how it behaves with this patch, i.e status != 0 makes skl_pcode_try_request
return False(as it should) which then enables skl_pcode_request retry machinery.

In real case we have something similar to this but can't really reproduce it that easily,
so was kinda simulating that issue to check.

Stan

> > > 
> > > Reviewed-by: Vinod Govindapillai <vinod.govindapillai@intel.com>
> > > Signed-off-by: Stanislav Lisovskiy <stanislav.lisovskiy@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/intel_pcode.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/intel_pcode.c b/drivers/gpu/drm/i915/intel_pcode.c
> > > index 391a37492ce5..fb6c43e8a02f 100644
> > > --- a/drivers/gpu/drm/i915/intel_pcode.c
> > > +++ b/drivers/gpu/drm/i915/intel_pcode.c
> > > @@ -136,7 +136,7 @@ static bool skl_pcode_try_request(struct drm_i915_private *i915, u32 mbox,
> > >  {
> > >  	*status = __snb_pcode_rw(i915, mbox, &request, NULL, 500, 0, true);
> > >  
> > > -	return *status || ((request & reply_mask) == reply);
> > > +	return (*status == 0) && ((request & reply_mask) == reply);
> > 
> > The problem with this is that now we'll keep pointlessly banging it
> > even if it returns a real error.
> > 
> > We should never really see that -EAGAIN since it indicates that our
> > timeout is too short. So the real fix should be to increase that
> > timeout. But I guess we could do a belt-and-suspenders approach
> > where we also keep repeating on -EGAIN. But I'm thinking -EAGAIN
> > should WARN as well to make sure we notice that our timeout is wrong.
> > 
> > >  }
> > >  
> > >  /**
> > > -- 
> > > 2.24.1.485.gad05a3d8e5
> > 
> > -- 
> > Ville Syrjälä
> > Intel
> 
> -- 
> Ville Syrjälä
> Intel
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/intel_pcode.c b/drivers/gpu/drm/i915/intel_pcode.c
index 391a37492ce5..fb6c43e8a02f 100644
--- a/drivers/gpu/drm/i915/intel_pcode.c
+++ b/drivers/gpu/drm/i915/intel_pcode.c
@@ -136,7 +136,7 @@  static bool skl_pcode_try_request(struct drm_i915_private *i915, u32 mbox,
 {
 	*status = __snb_pcode_rw(i915, mbox, &request, NULL, 500, 0, true);
 
-	return *status || ((request & reply_mask) == reply);
+	return (*status == 0) && ((request & reply_mask) == reply);
 }
 
 /**