diff mbox series

drm/i915: Fix skl_pcode_try_request function

Message ID 20220405104114.20780-1-stanislav.lisovskiy@intel.com (mailing list archive)
State New, archived
Headers show
Series drm/i915: Fix skl_pcode_try_request function | expand

Commit Message

Stanislav Lisovskiy April 5, 2022, 10:41 a.m. UTC
Currently skl_pcode_try_request function doesn't
properly handle return value it gets from
snb_pcode_rw, but treats status != 0 as success,
returning true, which basically doesn't allow
to use retry/timeout mechanisms if PCode happens
to be busy and returns EGAIN or some other status
code not equal to 0.

We saw this on real hw and also tried simulating this
by always returning -EAGAIN from snb_pcode_rw for 6 times, which
currently will just result in false success, while it should
have tried until timeout is reached:

[   22.357729] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
[   22.357831] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
[   22.357892] i915 0000:00:02.0: [drm:skl_pcode_request [i915]] Success, exiting
[   22.357936] i915 0000:00:02.0: [drm] ERROR Failed to inform PCU about cdclk change (err -11, freq 307200)

We see en error because higher level api, still notices that status was wrong,
however we still did try only once.

We fix it by requiring _both_ the status to be 0 and
request/reply match for success(true) and function
should return failure(false) if either status turns
out to be EAGAIN, EBUSY or whatever or reply/request
masks do not match.

So now we see this in the logs:

[   22.318667] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
[   22.318782] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
[   22.318849] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 2
[   22.319006] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 3
[   22.319091] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 4
[   22.319158] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 5
[   22.319224] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 6

Signed-off-by: Stanislav Lisovskiy <stanislav.lisovskiy@intel.com>
---
 drivers/gpu/drm/i915/intel_pcode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Govindapillai, Vinod April 5, 2022, 9:51 p.m. UTC | #1
Hi Stan

Nice Find! Couple of clarifications, please check inline...

On Tue, 2022-04-05 at 13:41 +0300, Stanislav Lisovskiy wrote:
> Currently skl_pcode_try_request function doesn't
> properly handle return value it gets from
> snb_pcode_rw, but treats status != 0 as success,
> returning true, which basically doesn't allow
> to use retry/timeout mechanisms if PCode happens
> to be busy and returns EGAIN or some other status
> code not equal to 0.
> 
> We saw this on real hw and also tried simulating this
> by always returning -EAGAIN from snb_pcode_rw for 6 times, which
> currently will just result in false success, while it should
> have tried until timeout is reached:
> 
> [   22.357729] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
> 307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
> [   22.357831] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
> [   22.357892] i915 0000:00:02.0: [drm:skl_pcode_request [i915]] Success, exiting
> [   22.357936] i915 0000:00:02.0: [drm] ERROR Failed to inform PCU about cdclk change (err -11,
> freq 307200)
> 
> We see en error because higher level api, still notices that status was wrong,
> however we still did try only once.
> 
> We fix it by requiring _both_ the status to be 0 and
> request/reply match for success(true) and function
> should return failure(false) if either status turns
> out to be EAGAIN, EBUSY or whatever or reply/request
> masks do not match.
> 
> So now we see this in the logs:
> 
> [   22.318667] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
> 307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
> [   22.318782] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
> [   22.318849] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 2
> [   22.319006] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 3
> [   22.319091] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 4
> [   22.319158] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 5
> [   22.319224] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 6
> 
> Signed-off-by: Stanislav Lisovskiy <stanislav.lisovskiy@intel.com>

> ---
>  drivers/gpu/drm/i915/intel_pcode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_pcode.c b/drivers/gpu/drm/i915/intel_pcode.c
> index 391a37492ce5..fb6c43e8a02f 100644
> --- a/drivers/gpu/drm/i915/intel_pcode.c
> +++ b/drivers/gpu/drm/i915/intel_pcode.c
> @@ -136,7 +136,7 @@ static bool skl_pcode_try_request(struct drm_i915_private *i915, u32 mbox,
>  {
>  	*status = __snb_pcode_rw(i915, mbox, &request, NULL, 500, 0, true);
>  
> -	return *status || ((request & reply_mask) == reply);
> +	return (*status == 0) && ((request & reply_mask) == reply);

Here I wonder whether you need to check what sort of __snb_pcode_rw return values need the retry!
Isn't only ETIMEDOUT need the retry? Other return error codes can be probably be ignored from retry?

And should the, "return ret ? ret : status;" in "skl_pcode_request" be change to "return ret ?
status : ret;" to reflect the correct error code to calling functions after this change?


>  }
>  
>  /**
Stanislav Lisovskiy April 6, 2022, 7:48 a.m. UTC | #2
On Wed, Apr 06, 2022 at 12:51:29AM +0300, Govindapillai, Vinod wrote:
> Hi Stan
> 
> Nice Find! Couple of clarifications, please check inline...
> 
> On Tue, 2022-04-05 at 13:41 +0300, Stanislav Lisovskiy wrote:
> > Currently skl_pcode_try_request function doesn't
> > properly handle return value it gets from
> > snb_pcode_rw, but treats status != 0 as success,
> > returning true, which basically doesn't allow
> > to use retry/timeout mechanisms if PCode happens
> > to be busy and returns EGAIN or some other status
> > code not equal to 0.
> >
> > We saw this on real hw and also tried simulating this
> > by always returning -EAGAIN from snb_pcode_rw for 6 times, which
> > currently will just result in false success, while it should
> > have tried until timeout is reached:
> >
> > [   22.357729] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
> > 307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
> > [   22.357831] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
> > [   22.357892] i915 0000:00:02.0: [drm:skl_pcode_request [i915]] Success, exiting
> > [   22.357936] i915 0000:00:02.0: [drm] ERROR Failed to inform PCU about cdclk change (err -11,
> > freq 307200)
> >
> > We see en error because higher level api, still notices that status was wrong,
> > however we still did try only once.
> >
> > We fix it by requiring _both_ the status to be 0 and
> > request/reply match for success(true) and function
> > should return failure(false) if either status turns
> > out to be EAGAIN, EBUSY or whatever or reply/request
> > masks do not match.
> >
> > So now we see this in the logs:
> >
> > [   22.318667] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
> > 307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
> > [   22.318782] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
> > [   22.318849] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 2
> > [   22.319006] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 3
> > [   22.319091] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 4
> > [   22.319158] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 5
> > [   22.319224] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 6
> >
> > Signed-off-by: Stanislav Lisovskiy <stanislav.lisovskiy@intel.com>
> 
> > ---
> >  drivers/gpu/drm/i915/intel_pcode.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/i915/intel_pcode.c b/drivers/gpu/drm/i915/intel_pcode.c
> > index 391a37492ce5..fb6c43e8a02f 100644
> > --- a/drivers/gpu/drm/i915/intel_pcode.c
> > +++ b/drivers/gpu/drm/i915/intel_pcode.c
> > @@ -136,7 +136,7 @@ static bool skl_pcode_try_request(struct drm_i915_private *i915, u32 mbox,
> >  {
> >       *status = __snb_pcode_rw(i915, mbox, &request, NULL, 500, 0, true);
> >
> > -     return *status || ((request & reply_mask) == reply);
> > +     return (*status == 0) && ((request & reply_mask) == reply);
> 
> Here I wonder whether you need to check what sort of __snb_pcode_rw return values need the retry!
> Isn't only ETIMEDOUT need the retry? Other return error codes can be probably be ignored from retry?

Hi Vinod! Thanks for comments.

Well theoretically yes, but in practice I think we would prefer to retry in almost all of the cases.
There are also multiple error codes when you need to retry, such as "EAGAIN", "EBUSY" and probably
some others. Thats is probably why original code also doesn't make a difference.

> 
> And should the, "return ret ? ret : status;" in "skl_pcode_request" be change to "return ret ?
> status : ret;" to reflect the correct error code to calling functions after this change?

I think the logic here is such that ret value is somewhat more important than the status, so 
if ret is something not zero - we always prefer returning ret, for the enduser to know what
was the ret ERROR code.
If ret is 0, then we can "afford" to let the enduser know, what was actually the status.

To me it actually sounds a bit wrong, I think we should get status pointer, like "&status"
and modify it, so that calling site _always_ knows both status and ret, mixing those two
is a dangerous strategy which exactly caused some coder confusion and probably the bug, that
this patch is fixing.

Stan

> 
> 
> >  }
> >
> >  /**
Govindapillai, Vinod April 6, 2022, 9:19 a.m. UTC | #3
Hi,

On Wed, 2022-04-06 at 10:48 +0300, Lisovskiy, Stanislav wrote:
> On Wed, Apr 06, 2022 at 12:51:29AM +0300, Govindapillai, Vinod wrote:
> > Hi Stan
> > 
> > Nice Find! Couple of clarifications, please check inline...
> > 
> > On Tue, 2022-04-05 at 13:41 +0300, Stanislav Lisovskiy wrote:
> > > Currently skl_pcode_try_request function doesn't
> > > properly handle return value it gets from
> > > snb_pcode_rw, but treats status != 0 as success,
> > > returning true, which basically doesn't allow
> > > to use retry/timeout mechanisms if PCode happens
> > > to be busy and returns EGAIN or some other status
> > > code not equal to 0.
> > > 
> > > We saw this on real hw and also tried simulating this
> > > by always returning -EAGAIN from snb_pcode_rw for 6 times, which
> > > currently will just result in false success, while it should
> > > have tried until timeout is reached:
> > > 
> > > [   22.357729] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
> > > 307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
> > > [   22.357831] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
> > > [   22.357892] i915 0000:00:02.0: [drm:skl_pcode_request [i915]] Success, exiting
> > > [   22.357936] i915 0000:00:02.0: [drm] ERROR Failed to inform PCU about cdclk change (err
> > > -11,
> > > freq 307200)
> > > 
> > > We see en error because higher level api, still notices that status was wrong,
> > > however we still did try only once.
> > > 
> > > We fix it by requiring _both_ the status to be 0 and
> > > request/reply match for success(true) and function
> > > should return failure(false) if either status turns
> > > out to be EAGAIN, EBUSY or whatever or reply/request
> > > masks do not match.
> > > 
> > > So now we see this in the logs:
> > > 
> > > [   22.318667] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
> > > 307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
> > > [   22.318782] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
> > > [   22.318849] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 2
> > > [   22.319006] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 3
> > > [   22.319091] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 4
> > > [   22.319158] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 5
> > > [   22.319224] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 6
> > > 
> > > Signed-off-by: Stanislav Lisovskiy <stanislav.lisovskiy@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/intel_pcode.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/intel_pcode.c b/drivers/gpu/drm/i915/intel_pcode.c
> > > index 391a37492ce5..fb6c43e8a02f 100644
> > > --- a/drivers/gpu/drm/i915/intel_pcode.c
> > > +++ b/drivers/gpu/drm/i915/intel_pcode.c
> > > @@ -136,7 +136,7 @@ static bool skl_pcode_try_request(struct drm_i915_private *i915, u32 mbox,
> > >  {
> > >       *status = __snb_pcode_rw(i915, mbox, &request, NULL, 500, 0, true);
> > > 
> > > -     return *status || ((request & reply_mask) == reply);
> > > +     return (*status == 0) && ((request & reply_mask) == reply);
> > 
> > Here I wonder whether you need to check what sort of __snb_pcode_rw return values need the
> > retry!
> > Isn't only ETIMEDOUT need the retry? Other return error codes can be probably be ignored from
> > retry?
> 
> Hi Vinod! Thanks for comments.
> 
> Well theoretically yes, but in practice I think we would prefer to retry in almost all of the
> cases.
> There are also multiple error codes when you need to retry, such as "EAGAIN", "EBUSY" and probably
> some others. Thats is probably why original code also doesn't make a difference.

Ack

> 
> > And should the, "return ret ? ret : status;" in "skl_pcode_request" be change to "return ret ?
> > status : ret;" to reflect the correct error code to calling functions after this change?
> 
> I think the logic here is such that ret value is somewhat more important than the status, so 
> if ret is something not zero - we always prefer returning ret, for the enduser to know what
> was the ret ERROR code.
> If ret is 0, then we can "afford" to let the enduser know, what was actually the status.
> 
> To me it actually sounds a bit wrong, I think we should get status pointer, like "&status"
> and modify it, so that calling site _always_ knows both status and ret, mixing those two
> is a dangerous strategy which exactly caused some coder confusion and probably the bug, that
> this patch is fixing.
> 
> Stan

Agreed, the original code was indeed bit complicated! 

But what I meant was, after your patch, "ret" will be either "0" or "ETIMEDOUT". 
If ret = 0, then "status" would had been 0 too based on your change in this patch.
If ret != 0, then "status" might have values other than ETIMEDOUT.

So, 
"return ret ? status : ret;" might be better instead of the original "return ret ? ret : status;"
especially after your patch.

Anyway, not sure if the calling functions care about the return value much other than just logging.
It was indeed a quite good fix!

Reviewd-By: Vinod Govindapillai <vinod.govindapillai@intel.com>

> 
> > 
> > >  }
> > > 
> > >  /**
Stanislav Lisovskiy April 6, 2022, 9:58 a.m. UTC | #4
On Wed, Apr 06, 2022 at 12:19:19PM +0300, Govindapillai, Vinod wrote:
> Hi,
> 
> On Wed, 2022-04-06 at 10:48 +0300, Lisovskiy, Stanislav wrote:
> > On Wed, Apr 06, 2022 at 12:51:29AM +0300, Govindapillai, Vinod wrote:
> > > Hi Stan
> > >
> > > Nice Find! Couple of clarifications, please check inline...
> > >
> > > On Tue, 2022-04-05 at 13:41 +0300, Stanislav Lisovskiy wrote:
> > > > Currently skl_pcode_try_request function doesn't
> > > > properly handle return value it gets from
> > > > snb_pcode_rw, but treats status != 0 as success,
> > > > returning true, which basically doesn't allow
> > > > to use retry/timeout mechanisms if PCode happens
> > > > to be busy and returns EGAIN or some other status
> > > > code not equal to 0.
> > > >
> > > > We saw this on real hw and also tried simulating this
> > > > by always returning -EAGAIN from snb_pcode_rw for 6 times, which
> > > > currently will just result in false success, while it should
> > > > have tried until timeout is reached:
> > > >
> > > > [   22.357729] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
> > > > 307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
> > > > [   22.357831] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
> > > > [   22.357892] i915 0000:00:02.0: [drm:skl_pcode_request [i915]] Success, exiting
> > > > [   22.357936] i915 0000:00:02.0: [drm] ERROR Failed to inform PCU about cdclk change (err
> > > > -11,
> > > > freq 307200)
> > > >
> > > > We see en error because higher level api, still notices that status was wrong,
> > > > however we still did try only once.
> > > >
> > > > We fix it by requiring _both_ the status to be 0 and
> > > > request/reply match for success(true) and function
> > > > should return failure(false) if either status turns
> > > > out to be EAGAIN, EBUSY or whatever or reply/request
> > > > masks do not match.
> > > >
> > > > So now we see this in the logs:
> > > >
> > > > [   22.318667] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
> > > > 307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
> > > > [   22.318782] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
> > > > [   22.318849] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 2
> > > > [   22.319006] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 3
> > > > [   22.319091] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 4
> > > > [   22.319158] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 5
> > > > [   22.319224] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 6
> > > >
> > > > Signed-off-by: Stanislav Lisovskiy <stanislav.lisovskiy@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/i915/intel_pcode.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/i915/intel_pcode.c b/drivers/gpu/drm/i915/intel_pcode.c
> > > > index 391a37492ce5..fb6c43e8a02f 100644
> > > > --- a/drivers/gpu/drm/i915/intel_pcode.c
> > > > +++ b/drivers/gpu/drm/i915/intel_pcode.c
> > > > @@ -136,7 +136,7 @@ static bool skl_pcode_try_request(struct drm_i915_private *i915, u32 mbox,
> > > >  {
> > > >       *status = __snb_pcode_rw(i915, mbox, &request, NULL, 500, 0, true);
> > > >
> > > > -     return *status || ((request & reply_mask) == reply);
> > > > +     return (*status == 0) && ((request & reply_mask) == reply);
> > >
> > > Here I wonder whether you need to check what sort of __snb_pcode_rw return values need the
> > > retry!
> > > Isn't only ETIMEDOUT need the retry? Other return error codes can be probably be ignored from
> > > retry?
> >
> > Hi Vinod! Thanks for comments.
> >
> > Well theoretically yes, but in practice I think we would prefer to retry in almost all of the
> > cases.
> > There are also multiple error codes when you need to retry, such as "EAGAIN", "EBUSY" and probably
> > some others. Thats is probably why original code also doesn't make a difference.
> 
> Ack
> 
> >
> > > And should the, "return ret ? ret : status;" in "skl_pcode_request" be change to "return ret ?
> > > status : ret;" to reflect the correct error code to calling functions after this change?
> >
> > I think the logic here is such that ret value is somewhat more important than the status, so
> > if ret is something not zero - we always prefer returning ret, for the enduser to know what
> > was the ret ERROR code.
> > If ret is 0, then we can "afford" to let the enduser know, what was actually the status.
> >
> > To me it actually sounds a bit wrong, I think we should get status pointer, like "&status"
> > and modify it, so that calling site _always_ knows both status and ret, mixing those two
> > is a dangerous strategy which exactly caused some coder confusion and probably the bug, that
> > this patch is fixing.
> >
> > Stan
> 
> Agreed, the original code was indeed bit complicated!
> 
> But what I meant was, after your patch, "ret" will be either "0" or "ETIMEDOUT".
> If ret = 0, then "status" would had been 0 too based on your change in this patch.
> If ret != 0, then "status" might have values other than ETIMEDOUT.
> 
> So,
> "return ret ? status : ret;" might be better instead of the original "return ret ? ret : status;"
> especially after your patch.
> 
> Anyway, not sure if the calling functions care about the return value much other than just logging.
> It was indeed a quite good fix!

Yes, your proposal actually makes more sense than current code anyway, however I'm still thinking
if we should may be propagate &status to call site always, need to think a bit.
Would be nice to get Ville's perspective on this.

> 
> Reviewd-By: Vinod Govindapillai <vinod.govindapillai@intel.com>

Thanks,

Stan

> 
> >
> > >
> > > >  }
> > > >
> > > >  /**
Govindapillai, Vinod April 6, 2022, 10:15 a.m. UTC | #5
On Wed, 2022-04-06 at 12:58 +0300, Lisovskiy, Stanislav wrote:
> On Wed, Apr 06, 2022 at 12:19:19PM +0300, Govindapillai, Vinod wrote:
> > Hi,
> > 
> > On Wed, 2022-04-06 at 10:48 +0300, Lisovskiy, Stanislav wrote:
> > > On Wed, Apr 06, 2022 at 12:51:29AM +0300, Govindapillai, Vinod wrote:
> > > > Hi Stan
> > > > 
> > > > Nice Find! Couple of clarifications, please check inline...
> > > > 
> > > > On Tue, 2022-04-05 at 13:41 +0300, Stanislav Lisovskiy wrote:
> > > > > Currently skl_pcode_try_request function doesn't
> > > > > properly handle return value it gets from
> > > > > snb_pcode_rw, but treats status != 0 as success,
> > > > > returning true, which basically doesn't allow
> > > > > to use retry/timeout mechanisms if PCode happens
> > > > > to be busy and returns EGAIN or some other status
> > > > > code not equal to 0.
> > > > > 
> > > > > We saw this on real hw and also tried simulating this
> > > > > by always returning -EAGAIN from snb_pcode_rw for 6 times, which
> > > > > currently will just result in false success, while it should
> > > > > have tried until timeout is reached:
> > > > > 
> > > > > [   22.357729] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
> > > > > 307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
> > > > > [   22.357831] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
> > > > > [   22.357892] i915 0000:00:02.0: [drm:skl_pcode_request [i915]] Success, exiting
> > > > > [   22.357936] i915 0000:00:02.0: [drm] ERROR Failed to inform PCU about cdclk change (err
> > > > > -11,
> > > > > freq 307200)
> > > > > 
> > > > > We see en error because higher level api, still notices that status was wrong,
> > > > > however we still did try only once.
> > > > > 
> > > > > We fix it by requiring _both_ the status to be 0 and
> > > > > request/reply match for success(true) and function
> > > > > should return failure(false) if either status turns
> > > > > out to be EAGAIN, EBUSY or whatever or reply/request
> > > > > masks do not match.
> > > > > 
> > > > > So now we see this in the logs:
> > > > > 
> > > > > [   22.318667] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
> > > > > 307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
> > > > > [   22.318782] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
> > > > > [   22.318849] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 2
> > > > > [   22.319006] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 3
> > > > > [   22.319091] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 4
> > > > > [   22.319158] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 5
> > > > > [   22.319224] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 6
> > > > > 
> > > > > Signed-off-by: Stanislav Lisovskiy <stanislav.lisovskiy@intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/i915/intel_pcode.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/intel_pcode.c b/drivers/gpu/drm/i915/intel_pcode.c
> > > > > index 391a37492ce5..fb6c43e8a02f 100644
> > > > > --- a/drivers/gpu/drm/i915/intel_pcode.c
> > > > > +++ b/drivers/gpu/drm/i915/intel_pcode.c
> > > > > @@ -136,7 +136,7 @@ static bool skl_pcode_try_request(struct drm_i915_private *i915, u32
> > > > > mbox,
> > > > >  {
> > > > >       *status = __snb_pcode_rw(i915, mbox, &request, NULL, 500, 0, true);
> > > > > 
> > > > > -     return *status || ((request & reply_mask) == reply);
> > > > > +     return (*status == 0) && ((request & reply_mask) == reply);
> > > > 
> > > > Here I wonder whether you need to check what sort of __snb_pcode_rw return values need the
> > > > retry!
> > > > Isn't only ETIMEDOUT need the retry? Other return error codes can be probably be ignored
> > > > from
> > > > retry?
> > > 
> > > Hi Vinod! Thanks for comments.
> > > 
> > > Well theoretically yes, but in practice I think we would prefer to retry in almost all of the
> > > cases.
> > > There are also multiple error codes when you need to retry, such as "EAGAIN", "EBUSY" and
> > > probably
> > > some others. Thats is probably why original code also doesn't make a difference.
> > 
> > Ack
> > 
> > > > And should the, "return ret ? ret : status;" in "skl_pcode_request" be change to "return ret
> > > > ?
> > > > status : ret;" to reflect the correct error code to calling functions after this change?
> > > 
> > > I think the logic here is such that ret value is somewhat more important than the status, so
> > > if ret is something not zero - we always prefer returning ret, for the enduser to know what
> > > was the ret ERROR code.
> > > If ret is 0, then we can "afford" to let the enduser know, what was actually the status.
> > > 
> > > To me it actually sounds a bit wrong, I think we should get status pointer, like "&status"
> > > and modify it, so that calling site _always_ knows both status and ret, mixing those two
> > > is a dangerous strategy which exactly caused some coder confusion and probably the bug, that
> > > this patch is fixing.
> > > 
> > > Stan
> > 
> > Agreed, the original code was indeed bit complicated!
> > 
> > But what I meant was, after your patch, "ret" will be either "0" or "ETIMEDOUT".
> > If ret = 0, then "status" would had been 0 too based on your change in this patch.
> > If ret != 0, then "status" might have values other than ETIMEDOUT.
> > 
> > So,
> > "return ret ? status : ret;" might be better instead of the original "return ret ? ret :
> > status;"
> > especially after your patch.
> > 
> > Anyway, not sure if the calling functions care about the return value much other than just
> > logging.
> > It was indeed a quite good fix!
> 
> Yes, your proposal actually makes more sense than current code anyway, however I'm still thinking
> if we should may be propagate &status to call site always, 

Yes! my bad :) actually "return ret ? status : ret" is indeed "return status"!


> need to think a bit.
> Would be nice to get Ville's perspective on this.
> 
> > Reviewd-By: Vinod Govindapillai <vinod.govindapillai@intel.com>
> 
> Thanks,
> 
> Stan
> 
> > > > >  }
> > > > > 
> > > > >  /**
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/intel_pcode.c b/drivers/gpu/drm/i915/intel_pcode.c
index 391a37492ce5..fb6c43e8a02f 100644
--- a/drivers/gpu/drm/i915/intel_pcode.c
+++ b/drivers/gpu/drm/i915/intel_pcode.c
@@ -136,7 +136,7 @@  static bool skl_pcode_try_request(struct drm_i915_private *i915, u32 mbox,
 {
 	*status = __snb_pcode_rw(i915, mbox, &request, NULL, 500, 0, true);
 
-	return *status || ((request & reply_mask) == reply);
+	return (*status == 0) && ((request & reply_mask) == reply);
 }
 
 /**