diff mbox

[1/3] thermal: handle get_temp() errors properly

Message ID 1479513177-81504-1-git-send-email-briannorris@chromium.org (mailing list archive)
State New, archived
Headers show

Commit Message

Brian Norris Nov. 18, 2016, 11:52 p.m. UTC
If using CONFIG_THERMAL_EMULATION, there's a corner case where we might
get an error from the zone's get_temp() callback, but we'll ignore that
and keep using its value. Let's just error out properly instead.

Signed-off-by: Brian Norris <briannorris@chromium.org>
---
 drivers/thermal/thermal_core.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Caesar Wang Nov. 19, 2016, 3:21 a.m. UTC | #1
Brian,
在 2016年11月19日 07:52, Brian Norris 写道:
> If using CONFIG_THERMAL_EMULATION, there's a corner case where we might
> get an error from the zone's get_temp() callback, but we'll ignore that
> and keep using its value. Let's just error out properly instead.
>
> Signed-off-by: Brian Norris <briannorris@chromium.org>
Tested-by: Caesar Wang <wxt@rock-chips.com>

[    8.111296] thermal thermal_zone4: power_allocator: sustainable_power 
will be estimated
[    8.119420] thermal_zone_get_temp:537 the ret=-19, no such device, 
look like the A/D value had no ready yet.
..
Anyway, this patch is useful for improving thermal.

-Caesar
> ---
>   drivers/thermal/thermal_core.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index 911fd964c742..0fa497f10d25 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -494,6 +494,8 @@ int thermal_zone_get_temp(struct thermal_zone_device *tz, int *temp)
>   	mutex_lock(&tz->lock);
>   
>   	ret = tz->ops->get_temp(tz, temp);
> +	if (ret)
> +		goto exit_unlock;
>   
>   	if (IS_ENABLED(CONFIG_THERMAL_EMULATION) && tz->emul_temperature) {
>   		for (count = 0; count < tz->trips; count++) {
> @@ -514,6 +516,7 @@ int thermal_zone_get_temp(struct thermal_zone_device *tz, int *temp)
>   			*temp = tz->emul_temperature;
>   	}
>    
> +exit_unlock:
>   	mutex_unlock(&tz->lock);
>   exit:
>   	return ret;
Eduardo Valentin Nov. 19, 2016, 3:41 a.m. UTC | #2
On Fri, Nov 18, 2016 at 03:52:55PM -0800, Brian Norris wrote:
> If using CONFIG_THERMAL_EMULATION, there's a corner case where we might
> get an error from the zone's get_temp() callback, but we'll ignore that
> and keep using its value. Let's just error out properly instead.
> 
> Signed-off-by: Brian Norris <briannorris@chromium.org>
> ---
>  drivers/thermal/thermal_core.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index 911fd964c742..0fa497f10d25 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -494,6 +494,8 @@ int thermal_zone_get_temp(struct thermal_zone_device *tz, int *temp)
>  	mutex_lock(&tz->lock);
>  
>  	ret = tz->ops->get_temp(tz, temp);
> +	if (ret)
> +		goto exit_unlock;

Yeah, but the follow through is intentional, if I am not mistaken.


>  
>  	if (IS_ENABLED(CONFIG_THERMAL_EMULATION) && tz->emul_temperature) {

Even if the driver is not able to read real temperature, but emul temp
is configured, then there is still opportunity to report the emulated
temperature.

>  		for (count = 0; count < tz->trips; count++) {
> @@ -514,6 +516,7 @@ int thermal_zone_get_temp(struct thermal_zone_device *tz, int *temp)
>  			*temp = tz->emul_temperature;

And if you check the lines at the bottom of the loop, you will see that,
in the fail case, we will stil compare to what is the content of temp,
which might be problematic.

I would prefer we consider the patch I sent
some time ago:
https://patchwork.kernel.org/patch/7876381/

>  	}
>   
> +exit_unlock:
>  	mutex_unlock(&tz->lock);
>  exit:
>  	return ret;
> -- 
> 2.8.0.rc3.226.g39d4020
>
Brian Norris Nov. 19, 2016, 5:30 a.m. UTC | #3
Hi,

On Fri, Nov 18, 2016 at 07:41:59PM -0800, Eduardo Valentin wrote:
> On Fri, Nov 18, 2016 at 03:52:55PM -0800, Brian Norris wrote:
> > If using CONFIG_THERMAL_EMULATION, there's a corner case where we might
> > get an error from the zone's get_temp() callback, but we'll ignore that
> > and keep using its value. Let's just error out properly instead.
> > 
> > Signed-off-by: Brian Norris <briannorris@chromium.org>
> > ---
> >  drivers/thermal/thermal_core.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> > index 911fd964c742..0fa497f10d25 100644
> > --- a/drivers/thermal/thermal_core.c
> > +++ b/drivers/thermal/thermal_core.c
> > @@ -494,6 +494,8 @@ int thermal_zone_get_temp(struct thermal_zone_device *tz, int *temp)
> >  	mutex_lock(&tz->lock);
> >  
> >  	ret = tz->ops->get_temp(tz, temp);
> > +	if (ret)
> > +		goto exit_unlock;
> 
> Yeah, but the follow through is intentional, if I am not mistaken.

OK...but it has a bug. It potentially utilizes an uninitialized value
for *temp.

> >  
> >  	if (IS_ENABLED(CONFIG_THERMAL_EMULATION) && tz->emul_temperature) {
> 
> Even if the driver is not able to read real temperature, but emul temp
> is configured, then there is still opportunity to report the emulated
> temperature.

OK, maybe, but you should avoid doing this comparison then:

513                 if (!ret && *temp < crit_temp)
514                         *temp = tz->emul_temperature;

Note that 'ret' might be 0 (from the calls to ->get_trip_type()), and then
you're comparing with the uninitialized value of *temp. So you need some
solution that accounts for this and decides to ignore the real
temperature properly.

> >  		for (count = 0; count < tz->trips; count++) {
> > @@ -514,6 +516,7 @@ int thermal_zone_get_temp(struct thermal_zone_device *tz, int *temp)
> >  			*temp = tz->emul_temperature;
> 
> And if you check the lines at the bottom of the loop, you will see that,
> in the fail case, we will stil compare to what is the content of temp,
> which might be problematic.

Yes...are you saying the same thing I am above?

> I would prefer we consider the patch I sent
> some time ago:
> https://patchwork.kernel.org/patch/7876381/

Honestly I didn't look that deeply into the framework here (and I also
don't use CONFIG_THERMAL_EMULATION), I was just fixing something that
was obviously wrong.

But on first read, that patch looks good to me -- although it'd be good
to note the uninitialized value fix in the comit log. Any reason that
didn't end up getting merged? It looks like it got reviewed, and you're
a thermal subsystem maintainer...

Brian
Zhang, Rui Nov. 22, 2016, 7:52 a.m. UTC | #4
Hi, Brian,

On Fri, 2016-11-18 at 21:30 -0800, Brian Norris wrote:
> Hi,
> 
> On Fri, Nov 18, 2016 at 07:41:59PM -0800, Eduardo Valentin wrote:
> > 
> > On Fri, Nov 18, 2016 at 03:52:55PM -0800, Brian Norris wrote:
> > > 
> > > If using CONFIG_THERMAL_EMULATION, there's a corner case where we
> > > might
> > > get an error from the zone's get_temp() callback, but we'll
> > > ignore that
> > > and keep using its value. Let's just error out properly instead.
> > > 
> > > Signed-off-by: Brian Norris <briannorris@chromium.org>
> > > ---
> > >  drivers/thermal/thermal_core.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/drivers/thermal/thermal_core.c
> > > b/drivers/thermal/thermal_core.c
> > > index 911fd964c742..0fa497f10d25 100644
> > > --- a/drivers/thermal/thermal_core.c
> > > +++ b/drivers/thermal/thermal_core.c
> > > @@ -494,6 +494,8 @@ int thermal_zone_get_temp(struct
> > > thermal_zone_device *tz, int *temp)
> > >  	mutex_lock(&tz->lock);
> > >  
> > >  	ret = tz->ops->get_temp(tz, temp);
> > > +	if (ret)
> > > +		goto exit_unlock;
> > Yeah, but the follow through is intentional, if I am not mistaken.
> OK...but it has a bug. It potentially utilizes an uninitialized value
> for *temp.
> 
Agreed.
> > 
> > > 
> > >  
> > >  	if (IS_ENABLED(CONFIG_THERMAL_EMULATION) && tz-
> > > >emul_temperature) {
> > Even if the driver is not able to read real temperature, but emul
> > temp
> > is configured, then there is still opportunity to report the
> > emulated
> > temperature.
> OK, maybe, but you should avoid doing this comparison then:
> 
> 513                 if (!ret && *temp < crit_temp)
> 514                         *temp = tz->emul_temperature;
> 
> Note that 'ret' might be 0 (from the calls to ->get_trip_type()), and
> then
> you're comparing with the uninitialized value of *temp. So you need
> some
> solution that accounts for this and decides to ignore the real
> temperature properly.
> 
right.
> > 
> > > 
> > >  		for (count = 0; count < tz->trips; count++) {
> > > @@ -514,6 +516,7 @@ int thermal_zone_get_temp(struct
> > > thermal_zone_device *tz, int *temp)
> > >  			*temp = tz->emul_temperature;
> > And if you check the lines at the bottom of the loop, you will see
> > that,
> > in the fail case, we will stil compare to what is the content of
> > temp,
> > which might be problematic.
> Yes...are you saying the same thing I am above?
> 
> > 
> > I would prefer we consider the patch I sent
> > some time ago:
> > https://patchwork.kernel.org/patch/7876381/
> Honestly I didn't look that deeply into the framework here (and I
> also
> don't use CONFIG_THERMAL_EMULATION), I was just fixing something that
> was obviously wrong.
> 
> But on first read, that patch looks good to me -- although it'd be
> good
> to note the uninitialized value fix in the comit log. Any reason that
> didn't end up getting merged? It looks like it got reviewed, and
> you're
> a thermal subsystem maintainer...
> 
hmmm, I forgot why I missed this one in the end.
Eduardo,
would you mind refresh and resend the patch?

thanks,
rui
> Brian
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eduardo Valentin Nov. 22, 2016, 11 a.m. UTC | #5
On Tue, Nov 22, 2016 at 03:52:25PM +0800, Zhang Rui wrote:
> Hi, Brian,
> 
> On Fri, 2016-11-18 at 21:30 -0800, Brian Norris wrote:
> > Hi,
> > 
> > On Fri, Nov 18, 2016 at 07:41:59PM -0800, Eduardo Valentin wrote:
> > > 
> > > On Fri, Nov 18, 2016 at 03:52:55PM -0800, Brian Norris wrote:
> > > > 
> > > > If using CONFIG_THERMAL_EMULATION, there's a corner case where we
> > > > might
> > > > get an error from the zone's get_temp() callback, but we'll
> > > > ignore that
> > > > and keep using its value. Let's just error out properly instead.
> > > > 
> > > > Signed-off-by: Brian Norris <briannorris@chromium.org>
> > > > ---
> > > >  drivers/thermal/thermal_core.c | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > > 
> > > > diff --git a/drivers/thermal/thermal_core.c
> > > > b/drivers/thermal/thermal_core.c
> > > > index 911fd964c742..0fa497f10d25 100644
> > > > --- a/drivers/thermal/thermal_core.c
> > > > +++ b/drivers/thermal/thermal_core.c
> > > > @@ -494,6 +494,8 @@ int thermal_zone_get_temp(struct
> > > > thermal_zone_device *tz, int *temp)
> > > >  	mutex_lock(&tz->lock);
> > > >  
> > > >  	ret = tz->ops->get_temp(tz, temp);
> > > > +	if (ret)
> > > > +		goto exit_unlock;
> > > Yeah, but the follow through is intentional, if I am not mistaken.
> > OK...but it has a bug. It potentially utilizes an uninitialized value
> > for *temp.
> > 
> Agreed.

I also agree that this section of current get_temp is buggy. That is why 
I sent the patch some time ago.

> > > 
> > > > 
> > > >  
> > > >  	if (IS_ENABLED(CONFIG_THERMAL_EMULATION) && tz-
> > > > >emul_temperature) {
> > > Even if the driver is not able to read real temperature, but emul
> > > temp
> > > is configured, then there is still opportunity to report the
> > > emulated
> > > temperature.
> > OK, maybe, but you should avoid doing this comparison then:
> > 
> > 513                 if (!ret && *temp < crit_temp)
> > 514                         *temp = tz->emul_temperature;
> > 
> > Note that 'ret' might be 0 (from the calls to ->get_trip_type()), and
> > then
> > you're comparing with the uninitialized value of *temp. So you need
> > some
> > solution that accounts for this and decides to ignore the real
> > temperature properly.
> > 
> right.
> > > 
> > > > 
> > > >  		for (count = 0; count < tz->trips; count++) {
> > > > @@ -514,6 +516,7 @@ int thermal_zone_get_temp(struct
> > > > thermal_zone_device *tz, int *temp)
> > > >  			*temp = tz->emul_temperature;
> > > And if you check the lines at the bottom of the loop, you will see
> > > that,
> > > in the fail case, we will stil compare to what is the content of
> > > temp,
> > > which might be problematic.
> > Yes...are you saying the same thing I am above?

Yes, Brian, we are concerned about the same bug.


> > 
> > > 
> > > I would prefer we consider the patch I sent
> > > some time ago:
> > > https://patchwork.kernel.org/patch/7876381/
> > Honestly I didn't look that deeply into the framework here (and I
> > also
> > don't use CONFIG_THERMAL_EMULATION), I was just fixing something that
> > was obviously wrong.

Yeah, but that is why we need people to look the code considering all
features. :-)


> > 
> > But on first read, that patch looks good to me -- although it'd be
> > good
> > to note the uninitialized value fix in the comit log. Any reason that
> > didn't end up getting merged? It looks like it got reviewed, and
> > you're
> > a thermal subsystem maintainer...
> > 

I do not remember why Rui postponed it. A note of clarification, for
things that touch thermal core, I agree with Rui that they go through
his tree. Besides, I tend to avoid acking and sending my own patches
without proper review, which was not the case of that patch, that was
just postponed and fell into the cracks somehow.

> hmmm, I forgot why I missed this one in the end.
> Eduardo,
> would you mind refresh and resend the patch?

Yeah sure. I have at least three extra patch sets on thermal core on
my queue. But I would like to get first the thermal sysfs reorg in
first. This fix is one of the changes that will go on top of the thermal
sysfs reorg.

BR,

Eduardo
Brian Norris Nov. 22, 2016, 10:27 p.m. UTC | #6
On Tue, Nov 22, 2016 at 03:00:47AM -0800, Eduardo Valentin wrote:
> On Tue, Nov 22, 2016 at 03:52:25PM +0800, Zhang Rui wrote:
> > On Fri, 2016-11-18 at 21:30 -0800, Brian Norris wrote:
> > > On Fri, Nov 18, 2016 at 07:41:59PM -0800, Eduardo Valentin wrote:
> > > > I would prefer we consider the patch I sent
> > > > some time ago:
> > > > https://patchwork.kernel.org/patch/7876381/
> > > Honestly I didn't look that deeply into the framework here (and I
> > > also
> > > don't use CONFIG_THERMAL_EMULATION), I was just fixing something that
> > > was obviously wrong.
> 
> Yeah, but that is why we need people to look the code considering all
> features. :-)

Well, there are bugfixes and there are features. My patch fixed the bug
in the simplest way possible; it didn't break CONFIG_THERMAL_EMULATION
any further than it already was, and it'll still work if get_temp()
doesn't return an error.

I'd say your patch is essentially adding a feature, and IMO that's not
the best way to fix a bug. You can fix the bug and *then* add the
feature.

Anyway, I'm not going to tell you how to run your subsystem. If your
patch goes through, that's probably just as well.

[...]

> > hmmm, I forgot why I missed this one in the end.
> > Eduardo,
> > would you mind refresh and resend the patch?
> 
> Yeah sure. I have at least three extra patch sets on thermal core on
> my queue. But I would like to get first the thermal sysfs reorg in
> first. This fix is one of the changes that will go on top of the thermal
> sysfs reorg.

So, the bugfix depends on feature work? I guess I'll check back in
another year to see what the status of the bugfix is :)

Brian
Dmitry Torokhov Sept. 8, 2017, 6:15 p.m. UTC | #7
On Tue, Nov 22, 2016 at 2:27 PM, Brian Norris <briannorris@chromium.org> wrote:
> On Tue, Nov 22, 2016 at 03:00:47AM -0800, Eduardo Valentin wrote:
>> On Tue, Nov 22, 2016 at 03:52:25PM +0800, Zhang Rui wrote:
>> > On Fri, 2016-11-18 at 21:30 -0800, Brian Norris wrote:
>> > > On Fri, Nov 18, 2016 at 07:41:59PM -0800, Eduardo Valentin wrote:
>> > > > I would prefer we consider the patch I sent
>> > > > some time ago:
>> > > > https://patchwork.kernel.org/patch/7876381/
>> > > Honestly I didn't look that deeply into the framework here (and I
>> > > also
>> > > don't use CONFIG_THERMAL_EMULATION), I was just fixing something that
>> > > was obviously wrong.
>>
>> Yeah, but that is why we need people to look the code considering all
>> features. :-)
>
> Well, there are bugfixes and there are features. My patch fixed the bug
> in the simplest way possible; it didn't break CONFIG_THERMAL_EMULATION
> any further than it already was, and it'll still work if get_temp()
> doesn't return an error.
>
> I'd say your patch is essentially adding a feature, and IMO that's not
> the best way to fix a bug. You can fix the bug and *then* add the
> feature.
>
> Anyway, I'm not going to tell you how to run your subsystem. If your
> patch goes through, that's probably just as well.
>
> [...]
>
>> > hmmm, I forgot why I missed this one in the end.
>> > Eduardo,
>> > would you mind refresh and resend the patch?
>>
>> Yeah sure. I have at least three extra patch sets on thermal core on
>> my queue. But I would like to get first the thermal sysfs reorg in
>> first. This fix is one of the changes that will go on top of the thermal
>> sysfs reorg.
>
> So, the bugfix depends on feature work? I guess I'll check back in
> another year to see what the status of the bugfix is :)

Not quite a year, but the status is still the same ;)

By the way, I do not quite understand why we want to mess with
emulated temperature when hardware reports errors. I'd say when
get_temp() fails we need to let upper layers know right away. Only
when we read temperature successfully and we are sure that the
temperature is not above critical level we should allow reporting
emulated value.

Can we please apply the patch?

Thanks.
diff mbox

Patch

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 911fd964c742..0fa497f10d25 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -494,6 +494,8 @@  int thermal_zone_get_temp(struct thermal_zone_device *tz, int *temp)
 	mutex_lock(&tz->lock);
 
 	ret = tz->ops->get_temp(tz, temp);
+	if (ret)
+		goto exit_unlock;
 
 	if (IS_ENABLED(CONFIG_THERMAL_EMULATION) && tz->emul_temperature) {
 		for (count = 0; count < tz->trips; count++) {
@@ -514,6 +516,7 @@  int thermal_zone_get_temp(struct thermal_zone_device *tz, int *temp)
 			*temp = tz->emul_temperature;
 	}
  
+exit_unlock:
 	mutex_unlock(&tz->lock);
 exit:
 	return ret;