diff mbox series

hwmon: (k10temp) Report negative temperatures

Message ID 20230523204932.2679-1-Baski.Kannan@amd.com (mailing list archive)
State Changes Requested
Headers show
Series hwmon: (k10temp) Report negative temperatures | expand

Commit Message

Baskaran Kannan May 23, 2023, 8:49 p.m. UTC
Currently, the tctl and die temperatures are rounded off to zero
if they are less than 0. There are industrial processors which work
below zero.

To display the correct temperature remove the rounding off.

Signed-off-by: Baskaran Kannan <Baski.Kannan@amd.com>
---
 drivers/hwmon/k10temp.c | 4 ----
 1 file changed, 4 deletions(-)

Comments

Guenter Roeck May 23, 2023, 9:46 p.m. UTC | #1
On Tue, May 23, 2023 at 03:49:32PM -0500, Baskaran Kannan wrote:
> Currently, the tctl and die temperatures are rounded off to zero
> if they are less than 0. There are industrial processors which work
> below zero.

This was introduced with commit aef17ca12719 ("hwmon: (k10temp) Only apply
temperature offset if result is positive"). This patch would effecively
revert that change. Given the reason for introducing it I am not convinced
that it is a good idea to unconditionally revert it.

Guenter

> 
> To display the correct temperature remove the rounding off.
> 
> Signed-off-by: Baskaran Kannan <Baski.Kannan@amd.com>
> ---
>  drivers/hwmon/k10temp.c | 4 ----
>  1 file changed, 4 deletions(-)
> 
> diff --git a/drivers/hwmon/k10temp.c b/drivers/hwmon/k10temp.c
> index 7b177b9fbb09..489ad0b1bc74 100644
> --- a/drivers/hwmon/k10temp.c
> +++ b/drivers/hwmon/k10temp.c
> @@ -204,13 +204,9 @@ static int k10temp_read_temp(struct device *dev, u32 attr, int channel,
>  		switch (channel) {
>  		case 0:		/* Tctl */
>  			*val = get_raw_temp(data);
> -			if (*val < 0)
> -				*val = 0;
>  			break;
>  		case 1:		/* Tdie */
>  			*val = get_raw_temp(data) - data->temp_offset;
> -			if (*val < 0)
> -				*val = 0;
>  			break;
>  		case 2 ... 13:		/* Tccd{1-12} */
>  			amd_smn_read(amd_pci_dev_to_node_id(data->pdev),
> -- 
> 2.25.1
>
Guenter Roeck June 8, 2023, 1:51 p.m. UTC | #2
On Tue, May 23, 2023 at 02:46:46PM -0700, Guenter Roeck wrote:
> On Tue, May 23, 2023 at 03:49:32PM -0500, Baskaran Kannan wrote:
> > Currently, the tctl and die temperatures are rounded off to zero
> > if they are less than 0. There are industrial processors which work
> > below zero.
> 
> This was introduced with commit aef17ca12719 ("hwmon: (k10temp) Only apply
> temperature offset if result is positive"). This patch would effecively
> revert that change. Given the reason for introducing it I am not convinced
> that it is a good idea to unconditionally revert it.
> 

Any comments ? I am not inclined to accept this patch as-is. What are
the industrial processors ? Is there a means to detect them ?

Guenter

> Guenter
> 
> > 
> > To display the correct temperature remove the rounding off.
> > 
> > Signed-off-by: Baskaran Kannan <Baski.Kannan@amd.com>
> > ---
> >  drivers/hwmon/k10temp.c | 4 ----
> >  1 file changed, 4 deletions(-)
> > 
> > diff --git a/drivers/hwmon/k10temp.c b/drivers/hwmon/k10temp.c
> > index 7b177b9fbb09..489ad0b1bc74 100644
> > --- a/drivers/hwmon/k10temp.c
> > +++ b/drivers/hwmon/k10temp.c
> > @@ -204,13 +204,9 @@ static int k10temp_read_temp(struct device *dev, u32 attr, int channel,
> >  		switch (channel) {
> >  		case 0:		/* Tctl */
> >  			*val = get_raw_temp(data);
> > -			if (*val < 0)
> > -				*val = 0;
> >  			break;
> >  		case 1:		/* Tdie */
> >  			*val = get_raw_temp(data) - data->temp_offset;
> > -			if (*val < 0)
> > -				*val = 0;
> >  			break;
> >  		case 2 ... 13:		/* Tccd{1-12} */
> >  			amd_smn_read(amd_pci_dev_to_node_id(data->pdev),
> > -- 
> > 2.25.1
> >
Baskaran Kannan June 8, 2023, 5:09 p.m. UTC | #3
[AMD Official Use Only - General]

The patch you have mentioned, aef17ca12719, sounds like a work-around for a problem found in some Ryzen Threadripper processors.
If I understand correctly, this work-around (aef17ca12719) has been provided as a blanket fix for all the processors.

The Industrial Processor in question is the Epyc3k i3255.
AMD Family 17h (boot_cpu_data.x86)
AMD model 00h - 0fh (boot_cpu_data.x86_model)
Model Name - contains string "3255"

It supports temperature ranging from -40 degree Celsius to 105 deg Celsius.
We have customers' machines running at -20 deg Celsius. They require that the correct temperature be passed to their tools.

-----Original Message-----
From: Guenter Roeck <groeck7@gmail.com> On Behalf Of Guenter Roeck
Sent: Thursday, June 8, 2023 8:52 AM
To: Kannan, Baski <Baski.Kannan@amd.com>
Cc: Moger, Babu <Babu.Moger@amd.com>; clemens@ladisch.de; jdelvare@suse.com; linux-hwmon@vger.kernel.org; linux-kernel@vger.kernel.org
Subject: Re: [PATCH] hwmon: (k10temp) Report negative temperatures

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.


On Tue, May 23, 2023 at 02:46:46PM -0700, Guenter Roeck wrote:
> On Tue, May 23, 2023 at 03:49:32PM -0500, Baskaran Kannan wrote:
> > Currently, the tctl and die temperatures are rounded off to zero if
> > they are less than 0. There are industrial processors which work
> > below zero.
>
> This was introduced with commit aef17ca12719 ("hwmon: (k10temp) Only
> apply temperature offset if result is positive"). This patch would
> effecively revert that change. Given the reason for introducing it I
> am not convinced that it is a good idea to unconditionally revert it.
>

Any comments ? I am not inclined to accept this patch as-is. What are the industrial processors ? Is there a means to detect them ?

Guenter

> Guenter
>
> >
> > To display the correct temperature remove the rounding off.
> >
> > Signed-off-by: Baskaran Kannan <Baski.Kannan@amd.com>
> > ---
> >  drivers/hwmon/k10temp.c | 4 ----
> >  1 file changed, 4 deletions(-)
> >
> > diff --git a/drivers/hwmon/k10temp.c b/drivers/hwmon/k10temp.c index
> > 7b177b9fbb09..489ad0b1bc74 100644
> > --- a/drivers/hwmon/k10temp.c
> > +++ b/drivers/hwmon/k10temp.c
> > @@ -204,13 +204,9 @@ static int k10temp_read_temp(struct device *dev, u32 attr, int channel,
> >             switch (channel) {
> >             case 0:         /* Tctl */
> >                     *val = get_raw_temp(data);
> > -                   if (*val < 0)
> > -                           *val = 0;
> >                     break;
> >             case 1:         /* Tdie */
> >                     *val = get_raw_temp(data) - data->temp_offset;
> > -                   if (*val < 0)
> > -                           *val = 0;
> >                     break;
> >             case 2 ... 13:          /* Tccd{1-12} */
> >                     amd_smn_read(amd_pci_dev_to_node_id(data->pdev),
> > --
> > 2.25.1
> >
Guenter Roeck June 8, 2023, 6:03 p.m. UTC | #4
On 6/8/23 10:09, Kannan, Baski wrote:
> [AMD Official Use Only - General]
> 
> The patch you have mentioned, aef17ca12719, sounds like a work-around for a problem found in some Ryzen Threadripper processors.
> If I understand correctly, this work-around (aef17ca12719) has been provided as a blanket fix for all the processors.
> 

Due to lack of better knowledge and understanding, yes. See
https://github.com/lm-sensors/lm-sensors/issues/70. That doesn't
mean that a blanket revert would be appropriate.

> The Industrial Processor in question is the Epyc3k i3255.
> AMD Family 17h (boot_cpu_data.x86)
> AMD model 00h - 0fh (boot_cpu_data.x86_model)
> Model Name - contains string "3255"
> 
> It supports temperature ranging from -40 degree Celsius to 105 deg Celsius.
> We have customers' machines running at -20 deg Celsius. They require that the correct temperature be passed to their tools.
> 

We have two options: Either limit the workaround to the list of processors
which may be affected by the original problem, or do not apply it to
processors which are known to _not_ be affected by the problem. Either
can easily be implemented by adding a flag to struct k10temp_data and
setting it in the probe function.

No one outside AMD knows which processors may or may not be affected
by the original problem. It was reported on 1950X at the time, but
it may exist on all processors with the ability to set Sense MI Skew
(and possibly Sense MI Offset), whatever that is. With that in mind,
the fix will have to be provided by AMD.

Guenter

> -----Original Message-----
> From: Guenter Roeck <groeck7@gmail.com> On Behalf Of Guenter Roeck
> Sent: Thursday, June 8, 2023 8:52 AM
> To: Kannan, Baski <Baski.Kannan@amd.com>
> Cc: Moger, Babu <Babu.Moger@amd.com>; clemens@ladisch.de; jdelvare@suse.com; linux-hwmon@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] hwmon: (k10temp) Report negative temperatures
> 
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
> On Tue, May 23, 2023 at 02:46:46PM -0700, Guenter Roeck wrote:
>> On Tue, May 23, 2023 at 03:49:32PM -0500, Baskaran Kannan wrote:
>>> Currently, the tctl and die temperatures are rounded off to zero if
>>> they are less than 0. There are industrial processors which work
>>> below zero.
>>
>> This was introduced with commit aef17ca12719 ("hwmon: (k10temp) Only
>> apply temperature offset if result is positive"). This patch would
>> effecively revert that change. Given the reason for introducing it I
>> am not convinced that it is a good idea to unconditionally revert it.
>>
> 
> Any comments ? I am not inclined to accept this patch as-is. What are the industrial processors ? Is there a means to detect them ?
> 
> Guenter
> 
>> Guenter
>>
>>>
>>> To display the correct temperature remove the rounding off.
>>>
>>> Signed-off-by: Baskaran Kannan <Baski.Kannan@amd.com>
>>> ---
>>>   drivers/hwmon/k10temp.c | 4 ----
>>>   1 file changed, 4 deletions(-)
>>>
>>> diff --git a/drivers/hwmon/k10temp.c b/drivers/hwmon/k10temp.c index
>>> 7b177b9fbb09..489ad0b1bc74 100644
>>> --- a/drivers/hwmon/k10temp.c
>>> +++ b/drivers/hwmon/k10temp.c
>>> @@ -204,13 +204,9 @@ static int k10temp_read_temp(struct device *dev, u32 attr, int channel,
>>>              switch (channel) {
>>>              case 0:         /* Tctl */
>>>                      *val = get_raw_temp(data);
>>> -                   if (*val < 0)
>>> -                           *val = 0;
>>>                      break;
>>>              case 1:         /* Tdie */
>>>                      *val = get_raw_temp(data) - data->temp_offset;
>>> -                   if (*val < 0)
>>> -                           *val = 0;
>>>                      break;
>>>              case 2 ... 13:          /* Tccd{1-12} */
>>>                      amd_smn_read(amd_pci_dev_to_node_id(data->pdev),
>>> --
>>> 2.25.1
>>>
Baskaran Kannan June 8, 2023, 6:25 p.m. UTC | #5
[AMD Official Use Only - General]

To not spawn any new problems, we can go ahead with option 2.  i.e., "do not apply it to processors which are known to _not_ be affected by the problem."

Thanks
- Baski

-----Original Message-----
From: Guenter Roeck <groeck7@gmail.com> On Behalf Of Guenter Roeck
Sent: Thursday, June 8, 2023 1:03 PM
To: Kannan, Baski <Baski.Kannan@amd.com>
Cc: Moger, Babu <Babu.Moger@amd.com>; clemens@ladisch.de; jdelvare@suse.com; linux-hwmon@vger.kernel.org; linux-kernel@vger.kernel.org; Ramayanam, Pavan <Pavan.Ramayanam@amd.com>
Subject: Re: [PATCH] hwmon: (k10temp) Report negative temperatures

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.


On 6/8/23 10:09, Kannan, Baski wrote:
> [AMD Official Use Only - General]
>
> The patch you have mentioned, aef17ca12719, sounds like a work-around for a problem found in some Ryzen Threadripper processors.
> If I understand correctly, this work-around (aef17ca12719) has been provided as a blanket fix for all the processors.
>

Due to lack of better knowledge and understanding, yes. See https://github.com/lm-sensors/lm-sensors/issues/70. That doesn't mean that a blanket revert would be appropriate.

> The Industrial Processor in question is the Epyc3k i3255.
> AMD Family 17h (boot_cpu_data.x86)
> AMD model 00h - 0fh (boot_cpu_data.x86_model) Model Name - contains
> string "3255"
>
> It supports temperature ranging from -40 degree Celsius to 105 deg Celsius.
> We have customers' machines running at -20 deg Celsius. They require that the correct temperature be passed to their tools.
>

We have two options: Either limit the workaround to the list of processors which may be affected by the original problem, or do not apply it to processors which are known to _not_ be affected by the problem. Either can easily be implemented by adding a flag to struct k10temp_data and setting it in the probe function.

No one outside AMD knows which processors may or may not be affected by the original problem. It was reported on 1950X at the time, but it may exist on all processors with the ability to set Sense MI Skew (and possibly Sense MI Offset), whatever that is. With that in mind, the fix will have to be provided by AMD.

Guenter

> -----Original Message-----
> From: Guenter Roeck <groeck7@gmail.com> On Behalf Of Guenter Roeck
> Sent: Thursday, June 8, 2023 8:52 AM
> To: Kannan, Baski <Baski.Kannan@amd.com>
> Cc: Moger, Babu <Babu.Moger@amd.com>; clemens@ladisch.de;
> jdelvare@suse.com; linux-hwmon@vger.kernel.org;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] hwmon: (k10temp) Report negative temperatures
>
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On Tue, May 23, 2023 at 02:46:46PM -0700, Guenter Roeck wrote:
>> On Tue, May 23, 2023 at 03:49:32PM -0500, Baskaran Kannan wrote:
>>> Currently, the tctl and die temperatures are rounded off to zero if
>>> they are less than 0. There are industrial processors which work
>>> below zero.
>>
>> This was introduced with commit aef17ca12719 ("hwmon: (k10temp) Only
>> apply temperature offset if result is positive"). This patch would
>> effecively revert that change. Given the reason for introducing it I
>> am not convinced that it is a good idea to unconditionally revert it.
>>
>
> Any comments ? I am not inclined to accept this patch as-is. What are the industrial processors ? Is there a means to detect them ?
>
> Guenter
>
>> Guenter
>>
>>>
>>> To display the correct temperature remove the rounding off.
>>>
>>> Signed-off-by: Baskaran Kannan <Baski.Kannan@amd.com>
>>> ---
>>>   drivers/hwmon/k10temp.c | 4 ----
>>>   1 file changed, 4 deletions(-)
>>>
>>> diff --git a/drivers/hwmon/k10temp.c b/drivers/hwmon/k10temp.c index
>>> 7b177b9fbb09..489ad0b1bc74 100644
>>> --- a/drivers/hwmon/k10temp.c
>>> +++ b/drivers/hwmon/k10temp.c
>>> @@ -204,13 +204,9 @@ static int k10temp_read_temp(struct device *dev, u32 attr, int channel,
>>>              switch (channel) {
>>>              case 0:         /* Tctl */
>>>                      *val = get_raw_temp(data);
>>> -                   if (*val < 0)
>>> -                           *val = 0;
>>>                      break;
>>>              case 1:         /* Tdie */
>>>                      *val = get_raw_temp(data) - data->temp_offset;
>>> -                   if (*val < 0)
>>> -                           *val = 0;
>>>                      break;
>>>              case 2 ... 13:          /* Tccd{1-12} */
>>>
>>> amd_smn_read(amd_pci_dev_to_node_id(data->pdev),
>>> --
>>> 2.25.1
>>>
Guenter Roeck June 8, 2023, 7:17 p.m. UTC | #6
On 6/8/23 11:25, Kannan, Baski wrote:
> [AMD Official Use Only - General]
> 
> To not spawn any new problems, we can go ahead with option 2.  i.e., "do not apply it to processors which are known to _not_ be affected by the problem."
> 

Sounds good to me.

Guenter

> Thanks
> - Baski
> 
> -----Original Message-----
> From: Guenter Roeck <groeck7@gmail.com> On Behalf Of Guenter Roeck
> Sent: Thursday, June 8, 2023 1:03 PM
> To: Kannan, Baski <Baski.Kannan@amd.com>
> Cc: Moger, Babu <Babu.Moger@amd.com>; clemens@ladisch.de; jdelvare@suse.com; linux-hwmon@vger.kernel.org; linux-kernel@vger.kernel.org; Ramayanam, Pavan <Pavan.Ramayanam@amd.com>
> Subject: Re: [PATCH] hwmon: (k10temp) Report negative temperatures
> 
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
> On 6/8/23 10:09, Kannan, Baski wrote:
>> [AMD Official Use Only - General]
>>
>> The patch you have mentioned, aef17ca12719, sounds like a work-around for a problem found in some Ryzen Threadripper processors.
>> If I understand correctly, this work-around (aef17ca12719) has been provided as a blanket fix for all the processors.
>>
> 
> Due to lack of better knowledge and understanding, yes. See https://github.com/lm-sensors/lm-sensors/issues/70. That doesn't mean that a blanket revert would be appropriate.
> 
>> The Industrial Processor in question is the Epyc3k i3255.
>> AMD Family 17h (boot_cpu_data.x86)
>> AMD model 00h - 0fh (boot_cpu_data.x86_model) Model Name - contains
>> string "3255"
>>
>> It supports temperature ranging from -40 degree Celsius to 105 deg Celsius.
>> We have customers' machines running at -20 deg Celsius. They require that the correct temperature be passed to their tools.
>>
> 
> We have two options: Either limit the workaround to the list of processors which may be affected by the original problem, or do not apply it to processors which are known to _not_ be affected by the problem. Either can easily be implemented by adding a flag to struct k10temp_data and setting it in the probe function.
> 
> No one outside AMD knows which processors may or may not be affected by the original problem. It was reported on 1950X at the time, but it may exist on all processors with the ability to set Sense MI Skew (and possibly Sense MI Offset), whatever that is. With that in mind, the fix will have to be provided by AMD.
> 
> Guenter
> 
>> -----Original Message-----
>> From: Guenter Roeck <groeck7@gmail.com> On Behalf Of Guenter Roeck
>> Sent: Thursday, June 8, 2023 8:52 AM
>> To: Kannan, Baski <Baski.Kannan@amd.com>
>> Cc: Moger, Babu <Babu.Moger@amd.com>; clemens@ladisch.de;
>> jdelvare@suse.com; linux-hwmon@vger.kernel.org;
>> linux-kernel@vger.kernel.org
>> Subject: Re: [PATCH] hwmon: (k10temp) Report negative temperatures
>>
>> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>>
>>
>> On Tue, May 23, 2023 at 02:46:46PM -0700, Guenter Roeck wrote:
>>> On Tue, May 23, 2023 at 03:49:32PM -0500, Baskaran Kannan wrote:
>>>> Currently, the tctl and die temperatures are rounded off to zero if
>>>> they are less than 0. There are industrial processors which work
>>>> below zero.
>>>
>>> This was introduced with commit aef17ca12719 ("hwmon: (k10temp) Only
>>> apply temperature offset if result is positive"). This patch would
>>> effecively revert that change. Given the reason for introducing it I
>>> am not convinced that it is a good idea to unconditionally revert it.
>>>
>>
>> Any comments ? I am not inclined to accept this patch as-is. What are the industrial processors ? Is there a means to detect them ?
>>
>> Guenter
>>
>>> Guenter
>>>
>>>>
>>>> To display the correct temperature remove the rounding off.
>>>>
>>>> Signed-off-by: Baskaran Kannan <Baski.Kannan@amd.com>
>>>> ---
>>>>    drivers/hwmon/k10temp.c | 4 ----
>>>>    1 file changed, 4 deletions(-)
>>>>
>>>> diff --git a/drivers/hwmon/k10temp.c b/drivers/hwmon/k10temp.c index
>>>> 7b177b9fbb09..489ad0b1bc74 100644
>>>> --- a/drivers/hwmon/k10temp.c
>>>> +++ b/drivers/hwmon/k10temp.c
>>>> @@ -204,13 +204,9 @@ static int k10temp_read_temp(struct device *dev, u32 attr, int channel,
>>>>               switch (channel) {
>>>>               case 0:         /* Tctl */
>>>>                       *val = get_raw_temp(data);
>>>> -                   if (*val < 0)
>>>> -                           *val = 0;
>>>>                       break;
>>>>               case 1:         /* Tdie */
>>>>                       *val = get_raw_temp(data) - data->temp_offset;
>>>> -                   if (*val < 0)
>>>> -                           *val = 0;
>>>>                       break;
>>>>               case 2 ... 13:          /* Tccd{1-12} */
>>>>
>>>> amd_smn_read(amd_pci_dev_to_node_id(data->pdev),
>>>> --
>>>> 2.25.1
>>>>
>
diff mbox series

Patch

diff --git a/drivers/hwmon/k10temp.c b/drivers/hwmon/k10temp.c
index 7b177b9fbb09..489ad0b1bc74 100644
--- a/drivers/hwmon/k10temp.c
+++ b/drivers/hwmon/k10temp.c
@@ -204,13 +204,9 @@  static int k10temp_read_temp(struct device *dev, u32 attr, int channel,
 		switch (channel) {
 		case 0:		/* Tctl */
 			*val = get_raw_temp(data);
-			if (*val < 0)
-				*val = 0;
 			break;
 		case 1:		/* Tdie */
 			*val = get_raw_temp(data) - data->temp_offset;
-			if (*val < 0)
-				*val = 0;
 			break;
 		case 2 ... 13:		/* Tccd{1-12} */
 			amd_smn_read(amd_pci_dev_to_node_id(data->pdev),