diff mbox series

cpufreq: governor: Fix negative 'idle_time' handling in dbs_update()

Message ID 20250210130659.3533182-1-zhanjie9@hisilicon.com (mailing list archive)
State Superseded, archived
Headers show
Series cpufreq: governor: Fix negative 'idle_time' handling in dbs_update() | expand

Commit Message

Jie Zhan Feb. 10, 2025, 1:06 p.m. UTC
We observed an issue that the cpu frequency can't raise up with a 100% cpu
load when nohz is off and the 'conservative' governor is selected.

'idle_time' can be negative if it's obtained from get_cpu_idle_time_jiffy()
when nohz is off.  This was found and explained in commit 9485e4ca0b48
("cpufreq: governor: Fix handling of special cases in dbs_update()").

However, commit 7592019634f8 ("cpufreq: governors: Fix long idle detection
logic in load calculation") introduced a comparison between 'idle_time' and
'samling_rate' to detect a long idle interval.  While 'idle_time' is
converted to int before comparison, it's actually promoted to unsigned
again when compared with an unsigned 'sampling_rate'.  Hence, this leads to
wrong idle interval detection when it's in fact 100% busy and sets
policy_dbs->idle_periods to a very large value.  'conservative' adjusts the
frequency to minimum because of the large 'idle_periods', such that the
frequency can't raise up.  'Ondemand' doesn't use policy_dbs->idle_periods
so it fortunately avoids the issue.

Modify negative 'idle_time' to 0 before any use of it in dbs_update().

Fixes: 7592019634f8 ("cpufreq: governors: Fix long idle detection logic in load calculation")
Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
---
 drivers/cpufreq/cpufreq_governor.c | 41 ++++++++++++++----------------
 1 file changed, 19 insertions(+), 22 deletions(-)

Comments

Chen Yu Feb. 10, 2025, 4:14 p.m. UTC | #1
Hi,

> -----Original Message-----
> From: Jie Zhan <zhanjie9@hisilicon.com>
> Sent: Monday, February 10, 2025 9:07 PM
> To: rafael@kernel.org; viresh.kumar@linaro.org
> Cc: linux-pm@vger.kernel.org; Chen, Yu C <yu.c.chen@intel.com>;
> linuxarm@huawei.com; jonathan.cameron@huawei.com;
> zhanjie9@hisilicon.com; zhenglifeng1@huawei.com; lihuisong@huawei.com;
> wanghuiqiang@huawei.com; fanghao11@huawei.com;
> prime.zeng@hisilicon.com
> Subject: [PATCH] cpufreq: governor: Fix negative 'idle_time' handling in
> dbs_update()
> 
> We observed an issue that the cpu frequency can't raise up with a 100% cpu
> load when nohz is off and the 'conservative' governor is selected.
> 
> 'idle_time' can be negative if it's obtained from get_cpu_idle_time_jiffy()
> when nohz is off.  This was found and explained in commit 9485e4ca0b48
> ("cpufreq: governor: Fix handling of special cases in dbs_update()").
> 
> However, commit 7592019634f8 ("cpufreq: governors: Fix long idle detection
> logic in load calculation") introduced a comparison between 'idle_time' and
> 'samling_rate' to detect a long idle interval.  While 'idle_time' is converted to
> int before comparison, it's actually promoted to unsigned again when
> compared with an unsigned 'sampling_rate'.  Hence, this leads to wrong idle
> interval detection when it's in fact 100% busy and sets policy_dbs-
> >idle_periods to a very large value.  'conservative' adjusts the frequency to
> minimum because of the large 'idle_periods', such that the frequency can't
> raise up.  'Ondemand' doesn't use policy_dbs->idle_periods so it fortunately
> avoids the issue.
> 
> Modify negative 'idle_time' to 0 before any use of it in dbs_update().
> 
> Fixes: 7592019634f8 ("cpufreq: governors: Fix long idle detection logic in load
> calculation")
> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
> ---
>  drivers/cpufreq/cpufreq_governor.c | 41 ++++++++++++++----------------
>  1 file changed, 19 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/cpufreq/cpufreq_governor.c
> b/drivers/cpufreq/cpufreq_governor.c
> index af44ee6a6430..cd045ca2268c 100644
> --- a/drivers/cpufreq/cpufreq_governor.c
> +++ b/drivers/cpufreq/cpufreq_governor.c
> @@ -147,6 +147,20 @@ unsigned int dbs_update(struct cpufreq_policy
> *policy)
> 
>  		idle_time = cur_idle_time - j_cdbs->prev_cpu_idle;
>  		j_cdbs->prev_cpu_idle = cur_idle_time;
> +		/*
> +		 * idle_time can be negative if cur_idle_time is returned by
> +		 * get_cpu_idle_time_jiffy() when NO_HZ is off.  In that case
> +		 * idle_time is roughly equal to the difference between
> +		 * time_elapsed and "busy time" obtained from CPU statistics.
> +		 * Then, the "busy time" can end up being greater than
> +		 * time_elapsed (for example, if jiffies_64 and the CPU
> +		 * statistics are updated by different CPUs), so idle_time may
> +		 * in fact be negative.  That means, though, that the CPU was
> +		 * busy all the time (on the rough average) during the last
> +		 * sampling interval, so idle_time should be regarded as 0 in
> +		 * order to make the further process correct.
> +		 */
> +		idle_time = (int)idle_time < 0 ? 0 : idle_time;

Thanks for catching this. I was wondering if it would be safer to be converted into something below:
idle_time = cur_idle_time  >  j_cdbs->prev_cpu_idle  ? cur_idle_time - j_cdbs->prev_cpu_idle : 0;
Because I was thinking if comparing (int)idle_time with X (X could be any type)
is compiler dependent?

thanks,
Chenyu

> 
>  		if (ignore_nice) {
>  			u64 cur_nice = kcpustat_field(&kcpustat_cpu(j),
> CPUTIME_NICE, j); @@ -162,7 +176,7 @@ unsigned int dbs_update(struct
> cpufreq_policy *policy)
>  			 * calls, so the previous load value can be used then.
>  			 */
>  			load = j_cdbs->prev_load;
> -		} else if (unlikely((int)idle_time > 2 * sampling_rate &&
> +		} else if (unlikely(idle_time > 2 * sampling_rate &&
>  				    j_cdbs->prev_load)) {
>  			/*
>  			 * If the CPU had gone completely idle and a task has
> @@ -189,30 +203,13 @@ unsigned int dbs_update(struct cpufreq_policy
> *policy)
>  			load = j_cdbs->prev_load;
>  			j_cdbs->prev_load = 0;
>  		} else {
> -			if (time_elapsed >= idle_time) {
> -				load = 100 * (time_elapsed - idle_time) /
> time_elapsed;
> -			} else {
> -				/*
> -				 * That can happen if idle_time is returned by
> -				 * get_cpu_idle_time_jiffy().  In that case
> -				 * idle_time is roughly equal to the difference
> -				 * between time_elapsed and "busy time"
> obtained
> -				 * from CPU statistics.  Then, the "busy time"
> -				 * can end up being greater than
> time_elapsed
> -				 * (for example, if jiffies_64 and the CPU
> -				 * statistics are updated by different CPUs),
> -				 * so idle_time may in fact be negative.  That
> -				 * means, though, that the CPU was busy all
> -				 * the time (on the rough average) during the
> -				 * last sampling interval and 100 can be
> -				 * returned as the load.
> -				 */
> -				load = (int)idle_time < 0 ? 100 : 0;
> -			}
> +			load = time_elapsed > idle_time ?
> +			       100 * (time_elapsed - idle_time) / time_elapsed :
> +			       0;
>  			j_cdbs->prev_load = load;
>  		}
> 
> -		if (unlikely((int)idle_time > 2 * sampling_rate)) {
> +		if (unlikely(idle_time > 2 * sampling_rate)) {
>  			unsigned int periods = idle_time / sampling_rate;
> 
>  			if (periods < idle_periods)
> --
> 2.33.0
Jie Zhan Feb. 11, 2025, 12:48 p.m. UTC | #2
Hi Chen Yu,

On 11/02/2025 00:14, Chen, Yu C wrote:
> Hi,
> 
>> -----Original Message-----
>> From: Jie Zhan <zhanjie9@hisilicon.com>
>> Sent: Monday, February 10, 2025 9:07 PM
>> To: rafael@kernel.org; viresh.kumar@linaro.org
>> Cc: linux-pm@vger.kernel.org; Chen, Yu C <yu.c.chen@intel.com>;
>> linuxarm@huawei.com; jonathan.cameron@huawei.com;
>> zhanjie9@hisilicon.com; zhenglifeng1@huawei.com; lihuisong@huawei.com;
>> wanghuiqiang@huawei.com; fanghao11@huawei.com;
>> prime.zeng@hisilicon.com
>> Subject: [PATCH] cpufreq: governor: Fix negative 'idle_time' handling in
>> dbs_update()
>>
>> We observed an issue that the cpu frequency can't raise up with a 100% cpu
>> load when nohz is off and the 'conservative' governor is selected.
>>
>> 'idle_time' can be negative if it's obtained from get_cpu_idle_time_jiffy()
>> when nohz is off.  This was found and explained in commit 9485e4ca0b48
>> ("cpufreq: governor: Fix handling of special cases in dbs_update()").
>>
>> However, commit 7592019634f8 ("cpufreq: governors: Fix long idle detection
>> logic in load calculation") introduced a comparison between 'idle_time' and
>> 'samling_rate' to detect a long idle interval.  While 'idle_time' is converted to
>> int before comparison, it's actually promoted to unsigned again when
>> compared with an unsigned 'sampling_rate'.  Hence, this leads to wrong idle
>> interval detection when it's in fact 100% busy and sets policy_dbs-
>>> idle_periods to a very large value.  'conservative' adjusts the frequency to
>> minimum because of the large 'idle_periods', such that the frequency can't
>> raise up.  'Ondemand' doesn't use policy_dbs->idle_periods so it fortunately
>> avoids the issue.
>>
>> Modify negative 'idle_time' to 0 before any use of it in dbs_update().
>>
>> Fixes: 7592019634f8 ("cpufreq: governors: Fix long idle detection logic in load
>> calculation")
>> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
>> ---
>>  drivers/cpufreq/cpufreq_governor.c | 41 ++++++++++++++----------------
>>  1 file changed, 19 insertions(+), 22 deletions(-)
>>
>> diff --git a/drivers/cpufreq/cpufreq_governor.c
>> b/drivers/cpufreq/cpufreq_governor.c
>> index af44ee6a6430..cd045ca2268c 100644
>> --- a/drivers/cpufreq/cpufreq_governor.c
>> +++ b/drivers/cpufreq/cpufreq_governor.c
>> @@ -147,6 +147,20 @@ unsigned int dbs_update(struct cpufreq_policy
>> *policy)
>>
>>  		idle_time = cur_idle_time - j_cdbs->prev_cpu_idle;
>>  		j_cdbs->prev_cpu_idle = cur_idle_time;
>> +		/*
>> +		 * idle_time can be negative if cur_idle_time is returned by
>> +		 * get_cpu_idle_time_jiffy() when NO_HZ is off.  In that case
>> +		 * idle_time is roughly equal to the difference between
>> +		 * time_elapsed and "busy time" obtained from CPU statistics.
>> +		 * Then, the "busy time" can end up being greater than
>> +		 * time_elapsed (for example, if jiffies_64 and the CPU
>> +		 * statistics are updated by different CPUs), so idle_time may
>> +		 * in fact be negative.  That means, though, that the CPU was
>> +		 * busy all the time (on the rough average) during the last
>> +		 * sampling interval, so idle_time should be regarded as 0 in
>> +		 * order to make the further process correct.
>> +		 */
>> +		idle_time = (int)idle_time < 0 ? 0 : idle_time;
> 
> Thanks for catching this. I was wondering if it would be safer to be converted into something below:
> idle_time = cur_idle_time  >  j_cdbs->prev_cpu_idle  ? cur_idle_time - j_cdbs->prev_cpu_idle : 0;
> Because I was thinking if comparing (int)idle_time with X (X could be any type)
> is compiler dependent?
> 
> thanks,
> Chenyu

As far as I know, integer literals would be signed int in this case
regardless of the compiler used, but I'm happy to do what you said if
that's less confusing. 

Thanks,
Jie

...
diff mbox series

Patch

diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index af44ee6a6430..cd045ca2268c 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -147,6 +147,20 @@  unsigned int dbs_update(struct cpufreq_policy *policy)
 
 		idle_time = cur_idle_time - j_cdbs->prev_cpu_idle;
 		j_cdbs->prev_cpu_idle = cur_idle_time;
+		/*
+		 * idle_time can be negative if cur_idle_time is returned by
+		 * get_cpu_idle_time_jiffy() when NO_HZ is off.  In that case
+		 * idle_time is roughly equal to the difference between
+		 * time_elapsed and "busy time" obtained from CPU statistics.
+		 * Then, the "busy time" can end up being greater than
+		 * time_elapsed (for example, if jiffies_64 and the CPU
+		 * statistics are updated by different CPUs), so idle_time may
+		 * in fact be negative.  That means, though, that the CPU was
+		 * busy all the time (on the rough average) during the last
+		 * sampling interval, so idle_time should be regarded as 0 in
+		 * order to make the further process correct.
+		 */
+		idle_time = (int)idle_time < 0 ? 0 : idle_time;
 
 		if (ignore_nice) {
 			u64 cur_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
@@ -162,7 +176,7 @@  unsigned int dbs_update(struct cpufreq_policy *policy)
 			 * calls, so the previous load value can be used then.
 			 */
 			load = j_cdbs->prev_load;
-		} else if (unlikely((int)idle_time > 2 * sampling_rate &&
+		} else if (unlikely(idle_time > 2 * sampling_rate &&
 				    j_cdbs->prev_load)) {
 			/*
 			 * If the CPU had gone completely idle and a task has
@@ -189,30 +203,13 @@  unsigned int dbs_update(struct cpufreq_policy *policy)
 			load = j_cdbs->prev_load;
 			j_cdbs->prev_load = 0;
 		} else {
-			if (time_elapsed >= idle_time) {
-				load = 100 * (time_elapsed - idle_time) / time_elapsed;
-			} else {
-				/*
-				 * That can happen if idle_time is returned by
-				 * get_cpu_idle_time_jiffy().  In that case
-				 * idle_time is roughly equal to the difference
-				 * between time_elapsed and "busy time" obtained
-				 * from CPU statistics.  Then, the "busy time"
-				 * can end up being greater than time_elapsed
-				 * (for example, if jiffies_64 and the CPU
-				 * statistics are updated by different CPUs),
-				 * so idle_time may in fact be negative.  That
-				 * means, though, that the CPU was busy all
-				 * the time (on the rough average) during the
-				 * last sampling interval and 100 can be
-				 * returned as the load.
-				 */
-				load = (int)idle_time < 0 ? 100 : 0;
-			}
+			load = time_elapsed > idle_time ?
+			       100 * (time_elapsed - idle_time) / time_elapsed :
+			       0;
 			j_cdbs->prev_load = load;
 		}
 
-		if (unlikely((int)idle_time > 2 * sampling_rate)) {
+		if (unlikely(idle_time > 2 * sampling_rate)) {
 			unsigned int periods = idle_time / sampling_rate;
 
 			if (periods < idle_periods)