diff mbox

cpufreq: governor: Fix prev_load initialization in cpufreq_governor_start()

Message ID 5284094.6jVZZT3rEB@vostro.rjw.lan (mailing list archive)
State Accepted, archived
Delegated to: Rafael Wysocki
Headers show

Commit Message

Rafael J. Wysocki April 25, 2016, 1:07 a.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

The way cpufreq_governor_start() initializes j_cdbs->prev_load is
questionable.

First off, j_cdbs->prev_cpu_wall used as a denominator in the
computation may be zero.  The case this happens is when
get_cpu_idle_time_us() returns -1 and get_cpu_idle_time_jiffy()
used to return that number is called exactly at the jiffies_64
wrap time.  It is rather hard to trigger that error, but it is not
impossible and it will just crash the kernel then.

Second, j_cdbs->prev_load is computed as the average load during
the entire time since the system started and it may not reflect the
load in the previous sampling period (as it is expected to).
That doesn't play well with the way dbs_update() uses that value.
Namely, if the update time delta (wall_time) happens do be greater
than twice the sampling rate on the first invocation of it, the
initial value of j_cdbs->prev_load (which may be completely off) will
be returned to the caller as the current load (unless it is equal to
zero and unless another CPU sharing the same policy object has a
greater load value).

For this reason, notice that the prev_load field of struct cpu_dbs_info
is only used by dbs_update() and only in that one place, so if
cpufreq_governor_start() is modified to always initialize it to 0,
it will make dbs_update() always compute the actual load first time
it checks the update time delta against the doubled sampling rate
(after initialization) and there won't be any side effects of it.

Consequently, modify cpufreq_governor_start() as described.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpufreq/cpufreq_governor.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Viresh Kumar April 25, 2016, 4:14 a.m. UTC | #1
On 25-04-16, 03:07, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> The way cpufreq_governor_start() initializes j_cdbs->prev_load is
> questionable.
> 
> First off, j_cdbs->prev_cpu_wall used as a denominator in the
> computation may be zero.  The case this happens is when
> get_cpu_idle_time_us() returns -1 and get_cpu_idle_time_jiffy()
> used to return that number is called exactly at the jiffies_64
> wrap time.  It is rather hard to trigger that error, but it is not
> impossible and it will just crash the kernel then.
> 
> Second, j_cdbs->prev_load is computed as the average load during
> the entire time since the system started and it may not reflect the
> load in the previous sampling period (as it is expected to).
> That doesn't play well with the way dbs_update() uses that value.
> Namely, if the update time delta (wall_time) happens do be greater
> than twice the sampling rate on the first invocation of it, the
> initial value of j_cdbs->prev_load (which may be completely off) will
> be returned to the caller as the current load (unless it is equal to
> zero and unless another CPU sharing the same policy object has a
> greater load value).
> 
> For this reason, notice that the prev_load field of struct cpu_dbs_info
> is only used by dbs_update() and only in that one place, so if
> cpufreq_governor_start() is modified to always initialize it to 0,
> it will make dbs_update() always compute the actual load first time
> it checks the update time delta against the doubled sampling rate
> (after initialization) and there won't be any side effects of it.
> 
> Consequently, modify cpufreq_governor_start() as described.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/cpufreq/cpufreq_governor.c |    8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
> +++ linux-pm/drivers/cpufreq/cpufreq_governor.c
> @@ -508,12 +508,12 @@ static int cpufreq_governor_start(struct
>  
>  	for_each_cpu(j, policy->cpus) {
>  		struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j);
> -		unsigned int prev_load;
>  
>  		j_cdbs->prev_cpu_idle = get_cpu_idle_time(j, &j_cdbs->prev_cpu_wall, io_busy);
> -
> -		prev_load = j_cdbs->prev_cpu_wall - j_cdbs->prev_cpu_idle;
> -		j_cdbs->prev_load = 100 * prev_load / (unsigned int)j_cdbs->prev_cpu_wall;
> +		/*
> +		 * Make the first invocation of dbs_update() compute the load.
> +		 */
> +		j_cdbs->prev_load = 0;
>  
>  		if (ignore_nice)
>  			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];

I tried to understand why the

commit 18b46abd0009 ("cpufreq: governor: Be friendly towards
latency-sensitive bursty workloads")

modify the START section and added this stuff and I completely failed
to understand it now. Do you remember why was this added at all ?
Rafael J. Wysocki April 25, 2016, 11:24 a.m. UTC | #2
On Mon, Apr 25, 2016 at 6:14 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> On 25-04-16, 03:07, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> The way cpufreq_governor_start() initializes j_cdbs->prev_load is
>> questionable.
>>
>> First off, j_cdbs->prev_cpu_wall used as a denominator in the
>> computation may be zero.  The case this happens is when
>> get_cpu_idle_time_us() returns -1 and get_cpu_idle_time_jiffy()
>> used to return that number is called exactly at the jiffies_64
>> wrap time.  It is rather hard to trigger that error, but it is not
>> impossible and it will just crash the kernel then.
>>
>> Second, j_cdbs->prev_load is computed as the average load during
>> the entire time since the system started and it may not reflect the
>> load in the previous sampling period (as it is expected to).
>> That doesn't play well with the way dbs_update() uses that value.
>> Namely, if the update time delta (wall_time) happens do be greater
>> than twice the sampling rate on the first invocation of it, the
>> initial value of j_cdbs->prev_load (which may be completely off) will
>> be returned to the caller as the current load (unless it is equal to
>> zero and unless another CPU sharing the same policy object has a
>> greater load value).
>>
>> For this reason, notice that the prev_load field of struct cpu_dbs_info
>> is only used by dbs_update() and only in that one place, so if
>> cpufreq_governor_start() is modified to always initialize it to 0,
>> it will make dbs_update() always compute the actual load first time
>> it checks the update time delta against the doubled sampling rate
>> (after initialization) and there won't be any side effects of it.
>>
>> Consequently, modify cpufreq_governor_start() as described.
>>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> ---
>>  drivers/cpufreq/cpufreq_governor.c |    8 ++++----
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
>> ===================================================================
>> --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
>> +++ linux-pm/drivers/cpufreq/cpufreq_governor.c
>> @@ -508,12 +508,12 @@ static int cpufreq_governor_start(struct
>>
>>       for_each_cpu(j, policy->cpus) {
>>               struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j);
>> -             unsigned int prev_load;
>>
>>               j_cdbs->prev_cpu_idle = get_cpu_idle_time(j, &j_cdbs->prev_cpu_wall, io_busy);
>> -
>> -             prev_load = j_cdbs->prev_cpu_wall - j_cdbs->prev_cpu_idle;
>> -             j_cdbs->prev_load = 100 * prev_load / (unsigned int)j_cdbs->prev_cpu_wall;
>> +             /*
>> +              * Make the first invocation of dbs_update() compute the load.
>> +              */
>> +             j_cdbs->prev_load = 0;
>>
>>               if (ignore_nice)
>>                       j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
>
> I tried to understand why the
>
> commit 18b46abd0009 ("cpufreq: governor: Be friendly towards
> latency-sensitive bursty workloads")
>
> modify the START section and added this stuff and I completely failed
> to understand it now. Do you remember why was this added at all ?

The big comment in dbs_update() explains it, but not the initialization part.

I guess the initialization tried to be smart and avoid the "almost
zero load" effect in cases when the CPU is idle to start with, but
that's questionable as explained in my changelog.  I guess I should
add a "Fixes:" tag for that commit to the patch. :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar April 25, 2016, 11:27 a.m. UTC | #3
On 25-04-16, 13:24, Rafael J. Wysocki wrote:
> On Mon, Apr 25, 2016 at 6:14 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > On 25-04-16, 03:07, Rafael J. Wysocki wrote:
> >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>
> >> The way cpufreq_governor_start() initializes j_cdbs->prev_load is
> >> questionable.
> >>
> >> First off, j_cdbs->prev_cpu_wall used as a denominator in the
> >> computation may be zero.  The case this happens is when
> >> get_cpu_idle_time_us() returns -1 and get_cpu_idle_time_jiffy()
> >> used to return that number is called exactly at the jiffies_64
> >> wrap time.  It is rather hard to trigger that error, but it is not
> >> impossible and it will just crash the kernel then.
> >>
> >> Second, j_cdbs->prev_load is computed as the average load during
> >> the entire time since the system started and it may not reflect the
> >> load in the previous sampling period (as it is expected to).
> >> That doesn't play well with the way dbs_update() uses that value.
> >> Namely, if the update time delta (wall_time) happens do be greater
> >> than twice the sampling rate on the first invocation of it, the
> >> initial value of j_cdbs->prev_load (which may be completely off) will
> >> be returned to the caller as the current load (unless it is equal to
> >> zero and unless another CPU sharing the same policy object has a
> >> greater load value).
> >>
> >> For this reason, notice that the prev_load field of struct cpu_dbs_info
> >> is only used by dbs_update() and only in that one place, so if
> >> cpufreq_governor_start() is modified to always initialize it to 0,
> >> it will make dbs_update() always compute the actual load first time
> >> it checks the update time delta against the doubled sampling rate
> >> (after initialization) and there won't be any side effects of it.
> >>
> >> Consequently, modify cpufreq_governor_start() as described.
> >>
> >> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >> ---
> >>  drivers/cpufreq/cpufreq_governor.c |    8 ++++----
> >>  1 file changed, 4 insertions(+), 4 deletions(-)
> >>
> >> Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
> >> ===================================================================
> >> --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
> >> +++ linux-pm/drivers/cpufreq/cpufreq_governor.c
> >> @@ -508,12 +508,12 @@ static int cpufreq_governor_start(struct
> >>
> >>       for_each_cpu(j, policy->cpus) {
> >>               struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j);
> >> -             unsigned int prev_load;
> >>
> >>               j_cdbs->prev_cpu_idle = get_cpu_idle_time(j, &j_cdbs->prev_cpu_wall, io_busy);
> >> -
> >> -             prev_load = j_cdbs->prev_cpu_wall - j_cdbs->prev_cpu_idle;
> >> -             j_cdbs->prev_load = 100 * prev_load / (unsigned int)j_cdbs->prev_cpu_wall;
> >> +             /*
> >> +              * Make the first invocation of dbs_update() compute the load.
> >> +              */
> >> +             j_cdbs->prev_load = 0;
> >>
> >>               if (ignore_nice)
> >>                       j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
> >
> > I tried to understand why the
> >
> > commit 18b46abd0009 ("cpufreq: governor: Be friendly towards
> > latency-sensitive bursty workloads")
> >
> > modify the START section and added this stuff and I completely failed
> > to understand it now. Do you remember why was this added at all ?
> 
> The big comment in dbs_update() explains it, but not the initialization part.
> 
> I guess the initialization tried to be smart and avoid the "almost
> zero load" effect in cases when the CPU is idle to start with, but
> that's questionable as explained in my changelog.  I guess I should
> add a "Fixes:" tag for that commit to the patch. :-)

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

:)
Chen Yu April 25, 2016, 3:45 p.m. UTC | #4
SGksDQoNCj4gLS0tLS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCj4gRnJvbTogUmFmYWVsIEouIFd5
c29ja2kgW21haWx0bzpyandAcmp3eXNvY2tpLm5ldF0NCj4gU2VudDogTW9uZGF5LCBBcHJpbCAy
NSwgMjAxNiA5OjA4IEFNDQo+IFRvOiBMaW51eCBQTSBsaXN0DQo+IENjOiBMaW51eCBLZXJuZWwg
TWFpbGluZyBMaXN0OyBWaXJlc2ggS3VtYXI7IENoZW4sIFl1IEMNCj4gU3ViamVjdDogW1BBVENI
XSBjcHVmcmVxOiBnb3Zlcm5vcjogRml4IHByZXZfbG9hZCBpbml0aWFsaXphdGlvbiBpbg0KPiBj
cHVmcmVxX2dvdmVybm9yX3N0YXJ0KCkNCj4gDQo+IEZyb206IFJhZmFlbCBKLiBXeXNvY2tpIDxy
YWZhZWwuai53eXNvY2tpQGludGVsLmNvbT4NCj4gDQo+IFRoZSB3YXkgY3B1ZnJlcV9nb3Zlcm5v
cl9zdGFydCgpIGluaXRpYWxpemVzIGpfY2Ricy0+cHJldl9sb2FkIGlzIHF1ZXN0aW9uYWJsZS4N
Cj4gDQo+IEZpcnN0IG9mZiwgal9jZGJzLT5wcmV2X2NwdV93YWxsIHVzZWQgYXMgYSBkZW5vbWlu
YXRvciBpbiB0aGUgY29tcHV0YXRpb24gbWF5DQo+IGJlIHplcm8uICBUaGUgY2FzZSB0aGlzIGhh
cHBlbnMgaXMgd2hlbg0KPiBnZXRfY3B1X2lkbGVfdGltZV91cygpIHJldHVybnMgLTEgYW5kIGdl
dF9jcHVfaWRsZV90aW1lX2ppZmZ5KCkgdXNlZCB0byByZXR1cm4NCj4gdGhhdCBudW1iZXIgaXMg
Y2FsbGVkIGV4YWN0bHkgYXQgdGhlIGppZmZpZXNfNjQgd3JhcCB0aW1lLiAgSXQgaXMgcmF0aGVy
IGhhcmQgdG8NCj4gdHJpZ2dlciB0aGF0IGVycm9yLCBidXQgaXQgaXMgbm90IGltcG9zc2libGUg
YW5kIGl0IHdpbGwganVzdCBjcmFzaCB0aGUga2VybmVsIHRoZW4uDQo+IA0KPiBTZWNvbmQsIGpf
Y2Ricy0+cHJldl9sb2FkIGlzIGNvbXB1dGVkIGFzIHRoZSBhdmVyYWdlIGxvYWQgZHVyaW5nIHRo
ZSBlbnRpcmUNCj4gdGltZSBzaW5jZSB0aGUgc3lzdGVtIHN0YXJ0ZWQgYW5kIGl0IG1heSBub3Qg
cmVmbGVjdCB0aGUgbG9hZCBpbiB0aGUgcHJldmlvdXMNCj4gc2FtcGxpbmcgcGVyaW9kIChhcyBp
dCBpcyBleHBlY3RlZCB0bykuDQo+IFRoYXQgZG9lc24ndCBwbGF5IHdlbGwgd2l0aCB0aGUgd2F5
IGRic191cGRhdGUoKSB1c2VzIHRoYXQgdmFsdWUuDQo+IE5hbWVseSwgaWYgdGhlIHVwZGF0ZSB0
aW1lIGRlbHRhICh3YWxsX3RpbWUpIGhhcHBlbnMgZG8gYmUgZ3JlYXRlciB0aGFuIHR3aWNlDQpo
YXBwZW5zIHMvZG8vdG8gYmU/DQo+IHRoZSBzYW1wbGluZyByYXRlIG9uIHRoZSBmaXJzdCBpbnZv
Y2F0aW9uIG9mIGl0LCB0aGUgaW5pdGlhbCB2YWx1ZSBvZiBqX2NkYnMtDQo+ID5wcmV2X2xvYWQg
KHdoaWNoIG1heSBiZSBjb21wbGV0ZWx5IG9mZikgd2lsbCBiZSByZXR1cm5lZCB0byB0aGUgY2Fs
bGVyIGFzIHRoZQ0KPiBjdXJyZW50IGxvYWQgKHVubGVzcyBpdCBpcyBlcXVhbCB0byB6ZXJvIGFu
ZCB1bmxlc3MgYW5vdGhlciBDUFUgc2hhcmluZyB0aGUgc2FtZQ0KPiBwb2xpY3kgb2JqZWN0IGhh
cyBhIGdyZWF0ZXIgbG9hZCB2YWx1ZSkuDQo+IA0KPiBGb3IgdGhpcyByZWFzb24sIG5vdGljZSB0
aGF0IHRoZSBwcmV2X2xvYWQgZmllbGQgb2Ygc3RydWN0IGNwdV9kYnNfaW5mbyBpcyBvbmx5DQo+
IHVzZWQgYnkgZGJzX3VwZGF0ZSgpIGFuZCBvbmx5IGluIHRoYXQgb25lIHBsYWNlLCBzbyBpZg0K
PiBjcHVmcmVxX2dvdmVybm9yX3N0YXJ0KCkgaXMgbW9kaWZpZWQgdG8gYWx3YXlzIGluaXRpYWxp
emUgaXQgdG8gMCwgaXQgd2lsbCBtYWtlDQo+IGRic191cGRhdGUoKSBhbHdheXMgY29tcHV0ZSB0
aGUgYWN0dWFsIGxvYWQgZmlyc3QgdGltZSBpdCBjaGVja3MgdGhlIHVwZGF0ZQ0KPiB0aW1lIGRl
bHRhIGFnYWluc3QgdGhlIGRvdWJsZWQgc2FtcGxpbmcgcmF0ZSAoYWZ0ZXIgaW5pdGlhbGl6YXRp
b24pIGFuZCB0aGVyZQ0KPiB3b24ndCBiZSBhbnkgc2lkZSBlZmZlY3RzIG9mIGl0Lg0KPiANCj4g
Q29uc2VxdWVudGx5LCBtb2RpZnkgY3B1ZnJlcV9nb3Zlcm5vcl9zdGFydCgpIGFzIGRlc2NyaWJl
ZC4NCj4gDQo+IFNpZ25lZC1vZmYtYnk6IFJhZmFlbCBKLiBXeXNvY2tpIDxyYWZhZWwuai53eXNv
Y2tpQGludGVsLmNvbT4NCj4gLS0tDQoNCkFja2VkLWJ5OiBDaGVuIFl1IDx5dS5jLmNoZW5AaW50
ZWwuY29tPg0KDQp0aGFua3MsDQpZdQ0K
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -508,12 +508,12 @@  static int cpufreq_governor_start(struct
 
 	for_each_cpu(j, policy->cpus) {
 		struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j);
-		unsigned int prev_load;
 
 		j_cdbs->prev_cpu_idle = get_cpu_idle_time(j, &j_cdbs->prev_cpu_wall, io_busy);
-
-		prev_load = j_cdbs->prev_cpu_wall - j_cdbs->prev_cpu_idle;
-		j_cdbs->prev_load = 100 * prev_load / (unsigned int)j_cdbs->prev_cpu_wall;
+		/*
+		 * Make the first invocation of dbs_update() compute the load.
+		 */
+		j_cdbs->prev_load = 0;
 
 		if (ignore_nice)
 			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];