Message ID | 1842158.0Xhak3Uaac@vostro.rjw.lan (mailing list archive) |
---|---|
State | Superseded, archived |
Delegated to: | Rafael Wysocki |
Headers | show |
Hi Rafael, On 2 March 2016 at 03:27, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Add a new cpufreq scaling governor, called "schedutil", that uses > scheduler-provided CPU utilization information as input for making > its decisions. > > Doing that is possible after commit fe7034338ba0 (cpufreq: Add > mechanism for registering utilization update callbacks) that > introduced cpufreq_update_util() called by the scheduler on > utilization changes (from CFS) and RT/DL task status updates. > In particular, CPU frequency scaling decisions may be based on > the the utilization data passed to cpufreq_update_util() by CFS. > > The new governor is relatively simple. > > The frequency selection formula used by it is essentially the same > as the one used by the "ondemand" governor, although it doesn't use > the additional up_threshold parameter, but instead of computing the > load as the "non-idle CPU time" to "total CPU time" ratio, it takes > the utilization data provided by CFS as input. More specifically, > it represents "load" as the util/max ratio, where util and max > are the utilization and CPU capacity coming from CFS. > [snip] > + > +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, > + unsigned long util, unsigned long max, > + unsigned int next_freq) > +{ > + struct cpufreq_policy *policy = sg_policy->policy; > + unsigned int rel; > + > + if (next_freq > policy->max) > + next_freq = policy->max; > + else if (next_freq < policy->min) > + next_freq = policy->min; > + > + sg_policy->last_freq_update_time = time; > + if (sg_policy->next_freq == next_freq) > + return; > + > + sg_policy->next_freq = next_freq; > + /* > + * If utilization is less than max / 4, use RELATION_C to allow the > + * minimum frequency to be selected more often in case the distance from > + * it to the next available frequency in the table is significant. > + */ > + rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L; > + if (policy->fast_switch_possible) { > + cpufreq_driver_fast_switch(policy, next_freq, rel); > + } else { > + sg_policy->relation = rel; > + sg_policy->work_in_progress = true; > + irq_work_queue(&sg_policy->irq_work); > + } > +} > + > +static void sugov_update_single(struct update_util_data *data, u64 time, > + unsigned long util, unsigned long max) > +{ > + struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util); > + struct sugov_policy *sg_policy = sg_cpu->sg_policy; > + unsigned int min_f, max_f, next_f; > + > + if (!sugov_should_update_freq(sg_policy, time)) > + return; > + > + min_f = sg_policy->policy->cpuinfo.min_freq; > + max_f = sg_policy->policy->cpuinfo.max_freq; > + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; I think it has been pointed out in another email's thread but you should change the way the next_f is computed. util reflects the utilization of a CPU from 0 to its max compute capacity whereas ondemand was using the load at the current frequency during the last time window. I have understood that you want to keep same formula than ondemand as a starting point but you use a different input to calculate the next frequency so i don't see the rational of keeping this formula. Saying that, even the simple formula next_f = util > max ? max_f : util * (max_f) / max will not work properly if the frequency invariance is enable because the utilization becomes capped by the current compute capacity so next_f will never be higher than current freq (unless a task move on the rq). That was one reason of using a threshold in sched-freq proposal (and there are on going dev to try to solve this limitation). IIIUC, frequency invariance is not enable on your platform so you have not seen the problem but you have probably see that selection of your next_f was not really stable. Without frequency invariance, the utilization will be overestimated when running at lower frequency so the governor will probably select a frequency that is higher than necessary but then the utilization will decrease at this higher frequency so the governor will probably decrease the frequency and so on until you found the right frequency that will generate the right utilisation value Regards, Vincent > + > + sugov_update_commit(sg_policy, time, util, max, next_f); > +} > + > +static unsigned int sugov_next_freq(struct sugov_policy *sg_policy, > + unsigned long util, unsigned long max) > +{ > + struct cpufreq_policy *policy = sg_policy->policy; > + unsigned int min_f = policy->cpuinfo.min_freq; > + unsigned int max_f = policy->cpuinfo.max_freq; > + u64 last_freq_update_time = sg_policy->last_freq_update_time; > + unsigned int j; > + > + if (util > max) > + return max_f; > + > + for_each_cpu(j, policy->cpus) { > + struct sugov_cpu *j_sg_cpu; > + unsigned long j_util, j_max; > + u64 delta_ns; > + > + if (j == smp_processor_id()) > + continue; > + > + j_sg_cpu = &per_cpu(sugov_cpu, j); > + /* > + * If the CPU utilization was last updated before the previous > + * frequency update and the time elapsed between the last update > + * of the CPU utilization and the last frequency update is long > + * enough, don't take the CPU into account as it probably is > + * idle now. > + */ > + delta_ns = last_freq_update_time - j_sg_cpu->last_update; > + if ((s64)delta_ns > NSEC_PER_SEC / HZ) > + continue; > + > + j_util = j_sg_cpu->util; > + j_max = j_sg_cpu->max; > + if (j_util > j_max) > + return max_f; > + > + if (j_util * max > j_max * util) { > + util = j_util; > + max = j_max; > + } > + } > + > + return min_f + util * (max_f - min_f) / max; > +} > + [snip] -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 2, 2016 at 6:10 PM, Vincent Guittot <vincent.guittot@linaro.org> wrote: > Hi Rafael, > > > On 2 March 2016 at 03:27, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> >> Add a new cpufreq scaling governor, called "schedutil", that uses >> scheduler-provided CPU utilization information as input for making >> its decisions. >> >> Doing that is possible after commit fe7034338ba0 (cpufreq: Add >> mechanism for registering utilization update callbacks) that >> introduced cpufreq_update_util() called by the scheduler on >> utilization changes (from CFS) and RT/DL task status updates. >> In particular, CPU frequency scaling decisions may be based on >> the the utilization data passed to cpufreq_update_util() by CFS. >> >> The new governor is relatively simple. >> >> The frequency selection formula used by it is essentially the same >> as the one used by the "ondemand" governor, although it doesn't use >> the additional up_threshold parameter, but instead of computing the >> load as the "non-idle CPU time" to "total CPU time" ratio, it takes >> the utilization data provided by CFS as input. More specifically, >> it represents "load" as the util/max ratio, where util and max >> are the utilization and CPU capacity coming from CFS. >> > > [snip] > >> + >> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, >> + unsigned long util, unsigned long max, >> + unsigned int next_freq) >> +{ >> + struct cpufreq_policy *policy = sg_policy->policy; >> + unsigned int rel; >> + >> + if (next_freq > policy->max) >> + next_freq = policy->max; >> + else if (next_freq < policy->min) >> + next_freq = policy->min; >> + >> + sg_policy->last_freq_update_time = time; >> + if (sg_policy->next_freq == next_freq) >> + return; >> + >> + sg_policy->next_freq = next_freq; >> + /* >> + * If utilization is less than max / 4, use RELATION_C to allow the >> + * minimum frequency to be selected more often in case the distance from >> + * it to the next available frequency in the table is significant. >> + */ >> + rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L; >> + if (policy->fast_switch_possible) { >> + cpufreq_driver_fast_switch(policy, next_freq, rel); >> + } else { >> + sg_policy->relation = rel; >> + sg_policy->work_in_progress = true; >> + irq_work_queue(&sg_policy->irq_work); >> + } >> +} >> + >> +static void sugov_update_single(struct update_util_data *data, u64 time, >> + unsigned long util, unsigned long max) >> +{ >> + struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util); >> + struct sugov_policy *sg_policy = sg_cpu->sg_policy; >> + unsigned int min_f, max_f, next_f; >> + >> + if (!sugov_should_update_freq(sg_policy, time)) >> + return; >> + >> + min_f = sg_policy->policy->cpuinfo.min_freq; >> + max_f = sg_policy->policy->cpuinfo.max_freq; >> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; > > I think it has been pointed out in another email's thread but you > should change the way the next_f is computed. util reflects the > utilization of a CPU from 0 to its max compute capacity whereas > ondemand was using the load at the current frequency during the last > time window. I have understood that you want to keep same formula than > ondemand as a starting point but you use a different input to > calculate the next frequency so i don't see the rational of keeping > this formula. It is a formula that causes the entire available frequency range to be utilized proportionally to the utilization as reported by the scheduler (modulo the policy->min/max limits). Its (significant IMO) advantage is that it doesn't require any additional factors that would need to be determined somehow. > Saying that, even the simple formula next_f = util > max > ? max_f : util * (max_f) / max will not work properly if the frequency > invariance is enable because the utilization becomes capped by the > current compute capacity so next_f will never be higher than current > freq (unless a task move on the rq). That was one reason of using a > threshold in sched-freq proposal (and there are on going dev to try to > solve this limitation). Well, a different formula will have to be used along with frequency invariance, then. > IIIUC, frequency invariance is not enable on your platform so you have > not seen the problem but you have probably see that selection of your > next_f was not really stable. Without frequency invariance, the > utilization will be overestimated when running at lower frequency so > the governor will probably select a frequency that is higher than > necessary but then the utilization will decrease at this higher > frequency so the governor will probably decrease the frequency and so > on until you found the right frequency that will generate the right > utilisation value I don't have any problems with that to be honest and if you aim at selecting the perfect frequency at the first attempt, then good luck with that anyway. Now, I'm not saying that the formula used in this patch cannot be improved or similar. It very well may be possible to improve it. I'm only saying that it is good enough to start with, because of the reasons mentioned above. Still, if you can suggest to me what other formula specifically should be used here, I'll consider using it. Which will probably mean comparing the two and seeing which one leads to better results. Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 2, 2016 at 6:58 PM, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Wed, Mar 2, 2016 at 6:10 PM, Vincent Guittot > <vincent.guittot@linaro.org> wrote: >> Hi Rafael, >> >> >> On 2 March 2016 at 03:27, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>> >>> Add a new cpufreq scaling governor, called "schedutil", that uses >>> scheduler-provided CPU utilization information as input for making >>> its decisions. >>> >>> Doing that is possible after commit fe7034338ba0 (cpufreq: Add >>> mechanism for registering utilization update callbacks) that >>> introduced cpufreq_update_util() called by the scheduler on >>> utilization changes (from CFS) and RT/DL task status updates. >>> In particular, CPU frequency scaling decisions may be based on >>> the the utilization data passed to cpufreq_update_util() by CFS. >>> >>> The new governor is relatively simple. >>> >>> The frequency selection formula used by it is essentially the same >>> as the one used by the "ondemand" governor, although it doesn't use >>> the additional up_threshold parameter, but instead of computing the >>> load as the "non-idle CPU time" to "total CPU time" ratio, it takes >>> the utilization data provided by CFS as input. More specifically, >>> it represents "load" as the util/max ratio, where util and max >>> are the utilization and CPU capacity coming from CFS. >>> >> >> [snip] >> >>> + >>> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, >>> + unsigned long util, unsigned long max, >>> + unsigned int next_freq) >>> +{ >>> + struct cpufreq_policy *policy = sg_policy->policy; >>> + unsigned int rel; >>> + >>> + if (next_freq > policy->max) >>> + next_freq = policy->max; >>> + else if (next_freq < policy->min) >>> + next_freq = policy->min; >>> + >>> + sg_policy->last_freq_update_time = time; >>> + if (sg_policy->next_freq == next_freq) >>> + return; >>> + >>> + sg_policy->next_freq = next_freq; >>> + /* >>> + * If utilization is less than max / 4, use RELATION_C to allow the >>> + * minimum frequency to be selected more often in case the distance from >>> + * it to the next available frequency in the table is significant. >>> + */ >>> + rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L; >>> + if (policy->fast_switch_possible) { >>> + cpufreq_driver_fast_switch(policy, next_freq, rel); >>> + } else { >>> + sg_policy->relation = rel; >>> + sg_policy->work_in_progress = true; >>> + irq_work_queue(&sg_policy->irq_work); >>> + } >>> +} >>> + >>> +static void sugov_update_single(struct update_util_data *data, u64 time, >>> + unsigned long util, unsigned long max) >>> +{ >>> + struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util); >>> + struct sugov_policy *sg_policy = sg_cpu->sg_policy; >>> + unsigned int min_f, max_f, next_f; >>> + >>> + if (!sugov_should_update_freq(sg_policy, time)) >>> + return; >>> + >>> + min_f = sg_policy->policy->cpuinfo.min_freq; >>> + max_f = sg_policy->policy->cpuinfo.max_freq; >>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; >> >> I think it has been pointed out in another email's thread but you >> should change the way the next_f is computed. util reflects the >> utilization of a CPU from 0 to its max compute capacity whereas >> ondemand was using the load at the current frequency during the last >> time window. I have understood that you want to keep same formula than >> ondemand as a starting point but you use a different input to >> calculate the next frequency so i don't see the rational of keeping >> this formula. > > It is a formula that causes the entire available frequency range to be > utilized proportionally to the utilization as reported by the > scheduler (modulo the policy->min/max limits). Its (significant IMO) > advantage is that it doesn't require any additional factors that would > need to be determined somehow. In case a more formal derivation of this formula is needed, it is based on the following 3 assumptions: (1) Performance is a linear function of frequency. (2) Required performance is a linear function of the utilization ratio x = util/max as provided by the scheduler (0 <= x <= 1). (3) The minimum possible frequency (min_freq) corresponds to x = 0 and the maximum possible frequency (max_freq) corresponds to x = 1. (1) and (2) combined imply that f = a * x + b (f - frequency, a, b - constants to be determined) and then (3) quite trivially leads to b = min_freq and a = max_freq - min_freq. Now, of course, the linearity assumptions may be questioned, but then it's just the first approximation. If you go any further, though, you end up with an expansion series like this: f(x) = c_0 + c_1 * x + c_2 * x^2 + c_3 * x^3 + ... where all of the c_j need to be determined in principle. With luck, if you can guess what kind of a function f(x) may be, it may be possible to reduce the number of coefficients to determine, but question is whether or not that is going to work universally for all systems. Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 02, 2016 at 11:49:48PM +0100, Rafael J. Wysocki wrote: > >>> + min_f = sg_policy->policy->cpuinfo.min_freq; > >>> + max_f = sg_policy->policy->cpuinfo.max_freq; > >>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; > In case a more formal derivation of this formula is needed, it is > based on the following 3 assumptions: > > (1) Performance is a linear function of frequency. > (2) Required performance is a linear function of the utilization ratio > x = util/max as provided by the scheduler (0 <= x <= 1). > (3) The minimum possible frequency (min_freq) corresponds to x = 0 and > the maximum possible frequency (max_freq) corresponds to x = 1. > > (1) and (2) combined imply that > > f = a * x + b > > (f - frequency, a, b - constants to be determined) and then (3) quite > trivially leads to b = min_freq and a = max_freq - min_freq. 3 is the problem, that just doesn't make sense and is probably the reason why you see very little selection of the min freq. Suppose a machine with the following frequencies: 500, 750, 1000 And a utilization of 0.4, how does asking for 500 + 0.4 * (1000-500) = 700 make any sense? Per your point 1, it should should be asking for 0.4 * 1000 = 400. Because, per 1, at 500 it runs exactly half as fast as at 1000, and we only need 0.4 times as much. Therefore 500 is more than sufficient. Note. we all know that 1 is a 'broken' assumption, but lacking anything better I think its a reasonable one to make. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/03/16 13:20, Peter Zijlstra wrote: > On Wed, Mar 02, 2016 at 11:49:48PM +0100, Rafael J. Wysocki wrote: > > >>> + min_f = sg_policy->policy->cpuinfo.min_freq; > > >>> + max_f = sg_policy->policy->cpuinfo.max_freq; > > >>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; > > > In case a more formal derivation of this formula is needed, it is > > based on the following 3 assumptions: > > > > (1) Performance is a linear function of frequency. > > (2) Required performance is a linear function of the utilization ratio > > x = util/max as provided by the scheduler (0 <= x <= 1). > > > (3) The minimum possible frequency (min_freq) corresponds to x = 0 and > > the maximum possible frequency (max_freq) corresponds to x = 1. > > > > (1) and (2) combined imply that > > > > f = a * x + b > > > > (f - frequency, a, b - constants to be determined) and then (3) quite > > trivially leads to b = min_freq and a = max_freq - min_freq. > > 3 is the problem, that just doesn't make sense and is probably the > reason why you see very little selection of the min freq. > > Suppose a machine with the following frequencies: > > 500, 750, 1000 > > And a utilization of 0.4, how does asking for 500 + 0.4 * (1000-500) = > 700 make any sense? Per your point 1, it should should be asking for > 0.4 * 1000 = 400. > > Because, per 1, at 500 it runs exactly half as fast as at 1000, and we > only need 0.4 times as much. Therefore 500 is more than sufficient. > Oh, and that is probably also why the governor can reach max OPP with freq invariance enabled (the point Vincent was making). When we run at 500 the util signal is capped at that capacity, but the formula makes us requesting more, so we can jump to the next step and so on. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2 March 2016 at 18:58, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Wed, Mar 2, 2016 at 6:10 PM, Vincent Guittot > <vincent.guittot@linaro.org> wrote: >> Hi Rafael, >> >> >> On 2 March 2016 at 03:27, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>> >>> Add a new cpufreq scaling governor, called "schedutil", that uses >>> scheduler-provided CPU utilization information as input for making >>> its decisions. >>> >>> Doing that is possible after commit fe7034338ba0 (cpufreq: Add >>> mechanism for registering utilization update callbacks) that >>> introduced cpufreq_update_util() called by the scheduler on >>> utilization changes (from CFS) and RT/DL task status updates. >>> In particular, CPU frequency scaling decisions may be based on >>> the the utilization data passed to cpufreq_update_util() by CFS. >>> >>> The new governor is relatively simple. >>> >>> The frequency selection formula used by it is essentially the same >>> as the one used by the "ondemand" governor, although it doesn't use >>> the additional up_threshold parameter, but instead of computing the >>> load as the "non-idle CPU time" to "total CPU time" ratio, it takes >>> the utilization data provided by CFS as input. More specifically, >>> it represents "load" as the util/max ratio, where util and max >>> are the utilization and CPU capacity coming from CFS. >>> >> >> [snip] >> >>> + >>> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, >>> + unsigned long util, unsigned long max, >>> + unsigned int next_freq) >>> +{ >>> + struct cpufreq_policy *policy = sg_policy->policy; >>> + unsigned int rel; >>> + >>> + if (next_freq > policy->max) >>> + next_freq = policy->max; >>> + else if (next_freq < policy->min) >>> + next_freq = policy->min; >>> + >>> + sg_policy->last_freq_update_time = time; >>> + if (sg_policy->next_freq == next_freq) >>> + return; >>> + >>> + sg_policy->next_freq = next_freq; >>> + /* >>> + * If utilization is less than max / 4, use RELATION_C to allow the >>> + * minimum frequency to be selected more often in case the distance from >>> + * it to the next available frequency in the table is significant. >>> + */ >>> + rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L; >>> + if (policy->fast_switch_possible) { >>> + cpufreq_driver_fast_switch(policy, next_freq, rel); >>> + } else { >>> + sg_policy->relation = rel; >>> + sg_policy->work_in_progress = true; >>> + irq_work_queue(&sg_policy->irq_work); >>> + } >>> +} >>> + >>> +static void sugov_update_single(struct update_util_data *data, u64 time, >>> + unsigned long util, unsigned long max) >>> +{ >>> + struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util); >>> + struct sugov_policy *sg_policy = sg_cpu->sg_policy; >>> + unsigned int min_f, max_f, next_f; >>> + >>> + if (!sugov_should_update_freq(sg_policy, time)) >>> + return; >>> + >>> + min_f = sg_policy->policy->cpuinfo.min_freq; >>> + max_f = sg_policy->policy->cpuinfo.max_freq; >>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; >> >> I think it has been pointed out in another email's thread but you >> should change the way the next_f is computed. util reflects the >> utilization of a CPU from 0 to its max compute capacity whereas >> ondemand was using the load at the current frequency during the last >> time window. I have understood that you want to keep same formula than >> ondemand as a starting point but you use a different input to >> calculate the next frequency so i don't see the rational of keeping >> this formula. > > It is a formula that causes the entire available frequency range to be > utilized proportionally to the utilization as reported by the > scheduler (modulo the policy->min/max limits). Its (significant IMO) > advantage is that it doesn't require any additional factors that would > need to be determined somehow. > >> Saying that, even the simple formula next_f = util > max >> ? max_f : util * (max_f) / max will not work properly if the frequency >> invariance is enable because the utilization becomes capped by the >> current compute capacity so next_f will never be higher than current >> freq (unless a task move on the rq). That was one reason of using a >> threshold in sched-freq proposal (and there are on going dev to try to >> solve this limitation). > > Well, a different formula will have to be used along with frequency > invariance, then. > >> IIIUC, frequency invariance is not enable on your platform so you have >> not seen the problem but you have probably see that selection of your >> next_f was not really stable. Without frequency invariance, the >> utilization will be overestimated when running at lower frequency so >> the governor will probably select a frequency that is higher than >> necessary but then the utilization will decrease at this higher >> frequency so the governor will probably decrease the frequency and so >> on until you found the right frequency that will generate the right >> utilisation value > > I don't have any problems with that to be honest and if you aim at > selecting the perfect frequency at the first attempt, then good luck > with that anyway. I mainly want to prevent any useless and periodic frequency switch because of an utilization that changes with the current frequency (if frequency invariance is not used) and that can make the formula selects another frequency than the current one. That what i can see when testing it . Sorry for the late reply, i was trying to do some test on my board but was facing some crash issue (not link with your patchset). So i have done some tests and i can see such instable behavior. I have generated a load of 33% at max frequency (3ms runs every 9ms) and i can see the frequency that toggles without any good reason. Saying that, i can see similar thing with ondemand. Vincent > > Now, I'm not saying that the formula used in this patch cannot be > improved or similar. It very well may be possible to improve it. I'm > only saying that it is good enough to start with, because of the > reasons mentioned above. > > Still, if you can suggest to me what other formula specifically should > be used here, I'll consider using it. Which will probably mean > comparing the two and seeing which one leads to better results. > > Thanks, > Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2 March 2016 at 23:49, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Wed, Mar 2, 2016 at 6:58 PM, Rafael J. Wysocki <rafael@kernel.org> wrote: >> On Wed, Mar 2, 2016 at 6:10 PM, Vincent Guittot >> <vincent.guittot@linaro.org> wrote: >>> Hi Rafael, >>> >>> >>> On 2 March 2016 at 03:27, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: >>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>>> >>>> Add a new cpufreq scaling governor, called "schedutil", that uses >>>> scheduler-provided CPU utilization information as input for making >>>> its decisions. >>>> >>>> Doing that is possible after commit fe7034338ba0 (cpufreq: Add >>>> mechanism for registering utilization update callbacks) that >>>> introduced cpufreq_update_util() called by the scheduler on >>>> utilization changes (from CFS) and RT/DL task status updates. >>>> In particular, CPU frequency scaling decisions may be based on >>>> the the utilization data passed to cpufreq_update_util() by CFS. >>>> >>>> The new governor is relatively simple. >>>> >>>> The frequency selection formula used by it is essentially the same >>>> as the one used by the "ondemand" governor, although it doesn't use >>>> the additional up_threshold parameter, but instead of computing the >>>> load as the "non-idle CPU time" to "total CPU time" ratio, it takes >>>> the utilization data provided by CFS as input. More specifically, >>>> it represents "load" as the util/max ratio, where util and max >>>> are the utilization and CPU capacity coming from CFS. >>>> >>> >>> [snip] >>> >>>> + >>>> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, >>>> + unsigned long util, unsigned long max, >>>> + unsigned int next_freq) >>>> +{ >>>> + struct cpufreq_policy *policy = sg_policy->policy; >>>> + unsigned int rel; >>>> + >>>> + if (next_freq > policy->max) >>>> + next_freq = policy->max; >>>> + else if (next_freq < policy->min) >>>> + next_freq = policy->min; >>>> + >>>> + sg_policy->last_freq_update_time = time; >>>> + if (sg_policy->next_freq == next_freq) >>>> + return; >>>> + >>>> + sg_policy->next_freq = next_freq; >>>> + /* >>>> + * If utilization is less than max / 4, use RELATION_C to allow the >>>> + * minimum frequency to be selected more often in case the distance from >>>> + * it to the next available frequency in the table is significant. >>>> + */ >>>> + rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L; >>>> + if (policy->fast_switch_possible) { >>>> + cpufreq_driver_fast_switch(policy, next_freq, rel); >>>> + } else { >>>> + sg_policy->relation = rel; >>>> + sg_policy->work_in_progress = true; >>>> + irq_work_queue(&sg_policy->irq_work); >>>> + } >>>> +} >>>> + >>>> +static void sugov_update_single(struct update_util_data *data, u64 time, >>>> + unsigned long util, unsigned long max) >>>> +{ >>>> + struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util); >>>> + struct sugov_policy *sg_policy = sg_cpu->sg_policy; >>>> + unsigned int min_f, max_f, next_f; >>>> + >>>> + if (!sugov_should_update_freq(sg_policy, time)) >>>> + return; >>>> + >>>> + min_f = sg_policy->policy->cpuinfo.min_freq; >>>> + max_f = sg_policy->policy->cpuinfo.max_freq; >>>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; >>> >>> I think it has been pointed out in another email's thread but you >>> should change the way the next_f is computed. util reflects the >>> utilization of a CPU from 0 to its max compute capacity whereas >>> ondemand was using the load at the current frequency during the last >>> time window. I have understood that you want to keep same formula than >>> ondemand as a starting point but you use a different input to >>> calculate the next frequency so i don't see the rational of keeping >>> this formula. >> >> It is a formula that causes the entire available frequency range to be >> utilized proportionally to the utilization as reported by the >> scheduler (modulo the policy->min/max limits). Its (significant IMO) >> advantage is that it doesn't require any additional factors that would >> need to be determined somehow. > > In case a more formal derivation of this formula is needed, it is > based on the following 3 assumptions: > > (1) Performance is a linear function of frequency. > (2) Required performance is a linear function of the utilization ratio > x = util/max as provided by the scheduler (0 <= x <= 1). Just to mention that the utilization that you are using, varies with the frequency which add another variable in your equation > (3) The minimum possible frequency (min_freq) corresponds to x = 0 and > the maximum possible frequency (max_freq) corresponds to x = 1. > > (1) and (2) combined imply that > > f = a * x + b > > (f - frequency, a, b - constants to be determined) and then (3) quite > trivially leads to b = min_freq and a = max_freq - min_freq. > > Now, of course, the linearity assumptions may be questioned, but then > it's just the first approximation. If you go any further, though, you > end up with an expansion series like this: > > f(x) = c_0 + c_1 * x + c_2 * x^2 + c_3 * x^3 + ... > > where all of the c_j need to be determined in principle. With luck, > if you can guess what kind of a function f(x) may be, it may be > possible to reduce the number of coefficients to determine, but > question is whether or not that is going to work universally for all > systems. > > Thanks, > Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 03, 2016 at 03:01:15PM +0100, Vincent Guittot wrote: > > In case a more formal derivation of this formula is needed, it is > > based on the following 3 assumptions: > > > > (1) Performance is a linear function of frequency. > > (2) Required performance is a linear function of the utilization ratio > > x = util/max as provided by the scheduler (0 <= x <= 1). > > Just to mention that the utilization that you are using, varies with > the frequency which add another variable in your equation Right, x86 hasn't implemented arch_scale_freq_capacity(), so the utilization values we use are all over the map. If we lower freq, the util will go up, which would result in us bumping the freq again, etc.. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 3, 2016 at 1:20 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Mar 02, 2016 at 11:49:48PM +0100, Rafael J. Wysocki wrote: >> >>> + min_f = sg_policy->policy->cpuinfo.min_freq; >> >>> + max_f = sg_policy->policy->cpuinfo.max_freq; >> >>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; > >> In case a more formal derivation of this formula is needed, it is >> based on the following 3 assumptions: >> >> (1) Performance is a linear function of frequency. >> (2) Required performance is a linear function of the utilization ratio >> x = util/max as provided by the scheduler (0 <= x <= 1). > >> (3) The minimum possible frequency (min_freq) corresponds to x = 0 and >> the maximum possible frequency (max_freq) corresponds to x = 1. >> >> (1) and (2) combined imply that >> >> f = a * x + b >> >> (f - frequency, a, b - constants to be determined) and then (3) quite >> trivially leads to b = min_freq and a = max_freq - min_freq. > > 3 is the problem, that just doesn't make sense and is probably the > reason why you see very little selection of the min freq. It is about mapping the entire [0,1] interval to the available frequency range. I till overprovision things (the smaller x the more), but then it may help the race-to-idle a bit in theory. > Suppose a machine with the following frequencies: > > 500, 750, 1000 > > And a utilization of 0.4, how does asking for 500 + 0.4 * (1000-500) = > 700 make any sense? Per your point 1, it should should be asking for > 0.4 * 1000 = 400. > > Because, per 1, at 500 it runs exactly half as fast as at 1000, and we > only need 0.4 times as much. Therefore 500 is more than sufficient. OK, but then I don't see why this reasoning only applies to the lower bound of the frequency range. Is there any reason why x = 1 should be the only point mapping to max_freq? If not, then I think it's reasonable to map the middle of the available frequency range to x = 0.5 and then we have b = 0 and a = (max_freq + min_freq) / 2. I'll try that and see how it goes. > Note. we all know that 1 is a 'broken' assumption, but lacking anything > better I think its a reasonable one to make. Right. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 03, 2016 at 05:24:32PM +0100, Rafael J. Wysocki wrote: > On Thu, Mar 3, 2016 at 1:20 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Mar 02, 2016 at 11:49:48PM +0100, Rafael J. Wysocki wrote: > >> >>> + min_f = sg_policy->policy->cpuinfo.min_freq; > >> >>> + max_f = sg_policy->policy->cpuinfo.max_freq; > >> >>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; > > > >> In case a more formal derivation of this formula is needed, it is > >> based on the following 3 assumptions: > >> > >> (1) Performance is a linear function of frequency. > >> (2) Required performance is a linear function of the utilization ratio > >> x = util/max as provided by the scheduler (0 <= x <= 1). > > > >> (3) The minimum possible frequency (min_freq) corresponds to x = 0 and > >> the maximum possible frequency (max_freq) corresponds to x = 1. > >> > >> (1) and (2) combined imply that > >> > >> f = a * x + b > >> > >> (f - frequency, a, b - constants to be determined) and then (3) quite > >> trivially leads to b = min_freq and a = max_freq - min_freq. > > > > 3 is the problem, that just doesn't make sense and is probably the > > reason why you see very little selection of the min freq. > > It is about mapping the entire [0,1] interval to the available frequency range. Yeah, but I don't see why that makes sense.. > I till overprovision things (the smaller x the more), but then it may > help the race-to-idle a bit in theory. So, since we also have the cpuidle information, could we not make a better guess at race-to-idle? > > Suppose a machine with the following frequencies: > > > > 500, 750, 1000 > > > > And a utilization of 0.4, how does asking for 500 + 0.4 * (1000-500) = > > 700 make any sense? Per your point 1, it should should be asking for > > 0.4 * 1000 = 400. > > > > Because, per 1, at 500 it runs exactly half as fast as at 1000, and we > > only need 0.4 times as much. Therefore 500 is more than sufficient. > > OK, but then I don't see why this reasoning only applies to the lower > bound of the frequency range. Is there any reason why x = 1 should be > the only point mapping to max_freq? Well, everything that goes over the second to last freq would end up at the last (max) freq. Take again the 500,750,1000 example, everything that's >750 would end up at 1000 (for relation_l, >875 for _c). But given the platform's cpuidle information, maybe coupled with an avg idle est, we can compute the benefit of race-to-idle and over provision based on that, right? > If not, then I think it's reasonable to map the middle of the > available frequency range to x = 0.5 and then we have b = 0 and a = > (max_freq + min_freq) / 2. So I really think that approach falls apart on the low util bits, you effectively always run above min speed, even if min is already vstly over provisioned. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 03, 2016 at 05:37:35PM +0100, Peter Zijlstra wrote: > On Thu, Mar 03, 2016 at 05:24:32PM +0100, Rafael J. Wysocki wrote: > > >> f = a * x + b > > If not, then I think it's reasonable to map the middle of the > > available frequency range to x = 0.5 and then we have b = 0 and a = > > (max_freq + min_freq) / 2. > > So I really think that approach falls apart on the low util bits, you > effectively always run above min speed, even if min is already vstly > over provisioned. Ah nevermind, I cannot read. Yes that is worth trying I suppose. But the b=0,a=1 thing seems more natural still. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/03/16 17:37, Peter Zijlstra wrote: > On Thu, Mar 03, 2016 at 05:24:32PM +0100, Rafael J. Wysocki wrote: > > On Thu, Mar 3, 2016 at 1:20 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > > On Wed, Mar 02, 2016 at 11:49:48PM +0100, Rafael J. Wysocki wrote: > > >> >>> + min_f = sg_policy->policy->cpuinfo.min_freq; > > >> >>> + max_f = sg_policy->policy->cpuinfo.max_freq; > > >> >>> + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; > > > > > >> In case a more formal derivation of this formula is needed, it is > > >> based on the following 3 assumptions: > > >> > > >> (1) Performance is a linear function of frequency. > > >> (2) Required performance is a linear function of the utilization ratio > > >> x = util/max as provided by the scheduler (0 <= x <= 1). > > > > > >> (3) The minimum possible frequency (min_freq) corresponds to x = 0 and > > >> the maximum possible frequency (max_freq) corresponds to x = 1. > > >> > > >> (1) and (2) combined imply that > > >> > > >> f = a * x + b > > >> > > >> (f - frequency, a, b - constants to be determined) and then (3) quite > > >> trivially leads to b = min_freq and a = max_freq - min_freq. > > > > > > 3 is the problem, that just doesn't make sense and is probably the > > > reason why you see very little selection of the min freq. > > > > It is about mapping the entire [0,1] interval to the available frequency range. > > Yeah, but I don't see why that makes sense.. > > > I till overprovision things (the smaller x the more), but then it may > > help the race-to-idle a bit in theory. > > So, since we also have the cpuidle information, could we not make a > better guess at race-to-idle? > > > > Suppose a machine with the following frequencies: > > > > > > 500, 750, 1000 > > > > > > And a utilization of 0.4, how does asking for 500 + 0.4 * (1000-500) = > > > 700 make any sense? Per your point 1, it should should be asking for > > > 0.4 * 1000 = 400. > > > > > > Because, per 1, at 500 it runs exactly half as fast as at 1000, and we > > > only need 0.4 times as much. Therefore 500 is more than sufficient. > > > > OK, but then I don't see why this reasoning only applies to the lower > > bound of the frequency range. Is there any reason why x = 1 should be > > the only point mapping to max_freq? > > Well, everything that goes over the second to last freq would end up at > the last (max) freq. > > Take again the 500,750,1000 example, everything that's >750 would end up > at 1000 (for relation_l, >875 for _c). > > But given the platform's cpuidle information, maybe coupled with an avg > idle est, we can compute the benefit of race-to-idle and over provision > based on that, right? > Shouldn't this kind of considerations be a scheduler thing? I'm not really getting why we want to put more "intelligence" in a new governor. Also, if I understand Ingo's point correctly, I think we want to make this kind of policy decisions inside the scheduler. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 03, 2016 at 04:55:44PM +0000, Juri Lelli wrote: > On 03/03/16 17:37, Peter Zijlstra wrote: > > But given the platform's cpuidle information, maybe coupled with an avg > > idle est, we can compute the benefit of race-to-idle and over provision > > based on that, right? > > > > Shouldn't this kind of considerations be a scheduler thing? I'm not > really getting why we want to put more "intelligence" in a new governor. > Also, if I understand Ingo's point correctly, I think we want to make > this kind of policy decisions inside the scheduler. Well sure, put it in kernel/sched/cpufreq.c or wherever. My point was more that we don't have to guess/hardcode race-to-idle assumptions but can actually calculate some of that. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/03/16 17:56, Peter Zijlstra wrote: > On Thu, Mar 03, 2016 at 04:55:44PM +0000, Juri Lelli wrote: > > On 03/03/16 17:37, Peter Zijlstra wrote: > > > But given the platform's cpuidle information, maybe coupled with an avg > > > idle est, we can compute the benefit of race-to-idle and over provision > > > based on that, right? > > > > > > > Shouldn't this kind of considerations be a scheduler thing? I'm not > > really getting why we want to put more "intelligence" in a new governor. > > Also, if I understand Ingo's point correctly, I think we want to make > > this kind of policy decisions inside the scheduler. > > Well sure, put it in kernel/sched/cpufreq.c or wherever. My point was > more that we don't have to guess/hardcode race-to-idle assumptions but > can actually calculate some of that. > Right, thanks for clarifying! -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/03/2016 05:07 AM, Vincent Guittot wrote: > I mainly want to prevent any useless and periodic frequency switch > because of an utilization that changes with the current frequency (if > frequency invariance is not used) and that can make the formula > selects another frequency than the current one. That what i can see > when testing it . > > Sorry for the late reply, i was trying to do some test on my board but > was facing some crash issue (not link with your patchset). So i have > done some tests and i can see such instable behavior. I have generated > a load of 33% at max frequency (3ms runs every 9ms) and i can see the > frequency that toggles without any good reason. Saying that, i can see > similar thing with ondemand. FWIW I ran some performance numbers on my chromebook 2. Initially I forgot to bring in the frequency invariance support but that yielded an opportunity to see the impact of it. The tests below consist of a periodic workload. The OH (overhead) numbers show how close the workload got to running as slow as fmin (100% = as slow as powersave gov, 0% = as fast as perf gov). The OR (overrun) number is the count of instances where the busy work exceeded the period. First a comparison of schedutil with and without frequency invariance. Run and period are in milliseconds. scu (no inv) scu (w/inv) run period busy % OR OH OR OH 1 100 1.00% 0 79.72% 0 95.86% 10 1000 1.00% 0 24.52% 0 71.61% 1 10 10.00% 0 21.25% 0 41.78% 10 100 10.00% 0 26.06% 0 47.96% 100 1000 10.00% 0 6.36% 0 26.03% 6 33 18.18% 0 15.67% 0 31.61% 66 333 19.82% 0 8.94% 0 29.46% 4 10 40.00% 0 6.26% 0 12.93% 40 100 40.00% 0 6.93% 2 14.08% 400 1000 40.00% 0 1.65% 0 11.58% 5 9 55.56% 0 3.70% 0 7.70% 50 90 55.56% 1 4.19% 6 8.06% 500 900 55.56% 0 1.35% 5 6.94% 9 12 75.00% 0 1.60% 56 3.59% 90 120 75.00% 0 1.88% 21 3.94% 900 1200 75.00% 0 0.73% 4 4.41% Frequency invariance causes schedutil overhead to increase noticeably. I haven't dug into traces or anything. Perhaps this is due to the algorithm overshooting then overcorrecting etc., I do not yet know. Here is a comparison, with frequency invariance, of ondemand and interactive with schedfreq and schedutil. The first two columns (run and period) are omitted so the table will fit. ondemand interactive schedfreq schedutil busy % OR OH OR OH OR OH OR OH 1.00% 0 68.96% 0 100.04% 0 78.49% 0 95.86% 1.00% 0 25.04% 0 22.59% 0 72.56% 0 71.61% 10.00% 0 21.75% 0 63.08% 0 52.40% 0 41.78% 10.00% 0 12.17% 0 14.41% 0 17.33% 0 47.96% 10.00% 0 2.57% 0 2.17% 0 0.29% 0 26.03% 18.18% 0 12.39% 0 9.39% 0 17.34% 0 31.61% 19.82% 0 3.74% 0 3.42% 0 12.26% 0 29.46% 40.00% 2 6.26% 1 12.23% 0 6.15% 0 12.93% 40.00% 0 0.47% 0 0.05% 0 2.68% 2 14.08% 40.00% 0 0.60% 0 0.50% 0 1.22% 0 11.58% 55.56% 2 4.25% 5 5.97% 0 2.51% 0 7.70% 55.56% 0 1.89% 0 0.04% 0 1.71% 6 8.06% 55.56% 0 0.50% 0 0.47% 0 1.82% 5 6.94% 75.00% 2 1.65% 1 0.46% 0 0.26% 56 3.59% 75.00% 0 1.68% 0 0.05% 0 0.49% 21 3.94% 75.00% 0 0.28% 0 0.23% 0 0.62% 4 4.41% Aside from the 2nd and 3rd tests schedutil is showing decreased performance across the board. The fifth test is particularly bad. The catch is that I do not have power numbers to go with this data, as I'm not currently equipped to gather them. So more analysis is definitely needed to capture the full story. thanks, Steve -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 3, 2016 at 9:06 PM, Steve Muckle <steve.muckle@linaro.org> wrote: > On 03/03/2016 05:07 AM, Vincent Guittot wrote: >> I mainly want to prevent any useless and periodic frequency switch >> because of an utilization that changes with the current frequency (if >> frequency invariance is not used) and that can make the formula >> selects another frequency than the current one. That what i can see >> when testing it . >> >> Sorry for the late reply, i was trying to do some test on my board but >> was facing some crash issue (not link with your patchset). So i have >> done some tests and i can see such instable behavior. I have generated >> a load of 33% at max frequency (3ms runs every 9ms) and i can see the >> frequency that toggles without any good reason. Saying that, i can see >> similar thing with ondemand. > > FWIW I ran some performance numbers on my chromebook 2. Initially I > forgot to bring in the frequency invariance support but that yielded an > opportunity to see the impact of it. > > The tests below consist of a periodic workload. The OH (overhead) > numbers show how close the workload got to running as slow as fmin (100% > = as slow as powersave gov, 0% = as fast as perf gov). The OR (overrun) > number is the count of instances where the busy work exceeded the period. > > First a comparison of schedutil with and without frequency invariance. > Run and period are in milliseconds. > > scu (no inv) scu (w/inv) > run period busy % OR OH OR OH > 1 100 1.00% 0 79.72% 0 95.86% > 10 1000 1.00% 0 24.52% 0 71.61% > 1 10 10.00% 0 21.25% 0 41.78% > 10 100 10.00% 0 26.06% 0 47.96% > 100 1000 10.00% 0 6.36% 0 26.03% > 6 33 18.18% 0 15.67% 0 31.61% > 66 333 19.82% 0 8.94% 0 29.46% > 4 10 40.00% 0 6.26% 0 12.93% > 40 100 40.00% 0 6.93% 2 14.08% > 400 1000 40.00% 0 1.65% 0 11.58% > 5 9 55.56% 0 3.70% 0 7.70% > 50 90 55.56% 1 4.19% 6 8.06% > 500 900 55.56% 0 1.35% 5 6.94% > 9 12 75.00% 0 1.60% 56 3.59% > 90 120 75.00% 0 1.88% 21 3.94% > 900 1200 75.00% 0 0.73% 4 4.41% > > Frequency invariance causes schedutil overhead to increase noticeably. I > haven't dug into traces or anything. Perhaps this is due to the > algorithm overshooting then overcorrecting etc., I do not yet know. So as I said, the formula I used didn't take invariance into account, so that's quite as expected. > Here is a comparison, with frequency invariance, of ondemand and > interactive with schedfreq and schedutil. The first two columns (run and > period) are omitted so the table will fit. > > ondemand interactive schedfreq schedutil > busy % OR OH OR OH OR OH OR OH > 1.00% 0 68.96% 0 100.04% 0 78.49% 0 95.86% > 1.00% 0 25.04% 0 22.59% 0 72.56% 0 71.61% > 10.00% 0 21.75% 0 63.08% 0 52.40% 0 41.78% > 10.00% 0 12.17% 0 14.41% 0 17.33% 0 47.96% > 10.00% 0 2.57% 0 2.17% 0 0.29% 0 26.03% > 18.18% 0 12.39% 0 9.39% 0 17.34% 0 31.61% > 19.82% 0 3.74% 0 3.42% 0 12.26% 0 29.46% > 40.00% 2 6.26% 1 12.23% 0 6.15% 0 12.93% > 40.00% 0 0.47% 0 0.05% 0 2.68% 2 14.08% > 40.00% 0 0.60% 0 0.50% 0 1.22% 0 11.58% > 55.56% 2 4.25% 5 5.97% 0 2.51% 0 7.70% > 55.56% 0 1.89% 0 0.04% 0 1.71% 6 8.06% > 55.56% 0 0.50% 0 0.47% 0 1.82% 5 6.94% > 75.00% 2 1.65% 1 0.46% 0 0.26% 56 3.59% > 75.00% 0 1.68% 0 0.05% 0 0.49% 21 3.94% > 75.00% 0 0.28% 0 0.23% 0 0.62% 4 4.41% > > Aside from the 2nd and 3rd tests schedutil is showing decreased > performance across the board. The fifth test is particularly bad. I guess you mean performance in terms of the overhead? Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/03/2016 12:20 PM, Rafael J. Wysocki wrote: >> Here is a comparison, with frequency invariance, of ondemand and >> interactive with schedfreq and schedutil. The first two columns (run and >> period) are omitted so the table will fit. >> >> ondemand interactive schedfreq schedutil >> busy % OR OH OR OH OR OH OR OH >> 1.00% 0 68.96% 0 100.04% 0 78.49% 0 95.86% >> 1.00% 0 25.04% 0 22.59% 0 72.56% 0 71.61% >> 10.00% 0 21.75% 0 63.08% 0 52.40% 0 41.78% >> 10.00% 0 12.17% 0 14.41% 0 17.33% 0 47.96% >> 10.00% 0 2.57% 0 2.17% 0 0.29% 0 26.03% >> 18.18% 0 12.39% 0 9.39% 0 17.34% 0 31.61% >> 19.82% 0 3.74% 0 3.42% 0 12.26% 0 29.46% >> 40.00% 2 6.26% 1 12.23% 0 6.15% 0 12.93% >> 40.00% 0 0.47% 0 0.05% 0 2.68% 2 14.08% >> 40.00% 0 0.60% 0 0.50% 0 1.22% 0 11.58% >> 55.56% 2 4.25% 5 5.97% 0 2.51% 0 7.70% >> 55.56% 0 1.89% 0 0.04% 0 1.71% 6 8.06% >> 55.56% 0 0.50% 0 0.47% 0 1.82% 5 6.94% >> 75.00% 2 1.65% 1 0.46% 0 0.26% 56 3.59% >> 75.00% 0 1.68% 0 0.05% 0 0.49% 21 3.94% >> 75.00% 0 0.28% 0 0.23% 0 0.62% 4 4.41% >> >> Aside from the 2nd and 3rd tests schedutil is showing decreased >> performance across the board. The fifth test is particularly bad. > > I guess you mean performance in terms of the overhead? Correct. This overhead metric describes how fast the workload completes, with 0% equaling the perf governor and 100% equaling the powersave governor. So it's a reflection of general performance using the governor. It's called "overhead" I imagine (the metric predates my involvement) as it is something introduced/caused by the policy of the governor. thanks, Steve -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 3, 2016 at 5:47 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Mar 03, 2016 at 05:37:35PM +0100, Peter Zijlstra wrote: >> On Thu, Mar 03, 2016 at 05:24:32PM +0100, Rafael J. Wysocki wrote: >> > >> f = a * x + b > >> > If not, then I think it's reasonable to map the middle of the >> > available frequency range to x = 0.5 and then we have b = 0 and a = >> > (max_freq + min_freq) / 2. That actually should be a = max_freq + min_freq, because I want (max_freq + min_freq) / 2 = a / 2. >> So I really think that approach falls apart on the low util bits, you >> effectively always run above min speed, even if min is already vstly >> over provisioned. > > Ah nevermind, I cannot read. Yes that is worth trying I suppose. But the > b=0,a=1 thing seems more natural still. It is somewhat imbalanced, though. If all of the values of x are equally probable (or equally frequent), the probability of running above the middle frequency is lower than the probability of running below it. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday, March 03, 2016 01:37:59 PM Steve Muckle wrote: > On 03/03/2016 12:20 PM, Rafael J. Wysocki wrote: > >> Here is a comparison, with frequency invariance, of ondemand and > >> interactive with schedfreq and schedutil. The first two columns (run and > >> period) are omitted so the table will fit. > >> > >> ondemand interactive schedfreq schedutil > >> busy % OR OH OR OH OR OH OR OH > >> 1.00% 0 68.96% 0 100.04% 0 78.49% 0 95.86% > >> 1.00% 0 25.04% 0 22.59% 0 72.56% 0 71.61% > >> 10.00% 0 21.75% 0 63.08% 0 52.40% 0 41.78% > >> 10.00% 0 12.17% 0 14.41% 0 17.33% 0 47.96% > >> 10.00% 0 2.57% 0 2.17% 0 0.29% 0 26.03% > >> 18.18% 0 12.39% 0 9.39% 0 17.34% 0 31.61% > >> 19.82% 0 3.74% 0 3.42% 0 12.26% 0 29.46% > >> 40.00% 2 6.26% 1 12.23% 0 6.15% 0 12.93% > >> 40.00% 0 0.47% 0 0.05% 0 2.68% 2 14.08% > >> 40.00% 0 0.60% 0 0.50% 0 1.22% 0 11.58% > >> 55.56% 2 4.25% 5 5.97% 0 2.51% 0 7.70% > >> 55.56% 0 1.89% 0 0.04% 0 1.71% 6 8.06% > >> 55.56% 0 0.50% 0 0.47% 0 1.82% 5 6.94% > >> 75.00% 2 1.65% 1 0.46% 0 0.26% 56 3.59% > >> 75.00% 0 1.68% 0 0.05% 0 0.49% 21 3.94% > >> 75.00% 0 0.28% 0 0.23% 0 0.62% 4 4.41% > >> > >> Aside from the 2nd and 3rd tests schedutil is showing decreased > >> performance across the board. The fifth test is particularly bad. > > > > I guess you mean performance in terms of the overhead? > > Correct. This overhead metric describes how fast the workload completes, > with 0% equaling the perf governor and 100% equaling the powersave > governor. So it's a reflection of general performance using the > governor. It's called "overhead" I imagine (the metric predates my > involvement) as it is something introduced/caused by the policy of the > governor. If my understanding of the requency invariant utilization idea is correct, it is about re-scaling utilization so it is always relative to the capacity at the max frequency. If that's the case, then instead of using x = util_raw / max we will use something like y = (util_raw / max) * (f / max_freq) (f - current frequency). This means that (1) x = y * max_freq / f Now, say we have an agreed-on (linear) formula for f depending on x: f = a * x + b and if you say "Look, if I substitute y for x in this formula, it doesn't produce correct results", then I can only say "It doesn't, because it can't". It *obviously* won't work, because instead of substituting y for x, you need to substitute the right-hand side of (1) for it. They you'll get f = a * y * max_freq / f + b which is obviously nonlinear, so there's no hope that the same formula will ever work for both "raw" and "frequency invariant" utilization. To me this means that looking for a formula that will work for both is just pointless and there are 3 possibilities: (a) Look for a good enough formula to apply to "raw" utilization and then switch over when all architectures start to use "frequency invariant" utilization. (b) Make all architecuters use "frequency invariant" and then look for a working formula (seems rather less than realistic to me to be honest). (c) Code for using either "raw" or "frequency invariant" depending on a callback flag or something like that. I, personally, would go for (a) at this point, because that's the easiest one, but (c) would be doable too IMO, so I don't care that much as long as it is not (b). Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 07, 2016 at 03:41:15AM +0100, Rafael J. Wysocki wrote: > If my understanding of the requency invariant utilization idea is correct, > it is about re-scaling utilization so it is always relative to the capacity > at the max frequency. Right. So if a workload runs for 5ms at @1GHz and 10ms @500MHz, it would still result in the exact same utilization. > If that's the case, then instead of using > x = util_raw / max > we will use something like > y = (util_raw / max) * (f / max_freq) (f - current frequency). I don't get the last term. Assuming fixed frequency hardware (we can't really assume anything else) I get to: util = util_raw * (current_freq / max_freq) (1) x = util / max (2) > so there's no hope that the same formula will ever work for both "raw" > and "frequency invariant" utilization. Here I agree, however the above (current_freq / max_freq) term is easily computable, and really the only thing we can assume if the arch doesn't implement freq invariant accounting. > (c) Code for using either "raw" or "frequency invariant" depending on > a callback flag or something like that. Seeing how frequency invariance is an arch feature, and cpufreq drivers are also typically arch specific, do we really need a flag at this level? In any case, I think the only difference between the two formula should be the addition of (1) for the platforms that do not already implement frequency invariance. That is actually correct for platforms which do as told with their DVFS bits. And there's really not much else we can do short of implementing the scheduler arch hook to do better. > (b) Make all architecuters use "frequency invariant" and then look for a > working formula (seems rather less than realistic to me to be honest). There was a proposal to implement arch_scale_freq_capacity() as a weak function and have it serve the cpufreq selected frequency for (1) so that everything would default to that. We didn't do that because that makes the function call and multiplications unconditional. It's cheaper to add (1) to the cpufreq side when selecting a freq rather than at every single time we update the util statistics. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Mar 07, 2016 at 03:41:15AM +0100, Rafael J. Wysocki wrote: > >> If my understanding of the requency invariant utilization idea is correct, >> it is about re-scaling utilization so it is always relative to the capacity >> at the max frequency. > > Right. So if a workload runs for 5ms at @1GHz and 10ms @500MHz, it would > still result in the exact same utilization. > >> If that's the case, then instead of using >> x = util_raw / max >> we will use something like >> y = (util_raw / max) * (f / max_freq) (f - current frequency). > > I don't get the last term. The "(f - current frequency)" thing? It doesn't belong to the formula, sorry for the confusion. So it is almost the same as your (1) below (except for the max in the denominator), so my y is your x. :-) > Assuming fixed frequency hardware (we can't > really assume anything else) I get to: > > util = util_raw * (current_freq / max_freq) (1) > x = util / max (2) > >> so there's no hope that the same formula will ever work for both "raw" >> and "frequency invariant" utilization. > > Here I agree, however the above (current_freq / max_freq) term is easily > computable, and really the only thing we can assume if the arch doesn't > implement freq invariant accounting. Right. >> (c) Code for using either "raw" or "frequency invariant" depending on >> a callback flag or something like that. > > Seeing how frequency invariance is an arch feature, and cpufreq drivers > are also typically arch specific, do we really need a flag at this > level? The next frequency is selected by the governor and that's why. The driver gets a frequency to set only. Now, the governor needs to work with different platforms, so it needs to know how to deal with the given one. > In any case, I think the only difference between the two formula should > be the addition of (1) for the platforms that do not already implement > frequency invariance. OK So I'm reading this as a statement that linear is a better approximation for frequency invariant utilization. This means that on platforms where the utilization is frequency invariant we should use next_freq = a * x (where x is given by (2) above) and for platforms where the utilization is not frequency invariant next_freq = a * x * current_freq / max_freq and all boils down to finding a. Now, it seems reasonable for a to be something like (1 + 1/n) * max_freq, so for non-frequency invariant we get nex_freq = (1 + 1/n) * current_freq * x > That is actually correct for platforms which do as told with their DVFS > bits. And there's really not much else we can do short of implementing > the scheduler arch hook to do better. > >> (b) Make all architecuters use "frequency invariant" and then look for a >> working formula (seems rather less than realistic to me to be honest). > > There was a proposal to implement arch_scale_freq_capacity() as a weak > function and have it serve the cpufreq selected frequency for (1) so > that everything would default to that. > > We didn't do that because that makes the function call and > multiplications unconditional. It's cheaper to add (1) to the cpufreq > side when selecting a freq rather than at every single time we update > the util statistics. That's fine by me. My point was that we need different formulas for frequency invariant and the other basically. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote: > On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > Seeing how frequency invariance is an arch feature, and cpufreq drivers > > are also typically arch specific, do we really need a flag at this > > level? > > The next frequency is selected by the governor and that's why. The > driver gets a frequency to set only. > > Now, the governor needs to work with different platforms, so it needs > to know how to deal with the given one. Ah, indeed. In any case, the availability of arch_sched_scale_freq() is a compile time thingy, so we can, at compile time, know what to use. > > In any case, I think the only difference between the two formula should > > be the addition of (1) for the platforms that do not already implement > > frequency invariance. > > OK > > So I'm reading this as a statement that linear is a better > approximation for frequency invariant utilization. Well, (1) is what the scheduler does with frequency invariance, except that allows a more flexible definition of 'current frequency' by asking for it every time we update the util stats. But if a platform doesn't need this, ie. it has a fixed frequency, or simply doesn't provide anything like this, assuming we run at the frequency we asked for is a reasonable assumption no? > This means that on platforms where the utilization is frequency > invariant we should use > > next_freq = a * x > > (where x is given by (2) above) and for platforms where the > utilization is not frequency invariant > > next_freq = a * x * current_freq / max_freq > > and all boils down to finding a. Right. > Now, it seems reasonable for a to be something like (1 + 1/n) * > max_freq, so for non-frequency invariant we get > > nex_freq = (1 + 1/n) * current_freq * x This seems like a big leap; where does: (1 + 1/n) * max_freq come from? And what is 'n'? -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote: >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote: > >> > Seeing how frequency invariance is an arch feature, and cpufreq drivers >> > are also typically arch specific, do we really need a flag at this >> > level? >> >> The next frequency is selected by the governor and that's why. The >> driver gets a frequency to set only. >> >> Now, the governor needs to work with different platforms, so it needs >> to know how to deal with the given one. > > Ah, indeed. In any case, the availability of arch_sched_scale_freq() is > a compile time thingy, so we can, at compile time, know what to use. > >> > In any case, I think the only difference between the two formula should >> > be the addition of (1) for the platforms that do not already implement >> > frequency invariance. >> >> OK >> >> So I'm reading this as a statement that linear is a better >> approximation for frequency invariant utilization. > > Well, (1) is what the scheduler does with frequency invariance, except > that allows a more flexible definition of 'current frequency' by asking > for it every time we update the util stats. > > But if a platform doesn't need this, ie. it has a fixed frequency, or > simply doesn't provide anything like this, assuming we run at the > frequency we asked for is a reasonable assumption no? > >> This means that on platforms where the utilization is frequency >> invariant we should use >> >> next_freq = a * x >> >> (where x is given by (2) above) and for platforms where the >> utilization is not frequency invariant >> >> next_freq = a * x * current_freq / max_freq >> >> and all boils down to finding a. > > Right. However, that doesn't seem to be in agreement with the Steve's results posted earlier in this thread. Also theoretically, with frequency invariant, the only way you can get to 100% utilization is by running at the max frequency, so the closer to 100% you get, the faster you need to run to get any further. That indicates nonlinear to me. >> Now, it seems reasonable for a to be something like (1 + 1/n) * >> max_freq, so for non-frequency invariant we get >> >> nex_freq = (1 + 1/n) * current_freq * x > > This seems like a big leap; where does: > > (1 + 1/n) * max_freq > > come from? And what is 'n'? a = max_freq gives next_freq = max_freq for x = 1, but with that choice of a you may never get to x = 1 with frequency invariant because of the feedback effect mentioned above, so the 1/n produces the extra boost needed for that (n is a positive integer). Quite frankly, to me it looks like linear really is a better approximation for "raw" utilization. That is, for frequency invariant x we should take: next_freq = a * x * max_freq / current_freq (and if x is not frequency invariant, the right-hand side becomes a * x). Then, the extra boost needed to get to x = 1 for frequency invariant is produced by the (max_freq / current_freq) factor that is greater than 1 as long as we are not running at max_freq and a can be chosen as max_freq. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, sorry if I didn't reply yet. Trying to cope with jetlag and talks/meetings these days :-). Let me see if I'm getting what you are discussing, though. On 08/03/16 21:05, Rafael J. Wysocki wrote: > On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote: > >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote: [...] > a = max_freq gives next_freq = max_freq for x = 1, but with that > choice of a you may never get to x = 1 with frequency invariant > because of the feedback effect mentioned above, so the 1/n produces > the extra boost needed for that (n is a positive integer). > > Quite frankly, to me it looks like linear really is a better > approximation for "raw" utilization. That is, for frequency invariant > x we should take: > > next_freq = a * x * max_freq / current_freq > > (and if x is not frequency invariant, the right-hand side becomes a * > x). Then, the extra boost needed to get to x = 1 for frequency > invariant is produced by the (max_freq / current_freq) factor that is > greater than 1 as long as we are not running at max_freq and a can be > chosen as max_freq. > Expanding terms again, your original formula (without the 1.1 factor of the last version) was: next_freq = util / max_cap * max_freq and this doesn't work when we have freq invariance since util won't go over curr_cap. What you propose above is to add another factor, so that we have: next_freq = util / max_cap * max_freq / curr_freq * max_freq which should give us the opportunity to reach max_freq also with freq invariance. This should actually be the same of doing: next_freq = util / max_cap * max_cap / curr_cap * max_freq We are basically scaling how much the cpu is busy at curr_cap back to the 0..1024 scale. And we use this to select next_freq. Also, we can simplify this to: next_freq = util / curr_cap * max_freq and we save some ops. However, if that is correct, I think we might have a problem, as we are skewing OPP selection towards higher frequencies. Let's suppose we have a platform with 3 OPPs: freq cap 1200 1024 900 768 600 512 As soon a task reaches an utilization of 257 we will be selecting the second OPP as next_freq = 257 / 512 * 1200 ~ 602 While the cpu is only 50% busy in this case. And we will go at max OPP when reaching ~492 (~64% of 768). That said, I guess this might work as a first solution, but we will probably need something better in the future. I understand Rafael's concerns regardin margins, but it seems to me that some kind of additional parameter will be probably needed anyway to fix this. Just to say again how we handle this in schedfreq, with a -20% margin applied to the lowest OPP we will get to the next one when utilization reaches ~410 (80% busy at curr OPP), and so on for the subsequent ones, which is less aggressive and might be better IMHO. Best, - Juri -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 08, 2016 at 09:05:50PM +0100, Rafael J. Wysocki wrote: > >> This means that on platforms where the utilization is frequency > >> invariant we should use > >> > >> next_freq = a * x > >> > >> (where x is given by (2) above) and for platforms where the > >> utilization is not frequency invariant > >> > >> next_freq = a * x * current_freq / max_freq > >> > >> and all boils down to finding a. > > > > Right. > > However, that doesn't seem to be in agreement with the Steve's results > posted earlier in this thread. I could not make anything of those numbers. > Also theoretically, with frequency invariant, the only way you can get > to 100% utilization is by running at the max frequency, so the closer > to 100% you get, the faster you need to run to get any further. That > indicates nonlinear to me. I'm not seeing that, you get that by using a > 1. No need for non-linear. > >> Now, it seems reasonable for a to be something like (1 + 1/n) * > >> max_freq, so for non-frequency invariant we get > >> > >> nex_freq = (1 + 1/n) * current_freq * x > > > > This seems like a big leap; where does: > > > > (1 + 1/n) * max_freq > > > > come from? And what is 'n'? > a = max_freq gives next_freq = max_freq for x = 1, next_freq = a * x * current_freq / max_freq [ a := max_freq, x := 1 ] -> = max_freq * 1 * current_freq / max_freq = current_freq != max_freq But I think I see what you're saying; because at x = 1, current_frequency must be max_frequency. Per your earlier point. > but with that choice of a you may never get to x = 1 with frequency > invariant because of the feedback effect mentioned above, so the 1/n > produces the extra boost needed for that (n is a positive integer). OK, so that gets us: a = (1 + 1/n) ; n > 0 [ I would not have chosen (1 + 1/n), but lets stick to that ] So for n = 4 that gets you: a = 1.25, which effectively gets you an 80% utilization tipping point. That is, 1.25 * .8 = 1, iow. you'll pick the next frequency (assuming RELATION_L like selection). Together this gets you: next_freq = (1 + 1/n) * max_freq * x * current_freq / max_freq = (1 + 1/n) * x * current_freq Again, with n = 4, x > .8 will result in a next_freq > current_freq, and hence (RELATION_L) pick a higher one. > Quite frankly, to me it looks like linear really is a better > approximation for "raw" utilization. That is, for frequency invariant > x we should take: > > next_freq = a * x * max_freq / current_freq (its very confusing how you use 'x' for both invariant and non-invariant). That doesn't make sense, remember: util = \Sum_i u_i * freq_i / max_freq (1) Which for systems where freq_i is constant reduces to: util = util_raw * current_freq / max_freq (2) But you cannot reverse this. IOW you cannot try and divide out current_freq on a frequency invariant metric. So going by: next_freq = (1 + 1/n) * max_freq * util (3) if we substitute (2) into (3) we get: = (1 + 1/n) * max_freq * util_raw * current_freq / max_freq = (1 + 1/n) * current_freq * util_raw (4) Which gets you two formula with the same general behaviour. As (2) is the only approximation of (1) we can make. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 9, 2016 at 5:39 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Mar 08, 2016 at 09:05:50PM +0100, Rafael J. Wysocki wrote: >> >> This means that on platforms where the utilization is frequency >> >> invariant we should use >> >> >> >> next_freq = a * x >> >> >> >> (where x is given by (2) above) and for platforms where the >> >> utilization is not frequency invariant >> >> >> >> next_freq = a * x * current_freq / max_freq >> >> >> >> and all boils down to finding a. >> > >> > Right. >> >> However, that doesn't seem to be in agreement with the Steve's results >> posted earlier in this thread. > > I could not make anything of those numbers. > >> Also theoretically, with frequency invariant, the only way you can get >> to 100% utilization is by running at the max frequency, so the closer >> to 100% you get, the faster you need to run to get any further. That >> indicates nonlinear to me. > > I'm not seeing that, you get that by using a > 1. No need for > non-linear. OK >> >> Now, it seems reasonable for a to be something like (1 + 1/n) * >> >> max_freq, so for non-frequency invariant we get >> >> >> >> nex_freq = (1 + 1/n) * current_freq * x (*) (see below) >> > This seems like a big leap; where does: >> > >> > (1 + 1/n) * max_freq >> > >> > come from? And what is 'n'? > >> a = max_freq gives next_freq = max_freq for x = 1, > > next_freq = a * x * current_freq / max_freq > > [ a := max_freq, x := 1 ] -> > > = max_freq * 1 * current_freq / max_freq > = current_freq > > != max_freq > > But I think I see what you're saying; because at x = 1, > current_frequency must be max_frequency. Per your earlier point. Correct. >> but with that choice of a you may never get to x = 1 with frequency >> invariant because of the feedback effect mentioned above, so the 1/n >> produces the extra boost needed for that (n is a positive integer). > > OK, so that gets us: > > a = (1 + 1/n) ; n > 0 > > [ I would not have chosen (1 + 1/n), but lets stick to that ] Well, what would you choose then? :-) > So for n = 4 that gets you: a = 1.25, which effectively gets you an 80% > utilization tipping point. That is, 1.25 * .8 = 1, iow. you'll pick the > next frequency (assuming RELATION_L like selection). > > Together this gets you: > > next_freq = (1 + 1/n) * max_freq * x * current_freq / max_freq > = (1 + 1/n) * x * current_freq That seems to be what I said above (*), isn't it? > Again, with n = 4, x > .8 will result in a next_freq > current_freq, and > hence (RELATION_L) pick a higher one. OK >> Quite frankly, to me it looks like linear really is a better >> approximation for "raw" utilization. That is, for frequency invariant >> x we should take: >> >> next_freq = a * x * max_freq / current_freq > > (its very confusing how you use 'x' for both invariant and > non-invariant). > > That doesn't make sense, remember: > > util = \Sum_i u_i * freq_i / max_freq (1) > > Which for systems where freq_i is constant reduces to: > > util = util_raw * current_freq / max_freq (2) > > But you cannot reverse this. IOW you cannot try and divide out > current_freq on a frequency invariant metric. I see. > So going by: > > next_freq = (1 + 1/n) * max_freq * util (3) I think that should be next_freq = (1 + 1/n) * max_freq * util / max (where max is the second argument of cpufreq_update_util) or the dimensions on both sides don't match. > if we substitute (2) into (3) we get: > > = (1 + 1/n) * max_freq * util_raw * current_freq / max_freq > = (1 + 1/n) * current_freq * util_raw (4) > > Which gets you two formula with the same general behaviour. As (2) is > the only approximation of (1) we can make. OK So since utilization is not frequency invariant in the current mainline (or linux-next for that matter) AFAIC, I'm going to use the following in the next version of the schedutil patch series: next_freq = 1.25 * current_freq * util_raw / max where util_raw and max are what I get from cpufreq_update_util(). 1.25 is for the 80% tipping point which I think is reasonable. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 9, 2016 at 11:15 AM, Juri Lelli <juri.lelli@arm.com> wrote: > Hi, > > sorry if I didn't reply yet. Trying to cope with jetlag and > talks/meetings these days :-). Let me see if I'm getting what you are > discussing, though. > > On 08/03/16 21:05, Rafael J. Wysocki wrote: >> On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <peterz@infradead.org> wrote: >> > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote: >> >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > [...] > >> a = max_freq gives next_freq = max_freq for x = 1, but with that >> choice of a you may never get to x = 1 with frequency invariant >> because of the feedback effect mentioned above, so the 1/n produces >> the extra boost needed for that (n is a positive integer). >> >> Quite frankly, to me it looks like linear really is a better >> approximation for "raw" utilization. That is, for frequency invariant >> x we should take: >> >> next_freq = a * x * max_freq / current_freq >> >> (and if x is not frequency invariant, the right-hand side becomes a * >> x). Then, the extra boost needed to get to x = 1 for frequency >> invariant is produced by the (max_freq / current_freq) factor that is >> greater than 1 as long as we are not running at max_freq and a can be >> chosen as max_freq. >> > > Expanding terms again, your original formula (without the 1.1 factor of > the last version) was: > > next_freq = util / max_cap * max_freq > > and this doesn't work when we have freq invariance since util won't go > over curr_cap. Can you please remind me what curr_cap is? > What you propose above is to add another factor, so that we have: > > next_freq = util / max_cap * max_freq / curr_freq * max_freq > > which should give us the opportunity to reach max_freq also with freq > invariance. > > This should actually be the same of doing: > > next_freq = util / max_cap * max_cap / curr_cap * max_freq > > We are basically scaling how much the cpu is busy at curr_cap back to > the 0..1024 scale. And we use this to select next_freq. Also, we can > simplify this to: > > next_freq = util / curr_cap * max_freq > > and we save some ops. > > However, if that is correct, I think we might have a problem, as we are > skewing OPP selection towards higher frequencies. Let's suppose we have > a platform with 3 OPPs: > > freq cap > 1200 1024 > 900 768 > 600 512 > > As soon a task reaches an utilization of 257 we will be selecting the > second OPP as > > next_freq = 257 / 512 * 1200 ~ 602 > > While the cpu is only 50% busy in this case. And we will go at max OPP > when reaching ~492 (~64% of 768). > > That said, I guess this might work as a first solution, but we will > probably need something better in the future. I understand Rafael's > concerns regardin margins, but it seems to me that some kind of > additional parameter will be probably needed anyway to fix this. > Just to say again how we handle this in schedfreq, with a -20% margin > applied to the lowest OPP we will get to the next one when utilization > reaches ~410 (80% busy at curr OPP), and so on for the subsequent ones, > which is less aggressive and might be better IMHO. Well, Peter says that my idea is incorrect, so I'll go for next_freq = C * current_freq * util_raw / max where C > 1 (and likely C < 1.5) instead. That means C has to be determined somehow or guessed. The 80% tipping point condition seems reasonable to me, though, which leads to C = 1.25. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10 March 2016 at 06:28, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Wed, Mar 9, 2016 at 5:39 PM, Peter Zijlstra <peterz@infradead.org> wrote: >> On Tue, Mar 08, 2016 at 09:05:50PM +0100, Rafael J. Wysocki wrote: >>> >> This means that on platforms where the utilization is frequency >>> >> invariant we should use >>> >> >>> >> next_freq = a * x >>> >> >>> >> (where x is given by (2) above) and for platforms where the >>> >> utilization is not frequency invariant >>> >> >>> >> next_freq = a * x * current_freq / max_freq >>> >> >>> >> and all boils down to finding a. >>> > >>> > Right. >>> >>> However, that doesn't seem to be in agreement with the Steve's results >>> posted earlier in this thread. >> >> I could not make anything of those numbers. >> >>> Also theoretically, with frequency invariant, the only way you can get >>> to 100% utilization is by running at the max frequency, so the closer >>> to 100% you get, the faster you need to run to get any further. That >>> indicates nonlinear to me. >> >> I'm not seeing that, you get that by using a > 1. No need for >> non-linear. > > OK > >>> >> Now, it seems reasonable for a to be something like (1 + 1/n) * >>> >> max_freq, so for non-frequency invariant we get >>> >> >>> >> nex_freq = (1 + 1/n) * current_freq * x > > (*) (see below) > >>> > This seems like a big leap; where does: >>> > >>> > (1 + 1/n) * max_freq >>> > >>> > come from? And what is 'n'? >> >>> a = max_freq gives next_freq = max_freq for x = 1, >> >> next_freq = a * x * current_freq / max_freq >> >> [ a := max_freq, x := 1 ] -> >> >> = max_freq * 1 * current_freq / max_freq >> = current_freq >> >> != max_freq >> >> But I think I see what you're saying; because at x = 1, >> current_frequency must be max_frequency. Per your earlier point. > > Correct. > >>> but with that choice of a you may never get to x = 1 with frequency >>> invariant because of the feedback effect mentioned above, so the 1/n >>> produces the extra boost needed for that (n is a positive integer). >> >> OK, so that gets us: >> >> a = (1 + 1/n) ; n > 0 >> >> [ I would not have chosen (1 + 1/n), but lets stick to that ] > > Well, what would you choose then? :-) > >> So for n = 4 that gets you: a = 1.25, which effectively gets you an 80% >> utilization tipping point. That is, 1.25 * .8 = 1, iow. you'll pick the >> next frequency (assuming RELATION_L like selection). >> >> Together this gets you: >> >> next_freq = (1 + 1/n) * max_freq * x * current_freq / max_freq >> = (1 + 1/n) * x * current_freq > > That seems to be what I said above (*), isn't it? > >> Again, with n = 4, x > .8 will result in a next_freq > current_freq, and >> hence (RELATION_L) pick a higher one. > > OK > >>> Quite frankly, to me it looks like linear really is a better >>> approximation for "raw" utilization. That is, for frequency invariant >>> x we should take: >>> >>> next_freq = a * x * max_freq / current_freq >> >> (its very confusing how you use 'x' for both invariant and >> non-invariant). >> >> That doesn't make sense, remember: >> >> util = \Sum_i u_i * freq_i / max_freq (1) >> >> Which for systems where freq_i is constant reduces to: >> >> util = util_raw * current_freq / max_freq (2) >> >> But you cannot reverse this. IOW you cannot try and divide out >> current_freq on a frequency invariant metric. > > I see. > >> So going by: >> >> next_freq = (1 + 1/n) * max_freq * util (3) > > I think that should be > > next_freq = (1 + 1/n) * max_freq * util / max > > (where max is the second argument of cpufreq_update_util) or the > dimensions on both sides don't match. > >> if we substitute (2) into (3) we get: >> >> = (1 + 1/n) * max_freq * util_raw * current_freq / max_freq >> = (1 + 1/n) * current_freq * util_raw (4) >> >> Which gets you two formula with the same general behaviour. As (2) is >> the only approximation of (1) we can make. > > OK > > So since utilization is not frequency invariant in the current > mainline (or linux-next for that matter) AFAIC, I'm going to use the > following in the next version of the schedutil patch series: > > next_freq = 1.25 * current_freq * util_raw / max > > where util_raw and max are what I get from cpufreq_update_util(). > > 1.25 is for the 80% tipping point which I think is reasonable. We have the arch_scale_freq_capacity function that is arch dependent and can be used to merge the 2 formula that were described by peter above. By default, arch_scale_freq_capacity return SCHED_CAPACITY_SCALE which is max capacity but when arch_scale_freq_capacity is defined by an architecture, arch_scale_freq_capacity returns current_freq * max_capacity/max_freq so can't we use arch_scale_freq in your formula ? Taking your formula above it becomes: next_freq = 1.25 * current_freq * util / arch_scale_freq_capacity() Without invariance feature, we have the same formula than above : next_freq = 1.25 * current_freq * util_raw / max because SCHED_CAPACITY_SCALE is max capacity With invariance feature, we have next_freq = 1.25 * current_freq * util / (current_freq*max_capacity/max_freq) = 1.25 * util * max_freq / max which is the formula that has to be used with frequency invariant utilization. so we have one formula that works for both configuration (this is not really optimized for invariant system because we multiply then divide by current_freq in 2 different places but it's better than a wrong formula) Now, arch_scale_freq_capacity is available in kernel/sched/sched.h header file which can only be accessed by scheduler code... May be we can pass arch_scale_freq_capacity value instead of max one as a parameter of update_util function prototype Vincent -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/03/16 00:41, Rafael J. Wysocki wrote: > On Wed, Mar 9, 2016 at 11:15 AM, Juri Lelli <juri.lelli@arm.com> wrote: > > Hi, > > > > sorry if I didn't reply yet. Trying to cope with jetlag and > > talks/meetings these days :-). Let me see if I'm getting what you are > > discussing, though. > > > > On 08/03/16 21:05, Rafael J. Wysocki wrote: > >> On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <peterz@infradead.org> wrote: > >> > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote: > >> >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > [...] > > > >> a = max_freq gives next_freq = max_freq for x = 1, but with that > >> choice of a you may never get to x = 1 with frequency invariant > >> because of the feedback effect mentioned above, so the 1/n produces > >> the extra boost needed for that (n is a positive integer). > >> > >> Quite frankly, to me it looks like linear really is a better > >> approximation for "raw" utilization. That is, for frequency invariant > >> x we should take: > >> > >> next_freq = a * x * max_freq / current_freq > >> > >> (and if x is not frequency invariant, the right-hand side becomes a * > >> x). Then, the extra boost needed to get to x = 1 for frequency > >> invariant is produced by the (max_freq / current_freq) factor that is > >> greater than 1 as long as we are not running at max_freq and a can be > >> chosen as max_freq. > >> > > > > Expanding terms again, your original formula (without the 1.1 factor of > > the last version) was: > > > > next_freq = util / max_cap * max_freq > > > > and this doesn't work when we have freq invariance since util won't go > > over curr_cap. > > Can you please remind me what curr_cap is? > The capacity at current frequency. > > What you propose above is to add another factor, so that we have: > > > > next_freq = util / max_cap * max_freq / curr_freq * max_freq > > > > which should give us the opportunity to reach max_freq also with freq > > invariance. > > > > This should actually be the same of doing: > > > > next_freq = util / max_cap * max_cap / curr_cap * max_freq > > > > We are basically scaling how much the cpu is busy at curr_cap back to > > the 0..1024 scale. And we use this to select next_freq. Also, we can > > simplify this to: > > > > next_freq = util / curr_cap * max_freq > > > > and we save some ops. > > > > However, if that is correct, I think we might have a problem, as we are > > skewing OPP selection towards higher frequencies. Let's suppose we have > > a platform with 3 OPPs: > > > > freq cap > > 1200 1024 > > 900 768 > > 600 512 > > > > As soon a task reaches an utilization of 257 we will be selecting the > > second OPP as > > > > next_freq = 257 / 512 * 1200 ~ 602 > > > > While the cpu is only 50% busy in this case. And we will go at max OPP > > when reaching ~492 (~64% of 768). > > > > That said, I guess this might work as a first solution, but we will > > probably need something better in the future. I understand Rafael's > > concerns regardin margins, but it seems to me that some kind of > > additional parameter will be probably needed anyway to fix this. > > Just to say again how we handle this in schedfreq, with a -20% margin > > applied to the lowest OPP we will get to the next one when utilization > > reaches ~410 (80% busy at curr OPP), and so on for the subsequent ones, > > which is less aggressive and might be better IMHO. > > Well, Peter says that my idea is incorrect, so I'll go for > > next_freq = C * current_freq * util_raw / max > > where C > 1 (and likely C < 1.5) instead. > > That means C has to be determined somehow or guessed. The 80% tipping > point condition seems reasonable to me, though, which leads to C = > 1.25. > Right. So, when using freq. invariant util we have: next_freq = C * curr_freq * util / curr_cap as util_raw = util * max / curr_cap What Vincent is saying makes sense, though. If we use arch_scale_freq_capacity() as denominator instead of max, we can use a single formula for both cases. Best, - Juri -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 10, 2016 at 12:28:52AM +0100, Rafael J. Wysocki wrote: > > [ I would not have chosen (1 + 1/n), but lets stick to that ] > > Well, what would you choose then? :-) 1/p ; 0 < p < 1 or so. Where p then represents the percentile threshold where you want to bump to the next freq. > I think that should be > > next_freq = (1 + 1/n) * max_freq * util / max > > (where max is the second argument of cpufreq_update_util) or the > dimensions on both sides don't match. Well yes, but so far we were treating util (and util_raw) as 0 < u < 1, values, so already normalized against max. But yes.. > > if we substitute (2) into (3) we get: > > > > = (1 + 1/n) * max_freq * util_raw * current_freq / max_freq > > = (1 + 1/n) * current_freq * util_raw (4) > > > > Which gets you two formula with the same general behaviour. As (2) is > > the only approximation of (1) we can make. > > OK > > So since utilization is not frequency invariant in the current > mainline (or linux-next for that matter) AFAIC, I'm going to use the > following in the next version of the schedutil patch series: > > next_freq = 1.25 * current_freq * util_raw / max > > where util_raw and max are what I get from cpufreq_update_util(). > > 1.25 is for the 80% tipping point which I think is reasonable. OK. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 10, 2016 at 10:44:21AM +0700, Vincent Guittot wrote: > We have the arch_scale_freq_capacity function that is arch dependent > and can be used to merge the 2 formula that were described by peter > above. > By default, arch_scale_freq_capacity return SCHED_CAPACITY_SCALE which > is max capacity > but when arch_scale_freq_capacity is defined by an architecture, > arch_scale_freq_capacity returns current_freq * max_capacity/max_freq However, current_freq is a very fluid thing, it might (and will) change very rapidly on some platforms. This is the same point I made earlier, you cannot try and divide out current_freq from the invariant measure. > so can't we use arch_scale_freq in your formula ? Taking your formula > above it becomes: > next_freq = 1.25 * current_freq * util / arch_scale_freq_capacity() No, that cannot work, nor makes any sense, per the above. > With invariance feature, we have: > > next_freq = 1.25 * current_freq * util / (current_freq*max_capacity/max_freq) > = 1.25 * util * max_freq / max > > which is the formula that has to be used with frequency invariant > utilization. Wrong, you cannot talk about current_freq in the invariant case. > May be we can pass arch_scale_freq_capacity value instead of max one > as a parameter of update_util function prototype No, since its a compile time thing, we can simply do: #ifdef arch_scale_freq_capacity next_freq = (1 + 1/n) * max_freq * (util / max) #else next_freq = (1 + 1/n) * current_freq * (util_raw / max) #endif -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10 March 2016 at 17:07, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Mar 10, 2016 at 10:44:21AM +0700, Vincent Guittot wrote: >> We have the arch_scale_freq_capacity function that is arch dependent >> and can be used to merge the 2 formula that were described by peter >> above. >> By default, arch_scale_freq_capacity return SCHED_CAPACITY_SCALE which >> is max capacity >> but when arch_scale_freq_capacity is defined by an architecture, > >> arch_scale_freq_capacity returns current_freq * max_capacity/max_freq > > However, current_freq is a very fluid thing, it might (and will) change > very rapidly on some platforms. > > This is the same point I made earlier, you cannot try and divide out > current_freq from the invariant measure. > >> so can't we use arch_scale_freq in your formula ? Taking your formula >> above it becomes: >> next_freq = 1.25 * current_freq * util / arch_scale_freq_capacity() > > No, that cannot work, nor makes any sense, per the above. > >> With invariance feature, we have: >> >> next_freq = 1.25 * current_freq * util / (current_freq*max_capacity/max_freq) >> = 1.25 * util * max_freq / max >> >> which is the formula that has to be used with frequency invariant >> utilization. > > Wrong, you cannot talk about current_freq in the invariant case. > >> May be we can pass arch_scale_freq_capacity value instead of max one >> as a parameter of update_util function prototype > > No, since its a compile time thing, we can simply do: > > #ifdef arch_scale_freq_capacity > next_freq = (1 + 1/n) * max_freq * (util / max) > #else > next_freq = (1 + 1/n) * current_freq * (util_raw / max) > #endif selecting formula at compilation is clearly better. I wrongly thought that it can't be accepted as a solution. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 10, 2016 at 05:23:54PM +0700, Vincent Guittot wrote: > > No, since its a compile time thing, we can simply do: > > > > #ifdef arch_scale_freq_capacity > > next_freq = (1 + 1/n) * max_freq * (util / max) > > #else > > next_freq = (1 + 1/n) * current_freq * (util_raw / max) > > #endif > > selecting formula at compilation is clearly better. I wrongly thought that > it can't be accepted as a solution. Well, its bound to get more 'interesting' since I forse implementations not always actually doing the invariant thing. Take for example the thing I send: lkml.kernel.org/r/20160303162829.GB6375@twins.programming.kicks-ass.net it both shows why you cannot talk about current_freq but also that the above needs a little more help (for the !X86_FEATURE_APERFMPERF case). But the !arch_scale_freq_capacity case should indeed be that simple. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 10, 2016 at 11:30:08AM +0100, Peter Zijlstra wrote: > On Thu, Mar 10, 2016 at 05:23:54PM +0700, Vincent Guittot wrote: > > > > No, since its a compile time thing, we can simply do: > > > > > > #ifdef arch_scale_freq_capacity > > > next_freq = (1 + 1/n) * max_freq * (util / max) > > > #else > > > next_freq = (1 + 1/n) * current_freq * (util_raw / max) > > > #endif > > > > selecting formula at compilation is clearly better. I wrongly thought that > > it can't be accepted as a solution. > > Well, its bound to get more 'interesting' since I forse implementations > not always actually doing the invariant thing. > > Take for example the thing I send: > > lkml.kernel.org/r/20160303162829.GB6375@twins.programming.kicks-ass.net > > it both shows why you cannot talk about current_freq but also that the > above needs a little more help (for the !X86_FEATURE_APERFMPERF case). > > But the !arch_scale_freq_capacity case should indeed be that simple. Maybe something like: #ifdef arch_scale_freq_capacity #ifndef arch_scale_freq_invariant #define arch_scale_freq_invariant() (true) #endif #else /* arch_scale_freq_capacity */ #define arch_scale_freq_invariant() (false) #endif if (arch_scale_freq_invariant()) And have archs that have conditional arch_scale_freq_capacity() implementation provide a arch_scale_freq_invariant implementation. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday, March 10, 2016 11:30:34 AM Juri Lelli wrote: > On 10/03/16 00:41, Rafael J. Wysocki wrote: > > On Wed, Mar 9, 2016 at 11:15 AM, Juri Lelli <juri.lelli@arm.com> wrote: > > > Hi, > > > > > > sorry if I didn't reply yet. Trying to cope with jetlag and > > > talks/meetings these days :-). Let me see if I'm getting what you are > > > discussing, though. > > > > > > On 08/03/16 21:05, Rafael J. Wysocki wrote: > > >> On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > >> > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote: > > >> >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > [...] > > > > > >> a = max_freq gives next_freq = max_freq for x = 1, but with that > > >> choice of a you may never get to x = 1 with frequency invariant > > >> because of the feedback effect mentioned above, so the 1/n produces > > >> the extra boost needed for that (n is a positive integer). > > >> > > >> Quite frankly, to me it looks like linear really is a better > > >> approximation for "raw" utilization. That is, for frequency invariant > > >> x we should take: > > >> > > >> next_freq = a * x * max_freq / current_freq > > >> > > >> (and if x is not frequency invariant, the right-hand side becomes a * > > >> x). Then, the extra boost needed to get to x = 1 for frequency > > >> invariant is produced by the (max_freq / current_freq) factor that is > > >> greater than 1 as long as we are not running at max_freq and a can be > > >> chosen as max_freq. > > >> > > > > > > Expanding terms again, your original formula (without the 1.1 factor of > > > the last version) was: > > > > > > next_freq = util / max_cap * max_freq > > > > > > and this doesn't work when we have freq invariance since util won't go > > > over curr_cap. > > > > Can you please remind me what curr_cap is? > > > > The capacity at current frequency. I see, thanks! > > > What you propose above is to add another factor, so that we have: > > > > > > next_freq = util / max_cap * max_freq / curr_freq * max_freq > > > > > > which should give us the opportunity to reach max_freq also with freq > > > invariance. > > > > > > This should actually be the same of doing: > > > > > > next_freq = util / max_cap * max_cap / curr_cap * max_freq > > > > > > We are basically scaling how much the cpu is busy at curr_cap back to > > > the 0..1024 scale. And we use this to select next_freq. Also, we can > > > simplify this to: > > > > > > next_freq = util / curr_cap * max_freq > > > > > > and we save some ops. > > > > > > However, if that is correct, I think we might have a problem, as we are > > > skewing OPP selection towards higher frequencies. Let's suppose we have > > > a platform with 3 OPPs: > > > > > > freq cap > > > 1200 1024 > > > 900 768 > > > 600 512 > > > > > > As soon a task reaches an utilization of 257 we will be selecting the > > > second OPP as > > > > > > next_freq = 257 / 512 * 1200 ~ 602 > > > > > > While the cpu is only 50% busy in this case. And we will go at max OPP > > > when reaching ~492 (~64% of 768). > > > > > > That said, I guess this might work as a first solution, but we will > > > probably need something better in the future. I understand Rafael's > > > concerns regardin margins, but it seems to me that some kind of > > > additional parameter will be probably needed anyway to fix this. > > > Just to say again how we handle this in schedfreq, with a -20% margin > > > applied to the lowest OPP we will get to the next one when utilization > > > reaches ~410 (80% busy at curr OPP), and so on for the subsequent ones, > > > which is less aggressive and might be better IMHO. > > > > Well, Peter says that my idea is incorrect, so I'll go for > > > > next_freq = C * current_freq * util_raw / max > > > > where C > 1 (and likely C < 1.5) instead. > > > > That means C has to be determined somehow or guessed. The 80% tipping > > point condition seems reasonable to me, though, which leads to C = > > 1.25. > > > > Right. So, when using freq. invariant util we have: > > next_freq = C * curr_freq * util / curr_cap > > as > > util_raw = util * max / curr_cap > > What Vincent is saying makes sense, though. If we use > arch_scale_freq_capacity() as denominator instead of max, we can use a > single formula for both cases. I'm not convinced about that yet, but let me think about it some more. :-) Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday, March 10, 2016 11:56:14 AM Peter Zijlstra wrote: > On Thu, Mar 10, 2016 at 11:30:08AM +0100, Peter Zijlstra wrote: > > On Thu, Mar 10, 2016 at 05:23:54PM +0700, Vincent Guittot wrote: > > > > > > No, since its a compile time thing, we can simply do: > > > > > > > > #ifdef arch_scale_freq_capacity > > > > next_freq = (1 + 1/n) * max_freq * (util / max) > > > > #else > > > > next_freq = (1 + 1/n) * current_freq * (util_raw / max) > > > > #endif > > > > > > selecting formula at compilation is clearly better. I wrongly thought that > > > it can't be accepted as a solution. > > > > Well, its bound to get more 'interesting' since I forse implementations > > not always actually doing the invariant thing. > > > > Take for example the thing I send: > > > > lkml.kernel.org/r/20160303162829.GB6375@twins.programming.kicks-ass.net > > > > it both shows why you cannot talk about current_freq but also that the > > above needs a little more help (for the !X86_FEATURE_APERFMPERF case). > > > > But the !arch_scale_freq_capacity case should indeed be that simple. > > Maybe something like: > > #ifdef arch_scale_freq_capacity > #ifndef arch_scale_freq_invariant > #define arch_scale_freq_invariant() (true) > #endif > #else /* arch_scale_freq_capacity */ > #define arch_scale_freq_invariant() (false) > #endif > > if (arch_scale_freq_invariant()) > > And have archs that have conditional arch_scale_freq_capacity() > implementation provide a arch_scale_freq_invariant implementation. Yeah, looks workable to me. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Quoting Rafael J. Wysocki (2016-03-09 15:41:34) > On Wed, Mar 9, 2016 at 11:15 AM, Juri Lelli <juri.lelli@arm.com> wrote: > > Hi, > > > > sorry if I didn't reply yet. Trying to cope with jetlag and > > talks/meetings these days :-). Let me see if I'm getting what you are > > discussing, though. > > > > On 08/03/16 21:05, Rafael J. Wysocki wrote: > >> On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <peterz@infradead.org> wrote: > >> > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote: > >> >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > [...] > > > >> a = max_freq gives next_freq = max_freq for x = 1, but with that > >> choice of a you may never get to x = 1 with frequency invariant > >> because of the feedback effect mentioned above, so the 1/n produces > >> the extra boost needed for that (n is a positive integer). > >> > >> Quite frankly, to me it looks like linear really is a better > >> approximation for "raw" utilization. That is, for frequency invariant > >> x we should take: > >> > >> next_freq = a * x * max_freq / current_freq > >> > >> (and if x is not frequency invariant, the right-hand side becomes a * > >> x). Then, the extra boost needed to get to x = 1 for frequency > >> invariant is produced by the (max_freq / current_freq) factor that is > >> greater than 1 as long as we are not running at max_freq and a can be > >> chosen as max_freq. > >> > > > > Expanding terms again, your original formula (without the 1.1 factor of > > the last version) was: > > > > next_freq = util / max_cap * max_freq > > > > and this doesn't work when we have freq invariance since util won't go > > over curr_cap. > > Can you please remind me what curr_cap is? > > > What you propose above is to add another factor, so that we have: > > > > next_freq = util / max_cap * max_freq / curr_freq * max_freq > > > > which should give us the opportunity to reach max_freq also with freq > > invariance. > > > > This should actually be the same of doing: > > > > next_freq = util / max_cap * max_cap / curr_cap * max_freq > > > > We are basically scaling how much the cpu is busy at curr_cap back to > > the 0..1024 scale. And we use this to select next_freq. Also, we can > > simplify this to: > > > > next_freq = util / curr_cap * max_freq > > > > and we save some ops. > > > > However, if that is correct, I think we might have a problem, as we are > > skewing OPP selection towards higher frequencies. Let's suppose we have > > a platform with 3 OPPs: > > > > freq cap > > 1200 1024 > > 900 768 > > 600 512 > > > > As soon a task reaches an utilization of 257 we will be selecting the > > second OPP as > > > > next_freq = 257 / 512 * 1200 ~ 602 > > > > While the cpu is only 50% busy in this case. And we will go at max OPP > > when reaching ~492 (~64% of 768). > > > > That said, I guess this might work as a first solution, but we will > > probably need something better in the future. I understand Rafael's > > concerns regardin margins, but it seems to me that some kind of > > additional parameter will be probably needed anyway to fix this. > > Just to say again how we handle this in schedfreq, with a -20% margin > > applied to the lowest OPP we will get to the next one when utilization > > reaches ~410 (80% busy at curr OPP), and so on for the subsequent ones, > > which is less aggressive and might be better IMHO. > > Well, Peter says that my idea is incorrect, so I'll go for > > next_freq = C * current_freq * util_raw / max > > where C > 1 (and likely C < 1.5) instead. > > That means C has to be determined somehow or guessed. The 80% tipping > point condition seems reasonable to me, though, which leads to C = > 1.25. Right, that is the same value used in the schedfreq series: +/* + * Capacity margin added to CFS and RT capacity requests to provide + * some head room if task utilization further increases. + */ +unsigned int capacity_margin = 1280; Regards, Mike -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: linux-pm/drivers/cpufreq/cpufreq_schedutil.c =================================================================== --- /dev/null +++ linux-pm/drivers/cpufreq/cpufreq_schedutil.c @@ -0,0 +1,501 @@ +/* + * CPUFreq governor based on scheduler-provided CPU utilization data. + * + * Copyright (C) 2016, Intel Corporation + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/percpu-defs.h> +#include <linux/slab.h> + +#include "cpufreq_governor.h" + +struct sugov_tunables { + struct gov_tunables gt; + unsigned int rate_limit_us; +}; + +struct sugov_policy { + struct cpufreq_policy *policy; + + struct sugov_tunables *tunables; + struct list_head tunables_hook; + + raw_spinlock_t update_lock; /* For shared policies */ + u64 last_freq_update_time; + s64 freq_update_delay_ns; + unsigned int next_freq; + + /* The next fields are only needed if fast switch cannot be used. */ + unsigned int relation; + struct irq_work irq_work; + struct work_struct work; + struct mutex work_lock; + bool work_in_progress; + + bool need_freq_update; +}; + +struct sugov_cpu { + struct update_util_data update_util; + struct sugov_policy *sg_policy; + + /* The fields below are only needed when sharing a policy. */ + unsigned long util; + unsigned long max; + u64 last_update; +}; + +static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); + +/************************ Governor internals ***********************/ + +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) +{ + u64 delta_ns; + + if (sg_policy->work_in_progress) + return false; + + if (unlikely(sg_policy->need_freq_update)) { + sg_policy->need_freq_update = false; + return true; + } + + delta_ns = time - sg_policy->last_freq_update_time; + return (s64)delta_ns >= sg_policy->freq_update_delay_ns; +} + +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, + unsigned long util, unsigned long max, + unsigned int next_freq) +{ + struct cpufreq_policy *policy = sg_policy->policy; + unsigned int rel; + + if (next_freq > policy->max) + next_freq = policy->max; + else if (next_freq < policy->min) + next_freq = policy->min; + + sg_policy->last_freq_update_time = time; + if (sg_policy->next_freq == next_freq) + return; + + sg_policy->next_freq = next_freq; + /* + * If utilization is less than max / 4, use RELATION_C to allow the + * minimum frequency to be selected more often in case the distance from + * it to the next available frequency in the table is significant. + */ + rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L; + if (policy->fast_switch_possible) { + cpufreq_driver_fast_switch(policy, next_freq, rel); + } else { + sg_policy->relation = rel; + sg_policy->work_in_progress = true; + irq_work_queue(&sg_policy->irq_work); + } +} + +static void sugov_update_single(struct update_util_data *data, u64 time, + unsigned long util, unsigned long max) +{ + struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util); + struct sugov_policy *sg_policy = sg_cpu->sg_policy; + unsigned int min_f, max_f, next_f; + + if (!sugov_should_update_freq(sg_policy, time)) + return; + + min_f = sg_policy->policy->cpuinfo.min_freq; + max_f = sg_policy->policy->cpuinfo.max_freq; + next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max; + + sugov_update_commit(sg_policy, time, util, max, next_f); +} + +static unsigned int sugov_next_freq(struct sugov_policy *sg_policy, + unsigned long util, unsigned long max) +{ + struct cpufreq_policy *policy = sg_policy->policy; + unsigned int min_f = policy->cpuinfo.min_freq; + unsigned int max_f = policy->cpuinfo.max_freq; + u64 last_freq_update_time = sg_policy->last_freq_update_time; + unsigned int j; + + if (util > max) + return max_f; + + for_each_cpu(j, policy->cpus) { + struct sugov_cpu *j_sg_cpu; + unsigned long j_util, j_max; + u64 delta_ns; + + if (j == smp_processor_id()) + continue; + + j_sg_cpu = &per_cpu(sugov_cpu, j); + /* + * If the CPU utilization was last updated before the previous + * frequency update and the time elapsed between the last update + * of the CPU utilization and the last frequency update is long + * enough, don't take the CPU into account as it probably is + * idle now. + */ + delta_ns = last_freq_update_time - j_sg_cpu->last_update; + if ((s64)delta_ns > NSEC_PER_SEC / HZ) + continue; + + j_util = j_sg_cpu->util; + j_max = j_sg_cpu->max; + if (j_util > j_max) + return max_f; + + if (j_util * max > j_max * util) { + util = j_util; + max = j_max; + } + } + + return min_f + util * (max_f - min_f) / max; +} + +static void sugov_update_shared(struct update_util_data *data, u64 time, + unsigned long util, unsigned long max) +{ + struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util); + struct sugov_policy *sg_policy = sg_cpu->sg_policy; + unsigned int next_f; + + raw_spin_lock(&sg_policy->update_lock); + + sg_cpu->util = util; + sg_cpu->max = max; + sg_cpu->last_update = time; + + if (sugov_should_update_freq(sg_policy, time)) { + next_f = sugov_next_freq(sg_policy, util, max); + sugov_update_commit(sg_policy, time, util, max, next_f); + } + + raw_spin_unlock(&sg_policy->update_lock); +} + +static void sugov_work(struct work_struct *work) +{ + struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work); + + mutex_lock(&sg_policy->work_lock); + __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq, + sg_policy->relation); + mutex_unlock(&sg_policy->work_lock); + + sg_policy->work_in_progress = false; +} + +static void sugov_irq_work(struct irq_work *irq_work) +{ + struct sugov_policy *sg_policy; + + sg_policy = container_of(irq_work, struct sugov_policy, irq_work); + schedule_work(&sg_policy->work); +} + +/************************** sysfs interface ************************/ + +static struct sugov_tunables *global_tunables; +static DEFINE_MUTEX(global_tunables_lock); + +static inline struct sugov_tunables *to_sugov_tunables(struct gov_tunables *gt) +{ + return container_of(gt, struct sugov_tunables, gt); +} + +static ssize_t rate_limit_us_show(struct gov_tunables *gt, char *buf) +{ + struct sugov_tunables *tunables = to_sugov_tunables(gt); + + return sprintf(buf, "%u\n", tunables->rate_limit_us); +} + +static ssize_t rate_limit_us_store(struct gov_tunables *gt, const char *buf, + size_t count) +{ + struct sugov_tunables *tunables = to_sugov_tunables(gt); + struct sugov_policy *sg_policy; + unsigned int rate_limit_us; + int ret; + + ret = sscanf(buf, "%u", &rate_limit_us); + if (ret != 1) + return -EINVAL; + + tunables->rate_limit_us = rate_limit_us; + + list_for_each_entry(sg_policy, >->policy_list, tunables_hook) + sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC; + + return count; +} + +static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us); + +static struct attribute *sugov_attributes[] = { + &rate_limit_us.attr, + NULL +}; + +static struct kobj_type sugov_tunables_ktype = { + .default_attrs = sugov_attributes, + .sysfs_ops = &governor_sysfs_ops, +}; + +/********************** cpufreq governor interface *********************/ + +static struct cpufreq_governor schedutil_gov; + +static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy; + + sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL); + if (!sg_policy) + return NULL; + + sg_policy->policy = policy; + init_irq_work(&sg_policy->irq_work, sugov_irq_work); + INIT_WORK(&sg_policy->work, sugov_work); + mutex_init(&sg_policy->work_lock); + raw_spin_lock_init(&sg_policy->update_lock); + return sg_policy; +} + +static void sugov_policy_free(struct sugov_policy *sg_policy) +{ + mutex_destroy(&sg_policy->work_lock); + kfree(sg_policy); +} + +static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy) +{ + struct sugov_tunables *tunables; + + tunables = kzalloc(sizeof(*tunables), GFP_KERNEL); + if (tunables) + gov_tunables_init(&tunables->gt, &sg_policy->tunables_hook); + + return tunables; +} + +static void sugov_tunables_free(struct sugov_tunables *tunables) +{ + if (!have_governor_per_policy()) + global_tunables = NULL; + + kfree(tunables); +} + +static int sugov_init(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy; + struct sugov_tunables *tunables; + unsigned int lat; + int ret = 0; + + /* State should be equivalent to EXIT */ + if (policy->governor_data) + return -EBUSY; + + sg_policy = sugov_policy_alloc(policy); + if (!sg_policy) + return -ENOMEM; + + mutex_lock(&global_tunables_lock); + + if (global_tunables) { + if (WARN_ON(have_governor_per_policy())) { + ret = -EINVAL; + goto free_sg_policy; + } + policy->governor_data = sg_policy; + sg_policy->tunables = global_tunables; + + gov_tunables_get(&global_tunables->gt, &sg_policy->tunables_hook); + goto out; + } + + tunables = sugov_tunables_alloc(sg_policy); + if (!tunables) { + ret = -ENOMEM; + goto free_sg_policy; + } + + tunables->rate_limit_us = LATENCY_MULTIPLIER; + lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC; + if (lat) + tunables->rate_limit_us *= lat; + + if (!have_governor_per_policy()) + global_tunables = tunables; + + policy->governor_data = sg_policy; + sg_policy->tunables = tunables; + + ret = kobject_init_and_add(&tunables->gt.kobj, &sugov_tunables_ktype, + get_governor_parent_kobj(policy), "%s", + schedutil_gov.name); + if (!ret) + goto out; + + /* Failure, so roll back. */ + policy->governor_data = NULL; + sugov_tunables_free(tunables); + + free_sg_policy: + pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret); + sugov_policy_free(sg_policy); + + out: + mutex_unlock(&global_tunables_lock); + return ret; +} + +static int sugov_exit(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + struct sugov_tunables *tunables = sg_policy->tunables; + unsigned int count; + + mutex_lock(&global_tunables_lock); + + count = gov_tunables_put(&tunables->gt, &sg_policy->tunables_hook); + policy->governor_data = NULL; + if (!count) + sugov_tunables_free(tunables); + + mutex_unlock(&global_tunables_lock); + + sugov_policy_free(sg_policy); + return 0; +} + +static int sugov_start(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + unsigned int cpu; + + sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC; + sg_policy->last_freq_update_time = 0; + sg_policy->next_freq = UINT_MAX; + sg_policy->work_in_progress = false; + sg_policy->need_freq_update = false; + + for_each_cpu(cpu, policy->cpus) { + struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu); + + sg_cpu->sg_policy = sg_policy; + if (policy_is_shared(policy)) { + sg_cpu->util = ULONG_MAX; + sg_cpu->max = 0; + sg_cpu->last_update = 0; + sg_cpu->update_util.func = sugov_update_shared; + } else { + sg_cpu->update_util.func = sugov_update_single; + } + cpufreq_set_update_util_data(cpu, &sg_cpu->update_util); + } + return 0; +} + +static int sugov_stop(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + unsigned int cpu; + + for_each_cpu(cpu, policy->cpus) + cpufreq_set_update_util_data(cpu, NULL); + + synchronize_sched(); + + irq_work_sync(&sg_policy->irq_work); + cancel_work_sync(&sg_policy->work); + return 0; +} + +static int sugov_limits(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + + if (!policy->fast_switch_possible) { + mutex_lock(&sg_policy->work_lock); + + if (policy->max < policy->cur) + __cpufreq_driver_target(policy, policy->max, + CPUFREQ_RELATION_H); + else if (policy->min > policy->cur) + __cpufreq_driver_target(policy, policy->min, + CPUFREQ_RELATION_L); + + mutex_unlock(&sg_policy->work_lock); + } + + sg_policy->need_freq_update = true; + return 0; +} + +int sugov_governor(struct cpufreq_policy *policy, unsigned int event) +{ + if (event == CPUFREQ_GOV_POLICY_INIT) { + return sugov_init(policy); + } else if (policy->governor_data) { + switch (event) { + case CPUFREQ_GOV_POLICY_EXIT: + return sugov_exit(policy); + case CPUFREQ_GOV_START: + return sugov_start(policy); + case CPUFREQ_GOV_STOP: + return sugov_stop(policy); + case CPUFREQ_GOV_LIMITS: + return sugov_limits(policy); + } + } + return -EINVAL; +} + +static struct cpufreq_governor schedutil_gov = { + .name = "schedutil", + .governor = sugov_governor, + .max_transition_latency = TRANSITION_LATENCY_LIMIT, + .owner = THIS_MODULE, +}; + +static int __init sugov_module_init(void) +{ + return cpufreq_register_governor(&schedutil_gov); +} + +static void __exit sugov_module_exit(void) +{ + cpufreq_unregister_governor(&schedutil_gov); +} + +MODULE_AUTHOR("Rafael J. Wysocki <rafael.j.wysocki@intel.com>"); +MODULE_DESCRIPTION("Utilization-based CPU frequency selection"); +MODULE_LICENSE("GPL"); + +#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL +struct cpufreq_governor *cpufreq_default_governor(void) +{ + return &schedutil_gov; +} + +fs_initcall(sugov_module_init); +#else +module_init(sugov_module_init); +#endif +module_exit(sugov_module_exit); Index: linux-pm/drivers/cpufreq/Kconfig =================================================================== --- linux-pm.orig/drivers/cpufreq/Kconfig +++ linux-pm/drivers/cpufreq/Kconfig @@ -107,6 +107,16 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor. + +config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL + bool "schedutil" + select CPU_FREQ_GOV_SCHEDUTIL + select CPU_FREQ_GOV_PERFORMANCE + help + Use the 'schedutil' CPUFreq governor by default. If unsure, + have a look at the help section of that governor. The fallback + governor will be 'performance'. + endchoice config CPU_FREQ_GOV_PERFORMANCE @@ -188,6 +198,22 @@ config CPU_FREQ_GOV_CONSERVATIVE If in doubt, say N. +config CPU_FREQ_GOV_SCHEDUTIL + tristate "'schedutil' cpufreq policy governor" + depends on CPU_FREQ + select CPU_FREQ_GOV_TUNABLES + select IRQ_WORK + help + The frequency selection formula used by this governor is analogous + to the one used by 'ondemand', but instead of computing CPU load + as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU + utilization data provided by the scheduler as input. + + To compile this driver as a module, choose M here: the + module will be called cpufreq_schedutil. + + If in doubt, say N. + comment "CPU frequency scaling drivers" config CPUFREQ_DT Index: linux-pm/drivers/cpufreq/Makefile =================================================================== --- linux-pm.orig/drivers/cpufreq/Makefile +++ linux-pm/drivers/cpufreq/Makefile @@ -10,6 +10,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_POWERSAVE) += obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE) += cpufreq_userspace.o obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND) += cpufreq_ondemand.o obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE) += cpufreq_conservative.o +obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o obj-$(CONFIG_CPU_FREQ_GOV_TUNABLES) += cpufreq_governor_tunables.o