Message ID | 20221031121314.1381472-3-kajetan.puchalski@arm.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | cpuidle: teo: Introduce util-awareness | expand |
Hi Kajetan, On Mon, Oct 31, 2022 at 5:14 AM Kajetan Puchalski <kajetan.puchalski@arm.com> wrote: ... [delete some]... > /** > * teo_update - Update CPU metrics after wakeup. > * @drv: cpuidle driver containing state data. > @@ -303,7 +359,9 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, > int i; > > if (dev->last_state_idx >= 0) { > - teo_update(drv, dev); > + /* don't update metrics if the cpu was utilized during the last sleep */ > + if (!cpu_data->utilized) > + teo_update(drv, dev); > dev->last_state_idx = -1; > } Ignoring the metrics is not the correct thing to do. Depending on the workflow, it can severely bias the idle states deeper than they should be because most of the needed information to select the appropriate shallow state is tossed out. Example 1: 2 pairs of ping pongs = 4 threads Parameters chosen such that idle state 2 would be a most used state. CPU frequency governor: Schedutil. CPU frequency scaling driver: intel_cpufreq. HWP: Disabled Processor: i5-10600K (6 cores 12 cpus). Kernel: 6.1-rc3 Run length: 1e8 cycles Idle governor: teo: 11.73 uSecs/loop ; idle state 1 ~3.5e6 exits/sec menu: 12.1 uSecs/loop ; idle state 1 ~3.3e6 exits/sec util-v3: 15.2 uSecs/loop ; idle state 1 ~200 exits/sec util-v4: 11.63 uSecs/loop ; idle state 1 ~3.5e6 exits/sec Where util-v4 is the same as this patch (util-v3) with the above code reverted. Note: less time per loop is better. Example 2: Same but parameters selected such that idle state 0 would be a most used idle state. Run Length: 4e8 cycles Idle governor: teo: 3.1 uSecs/loop ; idle state 0 ~1.2e6 exits/sec menu: 3.1 uSecs/loop ; idle state 0 ~1.3e6 exits/sec util-v3: 5.1 uSecs/loop ; idle state 0 ~4 exits/sec util-v4: ? uSecs/loop ; idle state 0 ~1.2e6 exits/sec (partial result) Note: the util-v4 test is still in progress, but it is late in my time zone. But I can tell from the idle state usage, which I can observe once per minute, that the issue is, at least mostly, fixed. ... Doug
On Mon, Oct 31, 2022 at 11:24 PM Doug Smythies <dsmythies@telus.net> wrote: > > Hi Kajetan, > > On Mon, Oct 31, 2022 at 5:14 AM Kajetan Puchalski > <kajetan.puchalski@arm.com> wrote: > > ... [delete some]... > > > /** > > * teo_update - Update CPU metrics after wakeup. > > * @drv: cpuidle driver containing state data. > > @@ -303,7 +359,9 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, > > int i; > > > > if (dev->last_state_idx >= 0) { > > - teo_update(drv, dev); > > + /* don't update metrics if the cpu was utilized during the last sleep */ > > + if (!cpu_data->utilized) > > + teo_update(drv, dev); > > dev->last_state_idx = -1; > > } > > Ignoring the metrics is not the correct thing to do. > Depending on the workflow, it can severely bias the idle states deeper > than they should be because most of the needed information to select > the appropriate shallow state is tossed out. > > Example 1: > 2 pairs of ping pongs = 4 threads > Parameters chosen such that idle state 2 would be a most used state. Sorry, typo, I meant idle state 1 would be most used. > CPU frequency governor: Schedutil. > CPU frequency scaling driver: intel_cpufreq. > HWP: Disabled > Processor: i5-10600K (6 cores 12 cpus). > Kernel: 6.1-rc3 > Run length: 1e8 cycles > Idle governor: > teo: 11.73 uSecs/loop ; idle state 1 ~3.5e6 exits/sec > menu: 12.1 uSecs/loop ; idle state 1 ~3.3e6 exits/sec > util-v3: 15.2 uSecs/loop ; idle state 1 ~200 exits/sec > util-v4: 11.63 uSecs/loop ; idle state 1 ~3.5e6 exits/sec > > Where util-v4 is the same as this patch (util-v3) with the above code reverted. > > Note: less time per loop is better. > > Example 2: Same but parameters selected such that idle state 0 would > be a most used idle state. > Run Length: 4e8 cycles > Idle governor: > teo: 3.1 uSecs/loop ; idle state 0 ~1.2e6 exits/sec > menu: 3.1 uSecs/loop ; idle state 0 ~1.3e6 exits/sec > util-v3: 5.1 uSecs/loop ; idle state 0 ~4 exits/sec > util-v4: ? uSecs/loop ; idle state 0 ~1.2e6 exits/sec (partial result) util-v4: 3.15 uSecs/loop ; idle state 0 ~1.2e6 exits/sec For the above tests we do not expect teo-util to have much impact. For completeness: Test 3: Same but parameters selected such that idle states 2 and 3 would be used the most. Run Length: 3.42e5 cycles CPU frequency scaling governor: schedutil. CPU frequency scaling driver: intel_cpufreq. Idle governor: teo: 4005 uSecs/loop ; IS 2: 1917 IS 3: 107.4 exits/sec menu: 3113 uSecs/loop ; IS 2: 1020 IS 3: 1576 exits/sec util-v3: 3457 uSecs/loop ; IS 2: 1139 IS 3: 1000 exits/sec util-v4: 3526 uSecs/loop ; IS 2: 2029 IS 3: 109 exits/sec Now, things are very noisy with the schedutil governor, so... Test 4: Same as test 3 except for frequency scaling governor. Run Length: 3.42e5 cycles CPU frequency scaling governor: performance. CPU frequency scaling driver: intel_pstate. Idle governor: teo: 2688 uSecs/loop ; IS 2: 2489 IS 3: 16 exits/sec menu: 2596 uSecs/loop ; IS 2: 865 IS 3: 2005 exits/sec util-v3: 2766 uSecs/loop ; IS 2: 1049 IS 3: 1394 exits/sec util-v4: 2756 uSecs/loop ; IS 2: 2440 IS 3: 24 exits/sec ... Doug
On Mon, Oct 31, 2022 at 11:24:06PM -0700, Doug Smythies wrote: > Ignoring the metrics is not the correct thing to do. > Depending on the workflow, it can severely bias the idle states deeper > than they should be because most of the needed information to select > the appropriate shallow state is tossed out. Interesting, thanks for flagging this actually. Arguably it made sense originally but with v3 changes it probably causes more harm than good. From testing it on my own side it still looks all right without the skipping metrics part so I'm fine with just dropping it from the patch. V4 it is. I'll include my updated numbers and more tests in the new cover letter. > Example 1: > 2 pairs of ping pongs = 4 threads > Parameters chosen such that idle state 2 would be a most used state. > CPU frequency governor: Schedutil. > CPU frequency scaling driver: intel_cpufreq. > HWP: Disabled > Processor: i5-10600K (6 cores 12 cpus). > Kernel: 6.1-rc3 > Run length: 1e8 cycles > Idle governor: > teo: 11.73 uSecs/loop ; idle state 1 ~3.5e6 exits/sec > menu: 12.1 uSecs/loop ; idle state 1 ~3.3e6 exits/sec > util-v3: 15.2 uSecs/loop ; idle state 1 ~200 exits/sec > util-v4: 11.63 uSecs/loop ; idle state 1 ~3.5e6 exits/sec > > Where util-v4 is the same as this patch (util-v3) with the above code reverted. > > Note: less time per loop is better. > > Example 2: Same but parameters selected such that idle state 0 would > be a most used idle state. > Run Length: 4e8 cycles > Idle governor: > teo: 3.1 uSecs/loop ; idle state 0 ~1.2e6 exits/sec > menu: 3.1 uSecs/loop ; idle state 0 ~1.3e6 exits/sec > util-v3: 5.1 uSecs/loop ; idle state 0 ~4 exits/sec > util-v4: ? uSecs/loop ; idle state 0 ~1.2e6 exits/sec (partial result) Thanks a lot for the testing! --- Kajetan
On Mon, Oct 31, 2022 at 1:14 PM Kajetan Puchalski <kajetan.puchalski@arm.com> wrote: > > Modern interactive systems, such as recent Android phones, tend to have > power efficient shallow idle states. Selecting deeper idle states on a > device while a latency-sensitive workload is running can adversely impact > performance due to increased latency. Additionally, if the CPU wakes up > from a deeper sleep before its target residency as is often the case, it > results in a waste of energy on top of that. > > This patch extends the TEO governor with a mechanism adding util-awareness, > effectively providing a way for the governor to reduce the selected idle > state by 1 when the CPU is being utilized over a certain threshold while > still trying to select the deepest possible state using TEO's metrics when > the CPU is not being utilized. This is now possible since the CPU > utilization is exported from the scheduler with the sched_cpu_util function > and already used e.g. in the thermal governor IPA. > > Under this implementation, when the CPU is being utilised and the > selected candidate state is C1, it will be reduced to C0 as long as C0 > is not a polling state. This effectively should make the patch have no > effect on most Intel systems. > > This can provide drastically decreased latency and performance benefits in > certain types of mobile workloads that are sensitive to latency, > such as Geekbench 5. > > Signed-off-by: Kajetan Puchalski <kajetan.puchalski@arm.com> > --- > drivers/cpuidle/governors/teo.c | 84 ++++++++++++++++++++++++++++++++- > 1 file changed, 82 insertions(+), 2 deletions(-) > > diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c > index e2864474a98d..0f38b3ecab3c 100644 > --- a/drivers/cpuidle/governors/teo.c > +++ b/drivers/cpuidle/governors/teo.c > @@ -2,8 +2,13 @@ > /* > * Timer events oriented CPU idle governor > * > + * TEO governor: > * Copyright (C) 2018 - 2021 Intel Corporation > * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > + * > + * Util-awareness mechanism: > + * Copyright (C) 2022 Arm Ltd. > + * Author: Kajetan Puchalski <kajetan.puchalski@arm.com> > */ > > /** > @@ -99,14 +104,49 @@ > * select the given idle state instead of the candidate one. > * > * 3. By default, select the candidate state. > + * > + * Util-awareness mechanism: > + * > + * The idea behind the util-awareness extension is that there are two distinct > + * scenarios for the CPU which should result in two different approaches to idle > + * state selection - utilized and not utilized. > + * > + * In this case, 'utilized' means that the average runqueue util of the CPU is > + * above a certain threshold. > + * > + * When the CPU is utilized while going into idle, more likely than not it will > + * be woken up to do more work soon and so a shallower idle state should be > + * selected to minimise latency and maximise performance. When the CPU is not > + * being utilized, the usual metrics-based approach to selecting the deepest > + * available idle state should be preferred to take advantage of the power > + * saving. > + * > + * In order to achieve this, the governor uses a utilization threshold. > + * The threshold is computed per-cpu as a percentage of the CPU's capacity > + * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%) > + * seems to be getting the best results. > + * > + * Before selecting the next idle state, the governor compares the current CPU > + * util to the precomputed util threshold. If it's below, it defaults to the > + * TEO metrics mechanism. If it's above and the currently selected candidate is > + * C1, the idle state will be reduced to C0 as long as C0 is not a polling state. > */ > > #include <linux/cpuidle.h> > #include <linux/jiffies.h> > #include <linux/kernel.h> > +#include <linux/sched.h> > #include <linux/sched/clock.h> > +#include <linux/sched/topology.h> > #include <linux/tick.h> > > +/* > + * The number of bits to shift the cpu's capacity by in order to determine > + * the utilized threshold > + */ > +#define UTIL_THRESHOLD_SHIFT 6 Why is this particular number regarded as the best one? > + > + > /* > * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value > * is used for decreasing metrics on a regular basis. > @@ -137,9 +177,11 @@ struct teo_bin { > * @time_span_ns: Time between idle state selection and post-wakeup update. > * @sleep_length_ns: Time till the closest timer event (at the selection time). > * @state_bins: Idle state data bins for this CPU. > - * @total: Grand total of the "intercepts" and "hits" mertics for all bins. > + * @total: Grand total of the "intercepts" and "hits" metrics for all bins. > * @next_recent_idx: Index of the next @recent_idx entry to update. > * @recent_idx: Indices of bins corresponding to recent "intercepts". > + * @util_threshold: Threshold above which the CPU is considered utilized > + * @utilized: Whether the last sleep on the CPU happened while utilized > */ > struct teo_cpu { > s64 time_span_ns; > @@ -148,10 +190,24 @@ struct teo_cpu { > unsigned int total; > int next_recent_idx; > int recent_idx[NR_RECENT]; > + unsigned long util_threshold; > + bool utilized; > }; > > static DEFINE_PER_CPU(struct teo_cpu, teo_cpus); > > +/** > + * teo_get_util - Update the CPU utilized status > + * @dev: Target CPU > + * @cpu_data: Governor CPU data for the target CPU > + */ > +static void teo_get_util(struct cpuidle_device *dev, struct teo_cpu *cpu_data) > +{ > + unsigned long util = sched_cpu_util(dev->cpu); > + > + cpu_data->utilized = util > cpu_data->util_threshold; Why exactly do you need the local variable here? Then, if there's only one caller, maybe this could be folded into it? > +} > + > /** > * teo_update - Update CPU metrics after wakeup. > * @drv: cpuidle driver containing state data. > @@ -303,7 +359,9 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, > int i; > > if (dev->last_state_idx >= 0) { > - teo_update(drv, dev); > + /* don't update metrics if the cpu was utilized during the last sleep */ Why? The metrics are related to idle duration and cpu_data->utilized indicates whether or not latency should be reduced. These are different things. Moreover, this is just one data point and there may not be any direct connection between it and the decision made in this particular cycle. > + if (!cpu_data->utilized) > + teo_update(drv, dev); > dev->last_state_idx = -1; > } > > @@ -323,6 +381,21 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, > goto end; > } > > + teo_get_util(dev, cpu_data); > + /* the cpu is being utilized and there's only 2 states to choose from */ > + /* no need to consider metrics, choose the shallowest non-polling state and exit */ A proper kernel-coding-style 2-line comment, please! Also I would say "utilized beyond the threshold" instead of just "utilized" and "there are only 2 states" (plural). > + if (drv->state_count < 3 && cpu_data->utilized) { > + for (i = 0; i < drv->state_count; ++i) { > + if (dev->states_usage[i].disable || > + drv->states[i].flags & CPUIDLE_FLAG_POLLING) > + continue; > + break; This looks odd. It would be more straightforward to do if (!dev->states_usage[i].disable && !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) { idx = i; goto end; } without the "break" and "continue". > + } > + > + idx = i; > + goto end; > + } > + > /* > * Find the deepest idle state whose target residency does not exceed > * the current sleep length and the deepest idle state not deeper than > @@ -454,6 +527,11 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, > if (idx > constraint_idx) > idx = constraint_idx; > > + /* if the CPU is being utilized and C1 is the selected candidate */ > + /* choose a shallower non-polling state to improve latency */ Again, the kernel coding style for multi-line comments is different from the above. > + if (cpu_data->utilized && idx == 1) I've changed my mind with respect to adding the idx == 1 check to this. If the goal is to reduce latency for the "loaded" CPUs, this applies to deeper idle states too. > + idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true); > + > end: > /* > * Don't stop the tick if the selected state is a polling one or if the > @@ -510,9 +588,11 @@ static int teo_enable_device(struct cpuidle_driver *drv, > struct cpuidle_device *dev) > { > struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); > + unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu); > int i; > > memset(cpu_data, 0, sizeof(*cpu_data)); > + cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT; > > for (i = 0; i < NR_RECENT; i++) > cpu_data->recent_idx[i] = -1; > --
On 2022.11.25 10:27 Rafael wrote: > On Mon, Oct 31, 2022 at 1:14 PM Kajetan wrote: ... [delete some] ... >> +} >> + >> /** >> * teo_update - Update CPU metrics after wakeup. >> * @drv: cpuidle driver containing state data. >> @@ -303,7 +359,9 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, >> int i; >> >> if (dev->last_state_idx >= 0) { >> - teo_update(drv, dev); >> + /* don't update metrics if the cpu was utilized during the last sleep */ > > Why? > > The metrics are related to idle duration and cpu_data->utilized > indicates whether or not latency should be reduced. These are > different things. > > Moreover, this is just one data point and there may not be any direct > connection between it and the decision made in this particular cycle. Hi Rafael, Yes, agreed. And version 4 of this patch set deleted this stuff. See also my similar feedback, with accompanying test data that verified the issue in [1]. [1] https://lore.kernel.org/linux-pm/CAAYoRsUgm6KyJCDowGKFVuMwJepnVN8NFEenjd3O-FN7+BETSw@mail.gmail.com/ ... Doug
On 2022.11.25 10:27 Rafael wrote: > On Mon, Oct 31, 2022 at 1:14 PM Kajetan wrote: ... [delete some] ... >> /* >> * Find the deepest idle state whose target residency does not exceed >> * the current sleep length and the deepest idle state not deeper than >> @@ -454,6 +527,11 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, >> if (idx > constraint_idx) >> idx = constraint_idx; >> >> + /* if the CPU is being utilized and C1 is the selected candidate */ >> + /* choose a shallower non-polling state to improve latency */ > > Again, the kernel coding style for multi-line comments is different > from the above. > >> + if (cpu_data->utilized && idx == 1) > > I've changed my mind with respect to adding the idx == 1 check to > this. If the goal is to reduce latency for the "loaded" CPUs, this > applies to deeper idle states too. After taking idle state 0 (POLL) out of it, the energy cost for reducing the selected idle state by 1 was still high in some cases, at least on my Intel processor. That was mainly for idle state 2 being bumped to idle state 1. I don't recall significant differences bumping idle state 3 to idle state 2, but I don't know about other Intel processors. So, there is a trade-off here where we might want to accept this higher energy consumption for no gain in some workflows verses the higher energy for gain in other workflows, or not. Example 1: Higher energy, for no benefit: Workflow: a medium load at 211 work/sleep frequency. This data is for one thread, but I looked at up to 6 threads. No performance metric, the work just has to finish before the next cycle begins. CPU frequency scaling driver: intel_pstate CPU frequency scaling governor: powersave No HWP. Kernel 6.1-rc3 teo: ~14.8 watts util-v4 without the "idx == 1" above: 16.1 watts (+8.8%) More info: http://smythies.com/~doug/linux/idle/teo-util/consume/dwell-v4/ Example 2: Lower energy, but no loss in performance: Workflow: 500 threads, light load per thread, approximately 10 hertz work/sleep frequency per thread. CPU frequency scaling driver: intel_cpufreq CPU frequency scaling governor: schedutil No HWP. Kernel 6.1-rc3 teo: ~70 watts util-v4 without the "idx == 1" above: ~59 watts (-16%) Execution times were the same More info: http://smythies.com/~doug/linux/idle/teo-util/waiter/ Note: legend util-v4-1 is util-v4 without the "idx == 1". I have also added util-v4-1 to some of the previous results. For reference, my testing processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz $ grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1_ACPI /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2_ACPI /sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3_ACPI $ grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/desc /sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE /sys/devices/system/cpu/cpu0/cpuidle/state1/desc:ACPI FFH MWAIT 0x0 /sys/devices/system/cpu/cpu0/cpuidle/state2/desc:ACPI FFH MWAIT 0x30 /sys/devices/system/cpu/cpu0/cpuidle/state3/desc:ACPI FFH MWAIT 0x60 ... Doug
On Fri, Nov 25, 2022 at 07:27:13PM +0100, Rafael J. Wysocki wrote: > > +/* > > + * The number of bits to shift the cpu's capacity by in order to determine > > + * the utilized threshold > > + */ > > +#define UTIL_THRESHOLD_SHIFT 6 > > Why is this particular number regarded as the best one? Based on my testing this number achieved the best balance of power and performance on average. It also makes sense from looking at the util plots. The resulting threshold is high enough to not be triggered by background noise and low enough to react quickly when activity starts ramping up. > > +static void teo_get_util(struct cpuidle_device *dev, struct teo_cpu *cpu_data) > > +{ > > + unsigned long util = sched_cpu_util(dev->cpu); > > + > > + cpu_data->utilized = util > cpu_data->util_threshold; > > Why exactly do you need the local variable here? It's not necessarily needed, I can replace it with comparing the result of the call directly. > Then, if there's only one caller, maybe this could be folded into it? I do think it's nicer to have it separated in its own helper function, that way if anything more has to be done with the util it'll all be self-contained. Having only one caller shouldn't be a big issue, it's also the case for teo_middle_of_bin and teo_find_shallower_state in the current TEO implementation. > > + /* don't update metrics if the cpu was utilized during the last sleep */ > > Why? > > The metrics are related to idle duration and cpu_data->utilized > indicates whether or not latency should be reduced. These are > different things. > > Moreover, this is just one data point and there may not be any direct > connection between it and the decision made in this particular cycle. Agreed, v4 already has this part removed. > > + if (!cpu_data->utilized) > > + teo_update(drv, dev); > > dev->last_state_idx = -1; > > } > > > > @@ -323,6 +381,21 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, > > goto end; > > } > > > > + teo_get_util(dev, cpu_data); > > + /* the cpu is being utilized and there's only 2 states to choose from */ > > + /* no need to consider metrics, choose the shallowest non-polling state and exit */ > > A proper kernel-coding-style 2-line comment, please! > > Also I would say "utilized beyond the threshold" instead of just > "utilized" and "there are only 2 states" (plural). Both good points, I'll fix that. > > + if (drv->state_count < 3 && cpu_data->utilized) { > > + for (i = 0; i < drv->state_count; ++i) { > > + if (dev->states_usage[i].disable || > > + drv->states[i].flags & CPUIDLE_FLAG_POLLING) > > + continue; > > + break; > > This looks odd. It would be more straightforward to do > > if (!dev->states_usage[i].disable && !(drv->states[i].flags & > CPUIDLE_FLAG_POLLING)) { > idx = i; > goto end; > } > > without the "break" and "continue". Fair enough, this works as well. > I've changed my mind with respect to adding the idx == 1 check to > this. If the goal is to reduce latency for the "loaded" CPUs, this > applies to deeper idle states too. I see, this has no effect on arm devices one way or the other so I don't mind, it's completely up to you. In light of Doug's test results regarding this change, should I remove the check in v5? Thanks, Kajetan
On Mon, Nov 28, 2022 at 3:33 PM Kajetan Puchalski <kajetan.puchalski@arm.com> wrote: > > On Fri, Nov 25, 2022 at 07:27:13PM +0100, Rafael J. Wysocki wrote: > > > +/* > > > + * The number of bits to shift the cpu's capacity by in order to determine > > > + * the utilized threshold > > > + */ > > > +#define UTIL_THRESHOLD_SHIFT 6 > > > > Why is this particular number regarded as the best one? > > Based on my testing this number achieved the best balance of power and > performance on average. It also makes sense from looking at the util > plots. The resulting threshold is high enough to not be triggered by > background noise and low enough to react quickly when activity starts > ramping up. It would be good to put some of this information (or maybe even all of it) into the comment above the symbol definition. > > > +static void teo_get_util(struct cpuidle_device *dev, struct teo_cpu *cpu_data) > > > +{ > > > + unsigned long util = sched_cpu_util(dev->cpu); > > > + > > > + cpu_data->utilized = util > cpu_data->util_threshold; > > > > Why exactly do you need the local variable here? > > It's not necessarily needed, I can replace it with comparing the result > of the call directly. > > > Then, if there's only one caller, maybe this could be folded into it? > > I do think it's nicer to have it separated in its own helper function, that > way if anything more has to be done with the util it'll all be > self-contained. Having only one caller shouldn't be a big issue, it's > also the case for teo_middle_of_bin and teo_find_shallower_state in the > current TEO implementation. OK > > > + /* don't update metrics if the cpu was utilized during the last sleep */ > > > > Why? > > > > The metrics are related to idle duration and cpu_data->utilized > > indicates whether or not latency should be reduced. These are > > different things. > > > > Moreover, this is just one data point and there may not be any direct > > connection between it and the decision made in this particular cycle. > > Agreed, v4 already has this part removed. > > > > + if (!cpu_data->utilized) > > > + teo_update(drv, dev); > > > dev->last_state_idx = -1; > > > } > > > > > > @@ -323,6 +381,21 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, > > > goto end; > > > } > > > > > > + teo_get_util(dev, cpu_data); > > > + /* the cpu is being utilized and there's only 2 states to choose from */ > > > + /* no need to consider metrics, choose the shallowest non-polling state and exit */ > > > > A proper kernel-coding-style 2-line comment, please! > > > > Also I would say "utilized beyond the threshold" instead of just > > "utilized" and "there are only 2 states" (plural). > > Both good points, I'll fix that. > > > > + if (drv->state_count < 3 && cpu_data->utilized) { > > > + for (i = 0; i < drv->state_count; ++i) { > > > + if (dev->states_usage[i].disable || > > > + drv->states[i].flags & CPUIDLE_FLAG_POLLING) > > > + continue; > > > + break; > > > > This looks odd. It would be more straightforward to do > > > > if (!dev->states_usage[i].disable && !(drv->states[i].flags & > > CPUIDLE_FLAG_POLLING)) { > > idx = i; > > goto end; > > } > > > > without the "break" and "continue". > > Fair enough, this works as well. > > > I've changed my mind with respect to adding the idx == 1 check to > > this. If the goal is to reduce latency for the "loaded" CPUs, this > > applies to deeper idle states too. > > I see, this has no effect on arm devices one way or the other so I don't > mind, it's completely up to you. In light of Doug's test results > regarding this change, should I remove the check in v5? Yes, please.
diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c index e2864474a98d..0f38b3ecab3c 100644 --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -2,8 +2,13 @@ /* * Timer events oriented CPU idle governor * + * TEO governor: * Copyright (C) 2018 - 2021 Intel Corporation * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> + * + * Util-awareness mechanism: + * Copyright (C) 2022 Arm Ltd. + * Author: Kajetan Puchalski <kajetan.puchalski@arm.com> */ /** @@ -99,14 +104,49 @@ * select the given idle state instead of the candidate one. * * 3. By default, select the candidate state. + * + * Util-awareness mechanism: + * + * The idea behind the util-awareness extension is that there are two distinct + * scenarios for the CPU which should result in two different approaches to idle + * state selection - utilized and not utilized. + * + * In this case, 'utilized' means that the average runqueue util of the CPU is + * above a certain threshold. + * + * When the CPU is utilized while going into idle, more likely than not it will + * be woken up to do more work soon and so a shallower idle state should be + * selected to minimise latency and maximise performance. When the CPU is not + * being utilized, the usual metrics-based approach to selecting the deepest + * available idle state should be preferred to take advantage of the power + * saving. + * + * In order to achieve this, the governor uses a utilization threshold. + * The threshold is computed per-cpu as a percentage of the CPU's capacity + * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%) + * seems to be getting the best results. + * + * Before selecting the next idle state, the governor compares the current CPU + * util to the precomputed util threshold. If it's below, it defaults to the + * TEO metrics mechanism. If it's above and the currently selected candidate is + * C1, the idle state will be reduced to C0 as long as C0 is not a polling state. */ #include <linux/cpuidle.h> #include <linux/jiffies.h> #include <linux/kernel.h> +#include <linux/sched.h> #include <linux/sched/clock.h> +#include <linux/sched/topology.h> #include <linux/tick.h> +/* + * The number of bits to shift the cpu's capacity by in order to determine + * the utilized threshold + */ +#define UTIL_THRESHOLD_SHIFT 6 + + /* * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value * is used for decreasing metrics on a regular basis. @@ -137,9 +177,11 @@ struct teo_bin { * @time_span_ns: Time between idle state selection and post-wakeup update. * @sleep_length_ns: Time till the closest timer event (at the selection time). * @state_bins: Idle state data bins for this CPU. - * @total: Grand total of the "intercepts" and "hits" mertics for all bins. + * @total: Grand total of the "intercepts" and "hits" metrics for all bins. * @next_recent_idx: Index of the next @recent_idx entry to update. * @recent_idx: Indices of bins corresponding to recent "intercepts". + * @util_threshold: Threshold above which the CPU is considered utilized + * @utilized: Whether the last sleep on the CPU happened while utilized */ struct teo_cpu { s64 time_span_ns; @@ -148,10 +190,24 @@ struct teo_cpu { unsigned int total; int next_recent_idx; int recent_idx[NR_RECENT]; + unsigned long util_threshold; + bool utilized; }; static DEFINE_PER_CPU(struct teo_cpu, teo_cpus); +/** + * teo_get_util - Update the CPU utilized status + * @dev: Target CPU + * @cpu_data: Governor CPU data for the target CPU + */ +static void teo_get_util(struct cpuidle_device *dev, struct teo_cpu *cpu_data) +{ + unsigned long util = sched_cpu_util(dev->cpu); + + cpu_data->utilized = util > cpu_data->util_threshold; +} + /** * teo_update - Update CPU metrics after wakeup. * @drv: cpuidle driver containing state data. @@ -303,7 +359,9 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, int i; if (dev->last_state_idx >= 0) { - teo_update(drv, dev); + /* don't update metrics if the cpu was utilized during the last sleep */ + if (!cpu_data->utilized) + teo_update(drv, dev); dev->last_state_idx = -1; } @@ -323,6 +381,21 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, goto end; } + teo_get_util(dev, cpu_data); + /* the cpu is being utilized and there's only 2 states to choose from */ + /* no need to consider metrics, choose the shallowest non-polling state and exit */ + if (drv->state_count < 3 && cpu_data->utilized) { + for (i = 0; i < drv->state_count; ++i) { + if (dev->states_usage[i].disable || + drv->states[i].flags & CPUIDLE_FLAG_POLLING) + continue; + break; + } + + idx = i; + goto end; + } + /* * Find the deepest idle state whose target residency does not exceed * the current sleep length and the deepest idle state not deeper than @@ -454,6 +527,11 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, if (idx > constraint_idx) idx = constraint_idx; + /* if the CPU is being utilized and C1 is the selected candidate */ + /* choose a shallower non-polling state to improve latency */ + if (cpu_data->utilized && idx == 1) + idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true); + end: /* * Don't stop the tick if the selected state is a polling one or if the @@ -510,9 +588,11 @@ static int teo_enable_device(struct cpuidle_driver *drv, struct cpuidle_device *dev) { struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); + unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu); int i; memset(cpu_data, 0, sizeof(*cpu_data)); + cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT; for (i = 0; i < NR_RECENT; i++) cpu_data->recent_idx[i] = -1;
Modern interactive systems, such as recent Android phones, tend to have power efficient shallow idle states. Selecting deeper idle states on a device while a latency-sensitive workload is running can adversely impact performance due to increased latency. Additionally, if the CPU wakes up from a deeper sleep before its target residency as is often the case, it results in a waste of energy on top of that. This patch extends the TEO governor with a mechanism adding util-awareness, effectively providing a way for the governor to reduce the selected idle state by 1 when the CPU is being utilized over a certain threshold while still trying to select the deepest possible state using TEO's metrics when the CPU is not being utilized. This is now possible since the CPU utilization is exported from the scheduler with the sched_cpu_util function and already used e.g. in the thermal governor IPA. Under this implementation, when the CPU is being utilised and the selected candidate state is C1, it will be reduced to C0 as long as C0 is not a polling state. This effectively should make the patch have no effect on most Intel systems. This can provide drastically decreased latency and performance benefits in certain types of mobile workloads that are sensitive to latency, such as Geekbench 5. Signed-off-by: Kajetan Puchalski <kajetan.puchalski@arm.com> --- drivers/cpuidle/governors/teo.c | 84 ++++++++++++++++++++++++++++++++- 1 file changed, 82 insertions(+), 2 deletions(-)