Message ID | 20221003144914.160547-1-kajetan.puchalski@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | cpuidle: teo: Introduce optional util-awareness | expand |
On Mon, Oct 3, 2022 at 7:49 AM Kajetan Puchalski <kajetan.puchalski@arm.com> wrote: > > Hi, > > At the moment, all the available idle governors operate mainly based on their own past performance > without taking into account any scheduling information. ... I tried V2 on my Intel i5-10600K, but there was little difference when compared to V1 in terms of enormous increase in processor power consumption for the periodic type tests I ran. However, sometimes the power use would drop down to comparable to the normal teo or menu governors for a short time, which V1 didn't do. I would suggest a completely new governor for your only 2 idle states, and of similar power use, scenario. Incredibly simple: If CPU utilization >= threshold select idle state 0 Else Select idle state 1 Endif As previously mentioned, you are overriding everything that teo does anyhow. Note to other readers: I also sent Kajetan an off-list email with more tests descriptions and web links to results graphs and such. ... Doug
On Thu, Oct 06, 2022 at 05:06:58PM -0700, Doug Smythies wrote: > > I would suggest a completely new governor for your > only 2 idle states, and of similar power use, scenario. > Incredibly simple: > > If CPU utilization >= threshold > select idle state 0 > Else > Select idle state 1 > Endif That would be easy to implement at least but sadly just doesn't really work at all. It would just result in huge amounts of too deep C1 sleeps which are the main problem for us at the moment. For all intents and purposes, only ever selecting C0 gives better results than this approach on our hardware. > As previously mentioned, you are overriding > everything that teo does anyhow. To an extent yes, this just came from observations of how cpuidle behaves on arm. We tried an approach with only ever using C0 and it was all right on power usage, the device would just heat up too much and lose performance. That's why this approach of using TEO when the core is completely idle and moving to the lower state when it's not achieves the best of both worlds. > Note to other readers: I also sent Kajetan an off-list email with more > tests descriptions and web links to results graphs and such. Looks like the email got lost in my inbox, could be some email filtering or something like that.. Could you possibly resend and cc my private email (kajetan.puchalski@tuta.io)? So far I've not directly received any of your emails, they show up on the mailing list archive but not in my inbox. Could be an IT issue on my end but might be worth checking your email setup regardless. > ... Doug Thanks, Kajetan
Hi Doug, Kajetan, On 10/10/22 11:09, Kajetan Puchalski wrote: > On Thu, Oct 06, 2022 at 05:06:58PM -0700, Doug Smythies wrote: >> >> I would suggest a completely new governor for your >> only 2 idle states, and of similar power use, scenario. >> Incredibly simple: >> >> If CPU utilization >= threshold >> select idle state 0 >> Else >> Select idle state 1 >> Endif > > That would be easy to implement at least but sadly just doesn't really > work at all. It would just result in huge amounts of too deep C1 sleeps > which are the main problem for us at the moment. For all intents and > purposes, only ever selecting C0 gives better results than this approach > on our hardware. > >> As previously mentioned, you are overriding >> everything that teo does anyhow. > > To an extent yes, this just came from observations of how cpuidle > behaves on arm. We tried an approach with only ever using C0 and it was > all right on power usage, the device would just heat up too much and > lose performance. That's why this approach of using TEO when the core is > completely idle and moving to the lower state when it's not achieves the > best of both worlds. > >> Note to other readers: I also sent Kajetan an off-list email with more >> tests descriptions and web links to results graphs and such. Thanks Doug for testing this work, we really appreciate that. Doug maybe you could post these into some public space, so wider community would also see them. Kajetan has used github to put a report with testing results containing some graphs/plots. > > Looks like the email got lost in my inbox, could be some email filtering > or something like that.. Could you possibly resend and cc my private > email (kajetan.puchalski@tuta.io)? > So far I've not directly received any of your emails, they show up on > the mailing list archive but not in my inbox. Could be an IT issue on my > end but might be worth checking your email setup regardless. I also have some issues with the email inbox. Me apologies Doug, we will sort it. Regards, Lukasz
On Mon, Oct 3, 2022 at 4:50 PM Kajetan Puchalski <kajetan.puchalski@arm.com> wrote: > > Hi, > > At the moment, all the available idle governors operate mainly based on their own past performance Not true, at least for the menu and teo governors that use the information on the distribution of CPU wakeups that is available to them and try to predict the next idle duration with the help of it. This has a little to do with their performance. > without taking into account any scheduling information. Especially on interactive systems, this > results in them frequently selecting a deeper idle state and then waking up before its target > residency is hit, thus leading to increased wakeup latency and lower performance with no power > saving. For 'menu' while web browsing on Android for instance, those types of wakeups ('too deep') > account for over 24% of all wakeups. How is this measured? > At the same time, on some platforms C0 can be power efficient enough to warrant wanting to prefer > it over C1. Well, energy-efficiency is relative, so strictly speaking it is invalid to say "power efficient enough". Also, as far as idle CPUs are concerned, we are talking about the situation in which no useful work is done at all, so the state drawing less power is always more energy-efficient than the one drawing more power. You may argue that predicting idle durations that are too long too often leads to both excessive task wakeup latency and excessive energy usage at the same time, but this may very well mean that the target residency value for C1 is too low. > Sleeps that happened in C0 while they could have used C1 ('too shallow') only save > less power than they otherwise could have. Too deep sleeps, on the other hand, harm performance > and nullify the potential power saving from using C1 in the first place. While taking this into > account, it is clear that on balance it is preferable for an idle governor to have more too shallow > sleeps instead of more too deep sleeps on those kinds of platforms. True. > Currently the best available governor under this metric is TEO which on average results in less than > half the percentage of too deep sleeps compared to 'menu', getting much better wakeup latencies and > increased performance in the process. Well, good to hear that, but some numbers in support of that claim would be nice to have too. > This proposed optional extension to TEO would specifically tune it for minimising too deep > sleeps and minimising latency to achieve better performance. To this end, before selecting the next > idle state it uses the avg_util signal of a CPU's runqueue in order to determine to what extent the > CPU is being utilized. Which has no bearing on what the CPU idle time governors have to do which is (1) to predict the next idle duration as precisely as reasonably possible and (2) to minimise the cost in terms of task wakeup latencies associated with using deep idle states. The avg_util value tells us nothing about how much the CPU is going to be idle this time and it also tells us nothing about the latency-sensitivity of the workload. Yes, it tells us how much idle time there was on the given CPU in the past, on the average, but there is zero information about the distribution of that idle time in it. So in the first place please tell me why it fundamentally makes sense to use avg_util in CPU idle time management at all.
On Wed, Oct 12, 2022 at 08:50:39PM +0200, Rafael J. Wysocki wrote: > On Mon, Oct 3, 2022 at 4:50 PM Kajetan Puchalski > <kajetan.puchalski@arm.com> wrote: > > > > Hi, > > > > At the moment, all the available idle governors operate mainly based on their own past performance > > Not true, at least for the menu and teo governors that use the > information on the distribution of CPU wakeups that is available to > them and try to predict the next idle duration with the help of it. > This has a little to do with their performance. You're right of course, I should have written "their own past correctness" as that's what I was referring to. I just meant that for instance with TEO the initial timer-based choice is only adjusted using the governor's own metrics and not any information from anywhere else in the system. > > without taking into account any scheduling information. Especially on interactive systems, this > > results in them frequently selecting a deeper idle state and then waking up before its target > > residency is hit, thus leading to increased wakeup latency and lower performance with no power > > saving. For 'menu' while web browsing on Android for instance, those types of wakeups ('too deep') > > account for over 24% of all wakeups. > > How is this measured? Using the cpu_idle_miss trace event. Over the course of a benchmark run I collect all cpu_idle and cpu_idle_miss trace events. Then I divide the number of too deep misses by the total number of cpu_idle wakeup events which gives me the percentage. Those are the percentages described as 'gmean too deep %' in the tables included in the cover letter. Gmean because I run the benchmarks for many iterations and then take an average of those percentages to account for outliers. PCMark Web Browsing is a 'benchmark' that just amounts to browsing the web on Android, hence I can use data from it to talk about what the system behaviour under normal usage would be. > > At the same time, on some platforms C0 can be power efficient enough to warrant wanting to prefer > > it over C1. > > Well, energy-efficiency is relative, so strictly speaking it is > invalid to say "power efficient enough". Yes, by 'enough' I meant that the power savings of C0 vs C1 on arm are fairly comparable as opposed to other platforms. From Doug's data collected on an Intel CPU, the power usage difference of only-C0 compared to only-C1 was over 20-fold ie 46w vs 2.6w. With only C0 enabled on Pixel 6 that difference is closer to something like 4%. It's just fundamentally different hardware. With 4% being your ceiling you can talk about performance/latency tradeoffs etc, if you're talking about potential over 1700% increases, not so much. > Also, as far as idle CPUs are concerned, we are talking about the > situation in which no useful work is done at all, so the state drawing > less power is always more energy-efficient than the one drawing more > power. Yes, assuming the CPU is woken up after the target residency of the state has been met. If the wakeup happens too early then for that situation C0 would've been more power efficient than C1 even though C1 technically draws less power, right? That's what we're trying to fix here, we just noticed that for mobile interactive workloads at least we're getting this situation way too often. The result being that this util-aware TEO variant while using much less C1 and decreasing the percentage of too deep sleeps from ~24% to ~3% in PCMark Web Browsing also uses almost 2% less power. Clearly the power is being wasted on not hitting C1 residency over and over. > You may argue that predicting idle durations that are too long too > often leads to both excessive task wakeup latency and excessive energy > usage at the same time, but this may very well mean that the target > residency value for C1 is too low. We get residency values from DT and they're meant to be the descriptions of each CPU's hardware so I don't think tweaking them to get better results would be a good idea. Unless I'm misunderstanding what you mean? > > Currently the best available governor under this metric is TEO which on average results in less than > > half the percentage of too deep sleeps compared to 'menu', getting much better wakeup latencies and > > increased performance in the process. > > Well, good to hear that, but some numbers in support of that claim > would be nice to have too. Those are the numbers I included in the cover letter for the two benchmarks, they've been very consistent in terms of the pattern across all the runs and workloads I've seen. For too deep % for instance in GB5 we had on average menu 16.6%, TEO 9.6%, TEO+util 4.19%. For PCMark Web Browsing menu 24.15%, TEO 10.32%, TEO+util 3.2%. The values differ per-workload but every dataset I've seen had that same 'staircase' pattern. > > This proposed optional extension to TEO would specifically tune it for minimising too deep > > sleeps and minimising latency to achieve better performance. To this end, before selecting the next > > idle state it uses the avg_util signal of a CPU's runqueue in order to determine to what extent the > > CPU is being utilized. > > Which has no bearing on what the CPU idle time governors have to do > which is (1) to predict the next idle duration as precisely as > reasonably possible and (2) to minimise the cost in terms of task > wakeup latencies associated with using deep idle states. > > The avg_util value tells us nothing about how much the CPU is going to > be idle this time and it also tells us nothing about the > latency-sensitivity of the workload. > > Yes, it tells us how much idle time there was on the given CPU in the > past, on the average, but there is zero information about the > distribution of that idle time in it. > > So in the first place please tell me why it fundamentally makes sense > to use avg_util in CPU idle time management at all. Right, the idea here is slightly similar to that of temporal locality. We obviously can't predict the future which is sort of what an idle governor tries to achieve. Focusing on timer events makes a lot of sense and is probably close to as good as it gets in estimating future behaviour. The observation we're relying on here is simply that if the CPU was doing enough work in the recent past for its avg_util to still be raised while going into idle, it is very likely that the same CPU might be needed again soon. From my tests that assumption tends to be correct quite often. In those situations, when avg_util is high and the next timer event is far enough for C1 to be selected, a lot of the time the CPU does actually get woken up before the residency is hit leading to all the issues described above. I don't think using avg_util as the *only* input for idle management would be a good idea at all. The way I see it, it serves as a very good hint to determine if we are likely to get a wakeup between now and the next timer event and provides an additional dimension for decision making. While the current metrics only adjust themselves after making a certain number of mistakes and are a "trailing" adjusting mechanism, using avg_util this way provides a "leading" mechanism that potentially lets us not make those mistakes in the first place. It's not just theory either, it very clearly works and gets results, at least on the platforms/workloads we've been looking at. On the Intel & power usage angle you might have seen in the discussion, Doug sent me some interesting data privately. As far as I can tell the main issue there is that C0 on Intel doesn't actually do power saving so moving the state selection down to it is a pretty bad idea because C1 could be very close in terms of latency and save much more power. A potential solution could be altering the v2 to only decrease the state selection by 1 if it's above 1, ie 2->1 but not 1->0. It's fine for us because arm systems with 2 states use the early exit path anyway. It'd just amount to changing this hunk: + if (cpu_data->utilized && idx > 0 && !dev->states_usage[idx-1].disable) + idx--; to: + if (cpu_data->utilized && idx > 1 && !dev->states_usage[idx-1].disable) + idx--; What would you think about that? Should make it much less intense for Intel systems. Thanks a lot for your interest, Kajetan
Hi All, On Thu, Oct 13, 2022 at 4:12 AM Kajetan Puchalski <kajetan.puchalski@arm.com> wrote: > On Wed, Oct 12, 2022 at 08:50:39PM +0200, Rafael J. Wysocki wrote: > > On Mon, Oct 3, 2022 at 4:50 PM Kajetan Puchalski > > <kajetan.puchalski@arm.com> wrote: ... > On the Intel & power usage angle you might have seen in the discussion, > Doug sent me some interesting data privately. As far as I can tell the > main issue there is that C0 on Intel doesn't actually do power saving so > moving the state selection down to it is a pretty bad idea because C1 > could be very close in terms of latency and save much more power. > > A potential solution could be altering the v2 to only decrease the state > selection by 1 if it's above 1, ie 2->1 but not 1->0. It's fine for us > because arm systems with 2 states use the early exit path anyway. It'd > just amount to changing this hunk: > > + if (cpu_data->utilized && idx > 0 && !dev->states_usage[idx-1].disable) > + idx--; > > to: > > + if (cpu_data->utilized && idx > 1 && !dev->states_usage[idx-1].disable) > + idx--; > > What would you think about that? Should make it much less intense for > Intel systems. I tested the above, which you sent me as patch version v2-2. By default, my Intel i5-10600K has 4 idle states: $ grep . /sys/devices/system/cpu/cpu7/cpuidle/state*/name /sys/devices/system/cpu/cpu7/cpuidle/state0/name:POLL /sys/devices/system/cpu/cpu7/cpuidle/state1/name:C1_ACPI /sys/devices/system/cpu/cpu7/cpuidle/state2/name:C2_ACPI /sys/devices/system/cpu/cpu7/cpuidle/state3/name:C3_ACPI Idle driver governor legend: teo: the normal teo idle governor menu: the normal menu idle governor util or v1: the original patch util-v2 or v2: V2 of the patch util-v2-2 or v2-2: the suggestion further up in this thread. Test 1: Timer based periodic: A load sweep from 0 to 100%, then 100% to 0, first 73 hertz, then 113, 211,347 and finally 401 hertz work/sleep frequency. Single thread. http://smythies.com/~doug/linux/idle/teo-util/consume/idle-1/ Summary, average processor package powers (watts): teo menu v1 v2 v2-2 10.19399 10.74804 22.12791 21.0431 11.27865 5.44% 117.07% 106.43% 10.64% There is no performance measurement for this test, it just has to finish the work packet before the next period starts. Note that overruns do occur as the workload approaches 100%, but I do not record that data, as typically the lower workload percentages are the area of interest. Test 2: Ping-pong test rotating through 6 different cores, with a variable packet of work to do at each stop. This test goes gradually through different idle states and is not timer based. A different 2 core test (which I have not done) is used to better explore the idle state 0 to idle state 1 transition. This test has a performance measurement. The CPU scaling governor was set to performance. HWP was disabled. http://smythies.com/~doug/linux/idle/teo-util/ping-sweep/6-1/loop-times.png http://smythies.com/~doug/linux/idle/teo-util/ping-sweep/6-1/loop-times-detail-a.png http://smythies.com/~doug/linux/idle/teo-util/ping-sweep/6-1/ Summary: Average processor package power (watts): teo v2-2 menu 27.3881 29.98293 28.04096 9.47% 2.38% Execution time for the entire test (minutes): teo v2-2 menu 56 54.667 55.333 -2.38% -1.19% However, notice that in the idle-state 0 and 1 region, V2-2 uses more power and its loop time is longer (less is better), but also in the deeper idle states regions V2-2 uses more power and also runs faster. teo: 36.4 watts and 10.3533 usecs/loop. menu: 36.8 watts and 10.1604 usecs/loop. util-v2-2: 38.8 watts and 11.2358 usecs/loop. and teo: 15.2 watts and 1,777.6 usecs/loop. menu: 15.6 watts and 1767.4 usecs/loop. util-v2-2: 17.4 watts and 1618.7 usecs/loop. ... Doug
Hi Rafael, > The avg_util value tells us nothing about how much the CPU is going to > be idle this time and it also tells us nothing about the > latency-sensitivity of the workload. > > Yes, it tells us how much idle time there was on the given CPU in the > past, on the average, but there is zero information about the > distribution of that idle time in it. > > So in the first place please tell me why it fundamentally makes sense > to use avg_util in CPU idle time management at all. I have an alternative suggestion that could be a reasonable way forward here. Instead of applying util-awareness on top of TEO where it would have to be reconciled with how TEO is currently expected to work, I just wrote a simple completely new governor which operates only on timer events alongside util values. The idea is this: 1. Find the deepest state based on residency and time until the next timer event 2. If sched_cpu_util() is above the threshold, select a shallower non-polling state There's no other metrics or anything else under the current implementation. I can't say how it would work on Intel systems and in the presence of more idle states but having a completely separate governor would be very useful for us to tune it specifically for our use cases and types of systems (ie. ones with 2 idle states and no polling states). As it stands it performs quite well and achieves better results (especially in terms of latency) than both menu & TEO but slightly worse than the previously suggested TEO + util. As far as we're concerned that's okay, we can work from there to try to find a way of doing metrics or improving the algorithm that would be more tailored to using the util approach. I think it's much cleaner than what we were discussing previously since that was effectively overriding most of what TEO was doing. Here are some numbers to visualize the results. They were all obtained in the same way as the ones in the cover letter so you can refer to that in case something isn't clear. 'teo_util' is of course TEO + util as in the patchset. 'idleutil' is this entirely new proposed minimal governor. 1. Geekbench 5 (latency-sensitive, heavy load test) +-----------------+----------+---------+-------------+ | metric | kernel | value | perc_diff | |-----------------+----------+---------+-------------| | multicore_score | menu | 2832.3 | 0.0% | | multicore_score | teo | 2815.3 | -0.6% | | multicore_score | teo_util | 2880.6 | 1.7% | | multicore_score | idleutil | 2859.3 | 0.95% | +-----------------+----------+---------+-------------+ Percentages & types of idle misses +-----------+-------------+--------------+ | kernel | type | percentage | |-----------+-------------+--------------| | menu | too deep | 15.613% | | teo | too deep | 9.376% | | teo_util | too deep | 4.581% | | idleutil | too deep | 5.464% | | menu | too shallow | 2.611% | | teo | too shallow | 6.099% | | teo_util | too shallow | 14.141% | | idleutil | too shallow | 13.282% | +-----------+-------------+--------------+ Power usage [mW] +--------------+----------+----------+---------+-------------+ | chan_name | metric | kernel | value | perc_diff | |--------------+----------+----------+---------+-------------| | total_power | gmean | menu | 2705.9 | 0.0% | | total_power | gmean | teo | 2668.2 | -1.39% | | total_power | gmean | teo_util | 2710.2 | 0.16% | | total_power | gmean | idleutil | 2657.9 | -1.78% | +--------------+----------+----------+---------+-------------+ Wakeup latency +-----------------+----------+----------+-------------+-------------+ | comm | metric | kernel | value | perc_diff | |-----------------+----------+----------+-------------+-------------| | AsyncTask #1 | gmean | menu | 66.85μs | 0.0% | | AsyncTask #1 | gmean | teo | 66.79μs | -0.09% | | AsyncTask #1 | gmean | teo_util | 57.84μs | -13.47% | | AsyncTask #1 | gmean | idleutil | 62.61μs | -6.35% | | labs.geekbench5 | gmean | menu | 80.62μs | 0.0% | | labs.geekbench5 | gmean | teo | 94.75μs | 17.52% | | labs.geekbench5 | gmean | teo_util | 52.98μs | -34.28% | | labs.geekbench5 | gmean | idleutil | 68.58μs | -14.93% | +-----------------+----------+----------+-------------+-------------+ 2. PCMark Web Browsing (non latency-sensitive, normal usage test) +----------------+----------+---------+-------------+ | metric | kernel | value | perc_diff | |----------------+----------+---------+-------------| | PcmaWebV2Score | menu | 5232 | 0.0% | | PcmaWebV2Score | teo | 5219.8 | -0.23% | | PcmaWebV2Score | teo_util | 5249.7 | 0.34% | | PcmaWebV2Score | idleutil | 5215.7 | -0.31% | +----------------+----------+---------+-------------+ Percentages & types of idle misses +-----------+-------------+--------------+ | kernel | type | percentage | |-----------+-------------+--------------| | menu | too deep | 24.814% | | teo | too deep | 11.65% | | teo_util | too deep | 3.753% | | idleutil | too deep | 4.304% | | menu | too shallow | 3.101% | | teo | too shallow | 8.578% | | teo_util | too shallow | 18.309% | | idleutil | too shallow | 17.638% | +-----------+-------------+--------------+ Power usage [mW] +--------------+----------+----------+---------+-------------+ | chan_name | metric | kernel | value | perc_diff | |--------------+----------+----------+---------+-------------| | total_power | gmean | menu | 179.2 | 0.0% | | total_power | gmean | teo | 184.8 | 3.1% | | total_power | gmean | teo_util | 180.5 | 0.71% | | total_power | gmean | idleutil | 185 | 3.24% | +--------------+----------+----------+---------+-------------+ Wakeup latency +-----------------+----------+----------+-------------+-------------+ | comm | metric | kernel | value | perc_diff | |-----------------+----------+----------+-------------+-------------| | CrRendererMain | gmean | menu | 236.63μs | 0.0% | | CrRendererMain | gmean | teo | 201.85μs | -14.7% | | CrRendererMain | gmean | teo_util | 111.76μs | -52.77% | | CrRendererMain | gmean | idleutil | 105.55μs | -55.39% | | chmark:workload | gmean | menu | 100.30μs | 0.0% | | chmark:workload | gmean | teo | 80.20μs | -20.04% | | chmark:workload | gmean | teo_util | 53.81μs | -46.35% | | chmark:workload | gmean | idleutil | 71.29μs | -28.92% | | RenderThread | gmean | menu | 37.97μs | 0.0% | | RenderThread | gmean | teo | 31.69μs | -16.54% | | RenderThread | gmean | teo_util | 34.32μs | -9.63% | | RenderThread | gmean | idleutil | 35.78μs | -5.77% | | surfaceflinger | gmean | menu | 97.57μs | 0.0% | | surfaceflinger | gmean | teo | 98.86μs | 1.31% | | surfaceflinger | gmean | teo_util | 72.59μs | -25.6% | | surfaceflinger | gmean | idleutil | 56.23μs | -42.37% | +-----------------+----------+----------+-------------+-------------+ I also have similar data for Jankbench & Speedometer with right about the same results, I'll skip those for now for brevity. Would you like me to send a patch with this new governor instead? What would you think about this instead of the previously suggested approach? Thanks, Kajetan
Hi Kajetan, On 20/10/2022 18:20, Kajetan Puchalski wrote: > Hi Rafael, > >> The avg_util value tells us nothing about how much the CPU is going to >> be idle this time and it also tells us nothing about the >> latency-sensitivity of the workload. >> >> Yes, it tells us how much idle time there was on the given CPU in the >> past, on the average, but there is zero information about the >> distribution of that idle time in it. >> >> So in the first place please tell me why it fundamentally makes sense >> to use avg_util in CPU idle time management at all. > > I have an alternative suggestion that could be a reasonable way forward > here. Instead of applying util-awareness on top of TEO where it would > have to be reconciled with how TEO is currently expected to work, I just > wrote a simple completely new governor which operates only on timer > events alongside util values. I second the idea. I took a long time to investigate how to improve the governor and reached the conclusion having a dedicated governor for mobile platform makes sense. Also the behavior is very platform dependent. Regarding the utilization, one of the issue is the kernel threads preventing a task to wake up on the same CPU and forcing its migration at wake up time. So the prediction is screwed up at that time. There is a paper talking this issue [1] I've done a 'mobile' governor, including the next interrupt prediction [2]. It is very simple and almost has the same results as the teo on my platform (rock960). I'm not planning to upstream it because I don't have spare time to improve the results and take care of the IPIs. part. Also the paradigm is radically different and you may be interested in the approach. So if you want to rework, improve, test, upstream it, feel free to reuse the code. -- Daniel [1] Dynamic workload characterization for power efficient scheduling on CMP systems : https://cseweb.ucsd.edu//~tullsen/islped10.pdf [2] https://git.linaro.org/people/daniel.lezcano/linux.git/commit/?h=cpuidle/mobile-governor-v5.1&id=de1edb05e3c342f0738b414aa84263d6555b7462 > The idea is this: > 1. Find the deepest state based on residency and time until the next timer event > 2. If sched_cpu_util() is above the threshold, select a shallower non-polling state > > There's no other metrics or anything else under the current > implementation. I can't say how it would work on Intel systems and in > the presence of more idle states but having a completely separate > governor would be very useful for us to tune it specifically for our use > cases and types of systems (ie. ones with 2 idle states and no polling > states). > > As it stands it performs quite well and achieves better results > (especially in terms of latency) than both menu & TEO but slightly worse > than the previously suggested TEO + util. As far as we're concerned > that's okay, we can work from there to try to find a way of doing > metrics or improving the algorithm that would be more tailored to using > the util approach. I think it's much cleaner than what we were > discussing previously since that was effectively overriding most of what > TEO was doing. > > Here are some numbers to visualize the results. They were all obtained > in the same way as the ones in the cover letter so you can refer to that > in case something isn't clear. > > 'teo_util' is of course TEO + util as in the patchset. > 'idleutil' is this entirely new proposed minimal governor. > > 1. Geekbench 5 (latency-sensitive, heavy load test) > > +-----------------+----------+---------+-------------+ > | metric | kernel | value | perc_diff | > |-----------------+----------+---------+-------------| > | multicore_score | menu | 2832.3 | 0.0% | > | multicore_score | teo | 2815.3 | -0.6% | > | multicore_score | teo_util | 2880.6 | 1.7% | > | multicore_score | idleutil | 2859.3 | 0.95% | > +-----------------+----------+---------+-------------+ > > Percentages & types of idle misses > > +-----------+-------------+--------------+ > | kernel | type | percentage | > |-----------+-------------+--------------| > | menu | too deep | 15.613% | > | teo | too deep | 9.376% | > | teo_util | too deep | 4.581% | > | idleutil | too deep | 5.464% | > | menu | too shallow | 2.611% | > | teo | too shallow | 6.099% | > | teo_util | too shallow | 14.141% | > | idleutil | too shallow | 13.282% | > +-----------+-------------+--------------+ > > Power usage [mW] > > +--------------+----------+----------+---------+-------------+ > | chan_name | metric | kernel | value | perc_diff | > |--------------+----------+----------+---------+-------------| > | total_power | gmean | menu | 2705.9 | 0.0% | > | total_power | gmean | teo | 2668.2 | -1.39% | > | total_power | gmean | teo_util | 2710.2 | 0.16% | > | total_power | gmean | idleutil | 2657.9 | -1.78% | > +--------------+----------+----------+---------+-------------+ > > Wakeup latency > > +-----------------+----------+----------+-------------+-------------+ > | comm | metric | kernel | value | perc_diff | > |-----------------+----------+----------+-------------+-------------| > | AsyncTask #1 | gmean | menu | 66.85μs | 0.0% | > | AsyncTask #1 | gmean | teo | 66.79μs | -0.09% | > | AsyncTask #1 | gmean | teo_util | 57.84μs | -13.47% | > | AsyncTask #1 | gmean | idleutil | 62.61μs | -6.35% | > | labs.geekbench5 | gmean | menu | 80.62μs | 0.0% | > | labs.geekbench5 | gmean | teo | 94.75μs | 17.52% | > | labs.geekbench5 | gmean | teo_util | 52.98μs | -34.28% | > | labs.geekbench5 | gmean | idleutil | 68.58μs | -14.93% | > +-----------------+----------+----------+-------------+-------------+ > > 2. PCMark Web Browsing (non latency-sensitive, normal usage test) > > +----------------+----------+---------+-------------+ > | metric | kernel | value | perc_diff | > |----------------+----------+---------+-------------| > | PcmaWebV2Score | menu | 5232 | 0.0% | > | PcmaWebV2Score | teo | 5219.8 | -0.23% | > | PcmaWebV2Score | teo_util | 5249.7 | 0.34% | > | PcmaWebV2Score | idleutil | 5215.7 | -0.31% | > +----------------+----------+---------+-------------+ > > Percentages & types of idle misses > > +-----------+-------------+--------------+ > | kernel | type | percentage | > |-----------+-------------+--------------| > | menu | too deep | 24.814% | > | teo | too deep | 11.65% | > | teo_util | too deep | 3.753% | > | idleutil | too deep | 4.304% | > | menu | too shallow | 3.101% | > | teo | too shallow | 8.578% | > | teo_util | too shallow | 18.309% | > | idleutil | too shallow | 17.638% | > +-----------+-------------+--------------+ > > Power usage [mW] > > +--------------+----------+----------+---------+-------------+ > | chan_name | metric | kernel | value | perc_diff | > |--------------+----------+----------+---------+-------------| > | total_power | gmean | menu | 179.2 | 0.0% | > | total_power | gmean | teo | 184.8 | 3.1% | > | total_power | gmean | teo_util | 180.5 | 0.71% | > | total_power | gmean | idleutil | 185 | 3.24% | > +--------------+----------+----------+---------+-------------+ > > Wakeup latency > > +-----------------+----------+----------+-------------+-------------+ > | comm | metric | kernel | value | perc_diff | > |-----------------+----------+----------+-------------+-------------| > | CrRendererMain | gmean | menu | 236.63μs | 0.0% | > | CrRendererMain | gmean | teo | 201.85μs | -14.7% | > | CrRendererMain | gmean | teo_util | 111.76μs | -52.77% | > | CrRendererMain | gmean | idleutil | 105.55μs | -55.39% | > | chmark:workload | gmean | menu | 100.30μs | 0.0% | > | chmark:workload | gmean | teo | 80.20μs | -20.04% | > | chmark:workload | gmean | teo_util | 53.81μs | -46.35% | > | chmark:workload | gmean | idleutil | 71.29μs | -28.92% | > | RenderThread | gmean | menu | 37.97μs | 0.0% | > | RenderThread | gmean | teo | 31.69μs | -16.54% | > | RenderThread | gmean | teo_util | 34.32μs | -9.63% | > | RenderThread | gmean | idleutil | 35.78μs | -5.77% | > | surfaceflinger | gmean | menu | 97.57μs | 0.0% | > | surfaceflinger | gmean | teo | 98.86μs | 1.31% | > | surfaceflinger | gmean | teo_util | 72.59μs | -25.6% | > | surfaceflinger | gmean | idleutil | 56.23μs | -42.37% | > +-----------------+----------+----------+-------------+-------------+ > > I also have similar data for Jankbench & Speedometer with right about > the same results, I'll skip those for now for brevity. > Would you like me to send a patch with this new governor instead? What > would you think about this instead of the previously suggested approach? > > Thanks, > Kajetan
Hi Doug, Thank you for your effort in testing these patches and different governors. We really appreciate that, since this helped us to better understand the platform that you are using. It is different to what we have and our workloads. That's why I have some comments. It would be hard to combine these two worlds and requirements. I have some concerns to the tests, the setup and the platform. I can see a reason why this patch has to prove the strengths on this platform and environment. Please see my comments below. On 10/13/22 23:12, Doug Smythies wrote: > Hi All, > > On Thu, Oct 13, 2022 at 4:12 AM Kajetan Puchalski > <kajetan.puchalski@arm.com> wrote: >> On Wed, Oct 12, 2022 at 08:50:39PM +0200, Rafael J. Wysocki wrote: >>> On Mon, Oct 3, 2022 at 4:50 PM Kajetan Puchalski >>> <kajetan.puchalski@arm.com> wrote: > ... > >> On the Intel & power usage angle you might have seen in the discussion, >> Doug sent me some interesting data privately. As far as I can tell the >> main issue there is that C0 on Intel doesn't actually do power saving so >> moving the state selection down to it is a pretty bad idea because C1 >> could be very close in terms of latency and save much more power. >> >> A potential solution could be altering the v2 to only decrease the state >> selection by 1 if it's above 1, ie 2->1 but not 1->0. It's fine for us >> because arm systems with 2 states use the early exit path anyway. It'd >> just amount to changing this hunk: >> >> + if (cpu_data->utilized && idx > 0 && !dev->states_usage[idx-1].disable) >> + idx--; >> >> to: >> >> + if (cpu_data->utilized && idx > 1 && !dev->states_usage[idx-1].disable) >> + idx--; >> >> What would you think about that? Should make it much less intense for >> Intel systems. > > I tested the above, which you sent me as patch version v2-2. > > By default, my Intel i5-10600K has 4 idle states: > > $ grep . /sys/devices/system/cpu/cpu7/cpuidle/state*/name > /sys/devices/system/cpu/cpu7/cpuidle/state0/name:POLL This active polling state type worries me a bit. We don't have such on our platforms. Our shallowest idle state is really different. We don't have active polling and there is no need for such. > /sys/devices/system/cpu/cpu7/cpuidle/state1/name:C1_ACPI > /sys/devices/system/cpu/cpu7/cpuidle/state2/name:C2_ACPI > /sys/devices/system/cpu/cpu7/cpuidle/state3/name:C3_ACPI > > Idle driver governor legend: > teo: the normal teo idle governor > menu: the normal menu idle governor > util or v1: the original patch > util-v2 or v2: V2 of the patch > util-v2-2 or v2-2: the suggestion further up in this thread. > > Test 1: Timer based periodic: > > A load sweep from 0 to 100%, then 100% to 0, first 73 hertz, then 113, > 211,347 and finally 401 hertz work/sleep frequency. Single thread. This 'Single thread' worries me a bit as well. Probably the task don't migrate at all over CPUs, or very unlikely. > > http://smythies.com/~doug/linux/idle/teo-util/consume/idle-1/ > > Summary, average processor package powers (watts): > > teo menu v1 v2 v2-2 > 10.19399 10.74804 22.12791 21.0431 11.27865 > 5.44% 117.07% 106.43% 10.64% > > There is no performance measurement for this test, it just has to > finish the work packet before the next period starts. Note that > overruns do occur as the workload approaches 100%, but I do not record > that data, as typically the lower workload percentages are the area of > interest. > > Test 2: Ping-pong test rotating through 6 different cores, with a > variable packet of work to do at each stop. This test goes gradually > through different idle states and is not timer based. A different 2 > core test (which I have not done) is used to better explore the idle > state 0 to idle state 1 transition. This test has a performance > measurement. The CPU scaling governor was set to performance. HWP was The 'performance' governor also worries me here. When we fix the frequency of the CPU then some basic statistics mechanisms would be good enough for reasoning. In our world, a few conditions are different: 1. The CPU frequency changes. We work with SchedUtil and adjust the frequency quite often. Therefore, simple statistics which are not aware of the frequency change and the impact to the CPU computation capacity might be misleading. The utilization signal of the CPU runqueue brings that information to our idle decisions. 2. Single threaded workloads aren't typical apps. When we deal with many tasks and the task scheduler migrates them across many CPUs we would like to 'see' this. The 'old-school' statistics observing only the local CPU usage are not able to figure out fast enough that some bigger task just migrated to that CPU. With utilization of the runqueue, we know that upfront, because the task utilization was subtracted from the old CPU's runqueue and added to the new CPU's runqueue. Our approach with this util signal would allow us to make a better decision in these two use cases: a) task is leaving the CPU and rq util drops dramatically - so we can go into deeper sleep immediately b) task just arrived on this CPU and rq util got higher value - so we shouldn't go into deep idle state, since there is 'not small' task. 3. Power saving on our platform in shallowest idle state was improved recently and creates a scope for saving power and increase performance. It would be fair to let TEO continue it's evolution (on the platforms that it was designed for) and create a new governor which would address better other platforms and workloads needs. I will ask Rafael if that can happen. Kajetan has a tiny patch with basic mechanisms, which performs really good. I will ask him to send it so Rafael could have a look and decide. We could then develop/improve that new governor with ideas from other experienced engineers in mobile platforms. Regards, Lukasz
Hi Rafael, On 10/13/22 12:12, Kajetan Puchalski wrote: > On Wed, Oct 12, 2022 at 08:50:39PM +0200, Rafael J. Wysocki wrote: >> On Mon, Oct 3, 2022 at 4:50 PM Kajetan Puchalski >> <kajetan.puchalski@arm.com> wrote: >>> >>> Hi, >>> >>> At the moment, all the available idle governors operate mainly based on their own past performance >> >> Not true, at least for the menu and teo governors that use the >> information on the distribution of CPU wakeups that is available to >> them and try to predict the next idle duration with the help of it. >> This has a little to do with their performance. > > You're right of course, I should have written "their own past > correctness" as that's what I was referring to. I just meant that for > instance with TEO the initial timer-based choice is only adjusted using > the governor's own metrics and not any information from anywhere else in > the system. > [snip] Would it be possible to consider a new small and simple idle governor which is better suited for those other workloads and platforms? Kajetan has such one and can send to the LKML, so you could have a look. I have sent some detailed explanation about this to Doug in this thread (don't want to duplicate it). It looks like it would be hard to meet both worlds' requirements. Regards, Lukasz
On 10/20/22 20:52, Daniel Lezcano wrote: > > Hi Kajetan, > > On 20/10/2022 18:20, Kajetan Puchalski wrote: >> Hi Rafael, >> >>> The avg_util value tells us nothing about how much the CPU is going to >>> be idle this time and it also tells us nothing about the >>> latency-sensitivity of the workload. >>> >>> Yes, it tells us how much idle time there was on the given CPU in the >>> past, on the average, but there is zero information about the >>> distribution of that idle time in it. >>> >>> So in the first place please tell me why it fundamentally makes sense >>> to use avg_util in CPU idle time management at all. >> >> I have an alternative suggestion that could be a reasonable way forward >> here. Instead of applying util-awareness on top of TEO where it would >> have to be reconciled with how TEO is currently expected to work, I just >> wrote a simple completely new governor which operates only on timer >> events alongside util values. > > I second the idea. I took a long time to investigate how to improve the > governor and reached the conclusion having a dedicated governor for > mobile platform makes sense. Also the behavior is very platform dependent. > > Regarding the utilization, one of the issue is the kernel threads > preventing a task to wake up on the same CPU and forcing its migration > at wake up time. So the prediction is screwed up at that time. > > There is a paper talking this issue [1] > > I've done a 'mobile' governor, including the next interrupt prediction > [2]. It is very simple and almost has the same results as the teo on my > platform (rock960). > > I'm not planning to upstream it because I don't have spare time to > improve the results and take care of the IPIs. part. > > Also the paradigm is radically different and you may be interested in > the approach. > > So if you want to rework, improve, test, upstream it, feel free to reuse > the code. > > -- Daniel > > [1] Dynamic workload characterization for power efficient scheduling on > CMP systems : https://cseweb.ucsd.edu//~tullsen/islped10.pdf > > [2] > https://git.linaro.org/people/daniel.lezcano/linux.git/commit/?h=cpuidle/mobile-governor-v5.1&id=de1edb05e3c342f0738b414aa84263d6555b7462 > > > Thanks Daniel! I forgot about your work in this area. As I have responded in some other email in this thread, we might start from a new small governor and than others can contribute. Even this small governor that Kajetan showed me performs really good on pixel6. Regards, Lukasz
On 28/10/2022 09:08, Lukasz Luba wrote: [ ... ] >> [1] Dynamic workload characterization for power efficient scheduling >> on CMP systems : https://cseweb.ucsd.edu//~tullsen/islped10.pdf >> >> [2] >> https://git.linaro.org/people/daniel.lezcano/linux.git/commit/?h=cpuidle/mobile-governor-v5.1&id=de1edb05e3c342f0738b414aa84263d6555b7462 >> >> > > Thanks Daniel! I forgot about your work in this area. As I have > responded in some other email in this thread, we might start > from a new small governor and than others can contribute. > > Even this small governor that Kajetan showed me performs really > good on pixel6. Do you have others ARM64 platforms to compare with ?
On 10/28/22 08:11, Daniel Lezcano wrote: > On 28/10/2022 09:08, Lukasz Luba wrote: > > [ ... ] > >>> [1] Dynamic workload characterization for power efficient scheduling >>> on CMP systems : https://cseweb.ucsd.edu//~tullsen/islped10.pdf >>> >>> [2] >>> https://git.linaro.org/people/daniel.lezcano/linux.git/commit/?h=cpuidle/mobile-governor-v5.1&id=de1edb05e3c342f0738b414aa84263d6555b7462 >>> >>> >>> >> >> Thanks Daniel! I forgot about your work in this area. As I have >> responded in some other email in this thread, we might start >> from a new small governor and than others can contribute. >> >> Even this small governor that Kajetan showed me performs really >> good on pixel6. > > Do you have others ARM64 platforms to compare with ? > Yes we have, also some older platforms (~10years now, which would be nice to check how they would perform). We also have a big arm64 server to give it a try there. So those results will be available with the new patch when we decide to go the new governor.
On Thu, Oct 13, 2022 at 1:12 PM Kajetan Puchalski <kajetan.puchalski@arm.com> wrote: > > On Wed, Oct 12, 2022 at 08:50:39PM +0200, Rafael J. Wysocki wrote: > > On Mon, Oct 3, 2022 at 4:50 PM Kajetan Puchalski > > <kajetan.puchalski@arm.com> wrote: > > > > > > Hi, > > > > > > At the moment, all the available idle governors operate mainly based on their own past performance > > > > Not true, at least for the menu and teo governors that use the > > information on the distribution of CPU wakeups that is available to > > them and try to predict the next idle duration with the help of it. > > This has a little to do with their performance. > > You're right of course, I should have written "their own past > correctness" as that's what I was referring to. I just meant that for > instance with TEO the initial timer-based choice is only adjusted using > the governor's own metrics and not any information from anywhere else in > the system. The last sentence is just right, so that's what I would say in the changelog. > > > without taking into account any scheduling information. Especially on interactive systems, this > > > results in them frequently selecting a deeper idle state and then waking up before its target > > > residency is hit, thus leading to increased wakeup latency and lower performance with no power > > > saving. For 'menu' while web browsing on Android for instance, those types of wakeups ('too deep') > > > account for over 24% of all wakeups. > > > > How is this measured? > > Using the cpu_idle_miss trace event. Over the course of a benchmark run > I collect all cpu_idle and cpu_idle_miss trace events. Then I divide the > number of too deep misses by the total number of cpu_idle wakeup events > which gives me the percentage. Those are the percentages described as > 'gmean too deep %' in the tables included in the cover letter. Gmean > because I run the benchmarks for many iterations and then take an > average of those percentages to account for outliers. > PCMark Web Browsing is a 'benchmark' that just amounts to browsing the > web on Android, hence I can use data from it to talk about what the > system behaviour under normal usage would be. > > > > At the same time, on some platforms C0 can be power efficient enough to warrant wanting to prefer > > > it over C1. > > > > Well, energy-efficiency is relative, so strictly speaking it is > > invalid to say "power efficient enough". > > Yes, by 'enough' I meant that the power savings of C0 vs C1 on arm are > fairly comparable as opposed to other platforms. From Doug's data > collected on an Intel CPU, the power usage difference of only-C0 > compared to only-C1 was over 20-fold ie 46w vs 2.6w. With only C0 > enabled on Pixel 6 that difference is closer to something like 4%. It's > just fundamentally different hardware. With 4% being your ceiling you > can talk about performance/latency tradeoffs etc, if you're talking > about potential over 1700% increases, not so much. The above is very close to a proper problem statement. IIUC, on the hardware in question the power difference between the first available idle state (state 0) and the next idle state (state 1) is relatively small, but the target residency of state 1 is relatively large and if it is missed, energy is wasted and the extra cost in terms of latency is relatively high. At the same time, this is the idle duration range where the latency matters the most, so it is desirable to reduce the likelihood of mispredicting higher idle duration in this range beyond what the teo governor does by itself. Also, unlike on Intel systems, state 0 actually is an idle state (on Intel systems state 0 is a polling state and it is there to avoid the latency cost of C1 in the cases when it wouldn't save any energy due to the nonzero target residency). Fair enough. > > Also, as far as idle CPUs are concerned, we are talking about the > > situation in which no useful work is done at all, so the state drawing > > less power is always more energy-efficient than the one drawing more > > power. > > Yes, assuming the CPU is woken up after the target residency of the > state has been met. If the wakeup happens too early then for that > situation C0 would've been more power efficient than C1 even though C1 > technically draws less power, right? That's what we're trying to fix > here, we just noticed that for mobile interactive workloads at least > we're getting this situation way too often. Well, the interactive workloads are likely to be similar on any clent systems (and the term "mobile" is somewhat vague). > The result being that this util-aware TEO variant while using much less > C1 and decreasing the percentage of too deep sleeps from ~24% to ~3% in > PCMark Web Browsing also uses almost 2% less power. Clearly the power is > being wasted on not hitting C1 residency over and over. Hmm. The PCMark Web Browsing table in your cover letter doesn't indicate that. The "gmean power usage" there for "teo + util-aware" is 205, whereas for "teo" alone it is 187.8. This is still arguably balanced by the latency difference (~100 us vs ~185 us, respectively), but this looks like trading energy for performance. On the side note, unmodified "teo" shows some nice 10%-range improvements in terms of both power and latency over "menu" in this case, even though it underestimates the idle duration much more often (which kind of supports the idea that underestimating the idle duration is better than overestimating it). > > You may argue that predicting idle durations that are too long too > > often leads to both excessive task wakeup latency and excessive energy > > usage at the same time, but this may very well mean that the target > > residency value for C1 is too low. > > We get residency values from DT and they're meant to be the descriptions > of each CPU's hardware so I don't think tweaking them to get better > results would be a good idea. Unless I'm misunderstanding what you mean? I mean that the target residency values from DT may be less than perfect. Also, they really are input for the governor's decisions, nothing more, so putting values that are likely to yield more desirable governor behavior in there is not a bad idea. > > > Currently the best available governor under this metric is TEO which on average results in less than > > > half the percentage of too deep sleeps compared to 'menu', getting much better wakeup latencies and > > > increased performance in the process. > > > > Well, good to hear that, but some numbers in support of that claim > > would be nice to have too. > > Those are the numbers I included in the cover letter for the two > benchmarks, they've been very consistent in terms of the pattern > across all the runs and workloads I've seen. For too deep % for > instance in GB5 we had on average menu 16.6%, TEO 9.6%, TEO+util 4.19%. > For PCMark Web Browsing menu 24.15%, TEO 10.32%, TEO+util 3.2%. The > values differ per-workload but every dataset I've seen had that same > 'staircase' pattern. I see. > > > This proposed optional extension to TEO would specifically tune it for minimising too deep > > > sleeps and minimising latency to achieve better performance. To this end, before selecting the next > > > idle state it uses the avg_util signal of a CPU's runqueue in order to determine to what extent the > > > CPU is being utilized. > > > > Which has no bearing on what the CPU idle time governors have to do > > which is (1) to predict the next idle duration as precisely as > > reasonably possible and (2) to minimise the cost in terms of task > > wakeup latencies associated with using deep idle states. > > > > The avg_util value tells us nothing about how much the CPU is going to > > be idle this time and it also tells us nothing about the > > latency-sensitivity of the workload. > > > > Yes, it tells us how much idle time there was on the given CPU in the > > past, on the average, but there is zero information about the > > distribution of that idle time in it. > > > > So in the first place please tell me why it fundamentally makes sense > > to use avg_util in CPU idle time management at all. > > Right, the idea here is slightly similar to that of temporal locality. > We obviously can't predict the future which is sort of what an idle > governor tries to achieve. Focusing on timer events makes a lot of sense > and is probably close to as good as it gets in estimating future > behaviour. > > The observation we're relying on here is simply that if the > CPU was doing enough work in the recent past for its avg_util to still > be raised while going into idle, it is very likely that the same CPU > might be needed again soon. From my tests that assumption tends to be > correct quite often. In those situations, when avg_util is high and the > next timer event is far enough for C1 to be selected, a lot of the time > the CPU does actually get woken up before the residency is hit leading > to all the issues described above. This is quite reasonable overall. > I don't think using avg_util as the *only* input for idle management > would be a good idea at all. The way I see it, it serves as a very good hint > to determine if we are likely to get a wakeup between now and the next > timer event and provides an additional dimension for decision making. > While the current metrics only adjust themselves after making a certain > number of mistakes and are a "trailing" adjusting mechanism, using > avg_util this way provides a "leading" mechanism that potentially lets > us not make those mistakes in the first place. It's not just theory > either, it very clearly works and gets results, at least on the > platforms/workloads we've been looking at. > > > On the Intel & power usage angle you might have seen in the discussion, > Doug sent me some interesting data privately. As far as I can tell the > main issue there is that C0 on Intel doesn't actually do power saving so > moving the state selection down to it is a pretty bad idea because C1 > could be very close in terms of latency and save much more power. > > A potential solution could be altering the v2 to only decrease the state > selection by 1 if it's above 1, ie 2->1 but not 1->0. It's fine for us > because arm systems with 2 states use the early exit path anyway. It'd > just amount to changing this hunk: > > + if (cpu_data->utilized && idx > 0 && !dev->states_usage[idx-1].disable) > + idx--; > > to: > > + if (cpu_data->utilized && idx > 1 && !dev->states_usage[idx-1].disable) > + idx--; > > What would you think about that? Definitely it should not be changed if the previous state is a polling one which can be checked right away. That would take care of the "Intel case" automatically. > Should make it much less intense for Intel systems. So I think that this adjustment only makes sense if the current candidate state is state 1 and state 0 is not polling. In the other cases the cost of missing an opportunity to save energy would be too high for the observed performance gain.
Hi, On Thu, Oct 20, 2022 at 6:21 PM Kajetan Puchalski <kajetan.puchalski@arm.com> wrote: > > Hi Rafael, > > > The avg_util value tells us nothing about how much the CPU is going to > > be idle this time and it also tells us nothing about the > > latency-sensitivity of the workload. > > > > Yes, it tells us how much idle time there was on the given CPU in the > > past, on the average, but there is zero information about the > > distribution of that idle time in it. > > > > So in the first place please tell me why it fundamentally makes sense > > to use avg_util in CPU idle time management at all. > > I have an alternative suggestion that could be a reasonable way forward > here. Instead of applying util-awareness on top of TEO where it would > have to be reconciled with how TEO is currently expected to work, I just > wrote a simple completely new governor which operates only on timer > events alongside util values. > > The idea is this: > 1. Find the deepest state based on residency and time until the next timer event > 2. If sched_cpu_util() is above the threshold, select a shallower non-polling state > > There's no other metrics or anything else under the current > implementation. I can't say how it would work on Intel systems and in > the presence of more idle states but having a completely separate > governor would be very useful for us to tune it specifically for our use > cases and types of systems (ie. ones with 2 idle states and no polling > states). So this is not a totally bad idea IMV and the simplicity of this new governor is certainly attractive. However, it is likely to underperform in the cases when the interrupt activity is not directly related to the average CPU load, like when CPUs do much work, but they are not interrupted very often. > As it stands it performs quite well and achieves better results > (especially in terms of latency) than both menu & TEO but slightly worse > than the previously suggested TEO + util. Well, precisely, because teo takes other factors into account too. > As far as we're concerned > that's okay, we can work from there to try to find a way of doing > metrics or improving the algorithm that would be more tailored to using > the util approach. I think it's much cleaner than what we were > discussing previously since that was effectively overriding most of what > TEO was doing. > > Here are some numbers to visualize the results. They were all obtained > in the same way as the ones in the cover letter so you can refer to that > in case something isn't clear. > > 'teo_util' is of course TEO + util as in the patchset. > 'idleutil' is this entirely new proposed minimal governor. > > 1. Geekbench 5 (latency-sensitive, heavy load test) > > +-----------------+----------+---------+-------------+ > | metric | kernel | value | perc_diff | > |-----------------+----------+---------+-------------| > | multicore_score | menu | 2832.3 | 0.0% | > | multicore_score | teo | 2815.3 | -0.6% | > | multicore_score | teo_util | 2880.6 | 1.7% | > | multicore_score | idleutil | 2859.3 | 0.95% | > +-----------------+----------+---------+-------------+ > > Percentages & types of idle misses > > +-----------+-------------+--------------+ > | kernel | type | percentage | > |-----------+-------------+--------------| > | menu | too deep | 15.613% | > | teo | too deep | 9.376% | > | teo_util | too deep | 4.581% | > | idleutil | too deep | 5.464% | > | menu | too shallow | 2.611% | > | teo | too shallow | 6.099% | > | teo_util | too shallow | 14.141% | > | idleutil | too shallow | 13.282% | > +-----------+-------------+--------------+ > > Power usage [mW] > > +--------------+----------+----------+---------+-------------+ > | chan_name | metric | kernel | value | perc_diff | > |--------------+----------+----------+---------+-------------| > | total_power | gmean | menu | 2705.9 | 0.0% | > | total_power | gmean | teo | 2668.2 | -1.39% | > | total_power | gmean | teo_util | 2710.2 | 0.16% | > | total_power | gmean | idleutil | 2657.9 | -1.78% | > +--------------+----------+----------+---------+-------------+ > > Wakeup latency > > +-----------------+----------+----------+-------------+-------------+ > | comm | metric | kernel | value | perc_diff | > |-----------------+----------+----------+-------------+-------------| > | AsyncTask #1 | gmean | menu | 66.85μs | 0.0% | > | AsyncTask #1 | gmean | teo | 66.79μs | -0.09% | > | AsyncTask #1 | gmean | teo_util | 57.84μs | -13.47% | > | AsyncTask #1 | gmean | idleutil | 62.61μs | -6.35% | > | labs.geekbench5 | gmean | menu | 80.62μs | 0.0% | > | labs.geekbench5 | gmean | teo | 94.75μs | 17.52% | > | labs.geekbench5 | gmean | teo_util | 52.98μs | -34.28% | > | labs.geekbench5 | gmean | idleutil | 68.58μs | -14.93% | > +-----------------+----------+----------+-------------+-------------+ > > 2. PCMark Web Browsing (non latency-sensitive, normal usage test) > > +----------------+----------+---------+-------------+ > | metric | kernel | value | perc_diff | > |----------------+----------+---------+-------------| > | PcmaWebV2Score | menu | 5232 | 0.0% | > | PcmaWebV2Score | teo | 5219.8 | -0.23% | > | PcmaWebV2Score | teo_util | 5249.7 | 0.34% | > | PcmaWebV2Score | idleutil | 5215.7 | -0.31% | > +----------------+----------+---------+-------------+ > > Percentages & types of idle misses > > +-----------+-------------+--------------+ > | kernel | type | percentage | > |-----------+-------------+--------------| > | menu | too deep | 24.814% | > | teo | too deep | 11.65% | > | teo_util | too deep | 3.753% | > | idleutil | too deep | 4.304% | > | menu | too shallow | 3.101% | > | teo | too shallow | 8.578% | > | teo_util | too shallow | 18.309% | > | idleutil | too shallow | 17.638% | > +-----------+-------------+--------------+ > > Power usage [mW] > > +--------------+----------+----------+---------+-------------+ > | chan_name | metric | kernel | value | perc_diff | > |--------------+----------+----------+---------+-------------| > | total_power | gmean | menu | 179.2 | 0.0% | > | total_power | gmean | teo | 184.8 | 3.1% | > | total_power | gmean | teo_util | 180.5 | 0.71% | > | total_power | gmean | idleutil | 185 | 3.24% | > +--------------+----------+----------+---------+-------------+ > > Wakeup latency > > +-----------------+----------+----------+-------------+-------------+ > | comm | metric | kernel | value | perc_diff | > |-----------------+----------+----------+-------------+-------------| > | CrRendererMain | gmean | menu | 236.63μs | 0.0% | > | CrRendererMain | gmean | teo | 201.85μs | -14.7% | > | CrRendererMain | gmean | teo_util | 111.76μs | -52.77% | > | CrRendererMain | gmean | idleutil | 105.55μs | -55.39% | > | chmark:workload | gmean | menu | 100.30μs | 0.0% | > | chmark:workload | gmean | teo | 80.20μs | -20.04% | > | chmark:workload | gmean | teo_util | 53.81μs | -46.35% | > | chmark:workload | gmean | idleutil | 71.29μs | -28.92% | > | RenderThread | gmean | menu | 37.97μs | 0.0% | > | RenderThread | gmean | teo | 31.69μs | -16.54% | > | RenderThread | gmean | teo_util | 34.32μs | -9.63% | > | RenderThread | gmean | idleutil | 35.78μs | -5.77% | > | surfaceflinger | gmean | menu | 97.57μs | 0.0% | > | surfaceflinger | gmean | teo | 98.86μs | 1.31% | > | surfaceflinger | gmean | teo_util | 72.59μs | -25.6% | > | surfaceflinger | gmean | idleutil | 56.23μs | -42.37% | > +-----------------+----------+----------+-------------+-------------+ > > I also have similar data for Jankbench & Speedometer with right about > the same results, I'll skip those for now for brevity. > Would you like me to send a patch with this new governor instead? What > would you think about this instead of the previously suggested approach? I would still kind of prefer to improve teo so it covers the known use cases better, especially that modified teo is likely to give you better results than the new simplistic one. Please see my other reply for the possible direction of improvement. Thanks!
On Thu, Oct 20, 2022 at 9:52 PM Daniel Lezcano <daniel.lezcano@linaro.org> wrote: > > > Hi Kajetan, > > On 20/10/2022 18:20, Kajetan Puchalski wrote: > > Hi Rafael, > > > >> The avg_util value tells us nothing about how much the CPU is going to > >> be idle this time and it also tells us nothing about the > >> latency-sensitivity of the workload. > >> > >> Yes, it tells us how much idle time there was on the given CPU in the > >> past, on the average, but there is zero information about the > >> distribution of that idle time in it. > >> > >> So in the first place please tell me why it fundamentally makes sense > >> to use avg_util in CPU idle time management at all. > > > > I have an alternative suggestion that could be a reasonable way forward > > here. Instead of applying util-awareness on top of TEO where it would > > have to be reconciled with how TEO is currently expected to work, I just > > wrote a simple completely new governor which operates only on timer > > events alongside util values. > > I second the idea. I took a long time to investigate how to improve the > governor and reached the conclusion having a dedicated governor for > mobile platform makes sense. Please define "mobile". > Also the behavior is very platform dependent. I'm not sure what you mean.
On Thu, Oct 27, 2022 at 9:56 PM Lukasz Luba <lukasz.luba@arm.com> wrote: > > Hi Doug, > > Thank you for your effort in testing these patches and different > governors. We really appreciate that, since this helped us to > better understand the platform that you are using. It is different > to what we have and our workloads. That's why I have some comments. > > It would be hard to combine these two worlds and requirements. > I have some concerns to the tests, the setup and the platform. > I can see a reason why this patch has to prove the > strengths on this platform and environment. > Please see my comments below. > > On 10/13/22 23:12, Doug Smythies wrote: > > Hi All, > > > > On Thu, Oct 13, 2022 at 4:12 AM Kajetan Puchalski > > <kajetan.puchalski@arm.com> wrote: > >> On Wed, Oct 12, 2022 at 08:50:39PM +0200, Rafael J. Wysocki wrote: > >>> On Mon, Oct 3, 2022 at 4:50 PM Kajetan Puchalski > >>> <kajetan.puchalski@arm.com> wrote: > > ... > > > >> On the Intel & power usage angle you might have seen in the discussion, > >> Doug sent me some interesting data privately. As far as I can tell the > >> main issue there is that C0 on Intel doesn't actually do power saving so > >> moving the state selection down to it is a pretty bad idea because C1 > >> could be very close in terms of latency and save much more power. > >> > >> A potential solution could be altering the v2 to only decrease the state > >> selection by 1 if it's above 1, ie 2->1 but not 1->0. It's fine for us > >> because arm systems with 2 states use the early exit path anyway. It'd > >> just amount to changing this hunk: > >> > >> + if (cpu_data->utilized && idx > 0 && !dev->states_usage[idx-1].disable) > >> + idx--; > >> > >> to: > >> > >> + if (cpu_data->utilized && idx > 1 && !dev->states_usage[idx-1].disable) > >> + idx--; > >> > >> What would you think about that? Should make it much less intense for > >> Intel systems. > > > > I tested the above, which you sent me as patch version v2-2. > > > > By default, my Intel i5-10600K has 4 idle states: > > > > $ grep . /sys/devices/system/cpu/cpu7/cpuidle/state*/name > > /sys/devices/system/cpu/cpu7/cpuidle/state0/name:POLL > > This active polling state type worries me a bit. We don't have > such on our platforms. Our shallowest idle state is really different. > We don't have active polling and there is no need for such. So as I said in a reply to Kajetan, the way to go is to avoid them when you do this utilization-based optimization. CPUIDLE_FLAG_POLLING is for that and it is used already in the code. Moreover, as I said in the other message, IMO the utilization-based optimization makes the most sense when the current candidate state is state 1, so it may not make sense to do it on Intel systems at all.
Hi, On Thu, Oct 27, 2022 at 10:04 PM Lukasz Luba <lukasz.luba@arm.com> wrote: > > Hi Rafael, > > On 10/13/22 12:12, Kajetan Puchalski wrote: > > On Wed, Oct 12, 2022 at 08:50:39PM +0200, Rafael J. Wysocki wrote: > >> On Mon, Oct 3, 2022 at 4:50 PM Kajetan Puchalski > >> <kajetan.puchalski@arm.com> wrote: > >>> > >>> Hi, > >>> > >>> At the moment, all the available idle governors operate mainly based on their own past performance > >> > >> Not true, at least for the menu and teo governors that use the > >> information on the distribution of CPU wakeups that is available to > >> them and try to predict the next idle duration with the help of it. > >> This has a little to do with their performance. > > > > You're right of course, I should have written "their own past > > correctness" as that's what I was referring to. I just meant that for > > instance with TEO the initial timer-based choice is only adjusted using > > the governor's own metrics and not any information from anywhere else in > > the system. > > > > [snip] > > Would it be possible to consider a new small and simple idle governor > which is better suited for those other workloads and platforms? > Kajetan has such one and can send to the LKML, so you could have a look. > > I have sent some detailed explanation about this to Doug in this > thread (don't want to duplicate it). > > It looks like it would be hard to meet both worlds' requirements. It may or may not be the case. Let's first explore the original idea of improving "teo" a bit more. As I said in another message in this thread, there are clear cases in which the new governor is likely to underperform, because it only takes 2 sources of information into account (the next timer event and CPU utilization). If some more information is to be taken into account without adding too much overhead, "teo" is the simplest thing that I could imagine (but perhaps my imagination is limited).
On Fri, Oct 28, 2022 at 03:12:43PM +0200, Rafael J. Wysocki wrote: > > The result being that this util-aware TEO variant while using much less > > C1 and decreasing the percentage of too deep sleeps from ~24% to ~3% in > > PCMark Web Browsing also uses almost 2% less power. Clearly the power is > > being wasted on not hitting C1 residency over and over. > > Hmm. The PCMark Web Browsing table in your cover letter doesn't indicate that. > > The "gmean power usage" there for "teo + util-aware" is 205, whereas > for "teo" alone it is 187.8. This is still arguably balanced by the > latency difference (~100 us vs ~185 us, respectively), but this looks > like trading energy for performance. In this case yes, I meant 2% less compared to menu but you're right of course. [...] > Definitely it should not be changed if the previous state is a polling > one which can be checked right away. That would take care of the > "Intel case" automatically. Makes sense, I already used the polling flag to implement this in this other governor I mentioned. > > > Should make it much less intense for Intel systems. > > So I think that this adjustment only makes sense if the current > candidate state is state 1 and state 0 is not polling. In the other > cases the cost of missing an opportunity to save energy would be too > high for the observed performance gain. Interesting, but only applying it to C1 and only when C0 isn't polling would make it effectively not do anything on Intel systems, right? From what I've seen on Doug's plots even C1 is hardly ever used on his platform, most sleeps end up in the deepest possible state. Checking for the polling flag is a good idea regardless so I can send a v3 with that. If you'd like me to also restrict the entire mechanism to only working on C1 as you suggested then I'm okay with including that in the v3 as well. What do you think? Thanks a lot for all your time & input, Kajetan
On Fri, Oct 28, 2022 at 5:01 PM Kajetan Puchalski <kajetan.puchalski@arm.com> wrote: > > On Fri, Oct 28, 2022 at 03:12:43PM +0200, Rafael J. Wysocki wrote: > > > > The result being that this util-aware TEO variant while using much less > > > C1 and decreasing the percentage of too deep sleeps from ~24% to ~3% in > > > PCMark Web Browsing also uses almost 2% less power. Clearly the power is > > > being wasted on not hitting C1 residency over and over. > > > > Hmm. The PCMark Web Browsing table in your cover letter doesn't indicate that. > > > > The "gmean power usage" there for "teo + util-aware" is 205, whereas > > for "teo" alone it is 187.8. This is still arguably balanced by the > > latency difference (~100 us vs ~185 us, respectively), but this looks > > like trading energy for performance. > > In this case yes, I meant 2% less compared to menu but you're right of > course. > > [...] > > > Definitely it should not be changed if the previous state is a polling > > one which can be checked right away. That would take care of the > > "Intel case" automatically. > > Makes sense, I already used the polling flag to implement this in this other > governor I mentioned. > > > > > > Should make it much less intense for Intel systems. > > > > So I think that this adjustment only makes sense if the current > > candidate state is state 1 and state 0 is not polling. In the other > > cases the cost of missing an opportunity to save energy would be too > > high for the observed performance gain. > > Interesting, but only applying it to C1 and only when C0 isn't polling would > make it effectively not do anything on Intel systems, right? Indeed. > From what I've seen on Doug's plots even C1 is hardly ever used on his platform, most > sleeps end up in the deepest possible state. That depends a lot on the workload. There are workloads in which C1 is mostly used and the deeper idle states aren't. > Checking for the polling flag is a good idea regardless so I can send a > v3 with that. If you'd like me to also restrict the entire mechanism to > only working on C1 as you suggested then I'm okay with including that in > the v3 as well. What do you think? It would be good to do that and see if there are any significant differences in the results. > Thanks a lot for all your time & input, No problem at all.
On Fri, Oct 28, 2022 at 5:04 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Fri, Oct 28, 2022 at 5:01 PM Kajetan Puchalski > <kajetan.puchalski@arm.com> wrote: > > > > On Fri, Oct 28, 2022 at 03:12:43PM +0200, Rafael J. Wysocki wrote: > > > > > > The result being that this util-aware TEO variant while using much less > > > > C1 and decreasing the percentage of too deep sleeps from ~24% to ~3% in > > > > PCMark Web Browsing also uses almost 2% less power. Clearly the power is > > > > being wasted on not hitting C1 residency over and over. > > > > > > Hmm. The PCMark Web Browsing table in your cover letter doesn't indicate that. > > > > > > The "gmean power usage" there for "teo + util-aware" is 205, whereas > > > for "teo" alone it is 187.8. This is still arguably balanced by the > > > latency difference (~100 us vs ~185 us, respectively), but this looks > > > like trading energy for performance. > > > > In this case yes, I meant 2% less compared to menu but you're right of > > course. > > > > [...] > > > > > Definitely it should not be changed if the previous state is a polling > > > one which can be checked right away. That would take care of the > > > "Intel case" automatically. > > > > Makes sense, I already used the polling flag to implement this in this other > > governor I mentioned. > > > > > > > > > Should make it much less intense for Intel systems. > > > > > > So I think that this adjustment only makes sense if the current > > > candidate state is state 1 and state 0 is not polling. In the other > > > cases the cost of missing an opportunity to save energy would be too > > > high for the observed performance gain. > > > > Interesting, but only applying it to C1 and only when C0 isn't polling would > > make it effectively not do anything on Intel systems, right? > > Indeed. > > > From what I've seen on Doug's plots even C1 is hardly ever used on his platform, most > > sleeps end up in the deepest possible state. > > That depends a lot on the workload. There are workloads in which C1 > is mostly used and the deeper idle states aren't. > > > Checking for the polling flag is a good idea regardless so I can send a > > v3 with that. If you'd like me to also restrict the entire mechanism to > > only working on C1 as you suggested then I'm okay with including that in > > the v3 as well. What do you think? > > It would be good to do that and see if there are any significant > differences in the results. BTW, you may as well drop the extra #ifdeffery from the v3, I don't think that it is particularly useful.