Message ID | 20221102152808.2978590-1-kajetan.puchalski@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | cpuidle: teo: Introduce util-awareness | expand |
Hi Rafael, On Wed, Nov 02, 2022 at 03:28:06PM +0000, Kajetan Puchalski wrote: [...] > v3 -> v4: > - remove the chunk of code skipping metrics updates when the CPU was utilized > - include new test results and more benchmarks in the cover letter [...] It's been some time so I just wanted to bump this, what do you think about this v4? Doug has already tested it, resuls for his machine are attached to the v3 thread. Thanks, Kajetan
On Mon, Nov 21, 2022 at 1:23 PM Kajetan Puchalski <kajetan.puchalski@arm.com> wrote: > > Hi Rafael, > > On Wed, Nov 02, 2022 at 03:28:06PM +0000, Kajetan Puchalski wrote: > > [...] > > > v3 -> v4: > > - remove the chunk of code skipping metrics updates when the CPU was utilized > > - include new test results and more benchmarks in the cover letter > > [...] > > It's been some time so I just wanted to bump this, what do you think > about this v4? Doug has already tested it, resuls for his machine are > attached to the v3 thread. I have some comments, but it's being pushed down by more urgent things, sorry. First off, I think that the information from your cover letter should go into the patch changelog (at least the majority of it), as it's relevant for the motivation part. Also I think that this optimization is really trading energy for performance and that should be emphasized. IOW, it is not about improving the prediction accuracy (which is what the cover letter and changelog seem to be claiming), but about reducing the expected CPU wakeup latency in some cases. I'll send more comments later today if I have the time or later this week otherwise.
On 2022.11.21 04:23 Kajetan Puchalski wrote: > Hi Rafael, > > On Wed, Nov 02, 2022 at 03:28:06PM +0000, Kajetan Puchalski wrote: > > [...] > >> v3 -> v4: >> - remove the chunk of code skipping metrics updates when the CPU was utilized >> - include new test results and more benchmarks in the cover letter > > [...] > > It's been some time so I just wanted to bump this, what do you think > about this v4? Doug has already tested it, resuls for his machine are > attached to the v3 thread. Hi All, I continued to test this and included the proposed ladder idle governor in my continued testing. (Which is why I added Rui as an addressee) However, I ran out of time. Here is what I have: Kernel: 6.1-rc3 and with patch sets Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz CPU scaling driver: intel_cpufreq HWP disabled. Unless otherwsie stated, performance CPU scaling govenor. Legend: teo: the current teo idle governor util-v4: the RFC utilization teo patch set version 4. menu: the menu idle governor ladder-old: the current ladder idle governor ladder: the RFC ladder patchset. Workflow: shell-intensive serialized workloads. Variable: PIDs per second. Note: Single threaded. Master reference: forced CPU affinity to 1 CPU. Performance Results: http://smythies.com/~doug/linux/idle/teo-util/graphs/pids-perf.png Schedutil Results: http://smythies.com/~doug/linux/idle/teo-util/graphs/pids-su.png Workflow: sleeping ebizzy 128 threads. Variable: interval (uSecs). Performance Results: http://smythies.com/~doug/linux/idle/teo-util/graphs/ebizzy-128-perf.png Performance power and idle data: http://smythies.com/~doug/linux/idle/teo-util/ebizzy/perf/ Schedutil Results: http://smythies.com/~doug/linux/idle/teo-util/graphs/pids-su.png Schedutil power and idle data: http://smythies.com/~doug/linux/idle/teo-util/ebizzy/su/ Workflow: 6 core ping-pong. Variable: amount of work packet per token transfer Forced CPU affinity, 16.67% load per core (6 CPUs idle, 6 busy). Overview: http://smythies.com/~doug/linux/idle/teo-util/graphs/6-core-ping-pong-sweep.png short loop times detail: http://smythies.com/~doug/linux/idle/teo-util/graphs/6-core-ping-pong-sweep-detail-a.png Power and idle data: http://smythies.com/~doug/linux/idle/teo-util/ping-sweep/6-4/ The transition between 35 and 40 minutes will be some future investigation. Workflow: periodic 73, 113, 211, 347, 401 work/sleep frequency. Summary: Nothing interesting. Variable: work packet (load), ramps up and then down. Single threaded. Power and idle data: http://smythies.com/~doug/linux/idle/teo-util/consume/idle-3/ Higher resolution power data: http://smythies.com/~doug/linux/idle/teo-util/consume/ps73/ http://smythies.com/~doug/linux/idle/teo-util/consume/ps113/ http://smythies.com/~doug/linux/idle/teo-util/consume/ps211/ http://smythies.com/~doug/linux/idle/teo-util/consume/ps347/ http://smythies.com/~doug/linux/idle/teo-util/consume/ps401/ Workflow: fast speed 2 pair, 4 threads ping-pong. Variable: none, this is a dwell test. Results: http://smythies.com/~doug/linux/idle/teo-util/many-0-400000000-2/times.txt Performance power and idle data: http://smythies.com/~doug/linux/idle/teo-util/many-0-400000000-2/perf/ Schedutil power and idle data: http://smythies.com/~doug/linux/idle/teo-util/many-0-400000000-2/su/ Workflow: medium speed 2 pair, 4 threads ping-pong. Variable: none, this is a dwell test. Results: http://smythies.com/~doug/linux/idle/teo-util/many-3000-100000000-2/times.txt Performance power and idle data: http://smythies.com/~doug/linux/idle/teo-util/many-3000-100000000-2/perf/ Schedutil power and idle data: http://smythies.com/~doug/linux/idle/teo-util/many-3000-100000000-2/su/ Workflow: slow speed 2 pair, 4 threads ping-pong. Variable: none, this is a dwell test. Results: http://smythies.com/~doug/linux/idle/teo-util/many-1000000-342000-2/times.txt Performance power and idle data: http://smythies.com/~doug/linux/idle/teo-util/many-1000000-342000-2/perf/ Schedutil power and idle data: http://smythies.com/~doug/linux/idle/teo-util/many-1000000-342000-2/su/ Results summary: Results are uSeconds per loop. Less is better. Slow ping pong - 2 pairs, 4 threads. Performance: ladder_old: Average: 2583 (-0.56%) ladder: Average: 2617 (+0.81%) menu: Average: 2596 Reference Time. teo: Average: 2689 (+3.6%) util-v4 Average: 2665 (+2.7%) Schedutil: ladder-old: Average: 4490 (+44%) ladder: Average: 3296 (+5.9%) menu: Average: 3113 Reference Time. teo: Average: 4005 (+29%) util-v4: Average: 3527 (+13%) Medium ping pong - 2 pairs, 4 threads. Performance: ladder-old: Average: 11.8214 (+4.6%) ladder: Average: 11.7730 (+4.2%) menu: Average: 11.2971 Reference Time. teo: Average: 11.355 (+5.1%) util-v4: Average: 11.3364 (+3.4%) Schedutil: ladder-old: Average: 15.6813 (+30%) ladder: Average: 15.4338 (+28%) menu: Average: 12.0868 Reference Time. teo: Average: 11.7367 (-2.9%) util-v4: Average: 11.6352 (-3.7%) Fast ping pong - 2 pairs, 4 threads. Performance: ladder-old: Average: 4.009 (+39%) ladder: Average: 3.844 (+33%) menu: Average: 2.891 Reference Time. teo: Average: 3.053 (+5.6%) util-v4: Average: 2.985 (+3.2%) Schedutil: ladder-old: Average: 5.053 (+64%) ladder: Average: 5.278 (+71%) menu: Average: 3.078 Reference Time. teo: Average: 3.106 (+0.91%) util-v4: Average: 3.15 (+2.35%) ... Doug
On Wed, 2022-11-23 at 20:08 -0800, Doug Smythies wrote: > On 2022.11.21 04:23 Kajetan Puchalski wrote: > > > Hi Rafael, > > > > On Wed, Nov 02, 2022 at 03:28:06PM +0000, Kajetan Puchalski wrote: > > > > [...] > > > > > v3 -> v4: > > > - remove the chunk of code skipping metrics updates when the CPU > > > was utilized > > > - include new test results and more benchmarks in the cover > > > letter > > > > [...] > > > > It's been some time so I just wanted to bump this, what do you > > think > > about this v4? Doug has already tested it, resuls for his machine > > are > > attached to the v3 thread. > > Hi All, > > I continued to test this and included the proposed ladder idle > governor in my continued testing. > (Which is why I added Rui as an addressee) Hi, Doug, Really appreciated your testing data on this. I have some dumb questions and I need your help so that I can better understand some of the graphs. :) > However, I ran out of time. Here is what I have: > > Kernel: 6.1-rc3 and with patch sets > Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz > CPU scaling driver: intel_cpufreq > HWP disabled. > Unless otherwsie stated, performance CPU scaling govenor. > > Legend: > teo: the current teo idle governor > util-v4: the RFC utilization teo patch set version 4. > menu: the menu idle governor > ladder-old: the current ladder idle governor > ladder: the RFC ladder patchset. > > Workflow: shell-intensive serialized workloads. > Variable: PIDs per second. > Note: Single threaded. > Master reference: forced CPU affinity to 1 CPU. > Performance Results: > http://smythies.com/~doug/linux/idle/teo-util/graphs/pids-perf.png > Schedutil Results: > http://smythies.com/~doug/linux/idle/teo-util/graphs/pids-su.png what does 1cpu mean? > > Workflow: sleeping ebizzy 128 threads. > Variable: interval (uSecs). > Performance Results: > http://smythies.com/~doug/linux/idle/teo-util/graphs/ebizzy-128-perf.png > Performance power and idle data: > http://smythies.com/~doug/linux/idle/teo-util/ebizzy/perf/ for the "Idle state 0/1/2/3 was too deep" graphs, may I know how you assert that an idle state is too deep/shallow? thanks, rui
> > Workflow: sleeping ebizzy 128 threads. > > Variable: interval (uSecs). > > Performance Results: > > http://smythies.com/~doug/linux/idle/teo-util/graphs/ebizzy-128-perf.png > > Performance power and idle data: > > http://smythies.com/~doug/linux/idle/teo-util/ebizzy/perf/ > > for the "Idle state 0/1/2/3 was too deep" graphs, may I know how you > assert that an idle state is too deep/shallow? > is this got from the cpu_idle_miss trace event? thanks, rui
On 2022.11.26 08:26 Rui wrote: > On Wed, 2022-11-23 at 20:08 -0800, Doug Smythies wrote: >> On 2022.11.21 04:23 Kajetan Puchalski wrote: >>> On Wed, Nov 02, 2022 at 03:28:06PM +0000, Kajetan Puchalski wrote: >>> >>> [...] >>> >>>> v3 -> v4: >>>> - remove the chunk of code skipping metrics updates when the CPU >>>> was utilized >>>> - include new test results and more benchmarks in the cover >>>> letter >>> >>> [...] >>> >>> It's been some time so I just wanted to bump this, what do you >>> think >>> about this v4? Doug has already tested it, resuls for his machine >>> are >>> attached to the v3 thread. >> >> Hi All, >> >> I continued to test this and included the proposed ladder idle >> governor in my continued testing. >> (Which is why I added Rui as an addressee) > > Hi, Doug, Hi Rui, > Really appreciated your testing data on this. > I have some dumb questions and I need your help so that I can better > understand some of the graphs. :) > >> However, I ran out of time. Here is what I have: >> >> Kernel: 6.1-rc3 and with patch sets >> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz >> CPU scaling driver: intel_cpufreq >> HWP disabled. >> Unless otherwsie stated, performance CPU scaling govenor. >> >> Legend: >> teo: the current teo idle governor >> util-v4: the RFC utilization teo patch set version 4. >> menu: the menu idle governor >> ladder-old: the current ladder idle governor >> ladder: the RFC ladder patchset. >> >> Workflow: shell-intensive serialized workloads. >> Variable: PIDs per second. >> Note: Single threaded. >> Master reference: forced CPU affinity to 1 CPU. This is the 1cpu on the graph. >> Performance Results: >> http://smythies.com/~doug/linux/idle/teo-util/graphs/pids-perf.png >> Schedutil Results: >> http://smythies.com/~doug/linux/idle/teo-util/graphs/pids-su.png > > what does 1cpu mean? For shell-intensive serialized workflow or: Dountil the list of tasks is finished: Start the next task in the list of stuff to do (with a new PID). Wait for it to finish Enduntil We know it represents a challenge for CPU frequency scaling drivers, schedulers, and therefore idle drivers. We also know that the best performance is achieved by overriding the scheduler and forcing CPU affinity. I use this "best" case as the master reference, using the label 1cpu on the graph. >> Workflow: sleeping ebizzy 128 threads. >> Variable: interval (uSecs). >> Performance Results: >> http://smythies.com/~doug/linux/idle/teo-util/graphs/ebizzy-128-perf.png >> Performance power and idle data: >> http://smythies.com/~doug/linux/idle/teo-util/ebizzy/perf/ > > for the "Idle state 0/1/2/3 was too deep" graphs, may I know how you > assert that an idle state is too deep/shallow? I get those stats directly from the kernel driver statistics. For example: $ grep . /sys/devices/system/cpu/cpu4/cpuidle/state*/above /sys/devices/system/cpu/cpu4/cpuidle/state0/above:0 /sys/devices/system/cpu/cpu4/cpuidle/state1/above:38085 /sys/devices/system/cpu/cpu4/cpuidle/state2/above:7668 /sys/devices/system/cpu/cpu4/cpuidle/state3/above:6823 $ grep . /sys/devices/system/cpu/cpu4/cpuidle/state*/below /sys/devices/system/cpu/cpu4/cpuidle/state0/below:72059 /sys/devices/system/cpu/cpu4/cpuidle/state1/below:246573 /sys/devices/system/cpu/cpu4/cpuidle/state2/below:7817 /sys/devices/system/cpu/cpu4/cpuidle/state3/below:0 I keep track of the changes per sample interval and graph the sum for all CPUs as a percentage of the usage of that idle state. Because I can never remember what "above" and "below" actually mean, I use the terms "was too shallow" and "was too deep". ... Doug
On Sat, 2022-11-26 at 13:56 -0800, Doug Smythies wrote: > On 2022.11.26 08:26 Rui wrote: > > On Wed, 2022-11-23 at 20:08 -0800, Doug Smythies wrote: > > > On 2022.11.21 04:23 Kajetan Puchalski wrote: > > > > On Wed, Nov 02, 2022 at 03:28:06PM +0000, Kajetan Puchalski > > > > wrote: > > > > > > > > [...] > > > > > > > > > v3 -> v4: > > > > > - remove the chunk of code skipping metrics updates when the > > > > > CPU > > > > > was utilized > > > > > - include new test results and more benchmarks in the cover > > > > > letter > > > > > > > > [...] > > > > > > > > It's been some time so I just wanted to bump this, what do you > > > > think > > > > about this v4? Doug has already tested it, resuls for his > > > > machine > > > > are > > > > attached to the v3 thread. > > > > > > Hi All, > > > > > > I continued to test this and included the proposed ladder idle > > > governor in my continued testing. > > > (Which is why I added Rui as an addressee) > > > > Hi, Doug, > > Hi Rui, > > > Really appreciated your testing data on this. > > I have some dumb questions and I need your help so that I can > > better > > understand some of the graphs. :) > > > > > However, I ran out of time. Here is what I have: > > > > > > Kernel: 6.1-rc3 and with patch sets > > > Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz > > > CPU scaling driver: intel_cpufreq > > > HWP disabled. > > > Unless otherwsie stated, performance CPU scaling govenor. > > > > > > Legend: > > > teo: the current teo idle governor > > > util-v4: the RFC utilization teo patch set version 4. > > > menu: the menu idle governor > > > ladder-old: the current ladder idle governor > > > ladder: the RFC ladder patchset. > > > > > > Workflow: shell-intensive serialized workloads. > > > Variable: PIDs per second. > > > Note: Single threaded. > > > Master reference: forced CPU affinity to 1 CPU. > > This is the 1cpu on the graph. > > > > Performance Results: > > > http://smythies.com/~doug/linux/idle/teo-util/graphs/pids-perf.png > > > Schedutil Results: > > > http://smythies.com/~doug/linux/idle/teo-util/graphs/pids-su.png > > > > what does 1cpu mean? > > For shell-intensive serialized workflow or: > > Dountil the list of tasks is finished: > Start the next task in the list of stuff to do (with a new PID). > Wait for it to finish > Enduntil > > We know it represents a challenge for CPU frequency scaling drivers, > schedulers, and therefore idle drivers. > > We also know that the best performance is achieved by overriding > the scheduler and forcing CPU affinity. I use this "best" case as the > master reference, using the label 1cpu on the graph. > Got it. > > > Workflow: sleeping ebizzy 128 threads. > > > Variable: interval (uSecs). > > > Performance Results: > > > http://smythies.com/~doug/linux/idle/teo-util/graphs/ebizzy-128-perf.png > > > Performance power and idle data: > > > http://smythies.com/~doug/linux/idle/teo-util/ebizzy/perf/ > > > > for the "Idle state 0/1/2/3 was too deep" graphs, may I know how > > you > > assert that an idle state is too deep/shallow? > > I get those stats directly from the kernel driver statistics. For > example: > > $ grep . /sys/devices/system/cpu/cpu4/cpuidle/state*/above > /sys/devices/system/cpu/cpu4/cpuidle/state0/above:0 > /sys/devices/system/cpu/cpu4/cpuidle/state1/above:38085 > /sys/devices/system/cpu/cpu4/cpuidle/state2/above:7668 > /sys/devices/system/cpu/cpu4/cpuidle/state3/above:6823 > > $ grep . /sys/devices/system/cpu/cpu4/cpuidle/state*/below > /sys/devices/system/cpu/cpu4/cpuidle/state0/below:72059 > /sys/devices/system/cpu/cpu4/cpuidle/state1/below:246573 > /sys/devices/system/cpu/cpu4/cpuidle/state2/below:7817 > /sys/devices/system/cpu/cpu4/cpuidle/state3/below:0 > > I keep track of the changes per sample interval and graph > the sum for all CPUs as a percentage of the usage of > that idle state. > > Because I can never remember what "above" and "below" > actually mean, I use the terms "was too shallow" > and "was too deep". I just checked the code. My understanding is that "above" means the previous idle state residency is too short, and a shallower state would have been a better match. "below" means the previous idle state residency is too long, and a deeper state would have been a better match. So probably "above" means "should be shallower" or "was too deep", and "below" means "should be deeper" or "was to shallow"? thanks, rui