Message ID | 20240606090050.327614-1-christian.loehle@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | cpuidle: teo: fixes and improvements | expand |
On 6/6/24 10:00, Christian Loehle wrote: > Hi all, > so my investigation into teo lead to the following fixes and > improvements. Logically they are mostly independent, that's why this > cover letter is quite short, details are in the patches. > > 1/6: > As discussed, the utilization threshold is too high, while > there are benefits in certain workloads, there are quite a few > regressions, too. > 2/6: > Especially with the new util threshold, stopping tick makes little > sense when utilized is detected, so don't. > 3/6: > Particularly with WFI, even if it's the only state, stopping the tick > has benefits, so enable that in the early bail out. > 4/6: > Stopping the tick with 0 cost (if the idle state dictates it) is too > aggressive IMO, so add 1ms constant cost. > XXX: This has the issue of now being counted as idle_miss, so we could > consider adding this to the states, too, but the simple implementation > of this would have the downside that the cost is added to deeper states > even if the tick is already off. > 5/6: > Remove the 'recent' intercept logic, see my findings in: > https://lore.kernel.org/lkml/0ce2d536-1125-4df8-9a5b-0d5e389cd8af@arm.com/ > I haven't found a way to salvage this properly, so I removed it. > The regular intercept seems to decay fast enough to not need this, but > we could change it if that turns out to be the case. > 6/6: > The rest of the intercept logic had issues, too. > See the commit. > > TODO: add some measurements of common workloads and some simple sanity > tests (like Vincent described in low utilization workloads if the > state selection looks reasonable). > I have some, but more (and more standardized) would be beneficial. > > Happy for anyone to take a look and test as well. > > Some numbers for context: > Maybe some numbers for context, I'll probably add them to the cover letter. > > Comparing: > - IO workload (intercept heavy). > - Timer workload very low utilization (check for deepest state) > - hackbench (high utilization) > all on RK3399 with CONFIG_HZ=100. > target_residencies: 1, 900, 2000 > > 1. IO workload, 5 runs, results sorted, in read IOPS. > fio --minimal --time_based --name=fiotest --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1 | cut -d \; -f 8; > > teo fixed: > /dev/nvme0n1 > [4597, 4673, 4727, 4741, 4756] > /dev/mmcblk2 > [5753, 5832, 5837, 5911, 5949] > /dev/mmcblk1 > [2059, 2062, 2070, 2071, 2080] > > teo mainline: > /dev/nvme0n1 > [3793, 3825, 3846, 3865, 3964] > /dev/mmcblk2 > [3831, 4110, 4154, 4203, 4228] > /dev/mmcblk1 > [1559, 1564, 1596, 1611, 1618] > > menu: > /dev/nvme0n1 > [2571, 2630, 2804, 2813, 2917] > /dev/mmcblk2 > [4181, 4260, 5062, 5260, 5329] > /dev/mmcblk1 > [1567, 1581, 1585, 1603, 1769] > > 2. Timer workload (through IO for my convenience ;) ) > Results in read IOPS, fio same as above. > echo "0 2097152 zero" | dmsetup create dm-zeros > echo "0 2097152 delay /dev/mapper/dm-zeros 0 50" | dmsetup create dm-slow > (Each IO is delayed by timer of 50ms, should be mostly in state2) > > teo fixed: > 3269 cpu_idle total > 48 cpu_idle_miss > 30 cpu_idle_miss above > 18 cpu_idle_miss below > > teo mainline: > 3221 cpu_idle total > 1269 cpu_idle_miss > 22 cpu_idle_miss above > 1247 cpu_idle_miss below > > menu: > 3433 cpu_idle total > 114 cpu_idle_miss > 61 cpu_idle_miss above > 53 cpu_idle_miss below > > Residencies: Hmm, maybe actually including them would've been helpful too: (Over 5s workload, only showing LITTLE cluster) teo fixed: idle_state 2.0 4.813378 -1.0 0.210820 1.0 0.202778 0.0 0.062426 teo mainline: idle_state 1.0 4.895766 -1.0 0.098063 0.0 0.253069 menu: idle_state 2.0 4.528356 -1.0 0.241486 1.0 0.345829 0.0 0.202505 > > tldr: overall teo fixed spends more time in state2 while having > fewer idle_miss than menu. > teo mainline was just way too aggressive at selecting shallow states. > > 3. Hackbench, 5 runs > for i in $(seq 0 4); do hackbench -l 100 -g 100 ; sleep 1; done > > teo fixed: > Time: 4.807 > Time: 4.856 > Time: 5.072 > Time: 4.934 > Time: 4.962 > > teo mainline: > Time: 4.945 > Time: 5.021 > Time: 4.927 > Time: 4.923 > Time: 5.137 > > menu: > Time: 4.991 > Time: 4.884 > Time: 4.880 > Time: 4.946 > Time: 4.980 > > tldr: all comparable, teo mainline slightly worse > > Kind Regards, > Christian > > Christian Loehle (6): > cpuidle: teo: Increase util-threshold > cpuidle: teo: Don't stop tick on utilized > cpuidle: teo: Don't always stop tick on one state > cpuidle: teo: Increase minimum time to stop tick > cpuidle: teo: Remove recent intercepts metric > cpuidle: teo: Don't count non-existent intercepts > > drivers/cpuidle/governors/teo.c | 121 +++++++++++++------------------- > 1 file changed, 48 insertions(+), 73 deletions(-) > > -- > 2.34.1 >
On Thu, Jun 6, 2024 at 1:59 PM Christian Loehle <christian.loehle@arm.com> wrote: > > On 6/6/24 10:00, Christian Loehle wrote: > > Hi all, > > so my investigation into teo lead to the following fixes and > > improvements. Logically they are mostly independent, that's why this > > cover letter is quite short, details are in the patches. > > > > 1/6: > > As discussed, the utilization threshold is too high, while > > there are benefits in certain workloads, there are quite a few > > regressions, too. > > 2/6: > > Especially with the new util threshold, stopping tick makes little > > sense when utilized is detected, so don't. > > 3/6: > > Particularly with WFI, even if it's the only state, stopping the tick > > has benefits, so enable that in the early bail out. > > 4/6: > > Stopping the tick with 0 cost (if the idle state dictates it) is too > > aggressive IMO, so add 1ms constant cost. > > XXX: This has the issue of now being counted as idle_miss, so we could > > consider adding this to the states, too, but the simple implementation > > of this would have the downside that the cost is added to deeper states > > even if the tick is already off. > > 5/6: > > Remove the 'recent' intercept logic, see my findings in: > > https://lore.kernel.org/lkml/0ce2d536-1125-4df8-9a5b-0d5e389cd8af@arm.com/ > > I haven't found a way to salvage this properly, so I removed it. > > The regular intercept seems to decay fast enough to not need this, but > > we could change it if that turns out to be the case. > > 6/6: > > The rest of the intercept logic had issues, too. > > See the commit. > > > > TODO: add some measurements of common workloads and some simple sanity > > tests (like Vincent described in low utilization workloads if the > > state selection looks reasonable). > > I have some, but more (and more standardized) would be beneficial. > > > > Happy for anyone to take a look and test as well. > > > > Some numbers for context: > > Maybe some numbers for context, I'll probably add them to the cover letter. > > > > Comparing: > > - IO workload (intercept heavy). > > - Timer workload very low utilization (check for deepest state) > > - hackbench (high utilization) > > all on RK3399 with CONFIG_HZ=100. > > target_residencies: 1, 900, 2000 > > > > 1. IO workload, 5 runs, results sorted, in read IOPS. > > fio --minimal --time_based --name=fiotest --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1 | cut -d \; -f 8; > > > > teo fixed: > > /dev/nvme0n1 > > [4597, 4673, 4727, 4741, 4756] > > /dev/mmcblk2 > > [5753, 5832, 5837, 5911, 5949] > > /dev/mmcblk1 > > [2059, 2062, 2070, 2071, 2080] > > > > teo mainline: > > /dev/nvme0n1 > > [3793, 3825, 3846, 3865, 3964] > > /dev/mmcblk2 > > [3831, 4110, 4154, 4203, 4228] > > /dev/mmcblk1 > > [1559, 1564, 1596, 1611, 1618] > > > > menu: > > /dev/nvme0n1 > > [2571, 2630, 2804, 2813, 2917] > > /dev/mmcblk2 > > [4181, 4260, 5062, 5260, 5329] > > /dev/mmcblk1 > > [1567, 1581, 1585, 1603, 1769] > > > > 2. Timer workload (through IO for my convenience ;) ) > > Results in read IOPS, fio same as above. > > echo "0 2097152 zero" | dmsetup create dm-zeros > > echo "0 2097152 delay /dev/mapper/dm-zeros 0 50" | dmsetup create dm-slow > > (Each IO is delayed by timer of 50ms, should be mostly in state2) > > > > teo fixed: > > 3269 cpu_idle total > > 48 cpu_idle_miss > > 30 cpu_idle_miss above > > 18 cpu_idle_miss below > > > > teo mainline: > > 3221 cpu_idle total > > 1269 cpu_idle_miss > > 22 cpu_idle_miss above > > 1247 cpu_idle_miss below > > > > menu: > > 3433 cpu_idle total > > 114 cpu_idle_miss > > 61 cpu_idle_miss above > > 53 cpu_idle_miss below > > > > Residencies: > > Hmm, maybe actually including them would've been helpful too: > (Over 5s workload, only showing LITTLE cluster) > teo fixed: > idle_state > 2.0 4.813378 > -1.0 0.210820 Just to clarify, what does -1.0 mean here? > 1.0 0.202778 > 0.0 0.062426 > > teo mainline: > idle_state > 1.0 4.895766 > -1.0 0.098063 > 0.0 0.253069 > > menu: > idle_state > 2.0 4.528356 > -1.0 0.241486 > 1.0 0.345829 > 0.0 0.202505 > > > > > tldr: overall teo fixed spends more time in state2 while having > > fewer idle_miss than menu. > > teo mainline was just way too aggressive at selecting shallow states. > > > > 3. Hackbench, 5 runs > > for i in $(seq 0 4); do hackbench -l 100 -g 100 ; sleep 1; done > > > > teo fixed: > > Time: 4.807 > > Time: 4.856 > > Time: 5.072 > > Time: 4.934 > > Time: 4.962 > > > > teo mainline: > > Time: 4.945 > > Time: 5.021 > > Time: 4.927 > > Time: 4.923 > > Time: 5.137 > > > > menu: > > Time: 4.991 > > Time: 4.884 > > Time: 4.880 > > Time: 4.946 > > Time: 4.980 > > > > tldr: all comparable, teo mainline slightly worse > > > > Kind Regards, > > Christian > > > > Christian Loehle (6): > > cpuidle: teo: Increase util-threshold > > cpuidle: teo: Don't stop tick on utilized > > cpuidle: teo: Don't always stop tick on one state > > cpuidle: teo: Increase minimum time to stop tick > > cpuidle: teo: Remove recent intercepts metric > > cpuidle: teo: Don't count non-existent intercepts > > > > drivers/cpuidle/governors/teo.c | 121 +++++++++++++------------------- > > 1 file changed, 48 insertions(+), 73 deletions(-) > > > > -- > > 2.34.1 > > > >
On 6/6/24 13:29, Rafael J. Wysocki wrote: > On Thu, Jun 6, 2024 at 1:59 PM Christian Loehle > <christian.loehle@arm.com> wrote: >> >> On 6/6/24 10:00, Christian Loehle wrote: >>> Hi all, >>> so my investigation into teo lead to the following fixes and >>> improvements. Logically they are mostly independent, that's why this >>> cover letter is quite short, details are in the patches. >>> >>> 1/6: >>> As discussed, the utilization threshold is too high, while >>> there are benefits in certain workloads, there are quite a few >>> regressions, too. >>> 2/6: >>> Especially with the new util threshold, stopping tick makes little >>> sense when utilized is detected, so don't. >>> 3/6: >>> Particularly with WFI, even if it's the only state, stopping the tick >>> has benefits, so enable that in the early bail out. >>> 4/6: >>> Stopping the tick with 0 cost (if the idle state dictates it) is too >>> aggressive IMO, so add 1ms constant cost. >>> XXX: This has the issue of now being counted as idle_miss, so we could >>> consider adding this to the states, too, but the simple implementation >>> of this would have the downside that the cost is added to deeper states >>> even if the tick is already off. >>> 5/6: >>> Remove the 'recent' intercept logic, see my findings in: >>> https://lore.kernel.org/lkml/0ce2d536-1125-4df8-9a5b-0d5e389cd8af@arm.com/ >>> I haven't found a way to salvage this properly, so I removed it. >>> The regular intercept seems to decay fast enough to not need this, but >>> we could change it if that turns out to be the case. >>> 6/6: >>> The rest of the intercept logic had issues, too. >>> See the commit. >>> >>> TODO: add some measurements of common workloads and some simple sanity >>> tests (like Vincent described in low utilization workloads if the >>> state selection looks reasonable). >>> I have some, but more (and more standardized) would be beneficial. >>> >>> Happy for anyone to take a look and test as well. >>> >>> Some numbers for context: >>> Maybe some numbers for context, I'll probably add them to the cover letter. >>> >>> Comparing: >>> - IO workload (intercept heavy). >>> - Timer workload very low utilization (check for deepest state) >>> - hackbench (high utilization) >>> all on RK3399 with CONFIG_HZ=100. >>> target_residencies: 1, 900, 2000 >>> >>> 1. IO workload, 5 runs, results sorted, in read IOPS. >>> fio --minimal --time_based --name=fiotest --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1 | cut -d \; -f 8; >>> >>> teo fixed: >>> /dev/nvme0n1 >>> [4597, 4673, 4727, 4741, 4756] >>> /dev/mmcblk2 >>> [5753, 5832, 5837, 5911, 5949] >>> /dev/mmcblk1 >>> [2059, 2062, 2070, 2071, 2080] >>> >>> teo mainline: >>> /dev/nvme0n1 >>> [3793, 3825, 3846, 3865, 3964] >>> /dev/mmcblk2 >>> [3831, 4110, 4154, 4203, 4228] >>> /dev/mmcblk1 >>> [1559, 1564, 1596, 1611, 1618] >>> >>> menu: >>> /dev/nvme0n1 >>> [2571, 2630, 2804, 2813, 2917] >>> /dev/mmcblk2 >>> [4181, 4260, 5062, 5260, 5329] >>> /dev/mmcblk1 >>> [1567, 1581, 1585, 1603, 1769] >>> >>> 2. Timer workload (through IO for my convenience ;) ) >>> Results in read IOPS, fio same as above. >>> echo "0 2097152 zero" | dmsetup create dm-zeros >>> echo "0 2097152 delay /dev/mapper/dm-zeros 0 50" | dmsetup create dm-slow >>> (Each IO is delayed by timer of 50ms, should be mostly in state2) >>> >>> teo fixed: >>> 3269 cpu_idle total >>> 48 cpu_idle_miss >>> 30 cpu_idle_miss above >>> 18 cpu_idle_miss below >>> >>> teo mainline: >>> 3221 cpu_idle total >>> 1269 cpu_idle_miss >>> 22 cpu_idle_miss above >>> 1247 cpu_idle_miss below >>> >>> menu: >>> 3433 cpu_idle total >>> 114 cpu_idle_miss >>> 61 cpu_idle_miss above >>> 53 cpu_idle_miss below >>> >>> Residencies: >> >> Hmm, maybe actually including them would've been helpful too: >> (Over 5s workload, only showing LITTLE cluster) >> teo fixed: >> idle_state >> 2.0 4.813378 >> -1.0 0.210820 > > Just to clarify, what does -1.0 mean here? Good point, I should've mentioned. This adds up the residencies just from the trace event cpu_idle. -1 stands for PWR_EVENT_EXIT i.e. non-idle, it's quite useful if we're talking absolute numbers but idle state ratios. tldr: the time the CPU isn't in any idle state. > [snip]