Message ID | 20240809073120.250974-1-aboorvad@linux.ibm.com (mailing list archive) |
---|---|
Headers | show |
Series | cpuidle/menu: Address performance drop from favoring physical over polling cpuidle state | expand |
On 8/9/24 08:31, Aboorva Devarajan wrote: > This patch aims to discuss a potential performance degradation that can occur > in certain workloads when the menu governor prioritizes selecting a physical > idle state over a polling state for short idle durations. > > Note: This patch is intended to showcase a performance degradation, applying > this patch could lead to increased power consumption due to the trade-off between > performance and power efficiency, potentially causing a higher preference for > performance at the expense of power usage. > Not really a menu expert, but at this point I don't know who dares call themselves one. The elephant in the room would be: Does teo work better for you? > ================================================== > System details in which the degradation is observed: > > $ uname -r > 6.10.0+ > > $ lscpu > Architecture: ppc64le > Byte Order: Little Endian > CPU(s): 160 > On-line CPU(s) list: 0-159 > Model name: POWER10 (architected), altivec supported > Model: 2.0 (pvr 0080 0200) > Thread(s) per core: 8 > Core(s) per socket: 3 > Socket(s): 6 > Physical sockets: 4 > Physical chips: 2 > Physical cores/chip: 6 > Virtualization features: > Hypervisor vendor: pHyp > Virtualization type: para > Caches (sum of all): > L1d: 1.3 MiB (40 instances) > L1i: 1.9 MiB (40 instances) > L2: 40 MiB (40 instances) > L3: 160 MiB (40 instances) > NUMA: > NUMA node(s): 6 > NUMA node0 CPU(s): 0-31 > NUMA node1 CPU(s): 32-71 > NUMA node2 CPU(s): 72-79 > NUMA node3 CPU(s): 80-87 > NUMA node4 CPU(s): 88-119 > NUMA node5 CPU(s): 120-159 > > > $ cpupower idle-info > CPUidle driver: pseries_idle > CPUidle governor: menu > analyzing CPU 0: > > Number of idle states: 2 > Available idle states: snooze CEDE > snooze: > Flags/Description: snooze > Latency: 0 > Residency: 0 > Usage: 6229 > Duration: 402142 > CEDE: > Flags/Description: CEDE > Latency: 12 > Residency: 120 > Usage: 191411 > Duration: 36329999037 > > ================================================== > > The menu governor contains a condition that selects physical idle states over, > such as the CEDE state over polling state, by checking if their exit latency meets > the latency requirements. This can lead to performance drops in workloads with > frequent short idle periods. > > The specific condition which causes degradation is as below (menu governor): > > ``` > if (s->target_residency_ns > predicted_ns) { > ... > if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) && > s->exit_latency_ns <= latency_req && > s->target_residency_ns <= data->next_timer_ns) { > predicted_ns = s->target_residency_ns; > idx = i; > break; > } > ... > } > ``` > > This condition can cause the menu governor to choose the CEDE state on Power > Systems (residency: 120 us, exit latency: 12 us) over a polling state, even > when the predicted idle duration is much shorter than the target residency > of the physical state. This misprediction leads to performance degradation > in certain workloads. > So clearly the condition s->target_residency_ns <= data->next_timer_ns) is supposed to prevent this, but data->next_timer_ns isn't accurate, have you got any idea what it's set to in your workload usually? Seems like your workload is timer-based, so the idle duration should be predicted accurately. > ================================================== > Test Results > ================================================== > > This issue can be clearly observed with the below test. > > A test with multiple wakee threads and a single waker thread was run to > demonstrate this issue. The waker thread periodically wakes up the wakee > threads after a specific sleep duration, creating a repeating of sleep -> wake > pattern. The test was run for a stipulated period, and cpuidle statistics are > collected. > > ./cpuidle-test -a 0 -b 10 -b 20 -b 30 -b 40 -b 50 -b 60 -b 70 -r 20 -t 60 > > ================================================== > Results (Baseline Kernel): > ================================================== > Wakee 0[PID 8295] affined to CPUs: 10, > Wakee 2[PID 8297] affined to CPUs: 30, > Wakee 3[PID 8298] affined to CPUs: 40, > Wakee 1[PID 8296] affined to CPUs: 20, > Wakee 4[PID 8299] affined to CPUs: 50, > Wakee 5[PID 8300] affined to CPUs: 60, > Wakee 6[PID 8301] affined to CPUs: 70, > Waker[PID 8302] affined to CPUs: 0, > > |-----------------------------------|-------------------------|-----------------------------| > | Metric | snooze | CEDE | > |-----------------------------------|-------------------------|-----------------------------| > | Usage | 47815 | 2030160 | > | Above | 0 | 2030043 | > | Below | 0 | 0 | > | Time Spent (us) | 976317 (1.63%) | 51046474 (85.08%) | > | Overall average sleep duration | 28.721 us | | > | Overall average wakeup latency | 6.858 us | | > |-----------------------------------|-------------------------|-----------------------------| > > In this test, without the patch, the CPU often enters the CEDE state for > sleep durations of around 20-30 microseconds, even though the CEDE state's > residency time is 120 microseconds. This happens because the menu governor > prioritizes the physical idle state (CEDE) if its exit latency is within > the latency limits. It also uses next_timer_ns for comparison, which can > be farther off than the actual idle duration as it is more predictable, > instead of using predicted idle duration as a comparision point with the > target residency. Ideally that shouldn't be the case though (next_timer_ns be farther off the actual idle duration). > [snip]
On Tue, 2024-08-13 at 13:56 +0100, Christian Loehle wrote: Hi Christian, Thanks a lot for your comments. ... > On 8/9/24 08:31, Aboorva Devarajan wrote: > > This patch aims to discuss a potential performance degradation that can occur > > in certain workloads when the menu governor prioritizes selecting a physical > > idle state over a polling state for short idle durations. > > > > Note: This patch is intended to showcase a performance degradation, applying > > this patch could lead to increased power consumption due to the trade-off between > > performance and power efficiency, potentially causing a higher preference for > > performance at the expense of power usage. > > > > Not really a menu expert, but at this point I don't know who dares call > themselves one. > The elephant in the room would be: Does teo work better for you? > I ran some tests with the teo governor enabled, but it didn’t make a lot of difference. The results are presented below. > > ================================================== > > System details in which the degradation is observed: > > > > $ uname -r > > 6.10.0+ > > > > $ lscpu > > Architecture: ppc64le > > Byte Order: Little Endian > > CPU(s): 160 > > On-line CPU(s) list: 0-159 > > Model name: POWER10 (architected), altivec supported > > Model: 2.0 (pvr 0080 0200) > > Thread(s) per core: 8 > > Core(s) per socket: 3 > > Socket(s): 6 > > Physical sockets: 4 > > Physical chips: 2 > > Physical cores/chip: 6 > > Virtualization features: > > Hypervisor vendor: pHyp > > Virtualization type: para > > Caches (sum of all): > > L1d: 1.3 MiB (40 instances) > > L1i: 1.9 MiB (40 instances) > > L2: 40 MiB (40 instances) > > L3: 160 MiB (40 instances) > > NUMA: > > NUMA node(s): 6 > > NUMA node0 CPU(s): 0-31 > > NUMA node1 CPU(s): 32-71 > > NUMA node2 CPU(s): 72-79 > > NUMA node3 CPU(s): 80-87 > > NUMA node4 CPU(s): 88-119 > > NUMA node5 CPU(s): 120-159 > > > > > > $ cpupower idle-info > > CPUidle driver: pseries_idle > > CPUidle governor: menu > > analyzing CPU 0: > > > > Number of idle states: 2 > > Available idle states: snooze CEDE > > snooze: > > Flags/Description: snooze > > Latency: 0 > > Residency: 0 > > Usage: 6229 > > Duration: 402142 > > CEDE: > > Flags/Description: CEDE > > Latency: 12 > > Residency: 120 > > Usage: 191411 > > Duration: 36329999037 > > > > ================================================== > > > > The menu governor contains a condition that selects physical idle states over, > > such as the CEDE state over polling state, by checking if their exit latency meets > > the latency requirements. This can lead to performance drops in workloads with > > frequent short idle periods. > > > > The specific condition which causes degradation is as below (menu governor): > > > > ``` > > if (s->target_residency_ns > predicted_ns) { > > ... > > if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) && > > s->exit_latency_ns <= latency_req && > > s->target_residency_ns <= data->next_timer_ns) { > > predicted_ns = s->target_residency_ns; > > idx = i; > > break; > > } > > ... > > } > > ``` > > > > This condition can cause the menu governor to choose the CEDE state on Power > > Systems (residency: 120 us, exit latency: 12 us) over a polling state, even > > when the predicted idle duration is much shorter than the target residency > > of the physical state. This misprediction leads to performance degradation > > in certain workloads. > > > > So clearly the condition > s->target_residency_ns <= data->next_timer_ns) > is supposed to prevent this, but data->next_timer_ns isn't accurate, > have you got any idea what it's set to in your workload usually? > Seems like your workload is timer-based, so the idle duration should be > predicted accurately. > Yes, that's right ideally this condition should have prevented this, but `data->next_timer_ns` is almost always greater than the actual idle duration which seems inaccurate. > > > ================================================== > > Test Results > > ================================================== > > > > This issue can be clearly observed with the below test. > > > > A test with multiple wakee threads and a single waker thread was run to > > demonstrate this issue. The waker thread periodically wakes up the wakee > > threads after a specific sleep duration, creating a repeating of sleep -> wake > > pattern. The test was run for a stipulated period, and cpuidle statistics are > > collected. > > > > ./cpuidle-test -a 0 -b 10 -b 20 -b 30 -b 40 -b 50 -b 60 -b 70 -r 20 -t 60 > > > > ================================================== > > Results (Baseline Kernel): > > ================================================== > > Wakee 0[PID 8295] affined to CPUs: 10, > > Wakee 2[PID 8297] affined to CPUs: 30, > > Wakee 3[PID 8298] affined to CPUs: 40, > > Wakee 1[PID 8296] affined to CPUs: 20, > > Wakee 4[PID 8299] affined to CPUs: 50, > > Wakee 5[PID 8300] affined to CPUs: 60, > > Wakee 6[PID 8301] affined to CPUs: 70, > > Waker[PID 8302] affined to CPUs: 0, > > > > > -----------------------------------|-------------------------|-----------------------------| > > > Metric | snooze | CEDE | > > > -----------------------------------|-------------------------|-----------------------------| > > > Usage | 47815 | 2030160 | > > > Above | 0 | 2030043 | > > > Below | 0 | 0 | > > > Time Spent (us) | 976317 (1.63%) | 51046474 (85.08%) | > > > Overall average sleep duration | 28.721 us | | > > > Overall average wakeup latency | 6.858 us | | > > > -----------------------------------|-------------------------|-----------------------------| > > > > In this test, without the patch, the CPU often enters the CEDE state for > > sleep durations of around 20-30 microseconds, even though the CEDE state's > > residency time is 120 microseconds. This happens because the menu governor > > prioritizes the physical idle state (CEDE) if its exit latency is within > > the latency limits. It also uses next_timer_ns for comparison, which can > > be farther off than the actual idle duration as it is more predictable, > > instead of using predicted idle duration as a comparision point with the > > target residency. > > Ideally that shouldn't be the case though (next_timer_ns be farther off thactual idle duration) I ran some experiments based on your suggestions. Below are the relative average runtimes and percentage differences compared to the base version: Picked a single representative workload for simplicity: Note: Lower (% Difference) the better. |---------------------------|-----------------|--------------| | Configuration | Average Runtime | % Difference | |---------------------------|-----------------|--------------| | Base (menu) | 1.00 | 0.00% | | Base + Patch 1 (menu) | 0.92 | -8.00% | | Base + Patch 2 (menu) | 0.98 | -2.00% | | Base (teo) | 1.01 | +1.00% | |---------------------------|-----------------|--------------| Patch 1: https://lore.kernel.org/all/20240809073120.250974-2-aboorvad@linux.ibm.com/ Patch 2: https://lore.kernel.org/all/c20a07e4-b9e6-4a66-80f5-63d679b17c3b@arm.com/ It seems that Patch 2 does provide a slight improvement in runtime, but not significantly like Patch 1. Additionally, teo does not seem to help in this case. Regarding the condition `s->target_residency_ns <= data->next_timer_ns`, it appears that `data->next_timer_ns` is consistently greater than both actual idle duration and `s->target_residency_ns` so this condition nearly always holds true, indicating some inaccuracy. I'll investigate this further and follow up with more details. Regards, Aboorva
On 8/20/24 09:51, Aboorva Devarajan wrote: > On Tue, 2024-08-13 at 13:56 +0100, Christian Loehle wrote: > > Hi Christian, > > Thanks a lot for your comments. > ... >> On 8/9/24 08:31, Aboorva Devarajan wrote: >>> This patch aims to discuss a potential performance degradation that can occur >>> in certain workloads when the menu governor prioritizes selecting a physical >>> idle state over a polling state for short idle durations. >>> >>> Note: This patch is intended to showcase a performance degradation, applying >>> this patch could lead to increased power consumption due to the trade-off between >>> performance and power efficiency, potentially causing a higher preference for >>> performance at the expense of power usage. >>> >> >> Not really a menu expert, but at this point I don't know who dares call >> themselves one. >> The elephant in the room would be: Does teo work better for you? >> > > I ran some tests with the teo governor enabled, but it didn’t make a > lot of difference. The results are presented below. > >>> ================================================== >>> System details in which the degradation is observed: >>> >>> $ uname -r >>> 6.10.0+ >>> >>> $ lscpu >>> Architecture: ppc64le >>> Byte Order: Little Endian >>> CPU(s): 160 >>> On-line CPU(s) list: 0-159 >>> Model name: POWER10 (architected), altivec supported >>> Model: 2.0 (pvr 0080 0200) >>> Thread(s) per core: 8 >>> Core(s) per socket: 3 >>> Socket(s): 6 >>> Physical sockets: 4 >>> Physical chips: 2 >>> Physical cores/chip: 6 >>> Virtualization features: >>> Hypervisor vendor: pHyp >>> Virtualization type: para >>> Caches (sum of all): >>> L1d: 1.3 MiB (40 instances) >>> L1i: 1.9 MiB (40 instances) >>> L2: 40 MiB (40 instances) >>> L3: 160 MiB (40 instances) >>> NUMA: >>> NUMA node(s): 6 >>> NUMA node0 CPU(s): 0-31 >>> NUMA node1 CPU(s): 32-71 >>> NUMA node2 CPU(s): 72-79 >>> NUMA node3 CPU(s): 80-87 >>> NUMA node4 CPU(s): 88-119 >>> NUMA node5 CPU(s): 120-159 >>> >>> >>> $ cpupower idle-info >>> CPUidle driver: pseries_idle >>> CPUidle governor: menu >>> analyzing CPU 0: >>> >>> Number of idle states: 2 >>> Available idle states: snooze CEDE >>> snooze: >>> Flags/Description: snooze >>> Latency: 0 >>> Residency: 0 >>> Usage: 6229 >>> Duration: 402142 >>> CEDE: >>> Flags/Description: CEDE >>> Latency: 12 >>> Residency: 120 >>> Usage: 191411 >>> Duration: 36329999037 >>> >>> ================================================== >>> >>> The menu governor contains a condition that selects physical idle states over, >>> such as the CEDE state over polling state, by checking if their exit latency meets >>> the latency requirements. This can lead to performance drops in workloads with >>> frequent short idle periods. >>> >>> The specific condition which causes degradation is as below (menu governor): >>> >>> ``` >>> if (s->target_residency_ns > predicted_ns) { >>> ... >>> if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) && >>> s->exit_latency_ns <= latency_req && >>> s->target_residency_ns <= data->next_timer_ns) { >>> predicted_ns = s->target_residency_ns; >>> idx = i; >>> break; >>> } >>> ... >>> } >>> ``` >>> >>> This condition can cause the menu governor to choose the CEDE state on Power >>> Systems (residency: 120 us, exit latency: 12 us) over a polling state, even >>> when the predicted idle duration is much shorter than the target residency >>> of the physical state. This misprediction leads to performance degradation >>> in certain workloads. >>> >> >> So clearly the condition >> s->target_residency_ns <= data->next_timer_ns) >> is supposed to prevent this, but data->next_timer_ns isn't accurate, >> have you got any idea what it's set to in your workload usually? >> Seems like your workload is timer-based, so the idle duration should be >> predicted accurately. >> > > Yes, that's right ideally this condition should have prevented this, > but `data->next_timer_ns` is almost always greater than the actual > idle duration which seems inaccurate. > >> >>> ================================================== >>> Test Results >>> ================================================== >>> >>> This issue can be clearly observed with the below test. >>> >>> A test with multiple wakee threads and a single waker thread was run to >>> demonstrate this issue. The waker thread periodically wakes up the wakee >>> threads after a specific sleep duration, creating a repeating of sleep -> wake >>> pattern. The test was run for a stipulated period, and cpuidle statistics are >>> collected. >>> >>> ./cpuidle-test -a 0 -b 10 -b 20 -b 30 -b 40 -b 50 -b 60 -b 70 -r 20 -t 60 >>> >>> ================================================== >>> Results (Baseline Kernel): >>> ================================================== >>> Wakee 0[PID 8295] affined to CPUs: 10, >>> Wakee 2[PID 8297] affined to CPUs: 30, >>> Wakee 3[PID 8298] affined to CPUs: 40, >>> Wakee 1[PID 8296] affined to CPUs: 20, >>> Wakee 4[PID 8299] affined to CPUs: 50, >>> Wakee 5[PID 8300] affined to CPUs: 60, >>> Wakee 6[PID 8301] affined to CPUs: 70, >>> Waker[PID 8302] affined to CPUs: 0, >>> >>>> -----------------------------------|-------------------------|-----------------------------| >>>> Metric | snooze | CEDE | >>>> -----------------------------------|-------------------------|-----------------------------| >>>> Usage | 47815 | 2030160 | >>>> Above | 0 | 2030043 | >>>> Below | 0 | 0 | >>>> Time Spent (us) | 976317 (1.63%) | 51046474 (85.08%) | >>>> Overall average sleep duration | 28.721 us | | >>>> Overall average wakeup latency | 6.858 us | | >>>> -----------------------------------|-------------------------|-----------------------------| >>> >>> In this test, without the patch, the CPU often enters the CEDE state for >>> sleep durations of around 20-30 microseconds, even though the CEDE state's >>> residency time is 120 microseconds. This happens because the menu governor >>> prioritizes the physical idle state (CEDE) if its exit latency is within >>> the latency limits. It also uses next_timer_ns for comparison, which can >>> be farther off than the actual idle duration as it is more predictable, >>> instead of using predicted idle duration as a comparision point with the >>> target residency. >> >> Ideally that shouldn't be the case though (next_timer_ns be farther off thactual idle duration) > > I ran some experiments based on your suggestions. Below are the > relative average runtimes and percentage differences compared to > the base version: > > Picked a single representative workload for simplicity: > > Note: Lower (% Difference) the better. > |---------------------------|-----------------|--------------| > | Configuration | Average Runtime | % Difference | > |---------------------------|-----------------|--------------| > | Base (menu) | 1.00 | 0.00% | > | Base + Patch 1 (menu) | 0.92 | -8.00% | > | Base + Patch 2 (menu) | 0.98 | -2.00% | > | Base (teo) | 1.01 | +1.00% | > |---------------------------|-----------------|--------------| > Patch 1: https://lore.kernel.org/all/20240809073120.250974-2-aboorvad@linux.ibm.com/ > Patch 2: https://lore.kernel.org/all/c20a07e4-b9e6-4a66-80f5-63d679b17c3b@arm.com/ > > It seems that Patch 2 does provide a slight improvement in runtime, but > not significantly like Patch 1. Additionally, teo does not seem > to help in this case. > > Regarding the condition `s->target_residency_ns <= data->next_timer_ns`, > it appears that `data->next_timer_ns` is consistently greater than > both actual idle duration and `s->target_residency_ns` so this condition > nearly always holds true, indicating some inaccuracy. I'll investigate > this further and follow up with more details. The wakeup source(s), since they don't seem to be timer events would be interesting, although a bit of a hassle to get right. What's the workload anyway? > > Regards, > Aboorva >
On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote: > On 8/20/24 09:51, Aboorva Devarajan wrote: > > On Tue, 2024-08-13 at 13:56 +0100, Christian Loehle wrote: > > ... > The wakeup source(s), since they don't seem to be timer events would be > interesting, although a bit of a hassle to get right. > What's the workload anyway? > Hi Christian, Apologies for the delayed response. I identified the part of the internal workload responsible for performance improvement with the patch and it appears to be the OLTP queries, and yes the wakeup sources are not timer events. --------------------------------------------------------------------- Additionally to reproduce the issue using a more open and accessible benchmark, conducted similar experiments using pgbench and I observed similar improvements with the patch, particularly when running the read intensive select query benchmarks. --------------------------------------------------------------------- System Details --------------------------------------------------------------------- $ lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 224 On-line CPU(s) list: 0-223 Model name: POWER10 (architected), altivec supported Model: 2.0 (pvr 0080 0200) Thread(s) per core: 8 Core(s) per socket: 3 Socket(s): 8 Virtualization features: Hypervisor vendor: pHyp Virtualization type: para --------------------------------------------------------------------- $ cpupower idle-info CPUidle driver: pseries_idle CPUidle governor: menu analyzing CPU 0: Number of idle states: 2 Available idle states: snooze CEDE snooze: Flags/Description: snooze Latency: 0 Residency: 0 Usage: 6229 Duration: 402142 CEDE: Flags/Description: CEDE Latency: 12 Residency: 120 Usage: 191411 Duration: 36329999037 --------------------------------------------------------------------- PostgreSQL Benchmark: --------------------------------------------------------------------- I ran pgbench with 224 clients and 20 threads for 600 seconds, performing only SELECT queries against the pgbench database to evaluate performance under read-intensive workloads: $ pgbench -c 224 -j 20 -T 600 -S pgbench Latency: |---|-------------|------------|------------| | # | Before (ms) | After (ms) | Change (%) | |Run| Patch | Patch | | |---|-------------|------------|------------| | 1 | 0.343 | 0.287 | -16.31% | | 2 | 0.334 | 0.286 | -14.37% | | 3 | 0.333 | 0.286 | -14.11% | | 4 | 0.341 | 0.288 | -15.55% | | 5 | 0.342 | 0.288 | -15.79% | |---|-------------|------------|------------| Latency Reduction: After applying the patch, the latency decreased by 14% to 16% across multiple runs. Throughput per second: |---|-------------|------------|------------| | # | Before | After | Change (%) | |Run| Patch | Patch | | |---|-------------|------------|------------| | 1 | 652,544.23 | 780,613.42 | +19.63% | | 2 | 670,243.45 | 784,348.44 | +17.04% | | 3 | 673,495.39 | 784,458.12 | +16.48% | | 4 | 656,609.16 | 778,531.20 | +18.57% | | 5 | 654,644.52 | 778,322.88 | +18.88% | |---|-------------|------------|------------| Transactions per second Improvement: The patch led to an increase in TPS, ranging from 16% to 19%. This indicates that the patch significantly reduces latency while improving throughput (TPS). pgbench is an OLTP benchmark and doesn't do timer based wakeups, this is in-line with the improvements I saw in the originally reported OLTP workload as well. --------------------------------------------------------------------- Additional CPU Idle test with custom benchmark: --------------------------------------------------------------------- I also wrote a simple benchmark [1] to analyze the impact of cpuidle state selection, comparing timer-based wakeups and non-timer (pipe-based) wakeups. This test involves a single waker thread periodically waking up a single wakee thread, simulating a repeating sleep-wake pattern. The test was run with both timer-based and pipe-based wakeups, and cpuidle statistics (usage, time, above, below) were collected. Timer based wakeup: +------------------------+---------------------+---------------------+ | Metric | Without Patch | With Patch | +------------------------+---------------------+---------------------+ | Wakee thread - CPU | 110 | 110 | | Waker thread - CPU | 20 | 20 | | Sleep Interval | 50 us | 50 us | | Total Wakeups | - | - | | Avg Wakeup Latency | - | - | | Actual Sleep time | 52.639 us | 52.627 us | +------------------------+---------------------+---------------------+ | Idle State 0 Usage Diff| 94,879 | 94,879 | | Idle State 0 Time Diff | 4,700,323 ns | 4,697,576 ns | | Idle State 1 Usage Diff| 0 | 0 | | Idle State 1 Time Diff | 0 ns | 0 ns | +------------------------+---------------------+---------------------+ | Total Above Usage | 0 (0.00%) | (0.00%) | +------------------------+---------------------+---------------------+ | Total Below Usage | 0 (0.00%) | 0 (0.00%) | +------------------------+---------------------+---------------------+ In timer-based wakeups, the menu governor effectively predicts the idle duration both with and without the patch. This ensures that there are few or no instances of "Above" usage, allowing the CPU to remain in the correct idle state. The condition (s->target_residency_ns <= data->next_timer_ns) in the menu governor ensures that a physical idle state is not prioritized when a timer event is expected before the target residency of the first physical idle state. As a result, the patch has no impact in this case, and performance remains stable with timer based wakeups. Pipe based wakeup (non-timer wakeup): +------------------------+---------------------+---------------------+ | Metric | Without Patch | With Patch | +------------------------+---------------------+---------------------+ | Wakee thread - CPU | 110 | 110 | | Waker thread - CPU | 20 | 20 | | Sleep Interval | 50 us | 50 us | | Total Wakeups | 97031 | 96583 | | Avg Wakeup Latency | 7.070 us | 4.879 us | | Actual Sleep time | 51.366 us | 51.605 us | +------------------------+---------------------+---------------------+ | Idle State 0 Usage Diff| 1209 | 96,586 | | Idle State 0 Time Diff | 55,579 ns | 4,510,003 ns | | Idle State 1 Usage Diff| 95,826 | 5 | | Idle State 1 Time Diff | 4,522,639 ns | 198 ns | +------------------------+---------------------+---------------------+ +------------------------+---------------------+---------------------+ | **Total Above Usage** | 95,824 (98.75%) | 5 (0.01%) | +------------------------+---------------------+---------------------+ +------------------------+---------------------+---------------------+ | Total Below Usage | 0 (0.00%) | 0 (0.00%) | +------------------------+---------------------+---------------------+ In the pipe-based wakeup scenario, before the patch was applied, the "Above" metric was notably high around 98.75%. This suggests that the menu governor frequently selected a deeper idle state like CEDE, even when the idle period was relatively short. This happened because the menu governor is inclined to prioritize the physical idle state (CEDE) even when the target residency time of the physical idle state (s->target_residency_ns) was longer than the predicted idle time (predicted_ns), so data->next_timer_ns won't be relevant here in non-timer wakeups. In this test, despite the actual idle period being around 50 microseconds, the menu governor still chose CEDE state, which has a target residency of 120 microseconds. --------------------------------------------------------------------- While timer-based wakeups performed well even without the patch, workloads that don't have timers as wakeup source but have predictable idle durations shorter than the first idle state's target residency benefit significantly from the patch. It will be helpful to understand why prioritizing deep physical idle states over shallow ones even for short idle periods that don’t meet the target residency like mentioned above is considered more beneficial. Is there something I could be missing here? Any comments or suggestions will be really helpful. [1] https://github.com/AboorvaDevarajan/linux-utils/tree/main/cpuidle/cpuidle_wakeup Thanks, Aboorva
On 9/19/24 06:02, Aboorva Devarajan wrote: > On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote: > >> On 8/20/24 09:51, Aboorva Devarajan wrote: >>> On Tue, 2024-08-13 at 13:56 +0100, Christian Loehle wrote: >>> > ... >> The wakeup source(s), since they don't seem to be timer events would be >> interesting, although a bit of a hassle to get right. >> What's the workload anyway? >> > > Hi Christian, > > Apologies for the delayed response. No worries! > > I identified the part of the internal workload responsible for performance > improvement with the patch and it appears to be the OLTP queries, and yes > the wakeup sources are not timer events. > > --------------------------------------------------------------------- > > Additionally to reproduce the issue using a more open and accessible > benchmark, conducted similar experiments using pgbench and I observed > similar improvements with the patch, particularly when running the > read intensive select query benchmarks. > > --------------------------------------------------------------------- > System Details > --------------------------------------------------------------------- > $ lscpu > Architecture: ppc64le > Byte Order: Little Endian > CPU(s): 224 > On-line CPU(s) list: 0-223 > Model name: POWER10 (architected), altivec supported > Model: 2.0 (pvr 0080 0200) > Thread(s) per core: 8 > Core(s) per socket: 3 > Socket(s): 8 > Virtualization features: > Hypervisor vendor: pHyp > Virtualization type: para > --------------------------------------------------------------------- > > $ cpupower idle-info > CPUidle driver: pseries_idle > CPUidle governor: menu > analyzing CPU 0: > > Number of idle states: 2 > Available idle states: snooze CEDE > snooze: > Flags/Description: snooze > > Latency: 0 > Residency: 0 > Usage: 6229 > Duration: 402142 > CEDE: > Flags/Description: CEDE > Latency: 12 > Residency: 120 > Usage: 191411 > Duration: 36329999037 > > --------------------------------------------------------------------- > PostgreSQL Benchmark: > --------------------------------------------------------------------- > > I ran pgbench with 224 clients and 20 threads for 600 seconds, > performing only SELECT queries against the pgbench database to > evaluate performance under read-intensive workloads: > > $ pgbench -c 224 -j 20 -T 600 -S pgbench > > Latency: > > |---|-------------|------------|------------| > | # | Before (ms) | After (ms) | Change (%) | > |Run| Patch | Patch | | > |---|-------------|------------|------------| > | 1 | 0.343 | 0.287 | -16.31% | > | 2 | 0.334 | 0.286 | -14.37% | > | 3 | 0.333 | 0.286 | -14.11% | > | 4 | 0.341 | 0.288 | -15.55% | > | 5 | 0.342 | 0.288 | -15.79% | > |---|-------------|------------|------------| > > Latency Reduction: After applying the patch, the latency decreased > by 14% to 16% across multiple runs. > > Throughput per second: > > |---|-------------|------------|------------| > | # | Before | After | Change (%) | > |Run| Patch | Patch | | > |---|-------------|------------|------------| > | 1 | 652,544.23 | 780,613.42 | +19.63% | > | 2 | 670,243.45 | 784,348.44 | +17.04% | > | 3 | 673,495.39 | 784,458.12 | +16.48% | > | 4 | 656,609.16 | 778,531.20 | +18.57% | > | 5 | 654,644.52 | 778,322.88 | +18.88% | > |---|-------------|------------|------------| Do you happen to have the idle stats here, too? Especially above/below see comment below. > > Transactions per second Improvement: The patch led to an increase in > TPS, ranging from 16% to 19%. > > This indicates that the patch significantly reduces latency while > improving throughput (TPS). pgbench is an OLTP benchmark and doesn't > do timer based wakeups, this is in-line with the improvements > I saw in the originally reported OLTP workload as well. > > --------------------------------------------------------------------- > Additional CPU Idle test with custom benchmark: > --------------------------------------------------------------------- > I also wrote a simple benchmark [1] to analyze the impact of cpuidle > state selection, comparing timer-based wakeups and non-timer > (pipe-based) wakeups. > > This test involves a single waker thread periodically waking up a > single wakee thread, simulating a repeating sleep-wake pattern. The > test was run with both timer-based and pipe-based wakeups, and cpuidle > statistics (usage, time, above, below) were collected. Thanks, pretty cool and just quickly checking it seems to be all there: hrtimer, different cpusets for wakee and waker, the pipe wakeup. That could be useful, only thing missing is the license header ;) > > Timer based wakeup: > > +------------------------+---------------------+---------------------+ > | Metric | Without Patch | With Patch | > +------------------------+---------------------+---------------------+ > | Wakee thread - CPU | 110 | 110 | > | Waker thread - CPU | 20 | 20 | > | Sleep Interval | 50 us | 50 us | > | Total Wakeups | - | - | > | Avg Wakeup Latency | - | - | > | Actual Sleep time | 52.639 us | 52.627 us | > +------------------------+---------------------+---------------------+ > | Idle State 0 Usage Diff| 94,879 | 94,879 | > | Idle State 0 Time Diff | 4,700,323 ns | 4,697,576 ns | > | Idle State 1 Usage Diff| 0 | 0 | > | Idle State 1 Time Diff | 0 ns | 0 ns | > +------------------------+---------------------+---------------------+ > | Total Above Usage | 0 (0.00%) | (0.00%) | > +------------------------+---------------------+---------------------+ > | Total Below Usage | 0 (0.00%) | 0 (0.00%) | > +------------------------+---------------------+---------------------+ > > In timer-based wakeups, the menu governor effectively predicts the idle > duration both with and without the patch. This ensures that there are > few or no instances of "Above" usage, allowing the CPU to remain in the > correct idle state. > > The condition (s->target_residency_ns <= data->next_timer_ns) in the menu > governor ensures that a physical idle state is not prioritized when a > timer event is expected before the target residency of the first physical > idle state. > > As a result, the patch has no impact in this case, and performance > remains stable with timer based wakeups. > > Pipe based wakeup (non-timer wakeup): > > +------------------------+---------------------+---------------------+ > | Metric | Without Patch | With Patch | > +------------------------+---------------------+---------------------+ > | Wakee thread - CPU | 110 | 110 | > | Waker thread - CPU | 20 | 20 | > | Sleep Interval | 50 us | 50 us | > | Total Wakeups | 97031 | 96583 | > | Avg Wakeup Latency | 7.070 us | 4.879 us | > | Actual Sleep time | 51.366 us | 51.605 us | > +------------------------+---------------------+---------------------+ > | Idle State 0 Usage Diff| 1209 | 96,586 | > | Idle State 0 Time Diff | 55,579 ns | 4,510,003 ns | > | Idle State 1 Usage Diff| 95,826 | 5 | > | Idle State 1 Time Diff | 4,522,639 ns | 198 ns | > +------------------------+---------------------+---------------------+ > +------------------------+---------------------+---------------------+ > | **Total Above Usage** | 95,824 (98.75%) | 5 (0.01%) | > +------------------------+---------------------+---------------------+ > +------------------------+---------------------+---------------------+ > | Total Below Usage | 0 (0.00%) | 0 (0.00%) | > +------------------------+---------------------+---------------------+ > > In the pipe-based wakeup scenario, before the patch was applied, the > "Above" metric was notably high around 98.75%. This suggests that the > menu governor frequently selected a deeper idle state like CEDE, even > when the idle period was relatively short. > > This happened because the menu governor is inclined to prioritize the > physical idle state (CEDE) even when the target residency time of the > physical idle state (s->target_residency_ns) was longer than the > predicted idle time (predicted_ns), so data->next_timer_ns won't be > relevant here in non-timer wakeups. > > In this test, despite the actual idle period being around 50 microseconds, > the menu governor still chose CEDE state, which has a target residency of > 120 microseconds. And the idle_miss statistics confirms that this was mostly wrong decisions in hindsight. I'll go through the menu code again, this indeed shouldn't be happening. I'd be very surprised if upstream teo performed as badly (or actually badly at all) on this, although it doesn't seem to help your actual workload, would you mind giving that a try too? Regards, Christian > > --------------------------------------------------------------------- > > While timer-based wakeups performed well even without the patch, workloads > that don't have timers as wakeup source but have predictable idle durations > shorter than the first idle state's target residency benefit significantly > from the patch. > > It will be helpful to understand why prioritizing deep physical idle > states over shallow ones even for short idle periods that don’t meet > the target residency like mentioned above is considered more beneficial. > > Is there something I could be missing here? > > Any comments or suggestions will be really helpful. > > [1] https://github.com/AboorvaDevarajan/linux-utils/tree/main/cpuidle/cpuidle_wakeup > > Thanks, > Aboorva >
On 9/19/24 09:49, Christian Loehle wrote: > On 9/19/24 06:02, Aboorva Devarajan wrote: >> On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote: > >> >> Timer based wakeup: >> >> +------------------------+---------------------+---------------------+ >> | Metric | Without Patch | With Patch | >> +------------------------+---------------------+---------------------+ >> | Wakee thread - CPU | 110 | 110 | >> | Waker thread - CPU | 20 | 20 | >> | Sleep Interval | 50 us | 50 us | >> | Total Wakeups | - | - | >> | Avg Wakeup Latency | - | - | >> | Actual Sleep time | 52.639 us | 52.627 us | >> +------------------------+---------------------+---------------------+ >> | Idle State 0 Usage Diff| 94,879 | 94,879 | >> | Idle State 0 Time Diff | 4,700,323 ns | 4,697,576 ns | >> | Idle State 1 Usage Diff| 0 | 0 | >> | Idle State 1 Time Diff | 0 ns | 0 ns | >> +------------------------+---------------------+---------------------+ >> | Total Above Usage | 0 (0.00%) | (0.00%) | >> +------------------------+---------------------+---------------------+ >> | Total Below Usage | 0 (0.00%) | 0 (0.00%) | >> +------------------------+---------------------+---------------------+ >> >> In timer-based wakeups, the menu governor effectively predicts the idle >> duration both with and without the patch. This ensures that there are >> few or no instances of "Above" usage, allowing the CPU to remain in the >> correct idle state. >> >> The condition (s->target_residency_ns <= data->next_timer_ns) in the menu >> governor ensures that a physical idle state is not prioritized when a >> timer event is expected before the target residency of the first physical >> idle state. >> >> As a result, the patch has no impact in this case, and performance >> remains stable with timer based wakeups. >> >> Pipe based wakeup (non-timer wakeup): >> >> +------------------------+---------------------+---------------------+ >> | Metric | Without Patch | With Patch | >> +------------------------+---------------------+---------------------+ >> | Wakee thread - CPU | 110 | 110 | >> | Waker thread - CPU | 20 | 20 | >> | Sleep Interval | 50 us | 50 us | >> | Total Wakeups | 97031 | 96583 | >> | Avg Wakeup Latency | 7.070 us | 4.879 us | >> | Actual Sleep time | 51.366 us | 51.605 us | >> +------------------------+---------------------+---------------------+ >> | Idle State 0 Usage Diff| 1209 | 96,586 | >> | Idle State 0 Time Diff | 55,579 ns | 4,510,003 ns | >> | Idle State 1 Usage Diff| 95,826 | 5 | >> | Idle State 1 Time Diff | 4,522,639 ns | 198 ns | >> +------------------------+---------------------+---------------------+ >> +------------------------+---------------------+---------------------+ >> | **Total Above Usage** | 95,824 (98.75%) | 5 (0.01%) | >> +------------------------+---------------------+---------------------+ >> +------------------------+---------------------+---------------------+ >> | Total Below Usage | 0 (0.00%) | 0 (0.00%) | >> +------------------------+---------------------+---------------------+ >> >> In the pipe-based wakeup scenario, before the patch was applied, the >> "Above" metric was notably high around 98.75%. This suggests that the >> menu governor frequently selected a deeper idle state like CEDE, even >> when the idle period was relatively short. >> >> This happened because the menu governor is inclined to prioritize the >> physical idle state (CEDE) even when the target residency time of the >> physical idle state (s->target_residency_ns) was longer than the >> predicted idle time (predicted_ns), so data->next_timer_ns won't be >> relevant here in non-timer wakeups. >> >> In this test, despite the actual idle period being around 50 microseconds, >> the menu governor still chose CEDE state, which has a target residency of >> 120 microseconds. > > And the idle_miss statistics confirms that this was mostly wrong decisions > in hindsight. > I'll go through the menu code again, this indeed shouldn't be happening. > I'd be very surprised if upstream teo performed as badly (or actually badly > at all) on this, although it doesn't seem to help your actual workload, > would you mind giving that a try too? > > Regards, > Christian So with a workload as static as this, the recent interval detection of menu should take affect. It records the last 8 idle interval lengths and does some statistics do see if they are somewhat stable and uses those instead of the next timer event. While I wouldn't vouch for the statistics just yet, the results don't seem to be completely unreasonable. Do you mind trying a) printing some snapshots of these intervals and the resulting return value of get_typical_interval()? b) Trying this out? Setting these values to some around 50us, that returns 50us for me as expected and therefore should not select an idle state above that. -->8-- --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -112,6 +112,8 @@ struct menu_device { int interval_ptr; }; +static int intervals_dummy[] = {50, 52, 48, 50, 51, 49, 51, 51}; + static inline int which_bucket(u64 duration_ns, unsigned int nr_iowaiters) { int bucket = 0; @@ -177,7 +179,7 @@ static unsigned int get_typical_interval(struct menu_device *data) sum = 0; divisor = 0; for (i = 0; i < INTERVALS; i++) { - unsigned int value = data->intervals[i]; + unsigned int value = intervals_dummy[i]; if (value <= thresh) { sum += value; divisor++; @@ -200,7 +202,7 @@ static unsigned int get_typical_interval(struct menu_device *data) /* Then try to determine variance */ variance = 0; for (i = 0; i < INTERVALS; i++) { - unsigned int value = data->intervals[i]; + unsigned int value = intervals_dummy[i]; if (value <= thresh) { int64_t diff = (int64_t)value - avg; variance += diff * diff;
On 9/19/24 10:43, Christian Loehle wrote: > On 9/19/24 09:49, Christian Loehle wrote: >> On 9/19/24 06:02, Aboorva Devarajan wrote: >>> On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote: >> >>> >>> Timer based wakeup: >>> >>> +------------------------+---------------------+---------------------+ >>> | Metric | Without Patch | With Patch | >>> +------------------------+---------------------+---------------------+ >>> | Wakee thread - CPU | 110 | 110 | >>> | Waker thread - CPU | 20 | 20 | >>> | Sleep Interval | 50 us | 50 us | >>> | Total Wakeups | - | - | >>> | Avg Wakeup Latency | - | - | >>> | Actual Sleep time | 52.639 us | 52.627 us | >>> +------------------------+---------------------+---------------------+ >>> | Idle State 0 Usage Diff| 94,879 | 94,879 | >>> | Idle State 0 Time Diff | 4,700,323 ns | 4,697,576 ns | >>> | Idle State 1 Usage Diff| 0 | 0 | >>> | Idle State 1 Time Diff | 0 ns | 0 ns | >>> +------------------------+---------------------+---------------------+ >>> | Total Above Usage | 0 (0.00%) | (0.00%) | >>> +------------------------+---------------------+---------------------+ >>> | Total Below Usage | 0 (0.00%) | 0 (0.00%) | >>> +------------------------+---------------------+---------------------+ >>> >>> In timer-based wakeups, the menu governor effectively predicts the idle >>> duration both with and without the patch. This ensures that there are >>> few or no instances of "Above" usage, allowing the CPU to remain in the >>> correct idle state. >>> >>> The condition (s->target_residency_ns <= data->next_timer_ns) in the menu >>> governor ensures that a physical idle state is not prioritized when a >>> timer event is expected before the target residency of the first physical >>> idle state. >>> >>> As a result, the patch has no impact in this case, and performance >>> remains stable with timer based wakeups. >>> >>> Pipe based wakeup (non-timer wakeup): >>> >>> +------------------------+---------------------+---------------------+ >>> | Metric | Without Patch | With Patch | >>> +------------------------+---------------------+---------------------+ >>> | Wakee thread - CPU | 110 | 110 | >>> | Waker thread - CPU | 20 | 20 | >>> | Sleep Interval | 50 us | 50 us | >>> | Total Wakeups | 97031 | 96583 | >>> | Avg Wakeup Latency | 7.070 us | 4.879 us | >>> | Actual Sleep time | 51.366 us | 51.605 us | >>> +------------------------+---------------------+---------------------+ >>> | Idle State 0 Usage Diff| 1209 | 96,586 | >>> | Idle State 0 Time Diff | 55,579 ns | 4,510,003 ns | >>> | Idle State 1 Usage Diff| 95,826 | 5 | >>> | Idle State 1 Time Diff | 4,522,639 ns | 198 ns | >>> +------------------------+---------------------+---------------------+ >>> +------------------------+---------------------+---------------------+ >>> | **Total Above Usage** | 95,824 (98.75%) | 5 (0.01%) | >>> +------------------------+---------------------+---------------------+ >>> +------------------------+---------------------+---------------------+ >>> | Total Below Usage | 0 (0.00%) | 0 (0.00%) | >>> +------------------------+---------------------+---------------------+ >>> >>> In the pipe-based wakeup scenario, before the patch was applied, the >>> "Above" metric was notably high around 98.75%. This suggests that the >>> menu governor frequently selected a deeper idle state like CEDE, even >>> when the idle period was relatively short. >>> >>> This happened because the menu governor is inclined to prioritize the >>> physical idle state (CEDE) even when the target residency time of the >>> physical idle state (s->target_residency_ns) was longer than the >>> predicted idle time (predicted_ns), so data->next_timer_ns won't be >>> relevant here in non-timer wakeups. >>> >>> In this test, despite the actual idle period being around 50 microseconds, >>> the menu governor still chose CEDE state, which has a target residency of >>> 120 microseconds. >> >> And the idle_miss statistics confirms that this was mostly wrong decisions >> in hindsight. >> I'll go through the menu code again, this indeed shouldn't be happening. >> I'd be very surprised if upstream teo performed as badly (or actually badly >> at all) on this, although it doesn't seem to help your actual workload, >> would you mind giving that a try too? >> >> Regards, >> Christian > > So with a workload as static as this, the recent interval detection of menu > should take affect. It records the last 8 idle interval lengths and does some > statistics do see if they are somewhat stable and uses those instead of the > next timer event. > While I wouldn't vouch for the statistics just yet, the results don't seem > to be completely unreasonable. > Do you mind trying a) printing some snapshots of these intervals and the > resulting return value of get_typical_interval()? > b) Trying this out? Setting these values to some around 50us, that returns > 50us for me as expected and therefore should not select an idle state above > that. > > -->8-- > > --- a/drivers/cpuidle/governors/menu.c > +++ b/drivers/cpuidle/governors/menu.c > @@ -112,6 +112,8 @@ struct menu_device { > int interval_ptr; > }; > > +static int intervals_dummy[] = {50, 52, 48, 50, 51, 49, 51, 51}; So just to illustrate this a bit more: {51, 80, 90, 54, 72, 60, 80, 117} is insignificant, but decrementing the last: {51, 80, 90, 54, 72, 60, 80, 116} is significant (75us returned) That sounds about reasonable but I think something like that would also be observable for you. It's an unfortunate edge case. I wouldn't want to mess with the ancient statistics (and silently break an unknown amount of setups). I much prefer the teo intercept approach that makes this idle state dependent and doesn't try to get one formula right for all setups. That is all assuming that a) get_typical_interval() doesn't return a significant interval length for you and b) replacing that with dummy significant values then fixes it (and not the error being some menu logic that I overlooked). Thinking out loud regarding the menu logic: If we have ~50us predicted interval and your state setup the code then goes as follows: predicted_ns = 50000 ... data->next_timer_ns = KTIME_MAX; delta_tick = TICK_NSEC / 2; data->bucket = which_bucket(KTIME_MAX, nr_iowaiters); // bucket highest state probably And then we have the big for loop for state 0 POLLING and state 1 CEDE target_residency_ns==100000: trivially we have for i = 0: if (idx == -1) idx = i; /* first enabled state */ and for i = 1: if (s->target_residency_ns > predicted_ns) { // True ... if (predicted_ns < TICK_NSEC) // True break; and should end up with idx == 0 and *stop_tick = false, all good. If my above assumptions in a) and b) do not hold then I must've made a mistake reading the code just now.
On 9/19/24 15:07, Christian Loehle wrote: > On 9/19/24 10:43, Christian Loehle wrote: >> On 9/19/24 09:49, Christian Loehle wrote: >>> On 9/19/24 06:02, Aboorva Devarajan wrote: >>>> On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote: >>> >>>> >>>> Timer based wakeup: >>>> >>>> +------------------------+---------------------+---------------------+ >>>> | Metric | Without Patch | With Patch | >>>> +------------------------+---------------------+---------------------+ >>>> | Wakee thread - CPU | 110 | 110 | >>>> | Waker thread - CPU | 20 | 20 | >>>> | Sleep Interval | 50 us | 50 us | >>>> | Total Wakeups | - | - | >>>> | Avg Wakeup Latency | - | - | >>>> | Actual Sleep time | 52.639 us | 52.627 us | >>>> +------------------------+---------------------+---------------------+ >>>> | Idle State 0 Usage Diff| 94,879 | 94,879 | >>>> | Idle State 0 Time Diff | 4,700,323 ns | 4,697,576 ns | >>>> | Idle State 1 Usage Diff| 0 | 0 | >>>> | Idle State 1 Time Diff | 0 ns | 0 ns | >>>> +------------------------+---------------------+---------------------+ >>>> | Total Above Usage | 0 (0.00%) | (0.00%) | >>>> +------------------------+---------------------+---------------------+ >>>> | Total Below Usage | 0 (0.00%) | 0 (0.00%) | >>>> +------------------------+---------------------+---------------------+ >>>> >>>> In timer-based wakeups, the menu governor effectively predicts the idle >>>> duration both with and without the patch. This ensures that there are >>>> few or no instances of "Above" usage, allowing the CPU to remain in the >>>> correct idle state. >>>> >>>> The condition (s->target_residency_ns <= data->next_timer_ns) in the menu >>>> governor ensures that a physical idle state is not prioritized when a >>>> timer event is expected before the target residency of the first physical >>>> idle state. >>>> >>>> As a result, the patch has no impact in this case, and performance >>>> remains stable with timer based wakeups. >>>> >>>> Pipe based wakeup (non-timer wakeup): >>>> >>>> +------------------------+---------------------+---------------------+ >>>> | Metric | Without Patch | With Patch | >>>> +------------------------+---------------------+---------------------+ >>>> | Wakee thread - CPU | 110 | 110 | >>>> | Waker thread - CPU | 20 | 20 | >>>> | Sleep Interval | 50 us | 50 us | >>>> | Total Wakeups | 97031 | 96583 | >>>> | Avg Wakeup Latency | 7.070 us | 4.879 us | >>>> | Actual Sleep time | 51.366 us | 51.605 us | >>>> +------------------------+---------------------+---------------------+ >>>> | Idle State 0 Usage Diff| 1209 | 96,586 | >>>> | Idle State 0 Time Diff | 55,579 ns | 4,510,003 ns | >>>> | Idle State 1 Usage Diff| 95,826 | 5 | >>>> | Idle State 1 Time Diff | 4,522,639 ns | 198 ns | >>>> +------------------------+---------------------+---------------------+ >>>> +------------------------+---------------------+---------------------+ >>>> | **Total Above Usage** | 95,824 (98.75%) | 5 (0.01%) | >>>> +------------------------+---------------------+---------------------+ >>>> +------------------------+---------------------+---------------------+ >>>> | Total Below Usage | 0 (0.00%) | 0 (0.00%) | >>>> +------------------------+---------------------+---------------------+ >>>> >>>> In the pipe-based wakeup scenario, before the patch was applied, the >>>> "Above" metric was notably high around 98.75%. This suggests that the >>>> menu governor frequently selected a deeper idle state like CEDE, even >>>> when the idle period was relatively short. >>>> >>>> This happened because the menu governor is inclined to prioritize the >>>> physical idle state (CEDE) even when the target residency time of the >>>> physical idle state (s->target_residency_ns) was longer than the >>>> predicted idle time (predicted_ns), so data->next_timer_ns won't be >>>> relevant here in non-timer wakeups. >>>> >>>> In this test, despite the actual idle period being around 50 microseconds, >>>> the menu governor still chose CEDE state, which has a target residency of >>>> 120 microseconds. >>> >>> And the idle_miss statistics confirms that this was mostly wrong decisions >>> in hindsight. >>> I'll go through the menu code again, this indeed shouldn't be happening. >>> I'd be very surprised if upstream teo performed as badly (or actually badly >>> at all) on this, although it doesn't seem to help your actual workload, >>> would you mind giving that a try too? >>> >>> Regards, >>> Christian >> >> So with a workload as static as this, the recent interval detection of menu >> should take affect. It records the last 8 idle interval lengths and does some >> statistics do see if they are somewhat stable and uses those instead of the >> next timer event. >> While I wouldn't vouch for the statistics just yet, the results don't seem >> to be completely unreasonable. >> Do you mind trying a) printing some snapshots of these intervals and the >> resulting return value of get_typical_interval()? >> b) Trying this out? Setting these values to some around 50us, that returns >> 50us for me as expected and therefore should not select an idle state above >> that. >> >> -->8-- >> >> --- a/drivers/cpuidle/governors/menu.c >> +++ b/drivers/cpuidle/governors/menu.c >> @@ -112,6 +112,8 @@ struct menu_device { >> int interval_ptr; >> }; >> >> +static int intervals_dummy[] = {50, 52, 48, 50, 51, 49, 51, 51}; > > So just to illustrate this a bit more: > {51, 80, 90, 54, 72, 60, 80, 117} is insignificant, but decrementing the last: > {51, 80, 90, 54, 72, 60, 80, 116} is significant (75us returned) > That sounds about reasonable but I think something like that would > also be observable for you. It's an unfortunate edge case. > I wouldn't want to mess with the ancient statistics (and silently break > an unknown amount of setups). I much prefer the teo intercept approach that > makes this idle state dependent and doesn't try to get one formula right > for all setups. > > That is all assuming that a) get_typical_interval() doesn't return a > significant interval length for you and b) replacing that with dummy > significant values then fixes it (and not the error being some menu > logic that I overlooked). > > Thinking out loud regarding the menu logic: > If we have ~50us predicted interval and your state setup the code then > goes as follows: > > predicted_ns = 50000 > ... > data->next_timer_ns = KTIME_MAX; > delta_tick = TICK_NSEC / 2; > data->bucket = which_bucket(KTIME_MAX, nr_iowaiters); // bucket highest state probably > > And then we have the big for loop for state 0 POLLING and state 1 CEDE target_residency_ns==100000: > trivially we have for i = 0: > if (idx == -1) > idx = i; /* first enabled state */ > and for i = 1: > if (s->target_residency_ns > predicted_ns) { // True > ... > if (predicted_ns < TICK_NSEC) // True > break; > and should end up with idx == 0 and *stop_tick = false, all good. > If my above assumptions in a) and b) do not hold then I must've made a > mistake reading the code just now. > And of course I realise right after sending I did indeed make a mistake. RESIDENCY_THRESHOLD_NS is 15us so timer is actually polled and the fields populated accurately, but also in the for loop it checks polling for idx == 0. predicted_ns should still be the value from get_typical_interval(), but that is unable to ever enforce a polling state, making your proposed code change the only obvious fix. Essentially with only one non-polling state the get_typical_interval() logic might as well be skipped, it never affects anything. If I'm correct this time then I'm not really sure how to proceed. On the one hand, if we trust predicted_ns we should indeed select a state with predicted_ns < s->target_residency_ns, no matter if polling or not, but that would be quite a behaviour change for x86 users too. I'm still curious on why teo wouldn't solve your problem, it's intercept logic doesn't care if it selects a polling state or not because of it.
On Thu, 2024-09-19 at 09:49 +0100, Christian Loehle wrote: Hi Christian, > On 9/19/24 06:02, Aboorva Devarajan wrote: > > On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote: > > > > > On 8/20/24 09:51, Aboorva Devarajan wrote: > > > > On Tue, 2024-08-13 at 13:56 +0100, Christian Loehle wrote: > > > > > > ... > > > The wakeup source(s), since they don't seem to be timer events would be > > > interesting, although a bit of a hassle to get right. > > > What's the workload anyway? > > > > > > > I identified the part of the internal workload responsible for performance > > improvement with the patch and it appears to be the OLTP queries, and yes > > the wakeup sources are not timer events. > > > > --------------------------------------------------------------------- > > > > Additionally to reproduce the issue using a more open and accessible > > benchmark, conducted similar experiments using pgbench and I observed > > similar improvements with the patch, particularly when running the > > read intensive select query benchmarks. > > > > --------------------------------------------------------------------- > > System Details > > --------------------------------------------------------------------- > > $ lscpu > > Architecture: ppc64le > > Byte Order: Little Endian > > CPU(s): 224 > > On-line CPU(s) list: 0-223 > > Model name: POWER10 (architected), altivec supported > > Model: 2.0 (pvr 0080 0200) > > Thread(s) per core: 8 > > Core(s) per socket: 3 > > Socket(s): 8 > > Virtualization features: > > Hypervisor vendor: pHyp > > Virtualization type: para > > --------------------------------------------------------------------- > > > > $ cpupower idle-info > > CPUidle driver: pseries_idle > > CPUidle governor: menu > > analyzing CPU 0: > > > > Number of idle states: 2 > > Available idle states: snooze CEDE > > snooze: > > Flags/Description: snooze > > > > Latency: 0 > > Residency: 0 > > Usage: 6229 > > Duration: 402142 > > CEDE: > > Flags/Description: CEDE > > Latency: 12 > > Residency: 120 > > Usage: 191411 > > Duration: 36329999037 > > > > --------------------------------------------------------------------- > > PostgreSQL Benchmark: > > --------------------------------------------------------------------- > > > > I ran pgbench with 224 clients and 20 threads for 600 seconds, > > performing only SELECT queries against the pgbench database to > > evaluate performance under read-intensive workloads: > > > > $ pgbench -c 224 -j 20 -T 600 -S pgbench > > > > Latency: > > > > > ---|-------------|------------|------------| > > > # | Before (ms) | After (ms) | Change (%) | > > > Run| Patch | Patch | | > > > ---|-------------|------------|------------| > > > 1 | 0.343 | 0.287 | -16.31% | > > > 2 | 0.334 | 0.286 | -14.37% | > > > 3 | 0.333 | 0.286 | -14.11% | > > > 4 | 0.341 | 0.288 | -15.55% | > > > 5 | 0.342 | 0.288 | -15.79% | > > > ---|-------------|------------|------------| > > > > Latency Reduction: After applying the patch, the latency decreased > > by 14% to 16% across multiple runs. > > > > Throughput per second: > > > > > ---|-------------|------------|------------| > > > # | Before | After | Change (%) | > > > Run| Patch | Patch | | > > > ---|-------------|------------|------------| > > > 1 | 652,544.23 | 780,613.42 | +19.63% | > > > 2 | 670,243.45 | 784,348.44 | +17.04% | > > > 3 | 673,495.39 | 784,458.12 | +16.48% | > > > 4 | 656,609.16 | 778,531.20 | +18.57% | > > > 5 | 654,644.52 | 778,322.88 | +18.88% | > > > ---|-------------|------------|------------| > > Do you happen to have the idle stats here, too? > Especially above/below see comment below. I ran the benchmark again and collected the idle statistics on a different system since I no longer have access to the previous one. Here are the stats I gathered, including data for the menu, teo, and ladder governors: Metric | Ladder | Teo | Menu | Menu Patched -------------------------------------------------------------------------------------------- Transactions Processed | 42,902,188 | 48,499,709 | 42,653,477 | 48,648,867 Latency (ms) | 0.313 | 0.277 | 0.315 | 0.276 TPS | 715,340.96 | 808,672.59 | 711,123.13 | 811,137.40 -------------------------------------------------------------------------------------------- Total Usage Difference | 46,680,184 | 66,348,992 | 47,892,509 | 62,366,337 Usage Diff (CEDE) | 46,680,184 | 7,612,960 | 45,421,436 | 19,237,430 Usage Diff (Snooze) | 0 | 58,736,032 | 2,471,073 | 43,128,907 -------------------------------------------------------------------------------------------- Total Time Difference (s) | 5,552.20 | 5,242.75 | 5,534.62 | 5,238.46 Time Diff (CEDE) | 5,552.20 | 2,497.89 | 5,431.91 | 3,324.99 Time Diff (Snooze) | 0.00 | 2,744.86 | 102.71 | 1,913.47 -------------------------------------------------------------------------------------------- Total Above Diff | 40,381,398 | 4,520,998 | 38,942,725 | 15,072,345 Above Diff (CEDE) | 40,381,398 | 4,520,998 | 38,942,725 | 15,072,345 Above Diff (Snooze) | 0 | 0 | 0 | 0 -------------------------------------------------------------------------------------------- Total Below Diff | 0 | 12,362,392 | 405,357 | 8,172,845 Below Diff (CEDE) | 0 | 0 | 0 | 0 Below Diff (Snooze) | 0 | 12,362,392 | 405,357 | 8,172,845 -------------------------------------------------------------------------------------------- % Above w.r.t. Usage | 86.51% | 6.82% | 81.30% | 24.17% -------------------------------------------------------------------------------------------- % Below w.r.t. Usage | 0% | 18.64% | 0.85% | 13.10% -------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------- Note: % Above w.r.t Usage = (Total Above Difference / Total Usage Difference) * 100 % Below w.r.t Usage = (Total Below Difference / Total Usage Difference) * 100 Menu, Teo, Ladder: This is with menu, teo and ladder governor enabled respectively on v6.12-rc1. Menu Patched: This is with menu governor enabled on v6.12-rc + https://lore.kernel.org/all/20240809073120.250974-2-aboorvad@linux.ibm.com/ -------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------- Inference: -------------------------------------------------------------------------------------------- Transactions Processed: The number of transactions processed in Menu Patched is 14.06% higher compared to Menu. Latency: There is a 12.38% reduction in latency in Menu Patched compared to Menu. TPS (Transactions Per Second): The TPS in Menu Patched is 14.06% higher than in Menu. -------------------------------------------------------------------------------------------- While this patch hasn't completely eliminated the cpuidle miss ratio, but it has improved see Above w.r.t Usage, Below w.r.t Usage. I'll keep investigating why there's still a 24% miss rate in the "above" and 13% in the "below" stats after the patch. This could be a different issue. Additionally, I've seen good performance improvements using the teo governor with the pgbench benchmark, although it didn't provide the same benefit in the original test. > > > Transactions per second Improvement: The patch led to an increase in > > TPS, ranging from 16% to 19%. > > > > This indicates that the patch significantly reduces latency while > > improving throughput (TPS). pgbench is an OLTP benchmark and doesn't > > do timer based wakeups, this is in-line with the improvements > > I saw in the originally reported OLTP workload as well. > > > > --------------------------------------------------------------------- > > Additional CPU Idle test with custom benchmark: > > --------------------------------------------------------------------- > > I also wrote a simple benchmark [1] to analyze the impact of cpuidle > > state selection, comparing timer-based wakeups and non-timer > > (pipe-based) wakeups. > > > > This test involves a single waker thread periodically waking up a > > single wakee thread, simulating a repeating sleep-wake pattern. The > > test was run with both timer-based and pipe-based wakeups, and cpuidle > > statistics (usage, time, above, below) were collected. > > Thanks, pretty cool and just quickly checking it seems to be all there: > hrtimer, different cpusets for wakee and waker, the pipe wakeup. > That could be useful, only thing missing is the license header ;) sure, I will add this. Thanks for pointing it out. > > > Timer based wakeup: > > > > +------------------------+---------------------+---------------------+ > > > Metric | Without Patch | With Patch | > > +------------------------+---------------------+---------------------+ > > > Wakee thread - CPU | 110 | 110 | > > > Waker thread - CPU | 20 | 20 | > > > Sleep Interval | 50 us | 50 us | > > > Total Wakeups | - | - | > > > Avg Wakeup Latency | - | - | > > > Actual Sleep time | 52.639 us | 52.627 us | > > +------------------------+---------------------+---------------------+ > > > Idle State 0 Usage Diff| 94,879 | 94,879 | > > > Idle State 0 Time Diff | 4,700,323 ns | 4,697,576 ns | > > > Idle State 1 Usage Diff| 0 | 0 | > > > Idle State 1 Time Diff | 0 ns | 0 ns | > > +------------------------+---------------------+---------------------+ > > > Total Above Usage | 0 (0.00%) | (0.00%) | > > +------------------------+---------------------+---------------------+ > > > Total Below Usage | 0 (0.00%) | 0 (0.00%) | > > +------------------------+---------------------+---------------------+ > > > > In timer-based wakeups, the menu governor effectively predicts the idle > > duration both with and without the patch. This ensures that there are > > few or no instances of "Above" usage, allowing the CPU to remain in the > > correct idle state. > > > > The condition (s->target_residency_ns <= data->next_timer_ns) in the menu > > governor ensures that a physical idle state is not prioritized when a > > timer event is expected before the target residency of the first physical > > idle state. > > > > As a result, the patch has no impact in this case, and performance > > remains stable with timer based wakeups. > > > > Pipe based wakeup (non-timer wakeup): > > > > +------------------------+---------------------+---------------------+ > > > Metric | Without Patch | With Patch | > > +------------------------+---------------------+---------------------+ > > > Wakee thread - CPU | 110 | 110 | > > > Waker thread - CPU | 20 | 20 | > > > Sleep Interval | 50 us | 50 us | > > > Total Wakeups | 97031 | 96583 | > > > Avg Wakeup Latency | 7.070 us | 4.879 us | > > > Actual Sleep time | 51.366 us | 51.605 us | > > +------------------------+---------------------+---------------------+ > > > Idle State 0 Usage Diff| 1209 | 96,586 | > > > Idle State 0 Time Diff | 55,579 ns | 4,510,003 ns | > > > Idle State 1 Usage Diff| 95,826 | 5 | > > > Idle State 1 Time Diff | 4,522,639 ns | 198 ns | > > +------------------------+---------------------+---------------------+ > > +------------------------+---------------------+---------------------+ > > > **Total Above Usage** | 95,824 (98.75%) | 5 (0.01%) | > > +------------------------+---------------------+---------------------+ > > +------------------------+---------------------+---------------------+ > > > Total Below Usage | 0 (0.00%) | 0 (0.00%) | > > +------------------------+---------------------+---------------------+ > > > > In the pipe-based wakeup scenario, before the patch was applied, the > > "Above" metric was notably high around 98.75%. This suggests that the > > menu governor frequently selected a deeper idle state like CEDE, even > > when the idle period was relatively short. > > > > This happened because the menu governor is inclined to prioritize the > > physical idle state (CEDE) even when the target residency time of the > > physical idle state (s->target_residency_ns) was longer than the > > predicted idle time (predicted_ns), so data->next_timer_ns won't be > > relevant here in non-timer wakeups. > > > > In this test, despite the actual idle period being around 50 microseconds, > > the menu governor still chose CEDE state, which has a target residency of > > 120 microseconds. > > And the idle_miss statistics confirms that this was mostly wrong decisions > in hindsight. > I'll go through the menu code again, this indeed shouldn't be happening. > I'd be very surprised if upstream teo performed as badly (or actually badly > at all) on this, although it doesn't seem to help your actual workload, > would you mind giving that a try too? I ran a similar benchmark using teo cpuidle governor, presented the averaged out values across 10 runs (has low standard deviation). Below are the results for a pipe-based wakeup with an approximate 50 microsecond sleep interval: Pipe based wakeup with approx 50 us as sleep interval: Metric Ladder Menu Teo Menu Patched ---------------------------------------------------------------------------------------- Wakeups 579,690 579,951 578,328 578,363 Avg wakeup latency (us) 7.456 7.112 4.46067 4.48733 ======================================================================================== Sleep interval (us) 51.7333 51.7097 51.8533 51.8477 ======================================================================================== State 0 Usage diff 0 7,324 578,361 578,407 State 0 Time diff (ns) 0 340,115 2.75953e+07 2.75696e+07 State 0 Above diff 0 0 0 0 State 0 Below diff 0 0 0.333333 0.666667 ======================================================================================== State 1 Usage diff 580,114 572,644 0.666667 9.33333 State 1 Time diff (ns) 2.74189e+07 2.73901e+07 20.6667 445.333 State 1 Above diff 579,993 572,535 0.666667 9.33333 State 1 Below diff 0 0 0 0 Total Above 579,993 572,535 0.666667 9.33333 Total Below 0 0 0.333333 0.666667 ======================================================================================== Total Above Percent 99.98% 98.7167% 0% 0% --> [1] Total Below Percent 0% 0% 0% 0% ======================================================================================== Timer based wakeup with approx 50 us as sleep interval: Metric Ladder Menu Teo Menu Patched ---------------------------------------------------------------------------------------- Sleep interval (us) 53.775 52.3687 52.746 52.327 ======================================================================================== State 0 Usage diff 0 572,575 568,419 573,109 State 0 Time diff (ns) 0 2.84142e+07 2.81956e+07 2.84548e+07 State 0 Above diff 0 0 0 0 State 0 Below diff 0 0.333333 1 0.333333 ======================================================================================== State 1 Usage diff 558,285 0.333333 0 0 State 1 Time diff (ns) 2.80965e+07 17 0 0 State 1 Above diff 558,215 0.333333 0 0 State 1 Below diff 0 0 0 0 ======================================================================================== Total Above 558,215 0.333333 0 0 Total Below 0 0.333333 1 0.333333 ======================================================================================== Total Above Percent 99.99% 0% 0% 0% Total Below Percent 0% 0% 0% 0% ======================================================================================== [1] The results does show that the teo governor doesn't have this issue. > > Regards, > Christian > > > --------------------------------------------------------------------- > > > > While timer-based wakeups performed well even without the patch, workloads > > that don't have timers as wakeup source but have predictable idle durations > > shorter than the first idle state's target residency benefit significantly > > from the patch. > > > > It will be helpful to understand why prioritizing deep physical idle > > states over shallow ones even for short idle periods that don’t meet > > the target residency like mentioned above is considered more beneficial. > > > > Is there something I could be missing here? > > > > Any comments or suggestions will be really helpful. > > > > [1] https://github.com/AboorvaDevarajan/linux-utils/tree/main/cpuidle/cpuidle_wakeup > > > > Thanks, > > Aboorva > >
On Thu, 2024-09-19 at 10:43 +0100, Christian Loehle wrote: > On 9/19/24 09:49, Christian Loehle wrote: > > On 9/19/24 06:02, Aboorva Devarajan wrote: > > > On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote: > > > Timer based wakeup: > > > > > > +------------------------+---------------------+---------------------+ > > > > Metric | Without Patch | With Patch | > > > +------------------------+---------------------+---------------------+ > > > > Wakee thread - CPU | 110 | 110 | > > > > Waker thread - CPU | 20 | 20 | > > > > Sleep Interval | 50 us | 50 us | > > > > Total Wakeups | - | - | > > > > Avg Wakeup Latency | - | - | > > > > Actual Sleep time | 52.639 us | 52.627 us | > > > +------------------------+---------------------+---------------------+ > > > > Idle State 0 Usage Diff| 94,879 | 94,879 | > > > > Idle State 0 Time Diff | 4,700,323 ns | 4,697,576 ns | > > > > Idle State 1 Usage Diff| 0 | 0 | > > > > Idle State 1 Time Diff | 0 ns | 0 ns | > > > +------------------------+---------------------+---------------------+ > > > > Total Above Usage | 0 (0.00%) | (0.00%) | > > > +------------------------+---------------------+---------------------+ > > > > Total Below Usage | 0 (0.00%) | 0 (0.00%) | > > > +------------------------+---------------------+---------------------+ > > > > > > In timer-based wakeups, the menu governor effectively predicts the idle > > > duration both with and without the patch. This ensures that there are > > > few or no instances of "Above" usage, allowing the CPU to remain in the > > > correct idle state. > > > > > > The condition (s->target_residency_ns <= data->next_timer_ns) in the menu > > > governor ensures that a physical idle state is not prioritized when a > > > timer event is expected before the target residency of the first physical > > > idle state. > > > > > > As a result, the patch has no impact in this case, and performance > > > remains stable with timer based wakeups. > > > > > > Pipe based wakeup (non-timer wakeup): > > > > > > +------------------------+---------------------+---------------------+ > > > > Metric | Without Patch | With Patch | > > > +------------------------+---------------------+---------------------+ > > > > Wakee thread - CPU | 110 | 110 | > > > > Waker thread - CPU | 20 | 20 | > > > > Sleep Interval | 50 us | 50 us | > > > > Total Wakeups | 97031 | 96583 | > > > > Avg Wakeup Latency | 7.070 us | 4.879 us | > > > > Actual Sleep time | 51.366 us | 51.605 us | > > > +------------------------+---------------------+---------------------+ > > > > Idle State 0 Usage Diff| 1209 | 96,586 | > > > > Idle State 0 Time Diff | 55,579 ns | 4,510,003 ns | > > > > Idle State 1 Usage Diff| 95,826 | 5 | > > > > Idle State 1 Time Diff | 4,522,639 ns | 198 ns | > > > +------------------------+---------------------+---------------------+ > > > +------------------------+---------------------+---------------------+ > > > > **Total Above Usage** | 95,824 (98.75%) | 5 (0.01%) | > > > +------------------------+---------------------+---------------------+ > > > +------------------------+---------------------+---------------------+ > > > > Total Below Usage | 0 (0.00%) | 0 (0.00%) | > > > +------------------------+---------------------+---------------------+ > > > > > > In the pipe-based wakeup scenario, before the patch was applied, the > > > "Above" metric was notably high around 98.75%. This suggests that the > > > menu governor frequently selected a deeper idle state like CEDE, even > > > when the idle period was relatively short. > > > > > > This happened because the menu governor is inclined to prioritize the > > > physical idle state (CEDE) even when the target residency time of the > > > physical idle state (s->target_residency_ns) was longer than the > > > predicted idle time (predicted_ns), so data->next_timer_ns won't be > > > relevant here in non-timer wakeups. > > > > > > In this test, despite the actual idle period being around 50 microseconds, > > > the menu governor still chose CEDE state, which has a target residency of > > > 120 microseconds. > > > > And the idle_miss statistics confirms that this was mostly wrong decisions > > in hindsight. > > I'll go through the menu code again, this indeed shouldn't be happening. > > I'd be very surprised if upstream teo performed as badly (or actually badly > > at all) on this, although it doesn't seem to help your actual workload, > > would you mind giving that a try too? > > > > Regards, > > Christian > > So with a workload as static as this, the recent interval detection of menu > should take affect. It records the last 8 idle interval lengths and does some > statistics do see if they are somewhat stable and uses those instead of the > next timer event. > While I wouldn't vouch for the statistics just yet, the results don't seem > to be completely unreasonable. > Do you mind trying > a) printing some snapshots of these intervals and the > resulting return value of get_typical_interval()? Christian, Here are the stats I get when printing the intervals and corresponding predicted_ns for an idle duration of 50 us set from user space. In this case, a pipe wakeup with a 50 us sleep time in user space translates to approximately 34 us of CPU idle time in the kernel, with very little deviation. ------------------------------------------------------------------------------------- process cpu timestamp trace <idle>-0 [200] d.... 123.969564: menu_select: CPU[200]: Interval[2] = 34 <idle>-0 [200] d.... 123.969564: menu_select: CPU[200]: Interval[3] = 34 <idle>-0 [200] d.... 123.969564: menu_select: CPU[200]: Interval[4] = 34 <idle>-0 [200] d.... 123.969564: menu_select: CPU[200]: Interval[5] = 34 <idle>-0 [200] d.... 123.969565: menu_select: CPU[200]: Interval[6] = 34 <idle>-0 [200] d.... 123.969565: menu_select: CPU[200]: Interval[7] = 34 <idle>-0 [200] d.... 123.969565: menu_select: CPU[200]: Predicted ns = 34000 <idle>-0 [200] d.... 123.969615: menu_select: CPU[200]: Interval[0] = 34 <idle>-0 [200] d.... 123.969616: menu_select: CPU[200]: Interval[1] = 34 <idle>-0 [200] d.... 123.969616: menu_select: CPU[200]: Interval[2] = 34 <idle>-0 [200] d.... 123.969616: menu_select: CPU[200]: Interval[3] = 34 <idle>-0 [200] d.... 123.969616: menu_select: CPU[200]: Interval[4] = 34 <idle>-0 [200] d.... 123.969616: menu_select: CPU[200]: Interval[5] = 34 <idle>-0 [200] d.... 123.969616: menu_select: CPU[200]: Interval[6] = 34 <idle>-0 [200] d.... 123.969616: menu_select: CPU[200]: Interval[7] = 34 <idle>-0 [200] d.... 123.969617: menu_select: CPU[200]: Predicted ns = 34000 ... ... ... ... <idle>-0 [200] d.... 123.969667: menu_select: CPU[200]: Interval[0] = 34 <idle>-0 [200] d.... 123.969667: menu_select: CPU[200]: Interval[1] = 34 <idle>-0 [200] d.... 123.969667: menu_select: CPU[200]: Interval[2] = 34 <idle>-0 [200] d.... 123.969667: menu_select: CPU[200]: Interval[3] = 34 <idle>-0 [200] d.... 123.969667: menu_select: CPU[200]: Interval[4] = 34 <idle>-0 [200] d.... 123.969668: menu_select: CPU[200]: Interval[5] = 34 <idle>-0 [200] d.... 123.969668: menu_select: CPU[200]: Interval[6] = 34 <idle>-0 [200] d.... 123.969668: menu_select: CPU[200]: Interval[7] = 33 <idle>-0 [200] d.... 123.969668: menu_select: CPU[200]: Predicted ns = 33000 This pattern repeats until I stop the workload. ---------------------------------------------------------------------- --------------- > b) Trying this out? Setting these values to some around 50us, that returns > 50us for me as expected and therefore should not select an idle state above > that. > > -->8-- > > --- a/drivers/cpuidle/governors/menu.c > +++ b/drivers/cpuidle/governors/menu.c > @@ -112,6 +112,8 @@ struct menu_device { > int interval_ptr; > }; > > +static int intervals_dummy[] = {50, 52, 48, 50, 51, 49, 51, 51}; > + > static inline int which_bucket(u64 duration_ns, unsigned int nr_iowaiters) > { > int bucket = 0; > @@ -177,7 +179,7 @@ static unsigned int get_typical_interval(struct menu_device *data) > sum = 0; > divisor = 0; > for (i = 0; i < INTERVALS; i++) { > - unsigned int value = data->intervals[i]; > + unsigned int value = intervals_dummy[i]; > if (value <= thresh) { > sum += value; > divisor++; > @@ -200,7 +202,7 @@ static unsigned int get_typical_interval(struct menu_device *data) > /* Then try to determine variance */ > variance = 0; > for (i = 0; i < INTERVALS; i++) { > - unsigned int value = data->intervals[i]; > + unsigned int value = intervals_dummy[i]; > if (value <= thresh) { > int64_t diff = (int64_t)value - avg; > variance += diff * diff; > Yes, I tried the above patch and I still see the issue: +-----------------------------+------------------+------------------+------------------+ | Metric | ladder | menu | teo | +=============================+==================+==================+==================+ | Wakeups | 579817 | 579298 | 578359 | +-----------------------------+------------------+------------------+------------------+ | Average wakeup latency (us) | 7.32 | 7.099 | 4.538 | +-----------------------------+------------------+------------------+------------------+ | Sleep interval (us) | 51.722 | 51.768 | 51.852 | +-----------------------------+------------------+------------------+------------------+ | State 0 Usage diff | 0 | 7056 | 578402 | +-----------------------------+------------------+------------------+------------------+ | State 0 Time diff (ns) | 0 | 350579 | 2.75562e+07 | +-----------------------------+------------------+------------------+------------------+ | State 0 Above diff | 0 | 0 | 0 | +-----------------------------+------------------+------------------+------------------+ | State 0 Below diff | 0 | 1 | 0 | +-----------------------------+------------------+------------------+------------------+ | State 1 Usage diff | 580239 | 572168 | 0 | +-----------------------------+------------------+------------------+------------------+ | State 1 Time diff (ns) | 2.74763e+07 | 2.73917e+07 | 0 | +-----------------------------+------------------+------------------+------------------+ | State 1 Above diff | 580140 | 572070 | 0 | +-----------------------------+------------------+------------------+------------------+ | State 1 Below diff | 0 | 0 | 0 | +-----------------------------+------------------+------------------+------------------+ | Total Above | 580140 | 572070 | 0 | +-----------------------------+------------------+------------------+------------------+ | Total Below | 0 | 1 | 0 | +=============================+==================+==================+==================+ | Total Above Percent | 99.98 | 98.76 | 0 | +-----------------------------+------------------+------------------+------------------+ | Total Below Percent | 0 | 0 | 0 | +======================================================================================+ -------------------------------------------------------------------------------------------- Note: % Total Above Percent (w.r.t. Total Usage) = (Total Above / Total Usage) * 100 % Total Below Percent (w.r.t. Total Usage) = (Total Below / Total Usage) * 100 -------------------------------------------------------------------------------------------- This confirms that the issue is not occuring due to the way in which predicted_ns is computed in get_typicial_interval, but rather related to the menu governor's CPU idle state selection logic. Thanks, Aboorva
On Thu, 2024-09-19 at 15:38 +0100, Christian Loehle wrote: > On 9/19/24 15:07, Christian Loehle wrote: > > On 9/19/24 10:43, Christian Loehle wrote: > > > On 9/19/24 09:49, Christian Loehle wrote: > > > > On 9/19/24 06:02, Aboorva Devarajan wrote: > > > > > On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote: > > > > > Timer based wakeup: > > > > > > > > > > +------------------------+---------------------+---------------------+ > > > > > > Metric | Without Patch | With Patch | > > > > > +------------------------+---------------------+---------------------+ > > > > > > Wakee thread - CPU | 110 | 110 | > > > > > > Waker thread - CPU | 20 | 20 | > > > > > > Sleep Interval | 50 us | 50 us | > > > > > > Total Wakeups | - | - | > > > > > > Avg Wakeup Latency | - | - | > > > > > > Actual Sleep time | 52.639 us | 52.627 us | > > > > > +------------------------+---------------------+---------------------+ > > > > > > Idle State 0 Usage Diff| 94,879 | 94,879 | > > > > > > Idle State 0 Time Diff | 4,700,323 ns | 4,697,576 ns | > > > > > > Idle State 1 Usage Diff| 0 | 0 | > > > > > > Idle State 1 Time Diff | 0 ns | 0 ns | > > > > > +------------------------+---------------------+---------------------+ > > > > > > Total Above Usage | 0 (0.00%) | (0.00%) | > > > > > +------------------------+---------------------+---------------------+ > > > > > > Total Below Usage | 0 (0.00%) | 0 (0.00%) | > > > > > +------------------------+---------------------+---------------------+ > > > > > > > > > > In timer-based wakeups, the menu governor effectively predicts the idle > > > > > duration both with and without the patch. This ensures that there are > > > > > few or no instances of "Above" usage, allowing the CPU to remain in the > > > > > correct idle state. > > > > > > > > > > The condition (s->target_residency_ns <= data->next_timer_ns) in the menu > > > > > governor ensures that a physical idle state is not prioritized when a > > > > > timer event is expected before the target residency of the first physical > > > > > idle state. > > > > > > > > > > As a result, the patch has no impact in this case, and performance > > > > > remains stable with timer based wakeups. > > > > > > > > > > Pipe based wakeup (non-timer wakeup): > > > > > > > > > > +------------------------+---------------------+---------------------+ > > > > > > Metric | Without Patch | With Patch | > > > > > +------------------------+---------------------+---------------------+ > > > > > > Wakee thread - CPU | 110 | 110 | > > > > > > Waker thread - CPU | 20 | 20 | > > > > > > Sleep Interval | 50 us | 50 us | > > > > > > Total Wakeups | 97031 | 96583 | > > > > > > Avg Wakeup Latency | 7.070 us | 4.879 us | > > > > > > Actual Sleep time | 51.366 us | 51.605 us | > > > > > +------------------------+---------------------+---------------------+ > > > > > > Idle State 0 Usage Diff| 1209 | 96,586 | > > > > > > Idle State 0 Time Diff | 55,579 ns | 4,510,003 ns | > > > > > > Idle State 1 Usage Diff| 95,826 | 5 | > > > > > > Idle State 1 Time Diff | 4,522,639 ns | 198 ns | > > > > > +------------------------+---------------------+---------------------+ > > > > > +------------------------+---------------------+---------------------+ > > > > > > **Total Above Usage** | 95,824 (98.75%) | 5 (0.01%) | > > > > > +------------------------+---------------------+---------------------+ > > > > > +------------------------+---------------------+---------------------+ > > > > > > Total Below Usage | 0 (0.00%) | 0 (0.00%) | > > > > > +------------------------+---------------------+---------------------+ > > > > > > > > > > In the pipe-based wakeup scenario, before the patch was applied, the > > > > > "Above" metric was notably high around 98.75%. This suggests that the > > > > > menu governor frequently selected a deeper idle state like CEDE, even > > > > > when the idle period was relatively short. > > > > > > > > > > This happened because the menu governor is inclined to prioritize the > > > > > physical idle state (CEDE) even when the target residency time of the > > > > > physical idle state (s->target_residency_ns) was longer than the > > > > > predicted idle time (predicted_ns), so data->next_timer_ns won't be > > > > > relevant here in non-timer wakeups. > > > > > > > > > > In this test, despite the actual idle period being around 50 microseconds, > > > > > the menu governor still chose CEDE state, which has a target residency of > > > > > 120 microseconds. > > > > > > > > And the idle_miss statistics confirms that this was mostly wrong decisions > > > > in hindsight. > > > > I'll go through the menu code again, this indeed shouldn't be happening. > > > > I'd be very surprised if upstream teo performed as badly (or actually badly > > > > at all) on this, although it doesn't seem to help your actual workload, > > > > would you mind giving that a try too? > > > > > > > > Regards, > > > > Christian > > > > > > So with a workload as static as this, the recent interval detection of menu > > > should take affect. It records the last 8 idle interval lengths and does some > > > statistics do see if they are somewhat stable and uses those instead of the > > > next timer event. > > > While I wouldn't vouch for the statistics just yet, the results don't seem > > > to be completely unreasonable. > > > Do you mind trying a) printing some snapshots of these intervals and the > > > resulting return value of get_typical_interval()? > > > b) Trying this out? Setting these values to some around 50us, that returns > > > 50us for me as expected and therefore should not select an idle state above > > > that. > > > > > > -->8-- > > > > > > --- a/drivers/cpuidle/governors/menu.c > > > +++ b/drivers/cpuidle/governors/menu.c > > > @@ -112,6 +112,8 @@ struct menu_device { > > > int interval_ptr; > > > }; > > > > > > +static int intervals_dummy[] = {50, 52, 48, 50, 51, 49, 51, 51}; > > > > So just to illustrate this a bit more: > > {51, 80, 90, 54, 72, 60, 80, 117} is insignificant, but decrementing the last: > > {51, 80, 90, 54, 72, 60, 80, 116} is significant (75us returned) > > That sounds about reasonable but I think something like that would > > also be observable for you. It's an unfortunate edge case. > > I wouldn't want to mess with the ancient statistics (and silently break > > an unknown amount of setups). I much prefer the teo intercept approach that > > makes this idle state dependent and doesn't try to get one formula right > > for all setups. > > > > That is all assuming that a) get_typical_interval() doesn't return a > > significant interval length for you and b) replacing that with dummy > > significant values then fixes it (and not the error being some menu > > logic that I overlooked). > > > > Thinking out loud regarding the menu logic: > > If we have ~50us predicted interval and your state setup the code then > > goes as follows: > > > > predicted_ns = 50000 > > ... > > data->next_timer_ns = KTIME_MAX; > > delta_tick = TICK_NSEC / 2; > > data->bucket = which_bucket(KTIME_MAX, nr_iowaiters); // bucket highest state probably > > > > And then we have the big for loop for state 0 POLLING and state 1 CEDE target_residency_ns==100000: > > trivially we have for i = 0: > > if (idx == -1) > > idx = i; /* first enabled state */ > > and for i = 1: > > if (s->target_residency_ns > predicted_ns) { // True > > ... > > if (predicted_ns < TICK_NSEC) // True > > break; > > and should end up with idx == 0 and *stop_tick = false, all good. > > If my above assumptions in a) and b) do not hold then I must've made a > > mistake reading the code just now. > > > > And of course I realise right after sending I did indeed make a mistake. > RESIDENCY_THRESHOLD_NS is 15us so timer is actually polled and the fields > populated accurately, but also in the for loop it checks polling for idx == 0. > predicted_ns should still be the value from get_typical_interval(), but > that is unable to ever enforce a polling state, making your proposed code > change the only obvious fix. > Essentially with only one non-polling state the get_typical_interval() > logic might as well be skipped, it never affects anything. > > If I'm correct this time then I'm not really sure how to proceed. > On the one hand, if we trust predicted_ns we should indeed select a state > with predicted_ns < s->target_residency_ns, no matter if polling or not, > but that would be quite a behaviour change for x86 users too. That's right, so far this is the only solution I can think of that improves performance, But, I'm open to trying alternative suggestions if any. > > I'm still curious on why teo wouldn't solve your problem, it's intercept > logic doesn't care if it selects a polling state or not because of it. > I confirmed that "teo" doesn't have the reported issue, and when compared to the vanilla menu governor switching to "teo" does improve the performance of some workloads, though not the original workload I tested. I'll look into why that's the case. However, we would still want to resolve the issue with the menu governor, as it's the default governor currently enabled and have been tested so far. While switching to "teo" could be an option in the future. Thanks, Aboorva
Hi Aboorva, On 10/21/24 06:27, Aboorva Devarajan wrote: > [...] > I ran the benchmark again and collected the idle statistics on a different > system since I no longer have access to the previous one. > > Here are the stats I gathered, including data for the menu, teo, and ladder governors: > > Metric | Ladder | Teo | Menu | Menu Patched > -------------------------------------------------------------------------------------------- > Transactions Processed | 42,902,188 | 48,499,709 | 42,653,477 | 48,648,867 > Latency (ms) | 0.313 | 0.277 | 0.315 | 0.276 > TPS | 715,340.96 | 808,672.59 | 711,123.13 | 811,137.40 > -------------------------------------------------------------------------------------------- > Total Usage Difference | 46,680,184 | 66,348,992 | 47,892,509 | 62,366,337 > Usage Diff (CEDE) | 46,680,184 | 7,612,960 | 45,421,436 | 19,237,430 > Usage Diff (Snooze) | 0 | 58,736,032 | 2,471,073 | 43,128,907 > -------------------------------------------------------------------------------------------- > Total Time Difference (s) | 5,552.20 | 5,242.75 | 5,534.62 | 5,238.46 > Time Diff (CEDE) | 5,552.20 | 2,497.89 | 5,431.91 | 3,324.99 > Time Diff (Snooze) | 0.00 | 2,744.86 | 102.71 | 1,913.47 > -------------------------------------------------------------------------------------------- > Total Above Diff | 40,381,398 | 4,520,998 | 38,942,725 | 15,072,345 > Above Diff (CEDE) | 40,381,398 | 4,520,998 | 38,942,725 | 15,072,345 > Above Diff (Snooze) | 0 | 0 | 0 | 0 > -------------------------------------------------------------------------------------------- > Total Below Diff | 0 | 12,362,392 | 405,357 | 8,172,845 > Below Diff (CEDE) | 0 | 0 | 0 | 0 > Below Diff (Snooze) | 0 | 12,362,392 | 405,357 | 8,172,845 > -------------------------------------------------------------------------------------------- > % Above w.r.t. Usage | 86.51% | 6.82% | 81.30% | 24.17% > -------------------------------------------------------------------------------------------- > % Below w.r.t. Usage | 0% | 18.64% | 0.85% | 13.10% > -------------------------------------------------------------------------------------------- > > -------------------------------------------------------------------------------------------- > > Note: > > % Above w.r.t Usage = (Total Above Difference / Total Usage Difference) * 100 > % Below w.r.t Usage = (Total Below Difference / Total Usage Difference) * 100 > > > Menu, Teo, Ladder: This is with menu, teo and ladder governor enabled respectively on v6.12-rc1. > > Menu Patched: This is with menu governor enabled on v6.12-rc + > https://lore.kernel.org/all/20240809073120.250974-2-aboorvad@linux.ibm.com/ > -------------------------------------------------------------------------------------------- > > -------------------------------------------------------------------------------------------- > Inference: > -------------------------------------------------------------------------------------------- > > Transactions Processed: The number of transactions processed in Menu Patched is > 14.06% higher compared to Menu. > > Latency: There is a 12.38% reduction in latency in Menu Patched compared to Menu. > > TPS (Transactions Per Second): The TPS in Menu Patched is 14.06% higher than in > Menu. > -------------------------------------------------------------------------------------------- > > While this patch hasn't completely eliminated the cpuidle miss ratio, but it > has improved see Above w.r.t Usage, Below w.r.t Usage. > > I'll keep investigating why there's still a 24% miss rate in the "above" and > 13% in the "below" stats after the patch. This could be a different issue. > Additionally, I've seen good performance improvements using the teo governor > with the pgbench benchmark, although it didn't provide the same benefit in the > original test. Good to hear and thank you for testing. Indeed the menu idle misses sound somewhat high, given that teo scores fewer. I'd be very curious about the workload where menu patched significantly outperforms teo if you ever come across that again. > [...] > > I ran a similar benchmark using teo cpuidle governor, presented the averaged out > values across 10 runs (has low standard deviation). Below are the results for a > pipe-based wakeup with an approximate 50 microsecond sleep interval: > > > Pipe based wakeup with approx 50 us as sleep interval: > > Metric Ladder Menu Teo Menu Patched > ---------------------------------------------------------------------------------------- > Wakeups 579,690 579,951 578,328 578,363 > Avg wakeup latency (us) 7.456 7.112 4.46067 4.48733 > ======================================================================================== > Sleep interval (us) 51.7333 51.7097 51.8533 51.8477 > ======================================================================================== > State 0 Usage diff 0 7,324 578,361 578,407 > State 0 Time diff (ns) 0 340,115 2.75953e+07 2.75696e+07 > State 0 Above diff 0 0 0 0 > State 0 Below diff 0 0 0.333333 0.666667 > ======================================================================================== > State 1 Usage diff 580,114 572,644 0.666667 9.33333 > State 1 Time diff (ns) 2.74189e+07 2.73901e+07 20.6667 445.333 > State 1 Above diff 579,993 572,535 0.666667 9.33333 > State 1 Below diff 0 0 0 0 > Total Above 579,993 572,535 0.666667 9.33333 > Total Below 0 0 0.333333 0.666667 > ======================================================================================== > Total Above Percent 99.98% 98.7167% 0% 0% --> [1] > Total Below Percent 0% 0% 0% 0% > ======================================================================================== > > Timer based wakeup with approx 50 us as sleep interval: > > Metric Ladder Menu Teo Menu Patched > ---------------------------------------------------------------------------------------- > Sleep interval (us) 53.775 52.3687 52.746 52.327 > ======================================================================================== > State 0 Usage diff 0 572,575 568,419 573,109 > State 0 Time diff (ns) 0 2.84142e+07 2.81956e+07 2.84548e+07 > State 0 Above diff 0 0 0 0 > State 0 Below diff 0 0.333333 1 0.333333 > ======================================================================================== > State 1 Usage diff 558,285 0.333333 0 0 > State 1 Time diff (ns) 2.80965e+07 17 0 0 > State 1 Above diff 558,215 0.333333 0 0 > State 1 Below diff 0 0 0 0 > ======================================================================================== > Total Above 558,215 0.333333 0 0 > Total Below 0 0.333333 1 0.333333 > ======================================================================================== > Total Above Percent 99.99% 0% 0% 0% > Total Below Percent 0% 0% 0% 0% > ======================================================================================== > > > [1] The results does show that the teo governor doesn't have this issue. Glad to see that confirmed, thank you. Regards, Christian
On 10/21/24 09:02, Aboorva Devarajan wrote: > On Thu, 2024-09-19 at 15:38 +0100, Christian Loehle wrote: >> [...] >> And of course I realise right after sending I did indeed make a mistake. >> RESIDENCY_THRESHOLD_NS is 15us so timer is actually polled and the fields >> populated accurately, but also in the for loop it checks polling for idx == 0. >> predicted_ns should still be the value from get_typical_interval(), but >> that is unable to ever enforce a polling state, making your proposed code >> change the only obvious fix. > >> Essentially with only one non-polling state the get_typical_interval() >> logic might as well be skipped, it never affects anything. >> >> If I'm correct this time then I'm not really sure how to proceed. >> On the one hand, if we trust predicted_ns we should indeed select a state >> with predicted_ns < s->target_residency_ns, no matter if polling or not, >> but that would be quite a behaviour change for x86 users too. > > That's right, so far this is the only solution I can think of that > improves performance, But, I'm open to trying alternative suggestions > if any. It's the most reasonable I can think of anyway. >> I'm still curious on why teo wouldn't solve your problem, it's intercept >> logic doesn't care if it selects a polling state or not because of it. >> > I confirmed that "teo" doesn't have the reported issue, and when compared > to the vanilla menu governor switching to "teo" does improve the performance > of some workloads, though not the original workload I tested. I'll look into > why that's the case. I'd be really curious about the workload that improves with your patch but doesn't perform well on teo. > However, we would still want to resolve the issue with the menu governor, > as it's the default governor currently enabled and have been tested so > far. While switching to "teo" could be an option in the future. Understandable and I think your change makes sense, although that logic is quite old as I dug up. The modern arm64 but also x86 platforms AFAICS are unlikely to be affected by your change as they have efficient/responsive physical idle states (or for arm64 don't have any non-polling state usually) so the prediction logic would have to spit out something like <1000 ns to take affect and at that point getting over that statistically significance is quite unlikely. OTOH if there were power regressions they might be with platforms where they would never be reported. Anyway, Rafael's call here. If he's willing to take it I would include the historical part in the commit message so that doesn't need to be dug up again. Regards, Christian