mbox series

[0/1] cpuidle/menu: Address performance drop from favoring physical over polling cpuidle state

Message ID 20240809073120.250974-1-aboorvad@linux.ibm.com (mailing list archive)
Headers show
Series cpuidle/menu: Address performance drop from favoring physical over polling cpuidle state | expand

Message

Aboorva Devarajan Aug. 9, 2024, 7:31 a.m. UTC
This patch aims to discuss a potential performance degradation that can occur
in certain workloads when the menu governor prioritizes selecting a physical
idle state over a polling state for short idle durations. 

Note: This patch is intended to showcase a performance degradation, applying
this patch could lead to increased power consumption due to the trade-off between
performance and power efficiency, potentially causing a higher preference for
performance at the expense of power usage.

==================================================
System details in which the degradation is observed:

$ uname -r
6.10.0+

$ lscpu
Architecture:             ppc64le
  Byte Order:             Little Endian
CPU(s):                   160
  On-line CPU(s) list:    0-159
Model name:               POWER10 (architected), altivec supported
  Model:                  2.0 (pvr 0080 0200)
  Thread(s) per core:     8
  Core(s) per socket:     3
  Socket(s):              6
  Physical sockets:       4
  Physical chips:         2
  Physical cores/chip:    6
Virtualization features:
  Hypervisor vendor:      pHyp
  Virtualization type:    para
Caches (sum of all):
  L1d:                    1.3 MiB (40 instances)
  L1i:                    1.9 MiB (40 instances)
  L2:                     40 MiB (40 instances)
  L3:                     160 MiB (40 instances)
NUMA:
  NUMA node(s):           6
  NUMA node0 CPU(s):      0-31
  NUMA node1 CPU(s):      32-71
  NUMA node2 CPU(s):      72-79
  NUMA node3 CPU(s):      80-87
  NUMA node4 CPU(s):      88-119
  NUMA node5 CPU(s):      120-159


$ cpupower idle-info
CPUidle driver: pseries_idle
CPUidle governor: menu
analyzing CPU 0:

Number of idle states: 2
Available idle states: snooze CEDE
snooze:
Flags/Description: snooze
Latency: 0
Residency: 0
Usage: 6229
Duration: 402142
CEDE:
Flags/Description: CEDE
Latency: 12
Residency: 120
Usage: 191411
Duration: 36329999037

==================================================

The menu governor contains a condition that selects physical idle states over,
such as the CEDE state over polling state, by checking if their exit latency meets
the latency requirements. This can lead to performance drops in workloads with
frequent short idle periods.

The specific condition which causes degradation is as below (menu governor):

```
if (s->target_residency_ns > predicted_ns) {
    ...
    if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
        s->exit_latency_ns <= latency_req &&
        s->target_residency_ns <= data->next_timer_ns) {
        predicted_ns = s->target_residency_ns;
        idx = i;
        break;
    }
    ...
}
```

This condition can cause the menu governor to choose the CEDE state on Power
Systems (residency: 120 us, exit latency: 12 us) over a polling state, even
when the predicted idle duration is much shorter than the target residency
of the physical state. This misprediction leads to performance degradation
in certain workloads.

==================================================
Test Results
==================================================

This issue can be clearly observed with the below test.

A test with multiple wakee threads and a single waker thread was run to
demonstrate this issue. The waker thread periodically wakes up the wakee
threads after a specific sleep duration, creating a repeating of sleep -> wake
pattern. The test was run for a stipulated period, and cpuidle statistics are
collected.

./cpuidle-test -a 0 -b 10 -b 20 -b 30 -b 40 -b 50 -b 60 -b 70 -r 20 -t 60

==================================================
Results (Baseline Kernel):
==================================================
Wakee 0[PID 8295] affined to CPUs: 10,
Wakee 2[PID 8297] affined to CPUs: 30,
Wakee 3[PID 8298] affined to CPUs: 40,
Wakee 1[PID 8296] affined to CPUs: 20,
Wakee 4[PID 8299] affined to CPUs: 50,
Wakee 5[PID 8300] affined to CPUs: 60,
Wakee 6[PID 8301] affined to CPUs: 70,
Waker[PID 8302] affined to CPUs: 0,

|-----------------------------------|-------------------------|-----------------------------|
| Metric                            | snooze                  | CEDE                        |
|-----------------------------------|-------------------------|-----------------------------|
| Usage                             | 47815                   | 2030160                     |
| Above                             | 0                       | 2030043                     |
| Below                             | 0                       | 0                           |
| Time Spent (us)                   | 976317 (1.63%)          | 51046474 (85.08%)           |
| Overall average sleep duration    | 28.721 us               |                             |
| Overall average wakeup latency    | 6.858 us                |                             |
|-----------------------------------|-------------------------|-----------------------------|

In this test, without the patch, the CPU often enters the CEDE state for
sleep durations of around 20-30 microseconds, even though the CEDE state's
residency time is 120 microseconds. This happens because the menu governor
prioritizes the physical idle state (CEDE) if its exit latency is within
the latency limits. It also uses next_timer_ns for comparison, which can
be farther off than the actual idle duration as it is more predictable,
instead of using predicted idle duration as a comparision point with the
target residency.

Inference: The "Above" metric in the table counts instances where the
state was entered but the observed idle duration was too short. The data
shows that the CEDE state's "Above" metric is high, with around 99.9% of
the CEDE state usage exceeding the expected cpuidle state.

==================================================
Results (With Patch):
==================================================
Wakee 0[PID 8295] affined to CPUs: 10,
Wakee 2[PID 8297] affined to CPUs: 30,
Wakee 3[PID 8298] affined to CPUs: 40,
Wakee 1[PID 8296] affined to CPUs: 20,
Wakee 4[PID 8299] affined to CPUs: 50,
Wakee 5[PID 8300] affined to CPUs: 60,
Wakee 6[PID 8301] affined to CPUs: 70,
Waker[PID 8302] affined to CPUs: 0,

|-----------------------------------|-------------------------|-----------------------------|
| Metric                            | snooze                  | CEDE                        |
|-----------------------------------|-------------------------|-----------------------------|
| Usage                             | 1853343                 | 6                           |
| Above                             | 0                       | 6                           |
| Below                             | 2                       | 0                           |
| Time Spent (us)                   | 51055087 (85.09%)       | 356 (0.00%)                 |
| Overall average sleep duration    | 32.268 us               |                             |
| Overall average wakeup latency    | 4.382 us                |                             |
|-----------------------------------|-------------------------|-----------------------------|

When the condition that prioritizes the selection of physical idle states
based on exit latency is removed, these metrics improve significantly.
The reduction in "above" metric for the CEDE state shows that the governor’s
selection of cpuidle state are now more accurate that before, leading to
better overall performance.

==================================================
Performance Improvements:
==================================================
The patch improved the performance of OLTP benchmarks by around 5-10%.

The following table summarizes the performance improvements observed in
multiple OLTP query tests:

|----|----------|------------------|---------------------------|
| #  | Base (s) | Base + Patch (s) | % Performance Improvement |
|    | Kernel   | Kernel           |    (lower the better)     |
|----|----------|------------------|---------------------------|
| 1  | 15.77    | 14.14            | -10.34%                   |
| 2  | 0.87     | 0.86             | -0.90%                    |
| 3  | 4.05     | 3.78             | -6.70%                    |
| 4  | 0.61     | 0.59             | -3.10%                    |
| 5  | 1.18     | 1.14             | -3.47%                    |
| 6  | 4.11     | 3.84             | -6.59%                    |
| 7  | 0.75     | 0.75             | -0.24%                    |
| 8  | 14.91    | 13.56            | -9.03%                    |
| 9  | 4.02     | 3.86             | -3.90%                    |
| 10 | 14.93    | 13.65            | -8.57%                    |
| 11 | 15.62    | 14.03            | -10.20%                   |
| 12 | 2.15     | 2.05             | -4.52%                    |
| 13 | 7.92     | 7.35             | -7.29%                    |
| 14 | 4.08     | 4.07             | -0.08%                    |
| 15 | 2.28     | 2.10             | -8.12%                    |
| 16 | 1.14     | 1.11             | -3.09%                    |
| 17 | 2.14     | 2.06             | -3.64%                    |
| 18 | 2.54     | 2.36             | -7.16%                    |
| 19 | 3.72     | 3.68             | -1.05%                    |
| 20 | 2.12     | 2.05             | -3.28%                    |
| 21 | 8.17     | 7.46             | -8.69%                    |
| 22 | 3.98     | 3.83             | -3.58%                    |
| 23 | 2.45     | 2.30             | -6.12%                    |
| 24 | 2.13     | 2.06             | -3.61%                    |
| 25 | 15.05    | 13.74            | -8.70%                    |
| 26 | 4.49     | 4.16             | -7.43%                    |
| 27 | 3.75     | 3.66             | -2.37%                    |
| 28 | 2.90     | 2.87             | -0.96%                    |
| 29 | 4.37     | 4.03             | -7.83%                    |
| 30 | 2.57     | 2.43             | -5.19%                    |
| 31 | 2.78     | 2.71             | -2.30%                    |
| 32 | 3.94     | 3.78             | -4.27%                    |
| 33 | 0.70     | 0.69             | -1.48%                    |
| 34 | 2.15     | 2.08             | -3.62%                    |
| 35 | 8.30     | 7.52             | -9.38%                    |
| 36 | 2.15     | 2.08             | -3.25%                    |
| 37 | 4.37     | 4.03             | -7.88%                    |
| 38 | 4.16     | 3.88             | -6.78%                    |
| 39 | 1.12     | 1.12             | -0.02%                    |
| 40 | 1.22     | 1.17             | -4.19%                    |
| 41 | 8.45     | 7.67             | -9.27%                    |
| 42 | 7.91     | 7.31             | -7.51%                    |
| 43 | 1.12     | 1.12             | -0.08%                    |
| 44 | 0.68     | 0.65             | -3.86%                    |
| 45 | 2.12     | 2.00             | -5.47%                    |
| 46 | 3.11     | 3.04             | -2.36%                    |
| 47 | 2.29     | 2.13             | -6.88%                    |
| 48 | 8.37     | 7.61             | -9.02%                    |
| 49 | 2.61     | 2.46             | -5.75%                    |
| 50 | 7.92     | 7.30             | -7.79%                    |
| 51 | 2.85     | 2.77             | -2.58%                    |
| 52 | 0.79     | 0.78             | -0.87%                    |
| 53 | 2.90     | 2.79             | -3.63%                    |
| 54 | 8.03     | 7.40             | -7.87%                    |
| 55 | 2.14     | 2.00             | -6.53%                    |
| 56 | 15.54    | 13.93            | -10.38%                   |
| 57 | 8.07     | 7.45             | -7.62%                    |
| 58 | 8.48     | 7.70             | -9.20%                    |
| 59 | 1.08     | 1.04             | -3.80%                    |
| 60 | 1.14     | 1.13             | -0.91%                    |
| 61 | 4.28     | 3.95             | -7.84%                    |
| 62 | 8.40     | 7.62             | -9.36%                    |
| 63 | 7.91     | 7.38             | -6.78%                    |
| 64 | 2.16     | 2.10             | -2.92%                    |
| 65 | 7.78     | 7.18             | -7.60%                    |
| 66 | 4.22     | 4.19             | -0.84%                    |
| 67 | 2.99     | 2.88             | -3.61%                    |
| 68 | 8.44     | 7.56             | -10.33%                   |
| 69 | 4.10     | 3.74             | -8.89%                    |
| 70 | 4.25     | 3.98             | -6.37%                    |
| 71 | 0.61     | 0.61             | -0.39%                    |
| 72 | 3.89     | 3.56             | -8.39%                    |
|----|----------|------------------|---------------------------|


==================================================
Observation:
==================================================

The condition under discussion is intended to prioritize physical idle state
when their exit latency is within the acceptable limits thereby improving
the power efficiency and overall SMT folding. However, in systems where the
target residency value is significantly higher than the exit latency, this
leads to frequent selection of states like CEDE in Power Systems, which is
less efficient for short sleep durations.

Whereas, in systems where the residency value and exit latency are almost
equal, this condition might not adversely affect the performance.

In general scenarios, there seems to be a preference for prioritizing
energy efficiency, as shown by the commit message 0c313cb (“cpuidle: menu:
Fallback to polling if the next timer event is near”). This change involved
reverting to next_timer_ns as a comparison point instead of using the predicted
idle duration. While next_timer_ns is more precise, it might not account for
very short idle durations, which could affect performance in certain
situations, even though it helps reduce energy consumption.

But, any feedback and suggestions on better improving the power-performance
tradeoff in menu governor to address the performance issue in specific cases
as mentioned above will be very helpful.

Thanks
Aboorva

Aboorva Devarajan (1):
  cpuidle/menu: avoid prioritizing physical state over polling state

 drivers/cpuidle/governors/menu.c | 11 -----------
 1 file changed, 11 deletions(-)

Comments

Christian Loehle Aug. 13, 2024, 12:56 p.m. UTC | #1
On 8/9/24 08:31, Aboorva Devarajan wrote:
> This patch aims to discuss a potential performance degradation that can occur
> in certain workloads when the menu governor prioritizes selecting a physical
> idle state over a polling state for short idle durations. 
> 
> Note: This patch is intended to showcase a performance degradation, applying
> this patch could lead to increased power consumption due to the trade-off between
> performance and power efficiency, potentially causing a higher preference for
> performance at the expense of power usage.
> 

Not really a menu expert, but at this point I don't know who dares call
themselves one.
The elephant in the room would be: Does teo work better for you?

> ==================================================
> System details in which the degradation is observed:
> 
> $ uname -r
> 6.10.0+
> 
> $ lscpu
> Architecture:             ppc64le
>   Byte Order:             Little Endian
> CPU(s):                   160
>   On-line CPU(s) list:    0-159
> Model name:               POWER10 (architected), altivec supported
>   Model:                  2.0 (pvr 0080 0200)
>   Thread(s) per core:     8
>   Core(s) per socket:     3
>   Socket(s):              6
>   Physical sockets:       4
>   Physical chips:         2
>   Physical cores/chip:    6
> Virtualization features:
>   Hypervisor vendor:      pHyp
>   Virtualization type:    para
> Caches (sum of all):
>   L1d:                    1.3 MiB (40 instances)
>   L1i:                    1.9 MiB (40 instances)
>   L2:                     40 MiB (40 instances)
>   L3:                     160 MiB (40 instances)
> NUMA:
>   NUMA node(s):           6
>   NUMA node0 CPU(s):      0-31
>   NUMA node1 CPU(s):      32-71
>   NUMA node2 CPU(s):      72-79
>   NUMA node3 CPU(s):      80-87
>   NUMA node4 CPU(s):      88-119
>   NUMA node5 CPU(s):      120-159
> 
> 
> $ cpupower idle-info
> CPUidle driver: pseries_idle
> CPUidle governor: menu
> analyzing CPU 0:
> 
> Number of idle states: 2
> Available idle states: snooze CEDE
> snooze:
> Flags/Description: snooze
> Latency: 0
> Residency: 0
> Usage: 6229
> Duration: 402142
> CEDE:
> Flags/Description: CEDE
> Latency: 12
> Residency: 120
> Usage: 191411
> Duration: 36329999037
> 
> ==================================================
> 
> The menu governor contains a condition that selects physical idle states over,
> such as the CEDE state over polling state, by checking if their exit latency meets
> the latency requirements. This can lead to performance drops in workloads with
> frequent short idle periods.
> 
> The specific condition which causes degradation is as below (menu governor):
> 
> ```
> if (s->target_residency_ns > predicted_ns) {
>     ...
>     if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
>         s->exit_latency_ns <= latency_req &&
>         s->target_residency_ns <= data->next_timer_ns) {
>         predicted_ns = s->target_residency_ns;
>         idx = i;
>         break;
>     }
>     ...
> }
> ```
> 
> This condition can cause the menu governor to choose the CEDE state on Power
> Systems (residency: 120 us, exit latency: 12 us) over a polling state, even
> when the predicted idle duration is much shorter than the target residency
> of the physical state. This misprediction leads to performance degradation
> in certain workloads.
> 

So clearly the condition
s->target_residency_ns <= data->next_timer_ns)
is supposed to prevent this, but data->next_timer_ns isn't accurate,
have you got any idea what it's set to in your workload usually?
Seems like your workload is timer-based, so the idle duration should be
predicted accurately.


> ==================================================
> Test Results
> ==================================================
> 
> This issue can be clearly observed with the below test.
> 
> A test with multiple wakee threads and a single waker thread was run to
> demonstrate this issue. The waker thread periodically wakes up the wakee
> threads after a specific sleep duration, creating a repeating of sleep -> wake
> pattern. The test was run for a stipulated period, and cpuidle statistics are
> collected.
> 
> ./cpuidle-test -a 0 -b 10 -b 20 -b 30 -b 40 -b 50 -b 60 -b 70 -r 20 -t 60
> 
> ==================================================
> Results (Baseline Kernel):
> ==================================================
> Wakee 0[PID 8295] affined to CPUs: 10,
> Wakee 2[PID 8297] affined to CPUs: 30,
> Wakee 3[PID 8298] affined to CPUs: 40,
> Wakee 1[PID 8296] affined to CPUs: 20,
> Wakee 4[PID 8299] affined to CPUs: 50,
> Wakee 5[PID 8300] affined to CPUs: 60,
> Wakee 6[PID 8301] affined to CPUs: 70,
> Waker[PID 8302] affined to CPUs: 0,
> 
> |-----------------------------------|-------------------------|-----------------------------|
> | Metric                            | snooze                  | CEDE                        |
> |-----------------------------------|-------------------------|-----------------------------|
> | Usage                             | 47815                   | 2030160                     |
> | Above                             | 0                       | 2030043                     |
> | Below                             | 0                       | 0                           |
> | Time Spent (us)                   | 976317 (1.63%)          | 51046474 (85.08%)           |
> | Overall average sleep duration    | 28.721 us               |                             |
> | Overall average wakeup latency    | 6.858 us                |                             |
> |-----------------------------------|-------------------------|-----------------------------|
> 
> In this test, without the patch, the CPU often enters the CEDE state for
> sleep durations of around 20-30 microseconds, even though the CEDE state's
> residency time is 120 microseconds. This happens because the menu governor
> prioritizes the physical idle state (CEDE) if its exit latency is within
> the latency limits. It also uses next_timer_ns for comparison, which can
> be farther off than the actual idle duration as it is more predictable,
> instead of using predicted idle duration as a comparision point with the
> target residency.

Ideally that shouldn't be the case though (next_timer_ns be farther off the
actual idle duration).

> [snip]
Aboorva Devarajan Aug. 20, 2024, 8:51 a.m. UTC | #2
On Tue, 2024-08-13 at 13:56 +0100, Christian Loehle wrote:

Hi Christian,

Thanks a lot for your comments.
...
> On 8/9/24 08:31, Aboorva Devarajan wrote:
> > This patch aims to discuss a potential performance degradation that can occur
> > in certain workloads when the menu governor prioritizes selecting a physical
> > idle state over a polling state for short idle durations. 
> > 
> > Note: This patch is intended to showcase a performance degradation, applying
> > this patch could lead to increased power consumption due to the trade-off between
> > performance and power efficiency, potentially causing a higher preference for
> > performance at the expense of power usage.
> > 
> 
> Not really a menu expert, but at this point I don't know who dares call
> themselves one.
> The elephant in the room would be: Does teo work better for you?
> 

I ran some tests with the teo governor enabled, but it didn’t make a
lot of difference. The results are presented below.

> > ==================================================
> > System details in which the degradation is observed:
> > 
> > $ uname -r
> > 6.10.0+
> > 
> > $ lscpu
> > Architecture:             ppc64le
> >   Byte Order:             Little Endian
> > CPU(s):                   160
> >   On-line CPU(s) list:    0-159
> > Model name:               POWER10 (architected), altivec supported
> >   Model:                  2.0 (pvr 0080 0200)
> >   Thread(s) per core:     8
> >   Core(s) per socket:     3
> >   Socket(s):              6
> >   Physical sockets:       4
> >   Physical chips:         2
> >   Physical cores/chip:    6
> > Virtualization features:
> >   Hypervisor vendor:      pHyp
> >   Virtualization type:    para
> > Caches (sum of all):
> >   L1d:                    1.3 MiB (40 instances)
> >   L1i:                    1.9 MiB (40 instances)
> >   L2:                     40 MiB (40 instances)
> >   L3:                     160 MiB (40 instances)
> > NUMA:
> >   NUMA node(s):           6
> >   NUMA node0 CPU(s):      0-31
> >   NUMA node1 CPU(s):      32-71
> >   NUMA node2 CPU(s):      72-79
> >   NUMA node3 CPU(s):      80-87
> >   NUMA node4 CPU(s):      88-119
> >   NUMA node5 CPU(s):      120-159
> > 
> > 
> > $ cpupower idle-info
> > CPUidle driver: pseries_idle
> > CPUidle governor: menu
> > analyzing CPU 0:
> > 
> > Number of idle states: 2
> > Available idle states: snooze CEDE
> > snooze:
> > Flags/Description: snooze
> > Latency: 0
> > Residency: 0
> > Usage: 6229
> > Duration: 402142
> > CEDE:
> > Flags/Description: CEDE
> > Latency: 12
> > Residency: 120
> > Usage: 191411
> > Duration: 36329999037
> > 
> > ==================================================
> > 
> > The menu governor contains a condition that selects physical idle states over,
> > such as the CEDE state over polling state, by checking if their exit latency meets
> > the latency requirements. This can lead to performance drops in workloads with
> > frequent short idle periods.
> > 
> > The specific condition which causes degradation is as below (menu governor):
> > 
> > ```
> > if (s->target_residency_ns > predicted_ns) {
> >     ...
> >     if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> >         s->exit_latency_ns <= latency_req &&
> >         s->target_residency_ns <= data->next_timer_ns) {
> >         predicted_ns = s->target_residency_ns;
> >         idx = i;
> >         break;
> >     }
> >     ...
> > }
> > ```
> > 
> > This condition can cause the menu governor to choose the CEDE state on Power
> > Systems (residency: 120 us, exit latency: 12 us) over a polling state, even
> > when the predicted idle duration is much shorter than the target residency
> > of the physical state. This misprediction leads to performance degradation
> > in certain workloads.
> > 
> 
> So clearly the condition
> s->target_residency_ns <= data->next_timer_ns)
> is supposed to prevent this, but data->next_timer_ns isn't accurate,
> have you got any idea what it's set to in your workload usually?
> Seems like your workload is timer-based, so the idle duration should be
> predicted accurately.
> 

Yes, that's right ideally this condition should have prevented this,
but `data->next_timer_ns` is almost always greater than the actual
idle duration which seems inaccurate.

> 
> > ==================================================
> > Test Results
> > ==================================================
> > 
> > This issue can be clearly observed with the below test.
> > 
> > A test with multiple wakee threads and a single waker thread was run to
> > demonstrate this issue. The waker thread periodically wakes up the wakee
> > threads after a specific sleep duration, creating a repeating of sleep -> wake
> > pattern. The test was run for a stipulated period, and cpuidle statistics are
> > collected.
> > 
> > ./cpuidle-test -a 0 -b 10 -b 20 -b 30 -b 40 -b 50 -b 60 -b 70 -r 20 -t 60
> > 
> > ==================================================
> > Results (Baseline Kernel):
> > ==================================================
> > Wakee 0[PID 8295] affined to CPUs: 10,
> > Wakee 2[PID 8297] affined to CPUs: 30,
> > Wakee 3[PID 8298] affined to CPUs: 40,
> > Wakee 1[PID 8296] affined to CPUs: 20,
> > Wakee 4[PID 8299] affined to CPUs: 50,
> > Wakee 5[PID 8300] affined to CPUs: 60,
> > Wakee 6[PID 8301] affined to CPUs: 70,
> > Waker[PID 8302] affined to CPUs: 0,
> > 
> > > -----------------------------------|-------------------------|-----------------------------|
> > > Metric                            | snooze                  | CEDE                        |
> > > -----------------------------------|-------------------------|-----------------------------|
> > > Usage                             | 47815                   | 2030160                     |
> > > Above                             | 0                       | 2030043                     |
> > > Below                             | 0                       | 0                           |
> > > Time Spent (us)                   | 976317 (1.63%)          | 51046474 (85.08%)           |
> > > Overall average sleep duration    | 28.721 us               |                             |
> > > Overall average wakeup latency    | 6.858 us                |                             |
> > > -----------------------------------|-------------------------|-----------------------------|
> > 
> > In this test, without the patch, the CPU often enters the CEDE state for
> > sleep durations of around 20-30 microseconds, even though the CEDE state's
> > residency time is 120 microseconds. This happens because the menu governor
> > prioritizes the physical idle state (CEDE) if its exit latency is within
> > the latency limits. It also uses next_timer_ns for comparison, which can
> > be farther off than the actual idle duration as it is more predictable,
> > instead of using predicted idle duration as a comparision point with the
> > target residency.
> 
> Ideally that shouldn't be the case though (next_timer_ns be farther off thactual idle duration)

I ran some experiments based on your suggestions. Below are the
relative average runtimes and percentage differences compared to
the base version:

Picked a single representative workload for simplicity:

Note: Lower (% Difference) the better.
|---------------------------|-----------------|--------------|
| Configuration             | Average Runtime | % Difference |
|---------------------------|-----------------|--------------|
| Base (menu)               | 1.00            | 0.00%        |
| Base + Patch 1 (menu)     | 0.92            | -8.00%       |
| Base + Patch 2 (menu)     | 0.98            | -2.00%       |
| Base (teo)                | 1.01            | +1.00%       |
|---------------------------|-----------------|--------------|
Patch 1: https://lore.kernel.org/all/20240809073120.250974-2-aboorvad@linux.ibm.com/  
Patch 2: https://lore.kernel.org/all/c20a07e4-b9e6-4a66-80f5-63d679b17c3b@arm.com/

It seems that Patch 2 does provide a slight improvement in runtime, but
not significantly like Patch 1. Additionally, teo does not seem
to help in this case.

Regarding the condition `s->target_residency_ns <= data->next_timer_ns`,
it appears that `data->next_timer_ns` is consistently greater than
both actual idle duration and `s->target_residency_ns` so this condition
nearly always holds true, indicating some inaccuracy. I'll investigate
this further and follow up with more details.

Regards,
Aboorva
Christian Loehle Aug. 21, 2024, 10:55 a.m. UTC | #3
On 8/20/24 09:51, Aboorva Devarajan wrote:
> On Tue, 2024-08-13 at 13:56 +0100, Christian Loehle wrote:
> 
> Hi Christian,
> 
> Thanks a lot for your comments.
> ...
>> On 8/9/24 08:31, Aboorva Devarajan wrote:
>>> This patch aims to discuss a potential performance degradation that can occur
>>> in certain workloads when the menu governor prioritizes selecting a physical
>>> idle state over a polling state for short idle durations. 
>>>
>>> Note: This patch is intended to showcase a performance degradation, applying
>>> this patch could lead to increased power consumption due to the trade-off between
>>> performance and power efficiency, potentially causing a higher preference for
>>> performance at the expense of power usage.
>>>
>>
>> Not really a menu expert, but at this point I don't know who dares call
>> themselves one.
>> The elephant in the room would be: Does teo work better for you?
>>
> 
> I ran some tests with the teo governor enabled, but it didn’t make a
> lot of difference. The results are presented below.
> 
>>> ==================================================
>>> System details in which the degradation is observed:
>>>
>>> $ uname -r
>>> 6.10.0+
>>>
>>> $ lscpu
>>> Architecture:             ppc64le
>>>   Byte Order:             Little Endian
>>> CPU(s):                   160
>>>   On-line CPU(s) list:    0-159
>>> Model name:               POWER10 (architected), altivec supported
>>>   Model:                  2.0 (pvr 0080 0200)
>>>   Thread(s) per core:     8
>>>   Core(s) per socket:     3
>>>   Socket(s):              6
>>>   Physical sockets:       4
>>>   Physical chips:         2
>>>   Physical cores/chip:    6
>>> Virtualization features:
>>>   Hypervisor vendor:      pHyp
>>>   Virtualization type:    para
>>> Caches (sum of all):
>>>   L1d:                    1.3 MiB (40 instances)
>>>   L1i:                    1.9 MiB (40 instances)
>>>   L2:                     40 MiB (40 instances)
>>>   L3:                     160 MiB (40 instances)
>>> NUMA:
>>>   NUMA node(s):           6
>>>   NUMA node0 CPU(s):      0-31
>>>   NUMA node1 CPU(s):      32-71
>>>   NUMA node2 CPU(s):      72-79
>>>   NUMA node3 CPU(s):      80-87
>>>   NUMA node4 CPU(s):      88-119
>>>   NUMA node5 CPU(s):      120-159
>>>
>>>
>>> $ cpupower idle-info
>>> CPUidle driver: pseries_idle
>>> CPUidle governor: menu
>>> analyzing CPU 0:
>>>
>>> Number of idle states: 2
>>> Available idle states: snooze CEDE
>>> snooze:
>>> Flags/Description: snooze
>>> Latency: 0
>>> Residency: 0
>>> Usage: 6229
>>> Duration: 402142
>>> CEDE:
>>> Flags/Description: CEDE
>>> Latency: 12
>>> Residency: 120
>>> Usage: 191411
>>> Duration: 36329999037
>>>
>>> ==================================================
>>>
>>> The menu governor contains a condition that selects physical idle states over,
>>> such as the CEDE state over polling state, by checking if their exit latency meets
>>> the latency requirements. This can lead to performance drops in workloads with
>>> frequent short idle periods.
>>>
>>> The specific condition which causes degradation is as below (menu governor):
>>>
>>> ```
>>> if (s->target_residency_ns > predicted_ns) {
>>>     ...
>>>     if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
>>>         s->exit_latency_ns <= latency_req &&
>>>         s->target_residency_ns <= data->next_timer_ns) {
>>>         predicted_ns = s->target_residency_ns;
>>>         idx = i;
>>>         break;
>>>     }
>>>     ...
>>> }
>>> ```
>>>
>>> This condition can cause the menu governor to choose the CEDE state on Power
>>> Systems (residency: 120 us, exit latency: 12 us) over a polling state, even
>>> when the predicted idle duration is much shorter than the target residency
>>> of the physical state. This misprediction leads to performance degradation
>>> in certain workloads.
>>>
>>
>> So clearly the condition
>> s->target_residency_ns <= data->next_timer_ns)
>> is supposed to prevent this, but data->next_timer_ns isn't accurate,
>> have you got any idea what it's set to in your workload usually?
>> Seems like your workload is timer-based, so the idle duration should be
>> predicted accurately.
>>
> 
> Yes, that's right ideally this condition should have prevented this,
> but `data->next_timer_ns` is almost always greater than the actual
> idle duration which seems inaccurate.
> 
>>
>>> ==================================================
>>> Test Results
>>> ==================================================
>>>
>>> This issue can be clearly observed with the below test.
>>>
>>> A test with multiple wakee threads and a single waker thread was run to
>>> demonstrate this issue. The waker thread periodically wakes up the wakee
>>> threads after a specific sleep duration, creating a repeating of sleep -> wake
>>> pattern. The test was run for a stipulated period, and cpuidle statistics are
>>> collected.
>>>
>>> ./cpuidle-test -a 0 -b 10 -b 20 -b 30 -b 40 -b 50 -b 60 -b 70 -r 20 -t 60
>>>
>>> ==================================================
>>> Results (Baseline Kernel):
>>> ==================================================
>>> Wakee 0[PID 8295] affined to CPUs: 10,
>>> Wakee 2[PID 8297] affined to CPUs: 30,
>>> Wakee 3[PID 8298] affined to CPUs: 40,
>>> Wakee 1[PID 8296] affined to CPUs: 20,
>>> Wakee 4[PID 8299] affined to CPUs: 50,
>>> Wakee 5[PID 8300] affined to CPUs: 60,
>>> Wakee 6[PID 8301] affined to CPUs: 70,
>>> Waker[PID 8302] affined to CPUs: 0,
>>>
>>>> -----------------------------------|-------------------------|-----------------------------|
>>>> Metric                            | snooze                  | CEDE                        |
>>>> -----------------------------------|-------------------------|-----------------------------|
>>>> Usage                             | 47815                   | 2030160                     |
>>>> Above                             | 0                       | 2030043                     |
>>>> Below                             | 0                       | 0                           |
>>>> Time Spent (us)                   | 976317 (1.63%)          | 51046474 (85.08%)           |
>>>> Overall average sleep duration    | 28.721 us               |                             |
>>>> Overall average wakeup latency    | 6.858 us                |                             |
>>>> -----------------------------------|-------------------------|-----------------------------|
>>>
>>> In this test, without the patch, the CPU often enters the CEDE state for
>>> sleep durations of around 20-30 microseconds, even though the CEDE state's
>>> residency time is 120 microseconds. This happens because the menu governor
>>> prioritizes the physical idle state (CEDE) if its exit latency is within
>>> the latency limits. It also uses next_timer_ns for comparison, which can
>>> be farther off than the actual idle duration as it is more predictable,
>>> instead of using predicted idle duration as a comparision point with the
>>> target residency.
>>
>> Ideally that shouldn't be the case though (next_timer_ns be farther off thactual idle duration)
> 
> I ran some experiments based on your suggestions. Below are the
> relative average runtimes and percentage differences compared to
> the base version:
> 
> Picked a single representative workload for simplicity:
> 
> Note: Lower (% Difference) the better.
> |---------------------------|-----------------|--------------|
> | Configuration             | Average Runtime | % Difference |
> |---------------------------|-----------------|--------------|
> | Base (menu)               | 1.00            | 0.00%        |
> | Base + Patch 1 (menu)     | 0.92            | -8.00%       |
> | Base + Patch 2 (menu)     | 0.98            | -2.00%       |
> | Base (teo)                | 1.01            | +1.00%       |
> |---------------------------|-----------------|--------------|
> Patch 1: https://lore.kernel.org/all/20240809073120.250974-2-aboorvad@linux.ibm.com/  
> Patch 2: https://lore.kernel.org/all/c20a07e4-b9e6-4a66-80f5-63d679b17c3b@arm.com/
> 
> It seems that Patch 2 does provide a slight improvement in runtime, but
> not significantly like Patch 1. Additionally, teo does not seem
> to help in this case.
> 
> Regarding the condition `s->target_residency_ns <= data->next_timer_ns`,
> it appears that `data->next_timer_ns` is consistently greater than
> both actual idle duration and `s->target_residency_ns` so this condition
> nearly always holds true, indicating some inaccuracy. I'll investigate
> this further and follow up with more details.

The wakeup source(s), since they don't seem to be timer events would be
interesting, although a bit of a hassle to get right. 
What's the workload anyway?

> 
> Regards,
> Aboorva
>
Aboorva Devarajan Sept. 19, 2024, 5:02 a.m. UTC | #4
On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote:

> On 8/20/24 09:51, Aboorva Devarajan wrote:
> > On Tue, 2024-08-13 at 13:56 +0100, Christian Loehle wrote:
> > 
...
> The wakeup source(s), since they don't seem to be timer events would be
> interesting, although a bit of a hassle to get right. 
> What's the workload anyway?
> 

Hi Christian,

Apologies for the delayed response.

I identified the part of the internal workload responsible for performance
improvement with the patch and it appears to be the OLTP queries, and yes
the wakeup sources are not timer events.

---------------------------------------------------------------------

Additionally to reproduce the issue using a more open and accessible
benchmark, conducted similar experiments using pgbench and I observed
similar improvements with the patch, particularly when running the
read intensive select query benchmarks.

---------------------------------------------------------------------
System Details
---------------------------------------------------------------------
$ lscpu
Architecture:             ppc64le
  Byte Order:             Little Endian
CPU(s):                   224
  On-line CPU(s) list:    0-223
Model name:               POWER10 (architected), altivec supported
  Model:                  2.0 (pvr 0080 0200)
  Thread(s) per core:     8
  Core(s) per socket:     3
  Socket(s):              8
Virtualization features:  
  Hypervisor vendor:      pHyp
  Virtualization type:    para
---------------------------------------------------------------------

$ cpupower idle-info
CPUidle driver: pseries_idle
CPUidle governor: menu
analyzing CPU 0:

Number of idle states: 2
Available idle states: snooze CEDE
snooze:
Flags/Description: snooze

Latency: 0
Residency: 0
Usage: 6229
Duration: 402142
CEDE:
Flags/Description: CEDE
Latency: 12
Residency: 120
Usage: 191411
Duration: 36329999037

---------------------------------------------------------------------
PostgreSQL Benchmark:
---------------------------------------------------------------------

I ran pgbench with 224 clients and 20 threads for 600 seconds,
performing only SELECT queries against the pgbench database to 
evaluate performance under read-intensive workloads:

$ pgbench -c 224 -j 20 -T 600 -S pgbench

Latency:

|---|-------------|------------|------------|
| # | Before (ms) | After (ms) | Change (%) |
|Run| Patch       | Patch      |            |
|---|-------------|------------|------------|
| 1 | 0.343       | 0.287      | -16.31%    |
| 2 | 0.334       | 0.286      | -14.37%    |
| 3 | 0.333       | 0.286      | -14.11%    |
| 4 | 0.341       | 0.288      | -15.55%    |
| 5 | 0.342       | 0.288      | -15.79%    |
|---|-------------|------------|------------|

Latency Reduction: After applying the patch, the latency decreased
by 14% to 16% across multiple runs.

Throughput per second:

|---|-------------|------------|------------|
| # | Before      | After      | Change (%) |
|Run| Patch       | Patch      |            |
|---|-------------|------------|------------|
| 1 | 652,544.23  | 780,613.42 | +19.63%    |
| 2 | 670,243.45  | 784,348.44 | +17.04%    |
| 3 | 673,495.39  | 784,458.12 | +16.48%    |
| 4 | 656,609.16  | 778,531.20 | +18.57%    |
| 5 | 654,644.52  | 778,322.88 | +18.88%    |
|---|-------------|------------|------------|

Transactions per second Improvement: The patch led to an increase in
TPS, ranging from 16% to 19%.

This indicates that the patch significantly reduces latency while
improving throughput (TPS).  pgbench is an OLTP benchmark and doesn't
do timer based wakeups, this is in-line with the improvements
I saw in the originally reported OLTP workload as well. 

---------------------------------------------------------------------
Additional CPU Idle test with custom benchmark:
---------------------------------------------------------------------
I also wrote a simple benchmark [1] to analyze the impact of cpuidle
state selection, comparing timer-based wakeups and non-timer
(pipe-based) wakeups.

This test involves a single waker thread periodically waking up a
single wakee thread, simulating a repeating sleep-wake pattern. The
test was run with both timer-based and pipe-based wakeups, and cpuidle
statistics (usage, time, above, below) were collected.

Timer based wakeup:

+------------------------+---------------------+---------------------+
| Metric                 | Without Patch       | With Patch          |
+------------------------+---------------------+---------------------+
| Wakee thread - CPU     | 110                 | 110                 |
| Waker thread - CPU     | 20                  | 20                  |
| Sleep Interval         | 50 us               | 50 us               |
| Total Wakeups          | -                   | -                   |
| Avg Wakeup Latency     | -                   | -                   |
| Actual Sleep time      | 52.639 us           | 52.627 us           |
+------------------------+---------------------+---------------------+
| Idle State 0 Usage Diff| 94,879              | 94,879              |
| Idle State 0 Time Diff | 4,700,323 ns        | 4,697,576 ns        |
| Idle State 1 Usage Diff| 0                   | 0                   |
| Idle State 1 Time Diff | 0 ns                | 0 ns                |
+------------------------+---------------------+---------------------+
| Total Above Usage      | 0 (0.00%)           | (0.00%)             |
+------------------------+---------------------+---------------------+
| Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
+------------------------+---------------------+---------------------+

In timer-based wakeups, the menu governor effectively predicts the idle
duration both with and without the patch. This ensures that there are
few or no instances of "Above" usage, allowing the CPU to remain in the
correct idle state.

The condition (s->target_residency_ns <= data->next_timer_ns) in the menu
governor ensures that a physical idle state is not prioritized when a
timer event is expected before the target residency of the first physical
idle state.

As a result, the patch has no impact in this case, and performance
remains stable with timer based wakeups.

Pipe based wakeup (non-timer wakeup):

+------------------------+---------------------+---------------------+
| Metric                 | Without Patch       | With Patch          |
+------------------------+---------------------+---------------------+
| Wakee thread - CPU     | 110                 | 110                 |
| Waker thread - CPU     | 20                  | 20                  |
| Sleep Interval         | 50 us               | 50 us               |
| Total Wakeups          | 97031               | 96583               |
| Avg Wakeup Latency     | 7.070 us            | 4.879 us            |
| Actual Sleep time      | 51.366 us           | 51.605 us           |
+------------------------+---------------------+---------------------+
| Idle State 0 Usage Diff| 1209                | 96,586              |
| Idle State 0 Time Diff | 55,579 ns           | 4,510,003 ns        |
| Idle State 1 Usage Diff| 95,826              | 5                   |
| Idle State 1 Time Diff | 4,522,639 ns        | 198 ns              |
+------------------------+---------------------+---------------------+
+------------------------+---------------------+---------------------+
| **Total Above Usage**  | 95,824 (98.75%)     | 5 (0.01%)           |
+------------------------+---------------------+---------------------+     
+------------------------+---------------------+---------------------+
| Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
+------------------------+---------------------+---------------------+

In the pipe-based wakeup scenario, before the patch was applied, the 
"Above" metric was notably high around 98.75%. This suggests that the
menu governor frequently selected a deeper idle state like CEDE, even
when the idle period was relatively short.

This happened because the menu governor is inclined to prioritize the
physical idle state (CEDE) even when the target residency time of the
physical idle state (s->target_residency_ns) was longer than the
predicted idle time (predicted_ns), so data->next_timer_ns won't be
relevant here in non-timer wakeups.

In this test, despite the actual idle period being around 50 microseconds,
the menu governor still chose CEDE state, which has a target residency of
120 microseconds.

---------------------------------------------------------------------

While timer-based wakeups performed well even without the patch, workloads
that don't have timers as wakeup source but have predictable idle durations
shorter than the first idle state's target residency benefit significantly
from the patch.

It will be helpful to understand why prioritizing deep physical idle
states over shallow ones even for short idle periods that don’t meet
the target residency like mentioned above is considered more beneficial.

Is there something I could be missing here?

Any comments or suggestions will be really helpful.

[1] https://github.com/AboorvaDevarajan/linux-utils/tree/main/cpuidle/cpuidle_wakeup

Thanks,
Aboorva
Christian Loehle Sept. 19, 2024, 8:49 a.m. UTC | #5
On 9/19/24 06:02, Aboorva Devarajan wrote:
> On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote:
> 
>> On 8/20/24 09:51, Aboorva Devarajan wrote:
>>> On Tue, 2024-08-13 at 13:56 +0100, Christian Loehle wrote:
>>>
> ...
>> The wakeup source(s), since they don't seem to be timer events would be
>> interesting, although a bit of a hassle to get right. 
>> What's the workload anyway?
>>
> 
> Hi Christian,
> 
> Apologies for the delayed response.

No worries!

> 
> I identified the part of the internal workload responsible for performance
> improvement with the patch and it appears to be the OLTP queries, and yes
> the wakeup sources are not timer events.
> 
> ---------------------------------------------------------------------
> 
> Additionally to reproduce the issue using a more open and accessible
> benchmark, conducted similar experiments using pgbench and I observed
> similar improvements with the patch, particularly when running the
> read intensive select query benchmarks.
> 
> ---------------------------------------------------------------------
> System Details
> ---------------------------------------------------------------------
> $ lscpu
> Architecture:             ppc64le
>   Byte Order:             Little Endian
> CPU(s):                   224
>   On-line CPU(s) list:    0-223
> Model name:               POWER10 (architected), altivec supported
>   Model:                  2.0 (pvr 0080 0200)
>   Thread(s) per core:     8
>   Core(s) per socket:     3
>   Socket(s):              8
> Virtualization features:  
>   Hypervisor vendor:      pHyp
>   Virtualization type:    para
> ---------------------------------------------------------------------
> 
> $ cpupower idle-info
> CPUidle driver: pseries_idle
> CPUidle governor: menu
> analyzing CPU 0:
> 
> Number of idle states: 2
> Available idle states: snooze CEDE
> snooze:
> Flags/Description: snooze
> 
> Latency: 0
> Residency: 0
> Usage: 6229
> Duration: 402142
> CEDE:
> Flags/Description: CEDE
> Latency: 12
> Residency: 120
> Usage: 191411
> Duration: 36329999037
> 
> ---------------------------------------------------------------------
> PostgreSQL Benchmark:
> ---------------------------------------------------------------------
> 
> I ran pgbench with 224 clients and 20 threads for 600 seconds,
> performing only SELECT queries against the pgbench database to 
> evaluate performance under read-intensive workloads:
> 
> $ pgbench -c 224 -j 20 -T 600 -S pgbench
> 
> Latency:
> 
> |---|-------------|------------|------------|
> | # | Before (ms) | After (ms) | Change (%) |
> |Run| Patch       | Patch      |            |
> |---|-------------|------------|------------|
> | 1 | 0.343       | 0.287      | -16.31%    |
> | 2 | 0.334       | 0.286      | -14.37%    |
> | 3 | 0.333       | 0.286      | -14.11%    |
> | 4 | 0.341       | 0.288      | -15.55%    |
> | 5 | 0.342       | 0.288      | -15.79%    |
> |---|-------------|------------|------------|
> 
> Latency Reduction: After applying the patch, the latency decreased
> by 14% to 16% across multiple runs.
> 
> Throughput per second:
> 
> |---|-------------|------------|------------|
> | # | Before      | After      | Change (%) |
> |Run| Patch       | Patch      |            |
> |---|-------------|------------|------------|
> | 1 | 652,544.23  | 780,613.42 | +19.63%    |
> | 2 | 670,243.45  | 784,348.44 | +17.04%    |
> | 3 | 673,495.39  | 784,458.12 | +16.48%    |
> | 4 | 656,609.16  | 778,531.20 | +18.57%    |
> | 5 | 654,644.52  | 778,322.88 | +18.88%    |
> |---|-------------|------------|------------|

Do you happen to have the idle stats here, too?
Especially above/below see comment below.

> 
> Transactions per second Improvement: The patch led to an increase in
> TPS, ranging from 16% to 19%.
> 
> This indicates that the patch significantly reduces latency while
> improving throughput (TPS).  pgbench is an OLTP benchmark and doesn't
> do timer based wakeups, this is in-line with the improvements
> I saw in the originally reported OLTP workload as well. 
> 
> ---------------------------------------------------------------------
> Additional CPU Idle test with custom benchmark:
> ---------------------------------------------------------------------
> I also wrote a simple benchmark [1] to analyze the impact of cpuidle
> state selection, comparing timer-based wakeups and non-timer
> (pipe-based) wakeups.
> 
> This test involves a single waker thread periodically waking up a
> single wakee thread, simulating a repeating sleep-wake pattern. The
> test was run with both timer-based and pipe-based wakeups, and cpuidle
> statistics (usage, time, above, below) were collected.

Thanks, pretty cool and just quickly checking it seems to be all there:
hrtimer, different cpusets for wakee and waker, the pipe wakeup.
That could be useful, only thing missing is the license header ;)

> 
> Timer based wakeup:
> 
> +------------------------+---------------------+---------------------+
> | Metric                 | Without Patch       | With Patch          |
> +------------------------+---------------------+---------------------+
> | Wakee thread - CPU     | 110                 | 110                 |
> | Waker thread - CPU     | 20                  | 20                  |
> | Sleep Interval         | 50 us               | 50 us               |
> | Total Wakeups          | -                   | -                   |
> | Avg Wakeup Latency     | -                   | -                   |
> | Actual Sleep time      | 52.639 us           | 52.627 us           |
> +------------------------+---------------------+---------------------+
> | Idle State 0 Usage Diff| 94,879              | 94,879              |
> | Idle State 0 Time Diff | 4,700,323 ns        | 4,697,576 ns        |
> | Idle State 1 Usage Diff| 0                   | 0                   |
> | Idle State 1 Time Diff | 0 ns                | 0 ns                |
> +------------------------+---------------------+---------------------+
> | Total Above Usage      | 0 (0.00%)           | (0.00%)             |
> +------------------------+---------------------+---------------------+
> | Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
> +------------------------+---------------------+---------------------+
> 
> In timer-based wakeups, the menu governor effectively predicts the idle
> duration both with and without the patch. This ensures that there are
> few or no instances of "Above" usage, allowing the CPU to remain in the
> correct idle state.
> 
> The condition (s->target_residency_ns <= data->next_timer_ns) in the menu
> governor ensures that a physical idle state is not prioritized when a
> timer event is expected before the target residency of the first physical
> idle state.
> 
> As a result, the patch has no impact in this case, and performance
> remains stable with timer based wakeups.
> 
> Pipe based wakeup (non-timer wakeup):
> 
> +------------------------+---------------------+---------------------+
> | Metric                 | Without Patch       | With Patch          |
> +------------------------+---------------------+---------------------+
> | Wakee thread - CPU     | 110                 | 110                 |
> | Waker thread - CPU     | 20                  | 20                  |
> | Sleep Interval         | 50 us               | 50 us               |
> | Total Wakeups          | 97031               | 96583               |
> | Avg Wakeup Latency     | 7.070 us            | 4.879 us            |
> | Actual Sleep time      | 51.366 us           | 51.605 us           |
> +------------------------+---------------------+---------------------+
> | Idle State 0 Usage Diff| 1209                | 96,586              |
> | Idle State 0 Time Diff | 55,579 ns           | 4,510,003 ns        |
> | Idle State 1 Usage Diff| 95,826              | 5                   |
> | Idle State 1 Time Diff | 4,522,639 ns        | 198 ns              |
> +------------------------+---------------------+---------------------+
> +------------------------+---------------------+---------------------+
> | **Total Above Usage**  | 95,824 (98.75%)     | 5 (0.01%)           |
> +------------------------+---------------------+---------------------+     
> +------------------------+---------------------+---------------------+
> | Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
> +------------------------+---------------------+---------------------+
> 
> In the pipe-based wakeup scenario, before the patch was applied, the 
> "Above" metric was notably high around 98.75%. This suggests that the
> menu governor frequently selected a deeper idle state like CEDE, even
> when the idle period was relatively short.
> 
> This happened because the menu governor is inclined to prioritize the
> physical idle state (CEDE) even when the target residency time of the
> physical idle state (s->target_residency_ns) was longer than the
> predicted idle time (predicted_ns), so data->next_timer_ns won't be
> relevant here in non-timer wakeups.
> 
> In this test, despite the actual idle period being around 50 microseconds,
> the menu governor still chose CEDE state, which has a target residency of
> 120 microseconds.

And the idle_miss statistics confirms that this was mostly wrong decisions
in hindsight.
I'll go through the menu code again, this indeed shouldn't be happening.
I'd be very surprised if upstream teo performed as badly (or actually badly
at all) on this, although it doesn't seem to help your actual workload,
would you mind giving that a try too?

Regards,
Christian

> 
> ---------------------------------------------------------------------
> 
> While timer-based wakeups performed well even without the patch, workloads
> that don't have timers as wakeup source but have predictable idle durations
> shorter than the first idle state's target residency benefit significantly
> from the patch.
> 
> It will be helpful to understand why prioritizing deep physical idle
> states over shallow ones even for short idle periods that don’t meet
> the target residency like mentioned above is considered more beneficial.
> 
> Is there something I could be missing here?
> 
> Any comments or suggestions will be really helpful.
> 
> [1] https://github.com/AboorvaDevarajan/linux-utils/tree/main/cpuidle/cpuidle_wakeup
> 
> Thanks,
> Aboorva
>
Christian Loehle Sept. 19, 2024, 9:43 a.m. UTC | #6
On 9/19/24 09:49, Christian Loehle wrote:
> On 9/19/24 06:02, Aboorva Devarajan wrote:
>> On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote:
> 
>>
>> Timer based wakeup:
>>
>> +------------------------+---------------------+---------------------+
>> | Metric                 | Without Patch       | With Patch          |
>> +------------------------+---------------------+---------------------+
>> | Wakee thread - CPU     | 110                 | 110                 |
>> | Waker thread - CPU     | 20                  | 20                  |
>> | Sleep Interval         | 50 us               | 50 us               |
>> | Total Wakeups          | -                   | -                   |
>> | Avg Wakeup Latency     | -                   | -                   |
>> | Actual Sleep time      | 52.639 us           | 52.627 us           |
>> +------------------------+---------------------+---------------------+
>> | Idle State 0 Usage Diff| 94,879              | 94,879              |
>> | Idle State 0 Time Diff | 4,700,323 ns        | 4,697,576 ns        |
>> | Idle State 1 Usage Diff| 0                   | 0                   |
>> | Idle State 1 Time Diff | 0 ns                | 0 ns                |
>> +------------------------+---------------------+---------------------+
>> | Total Above Usage      | 0 (0.00%)           | (0.00%)             |
>> +------------------------+---------------------+---------------------+
>> | Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
>> +------------------------+---------------------+---------------------+
>>
>> In timer-based wakeups, the menu governor effectively predicts the idle
>> duration both with and without the patch. This ensures that there are
>> few or no instances of "Above" usage, allowing the CPU to remain in the
>> correct idle state.
>>
>> The condition (s->target_residency_ns <= data->next_timer_ns) in the menu
>> governor ensures that a physical idle state is not prioritized when a
>> timer event is expected before the target residency of the first physical
>> idle state.
>>
>> As a result, the patch has no impact in this case, and performance
>> remains stable with timer based wakeups.
>>
>> Pipe based wakeup (non-timer wakeup):
>>
>> +------------------------+---------------------+---------------------+
>> | Metric                 | Without Patch       | With Patch          |
>> +------------------------+---------------------+---------------------+
>> | Wakee thread - CPU     | 110                 | 110                 |
>> | Waker thread - CPU     | 20                  | 20                  |
>> | Sleep Interval         | 50 us               | 50 us               |
>> | Total Wakeups          | 97031               | 96583               |
>> | Avg Wakeup Latency     | 7.070 us            | 4.879 us            |
>> | Actual Sleep time      | 51.366 us           | 51.605 us           |
>> +------------------------+---------------------+---------------------+
>> | Idle State 0 Usage Diff| 1209                | 96,586              |
>> | Idle State 0 Time Diff | 55,579 ns           | 4,510,003 ns        |
>> | Idle State 1 Usage Diff| 95,826              | 5                   |
>> | Idle State 1 Time Diff | 4,522,639 ns        | 198 ns              |
>> +------------------------+---------------------+---------------------+
>> +------------------------+---------------------+---------------------+
>> | **Total Above Usage**  | 95,824 (98.75%)     | 5 (0.01%)           |
>> +------------------------+---------------------+---------------------+     
>> +------------------------+---------------------+---------------------+
>> | Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
>> +------------------------+---------------------+---------------------+
>>
>> In the pipe-based wakeup scenario, before the patch was applied, the 
>> "Above" metric was notably high around 98.75%. This suggests that the
>> menu governor frequently selected a deeper idle state like CEDE, even
>> when the idle period was relatively short.
>>
>> This happened because the menu governor is inclined to prioritize the
>> physical idle state (CEDE) even when the target residency time of the
>> physical idle state (s->target_residency_ns) was longer than the
>> predicted idle time (predicted_ns), so data->next_timer_ns won't be
>> relevant here in non-timer wakeups.
>>
>> In this test, despite the actual idle period being around 50 microseconds,
>> the menu governor still chose CEDE state, which has a target residency of
>> 120 microseconds.
> 
> And the idle_miss statistics confirms that this was mostly wrong decisions
> in hindsight.
> I'll go through the menu code again, this indeed shouldn't be happening.
> I'd be very surprised if upstream teo performed as badly (or actually badly
> at all) on this, although it doesn't seem to help your actual workload,
> would you mind giving that a try too?
> 
> Regards,
> Christian

So with a workload as static as this, the recent interval detection of menu
should take affect. It records the last 8 idle interval lengths and does some
statistics do see if they are somewhat stable and uses those instead of the
next timer event.
While I wouldn't vouch for the statistics just yet, the results don't seem
to be completely unreasonable.
Do you mind trying a) printing some snapshots of these intervals and the
resulting return value of get_typical_interval()?
b) Trying this out? Setting these values to some around 50us, that returns
50us for me as expected and therefore should not select an idle state above
that.

-->8--

--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -112,6 +112,8 @@ struct menu_device {
        int             interval_ptr;
 };
 
+static int intervals_dummy[] = {50, 52, 48, 50, 51, 49, 51, 51};
+
 static inline int which_bucket(u64 duration_ns, unsigned int nr_iowaiters)
 {
        int bucket = 0;
@@ -177,7 +179,7 @@ static unsigned int get_typical_interval(struct menu_device *data)
        sum = 0;
        divisor = 0;
        for (i = 0; i < INTERVALS; i++) {
-               unsigned int value = data->intervals[i];
+               unsigned int value = intervals_dummy[i];
                if (value <= thresh) {
                        sum += value;
                        divisor++;
@@ -200,7 +202,7 @@ static unsigned int get_typical_interval(struct menu_device *data)
        /* Then try to determine variance */
        variance = 0;
        for (i = 0; i < INTERVALS; i++) {
-               unsigned int value = data->intervals[i];
+               unsigned int value = intervals_dummy[i];
                if (value <= thresh) {
                        int64_t diff = (int64_t)value - avg;
                        variance += diff * diff;
Christian Loehle Sept. 19, 2024, 2:07 p.m. UTC | #7
On 9/19/24 10:43, Christian Loehle wrote:
> On 9/19/24 09:49, Christian Loehle wrote:
>> On 9/19/24 06:02, Aboorva Devarajan wrote:
>>> On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote:
>>
>>>
>>> Timer based wakeup:
>>>
>>> +------------------------+---------------------+---------------------+
>>> | Metric                 | Without Patch       | With Patch          |
>>> +------------------------+---------------------+---------------------+
>>> | Wakee thread - CPU     | 110                 | 110                 |
>>> | Waker thread - CPU     | 20                  | 20                  |
>>> | Sleep Interval         | 50 us               | 50 us               |
>>> | Total Wakeups          | -                   | -                   |
>>> | Avg Wakeup Latency     | -                   | -                   |
>>> | Actual Sleep time      | 52.639 us           | 52.627 us           |
>>> +------------------------+---------------------+---------------------+
>>> | Idle State 0 Usage Diff| 94,879              | 94,879              |
>>> | Idle State 0 Time Diff | 4,700,323 ns        | 4,697,576 ns        |
>>> | Idle State 1 Usage Diff| 0                   | 0                   |
>>> | Idle State 1 Time Diff | 0 ns                | 0 ns                |
>>> +------------------------+---------------------+---------------------+
>>> | Total Above Usage      | 0 (0.00%)           | (0.00%)             |
>>> +------------------------+---------------------+---------------------+
>>> | Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
>>> +------------------------+---------------------+---------------------+
>>>
>>> In timer-based wakeups, the menu governor effectively predicts the idle
>>> duration both with and without the patch. This ensures that there are
>>> few or no instances of "Above" usage, allowing the CPU to remain in the
>>> correct idle state.
>>>
>>> The condition (s->target_residency_ns <= data->next_timer_ns) in the menu
>>> governor ensures that a physical idle state is not prioritized when a
>>> timer event is expected before the target residency of the first physical
>>> idle state.
>>>
>>> As a result, the patch has no impact in this case, and performance
>>> remains stable with timer based wakeups.
>>>
>>> Pipe based wakeup (non-timer wakeup):
>>>
>>> +------------------------+---------------------+---------------------+
>>> | Metric                 | Without Patch       | With Patch          |
>>> +------------------------+---------------------+---------------------+
>>> | Wakee thread - CPU     | 110                 | 110                 |
>>> | Waker thread - CPU     | 20                  | 20                  |
>>> | Sleep Interval         | 50 us               | 50 us               |
>>> | Total Wakeups          | 97031               | 96583               |
>>> | Avg Wakeup Latency     | 7.070 us            | 4.879 us            |
>>> | Actual Sleep time      | 51.366 us           | 51.605 us           |
>>> +------------------------+---------------------+---------------------+
>>> | Idle State 0 Usage Diff| 1209                | 96,586              |
>>> | Idle State 0 Time Diff | 55,579 ns           | 4,510,003 ns        |
>>> | Idle State 1 Usage Diff| 95,826              | 5                   |
>>> | Idle State 1 Time Diff | 4,522,639 ns        | 198 ns              |
>>> +------------------------+---------------------+---------------------+
>>> +------------------------+---------------------+---------------------+
>>> | **Total Above Usage**  | 95,824 (98.75%)     | 5 (0.01%)           |
>>> +------------------------+---------------------+---------------------+     
>>> +------------------------+---------------------+---------------------+
>>> | Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
>>> +------------------------+---------------------+---------------------+
>>>
>>> In the pipe-based wakeup scenario, before the patch was applied, the 
>>> "Above" metric was notably high around 98.75%. This suggests that the
>>> menu governor frequently selected a deeper idle state like CEDE, even
>>> when the idle period was relatively short.
>>>
>>> This happened because the menu governor is inclined to prioritize the
>>> physical idle state (CEDE) even when the target residency time of the
>>> physical idle state (s->target_residency_ns) was longer than the
>>> predicted idle time (predicted_ns), so data->next_timer_ns won't be
>>> relevant here in non-timer wakeups.
>>>
>>> In this test, despite the actual idle period being around 50 microseconds,
>>> the menu governor still chose CEDE state, which has a target residency of
>>> 120 microseconds.
>>
>> And the idle_miss statistics confirms that this was mostly wrong decisions
>> in hindsight.
>> I'll go through the menu code again, this indeed shouldn't be happening.
>> I'd be very surprised if upstream teo performed as badly (or actually badly
>> at all) on this, although it doesn't seem to help your actual workload,
>> would you mind giving that a try too?
>>
>> Regards,
>> Christian
> 
> So with a workload as static as this, the recent interval detection of menu
> should take affect. It records the last 8 idle interval lengths and does some
> statistics do see if they are somewhat stable and uses those instead of the
> next timer event.
> While I wouldn't vouch for the statistics just yet, the results don't seem
> to be completely unreasonable.
> Do you mind trying a) printing some snapshots of these intervals and the
> resulting return value of get_typical_interval()?
> b) Trying this out? Setting these values to some around 50us, that returns
> 50us for me as expected and therefore should not select an idle state above
> that.
> 
> -->8--
> 
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -112,6 +112,8 @@ struct menu_device {
>         int             interval_ptr;
>  };
>  
> +static int intervals_dummy[] = {50, 52, 48, 50, 51, 49, 51, 51};

So just to illustrate this a bit more:
{51, 80, 90, 54, 72, 60, 80, 117} is insignificant, but decrementing the last:
{51, 80, 90, 54, 72, 60, 80, 116} is significant (75us returned)
That sounds about reasonable but I think something like that would
also be observable for you. It's an unfortunate edge case.
I wouldn't want to mess with the ancient statistics (and silently break
an unknown amount of setups). I much prefer the teo intercept approach that
makes this idle state dependent and doesn't try to get one formula right
for all setups.

That is all assuming that a) get_typical_interval() doesn't return a
significant interval length for you and b) replacing that with dummy
significant values then fixes it (and not the error being some menu
logic that I overlooked).

Thinking out loud regarding the menu logic:
If we have ~50us predicted interval and your state setup the code then
goes as follows:

predicted_ns = 50000
...
data->next_timer_ns = KTIME_MAX; 
delta_tick = TICK_NSEC / 2;  
data->bucket = which_bucket(KTIME_MAX, nr_iowaiters); // bucket highest state probably

And then we have the big for loop for state 0 POLLING and state 1 CEDE target_residency_ns==100000:
trivially we have for i = 0:
	if (idx == -1)                                                  
		idx = i; /* first enabled state */
and for i = 1:
	if (s->target_residency_ns > predicted_ns) { // True
...
		if (predicted_ns < TICK_NSEC) // True                      
			break;
and should end up with idx == 0 and *stop_tick = false, all good.
If my above assumptions in a) and b) do not hold then I must've made a
mistake reading the code just now.
Christian Loehle Sept. 19, 2024, 2:38 p.m. UTC | #8
On 9/19/24 15:07, Christian Loehle wrote:
> On 9/19/24 10:43, Christian Loehle wrote:
>> On 9/19/24 09:49, Christian Loehle wrote:
>>> On 9/19/24 06:02, Aboorva Devarajan wrote:
>>>> On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote:
>>>
>>>>
>>>> Timer based wakeup:
>>>>
>>>> +------------------------+---------------------+---------------------+
>>>> | Metric                 | Without Patch       | With Patch          |
>>>> +------------------------+---------------------+---------------------+
>>>> | Wakee thread - CPU     | 110                 | 110                 |
>>>> | Waker thread - CPU     | 20                  | 20                  |
>>>> | Sleep Interval         | 50 us               | 50 us               |
>>>> | Total Wakeups          | -                   | -                   |
>>>> | Avg Wakeup Latency     | -                   | -                   |
>>>> | Actual Sleep time      | 52.639 us           | 52.627 us           |
>>>> +------------------------+---------------------+---------------------+
>>>> | Idle State 0 Usage Diff| 94,879              | 94,879              |
>>>> | Idle State 0 Time Diff | 4,700,323 ns        | 4,697,576 ns        |
>>>> | Idle State 1 Usage Diff| 0                   | 0                   |
>>>> | Idle State 1 Time Diff | 0 ns                | 0 ns                |
>>>> +------------------------+---------------------+---------------------+
>>>> | Total Above Usage      | 0 (0.00%)           | (0.00%)             |
>>>> +------------------------+---------------------+---------------------+
>>>> | Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
>>>> +------------------------+---------------------+---------------------+
>>>>
>>>> In timer-based wakeups, the menu governor effectively predicts the idle
>>>> duration both with and without the patch. This ensures that there are
>>>> few or no instances of "Above" usage, allowing the CPU to remain in the
>>>> correct idle state.
>>>>
>>>> The condition (s->target_residency_ns <= data->next_timer_ns) in the menu
>>>> governor ensures that a physical idle state is not prioritized when a
>>>> timer event is expected before the target residency of the first physical
>>>> idle state.
>>>>
>>>> As a result, the patch has no impact in this case, and performance
>>>> remains stable with timer based wakeups.
>>>>
>>>> Pipe based wakeup (non-timer wakeup):
>>>>
>>>> +------------------------+---------------------+---------------------+
>>>> | Metric                 | Without Patch       | With Patch          |
>>>> +------------------------+---------------------+---------------------+
>>>> | Wakee thread - CPU     | 110                 | 110                 |
>>>> | Waker thread - CPU     | 20                  | 20                  |
>>>> | Sleep Interval         | 50 us               | 50 us               |
>>>> | Total Wakeups          | 97031               | 96583               |
>>>> | Avg Wakeup Latency     | 7.070 us            | 4.879 us            |
>>>> | Actual Sleep time      | 51.366 us           | 51.605 us           |
>>>> +------------------------+---------------------+---------------------+
>>>> | Idle State 0 Usage Diff| 1209                | 96,586              |
>>>> | Idle State 0 Time Diff | 55,579 ns           | 4,510,003 ns        |
>>>> | Idle State 1 Usage Diff| 95,826              | 5                   |
>>>> | Idle State 1 Time Diff | 4,522,639 ns        | 198 ns              |
>>>> +------------------------+---------------------+---------------------+
>>>> +------------------------+---------------------+---------------------+
>>>> | **Total Above Usage**  | 95,824 (98.75%)     | 5 (0.01%)           |
>>>> +------------------------+---------------------+---------------------+     
>>>> +------------------------+---------------------+---------------------+
>>>> | Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
>>>> +------------------------+---------------------+---------------------+
>>>>
>>>> In the pipe-based wakeup scenario, before the patch was applied, the 
>>>> "Above" metric was notably high around 98.75%. This suggests that the
>>>> menu governor frequently selected a deeper idle state like CEDE, even
>>>> when the idle period was relatively short.
>>>>
>>>> This happened because the menu governor is inclined to prioritize the
>>>> physical idle state (CEDE) even when the target residency time of the
>>>> physical idle state (s->target_residency_ns) was longer than the
>>>> predicted idle time (predicted_ns), so data->next_timer_ns won't be
>>>> relevant here in non-timer wakeups.
>>>>
>>>> In this test, despite the actual idle period being around 50 microseconds,
>>>> the menu governor still chose CEDE state, which has a target residency of
>>>> 120 microseconds.
>>>
>>> And the idle_miss statistics confirms that this was mostly wrong decisions
>>> in hindsight.
>>> I'll go through the menu code again, this indeed shouldn't be happening.
>>> I'd be very surprised if upstream teo performed as badly (or actually badly
>>> at all) on this, although it doesn't seem to help your actual workload,
>>> would you mind giving that a try too?
>>>
>>> Regards,
>>> Christian
>>
>> So with a workload as static as this, the recent interval detection of menu
>> should take affect. It records the last 8 idle interval lengths and does some
>> statistics do see if they are somewhat stable and uses those instead of the
>> next timer event.
>> While I wouldn't vouch for the statistics just yet, the results don't seem
>> to be completely unreasonable.
>> Do you mind trying a) printing some snapshots of these intervals and the
>> resulting return value of get_typical_interval()?
>> b) Trying this out? Setting these values to some around 50us, that returns
>> 50us for me as expected and therefore should not select an idle state above
>> that.
>>
>> -->8--
>>
>> --- a/drivers/cpuidle/governors/menu.c
>> +++ b/drivers/cpuidle/governors/menu.c
>> @@ -112,6 +112,8 @@ struct menu_device {
>>         int             interval_ptr;
>>  };
>>  
>> +static int intervals_dummy[] = {50, 52, 48, 50, 51, 49, 51, 51};
> 
> So just to illustrate this a bit more:
> {51, 80, 90, 54, 72, 60, 80, 117} is insignificant, but decrementing the last:
> {51, 80, 90, 54, 72, 60, 80, 116} is significant (75us returned)
> That sounds about reasonable but I think something like that would
> also be observable for you. It's an unfortunate edge case.
> I wouldn't want to mess with the ancient statistics (and silently break
> an unknown amount of setups). I much prefer the teo intercept approach that
> makes this idle state dependent and doesn't try to get one formula right
> for all setups.
> 
> That is all assuming that a) get_typical_interval() doesn't return a
> significant interval length for you and b) replacing that with dummy
> significant values then fixes it (and not the error being some menu
> logic that I overlooked).
> 
> Thinking out loud regarding the menu logic:
> If we have ~50us predicted interval and your state setup the code then
> goes as follows:
> 
> predicted_ns = 50000
> ...
> data->next_timer_ns = KTIME_MAX; 
> delta_tick = TICK_NSEC / 2;  
> data->bucket = which_bucket(KTIME_MAX, nr_iowaiters); // bucket highest state probably
> 
> And then we have the big for loop for state 0 POLLING and state 1 CEDE target_residency_ns==100000:
> trivially we have for i = 0:
> 	if (idx == -1)                                                  
> 		idx = i; /* first enabled state */
> and for i = 1:
> 	if (s->target_residency_ns > predicted_ns) { // True
> ...
> 		if (predicted_ns < TICK_NSEC) // True                      
> 			break;
> and should end up with idx == 0 and *stop_tick = false, all good.
> If my above assumptions in a) and b) do not hold then I must've made a
> mistake reading the code just now.
> 

And of course I realise right after sending I did indeed make a mistake.
RESIDENCY_THRESHOLD_NS is 15us so timer is actually polled and the fields
populated accurately, but also in the for loop it checks polling for idx == 0.
predicted_ns should still be the value from get_typical_interval(), but
that is unable to ever enforce a polling state, making your proposed code
change the only obvious fix.
Essentially with only one non-polling state the get_typical_interval()
logic might as well be skipped, it never affects anything.

If I'm correct this time then I'm not really sure how to proceed.
On the one hand, if we trust predicted_ns we should indeed select a state
with predicted_ns < s->target_residency_ns, no matter if polling or not,
but that would be quite a behaviour change for x86 users too.

I'm still curious on why teo wouldn't solve your problem, it's intercept
logic doesn't care if it selects a polling state or not because of it.
Aboorva Devarajan Oct. 21, 2024, 5:27 a.m. UTC | #9
On Thu, 2024-09-19 at 09:49 +0100, Christian Loehle wrote:


Hi Christian, 

> On 9/19/24 06:02, Aboorva Devarajan wrote:
> > On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote:
> > 
> > > On 8/20/24 09:51, Aboorva Devarajan wrote:
> > > > On Tue, 2024-08-13 at 13:56 +0100, Christian Loehle wrote:
> > > > 
> > ...
> > > The wakeup source(s), since they don't seem to be timer events would be
> > > interesting, although a bit of a hassle to get right. 
> > > What's the workload anyway?
> > > 
> > 
> > I identified the part of the internal workload responsible for performance
> > improvement with the patch and it appears to be the OLTP queries, and yes
> > the wakeup sources are not timer events.
> > 
> > ---------------------------------------------------------------------
> > 
> > Additionally to reproduce the issue using a more open and accessible
> > benchmark, conducted similar experiments using pgbench and I observed
> > similar improvements with the patch, particularly when running the
> > read intensive select query benchmarks.
> > 
> > ---------------------------------------------------------------------
> > System Details
> > ---------------------------------------------------------------------
> > $ lscpu
> > Architecture:             ppc64le
> >   Byte Order:             Little Endian
> > CPU(s):                   224
> >   On-line CPU(s) list:    0-223
> > Model name:               POWER10 (architected), altivec supported
> >   Model:                  2.0 (pvr 0080 0200)
> >   Thread(s) per core:     8
> >   Core(s) per socket:     3
> >   Socket(s):              8
> > Virtualization features:  
> >   Hypervisor vendor:      pHyp
> >   Virtualization type:    para
> > ---------------------------------------------------------------------
> > 
> > $ cpupower idle-info
> > CPUidle driver: pseries_idle
> > CPUidle governor: menu
> > analyzing CPU 0:
> > 
> > Number of idle states: 2
> > Available idle states: snooze CEDE
> > snooze:
> > Flags/Description: snooze
> > 
> > Latency: 0
> > Residency: 0
> > Usage: 6229
> > Duration: 402142
> > CEDE:
> > Flags/Description: CEDE
> > Latency: 12
> > Residency: 120
> > Usage: 191411
> > Duration: 36329999037
> > 
> > ---------------------------------------------------------------------
> > PostgreSQL Benchmark:
> > ---------------------------------------------------------------------
> > 
> > I ran pgbench with 224 clients and 20 threads for 600 seconds,
> > performing only SELECT queries against the pgbench database to 
> > evaluate performance under read-intensive workloads:
> > 
> > $ pgbench -c 224 -j 20 -T 600 -S pgbench
> > 
> > Latency:
> > 
> > > ---|-------------|------------|------------|
> > >  # | Before (ms) | After (ms) | Change (%) |
> > > Run| Patch       | Patch      |            |
> > > ---|-------------|------------|------------|
> > >  1 | 0.343       | 0.287      | -16.31%    |
> > >  2 | 0.334       | 0.286      | -14.37%    |
> > >  3 | 0.333       | 0.286      | -14.11%    |
> > >  4 | 0.341       | 0.288      | -15.55%    |
> > >  5 | 0.342       | 0.288      | -15.79%    |
> > > ---|-------------|------------|------------|
> > 
> > Latency Reduction: After applying the patch, the latency decreased
> > by 14% to 16% across multiple runs.
> > 
> > Throughput per second:
> > 
> > > ---|-------------|------------|------------|
> > >  # | Before      | After      | Change (%) |
> > > Run| Patch       | Patch      |            |
> > > ---|-------------|------------|------------|
> > >  1 | 652,544.23  | 780,613.42 | +19.63%    | 
> > >  2 | 670,243.45  | 784,348.44 | +17.04%    |
> > >  3 | 673,495.39  | 784,458.12 | +16.48%    |
> > >  4 | 656,609.16  | 778,531.20 | +18.57%    |
> > >  5 | 654,644.52  | 778,322.88 | +18.88%    |
> > > ---|-------------|------------|------------|
> 
> Do you happen to have the idle stats here, too?
> Especially above/below see comment below.

I ran the benchmark again and collected the idle statistics on a different
system since I no longer have access to the previous one.

Here are the stats I gathered, including data for the menu, teo, and ladder governors:

Metric                        | Ladder        | Teo           | Menu          | Menu Patched
--------------------------------------------------------------------------------------------
Transactions Processed        | 42,902,188    | 48,499,709    | 42,653,477    | 48,648,867   
Latency (ms)                  | 0.313         | 0.277         | 0.315         | 0.276        
TPS                           | 715,340.96    | 808,672.59    | 711,123.13    | 811,137.40  
--------------------------------------------------------------------------------------------
Total Usage Difference        | 46,680,184    | 66,348,992    | 47,892,509    | 62,366,337
Usage Diff (CEDE)             | 46,680,184    | 7,612,960     | 45,421,436    | 19,237,430
Usage Diff (Snooze)           | 0             | 58,736,032    | 2,471,073     | 43,128,907
--------------------------------------------------------------------------------------------
Total Time Difference (s)     | 5,552.20      | 5,242.75      | 5,534.62      | 5,238.46     
Time Diff (CEDE)              | 5,552.20      | 2,497.89      | 5,431.91      | 3,324.99
Time Diff (Snooze)            | 0.00          | 2,744.86      | 102.71        | 1,913.47
--------------------------------------------------------------------------------------------
Total Above Diff              | 40,381,398    | 4,520,998     | 38,942,725    | 15,072,345
Above Diff (CEDE)             | 40,381,398    | 4,520,998     | 38,942,725    | 15,072,345
Above Diff (Snooze)           | 0             | 0             | 0             | 0
--------------------------------------------------------------------------------------------
Total Below Diff              | 0             | 12,362,392    | 405,357       | 8,172,845
Below Diff (CEDE)             | 0             | 0             | 0             | 0
Below Diff (Snooze)           | 0             | 12,362,392    | 405,357       | 8,172,845
--------------------------------------------------------------------------------------------
% Above w.r.t. Usage          | 86.51%        | 6.82%         | 81.30%        | 24.17%
--------------------------------------------------------------------------------------------
% Below w.r.t. Usage          | 0%            | 18.64%        | 0.85%         | 13.10%
--------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------

Note:

% Above w.r.t Usage = (Total Above Difference / Total Usage Difference) * 100
% Below w.r.t Usage = (Total Below Difference / Total Usage Difference) * 100


Menu, Teo, Ladder: This is with menu, teo and ladder governor enabled respectively on v6.12-rc1.

Menu Patched: This is with menu governor enabled on v6.12-rc + 
              https://lore.kernel.org/all/20240809073120.250974-2-aboorvad@linux.ibm.com/
--------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------
Inference:
--------------------------------------------------------------------------------------------

Transactions Processed: The number of transactions processed in Menu Patched is
14.06% higher compared to Menu.

Latency: There is a 12.38% reduction in latency in Menu Patched compared to Menu.

TPS (Transactions Per Second): The TPS in Menu Patched is 14.06% higher than in
Menu.
--------------------------------------------------------------------------------------------

While this patch hasn't completely eliminated the cpuidle miss ratio, but it
has improved see Above w.r.t Usage, Below w.r.t Usage.

I'll keep investigating why there's still a 24% miss rate in the "above" and
13% in the "below" stats after the patch. This could be a different issue.
Additionally, I've seen good performance improvements using the teo governor
with the pgbench benchmark, although it didn't provide the same benefit in the
original test.

> 
> > Transactions per second Improvement: The patch led to an increase in
> > TPS, ranging from 16% to 19%.
> > 
> > This indicates that the patch significantly reduces latency while
> > improving throughput (TPS).  pgbench is an OLTP benchmark and doesn't
> > do timer based wakeups, this is in-line with the improvements
> > I saw in the originally reported OLTP workload as well. 
> > 
> > ---------------------------------------------------------------------
> > Additional CPU Idle test with custom benchmark:
> > ---------------------------------------------------------------------
> > I also wrote a simple benchmark [1] to analyze the impact of cpuidle
> > state selection, comparing timer-based wakeups and non-timer
> > (pipe-based) wakeups.
> > 
> > This test involves a single waker thread periodically waking up a
> > single wakee thread, simulating a repeating sleep-wake pattern. The
> > test was run with both timer-based and pipe-based wakeups, and cpuidle
> > statistics (usage, time, above, below) were collected.
> 
> Thanks, pretty cool and just quickly checking it seems to be all there:
> hrtimer, different cpusets for wakee and waker, the pipe wakeup.
> That could be useful, only thing missing is the license header ;)

sure, I will add this. Thanks for pointing it out.

> 
> > Timer based wakeup:
> > 
> > +------------------------+---------------------+---------------------+
> > > Metric                 | Without Patch       | With Patch          |
> > +------------------------+---------------------+---------------------+
> > > Wakee thread - CPU     | 110                 | 110                 |
> > > Waker thread - CPU     | 20                  | 20                  |
> > > Sleep Interval         | 50 us               | 50 us               |
> > > Total Wakeups          | -                   | -                   |
> > > Avg Wakeup Latency     | -                   | -                   |
> > > Actual Sleep time      | 52.639 us           | 52.627 us           |
> > +------------------------+---------------------+---------------------+
> > > Idle State 0 Usage Diff| 94,879              | 94,879              |
> > > Idle State 0 Time Diff | 4,700,323 ns        | 4,697,576 ns        |
> > > Idle State 1 Usage Diff| 0                   | 0                   |
> > > Idle State 1 Time Diff | 0 ns                | 0 ns                |
> > +------------------------+---------------------+---------------------+
> > > Total Above Usage      | 0 (0.00%)           | (0.00%)             |
> > +------------------------+---------------------+---------------------+
> > > Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
> > +------------------------+---------------------+---------------------+
> > 
> > In timer-based wakeups, the menu governor effectively predicts the idle
> > duration both with and without the patch. This ensures that there are
> > few or no instances of "Above" usage, allowing the CPU to remain in the
> > correct idle state.
> > 
> > The condition (s->target_residency_ns <= data->next_timer_ns) in the menu
> > governor ensures that a physical idle state is not prioritized when a
> > timer event is expected before the target residency of the first physical
> > idle state.
> > 
> > As a result, the patch has no impact in this case, and performance
> > remains stable with timer based wakeups.
> > 
> > Pipe based wakeup (non-timer wakeup):
> > 
> > +------------------------+---------------------+---------------------+
> > > Metric                 | Without Patch       | With Patch          |
> > +------------------------+---------------------+---------------------+
> > > Wakee thread - CPU     | 110                 | 110                 |
> > > Waker thread - CPU     | 20                  | 20                  |
> > > Sleep Interval         | 50 us               | 50 us               |
> > > Total Wakeups          | 97031               | 96583               |
> > > Avg Wakeup Latency     | 7.070 us            | 4.879 us            |
> > > Actual Sleep time      | 51.366 us           | 51.605 us           |
> > +------------------------+---------------------+---------------------+
> > > Idle State 0 Usage Diff| 1209                | 96,586              |
> > > Idle State 0 Time Diff | 55,579 ns           | 4,510,003 ns        |
> > > Idle State 1 Usage Diff| 95,826              | 5                   |
> > > Idle State 1 Time Diff | 4,522,639 ns        | 198 ns              |
> > +------------------------+---------------------+---------------------+
> > +------------------------+---------------------+---------------------+
> > > **Total Above Usage**  | 95,824 (98.75%)     | 5 (0.01%)           |
> > +------------------------+---------------------+---------------------+     
> > +------------------------+---------------------+---------------------+
> > > Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
> > +------------------------+---------------------+---------------------+
> > 
> > In the pipe-based wakeup scenario, before the patch was applied, the 
> > "Above" metric was notably high around 98.75%. This suggests that the
> > menu governor frequently selected a deeper idle state like CEDE, even
> > when the idle period was relatively short.
> > 
> > This happened because the menu governor is inclined to prioritize the
> > physical idle state (CEDE) even when the target residency time of the
> > physical idle state (s->target_residency_ns) was longer than the
> > predicted idle time (predicted_ns), so data->next_timer_ns won't be
> > relevant here in non-timer wakeups.
> > 
> > In this test, despite the actual idle period being around 50 microseconds,
> > the menu governor still chose CEDE state, which has a target residency of
> > 120 microseconds.
> 
> And the idle_miss statistics confirms that this was mostly wrong decisions
> in hindsight.
> I'll go through the menu code again, this indeed shouldn't be happening.
> I'd be very surprised if upstream teo performed as badly (or actually badly
> at all) on this, although it doesn't seem to help your actual workload,
> would you mind giving that a try too?


I ran a similar benchmark using teo cpuidle governor, presented the averaged out
values across 10 runs (has low standard deviation). Below are the results for a
pipe-based wakeup with an approximate 50 microsecond sleep interval:


Pipe based wakeup with approx 50 us as sleep interval:

Metric                    Ladder           Menu            Teo              Menu Patched
----------------------------------------------------------------------------------------
Wakeups                   579,690          579,951         578,328          578,363
Avg wakeup latency (us)   7.456            7.112           4.46067          4.48733
========================================================================================
Sleep interval (us)       51.7333          51.7097         51.8533          51.8477
========================================================================================
State 0 Usage diff        0                7,324           578,361          578,407
State 0 Time diff (ns)    0                340,115         2.75953e+07      2.75696e+07
State 0 Above diff        0                0               0                0
State 0 Below diff        0                0               0.333333         0.666667
========================================================================================
State 1 Usage diff        580,114          572,644         0.666667         9.33333
State 1 Time diff (ns)    2.74189e+07      2.73901e+07     20.6667          445.333
State 1 Above diff        579,993          572,535         0.666667         9.33333
State 1 Below diff        0                0               0                0
Total Above               579,993          572,535         0.666667         9.33333
Total Below               0                0               0.333333         0.666667
========================================================================================
Total Above Percent       99.98%           98.7167%        0%               0%             --> [1]
Total Below Percent       0%               0%              0%               0%
========================================================================================

Timer based wakeup with approx 50 us as sleep interval:

Metric                    Ladder           Menu            Teo              Menu Patched
----------------------------------------------------------------------------------------
Sleep interval (us)        53.775           52.3687         52.746           52.327
========================================================================================
State 0 Usage diff         0                572,575         568,419          573,109
State 0 Time diff (ns)     0                2.84142e+07     2.81956e+07      2.84548e+07
State 0 Above diff         0                0               0                0
State 0 Below diff         0                0.333333        1                0.333333
========================================================================================
State 1 Usage diff         558,285          0.333333        0                0
State 1 Time diff (ns)     2.80965e+07      17              0                0
State 1 Above diff         558,215          0.333333        0                0
State 1 Below diff         0                0               0                0
========================================================================================
Total Above                558,215          0.333333        0                0
Total Below                0                0.333333        1                0.333333
========================================================================================
Total Above Percent        99.99%           0%              0%               0%
Total Below Percent        0%               0%              0%               0%
========================================================================================


[1] The results does show that the teo governor doesn't have this issue.

> 
> Regards,
> Christian
> 
> > ---------------------------------------------------------------------
> > 
> > While timer-based wakeups performed well even without the patch, workloads
> > that don't have timers as wakeup source but have predictable idle durations
> > shorter than the first idle state's target residency benefit significantly
> > from the patch.
> > 
> > It will be helpful to understand why prioritizing deep physical idle
> > states over shallow ones even for short idle periods that don’t meet
> > the target residency like mentioned above is considered more beneficial.
> > 
> > Is there something I could be missing here?
> > 
> > Any comments or suggestions will be really helpful.
> > 
> > [1] https://github.com/AboorvaDevarajan/linux-utils/tree/main/cpuidle/cpuidle_wakeup
> > 
> > Thanks,
> > Aboorva
> >
Aboorva Devarajan Oct. 21, 2024, 7:40 a.m. UTC | #10
On Thu, 2024-09-19 at 10:43 +0100, Christian Loehle wrote:

> On 9/19/24 09:49, Christian Loehle wrote:
> > On 9/19/24 06:02, Aboorva Devarajan wrote:
> > > On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote:
> > > Timer based wakeup:
> > > 
> > > +------------------------+---------------------+---------------------+
> > > > Metric                 | Without Patch       | With Patch          |
> > > +------------------------+---------------------+---------------------+
> > > > Wakee thread - CPU     | 110                 | 110                 |
> > > > Waker thread - CPU     | 20                  | 20                  |
> > > > Sleep Interval         | 50 us               | 50 us               |
> > > > Total Wakeups          | -                   | -                   |
> > > > Avg Wakeup Latency     | -                   | -                   |
> > > > Actual Sleep time      | 52.639 us           | 52.627 us           |
> > > +------------------------+---------------------+---------------------+
> > > > Idle State 0 Usage Diff| 94,879              | 94,879              |
> > > > Idle State 0 Time Diff | 4,700,323 ns        | 4,697,576 ns        |
> > > > Idle State 1 Usage Diff| 0                   | 0                   |
> > > > Idle State 1 Time Diff | 0 ns                | 0 ns                |
> > > +------------------------+---------------------+---------------------+
> > > > Total Above Usage      | 0 (0.00%)           | (0.00%)             |
> > > +------------------------+---------------------+---------------------+
> > > > Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
> > > +------------------------+---------------------+---------------------+
> > > 
> > > In timer-based wakeups, the menu governor effectively predicts the idle
> > > duration both with and without the patch. This ensures that there are
> > > few or no instances of "Above" usage, allowing the CPU to remain in the
> > > correct idle state.
> > > 
> > > The condition (s->target_residency_ns <= data->next_timer_ns) in the menu
> > > governor ensures that a physical idle state is not prioritized when a
> > > timer event is expected before the target residency of the first physical
> > > idle state.
> > > 
> > > As a result, the patch has no impact in this case, and performance
> > > remains stable with timer based wakeups.
> > > 
> > > Pipe based wakeup (non-timer wakeup):
> > > 
> > > +------------------------+---------------------+---------------------+
> > > > Metric                 | Without Patch       | With Patch          |
> > > +------------------------+---------------------+---------------------+
> > > > Wakee thread - CPU     | 110                 | 110                 |
> > > > Waker thread - CPU     | 20                  | 20                  |
> > > > Sleep Interval         | 50 us               | 50 us               |
> > > > Total Wakeups          | 97031               | 96583               |
> > > > Avg Wakeup Latency     | 7.070 us            | 4.879 us            |
> > > > Actual Sleep time      | 51.366 us           | 51.605 us           |
> > > +------------------------+---------------------+---------------------+
> > > > Idle State 0 Usage Diff| 1209                | 96,586              |
> > > > Idle State 0 Time Diff | 55,579 ns           | 4,510,003 ns        |
> > > > Idle State 1 Usage Diff| 95,826              | 5                   |
> > > > Idle State 1 Time Diff | 4,522,639 ns        | 198 ns              |
> > > +------------------------+---------------------+---------------------+
> > > +------------------------+---------------------+---------------------+
> > > > **Total Above Usage**  | 95,824 (98.75%)     | 5 (0.01%)           |
> > > +------------------------+---------------------+---------------------+     
> > > +------------------------+---------------------+---------------------+
> > > > Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
> > > +------------------------+---------------------+---------------------+
> > > 
> > > In the pipe-based wakeup scenario, before the patch was applied, the 
> > > "Above" metric was notably high around 98.75%. This suggests that the
> > > menu governor frequently selected a deeper idle state like CEDE, even
> > > when the idle period was relatively short.
> > > 
> > > This happened because the menu governor is inclined to prioritize the
> > > physical idle state (CEDE) even when the target residency time of the
> > > physical idle state (s->target_residency_ns) was longer than the
> > > predicted idle time (predicted_ns), so data->next_timer_ns won't be
> > > relevant here in non-timer wakeups.
> > > 
> > > In this test, despite the actual idle period being around 50 microseconds,
> > > the menu governor still chose CEDE state, which has a target residency of
> > > 120 microseconds.
> > 
> > And the idle_miss statistics confirms that this was mostly wrong decisions
> > in hindsight.
> > I'll go through the menu code again, this indeed shouldn't be happening.
> > I'd be very surprised if upstream teo performed as badly (or actually badly
> > at all) on this, although it doesn't seem to help your actual workload,
> > would you mind giving that a try too?
> > 
> > Regards,
> > Christian
> 
> So with a workload as static as this, the recent interval detection of menu
> should take affect. It records the last 8 idle interval lengths and does some
> statistics do see if they are somewhat stable and uses those instead of the
> next timer event.
> While I wouldn't vouch for the statistics just yet, the results don't seem
> to be completely unreasonable.
> Do you mind trying

> a) printing some snapshots of these intervals and the
> resulting return value of get_typical_interval()?

Christian,

Here are the stats I get when printing the intervals and corresponding predicted_ns
for an idle duration of 50 us set from user space. In this case, a pipe wakeup with
a 50 us sleep time in user space translates to approximately 34 us of CPU idle time
in the kernel, with very little deviation.

-------------------------------------------------------------------------------------
process        cpu           timestamp   trace
<idle>-0       [200] d....   123.969564: menu_select: CPU[200]: Interval[2] = 34
<idle>-0       [200] d....   123.969564: menu_select: CPU[200]: Interval[3] = 34
<idle>-0       [200] d....   123.969564: menu_select: CPU[200]: Interval[4] = 34
<idle>-0       [200] d....   123.969564: menu_select: CPU[200]: Interval[5] = 34
<idle>-0       [200] d....   123.969565: menu_select: CPU[200]: Interval[6] = 34
<idle>-0       [200] d....   123.969565: menu_select: CPU[200]: Interval[7] = 34
<idle>-0       [200] d....   123.969565: menu_select: CPU[200]: Predicted ns = 34000
<idle>-0       [200] d....   123.969615: menu_select: CPU[200]: Interval[0] = 34
<idle>-0       [200] d....   123.969616: menu_select: CPU[200]: Interval[1] = 34
<idle>-0       [200] d....   123.969616: menu_select: CPU[200]: Interval[2] = 34
<idle>-0       [200] d....   123.969616: menu_select: CPU[200]: Interval[3] = 34
<idle>-0       [200] d....   123.969616: menu_select: CPU[200]: Interval[4] = 34
<idle>-0       [200] d....   123.969616: menu_select: CPU[200]: Interval[5] = 34
<idle>-0       [200] d....   123.969616: menu_select: CPU[200]: Interval[6] = 34
<idle>-0       [200] d....   123.969616: menu_select: CPU[200]: Interval[7] = 34
<idle>-0       [200] d....   123.969617: menu_select: CPU[200]: Predicted ns = 34000
...
...
...
...
<idle>-0       [200] d....   123.969667: menu_select: CPU[200]: Interval[0] = 34
<idle>-0       [200] d....   123.969667: menu_select: CPU[200]: Interval[1] = 34
<idle>-0       [200] d....   123.969667: menu_select: CPU[200]: Interval[2] = 34
<idle>-0       [200] d....   123.969667: menu_select: CPU[200]: Interval[3] = 34
<idle>-0       [200] d....   123.969667: menu_select: CPU[200]: Interval[4] = 34
<idle>-0       [200] d....   123.969668: menu_select: CPU[200]: Interval[5] = 34
<idle>-0       [200] d....   123.969668: menu_select: CPU[200]: Interval[6] = 34
<idle>-0       [200] d....   123.969668: menu_select: CPU[200]: Interval[7] = 33
<idle>-0       [200] d....   123.969668: menu_select: CPU[200]: Predicted ns = 33000

This pattern repeats until I stop the workload.
----------------------------------------------------------------------
---------------

> b) Trying this out? Setting these values to some around 50us, that returns
> 50us for me as expected and therefore should not select an idle state above
> that.
> 
> -->8--
> 
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -112,6 +112,8 @@ struct menu_device {
>         int             interval_ptr;
>  };
>  
> +static int intervals_dummy[] = {50, 52, 48, 50, 51, 49, 51, 51};
> +
>  static inline int which_bucket(u64 duration_ns, unsigned int nr_iowaiters)
>  {
>         int bucket = 0;
> @@ -177,7 +179,7 @@ static unsigned int get_typical_interval(struct menu_device *data)
>         sum = 0;
>         divisor = 0;
>         for (i = 0; i < INTERVALS; i++) {
> -               unsigned int value = data->intervals[i];
> +               unsigned int value = intervals_dummy[i];
>                 if (value <= thresh) {
>                         sum += value;
>                         divisor++;
> @@ -200,7 +202,7 @@ static unsigned int get_typical_interval(struct menu_device *data)
>         /* Then try to determine variance */
>         variance = 0;
>         for (i = 0; i < INTERVALS; i++) {
> -               unsigned int value = data->intervals[i];
> +               unsigned int value = intervals_dummy[i];
>                 if (value <= thresh) {
>                         int64_t diff = (int64_t)value - avg;
>                         variance += diff * diff;
> 

Yes, I tried the above patch and I still see the issue:

+-----------------------------+------------------+------------------+------------------+
|           Metric            |     ladder       |      menu        |      teo         |
+=============================+==================+==================+==================+
|           Wakeups           |     579817       |     579298       |     578359       |
+-----------------------------+------------------+------------------+------------------+
| Average wakeup latency (us) |      7.32        |     7.099        |     4.538        |
+-----------------------------+------------------+------------------+------------------+
|     Sleep interval (us)     |      51.722      |    51.768        |     51.852       |
+-----------------------------+------------------+------------------+------------------+
|     State 0 Usage diff      |      0           |     7056         |     578402       |
+-----------------------------+------------------+------------------+------------------+
|   State 0 Time diff (ns)    |      0           |    350579        |    2.75562e+07   |
+-----------------------------+------------------+------------------+------------------+
|     State 0 Above diff      |      0           |      0           |      0           |
+-----------------------------+------------------+------------------+------------------+
|     State 0 Below diff      |      0           |      1           |      0           |
+-----------------------------+------------------+------------------+------------------+
|     State 1 Usage diff      |    580239        |    572168        |      0           |
+-----------------------------+------------------+------------------+------------------+
|   State 1 Time diff (ns)    |   2.74763e+07    |   2.73917e+07    |      0           |
+-----------------------------+------------------+------------------+------------------+
|     State 1 Above diff      |    580140        |    572070        |      0           |
+-----------------------------+------------------+------------------+------------------+
|     State 1 Below diff      |      0           |      0           |      0           |
+-----------------------------+------------------+------------------+------------------+
|         Total Above         |    580140        |    572070        |      0           |
+-----------------------------+------------------+------------------+------------------+
|         Total Below         |      0           |      1           |      0           |
+=============================+==================+==================+==================+
|     Total Above Percent     |     99.98        |     98.76        |      0           |
+-----------------------------+------------------+------------------+------------------+
|     Total Below Percent     |      0           |      0           |      0           |
+======================================================================================+

--------------------------------------------------------------------------------------------
Note:

% Total Above Percent (w.r.t. Total Usage) = (Total Above / Total Usage) * 100
% Total Below Percent (w.r.t. Total Usage) = (Total Below / Total Usage) * 100
--------------------------------------------------------------------------------------------

This confirms that the issue is not occuring due to the way in which predicted_ns is
computed in get_typicial_interval, but rather related to the menu governor's CPU idle
state selection logic.

Thanks,
Aboorva
Aboorva Devarajan Oct. 21, 2024, 8:02 a.m. UTC | #11
On Thu, 2024-09-19 at 15:38 +0100, Christian Loehle wrote:
> On 9/19/24 15:07, Christian Loehle wrote:
> > On 9/19/24 10:43, Christian Loehle wrote:
> > > On 9/19/24 09:49, Christian Loehle wrote:
> > > > On 9/19/24 06:02, Aboorva Devarajan wrote:
> > > > > On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote:
> > > > > Timer based wakeup:
> > > > > 
> > > > > +------------------------+---------------------+---------------------+
> > > > > > Metric                 | Without Patch       | With Patch          |
> > > > > +------------------------+---------------------+---------------------+
> > > > > > Wakee thread - CPU     | 110                 | 110                 |
> > > > > > Waker thread - CPU     | 20                  | 20                  |
> > > > > > Sleep Interval         | 50 us               | 50 us               |
> > > > > > Total Wakeups          | -                   | -                   |
> > > > > > Avg Wakeup Latency     | -                   | -                   |
> > > > > > Actual Sleep time      | 52.639 us           | 52.627 us           |
> > > > > +------------------------+---------------------+---------------------+
> > > > > > Idle State 0 Usage Diff| 94,879              | 94,879              |
> > > > > > Idle State 0 Time Diff | 4,700,323 ns        | 4,697,576 ns        |
> > > > > > Idle State 1 Usage Diff| 0                   | 0                   |
> > > > > > Idle State 1 Time Diff | 0 ns                | 0 ns                |
> > > > > +------------------------+---------------------+---------------------+
> > > > > > Total Above Usage      | 0 (0.00%)           | (0.00%)             |
> > > > > +------------------------+---------------------+---------------------+
> > > > > > Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
> > > > > +------------------------+---------------------+---------------------+
> > > > > 
> > > > > In timer-based wakeups, the menu governor effectively predicts the idle
> > > > > duration both with and without the patch. This ensures that there are
> > > > > few or no instances of "Above" usage, allowing the CPU to remain in the
> > > > > correct idle state.
> > > > > 
> > > > > The condition (s->target_residency_ns <= data->next_timer_ns) in the menu
> > > > > governor ensures that a physical idle state is not prioritized when a
> > > > > timer event is expected before the target residency of the first physical
> > > > > idle state.
> > > > > 
> > > > > As a result, the patch has no impact in this case, and performance
> > > > > remains stable with timer based wakeups.
> > > > > 
> > > > > Pipe based wakeup (non-timer wakeup):
> > > > > 
> > > > > +------------------------+---------------------+---------------------+
> > > > > > Metric                 | Without Patch       | With Patch          |
> > > > > +------------------------+---------------------+---------------------+
> > > > > > Wakee thread - CPU     | 110                 | 110                 |
> > > > > > Waker thread - CPU     | 20                  | 20                  |
> > > > > > Sleep Interval         | 50 us               | 50 us               |
> > > > > > Total Wakeups          | 97031               | 96583               |
> > > > > > Avg Wakeup Latency     | 7.070 us            | 4.879 us            |
> > > > > > Actual Sleep time      | 51.366 us           | 51.605 us           |
> > > > > +------------------------+---------------------+---------------------+
> > > > > > Idle State 0 Usage Diff| 1209                | 96,586              |
> > > > > > Idle State 0 Time Diff | 55,579 ns           | 4,510,003 ns        |
> > > > > > Idle State 1 Usage Diff| 95,826              | 5                   |
> > > > > > Idle State 1 Time Diff | 4,522,639 ns        | 198 ns              |
> > > > > +------------------------+---------------------+---------------------+
> > > > > +------------------------+---------------------+---------------------+
> > > > > > **Total Above Usage**  | 95,824 (98.75%)     | 5 (0.01%)           |
> > > > > +------------------------+---------------------+---------------------+     
> > > > > +------------------------+---------------------+---------------------+
> > > > > > Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
> > > > > +------------------------+---------------------+---------------------+
> > > > > 
> > > > > In the pipe-based wakeup scenario, before the patch was applied, the 
> > > > > "Above" metric was notably high around 98.75%. This suggests that the
> > > > > menu governor frequently selected a deeper idle state like CEDE, even
> > > > > when the idle period was relatively short.
> > > > > 
> > > > > This happened because the menu governor is inclined to prioritize the
> > > > > physical idle state (CEDE) even when the target residency time of the
> > > > > physical idle state (s->target_residency_ns) was longer than the
> > > > > predicted idle time (predicted_ns), so data->next_timer_ns won't be
> > > > > relevant here in non-timer wakeups.
> > > > > 
> > > > > In this test, despite the actual idle period being around 50 microseconds,
> > > > > the menu governor still chose CEDE state, which has a target residency of
> > > > > 120 microseconds.
> > > > 
> > > > And the idle_miss statistics confirms that this was mostly wrong decisions
> > > > in hindsight.
> > > > I'll go through the menu code again, this indeed shouldn't be happening.
> > > > I'd be very surprised if upstream teo performed as badly (or actually badly
> > > > at all) on this, although it doesn't seem to help your actual workload,
> > > > would you mind giving that a try too?
> > > > 
> > > > Regards,
> > > > Christian
> > > 
> > > So with a workload as static as this, the recent interval detection of menu
> > > should take affect. It records the last 8 idle interval lengths and does some
> > > statistics do see if they are somewhat stable and uses those instead of the
> > > next timer event.
> > > While I wouldn't vouch for the statistics just yet, the results don't seem
> > > to be completely unreasonable.
> > > Do you mind trying a) printing some snapshots of these intervals and the
> > > resulting return value of get_typical_interval()?
> > > b) Trying this out? Setting these values to some around 50us, that returns
> > > 50us for me as expected and therefore should not select an idle state above
> > > that.
> > > 
> > > -->8--
> > > 
> > > --- a/drivers/cpuidle/governors/menu.c
> > > +++ b/drivers/cpuidle/governors/menu.c
> > > @@ -112,6 +112,8 @@ struct menu_device {
> > >         int             interval_ptr;
> > >  };
> > >  
> > > +static int intervals_dummy[] = {50, 52, 48, 50, 51, 49, 51, 51};
> > 
> > So just to illustrate this a bit more:
> > {51, 80, 90, 54, 72, 60, 80, 117} is insignificant, but decrementing the last:
> > {51, 80, 90, 54, 72, 60, 80, 116} is significant (75us returned)
> > That sounds about reasonable but I think something like that would
> > also be observable for you. It's an unfortunate edge case.
> > I wouldn't want to mess with the ancient statistics (and silently break
> > an unknown amount of setups). I much prefer the teo intercept approach that
> > makes this idle state dependent and doesn't try to get one formula right
> > for all setups.
> > 
> > That is all assuming that a) get_typical_interval() doesn't return a
> > significant interval length for you and b) replacing that with dummy
> > significant values then fixes it (and not the error being some menu
> > logic that I overlooked).
> > 
> > Thinking out loud regarding the menu logic:
> > If we have ~50us predicted interval and your state setup the code then
> > goes as follows:
> > 
> > predicted_ns = 50000
> > ...
> > data->next_timer_ns = KTIME_MAX; 
> > delta_tick = TICK_NSEC / 2;  
> > data->bucket = which_bucket(KTIME_MAX, nr_iowaiters); // bucket highest state probably
> > 
> > And then we have the big for loop for state 0 POLLING and state 1 CEDE target_residency_ns==100000:
> > trivially we have for i = 0:
> > 	if (idx == -1)                                                  
> > 		idx = i; /* first enabled state */
> > and for i = 1:
> > 	if (s->target_residency_ns > predicted_ns) { // True
> > ...
> > 		if (predicted_ns < TICK_NSEC) // True                      
> > 			break;
> > and should end up with idx == 0 and *stop_tick = false, all good.
> > If my above assumptions in a) and b) do not hold then I must've made a
> > mistake reading the code just now.
> > 
> 
> And of course I realise right after sending I did indeed make a mistake.
> RESIDENCY_THRESHOLD_NS is 15us so timer is actually polled and the fields
> populated accurately, but also in the for loop it checks polling for idx == 0.
> predicted_ns should still be the value from get_typical_interval(), but
> that is unable to ever enforce a polling state, making your proposed code
> change the only obvious fix.

> Essentially with only one non-polling state the get_typical_interval()
> logic might as well be skipped, it never affects anything.
> 
> If I'm correct this time then I'm not really sure how to proceed.
> On the one hand, if we trust predicted_ns we should indeed select a state
> with predicted_ns < s->target_residency_ns, no matter if polling or not,
> but that would be quite a behaviour change for x86 users too.

That's right, so far this is the only solution I can think of that
improves performance, But, I'm open to trying alternative suggestions
if any.
> 
> I'm still curious on why teo wouldn't solve your problem, it's intercept
> logic doesn't care if it selects a polling state or not because of it.
> 
I confirmed that "teo" doesn't have the reported issue, and when compared
to the  vanilla menu governor switching to "teo" does improve the performance
of some workloads, though not the original workload I tested. I'll look into
why that's the case.

However, we would still want to resolve the issue with the menu governor,
as it's the default governor currently enabled and have been tested so
far. While switching to "teo" could be an option in the future.

Thanks,
Aboorva
Christian Loehle Oct. 21, 2024, 8:43 a.m. UTC | #12
Hi Aboorva,

On 10/21/24 06:27, Aboorva Devarajan wrote:
> [...] 
> I ran the benchmark again and collected the idle statistics on a different
> system since I no longer have access to the previous one.
> 
> Here are the stats I gathered, including data for the menu, teo, and ladder governors:
> 
> Metric                        | Ladder        | Teo           | Menu          | Menu Patched
> --------------------------------------------------------------------------------------------
> Transactions Processed        | 42,902,188    | 48,499,709    | 42,653,477    | 48,648,867   
> Latency (ms)                  | 0.313         | 0.277         | 0.315         | 0.276        
> TPS                           | 715,340.96    | 808,672.59    | 711,123.13    | 811,137.40  
> --------------------------------------------------------------------------------------------
> Total Usage Difference        | 46,680,184    | 66,348,992    | 47,892,509    | 62,366,337
> Usage Diff (CEDE)             | 46,680,184    | 7,612,960     | 45,421,436    | 19,237,430
> Usage Diff (Snooze)           | 0             | 58,736,032    | 2,471,073     | 43,128,907
> --------------------------------------------------------------------------------------------
> Total Time Difference (s)     | 5,552.20      | 5,242.75      | 5,534.62      | 5,238.46     
> Time Diff (CEDE)              | 5,552.20      | 2,497.89      | 5,431.91      | 3,324.99
> Time Diff (Snooze)            | 0.00          | 2,744.86      | 102.71        | 1,913.47
> --------------------------------------------------------------------------------------------
> Total Above Diff              | 40,381,398    | 4,520,998     | 38,942,725    | 15,072,345
> Above Diff (CEDE)             | 40,381,398    | 4,520,998     | 38,942,725    | 15,072,345
> Above Diff (Snooze)           | 0             | 0             | 0             | 0
> --------------------------------------------------------------------------------------------
> Total Below Diff              | 0             | 12,362,392    | 405,357       | 8,172,845
> Below Diff (CEDE)             | 0             | 0             | 0             | 0
> Below Diff (Snooze)           | 0             | 12,362,392    | 405,357       | 8,172,845
> --------------------------------------------------------------------------------------------
> % Above w.r.t. Usage          | 86.51%        | 6.82%         | 81.30%        | 24.17%
> --------------------------------------------------------------------------------------------
> % Below w.r.t. Usage          | 0%            | 18.64%        | 0.85%         | 13.10%
> --------------------------------------------------------------------------------------------
> 
> --------------------------------------------------------------------------------------------
> 
> Note:
> 
> % Above w.r.t Usage = (Total Above Difference / Total Usage Difference) * 100
> % Below w.r.t Usage = (Total Below Difference / Total Usage Difference) * 100
> 
> 
> Menu, Teo, Ladder: This is with menu, teo and ladder governor enabled respectively on v6.12-rc1.
> 
> Menu Patched: This is with menu governor enabled on v6.12-rc + 
>               https://lore.kernel.org/all/20240809073120.250974-2-aboorvad@linux.ibm.com/
> --------------------------------------------------------------------------------------------
> 
> --------------------------------------------------------------------------------------------
> Inference:
> --------------------------------------------------------------------------------------------
> 
> Transactions Processed: The number of transactions processed in Menu Patched is
> 14.06% higher compared to Menu.
> 
> Latency: There is a 12.38% reduction in latency in Menu Patched compared to Menu.
> 
> TPS (Transactions Per Second): The TPS in Menu Patched is 14.06% higher than in
> Menu.
> --------------------------------------------------------------------------------------------
> 
> While this patch hasn't completely eliminated the cpuidle miss ratio, but it
> has improved see Above w.r.t Usage, Below w.r.t Usage.
> 
> I'll keep investigating why there's still a 24% miss rate in the "above" and
> 13% in the "below" stats after the patch. This could be a different issue.
> Additionally, I've seen good performance improvements using the teo governor
> with the pgbench benchmark, although it didn't provide the same benefit in the
> original test.

Good to hear and thank you for testing.
Indeed the menu idle misses sound somewhat high, given that teo scores fewer.
I'd be very curious about the workload where menu patched significantly outperforms
teo if you ever come across that again.


> [...] 
> 
> I ran a similar benchmark using teo cpuidle governor, presented the averaged out
> values across 10 runs (has low standard deviation). Below are the results for a
> pipe-based wakeup with an approximate 50 microsecond sleep interval:
> 
> 
> Pipe based wakeup with approx 50 us as sleep interval:
> 
> Metric                    Ladder           Menu            Teo              Menu Patched
> ----------------------------------------------------------------------------------------
> Wakeups                   579,690          579,951         578,328          578,363
> Avg wakeup latency (us)   7.456            7.112           4.46067          4.48733
> ========================================================================================
> Sleep interval (us)       51.7333          51.7097         51.8533          51.8477
> ========================================================================================
> State 0 Usage diff        0                7,324           578,361          578,407
> State 0 Time diff (ns)    0                340,115         2.75953e+07      2.75696e+07
> State 0 Above diff        0                0               0                0
> State 0 Below diff        0                0               0.333333         0.666667
> ========================================================================================
> State 1 Usage diff        580,114          572,644         0.666667         9.33333
> State 1 Time diff (ns)    2.74189e+07      2.73901e+07     20.6667          445.333
> State 1 Above diff        579,993          572,535         0.666667         9.33333
> State 1 Below diff        0                0               0                0
> Total Above               579,993          572,535         0.666667         9.33333
> Total Below               0                0               0.333333         0.666667
> ========================================================================================
> Total Above Percent       99.98%           98.7167%        0%               0%             --> [1]
> Total Below Percent       0%               0%              0%               0%
> ========================================================================================
> 
> Timer based wakeup with approx 50 us as sleep interval:
> 
> Metric                    Ladder           Menu            Teo              Menu Patched
> ----------------------------------------------------------------------------------------
> Sleep interval (us)        53.775           52.3687         52.746           52.327
> ========================================================================================
> State 0 Usage diff         0                572,575         568,419          573,109
> State 0 Time diff (ns)     0                2.84142e+07     2.81956e+07      2.84548e+07
> State 0 Above diff         0                0               0                0
> State 0 Below diff         0                0.333333        1                0.333333
> ========================================================================================
> State 1 Usage diff         558,285          0.333333        0                0
> State 1 Time diff (ns)     2.80965e+07      17              0                0
> State 1 Above diff         558,215          0.333333        0                0
> State 1 Below diff         0                0               0                0
> ========================================================================================
> Total Above                558,215          0.333333        0                0
> Total Below                0                0.333333        1                0.333333
> ========================================================================================
> Total Above Percent        99.99%           0%              0%               0%
> Total Below Percent        0%               0%              0%               0%
> ========================================================================================
> 
> 
> [1] The results does show that the teo governor doesn't have this issue.

Glad to see that confirmed, thank you.

Regards,
Christian
Christian Loehle Oct. 21, 2024, 9:03 a.m. UTC | #13
On 10/21/24 09:02, Aboorva Devarajan wrote:
> On Thu, 2024-09-19 at 15:38 +0100, Christian Loehle wrote:
>> [...]
>> And of course I realise right after sending I did indeed make a mistake.
>> RESIDENCY_THRESHOLD_NS is 15us so timer is actually polled and the fields
>> populated accurately, but also in the for loop it checks polling for idx == 0.
>> predicted_ns should still be the value from get_typical_interval(), but
>> that is unable to ever enforce a polling state, making your proposed code
>> change the only obvious fix.
> 
>> Essentially with only one non-polling state the get_typical_interval()
>> logic might as well be skipped, it never affects anything.
>>
>> If I'm correct this time then I'm not really sure how to proceed.
>> On the one hand, if we trust predicted_ns we should indeed select a state
>> with predicted_ns < s->target_residency_ns, no matter if polling or not,
>> but that would be quite a behaviour change for x86 users too.
> 
> That's right, so far this is the only solution I can think of that
> improves performance, But, I'm open to trying alternative suggestions
> if any.

It's the most reasonable I can think of anyway.

>> I'm still curious on why teo wouldn't solve your problem, it's intercept
>> logic doesn't care if it selects a polling state or not because of it.
>>
> I confirmed that "teo" doesn't have the reported issue, and when compared
> to the  vanilla menu governor switching to "teo" does improve the performance
> of some workloads, though not the original workload I tested. I'll look into
> why that's the case.

I'd be really curious about the workload that improves with your patch but
doesn't perform well on teo.

> However, we would still want to resolve the issue with the menu governor,
> as it's the default governor currently enabled and have been tested so
> far. While switching to "teo" could be an option in the future.

Understandable and I think your change makes sense, although that logic is
quite old as I dug up.
The modern arm64 but also x86 platforms AFAICS are unlikely to be affected by
your change as they have efficient/responsive physical idle states (or for
arm64 don't have any non-polling state usually) so the prediction logic
would have to spit out something like <1000 ns to take affect and at that
point getting over that statistically significance is quite unlikely.
OTOH if there were power regressions they might be with platforms where they
would never be reported.
Anyway, Rafael's call here.
If he's willing to take it I would include the historical part in the commit
message so that doesn't need to be dug up again.

Regards,
Christian