mbox series

[RFC,v3,0/5] Introduce Cpufreq Active Stats

Message ID 20220406220809.22555-1-lukasz.luba@arm.com (mailing list archive)
Headers show
Series Introduce Cpufreq Active Stats | expand

Message

Lukasz Luba April 6, 2022, 10:08 p.m. UTC
Hi all,

This is the 3rd version of patch set which tries to address issues which are
due to missing proper information about CPU performance in time.

The issue description:
1. "Cpufreq statistics cover the time when CPUs are in idle states, so they
   are not suitable for certain purposes, like thermal control." Rafael [2]
2. Thermal governor Intelligent Power Allocation (IPA) has to estimate power,
   for the last period, e.g. 100ms, for each CPU in the Cluster, to grant new
   power and set max possible frequency. Currently in some cases it gets big
   error, when the frequency of CPU changed in the middle. It is due to the
   fact that IPA reads the current frequency for the CPU, not aware of all
   other frequencies which were actively (not in idle) used in the last 100ms.

This code focuses on tracking the events of idle entry/exit for each CPU
and combine them with the frequency tracked statistics inside internal
statistics arrays (per-CPU). In the old cpufreq stats we have one shared
statistics array for the policy (all CPUs) and not take into account
periods when each CPU was in idle.

Sometimes the IPA error between old estimation signal and reality is quite
big (>50%).

changelog:
v3:
- moved the core implementation into the cpufreq and not
  creating a new framework (as sugested by Rafael)
- updated all function names and APIs
v2 [1]


Regards,
Lukasz Luba

[1] https://lore.kernel.org/all/20210706131828.22309-1-lukasz.luba@arm.com/
[2] https://lore.kernel.org/all/CAJZ5v0gzpfT__EyrVuZSr32ms7-YJZw7qEok0WZECv1iDRRvWA@mail.gmail.com/

Lukasz Luba (5):
  cpufreq: stats: Introduce Cpufreq Active Stats
  cpuidle: Add Cpufreq Active Stats calls tracking idle entry/exit
  thermal: Add interface to cooling devices to handle governor change
  thermal: power allocator: Prepare power actors and calm down when not
    used
  thermal: cpufreq_cooling: Improve power estimation using Cpufreq
    Active Stats

 MAINTAINERS                           |   2 +-
 drivers/cpufreq/cpufreq_stats.c       | 872 ++++++++++++++++++++++++++
 drivers/cpuidle/cpuidle.c             |   5 +
 drivers/thermal/cpufreq_cooling.c     | 131 ++++
 drivers/thermal/gov_power_allocator.c |  71 +++
 include/linux/cpufreq_stats.h         | 131 ++++
 include/linux/thermal.h               |   1 +
 7 files changed, 1212 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/cpufreq_stats.h

Comments

Viresh Kumar April 26, 2022, 3:11 a.m. UTC | #1
On 06-04-22, 23:08, Lukasz Luba wrote:
> Hi all,
> 
> This is the 3rd version of patch set which tries to address issues which are
> due to missing proper information about CPU performance in time.
> 
> The issue description:
> 1. "Cpufreq statistics cover the time when CPUs are in idle states, so they
>    are not suitable for certain purposes, like thermal control." Rafael [2]
> 2. Thermal governor Intelligent Power Allocation (IPA) has to estimate power,
>    for the last period, e.g. 100ms, for each CPU in the Cluster, to grant new
>    power and set max possible frequency. Currently in some cases it gets big
>    error, when the frequency of CPU changed in the middle. It is due to the
>    fact that IPA reads the current frequency for the CPU, not aware of all
>    other frequencies which were actively (not in idle) used in the last 100ms.
> 
> This code focuses on tracking the events of idle entry/exit for each CPU
> and combine them with the frequency tracked statistics inside internal
> statistics arrays (per-CPU). In the old cpufreq stats we have one shared
> statistics array for the policy (all CPUs) and not take into account
> periods when each CPU was in idle.
> 
> Sometimes the IPA error between old estimation signal and reality is quite
> big (>50%).

It would have been useful to show how the stats hierarchy looks in userspace
now.
Lukasz Luba April 26, 2022, 7:46 a.m. UTC | #2
On 4/26/22 04:11, Viresh Kumar wrote:
> On 06-04-22, 23:08, Lukasz Luba wrote:
>> Hi all,
>>
>> This is the 3rd version of patch set which tries to address issues which are
>> due to missing proper information about CPU performance in time.
>>
>> The issue description:
>> 1. "Cpufreq statistics cover the time when CPUs are in idle states, so they
>>     are not suitable for certain purposes, like thermal control." Rafael [2]
>> 2. Thermal governor Intelligent Power Allocation (IPA) has to estimate power,
>>     for the last period, e.g. 100ms, for each CPU in the Cluster, to grant new
>>     power and set max possible frequency. Currently in some cases it gets big
>>     error, when the frequency of CPU changed in the middle. It is due to the
>>     fact that IPA reads the current frequency for the CPU, not aware of all
>>     other frequencies which were actively (not in idle) used in the last 100ms.
>>
>> This code focuses on tracking the events of idle entry/exit for each CPU
>> and combine them with the frequency tracked statistics inside internal
>> statistics arrays (per-CPU). In the old cpufreq stats we have one shared
>> statistics array for the policy (all CPUs) and not take into account
>> periods when each CPU was in idle.
>>
>> Sometimes the IPA error between old estimation signal and reality is quite
>> big (>50%).
> 
> It would have been useful to show how the stats hierarchy looks in userspace
> now.
> 

I haven't modify your current cpufreq stats, they are still counting
total time (idle + running) for the given frequency. I think this is
still useful for some userspace tools. These new proposed stats don't
have such sysfs interface to read them. I don't know if userspace would
be interested in this information (the running only time). IIRC Android
uses bpf mechanisms to get this information to the userspace.
Viresh Kumar April 26, 2022, 7:54 a.m. UTC | #3
On 26-04-22, 08:46, Lukasz Luba wrote:
> I haven't modify your current cpufreq stats, they are still counting
> total time (idle + running) for the given frequency. I think this is
> still useful for some userspace tools. These new proposed stats don't
> have such sysfs interface to read them. I don't know if userspace would
> be interested in this information (the running only time). IIRC Android
> uses bpf mechanisms to get this information to the userspace.

I saw some debugfs bits there, aren't you exposing any data via it ? I
am just asking about, not suggesting :)
Lukasz Luba April 26, 2022, 7:59 a.m. UTC | #4
On 4/26/22 08:54, Viresh Kumar wrote:
> On 26-04-22, 08:46, Lukasz Luba wrote:
>> I haven't modify your current cpufreq stats, they are still counting
>> total time (idle + running) for the given frequency. I think this is
>> still useful for some userspace tools. These new proposed stats don't
>> have such sysfs interface to read them. I don't know if userspace would
>> be interested in this information (the running only time). IIRC Android
>> uses bpf mechanisms to get this information to the userspace.
> 
> I saw some debugfs bits there, aren't you exposing any data via it ? I
> am just asking about, not suggesting :)
> 

:) but I didn't dare to make it sysfs. I don't know if anything in
user-space would be interested (apart from my test scripts).
Viresh Kumar April 26, 2022, 8:02 a.m. UTC | #5
On 26-04-22, 08:59, Lukasz Luba wrote:
> :) but I didn't dare to make it sysfs. I don't know if anything in
> user-space would be interested (apart from my test scripts).

Sure, I was talking about hierarchy in debugfs only. Will be useful if
you can show how it looks and what all data is exposed.
Lukasz Luba April 26, 2022, 2:40 p.m. UTC | #6
On 4/26/22 09:02, Viresh Kumar wrote:
> On 26-04-22, 08:59, Lukasz Luba wrote:
>> :) but I didn't dare to make it sysfs. I don't know if anything in
>> user-space would be interested (apart from my test scripts).
> 
> Sure, I was talking about hierarchy in debugfs only. Will be useful if
> you can show how it looks and what all data is exposed.
> 

I've created a new way for sharing such thing. Please check the rendered
notebook at [1]. You can find raw output of that debugfs at cell 9 or
in cell 11 as a dictionary. The residency is in ns. You can also find a
diff from two snapshots for all cpus at cell 16. We randomly use Little
cpus: 0,3,4,5.

At the bottom you can find plots for all cpus, their active residency at
frequencies. Cpu1 and cpu2 are big, cpu2 has been hotplug out so there
is an empty plot (which is good).

BTW, if you are interested in comparison of different input power
estimation mechanism, you can find them here [2]. There are 4 different
power signals. One is real from Juno power/energy meters the rest
is SW estimations of avg power for the 100ms period. As you can see
there in cell 25 plot, the new proposal in this patch set is better
that two previous one used in mainline. The last plot shows real
power signal and the new avg signal. The plot is interactive and
supports 'Box Zoom' on the right (scroll to right to see that toolbox).

Regards,
Lukasz

[1] 
https://nbviewer.org/github/lukaszluba-arm/lisa/blob/public_tests/ipa_input_power-debugfs.ipynb
[2] 
https://nbviewer.org/github/lukaszluba-arm/lisa/blob/public_tests/ipa_input_power.ipynb