Message ID | 20201102120115.29993-1-nicola.mazzucato@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | CPUFreq: Add support for cpu performance dependencies | expand |
Hi All, This is a continuation of the previous v2, where we focused mostly on the dt binding. I am seeking some feedback/comments on the following two approaches. Intro: We have seen that in a system where performance control and hardware description do not match (i.e. per-cpu), we still need the information of how the v/f lines are shared among the cpus. We call this information "performance dependencies". We got this info through the opp-shared (the previous 2 patches aim for that). Problem: How do we share such info (retrieved from a cpufreq driver) to other consumers that rely on it? I have two proposals. First proposal: Having a new cpumask 'dependent_cpus' to represent the CPU performance dependencies seems a viable and scalable way. The main reason for a new cpumaks is that although 'related_cpus' could be (or could have been) used for such purpose, its meaning has changed over time [1]. Provided that the cpufreq driver will populate dependent_cpus and set shared_type accordingly, the s/w components that rely on such description (we focus on energy-model and cpufreq_cooling for now) will always be provided with the correct information, when picking the new cpumask. Proposed changes (at high level)(3): 1) cpufreq: Add new dependent_cpus cpumaks in cpufreq_policy * New cpumask addition <snippet> struct cpufreq_policy { cpumask_var_t related_cpus; /* Online + Offline CPUs */ cpumask_var_t real_cpus; /* Related and present */ + /* + * CPUs with hardware clk/perf dependencies + * + * For sw components that rely on h/w info of clk dependencies when hw + * coordinates. This cpumask should always reflect the hw dependencies. + */ + cpumask_var_t dependent_cpus; /* all clk-dependent cpus */ + unsigned int shared_type; /* ACPI: ANY or ALL affected CPUs </snippet> * Fallback mechanism for dependent_cpus. With this, s/w components can always pick dependent_cpus regardless the coordination type. <snippet> static int cpufreq_online(unsigned int cpu) /* related_cpus should at least include policy->cpus. */ cpumask_copy(policy->related_cpus, policy->cpus); + + /* dependent_cpus should differ only when hw coordination is in place */ + if (policy->shared_type != CPUFREQ_SHARED_TYPE_HW) + cpumask_copy(policy->dependent_cpus, policy->cpus); } </snippet> * Add sysfs attribute for dependent_cpus 2) drivers/thermal/cpufreq_cooling: Replace related_cpus with dependent_cpus 3) drivers/cpufreq/scmi-cpufreq: Get perf-dependencies from dev_pm_opp_of_get_sharing_cpus() * Call dev_pm_opp_of_get_sharing_cpus() * Populate policy->dependent_cpus if possible * Set policy->shared_type accordingly * Provide to EM the correct performance dependencies information <snippet> - em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, policy->cpus); + /* + * EM needs accurate information about perf boundaries, thus provide the + * correct cpumask. + */ + if (policy->shared_type == CPUFREQ_SHARED_TYPE_HW) + em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, + policy->dependent_cpus); + else + em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, + policy->cpus); </snippet> Second proposal: Another option could be for each driver to store internally the performance dependencies and let the driver directly provide the correct cpumask for any consumer. Whilst this works fine for energy model (see above), it is not the case (currently) for cpufreq_cooling, thus we would need to add a new interface for it. That way the driver can call directly the registration function. This seems to be possible since the CPUFreq core can skip to register cpufreq_cooling (flag CPUFREQ_IS_COOLING_DEV). In comparison with the first proposal this one is less scalable since each driver will have to call the registration functions, while in some cases the CPUFreq core could do it. Any comments/preferences? Many Thanks Nicola [1] https://lkml.org/lkml/2020/9/24/399
On 02-11-20, 12:01, Nicola Mazzucato wrote: > Hi All, > > In this V3 posting I have replaced the new dt-binding with minor changes/ > improvements for opp (since we are now using opp tables instead). > The RFC still stands on how to make this info available to sw consumers. > > In the RFC, I am proposing a simple addition of a performance dependencies > cpumask in CPUFreq core and an example of how drivers and consumers would > make use of it. > I also propose an alternative approach, which does not require changes in > CPUFreq core, but - possibly - in all the consumers. > > This is to support systems where exposed cpu performance controls are more > fine-grained that the platform's ability to scale cpus independently. I was talking to Vincent about what you are doing here and we got a bit confused and so here are few questions that we had: - Based on my previous understanding, you don't want software coordination of frequencies (which is done by cpufreq today), but want the hardware to do that and so you want per-cpu cpufreq policies. - What's the real benefit of hardware coordination ? Want to make sure I fully understand that. - Because of hardware co-ordination of otherwise co-ordinated CPUs, few things break. Thermal and EAS are some of the examples and so you are trying to fix them here by proving them the missing information again. - One other thing that breaks with this is freq-invariance in the scheduler, as the scheduler won't see the real frequencies the various CPUs are running at. Most of the hardware we have today doesn't have counters, like AMUs, not sure if all future ones based on SCMI will have that too, so how are they gong to be fixed ? And if we even have to fix this (freq invariance), what's hardware coordination giving us that makes all this worth it ? Sorry about the long list :)
Hi Viresh, thanks for looking into this. On 11/3/20 10:18 AM, Viresh Kumar wrote: > On 02-11-20, 12:01, Nicola Mazzucato wrote: >> Hi All, >> >> In this V3 posting I have replaced the new dt-binding with minor changes/ >> improvements for opp (since we are now using opp tables instead). >> The RFC still stands on how to make this info available to sw consumers. >> >> In the RFC, I am proposing a simple addition of a performance dependencies >> cpumask in CPUFreq core and an example of how drivers and consumers would >> make use of it. >> I also propose an alternative approach, which does not require changes in >> CPUFreq core, but - possibly - in all the consumers. >> >> This is to support systems where exposed cpu performance controls are more >> fine-grained that the platform's ability to scale cpus independently. > > I was talking to Vincent about what you are doing here and we got a > bit confused and so here are few questions that we had: > > - Based on my previous understanding, you don't want software > coordination of frequencies (which is done by cpufreq today), but > want the hardware to do that and so you want per-cpu cpufreq > policies. Correct. And this has been done for quite some time in some platforms. > > - What's the real benefit of hardware coordination ? Want to make sure > I fully understand that. The hardware coordination that is coming out by having per-cpu cpufreq policies is not new, and works just fine in most of the systems. The benefit of having per-cpu controls is that the firmware will take care of the performance of the entire system. It is purely a delegation to firmware for the performance optimizations. > > - Because of hardware co-ordination of otherwise co-ordinated CPUs, > few things break. Thermal and EAS are some of the examples and so > you are trying to fix them here by proving them the missing > information again. Correct. And for this I have proposed two ways. > > - One other thing that breaks with this is freq-invariance in the > scheduler, as the scheduler won't see the real frequencies the > various CPUs are running at. Most of the hardware we have today > doesn't have counters, like AMUs, not sure if all future ones based > on SCMI will have that too, so how are they gong to be fixed ? > Correct. freq-invariance without counters is trying to do its best based on the information it has available. It definitely relies on the knowledge of the v/f domains to work at its best so I think in the case of per-cpu it will follow the same approach as others being affected (EAS, thermal). > And if we even have to fix this (freq invariance), what's hardware > coordination giving us that makes all this worth it ? I suppose this is more a generic question for all the platforms running with h/w coordination, but for our case is that the f/w will take care of the performance optimizations for us :) > > Sorry about the long list :) No problem at all. Thank you for your time on this and I hope I have made bits clearer. Nicola >
On Wed, 4 Nov 2020 at 19:02, Nicola Mazzucato <nicola.mazzucato@arm.com> wrote: > > Hi Viresh, thanks for looking into this. > > On 11/3/20 10:18 AM, Viresh Kumar wrote: > > On 02-11-20, 12:01, Nicola Mazzucato wrote: > >> Hi All, > >> > >> In this V3 posting I have replaced the new dt-binding with minor changes/ > >> improvements for opp (since we are now using opp tables instead). > >> The RFC still stands on how to make this info available to sw consumers. > >> > >> In the RFC, I am proposing a simple addition of a performance dependencies > >> cpumask in CPUFreq core and an example of how drivers and consumers would > >> make use of it. > >> I also propose an alternative approach, which does not require changes in > >> CPUFreq core, but - possibly - in all the consumers. > >> > >> This is to support systems where exposed cpu performance controls are more > >> fine-grained that the platform's ability to scale cpus independently. > > > > I was talking to Vincent about what you are doing here and we got a > > bit confused and so here are few questions that we had: > > > > - Based on my previous understanding, you don't want software > > coordination of frequencies (which is done by cpufreq today), but > > want the hardware to do that and so you want per-cpu cpufreq > > policies. > > Correct. And this has been done for quite some time in some platforms. > > > > > - What's the real benefit of hardware coordination ? Want to make sure > > I fully understand that. > > The hardware coordination that is coming out by having per-cpu cpufreq policies > is not new, and works just fine in most of the systems. > > The benefit of having per-cpu controls is that the firmware will take care of > the performance of the entire system. It is purely a delegation to firmware for > the performance optimizations. > > > > > - Because of hardware co-ordination of otherwise co-ordinated CPUs, > > few things break. Thermal and EAS are some of the examples and so > > you are trying to fix them here by proving them the missing > > information again. > > Correct. And for this I have proposed two ways. > > > > > - One other thing that breaks with this is freq-invariance in the > > scheduler, as the scheduler won't see the real frequencies the > > various CPUs are running at. Most of the hardware we have today > > doesn't have counters, like AMUs, not sure if all future ones based > > on SCMI will have that too, so how are they gong to be fixed ? > > > > Correct. freq-invariance without counters is trying to do its best based on the > information it has available. It definitely relies on the knowledge of the v/f > domains to work at its best so I think in the case of per-cpu it will follow the > same approach as others being affected (EAS, thermal). As frequency invariance has same problem as EAS and Thermal it would be good to see the solution as part of this proposal like EAS and Thermal > > > And if we even have to fix this (freq invariance), what's hardware > > coordination giving us that makes all this worth it ? > > I suppose this is more a generic question for all the platforms running with h/w > coordination, but for our case is that the f/w will take care of the performance > optimizations for us :) > > > > > Sorry about the long list :) > > No problem at all. Thank you for your time on this and I hope I have made bits > clearer. > > Nicola > > >
Hi guys, On Thursday 05 Nov 2020 at 15:25:53 (+0100), Vincent Guittot wrote: [..] > > > - Because of hardware co-ordination of otherwise co-ordinated CPUs, > > > few things break. Thermal and EAS are some of the examples and so > > > you are trying to fix them here by proving them the missing > > > information again. > > > > Correct. And for this I have proposed two ways. > > > > > > > > - One other thing that breaks with this is freq-invariance in the > > > scheduler, as the scheduler won't see the real frequencies the > > > various CPUs are running at. Most of the hardware we have today > > > doesn't have counters, like AMUs, not sure if all future ones based > > > on SCMI will have that too, so how are they gong to be fixed ? > > > > > > > Correct. freq-invariance without counters is trying to do its best based on the > > information it has available. It definitely relies on the knowledge of the v/f > > domains to work at its best so I think in the case of per-cpu it will follow the > > same approach as others being affected (EAS, thermal). > > As frequency invariance has same problem as EAS and Thermal it would > be good to see the solution as part of this proposal like EAS and > Thermal > I think I was waiting for a consensus on patch 3/3, although I believe the discussion at [1] tended towards option 2: "each driver to store internally the performance dependencies and let the driver directly provide the correct cpumask for any consumer." The alternative was option 1: "add a new dependent_cpus cpumaks in cpufreq_policy", as Nicola mentioned in the commit message for 3/3. If the choice is clear, I'm happy to take the FIE fixes in a separate set. Thanks, Ionela. [1] https://lore.kernel.org/linux-arm-kernel/20200924095347.32148-3-nicola.mazzucato@arm.com/ > > > > > And if we even have to fix this (freq invariance), what's hardware > > > coordination giving us that makes all this worth it ? > > > > I suppose this is more a generic question for all the platforms running with h/w > > coordination, but for our case is that the f/w will take care of the performance > > optimizations for us :) > > > > > > > > Sorry about the long list :) > > > > No problem at all. Thank you for your time on this and I hope I have made bits > > clearer. > > > > Nicola > > > > >
On 02-11-20, 12:01, Nicola Mazzucato wrote: > This is a continuation of the previous v2, where we focused mostly on the > dt binding. > > I am seeking some feedback/comments on the following two approaches. > > Intro: > We have seen that in a system where performance control and hardware > description do not match (i.e. per-cpu), we still need the information of > how the v/f lines are shared among the cpus. We call this information > "performance dependencies". > We got this info through the opp-shared (the previous 2 patches aim for > that). > > Problem: > How do we share such info (retrieved from a cpufreq driver) to other > consumers that rely on it? I have two proposals. I haven't really stop thinking about what and how we should solve this, but I have few concerns first. > 2) drivers/thermal/cpufreq_cooling: Replace related_cpus with dependent_cpus I am not sure if I understand completely on how this is going to be modified/work. The only use of related_cpus in the cooling driver is in the helper cdev->get_requested_power(), where we need to find the total power being consumed by devices controlled by the cooling device. Right ? Now the cooling devices today are very closely related to the cpufreq policy, the registration function itself takes a cpufreq policy as an argument. Consider that you have an octa-core platform and all the CPUs are dependent on each other. With your suggested changes and hw control, we will have different cpufreq policies for each CPU. And so we will have a cooling device, cdev, for each CPU as well. When the IPA governor calls cdev->get_requested_power(), why should we ever bother to traverse the list of dependent_cpus and not related_cpus only ? Otherwise the same CPU will have its load contributed to the power of 8 cooling devices.
On 02-11-20, 12:01, Nicola Mazzucato wrote: > Hi All, > > This is a continuation of the previous v2, where we focused mostly on the > dt binding. > > I am seeking some feedback/comments on the following two approaches. > > Intro: > We have seen that in a system where performance control and hardware > description do not match (i.e. per-cpu), we still need the information of > how the v/f lines are shared among the cpus. We call this information > "performance dependencies". > We got this info through the opp-shared (the previous 2 patches aim for > that). > > Problem: > How do we share such info (retrieved from a cpufreq driver) to other > consumers that rely on it? I have two proposals. FWIW, we already have a case in kernel where something similar is done. commit f4fd3797848a ("acpi-cpufreq: Add new sysfs attribute freqdomain_cpus") and the information is shared via: /sys/devices/system/cpu/cpu#/cpufreq/freqdomain_cpus
Hi Viresh, On 11/6/20 9:20 AM, Viresh Kumar wrote: > On 02-11-20, 12:01, Nicola Mazzucato wrote: >> This is a continuation of the previous v2, where we focused mostly on the >> dt binding. >> >> I am seeking some feedback/comments on the following two approaches. >> >> Intro: >> We have seen that in a system where performance control and hardware >> description do not match (i.e. per-cpu), we still need the information of >> how the v/f lines are shared among the cpus. We call this information >> "performance dependencies". >> We got this info through the opp-shared (the previous 2 patches aim for >> that). >> >> Problem: >> How do we share such info (retrieved from a cpufreq driver) to other >> consumers that rely on it? I have two proposals. > > I haven't really stop thinking about what and how we should solve > this, but I have few concerns first. > >> 2) drivers/thermal/cpufreq_cooling: Replace related_cpus with dependent_cpus > > I am not sure if I understand completely on how this is going to be > modified/work. > > The only use of related_cpus in the cooling driver is in the helper > cdev->get_requested_power(), where we need to find the total power > being consumed by devices controlled by the cooling device. Right ? > > Now the cooling devices today are very closely related to the cpufreq > policy, the registration function itself takes a cpufreq policy as an > argument. > > Consider that you have an octa-core platform and all the CPUs are > dependent on each other. With your suggested changes and hw control, > we will have different cpufreq policies for each CPU. And so we will > have a cooling device, cdev, for each CPU as well. When the IPA > governor calls cdev->get_requested_power(), why should we ever bother > to traverse the list of dependent_cpus and not related_cpus only ? > > Otherwise the same CPU will have its load contributed to the power of > 8 cooling devices. > Good question. How about a different interface for those cpufreq drivers? That new registration API would allow to specify the cpumask. Or rely on EM cpumask: em_span_cpus(em) Currently we have two ways to register cooling device: 1. when the cpufreq driver set a flag CPUFREQ_IS_COOLING_DEV, the core will register cooling device 2. cpufreq driver can explicitly call the registration function: cpufreq_cooling_register() with 'policy' as argument That would need substantial change to the cpufreq cooling code, from policy oriented to custom driver's cpumask (like EM registration). Regards, Lukasz
On 06-11-20, 10:37, Lukasz Luba wrote: > Good question. > > How about a different interface for those cpufreq drivers? > That new registration API would allow to specify the cpumask. > Or rely on EM cpumask: em_span_cpus(em) > > Currently we have two ways to register cooling device: > 1. when the cpufreq driver set a flag CPUFREQ_IS_COOLING_DEV, the core > will register cooling device > 2. cpufreq driver can explicitly call the registration function: > cpufreq_cooling_register() with 'policy' as argument > > That would need substantial change to the cpufreq cooling code, from > policy oriented to custom driver's cpumask (like EM registration). I am even wondering if we should really make that change. Why do we need the combined load of the CPUs to be sent back to the IPA governor ? Why shouldn't they all do that (they == cdev) ? This is a bit confusing to me, sorry about that. The cpufreq governors take a look at all the CPUs utilization and set the frequency based on the highest utilization (and not the total util). While in this case we present the total load of the CPUs to the IPA (based on the current frequency of the CPUs), in response to which it tells us the frequency at which all the CPUs of the policy can run at (I am not even sure if it is the right thing to do as the CPUs have different loads). And how do we fit this dependent_cpus thing into this. Sorry, I am not sure what's the right way of doing thing here.
On 11/6/20 10:55 AM, Viresh Kumar wrote: > On 06-11-20, 10:37, Lukasz Luba wrote: >> Good question. >> >> How about a different interface for those cpufreq drivers? >> That new registration API would allow to specify the cpumask. >> Or rely on EM cpumask: em_span_cpus(em) >> >> Currently we have two ways to register cooling device: >> 1. when the cpufreq driver set a flag CPUFREQ_IS_COOLING_DEV, the core >> will register cooling device >> 2. cpufreq driver can explicitly call the registration function: >> cpufreq_cooling_register() with 'policy' as argument >> >> That would need substantial change to the cpufreq cooling code, from >> policy oriented to custom driver's cpumask (like EM registration). > > I am even wondering if we should really make that change. Why do we > need the combined load of the CPUs to be sent back to the IPA governor > ? Why shouldn't they all do that (they == cdev) ? > > This is a bit confusing to me, sorry about that. The cpufreq governors > take a look at all the CPUs utilization and set the frequency based on > the highest utilization (and not the total util). > > While in this case we present the total load of the CPUs to the IPA > (based on the current frequency of the CPUs), in response to which it > tells us the frequency at which all the CPUs of the policy can run at > (I am not even sure if it is the right thing to do as the CPUs have > different loads). And how do we fit this dependent_cpus thing into > this. > > Sorry, I am not sure what's the right way of doing thing here. > I also had similar doubts, because if we make frequency requests independently for each CPU, why not having N cooling devs, which will set independently QoS max freq for them... What convinced me: EAS and FIE would know the 'real' frequency of the cluster, IPA can use it also and have only one cooling device per cluster. We would like to keep this old style 'one cooling device per cpuset'. I don't have strong opinion and if it would appear that there are some errors in freq estimation for cluster, then maybe it does make more sense to have cdev per CPU...
On 06-11-20, 11:14, Lukasz Luba wrote: > I also had similar doubts, because if we make frequency requests > independently for each CPU, why not having N cooling devs, which > will set independently QoS max freq for them... > > What convinced me: > EAS and FIE would know the 'real' frequency of the cluster, IPA > can use it also and have only one cooling device per cluster. > > We would like to keep this old style 'one cooling device per cpuset'. > I don't have strong opinion and if it would appear that there are > some errors in freq estimation for cluster, then maybe it does make > more sense to have cdev per CPU... Let me rephrase my question. What is it that doesn't work _correctly_ with cdev per cpufreq policy in your case? What doesn't work well if the thermal stuff keeps looking at only the related_cpus thing and not the cpu-perf-dependencies thing?
On 11/9/20 6:57 AM, Viresh Kumar wrote: > On 06-11-20, 11:14, Lukasz Luba wrote: >> I also had similar doubts, because if we make frequency requests >> independently for each CPU, why not having N cooling devs, which >> will set independently QoS max freq for them... >> >> What convinced me: >> EAS and FIE would know the 'real' frequency of the cluster, IPA >> can use it also and have only one cooling device per cluster. >> >> We would like to keep this old style 'one cooling device per cpuset'. >> I don't have strong opinion and if it would appear that there are >> some errors in freq estimation for cluster, then maybe it does make >> more sense to have cdev per CPU... > > Let me rephrase my question. What is it that doesn't work _correctly_ > with cdev per cpufreq policy in your case? What doesn't work well if > the thermal stuff keeps looking at only the related_cpus thing and not > the cpu-perf-dependencies thing? > We don't have a platform which would be this per-cpu freq request, yet. Thus it's hard to answer your question. The EAS would work in 'old style' - cluster mode. I don't know how IPA would work on such HW and SW configuration. To figure this out I need a real platform.
On 16-11-20, 11:33, Lukasz Luba wrote: > On 11/9/20 6:57 AM, Viresh Kumar wrote: > > On 06-11-20, 11:14, Lukasz Luba wrote: > > > I also had similar doubts, because if we make frequency requests > > > independently for each CPU, why not having N cooling devs, which > > > will set independently QoS max freq for them... > > > > > > What convinced me: > > > EAS and FIE would know the 'real' frequency of the cluster, IPA > > > can use it also and have only one cooling device per cluster. > > > > > > We would like to keep this old style 'one cooling device per cpuset'. > > > I don't have strong opinion and if it would appear that there are > > > some errors in freq estimation for cluster, then maybe it does make > > > more sense to have cdev per CPU... > > > > Let me rephrase my question. What is it that doesn't work _correctly_ > > with cdev per cpufreq policy in your case? What doesn't work well if > > the thermal stuff keeps looking at only the related_cpus thing and not > > the cpu-perf-dependencies thing? > > > > We don't have a platform which would be this per-cpu freq request, yet. > Thus it's hard to answer your question. The EAS would work in 'old > style' - cluster mode. I don't know how IPA would work on such HW > and SW configuration. To figure this out I need a real platform. Hmm, so who are going to be the users of this new stuff (dependent CPUs) ? I don't think cpufreq-cooling should be updated, unless there is a compelling reason to. The other one in energy model ? Why does it need this information ? Who else ?
Hi Viresh, On 11/17/20 10:11 AM, Viresh Kumar wrote: > On 16-11-20, 11:33, Lukasz Luba wrote: >> On 11/9/20 6:57 AM, Viresh Kumar wrote: >>> On 06-11-20, 11:14, Lukasz Luba wrote: >>>> I also had similar doubts, because if we make frequency requests >>>> independently for each CPU, why not having N cooling devs, which >>>> will set independently QoS max freq for them... >>>> >>>> What convinced me: >>>> EAS and FIE would know the 'real' frequency of the cluster, IPA >>>> can use it also and have only one cooling device per cluster. >>>> >>>> We would like to keep this old style 'one cooling device per cpuset'. >>>> I don't have strong opinion and if it would appear that there are >>>> some errors in freq estimation for cluster, then maybe it does make >>>> more sense to have cdev per CPU... >>> >>> Let me rephrase my question. What is it that doesn't work _correctly_ >>> with cdev per cpufreq policy in your case? What doesn't work well if >>> the thermal stuff keeps looking at only the related_cpus thing and not >>> the cpu-perf-dependencies thing? >>> >> >> We don't have a platform which would be this per-cpu freq request, yet. >> Thus it's hard to answer your question. The EAS would work in 'old >> style' - cluster mode. I don't know how IPA would work on such HW >> and SW configuration. To figure this out I need a real platform. > > Hmm, so who are going to be the users of this new stuff (dependent > CPUs) ? In general, any platform that has hardware coordination in place and some components need to use the information. I don't think cpufreq-cooling should be updated, unless there > is a compelling reason to. > > The other one in energy model ? Why does it need this information ? The reasons has probably gone lost in the emails, but in a nutshell EM needs accurate information on performance boundaries to achieve correct task placement. > > Who else ? > Freq-invariance has been mentioned. I suppose the fix will depend on which strategy we prefer to solve this. As a reminder, two solutions: 1) dependent_cpus cpumask in cpufreq and involved entities pick this info or 2) dependent_cpus cpumask in driver but some entities' interfaces may need to change Hope it helps, Nicola
On 11/17/20 10:11 AM, Viresh Kumar wrote: > On 16-11-20, 11:33, Lukasz Luba wrote: >> On 11/9/20 6:57 AM, Viresh Kumar wrote: >>> On 06-11-20, 11:14, Lukasz Luba wrote: >>>> I also had similar doubts, because if we make frequency requests >>>> independently for each CPU, why not having N cooling devs, which >>>> will set independently QoS max freq for them... >>>> >>>> What convinced me: >>>> EAS and FIE would know the 'real' frequency of the cluster, IPA >>>> can use it also and have only one cooling device per cluster. >>>> >>>> We would like to keep this old style 'one cooling device per cpuset'. >>>> I don't have strong opinion and if it would appear that there are >>>> some errors in freq estimation for cluster, then maybe it does make >>>> more sense to have cdev per CPU... >>> >>> Let me rephrase my question. What is it that doesn't work _correctly_ >>> with cdev per cpufreq policy in your case? What doesn't work well if >>> the thermal stuff keeps looking at only the related_cpus thing and not >>> the cpu-perf-dependencies thing? >>> >> >> We don't have a platform which would be this per-cpu freq request, yet. >> Thus it's hard to answer your question. The EAS would work in 'old >> style' - cluster mode. I don't know how IPA would work on such HW >> and SW configuration. To figure this out I need a real platform. > > Hmm, so who are going to be the users of this new stuff (dependent > CPUs) ? I don't think cpufreq-cooling should be updated, unless there > is a compelling reason to. Fair enough. What if we come back with experiments results in future? Currently we are trying to conduct experiments with Nicola on modified Juno firmware and kernel) > > The other one in energy model ? Why does it need this information ? The energy model needs this information, because it's used during the sched domain re-build. The sched domain is then used in the EAS main functions: feec() [1] and compute_energy() [2]. What EAS normally does is 'trying different options to put a task on different CPUs and observe the potential energy costs'. Example, we need the 'cluster' frequency, because when you put a task on a CPU its freq might need to be set a bit higher. This would affect all CPUs in the cluster not only one and we capture that energy cost. Then, we try to put a task on other CPU in that cluster and if it appears to be no need of rising freq for the whole cluster, then it wins. > > Who else ? > FIE..
On 17-11-20, 10:47, Nicola Mazzucato wrote: > Freq-invariance has been mentioned. I suppose the fix will depend on which > strategy we prefer to solve this. I am not sure how FIE will use this information, as I thought the problem is about knowing the current frequency on a CPU instead of which CPUs are related. Anyway, EM is good enough to get this stuff going. > As a reminder, two solutions: > 1) dependent_cpus cpumask in cpufreq and involved entities pick this info Yeah, this one. And it will be called freqdomain_cpus. Add support for freqdomain_cpus in core, update acpi-cpufreq to reuse it instead of adding its own and you can have it in other drivers then.
On 17-11-20, 10:47, Lukasz Luba wrote: > Fair enough. What if we come back with experiments results in future? > Currently we are trying to conduct experiments with Nicola on modified Juno > firmware and kernel) Sure. If we think it improves things, why not. I just wanted to make sure we understand why would we go do this change now.
On Tue, Nov 17, 2020 at 11:53 AM Viresh Kumar <viresh.kumar@linaro.org> wrote: > > On 17-11-20, 10:47, Nicola Mazzucato wrote: > > Freq-invariance has been mentioned. I suppose the fix will depend on which > > strategy we prefer to solve this. > > I am not sure how FIE will use this information, as I thought the > problem is about knowing the current frequency on a CPU instead of > which CPUs are related. Anyway, EM is good enough to get this stuff > going. > > > As a reminder, two solutions: > > 1) dependent_cpus cpumask in cpufreq and involved entities pick this info > > Yeah, this one. And it will be called freqdomain_cpus. Add support for > freqdomain_cpus in core, update acpi-cpufreq to reuse it instead of > adding its own and you can have it in other drivers then. Is this really a cpufreq thing, though, or is it arch stuff? I think the latter, because it is not necessary for anything in cpufreq. Yes, acpi-cpufreq happens to know this information, because it uses processor_perflib, but the latter may as well be used by the arch enumeration of CPUs and the freqdomain_cpus mask may be populated from there. As far as cpufreq is concerned, if the interface to the hardware is per-CPU, there is one CPU per policy and cpufreq has no business knowing anything about the underlying hardware coordination.
On 17-11-20, 14:06, Rafael J. Wysocki wrote: > Is this really a cpufreq thing, though, or is it arch stuff? I think > the latter, because it is not necessary for anything in cpufreq. > > Yes, acpi-cpufreq happens to know this information, because it uses > processor_perflib, but the latter may as well be used by the arch > enumeration of CPUs and the freqdomain_cpus mask may be populated from > there. > > As far as cpufreq is concerned, if the interface to the hardware is > per-CPU, there is one CPU per policy and cpufreq has no business > knowing anything about the underlying hardware coordination. It won't be used by cpufreq for now at least and yes I understand your concern. I opted for this because we already have a cpufreq implementation for the same thing and it is usually better to reuse this kind of stuff instead of inventing it over.
On Wed, Nov 18, 2020 at 5:42 AM Viresh Kumar <viresh.kumar@linaro.org> wrote: > > On 17-11-20, 14:06, Rafael J. Wysocki wrote: > > Is this really a cpufreq thing, though, or is it arch stuff? I think > > the latter, because it is not necessary for anything in cpufreq. > > > > Yes, acpi-cpufreq happens to know this information, because it uses > > processor_perflib, but the latter may as well be used by the arch > > enumeration of CPUs and the freqdomain_cpus mask may be populated from > > there. > > > > As far as cpufreq is concerned, if the interface to the hardware is > > per-CPU, there is one CPU per policy and cpufreq has no business > > knowing anything about the underlying hardware coordination. > > It won't be used by cpufreq for now at least and yes I understand your > concern. I opted for this because we already have a cpufreq > implementation for the same thing and it is usually better to reuse > this kind of stuff instead of inventing it over. Do you mean related_cpus and real_cpus? That's the granularity of the interface to the hardware I'm talking about. Strictly speaking, it means "these CPUs share a HW interface for perf control" and it need not mean "these CPUs are in the same clock/voltage domain". Specifically, it need not mean "these CPUs are the only CPUs in the given clock/voltage domain". That's what it means when the control is exercised by manipulating OPPs directly, but not in general. In the ACPI case, for example, what the firmware tells you need not reflect the HW topology in principle. It only tells you whether or not it wants you to coordinate a given group of CPUs and in what way, but this may not be the whole picture from the HW perspective. If you need the latter, you need more information in general (at least you need to assume that what the firmware tells you actually does reflect the HW topology on the given SoC). So yes, in the particular case of OPP-based perf control, cpufreq happens to have the same information that is needed by the other subsystems, but otherwise it may not and what I'm saying is that it generally is a mistake to expect cpufreq to have that information or to be able to obtain it without the help of the arch/platform code. Hence, it would be a mistake to design an interface based on that expectation. Or looking at it from a different angle, today a cpufreq driver is only required to specify the granularity of the HW interface for perf control via related_cpus. It is not required to obtain extra information beyond that. If a new mask to be populated by it is added, the driver may need to do more work which is not necessary from the perf control perspective. That doesn't look particularly clean to me. Moreover, adding such a mask to cpufreq_policy would make the users of it depend on cpufreq sort of artificially, which need not be useful even. IMO, the information needed by all of the subsystems in question should be obtained and made available at the arch/platform level and everyone who needs it should be able to access it from there, including the cpufreq driver for the given platform if that's what it needs to do. BTW, cpuidle may need the information in question too, so why should it be provided via cpufreq rather than via cpuidle?
On 18-11-20, 13:00, Rafael J. Wysocki wrote: > On Wed, Nov 18, 2020 at 5:42 AM Viresh Kumar <viresh.kumar@linaro.org> wrote: > > > > On 17-11-20, 14:06, Rafael J. Wysocki wrote: > > > Is this really a cpufreq thing, though, or is it arch stuff? I think > > > the latter, because it is not necessary for anything in cpufreq. > > > > > > Yes, acpi-cpufreq happens to know this information, because it uses > > > processor_perflib, but the latter may as well be used by the arch > > > enumeration of CPUs and the freqdomain_cpus mask may be populated from > > > there. > > > > > > As far as cpufreq is concerned, if the interface to the hardware is > > > per-CPU, there is one CPU per policy and cpufreq has no business > > > knowing anything about the underlying hardware coordination. > > > > It won't be used by cpufreq for now at least and yes I understand your > > concern. I opted for this because we already have a cpufreq > > implementation for the same thing and it is usually better to reuse > > this kind of stuff instead of inventing it over. > > Do you mean related_cpus and real_cpus? Sorry about the confusion, I meant freqdomain_cpus only. > That's the granularity of the interface to the hardware I'm talking about. > > Strictly speaking, it means "these CPUs share a HW interface for perf > control" and it need not mean "these CPUs are in the same > clock/voltage domain". Specifically, it need not mean "these CPUs are > the only CPUs in the given clock/voltage domain". That's what it > means when the control is exercised by manipulating OPPs directly, but > not in general. > > In the ACPI case, for example, what the firmware tells you need not > reflect the HW topology in principle. It only tells you whether or > not it wants you to coordinate a given group of CPUs and in what way, > but this may not be the whole picture from the HW perspective. If you > need the latter, you need more information in general (at least you > need to assume that what the firmware tells you actually does reflect > the HW topology on the given SoC). > > So yes, in the particular case of OPP-based perf control, cpufreq > happens to have the same information that is needed by the other > subsystems, but otherwise it may not and what I'm saying is that it > generally is a mistake to expect cpufreq to have that information or > to be able to obtain it without the help of the arch/platform code. > Hence, it would be a mistake to design an interface based on that > expectation. > > Or looking at it from a different angle, today a cpufreq driver is > only required to specify the granularity of the HW interface for perf > control via related_cpus. It is not required to obtain extra > information beyond that. If a new mask to be populated by it is > added, the driver may need to do more work which is not necessary from > the perf control perspective. That doesn't look particularly clean to > me. > > Moreover, adding such a mask to cpufreq_policy would make the users of > it depend on cpufreq sort of artificially, which need not be useful > even. > > IMO, the information needed by all of the subsystems in question > should be obtained and made available at the arch/platform level and > everyone who needs it should be able to access it from there, > including the cpufreq driver for the given platform if that's what it > needs to do. > > BTW, cpuidle may need the information in question too, so why should > it be provided via cpufreq rather than via cpuidle? Right.
On 02-11-20, 12:01, Nicola Mazzucato wrote: > Second proposal: > > Another option could be for each driver to store internally the performance > dependencies and let the driver directly provide the correct cpumask for > any consumer. From the discussion that happened in this thread, looks like we are going to do it. > Whilst this works fine for energy model (see above), it is not the case Good. > (currently) for cpufreq_cooling, thus we would need to add a new interface for And this is not required for now as we discussed.