Message ID | 20231111014933.1934562-2-davidai@google.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | Improve VM CPUfreq and task placement behavior | expand |
On 10-11-23, 17:49, David Dai wrote: > diff --git a/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml > +$id: http://devicetree.org/schemas/cpufreq/qemu,cpufreq-virtual.yaml# > +properties: > + compatible: > + const: qemu,virtual-cpufreq Not sure why we need to mention QEMU here.. Why limit this to just QEMU ?
On Sat, 11 Nov 2023 01:49:29 +0000, David Dai <davidai@google.com> wrote: > > Adding bindings to represent a virtual cpufreq device. > > Virtual machines may expose MMIO regions for a virtual cpufreq device > for guests to read frequency information or to request frequency > selection. The virtual cpufreq device has an individual controller for > each frequency domain. I would really refrain form having absolute frequencies here. A virtual machine can be migrated, and there are *zero* guarantees that the target system has the same clock range as the source. This really should be a relative number, much like the capacity. That, at least, can be migrated across systems. Thanks, M.
On Wed, Nov 15, 2023 at 11:57:41AM +0530, Viresh Kumar wrote: > On 10-11-23, 17:49, David Dai wrote: > > diff --git a/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml > > +$id: http://devicetree.org/schemas/cpufreq/qemu,cpufreq-virtual.yaml# > > +properties: > > + compatible: > > + const: qemu,virtual-cpufreq > > Not sure why we need to mention QEMU here.. Why limit this to just QEMU ? Because that is the implementation. And also to discourage anyone from using this on their hardware or thinking it is generic and the way to extend it to another implementation is adding properties. Rob
On Wed, Nov 15, 2023 at 12:49 AM Marc Zyngier <maz@kernel.org> wrote: > > On Sat, 11 Nov 2023 01:49:29 +0000, > David Dai <davidai@google.com> wrote: > > > > Adding bindings to represent a virtual cpufreq device. > > > > Virtual machines may expose MMIO regions for a virtual cpufreq device > > for guests to read frequency information or to request frequency > > selection. The virtual cpufreq device has an individual controller for > > each frequency domain. > > I would really refrain form having absolute frequencies here. A > virtual machine can be migrated, and there are *zero* guarantees that > the target system has the same clock range as the source. > > This really should be a relative number, much like the capacity. That, > at least, can be migrated across systems. There's nothing in this patch that mandates absolute frequency. In true KVM philosophy, we leave it to the VMM to decide. -Saravana
On Thu, 07 Dec 2023 22:44:36 +0000, Saravana Kannan <saravanak@google.com> wrote: > > On Wed, Nov 15, 2023 at 12:49 AM Marc Zyngier <maz@kernel.org> wrote: > > > > On Sat, 11 Nov 2023 01:49:29 +0000, > > David Dai <davidai@google.com> wrote: > > > > > > Adding bindings to represent a virtual cpufreq device. > > > > > > Virtual machines may expose MMIO regions for a virtual cpufreq device > > > for guests to read frequency information or to request frequency > > > selection. The virtual cpufreq device has an individual controller for > > > each frequency domain. > > > > I would really refrain form having absolute frequencies here. A > > virtual machine can be migrated, and there are *zero* guarantees that > > the target system has the same clock range as the source. > > > > This really should be a relative number, much like the capacity. That, > > at least, can be migrated across systems. > > There's nothing in this patch that mandates absolute frequency. > In true KVM philosophy, we leave it to the VMM to decide. This has nothing to do with KVM. It would apply to any execution environment, including QEMU in TCG mode. To quote the original patch: + description: + Address and size of region containing frequency controls for each of the + frequency domains. Regions for each frequency domain is placed + contiugously and contain registers for controlling DVFS(Dynamic Frequency + and Voltage) characteristics. The size of the region is proportional to + total number of frequency domains. What part of that indicates that *relative* frequencies are acceptable? The example explicitly uses the opp-v2 binding, which clearly is about absolute frequency. To reiterate: absolute frequencies are not the right tool for the job, and they should explicitly be described as relative in the spec. Not left as a "whatev'" option for the execution environment to interpret. M.
On Fri, Nov 10, 2023 at 05:49:29PM -0800, David Dai wrote: > Adding bindings to represent a virtual cpufreq device. > > Virtual machines may expose MMIO regions for a virtual cpufreq device > for guests to read frequency information or to request frequency > selection. The virtual cpufreq device has an individual controller for > each frequency domain. > > Co-developed-by: Saravana Kannan <saravanak@google.com> > Signed-off-by: Saravana Kannan <saravanak@google.com> > Signed-off-by: David Dai <davidai@google.com> > --- > .../cpufreq/qemu,cpufreq-virtual.yaml | 99 +++++++++++++++++++ > 1 file changed, 99 insertions(+) > create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml > > diff --git a/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml > new file mode 100644 > index 000000000000..16606cf1fd1a > --- /dev/null > +++ b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml > @@ -0,0 +1,99 @@ > +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause > +%YAML 1.2 > +--- > +$id: http://devicetree.org/schemas/cpufreq/qemu,cpufreq-virtual.yaml# > +$schema: http://devicetree.org/meta-schemas/core.yaml# > + > +title: Virtual CPUFreq > + > +maintainers: > + - David Dai <davidai@google.com> > + - Saravana Kannan <saravanak@google.com> > + > +description: > + Virtual CPUFreq is a virtualized driver in guest kernels that sends frequency > + selection of its vCPUs as a hint to the host through MMIO regions. Each vCPU > + is associated with a frequency domain which can be shared with other vCPUs. > + Each frequency domain has its own set of registers for frequency controls. > + Are these register controls described somewhere ? The reason I ask is we should be able to have single implementation of this virtual cpufreq irrespective of the firmware used(DT vs ACPI) IMO. > +properties: > + compatible: > + const: qemu,virtual-cpufreq > + > + reg: > + maxItems: 1 > + description: > + Address and size of region containing frequency controls for each of the > + frequency domains. Regions for each frequency domain is placed > + contiugously and contain registers for controlling DVFS(Dynamic Frequency > + and Voltage) characteristics. The size of the region is proportional to > + total number of frequency domains. > + > +required: > + - compatible > + - reg > + > +additionalProperties: false > + > +examples: > + - | > + // This example shows a two CPU configuration with a frequency domain > + // for each CPU. > + cpus { > + #address-cells = <1>; > + #size-cells = <0>; > + > + cpu@0 { > + compatible = "arm,armv8"; > + device_type = "cpu"; > + reg = <0x0>; > + operating-points-v2 = <&opp_table0>; > + }; > + > + cpu@1 { > + compatible = "arm,armv8"; > + device_type = "cpu"; > + reg = <0x0>; > + operating-points-v2 = <&opp_table1>; > + }; > + }; > + > + opp_table0: opp-table-0 { > + compatible = "operating-points-v2"; > + opp-shared; > + > + opp1098000000 { > + opp-hz = /bits/ 64 <1098000000>; > + opp-level = <1>; > + }; > + > + opp1197000000 { > + opp-hz = /bits/ 64 <1197000000>; > + opp-level = <2>; > + }; > + }; > + > + opp_table1: opp-table-1 { > + compatible = "operating-points-v2"; > + opp-shared; > + > + opp1106000000 { > + opp-hz = /bits/ 64 <1106000000>; > + opp-level = <1>; > + }; > + > + opp1277000000 { > + opp-hz = /bits/ 64 <1277000000>; > + opp-level = <2>; > + }; > + }; > I think using OPP with absolute frequencies seems not appropriate here. Why can't these fetched from the registers and have some abstract values instead ? > + soc { > + #address-cells = <1>; > + #size-cells = <1>; > + > + cpufreq@1040000 { > + compatible = "qemu,virtual-cpufreq"; > + reg = <0x1040000 0x10>; So just 16bytes for 2 CPU system ? How does the register layout look like ? I assume just 4 x 32bit registers: 2 for reading and 2 for setting the frequencies ? -- Regards, Sudeep
Sorry for the delay in response. Was very busy for a while and then holidays started. On Fri, Dec 8, 2023 at 12:52 AM Marc Zyngier <maz@kernel.org> wrote: > > On Thu, 07 Dec 2023 22:44:36 +0000, > Saravana Kannan <saravanak@google.com> wrote: > > > > On Wed, Nov 15, 2023 at 12:49 AM Marc Zyngier <maz@kernel.org> wrote: > > > > > > On Sat, 11 Nov 2023 01:49:29 +0000, > > > David Dai <davidai@google.com> wrote: > > > > > > > > Adding bindings to represent a virtual cpufreq device. > > > > > > > > Virtual machines may expose MMIO regions for a virtual cpufreq device > > > > for guests to read frequency information or to request frequency > > > > selection. The virtual cpufreq device has an individual controller for > > > > each frequency domain. > > > > > > I would really refrain form having absolute frequencies here. A > > > virtual machine can be migrated, and there are *zero* guarantees that > > > the target system has the same clock range as the source. > > > > > > This really should be a relative number, much like the capacity. That, > > > at least, can be migrated across systems. > > > > There's nothing in this patch that mandates absolute frequency. > > In true KVM philosophy, we leave it to the VMM to decide. > > This has nothing to do with KVM. It would apply to any execution > environment, including QEMU in TCG mode. > > To quote the original patch: > > + description: > + Address and size of region containing frequency controls for each of the > + frequency domains. Regions for each frequency domain is placed > + contiugously and contain registers for controlling DVFS(Dynamic Frequency > + and Voltage) characteristics. The size of the region is proportional to > + total number of frequency domains. > > What part of that indicates that *relative* frequencies are > acceptable? The example explicitly uses the opp-v2 binding, which > clearly is about absolute frequency. We can update the doc to make that clearer and update the example too. > To reiterate: absolute frequencies are not the right tool for the job, > and they should explicitly be described as relative in the spec. Not > left as a "whatev'" option for the execution environment to interpret. I think it depends on the use case. If there's no plan to migrate the VM across different devices, there's no need to do the unnecessary normalization back and forth. And if we can translate between pCPU frequency and a normalized frequency, we can do the same for whatever made up frequencies too. In fact, we plan to do exactly that in our internal use cases for this. There's nothing here that prevents the VMM from doing that. Also, if there are hardware virtualized performance counters (AMU, CPPC, etc) that are used for frequency normalization, then we have to use the real frequencies in those devices otherwise the "current frequency" can be 2 GHz while the max normalized frequency is 1024 KHz. That'll mess up load tracking. Thanks, Saravana
On Fri, Dec 8, 2023 at 4:47 AM Sudeep Holla <sudeep.holla@arm.com> wrote: > > On Fri, Nov 10, 2023 at 05:49:29PM -0800, David Dai wrote: > > Adding bindings to represent a virtual cpufreq device. > > > > Virtual machines may expose MMIO regions for a virtual cpufreq device > > for guests to read frequency information or to request frequency > > selection. The virtual cpufreq device has an individual controller for > > each frequency domain. > > > > Co-developed-by: Saravana Kannan <saravanak@google.com> > > Signed-off-by: Saravana Kannan <saravanak@google.com> > > Signed-off-by: David Dai <davidai@google.com> > > --- > > .../cpufreq/qemu,cpufreq-virtual.yaml | 99 +++++++++++++++++++ > > 1 file changed, 99 insertions(+) > > create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml > > > > diff --git a/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml > > new file mode 100644 > > index 000000000000..16606cf1fd1a > > --- /dev/null > > +++ b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml > > @@ -0,0 +1,99 @@ > > +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause > > +%YAML 1.2 > > +--- > > +$id: http://devicetree.org/schemas/cpufreq/qemu,cpufreq-virtual.yaml# > > +$schema: http://devicetree.org/meta-schemas/core.yaml# > > + > > +title: Virtual CPUFreq > > + > > +maintainers: > > + - David Dai <davidai@google.com> > > + - Saravana Kannan <saravanak@google.com> > > + > > +description: > > + Virtual CPUFreq is a virtualized driver in guest kernels that sends frequency > > + selection of its vCPUs as a hint to the host through MMIO regions. Each vCPU > > + is associated with a frequency domain which can be shared with other vCPUs. > > + Each frequency domain has its own set of registers for frequency controls. > > + > > Are these register controls described somewhere ? The reason I ask is we > should be able to have single implementation of this virtual cpufreq > irrespective of the firmware used(DT vs ACPI) IMO. Agree that we want the same driver to work for DT and ACPI. This doc was written to be similar to other DT binding docs that don't describe the registers in the DT binding. The registers are pretty straight forward (can tell from the code too). One register to "set" the frequency and another to "get" the current frequency. We don't have any ACPI expertise/hardware to test this on or care about it right now. But David looked at some ACPI drivers and we think it should be trivial to add ACPI support to this. Just a different set of probe functions to register and populate the CPUfreq table. > > > +properties: > > + compatible: > > + const: qemu,virtual-cpufreq > > + > > + reg: > > + maxItems: 1 > > + description: > > + Address and size of region containing frequency controls for each of the > > + frequency domains. Regions for each frequency domain is placed > > + contiugously and contain registers for controlling DVFS(Dynamic Frequency > > + and Voltage) characteristics. The size of the region is proportional to > > + total number of frequency domains. > > + > > +required: > > + - compatible > > + - reg > > + > > +additionalProperties: false > > + > > +examples: > > + - | > > + // This example shows a two CPU configuration with a frequency domain > > + // for each CPU. > > + cpus { > > + #address-cells = <1>; > > + #size-cells = <0>; > > + > > + cpu@0 { > > + compatible = "arm,armv8"; > > + device_type = "cpu"; > > + reg = <0x0>; > > + operating-points-v2 = <&opp_table0>; > > + }; > > + > > + cpu@1 { > > + compatible = "arm,armv8"; > > + device_type = "cpu"; > > + reg = <0x0>; > > + operating-points-v2 = <&opp_table1>; > > + }; > > + }; > > + > > + opp_table0: opp-table-0 { > > + compatible = "operating-points-v2"; > > + opp-shared; > > + > > + opp1098000000 { > > + opp-hz = /bits/ 64 <1098000000>; > > + opp-level = <1>; > > + }; > > + > > + opp1197000000 { > > + opp-hz = /bits/ 64 <1197000000>; > > + opp-level = <2>; > > + }; > > + }; > > + > > + opp_table1: opp-table-1 { > > + compatible = "operating-points-v2"; > > + opp-shared; > > + > > + opp1106000000 { > > + opp-hz = /bits/ 64 <1106000000>; > > + opp-level = <1>; > > + }; > > + > > + opp1277000000 { > > + opp-hz = /bits/ 64 <1277000000>; > > + opp-level = <2>; > > + }; > > + }; > > > > I think using OPP with absolute frequencies seems not appropriate here. > Why can't these fetched from the registers and have some abstract values > instead ? Whether the frequencies are real or you want to cap it to 1024 and normalize it, you still need to populate the CPUfreq table. And we didn't want to reinvent the wheel and want to use existing means of representing the table in as cross-architecture as possible -- so, DT and ACPI should cover them all. For example, if we want to say CPU0 and 1 for a single CPUfreq policy, that's all already doable in DT. We don't want to reinvent new schemes/register interfaces for that. > > > + soc { > > + #address-cells = <1>; > > + #size-cells = <1>; > > + > > + cpufreq@1040000 { > > + compatible = "qemu,virtual-cpufreq"; > > + reg = <0x1040000 0x10>; > > So just 16bytes for 2 CPU system ? How does the register layout look like ? > I assume just 4 x 32bit registers: 2 for reading and 2 for setting the > frequencies ? Yup. 2 registers per CPU frequency domain or policy. Thanks, Saravana
On Fri, 12 Jan 2024 22:02:39 +0000, Saravana Kannan <saravanak@google.com> wrote: > > Sorry for the delay in response. Was very busy for a while and then > holidays started. > > On Fri, Dec 8, 2023 at 12:52 AM Marc Zyngier <maz@kernel.org> wrote: > > > > On Thu, 07 Dec 2023 22:44:36 +0000, > > Saravana Kannan <saravanak@google.com> wrote: > > > > > > On Wed, Nov 15, 2023 at 12:49 AM Marc Zyngier <maz@kernel.org> wrote: > > > > > > > > On Sat, 11 Nov 2023 01:49:29 +0000, > > > > David Dai <davidai@google.com> wrote: > > > > > > > > > > Adding bindings to represent a virtual cpufreq device. > > > > > > > > > > Virtual machines may expose MMIO regions for a virtual cpufreq device > > > > > for guests to read frequency information or to request frequency > > > > > selection. The virtual cpufreq device has an individual controller for > > > > > each frequency domain. > > > > > > > > I would really refrain form having absolute frequencies here. A > > > > virtual machine can be migrated, and there are *zero* guarantees that > > > > the target system has the same clock range as the source. > > > > > > > > This really should be a relative number, much like the capacity. That, > > > > at least, can be migrated across systems. > > > > > > There's nothing in this patch that mandates absolute frequency. > > > In true KVM philosophy, we leave it to the VMM to decide. > > > > This has nothing to do with KVM. It would apply to any execution > > environment, including QEMU in TCG mode. > > > > To quote the original patch: > > > > + description: > > + Address and size of region containing frequency controls for each of the > > + frequency domains. Regions for each frequency domain is placed > > + contiugously and contain registers for controlling DVFS(Dynamic Frequency > > + and Voltage) characteristics. The size of the region is proportional to > > + total number of frequency domains. > > > > What part of that indicates that *relative* frequencies are > > acceptable? The example explicitly uses the opp-v2 binding, which > > clearly is about absolute frequency. > > We can update the doc to make that clearer and update the example too. > > > To reiterate: absolute frequencies are not the right tool for the job, > > and they should explicitly be described as relative in the spec. Not > > left as a "whatev'" option for the execution environment to interpret. > > I think it depends on the use case. If there's no plan to migrate the > VM across different devices, there's no need to do the unnecessary > normalization back and forth. VM migration is a given, specially when QEMU is involved. Designing something that doesn't support it is a bug, plain and simple. > And if we can translate between pCPU frequency and a normalized > frequency, we can do the same for whatever made up frequencies too. In > fact, we plan to do exactly that in our internal use cases for this. > There's nothing here that prevents the VMM from doing that. > > Also, if there are hardware virtualized performance counters (AMU, > CPPC, etc) that are used for frequency normalization, then we have to > use the real frequencies in those devices otherwise the "current > frequency" can be 2 GHz while the max normalized frequency is 1024 > KHz. That'll mess up load tracking. And that's exactly why this shouldn't be a *frequency*, but a performance scale or some other unit-less coefficient. Just like the big-little capacity. M.
On 11/11/2023 01:49, David Dai wrote: > Adding bindings to represent a virtual cpufreq device. > > Virtual machines may expose MMIO regions for a virtual cpufreq device > for guests to read frequency information or to request frequency > selection. The virtual cpufreq device has an individual controller for > each frequency domain. > > Co-developed-by: Saravana Kannan <saravanak@google.com> > Signed-off-by: Saravana Kannan <saravanak@google.com> > Signed-off-by: David Dai <davidai@google.com> > --- > .../cpufreq/qemu,cpufreq-virtual.yaml | 99 +++++++++++++++++++ > 1 file changed, 99 insertions(+) > create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml > > diff --git a/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml > new file mode 100644 > index 000000000000..16606cf1fd1a > --- /dev/null > +++ b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml > @@ -0,0 +1,99 @@ > +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause > +%YAML 1.2 > +--- > +$id: http://devicetree.org/schemas/cpufreq/qemu,cpufreq-virtual.yaml# > +$schema: http://devicetree.org/meta-schemas/core.yaml# > + > +title: Virtual CPUFreq > + > +maintainers: > + - David Dai <davidai@google.com> > + - Saravana Kannan <saravanak@google.com> > + > +description: > + Virtual CPUFreq is a virtualized driver in guest kernels that sends frequency > + selection of its vCPUs as a hint to the host through MMIO regions. Each vCPU > + is associated with a frequency domain which can be shared with other vCPUs. > + Each frequency domain has its own set of registers for frequency controls. > + > +properties: > + compatible: > + const: qemu,virtual-cpufreq > + > + reg: > + maxItems: 1 > + description: > + Address and size of region containing frequency controls for each of the > + frequency domains. Regions for each frequency domain is placed > + contiugously and contain registers for controlling DVFS(Dynamic Frequency > + and Voltage) characteristics. The size of the region is proportional to > + total number of frequency domains. > + > +required: > + - compatible > + - reg > + > +additionalProperties: false > + > +examples: > + - | > + // This example shows a two CPU configuration with a frequency domain > + // for each CPU. > + cpus { > + #address-cells = <1>; > + #size-cells = <0>; > + > + cpu@0 { > + compatible = "arm,armv8"; > + device_type = "cpu"; > + reg = <0x0>; > + operating-points-v2 = <&opp_table0>; > + }; > + > + cpu@1 { > + compatible = "arm,armv8"; > + device_type = "cpu"; > + reg = <0x0>; > + operating-points-v2 = <&opp_table1>; > + }; > + }; > + > + opp_table0: opp-table-0 { > + compatible = "operating-points-v2"; > + opp-shared; > + > + opp1098000000 { > + opp-hz = /bits/ 64 <1098000000>; > + opp-level = <1>; > + }; > + > + opp1197000000 { > + opp-hz = /bits/ 64 <1197000000>; > + opp-level = <2>; > + }; > + }; > + > + opp_table1: opp-table-1 { > + compatible = "operating-points-v2"; > + opp-shared; > + > + opp1106000000 { > + opp-hz = /bits/ 64 <1106000000>; > + opp-level = <1>; > + }; > + > + opp1277000000 { > + opp-hz = /bits/ 64 <1277000000>; > + opp-level = <2>; > + }; > + }; NIT: If my understanding is correct, it might be worth re-iterating that these OPPs should mirror the host frequency domain this vCPU is pinned to. Also, since VM migration has been mentioned elsewhere in this thread, am I right in saying that you can't change these OPPs after registration? So, even if one wants to migrate, one has to migrate to an SoC with the same frequency domains anyway, otherwise the OPPs in the VM are entirely bogus? > + > + soc { > + #address-cells = <1>; > + #size-cells = <1>; > + > + cpufreq@1040000 { > + compatible = "qemu,virtual-cpufreq"; > + reg = <0x1040000 0x10>; > + }; > + };
On Sat, Jan 13, 2024 at 1:37 AM Marc Zyngier <maz@kernel.org> wrote: > > On Fri, 12 Jan 2024 22:02:39 +0000, > Saravana Kannan <saravanak@google.com> wrote: > > > > Sorry for the delay in response. Was very busy for a while and then > > holidays started. > > > > On Fri, Dec 8, 2023 at 12:52 AM Marc Zyngier <maz@kernel.org> wrote: > > > > > > On Thu, 07 Dec 2023 22:44:36 +0000, > > > Saravana Kannan <saravanak@google.com> wrote: > > > > > > > > On Wed, Nov 15, 2023 at 12:49 AM Marc Zyngier <maz@kernel.org> wrote: > > > > > > > > > > On Sat, 11 Nov 2023 01:49:29 +0000, > > > > > David Dai <davidai@google.com> wrote: > > > > > > > > > > > > Adding bindings to represent a virtual cpufreq device. > > > > > > > > > > > > Virtual machines may expose MMIO regions for a virtual cpufreq device > > > > > > for guests to read frequency information or to request frequency > > > > > > selection. The virtual cpufreq device has an individual controller for > > > > > > each frequency domain. > > > > > > > > > > I would really refrain form having absolute frequencies here. A > > > > > virtual machine can be migrated, and there are *zero* guarantees that > > > > > the target system has the same clock range as the source. > > > > > > > > > > This really should be a relative number, much like the capacity. That, > > > > > at least, can be migrated across systems. > > > > > > > > There's nothing in this patch that mandates absolute frequency. > > > > In true KVM philosophy, we leave it to the VMM to decide. > > > > > > This has nothing to do with KVM. It would apply to any execution > > > environment, including QEMU in TCG mode. > > > > > > To quote the original patch: > > > > > > + description: > > > + Address and size of region containing frequency controls for each of the > > > + frequency domains. Regions for each frequency domain is placed > > > + contiugously and contain registers for controlling DVFS(Dynamic Frequency > > > + and Voltage) characteristics. The size of the region is proportional to > > > + total number of frequency domains. > > > > > > What part of that indicates that *relative* frequencies are > > > acceptable? The example explicitly uses the opp-v2 binding, which > > > clearly is about absolute frequency. > > > > We can update the doc to make that clearer and update the example too. > > > > > To reiterate: absolute frequencies are not the right tool for the job, > > > and they should explicitly be described as relative in the spec. Not > > > left as a "whatev'" option for the execution environment to interpret. > > > > I think it depends on the use case. If there's no plan to migrate the > > VM across different devices, there's no need to do the unnecessary > > normalization back and forth. > > VM migration is a given, specially when QEMU is involved. Designing > something that doesn't support it is a bug, plain and simple. I'm not saying this patch series doesn't allow migrating. I'm just pointing out that normalization might not always be worth it for a VMM to do. We'll update the example and documentation to used normalize values. CPUfreq itself used KHz for the tables and we can't really change that when we are emulating CPU freq scaling. Same goes for OPP table DT property names. But the values we use can be 0 to 1024 Hz and be normalized. > > And if we can translate between pCPU frequency and a normalized > > frequency, we can do the same for whatever made up frequencies too. In > > fact, we plan to do exactly that in our internal use cases for this. > > There's nothing here that prevents the VMM from doing that. > > > > Also, if there are hardware virtualized performance counters (AMU, > > CPPC, etc) that are used for frequency normalization, then we have to > > use the real frequencies in those devices otherwise the "current > > frequency" can be 2 GHz while the max normalized frequency is 1024 > > KHz. That'll mess up load tracking. > > And that's exactly why this shouldn't be a *frequency*, but a > performance scale or some other unit-less coefficient. Just like the > big-little capacity. Sorry I wasn't cleared in my explanation. Let me explain it better. When performance counters are available, the scheduler uses them to compute the current average CPU frequency over a scheduler tick. It looks at the performance counters to figure out how many CPU cycles have gone by and how much time has gone by and does (delta cycles / delta seconds) to compute current CPU freq in Hz. So, when a HW virtualized performance counter is available, and the scheduler inside the VM uses it to compute the average CPU frequency, the resulting frequency is going to be the real physical CPU frequency. And the CPU frequency value the scheduler used to compute the PELT load will be completely different from the normalized values and the load tracking will be completely wrong. Using their performance counters inside the VM reduces a lot of MMIO access exits to the host or memory accesses. So it makes sense a VM might want to use that. I was just trying to say that in that situation, a VMM might have a good reason to just use the real frequencies inside the VM too. In any case, we can update the doc to use normalized values/examples and call out this caveat. Thanks, Saravana
diff --git a/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml new file mode 100644 index 000000000000..16606cf1fd1a --- /dev/null +++ b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml @@ -0,0 +1,99 @@ +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/cpufreq/qemu,cpufreq-virtual.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Virtual CPUFreq + +maintainers: + - David Dai <davidai@google.com> + - Saravana Kannan <saravanak@google.com> + +description: + Virtual CPUFreq is a virtualized driver in guest kernels that sends frequency + selection of its vCPUs as a hint to the host through MMIO regions. Each vCPU + is associated with a frequency domain which can be shared with other vCPUs. + Each frequency domain has its own set of registers for frequency controls. + +properties: + compatible: + const: qemu,virtual-cpufreq + + reg: + maxItems: 1 + description: + Address and size of region containing frequency controls for each of the + frequency domains. Regions for each frequency domain is placed + contiugously and contain registers for controlling DVFS(Dynamic Frequency + and Voltage) characteristics. The size of the region is proportional to + total number of frequency domains. + +required: + - compatible + - reg + +additionalProperties: false + +examples: + - | + // This example shows a two CPU configuration with a frequency domain + // for each CPU. + cpus { + #address-cells = <1>; + #size-cells = <0>; + + cpu@0 { + compatible = "arm,armv8"; + device_type = "cpu"; + reg = <0x0>; + operating-points-v2 = <&opp_table0>; + }; + + cpu@1 { + compatible = "arm,armv8"; + device_type = "cpu"; + reg = <0x0>; + operating-points-v2 = <&opp_table1>; + }; + }; + + opp_table0: opp-table-0 { + compatible = "operating-points-v2"; + opp-shared; + + opp1098000000 { + opp-hz = /bits/ 64 <1098000000>; + opp-level = <1>; + }; + + opp1197000000 { + opp-hz = /bits/ 64 <1197000000>; + opp-level = <2>; + }; + }; + + opp_table1: opp-table-1 { + compatible = "operating-points-v2"; + opp-shared; + + opp1106000000 { + opp-hz = /bits/ 64 <1106000000>; + opp-level = <1>; + }; + + opp1277000000 { + opp-hz = /bits/ 64 <1277000000>; + opp-level = <2>; + }; + }; + + soc { + #address-cells = <1>; + #size-cells = <1>; + + cpufreq@1040000 { + compatible = "qemu,virtual-cpufreq"; + reg = <0x1040000 0x10>; + }; + };