diff mbox series

[v3,3/4] KVM: X86: Add a capability to configure bus frequency for APIC timer

Message ID f393da364d3389f8e65c7fae3e5d9210ffe7a2db.1702974319.git.isaku.yamahata@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: X86: Make bus clock frequency for vapic timer configurable | expand

Commit Message

Isaku Yamahata Dec. 19, 2023, 8:34 a.m. UTC
Add KVM_CAP_X86_BUS_FREQUENCY_CONTROL capability to configure the core
crystal clock (or processor's bus clock) for APIC timer emulation.  Allow
KVM_ENABLE_CAPABILITY(KVM_CAP_X86_BUS_FREQUENCY_CONTROL) to set the
frequency.

The TDX architecture hard-codes the APIC bus frequency to 25MHz.  The TDX
mandates it to be enumerated in CPUIID leaf 0x15 and doesn't allow the VMM
to override its value.  The KVM APIC timer emulation hard-codes the
frequency to 1GHz.  The KVM doesn't enumerate it to the guest unless the
user space VMM sets the CPUID leaf 0x15 by KVM_SET_CPUID.  If the CPUID
leaf 0x15 is enumerated, the guest kernel uses it as the APIC bus
frequency.  If not, the guest kernel measures the frequency based on other
known timers like the ACPI timer or the legacy PIT.  The TDX guest kernel
gets timer interrupt more times by (1GHz as the frequency KVM used) /
(25MHz as TDX CPUID enumerates).  To ensure that the guest doesn't have a
conflicting view of the APIC bus frequency, allow the userspace to tell KVM
to use the same frequency that TDX mandates instead of the default 1Ghz.

There are several options to address this.
1. Make the KVM able to configure APIC bus frequency (This patch).
   Pro: It resembles the existing hardware.  The recent Intel CPUs
        adapts 25MHz.
   Con: Require the VMM to emulate the APIC timer at 25MHz.
2. Make the TDX architecture enumerate CPUID 0x15 to configurable
   frequency or not enumerate it.
   Pro: Any APIC bus frequency is allowed.
   Con: Deviation from the real hardware.
3. Make the TDX guest kernel use 1GHz when it's running on KVM.
   Con: The kernel ignores CPUID leaf 0x15.
4. Change CPUID.15H under TDX to report the crystal clock frequency as
   1 GHz.
   Pro: This has been the virtual APIC frequency for KVM guests for 13
        years.
   Pro: This requires changing only one hard-coded constant in TDX.
   Con: It doesn't work with other VMMs as TDX isn't specific to KVM.

This patch doesn't affect the TSC deadline timer emulation.  The APIC timer
emulation path calculates the TSC value from the TMICT register value and
uses the TSC deadline timer path.  This patch touches only the APIC
timer-specific code.

[1] https://lore.kernel.org/lkml/20231006011255.4163884-1-vannapurve@google.com/
Reported-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

---
Changes v3:
- Added reviewed-by Maxim Levitsky.
- minor update of the commit message.

Changes v2:
- Add check if vcpu isn't created.
- Add check if lapic chip is in-kernel emulation.
- Fix build error for i386.
- Add document to api.rst.
- typo in the commit message.
---
 Documentation/virt/kvm/api.rst | 14 ++++++++++++++
 arch/x86/kvm/x86.c             | 33 +++++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h       |  1 +
 3 files changed, 48 insertions(+)

Comments

Xiaoyao Li Dec. 29, 2023, 4:41 a.m. UTC | #1
On 12/19/2023 4:34 PM, Isaku Yamahata wrote:
> +7.34 KVM_CAP_X86_BUS_FREQUENCY_CONTROL
> +--------------------------------------
> +
> +:Architectures: x86
> +:Target: VM
> +:Parameters: args[0] is the value of apic bus clock frequency
> +:Returns: 0 on success, -EINVAL if args[0] contains invalid value for the
> +          frequency, or -ENXIO if virtual local APIC isn't enabled by
> +          KVM_CREATE_IRQCHIP, or -EBUSY if any vcpu is created.
> +
> +This capability sets the APIC bus clock frequency (or core crystal clock
> +frequency) for kvm to emulate APIC in the kernel.  

Isaku,

you are mixing the `bus clock` and `core crystal clock` frequency. They 
are different.

- When CPUID 0x15 doesn't exist, or CPUID 0x15 doesn't enumerate core 
crystal clock frequency, the APIC timer frequency is the processor's bus 
clock.

- When CPUID 0x15 does enumerate the core crystal clock frequency, the 
APIC timer frequency is the core crystal clock frequency.

This patch only enables the user-configurable bus clock frequency, or 
specifically APIC timer frequency. It doesn't enable the configuration 
of core crystal clock frequency. Userspace can configure core crystal 
clock frequency by passing a valid CPUID 0x15 leaf into KVM_SET_CPUID2, 
not by this KVM_CAP.

> The default value is 1000000
> +(1GHz).
Sean Christopherson Feb. 23, 2024, 7:33 p.m. UTC | #2
On Tue, Dec 19, 2023, Isaku Yamahata wrote:
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 7025b3751027..cc976df2651e 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -7858,6 +7858,20 @@ This capability is aimed to mitigate the threat that malicious VMs can
>  cause CPU stuck (due to event windows don't open up) and make the CPU
>  unavailable to host or other VMs.
>  
> +7.34 KVM_CAP_X86_BUS_FREQUENCY_CONTROL

BUS_FREQUENCY_CONTROL is simultaneously too long, yet not descriptive enough.
Depending on whether people get hung up on nanoseconds not being a "frequency",
either KVM_CAP_X86_APIC_BUS_FREQUENCY or KVM_CAP_X86_APIC_BUS_CYCLES_NS.

Also, this series needs to be rebased onto kvm-x86/next.

> +:Architectures: x86
> +:Target: VM
> +:Parameters: args[0] is the value of apic bus clock frequency
> +:Returns: 0 on success, -EINVAL if args[0] contains invalid value for the
> +          frequency, or -ENXIO if virtual local APIC isn't enabled by
> +          KVM_CREATE_IRQCHIP, or -EBUSY if any vcpu is created.
> +
> +This capability sets the APIC bus clock frequency (or core crystal clock
> +frequency) for kvm to emulate APIC in the kernel.  The default value is 1000000
> +(1GHz).

If we're going to add a capability, might as well make KVM's default value
discoverable.

This also needs to clarify the units.  Having to count the number of zeros in the
code to figure that out is ridiculous.

And as Xiaoyao, this is NOT the core crystal clock.  Though conversely, this
documentation should make it clear that setting CPUID 0x15 is userspace's problem.
E.g.

7.35 KVM_CAP_X86_APIC_BUS_FREQUENCY
-----------------------------------

:Architectures: x86
:Target: VM
:Parameters: args[0] is the desired APIC bus clock rate, in nanoseconds
:Returns: 0 on success, -EINVAL if args[0] contains invalid value for the
          frequency, or -ENXIO if virtual local APIC isn't enabled by
          KVM_CREATE_IRQCHIP, or -EBUSY if any vcpu is created.

This capability sets VM's APIC bus clock frequency, used by KVM's in-kernel
virtual APIC when emulating APIC timers.  KVM's default value can be retrieved
by KVM_CHECK_EXTENSION.

Note: Userspace is responsible for correctly configuring CPUID 0x15, a.k.a. the
core crystal clock frequency, if a non-zero CPUID 0x15 is exposed to the guest.

>  8. Other capabilities.
>  ======================
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d7d865f7c847..97f81d612366 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4625,6 +4625,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_ENABLE_CAP:
>  	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
>  	case KVM_CAP_IRQFD_RESAMPLE:
> +	case KVM_CAP_X86_BUS_FREQUENCY_CONTROL:
>  		r = 1;

And instead of returning '1', return APIC_BUS_CYCLE_NS_DEFAULT (which is amusingly
also '1', but there's no reason to rely on that, it's unnecessarily confusing).

>  		break;
>  	case KVM_CAP_EXIT_HYPERCALL:
> @@ -6616,6 +6617,38 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  		}
>  		mutex_unlock(&kvm->lock);
>  		break;
> +	case KVM_CAP_X86_BUS_FREQUENCY_CONTROL: {
> +		u64 bus_frequency = cap->args[0];
> +		u64 bus_cycle_ns;
> +
> +		if (!bus_frequency)
> +			return -EINVAL;
> +		/* CPUID[0x15] only support 32bits.  */

So?  This capability might exist to play nice with TDX forcing CPUID 0x15, but
that doesn't mean the capability is beholden to CPUID 0x15.

> +		if (bus_frequency != (u32)bus_frequency)
> +			return -EINVAL;
> +
> +		/* Cast to avoid 64bit division on 32bit platform. */
> +		bus_cycle_ns = 1000000000UL / (u32)bus_frequency;

Why take the userspace value as a frequency?  That will unnecessarily result in
loss of fidelity if 1000000000UL isn't cleanly disibile by bus_frequency, e.g.
reversing the math in the Hyper-V code will yield the "wrong" frequency.

> +		if (!bus_cycle_ns)

This needs to guard against overflow in tmict_to_ns().  The max divide count is
14, I think?  Whatever this yields:

	apic->divide_count = 0x1 << (tmp2 & 0x7);

So from that, we can derive the max allowed bus_cycle_ns.

> +			return -EINVAL;

Use break, like literally every other case statement.  Burying a return in the
middle of this pile will result in breakage if kvm_vm_ioctl_enable_cap() ever
gains an epilogue.

> +
> +		r = 0;
> +		mutex_lock(&kvm->lock);
> +		/*
> +		 * Don't allow to change the frequency dynamically during vcpu
> +		 * running to avoid potentially bizarre behavior.
> +		 */
> +		if (kvm->created_vcpus)
> +			r = -EBUSY;

EINVAL, not EBUSY, because userspace can't magically uncreate vCPUs.

> +		/* This is for in-kernel vAPIC emulation. */

Meh, just drop the comment.  Same for the one above created_vcpus, it's fairly
self-explantory.

> +		else if (!irqchip_in_kernel(kvm))
> +			r = -ENXIO;

This should go before created_vcpus, e.g. creating a vCPU shouldn't change the
error code.

> +
> +		if (!r)

Make this an else...

> +			kvm->arch.apic_bus_cycle_ns = bus_cycle_ns;

> +		mutex_unlock(&kvm->lock);
> +		return r;

		break;

Something like:

	case KVM_CAP_X86_APIC_BUS_FREQUENCY: {
		u64 bus_cycle_ns = cap->args[0];

		r = -EINVAL;
		if (!bus_frequency || bus_frequency > (whatever cause overflow))
			break;

		r = 0;
		mutex_lock(&kvm->lock);
		if (!irqchip_in_kernel(kvm))
			r = -ENXIO;
		else if (kvm->created_vcpus)
			r = -EINVAL;
		else
			kvm->arch.apic_bus_cycle_ns = bus_cycle_ns;
		mutex_unlock(&kvm->lock);
		break;
	}
Isaku Yamahata March 8, 2024, 1:36 a.m. UTC | #3
On Fri, Feb 23, 2024 at 11:33:54AM -0800,
Sean Christopherson <seanjc@google.com> wrote:

> On Tue, Dec 19, 2023, Isaku Yamahata wrote:
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 7025b3751027..cc976df2651e 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -7858,6 +7858,20 @@ This capability is aimed to mitigate the threat that malicious VMs can
> >  cause CPU stuck (due to event windows don't open up) and make the CPU
> >  unavailable to host or other VMs.
> >  
> > +7.34 KVM_CAP_X86_BUS_FREQUENCY_CONTROL
> 
> BUS_FREQUENCY_CONTROL is simultaneously too long, yet not descriptive enough.
> Depending on whether people get hung up on nanoseconds not being a "frequency",
> either KVM_CAP_X86_APIC_BUS_FREQUENCY or KVM_CAP_X86_APIC_BUS_CYCLES_NS.
> 
> Also, this series needs to be rebased onto kvm-x86/next.

Thanks for the feedback with the concrete change to the patch.
I agree with those for the next respin.
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 7025b3751027..cc976df2651e 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7858,6 +7858,20 @@  This capability is aimed to mitigate the threat that malicious VMs can
 cause CPU stuck (due to event windows don't open up) and make the CPU
 unavailable to host or other VMs.
 
+7.34 KVM_CAP_X86_BUS_FREQUENCY_CONTROL
+--------------------------------------
+
+:Architectures: x86
+:Target: VM
+:Parameters: args[0] is the value of apic bus clock frequency
+:Returns: 0 on success, -EINVAL if args[0] contains invalid value for the
+          frequency, or -ENXIO if virtual local APIC isn't enabled by
+          KVM_CREATE_IRQCHIP, or -EBUSY if any vcpu is created.
+
+This capability sets the APIC bus clock frequency (or core crystal clock
+frequency) for kvm to emulate APIC in the kernel.  The default value is 1000000
+(1GHz).
+
 8. Other capabilities.
 ======================
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d7d865f7c847..97f81d612366 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4625,6 +4625,7 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
 	case KVM_CAP_IRQFD_RESAMPLE:
+	case KVM_CAP_X86_BUS_FREQUENCY_CONTROL:
 		r = 1;
 		break;
 	case KVM_CAP_EXIT_HYPERCALL:
@@ -6616,6 +6617,38 @@  int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		}
 		mutex_unlock(&kvm->lock);
 		break;
+	case KVM_CAP_X86_BUS_FREQUENCY_CONTROL: {
+		u64 bus_frequency = cap->args[0];
+		u64 bus_cycle_ns;
+
+		if (!bus_frequency)
+			return -EINVAL;
+		/* CPUID[0x15] only support 32bits.  */
+		if (bus_frequency != (u32)bus_frequency)
+			return -EINVAL;
+
+		/* Cast to avoid 64bit division on 32bit platform. */
+		bus_cycle_ns = 1000000000UL / (u32)bus_frequency;
+		if (!bus_cycle_ns)
+			return -EINVAL;
+
+		r = 0;
+		mutex_lock(&kvm->lock);
+		/*
+		 * Don't allow to change the frequency dynamically during vcpu
+		 * running to avoid potentially bizarre behavior.
+		 */
+		if (kvm->created_vcpus)
+			r = -EBUSY;
+		/* This is for in-kernel vAPIC emulation. */
+		else if (!irqchip_in_kernel(kvm))
+			r = -ENXIO;
+
+		if (!r)
+			kvm->arch.apic_bus_cycle_ns = bus_cycle_ns;
+		mutex_unlock(&kvm->lock);
+		return r;
+	}
 	default:
 		r = -EINVAL;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 211b86de35ac..d74a057df173 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1201,6 +1201,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
 #define KVM_CAP_ARM_SUPPORTED_REG_MASK_RANGES 230
+#define KVM_CAP_X86_BUS_FREQUENCY_CONTROL 231
 
 #ifdef KVM_CAP_IRQ_ROUTING