[1/2] KVM: VMX: remove kvm-intel.flexpriority module parameter
diff mbox series

Message ID 20181001212534.20073-2-sean.j.christopherson@intel.com
State New
Headers show
Series
  • KVM: VMX: clean up virtual APIC control handling
Related show

Commit Message

Sean Christopherson Oct. 1, 2018, 9:25 p.m. UTC
Many moons ago, commit 4c9fc8ef5017 ("KVM: VMX: Add module option to
disable flexpriority") introduced kvm-intel.flexpriority as it was
"Useful for debugging".  At that time, kvm-intel.flexpriority only
determined whether or not KVM would enable VIRTUALIZE_APIC_ACCESSES.
In short, it was intended as a way to disable virtualization of APIC
accesses for debug purposes.  Nowadays, kvm-intel.flexpriority is a
haphazard param that is inconsistently honored by KVM, e.g. it still
controls VIRTUALIZE_APIC_ACCESSES but not TPR_SHADOW, CR8-exiting or
VIRTUALIZE_X2APIC_MODE, and only for non-nested guests.  Disabling
the param also announces support for KVM_TPR_ACCESS_REPORTING (via
KVM_CAP_VAPIC), which may be functionally desirable, e.g. Qemu uses
KVM_TPR_ACCESS_REPORTING only to patch MMIO APIC access, but isn't
exactly accurate given its name since KVM may not intercept/report
TPR accesses via CR8 or MSR.

Remove kvm-intel.flexpriority as its usefulness for debug is dubious
given the current code base, while its existence is confusing and
can complicate code changes and/or lead to new bugs being introduced.
For example, as of commit 8d860bbeedef ("kvm: vmx: Basic APIC
virtualization controls have three settings"), KVM will disable
VIRTUALIZE_APIC_ACCESSES when a nested guest writes APIC_BASE MSR and
kvm-intel.flexpriority=0, whereas previously KVM would allow a nested
guest to enable VIRTUALIZE_APIC_ACCESSES so long as it's supported in
hardware.  I.e. KVM now advertises VIRTUALIZE_APIC_ACCESSES to a guest
but doesn't (always) allow setting it when kvm-intel.flexpriority=0,
and may even initially allow the control and then clear it when the
nested guest writes APIC_BASE MSR, which is decidedly odd even if it
doesn't cause functional issues.

Fixes: 8d860bbeedef ("kvm: vmx: Basic APIC virtualization controls have three settings")
Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 .../admin-guide/kernel-parameters.txt         |  4 ----
 arch/x86/kvm/vmx.c                            | 19 ++++---------------
 2 files changed, 4 insertions(+), 19 deletions(-)

Comments

Jim Mattson Oct. 1, 2018, 9:47 p.m. UTC | #1
On Mon, Oct 1, 2018 at 2:25 PM, Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
> Many moons ago, commit 4c9fc8ef5017 ("KVM: VMX: Add module option to
> disable flexpriority") introduced kvm-intel.flexpriority as it was
> "Useful for debugging".  At that time, kvm-intel.flexpriority only
> determined whether or not KVM would enable VIRTUALIZE_APIC_ACCESSES.
> In short, it was intended as a way to disable virtualization of APIC
> accesses for debug purposes.  Nowadays, kvm-intel.flexpriority is a
> haphazard param that is inconsistently honored by KVM, e.g. it still
> controls VIRTUALIZE_APIC_ACCESSES but not TPR_SHADOW, CR8-exiting or
> VIRTUALIZE_X2APIC_MODE, and only for non-nested guests.  Disabling
> the param also announces support for KVM_TPR_ACCESS_REPORTING (via
> KVM_CAP_VAPIC), which may be functionally desirable, e.g. Qemu uses
> KVM_TPR_ACCESS_REPORTING only to patch MMIO APIC access, but isn't
> exactly accurate given its name since KVM may not intercept/report
> TPR accesses via CR8 or MSR.
>
> Remove kvm-intel.flexpriority as its usefulness for debug is dubious
> given the current code base, while its existence is confusing and
> can complicate code changes and/or lead to new bugs being introduced.
> For example, as of commit 8d860bbeedef ("kvm: vmx: Basic APIC
> virtualization controls have three settings"), KVM will disable
> VIRTUALIZE_APIC_ACCESSES when a nested guest writes APIC_BASE MSR and
> kvm-intel.flexpriority=0, whereas previously KVM would allow a nested
> guest to enable VIRTUALIZE_APIC_ACCESSES so long as it's supported in
> hardware.  I.e. KVM now advertises VIRTUALIZE_APIC_ACCESSES to a guest
> but doesn't (always) allow setting it when kvm-intel.flexpriority=0,
> and may even initially allow the control and then clear it when the
> nested guest writes APIC_BASE MSR, which is decidedly odd even if it
> doesn't cause functional issues.
>
> Fixes: 8d860bbeedef ("kvm: vmx: Basic APIC virtualization controls have three settings")
> Cc: Jim Mattson <jmattson@google.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>

I'm happy to see this go, but others may prefer to keep it and make it
work as one might expect it to (i.e. clearing the module parameter
should make kvm behave as if the host didn't advertise the feature).

Reviewed-by: Jim Mattson <jmattson@google.com>
Paolo Bonzini Oct. 3, 2018, 8:29 a.m. UTC | #2
On 01/10/2018 23:25, Sean Christopherson wrote:
> kvm-intel.flexpriority is a
> haphazard param that is inconsistently honored by KVM, e.g. it still
> controls VIRTUALIZE_APIC_ACCESSES but not TPR_SHADOW, CR8-exiting or
> VIRTUALIZE_X2APIC_MODE

This is on purpose: the original purpose of VIRTUALIZE_APIC_ACCESSES was
to speed up Windows XP and 2003, which didn't use CR8.  Processors
without it could already execute newer Windows versions without the
overhead of emulating TPR accesses (notice how VIRTUALIZE_APIC_ACCESSES
is a secondary execution control, while TPR_SHADOW is a primary control).

Likewise, VIRTUALIZE_X2APIC_MODE only deals with MSR reads and writes,
not MMIO accesses.  We could add a separate knob to turn it off, but I
never did it because the code paths are not as different as for
flexpriority or, say, !APICv or !vNMI.

> , and only for non-nested guests.

Not sure if this is a bug, but it is certainly cleaner if flexpriority=0
masks VIRTUALIZE_APIC_ACCESSES from nested guests too.

Paolo
Jim Mattson Oct. 3, 2018, 12:14 p.m. UTC | #3
On Wed, Oct 3, 2018 at 1:29 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 01/10/2018 23:25, Sean Christopherson wrote:
>> kvm-intel.flexpriority is a
>> haphazard param that is inconsistently honored by KVM, e.g. it still
>> controls VIRTUALIZE_APIC_ACCESSES but not TPR_SHADOW, CR8-exiting or
>> VIRTUALIZE_X2APIC_MODE
>
> This is on purpose: the original purpose of VIRTUALIZE_APIC_ACCESSES was
> to speed up Windows XP and 2003, which didn't use CR8.  Processors
> without it could already execute newer Windows versions without the
> overhead of emulating TPR accesses (notice how VIRTUALIZE_APIC_ACCESSES
> is a secondary execution control, while TPR_SHADOW is a primary control).

If the setting is only for specific guests, why is it a kernel
parameter rather than a per-VM capability?
Paolo Bonzini Oct. 3, 2018, 12:40 p.m. UTC | #4
On 03/10/2018 14:14, Jim Mattson wrote:
> On Wed, Oct 3, 2018 at 1:29 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> On 01/10/2018 23:25, Sean Christopherson wrote:
>>> kvm-intel.flexpriority is a
>>> haphazard param that is inconsistently honored by KVM, e.g. it still
>>> controls VIRTUALIZE_APIC_ACCESSES but not TPR_SHADOW, CR8-exiting or
>>> VIRTUALIZE_X2APIC_MODE
>>
>> This is on purpose: the original purpose of VIRTUALIZE_APIC_ACCESSES was
>> to speed up Windows XP and 2003, which didn't use CR8.  Processors
>> without it could already execute newer Windows versions without the
>> overhead of emulating TPR accesses (notice how VIRTUALIZE_APIC_ACCESSES
>> is a secondary execution control, while TPR_SHADOW is a primary control).
> 
> If the setting is only for specific guests, why is it a kernel
> parameter rather than a per-VM capability?

Do you mean the flexpriority module parameter or VIRTUALIZE_APIC_ACCESSES?

VIRTUALIZE_APIC_ACCESSES does nothing on newer Windows guests, because
they use CR8 to write to the TPR.  However, it's also not a problem if
you enable it.

The flexpriority module parameter instead is a kernel parameter because
it's really only useful for debugging problems that only happen on old
machines (and to a lesser extent, problems that only happen on pre-AVIC
AMD machines).  Making it a per-VM capability would not be particularly
interesting, and it adds weight to the userspace ABI compatibility
(userspace ABI for example is the reason why we cannot drop userspace
irqchip support).  But since these machines still exist and people are
using KVM on them, we have to keep the debugging aid around.

Paolo
Jim Mattson Oct. 3, 2018, 4:36 p.m. UTC | #5
On Wed, Oct 3, 2018 at 5:40 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:

> The flexpriority module parameter instead is a kernel parameter because
> it's really only useful for debugging problems that only happen on old
> machines (and to a lesser extent, problems that only happen on pre-AVIC
> AMD machines).  Making it a per-VM capability would not be particularly
> interesting, and it adds weight to the userspace ABI compatibility
> (userspace ABI for example is the reason why we cannot drop userspace
> irqchip support).  But since these machines still exist and people are
> using KVM on them, we have to keep the debugging aid around.

It seems that the best way to mock up an older CPU that doesn't
support FlexPriority is to:
(a) intercept RDMSR of IA32_VMX_PROCBASED_CTLS2 and clear bit 0 of %rdx, and
(b) intercept VMWRITE to the secondary processor-based VM-execution
controls field and emulate VMfail when a 1 is written to bit 0.

Pushing the code higher up in the stack complicates things quite a
bit, and encourages bizarre behaviors like we have today, where, for
instance, FlexPriority support is still enumerated in the
IA32_VMX_PROCBASED_CTLS2 value synthesized for L1, and in fact, L1 can
still enable FlexPriority for L2. That certainly isn't consistent with
the behavior of old machines.
Paolo Bonzini Oct. 3, 2018, 4:54 p.m. UTC | #6
On 03/10/2018 18:36, Jim Mattson wrote:
> It seems that the best way to mock up an older CPU that doesn't
> support FlexPriority is to:
> (a) intercept RDMSR of IA32_VMX_PROCBASED_CTLS2 and clear bit 0 of %rdx, and
> (b) intercept VMWRITE to the secondary processor-based VM-execution
> controls field and emulate VMfail when a 1 is written to bit 0.

Not VMWRITE, but VMLAUNCH/VMRESUME I suppose?  The patch that I sent
this morning adds exactly these two things to the flexpriority parameter
(actually only the MSR has to be changed, because execution controls are
checked against the MSR values and changing them fixes (b) as well).

But it's not clear, are you suggesting to use nested virtualization
(using "intercept" makes me think of nested)?  That also introduces the
possible bugs (and hidden bugs) from nesting, so it is worse than the
module parameter.

> Pushing the code higher up in the stack complicates things quite a
> bit, and encourages bizarre behaviors like we have today, where, for
> instance, FlexPriority support is still enumerated in the
> IA32_VMX_PROCBASED_CTLS2 value synthesized for L1, and in fact, L1 can
> still enable FlexPriority for L2. That certainly isn't consistent with
> the behavior of old machines.

Yup, I agree.   In general we do have cases where we provide
functionality that is not there on older machines, but for flexpriority
the behavior is messy and nowhere near real hardware's.  In other cases
(enable_vnmi for example) it's possibly odd and we may want to change
that, but the severity is smaller.

Paolo
Jim Mattson Oct. 3, 2018, 5:39 p.m. UTC | #7
On Wed, Oct 3, 2018 at 9:54 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 03/10/2018 18:36, Jim Mattson wrote:
>> It seems that the best way to mock up an older CPU that doesn't
>> support FlexPriority is to:
>> (a) intercept RDMSR of IA32_VMX_PROCBASED_CTLS2 and clear bit 0 of %rdx, and
>> (b) intercept VMWRITE to the secondary processor-based VM-execution
>> controls field and emulate VMfail when a 1 is written to bit 0.
>
> Not VMWRITE, but VMLAUNCH/VMRESUME I suppose?  The patch that I sent
> this morning adds exactly these two things to the flexpriority parameter
> (actually only the MSR has to be changed, because execution controls are
> checked against the MSR values and changing them fixes (b) as well).

You're right, of course. VMLAUNCH/VMRESUME would be the place to check.

> But it's not clear, are you suggesting to use nested virtualization
> (using "intercept" makes me think of nested)?  That also introduces the
> possible bugs (and hidden bugs) from nesting, so it is worse than the
> module parameter.

No, not at all. I'm suggesting adopting a convention where all RDMSRs
of IA32_VMX_PROCBASED_CTLS2 are done through a single function that
can mock up an older CPU based on module parameters. Similarly for
VMLAUNCH/VMRESUME.
Paolo Bonzini Oct. 4, 2018, 11:59 a.m. UTC | #8
On 03/10/2018 19:39, Jim Mattson wrote:
> On Wed, Oct 3, 2018 at 9:54 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> But it's not clear, are you suggesting to use nested virtualization
>> (using "intercept" makes me think of nested)?  That also introduces the
>> possible bugs (and hidden bugs) from nesting, so it is worse than the
>> module parameter.
> 
> No, not at all. I'm suggesting adopting a convention where all RDMSRs
> of IA32_VMX_PROCBASED_CTLS2 are done through a single function that
> can mock up an older CPU based on module parameters. Similarly for
> VMLAUNCH/VMRESUME.

Oh I see.  Yeah, that would be nice.

Paolo

Patch
diff mbox series

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9871e649ffef..670f0b0c6806 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1983,10 +1983,6 @@ 
 			[KVM,Intel] Enable emulation of invalid guest states
 			Default is 0 (disabled)
 
-	kvm-intel.flexpriority=
-			[KVM,Intel] Disable FlexPriority feature (TPR shadow).
-			Default is 1 (enabled)
-
 	kvm-intel.nested=
 			[KVM,Intel] Enable VMX nesting (nVMX).
 			Default is 0 (disabled)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 1d26f3c4985b..e1b8ea9a2bc4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -78,9 +78,6 @@  module_param_named(vpid, enable_vpid, bool, 0444);
 static bool __read_mostly enable_vnmi = 1;
 module_param_named(vnmi, enable_vnmi, bool, S_IRUGO);
 
-static bool __read_mostly flexpriority_enabled = 1;
-module_param_named(flexpriority, flexpriority_enabled, bool, S_IRUGO);
-
 static bool __read_mostly enable_ept = 1;
 module_param_named(ept, enable_ept, bool, S_IRUGO);
 
@@ -1857,7 +1854,7 @@  static inline bool cpu_has_vmx_basic_inout(void)
 
 static inline bool cpu_need_virtualize_apic_accesses(struct kvm_vcpu *vcpu)
 {
-	return flexpriority_enabled && lapic_in_kernel(vcpu);
+	return cpu_has_vmx_flexpriority() && lapic_in_kernel(vcpu);
 }
 
 static inline bool cpu_has_vmx_vpid(void)
@@ -1924,11 +1921,6 @@  static bool vmx_umip_emulated(void)
 		SECONDARY_EXEC_DESC;
 }
 
-static inline bool report_flexpriority(void)
-{
-	return flexpriority_enabled;
-}
-
 static inline unsigned nested_cpu_vmx_misc_cr3_count(struct kvm_vcpu *vcpu)
 {
 	return vmx_misc_cr3_count(to_vmx(vcpu)->nested.msrs.misc_low);
@@ -7895,9 +7887,6 @@  static __init int hardware_setup(void)
 	if (!cpu_has_vmx_unrestricted_guest() || !enable_ept)
 		enable_unrestricted_guest = 0;
 
-	if (!cpu_has_vmx_flexpriority())
-		flexpriority_enabled = 0;
-
 	if (!cpu_has_virtual_nmis())
 		enable_vnmi = 0;
 
@@ -7906,7 +7895,7 @@  static __init int hardware_setup(void)
 	 * page upon invalidation.  No need to do anything if not
 	 * using the APIC_ACCESS_ADDR VMCS field.
 	 */
-	if (!flexpriority_enabled)
+	if (!cpu_has_vmx_flexpriority())
 		kvm_x86_ops->set_apic_access_page_addr = NULL;
 
 	if (!cpu_has_vmx_tpr_shadow())
@@ -10233,7 +10222,7 @@  static void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
 	case LAPIC_MODE_DISABLED:
 		break;
 	case LAPIC_MODE_XAPIC:
-		if (flexpriority_enabled) {
+		if (cpu_has_vmx_flexpriority()) {
 			sec_exec_control |=
 				SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
 			vmx_flush_tlb(vcpu, true);
@@ -14007,7 +13996,7 @@  static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.check_processor_compatibility = vmx_check_processor_compat,
 	.hardware_enable = hardware_enable,
 	.hardware_disable = hardware_disable,
-	.cpu_has_accelerated_tpr = report_flexpriority,
+	.cpu_has_accelerated_tpr = cpu_has_vmx_flexpriority,
 	.has_emulated_msr = vmx_has_emulated_msr,
 
 	.vm_init = vmx_vm_init,