diff mbox series

[1/7] KVM: nVMX: Intercept VMWRITEs to read-only shadow VMCS fields

Message ID 20190507153629.3681-2-sean.j.christopherson@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: nVMX: Optimize VMCS data copying | expand

Commit Message

Sean Christopherson May 7, 2019, 3:36 p.m. UTC
Allowing L1 to VMWRITE read-only fields is only beneficial in a double
nesting scenario, e.g. no sane VMM will VMWRITE VM_EXIT_REASON in normal
non-nested operation.  Intercepting RO fields means KVM doesn't need to
sync them from the shadow VMCS to vmcs12 when running L2.  The obvious
downside is that L1 will VM-Exit more often when running L3, but it's
likely safe to assume most folks would happily sacrifice a bit of L3
performance, which may not even be noticeable in the grande scheme, to
improve L2 performance across the board.

Not intercepting fields tagged read-only also allows for additional
optimizations, e.g. marking GUEST_{CS,SS}_AR_BYTES as SHADOW_FIELD_RO
since those fields are rarely written by a VMMs, but read frequently.

When utilizing a shadow VMCS with asymmetric R/W and R/O bitmaps, fields
that cause VM-Exit on VMWRITE but not VMREAD need to be propagated to
the shadow VMCS during VMWRITE emulation, otherwise a subsequence VMREAD
from L1 will consume a stale value.

Note, KVM currently utilizes asymmetric bitmaps when "VMWRITE any field"
is not exposed to L1, but only so that it can reject the VMWRITE, i.e.
propagating the VMWRITE to the shadow VMCS is a new requirement, not a
bug fix.

Eliminating the copying of RO fields reduces the latency of nested
VM-Entry (copy_shadow_to_vmcs12()) by ~100 cycles (plus 40-50 cycles
if/when the AR_BYTES fields are exposed RO).

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/vmx/nested.c | 72 +++++++++++++++++----------------------
 1 file changed, 31 insertions(+), 41 deletions(-)

Comments

Paolo Bonzini June 6, 2019, 1:26 p.m. UTC | #1
On 07/05/19 17:36, Sean Christopherson wrote:
> Note, "writable" in this context means
> + * "writable by the guest", i.e. tagged SHADOW_FIELD_RW.  Note #2, the set of
> + * fields tagged SHADOW_FIELD_RO may or may not align with the "read-only"
> + * VM-exit information fields (which are actually writable if the vCPU is
> + * configured to support "VMWRITE to any supported field in the VMCS").


"There's a few more pages of P.S.'s here"
(https://youtu.be/rKlrTJ7WN18?t=170)

Paolo
Jim Mattson June 13, 2019, 5:02 p.m. UTC | #2
On Tue, May 7, 2019 at 8:36 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:

> Not intercepting fields tagged read-only also allows for additional
> optimizations, e.g. marking GUEST_{CS,SS}_AR_BYTES as SHADOW_FIELD_RO
> since those fields are rarely written by a VMMs, but read frequently.

Do you have data to support this, or is this just a gut feeling? The
last time I looked at Virtual Box (which was admittedly a long time
ago), it liked to read and write just about every VMCS guest-state
field it could find on every VM-exit.

The decision of which fields to shadow is really something that should
be done dynamically, depending on the behavior of the guest hypervisor
(which may vary depending on the L2 guest it's running!) Making the
decision statically is bound to result in a poor outcome for some
scenarios.

When I measured this several years ago, taking one VM-exit for a
VMREAD or VMWRITE was more expensive than needlessly shadowing it
~35-40 times.
Paolo Bonzini June 13, 2019, 5:18 p.m. UTC | #3
On 13/06/19 19:02, Jim Mattson wrote:
> On Tue, May 7, 2019 at 8:36 AM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> 
>> Not intercepting fields tagged read-only also allows for additional
>> optimizations, e.g. marking GUEST_{CS,SS}_AR_BYTES as SHADOW_FIELD_RO
>> since those fields are rarely written by a VMMs, but read frequently.
> 
> Do you have data to support this, or is this just a gut feeling? The
> last time I looked at Virtual Box (which was admittedly a long time
> ago), it liked to read and write just about every VMCS guest-state
> field it could find on every VM-exit.

I have never looked at VirtualBox, but most other hypervisors do have a
common set of fields (give or take a couple) that they like to read
and/or write on most if not every vmexit.

Also, while this may vary dynamically based on the L2 guest that is
running, this is much less true for unrestricted-guest processors.
Without data on _which_ scenarios are bad for a static set of shadowed
fields, I'm not really happy to add even more complexity.

Paolo

> The decision of which fields to shadow is really something that should
> be done dynamically, depending on the behavior of the guest hypervisor
> (which may vary depending on the L2 guest it's running!) Making the
> decision statically is bound to result in a poor outcome for some
> scenarios.
Jim Mattson June 13, 2019, 5:36 p.m. UTC | #4
On Thu, Jun 13, 2019 at 10:18 AM Paolo Bonzini <pbonzini@redhat.com> wrote:

> Also, while this may vary dynamically based on the L2 guest that is
> running, this is much less true for unrestricted-guest processors.
> Without data on _which_ scenarios are bad for a static set of shadowed
> fields, I'm not really happy to add even more complexity.

Data supporting which scenarios would lead you to entertain more
complexity? Is it even worth collecting data on L3 performance, for
example? :-)
Paolo Bonzini June 13, 2019, 5:59 p.m. UTC | #5
On 13/06/19 19:36, Jim Mattson wrote:
>> Also, while this may vary dynamically based on the L2 guest that is
>> running, this is much less true for unrestricted-guest processors.
>> Without data on _which_ scenarios are bad for a static set of shadowed
>> fields, I'm not really happy to add even more complexity.
>
> Data supporting which scenarios would lead you to entertain more
> complexity?

For example it would be interesting if some L1 hypervisor had 2x slower
vmexits on some L2 guests, but otherwise fits the current set of
shadowed fields.

Paolo

> Is it even worth collecting data on L3 performance, for
> example? :-)
diff mbox series

Patch

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 04b40a98f60b..1f7c4af70903 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1102,14 +1102,6 @@  static int vmx_restore_vmx_misc(struct vcpu_vmx *vmx, u64 data)
 	vmx->nested.msrs.misc_low = data;
 	vmx->nested.msrs.misc_high = data >> 32;
 
-	/*
-	 * If L1 has read-only VM-exit information fields, use the
-	 * less permissive vmx_vmwrite_bitmap to specify write
-	 * permissions for the shadow VMCS.
-	 */
-	if (enable_shadow_vmcs && !nested_cpu_has_vmwrite_any_field(&vmx->vcpu))
-		vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap));
-
 	return 0;
 }
 
@@ -1298,41 +1290,27 @@  int vmx_get_vmx_msr(struct nested_vmx_msrs *msrs, u32 msr_index, u64 *pdata)
 }
 
 /*
- * Copy the writable VMCS shadow fields back to the VMCS12, in case
- * they have been modified by the L1 guest. Note that the "read-only"
- * VM-exit information fields are actually writable if the vCPU is
- * configured to support "VMWRITE to any supported field in the VMCS."
+ * Copy the writable VMCS shadow fields back to the VMCS12, in case they have
+ * been modified by the L1 guest.  Note, "writable" in this context means
+ * "writable by the guest", i.e. tagged SHADOW_FIELD_RW.  Note #2, the set of
+ * fields tagged SHADOW_FIELD_RO may or may not align with the "read-only"
+ * VM-exit information fields (which are actually writable if the vCPU is
+ * configured to support "VMWRITE to any supported field in the VMCS").
  */
 static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx)
 {
-	const u16 *fields[] = {
-		shadow_read_write_fields,
-		shadow_read_only_fields
-	};
-	const int max_fields[] = {
-		max_shadow_read_write_fields,
-		max_shadow_read_only_fields
-	};
-	int i, q;
-	unsigned long field;
-	u64 field_value;
 	struct vmcs *shadow_vmcs = vmx->vmcs01.shadow_vmcs;
+	struct vmcs12 *vmcs12 = get_vmcs12(&vmx->vcpu);
+	unsigned long field;
+	int i;
 
 	preempt_disable();
 
 	vmcs_load(shadow_vmcs);
 
-	for (q = 0; q < ARRAY_SIZE(fields); q++) {
-		for (i = 0; i < max_fields[q]; i++) {
-			field = fields[q][i];
-			field_value = __vmcs_readl(field);
-			vmcs12_write_any(get_vmcs12(&vmx->vcpu), field, field_value);
-		}
-		/*
-		 * Skip the VM-exit information fields if they are read-only.
-		 */
-		if (!nested_cpu_has_vmwrite_any_field(&vmx->vcpu))
-			break;
+	for (i = 0; i < max_shadow_read_write_fields; i++) {
+		field = shadow_read_write_fields[i];
+		vmcs12_write_any(vmcs12, field, __vmcs_readl(field));
 	}
 
 	vmcs_clear(shadow_vmcs);
@@ -4511,6 +4489,24 @@  static int handle_vmwrite(struct kvm_vcpu *vcpu)
 			 * path of prepare_vmcs02.
 			 */
 			break;
+
+#define SHADOW_FIELD_RO(x) case x:
+#include "vmcs_shadow_fields.h"
+			/*
+			 * L1 can read these fields without exiting, ensure the
+			 * shadow VMCS is up-to-date.
+			 */
+			if (enable_shadow_vmcs) {
+				preempt_disable();
+				vmcs_load(vmx->vmcs01.shadow_vmcs);
+
+				__vmcs_writel(field, field_value);
+
+				vmcs_clear(vmx->vmcs01.shadow_vmcs);
+				vmcs_load(vmx->loaded_vmcs->vmcs);
+				preempt_enable();
+			}
+			/* fall through */
 		default:
 			vmx->nested.dirty_vmcs12 = true;
 			break;
@@ -5456,14 +5452,8 @@  static int vmx_set_nested_state(struct kvm_vcpu *vcpu,
 void nested_vmx_vcpu_setup(void)
 {
 	if (enable_shadow_vmcs) {
-		/*
-		 * At vCPU creation, "VMWRITE to any supported field
-		 * in the VMCS" is supported, so use the more
-		 * permissive vmx_vmread_bitmap to specify both read
-		 * and write permissions for the shadow VMCS.
-		 */
 		vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap));
-		vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmread_bitmap));
+		vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap));
 	}
 }