diff mbox series

[v3] KVM: VMX: do not disable interception for MSR_IA32_SPEC_CTRL on eIBRS

Message ID 20220520204115.67580-1-jon@nutanix.com (mailing list archive)
State New, archived
Headers show
Series [v3] KVM: VMX: do not disable interception for MSR_IA32_SPEC_CTRL on eIBRS | expand

Commit Message

Jon Kohler May 20, 2022, 8:41 p.m. UTC
Avoid expensive rdmsr on every VM Exit for MSR_IA32_SPEC_CTRL on
eIBRS enabled systems iff the guest only sets IA32_SPEC_CTRL[0] (IBRS)
and not [1] (STIBP) or [2] (SSBD) by not disabling interception in
the MSR bitmap. Note: this logic is only for eIBRS, as Intel's guidance
has long been that eIBRS only needs to be set once, so most guests with
eIBRS awareness should behave nicely. We would not want to accidentally
regress misbehaving guests on pre-eIBRS systems, who might be spamming
IBRS MSR without the hypervisor being able to see it today.

eIBRS enabled guests using just IBRS will only write SPEC_CTRL MSR
once or twice per vCPU on boot, so it is far better to take those
VM exits on boot than having to read and save this msr on every
single VM exit forever. This outcome was suggested on Andrea's commit
2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
however, since interception is still unilaterally disabled, the rdmsr
tax is still there even after that commit.

This is a significant win for eIBRS enabled systems as this rdmsr
accounts for roughly ~50% of time for vmx_vcpu_run() as observed
by perf top disassembly, and is in the critical path for all
VM-Exits, including fastpath exits.

Update relevant comments in vmx_vcpu_run() and opportunistically
update comments for both MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD to
make it clear how L1 vs L2 handling works.

Fixes: 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
Signed-off-by: Jon Kohler <jon@nutanix.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Waiman Long <longman@redhat.com>
---
v1
 - https://lore.kernel.org/all/20220512174427.3608-1-jon@nutanix.com/
v1 -> v2:
 - https://lore.kernel.org/all/20220520195303.58692-1-jon@nutanix.com/
 - Addressed comments on approach from Sean.
v2 -> v3:
 - Addressed comments on approach from Sean.

 arch/x86/kvm/vmx/vmx.c | 62 ++++++++++++++++++++----------------------
 1 file changed, 29 insertions(+), 33 deletions(-)

--
2.30.1 (Apple Git-130)

Comments

Sean Christopherson May 25, 2022, 5:04 p.m. UTC | #1
On Fri, May 20, 2022, Jon Kohler wrote:
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 610355b9ccce..1c725d17d984 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2057,20 +2057,32 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  			return 1;
> 
>  		vmx->spec_ctrl = data;
> -		if (!data)
> +
> +		/*
> +		 * Disable interception on the first non-zero write, unless the
> +		 * guest is hosted on an eIBRS system and setting only

The "unless guest is hosted on an eIBRS system" blurb is wrong and doesn't match
the code.  Again, it's all about whether eIBRS is advertised to the guest.  With
some other minor tweaking to wrangle the comment to 80 chars...

		/*
                 * Disable interception on the first non-zero write, except if
		 * eIBRS is advertised to the guest and the guest is enabling
		 * _only_ IBRS.  On eIBRS systems, kernels typically set IBRS
		 * once at boot and never touch it post-boot.  All other bits,
		 * and IBRS on non-eIBRS systems, are often set on a per-task
		 * basis, i.e. change frequently, so the benefit of avoiding
		 * VM-exits during guest context switches outweighs the cost of
		 * RDMSR on every VM-Exit to save the guest's value.
		 */

> +		 * SPEC_CTRL_IBRS, which is typically set once at boot and never

Uber nit, when it makes sense, avoid regurgitating the code verbatim, e.g. refer
to "setting SPEC_CTRL_IBRS" as "enabling IBRS".  That little bit of abstraction
can sometimes help unfamiliar readers understand the effect of the code, whereas
copy+pasting bits of the code doesn't provide any additional context.

> +		 * touched again.  All other bits are often set on a per-task
> +		 * basis, i.e. may change frequently, so the benefit of avoiding
> +		 * VM-exits during guest context switches outweighs the cost of
> +		 * RDMSR on every VM-Exit to save the guest's value.
> +		 */
> +		if (!data ||
> +			(data == SPEC_CTRL_IBRS &&
> +			 (vcpu->arch.arch_capabilities & ARCH_CAP_IBRS_ALL)))

Align the two halves of the logical-OR, i.e.

		if (!data ||
		    (data == SPEC_CTRL_IBRS &&
		     (vcpu->arch.arch_capabilities & ARCH_CAP_IBRS_ALL)))
			break;
Jon Kohler May 25, 2022, 5:14 p.m. UTC | #2
> On May 25, 2022, at 1:04 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> On Fri, May 20, 2022, Jon Kohler wrote:
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 610355b9ccce..1c725d17d984 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -2057,20 +2057,32 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> 			return 1;
>> 
>> 		vmx->spec_ctrl = data;
>> -		if (!data)
>> +
>> +		/*
>> +		 * Disable interception on the first non-zero write, unless the
>> +		 * guest is hosted on an eIBRS system and setting only
> 
> The "unless guest is hosted on an eIBRS system" blurb is wrong and doesn't match

Ah right, thanks for catching that

> the code.  Again, it's all about whether eIBRS is advertised to the guest.  With
> some other minor tweaking to wrangle the comment to 80 chars...

RE 80 chars - quick question (and forgive the silly question here), but how are you
counting that? I’ve got my editor cutting at 79 cols, where tab size is accounted
for as 4 cols, so the longest line on my side for this patch is 72-73 or so.

These also pass the checkpatch.pl script as well, so I just want to make sure
going forward I’m formatting them appropriately.

Let me know and I’ll incorporate all of this into a v4 after I hear back :)

> 
> 		/*
>                 * Disable interception on the first non-zero write, except if
> 		 * eIBRS is advertised to the guest and the guest is enabling
> 		 * _only_ IBRS.  On eIBRS systems, kernels typically set IBRS
> 		 * once at boot and never touch it post-boot.  All other bits,
> 		 * and IBRS on non-eIBRS systems, are often set on a per-task
> 		 * basis, i.e. change frequently, so the benefit of avoiding
> 		 * VM-exits during guest context switches outweighs the cost of
> 		 * RDMSR on every VM-Exit to save the guest's value.
> 		 */

Thanks for the suggestion, this text works for me

> 
>> +		 * SPEC_CTRL_IBRS, which is typically set once at boot and never
> 
> Uber nit, when it makes sense, avoid regurgitating the code verbatim, e.g. refer
> to "setting SPEC_CTRL_IBRS" as "enabling IBRS".  That little bit of abstraction
> can sometimes help unfamiliar readers understand the effect of the code, whereas
> copy+pasting bits of the code doesn't provide any additional context.

Ok thanks for this as well, I appreciate the feedback!

> 
>> +		 * touched again.  All other bits are often set on a per-task
>> +		 * basis, i.e. may change frequently, so the benefit of avoiding
>> +		 * VM-exits during guest context switches outweighs the cost of
>> +		 * RDMSR on every VM-Exit to save the guest's value.
>> +		 */
>> +		if (!data ||
>> +			(data == SPEC_CTRL_IBRS &&
>> +			 (vcpu->arch.arch_capabilities & ARCH_CAP_IBRS_ALL)))
> 
> Align the two halves of the logical-OR, i.e.
> 
> 		if (!data ||
> 		    (data == SPEC_CTRL_IBRS &&
> 		     (vcpu->arch.arch_capabilities & ARCH_CAP_IBRS_ALL)))
> 			break;

Ack on this
Sean Christopherson May 25, 2022, 5:32 p.m. UTC | #3
On Wed, May 25, 2022, Jon Kohler wrote:
> > On May 25, 2022, at 1:04 PM, Sean Christopherson <seanjc@google.com> wrote:
> > the code.  Again, it's all about whether eIBRS is advertised to the guest.  With
> > some other minor tweaking to wrangle the comment to 80 chars...
> 
> RE 80 chars - quick question (and forgive the silly question here), but how are you
> counting that? I’ve got my editor cutting at 79 cols, where tab size is accounted
> for as 4 cols, so the longest line on my side for this patch is 72-73 or so.

Tabs are 8 cols in the kernel.  FWIW, your patch was totally fine with respect to
wrapping, my comment was purely to give the heads up that I arbitrarily tweaked
other bits of the comment to adjust for my suggested reword.

> These also pass the checkpatch.pl script as well, so I just want to make sure
> going forward I’m formatting them appropriately.

For future reference, checkpatch.pl only yells if a line length exceeds 100 chars.
The 80 char limit is a soft limit, with 100 chars being a mostly-firm limit.
checkpatch used to yell at 80, but was changed because too many people were
interpreting 80 chars as a hard limit and blindly wrapping code to make checkpatch
happy at the cost of yielding less readable code.

Whether or not to run over the soft limit is definitely subjective, just try to
use common sense.  E.g. overflowing by 1-2 chars, not a big deal, especially if
the statement would otherwise fit on a single line, i.e. doesn't already wrap.

A decent example is the SGX MSRs, which allows the SGX_LC_ENABLED check to run
over a little, but wraps the data update.  The reasoning is that

		if (!msr_info->host_initiated &&
		    (!guest_cpuid_has(vcpu, X86_FEATURE_SGX_LC) ||
		    ((vmx->msr_ia32_feature_control & FEAT_CTL_LOCKED) &&
		    !(vmx->msr_ia32_feature_control & FEAT_CTL_SGX_LC_ENABLED))))
			return 1;

is much more readable than 

		if (!msr_info->host_initiated &&
		    (!guest_cpuid_has(vcpu, X86_FEATURE_SGX_LC) ||
		    ((vmx->msr_ia32_feature_control & FEAT_CTL_LOCKED) &&
		    !(vmx->msr_ia32_feature_control &
		      FEAT_CTL_SGX_LC_ENABLED))))
			return 1;

but this

		vmx->msr_ia32_sgxlepubkeyhash
			[msr_index - MSR_IA32_SGXLEPUBKEYHASH0] = data;

is only isn't _that_ much worse than running the line way out, and ~93 chars gets
to be a bit too long.

		vmx->msr_ia32_sgxlepubkeyhash[msr_index - MSR_IA32_SGXLEPUBKEYHASH0] = data;
Jim Mattson May 25, 2022, 5:46 p.m. UTC | #4
On Wed, May 25, 2022 at 10:14 AM Jon Kohler <jon@nutanix.com> wrote:
>
>
>
> > On May 25, 2022, at 1:04 PM, Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, May 20, 2022, Jon Kohler wrote:
> >> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> >> index 610355b9ccce..1c725d17d984 100644
> >> --- a/arch/x86/kvm/vmx/vmx.c
> >> +++ b/arch/x86/kvm/vmx/vmx.c
> >> @@ -2057,20 +2057,32 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >>                      return 1;
> >>
> >>              vmx->spec_ctrl = data;
> >> -            if (!data)
> >> +
> >> +            /*
> >> +             * Disable interception on the first non-zero write, unless the
> >> +             * guest is hosted on an eIBRS system and setting only
> >
> > The "unless guest is hosted on an eIBRS system" blurb is wrong and doesn't match
>
> Ah right, thanks for catching that
>
> > the code.  Again, it's all about whether eIBRS is advertised to the guest.  With
> > some other minor tweaking to wrangle the comment to 80 chars...
>
> RE 80 chars - quick question (and forgive the silly question here), but how are you
> counting that? I’ve got my editor cutting at 79 cols, where tab size is accounted
> for as 4 cols, so the longest line on my side for this patch is 72-73 or so.

Tab stops are every 8 characters. :-)
diff mbox series

Patch

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 610355b9ccce..1c725d17d984 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2057,20 +2057,32 @@  static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			return 1;

 		vmx->spec_ctrl = data;
-		if (!data)
+
+		/*
+		 * Disable interception on the first non-zero write, unless the
+		 * guest is hosted on an eIBRS system and setting only
+		 * SPEC_CTRL_IBRS, which is typically set once at boot and never
+		 * touched again.  All other bits are often set on a per-task
+		 * basis, i.e. may change frequently, so the benefit of avoiding
+		 * VM-exits during guest context switches outweighs the cost of
+		 * RDMSR on every VM-Exit to save the guest's value.
+		 */
+		if (!data ||
+			(data == SPEC_CTRL_IBRS &&
+			 (vcpu->arch.arch_capabilities & ARCH_CAP_IBRS_ALL)))
 			break;

 		/*
-		 * For non-nested:
-		 * When it's written (to non-zero) for the first time, pass
-		 * it through.
-		 *
-		 * For nested:
-		 * The handling of the MSR bitmap for L2 guests is done in
-		 * nested_vmx_prepare_msr_bitmap. We should not touch the
-		 * vmcs02.msr_bitmap here since it gets completely overwritten
-		 * in the merging. We update the vmcs01 here for L1 as well
-		 * since it will end up touching the MSR anyway now.
+		 * Update vmcs01.msr_bitmap even if L2 is active, i.e. disable
+		 * interception for the vCPU on the first write regardless of
+		 * whether the WRMSR came from L1 or L2.  vmcs02's bitmap is a
+		 * combination of vmcs01 and vmcs12 bitmaps, and will be
+		 * recomputed by nested_vmx_prepare_msr_bitmap() on the next
+		 * nested VM-Enter.  Note, this does mean that future WRMSRs
+		 * from L2 will be intercepted until the next nested VM-Exit if
+		 * L2 was the first to write, but L1 exposing the MSR to L2
+		 * without first writing it is unlikely and not worth the
+		 * extra bit of complexity.
 		 */
 		vmx_disable_intercept_for_msr(vcpu,
 					      MSR_IA32_SPEC_CTRL,
@@ -2098,15 +2110,9 @@  static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);

 		/*
-		 * For non-nested:
-		 * When it's written (to non-zero) for the first time, pass
-		 * it through.
-		 *
-		 * For nested:
-		 * The handling of the MSR bitmap for L2 guests is done in
-		 * nested_vmx_prepare_msr_bitmap. We should not touch the
-		 * vmcs02.msr_bitmap here since it gets completely overwritten
-		 * in the merging.
+		 * Disable interception on the first IBPB, odds are good IBPB
+		 * will be a frequent guest action.  See the comment for
+		 * MSR_IA32_SPEC_CTRL for details on the nested interaction.
 		 */
 		vmx_disable_intercept_for_msr(vcpu, MSR_IA32_PRED_CMD, MSR_TYPE_W);
 		break;
@@ -6887,19 +6893,9 @@  static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	vmx_vcpu_enter_exit(vcpu, vmx);

 	/*
-	 * We do not use IBRS in the kernel. If this vCPU has used the
-	 * SPEC_CTRL MSR it may have left it on; save the value and
-	 * turn it off. This is much more efficient than blindly adding
-	 * it to the atomic save/restore list. Especially as the former
-	 * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
-	 *
-	 * For non-nested case:
-	 * If the L01 MSR bitmap does not intercept the MSR, then we need to
-	 * save it.
-	 *
-	 * For nested case:
-	 * If the L02 MSR bitmap does not intercept the MSR, then we need to
-	 * save it.
+	 * Save SPEC_CTRL if it may have been written by the guest, the current
+	 * value in hardware is used by x86_spec_ctrl_restore_host() to avoid
+	 * WRMSR if the current value matches the host's desired value.
 	 */
 	if (unlikely(!msr_write_intercepted(vmx, MSR_IA32_SPEC_CTRL)))
 		vmx->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL);