diff mbox series

[v3] x86/speculation, KVM: only IBPB for switch_mm_always_ibpb on vCPU load

Message ID 20220422162103.32736-1-jon@nutanix.com (mailing list archive)
State New, archived
Headers show
Series [v3] x86/speculation, KVM: only IBPB for switch_mm_always_ibpb on vCPU load | expand

Commit Message

Jon Kohler April 22, 2022, 4:21 p.m. UTC
On vmx_vcpu_load_vmcs and svm_vcpu_load, respect user controlled
configuration for conditional IBPB and only attempt IBPB MSR when
switching between different guest vCPUs IFF switch_mm_always_ibpb,
which fixes a situation where the kernel will issue IBPB
unconditionally even when conditional IBPB is enabled.

If a user has spectre_v2_user mitigation enabled, in any
configuration, and the underlying processor supports X86_FEATURE_IBPB,
X86_FEATURE_USE_IBPB is set and any calls to
indirect_branch_prediction_barrier() will issue IBPB MSR.

Depending on the spectre_v2_user configuration, either
switch_mm_always_ibpb key or switch_mm_cond_ibpb key will be set.

Both switch_mm_always_ibpb and switch_mm_cond_ibpb are handled by
switch_mm() -> cond_mitigation(), which works well in cases where
switching vCPUs (i.e. switching tasks) also switches mm_struct;
however, this misses a paranoid case where user space may be running
multiple guests in a single process (i.e. single mm_struct). This
presents two issues:

Issue 1:
This paranoid case is already covered by vmx_vcpu_load_vmcs and
svm_vcpu_load; however, this is done by calling
indirect_branch_prediction_barrier() and thus the kernel
unconditionally issues IBPB if X86_FEATURE_USE_IBPB is set.

Issue 2:
For a conditional configuration, this paranoid case is nonsensical.
If userspace runs multiple VMs in the same process, enables cond_ipbp,
_and_ sets TIF_SPEC_IB, then isn't getting full protection in any case,
e.g. if userspace is handling an exit-to-userspace condition for two
vCPUs from different VMs, then the kernel could switch between those
two vCPUs' tasks without bouncing through KVM and thus without doing
KVM's IBPB.

Fix both by using intermediary call to x86_virt_guest_switch_ibpb(),
which gates IBPB MSR IFF switch_mm_always_ibpb is true.

switch_mm_cond_ibpb is intentionally ignored from the KVM code side
as it really is nonsensical given the common case is already well
covered by switch_mm(), so issuing an additional IBPB from KVM is
just pure overhead.

Note: switch_mm_always_ibpb key is user controlled via spectre_v2_user
and will be true for the following configurations:
  spectre_v2_user=on
  spectre_v2_user=prctl,ibpb
  spectre_v2_user=seccomp,ibpb

Signed-off-by: Jon Kohler <jon@nutanix.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Waiman Long <longman@redhat.com>
---
v1 -> v2:
 - Addressed comments on approach from Sean.
v2 -> v3:
 - Updated spec-ctrl.h comments and commit msg to incorporate
   additional feedback from Sean.

 arch/x86/include/asm/spec-ctrl.h | 14 ++++++++++++++
 arch/x86/kernel/cpu/bugs.c       |  6 +++++-
 arch/x86/kvm/svm/svm.c           |  2 +-
 arch/x86/kvm/vmx/vmx.c           |  2 +-
 4 files changed, 21 insertions(+), 3 deletions(-)

--
2.30.1 (Apple Git-130)

Comments

Jon Kohler April 28, 2022, 12:51 p.m. UTC | #1
> On Apr 22, 2022, at 12:21 PM, Jon Kohler <jon@nutanix.com> wrote:
> 
> On vmx_vcpu_load_vmcs and svm_vcpu_load, respect user controlled
> configuration for conditional IBPB and only attempt IBPB MSR when
> switching between different guest vCPUs IFF switch_mm_always_ibpb,
> which fixes a situation where the kernel will issue IBPB
> unconditionally even when conditional IBPB is enabled.
> 
> If a user has spectre_v2_user mitigation enabled, in any
> configuration, and the underlying processor supports X86_FEATURE_IBPB,
> X86_FEATURE_USE_IBPB is set and any calls to
> indirect_branch_prediction_barrier() will issue IBPB MSR.
> 
> Depending on the spectre_v2_user configuration, either
> switch_mm_always_ibpb key or switch_mm_cond_ibpb key will be set.
> 
> Both switch_mm_always_ibpb and switch_mm_cond_ibpb are handled by
> switch_mm() -> cond_mitigation(), which works well in cases where
> switching vCPUs (i.e. switching tasks) also switches mm_struct;
> however, this misses a paranoid case where user space may be running
> multiple guests in a single process (i.e. single mm_struct). This
> presents two issues:
> 
> Issue 1:
> This paranoid case is already covered by vmx_vcpu_load_vmcs and
> svm_vcpu_load; however, this is done by calling
> indirect_branch_prediction_barrier() and thus the kernel
> unconditionally issues IBPB if X86_FEATURE_USE_IBPB is set.
> 
> Issue 2:
> For a conditional configuration, this paranoid case is nonsensical.
> If userspace runs multiple VMs in the same process, enables cond_ipbp,
> _and_ sets TIF_SPEC_IB, then isn't getting full protection in any case,
> e.g. if userspace is handling an exit-to-userspace condition for two
> vCPUs from different VMs, then the kernel could switch between those
> two vCPUs' tasks without bouncing through KVM and thus without doing
> KVM's IBPB.
> 
> Fix both by using intermediary call to x86_virt_guest_switch_ibpb(),
> which gates IBPB MSR IFF switch_mm_always_ibpb is true.
> 
> switch_mm_cond_ibpb is intentionally ignored from the KVM code side
> as it really is nonsensical given the common case is already well
> covered by switch_mm(), so issuing an additional IBPB from KVM is
> just pure overhead.
> 
> Note: switch_mm_always_ibpb key is user controlled via spectre_v2_user
> and will be true for the following configurations:
>  spectre_v2_user=on
>  spectre_v2_user=prctl,ibpb
>  spectre_v2_user=seccomp,ibpb
> 
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Waiman Long <longman@redhat.com>
> ---
> v1 -> v2:
> - Addressed comments on approach from Sean.
> v2 -> v3:
> - Updated spec-ctrl.h comments and commit msg to incorporate
>   additional feedback from Sean.
> 

Gentle ping on this one, thanks, Jon

> arch/x86/include/asm/spec-ctrl.h | 14 ++++++++++++++
> arch/x86/kernel/cpu/bugs.c       |  6 +++++-
> arch/x86/kvm/svm/svm.c           |  2 +-
> arch/x86/kvm/vmx/vmx.c           |  2 +-
> 4 files changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/spec-ctrl.h b/arch/x86/include/asm/spec-ctrl.h
> index 5393babc0598..99d3341d2e21 100644
> --- a/arch/x86/include/asm/spec-ctrl.h
> +++ b/arch/x86/include/asm/spec-ctrl.h
> @@ -85,4 +85,18 @@ static inline void speculative_store_bypass_ht_init(void) { }
> extern void speculation_ctrl_update(unsigned long tif);
> extern void speculation_ctrl_update_current(void);
> 
> +/*
> + * Issue IBPB when switching guest vCPUs IFF switch_mm_always_ibpb.
> + * For the more common case of running VMs in their own dedicated process,
> + * switching vCPUs that belong to different VMs, i.e. switching tasks,
> + * will also switch mm_structs and thus do IPBP via cond_mitigation();
> + * however, in the always_ibpb case, take a paranoid approach and issue
> + * IBPB on both switch_mm() and vCPU switch.
> + */
> +static inline void x86_virt_guest_switch_ibpb(void)
> +{
> +	if (static_branch_unlikely(&switch_mm_always_ibpb))
> +		indirect_branch_prediction_barrier();
> +}
> +
> #endif
> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index 6296e1ebed1d..6aafb0279cbc 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -68,8 +68,12 @@ u64 __ro_after_init x86_amd_ls_cfg_ssbd_mask;
> DEFINE_STATIC_KEY_FALSE(switch_to_cond_stibp);
> /* Control conditional IBPB in switch_mm() */
> DEFINE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
> -/* Control unconditional IBPB in switch_mm() */
> +/* Control unconditional IBPB in both switch_mm() and
> + * x86_virt_guest_switch_ibpb().
> + * See notes on x86_virt_guest_switch_ibpb() for KVM use case details.
> + */
> DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
> +EXPORT_SYMBOL_GPL(switch_mm_always_ibpb);
> 
> /* Control MDS CPU buffer clear before returning to user space */
> DEFINE_STATIC_KEY_FALSE(mds_user_clear);
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index bd4c64b362d2..fc08c94df888 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1302,7 +1302,7 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> 
> 	if (sd->current_vmcb != svm->vmcb) {
> 		sd->current_vmcb = svm->vmcb;
> -		indirect_branch_prediction_barrier();
> +		x86_virt_guest_switch_ibpb();
> 	}
> 	if (kvm_vcpu_apicv_active(vcpu))
> 		__avic_vcpu_load(vcpu, cpu);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 04d170c4b61e..a8eed9b8221b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1270,7 +1270,7 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
> 		 * The L1 VMM can protect itself with retpolines, IBPB or IBRS.
> 		 */
> 		if (!buddy || WARN_ON_ONCE(buddy->vmcs != prev))
> -			indirect_branch_prediction_barrier();
> +			x86_virt_guest_switch_ibpb();
> 	}
> 
> 	if (!already_loaded) {
> --
> 2.30.1 (Apple Git-130)
>
Borislav Petkov April 29, 2022, 4:59 p.m. UTC | #2
On Fri, Apr 22, 2022 at 12:21:01PM -0400, Jon Kohler wrote:
> Both switch_mm_always_ibpb and switch_mm_cond_ibpb are handled by
> switch_mm() -> cond_mitigation(), which works well in cases where
> switching vCPUs (i.e. switching tasks) also switches mm_struct;
> however, this misses a paranoid case where user space may be running
> multiple guests in a single process (i.e. single mm_struct).

You lost me here. I admit I'm no virt guy so you'll have to explain in
more detail what use case that is that you want to support.

What guests share mm_struct?

What is the paranoid aspect here? You want to protect a single guest
from all the others sharing a mm_struct?

> +/*
> + * Issue IBPB when switching guest vCPUs IFF switch_mm_always_ibpb.
> + * For the more common case of running VMs in their own dedicated process,
> + * switching vCPUs that belong to different VMs, i.e. switching tasks,
> + * will also switch mm_structs and thus do IPBP via cond_mitigation();
> + * however, in the always_ibpb case, take a paranoid approach and issue
> + * IBPB on both switch_mm() and vCPU switch.
> + */
> +static inline void x86_virt_guest_switch_ibpb(void)
> +{
> +	if (static_branch_unlikely(&switch_mm_always_ibpb))
> +		indirect_branch_prediction_barrier();

If this switch is going to be conditional, then make it so:

static void x86_do_cond_ibpb(void)
{
	if (static_branch_likely(&switch_mm_cond_ibpb))
		indirect_branch_prediction_barrier();
}

and there's nothing "virt" about it - might as well call the function
what it does. And I'd put that function in bugs.c...

> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index 6296e1ebed1d..6aafb0279cbc 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -68,8 +68,12 @@ u64 __ro_after_init x86_amd_ls_cfg_ssbd_mask;
>  DEFINE_STATIC_KEY_FALSE(switch_to_cond_stibp);
>  /* Control conditional IBPB in switch_mm() */
>  DEFINE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
> -/* Control unconditional IBPB in switch_mm() */
> +/* Control unconditional IBPB in both switch_mm() and
> + * x86_virt_guest_switch_ibpb().
> + * See notes on x86_virt_guest_switch_ibpb() for KVM use case details.
> + */
>  DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
> +EXPORT_SYMBOL_GPL(switch_mm_always_ibpb);

... so that I don't export that key to modules.

I'd like to have the big picture clarified first, why we need this, etc.

Thx.
Jon Kohler April 29, 2022, 5:31 p.m. UTC | #3
> On Apr 29, 2022, at 12:59 PM, Borislav Petkov <bp@alien8.de> wrote:
> 
> On Fri, Apr 22, 2022 at 12:21:01PM -0400, Jon Kohler wrote:
>> Both switch_mm_always_ibpb and switch_mm_cond_ibpb are handled by
>> switch_mm() -> cond_mitigation(), which works well in cases where
>> switching vCPUs (i.e. switching tasks) also switches mm_struct;
>> however, this misses a paranoid case where user space may be running
>> multiple guests in a single process (i.e. single mm_struct).
> 
> You lost me here. I admit I'm no virt guy so you'll have to explain in
> more detail what use case that is that you want to support.
> 
> What guests share mm_struct?

Selftests IIUC, but there may be other VMMs that do funny stuff. Said
another way, I don’t think we actively restrict user space from doing
this as far as I know.

> 
> What is the paranoid aspect here? You want to protect a single guest
> from all the others sharing a mm_struct?

The paranoid aspect here is KVM is issuing an *additional* IBPB on
top of what already happens in switch_mm(). 

IMHO KVM side IBPB for most use cases isn’t really necessarily but 
the general concept is that you want to protect vCPU from guest A
from guest B, so you issue a prediction barrier on vCPU switch.

*however* that protection already happens in switch_mm(), because
guest A and B are likely to use different mm_struct, so the only point
of having this support in KVM seems to be to “kill it with fire” for 
paranoid users who might be doing some tomfoolery that would 
somehow bypass switch_mm() protection (such as somehow 
sharing a struct).

One argument could just say, let’s KISS principle and rip this out of
KVM entirely, and users who do this shared mm_struct stuff can just
know that their security profile is possibly less than it could be. That
would certainly simplify the heck out of all of this, but you could probably
set your watch to the next email saying that we broke someones use
case and they’d be all grumpy.

That’s why we’ve proposed keeping this to the always_ibpb path only,
so that if you intentionally configure always_ibpb, you’re going to get
barriers in both KVM and in switch_mm.

Sean feel free to chime in if I missed the mark with this explanation,
you’ve got a way with words for these things :)

> 
>> +/*
>> + * Issue IBPB when switching guest vCPUs IFF switch_mm_always_ibpb.
>> + * For the more common case of running VMs in their own dedicated process,
>> + * switching vCPUs that belong to different VMs, i.e. switching tasks,
>> + * will also switch mm_structs and thus do IPBP via cond_mitigation();
>> + * however, in the always_ibpb case, take a paranoid approach and issue
>> + * IBPB on both switch_mm() and vCPU switch.
>> + */
>> +static inline void x86_virt_guest_switch_ibpb(void)
>> +{
>> +	if (static_branch_unlikely(&switch_mm_always_ibpb))
>> +		indirect_branch_prediction_barrier();
> 
> If this switch is going to be conditional, then make it so:
> 
> static void x86_do_cond_ibpb(void)
> {
> 	if (static_branch_likely(&switch_mm_cond_ibpb))
> 		indirect_branch_prediction_barrier();
> }

In the context of this change, we only want to do this barrier on the always_ibpb
path, so we wouldn’t want to do cont_ibpb here, but you bring up a good point, 
we could change it to x86_do_always_ibpb() and move it to bugs.c, that would
simplify things. Thanks.

> 
> and there's nothing "virt" about it - might as well call the function
> what it does. And I'd put that function in bugs.c...
> 
>> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
>> index 6296e1ebed1d..6aafb0279cbc 100644
>> --- a/arch/x86/kernel/cpu/bugs.c
>> +++ b/arch/x86/kernel/cpu/bugs.c
>> @@ -68,8 +68,12 @@ u64 __ro_after_init x86_amd_ls_cfg_ssbd_mask;
>> DEFINE_STATIC_KEY_FALSE(switch_to_cond_stibp);
>> /* Control conditional IBPB in switch_mm() */
>> DEFINE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
>> -/* Control unconditional IBPB in switch_mm() */
>> +/* Control unconditional IBPB in both switch_mm() and
>> + * x86_virt_guest_switch_ibpb().
>> + * See notes on x86_virt_guest_switch_ibpb() for KVM use case details.
>> + */
>> DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
>> +EXPORT_SYMBOL_GPL(switch_mm_always_ibpb);
> 
> ... so that I don't export that key to modules.
> 
> I'd like to have the big picture clarified first, why we need this, etc.

No problem, and I appreciate the help and review! Let me know if the
above makes sense and I’m happy to issue a v4 patch for this.

Cheers,
Jon


> Thx.
> 
> -- 
> Regards/Gruss,
>    Boris.
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__people.kernel.org_tglx_notes-2Dabout-2Dnetiquette&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=ZqhgyVXM5xkqhxRyZoFn5ed8_yDhRAKNqjt4Jthv1UJx7NCnlZAkK5_jJs4dN0WA&s=h_0nhBch2znONU8N13GxQyyvSuSSq2Kr7YFpjjR9tko&e=
Borislav Petkov April 29, 2022, 7:32 p.m. UTC | #4
On Fri, Apr 29, 2022 at 05:31:16PM +0000, Jon Kohler wrote:
> Selftests IIUC, but there may be other VMMs that do funny stuff. Said
> another way, I don’t think we actively restrict user space from doing
> this as far as I know.

"selftests", "there may be"?!

This doesn't sound like a real-life use case to me and we don't do
changes just because. Sorry.

> The paranoid aspect here is KVM is issuing an *additional* IBPB on
> top of what already happens in switch_mm(). 

Yeah, I know how that works.

> IMHO KVM side IBPB for most use cases isn’t really necessarily but 
> the general concept is that you want to protect vCPU from guest A
> from guest B, so you issue a prediction barrier on vCPU switch.
> 
> *however* that protection already happens in switch_mm(), because
> guest A and B are likely to use different mm_struct, so the only point
> of having this support in KVM seems to be to “kill it with fire” for 
> paranoid users who might be doing some tomfoolery that would 
> somehow bypass switch_mm() protection (such as somehow 
> sharing a struct).

Yeah, no, this all sounds like something highly hypothetical or there's
a use case of which you don't want to talk about publicly.

Either way, from what I'm reading I'm not in the least convinced that
this is needed.
Jon Kohler April 29, 2022, 8:08 p.m. UTC | #5
> On Apr 29, 2022, at 3:32 PM, Borislav Petkov <bp@alien8.de> wrote:
> 
> On Fri, Apr 29, 2022 at 05:31:16PM +0000, Jon Kohler wrote:
>> Selftests IIUC, but there may be other VMMs that do funny stuff. Said
>> another way, I don’t think we actively restrict user space from doing
>> this as far as I know.
> 
> "selftests", "there may be"?!
> 
> This doesn't sound like a real-life use case to me and we don't do
> changes just because. Sorry.

I appreciate your direct feedback, thank you for helping here.

Let’s separate the discussion into two parts. 
1: Bug Fixing -> an IBPB is being unconditionally issued even when
the user selects conditional. That needs to be fixed and this patch fixes
that.

2: Design -> do we even want to have this IBPB here in KVM at all?

In previous discussions (v1/v2 patch) with Sean, we talked about this
not making a whole lot of sense in general; however, we landed on
trying to not regress users who might, for whatever reason, care about 
this IBPB.

I’ve shared a bit more detail on our use case below. I’m fine with nuking
this IBPB entirely, just want to be mindful of use cases from the rest of
the community that we might not normally cross in our day to day.

I’m happy to take feedback on this and integrate it into a v4 patch for
both of these parts, in terms of both code and correctness in the change
log.

> 
>> The paranoid aspect here is KVM is issuing an *additional* IBPB on
>> top of what already happens in switch_mm(). 
> 
> Yeah, I know how that works.
> 
>> IMHO KVM side IBPB for most use cases isn’t really necessarily but 
>> the general concept is that you want to protect vCPU from guest A
>> from guest B, so you issue a prediction barrier on vCPU switch.
>> 
>> *however* that protection already happens in switch_mm(), because
>> guest A and B are likely to use different mm_struct, so the only point
>> of having this support in KVM seems to be to “kill it with fire” for 
>> paranoid users who might be doing some tomfoolery that would 
>> somehow bypass switch_mm() protection (such as somehow 
>> sharing a struct).
> 
> Yeah, no, this all sounds like something highly hypothetical or there's
> a use case of which you don't want to talk about publicly.

We’re an open book here, so I’m happy to share what we’re up to
publicly. Our use case is 100% qemu-kvm, which is all separate 
processes/structs and is covered a-ok by the switch_mm() path. We
noticed this bug in one of our scalability tests, which oversubscribes
the host with many 2 vCPU VMs and runs a load runner that increases
load one machine at a time, so that we can see the trend of response
time of an app as host load increases. 

Given that the host is heavily oversubscribed, there is a lot of
vCPU switching, and in particular switching in between vCPUs 
belonging to different guests, so that we hit this particular barrier in 
vcpu_load constantly. Taking this fix helped smoothed out that response
time curve a bit. Happy to share more specific data if you’d like.

> 
> Either way, from what I'm reading I'm not in the least convinced that
> this is needed.
> 
> -- 
> Regards/Gruss,
>    Boris.
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__people.kernel.org_tglx_notes-2Dabout-2Dnetiquette&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=tctUY3zgYEwUcPZ8E8v-EiXloK4PwYvUVBlR-amoRBEVZym6a2SuqyRYbNGF1_aZ&s=lQjy9h3G6eOqr2qEuAVvtX3LuDxW1kVJHdlkezCy3sU&e=
Sean Christopherson April 29, 2022, 8:29 p.m. UTC | #6
On Fri, Apr 29, 2022, Borislav Petkov wrote:
> On Fri, Apr 29, 2022 at 05:31:16PM +0000, Jon Kohler wrote:
> > Selftests IIUC, but there may be other VMMs that do funny stuff. Said
> > another way, I don’t think we actively restrict user space from doing
> > this as far as I know.
> 
> "selftests", "there may be"?!
> 
> This doesn't sound like a real-life use case to me and we don't do
> changes just because. Sorry.
> 
> > The paranoid aspect here is KVM is issuing an *additional* IBPB on
> > top of what already happens in switch_mm(). 
> 
> Yeah, I know how that works.
> 
> > IMHO KVM side IBPB for most use cases isn’t really necessarily but 
> > the general concept is that you want to protect vCPU from guest A
> > from guest B, so you issue a prediction barrier on vCPU switch.
> > 
> > *however* that protection already happens in switch_mm(), because
> > guest A and B are likely to use different mm_struct, so the only point
> > of having this support in KVM seems to be to “kill it with fire” for 
> > paranoid users who might be doing some tomfoolery that would 
> > somehow bypass switch_mm() protection (such as somehow 
> > sharing a struct).
> 
> Yeah, no, this all sounds like something highly hypothetical or there's
> a use case of which you don't want to talk about publicly.

What Jon is trying to do is eliminate IBPB that already exists in KVM.  The catch
is that, in theory, someone not-Jon could be running multiple VMs in a single
address space, e.g. VM-based containers.  So if we simply delete the IBPB, then
we could theoretically and silently break a user.  That's why there's a bunch of
hand-waving.

> Either way, from what I'm reading I'm not in the least convinced that
> this is needed.

Can you clarify what "this" is?  Does "this" mean "this patch", or does it mean
"this IBPB when switching vCPUs"?  Because if it means the latter, then I think
you're in violent agreement; the IBPB when switching vCPUs is pointless and
unnecessary.
Borislav Petkov April 29, 2022, 8:59 p.m. UTC | #7
On Fri, Apr 29, 2022 at 08:29:30PM +0000, Sean Christopherson wrote:
> That's why there's a bunch of hand-waving.

Well, I'm still not sure what this patch is trying to fix but both your
latest replies do sound clearer...

> Can you clarify what "this" is?  Does "this" mean "this patch", or does it mean

This patch.

> "this IBPB when switching vCPUs"?  Because if it means the latter, then I think
> you're in violent agreement; the IBPB when switching vCPUs is pointless and
> unnecessary.

Ok, let's concentrate on the bug first - whether a second IBPB - so to
speak - is needed. Doing some git archeology points to:

  15d45071523d ("KVM/x86: Add IBPB support")

which - and I'm surprised - goes to great lengths to explain what
those IBPB calls in KVM protect against. From that commit message, for
example:

"    * Mitigate attacks from guest/ring3->host/ring3.
      These would require a IBPB during context switch in host, or after
      VMEXIT."

so with my very limited virt understanding, when you vmexit, you don't
do switch_mm(), right?

If so, you need to do a barrier. Regardless of conditional IBPB or not
as you want to protect the host from a malicious guest.

In general, the whole mitigation strategies are enumerated in

Documentation/admin-guide/hw-vuln/spectre.rst

There's also a "3. VM mitigation" section.

And so on...

Bottomline is this: at the time, we went to great lengths to document
what the attacks are and how we are protecting against them.

So now, if you want to change all that, I'd like to see

- the attack scenario is this

- we don't think it is relevant because...

- therefore, we don't need to protect against it anymore

and all that should be properly documented.

Otherwise, there's no surviving this mess.

Thx!
Sean Christopherson April 29, 2022, 9:59 p.m. UTC | #8
On Fri, Apr 29, 2022, Borislav Petkov wrote:
> On Fri, Apr 29, 2022 at 08:29:30PM +0000, Sean Christopherson wrote:
> > That's why there's a bunch of hand-waving.
> 
> Well, I'm still not sure what this patch is trying to fix but both your
> latest replies do sound clearer...
> 
> > Can you clarify what "this" is?  Does "this" mean "this patch", or does it mean
> 
> This patch.
> 
> > "this IBPB when switching vCPUs"?  Because if it means the latter, then I think
> > you're in violent agreement; the IBPB when switching vCPUs is pointless and
> > unnecessary.
> 
> Ok, let's concentrate on the bug first - whether a second IBPB - so to
> speak - is needed. Doing some git archeology points to:
> 
>   15d45071523d ("KVM/x86: Add IBPB support")
> 
> which - and I'm surprised - goes to great lengths to explain what
> those IBPB calls in KVM protect against. From that commit message, for
> example:
> 
> "    * Mitigate attacks from guest/ring3->host/ring3.
>       These would require a IBPB during context switch in host, or after
>       VMEXIT."

Except that snippet changelog doesn't actually state what KVM does, it states what
a hypervsior _could_ do to protect the host from the guest via IBPB.

> so with my very limited virt understanding, when you vmexit, you don't
> do switch_mm(), right?

Correct, but KVM also doesn't do IBPB on VM-Exit (or VM-Entry), nor does KVM do
IBPB before exiting to userspace.  The IBPB we want to whack is issued only when
KVM is switching vCPUs.

> If so, you need to do a barrier. Regardless of conditional IBPB or not
> as you want to protect the host from a malicious guest.
> 
> In general, the whole mitigation strategies are enumerated in
> 
> Documentation/admin-guide/hw-vuln/spectre.rst
> 
> There's also a "3. VM mitigation" section.
> 
> And so on...
> 
> Bottomline is this: at the time, we went to great lengths to document
> what the attacks are and how we are protecting against them.

Except that _none_ of that documentation explains why the hell KVM does IBPB when
switching betwen vCPUs.  The only item is this snippet from the changelog:

    * Mitigate guests from being attacked by other guests.
      - This is addressed by issing IBPB when we do a guest switch.

And that's the one that I pointed out in v1 as being flawed/wrong, and how Jon
ended up with this patch.

  : But stepping back, why does KVM do its own IBPB in the first place?  The goal is
  : to prevent one vCPU from attacking the next vCPU run on the same pCPU.  But unless
  : userspace is running multiple VMs in the same process/mm_struct, switching vCPUs,
  : i.e. switching tasks, will also switch mm_structs and thus do IPBP via cond_mitigation.
  :
  : If userspace runs multiple VMs in the same process, enables cond_ipbp, _and_ sets
  : TIF_SPEC_IB, then it's being stupid and isn't getting full protection in any case,
  : e.g. if userspace is handling an exit-to-userspace condition for two vCPUs from
  : different VMs, then the kernel could switch between those two vCPUs' tasks without
  : bouncing through KVM and thus without doing KVM's IBPB.
  :
  : I can kinda see doing this for always_ibpb, e.g. if userspace is unaware of spectre
  : and is naively running multiple VMs in the same process.
  :
  : What am I missing?
Borislav Petkov April 29, 2022, 10:22 p.m. UTC | #9
On Fri, Apr 29, 2022 at 09:59:52PM +0000, Sean Christopherson wrote:
> Correct, but KVM also doesn't do IBPB on VM-Exit (or VM-Entry),

Why doesn't it do that? Not needed?

> nor does KVM do IBPB before exiting to userspace.

Same question.

> The IBPB we want to whack is issued only when KVM is switching vCPUs.

Then please document it properly as I've already requested.

> Except that _none_ of that documentation explains why the hell KVM
> does IBPB when switching betwen vCPUs.

Probably because the folks involved in those patches weren't the hell
mainly virt people. Although I see a bunch of virt people on CC on that
patch.

>   : But stepping back, why does KVM do its own IBPB in the first place?  The goal is
>   : to prevent one vCPU from attacking the next vCPU run on the same pCPU.  But unless
>   : userspace is running multiple VMs in the same process/mm_struct, switching vCPUs,
>   : i.e. switching tasks, will also switch mm_structs and thus do IPBP via cond_mitigation.
>   :
>   : If userspace runs multiple VMs in the same process,

This keeps popping up. Who does that? Can I get a real-life example to
such VM-based containers or what the hell that is, pls?

> enables cond_ipbp, _and_ sets
>   : TIF_SPEC_IB, then it's being stupid and isn't getting full protection in any case,
>   : e.g. if userspace is handling an exit-to-userspace condition for two vCPUs from
>   : different VMs, then the kernel could switch between those two vCPUs' tasks without
>   : bouncing through KVM and thus without doing KVM's IBPB.
>   :
>   : I can kinda see doing this for always_ibpb, e.g. if userspace is unaware of spectre
>   : and is naively running multiple VMs in the same process.

So this needs a clearer definition: what protection are we even talking
about when the address spaces of processes are shared? My naïve
thinking would be: none. They're sharing address space - branch pred.
poisoning between the two is the least of their worries.

So to cut to the chase: it sounds to me like you don't want to do IBPB
at all on vCPU switch. And the process switch case is taken care of by
switch_mm().
Sean Christopherson April 29, 2022, 11:23 p.m. UTC | #10
On Sat, Apr 30, 2022, Borislav Petkov wrote:
> On Fri, Apr 29, 2022 at 09:59:52PM +0000, Sean Christopherson wrote:
> > Correct, but KVM also doesn't do IBPB on VM-Exit (or VM-Entry),
> 
> Why doesn't it do that? Not needed?

The host kernel is protected via RETPOLINE and by flushing the RSB immediately
after VM-Exit.

> > nor does KVM do IBPB before exiting to userspace.
> 
> Same question.

I don't know definitively.  My guess is that IBPB is far too costly to do on every
exit, and so the onus was put on userspace to recompile with RETPOLINE.  What I
don't know is why it wasn't implemented as an opt-out feature.

> > The IBPB we want to whack is issued only when KVM is switching vCPUs.
> 
> Then please document it properly as I've already requested.

I'll write up the bits I have my head wrapped around.

> > Except that _none_ of that documentation explains why the hell KVM
> > does IBPB when switching betwen vCPUs.
> 
> Probably because the folks involved in those patches weren't the hell
> mainly virt people. Although I see a bunch of virt people on CC on that
> patch.
> 
> >   : But stepping back, why does KVM do its own IBPB in the first place?  The goal is
> >   : to prevent one vCPU from attacking the next vCPU run on the same pCPU.  But unless
> >   : userspace is running multiple VMs in the same process/mm_struct, switching vCPUs,
> >   : i.e. switching tasks, will also switch mm_structs and thus do IPBP via cond_mitigation.
> >   :
> >   : If userspace runs multiple VMs in the same process,
> 
> This keeps popping up. Who does that? Can I get a real-life example to
> such VM-based containers or what the hell that is, pls?

I don't know of any actual examples.  But, it's trivially easy to create multiple
VMs in a single process, and so proving the negative that no one runs multiple VMs
in a single address space is essentially impossible.

The container thing is just one scenario I can think of where userspace might
actually benefit from sharing an address space, e.g. it would allow backing the
image for large number of VMs with a single set of read-only VMAs.

> > enables cond_ipbp, _and_ sets
> >   : TIF_SPEC_IB, then it's being stupid and isn't getting full protection in any case,
> >   : e.g. if userspace is handling an exit-to-userspace condition for two vCPUs from
> >   : different VMs, then the kernel could switch between those two vCPUs' tasks without
> >   : bouncing through KVM and thus without doing KVM's IBPB.
> >   :
> >   : I can kinda see doing this for always_ibpb, e.g. if userspace is unaware of spectre
> >   : and is naively running multiple VMs in the same process.
> 
> So this needs a clearer definition: what protection are we even talking
> about when the address spaces of processes are shared? My naïve
> thinking would be: none. They're sharing address space - branch pred.
> poisoning between the two is the least of their worries.

I truly have no idea, which is part of the reason I brought it up in the first
place.  I'd have happily just whacked KVM's IBPB entirely, but it seemed prudent
to preserve the existing behavior if someone went out of their way to enable
switch_mm_always_ibpb.

> So to cut to the chase: it sounds to me like you don't want to do IBPB
> at all on vCPU switch.

Yes, or do it iff switch_mm_always_ibpb is enabled to maintain "compability".

> And the process switch case is taken care of by switch_mm().

Yep.
Borislav Petkov April 30, 2022, 9:50 a.m. UTC | #11
On Fri, Apr 29, 2022 at 11:23:32PM +0000, Sean Christopherson wrote:
> The host kernel is protected via RETPOLINE and by flushing the RSB immediately
> after VM-Exit.

Ah, right.

> I don't know definitively.  My guess is that IBPB is far too costly to do on every
> exit, and so the onus was put on userspace to recompile with RETPOLINE.  What I
> don't know is why it wasn't implemented as an opt-out feature.

Or, we could add the same logic on the exit path as in cond_mitigation()
and test for LAST_USER_MM_IBPB when the host has selected
switch_mm_cond_ibpb and thus allows for certain guests to be
protected...

Although, that use case sounds kinda meh: AFAIU, the attack vector here
would be, protecting the guest from a malicious kernel. I guess this
might matter for encrypted guests tho.

> I'll write up the bits I have my head wrapped around.

That's nice, thanks!

> I don't know of any actual examples.  But, it's trivially easy to create multiple
> VMs in a single process, and so proving the negative that no one runs multiple VMs
> in a single address space is essentially impossible.
> 
> The container thing is just one scenario I can think of where userspace might
> actually benefit from sharing an address space, e.g. it would allow backing the
> image for large number of VMs with a single set of read-only VMAs.

Why I keep harping about this: so let's say we eventually add something
and then months, years from now we cannot find out anymore why that
thing was added. We will likely remove it or start wasting time figuring
out why that thing was added in the first place.

This very questioning keeps popping up almost on a daily basis during
refactoring so I'd like for us to be better at documenting *why* we're
doing a certain solution or function or whatever.

And this is doubly important when it comes to the hw mitigations because
if you look at the problem space and all the possible ifs and whens and
but(t)s, your head will spin in no time.

So I'm really sceptical when there's not even a single actual use case
to a proposed change.

So Jon said something about oversubscription and a lot of vCPU
switching. That should be there in the docs as the use case and
explaining why dropping IBPB during vCPU switches is redundant.

> I truly have no idea, which is part of the reason I brought it up in the first
> place.  I'd have happily just whacked KVM's IBPB entirely, but it seemed prudent
> to preserve the existing behavior if someone went out of their way to enable
> switch_mm_always_ibpb.

So let me try to understand this use case: you have a guest and a bunch
of vCPUs which belong to it. And that guest gets switched between those
vCPUs and KVM does IBPB flushes between those vCPUs.

So either I'm missing something - which is possible - but if not, that
"protection" doesn't make any sense - it is all within the same guest!
So that existing behavior was silly to begin with so we might just as
well kill it.

> Yes, or do it iff switch_mm_always_ibpb is enabled to maintain "compability".

Yap, and I'm questioning the even smallest shred of reasoning for having
that IBPB flush there *at* *all*.

And here's the thing with documenting all that: we will document and
say, IBPB between vCPU flushes is non-sensical. Then, when someone comes
later and says, "but but, it makes sense because of X" and we hadn't
thought about X at the time, we will change it and document it again and
this way you'll have everything explicit there, how we arrived at the
current situation and be able to always go, "ah, ok, that's why we did
it."

I hope I'm making some sense...
Jon Kohler April 30, 2022, 2:50 p.m. UTC | #12
> On Apr 30, 2022, at 5:50 AM, Borislav Petkov <bp@alien8.de> wrote:
> 
> On Fri, Apr 29, 2022 at 11:23:32PM +0000, Sean Christopherson wrote:
>> The host kernel is protected via RETPOLINE and by flushing the RSB immediately
>> after VM-Exit.
> 
> Ah, right.
> 
>> I don't know definitively.  My guess is that IBPB is far too costly to do on every
>> exit, and so the onus was put on userspace to recompile with RETPOLINE.  What I
>> don't know is why it wasn't implemented as an opt-out feature.
> 
> Or, we could add the same logic on the exit path as in cond_mitigation()
> and test for LAST_USER_MM_IBPB when the host has selected
> switch_mm_cond_ibpb and thus allows for certain guests to be
> protected...

This is roughly what the v1 of this patch did. *if* we keep it here, to fix this bug
we’d have to bring this logic here for sure.

> 
> Although, that use case sounds kinda meh: AFAIU, the attack vector here
> would be, protecting the guest from a malicious kernel. I guess this
> might matter for encrypted guests tho.
> 
>> I'll write up the bits I have my head wrapped around.
> 
> That's nice, thanks!

Agreed, thx Sean, I really appreciate it.

>> I don't know of any actual examples.  But, it's trivially easy to create multiple
>> VMs in a single process, and so proving the negative that no one runs multiple VMs
>> in a single address space is essentially impossible.
>> 
>> The container thing is just one scenario I can think of where userspace might
>> actually benefit from sharing an address space, e.g. it would allow backing the
>> image for large number of VMs with a single set of read-only VMAs.
> 
> Why I keep harping about this: so let's say we eventually add something
> and then months, years from now we cannot find out anymore why that
> thing was added. We will likely remove it or start wasting time figuring
> out why that thing was added in the first place.
> 
> This very questioning keeps popping up almost on a daily basis during
> refactoring so I'd like for us to be better at documenting *why* we're
> doing a certain solution or function or whatever.

This is 100% a fair ask, I appreciate the diligence, as we’ve all been there
on the ‘other side’ of changes to complex areas and spend hours digging on
git history, LKML threads, SDM/APM, and other sources trying to derive
why the heck something is the way it is. I’m 100% game to make sure this
is a good change, so truly thank you for helping hold a high quality bar.

> And this is doubly important when it comes to the hw mitigations because
> if you look at the problem space and all the possible ifs and whens and
> but(t)s, your head will spin in no time.
> 
> So I'm really sceptical when there's not even a single actual use case
> to a proposed change.
> 
> So Jon said something about oversubscription and a lot of vCPU
> switching. That should be there in the docs as the use case and
> explaining why dropping IBPB during vCPU switches is redundant.
> 
>> I truly have no idea, which is part of the reason I brought it up in the first
>> place.  I'd have happily just whacked KVM's IBPB entirely, but it seemed prudent
>> to preserve the existing behavior if someone went out of their way to enable
>> switch_mm_always_ibpb.
> 
> So let me try to understand this use case: you have a guest and a bunch
> of vCPUs which belong to it. And that guest gets switched between those
> vCPUs and KVM does IBPB flushes between those vCPUs.
> 
> So either I'm missing something - which is possible - but if not, that
> "protection" doesn't make any sense - it is all within the same guest!
> So that existing behavior was silly to begin with so we might just as
> well kill it.

Close, its not 1 guest with a bunch of vCPU, its a bunch of guests with
a small amount of vCPUs, thats the small nuance here, which is one of 
the reasons why this was hard to see from the beginning. 

AFAIK, the KVM IBPB is avoided when switching in between vCPUs
belonging to the same vmcs/vmcb (i.e. the same guest), e.g. you could 
have one VM highly oversubscribed to the host and you wouldn’t see
either the KVM IBPB or the switch_mm IBPB. All good. 

Reference vmx_vcpu_load_vmcs() and svm_vcpu_load() and the 
conditionals prior to the barrier.

However, the pain ramps up when you have a bunch of separate guests,
especially with a small amount of vCPUs per guest, so the switching is more
likely to be in between completely separate guests. Think small servers or
virtual desktops. Thats what the scalability test I described in my previous note
does, which effectively gets us to the point where each and every load is to
a different guest, so we’re firing KVM IBPB each time.

But even then, we’re in agreement that its silly because the switch_mm
takes care of it.

> 
>> Yes, or do it iff switch_mm_always_ibpb is enabled to maintain "compability".
> 
> Yap, and I'm questioning the even smallest shred of reasoning for having
> that IBPB flush there *at* *all*.
> 
> And here's the thing with documenting all that: we will document and
> say, IBPB between vCPU flushes is non-sensical. Then, when someone comes
> later and says, "but but, it makes sense because of X" and we hadn't
> thought about X at the time, we will change it and document it again and
> this way you'll have everything explicit there, how we arrived at the
> current situation and be able to always go, "ah, ok, that's why we did
> it."
> 
> I hope I'm making some sense...

Makes sense to me. Let’s wait for Sean’s writeup to clarify and keep
drilling down on this. This “but but” was exactly why we wanted to leave at
least a “shred” behind :) but agreed, we need to double down on documentation

> -- 
> Regards/Gruss,
>    Boris.
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__people.kernel.org_tglx_notes-2Dabout-2Dnetiquette&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=gz30B4PoWGK0UCpgwq_jMKL5LHzM6w-430LAsSEcVYgm5iUvCuCKcbv8amDHDAUu&s=MOviOxGMpK3YkgYmDtVKwgJQ7RcSFTsEAMWUD8W-SoI&e=
Borislav Petkov April 30, 2022, 4:08 p.m. UTC | #13
On Sat, Apr 30, 2022 at 02:50:35PM +0000, Jon Kohler wrote:
> This is 100% a fair ask, I appreciate the diligence, as we’ve all been there
> on the ‘other side’ of changes to complex areas and spend hours digging on
> git history, LKML threads, SDM/APM, and other sources trying to derive
> why the heck something is the way it is.

Yap, that's basically proving my point and why I want stuff to be
properly documented so that the question "why was it done this way" can
always be answered satisfactorily.

> AFAIK, the KVM IBPB is avoided when switching in between vCPUs
> belonging to the same vmcs/vmcb (i.e. the same guest), e.g. you could 
> have one VM highly oversubscribed to the host and you wouldn’t see
> either the KVM IBPB or the switch_mm IBPB. All good. 
> 
> Reference vmx_vcpu_load_vmcs() and svm_vcpu_load() and the 
> conditionals prior to the barrier.

So this is where something's still missing.

> However, the pain ramps up when you have a bunch of separate guests,
> especially with a small amount of vCPUs per guest, so the switching is more
> likely to be in between completely separate guests.

If the guests are completely separate, then it should fall into the
switch_mm() case.

Unless it has something to do with, as I looked at the SVM side of
things, the VMCBs:

	if (sd->current_vmcb != svm->vmcb) {

So it is not only different guests but also within the same guest and
when the VMCB of the vCPU is not the current one.

But then if VMCB of the vCPU is not the current, per-CPU VMCB, then that
CPU ran another guest so in order for that other guest to attack the
current guest, then its branch pred should be flushed.

But I'm likely missing a virt aspect here so I'd let Sean explain what
the rules are...

Thx.
Jon Kohler May 6, 2022, 3:42 p.m. UTC | #14
> On Apr 30, 2022, at 12:08 PM, Borislav Petkov <bp@alien8.de> wrote:
> 
> On Sat, Apr 30, 2022 at 02:50:35PM +0000, Jon Kohler wrote:
>> This is 100% a fair ask, I appreciate the diligence, as we’ve all been there
>> on the ‘other side’ of changes to complex areas and spend hours digging on
>> git history, LKML threads, SDM/APM, and other sources trying to derive
>> why the heck something is the way it is.
> 
> Yap, that's basically proving my point and why I want stuff to be
> properly documented so that the question "why was it done this way" can
> always be answered satisfactorily.
> 
>> AFAIK, the KVM IBPB is avoided when switching in between vCPUs
>> belonging to the same vmcs/vmcb (i.e. the same guest), e.g. you could 
>> have one VM highly oversubscribed to the host and you wouldn’t see
>> either the KVM IBPB or the switch_mm IBPB. All good. 
>> 
>> Reference vmx_vcpu_load_vmcs() and svm_vcpu_load() and the 
>> conditionals prior to the barrier.
> 
> So this is where something's still missing.
> 
>> However, the pain ramps up when you have a bunch of separate guests,
>> especially with a small amount of vCPUs per guest, so the switching is more
>> likely to be in between completely separate guests.
> 
> If the guests are completely separate, then it should fall into the
> switch_mm() case.
> 
> Unless it has something to do with, as I looked at the SVM side of
> things, the VMCBs:
> 
> 	if (sd->current_vmcb != svm->vmcb) {
> 
> So it is not only different guests but also within the same guest and
> when the VMCB of the vCPU is not the current one.
> 
> But then if VMCB of the vCPU is not the current, per-CPU VMCB, then that
> CPU ran another guest so in order for that other guest to attack the
> current guest, then its branch pred should be flushed.
> 
> But I'm likely missing a virt aspect here so I'd let Sean explain what
> the rules are...

Hey Sean - touching back on this one to see if were able to get some
thoughts together on this one?

Thanks - Jon

> 
> Thx.
> 
> -- 
> Regards/Gruss,
>    Boris.
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__people.kernel.org_tglx_notes-2Dabout-2Dnetiquette&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=g8L6xyV5FDaMT1tJZ0GSAFqRAfzycwb-KqVGLeF9tuj2dl8To57JSPptw9_QKhn2&s=DU4mnTrLXbmOItsB0lTJNN4bgP2YC1n1Y2WeysN8PhM&e=
Sean Christopherson May 10, 2022, 2:22 p.m. UTC | #15
On Sat, Apr 30, 2022, Jon Kohler wrote:
> 
> > On Apr 30, 2022, at 5:50 AM, Borislav Petkov <bp@alien8.de> wrote:
> > So let me try to understand this use case: you have a guest and a bunch
> > of vCPUs which belong to it. And that guest gets switched between those
> > vCPUs and KVM does IBPB flushes between those vCPUs.
> > 
> > So either I'm missing something - which is possible - but if not, that
> > "protection" doesn't make any sense - it is all within the same guest!
> > So that existing behavior was silly to begin with so we might just as
> > well kill it.
> 
> Close, its not 1 guest with a bunch of vCPU, its a bunch of guests with
> a small amount of vCPUs, thats the small nuance here, which is one of 
> the reasons why this was hard to see from the beginning. 
> 
> AFAIK, the KVM IBPB is avoided when switching in between vCPUs
> belonging to the same vmcs/vmcb (i.e. the same guest), e.g. you could 
> have one VM highly oversubscribed to the host and you wouldn’t see
> either the KVM IBPB or the switch_mm IBPB. All good. 

No, KVM does not avoid IBPB when switching between vCPUs in a single VM.  Every
vCPU has a separate VMCS/VMCB, and so the scenario described above where a single
VM has a bunch of vCPUs running on a limited set of logical CPUs will emit IBPB
on every single switch.
Sean Christopherson May 10, 2022, 2:44 p.m. UTC | #16
On Sat, Apr 30, 2022, Borislav Petkov wrote:
> On Sat, Apr 30, 2022 at 02:50:35PM +0000, Jon Kohler wrote:
> > This is 100% a fair ask, I appreciate the diligence, as we’ve all been there
> > on the ‘other side’ of changes to complex areas and spend hours digging on
> > git history, LKML threads, SDM/APM, and other sources trying to derive
> > why the heck something is the way it is.
> 
> Yap, that's basically proving my point and why I want stuff to be
> properly documented so that the question "why was it done this way" can
> always be answered satisfactorily.
> 
> > AFAIK, the KVM IBPB is avoided when switching in between vCPUs
> > belonging to the same vmcs/vmcb (i.e. the same guest), e.g. you could 
> > have one VM highly oversubscribed to the host and you wouldn’t see
> > either the KVM IBPB or the switch_mm IBPB. All good. 
> > 
> > Reference vmx_vcpu_load_vmcs() and svm_vcpu_load() and the 
> > conditionals prior to the barrier.
> 
> So this is where something's still missing.
> 
> > However, the pain ramps up when you have a bunch of separate guests,
> > especially with a small amount of vCPUs per guest, so the switching is more
> > likely to be in between completely separate guests.
> 
> If the guests are completely separate, then it should fall into the
> switch_mm() case.
> 
> Unless it has something to do with, as I looked at the SVM side of
> things, the VMCBs:
> 
> 	if (sd->current_vmcb != svm->vmcb) {
> 
> So it is not only different guests but also within the same guest and
> when the VMCB of the vCPU is not the current one.

Yep.

> But then if VMCB of the vCPU is not the current, per-CPU VMCB, then that
> CPU ran another guest so in order for that other guest to attack the
> current guest, then its branch pred should be flushed.

That CPU ran a different _vCPU_, whether or not it ran a different guest, i.e. a
different security domain, is unknown.

> But I'm likely missing a virt aspect here so I'd let Sean explain what
> the rules are...

I don't think you're missing anything.  I think the original 15d45071523d ("KVM/x86:
Add IBPB support") was simply wrong.

As I see it:

  1. If the vCPUs belong to the same VM, they are in the same security domain and
     do not need an IPBP.

  2. If the vCPUs belong to different VMs, and each VM is in its own mm_struct,
     defer to switch_mm_irqs_off() to handle IBPB as an mm switch is guaranteed to
     occur prior to loading a vCPU belonging to a different VMs.
 
  3. If the vCPUs belong to different VMs, but multiple VMs share an mm_struct,
     then the security benefits of an IBPB when switching vCPUs are dubious, at best.

If we only consider #1 and #2, then KVM doesn't need an IBPB, period.

#3 is the only one that's a grey area.  I have no objection to omitting IBPB entirely
even in that case, because none of us can identify any tangible value in doing so.
Jon Kohler May 10, 2022, 2:49 p.m. UTC | #17
> On May 10, 2022, at 10:22 AM, Sean Christopherson <seanjc@google.com> wrote:
> 
> On Sat, Apr 30, 2022, Jon Kohler wrote:
>> 
>>> On Apr 30, 2022, at 5:50 AM, Borislav Petkov <bp@alien8.de> wrote:
>>> So let me try to understand this use case: you have a guest and a bunch
>>> of vCPUs which belong to it. And that guest gets switched between those
>>> vCPUs and KVM does IBPB flushes between those vCPUs.
>>> 
>>> So either I'm missing something - which is possible - but if not, that
>>> "protection" doesn't make any sense - it is all within the same guest!
>>> So that existing behavior was silly to begin with so we might just as
>>> well kill it.
>> 
>> Close, its not 1 guest with a bunch of vCPU, its a bunch of guests with
>> a small amount of vCPUs, thats the small nuance here, which is one of 
>> the reasons why this was hard to see from the beginning. 
>> 
>> AFAIK, the KVM IBPB is avoided when switching in between vCPUs
>> belonging to the same vmcs/vmcb (i.e. the same guest), e.g. you could 
>> have one VM highly oversubscribed to the host and you wouldn’t see
>> either the KVM IBPB or the switch_mm IBPB. All good. 
> 
> No, KVM does not avoid IBPB when switching between vCPUs in a single VM.  Every
> vCPU has a separate VMCS/VMCB, and so the scenario described above where a single
> VM has a bunch of vCPUs running on a limited set of logical CPUs will emit IBPB
> on every single switch.

Ah! Right, ok thanks for helping me get my wires uncrossed there, I was getting
confused from the nested optimization made on
5c911beff KVM: nVMX: Skip IBPB when switching between vmcs01 and vmcs02

So the only time we’d *not* issue IBPB is if the current per-vcpu vmcs/vmcb is
still loaded in the non-nested case, or between guests in the nested case.

Walking through my thoughts again here with this fresh in my mind:

In that example, say guest A has vCPU0 and vCPU1 and has to switch
in between the two on the same pCPU, it isn’t doing a switch_mm()
because its the same mm_struct; however, I’d wager to say that if you
had an attacker on the guest VM, executing an attack on vCPU0 with
the intent of attacking vCPU1 (which is up to run next), you have far
bigger problems, as that would imply the guest is completely
compromised, so why would they even waste time on a complex
prediction attack when they have that level of system access in
the first place?

Going back to the original commit documentation that Boris called out, 
that specifically says:

   * Mitigate guests from being attacked by other guests.
     - This is addressed by issing IBPB when we do a guest switch.

Since you need to go through switch_mm() to change mm_struct from
Guest A to Guest B, it makes no sense to issue the barrier in KVM,
as the kernel is already giving that “for free” (from KVM’s perspective) as
the guest-to-guest transition is already covered by cond_mitigation().

That would apply equally for switches within both nested and non-nested
cases, because switch_mm needs to be called between guests.
Jon Kohler May 10, 2022, 3:03 p.m. UTC | #18
> On May 10, 2022, at 10:44 AM, Sean Christopherson <seanjc@google.com> wrote:
> 
> On Sat, Apr 30, 2022, Borislav Petkov wrote:
>> On Sat, Apr 30, 2022 at 02:50:35PM +0000, Jon Kohler wrote:
>>> This is 100% a fair ask, I appreciate the diligence, as we’ve all been there
>>> on the ‘other side’ of changes to complex areas and spend hours digging on
>>> git history, LKML threads, SDM/APM, and other sources trying to derive
>>> why the heck something is the way it is.
>> 
>> Yap, that's basically proving my point and why I want stuff to be
>> properly documented so that the question "why was it done this way" can
>> always be answered satisfactorily.
>> 
>>> AFAIK, the KVM IBPB is avoided when switching in between vCPUs
>>> belonging to the same vmcs/vmcb (i.e. the same guest), e.g. you could 
>>> have one VM highly oversubscribed to the host and you wouldn’t see
>>> either the KVM IBPB or the switch_mm IBPB. All good. 
>>> 
>>> Reference vmx_vcpu_load_vmcs() and svm_vcpu_load() and the 
>>> conditionals prior to the barrier.
>> 
>> So this is where something's still missing.
>> 
>>> However, the pain ramps up when you have a bunch of separate guests,
>>> especially with a small amount of vCPUs per guest, so the switching is more
>>> likely to be in between completely separate guests.
>> 
>> If the guests are completely separate, then it should fall into the
>> switch_mm() case.
>> 
>> Unless it has something to do with, as I looked at the SVM side of
>> things, the VMCBs:
>> 
>> 	if (sd->current_vmcb != svm->vmcb) {
>> 
>> So it is not only different guests but also within the same guest and
>> when the VMCB of the vCPU is not the current one.
> 
> Yep.
> 
>> But then if VMCB of the vCPU is not the current, per-CPU VMCB, then that
>> CPU ran another guest so in order for that other guest to attack the
>> current guest, then its branch pred should be flushed.
> 
> That CPU ran a different _vCPU_, whether or not it ran a different guest, i.e. a
> different security domain, is unknown.
> 
>> But I'm likely missing a virt aspect here so I'd let Sean explain what
>> the rules are...
> 
> I don't think you're missing anything.  I think the original 15d45071523d ("KVM/x86:
> Add IBPB support") was simply wrong.
> 
> As I see it:
> 
>  1. If the vCPUs belong to the same VM, they are in the same security domain and
>     do not need an IPBP.
> 
>  2. If the vCPUs belong to different VMs, and each VM is in its own mm_struct,
>     defer to switch_mm_irqs_off() to handle IBPB as an mm switch is guaranteed to
>     occur prior to loading a vCPU belonging to a different VMs.
> 
>  3. If the vCPUs belong to different VMs, but multiple VMs share an mm_struct,
>     then the security benefits of an IBPB when switching vCPUs are dubious, at best.
> 
> If we only consider #1 and #2, then KVM doesn't need an IBPB, period.
> 
> #3 is the only one that's a grey area.  I have no objection to omitting IBPB entirely
> even in that case, because none of us can identify any tangible value in doing so.

Thanks, Sean. Our messages crossed in flight, I sent a reply to your earlier message
just a bit ago. This is super helpful to frame this up.

What would you think framing the patch like this:

    x86/speculation, KVM: remove IBPB on vCPU load

    Remove IBPB that is done on KVM vCPU load, as the guest-to-guest
    attack surface is already covered by switch_mm_irqs_off() ->
    cond_mitigation().

    The original 15d45071523d ("KVM/x86: Add IBPB support") was simply wrong in
    its guest-to-guest design intention. There are three scenarios at play
    here:

    1. If the vCPUs belong to the same VM, they are in the same security 
    domain and do not need an IPBP.
    2. If the vCPUs belong to different VMs, and each VM is in its own mm_struct,
    switch_mm_irqs_off() will handle IBPB as an mm switch is guaranteed to
    occur prior to loading a vCPU belonging to a different VMs.
    3. If the vCPUs belong to different VMs, but multiple VMs share an mm_struct,
    then the security benefits of an IBPB when switching vCPUs are dubious, 
    at best.

    Issuing IBPB from KVM vCPU load would only cover #3, but there are no
    real world tangible use cases for such a configuration. If multiple VMs
    are sharing an mm_structs, prediction attacks are the least of their
    security worries.

    Fixes: 15d45071523d ("KVM/x86: Add IBPB support")
    (Reviewedby/signed of by people here)
    (Code change simply whacks IBPB in KVM vmx/svm and thats it)
Sean Christopherson May 10, 2022, 3:50 p.m. UTC | #19
On Tue, May 10, 2022, Jon Kohler wrote:
> 
> > On May 10, 2022, at 10:44 AM, Sean Christopherson <seanjc@google.com> wrote:
> > 
> > On Sat, Apr 30, 2022, Borislav Petkov wrote:
> >> But I'm likely missing a virt aspect here so I'd let Sean explain what
> >> the rules are...
> > 
> > I don't think you're missing anything.  I think the original 15d45071523d ("KVM/x86:
> > Add IBPB support") was simply wrong.
> > 
> > As I see it:
> > 
> >  1. If the vCPUs belong to the same VM, they are in the same security domain and
> >     do not need an IPBP.
> > 
> >  2. If the vCPUs belong to different VMs, and each VM is in its own mm_struct,
> >     defer to switch_mm_irqs_off() to handle IBPB as an mm switch is guaranteed to
> >     occur prior to loading a vCPU belonging to a different VMs.
> > 
> >  3. If the vCPUs belong to different VMs, but multiple VMs share an mm_struct,
> >     then the security benefits of an IBPB when switching vCPUs are dubious, at best.
> > 
> > If we only consider #1 and #2, then KVM doesn't need an IBPB, period.
> > 
> > #3 is the only one that's a grey area.  I have no objection to omitting IBPB entirely
> > even in that case, because none of us can identify any tangible value in doing so.
> 
> Thanks, Sean. Our messages crossed in flight, I sent a reply to your earlier message
> just a bit ago. This is super helpful to frame this up.
> 
> What would you think framing the patch like this:
> 
>     x86/speculation, KVM: remove IBPB on vCPU load
> 
>     Remove IBPB that is done on KVM vCPU load, as the guest-to-guest
>     attack surface is already covered by switch_mm_irqs_off() ->
>     cond_mitigation().
> 
>     The original 15d45071523d ("KVM/x86: Add IBPB support") was simply wrong in
>     its guest-to-guest design intention. There are three scenarios at play
>     here:
> 
>     1. If the vCPUs belong to the same VM, they are in the same security 
>     domain and do not need an IPBP.
>     2. If the vCPUs belong to different VMs, and each VM is in its own mm_struct,
>     switch_mm_irqs_off() will handle IBPB as an mm switch is guaranteed to
>     occur prior to loading a vCPU belonging to a different VMs.
>     3. If the vCPUs belong to different VMs, but multiple VMs share an mm_struct,
>     then the security benefits of an IBPB when switching vCPUs are dubious, 
>     at best.
> 
>     Issuing IBPB from KVM vCPU load would only cover #3, but there are no

Just to hedge, there are no _known_ use cases.

>     real world tangible use cases for such a configuration.

and I would further qualify this with:

      but there are no known real world, tangible use cases for running multiple
      VMs belonging to different security domains in a shared address space.

Running multiple VMs in a single address space is plausible and sane, _if_ they
are all in the same security domain or security is not a concern.  That way the
statement isn't invalidated if someone pops up with a use case for running multiple
VMs but has no security story.

Other than that, LGTM.

>     If multiple VMs
>     are sharing an mm_structs, prediction attacks are the least of their
>     security worries.
> 
>     Fixes: 15d45071523d ("KVM/x86: Add IBPB support")
>     (Reviewedby/signed of by people here)
>     (Code change simply whacks IBPB in KVM vmx/svm and thats it)
> 
> 
>
Borislav Petkov May 12, 2022, 1:44 p.m. UTC | #20
On Tue, May 10, 2022 at 03:50:31PM +0000, Sean Christopherson wrote:
> >     x86/speculation, KVM: remove IBPB on vCPU load
> > 
> >     Remove IBPB that is done on KVM vCPU load, as the guest-to-guest
> >     attack surface is already covered by switch_mm_irqs_off() ->
> >     cond_mitigation().
> > 
> >     The original 15d45071523d ("KVM/x86: Add IBPB support") was simply wrong in
> >     its guest-to-guest design intention. There are three scenarios at play
> >     here:
> > 
> >     1. If the vCPUs belong to the same VM, they are in the same security 
> >     domain and do not need an IPBP.
> >     2. If the vCPUs belong to different VMs, and each VM is in its own mm_struct,
> >     switch_mm_irqs_off() will handle IBPB as an mm switch is guaranteed to
> >     occur prior to loading a vCPU belonging to a different VMs.
> >     3. If the vCPUs belong to different VMs, but multiple VMs share an mm_struct,
> >     then the security benefits of an IBPB when switching vCPUs are dubious, 
> >     at best.
> > 
> >     Issuing IBPB from KVM vCPU load would only cover #3, but there are no
> 
> Just to hedge, there are no _known_ use cases.
> 
> >     real world tangible use cases for such a configuration.
> 
> and I would further qualify this with:
> 
>       but there are no known real world, tangible use cases for running multiple
>       VMs belonging to different security domains in a shared address space.
> 
> Running multiple VMs in a single address space is plausible and sane, _if_ they
> are all in the same security domain or security is not a concern.  That way the
> statement isn't invalidated if someone pops up with a use case for running multiple
> VMs but has no security story.
> 
> Other than that, LGTM.
> 
> >     If multiple VMs
> >     are sharing an mm_structs, prediction attacks are the least of their
> >     security worries.
> > 
> >     Fixes: 15d45071523d ("KVM/x86: Add IBPB support")
> >     (Reviewedby/signed of by people here)
> >     (Code change simply whacks IBPB in KVM vmx/svm and thats it)

I agree with all that I've read so far - the only thing that's missing is:

	(Documentation in Documentation/admin-guide/hw-vuln/spectre.rst about what the use
	 cases are and what we're protecting against and what we're *not* protecting
	 against because <raisins>).

Thx.
Jon Kohler May 12, 2022, 5:56 p.m. UTC | #21
> On May 12, 2022, at 9:44 AM, Borislav Petkov <bp@alien8.de> wrote:
> 
> On Tue, May 10, 2022 at 03:50:31PM +0000, Sean Christopherson wrote:
>>>    x86/speculation, KVM: remove IBPB on vCPU load
>>> 
>>>    Remove IBPB that is done on KVM vCPU load, as the guest-to-guest
>>>    attack surface is already covered by switch_mm_irqs_off() ->
>>>    cond_mitigation().
>>> 
>>>    The original 15d45071523d ("KVM/x86: Add IBPB support") was simply wrong in
>>>    its guest-to-guest design intention. There are three scenarios at play
>>>    here:
>>> 
>>>    1. If the vCPUs belong to the same VM, they are in the same security 
>>>    domain and do not need an IPBP.
>>>    2. If the vCPUs belong to different VMs, and each VM is in its own mm_struct,
>>>    switch_mm_irqs_off() will handle IBPB as an mm switch is guaranteed to
>>>    occur prior to loading a vCPU belonging to a different VMs.
>>>    3. If the vCPUs belong to different VMs, but multiple VMs share an mm_struct,
>>>    then the security benefits of an IBPB when switching vCPUs are dubious, 
>>>    at best.
>>> 
>>>    Issuing IBPB from KVM vCPU load would only cover #3, but there are no
>> 
>> Just to hedge, there are no _known_ use cases.
>> 
>>>    real world tangible use cases for such a configuration.
>> 
>> and I would further qualify this with:
>> 
>>      but there are no known real world, tangible use cases for running multiple
>>      VMs belonging to different security domains in a shared address space.
>> 
>> Running multiple VMs in a single address space is plausible and sane, _if_ they
>> are all in the same security domain or security is not a concern.  That way the
>> statement isn't invalidated if someone pops up with a use case for running multiple
>> VMs but has no security story.
>> 
>> Other than that, LGTM.
>> 
>>>    If multiple VMs
>>>    are sharing an mm_structs, prediction attacks are the least of their
>>>    security worries.
>>> 
>>>    Fixes: 15d45071523d ("KVM/x86: Add IBPB support")
>>>    (Reviewedby/signed of by people here)
>>>    (Code change simply whacks IBPB in KVM vmx/svm and thats it)
> 
> I agree with all that I've read so far - the only thing that's missing is:
> 
> 	(Documentation in Documentation/admin-guide/hw-vuln/spectre.rst about what the use
> 	 cases are and what we're protecting against and what we're *not* protecting
> 	 against because <raisins>).
> 
> Thx.

Ok Thanks, Boris. I’ll review that doc and make modifications on v4, and make sure
that you are cc’d.

Thanks again,
Jon

> 
> -- 
> Regards/Gruss,
>    Boris.
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__people.kernel.org_tglx_notes-2Dabout-2Dnetiquette&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=55IDSpFE7N1d0eOYIL-UhgxoFg5JT7HFCEx17rNfo8XDAoJgj4xHjTzvqKec6Zi6&s=4ijrpeiLfGJRiyOpYY0Pn-BxvGEqvO2T7xaNyC0LmMk&e=
diff mbox series

Patch

diff --git a/arch/x86/include/asm/spec-ctrl.h b/arch/x86/include/asm/spec-ctrl.h
index 5393babc0598..99d3341d2e21 100644
--- a/arch/x86/include/asm/spec-ctrl.h
+++ b/arch/x86/include/asm/spec-ctrl.h
@@ -85,4 +85,18 @@  static inline void speculative_store_bypass_ht_init(void) { }
 extern void speculation_ctrl_update(unsigned long tif);
 extern void speculation_ctrl_update_current(void);

+/*
+ * Issue IBPB when switching guest vCPUs IFF switch_mm_always_ibpb.
+ * For the more common case of running VMs in their own dedicated process,
+ * switching vCPUs that belong to different VMs, i.e. switching tasks,
+ * will also switch mm_structs and thus do IPBP via cond_mitigation();
+ * however, in the always_ibpb case, take a paranoid approach and issue
+ * IBPB on both switch_mm() and vCPU switch.
+ */
+static inline void x86_virt_guest_switch_ibpb(void)
+{
+	if (static_branch_unlikely(&switch_mm_always_ibpb))
+		indirect_branch_prediction_barrier();
+}
+
 #endif
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 6296e1ebed1d..6aafb0279cbc 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -68,8 +68,12 @@  u64 __ro_after_init x86_amd_ls_cfg_ssbd_mask;
 DEFINE_STATIC_KEY_FALSE(switch_to_cond_stibp);
 /* Control conditional IBPB in switch_mm() */
 DEFINE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
-/* Control unconditional IBPB in switch_mm() */
+/* Control unconditional IBPB in both switch_mm() and
+ * x86_virt_guest_switch_ibpb().
+ * See notes on x86_virt_guest_switch_ibpb() for KVM use case details.
+ */
 DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
+EXPORT_SYMBOL_GPL(switch_mm_always_ibpb);

 /* Control MDS CPU buffer clear before returning to user space */
 DEFINE_STATIC_KEY_FALSE(mds_user_clear);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index bd4c64b362d2..fc08c94df888 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1302,7 +1302,7 @@  static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)

 	if (sd->current_vmcb != svm->vmcb) {
 		sd->current_vmcb = svm->vmcb;
-		indirect_branch_prediction_barrier();
+		x86_virt_guest_switch_ibpb();
 	}
 	if (kvm_vcpu_apicv_active(vcpu))
 		__avic_vcpu_load(vcpu, cpu);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 04d170c4b61e..a8eed9b8221b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1270,7 +1270,7 @@  void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
 		 * The L1 VMM can protect itself with retpolines, IBPB or IBRS.
 		 */
 		if (!buddy || WARN_ON_ONCE(buddy->vmcs != prev))
-			indirect_branch_prediction_barrier();
+			x86_virt_guest_switch_ibpb();
 	}

 	if (!already_loaded) {