diff mbox series

x86/bhi: avoid hardware mitigation for 'spectre_bhi=vmexit'

Message ID 20240912141156.231429-1-jon@nutanix.com (mailing list archive)
State New, archived
Headers show
Series x86/bhi: avoid hardware mitigation for 'spectre_bhi=vmexit' | expand

Commit Message

Jon Kohler Sept. 12, 2024, 2:11 p.m. UTC
On hardware that supports BHI_DIS_S/X86_FEATURE_BHI_CTRL, do not use
hardware mitigation when using BHI_MITIGATION_VMEXIT_ONLY, as this
causes the value of MSR_IA32_SPEC_CTRL to change, which inflicts
additional KVM overhead.

Example: In a typical eIBRS enabled system, such as Intel SPR, the
SPEC_CTRL may be commonly set to val == 1 to reflect eIBRS enablement;
however, SPEC_CTRL_BHI_DIS_S causes val == 1025. If the guests that
KVM is virtualizing do not also set the guest side value == 1025,
KVM will constantly have to wrmsr toggle the guest vs host value on
both entry and exit, delaying both.

Signed-off-by: Jon Kohler <jon@nutanix.com>
---
 arch/x86/kernel/cpu/bugs.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Comments

Pawan Gupta Sept. 12, 2024, 3:14 p.m. UTC | #1
On Thu, Sep 12, 2024 at 07:11:56AM -0700, Jon Kohler wrote:
> On hardware that supports BHI_DIS_S/X86_FEATURE_BHI_CTRL, do not use
> hardware mitigation when using BHI_MITIGATION_VMEXIT_ONLY, as this
> causes the value of MSR_IA32_SPEC_CTRL to change, which inflicts
> additional KVM overhead.
> 
> Example: In a typical eIBRS enabled system, such as Intel SPR, the
> SPEC_CTRL may be commonly set to val == 1 to reflect eIBRS enablement;
> however, SPEC_CTRL_BHI_DIS_S causes val == 1025. If the guests that
> KVM is virtualizing do not also set the guest side value == 1025,
> KVM will constantly have to wrmsr toggle the guest vs host value on
> both entry and exit, delaying both.
> 
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> ---
>  arch/x86/kernel/cpu/bugs.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index 45675da354f3..df7535f5e882 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -1662,8 +1662,16 @@ static void __init bhi_select_mitigation(void)
>  			return;
>  	}
>  
> -	/* Mitigate in hardware if supported */
> -	if (spec_ctrl_bhi_dis())
> +	/*
> +	 * Mitigate in hardware if appropriate.
> +	 * Note: for vmexit only, do not mitigate in hardware to avoid changing
> +	 * the value of MSR_IA32_SPEC_CTRL to include SPEC_CTRL_BHI_DIS_S. If a
> +	 * guest does not also set their own SPEC_CTRL to include this, KVM has
> +	 * to toggle on every vmexit and vmentry if the host value does not
> +	 * match the guest value. Instead, depend on software loop mitigation
> +	 * only.
> +	 */
> +	if (bhi_mitigation != BHI_MITIGATION_VMEXIT_ONLY && spec_ctrl_bhi_dis())
>  		return;

This makes the system vulnerable. The current software loop is not
effective on parts that support BHI_DIS_S. There is a separate loop for
SPR, see Listing 2(long sequence) in Software BHB-clearing sequence
section here:

  https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html

It is only worth implementing the long sequence in VMEXIT_ONLY mode if it is
significantly better than toggling the MSR.
Jon Kohler Sept. 12, 2024, 3:44 p.m. UTC | #2
> On Sep 12, 2024, at 11:14 AM, Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Thu, Sep 12, 2024 at 07:11:56AM -0700, Jon Kohler wrote:
>> On hardware that supports BHI_DIS_S/X86_FEATURE_BHI_CTRL, do not use
>> hardware mitigation when using BHI_MITIGATION_VMEXIT_ONLY, as this
>> causes the value of MSR_IA32_SPEC_CTRL to change, which inflicts
>> additional KVM overhead.
>> 
>> Example: In a typical eIBRS enabled system, such as Intel SPR, the
>> SPEC_CTRL may be commonly set to val == 1 to reflect eIBRS enablement;
>> however, SPEC_CTRL_BHI_DIS_S causes val == 1025. If the guests that
>> KVM is virtualizing do not also set the guest side value == 1025,
>> KVM will constantly have to wrmsr toggle the guest vs host value on
>> both entry and exit, delaying both.
>> 
>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>> ---
>> arch/x86/kernel/cpu/bugs.c | 12 ++++++++++--
>> 1 file changed, 10 insertions(+), 2 deletions(-)
>> 
>> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
>> index 45675da354f3..df7535f5e882 100644
>> --- a/arch/x86/kernel/cpu/bugs.c
>> +++ b/arch/x86/kernel/cpu/bugs.c
>> @@ -1662,8 +1662,16 @@ static void __init bhi_select_mitigation(void)
>> return;
>> }
>> 
>> - /* Mitigate in hardware if supported */
>> - if (spec_ctrl_bhi_dis())
>> + /*
>> +  * Mitigate in hardware if appropriate.
>> +  * Note: for vmexit only, do not mitigate in hardware to avoid changing
>> +  * the value of MSR_IA32_SPEC_CTRL to include SPEC_CTRL_BHI_DIS_S. If a
>> +  * guest does not also set their own SPEC_CTRL to include this, KVM has
>> +  * to toggle on every vmexit and vmentry if the host value does not
>> +  * match the guest value. Instead, depend on software loop mitigation
>> +  * only.
>> +  */
>> + if (bhi_mitigation != BHI_MITIGATION_VMEXIT_ONLY && spec_ctrl_bhi_dis())
>> return;
> 
> This makes the system vulnerable. The current software loop is not
> effective on parts that support BHI_DIS_S. There is a separate loop for
> SPR, see Listing 2(long sequence) in Software BHB-clearing sequence
> section here:
> 
>  https://urldefense.proofpoint.com/v2/url?u=https-3A__www.intel.com_content_www_us_en_developer_articles_technical_software-2Dsecurity-2Dguidance_technical-2Ddocumentation_branch-2Dhistory-2Dinjection.html&d=DwIBAg&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=65GTWhwjPejAvs_QaMrB6pw1KqVI5zZUx6r6hsaSoPHDDulpwh96Q7zedYS6CGoT&s=bQ8X4mjfOPUZgWUx7YQA4rkDADXoOX2-pmn-65xPL9g&e=
> 
> It is only worth implementing the long sequence in VMEXIT_ONLY mode if it is
> significantly better than toggling the MSR.

Thanks for the pointer! I hadn’t seen that second sequence. I’ll do measurements on
three cases and come back with data from an SPR system.
1. as-is (wrmsr on entry and exit)
2. Short sequence (as a baseline)
3. Long sequence
Pawan Gupta Sept. 12, 2024, 4:24 p.m. UTC | #3
On Thu, Sep 12, 2024 at 03:44:38PM +0000, Jon Kohler wrote:
> > It is only worth implementing the long sequence in VMEXIT_ONLY mode if it is
> > significantly better than toggling the MSR.
> 
> Thanks for the pointer! I hadn’t seen that second sequence. I’ll do measurements on
> three cases and come back with data from an SPR system.
> 1. as-is (wrmsr on entry and exit)
> 2. Short sequence (as a baseline)
> 3. Long sequence

I wonder if virtual SPEC_CTRL feature introduced in below series can
provide speedup, as it can replace the MSR toggling with faster VMCS
operations:

  https://lore.kernel.org/kvm/20240410143446.797262-1-chao.gao@intel.com/

Adding Chao for their opinion.
Chao Gao Sept. 13, 2024, 5:28 a.m. UTC | #4
On Thu, Sep 12, 2024 at 09:24:40AM -0700, Pawan Gupta wrote:
>On Thu, Sep 12, 2024 at 03:44:38PM +0000, Jon Kohler wrote:
>> > It is only worth implementing the long sequence in VMEXIT_ONLY mode if it is
>> > significantly better than toggling the MSR.
>> 
>> Thanks for the pointer! I hadn’t seen that second sequence. I’ll do measurements on
>> three cases and come back with data from an SPR system.
>> 1. as-is (wrmsr on entry and exit)
>> 2. Short sequence (as a baseline)
>> 3. Long sequence
>
>I wonder if virtual SPEC_CTRL feature introduced in below series can
>provide speedup, as it can replace the MSR toggling with faster VMCS
>operations:

"virtual SPEC_CTRL" won't provide speedup. the wrmsr on entry/exit is still
need if guest's (effective) value and host's value are different.

"virtual SPEC_CTRL" just prevents guests from toggling some bits. It doesn't
switch the MSR between guest value and host value on entry/exit. so, KVM has
to do the switching with wrmsr/rdmsr instructions. A new feature, "load
IA32_SPEC_CTRL" VMX control (refer to Chapter 15 in ISE spec[*]), can help but
it isn't supported on SPR.

[*]: https://cdrdv2.intel.com/v1/dl/getContent/671368

>
>  https://lore.kernel.org/kvm/20240410143446.797262-1-chao.gao@intel.com/
>
>Adding Chao for their opinion.
Chao Gao Sept. 13, 2024, 5:39 a.m. UTC | #5
On Thu, Sep 12, 2024 at 07:11:56AM -0700, Jon Kohler wrote:
>On hardware that supports BHI_DIS_S/X86_FEATURE_BHI_CTRL, do not use
>hardware mitigation when using BHI_MITIGATION_VMEXIT_ONLY, as this
>causes the value of MSR_IA32_SPEC_CTRL to change, which inflicts
>additional KVM overhead.
>
>Example: In a typical eIBRS enabled system, such as Intel SPR, the
>SPEC_CTRL may be commonly set to val == 1 to reflect eIBRS enablement;
>however, SPEC_CTRL_BHI_DIS_S causes val == 1025. If the guests that
>KVM is virtualizing do not also set the guest side value == 1025,
>KVM will constantly have to wrmsr toggle the guest vs host value on
>both entry and exit, delaying both.

Putting aside the security concern, this patch isn't a net positive
because it causes additional overhead to guests with spec_ctrl = 1025.

>
>Signed-off-by: Jon Kohler <jon@nutanix.com>
>---
> arch/x86/kernel/cpu/bugs.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
>diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
>index 45675da354f3..df7535f5e882 100644
>--- a/arch/x86/kernel/cpu/bugs.c
>+++ b/arch/x86/kernel/cpu/bugs.c
>@@ -1662,8 +1662,16 @@ static void __init bhi_select_mitigation(void)
> 			return;
> 	}
> 
>-	/* Mitigate in hardware if supported */
>-	if (spec_ctrl_bhi_dis())
>+	/*
>+	 * Mitigate in hardware if appropriate.
>+	 * Note: for vmexit only, do not mitigate in hardware to avoid changing
>+	 * the value of MSR_IA32_SPEC_CTRL to include SPEC_CTRL_BHI_DIS_S. If a
>+	 * guest does not also set their own SPEC_CTRL to include this, KVM has
>+	 * to toggle on every vmexit and vmentry if the host value does not
>+	 * match the guest value. Instead, depend on software loop mitigation
>+	 * only.
>+	 */
>+	if (bhi_mitigation != BHI_MITIGATION_VMEXIT_ONLY && spec_ctrl_bhi_dis())
> 		return;
> 
> 	if (!IS_ENABLED(CONFIG_X86_64))
>-- 
>2.43.0
>
>
Jon Kohler Sept. 13, 2024, 3:51 p.m. UTC | #6
> On Sep 13, 2024, at 1:28 AM, Chao Gao <chao.gao@intel.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Thu, Sep 12, 2024 at 09:24:40AM -0700, Pawan Gupta wrote:
>> On Thu, Sep 12, 2024 at 03:44:38PM +0000, Jon Kohler wrote:
>>>> It is only worth implementing the long sequence in VMEXIT_ONLY mode if it is
>>>> significantly better than toggling the MSR.
>>> 
>>> Thanks for the pointer! I hadn’t seen that second sequence. I’ll do measurements on
>>> three cases and come back with data from an SPR system.
>>> 1. as-is (wrmsr on entry and exit)
>>> 2. Short sequence (as a baseline)
>>> 3. Long sequence
>> 

Pawan,

Thanks for the pointer to the long sequence. I've tested it along with 
Listing 3 (TSX Abort sequence) using KUT tscdeadline_immed test. TSX 
abort sequence performs better unless BHI mitigation is off or 
host/guest spec_ctrl values match, avoiding WRMSR toggling. Having the
values match the DIS_S value is easier said than done across a fleet
that is already using eIBRS heavily.

Test System:
- Intel Xeon Gold 6442Y, microcode 0x2b0005c0
- Linux 6.6.34 + patches, qemu 8.2
- KVM Unit Tests @ latest (17f6f2fd) with tscdeadline_immed + edits:
- Toggle spec ctrl before test in main()
- Use cpu type SapphireRapids-v2

Test string:
TESTNAME=vmexit_tscdeadline_immed TIMEOUT=90s MACHINE= ACCEL= taskset -c 26 ./x86/run x86/vmexit.flat \
-smp 1 -cpu SapphireRapids-v2,+x2apic,+tsc-deadline -append tscdeadline_immed |grep tscdeadline

Test Results:
1. spectre_bhi=on, host spec_ctrl=1025, guest spec_ctrl=1: tscdeadline_immed 3878 (WRMSR toggling)
2. spectre_bhi=on, host spec_ctrl=1025, guest spec_ctrl=1025: tscdeadline_immed 3153 (no WRMSR toggling)
3. spectre_bhi=vmexit, BHB long sequence, host/guest spec_ctrl=1: tscdeadline_immed 3629 (still better than test 1, penalty only on exit)
4. spectre_bhi=vmexit, TSX abort sequence, host/guest spec_ctrl=1: tscdeadline_immed 3294 (best general purpose performance)
5. spectre_bhi=vmexit, TSX abort sequence, host spec_ctrl=1, guest spec_ctrl=1025: tscdeadline_immed 4011 (needs optimization)

In short, there is a significant speedup to be had here.

As for test 5, honest that is somewhat invalid because it would be
dependent on the VMM user space showing BHI_CTRL.

QEMU as an example does not do that, so even with latest qemu and latest
kernel, guests will still use BHB loop even on SPR++ today, and they
could use the TSX loop with this proposed change if the VMM exposes RTM
feature.

I'm happy to post a V2 patch with my TSX changes, or take any other
suggestions here. 

Thanks all,
Jon

>> I wonder if virtual SPEC_CTRL feature introduced in below series can
>> provide speedup, as it can replace the MSR toggling with faster VMCS
>> operations:
> 
> "virtual SPEC_CTRL" won't provide speedup. the wrmsr on entry/exit is still
> need if guest's (effective) value and host's value are different.
> 
> "virtual SPEC_CTRL" just prevents guests from toggling some bits. It doesn't
> switch the MSR between guest value and host value on entry/exit. so, KVM has
> to do the switching with wrmsr/rdmsr instructions. A new feature, "load
> IA32_SPEC_CTRL" VMX control (refer to Chapter 15 in ISE spec[*]), can help but
> it isn't supported on SPR.
> 
> [*]: https://urldefense.proofpoint.com/v2/url?u=https-3A__cdrdv2.intel.com_v1_dl_getContent_671368&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=c7SFjczyXeO5McE4firUZaiOVuLBVwLXAzKV9WQqMqKCCEwSvVk0V4cko-falQYo&s=-hskrlhrR4iuT2sz0KkGJn7hCSAGIteu3_TGQzPgh8I&e= 
> 
>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_kvm_20240410143446.797262-2D1-2Dchao.gao-40intel.com_&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=c7SFjczyXeO5McE4firUZaiOVuLBVwLXAzKV9WQqMqKCCEwSvVk0V4cko-falQYo&s=rsaEdAN9KEjtAMSN-ke4x4R87FgfxsvCsdwbCFk7VOE&e= 
>> 
>> Adding Chao for their opinion.
Jon Kohler Sept. 13, 2024, 3:52 p.m. UTC | #7
> On Sep 13, 2024, at 1:39 AM, Chao Gao <chao.gao@intel.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Thu, Sep 12, 2024 at 07:11:56AM -0700, Jon Kohler wrote:
>> On hardware that supports BHI_DIS_S/X86_FEATURE_BHI_CTRL, do not use
>> hardware mitigation when using BHI_MITIGATION_VMEXIT_ONLY, as this
>> causes the value of MSR_IA32_SPEC_CTRL to change, which inflicts
>> additional KVM overhead.
>> 
>> Example: In a typical eIBRS enabled system, such as Intel SPR, the
>> SPEC_CTRL may be commonly set to val == 1 to reflect eIBRS enablement;
>> however, SPEC_CTRL_BHI_DIS_S causes val == 1025. If the guests that
>> KVM is virtualizing do not also set the guest side value == 1025,
>> KVM will constantly have to wrmsr toggle the guest vs host value on
>> both entry and exit, delaying both.
> 
> Putting aside the security concern, this patch isn't a net positive
> because it causes additional overhead to guests with spec_ctrl = 1025.

See my other note with a different approach and testing to match.

Responding to this point here that this would require VMM to expose BHI_CTRL
which for example QEMU does not, so this is not yet a problem at least
today

> 
>> 
>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>> ---
>> arch/x86/kernel/cpu/bugs.c | 12 ++++++++++--
>> 1 file changed, 10 insertions(+), 2 deletions(-)
>> 
>> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
>> index 45675da354f3..df7535f5e882 100644
>> --- a/arch/x86/kernel/cpu/bugs.c
>> +++ b/arch/x86/kernel/cpu/bugs.c
>> @@ -1662,8 +1662,16 @@ static void __init bhi_select_mitigation(void)
>> return;
>> }
>> 
>> - /* Mitigate in hardware if supported */
>> - if (spec_ctrl_bhi_dis())
>> + /*
>> + * Mitigate in hardware if appropriate.
>> + * Note: for vmexit only, do not mitigate in hardware to avoid changing
>> + * the value of MSR_IA32_SPEC_CTRL to include SPEC_CTRL_BHI_DIS_S. If a
>> + * guest does not also set their own SPEC_CTRL to include this, KVM has
>> + * to toggle on every vmexit and vmentry if the host value does not
>> + * match the guest value. Instead, depend on software loop mitigation
>> + * only.
>> + */
>> + if (bhi_mitigation != BHI_MITIGATION_VMEXIT_ONLY && spec_ctrl_bhi_dis())
>> return;
>> 
>> if (!IS_ENABLED(CONFIG_X86_64))
>> -- 
>> 2.43.0
>> 
>>
Pawan Gupta Sept. 13, 2024, 5:33 p.m. UTC | #8
On Fri, Sep 13, 2024 at 03:51:01PM +0000, Jon Kohler wrote:
> 
> 
> > On Sep 13, 2024, at 1:28 AM, Chao Gao <chao.gao@intel.com> wrote:
> > 
> > !-------------------------------------------------------------------|
> >  CAUTION: External Email
> > 
> > |-------------------------------------------------------------------!
> > 
> > On Thu, Sep 12, 2024 at 09:24:40AM -0700, Pawan Gupta wrote:
> >> On Thu, Sep 12, 2024 at 03:44:38PM +0000, Jon Kohler wrote:
> >>>> It is only worth implementing the long sequence in VMEXIT_ONLY mode if it is
> >>>> significantly better than toggling the MSR.
> >>> 
> >>> Thanks for the pointer! I hadn’t seen that second sequence. I’ll do measurements on
> >>> three cases and come back with data from an SPR system.
> >>> 1. as-is (wrmsr on entry and exit)
> >>> 2. Short sequence (as a baseline)
> >>> 3. Long sequence
> >> 
> 
> Pawan,
> 
> Thanks for the pointer to the long sequence. I've tested it along with 
> Listing 3 (TSX Abort sequence) using KUT tscdeadline_immed test. TSX 
> abort sequence performs better unless BHI mitigation is off or 
> host/guest spec_ctrl values match, avoiding WRMSR toggling. Having the
> values match the DIS_S value is easier said than done across a fleet
> that is already using eIBRS heavily.
> 
> Test System:
> - Intel Xeon Gold 6442Y, microcode 0x2b0005c0
> - Linux 6.6.34 + patches, qemu 8.2
> - KVM Unit Tests @ latest (17f6f2fd) with tscdeadline_immed + edits:
> - Toggle spec ctrl before test in main()
> - Use cpu type SapphireRapids-v2
> 
> Test string:
> TESTNAME=vmexit_tscdeadline_immed TIMEOUT=90s MACHINE= ACCEL= taskset -c 26 ./x86/run x86/vmexit.flat \
> -smp 1 -cpu SapphireRapids-v2,+x2apic,+tsc-deadline -append tscdeadline_immed |grep tscdeadline
> 
> Test Results:
> 1. spectre_bhi=on, host spec_ctrl=1025, guest spec_ctrl=1: tscdeadline_immed 3878 (WRMSR toggling)
> 2. spectre_bhi=on, host spec_ctrl=1025, guest spec_ctrl=1025: tscdeadline_immed 3153 (no WRMSR toggling)
> 3. spectre_bhi=vmexit, BHB long sequence, host/guest spec_ctrl=1: tscdeadline_immed 3629 (still better than test 1, penalty only on exit)
> 4. spectre_bhi=vmexit, TSX abort sequence, host/guest spec_ctrl=1: tscdeadline_immed 3294 (best general purpose performance)

This looks promising.

> 5. spectre_bhi=vmexit, TSX abort sequence, host spec_ctrl=1, guest spec_ctrl=1025: tscdeadline_immed 4011 (needs optimization)

Once QEMU adds support for exposing BHI_CTRL, this is a very likely
scenario. To optimize this, host needs to have BHI_DIS_S set. We also need
to account for the case where some guests set BHI_DIS_S and others dont.

> In short, there is a significant speedup to be had here.
> 
> As for test 5, honest that is somewhat invalid because it would be
> dependent on the VMM user space showing BHI_CTRL.

Right.

> QEMU as an example does not do that, so even with latest qemu and latest
> kernel, guests will still use BHB loop even on SPR++ today, and they
> could use the TSX loop with this proposed change if the VMM exposes RTM
> feature.

I did not know that QEMU does not expose CPUID.BHI_CTRL. Chao, could you
please help getting this feature exposed in QEMU?

> I'm happy to post a V2 patch with my TSX changes, or take any other
> suggestions here.

With CPUID.BHI_CTRL exposed to guests, this:

> 2. spectre_bhi=on, host spec_ctrl=1025, guest spec_ctrl=1025: tscdeadline_immed 3153 (no WRMSR toggling)

will be the most common case, which is also the best performing. Isn't it
better to aim for this?
Jon Kohler Sept. 13, 2024, 6:01 p.m. UTC | #9
> On Sep 13, 2024, at 1:33 PM, Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Fri, Sep 13, 2024 at 03:51:01PM +0000, Jon Kohler wrote:
>> 
>> 
>>> On Sep 13, 2024, at 1:28 AM, Chao Gao <chao.gao@intel.com> wrote:
>>> 
>>> !-------------------------------------------------------------------|
>>> CAUTION: External Email
>>> 
>>> |-------------------------------------------------------------------!
>>> 
>>> On Thu, Sep 12, 2024 at 09:24:40AM -0700, Pawan Gupta wrote:
>>>> On Thu, Sep 12, 2024 at 03:44:38PM +0000, Jon Kohler wrote:
>>>>>> It is only worth implementing the long sequence in VMEXIT_ONLY mode if it is
>>>>>> significantly better than toggling the MSR.
>>>>> 
>>>>> Thanks for the pointer! I hadn’t seen that second sequence. I’ll do measurements on
>>>>> three cases and come back with data from an SPR system.
>>>>> 1. as-is (wrmsr on entry and exit)
>>>>> 2. Short sequence (as a baseline)
>>>>> 3. Long sequence
>>>> 
>> 
>> Pawan,
>> 
>> Thanks for the pointer to the long sequence. I've tested it along with 
>> Listing 3 (TSX Abort sequence) using KUT tscdeadline_immed test. TSX 
>> abort sequence performs better unless BHI mitigation is off or 
>> host/guest spec_ctrl values match, avoiding WRMSR toggling. Having the
>> values match the DIS_S value is easier said than done across a fleet
>> that is already using eIBRS heavily.
>> 
>> Test System:
>> - Intel Xeon Gold 6442Y, microcode 0x2b0005c0
>> - Linux 6.6.34 + patches, qemu 8.2
>> - KVM Unit Tests @ latest (17f6f2fd) with tscdeadline_immed + edits:
>> - Toggle spec ctrl before test in main()
>> - Use cpu type SapphireRapids-v2
>> 
>> Test string:
>> TESTNAME=vmexit_tscdeadline_immed TIMEOUT=90s MACHINE= ACCEL= taskset -c 26 ./x86/run x86/vmexit.flat \
>> -smp 1 -cpu SapphireRapids-v2,+x2apic,+tsc-deadline -append tscdeadline_immed |grep tscdeadline
>> 
>> Test Results:
>> 1. spectre_bhi=on, host spec_ctrl=1025, guest spec_ctrl=1: tscdeadline_immed 3878 (WRMSR toggling)
>> 2. spectre_bhi=on, host spec_ctrl=1025, guest spec_ctrl=1025: tscdeadline_immed 3153 (no WRMSR toggling)
>> 3. spectre_bhi=vmexit, BHB long sequence, host/guest spec_ctrl=1: tscdeadline_immed 3629 (still better than test 1, penalty only on exit)
>> 4. spectre_bhi=vmexit, TSX abort sequence, host/guest spec_ctrl=1: tscdeadline_immed 3294 (best general purpose performance)
> 
> This looks promising.

Thanks! I’ll send out a v2 so you can see how it comes together.

> 
>> 5. spectre_bhi=vmexit, TSX abort sequence, host spec_ctrl=1, guest spec_ctrl=1025: tscdeadline_immed 4011 (needs optimization)
> 
> Once QEMU adds support for exposing BHI_CTRL, this is a very likely
> scenario. To optimize this, host needs to have BHI_DIS_S set. We also need
> to account for the case where some guests set BHI_DIS_S and others dont.

QEMU base enablement is only one part of the puzzle. That would mean
a cpu type change (e.g. SapphireRapids-Vxxx), which VMM control planes
need to pickup (e.g. libvirt), in addition to guest OS’s needing to pick it up too.

Even then, it isn’t always automatic. Windows for example disables their
BHI mitigation by default, requiring admin intervention to manually modify the
registry to enable it: 
https://msrc.microsoft.com/update-guide/vulnerability/CVE-2022-0001 

I don’t know offhand if that is BHI_DIS_S or just a clear loop, it doesn’t say

Server SKUs are disabled by default: https://support.microsoft.com/en-us/topic/kb4072698-windows-server-and-azure-stack-hci-guidance-to-protect-against-silicon-based-microarchitectural-and-speculative-execution-side-channel-vulnerabilities-2f965763-00e2-8f98-b632-0d96f30c8c8e 
Desktop/Client SKUs are disabled by default: https://support.microsoft.com/en-us/topic/kb4073119-windows-client-guidance-for-it-pros-to-protect-against-silicon-based-microarchitectural-and-speculative-execution-side-channel-vulnerabilities-35820a8a-ae13-1299-88cc-357f104f5b11

> 
>> In short, there is a significant speedup to be had here.
>> 
>> As for test 5, honest that is somewhat invalid because it would be
>> dependent on the VMM user space showing BHI_CTRL.
> 
> Right.
> 
>> QEMU as an example does not do that, so even with latest qemu and latest
>> kernel, guests will still use BHB loop even on SPR++ today, and they
>> could use the TSX loop with this proposed change if the VMM exposes RTM
>> feature.
> 
> I did not know that QEMU does not expose CPUID.BHI_CTRL. Chao, could you
> please help getting this feature exposed in QEMU?
> 
>> I'm happy to post a V2 patch with my TSX changes, or take any other
>> suggestions here.
> 
> With CPUID.BHI_CTRL exposed to guests, this:
> 
>> 2. spectre_bhi=on, host spec_ctrl=1025, guest spec_ctrl=1025: tscdeadline_immed 3153 (no WRMSR toggling)
> 
> will be the most common case, which is also the best performing. Isn't it
> better to aim for this?

I agree, but I also honestly think this is a very large hill to climb.

This will only happen when the host and guest have full understanding of this
mitigation and the guest reboots to reinitialize.

In both enterprise and cloud environments, it may be an extremely long time
before there is full alignment between these two points at a broader fleet level. 

In some use cases, such as virtual appliances or older operating systems
that may *never* get updated to understand BHI_CTRL or as I pointed out
for Windows SKUs, Microsoft just straight up disabled it by default, so we’d
be imposing a non-trivial tax on them from the outset.
Jim Mattson Sept. 13, 2024, 6:39 p.m. UTC | #10
On Thu, Sep 12, 2024 at 10:29 PM Chao Gao <chao.gao@intel.com> wrote:
>
> On Thu, Sep 12, 2024 at 09:24:40AM -0700, Pawan Gupta wrote:
> >On Thu, Sep 12, 2024 at 03:44:38PM +0000, Jon Kohler wrote:
> >> > It is only worth implementing the long sequence in VMEXIT_ONLY mode if it is
> >> > significantly better than toggling the MSR.
> >>
> >> Thanks for the pointer! I hadn’t seen that second sequence. I’ll do measurements on
> >> three cases and come back with data from an SPR system.
> >> 1. as-is (wrmsr on entry and exit)
> >> 2. Short sequence (as a baseline)
> >> 3. Long sequence
> >
> >I wonder if virtual SPEC_CTRL feature introduced in below series can
> >provide speedup, as it can replace the MSR toggling with faster VMCS
> >operations:
>
> "virtual SPEC_CTRL" won't provide speedup. the wrmsr on entry/exit is still
> need if guest's (effective) value and host's value are different.

I believe that is the case here. The guest's effective value is 1025.
If the guest knew about BHI_DIS_S, it would actually set it to 1025,
but older guests set it to 1.

The IA32_SPEC_CTRL mask and shadow fields should be perfect for this.

> "virtual SPEC_CTRL" just prevents guests from toggling some bits. It doesn't
> switch the MSR between guest value and host value on entry/exit. so, KVM has
> to do the switching with wrmsr/rdmsr instructions. A new feature, "load
> IA32_SPEC_CTRL" VMX control (refer to Chapter 15 in ISE spec[*]), can help but
> it isn't supported on SPR.
>
> [*]: https://cdrdv2.intel.com/v1/dl/getContent/671368
>
> >
> >  https://lore.kernel.org/kvm/20240410143446.797262-1-chao.gao@intel.com/
> >
> >Adding Chao for their opinion.
>
Jim Mattson Sept. 13, 2024, 11:04 p.m. UTC | #11
On Fri, Sep 13, 2024 at 11:39 AM Jim Mattson <jmattson@google.com> wrote:
>
> On Thu, Sep 12, 2024 at 10:29 PM Chao Gao <chao.gao@intel.com> wrote:
> >
> > On Thu, Sep 12, 2024 at 09:24:40AM -0700, Pawan Gupta wrote:
> > >On Thu, Sep 12, 2024 at 03:44:38PM +0000, Jon Kohler wrote:
> > >> > It is only worth implementing the long sequence in VMEXIT_ONLY mode if it is
> > >> > significantly better than toggling the MSR.
> > >>
> > >> Thanks for the pointer! I hadn’t seen that second sequence. I’ll do measurements on
> > >> three cases and come back with data from an SPR system.
> > >> 1. as-is (wrmsr on entry and exit)
> > >> 2. Short sequence (as a baseline)
> > >> 3. Long sequence
> > >
> > >I wonder if virtual SPEC_CTRL feature introduced in below series can
> > >provide speedup, as it can replace the MSR toggling with faster VMCS
> > >operations:
> >
> > "virtual SPEC_CTRL" won't provide speedup. the wrmsr on entry/exit is still
> > need if guest's (effective) value and host's value are different.
>
> I believe that is the case here. The guest's effective value is 1025.
> If the guest knew about BHI_DIS_S, it would actually set it to 1025,
> but older guests set it to 1.
>
> The IA32_SPEC_CTRL mask and shadow fields should be perfect for this.

In fact, this is the guidance given in
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html:

The VMM should use the “virtualize IA32_SPEC_CTRL” VM-execution
control to cause BHI_DIS_S to be set (see the VMM Support for
BHB-clearing Software Sequences section) whenever:
o The VMM is running on a processor for which the short software
sequence may not be effective:
  - Specifically, it does not enumerate BHI_NO, but does enumerate
BHI_DIS_S, and is not an Atom-only processor.

In other words, the VMM should set bit 10 in the IA32_SPEC_CTRL mask
on SPR. As long as the *effective* guest IA32_SPEC_CTRL value matches
the host value, there is no need to write the MSR on VM-{entry,exit}.

There is no need to disable BHI_DIS_S on the host and use the TSX
abort sequence in its place.

Besides, with the TSX abort approach, what are you going to do about
guests that *do* set BHI_DIS_S? If that bit is clear on the host,
they'll suffer the overhead of writing the MSR on VM-{entry,exit}.

> > "virtual SPEC_CTRL" just prevents guests from toggling some bits. It doesn't
> > switch the MSR between guest value and host value on entry/exit. so, KVM has
> > to do the switching with wrmsr/rdmsr instructions. A new feature, "load
> > IA32_SPEC_CTRL" VMX control (refer to Chapter 15 in ISE spec[*]), can help but
> > it isn't supported on SPR.
> >
> > [*]: https://cdrdv2.intel.com/v1/dl/getContent/671368
> >
> > >
> > >  https://lore.kernel.org/kvm/20240410143446.797262-1-chao.gao@intel.com/
> > >
> > >Adding Chao for their opinion.
> >
Pawan Gupta Sept. 14, 2024, 12:16 a.m. UTC | #12
On Fri, Sep 13, 2024 at 04:04:56PM -0700, Jim Mattson wrote:
> > The IA32_SPEC_CTRL mask and shadow fields should be perfect for this.
> 
> In fact, this is the guidance given in
> https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html:
> 
> The VMM should use the “virtualize IA32_SPEC_CTRL” VM-execution
> control to cause BHI_DIS_S to be set (see the VMM Support for
> BHB-clearing Software Sequences section) whenever:
> o The VMM is running on a processor for which the short software
> sequence may not be effective:
>   - Specifically, it does not enumerate BHI_NO, but does enumerate
> BHI_DIS_S, and is not an Atom-only processor.
> 
> In other words, the VMM should set bit 10 in the IA32_SPEC_CTRL mask
> on SPR. As long as the *effective* guest IA32_SPEC_CTRL value matches
> the host value, there is no need to write the MSR on VM-{entry,exit}.

With host setting the effective BHI_DIS_S for guest using virtual
SPEC_CTRL, there will be no way for guest to opt-out of BHI mitigation.
Or if the guest is mitigating BHI with the software sequence, it will
still get the hardware mitigation also.

To overcome this, the guest and KVM need to implement
MSR_VIRTUAL_MITIGATION_CTRL to allow guest to opt-out of hardware
mitigation.

> There is no need to disable BHI_DIS_S on the host and use the TSX
> abort sequence in its place.

Exactly.
Jim Mattson Sept. 14, 2024, 2:35 a.m. UTC | #13
On Fri, Sep 13, 2024 at 5:16 PM Pawan Gupta
<pawan.kumar.gupta@linux.intel.com> wrote:
>
> On Fri, Sep 13, 2024 at 04:04:56PM -0700, Jim Mattson wrote:
> > > The IA32_SPEC_CTRL mask and shadow fields should be perfect for this.
> >
> > In fact, this is the guidance given in
> > https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html:
> >
> > The VMM should use the “virtualize IA32_SPEC_CTRL” VM-execution
> > control to cause BHI_DIS_S to be set (see the VMM Support for
> > BHB-clearing Software Sequences section) whenever:
> > o The VMM is running on a processor for which the short software
> > sequence may not be effective:
> >   - Specifically, it does not enumerate BHI_NO, but does enumerate
> > BHI_DIS_S, and is not an Atom-only processor.
> >
> > In other words, the VMM should set bit 10 in the IA32_SPEC_CTRL mask
> > on SPR. As long as the *effective* guest IA32_SPEC_CTRL value matches
> > the host value, there is no need to write the MSR on VM-{entry,exit}.
>
> With host setting the effective BHI_DIS_S for guest using virtual
> SPEC_CTRL, there will be no way for guest to opt-out of BHI mitigation.
> Or if the guest is mitigating BHI with the software sequence, it will
> still get the hardware mitigation also.
>
> To overcome this, the guest and KVM need to implement
> MSR_VIRTUAL_MITIGATION_CTRL to allow guest to opt-out of hardware
> mitigation.

I don't think there is much value in this additional complexity. If
the guest opts out of BHI mitigation, it will pay dearly for it,
because then the effective value of the guest IA32_SPEC_CTRL will not
be the same as the host value, and KVM will have to write the MSR on
every VM-{entry,exit}. That's likely to be a higher cost than
BHI_DIS_S in VMX non-root operation.

> > There is no need to disable BHI_DIS_S on the host and use the TSX
> > abort sequence in its place.
>
> Exactly.
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 45675da354f3..df7535f5e882 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1662,8 +1662,16 @@  static void __init bhi_select_mitigation(void)
 			return;
 	}
 
-	/* Mitigate in hardware if supported */
-	if (spec_ctrl_bhi_dis())
+	/*
+	 * Mitigate in hardware if appropriate.
+	 * Note: for vmexit only, do not mitigate in hardware to avoid changing
+	 * the value of MSR_IA32_SPEC_CTRL to include SPEC_CTRL_BHI_DIS_S. If a
+	 * guest does not also set their own SPEC_CTRL to include this, KVM has
+	 * to toggle on every vmexit and vmentry if the host value does not
+	 * match the guest value. Instead, depend on software loop mitigation
+	 * only.
+	 */
+	if (bhi_mitigation != BHI_MITIGATION_VMEXIT_ONLY && spec_ctrl_bhi_dis())
 		return;
 
 	if (!IS_ENABLED(CONFIG_X86_64))