Message ID | 20240411072445.522731-1-alexandre.chartre@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86: Set BHI_NO in guest when host is not affected by BHI | expand |
On 11.04.24 г. 10:24 ч., Alexandre Chartre wrote: > When a system is not affected by the BHI bug then KVM should > configure guests with BHI_NO to ensure they won't enable any > BHI mitigation. > > Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com> > --- > arch/x86/kvm/x86.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 984ea2089efc..f43d3c15a6b7 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -1678,6 +1678,9 @@ static u64 kvm_get_arch_capabilities(void) > if (!boot_cpu_has_bug(X86_BUG_GDS) || gds_ucode_mitigated()) > data |= ARCH_CAP_GDS_NO; > > + if (!boot_cpu_has_bug(X86_BUG_BHI)) > + data |= ARCH_CAP_BHI_NO; > + But this is already handled since ARCH_CAP_BHI_NO is added to KVM_SUPPORTED_ARCH_CAP so when the host caps are read that bit is going to be set there, if it's set for the physical cpu of course. > return data; > } >
On 4/11/24 09:34, Nikolay Borisov wrote: > > > On 11.04.24 г. 10:24 ч., Alexandre Chartre wrote: >> When a system is not affected by the BHI bug then KVM should >> configure guests with BHI_NO to ensure they won't enable any >> BHI mitigation. >> >> Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com> >> --- >> arch/x86/kvm/x86.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >> index 984ea2089efc..f43d3c15a6b7 100644 >> --- a/arch/x86/kvm/x86.c >> +++ b/arch/x86/kvm/x86.c >> @@ -1678,6 +1678,9 @@ static u64 kvm_get_arch_capabilities(void) >> if (!boot_cpu_has_bug(X86_BUG_GDS) || gds_ucode_mitigated()) >> data |= ARCH_CAP_GDS_NO; >> + if (!boot_cpu_has_bug(X86_BUG_BHI)) >> + data |= ARCH_CAP_BHI_NO; >> + > > But this is already handled since ARCH_CAP_BHI_NO is added to > KVM_SUPPORTED_ARCH_CAP so when the host caps are read that bit is > going to be set there, if it's set for the physical cpu of course. Correct, if the host has ARCH_CAP_BHI_NO then it will be propagated to the guest. But the host might not have ARCH_CAP_BHI_NO set and not be affected by BHI. That's the case for example of Skylake servers: they don't have ARCH_CAP_BHI_NO, but they are not affected by BHI because they don't have eIBRS. However, a guest will think it is affected because it doesn't know if eIBRS is present on the system (because virtualization can hide it). I tested on Skylake: Without the patch, both host and guest are running 6.9.0-rc3, then BHI mitigations are: - Host: BHI: Not affected - Guest: BHI: SW loop, KVM: SW loop => so guest enables BHI SW loop mitigation although host doesn't need mitigation. With the patch on the host, guest still running 6.9.0-rc3, then BHI mitigations are: - Host: BHI: Not affected - Guest: BHI: Not affected => now guest doesn't enable BHI mitigation, like the host. Thanks, alex.
On Thu, Apr 11, 2024 at 09:24:45AM +0200, Alexandre Chartre wrote: > When a system is not affected by the BHI bug then KVM should > configure guests with BHI_NO to ensure they won't enable any > BHI mitigation. > > Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com> No Fixes: tag here? thanks, greg k-h
On 11.04.24 г. 10:24 ч., Alexandre Chartre wrote: > When a system is not affected by the BHI bug then KVM should > configure guests with BHI_NO to ensure they won't enable any > BHI mitigation. > > Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com> > --- > arch/x86/kvm/x86.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 984ea2089efc..f43d3c15a6b7 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -1678,6 +1678,9 @@ static u64 kvm_get_arch_capabilities(void) > if (!boot_cpu_has_bug(X86_BUG_GDS) || gds_ucode_mitigated()) > data |= ARCH_CAP_GDS_NO; > > + if (!boot_cpu_has_bug(X86_BUG_BHI)) > + data |= ARCH_CAP_BHI_NO; > + > return data; > } > Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
On 4/11/24 09:51, Greg KH wrote: > On Thu, Apr 11, 2024 at 09:24:45AM +0200, Alexandre Chartre wrote: >> When a system is not affected by the BHI bug then KVM should >> configure guests with BHI_NO to ensure they won't enable any >> BHI mitigation. >> >> Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com> > > No Fixes: tag here? > This should have been done with commit ed2e8d49b54d ("KVM: x86: Add BHI_NO"), so: Fixes: ed2e8d49b54d ("KVM: x86: Add BHI_NO") Thanks, alex.
On 11/04/2024 8:24 am, Alexandre Chartre wrote: > When a system is not affected by the BHI bug then KVM should > configure guests with BHI_NO to ensure they won't enable any > BHI mitigation. > > Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com> > --- > arch/x86/kvm/x86.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 984ea2089efc..f43d3c15a6b7 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -1678,6 +1678,9 @@ static u64 kvm_get_arch_capabilities(void) > if (!boot_cpu_has_bug(X86_BUG_GDS) || gds_ucode_mitigated()) > data |= ARCH_CAP_GDS_NO; > > + if (!boot_cpu_has_bug(X86_BUG_BHI)) > + data |= ARCH_CAP_BHI_NO; This isn't true or safe. Linux only sets X86_BUG_BHI on a subset of affected parts. Skylake for example *is* affected by BHI. It's just that existing mitigations are believed to suffice to mitigate BHI too. "you happen to be safe if you're doing something else too" doesn't remotely have the same meaning as "hardware doesn't have a history based predictor". ~Andrew
On 4/11/24 10:43, Andrew Cooper wrote: > On 11/04/2024 8:24 am, Alexandre Chartre wrote: >> When a system is not affected by the BHI bug then KVM should >> configure guests with BHI_NO to ensure they won't enable any >> BHI mitigation. >> >> Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com> >> --- >> arch/x86/kvm/x86.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >> index 984ea2089efc..f43d3c15a6b7 100644 >> --- a/arch/x86/kvm/x86.c >> +++ b/arch/x86/kvm/x86.c >> @@ -1678,6 +1678,9 @@ static u64 kvm_get_arch_capabilities(void) >> if (!boot_cpu_has_bug(X86_BUG_GDS) || gds_ucode_mitigated()) >> data |= ARCH_CAP_GDS_NO; >> >> + if (!boot_cpu_has_bug(X86_BUG_BHI)) >> + data |= ARCH_CAP_BHI_NO; > > This isn't true or safe. > > Linux only sets X86_BUG_BHI on a subset of affected parts. > > Skylake for example *is* affected by BHI. It's just that existing > mitigations are believed to suffice to mitigate BHI too. > > "you happen to be safe if you're doing something else too" doesn't > remotely have the same meaning as "hardware doesn't have a history based > predictor". > So you mean we can't set ARCH_CAP_BHI_NO for the guest because we don't know if the guest will run the (other) existing mitigations which are believed to suffice to mitigate BHI? The problem is that we can end up with a guest running extra BHI mitigations while this is not needed. Could we inform the guest that eIBRS is not available on the system so a Linux guest doesn't run with extra BHI mitigations? Thanks, alex.
On 11/04/2024 10:33 am, Alexandre Chartre wrote: > > > On 4/11/24 10:43, Andrew Cooper wrote: >> On 11/04/2024 8:24 am, Alexandre Chartre wrote: >>> When a system is not affected by the BHI bug then KVM should >>> configure guests with BHI_NO to ensure they won't enable any >>> BHI mitigation. >>> >>> Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com> >>> --- >>> arch/x86/kvm/x86.c | 3 +++ >>> 1 file changed, 3 insertions(+) >>> >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>> index 984ea2089efc..f43d3c15a6b7 100644 >>> --- a/arch/x86/kvm/x86.c >>> +++ b/arch/x86/kvm/x86.c >>> @@ -1678,6 +1678,9 @@ static u64 kvm_get_arch_capabilities(void) >>> if (!boot_cpu_has_bug(X86_BUG_GDS) || gds_ucode_mitigated()) >>> data |= ARCH_CAP_GDS_NO; >>> + if (!boot_cpu_has_bug(X86_BUG_BHI)) >>> + data |= ARCH_CAP_BHI_NO; >> >> This isn't true or safe. >> >> Linux only sets X86_BUG_BHI on a subset of affected parts. >> >> Skylake for example *is* affected by BHI. It's just that existing >> mitigations are believed to suffice to mitigate BHI too. >> >> "you happen to be safe if you're doing something else too" doesn't >> remotely have the same meaning as "hardware doesn't have a history based >> predictor". >> > > So you mean we can't set ARCH_CAP_BHI_NO for the guest because we > don't know > if the guest will run the (other) existing mitigations which are > believed to > suffice to mitigate BHI? Correct. Also, when a VM really is migrating between different CPUs, things get far more complicated. > > The problem is that we can end up with a guest running extra BHI > mitigations > while this is not needed. Could we inform the guest that eIBRS is not > available > on the system so a Linux guest doesn't run with extra BHI mitigations? Well, that's why Intel specified some MSRs at 0x5000xxxx. Except I don't know anyone currently interested in implementing them, and I'm still not sure if they work correctly for some of the more complicated migration cases. ~Andrew
>> The problem is that we can end up with a guest running extra BHI >> mitigations >> while this is not needed. Could we inform the guest that eIBRS is not >> available >> on the system so a Linux guest doesn't run with extra BHI mitigations? > >Well, that's why Intel specified some MSRs at 0x5000xxxx. Yes. But note that there is a subtle difference. Those MSRs are used for guest to communicate in-used software mitigations to the host. Such information is stable across migration. Here we need the host to communicate that eIBRS isn't available to the guest. this isn't stable as the guest may be migrated from a host without eIBRS to one with it. > >Except I don't know anyone currently interested in implementing them, >and I'm still not sure if they work correctly for some of the more >complicated migration cases. Looks you have the same opinion on the Intel-defined virtual MSRs as Sean. If we all agree the issue here and the effectivenss problem of the short BHB-clearing sequence need to be resolved and don't think the Intel-defined virtual MSRs can handle all cases correctly, we have to define a better interface through community collaboration as Sean suggested.
On 4/11/24 13:14, Chao Gao wrote: >>> The problem is that we can end up with a guest running extra BHI >>> mitigations >>> while this is not needed. Could we inform the guest that eIBRS is not >>> available >>> on the system so a Linux guest doesn't run with extra BHI mitigations? >> >> Well, that's why Intel specified some MSRs at 0x5000xxxx. > > Yes. But note that there is a subtle difference. Those MSRs are used for guest > to communicate in-used software mitigations to the host. Such information is > stable across migration. Here we need the host to communicate that eIBRS isn't > available to the guest. this isn't stable as the guest may be migrated from > a host without eIBRS to one with it. > >> >> Except I don't know anyone currently interested in implementing them, >> and I'm still not sure if they work correctly for some of the more >> complicated migration cases. > > Looks you have the same opinion on the Intel-defined virtual MSRs as Sean. > If we all agree the issue here and the effectivenss problem of the short > BHB-clearing sequence need to be resolved and don't think the Intel-defined > virtual MSRs can handle all cases correctly, we have to define a better > interface through community collaboration as Sean suggested. Another solution could be to add cpus to cpu_vuln_whitelist with BHI_NO. (e.g. explicitly add cpus which have eIBRS). That way, the kernel will figure out the right mitigation on the host and guest. alex.
On Thu, Apr 11, 2024 at 11:34 AM Alexandre Chartre <alexandre.chartre@oracle.com> wrote: > > So you mean we can't set ARCH_CAP_BHI_NO for the guest because we don't know > if the guest will run the (other) existing mitigations which are believed to > suffice to mitigate BHI? > > The problem is that we can end up with a guest running extra BHI mitigations > while this is not needed. Could we inform the guest that eIBRS is not available > on the system so a Linux guest doesn't run with extra BHI mitigations? The (Linux or otherwise) guest will make its own determinations as to whether BHI mitigations are necessary. If the guest uses eIBRS, it will run with mitigations. If you hide bit 1 of MSR_IA32_ARCH_CAPABILITIES from the guest, it may decide to disable it. But if the guest decides to use eIBRS, I think it should use mitigations even if the host doesn't. It's a different story if the host isn't susceptible altogether. The ARCH_CAP_BHI_NO bit *can* be set if the processor doesn't have the bug at all, which would be true if cpu_matches(cpu_vuln_whitelist, NO_BHI). I would apply a patch to do that. Paolo
On 4/11/24 15:22, Paolo Bonzini wrote: > On Thu, Apr 11, 2024 at 11:34 AM Alexandre Chartre > <alexandre.chartre@oracle.com> wrote: >> >> So you mean we can't set ARCH_CAP_BHI_NO for the guest because we don't know >> if the guest will run the (other) existing mitigations which are believed to >> suffice to mitigate BHI? >> >> The problem is that we can end up with a guest running extra BHI mitigations >> while this is not needed. Could we inform the guest that eIBRS is not available >> on the system so a Linux guest doesn't run with extra BHI mitigations? > > The (Linux or otherwise) guest will make its own determinations as to > whether BHI mitigations are necessary. If the guest uses eIBRS, it > will run with mitigations. If you hide bit 1 of > MSR_IA32_ARCH_CAPABILITIES from the guest, it may decide to disable > it. But if the guest decides to use eIBRS, I think it should use > mitigations even if the host doesn't. The problem is not on servers which have eIBRS, but on servers which don't. If there is no eIBRS on the server, then the guest doesn't know if there is effectively no eIBRS on the server or if eIBRS is hidden by the virtualization so it applies the BHI mitigation even when that's not needed (i.e. when eIBRS is effectively not present the server). > It's a different story if the host isn't susceptible altogether. The > ARCH_CAP_BHI_NO bit *can* be set if the processor doesn't have the bug > at all, which would be true if cpu_matches(cpu_vuln_whitelist, > NO_BHI). I would apply a patch to do that. > Right. I have just suggested to enumerate cpus which have eIBRS with NO_BHI, but we need would that precise list of cpus. alex.
On 11/04/2024 2:32 pm, Alexandre Chartre wrote: > > On 4/11/24 15:22, Paolo Bonzini wrote: >> On Thu, Apr 11, 2024 at 11:34 AM Alexandre Chartre >> <alexandre.chartre@oracle.com> wrote: >>> >>> So you mean we can't set ARCH_CAP_BHI_NO for the guest because we >>> don't know >>> if the guest will run the (other) existing mitigations which are >>> believed to >>> suffice to mitigate BHI? >>> >>> The problem is that we can end up with a guest running extra BHI >>> mitigations >>> while this is not needed. Could we inform the guest that eIBRS is >>> not available >>> on the system so a Linux guest doesn't run with extra BHI mitigations? >> >> The (Linux or otherwise) guest will make its own determinations as to >> whether BHI mitigations are necessary. If the guest uses eIBRS, it >> will run with mitigations. If you hide bit 1 of >> MSR_IA32_ARCH_CAPABILITIES from the guest, it may decide to disable >> it. But if the guest decides to use eIBRS, I think it should use >> mitigations even if the host doesn't. > > The problem is not on servers which have eIBRS, but on servers which > don't. > > If there is no eIBRS on the server, then the guest doesn't know if > there is > effectively no eIBRS on the server or if eIBRS is hidden by the > virtualization > so it applies the BHI mitigation even when that's not needed (i.e. > when eIBRS > is effectively not present the server). > >> It's a different story if the host isn't susceptible altogether. The >> ARCH_CAP_BHI_NO bit *can* be set if the processor doesn't have the bug >> at all, which would be true if cpu_matches(cpu_vuln_whitelist, >> NO_BHI). I would apply a patch to do that. >> > > Right. I have just suggested to enumerate cpus which have eIBRS with > NO_BHI, > but we need would that precise list of cpus. Intel stated that there are no current CPUs for which NO_BHI would be true. What I take this to mean is "no CPUs analysing backwards as far as Intel cared to go". ~Andrew
On 4/11/24 16:13, Andrew Cooper wrote: > On 11/04/2024 2:32 pm, Alexandre Chartre wrote: >> >> On 4/11/24 15:22, Paolo Bonzini wrote: >>> On Thu, Apr 11, 2024 at 11:34 AM Alexandre Chartre >>> <alexandre.chartre@oracle.com> wrote: >>>> >>>> So you mean we can't set ARCH_CAP_BHI_NO for the guest because we >>>> don't know >>>> if the guest will run the (other) existing mitigations which are >>>> believed to >>>> suffice to mitigate BHI? >>>> >>>> The problem is that we can end up with a guest running extra BHI >>>> mitigations >>>> while this is not needed. Could we inform the guest that eIBRS is >>>> not available >>>> on the system so a Linux guest doesn't run with extra BHI mitigations? >>> >>> The (Linux or otherwise) guest will make its own determinations as to >>> whether BHI mitigations are necessary. If the guest uses eIBRS, it >>> will run with mitigations. If you hide bit 1 of >>> MSR_IA32_ARCH_CAPABILITIES from the guest, it may decide to disable >>> it. But if the guest decides to use eIBRS, I think it should use >>> mitigations even if the host doesn't. >> >> The problem is not on servers which have eIBRS, but on servers which >> don't. >> >> If there is no eIBRS on the server, then the guest doesn't know if >> there is >> effectively no eIBRS on the server or if eIBRS is hidden by the >> virtualization >> so it applies the BHI mitigation even when that's not needed (i.e. >> when eIBRS >> is effectively not present the server). >> >>> It's a different story if the host isn't susceptible altogether. The >>> ARCH_CAP_BHI_NO bit *can* be set if the processor doesn't have the bug >>> at all, which would be true if cpu_matches(cpu_vuln_whitelist, >>> NO_BHI). I would apply a patch to do that. >>> >> >> Right. I have just suggested to enumerate cpus which have eIBRS with >> NO_BHI, >> but we need would that precise list of cpus. > > Intel stated that there are no current CPUs for which NO_BHI would be true. > > What I take this to mean is "no CPUs analysing backwards as far as Intel > cared to go". > Still, we could enumerate CPUs which don't have eIBRS independently of NO_BHI (if we have a list of such CPUs) and set X86_BUG_BHI for cpus with eIBRS. So in arch/x86/kernel/cpu/common.c, replace: /* When virtualized, eIBRS could be hidden, assume vulnerable */ if (!(ia32_cap & ARCH_CAP_BHI_NO) && !cpu_matches(cpu_vuln_whitelist, NO_BHI) && (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) || boot_cpu_has(X86_FEATURE_HYPERVISOR))) setup_force_cpu_bug(X86_BUG_BHI); with something like: if (!(ia32_cap & ARCH_CAP_BHI_NO) && !cpu_matches(cpu_vuln_whitelist, NO_BHI) && (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) || !cpu_matches(cpu_vuln_whitelist, NO_EIBRS))) setup_force_cpu_bug(X86_BUG_BHI); alex.
On Thu, Apr 11, 2024 at 4:34 PM Alexandre Chartre <alexandre.chartre@oracle.com> wrote: > Still, we could enumerate CPUs which don't have eIBRS independently of NO_BHI > (if we have a list of such CPUs) and set X86_BUG_BHI for cpus with eIBRS. > > So in arch/x86/kernel/cpu/common.c, replace: > > /* When virtualized, eIBRS could be hidden, assume vulnerable */ > if (!(ia32_cap & ARCH_CAP_BHI_NO) && > !cpu_matches(cpu_vuln_whitelist, NO_BHI) && > (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) || > boot_cpu_has(X86_FEATURE_HYPERVISOR))) > setup_force_cpu_bug(X86_BUG_BHI); > > with something like: > > if (!(ia32_cap & ARCH_CAP_BHI_NO) && > !cpu_matches(cpu_vuln_whitelist, NO_BHI) && > (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) || > !cpu_matches(cpu_vuln_whitelist, NO_EIBRS))) > setup_force_cpu_bug(X86_BUG_BHI); No, that you cannot do because the hypervisor can and will fake the family/model/stepping. However, looking again at the original patch you submitted, I think the review was confusing host and guest sides. If the host is "not affected", i.e. the host *genuinely* does not have eIBRS, it can set BHI_NO and that will bypass the "always mitigate in a guest" bit. I think that's robust and boot_cpu_has_bug(X86_BUG_BHI) is the right way to do it. If a VM migration pool has both !eIBRS and eIBRS machines, it will combine the two; on one hand it will not set the eIBRS bit (bit 1), on the other hand it will not set BHI_NO either, and it will set the hypervisor bit. The result is that the guest *will* use mitigations. To double check, from the point of view of a nested hypervisor, you could set BHI_NO in a nested guest: * if the nested hypervisor has BHI_NO passed from the outer level * or if its CPUID passes cpu_matches(cpu_vuln_whitelist, NO_BHI) * but it won't matter whether the nested hypervisor lacks eIBRS, because that bit is not reliable in a VM The logic you'd use in KVM therefore is: (ia32_cap & ARCH_CAP_BHI_NO) || cpu_matches(cpu_vuln_whitelist, NO_BHI) || (!boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) && !boot_cpu_has(X86_FEATURE_HYPERVISOR))) but that is exactly !boot_cpu_has_bug(X86_BUG_BHI) and is therefore what Alexandre's patch does. So I'll wait for further comments but I think the patch is correct. Paolo
On 4/11/24 16:46, Paolo Bonzini wrote: > On Thu, Apr 11, 2024 at 4:34 PM Alexandre Chartre > <alexandre.chartre@oracle.com> wrote: >> Still, we could enumerate CPUs which don't have eIBRS independently of NO_BHI >> (if we have a list of such CPUs) and set X86_BUG_BHI for cpus with eIBRS. >> >> So in arch/x86/kernel/cpu/common.c, replace: >> >> /* When virtualized, eIBRS could be hidden, assume vulnerable */ >> if (!(ia32_cap & ARCH_CAP_BHI_NO) && >> !cpu_matches(cpu_vuln_whitelist, NO_BHI) && >> (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) || >> boot_cpu_has(X86_FEATURE_HYPERVISOR))) >> setup_force_cpu_bug(X86_BUG_BHI); >> >> with something like: >> >> if (!(ia32_cap & ARCH_CAP_BHI_NO) && >> !cpu_matches(cpu_vuln_whitelist, NO_BHI) && >> (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) || >> !cpu_matches(cpu_vuln_whitelist, NO_EIBRS))) >> setup_force_cpu_bug(X86_BUG_BHI); > > No, that you cannot do because the hypervisor can and will fake the > family/model/stepping. > > However, looking again at the original patch you submitted, I think > the review was confusing host and guest sides. If the host is "not > affected", i.e. the host *genuinely* does not have eIBRS, it can set > BHI_NO and that will bypass the "always mitigate in a guest" bit. > > I think that's robust and boot_cpu_has_bug(X86_BUG_BHI) is the right > way to do it. > > If a VM migration pool has both !eIBRS and eIBRS machines, it will > combine the two; on one hand it will not set the eIBRS bit (bit 1), on > the other hand it will not set BHI_NO either, and it will set the > hypervisor bit. The result is that the guest *will* use mitigations. > > To double check, from the point of view of a nested hypervisor, you > could set BHI_NO in a nested guest: > * if the nested hypervisor has BHI_NO passed from the outer level > * or if its CPUID passes cpu_matches(cpu_vuln_whitelist, NO_BHI) > * but it won't matter whether the nested hypervisor lacks eIBRS, > because that bit is not reliable in a VM > > The logic you'd use in KVM therefore is: > > (ia32_cap & ARCH_CAP_BHI_NO) || > cpu_matches(cpu_vuln_whitelist, NO_BHI) || > (!boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) && > !boot_cpu_has(X86_FEATURE_HYPERVISOR))) > > but that is exactly !boot_cpu_has_bug(X86_BUG_BHI) and is therefore > what Alexandre's patch does. > > So I'll wait for further comments but I think the patch is correct. > I think that Andrew's concern is that if there is no eIBRS on the host then we do not set X86_BUG_BHI on the host because we know the kernel which is running and this kernel has some mitigations (other than the explicit BHI mitigations) and these mitigations are enough to prevent BHI. But still the cpu is affected by BHI. If you set ARCH_CAP_BHI_NO for the guest then you tell the guest that the cpu is not affected by BHI although it is. The guest can be running a different kernel or OS which doesn't necessarily have the (basic) mitigations used in the host kernel that mitigate BHI. alex.
On Thu, Apr 11, 2024 at 5:13 PM Alexandre Chartre <alexandre.chartre@oracle.com> wrote: > I think that Andrew's concern is that if there is no eIBRS on the host then > we do not set X86_BUG_BHI on the host because we know the kernel which is > running and this kernel has some mitigations (other than the explicit BHI > mitigations) and these mitigations are enough to prevent BHI. But still > the cpu is affected by BHI. Hmm, then I'm confused. It's what I wrote before: "The (Linux or otherwise) guest will make its own determinations as to whether BHI mitigations are necessary. If the guest uses eIBRS, it will run with mitigations" but you said machines without eIBRS are fine. If instead they are only fine _with Linux_, then yeah we cannot set BHI_NO in general. What we can do is define a new bit that is in the KVM leaves. The new bit is effectively !eIBRS, except that it is defined in such a way that, in a mixed migration pool, both eIBRS and the new bit will be 0. Paolo
On Thu, Apr 11, 2024 at 05:20:30PM +0200, Paolo Bonzini wrote: >On Thu, Apr 11, 2024 at 5:13 PM Alexandre Chartre ><alexandre.chartre@oracle.com> wrote: >> I think that Andrew's concern is that if there is no eIBRS on the host then >> we do not set X86_BUG_BHI on the host because we know the kernel which is >> running and this kernel has some mitigations (other than the explicit BHI >> mitigations) and these mitigations are enough to prevent BHI. But still >> the cpu is affected by BHI. > >Hmm, then I'm confused. It's what I wrote before: "The (Linux or >otherwise) guest will make its own determinations as to whether BHI >mitigations are necessary. If the guest uses eIBRS, it will run with >mitigations" but you said machines without eIBRS are fine. > >If instead they are only fine _with Linux_, then yeah we cannot set >BHI_NO in general. What we can do is define a new bit that is in the >KVM leaves. The new bit is effectively !eIBRS, except that it is >defined in such a way that, in a mixed migration pool, both eIBRS and >the new bit will be 0. This looks a good solution. We can also introduce a new bit indicating the effectiveness of the short BHB-clearing sequence. KVM advertises this bit for all pre-SPR/ADL parts. Only if the bit is 1, guests will use the short BHB-clearing sequence. Otherwise guests should use the long sequence. In a mixed migration pool, the VMM shouldn't expose the bit to guests. > >Paolo > >
On Thu, Apr 11, 2024 at 11:56:39PM +0800, Chao Gao wrote: > On Thu, Apr 11, 2024 at 05:20:30PM +0200, Paolo Bonzini wrote: > >On Thu, Apr 11, 2024 at 5:13 PM Alexandre Chartre > ><alexandre.chartre@oracle.com> wrote: > >> I think that Andrew's concern is that if there is no eIBRS on the host then > >> we do not set X86_BUG_BHI on the host because we know the kernel which is > >> running and this kernel has some mitigations (other than the explicit BHI > >> mitigations) and these mitigations are enough to prevent BHI. But still > >> the cpu is affected by BHI. > > > >Hmm, then I'm confused. It's what I wrote before: "The (Linux or > >otherwise) guest will make its own determinations as to whether BHI > >mitigations are necessary. If the guest uses eIBRS, it will run with > >mitigations" but you said machines without eIBRS are fine. > > > >If instead they are only fine _with Linux_, then yeah we cannot set > >BHI_NO in general. What we can do is define a new bit that is in the > >KVM leaves. The new bit is effectively !eIBRS, except that it is > >defined in such a way that, in a mixed migration pool, both eIBRS and > >the new bit will be 0. > > This looks a good solution. > > We can also introduce a new bit indicating the effectiveness of the short > BHB-clearing sequence. KVM advertises this bit for all pre-SPR/ADL parts. > Only if the bit is 1, guests will use the short BHB-clearing sequence. > Otherwise guests should use the long sequence. In a mixed migration pool, > the VMM shouldn't expose the bit to guests. Is there a link to this 'short BHB-clearing sequence'? But on your email, should a Skylake guests enable IBRS (or retpoline) and have the short BHB clearing sequence? And IceLake/Cascade lake should use eIBRS (or retpoline) and short BHB clearing sequence? If we already know all of this why does the hypervisor need to advertise this to the guest? They can lookup the CPU data to make this determination, no? I don't actually understand how one could do a mixed migration pool with the various mitigations one has to engage (or not) based on the host one is running under. > > > > >Paolo > > > >
On Thu, Apr 11, 2024 at 04:50:12PM -0400, Konrad Rzeszutek Wilk wrote: >On Thu, Apr 11, 2024 at 11:56:39PM +0800, Chao Gao wrote: >> On Thu, Apr 11, 2024 at 05:20:30PM +0200, Paolo Bonzini wrote: >> >On Thu, Apr 11, 2024 at 5:13 PM Alexandre Chartre >> ><alexandre.chartre@oracle.com> wrote: >> >> I think that Andrew's concern is that if there is no eIBRS on the host then >> >> we do not set X86_BUG_BHI on the host because we know the kernel which is >> >> running and this kernel has some mitigations (other than the explicit BHI >> >> mitigations) and these mitigations are enough to prevent BHI. But still >> >> the cpu is affected by BHI. >> > >> >Hmm, then I'm confused. It's what I wrote before: "The (Linux or >> >otherwise) guest will make its own determinations as to whether BHI >> >mitigations are necessary. If the guest uses eIBRS, it will run with >> >mitigations" but you said machines without eIBRS are fine. >> > >> >If instead they are only fine _with Linux_, then yeah we cannot set >> >BHI_NO in general. What we can do is define a new bit that is in the >> >KVM leaves. The new bit is effectively !eIBRS, except that it is >> >defined in such a way that, in a mixed migration pool, both eIBRS and >> >the new bit will be 0. >> >> This looks a good solution. >> >> We can also introduce a new bit indicating the effectiveness of the short >> BHB-clearing sequence. KVM advertises this bit for all pre-SPR/ADL parts. >> Only if the bit is 1, guests will use the short BHB-clearing sequence. >> Otherwise guests should use the long sequence. In a mixed migration pool, >> the VMM shouldn't expose the bit to guests. > >Is there a link to this 'short BHB-clearing sequence'? https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html#inpage-nav-4-4 > >But on your email, should a Skylake guests enable IBRS (or retpoline) >and have the short BHB clearing sequence? > >And IceLake/Cascade lake should use eIBRS (or retpoline) and short BHB >clearing sequence? > >If we already know all of this why does the hypervisor need to advertise >this to the guest? They can lookup the CPU data to make this determination, no? > >I don't actually understand how one could do a mixed migration pool with >the various mitigations one has to engage (or not) based on the host one >is running under. In my understanding, it is done at the cost of performance. The idea is to report the "worst" case in a mixed migration pool to guests, i.e., Hey, you are running on a host where eIBRS is available (and/or the short BHB-clearing sequnece is ineffective). Now, select your mitigation for BHI. Then no matter which system in the pool the guest is migrated to, the guest is not vulnerable if it deployed a mitigation for the "worst" case (in general, this means a mitigation with larger overhead). The good thing is migration in a mixed pool won't compromise the security level of guests and guests in a homogeneous pool won't experience any performance loss. >> >> > >> >Paolo >> > >> >
On Fri, Apr 12, 2024 at 11:24:54AM +0800, Chao Gao wrote: > On Thu, Apr 11, 2024 at 04:50:12PM -0400, Konrad Rzeszutek Wilk wrote: > >On Thu, Apr 11, 2024 at 11:56:39PM +0800, Chao Gao wrote: > >> On Thu, Apr 11, 2024 at 05:20:30PM +0200, Paolo Bonzini wrote: > >> >On Thu, Apr 11, 2024 at 5:13 PM Alexandre Chartre > >> ><alexandre.chartre@oracle.com> wrote: > >> >> I think that Andrew's concern is that if there is no eIBRS on the host then > >> >> we do not set X86_BUG_BHI on the host because we know the kernel which is > >> >> running and this kernel has some mitigations (other than the explicit BHI > >> >> mitigations) and these mitigations are enough to prevent BHI. But still > >> >> the cpu is affected by BHI. > >> > > >> >Hmm, then I'm confused. It's what I wrote before: "The (Linux or > >> >otherwise) guest will make its own determinations as to whether BHI > >> >mitigations are necessary. If the guest uses eIBRS, it will run with > >> >mitigations" but you said machines without eIBRS are fine. > >> > > >> >If instead they are only fine _with Linux_, then yeah we cannot set > >> >BHI_NO in general. What we can do is define a new bit that is in the > >> >KVM leaves. The new bit is effectively !eIBRS, except that it is > >> >defined in such a way that, in a mixed migration pool, both eIBRS and > >> >the new bit will be 0. > >> > >> This looks a good solution. > >> > >> We can also introduce a new bit indicating the effectiveness of the short > >> BHB-clearing sequence. KVM advertises this bit for all pre-SPR/ADL parts. > >> Only if the bit is 1, guests will use the short BHB-clearing sequence. > >> Otherwise guests should use the long sequence. In a mixed migration pool, > >> the VMM shouldn't expose the bit to guests. > > > >Is there a link to this 'short BHB-clearing sequence'? > > https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html#inpage-nav-4-4 > > > > >But on your email, should a Skylake guests enable IBRS (or retpoline) > >and have the short BHB clearing sequence? > > > >And IceLake/Cascade lake should use eIBRS (or retpoline) and short BHB > >clearing sequence? > > > >If we already know all of this why does the hypervisor need to advertise > >this to the guest? They can lookup the CPU data to make this determination, no? > > > >I don't actually understand how one could do a mixed migration pool with > >the various mitigations one has to engage (or not) based on the host one > >is running under. > > In my understanding, it is done at the cost of performance. The idea is to > report the "worst" case in a mixed migration pool to guests, i.e., > > Hey, you are running on a host where eIBRS is available (and/or the short > BHB-clearing sequnece is ineffective). Now, select your mitigation for BHI. And if the pool could be Skylake,IceLake,CascadeLake,Sapphire Rappids then lowest common one is IBRS. And you would expose long-clearing BHI sequence enabled for all of them too, me thinks? I would think that the use case for this mixed migration pool is not workable anymore unless you expose the Family,Model,Stepping so that the VM can engage the right mitigation. And then when it is Live Migrated you re-engage the correct one (So from SkyLake to CascadeLake you kick turn on eIBRS and BHI. When you move from CascadeLake to Skylake you turn off BHI and enable retpoline). But nobody has done that work and nobody will, so why are we debating this? > > Then no matter which system in the pool the guest is migrated to, the guest is > not vulnerable if it deployed a mitigation for the "worst" case (in general, > this means a mitigation with larger overhead). > > The good thing is migration in a mixed pool won't compromise the security level > of guests and guests in a homogeneous pool won't experience any performance loss. I understand what you are saying, but I can't actually see someone wanting to do this as either you get horrible performance (engage all the mitigation making the VM be 50% slower), or some at runtime (but nobody has done the work and nobody will as Thomas will not want runtime knobs for mitigation). So how about we just try to solve the problem for the 99% of homogenous pools? And not make Skylake guests slower than they already are? > > >> > >> > > >> >Paolo > >> > > >> >
On 4/11/24 15:20, Alexandre Chartre wrote: > > On 4/11/24 13:14, Chao Gao wrote: >>>> The problem is that we can end up with a guest running extra BHI >>>> mitigations >>>> while this is not needed. Could we inform the guest that eIBRS is not >>>> available >>>> on the system so a Linux guest doesn't run with extra BHI mitigations? >>> >>> Well, that's why Intel specified some MSRs at 0x5000xxxx. >> >> Yes. But note that there is a subtle difference. Those MSRs are used for guest >> to communicate in-used software mitigations to the host. Such information is >> stable across migration. Here we need the host to communicate that eIBRS isn't >> available to the guest. this isn't stable as the guest may be migrated from >> a host without eIBRS to one with it. >> >>> >>> Except I don't know anyone currently interested in implementing them, >>> and I'm still not sure if they work correctly for some of the more >>> complicated migration cases. >> >> Looks you have the same opinion on the Intel-defined virtual MSRs as Sean. >> If we all agree the issue here and the effectivenss problem of the short >> BHB-clearing sequence need to be resolved and don't think the Intel-defined >> virtual MSRs can handle all cases correctly, we have to define a better >> interface through community collaboration as Sean suggested. > > Another solution could be to add cpus to cpu_vuln_whitelist with BHI_NO. > (e.g. explicitly add cpus which have eIBRS). That way, the kernel will > figure out the right mitigation on the host and guest. > More precisely we could something like this (this is just an example, obviously the list is clearly incomplete): diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 754d91857d63..80477170ccc0 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1182,6 +1182,24 @@ static const __initconst struct x86_cpu_id cpu_vuln_whitelist[] = { VULNWL_INTEL(ATOM_TREMONT_L, NO_EIBRS_PBRSB), VULNWL_INTEL(ATOM_TREMONT_D, NO_ITLB_MULTIHIT | NO_EIBRS_PBRSB), + /* + * The following Intel CPUs are affected by BHI, but they don't have + * the eIBRS feature. In that case, the default Spectre v2 mitigations + * are enough to also mitigate BHI. We mark these CPUs with NO_BHI so + * that X86_BUG_BHI doesn't get set and no extra BHI mitigation is + * enabled. + * + * This avoids guest VMs from enabling extra BHI mitigation when this + * is not needed. For guest, X86_BUG_BHI is never set for CPUs which + * don't have the eIBRS feature. But this doesn't happen in guest VMs + * as the virtualization can hide the eIBRS feature. + */ + VULNWL_INTEL(IVYBRIDGE_X, NO_BHI), + VULNWL_INTEL(HASWELL_X, NO_BHI), + VULNWL_INTEL(BROADWELL_X, NO_BHI), + VULNWL_INTEL(SKYLAKE_X, NO_BHI), + VULNWL_INTEL(SKYLAKE_X, NO_BHI), + /* AMD Family 0xf - 0x12 */ VULNWL_AMD(0x0f, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT | NO_MMIO | NO_BHI), VULNWL_AMD(0x10, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT | NO_MMIO | NO_BH alex.
> + /* > + * The following Intel CPUs are affected by BHI, but they don't have > + * the eIBRS feature. In that case, the default Spectre v2 mitigations > + * are enough to also mitigate BHI. We mark these CPUs with NO_BHI so > + * that X86_BUG_BHI doesn't get set and no extra BHI mitigation is > + * enabled. > + * > + * This avoids guest VMs from enabling extra BHI mitigation when this > + * is not needed. For guest, X86_BUG_BHI is never set for CPUs which > + * don't have the eIBRS feature. But this doesn't happen in guest VMs > + * as the virtualization can hide the eIBRS feature. > + */ > + VULNWL_INTEL(IVYBRIDGE_X, NO_BHI), > + VULNWL_INTEL(HASWELL_X, NO_BHI), > + VULNWL_INTEL(BROADWELL_X, NO_BHI), > + VULNWL_INTEL(SKYLAKE_X, NO_BHI), > + VULNWL_INTEL(SKYLAKE_X, NO_BHI), Isn't this at odds with the existing comment? /* When virtualized, eIBRS could be hidden, assume vulnerable */ Because it seems now that we've got two relatively conflicting pieces of vulnerability information when running under a hypervisor.
On 4/15/24 19:17, Dave Hansen wrote: >> + /* >> + * The following Intel CPUs are affected by BHI, but they don't have >> + * the eIBRS feature. In that case, the default Spectre v2 mitigations >> + * are enough to also mitigate BHI. We mark these CPUs with NO_BHI so >> + * that X86_BUG_BHI doesn't get set and no extra BHI mitigation is >> + * enabled. >> + * >> + * This avoids guest VMs from enabling extra BHI mitigation when this >> + * is not needed. For guest, X86_BUG_BHI is never set for CPUs which >> + * don't have the eIBRS feature. But this doesn't happen in guest VMs >> + * as the virtualization can hide the eIBRS feature. >> + */ >> + VULNWL_INTEL(IVYBRIDGE_X, NO_BHI), >> + VULNWL_INTEL(HASWELL_X, NO_BHI), >> + VULNWL_INTEL(BROADWELL_X, NO_BHI), >> + VULNWL_INTEL(SKYLAKE_X, NO_BHI), >> + VULNWL_INTEL(SKYLAKE_X, NO_BHI), > > Isn't this at odds with the existing comment? > > /* When virtualized, eIBRS could be hidden, assume vulnerable */ > > Because it seems now that we've got two relatively conflicting pieces of > vulnerability information when running under a hypervisor. It's not at odd, it's an additional information. The comment covers the general case. When running under a hypervisor then the kernel can't rely on CPU features to find if the server has eIBRS or not, because the virtualization can be hiding eIBRS. And that's the problem because the kernel might enable BHI mitigation while it's not needed. For example on Skylake: on the host, the kernel won't see eIBRS so it won't set X86_BUG_BHI. But in a guest on the same platform, the kernel will set X86_BUG_BHI because it doesn't know if the server doesn't effectively have eIBRS or if eIBRS is hidden by virtualization. With the patch, the kernel can know if the CPU it is running on (e.g. Skylake) needs extra BHI mitigation or not. Then it can safely not enable BHI mitigation no matter if it is running on host or in guest. alex.
On Tue, Apr 16, 2024 at 10:41:58AM +0200, Alexandre Chartre wrote: > > On 4/15/24 19:17, Dave Hansen wrote: > > > + /* > > > + * The following Intel CPUs are affected by BHI, but they don't have > > > + * the eIBRS feature. In that case, the default Spectre v2 mitigations > > > + * are enough to also mitigate BHI. We mark these CPUs with NO_BHI so > > > + * that X86_BUG_BHI doesn't get set and no extra BHI mitigation is > > > + * enabled. > > > + * > > > + * This avoids guest VMs from enabling extra BHI mitigation when this > > > + * is not needed. For guest, X86_BUG_BHI is never set for CPUs which > > > + * don't have the eIBRS feature. But this doesn't happen in guest VMs > > > + * as the virtualization can hide the eIBRS feature. > > > + */ > > > + VULNWL_INTEL(IVYBRIDGE_X, NO_BHI), > > > + VULNWL_INTEL(HASWELL_X, NO_BHI), > > > + VULNWL_INTEL(BROADWELL_X, NO_BHI), > > > + VULNWL_INTEL(SKYLAKE_X, NO_BHI), > > > + VULNWL_INTEL(SKYLAKE_X, NO_BHI), > > > > Isn't this at odds with the existing comment? > > > > /* When virtualized, eIBRS could be hidden, assume vulnerable */ > > > > Because it seems now that we've got two relatively conflicting pieces of > > vulnerability information when running under a hypervisor. > > It's not at odd, it's an additional information. The comment covers the general > case. > > When running under a hypervisor then the kernel can't rely on CPU features to > find if the server has eIBRS or not, because the virtualization can be hiding > eIBRS. And that's the problem because the kernel might enable BHI mitigation > while it's not needed. > > For example on Skylake: on the host, the kernel won't see eIBRS so it won't set > X86_BUG_BHI. But in a guest on the same platform, the kernel will set X86_BUG_BHI > because it doesn't know if the server doesn't effectively have eIBRS or if eIBRS > is hidden by virtualization. > > With the patch, the kernel can know if the CPU it is running on (e.g. Skylake) > needs extra BHI mitigation or not. Then it can safely not enable BHI mitigation > no matter if it is running on host or in guest. Where do we want to go with this one? The problem (which I think is not understood) is that on Skylake there is no Enhanced IBRS support. There is either IBRS or retpoline - and when IBRS is enabled it does the job of mitigating against BHI. And as of right now on Skylake guests we enable BHI _and_ IBRS for extra slowdown. We can't disable IBRS as it mitigates against other bugs too, so how do you folks want to disable automatically BHI on Skylake with the least amount of code?
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 984ea2089efc..f43d3c15a6b7 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1678,6 +1678,9 @@ static u64 kvm_get_arch_capabilities(void) if (!boot_cpu_has_bug(X86_BUG_GDS) || gds_ucode_mitigated()) data |= ARCH_CAP_GDS_NO; + if (!boot_cpu_has_bug(X86_BUG_BHI)) + data |= ARCH_CAP_BHI_NO; + return data; }
When a system is not affected by the BHI bug then KVM should configure guests with BHI_NO to ensure they won't enable any BHI mitigation. Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com> --- arch/x86/kvm/x86.c | 3 +++ 1 file changed, 3 insertions(+)