Message ID | 2863b0a9-ca7c-3cce-104d-0b6685b0b383@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/spec-ctrl: IBPB improvements | expand |
On Tue, Feb 14, 2023 at 05:12:08PM +0100, Jan Beulich wrote: > Since both kernel and user mode run in ring 3, they run in the same > "predictor mode". That only true when IBRS is enabled, otherwise all CPU modes share the same predictor mode? > While the kernel could take care of this itself, doing > so would be yet another item distinguishing PV from native. Additionally > we're in a much better position to issue the barrier command, and we can > save a #GP (for privileged instruction emulation) this way. > > To allow to recover performance, introduce a new VM assist allowing the > guest kernel to suppress this barrier. Make availability of the assist > dependent upon the command line control, such that kernels have a way to > know whether their request actually took any effect. > > Note that because of its use in PV64_VM_ASSIST_MASK, the declaration of > opt_ibpb_mode_switch can't live in asm/spec_ctrl.h. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > Is the placement of the clearing of opt_ibpb_ctxt_switch correct in > parse_spec_ctrl()? Shouldn't it live ahead of the "disable_common" > label, as being about guest protection, not Xen's? > > Adding setting of the variable to the "pv" sub-case in parse_spec_ctrl() > didn't seem quite right to me, considering that we default it to the > opposite of opt_ibpb_entry_pv. > --- > v4: Correct the print_details() change. Re-base in particular over > changes earlier in the series. > v3: Leverage exit-IBPB. Introduce separate command line control. > v2: Leverage entry-IBPB. Add VM assist. Re-base. > > --- a/docs/misc/xen-command-line.pandoc > +++ b/docs/misc/xen-command-line.pandoc > @@ -2320,8 +2320,8 @@ By default SSBD will be mitigated at run > ### spec-ctrl (x86) > > `= List of [ <bool>, xen=<bool>, {pv,hvm}=<bool>, > > {msr-sc,rsb,md-clear,ibpb-entry}=<bool>|{pv,hvm}=<bool>, > -> bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,psfd, > -> eager-fpu,l1d-flush,branch-harden,srb-lock, > +> bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ibpb-mode-switch, > +> ssbd,psfd,eager-fpu,l1d-flush,branch-harden,srb-lock, > > unpriv-mmio}=<bool> ]` > > Controls for speculative execution sidechannel mitigations. By default, Xen > @@ -2403,7 +2403,10 @@ default. > > On hardware supporting IBPB (Indirect Branch Prediction Barrier), the `ibpb=` > option can be used to force (the default) or prevent Xen from issuing branch > -prediction barriers on vcpu context switches. > +prediction barriers on vcpu context switches. On such hardware the > +`ibpb-mode-switch` option can be used to control whether, by default, Xen > +would issue branch prediction barriers when 64-bit PV guests switch from > +user to kernel mode. If enabled, guest kernels can op out of this behavior. > > On all hardware, the `eager-fpu=` option can be used to force or prevent Xen > from using fully eager FPU context switches. This is currently implemented as > --- a/xen/arch/x86/include/asm/domain.h > +++ b/xen/arch/x86/include/asm/domain.h > @@ -742,6 +742,8 @@ static inline void pv_inject_sw_interrup > pv_inject_event(&event); > } > > +extern int8_t opt_ibpb_mode_switch; > + > #define PV32_VM_ASSIST_MASK ((1UL << VMASST_TYPE_4gb_segments) | \ > (1UL << VMASST_TYPE_4gb_segments_notify) | \ > (1UL << VMASST_TYPE_writable_pagetables) | \ > @@ -753,7 +755,9 @@ static inline void pv_inject_sw_interrup > * but we can't make such requests fail all of the sudden. > */ > #define PV64_VM_ASSIST_MASK (PV32_VM_ASSIST_MASK | \ > - (1UL << VMASST_TYPE_m2p_strict)) > + (1UL << VMASST_TYPE_m2p_strict) | \ > + ((opt_ibpb_mode_switch + 0UL) << \ > + VMASST_TYPE_mode_switch_no_ibpb)) I'm wondering that it's kind of weird to offer the option to PV domUs if opt_ibpb_entry_pv is set, as then the guest mode switch will always (implicitly) do a IBPB as requiring an hypercall and thus take an entry point into Xen. I guess it's worth having it just as a way to signal to Xen that the hypervisor does perform an IBPB, even if the guest cannot disable it. > #define HVM_VM_ASSIST_MASK (1UL << VMASST_TYPE_runstate_update_flag) > > #define arch_vm_assist_valid_mask(d) \ > --- a/xen/arch/x86/pv/domain.c > +++ b/xen/arch/x86/pv/domain.c > @@ -455,6 +455,7 @@ static void _toggle_guest_pt(struct vcpu > void toggle_guest_mode(struct vcpu *v) > { > const struct domain *d = v->domain; > + struct cpu_info *cpu_info = get_cpu_info(); > unsigned long gs_base; > > ASSERT(!is_pv_32bit_vcpu(v)); > @@ -467,15 +468,21 @@ void toggle_guest_mode(struct vcpu *v) > if ( v->arch.flags & TF_kernel_mode ) > v->arch.pv.gs_base_kernel = gs_base; > else > + { > v->arch.pv.gs_base_user = gs_base; > + > + if ( opt_ibpb_mode_switch && > + !(d->arch.spec_ctrl_flags & SCF_entry_ibpb) && > + !VM_ASSIST(d, mode_switch_no_ibpb) ) > + cpu_info->spec_ctrl_flags |= SCF_new_pred_ctxt; Likewise similar to the remarks I've made before, if doing an IBPB on entry is enough to cover for the case here, it must also be fine to issue the IBPB right here, instead of deferring to return to guest context? The only concern would be (as you mentioned before) to avoid clearing valid Xen predictions, but I would rather see some figures about what effect the delaying to return to guest has vs issuing it right here. > + } > + > asm volatile ( "swapgs" ); > > _toggle_guest_pt(v); > > if ( d->arch.pv.xpti ) > { > - struct cpu_info *cpu_info = get_cpu_info(); > - > cpu_info->root_pgt_changed = true; > cpu_info->pv_cr3 = __pa(this_cpu(root_pgt)) | > (d->arch.pv.pcid ? get_pcid_bits(v, true) : 0); > --- a/xen/arch/x86/spec_ctrl.c > +++ b/xen/arch/x86/spec_ctrl.c > @@ -60,6 +60,7 @@ bool __ro_after_init opt_ssbd; > int8_t __initdata opt_psfd = -1; > > int8_t __ro_after_init opt_ibpb_ctxt_switch = -1; > +int8_t __ro_after_init opt_ibpb_mode_switch = -1; > int8_t __read_mostly opt_eager_fpu = -1; > int8_t __read_mostly opt_l1d_flush = -1; > static bool __initdata opt_branch_harden = true; > @@ -111,6 +112,8 @@ static int __init cf_check parse_spec_ct > if ( opt_pv_l1tf_domu < 0 ) > opt_pv_l1tf_domu = 0; > > + opt_ibpb_mode_switch = 0; > + > if ( opt_tsx == -1 ) > opt_tsx = -3; > > @@ -271,6 +274,8 @@ static int __init cf_check parse_spec_ct > /* Misc settings. */ > else if ( (val = parse_boolean("ibpb", s, ss)) >= 0 ) > opt_ibpb_ctxt_switch = val; > + else if ( (val = parse_boolean("ibpb-mode-switch", s, ss)) >= 0 ) > + opt_ibpb_mode_switch = val; > else if ( (val = parse_boolean("eager-fpu", s, ss)) >= 0 ) > opt_eager_fpu = val; > else if ( (val = parse_boolean("l1d-flush", s, ss)) >= 0 ) > @@ -527,16 +532,18 @@ static void __init print_details(enum in > > #endif > #ifdef CONFIG_PV > - printk(" Support for PV VMs:%s%s%s%s%s%s\n", > + printk(" Support for PV VMs:%s%s%s%s%s%s%s\n", > (boot_cpu_has(X86_FEATURE_SC_MSR_PV) || > boot_cpu_has(X86_FEATURE_SC_RSB_PV) || > boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) || > - opt_eager_fpu || opt_md_clear_pv) ? "" : " None", > + opt_eager_fpu || opt_md_clear_pv || > + opt_ibpb_mode_switch) ? "" : " None", > boot_cpu_has(X86_FEATURE_SC_MSR_PV) ? " MSR_SPEC_CTRL" : "", > boot_cpu_has(X86_FEATURE_SC_RSB_PV) ? " RSB" : "", > opt_eager_fpu ? " EAGER_FPU" : "", > opt_md_clear_pv ? " MD_CLEAR" : "", > - boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) ? " IBPB-entry" : ""); > + boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) ? " IBPB-entry" : "", > + opt_ibpb_mode_switch ? " IBPB-mode-switch" : ""); > > printk(" XPTI (64-bit PV only): Dom0 %s, DomU %s (with%s PCID)\n", > opt_xpti_hwdom ? "enabled" : "disabled", > @@ -804,7 +811,8 @@ static void __init ibpb_calculations(voi > /* Check we have hardware IBPB support before using it... */ > if ( !boot_cpu_has(X86_FEATURE_IBRSB) && !boot_cpu_has(X86_FEATURE_IBPB) ) > { > - opt_ibpb_entry_hvm = opt_ibpb_entry_pv = opt_ibpb_ctxt_switch = 0; > + opt_ibpb_entry_hvm = opt_ibpb_entry_pv = 0; > + opt_ibpb_mode_switch = opt_ibpb_ctxt_switch = 0; > opt_ibpb_entry_dom0 = false; > return; > } > @@ -859,6 +867,18 @@ static void __init ibpb_calculations(voi > setup_force_cpu_cap(X86_FEATURE_NEW_PRED_CTXT_PV); > setup_force_cpu_cap(X86_FEATURE_NEW_PRED_CTXT_HVM); > } > + > +#ifdef CONFIG_PV > + /* > + * If we're using IBPB-on-entry to protect against PV guests, then > + * there's no need to also issue IBPB on a guest user->kernel switch. > + */ > + if ( opt_ibpb_mode_switch == -1 ) > + opt_ibpb_mode_switch = !opt_ibpb_entry_pv || > + (!opt_ibpb_entry_dom0 && !opt_dom0_pvh); > + if ( opt_ibpb_mode_switch ) > + setup_force_cpu_cap(X86_FEATURE_NEW_PRED_CTXT_PV); > +#endif > } > > /* Calculate whether this CPU is vulnerable to L1TF. */ > --- a/xen/include/public/xen.h > +++ b/xen/include/public/xen.h > @@ -554,6 +554,16 @@ DEFINE_XEN_GUEST_HANDLE(mmuext_op_t); > */ > #define VMASST_TYPE_m2p_strict 32 > > +/* > + * x86-64 guests: Suppress IBPB on guest-user to guest-kernel mode switch. I think this needs to be more vague, as it's not true that the IBPB will be suppressed if Xen is unconditionally issuing one on all guest entry points. Maybe adding: "Setting the assist signals Xen that the IBPB can be avoided from a guest perspective, however Xen might still issue one for other reasons." > + * > + * By default (on affected and capable hardware) as a safety measure Xen, > + * to cover for the fact that guest-kernel and guest-user modes are both > + * running in ring 3 (and hence share prediction context), would issue a > + * barrier for user->kernel mode switches of PV guests. > + */ > +#define VMASST_TYPE_mode_switch_no_ibpb 33 Would it be possible to define the assist as VMASST_TYPE_mode_switch_ibpb and have it on when enabled? So that the guest would disable it if unneeded? IMO negated options are in general harder to understand. Thanks, Roger.
On 18.12.2023 18:24, Roger Pau Monné wrote: > On Tue, Feb 14, 2023 at 05:12:08PM +0100, Jan Beulich wrote: >> Since both kernel and user mode run in ring 3, they run in the same >> "predictor mode". > > That only true when IBRS is enabled, otherwise all CPU modes share the > same predictor mode? But here we only care about ring 3 anyway? >> @@ -753,7 +755,9 @@ static inline void pv_inject_sw_interrup >> * but we can't make such requests fail all of the sudden. >> */ >> #define PV64_VM_ASSIST_MASK (PV32_VM_ASSIST_MASK | \ >> - (1UL << VMASST_TYPE_m2p_strict)) >> + (1UL << VMASST_TYPE_m2p_strict) | \ >> + ((opt_ibpb_mode_switch + 0UL) << \ >> + VMASST_TYPE_mode_switch_no_ibpb)) > > I'm wondering that it's kind of weird to offer the option to PV domUs > if opt_ibpb_entry_pv is set, as then the guest mode switch will always > (implicitly) do a IBPB as requiring an hypercall and thus take an > entry point into Xen. > > I guess it's worth having it just as a way to signal to Xen that the > hypervisor does perform an IBPB, even if the guest cannot disable it. I'm afraid I'm confused by your reply. Not only, but also because the latter sentence looks partly backwards / non-logical to me. >> --- a/xen/arch/x86/pv/domain.c >> +++ b/xen/arch/x86/pv/domain.c >> @@ -455,6 +455,7 @@ static void _toggle_guest_pt(struct vcpu >> void toggle_guest_mode(struct vcpu *v) >> { >> const struct domain *d = v->domain; >> + struct cpu_info *cpu_info = get_cpu_info(); >> unsigned long gs_base; >> >> ASSERT(!is_pv_32bit_vcpu(v)); >> @@ -467,15 +468,21 @@ void toggle_guest_mode(struct vcpu *v) >> if ( v->arch.flags & TF_kernel_mode ) >> v->arch.pv.gs_base_kernel = gs_base; >> else >> + { >> v->arch.pv.gs_base_user = gs_base; >> + >> + if ( opt_ibpb_mode_switch && >> + !(d->arch.spec_ctrl_flags & SCF_entry_ibpb) && >> + !VM_ASSIST(d, mode_switch_no_ibpb) ) >> + cpu_info->spec_ctrl_flags |= SCF_new_pred_ctxt; > > Likewise similar to the remarks I've made before, if doing an IBPB on > entry is enough to cover for the case here, it must also be fine to > issue the IBPB right here, instead of deferring to return to guest > context? > > The only concern would be (as you mentioned before) to avoid clearing > valid Xen predictions, but I would rather see some figures about what > effect the delaying to return to guest has vs issuing it right here. Part of the reason (aiui) to do things on the exit path was to consolidate the context switch induced one and the user->kernel switch one into the same place and mechanism. >> --- a/xen/include/public/xen.h >> +++ b/xen/include/public/xen.h >> @@ -554,6 +554,16 @@ DEFINE_XEN_GUEST_HANDLE(mmuext_op_t); >> */ >> #define VMASST_TYPE_m2p_strict 32 >> >> +/* >> + * x86-64 guests: Suppress IBPB on guest-user to guest-kernel mode switch. > > I think this needs to be more vague, as it's not true that the IBPB > will be suppressed if Xen is unconditionally issuing one on all guest > entry points. > > Maybe adding: > > "Setting the assist signals Xen that the IBPB can be avoided from a > guest perspective, however Xen might still issue one for other > reasons." I've done s/Suppress/Permit skipping/. I wouldn't want to go further, as that then becomes related to implementation details imo. IOW of course Xen may issue IBPB whenever it thinks there's a possible need. >> + * >> + * By default (on affected and capable hardware) as a safety measure Xen, >> + * to cover for the fact that guest-kernel and guest-user modes are both >> + * running in ring 3 (and hence share prediction context), would issue a >> + * barrier for user->kernel mode switches of PV guests. >> + */ >> +#define VMASST_TYPE_mode_switch_no_ibpb 33 > > Would it be possible to define the assist as > VMASST_TYPE_mode_switch_ibpb and have it on when enabled? So that the > guest would disable it if unneeded? IMO negated options are in > general harder to understand. Negative options aren't nice, yes, but VM assists start out as all clear. The guest needs to change a "false" to a "true", and thus it cannot be a positive option here, as we want the default (off) to be safe/secure. Jan
On Tue, Dec 19, 2023 at 10:56:16AM +0100, Jan Beulich wrote: > On 18.12.2023 18:24, Roger Pau Monné wrote: > > On Tue, Feb 14, 2023 at 05:12:08PM +0100, Jan Beulich wrote: > >> Since both kernel and user mode run in ring 3, they run in the same > >> "predictor mode". > > > > That only true when IBRS is enabled, otherwise all CPU modes share the > > same predictor mode? > > But here we only care about ring 3 anyway? Hm, yes, what I wanted to say is that this is not exclusive to 64bit PV, as with IBRS disabled ring 3 and ring 0 share the same predictor mode. Anyway, not relevant. > > >> @@ -753,7 +755,9 @@ static inline void pv_inject_sw_interrup > >> * but we can't make such requests fail all of the sudden. > >> */ > >> #define PV64_VM_ASSIST_MASK (PV32_VM_ASSIST_MASK | \ > >> - (1UL << VMASST_TYPE_m2p_strict)) > >> + (1UL << VMASST_TYPE_m2p_strict) | \ > >> + ((opt_ibpb_mode_switch + 0UL) << \ > >> + VMASST_TYPE_mode_switch_no_ibpb)) > > > > I'm wondering that it's kind of weird to offer the option to PV domUs > > if opt_ibpb_entry_pv is set, as then the guest mode switch will always > > (implicitly) do a IBPB as requiring an hypercall and thus take an > > entry point into Xen. > > > > I guess it's worth having it just as a way to signal to Xen that the > > hypervisor does perform an IBPB, even if the guest cannot disable it. > > I'm afraid I'm confused by your reply. Not only, but also because the > latter sentence looks partly backwards / non-logical to me. Sorry, I think I didn't word that very well. The remark is tied to the one below about the vmassist 'possibly' allowing the guest to disable IBPB on guest user -> kernel context switches, but Xen might unconditionally do additional IBPBs that the guest cannot disable. > >> --- a/xen/arch/x86/pv/domain.c > >> +++ b/xen/arch/x86/pv/domain.c > >> @@ -455,6 +455,7 @@ static void _toggle_guest_pt(struct vcpu > >> void toggle_guest_mode(struct vcpu *v) > >> { > >> const struct domain *d = v->domain; > >> + struct cpu_info *cpu_info = get_cpu_info(); > >> unsigned long gs_base; > >> > >> ASSERT(!is_pv_32bit_vcpu(v)); > >> @@ -467,15 +468,21 @@ void toggle_guest_mode(struct vcpu *v) > >> if ( v->arch.flags & TF_kernel_mode ) > >> v->arch.pv.gs_base_kernel = gs_base; > >> else > >> + { > >> v->arch.pv.gs_base_user = gs_base; > >> + > >> + if ( opt_ibpb_mode_switch && > >> + !(d->arch.spec_ctrl_flags & SCF_entry_ibpb) && > >> + !VM_ASSIST(d, mode_switch_no_ibpb) ) > >> + cpu_info->spec_ctrl_flags |= SCF_new_pred_ctxt; > > > > Likewise similar to the remarks I've made before, if doing an IBPB on > > entry is enough to cover for the case here, it must also be fine to > > issue the IBPB right here, instead of deferring to return to guest > > context? > > > > The only concern would be (as you mentioned before) to avoid clearing > > valid Xen predictions, but I would rather see some figures about what > > effect the delaying to return to guest has vs issuing it right here. > > Part of the reason (aiui) to do things on the exit path was to > consolidate the context switch induced one and the user->kernel switch > one into the same place and mechanism. Isn't it kind of a very specific case that we end up doing a user->kernel switch as part of a context switch? IOW: would require the vCPU to be scheduled out at that very specific point. > >> + * > >> + * By default (on affected and capable hardware) as a safety measure Xen, > >> + * to cover for the fact that guest-kernel and guest-user modes are both > >> + * running in ring 3 (and hence share prediction context), would issue a > >> + * barrier for user->kernel mode switches of PV guests. > >> + */ > >> +#define VMASST_TYPE_mode_switch_no_ibpb 33 > > > > Would it be possible to define the assist as > > VMASST_TYPE_mode_switch_ibpb and have it on when enabled? So that the > > guest would disable it if unneeded? IMO negated options are in > > general harder to understand. > > Negative options aren't nice, yes, but VM assists start out as all > clear. Are you sure? I see VMASST_TYPE_pae_extended_cr3 getting set in dom0_construct_pv() and that makes me wonder whether other bits couldn't start set also. Maybe there's some restriction I'm missing, but I don't see any wording in the description of the interface that states that all assists are supposed to start disabled. Thanks, Roger.
On 19.12.2023 12:48, Roger Pau Monné wrote: > On Tue, Dec 19, 2023 at 10:56:16AM +0100, Jan Beulich wrote: >> On 18.12.2023 18:24, Roger Pau Monné wrote: >>> On Tue, Feb 14, 2023 at 05:12:08PM +0100, Jan Beulich wrote: >>>> --- a/xen/arch/x86/pv/domain.c >>>> +++ b/xen/arch/x86/pv/domain.c >>>> @@ -455,6 +455,7 @@ static void _toggle_guest_pt(struct vcpu >>>> void toggle_guest_mode(struct vcpu *v) >>>> { >>>> const struct domain *d = v->domain; >>>> + struct cpu_info *cpu_info = get_cpu_info(); >>>> unsigned long gs_base; >>>> >>>> ASSERT(!is_pv_32bit_vcpu(v)); >>>> @@ -467,15 +468,21 @@ void toggle_guest_mode(struct vcpu *v) >>>> if ( v->arch.flags & TF_kernel_mode ) >>>> v->arch.pv.gs_base_kernel = gs_base; >>>> else >>>> + { >>>> v->arch.pv.gs_base_user = gs_base; >>>> + >>>> + if ( opt_ibpb_mode_switch && >>>> + !(d->arch.spec_ctrl_flags & SCF_entry_ibpb) && >>>> + !VM_ASSIST(d, mode_switch_no_ibpb) ) >>>> + cpu_info->spec_ctrl_flags |= SCF_new_pred_ctxt; >>> >>> Likewise similar to the remarks I've made before, if doing an IBPB on >>> entry is enough to cover for the case here, it must also be fine to >>> issue the IBPB right here, instead of deferring to return to guest >>> context? >>> >>> The only concern would be (as you mentioned before) to avoid clearing >>> valid Xen predictions, but I would rather see some figures about what >>> effect the delaying to return to guest has vs issuing it right here. >> >> Part of the reason (aiui) to do things on the exit path was to >> consolidate the context switch induced one and the user->kernel switch >> one into the same place and mechanism. > > Isn't it kind of a very specific case that we end up doing a > user->kernel switch as part of a context switch? IOW: would require > the vCPU to be scheduled out at that very specific point. No, there's no user->kernel switch at the same time as context switch. What I was trying to explain is that with the actual IBPB being issued on exit to guest, both the context switch path and the user->kernel mode switch path set the same indicator, for the exit path to consume. >>>> + * >>>> + * By default (on affected and capable hardware) as a safety measure Xen, >>>> + * to cover for the fact that guest-kernel and guest-user modes are both >>>> + * running in ring 3 (and hence share prediction context), would issue a >>>> + * barrier for user->kernel mode switches of PV guests. >>>> + */ >>>> +#define VMASST_TYPE_mode_switch_no_ibpb 33 >>> >>> Would it be possible to define the assist as >>> VMASST_TYPE_mode_switch_ibpb and have it on when enabled? So that the >>> guest would disable it if unneeded? IMO negated options are in >>> general harder to understand. >> >> Negative options aren't nice, yes, but VM assists start out as all >> clear. > > Are you sure? I see VMASST_TYPE_pae_extended_cr3 getting set in > dom0_construct_pv() and that makes me wonder whether other bits > couldn't start set also. > > Maybe there's some restriction I'm missing, but I don't see any > wording in the description of the interface that states that all > assists are supposed to start disabled. Well, that setting of pae_extended_cr3 is in response to the kernel's notes section having a respective indicator. So we still only set the bit in response to what the kernel's asking us to do, just that here we carry out the request ahead of launching the kernel. Also consider what would happen during migration if there was a default-on assist: At the destination we can't know whether the source simply didn't know of the bit, or whether the guest elected to clear it. Jan
On Tue, Dec 19, 2023 at 03:06:50PM +0100, Jan Beulich wrote: > On 19.12.2023 12:48, Roger Pau Monné wrote: > > On Tue, Dec 19, 2023 at 10:56:16AM +0100, Jan Beulich wrote: > >> On 18.12.2023 18:24, Roger Pau Monné wrote: > >>> On Tue, Feb 14, 2023 at 05:12:08PM +0100, Jan Beulich wrote: > >>>> --- a/xen/arch/x86/pv/domain.c > >>>> +++ b/xen/arch/x86/pv/domain.c > >>>> @@ -455,6 +455,7 @@ static void _toggle_guest_pt(struct vcpu > >>>> void toggle_guest_mode(struct vcpu *v) > >>>> { > >>>> const struct domain *d = v->domain; > >>>> + struct cpu_info *cpu_info = get_cpu_info(); > >>>> unsigned long gs_base; > >>>> > >>>> ASSERT(!is_pv_32bit_vcpu(v)); > >>>> @@ -467,15 +468,21 @@ void toggle_guest_mode(struct vcpu *v) > >>>> if ( v->arch.flags & TF_kernel_mode ) > >>>> v->arch.pv.gs_base_kernel = gs_base; > >>>> else > >>>> + { > >>>> v->arch.pv.gs_base_user = gs_base; > >>>> + > >>>> + if ( opt_ibpb_mode_switch && > >>>> + !(d->arch.spec_ctrl_flags & SCF_entry_ibpb) && > >>>> + !VM_ASSIST(d, mode_switch_no_ibpb) ) > >>>> + cpu_info->spec_ctrl_flags |= SCF_new_pred_ctxt; > >>> > >>> Likewise similar to the remarks I've made before, if doing an IBPB on > >>> entry is enough to cover for the case here, it must also be fine to > >>> issue the IBPB right here, instead of deferring to return to guest > >>> context? > >>> > >>> The only concern would be (as you mentioned before) to avoid clearing > >>> valid Xen predictions, but I would rather see some figures about what > >>> effect the delaying to return to guest has vs issuing it right here. > >> > >> Part of the reason (aiui) to do things on the exit path was to > >> consolidate the context switch induced one and the user->kernel switch > >> one into the same place and mechanism. > > > > Isn't it kind of a very specific case that we end up doing a > > user->kernel switch as part of a context switch? IOW: would require > > the vCPU to be scheduled out at that very specific point. > > No, there's no user->kernel switch at the same time as context switch. > What I was trying to explain is that with the actual IBPB being issued > on exit to guest, both the context switch path and the user->kernel > mode switch path set the same indicator, for the exit path to consume. Deferring to exit to guest path could be OK, but unless strictly needed, which I don't think it's the case, I would request for IBPB to be executed in C context rather than assembly one. > >>>> + * > >>>> + * By default (on affected and capable hardware) as a safety measure Xen, > >>>> + * to cover for the fact that guest-kernel and guest-user modes are both > >>>> + * running in ring 3 (and hence share prediction context), would issue a > >>>> + * barrier for user->kernel mode switches of PV guests. > >>>> + */ > >>>> +#define VMASST_TYPE_mode_switch_no_ibpb 33 > >>> > >>> Would it be possible to define the assist as > >>> VMASST_TYPE_mode_switch_ibpb and have it on when enabled? So that the > >>> guest would disable it if unneeded? IMO negated options are in > >>> general harder to understand. > >> > >> Negative options aren't nice, yes, but VM assists start out as all > >> clear. > > > > Are you sure? I see VMASST_TYPE_pae_extended_cr3 getting set in > > dom0_construct_pv() and that makes me wonder whether other bits > > couldn't start set also. > > > > Maybe there's some restriction I'm missing, but I don't see any > > wording in the description of the interface that states that all > > assists are supposed to start disabled. > > Well, that setting of pae_extended_cr3 is in response to the kernel's > notes section having a respective indicator. So we still only set the > bit in response to what the kernel's asking us to do, just that here > we carry out the request ahead of launching the kernel. > > Also consider what would happen during migration if there was a > default-on assist: At the destination we can't know whether the > source simply didn't know of the bit, or whether the guest elected to > clear it. Hm, I see, so I was indeed missing that aspect. VM assist is passed as a plain bitmap, and there's no signal on which assists the VM had available on the source side if not enabled. Thanks, Roger.
On Tue, Dec 19, 2023 at 04:11:09PM +0100, Roger Pau Monné wrote: > On Tue, Dec 19, 2023 at 03:06:50PM +0100, Jan Beulich wrote: > > On 19.12.2023 12:48, Roger Pau Monné wrote: > > > On Tue, Dec 19, 2023 at 10:56:16AM +0100, Jan Beulich wrote: > > >> On 18.12.2023 18:24, Roger Pau Monné wrote: > > >>> On Tue, Feb 14, 2023 at 05:12:08PM +0100, Jan Beulich wrote: > > >>>> --- a/xen/arch/x86/pv/domain.c > > >>>> +++ b/xen/arch/x86/pv/domain.c > > >>>> @@ -455,6 +455,7 @@ static void _toggle_guest_pt(struct vcpu > > >>>> void toggle_guest_mode(struct vcpu *v) > > >>>> { > > >>>> const struct domain *d = v->domain; > > >>>> + struct cpu_info *cpu_info = get_cpu_info(); > > >>>> unsigned long gs_base; > > >>>> > > >>>> ASSERT(!is_pv_32bit_vcpu(v)); > > >>>> @@ -467,15 +468,21 @@ void toggle_guest_mode(struct vcpu *v) > > >>>> if ( v->arch.flags & TF_kernel_mode ) > > >>>> v->arch.pv.gs_base_kernel = gs_base; > > >>>> else > > >>>> + { > > >>>> v->arch.pv.gs_base_user = gs_base; > > >>>> + > > >>>> + if ( opt_ibpb_mode_switch && > > >>>> + !(d->arch.spec_ctrl_flags & SCF_entry_ibpb) && > > >>>> + !VM_ASSIST(d, mode_switch_no_ibpb) ) > > >>>> + cpu_info->spec_ctrl_flags |= SCF_new_pred_ctxt; > > >>> > > >>> Likewise similar to the remarks I've made before, if doing an IBPB on > > >>> entry is enough to cover for the case here, it must also be fine to > > >>> issue the IBPB right here, instead of deferring to return to guest > > >>> context? > > >>> > > >>> The only concern would be (as you mentioned before) to avoid clearing > > >>> valid Xen predictions, but I would rather see some figures about what > > >>> effect the delaying to return to guest has vs issuing it right here. > > >> > > >> Part of the reason (aiui) to do things on the exit path was to > > >> consolidate the context switch induced one and the user->kernel switch > > >> one into the same place and mechanism. > > > > > > Isn't it kind of a very specific case that we end up doing a > > > user->kernel switch as part of a context switch? IOW: would require > > > the vCPU to be scheduled out at that very specific point. > > > > No, there's no user->kernel switch at the same time as context switch. > > What I was trying to explain is that with the actual IBPB being issued > > on exit to guest, both the context switch path and the user->kernel > > mode switch path set the same indicator, for the exit path to consume. > > Deferring to exit to guest path could be OK, but unless strictly > needed, which I don't think it's the case, I would request for IBPB to > be executed in C context rather than assembly one. > > > >>>> + * > > >>>> + * By default (on affected and capable hardware) as a safety measure Xen, > > >>>> + * to cover for the fact that guest-kernel and guest-user modes are both > > >>>> + * running in ring 3 (and hence share prediction context), would issue a > > >>>> + * barrier for user->kernel mode switches of PV guests. > > >>>> + */ > > >>>> +#define VMASST_TYPE_mode_switch_no_ibpb 33 > > >>> > > >>> Would it be possible to define the assist as > > >>> VMASST_TYPE_mode_switch_ibpb and have it on when enabled? So that the > > >>> guest would disable it if unneeded? IMO negated options are in > > >>> general harder to understand. > > >> > > >> Negative options aren't nice, yes, but VM assists start out as all > > >> clear. > > > > > > Are you sure? I see VMASST_TYPE_pae_extended_cr3 getting set in > > > dom0_construct_pv() and that makes me wonder whether other bits > > > couldn't start set also. > > > > > > Maybe there's some restriction I'm missing, but I don't see any > > > wording in the description of the interface that states that all > > > assists are supposed to start disabled. > > > > Well, that setting of pae_extended_cr3 is in response to the kernel's > > notes section having a respective indicator. So we still only set the > > bit in response to what the kernel's asking us to do, just that here > > we carry out the request ahead of launching the kernel. > > > > Also consider what would happen during migration if there was a > > default-on assist: At the destination we can't know whether the > > source simply didn't know of the bit, or whether the guest elected to > > clear it. > > Hm, I see, so I was indeed missing that aspect. VM assist is passed > as a plain bitmap, and there's no signal on which assists the VM had > available on the source side if not enabled. Sorry, please bear with me, as I've been further thinking about this. Why does the assist needs to be default-on? It's my understanding that the guest can execute the IBPB itself by writing to the MSR, but that's suboptimal in the user -> kernel context switch as it then involves two traps into Xen, but the guest is not left insecure, it just needs to write the MSR itself like on native. In fact, if we add an IBPB by default as part of amd64 PV user -> kernel guest context switch, we are likely doing a double IBPB on guests not aware of the assist. IOW: I don't know why doing the assist as guest opt-in would be insecure, in fact I think it's the best approach (again I might be missing something). Thanks, Roger.
On 19.12.2023 18:07, Roger Pau Monné wrote: > On Tue, Dec 19, 2023 at 04:11:09PM +0100, Roger Pau Monné wrote: >> On Tue, Dec 19, 2023 at 03:06:50PM +0100, Jan Beulich wrote: >>> On 19.12.2023 12:48, Roger Pau Monné wrote: >>>> On Tue, Dec 19, 2023 at 10:56:16AM +0100, Jan Beulich wrote: >>>>> On 18.12.2023 18:24, Roger Pau Monné wrote: >>>>>> On Tue, Feb 14, 2023 at 05:12:08PM +0100, Jan Beulich wrote: >>>>>>> --- a/xen/arch/x86/pv/domain.c >>>>>>> +++ b/xen/arch/x86/pv/domain.c >>>>>>> @@ -455,6 +455,7 @@ static void _toggle_guest_pt(struct vcpu >>>>>>> void toggle_guest_mode(struct vcpu *v) >>>>>>> { >>>>>>> const struct domain *d = v->domain; >>>>>>> + struct cpu_info *cpu_info = get_cpu_info(); >>>>>>> unsigned long gs_base; >>>>>>> >>>>>>> ASSERT(!is_pv_32bit_vcpu(v)); >>>>>>> @@ -467,15 +468,21 @@ void toggle_guest_mode(struct vcpu *v) >>>>>>> if ( v->arch.flags & TF_kernel_mode ) >>>>>>> v->arch.pv.gs_base_kernel = gs_base; >>>>>>> else >>>>>>> + { >>>>>>> v->arch.pv.gs_base_user = gs_base; >>>>>>> + >>>>>>> + if ( opt_ibpb_mode_switch && >>>>>>> + !(d->arch.spec_ctrl_flags & SCF_entry_ibpb) && >>>>>>> + !VM_ASSIST(d, mode_switch_no_ibpb) ) >>>>>>> + cpu_info->spec_ctrl_flags |= SCF_new_pred_ctxt; >>>>>> >>>>>> Likewise similar to the remarks I've made before, if doing an IBPB on >>>>>> entry is enough to cover for the case here, it must also be fine to >>>>>> issue the IBPB right here, instead of deferring to return to guest >>>>>> context? >>>>>> >>>>>> The only concern would be (as you mentioned before) to avoid clearing >>>>>> valid Xen predictions, but I would rather see some figures about what >>>>>> effect the delaying to return to guest has vs issuing it right here. >>>>> >>>>> Part of the reason (aiui) to do things on the exit path was to >>>>> consolidate the context switch induced one and the user->kernel switch >>>>> one into the same place and mechanism. >>>> >>>> Isn't it kind of a very specific case that we end up doing a >>>> user->kernel switch as part of a context switch? IOW: would require >>>> the vCPU to be scheduled out at that very specific point. >>> >>> No, there's no user->kernel switch at the same time as context switch. >>> What I was trying to explain is that with the actual IBPB being issued >>> on exit to guest, both the context switch path and the user->kernel >>> mode switch path set the same indicator, for the exit path to consume. >> >> Deferring to exit to guest path could be OK, but unless strictly >> needed, which I don't think it's the case, I would request for IBPB to >> be executed in C context rather than assembly one. >> >>>>>>> + * >>>>>>> + * By default (on affected and capable hardware) as a safety measure Xen, >>>>>>> + * to cover for the fact that guest-kernel and guest-user modes are both >>>>>>> + * running in ring 3 (and hence share prediction context), would issue a >>>>>>> + * barrier for user->kernel mode switches of PV guests. >>>>>>> + */ >>>>>>> +#define VMASST_TYPE_mode_switch_no_ibpb 33 >>>>>> >>>>>> Would it be possible to define the assist as >>>>>> VMASST_TYPE_mode_switch_ibpb and have it on when enabled? So that the >>>>>> guest would disable it if unneeded? IMO negated options are in >>>>>> general harder to understand. >>>>> >>>>> Negative options aren't nice, yes, but VM assists start out as all >>>>> clear. >>>> >>>> Are you sure? I see VMASST_TYPE_pae_extended_cr3 getting set in >>>> dom0_construct_pv() and that makes me wonder whether other bits >>>> couldn't start set also. >>>> >>>> Maybe there's some restriction I'm missing, but I don't see any >>>> wording in the description of the interface that states that all >>>> assists are supposed to start disabled. >>> >>> Well, that setting of pae_extended_cr3 is in response to the kernel's >>> notes section having a respective indicator. So we still only set the >>> bit in response to what the kernel's asking us to do, just that here >>> we carry out the request ahead of launching the kernel. >>> >>> Also consider what would happen during migration if there was a >>> default-on assist: At the destination we can't know whether the >>> source simply didn't know of the bit, or whether the guest elected to >>> clear it. >> >> Hm, I see, so I was indeed missing that aspect. VM assist is passed >> as a plain bitmap, and there's no signal on which assists the VM had >> available on the source side if not enabled. > > Sorry, please bear with me, as I've been further thinking about this. > > Why does the assist needs to be default-on? It's my understanding > that the guest can execute the IBPB itself by writing to the MSR, but > that's suboptimal in the user -> kernel context switch as it then > involves two traps into Xen, but the guest is not left insecure, it > just needs to write the MSR itself like on native. > > In fact, if we add an IBPB by default as part of amd64 PV user -> > kernel guest context switch, we are likely doing a double IBPB on > guests not aware of the assist. > > IOW: I don't know why doing the assist as guest opt-in would be > insecure, in fact I think it's the best approach (again I might be > missing something). By issuing IBPB by default we can make guests safe (in this regard) irrespective of their awareness of IBPB, and in particular their awareness of IBPB being needed explicitly on the user->kernel mode transition (where on native, with IBRS enabled, sufficient separation exists iirc). IOW we're trying to cater for a 64-bit-PV special aspect by default. (Andrew, please correct me if there's anything wrong in here.) Jan
On Wed, Dec 20, 2023 at 10:25:57AM +0100, Jan Beulich wrote: > On 19.12.2023 18:07, Roger Pau Monné wrote: > > On Tue, Dec 19, 2023 at 04:11:09PM +0100, Roger Pau Monné wrote: > >> On Tue, Dec 19, 2023 at 03:06:50PM +0100, Jan Beulich wrote: > >>> On 19.12.2023 12:48, Roger Pau Monné wrote: > >>>> On Tue, Dec 19, 2023 at 10:56:16AM +0100, Jan Beulich wrote: > >>>>> On 18.12.2023 18:24, Roger Pau Monné wrote: > >>>>>> On Tue, Feb 14, 2023 at 05:12:08PM +0100, Jan Beulich wrote: > >>>>>>> --- a/xen/arch/x86/pv/domain.c > >>>>>>> +++ b/xen/arch/x86/pv/domain.c > >>>>>>> @@ -455,6 +455,7 @@ static void _toggle_guest_pt(struct vcpu > >>>>>>> void toggle_guest_mode(struct vcpu *v) > >>>>>>> { > >>>>>>> const struct domain *d = v->domain; > >>>>>>> + struct cpu_info *cpu_info = get_cpu_info(); > >>>>>>> unsigned long gs_base; > >>>>>>> > >>>>>>> ASSERT(!is_pv_32bit_vcpu(v)); > >>>>>>> @@ -467,15 +468,21 @@ void toggle_guest_mode(struct vcpu *v) > >>>>>>> if ( v->arch.flags & TF_kernel_mode ) > >>>>>>> v->arch.pv.gs_base_kernel = gs_base; > >>>>>>> else > >>>>>>> + { > >>>>>>> v->arch.pv.gs_base_user = gs_base; > >>>>>>> + > >>>>>>> + if ( opt_ibpb_mode_switch && > >>>>>>> + !(d->arch.spec_ctrl_flags & SCF_entry_ibpb) && > >>>>>>> + !VM_ASSIST(d, mode_switch_no_ibpb) ) > >>>>>>> + cpu_info->spec_ctrl_flags |= SCF_new_pred_ctxt; > >>>>>> > >>>>>> Likewise similar to the remarks I've made before, if doing an IBPB on > >>>>>> entry is enough to cover for the case here, it must also be fine to > >>>>>> issue the IBPB right here, instead of deferring to return to guest > >>>>>> context? > >>>>>> > >>>>>> The only concern would be (as you mentioned before) to avoid clearing > >>>>>> valid Xen predictions, but I would rather see some figures about what > >>>>>> effect the delaying to return to guest has vs issuing it right here. > >>>>> > >>>>> Part of the reason (aiui) to do things on the exit path was to > >>>>> consolidate the context switch induced one and the user->kernel switch > >>>>> one into the same place and mechanism. > >>>> > >>>> Isn't it kind of a very specific case that we end up doing a > >>>> user->kernel switch as part of a context switch? IOW: would require > >>>> the vCPU to be scheduled out at that very specific point. > >>> > >>> No, there's no user->kernel switch at the same time as context switch. > >>> What I was trying to explain is that with the actual IBPB being issued > >>> on exit to guest, both the context switch path and the user->kernel > >>> mode switch path set the same indicator, for the exit path to consume. > >> > >> Deferring to exit to guest path could be OK, but unless strictly > >> needed, which I don't think it's the case, I would request for IBPB to > >> be executed in C context rather than assembly one. > >> > >>>>>>> + * > >>>>>>> + * By default (on affected and capable hardware) as a safety measure Xen, > >>>>>>> + * to cover for the fact that guest-kernel and guest-user modes are both > >>>>>>> + * running in ring 3 (and hence share prediction context), would issue a > >>>>>>> + * barrier for user->kernel mode switches of PV guests. > >>>>>>> + */ > >>>>>>> +#define VMASST_TYPE_mode_switch_no_ibpb 33 > >>>>>> > >>>>>> Would it be possible to define the assist as > >>>>>> VMASST_TYPE_mode_switch_ibpb and have it on when enabled? So that the > >>>>>> guest would disable it if unneeded? IMO negated options are in > >>>>>> general harder to understand. > >>>>> > >>>>> Negative options aren't nice, yes, but VM assists start out as all > >>>>> clear. > >>>> > >>>> Are you sure? I see VMASST_TYPE_pae_extended_cr3 getting set in > >>>> dom0_construct_pv() and that makes me wonder whether other bits > >>>> couldn't start set also. > >>>> > >>>> Maybe there's some restriction I'm missing, but I don't see any > >>>> wording in the description of the interface that states that all > >>>> assists are supposed to start disabled. > >>> > >>> Well, that setting of pae_extended_cr3 is in response to the kernel's > >>> notes section having a respective indicator. So we still only set the > >>> bit in response to what the kernel's asking us to do, just that here > >>> we carry out the request ahead of launching the kernel. > >>> > >>> Also consider what would happen during migration if there was a > >>> default-on assist: At the destination we can't know whether the > >>> source simply didn't know of the bit, or whether the guest elected to > >>> clear it. > >> > >> Hm, I see, so I was indeed missing that aspect. VM assist is passed > >> as a plain bitmap, and there's no signal on which assists the VM had > >> available on the source side if not enabled. > > > > Sorry, please bear with me, as I've been further thinking about this. > > > > Why does the assist needs to be default-on? It's my understanding > > that the guest can execute the IBPB itself by writing to the MSR, but > > that's suboptimal in the user -> kernel context switch as it then > > involves two traps into Xen, but the guest is not left insecure, it > > just needs to write the MSR itself like on native. > > > > In fact, if we add an IBPB by default as part of amd64 PV user -> > > kernel guest context switch, we are likely doing a double IBPB on > > guests not aware of the assist. > > > > IOW: I don't know why doing the assist as guest opt-in would be > > insecure, in fact I think it's the best approach (again I might be > > missing something). > > By issuing IBPB by default we can make guests safe (in this regard) > irrespective of their awareness of IBPB, and in particular their > awareness of IBPB being needed explicitly on the user->kernel mode > transition (where on native, with IBRS enabled, sufficient separation > exists iirc). IOW we're trying to cater for a 64-bit-PV special aspect > by default. (Andrew, please correct me if there's anything wrong in > here.) Hm, maybe. My point would be that PV is already specific enough that OSes shouldn't expect things like IBRS to work as on native, and hence should be aware of user and kernel running in the same privilege mode and issue the IBPB themselves. Setting that aside, would it make sense to tie the IBPB on guest user -> kernel switches to the guest having enabled IBRS? AFAICT IBRS on 64bit PV is useless, as from the predictor PoV both user and kernel space share the same mode. Hence a PV guest enabling IBRS could be used as a signal for Xen to execute IBPB on user -> kernel guest context switches? Thanks, Roger.
--- a/docs/misc/xen-command-line.pandoc +++ b/docs/misc/xen-command-line.pandoc @@ -2320,8 +2320,8 @@ By default SSBD will be mitigated at run ### spec-ctrl (x86) > `= List of [ <bool>, xen=<bool>, {pv,hvm}=<bool>, > {msr-sc,rsb,md-clear,ibpb-entry}=<bool>|{pv,hvm}=<bool>, -> bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,psfd, -> eager-fpu,l1d-flush,branch-harden,srb-lock, +> bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ibpb-mode-switch, +> ssbd,psfd,eager-fpu,l1d-flush,branch-harden,srb-lock, > unpriv-mmio}=<bool> ]` Controls for speculative execution sidechannel mitigations. By default, Xen @@ -2403,7 +2403,10 @@ default. On hardware supporting IBPB (Indirect Branch Prediction Barrier), the `ibpb=` option can be used to force (the default) or prevent Xen from issuing branch -prediction barriers on vcpu context switches. +prediction barriers on vcpu context switches. On such hardware the +`ibpb-mode-switch` option can be used to control whether, by default, Xen +would issue branch prediction barriers when 64-bit PV guests switch from +user to kernel mode. If enabled, guest kernels can op out of this behavior. On all hardware, the `eager-fpu=` option can be used to force or prevent Xen from using fully eager FPU context switches. This is currently implemented as --- a/xen/arch/x86/include/asm/domain.h +++ b/xen/arch/x86/include/asm/domain.h @@ -742,6 +742,8 @@ static inline void pv_inject_sw_interrup pv_inject_event(&event); } +extern int8_t opt_ibpb_mode_switch; + #define PV32_VM_ASSIST_MASK ((1UL << VMASST_TYPE_4gb_segments) | \ (1UL << VMASST_TYPE_4gb_segments_notify) | \ (1UL << VMASST_TYPE_writable_pagetables) | \ @@ -753,7 +755,9 @@ static inline void pv_inject_sw_interrup * but we can't make such requests fail all of the sudden. */ #define PV64_VM_ASSIST_MASK (PV32_VM_ASSIST_MASK | \ - (1UL << VMASST_TYPE_m2p_strict)) + (1UL << VMASST_TYPE_m2p_strict) | \ + ((opt_ibpb_mode_switch + 0UL) << \ + VMASST_TYPE_mode_switch_no_ibpb)) #define HVM_VM_ASSIST_MASK (1UL << VMASST_TYPE_runstate_update_flag) #define arch_vm_assist_valid_mask(d) \ --- a/xen/arch/x86/pv/domain.c +++ b/xen/arch/x86/pv/domain.c @@ -455,6 +455,7 @@ static void _toggle_guest_pt(struct vcpu void toggle_guest_mode(struct vcpu *v) { const struct domain *d = v->domain; + struct cpu_info *cpu_info = get_cpu_info(); unsigned long gs_base; ASSERT(!is_pv_32bit_vcpu(v)); @@ -467,15 +468,21 @@ void toggle_guest_mode(struct vcpu *v) if ( v->arch.flags & TF_kernel_mode ) v->arch.pv.gs_base_kernel = gs_base; else + { v->arch.pv.gs_base_user = gs_base; + + if ( opt_ibpb_mode_switch && + !(d->arch.spec_ctrl_flags & SCF_entry_ibpb) && + !VM_ASSIST(d, mode_switch_no_ibpb) ) + cpu_info->spec_ctrl_flags |= SCF_new_pred_ctxt; + } + asm volatile ( "swapgs" ); _toggle_guest_pt(v); if ( d->arch.pv.xpti ) { - struct cpu_info *cpu_info = get_cpu_info(); - cpu_info->root_pgt_changed = true; cpu_info->pv_cr3 = __pa(this_cpu(root_pgt)) | (d->arch.pv.pcid ? get_pcid_bits(v, true) : 0); --- a/xen/arch/x86/spec_ctrl.c +++ b/xen/arch/x86/spec_ctrl.c @@ -60,6 +60,7 @@ bool __ro_after_init opt_ssbd; int8_t __initdata opt_psfd = -1; int8_t __ro_after_init opt_ibpb_ctxt_switch = -1; +int8_t __ro_after_init opt_ibpb_mode_switch = -1; int8_t __read_mostly opt_eager_fpu = -1; int8_t __read_mostly opt_l1d_flush = -1; static bool __initdata opt_branch_harden = true; @@ -111,6 +112,8 @@ static int __init cf_check parse_spec_ct if ( opt_pv_l1tf_domu < 0 ) opt_pv_l1tf_domu = 0; + opt_ibpb_mode_switch = 0; + if ( opt_tsx == -1 ) opt_tsx = -3; @@ -271,6 +274,8 @@ static int __init cf_check parse_spec_ct /* Misc settings. */ else if ( (val = parse_boolean("ibpb", s, ss)) >= 0 ) opt_ibpb_ctxt_switch = val; + else if ( (val = parse_boolean("ibpb-mode-switch", s, ss)) >= 0 ) + opt_ibpb_mode_switch = val; else if ( (val = parse_boolean("eager-fpu", s, ss)) >= 0 ) opt_eager_fpu = val; else if ( (val = parse_boolean("l1d-flush", s, ss)) >= 0 ) @@ -527,16 +532,18 @@ static void __init print_details(enum in #endif #ifdef CONFIG_PV - printk(" Support for PV VMs:%s%s%s%s%s%s\n", + printk(" Support for PV VMs:%s%s%s%s%s%s%s\n", (boot_cpu_has(X86_FEATURE_SC_MSR_PV) || boot_cpu_has(X86_FEATURE_SC_RSB_PV) || boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) || - opt_eager_fpu || opt_md_clear_pv) ? "" : " None", + opt_eager_fpu || opt_md_clear_pv || + opt_ibpb_mode_switch) ? "" : " None", boot_cpu_has(X86_FEATURE_SC_MSR_PV) ? " MSR_SPEC_CTRL" : "", boot_cpu_has(X86_FEATURE_SC_RSB_PV) ? " RSB" : "", opt_eager_fpu ? " EAGER_FPU" : "", opt_md_clear_pv ? " MD_CLEAR" : "", - boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) ? " IBPB-entry" : ""); + boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) ? " IBPB-entry" : "", + opt_ibpb_mode_switch ? " IBPB-mode-switch" : ""); printk(" XPTI (64-bit PV only): Dom0 %s, DomU %s (with%s PCID)\n", opt_xpti_hwdom ? "enabled" : "disabled", @@ -804,7 +811,8 @@ static void __init ibpb_calculations(voi /* Check we have hardware IBPB support before using it... */ if ( !boot_cpu_has(X86_FEATURE_IBRSB) && !boot_cpu_has(X86_FEATURE_IBPB) ) { - opt_ibpb_entry_hvm = opt_ibpb_entry_pv = opt_ibpb_ctxt_switch = 0; + opt_ibpb_entry_hvm = opt_ibpb_entry_pv = 0; + opt_ibpb_mode_switch = opt_ibpb_ctxt_switch = 0; opt_ibpb_entry_dom0 = false; return; } @@ -859,6 +867,18 @@ static void __init ibpb_calculations(voi setup_force_cpu_cap(X86_FEATURE_NEW_PRED_CTXT_PV); setup_force_cpu_cap(X86_FEATURE_NEW_PRED_CTXT_HVM); } + +#ifdef CONFIG_PV + /* + * If we're using IBPB-on-entry to protect against PV guests, then + * there's no need to also issue IBPB on a guest user->kernel switch. + */ + if ( opt_ibpb_mode_switch == -1 ) + opt_ibpb_mode_switch = !opt_ibpb_entry_pv || + (!opt_ibpb_entry_dom0 && !opt_dom0_pvh); + if ( opt_ibpb_mode_switch ) + setup_force_cpu_cap(X86_FEATURE_NEW_PRED_CTXT_PV); +#endif } /* Calculate whether this CPU is vulnerable to L1TF. */ --- a/xen/include/public/xen.h +++ b/xen/include/public/xen.h @@ -554,6 +554,16 @@ DEFINE_XEN_GUEST_HANDLE(mmuext_op_t); */ #define VMASST_TYPE_m2p_strict 32 +/* + * x86-64 guests: Suppress IBPB on guest-user to guest-kernel mode switch. + * + * By default (on affected and capable hardware) as a safety measure Xen, + * to cover for the fact that guest-kernel and guest-user modes are both + * running in ring 3 (and hence share prediction context), would issue a + * barrier for user->kernel mode switches of PV guests. + */ +#define VMASST_TYPE_mode_switch_no_ibpb 33 + #if __XEN_INTERFACE_VERSION__ < 0x00040600 #define MAX_VMASST_TYPE 3 #endif
Since both kernel and user mode run in ring 3, they run in the same "predictor mode". While the kernel could take care of this itself, doing so would be yet another item distinguishing PV from native. Additionally we're in a much better position to issue the barrier command, and we can save a #GP (for privileged instruction emulation) this way. To allow to recover performance, introduce a new VM assist allowing the guest kernel to suppress this barrier. Make availability of the assist dependent upon the command line control, such that kernels have a way to know whether their request actually took any effect. Note that because of its use in PV64_VM_ASSIST_MASK, the declaration of opt_ibpb_mode_switch can't live in asm/spec_ctrl.h. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- Is the placement of the clearing of opt_ibpb_ctxt_switch correct in parse_spec_ctrl()? Shouldn't it live ahead of the "disable_common" label, as being about guest protection, not Xen's? Adding setting of the variable to the "pv" sub-case in parse_spec_ctrl() didn't seem quite right to me, considering that we default it to the opposite of opt_ibpb_entry_pv. --- v4: Correct the print_details() change. Re-base in particular over changes earlier in the series. v3: Leverage exit-IBPB. Introduce separate command line control. v2: Leverage entry-IBPB. Add VM assist. Re-base.