diff mbox series

[v4,4/4] x86/PV: issue branch prediction barrier when switching 64-bit guest to kernel mode

Message ID 2863b0a9-ca7c-3cce-104d-0b6685b0b383@suse.com (mailing list archive)
State New, archived
Headers show
Series x86/spec-ctrl: IBPB improvements | expand

Commit Message

Jan Beulich Feb. 14, 2023, 4:12 p.m. UTC
Since both kernel and user mode run in ring 3, they run in the same
"predictor mode". While the kernel could take care of this itself, doing
so would be yet another item distinguishing PV from native. Additionally
we're in a much better position to issue the barrier command, and we can
save a #GP (for privileged instruction emulation) this way.

To allow to recover performance, introduce a new VM assist allowing the
guest kernel to suppress this barrier. Make availability of the assist
dependent upon the command line control, such that kernels have a way to
know whether their request actually took any effect.

Note that because of its use in PV64_VM_ASSIST_MASK, the declaration of
opt_ibpb_mode_switch can't live in asm/spec_ctrl.h.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Is the placement of the clearing of opt_ibpb_ctxt_switch correct in
parse_spec_ctrl()? Shouldn't it live ahead of the "disable_common"
label, as being about guest protection, not Xen's?

Adding setting of the variable to the "pv" sub-case in parse_spec_ctrl()
didn't seem quite right to me, considering that we default it to the
opposite of opt_ibpb_entry_pv.
---
v4: Correct the print_details() change. Re-base in particular over
    changes earlier in the series.
v3: Leverage exit-IBPB. Introduce separate command line control.
v2: Leverage entry-IBPB. Add VM assist. Re-base.

Comments

Roger Pau Monné Dec. 18, 2023, 5:24 p.m. UTC | #1
On Tue, Feb 14, 2023 at 05:12:08PM +0100, Jan Beulich wrote:
> Since both kernel and user mode run in ring 3, they run in the same
> "predictor mode".

That only true when IBRS is enabled, otherwise all CPU modes share the
same predictor mode?

> While the kernel could take care of this itself, doing
> so would be yet another item distinguishing PV from native. Additionally
> we're in a much better position to issue the barrier command, and we can
> save a #GP (for privileged instruction emulation) this way.
> 
> To allow to recover performance, introduce a new VM assist allowing the
> guest kernel to suppress this barrier. Make availability of the assist
> dependent upon the command line control, such that kernels have a way to
> know whether their request actually took any effect.
> 
> Note that because of its use in PV64_VM_ASSIST_MASK, the declaration of
> opt_ibpb_mode_switch can't live in asm/spec_ctrl.h.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Is the placement of the clearing of opt_ibpb_ctxt_switch correct in
> parse_spec_ctrl()? Shouldn't it live ahead of the "disable_common"
> label, as being about guest protection, not Xen's?
> 
> Adding setting of the variable to the "pv" sub-case in parse_spec_ctrl()
> didn't seem quite right to me, considering that we default it to the
> opposite of opt_ibpb_entry_pv.
> ---
> v4: Correct the print_details() change. Re-base in particular over
>     changes earlier in the series.
> v3: Leverage exit-IBPB. Introduce separate command line control.
> v2: Leverage entry-IBPB. Add VM assist. Re-base.
> 
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -2320,8 +2320,8 @@ By default SSBD will be mitigated at run
>  ### spec-ctrl (x86)
>  > `= List of [ <bool>, xen=<bool>, {pv,hvm}=<bool>,
>  >              {msr-sc,rsb,md-clear,ibpb-entry}=<bool>|{pv,hvm}=<bool>,
> ->              bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,psfd,
> ->              eager-fpu,l1d-flush,branch-harden,srb-lock,
> +>              bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ibpb-mode-switch,
> +>              ssbd,psfd,eager-fpu,l1d-flush,branch-harden,srb-lock,
>  >              unpriv-mmio}=<bool> ]`
>  
>  Controls for speculative execution sidechannel mitigations.  By default, Xen
> @@ -2403,7 +2403,10 @@ default.
>  
>  On hardware supporting IBPB (Indirect Branch Prediction Barrier), the `ibpb=`
>  option can be used to force (the default) or prevent Xen from issuing branch
> -prediction barriers on vcpu context switches.
> +prediction barriers on vcpu context switches.  On such hardware the
> +`ibpb-mode-switch` option can be used to control whether, by default, Xen
> +would issue branch prediction barriers when 64-bit PV guests switch from
> +user to kernel mode.  If enabled, guest kernels can op out of this behavior.
>  
>  On all hardware, the `eager-fpu=` option can be used to force or prevent Xen
>  from using fully eager FPU context switches.  This is currently implemented as
> --- a/xen/arch/x86/include/asm/domain.h
> +++ b/xen/arch/x86/include/asm/domain.h
> @@ -742,6 +742,8 @@ static inline void pv_inject_sw_interrup
>      pv_inject_event(&event);
>  }
>  
> +extern int8_t opt_ibpb_mode_switch;
> +
>  #define PV32_VM_ASSIST_MASK ((1UL << VMASST_TYPE_4gb_segments)        | \
>                               (1UL << VMASST_TYPE_4gb_segments_notify) | \
>                               (1UL << VMASST_TYPE_writable_pagetables) | \
> @@ -753,7 +755,9 @@ static inline void pv_inject_sw_interrup
>   * but we can't make such requests fail all of the sudden.
>   */
>  #define PV64_VM_ASSIST_MASK (PV32_VM_ASSIST_MASK                      | \
> -                             (1UL << VMASST_TYPE_m2p_strict))
> +                             (1UL << VMASST_TYPE_m2p_strict)          | \
> +                             ((opt_ibpb_mode_switch + 0UL) <<           \
> +                              VMASST_TYPE_mode_switch_no_ibpb))

I'm wondering that it's kind of weird to offer the option to PV domUs
if opt_ibpb_entry_pv is set, as then the guest mode switch will always
(implicitly) do a IBPB as requiring an hypercall and thus take an
entry point into Xen.

I guess it's worth having it just as a way to signal to Xen that the
hypervisor does perform an IBPB, even if the guest cannot disable it.

>  #define HVM_VM_ASSIST_MASK  (1UL << VMASST_TYPE_runstate_update_flag)
>  
>  #define arch_vm_assist_valid_mask(d) \
> --- a/xen/arch/x86/pv/domain.c
> +++ b/xen/arch/x86/pv/domain.c
> @@ -455,6 +455,7 @@ static void _toggle_guest_pt(struct vcpu
>  void toggle_guest_mode(struct vcpu *v)
>  {
>      const struct domain *d = v->domain;
> +    struct cpu_info *cpu_info = get_cpu_info();
>      unsigned long gs_base;
>  
>      ASSERT(!is_pv_32bit_vcpu(v));
> @@ -467,15 +468,21 @@ void toggle_guest_mode(struct vcpu *v)
>      if ( v->arch.flags & TF_kernel_mode )
>          v->arch.pv.gs_base_kernel = gs_base;
>      else
> +    {
>          v->arch.pv.gs_base_user = gs_base;
> +
> +        if ( opt_ibpb_mode_switch &&
> +             !(d->arch.spec_ctrl_flags & SCF_entry_ibpb) &&
> +             !VM_ASSIST(d, mode_switch_no_ibpb) )
> +            cpu_info->spec_ctrl_flags |= SCF_new_pred_ctxt;

Likewise similar to the remarks I've made before, if doing an IBPB on
entry is enough to cover for the case here, it must also be fine to
issue the IBPB right here, instead of deferring to return to guest
context?

The only concern would be (as you mentioned before) to avoid clearing
valid Xen predictions, but I would rather see some figures about what
effect the delaying to return to guest has vs issuing it right here.

> +    }
> +
>      asm volatile ( "swapgs" );
>  
>      _toggle_guest_pt(v);
>  
>      if ( d->arch.pv.xpti )
>      {
> -        struct cpu_info *cpu_info = get_cpu_info();
> -
>          cpu_info->root_pgt_changed = true;
>          cpu_info->pv_cr3 = __pa(this_cpu(root_pgt)) |
>                             (d->arch.pv.pcid ? get_pcid_bits(v, true) : 0);
> --- a/xen/arch/x86/spec_ctrl.c
> +++ b/xen/arch/x86/spec_ctrl.c
> @@ -60,6 +60,7 @@ bool __ro_after_init opt_ssbd;
>  int8_t __initdata opt_psfd = -1;
>  
>  int8_t __ro_after_init opt_ibpb_ctxt_switch = -1;
> +int8_t __ro_after_init opt_ibpb_mode_switch = -1;
>  int8_t __read_mostly opt_eager_fpu = -1;
>  int8_t __read_mostly opt_l1d_flush = -1;
>  static bool __initdata opt_branch_harden = true;
> @@ -111,6 +112,8 @@ static int __init cf_check parse_spec_ct
>              if ( opt_pv_l1tf_domu < 0 )
>                  opt_pv_l1tf_domu = 0;
>  
> +            opt_ibpb_mode_switch = 0;
> +
>              if ( opt_tsx == -1 )
>                  opt_tsx = -3;
>  
> @@ -271,6 +274,8 @@ static int __init cf_check parse_spec_ct
>          /* Misc settings. */
>          else if ( (val = parse_boolean("ibpb", s, ss)) >= 0 )
>              opt_ibpb_ctxt_switch = val;
> +        else if ( (val = parse_boolean("ibpb-mode-switch", s, ss)) >= 0 )
> +            opt_ibpb_mode_switch = val;
>          else if ( (val = parse_boolean("eager-fpu", s, ss)) >= 0 )
>              opt_eager_fpu = val;
>          else if ( (val = parse_boolean("l1d-flush", s, ss)) >= 0 )
> @@ -527,16 +532,18 @@ static void __init print_details(enum in
>  
>  #endif
>  #ifdef CONFIG_PV
> -    printk("  Support for PV VMs:%s%s%s%s%s%s\n",
> +    printk("  Support for PV VMs:%s%s%s%s%s%s%s\n",
>             (boot_cpu_has(X86_FEATURE_SC_MSR_PV) ||
>              boot_cpu_has(X86_FEATURE_SC_RSB_PV) ||
>              boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) ||
> -            opt_eager_fpu || opt_md_clear_pv)        ? ""               : " None",
> +            opt_eager_fpu || opt_md_clear_pv ||
> +            opt_ibpb_mode_switch)                    ? ""               : " None",
>             boot_cpu_has(X86_FEATURE_SC_MSR_PV)       ? " MSR_SPEC_CTRL" : "",
>             boot_cpu_has(X86_FEATURE_SC_RSB_PV)       ? " RSB"           : "",
>             opt_eager_fpu                             ? " EAGER_FPU"     : "",
>             opt_md_clear_pv                           ? " MD_CLEAR"      : "",
> -           boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV)   ? " IBPB-entry"    : "");
> +           boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV)   ? " IBPB-entry"    : "",
> +           opt_ibpb_mode_switch                      ? " IBPB-mode-switch" : "");
>  
>      printk("  XPTI (64-bit PV only): Dom0 %s, DomU %s (with%s PCID)\n",
>             opt_xpti_hwdom ? "enabled" : "disabled",
> @@ -804,7 +811,8 @@ static void __init ibpb_calculations(voi
>      /* Check we have hardware IBPB support before using it... */
>      if ( !boot_cpu_has(X86_FEATURE_IBRSB) && !boot_cpu_has(X86_FEATURE_IBPB) )
>      {
> -        opt_ibpb_entry_hvm = opt_ibpb_entry_pv = opt_ibpb_ctxt_switch = 0;
> +        opt_ibpb_entry_hvm = opt_ibpb_entry_pv = 0;
> +        opt_ibpb_mode_switch = opt_ibpb_ctxt_switch = 0;
>          opt_ibpb_entry_dom0 = false;
>          return;
>      }
> @@ -859,6 +867,18 @@ static void __init ibpb_calculations(voi
>          setup_force_cpu_cap(X86_FEATURE_NEW_PRED_CTXT_PV);
>          setup_force_cpu_cap(X86_FEATURE_NEW_PRED_CTXT_HVM);
>      }
> +
> +#ifdef CONFIG_PV
> +    /*
> +     * If we're using IBPB-on-entry to protect against PV guests, then
> +     * there's no need to also issue IBPB on a guest user->kernel switch.
> +     */
> +    if ( opt_ibpb_mode_switch == -1 )
> +        opt_ibpb_mode_switch = !opt_ibpb_entry_pv ||
> +                               (!opt_ibpb_entry_dom0 && !opt_dom0_pvh);
> +    if ( opt_ibpb_mode_switch )
> +        setup_force_cpu_cap(X86_FEATURE_NEW_PRED_CTXT_PV);
> +#endif
>  }
>  
>  /* Calculate whether this CPU is vulnerable to L1TF. */
> --- a/xen/include/public/xen.h
> +++ b/xen/include/public/xen.h
> @@ -554,6 +554,16 @@ DEFINE_XEN_GUEST_HANDLE(mmuext_op_t);
>   */
>  #define VMASST_TYPE_m2p_strict           32
>  
> +/*
> + * x86-64 guests: Suppress IBPB on guest-user to guest-kernel mode switch.

I think this needs to be more vague, as it's not true that the IBPB
will be suppressed if Xen is unconditionally issuing one on all guest
entry points.

Maybe adding:

"Setting the assist signals Xen that the IBPB can be avoided from a
guest perspective, however Xen might still issue one for other
reasons."

> + *
> + * By default (on affected and capable hardware) as a safety measure Xen,
> + * to cover for the fact that guest-kernel and guest-user modes are both
> + * running in ring 3 (and hence share prediction context), would issue a
> + * barrier for user->kernel mode switches of PV guests.
> + */
> +#define VMASST_TYPE_mode_switch_no_ibpb  33

Would it be possible to define the assist as
VMASST_TYPE_mode_switch_ibpb and have it on when enabled?  So that the
guest would disable it if unneeded?  IMO negated options are in
general harder to understand.

Thanks, Roger.
Jan Beulich Dec. 19, 2023, 9:56 a.m. UTC | #2
On 18.12.2023 18:24, Roger Pau Monné wrote:
> On Tue, Feb 14, 2023 at 05:12:08PM +0100, Jan Beulich wrote:
>> Since both kernel and user mode run in ring 3, they run in the same
>> "predictor mode".
> 
> That only true when IBRS is enabled, otherwise all CPU modes share the
> same predictor mode?

But here we only care about ring 3 anyway?

>> @@ -753,7 +755,9 @@ static inline void pv_inject_sw_interrup
>>   * but we can't make such requests fail all of the sudden.
>>   */
>>  #define PV64_VM_ASSIST_MASK (PV32_VM_ASSIST_MASK                      | \
>> -                             (1UL << VMASST_TYPE_m2p_strict))
>> +                             (1UL << VMASST_TYPE_m2p_strict)          | \
>> +                             ((opt_ibpb_mode_switch + 0UL) <<           \
>> +                              VMASST_TYPE_mode_switch_no_ibpb))
> 
> I'm wondering that it's kind of weird to offer the option to PV domUs
> if opt_ibpb_entry_pv is set, as then the guest mode switch will always
> (implicitly) do a IBPB as requiring an hypercall and thus take an
> entry point into Xen.
> 
> I guess it's worth having it just as a way to signal to Xen that the
> hypervisor does perform an IBPB, even if the guest cannot disable it.

I'm afraid I'm confused by your reply. Not only, but also because the
latter sentence looks partly backwards / non-logical to me.

>> --- a/xen/arch/x86/pv/domain.c
>> +++ b/xen/arch/x86/pv/domain.c
>> @@ -455,6 +455,7 @@ static void _toggle_guest_pt(struct vcpu
>>  void toggle_guest_mode(struct vcpu *v)
>>  {
>>      const struct domain *d = v->domain;
>> +    struct cpu_info *cpu_info = get_cpu_info();
>>      unsigned long gs_base;
>>  
>>      ASSERT(!is_pv_32bit_vcpu(v));
>> @@ -467,15 +468,21 @@ void toggle_guest_mode(struct vcpu *v)
>>      if ( v->arch.flags & TF_kernel_mode )
>>          v->arch.pv.gs_base_kernel = gs_base;
>>      else
>> +    {
>>          v->arch.pv.gs_base_user = gs_base;
>> +
>> +        if ( opt_ibpb_mode_switch &&
>> +             !(d->arch.spec_ctrl_flags & SCF_entry_ibpb) &&
>> +             !VM_ASSIST(d, mode_switch_no_ibpb) )
>> +            cpu_info->spec_ctrl_flags |= SCF_new_pred_ctxt;
> 
> Likewise similar to the remarks I've made before, if doing an IBPB on
> entry is enough to cover for the case here, it must also be fine to
> issue the IBPB right here, instead of deferring to return to guest
> context?
> 
> The only concern would be (as you mentioned before) to avoid clearing
> valid Xen predictions, but I would rather see some figures about what
> effect the delaying to return to guest has vs issuing it right here.

Part of the reason (aiui) to do things on the exit path was to
consolidate the context switch induced one and the user->kernel switch
one into the same place and mechanism.

>> --- a/xen/include/public/xen.h
>> +++ b/xen/include/public/xen.h
>> @@ -554,6 +554,16 @@ DEFINE_XEN_GUEST_HANDLE(mmuext_op_t);
>>   */
>>  #define VMASST_TYPE_m2p_strict           32
>>  
>> +/*
>> + * x86-64 guests: Suppress IBPB on guest-user to guest-kernel mode switch.
> 
> I think this needs to be more vague, as it's not true that the IBPB
> will be suppressed if Xen is unconditionally issuing one on all guest
> entry points.
> 
> Maybe adding:
> 
> "Setting the assist signals Xen that the IBPB can be avoided from a
> guest perspective, however Xen might still issue one for other
> reasons."

I've done s/Suppress/Permit skipping/. I wouldn't want to go further,
as that then becomes related to implementation details imo. IOW of
course Xen may issue IBPB whenever it thinks there's a possible need.

>> + *
>> + * By default (on affected and capable hardware) as a safety measure Xen,
>> + * to cover for the fact that guest-kernel and guest-user modes are both
>> + * running in ring 3 (and hence share prediction context), would issue a
>> + * barrier for user->kernel mode switches of PV guests.
>> + */
>> +#define VMASST_TYPE_mode_switch_no_ibpb  33
> 
> Would it be possible to define the assist as
> VMASST_TYPE_mode_switch_ibpb and have it on when enabled?  So that the
> guest would disable it if unneeded?  IMO negated options are in
> general harder to understand.

Negative options aren't nice, yes, but VM assists start out as all
clear. The guest needs to change a "false" to a "true", and thus it
cannot be a positive option here, as we want the default (off) to be
safe/secure.

Jan
Roger Pau Monné Dec. 19, 2023, 11:48 a.m. UTC | #3
On Tue, Dec 19, 2023 at 10:56:16AM +0100, Jan Beulich wrote:
> On 18.12.2023 18:24, Roger Pau Monné wrote:
> > On Tue, Feb 14, 2023 at 05:12:08PM +0100, Jan Beulich wrote:
> >> Since both kernel and user mode run in ring 3, they run in the same
> >> "predictor mode".
> > 
> > That only true when IBRS is enabled, otherwise all CPU modes share the
> > same predictor mode?
> 
> But here we only care about ring 3 anyway?

Hm, yes, what I wanted to say is that this is not exclusive to 64bit
PV, as with IBRS disabled ring 3 and ring 0 share the same predictor
mode.  Anyway, not relevant.

> 
> >> @@ -753,7 +755,9 @@ static inline void pv_inject_sw_interrup
> >>   * but we can't make such requests fail all of the sudden.
> >>   */
> >>  #define PV64_VM_ASSIST_MASK (PV32_VM_ASSIST_MASK                      | \
> >> -                             (1UL << VMASST_TYPE_m2p_strict))
> >> +                             (1UL << VMASST_TYPE_m2p_strict)          | \
> >> +                             ((opt_ibpb_mode_switch + 0UL) <<           \
> >> +                              VMASST_TYPE_mode_switch_no_ibpb))
> > 
> > I'm wondering that it's kind of weird to offer the option to PV domUs
> > if opt_ibpb_entry_pv is set, as then the guest mode switch will always
> > (implicitly) do a IBPB as requiring an hypercall and thus take an
> > entry point into Xen.
> > 
> > I guess it's worth having it just as a way to signal to Xen that the
> > hypervisor does perform an IBPB, even if the guest cannot disable it.
> 
> I'm afraid I'm confused by your reply. Not only, but also because the
> latter sentence looks partly backwards / non-logical to me.

Sorry, I think I didn't word that very well.  The remark is tied to
the one below about the vmassist 'possibly' allowing the guest to
disable IBPB on guest user -> kernel context switches, but Xen might
unconditionally do additional IBPBs that the guest cannot disable.

> >> --- a/xen/arch/x86/pv/domain.c
> >> +++ b/xen/arch/x86/pv/domain.c
> >> @@ -455,6 +455,7 @@ static void _toggle_guest_pt(struct vcpu
> >>  void toggle_guest_mode(struct vcpu *v)
> >>  {
> >>      const struct domain *d = v->domain;
> >> +    struct cpu_info *cpu_info = get_cpu_info();
> >>      unsigned long gs_base;
> >>  
> >>      ASSERT(!is_pv_32bit_vcpu(v));
> >> @@ -467,15 +468,21 @@ void toggle_guest_mode(struct vcpu *v)
> >>      if ( v->arch.flags & TF_kernel_mode )
> >>          v->arch.pv.gs_base_kernel = gs_base;
> >>      else
> >> +    {
> >>          v->arch.pv.gs_base_user = gs_base;
> >> +
> >> +        if ( opt_ibpb_mode_switch &&
> >> +             !(d->arch.spec_ctrl_flags & SCF_entry_ibpb) &&
> >> +             !VM_ASSIST(d, mode_switch_no_ibpb) )
> >> +            cpu_info->spec_ctrl_flags |= SCF_new_pred_ctxt;
> > 
> > Likewise similar to the remarks I've made before, if doing an IBPB on
> > entry is enough to cover for the case here, it must also be fine to
> > issue the IBPB right here, instead of deferring to return to guest
> > context?
> > 
> > The only concern would be (as you mentioned before) to avoid clearing
> > valid Xen predictions, but I would rather see some figures about what
> > effect the delaying to return to guest has vs issuing it right here.
> 
> Part of the reason (aiui) to do things on the exit path was to
> consolidate the context switch induced one and the user->kernel switch
> one into the same place and mechanism.

Isn't it kind of a very specific case that we end up doing a
user->kernel switch as part of a context switch?  IOW: would require
the vCPU to be scheduled out at that very specific point.

> >> + *
> >> + * By default (on affected and capable hardware) as a safety measure Xen,
> >> + * to cover for the fact that guest-kernel and guest-user modes are both
> >> + * running in ring 3 (and hence share prediction context), would issue a
> >> + * barrier for user->kernel mode switches of PV guests.
> >> + */
> >> +#define VMASST_TYPE_mode_switch_no_ibpb  33
> > 
> > Would it be possible to define the assist as
> > VMASST_TYPE_mode_switch_ibpb and have it on when enabled?  So that the
> > guest would disable it if unneeded?  IMO negated options are in
> > general harder to understand.
> 
> Negative options aren't nice, yes, but VM assists start out as all
> clear.

Are you sure?  I see VMASST_TYPE_pae_extended_cr3 getting set in
dom0_construct_pv() and that makes me wonder whether other bits
couldn't start set also.

Maybe there's some restriction I'm missing, but I don't see any
wording in the description of the interface that states that all
assists are supposed to start disabled.

Thanks, Roger.
Jan Beulich Dec. 19, 2023, 2:06 p.m. UTC | #4
On 19.12.2023 12:48, Roger Pau Monné wrote:
> On Tue, Dec 19, 2023 at 10:56:16AM +0100, Jan Beulich wrote:
>> On 18.12.2023 18:24, Roger Pau Monné wrote:
>>> On Tue, Feb 14, 2023 at 05:12:08PM +0100, Jan Beulich wrote:
>>>> --- a/xen/arch/x86/pv/domain.c
>>>> +++ b/xen/arch/x86/pv/domain.c
>>>> @@ -455,6 +455,7 @@ static void _toggle_guest_pt(struct vcpu
>>>>  void toggle_guest_mode(struct vcpu *v)
>>>>  {
>>>>      const struct domain *d = v->domain;
>>>> +    struct cpu_info *cpu_info = get_cpu_info();
>>>>      unsigned long gs_base;
>>>>  
>>>>      ASSERT(!is_pv_32bit_vcpu(v));
>>>> @@ -467,15 +468,21 @@ void toggle_guest_mode(struct vcpu *v)
>>>>      if ( v->arch.flags & TF_kernel_mode )
>>>>          v->arch.pv.gs_base_kernel = gs_base;
>>>>      else
>>>> +    {
>>>>          v->arch.pv.gs_base_user = gs_base;
>>>> +
>>>> +        if ( opt_ibpb_mode_switch &&
>>>> +             !(d->arch.spec_ctrl_flags & SCF_entry_ibpb) &&
>>>> +             !VM_ASSIST(d, mode_switch_no_ibpb) )
>>>> +            cpu_info->spec_ctrl_flags |= SCF_new_pred_ctxt;
>>>
>>> Likewise similar to the remarks I've made before, if doing an IBPB on
>>> entry is enough to cover for the case here, it must also be fine to
>>> issue the IBPB right here, instead of deferring to return to guest
>>> context?
>>>
>>> The only concern would be (as you mentioned before) to avoid clearing
>>> valid Xen predictions, but I would rather see some figures about what
>>> effect the delaying to return to guest has vs issuing it right here.
>>
>> Part of the reason (aiui) to do things on the exit path was to
>> consolidate the context switch induced one and the user->kernel switch
>> one into the same place and mechanism.
> 
> Isn't it kind of a very specific case that we end up doing a
> user->kernel switch as part of a context switch?  IOW: would require
> the vCPU to be scheduled out at that very specific point.

No, there's no user->kernel switch at the same time as context switch.
What I was trying to explain is that with the actual IBPB being issued
on exit to guest, both the context switch path and the user->kernel
mode switch path set the same indicator, for the exit path to consume.

>>>> + *
>>>> + * By default (on affected and capable hardware) as a safety measure Xen,
>>>> + * to cover for the fact that guest-kernel and guest-user modes are both
>>>> + * running in ring 3 (and hence share prediction context), would issue a
>>>> + * barrier for user->kernel mode switches of PV guests.
>>>> + */
>>>> +#define VMASST_TYPE_mode_switch_no_ibpb  33
>>>
>>> Would it be possible to define the assist as
>>> VMASST_TYPE_mode_switch_ibpb and have it on when enabled?  So that the
>>> guest would disable it if unneeded?  IMO negated options are in
>>> general harder to understand.
>>
>> Negative options aren't nice, yes, but VM assists start out as all
>> clear.
> 
> Are you sure?  I see VMASST_TYPE_pae_extended_cr3 getting set in
> dom0_construct_pv() and that makes me wonder whether other bits
> couldn't start set also.
> 
> Maybe there's some restriction I'm missing, but I don't see any
> wording in the description of the interface that states that all
> assists are supposed to start disabled.

Well, that setting of pae_extended_cr3 is in response to the kernel's
notes section having a respective indicator. So we still only set the
bit in response to what the kernel's asking us to do, just that here
we carry out the request ahead of launching the kernel.

Also consider what would happen during migration if there was a
default-on assist: At the destination we can't know whether the
source simply didn't know of the bit, or whether the guest elected to
clear it.

Jan
Roger Pau Monné Dec. 19, 2023, 3:11 p.m. UTC | #5
On Tue, Dec 19, 2023 at 03:06:50PM +0100, Jan Beulich wrote:
> On 19.12.2023 12:48, Roger Pau Monné wrote:
> > On Tue, Dec 19, 2023 at 10:56:16AM +0100, Jan Beulich wrote:
> >> On 18.12.2023 18:24, Roger Pau Monné wrote:
> >>> On Tue, Feb 14, 2023 at 05:12:08PM +0100, Jan Beulich wrote:
> >>>> --- a/xen/arch/x86/pv/domain.c
> >>>> +++ b/xen/arch/x86/pv/domain.c
> >>>> @@ -455,6 +455,7 @@ static void _toggle_guest_pt(struct vcpu
> >>>>  void toggle_guest_mode(struct vcpu *v)
> >>>>  {
> >>>>      const struct domain *d = v->domain;
> >>>> +    struct cpu_info *cpu_info = get_cpu_info();
> >>>>      unsigned long gs_base;
> >>>>  
> >>>>      ASSERT(!is_pv_32bit_vcpu(v));
> >>>> @@ -467,15 +468,21 @@ void toggle_guest_mode(struct vcpu *v)
> >>>>      if ( v->arch.flags & TF_kernel_mode )
> >>>>          v->arch.pv.gs_base_kernel = gs_base;
> >>>>      else
> >>>> +    {
> >>>>          v->arch.pv.gs_base_user = gs_base;
> >>>> +
> >>>> +        if ( opt_ibpb_mode_switch &&
> >>>> +             !(d->arch.spec_ctrl_flags & SCF_entry_ibpb) &&
> >>>> +             !VM_ASSIST(d, mode_switch_no_ibpb) )
> >>>> +            cpu_info->spec_ctrl_flags |= SCF_new_pred_ctxt;
> >>>
> >>> Likewise similar to the remarks I've made before, if doing an IBPB on
> >>> entry is enough to cover for the case here, it must also be fine to
> >>> issue the IBPB right here, instead of deferring to return to guest
> >>> context?
> >>>
> >>> The only concern would be (as you mentioned before) to avoid clearing
> >>> valid Xen predictions, but I would rather see some figures about what
> >>> effect the delaying to return to guest has vs issuing it right here.
> >>
> >> Part of the reason (aiui) to do things on the exit path was to
> >> consolidate the context switch induced one and the user->kernel switch
> >> one into the same place and mechanism.
> > 
> > Isn't it kind of a very specific case that we end up doing a
> > user->kernel switch as part of a context switch?  IOW: would require
> > the vCPU to be scheduled out at that very specific point.
> 
> No, there's no user->kernel switch at the same time as context switch.
> What I was trying to explain is that with the actual IBPB being issued
> on exit to guest, both the context switch path and the user->kernel
> mode switch path set the same indicator, for the exit path to consume.

Deferring to exit to guest path could be OK, but unless strictly
needed, which I don't think it's the case, I would request for IBPB to
be executed in C context rather than assembly one.

> >>>> + *
> >>>> + * By default (on affected and capable hardware) as a safety measure Xen,
> >>>> + * to cover for the fact that guest-kernel and guest-user modes are both
> >>>> + * running in ring 3 (and hence share prediction context), would issue a
> >>>> + * barrier for user->kernel mode switches of PV guests.
> >>>> + */
> >>>> +#define VMASST_TYPE_mode_switch_no_ibpb  33
> >>>
> >>> Would it be possible to define the assist as
> >>> VMASST_TYPE_mode_switch_ibpb and have it on when enabled?  So that the
> >>> guest would disable it if unneeded?  IMO negated options are in
> >>> general harder to understand.
> >>
> >> Negative options aren't nice, yes, but VM assists start out as all
> >> clear.
> > 
> > Are you sure?  I see VMASST_TYPE_pae_extended_cr3 getting set in
> > dom0_construct_pv() and that makes me wonder whether other bits
> > couldn't start set also.
> > 
> > Maybe there's some restriction I'm missing, but I don't see any
> > wording in the description of the interface that states that all
> > assists are supposed to start disabled.
> 
> Well, that setting of pae_extended_cr3 is in response to the kernel's
> notes section having a respective indicator. So we still only set the
> bit in response to what the kernel's asking us to do, just that here
> we carry out the request ahead of launching the kernel.
> 
> Also consider what would happen during migration if there was a
> default-on assist: At the destination we can't know whether the
> source simply didn't know of the bit, or whether the guest elected to
> clear it.

Hm, I see, so I was indeed missing that aspect.  VM assist is passed
as a plain bitmap, and there's no signal on which assists the VM had
available on the source side if not enabled.

Thanks, Roger.
Roger Pau Monné Dec. 19, 2023, 5:07 p.m. UTC | #6
On Tue, Dec 19, 2023 at 04:11:09PM +0100, Roger Pau Monné wrote:
> On Tue, Dec 19, 2023 at 03:06:50PM +0100, Jan Beulich wrote:
> > On 19.12.2023 12:48, Roger Pau Monné wrote:
> > > On Tue, Dec 19, 2023 at 10:56:16AM +0100, Jan Beulich wrote:
> > >> On 18.12.2023 18:24, Roger Pau Monné wrote:
> > >>> On Tue, Feb 14, 2023 at 05:12:08PM +0100, Jan Beulich wrote:
> > >>>> --- a/xen/arch/x86/pv/domain.c
> > >>>> +++ b/xen/arch/x86/pv/domain.c
> > >>>> @@ -455,6 +455,7 @@ static void _toggle_guest_pt(struct vcpu
> > >>>>  void toggle_guest_mode(struct vcpu *v)
> > >>>>  {
> > >>>>      const struct domain *d = v->domain;
> > >>>> +    struct cpu_info *cpu_info = get_cpu_info();
> > >>>>      unsigned long gs_base;
> > >>>>  
> > >>>>      ASSERT(!is_pv_32bit_vcpu(v));
> > >>>> @@ -467,15 +468,21 @@ void toggle_guest_mode(struct vcpu *v)
> > >>>>      if ( v->arch.flags & TF_kernel_mode )
> > >>>>          v->arch.pv.gs_base_kernel = gs_base;
> > >>>>      else
> > >>>> +    {
> > >>>>          v->arch.pv.gs_base_user = gs_base;
> > >>>> +
> > >>>> +        if ( opt_ibpb_mode_switch &&
> > >>>> +             !(d->arch.spec_ctrl_flags & SCF_entry_ibpb) &&
> > >>>> +             !VM_ASSIST(d, mode_switch_no_ibpb) )
> > >>>> +            cpu_info->spec_ctrl_flags |= SCF_new_pred_ctxt;
> > >>>
> > >>> Likewise similar to the remarks I've made before, if doing an IBPB on
> > >>> entry is enough to cover for the case here, it must also be fine to
> > >>> issue the IBPB right here, instead of deferring to return to guest
> > >>> context?
> > >>>
> > >>> The only concern would be (as you mentioned before) to avoid clearing
> > >>> valid Xen predictions, but I would rather see some figures about what
> > >>> effect the delaying to return to guest has vs issuing it right here.
> > >>
> > >> Part of the reason (aiui) to do things on the exit path was to
> > >> consolidate the context switch induced one and the user->kernel switch
> > >> one into the same place and mechanism.
> > > 
> > > Isn't it kind of a very specific case that we end up doing a
> > > user->kernel switch as part of a context switch?  IOW: would require
> > > the vCPU to be scheduled out at that very specific point.
> > 
> > No, there's no user->kernel switch at the same time as context switch.
> > What I was trying to explain is that with the actual IBPB being issued
> > on exit to guest, both the context switch path and the user->kernel
> > mode switch path set the same indicator, for the exit path to consume.
> 
> Deferring to exit to guest path could be OK, but unless strictly
> needed, which I don't think it's the case, I would request for IBPB to
> be executed in C context rather than assembly one.
> 
> > >>>> + *
> > >>>> + * By default (on affected and capable hardware) as a safety measure Xen,
> > >>>> + * to cover for the fact that guest-kernel and guest-user modes are both
> > >>>> + * running in ring 3 (and hence share prediction context), would issue a
> > >>>> + * barrier for user->kernel mode switches of PV guests.
> > >>>> + */
> > >>>> +#define VMASST_TYPE_mode_switch_no_ibpb  33
> > >>>
> > >>> Would it be possible to define the assist as
> > >>> VMASST_TYPE_mode_switch_ibpb and have it on when enabled?  So that the
> > >>> guest would disable it if unneeded?  IMO negated options are in
> > >>> general harder to understand.
> > >>
> > >> Negative options aren't nice, yes, but VM assists start out as all
> > >> clear.
> > > 
> > > Are you sure?  I see VMASST_TYPE_pae_extended_cr3 getting set in
> > > dom0_construct_pv() and that makes me wonder whether other bits
> > > couldn't start set also.
> > > 
> > > Maybe there's some restriction I'm missing, but I don't see any
> > > wording in the description of the interface that states that all
> > > assists are supposed to start disabled.
> > 
> > Well, that setting of pae_extended_cr3 is in response to the kernel's
> > notes section having a respective indicator. So we still only set the
> > bit in response to what the kernel's asking us to do, just that here
> > we carry out the request ahead of launching the kernel.
> > 
> > Also consider what would happen during migration if there was a
> > default-on assist: At the destination we can't know whether the
> > source simply didn't know of the bit, or whether the guest elected to
> > clear it.
> 
> Hm, I see, so I was indeed missing that aspect.  VM assist is passed
> as a plain bitmap, and there's no signal on which assists the VM had
> available on the source side if not enabled.

Sorry, please bear with me, as I've been further thinking about this.

Why does the assist needs to be default-on?  It's my understanding
that the guest can execute the IBPB itself by writing to the MSR, but
that's suboptimal in the user -> kernel context switch as it then
involves two traps into Xen, but the guest is not left insecure, it
just needs to write the MSR itself like on native.

In fact, if we add an IBPB by default as part of amd64 PV user ->
kernel guest context switch, we are likely doing a double IBPB on
guests not aware of the assist.

IOW: I don't know why doing the assist as guest opt-in would be
insecure, in fact I think it's the best approach (again I might be
missing something).

Thanks, Roger.
Jan Beulich Dec. 20, 2023, 9:25 a.m. UTC | #7
On 19.12.2023 18:07, Roger Pau Monné wrote:
> On Tue, Dec 19, 2023 at 04:11:09PM +0100, Roger Pau Monné wrote:
>> On Tue, Dec 19, 2023 at 03:06:50PM +0100, Jan Beulich wrote:
>>> On 19.12.2023 12:48, Roger Pau Monné wrote:
>>>> On Tue, Dec 19, 2023 at 10:56:16AM +0100, Jan Beulich wrote:
>>>>> On 18.12.2023 18:24, Roger Pau Monné wrote:
>>>>>> On Tue, Feb 14, 2023 at 05:12:08PM +0100, Jan Beulich wrote:
>>>>>>> --- a/xen/arch/x86/pv/domain.c
>>>>>>> +++ b/xen/arch/x86/pv/domain.c
>>>>>>> @@ -455,6 +455,7 @@ static void _toggle_guest_pt(struct vcpu
>>>>>>>  void toggle_guest_mode(struct vcpu *v)
>>>>>>>  {
>>>>>>>      const struct domain *d = v->domain;
>>>>>>> +    struct cpu_info *cpu_info = get_cpu_info();
>>>>>>>      unsigned long gs_base;
>>>>>>>  
>>>>>>>      ASSERT(!is_pv_32bit_vcpu(v));
>>>>>>> @@ -467,15 +468,21 @@ void toggle_guest_mode(struct vcpu *v)
>>>>>>>      if ( v->arch.flags & TF_kernel_mode )
>>>>>>>          v->arch.pv.gs_base_kernel = gs_base;
>>>>>>>      else
>>>>>>> +    {
>>>>>>>          v->arch.pv.gs_base_user = gs_base;
>>>>>>> +
>>>>>>> +        if ( opt_ibpb_mode_switch &&
>>>>>>> +             !(d->arch.spec_ctrl_flags & SCF_entry_ibpb) &&
>>>>>>> +             !VM_ASSIST(d, mode_switch_no_ibpb) )
>>>>>>> +            cpu_info->spec_ctrl_flags |= SCF_new_pred_ctxt;
>>>>>>
>>>>>> Likewise similar to the remarks I've made before, if doing an IBPB on
>>>>>> entry is enough to cover for the case here, it must also be fine to
>>>>>> issue the IBPB right here, instead of deferring to return to guest
>>>>>> context?
>>>>>>
>>>>>> The only concern would be (as you mentioned before) to avoid clearing
>>>>>> valid Xen predictions, but I would rather see some figures about what
>>>>>> effect the delaying to return to guest has vs issuing it right here.
>>>>>
>>>>> Part of the reason (aiui) to do things on the exit path was to
>>>>> consolidate the context switch induced one and the user->kernel switch
>>>>> one into the same place and mechanism.
>>>>
>>>> Isn't it kind of a very specific case that we end up doing a
>>>> user->kernel switch as part of a context switch?  IOW: would require
>>>> the vCPU to be scheduled out at that very specific point.
>>>
>>> No, there's no user->kernel switch at the same time as context switch.
>>> What I was trying to explain is that with the actual IBPB being issued
>>> on exit to guest, both the context switch path and the user->kernel
>>> mode switch path set the same indicator, for the exit path to consume.
>>
>> Deferring to exit to guest path could be OK, but unless strictly
>> needed, which I don't think it's the case, I would request for IBPB to
>> be executed in C context rather than assembly one.
>>
>>>>>>> + *
>>>>>>> + * By default (on affected and capable hardware) as a safety measure Xen,
>>>>>>> + * to cover for the fact that guest-kernel and guest-user modes are both
>>>>>>> + * running in ring 3 (and hence share prediction context), would issue a
>>>>>>> + * barrier for user->kernel mode switches of PV guests.
>>>>>>> + */
>>>>>>> +#define VMASST_TYPE_mode_switch_no_ibpb  33
>>>>>>
>>>>>> Would it be possible to define the assist as
>>>>>> VMASST_TYPE_mode_switch_ibpb and have it on when enabled?  So that the
>>>>>> guest would disable it if unneeded?  IMO negated options are in
>>>>>> general harder to understand.
>>>>>
>>>>> Negative options aren't nice, yes, but VM assists start out as all
>>>>> clear.
>>>>
>>>> Are you sure?  I see VMASST_TYPE_pae_extended_cr3 getting set in
>>>> dom0_construct_pv() and that makes me wonder whether other bits
>>>> couldn't start set also.
>>>>
>>>> Maybe there's some restriction I'm missing, but I don't see any
>>>> wording in the description of the interface that states that all
>>>> assists are supposed to start disabled.
>>>
>>> Well, that setting of pae_extended_cr3 is in response to the kernel's
>>> notes section having a respective indicator. So we still only set the
>>> bit in response to what the kernel's asking us to do, just that here
>>> we carry out the request ahead of launching the kernel.
>>>
>>> Also consider what would happen during migration if there was a
>>> default-on assist: At the destination we can't know whether the
>>> source simply didn't know of the bit, or whether the guest elected to
>>> clear it.
>>
>> Hm, I see, so I was indeed missing that aspect.  VM assist is passed
>> as a plain bitmap, and there's no signal on which assists the VM had
>> available on the source side if not enabled.
> 
> Sorry, please bear with me, as I've been further thinking about this.
> 
> Why does the assist needs to be default-on?  It's my understanding
> that the guest can execute the IBPB itself by writing to the MSR, but
> that's suboptimal in the user -> kernel context switch as it then
> involves two traps into Xen, but the guest is not left insecure, it
> just needs to write the MSR itself like on native.
> 
> In fact, if we add an IBPB by default as part of amd64 PV user ->
> kernel guest context switch, we are likely doing a double IBPB on
> guests not aware of the assist.
> 
> IOW: I don't know why doing the assist as guest opt-in would be
> insecure, in fact I think it's the best approach (again I might be
> missing something).

By issuing IBPB by default we can make guests safe (in this regard)
irrespective of their awareness of IBPB, and in particular their
awareness of IBPB being needed explicitly on the user->kernel mode
transition (where on native, with IBRS enabled, sufficient separation
exists iirc). IOW we're trying to cater for a 64-bit-PV special aspect
by default. (Andrew, please correct me if there's anything wrong in
here.)

Jan
Roger Pau Monné Dec. 20, 2023, 9:59 a.m. UTC | #8
On Wed, Dec 20, 2023 at 10:25:57AM +0100, Jan Beulich wrote:
> On 19.12.2023 18:07, Roger Pau Monné wrote:
> > On Tue, Dec 19, 2023 at 04:11:09PM +0100, Roger Pau Monné wrote:
> >> On Tue, Dec 19, 2023 at 03:06:50PM +0100, Jan Beulich wrote:
> >>> On 19.12.2023 12:48, Roger Pau Monné wrote:
> >>>> On Tue, Dec 19, 2023 at 10:56:16AM +0100, Jan Beulich wrote:
> >>>>> On 18.12.2023 18:24, Roger Pau Monné wrote:
> >>>>>> On Tue, Feb 14, 2023 at 05:12:08PM +0100, Jan Beulich wrote:
> >>>>>>> --- a/xen/arch/x86/pv/domain.c
> >>>>>>> +++ b/xen/arch/x86/pv/domain.c
> >>>>>>> @@ -455,6 +455,7 @@ static void _toggle_guest_pt(struct vcpu
> >>>>>>>  void toggle_guest_mode(struct vcpu *v)
> >>>>>>>  {
> >>>>>>>      const struct domain *d = v->domain;
> >>>>>>> +    struct cpu_info *cpu_info = get_cpu_info();
> >>>>>>>      unsigned long gs_base;
> >>>>>>>  
> >>>>>>>      ASSERT(!is_pv_32bit_vcpu(v));
> >>>>>>> @@ -467,15 +468,21 @@ void toggle_guest_mode(struct vcpu *v)
> >>>>>>>      if ( v->arch.flags & TF_kernel_mode )
> >>>>>>>          v->arch.pv.gs_base_kernel = gs_base;
> >>>>>>>      else
> >>>>>>> +    {
> >>>>>>>          v->arch.pv.gs_base_user = gs_base;
> >>>>>>> +
> >>>>>>> +        if ( opt_ibpb_mode_switch &&
> >>>>>>> +             !(d->arch.spec_ctrl_flags & SCF_entry_ibpb) &&
> >>>>>>> +             !VM_ASSIST(d, mode_switch_no_ibpb) )
> >>>>>>> +            cpu_info->spec_ctrl_flags |= SCF_new_pred_ctxt;
> >>>>>>
> >>>>>> Likewise similar to the remarks I've made before, if doing an IBPB on
> >>>>>> entry is enough to cover for the case here, it must also be fine to
> >>>>>> issue the IBPB right here, instead of deferring to return to guest
> >>>>>> context?
> >>>>>>
> >>>>>> The only concern would be (as you mentioned before) to avoid clearing
> >>>>>> valid Xen predictions, but I would rather see some figures about what
> >>>>>> effect the delaying to return to guest has vs issuing it right here.
> >>>>>
> >>>>> Part of the reason (aiui) to do things on the exit path was to
> >>>>> consolidate the context switch induced one and the user->kernel switch
> >>>>> one into the same place and mechanism.
> >>>>
> >>>> Isn't it kind of a very specific case that we end up doing a
> >>>> user->kernel switch as part of a context switch?  IOW: would require
> >>>> the vCPU to be scheduled out at that very specific point.
> >>>
> >>> No, there's no user->kernel switch at the same time as context switch.
> >>> What I was trying to explain is that with the actual IBPB being issued
> >>> on exit to guest, both the context switch path and the user->kernel
> >>> mode switch path set the same indicator, for the exit path to consume.
> >>
> >> Deferring to exit to guest path could be OK, but unless strictly
> >> needed, which I don't think it's the case, I would request for IBPB to
> >> be executed in C context rather than assembly one.
> >>
> >>>>>>> + *
> >>>>>>> + * By default (on affected and capable hardware) as a safety measure Xen,
> >>>>>>> + * to cover for the fact that guest-kernel and guest-user modes are both
> >>>>>>> + * running in ring 3 (and hence share prediction context), would issue a
> >>>>>>> + * barrier for user->kernel mode switches of PV guests.
> >>>>>>> + */
> >>>>>>> +#define VMASST_TYPE_mode_switch_no_ibpb  33
> >>>>>>
> >>>>>> Would it be possible to define the assist as
> >>>>>> VMASST_TYPE_mode_switch_ibpb and have it on when enabled?  So that the
> >>>>>> guest would disable it if unneeded?  IMO negated options are in
> >>>>>> general harder to understand.
> >>>>>
> >>>>> Negative options aren't nice, yes, but VM assists start out as all
> >>>>> clear.
> >>>>
> >>>> Are you sure?  I see VMASST_TYPE_pae_extended_cr3 getting set in
> >>>> dom0_construct_pv() and that makes me wonder whether other bits
> >>>> couldn't start set also.
> >>>>
> >>>> Maybe there's some restriction I'm missing, but I don't see any
> >>>> wording in the description of the interface that states that all
> >>>> assists are supposed to start disabled.
> >>>
> >>> Well, that setting of pae_extended_cr3 is in response to the kernel's
> >>> notes section having a respective indicator. So we still only set the
> >>> bit in response to what the kernel's asking us to do, just that here
> >>> we carry out the request ahead of launching the kernel.
> >>>
> >>> Also consider what would happen during migration if there was a
> >>> default-on assist: At the destination we can't know whether the
> >>> source simply didn't know of the bit, or whether the guest elected to
> >>> clear it.
> >>
> >> Hm, I see, so I was indeed missing that aspect.  VM assist is passed
> >> as a plain bitmap, and there's no signal on which assists the VM had
> >> available on the source side if not enabled.
> > 
> > Sorry, please bear with me, as I've been further thinking about this.
> > 
> > Why does the assist needs to be default-on?  It's my understanding
> > that the guest can execute the IBPB itself by writing to the MSR, but
> > that's suboptimal in the user -> kernel context switch as it then
> > involves two traps into Xen, but the guest is not left insecure, it
> > just needs to write the MSR itself like on native.
> > 
> > In fact, if we add an IBPB by default as part of amd64 PV user ->
> > kernel guest context switch, we are likely doing a double IBPB on
> > guests not aware of the assist.
> > 
> > IOW: I don't know why doing the assist as guest opt-in would be
> > insecure, in fact I think it's the best approach (again I might be
> > missing something).
> 
> By issuing IBPB by default we can make guests safe (in this regard)
> irrespective of their awareness of IBPB, and in particular their
> awareness of IBPB being needed explicitly on the user->kernel mode
> transition (where on native, with IBRS enabled, sufficient separation
> exists iirc). IOW we're trying to cater for a 64-bit-PV special aspect
> by default. (Andrew, please correct me if there's anything wrong in
> here.)

Hm, maybe.  My point would be that PV is already specific enough that
OSes shouldn't expect things like IBRS to work as on native, and hence
should be aware of user and kernel running in the same privilege mode
and issue the IBPB themselves.

Setting that aside, would it make sense to tie the IBPB on guest user
-> kernel switches to the guest having enabled IBRS?  AFAICT IBRS on
64bit PV is useless, as from the predictor PoV both user and kernel
space share the same mode.  Hence a PV guest enabling IBRS could be
used as a signal for Xen to execute IBPB on user -> kernel guest
context switches?

Thanks, Roger.
diff mbox series

Patch

--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2320,8 +2320,8 @@  By default SSBD will be mitigated at run
 ### spec-ctrl (x86)
 > `= List of [ <bool>, xen=<bool>, {pv,hvm}=<bool>,
 >              {msr-sc,rsb,md-clear,ibpb-entry}=<bool>|{pv,hvm}=<bool>,
->              bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,psfd,
->              eager-fpu,l1d-flush,branch-harden,srb-lock,
+>              bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ibpb-mode-switch,
+>              ssbd,psfd,eager-fpu,l1d-flush,branch-harden,srb-lock,
 >              unpriv-mmio}=<bool> ]`
 
 Controls for speculative execution sidechannel mitigations.  By default, Xen
@@ -2403,7 +2403,10 @@  default.
 
 On hardware supporting IBPB (Indirect Branch Prediction Barrier), the `ibpb=`
 option can be used to force (the default) or prevent Xen from issuing branch
-prediction barriers on vcpu context switches.
+prediction barriers on vcpu context switches.  On such hardware the
+`ibpb-mode-switch` option can be used to control whether, by default, Xen
+would issue branch prediction barriers when 64-bit PV guests switch from
+user to kernel mode.  If enabled, guest kernels can op out of this behavior.
 
 On all hardware, the `eager-fpu=` option can be used to force or prevent Xen
 from using fully eager FPU context switches.  This is currently implemented as
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -742,6 +742,8 @@  static inline void pv_inject_sw_interrup
     pv_inject_event(&event);
 }
 
+extern int8_t opt_ibpb_mode_switch;
+
 #define PV32_VM_ASSIST_MASK ((1UL << VMASST_TYPE_4gb_segments)        | \
                              (1UL << VMASST_TYPE_4gb_segments_notify) | \
                              (1UL << VMASST_TYPE_writable_pagetables) | \
@@ -753,7 +755,9 @@  static inline void pv_inject_sw_interrup
  * but we can't make such requests fail all of the sudden.
  */
 #define PV64_VM_ASSIST_MASK (PV32_VM_ASSIST_MASK                      | \
-                             (1UL << VMASST_TYPE_m2p_strict))
+                             (1UL << VMASST_TYPE_m2p_strict)          | \
+                             ((opt_ibpb_mode_switch + 0UL) <<           \
+                              VMASST_TYPE_mode_switch_no_ibpb))
 #define HVM_VM_ASSIST_MASK  (1UL << VMASST_TYPE_runstate_update_flag)
 
 #define arch_vm_assist_valid_mask(d) \
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -455,6 +455,7 @@  static void _toggle_guest_pt(struct vcpu
 void toggle_guest_mode(struct vcpu *v)
 {
     const struct domain *d = v->domain;
+    struct cpu_info *cpu_info = get_cpu_info();
     unsigned long gs_base;
 
     ASSERT(!is_pv_32bit_vcpu(v));
@@ -467,15 +468,21 @@  void toggle_guest_mode(struct vcpu *v)
     if ( v->arch.flags & TF_kernel_mode )
         v->arch.pv.gs_base_kernel = gs_base;
     else
+    {
         v->arch.pv.gs_base_user = gs_base;
+
+        if ( opt_ibpb_mode_switch &&
+             !(d->arch.spec_ctrl_flags & SCF_entry_ibpb) &&
+             !VM_ASSIST(d, mode_switch_no_ibpb) )
+            cpu_info->spec_ctrl_flags |= SCF_new_pred_ctxt;
+    }
+
     asm volatile ( "swapgs" );
 
     _toggle_guest_pt(v);
 
     if ( d->arch.pv.xpti )
     {
-        struct cpu_info *cpu_info = get_cpu_info();
-
         cpu_info->root_pgt_changed = true;
         cpu_info->pv_cr3 = __pa(this_cpu(root_pgt)) |
                            (d->arch.pv.pcid ? get_pcid_bits(v, true) : 0);
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -60,6 +60,7 @@  bool __ro_after_init opt_ssbd;
 int8_t __initdata opt_psfd = -1;
 
 int8_t __ro_after_init opt_ibpb_ctxt_switch = -1;
+int8_t __ro_after_init opt_ibpb_mode_switch = -1;
 int8_t __read_mostly opt_eager_fpu = -1;
 int8_t __read_mostly opt_l1d_flush = -1;
 static bool __initdata opt_branch_harden = true;
@@ -111,6 +112,8 @@  static int __init cf_check parse_spec_ct
             if ( opt_pv_l1tf_domu < 0 )
                 opt_pv_l1tf_domu = 0;
 
+            opt_ibpb_mode_switch = 0;
+
             if ( opt_tsx == -1 )
                 opt_tsx = -3;
 
@@ -271,6 +274,8 @@  static int __init cf_check parse_spec_ct
         /* Misc settings. */
         else if ( (val = parse_boolean("ibpb", s, ss)) >= 0 )
             opt_ibpb_ctxt_switch = val;
+        else if ( (val = parse_boolean("ibpb-mode-switch", s, ss)) >= 0 )
+            opt_ibpb_mode_switch = val;
         else if ( (val = parse_boolean("eager-fpu", s, ss)) >= 0 )
             opt_eager_fpu = val;
         else if ( (val = parse_boolean("l1d-flush", s, ss)) >= 0 )
@@ -527,16 +532,18 @@  static void __init print_details(enum in
 
 #endif
 #ifdef CONFIG_PV
-    printk("  Support for PV VMs:%s%s%s%s%s%s\n",
+    printk("  Support for PV VMs:%s%s%s%s%s%s%s\n",
            (boot_cpu_has(X86_FEATURE_SC_MSR_PV) ||
             boot_cpu_has(X86_FEATURE_SC_RSB_PV) ||
             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) ||
-            opt_eager_fpu || opt_md_clear_pv)        ? ""               : " None",
+            opt_eager_fpu || opt_md_clear_pv ||
+            opt_ibpb_mode_switch)                    ? ""               : " None",
            boot_cpu_has(X86_FEATURE_SC_MSR_PV)       ? " MSR_SPEC_CTRL" : "",
            boot_cpu_has(X86_FEATURE_SC_RSB_PV)       ? " RSB"           : "",
            opt_eager_fpu                             ? " EAGER_FPU"     : "",
            opt_md_clear_pv                           ? " MD_CLEAR"      : "",
-           boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV)   ? " IBPB-entry"    : "");
+           boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV)   ? " IBPB-entry"    : "",
+           opt_ibpb_mode_switch                      ? " IBPB-mode-switch" : "");
 
     printk("  XPTI (64-bit PV only): Dom0 %s, DomU %s (with%s PCID)\n",
            opt_xpti_hwdom ? "enabled" : "disabled",
@@ -804,7 +811,8 @@  static void __init ibpb_calculations(voi
     /* Check we have hardware IBPB support before using it... */
     if ( !boot_cpu_has(X86_FEATURE_IBRSB) && !boot_cpu_has(X86_FEATURE_IBPB) )
     {
-        opt_ibpb_entry_hvm = opt_ibpb_entry_pv = opt_ibpb_ctxt_switch = 0;
+        opt_ibpb_entry_hvm = opt_ibpb_entry_pv = 0;
+        opt_ibpb_mode_switch = opt_ibpb_ctxt_switch = 0;
         opt_ibpb_entry_dom0 = false;
         return;
     }
@@ -859,6 +867,18 @@  static void __init ibpb_calculations(voi
         setup_force_cpu_cap(X86_FEATURE_NEW_PRED_CTXT_PV);
         setup_force_cpu_cap(X86_FEATURE_NEW_PRED_CTXT_HVM);
     }
+
+#ifdef CONFIG_PV
+    /*
+     * If we're using IBPB-on-entry to protect against PV guests, then
+     * there's no need to also issue IBPB on a guest user->kernel switch.
+     */
+    if ( opt_ibpb_mode_switch == -1 )
+        opt_ibpb_mode_switch = !opt_ibpb_entry_pv ||
+                               (!opt_ibpb_entry_dom0 && !opt_dom0_pvh);
+    if ( opt_ibpb_mode_switch )
+        setup_force_cpu_cap(X86_FEATURE_NEW_PRED_CTXT_PV);
+#endif
 }
 
 /* Calculate whether this CPU is vulnerable to L1TF. */
--- a/xen/include/public/xen.h
+++ b/xen/include/public/xen.h
@@ -554,6 +554,16 @@  DEFINE_XEN_GUEST_HANDLE(mmuext_op_t);
  */
 #define VMASST_TYPE_m2p_strict           32
 
+/*
+ * x86-64 guests: Suppress IBPB on guest-user to guest-kernel mode switch.
+ *
+ * By default (on affected and capable hardware) as a safety measure Xen,
+ * to cover for the fact that guest-kernel and guest-user modes are both
+ * running in ring 3 (and hence share prediction context), would issue a
+ * barrier for user->kernel mode switches of PV guests.
+ */
+#define VMASST_TYPE_mode_switch_no_ibpb  33
+
 #if __XEN_INTERFACE_VERSION__ < 0x00040600
 #define MAX_VMASST_TYPE                  3
 #endif