Message ID | 20230110171845.20542-6-andrew.cooper3@citrix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86: Protection Key Supervisor support | expand |
On 10.01.2023 18:18, Andrew Cooper wrote: > +static inline void wrpkrs(uint32_t pkrs) > +{ > + uint32_t *this_pkrs = &this_cpu(pkrs); > + > + if ( *this_pkrs != pkrs ) > + { > + *this_pkrs = pkrs; > + > + wrmsr_ns(MSR_PKRS, pkrs, 0); > + } > +} > + > +static inline void wrpkrs_and_cache(uint32_t pkrs) > +{ > + this_cpu(pkrs) = pkrs; > + wrmsr_ns(MSR_PKRS, pkrs, 0); > +} Just to confirm - there's no anticipation of uses of this in async contexts, i.e. there's no concern about the ordering of cache vs hardware writes? > --- a/xen/arch/x86/setup.c > +++ b/xen/arch/x86/setup.c > @@ -54,6 +54,7 @@ > #include <asm/spec_ctrl.h> > #include <asm/guest.h> > #include <asm/microcode.h> > +#include <asm/prot-key.h> > #include <asm/pv/domain.h> > > /* opt_nosmp: If true, secondary processors are ignored. */ > @@ -1804,6 +1805,9 @@ void __init noreturn __start_xen(unsigned long mbi_p) > if ( opt_invpcid && cpu_has_invpcid ) > use_invpcid = true; > > + if ( cpu_has_pks ) > + wrpkrs_and_cache(0); /* Must be before setting CR4.PKS */ Same question here as for PKRU wrt the BSP during S3 resume. Jan
On 12/01/2023 1:10 pm, Jan Beulich wrote: > On 10.01.2023 18:18, Andrew Cooper wrote: >> +static inline void wrpkrs(uint32_t pkrs) >> +{ >> + uint32_t *this_pkrs = &this_cpu(pkrs); >> + >> + if ( *this_pkrs != pkrs ) >> + { >> + *this_pkrs = pkrs; >> + >> + wrmsr_ns(MSR_PKRS, pkrs, 0); >> + } >> +} >> + >> +static inline void wrpkrs_and_cache(uint32_t pkrs) >> +{ >> + this_cpu(pkrs) = pkrs; >> + wrmsr_ns(MSR_PKRS, pkrs, 0); >> +} > Just to confirm - there's no anticipation of uses of this in async > contexts, i.e. there's no concern about the ordering of cache vs hardware > writes? No. The only thing modifying MSR_PKRS does is change how the pagewalk works for the current thread (specifically, the determination of Access Rights). Their is no relevance outside of the core, especially for Xen's local copy of the register value. What WRMSRNS does guarantee is that older instructions will complete before the MSR gets updated, and that subsequent instructions won't start, so WRMSRNS acts "atomically" with respect to instruction order. Also remember that not all WRMSRs are serialising. e.g. the X2APIC MSRs are explicitly not, and this is an oversight in practice for MSR_X2APIC_ICR at least. >> --- a/xen/arch/x86/setup.c >> +++ b/xen/arch/x86/setup.c >> @@ -54,6 +54,7 @@ >> #include <asm/spec_ctrl.h> >> #include <asm/guest.h> >> #include <asm/microcode.h> >> +#include <asm/prot-key.h> >> #include <asm/pv/domain.h> >> >> /* opt_nosmp: If true, secondary processors are ignored. */ >> @@ -1804,6 +1805,9 @@ void __init noreturn __start_xen(unsigned long mbi_p) >> if ( opt_invpcid && cpu_has_invpcid ) >> use_invpcid = true; >> >> + if ( cpu_has_pks ) >> + wrpkrs_and_cache(0); /* Must be before setting CR4.PKS */ > Same question here as for PKRU wrt the BSP during S3 resume. I had reasoned not, but it turns out that I'm wrong. It's important to reset the cache back to 0 here. (Handling PKRU is different - I'll follow up on the other email..) ~Andrew
On 12/01/2023 4:51 pm, Andrew Cooper wrote: > On 12/01/2023 1:10 pm, Jan Beulich wrote: >> On 10.01.2023 18:18, Andrew Cooper wrote: >>> --- a/xen/arch/x86/setup.c >>> +++ b/xen/arch/x86/setup.c >>> @@ -54,6 +54,7 @@ >>> #include <asm/spec_ctrl.h> >>> #include <asm/guest.h> >>> #include <asm/microcode.h> >>> +#include <asm/prot-key.h> >>> #include <asm/pv/domain.h> >>> >>> /* opt_nosmp: If true, secondary processors are ignored. */ >>> @@ -1804,6 +1805,9 @@ void __init noreturn __start_xen(unsigned long mbi_p) >>> if ( opt_invpcid && cpu_has_invpcid ) >>> use_invpcid = true; >>> >>> + if ( cpu_has_pks ) >>> + wrpkrs_and_cache(0); /* Must be before setting CR4.PKS */ >> Same question here as for PKRU wrt the BSP during S3 resume. > I had reasoned not, but it turns out that I'm wrong. > > It's important to reset the cache back to 0 here. (Handling PKRU is > different - I'll follow up on the other email..) diff --git a/xen/arch/x86/acpi/power.c b/xen/arch/x86/acpi/power.c index d23335391c67..de9317e8c573 100644 --- a/xen/arch/x86/acpi/power.c +++ b/xen/arch/x86/acpi/power.c @@ -299,6 +299,13 @@ static int enter_state(u32 state) update_mcu_opt_ctrl(); + /* + * Should be before restoring CR4, but that is earlier in asm. We rely on + * MSR_PKRS actually being 0 out of S3 resume. + */ + if ( cpu_has_pks ) + wrpkrs_and_cache(0); + /* (re)initialise SYSCALL/SYSENTER state, amongst other things. */ percpu_traps_init(); I've folded this hunk, to sort out the S3 resume path. As its the final hunk before the entire series can be committed, I shan't bother sending a v3 just for this. ~Andrew
On 16.01.2023 14:00, Andrew Cooper wrote: > On 12/01/2023 4:51 pm, Andrew Cooper wrote: >> On 12/01/2023 1:10 pm, Jan Beulich wrote: >>> On 10.01.2023 18:18, Andrew Cooper wrote: >>>> --- a/xen/arch/x86/setup.c >>>> +++ b/xen/arch/x86/setup.c >>>> @@ -54,6 +54,7 @@ >>>> #include <asm/spec_ctrl.h> >>>> #include <asm/guest.h> >>>> #include <asm/microcode.h> >>>> +#include <asm/prot-key.h> >>>> #include <asm/pv/domain.h> >>>> >>>> /* opt_nosmp: If true, secondary processors are ignored. */ >>>> @@ -1804,6 +1805,9 @@ void __init noreturn __start_xen(unsigned long mbi_p) >>>> if ( opt_invpcid && cpu_has_invpcid ) >>>> use_invpcid = true; >>>> >>>> + if ( cpu_has_pks ) >>>> + wrpkrs_and_cache(0); /* Must be before setting CR4.PKS */ >>> Same question here as for PKRU wrt the BSP during S3 resume. >> I had reasoned not, but it turns out that I'm wrong. >> >> It's important to reset the cache back to 0 here. (Handling PKRU is >> different - I'll follow up on the other email..) > > diff --git a/xen/arch/x86/acpi/power.c b/xen/arch/x86/acpi/power.c > index d23335391c67..de9317e8c573 100644 > --- a/xen/arch/x86/acpi/power.c > +++ b/xen/arch/x86/acpi/power.c > @@ -299,6 +299,13 @@ static int enter_state(u32 state) > > update_mcu_opt_ctrl(); > > + /* > + * Should be before restoring CR4, but that is earlier in asm. We > rely on > + * MSR_PKRS actually being 0 out of S3 resume. > + */ > + if ( cpu_has_pks ) > + wrpkrs_and_cache(0); > + > /* (re)initialise SYSCALL/SYSENTER state, amongst other things. */ > percpu_traps_init(); > > > I've folded this hunk, to sort out the S3 resume path. The comment is a little misleading imo - it looks to justify that nothing needs doing. Could you add "..., but our cache needs clearing" to clarify why, despite our relying on zero being in the register (which I find problematic, considering that the doc doesn't even spell out reset state), the write is needed? > As its the final hunk before the entire series can be committed, I > shan't bother sending a v3 just for this. If you're seeing reasons not to be concerned of the unspecified reset state, then feel free to add my A-b (but not R-b) here. Jan
On 16/01/2023 2:17 pm, Jan Beulich wrote: > On 16.01.2023 14:00, Andrew Cooper wrote: >> On 12/01/2023 4:51 pm, Andrew Cooper wrote: >>> On 12/01/2023 1:10 pm, Jan Beulich wrote: >>>> On 10.01.2023 18:18, Andrew Cooper wrote: >>>>> --- a/xen/arch/x86/setup.c >>>>> +++ b/xen/arch/x86/setup.c >>>>> @@ -54,6 +54,7 @@ >>>>> #include <asm/spec_ctrl.h> >>>>> #include <asm/guest.h> >>>>> #include <asm/microcode.h> >>>>> +#include <asm/prot-key.h> >>>>> #include <asm/pv/domain.h> >>>>> >>>>> /* opt_nosmp: If true, secondary processors are ignored. */ >>>>> @@ -1804,6 +1805,9 @@ void __init noreturn __start_xen(unsigned long mbi_p) >>>>> if ( opt_invpcid && cpu_has_invpcid ) >>>>> use_invpcid = true; >>>>> >>>>> + if ( cpu_has_pks ) >>>>> + wrpkrs_and_cache(0); /* Must be before setting CR4.PKS */ >>>> Same question here as for PKRU wrt the BSP during S3 resume. >>> I had reasoned not, but it turns out that I'm wrong. >>> >>> It's important to reset the cache back to 0 here. (Handling PKRU is >>> different - I'll follow up on the other email..) >> diff --git a/xen/arch/x86/acpi/power.c b/xen/arch/x86/acpi/power.c >> index d23335391c67..de9317e8c573 100644 >> --- a/xen/arch/x86/acpi/power.c >> +++ b/xen/arch/x86/acpi/power.c >> @@ -299,6 +299,13 @@ static int enter_state(u32 state) >> >> update_mcu_opt_ctrl(); >> >> + /* >> + * Should be before restoring CR4, but that is earlier in asm. We >> rely on >> + * MSR_PKRS actually being 0 out of S3 resume. >> + */ >> + if ( cpu_has_pks ) >> + wrpkrs_and_cache(0); >> + >> /* (re)initialise SYSCALL/SYSENTER state, amongst other things. */ >> percpu_traps_init(); >> >> >> I've folded this hunk, to sort out the S3 resume path. > The comment is a little misleading imo - it looks to justify that nothing > needs doing. Could you add "..., but our cache needs clearing" to clarify > why, despite our relying on zero being in the register (which I find > problematic, considering that the doc doesn't even spell out reset state), > the write is needed? Xen doesn't actually set CR4.PKS at all (yet). I'm just trying to do a reasonable job of leaving Xen in a position where it doesn't explode the instant we want to start using PKS ourselves. S3 resume is out of a full core poweroff. Registers (which aren't modified by firmware) will have their architectural reset values, and for MSR_PKRS, that's 0. I will add a comment about resetting the cache, because that is the point of doing this operation. If we do find firmware which really does mess around with MSR_PKRS (and isn't restoring the pre-S3 value), then this clause needs moving down into asm before we restore %cr4 fully, but TBH, I hope I've had time to try and unify the boot and S3 resume paths a bit before then. >> As its the final hunk before the entire series can be committed, I >> shan't bother sending a v3 just for this. > If you're seeing reasons not to be concerned of the unspecified reset > state, then feel free to add my A-b (but not R-b) here. There are several reasons not to be concerned. I guess I'll take your ack then. Thanks. ~Andrew
On 16.01.2023 15:57, Andrew Cooper wrote: > On 16/01/2023 2:17 pm, Jan Beulich wrote: >> On 16.01.2023 14:00, Andrew Cooper wrote: >>> --- a/xen/arch/x86/acpi/power.c >>> +++ b/xen/arch/x86/acpi/power.c >>> @@ -299,6 +299,13 @@ static int enter_state(u32 state) >>> >>> update_mcu_opt_ctrl(); >>> >>> + /* >>> + * Should be before restoring CR4, but that is earlier in asm. We >>> rely on >>> + * MSR_PKRS actually being 0 out of S3 resume. >>> + */ >>> + if ( cpu_has_pks ) >>> + wrpkrs_and_cache(0); >>> + >>> /* (re)initialise SYSCALL/SYSENTER state, amongst other things. */ >>> percpu_traps_init(); >>> >>> >>> I've folded this hunk, to sort out the S3 resume path. >> The comment is a little misleading imo - it looks to justify that nothing >> needs doing. Could you add "..., but our cache needs clearing" to clarify >> why, despite our relying on zero being in the register (which I find >> problematic, considering that the doc doesn't even spell out reset state), >> the write is needed? > > Xen doesn't actually set CR4.PKS at all (yet). > > I'm just trying to do a reasonable job of leaving Xen in a position > where it doesn't explode the instant we want to start using PKS ourselves. > > S3 resume is out of a full core poweroff. Registers (which aren't > modified by firmware) will have their architectural reset values, and > for MSR_PKRS, that's 0. And where have you found that to be spelled out? It is this lack of specification (afaics) which is concerning me. Jan
On 16/01/2023 3:04 pm, Jan Beulich wrote: > On 16.01.2023 15:57, Andrew Cooper wrote: >> On 16/01/2023 2:17 pm, Jan Beulich wrote: >>> On 16.01.2023 14:00, Andrew Cooper wrote: >>>> --- a/xen/arch/x86/acpi/power.c >>>> +++ b/xen/arch/x86/acpi/power.c >>>> @@ -299,6 +299,13 @@ static int enter_state(u32 state) >>>> >>>> update_mcu_opt_ctrl(); >>>> >>>> + /* >>>> + * Should be before restoring CR4, but that is earlier in asm. We >>>> rely on >>>> + * MSR_PKRS actually being 0 out of S3 resume. >>>> + */ >>>> + if ( cpu_has_pks ) >>>> + wrpkrs_and_cache(0); >>>> + >>>> /* (re)initialise SYSCALL/SYSENTER state, amongst other things. */ >>>> percpu_traps_init(); >>>> >>>> >>>> I've folded this hunk, to sort out the S3 resume path. >>> The comment is a little misleading imo - it looks to justify that nothing >>> needs doing. Could you add "..., but our cache needs clearing" to clarify >>> why, despite our relying on zero being in the register (which I find >>> problematic, considering that the doc doesn't even spell out reset state), >>> the write is needed? >> Xen doesn't actually set CR4.PKS at all (yet). >> >> I'm just trying to do a reasonable job of leaving Xen in a position >> where it doesn't explode the instant we want to start using PKS ourselves. >> >> S3 resume is out of a full core poweroff. Registers (which aren't >> modified by firmware) will have their architectural reset values, and >> for MSR_PKRS, that's 0. > And where have you found that to be spelled out? It is this lack of > specification (afaics) which is concerning me. I have a request for an update to table 10-1 already pending. MSR_PKRS isn't plausibly different from PKRU. (And even if it is different, this still won't matter while Xen doesn't use CR4.PKS itself.) ~Andrew
diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c index 2bcdd08b2fb5..f44c907e8a43 100644 --- a/xen/arch/x86/cpu/common.c +++ b/xen/arch/x86/cpu/common.c @@ -58,6 +58,8 @@ static unsigned int forced_caps[NCAPINTS]; DEFINE_PER_CPU(bool, full_gdt_loaded); +DEFINE_PER_CPU(uint32_t, pkrs); + void __init setup_clear_cpu_cap(unsigned int cap) { const uint32_t *dfs; diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index 43a4865d1c76..b1f493f009fd 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -58,6 +58,7 @@ #include <asm/event.h> #include <asm/mce.h> #include <asm/monitor.h> +#include <asm/prot-key.h> #include <public/arch-x86/cpuid.h> static bool_t __initdata opt_force_ept; @@ -536,6 +537,7 @@ static void vmx_restore_host_msrs(void) static void vmx_save_guest_msrs(struct vcpu *v) { + const struct cpuid_policy *cp = v->domain->arch.cpuid; struct vcpu_msrs *msrs = v->arch.msrs; /* @@ -549,10 +551,14 @@ static void vmx_save_guest_msrs(struct vcpu *v) rdmsrl(MSR_RTIT_OUTPUT_MASK, msrs->rtit.output_mask); rdmsrl(MSR_RTIT_STATUS, msrs->rtit.status); } + + if ( cp->feat.pks ) + msrs->pkrs = rdpkrs_and_cache(); } static void vmx_restore_guest_msrs(struct vcpu *v) { + const struct cpuid_policy *cp = v->domain->arch.cpuid; const struct vcpu_msrs *msrs = v->arch.msrs; write_gs_shadow(v->arch.hvm.vmx.shadow_gs); @@ -569,6 +575,9 @@ static void vmx_restore_guest_msrs(struct vcpu *v) wrmsrl(MSR_RTIT_OUTPUT_MASK, msrs->rtit.output_mask); wrmsrl(MSR_RTIT_STATUS, msrs->rtit.status); } + + if ( cp->feat.pks ) + wrpkrs(msrs->pkrs); } void vmx_update_cpu_exec_control(struct vcpu *v) diff --git a/xen/arch/x86/include/asm/msr.h b/xen/arch/x86/include/asm/msr.h index 191e54068856..7946b6b24c11 100644 --- a/xen/arch/x86/include/asm/msr.h +++ b/xen/arch/x86/include/asm/msr.h @@ -373,6 +373,15 @@ struct vcpu_msrs }; } rtit; + /* + * 0x000006e1 - MSR_PKRS - Protection Key Supervisor. + * + * Exposed R/W to guests. Xen doesn't use PKS yet, so only context + * switched per vcpu. When in current context, live value is in hardware, + * and this value is stale. + */ + uint32_t pkrs; + /* 0x00000da0 - MSR_IA32_XSS */ struct { uint64_t raw; diff --git a/xen/arch/x86/include/asm/prot-key.h b/xen/arch/x86/include/asm/prot-key.h index 63a2e22f3fa0..0dcd31b7ea68 100644 --- a/xen/arch/x86/include/asm/prot-key.h +++ b/xen/arch/x86/include/asm/prot-key.h @@ -5,8 +5,11 @@ #ifndef ASM_PROT_KEY_H #define ASM_PROT_KEY_H +#include <xen/percpu.h> #include <xen/types.h> +#include <asm/msr.h> + #define PKEY_AD 1 /* Access Disable */ #define PKEY_WD 2 /* Write Disable */ @@ -28,4 +31,55 @@ static inline void wrpkru(uint32_t pkru) :: "a" (pkru), "d" (0), "c" (0) ); } +/* + * Xen does not use PKS. + * + * Guest kernel use is expected to be one default key, except for tiny windows + * with a double write to switch to a non-default key in a permitted critical + * section. + * + * As such, we want MSR_PKRS un-intercepted. Furthermore, as we only need it + * in Xen for emulation or migration purposes (i.e. possibly never in a + * domain's lifetime), we don't want to re-sync the hardware value on every + * vmexit. + * + * Therefore, we read and cache the guest value in ctxt_switch_from(), in the + * expectation that we can short-circuit the write in ctxt_switch_to(). + * During regular operations in current context, the guest value is in + * hardware and the per-cpu cache is stale. + */ +DECLARE_PER_CPU(uint32_t, pkrs); + +static inline uint32_t rdpkrs(void) +{ + uint32_t pkrs, tmp; + + rdmsr(MSR_PKRS, pkrs, tmp); + + return pkrs; +} + +static inline uint32_t rdpkrs_and_cache(void) +{ + return this_cpu(pkrs) = rdpkrs(); +} + +static inline void wrpkrs(uint32_t pkrs) +{ + uint32_t *this_pkrs = &this_cpu(pkrs); + + if ( *this_pkrs != pkrs ) + { + *this_pkrs = pkrs; + + wrmsr_ns(MSR_PKRS, pkrs, 0); + } +} + +static inline void wrpkrs_and_cache(uint32_t pkrs) +{ + this_cpu(pkrs) = pkrs; + wrmsr_ns(MSR_PKRS, pkrs, 0); +} + #endif /* ASM_PROT_KEY_H */ diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c index 6deadcf74763..567a0a42ac50 100644 --- a/xen/arch/x86/setup.c +++ b/xen/arch/x86/setup.c @@ -54,6 +54,7 @@ #include <asm/spec_ctrl.h> #include <asm/guest.h> #include <asm/microcode.h> +#include <asm/prot-key.h> #include <asm/pv/domain.h> /* opt_nosmp: If true, secondary processors are ignored. */ @@ -1804,6 +1805,9 @@ void __init noreturn __start_xen(unsigned long mbi_p) if ( opt_invpcid && cpu_has_invpcid ) use_invpcid = true; + if ( cpu_has_pks ) + wrpkrs_and_cache(0); /* Must be before setting CR4.PKS */ + init_speculation_mitigations(); init_idle_domain(); diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c index 52beed9d8d6d..b26758c2c89f 100644 --- a/xen/arch/x86/smpboot.c +++ b/xen/arch/x86/smpboot.c @@ -42,6 +42,7 @@ #include <asm/microcode.h> #include <asm/msr.h> #include <asm/mtrr.h> +#include <asm/prot-key.h> #include <asm/setup.h> #include <asm/spec_ctrl.h> #include <asm/time.h> @@ -364,6 +365,9 @@ void start_secondary(void *unused) /* Full exception support from here on in. */ + if ( cpu_has_pks ) + wrpkrs_and_cache(0); /* Must be before setting CR4.PKS */ + /* Safe to enable feature such as CR4.MCE with the IDT set up now. */ write_cr4(mmu_cr4_features);
Under PKS, MSR_PKRS is available and based on the CPUID policy alone, and usable independently of CR4.PKS. See the large comment in prot-key.h for details of the context switching arrangement. Use WRMSRNS right away, as we don't care about serialsing properties for context switching this MSR. Sanitise MSR_PKRS on boot. In anticipation of wanting to use PKS for Xen in the future, arrange for the sanitisation to occur prior to potentially setting CR4.PKS; if PKEY0.{AD,WD} leak in from a previous context, we will triple fault immediately on setting CR4.PKS. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> --- CC: Jan Beulich <JBeulich@suse.com> CC: Roger Pau Monné <roger.pau@citrix.com> CC: Wei Liu <wl@xen.org> CC: Kevin Tian <kevin.tian@intel.com> v2: * Use WRMSRNS * Sanitise MSR_PKS on boot. --- xen/arch/x86/cpu/common.c | 2 ++ xen/arch/x86/hvm/vmx/vmx.c | 9 +++++++ xen/arch/x86/include/asm/msr.h | 9 +++++++ xen/arch/x86/include/asm/prot-key.h | 54 +++++++++++++++++++++++++++++++++++++ xen/arch/x86/setup.c | 4 +++ xen/arch/x86/smpboot.c | 4 +++ 6 files changed, 82 insertions(+)