Message ID | 8ba8f016-0aed-277b-bbea-80022d057791@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD | expand |
On 22/04/2021 15:44, Jan Beulich wrote: > vCPU-s get maximum size areas allocated initially. Hidden (and in > particular default-off) features may allow for a smaller size area to > suffice. > > Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > v2: Use 1ul instead of 1ull. Re-base. > --- > This could be further shrunk if we used XSAVEC / if we really used > XSAVES, as then we don't need to also cover the holes. But since we > currently use neither of the two in reality, this would require more > work than just adding the alternative size calculation here. > > Seeing that both vcpu_init_fpu() and cpuid_policy_updated() get called > from arch_vcpu_create(), I'm not sure we really need this two-stage > approach - the slightly longer period of time during which > v->arch.xsave_area would remain NULL doesn't look all that problematic. > But since xstate_alloc_save_area() gets called for idle vCPU-s, it has > to stay anyway in some form, so the extra code churn may not be worth > it. > > Instead of cpuid_policy_xcr0_max(), cpuid_policy_xstates() may be the > interface to use here. But it remains to be determined whether the > xcr0_accum field is meant to be inclusive of XSS (in which case it would > better be renamed) or exclusive. Right now there's no difference as we > don't support any XSS-controlled features. I've been figuring out what we need to for supervisors states. The current code is not in a good shape, but I also think some of the changes in this series take us in an unhelpful direction. I've got a cleanup series which I will post shortly. It interacts texturally although not fundamentally with this series, but does fix several issues. For supervisor states, we need use XSAVES unilaterally, even for PV. This is because XSS_CET_S needs to be the HVM kernel's context, or Xen's in PV context (specifically, MSR_PL0_SSP which is the shstk equivalent of TSS.RSP0). A consequence is that Xen's data handling shall use the compressed format, and include supervisor states. (While in principle we could manage CET_S, CET_U, and potentially PT when vmtrace gets expanded, each WRMSR there is a similar order of magnitude to an XSAVES/XRSTORS instruction.) I'm planning a host xss setting, similar to mmu_cr4_features, which shall be the setting in context for everything other than HVM vcpus (which need the guest setting in context, and/or the VT-x bodge to support host-only states). Amongst other things, all context switch paths, including from-HVM, need to step XSS up to the host setting to let XSAVES function correctly. However, a consequence of this is that the size of the xsave area needs deriving from host, as well as guest-max state. i.e. even if some VMs aren't using CET, we still need space in the xsave areas to function correctly when a single VM is using CET. Another consequence is that we need to rethink our hypercall behaviour. There is no such thing as supervisor states in an uncompressed XSAVE image, which means we can't continue with that being the ABI. I've also found some substantial issues with how we handle xcr0/xcr0_accum and plan to address these. There is no such thing as xcr0 without the bottom bit set, ever, and xcr0_accum needs to default to X87|SSE seeing as that's how we use it anyway. However, in a context switch, I expect we'll still be using xcr0_accum | host_xss when it comes to the context switch path. In terms of actual context switching, we want to be using XSAVES/XRSTORS whenever it is available, even if we're not using supervisor states. XSAVES has both the inuse and modified optimisations, without the broken consequence of XSAVEOPT (which is firmly in the "don't ever use this" bucket now). There's no point ever using XSAVEC. There is no hardware where it exists in the absence of XSAVES, and can't even in theoretical circumstances due to (perhaps unintentional) linkage of the CPUID data. XSAVEC also doesn't use the modified optimisation, and is therefore strictly worse than XSAVES, even when MSR_XSS is 0. Therefore, our choice of instruction wants to be XSAVES, or XSAVE, or FXSAVE, depending on hardware capability. ~Andrew
On 03.05.2021 15:57, Andrew Cooper wrote: > On 22/04/2021 15:44, Jan Beulich wrote: >> vCPU-s get maximum size areas allocated initially. Hidden (and in >> particular default-off) features may allow for a smaller size area to >> suffice. >> >> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> --- >> v2: Use 1ul instead of 1ull. Re-base. >> --- >> This could be further shrunk if we used XSAVEC / if we really used >> XSAVES, as then we don't need to also cover the holes. But since we >> currently use neither of the two in reality, this would require more >> work than just adding the alternative size calculation here. >> >> Seeing that both vcpu_init_fpu() and cpuid_policy_updated() get called >> from arch_vcpu_create(), I'm not sure we really need this two-stage >> approach - the slightly longer period of time during which >> v->arch.xsave_area would remain NULL doesn't look all that problematic. >> But since xstate_alloc_save_area() gets called for idle vCPU-s, it has >> to stay anyway in some form, so the extra code churn may not be worth >> it. >> >> Instead of cpuid_policy_xcr0_max(), cpuid_policy_xstates() may be the >> interface to use here. But it remains to be determined whether the >> xcr0_accum field is meant to be inclusive of XSS (in which case it would >> better be renamed) or exclusive. Right now there's no difference as we >> don't support any XSS-controlled features. > > I've been figuring out what we need to for supervisors states. The > current code is not in a good shape, but I also think some of the > changes in this series take us in an unhelpful direction. From reading through the rest your reply I'm not sure I see what you mean. ORing in host_xss at certain points shouldn't be a big deal. > I've got a cleanup series which I will post shortly. It interacts > texturally although not fundamentally with this series, but does fix > several issues. > > For supervisor states, we need use XSAVES unilaterally, even for PV. > This is because XSS_CET_S needs to be the HVM kernel's context, or Xen's > in PV context (specifically, MSR_PL0_SSP which is the shstk equivalent > of TSS.RSP0). > > > A consequence is that Xen's data handling shall use the compressed > format, and include supervisor states. (While in principle we could > manage CET_S, CET_U, and potentially PT when vmtrace gets expanded, each > WRMSR there is a similar order of magnitude to an XSAVES/XRSTORS > instruction.) I agree. > I'm planning a host xss setting, similar to mmu_cr4_features, which > shall be the setting in context for everything other than HVM vcpus > (which need the guest setting in context, and/or the VT-x bodge to > support host-only states). Amongst other things, all context switch > paths, including from-HVM, need to step XSS up to the host setting to > let XSAVES function correctly. > > However, a consequence of this is that the size of the xsave area needs > deriving from host, as well as guest-max state. i.e. even if some VMs > aren't using CET, we still need space in the xsave areas to function > correctly when a single VM is using CET. Right - as said above, taking this into consideration here shouldn't be overly problematic. > Another consequence is that we need to rethink our hypercall behaviour. > There is no such thing as supervisor states in an uncompressed XSAVE > image, which means we can't continue with that being the ABI. I don't think the hypercall input / output blob needs to follow any specific hardware layout. > I've also found some substantial issues with how we handle > xcr0/xcr0_accum and plan to address these. There is no such thing as > xcr0 without the bottom bit set, ever, and xcr0_accum needs to default > to X87|SSE seeing as that's how we use it anyway. However, in a context > switch, I expect we'll still be using xcr0_accum | host_xss when it > comes to the context switch path. Right, and to avoid confusion I think we also want to move from xcr0_accum to e.g. xstate_accum, covering both XCR0 and XSS parts all in one go. > In terms of actual context switching, we want to be using XSAVES/XRSTORS > whenever it is available, even if we're not using supervisor states. > XSAVES has both the inuse and modified optimisations, without the broken > consequence of XSAVEOPT (which is firmly in the "don't ever use this" > bucket now). The XSAVEOPT anomaly is affecting user mode only, isn't it? Or are you talking of something I have forgot about? > There's no point ever using XSAVEC. There is no hardware where it > exists in the absence of XSAVES, and can't even in theoretical > circumstances due to (perhaps unintentional) linkage of the CPUID data. > XSAVEC also doesn't use the modified optimisation, and is therefore > strictly worse than XSAVES, even when MSR_XSS is 0. > > Therefore, our choice of instruction wants to be XSAVES, or XSAVE, or > FXSAVE, depending on hardware capability. Makes sense to me (perhaps - see above - minus your omission of XSAVEOPT here). Jan
On 03/05/2021 15:22, Jan Beulich wrote: >> Another consequence is that we need to rethink our hypercall behaviour. >> There is no such thing as supervisor states in an uncompressed XSAVE >> image, which means we can't continue with that being the ABI. > I don't think the hypercall input / output blob needs to follow any > specific hardware layout. Currently, the blob is { xcr0, xcr0_accum, uncompressed image }. As we haven't supported any compressed states yet, we are at liberty to create a forward compatible change by logically s/xcr0/xstate/ and permitting an uncompressed image. Irritatingly, we have xcr0=0 as a permitted state and out in the field, for "no xsave state". This contributes a substantial quantity of complexity in our xstate logic, and invalidates the easy fix I had for not letting the HVM initpath explode. The first task is to untangle the non-architectural xcr0=0 case, and to support compressed images. Size parsing needs to be split into two, as for compressed images, we need to consume XSTATE_BV and XCOMP_BV to cross-check the size. I think we also want a rule that Xen will always send compressed if it is using XSAVES (/XSAVEC in the interim?) We do not want to be working with uncompressed images at all, now that MPX is a reasonable sized hole in the middle. Cleaning this up will then unblock v2 of the existing xstate cleanup series I posted. >> In terms of actual context switching, we want to be using XSAVES/XRSTORS >> whenever it is available, even if we're not using supervisor states. >> XSAVES has both the inuse and modified optimisations, without the broken >> consequence of XSAVEOPT (which is firmly in the "don't ever use this" >> bucket now). > The XSAVEOPT anomaly is affecting user mode only, isn't it? Or are > you talking of something I have forgot about? It's not safe to use at all in L1 xen, because the tracking leaks between non-root contexts. I can't remember if there are further problems for an L0 xen, but I have a nagging feeling that there is. ~Andrew
On 11.05.2021 18:41, Andrew Cooper wrote: > On 03/05/2021 15:22, Jan Beulich wrote: >>> Another consequence is that we need to rethink our hypercall behaviour. >>> There is no such thing as supervisor states in an uncompressed XSAVE >>> image, which means we can't continue with that being the ABI. >> I don't think the hypercall input / output blob needs to follow any >> specific hardware layout. > > Currently, the blob is { xcr0, xcr0_accum, uncompressed image }. > > As we haven't supported any compressed states yet, we are at liberty to > create a forward compatible change by logically s/xcr0/xstate/ and > permitting an uncompressed image. > > Irritatingly, we have xcr0=0 as a permitted state and out in the field, > for "no xsave state". This contributes a substantial quantity of > complexity in our xstate logic, and invalidates the easy fix I had for > not letting the HVM initpath explode. > > The first task is to untangle the non-architectural xcr0=0 case, and to > support compressed images. Size parsing needs to be split into two, as > for compressed images, we need to consume XSTATE_BV and XCOMP_BV to > cross-check the size. Not sure about the need to eliminate the xcr0=0 (or xstates=0) case. Which isn't to say I'm opposed if you want to do so and it's not overly intrusive. > I think we also want a rule that Xen will always send compressed if it > is using XSAVES (/XSAVEC in the interim?) If this is sufficiently neutral to tool stack code, why not (albeit I don't think there needs to be a "rule" - Xen should be free to provide what it deems best, with consumers in the position to easily recognize the format; similarly Xen should be consuming whatever it gets handed, as long as that's valid state). Luckily the layout is visible just through tool-stack-only interfaces. > We do not want to be working > with uncompressed images at all, now that MPX is a reasonable sized hole > in the middle. They're together no larger (128 bytes) than the LWP hole right ahead of them (at 0x340). I agree avoiding uncompressed format is worthwhile, but perhaps quite a bit more so for systems with higher components following unavailable even bigger ones. Jan
--- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -281,7 +281,21 @@ void update_guest_memory_policy(struct v } } -void domain_cpu_policy_changed(struct domain *d) +/* + * Called during vcpu construction, and each time the toolstack changes the + * CPUID configuration for the domain. + */ +static int __must_check cpuid_policy_updated(struct vcpu *v) +{ + int rc = xstate_update_save_area(v); + + if ( !rc && is_hvm_vcpu(v) ) + hvm_cpuid_policy_changed(v); + + return rc; +} + +int domain_cpu_policy_changed(struct domain *d) { const struct cpuid_policy *p = d->arch.cpuid; struct vcpu *v; @@ -439,13 +453,18 @@ void domain_cpu_policy_changed(struct do for_each_vcpu ( d, v ) { - cpuid_policy_updated(v); + int rc = cpuid_policy_updated(v); + + if ( rc ) + return rc; /* If PMU version is zero then the guest doesn't have VPMU */ if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL && p->basic.pmu_version == 0 ) vpmu_destroy(v); } + + return 0; } #ifndef CONFIG_BIGMEM @@ -584,7 +603,7 @@ int arch_vcpu_create(struct vcpu *v) { vpmu_initialise(v); - cpuid_policy_updated(v); + rc = cpuid_policy_updated(v); } return rc; @@ -859,11 +878,11 @@ int arch_domain_create(struct domain *d, */ d->arch.x87_fip_width = cpu_has_fpu_sel ? 0 : 8; - domain_cpu_policy_changed(d); - d->arch.msr_relaxed = config->arch.misc_flags & XEN_X86_MSR_RELAXED; - return 0; + rc = domain_cpu_policy_changed(d); + if ( !rc ) + return 0; fail: d->is_dying = DOMDYING_dead; @@ -2471,16 +2490,6 @@ int domain_relinquish_resources(struct d return 0; } -/* - * Called during vcpu construction, and each time the toolstack changes the - * CPUID configuration for the domain. - */ -void cpuid_policy_updated(struct vcpu *v) -{ - if ( is_hvm_vcpu(v) ) - hvm_cpuid_policy_changed(v); -} - void arch_dump_domain_info(struct domain *d) { paging_dump_domain_info(d); --- a/xen/arch/x86/domctl.c +++ b/xen/arch/x86/domctl.c @@ -89,7 +89,7 @@ static int update_domain_cpu_policy(stru recalculate_cpuid_policy(d); /* Recalculate relevant dom/vcpu state now the policy has changed. */ - domain_cpu_policy_changed(d); + ret = domain_cpu_policy_changed(d); out: /* Free whichever cpuid/msr structs are not installed in struct domain. */ --- a/xen/arch/x86/xstate.c +++ b/xen/arch/x86/xstate.c @@ -542,6 +542,41 @@ int xstate_alloc_save_area(struct vcpu * return 0; } +int xstate_update_save_area(struct vcpu *v) +{ + unsigned int i, size, old; + struct xsave_struct *save_area; + uint64_t xcr0_max = cpuid_policy_xcr0_max(v->domain->arch.cpuid); + + ASSERT(!is_idle_vcpu(v)); + + if ( !cpu_has_xsave ) + return 0; + + if ( v->arch.xcr0_accum & ~xcr0_max ) + return -EBUSY; + + for ( size = old = XSTATE_AREA_MIN_SIZE, i = 2; i < xstate_features; ++i ) + { + if ( xcr0_max & (1ul << i) ) + size = max(size, xstate_offsets[i] + xstate_sizes[i]); + if ( v->arch.xcr0_accum & (1ul << i) ) + old = max(old, xstate_offsets[i] + xstate_sizes[i]); + } + + save_area = _xvrealloc(v->arch.xsave_area, size, __alignof(*save_area)); + if ( !save_area ) + return -ENOMEM; + + ASSERT(old <= size); + memset((void *)save_area + old, 0, size - old); + + v->arch.xsave_area = save_area; + v->arch.fpu_ctxt = &v->arch.xsave_area->fpu_sse; + + return 0; +} + void xstate_free_save_area(struct vcpu *v) { XVFREE(v->arch.xsave_area); --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -78,8 +78,6 @@ void toggle_guest_mode(struct vcpu *); /* x86/64: toggle guest page tables between kernel and user modes. */ void toggle_guest_pt(struct vcpu *); -void cpuid_policy_updated(struct vcpu *v); - /* * Initialise a hypercall-transfer page. The given pointer must be mapped * in Xen virtual address space (accesses are not validated or checked). @@ -670,7 +668,7 @@ struct guest_memory_policy void update_guest_memory_policy(struct vcpu *v, struct guest_memory_policy *policy); -void domain_cpu_policy_changed(struct domain *d); +int __must_check domain_cpu_policy_changed(struct domain *d); bool update_runstate_area(struct vcpu *); bool update_secondary_system_time(struct vcpu *, --- a/xen/include/asm-x86/xstate.h +++ b/xen/include/asm-x86/xstate.h @@ -106,6 +106,7 @@ void compress_xsave_states(struct vcpu * /* extended state init and cleanup functions */ void xstate_free_save_area(struct vcpu *v); int xstate_alloc_save_area(struct vcpu *v); +int xstate_update_save_area(struct vcpu *v); void xstate_init(struct cpuinfo_x86 *c); unsigned int xstate_ctxt_size(u64 xcr0);
vCPU-s get maximum size areas allocated initially. Hidden (and in particular default-off) features may allow for a smaller size area to suffice. Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> --- v2: Use 1ul instead of 1ull. Re-base. --- This could be further shrunk if we used XSAVEC / if we really used XSAVES, as then we don't need to also cover the holes. But since we currently use neither of the two in reality, this would require more work than just adding the alternative size calculation here. Seeing that both vcpu_init_fpu() and cpuid_policy_updated() get called from arch_vcpu_create(), I'm not sure we really need this two-stage approach - the slightly longer period of time during which v->arch.xsave_area would remain NULL doesn't look all that problematic. But since xstate_alloc_save_area() gets called for idle vCPU-s, it has to stay anyway in some form, so the extra code churn may not be worth it. Instead of cpuid_policy_xcr0_max(), cpuid_policy_xstates() may be the interface to use here. But it remains to be determined whether the xcr0_accum field is meant to be inclusive of XSS (in which case it would better be renamed) or exclusive. Right now there's no difference as we don't support any XSS-controlled features.