diff mbox series

[v3,03/22] x86/xstate: re-size save area when CPUID policy changes

Message ID 8ba8f016-0aed-277b-bbea-80022d057791@suse.com (mailing list archive)
State New, archived
Headers show
Series xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD | expand

Commit Message

Jan Beulich April 22, 2021, 2:44 p.m. UTC
vCPU-s get maximum size areas allocated initially. Hidden (and in
particular default-off) features may allow for a smaller size area to
suffice.

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Use 1ul instead of 1ull. Re-base.
---
This could be further shrunk if we used XSAVEC / if we really used
XSAVES, as then we don't need to also cover the holes. But since we
currently use neither of the two in reality, this would require more
work than just adding the alternative size calculation here.

Seeing that both vcpu_init_fpu() and cpuid_policy_updated() get called
from arch_vcpu_create(), I'm not sure we really need this two-stage
approach - the slightly longer period of time during which
v->arch.xsave_area would remain NULL doesn't look all that problematic.
But since xstate_alloc_save_area() gets called for idle vCPU-s, it has
to stay anyway in some form, so the extra code churn may not be worth
it.

Instead of cpuid_policy_xcr0_max(), cpuid_policy_xstates() may be the
interface to use here. But it remains to be determined whether the
xcr0_accum field is meant to be inclusive of XSS (in which case it would
better be renamed) or exclusive. Right now there's no difference as we
don't support any XSS-controlled features.

Comments

Andrew Cooper May 3, 2021, 1:57 p.m. UTC | #1
On 22/04/2021 15:44, Jan Beulich wrote:
> vCPU-s get maximum size areas allocated initially. Hidden (and in
> particular default-off) features may allow for a smaller size area to
> suffice.
>
> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v2: Use 1ul instead of 1ull. Re-base.
> ---
> This could be further shrunk if we used XSAVEC / if we really used
> XSAVES, as then we don't need to also cover the holes. But since we
> currently use neither of the two in reality, this would require more
> work than just adding the alternative size calculation here.
>
> Seeing that both vcpu_init_fpu() and cpuid_policy_updated() get called
> from arch_vcpu_create(), I'm not sure we really need this two-stage
> approach - the slightly longer period of time during which
> v->arch.xsave_area would remain NULL doesn't look all that problematic.
> But since xstate_alloc_save_area() gets called for idle vCPU-s, it has
> to stay anyway in some form, so the extra code churn may not be worth
> it.
>
> Instead of cpuid_policy_xcr0_max(), cpuid_policy_xstates() may be the
> interface to use here. But it remains to be determined whether the
> xcr0_accum field is meant to be inclusive of XSS (in which case it would
> better be renamed) or exclusive. Right now there's no difference as we
> don't support any XSS-controlled features.

I've been figuring out what we need to for supervisors states.  The
current code is not in a good shape, but I also think some of the
changes in this series take us in an unhelpful direction.

I've got a cleanup series which I will post shortly.  It interacts
texturally although not fundamentally with this series, but does fix
several issues.

For supervisor states, we need use XSAVES unilaterally, even for PV. 
This is because XSS_CET_S needs to be the HVM kernel's context, or Xen's
in PV context (specifically, MSR_PL0_SSP which is the shstk equivalent
of TSS.RSP0).


A consequence is that Xen's data handling shall use the compressed
format, and include supervisor states.  (While in principle we could
manage CET_S, CET_U, and potentially PT when vmtrace gets expanded, each
WRMSR there is a similar order of magnitude to an XSAVES/XRSTORS
instruction.)

I'm planning a host xss setting, similar to mmu_cr4_features, which
shall be the setting in context for everything other than HVM vcpus
(which need the guest setting in context, and/or the VT-x bodge to
support host-only states).  Amongst other things, all context switch
paths, including from-HVM, need to step XSS up to the host setting to
let XSAVES function correctly.

However, a consequence of this is that the size of the xsave area needs
deriving from host, as well as guest-max state.  i.e. even if some VMs
aren't using CET, we still need space in the xsave areas to function
correctly when a single VM is using CET.

Another consequence is that we need to rethink our hypercall behaviour. 
There is no such thing as supervisor states in an uncompressed XSAVE
image, which means we can't continue with that being the ABI.

I've also found some substantial issues with how we handle
xcr0/xcr0_accum and plan to address these.  There is no such thing as
xcr0 without the bottom bit set, ever, and xcr0_accum needs to default
to X87|SSE seeing as that's how we use it anyway.  However, in a context
switch, I expect we'll still be using xcr0_accum | host_xss when it
comes to the context switch path.

In terms of actual context switching, we want to be using XSAVES/XRSTORS
whenever it is available, even if we're not using supervisor states. 
XSAVES has both the inuse and modified optimisations, without the broken
consequence of XSAVEOPT (which is firmly in the "don't ever use this"
bucket now).

There's no point ever using XSAVEC.  There is no hardware where it
exists in the absence of XSAVES, and can't even in theoretical
circumstances due to (perhaps unintentional) linkage of the CPUID data. 
XSAVEC also doesn't use the modified optimisation, and is therefore
strictly worse than XSAVES, even when MSR_XSS is 0.

Therefore, our choice of instruction wants to be XSAVES, or XSAVE, or
FXSAVE, depending on hardware capability.

~Andrew
Jan Beulich May 3, 2021, 2:22 p.m. UTC | #2
On 03.05.2021 15:57, Andrew Cooper wrote:
> On 22/04/2021 15:44, Jan Beulich wrote:
>> vCPU-s get maximum size areas allocated initially. Hidden (and in
>> particular default-off) features may allow for a smaller size area to
>> suffice.
>>
>> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> v2: Use 1ul instead of 1ull. Re-base.
>> ---
>> This could be further shrunk if we used XSAVEC / if we really used
>> XSAVES, as then we don't need to also cover the holes. But since we
>> currently use neither of the two in reality, this would require more
>> work than just adding the alternative size calculation here.
>>
>> Seeing that both vcpu_init_fpu() and cpuid_policy_updated() get called
>> from arch_vcpu_create(), I'm not sure we really need this two-stage
>> approach - the slightly longer period of time during which
>> v->arch.xsave_area would remain NULL doesn't look all that problematic.
>> But since xstate_alloc_save_area() gets called for idle vCPU-s, it has
>> to stay anyway in some form, so the extra code churn may not be worth
>> it.
>>
>> Instead of cpuid_policy_xcr0_max(), cpuid_policy_xstates() may be the
>> interface to use here. But it remains to be determined whether the
>> xcr0_accum field is meant to be inclusive of XSS (in which case it would
>> better be renamed) or exclusive. Right now there's no difference as we
>> don't support any XSS-controlled features.
> 
> I've been figuring out what we need to for supervisors states.  The
> current code is not in a good shape, but I also think some of the
> changes in this series take us in an unhelpful direction.

From reading through the rest your reply I'm not sure I see what you
mean. ORing in host_xss at certain points shouldn't be a big deal.

> I've got a cleanup series which I will post shortly.  It interacts
> texturally although not fundamentally with this series, but does fix
> several issues.
> 
> For supervisor states, we need use XSAVES unilaterally, even for PV. 
> This is because XSS_CET_S needs to be the HVM kernel's context, or Xen's
> in PV context (specifically, MSR_PL0_SSP which is the shstk equivalent
> of TSS.RSP0).
> 
> 
> A consequence is that Xen's data handling shall use the compressed
> format, and include supervisor states.  (While in principle we could
> manage CET_S, CET_U, and potentially PT when vmtrace gets expanded, each
> WRMSR there is a similar order of magnitude to an XSAVES/XRSTORS
> instruction.)

I agree.

> I'm planning a host xss setting, similar to mmu_cr4_features, which
> shall be the setting in context for everything other than HVM vcpus
> (which need the guest setting in context, and/or the VT-x bodge to
> support host-only states).  Amongst other things, all context switch
> paths, including from-HVM, need to step XSS up to the host setting to
> let XSAVES function correctly.
> 
> However, a consequence of this is that the size of the xsave area needs
> deriving from host, as well as guest-max state.  i.e. even if some VMs
> aren't using CET, we still need space in the xsave areas to function
> correctly when a single VM is using CET.

Right - as said above, taking this into consideration here shouldn't
be overly problematic.

> Another consequence is that we need to rethink our hypercall behaviour. 
> There is no such thing as supervisor states in an uncompressed XSAVE
> image, which means we can't continue with that being the ABI.

I don't think the hypercall input / output blob needs to follow any
specific hardware layout.

> I've also found some substantial issues with how we handle
> xcr0/xcr0_accum and plan to address these.  There is no such thing as
> xcr0 without the bottom bit set, ever, and xcr0_accum needs to default
> to X87|SSE seeing as that's how we use it anyway.  However, in a context
> switch, I expect we'll still be using xcr0_accum | host_xss when it
> comes to the context switch path.

Right, and to avoid confusion I think we also want to move from
xcr0_accum to e.g. xstate_accum, covering both XCR0 and XSS parts
all in one go.

> In terms of actual context switching, we want to be using XSAVES/XRSTORS
> whenever it is available, even if we're not using supervisor states. 
> XSAVES has both the inuse and modified optimisations, without the broken
> consequence of XSAVEOPT (which is firmly in the "don't ever use this"
> bucket now).

The XSAVEOPT anomaly is affecting user mode only, isn't it? Or are
you talking of something I have forgot about?

> There's no point ever using XSAVEC.  There is no hardware where it
> exists in the absence of XSAVES, and can't even in theoretical
> circumstances due to (perhaps unintentional) linkage of the CPUID data. 
> XSAVEC also doesn't use the modified optimisation, and is therefore
> strictly worse than XSAVES, even when MSR_XSS is 0.
> 
> Therefore, our choice of instruction wants to be XSAVES, or XSAVE, or
> FXSAVE, depending on hardware capability.

Makes sense to me (perhaps - see above - minus your omission of
XSAVEOPT here).

Jan
Andrew Cooper May 11, 2021, 4:41 p.m. UTC | #3
On 03/05/2021 15:22, Jan Beulich wrote:
>> Another consequence is that we need to rethink our hypercall behaviour. 
>> There is no such thing as supervisor states in an uncompressed XSAVE
>> image, which means we can't continue with that being the ABI.
> I don't think the hypercall input / output blob needs to follow any
> specific hardware layout.

Currently, the blob is { xcr0, xcr0_accum, uncompressed image }.

As we haven't supported any compressed states yet, we are at liberty to
create a forward compatible change by logically s/xcr0/xstate/ and
permitting an uncompressed image.

Irritatingly, we have xcr0=0 as a permitted state and out in the field,
for "no xsave state".  This contributes a substantial quantity of
complexity in our xstate logic, and invalidates the easy fix I had for
not letting the HVM initpath explode.

The first task is to untangle the non-architectural xcr0=0 case, and to
support compressed images.  Size parsing needs to be split into two, as
for compressed images, we need to consume XSTATE_BV and XCOMP_BV to
cross-check the size.

I think we also want a rule that Xen will always send compressed if it
is using XSAVES (/XSAVEC in the interim?)  We do not want to be working
with uncompressed images at all, now that MPX is a reasonable sized hole
in the middle.

Cleaning this up will then unblock v2 of the existing xstate cleanup
series I posted.

>> In terms of actual context switching, we want to be using XSAVES/XRSTORS
>> whenever it is available, even if we're not using supervisor states. 
>> XSAVES has both the inuse and modified optimisations, without the broken
>> consequence of XSAVEOPT (which is firmly in the "don't ever use this"
>> bucket now).
> The XSAVEOPT anomaly is affecting user mode only, isn't it? Or are
> you talking of something I have forgot about?

It's not safe to use at all in L1 xen, because the tracking leaks
between non-root contexts.  I can't remember if there are further
problems for an L0 xen, but I have a nagging feeling that there is.

~Andrew
Jan Beulich May 17, 2021, 7:33 a.m. UTC | #4
On 11.05.2021 18:41, Andrew Cooper wrote:
> On 03/05/2021 15:22, Jan Beulich wrote:
>>> Another consequence is that we need to rethink our hypercall behaviour. 
>>> There is no such thing as supervisor states in an uncompressed XSAVE
>>> image, which means we can't continue with that being the ABI.
>> I don't think the hypercall input / output blob needs to follow any
>> specific hardware layout.
> 
> Currently, the blob is { xcr0, xcr0_accum, uncompressed image }.
> 
> As we haven't supported any compressed states yet, we are at liberty to
> create a forward compatible change by logically s/xcr0/xstate/ and
> permitting an uncompressed image.
> 
> Irritatingly, we have xcr0=0 as a permitted state and out in the field,
> for "no xsave state".  This contributes a substantial quantity of
> complexity in our xstate logic, and invalidates the easy fix I had for
> not letting the HVM initpath explode.
> 
> The first task is to untangle the non-architectural xcr0=0 case, and to
> support compressed images.  Size parsing needs to be split into two, as
> for compressed images, we need to consume XSTATE_BV and XCOMP_BV to
> cross-check the size.

Not sure about the need to eliminate the xcr0=0 (or xstates=0) case.
Which isn't to say I'm opposed if you want to do so and it's not
overly intrusive.

> I think we also want a rule that Xen will always send compressed if it
> is using XSAVES (/XSAVEC in the interim?)

If this is sufficiently neutral to tool stack code, why not (albeit
I don't think there needs to be a "rule" - Xen should be free to
provide what it deems best, with consumers in the position to easily
recognize the format; similarly Xen should be consuming whatever it
gets handed, as long as that's valid state). Luckily the layout is
visible just through tool-stack-only interfaces.

>  We do not want to be working
> with uncompressed images at all, now that MPX is a reasonable sized hole
> in the middle.

They're together no larger (128 bytes) than the LWP hole right ahead
of them (at 0x340). I agree avoiding uncompressed format is worthwhile,
but perhaps quite a bit more so for systems with higher components
following unavailable even bigger ones.

Jan
diff mbox series

Patch

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -281,7 +281,21 @@  void update_guest_memory_policy(struct v
     }
 }
 
-void domain_cpu_policy_changed(struct domain *d)
+/*
+ * Called during vcpu construction, and each time the toolstack changes the
+ * CPUID configuration for the domain.
+ */
+static int __must_check cpuid_policy_updated(struct vcpu *v)
+{
+    int rc = xstate_update_save_area(v);
+
+    if ( !rc && is_hvm_vcpu(v) )
+        hvm_cpuid_policy_changed(v);
+
+    return rc;
+}
+
+int domain_cpu_policy_changed(struct domain *d)
 {
     const struct cpuid_policy *p = d->arch.cpuid;
     struct vcpu *v;
@@ -439,13 +453,18 @@  void domain_cpu_policy_changed(struct do
 
     for_each_vcpu ( d, v )
     {
-        cpuid_policy_updated(v);
+        int rc = cpuid_policy_updated(v);
+
+        if ( rc )
+            return rc;
 
         /* If PMU version is zero then the guest doesn't have VPMU */
         if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
              p->basic.pmu_version == 0 )
             vpmu_destroy(v);
     }
+
+    return 0;
 }
 
 #ifndef CONFIG_BIGMEM
@@ -584,7 +603,7 @@  int arch_vcpu_create(struct vcpu *v)
     {
         vpmu_initialise(v);
 
-        cpuid_policy_updated(v);
+        rc = cpuid_policy_updated(v);
     }
 
     return rc;
@@ -859,11 +878,11 @@  int arch_domain_create(struct domain *d,
      */
     d->arch.x87_fip_width = cpu_has_fpu_sel ? 0 : 8;
 
-    domain_cpu_policy_changed(d);
-
     d->arch.msr_relaxed = config->arch.misc_flags & XEN_X86_MSR_RELAXED;
 
-    return 0;
+    rc = domain_cpu_policy_changed(d);
+    if ( !rc )
+        return 0;
 
  fail:
     d->is_dying = DOMDYING_dead;
@@ -2471,16 +2490,6 @@  int domain_relinquish_resources(struct d
     return 0;
 }
 
-/*
- * Called during vcpu construction, and each time the toolstack changes the
- * CPUID configuration for the domain.
- */
-void cpuid_policy_updated(struct vcpu *v)
-{
-    if ( is_hvm_vcpu(v) )
-        hvm_cpuid_policy_changed(v);
-}
-
 void arch_dump_domain_info(struct domain *d)
 {
     paging_dump_domain_info(d);
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -89,7 +89,7 @@  static int update_domain_cpu_policy(stru
     recalculate_cpuid_policy(d);
 
     /* Recalculate relevant dom/vcpu state now the policy has changed. */
-    domain_cpu_policy_changed(d);
+    ret = domain_cpu_policy_changed(d);
 
  out:
     /* Free whichever cpuid/msr structs are not installed in struct domain. */
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -542,6 +542,41 @@  int xstate_alloc_save_area(struct vcpu *
     return 0;
 }
 
+int xstate_update_save_area(struct vcpu *v)
+{
+    unsigned int i, size, old;
+    struct xsave_struct *save_area;
+    uint64_t xcr0_max = cpuid_policy_xcr0_max(v->domain->arch.cpuid);
+
+    ASSERT(!is_idle_vcpu(v));
+
+    if ( !cpu_has_xsave )
+        return 0;
+
+    if ( v->arch.xcr0_accum & ~xcr0_max )
+        return -EBUSY;
+
+    for ( size = old = XSTATE_AREA_MIN_SIZE, i = 2; i < xstate_features; ++i )
+    {
+        if ( xcr0_max & (1ul << i) )
+            size = max(size, xstate_offsets[i] + xstate_sizes[i]);
+        if ( v->arch.xcr0_accum & (1ul << i) )
+            old = max(old, xstate_offsets[i] + xstate_sizes[i]);
+    }
+
+    save_area = _xvrealloc(v->arch.xsave_area, size, __alignof(*save_area));
+    if ( !save_area )
+        return -ENOMEM;
+
+    ASSERT(old <= size);
+    memset((void *)save_area + old, 0, size - old);
+
+    v->arch.xsave_area = save_area;
+    v->arch.fpu_ctxt = &v->arch.xsave_area->fpu_sse;
+
+    return 0;
+}
+
 void xstate_free_save_area(struct vcpu *v)
 {
     XVFREE(v->arch.xsave_area);
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -78,8 +78,6 @@  void toggle_guest_mode(struct vcpu *);
 /* x86/64: toggle guest page tables between kernel and user modes. */
 void toggle_guest_pt(struct vcpu *);
 
-void cpuid_policy_updated(struct vcpu *v);
-
 /*
  * Initialise a hypercall-transfer page. The given pointer must be mapped
  * in Xen virtual address space (accesses are not validated or checked).
@@ -670,7 +668,7 @@  struct guest_memory_policy
 void update_guest_memory_policy(struct vcpu *v,
                                 struct guest_memory_policy *policy);
 
-void domain_cpu_policy_changed(struct domain *d);
+int __must_check domain_cpu_policy_changed(struct domain *d);
 
 bool update_runstate_area(struct vcpu *);
 bool update_secondary_system_time(struct vcpu *,
--- a/xen/include/asm-x86/xstate.h
+++ b/xen/include/asm-x86/xstate.h
@@ -106,6 +106,7 @@  void compress_xsave_states(struct vcpu *
 /* extended state init and cleanup functions */
 void xstate_free_save_area(struct vcpu *v);
 int xstate_alloc_save_area(struct vcpu *v);
+int xstate_update_save_area(struct vcpu *v);
 void xstate_init(struct cpuinfo_x86 *c);
 unsigned int xstate_ctxt_size(u64 xcr0);