Message ID | 1464714345-26571-1-git-send-email-andrew.cooper3@citrix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
>>> On 31.05.16 at 19:05, <andrew.cooper3@citrix.com> wrote: > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> with one spelling correction: > +# Overview > + > +On native hardware, a kernel will boot, detect features, typically optimise > +certain codepaths based on the available features, and expect the features to > +remain available until it shuts down. > + > +The same expectation exists for virtual machines, and it is up to the > +hypervisor/toolstack to fulfil this expectation for the lifetime of the fulfill Jan
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper writes ("[PATCH for-4.7] docs: Feature Levelling feature document"): > +N.B. `xl`, being inherently a single-host toolstack, doesn't make use of these > +levelling improvements. These features are of interest to higher level > +toolstacks such as `libvirt` or `XAPI`. I don't think this is quite the right spin, IYSWIM. xl does not currently provide any way to sort this stuff out. But in principle, I think there would be ways that it could. I would prefer a wording which was more encouraging to future improvements. Shall I suggest something ? Ian.
On 01/06/16 11:25, Ian Jackson wrote: > Andrew Cooper writes ("[PATCH for-4.7] docs: Feature Levelling feature document"): >> +N.B. `xl`, being inherently a single-host toolstack, doesn't make use of these >> +levelling improvements. These features are of interest to higher level >> +toolstacks such as `libvirt` or `XAPI`. > I don't think this is quite the right spin, IYSWIM. xl does not > currently provide any way to sort this stuff out. But in principle, I > think there would be ways that it could. > > I would prefer a wording which was more encouraging to future > improvements. Shall I suggest something ? I guess there are two different issues here. (Note: I am specifically distinguishing `xl` as a toolstack itself, from libxl which is a just a library.) Simply exposing the levelling/featureset information in `xl info` is certainly a possible thing to do. Joao has some plans for surfacing the levelling information in libxl for libvirt to use. However, without a fundamental redesign of how xl works, it isn't going to gain multi-host knowledge and consideration during domain creation. ~Andrew
Andrew Cooper writes ("Re: [PATCH for-4.7] docs: Feature Levelling feature document"): > On 01/06/16 11:25, Ian Jackson wrote: > > I would prefer a wording which was more encouraging to future > > improvements. Shall I suggest something ? > > I guess there are two different issues here. (Note: I am specifically > distinguishing `xl` as a toolstack itself, from libxl which is a just a > library.) > > Simply exposing the levelling/featureset information in `xl info` is > certainly a possible thing to do. Joao has some plans for surfacing the > levelling information in libxl for libvirt to use. Right. > However, without a fundamental redesign of how xl works, it isn't going > to gain multi-host knowledge and consideration during domain creation. IMO xl ought to have the moving parts necessary to allow an administrator to: 1. collect feature information from their hosts; 2. combine that information into the desired feature set to expose to guests; 3. specify the feature set in their host configuration; 4. do all of the above conveniently, without seddery. We should assume that the administrator has available tools like GNU parallel, ansible, or whatever. I don't want to design this now but I do want the feature levelling documentation to welcome suggestions for it, or at least not to seem to rule it out. Ian.
On 01/06/16 13:14, Ian Jackson wrote: > Andrew Cooper writes ("Re: [PATCH for-4.7] docs: Feature Levelling feature document"): >> On 01/06/16 11:25, Ian Jackson wrote: >>> I would prefer a wording which was more encouraging to future >>> improvements. Shall I suggest something ? >> I guess there are two different issues here. (Note: I am specifically >> distinguishing `xl` as a toolstack itself, from libxl which is a just a >> library.) >> >> Simply exposing the levelling/featureset information in `xl info` is >> certainly a possible thing to do. Joao has some plans for surfacing the >> levelling information in libxl for libvirt to use. > Right. > >> However, without a fundamental redesign of how xl works, it isn't going >> to gain multi-host knowledge and consideration during domain creation. > IMO xl ought to have the moving parts necessary to allow an > administrator to: 1. collect feature information from their hosts; > 2. combine that information into the desired feature set to expose to > guests; 3. specify the feature set in their host configuration; 4. do > all of the above conveniently, without seddery. > > We should assume that the administrator has available tools like > GNU parallel, ansible, or whatever. > > I don't want to design this now but I do want the feature levelling > documentation to welcome suggestions for it, or at least not to seem > to rule it out. 1) is currently available via the `xen-cpuid` binary introduced, although I intended it more as a developer tool Combining is the awkward part, but in the common case, it is just a bitwise AND of the bitmaps provided by `xen-cpuid`. 3) I don't know what you mean about their host configuration. Do you mean guest configuration? All of this works in combination with the existing cpuid= guest configuration. ~Andrew
On 01/06/16 10:29, Jan Beulich wrote: >>>> On 31.05.16 at 19:05, <andrew.cooper3@citrix.com> wrote: >> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> > Reviewed-by: Jan Beulich <jbeulich@suse.com> > > with one spelling correction: > >> +# Overview >> + >> +On native hardware, a kernel will boot, detect features, typically optimise >> +certain codepaths based on the available features, and expect the features to >> +remain available until it shuts down. >> + >> +The same expectation exists for virtual machines, and it is up to the >> +hypervisor/toolstack to fulfil this expectation for the lifetime of the > fulfill That is the American spelling. The English spelling does not have a double l. ~Andrew
>>> On 03.06.16 at 17:36, <andrew.cooper3@citrix.com> wrote: > On 01/06/16 10:29, Jan Beulich wrote: >>>>> On 31.05.16 at 19:05, <andrew.cooper3@citrix.com> wrote: >>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> >> Reviewed-by: Jan Beulich <jbeulich@suse.com> >> >> with one spelling correction: >> >>> +# Overview >>> + >>> +On native hardware, a kernel will boot, detect features, typically optimise >>> +certain codepaths based on the available features, and expect the features to >>> +remain available until it shuts down. >>> + >>> +The same expectation exists for virtual machines, and it is up to the >>> +hypervisor/toolstack to fulfil this expectation for the lifetime of the >> fulfill > > That is the American spelling. The English spelling does not have a > double l. Oh, very interesting. I would never have thought of this kind of a difference between British and American English, the more that you also write "fill" afaik, not "fil". But - good to know, thanks! Jan
On 03/06/16 16:42, Jan Beulich wrote: >>>> On 03.06.16 at 17:36, <andrew.cooper3@citrix.com> wrote: >> On 01/06/16 10:29, Jan Beulich wrote: >>>>>> On 31.05.16 at 19:05, <andrew.cooper3@citrix.com> wrote: >>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> >>> Reviewed-by: Jan Beulich <jbeulich@suse.com> >>> >>> with one spelling correction: >>> >>>> +# Overview >>>> + >>>> +On native hardware, a kernel will boot, detect features, typically optimise >>>> +certain codepaths based on the available features, and expect the features to >>>> +remain available until it shuts down. >>>> + >>>> +The same expectation exists for virtual machines, and it is up to the >>>> +hypervisor/toolstack to fulfil this expectation for the lifetime of the >>> fulfill >> That is the American spelling. The English spelling does not have a >> double l. > Oh, very interesting. I would never have thought of this kind of a > difference between British and American English, the more that > you also write "fill" afaik, not "fil". But - good to know, thanks! Because English is so well known for its consistency :) ~Andrew
diff --git a/docs/features/feature-levelling.pandoc b/docs/features/feature-levelling.pandoc new file mode 100644 index 0000000..50bf099 --- /dev/null +++ b/docs/features/feature-levelling.pandoc @@ -0,0 +1,211 @@ +% Feature Levelling +% Draft 1 + +\clearpage + +# Basics + +---------------- ---------------------------------------------------- + Status: **Supported** + + Architecture: x86 + + Component: Hypervisor, toolstack, guest +---------------- ---------------------------------------------------- + + +# Overview + +On native hardware, a kernel will boot, detect features, typically optimise +certain codepaths based on the available features, and expect the features to +remain available until it shuts down. + +The same expectation exists for virtual machines, and it is up to the +hypervisor/toolstack to fulfil this expectation for the lifetime of the +virtual machine, including across migrate/suspend/resume. + + +# User details + +Many factors affect the featureset which a VM may use: + +* The CPU itself +* The BIOS/firmware/microcode version and settings +* The hypervisor version and command line settings +* Further restrictions the toolstack chooses to apply + +A firmware or software upgrade might reduce the available set of features +(e.g. Intel disabling TSX in a microcode update for certain Haswell/Broadwell +processors), as may editing the settings. + +It is unsafe to make any assumption about features remaining consistent across +a host reboot. Xen recalculates all information from scratch each boot, and +provides the information for the toolstack to consume. + +N.B. `xl`, being inherently a single-host toolstack, doesn't make use of these +levelling improvements. These features are of interest to higher level +toolstacks such as `libvirt` or `XAPI`. + + +# Technical details + +The `CPUID` instruction is used by software to query for features. In the +virtualisation usecase, guest software should query Xen rather than hardware +directly. However, `CPUID` is an unprivileged instruction which doesn't +fault, complicating the task of hiding hardware features from guests. + +Important files: + +* Hypervisor + * `xen/arch/x86/cpu/*.c` + * `xen/arch/x86/cpuid.c` + * `xen/include/asm-x86/cpuid-autogen.h` + * `xen/include/public/arch-x86/cpufeatureset.h` + * `xen/tools/gen-cpuid.py` +* `libxc` + * `tools/libxc/xc_cpuid_x86.c` + +## Ability to control CPUID + +### HVM + +HVM guests (using `Intel VT-x` or `AMD SVM`) will unconditionally exit to Xen +on all `CPUID` instructions, allowing Xen full control over all information. + +### PV + +The `CPUID` instruction is unprivileged, so executing it in a PV guest will +not trap, leaving Xen no direct ability to control the information returned. + +### Xen Forced Emulation Prefix + +Xen-aware PV software can make use of the 'Forced Emulation Prefix' + +> `ud2a; .ascii 'xen'; cpuid` + +which Xen recognises as a deliberate attempt to get the fully-controlled +`CPUID` information rather than the hardware-reported information. This only +works with cooperative software. + +### Masking and Override MSRs + +AMD CPUs from the `K8` onwards support _Feature Override_ MSRs, which allow +direct control of the values returned for certain `CPUID` leaves. These MSRs +allow any result to be returned, including the ability to advertise features +which are not actually supported. + +Intel CPUs between `Nehalem` and `SandyBridge` have differing numbers of +_Feature Mask_ MSRs, which are a simple AND-mask applied to all `CPUID` +instructions requesting specific feature bitmap sets. The exact MSRs, and +which feature bitmap sets they affect are hardware specific. These MSRs allow +features to be hidden by clearing the appropriate bit in the mask, but does +not allow unsupported features to be advertised. + +### CPUID Faulting + +Intel CPUs from `IvyBridge` onwards have _CPUID Faulting_, which allows Xen to +cause `CPUID` instruction executed in PV guests to fault. This allows Xen +full control over all information, exactly like HVM guests. + +## Compile time + +As some features depend on other features, it is important that, when +disabling a certain feature, we disable all features which depend on it. This +allows runtime logic to be simplified, by being able to rely on testing only +the single appropriate feature, rather than the entire feature dependency +chain. + +To speed up runtime calculation of feature dependencies, the dependency chain +is calculated and flattened by `xen/tools/gen-cpuid.py` to create +`xen/include/asm-x86/cpuid-autogen.h` from +`xen/include/public/arch-x86/cpufeatureset.h`, allowing the runtime code to +disable all dependent features of a specific disabled feature in constant +time. + +## Host boot + +As Xen boots, it will enumerate the features it can see. This is stored as +the _raw\_featureset_. + +Errata checks and command line arguments are then taken into account to reduce +the _raw\_featureset_ into the _host\_featureset_, which is the set of +features Xen uses. On hardware with masking/override MSRs, the default MSR +values are picked from the _host\_featureset_. + +The _host\_featureset_ is then used to calculate the _pv\_featureset_ and +_hvm\_featureset_, which are the maximum featuresets Xen is willing to offer +to PV and HVM guests respectively. + +In addition, Xen will calculate how much control it has over non-cooperative +PV `CPUID` instructions, storing this information as _levelling\_caps_. + +## Domain creation + +The toolstack can query each of the calculated featureset via +`XEN_SYSCTL_get_cpu_featureset`, and query for the levelling caps via +`XEN_SYSCTL_get_cpu_levelling_caps`. + +These data should be used by the toolstack when choosing the eventual +featureset to offer to the guest. + +Once a featureset has been chosen, it is set (implicitly or explicitly) via +`XEN_DOMCTL_set_cpuid`. Xen will clamp the toolstacks choice to the +appropriate PV or HVM featureset. On hardware with masking/override MSRs, the +guest cpuid policy is reflected in the MSRs, which are context switched with +other vcpu state. + +# Limitations + +A guest which ignores the provided feature information and manually probes for +features will be able to find some of them. e.g. There is no way of forcibly +preventing a guest from using 1GB superpages if the hardware supports it. + +Some information simply cannot be hidden from guests. There is no way to +control certain behaviour such as the hardware MXCSR\_MASK or x87 FPU exception +behaviour. + + +# Testing + +Feature levelling is a very wide area, and used all over the hypervisor. +Please ask on xen-devel for help identifying more specific tests which could +be of use. + + +# Known issues / Areas for improvement + +Xen currently has no concept of per-{socket,core,thread} CPUID information. +As a result, details such as APIC IDs, topology and cache information do not +match real hardware, and do not match the documented expectations in the Intel +and AMD system manuals. + +The CPU feature flags are the only information which the toolstack has a +sensible interface for querying and levelling. Other information in the CPUID +policy is important and should be levelled (e.g. maxphysaddr). + +The CPUID policy is currently regenerated from scratch by the receiving side, +once memory and vcpu content has been restored. This means that the receiving +Xen cannot verify the memory/vcpu content against the CPUID policy, and can +end up running a guest which will subsequently crash. The CPUID policy should +be at the head of the migration stream. + +MSRs are another source of features for guests. There is no general provision +for controlling the available MSRs. E.g. 64bit versions of Windows notice +changes in IA32\_MISC\_ENABLE, and suffer a BSOD 0x109 (Critical Structure +Corruption) + + +# References + +[Intel Flexmigration](http://www.intel.co.uk/content/dam/www/public/us/en/documents/application-notes/virtualization-technology-flexmigration-application-note.pdf) + +[AMD Extended Migration Technology](http://developer.amd.com/wordpress/media/2012/10/43781-3.00-PUB_Live-Virtual-Machine-Migration-on-AMD-processors.pdf) + + +# History + +------------------------------------------------------------------------ +Date Revision Version Notes +---------- -------- -------- ------------------------------------------- +2016-05-31 1 Xen 4.7 Document written +---------- -------- -------- -------------------------------------------
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> --- CC: Ian Jackson <Ian.Jackson@eu.citrix.com> CC: Wei Liu <wei.liu2@citrix.com> CC: Jan Beulich <JBeulich@suse.com> CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- docs/features/feature-levelling.pandoc | 211 +++++++++++++++++++++++++++++++++ 1 file changed, 211 insertions(+) create mode 100644 docs/features/feature-levelling.pandoc