diff mbox

[for-4.7] docs: Feature Levelling feature document

Message ID 1464714345-26571-1-git-send-email-andrew.cooper3@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

Andrew Cooper May 31, 2016, 5:05 p.m. UTC
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Jan Beulich <JBeulich@suse.com>
CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 docs/features/feature-levelling.pandoc | 211 +++++++++++++++++++++++++++++++++
 1 file changed, 211 insertions(+)
 create mode 100644 docs/features/feature-levelling.pandoc

Comments

Jan Beulich June 1, 2016, 9:29 a.m. UTC | #1
>>> On 31.05.16 at 19:05, <andrew.cooper3@citrix.com> wrote:
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

with one spelling correction:

> +# Overview
> +
> +On native hardware, a kernel will boot, detect features, typically optimise
> +certain codepaths based on the available features, and expect the features to
> +remain available until it shuts down.
> +
> +The same expectation exists for virtual machines, and it is up to the
> +hypervisor/toolstack to fulfil this expectation for the lifetime of the

fulfill

Jan
Wei Liu June 1, 2016, 9:41 a.m. UTC | #2
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson June 1, 2016, 10:25 a.m. UTC | #3
Andrew Cooper writes ("[PATCH for-4.7] docs: Feature Levelling feature document"):
> +N.B. `xl`, being inherently a single-host toolstack, doesn't make use of these
> +levelling improvements.  These features are of interest to higher level
> +toolstacks such as `libvirt` or `XAPI`.

I don't think this is quite the right spin, IYSWIM.  xl does not
currently provide any way to sort this stuff out.  But in principle, I
think there would be ways that it could.

I would prefer a wording which was more encouraging to future
improvements.  Shall I suggest something ?

Ian.
Andrew Cooper June 1, 2016, 12:05 p.m. UTC | #4
On 01/06/16 11:25, Ian Jackson wrote:
> Andrew Cooper writes ("[PATCH for-4.7] docs: Feature Levelling feature document"):
>> +N.B. `xl`, being inherently a single-host toolstack, doesn't make use of these
>> +levelling improvements.  These features are of interest to higher level
>> +toolstacks such as `libvirt` or `XAPI`.
> I don't think this is quite the right spin, IYSWIM.  xl does not
> currently provide any way to sort this stuff out.  But in principle, I
> think there would be ways that it could.
>
> I would prefer a wording which was more encouraging to future
> improvements.  Shall I suggest something ?

I guess there are two different issues here.  (Note: I am specifically
distinguishing `xl` as a toolstack itself, from libxl which is a just a
library.)

Simply exposing the levelling/featureset information in `xl info` is
certainly a possible thing to do.  Joao has some plans for surfacing the
levelling information in libxl for libvirt to use.

However, without a fundamental redesign of how xl works, it isn't going
to gain multi-host knowledge and consideration during domain creation.

~Andrew
Ian Jackson June 1, 2016, 12:14 p.m. UTC | #5
Andrew Cooper writes ("Re: [PATCH for-4.7] docs: Feature Levelling feature document"):
> On 01/06/16 11:25, Ian Jackson wrote:
> > I would prefer a wording which was more encouraging to future
> > improvements.  Shall I suggest something ?
> 
> I guess there are two different issues here.  (Note: I am specifically
> distinguishing `xl` as a toolstack itself, from libxl which is a just a
> library.)
> 
> Simply exposing the levelling/featureset information in `xl info` is
> certainly a possible thing to do.  Joao has some plans for surfacing the
> levelling information in libxl for libvirt to use.

Right.

> However, without a fundamental redesign of how xl works, it isn't going
> to gain multi-host knowledge and consideration during domain creation.

IMO xl ought to have the moving parts necessary to allow an
administrator to: 1. collect feature information from their hosts;
2. combine that information into the desired feature set to expose to
guests; 3. specify the feature set in their host configuration; 4. do
all of the above conveniently, without seddery.

We should assume that the administrator has available tools like
GNU parallel, ansible, or whatever.

I don't want to design this now but I do want the feature levelling
documentation to welcome suggestions for it, or at least not to seem
to rule it out.

Ian.
Andrew Cooper June 1, 2016, 1:11 p.m. UTC | #6
On 01/06/16 13:14, Ian Jackson wrote:
> Andrew Cooper writes ("Re: [PATCH for-4.7] docs: Feature Levelling feature document"):
>> On 01/06/16 11:25, Ian Jackson wrote:
>>> I would prefer a wording which was more encouraging to future
>>> improvements.  Shall I suggest something ?
>> I guess there are two different issues here.  (Note: I am specifically
>> distinguishing `xl` as a toolstack itself, from libxl which is a just a
>> library.)
>>
>> Simply exposing the levelling/featureset information in `xl info` is
>> certainly a possible thing to do.  Joao has some plans for surfacing the
>> levelling information in libxl for libvirt to use.
> Right.
>
>> However, without a fundamental redesign of how xl works, it isn't going
>> to gain multi-host knowledge and consideration during domain creation.
> IMO xl ought to have the moving parts necessary to allow an
> administrator to: 1. collect feature information from their hosts;
> 2. combine that information into the desired feature set to expose to
> guests; 3. specify the feature set in their host configuration; 4. do
> all of the above conveniently, without seddery.
>
> We should assume that the administrator has available tools like
> GNU parallel, ansible, or whatever.
>
> I don't want to design this now but I do want the feature levelling
> documentation to welcome suggestions for it, or at least not to seem
> to rule it out.

1) is currently available via the `xen-cpuid` binary introduced,
although I intended it more as a developer tool

Combining is the awkward part, but in the common case, it is just a
bitwise AND of the bitmaps provided by `xen-cpuid`.

3) I don't know what you mean about their host configuration.  Do you
mean guest configuration?

All of this works in combination with the existing cpuid= guest
configuration.

~Andrew
Andrew Cooper June 3, 2016, 3:36 p.m. UTC | #7
On 01/06/16 10:29, Jan Beulich wrote:
>>>> On 31.05.16 at 19:05, <andrew.cooper3@citrix.com> wrote:
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>
> with one spelling correction:
>
>> +# Overview
>> +
>> +On native hardware, a kernel will boot, detect features, typically optimise
>> +certain codepaths based on the available features, and expect the features to
>> +remain available until it shuts down.
>> +
>> +The same expectation exists for virtual machines, and it is up to the
>> +hypervisor/toolstack to fulfil this expectation for the lifetime of the
> fulfill

That is the American spelling.  The English spelling does not have a
double l.

~Andrew
Jan Beulich June 3, 2016, 3:42 p.m. UTC | #8
>>> On 03.06.16 at 17:36, <andrew.cooper3@citrix.com> wrote:
> On 01/06/16 10:29, Jan Beulich wrote:
>>>>> On 31.05.16 at 19:05, <andrew.cooper3@citrix.com> wrote:
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>>
>> with one spelling correction:
>>
>>> +# Overview
>>> +
>>> +On native hardware, a kernel will boot, detect features, typically optimise
>>> +certain codepaths based on the available features, and expect the features to
>>> +remain available until it shuts down.
>>> +
>>> +The same expectation exists for virtual machines, and it is up to the
>>> +hypervisor/toolstack to fulfil this expectation for the lifetime of the
>> fulfill
> 
> That is the American spelling.  The English spelling does not have a
> double l.

Oh, very interesting. I would never have thought of this kind of a
difference between British and American English, the more that
you also write "fill" afaik, not "fil". But - good to know, thanks!

Jan
Andrew Cooper June 3, 2016, 3:53 p.m. UTC | #9
On 03/06/16 16:42, Jan Beulich wrote:
>>>> On 03.06.16 at 17:36, <andrew.cooper3@citrix.com> wrote:
>> On 01/06/16 10:29, Jan Beulich wrote:
>>>>>> On 31.05.16 at 19:05, <andrew.cooper3@citrix.com> wrote:
>>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>>>
>>> with one spelling correction:
>>>
>>>> +# Overview
>>>> +
>>>> +On native hardware, a kernel will boot, detect features, typically optimise
>>>> +certain codepaths based on the available features, and expect the features to
>>>> +remain available until it shuts down.
>>>> +
>>>> +The same expectation exists for virtual machines, and it is up to the
>>>> +hypervisor/toolstack to fulfil this expectation for the lifetime of the
>>> fulfill
>> That is the American spelling.  The English spelling does not have a
>> double l.
> Oh, very interesting. I would never have thought of this kind of a
> difference between British and American English, the more that
> you also write "fill" afaik, not "fil". But - good to know, thanks!

Because English is so well known for its consistency :)

~Andrew
diff mbox

Patch

diff --git a/docs/features/feature-levelling.pandoc b/docs/features/feature-levelling.pandoc
new file mode 100644
index 0000000..50bf099
--- /dev/null
+++ b/docs/features/feature-levelling.pandoc
@@ -0,0 +1,211 @@ 
+% Feature Levelling
+% Draft 1
+
+\clearpage
+
+# Basics
+
+---------------- ----------------------------------------------------
+         Status: **Supported**
+
+   Architecture: x86
+
+      Component: Hypervisor, toolstack, guest
+---------------- ----------------------------------------------------
+
+
+# Overview
+
+On native hardware, a kernel will boot, detect features, typically optimise
+certain codepaths based on the available features, and expect the features to
+remain available until it shuts down.
+
+The same expectation exists for virtual machines, and it is up to the
+hypervisor/toolstack to fulfil this expectation for the lifetime of the
+virtual machine, including across migrate/suspend/resume.
+
+
+# User details
+
+Many factors affect the featureset which a VM may use:
+
+* The CPU itself
+* The BIOS/firmware/microcode version and settings
+* The hypervisor version and command line settings
+* Further restrictions the toolstack chooses to apply
+
+A firmware or software upgrade might reduce the available set of features
+(e.g. Intel disabling TSX in a microcode update for certain Haswell/Broadwell
+processors), as may editing the settings.
+
+It is unsafe to make any assumption about features remaining consistent across
+a host reboot.  Xen recalculates all information from scratch each boot, and
+provides the information for the toolstack to consume.
+
+N.B. `xl`, being inherently a single-host toolstack, doesn't make use of these
+levelling improvements.  These features are of interest to higher level
+toolstacks such as `libvirt` or `XAPI`.
+
+
+# Technical details
+
+The `CPUID` instruction is used by software to query for features.  In the
+virtualisation usecase, guest software should query Xen rather than hardware
+directly.  However, `CPUID` is an unprivileged instruction which doesn't
+fault, complicating the task of hiding hardware features from guests.
+
+Important files:
+
+* Hypervisor
+    * `xen/arch/x86/cpu/*.c`
+    * `xen/arch/x86/cpuid.c`
+    * `xen/include/asm-x86/cpuid-autogen.h`
+    * `xen/include/public/arch-x86/cpufeatureset.h`
+    * `xen/tools/gen-cpuid.py`
+* `libxc`
+    * `tools/libxc/xc_cpuid_x86.c`
+
+## Ability to control CPUID
+
+### HVM
+
+HVM guests (using `Intel VT-x` or `AMD SVM`) will unconditionally exit to Xen
+on all `CPUID` instructions, allowing Xen full control over all information.
+
+### PV
+
+The `CPUID` instruction is unprivileged, so executing it in a PV guest will
+not trap, leaving Xen no direct ability to control the information returned.
+
+### Xen Forced Emulation Prefix
+
+Xen-aware PV software can make use of the 'Forced Emulation Prefix'
+
+> `ud2a; .ascii 'xen'; cpuid`
+
+which Xen recognises as a deliberate attempt to get the fully-controlled
+`CPUID` information rather than the hardware-reported information.  This only
+works with cooperative software.
+
+### Masking and Override MSRs
+
+AMD CPUs from the `K8` onwards support _Feature Override_ MSRs, which allow
+direct control of the values returned for certain `CPUID` leaves.  These MSRs
+allow any result to be returned, including the ability to advertise features
+which are not actually supported.
+
+Intel CPUs between `Nehalem` and `SandyBridge` have differing numbers of
+_Feature Mask_ MSRs, which are a simple AND-mask applied to all `CPUID`
+instructions requesting specific feature bitmap sets.  The exact MSRs, and
+which feature bitmap sets they affect are hardware specific.  These MSRs allow
+features to be hidden by clearing the appropriate bit in the mask, but does
+not allow unsupported features to be advertised.
+
+### CPUID Faulting
+
+Intel CPUs from `IvyBridge` onwards have _CPUID Faulting_, which allows Xen to
+cause `CPUID` instruction executed in PV guests to fault.  This allows Xen
+full control over all information, exactly like HVM guests.
+
+## Compile time
+
+As some features depend on other features, it is important that, when
+disabling a certain feature, we disable all features which depend on it.  This
+allows runtime logic to be simplified, by being able to rely on testing only
+the single appropriate feature, rather than the entire feature dependency
+chain.
+
+To speed up runtime calculation of feature dependencies, the dependency chain
+is calculated and flattened by `xen/tools/gen-cpuid.py` to create
+`xen/include/asm-x86/cpuid-autogen.h` from
+`xen/include/public/arch-x86/cpufeatureset.h`, allowing the runtime code to
+disable all dependent features of a specific disabled feature in constant
+time.
+
+## Host boot
+
+As Xen boots, it will enumerate the features it can see.  This is stored as
+the _raw\_featureset_.
+
+Errata checks and command line arguments are then taken into account to reduce
+the _raw\_featureset_ into the _host\_featureset_, which is the set of
+features Xen uses.  On hardware with masking/override MSRs, the default MSR
+values are picked from the _host\_featureset_.
+
+The _host\_featureset_ is then used to calculate the _pv\_featureset_ and
+_hvm\_featureset_, which are the maximum featuresets Xen is willing to offer
+to PV and HVM guests respectively.
+
+In addition, Xen will calculate how much control it has over non-cooperative
+PV `CPUID` instructions, storing this information as _levelling\_caps_.
+
+## Domain creation
+
+The toolstack can query each of the calculated featureset via
+`XEN_SYSCTL_get_cpu_featureset`, and query for the levelling caps via
+`XEN_SYSCTL_get_cpu_levelling_caps`.
+
+These data should be used by the toolstack when choosing the eventual
+featureset to offer to the guest.
+
+Once a featureset has been chosen, it is set (implicitly or explicitly) via
+`XEN_DOMCTL_set_cpuid`.  Xen will clamp the toolstacks choice to the
+appropriate PV or HVM featureset.  On hardware with masking/override MSRs, the
+guest cpuid policy is reflected in the MSRs, which are context switched with
+other vcpu state.
+
+# Limitations
+
+A guest which ignores the provided feature information and manually probes for
+features will be able to find some of them.  e.g. There is no way of forcibly
+preventing a guest from using 1GB superpages if the hardware supports it.
+
+Some information simply cannot be hidden from guests.  There is no way to
+control certain behaviour such as the hardware MXCSR\_MASK or x87 FPU exception
+behaviour.
+
+
+# Testing
+
+Feature levelling is a very wide area, and used all over the hypervisor.
+Please ask on xen-devel for help identifying more specific tests which could
+be of use.
+
+
+# Known issues / Areas for improvement
+
+Xen currently has no concept of per-{socket,core,thread} CPUID information.
+As a result, details such as APIC IDs, topology and cache information do not
+match real hardware, and do not match the documented expectations in the Intel
+and AMD system manuals.
+
+The CPU feature flags are the only information which the toolstack has a
+sensible interface for querying and levelling.  Other information in the CPUID
+policy is important and should be levelled (e.g. maxphysaddr).
+
+The CPUID policy is currently regenerated from scratch by the receiving side,
+once memory and vcpu content has been restored.  This means that the receiving
+Xen cannot verify the memory/vcpu content against the CPUID policy, and can
+end up running a guest which will subsequently crash.  The CPUID policy should
+be at the head of the migration stream.
+
+MSRs are another source of features for guests.  There is no general provision
+for controlling the available MSRs.  E.g. 64bit versions of Windows notice
+changes in IA32\_MISC\_ENABLE, and suffer a BSOD 0x109 (Critical Structure
+Corruption)
+
+
+# References
+
+[Intel Flexmigration](http://www.intel.co.uk/content/dam/www/public/us/en/documents/application-notes/virtualization-technology-flexmigration-application-note.pdf)
+
+[AMD Extended Migration Technology](http://developer.amd.com/wordpress/media/2012/10/43781-3.00-PUB_Live-Virtual-Machine-Migration-on-AMD-processors.pdf)
+
+
+# History
+
+------------------------------------------------------------------------
+Date       Revision Version  Notes
+---------- -------- -------- -------------------------------------------
+2016-05-31 1        Xen 4.7  Document written
+---------- -------- -------- -------------------------------------------