[RFC,v2] Add SUPPORT.md

Message ID	20170911170159.3083-1-george.dunlap@citrix.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xen.org> From: George Dunlap <george.dunlap@citrix.com> To: <xen-devel@lists.xenproject.org> Date: Mon, 11 Sep 2017 18:01:59 +0100 Message-ID: <20170911170159.3083-1-george.dunlap@citrix.com> MIME-Version: 1.0 Cc: Stefano Stabellini <sstabellini@kernel.org>, Wei Liu <wei.liu2@citrix.com>, Andrew Cooper <andrew.cooper3@citrix.com>, Dario Faggioli <dario.faggioli@citrix.com>, Tim Deegan <tim@xen.org>, George Dunlap <george.dunlap@citrix.com>, Julien Grall <julien.grall@arm.com>, Paul Durrant <paul.durrant@citrix.com>, Jan Beulich <jbeulich@suse.com>, Tamas K Lengyel <tamas.lengyel@zentific.com>, Anthony Perard <anthony.perard@citrix.com>, Ian Jackson <ian.jackson@citrix.com>, Roger Pau Monne <roger.pau@citrix.com> Subject: [Xen-devel] [PATCH RFC v2] Add SUPPORT.md Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>

George Dunlap Sept. 11, 2017, 5:01 p.m. UTC

Add a machine-readable file to describe what features are in what
state of being 'supported', as well as information about how long this
release will be supported, and so on.

The document should be formatted using "semantic newlines" [1], to make
changes easier.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>

[1] http://rhodesmill.org/brandon/2012/one-sentence-per-line/
---

Sorry, I wrote a 'changes since v1' but managed to lose it.  I'll
reply to this mail tomorrow with a list of changes.

CC: Ian Jackson <ian.jackson@citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Dario Faggioli <dario.faggioli@citrix.com>
CC: Tamas K Lengyel <tamas.lengyel@zentific.com>
CC: Roger Pau Monne <roger.pau@citrix.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Anthony Perard <anthony.perard@citrix.com>
CC: Paul Durrant <paul.durrant@citrix.com>
CC: Konrad Wilk <konrad.wilk@oracle.com>
CC: Julien Grall <julien.grall@arm.com>
---
 SUPPORT.md | 821 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 821 insertions(+)
 create mode 100644 SUPPORT.md

Andrew Cooper Sept. 11, 2017, 5:53 p.m. UTC | #1

On 11/09/17 18:01, George Dunlap wrote:
> +### x86/PV
> +
> +    Status: Supported
> +
> +Traditional Xen Project PV guest

What's a "Xen Project" PV guest?  Just Xen here.

Also, a perhaps a statement of "No hardware requirements" ?

> +### x86/RAM
> +
> +    Limit, x86: 16TiB
> +    Limit, ARM32: 16GiB
> +    Limit, ARM64: 5TiB
> +
> +[XXX: Andy to suggest what this should say for x86]

The limit for x86 is either 16TiB or 123TiB, depending on
CONFIG_BIGMEM.  CONFIG_BIGMEM is exposed via menuconfig without
XEN_CONFIG_EXPERT, so falls into at least some kind of support statement.

As for practical limits, I don't think its reasonable to claim anything
which we can't test.  What are the specs in the MA colo?

> +
> +## Limits/Guest
> +
> +### Virtual CPUs
> +
> +    Limit, x86 PV: 512

Where did this number come from?  The actual limit as enforced in Xen is
8192, and it has been like that for a very long time (i.e. the 3.x days)

[root@fusebot ~]# python
Python 2.7.5 (default, Nov 20 2015, 02:00:19)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from xen.lowlevel.xc import xc as XC
>>> xc = XC()
>>> xc.domain_create()
1
>>> xc.domain_max_vcpus(1, 8192)
0
>>> xc.domain_create()
2
>>> xc.domain_max_vcpus(2, 8193)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
xen.lowlevel.xc.Error: (22, 'Invalid argument')

Trying to shut such a domain down however does tickle a host watchdog
timeout as the for_each_vcpu() loops in domain_kill() are very long.

> +    Limit, x86 HVM: 128
> +    Limit, ARM32: 8
> +    Limit, ARM64: 128
> +
> +[XXX Andrew Cooper: Do want to add "Limit-Security" here for some of these?]

32 for each.  64 vcpu HVM guests can excerpt enough p2m lock pressure to
trigger a 5 second host watchdog timeout.

> +
> +### Virtual RAM
> +
> +    Limit, x86 PV: >1TB
> +    Limit, x86 HVM: 1TB
> +    Limit, ARM32: 16GiB
> +    Limit, ARM64: 1TB

There is no specific upper bound on the size of PV or HVM guests that I
am aware of.  1.5TB HVM domains definitely work, because that's what we
test and support in XenServer.

> +
> +### x86 PV/Event Channels
> +
> +    Limit: 131072

Why do we call out event channel limits but not grant table limits? 
Also, why is this x86?  The 2l and fifo ABIs are arch agnostic, as far
as I am aware.

> +## High Availability and Fault Tolerance
> +
> +### Live Migration, Save & Restore
> +
> +    Status, x86: Supported

With caveats.  From docs/features/migration.pandoc

* x86 HVM guest physmap operations (not reflected in logdirty bitmap)
* x86 HVM with PoD pages (attempts to map cause PoD allocations)
* x86 HVM with nested-virt (no relevant information included in the stream)
* x86 PV ballooning (P2M marked dirty, target frame not marked)
* x86 PV P2M structure changes (not noticed, stale mappings used) for
  guests not using the linear p2m layout

Also, features such as vNUMA and nested virt (which are two I know for
certain) have all state discarded on the source side, because they were
never suitably plumbed in.

~Andrew

Jürgen Groß Sept. 12, 2017, 5:09 a.m. UTC | #2

On 11/09/17 19:01, George Dunlap wrote:
> Add a machine-readable file to describe what features are in what
> state of being 'supported', as well as information about how long this
> release will be supported, and so on.
> 
> The document should be formatted using "semantic newlines" [1], to make
> changes easier.
> 
> Signed-off-by: Ian Jackson <ian.jackson@citrix.com>
> Signed-off-by: George Dunlap <george.dunlap@citrix.com>
> 
> [1] http://rhodesmill.org/brandon/2012/one-sentence-per-line/
> ---
> 
> Sorry, I wrote a 'changes since v1' but managed to lose it.  I'll
> reply to this mail tomorrow with a list of changes.
> 
> CC: Ian Jackson <ian.jackson@citrix.com>
> CC: Wei Liu <wei.liu2@citrix.com>
> CC: Andrew Cooper <andrew.cooper3@citrix.com>
> CC: Jan Beulich <jbeulich@suse.com>
> CC: Tim Deegan <tim@xen.org>
> CC: Dario Faggioli <dario.faggioli@citrix.com>
> CC: Tamas K Lengyel <tamas.lengyel@zentific.com>
> CC: Roger Pau Monne <roger.pau@citrix.com>
> CC: Stefano Stabellini <sstabellini@kernel.org>
> CC: Anthony Perard <anthony.perard@citrix.com>
> CC: Paul Durrant <paul.durrant@citrix.com>
> CC: Konrad Wilk <konrad.wilk@oracle.com>
> CC: Julien Grall <julien.grall@arm.com>
> ---
>  SUPPORT.md | 821 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 821 insertions(+)
>  create mode 100644 SUPPORT.md
> 
> diff --git a/SUPPORT.md b/SUPPORT.md
> new file mode 100644
> index 0000000000..e30664feca
> --- /dev/null
> +++ b/SUPPORT.md
> @@ -0,0 +1,821 @@
> +# Support statement for this release
> +
> +This document describes the support status and in particular the
> +security support status of the Xen branch within which you find it.
> +
> +See the bottom of the file for the definitions of the support status
> +levels etc.
> +
> +# Release Support
> +
> +    Xen-Version: 4.10-unstable
> +    Initial-Release: n/a
> +    Supported-Until: TBD
> +    Security-Support-Until: Unreleased - not yet security-supported
> +
> +# Feature Support
> +

> +### Virtual RAM
> +
> +    Limit, x86 PV: >1TB

2047GB PV guests have been tested to work, including live migration.
Tests with larger guests are just ongoing (needed my live migration
patch which is upstream now).


Juergen

Jan Beulich Sept. 12, 2017, 9:48 a.m. UTC | #3

>>> On 11.09.17 at 19:53, <andrew.cooper3@citrix.com> wrote:
> As for practical limits, I don't think its reasonable to claim anything
> which we can't test.  What are the specs in the MA colo?

I don't think the MA colo's limits ought to be the only ones applicable
here, and it looks like you think this way too:

>> +### Virtual RAM
>> +
>> +    Limit, x86 PV: >1TB
>> +    Limit, x86 HVM: 1TB
>> +    Limit, ARM32: 16GiB
>> +    Limit, ARM64: 1TB
> 
> There is no specific upper bound on the size of PV or HVM guests that I
> am aware of.  1.5TB HVM domains definitely work, because that's what we
> test and support in XenServer.

I'm pretty sure the MA colo can't create 1.5Tb guests, yet them
being tested and supported by XenServer should generally suffice
for upstream to also consider them supported. Same would then
go for other distros testing and supporting certain larger limits
(without extra patches to enable that).

Jan

Wei Liu Sept. 12, 2017, 9:49 a.m. UTC | #4

On Mon, Sep 11, 2017 at 06:53:55PM +0100, Andrew Cooper wrote:
> On 11/09/17 18:01, George Dunlap wrote:
> > +### x86/PV
> > +
> > +    Status: Supported
> > +
> > +Traditional Xen Project PV guest
> 
> What's a "Xen Project" PV guest?  Just Xen here.
> 
> Also, a perhaps a statement of "No hardware requirements" ?
> 
> > +### x86/RAM
> > +
> > +    Limit, x86: 16TiB
> > +    Limit, ARM32: 16GiB
> > +    Limit, ARM64: 5TiB
> > +
> > +[XXX: Andy to suggest what this should say for x86]
> 
> The limit for x86 is either 16TiB or 123TiB, depending on
> CONFIG_BIGMEM.  CONFIG_BIGMEM is exposed via menuconfig without
> XEN_CONFIG_EXPERT, so falls into at least some kind of support statement.
> 
> As for practical limits, I don't think its reasonable to claim anything
> which we can't test.  What are the specs in the MA colo?

Nowhere near the TB range.

I think it would be okay for downstream like XenServer and/or OVM to
provide some numbers.

Roger Pau Monne Sept. 12, 2017, 10:39 a.m. UTC | #5

On Mon, Sep 11, 2017 at 06:01:59PM +0100, George Dunlap wrote:
> +## Toolstack
> +
> +### xl
> +
> +    Status: Supported
> +
> +### Direct-boot kernel image format
> +
> +    Supported, x86: bzImage

ELF

> +    Supported, ARM32: zImage
> +    Supported, ARM64: Image
> +
> +Format which the toolstack accept for direct-boot kernels

IMHO it would be good to provide references to the specs, for ELF that
should be:

http://refspecs.linuxbase.org/elf/elf.pdf

> +### Qemu based disk backend (qdisk) for xl
> +
> +    Status: Supported
> +
> +### Open vSwitch integration for xl
> +
> +    Status: Supported

Status, Linux: Supported

I haven't played with vswitch on FreeBSD at all.

> +
> +### systemd support for xl
> +
> +    Status: Supported
> +
> +### JSON output support for xl
> +
> +    Status: Experimental
> +
> +Output of information in machine-parseable JSON format
> +
> +### AHCI support for xl
> +
> +    Status, x86: Supported
> +
> +### ACPI guest
> +
> +    Status, x86 HVM: Supported
> +    Status, ARM: Tech Preview

status, x86 PVH: Tech preview

> +
> +### PVUSB support for xl
> +
> +    Status: Supported
> +
> +### HVM USB passthrough for xl
> +
> +    Status, x86: Supported
> +
> +### QEMU backend hotplugging for xl
> +
> +    Status: Supported

What's this exactly? Is it referring to hot-adding PV disk and nics?
If so it shouldn't specifically reference xl, the same can be done
with blkback or netback for example.

> +### Virtual cpu hotplug
> +
> +    Status: Supported
> +
> +## Toolstack/3rd party
> +
> +### libvirt driver for xl
> +
> +    Status: Supported, Security support external
> +
> +## Debugging, analysis, and crash post-mortem
> +
> +### gdbsx
> +
> +    Status, x86: Supported
> +
> +Debugger to debug ELF guests
> +
> +### Guest serial sonsole
> +
> +    Status: Supported
> +
> +Logs key hypervisor and Dom0 kernel events to a file
> +
> +### Soft-reset for PV guests
> +
> +    Status: Supported
> +	
> +Soft-reset allows a new kernel to start 'from scratch' with a fresh VM state, 
> +but with all the memory from the previous state of the VM intact.
> +This is primarily designed to allow "crash kernels", 
> +which can do core dumps of memory to help with debugging in the event of a crash.
> +
> +### xentrace
> +
> +    Status, x86: Supported
> +
> +Tool to capture Xen trace buffer data
> +
> +### gcov
> +
> +    Status: Supported, Not security supported
> +
> +Export hypervisor coverage data suitable for analysis by gcov or lcov.
> +
> +## Memory Management
> +
> +### Memory Ballooning
> +
> +    Status: Supported
> +
> +### Memory Sharing
> +
> +    Status, x86 HVM: Tech Preview
> +    Status, ARM: Tech Preview
> +
> +Allow sharing of identical pages between guests
> +
> +### Memory Paging
> +
> +    Status, x86 HVM: Experimenal
> +
> +Allow pages belonging to guests to be paged to disk
> +
> +### Transcendent Memory
> +
> +    Status: Experimental
> +
> +[XXX Add description]
> +
> +### Alternative p2m
> +
> +    Status, x86 HVM: Tech Preview
> +    Status, ARM: Tech Preview
> +
> +Allows external monitoring of hypervisor memory
> +by maintaining multiple physical to machine (p2m) memory mappings.
> +
> +## Resource Management
> +
> +### CPU Pools
> +
> +    Status: Supported
> +
> +Groups physical cpus into distinct groups called "cpupools",
> +with each pool having the capability of using different schedulers and scheduling properties.
> +
> +### Credit Scheduler
> +
> +    Status: Supported
> +
> +The default scheduler, which is a weighted proportional fair share virtual CPU scheduler.
> +
> +### Credit2 Scheduler
> +
> +    Status: Supported
> +
> +Credit2 is a general purpose scheduler for Xen,
> +designed with particular focus on fairness, responsiveness and scalability
> +
> +### RTDS based Scheduler
> +
> +    Status: Experimental
> +
> +A soft real-time CPU scheduler built to provide guaranteed CPU capacity to guest VMs on SMP hosts
> +
> +### ARINC653 Scheduler
> +
> +    Status: Supported, Not security supported
> +
> +A periodically repeating fixed timeslice scheduler. Multicore support is not yet implemented.
> +
> +### Null Scheduler
> +
> +    Status: Experimental
> +
> +A very simple, very static scheduling policy 
> +that always schedules the same vCPU(s) on the same pCPU(s). 
> +It is designed for maximum determinism and minimum overhead
> +on embedded platforms.
> +
> +### Numa scheduler affinity
> +
> +    Status, x86: Supported
> +
> +Enables Numa aware scheduling in Xen
> +
> +## Scalability
> +
> +### 1GB/2MB super page support
> +
> +    Status: Supported

This needs something like:

Status, x86 HVM/PVH: Supported

IIRC on ARM page sizes are different (64K?)

> +
> +### x86/PV-on-HVM
> +
> +    Status: Supported
> +
> +This is a useful label for a set of hypervisor features
> +which add paravirtualized functionality to HVM guests 
> +for improved performance and scalability.  
> +This includes exposing event channels to HVM guests.
> +
> +### x86/Deliver events to PVHVM guests using Xen event channels
> +
> +    Status: Supported

I think this should be labeled as "x86/HVM deliver guest events using
event channels", and the x86/PV-on-HVM section removed.

> +
> +## High Availability and Fault Tolerance
> +
> +### Live Migration, Save & Restore
> +
> +    Status, x86: Supported
> +
> +### Remus Fault Tolerance
> +
> +    Status: Experimental
> +
> +### COLO Manager
> +
> +    Status: Experimental
> +
> +### x86/vMCE
> +
> +    Status: Supported
> +
> +Forward Machine Check Exceptions to Appropriate guests
> +
> +## Virtual driver support, guest side
> +
> +[XXX Consider adding 'frontend' and 'backend' to the titles in these two sections to make it clearer]
> +
> +### Blkfront
> +
> +    Status, Linux: Supported
> +    Status, FreeBSD: Supported, Security support external
> +    Status, Windows: Supported

Status, NetBSD: Supported, Security support external

> +
> +Guest-side driver capable of speaking the Xen PV block protocol
> +
> +### Netfront
> +
> +    Status, Linux: Supported
> +    States, Windows: Supported
> +    Status, FreeBSD: Supported, Security support external
> +    Status, NetBSD: Supported, Security support external
> +    Status, OpenBSD: Supported, Security support external
> +
> +Guest-side driver capable of speaking the Xen PV networking protocol
> +
> +### Xen Framebuffer
> +
> +    Status, Linux (xen-fbfront): Supported
> +
> +Guest-side driver capable of speaking the Xen PV Framebuffer protocol
> +
> +### Xen Console
> +
> +    Status, Linux (hvc_xen): Supported
> +    Status, Windows: Supported
> +
> +Guest-side driver capable of speaking the Xen PV console protocol

Status, FreeBSD: Supported, Security support external
Status, NetBSD: Supported, Security support external

> +
> +### Xen PV keyboard
> +
> +    Status, Linux (xen-kbdfront): Supported
> +    Status, Windows: Supported
> +
> +Guest-side driver capable of speaking the Xen PV keyboard protocol
> +
> +[XXX 'Supported' here depends on the version we ship in 4.10 having some fixes]
> +
> +### Xen PVUSB protocol
> +
> +    Status, Linux: Supported
> +
> +### Xen PV SCSI protocol
> +
> +    Status, Linux: Supported, with caveats

Should both of the above items be labeled with frontend/backend?

And do we really need the 'Xen' prefix in all the items? Seems quite
redundant.

> +
> +NB that while the pvSCSU frontend is in Linux and tested regularly,
> +there is currently no xl support.
> +
> +### Xen TPMfront

PV TPM frotnend

> +
> +    Status, Linux (xen-tpmfront): Tech Preview
> +
> +Guest-side driver capable of speaking the Xen PV TPM protocol
> +
> +### Xen 9pfs frontend
> +
> +    Status, Linux: Tech Preview
> +
> +Guest-side driver capable of speaking the Xen 9pfs protocol
> +
> +### PVCalls frontend
> +
> +    Status, Linux: Tech Preview
> +
> +Guest-side driver capable of making pv system calls

Didn't we merge the backend, but not the frontend?

> +
> +## Virtual device support, host side
> +
> +### Blkback
> +
> +    Status, Linux (blkback): Supported
> +    Status, FreeBSD (blkback): Supported
                                           ^, security support
                                            external

Status, NetBSD (xbdback): Supported, security support external
> +    Status, QEMU (xen_disk): Supported
> +    Status, Blktap2: Deprecated
> +
> +Host-side implementations of the Xen PV block protocol
> +
> +### Netback
> +
> +    Status, Linux (netback): Supported
> +    Status, FreeBSD (netback): Supported

Status, NetBSD (xennetback): Supported

Both FreeBSD & NetBSD: security support external.

> +
> +Host-side implementations of Xen PV network protocol
> +
> +### Xen Framebuffer
> +
> +    Status, Linux: Supported

Frontend?

> +    Status, QEMU: Supported

Backend?

I don't recall Linux having a backend for the pv fb.

> +
> +Host-side implementaiton of the Xen PV framebuffer protocol
> +
> +### Xen Console (xenconsoled)

Console backend

> +
> +    Status: Supported
> +
> +Host-side implementation of the Xen PV console protocol
> +
> +### Xen PV keyboard

PV keyboard backend

> +
> +    Status, QEMU: Supported
> +
> +Host-side implementation fo the Xen PV keyboard protocol
> +
> +### Xen PV USB

PV USB Backend

> +
> +    Status, Linux: Experimental

? The backend is in QEMU.

> +    Status, QEMU: Supported
> +
> +Host-side implementation of the Xen PV USB protocol
> +
> +### Xen PV SCSI protocol

Does this refer to the backend or the frontend?

> +
> +    Status, Linux: Supported, with caveats
> +
> +NB that while the pvSCI backend is in Linux and tested regularly,
> +there is currently no xl support.
> +
> +### Xen PV TPM
> +
> +    Status: Tech Preview

This seems to be duplicated with the item "Xen TPMfront".

> +
> +### Xen 9pfs

backend

> +
> +    Status, QEMU: Tech Preview
> +
> +### PVCalls
> +
> +    Status, Linux: Tech Preview

? backend, frontend?

> +
> +### Online resize of virtual disks
> +
> +    Status: Supported

I would remove this.

> +## Security
> +
> +### Driver Domains
> +
> +    Status: Supported
> +
> +### Device Model Stub Domains
> +
> +    Status: Supported, with caveats
> +
> +Vulnerabilities of a device model stub domain to a hostile driver domain are excluded from security support.
> +
> +### KCONFIG Expert
> +
> +    Status: Experimental
> +
> +### Live Patching
> +
> +    Status, x86: Supported
> +    Status, ARM: Experimental
> +
> +Compile time disabled
> +
> +### Virtual Machine Introspection
> +
> +    Status, x86: Supported, not security supported
> +
> +### XSM & FLASK
> +
> +    Status: Experimental
> +
> +Compile time disabled
> +
> +### XSM & FLASK support for IS_PRIV
> +
> +    Status: Experimental
> +
> +Compile time disabled
> +
> +## Hardware
> +
> +### x86/Nested PV
> +
> +    Status, x86 HVM: Tech Preview
> +
> +This means running a Xen hypervisor inside an HVM domain,
> +with support for PV L2 guests only
> +(i.e., hardware virtualization extensions not provided
> +to the guest).
> +
> +This works, but has performance limitations
> +because the L1 dom0 can only access emulated L1 devices.
> +
> +### x86/Nested HVM
> +
> +    Status, x86 HVM: Experimental
> +
> +This means running a Xen hypervisor inside an HVM domain,
> +with support for running both PV and HVM L2 guests
> +(i.e., hardware virtualization extensions provided
> +to the guest).
> +
> +### x86/HVM iPXE
> +
> +    Status: Supported, with caveats
> +
> +Booting a guest via PXE.
> +PXE inherently places full trust of the guest in the network,
> +and so should only be used
> +when the guest network is under the same administrative control
> +as the guest itself.
> +
> +### x86/HVM BIOS
> +
> +    Status: Supported
> +
> +Booting a guest via guest BIOS firmware
> +
> +### x86/HVM EFI
> +
> +	Status: Supported
> +
> +Booting a guest via guest EFI firmware

Maybe this is too generic? We certainly don't support ROMBIOS with
qemu-trad, or SeaBIOS with qemu-upstream.

> +### x86/Physical CPU Hotplug
> +
> +    Status: Supported
> +
> +### x86/Physical Memory Hotplug
> +
> +    Status: Supported
> +
> +### x86/PCI Passthrough PV
> +
> +    Status: Supported, Not security supported
> +
> +PV passthrough cannot be done safely.
> +
> +[XXX Not even with an IOMMU?]
> +
> +### x86/PCI Passthrough HVM
> +
> +    Status: Supported, with caveats
> +
> +Many hardware device and motherboard combinations are not possible to use safely.
> +The XenProject will support bugs in PCI passthrough for Xen,
> +but the user is responsible to ensure that the hardware combination they use
> +is sufficiently secure for their needs,
> +and should assume that any combination is insecure
> +unless they have reason to believe otherwise.
> +
> +### ARM/Non-PCI device passthrough
> +
> +    Status: Supported
> +
> +### x86/Advanced Vector eXtension
> +
> +    Status: Supported
> +
> +### vPMU
> +
> +    Status, x86: Supported, Not security supported
> +
> +Virtual Performance Management Unit for HVM guests
> +
> +Disabled by default (enable with hypervisor command line option).
> +This feature is not security supported: see http://xenbits.xen.org/xsa/advisory-163.html
> +
> +### Intel Platform QoS Technologies
> +
> +    Status: Tech Preview
> +
> +### ARM/ACPI (host)
> +
> +    Status: Experimental

"ACPI host" (since we already have "ACPI guest" above).

Status, ARM: experimental
Status, x86 PV: supported
Status, x86 PVH: experimental

> +### ARM/SMMUv1
> +
> +    Status: Supported
> +
> +### ARM/SMMUv2
> +
> +    Status: Supported
> +
> +### ARM/GICv3 ITS
> +
> +    Status: Experimental
> +
> +Extension to the GICv3 interrupt controller to support MSI.
> +
> +### ARM: 16K and 64K pages in guests
> +
> +    Status: Supported, with caveats
> +
> +No support for QEMU backends in a 16K or 64K domain.

Needs to be merged with the "1GB/2MB super page support"?

Thanks, Roger.

George Dunlap Sept. 12, 2017, 1:14 p.m. UTC | #6

On 09/11/2017 06:01 PM, George Dunlap wrote:
> Add a machine-readable file to describe what features are in what
> state of being 'supported', as well as information about how long this
> release will be supported, and so on.
> 
> The document should be formatted using "semantic newlines" [1], to make
> changes easier.
> 
> Signed-off-by: Ian Jackson <ian.jackson@citrix.com>
> Signed-off-by: George Dunlap <george.dunlap@citrix.com>
> 
> [1] http://rhodesmill.org/brandon/2012/one-sentence-per-line/
> ---
> 
> Sorry, I wrote a 'changes since v1' but managed to lose it.  I'll
> reply to this mail tomorrow with a list of changes.

Changes since v1:
- Moved PV-on-HVM from 'Guest Types' to 'Scalability'
- Renamed all "Preview" to "Tech Preview" for consistency
- Removed PVH dom0 support (since it doesn't work at all)
- Fixed "Virtual RAM"
- JSON: Preview -> Experimental
- ACPI: Added x86
- Virtual cpu hotplug -> Supported on all platforms
- Created "External support" section, moved all external links there
- Renamed "Tooling" section to "Debugging, analysis, and crash post-mortem"
- Moved 'Soft-reset' to "Debugging, ..."
- Moved vPMU to "Hardware" section
- vMCE -> x86/vMCE
- Updates on various virtual device driver statuses
- Removed QEMU pv netback (xen_nic)
- VMI -> Supported, not security supported
- Break out Nested PV separately to Nested HVM
- Add x86/HVM BIOS and EFI entries
- ARM/SMMU -> v1 & v2
- Updated ARM/ITS description

Rich Persaud Sept. 12, 2017, 3:35 p.m. UTC | #7

> On Sep 11, 2017, at 13:01, George Dunlap <george.dunlap@citrix.com> wrote:
> 
> +### XSM & FLASK
> +
> +    Status: Experimental
> +
> +Compile time disabled
> +
> +### XSM & FLASK support for IS_PRIV
> +
> +    Status: Experimental

In which specific areas is XSM lacking in Functional completeness, Functional stability and/or Interface stability, resulting in "Experimental" status?  What changes to XSM would be needed for it to qualify for "Supported" status?

If there will be no security support for features in Experimental status, would Xen Project accept patches to fix XSM security issues?  Could downstream projects issue CVEs for XSM security issues, if these will not be issued by Xen Project?

Rich

Stefano Stabellini Sept. 12, 2017, 7:52 p.m. UTC | #8

On Tue, 12 Sep 2017, Roger Pau Monné wrote:
> On Mon, Sep 11, 2017 at 06:01:59PM +0100, George Dunlap wrote:
> > +## Toolstack
> > +
> > +### xl
> > +
> > +    Status: Supported
> > +
> > +### Direct-boot kernel image format
> > +
> > +    Supported, x86: bzImage
> 
> ELF
> 
> > +    Supported, ARM32: zImage
> > +    Supported, ARM64: Image
> > +
> > +Format which the toolstack accept for direct-boot kernels
> 
> IMHO it would be good to provide references to the specs, for ELF that
> should be:
> 
> http://refspecs.linuxbase.org/elf/elf.pdf
> 
> > +### Qemu based disk backend (qdisk) for xl
> > +
> > +    Status: Supported
> > +
> > +### Open vSwitch integration for xl
> > +
> > +    Status: Supported
> 
> Status, Linux: Supported
> 
> I haven't played with vswitch on FreeBSD at all.
> 
> > +
> > +### systemd support for xl
> > +
> > +    Status: Supported
> > +
> > +### JSON output support for xl
> > +
> > +    Status: Experimental
> > +
> > +Output of information in machine-parseable JSON format
> > +
> > +### AHCI support for xl
> > +
> > +    Status, x86: Supported
> > +
> > +### ACPI guest
> > +
> > +    Status, x86 HVM: Supported
> > +    Status, ARM: Tech Preview
> 
> status, x86 PVH: Tech preview
> 
> > +
> > +### PVUSB support for xl
> > +
> > +    Status: Supported
> > +
> > +### HVM USB passthrough for xl
> > +
> > +    Status, x86: Supported
> > +
> > +### QEMU backend hotplugging for xl
> > +
> > +    Status: Supported
> 
> What's this exactly? Is it referring to hot-adding PV disk and nics?
> If so it shouldn't specifically reference xl, the same can be done
> with blkback or netback for example.
> 
> > +### Virtual cpu hotplug
> > +
> > +    Status: Supported
> > +
> > +## Toolstack/3rd party
> > +
> > +### libvirt driver for xl
> > +
> > +    Status: Supported, Security support external
> > +
> > +## Debugging, analysis, and crash post-mortem
> > +
> > +### gdbsx
> > +
> > +    Status, x86: Supported
> > +
> > +Debugger to debug ELF guests
> > +
> > +### Guest serial sonsole
> > +
> > +    Status: Supported
> > +
> > +Logs key hypervisor and Dom0 kernel events to a file
> > +
> > +### Soft-reset for PV guests
> > +
> > +    Status: Supported
> > +	
> > +Soft-reset allows a new kernel to start 'from scratch' with a fresh VM state, 
> > +but with all the memory from the previous state of the VM intact.
> > +This is primarily designed to allow "crash kernels", 
> > +which can do core dumps of memory to help with debugging in the event of a crash.
> > +
> > +### xentrace
> > +
> > +    Status, x86: Supported
> > +
> > +Tool to capture Xen trace buffer data
> > +
> > +### gcov
> > +
> > +    Status: Supported, Not security supported
> > +
> > +Export hypervisor coverage data suitable for analysis by gcov or lcov.
> > +
> > +## Memory Management
> > +
> > +### Memory Ballooning
> > +
> > +    Status: Supported
> > +
> > +### Memory Sharing
> > +
> > +    Status, x86 HVM: Tech Preview
> > +    Status, ARM: Tech Preview
> > +
> > +Allow sharing of identical pages between guests
> > +
> > +### Memory Paging
> > +
> > +    Status, x86 HVM: Experimenal
> > +
> > +Allow pages belonging to guests to be paged to disk
> > +
> > +### Transcendent Memory
> > +
> > +    Status: Experimental
> > +
> > +[XXX Add description]
> > +
> > +### Alternative p2m
> > +
> > +    Status, x86 HVM: Tech Preview
> > +    Status, ARM: Tech Preview
> > +
> > +Allows external monitoring of hypervisor memory
> > +by maintaining multiple physical to machine (p2m) memory mappings.
> > +
> > +## Resource Management
> > +
> > +### CPU Pools
> > +
> > +    Status: Supported
> > +
> > +Groups physical cpus into distinct groups called "cpupools",
> > +with each pool having the capability of using different schedulers and scheduling properties.
> > +
> > +### Credit Scheduler
> > +
> > +    Status: Supported
> > +
> > +The default scheduler, which is a weighted proportional fair share virtual CPU scheduler.
> > +
> > +### Credit2 Scheduler
> > +
> > +    Status: Supported
> > +
> > +Credit2 is a general purpose scheduler for Xen,
> > +designed with particular focus on fairness, responsiveness and scalability
> > +
> > +### RTDS based Scheduler
> > +
> > +    Status: Experimental
> > +
> > +A soft real-time CPU scheduler built to provide guaranteed CPU capacity to guest VMs on SMP hosts
> > +
> > +### ARINC653 Scheduler
> > +
> > +    Status: Supported, Not security supported
> > +
> > +A periodically repeating fixed timeslice scheduler. Multicore support is not yet implemented.
> > +
> > +### Null Scheduler
> > +
> > +    Status: Experimental
> > +
> > +A very simple, very static scheduling policy 
> > +that always schedules the same vCPU(s) on the same pCPU(s). 
> > +It is designed for maximum determinism and minimum overhead
> > +on embedded platforms.
> > +
> > +### Numa scheduler affinity
> > +
> > +    Status, x86: Supported
> > +
> > +Enables Numa aware scheduling in Xen
> > +
> > +## Scalability
> > +
> > +### 1GB/2MB super page support
> > +
> > +    Status: Supported
> 
> This needs something like:
> 
> Status, x86 HVM/PVH: Supported
> 
> IIRC on ARM page sizes are different (64K?)

There is a separate entry for different page granularities. 2MB and 1GB
super-pages, both based on 4K granularity, are supported on ARM too.


> > +
> > +### x86/PV-on-HVM
> > +
> > +    Status: Supported
> > +
> > +This is a useful label for a set of hypervisor features
> > +which add paravirtualized functionality to HVM guests 
> > +for improved performance and scalability.  
> > +This includes exposing event channels to HVM guests.
> > +
> > +### x86/Deliver events to PVHVM guests using Xen event channels
> > +
> > +    Status: Supported
> 
> I think this should be labeled as "x86/HVM deliver guest events using
> event channels", and the x86/PV-on-HVM section removed.
> 
> > +
> > +## High Availability and Fault Tolerance
> > +
> > +### Live Migration, Save & Restore
> > +
> > +    Status, x86: Supported
> > +
> > +### Remus Fault Tolerance
> > +
> > +    Status: Experimental
> > +
> > +### COLO Manager
> > +
> > +    Status: Experimental
> > +
> > +### x86/vMCE
> > +
> > +    Status: Supported
> > +
> > +Forward Machine Check Exceptions to Appropriate guests
> > +
> > +## Virtual driver support, guest side
> > +
> > +[XXX Consider adding 'frontend' and 'backend' to the titles in these two sections to make it clearer]
> > +
> > +### Blkfront
> > +
> > +    Status, Linux: Supported
> > +    Status, FreeBSD: Supported, Security support external
> > +    Status, Windows: Supported
> 
> Status, NetBSD: Supported, Security support external
> 
> > +
> > +Guest-side driver capable of speaking the Xen PV block protocol
> > +
> > +### Netfront
> > +
> > +    Status, Linux: Supported
> > +    States, Windows: Supported
> > +    Status, FreeBSD: Supported, Security support external
> > +    Status, NetBSD: Supported, Security support external
> > +    Status, OpenBSD: Supported, Security support external
> > +
> > +Guest-side driver capable of speaking the Xen PV networking protocol
> > +
> > +### Xen Framebuffer
> > +
> > +    Status, Linux (xen-fbfront): Supported
> > +
> > +Guest-side driver capable of speaking the Xen PV Framebuffer protocol
> > +
> > +### Xen Console
> > +
> > +    Status, Linux (hvc_xen): Supported
> > +    Status, Windows: Supported
> > +
> > +Guest-side driver capable of speaking the Xen PV console protocol
> 
> Status, FreeBSD: Supported, Security support external
> Status, NetBSD: Supported, Security support external
> 
> > +
> > +### Xen PV keyboard
> > +
> > +    Status, Linux (xen-kbdfront): Supported
> > +    Status, Windows: Supported
> > +
> > +Guest-side driver capable of speaking the Xen PV keyboard protocol
> > +
> > +[XXX 'Supported' here depends on the version we ship in 4.10 having some fixes]
> > +
> > +### Xen PVUSB protocol
> > +
> > +    Status, Linux: Supported
> > +
> > +### Xen PV SCSI protocol
> > +
> > +    Status, Linux: Supported, with caveats
> 
> Should both of the above items be labeled with frontend/backend?
> 
> And do we really need the 'Xen' prefix in all the items? Seems quite
> redundant.
> 
> > +
> > +NB that while the pvSCSU frontend is in Linux and tested regularly,
> > +there is currently no xl support.
> > +
> > +### Xen TPMfront
> 
> PV TPM frotnend
> 
> > +
> > +    Status, Linux (xen-tpmfront): Tech Preview
> > +
> > +Guest-side driver capable of speaking the Xen PV TPM protocol
> > +
> > +### Xen 9pfs frontend
> > +
> > +    Status, Linux: Tech Preview
> > +
> > +Guest-side driver capable of speaking the Xen 9pfs protocol
> > +
> > +### PVCalls frontend
> > +
> > +    Status, Linux: Tech Preview
> > +
> > +Guest-side driver capable of making pv system calls
> 
> Didn't we merge the backend, but not the frontend?
> 
> > +
> > +## Virtual device support, host side
> > +
> > +### Blkback
> > +
> > +    Status, Linux (blkback): Supported
> > +    Status, FreeBSD (blkback): Supported
>                                            ^, security support
>                                             external
> 
> Status, NetBSD (xbdback): Supported, security support external
> > +    Status, QEMU (xen_disk): Supported
> > +    Status, Blktap2: Deprecated
> > +
> > +Host-side implementations of the Xen PV block protocol
> > +
> > +### Netback
> > +
> > +    Status, Linux (netback): Supported
> > +    Status, FreeBSD (netback): Supported
> 
> Status, NetBSD (xennetback): Supported
> 
> Both FreeBSD & NetBSD: security support external.
> 
> > +
> > +Host-side implementations of Xen PV network protocol
> > +
> > +### Xen Framebuffer
> > +
> > +    Status, Linux: Supported
> 
> Frontend?

Yes, please. If you write "Xen Framebuffer" I only take it to mean the
protocol as should be documented somewhere under docs/. Then I read
Linux, and I don't understand what you mean. Then I read QEMU and I have
to guess you are talking about the backend?


> > +    Status, QEMU: Supported
> 
> Backend?
> 
> I don't recall Linux having a backend for the pv fb.
> 
> > +
> > +Host-side implementaiton of the Xen PV framebuffer protocol
> > +
> > +### Xen Console (xenconsoled)
> 
> Console backend
> 
> > +
> > +    Status: Supported
> > +
> > +Host-side implementation of the Xen PV console protocol
> > +
> > +### Xen PV keyboard
> 
> PV keyboard backend
> 
> > +
> > +    Status, QEMU: Supported
> > +
> > +Host-side implementation fo the Xen PV keyboard protocol
> > +
> > +### Xen PV USB
> 
> PV USB Backend
> 
> > +
> > +    Status, Linux: Experimental
> 
> ? The backend is in QEMU.
> 
> > +    Status, QEMU: Supported
> > +
> > +Host-side implementation of the Xen PV USB protocol
> > +
> > +### Xen PV SCSI protocol
> 
> Does this refer to the backend or the frontend?
> 
> > +
> > +    Status, Linux: Supported, with caveats
> > +
> > +NB that while the pvSCI backend is in Linux and tested regularly,
> > +there is currently no xl support.
> > +
> > +### Xen PV TPM
> > +
> > +    Status: Tech Preview
> 
> This seems to be duplicated with the item "Xen TPMfront".
> 
> > +
> > +### Xen 9pfs
> 
> backend
> 
> > +
> > +    Status, QEMU: Tech Preview
> > +
> > +### PVCalls
> > +
> > +    Status, Linux: Tech Preview
> 
> ? backend, frontend?
> 
> > +
> > +### Online resize of virtual disks
> > +
> > +    Status: Supported
> 
> I would remove this.
> 
> > +## Security
> > +
> > +### Driver Domains
> > +
> > +    Status: Supported
> > +
> > +### Device Model Stub Domains
> > +
> > +    Status: Supported, with caveats
> > +
> > +Vulnerabilities of a device model stub domain to a hostile driver domain are excluded from security support.
> > +
> > +### KCONFIG Expert
> > +
> > +    Status: Experimental
> > +
> > +### Live Patching
> > +
> > +    Status, x86: Supported
> > +    Status, ARM: Experimental
> > +
> > +Compile time disabled
> > +
> > +### Virtual Machine Introspection
> > +
> > +    Status, x86: Supported, not security supported
> > +
> > +### XSM & FLASK
> > +
> > +    Status: Experimental
> > +
> > +Compile time disabled
> > +
> > +### XSM & FLASK support for IS_PRIV
> > +
> > +    Status: Experimental
> > +
> > +Compile time disabled
> > +
> > +## Hardware
> > +
> > +### x86/Nested PV
> > +
> > +    Status, x86 HVM: Tech Preview
> > +
> > +This means running a Xen hypervisor inside an HVM domain,
> > +with support for PV L2 guests only
> > +(i.e., hardware virtualization extensions not provided
> > +to the guest).
> > +
> > +This works, but has performance limitations
> > +because the L1 dom0 can only access emulated L1 devices.
> > +
> > +### x86/Nested HVM
> > +
> > +    Status, x86 HVM: Experimental
> > +
> > +This means running a Xen hypervisor inside an HVM domain,
> > +with support for running both PV and HVM L2 guests
> > +(i.e., hardware virtualization extensions provided
> > +to the guest).
> > +
> > +### x86/HVM iPXE
> > +
> > +    Status: Supported, with caveats
> > +
> > +Booting a guest via PXE.
> > +PXE inherently places full trust of the guest in the network,
> > +and so should only be used
> > +when the guest network is under the same administrative control
> > +as the guest itself.
> > +
> > +### x86/HVM BIOS
> > +
> > +    Status: Supported
> > +
> > +Booting a guest via guest BIOS firmware
> > +
> > +### x86/HVM EFI
> > +
> > +	Status: Supported
> > +
> > +Booting a guest via guest EFI firmware
> 
> Maybe this is too generic? We certainly don't support ROMBIOS with
> qemu-trad, or SeaBIOS with qemu-upstream.
> 
> > +### x86/Physical CPU Hotplug
> > +
> > +    Status: Supported
> > +
> > +### x86/Physical Memory Hotplug
> > +
> > +    Status: Supported
> > +
> > +### x86/PCI Passthrough PV
> > +
> > +    Status: Supported, Not security supported
> > +
> > +PV passthrough cannot be done safely.
> > +
> > +[XXX Not even with an IOMMU?]
> > +
> > +### x86/PCI Passthrough HVM
> > +
> > +    Status: Supported, with caveats
> > +
> > +Many hardware device and motherboard combinations are not possible to use safely.
> > +The XenProject will support bugs in PCI passthrough for Xen,
> > +but the user is responsible to ensure that the hardware combination they use
> > +is sufficiently secure for their needs,
> > +and should assume that any combination is insecure
> > +unless they have reason to believe otherwise.
> > +
> > +### ARM/Non-PCI device passthrough
> > +
> > +    Status: Supported
> > +
> > +### x86/Advanced Vector eXtension
> > +
> > +    Status: Supported
> > +
> > +### vPMU
> > +
> > +    Status, x86: Supported, Not security supported
> > +
> > +Virtual Performance Management Unit for HVM guests
> > +
> > +Disabled by default (enable with hypervisor command line option).
> > +This feature is not security supported: see http://xenbits.xen.org/xsa/advisory-163.html
> > +
> > +### Intel Platform QoS Technologies
> > +
> > +    Status: Tech Preview
> > +
> > +### ARM/ACPI (host)
> > +
> > +    Status: Experimental
> 
> "ACPI host" (since we already have "ACPI guest" above).
> 
> Status, ARM: experimental
> Status, x86 PV: supported
> Status, x86 PVH: experimental
> 
> > +### ARM/SMMUv1
> > +
> > +    Status: Supported
> > +
> > +### ARM/SMMUv2
> > +
> > +    Status: Supported
> > +
> > +### ARM/GICv3 ITS
> > +
> > +    Status: Experimental
> > +
> > +Extension to the GICv3 interrupt controller to support MSI.
> > +
> > +### ARM: 16K and 64K pages in guests
> > +
> > +    Status: Supported, with caveats
> > +
> > +No support for QEMU backends in a 16K or 64K domain.
> 
> Needs to be merged with the "1GB/2MB super page support"?
 
Super-pages are different from page granularity. 1GB and 2MB pages are
based on the same 4K page granularity, while 512MB pages are based on
64K granularity. Does it make sense?

Maybe we want to say "ARM: 16K and 64K page granularity in guest" to
clarify.

Julien Grall Sept. 12, 2017, 8:09 p.m. UTC | #9

Hi,

On 12/09/2017 20:52, Stefano Stabellini wrote:
> On Tue, 12 Sep 2017, Roger Pau Monné wrote:
>> On Mon, Sep 11, 2017 at 06:01:59PM +0100, George Dunlap wrote:
>>> +## Scalability
>>> +
>>> +### 1GB/2MB super page support
>>> +
>>> +    Status: Supported
>>
>> This needs something like:
>>
>> Status, x86 HVM/PVH: Supported
>>
>> IIRC on ARM page sizes are different (64K?)
>
> There is a separate entry for different page granularities. 2MB and 1GB
> super-pages, both based on 4K granularity, are supported on ARM too.

This entry and the entry "ARM: 16K and 64K pages in guests" are two 
different things.

Here we speak about the hypervisor whereas the other one is about guests 
itself.

At the moment, the hypervisor only supports 4K. The guests can support 
4K, 16K, 64K. The later two are only for AArch64 guest.

It is probably worth to rename the other entry to "ARM: 4K, 16K, 64K 
pages in guests" for avoiding confusion.

[...]

>>> +### ARM: 16K and 64K pages in guests
>>> +
>>> +    Status: Supported, with caveats
>>> +
>>> +No support for QEMU backends in a 16K or 64K domain.
>>
>> Needs to be merged with the "1GB/2MB super page support"?
>
> Super-pages are different from page granularity. 1GB and 2MB pages are
> based on the same 4K page granularity, while 512MB pages are based on
> 64K granularity. Does it make sense?
> Maybe we want to say "ARM: 16K and 64K page granularity in guest" to
> clarify.

Each entry is related to different components. The first entry is about 
the hypervisor, whilst this one is about guests. We really don't care 
whether the guests is going to use superpage because at the end of the 
day this will get handle by the hardware directly.

The only thing we care is those guests to be able to interact with Xen 
(the interface is based on 4K granularity at the moment).
So I am not sure what we are trying to clarify at the end...

Cheers,

Stefano Stabellini Sept. 25, 2017, 11:10 p.m. UTC | #10

On Mon, 11 Sep 2017, George Dunlap wrote:
> +### RTDS based Scheduler
> +
> +    Status: Experimental
> +
> +A soft real-time CPU scheduler built to provide guaranteed CPU capacity to guest VMs on SMP hosts
> +
> +### ARINC653 Scheduler
> +
> +    Status: Supported, Not security supported
> +
> +A periodically repeating fixed timeslice scheduler. Multicore support is not yet implemented.
> +
> +### Null Scheduler
> +
> +    Status: Experimental
> +
> +A very simple, very static scheduling policy 
> +that always schedules the same vCPU(s) on the same pCPU(s). 
> +It is designed for maximum determinism and minimum overhead
> +on embedded platforms.

Hi all,

I have just noticed that none of the non-credit schedulers are security
supported. Would it make sense to try to support at least one of them?

For example, RTDS is not new and Dario is co-maintaining it. It is
currently marked as Supported in the MAINTAINERS file. Is it really fair
to mark it as "Experimental" in SUPPORT.md?

The Null scheduler was new when we started this discussion, but now that
Xen 4.10 is entering code freeze, Null scheduler is not so new anymore.
We didn't get any bug reports during the 4.10 development window. By the
time this document is accepted and Xen 4.10 is out, Null could be a
candidate for "Supported" too.

Thoughts?

Cheers,

Stefano

Dario Faggioli Sept. 26, 2017, 7:12 a.m. UTC | #11

[Cc-list modified by removing someone and adding someone else]

On Mon, 2017-09-25 at 16:10 -0700, Stefano Stabellini wrote:
> On Mon, 11 Sep 2017, George Dunlap wrote:
> > +### RTDS based Scheduler
> > +
> > +    Status: Experimental
> > +
> > +A soft real-time CPU scheduler built to provide guaranteed CPU
> > capacity to guest VMs on SMP hosts
> > +
> > +### ARINC653 Scheduler
> > +
> > +    Status: Supported, Not security supported
> > +
> > +A periodically repeating fixed timeslice scheduler. Multicore
> > support is not yet implemented.
> > +
> > +### Null Scheduler
> > +
> > +    Status: Experimental
> > +
> > +A very simple, very static scheduling policy 
> > +that always schedules the same vCPU(s) on the same pCPU(s). 
> > +It is designed for maximum determinism and minimum overhead
> > +on embedded platforms.
> 
> Hi all,
> 
Hey!

> I have just noticed that none of the non-credit schedulers are
> security
> supported. Would it make sense to try to support at least one of
> them?
> 
Yes, that indeed would be great.

> For example, RTDS is not new and Dario is co-maintaining it. It is
> currently marked as Supported in the MAINTAINERS file. Is it really
> fair
> to mark it as "Experimental" in SUPPORT.md?
> 
True, but there still one small missing piece in RTDS, before I'd feel
comfortable about telling people "here, it's ready, use it at will",
which is the work conserving mode.

There are patches out for this, and they were posted before last
posting date, so, in theory, they still can go in 4.10.

> The Null scheduler was new when we started this discussion, but now
> that
> Xen 4.10 is entering code freeze, Null scheduler is not so new
> anymore.
> We didn't get any bug reports during the 4.10 development window. By
> the
> time this document is accepted and Xen 4.10 is out, Null could be a
> candidate for "Supported" too.
> 
Yes, especially considering how simple it is, there should be no big
issues preventing that to happen.

There's one thing, though: it's not tested in OSSTest. I can actually
try to have a quick look about creating a job that does that (I mean
like today).

The trickiest part is the need to limit the number of Dom0 vCPUs, to a
number that would allow the creation and the local migration of guests
(considering that the number of pCPUs of the testbox in the MA colo
varies, and that we have some ARM boards with like 1 or 2 CPUs).

Actually, the best candidate for gaining security support, is IMO
ARINC. Code is also rather simple and "stable" (hasn't changed in the
last... years!) and it's used by DornerWorks' people for some of their
projects (I think?). It's also not tested in OSSTest, though, and
considering how special purpose it is, I think we're not totally
comfortable marking it as Sec-Supported, without feedback from the
maintainers.

George, Josh, Robert?

Regards,
Dario

George Dunlap Sept. 26, 2017, 10:34 a.m. UTC | #12

On 09/26/2017 12:10 AM, Stefano Stabellini wrote:
> On Mon, 11 Sep 2017, George Dunlap wrote:
>> +### RTDS based Scheduler
>> +
>> +    Status: Experimental
>> +
>> +A soft real-time CPU scheduler built to provide guaranteed CPU capacity to guest VMs on SMP hosts
>> +
>> +### ARINC653 Scheduler
>> +
>> +    Status: Supported, Not security supported
>> +
>> +A periodically repeating fixed timeslice scheduler. Multicore support is not yet implemented.
>> +
>> +### Null Scheduler
>> +
>> +    Status: Experimental
>> +
>> +A very simple, very static scheduling policy 
>> +that always schedules the same vCPU(s) on the same pCPU(s). 
>> +It is designed for maximum determinism and minimum overhead
>> +on embedded platforms.
> 
> Hi all,
> 
> I have just noticed that none of the non-credit schedulers are security
> supported. Would it make sense to try to support at least one of them?
> 
> For example, RTDS is not new and Dario is co-maintaining it. It is
> currently marked as Supported in the MAINTAINERS file. Is it really fair
> to mark it as "Experimental" in SUPPORT.md?
> 
> The Null scheduler was new when we started this discussion, but now that
> Xen 4.10 is entering code freeze, Null scheduler is not so new anymore.
> We didn't get any bug reports during the 4.10 development window. By the
> time this document is accepted and Xen 4.10 is out, Null could be a
> candidate for "Supported" too.
> 
> Thoughts?

One thing we've been talking about for a long time is having more of a
formal process for getting features into the 'supported' state; and one
of the key criteria for that was to make sure that the feature was
getting regular testing somewhere (preferably in osstest, but at least
*somewhere*).

A lot of these features we have no idea how much testing they're getting
or even if they work reliably; so we put them in 'experimental' or
'preview' by default, until someone who is working on those features
wants to argue otherwise.  If Meng (or someone) wanted RTDS to be
considered 'supported', he could come to us and ask for it to be
considered 'supported'; and we can discuss what criteria we'd use to
decide whether to change it or not.

And of course, all of this (both the "ask for it to be considered
supported" and "make sure it's regularly tested") is really just a proxy
for "How much do people care about this feature".  If people care enough
about the feature to notice that it's listed as 'experimental' and set
up regular testing, then we should care enough to give it security
support.  If nobody cares enough about the feature to even notice it's
not listed as 'supported', or to give it regular testing (again, not
even necessarily in osstest), then I think we're justified in not caring
enough to give it security support.

As Dario said, the null scheduler could do with just getting into osstest.

 -George

Robert VanVossen Sept. 27, 2017, 12:57 p.m. UTC | #13

On 9/26/2017 3:12 AM, Dario Faggioli wrote:
> [Cc-list modified by removing someone and adding someone else]
> 
> On Mon, 2017-09-25 at 16:10 -0700, Stefano Stabellini wrote:
>> On Mon, 11 Sep 2017, George Dunlap wrote:
>>> +### RTDS based Scheduler
>>> +
>>> +    Status: Experimental
>>> +
>>> +A soft real-time CPU scheduler built to provide guaranteed CPU
>>> capacity to guest VMs on SMP hosts
>>> +
>>> +### ARINC653 Scheduler
>>> +
>>> +    Status: Supported, Not security supported
>>> +
>>> +A periodically repeating fixed timeslice scheduler. Multicore
>>> support is not yet implemented.
>>> +
>>> +### Null Scheduler
>>> +
>>> +    Status: Experimental
>>> +
>>> +A very simple, very static scheduling policy 
>>> +that always schedules the same vCPU(s) on the same pCPU(s). 
>>> +It is designed for maximum determinism and minimum overhead
>>> +on embedded platforms.
>>
>> Hi all,
>>
> Hey!
> 
>> I have just noticed that none of the non-credit schedulers are
>> security
>> supported. Would it make sense to try to support at least one of
>> them?
>>
> Yes, that indeed would be great.
> 
>> For example, RTDS is not new and Dario is co-maintaining it. It is
>> currently marked as Supported in the MAINTAINERS file. Is it really
>> fair
>> to mark it as "Experimental" in SUPPORT.md?
>>
> True, but there still one small missing piece in RTDS, before I'd feel
> comfortable about telling people "here, it's ready, use it at will",
> which is the work conserving mode.
> 
> There are patches out for this, and they were posted before last
> posting date, so, in theory, they still can go in 4.10.
> 
>> The Null scheduler was new when we started this discussion, but now
>> that
>> Xen 4.10 is entering code freeze, Null scheduler is not so new
>> anymore.
>> We didn't get any bug reports during the 4.10 development window. By
>> the
>> time this document is accepted and Xen 4.10 is out, Null could be a
>> candidate for "Supported" too.
>>
> Yes, especially considering how simple it is, there should be no big
> issues preventing that to happen.
> 
> There's one thing, though: it's not tested in OSSTest. I can actually
> try to have a quick look about creating a job that does that (I mean
> like today).
> 
> The trickiest part is the need to limit the number of Dom0 vCPUs, to a
> number that would allow the creation and the local migration of guests
> (considering that the number of pCPUs of the testbox in the MA colo
> varies, and that we have some ARM boards with like 1 or 2 CPUs).
> 
> 
> Actually, the best candidate for gaining security support, is IMO
> ARINC. Code is also rather simple and "stable" (hasn't changed in the
> last... years!) and it's used by DornerWorks' people for some of their
> projects (I think?). It's also not tested in OSSTest, though, and
> considering how special purpose it is, I think we're not totally
> comfortable marking it as Sec-Supported, without feedback from the
> maintainers.
> 
> George, Josh, Robert?
>

Yes, we do still use the ARINC653 scheduler. Since it is so simple, it hasn't
really needed any modifications in the last couple years.

We are not really sure what kind of feedback you are looking from us in regards
to marking it sec-supported, but would be happy to try and answer any questions.
If you have any specific questions or requests, we can discuss it internally and
get back to you.

Thanks,
Robbie VanVossen

Dario Faggioli Sept. 27, 2017, 1:48 p.m. UTC | #14

On Wed, 2017-09-27 at 08:57 -0400, Robert VanVossen wrote:
> On 9/26/2017 3:12 AM, Dario Faggioli wrote:
> > [Cc-list modified by removing someone and adding someone else]
> > 
> > Actually, the best candidate for gaining security support, is IMO
> > ARINC. Code is also rather simple and "stable" (hasn't changed in
> > the
> > last... years!) and it's used by DornerWorks' people for some of
> > their
> > projects (I think?). It's also not tested in OSSTest, though, and
> > considering how special purpose it is, I think we're not totally
> > comfortable marking it as Sec-Supported, without feedback from the
> > maintainers.
> > 
> > George, Josh, Robert?
> > 
> 
> Yes, we do still use the ARINC653 scheduler. Since it is so simple,
> it hasn't
> really needed any modifications in the last couple years.
> 
Hehe :-)

> We are not really sure what kind of feedback you are looking from us
> in regards
> to marking it sec-supported, but would be happy to try and answer any
> questions.
> If you have any specific questions or requests, we can discuss it
> internally and
> get back to you.
> 
Right. So, that's something we are still in the process of defining
properly.

To have an idea, you may have a look at George's email, in this thread
(you weren't Cc-ed yet):
https://www.mail-archive.com/xen-devel@lists.xen.org/msg123768.html

And also to this other ones:
https://www.mail-archive.com/xen-devel@lists.xen.org/msg84376.html
https://lists.xenproject.org/archives/html/xen-devel/2016-11/msg00171.h
tml

Regards,
Dario

Lars Kurth Oct. 9, 2017, 1:53 p.m. UTC | #15

> On 12 Sep 2017, at 16:35, Rich Persaud <persaur@gmail.com> wrote:
> 
>> On Sep 11, 2017, at 13:01, George Dunlap <george.dunlap@citrix.com> wrote:
>> 
>> +### XSM & FLASK
>> +
>> +    Status: Experimental
>> +
>> +Compile time disabled
>> +
>> +### XSM & FLASK support for IS_PRIV
>> +
>> +    Status: Experimental
> 
> In which specific areas is XSM lacking in Functional completeness, Functional stability and/or Interface stability, resulting in "Experimental" status?  What changes to XSM would be needed for it to qualify for "Supported" status?

I think the issue in this case may be lack of automated testing or known testing - see https://www.mail-archive.com/xen-devel@lists.xen.org/msg123768.html <https://www.mail-archive.com/xen-devel@lists.xen.org/msg123768.html>  
I am not quite sure what the status of XSM testing in OSSTEST is: I think there is something there, but not sure what. 

> If there will be no security support for features in Experimental status, would Xen Project accept patches to fix XSM security issues?  Could downstream projects issue CVEs for XSM security issues, if these will not be issued by Xen Project?

This question I have to defer to members of the security team.

Lars

Lars Kurth Oct. 9, 2017, 2:14 p.m. UTC | #16

> On 27 Sep 2017, at 13:57, Robert VanVossen <robert.vanvossen@dornerworks.com> wrote:
> 
> 
> 
> On 9/26/2017 3:12 AM, Dario Faggioli wrote:
>> [Cc-list modified by removing someone and adding someone else]
>> 
>> On Mon, 2017-09-25 at 16:10 -0700, Stefano Stabellini wrote:
>>> On Mon, 11 Sep 2017, George Dunlap wrote:
>>>> +### RTDS based Scheduler
>>>> +
>>>> +    Status: Experimental
>>>> +
>>>> +A soft real-time CPU scheduler built to provide guaranteed CPU
>>>> capacity to guest VMs on SMP hosts
>>>> +
>>>> +### ARINC653 Scheduler
>>>> +
>>>> +    Status: Supported, Not security supported
>>>> +
>>>> +A periodically repeating fixed timeslice scheduler. Multicore
>>>> support is not yet implemented.
>>>> +
>>>> +### Null Scheduler
>>>> +
>>>> +    Status: Experimental
>>>> +
>>>> +A very simple, very static scheduling policy 
>>>> +that always schedules the same vCPU(s) on the same pCPU(s). 
>>>> +It is designed for maximum determinism and minimum overhead
>>>> +on embedded platforms.

...

>> Actually, the best candidate for gaining security support, is IMO
>> ARINC. Code is also rather simple and "stable" (hasn't changed in the
>> last... years!) and it's used by DornerWorks' people for some of their
>> projects (I think?). It's also not tested in OSSTest, though, and
>> considering how special purpose it is, I think we're not totally
>> comfortable marking it as Sec-Supported, without feedback from the
>> maintainers.
>> 
>> George, Josh, Robert?
>> 
> 
> Yes, we do still use the ARINC653 scheduler. Since it is so simple, it hasn't
> really needed any modifications in the last couple years.
> 
> We are not really sure what kind of feedback you are looking from us in regards
> to marking it sec-supported, but would be happy to try and answer any questions.
> If you have any specific questions or requests, we can discuss it internally and
> get back to you.

I think there are two sets of issues: one around testing, which Dario outlined.

For example, if you had some test harnesses that could be run on Xen release 
candidates, which verify that the scheduler works as expected, that would
help. It would imply a commitment to run the tests on release candidates.

The second question is what happens if someone reported a security issue on
the scheduler. The security team would not have the capability to fix issues in 
the ARINC scheduler: so it would be necessary to pull in an expert under 
embargo to help triage the issue, fix the issue and prove that the fix works. This 
would most likely require "the expert" to work to the timeline of the security
team (which may require prioritising it over other work), as once a security issue 
has been reported, the reporter may insist on a disclosure schedule. If we didn't 
have a fix in time, because we don't get expert bandwidth, we could be forced to 
disclose an XSA without a fix.

Does this make sense?

Lars

George Dunlap Oct. 23, 2017, 4:22 p.m. UTC | #17

On 09/11/2017 06:53 PM, Andrew Cooper wrote:
> On 11/09/17 18:01, George Dunlap wrote:
>> +### x86/PV
>> +
>> +    Status: Supported
>> +
>> +Traditional Xen Project PV guest
> 
> What's a "Xen Project" PV guest?  Just Xen here.
> 
> Also, a perhaps a statement of "No hardware requirements" ?

OK.

> 
>> +### x86/RAM
>> +
>> +    Limit, x86: 16TiB
>> +    Limit, ARM32: 16GiB
>> +    Limit, ARM64: 5TiB
>> +
>> +[XXX: Andy to suggest what this should say for x86]
> 
> The limit for x86 is either 16TiB or 123TiB, depending on
> CONFIG_BIGMEM.  CONFIG_BIGMEM is exposed via menuconfig without
> XEN_CONFIG_EXPERT, so falls into at least some kind of support statement.
> 
> As for practical limits, I don't think its reasonable to claim anything
> which we can't test.  What are the specs in the MA colo?

At the moment the "Limit" tag specifically says that it's theoretical
and may not work.

We could add another tag, "Limit-tested", or something like that.

Or, we could simply have the Limit-security be equal to the highest
amount which has been tested (either by osstest or downstreams).

For simplicity's sake I'd go with the second one.

Shall I write an e-mail with a more direct query for the maximum amounts
of various numbers tested by the XenProject (via osstest), Citrix, SuSE,
and Oracle?

>> +
>> +## Limits/Guest
>> +
>> +### Virtual CPUs
>> +
>> +    Limit, x86 PV: 512
> 
> Where did this number come from?  The actual limit as enforced in Xen is
> 8192, and it has been like that for a very long time (i.e. the 3.x days)

Looks like Lars copied this from
https://wiki.xenproject.org/wiki/Xen_Project_Release_Features.  Not sure
where it came from before that.

> [root@fusebot ~]# python
> Python 2.7.5 (default, Nov 20 2015, 02:00:19)
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> from xen.lowlevel.xc import xc as XC
>>>> xc = XC()
>>>> xc.domain_create()
> 1
>>>> xc.domain_max_vcpus(1, 8192)
> 0
>>>> xc.domain_create()
> 2
>>>> xc.domain_max_vcpus(2, 8193)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> xen.lowlevel.xc.Error: (22, 'Invalid argument')
> 
> Trying to shut such a domain down however does tickle a host watchdog
> timeout as the for_each_vcpu() loops in domain_kill() are very long.

For now I'll set 'Limit' to 8192, and 'Limit-security' to 512.
Depending on what I get for the "test limit" survey I may adjust it
afterwards.

>> +    Limit, x86 HVM: 128
>> +    Limit, ARM32: 8
>> +    Limit, ARM64: 128
>> +
>> +[XXX Andrew Cooper: Do want to add "Limit-Security" here for some of these?]
> 
> 32 for each.  64 vcpu HVM guests can excerpt enough p2m lock pressure to
> trigger a 5 second host watchdog timeout.

Is that "32 for x86 PV and x86 HVM", or "32 for x86 HVM and ARM64"?  Or
something else?

>> +### Virtual RAM
>> +
>> +    Limit, x86 PV: >1TB
>> +    Limit, x86 HVM: 1TB
>> +    Limit, ARM32: 16GiB
>> +    Limit, ARM64: 1TB
> 
> There is no specific upper bound on the size of PV or HVM guests that I
> am aware of.  1.5TB HVM domains definitely work, because that's what we
> test and support in XenServer.

Are there limits for 32-bit guests?  There's some complicated limit
having to do with the m2p, right?

>> +
>> +### x86 PV/Event Channels
>> +
>> +    Limit: 131072
> 
> Why do we call out event channel limits but not grant table limits? 
> Also, why is this x86?  The 2l and fifo ABIs are arch agnostic, as far
> as I am aware.

Sure, but I'm pretty sure that ARM guests don't (perhaps cannot?) use PV
event channels.

> 
>> +## High Availability and Fault Tolerance
>> +
>> +### Live Migration, Save & Restore
>> +
>> +    Status, x86: Supported
> 
> With caveats.  From docs/features/migration.pandoc

This would extend the meaning of "caveats" from "when it's not security
supported" to "when it doesn't work"; which is probably the best thing
at the moment.

> * x86 HVM with nested-virt (no relevant information included in the stream)
[snip]
> Also, features such as vNUMA and nested virt (which are two I know for
> certain) have all state discarded on the source side, because they were
> never suitably plumbed in.

OK, I'll list these, as well as PCI pass-through.

(Actually, vNUMA doesn't seem to be on the list!)

And we should probably add a safety-catch to prevent a VM started with
any of these from being live-migrated.

In fact, if possible, that should be a whitelist: Any configuration that
isn't specifically known to work with migration should cause a migration
command to be refused.

What about the following features?

 * Guest serial console
 * Crash kernels
 * Transcendent Memory
 * Alternative p2m
 * vMCE
 * vPMU
 * Intel Platform QoS
 * Remus
 * COLO
 * PV protocols: Keyboard, PVUSB, PVSCSI, PVTPM, 9pfs, pvcalls?
 * FlASK?
 * CPU / memory hotplug?

> * x86 HVM guest physmap operations (not reflected in logdirty bitmap)
> * x86 PV P2M structure changes (not noticed, stale mappings used) for
>   guests not using the linear p2m layout

I'm afraid this isn't really appropriate for a user-facing document.
Users don't directly do physmap operations, nor p2m structure changes.
We need to tell them specifically which features they can or cannot use.

> * x86 HVM with PoD pages (attempts to map cause PoD allocations)

This shouldn't be any more dangerous than a guest-side sweep, should it?
 You may waste a lot of time reclaiming zero pages, but it seems like it
should only be a relatively minor performance issue, not a correctness
issue.

The main "problem" (in terms of "surprising behavior") would be that on
the remote side any PoD pages will actually be allocated zero pages.  So
if your guest was booted with memmax=4096 and memory=2048, but your
balloon driver had only ballooned down to 3000 for some reason (and then
stopped), the remote side would want 3000 MiB (not 2048, as one might
expect).

> * x86 PV ballooning (P2M marked dirty, target frame not marked)

Er, this should probably be fixed.  What exactly is the problem here?

 -George

Andrew Cooper Oct. 23, 2017, 5:55 p.m. UTC | #18

On 23/10/17 17:22, George Dunlap wrote:
> On 09/11/2017 06:53 PM, Andrew Cooper wrote:
>> On 11/09/17 18:01, George Dunlap wrote:
>>> +### x86/RAM
>>> +
>>> +    Limit, x86: 16TiB
>>> +    Limit, ARM32: 16GiB
>>> +    Limit, ARM64: 5TiB
>>> +
>>> +[XXX: Andy to suggest what this should say for x86]
>> The limit for x86 is either 16TiB or 123TiB, depending on
>> CONFIG_BIGMEM.  CONFIG_BIGMEM is exposed via menuconfig without
>> XEN_CONFIG_EXPERT, so falls into at least some kind of support statement.
>>
>> As for practical limits, I don't think its reasonable to claim anything
>> which we can't test.  What are the specs in the MA colo?
> At the moment the "Limit" tag specifically says that it's theoretical
> and may not work.
>
> We could add another tag, "Limit-tested", or something like that.
>
> Or, we could simply have the Limit-security be equal to the highest
> amount which has been tested (either by osstest or downstreams).
>
> For simplicity's sake I'd go with the second one.

It think it would be very helpful to distinguish the upper limits from
the supported limits.  There will be a large difference between the two.

Limit-Theoretical and Limit-Supported ?

In all cases, we should identify why the limit is where it is, even if
that is only "maximum people have tested to".  Other

>
> Shall I write an e-mail with a more direct query for the maximum amounts
> of various numbers tested by the XenProject (via osstest), Citrix, SuSE,
> and Oracle?

For XenServer,
http://docs.citrix.com/content/dam/docs/en-us/xenserver/current-release/downloads/xenserver-config-limits.pdf

>> [root@fusebot ~]# python
>> Python 2.7.5 (default, Nov 20 2015, 02:00:19)
>> [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>>>>> from xen.lowlevel.xc import xc as XC
>>>>> xc = XC()
>>>>> xc.domain_create()
>> 1
>>>>> xc.domain_max_vcpus(1, 8192)
>> 0
>>>>> xc.domain_create()
>> 2
>>>>> xc.domain_max_vcpus(2, 8193)
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> xen.lowlevel.xc.Error: (22, 'Invalid argument')
>>
>> Trying to shut such a domain down however does tickle a host watchdog
>> timeout as the for_each_vcpu() loops in domain_kill() are very long.
> For now I'll set 'Limit' to 8192, and 'Limit-security' to 512.
> Depending on what I get for the "test limit" survey I may adjust it
> afterwards.

The largest production x86 server I am aware of is a Skylake-S system
with 496 threads.  512 is not a plausibly-tested number.

>
>>> +    Limit, x86 HVM: 128
>>> +    Limit, ARM32: 8
>>> +    Limit, ARM64: 128
>>> +
>>> +[XXX Andrew Cooper: Do want to add "Limit-Security" here for some of these?]
>> 32 for each.  64 vcpu HVM guests can excerpt enough p2m lock pressure to
>> trigger a 5 second host watchdog timeout.
> Is that "32 for x86 PV and x86 HVM", or "32 for x86 HVM and ARM64"?  Or
> something else?

The former.  I'm not qualified to comment on any of the ARM limits.

There are several non-trivial for_each_vcpu() loops in the domain_kill
path which aren't handled by continuations.  ISTR 128 vcpus is enough to
trip a watchdog timeout when freeing pagetables.

>
>>> +### Virtual RAM
>>> +
>>> +    Limit, x86 PV: >1TB
>>> +    Limit, x86 HVM: 1TB
>>> +    Limit, ARM32: 16GiB
>>> +    Limit, ARM64: 1TB
>> There is no specific upper bound on the size of PV or HVM guests that I
>> am aware of.  1.5TB HVM domains definitely work, because that's what we
>> test and support in XenServer.
> Are there limits for 32-bit guests?  There's some complicated limit
> having to do with the m2p, right?

32bit PV guests need to live in MFNs under the 128G boundary, despite
the fact their p2m handling supports 4TB of RAM.

The PVinPVH plan will lift this limitation, at which point it will be
possible to have many 128G 32bit PV(inPVH) VMs on a large system. 
(OTOH, I'm not aware of any 32bit PV guest which itself supports more
than 64G of RAM, other than perhaps SLES 11.)

>
>>> +
>>> +### x86 PV/Event Channels
>>> +
>>> +    Limit: 131072
>> Why do we call out event channel limits but not grant table limits? 
>> Also, why is this x86?  The 2l and fifo ABIs are arch agnostic, as far
>> as I am aware.
> Sure, but I'm pretty sure that ARM guests don't (perhaps cannot?) use PV
> event channels.

This is mixing the hypervisor API/ABI capabilities with the actual
abilities of guests (which is also different to what Linux would use in
the guests).

ARM guests, as well as x86 HVM with APICV (configured properly) will
actively want to avoid the guest event channel interface, because its
slower.

This solitary evtchn limit serves no useful purpose IMO.

>
>>> +## High Availability and Fault Tolerance
>>> +
>>> +### Live Migration, Save & Restore
>>> +
>>> +    Status, x86: Supported
>> With caveats.  From docs/features/migration.pandoc
> This would extend the meaning of "caveats" from "when it's not security
> supported" to "when it doesn't work"; which is probably the best thing
> at the moment.

I wasn't specifically taking your meaning of caveats.

>
>> * x86 HVM with nested-virt (no relevant information included in the stream)
> [snip]
>> Also, features such as vNUMA and nested virt (which are two I know for
>> certain) have all state discarded on the source side, because they were
>> never suitably plumbed in.
> OK, I'll list these, as well as PCI pass-through.
>
> (Actually, vNUMA doesn't seem to be on the list!)
>
> And we should probably add a safety-catch to prevent a VM started with
> any of these from being live-migrated.
>
> In fact, if possible, that should be a whitelist: Any configuration that
> isn't specifically known to work with migration should cause a migration
> command to be refused.

Absolutely everything should be in whitelist form, but Xen has 14 years
of history to clean up after.

> What about the following features?

What do you mean "what about"?  Do you mean "are they migrate safe?"?

Assuming that that is what you mean,

>  * Guest serial console

Which consoles?  A qemu emulated-serial will be qemus problem to deal
with.  Anything xenconsoled based will be the guests problem to deal
with, so pass.

>  * Crash kernels

These are internal to the guest until the point of crash, at which point
you may need SHUTDOWN_soft_reset support to crash successfully.  I don't
think there is any migration interaction.

>  * Transcendent Memory

Excluded from security support by XSA-17.

Legacy migration claimed to have TMEM migration support, but the code
was sufficiently broken that I persuaded Konrad to not block Migration
v2 on getting TMEM working again.  Its current state is "will be lost on
migrate if you try to use it", because it also turns out it is
nontrivial to work out if there are TMEM pages needing moving.

>  * Alternative p2m

Lost on migrate.

>  * vMCE

There appears to be code to move state in the migrate stream.  Whether
it works or not is an entirely different matter.

>  * vPMU

Lost on migrate.  Furthermore, levelling vPMU is far harder than
levelling CPUID.  Anything using vPMU and migrated to non-identical
hardware likely to blow up at the destination when a previously
established PMU setting now takes a #GP fault.

>  * Intel Platform QoS

Not exposed to guests at all, so it has no migration interaction atm.

>  * Remus
>  * COLO

These are both migration protocols themselves, so don't really fit into
this category.  Anything with works in normal migration should work when
using these.

>  * PV protocols: Keyboard, PVUSB, PVSCSI, PVTPM, 9pfs, pvcalls?

Pass.  These will be far more to do with what is arranged in the
receiving dom0 by the toolstack.

PVTPM is the only one I'm aware of with state held outside of the rings,
and I'm not aware of any support for moving that state.

>  * FlASK?

I don't know what you mean by this.  Flask is a setting in the
hypervisor, and isn't exposed to the guest.

>  * CPU / memory hotplug?

We don't have memory hotplug, and CPU hotplug is complicated.  PV guests
don't have hotplug (they have "give the guest $MAX and ask it politely
to give some back"), while for HVM guests it is currently performed by
Qemu.  PVH is going to complicate things further with various bits being
performed by Xen.

>
>> * x86 HVM guest physmap operations (not reflected in logdirty bitmap)
>> * x86 PV P2M structure changes (not noticed, stale mappings used) for
>>   guests not using the linear p2m layout
> I'm afraid this isn't really appropriate for a user-facing document.
> Users don't directly do physmap operations, nor p2m structure changes.
> We need to tell them specifically which features they can or cannot use.

I didn't intend this to be a straight copy/paste into the user facing
document, but rather to highlight the already-known issues.

In practice, this means "no ballooning", except you've got no way of
stopping the guest using add_to/remove_from physmap on itself, so there
is nothing the toolstack can do to prevent a guest from accidentally
falling into these traps.

>
>> * x86 HVM with PoD pages (attempts to map cause PoD allocations)
> This shouldn't be any more dangerous than a guest-side sweep, should it?

Except that for XSA-150, the sweep isn't guest wide.  It is only of the
last 32 allocated frames.

>  You may waste a lot of time reclaiming zero pages, but it seems like it
> should only be a relatively minor performance issue, not a correctness
> issue.

The overwhelmingly common case is that when the migration stream tries
to map a gfn, the demand population causes a crash on the source side,
because xenforeignmemory_map() does a P2M_ALLOC lookup and can't find a
frame.

>
> The main "problem" (in terms of "surprising behavior") would be that on
> the remote side any PoD pages will actually be allocated zero pages.  So
> if your guest was booted with memmax=4096 and memory=2048, but your
> balloon driver had only ballooned down to 3000 for some reason (and then
> stopped), the remote side would want 3000 MiB (not 2048, as one might
> expect).

If there are too many frames in the migration stream, the destination
side will fail because of going over allocation.

>
>> * x86 PV ballooning (P2M marked dirty, target frame not marked)
> Er, this should probably be fixed.  What exactly is the problem here?

P2M structure changes don't cause all frames under the change to be
resent.  This is mainly a problem when ballooning out a frame (which has
already been sent in the stream), at which point we get too much memory
on the destination side, and go over allocation.

~Andrew

Stefano Stabellini Oct. 23, 2017, 8:57 p.m. UTC | #19

On Mon, 23 Oct 2017, Andrew Cooper wrote:
> >>> +### x86 PV/Event Channels
> >>> +
> >>> +    Limit: 131072
> >> Why do we call out event channel limits but not grant table limits? 
> >> Also, why is this x86?  The 2l and fifo ABIs are arch agnostic, as far
> >> as I am aware.
> > Sure, but I'm pretty sure that ARM guests don't (perhaps cannot?) use PV
> > event channels.
> 
> This is mixing the hypervisor API/ABI capabilities with the actual
> abilities of guests (which is also different to what Linux would use in
> the guests).
> 
> ARM guests, as well as x86 HVM with APICV (configured properly) will
> actively want to avoid the guest event channel interface, because its
> slower.
> 
> This solitary evtchn limit serves no useful purpose IMO.

Just a clarification: ARM guests have event channels. They are delivered
to the guest using a single PPI (per processor interrupt). I am pretty
sure that limit on the number of event channels on ARM is the same as on
x86 because they both depend on the same fifo ABI.

George Dunlap Oct. 24, 2017, 10:27 a.m. UTC | #20

On 10/23/2017 06:55 PM, Andrew Cooper wrote:
> On 23/10/17 17:22, George Dunlap wrote:
>> On 09/11/2017 06:53 PM, Andrew Cooper wrote:
>>> On 11/09/17 18:01, George Dunlap wrote:
>>>> +### x86/RAM
>>>> +
>>>> +    Limit, x86: 16TiB
>>>> +    Limit, ARM32: 16GiB
>>>> +    Limit, ARM64: 5TiB
>>>> +
>>>> +[XXX: Andy to suggest what this should say for x86]
>>> The limit for x86 is either 16TiB or 123TiB, depending on
>>> CONFIG_BIGMEM.  CONFIG_BIGMEM is exposed via menuconfig without
>>> XEN_CONFIG_EXPERT, so falls into at least some kind of support statement.
>>>
>>> As for practical limits, I don't think its reasonable to claim anything
>>> which we can't test.  What are the specs in the MA colo?
>> At the moment the "Limit" tag specifically says that it's theoretical
>> and may not work.
>>
>> We could add another tag, "Limit-tested", or something like that.
>>
>> Or, we could simply have the Limit-security be equal to the highest
>> amount which has been tested (either by osstest or downstreams).
>>
>> For simplicity's sake I'd go with the second one.
> 
> It think it would be very helpful to distinguish the upper limits from
> the supported limits.  There will be a large difference between the two.
> 
> Limit-Theoretical and Limit-Supported ?

Well "supported" without any modifiers implies "security supported".  So
perhaps we could just `s/Limit-security/Limit-supported/;` ?

> 
> In all cases, we should identify why the limit is where it is, even if
> that is only "maximum people have tested to".  Other

This document is already fairly complicated, and a massive amount of
work (as each line is basically an invitation to bike-shedding).  If
it's OK with you, I'll leave the introduction of where the limit comes
from for a motivated individual to add in a subsequent patch. :-)

>> Shall I write an e-mail with a more direct query for the maximum amounts
>> of various numbers tested by the XenProject (via osstest), Citrix, SuSE,
>> and Oracle?
> 
> For XenServer,
> http://docs.citrix.com/content/dam/docs/en-us/xenserver/current-release/downloads/xenserver-config-limits.pdf
> 
>>> [root@fusebot ~]# python
>>> Python 2.7.5 (default, Nov 20 2015, 02:00:19)
>>> [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
>>> Type "help", "copyright", "credits" or "license" for more information.
>>>>>> from xen.lowlevel.xc import xc as XC
>>>>>> xc = XC()
>>>>>> xc.domain_create()
>>> 1
>>>>>> xc.domain_max_vcpus(1, 8192)
>>> 0
>>>>>> xc.domain_create()
>>> 2
>>>>>> xc.domain_max_vcpus(2, 8193)
>>> Traceback (most recent call last):
>>>   File "<stdin>", line 1, in <module>
>>> xen.lowlevel.xc.Error: (22, 'Invalid argument')
>>>
>>> Trying to shut such a domain down however does tickle a host watchdog
>>> timeout as the for_each_vcpu() loops in domain_kill() are very long.
>> For now I'll set 'Limit' to 8192, and 'Limit-security' to 512.
>> Depending on what I get for the "test limit" survey I may adjust it
>> afterwards.
> 
> The largest production x86 server I am aware of is a Skylake-S system
> with 496 threads.  512 is not a plausibly-tested number.
> 
>>
>>>> +    Limit, x86 HVM: 128
>>>> +    Limit, ARM32: 8
>>>> +    Limit, ARM64: 128
>>>> +
>>>> +[XXX Andrew Cooper: Do want to add "Limit-Security" here for some of these?]
>>> 32 for each.  64 vcpu HVM guests can excerpt enough p2m lock pressure to
>>> trigger a 5 second host watchdog timeout.
>> Is that "32 for x86 PV and x86 HVM", or "32 for x86 HVM and ARM64"?  Or
>> something else?
> 
> The former.  I'm not qualified to comment on any of the ARM limits.
> 
> There are several non-trivial for_each_vcpu() loops in the domain_kill
> path which aren't handled by continuations.  ISTR 128 vcpus is enough to
> trip a watchdog timeout when freeing pagetables.

I don't think 32 is a really practical limit.  I'm inclined to say that
if a rogue guest can crash a host with 33 vcpus, we should issue an XSA
and fix it.

>>>> +### Virtual RAM
>>>> +
>>>> +    Limit, x86 PV: >1TB
>>>> +    Limit, x86 HVM: 1TB
>>>> +    Limit, ARM32: 16GiB
>>>> +    Limit, ARM64: 1TB
>>> There is no specific upper bound on the size of PV or HVM guests that I
>>> am aware of.  1.5TB HVM domains definitely work, because that's what we
>>> test and support in XenServer.
>> Are there limits for 32-bit guests?  There's some complicated limit
>> having to do with the m2p, right?
> 
> 32bit PV guests need to live in MFNs under the 128G boundary, despite
> the fact their p2m handling supports 4TB of RAM.

That's what I was looking for.  Let me see if I can find a concise way
to represent that.

> The PVinPVH plan will lift this limitation, at which point it will be
> possible to have many 128G 32bit PV(inPVH) VMs on a large system. 
> (OTOH, I'm not aware of any 32bit PV guest which itself supports more
> than 64G of RAM, other than perhaps SLES 11.)

Right, but PVinPVH is a different monster.

>>>> +
>>>> +### x86 PV/Event Channels
>>>> +
>>>> +    Limit: 131072
>>> Why do we call out event channel limits but not grant table limits? 
>>> Also, why is this x86?  The 2l and fifo ABIs are arch agnostic, as far
>>> as I am aware.
>> Sure, but I'm pretty sure that ARM guests don't (perhaps cannot?) use PV
>> event channels.
> 
> This is mixing the hypervisor API/ABI capabilities with the actual
> abilities of guests (which is also different to what Linux would use in
> the guests).

I'd say rather that you are mixing up the technical abilities of a
system with user-facing features.  :-)  At the moment there is no reason
for any ARM user to even think about event channels, so there's no
reason to bother them with the technical details.  If at some point that
changes, we can modify the document.

> ARM guests, as well as x86 HVM with APICV (configured properly) will
> actively want to avoid the guest event channel interface, because its
> slower.
> 
> This solitary evtchn limit serves no useful purpose IMO.

There may be a point to what you're saying: The event channel limit
normally manifests itself as a limit on the number of guests / total
devices.

On the other hand, having these kinds of limits around does make sense.

Let me give it some thoughts.  (If anyone else has any opinions...)

>>>> +## High Availability and Fault Tolerance
>>>> +
>>>> +### Live Migration, Save & Restore
>>>> +
>>>> +    Status, x86: Supported
>>> * x86 HVM with nested-virt (no relevant information included in the stream)
>> [snip]
>>> Also, features such as vNUMA and nested virt (which are two I know for
>>> certain) have all state discarded on the source side, because they were
>>> never suitably plumbed in.
>> OK, I'll list these, as well as PCI pass-through.
>>
>> (Actually, vNUMA doesn't seem to be on the list!)
>>
>> And we should probably add a safety-catch to prevent a VM started with
>> any of these from being live-migrated.
>>
>> In fact, if possible, that should be a whitelist: Any configuration that
>> isn't specifically known to work with migration should cause a migration
>> command to be refused.
> 
> Absolutely everything should be in whitelist form, but Xen has 14 years
> of history to clean up after.
> 
>> What about the following features?
> 
> What do you mean "what about"?  Do you mean "are they migrate safe?"?

"Are they compatible with migration", yes.  By which I mean, "Do they
operate as one would reasonably expect?"

>>  * Guest serial console
> 
> Which consoles?  A qemu emulated-serial will be qemus problem to deal
> with.  Anything xenconsoled based will be the guests problem to deal
> with, so pass.

If the guest sets up extra consoles, these will show up in some
appropriately-discoverable place after the migrate?

>>  * Crash kernels
> 
> These are internal to the guest until the point of crash, at which point
> you may need SHUTDOWN_soft_reset support to crash successfully.  I don't
> think there is any migration interaction.

For some reason I thought you had to upload your kernel before the soft
reset.  If the crash kernel lives entirely in the guest until the crash
actually happens, then yes, this should be safe.

>>  * Transcendent Memory
> 
> Excluded from security support by XSA-17.
> 
> Legacy migration claimed to have TMEM migration support, but the code
> was sufficiently broken that I persuaded Konrad to not block Migration
> v2 on getting TMEM working again.  Its current state is "will be lost on
> migrate if you try to use it", because it also turns out it is
> nontrivial to work out if there are TMEM pages needing moving.
> 
>>  * Alternative p2m
> 
> Lost on migrate.
> 
>>  * vMCE
> 
> There appears to be code to move state in the migrate stream.  Whether
> it works or not is an entirely different matter.
> 
>>  * vPMU
> 
> Lost on migrate.  Furthermore, levelling vPMU is far harder than
> levelling CPUID.  Anything using vPMU and migrated to non-identical
> hardware likely to blow up at the destination when a previously
> established PMU setting now takes a #GP fault.
> 
>>  * Intel Platform QoS
> 
> Not exposed to guests at all, so it has no migration interaction atm.

Well suppose a user limited a guest to using only 1k of L3 cache, and
then saved and restored it.  Would she be surprised that the QoS limit
disappeared?

I think so, so we should probably call it out.

>>  * Remus
>>  * COLO
> 
> These are both migration protocols themselves, so don't really fit into
> this category.  Anything with works in normal migration should work when
> using these.

The question is, "If I have a VM which is using Remus, can I call `xl
migrate/(save+restore)` on it?"

I.e., suppose I have a VM on host A (local) being replicated to host X
(remote) via REMUS.  Can I migrate that VM to host B (also local), while
maintaining the replication to host X?

Sounds like the answer is "no", so these are not compatible.

>>  * PV protocols: Keyboard, PVUSB, PVSCSI, PVTPM, 9pfs, pvcalls?
> 
> Pass.  These will be far more to do with what is arranged in the
> receiving dom0 by the toolstack.

No, no pass.  This is exactly the question:  If I call "xl migrate" or
"xl save+xl restore" on a VM using these, will the toolstack on receive
/ restore re-arrange these features in a sensible way?

If the answer is "no", then these are not compatible with migration.

> PVTPM is the only one I'm aware of with state held outside of the rings,
> and I'm not aware of any support for moving that state.
> 
>>  * FlASK?
> 
> I don't know what you mean by this.  Flask is a setting in the
> hypervisor, and isn't exposed to the guest.

Yes, so if I as an administrator give a VM a certain label limiting or
extending its functionality, and then I do a migrate/save+restore, will
that label be applied afterwards?

If the answer is 'no' then we need to specify it.

>>  * CPU / memory hotplug?
> 
> We don't have memory hotplug, and CPU hotplug is complicated.  PV guests
> don't have hotplug (they have "give the guest $MAX and ask it politely
> to give some back"), while for HVM guests it is currently performed by
> Qemu.  PVH is going to complicate things further with various bits being
> performed by Xen.
> 
>>
>>> * x86 HVM guest physmap operations (not reflected in logdirty bitmap)
>>> * x86 PV P2M structure changes (not noticed, stale mappings used) for
>>>   guests not using the linear p2m layout
>> I'm afraid this isn't really appropriate for a user-facing document.
>> Users don't directly do physmap operations, nor p2m structure changes.
>> We need to tell them specifically which features they can or cannot use.
> 
> I didn't intend this to be a straight copy/paste into the user facing
> document, but rather to highlight the already-known issues.
> 
> In practice, this means "no ballooning", except you've got no way of
> stopping the guest using add_to/remove_from physmap on itself, so there
> is nothing the toolstack can do to prevent a guest from accidentally
> falling into these traps.

Hmm, I see: just because we didn't write code to do something doesn't
mean someone else hasn't done it.

I'd probably list the user-level features in this document, and point
people to the pandoc document for more detail.

[More later]

 -George

Julien Grall Oct. 24, 2017, 10:29 a.m. UTC | #21

Hi,

On 23/10/2017 18:55, Andrew Cooper wrote:
> On 23/10/17 17:22, George Dunlap wrote:
>> On 09/11/2017 06:53 PM, Andrew Cooper wrote:
>>> On 11/09/17 18:01, George Dunlap wrote:
>>>> +    Limit, x86 HVM: 128
>>>> +    Limit, ARM32: 8
>>>> +    Limit, ARM64: 128
>>>> +
>>>> +[XXX Andrew Cooper: Do want to add "Limit-Security" here for some of these?]
>>> 32 for each.  64 vcpu HVM guests can excerpt enough p2m lock pressure to
>>> trigger a 5 second host watchdog timeout.
>> Is that "32 for x86 PV and x86 HVM", or "32 for x86 HVM and ARM64"?  Or
>> something else?
>
> The former.  I'm not qualified to comment on any of the ARM limits.

That's a good question. On Arm32 the number of vCPUs is limited by the 
GICv2 implementation.

On Arm64, GICv2 platform can only support up to 8 vCPUs. GICv3 is 
theoretically 4096. But it is capped to 128 vCPUs, IIRC it was just to 
match x86.

>
> There are several non-trivial for_each_vcpu() loops in the domain_kill
> path which aren't handled by continuations.  ISTR 128 vcpus is enough to
> trip a watchdog timeout when freeing pagetables.

On Arm, we have similar for_each_vcpu() in the vGIC code to inject SPIs 
(see vgic_to_sgi). I haven't tried it so far with a high number of 
vCPUs. So I am not sure if we should stick to 128 too. Stefano do you 
have any opinions?

Cheers,

Andrew Cooper Oct. 24, 2017, 11:42 a.m. UTC | #22

On 24/10/17 11:27, George Dunlap wrote:
> On 10/23/2017 06:55 PM, Andrew Cooper wrote:
>> On 23/10/17 17:22, George Dunlap wrote:
>>> On 09/11/2017 06:53 PM, Andrew Cooper wrote:
>>>> On 11/09/17 18:01, George Dunlap wrote:
>>>>> +### x86/RAM
>>>>> +
>>>>> +    Limit, x86: 16TiB
>>>>> +    Limit, ARM32: 16GiB
>>>>> +    Limit, ARM64: 5TiB
>>>>> +
>>>>> +[XXX: Andy to suggest what this should say for x86]
>>>> The limit for x86 is either 16TiB or 123TiB, depending on
>>>> CONFIG_BIGMEM.  CONFIG_BIGMEM is exposed via menuconfig without
>>>> XEN_CONFIG_EXPERT, so falls into at least some kind of support statement.
>>>>
>>>> As for practical limits, I don't think its reasonable to claim anything
>>>> which we can't test.  What are the specs in the MA colo?
>>> At the moment the "Limit" tag specifically says that it's theoretical
>>> and may not work.
>>>
>>> We could add another tag, "Limit-tested", or something like that.
>>>
>>> Or, we could simply have the Limit-security be equal to the highest
>>> amount which has been tested (either by osstest or downstreams).
>>>
>>> For simplicity's sake I'd go with the second one.
>> It think it would be very helpful to distinguish the upper limits from
>> the supported limits.  There will be a large difference between the two.
>>
>> Limit-Theoretical and Limit-Supported ?
> Well "supported" without any modifiers implies "security supported".  So
> perhaps we could just `s/Limit-security/Limit-supported/;` ?

By this, you mean use Limit-Supported throughout this document?  That
sounds like a good plan.

>
>>>>> +    Limit, x86 HVM: 128
>>>>> +    Limit, ARM32: 8
>>>>> +    Limit, ARM64: 128
>>>>> +
>>>>> +[XXX Andrew Cooper: Do want to add "Limit-Security" here for some of these?]
>>>> 32 for each.  64 vcpu HVM guests can excerpt enough p2m lock pressure to
>>>> trigger a 5 second host watchdog timeout.
>>> Is that "32 for x86 PV and x86 HVM", or "32 for x86 HVM and ARM64"?  Or
>>> something else?
>> The former.  I'm not qualified to comment on any of the ARM limits.
>>
>> There are several non-trivial for_each_vcpu() loops in the domain_kill
>> path which aren't handled by continuations.  ISTR 128 vcpus is enough to
>> trip a watchdog timeout when freeing pagetables.
> I don't think 32 is a really practical limit.

What do you mean by practical here, and what evidence are you basing
this on?

Amongst other things, there is an ABI boundary in Xen at 32 vcpus, and
given how often it is broken in Linux, its clear that there isn't
regular testing happening beyond this limit.

> I'm inclined to say that if a rogue guest can crash a host with 33 vcpus, we should issue an XSA
> and fix it.

The reason XenServer limits at 32 vcpus is that I can crash Xen with a
64 vcpu HVM domain.  The reason it hasn't been my top priority to fix
this is because there is very little customer interest in pushing this
limit higher.

Obviously, we should fix issues as and when they are discovered, and
work towards increasing the limits in the longterm, but saying "this
limit seems too low, so lets provisionally set it higher" is short
sighted and a recipe for more XSAs.

>>>>> +
>>>>> +### x86 PV/Event Channels
>>>>> +
>>>>> +    Limit: 131072
>>>> Why do we call out event channel limits but not grant table limits? 
>>>> Also, why is this x86?  The 2l and fifo ABIs are arch agnostic, as far
>>>> as I am aware.
>>> Sure, but I'm pretty sure that ARM guests don't (perhaps cannot?) use PV
>>> event channels.
>> This is mixing the hypervisor API/ABI capabilities with the actual
>> abilities of guests (which is also different to what Linux would use in
>> the guests).
> I'd say rather that you are mixing up the technical abilities of a
> system with user-facing features.  :-)  At the moment there is no reason
> for any ARM user to even think about event channels, so there's no
> reason to bother them with the technical details.  If at some point that
> changes, we can modify the document.

You do realise that receiving an event is entirely asymmetric with
sending an event?

Even on ARM, {net,blk}front needs to speak event_{2l,fifo} with Xen to
bind and use its interdomain event channel(s) with {net,blk}back.

>
>> ARM guests, as well as x86 HVM with APICV (configured properly) will
>> actively want to avoid the guest event channel interface, because its
>> slower.
>>
>> This solitary evtchn limit serves no useful purpose IMO.
> There may be a point to what you're saying: The event channel limit
> normally manifests itself as a limit on the number of guests / total
> devices.
>
> On the other hand, having these kinds of limits around does make sense.
>
> Let me give it some thoughts.  (If anyone else has any opinions...)

The event_fifo limit is per-domain, not system-wide.

In general this only matters for a monolithic dom0, as it is one end of
each event channel in the system.

>
>>>>> +## High Availability and Fault Tolerance
>>>>> +
>>>>> +### Live Migration, Save & Restore
>>>>> +
>>>>> +    Status, x86: Supported
>>>> * x86 HVM with nested-virt (no relevant information included in the stream)
>>> [snip]
>>>> Also, features such as vNUMA and nested virt (which are two I know for
>>>> certain) have all state discarded on the source side, because they were
>>>> never suitably plumbed in.
>>> OK, I'll list these, as well as PCI pass-through.
>>>
>>> (Actually, vNUMA doesn't seem to be on the list!)
>>>
>>> And we should probably add a safety-catch to prevent a VM started with
>>> any of these from being live-migrated.
>>>
>>> In fact, if possible, that should be a whitelist: Any configuration that
>>> isn't specifically known to work with migration should cause a migration
>>> command to be refused.
>> Absolutely everything should be in whitelist form, but Xen has 14 years
>> of history to clean up after.
>>
>>> What about the following features?
>> What do you mean "what about"?  Do you mean "are they migrate safe?"?
> "Are they compatible with migration", yes.  By which I mean, "Do they
> operate as one would reasonably expect?"
>
>>>  * Guest serial console
>> Which consoles?  A qemu emulated-serial will be qemus problem to deal
>> with.  Anything xenconsoled based will be the guests problem to deal
>> with, so pass.
> If the guest sets up extra consoles, these will show up in some
> appropriately-discoverable place after the migrate?

That is a complete can of worms.  Where do you draw the line?  log files
will get spliced across the migrate point, and `xl console $DOM` will
terminate, but whether this is "reasonably expected" is very subjective.

>
>>>  * Crash kernels
>> These are internal to the guest until the point of crash, at which point
>> you may need SHUTDOWN_soft_reset support to crash successfully.  I don't
>> think there is any migration interaction.
> For some reason I thought you had to upload your kernel before the soft
> reset.  If the crash kernel lives entirely in the guest until the crash
> actually happens, then yes, this should be safe.
>
>>>  * Transcendent Memory
>> Excluded from security support by XSA-17.
>>
>> Legacy migration claimed to have TMEM migration support, but the code
>> was sufficiently broken that I persuaded Konrad to not block Migration
>> v2 on getting TMEM working again.  Its current state is "will be lost on
>> migrate if you try to use it", because it also turns out it is
>> nontrivial to work out if there are TMEM pages needing moving.
>>
>>>  * Alternative p2m
>> Lost on migrate.
>>
>>>  * vMCE
>> There appears to be code to move state in the migrate stream.  Whether
>> it works or not is an entirely different matter.
>>
>>>  * vPMU
>> Lost on migrate.  Furthermore, levelling vPMU is far harder than
>> levelling CPUID.  Anything using vPMU and migrated to non-identical
>> hardware likely to blow up at the destination when a previously
>> established PMU setting now takes a #GP fault.
>>
>>>  * Intel Platform QoS
>> Not exposed to guests at all, so it has no migration interaction atm.
> Well suppose a user limited a guest to using only 1k of L3 cache, and
> then saved and restored it.  Would she be surprised that the QoS limit
> disappeared?
>
> I think so, so we should probably call it out.

Oh - you mean the xl configuration.

A quick `git grep` says that libxl_psr.c isn't referenced by any other
code in libxl, which means that the settings almost certainly get lost
on migrate.

>
>>>  * Remus
>>>  * COLO
>> These are both migration protocols themselves, so don't really fit into
>> this category.  Anything with works in normal migration should work when
>> using these.
> The question is, "If I have a VM which is using Remus, can I call `xl
> migrate/(save+restore)` on it?"

There is no such thing as "A VM using Remus/COLO" which isn't migrating.

Calling `xl migrate` a second time is user error, and they get to keep
all the pieces.

>
> I.e., suppose I have a VM on host A (local) being replicated to host X
> (remote) via REMUS.  Can I migrate that VM to host B (also local), while
> maintaining the replication to host X?
>
> Sounds like the answer is "no", so these are not compatible.

I think your expectations are off here.

To move a VM which is using remus/colo, you let it fail-over to the
destination then start replicating it again to a 3rd location.

Attempting to do what you describe is equivalent to `xl migrate $DOM $X
& xl migrate $DOM $Y` and expecting any pieces to remain intact.

(As a complete guess) what will most likely happen is that one stream
will get memory corruption, and the other stream will take a hard error
on the source side, because both of them are trying to be the
controlling entity for logdirty mode.  One stream has logdirty turned
off behind its back, and the other gets a hard error for trying to
enable logdirty mode a second time.

>
>>>  * PV protocols: Keyboard, PVUSB, PVSCSI, PVTPM, 9pfs, pvcalls?
>> Pass.  These will be far more to do with what is arranged in the
>> receiving dom0 by the toolstack.
> No, no pass.  This is exactly the question:  If I call "xl migrate" or
> "xl save+xl restore" on a VM using these, will the toolstack on receive
> / restore re-arrange these features in a sensible way?
>
> If the answer is "no", then these are not compatible with migration.

The answer is no until proved otherwise.  I do not know the answer to
these (hence the pass), although I heavily suspect the answer is
definitely no for PVTVM.

>
>> PVTPM is the only one I'm aware of with state held outside of the rings,
>> and I'm not aware of any support for moving that state.
>>
>>>  * FlASK?
>> I don't know what you mean by this.  Flask is a setting in the
>> hypervisor, and isn't exposed to the guest.
> Yes, so if I as an administrator give a VM a certain label limiting or
> extending its functionality, and then I do a migrate/save+restore, will
> that label be applied afterwards?
>
> If the answer is 'no' then we need to specify it.

I don't know the answer.

~Andrew

George Dunlap Oct. 24, 2017, 2 p.m. UTC | #23

On Tue, Sep 12, 2017 at 4:35 PM, Rich Persaud <persaur@gmail.com> wrote:
>> On Sep 11, 2017, at 13:01, George Dunlap <george.dunlap@citrix.com> wrote:
>>
>> +### XSM & FLASK
>> +
>> +    Status: Experimental
>> +
>> +Compile time disabled
>> +
>> +### XSM & FLASK support for IS_PRIV
>> +
>> +    Status: Experimental
>
> In which specific areas is XSM lacking in Functional completeness, Functional stability and/or Interface stability, resulting in "Experimental" status?  What changes to XSM would be needed for it to qualify for "Supported" status?

So first of all, I guess there's two "features" here: One is XSM /
FLASK itself, which downstreams such OpenXT can use do make their own
policies.  The second is the "default FLASK policy", shipped with Xen,
which has rules and labels for things in a "normal" Xen system: domUs,
driver domains, stub domains, dom0, &c.  There was a time when you
could simply enable that and a basic Xen System would Just Work, and
(in theory) would be more secure than the default Xen system.  It
probably makes sense to treat these separately.

Two problems we have so far: The first is that the policy bitrots
fairly quickly.  At the moment we don't have proper testing, and we
don't really have anyone that knows how to fix it if it does break.

The second problem is that while functional testing can show that the
default policy is *at least* as permissive as not having FLASK enabled
at all, it's a lot more difficult to show that having FLASK enabled
isn't in some cases *more permissive* than we would like to be by
default.  We've noticed issues before where enabling XSM accidentally
gives a domU access to hypercalls or settings it wouldn't have access
to otherwise.  Absent some way of automatically catching these
changes, we're not sure we could recommend people use the default
policy, even if we had confidence (via testing) that it wouldn't break
people's functionality on update.

The "default policy bitrot" problem won't be one for you, because (as
I understand it) you write your own custom policies.  But the second
issue should be more concerning: when you update to a new version of
Xen, what confidence do you have that your old policies will still
adequately restrict guests from dangerous new functionality?

I think sorting the second question out is basically what it would
take to call FLASK by itself (as opposed to the default policy)
"Supported".  (And if you can make an argument that this is already
sorted, then we can list FLASK itself as "supported".)

> If there will be no security support for features in Experimental status, would Xen Project accept patches to fix XSM security issues?  Could downstream projects issue CVEs for XSM security issues, if these will not be issued by Xen Project?

Experimental status is about 1) our assessment of how reliable the
feature is, and 2) whether we will issue XSAs if security-related bugs
are found.  We will of course accept patches to improve functionality,
and it's likely that if someone only *reports* a bug that people on
the list will be able to come up with a fix.

Regarding CVEs, I guess what you care about is whether as our own CNA,
the XenProject would be willing to issue CVEs for XSM security issues,
and/or perhaps whether we would mind if you asked Mitre directly
instead.

That's slightly a different topic, which we should probably discuss
when we become a CNA.  But to give you an idea where I'm at, I think
the question is: What kind of a bug do you think you'd issue a CVE for
(and/or, an XSA)?

 -George

George Dunlap Oct. 24, 2017, 3:22 p.m. UTC | #24

On Fri, Sep 15, 2017 at 3:51 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
>> +### Soft-reset for PV guests
>
> s/PV/HVM/

Is it?  I thought this was for RHEL 5 PV guests to be able to do crash kernels.

>> +### Transcendent Memory
>> +
>> +    Status: Experimental
>> +
>> +[XXX Add description]
>
> Guests with tmem drivers autoballoon memory out allowing a fluid
> and dynamic memory allocation - in effect memory overcommit without
> the need to swap. Only works with Linux guests (as it requires
> OS drivers).

But autoballooning doesn't require any support in Xen, right?  I
thought the TMEM support in Xen was more about the trancendent memory
backends.

> ..snip..
>> +### Live Patching
>> +
>> +    Status, x86: Supported
>> +    Status, ARM: Experimental
>> +
>> +Compile time disabled
>
> for ARM.
>
> As the patch will do:
>
>  config LIVEPATCH
> -       bool "Live patching support (TECH PREVIEW)"
> -       default n
> +       bool "Live patching support"
> +       default X86
>         depends on HAS_BUILD_ID = "y"
>         ---help---
>           Allows a running Xen hypervisor to be dynamically patched using

Ack

 -George

George Dunlap Oct. 25, 2017, 10:59 a.m. UTC | #25

On Tue, Oct 24, 2017 at 12:42 PM, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> On 24/10/17 11:27, George Dunlap wrote:
>> On 10/23/2017 06:55 PM, Andrew Cooper wrote:
>>> On 23/10/17 17:22, George Dunlap wrote:
>>>> On 09/11/2017 06:53 PM, Andrew Cooper wrote:
>>>>> On 11/09/17 18:01, George Dunlap wrote:
>>>>>> +### x86/RAM
>>>>>> +
>>>>>> +    Limit, x86: 16TiB
>>>>>> +    Limit, ARM32: 16GiB
>>>>>> +    Limit, ARM64: 5TiB
>>>>>> +
>>>>>> +[XXX: Andy to suggest what this should say for x86]
>>>>> The limit for x86 is either 16TiB or 123TiB, depending on
>>>>> CONFIG_BIGMEM.  CONFIG_BIGMEM is exposed via menuconfig without
>>>>> XEN_CONFIG_EXPERT, so falls into at least some kind of support statement.
>>>>>
>>>>> As for practical limits, I don't think its reasonable to claim anything
>>>>> which we can't test.  What are the specs in the MA colo?
>>>> At the moment the "Limit" tag specifically says that it's theoretical
>>>> and may not work.
>>>>
>>>> We could add another tag, "Limit-tested", or something like that.
>>>>
>>>> Or, we could simply have the Limit-security be equal to the highest
>>>> amount which has been tested (either by osstest or downstreams).
>>>>
>>>> For simplicity's sake I'd go with the second one.
>>> It think it would be very helpful to distinguish the upper limits from
>>> the supported limits.  There will be a large difference between the two.
>>>
>>> Limit-Theoretical and Limit-Supported ?
>> Well "supported" without any modifiers implies "security supported".  So
>> perhaps we could just `s/Limit-security/Limit-supported/;` ?
>
> By this, you mean use Limit-Supported throughout this document?  That
> sounds like a good plan.

Yes, that's basically what I meant.

>>>>>> +    Limit, x86 HVM: 128
>>>>>> +    Limit, ARM32: 8
>>>>>> +    Limit, ARM64: 128
>>>>>> +
>>>>>> +[XXX Andrew Cooper: Do want to add "Limit-Security" here for some of these?]
>>>>> 32 for each.  64 vcpu HVM guests can excerpt enough p2m lock pressure to
>>>>> trigger a 5 second host watchdog timeout.
>>>> Is that "32 for x86 PV and x86 HVM", or "32 for x86 HVM and ARM64"?  Or
>>>> something else?
>>> The former.  I'm not qualified to comment on any of the ARM limits.
>>>
>>> There are several non-trivial for_each_vcpu() loops in the domain_kill
>>> path which aren't handled by continuations.  ISTR 128 vcpus is enough to
>>> trip a watchdog timeout when freeing pagetables.
>> I don't think 32 is a really practical limit.
>
> What do you mean by practical here, and what evidence are you basing
> this on?
>
> Amongst other things, there is an ABI boundary in Xen at 32 vcpus, and
> given how often it is broken in Linux, its clear that there isn't
> regular testing happening beyond this limit.

Is that true for dom0 as well?

>> I'm inclined to say that if a rogue guest can crash a host with 33 vcpus, we should issue an XSA
>> and fix it.
>
> The reason XenServer limits at 32 vcpus is that I can crash Xen with a
> 64 vcpu HVM domain.  The reason it hasn't been my top priority to fix
> this is because there is very little customer interest in pushing this
> limit higher.
>
> Obviously, we should fix issues as and when they are discovered, and
> work towards increasing the limits in the longterm, but saying "this
> limit seems too low, so lets provisionally set it higher" is short
> sighted and a recipe for more XSAs.

OK -- I'll set this to 32 for now and see if anyone else wants to
argue for a different value.

>>>>>> +
>>>>>> +### x86 PV/Event Channels
>>>>>> +
>>>>>> +    Limit: 131072
>>>>> Why do we call out event channel limits but not grant table limits?
>>>>> Also, why is this x86?  The 2l and fifo ABIs are arch agnostic, as far
>>>>> as I am aware.
>>>> Sure, but I'm pretty sure that ARM guests don't (perhaps cannot?) use PV
>>>> event channels.
>>> This is mixing the hypervisor API/ABI capabilities with the actual
>>> abilities of guests (which is also different to what Linux would use in
>>> the guests).
>> I'd say rather that you are mixing up the technical abilities of a
>> system with user-facing features.  :-)  At the moment there is no reason
>> for any ARM user to even think about event channels, so there's no
>> reason to bother them with the technical details.  If at some point that
>> changes, we can modify the document.
>
> You do realise that receiving an event is entirely asymmetric with
> sending an event?
>
> Even on ARM, {net,blk}front needs to speak event_{2l,fifo} with Xen to
> bind and use its interdomain event channel(s) with {net,blk}back.

I guess I didn't realize that (and just noticed Stefano's comment
saying ARM uses event channels).

>>> ARM guests, as well as x86 HVM with APICV (configured properly) will
>>> actively want to avoid the guest event channel interface, because its
>>> slower.
>>>
>>> This solitary evtchn limit serves no useful purpose IMO.
>> There may be a point to what you're saying: The event channel limit
>> normally manifests itself as a limit on the number of guests / total
>> devices.
>>
>> On the other hand, having these kinds of limits around does make sense.
>>
>> Let me give it some thoughts.  (If anyone else has any opinions...)
>
> The event_fifo limit is per-domain, not system-wide.
>
> In general this only matters for a monolithic dom0, as it is one end of
> each event channel in the system.

Sure -- and that's why the limit used to matter.  It doesn't seem to
matter at the moment because you now hit other resource bottlenecks
before you hit the event channel limit.

>>>>  * Guest serial console
>>> Which consoles?  A qemu emulated-serial will be qemus problem to deal
>>> with.  Anything xenconsoled based will be the guests problem to deal
>>> with, so pass.
>> If the guest sets up extra consoles, these will show up in some
>> appropriately-discoverable place after the migrate?
>
> That is a complete can of worms.  Where do you draw the line?  log files
> will get spliced across the migrate point, and `xl console $DOM` will
> terminate, but whether this is "reasonably expected" is very subjective.

Log files getting spliced and `xl console` terminating is I think
reasonable to expect.  I was more talking about the "channel" feature
(see xl.cfg man page on 'channels') -- will the device file show up on
the remote dom0 after migration?

But I suppose that feature doesn't really belong under "debugging,
analysis, and crash post-mortem".

>>>>  * Intel Platform QoS
>>> Not exposed to guests at all, so it has no migration interaction atm.
>> Well suppose a user limited a guest to using only 1k of L3 cache, and
>> then saved and restored it.  Would she be surprised that the QoS limit
>> disappeared?
>>
>> I think so, so we should probably call it out.
>
> Oh - you mean the xl configuration.
>
> A quick `git grep` says that libxl_psr.c isn't referenced by any other
> code in libxl, which means that the settings almost certainly get lost
> on migrate.

Can't you modify restrictions after the VM is started?  But either
way, they won't be there after migrate, which may be surprising.

>>>>  * Remus
>>>>  * COLO
>>> These are both migration protocols themselves, so don't really fit into
>>> this category.  Anything with works in normal migration should work when
>>> using these.
>> The question is, "If I have a VM which is using Remus, can I call `xl
>> migrate/(save+restore)` on it?"
>
> There is no such thing as "A VM using Remus/COLO" which isn't migrating.
>
> Calling `xl migrate` a second time is user error, and they get to keep
> all the pieces.
>
>>
>> I.e., suppose I have a VM on host A (local) being replicated to host X
>> (remote) via REMUS.  Can I migrate that VM to host B (also local), while
>> maintaining the replication to host X?
>>
>> Sounds like the answer is "no", so these are not compatible.
>
> I think your expectations are off here.
>
> To move a VM which is using remus/colo, you let it fail-over to the
> destination then start replicating it again to a 3rd location.
>
> Attempting to do what you describe is equivalent to `xl migrate $DOM $X
> & xl migrate $DOM $Y` and expecting any pieces to remain intact.
>
> (As a complete guess) what will most likely happen is that one stream
> will get memory corruption, and the other stream will take a hard error
> on the source side, because both of them are trying to be the
> controlling entity for logdirty mode.  One stream has logdirty turned
> off behind its back, and the other gets a hard error for trying to
> enable logdirty mode a second time.

You're confusing mechanism with interface again.  Migration is the
internal mechanism Remus and COLO use, but a user doesn't type "xl
migrate" for any of them, so how are they supposed to know that it's
the same mechanism being used?  And in any case, being able to migrate
a replicated VM from one "local" host to another (as I've described)
seems like a pretty cool feature to me.  If I had time and inclination
to make COLO or Remus awesome I'd try to implement it.  From a user's
perspective, I don't think it's at all a given that it doesn't work;
so we need to tell them.

>>>>  * PV protocols: Keyboard, PVUSB, PVSCSI, PVTPM, 9pfs, pvcalls?
>>> Pass.  These will be far more to do with what is arranged in the
>>> receiving dom0 by the toolstack.
>> No, no pass.  This is exactly the question:  If I call "xl migrate" or
>> "xl save+xl restore" on a VM using these, will the toolstack on receive
>> / restore re-arrange these features in a sensible way?
>>
>> If the answer is "no", then these are not compatible with migration.
>
> The answer is no until proved otherwise.  I do not know the answer to
> these (hence the pass), although I heavily suspect the answer is
> definitely no for PVTVM.

Right -- these questions weren't necessarily directed at you, but were
meant to be part of the ongoing discussion.

 -George

Andrew Cooper Oct. 25, 2017, 11:30 a.m. UTC | #26

On 25/10/17 11:59, George Dunlap wrote:
>>>>>>> +    Limit, x86 HVM: 128
>>>>>>> +    Limit, ARM32: 8
>>>>>>> +    Limit, ARM64: 128
>>>>>>> +
>>>>>>> +[XXX Andrew Cooper: Do want to add "Limit-Security" here for some of these?]
>>>>>> 32 for each.  64 vcpu HVM guests can excerpt enough p2m lock pressure to
>>>>>> trigger a 5 second host watchdog timeout.
>>>>> Is that "32 for x86 PV and x86 HVM", or "32 for x86 HVM and ARM64"?  Or
>>>>> something else?
>>>> The former.  I'm not qualified to comment on any of the ARM limits.
>>>>
>>>> There are several non-trivial for_each_vcpu() loops in the domain_kill
>>>> path which aren't handled by continuations.  ISTR 128 vcpus is enough to
>>>> trip a watchdog timeout when freeing pagetables.
>>> I don't think 32 is a really practical limit.
>> What do you mean by practical here, and what evidence are you basing
>> this on?
>>
>> Amongst other things, there is an ABI boundary in Xen at 32 vcpus, and
>> given how often it is broken in Linux, its clear that there isn't
>> regular testing happening beyond this limit.
> Is that true for dom0 as well?

Yes.  The problem is:

struct shared_info {
    struct vcpu_info vcpu_info[XEN_LEGACY_MAX_VCPUS];
...

and while there are ways to make a larger number of vcpus work, it
requires additional hypercalls to make alternate arrangements for the
vcpus beyond the 32 boundary, and these arrangements appear to be broken
more often than not around suspend/resume.

>
>>> I'm inclined to say that if a rogue guest can crash a host with 33 vcpus, we should issue an XSA
>>> and fix it.
>> The reason XenServer limits at 32 vcpus is that I can crash Xen with a
>> 64 vcpu HVM domain.  The reason it hasn't been my top priority to fix
>> this is because there is very little customer interest in pushing this
>> limit higher.
>>
>> Obviously, we should fix issues as and when they are discovered, and
>> work towards increasing the limits in the longterm, but saying "this
>> limit seems too low, so lets provisionally set it higher" is short
>> sighted and a recipe for more XSAs.
> OK -- I'll set this to 32 for now and see if anyone else wants to
> argue for a different value.

Sounds good to me.

>
>>>>>>> +
>>>>>>> +### x86 PV/Event Channels
>>>>>>> +
>>>>>>> +    Limit: 131072
>>>>>> Why do we call out event channel limits but not grant table limits?
>>>>>> Also, why is this x86?  The 2l and fifo ABIs are arch agnostic, as far
>>>>>> as I am aware.
>>>>> Sure, but I'm pretty sure that ARM guests don't (perhaps cannot?) use PV
>>>>> event channels.
>>>> This is mixing the hypervisor API/ABI capabilities with the actual
>>>> abilities of guests (which is also different to what Linux would use in
>>>> the guests).
>>> I'd say rather that you are mixing up the technical abilities of a
>>> system with user-facing features.  :-)  At the moment there is no reason
>>> for any ARM user to even think about event channels, so there's no
>>> reason to bother them with the technical details.  If at some point that
>>> changes, we can modify the document.
>> You do realise that receiving an event is entirely asymmetric with
>> sending an event?
>>
>> Even on ARM, {net,blk}front needs to speak event_{2l,fifo} with Xen to
>> bind and use its interdomain event channel(s) with {net,blk}back.
> I guess I didn't realize that (and just noticed Stefano's comment
> saying ARM uses event channels).
>
>>>> ARM guests, as well as x86 HVM with APICV (configured properly) will
>>>> actively want to avoid the guest event channel interface, because its
>>>> slower.
>>>>
>>>> This solitary evtchn limit serves no useful purpose IMO.
>>> There may be a point to what you're saying: The event channel limit
>>> normally manifests itself as a limit on the number of guests / total
>>> devices.
>>>
>>> On the other hand, having these kinds of limits around does make sense.
>>>
>>> Let me give it some thoughts.  (If anyone else has any opinions...)
>> The event_fifo limit is per-domain, not system-wide.
>>
>> In general this only matters for a monolithic dom0, as it is one end of
>> each event channel in the system.
> Sure -- and that's why the limit used to matter.  It doesn't seem to
> matter at the moment because you now hit other resource bottlenecks
> before you hit the event channel limit.

This point highlights why conjoining the information is misleading.

A dom0 which (for whatever reason) chooses to use event_2l will still
hit the event channel bottlekneck before other resource bottleknecks.

I'd expect the information to look a little more like this (formatting
subject to improvement)

## Event channels

### Event Channel 2-level ABI
Limit-theoretical (per guest): 1024 (32bit guest), 4096 (64bit guest)
Supported

### Event Channel FIFO ABI
Limit-theoretical (per guest): 131072
Supported

(We may want a shorthand for "this is the theoretical limit, and we
support it all the way up to the limit").

>
>>>>>  * Guest serial console
>>>> Which consoles?  A qemu emulated-serial will be qemus problem to deal
>>>> with.  Anything xenconsoled based will be the guests problem to deal
>>>> with, so pass.
>>> If the guest sets up extra consoles, these will show up in some
>>> appropriately-discoverable place after the migrate?
>> That is a complete can of worms.  Where do you draw the line?  log files
>> will get spliced across the migrate point, and `xl console $DOM` will
>> terminate, but whether this is "reasonably expected" is very subjective.
> Log files getting spliced and `xl console` terminating is I think
> reasonable to expect.  I was more talking about the "channel" feature
> (see xl.cfg man page on 'channels') -- will the device file show up on
> the remote dom0 after migration?

A cursory `git grep` doesn't show anything promising.

>
> But I suppose that feature doesn't really belong under "debugging,
> analysis, and crash post-mortem".
>
>>>>>  * Intel Platform QoS
>>>> Not exposed to guests at all, so it has no migration interaction atm.
>>> Well suppose a user limited a guest to using only 1k of L3 cache, and
>>> then saved and restored it.  Would she be surprised that the QoS limit
>>> disappeared?
>>>
>>> I think so, so we should probably call it out.
>> Oh - you mean the xl configuration.
>>
>> A quick `git grep` says that libxl_psr.c isn't referenced by any other
>> code in libxl, which means that the settings almost certainly get lost
>> on migrate.
> Can't you modify restrictions after the VM is started?  But either
> way, they won't be there after migrate, which may be surprising.

It appears that the libxl side of this basically stateless, and just
shuffles settings between the xl cmdline and Xen.

>
>>>>>  * Remus
>>>>>  * COLO
>>>> These are both migration protocols themselves, so don't really fit into
>>>> this category.  Anything with works in normal migration should work when
>>>> using these.
>>> The question is, "If I have a VM which is using Remus, can I call `xl
>>> migrate/(save+restore)` on it?"
>> There is no such thing as "A VM using Remus/COLO" which isn't migrating.
>>
>> Calling `xl migrate` a second time is user error, and they get to keep
>> all the pieces.
>>
>>> I.e., suppose I have a VM on host A (local) being replicated to host X
>>> (remote) via REMUS.  Can I migrate that VM to host B (also local), while
>>> maintaining the replication to host X?
>>>
>>> Sounds like the answer is "no", so these are not compatible.
>> I think your expectations are off here.
>>
>> To move a VM which is using remus/colo, you let it fail-over to the
>> destination then start replicating it again to a 3rd location.
>>
>> Attempting to do what you describe is equivalent to `xl migrate $DOM $X
>> & xl migrate $DOM $Y` and expecting any pieces to remain intact.
>>
>> (As a complete guess) what will most likely happen is that one stream
>> will get memory corruption, and the other stream will take a hard error
>> on the source side, because both of them are trying to be the
>> controlling entity for logdirty mode.  One stream has logdirty turned
>> off behind its back, and the other gets a hard error for trying to
>> enable logdirty mode a second time.
> You're confusing mechanism with interface again.  Migration is the
> internal mechanism Remus and COLO use, but a user doesn't type "xl
> migrate" for any of them, so how are they supposed to know that it's
> the same mechanism being used?  And in any case, being able to migrate
> a replicated VM from one "local" host to another (as I've described)
> seems like a pretty cool feature to me.  If I had time and inclination
> to make COLO or Remus awesome I'd try to implement it.  From a user's
> perspective, I don't think it's at all a given that it doesn't work;
> so we need to tell them.

I don't think its reasonable to expect people to be able to use
Remus/COLO without knowing that it is migration.

OTOH, you are correct that calling `xl migrate` on top of an
already-running Remus/COLO session (or indeed, on top of a plain
migrate) will cause everything to blow up, and there are no interlocks
to prevent such an explosion from happening.

~Andrew

Jan Beulich Oct. 26, 2017, 9:19 a.m. UTC | #27

>>> On 25.10.17 at 13:30, <andrew.cooper3@citrix.com> wrote:
> On 25/10/17 11:59, George Dunlap wrote:
>>>>>>>> +    Limit, x86 HVM: 128
>>>>>>>> +    Limit, ARM32: 8
>>>>>>>> +    Limit, ARM64: 128
>>>>>>>> +
>>>>>>>> +[XXX Andrew Cooper: Do want to add "Limit-Security" here for some of these?]
>>>>>>> 32 for each.  64 vcpu HVM guests can excerpt enough p2m lock pressure to
>>>>>>> trigger a 5 second host watchdog timeout.
>>>>>> Is that "32 for x86 PV and x86 HVM", or "32 for x86 HVM and ARM64"?  Or
>>>>>> something else?
>>>>> The former.  I'm not qualified to comment on any of the ARM limits.
>>>>>
>>>>> There are several non-trivial for_each_vcpu() loops in the domain_kill
>>>>> path which aren't handled by continuations.  ISTR 128 vcpus is enough to
>>>>> trip a watchdog timeout when freeing pagetables.
>>>> I don't think 32 is a really practical limit.
>>> What do you mean by practical here, and what evidence are you basing
>>> this on?
>>>
>>> Amongst other things, there is an ABI boundary in Xen at 32 vcpus, and
>>> given how often it is broken in Linux, its clear that there isn't
>>> regular testing happening beyond this limit.
>> Is that true for dom0 as well?
> 
> Yes.  The problem is:
> 
> struct shared_info {
>     struct vcpu_info vcpu_info[XEN_LEGACY_MAX_VCPUS];
> ...
> 
> and while there are ways to make a larger number of vcpus work, it
> requires additional hypercalls to make alternate arrangements for the
> vcpus beyond the 32 boundary, and these arrangements appear to be broken
> more often than not around suspend/resume.

But I guess the implied part of George's question was: Wouldn't
we expect Dom0 to be more frequently tested with > 32 vCPU-s,
as quite likely not everyone has dom0_max_vcpus= in place?

Jan

Andrew Cooper Oct. 26, 2017, 10:59 a.m. UTC | #28

On 26/10/17 10:19, Jan Beulich wrote:
>>>> On 25.10.17 at 13:30, <andrew.cooper3@citrix.com> wrote:
>> On 25/10/17 11:59, George Dunlap wrote:
>>>>>>>>> +    Limit, x86 HVM: 128
>>>>>>>>> +    Limit, ARM32: 8
>>>>>>>>> +    Limit, ARM64: 128
>>>>>>>>> +
>>>>>>>>> +[XXX Andrew Cooper: Do want to add "Limit-Security" here for some of these?]
>>>>>>>> 32 for each.  64 vcpu HVM guests can excerpt enough p2m lock pressure to
>>>>>>>> trigger a 5 second host watchdog timeout.
>>>>>>> Is that "32 for x86 PV and x86 HVM", or "32 for x86 HVM and ARM64"?  Or
>>>>>>> something else?
>>>>>> The former.  I'm not qualified to comment on any of the ARM limits.
>>>>>>
>>>>>> There are several non-trivial for_each_vcpu() loops in the domain_kill
>>>>>> path which aren't handled by continuations.  ISTR 128 vcpus is enough to
>>>>>> trip a watchdog timeout when freeing pagetables.
>>>>> I don't think 32 is a really practical limit.
>>>> What do you mean by practical here, and what evidence are you basing
>>>> this on?
>>>>
>>>> Amongst other things, there is an ABI boundary in Xen at 32 vcpus, and
>>>> given how often it is broken in Linux, its clear that there isn't
>>>> regular testing happening beyond this limit.
>>> Is that true for dom0 as well?
>> Yes.  The problem is:
>>
>> struct shared_info {
>>     struct vcpu_info vcpu_info[XEN_LEGACY_MAX_VCPUS];
>> ...
>>
>> and while there are ways to make a larger number of vcpus work, it
>> requires additional hypercalls to make alternate arrangements for the
>> vcpus beyond the 32 boundary, and these arrangements appear to be broken
>> more often than not around suspend/resume.
> But I guess the implied part of George's question was: Wouldn't
> we expect Dom0 to be more frequently tested with > 32 vCPU-s,
> as quite likely not everyone has dom0_max_vcpus= in place?

I'm going to make a wild guess and say the intersection of people with
server class hardware and not using dom0_max_vcpus= is very small.

XenServer for example tops out at 16 dom0 vcpus, because performance
(aggregate disk/network throughput) plateaus at that point, and extra
cpu resource is far better spent running the VMs.

~Andrew

Nathan Studer Oct. 27, 2017, 3:09 p.m. UTC | #29

On 10/09/2017 10:14 AM, Lars Kurth wrote:
> 
>> On 27 Sep 2017, at 13:57, Robert VanVossen <robert.vanvossen@dornerworks.com> wrote:
>>
>>
>>
>> On 9/26/2017 3:12 AM, Dario Faggioli wrote:
>>> [Cc-list modified by removing someone and adding someone else]
>>>
>>> On Mon, 2017-09-25 at 16:10 -0700, Stefano Stabellini wrote:
>>>> On Mon, 11 Sep 2017, George Dunlap wrote:
>>>>> +### RTDS based Scheduler
>>>>> +
>>>>> +    Status: Experimental
>>>>> +
>>>>> +A soft real-time CPU scheduler built to provide guaranteed CPU
>>>>> capacity to guest VMs on SMP hosts
>>>>> +
>>>>> +### ARINC653 Scheduler
>>>>> +
>>>>> +    Status: Supported, Not security supported
>>>>> +
>>>>> +A periodically repeating fixed timeslice scheduler. Multicore
>>>>> support is not yet implemented.
>>>>> +
>>>>> +### Null Scheduler
>>>>> +
>>>>> +    Status: Experimental
>>>>> +
>>>>> +A very simple, very static scheduling policy 
>>>>> +that always schedules the same vCPU(s) on the same pCPU(s). 
>>>>> +It is designed for maximum determinism and minimum overhead
>>>>> +on embedded platforms.
> 
> ...
> 
>>> Actually, the best candidate for gaining security support, is IMO
>>> ARINC. Code is also rather simple and "stable" (hasn't changed in the
>>> last... years!) and it's used by DornerWorks' people for some of their
>>> projects (I think?). It's also not tested in OSSTest, though, and
>>> considering how special purpose it is, I think we're not totally
>>> comfortable marking it as Sec-Supported, without feedback from the
>>> maintainers.
>>>
>>> George, Josh, Robert?
>>>
>>
>> Yes, we do still use the ARINC653 scheduler. Since it is so simple, it hasn't
>> really needed any modifications in the last couple years.
>>
>> We are not really sure what kind of feedback you are looking from us in regards
>> to marking it sec-supported, but would be happy to try and answer any questions.
>> If you have any specific questions or requests, we can discuss it internally and
>> get back to you.
> 
> I think there are two sets of issues: one around testing, which Dario outlined.
> 
> For example, if you had some test harnesses that could be run on Xen release 
> candidates, which verify that the scheduler works as expected, that would
> help. It would imply a commitment to run the tests on release candidates.

We have an internal Xen test harness that we use to test the scheduler, but I
assume you would like it converted to use OSSTest instead, so that the
tests could be integrated into the main test suite someday?

> 
> The second question is what happens if someone reported a security issue on
> the scheduler. The security team would not have the capability to fix issues in 
> the ARINC scheduler: so it would be necessary to pull in an expert under 
> embargo to help triage the issue, fix the issue and prove that the fix works. This 
> would most likely require "the expert" to work to the timeline of the security
> team (which may require prioritising it over other work), as once a security issue 
> has been reported, the reporter may insist on a disclosure schedule. If we didn't 
> have a fix in time, because we don't get expert bandwidth, we could be forced to 
> disclose an XSA without a fix.

We can support this and have enough staff familiar with the scheduler that
prioritizing security issues shouldn't be a problem.  The maintainers (Robbie
and Josh) can triage issues if and when the time comes, but if you need a more
dedicated "expert" for this type of issue, then that would likely be me.

Sorry for the relatively late response.

     Nate

> 
> Does this make sense?
> 
> Lars
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
>

George Dunlap Nov. 1, 2017, 4:57 p.m. UTC | #30

On 09/12/2017 11:39 AM, Roger Pau Monné wrote:
> On Mon, Sep 11, 2017 at 06:01:59PM +0100, George Dunlap wrote:
>> +## Toolstack
>> +
>> +### xl
>> +
>> +    Status: Supported
>> +
>> +### Direct-boot kernel image format
>> +
>> +    Supported, x86: bzImage
> 
> ELF
> 
>> +    Supported, ARM32: zImage
>> +    Supported, ARM64: Image
>> +
>> +Format which the toolstack accept for direct-boot kernels
> 
> IMHO it would be good to provide references to the specs, for ELF that
> should be:
> 
> http://refspecs.linuxbase.org/elf/elf.pdf

I'm having trouble evaluating these to recommendations because I don't
really know what the point of this section is.  Who wants this
information and why?

I think most end-users will want to build a Linux / whatever binary.
From that perspective, "bzImage" is probably the thing people want to
know about.  If you're doing unikernels or rolling your own custom
system somehow, knowing that it's ELF is probably more useful.

>> +### Qemu based disk backend (qdisk) for xl
>> +
>> +    Status: Supported
>> +
>> +### Open vSwitch integration for xl
>> +
>> +    Status: Supported
> 
> Status, Linux: Supported
> 
> I haven't played with vswitch on FreeBSD at all.

Ack

>> +### systemd support for xl
>> +
>> +    Status: Supported
>> +
>> +### JSON output support for xl
>> +
>> +    Status: Experimental
>> +
>> +Output of information in machine-parseable JSON format
>> +
>> +### AHCI support for xl
>> +
>> +    Status, x86: Supported
>> +
>> +### ACPI guest
>> +
>> +    Status, x86 HVM: Supported
>> +    Status, ARM: Tech Preview
> 
> status, x86 PVH: Tech preview

Is the interface and functionality mostly stable?  Or are the interfaces
likely to change / people using it likely to have crashes?

>> +### PVUSB support for xl
>> +
>> +    Status: Supported
>> +
>> +### HVM USB passthrough for xl
>> +
>> +    Status, x86: Supported
>> +
>> +### QEMU backend hotplugging for xl
>> +
>> +    Status: Supported
> 
> What's this exactly? Is it referring to hot-adding PV disk and nics?
> If so it shouldn't specifically reference xl, the same can be done
> with blkback or netback for example.

I think it means, xl knows how to hotplug QEMU backends.  There was a
time when I think this wasn't true.


>> +## Scalability
>> +
>> +### 1GB/2MB super page support
>> +
>> +    Status: Supported
> 
> This needs something like:
> 
> Status, x86 HVM/PVH: Supported

Sounds good -- I'll have a line for ARM as well.

> IIRC on ARM page sizes are different (64K?)
> 
>> +
>> +### x86/PV-on-HVM
>> +
>> +    Status: Supported
>> +
>> +This is a useful label for a set of hypervisor features
>> +which add paravirtualized functionality to HVM guests 
>> +for improved performance and scalability.  
>> +This includes exposing event channels to HVM guests.
>> +
>> +### x86/Deliver events to PVHVM guests using Xen event channels
>> +
>> +    Status: Supported
> 
> I think this should be labeled as "x86/HVM deliver guest events using
> event channels", and the x86/PV-on-HVM section removed.

Actually, I think 'PVHVM' should be the feature and this one should be
removed.


>> +### Blkfront
>> +
>> +    Status, Linux: Supported
>> +    Status, FreeBSD: Supported, Security support external
>> +    Status, Windows: Supported
> 
> Status, NetBSD: Supported, Security support external

Ack


>> +### Xen Console
>> +
>> +    Status, Linux (hvc_xen): Supported
>> +    Status, Windows: Supported
>> +
>> +Guest-side driver capable of speaking the Xen PV console protocol
> 
> Status, FreeBSD: Supported, Security support external
> Status, NetBSD: Supported, Security support external

Ack

> 
>> +
>> +### Xen PV keyboard
>> +
>> +    Status, Linux (xen-kbdfront): Supported
>> +    Status, Windows: Supported
>> +
>> +Guest-side driver capable of speaking the Xen PV keyboard protocol
>> +
>> +[XXX 'Supported' here depends on the version we ship in 4.10 having some fixes]
>> +
>> +### Xen PVUSB protocol
>> +
>> +    Status, Linux: Supported
>> +
>> +### Xen PV SCSI protocol
>> +
>> +    Status, Linux: Supported, with caveats
> 
> Should both of the above items be labeled with frontend/backend?

Done.

> And do we really need the 'Xen' prefix in all the items? Seems quite
> redundant.

Let me think about that.

>> +
>> +NB that while the pvSCSU frontend is in Linux and tested regularly,
>> +there is currently no xl support.
>> +
>> +### Xen TPMfront
> 
> PV TPM frotnend

Ack

>> +### PVCalls frontend
>> +
>> +    Status, Linux: Tech Preview
>> +
>> +Guest-side driver capable of making pv system calls
> 
> Didn't we merge the backend, but not the frontend?

No idea

>> +
>> +## Virtual device support, host side
>> +
>> +### Blkback
>> +
>> +    Status, Linux (blkback): Supported
>> +    Status, FreeBSD (blkback): Supported
>                                            ^, security support
>                                             external

Ack

> Status, NetBSD (xbdback): Supported, security support external
>> +    Status, QEMU (xen_disk): Supported
>> +    Status, Blktap2: Deprecated
>> +
>> +Host-side implementations of the Xen PV block protocol
>> +
>> +### Netback
>> +
>> +    Status, Linux (netback): Supported
>> +    Status, FreeBSD (netback): Supported
> 
> Status, NetBSD (xennetback): Supported
> 
> Both FreeBSD & NetBSD: security support external.

Ack

> 
>> +
>> +Host-side implementations of Xen PV network protocol
>> +
>> +### Xen Framebuffer
>> +
>> +    Status, Linux: Supported
> 
> Frontend?

>> +    Status, QEMU: Supported
> 
> Backend?
> 
> I don't recall Linux having a backend for the pv fb.

And it's hard to see how a Linux backend for pv FB would work in
practice; this is probably a mistake (maybe a c&p error).  I'll remove it.

>> +
>> +Host-side implementaiton of the Xen PV framebuffer protocol
>> +
>> +### Xen Console (xenconsoled)
> 
> Console backend
> 
>> +
>> +    Status: Supported
>> +
>> +Host-side implementation of the Xen PV console protocol
>> +
>> +### Xen PV keyboard
> 
> PV keyboard backend
> 
>> +
>> +    Status, QEMU: Supported
>> +
>> +Host-side implementation fo the Xen PV keyboard protocol
>> +
>> +### Xen PV USB
> 
> PV USB Backend
> 
>> +
>> +    Status, Linux: Experimental
> 
> ? The backend is in QEMU.

Juergen also has patches for a backend in Linux.

>> +### Online resize of virtual disks
>> +
>> +    Status: Supported
> 
> I would remove this.

I agree it probably doesn't belong here.

It might be useful to have a list of stuff like this as a prompt for
writing tests.  (Although perhaps good coverage support would be better.)


>> +### x86/HVM iPXE
>> +
>> +    Status: Supported, with caveats
>> +
>> +Booting a guest via PXE.
>> +PXE inherently places full trust of the guest in the network,
>> +and so should only be used
>> +when the guest network is under the same administrative control
>> +as the guest itself.
>> +
>> +### x86/HVM BIOS
>> +
>> +    Status: Supported
>> +
>> +Booting a guest via guest BIOS firmware
>> +
>> +### x86/HVM EFI
>> +
>> +	Status: Supported
>> +
>> +Booting a guest via guest EFI firmware
> 
> Maybe this is too generic? We certainly don't support ROMBIOS with
> qemu-trad, or SeaBIOS with qemu-upstream.

You mean we don't support SeaBIOS w/ qemu-trad or ROMBIOS with
qemu-upstream?

That probably doesn't matter so much, as SeaBIOS / ROMBIOS is mostly an
internal implementation detail.  But do we support booting EFI with
qemu-traditional?  If not, then you can't (for instance) boot an EFI
guest with stubdomains.

But then that opens up another factor in the matrix we need to track. :-/

>> +### ARM/ACPI (host)
>> +
>> +    Status: Experimental
> 
> "ACPI host" (since we already have "ACPI guest" above).

Yeah, I actually moved this to a separate "host hardware support" section.

> Status, ARM: experimental
> Status, x86 PV: supported
> Status, x86 PVH: experimental

Oh, this is for PV and PVH dom0.  I'll add a note to that effect.

Actually, how much of this is Xen support vs dom0 OS support?  Does this
need to be specified Linux / FreeBSD /&c?

 -George

George Dunlap Nov. 1, 2017, 5:01 p.m. UTC | #31

On 09/12/2017 08:52 PM, Stefano Stabellini wrote:
>>> +### Xen Framebuffer
>>> +
>>> +    Status, Linux: Supported
>>
>> Frontend?
> 
> Yes, please. If you write "Xen Framebuffer" I only take it to mean the
> protocol as should be documented somewhere under docs/. Then I read
> Linux, and I don't understand what you mean. Then I read QEMU and I have
> to guess you are talking about the backend?

Well this was in the "backend" section, so it was just completely wrong.
 I've removed it. :-)

>>> +### ARM: 16K and 64K pages in guests
>>> +
>>> +    Status: Supported, with caveats
>>> +
>>> +No support for QEMU backends in a 16K or 64K domain.
>>
>> Needs to be merged with the "1GB/2MB super page support"?
>  
> Super-pages are different from page granularity. 1GB and 2MB pages are
> based on the same 4K page granularity, while 512MB pages are based on
> 64K granularity. Does it make sense?

It does -- wondering what the best way to describe this concisely is.
Would it make sense to say "L2 and L3 superpages", and then explain in
the comment that for 4k page granularity that's 2MiB and 1GiB, and for
64k granularity it's 512MiB?

> Maybe we want to say "ARM: 16K and 64K page granularity in guest" to
> clarify.

Clarifying that this is "page granularity" would be helpful.

If we had a document describing this in more detail we could point to
that also might be useful.

 -George

Konrad Rzeszutek Wilk Nov. 1, 2017, 5:10 p.m. UTC | #32

On Tue, Oct 24, 2017 at 04:22:38PM +0100, George Dunlap wrote:
> On Fri, Sep 15, 2017 at 3:51 PM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> >> +### Soft-reset for PV guests
> >
> > s/PV/HVM/
> 
> Is it?  I thought this was for RHEL 5 PV guests to be able to do crash kernels.
> 
> >> +### Transcendent Memory
> >> +
> >> +    Status: Experimental
> >> +
> >> +[XXX Add description]
> >
> > Guests with tmem drivers autoballoon memory out allowing a fluid
> > and dynamic memory allocation - in effect memory overcommit without
> > the need to swap. Only works with Linux guests (as it requires
> > OS drivers).
> 
> But autoballooning doesn't require any support in Xen, right?  I
> thought the TMEM support in Xen was more about the trancendent memory
> backends.

frontends you mean? That is Linux guests when compiled with XEN_TMEM will
balloon down (using the self-shrinker) to using the normal balloon code
(XENMEM_decrease_reservation, XENMEM_populate_physmap) to make the
guest smaller. Then the Linux code starts hitting the case where it starts
swapping memory out - and that is where the tmem comes in and the
pages are swapped out to the hypervisor.

There is also the secondary cache (cleancache) which just puts pages
in the hypervisor temporary cache, kind of like an L3. For that you don't
need ballooning.
> 
> > ..snip..
> >> +### Live Patching
> >> +
> >> +    Status, x86: Supported
> >> +    Status, ARM: Experimental
> >> +
> >> +Compile time disabled
> >
> > for ARM.
> >
> > As the patch will do:
> >
> >  config LIVEPATCH
> > -       bool "Live patching support (TECH PREVIEW)"
> > -       default n
> > +       bool "Live patching support"
> > +       default X86
> >         depends on HAS_BUILD_ID = "y"
> >         ---help---
> >           Allows a running Xen hypervisor to be dynamically patched using
> 
> Ack
> 
>  -George
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel

George Dunlap Nov. 2, 2017, 10:46 a.m. UTC | #33

On 11/01/2017 05:10 PM, Konrad Rzeszutek Wilk wrote:
> On Tue, Oct 24, 2017 at 04:22:38PM +0100, George Dunlap wrote:
>> On Fri, Sep 15, 2017 at 3:51 PM, Konrad Rzeszutek Wilk
>> <konrad.wilk@oracle.com> wrote:
>>>> +### Soft-reset for PV guests
>>>
>>> s/PV/HVM/
>>
>> Is it?  I thought this was for RHEL 5 PV guests to be able to do crash kernels.
>>
>>>> +### Transcendent Memory
>>>> +
>>>> +    Status: Experimental
>>>> +
>>>> +[XXX Add description]
>>>
>>> Guests with tmem drivers autoballoon memory out allowing a fluid
>>> and dynamic memory allocation - in effect memory overcommit without
>>> the need to swap. Only works with Linux guests (as it requires
>>> OS drivers).
>>
>> But autoballooning doesn't require any support in Xen, right?  I
>> thought the TMEM support in Xen was more about the trancendent memory
>> backends.
> 
> frontends you mean? That is Linux guests when compiled with XEN_TMEM will
> balloon down (using the self-shrinker) to using the normal balloon code
> (XENMEM_decrease_reservation, XENMEM_populate_physmap) to make the
> guest smaller. Then the Linux code starts hitting the case where it starts
> swapping memory out - and that is where the tmem comes in and the
> pages are swapped out to the hypervisor.

Right -- so TMEM itself actually consists of this ephemeral and
non-ephemeral memory pools.  Autoballooning is just a trick to get Linux
to put the least-used pages into one of the pools.

How about this:

---
Transcendent Memory (tmem) allows the creation of hypervisor memory
pools which guests can use to store memory rather than caching in its
own memory or swapping to disk.  Having these in the hypervisor can
allow more efficient aggregate use of memory across VMs.
---

 -George

Konrad Rzeszutek Wilk Nov. 2, 2017, 3:23 p.m. UTC | #34

On Thu, Nov 02, 2017 at 10:46:20AM +0000, George Dunlap wrote:
> On 11/01/2017 05:10 PM, Konrad Rzeszutek Wilk wrote:
> > On Tue, Oct 24, 2017 at 04:22:38PM +0100, George Dunlap wrote:
> >> On Fri, Sep 15, 2017 at 3:51 PM, Konrad Rzeszutek Wilk
> >> <konrad.wilk@oracle.com> wrote:
> >>>> +### Soft-reset for PV guests
> >>>
> >>> s/PV/HVM/
> >>
> >> Is it?  I thought this was for RHEL 5 PV guests to be able to do crash kernels.
> >>
> >>>> +### Transcendent Memory
> >>>> +
> >>>> +    Status: Experimental
> >>>> +
> >>>> +[XXX Add description]
> >>>
> >>> Guests with tmem drivers autoballoon memory out allowing a fluid
> >>> and dynamic memory allocation - in effect memory overcommit without
> >>> the need to swap. Only works with Linux guests (as it requires
> >>> OS drivers).
> >>
> >> But autoballooning doesn't require any support in Xen, right?  I
> >> thought the TMEM support in Xen was more about the trancendent memory
> >> backends.
> > 
> > frontends you mean? That is Linux guests when compiled with XEN_TMEM will
> > balloon down (using the self-shrinker) to using the normal balloon code
> > (XENMEM_decrease_reservation, XENMEM_populate_physmap) to make the
> > guest smaller. Then the Linux code starts hitting the case where it starts
> > swapping memory out - and that is where the tmem comes in and the
> > pages are swapped out to the hypervisor.
> 
> Right -- so TMEM itself actually consists of this ephemeral and
> non-ephemeral memory pools.  Autoballooning is just a trick to get Linux
> to put the least-used pages into one of the pools.

<nods>
> 
> How about this:
> 
> ---
> Transcendent Memory (tmem) allows the creation of hypervisor memory
> pools which guests can use to store memory rather than caching in its
> own memory or swapping to disk.  Having these in the hypervisor can
> allow more efficient aggregate use of memory across VMs.
> ---

<purrs> Perfect!
> 
>  -George

George Dunlap Nov. 2, 2017, 5:34 p.m. UTC | #35

On 10/27/2017 04:09 PM, NathanStuder wrote:
> 
> 
> On 10/09/2017 10:14 AM, Lars Kurth wrote:
>>
>>> On 27 Sep 2017, at 13:57, Robert VanVossen <robert.vanvossen@dornerworks.com> wrote:
>>>
>>>
>>>
>>> On 9/26/2017 3:12 AM, Dario Faggioli wrote:
>>>> [Cc-list modified by removing someone and adding someone else]
>>>>
>>>> On Mon, 2017-09-25 at 16:10 -0700, Stefano Stabellini wrote:
>>>>> On Mon, 11 Sep 2017, George Dunlap wrote:
>>>>>> +### RTDS based Scheduler
>>>>>> +
>>>>>> +    Status: Experimental
>>>>>> +
>>>>>> +A soft real-time CPU scheduler built to provide guaranteed CPU
>>>>>> capacity to guest VMs on SMP hosts
>>>>>> +
>>>>>> +### ARINC653 Scheduler
>>>>>> +
>>>>>> +    Status: Supported, Not security supported
>>>>>> +
>>>>>> +A periodically repeating fixed timeslice scheduler. Multicore
>>>>>> support is not yet implemented.
>>>>>> +
>>>>>> +### Null Scheduler
>>>>>> +
>>>>>> +    Status: Experimental
>>>>>> +
>>>>>> +A very simple, very static scheduling policy 
>>>>>> +that always schedules the same vCPU(s) on the same pCPU(s). 
>>>>>> +It is designed for maximum determinism and minimum overhead
>>>>>> +on embedded platforms.
>>
>> ...
>>
>>>> Actually, the best candidate for gaining security support, is IMO
>>>> ARINC. Code is also rather simple and "stable" (hasn't changed in the
>>>> last... years!) and it's used by DornerWorks' people for some of their
>>>> projects (I think?). It's also not tested in OSSTest, though, and
>>>> considering how special purpose it is, I think we're not totally
>>>> comfortable marking it as Sec-Supported, without feedback from the
>>>> maintainers.
>>>>
>>>> George, Josh, Robert?
>>>>
>>>
>>> Yes, we do still use the ARINC653 scheduler. Since it is so simple, it hasn't
>>> really needed any modifications in the last couple years.
>>>
>>> We are not really sure what kind of feedback you are looking from us in regards
>>> to marking it sec-supported, but would be happy to try and answer any questions.
>>> If you have any specific questions or requests, we can discuss it internally and
>>> get back to you.
>>
>> I think there are two sets of issues: one around testing, which Dario outlined.
>>
>> For example, if you had some test harnesses that could be run on Xen release 
>> candidates, which verify that the scheduler works as expected, that would
>> help. It would imply a commitment to run the tests on release candidates.
> 
> We have an internal Xen test harness that we use to test the scheduler, but I
> assume you would like it converted to use OSSTest instead, so that the
> tests could be integrated into the main test suite someday?

In our past discussions I don't think anyone has thought the "everything
has to be tested in osstest" strategy is really feasible.  So I think we
were going for a model where it just had to be regularly tested
*somewhere*, more or less as a marker for "is this functionality
important enough to people to give security support".

>> The second question is what happens if someone reported a security issue on
>> the scheduler. The security team would not have the capability to fix issues in 
>> the ARINC scheduler: so it would be necessary to pull in an expert under 
>> embargo to help triage the issue, fix the issue and prove that the fix works. This 
>> would most likely require "the expert" to work to the timeline of the security
>> team (which may require prioritising it over other work), as once a security issue 
>> has been reported, the reporter may insist on a disclosure schedule. If we didn't 
>> have a fix in time, because we don't get expert bandwidth, we could be forced to 
>> disclose an XSA without a fix.
> 
> We can support this and have enough staff familiar with the scheduler that
> prioritizing security issues shouldn't be a problem.  The maintainers (Robbie
> and Josh) can triage issues if and when the time comes, but if you need a more
> dedicated "expert" for this type of issue, then that would likely be me.

OK -- in that case, if it's OK with you, I'll list ArinC as 'Supported'.

Thanks,
 -George

Nathan Studer Nov. 2, 2017, 8:42 p.m. UTC | #36

On 11/02/2017 01:34 PM, George Dunlap wrote:
> On 10/27/2017 04:09 PM, NathanStuder wrote:
>>
>>
>> On 10/09/2017 10:14 AM, Lars Kurth wrote:
>>>
>>>> On 27 Sep 2017, at 13:57, Robert VanVossen <robert.vanvossen@dornerworks.com> wrote:
>>>>
>>>>
>>>>
>>>> On 9/26/2017 3:12 AM, Dario Faggioli wrote:
>>>>> [Cc-list modified by removing someone and adding someone else]
>>>>>
>>>>> On Mon, 2017-09-25 at 16:10 -0700, Stefano Stabellini wrote:
>>>>>> On Mon, 11 Sep 2017, George Dunlap wrote:
>>>>>>> +### RTDS based Scheduler
>>>>>>> +
>>>>>>> +    Status: Experimental
>>>>>>> +
>>>>>>> +A soft real-time CPU scheduler built to provide guaranteed CPU
>>>>>>> capacity to guest VMs on SMP hosts
>>>>>>> +
>>>>>>> +### ARINC653 Scheduler
>>>>>>> +
>>>>>>> +    Status: Supported, Not security supported
>>>>>>> +
>>>>>>> +A periodically repeating fixed timeslice scheduler. Multicore
>>>>>>> support is not yet implemented.
>>>>>>> +
>>>>>>> +### Null Scheduler
>>>>>>> +
>>>>>>> +    Status: Experimental
>>>>>>> +
>>>>>>> +A very simple, very static scheduling policy 
>>>>>>> +that always schedules the same vCPU(s) on the same pCPU(s). 
>>>>>>> +It is designed for maximum determinism and minimum overhead
>>>>>>> +on embedded platforms.
>>>
>>> ...
>>>
>>>>> Actually, the best candidate for gaining security support, is IMO
>>>>> ARINC. Code is also rather simple and "stable" (hasn't changed in the
>>>>> last... years!) and it's used by DornerWorks' people for some of their
>>>>> projects (I think?). It's also not tested in OSSTest, though, and
>>>>> considering how special purpose it is, I think we're not totally
>>>>> comfortable marking it as Sec-Supported, without feedback from the
>>>>> maintainers.
>>>>>
>>>>> George, Josh, Robert?
>>>>>
>>>>
>>>> Yes, we do still use the ARINC653 scheduler. Since it is so simple, it hasn't
>>>> really needed any modifications in the last couple years.
>>>>
>>>> We are not really sure what kind of feedback you are looking from us in regards
>>>> to marking it sec-supported, but would be happy to try and answer any questions.
>>>> If you have any specific questions or requests, we can discuss it internally and
>>>> get back to you.
>>>
>>> I think there are two sets of issues: one around testing, which Dario outlined.
>>>
>>> For example, if you had some test harnesses that could be run on Xen release 
>>> candidates, which verify that the scheduler works as expected, that would
>>> help. It would imply a commitment to run the tests on release candidates.
>>
>> We have an internal Xen test harness that we use to test the scheduler, but I
>> assume you would like it converted to use OSSTest instead, so that the
>> tests could be integrated into the main test suite someday?
> 
> In our past discussions I don't think anyone has thought the "everything
> has to be tested in osstest" strategy is really feasible.  So I think we
> were going for a model where it just had to be regularly tested
> *somewhere*, more or less as a marker for "is this functionality
> important enough to people to give security support".
> 
>>> The second question is what happens if someone reported a security issue on
>>> the scheduler. The security team would not have the capability to fix issues in 
>>> the ARINC scheduler: so it would be necessary to pull in an expert under 
>>> embargo to help triage the issue, fix the issue and prove that the fix works. This 
>>> would most likely require "the expert" to work to the timeline of the security
>>> team (which may require prioritising it over other work), as once a security issue 
>>> has been reported, the reporter may insist on a disclosure schedule. If we didn't 
>>> have a fix in time, because we don't get expert bandwidth, we could be forced to 
>>> disclose an XSA without a fix.
>>
>> We can support this and have enough staff familiar with the scheduler that
>> prioritizing security issues shouldn't be a problem.  The maintainers (Robbie
>> and Josh) can triage issues if and when the time comes, but if you need a more
>> dedicated "expert" for this type of issue, then that would likely be me.
> 
> OK -- in that case, if it's OK with you, I'll list ArinC as 'Supported'.

We're good with that.  Thanks.

     Nate

> 
> Thanks,
>  -George
>

[RFC,v2] Add SUPPORT.md

Commit Message

Comments

Patch