diff mbox series

docs/designs: Add a design document for transparent live migration

Message ID 20200127161430.3312-1-pdurrant@amazon.com (mailing list archive)
State New, archived
Headers show
Series docs/designs: Add a design document for transparent live migration | expand

Commit Message

Paul Durrant Jan. 27, 2020, 4:14 p.m. UTC
It has become apparent to some large cloud providers that the current
model of co-operative migration of guests under Xen is not usable as it
places trust in software running inside the guest, which is likely
beyond the provider's trust boundary.
This patch introduces a proposal for a 'transparent' live migration,
designed to overcome the need for this trust.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Julien Grall <julien@xen.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Wei Liu <wl@xen.org>
---
 docs/designs/transparent-migration.md | 266 ++++++++++++++++++++++++++
 1 file changed, 266 insertions(+)
 create mode 100644 docs/designs/transparent-migration.md

Comments

Ian Jackson Jan. 27, 2020, 4:46 p.m. UTC | #1
Paul Durrant writes ("[PATCH] docs/designs: Add a design document for transparent live migration"):
> It has become apparent to some large cloud providers that the current
> model of co-operative migration of guests under Xen is not usable as it
> places trust in software running inside the guest, which is likely
> beyond the provider's trust boundary.
> This patch introduces a proposal for a 'transparent' live migration,
> designed to overcome the need for this trust.

I have reviewed this and it seems like an accurate summary of the
situation, and a plausible proposal.  I wonder if some of the
existing-situation text could go into other documents.

I have some very minor comments.

I don't like the term `transparent'.  It is often abused in other
contexts.  It can be clear to whom things are transparent.  In a very
real sense existing migration is `transparent' to a domain's network
peers, for example.  How about `oblivious' ?

I don't think `trust' is right, either.  I think you mean `reliance'
or something.  `Trust' makes it sound like the guest can cause trouble
for the host.  Whereas the problem you are addressing here is that
the guest can cause trouble *for itself* by not operating the
migration protocols correctly.  This is an operational inconvenience,
but `trust' implies a security issue.

Ian.
David Woodhouse Feb. 3, 2020, 9:48 a.m. UTC | #2
On Mon, 2020-01-27 at 16:46 +0000, Ian Jackson wrote:
> I don't like the term `transparent'.  It is often abused in other
> contexts.  It can be clear to whom things are transparent.  In a very
> real sense existing migration is `transparent' to a domain's network
> peers, for example.  How about `oblivious' ?

The term we generally use is 'guest transparent live migration', in
which the additional word addresses that potential ambiguity. We thus
have GT migration in addition to SR (suspend/resume) migration.

Perhaps it's just familiarity, but I very much prefer that to
'oblivious' (guests aren't necessary oblivious to it; they just aren't
required to *do* anything). and to 'non-cooperative' which for me is
too easy to conflate with 'uncooperative' and might cause an inference
that it's for guests which have done something *wrong*.
diff mbox series

Patch

diff --git a/docs/designs/transparent-migration.md b/docs/designs/transparent-migration.md
new file mode 100644
index 0000000000..9f26d4da6d
--- /dev/null
+++ b/docs/designs/transparent-migration.md
@@ -0,0 +1,266 @@ 
+# Transparent Migration of Guests on Xen
+
+## Background
+
+The term **transparent migration** needs qualification. Here it is taken to
+mean migration of a guest without the co-operation of software running inside
+that guest. It is not taken to mean that a guest, aware that it is virtualized
+under Xen, is not going to see *any* changes across a migration but no part of
+migration should require any explicit action by the guest (including re-reading
+any state that it may have cached).
+
+The normal model of migration in Xen is driven by the guest because it was
+originally implemented for PV guests, where the guest must be aware it is
+running under Xen and is hence expected to co-operate. This model dates from
+an era when it was assumed that the host administrator had control of at least
+the privileged software running in the guest (i.e. the guest kernel) which may
+still be true in an enterprise deployment but is not generally true in a cloud
+environment. The aim of transparent migration is to provide a model which is
+purely host driven, requiring no co-operation from or trust in the software
+running in the guest, and is thus suitable for cloud scenarios.
+
+PV guests are out of scope for this project because, as is outlined above, they
+have a symbiotic relationship with the hypervisor and therefore a certain level
+of co-operation can be assumed.
+HVM guests can already be migrated on Xen without guest co-operation but only
+if they don’t have PV drivers installed[1] or are in power state S3. The
+reason for not expecting co-operation if the guest is in S3 is obvious, but the
+reason co-operation is expected if PV drivers are installed is due to the
+nature of PV protocols.
+
+## Xenstore Nodes and Domain ID
+
+The PV driver model consists of a *frontend* and a *backend*. The frontend runs
+inside the guest domain and the backend runs inside a *service domain* which
+may or may not domain 0. The frontend and backend typically pass data via
+memory pages which are shared between the two domains, but this channel of
+communication is generally established using xenstore (the store protocol
+itself being an exception to this for obvious chicken-and-egg reasons).
+
+Typical protocol establishment is based on use of two separate xenstore
+*areas*. If we consider PV drivers for the *netif* protocol (i.e. class vif)
+and assume the guest has domid X, the service domain has domid Y, and the vif
+has index Z then the frontend area will reside under the parent node:
+
+`/local/domain/Y/device/vif/Z`
+
+All backends, by convention, typically reside under parent node:
+
+`/local/domain/X/backend`
+
+and the normal backend area for vif Z would be:
+
+`/local/domain/X/backend/vif/Y/Z`
+
+but this should not be assumed.
+
+The toolstack will place two nodes in the frontend area to explicitly locate
+the backend:
+
+    * `backend`: the fully qualified xenstore path of the backend area
+    * `backend-id`: the domid of the service domain
+
+and similarly two nodes in the backend area to locate the frontend area:
+
+    * `frontend`: the fully qualified xenstore path of the frontend area
+    * `frontend-id`: the domid of the guest domain
+
+
+The guest domain only has write permission to the frontend area and similarly
+the service domain only has write permission to the backend area, but both ends
+have read permission to both areas.
+
+Under both frontend and backend areas is a node called *state*. This is key to
+protocol establishment. Upon PV device creation the toolstack will set the
+value of both state nodes to 1 (XenbusStateInitialising[2]). This should cause
+enumeration of appropriate devices in both the guest and service domains. The
+backend device, once it has written any necessary protocol specific information
+into the xenstore backend area (to be read by the frontend driver) will update
+the backend state node to 2 (XenbusStateInitWait). From this point on PV
+protocols differ slightly; the following illustration is true of the netif
+protocol.
+Upon seeing a backend state value of 2, the frontend driver will then read the
+protocol specific information, write details of grant references (for shared
+pages) and event channel ports (for signalling) that it has created, and set
+the state node in the frontend area to 4 (XenbusStateConnected). Upon see this
+frontend state, the backend driver will then read the grant references (mapping
+the shared pages) and event channel ports (opening its end of them) and set the
+state node in the backend area to 4. Protocol establishment is now complete and
+the frontend and backend start to pass data.
+
+Because the domid of both ends of a PV protocol forms a key part of negotiating
+the data plane for that protocol (because it is encoded into both xenstore
+nodes and node paths), and because guest’s own domid and the domid of the
+service domain are visible to the guest in xenstore (and hence may cached
+internally), and neither are necessarily preserved during migration, it is
+hence necessary to have the co-operation of the frontend in re-negotiating the
+protocol using the new domid after migration.
+Moreover the backend-id value will be used by the frontend driver in setting up
+grant table entries and event channels to communicate with the service domain,
+so the co-operation of the guest is required to re-establish these in the new
+host environment after migration.
+
+Thus if we are to change the model and support migration of a guest with PV
+drivers, without the co-operation of the frontend driver code, the paths and
+values in both the frontend and backend xenstore areas must remain unchanged
+and valid in the new host environment, and the grant table entries and event
+channels must be preserved (and remain operational once guest execution is
+resumed).
+Because the service domain’s domid is used directly by the guest in setting
+up grant entries and event channels, the backend drivers in the new host
+environment must be provided by service domain with the same domid. Also,
+because the guest can sample its own domid from the frontend area and use it in
+hypercalls (e.g. HVMOP_set_param) rather than DOMID_SELF, the guest domid must
+also be preserved to maintain the ABI.
+
+Furthermore, it will necessary to modify backend drivers to re-establish
+communication with frontend drivers without perturbing the content of the
+backend area or requiring any changes to the values of the xenstore state nodes.
+
+## Other Para-Virtual State
+
+### Shared Rings
+
+Because the console and store protocol shared pages are actually part of the
+guest memory image (in an E820 reserved region just below 4G) then the content
+will get migrated as part of the guest memory image. Hence no additional code
+is require to prevent any guest visible change in the content.
+
+### Shared Info
+
+There is already a record defined in *LibXenCtrl Domain Image Format* [3]
+called `SHARED_INFO` which simply contains a complete copy of the domain’s
+shared info page. It is not currently incuded in an HVM (type `0x0002`)
+migration stream. It may be feasible to include it as an optional record
+but it is not clear that the content of the shared info page ever needs
+to be preserved for an HVM guest.
+For a PV guest the `arch_shared_info` sub-structure contains important
+information about the guest’s P2M, but this information is not relevant for
+an HVM guest where the P2M is not directly manipulated via the guest. The other
+state contained in the `shared_info` structure relates the domain wall-clock
+(the state of which should already be transferred by the `RTC` HVM context
+information which contained in the `HVM_CONTEXT` save record) and some event
+channel state (particularly if using the *2l* protocol). Event channel state
+will need to be fully transferred if we are not going to require the guest
+co-operation to re-open the channels and so it should be possible to re-build a
+shared info page for an HVM guest from such other state.
+Note that the shared info page also contains an array of `XEN_LEGACY_MAX_VCPUS`
+(32) `vcpu_info` structures. A domain may nominate a different guest physical
+address to use for the vcpu info. This is mandatory for if a domain wants to
+use more than 32 vCPUs and optional for legacy vCPUs. This mapping is not
+currently transferred in the migration state so this will either need to be
+added into an existing save record, or an additional type of save record will
+be needed.
+
+### Xenstore Watches
+
+As mentioned above, no domain Xenstore state is currently transferred in the
+migration stream. There is a record defined in *LibXenLight Domain Image
+Format* [4] called `EMULATOR_XENSTORE_DATA` for transferring Xenstore nodes
+relating to emulators but no record type is defined for nodes relating to the
+domain itself, nor for registered *watches*. A XenStore watch is a mechanism
+used by PV frontend and backend drivers to request a notification if the value
+of a particular node (e.g. the other end’s state node) changes, so it is
+important that watches continue to function after a migration. One or more new
+save records will therefore be required to transfer Xenstore state. It will
+also be necessary to extend the *store* protocol[5] with mechanisms to allow
+the toolstack to acquire the list of watches that the guest has registered and
+for the toolstack to register a watch on behalf of a domain.
+
+### Event channels
+
+Event channels are essentially the para-virtual equivalent of interrupts. They
+are an important part of post PV protocols. Normally a frontend driver creates
+an *inter-domain* event channel between its own domain and the domain running
+the backend, which it discovers using the `backend-id` node in Xenstore (see
+above), by making a `EVTCHNOP_alloc_unbound` hypercall. This hypercall
+allocates an event channel object in the hypervisor and assigns a *local port*
+number which is then written into the frontend area in Xenstore. The backend
+driver then reads this port number and *binds* to the event channel by
+specifying it, and the value of `frontend-id`, as *remote domain* and *remote
+port* (respectively) to a `EVTCHNOP_bind_interdomain` hypercall. Once
+connection is established in this fashion frontend and backend drivers can use
+the event channel as a *mailbox* to notify each other when a shared ring has
+been updated with new requests or response structures.
+Currently no event channel state is preserved on migration, requiring frontend
+and backend drivers to create and bind a complete new set of event channels in
+order to re-establish a protocol connection. Hence, one or more new save
+records will be required to transfer event channel state in order to avoid the
+need for explicit action by frontend drivers running in the guest. Note that
+the local port numbers need to preserved in this state as they are the only
+context the guest has to refer to the hypervisor event channel objects.
+ Note also that the PV *store* (Xenstore access) and *console* protocols also
+rely on event channels which are set up by the toolstack. Normally, early in
+migration, the toolstack running on the remote host would set up a new pair of
+event channels for these protocols in the destination domain. These may not be
+assigned the same local port numbers as the protocols running in the source
+domain. For transparent migration these channels must either be created with
+fixed port numbers, or their creation must be avoided and instead be included
+in the general event channel state record(s).
+
+### Grant table
+
+The grant table is essentially the para-virtual equivalent of an IOMMU. For
+example, the shared rings of a PV protocol are *granted* by a frontend driver
+to the backend driver by allocating *grant entries* in the guest’s table,
+filling in details of the memory pages and then writing the *grant references*
+(the index values of the grant entries) into Xenstore. The grant references of
+the protocol buffers themselves are typically written directly into the request
+structures passed via a shared ring.
+The guest is responsible for managing its own grant table. No hypercall is
+required to grant a memory page to another domain. It is sufficient to find an
+unused grant entry and set bits in the entry to give read and/or write access
+to a remote domain also specified in the entry along with the page frame
+number. Thus the layout and content of the grant table logically forms part of
+the guest state.
+Currently no grant table state is migrated, requiring a guest to separately
+maintain any state that it wishes to persist elsewhere in its memory image and
+then restore it after migration. Thus to avoid the need for such explicit
+action by the guest, one or more new save records will be required to migrate
+the contents of the grant table.
+
+# Outline Proposal
+
+* PV backend drivers will be modified to unilaterally re-establish connection
+to a frontend if the backend state node is restored with value 4
+(XenbusStateConnected)[6].
+* The toolstack should be modified to allow domid to be randomized on initial
+creation or default migration, but make it identical to the source domain on
+transparent migration. Transparent migration will have to be denied if the
+domid is unavailable on the target host, but randomization of domid on creation
+should hopefully minimize the likelihood of this. Transparent migration to
+localhost will clearly not be possible. Patches have already been sent to
+`xen-devel` to make this change[7].
+* `xenstored` should be modified to implement the new mechanisms needed. See
+*Other Para-Virtual State* above. A further design document will propose
+additional protocol messages.
+* Within the migration stream extra save records will be defined as required.
+See *Other Para-Virtual State* above. A further design document will propose
+modifications to the LibXenLight and LibXenCtrl Domain Image Formats.
+* An option should be added to the toolstack to initiate a transparent
+migration, instead of the (default) potentially co-operative migration.
+Essentially this should skip the check to see if PV drivers and migrate as if
+there are none present, but also enabling the extra save records. Note that at
+least some of the extra records should only form part of a transparent
+migration stream. For example, migrating event channel state would be counter
+productive in a normal migration as this will essentially leak event channel
+objects at the receiving end. Others, such as grant table state, could
+potentially harmlessly form part of a normal migration stream.
+
+* * *
+[1] PV drivers are deemed to be installed if the HVM parameter
+*HVM_PARAM_CALLBACK_IRQ* has been set to a non-zero value.
+
+[2] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/xenbus.h
+
+[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxc-migration-stream.pandoc
+
+[4] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc
+
+[5] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt
+
+[6] `xen-blkback` and `xen-netback` have already been modified in Linux to do
+this.
+
+[7] See https://lists.xenproject.org/archives/html/xen-devel/2020-01/msg00632.html
+