From patchwork Mon Jan 27 16:14:30 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Paul Durrant X-Patchwork-Id: 11352869 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A7AF5159A for ; Mon, 27 Jan 2020 16:16:24 +0000 (UTC) Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 6FD572087F for ; Mon, 27 Jan 2020 16:16:24 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="RoDGLdT0" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6FD572087F Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=amazon.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=xen-devel-bounces@lists.xenproject.org Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.89) (envelope-from ) id 1iw72g-0005SZ-Rd; Mon, 27 Jan 2020 16:15:14 +0000 Received: from all-amaz-eas1.inumbo.com ([34.197.232.57] helo=us1-amaz-eas2.inumbo.com) by lists.xenproject.org with esmtp (Exim 4.89) (envelope-from ) id 1iw72f-0005SU-4A for xen-devel@lists.xenproject.org; Mon, 27 Jan 2020 16:15:13 +0000 X-Inumbo-ID: 31e074f4-4120-11ea-856f-12813bfff9fa Received: from smtp-fw-9101.amazon.com (unknown [207.171.184.25]) by us1-amaz-eas2.inumbo.com (Halon) with ESMTPS id 31e074f4-4120-11ea-856f-12813bfff9fa; Mon, 27 Jan 2020 16:15:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1580141712; x=1611677712; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=LL1kZnjPJF0/WwrZQYzBWOWCC0pjuAdQKcPxLjpNiK8=; b=RoDGLdT0hr00bgDvJEgk+UbtJLEG8x0yjpgwoSa0Uz9/jXf6KBWLlsG4 9ZQfL/bgh/rzooHnVnb0Lec8KOtgVuyfMwLxwYuAKLU6aZIo9+6/f2/Ry xDMhW1nrIoea6LotN/0C5eAiyoGVgI+CbRTDhNhF4LouPAHKxDsEUYQNH Y=; IronPort-SDR: 3Gj9HhmwIH3KhVWYSjZ4SGNSCHnCyBEy9mC4jUo3Hs2Qt+kyjLl6UmPL/l8unLnAgGwnwLyg17 xfcBwUWZnWJQ== X-IronPort-AV: E=Sophos;i="5.70,370,1574121600"; d="scan'208";a="12927570" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-2a-22cc717f.us-west-2.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-9101.sea19.amazon.com with ESMTP; 27 Jan 2020 16:14:36 +0000 Received: from EX13MTAUEA002.ant.amazon.com (pdx4-ws-svc-p6-lb7-vlan3.pdx.amazon.com [10.170.41.166]) by email-inbound-relay-2a-22cc717f.us-west-2.amazon.com (Postfix) with ESMTPS id 75101A21ED; Mon, 27 Jan 2020 16:14:35 +0000 (UTC) Received: from EX13D32EUC004.ant.amazon.com (10.43.164.121) by EX13MTAUEA002.ant.amazon.com (10.43.61.77) with Microsoft SMTP Server (TLS) id 15.0.1236.3; Mon, 27 Jan 2020 16:14:35 +0000 Received: from EX13MTAUEA002.ant.amazon.com (10.43.61.77) by EX13D32EUC004.ant.amazon.com (10.43.164.121) with Microsoft SMTP Server (TLS) id 15.0.1367.3; Mon, 27 Jan 2020 16:14:34 +0000 Received: from u2f063a87eabd5f.cbg10.amazon.com (10.125.106.135) by mail-relay.amazon.com (10.43.61.169) with Microsoft SMTP Server id 15.0.1236.3 via Frontend Transport; Mon, 27 Jan 2020 16:14:32 +0000 From: Paul Durrant To: Date: Mon, 27 Jan 2020 16:14:30 +0000 Message-ID: <20200127161430.3312-1-pdurrant@amazon.com> X-Mailer: git-send-email 2.20.1 MIME-Version: 1.0 Precedence: Bulk Subject: [Xen-devel] [PATCH] docs/designs: Add a design document for transparent live migration X-BeenThere: xen-devel@lists.xenproject.org X-Mailman-Version: 2.1.23 List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Cc: Stefano Stabellini , Julien Grall , Wei Liu , Konrad Rzeszutek Wilk , George Dunlap , Andrew Cooper , Paul Durrant , Ian Jackson Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" It has become apparent to some large cloud providers that the current model of co-operative migration of guests under Xen is not usable as it places trust in software running inside the guest, which is likely beyond the provider's trust boundary. This patch introduces a proposal for a 'transparent' live migration, designed to overcome the need for this trust. Signed-off-by: Paul Durrant --- Cc: Andrew Cooper Cc: George Dunlap Cc: Ian Jackson Cc: Jan Beulich Cc: Julien Grall Cc: Konrad Rzeszutek Wilk Cc: Stefano Stabellini Cc: Wei Liu --- docs/designs/transparent-migration.md | 266 ++++++++++++++++++++++++++ 1 file changed, 266 insertions(+) create mode 100644 docs/designs/transparent-migration.md diff --git a/docs/designs/transparent-migration.md b/docs/designs/transparent-migration.md new file mode 100644 index 0000000000..9f26d4da6d --- /dev/null +++ b/docs/designs/transparent-migration.md @@ -0,0 +1,266 @@ +# Transparent Migration of Guests on Xen + +## Background + +The term **transparent migration** needs qualification. Here it is taken to +mean migration of a guest without the co-operation of software running inside +that guest. It is not taken to mean that a guest, aware that it is virtualized +under Xen, is not going to see *any* changes across a migration but no part of +migration should require any explicit action by the guest (including re-reading +any state that it may have cached). + +The normal model of migration in Xen is driven by the guest because it was +originally implemented for PV guests, where the guest must be aware it is +running under Xen and is hence expected to co-operate. This model dates from +an era when it was assumed that the host administrator had control of at least +the privileged software running in the guest (i.e. the guest kernel) which may +still be true in an enterprise deployment but is not generally true in a cloud +environment. The aim of transparent migration is to provide a model which is +purely host driven, requiring no co-operation from or trust in the software +running in the guest, and is thus suitable for cloud scenarios. + +PV guests are out of scope for this project because, as is outlined above, they +have a symbiotic relationship with the hypervisor and therefore a certain level +of co-operation can be assumed. +HVM guests can already be migrated on Xen without guest co-operation but only +if they don’t have PV drivers installed[1] or are in power state S3. The +reason for not expecting co-operation if the guest is in S3 is obvious, but the +reason co-operation is expected if PV drivers are installed is due to the +nature of PV protocols. + +## Xenstore Nodes and Domain ID + +The PV driver model consists of a *frontend* and a *backend*. The frontend runs +inside the guest domain and the backend runs inside a *service domain* which +may or may not domain 0. The frontend and backend typically pass data via +memory pages which are shared between the two domains, but this channel of +communication is generally established using xenstore (the store protocol +itself being an exception to this for obvious chicken-and-egg reasons). + +Typical protocol establishment is based on use of two separate xenstore +*areas*. If we consider PV drivers for the *netif* protocol (i.e. class vif) +and assume the guest has domid X, the service domain has domid Y, and the vif +has index Z then the frontend area will reside under the parent node: + +`/local/domain/Y/device/vif/Z` + +All backends, by convention, typically reside under parent node: + +`/local/domain/X/backend` + +and the normal backend area for vif Z would be: + +`/local/domain/X/backend/vif/Y/Z` + +but this should not be assumed. + +The toolstack will place two nodes in the frontend area to explicitly locate +the backend: + + * `backend`: the fully qualified xenstore path of the backend area + * `backend-id`: the domid of the service domain + +and similarly two nodes in the backend area to locate the frontend area: + + * `frontend`: the fully qualified xenstore path of the frontend area + * `frontend-id`: the domid of the guest domain + + +The guest domain only has write permission to the frontend area and similarly +the service domain only has write permission to the backend area, but both ends +have read permission to both areas. + +Under both frontend and backend areas is a node called *state*. This is key to +protocol establishment. Upon PV device creation the toolstack will set the +value of both state nodes to 1 (XenbusStateInitialising[2]). This should cause +enumeration of appropriate devices in both the guest and service domains. The +backend device, once it has written any necessary protocol specific information +into the xenstore backend area (to be read by the frontend driver) will update +the backend state node to 2 (XenbusStateInitWait). From this point on PV +protocols differ slightly; the following illustration is true of the netif +protocol. +Upon seeing a backend state value of 2, the frontend driver will then read the +protocol specific information, write details of grant references (for shared +pages) and event channel ports (for signalling) that it has created, and set +the state node in the frontend area to 4 (XenbusStateConnected). Upon see this +frontend state, the backend driver will then read the grant references (mapping +the shared pages) and event channel ports (opening its end of them) and set the +state node in the backend area to 4. Protocol establishment is now complete and +the frontend and backend start to pass data. + +Because the domid of both ends of a PV protocol forms a key part of negotiating +the data plane for that protocol (because it is encoded into both xenstore +nodes and node paths), and because guest’s own domid and the domid of the +service domain are visible to the guest in xenstore (and hence may cached +internally), and neither are necessarily preserved during migration, it is +hence necessary to have the co-operation of the frontend in re-negotiating the +protocol using the new domid after migration. +Moreover the backend-id value will be used by the frontend driver in setting up +grant table entries and event channels to communicate with the service domain, +so the co-operation of the guest is required to re-establish these in the new +host environment after migration. + +Thus if we are to change the model and support migration of a guest with PV +drivers, without the co-operation of the frontend driver code, the paths and +values in both the frontend and backend xenstore areas must remain unchanged +and valid in the new host environment, and the grant table entries and event +channels must be preserved (and remain operational once guest execution is +resumed). +Because the service domain’s domid is used directly by the guest in setting +up grant entries and event channels, the backend drivers in the new host +environment must be provided by service domain with the same domid. Also, +because the guest can sample its own domid from the frontend area and use it in +hypercalls (e.g. HVMOP_set_param) rather than DOMID_SELF, the guest domid must +also be preserved to maintain the ABI. + +Furthermore, it will necessary to modify backend drivers to re-establish +communication with frontend drivers without perturbing the content of the +backend area or requiring any changes to the values of the xenstore state nodes. + +## Other Para-Virtual State + +### Shared Rings + +Because the console and store protocol shared pages are actually part of the +guest memory image (in an E820 reserved region just below 4G) then the content +will get migrated as part of the guest memory image. Hence no additional code +is require to prevent any guest visible change in the content. + +### Shared Info + +There is already a record defined in *LibXenCtrl Domain Image Format* [3] +called `SHARED_INFO` which simply contains a complete copy of the domain’s +shared info page. It is not currently incuded in an HVM (type `0x0002`) +migration stream. It may be feasible to include it as an optional record +but it is not clear that the content of the shared info page ever needs +to be preserved for an HVM guest. +For a PV guest the `arch_shared_info` sub-structure contains important +information about the guest’s P2M, but this information is not relevant for +an HVM guest where the P2M is not directly manipulated via the guest. The other +state contained in the `shared_info` structure relates the domain wall-clock +(the state of which should already be transferred by the `RTC` HVM context +information which contained in the `HVM_CONTEXT` save record) and some event +channel state (particularly if using the *2l* protocol). Event channel state +will need to be fully transferred if we are not going to require the guest +co-operation to re-open the channels and so it should be possible to re-build a +shared info page for an HVM guest from such other state. +Note that the shared info page also contains an array of `XEN_LEGACY_MAX_VCPUS` +(32) `vcpu_info` structures. A domain may nominate a different guest physical +address to use for the vcpu info. This is mandatory for if a domain wants to +use more than 32 vCPUs and optional for legacy vCPUs. This mapping is not +currently transferred in the migration state so this will either need to be +added into an existing save record, or an additional type of save record will +be needed. + +### Xenstore Watches + +As mentioned above, no domain Xenstore state is currently transferred in the +migration stream. There is a record defined in *LibXenLight Domain Image +Format* [4] called `EMULATOR_XENSTORE_DATA` for transferring Xenstore nodes +relating to emulators but no record type is defined for nodes relating to the +domain itself, nor for registered *watches*. A XenStore watch is a mechanism +used by PV frontend and backend drivers to request a notification if the value +of a particular node (e.g. the other end’s state node) changes, so it is +important that watches continue to function after a migration. One or more new +save records will therefore be required to transfer Xenstore state. It will +also be necessary to extend the *store* protocol[5] with mechanisms to allow +the toolstack to acquire the list of watches that the guest has registered and +for the toolstack to register a watch on behalf of a domain. + +### Event channels + +Event channels are essentially the para-virtual equivalent of interrupts. They +are an important part of post PV protocols. Normally a frontend driver creates +an *inter-domain* event channel between its own domain and the domain running +the backend, which it discovers using the `backend-id` node in Xenstore (see +above), by making a `EVTCHNOP_alloc_unbound` hypercall. This hypercall +allocates an event channel object in the hypervisor and assigns a *local port* +number which is then written into the frontend area in Xenstore. The backend +driver then reads this port number and *binds* to the event channel by +specifying it, and the value of `frontend-id`, as *remote domain* and *remote +port* (respectively) to a `EVTCHNOP_bind_interdomain` hypercall. Once +connection is established in this fashion frontend and backend drivers can use +the event channel as a *mailbox* to notify each other when a shared ring has +been updated with new requests or response structures. +Currently no event channel state is preserved on migration, requiring frontend +and backend drivers to create and bind a complete new set of event channels in +order to re-establish a protocol connection. Hence, one or more new save +records will be required to transfer event channel state in order to avoid the +need for explicit action by frontend drivers running in the guest. Note that +the local port numbers need to preserved in this state as they are the only +context the guest has to refer to the hypervisor event channel objects. + Note also that the PV *store* (Xenstore access) and *console* protocols also +rely on event channels which are set up by the toolstack. Normally, early in +migration, the toolstack running on the remote host would set up a new pair of +event channels for these protocols in the destination domain. These may not be +assigned the same local port numbers as the protocols running in the source +domain. For transparent migration these channels must either be created with +fixed port numbers, or their creation must be avoided and instead be included +in the general event channel state record(s). + +### Grant table + +The grant table is essentially the para-virtual equivalent of an IOMMU. For +example, the shared rings of a PV protocol are *granted* by a frontend driver +to the backend driver by allocating *grant entries* in the guest’s table, +filling in details of the memory pages and then writing the *grant references* +(the index values of the grant entries) into Xenstore. The grant references of +the protocol buffers themselves are typically written directly into the request +structures passed via a shared ring. +The guest is responsible for managing its own grant table. No hypercall is +required to grant a memory page to another domain. It is sufficient to find an +unused grant entry and set bits in the entry to give read and/or write access +to a remote domain also specified in the entry along with the page frame +number. Thus the layout and content of the grant table logically forms part of +the guest state. +Currently no grant table state is migrated, requiring a guest to separately +maintain any state that it wishes to persist elsewhere in its memory image and +then restore it after migration. Thus to avoid the need for such explicit +action by the guest, one or more new save records will be required to migrate +the contents of the grant table. + +# Outline Proposal + +* PV backend drivers will be modified to unilaterally re-establish connection +to a frontend if the backend state node is restored with value 4 +(XenbusStateConnected)[6]. +* The toolstack should be modified to allow domid to be randomized on initial +creation or default migration, but make it identical to the source domain on +transparent migration. Transparent migration will have to be denied if the +domid is unavailable on the target host, but randomization of domid on creation +should hopefully minimize the likelihood of this. Transparent migration to +localhost will clearly not be possible. Patches have already been sent to +`xen-devel` to make this change[7]. +* `xenstored` should be modified to implement the new mechanisms needed. See +*Other Para-Virtual State* above. A further design document will propose +additional protocol messages. +* Within the migration stream extra save records will be defined as required. +See *Other Para-Virtual State* above. A further design document will propose +modifications to the LibXenLight and LibXenCtrl Domain Image Formats. +* An option should be added to the toolstack to initiate a transparent +migration, instead of the (default) potentially co-operative migration. +Essentially this should skip the check to see if PV drivers and migrate as if +there are none present, but also enabling the extra save records. Note that at +least some of the extra records should only form part of a transparent +migration stream. For example, migrating event channel state would be counter +productive in a normal migration as this will essentially leak event channel +objects at the receiving end. Others, such as grant table state, could +potentially harmlessly form part of a normal migration stream. + +* * * +[1] PV drivers are deemed to be installed if the HVM parameter +*HVM_PARAM_CALLBACK_IRQ* has been set to a non-zero value. + +[2] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/xenbus.h + +[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxc-migration-stream.pandoc + +[4] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc + +[5] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt + +[6] `xen-blkback` and `xen-netback` have already been modified in Linux to do +this. + +[7] See https://lists.xenproject.org/archives/html/xen-devel/2020-01/msg00632.html +