diff mbox series

[RFC,1/2] docs/design: Add a design document for Live Update

Message ID 20210506104259.16928-2-julien@xen.org (mailing list archive)
State New
Headers show
Series Add a design document for Live Updating Xen | expand

Commit Message

Julien Grall May 6, 2021, 10:42 a.m. UTC
From: Julien Grall <jgrall@amazon.com>

Administrators often require updating the Xen hypervisor to address
security vulnerabilities, introduce new features, or fix software defects.
Currently, we offer the following methods to perform the update:

    * Rebooting the guests and the host: this is highly disrupting to running
      guests.
    * Migrating off the guests, rebooting the host: this currently requires
      the guest to cooperate (see [1] for a non-cooperative solution) and it
      may not always be possible to migrate it off (i.e lack of capacity, use
      of local storage...).
    * Live patching: This is the less disruptive of the existing methods.
      However, it can be difficult to prepare the livepatch if the change is
      large or there are data structures to update.

This patch will introduce a new proposal called "Live Update" which will
activate new software without noticeable downtime (i.e no - or minimal -
customer).

Signed-off-by: Julien Grall <jgrall@amazon.com>
---
 docs/designs/liveupdate.md | 254 +++++++++++++++++++++++++++++++++++++
 1 file changed, 254 insertions(+)
 create mode 100644 docs/designs/liveupdate.md

Comments

Paul Durrant May 6, 2021, 2:43 p.m. UTC | #1
On 06/05/2021 11:42, Julien Grall wrote:
> From: Julien Grall <jgrall@amazon.com>
> 

Looks good in general... just a few comments below...

> Administrators often require updating the Xen hypervisor to address
> security vulnerabilities, introduce new features, or fix software defects.
> Currently, we offer the following methods to perform the update:
> 
>      * Rebooting the guests and the host: this is highly disrupting to running
>        guests.
>      * Migrating off the guests, rebooting the host: this currently requires
>        the guest to cooperate (see [1] for a non-cooperative solution) and it
>        may not always be possible to migrate it off (i.e lack of capacity, use
>        of local storage...).
>      * Live patching: This is the less disruptive of the existing methods.
>        However, it can be difficult to prepare the livepatch if the change is
>        large or there are data structures to update.
> 
> This patch will introduce a new proposal called "Live Update" which will
> activate new software without noticeable downtime (i.e no - or minimal -
> customer).
> 
> Signed-off-by: Julien Grall <jgrall@amazon.com>
> ---
>   docs/designs/liveupdate.md | 254 +++++++++++++++++++++++++++++++++++++
>   1 file changed, 254 insertions(+)
>   create mode 100644 docs/designs/liveupdate.md
> 
> diff --git a/docs/designs/liveupdate.md b/docs/designs/liveupdate.md
> new file mode 100644
> index 000000000000..32993934f4fe
> --- /dev/null
> +++ b/docs/designs/liveupdate.md
> @@ -0,0 +1,254 @@
> +# Live Updating Xen
> +
> +## Background
> +
> +Administrators often require updating the Xen hypervisor to address security
> +vulnerabilities, introduce new features, or fix software defects.  Currently,
> +we offer the following methods to perform the update:
> +
> +    * Rebooting the guests and the host: this is highly disrupting to running
> +      guests.
> +    * Migrating off the guests, rebooting the host: this currently requires
> +      the guest to cooperate (see [1] for a non-cooperative solution) and it
> +      may not always be possible to migrate it off (i.e lack of capacity, use
> +      of local storage...).
> +    * Live patching: This is the less disruptive of the existing methods.
> +      However, it can be difficult to prepare the livepatch if the change is
> +      large or there are data structures to update.
> +
> +This document will present a new approach called "Live Update" which will
> +activate new software without noticeable downtime (i.e no - or minimal -
> +customer pain).
> +
> +## Terminology
> +
> +xen#1: Xen version currently active and running on a droplet.  This is the
> +“source” for the Live Update operation.  This version can actually be newer
> +than xen#2 in case of a rollback operation.
> +
> +xen#2: Xen version that's the “target” of the Live Update operation. This
> +version will become the active version after successful Live Update.  This
> +version of Xen can actually be older than xen#1 in case of a rollback
> +operation.
> +
> +## High-level overview
> +
> +Xen has a framework to bring a new image of the Xen hypervisor in memory using
> +kexec.  The existing framework does not meet the baseline functionality for
> +Live Update, since kexec results in a restart for the hypervisor, host, Dom0,
> +and all the guests.

Feels like there's a sentence or two missing here. The subject has 
jumped from a framework that is not fit for purpose to 'the operation'.

> +
> +The operation can be divided in roughly 4 parts:
> +
> +    1. Trigger: The operation will by triggered from outside the hypervisor
> +       (e.g. dom0 userspace).
> +    2. Save: The state will be stabilized by pausing the domains and
> +       serialized by xen#1.
> +    3. Hand-over: xen#1 will pass the serialized state and transfer control to
> +       xen#2.
> +    4. Restore: The state will be deserialized by xen#2.
> +
> +All the domains will be paused before xen#1 is starting to save the states,

s/is starting/starts

> +and any domain that was running before Live Update will be unpaused after
> +xen#2 has finished to restore the states.  This is to prevent a domain to try

s/finished to restore/finished restoring

and

s/domain to try/domain trying

> +to modify the state of another domain while it is being saved/restored.
> +
> +The current approach could be seen as non-cooperative migration with a twist:
> +all the domains (including dom0) are not expected be involved in the Live
> +Update process.
> +
> +The major differences compare to live migration are:

s/compare/compared

> +
> +    * The state is not transferred to another host, but instead locally to
> +      xen#2.
> +    * The memory content or device state (for passthrough) does not need to
> +      be part of the stream. Instead we need to preserve it.
> +    * PV backends, device emulators, xenstored are not recreated but preserved
> +      (as these are part of dom0).
> +
> +
> +Domains in process of being destroyed (*XEN\_DOMCTL\_destroydomain*) will need
> +to be preserved because another entity may have mappings (e.g foreign, grant)
> +on them.
> +
> +## Trigger
> +
> +Live update is built on top of the kexec interface to prepare the command line,
> +load xen#2 and trigger the operation.  A new kexec type has been introduced
> +(*KEXEC\_TYPE\_LIVE\_UPDATE*) to notify Xen to Live Update.
> +
> +The Live Update will be triggered from outside the hypervisor (e.g. dom0
> +userspace).  Support for the operation has been added in kexec-tools 2.0.21.
> +
> +All the domains will be paused before xen#1 is starting to save the states.

You already said this in the previous section.

> +In Xen, *domain\_pause()* will pause the vCPUs as soon as they can be re-
> +scheduled.  In other words, a pause request will not wait for asynchronous
> +requests (e.g. I/O) to finish.  For Live Update, this is not an ideal time to
> +pause because it will require more xen#1 internal state to be transferred.
> +Therefore, all the domains will be paused at an architectural restartable
> +boundary.
> +
> +Live update will not happen synchronously to the request but when all the
> +domains are quiescent.  As domains running device emulators (e.g Dom0) will > +be part of the process to quiesce HVM domains, we will need to let 
them run
> +until xen#1 is actually starting to save the state.  HVM vCPUs will be paused
> +as soon as any pending asynchronous request has finished.
> +
> +In the current implementation, all PV domains will continue to run while the
> +rest will be paused as soon as possible.  Note this approach is assuming that
> +device emulators are only running in PV domains.
> +
> +It should be easy to extend to PVH domains not requiring device emulations.
> +It will require more thought if we need to run device models in HVM domains as
> +there might be inter-dependency.
> +
> +## Save
> +
> +xen#1 will be responsible to preserve and serialize the state of each existing
> +domain and any system-wide state (e.g M2P).

s/to preserve and serialize/for preserving and serializing

> +
> +Each domain will be serialized independently using a modified migration stream,
> +if there is any dependency between domains (such as for IOREQ server) they will
> +be recorded using a domid. All the complexity of resolving the dependencies are
> +left to the restore path in xen#2 (more in the *Restore* section).
> +
> +At the moment, the domains are saved one by one in a single thread, but it
> +would be possible to consider multi-threading if it takes too long. Although
> +this may require some adjustment in the stream format.
> +
> +As we want to be able to Live Update between major versions of Xen (e.g Xen
> +4.11 -> Xen 4.15), the states preserved should not be a dump of Xen internal
> +structure but instead the minimal information that allow us to recreate the
> +domains.
> +
> +For instance, we don't want to preserve the frametable (and therefore
> +*struct page\_info*) as-is because the refcounting may be different across
> +between xen#1 and xen#2 (see XSA-299). Instead, we want to be able to recreate
> +*struct page\_info* based on minimal information that are considered stable
> +(such as the page type).
> +
> +Note that upgrading between version of Xen will also require all the hypercalls
> +to be stable. This will not be covered by this document.
> +
> +## Hand over
> +
> +### Memory usage restrictions
> +
> +xen#2 must take care not to use any memory pages which already belong to
> +guests.  To facilitate this, a number of contiguous region of memory are
> +reserved for the boot allocator, known as *live update bootmem*.
> +
> +xen#1 will always reserve a region just below Xen (the size is controlled by
> +the Xen command line parameter liveupdate) to allow Xen growing and provide
> +information about LiveUpdate (see the section *Breadcrumb*).  The region will be
> +passed to xen#2 using the same command line option but with the base address
> +specified.
> +
> +For simplicity, additional regions will be provided in the stream.  They will
> +consist of region that could be re-used by xen#2 during boot (such as the

s/region/a region

   Paul

> +xen#1's frametable memory).
> +
> +xen#2 must not use any pages outside those regions until it has consumed the
> +Live Update data stream and determined which pages are already in use by
> +running domains or need to be re-used as-is by Xen (e.g M2P).
> +
> +At run time, Xen may use memory from the reserved region for any purpose that
> +does not require preservation over a Live Update; in particular it __must__ not be
> +mapped to a domain or used by any Xen state requiring to be preserved (e.g
> +M2P).  In other word, the xenheap pages could be allocated from the reserved
> +regions if we remove the concept of shared xenheap pages.
> +
> +The xen#2's binary may be bigger (or smaller) compare to xen#1's binary.  So
> +for the purpose of loading xen#2 binary, kexec should treat the reserved memory
> +right below xen#1 and its region as a single contiguous space. xen#2 will be
> +loaded right at the top of the contiguous space and the rest of the memory will
> +be the new reserved memory (this may shrink or grow).  For that reason, freed
> +init memory from xen#1 image is also treated as reserved liveupdate update
> +bootmem.
> +
> +### Live Update data stream
> +
> +During handover, xen#1 creates a Live Update data stream containing all the
> +information required by the new Xen#2 to restore all the domains.
> +
> +Data pages for this stream may be allocated anywhere in physical memory outside
> +the *live update bootmem* regions.
> +
> +As calling __vmap()__/__vunmap()__ has a cost on the downtime.  We want to reduce the
> +number of call to __vmap()__ when restoring the stream.  Therefore the stream
> +will be contiguously virtually mapped in xen#2.  xen#1 will create an array of
> +MFNs of the allocated data pages, suitable for passing to __vmap()__.  The
> +array will be physically contiguous but the MFNs don't need to be physically
> +contiguous.
> +
> +### Breadcrumb
> +
> +Since the Live Update data stream is created during the final **kexec\_exec**
> +hypercall, its address cannot be passed on the command line to the new Xen
> +since the command line needs to have been set up by **kexec(8)** in userspace
> +long beforehand.
> +
> +Thus, to allow the new Xen to find the data stream, xen#1 places a breadcrumb
> +in the first words of the Live Update bootmem, containing the number of data
> +pages, and the physical address of the contiguous MFN array.
> +
> +### IOMMU
> +
> +Where devices are passed through to domains, it may not be possible to quiesce
> +those devices for the purpose of performing the update.
> +
> +If performing Live Update with assigned devices, xen#1 will leave the IOMMU
> +mappings active during the handover (thus implying that IOMMU page tables may
> +not be allocated in the *live update bootmem* region either).
> +
> +xen#2 must take control of the IOMMU without causing those mappings to become
> +invalid even for a short period of time.  In other words, xen#2 should not
> +re-setup the IOMMUs.  On hardware which does not support Posted Interrupts,
> +interrupts may need to be generated on resume.
> +
> +## Restore
> +
> +After xen#2 initialized itself and map the stream, it will be responsible to
> +restore the state of the system and each domain.
> +
> +Unlike the save part, it is not possible to restore a domain in a single pass.
> +There are dependencies between:
> +
> +    1. different states of a domain.  For instance, the event channels ABI
> +       used (2l vs fifo) requires to be restored before restoring the event
> +       channels.
> +    2. the same "state" within a domain.  For instance, in case of PV domain,
> +       the pages' ownership requires to be restored before restoring the type
> +       of the page (e.g is it an L4, L1... table?).
> +
> +    3. domains.  For instance when restoring the grant mapping, it will be
> +       necessary to have the page's owner in hand to do proper refcounting.
> +       Therefore the pages' ownership have to be restored first.
> +
> +Dependencies will be resolved using either multiple passes (for dependency
> +type 2 and 3) or using a specific ordering between records (for dependency
> +type 1).
> +
> +Each domain will be restored in 3 passes:
> +
> +    * Pass 0: Create the domain and restore the P2M for HVM. This can be broken
> +      down in 3 parts:
> +      * Allocate a domain via _domain\_create()_ but skip part that requires
> +        extra records (e.g HAP, P2M).
> +      * Restore any parts which needs to be done before create the vCPUs. This
> +        including restoring the P2M and whether HAP is used.
> +      * Create the vCPUs. Note this doesn't restore the state of the vCPUs.
> +    * Pass 1: It will restore the pages' ownership and the grant-table frames
> +    * Pass 2: This steps will restore any domain states (e.g vCPU state, event
> +      channels) that wasn't
> +
> +A domain should not have a dependency on another domain within the same pass.
> +Therefore it would be possible to take advantage of all the CPUs to restore
> +domains in parallel and reduce the overall downtime.
> +
> +Once all the domains have been restored, they will be unpaused if they were
> +running before Live Update.
> +
> +* * *
> +[1] https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md;h=4b876d809fb5b8aac02d29fd7760a5c0d5b86d87;hb=HEAD
> +
>
Hongyan Xia May 7, 2021, 9:18 a.m. UTC | #2
On Thu, 2021-05-06 at 11:42 +0100, Julien Grall wrote:
> From: Julien Grall <jgrall@amazon.com>
> 
> Administrators often require updating the Xen hypervisor to address
> security vulnerabilities, introduce new features, or fix software
> defects.
> Currently, we offer the following methods to perform the update:
> 
>     * Rebooting the guests and the host: this is highly disrupting to
> running
>       guests.
>     * Migrating off the guests, rebooting the host: this currently
> requires
>       the guest to cooperate (see [1] for a non-cooperative solution)
> and it
>       may not always be possible to migrate it off (i.e lack of
> capacity, use
>       of local storage...).
>     * Live patching: This is the less disruptive of the existing
> methods.
>       However, it can be difficult to prepare the livepatch if the
> change is
>       large or there are data structures to update.

Might want to mention that live patching slowly consumes memory and
fragments the Xen image and degrades performance (especially when the
patched code is on the critical path).

> 
> This patch will introduce a new proposal called "Live Update" which
> will
> activate new software without noticeable downtime (i.e no - or
> minimal -
> customer).
> 
> Signed-off-by: Julien Grall <jgrall@amazon.com>
> ---
>  docs/designs/liveupdate.md | 254
> +++++++++++++++++++++++++++++++++++++
>  1 file changed, 254 insertions(+)
>  create mode 100644 docs/designs/liveupdate.md
> 
> diff --git a/docs/designs/liveupdate.md b/docs/designs/liveupdate.md
> new file mode 100644
> index 000000000000..32993934f4fe
> --- /dev/null
> +++ b/docs/designs/liveupdate.md
> @@ -0,0 +1,254 @@
> +# Live Updating Xen
> +
> +## Background
> +
> +Administrators often require updating the Xen hypervisor to address
> security
> +vulnerabilities, introduce new features, or fix software
> defects.  Currently,
> +we offer the following methods to perform the update:
> +
> +    * Rebooting the guests and the host: this is highly disrupting
> to running
> +      guests.
> +    * Migrating off the guests, rebooting the host: this currently
> requires
> +      the guest to cooperate (see [1] for a non-cooperative
> solution) and it
> +      may not always be possible to migrate it off (i.e lack of
> capacity, use
> +      of local storage...).
> +    * Live patching: This is the less disruptive of the existing
> methods.
> +      However, it can be difficult to prepare the livepatch if the
> change is
> +      large or there are data structures to update.
> +
> +This document will present a new approach called "Live Update" which
> will
> +activate new software without noticeable downtime (i.e no - or
> minimal -
> +customer pain).
> +
> +## Terminology
> +
> +xen#1: Xen version currently active and running on a droplet.  This
> is the
> +“source” for the Live Update operation.  This version can actually
> be newer
> +than xen#2 in case of a rollback operation.
> +
> +xen#2: Xen version that's the “target” of the Live Update operation.
> This
> +version will become the active version after successful Live
> Update.  This
> +version of Xen can actually be older than xen#1 in case of a
> rollback
> +operation.

A bit redundant since it was mentioned in Xen 1 already.

> +
> +## High-level overview
> +
> +Xen has a framework to bring a new image of the Xen hypervisor in
> memory using
> +kexec.  The existing framework does not meet the baseline
> functionality for
> +Live Update, since kexec results in a restart for the hypervisor,
> host, Dom0,
> +and all the guests.
> +
> +The operation can be divided in roughly 4 parts:
> +
> +    1. Trigger: The operation will by triggered from outside the
> hypervisor
> +       (e.g. dom0 userspace).
> +    2. Save: The state will be stabilized by pausing the domains and
> +       serialized by xen#1.
> +    3. Hand-over: xen#1 will pass the serialized state and transfer
> control to
> +       xen#2.
> +    4. Restore: The state will be deserialized by xen#2.
> +
> +All the domains will be paused before xen#1 is starting to save the
> states,
> +and any domain that was running before Live Update will be unpaused
> after
> +xen#2 has finished to restore the states.  This is to prevent a
> domain to try
> +to modify the state of another domain while it is being
> saved/restored.
> +
> +The current approach could be seen as non-cooperative migration with
> a twist:
> +all the domains (including dom0) are not expected be involved in the
> Live
> +Update process.
> +
> +The major differences compare to live migration are:
> +
> +    * The state is not transferred to another host, but instead
> locally to
> +      xen#2.
> +    * The memory content or device state (for passthrough) does not
> need to
> +      be part of the stream. Instead we need to preserve it.
> +    * PV backends, device emulators, xenstored are not recreated but
> preserved
> +      (as these are part of dom0).
> +
> +
> +Domains in process of being destroyed (*XEN\_DOMCTL\_destroydomain*)
> will need
> +to be preserved because another entity may have mappings (e.g
> foreign, grant)
> +on them.
> +
> +## Trigger
> +
> +Live update is built on top of the kexec interface to prepare the
> command line,
> +load xen#2 and trigger the operation.  A new kexec type has been
> introduced
> +(*KEXEC\_TYPE\_LIVE\_UPDATE*) to notify Xen to Live Update.
> +
> +The Live Update will be triggered from outside the hypervisor (e.g.
> dom0
> +userspace).  Support for the operation has been added in kexec-tools 
> 2.0.21.
> +
> +All the domains will be paused before xen#1 is starting to save the
> states.
> +In Xen, *domain\_pause()* will pause the vCPUs as soon as they can
> be re-
> +scheduled.  In other words, a pause request will not wait for
> asynchronous
> +requests (e.g. I/O) to finish.  For Live Update, this is not an
> ideal time to
> +pause because it will require more xen#1 internal state to be
> transferred.
> +Therefore, all the domains will be paused at an architectural
> restartable
> +boundary.
> +
> +Live update will not happen synchronously to the request but when
> all the
> +domains are quiescent.  As domains running device emulators (e.g
> Dom0) will
> +be part of the process to quiesce HVM domains, we will need to let
> them run
> +until xen#1 is actually starting to save the state.  HVM vCPUs will
> be paused
> +as soon as any pending asynchronous request has finished.
> +
> +In the current implementation, all PV domains will continue to run
> while the
> +rest will be paused as soon as possible.  Note this approach is
> assuming that
> +device emulators are only running in PV domains.
> +
> +It should be easy to extend to PVH domains not requiring device
> emulations.
> +It will require more thought if we need to run device models in HVM
> domains as
> +there might be inter-dependency.
> +
> +## Save
> +
> +xen#1 will be responsible to preserve and serialize the state of
> each existing
> +domain and any system-wide state (e.g M2P).
> +
> +Each domain will be serialized independently using a modified
> migration stream,
> +if there is any dependency between domains (such as for IOREQ
> server) they will
> +be recorded using a domid. All the complexity of resolving the
> dependencies are
> +left to the restore path in xen#2 (more in the *Restore* section).
> +
> +At the moment, the domains are saved one by one in a single thread,
> but it
> +would be possible to consider multi-threading if it takes too long.
> Although
> +this may require some adjustment in the stream format.
> +
> +As we want to be able to Live Update between major versions of Xen
> (e.g Xen
> +4.11 -> Xen 4.15), the states preserved should not be a dump of Xen
> internal
> +structure but instead the minimal information that allow us to
> recreate the
> +domains.
> +
> +For instance, we don't want to preserve the frametable (and
> therefore
> +*struct page\_info*) as-is because the refcounting may be different
> across
> +between xen#1 and xen#2 (see XSA-299). Instead, we want to be able
> to recreate
> +*struct page\_info* based on minimal information that are considered
> stable
> +(such as the page type).
> +
> +Note that upgrading between version of Xen will also require all the
> hypercalls
> +to be stable. This will not be covered by this document.
> +
> +## Hand over
> +
> +### Memory usage restrictions
> +
> +xen#2 must take care not to use any memory pages which already
> belong to
> +guests.  To facilitate this, a number of contiguous region of memory
> are
> +reserved for the boot allocator, known as *live update bootmem*.
> +
> +xen#1 will always reserve a region just below Xen (the size is
> controlled by
> +the Xen command line parameter liveupdate) to allow Xen growing and
> provide
> +information about LiveUpdate (see the section *Breadcrumb*).  The
> region will be
> +passed to xen#2 using the same command line option but with the base
> address
> +specified.

The size of the command line option may not be the same depending on
the size of Xen #2.

> +
> +For simplicity, additional regions will be provided in the
> stream.  They will
> +consist of region that could be re-used by xen#2 during boot (such
> as the
> +xen#1's frametable memory).
> +
> +xen#2 must not use any pages outside those regions until it has
> consumed the
> +Live Update data stream and determined which pages are already in
> use by
> +running domains or need to be re-used as-is by Xen (e.g M2P).
> +
> +At run time, Xen may use memory from the reserved region for any
> purpose that
> +does not require preservation over a Live Update; in particular it
> __must__ not be
> +mapped to a domain or used by any Xen state requiring to be
> preserved (e.g
> +M2P).  In other word, the xenheap pages could be allocated from the
> reserved
> +regions if we remove the concept of shared xenheap pages.
> +
> +The xen#2's binary may be bigger (or smaller) compare to xen#1's
> binary.  So
> +for the purpose of loading xen#2 binary, kexec should treat the
> reserved memory
> +right below xen#1 and its region as a single contiguous space. xen#2
> will be
> +loaded right at the top of the contiguous space and the rest of the
> memory will
> +be the new reserved memory (this may shrink or grow).  For that
> reason, freed
> +init memory from xen#1 image is also treated as reserved liveupdate
> update

s/update//

This is explained quite well actually, but I wonder if we can move this
part closer to the liveupdate command line section (they both talk
about the initial bootmem region and Xen size changes). After that, we
then talk about multiple regions and how we should use them.

> +bootmem.
> +
> +### Live Update data stream
> +
> +During handover, xen#1 creates a Live Update data stream containing
> all the
> +information required by the new Xen#2 to restore all the domains.
> +
> +Data pages for this stream may be allocated anywhere in physical
> memory outside
> +the *live update bootmem* regions.
> +
> +As calling __vmap()__/__vunmap()__ has a cost on the downtime.  We
> want to reduce the
> +number of call to __vmap()__ when restoring the stream.  Therefore
> the stream
> +will be contiguously virtually mapped in xen#2.  xen#1 will create
> an array of

Using vmap during restore for a contiguous range sounds more like
implementation and optimisation detail to me rather than an ABI
requirement, so I would s/the stream will be/the stream can be/.

> +MFNs of the allocated data pages, suitable for passing to
> __vmap()__.  The
> +array will be physically contiguous but the MFNs don't need to be
> physically
> +contiguous.
> +
> +### Breadcrumb
> +
> +Since the Live Update data stream is created during the final
> **kexec\_exec**
> +hypercall, its address cannot be passed on the command line to the
> new Xen
> +since the command line needs to have been set up by **kexec(8)** in
> userspace
> +long beforehand.
> +
> +Thus, to allow the new Xen to find the data stream, xen#1 places a
> breadcrumb
> +in the first words of the Live Update bootmem, containing the number
> of data
> +pages, and the physical address of the contiguous MFN array.
> +
> +### IOMMU
> +
> +Where devices are passed through to domains, it may not be possible
> to quiesce
> +those devices for the purpose of performing the update.
> +
> +If performing Live Update with assigned devices, xen#1 will leave
> the IOMMU
> +mappings active during the handover (thus implying that IOMMU page
> tables may
> +not be allocated in the *live update bootmem* region either).
> +
> +xen#2 must take control of the IOMMU without causing those mappings
> to become
> +invalid even for a short period of time.  In other words, xen#2
> should not
> +re-setup the IOMMUs.  On hardware which does not support Posted
> Interrupts,
> +interrupts may need to be generated on resume.
> +
> +## Restore
> +
> +After xen#2 initialized itself and map the stream, it will be
> responsible to
> +restore the state of the system and each domain.
> +
> +Unlike the save part, it is not possible to restore a domain in a
> single pass.
> +There are dependencies between:
> +
> +    1. different states of a domain.  For instance, the event
> channels ABI
> +       used (2l vs fifo) requires to be restored before restoring
> the event
> +       channels.
> +    2. the same "state" within a domain.  For instance, in case of
> PV domain,
> +       the pages' ownership requires to be restored before restoring
> the type
> +       of the page (e.g is it an L4, L1... table?).
> +
> +    3. domains.  For instance when restoring the grant mapping, it
> will be
> +       necessary to have the page's owner in hand to do proper
> refcounting.
> +       Therefore the pages' ownership have to be restored first.
> +
> +Dependencies will be resolved using either multiple passes (for
> dependency
> +type 2 and 3) or using a specific ordering between records (for
> dependency
> +type 1).
> +
> +Each domain will be restored in 3 passes:
> +
> +    * Pass 0: Create the domain and restore the P2M for HVM. This
> can be broken
> +      down in 3 parts:
> +      * Allocate a domain via _domain\_create()_ but skip part that
> requires
> +        extra records (e.g HAP, P2M).
> +      * Restore any parts which needs to be done before create the
> vCPUs. This
> +        including restoring the P2M and whether HAP is used.
> +      * Create the vCPUs. Note this doesn't restore the state of the
> vCPUs.
> +    * Pass 1: It will restore the pages' ownership and the grant-
> table frames
> +    * Pass 2: This steps will restore any domain states (e.g vCPU
> state, event
> +      channels) that wasn't

Sentence seems incomplete.

Hongyan
Jan Beulich May 7, 2021, 9:52 a.m. UTC | #3
On 06.05.2021 12:42, Julien Grall wrote:
> +## High-level overview
> +
> +Xen has a framework to bring a new image of the Xen hypervisor in memory using
> +kexec.  The existing framework does not meet the baseline functionality for
> +Live Update, since kexec results in a restart for the hypervisor, host, Dom0,
> +and all the guests.
> +
> +The operation can be divided in roughly 4 parts:
> +
> +    1. Trigger: The operation will by triggered from outside the hypervisor
> +       (e.g. dom0 userspace).
> +    2. Save: The state will be stabilized by pausing the domains and
> +       serialized by xen#1.
> +    3. Hand-over: xen#1 will pass the serialized state and transfer control to
> +       xen#2.
> +    4. Restore: The state will be deserialized by xen#2.
> +
> +All the domains will be paused before xen#1 is starting to save the states,
> +and any domain that was running before Live Update will be unpaused after
> +xen#2 has finished to restore the states.  This is to prevent a domain to try
> +to modify the state of another domain while it is being saved/restored.
> +
> +The current approach could be seen as non-cooperative migration with a twist:
> +all the domains (including dom0) are not expected be involved in the Live
> +Update process.
> +
> +The major differences compare to live migration are:
> +
> +    * The state is not transferred to another host, but instead locally to
> +      xen#2.
> +    * The memory content or device state (for passthrough) does not need to
> +      be part of the stream. Instead we need to preserve it.
> +    * PV backends, device emulators, xenstored are not recreated but preserved
> +      (as these are part of dom0).

Isn't dom0 too limiting here?

> +## Trigger
> +
> +Live update is built on top of the kexec interface to prepare the command line,
> +load xen#2 and trigger the operation.  A new kexec type has been introduced
> +(*KEXEC\_TYPE\_LIVE\_UPDATE*) to notify Xen to Live Update.
> +
> +The Live Update will be triggered from outside the hypervisor (e.g. dom0
> +userspace).  Support for the operation has been added in kexec-tools 2.0.21.
> +
> +All the domains will be paused before xen#1 is starting to save the states.
> +In Xen, *domain\_pause()* will pause the vCPUs as soon as they can be re-
> +scheduled.  In other words, a pause request will not wait for asynchronous
> +requests (e.g. I/O) to finish.  For Live Update, this is not an ideal time to
> +pause because it will require more xen#1 internal state to be transferred.
> +Therefore, all the domains will be paused at an architectural restartable
> +boundary.

To me this leaves entirely unclear what this then means. domain_pause()
not being suitable is one thing, but what _is_ suitable seems worth
mentioning. Among other things I'd be curious to know what this would
mean for pending hypercall continuations.

> +## Save
> +
> +xen#1 will be responsible to preserve and serialize the state of each existing
> +domain and any system-wide state (e.g M2P).
> +
> +Each domain will be serialized independently using a modified migration stream,
> +if there is any dependency between domains (such as for IOREQ server) they will
> +be recorded using a domid. All the complexity of resolving the dependencies are
> +left to the restore path in xen#2 (more in the *Restore* section).
> +
> +At the moment, the domains are saved one by one in a single thread, but it
> +would be possible to consider multi-threading if it takes too long. Although
> +this may require some adjustment in the stream format.
> +
> +As we want to be able to Live Update between major versions of Xen (e.g Xen
> +4.11 -> Xen 4.15), the states preserved should not be a dump of Xen internal
> +structure but instead the minimal information that allow us to recreate the
> +domains.
> +
> +For instance, we don't want to preserve the frametable (and therefore
> +*struct page\_info*) as-is because the refcounting may be different across
> +between xen#1 and xen#2 (see XSA-299). Instead, we want to be able to recreate
> +*struct page\_info* based on minimal information that are considered stable
> +(such as the page type).

Perhaps leaving it at this very generic description is fine, but I can
easily see cases (which may not even be corner ones) where this quickly
gets problematic: What if xen#2 has state xen#1 didn't (properly) record?
Such information may not be possible to take out of thin air. Is the
consequence then that in such a case LU won't work? If so, is it perhaps
worthwhile having a Limitations section somewhere?

> +## Hand over
> +
> +### Memory usage restrictions
> +
> +xen#2 must take care not to use any memory pages which already belong to
> +guests.  To facilitate this, a number of contiguous region of memory are
> +reserved for the boot allocator, known as *live update bootmem*.
> +
> +xen#1 will always reserve a region just below Xen (the size is controlled by
> +the Xen command line parameter liveupdate) to allow Xen growing and provide
> +information about LiveUpdate (see the section *Breadcrumb*).  The region will be
> +passed to xen#2 using the same command line option but with the base address
> +specified.

I particularly don't understand the "to allow Xen growing" aspect here:
xen#2 needs to be placed in a different memory range anyway until xen#1
has handed over control. Are you suggesting it gets moved over to xen#1's
original physical range (not necessarily an exact match), and then
perhaps to start below where xen#1 started? Why would you do this? Xen
intentionally lives at a 2Mb boundary, such that in principle (for EFI:
in fact) large page mappings are possible. I also see no reason to reuse
the same physical area of memory for Xen itself - all you need is for
Xen's virtual addresses to be properly mapped to the new physical range.
I wonder what I'm missing here.

> +For simplicity, additional regions will be provided in the stream.  They will
> +consist of region that could be re-used by xen#2 during boot (such as the
> +xen#1's frametable memory).
> +
> +xen#2 must not use any pages outside those regions until it has consumed the
> +Live Update data stream and determined which pages are already in use by
> +running domains or need to be re-used as-is by Xen (e.g M2P).

Is the M2P really in the "need to be re-used" group, not just "can
be re-used for simplicity and efficiency reasons"?

> +## Restore
> +
> +After xen#2 initialized itself and map the stream, it will be responsible to
> +restore the state of the system and each domain.
> +
> +Unlike the save part, it is not possible to restore a domain in a single pass.
> +There are dependencies between:
> +
> +    1. different states of a domain.  For instance, the event channels ABI
> +       used (2l vs fifo) requires to be restored before restoring the event
> +       channels.
> +    2. the same "state" within a domain.  For instance, in case of PV domain,
> +       the pages' ownership requires to be restored before restoring the type
> +       of the page (e.g is it an L4, L1... table?).
> +
> +    3. domains.  For instance when restoring the grant mapping, it will be
> +       necessary to have the page's owner in hand to do proper refcounting.
> +       Therefore the pages' ownership have to be restored first.
> +
> +Dependencies will be resolved using either multiple passes (for dependency
> +type 2 and 3) or using a specific ordering between records (for dependency
> +type 1).
> +
> +Each domain will be restored in 3 passes:
> +
> +    * Pass 0: Create the domain and restore the P2M for HVM. This can be broken
> +      down in 3 parts:
> +      * Allocate a domain via _domain\_create()_ but skip part that requires
> +        extra records (e.g HAP, P2M).
> +      * Restore any parts which needs to be done before create the vCPUs. This
> +        including restoring the P2M and whether HAP is used.
> +      * Create the vCPUs. Note this doesn't restore the state of the vCPUs.
> +    * Pass 1: It will restore the pages' ownership and the grant-table frames
> +    * Pass 2: This steps will restore any domain states (e.g vCPU state, event
> +      channels) that wasn't

What about foreign mappings (which are part of the P2M)? Can they be
validly restored prior to restoring page ownership? In how far do you
fully trust xen#1's state to be fully consistent anyway, rather than
perhaps checking it?

> +A domain should not have a dependency on another domain within the same pass.
> +Therefore it would be possible to take advantage of all the CPUs to restore
> +domains in parallel and reduce the overall downtime.

"Dependency" may be ambiguous here. For example, an interdomain event
channel to me necessarily expresses a dependency between two domains.

Jan
Julien Grall May 7, 2021, 10 a.m. UTC | #4
Hi Hongyan,

On 07/05/2021 10:18, Hongyan Xia wrote:
> On Thu, 2021-05-06 at 11:42 +0100, Julien Grall wrote:
>> From: Julien Grall <jgrall@amazon.com>
>>
>> Administrators often require updating the Xen hypervisor to address
>> security vulnerabilities, introduce new features, or fix software
>> defects.
>> Currently, we offer the following methods to perform the update:
>>
>>      * Rebooting the guests and the host: this is highly disrupting to
>> running
>>        guests.
>>      * Migrating off the guests, rebooting the host: this currently
>> requires
>>        the guest to cooperate (see [1] for a non-cooperative solution)
>> and it
>>        may not always be possible to migrate it off (i.e lack of
>> capacity, use
>>        of local storage...).
>>      * Live patching: This is the less disruptive of the existing
>> methods.
>>        However, it can be difficult to prepare the livepatch if the
>> change is
>>        large or there are data structures to update.
> 
> Might want to mention that live patching slowly consumes memory and
> fragments the Xen image and degrades performance (especially when the
> patched code is on the critical path).
My goal wasn't to list all the drawbacks for each existign methods. 
Instead, I wanted to give a simple important reason for each for them.

I would prefer to keep the list as it is unless someone needs more 
arguments about introducing a new approach.

> >>
>> This patch will introduce a new proposal called "Live Update" which
>> will
>> activate new software without noticeable downtime (i.e no - or
>> minimal -
>> customer).
>>
>> Signed-off-by: Julien Grall <jgrall@amazon.com>
>> ---
>>   docs/designs/liveupdate.md | 254
>> +++++++++++++++++++++++++++++++++++++
>>   1 file changed, 254 insertions(+)
>>   create mode 100644 docs/designs/liveupdate.md
>>
>> diff --git a/docs/designs/liveupdate.md b/docs/designs/liveupdate.md
>> new file mode 100644
>> index 000000000000..32993934f4fe
>> --- /dev/null
>> +++ b/docs/designs/liveupdate.md
>> @@ -0,0 +1,254 @@
>> +# Live Updating Xen
>> +
>> +## Background
>> +
>> +Administrators often require updating the Xen hypervisor to address
>> security
>> +vulnerabilities, introduce new features, or fix software
>> defects.  Currently,
>> +we offer the following methods to perform the update:
>> +
>> +    * Rebooting the guests and the host: this is highly disrupting
>> to running
>> +      guests.
>> +    * Migrating off the guests, rebooting the host: this currently
>> requires
>> +      the guest to cooperate (see [1] for a non-cooperative
>> solution) and it
>> +      may not always be possible to migrate it off (i.e lack of
>> capacity, use
>> +      of local storage...).
>> +    * Live patching: This is the less disruptive of the existing
>> methods.
>> +      However, it can be difficult to prepare the livepatch if the
>> change is
>> +      large or there are data structures to update.
>> +
>> +This document will present a new approach called "Live Update" which
>> will
>> +activate new software without noticeable downtime (i.e no - or
>> minimal -
>> +customer pain).
>> +
>> +## Terminology
>> +
>> +xen#1: Xen version currently active and running on a droplet.  This
>> is the
>> +“source” for the Live Update operation.  This version can actually
>> be newer
>> +than xen#2 in case of a rollback operation.
>> +
>> +xen#2: Xen version that's the “target” of the Live Update operation.
>> This
>> +version will become the active version after successful Live
>> Update.  This
>> +version of Xen can actually be older than xen#1 in case of a
>> rollback
>> +operation.
> 
> A bit redundant since it was mentioned in Xen 1 already.

Definitions tends to be redundant. So I would prefer to keep like that.

> 
>> +
>> +## High-level overview
>> +
>> +Xen has a framework to bring a new image of the Xen hypervisor in
>> memory using
>> +kexec.  The existing framework does not meet the baseline
>> functionality for
>> +Live Update, since kexec results in a restart for the hypervisor,
>> host, Dom0,
>> +and all the guests.
>> +
>> +The operation can be divided in roughly 4 parts:
>> +
>> +    1. Trigger: The operation will by triggered from outside the
>> hypervisor
>> +       (e.g. dom0 userspace).
>> +    2. Save: The state will be stabilized by pausing the domains and
>> +       serialized by xen#1.
>> +    3. Hand-over: xen#1 will pass the serialized state and transfer
>> control to
>> +       xen#2.
>> +    4. Restore: The state will be deserialized by xen#2.
>> +
>> +All the domains will be paused before xen#1 is starting to save the
>> states,
>> +and any domain that was running before Live Update will be unpaused
>> after
>> +xen#2 has finished to restore the states.  This is to prevent a
>> domain to try
>> +to modify the state of another domain while it is being
>> saved/restored.
>> +
>> +The current approach could be seen as non-cooperative migration with
>> a twist:
>> +all the domains (including dom0) are not expected be involved in the
>> Live
>> +Update process.
>> +
>> +The major differences compare to live migration are:
>> +
>> +    * The state is not transferred to another host, but instead
>> locally to
>> +      xen#2.
>> +    * The memory content or device state (for passthrough) does not
>> need to
>> +      be part of the stream. Instead we need to preserve it.
>> +    * PV backends, device emulators, xenstored are not recreated but
>> preserved
>> +      (as these are part of dom0).
>> +
>> +
>> +Domains in process of being destroyed (*XEN\_DOMCTL\_destroydomain*)
>> will need
>> +to be preserved because another entity may have mappings (e.g
>> foreign, grant)
>> +on them.
>> +
>> +## Trigger
>> +
>> +Live update is built on top of the kexec interface to prepare the
>> command line,
>> +load xen#2 and trigger the operation.  A new kexec type has been
>> introduced
>> +(*KEXEC\_TYPE\_LIVE\_UPDATE*) to notify Xen to Live Update.
>> +
>> +The Live Update will be triggered from outside the hypervisor (e.g.
>> dom0
>> +userspace).  Support for the operation has been added in kexec-tools
>> 2.0.21.
>> +
>> +All the domains will be paused before xen#1 is starting to save the
>> states.
>> +In Xen, *domain\_pause()* will pause the vCPUs as soon as they can
>> be re-
>> +scheduled.  In other words, a pause request will not wait for
>> asynchronous
>> +requests (e.g. I/O) to finish.  For Live Update, this is not an
>> ideal time to
>> +pause because it will require more xen#1 internal state to be
>> transferred.
>> +Therefore, all the domains will be paused at an architectural
>> restartable
>> +boundary.
>> +
>> +Live update will not happen synchronously to the request but when
>> all the
>> +domains are quiescent.  As domains running device emulators (e.g
>> Dom0) will
>> +be part of the process to quiesce HVM domains, we will need to let
>> them run
>> +until xen#1 is actually starting to save the state.  HVM vCPUs will
>> be paused
>> +as soon as any pending asynchronous request has finished.
>> +
>> +In the current implementation, all PV domains will continue to run
>> while the
>> +rest will be paused as soon as possible.  Note this approach is
>> assuming that
>> +device emulators are only running in PV domains.
>> +
>> +It should be easy to extend to PVH domains not requiring device
>> emulations.
>> +It will require more thought if we need to run device models in HVM
>> domains as
>> +there might be inter-dependency.
>> +
>> +## Save
>> +
>> +xen#1 will be responsible to preserve and serialize the state of
>> each existing
>> +domain and any system-wide state (e.g M2P).
>> +
>> +Each domain will be serialized independently using a modified
>> migration stream,
>> +if there is any dependency between domains (such as for IOREQ
>> server) they will
>> +be recorded using a domid. All the complexity of resolving the
>> dependencies are
>> +left to the restore path in xen#2 (more in the *Restore* section).
>> +
>> +At the moment, the domains are saved one by one in a single thread,
>> but it
>> +would be possible to consider multi-threading if it takes too long.
>> Although
>> +this may require some adjustment in the stream format.
>> +
>> +As we want to be able to Live Update between major versions of Xen
>> (e.g Xen
>> +4.11 -> Xen 4.15), the states preserved should not be a dump of Xen
>> internal
>> +structure but instead the minimal information that allow us to
>> recreate the
>> +domains.
>> +
>> +For instance, we don't want to preserve the frametable (and
>> therefore
>> +*struct page\_info*) as-is because the refcounting may be different
>> across
>> +between xen#1 and xen#2 (see XSA-299). Instead, we want to be able
>> to recreate
>> +*struct page\_info* based on minimal information that are considered
>> stable
>> +(such as the page type).
>> +
>> +Note that upgrading between version of Xen will also require all the
>> hypercalls
>> +to be stable. This will not be covered by this document.
>> +
>> +## Hand over
>> +
>> +### Memory usage restrictions
>> +
>> +xen#2 must take care not to use any memory pages which already
>> belong to
>> +guests.  To facilitate this, a number of contiguous region of memory
>> are
>> +reserved for the boot allocator, known as *live update bootmem*.
>> +
>> +xen#1 will always reserve a region just below Xen (the size is
>> controlled by
>> +the Xen command line parameter liveupdate) to allow Xen growing and
>> provide
>> +information about LiveUpdate (see the section *Breadcrumb*).  The
>> region will be
>> +passed to xen#2 using the same command line option but with the base
>> address
>> +specified.
> 
> The size of the command line option may not be the same depending on
> the size of Xen #2.

Good point I will update it.

> 
>> +
>> +For simplicity, additional regions will be provided in the
>> stream.  They will
>> +consist of region that could be re-used by xen#2 during boot (such
>> as the
>> +xen#1's frametable memory).
>> +
>> +xen#2 must not use any pages outside those regions until it has
>> consumed the
>> +Live Update data stream and determined which pages are already in
>> use by
>> +running domains or need to be re-used as-is by Xen (e.g M2P).
>> +
>> +At run time, Xen may use memory from the reserved region for any
>> purpose that
>> +does not require preservation over a Live Update; in particular it
>> __must__ not be
>> +mapped to a domain or used by any Xen state requiring to be
>> preserved (e.g
>> +M2P).  In other word, the xenheap pages could be allocated from the
>> reserved
>> +regions if we remove the concept of shared xenheap pages.
>> +
>> +The xen#2's binary may be bigger (or smaller) compare to xen#1's
>> binary.  So
>> +for the purpose of loading xen#2 binary, kexec should treat the
>> reserved memory
>> +right below xen#1 and its region as a single contiguous space. xen#2
>> will be
>> +loaded right at the top of the contiguous space and the rest of the
>> memory will
>> +be the new reserved memory (this may shrink or grow).  For that
>> reason, freed
>> +init memory from xen#1 image is also treated as reserved liveupdate
>> update
> 
> s/update//
> 
> This is explained quite well actually, but I wonder if we can move this
> part closer to the liveupdate command line section (they both talk
> about the initial bootmem region and Xen size changes). After that, we
> then talk about multiple regions and how we should use them.

Just for clarification, do you mean moving after "The region will be 
passed to xen#2 using the same command line option but with the base 
address specified."?

>> +bootmem.
>> +
>> +### Live Update data stream
>> +
>> +During handover, xen#1 creates a Live Update data stream containing
>> all the
>> +information required by the new Xen#2 to restore all the domains.
>> +
>> +Data pages for this stream may be allocated anywhere in physical
>> memory outside
>> +the *live update bootmem* regions.
>> +
>> +As calling __vmap()__/__vunmap()__ has a cost on the downtime.  We
>> want to reduce the
>> +number of call to __vmap()__ when restoring the stream.  Therefore
>> the stream
>> +will be contiguously virtually mapped in xen#2.  xen#1 will create
>> an array of
> 
> Using vmap during restore for a contiguous range sounds more like
> implementation and optimisation detail to me rather than an ABI
> requirement, so I would s/the stream will be/the stream can be/.

I will do.

> 
>> +MFNs of the allocated data pages, suitable for passing to
>> __vmap()__.  The
>> +array will be physically contiguous but the MFNs don't need to be
>> physically
>> +contiguous.
>> +
>> +### Breadcrumb
>> +
>> +Since the Live Update data stream is created during the final
>> **kexec\_exec**
>> +hypercall, its address cannot be passed on the command line to the
>> new Xen
>> +since the command line needs to have been set up by **kexec(8)** in
>> userspace
>> +long beforehand.
>> +
>> +Thus, to allow the new Xen to find the data stream, xen#1 places a
>> breadcrumb
>> +in the first words of the Live Update bootmem, containing the number
>> of data
>> +pages, and the physical address of the contiguous MFN array.
>> +
>> +### IOMMU
>> +
>> +Where devices are passed through to domains, it may not be possible
>> to quiesce
>> +those devices for the purpose of performing the update.
>> +
>> +If performing Live Update with assigned devices, xen#1 will leave
>> the IOMMU
>> +mappings active during the handover (thus implying that IOMMU page
>> tables may
>> +not be allocated in the *live update bootmem* region either).
>> +
>> +xen#2 must take control of the IOMMU without causing those mappings
>> to become
>> +invalid even for a short period of time.  In other words, xen#2
>> should not
>> +re-setup the IOMMUs.  On hardware which does not support Posted
>> Interrupts,
>> +interrupts may need to be generated on resume.
>> +
>> +## Restore
>> +
>> +After xen#2 initialized itself and map the stream, it will be
>> responsible to
>> +restore the state of the system and each domain.
>> +
>> +Unlike the save part, it is not possible to restore a domain in a
>> single pass.
>> +There are dependencies between:
>> +
>> +    1. different states of a domain.  For instance, the event
>> channels ABI
>> +       used (2l vs fifo) requires to be restored before restoring
>> the event
>> +       channels.
>> +    2. the same "state" within a domain.  For instance, in case of
>> PV domain,
>> +       the pages' ownership requires to be restored before restoring
>> the type
>> +       of the page (e.g is it an L4, L1... table?).
>> +
>> +    3. domains.  For instance when restoring the grant mapping, it
>> will be
>> +       necessary to have the page's owner in hand to do proper
>> refcounting.
>> +       Therefore the pages' ownership have to be restored first.
>> +
>> +Dependencies will be resolved using either multiple passes (for
>> dependency
>> +type 2 and 3) or using a specific ordering between records (for
>> dependency
>> +type 1).
>> +
>> +Each domain will be restored in 3 passes:
>> +
>> +    * Pass 0: Create the domain and restore the P2M for HVM. This
>> can be broken
>> +      down in 3 parts:
>> +      * Allocate a domain via _domain\_create()_ but skip part that
>> requires
>> +        extra records (e.g HAP, P2M).
>> +      * Restore any parts which needs to be done before create the
>> vCPUs. This
>> +        including restoring the P2M and whether HAP is used.
>> +      * Create the vCPUs. Note this doesn't restore the state of the
>> vCPUs.
>> +    * Pass 1: It will restore the pages' ownership and the grant-
>> table frames
>> +    * Pass 2: This steps will restore any domain states (e.g vCPU
>> state, event
>> +      channels) that wasn't
> 
> Sentence seems incomplete.

I can add 'already restored' if that clarifies it?

Cheers,
Julien Grall May 7, 2021, 11:44 a.m. UTC | #5
Hi Jan,

On 07/05/2021 10:52, Jan Beulich wrote:
> On 06.05.2021 12:42, Julien Grall wrote:
>> +## High-level overview
>> +
>> +Xen has a framework to bring a new image of the Xen hypervisor in memory using
>> +kexec.  The existing framework does not meet the baseline functionality for
>> +Live Update, since kexec results in a restart for the hypervisor, host, Dom0,
>> +and all the guests.
>> +
>> +The operation can be divided in roughly 4 parts:
>> +
>> +    1. Trigger: The operation will by triggered from outside the hypervisor
>> +       (e.g. dom0 userspace).
>> +    2. Save: The state will be stabilized by pausing the domains and
>> +       serialized by xen#1.
>> +    3. Hand-over: xen#1 will pass the serialized state and transfer control to
>> +       xen#2.
>> +    4. Restore: The state will be deserialized by xen#2.
>> +
>> +All the domains will be paused before xen#1 is starting to save the states,
>> +and any domain that was running before Live Update will be unpaused after
>> +xen#2 has finished to restore the states.  This is to prevent a domain to try
>> +to modify the state of another domain while it is being saved/restored.
>> +
>> +The current approach could be seen as non-cooperative migration with a twist:
>> +all the domains (including dom0) are not expected be involved in the Live
>> +Update process.
>> +
>> +The major differences compare to live migration are:
>> +
>> +    * The state is not transferred to another host, but instead locally to
>> +      xen#2.
>> +    * The memory content or device state (for passthrough) does not need to
>> +      be part of the stream. Instead we need to preserve it.
>> +    * PV backends, device emulators, xenstored are not recreated but preserved
>> +      (as these are part of dom0).
> 
> Isn't dom0 too limiting here?

Good point. I can replace with "as these are living outside of the 
hypervisor".

> 
>> +## Trigger
>> +
>> +Live update is built on top of the kexec interface to prepare the command line,
>> +load xen#2 and trigger the operation.  A new kexec type has been introduced
>> +(*KEXEC\_TYPE\_LIVE\_UPDATE*) to notify Xen to Live Update.
>> +
>> +The Live Update will be triggered from outside the hypervisor (e.g. dom0
>> +userspace).  Support for the operation has been added in kexec-tools 2.0.21.
>> +
>> +All the domains will be paused before xen#1 is starting to save the states.
>> +In Xen, *domain\_pause()* will pause the vCPUs as soon as they can be re-
>> +scheduled.  In other words, a pause request will not wait for asynchronous
>> +requests (e.g. I/O) to finish.  For Live Update, this is not an ideal time to
>> +pause because it will require more xen#1 internal state to be transferred.
>> +Therefore, all the domains will be paused at an architectural restartable
>> +boundary.
> 
> To me this leaves entirely unclear what this then means. domain_pause()
> not being suitable is one thing, but what _is_ suitable seems worth
> mentioning.

I haven't mentioned anything because there is nothing directly suitable 
for Live Update. What we want is a behavior similar to 
``domain_shutdown()`` but without cloberring ``d->shutdown_code()`` as 
we would need to transfer it.

This is quite similar to what live migration is doing as, AFAICT, it 
will "shutdown" the domain with the reason SHUTDOWN_suspend.

> Among other things I'd be curious to know what this would
> mean for pending hypercall continuations.

Most of the hypercalls are fine because the state is encoded in the vCPU 
registers and can continue on a new Xen.

The problematic one are:
   1) Hypercalls running in a tasklet (mostly SYSCTL_*)
   2) XEN_DOMCTL_destroydomain
   3) EVTCHNOP_reset{,_cont}

For 1), we need to make sure the tasklets are completed before Live 
Update happens.

For 2), we could decide to wait until it is finished but it can take a 
while (on some of our testing it takes ~20ish to destroy) or it can 
never finish (e.g. zombie domain). The question is still open on how to 
deal with them because we can't really recreate them using 
domain_create() (some state may have already been relinquished).

For 3), you may remember the discussion we had on security ML during 
XSA-344. One possibility would be to restart the command from scratch 
(or not transfer the event channel at all).

> 
>> +## Save
>> +
>> +xen#1 will be responsible to preserve and serialize the state of each existing
>> +domain and any system-wide state (e.g M2P).
>> +
>> +Each domain will be serialized independently using a modified migration stream,
>> +if there is any dependency between domains (such as for IOREQ server) they will
>> +be recorded using a domid. All the complexity of resolving the dependencies are
>> +left to the restore path in xen#2 (more in the *Restore* section).
>> +
>> +At the moment, the domains are saved one by one in a single thread, but it
>> +would be possible to consider multi-threading if it takes too long. Although
>> +this may require some adjustment in the stream format.
>> +
>> +As we want to be able to Live Update between major versions of Xen (e.g Xen
>> +4.11 -> Xen 4.15), the states preserved should not be a dump of Xen internal
>> +structure but instead the minimal information that allow us to recreate the
>> +domains.
>> +
>> +For instance, we don't want to preserve the frametable (and therefore
>> +*struct page\_info*) as-is because the refcounting may be different across
>> +between xen#1 and xen#2 (see XSA-299). Instead, we want to be able to recreate
>> +*struct page\_info* based on minimal information that are considered stable
>> +(such as the page type).
> 
> Perhaps leaving it at this very generic description is fine, but I can
> easily see cases (which may not even be corner ones) where this quickly
> gets problematic: What if xen#2 has state xen#1 didn't (properly) record?
> Such information may not be possible to take out of thin air. Is the
> consequence then that in such a case LU won't work?
I can see cases where the state may not be record by xen#1, but so far I 
am struggling to find a case where we could not fake them in xen#2. Do 
you have any example?

>> +## Hand over
>> +
>> +### Memory usage restrictions
>> +
>> +xen#2 must take care not to use any memory pages which already belong to
>> +guests.  To facilitate this, a number of contiguous region of memory are
>> +reserved for the boot allocator, known as *live update bootmem*.
>> +
>> +xen#1 will always reserve a region just below Xen (the size is controlled by
>> +the Xen command line parameter liveupdate) to allow Xen growing and provide
>> +information about LiveUpdate (see the section *Breadcrumb*).  The region will be
>> +passed to xen#2 using the same command line option but with the base address
>> +specified.
> 
> I particularly don't understand the "to allow Xen growing" aspect here:
> xen#2 needs to be placed in a different memory range anyway until xen#1
> has handed over control.
> Are you suggesting it gets moved over to xen#1's
> original physical range (not necessarily an exact match), and then
> perhaps to start below where xen#1 started? 

That's correct.

> Why would you do this?

There are a few reasons:
   1) kexec-tools is in charge of selecting the physical address where 
the kernel (or Xen in our case) will be loaded. So we need to tell kexec 
where is a good place to load the new binary.
   2) xen#2 may end up to be loaded in a "random" and therefore possibly 
inconvenient place.

> Xen intentionally lives at a 2Mb boundary, such that in principle (for EFI:
> in fact) large page mappings are possible.

Right, xen#2 will still be loaded at a 2MB boundary. But it may be 2MB 
lower than the original one.

> I also see no reason to reuse
> the same physical area of memory for Xen itself - all you need is for
> Xen's virtual addresses to be properly mapped to the new physical range.
> I wonder what I'm missing here.
It is a known convenient place. It may be difficult to find a similar 
spot on host that have been long-running.

> 
>> +For simplicity, additional regions will be provided in the stream.  They will
>> +consist of region that could be re-used by xen#2 during boot (such as the
>> +xen#1's frametable memory).
>> +
>> +xen#2 must not use any pages outside those regions until it has consumed the
>> +Live Update data stream and determined which pages are already in use by
>> +running domains or need to be re-used as-is by Xen (e.g M2P).
> 
> Is the M2P really in the "need to be re-used" group, not just "can
> be re-used for simplicity and efficiency reasons"?

The MFNs are shared with privileged guests (e.g. dom0). So, I believe, 
the M2P needs to reside at the same place.

The efficiency is an additional benefits.

> 
>> +## Restore
>> +
>> +After xen#2 initialized itself and map the stream, it will be responsible to
>> +restore the state of the system and each domain.
>> +
>> +Unlike the save part, it is not possible to restore a domain in a single pass.
>> +There are dependencies between:
>> +
>> +    1. different states of a domain.  For instance, the event channels ABI
>> +       used (2l vs fifo) requires to be restored before restoring the event
>> +       channels.
>> +    2. the same "state" within a domain.  For instance, in case of PV domain,
>> +       the pages' ownership requires to be restored before restoring the type
>> +       of the page (e.g is it an L4, L1... table?).
>> +
>> +    3. domains.  For instance when restoring the grant mapping, it will be
>> +       necessary to have the page's owner in hand to do proper refcounting.
>> +       Therefore the pages' ownership have to be restored first.
>> +
>> +Dependencies will be resolved using either multiple passes (for dependency
>> +type 2 and 3) or using a specific ordering between records (for dependency
>> +type 1).
>> +
>> +Each domain will be restored in 3 passes:
>> +
>> +    * Pass 0: Create the domain and restore the P2M for HVM. This can be broken
>> +      down in 3 parts:
>> +      * Allocate a domain via _domain\_create()_ but skip part that requires
>> +        extra records (e.g HAP, P2M).
>> +      * Restore any parts which needs to be done before create the vCPUs. This
>> +        including restoring the P2M and whether HAP is used.
>> +      * Create the vCPUs. Note this doesn't restore the state of the vCPUs.
>> +    * Pass 1: It will restore the pages' ownership and the grant-table frames
>> +    * Pass 2: This steps will restore any domain states (e.g vCPU state, event
>> +      channels) that wasn't
> 
> What about foreign mappings (which are part of the P2M)? Can they be
> validly restored prior to restoring page ownership?

Our plan is to transfer the P2M as-is because it is used by the IOMMU. 
So the P2M may be restored before it is fully validated.

> In how far do you
> fully trust xen#1's state to be fully consistent anyway, rather than
> perhaps checking it?

This is a tricky question. If the state is not consistent, then it may 
be difficult to get around it. To continue on the example of foreign 
mapping, what if Xen#2 thinks dom0 has not the right to map it? We can't 
easily (?) recover from that.

So far, you need to put some trust in xen#1 state. IOW, you would not be 
able to blindly replace a reboot with LiveUpdating the hypervisor. This 
will need to be tested.

In our PoC, we decided to crash as soon as there is an inconsistent 
state. This avoids to continue running on a possibly unsafe setup but 
will introduce pain to the VM users.

As a future improvement, we could look at been able to Live Update with 
inconsistent state.

> 
>> +A domain should not have a dependency on another domain within the same pass.
>> +Therefore it would be possible to take advantage of all the CPUs to restore
>> +domains in parallel and reduce the overall downtime.
> 
> "Dependency" may be ambiguous here. For example, an interdomain event
> channel to me necessarily expresses a dependency between two domains.

That's a good point. AFAICT, the interdomain can be restored either way. 
So I will rephrase it.

Thank you for the feedback.

Cheers,
Jan Beulich May 7, 2021, 12:15 p.m. UTC | #6
On 07.05.2021 13:44, Julien Grall wrote:
> On 07/05/2021 10:52, Jan Beulich wrote:
>> On 06.05.2021 12:42, Julien Grall wrote:
>>> +## Trigger
>>> +
>>> +Live update is built on top of the kexec interface to prepare the command line,
>>> +load xen#2 and trigger the operation.  A new kexec type has been introduced
>>> +(*KEXEC\_TYPE\_LIVE\_UPDATE*) to notify Xen to Live Update.
>>> +
>>> +The Live Update will be triggered from outside the hypervisor (e.g. dom0
>>> +userspace).  Support for the operation has been added in kexec-tools 2.0.21.
>>> +
>>> +All the domains will be paused before xen#1 is starting to save the states.
>>> +In Xen, *domain\_pause()* will pause the vCPUs as soon as they can be re-
>>> +scheduled.  In other words, a pause request will not wait for asynchronous
>>> +requests (e.g. I/O) to finish.  For Live Update, this is not an ideal time to
>>> +pause because it will require more xen#1 internal state to be transferred.
>>> +Therefore, all the domains will be paused at an architectural restartable
>>> +boundary.
>>
>> To me this leaves entirely unclear what this then means. domain_pause()
>> not being suitable is one thing, but what _is_ suitable seems worth
>> mentioning.
> 
> I haven't mentioned anything because there is nothing directly suitable 
> for Live Update. What we want is a behavior similar to 
> ``domain_shutdown()`` but without cloberring ``d->shutdown_code()`` as 
> we would need to transfer it.
> 
> This is quite similar to what live migration is doing as, AFAICT, it 
> will "shutdown" the domain with the reason SHUTDOWN_suspend.
> 
>> Among other things I'd be curious to know what this would
>> mean for pending hypercall continuations.
> 
> Most of the hypercalls are fine because the state is encoded in the vCPU 
> registers and can continue on a new Xen.
> 
> The problematic one are:
>    1) Hypercalls running in a tasklet (mostly SYSCTL_*)
>    2) XEN_DOMCTL_destroydomain
>    3) EVTCHNOP_reset{,_cont}

4) paging_domctl_continuation
5) various PV mm hypercalls leaving state in struct page_info or
the old_guest_table per-vCPU field

> For 1), we need to make sure the tasklets are completed before Live 
> Update happens.
> 
> For 2), we could decide to wait until it is finished but it can take a 
> while (on some of our testing it takes ~20ish to destroy) or it can 
> never finish (e.g. zombie domain). The question is still open on how to 
> deal with them because we can't really recreate them using 
> domain_create() (some state may have already been relinquished).
> 
> For 3), you may remember the discussion we had on security ML during 
> XSA-344. One possibility would be to restart the command from scratch 
> (or not transfer the event channel at all).

Yes, I do recall that.

>>> +## Save
>>> +
>>> +xen#1 will be responsible to preserve and serialize the state of each existing
>>> +domain and any system-wide state (e.g M2P).
>>> +
>>> +Each domain will be serialized independently using a modified migration stream,
>>> +if there is any dependency between domains (such as for IOREQ server) they will
>>> +be recorded using a domid. All the complexity of resolving the dependencies are
>>> +left to the restore path in xen#2 (more in the *Restore* section).
>>> +
>>> +At the moment, the domains are saved one by one in a single thread, but it
>>> +would be possible to consider multi-threading if it takes too long. Although
>>> +this may require some adjustment in the stream format.
>>> +
>>> +As we want to be able to Live Update between major versions of Xen (e.g Xen
>>> +4.11 -> Xen 4.15), the states preserved should not be a dump of Xen internal
>>> +structure but instead the minimal information that allow us to recreate the
>>> +domains.
>>> +
>>> +For instance, we don't want to preserve the frametable (and therefore
>>> +*struct page\_info*) as-is because the refcounting may be different across
>>> +between xen#1 and xen#2 (see XSA-299). Instead, we want to be able to recreate
>>> +*struct page\_info* based on minimal information that are considered stable
>>> +(such as the page type).
>>
>> Perhaps leaving it at this very generic description is fine, but I can
>> easily see cases (which may not even be corner ones) where this quickly
>> gets problematic: What if xen#2 has state xen#1 didn't (properly) record?
>> Such information may not be possible to take out of thin air. Is the
>> consequence then that in such a case LU won't work?
> I can see cases where the state may not be record by xen#1, but so far I 
> am struggling to find a case where we could not fake them in xen#2. Do 
> you have any example?

The thing that came to mind were the state representation (and logic)
changes done for XSA-299.

>>> +## Hand over
>>> +
>>> +### Memory usage restrictions
>>> +
>>> +xen#2 must take care not to use any memory pages which already belong to
>>> +guests.  To facilitate this, a number of contiguous region of memory are
>>> +reserved for the boot allocator, known as *live update bootmem*.
>>> +
>>> +xen#1 will always reserve a region just below Xen (the size is controlled by
>>> +the Xen command line parameter liveupdate) to allow Xen growing and provide
>>> +information about LiveUpdate (see the section *Breadcrumb*).  The region will be
>>> +passed to xen#2 using the same command line option but with the base address
>>> +specified.
>>
>> I particularly don't understand the "to allow Xen growing" aspect here:
>> xen#2 needs to be placed in a different memory range anyway until xen#1
>> has handed over control.
>> Are you suggesting it gets moved over to xen#1's
>> original physical range (not necessarily an exact match), and then
>> perhaps to start below where xen#1 started? 
> 
> That's correct.
> 
>> Why would you do this?
> 
> There are a few reasons:
>    1) kexec-tools is in charge of selecting the physical address where 
> the kernel (or Xen in our case) will be loaded. So we need to tell kexec 
> where is a good place to load the new binary.
>    2) xen#2 may end up to be loaded in a "random" and therefore possibly 
> inconvenient place.

"Inconvenient" should be avoidable as long as the needed alignment
can be guaranteed. In particular I don't think there's too much in
the way in order to have (x86) Xen run on physical memory above
4Gb.

>> Xen intentionally lives at a 2Mb boundary, such that in principle (for EFI:
>> in fact) large page mappings are possible.
> 
> Right, xen#2 will still be loaded at a 2MB boundary. But it may be 2MB 
> lower than the original one.

Oh, I see. The wording made be think you would move it down in
smaller steps. I think somewhere (perhaps in a reply to someone
else) it was said that you'd place it such that its upper address
matches that of xen#1.

>> I also see no reason to reuse
>> the same physical area of memory for Xen itself - all you need is for
>> Xen's virtual addresses to be properly mapped to the new physical range.
>> I wonder what I'm missing here.
> It is a known convenient place. It may be difficult to find a similar 
> spot on host that have been long-running.

I'm not convinced: If it was placed in the kexec area at a 2Mb
boundary, it could just run from there. If the kexec area is
large enough, this would work any number of times (as occupied
ranges become available again when the next LU cycle ends).

>>> +For simplicity, additional regions will be provided in the stream.  They will
>>> +consist of region that could be re-used by xen#2 during boot (such as the
>>> +xen#1's frametable memory).
>>> +
>>> +xen#2 must not use any pages outside those regions until it has consumed the
>>> +Live Update data stream and determined which pages are already in use by
>>> +running domains or need to be re-used as-is by Xen (e.g M2P).
>>
>> Is the M2P really in the "need to be re-used" group, not just "can
>> be re-used for simplicity and efficiency reasons"?
> 
> The MFNs are shared with privileged guests (e.g. dom0). So, I believe, 
> the M2P needs to reside at the same place.

Oh, yes, good point.

>>> +## Restore
>>> +
>>> +After xen#2 initialized itself and map the stream, it will be responsible to
>>> +restore the state of the system and each domain.
>>> +
>>> +Unlike the save part, it is not possible to restore a domain in a single pass.
>>> +There are dependencies between:
>>> +
>>> +    1. different states of a domain.  For instance, the event channels ABI
>>> +       used (2l vs fifo) requires to be restored before restoring the event
>>> +       channels.
>>> +    2. the same "state" within a domain.  For instance, in case of PV domain,
>>> +       the pages' ownership requires to be restored before restoring the type
>>> +       of the page (e.g is it an L4, L1... table?).
>>> +
>>> +    3. domains.  For instance when restoring the grant mapping, it will be
>>> +       necessary to have the page's owner in hand to do proper refcounting.
>>> +       Therefore the pages' ownership have to be restored first.
>>> +
>>> +Dependencies will be resolved using either multiple passes (for dependency
>>> +type 2 and 3) or using a specific ordering between records (for dependency
>>> +type 1).
>>> +
>>> +Each domain will be restored in 3 passes:
>>> +
>>> +    * Pass 0: Create the domain and restore the P2M for HVM. This can be broken
>>> +      down in 3 parts:
>>> +      * Allocate a domain via _domain\_create()_ but skip part that requires
>>> +        extra records (e.g HAP, P2M).
>>> +      * Restore any parts which needs to be done before create the vCPUs. This
>>> +        including restoring the P2M and whether HAP is used.
>>> +      * Create the vCPUs. Note this doesn't restore the state of the vCPUs.
>>> +    * Pass 1: It will restore the pages' ownership and the grant-table frames
>>> +    * Pass 2: This steps will restore any domain states (e.g vCPU state, event
>>> +      channels) that wasn't
>>
>> What about foreign mappings (which are part of the P2M)? Can they be
>> validly restored prior to restoring page ownership?
> 
> Our plan is to transfer the P2M as-is because it is used by the IOMMU. 
> So the P2M may be restored before it is fully validated.
> 
>> In how far do you
>> fully trust xen#1's state to be fully consistent anyway, rather than
>> perhaps checking it?
> 
> This is a tricky question. If the state is not consistent, then it may 
> be difficult to get around it. To continue on the example of foreign 
> mapping, what if Xen#2 thinks dom0 has not the right to map it? We can't 
> easily (?) recover from that.
> 
> So far, you need to put some trust in xen#1 state. IOW, you would not be 
> able to blindly replace a reboot with LiveUpdating the hypervisor. This 
> will need to be tested.

But this then eliminates a subset of the intended use cases: If e.g.
a refcounting bug needed to be fixed in Xen, and if you don't know
whether xen#1 has actually accumulated any badness, you still won't
be able to avoid the reboot.

Jan
Xia, Hongyan May 7, 2021, 2:59 p.m. UTC | #7
On Fri, 2021-05-07 at 14:15 +0200, Jan Beulich wrote:
> On 07.05.2021 13:44, Julien Grall wrote:
[...]
> > 
> > It is a known convenient place. It may be difficult to find a
> > similar 
> > spot on host that have been long-running.
> 
> I'm not convinced: If it was placed in the kexec area at a 2Mb
> boundary, it could just run from there. If the kexec area is
> large enough, this would work any number of times (as occupied
> ranges become available again when the next LU cycle ends).

To make sure the next Xen can be loaded and run anywhere in case kexec
cannot find large enough memory under 4G, we need to:

1. teach kexec to load the whole image contiguously. At the moment
kexec prepares scattered 4K pages which are not runnable until they are
copied to a contiguous destination. (What if it can't find a contiguous
range?)

2. teach Xen that it can be jumped into with some existing page tables
which point to itself above 4G. We can't do real/protected mode entry
because it needs to start below 4G physically. Maybe a modified version
of the EFI entry path (my familiarity with Xen EFI entry is limited)?

3. rewrite all the early boot bits that assume Xen is under 4G and its
bundled page tables for below 4G.

These are the obstacles off the top of my head. So I think there is no
fundamental reason why we have to place Xen #2 where Xen #1 was, but
doing so is a massive reduction of pain which allows us to reuse much
of the existing Xen code.

Maybe, this part does not have to be part of the ABI and we just
suggest this as one way of loading the next Xen to cope with growth?
This is the best way I can think of (loading Xen where it was and
expand into the reserved bootmem if needed) that does not need to
rewrite a lot of early boot code and can pretty much guarantee success
even if memory is tight and fragmented.

Hongyan
Jan Beulich May 7, 2021, 3:28 p.m. UTC | #8
On 07.05.2021 16:59, Xia, Hongyan wrote:
> On Fri, 2021-05-07 at 14:15 +0200, Jan Beulich wrote:
>> On 07.05.2021 13:44, Julien Grall wrote:
> [...]
>>>
>>> It is a known convenient place. It may be difficult to find a
>>> similar 
>>> spot on host that have been long-running.
>>
>> I'm not convinced: If it was placed in the kexec area at a 2Mb
>> boundary, it could just run from there. If the kexec area is
>> large enough, this would work any number of times (as occupied
>> ranges become available again when the next LU cycle ends).
> 
> To make sure the next Xen can be loaded and run anywhere in case kexec
> cannot find large enough memory under 4G, we need to:
> 
> 1. teach kexec to load the whole image contiguously. At the moment
> kexec prepares scattered 4K pages which are not runnable until they are
> copied to a contiguous destination. (What if it can't find a contiguous
> range?)
> 
> 2. teach Xen that it can be jumped into with some existing page tables
> which point to itself above 4G. We can't do real/protected mode entry
> because it needs to start below 4G physically. Maybe a modified version
> of the EFI entry path (my familiarity with Xen EFI entry is limited)?
> 
> 3. rewrite all the early boot bits that assume Xen is under 4G and its
> bundled page tables for below 4G.
> 
> These are the obstacles off the top of my head. So I think there is no
> fundamental reason why we have to place Xen #2 where Xen #1 was, but
> doing so is a massive reduction of pain which allows us to reuse much
> of the existing Xen code.
> 
> Maybe, this part does not have to be part of the ABI and we just
> suggest this as one way of loading the next Xen to cope with growth?
> This is the best way I can think of (loading Xen where it was and
> expand into the reserved bootmem if needed) that does not need to
> rewrite a lot of early boot code and can pretty much guarantee success
> even if memory is tight and fragmented.

Yeah, all of this as an initial implementation plan sounds fine to
me. But it should then be called out as such (rather than as part of
how things ought to [remain to] be).

Jan
diff mbox series

Patch

diff --git a/docs/designs/liveupdate.md b/docs/designs/liveupdate.md
new file mode 100644
index 000000000000..32993934f4fe
--- /dev/null
+++ b/docs/designs/liveupdate.md
@@ -0,0 +1,254 @@ 
+# Live Updating Xen
+
+## Background
+
+Administrators often require updating the Xen hypervisor to address security
+vulnerabilities, introduce new features, or fix software defects.  Currently,
+we offer the following methods to perform the update:
+
+    * Rebooting the guests and the host: this is highly disrupting to running
+      guests.
+    * Migrating off the guests, rebooting the host: this currently requires
+      the guest to cooperate (see [1] for a non-cooperative solution) and it
+      may not always be possible to migrate it off (i.e lack of capacity, use
+      of local storage...).
+    * Live patching: This is the less disruptive of the existing methods.
+      However, it can be difficult to prepare the livepatch if the change is
+      large or there are data structures to update.
+
+This document will present a new approach called "Live Update" which will
+activate new software without noticeable downtime (i.e no - or minimal -
+customer pain).
+
+## Terminology
+
+xen#1: Xen version currently active and running on a droplet.  This is the
+“source” for the Live Update operation.  This version can actually be newer
+than xen#2 in case of a rollback operation.
+
+xen#2: Xen version that's the “target” of the Live Update operation. This
+version will become the active version after successful Live Update.  This
+version of Xen can actually be older than xen#1 in case of a rollback
+operation.
+
+## High-level overview
+
+Xen has a framework to bring a new image of the Xen hypervisor in memory using
+kexec.  The existing framework does not meet the baseline functionality for
+Live Update, since kexec results in a restart for the hypervisor, host, Dom0,
+and all the guests.
+
+The operation can be divided in roughly 4 parts:
+
+    1. Trigger: The operation will by triggered from outside the hypervisor
+       (e.g. dom0 userspace).
+    2. Save: The state will be stabilized by pausing the domains and
+       serialized by xen#1.
+    3. Hand-over: xen#1 will pass the serialized state and transfer control to
+       xen#2.
+    4. Restore: The state will be deserialized by xen#2.
+
+All the domains will be paused before xen#1 is starting to save the states,
+and any domain that was running before Live Update will be unpaused after
+xen#2 has finished to restore the states.  This is to prevent a domain to try
+to modify the state of another domain while it is being saved/restored.
+
+The current approach could be seen as non-cooperative migration with a twist:
+all the domains (including dom0) are not expected be involved in the Live
+Update process.
+
+The major differences compare to live migration are:
+
+    * The state is not transferred to another host, but instead locally to
+      xen#2.
+    * The memory content or device state (for passthrough) does not need to
+      be part of the stream. Instead we need to preserve it.
+    * PV backends, device emulators, xenstored are not recreated but preserved
+      (as these are part of dom0).
+
+
+Domains in process of being destroyed (*XEN\_DOMCTL\_destroydomain*) will need
+to be preserved because another entity may have mappings (e.g foreign, grant)
+on them.
+
+## Trigger
+
+Live update is built on top of the kexec interface to prepare the command line,
+load xen#2 and trigger the operation.  A new kexec type has been introduced
+(*KEXEC\_TYPE\_LIVE\_UPDATE*) to notify Xen to Live Update.
+
+The Live Update will be triggered from outside the hypervisor (e.g. dom0
+userspace).  Support for the operation has been added in kexec-tools 2.0.21.
+
+All the domains will be paused before xen#1 is starting to save the states.
+In Xen, *domain\_pause()* will pause the vCPUs as soon as they can be re-
+scheduled.  In other words, a pause request will not wait for asynchronous
+requests (e.g. I/O) to finish.  For Live Update, this is not an ideal time to
+pause because it will require more xen#1 internal state to be transferred.
+Therefore, all the domains will be paused at an architectural restartable
+boundary.
+
+Live update will not happen synchronously to the request but when all the
+domains are quiescent.  As domains running device emulators (e.g Dom0) will
+be part of the process to quiesce HVM domains, we will need to let them run
+until xen#1 is actually starting to save the state.  HVM vCPUs will be paused
+as soon as any pending asynchronous request has finished.
+
+In the current implementation, all PV domains will continue to run while the
+rest will be paused as soon as possible.  Note this approach is assuming that
+device emulators are only running in PV domains.
+
+It should be easy to extend to PVH domains not requiring device emulations.
+It will require more thought if we need to run device models in HVM domains as
+there might be inter-dependency.
+
+## Save
+
+xen#1 will be responsible to preserve and serialize the state of each existing
+domain and any system-wide state (e.g M2P).
+
+Each domain will be serialized independently using a modified migration stream,
+if there is any dependency between domains (such as for IOREQ server) they will
+be recorded using a domid. All the complexity of resolving the dependencies are
+left to the restore path in xen#2 (more in the *Restore* section).
+
+At the moment, the domains are saved one by one in a single thread, but it
+would be possible to consider multi-threading if it takes too long. Although
+this may require some adjustment in the stream format.
+
+As we want to be able to Live Update between major versions of Xen (e.g Xen
+4.11 -> Xen 4.15), the states preserved should not be a dump of Xen internal
+structure but instead the minimal information that allow us to recreate the
+domains.
+
+For instance, we don't want to preserve the frametable (and therefore
+*struct page\_info*) as-is because the refcounting may be different across
+between xen#1 and xen#2 (see XSA-299). Instead, we want to be able to recreate
+*struct page\_info* based on minimal information that are considered stable
+(such as the page type).
+
+Note that upgrading between version of Xen will also require all the hypercalls
+to be stable. This will not be covered by this document.
+
+## Hand over
+
+### Memory usage restrictions
+
+xen#2 must take care not to use any memory pages which already belong to
+guests.  To facilitate this, a number of contiguous region of memory are
+reserved for the boot allocator, known as *live update bootmem*.
+
+xen#1 will always reserve a region just below Xen (the size is controlled by
+the Xen command line parameter liveupdate) to allow Xen growing and provide
+information about LiveUpdate (see the section *Breadcrumb*).  The region will be
+passed to xen#2 using the same command line option but with the base address
+specified.
+
+For simplicity, additional regions will be provided in the stream.  They will
+consist of region that could be re-used by xen#2 during boot (such as the
+xen#1's frametable memory).
+
+xen#2 must not use any pages outside those regions until it has consumed the
+Live Update data stream and determined which pages are already in use by
+running domains or need to be re-used as-is by Xen (e.g M2P).
+
+At run time, Xen may use memory from the reserved region for any purpose that
+does not require preservation over a Live Update; in particular it __must__ not be
+mapped to a domain or used by any Xen state requiring to be preserved (e.g
+M2P).  In other word, the xenheap pages could be allocated from the reserved
+regions if we remove the concept of shared xenheap pages.
+
+The xen#2's binary may be bigger (or smaller) compare to xen#1's binary.  So
+for the purpose of loading xen#2 binary, kexec should treat the reserved memory
+right below xen#1 and its region as a single contiguous space. xen#2 will be
+loaded right at the top of the contiguous space and the rest of the memory will
+be the new reserved memory (this may shrink or grow).  For that reason, freed
+init memory from xen#1 image is also treated as reserved liveupdate update
+bootmem.
+
+### Live Update data stream
+
+During handover, xen#1 creates a Live Update data stream containing all the
+information required by the new Xen#2 to restore all the domains.
+
+Data pages for this stream may be allocated anywhere in physical memory outside
+the *live update bootmem* regions.
+
+As calling __vmap()__/__vunmap()__ has a cost on the downtime.  We want to reduce the
+number of call to __vmap()__ when restoring the stream.  Therefore the stream
+will be contiguously virtually mapped in xen#2.  xen#1 will create an array of
+MFNs of the allocated data pages, suitable for passing to __vmap()__.  The
+array will be physically contiguous but the MFNs don't need to be physically
+contiguous.
+
+### Breadcrumb
+
+Since the Live Update data stream is created during the final **kexec\_exec**
+hypercall, its address cannot be passed on the command line to the new Xen
+since the command line needs to have been set up by **kexec(8)** in userspace
+long beforehand.
+
+Thus, to allow the new Xen to find the data stream, xen#1 places a breadcrumb
+in the first words of the Live Update bootmem, containing the number of data
+pages, and the physical address of the contiguous MFN array.
+
+### IOMMU
+
+Where devices are passed through to domains, it may not be possible to quiesce
+those devices for the purpose of performing the update.
+
+If performing Live Update with assigned devices, xen#1 will leave the IOMMU
+mappings active during the handover (thus implying that IOMMU page tables may
+not be allocated in the *live update bootmem* region either).
+
+xen#2 must take control of the IOMMU without causing those mappings to become
+invalid even for a short period of time.  In other words, xen#2 should not
+re-setup the IOMMUs.  On hardware which does not support Posted Interrupts,
+interrupts may need to be generated on resume.
+
+## Restore
+
+After xen#2 initialized itself and map the stream, it will be responsible to
+restore the state of the system and each domain.
+
+Unlike the save part, it is not possible to restore a domain in a single pass.
+There are dependencies between:
+
+    1. different states of a domain.  For instance, the event channels ABI
+       used (2l vs fifo) requires to be restored before restoring the event
+       channels.
+    2. the same "state" within a domain.  For instance, in case of PV domain,
+       the pages' ownership requires to be restored before restoring the type
+       of the page (e.g is it an L4, L1... table?).
+
+    3. domains.  For instance when restoring the grant mapping, it will be
+       necessary to have the page's owner in hand to do proper refcounting.
+       Therefore the pages' ownership have to be restored first.
+
+Dependencies will be resolved using either multiple passes (for dependency
+type 2 and 3) or using a specific ordering between records (for dependency
+type 1).
+
+Each domain will be restored in 3 passes:
+
+    * Pass 0: Create the domain and restore the P2M for HVM. This can be broken
+      down in 3 parts:
+      * Allocate a domain via _domain\_create()_ but skip part that requires
+        extra records (e.g HAP, P2M).
+      * Restore any parts which needs to be done before create the vCPUs. This
+        including restoring the P2M and whether HAP is used.
+      * Create the vCPUs. Note this doesn't restore the state of the vCPUs.
+    * Pass 1: It will restore the pages' ownership and the grant-table frames
+    * Pass 2: This steps will restore any domain states (e.g vCPU state, event
+      channels) that wasn't
+
+A domain should not have a dependency on another domain within the same pass.
+Therefore it would be possible to take advantage of all the CPUs to restore
+domains in parallel and reduce the overall downtime.
+
+Once all the domains have been restored, they will be unpaused if they were
+running before Live Update.
+
+* * *
+[1] https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md;h=4b876d809fb5b8aac02d29fd7760a5c0d5b86d87;hb=HEAD
+