diff mbox series

[v7,2/2] docs/designs: Add a design document for migration of xenstore data

Message ID 20200318111144.12894-3-paul@xen.org (mailing list archive)
State Superseded
Headers show
Series docs: Migration design documents | expand

Commit Message

Paul Durrant March 18, 2020, 11:11 a.m. UTC
From: Paul Durrant <pdurrant@amazon.com>

This patch details proposes extra migration data and xenstore protocol
extensions to support non-cooperative live migration of guests.

NOTE: doc/misc/xenstore.txt is also amened to replace the <mfn> term
      for the INTRODUCE operation with the <gfn>, since this is what
      it actually is.

Signed-off-by: Paul Durrant <paul@xen.org>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Julien Grall <julien@xen.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Wei Liu <wl@xen.org>

v7:
 - Addressed further comments from Julien
 - Switched migration records to defined structures instead of tuples

v6:
 - Addressed comments from Julien

v5:
 - Add QUIESCE
 - Make semantics of <index> in GET_DOMAIN_WATCHES more clear

v4:
 - Drop the restrictions on special paths

v3:
 - New in v3
---
 docs/designs/xenstore-migration.md | 256 +++++++++++++++++++++++++++++
 docs/misc/xenstore.txt             |   6 +-
 2 files changed, 259 insertions(+), 3 deletions(-)
 create mode 100644 docs/designs/xenstore-migration.md

Comments

Julien Grall March 23, 2020, 11:33 a.m. UTC | #1
Hi Paul,

On 18/03/2020 11:11, Paul Durrant wrote:
> From: Paul Durrant <pdurrant@amazon.com>
> 
> This patch details proposes extra migration data and xenstore protocol
> extensions to support non-cooperative live migration of guests.
> 
> NOTE: doc/misc/xenstore.txt is also amened to replace the <mfn> term

NIT: s/amened/amended/

>        for the INTRODUCE operation with the <gfn>, since this is what
>        it actually is.
> 
> Signed-off-by: Paul Durrant <paul@xen.org>
> ---
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Julien Grall <julien@xen.org>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Wei Liu <wl@xen.org>
> 
> v7:
>   - Addressed further comments from Julien
>   - Switched migration records to defined structures instead of tuples
> 
> v6:
>   - Addressed comments from Julien
> 
> v5:
>   - Add QUIESCE
>   - Make semantics of <index> in GET_DOMAIN_WATCHES more clear
> 
> v4:
>   - Drop the restrictions on special paths
> 
> v3:
>   - New in v3
> ---
>   docs/designs/xenstore-migration.md | 256 +++++++++++++++++++++++++++++
>   docs/misc/xenstore.txt             |   6 +-
>   2 files changed, 259 insertions(+), 3 deletions(-)
>   create mode 100644 docs/designs/xenstore-migration.md
> 
> diff --git a/docs/designs/xenstore-migration.md b/docs/designs/xenstore-migration.md
> new file mode 100644
> index 0000000000..e7e6593002
> --- /dev/null
> +++ b/docs/designs/xenstore-migration.md
> @@ -0,0 +1,256 @@
> +# Xenstore Migration
> +
> +## Background
> +
> +The design for *Non-Cooperative Migration of Guests*[1] explains that extra
> +save records are required in the migrations stream to allow a guest running
> +PV drivers to be migrated without its co-operation. Moreover the save
> +records must include details of registered xenstore watches as well as
> +content; information that cannot currently be recovered from `xenstored`,
> +and hence some extension to the xenstore protocol[2] will also be required.
> +
> +The *libxenlight Domain Image Format* specification[3] already defines a
> +record type `EMULATOR_XENSTORE_DATA` but this is not suitable for
> +transferring xenstore data pertaining to the domain directly as it is
> +specified such that keys are relative to the path
> +`/local/domain/$dm_domid/device-model/$domid`. Thus it is necessary to
> +define at least one new save record type.
> +
> +## Proposal
> +
> +### New Save Record
> +
> +A new mandatory record type should be defined within the libxenlight Domain
> +Image Format:
> +
> +`0x00000007: DOMAIN_XENSTORE_DATA`
> +
> +An arbitrary number of these records may be present in the migration
> +stream and may appear in any order. The format of each record should be as
> +follows:
> +
> +
> +```
> +    0       1       2       3       4       5       6       7    octet
> ++-------+-------+-------+-------+-------+-------+-------+-------+
> +| type                          | record specific data          |
> ++-------------------------------+                               |
> +...
> ++---------------------------------------------------------------+
> +```
> +
> +where type is one of the following values
> +
> +
> +| Field  | Description                                      |
> +|--------|--------------------------------------------------|
> +| `type` | 0x00000000: invalid                              |
> +|        | 0x00000001: NODE_DATA                            |
> +|        | 0x00000002: WATCH_DATA                           |
> +|        | 0x00000003: TRANSACTION_DATA                     |
> +|        | 0x00000004 - 0xFFFFFFFF: reserved for future use |
> +
> +
> +and data is one of the record data formats described in the following
> +sections.
> +
> +
> +NOTE: The record data does not contain an overall length because the
> +libxenlight record header specifies the length.
> +
> +
> +**NODE_DATA**
> +
> +
> +Each NODE_DATA record specifies a single node in xenstore and is formatted
> +as follows:
> +
> +
> +```
> +    0       1       2       3     octet
> ++-------+-------+-------+-------+
> +|   1   |   0   |   0   |   0   |


The stream may either be big-endian or little-endian. This will be 
specified in the header of the stream.

But here, this suggests that the type will always be described in 
little-endian.

I think it would be best to only describe the "record specific data".

> ++-------------------------------+
> +| path length                   |
> ++-------------------------------+
> +| path data                     |
> +...
> +| pad (0 to 3 octets)           |
> ++-------------------------------+
> +| perm count (N)                |
> ++-------------------------------+
> +| perm0                         |
> ++-------------------------------+
> +...
> ++-------------------------------+
> +| permN                         |
> ++-------------------------------+
> +| value length                  |
> ++-------------------------------+
> +| value data                    |
> +...
> +| pad (0 to 3 octets)           |
> ++-------------------------------+
> +```
> +
> +where perm0..N are formatted as follows:
> +
> +
> +```
> +    0       1       2       3     octet
> ++-------+-------+-------+-------+
> +| perm  | pad   | domid         |
> ++-------------------------------+
> +```
> +
> +
> +path length and value length are specified in octets (excluding the NUL
> +terminator of the path).

IIUC, this means the "path data" would not contain a NUL at the end.am I 
correct?

> perm should be one of the ASCII values `w`, `r`,
> +`b` or `n` as described in [2]. All pad values should be 0.
> +All paths should be absolute (i.e. start with `/`) and as described in
> +[2].
> +
> +
> +**WATCH_DATA**
> +
> +
> +Each WATCH_DATA record specifies a registered watch and is formatted as
> +follows:
> +
> +
> +```
> +    0       1       2       3     octet
> ++-------+-------+-------+-------+
> +|   2   |   0   |   0   |   0   |

Same remark here about the endianess.

> ++-------------------------------+
> +| wpath length                  |
> ++-------------------------------+
> +| wpath data                    |
> +...
> +| pad (0 to 3 octets)           |
> ++-------------------------------+
> +...
> ++-------------------------------+
> +| token length                  |
> ++-------------------------------+
> +| token data                    |
> +...
> +| pad (0 to 3 octets)           |
> ++-------------------------------+
> +```
> +
> +wpath length and token length are specified in octets (excluding the NUL
> +terminator). The wpath should be as described for the `WATCH` operation in
> +[2]. The token is an arbitrary string of octets not containing any NUL
> +values.
> +
> +
> +**TRANSACTION_DATA**
> +
> +
> +Each TRANSACTION_DATA record specifies an open transaction and is formatted
> +as follows:
> +
> +
> +```
> +    0       1       2       3     octet
> ++-------+-------+-------+-------+
> +|   3   |   0   |   0   |   0   |

Same remark here about the endianess.

> ++-------------------------------+
> +| tx_id                         |
> ++-------------------------------+
> +```
> +
> +where tx_id is the non-zero identifier values of an open transaction.
> +
> +
> +### Protocol Extension
> +
> +Before xenstore state is migrated it is necessary to wait for any pending
> +reads, writes, watch registrations etc. to complete, and also to make sure
> +that xenstored does not start processing any new requests (so that new
> +requests remain pending on the shared ring for subsequent processing on the
> +new host). Hence the following operation is needed:
> +
> +```
> +QUIESCE                 <domid>|
> +
> +Complete processing of any request issued by the specified domain, and
> +do not process any further requests from the shared ring.
> +```
> +
> +The `WATCH` operation does not allow specification of a `<domid>`; it is
> +assumed that the watch pertains to the domain that owns the shared ring
> +over which the operation is passed. Hence, for the tool-stack to be able
> +to register a watch on behalf of a domain a new operation is needed:
> +
> +```
> +ADD_DOMAIN_WATCHES      <domid>|<watch>|+
> +
> +Adds watches on behalf of the specified domain.
> +
> +<watch> is a NUL separated tuple of <path>|<token>. The semantics of this
> +operation are identical to the domain issuing WATCH <path>|<token>| for
> +each <watch>.
> +```
> +
> +The watch information for a domain also needs to be extracted from the
> +sending xenstored so the following operation is also needed:
> +
> +```
> +GET_DOMAIN_WATCHES      <domid>|<index>   <gencnt>|<watch>|*
> +
> +Gets the list of watches that are currently registered for the domain.
> +
> +<watch> is a NUL separated tuple of <path>|<token>. The sub-list returned
> +will start at <index> items into the the overall list of watches and may
> +be truncated (at a <watch> boundary) such that the returned data fits
> +within XENSTORE_PAYLOAD_MAX.
> +
> +If <index> is beyond the end of the overall list then the returned sub-
> +list will be empty. If the value of <gencnt> changes then it indicates
> +that the overall watch list has changed and thus it may be necessary
> +to re-issue the operation for previous values of <index>.
> +```
> +
> +To deal with transactions that were pending when the domain is migrated
> +it is necessary to start transactions with the same tx_id on behalf of the
> +domain in the receiving xenstored.
> +
> +NOTE: For safety each such transaction should result in an `EAGAIN` when
> +the `TRANSACTION_END` operation is peformed, as modifications made under

NIT: s/peformed/performed/

> +the tx_id will not be part of the migration stream.
> +
> +The `TRANSACTION_START` operation does not allow specification of a
> +`<domid>`; it is assumed that the transaction pertains to the domain that
> +owns the shared ring over which the operation is passed. Neither does it
> +allow a `<transid>` to be specified; it is always chosen by xenstored.
> +Hence, for the tool-stack to be able to open a transaction on behalf of a
> +domain a new operation is needed:
> +
> +```
> +START_DOMAIN_TRANSACTION    <domid>|<transid>|
> +
> +Starts a transaction on behalf of a domain.
> +
> +The semantics of this are similar to the domain issuing
> +TRANSACTION_START and receiving the specified <transid> as the response.
> +The main difference is that the transaction will be immediately marked as
> +'conflicting' such that when the domain isses TRANSACTION_END T|, it will
> +result in EAGAIN.
> +```
> +
> +It may also be desirable to state in the protocol specification that
> +the `INTRODUCE` operation should not clear the `<gfn>` specified such that
> +a `RELEASE` operation followed by an `INTRODUCE` operation form an
> +idempotent pair. The current implementation of *C xentored* does this
> +(in the `domain_conn_reset()` function) but this could be dropped as this
> +behaviour is not currently specified and the page will always be zeroed
> +for a newly created domain.
> +
> +
> +* * *
> +
> +[1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md
> +[2] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt
> +[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc
> diff --git a/docs/misc/xenstore.txt b/docs/misc/xenstore.txt
> index 6f8569d576..51e6b12931 100644
> --- a/docs/misc/xenstore.txt
> +++ b/docs/misc/xenstore.txt
> @@ -254,7 +254,7 @@ TRANSACTION_END		F|
>   
>   ---------- Domain management and xenstored communications ----------
>   
> -INTRODUCE		<domid>|<mfn>|<evtchn>|?
> +INTRODUCE		<domid>|<gfn>|<evtchn>|?
>   	Notifies xenstored to communicate with this domain.
>   
>   	INTRODUCE is currently only used by xend (during domain
> @@ -262,12 +262,12 @@ INTRODUCE		<domid>|<mfn>|<evtchn>|?
>   	xenstored prevents its use other than by dom0.
>   
>   	<domid> must be a real domain id (not 0 and not a special
> -	DOMID_... value).  <mfn> must be a machine page in that domain
> +	DOMID_... value).  <gfn> must be a machine page in that domain

Now that you are using "gfn", do you also want to update "machine page"?

>   	represented in signed decimal (!).  <evtchn> must be event
>   	channel is an unbound event channel in <domid> (likewise in
>   	decimal), on which xenstored will call bind_interdomain.
>   	Violations of these rules may result in undefined behaviour;
> -	for example passing a high-bit-set 32-bit mfn as an unsigned
> +	for example passing a high-bit-set 32-bit gfn as an unsigned
>   	decimal will attempt to use 0x7fffffff instead (!).
>   
>   RELEASE			<domid>|
> 

Cheeers,
Paul Durrant March 23, 2020, 11:47 a.m. UTC | #2
> -----Original Message-----
> From: Julien Grall <julien@xen.org>
> Sent: 23 March 2020 11:34
> To: Paul Durrant <paul@xen.org>; xen-devel@lists.xenproject.org
> Cc: Paul Durrant <pdurrant@amazon.com>; Andrew Cooper <andrew.cooper3@citrix.com>; George Dunlap
> <George.Dunlap@eu.citrix.com>; Ian Jackson <ian.jackson@eu.citrix.com>; Jan Beulich
> <jbeulich@suse.com>; Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>; Stefano Stabellini
> <sstabellini@kernel.org>; Wei Liu <wl@xen.org>
> Subject: Re: [PATCH v7 2/2] docs/designs: Add a design document for migration of xenstore data
> 
> Hi Paul,
> 
> On 18/03/2020 11:11, Paul Durrant wrote:
> > From: Paul Durrant <pdurrant@amazon.com>
> >
> > This patch details proposes extra migration data and xenstore protocol
> > extensions to support non-cooperative live migration of guests.
> >
> > NOTE: doc/misc/xenstore.txt is also amened to replace the <mfn> term
> 
> NIT: s/amened/amended/
> 

Yes.

> >        for the INTRODUCE operation with the <gfn>, since this is what
> >        it actually is.
> >
> > Signed-off-by: Paul Durrant <paul@xen.org>
> > ---
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> > Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> > Cc: Jan Beulich <jbeulich@suse.com>
> > Cc: Julien Grall <julien@xen.org>
> > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > Cc: Stefano Stabellini <sstabellini@kernel.org>
> > Cc: Wei Liu <wl@xen.org>
> >
> > v7:
> >   - Addressed further comments from Julien
> >   - Switched migration records to defined structures instead of tuples
> >
> > v6:
> >   - Addressed comments from Julien
> >
> > v5:
> >   - Add QUIESCE
> >   - Make semantics of <index> in GET_DOMAIN_WATCHES more clear
> >
> > v4:
> >   - Drop the restrictions on special paths
> >
> > v3:
> >   - New in v3
> > ---
> >   docs/designs/xenstore-migration.md | 256 +++++++++++++++++++++++++++++
> >   docs/misc/xenstore.txt             |   6 +-
> >   2 files changed, 259 insertions(+), 3 deletions(-)
> >   create mode 100644 docs/designs/xenstore-migration.md
> >
> > diff --git a/docs/designs/xenstore-migration.md b/docs/designs/xenstore-migration.md
> > new file mode 100644
> > index 0000000000..e7e6593002
> > --- /dev/null
> > +++ b/docs/designs/xenstore-migration.md
> > @@ -0,0 +1,256 @@
> > +# Xenstore Migration
> > +
> > +## Background
> > +
> > +The design for *Non-Cooperative Migration of Guests*[1] explains that extra
> > +save records are required in the migrations stream to allow a guest running
> > +PV drivers to be migrated without its co-operation. Moreover the save
> > +records must include details of registered xenstore watches as well as
> > +content; information that cannot currently be recovered from `xenstored`,
> > +and hence some extension to the xenstore protocol[2] will also be required.
> > +
> > +The *libxenlight Domain Image Format* specification[3] already defines a
> > +record type `EMULATOR_XENSTORE_DATA` but this is not suitable for
> > +transferring xenstore data pertaining to the domain directly as it is
> > +specified such that keys are relative to the path
> > +`/local/domain/$dm_domid/device-model/$domid`. Thus it is necessary to
> > +define at least one new save record type.
> > +
> > +## Proposal
> > +
> > +### New Save Record
> > +
> > +A new mandatory record type should be defined within the libxenlight Domain
> > +Image Format:
> > +
> > +`0x00000007: DOMAIN_XENSTORE_DATA`
> > +
> > +An arbitrary number of these records may be present in the migration
> > +stream and may appear in any order. The format of each record should be as
> > +follows:
> > +
> > +
> > +```
> > +    0       1       2       3       4       5       6       7    octet
> > ++-------+-------+-------+-------+-------+-------+-------+-------+
> > +| type                          | record specific data          |
> > ++-------------------------------+                               |
> > +...
> > ++---------------------------------------------------------------+
> > +```
> > +
> > +where type is one of the following values
> > +
> > +
> > +| Field  | Description                                      |
> > +|--------|--------------------------------------------------|
> > +| `type` | 0x00000000: invalid                              |
> > +|        | 0x00000001: NODE_DATA                            |
> > +|        | 0x00000002: WATCH_DATA                           |
> > +|        | 0x00000003: TRANSACTION_DATA                     |
> > +|        | 0x00000004 - 0xFFFFFFFF: reserved for future use |
> > +
> > +
> > +and data is one of the record data formats described in the following
> > +sections.
> > +
> > +
> > +NOTE: The record data does not contain an overall length because the
> > +libxenlight record header specifies the length.
> > +
> > +
> > +**NODE_DATA**
> > +
> > +
> > +Each NODE_DATA record specifies a single node in xenstore and is formatted
> > +as follows:
> > +
> > +
> > +```
> > +    0       1       2       3     octet
> > ++-------+-------+-------+-------+
> > +|   1   |   0   |   0   |   0   |
> 
> 
> The stream may either be big-endian or little-endian. This will be
> specified in the header of the stream.
> 
> But here, this suggests that the type will always be described in
> little-endian.
> 
> I think it would be best to only describe the "record specific data".
> 

True, I'll just say 'NODE_DATA' (or similar) in the first line.

> > ++-------------------------------+
> > +| path length                   |
> > ++-------------------------------+
> > +| path data                     |
> > +...
> > +| pad (0 to 3 octets)           |
> > ++-------------------------------+
> > +| perm count (N)                |
> > ++-------------------------------+
> > +| perm0                         |
> > ++-------------------------------+
> > +...
> > ++-------------------------------+
> > +| permN                         |
> > ++-------------------------------+
> > +| value length                  |
> > ++-------------------------------+
> > +| value data                    |
> > +...
> > +| pad (0 to 3 octets)           |
> > ++-------------------------------+
> > +```
> > +
> > +where perm0..N are formatted as follows:
> > +
> > +
> > +```
> > +    0       1       2       3     octet
> > ++-------+-------+-------+-------+
> > +| perm  | pad   | domid         |
> > ++-------------------------------+
> > +```
> > +
> > +
> > +path length and value length are specified in octets (excluding the NUL
> > +terminator of the path).
> 
> IIUC, this means the "path data" would not contain a NUL at the end.am I
> correct?

Yes, it is supposed to apply to both.

> 
> > perm should be one of the ASCII values `w`, `r`,
> > +`b` or `n` as described in [2]. All pad values should be 0.
> > +All paths should be absolute (i.e. start with `/`) and as described in
> > +[2].
> > +
> > +
> > +**WATCH_DATA**
> > +
> > +
> > +Each WATCH_DATA record specifies a registered watch and is formatted as
> > +follows:
> > +
> > +
> > +```
> > +    0       1       2       3     octet
> > ++-------+-------+-------+-------+
> > +|   2   |   0   |   0   |   0   |
> 
> Same remark here about the endianess.
> 
> > ++-------------------------------+
> > +| wpath length                  |
> > ++-------------------------------+
> > +| wpath data                    |
> > +...
> > +| pad (0 to 3 octets)           |
> > ++-------------------------------+
> > +...
> > ++-------------------------------+
> > +| token length                  |
> > ++-------------------------------+
> > +| token data                    |
> > +...
> > +| pad (0 to 3 octets)           |
> > ++-------------------------------+
> > +```
> > +
> > +wpath length and token length are specified in octets (excluding the NUL
> > +terminator). The wpath should be as described for the `WATCH` operation in
> > +[2]. The token is an arbitrary string of octets not containing any NUL
> > +values.
> > +
> > +
> > +**TRANSACTION_DATA**
> > +
> > +
> > +Each TRANSACTION_DATA record specifies an open transaction and is formatted
> > +as follows:
> > +
> > +
> > +```
> > +    0       1       2       3     octet
> > ++-------+-------+-------+-------+
> > +|   3   |   0   |   0   |   0   |
> 
> Same remark here about the endianess.
> 
> > ++-------------------------------+
> > +| tx_id                         |
> > ++-------------------------------+
> > +```
> > +
> > +where tx_id is the non-zero identifier values of an open transaction.
> > +
> > +
> > +### Protocol Extension
> > +
> > +Before xenstore state is migrated it is necessary to wait for any pending
> > +reads, writes, watch registrations etc. to complete, and also to make sure
> > +that xenstored does not start processing any new requests (so that new
> > +requests remain pending on the shared ring for subsequent processing on the
> > +new host). Hence the following operation is needed:
> > +
> > +```
> > +QUIESCE                 <domid>|
> > +
> > +Complete processing of any request issued by the specified domain, and
> > +do not process any further requests from the shared ring.
> > +```
> > +
> > +The `WATCH` operation does not allow specification of a `<domid>`; it is
> > +assumed that the watch pertains to the domain that owns the shared ring
> > +over which the operation is passed. Hence, for the tool-stack to be able
> > +to register a watch on behalf of a domain a new operation is needed:
> > +
> > +```
> > +ADD_DOMAIN_WATCHES      <domid>|<watch>|+
> > +
> > +Adds watches on behalf of the specified domain.
> > +
> > +<watch> is a NUL separated tuple of <path>|<token>. The semantics of this
> > +operation are identical to the domain issuing WATCH <path>|<token>| for
> > +each <watch>.
> > +```
> > +
> > +The watch information for a domain also needs to be extracted from the
> > +sending xenstored so the following operation is also needed:
> > +
> > +```
> > +GET_DOMAIN_WATCHES      <domid>|<index>   <gencnt>|<watch>|*
> > +
> > +Gets the list of watches that are currently registered for the domain.
> > +
> > +<watch> is a NUL separated tuple of <path>|<token>. The sub-list returned
> > +will start at <index> items into the the overall list of watches and may
> > +be truncated (at a <watch> boundary) such that the returned data fits
> > +within XENSTORE_PAYLOAD_MAX.
> > +
> > +If <index> is beyond the end of the overall list then the returned sub-
> > +list will be empty. If the value of <gencnt> changes then it indicates
> > +that the overall watch list has changed and thus it may be necessary
> > +to re-issue the operation for previous values of <index>.
> > +```
> > +
> > +To deal with transactions that were pending when the domain is migrated
> > +it is necessary to start transactions with the same tx_id on behalf of the
> > +domain in the receiving xenstored.
> > +
> > +NOTE: For safety each such transaction should result in an `EAGAIN` when
> > +the `TRANSACTION_END` operation is peformed, as modifications made under
> 
> NIT: s/peformed/performed/
> 

Ok.

> > +the tx_id will not be part of the migration stream.
> > +
> > +The `TRANSACTION_START` operation does not allow specification of a
> > +`<domid>`; it is assumed that the transaction pertains to the domain that
> > +owns the shared ring over which the operation is passed. Neither does it
> > +allow a `<transid>` to be specified; it is always chosen by xenstored.
> > +Hence, for the tool-stack to be able to open a transaction on behalf of a
> > +domain a new operation is needed:
> > +
> > +```
> > +START_DOMAIN_TRANSACTION    <domid>|<transid>|
> > +
> > +Starts a transaction on behalf of a domain.
> > +
> > +The semantics of this are similar to the domain issuing
> > +TRANSACTION_START and receiving the specified <transid> as the response.
> > +The main difference is that the transaction will be immediately marked as
> > +'conflicting' such that when the domain isses TRANSACTION_END T|, it will
> > +result in EAGAIN.
> > +```
> > +
> > +It may also be desirable to state in the protocol specification that
> > +the `INTRODUCE` operation should not clear the `<gfn>` specified such that
> > +a `RELEASE` operation followed by an `INTRODUCE` operation form an
> > +idempotent pair. The current implementation of *C xentored* does this
> > +(in the `domain_conn_reset()` function) but this could be dropped as this
> > +behaviour is not currently specified and the page will always be zeroed
> > +for a newly created domain.
> > +
> > +
> > +* * *
> > +
> > +[1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-
> migration.md
> > +[2] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt
> > +[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc
> > diff --git a/docs/misc/xenstore.txt b/docs/misc/xenstore.txt
> > index 6f8569d576..51e6b12931 100644
> > --- a/docs/misc/xenstore.txt
> > +++ b/docs/misc/xenstore.txt
> > @@ -254,7 +254,7 @@ TRANSACTION_END		F|
> >
> >   ---------- Domain management and xenstored communications ----------
> >
> > -INTRODUCE		<domid>|<mfn>|<evtchn>|?
> > +INTRODUCE		<domid>|<gfn>|<evtchn>|?
> >   	Notifies xenstored to communicate with this domain.
> >
> >   	INTRODUCE is currently only used by xend (during domain
> > @@ -262,12 +262,12 @@ INTRODUCE		<domid>|<mfn>|<evtchn>|?
> >   	xenstored prevents its use other than by dom0.
> >
> >   	<domid> must be a real domain id (not 0 and not a special
> > -	DOMID_... value).  <mfn> must be a machine page in that domain
> > +	DOMID_... value).  <gfn> must be a machine page in that domain
> 
> Now that you are using "gfn", do you also want to update "machine page"?
> 

Oops, yes.

 Paul

> >   	represented in signed decimal (!).  <evtchn> must be event
> >   	channel is an unbound event channel in <domid> (likewise in
> >   	decimal), on which xenstored will call bind_interdomain.
> >   	Violations of these rules may result in undefined behaviour;
> > -	for example passing a high-bit-set 32-bit mfn as an unsigned
> > +	for example passing a high-bit-set 32-bit gfn as an unsigned
> >   	decimal will attempt to use 0x7fffffff instead (!).
> >
> >   RELEASE			<domid>|
> >
> 
> Cheeers,
> 
> --
> Julien Grall
diff mbox series

Patch

diff --git a/docs/designs/xenstore-migration.md b/docs/designs/xenstore-migration.md
new file mode 100644
index 0000000000..e7e6593002
--- /dev/null
+++ b/docs/designs/xenstore-migration.md
@@ -0,0 +1,256 @@ 
+# Xenstore Migration
+
+## Background
+
+The design for *Non-Cooperative Migration of Guests*[1] explains that extra
+save records are required in the migrations stream to allow a guest running
+PV drivers to be migrated without its co-operation. Moreover the save
+records must include details of registered xenstore watches as well as
+content; information that cannot currently be recovered from `xenstored`,
+and hence some extension to the xenstore protocol[2] will also be required.
+
+The *libxenlight Domain Image Format* specification[3] already defines a
+record type `EMULATOR_XENSTORE_DATA` but this is not suitable for
+transferring xenstore data pertaining to the domain directly as it is
+specified such that keys are relative to the path
+`/local/domain/$dm_domid/device-model/$domid`. Thus it is necessary to
+define at least one new save record type.
+
+## Proposal
+
+### New Save Record
+
+A new mandatory record type should be defined within the libxenlight Domain
+Image Format:
+
+`0x00000007: DOMAIN_XENSTORE_DATA`
+
+An arbitrary number of these records may be present in the migration
+stream and may appear in any order. The format of each record should be as
+follows:
+
+
+```
+    0       1       2       3       4       5       6       7    octet
++-------+-------+-------+-------+-------+-------+-------+-------+
+| type                          | record specific data          |
++-------------------------------+                               |
+...
++---------------------------------------------------------------+
+```
+
+where type is one of the following values
+
+
+| Field  | Description                                      |
+|--------|--------------------------------------------------|
+| `type` | 0x00000000: invalid                              |
+|        | 0x00000001: NODE_DATA                            |
+|        | 0x00000002: WATCH_DATA                           |
+|        | 0x00000003: TRANSACTION_DATA                     |
+|        | 0x00000004 - 0xFFFFFFFF: reserved for future use |
+
+
+and data is one of the record data formats described in the following
+sections.
+
+
+NOTE: The record data does not contain an overall length because the
+libxenlight record header specifies the length.
+
+
+**NODE_DATA**
+
+
+Each NODE_DATA record specifies a single node in xenstore and is formatted
+as follows:
+
+
+```
+    0       1       2       3     octet
++-------+-------+-------+-------+
+|   1   |   0   |   0   |   0   |
++-------------------------------+
+| path length                   |
++-------------------------------+
+| path data                     |
+...
+| pad (0 to 3 octets)           |
++-------------------------------+
+| perm count (N)                |
++-------------------------------+
+| perm0                         |
++-------------------------------+
+...
++-------------------------------+
+| permN                         |
++-------------------------------+
+| value length                  |
++-------------------------------+
+| value data                    |
+...
+| pad (0 to 3 octets)           |
++-------------------------------+
+```
+
+where perm0..N are formatted as follows:
+
+
+```
+    0       1       2       3     octet
++-------+-------+-------+-------+
+| perm  | pad   | domid         |
++-------------------------------+
+```
+
+
+path length and value length are specified in octets (excluding the NUL
+terminator of the path). perm should be one of the ASCII values `w`, `r`,
+`b` or `n` as described in [2]. All pad values should be 0.
+All paths should be absolute (i.e. start with `/`) and as described in
+[2].
+
+
+**WATCH_DATA**
+
+
+Each WATCH_DATA record specifies a registered watch and is formatted as
+follows:
+
+
+```
+    0       1       2       3     octet
++-------+-------+-------+-------+
+|   2   |   0   |   0   |   0   |
++-------------------------------+
+| wpath length                  |
++-------------------------------+
+| wpath data                    |
+...
+| pad (0 to 3 octets)           |
++-------------------------------+
+...
++-------------------------------+
+| token length                  |
++-------------------------------+
+| token data                    |
+...
+| pad (0 to 3 octets)           |
++-------------------------------+
+```
+
+wpath length and token length are specified in octets (excluding the NUL
+terminator). The wpath should be as described for the `WATCH` operation in
+[2]. The token is an arbitrary string of octets not containing any NUL
+values.
+
+
+**TRANSACTION_DATA**
+
+
+Each TRANSACTION_DATA record specifies an open transaction and is formatted
+as follows:
+
+
+```
+    0       1       2       3     octet
++-------+-------+-------+-------+
+|   3   |   0   |   0   |   0   |
++-------------------------------+
+| tx_id                         |
++-------------------------------+
+```
+
+where tx_id is the non-zero identifier values of an open transaction.
+
+
+### Protocol Extension
+
+Before xenstore state is migrated it is necessary to wait for any pending
+reads, writes, watch registrations etc. to complete, and also to make sure
+that xenstored does not start processing any new requests (so that new
+requests remain pending on the shared ring for subsequent processing on the
+new host). Hence the following operation is needed:
+
+```
+QUIESCE                 <domid>|
+
+Complete processing of any request issued by the specified domain, and
+do not process any further requests from the shared ring.
+```
+
+The `WATCH` operation does not allow specification of a `<domid>`; it is
+assumed that the watch pertains to the domain that owns the shared ring
+over which the operation is passed. Hence, for the tool-stack to be able
+to register a watch on behalf of a domain a new operation is needed:
+
+```
+ADD_DOMAIN_WATCHES      <domid>|<watch>|+
+
+Adds watches on behalf of the specified domain.
+
+<watch> is a NUL separated tuple of <path>|<token>. The semantics of this
+operation are identical to the domain issuing WATCH <path>|<token>| for
+each <watch>.
+```
+
+The watch information for a domain also needs to be extracted from the
+sending xenstored so the following operation is also needed:
+
+```
+GET_DOMAIN_WATCHES      <domid>|<index>   <gencnt>|<watch>|*
+
+Gets the list of watches that are currently registered for the domain.
+
+<watch> is a NUL separated tuple of <path>|<token>. The sub-list returned
+will start at <index> items into the the overall list of watches and may
+be truncated (at a <watch> boundary) such that the returned data fits
+within XENSTORE_PAYLOAD_MAX.
+
+If <index> is beyond the end of the overall list then the returned sub-
+list will be empty. If the value of <gencnt> changes then it indicates
+that the overall watch list has changed and thus it may be necessary
+to re-issue the operation for previous values of <index>.
+```
+
+To deal with transactions that were pending when the domain is migrated
+it is necessary to start transactions with the same tx_id on behalf of the
+domain in the receiving xenstored.
+
+NOTE: For safety each such transaction should result in an `EAGAIN` when
+the `TRANSACTION_END` operation is peformed, as modifications made under
+the tx_id will not be part of the migration stream.
+
+The `TRANSACTION_START` operation does not allow specification of a
+`<domid>`; it is assumed that the transaction pertains to the domain that
+owns the shared ring over which the operation is passed. Neither does it
+allow a `<transid>` to be specified; it is always chosen by xenstored.
+Hence, for the tool-stack to be able to open a transaction on behalf of a
+domain a new operation is needed:
+
+```
+START_DOMAIN_TRANSACTION    <domid>|<transid>|
+
+Starts a transaction on behalf of a domain.
+
+The semantics of this are similar to the domain issuing
+TRANSACTION_START and receiving the specified <transid> as the response.
+The main difference is that the transaction will be immediately marked as
+'conflicting' such that when the domain isses TRANSACTION_END T|, it will
+result in EAGAIN.
+```
+
+It may also be desirable to state in the protocol specification that
+the `INTRODUCE` operation should not clear the `<gfn>` specified such that
+a `RELEASE` operation followed by an `INTRODUCE` operation form an
+idempotent pair. The current implementation of *C xentored* does this
+(in the `domain_conn_reset()` function) but this could be dropped as this
+behaviour is not currently specified and the page will always be zeroed
+for a newly created domain.
+
+
+* * *
+
+[1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md
+[2] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt
+[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc
diff --git a/docs/misc/xenstore.txt b/docs/misc/xenstore.txt
index 6f8569d576..51e6b12931 100644
--- a/docs/misc/xenstore.txt
+++ b/docs/misc/xenstore.txt
@@ -254,7 +254,7 @@  TRANSACTION_END		F|
 
 ---------- Domain management and xenstored communications ----------
 
-INTRODUCE		<domid>|<mfn>|<evtchn>|?
+INTRODUCE		<domid>|<gfn>|<evtchn>|?
 	Notifies xenstored to communicate with this domain.
 
 	INTRODUCE is currently only used by xend (during domain
@@ -262,12 +262,12 @@  INTRODUCE		<domid>|<mfn>|<evtchn>|?
 	xenstored prevents its use other than by dom0.
 
 	<domid> must be a real domain id (not 0 and not a special
-	DOMID_... value).  <mfn> must be a machine page in that domain
+	DOMID_... value).  <gfn> must be a machine page in that domain
 	represented in signed decimal (!).  <evtchn> must be event
 	channel is an unbound event channel in <domid> (likewise in
 	decimal), on which xenstored will call bind_interdomain.
 	Violations of these rules may result in undefined behaviour;
-	for example passing a high-bit-set 32-bit mfn as an unsigned
+	for example passing a high-bit-set 32-bit gfn as an unsigned
 	decimal will attempt to use 0x7fffffff instead (!).
 
 RELEASE			<domid>|