diff mbox series

[v2] docs/designs: re-work the xenstore migration document...

Message ID 20200427075342.149-1-paul@xen.org (mailing list archive)
State Superseded
Headers show
Series [v2] docs/designs: re-work the xenstore migration document... | expand

Commit Message

Paul Durrant April 27, 2020, 7:53 a.m. UTC
From: Paul Durrant <pdurrant@amazon.com>

... to specify a separate migration stream that will also be suitable for
live update.

The original scope of the document was to support non-cooperative migration
of guests [1] but, since then, live update of xenstored has been brought into
scope. Thus it makes more sense to define a separate image format for
serializing xenstore state that is suitable for both purposes.

The document has been limited to specifying a new image format. The mechanism
for acquiring the image for live update or migration is not covered as that
is more appropriately dealt with by a patch to docs/misc/xenstore.txt. It is
also expected that, when the first implementation of live update or migration
making use of this specification is committed, that the document is moved from
docs/designs into docs/specs.

[1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
---
Juergen Gross <jgross@suse.com>
Andrew Cooper <andrew.cooper3@citrix.com>
George Dunlap <george.dunlap@citrix.com>
Ian Jackson <ian.jackson@eu.citrix.com>
Jan Beulich <jbeulich@suse.com>
Julien Grall <julien@xen.org>
Stefano Stabellini <sstabellini@kernel.org>
Wei Liu <wl@xen.org>

v2:
 - Address comments from Juergen
---
 docs/designs/xenstore-migration.md | 475 +++++++++++++++++++----------
 1 file changed, 312 insertions(+), 163 deletions(-)

Comments

Jürgen Groß April 27, 2020, 10:36 a.m. UTC | #1
On 27.04.20 09:53, Paul Durrant wrote:
> From: Paul Durrant <pdurrant@amazon.com>
> 
> ... to specify a separate migration stream that will also be suitable for
> live update.
> 
> The original scope of the document was to support non-cooperative migration
> of guests [1] but, since then, live update of xenstored has been brought into
> scope. Thus it makes more sense to define a separate image format for
> serializing xenstore state that is suitable for both purposes.
> 
> The document has been limited to specifying a new image format. The mechanism
> for acquiring the image for live update or migration is not covered as that
> is more appropriately dealt with by a patch to docs/misc/xenstore.txt. It is
> also expected that, when the first implementation of live update or migration
> making use of this specification is committed, that the document is moved from
> docs/designs into docs/specs.
> 
> [1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md
> 
> Signed-off-by: Paul Durrant <pdurrant@amazon.com>
> ---
> Juergen Gross <jgross@suse.com>
> Andrew Cooper <andrew.cooper3@citrix.com>
> George Dunlap <george.dunlap@citrix.com>
> Ian Jackson <ian.jackson@eu.citrix.com>
> Jan Beulich <jbeulich@suse.com>
> Julien Grall <julien@xen.org>
> Stefano Stabellini <sstabellini@kernel.org>
> Wei Liu <wl@xen.org>

Mind adding CC: before those mail addresses in order to let git add
those to the recipients list?

> 
> v2:
>   - Address comments from Juergen

Not all unfortunately. :-(

> +### CONNECTION_DATA
>   
> -Each WATCH_DATA record specifies a registered watch and is formatted as
> -follows:
> +For live update the image format will contain a `CONNECTION_DATA` record for
> +each connection to xenstore. For migration it will only contain a record for
> +the domain being migrated.
>   
>   
>   ```
> -    0       1       2       3     octet
> -+-------+-------+-------+-------+
> -| WATCH_DATA                    |
> -+-------------------------------+
> -| wpath length                  |
> -+-------------------------------+
> -| wpath data                    |
> -...
> -| pad (0 to 3 octets)           |
> -+-------------------------------+
> +    0       1       2       3       4       5       6       7    octet
> ++-------+-------+-------+-------+-------+-------+-------+-------+
> +| conn-id                       | pad                           |
> ++---------------+-----------------------------------------------+
> +| conn-type     | conn-spec
>   ...

I asked whether it wouldn't be better to drop the pad and move conn-type
and a 2-byte (unified) flag field at its position. This together ...

> ++-------------------------------+-------------------------------+
> +| data-len                      | data
>   +-------------------------------+
> -| token length                  |
> -+-------------------------------+
> -| token data                    |
>   ...
> -| pad (0 to 3 octets)           |
> -+-------------------------------+
>   ```
>   
> -wpath length and token length are specified in octets (excluding the NUL
> -terminator). The wpath should be as described for the `WATCH` operation in
> -[2]. The token is an arbitrary string of octets not containing any NUL
> -values.
>   
> +| Field       | Description                                     |
> +|-------------|-------------------------------------------------|
> +| `conn-id`   | A non-zero number used to identify this         |
> +|             | connection in subsequent connection-specific    |
> +|             | records                                         |
> +|             |                                                 |
> +| `conn-type` | 0x0000: shared ring                             |
> +|             | 0x0001: socket                                  |
> +|             |                                                 |
> +| `conn-spec` | See below                                       |
> +|             |                                                 |
> +| `data-len`  | The length (in octets) of any pending data not  |
> +|             | yet written to the connection                   |
> +|             |                                                 |
> +| `data`      | Pending data (may be empty)                     |
>   
> -**TRANSACTION_DATA**
> +The format of `conn-spec` is dependent upon `conn-type`.
>   
> +\pagebreak
>   
> -Each TRANSACTION_DATA record specifies an open transaction and is formatted
> -as follows:
> +For `shared ring` connections it is as follows:
>   
>   
>   ```
> -    0       1       2       3     octet
> -+-------+-------+-------+-------+
> -| TRANSACTION_DATA              |
> -+-------------------------------+
> -| tx_id                         |
> -+-------------------------------+
> +    0       1       2       3       4       5       6       7    octet
> +                +-------+-------+-------+-------+-------+-------+
> +                | domid         | tdomid        | flags         |
> ++---------------+---------------+---------------+---------------+
> +| revtchn                       | levtchn                       |
> ++-------------------------------+-------------------------------+
> +| mfn                                                           |
> ++---------------------------------------------------------------+

... with dropping levtchn (which isn't needed IMO) will make it much
easier to have a union in C (which needs to be aligned to 8 bytes
and have a length of a multiple of 8 bytes due to mfn).

So something like:

struct xs_state_connection {
     uint32_t conn_id;
     uint16_t conn_type;
#define XS_STATE_CONN_TYPE_RING   0
#define XS_STATE_CONN_TYPE_SOCKET 1
     uint16_t flags;
#define XS_STATE_CONN_INTRODUCED  0x0001
#define XS_STATE_CONN_RELEASED    0x0002
#define XS_STATE_CONN_READONLY    0x0004
     union {
         struct {
             uint16_t domid;
             uint16_t tdomid;
#define XS_STATE_DOMID_INVALID  0xffffU
             uint32_t evtchn;
             uint64_t mfn;
#define XS_STATE_MFN_INVALID    0xffffffffffffffffUL
         } ring;
         int32_t socket_fd;
     } spec;
     uint32_t data_out_len;
     uint8_t  data[];
};

>   ```
>   
> -where tx_id is the non-zero identifier values of an open transaction.
> -
>   
> -### Protocol Extension
> +| Field      | Description                                      |
> +|------------|--------------------------------------------------|
> +| `domid`    | The domain-id that owns the shared page          |
> +|            |                                                  |
> +| `tdomid`   | The domain-id that `domid` acts on behalf of if  |
> +|            | it has been subject to an SET_TARGET             |
> +|            | operation [2] or DOMID_INVALID otherwise         |

DOMID_INVALID needs to be defined (or we need a reference where it is
coming from).

> +|            |                                                  |
> +| `flags`    | A bit-wise OR of:                                |
> +|            | 0x0001: INTRODUCE has been issued                |
> +|            | 0x0002: RELEASE has been issued                  |
> +|            |                                                  |
> +| `revtchn`  | The port number of the interdomain channel used  |
> +|            | by `domid` to communicate with xenstored         |
> +|            |                                                  |
> +| `levtchn`  | For a live update this will be the port number   |
> +|            | of the interdomain channel used by xenstored     |
> +|            | itself otherwise, for migration, it will be -1   |
> +|            |                                                  |
> +| `mfn`      | The MFN of the shared page for a live update or  |
> +|            | INVALID_MFN otherwise                            |

INVALID_MFN is an internal detail of the hypervisor. We should have a
local definition for it here (as in my example above).

> +
> +Since the ABI guarantees that entry 1 in `domid`'s grant table will always
> +contain the GFN of the shared page, so for a live update `mfn` can be used to
> +give confidence that `domid` has not been re-cycled during the update.
> +
> +
> +For `socket` connections it is as follows:
>   
> -Before xenstore state is migrated it is necessary to wait for any pending
> -reads, writes, watch registrations etc. to complete, and also to make sure
> -that xenstored does not start processing any new requests (so that new
> -requests remain pending on the shared ring for subsequent processing on the
> -new host). Hence the following operation is needed:
>   
>   ```
> -QUIESCE                 <domid>|
> -
> -Complete processing of any request issued by the specified domain, and
> -do not process any further requests from the shared ring.
> +    0       1       2       3       4       5       6       7    octet
> +                +-------+-------+-------+-------+-------+-------+
> +                | flags         | socket-fd                     |
> +                +---------------+-------------------------------+

> -START_DOMAIN_TRANSACTION    <domid>|<transid>|
> +    0       1       2       3    octet
> ++-------+-------+-------+-------+
> +| conn-id                       |
> ++-------------------------------+
> +| tx-id                         |
> ++---------------+---------------+
> +| access        | perm-count    |
> ++---------------+---------------+
> +| perm1                         |
> ++-------------------------------+
> +...
> ++-------------------------------+
> +| permN                         |
> ++---------------+---------------+
> +| path-len      | value-len     |
> ++---------------+---------------+
> +| path
> +...
> +| value
> +...
> +```
> +
> +
> +| Field        | Description                                    |
> +|--------------|------------------------------------------------|
> +| `conn-id`    | If this value is non-zero then this record     |
> +|              | related to a pending transaction               |
> +|              |                                                |
> +| `tx-id`      | This value should be ignored if `conn-id` is   |
> +|              | zero. Otherwise it specifies the id of the     |
> +|              | pending transaction                            |
> +|              |                                                |
> +| `access`     | This value should be ignored if this record    |
> +|              | does not relate to a pending transaction,      |
> +|              | otherwise it specifies the accesses made to    |
> +|              | the node and hence is a bitwise OR of:         |
> +|              |                                                |
> +|              | 0x0001: read                                   |
> +|              | 0x0002: written                                |
> +|              |                                                |
> +|              | The value will be zero for a deleted node      |
> +|              |                                                |
> +| `perm-count` | The number (N) of node permission specifiers   |
> +|              | (which will be 0 for a node deleted in a       |
> +|              | pending transaction)                           |
> +|              |                                                |
> +| `perm1..N`   | A list of zero or more node permission         |
> +|              | specifiers (see below)                         |
> +|              |                                                |
> +| `path-len`   | The length (in octets) of `path` including the |
> +|              | NUL terminator                                 |
> +|              |                                                |
> +| `value-len`  | The length (in octets) of `value` (which will  |
> +|              | be zero for a deleted node)                    |

I asked you to put the path-len and value-len fields before the perm
array.


Juergen
Paul Durrant April 27, 2020, 10:45 a.m. UTC | #2
> -----Original Message-----
> From: Jürgen Groß <jgross@suse.com>
> Sent: 27 April 2020 11:37
> To: Paul Durrant <paul@xen.org>; xen-devel@lists.xenproject.org
> Cc: Paul Durrant <pdurrant@amazon.com>
> Subject: Re: [PATCH v2] docs/designs: re-work the xenstore migration document...
> 
> On 27.04.20 09:53, Paul Durrant wrote:
> > From: Paul Durrant <pdurrant@amazon.com>
> >
> > ... to specify a separate migration stream that will also be suitable for
> > live update.
> >
> > The original scope of the document was to support non-cooperative migration
> > of guests [1] but, since then, live update of xenstored has been brought into
> > scope. Thus it makes more sense to define a separate image format for
> > serializing xenstore state that is suitable for both purposes.
> >
> > The document has been limited to specifying a new image format. The mechanism
> > for acquiring the image for live update or migration is not covered as that
> > is more appropriately dealt with by a patch to docs/misc/xenstore.txt. It is
> > also expected that, when the first implementation of live update or migration
> > making use of this specification is committed, that the document is moved from
> > docs/designs into docs/specs.
> >
> > [1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md
> >
> > Signed-off-by: Paul Durrant <pdurrant@amazon.com>
> > ---
> > Juergen Gross <jgross@suse.com>
> > Andrew Cooper <andrew.cooper3@citrix.com>
> > George Dunlap <george.dunlap@citrix.com>
> > Ian Jackson <ian.jackson@eu.citrix.com>
> > Jan Beulich <jbeulich@suse.com>
> > Julien Grall <julien@xen.org>
> > Stefano Stabellini <sstabellini@kernel.org>
> > Wei Liu <wl@xen.org>
> 
> Mind adding CC: before those mail addresses in order to let git add
> those to the recipients list?
> 

D'oh... good spot.

> >
> > v2:
> >   - Address comments from Juergen
> 
> Not all unfortunately. :-(
> 

OK.

> > +### CONNECTION_DATA
> >
> > -Each WATCH_DATA record specifies a registered watch and is formatted as
> > -follows:
> > +For live update the image format will contain a `CONNECTION_DATA` record for
> > +each connection to xenstore. For migration it will only contain a record for
> > +the domain being migrated.
> >
> >
> >   ```
> > -    0       1       2       3     octet
> > -+-------+-------+-------+-------+
> > -| WATCH_DATA                    |
> > -+-------------------------------+
> > -| wpath length                  |
> > -+-------------------------------+
> > -| wpath data                    |
> > -...
> > -| pad (0 to 3 octets)           |
> > -+-------------------------------+
> > +    0       1       2       3       4       5       6       7    octet
> > ++-------+-------+-------+-------+-------+-------+-------+-------+
> > +| conn-id                       | pad                           |
> > ++---------------+-----------------------------------------------+
> > +| conn-type     | conn-spec
> >   ...
> 
> I asked whether it wouldn't be better to drop the pad and move conn-type
> and a 2-byte (unified) flag field at its position. This together ...
> 
> > ++-------------------------------+-------------------------------+
> > +| data-len                      | data
> >   +-------------------------------+
> > -| token length                  |
> > -+-------------------------------+
> > -| token data                    |
> >   ...
> > -| pad (0 to 3 octets)           |
> > -+-------------------------------+
> >   ```
> >
> > -wpath length and token length are specified in octets (excluding the NUL
> > -terminator). The wpath should be as described for the `WATCH` operation in
> > -[2]. The token is an arbitrary string of octets not containing any NUL
> > -values.
> >
> > +| Field       | Description                                     |
> > +|-------------|-------------------------------------------------|
> > +| `conn-id`   | A non-zero number used to identify this         |
> > +|             | connection in subsequent connection-specific    |
> > +|             | records                                         |
> > +|             |                                                 |
> > +| `conn-type` | 0x0000: shared ring                             |
> > +|             | 0x0001: socket                                  |
> > +|             |                                                 |
> > +| `conn-spec` | See below                                       |
> > +|             |                                                 |
> > +| `data-len`  | The length (in octets) of any pending data not  |
> > +|             | yet written to the connection                   |
> > +|             |                                                 |
> > +| `data`      | Pending data (may be empty)                     |
> >
> > -**TRANSACTION_DATA**
> > +The format of `conn-spec` is dependent upon `conn-type`.
> >
> > +\pagebreak
> >
> > -Each TRANSACTION_DATA record specifies an open transaction and is formatted
> > -as follows:
> > +For `shared ring` connections it is as follows:
> >
> >
> >   ```
> > -    0       1       2       3     octet
> > -+-------+-------+-------+-------+
> > -| TRANSACTION_DATA              |
> > -+-------------------------------+
> > -| tx_id                         |
> > -+-------------------------------+
> > +    0       1       2       3       4       5       6       7    octet
> > +                +-------+-------+-------+-------+-------+-------+
> > +                | domid         | tdomid        | flags         |
> > ++---------------+---------------+---------------+---------------+
> > +| revtchn                       | levtchn                       |
> > ++-------------------------------+-------------------------------+
> > +| mfn                                                           |
> > ++---------------------------------------------------------------+
> 
> ... with dropping levtchn (which isn't needed IMO) will make it much
> easier to have a union in C (which needs to be aligned to 8 bytes
> and have a length of a multiple of 8 bytes due to mfn).
> 
> So something like:
> 
> struct xs_state_connection {
>      uint32_t conn_id;
>      uint16_t conn_type;
> #define XS_STATE_CONN_TYPE_RING   0
> #define XS_STATE_CONN_TYPE_SOCKET 1
>      uint16_t flags;
> #define XS_STATE_CONN_INTRODUCED  0x0001
> #define XS_STATE_CONN_RELEASED    0x0002
> #define XS_STATE_CONN_READONLY    0x0004
>      union {
>          struct {
>              uint16_t domid;
>              uint16_t tdomid;
> #define XS_STATE_DOMID_INVALID  0xffffU
>              uint32_t evtchn;
>              uint64_t mfn;
> #define XS_STATE_MFN_INVALID    0xffffffffffffffffUL
>          } ring;
>          int32_t socket_fd;
>      } spec;
>      uint32_t data_out_len;
>      uint8_t  data[];
> };

The issue is making sure that the mfn is properly aligned. If I can drop the levtchn then this gets easier.

> 
> >   ```
> >
> > -where tx_id is the non-zero identifier values of an open transaction.
> > -
> >
> > -### Protocol Extension
> > +| Field      | Description                                      |
> > +|------------|--------------------------------------------------|
> > +| `domid`    | The domain-id that owns the shared page          |
> > +|            |                                                  |
> > +| `tdomid`   | The domain-id that `domid` acts on behalf of if  |
> > +|            | it has been subject to an SET_TARGET             |
> > +|            | operation [2] or DOMID_INVALID otherwise         |
> 
> DOMID_INVALID needs to be defined (or we need a reference where it is
> coming from).

OK. It's in a public header... I'll reference it.

> 
> > +|            |                                                  |
> > +| `flags`    | A bit-wise OR of:                                |
> > +|            | 0x0001: INTRODUCE has been issued                |
> > +|            | 0x0002: RELEASE has been issued                  |
> > +|            |                                                  |
> > +| `revtchn`  | The port number of the interdomain channel used  |
> > +|            | by `domid` to communicate with xenstored         |
> > +|            |                                                  |
> > +| `levtchn`  | For a live update this will be the port number   |
> > +|            | of the interdomain channel used by xenstored     |
> > +|            | itself otherwise, for migration, it will be -1   |
> > +|            |                                                  |
> > +| `mfn`      | The MFN of the shared page for a live update or  |
> > +|            | INVALID_MFN otherwise                            |
> 
> INVALID_MFN is an internal detail of the hypervisor. We should have a
> local definition for it here (as in my example above).

OK. I'll define it as all-1s.

> 
> > +
> > +Since the ABI guarantees that entry 1 in `domid`'s grant table will always
> > +contain the GFN of the shared page, so for a live update `mfn` can be used to
> > +give confidence that `domid` has not been re-cycled during the update.
> > +
> > +
> > +For `socket` connections it is as follows:
> >
> > -Before xenstore state is migrated it is necessary to wait for any pending
> > -reads, writes, watch registrations etc. to complete, and also to make sure
> > -that xenstored does not start processing any new requests (so that new
> > -requests remain pending on the shared ring for subsequent processing on the
> > -new host). Hence the following operation is needed:
> >
> >   ```
> > -QUIESCE                 <domid>|
> > -
> > -Complete processing of any request issued by the specified domain, and
> > -do not process any further requests from the shared ring.
> > +    0       1       2       3       4       5       6       7    octet
> > +                +-------+-------+-------+-------+-------+-------+
> > +                | flags         | socket-fd                     |
> > +                +---------------+-------------------------------+
> 
> > -START_DOMAIN_TRANSACTION    <domid>|<transid>|
> > +    0       1       2       3    octet
> > ++-------+-------+-------+-------+
> > +| conn-id                       |
> > ++-------------------------------+
> > +| tx-id                         |
> > ++---------------+---------------+
> > +| access        | perm-count    |
> > ++---------------+---------------+
> > +| perm1                         |
> > ++-------------------------------+
> > +...
> > ++-------------------------------+
> > +| permN                         |
> > ++---------------+---------------+
> > +| path-len      | value-len     |
> > ++---------------+---------------+
> > +| path
> > +...
> > +| value
> > +...
> > +```
> > +
> > +
> > +| Field        | Description                                    |
> > +|--------------|------------------------------------------------|
> > +| `conn-id`    | If this value is non-zero then this record     |
> > +|              | related to a pending transaction               |
> > +|              |                                                |
> > +| `tx-id`      | This value should be ignored if `conn-id` is   |
> > +|              | zero. Otherwise it specifies the id of the     |
> > +|              | pending transaction                            |
> > +|              |                                                |
> > +| `access`     | This value should be ignored if this record    |
> > +|              | does not relate to a pending transaction,      |
> > +|              | otherwise it specifies the accesses made to    |
> > +|              | the node and hence is a bitwise OR of:         |
> > +|              |                                                |
> > +|              | 0x0001: read                                   |
> > +|              | 0x0002: written                                |
> > +|              |                                                |
> > +|              | The value will be zero for a deleted node      |
> > +|              |                                                |
> > +| `perm-count` | The number (N) of node permission specifiers   |
> > +|              | (which will be 0 for a node deleted in a       |
> > +|              | pending transaction)                           |
> > +|              |                                                |
> > +| `perm1..N`   | A list of zero or more node permission         |
> > +|              | specifiers (see below)                         |
> > +|              |                                                |
> > +| `path-len`   | The length (in octets) of `path` including the |
> > +|              | NUL terminator                                 |
> > +|              |                                                |
> > +| `value-len`  | The length (in octets) of `value` (which will  |
> > +|              | be zero for a deleted node)                    |
> 
> I asked you to put the path-len and value-len fields before the perm
> array.

I missed that. I'll move them.

  Paul

> 
> 
> Juergen
Jürgen Groß April 27, 2020, 11:13 a.m. UTC | #3
On 27.04.20 12:45, Paul Durrant wrote:
>> -----Original Message-----
>> From: Jürgen Groß <jgross@suse.com>
>> Sent: 27 April 2020 11:37
>> To: Paul Durrant <paul@xen.org>; xen-devel@lists.xenproject.org
>> Cc: Paul Durrant <pdurrant@amazon.com>
>> Subject: Re: [PATCH v2] docs/designs: re-work the xenstore migration document...
>>
>> On 27.04.20 09:53, Paul Durrant wrote:
>>> From: Paul Durrant <pdurrant@amazon.com>
>>>
>>> ... to specify a separate migration stream that will also be suitable for
>>> live update.
>>>
>>> The original scope of the document was to support non-cooperative migration
>>> of guests [1] but, since then, live update of xenstored has been brought into
>>> scope. Thus it makes more sense to define a separate image format for
>>> serializing xenstore state that is suitable for both purposes.
>>>
>>> The document has been limited to specifying a new image format. The mechanism
>>> for acquiring the image for live update or migration is not covered as that
>>> is more appropriately dealt with by a patch to docs/misc/xenstore.txt. It is
>>> also expected that, when the first implementation of live update or migration
>>> making use of this specification is committed, that the document is moved from
>>> docs/designs into docs/specs.
>>>
>>> [1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md
>>>
>>> Signed-off-by: Paul Durrant <pdurrant@amazon.com>
>>> ---
>>> Juergen Gross <jgross@suse.com>
>>> Andrew Cooper <andrew.cooper3@citrix.com>
>>> George Dunlap <george.dunlap@citrix.com>
>>> Ian Jackson <ian.jackson@eu.citrix.com>
>>> Jan Beulich <jbeulich@suse.com>
>>> Julien Grall <julien@xen.org>
>>> Stefano Stabellini <sstabellini@kernel.org>
>>> Wei Liu <wl@xen.org>
>>
>> Mind adding CC: before those mail addresses in order to let git add
>> those to the recipients list?
>>
> 
> D'oh... good spot.
> 
>>>
>>> v2:
>>>    - Address comments from Juergen
>>
>> Not all unfortunately. :-(
>>
> 
> OK.
> 
>>> +### CONNECTION_DATA
>>>
>>> -Each WATCH_DATA record specifies a registered watch and is formatted as
>>> -follows:
>>> +For live update the image format will contain a `CONNECTION_DATA` record for
>>> +each connection to xenstore. For migration it will only contain a record for
>>> +the domain being migrated.
>>>
>>>
>>>    ```
>>> -    0       1       2       3     octet
>>> -+-------+-------+-------+-------+
>>> -| WATCH_DATA                    |
>>> -+-------------------------------+
>>> -| wpath length                  |
>>> -+-------------------------------+
>>> -| wpath data                    |
>>> -...
>>> -| pad (0 to 3 octets)           |
>>> -+-------------------------------+
>>> +    0       1       2       3       4       5       6       7    octet
>>> ++-------+-------+-------+-------+-------+-------+-------+-------+
>>> +| conn-id                       | pad                           |
>>> ++---------------+-----------------------------------------------+
>>> +| conn-type     | conn-spec
>>>    ...
>>
>> I asked whether it wouldn't be better to drop the pad and move conn-type
>> and a 2-byte (unified) flag field at its position. This together ...
>>
>>> ++-------------------------------+-------------------------------+
>>> +| data-len                      | data
>>>    +-------------------------------+
>>> -| token length                  |
>>> -+-------------------------------+
>>> -| token data                    |
>>>    ...
>>> -| pad (0 to 3 octets)           |
>>> -+-------------------------------+
>>>    ```
>>>
>>> -wpath length and token length are specified in octets (excluding the NUL
>>> -terminator). The wpath should be as described for the `WATCH` operation in
>>> -[2]. The token is an arbitrary string of octets not containing any NUL
>>> -values.
>>>
>>> +| Field       | Description                                     |
>>> +|-------------|-------------------------------------------------|
>>> +| `conn-id`   | A non-zero number used to identify this         |
>>> +|             | connection in subsequent connection-specific    |
>>> +|             | records                                         |
>>> +|             |                                                 |
>>> +| `conn-type` | 0x0000: shared ring                             |
>>> +|             | 0x0001: socket                                  |
>>> +|             |                                                 |
>>> +| `conn-spec` | See below                                       |
>>> +|             |                                                 |
>>> +| `data-len`  | The length (in octets) of any pending data not  |
>>> +|             | yet written to the connection                   |
>>> +|             |                                                 |
>>> +| `data`      | Pending data (may be empty)                     |
>>>
>>> -**TRANSACTION_DATA**
>>> +The format of `conn-spec` is dependent upon `conn-type`.
>>>
>>> +\pagebreak
>>>
>>> -Each TRANSACTION_DATA record specifies an open transaction and is formatted
>>> -as follows:
>>> +For `shared ring` connections it is as follows:
>>>
>>>
>>>    ```
>>> -    0       1       2       3     octet
>>> -+-------+-------+-------+-------+
>>> -| TRANSACTION_DATA              |
>>> -+-------------------------------+
>>> -| tx_id                         |
>>> -+-------------------------------+
>>> +    0       1       2       3       4       5       6       7    octet
>>> +                +-------+-------+-------+-------+-------+-------+
>>> +                | domid         | tdomid        | flags         |
>>> ++---------------+---------------+---------------+---------------+
>>> +| revtchn                       | levtchn                       |
>>> ++-------------------------------+-------------------------------+
>>> +| mfn                                                           |
>>> ++---------------------------------------------------------------+
>>
>> ... with dropping levtchn (which isn't needed IMO) will make it much
>> easier to have a union in C (which needs to be aligned to 8 bytes
>> and have a length of a multiple of 8 bytes due to mfn).
>>
>> So something like:
>>
>> struct xs_state_connection {
>>       uint32_t conn_id;
>>       uint16_t conn_type;
>> #define XS_STATE_CONN_TYPE_RING   0
>> #define XS_STATE_CONN_TYPE_SOCKET 1
>>       uint16_t flags;
>> #define XS_STATE_CONN_INTRODUCED  0x0001
>> #define XS_STATE_CONN_RELEASED    0x0002
>> #define XS_STATE_CONN_READONLY    0x0004
>>       union {
>>           struct {
>>               uint16_t domid;
>>               uint16_t tdomid;
>> #define XS_STATE_DOMID_INVALID  0xffffU
>>               uint32_t evtchn;
>>               uint64_t mfn;
>> #define XS_STATE_MFN_INVALID    0xffffffffffffffffUL
>>           } ring;
>>           int32_t socket_fd;
>>       } spec;
>>       uint32_t data_out_len;
>>       uint8_t  data[];
>> };
> 
> The issue is making sure that the mfn is properly aligned. If I can drop the levtchn then this gets easier.
> 
>>
>>>    ```
>>>
>>> -where tx_id is the non-zero identifier values of an open transaction.
>>> -
>>>
>>> -### Protocol Extension
>>> +| Field      | Description                                      |
>>> +|------------|--------------------------------------------------|
>>> +| `domid`    | The domain-id that owns the shared page          |
>>> +|            |                                                  |
>>> +| `tdomid`   | The domain-id that `domid` acts on behalf of if  |
>>> +|            | it has been subject to an SET_TARGET             |
>>> +|            | operation [2] or DOMID_INVALID otherwise         |
>>
>> DOMID_INVALID needs to be defined (or we need a reference where it is
>> coming from).
> 
> OK. It's in a public header... I'll reference it.
> 
>>
>>> +|            |                                                  |
>>> +| `flags`    | A bit-wise OR of:                                |
>>> +|            | 0x0001: INTRODUCE has been issued                |

Just realized, I think we can drop those flags.

Reasoning: if INTRODUCE hasn't been issued, there can't be an active
connection to Xenstore for that domain, as Xenstore doesn't know about
the parameters to connect (especially the event channel is missing).

>>> +|            | 0x0002: RELEASE has been issued                  |

And the same applies here: RELEASE will drop the connection to the
domain, so it can't appear in a connection record.


Juergen
Paul Durrant April 27, 2020, 11:25 a.m. UTC | #4
> -----Original Message-----
> From: Jürgen Groß <jgross@suse.com>
> Sent: 27 April 2020 12:13
> To: paul@xen.org; xen-devel@lists.xenproject.org
> Cc: 'Paul Durrant' <pdurrant@amazon.com>
> Subject: Re: [PATCH v2] docs/designs: re-work the xenstore migration document...
> 
> On 27.04.20 12:45, Paul Durrant wrote:
> >> -----Original Message-----
> >> From: Jürgen Groß <jgross@suse.com>
> >> Sent: 27 April 2020 11:37
> >> To: Paul Durrant <paul@xen.org>; xen-devel@lists.xenproject.org
> >> Cc: Paul Durrant <pdurrant@amazon.com>
> >> Subject: Re: [PATCH v2] docs/designs: re-work the xenstore migration document...
> >>
> >> On 27.04.20 09:53, Paul Durrant wrote:
> >>> From: Paul Durrant <pdurrant@amazon.com>
> >>>
> >>> ... to specify a separate migration stream that will also be suitable for
> >>> live update.
> >>>
> >>> The original scope of the document was to support non-cooperative migration
> >>> of guests [1] but, since then, live update of xenstored has been brought into
> >>> scope. Thus it makes more sense to define a separate image format for
> >>> serializing xenstore state that is suitable for both purposes.
> >>>
> >>> The document has been limited to specifying a new image format. The mechanism
> >>> for acquiring the image for live update or migration is not covered as that
> >>> is more appropriately dealt with by a patch to docs/misc/xenstore.txt. It is
> >>> also expected that, when the first implementation of live update or migration
> >>> making use of this specification is committed, that the document is moved from
> >>> docs/designs into docs/specs.
> >>>
> >>> [1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-
> migration.md
> >>>
> >>> Signed-off-by: Paul Durrant <pdurrant@amazon.com>
> >>> ---
> >>> Juergen Gross <jgross@suse.com>
> >>> Andrew Cooper <andrew.cooper3@citrix.com>
> >>> George Dunlap <george.dunlap@citrix.com>
> >>> Ian Jackson <ian.jackson@eu.citrix.com>
> >>> Jan Beulich <jbeulich@suse.com>
> >>> Julien Grall <julien@xen.org>
> >>> Stefano Stabellini <sstabellini@kernel.org>
> >>> Wei Liu <wl@xen.org>
> >>
> >> Mind adding CC: before those mail addresses in order to let git add
> >> those to the recipients list?
> >>
> >
> > D'oh... good spot.
> >
> >>>
> >>> v2:
> >>>    - Address comments from Juergen
> >>
> >> Not all unfortunately. :-(
> >>
> >
> > OK.
> >
> >>> +### CONNECTION_DATA
> >>>
> >>> -Each WATCH_DATA record specifies a registered watch and is formatted as
> >>> -follows:
> >>> +For live update the image format will contain a `CONNECTION_DATA` record for
> >>> +each connection to xenstore. For migration it will only contain a record for
> >>> +the domain being migrated.
> >>>
> >>>
> >>>    ```
> >>> -    0       1       2       3     octet
> >>> -+-------+-------+-------+-------+
> >>> -| WATCH_DATA                    |
> >>> -+-------------------------------+
> >>> -| wpath length                  |
> >>> -+-------------------------------+
> >>> -| wpath data                    |
> >>> -...
> >>> -| pad (0 to 3 octets)           |
> >>> -+-------------------------------+
> >>> +    0       1       2       3       4       5       6       7    octet
> >>> ++-------+-------+-------+-------+-------+-------+-------+-------+
> >>> +| conn-id                       | pad                           |
> >>> ++---------------+-----------------------------------------------+
> >>> +| conn-type     | conn-spec
> >>>    ...
> >>
> >> I asked whether it wouldn't be better to drop the pad and move conn-type
> >> and a 2-byte (unified) flag field at its position. This together ...
> >>
> >>> ++-------------------------------+-------------------------------+
> >>> +| data-len                      | data
> >>>    +-------------------------------+
> >>> -| token length                  |
> >>> -+-------------------------------+
> >>> -| token data                    |
> >>>    ...
> >>> -| pad (0 to 3 octets)           |
> >>> -+-------------------------------+
> >>>    ```
> >>>
> >>> -wpath length and token length are specified in octets (excluding the NUL
> >>> -terminator). The wpath should be as described for the `WATCH` operation in
> >>> -[2]. The token is an arbitrary string of octets not containing any NUL
> >>> -values.
> >>>
> >>> +| Field       | Description                                     |
> >>> +|-------------|-------------------------------------------------|
> >>> +| `conn-id`   | A non-zero number used to identify this         |
> >>> +|             | connection in subsequent connection-specific    |
> >>> +|             | records                                         |
> >>> +|             |                                                 |
> >>> +| `conn-type` | 0x0000: shared ring                             |
> >>> +|             | 0x0001: socket                                  |
> >>> +|             |                                                 |
> >>> +| `conn-spec` | See below                                       |
> >>> +|             |                                                 |
> >>> +| `data-len`  | The length (in octets) of any pending data not  |
> >>> +|             | yet written to the connection                   |
> >>> +|             |                                                 |
> >>> +| `data`      | Pending data (may be empty)                     |
> >>>
> >>> -**TRANSACTION_DATA**
> >>> +The format of `conn-spec` is dependent upon `conn-type`.
> >>>
> >>> +\pagebreak
> >>>
> >>> -Each TRANSACTION_DATA record specifies an open transaction and is formatted
> >>> -as follows:
> >>> +For `shared ring` connections it is as follows:
> >>>
> >>>
> >>>    ```
> >>> -    0       1       2       3     octet
> >>> -+-------+-------+-------+-------+
> >>> -| TRANSACTION_DATA              |
> >>> -+-------------------------------+
> >>> -| tx_id                         |
> >>> -+-------------------------------+
> >>> +    0       1       2       3       4       5       6       7    octet
> >>> +                +-------+-------+-------+-------+-------+-------+
> >>> +                | domid         | tdomid        | flags         |
> >>> ++---------------+---------------+---------------+---------------+
> >>> +| revtchn                       | levtchn                       |
> >>> ++-------------------------------+-------------------------------+
> >>> +| mfn                                                           |
> >>> ++---------------------------------------------------------------+
> >>
> >> ... with dropping levtchn (which isn't needed IMO) will make it much
> >> easier to have a union in C (which needs to be aligned to 8 bytes
> >> and have a length of a multiple of 8 bytes due to mfn).
> >>
> >> So something like:
> >>
> >> struct xs_state_connection {
> >>       uint32_t conn_id;
> >>       uint16_t conn_type;
> >> #define XS_STATE_CONN_TYPE_RING   0
> >> #define XS_STATE_CONN_TYPE_SOCKET 1
> >>       uint16_t flags;
> >> #define XS_STATE_CONN_INTRODUCED  0x0001
> >> #define XS_STATE_CONN_RELEASED    0x0002
> >> #define XS_STATE_CONN_READONLY    0x0004
> >>       union {
> >>           struct {
> >>               uint16_t domid;
> >>               uint16_t tdomid;
> >> #define XS_STATE_DOMID_INVALID  0xffffU
> >>               uint32_t evtchn;
> >>               uint64_t mfn;
> >> #define XS_STATE_MFN_INVALID    0xffffffffffffffffUL
> >>           } ring;
> >>           int32_t socket_fd;
> >>       } spec;
> >>       uint32_t data_out_len;
> >>       uint8_t  data[];
> >> };
> >
> > The issue is making sure that the mfn is properly aligned. If I can drop the levtchn then this gets
> easier.
> >
> >>
> >>>    ```
> >>>
> >>> -where tx_id is the non-zero identifier values of an open transaction.
> >>> -
> >>>
> >>> -### Protocol Extension
> >>> +| Field      | Description                                      |
> >>> +|------------|--------------------------------------------------|
> >>> +| `domid`    | The domain-id that owns the shared page          |
> >>> +|            |                                                  |
> >>> +| `tdomid`   | The domain-id that `domid` acts on behalf of if  |
> >>> +|            | it has been subject to an SET_TARGET             |
> >>> +|            | operation [2] or DOMID_INVALID otherwise         |
> >>
> >> DOMID_INVALID needs to be defined (or we need a reference where it is
> >> coming from).
> >
> > OK. It's in a public header... I'll reference it.
> >
> >>
> >>> +|            |                                                  |
> >>> +| `flags`    | A bit-wise OR of:                                |
> >>> +|            | 0x0001: INTRODUCE has been issued                |
> 
> Just realized, I think we can drop those flags.
> 
> Reasoning: if INTRODUCE hasn't been issued, there can't be an active
> connection to Xenstore for that domain, as Xenstore doesn't know about
> the parameters to connect (especially the event channel is missing).
> 
> >>> +|            | 0x0002: RELEASE has been issued                  |
> 
> And the same applies here: RELEASE will drop the connection to the
> domain, so it can't appear in a connection record.
> 

I think the presence of the RESUME command in xenstore.txt makes it non-obvious that we can forget about a domain once RELEASE has been called. The text there does say:

"It is not clear whether this is possible since one would
normally expect a domain not to be restarted after being shut
down without being destroyed in the meantime.  There are
currently no users of this request in xen-unstable."

So, perhaps this would be the time to remove RESUME from the spec?

  Paul

> 
> Juergen
Jürgen Groß April 27, 2020, 1:30 p.m. UTC | #5
On 27.04.20 13:25, Paul Durrant wrote:
>> -----Original Message-----
>> From: Jürgen Groß <jgross@suse.com>
>> Sent: 27 April 2020 12:13
>> To: paul@xen.org; xen-devel@lists.xenproject.org
>> Cc: 'Paul Durrant' <pdurrant@amazon.com>
>> Subject: Re: [PATCH v2] docs/designs: re-work the xenstore migration document...
>>
>> On 27.04.20 12:45, Paul Durrant wrote:
>>>> -----Original Message-----
>>>> From: Jürgen Groß <jgross@suse.com>
>>>> Sent: 27 April 2020 11:37
>>>> To: Paul Durrant <paul@xen.org>; xen-devel@lists.xenproject.org
>>>> Cc: Paul Durrant <pdurrant@amazon.com>
>>>> Subject: Re: [PATCH v2] docs/designs: re-work the xenstore migration document...
>>>>
>>>> On 27.04.20 09:53, Paul Durrant wrote:
>>>>> From: Paul Durrant <pdurrant@amazon.com>
>>>>>
>>>>> ... to specify a separate migration stream that will also be suitable for
>>>>> live update.
>>>>>
>>>>> The original scope of the document was to support non-cooperative migration
>>>>> of guests [1] but, since then, live update of xenstored has been brought into
>>>>> scope. Thus it makes more sense to define a separate image format for
>>>>> serializing xenstore state that is suitable for both purposes.
>>>>>
>>>>> The document has been limited to specifying a new image format. The mechanism
>>>>> for acquiring the image for live update or migration is not covered as that
>>>>> is more appropriately dealt with by a patch to docs/misc/xenstore.txt. It is
>>>>> also expected that, when the first implementation of live update or migration
>>>>> making use of this specification is committed, that the document is moved from
>>>>> docs/designs into docs/specs.
>>>>>
>>>>> [1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-
>> migration.md
>>>>>
>>>>> Signed-off-by: Paul Durrant <pdurrant@amazon.com>
>>>>> ---
>>>>> Juergen Gross <jgross@suse.com>
>>>>> Andrew Cooper <andrew.cooper3@citrix.com>
>>>>> George Dunlap <george.dunlap@citrix.com>
>>>>> Ian Jackson <ian.jackson@eu.citrix.com>
>>>>> Jan Beulich <jbeulich@suse.com>
>>>>> Julien Grall <julien@xen.org>
>>>>> Stefano Stabellini <sstabellini@kernel.org>
>>>>> Wei Liu <wl@xen.org>
>>>>
>>>> Mind adding CC: before those mail addresses in order to let git add
>>>> those to the recipients list?
>>>>
>>>
>>> D'oh... good spot.
>>>
>>>>>
>>>>> v2:
>>>>>     - Address comments from Juergen
>>>>
>>>> Not all unfortunately. :-(
>>>>
>>>
>>> OK.
>>>
>>>>> +### CONNECTION_DATA
>>>>>
>>>>> -Each WATCH_DATA record specifies a registered watch and is formatted as
>>>>> -follows:
>>>>> +For live update the image format will contain a `CONNECTION_DATA` record for
>>>>> +each connection to xenstore. For migration it will only contain a record for
>>>>> +the domain being migrated.
>>>>>
>>>>>
>>>>>     ```
>>>>> -    0       1       2       3     octet
>>>>> -+-------+-------+-------+-------+
>>>>> -| WATCH_DATA                    |
>>>>> -+-------------------------------+
>>>>> -| wpath length                  |
>>>>> -+-------------------------------+
>>>>> -| wpath data                    |
>>>>> -...
>>>>> -| pad (0 to 3 octets)           |
>>>>> -+-------------------------------+
>>>>> +    0       1       2       3       4       5       6       7    octet
>>>>> ++-------+-------+-------+-------+-------+-------+-------+-------+
>>>>> +| conn-id                       | pad                           |
>>>>> ++---------------+-----------------------------------------------+
>>>>> +| conn-type     | conn-spec
>>>>>     ...
>>>>
>>>> I asked whether it wouldn't be better to drop the pad and move conn-type
>>>> and a 2-byte (unified) flag field at its position. This together ...
>>>>
>>>>> ++-------------------------------+-------------------------------+
>>>>> +| data-len                      | data
>>>>>     +-------------------------------+
>>>>> -| token length                  |
>>>>> -+-------------------------------+
>>>>> -| token data                    |
>>>>>     ...
>>>>> -| pad (0 to 3 octets)           |
>>>>> -+-------------------------------+
>>>>>     ```
>>>>>
>>>>> -wpath length and token length are specified in octets (excluding the NUL
>>>>> -terminator). The wpath should be as described for the `WATCH` operation in
>>>>> -[2]. The token is an arbitrary string of octets not containing any NUL
>>>>> -values.
>>>>>
>>>>> +| Field       | Description                                     |
>>>>> +|-------------|-------------------------------------------------|
>>>>> +| `conn-id`   | A non-zero number used to identify this         |
>>>>> +|             | connection in subsequent connection-specific    |
>>>>> +|             | records                                         |
>>>>> +|             |                                                 |
>>>>> +| `conn-type` | 0x0000: shared ring                             |
>>>>> +|             | 0x0001: socket                                  |
>>>>> +|             |                                                 |
>>>>> +| `conn-spec` | See below                                       |
>>>>> +|             |                                                 |
>>>>> +| `data-len`  | The length (in octets) of any pending data not  |
>>>>> +|             | yet written to the connection                   |
>>>>> +|             |                                                 |
>>>>> +| `data`      | Pending data (may be empty)                     |
>>>>>
>>>>> -**TRANSACTION_DATA**
>>>>> +The format of `conn-spec` is dependent upon `conn-type`.
>>>>>
>>>>> +\pagebreak
>>>>>
>>>>> -Each TRANSACTION_DATA record specifies an open transaction and is formatted
>>>>> -as follows:
>>>>> +For `shared ring` connections it is as follows:
>>>>>
>>>>>
>>>>>     ```
>>>>> -    0       1       2       3     octet
>>>>> -+-------+-------+-------+-------+
>>>>> -| TRANSACTION_DATA              |
>>>>> -+-------------------------------+
>>>>> -| tx_id                         |
>>>>> -+-------------------------------+
>>>>> +    0       1       2       3       4       5       6       7    octet
>>>>> +                +-------+-------+-------+-------+-------+-------+
>>>>> +                | domid         | tdomid        | flags         |
>>>>> ++---------------+---------------+---------------+---------------+
>>>>> +| revtchn                       | levtchn                       |
>>>>> ++-------------------------------+-------------------------------+
>>>>> +| mfn                                                           |
>>>>> ++---------------------------------------------------------------+
>>>>
>>>> ... with dropping levtchn (which isn't needed IMO) will make it much
>>>> easier to have a union in C (which needs to be aligned to 8 bytes
>>>> and have a length of a multiple of 8 bytes due to mfn).
>>>>
>>>> So something like:
>>>>
>>>> struct xs_state_connection {
>>>>        uint32_t conn_id;
>>>>        uint16_t conn_type;
>>>> #define XS_STATE_CONN_TYPE_RING   0
>>>> #define XS_STATE_CONN_TYPE_SOCKET 1
>>>>        uint16_t flags;
>>>> #define XS_STATE_CONN_INTRODUCED  0x0001
>>>> #define XS_STATE_CONN_RELEASED    0x0002
>>>> #define XS_STATE_CONN_READONLY    0x0004
>>>>        union {
>>>>            struct {
>>>>                uint16_t domid;
>>>>                uint16_t tdomid;
>>>> #define XS_STATE_DOMID_INVALID  0xffffU
>>>>                uint32_t evtchn;
>>>>                uint64_t mfn;
>>>> #define XS_STATE_MFN_INVALID    0xffffffffffffffffUL
>>>>            } ring;
>>>>            int32_t socket_fd;
>>>>        } spec;
>>>>        uint32_t data_out_len;
>>>>        uint8_t  data[];
>>>> };
>>>
>>> The issue is making sure that the mfn is properly aligned. If I can drop the levtchn then this gets
>> easier.
>>>
>>>>
>>>>>     ```
>>>>>
>>>>> -where tx_id is the non-zero identifier values of an open transaction.
>>>>> -
>>>>>
>>>>> -### Protocol Extension
>>>>> +| Field      | Description                                      |
>>>>> +|------------|--------------------------------------------------|
>>>>> +| `domid`    | The domain-id that owns the shared page          |
>>>>> +|            |                                                  |
>>>>> +| `tdomid`   | The domain-id that `domid` acts on behalf of if  |
>>>>> +|            | it has been subject to an SET_TARGET             |
>>>>> +|            | operation [2] or DOMID_INVALID otherwise         |
>>>>
>>>> DOMID_INVALID needs to be defined (or we need a reference where it is
>>>> coming from).
>>>
>>> OK. It's in a public header... I'll reference it.
>>>
>>>>
>>>>> +|            |                                                  |
>>>>> +| `flags`    | A bit-wise OR of:                                |
>>>>> +|            | 0x0001: INTRODUCE has been issued                |
>>
>> Just realized, I think we can drop those flags.
>>
>> Reasoning: if INTRODUCE hasn't been issued, there can't be an active
>> connection to Xenstore for that domain, as Xenstore doesn't know about
>> the parameters to connect (especially the event channel is missing).
>>
>>>>> +|            | 0x0002: RELEASE has been issued                  |
>>
>> And the same applies here: RELEASE will drop the connection to the
>> domain, so it can't appear in a connection record.
>>
> 
> I think the presence of the RESUME command in xenstore.txt makes it non-obvious that we can forget about a domain once RELEASE has been called. The text there does say:
> 
> "It is not clear whether this is possible since one would
> normally expect a domain not to be restarted after being shut
> down without being destroyed in the meantime.  There are
> currently no users of this request in xen-unstable."
> 
> So, perhaps this would be the time to remove RESUME from the spec?

+1


Juergen
diff mbox series

Patch

diff --git a/docs/designs/xenstore-migration.md b/docs/designs/xenstore-migration.md
index 6ab351e8fe..815aaeee59 100644
--- a/docs/designs/xenstore-migration.md
+++ b/docs/designs/xenstore-migration.md
@@ -3,254 +3,403 @@ 
 ## Background
 
 The design for *Non-Cooperative Migration of Guests*[1] explains that extra
-save records are required in the migrations stream to allow a guest running
-PV drivers to be migrated without its co-operation. Moreover the save
-records must include details of registered xenstore watches as well as
-content; information that cannot currently be recovered from `xenstored`,
-and hence some extension to the xenstore protocol[2] will also be required.
-
-The *libxenlight Domain Image Format* specification[3] already defines a
-record type `EMULATOR_XENSTORE_DATA` but this is not suitable for
-transferring xenstore data pertaining to the domain directly as it is
-specified such that keys are relative to the path
-`/local/domain/$dm_domid/device-model/$domid`. Thus it is necessary to
-define at least one new save record type.
+save records are required in the migrations stream to allow a guest running PV
+drivers to be migrated without its co-operation. Moreover the save records must
+include details of registered xenstore watches as well ascontent; information
+that cannot currently be recovered from `xenstored`, and hence some extension
+to the xenstored implementations will also be required.
+
+As a similar set of data is needed for transferring xenstore data from one
+instance to another when live updating xenstored this document proposes an
+image format for a 'migration stream' suitable for both purposes.
 
 ## Proposal
 
-### New Save Record
+The image format consists of a _header_ followed by 1 or more _records_. Each
+record consists of a type and length field, followed by any data mandated by
+the record type. At minimum there will be a single record of type `END`
+(defined below).
 
-A new mandatory record type should be defined within the libxenlight Domain
-Image Format:
+### Header
 
-`0x00000007: DOMAIN_XENSTORE_DATA`
+The header identifies the stream as a `xenstore` stream, including the version
+of the specification that it complies with.
 
-An arbitrary number of these records may be present in the migration
-stream and may appear in any order. The format of each record should be as
-follows:
+All fields in this header must be in _big-endian_ byte order, regardless of
+the setting of the endianness bit.
 
 
 ```
     0       1       2       3       4       5       6       7    octet
 +-------+-------+-------+-------+-------+-------+-------+-------+
-| type                          | record specific data          |
-+-------------------------------+                               |
-...
-+---------------------------------------------------------------+
+| ident                                                         |
++-------------------------------+-------------------------------|
+| version                       | flags                         |
++-------------------------------+-------------------------------+
 ```
 
-where type is one of the following values
 
+| Field     | Description                                       |
+|-----------|---------------------------------------------------|
+| `ident`   | 0x78656e73746f7265 ('xenstore' in ASCII)          |
+|           |                                                   |
+| `version` | 0x00000001 (the version of the specification)     |
+|           |                                                   |
+| `flags`   | 0 (LSB): Endianness: 0 = little, 1 = big          |
+|           |                                                   |
+|           | 1-31: Reserved (must be zero)                     |
 
-| Field  | Description                                      |
-|--------|--------------------------------------------------|
-| `type` | 0x00000000: invalid                              |
-|        | 0x00000001: NODE_DATA                            |
-|        | 0x00000002: WATCH_DATA                           |
-|        | 0x00000003: TRANSACTION_DATA                     |
-|        | 0x00000004 - 0xFFFFFFFF: reserved for future use |
+### Records
 
+Records immediately follow the header and have the following format:
 
-and data is one of the record data formats described in the following
-sections.
 
+```
+    0       1       2       3       4       5       6       7    octet
++-------+-------+-------+-------+-------+-------+-------+-------+
+| type                          | len                           |
++-------------------------------+-------------------------------+
+| body
+...
+|       | padding (0 to 7 octets)                               |
++-------+-------------------------------------------------------+
+```
+
+NOTE: padding octets here and in all subsequent format specifications must be
+      written as zero and should be ignored when the stream is read.
 
-NOTE: The record data does not contain an overall length because the
-libxenlight record header specifies the length.
 
+| Field  | Description                                          |
+|--------|------------------------------------------------------|
+| `type` | 0x00000000: END                                      |
+|        | 0x00000001: GLOBAL_DATA                              |
+|        | 0x00000002: CONNECTION_DATA                          |
+|        | 0x00000003: WATCH_DATA                               |
+|        | 0x00000004: TRANSACTION_DATA                         |
+|        | 0x00000005: NODE_DATA                                |
+|        | 0x00000006 - 0xFFFFFFFF: reserved for future use     |
+|        |                                                      |
+| `len`  | The length (in octets) of `body`                     |
+|        |                                                      |
+| `body` | The type-specific record data                        |
 
-**NODE_DATA**
+Some records will depend on other records in the migration stream. Records
+upon which other records depend must always appear earlier in the stream.
 
+The various formats of the type-specific data are described in the following
+sections:
 
-Each NODE_DATA record specifies a single node in xenstore and is formatted
-as follows:
+\pagebreak
 
+### END
+
+The end record marks the end of the image, and is the final record
+in the stream.
 
 ```
-    0       1       2       3     octet
-+-------+-------+-------+-------+
-| NODE_DATA                     |
-+-------------------------------+
-| path length                   |
-+-------------------------------+
-| path data                     |
-...
-| pad (0 to 3 octets)           |
-+-------------------------------+
-| perm count (N)                |
-+-------------------------------+
-| perm0                         |
-+-------------------------------+
-...
-+-------------------------------+
-| permN                         |
-+-------------------------------+
-| value length                  |
-+-------------------------------+
-| value data                    |
-...
-| pad (0 to 3 octets)           |
-+-------------------------------+
+    0       1       2       3       4       5       6       7    octet
++-------+-------+-------+-------+-------+-------+-------+-------+
 ```
 
-where perm0..N are formatted as follows:
 
+The end record contains no fields; its body length is 0.
+
+\pagebreak
+
+### GLOBAL_DATA
+
+This record is only relevant for live update. It contains details of global
+xenstored state that needs to be restored.
 
 ```
-    0       1       2       3     octet
+    0       1       2       3    octet
 +-------+-------+-------+-------+
-| perm  | pad   | domid         |
+| rw-socket-fd                  |
++-------------------------------+
+| ro-socket-fd                  |
 +-------------------------------+
 ```
 
 
-path length and value length are specified in octets (excluding the NUL
-terminator of the path). perm should be one of the ASCII values `w`, `r`,
-`b` or `n` as described in [2]. All pad values should be 0.
-All paths should be absolute (i.e. start with `/`) and as described in
-[2].
+| Field          | Description                                  |
+|----------------|----------------------------------------------|
+| `rw-socket-fd` | The file descriptor of the socket accepting  |
+|                | read-write connections                       |
+|                |                                              |
+| `ro-socket-fd` | The file descriptor of the socket accepting  |
+|                | read-only connections                        |
+
+xenstored will resume in the original process context. Hence `rw-socket-fd` and
+`ro-socket-fd` simply specify the file descriptors of the sockets.
 
 
-**WATCH_DATA**
+\pagebreak
 
+### CONNECTION_DATA
 
-Each WATCH_DATA record specifies a registered watch and is formatted as
-follows:
+For live update the image format will contain a `CONNECTION_DATA` record for
+each connection to xenstore. For migration it will only contain a record for
+the domain being migrated.
 
 
 ```
-    0       1       2       3     octet
-+-------+-------+-------+-------+
-| WATCH_DATA                    |
-+-------------------------------+
-| wpath length                  |
-+-------------------------------+
-| wpath data                    |
-...
-| pad (0 to 3 octets)           |
-+-------------------------------+
+    0       1       2       3       4       5       6       7    octet
++-------+-------+-------+-------+-------+-------+-------+-------+
+| conn-id                       | pad                           |
++---------------+-----------------------------------------------+
+| conn-type     | conn-spec
 ...
++-------------------------------+-------------------------------+
+| data-len                      | data
 +-------------------------------+
-| token length                  |
-+-------------------------------+
-| token data                    |
 ...
-| pad (0 to 3 octets)           |
-+-------------------------------+
 ```
 
-wpath length and token length are specified in octets (excluding the NUL
-terminator). The wpath should be as described for the `WATCH` operation in
-[2]. The token is an arbitrary string of octets not containing any NUL
-values.
 
+| Field       | Description                                     |
+|-------------|-------------------------------------------------|
+| `conn-id`   | A non-zero number used to identify this         |
+|             | connection in subsequent connection-specific    |
+|             | records                                         |
+|             |                                                 |
+| `conn-type` | 0x0000: shared ring                             |
+|             | 0x0001: socket                                  |
+|             |                                                 |
+| `conn-spec` | See below                                       |
+|             |                                                 |
+| `data-len`  | The length (in octets) of any pending data not  |
+|             | yet written to the connection                   |
+|             |                                                 |
+| `data`      | Pending data (may be empty)                     |
 
-**TRANSACTION_DATA**
+The format of `conn-spec` is dependent upon `conn-type`.
 
+\pagebreak
 
-Each TRANSACTION_DATA record specifies an open transaction and is formatted
-as follows:
+For `shared ring` connections it is as follows:
 
 
 ```
-    0       1       2       3     octet
-+-------+-------+-------+-------+
-| TRANSACTION_DATA              |
-+-------------------------------+
-| tx_id                         |
-+-------------------------------+
+    0       1       2       3       4       5       6       7    octet
+                +-------+-------+-------+-------+-------+-------+
+                | domid         | tdomid        | flags         |
++---------------+---------------+---------------+---------------+
+| revtchn                       | levtchn                       |
++-------------------------------+-------------------------------+
+| mfn                                                           |
++---------------------------------------------------------------+
 ```
 
-where tx_id is the non-zero identifier values of an open transaction.
-
 
-### Protocol Extension
+| Field      | Description                                      |
+|------------|--------------------------------------------------|
+| `domid`    | The domain-id that owns the shared page          |
+|            |                                                  |
+| `tdomid`   | The domain-id that `domid` acts on behalf of if  |
+|            | it has been subject to an SET_TARGET             |
+|            | operation [2] or DOMID_INVALID otherwise         |
+|            |                                                  |
+| `flags`    | A bit-wise OR of:                                |
+|            | 0x0001: INTRODUCE has been issued                |
+|            | 0x0002: RELEASE has been issued                  |
+|            |                                                  |
+| `revtchn`  | The port number of the interdomain channel used  |
+|            | by `domid` to communicate with xenstored         |
+|            |                                                  |
+| `levtchn`  | For a live update this will be the port number   |
+|            | of the interdomain channel used by xenstored     |
+|            | itself otherwise, for migration, it will be -1   |
+|            |                                                  |
+| `mfn`      | The MFN of the shared page for a live update or  |
+|            | INVALID_MFN otherwise                            |
+
+Since the ABI guarantees that entry 1 in `domid`'s grant table will always
+contain the GFN of the shared page, so for a live update `mfn` can be used to
+give confidence that `domid` has not been re-cycled during the update.
+
+
+For `socket` connections it is as follows:
 
-Before xenstore state is migrated it is necessary to wait for any pending
-reads, writes, watch registrations etc. to complete, and also to make sure
-that xenstored does not start processing any new requests (so that new
-requests remain pending on the shared ring for subsequent processing on the
-new host). Hence the following operation is needed:
 
 ```
-QUIESCE                 <domid>|
-
-Complete processing of any request issued by the specified domain, and
-do not process any further requests from the shared ring.
+    0       1       2       3       4       5       6       7    octet
+                +-------+-------+-------+-------+-------+-------+
+                | flags         | socket-fd                     |
+                +---------------+-------------------------------+
 ```
 
-The `WATCH` operation does not allow specification of a `<domid>`; it is
-assumed that the watch pertains to the domain that owns the shared ring
-over which the operation is passed. Hence, for the tool-stack to be able
-to register a watch on behalf of a domain a new operation is needed:
 
-```
-ADD_DOMAIN_WATCHES      <domid>|<watch>|+
+| Field       | Description                                     |
+|-------------|-------------------------------------------------|
+| `flags`     | A bit-wise OR of:                               |
+|             | 0001: read-only                                 |
+|             |                                                 |
+| `socket-fd` | The file descriptor of the connected socket     |
 
-Adds watches on behalf of the specified domain.
+This type of connection is only relevant for live update, where the xenstored
+resumes in the original process context. Hence `socket-fd` simply specify
+the file descriptor of the socket connection.
 
-<watch> is a NUL separated tuple of <path>|<token>. The semantics of this
-operation are identical to the domain issuing WATCH <path>|<token>| for
-each <watch>.
-```
+\pagebreak
+
+### WATCH_DATA
+
+The image format will contain a `WATCH_DATA` record for each watch registered
+by a connection for which there is `CONNECTION_DATA` record previously present.
 
-The watch information for a domain also needs to be extracted from the
-sending xenstored so the following operation is also needed:
 
 ```
-GET_DOMAIN_WATCHES      <domid>|<index>   <gencnt>|<watch>|*
+    0       1       2       3    octet
++-------+-------+-------+-------+
+| conn-id                       |
++---------------+---------------+
+| wpath-len     | token-len     |
++---------------+---------------+
+| wpath
+...
+| token
+...
+```
+
 
-Gets the list of watches that are currently registered for the domain.
+| Field       | Description                                     |
+|-------------|-------------------------------------------------|
+| `conn-id`   | The connection that issued the `WATCH`          |
+|             | operation [2]                                   |
+|             |                                                 |
+| `wpath-len` | The length (in octets) of `wpath` including the |
+|             | NUL terminator                                  |
+|             |                                                 |
+| `token-len` | The length (in octets) of `token` including the |
+|             | NUL terminator                                  |
+|             |                                                 |
+| `wpath`     | The watch path, as specified in the `WATCH`     |
+|             | operation                                       |
+|             |                                                 |
+| `token`     | The watch identifier token, as specified in the |
+|             | `WATCH` operation                               |
+
+\pagebreak
+
+### TRANSACTION_DATA
+
+The image format will contain a `TRANSACTION_DATA` record for each transaction
+that is pending on a connection for which there is `CONNECTION_DATA` record
+previously present.
 
-<watch> is a NUL separated tuple of <path>|<token>. The sub-list returned
-will start at <index> items into the the overall list of watches and may
-be truncated (at a <watch> boundary) such that the returned data fits
-within XENSTORE_PAYLOAD_MAX.
 
-If <index> is beyond the end of the overall list then the returned sub-
-list will be empty. If the value of <gencnt> changes then it indicates
-that the overall watch list has changed and thus it may be necessary
-to re-issue the operation for previous values of <index>.
 ```
+    0       1       2       3    octet
++-------+-------+-------+-------+
+| conn-id                       |
++-------------------------------+
+| tx-id                         |
++-------------------------------+
+```
+
+
+| Field          | Description                                  |
+|----------------|----------------------------------------------|
+| `conn-id`      | The connection that issued the               |
+|                | `TRANSACTION_START` operation [2]            |
+|                |                                              |
+| `tx-id`        | The transaction id passed back to the domain |
+|                | by the `TRANSACTION_START` operation         |
+
+\pagebreak
 
-To deal with transactions that were pending when the domain is migrated
-it is necessary to start transactions with the same tx_id on behalf of the
-domain in the receiving xenstored.
+### NODE_DATA
 
-NOTE: For safety each such transaction should result in an `EAGAIN` when
-the `TRANSACTION_END` operation is performed, as modifications made under
-the tx_id will not be part of the migration stream.
+For live update the image format will contain a `NODE_DATA` record for each
+node in xenstore. For migration it will only contain a record for the nodes
+relating to the domain being migrated. The `NODE_DATA` may be related to
+a _committed_ node (globally visible in xenstored) or a _pending_ node (created
+or modified by a transaction for which there is also a `TRANSACTION_DATA`
+record previously present).
 
-The `TRANSACTION_START` operation does not allow specification of a
-`<domid>`; it is assumed that the transaction pertains to the domain that
-owns the shared ring over which the operation is passed. Neither does it
-allow a `<transid>` to be specified; it is always chosen by xenstored.
-Hence, for the tool-stack to be able to open a transaction on behalf of a
-domain a new operation is needed:
 
 ```
-START_DOMAIN_TRANSACTION    <domid>|<transid>|
+    0       1       2       3    octet
++-------+-------+-------+-------+
+| conn-id                       |
++-------------------------------+
+| tx-id                         |
++---------------+---------------+
+| access        | perm-count    |
++---------------+---------------+
+| perm1                         |
++-------------------------------+
+...
++-------------------------------+
+| permN                         |
++---------------+---------------+
+| path-len      | value-len     |
++---------------+---------------+
+| path
+...
+| value
+...
+```
+
+
+| Field        | Description                                    |
+|--------------|------------------------------------------------|
+| `conn-id`    | If this value is non-zero then this record     |
+|              | related to a pending transaction               |
+|              |                                                |
+| `tx-id`      | This value should be ignored if `conn-id` is   |
+|              | zero. Otherwise it specifies the id of the     |
+|              | pending transaction                            |
+|              |                                                |
+| `access`     | This value should be ignored if this record    |
+|              | does not relate to a pending transaction,      |
+|              | otherwise it specifies the accesses made to    |
+|              | the node and hence is a bitwise OR of:         |
+|              |                                                |
+|              | 0x0001: read                                   |
+|              | 0x0002: written                                |
+|              |                                                |
+|              | The value will be zero for a deleted node      |
+|              |                                                |
+| `perm-count` | The number (N) of node permission specifiers   |
+|              | (which will be 0 for a node deleted in a       |
+|              | pending transaction)                           |
+|              |                                                |
+| `perm1..N`   | A list of zero or more node permission         |
+|              | specifiers (see below)                         |
+|              |                                                |
+| `path-len`   | The length (in octets) of `path` including the |
+|              | NUL terminator                                 |
+|              |                                                |
+| `value-len`  | The length (in octets) of `value` (which will  |
+|              | be zero for a deleted node)                    |
+|              |                                                |
+| `path`       | The absolute path of the node                  |
+|              |                                                |
+| `value`      | The node value (which may be empty or contain  |
+|              | NUL octets)                                    |
+
+
+A node permission specifier has the following format:
 
-Starts a transaction on behalf of a domain.
 
-The semantics of this are similar to the domain issuing
-TRANSACTION_START and receiving the specified <transid> as the response.
-The main difference is that the transaction will be immediately marked as
-'conflicting' such that when the domain issues TRANSACTION_END T|, it will
-result in EAGAIN.
+```
+    0       1       2       3    octet
++-------+-------+-------+-------+
+| perm  | pad   | domid         |
++-------+-------+---------------+
 ```
 
-It may also be desirable to state in the protocol specification that
-the `INTRODUCE` operation should not clear the `<gfn>` specified such that
-a `RELEASE` operation followed by an `INTRODUCE` operation form an
-idempotent pair. The current implementation of *C xentored* does this
-(in the `domain_conn_reset()` function) but this could be dropped as this
-behaviour is not currently specified and the page will always be zeroed
-for a newly created domain.
+| Field   | Description                                         |
+|---------|-----------------------------------------------------|
+| `perm`  | One of the ASCII values `w`, `r`, `b` or `n` as     |
+|         | specified for the `SET_PERMS` operation [2]         |
+|         |                                                     |
+| `domid` | The domain-id to which the permission relates       |
 
 
 * * *
 
 [1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md
+
 [2] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt
-[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc