diff mbox series

[v4,2/2] docs/designs: Add a design document for migration of xenstore data

Message ID 20200129144702.1543-3-pdurrant@amazon.com (mailing list archive)
State Superseded
Headers show
Series docs: Migration design documents | expand

Commit Message

Paul Durrant Jan. 29, 2020, 2:47 p.m. UTC
This patch details proposes extra migration data and xenstore protocol
extensions to support non-cooperative live migration of guests.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Julien Grall <julien@xen.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Wei Liu <wl@xen.org>

v4:
 - Drop the restrictions on special paths

v3:
 - New in v3
---
 docs/designs/xenstore-migration.md | 121 +++++++++++++++++++++++++++++
 1 file changed, 121 insertions(+)
 create mode 100644 docs/designs/xenstore-migration.md

Comments

Marek Marczykowski-Górecki Jan. 29, 2020, 9:23 p.m. UTC | #1
On Wed, Jan 29, 2020 at 02:47:02PM +0000, Paul Durrant wrote:

<snip>

> +**node data**
> +
> +
> +`<path>|<value>|<perm-as-string>|`
> +
> +
> +`<path>` is considered relative to the domain path `/local/domain/$domid`
> +and hence must not begin with `/`.

How backend settings are going to be serialized? For example vif backend
has a bunch of feature-* entries, which should not change under the
guest feet during non-cooperative migration.

> +`<path>` and `<value>` should be suitable to formulate a `WRITE` operation
> +to the receiving xenstore and `<perm-as-string>` should be similarly suitable
> +to formulate a subsequent `SET_PERMS` operation.
> +
> +**watch data**
> +
> +
> +`<path>|<token>|`
> +
> +`<path>` again is considered relative and, together with `<token>`, should
> +be suitable to formulate an `ADD_DOMAIN_WATCHES` operation (see below).
> +
> +
> +### Protocol Extension
> +
> +The `WATCH` operation does not allow specification of a `<domid>`; it is
> +assumed that the watch pertains to the domain that owns the shared ring
> +over which the operation is passed. Hence, for the tool-stack to be able
> +to register a watch on behalf of a domain a new operation is needed:
> +
> +```
> +ADD_DOMAIN_WATCHES      <domid>|<watch>|+
> +
> +Adds watches on behalf of the specified domain.
> +
> +<watch> is a NUL separated tuple of <path>|<token>. The semantics of this
> +operation are identical to the domain issuing WATCH <path>|<token>| for
> +each <watch>.
> +```

Normal WATCH operation triggers an event immediately. Is it intended in
this case too? On the other hand, guest should cope with spurious watch
events, so probably not an issue.

> +The watch information for a domain also needs to be extracted from the
> +sending xenstored so the following operation is also needed:
> +
> +```
> +GET_DOMAIN_WATCHES      <domid>|<index>   <gencnt>|<watch>|* 
> +
> +Gets the list of watches that are currently registered for the domain.
> +
> +<watch> is a NUL separated tuple of <path>|<token>. The sub-list returned
> +will start at <index> into the the overall list of watches and may be
> +truncated such that the returned data fits within XENSTORE_PAYLOAD_MAX.
> +If <index> is beyond the end of the overall list then the returned sub-
> +list will be empty. If the value of <gencnt> changes then it indicates
> +that the overall watch list has changed and thus it may be necessary
> +to re-issue the operation for previous values of <index>.
> +```

In what units <index> is expressed? bytes? entries?
Can the response be truncated at arbitrary place, or only between
records?
Durrant, Paul Jan. 30, 2020, 8:45 a.m. UTC | #2
> -----Original Message-----
> From: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> Sent: 29 January 2020 21:23
> To: Durrant, Paul <pdurrant@amazon.co.uk>
> Cc: xen-devel@lists.xenproject.org; Stefano Stabellini
> <sstabellini@kernel.org>; Julien Grall <julien@xen.org>; Wei Liu
> <wl@xen.org>; Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>; George
> Dunlap <George.Dunlap@eu.citrix.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Ian Jackson <ian.jackson@eu.citrix.com>
> Subject: Re: [Xen-devel] [PATCH v4 2/2] docs/designs: Add a design
> document for migration of xenstore data
> 
> On Wed, Jan 29, 2020 at 02:47:02PM +0000, Paul Durrant wrote:
> 
> <snip>
> 
> > +**node data**
> > +
> > +
> > +`<path>|<value>|<perm-as-string>|`
> > +
> > +
> > +`<path>` is considered relative to the domain path
> `/local/domain/$domid`
> > +and hence must not begin with `/`.
> 
> How backend settings are going to be serialized?

They're not going to be. The toolstack will construct new backends; co-operation will be required from them in that they must support re-binding (which the latest versions of blkback and netback do). We will need to white-list backends that are known to do this and refuse non-cooperative migration if any other backend is use.

> For example vif backend
> has a bunch of feature-* entries, which should not change under the
> guest feet during non-cooperative migration.
> 

The frontend will normally come back in 'connected' state, in which case a change in any feature flags will be irrelevant until the next (if any) re-connection (since the protocols only sample them at that point). If the frontend is not connected then it will sample the feature flags in the usual way. If the frontend/backend are in transition then the locking in the backend should prevent the 'unbind' from completing until some level of stability has been reached.

> > +`<path>` and `<value>` should be suitable to formulate a `WRITE`
> operation
> > +to the receiving xenstore and `<perm-as-string>` should be similarly
> suitable
> > +to formulate a subsequent `SET_PERMS` operation.
> > +
> > +**watch data**
> > +
> > +
> > +`<path>|<token>|`
> > +
> > +`<path>` again is considered relative and, together with `<token>`,
> should
> > +be suitable to formulate an `ADD_DOMAIN_WATCHES` operation (see below).
> > +
> > +
> > +### Protocol Extension
> > +
> > +The `WATCH` operation does not allow specification of a `<domid>`; it
> is
> > +assumed that the watch pertains to the domain that owns the shared ring
> > +over which the operation is passed. Hence, for the tool-stack to be
> able
> > +to register a watch on behalf of a domain a new operation is needed:
> > +
> > +```
> > +ADD_DOMAIN_WATCHES      <domid>|<watch>|+
> > +
> > +Adds watches on behalf of the specified domain.
> > +
> > +<watch> is a NUL separated tuple of <path>|<token>. The semantics of
> this
> > +operation are identical to the domain issuing WATCH <path>|<token>| for
> > +each <watch>.
> > +```
> 
> Normal WATCH operation triggers an event immediately. Is it intended in
> this case too? On the other hand, guest should cope with spurious watch
> events, so probably not an issue.

I don't think it matters if one is triggered or not. As you say, the watch protocol allows them to spuriously fire so the guest should cope either way.

> 
> > +The watch information for a domain also needs to be extracted from the
> > +sending xenstored so the following operation is also needed:
> > +
> > +```
> > +GET_DOMAIN_WATCHES      <domid>|<index>   <gencnt>|<watch>|*
> > +
> > +Gets the list of watches that are currently registered for the domain.
> > +
> > +<watch> is a NUL separated tuple of <path>|<token>. The sub-list
> returned
> > +will start at <index> into the the overall list of watches and may be
> > +truncated such that the returned data fits within XENSTORE_PAYLOAD_MAX.
> > +If <index> is beyond the end of the overall list then the returned sub-
> > +list will be empty. If the value of <gencnt> changes then it indicates
> > +that the overall watch list has changed and thus it may be necessary
> > +to re-issue the operation for previous values of <index>.
> > +```
> 
> In what units <index> is expressed? bytes? entries?

I thought entries was reasonably well implied since I refer to a list, but I can make it explicit.

> Can the response be truncated at arbitrary place, or only between
> records?
> 

Only between records. Again I'll add some words to make that clear.

  Paul

> --
> Best Regards,
> Marek Marczykowski-Górecki
> Invisible Things Lab
> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?
Marek Marczykowski-Górecki Jan. 30, 2020, 10:27 a.m. UTC | #3
On Thu, Jan 30, 2020 at 08:45:49AM +0000, Durrant, Paul wrote:
> > -----Original Message-----
> > From: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > Sent: 29 January 2020 21:23
> > To: Durrant, Paul <pdurrant@amazon.co.uk>
> > Cc: xen-devel@lists.xenproject.org; Stefano Stabellini
> > <sstabellini@kernel.org>; Julien Grall <julien@xen.org>; Wei Liu
> > <wl@xen.org>; Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>; George
> > Dunlap <George.Dunlap@eu.citrix.com>; Andrew Cooper
> > <andrew.cooper3@citrix.com>; Ian Jackson <ian.jackson@eu.citrix.com>
> > Subject: Re: [Xen-devel] [PATCH v4 2/2] docs/designs: Add a design
> > document for migration of xenstore data
> > 
> > On Wed, Jan 29, 2020 at 02:47:02PM +0000, Paul Durrant wrote:
> > 
> > <snip>
> > 
> > > +**node data**
> > > +
> > > +
> > > +`<path>|<value>|<perm-as-string>|`
> > > +
> > > +
> > > +`<path>` is considered relative to the domain path
> > `/local/domain/$domid`
> > > +and hence must not begin with `/`.
> > 
> > How backend settings are going to be serialized?
> 
> They're not going to be. The toolstack will construct new backends; co-operation will be required from them in that they must support re-binding (which the latest versions of blkback and netback do). We will need to white-list backends that are known to do this and refuse non-cooperative migration if any other backend is use.
> 
> > For example vif backend
> > has a bunch of feature-* entries, which should not change under the
> > guest feet during non-cooperative migration.
> > 
> 
> The frontend will normally come back in 'connected' state, in which case a change in any feature flags will be irrelevant until the next (if any) re-connection (since the protocols only sample them at that point). If the frontend is not connected then it will sample the feature flags in the usual way. If the frontend/backend are in transition then the locking in the backend should prevent the 'unbind' from completing until some level of stability has been reached.

Yes, but regardless of xenstore entries, those do represent what backend
can and is willing to do. If some feature disappears (in actual backend,
not only its xenstore representation) during non-cooperative migration,
I guess frontent won't be happy about that.

How are they related to feature-* entries in frontent? Are frontend
features separate set, or a confirmation what backend's features will be
used?
diff mbox series

Patch

diff --git a/docs/designs/xenstore-migration.md b/docs/designs/xenstore-migration.md
new file mode 100644
index 0000000000..991236e201
--- /dev/null
+++ b/docs/designs/xenstore-migration.md
@@ -0,0 +1,121 @@ 
+# Xenstore Migration
+
+## Background
+
+The design for *Non-Cooperative Migration of Guests*[1] explains that extra
+save records are required in the migrations stream to allow a guest running
+PV drivers to be migrated without its co-operation. Moreover the save
+records must include details of registered xenstore watches as well as
+content; information that cannot currently be recovered from `xenstored`,
+and hence some extension to the xenstore protocol[2] will also be required.
+
+The *libxenlight Domain Image Format* specification[3] already defines a
+record type `EMULATOR_XENSTORE_DATA` but this is not suitable for
+transferring xenstore data pertaining to the domain directly as it is
+specified such that keys are relative to the path
+`/local/domain/$dm_domid/device-model/$domid`. Thus it is necessary to
+define at least one new save record type.
+
+## Proposal
+
+### New Save Record
+
+A new mandatory record type should be defined within the libxenlight Domain
+Image Format:
+
+`0x00000007: DOMAIN_XENSTORE_DATA`
+
+The format of each of these new records should be as follows:
+
+
+```
+0     1     2     3     4     5     6     7 octet
++------------------------+------------------------+
+| type                   | record specific data   |
++------------------------+                        |
+...
++-------------------------------------------------+
+```
+
+
+| Field | Description |
+|---|---|
+| `type` | 0x00000000: invalid |
+|        | 0x00000001: node data |
+|        | 0x00000002: watch data |
+|        | 0x00000003 - 0xFFFFFFFF: reserved for future use |
+
+
+where data is always in the form of a NUL separated and terminated tuple
+as follows
+
+
+**node data**
+
+
+`<path>|<value>|<perm-as-string>|`
+
+
+`<path>` is considered relative to the domain path `/local/domain/$domid`
+and hence must not begin with `/`.
+`<path>` and `<value>` should be suitable to formulate a `WRITE` operation
+to the receiving xenstore and `<perm-as-string>` should be similarly suitable
+to formulate a subsequent `SET_PERMS` operation.
+
+**watch data**
+
+
+`<path>|<token>|`
+
+`<path>` again is considered relative and, together with `<token>`, should
+be suitable to formulate an `ADD_DOMAIN_WATCHES` operation (see below).
+
+
+### Protocol Extension
+
+The `WATCH` operation does not allow specification of a `<domid>`; it is
+assumed that the watch pertains to the domain that owns the shared ring
+over which the operation is passed. Hence, for the tool-stack to be able
+to register a watch on behalf of a domain a new operation is needed:
+
+```
+ADD_DOMAIN_WATCHES      <domid>|<watch>|+
+
+Adds watches on behalf of the specified domain.
+
+<watch> is a NUL separated tuple of <path>|<token>. The semantics of this
+operation are identical to the domain issuing WATCH <path>|<token>| for
+each <watch>.
+```
+
+The watch information for a domain also needs to be extracted from the
+sending xenstored so the following operation is also needed:
+
+```
+GET_DOMAIN_WATCHES      <domid>|<index>   <gencnt>|<watch>|* 
+
+Gets the list of watches that are currently registered for the domain.
+
+<watch> is a NUL separated tuple of <path>|<token>. The sub-list returned
+will start at <index> into the the overall list of watches and may be
+truncated such that the returned data fits within XENSTORE_PAYLOAD_MAX.
+If <index> is beyond the end of the overall list then the returned sub-
+list will be empty. If the value of <gencnt> changes then it indicates
+that the overall watch list has changed and thus it may be necessary
+to re-issue the operation for previous values of <index>.
+```
+
+It may also be desirable to state in the protocol specification that
+the `INTRODUCE` operation should not clear the `<mfn>` specified such that
+a `RELEASE` operation followed by an `INTRODUCE` operation form an
+idempotent pair. The current implementation of *C xentored* does this
+(in the `domain_conn_reset()` function) but this could be dropped as this
+behaviour is not currently specified and the page will always be zeroed
+for a newly created domain.
+
+
+* * *
+
+[1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md
+[2] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt
+[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc