Message ID | 20200213105325.3022-3-pdurrant@amazon.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | docs: Migration design documents | expand |
Hi Paul, On 13/02/2020 10:53, Paul Durrant wrote: > This patch details proposes extra migration data and xenstore protocol > extensions to support non-cooperative live migration of guests. > > Signed-off-by: Paul Durrant <pdurrant@amazon.com> > --- > Cc: Andrew Cooper <andrew.cooper3@citrix.com> > Cc: George Dunlap <George.Dunlap@eu.citrix.com> > Cc: Ian Jackson <ian.jackson@eu.citrix.com> > Cc: Jan Beulich <jbeulich@suse.com> > Cc: Julien Grall <julien@xen.org> > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Cc: Stefano Stabellini <sstabellini@kernel.org> > Cc: Wei Liu <wl@xen.org> > > v5: > - Add QUIESCE > - Make semantics of <index> in GET_DOMAIN_WATCHES more clear > > v4: > - Drop the restrictions on special paths > > v3: > - New in v3 > --- > docs/designs/xenstore-migration.md | 136 +++++++++++++++++++++++++++++ > 1 file changed, 136 insertions(+) > create mode 100644 docs/designs/xenstore-migration.md > > diff --git a/docs/designs/xenstore-migration.md b/docs/designs/xenstore-migration.md > new file mode 100644 > index 0000000000..5cfe2d9a7d > --- /dev/null > +++ b/docs/designs/xenstore-migration.md > @@ -0,0 +1,136 @@ > +# Xenstore Migration > + > +## Background > + > +The design for *Non-Cooperative Migration of Guests*[1] explains that extra > +save records are required in the migrations stream to allow a guest running > +PV drivers to be migrated without its co-operation. Moreover the save > +records must include details of registered xenstore watches as well as > +content; information that cannot currently be recovered from `xenstored`, > +and hence some extension to the xenstore protocol[2] will also be required. > + > +The *libxenlight Domain Image Format* specification[3] already defines a > +record type `EMULATOR_XENSTORE_DATA` but this is not suitable for > +transferring xenstore data pertaining to the domain directly as it is > +specified such that keys are relative to the path > +`/local/domain/$dm_domid/device-model/$domid`. Thus it is necessary to > +define at least one new save record type. > + > +## Proposal > + > +### New Save Record > + > +A new mandatory record type should be defined within the libxenlight Domain > +Image Format: > + > +`0x00000007: DOMAIN_XENSTORE_DATA` > + > +The format of each of these new records should be as follows: > + > + > +``` > +0 1 2 3 4 5 6 7 octet > ++------------------------+------------------------+ > +| type | record specific data | > ++------------------------+ | > +... > ++-------------------------------------------------+ > +``` > + > + > +| Field | Description | > +|---|---| Did you indend to add more - so | is on the same column as the onter lines? > +| `type` | 0x00000000: invalid | > +| | 0x00000001: node data | > +| | 0x00000002: watch data | Should not the last | be some of the columns on all the lines? > +| | 0x00000003 - 0xFFFFFFFF: reserved for future use | Looking at the spec, the command TRANSACTION_END *must* be used with an existing transaction. As a guest would be migrate to a new domain, the transaction ID would now be invalid. I understand that xenstored is able to cope with it, but such behavior is not described in the spec. So I am not sure we can expect a guest to cope with an error value other than the ones described for the command. > + > + > +where data is always in the form of a NUL separated and terminated tuple > +as follows > + > + > +**node data** > + > + > +`<path>|<value>|<perm-as-string>|` I don't think this would work. From the spec, <value> is a binary data and therefore it can contain zero or nul. So you would not be able to find out where the <perm-as-string> starts. Regarding the <perm-as-string>, it is only describing the permission for one domain. If multiple domains can access the node, then you would have multiple <perm-as-string>. Do we want to transfer all the permissions, if not how do we define which permissions should be transferred? > + > + > +`<path>` is considered relative to the domain path `/local/domain/$domid` > +and hence must not begin with `/`. > +`<path>` and `<value>` should be suitable to formulate a `WRITE` operation > +to the receiving xenstore and `<perm-as-string>` should be similarly suitable > +to formulate a subsequent `SET_PERMS` operation. > + > +**watch data** > + > + > +`<path>|<token>|` > + > +`<path>` again is considered relative and, together with `<token>`, should > +be suitable to formulate an `ADD_DOMAIN_WATCHES` operation (see below). AFAICT, a guest is allowed to watch /. So is it a sensible thing to only transfer relative watch? Also, how about special watch (i.e @...)? > + > + > +### Protocol Extension > + > +Before xenstore state is migrated it is necessary to wait for any pending > +reads, writes, watch registrations etc. to complete, and also to make sure > +that xenstored does not start processing any new requests (so that new > +requests remain pending on the shared ring for subsequent processing on the > +new host). Hence the following operation is needed: > + > +``` > +QUIESCE <domid>| > + > +Complete processing of any request issued by the specified domain, and > +do not process any further requests from the shared ring. > +``` > + > +The `WATCH` operation does not allow specification of a `<domid>`; it is > +assumed that the watch pertains to the domain that owns the shared ring > +over which the operation is passed. Hence, for the tool-stack to be able > +to register a watch on behalf of a domain a new operation is needed: > + > +``` > +ADD_DOMAIN_WATCHES <domid>|<watch>|+ > + > +Adds watches on behalf of the specified domain. > + > +<watch> is a NUL separated tuple of <path>|<token>. The semantics of this > +operation are identical to the domain issuing WATCH <path>|<token>| for > +each <watch>. > +``` > + > +The watch information for a domain also needs to be extracted from the > +sending xenstored so the following operation is also needed: > + > +``` > +GET_DOMAIN_WATCHES <domid>|<index> <gencnt>|<watch>|* > + > +Gets the list of watches that are currently registered for the domain. > + > +<watch> is a NUL separated tuple of <path>|<token>. The sub-list returned > +will start at <index> items into the the overall list of watches and may > +be truncated (at a <watch> boundary) such that the returned data fits > +within XENSTORE_PAYLOAD_MAX. > + > +If <index> is beyond the end of the overall list then the returned sub- > +list will be empty. If the value of <gencnt> changes then it indicates > +that the overall watch list has changed and thus it may be necessary > +to re-issue the operation for previous values of <index>. > +``` > + > +It may also be desirable to state in the protocol specification that > +the `INTRODUCE` operation should not clear the `<mfn>` specified such that Not directly related to this patch, the '<mfn>' is slightly confusing because, AFAICT, this will actually hold an GFN. To avoid spreading more misuse, it would make sense to update the xenstore accordingly and use the new term here. > +a `RELEASE` operation followed by an `INTRODUCE` operation form an > +idempotent pair. The current implementation of *C xentored* does this > +(in the `domain_conn_reset()` function) but this could be dropped as this > +behaviour is not currently specified and the page will always be zeroed > +for a newly created domain. > + > + > +* * * > + > +[1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md > +[2] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt > +[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc > Cheers,
> -----Original Message----- > From: Julien Grall <julien@xen.org> > Sent: 04 March 2020 18:32 > To: Durrant, Paul <pdurrant@amazon.co.uk>; xen-devel@lists.xenproject.org > Cc: Andrew Cooper <andrew.cooper3@citrix.com>; George Dunlap <George.Dunlap@eu.citrix.com>; Ian > Jackson <ian.jackson@eu.citrix.com>; Jan Beulich <jbeulich@suse.com>; Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com>; Stefano Stabellini <sstabellini@kernel.org>; Wei Liu <wl@xen.org> > Subject: RE: [EXTERNAL][PATCH v5 2/2] docs/designs: Add a design document for migration of xenstore > data > > CAUTION: This email originated from outside of the organization. Do not click links or open > attachments unless you can confirm the sender and know the content is safe. > > > > Hi Paul, > > On 13/02/2020 10:53, Paul Durrant wrote: > > This patch details proposes extra migration data and xenstore protocol > > extensions to support non-cooperative live migration of guests. > > > > Signed-off-by: Paul Durrant <pdurrant@amazon.com> > > --- > > Cc: Andrew Cooper <andrew.cooper3@citrix.com> > > Cc: George Dunlap <George.Dunlap@eu.citrix.com> > > Cc: Ian Jackson <ian.jackson@eu.citrix.com> > > Cc: Jan Beulich <jbeulich@suse.com> > > Cc: Julien Grall <julien@xen.org> > > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > Cc: Stefano Stabellini <sstabellini@kernel.org> > > Cc: Wei Liu <wl@xen.org> > > > > v5: > > - Add QUIESCE > > - Make semantics of <index> in GET_DOMAIN_WATCHES more clear > > > > v4: > > - Drop the restrictions on special paths > > > > v3: > > - New in v3 > > --- > > docs/designs/xenstore-migration.md | 136 +++++++++++++++++++++++++++++ > > 1 file changed, 136 insertions(+) > > create mode 100644 docs/designs/xenstore-migration.md > > > > diff --git a/docs/designs/xenstore-migration.md b/docs/designs/xenstore-migration.md > > new file mode 100644 > > index 0000000000..5cfe2d9a7d > > --- /dev/null > > +++ b/docs/designs/xenstore-migration.md > > @@ -0,0 +1,136 @@ > > +# Xenstore Migration > > + > > +## Background > > + > > +The design for *Non-Cooperative Migration of Guests*[1] explains that extra > > +save records are required in the migrations stream to allow a guest running > > +PV drivers to be migrated without its co-operation. Moreover the save > > +records must include details of registered xenstore watches as well as > > +content; information that cannot currently be recovered from `xenstored`, > > +and hence some extension to the xenstore protocol[2] will also be required. > > + > > +The *libxenlight Domain Image Format* specification[3] already defines a > > +record type `EMULATOR_XENSTORE_DATA` but this is not suitable for > > +transferring xenstore data pertaining to the domain directly as it is > > +specified such that keys are relative to the path > > +`/local/domain/$dm_domid/device-model/$domid`. Thus it is necessary to > > +define at least one new save record type. > > + > > +## Proposal > > + > > +### New Save Record > > + > > +A new mandatory record type should be defined within the libxenlight Domain > > +Image Format: > > + > > +`0x00000007: DOMAIN_XENSTORE_DATA` > > + > > +The format of each of these new records should be as follows: > > + > > + > > +``` > > +0 1 2 3 4 5 6 7 octet > > ++------------------------+------------------------+ > > +| type | record specific data | > > ++------------------------+ | > > +... > > ++-------------------------------------------------+ > > +``` > > + > > + > > +| Field | Description | > > +|---|---| > > Did you indend to add more - so | is on the same column as the onter lines? > Yep, cut'n'paste error. > > +| `type` | 0x00000000: invalid | > > +| | 0x00000001: node data | > > +| | 0x00000002: watch data | > > Should not the last | be some of the columns on all the lines? > > > +| | 0x00000003 - 0xFFFFFFFF: reserved for future use | > > Looking at the spec, the command TRANSACTION_END *must* be used with an > existing transaction. As a guest would be migrate to a new domain, the > transaction ID would now be invalid. > > I understand that xenstored is able to cope with it, but such behavior > is not described in the spec. So I am not sure we can expect a guest to > cope with an error value other than the ones described for the command. > And (as we discussed offline) there would be an issue if the migrated guest started a new transaction before completing one that was started pre-migration, as the ids may clash. So, we are going to need a record to transfer open transaction ids so that we can reserve them in the receiving xenstored. > > + > > + > > +where data is always in the form of a NUL separated and terminated tuple > > +as follows > > + > > + > > +**node data** > > + > > + > > +`<path>|<value>|<perm-as-string>|` > > I don't think this would work. From the spec, <value> is a binary data > and therefore it can contain zero or nul. So you would not be able to > find out where the <perm-as-string> starts. > > Regarding the <perm-as-string>, it is only describing the permission for > one domain. If multiple domains can access the node, then you would have > multiple <perm-as-string>. Do we want to transfer all the permissions, > if not how do we define which permissions should be transferred? Yes this should cope with multiple perms and binary data, even though I think we don't necessarily need it in the normal case. > > > + > > + > > +`<path>` is considered relative to the domain path `/local/domain/$domid` > > +and hence must not begin with `/`. > > +`<path>` and `<value>` should be suitable to formulate a `WRITE` operation > > +to the receiving xenstore and `<perm-as-string>` should be similarly suitable > > +to formulate a subsequent `SET_PERMS` operation. > > + > > +**watch data** > > + > > + > > +`<path>|<token>|` > > + > > +`<path>` again is considered relative and, together with `<token>`, should > > +be suitable to formulate an `ADD_DOMAIN_WATCHES` operation (see below). > > AFAICT, a guest is allowed to watch /. So is it a sensible thing to only > transfer relative watch? > > Also, how about special watch (i.e @...)? I guess we need to cope with whatever a guest is allowed to register... which appears to be anything. > > > + > > + > > +### Protocol Extension > > + > > +Before xenstore state is migrated it is necessary to wait for any pending > > +reads, writes, watch registrations etc. to complete, and also to make sure > > +that xenstored does not start processing any new requests (so that new > > +requests remain pending on the shared ring for subsequent processing on the > > +new host). Hence the following operation is needed: > > + > > +``` > > +QUIESCE <domid>| > > + > > +Complete processing of any request issued by the specified domain, and > > +do not process any further requests from the shared ring. > > +``` > > + > > +The `WATCH` operation does not allow specification of a `<domid>`; it is > > +assumed that the watch pertains to the domain that owns the shared ring > > +over which the operation is passed. Hence, for the tool-stack to be able > > +to register a watch on behalf of a domain a new operation is needed: > > + > > +``` > > +ADD_DOMAIN_WATCHES <domid>|<watch>|+ > > + > > +Adds watches on behalf of the specified domain. > > + > > +<watch> is a NUL separated tuple of <path>|<token>. The semantics of this > > +operation are identical to the domain issuing WATCH <path>|<token>| for > > +each <watch>. > > +``` > > + > > +The watch information for a domain also needs to be extracted from the > > +sending xenstored so the following operation is also needed: > > + > > +``` > > +GET_DOMAIN_WATCHES <domid>|<index> <gencnt>|<watch>|* > > + > > +Gets the list of watches that are currently registered for the domain. > > + > > +<watch> is a NUL separated tuple of <path>|<token>. The sub-list returned > > +will start at <index> items into the the overall list of watches and may > > +be truncated (at a <watch> boundary) such that the returned data fits > > +within XENSTORE_PAYLOAD_MAX. > > + > > +If <index> is beyond the end of the overall list then the returned sub- > > +list will be empty. If the value of <gencnt> changes then it indicates > > +that the overall watch list has changed and thus it may be necessary > > +to re-issue the operation for previous values of <index>. > > +``` > > + > > +It may also be desirable to state in the protocol specification that > > +the `INTRODUCE` operation should not clear the `<mfn>` specified such that > > Not directly related to this patch, the '<mfn>' is slightly confusing > because, AFAICT, this will actually hold an GFN. To avoid spreading more > misuse, it would make sense to update the xenstore accordingly and use > the new term here. > Ok, I can add a small patch to modify the doc. Paul > > +a `RELEASE` operation followed by an `INTRODUCE` operation form an > > +idempotent pair. The current implementation of *C xentored* does this > > +(in the `domain_conn_reset()` function) but this could be dropped as this > > +behaviour is not currently specified and the page will always be zeroed > > +for a newly created domain. > > + > > + > > +* * * > > + > > +[1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative- > migration.md > > +[2] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt > > +[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc > > > > Cheers, > > -- > Julien Grall
diff --git a/docs/designs/xenstore-migration.md b/docs/designs/xenstore-migration.md new file mode 100644 index 0000000000..5cfe2d9a7d --- /dev/null +++ b/docs/designs/xenstore-migration.md @@ -0,0 +1,136 @@ +# Xenstore Migration + +## Background + +The design for *Non-Cooperative Migration of Guests*[1] explains that extra +save records are required in the migrations stream to allow a guest running +PV drivers to be migrated without its co-operation. Moreover the save +records must include details of registered xenstore watches as well as +content; information that cannot currently be recovered from `xenstored`, +and hence some extension to the xenstore protocol[2] will also be required. + +The *libxenlight Domain Image Format* specification[3] already defines a +record type `EMULATOR_XENSTORE_DATA` but this is not suitable for +transferring xenstore data pertaining to the domain directly as it is +specified such that keys are relative to the path +`/local/domain/$dm_domid/device-model/$domid`. Thus it is necessary to +define at least one new save record type. + +## Proposal + +### New Save Record + +A new mandatory record type should be defined within the libxenlight Domain +Image Format: + +`0x00000007: DOMAIN_XENSTORE_DATA` + +The format of each of these new records should be as follows: + + +``` +0 1 2 3 4 5 6 7 octet ++------------------------+------------------------+ +| type | record specific data | ++------------------------+ | +... ++-------------------------------------------------+ +``` + + +| Field | Description | +|---|---| +| `type` | 0x00000000: invalid | +| | 0x00000001: node data | +| | 0x00000002: watch data | +| | 0x00000003 - 0xFFFFFFFF: reserved for future use | + + +where data is always in the form of a NUL separated and terminated tuple +as follows + + +**node data** + + +`<path>|<value>|<perm-as-string>|` + + +`<path>` is considered relative to the domain path `/local/domain/$domid` +and hence must not begin with `/`. +`<path>` and `<value>` should be suitable to formulate a `WRITE` operation +to the receiving xenstore and `<perm-as-string>` should be similarly suitable +to formulate a subsequent `SET_PERMS` operation. + +**watch data** + + +`<path>|<token>|` + +`<path>` again is considered relative and, together with `<token>`, should +be suitable to formulate an `ADD_DOMAIN_WATCHES` operation (see below). + + +### Protocol Extension + +Before xenstore state is migrated it is necessary to wait for any pending +reads, writes, watch registrations etc. to complete, and also to make sure +that xenstored does not start processing any new requests (so that new +requests remain pending on the shared ring for subsequent processing on the +new host). Hence the following operation is needed: + +``` +QUIESCE <domid>| + +Complete processing of any request issued by the specified domain, and +do not process any further requests from the shared ring. +``` + +The `WATCH` operation does not allow specification of a `<domid>`; it is +assumed that the watch pertains to the domain that owns the shared ring +over which the operation is passed. Hence, for the tool-stack to be able +to register a watch on behalf of a domain a new operation is needed: + +``` +ADD_DOMAIN_WATCHES <domid>|<watch>|+ + +Adds watches on behalf of the specified domain. + +<watch> is a NUL separated tuple of <path>|<token>. The semantics of this +operation are identical to the domain issuing WATCH <path>|<token>| for +each <watch>. +``` + +The watch information for a domain also needs to be extracted from the +sending xenstored so the following operation is also needed: + +``` +GET_DOMAIN_WATCHES <domid>|<index> <gencnt>|<watch>|* + +Gets the list of watches that are currently registered for the domain. + +<watch> is a NUL separated tuple of <path>|<token>. The sub-list returned +will start at <index> items into the the overall list of watches and may +be truncated (at a <watch> boundary) such that the returned data fits +within XENSTORE_PAYLOAD_MAX. + +If <index> is beyond the end of the overall list then the returned sub- +list will be empty. If the value of <gencnt> changes then it indicates +that the overall watch list has changed and thus it may be necessary +to re-issue the operation for previous values of <index>. +``` + +It may also be desirable to state in the protocol specification that +the `INTRODUCE` operation should not clear the `<mfn>` specified such that +a `RELEASE` operation followed by an `INTRODUCE` operation form an +idempotent pair. The current implementation of *C xentored* does this +(in the `domain_conn_reset()` function) but this could be dropped as this +behaviour is not currently specified and the page will always be zeroed +for a newly created domain. + + +* * * + +[1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md +[2] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt +[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc
This patch details proposes extra migration data and xenstore protocol extensions to support non-cooperative live migration of guests. Signed-off-by: Paul Durrant <pdurrant@amazon.com> --- Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: George Dunlap <George.Dunlap@eu.citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Julien Grall <julien@xen.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Wei Liu <wl@xen.org> v5: - Add QUIESCE - Make semantics of <index> in GET_DOMAIN_WATCHES more clear v4: - Drop the restrictions on special paths v3: - New in v3 --- docs/designs/xenstore-migration.md | 136 +++++++++++++++++++++++++++++ 1 file changed, 136 insertions(+) create mode 100644 docs/designs/xenstore-migration.md