Message ID | 20190514080558.14540-1-olaf@aepfle.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v5] libxl: fix migration of PV and PVH domUs with and without qemu | expand |
Am Tue, 14 May 2019 10:05:58 +0200 schrieb Olaf Hering <olaf@aepfle.de>: > @@ -459,7 +461,9 @@ int libxl__domain_resume(libxl__gc *gc, uint32_t domid, int suspend_cancel) > goto out; > } > > - if (type == LIBXL_DOMAIN_TYPE_HVM) { > + if (type == LIBXL_DOMAIN_TYPE_HVM || > + libxl__device_model_version_running(gc, domid) == > + LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN) { > rc = libxl__domain_resume_device_model(gc, domid); > if (rc) { > LOGD(ERROR, domid, "failed to resume device model:%d", rc); I think this could be done like that instead, so that libxl__device_model_version_running is called just once: --- a/tools/libxl/libxl_dom_suspend.c +++ b/tools/libxl/libxl_dom_suspend.c @@ -444,6 +444,8 @@ int libxl__domain_resume_device_model(libxl__gc *gc, uint32_t domid) if (libxl__qmp_resume(gc, domid)) return ERROR_FAIL; break; + case LIBXL_DEVICE_MODEL_VERSION_NONE: + break; default: return ERROR_INVAL; } @@ -461,14 +463,10 @@ int libxl__domain_resume(libxl__gc *gc, uint32_t domid, int suspend_cancel) goto out; } - if (type == LIBXL_DOMAIN_TYPE_HVM || - libxl__device_model_version_running(gc, domid) == - LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN) { - rc = libxl__domain_resume_device_model(gc, domid); - if (rc) { - LOGD(ERROR, domid, "failed to resume device model:%d", rc); - goto out; - } + rc = libxl__domain_resume_device_model(gc, domid); + if (rc) { + LOGD(ERROR, domid, "failed to resume device model:%d", rc); + goto out; } if (xc_domain_resume(CTX->xch, domid, suspend_cancel)) { While it is easy for the resume path, doing the same in the suspend path needs more changes. libxl__domain_suspend_device_model would need to receive the callback and set it if a device model is active. Should this be done on top of this change? Olaf
On Tue, May 14, 2019 at 10:05:58AM +0200, Olaf Hering wrote: > If a domU has a qemu-xen instance attached, it is required to call qemus > "xen-save-devices-state" method. Without it, the receiving side of a PV or > PVH migration may be unable to lock the image: > > xen be: qdisk-51712: xen be: qdisk-51712: error: Failed to get "write" lock > error: Failed to get "write" lock > xen be: qdisk-51712: xen be: qdisk-51712: initialise() failed > initialise() failed > > To fix this bug, libxl__domain_suspend_device_model() and > libxl__domain_resume_device_model() have to be called not only for HVM, > but also if the active device_model is QEMU_XEN. > > Unfortunately, libxl__domain_build_info_setdefault() used to hardcode > b_info->device_model_version to QEMU_XEN if it does not know it any > better. As a result libxl__device_model_version_running() will return > incorrect values. This breaks domUs without a device_model. > libxl__qmp_stop() would wait 10 seconds in qmp_open() for a qemu that > will never appear. During this long timeframe the domU remains in state > paused on the sending side. As a result network connections may be > dropped. Once this bug is fixed as well, by just removing the assumption > that every domU has a QEMU_XEN, there is no code to actually initialise > b_info->device_model_version. > > There is a helper function libxl__need_xenpv_qemu(), which is used in > various places to decide if a device_model has to be spawned. This > function can not be used as is, just to fill device_model_version, > because store_libxl_entry() was already called earlier. > > Introduce LIBXL_DEVICE_MODEL_VERSION_NONE for PV and PVH that have no > need for a device_model to make the state explicit. Indicate this new > state via LIBXL_HAVE macro in libxl.h. > > v05: > - move initialization of device_model_version to extra commit > - return error from libxl__need_xenpv_qemu > - add LIBXL_HAVE_DEVICE_MODEL_VERSION_NONE > v04: > - make sure device_model_stubdomain is initialized > v03: > - rearrange code to make sure device_model_version is initialized before > store_libxl_entry() is called > v02: > - update wording in a comment > - remove stale goto in domcreate_launch_dm > - initialize ret in libxl__need_xenpv_qemu > > Signed-off-by: Olaf Hering <olaf@aepfle.de> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Thanks, Roger.
On Tue, May 14, 2019 at 10:14:52AM +0200, Olaf Hering wrote: > Am Tue, 14 May 2019 10:05:58 +0200 > schrieb Olaf Hering <olaf@aepfle.de>: > > > @@ -459,7 +461,9 @@ int libxl__domain_resume(libxl__gc *gc, uint32_t domid, int suspend_cancel) > > goto out; > > } > > > > - if (type == LIBXL_DOMAIN_TYPE_HVM) { > > + if (type == LIBXL_DOMAIN_TYPE_HVM || > > + libxl__device_model_version_running(gc, domid) == > > + LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN) { > > rc = libxl__domain_resume_device_model(gc, domid); > > if (rc) { > > LOGD(ERROR, domid, "failed to resume device model:%d", rc); > > I think this could be done like that instead, so that libxl__device_model_version_running > is called just once: > > --- a/tools/libxl/libxl_dom_suspend.c > +++ b/tools/libxl/libxl_dom_suspend.c > @@ -444,6 +444,8 @@ int libxl__domain_resume_device_model(libxl__gc *gc, uint32_t domid) > if (libxl__qmp_resume(gc, domid)) > return ERROR_FAIL; > break; > + case LIBXL_DEVICE_MODEL_VERSION_NONE: > + break; > default: > return ERROR_INVAL; > } > @@ -461,14 +463,10 @@ int libxl__domain_resume(libxl__gc *gc, uint32_t domid, int suspend_cancel) > goto out; > } > > - if (type == LIBXL_DOMAIN_TYPE_HVM || > - libxl__device_model_version_running(gc, domid) == > - LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN) { > - rc = libxl__domain_resume_device_model(gc, domid); > - if (rc) { > - LOGD(ERROR, domid, "failed to resume device model:%d", rc); > - goto out; > - } > + rc = libxl__domain_resume_device_model(gc, domid); > + if (rc) { > + LOGD(ERROR, domid, "failed to resume device model:%d", rc); > + goto out; > } > > if (xc_domain_resume(CTX->xch, domid, suspend_cancel)) { > Yeah, from the look of it this is definitely better. > > While it is easy for the resume path, doing the same in the suspend path > needs more changes. libxl__domain_suspend_device_model would need to receive > the callback and set it if a device model is active. What do you mean here? Can't you handle the NONE case just like you do in the resume function? Wei.
Am Tue, 14 May 2019 15:38:55 +0100 schrieb Wei Liu <wei.liu2@citrix.com>: > > While it is easy for the resume path, doing the same in the suspend path > > needs more changes. libxl__domain_suspend_device_model would need to receive > > the callback and set it if a device model is active. > > What do you mean here? Can't you handle the NONE case just like you do > in the resume function? It means calling libxl__domain_suspend_device_model unconditionally, and let that function decide what to do. Maybe there is no downside to set the callback unconditionally, I will check that. Olaf
On Tue, May 14, 2019 at 04:44:19PM +0200, Olaf Hering wrote: > Am Tue, 14 May 2019 15:38:55 +0100 > schrieb Wei Liu <wei.liu2@citrix.com>: > > > > While it is easy for the resume path, doing the same in the suspend path > > > needs more changes. libxl__domain_suspend_device_model would need to receive > > > the callback and set it if a device model is active. > > > > What do you mean here? Can't you handle the NONE case just like you do > > in the resume function? > > It means calling libxl__domain_suspend_device_model unconditionally, and > let that function decide what to do. Maybe there is no downside to set the > callback unconditionally, I will check that. Sure. In the meantime I will apply this patch at some point to fix the bug first. Wei. > > Olaf
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h index 482499a6c0..e0f6916b66 100644 --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -1182,6 +1182,13 @@ void libxl_mac_copy(libxl_ctx *ctx, libxl_mac *dst, const libxl_mac *src); */ #define LIBXL_HAVE_PVCALLS 1 +/* + * LIBXL_HAVE_DEVICE_MODEL_VERSION_NONE + * + * If this is defined, libxl will only run a device-model if required. + */ +#define LIBXL_HAVE_DEVICE_MODEL_VERSION_NONE 1 + typedef char **libxl_string_list; void libxl_string_list_dispose(libxl_string_list *sl); int libxl_string_list_length(const libxl_string_list *sl); diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c index 3f0431cc84..64336b0d29 100644 --- a/tools/libxl/libxl_create.c +++ b/tools/libxl/libxl_create.c @@ -47,9 +47,20 @@ int libxl__domain_set_device_model(libxl__gc *gc, libxl_domain_config *d_config) } break; default: - b_info->device_model_version = - LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN_TRADITIONAL; - break; + ret = libxl__need_xenpv_qemu(gc, d_config); + switch (ret) { + case 1: + d_config->b_info.device_model_version = + LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN; + break; + case 0: + d_config->b_info.device_model_version = + LIBXL_DEVICE_MODEL_VERSION_NONE; + break; + default: + LOGE(ERROR, "Unable to determine QEMU requisite"); + return ret; + } } if (b_info->device_model_version == LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN) { diff --git a/tools/libxl/libxl_dom_suspend.c b/tools/libxl/libxl_dom_suspend.c index d1af3a6573..c492fe5dd1 100644 --- a/tools/libxl/libxl_dom_suspend.c +++ b/tools/libxl/libxl_dom_suspend.c @@ -379,7 +379,9 @@ static void domain_suspend_common_guest_suspended(libxl__egc *egc, libxl__ev_xswatch_deregister(gc, &dsps->guest_watch); libxl__ev_time_deregister(gc, &dsps->guest_timeout); - if (dsps->type == LIBXL_DOMAIN_TYPE_HVM) { + if (dsps->type == LIBXL_DOMAIN_TYPE_HVM || + libxl__device_model_version_running(gc, dsps->domid) == + LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN) { dsps->callback_device_model_done = domain_suspend_common_done; libxl__domain_suspend_device_model(egc, dsps); /* must be last */ return; @@ -459,7 +461,9 @@ int libxl__domain_resume(libxl__gc *gc, uint32_t domid, int suspend_cancel) goto out; } - if (type == LIBXL_DOMAIN_TYPE_HVM) { + if (type == LIBXL_DOMAIN_TYPE_HVM || + libxl__device_model_version_running(gc, domid) == + LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN) { rc = libxl__domain_resume_device_model(gc, domid); if (rc) { LOGD(ERROR, domid, "failed to resume device model:%d", rc); diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl index cb4702fd7a..75bde095bc 100644 --- a/tools/libxl/libxl_types.idl +++ b/tools/libxl/libxl_types.idl @@ -106,6 +106,7 @@ libxl_device_model_version = Enumeration("device_model_version", [ (0, "UNKNOWN"), (1, "QEMU_XEN_TRADITIONAL"), # Historical qemu-xen device model (qemu-dm) (2, "QEMU_XEN"), # Upstream based qemu-xen device model + (3, "NONE"), ]) libxl_console_type = Enumeration("console_type", [
If a domU has a qemu-xen instance attached, it is required to call qemus "xen-save-devices-state" method. Without it, the receiving side of a PV or PVH migration may be unable to lock the image: xen be: qdisk-51712: xen be: qdisk-51712: error: Failed to get "write" lock error: Failed to get "write" lock xen be: qdisk-51712: xen be: qdisk-51712: initialise() failed initialise() failed To fix this bug, libxl__domain_suspend_device_model() and libxl__domain_resume_device_model() have to be called not only for HVM, but also if the active device_model is QEMU_XEN. Unfortunately, libxl__domain_build_info_setdefault() used to hardcode b_info->device_model_version to QEMU_XEN if it does not know it any better. As a result libxl__device_model_version_running() will return incorrect values. This breaks domUs without a device_model. libxl__qmp_stop() would wait 10 seconds in qmp_open() for a qemu that will never appear. During this long timeframe the domU remains in state paused on the sending side. As a result network connections may be dropped. Once this bug is fixed as well, by just removing the assumption that every domU has a QEMU_XEN, there is no code to actually initialise b_info->device_model_version. There is a helper function libxl__need_xenpv_qemu(), which is used in various places to decide if a device_model has to be spawned. This function can not be used as is, just to fill device_model_version, because store_libxl_entry() was already called earlier. Introduce LIBXL_DEVICE_MODEL_VERSION_NONE for PV and PVH that have no need for a device_model to make the state explicit. Indicate this new state via LIBXL_HAVE macro in libxl.h. v05: - move initialization of device_model_version to extra commit - return error from libxl__need_xenpv_qemu - add LIBXL_HAVE_DEVICE_MODEL_VERSION_NONE v04: - make sure device_model_stubdomain is initialized v03: - rearrange code to make sure device_model_version is initialized before store_libxl_entry() is called v02: - update wording in a comment - remove stale goto in domcreate_launch_dm - initialize ret in libxl__need_xenpv_qemu Signed-off-by: Olaf Hering <olaf@aepfle.de> Cc: Roger Pau Monné <roger.pau@citrix.com> Cc: Anthony PERARD <anthony.perard@citrix.com> --- tools/libxl/libxl.h | 7 +++++++ tools/libxl/libxl_create.c | 17 ++++++++++++++--- tools/libxl/libxl_dom_suspend.c | 8 ++++++-- tools/libxl/libxl_types.idl | 1 + 4 files changed, 28 insertions(+), 5 deletions(-)