diff mbox

libxl: invert xc and domain model resume calls in xc_domain_resume()

Message ID 20161128135357.5543-1-cbosdonnat@suse.com (mailing list archive)
State New, archived
Headers show

Commit Message

Cedric Bosdonnat Nov. 28, 2016, 1:53 p.m. UTC
Resume is sometimes silently failing for HVM guests. Getting the
xc_domain_resume() and libxl__domain_resume_device_model() in the
reverse order than what is in the suspend code fixes the problem.

Signed-off-by: Cédric Bosdonnat <cbosdonnat@suse.com>
---
 tools/libxl/libxl_dom_suspend.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Comments

Wei Liu Nov. 29, 2016, 7:34 a.m. UTC | #1
On Mon, Nov 28, 2016 at 02:53:57PM +0100, Cédric Bosdonnat wrote:
> Resume is sometimes silently failing for HVM guests. Getting the
> xc_domain_resume() and libxl__domain_resume_device_model() in the
> reverse order than what is in the suspend code fixes the problem.
> 
> Signed-off-by: Cédric Bosdonnat <cbosdonnat@suse.com>
 
I think it would be nice to explain why reversing the order fixes the
problem for you. My guess is because device model needs to be ready when
the guest runs, but I'm not fully convinced by this explanation --
guests should just be trapped in the hypervisor waiting for device model
to come up.

I also CC'ed other people who are more familiar with this area so that
they can provide insight.
 
Wei.
Jürgen Groß Nov. 29, 2016, 8:47 a.m. UTC | #2
On 29/11/16 08:34, Wei Liu wrote:
> On Mon, Nov 28, 2016 at 02:53:57PM +0100, Cédric Bosdonnat wrote:
>> Resume is sometimes silently failing for HVM guests. Getting the
>> xc_domain_resume() and libxl__domain_resume_device_model() in the
>> reverse order than what is in the suspend code fixes the problem.
>>
>> Signed-off-by: Cédric Bosdonnat <cbosdonnat@suse.com>
>  
> I think it would be nice to explain why reversing the order fixes the
> problem for you. My guess is because device model needs to be ready when
> the guest runs, but I'm not fully convinced by this explanation --
> guests should just be trapped in the hypervisor waiting for device model
> to come up.

I'm not completely sure this is true. qemu is in "stopped" state, so it
might be any emulation requests are just silently dropped. In any case
it is just weird to stop qemu in suspend case only after suspending the
domain, but let it continue _after_ resuming the domain. So I'd rather
expect an explanation (not from Cedric) why this should be okay in case
the patch isn't accepted.

> I also CC'ed other people who are more familiar with this area so that
> they can provide insight.

And adding Stefano and Anthony, too.


Juergen
Stefano Stabellini Nov. 29, 2016, 7:15 p.m. UTC | #3
On Tue, 29 Nov 2016, Juergen Gross wrote:
> On 29/11/16 08:34, Wei Liu wrote:
> > On Mon, Nov 28, 2016 at 02:53:57PM +0100, Cédric Bosdonnat wrote:
> >> Resume is sometimes silently failing for HVM guests. Getting the
> >> xc_domain_resume() and libxl__domain_resume_device_model() in the
> >> reverse order than what is in the suspend code fixes the problem.
> >>
> >> Signed-off-by: Cédric Bosdonnat <cbosdonnat@suse.com>
> >  
> > I think it would be nice to explain why reversing the order fixes the
> > problem for you. My guess is because device model needs to be ready when
> > the guest runs, but I'm not fully convinced by this explanation --
> > guests should just be trapped in the hypervisor waiting for device model
> > to come up.
> 
> I'm not completely sure this is true. qemu is in "stopped" state, so it
> might be any emulation requests are just silently dropped. In any case
> it is just weird to stop qemu in suspend case only after suspending the
> domain, but let it continue _after_ resuming the domain. So I'd rather
> expect an explanation (not from Cedric) why this should be okay in case
> the patch isn't accepted.

Calling xc_domain_resume before libxl__domain_resume_device_model seems
wrong to me. For example in libxl_domain_unpause we call
libxl__domain_resume_device_model, then xc_domain_unpause. We should get
the DM ready before resuming the VM, right?

TBH I don't know exactly what would happen if an ioreq comes in QEMU
before we send the QMP "cont" command. It could be silenty dropped,
causing the issue described above, but it would be nice if somebody
instrumented QEMU with some debug printf to be sure.
Wei Liu Dec. 1, 2016, 1:29 p.m. UTC | #4
On Tue, Nov 29, 2016 at 11:15:36AM -0800, Stefano Stabellini wrote:
> On Tue, 29 Nov 2016, Juergen Gross wrote:
> > On 29/11/16 08:34, Wei Liu wrote:
> > > On Mon, Nov 28, 2016 at 02:53:57PM +0100, Cédric Bosdonnat wrote:
> > >> Resume is sometimes silently failing for HVM guests. Getting the
> > >> xc_domain_resume() and libxl__domain_resume_device_model() in the
> > >> reverse order than what is in the suspend code fixes the problem.
> > >>
> > >> Signed-off-by: Cédric Bosdonnat <cbosdonnat@suse.com>
> > >  
> > > I think it would be nice to explain why reversing the order fixes the
> > > problem for you. My guess is because device model needs to be ready when
> > > the guest runs, but I'm not fully convinced by this explanation --
> > > guests should just be trapped in the hypervisor waiting for device model
> > > to come up.
> > 
> > I'm not completely sure this is true. qemu is in "stopped" state, so it
> > might be any emulation requests are just silently dropped. In any case
> > it is just weird to stop qemu in suspend case only after suspending the
> > domain, but let it continue _after_ resuming the domain. So I'd rather
> > expect an explanation (not from Cedric) why this should be okay in case
> > the patch isn't accepted.
> 
> Calling xc_domain_resume before libxl__domain_resume_device_model seems
> wrong to me. For example in libxl_domain_unpause we call
> libxl__domain_resume_device_model, then xc_domain_unpause. We should get
> the DM ready before resuming the VM, right?
> 

Yes, I would think so, too.

I'm inclined to accept this patch. At the end of the day, even if QEMU
doesn't drop requests now, it doesn't mean it will never drop requests
in the future.

Wei.
Wei Liu Dec. 7, 2016, 11:14 a.m. UTC | #5
On Mon, Nov 28, 2016 at 02:53:57PM +0100, Cédric Bosdonnat wrote:
> Resume is sometimes silently failing for HVM guests. Getting the
> xc_domain_resume() and libxl__domain_resume_device_model() in the
> reverse order than what is in the suspend code fixes the problem.
> 
> Signed-off-by: Cédric Bosdonnat <cbosdonnat@suse.com>

Acked + applied.

Due to the acceptance of LOG*D patch series, I need to rebase this patch
a bit. Please check the result.

Wei.
diff mbox

Patch

diff --git a/tools/libxl/libxl_dom_suspend.c b/tools/libxl/libxl_dom_suspend.c
index 0648919..3e29c01 100644
--- a/tools/libxl/libxl_dom_suspend.c
+++ b/tools/libxl/libxl_dom_suspend.c
@@ -456,12 +456,6 @@  int libxl__domain_resume(libxl__gc *gc, uint32_t domid, int suspend_cancel)
 {
     int rc = 0;
 
-    if (xc_domain_resume(CTX->xch, domid, suspend_cancel)) {
-        LOGE(ERROR, "xc_domain_resume failed for domain %u", domid);
-        rc = ERROR_FAIL;
-        goto out;
-    }
-
     libxl_domain_type type = libxl__domain_type(gc, domid);
     if (type == LIBXL_DOMAIN_TYPE_INVALID) {
         rc = ERROR_FAIL;
@@ -477,6 +471,12 @@  int libxl__domain_resume(libxl__gc *gc, uint32_t domid, int suspend_cancel)
         }
     }
 
+    if (xc_domain_resume(CTX->xch, domid, suspend_cancel)) {
+        LOGE(ERROR, "xc_domain_resume failed for domain %u", domid);
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
     if (!xs_resume_domain(CTX->xsh, domid)) {
         LOGE(ERROR, "xs_resume_domain failed for domain %u", domid);
         rc = ERROR_FAIL;