Message ID | CAFLBxZaUDqX4W0XLYTd38Ei09rXLYT8V83JAeVGhypW0zvE_bQ@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 22/05/17 12:03, George Dunlap wrote: > On Mon, May 22, 2017 at 11:18 AM, George Dunlap > <george.dunlap@citrix.com> wrote: >> On 22/05/17 07:35, Hao, Xudong wrote: >>> Bug detailed description: >>> >>> ---------------- >>> >>> Create one RHEL7.3 HVM and do live migration continuously, while doing the 200+ or 300+ times live-migration, tool stack report error and migration failed. >>> >>> >>> >>> Environment : >>> >>> ---------------- >>> >>> HW: Skylake server >>> >>> Xen: Xen 4.9.0 RC4 >>> >>> Dom0: Linux 4.11.0 >>> >>> >>> >>> Reproduce steps: >>> >>> ---------------- >>> >>> 1. Compile Xen 4.9 Rc4 and dom0 kernel 4.11.0, boot to dom0 >>> >>> 2. Boot RHEL7.3 HVM guest >>> >>> 3. Migrate guest to localhost, sleep 10 seconds >>> >>> 4. Repeat doing the step3. >>> >>> >>> >>> Current result: >>> >>> ---------------- >>> >>> VM Migration fail. >>> >>> >>> >>> Base error log: >>> >>> ---------------- >>> >>> xl migrate 24hrs_lm_guest_2 localhost >>> >>> root@localhost's password: >>> >>> migration target: Ready to receive domain. >>> >>> Saving to migration stream new xl format (info 0x3/0x0/1761) >>> >>> Loading new save file <incoming migration stream> (new xl fmt info 0x3/0x0/1761) >>> >>> Savefile contains xl domain config in JSON format >>> >>> Parsing config from <saved> >>> >>> xc: info: Saving domain 273, type x86 HVM >>> >>> xc: info: Found x86 HVM domain from Xen 4.9 >>> >>> xc: info: Restoring domain >>> >>> xc: error: set HVM param 12 = 0x00000000feffe000 (85 = Interrupted system call should ): Internal error >>> >>> xc: error: Restore failed (85 = Interrupted system call should ): Internal error >> Interesting -- it appears that setting HVM_PARAM_IDENT_PT (#12) can fail >> with -ERESTART. But the comment for ERESTART makes it explicit that it >> should be internal only -- it should cause a hypercall continuation (so >> that the hypercall restarts automatically), rather than returning to the >> guest. >> >> But the hypercall continuation code seems to have disappeared from >> do_hvm_op() at some point? >> >> /me digs a bit more... > The problem turns out to be commit ae20ccf ("dm_op: convert > HVMOP_set_mem_type"), which says: > > This patch removes the need for handling HVMOP restarts, so that > infrastructure is removed. > > While it's true that there are no more operations which need iteration > information restored, but there are two operations which may still > need to be restarted to avoid deadlocks with other operations. > > Attached is a patch which restores hypercall continuation checking. > Xudong, can you give it a test? > > Thanks, > -George > > From 3d4ce135ea3b396bb63752c39e6234366d590c16 Mon Sep 17 00:00:00 2001 > From: George Dunlap <george.dunlap@citrix.com> > Date: Mon, 22 May 2017 11:38:31 +0100 > Subject: [PATCH] Restore HVM_OP hypercall continuation (partial revert of > ae20ccf) > > Commit ae20ccf removed the hypercall continuation logic from the end > of do_hvm_op(), claiming: > > "This patch removes the need for handling HVMOP restarts, so that > infrastructure is removed." > > That turns out to be false. The removal of HVMOP_set_mem_type removed > the need to store a start iteration value in the hypercall > continuation, but a grep through hvm.c for ERESTART turns up at least > two places where do_hvm_op() may still need a hypercall continuation: > > * HVMOP_set_hvm_param can return -ERESTART when setting > HVM_PARAM_IDENT_PT in the event that it fails to acquire the domctl > lock > > * HVMOP_flush_tlbs can return -ERESTART if several vcpus call it at > the same time > > In both cases, a simple restart (with no stored iteration information) > is necessary. > > Add a check for -ERESTART again, along with a comment at the top of > the function regarding the lack of decoding any information from the > op value. > > Reported-by: Xudong Hao <xudong.hao@intel.com> > Signed-off-by: George Dunlap <george.dunlap@citrix.com> > --- > CC: Andrew Cooper <andrew.cooper3@citrix.com> > CC: Jan Beulich <jbeulich@suse.com> > CC: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> (with the final hunk removed) > --- > xen/arch/x86/hvm/hvm.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c > index 81691e2..e3e817d 100644 > --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -4544,6 +4544,13 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg) > { > long rc = 0; > > + /* > + * NB: hvm_op can be part of a restarted hypercall; but at the > + * moment the only hypercalls which do continuations don't need to > + * store any iteration information (since they're just re-trying > + * the acquisition of a lock). > + */ > + > switch ( op ) > { > case HVMOP_set_evtchn_upcall_vector: > @@ -4636,6 +4643,10 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg) > } > } > > + if ( rc == -ERESTART ) > + rc = hypercall_create_continuation(__HYPERVISOR_hvm_op, "lh", > + op, arg); > + > return rc; > } > > @@ -4869,4 +4880,3 @@ void hvm_set_segment_register(struct vcpu *v, enum x86_segment seg, > * indent-tabs-mode: nil > * End: > */ > - > -- 2.1.4
Hi, On 22/05/17 12:11, Andrew Cooper wrote: >> From 3d4ce135ea3b396bb63752c39e6234366d590c16 Mon Sep 17 00:00:00 2001 >> From: George Dunlap <george.dunlap@citrix.com> >> Date: Mon, 22 May 2017 11:38:31 +0100 >> Subject: [PATCH] Restore HVM_OP hypercall continuation (partial revert of >> ae20ccf) >> >> Commit ae20ccf removed the hypercall continuation logic from the end >> of do_hvm_op(), claiming: >> >> "This patch removes the need for handling HVMOP restarts, so that >> infrastructure is removed." >> >> That turns out to be false. The removal of HVMOP_set_mem_type removed >> the need to store a start iteration value in the hypercall >> continuation, but a grep through hvm.c for ERESTART turns up at least >> two places where do_hvm_op() may still need a hypercall continuation: >> >> * HVMOP_set_hvm_param can return -ERESTART when setting >> HVM_PARAM_IDENT_PT in the event that it fails to acquire the domctl >> lock >> >> * HVMOP_flush_tlbs can return -ERESTART if several vcpus call it at >> the same time >> >> In both cases, a simple restart (with no stored iteration information) >> is necessary. >> >> Add a check for -ERESTART again, along with a comment at the top of >> the function regarding the lack of decoding any information from the >> op value. >> >> Reported-by: Xudong Hao <xudong.hao@intel.com> >> Signed-off-by: George Dunlap <george.dunlap@citrix.com> >> --- >> CC: Andrew Cooper <andrew.cooper3@citrix.com> >> CC: Jan Beulich <jbeulich@suse.com> >> CC: Paul Durrant <paul.durrant@citrix.com> > > Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> (with the final > hunk removed) Release-acked-by: Julien Grall <julien.grall@arm.com> Cheers,
George, thanks the fixing. With the patch, the testing is running on 90+ time LM without any error till now, let's wait for the final result. Thanks, -Xudong > -----Original Message----- > From: George Dunlap [mailto:george.dunlap@citrix.com] > Sent: Monday, May 22, 2017 7:03 PM > To: Hao, Xudong <xudong.hao@intel.com>; xen-devel@lists.xen.org > Cc: Lars Kurth <lars.kurth@citrix.com>; Julien Grall <julien.grall@arm.com>; Gao, > Chao <chao.gao@intel.com>; Paul Durrant <paul.durrant@citrix.com>; Andrew > Cooper <andrew.cooper3@citrix.com>; Jan Beulich <JBeulich@suse.com> > Subject: Re: [Xen-devel] [BUG] repeated live migration for VM failed > > On Mon, May 22, 2017 at 11:18 AM, George Dunlap <george.dunlap@citrix.com> > wrote: > > On 22/05/17 07:35, Hao, Xudong wrote: > >> Bug detailed description: > >> > >> ---------------- > >> > >> Create one RHEL7.3 HVM and do live migration continuously, while doing the > 200+ or 300+ times live-migration, tool stack report error and migration failed. > >> > >> > >> > >> Environment : > >> > >> ---------------- > >> > >> HW: Skylake server > >> > >> Xen: Xen 4.9.0 RC4 > >> > >> Dom0: Linux 4.11.0 > >> > >> > >> > >> Reproduce steps: > >> > >> ---------------- > >> > >> 1. Compile Xen 4.9 Rc4 and dom0 kernel 4.11.0, boot to dom0 > >> > >> 2. Boot RHEL7.3 HVM guest > >> > >> 3. Migrate guest to localhost, sleep 10 seconds > >> > >> 4. Repeat doing the step3. > >> > >> > >> > >> Current result: > >> > >> ---------------- > >> > >> VM Migration fail. > >> > >> > >> > >> Base error log: > >> > >> ---------------- > >> > >> xl migrate 24hrs_lm_guest_2 localhost > >> > >> root@localhost's password: > >> > >> migration target: Ready to receive domain. > >> > >> Saving to migration stream new xl format (info 0x3/0x0/1761) > >> > >> Loading new save file <incoming migration stream> (new xl fmt info > >> 0x3/0x0/1761) > >> > >> Savefile contains xl domain config in JSON format > >> > >> Parsing config from <saved> > >> > >> xc: info: Saving domain 273, type x86 HVM > >> > >> xc: info: Found x86 HVM domain from Xen 4.9 > >> > >> xc: info: Restoring domain > >> > >> xc: error: set HVM param 12 = 0x00000000feffe000 (85 = Interrupted > >> system call should ): Internal error > >> > >> xc: error: Restore failed (85 = Interrupted system call should ): > >> Internal error > > > > Interesting -- it appears that setting HVM_PARAM_IDENT_PT (#12) can > > fail with -ERESTART. But the comment for ERESTART makes it explicit > > that it should be internal only -- it should cause a hypercall > > continuation (so that the hypercall restarts automatically), rather > > than returning to the guest. > > > > But the hypercall continuation code seems to have disappeared from > > do_hvm_op() at some point? > > > > /me digs a bit more... > > The problem turns out to be commit ae20ccf ("dm_op: convert > HVMOP_set_mem_type"), which says: > > This patch removes the need for handling HVMOP restarts, so that > infrastructure is removed. > > While it's true that there are no more operations which need iteration > information restored, but there are two operations which may still need to be > restarted to avoid deadlocks with other operations. > > Attached is a patch which restores hypercall continuation checking. > Xudong, can you give it a test? > > Thanks, > -George
George, The live migrate pass over 500+ times with this patch, I think it's fine to merge it into Xen 4.9. Tested-by: Xudong Hao <xudong.hao@intel.com> Thanks, -Xudong > -----Original Message----- > From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Hao, > Xudong > Sent: Tuesday, May 23, 2017 5:23 PM > To: George Dunlap <george.dunlap@citrix.com>; xen-devel@lists.xen.org > Cc: Lars Kurth <lars.kurth@citrix.com>; Andrew Cooper > <andrew.cooper3@citrix.com>; Julien Grall <julien.grall@arm.com>; Paul > Durrant <paul.durrant@citrix.com>; Jan Beulich <JBeulich@suse.com>; Gao, > Chao <chao.gao@intel.com> > Subject: Re: [Xen-devel] [BUG] repeated live migration for VM failed > > George, thanks the fixing. > With the patch, the testing is running on 90+ time LM without any error till now, > let's wait for the final result. > > Thanks, > -Xudong > > > > -----Original Message----- > > From: George Dunlap [mailto:george.dunlap@citrix.com] > > Sent: Monday, May 22, 2017 7:03 PM > > To: Hao, Xudong <xudong.hao@intel.com>; xen-devel@lists.xen.org > > Cc: Lars Kurth <lars.kurth@citrix.com>; Julien Grall > > <julien.grall@arm.com>; Gao, Chao <chao.gao@intel.com>; Paul Durrant > > <paul.durrant@citrix.com>; Andrew Cooper <andrew.cooper3@citrix.com>; > > Jan Beulich <JBeulich@suse.com> > > Subject: Re: [Xen-devel] [BUG] repeated live migration for VM failed > > > > On Mon, May 22, 2017 at 11:18 AM, George Dunlap > > <george.dunlap@citrix.com> > > wrote: > > > On 22/05/17 07:35, Hao, Xudong wrote: > > >> Bug detailed description: > > >> > > >> ---------------- > > >> > > >> Create one RHEL7.3 HVM and do live migration continuously, while > > >> doing the > > 200+ or 300+ times live-migration, tool stack report error and migration failed. > > >> > > >> > > >> > > >> Environment : > > >> > > >> ---------------- > > >> > > >> HW: Skylake server > > >> > > >> Xen: Xen 4.9.0 RC4 > > >> > > >> Dom0: Linux 4.11.0 > > >> > > >> > > >> > > >> Reproduce steps: > > >> > > >> ---------------- > > >> > > >> 1. Compile Xen 4.9 Rc4 and dom0 kernel 4.11.0, boot to dom0 > > >> > > >> 2. Boot RHEL7.3 HVM guest > > >> > > >> 3. Migrate guest to localhost, sleep 10 seconds > > >> > > >> 4. Repeat doing the step3. > > >> > > >> > > >> > > >> Current result: > > >> > > >> ---------------- > > >> > > >> VM Migration fail. > > >> > > >> > > >> > > >> Base error log: > > >> > > >> ---------------- > > >> > > >> xl migrate 24hrs_lm_guest_2 localhost > > >> > > >> root@localhost's password: > > >> > > >> migration target: Ready to receive domain. > > >> > > >> Saving to migration stream new xl format (info 0x3/0x0/1761) > > >> > > >> Loading new save file <incoming migration stream> (new xl fmt info > > >> 0x3/0x0/1761) > > >> > > >> Savefile contains xl domain config in JSON format > > >> > > >> Parsing config from <saved> > > >> > > >> xc: info: Saving domain 273, type x86 HVM > > >> > > >> xc: info: Found x86 HVM domain from Xen 4.9 > > >> > > >> xc: info: Restoring domain > > >> > > >> xc: error: set HVM param 12 = 0x00000000feffe000 (85 = Interrupted > > >> system call should ): Internal error > > >> > > >> xc: error: Restore failed (85 = Interrupted system call should ): > > >> Internal error > > > > > > Interesting -- it appears that setting HVM_PARAM_IDENT_PT (#12) can > > > fail with -ERESTART. But the comment for ERESTART makes it explicit > > > that it should be internal only -- it should cause a hypercall > > > continuation (so that the hypercall restarts automatically), rather > > > than returning to the guest. > > > > > > But the hypercall continuation code seems to have disappeared from > > > do_hvm_op() at some point? > > > > > > /me digs a bit more... > > > > The problem turns out to be commit ae20ccf ("dm_op: convert > > HVMOP_set_mem_type"), which says: > > > > This patch removes the need for handling HVMOP restarts, so that > > infrastructure is removed. > > > > While it's true that there are no more operations which need iteration > > information restored, but there are two operations which may still > > need to be restarted to avoid deadlocks with other operations. > > > > Attached is a patch which restores hypercall continuation checking. > > Xudong, can you give it a test? > > > > Thanks, > > -George > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > https://lists.xen.org/xen-devel
From 3d4ce135ea3b396bb63752c39e6234366d590c16 Mon Sep 17 00:00:00 2001 From: George Dunlap <george.dunlap@citrix.com> Date: Mon, 22 May 2017 11:38:31 +0100 Subject: [PATCH] Restore HVM_OP hypercall continuation (partial revert of ae20ccf) Commit ae20ccf removed the hypercall continuation logic from the end of do_hvm_op(), claiming: "This patch removes the need for handling HVMOP restarts, so that infrastructure is removed." That turns out to be false. The removal of HVMOP_set_mem_type removed the need to store a start iteration value in the hypercall continuation, but a grep through hvm.c for ERESTART turns up at least two places where do_hvm_op() may still need a hypercall continuation: * HVMOP_set_hvm_param can return -ERESTART when setting HVM_PARAM_IDENT_PT in the event that it fails to acquire the domctl lock * HVMOP_flush_tlbs can return -ERESTART if several vcpus call it at the same time In both cases, a simple restart (with no stored iteration information) is necessary. Add a check for -ERESTART again, along with a comment at the top of the function regarding the lack of decoding any information from the op value. Reported-by: Xudong Hao <xudong.hao@intel.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> --- CC: Andrew Cooper <andrew.cooper3@citrix.com> CC: Jan Beulich <jbeulich@suse.com> CC: Paul Durrant <paul.durrant@citrix.com> --- xen/arch/x86/hvm/hvm.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 81691e2..e3e817d 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -4544,6 +4544,13 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg) { long rc = 0; + /* + * NB: hvm_op can be part of a restarted hypercall; but at the + * moment the only hypercalls which do continuations don't need to + * store any iteration information (since they're just re-trying + * the acquisition of a lock). + */ + switch ( op ) { case HVMOP_set_evtchn_upcall_vector: @@ -4636,6 +4643,10 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg) } } + if ( rc == -ERESTART ) + rc = hypercall_create_continuation(__HYPERVISOR_hvm_op, "lh", + op, arg); + return rc; } @@ -4869,4 +4880,3 @@ void hvm_set_segment_register(struct vcpu *v, enum x86_segment seg, * indent-tabs-mode: nil * End: */ - -- 2.1.4