Message ID | 1623898035-18533-5-git-send-email-lei.rao@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fixed some bugs and optimized some codes for COLO | expand |
On Thu, 17 Jun 2021 10:47:12 +0800 Lei Rao <lei.rao@intel.com> wrote: > From: "Rao, Lei" <lei.rao@intel.com> > > When a PVM completed its SVM failover steps and begins to run in > the simplex mode, QEMU would encounter a 'Segmentation fault' if > the guest poweroff with the following calltrace: > > Program received signal SIGSEGV, Segmentation fault. > > This is because primary_vm_do_failover() would call "qemu_file_shutdown > (s->rp_state.from_dst_file);" and later the migration_shutdown() would > do it again. So, we should set the s->rp_state.from_dst_file to NULL. Hello, Please provide a backtrace to such bugfixes. It helps the reviewers to better understand the bug and the fix and it helps yourself to check if it is actually correct. I suggest you to enable coredumps in your test (or even production) system, so for every crash you definitely have a backtrace. > Signed-off-by: Like Xu <like.xu@linux.intel.com> > Signed-off-by: Lei Rao <lei.rao@intel.com> > --- > migration/colo.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/migration/colo.c b/migration/colo.c > index 616dc00..c25e488 100644 > --- a/migration/colo.c > +++ b/migration/colo.c > @@ -156,14 +156,15 @@ static void primary_vm_do_failover(void) > > /* > * Wake up COLO thread which may blocked in recv() or send(), > - * The s->rp_state.from_dst_file and s->to_dst_file may use the > - * same fd, but we still shutdown the fd for twice, it is harmless. > + * The s->to_dst_file may use the same fd, but we still shutdown > + * the fd for twice, it is harmless. > */ This change to the comment is incorrect. Shutting down a file multiple times is safe and not a bug in it self. If it leads to a crash anyway, that points to a bug in the qemu_file_shutdown() function or similar. > if (s->to_dst_file) { > qemu_file_shutdown(s->to_dst_file); > } > if (s->rp_state.from_dst_file) { > qemu_file_shutdown(s->rp_state.from_dst_file); > + s->rp_state.from_dst_file = NULL; > } This is incorrect, as the file won't be closed after this and is leaked. And indeed, when applying the patch to master, qemu crashes when triggering failover on the primary, due to the leak: qemu-system-x86_64: ../util/yank.c:107: yank_unregister_instance: Assertion `QLIST_EMPTY(&entry->yankfns)' failed. > old_state = failover_set_state(FAILOVER_STATUS_ACTIVE, --
It's been a long time since this bug, I will reproduce it to get the GDB stack, but it may take some time. Thanks, Lei. -----Original Message----- From: Lukas Straub <lukasstraub2@web.de> Sent: Sunday, July 4, 2021 4:36 AM To: Rao, Lei <lei.rao@intel.com> Cc: Zhang, Chen <chen.zhang@intel.com>; lizhijian@cn.fujitsu.com; jasowang@redhat.com; zhang.zhanghailiang@huawei.com; quintela@redhat.com; dgilbert@redhat.com; like.xu.linux@gmail.com; qemu-devel@nongnu.org; Like Xu <like.xu@linux.intel.com> Subject: Re: [PATCH 4/7] colo: fixed 'Segmentation fault' when the simplex mode PVM poweroff On Thu, 17 Jun 2021 10:47:12 +0800 Lei Rao <lei.rao@intel.com> wrote: > From: "Rao, Lei" <lei.rao@intel.com> > > When a PVM completed its SVM failover steps and begins to run in the > simplex mode, QEMU would encounter a 'Segmentation fault' if the guest > poweroff with the following calltrace: > > Program received signal SIGSEGV, Segmentation fault. > > This is because primary_vm_do_failover() would call > "qemu_file_shutdown (s->rp_state.from_dst_file);" and later the > migration_shutdown() would do it again. So, we should set the s->rp_state.from_dst_file to NULL. Hello, Please provide a backtrace to such bugfixes. It helps the reviewers to better understand the bug and the fix and it helps yourself to check if it is actually correct. I suggest you to enable coredumps in your test (or even production) system, so for every crash you definitely have a backtrace. > Signed-off-by: Like Xu <like.xu@linux.intel.com> > Signed-off-by: Lei Rao <lei.rao@intel.com> > --- > migration/colo.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/migration/colo.c b/migration/colo.c index > 616dc00..c25e488 100644 > --- a/migration/colo.c > +++ b/migration/colo.c > @@ -156,14 +156,15 @@ static void primary_vm_do_failover(void) > > /* > * Wake up COLO thread which may blocked in recv() or send(), > - * The s->rp_state.from_dst_file and s->to_dst_file may use the > - * same fd, but we still shutdown the fd for twice, it is harmless. > + * The s->to_dst_file may use the same fd, but we still shutdown > + * the fd for twice, it is harmless. > */ This change to the comment is incorrect. Shutting down a file multiple times is safe and not a bug in it self. If it leads to a crash anyway, that points to a bug in the qemu_file_shutdown() function or similar. > if (s->to_dst_file) { > qemu_file_shutdown(s->to_dst_file); > } > if (s->rp_state.from_dst_file) { > qemu_file_shutdown(s->rp_state.from_dst_file); > + s->rp_state.from_dst_file = NULL; > } This is incorrect, as the file won't be closed after this and is leaked. And indeed, when applying the patch to master, qemu crashes when triggering failover on the primary, due to the leak: qemu-system-x86_64: ../util/yank.c:107: yank_unregister_instance: Assertion `QLIST_EMPTY(&entry->yankfns)' failed. > old_state = failover_set_state(FAILOVER_STATUS_ACTIVE, --
I can reproduce this bug successfully and the GDB stack is as follows: Program terminated with signal SIGSEGV, Segmentation fault. #0 object_class_dynamic_cast (class=0x55c8f5d2bf50, typename=0x55c8f2f7379e "qio-channel") at qom/object.c:832 832 if (type->class->interfaces && [Current thread is 1 (Thread 0x7f756e97eb00 (LWP 1811577))] (gdb) bt #0 object_class_dynamic_cast (class=0x55c8f5d2bf50, typename=0x55c8f2f7379e "qio-channel") at qom/object.c:832 #1 0x000055c8f2c3dd14 in object_dynamic_cast (obj=0x55c8f543ac00, typename=0x55c8f2f7379e "qio-channel") at qom/object.c:763 #2 0x000055c8f2c3ddce in object_dynamic_cast_assert (obj=0x55c8f543ac00, typename=0x55c8f2f7379e "qio-channel", file=0x55c8f2f73780 "migration/qemu-file-channel.c", line=117, func=0x55c8f2f73800 <__func__.18724> "channel_shutdown") at qom/object.c:786 #3 0x000055c8f2bbc6ac in channel_shutdown (opaque=0x55c8f543ac00, rd=true, wr=true, errp=0x0) at migration/qemu-file-channel.c:117 #4 0x000055c8f2bba56e in qemu_file_shutdown (f=0x7f7558070f50) at migration/qemu-file.c:67 #5 0x000055c8f2ba5373 in migrate_fd_cancel (s=0x55c8f4ccf3f0) at migration/migration.c:1699 #6 0x000055c8f2ba1992 in migration_shutdown () at migration/migration.c:187 #7 0x000055c8f29a5b77 in main (argc=69, argv=0x7fff3e9e8c08, envp=0x7fff3e9e8e38) at vl.c:4512 From the call trace we can see that the reason for core dump is due to QIOChannel *ioc = QIO_CHANNEL(opaque) in the channel_shutdown(); After analysis, I think what you said is right, Shutting down a file multiple times is safe and not a bug in it self. The reason for the segmentation fault Is that after doing failover, the COLO thread will qemu_fclose(s->rp_state.from_dst_file) in colo_process_checkpoint(); if we power off the guest at the same time. This will cause qemu_file_shutdown after qemu_close(from_dst_file). As a result, the qemu will crash. So, I think the better way is as follows: /* * Must be called after failover BH is completed, * Or the failover BH may shutdown the wrong fd that * re-used by other threads after we release here. */ if (s->rp_state.from_dst_file) { qemu_fclose(s->rp_state.from_dst_file); + s->rp_state.from_dst_file = NULL; } After qemu_close(s->rp_state.from_dst_file), we set the from_dst_file = NULL; Ways to reproduce bugs: You can kill SVM, after executing the failover QMP command, immediately execute the power off command in the guest, which will have a high probability to reproduce this bug. Thanks, Lei. -----Original Message----- From: Rao, Lei Sent: Monday, July 5, 2021 10:54 AM To: Lukas Straub <lukasstraub2@web.de> Cc: Zhang, Chen <chen.zhang@intel.com>; lizhijian@cn.fujitsu.com; jasowang@redhat.com; zhang.zhanghailiang@huawei.com; quintela@redhat.com; dgilbert@redhat.com; like.xu.linux@gmail.com; qemu-devel@nongnu.org; Like Xu <like.xu@linux.intel.com> Subject: RE: [PATCH 4/7] colo: fixed 'Segmentation fault' when the simplex mode PVM poweroff It's been a long time since this bug, I will reproduce it to get the GDB stack, but it may take some time. Thanks, Lei. -----Original Message----- From: Lukas Straub <lukasstraub2@web.de> Sent: Sunday, July 4, 2021 4:36 AM To: Rao, Lei <lei.rao@intel.com> Cc: Zhang, Chen <chen.zhang@intel.com>; lizhijian@cn.fujitsu.com; jasowang@redhat.com; zhang.zhanghailiang@huawei.com; quintela@redhat.com; dgilbert@redhat.com; like.xu.linux@gmail.com; qemu-devel@nongnu.org; Like Xu <like.xu@linux.intel.com> Subject: Re: [PATCH 4/7] colo: fixed 'Segmentation fault' when the simplex mode PVM poweroff On Thu, 17 Jun 2021 10:47:12 +0800 Lei Rao <lei.rao@intel.com> wrote: > From: "Rao, Lei" <lei.rao@intel.com> > > When a PVM completed its SVM failover steps and begins to run in the > simplex mode, QEMU would encounter a 'Segmentation fault' if the guest > poweroff with the following calltrace: > > Program received signal SIGSEGV, Segmentation fault. > > This is because primary_vm_do_failover() would call > "qemu_file_shutdown (s->rp_state.from_dst_file);" and later the > migration_shutdown() would do it again. So, we should set the s->rp_state.from_dst_file to NULL. Hello, Please provide a backtrace to such bugfixes. It helps the reviewers to better understand the bug and the fix and it helps yourself to check if it is actually correct. I suggest you to enable coredumps in your test (or even production) system, so for every crash you definitely have a backtrace. > Signed-off-by: Like Xu <like.xu@linux.intel.com> > Signed-off-by: Lei Rao <lei.rao@intel.com> > --- > migration/colo.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/migration/colo.c b/migration/colo.c index > 616dc00..c25e488 100644 > --- a/migration/colo.c > +++ b/migration/colo.c > @@ -156,14 +156,15 @@ static void primary_vm_do_failover(void) > > /* > * Wake up COLO thread which may blocked in recv() or send(), > - * The s->rp_state.from_dst_file and s->to_dst_file may use the > - * same fd, but we still shutdown the fd for twice, it is harmless. > + * The s->to_dst_file may use the same fd, but we still shutdown > + * the fd for twice, it is harmless. > */ This change to the comment is incorrect. Shutting down a file multiple times is safe and not a bug in it self. If it leads to a crash anyway, that points to a bug in the qemu_file_shutdown() function or similar. > if (s->to_dst_file) { > qemu_file_shutdown(s->to_dst_file); > } > if (s->rp_state.from_dst_file) { > qemu_file_shutdown(s->rp_state.from_dst_file); > + s->rp_state.from_dst_file = NULL; > } This is incorrect, as the file won't be closed after this and is leaked. And indeed, when applying the patch to master, qemu crashes when triggering failover on the primary, due to the leak: qemu-system-x86_64: ../util/yank.c:107: yank_unregister_instance: Assertion `QLIST_EMPTY(&entry->yankfns)' failed. > old_state = failover_set_state(FAILOVER_STATUS_ACTIVE, --
diff --git a/migration/colo.c b/migration/colo.c index 616dc00..c25e488 100644 --- a/migration/colo.c +++ b/migration/colo.c @@ -156,14 +156,15 @@ static void primary_vm_do_failover(void) /* * Wake up COLO thread which may blocked in recv() or send(), - * The s->rp_state.from_dst_file and s->to_dst_file may use the - * same fd, but we still shutdown the fd for twice, it is harmless. + * The s->to_dst_file may use the same fd, but we still shutdown + * the fd for twice, it is harmless. */ if (s->to_dst_file) { qemu_file_shutdown(s->to_dst_file); } if (s->rp_state.from_dst_file) { qemu_file_shutdown(s->rp_state.from_dst_file); + s->rp_state.from_dst_file = NULL; } old_state = failover_set_state(FAILOVER_STATUS_ACTIVE,