From patchwork Wed Mar 22 06:59:20 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Zhanghailiang X-Patchwork-Id: 9638161 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 8AFD5602CB for ; Wed, 22 Mar 2017 07:03:26 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7A5402723E for ; Wed, 22 Mar 2017 07:03:26 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6EC2827CAF; Wed, 22 Mar 2017 07:03:26 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 05A272723E for ; Wed, 22 Mar 2017 07:03:23 +0000 (UTC) Received: from localhost ([::1]:49312 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cqaIk-00008Z-Ft for patchwork-qemu-devel@patchwork.kernel.org; Wed, 22 Mar 2017 03:03:22 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:54020) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cqaI5-00008U-W3 for qemu-devel@nongnu.org; Wed, 22 Mar 2017 03:02:44 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cqaI2-0005DL-Qn for qemu-devel@nongnu.org; Wed, 22 Mar 2017 03:02:42 -0400 Received: from [45.249.212.189] (port=2865 helo=dggrg03-dlp.huawei.com) by eggs.gnu.org with esmtps (TLS1.0:RSA_ARCFOUR_SHA1:16) (Exim 4.71) (envelope-from ) id 1cqaHh-0004su-OE for qemu-devel@nongnu.org; Wed, 22 Mar 2017 03:02:38 -0400 Received: from 172.30.72.57 (EHLO DGGEML402-HUB.china.huawei.com) ([172.30.72.57]) by dggrg03-dlp.huawei.com (MOS 4.4.6-GA FastPath queued) with ESMTP id AKK64660; Wed, 22 Mar 2017 14:59:32 +0800 (CST) Received: from [127.0.0.1] (10.177.24.212) by DGGEML402-HUB.china.huawei.com (10.3.17.38) with Microsoft SMTP Server id 14.3.301.0; Wed, 22 Mar 2017 14:59:24 +0800 To: References: <201703220942303543948@zte.com.cn> From: Hailiang Zhang Message-ID: <58D220C8.3090402@huawei.com> Date: Wed, 22 Mar 2017 14:59:20 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 MIME-Version: 1.0 In-Reply-To: <201703220942303543948@zte.com.cn> X-Originating-IP: [10.177.24.212] X-CFilter-Loop: Reflected X-Mirapoint-Virus-RAPID-Raw: score=unknown(0), refid=str=0001.0A020202.58D220D6.0069, ss=1, re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0, ip=0.0.0.0, so=2014-11-16 11:51:01, dmn=2013-03-21 17:37:32 X-Mirapoint-Loop-Id: 75d69a34d66b0e04ef9e4a00c42c3577 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.4.x-2.6.x [generic] [fuzzy] X-Received-From: 45.249.212.189 Subject: Re: [Qemu-devel] =?utf-8?b?562U5aSNOiBSZTogIOetlOWkjTogUmU6IOetlA==?= =?utf-8?b?5aSNOiBSZTogW0JVR11DT0xPIGZhaWxvdmVyIGhhbmc=?= X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: xuquan8@huawei.com, dgilbert@redhat.com, zhangchen.fnst@cn.fujitsu.com, qemu-devel@nongnu.org Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" X-Virus-Scanned: ClamAV using ClamSMTP Hi, On 2017/3/22 9:42, wang.guang55@zte.com.cn wrote: > diff --git a/migration/socket.c b/migration/socket.c > > > index 13966f1..d65a0ea 100644 > > > --- a/migration/socket.c > > > +++ b/migration/socket.c > > > @@ -147,8 +147,9 @@ static gboolean socket_accept_incoming_migration(QIOChannel *ioc, > > > } > > > > > > trace_migration_socket_incoming_accepted() > > > > > > qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming") > > > + qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN) > > > migration_channel_process_incoming(migrate_get_current(), > > > QIO_CHANNEL(sioc)) > > > object_unref(OBJECT(sioc)) > > > > > Is this patch ok? > Yes, i think this works, but a better way maybe to call qio_channel_set_feature() in qio_channel_socket_accept(), we didn't set the SHUTDOWN feature for the socket accept fd, Or fix it by this: Thanks, Hailiang > I have test it . The test could not hang any more. > > > > > > > > > > > > > 原始邮件 > > > > 发件人: <zhang.zhanghailiang@huawei.com> > 收件人: <dgilbert@redhat.com> <berrange@redhat.com> > 抄送人: <xuquan8@huawei.com> <qemu-devel@nongnu.org> <zhangchen.fnst@cn.fujitsu.com>王广10165992 > 日 期 :2017年03月22日 09:11 > 主 题 :Re: [Qemu-devel] 答复: Re: 答复: Re: [BUG]COLO failover hang > > > > > > On 2017/3/21 19:56, Dr. David Alan Gilbert wrote: > > * Hailiang Zhang (zhang.zhanghailiang@huawei.com) wrote: > >> Hi, > >> > >> Thanks for reporting this, and i confirmed it in my test, and it is a bug. > >> > >> Though we tried to call qemu_file_shutdown() to shutdown the related fd, in > >> case COLO thread/incoming thread is stuck in read/write() while do failover, > >> but it didn't take effect, because all the fd used by COLO (also migration) > >> has been wrapped by qio channel, and it will not call the shutdown API if > >> we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN). > >> > >> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> > >> > >> I doubted migration cancel has the same problem, it may be stuck in write() > >> if we tried to cancel migration. > >> > >> void fd_start_outgoing_migration(MigrationState *s, const char *fdname, Error **errp) > >> { > >> qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing") > >> migration_channel_connect(s, ioc, NULL) > >> ... ... > >> We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN) above, > >> and the > >> migrate_fd_cancel() > >> { > >> ... ... > >> if (s->state == MIGRATION_STATUS_CANCELLING && f) { > >> qemu_file_shutdown(f) --> This will not take effect. No ? > >> } > >> } > > > > (cc'd in Daniel Berrange). > > I see that we call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN) at the > > top of qio_channel_socket_new so I think that's safe isn't it? > > > > Hmm, you are right, this problem is only exist for the migration incoming fd, thanks. > > > Dave > > > >> Thanks, > >> Hailiang > >> > >> On 2017/3/21 16:10, wang.guang55@zte.com.cn wrote: > >>> Thank you。 > >>> > >>> I have test aready。 > >>> > >>> When the Primary Node panic,the Secondary Node qemu hang at the same place。 > >>> > >>> Incorrding http://wiki.qemu-project.org/Features/COLO ,kill Primary Node qemu will not produce the problem,but Primary Node panic can。 > >>> > >>> I think due to the feature of channel does not support QIO_CHANNEL_FEATURE_SHUTDOWN. > >>> > >>> > >>> when failover,channel_shutdown could not shut down the channel. > >>> > >>> > >>> so the colo_process_incoming_thread will hang at recvmsg. > >>> > >>> > >>> I test a patch: > >>> > >>> > >>> diff --git a/migration/socket.c b/migration/socket.c > >>> > >>> > >>> index 13966f1..d65a0ea 100644 > >>> > >>> > >>> --- a/migration/socket.c > >>> > >>> > >>> +++ b/migration/socket.c > >>> > >>> > >>> @@ -147,8 +147,9 @@ static gboolean socket_accept_incoming_migration(QIOChannel *ioc, > >>> > >>> > >>> } > >>> > >>> > >>> > >>> > >>> > >>> trace_migration_socket_incoming_accepted() > >>> > >>> > >>> > >>> > >>> > >>> qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming") > >>> > >>> > >>> + qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN) > >>> > >>> > >>> migration_channel_process_incoming(migrate_get_current(), > >>> > >>> > >>> QIO_CHANNEL(sioc)) > >>> > >>> > >>> object_unref(OBJECT(sioc)) > >>> > >>> > >>> > >>> > >>> My test will not hang any more. > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> 原始邮件 > >>> > >>> > >>> > >>> 发件人: <zhangchen.fnst@cn.fujitsu..com> > >>> 收件人:王广10165992 <zhang.zhanghailiang@huawei.com> > >>> 抄送人: <qemu-devel@nongnu.org> <zhangchen.fnst@cn.fujitsu.com> > >>> 日 期 :2017年03月21日 15:58 > >>> 主 题 :Re: [Qemu-devel] 答复: Re: [BUG]COLO failover hang > >>> > >>> > >>> > >>> > >>> > >>> Hi,Wang. > >>> > >>> You can test this branch: > >>> > >>> https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk > >>> > >>> and please follow wiki ensure your own configuration correctly. > >>> > >>> http://wiki.qemu-project.org/Features/COLO > >>> > >>> > >>> Thanks > >>> > >>> Zhang Chen > >>> > >>> > >>> On 03/21/2017 03:27 PM, wang.guang55@zte.com.cn wrote: > >>> > > >>> > hi. > >>> > > >>> > I test the git qemu master have the same problem. > >>> > > >>> > (gdb) bt > >>> > > >>> > #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, > >>> > niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 > >>> > > >>> > #1 0x00007f658e4aa0c2 in qio_channel_read > >>> > (ioc=ioc@entry=0x7f65911b4e50, buf=buf@entry=0x7f65907cb838 "", > >>> > buflen=buflen@entry=32768, errp=errp@entry=0x0) at io/channel.c:114 > >>> > > >>> > #2 0x00007f658e3ea990 in channel_get_buffer (opaque=<optimized out>, > >>> > buf=0x7f65907cb838 "", pos=<optimized out>, size=32768) at > >>> > migration/qemu-file-channel.c:78 > >>> > > >>> > #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at > >>> > migration/qemu-file.c:295 > >>> > > >>> > #4 0x00007f658e3ea2e1 in qemu_peek_byte (f=f@entry=0x7f65907cb800, > >>> > offset=offset@entry=0) at migration/qemu-file.c:555 > >>> > > >>> > #5 0x00007f658e3ea34b in qemu_get_byte (f=f@entry=0x7f65907cb800) at > >>> > migration/qemu-file.c:568 > >>> > > >>> > #6 0x00007f658e3ea552 in qemu_get_be32 (f=f@entry=0x7f65907cb800) at > >>> > migration/qemu-file.c:648 > >>> > > >>> > #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, > >>> > errp=errp@entry=0x7f64ef3fd9b0) at migration/colo.c:244 > >>> > > >>> > #8 0x00007f658e3e681e in colo_receive_check_message (f=<optimized > >>> > out>, expect_msg=expect_msg@entry=COLO_MESSAGE_VMSTATE_SEND, > >>> > errp=errp@entry=0x7f64ef3fda08) > >>> > > >>> > at migration/colo.c:264 > >>> > > >>> > #9 0x00007f658e3e740e in colo_process_incoming_thread > >>> > (opaque=0x7f658eb30360 <mis_current.31286>) at migration/colo.c:577 > >>> > > >>> > #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 > >>> > > >>> > #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 > >>> > > >>> > (gdb) p ioc->name > >>> > > >>> > $2 = 0x7f658ff7d5c0 "migration-socket-incoming" > >>> > > >>> > (gdb) p ioc->features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN > >>> > > >>> > $3 = 0 > >>> > > >>> > > >>> > (gdb) bt > >>> > > >>> > #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, > >>> > condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 > >>> > > >>> > #1 0x00007fdcc6966350 in g_main_dispatch (context=<optimized out>) at > >>> > gmain.c:3054 > >>> > > >>> > #2 g_main_context_dispatch (context=<optimized out>, > >>> > context@entry=0x7fdccce9f590) at gmain.c:3630 > >>> > > >>> > #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 > >>> > > >>> > #4 os_host_main_loop_wait (timeout=<optimized out>) at > >>> > util/main-loop.c:258 > >>> > > >>> > #5 main_loop_wait (nonblocking=nonblocking@entry=0) at > >>> > util/main-loop.c:506 > >>> > > >>> > #6 0x00007fdccb526187 in main_loop () at vl.c:1898 > >>> > > >>> > #7 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized > >>> > out>) at vl.c:4709 > >>> > > >>> > (gdb) p ioc->features > >>> > > >>> > $1 = 6 > >>> > > >>> > (gdb) p ioc->name > >>> > > >>> > $2 = 0x7fdcce1b1ab0 "migration-socket-listener" > >>> > > >>> > > >>> > May be socket_accept_incoming_migration should > >>> > call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? > >>> > > >>> > > >>> > thank you. > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > 原始邮件 > >>> > *发件人:*<zhangchen.fnst@cn.fujitsu.com> > >>> > *收件人:*王广10165992<qemu-devel@nongnu.org> > >>> > *抄送人:*<zhangchen.fnst@cn.fujitsu.com><zhang.zhanghailiang@huawei.com> > >>> > *日 期 :*2017年03月16日 14:46 > >>> > *主 题 :**Re: [Qemu-devel] COLO failover hang* > >>> > > >>> > > >>> > > >>> > > >>> > On 03/15/2017 05:06 PM, wangguang wrote: > >>> > > am testing QEMU COLO feature described here [QEMU > >>> > > Wiki](http://wiki.qemu-project.org/Features/COLO). > >>> > > > >>> > > When the Primary Node panic,the Secondary Node qemu hang. > >>> > > hang at recvmsg in qio_channel_socket_readv. > >>> > > And I run { 'execute': 'nbd-server-stop' } and { "execute": > >>> > > "x-colo-lost-heartbeat" } in Secondary VM's > >>> > > monitor,the Secondary Node qemu still hang at recvmsg . > >>> > > > >>> > > I found that the colo in qemu is not complete yet. > >>> > > Do the colo have any plan for development? > >>> > > >>> > Yes, We are developing. You can see some of patch we pushing. > >>> > > >>> > > Has anyone ever run it successfully? Any help is appreciated! > >>> > > >>> > In our internal version can run it successfully, > >>> > The failover detail you can ask Zhanghailiang for help. > >>> > Next time if you have some question about COLO, > >>> > please cc me and zhanghailiang <zhang.zhanghailiang@huawei.com>. > >>> > > >>> > > >>> > Thanks > >>> > Zhang Chen > >>> > > >>> > > >>> > > > >>> > > > >>> > > > >>> > > centos7.2+qemu2.7.50 > >>> > > (gdb) bt > >>> > > #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 > >>> > > #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=<optimized out>, > >>> > > iov=<optimized out>, niov=<optimized out>, fds=0x0, nfds=0x0, errp=0x0) at > >>> > > io/channel-socket.c:497 > >>> > > #2 0x00007f3e03329472 in qio_channel_read (ioc=ioc@entry=0x7f3e05110e40, > >>> > > buf=buf@entry=0x7f3e05910f38 "", buflen=buflen@entry=32768, > >>> > > errp=errp@entry=0x0) at io/channel.c:97 > >>> > > #3 0x00007f3e032750e0 in channel_get_buffer (opaque=<optimized out>, > >>> > > buf=0x7f3e05910f38 "", pos=<optimized out>, size=32768) at > >>> > > migration/qemu-file-channel.c:78 > >>> > > #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at > >>> > > migration/qemu-file.c:257 > >>> > > #5 0x00007f3e03274a41 in qemu_peek_byte (f=f@entry=0x7f3e05910f00, > >>> > > offset=offset@entry=0) at migration/qemu-file.c:510 > >>> > > #6 0x00007f3e03274aab in qemu_get_byte (f=f@entry=0x7f3e05910f00) at > >>> > > migration/qemu-file.c:523 > >>> > > #7 0x00007f3e03274cb2 in qemu_get_be32 (f=f@entry=0x7f3e05910f00) at > >>> > > migration/qemu-file.c:603 > >>> > > #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, > >>> > > errp=errp@entry=0x7f3d62bfaa50) at migration/colo.c:215 > >>> > > #9 0x00007f3e0327250d in colo_wait_handle_message (errp=0x7f3d62bfaa48, > >>> > > checkpoint_request=<synthetic pointer>, f=<optimized out>) at > >>> > > migration/colo.c:546 > >>> > > #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at > >>> > > migration/colo.c:649 > >>> > > #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0 > >>> > > #12 0x00007f3dfc9c03ed in clone () from /lib64/libc..so.6 > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > -- > >>> > > View this message in context: http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html > >>> > > Sent from the Developer mailing list archive at Nabble.com. > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > >>> > -- > >>> > Thanks > >>> > Zhang Chen > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > >> > > -- > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > > > > . > > > diff --git a/io/channel-socket.c b/io/channel-socket.c index f546c68..ce6894c 100644 --- a/io/channel-socket.c +++ b/io/channel-socket.c @@ -330,9 +330,8 @@ qio_channel_socket_accept(QIOChannelSocket *ioc, Error **errp) { QIOChannelSocket *cioc; - - cioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET)); - cioc->fd = -1; + + cioc = qio_channel_socket_new(); cioc->remoteAddrLen = sizeof(ioc->remoteAddr); cioc->localAddrLen = sizeof(ioc->localAddr);