From patchwork Wed Apr 27 15:02:49 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?q?Daniel_P=2E_Berrang=C3=A9?= X-Patchwork-Id: 8958661 Return-Path: X-Original-To: patchwork-qemu-devel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork1.web.kernel.org (Postfix) with ESMTP id 38D769F1C1 for ; Wed, 27 Apr 2016 15:03:59 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 922AD20259 for ; Wed, 27 Apr 2016 15:03:53 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 3C6462024D for ; Wed, 27 Apr 2016 15:03:52 +0000 (UTC) Received: from localhost ([::1]:43796 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1avR0J-0001VP-KP for patchwork-qemu-devel@patchwork.kernel.org; Wed, 27 Apr 2016 11:03:51 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:45616) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1avQzS-0008Qd-Kc for qemu-devel@nongnu.org; Wed, 27 Apr 2016 11:02:59 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1avQzO-0007Yy-IP for qemu-devel@nongnu.org; Wed, 27 Apr 2016 11:02:58 -0400 Received: from mx1.redhat.com ([209.132.183.28]:47204) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1avQzO-0007Yt-De for qemu-devel@nongnu.org; Wed, 27 Apr 2016 11:02:54 -0400 Received: from int-mx14.intmail.prod.int.phx2.redhat.com (int-mx14.intmail.prod.int.phx2.redhat.com [10.5.11.27]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id E2D70C05681D; Wed, 27 Apr 2016 15:02:52 +0000 (UTC) Received: from redhat.com (vpn1-6-18.ams2.redhat.com [10.36.6.18]) by int-mx14.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u3RF2now006553 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Wed, 27 Apr 2016 11:02:52 -0400 Date: Wed, 27 Apr 2016 16:02:49 +0100 From: "Daniel P. Berrange" To: "Dr. David Alan Gilbert" Message-ID: <20160427150249.GH17937@redhat.com> References: <20160427142023.GC17937@redhat.com> <20160427142929.GC2290@work-vm> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160427142929.GC2290@work-vm> User-Agent: Mutt/1.5.24 (2015-08-30) X-Scanned-By: MIMEDefang 2.68 on 10.5.11.27 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 209.132.183.28 Subject: Re: [Qemu-devel] Hang with migration multi-thread compression under high load X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: "Daniel P. Berrange" Cc: liang.z.li@intel.com, qemu-devel@nongnu.org Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Wed, Apr 27, 2016 at 03:29:30PM +0100, Dr. David Alan Gilbert wrote: > ccing in Liang Li > > * Daniel P. Berrange (berrange@redhat.com) wrote: > > for some reason it isn't shown in the stack thrace for thread > > 1 above, when initially connecting GDB it says the main thread > > is at: > > > > decompress_data_with_multi_threads (len=702, host=0x7fd78fe06000, f=0x55901af09950) at /home/berrange/src/virt/qemu/migration/ram.c:2254 > > 2254 for (idx = 0; idx < thread_count; idx++) { > > > > > > Looking at the target QEMU, we see do_data_decompress method > > is waiting in a condition var: > > > > while (!param->start && !quit_decomp_thread) { > > qemu_cond_wait(¶m->cond, ¶m->mutex); > > ....do stuff.. > > param->start = false > > } > > > > > > Now the decompress_data_with_multi_threads is checking param->start without > > holding the param->mutex lock. > > > > Changing decompress_data_with_multi_threads to acquire param->mutex > > lock makes it work, but isn't ideal, since that now blocks the > > decompress_data_with_multi_threads() method on the completion of > > each thread, which defeats the point of having multiple threads. FWIW, the following patch also appears to "fix" the issue, presumably by just making the race much less likely to hit: Incidentally IIUC, this decompress_data_with_multi_threads is just busy waiting for a thread to become free, which seems pretty wasteful of CPU resources. I wonder if there's a more effective way to structure this, so that instead of having decompress_data_with_multi_threads() choose which thread to pass the decompression job to, it just puts the job into a queue, and then let all the threads pull from that shared queue. IOW whichever thread the kerenl decides to wakeup would get the job, without us having to explicitly assign a thread to the job. Regards, Daniel diff --git a/migration/ram.c b/migration/ram.c index 3f05738..be0233f 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -2271,6 +2271,7 @@ static void decompress_data_with_multi_threads(QEMUFile *f, if (idx < thread_count) { break; } + sched_yield(); } }