From patchwork Fri Jun 16 06:58:17 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Ladi Prosek <lprosek@redhat.com>
X-Patchwork-Id: 9790659
Return-Path: 
 <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	9513560326 for <patchwork-qemu-devel@patchwork.kernel.org>;
	Fri, 16 Jun 2017 06:59:03 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8B70728547
	for <patchwork-qemu-devel@patchwork.kernel.org>;
	Fri, 16 Jun 2017 06:59:03 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 7F39128578; Fri, 16 Jun 2017 06:59:03 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI
	autolearn=ham version=3.3.1
Received: from lists.gnu.org (lists.gnu.org [208.118.235.17])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id DDF9028547
	for <patchwork-qemu-devel@patchwork.kernel.org>;
	Fri, 16 Jun 2017 06:59:02 +0000 (UTC)
Received: from localhost ([::1]:57359 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71) (envelope-from
	<qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>)
	id 1dLlDf-00038S-Hd for patchwork-qemu-devel@patchwork.kernel.org;
	Fri, 16 Jun 2017 02:58:59 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:49064)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <lprosek@redhat.com>) id 1dLlD5-00038L-Fr
	for qemu-devel@nongnu.org; Fri, 16 Jun 2017 02:58:24 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <lprosek@redhat.com>) id 1dLlD1-0002Ar-J4
	for qemu-devel@nongnu.org; Fri, 16 Jun 2017 02:58:23 -0400
Received: from mail-vk0-f48.google.com ([209.85.213.48]:32863)
	by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
	(Exim 4.71) (envelope-from <lprosek@redhat.com>) id 1dLlD1-00029G-CE
	for qemu-devel@nongnu.org; Fri, 16 Jun 2017 02:58:19 -0400
Received: by mail-vk0-f48.google.com with SMTP id p62so17991583vkp.0
	for <qemu-devel@nongnu.org>; Thu, 15 Jun 2017 23:58:18 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:mime-version:in-reply-to:references:from:date
	:message-id:subject:to:cc:content-transfer-encoding;
	bh=qshjKP8s5R3X0BV1JiU73JNXpZvhO/Zb1gdTDmVt3dY=;
	b=YfBpJg30KdPD+NqcO5gz9YuQ3XnkGTVSeFbmJ3BDc+dpixYs+SinviU9U1hDrx9ZsK
	AMVJJsrE8ICKW2bjbcHtyJc2DxPXIOpstUpU2b00EsLiRxmsDuTqe7+qxrhOYb/Udh9+
	Y+jM/HeDsnqKh9UeE3O0Z05cH0FLlHfaIxNEc8hDmTs88dGITtH4o11zjmyq//xcDCEF
	jQvaWefw+AJsfaff9IzeSpikryQ3SJB8k81lxQ5evcwgxeKZMo1YYACcTI1/JDWox2hM
	ct9nzwStvJdqBSLhsPuFoiLxkfCqEb10z0gCw0tlGPuZHR/PkAw1sovOGo9aKbO/jFNi
	t/KQ==
X-Gm-Message-State: AKS2vOywFXBmUwyVpWcPPoeF8BwxLR1VsNhc9RpHA6tjLc1Y0qIB6PS2
	HOAY/Qa7Fs+XB69g9lK3Fe0FqB3BS6IY
X-Received: by 10.31.21.205 with SMTP id 196mr356924vkv.126.1497596297477;
	Thu, 15 Jun 2017 23:58:17 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.103.65.73 with HTTP; Thu, 15 Jun 2017 23:58:17 -0700 (PDT)
In-Reply-To: 
 <VI1PR1001MB137395F2FCBDE6389A9AC5FCB7C30@VI1PR1001MB1373.EURPRD10.PROD.OUTLOOK.COM>
References: 
 <VI1PR1001MB137395F2FCBDE6389A9AC5FCB7C30@VI1PR1001MB1373.EURPRD10.PROD.OUTLOOK.COM>
From: Ladi Prosek <lprosek@redhat.com>
Date: Fri, 16 Jun 2017 08:58:17 +0200
Message-ID: 
 <CABdb7358YVXbk++Z7s+q6Z5O0Q=6CQiQvYsTdc35eqmQeUnP_Q@mail.gmail.com>
To: =?UTF-8?Q?Fernando_Casas_Sch=C3=B6ssow?= <casasfernando@hotmail.com>
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
	[fuzzy]
X-Received-From: 209.85.213.48
Subject: Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded
	error
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Cc: "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Errors-To: 
 qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org
Sender: "Qemu-devel"
	<qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
X-Virus-Scanned: ClamAV using ClamSMTP

Hi,

On Wed, Jun 14, 2017 at 11:56 PM, Fernando Casas Schössow
<casasfernando@hotmail.com> wrote:
> Hi there,
>
> I recently migrated a Hyper-V host to qemu/kvm runing on Alpine Linux 3.6.1 (kernel 4.9.30 -with grsec patches- and qemu 2.8.1).
>
> Almost on daily basis at least one of the guests is showing the following error in the log and the it needs to be terminated and restarted to recover it:
>
> qemu-system-x86_64: Virtqueue size exceeded
>
> Is not always the same guest, and the error is appearing for both, Linux (CentOS 7.3) and Windows (2012R2) guests.
> As soon as this error appears the guest is not really working anymore. It may respond to ping or you can even try to login but then everything is very slow or completely unresponsive. Restarting the guest from within the guest OS is not working either and the only thing I can do is to terminate it (virsh destroy) and start it again until the next failure.
>
> In Windows guest the error seems to be related to disk:
> "Reset to device, \Device\RaidPort2, was issued" and the source is viostor
>
> And in Linux guests the error is always (with the process and pid changing):
>
> INFO: task <process>:<pid> blocked for more than 120 seconds
>
> But unfortunately I was not able to find any other indication of a problem in the guests logs nor in the host logs except for the error regarding the virtqueue size. The problem is happening at different times of day and I couldn't find any patterns yet.
>
> All the Windows guests are using virtio drivers version 126 and all Linux guests are CentOS 7.3 using the latest kernel available in the distribution (3.10.0-514.21.1). They all run qemu-guest agent as well.
> All the guest disks are qcow2 images with cache=none and aimode=threads (tried native mode before but with the same results).
>
> Example qemu command for a Linux guest:
>
> /usr/bin/qemu-system-x86_64 -name guest=DOCKER01,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-24-DOCKER01/master-key.aes -machine pc-i440fx-2.8,accel=kvm,usb=off,dump-guest-core=off -cpu IvyBridge,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+monitor,+ds_cpl,+vmx,+smx,+est,+tm2,+xtpr,+pdcm,+pcid,+osxsave,+arat,+xsaveopt -drive file=/usr/share/edk2.git/ovmf-x64/OVMF_CODE-pure-efi.fd,if=pflash,format=raw,unit=0,readonly=on -drive file=/var/lib/libvirt/qemu/nvram/DOCKER01_VARS.fd,if=pflash,format=raw,unit=1 -m 2048 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 4705b146-3b14-4c20-923c-42105d47e7fc -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-24-DOCKER01/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x4.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x4 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x4.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x4.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive file=/storage/storage-ssd-vms/virtual_machines_ssd/docker01.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=35,id=hostnet0,vhost=on,vhostfd=45 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:1c:af:ce,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-24-DOCKER01/org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel1,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0 -device usb-tablet,id=input0,bus=usb.0,port=1 -spice port=5905,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0,bus=usb.0,port=2 -chardev spicevmc,id=charredir1,name=usbredir -device usb-redir,chardev=charredir1,id=redir1,bus=usb.0,port=3 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x8 -msg timestamp=on
>
> For what it worth, the same guests were working fine for years on Hyper-V on the same hardware (Intel Xeon E3, 32GB RAM, Supermicro mainboard, 6x3TB Western Digital Red disks and 6x120MB Kingston V300 SSD all connected to a LSI LSISAS2008 controller).
> Except for this stability issue that I hope to solve everything else is working great and outperforming Hyper-V.
>
> Any ideas, thoughts or suggestions to try to narrow down the problem?

Would you be able to enhance the error message and rebuild QEMU?


This would at least confirm the theory that it's caused by virtio-blk-pci.

If rebuilding is not feasible I would start by removing other virtio
devices -- particularly balloon which has had quite a few virtio
related bugs fixed recently.

Does your environment involve VM migrations or saving/resuming, or
does the crashing QEMU process always run the VM from its boot?

Thanks!

> Thanks in advance and sorry for the long email but I wanted to be as descriptive as possible.
>
> Fer

--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -856,7 +856,7 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
     max = vq->vring.num;

     if (vq->inuse >= vq->vring.num) {
-        virtio_error(vdev, "Virtqueue size exceeded");
+        virtio_error(vdev, "Virtqueue %u device %s size exceeded",
vq->queue_index, vdev->name);
         goto done;
     }