From patchwork Fri Jun 16 06:58:17 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Ladi Prosek X-Patchwork-Id: 9790659 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 9513560326 for ; Fri, 16 Jun 2017 06:59:03 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8B70728547 for ; Fri, 16 Jun 2017 06:59:03 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 7F39128578; Fri, 16 Jun 2017 06:59:03 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id DDF9028547 for ; Fri, 16 Jun 2017 06:59:02 +0000 (UTC) Received: from localhost ([::1]:57359 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dLlDf-00038S-Hd for patchwork-qemu-devel@patchwork.kernel.org; Fri, 16 Jun 2017 02:58:59 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:49064) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dLlD5-00038L-Fr for qemu-devel@nongnu.org; Fri, 16 Jun 2017 02:58:24 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dLlD1-0002Ar-J4 for qemu-devel@nongnu.org; Fri, 16 Jun 2017 02:58:23 -0400 Received: from mail-vk0-f48.google.com ([209.85.213.48]:32863) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1dLlD1-00029G-CE for qemu-devel@nongnu.org; Fri, 16 Jun 2017 02:58:19 -0400 Received: by mail-vk0-f48.google.com with SMTP id p62so17991583vkp.0 for ; Thu, 15 Jun 2017 23:58:18 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=qshjKP8s5R3X0BV1JiU73JNXpZvhO/Zb1gdTDmVt3dY=; b=YfBpJg30KdPD+NqcO5gz9YuQ3XnkGTVSeFbmJ3BDc+dpixYs+SinviU9U1hDrx9ZsK AMVJJsrE8ICKW2bjbcHtyJc2DxPXIOpstUpU2b00EsLiRxmsDuTqe7+qxrhOYb/Udh9+ Y+jM/HeDsnqKh9UeE3O0Z05cH0FLlHfaIxNEc8hDmTs88dGITtH4o11zjmyq//xcDCEF jQvaWefw+AJsfaff9IzeSpikryQ3SJB8k81lxQ5evcwgxeKZMo1YYACcTI1/JDWox2hM ct9nzwStvJdqBSLhsPuFoiLxkfCqEb10z0gCw0tlGPuZHR/PkAw1sovOGo9aKbO/jFNi t/KQ== X-Gm-Message-State: AKS2vOywFXBmUwyVpWcPPoeF8BwxLR1VsNhc9RpHA6tjLc1Y0qIB6PS2 HOAY/Qa7Fs+XB69g9lK3Fe0FqB3BS6IY X-Received: by 10.31.21.205 with SMTP id 196mr356924vkv.126.1497596297477; Thu, 15 Jun 2017 23:58:17 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.65.73 with HTTP; Thu, 15 Jun 2017 23:58:17 -0700 (PDT) In-Reply-To: References: From: Ladi Prosek Date: Fri, 16 Jun 2017 08:58:17 +0200 Message-ID: To: =?UTF-8?Q?Fernando_Casas_Sch=C3=B6ssow?= X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 209.85.213.48 Subject: Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "qemu-devel@nongnu.org" Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" X-Virus-Scanned: ClamAV using ClamSMTP Hi, On Wed, Jun 14, 2017 at 11:56 PM, Fernando Casas Schössow wrote: > Hi there, > > I recently migrated a Hyper-V host to qemu/kvm runing on Alpine Linux 3.6.1 (kernel 4.9.30 -with grsec patches- and qemu 2.8.1). > > Almost on daily basis at least one of the guests is showing the following error in the log and the it needs to be terminated and restarted to recover it: > > qemu-system-x86_64: Virtqueue size exceeded > > Is not always the same guest, and the error is appearing for both, Linux (CentOS 7.3) and Windows (2012R2) guests. > As soon as this error appears the guest is not really working anymore. It may respond to ping or you can even try to login but then everything is very slow or completely unresponsive. Restarting the guest from within the guest OS is not working either and the only thing I can do is to terminate it (virsh destroy) and start it again until the next failure. > > In Windows guest the error seems to be related to disk: > "Reset to device, \Device\RaidPort2, was issued" and the source is viostor > > And in Linux guests the error is always (with the process and pid changing): > > INFO: task : blocked for more than 120 seconds > > But unfortunately I was not able to find any other indication of a problem in the guests logs nor in the host logs except for the error regarding the virtqueue size. The problem is happening at different times of day and I couldn't find any patterns yet. > > All the Windows guests are using virtio drivers version 126 and all Linux guests are CentOS 7.3 using the latest kernel available in the distribution (3.10.0-514.21.1). They all run qemu-guest agent as well. > All the guest disks are qcow2 images with cache=none and aimode=threads (tried native mode before but with the same results). > > Example qemu command for a Linux guest: > > /usr/bin/qemu-system-x86_64 -name guest=DOCKER01,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-24-DOCKER01/master-key.aes -machine pc-i440fx-2.8,accel=kvm,usb=off,dump-guest-core=off -cpu IvyBridge,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+monitor,+ds_cpl,+vmx,+smx,+est,+tm2,+xtpr,+pdcm,+pcid,+osxsave,+arat,+xsaveopt -drive file=/usr/share/edk2.git/ovmf-x64/OVMF_CODE-pure-efi.fd,if=pflash,format=raw,unit=0,readonly=on -drive file=/var/lib/libvirt/qemu/nvram/DOCKER01_VARS.fd,if=pflash,format=raw,unit=1 -m 2048 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 4705b146-3b14-4c20-923c-42105d47e7fc -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-24-DOCKER01/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x4.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x4 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x4.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x4.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive file=/storage/storage-ssd-vms/virtual_machines_ssd/docker01.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=35,id=hostnet0,vhost=on,vhostfd=45 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:1c:af:ce,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-24-DOCKER01/org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel1,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0 -device usb-tablet,id=input0,bus=usb.0,port=1 -spice port=5905,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0,bus=usb.0,port=2 -chardev spicevmc,id=charredir1,name=usbredir -device usb-redir,chardev=charredir1,id=redir1,bus=usb.0,port=3 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x8 -msg timestamp=on > > For what it worth, the same guests were working fine for years on Hyper-V on the same hardware (Intel Xeon E3, 32GB RAM, Supermicro mainboard, 6x3TB Western Digital Red disks and 6x120MB Kingston V300 SSD all connected to a LSI LSISAS2008 controller). > Except for this stability issue that I hope to solve everything else is working great and outperforming Hyper-V. > > Any ideas, thoughts or suggestions to try to narrow down the problem? Would you be able to enhance the error message and rebuild QEMU? This would at least confirm the theory that it's caused by virtio-blk-pci. If rebuilding is not feasible I would start by removing other virtio devices -- particularly balloon which has had quite a few virtio related bugs fixed recently. Does your environment involve VM migrations or saving/resuming, or does the crashing QEMU process always run the VM from its boot? Thanks! > Thanks in advance and sorry for the long email but I wanted to be as descriptive as possible. > > Fer --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -856,7 +856,7 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz) max = vq->vring.num; if (vq->inuse >= vq->vring.num) { - virtio_error(vdev, "Virtqueue size exceeded"); + virtio_error(vdev, "Virtqueue %u device %s size exceeded", vq->queue_index, vdev->name); goto done; }