Guest unresponsive after Virtqueue size exceeded error

Hi,

On Wed, Jun 14, 2017 at 11:56 PM, Fernando Casas Schössow
<casasfernando@hotmail.com> wrote:
> Hi there,
>
> I recently migrated a Hyper-V host to qemu/kvm runing on Alpine Linux 3.6.1 (kernel 4.9.30 -with grsec patches- and qemu 2.8.1).
>
> Almost on daily basis at least one of the guests is showing the following error in the log and the it needs to be terminated and restarted to recover it:
>
> qemu-system-x86_64: Virtqueue size exceeded
>
> Is not always the same guest, and the error is appearing for both, Linux (CentOS 7.3) and Windows (2012R2) guests.
> As soon as this error appears the guest is not really working anymore. It may respond to ping or you can even try to login but then everything is very slow or completely unresponsive. Restarting the guest from within the guest OS is not working either and the only thing I can do is to terminate it (virsh destroy) and start it again until the next failure.
>
> In Windows guest the error seems to be related to disk:
> "Reset to device, \Device\RaidPort2, was issued" and the source is viostor
>
> And in Linux guests the error is always (with the process and pid changing):
>
> INFO: task <process>:<pid> blocked for more than 120 seconds
>
> But unfortunately I was not able to find any other indication of a problem in the guests logs nor in the host logs except for the error regarding the virtqueue size. The problem is happening at different times of day and I couldn't find any patterns yet.
>
> All the Windows guests are using virtio drivers version 126 and all Linux guests are CentOS 7.3 using the latest kernel available in the distribution (3.10.0-514.21.1). They all run qemu-guest agent as well.
> All the guest disks are qcow2 images with cache=none and aimode=threads (tried native mode before but with the same results).
>
> Example qemu command for a Linux guest:
>
> /usr/bin/qemu-system-x86_64 -name guest=DOCKER01,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-24-DOCKER01/master-key.aes -machine pc-i440fx-2.8,accel=kvm,usb=off,dump-guest-core=off -cpu IvyBridge,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+monitor,+ds_cpl,+vmx,+smx,+est,+tm2,+xtpr,+pdcm,+pcid,+osxsave,+arat,+xsaveopt -drive file=/usr/share/edk2.git/ovmf-x64/OVMF_CODE-pure-efi.fd,if=pflash,format=raw,unit=0,readonly=on -drive file=/var/lib/libvirt/qemu/nvram/DOCKER01_VARS.fd,if=pflash,format=raw,unit=1 -m 2048 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 4705b146-3b14-4c20-923c-42105d47e7fc -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-24-DOCKER01/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x4.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x4 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x4.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x4.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive file=/storage/storage-ssd-vms/virtual_machines_ssd/docker01.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=35,id=hostnet0,vhost=on,vhostfd=45 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:1c:af:ce,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-24-DOCKER01/org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel1,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0 -device usb-tablet,id=input0,bus=usb.0,port=1 -spice port=5905,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0,bus=usb.0,port=2 -chardev spicevmc,id=charredir1,name=usbredir -device usb-redir,chardev=charredir1,id=redir1,bus=usb.0,port=3 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x8 -msg timestamp=on
>
> For what it worth, the same guests were working fine for years on Hyper-V on the same hardware (Intel Xeon E3, 32GB RAM, Supermicro mainboard, 6x3TB Western Digital Red disks and 6x120MB Kingston V300 SSD all connected to a LSI LSISAS2008 controller).
> Except for this stability issue that I hope to solve everything else is working great and outperforming Hyper-V.
>
> Any ideas, thoughts or suggestions to try to narrow down the problem?

Would you be able to enhance the error message and rebuild QEMU?

This would at least confirm the theory that it's caused by virtio-blk-pci.

If rebuilding is not feasible I would start by removing other virtio
devices -- particularly balloon which has had quite a few virtio
related bugs fixed recently.

Does your environment involve VM migrations or saving/resuming, or
does the crashing QEMU process always run the VM from its boot?

Thanks!

> Thanks in advance and sorry for the long email but I wanted to be as descriptive as possible.
>
> Fer

Guest unresponsive after Virtqueue size exceeded error

Commit Message

Comments

Patch