diff mbox

KVM pci-assign - iommu width is not sufficient for mapped address

Message ID 1452279177.29599.226.camel@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Alex Williamson Jan. 8, 2016, 6:52 p.m. UTC
On Fri, 2016-01-08 at 12:22 +0530, Shyam wrote:
> Hi Alex,
> 
> It will be hard to reproduce this on Fedora/RHEL. We have Ubuntu
> based
> server/VM & I can shift to any kernel/qemu/vfio versions that you
> recommend.
> 
> Both our Host & Guest run Ubuntu Trusty (Ubuntu 14.04.3 LTS) with
> Linux Kernel version 3.18.19 (from
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.18.19-vivid/).
> 
> Qemu version on the host is
> QEMU emulator version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.21),
> Copyright
> (c) 2003-2008 Fabrice Bellard
> 
> We are using 8 X Intel RMS3CC080 SSD's for this test. We expose these
> SSD's to the VM (through iSER) & then setup dm-stripe over them
> within
> the VM. We create two dm-linear out of this at 100GB size & expose
> through SCST to an external server. External server iSER connects to
> these devices & have multipath 4Xpaths (policy: queue-length:0) per
> device. From external server we run fio with 4 threads & each with
> 64-outstanding IOs of 100% 4K random-reads.
> 
> This is the performance difference we see
> 
> with PCI-assign to the VM
> randrw 100:0 64iodepth 4thr 4kb - R: 550,224K wait_us:2,245 cpu
> tot:85.57 usr:3.96 sys:31.55 iow:50.06
> 
> i.e. we get 137-140K IOPs or 550MB/s
> 
> with VFIO to the VM
> randrw 100:0 64iodepth 4thr 4kb - R: 309,432K wait_us:3,964 cpu
> tot:78.58 usr:2.28 sys:18.00 iow:58.30
> 
> i.e. we get 77-80K IOPs or 310MB/s
> 
> The only change between the two runs is to have a VM that is spawned
> with VFIO instead of pci-assign. There is no other difference in
> software versions or any settings.
> 
> $ grep VFIO /boot/config-`uname -r`
> CONFIG_VFIO_IOMMU_TYPE1=m
> CONFIG_VFIO=m
> CONFIG_VFIO_PCI=m
> CONFIG_VFIO_PCI_VGA=y
> CONFIG_KVM_VFIO=y
> 
> I uploaded QEMU command-line & lspci outputs at
> https://www.dropbox.com/s/imbqn0274i6hhnz/vfio_issue.tgz?dl=0
> 
> Pls let me know if you have any issues in downloading it.
> 
> Please let us know if you see any KVM acceleration is disabled &
> suggested next steps to debug with VFIO tracing. Thanks for your
> help!

Thanks for the logs, everything appears to be setup correctly.  One
suspicion I have is the difference between pci-assign and vfio-pci in
the way the MSI-X Pending Bits Array (PBA) is handled.  Legacy KVM
device assignment handles MSI-X itself and ignores the PBA.  On this
hardware the MSI-X vector table and PBA are nicely aligned on separate
4k pages, which means that pci-assign will give the VM direct access to
everything on the PBA page.  On the other hand, vfio-pci registers MSI-
X with QEMU, which does handle the PBA.  The vast majority of drivers
never use the PBA and the PCI spec includes an implementation note
suggesting that hardware vendors include additional alignment to
prevent MSI-X structures from overlapping with other registers.  My
hypothesis is that this device perhaps does not abide by that
recommendation and may be regularly accessing the PBA page, thus
causing a vfio-pci assigned device to trap through to QEMU more
regularly than a legacy assigned device.

If I could ask you to build and run a new QEMU, I think we can easily
test this hypothesis by making vfio-pci behave more like pci-assign.
 The following patch is based on QEMU 2.5 and simply skips the step of
placing the PBA memory region overlapping the device, allowing direct
access in this case.  The patch is easily adaptable to older versions
of QEMU, but if we need to do any further tracing, it's probably best
to do so on 2.5 anyway.  This is only a proof of concept, if it proves
to be the culprit we'll need to think about how to handle it more
cleanly.  Here's the patch:


Thanks,
Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Shyam Jan. 11, 2016, 10:11 a.m. UTC | #1
Hi Alex,

You are spot on!

Applying your patch on QEMU 2.5.50 (latest from github master) solves
the performance issue fully. We are able to get back to pci-assign
performance numbers. Great!

Can you please see how to formalize this patch cleanly? I will be
happy to test additional patches for you. Thanks a lot for your help!

--Shyam

On Sat, Jan 9, 2016 at 12:22 AM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Fri, 2016-01-08 at 12:22 +0530, Shyam wrote:
>> Hi Alex,
>>
>> It will be hard to reproduce this on Fedora/RHEL. We have Ubuntu
>> based
>> server/VM & I can shift to any kernel/qemu/vfio versions that you
>> recommend.
>>
>> Both our Host & Guest run Ubuntu Trusty (Ubuntu 14.04.3 LTS) with
>> Linux Kernel version 3.18.19 (from
>> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.18.19-vivid/).
>>
>> Qemu version on the host is
>> QEMU emulator version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.21),
>> Copyright
>> (c) 2003-2008 Fabrice Bellard
>>
>> We are using 8 X Intel RMS3CC080 SSD's for this test. We expose these
>> SSD's to the VM (through iSER) & then setup dm-stripe over them
>> within
>> the VM. We create two dm-linear out of this at 100GB size & expose
>> through SCST to an external server. External server iSER connects to
>> these devices & have multipath 4Xpaths (policy: queue-length:0) per
>> device. From external server we run fio with 4 threads & each with
>> 64-outstanding IOs of 100% 4K random-reads.
>>
>> This is the performance difference we see
>>
>> with PCI-assign to the VM
>> randrw 100:0 64iodepth 4thr 4kb - R: 550,224K wait_us:2,245 cpu
>> tot:85.57 usr:3.96 sys:31.55 iow:50.06
>>
>> i.e. we get 137-140K IOPs or 550MB/s
>>
>> with VFIO to the VM
>> randrw 100:0 64iodepth 4thr 4kb - R: 309,432K wait_us:3,964 cpu
>> tot:78.58 usr:2.28 sys:18.00 iow:58.30
>>
>> i.e. we get 77-80K IOPs or 310MB/s
>>
>> The only change between the two runs is to have a VM that is spawned
>> with VFIO instead of pci-assign. There is no other difference in
>> software versions or any settings.
>>
>> $ grep VFIO /boot/config-`uname -r`
>> CONFIG_VFIO_IOMMU_TYPE1=m
>> CONFIG_VFIO=m
>> CONFIG_VFIO_PCI=m
>> CONFIG_VFIO_PCI_VGA=y
>> CONFIG_KVM_VFIO=y
>>
>> I uploaded QEMU command-line & lspci outputs at
>> https://www.dropbox.com/s/imbqn0274i6hhnz/vfio_issue.tgz?dl=0
>>
>> Pls let me know if you have any issues in downloading it.
>>
>> Please let us know if you see any KVM acceleration is disabled &
>> suggested next steps to debug with VFIO tracing. Thanks for your
>> help!
>
> Thanks for the logs, everything appears to be setup correctly.  One
> suspicion I have is the difference between pci-assign and vfio-pci in
> the way the MSI-X Pending Bits Array (PBA) is handled.  Legacy KVM
> device assignment handles MSI-X itself and ignores the PBA.  On this
> hardware the MSI-X vector table and PBA are nicely aligned on separate
> 4k pages, which means that pci-assign will give the VM direct access to
> everything on the PBA page.  On the other hand, vfio-pci registers MSI-
> X with QEMU, which does handle the PBA.  The vast majority of drivers
> never use the PBA and the PCI spec includes an implementation note
> suggesting that hardware vendors include additional alignment to
> prevent MSI-X structures from overlapping with other registers.  My
> hypothesis is that this device perhaps does not abide by that
> recommendation and may be regularly accessing the PBA page, thus
> causing a vfio-pci assigned device to trap through to QEMU more
> regularly than a legacy assigned device.
>
> If I could ask you to build and run a new QEMU, I think we can easily
> test this hypothesis by making vfio-pci behave more like pci-assign.
>  The following patch is based on QEMU 2.5 and simply skips the step of
> placing the PBA memory region overlapping the device, allowing direct
> access in this case.  The patch is easily adaptable to older versions
> of QEMU, but if we need to do any further tracing, it's probably best
> to do so on 2.5 anyway.  This is only a proof of concept, if it proves
> to be the culprit we'll need to think about how to handle it more
> cleanly.  Here's the patch:
>
> diff --git a/hw/pci/msix.c b/hw/pci/msix.c
> index 64c93d8..a5ad18c 100644
> --- a/hw/pci/msix.c
> +++ b/hw/pci/msix.c
> @@ -291,7 +291,7 @@ int msix_init(struct PCIDevice *dev, unsigned short nentries,
>      memory_region_add_subregion(table_bar, table_offset, &dev->msix_table_mmio);
>      memory_region_init_io(&dev->msix_pba_mmio, OBJECT(dev), &msix_pba_mmio_ops, dev,
>                            "msix-pba", pba_size);
> -    memory_region_add_subregion(pba_bar, pba_offset, &dev->msix_pba_mmio);
> +    /* memory_region_add_subregion(pba_bar, pba_offset, &dev->msix_pba_mmio); */
>
>      return 0;
>  }
> @@ -369,7 +369,7 @@ void msix_uninit(PCIDevice *dev, MemoryRegion *table_bar, MemoryRegion *pba_bar)
>      dev->msix_cap = 0;
>      msix_free_irq_entries(dev);
>      dev->msix_entries_nr = 0;
> -    memory_region_del_subregion(pba_bar, &dev->msix_pba_mmio);
> +    /* memory_region_del_subregion(pba_bar, &dev->msix_pba_mmio); */
>      g_free(dev->msix_pba);
>      dev->msix_pba = NULL;
>      memory_region_del_subregion(table_bar, &dev->msix_table_mmio);
>
> Thanks,
> Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index 64c93d8..a5ad18c 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -291,7 +291,7 @@  int msix_init(struct PCIDevice *dev, unsigned short nentries,
     memory_region_add_subregion(table_bar, table_offset, &dev->msix_table_mmio);
     memory_region_init_io(&dev->msix_pba_mmio, OBJECT(dev), &msix_pba_mmio_ops, dev,
                           "msix-pba", pba_size);
-    memory_region_add_subregion(pba_bar, pba_offset, &dev->msix_pba_mmio);
+    /* memory_region_add_subregion(pba_bar, pba_offset, &dev->msix_pba_mmio); */
 
     return 0;
 }
@@ -369,7 +369,7 @@  void msix_uninit(PCIDevice *dev, MemoryRegion *table_bar, MemoryRegion *pba_bar)
     dev->msix_cap = 0;
     msix_free_irq_entries(dev);
     dev->msix_entries_nr = 0;
-    memory_region_del_subregion(pba_bar, &dev->msix_pba_mmio);
+    /* memory_region_del_subregion(pba_bar, &dev->msix_pba_mmio); */
     g_free(dev->msix_pba);
     dev->msix_pba = NULL;
     memory_region_del_subregion(table_bar, &dev->msix_table_mmio);