diff mbox

KVM pci-assign - iommu width is not sufficient for mapped address

Message ID 1452554423.29599.333.camel@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Alex Williamson Jan. 11, 2016, 11:20 p.m. UTC
On Mon, 2016-01-11 at 15:41 +0530, Shyam wrote:
> Hi Alex,
> 
> You are spot on!
> 
> Applying your patch on QEMU 2.5.50 (latest from github master) solves
> the performance issue fully. We are able to get back to pci-assign
> performance numbers. Great!
> 
> Can you please see how to formalize this patch cleanly? I will be
> happy to test additional patches for you. Thanks a lot for your help!

Hi Shyam,

Thanks for the testing.  I'm really tempted to just disable PBA
emulation altogether, but I came up with the below patch which enables
it only in the off chance that it's needed.  Patch is against current
qemu.git, please test.  Thanks!

Alex

commit 4f97c12c9f801fabdd3405758408f516e8ea1a80
Author: Alex Williamson <alex.williamson@redhat.com>
Date:   Mon Jan 11 10:44:13 2016 -0700

    vfio/pci: Lazy PBA emulation
    
    The PCI spec recommends devices use additional alignment for MSI-X
    data structures to allow software to map them to separate processor
    pages.  One advantage of doing this is that we can emulate those data
    structures without a significant performance impact to the operation
    of the device.  Some devices fail to implement that suggestion and
    assigned device performance suffers.
    
    One such case of this is a Mellanox MT27500 series, ConnectX-3 VF,
    where the MSI-X vector table and PBA are aligned on separate 4K
    pages.  If PBA emulation is enabled, performance suffers.  It's not
    clear how much value we get from PBA emulation, but the solution here
    is to only lazily enable the emulated PBA when a masked MSI-X vector
    fires.  We then attempt to more aggresively disable the PBA memory
    region any time a vector is unmasked.  The expectation is then that
    a typical VM will run entirely with PBA emulation disabled, and only
    when used is that emulation re-enabled.
    
    Reported-by: Shyam Kaushik <shyam.kaushik@gmail.com>
    Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Shyam Jan. 12, 2016, 7:04 a.m. UTC | #1
Hi Alex,

I tested your new patch & it works great! Yields similar performance
as your disable PBA emulation patch.

So I think you are good to commit. Once you commit we will start using
qemu.git/master.

Thanks again for your great support!

--Shyam


On Tue, Jan 12, 2016 at 4:50 AM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Mon, 2016-01-11 at 15:41 +0530, Shyam wrote:
>> Hi Alex,
>>
>> You are spot on!
>>
>> Applying your patch on QEMU 2.5.50 (latest from github master) solves
>> the performance issue fully. We are able to get back to pci-assign
>> performance numbers. Great!
>>
>> Can you please see how to formalize this patch cleanly? I will be
>> happy to test additional patches for you. Thanks a lot for your help!
>
> Hi Shyam,
>
> Thanks for the testing.  I'm really tempted to just disable PBA
> emulation altogether, but I came up with the below patch which enables
> it only in the off chance that it's needed.  Patch is against current
> qemu.git, please test.  Thanks!
>
> Alex
>
> commit 4f97c12c9f801fabdd3405758408f516e8ea1a80
> Author: Alex Williamson <alex.williamson@redhat.com>
> Date:   Mon Jan 11 10:44:13 2016 -0700
>
>     vfio/pci: Lazy PBA emulation
>
>     The PCI spec recommends devices use additional alignment for MSI-X
>     data structures to allow software to map them to separate processor
>     pages.  One advantage of doing this is that we can emulate those data
>     structures without a significant performance impact to the operation
>     of the device.  Some devices fail to implement that suggestion and
>     assigned device performance suffers.
>
>     One such case of this is a Mellanox MT27500 series, ConnectX-3 VF,
>     where the MSI-X vector table and PBA are aligned on separate 4K
>     pages.  If PBA emulation is enabled, performance suffers.  It's not
>     clear how much value we get from PBA emulation, but the solution here
>     is to only lazily enable the emulated PBA when a masked MSI-X vector
>     fires.  We then attempt to more aggresively disable the PBA memory
>     region any time a vector is unmasked.  The expectation is then that
>     a typical VM will run entirely with PBA emulation disabled, and only
>     when used is that emulation re-enabled.
>
>     Reported-by: Shyam Kaushik <shyam.kaushik@gmail.com>
>     Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 1fb868c..e66c47f 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -356,6 +356,13 @@ static void vfio_msi_interrupt(void *opaque)
>      if (vdev->interrupt == VFIO_INT_MSIX) {
>          get_msg = msix_get_message;
>          notify = msix_notify;
> +
> +        /* A masked vector firing needs to use the PBA, enable it */
> +        if (msix_is_masked(&vdev->pdev, nr)) {
> +            set_bit(nr, vdev->msix->pending);
> +            memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, true);
> +            trace_vfio_msix_pba_enable(vdev->vbasedev.name);
> +        }
>      } else if (vdev->interrupt == VFIO_INT_MSI) {
>          get_msg = msi_get_message;
>          notify = msi_notify;
> @@ -535,6 +542,14 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>          }
>      }
>
> +    /* Disable PBA emulation when nothing more is pending. */
> +    clear_bit(nr, vdev->msix->pending);
> +    if (find_first_bit(vdev->msix->pending,
> +                       vdev->nr_vectors) == vdev->nr_vectors) {
> +        memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, false);
> +        trace_vfio_msix_pba_disable(vdev->vbasedev.name);
> +    }
> +
>      return 0;
>  }
>
> @@ -738,6 +753,9 @@ static void vfio_msix_disable(VFIOPCIDevice *vdev)
>
>      vfio_msi_disable_common(vdev);
>
> +    memset(vdev->msix->pending, 0,
> +           BITS_TO_LONGS(vdev->msix->entries) * sizeof(unsigned long));
> +
>      trace_vfio_msix_disable(vdev->vbasedev.name);
>  }
>
> @@ -1251,6 +1269,8 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos)
>  {
>      int ret;
>
> +    vdev->msix->pending = g_malloc0(BITS_TO_LONGS(vdev->msix->entries) *
> +                                    sizeof(unsigned long));
>      ret = msix_init(&vdev->pdev, vdev->msix->entries,
>                      &vdev->bars[vdev->msix->table_bar].region.mem,
>                      vdev->msix->table_bar, vdev->msix->table_offset,
> @@ -1264,6 +1284,24 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos)
>          return ret;
>      }
>
> +    /*
> +     * The PCI spec suggests that devices provide additional alignment for
> +     * MSI-X structures and avoid overlapping non-MSI-X related registers.
> +     * For an assigned device, this hopefully means that emulation of MSI-X
> +     * structures does not affect the performance of the device.  If devices
> +     * fail to provide that alignment, a significant performance penalty may
> +     * result, for instance Mellanox MT27500 VFs:
> +     * http://www.spinics.net/lists/kvm/msg125881.html
> +     *
> +     * The PBA is simply not that important for such a serious regression and
> +     * most drivers do not appear to look at it.  The solution for this is to
> +     * disable the PBA MemoryRegion unless it's being used.  We disable it
> +     * here and only enable it if a masked vector fires through QEMU.  As the
> +     * vector-use notifier is called, which occurs on unmask, we test whether
> +     * PBA emulation is needed and again disable if not.
> +     */
> +    memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, false);
> +
>      return 0;
>  }
>
> @@ -1275,6 +1313,7 @@ static void vfio_teardown_msi(VFIOPCIDevice *vdev)
>          msix_uninit(&vdev->pdev,
>                      &vdev->bars[vdev->msix->table_bar].region.mem,
>                      &vdev->bars[vdev->msix->pba_bar].region.mem);
> +        g_free(vdev->msix->pending);
>      }
>  }
>
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index f004d52..6256587 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -95,6 +95,7 @@ typedef struct VFIOMSIXInfo {
>      uint32_t pba_offset;
>      MemoryRegion mmap_mem;
>      void *mmap;
> +    unsigned long *pending;
>  } VFIOMSIXInfo;
>
>  typedef struct VFIOPCIDevice {
> diff --git a/trace-events b/trace-events
> index 934a7b6..c9ac144 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1631,6 +1631,8 @@ vfio_msi_interrupt(const char *name, int index, uint64_t addr, int data) " (%s)
>  vfio_msix_vector_do_use(const char *name, int index) " (%s) vector %d used"
>  vfio_msix_vector_release(const char *name, int index) " (%s) vector %d released"
>  vfio_msix_enable(const char *name) " (%s)"
> +vfio_msix_pba_disable(const char *name) " (%s)"
> +vfio_msix_pba_enable(const char *name) " (%s)"
>  vfio_msix_disable(const char *name) " (%s)"
>  vfio_msi_enable(const char *name, int nr_vectors) " (%s) Enabled %d MSI vectors"
>  vfio_msi_disable(const char *name) " (%s)"
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 1fb868c..e66c47f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -356,6 +356,13 @@  static void vfio_msi_interrupt(void *opaque)
     if (vdev->interrupt == VFIO_INT_MSIX) {
         get_msg = msix_get_message;
         notify = msix_notify;
+
+        /* A masked vector firing needs to use the PBA, enable it */
+        if (msix_is_masked(&vdev->pdev, nr)) {
+            set_bit(nr, vdev->msix->pending);
+            memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, true);
+            trace_vfio_msix_pba_enable(vdev->vbasedev.name);
+        }
     } else if (vdev->interrupt == VFIO_INT_MSI) {
         get_msg = msi_get_message;
         notify = msi_notify;
@@ -535,6 +542,14 @@  static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
         }
     }
 
+    /* Disable PBA emulation when nothing more is pending. */
+    clear_bit(nr, vdev->msix->pending);
+    if (find_first_bit(vdev->msix->pending,
+                       vdev->nr_vectors) == vdev->nr_vectors) {
+        memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, false);
+        trace_vfio_msix_pba_disable(vdev->vbasedev.name);
+    }
+
     return 0;
 }
 
@@ -738,6 +753,9 @@  static void vfio_msix_disable(VFIOPCIDevice *vdev)
 
     vfio_msi_disable_common(vdev);
 
+    memset(vdev->msix->pending, 0,
+           BITS_TO_LONGS(vdev->msix->entries) * sizeof(unsigned long));
+
     trace_vfio_msix_disable(vdev->vbasedev.name);
 }
 
@@ -1251,6 +1269,8 @@  static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos)
 {
     int ret;
 
+    vdev->msix->pending = g_malloc0(BITS_TO_LONGS(vdev->msix->entries) *
+                                    sizeof(unsigned long));
     ret = msix_init(&vdev->pdev, vdev->msix->entries,
                     &vdev->bars[vdev->msix->table_bar].region.mem,
                     vdev->msix->table_bar, vdev->msix->table_offset,
@@ -1264,6 +1284,24 @@  static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos)
         return ret;
     }
 
+    /*
+     * The PCI spec suggests that devices provide additional alignment for
+     * MSI-X structures and avoid overlapping non-MSI-X related registers.
+     * For an assigned device, this hopefully means that emulation of MSI-X
+     * structures does not affect the performance of the device.  If devices
+     * fail to provide that alignment, a significant performance penalty may
+     * result, for instance Mellanox MT27500 VFs:
+     * http://www.spinics.net/lists/kvm/msg125881.html
+     *
+     * The PBA is simply not that important for such a serious regression and
+     * most drivers do not appear to look at it.  The solution for this is to
+     * disable the PBA MemoryRegion unless it's being used.  We disable it
+     * here and only enable it if a masked vector fires through QEMU.  As the
+     * vector-use notifier is called, which occurs on unmask, we test whether
+     * PBA emulation is needed and again disable if not.
+     */
+    memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, false);
+
     return 0;
 }
 
@@ -1275,6 +1313,7 @@  static void vfio_teardown_msi(VFIOPCIDevice *vdev)
         msix_uninit(&vdev->pdev,
                     &vdev->bars[vdev->msix->table_bar].region.mem,
                     &vdev->bars[vdev->msix->pba_bar].region.mem);
+        g_free(vdev->msix->pending);
     }
 }
 
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index f004d52..6256587 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -95,6 +95,7 @@  typedef struct VFIOMSIXInfo {
     uint32_t pba_offset;
     MemoryRegion mmap_mem;
     void *mmap;
+    unsigned long *pending;
 } VFIOMSIXInfo;
 
 typedef struct VFIOPCIDevice {
diff --git a/trace-events b/trace-events
index 934a7b6..c9ac144 100644
--- a/trace-events
+++ b/trace-events
@@ -1631,6 +1631,8 @@  vfio_msi_interrupt(const char *name, int index, uint64_t addr, int data) " (%s)
 vfio_msix_vector_do_use(const char *name, int index) " (%s) vector %d used"
 vfio_msix_vector_release(const char *name, int index) " (%s) vector %d released"
 vfio_msix_enable(const char *name) " (%s)"
+vfio_msix_pba_disable(const char *name) " (%s)"
+vfio_msix_pba_enable(const char *name) " (%s)"
 vfio_msix_disable(const char *name) " (%s)"
 vfio_msi_enable(const char *name, int nr_vectors) " (%s) Enabled %d MSI vectors"
 vfio_msi_disable(const char *name) " (%s)"