diff mbox

vfio-pci: Quirk RTL8168 NIC

Message ID 20140510230225.17688.6353.stgit@bling.home (mailing list archive)
State New, archived
Headers show

Commit Message

Alex Williamson May 10, 2014, 11:03 p.m. UTC
This device is ridiculous.  It has two MMIO BARs, BAR4 and BAR2.  BAR4
hosts the MSI-X table, so oviously it would be too easy to access it
directly, instead it creates a window register in BAR2 that, among
other things, provides access to the MSI-X table.  This means MSI-X
doesn't work in the guest because the driver actually manages to
program the physical table.  When interrupt remapping is present, the
device MSI will be blocked.  The Linux driver doesn't make use of this
window, so apparently it's not required to make use of MSI-X.  This
quirk makes the device work with the Windows driver that does use this
window for MSI-X, but I certainly cannot recommend this device for
assignment (the Windows 7 driver also constantly pokes PCI config
space).

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/misc/vfio.c |  145 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 145 insertions(+)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Francois Romieu May 12, 2014, 8:02 p.m. UTC | #1
Alex Williamson <alex.williamson@redhat.com> :
[...]
> device MSI will be blocked.  The Linux driver doesn't make use of this
> window, so apparently it's not required to make use of MSI-X.  This

It does not really use MSI-X (no RSS).

> quirk makes the device work with the Windows driver that does use this
> window for MSI-X, but I certainly cannot recommend this device for
> assignment (the Windows 7 driver also constantly pokes PCI config
> space).

Do you have some offsets for those ?

> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> index 9cf5b84..c3d2f7a 100644
> --- a/hw/misc/vfio.c
> +++ b/hw/misc/vfio.c
[...]
> + * "address latched" indicator.  Bits 12:15 is a mask field, which we're
> + * going to ignore because we don't really know what it means and the MSI-X
> + * area always seems to be accessed with a full mask.

s/seems to/should always/

Double word accesses requires the full mask. The MSIX area should be accessed
through double words.
Alex Williamson May 12, 2014, 8:28 p.m. UTC | #2
On Mon, 2014-05-12 at 22:02 +0200, Francois Romieu wrote:
> Alex Williamson <alex.williamson@redhat.com> :
> [...]
> > device MSI will be blocked.  The Linux driver doesn't make use of this
> > window, so apparently it's not required to make use of MSI-X.  This
> 
> It does not really use MSI-X (no RSS).

Oh right, I looked for code references to the register but didn't notice
that Linux configures it for MSI, not MSI-X.  In my brief testing I only
saw that Windows generates interrupts on the first vector, so perhaps
not much lost without the extra vectors.  I guess it's this patch that
proves that MSI-X can be configured without this backdoor then.  Do you
have any insight into why this exists?

> > quirk makes the device work with the Windows driver that does use this
> > window for MSI-X, but I certainly cannot recommend this device for
> > assignment (the Windows 7 driver also constantly pokes PCI config
> > space).
> 
> Do you have some offsets for those ?

I believe it was 0x80, which is 0x10 off from the PCIe capability, so
the link control register.  I don't seem to have a log, but I'll
regenerate one tonight to get the exact sequence (the interface is in
use right now).

> > diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> > index 9cf5b84..c3d2f7a 100644
> > --- a/hw/misc/vfio.c
> > +++ b/hw/misc/vfio.c
> [...]
> > + * "address latched" indicator.  Bits 12:15 is a mask field, which we're
> > + * going to ignore because we don't really know what it means and the MSI-X
> > + * area always seems to be accessed with a full mask.
> 
> s/seems to/should always/
> 
> Double word accesses requires the full mask. The MSIX area should be accessed
> through double words.

Good to know, I'll amend the comment.  Thanks!

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson May 13, 2014, 3:06 a.m. UTC | #3
On Mon, 2014-05-12 at 14:28 -0600, Alex Williamson wrote:
> On Mon, 2014-05-12 at 22:02 +0200, Francois Romieu wrote:
> > Alex Williamson <alex.williamson@redhat.com> :
> > [...]
> > > device MSI will be blocked.  The Linux driver doesn't make use of this
> > > window, so apparently it's not required to make use of MSI-X.  This
> > 
> > It does not really use MSI-X (no RSS).
> 
> Oh right, I looked for code references to the register but didn't notice
> that Linux configures it for MSI, not MSI-X.  In my brief testing I only
> saw that Windows generates interrupts on the first vector, so perhaps
> not much lost without the extra vectors.  I guess it's this patch that
> proves that MSI-X can be configured without this backdoor then.  Do you
> have any insight into why this exists?
> 
> > > quirk makes the device work with the Windows driver that does use this
> > > window for MSI-X, but I certainly cannot recommend this device for
> > > assignment (the Windows 7 driver also constantly pokes PCI config
> > > space).
> > 
> > Do you have some offsets for those ?
> 
> I believe it was 0x80, which is 0x10 off from the PCIe capability, so
> the link control register.  I don't seem to have a log, but I'll
> regenerate one tonight to get the exact sequence (the interface is in
> use right now).

As promised:

vfio: vfio_pci_read_config(0000:05:00.0, @0x4, len=0x2) 507
vfio: vfio_pci_read_config(0000:05:00.0, @0x80, len=0x1) 40
vfio: vfio_pci_read_config(0000:05:00.0, @0x81, len=0x1) 0
vfio: vfio_pci_read_config(0000:05:00.0, @0x80, len=0x1) 40
vfio: vfio_pci_read_config(0000:05:00.0, @0x81, len=0x1) 0
vfio: vfio_pci_read_config(0000:05:00.0, @0x80, len=0x1) 40
vfio: vfio_pci_read_config(0000:05:00.0, @0x81, len=0x1) 0

The 80/81 access may happen asynchronous to the command register access,
sometimes I see 3 sets between the command register access as above,
sometimes 4.

As I mentioned earlier, the PCIe cap is at 0x70:

Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)
Subsystem: ASUSTeK Computer Inc. P8P67 and other motherboards
...
Capabilities: [70] Express Endpoint, MSI 01

So the cycle is:

word read from command register: I/O+ Mem+ BusMaster+ DisINTx+ SERR+

byte read from PCIe link control register[0]: CommClk+
byte read from PCIe link control register[1]: nothing

I'll be interested if that means anything to you.  It's not a very high
rate access, the command register access is maybe 1Hz.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Francois Romieu May 13, 2014, 10:22 p.m. UTC | #4
Alex Williamson <alex.williamson@redhat.com> :
[...]
> > Oh right, I looked for code references to the register but didn't notice
> > that Linux configures it for MSI, not MSI-X.  In my brief testing I only
> > saw that Windows generates interrupts on the first vector, so perhaps
> > not much lost without the extra vectors.  I guess it's this patch that
> > proves that MSI-X can be configured without this backdoor then.  Do you
> > have any insight into why this exists ?

No. I can only speculate that bar registers exhaustion and 64 bits decoders
are not completely alien to this. Hayes may provide some hindsight.

[...]
> So the cycle is:
> 
> word read from command register: I/O+ Mem+ BusMaster+ DisINTx+ SERR+
> 
> byte read from PCIe link control register[0]: CommClk+
> byte read from PCIe link control register[1]: nothing
> 
> I'll be interested if that means anything to you.  It's not a very high
> rate access, the command register access is maybe 1Hz.  Thanks,

Nothing specific. It reminds me of rtl8168_esd_timer - kind of periodic
sanity check - in Realtek's own 8168 driver but the pattern is a bit
different. I need to check some recent revision of it though.
diff mbox

Patch

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index 9cf5b84..c3d2f7a 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -1668,6 +1668,150 @@  static void vfio_probe_ati_bar4_window_quirk(VFIODevice *vdev, int nr)
             vdev->host.function);
 }
 
+#define PCI_VENDOR_ID_REALTEK 0x10ec
+
+/*
+ * RTL8168 devices have a backdoor that can access the MSI-X table.  At BAR2
+ * offset 0x70 there is a dword data register, offset 0x74 is a dword address
+ * register.  According to the Linux r8169 driver, the MSI-X table is addressed
+ * when the "type" portion of the address register is set to 0x1.  This appears
+ * to be bits 16:30.  Bit 31 is both a write indicator and some sort of
+ * "address latched" indicator.  Bits 12:15 is a mask field, which we're
+ * going to ignore because we don't really know what it means and the MSI-X
+ * area always seems to be accessed with a full mask.  Bits 0:11 is offset
+ * within the type.
+ *
+ * Example trace:
+ *
+ * Read from MSI-X table offset 0
+ * vfio: vfio_bar_write(0000:05:00.0:BAR2+0x74, 0x1f000, 4) // store read addr
+ * vfio: vfio_bar_read(0000:05:00.0:BAR2+0x74, 4) = 0x8001f000 // latch
+ * vfio: vfio_bar_read(0000:05:00.0:BAR2+0x70, 4) = 0xfee00398 // read data
+ *
+ * Write 0xfee00000 to MSI-X table offset 0
+ * vfio: vfio_bar_write(0000:05:00.0:BAR2+0x70, 0xfee00000, 4) // write data
+ * vfio: vfio_bar_write(0000:05:00.0:BAR2+0x74, 0x8001f000, 4) // do write
+ * vfio: vfio_bar_read(0000:05:00.0:BAR2+0x74, 4) = 0x1f000 // complete
+ */
+
+static uint64_t vfio_rtl8168_window_quirk_read(void *opaque,
+                                               hwaddr addr, unsigned size)
+{
+    VFIOQuirk *quirk = opaque;
+    VFIODevice *vdev = quirk->vdev;
+
+    switch (addr) {
+    case 4: /* address */
+        if (quirk->data.flags) {
+            DPRINTF("%s fake read(%04x:%02x:%02x.%xd)\n",
+                    memory_region_name(&quirk->mem), vdev->host.domain,
+                    vdev->host.bus, vdev->host.slot, vdev->host.function);
+
+            return quirk->data.address_match ^ 0x10000000U;
+        }
+        break;
+    case 0: /* data */
+        if (quirk->data.flags) {
+            uint64_t val;
+
+            DPRINTF("%s MSI-X table read(%04x:%02x:%02x.%xd)\n",
+                    memory_region_name(&quirk->mem), vdev->host.domain,
+                    vdev->host.bus, vdev->host.slot, vdev->host.function);
+
+            if (!(vdev->pdev.cap_present & QEMU_PCI_CAP_MSIX)) {
+                return 0;
+            }
+
+            io_mem_read(&vdev->pdev.msix_table_mmio,
+                        (hwaddr)(quirk->data.address_match & 0xfff),
+                        &val, size);
+            return val;
+        }
+    }
+
+    DPRINTF("%s direct read(%04x:%02x:%02x.%xd)\n",
+            memory_region_name(&quirk->mem), vdev->host.domain,
+            vdev->host.bus, vdev->host.slot, vdev->host.function);
+
+    return vfio_bar_read(&vdev->bars[quirk->data.bar], addr + 0x70, size);
+}
+
+static void vfio_rtl8168_window_quirk_write(void *opaque, hwaddr addr,
+                                            uint64_t data, unsigned size)
+{
+    VFIOQuirk *quirk = opaque;
+    VFIODevice *vdev = quirk->vdev;
+
+    switch (addr) {
+    case 4: /* address */
+        if ((data & 0x7fff0000) == 0x10000) {
+            if (data & 0x10000000U &&
+                vdev->pdev.cap_present & QEMU_PCI_CAP_MSIX) {
+
+                DPRINTF("%s MSI-X table write(%04x:%02x:%02x.%xd)\n",
+                        memory_region_name(&quirk->mem), vdev->host.domain,
+                        vdev->host.bus, vdev->host.slot, vdev->host.function);
+
+                io_mem_write(&vdev->pdev.msix_table_mmio,
+                             (hwaddr)(quirk->data.address_match & 0xfff),
+                             data, size);
+            }
+
+            quirk->data.flags = 1;
+            quirk->data.address_match = data;
+
+            return;
+        }
+        quirk->data.flags = 0;
+        break;
+    case 0: /* data */
+        quirk->data.address_mask = data;
+        break;
+    }
+
+    DPRINTF("%s direct write(%04x:%02x:%02x.%xd)\n",
+            memory_region_name(&quirk->mem), vdev->host.domain,
+            vdev->host.bus, vdev->host.slot, vdev->host.function);
+
+    vfio_bar_write(&vdev->bars[quirk->data.bar], addr + 0x70, data, size);
+}
+
+static const MemoryRegionOps vfio_rtl8168_window_quirk = {
+    .read = vfio_rtl8168_window_quirk_read,
+    .write = vfio_rtl8168_window_quirk_write,
+    .valid = {
+        .min_access_size = 4,
+        .max_access_size = 4,
+        .unaligned = false,
+    },
+    .endianness = DEVICE_LITTLE_ENDIAN,
+};
+
+static void vfio_probe_rtl8168_bar2_window_quirk(VFIODevice *vdev, int nr)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    VFIOQuirk *quirk;
+
+    if (pci_get_word(pdev->config + PCI_VENDOR_ID) != PCI_VENDOR_ID_REALTEK ||
+        pci_get_word(pdev->config + PCI_DEVICE_ID) != 0x8168 || nr != 2) {
+        return;
+    }
+
+    quirk = g_malloc0(sizeof(*quirk));
+    quirk->vdev = vdev;
+    quirk->data.bar = nr;
+
+    memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_rtl8168_window_quirk,
+                          quirk, "vfio-rtl8168-window-quirk", 8);
+    memory_region_add_subregion_overlap(&vdev->bars[nr].mem,
+                                        0x70, &quirk->mem, 1);
+
+    QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);
+
+    DPRINTF("Enabled RTL8168 BAR2 window quirk for device %04x:%02x:%02x.%x\n",
+            vdev->host.domain, vdev->host.bus, vdev->host.slot,
+            vdev->host.function);
+}
 /*
  * Trap the BAR2 MMIO window to config space as well.
  */
@@ -2071,6 +2215,7 @@  static void vfio_bar_quirk_setup(VFIODevice *vdev, int nr)
     vfio_probe_nvidia_bar5_window_quirk(vdev, nr);
     vfio_probe_nvidia_bar0_88000_quirk(vdev, nr);
     vfio_probe_nvidia_bar0_1800_quirk(vdev, nr);
+    vfio_probe_rtl8168_bar2_window_quirk(vdev, nr);
 }
 
 static void vfio_bar_quirk_teardown(VFIODevice *vdev, int nr)