[v2,5/5] pc/q35: Add pre-plug hook for x86-iommu

Message ID	20211028043129.38871-6-peterx@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=jZRi=PQ=nongnu.org=qemu-devel-bounces+qemu-devel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 9152E6044F From: Peter Xu <peterx@redhat.com> To: qemu-devel@nongnu.org Subject: [PATCH v2 5/5] pc/q35: Add pre-plug hook for x86-iommu Date: Thu, 28 Oct 2021 12:31:29 +0800 Message-Id: <20211028043129.38871-6-peterx@redhat.com> In-Reply-To: <20211028043129.38871-1-peterx@redhat.com> References: <20211028043129.38871-1-peterx@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII" Received-SPF: pass client-ip=170.10.129.124; envelope-from=peterx@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -27 X-Spam_score: -2.8 X-Spam_bar: -- X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Cc: Peter Maydell <peter.maydell@linaro.org>, "Daniel P . Berrange" <berrange@redhat.com>, Eduardo Habkost <ehabkost@redhat.com>, David Hildenbrand <david@redhat.com>, Jason Wang <jasowang@redhat.com>, "Michael S . Tsirkin" <mst@redhat.com>, Markus Armbruster <armbru@redhat.com>, peterx@redhat.com, Eric Auger <eric.auger@redhat.com>, Alex Williamson <alex.williamson@redhat.com>, Igor Mammedov <imammedo@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, =?utf-8?q?Philippe_Mathieu-Daud=C3=A9?= <philmd@redhat.com>, David Gibson <david@gibson.dropbear.id.au> Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Series	pci/iommu: Fail early if vfio-pci detected before vIOMMU \| expand [v2,0/5] pci/iommu: Fail early if vfio-pci detected before vIOMMU [v2,1/5] pci: Define pci_bus_dev_fn/pci_bus_fn/pci_bus_ret_fn [v2,2/5] pci: Export pci_for_each_device_under_bus*() [v2,3/5] qom: object_child_foreach_recursive_type() [v2,4/5] pci: Add pci_for_each_root_bus() [v2,5/5] pc/q35: Add pre-plug hook for x86-iommu

Peter Xu Oct. 28, 2021, 4:31 a.m. UTC

Add a pre-plug hook for x86-iommu, so that we can detect vfio-pci devices
before realizing the vIOMMU device.

When the guest contains both the x86 vIOMMU and vfio-pci devices, the user
needs to specify the x86 vIOMMU before the vfio-pci devices.  The reason is,
vfio_realize() calls pci_device_iommu_address_space() to fetch the correct dma
address space for the device, while that API can only work right after the
vIOMMU device initialized first.

For example, the iommu_fn() that is used in pci_device_iommu_address_space() is
only setup in realize() of the vIOMMU devices.

For a long time we have had libvirt making sure that the ordering is correct,
however from qemu side we never fail a guest from booting even if the ordering
is specified wrongly.  When the order is wrong, the guest will encounter
misterious error when operating on the vfio-pci device because in QEMU we'll
still assume the vfio-pci devices are put into the default DMA domain (which is
normally the direct GPA mapping), so e.g. the DMAs will never go right.

This patch fails the guest from booting when we detected such errornous cmdline
specified, then the guest at least won't encounter weird device behavior after
booted.  The error message will also help the user to know how to fix the issue.

Cc: Alex Williamson <alex.williamson@redhat.com>
Suggested-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/pc.c                |  4 ++++
 hw/i386/x86-iommu.c         | 14 ++++++++++++++
 include/hw/i386/x86-iommu.h |  8 ++++++++
 3 files changed, 26 insertions(+)

David Hildenbrand Oct. 28, 2021, 7:17 a.m. UTC | #1

On 28.10.21 06:31, Peter Xu wrote:
> Add a pre-plug hook for x86-iommu, so that we can detect vfio-pci devices
> before realizing the vIOMMU device.
> 
> When the guest contains both the x86 vIOMMU and vfio-pci devices, the user
> needs to specify the x86 vIOMMU before the vfio-pci devices.  The reason is,
> vfio_realize() calls pci_device_iommu_address_space() to fetch the correct dma
> address space for the device, while that API can only work right after the
> vIOMMU device initialized first.
> 
> For example, the iommu_fn() that is used in pci_device_iommu_address_space() is
> only setup in realize() of the vIOMMU devices.
> 
> For a long time we have had libvirt making sure that the ordering is correct,
> however from qemu side we never fail a guest from booting even if the ordering
> is specified wrongly.  When the order is wrong, the guest will encounter
> misterious error when operating on the vfio-pci device because in QEMU we'll
> still assume the vfio-pci devices are put into the default DMA domain (which is
> normally the direct GPA mapping), so e.g. the DMAs will never go right.
> 
> This patch fails the guest from booting when we detected such errornous cmdline
> specified, then the guest at least won't encounter weird device behavior after
> booted.  The error message will also help the user to know how to fix the issue.
> 
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Suggested-by: Igor Mammedov <imammedo@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

I think that's a big improvement. I ran into this issue myself and found
the documentation at https://wiki.qemu.org/Features/VT-d at one time
("Meanwhile, the intel-iommu device must be specified as the first
device in the parameter list (before all the rest of the devices). ").

So feel free to add my

Acked-by: David Hildenbrand <david@redhat.com>

> ---
>  hw/i386/pc.c                |  4 ++++
>  hw/i386/x86-iommu.c         | 14 ++++++++++++++
>  include/hw/i386/x86-iommu.h |  8 ++++++++
>  3 files changed, 26 insertions(+)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 86223acfd3..b70a04011e 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -81,6 +81,7 @@
>  #include "hw/core/cpu.h"
>  #include "hw/usb.h"
>  #include "hw/i386/intel_iommu.h"
> +#include "hw/i386/x86-iommu.h"
>  #include "hw/net/ne2000-isa.h"
>  #include "standard-headers/asm-x86/bootparam.h"
>  #include "hw/virtio/virtio-pmem-pci.h"
> @@ -1327,6 +1328,8 @@ static void pc_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev,
>          pc_memory_pre_plug(hotplug_dev, dev, errp);
>      } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
>          x86_cpu_pre_plug(hotplug_dev, dev, errp);
> +    } else if (object_dynamic_cast(OBJECT(dev), TYPE_X86_IOMMU_DEVICE)) {
> +        x86_iommu_pre_plug(X86_IOMMU_DEVICE(dev), errp);
>      } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
>                 object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
>          pc_virtio_md_pci_pre_plug(hotplug_dev, dev, errp);
> @@ -1383,6 +1386,7 @@ static HotplugHandler *pc_get_hotplug_handler(MachineState *machine,
>  {
>      if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM) ||
>          object_dynamic_cast(OBJECT(dev), TYPE_CPU) ||
> +        object_dynamic_cast(OBJECT(dev), TYPE_X86_IOMMU_DEVICE) ||
>          object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
>          object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
>          return HOTPLUG_HANDLER(machine);
> diff --git a/hw/i386/x86-iommu.c b/hw/i386/x86-iommu.c
> index 86ad03972e..c9ee9041a3 100644
> --- a/hw/i386/x86-iommu.c
> +++ b/hw/i386/x86-iommu.c
> @@ -22,6 +22,7 @@
>  #include "hw/i386/x86-iommu.h"
>  #include "hw/qdev-properties.h"
>  #include "hw/i386/pc.h"
> +#include "hw/vfio/pci.h"
>  #include "qapi/error.h"
>  #include "qemu/error-report.h"
>  #include "trace.h"
> @@ -103,6 +104,19 @@ IommuType x86_iommu_get_type(void)
>      return x86_iommu_default->type;
>  }
>  
> +void x86_iommu_pre_plug(X86IOMMUState *iommu, Error **errp)
> +{
> +    bool ambiguous = false;
> +    Object *object;
> +
> +    object = object_resolve_path_type("", TYPE_VFIO_PCI, &ambiguous);
> +    if (object || ambiguous) {
> +        /* There're one or more vfio-pci devices detected */
> +        error_setg(errp, "Please specify all the vfio-pci devices to be after "
> +                   "the vIOMMU device");
> +    }
> +}
> +
>  static void x86_iommu_realize(DeviceState *dev, Error **errp)
>  {
>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(dev);
> diff --git a/include/hw/i386/x86-iommu.h b/include/hw/i386/x86-iommu.h
> index 9de92d33a1..e8b6c293e0 100644
> --- a/include/hw/i386/x86-iommu.h
> +++ b/include/hw/i386/x86-iommu.h
> @@ -172,4 +172,12 @@ void x86_iommu_iec_notify_all(X86IOMMUState *iommu, bool global,
>   * @out: Output MSI message
>   */
>  void x86_iommu_irq_to_msi_message(X86IOMMUIrq *irq, MSIMessage *out);
> +
> +/**
> + * x86_iommu_pre_plug: called before plugging the iommu device
> + * @X86IOMMUState: the pointer to x86 iommu state
> + * @errp: the double pointer to Error, set if we want to fail the plug
> + */

I'd drop that documentation because it's essentially just how any other
pre_plug handlers works. But maybe it's just me that knows how the whole
hotplug machinery works, so ...

> +void x86_iommu_pre_plug(X86IOMMUState *iommu, Error **errp);
> +
>  #endif
>

Peter Xu Oct. 28, 2021, 8:16 a.m. UTC | #2

On Thu, Oct 28, 2021 at 09:17:35AM +0200, David Hildenbrand wrote:
> On 28.10.21 06:31, Peter Xu wrote:
> > Add a pre-plug hook for x86-iommu, so that we can detect vfio-pci devices
> > before realizing the vIOMMU device.
> > 
> > When the guest contains both the x86 vIOMMU and vfio-pci devices, the user
> > needs to specify the x86 vIOMMU before the vfio-pci devices.  The reason is,
> > vfio_realize() calls pci_device_iommu_address_space() to fetch the correct dma
> > address space for the device, while that API can only work right after the
> > vIOMMU device initialized first.
> > 
> > For example, the iommu_fn() that is used in pci_device_iommu_address_space() is
> > only setup in realize() of the vIOMMU devices.
> > 
> > For a long time we have had libvirt making sure that the ordering is correct,
> > however from qemu side we never fail a guest from booting even if the ordering
> > is specified wrongly.  When the order is wrong, the guest will encounter
> > misterious error when operating on the vfio-pci device because in QEMU we'll
> > still assume the vfio-pci devices are put into the default DMA domain (which is
> > normally the direct GPA mapping), so e.g. the DMAs will never go right.
> > 
> > This patch fails the guest from booting when we detected such errornous cmdline
> > specified, then the guest at least won't encounter weird device behavior after
> > booted.  The error message will also help the user to know how to fix the issue.
> > 
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Suggested-by: Igor Mammedov <imammedo@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> I think that's a big improvement. I ran into this issue myself and found
> the documentation at https://wiki.qemu.org/Features/VT-d at one time
> ("Meanwhile, the intel-iommu device must be specified as the first
> device in the parameter list (before all the rest of the devices). ").
> 
> So feel free to add my
> 
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks, will do.

> > @@ -172,4 +172,12 @@ void x86_iommu_iec_notify_all(X86IOMMUState *iommu, bool global,
> >   * @out: Output MSI message
> >   */
> >  void x86_iommu_irq_to_msi_message(X86IOMMUIrq *irq, MSIMessage *out);
> > +
> > +/**
> > + * x86_iommu_pre_plug: called before plugging the iommu device
> > + * @X86IOMMUState: the pointer to x86 iommu state
> > + * @errp: the double pointer to Error, set if we want to fail the plug
> > + */
> 
> I'd drop that documentation because it's essentially just how any other
> pre_plug handlers works. But maybe it's just me that knows how the whole
> hotplug machinery works, so ...

Yes the documentation is not very helpful because it shouldn't be called
randomly but only in the machine pre-plug hook of x86.  It's just trying to not
be the 1st one exported function in the header that does not have a comment.

Thanks,

Alex Williamson Oct. 28, 2021, 2:52 p.m. UTC | #3

On Thu, 28 Oct 2021 12:31:29 +0800
Peter Xu <peterx@redhat.com> wrote:

> Add a pre-plug hook for x86-iommu, so that we can detect vfio-pci devices
> before realizing the vIOMMU device.
> 
> When the guest contains both the x86 vIOMMU and vfio-pci devices, the user
> needs to specify the x86 vIOMMU before the vfio-pci devices.  The reason is,
> vfio_realize() calls pci_device_iommu_address_space() to fetch the correct dma
> address space for the device, while that API can only work right after the
> vIOMMU device initialized first.
> 
> For example, the iommu_fn() that is used in pci_device_iommu_address_space() is
> only setup in realize() of the vIOMMU devices.
> 
> For a long time we have had libvirt making sure that the ordering is correct,
> however from qemu side we never fail a guest from booting even if the ordering
> is specified wrongly.  When the order is wrong, the guest will encounter
> misterious error when operating on the vfio-pci device because in QEMU we'll
> still assume the vfio-pci devices are put into the default DMA domain (which is
> normally the direct GPA mapping), so e.g. the DMAs will never go right.
> 
> This patch fails the guest from booting when we detected such errornous cmdline
> specified, then the guest at least won't encounter weird device behavior after
> booted.  The error message will also help the user to know how to fix the issue.
> 
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Suggested-by: Igor Mammedov <imammedo@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/pc.c                |  4 ++++
>  hw/i386/x86-iommu.c         | 14 ++++++++++++++
>  include/hw/i386/x86-iommu.h |  8 ++++++++
>  3 files changed, 26 insertions(+)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 86223acfd3..b70a04011e 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -81,6 +81,7 @@
>  #include "hw/core/cpu.h"
>  #include "hw/usb.h"
>  #include "hw/i386/intel_iommu.h"
> +#include "hw/i386/x86-iommu.h"
>  #include "hw/net/ne2000-isa.h"
>  #include "standard-headers/asm-x86/bootparam.h"
>  #include "hw/virtio/virtio-pmem-pci.h"
> @@ -1327,6 +1328,8 @@ static void pc_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev,
>          pc_memory_pre_plug(hotplug_dev, dev, errp);
>      } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
>          x86_cpu_pre_plug(hotplug_dev, dev, errp);
> +    } else if (object_dynamic_cast(OBJECT(dev), TYPE_X86_IOMMU_DEVICE)) {
> +        x86_iommu_pre_plug(X86_IOMMU_DEVICE(dev), errp);
>      } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
>                 object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
>          pc_virtio_md_pci_pre_plug(hotplug_dev, dev, errp);
> @@ -1383,6 +1386,7 @@ static HotplugHandler *pc_get_hotplug_handler(MachineState *machine,
>  {
>      if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM) ||
>          object_dynamic_cast(OBJECT(dev), TYPE_CPU) ||
> +        object_dynamic_cast(OBJECT(dev), TYPE_X86_IOMMU_DEVICE) ||
>          object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
>          object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
>          return HOTPLUG_HANDLER(machine);
> diff --git a/hw/i386/x86-iommu.c b/hw/i386/x86-iommu.c
> index 86ad03972e..c9ee9041a3 100644
> --- a/hw/i386/x86-iommu.c
> +++ b/hw/i386/x86-iommu.c
> @@ -22,6 +22,7 @@
>  #include "hw/i386/x86-iommu.h"
>  #include "hw/qdev-properties.h"
>  #include "hw/i386/pc.h"
> +#include "hw/vfio/pci.h"
>  #include "qapi/error.h"
>  #include "qemu/error-report.h"
>  #include "trace.h"
> @@ -103,6 +104,19 @@ IommuType x86_iommu_get_type(void)
>      return x86_iommu_default->type;
>  }
>  
> +void x86_iommu_pre_plug(X86IOMMUState *iommu, Error **errp)
> +{
> +    bool ambiguous = false;
> +    Object *object;
> +
> +    object = object_resolve_path_type("", TYPE_VFIO_PCI, &ambiguous);
> +    if (object || ambiguous) {
> +        /* There're one or more vfio-pci devices detected */
> +        error_setg(errp, "Please specify all the vfio-pci devices to be after "
> +                   "the vIOMMU device");
> +    }

I still really don't buy the argument that vfio-pci is the only driver
that does "this thing", therefore we can just look for vfio-pci devices
by name rather than try to generically detect devices that have this
dependency.  That seems short sighted.

I've already suggested that pci-core could record on the PCIDevice
structure if the device address space has been accessed.  We could also
do something like create a TYPE_PCI_AS_DEVICE class derived from
TYPE_PCI_DEVICE and any PCI drivers that make use of the device address
space before machine-init-done would be of this class.  That could even
be enforced by pci_device_iommu_address_space() and would allow the
same sort of object resolution as used here.  Thanks,

Alex

> +}
> +
>  static void x86_iommu_realize(DeviceState *dev, Error **errp)
>  {
>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(dev);
> diff --git a/include/hw/i386/x86-iommu.h b/include/hw/i386/x86-iommu.h
> index 9de92d33a1..e8b6c293e0 100644
> --- a/include/hw/i386/x86-iommu.h
> +++ b/include/hw/i386/x86-iommu.h
> @@ -172,4 +172,12 @@ void x86_iommu_iec_notify_all(X86IOMMUState *iommu, bool global,
>   * @out: Output MSI message
>   */
>  void x86_iommu_irq_to_msi_message(X86IOMMUIrq *irq, MSIMessage *out);
> +
> +/**
> + * x86_iommu_pre_plug: called before plugging the iommu device
> + * @X86IOMMUState: the pointer to x86 iommu state
> + * @errp: the double pointer to Error, set if we want to fail the plug
> + */
> +void x86_iommu_pre_plug(X86IOMMUState *iommu, Error **errp);
> +
>  #endif

Peter Xu Oct. 28, 2021, 3:36 p.m. UTC | #4

On Thu, Oct 28, 2021 at 08:52:42AM -0600, Alex Williamson wrote:
> > +void x86_iommu_pre_plug(X86IOMMUState *iommu, Error **errp)
> > +{
> > +    bool ambiguous = false;
> > +    Object *object;
> > +
> > +    object = object_resolve_path_type("", TYPE_VFIO_PCI, &ambiguous);
> > +    if (object || ambiguous) {
> > +        /* There're one or more vfio-pci devices detected */
> > +        error_setg(errp, "Please specify all the vfio-pci devices to be after "
> > +                   "the vIOMMU device");
> > +    }
> 
> I still really don't buy the argument that vfio-pci is the only driver
> that does "this thing", therefore we can just look for vfio-pci devices
> by name rather than try to generically detect devices that have this
> dependency.  That seems short sighted.
> 
> I've already suggested that pci-core could record on the PCIDevice
> structure if the device address space has been accessed.  We could also
> do something like create a TYPE_PCI_AS_DEVICE class derived from
> TYPE_PCI_DEVICE and any PCI drivers that make use of the device address
> space before machine-init-done would be of this class.  That could even
> be enforced by pci_device_iommu_address_space() and would allow the
> same sort of object resolution as used here.  Thanks,

Sorry Alex, I didn't receive any follow up so I thought you were fine with it.

I was always fine with either way, though I think another parent class would be
an overkill just for this.  Would you think below acceptable?

---8<---
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5cdf1d4298..2156b5d3ed 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3266,6 +3266,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
     pdc->exit = vfio_exitfn;
     pdc->config_read = vfio_pci_read_config;
     pdc->config_write = vfio_pci_write_config;
+    pdc->require_consolidated_iommu_as = true;
 }

 static const TypeInfo vfio_pci_dev_info = {
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 6813f128e0..ffddc766ba 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -239,6 +239,14 @@ struct PCIDeviceClass {
      */
     bool is_bridge;

+    /*
+     * Set this to true when a pci device needs consolidated result from the
+     * pci_device_iommu_address_space() in its realize() fn.  This also means
+     * when specified in cmdline, this "-device" parameter needs to be put
+     * before the vIOMMU devices so as to make it work.
+     */
+    bool require_consolidated_iommu_as;
+
     /* rom bar */
     const char *romfile;
 };
---8<---

Then I'll need to pick the dropped patch back on pci scanning, since then I
won't be able to use object_resolve_path_type() anymore, and I'll need to check
up PCIDeviceClass instead.

Michael, Igor, others - any objections?

Alex Williamson Oct. 28, 2021, 4:11 p.m. UTC | #5

On Thu, 28 Oct 2021 23:36:33 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Thu, Oct 28, 2021 at 08:52:42AM -0600, Alex Williamson wrote:
> > > +void x86_iommu_pre_plug(X86IOMMUState *iommu, Error **errp)
> > > +{
> > > +    bool ambiguous = false;
> > > +    Object *object;
> > > +
> > > +    object = object_resolve_path_type("", TYPE_VFIO_PCI, &ambiguous);
> > > +    if (object || ambiguous) {
> > > +        /* There're one or more vfio-pci devices detected */
> > > +        error_setg(errp, "Please specify all the vfio-pci devices to be after "
> > > +                   "the vIOMMU device");
> > > +    }  
> > 
> > I still really don't buy the argument that vfio-pci is the only driver
> > that does "this thing", therefore we can just look for vfio-pci devices
> > by name rather than try to generically detect devices that have this
> > dependency.  That seems short sighted.
> > 
> > I've already suggested that pci-core could record on the PCIDevice
> > structure if the device address space has been accessed.  We could also
> > do something like create a TYPE_PCI_AS_DEVICE class derived from
> > TYPE_PCI_DEVICE and any PCI drivers that make use of the device address
> > space before machine-init-done would be of this class.  That could even
> > be enforced by pci_device_iommu_address_space() and would allow the
> > same sort of object resolution as used here.  Thanks,  
> 
> Sorry Alex, I didn't receive any follow up so I thought you were fine with it.
> 
> I was always fine with either way, though I think another parent class would be
> an overkill just for this.  Would you think below acceptable?
> 
> ---8<---
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 5cdf1d4298..2156b5d3ed 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3266,6 +3266,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>      pdc->exit = vfio_exitfn;
>      pdc->config_read = vfio_pci_read_config;
>      pdc->config_write = vfio_pci_write_config;
> +    pdc->require_consolidated_iommu_as = true;
>  }
> 
>  static const TypeInfo vfio_pci_dev_info = {
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index 6813f128e0..ffddc766ba 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -239,6 +239,14 @@ struct PCIDeviceClass {
>       */
>      bool is_bridge;
> 
> +    /*
> +     * Set this to true when a pci device needs consolidated result from the
> +     * pci_device_iommu_address_space() in its realize() fn.  This also means
> +     * when specified in cmdline, this "-device" parameter needs to be put
> +     * before the vIOMMU devices so as to make it work.
> +     */
> +    bool require_consolidated_iommu_as;

Maybe this is where the naming of the previous attempt along these lines
didn't work.  There's no "consolidation" requirement and the IOMMU is
only related in that it is a driver that provides new address spaces
for devices.  This is why I thought we might be able to make it
automatic if pci_device_iommu_address_space() records that the address
space for a device has been consumed.

What we're trying to describe is that the address space object for a
device must be fixed at the time the device is realized.  Along those
lines, maybe a better name is something like "required_fixed_as_obj".

> +
>      /* rom bar */
>      const char *romfile;
>  };
> ---8<---
> 
> Then I'll need to pick the dropped patch back on pci scanning, since then I
> won't be able to use object_resolve_path_type() anymore, and I'll need to check
> up PCIDeviceClass instead.

Better.  Like the class layering proposal, a downside is that the
driver needs to be aware that it's imposing this requirement to be able
to mark it in the class init function rather than some automatic means,
like an "as_object_consumed" flag set automatically on the device
structure via accessors like pci_device_iommu_address_space().  Thanks,

Alex

Peter Xu Oct. 29, 2021, 2:53 a.m. UTC | #6

On Thu, Oct 28, 2021 at 10:11:35AM -0600, Alex Williamson wrote:
> Better.  Like the class layering proposal, a downside is that the
> driver needs to be aware that it's imposing this requirement to be able
> to mark it in the class init function rather than some automatic means,
> like an "as_object_consumed" flag set automatically on the device
> structure via accessors like pci_device_iommu_address_space().  Thanks,

Do you mean something like this?

---8<---
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 258290f4eb..969f4c85fd 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2729,6 +2729,10 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
     PCIBus *iommu_bus = bus;
     uint8_t devfn = dev->devfn;

+    if (!dev->address_space_consumed) {
+        dev->address_space_consumed = true;
+    }
+
     while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
         PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 6813f128e0..704c9bdc6e 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -268,6 +268,13 @@ typedef struct PCIReqIDCache PCIReqIDCache;
 struct PCIDevice {
     DeviceState qdev;
     bool partially_hotplugged;
+    /*
+     * This will be set after the 1st time the device implementation fetches
+     * its dma address space from pci_device_iommu_address_space().  It's used
+     * as a sanity check for platform devices like vIOMMU to detect incorrect
+     * ordering of device realization.
+     */
+    bool address_space_consumed;

     /* PCI config space */
     uint8_t *config;
---8<---

Then sanity check in pre-plug of vIOMMU.

The flag will be a bit more "misterious" than the previous approach, imho, as
the name of it will be even further from the problem it's going to solve.
However it looks at least clean on the changeset and it looks working too.

Thanks,

Alex Williamson Oct. 29, 2021, 3:31 p.m. UTC | #7

On Fri, 29 Oct 2021 10:53:07 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Thu, Oct 28, 2021 at 10:11:35AM -0600, Alex Williamson wrote:
> > Better.  Like the class layering proposal, a downside is that the
> > driver needs to be aware that it's imposing this requirement to be able
> > to mark it in the class init function rather than some automatic means,
> > like an "as_object_consumed" flag set automatically on the device
> > structure via accessors like pci_device_iommu_address_space().  Thanks,  
> 
> Do you mean something like this?
> 
> ---8<---
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 258290f4eb..969f4c85fd 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2729,6 +2729,10 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>      PCIBus *iommu_bus = bus;
>      uint8_t devfn = dev->devfn;
> 
> +    if (!dev->address_space_consumed) {
> +        dev->address_space_consumed = true;
> +    }

Could just set it unconditionally.

> +
>      while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
>          PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> 
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index 6813f128e0..704c9bdc6e 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -268,6 +268,13 @@ typedef struct PCIReqIDCache PCIReqIDCache;
>  struct PCIDevice {
>      DeviceState qdev;
>      bool partially_hotplugged;
> +    /*
> +     * This will be set after the 1st time the device implementation fetches
> +     * its dma address space from pci_device_iommu_address_space().  It's used
> +     * as a sanity check for platform devices like vIOMMU to detect incorrect
> +     * ordering of device realization.
> +     */
> +    bool address_space_consumed;
> 
>      /* PCI config space */
>      uint8_t *config;
> ---8<---
> 
> Then sanity check in pre-plug of vIOMMU.
> 
> The flag will be a bit more "misterious" than the previous approach, imho, as
> the name of it will be even further from the problem it's going to solve.
> However it looks at least clean on the changeset and it looks working too.

That seems like a function of how well we name and comment the
variable, right?  We are making an assumption here that if the address
space for a device is provided then that address space is no longer
interchangeable, some decision has already been made based on the
provided address space.  If we look at the callers of
pci_device_iommu_address_space(), we have:

pci_init_bus_master() - It holds true here that the purpose of
accessing the address space is to make the memory of that address space
accessible to the device, the address space cannot be transparently
swapped for another.

vfio_realize() - The case we're concerned about, potentially the
earliest use case.

virtio_pci_iommu_enabled() - AIUI, this is where virtio devices decide
how DMA flows, the address space of the device cannot be changed after
this.

kvm_arch_fixup_msi_route() - ARM KVM decides MSI routing here, the
address space is fixed after this.

Actually, maybe there's a more simple approach, could we further assume
that if the address space for *any* device relative to an IOMMU is
evaluated, then we've passed the point where an IOMMU could be added?
IOW, maybe we don't need a per device flag and a global flag would be
enough.  A function like pci_iommu_as_evaluated() could report the
state of that flag.  For convenience to the user though, tracking per
device to be able to report which devices are mis-ordered could still
be useful though.  Thanks,

Alex

[v2,5/5] pc/q35: Add pre-plug hook for x86-iommu

Commit Message

Comments

Patch