diff mbox series

[8/8] x86-iommu: Fail early if vIOMMU specified after vfio-pci

Message ID 20211021104259.57754-9-peterx@redhat.com (mailing list archive)
State New, archived
Headers show
Series pci/iommu: Fail early if vfio-pci detected before vIOMMU | expand

Commit Message

Peter Xu Oct. 21, 2021, 10:42 a.m. UTC
Scan the pci bus to make sure there's no vfio-pci device attached before vIOMMU
is realized.

Suggested-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/x86-iommu.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Comments

Eric Auger Oct. 21, 2021, 12:38 p.m. UTC | #1
Hi Peter,

On 10/21/21 12:42 PM, Peter Xu wrote:
> Scan the pci bus to make sure there's no vfio-pci device attached before vIOMMU
> is realized.
>
> Suggested-by: Igor Mammedov <imammedo@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/x86-iommu.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
>
> diff --git a/hw/i386/x86-iommu.c b/hw/i386/x86-iommu.c
> index 86ad03972e..58abce7edc 100644
> --- a/hw/i386/x86-iommu.c
> +++ b/hw/i386/x86-iommu.c
> @@ -21,6 +21,7 @@
>  #include "hw/sysbus.h"
>  #include "hw/i386/x86-iommu.h"
>  #include "hw/qdev-properties.h"
> +#include "hw/vfio/pci.h"
>  #include "hw/i386/pc.h"
>  #include "qapi/error.h"
>  #include "qemu/error-report.h"
> @@ -103,6 +104,16 @@ IommuType x86_iommu_get_type(void)
>      return x86_iommu_default->type;
>  }
>  
> +static void x86_iommu_pci_dev_hook(PCIBus *bus, PCIDevice *dev, void *opaque)
> +{
> +    Error **errp = (Error **)opaque;
> +
> +    if (object_dynamic_cast(OBJECT(dev), TYPE_VFIO_PCI)) {
> +        error_setg(errp, "Device '%s' must be specified before vIOMMUs",
> +                   TYPE_VFIO_PCI);
if there are several VFIO-PCI devices set before the IOMMU, errp may be
overriden
as we do not exit the loop as soon as there is an error I think

Eric
> +    }
> +}
> +
>  static void x86_iommu_realize(DeviceState *dev, Error **errp)
>  {
>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(dev);
> @@ -120,6 +131,12 @@ static void x86_iommu_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> +    /* Make sure there's no special device plugged before vIOMMU */
> +    pci_for_each_device_all(x86_iommu_pci_dev_hook, (void *)errp);
> +    if (*errp) {
> +        return;
> +    }
> +
>      /* If the user didn't specify IR, choose a default value for it */
>      if (x86_iommu->intr_supported == ON_OFF_AUTO_AUTO) {
>          x86_iommu->intr_supported = irq_all_kernel ?
> @@ -151,6 +168,7 @@ static Property x86_iommu_properties[] = {
>  static void x86_iommu_class_init(ObjectClass *klass, void *data)
>  {
>      DeviceClass *dc = DEVICE_CLASS(klass);
> +
>      dc->realize = x86_iommu_realize;
>      device_class_set_props(dc, x86_iommu_properties);
>  }
Alex Williamson Oct. 21, 2021, 10:30 p.m. UTC | #2
On Thu, 21 Oct 2021 18:42:59 +0800
Peter Xu <peterx@redhat.com> wrote:

> Scan the pci bus to make sure there's no vfio-pci device attached before vIOMMU
> is realized.

Sorry, I'm not onboard with this solution at all.

It would be really useful though if this commit log or a code comment
described exactly the incompatibility for which vfio-pci devices are
being called out here.  Otherwise I see this as a bit of magic voodoo
that gets lost in lore and copied elsewhere and we're constantly trying
to figure out specific incompatibilities when vfio-pci devices are
trying really hard to be "just another device".

I infer from the link of the previous alternate solution that this is
to do with the fact that vfio devices attach a memory listener to the
device address space.  Interestingly that previous cover letter also
discusses how vdpa devices might have a similar issue, which makes it
confusing again that we're calling out vfio-pci devices by name rather
than for a behavior.

If the behavior here is that vfio-pci devices attach a listener to the
device address space, then that provides a couple possible options.  We
could look for devices that have recorded an interest in their address
space, such as by setting a flag on PCIDevice when someone calls
pci_device_iommu_address_space(), where we could walk all devices using
the code in this series to find a device with such a flag.

Another option might be to attach owner objects to memory listeners,
walk all the listeners on the system address space (that the vIOMMU is
about to replace for some devices) and evaluate the owner objects
against TYPE_PCI_DEVICE.

Please lets not just call out arbitrary devices, especially not without
a thorough explanation of the incompatibility.  Thanks,

Alex


> Suggested-by: Igor Mammedov <imammedo@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/x86-iommu.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/hw/i386/x86-iommu.c b/hw/i386/x86-iommu.c
> index 86ad03972e..58abce7edc 100644
> --- a/hw/i386/x86-iommu.c
> +++ b/hw/i386/x86-iommu.c
> @@ -21,6 +21,7 @@
>  #include "hw/sysbus.h"
>  #include "hw/i386/x86-iommu.h"
>  #include "hw/qdev-properties.h"
> +#include "hw/vfio/pci.h"
>  #include "hw/i386/pc.h"
>  #include "qapi/error.h"
>  #include "qemu/error-report.h"
> @@ -103,6 +104,16 @@ IommuType x86_iommu_get_type(void)
>      return x86_iommu_default->type;
>  }
>  
> +static void x86_iommu_pci_dev_hook(PCIBus *bus, PCIDevice *dev, void *opaque)
> +{
> +    Error **errp = (Error **)opaque;
> +
> +    if (object_dynamic_cast(OBJECT(dev), TYPE_VFIO_PCI)) {
> +        error_setg(errp, "Device '%s' must be specified before vIOMMUs",
> +                   TYPE_VFIO_PCI);
> +    }
> +}
> +
>  static void x86_iommu_realize(DeviceState *dev, Error **errp)
>  {
>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(dev);
> @@ -120,6 +131,12 @@ static void x86_iommu_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> +    /* Make sure there's no special device plugged before vIOMMU */
> +    pci_for_each_device_all(x86_iommu_pci_dev_hook, (void *)errp);
> +    if (*errp) {
> +        return;
> +    }
> +
>      /* If the user didn't specify IR, choose a default value for it */
>      if (x86_iommu->intr_supported == ON_OFF_AUTO_AUTO) {
>          x86_iommu->intr_supported = irq_all_kernel ?
> @@ -151,6 +168,7 @@ static Property x86_iommu_properties[] = {
>  static void x86_iommu_class_init(ObjectClass *klass, void *data)
>  {
>      DeviceClass *dc = DEVICE_CLASS(klass);
> +
>      dc->realize = x86_iommu_realize;
>      device_class_set_props(dc, x86_iommu_properties);
>  }
Peter Xu Oct. 22, 2021, 2:14 a.m. UTC | #3
Hi, Alex,

On Thu, Oct 21, 2021 at 04:30:39PM -0600, Alex Williamson wrote:
> On Thu, 21 Oct 2021 18:42:59 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > Scan the pci bus to make sure there's no vfio-pci device attached before vIOMMU
> > is realized.
> 
> Sorry, I'm not onboard with this solution at all.
> 
> It would be really useful though if this commit log or a code comment
> described exactly the incompatibility for which vfio-pci devices are
> being called out here.  Otherwise I see this as a bit of magic voodoo
> that gets lost in lore and copied elsewhere and we're constantly trying
> to figure out specific incompatibilities when vfio-pci devices are
> trying really hard to be "just another device".

Sure, I can enrich the commit message.

> 
> I infer from the link of the previous alternate solution that this is
> to do with the fact that vfio devices attach a memory listener to the
> device address space.

IMHO it's not about the memory listeners, I think that' after vfio detected
some vIOMMU memory regions already, which must be based on an vIOMMU address
space being available.  I think the problem is that when realize() vfio-pci we
fetch the dma address space specifically for getting the vfio group, while that
could happen too early, even before vIOMMU is created.

> Interestingly that previous cover letter also discusses how vdpa devices
> might have a similar issue, which makes it confusing again that we're calling
> out vfio-pci devices by name rather than for a behavior.

Yes I'll need to see whether this approach will be accepted first.  I think
similar thing could help VDPA but it's not required there because VDPA has
already worked around using pci_device_iommu_address_space().  So potentially
the only one to "fix" is the vfio-pci device using along with vIOMMU, when the
device ordering is specified in the wrong order.  I'll leave the VDPA problem
to Jason to see whether he prefers keeping current code, or switch to a simpler
one.  That should be after this one.

> 
> If the behavior here is that vfio-pci devices attach a listener to the
> device address space, then that provides a couple possible options.  We
> could look for devices that have recorded an interest in their address
> space, such as by setting a flag on PCIDevice when someone calls
> pci_device_iommu_address_space(), where we could walk all devices using
> the code in this series to find a device with such a flag.

Right, we can set a flag for all the pci devices that needs to consolidate
pci_device_iommu_address_space() result, however then it'll be vfio-pci only so
far.  Btw, I actually proposed similar things two months ago, and I think Igor
showed concern on that flag being vague on meaning:

https://lore.kernel.org/qemu-devel/20210906104915.7dd5c934@redhat.com/

  > > Does it need to be a pre_plug hook?  I thought we might just need a flag in the
  > > pci device classes showing that it should be after vIOMMUs, then in vIOMMU
  > > realize functions we walk pci bus to make sure no such device exist?
  > > 
  > > We could have a base vIOMMU class, then that could be in the realize() of the
  > > common class.
  > 
  > We basically don't know if device needs IOMMU or not and can work
  > with/without it just fine. In this case I'd think about IOMMU as board
  > feature that morphs PCI buses (some of them) (address space, bus numers, ...).
  > So I don't perceive any iommu flag as a device property at all.
  > 
  > As for realize vs pre_plug, the later is the part of abstract realize
  > (see: device_set_realized) and is already used by some PCI infrastructure:
  >   ex: pcie_cap_slot_pre_plug_cb/spapr_pci_pre_plug

I still think that flag will work, that flag should only shows "whether this
device needs to be specified earlier than vIOMMU", but I can get the point from
Igor that it's at least confusing on what does the flag mean.  Meanwhile I
don't think that flag will be required, as this is not the first time we name a
special device in the code, e.g. pc_machine_device_pre_plug_cb().
intel_iommu.c has it too upon vfio-pci already on making sure caching-mode=on
in vtd_machine_done_notify_one().

If Igor is okay with adding such a flag for PCIDevice class, I can do that in
the new version.  I don't have a strong opinion on this.

Thanks,
Peter Xu Oct. 22, 2021, 2:37 a.m. UTC | #4
On Thu, Oct 21, 2021 at 02:38:54PM +0200, Eric Auger wrote:
> Hi Peter,
> 
> On 10/21/21 12:42 PM, Peter Xu wrote:
> > Scan the pci bus to make sure there's no vfio-pci device attached before vIOMMU
> > is realized.
> >
> > Suggested-by: Igor Mammedov <imammedo@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  hw/i386/x86-iommu.c | 18 ++++++++++++++++++
> >  1 file changed, 18 insertions(+)
> >
> > diff --git a/hw/i386/x86-iommu.c b/hw/i386/x86-iommu.c
> > index 86ad03972e..58abce7edc 100644
> > --- a/hw/i386/x86-iommu.c
> > +++ b/hw/i386/x86-iommu.c
> > @@ -21,6 +21,7 @@
> >  #include "hw/sysbus.h"
> >  #include "hw/i386/x86-iommu.h"
> >  #include "hw/qdev-properties.h"
> > +#include "hw/vfio/pci.h"
> >  #include "hw/i386/pc.h"
> >  #include "qapi/error.h"
> >  #include "qemu/error-report.h"
> > @@ -103,6 +104,16 @@ IommuType x86_iommu_get_type(void)
> >      return x86_iommu_default->type;
> >  }
> >  
> > +static void x86_iommu_pci_dev_hook(PCIBus *bus, PCIDevice *dev, void *opaque)
> > +{
> > +    Error **errp = (Error **)opaque;
> > +
> > +    if (object_dynamic_cast(OBJECT(dev), TYPE_VFIO_PCI)) {
> > +        error_setg(errp, "Device '%s' must be specified before vIOMMUs",
> > +                   TYPE_VFIO_PCI);
> if there are several VFIO-PCI devices set before the IOMMU, errp may be
> overriden
> as we do not exit the loop as soon as there is an error I think

Hmm, good point.  I won't worry too much about overriding yet as if there're
more devices violating the rule then reporting any of them should work - then
as the user tune the qemu cmdline it'll finally go right.

But I do see that error_setv() has an assertion on *errp being NULL.. I'll at
least make sure it won't trigger that assert by accident.

Thanks for spotting it!
Igor Mammedov Oct. 26, 2021, 3:11 p.m. UTC | #5
On Fri, 22 Oct 2021 10:14:29 +0800
Peter Xu <peterx@redhat.com> wrote:

> Hi, Alex,
> 
> On Thu, Oct 21, 2021 at 04:30:39PM -0600, Alex Williamson wrote:
> > On Thu, 21 Oct 2021 18:42:59 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> >   
> > > Scan the pci bus to make sure there's no vfio-pci device attached before vIOMMU
> > > is realized.  
> > 
> > Sorry, I'm not onboard with this solution at all.
> > 
> > It would be really useful though if this commit log or a code comment
> > described exactly the incompatibility for which vfio-pci devices are
> > being called out here.  Otherwise I see this as a bit of magic voodoo
> > that gets lost in lore and copied elsewhere and we're constantly trying
> > to figure out specific incompatibilities when vfio-pci devices are
> > trying really hard to be "just another device".  
> 
> Sure, I can enrich the commit message.
> 
> > 
> > I infer from the link of the previous alternate solution that this is
> > to do with the fact that vfio devices attach a memory listener to the
> > device address space.  
> 
> IMHO it's not about the memory listeners, I think that' after vfio detected
> some vIOMMU memory regions already, which must be based on an vIOMMU address
> space being available.  I think the problem is that when realize() vfio-pci we
> fetch the dma address space specifically for getting the vfio group, while that
> could happen too early, even before vIOMMU is created.
> 
> > Interestingly that previous cover letter also discusses how vdpa devices
> > might have a similar issue, which makes it confusing again that we're calling
> > out vfio-pci devices by name rather than for a behavior.  
> 
> Yes I'll need to see whether this approach will be accepted first.  I think
> similar thing could help VDPA but it's not required there because VDPA has
> already worked around using pci_device_iommu_address_space().  So potentially
> the only one to "fix" is the vfio-pci device using along with vIOMMU, when the
> device ordering is specified in the wrong order.  I'll leave the VDPA problem
> to Jason to see whether he prefers keeping current code, or switch to a simpler
> one.  That should be after this one.
> 
> > 
> > If the behavior here is that vfio-pci devices attach a listener to the
> > device address space, then that provides a couple possible options.  We
> > could look for devices that have recorded an interest in their address
> > space, such as by setting a flag on PCIDevice when someone calls
> > pci_device_iommu_address_space(), where we could walk all devices using
> > the code in this series to find a device with such a flag.  
> 
> Right, we can set a flag for all the pci devices that needs to consolidate
> pci_device_iommu_address_space() result, however then it'll be vfio-pci only so
> far.  Btw, I actually proposed similar things two months ago, and I think Igor
> showed concern on that flag being vague on meaning:

(1)
> https://lore.kernel.org/qemu-devel/20210906104915.7dd5c934@redhat.com/

> 
>   > > Does it need to be a pre_plug hook?  I thought we might just need a flag in the
>   > > pci device classes showing that it should be after vIOMMUs, then in vIOMMU
>   > > realize functions we walk pci bus to make sure no such device exist?
>   > > 
>   > > We could have a base vIOMMU class, then that could be in the realize() of the
>   > > common class.  
>   > 
>   > We basically don't know if device needs IOMMU or not and can work
>   > with/without it just fine. In this case I'd think about IOMMU as board
>   > feature that morphs PCI buses (some of them) (address space, bus numers, ...).
>   > So I don't perceive any iommu flag as a device property at all.
>   > 
>   > As for realize vs pre_plug, the later is the part of abstract realize
>   > (see: device_set_realized) and is already used by some PCI infrastructure:
>   >   ex: pcie_cap_slot_pre_plug_cb/spapr_pci_pre_plug  
> 
> I still think that flag will work, that flag should only shows "whether this
> device needs to be specified earlier than vIOMMU", but I can get the point from
> Igor that it's at least confusing on what does the flag mean.

> Meanwhile I
> don't think that flag will be required, as this is not the first time we name a
> special device in the code, e.g. pc_machine_device_pre_plug_cb().
> intel_iommu.c has it too upon vfio-pci already on making sure caching-mode=on
> in vtd_machine_done_notify_one().

I pointed to specifically to _pre_plug() handler and not as
implemented here in realize().
Reasoning behind it is that some_device_realize() should not poke
into other devices, while pc_machine_device_pre_plug_cb() is
part of machine code can/may legitimately access its child devices and verify/
configure them. (Hence I'd drop my suggested-by with current impl.)
 
> If Igor is okay with adding such a flag for PCIDevice class, I can do that in
> the new version.  I don't have a strong opinion on this.

Also, I've suggested to use pre_plug only as the last resort in case
vfio-pci can't be made independent of the order (see [1] for reset time
suggestion).
So why 'reset' approach didn't work out?
(this need to be cover letter/commit message as a reason why
we are resorting to a hack)

> 
> Thanks,
>
Alex Williamson Oct. 26, 2021, 3:38 p.m. UTC | #6
On Tue, 26 Oct 2021 17:11:39 +0200
Igor Mammedov <imammedo@redhat.com> wrote:

> On Fri, 22 Oct 2021 10:14:29 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > Hi, Alex,
> > 
> > On Thu, Oct 21, 2021 at 04:30:39PM -0600, Alex Williamson wrote:  
> > > On Thu, 21 Oct 2021 18:42:59 +0800
> > > Peter Xu <peterx@redhat.com> wrote:
> > >     
> > > > Scan the pci bus to make sure there's no vfio-pci device attached before vIOMMU
> > > > is realized.    
> > > 
> > > Sorry, I'm not onboard with this solution at all.
> > > 
> > > It would be really useful though if this commit log or a code comment
> > > described exactly the incompatibility for which vfio-pci devices are
> > > being called out here.  Otherwise I see this as a bit of magic voodoo
> > > that gets lost in lore and copied elsewhere and we're constantly trying
> > > to figure out specific incompatibilities when vfio-pci devices are
> > > trying really hard to be "just another device".    
> > 
> > Sure, I can enrich the commit message.
> >   
> > > 
> > > I infer from the link of the previous alternate solution that this is
> > > to do with the fact that vfio devices attach a memory listener to the
> > > device address space.    
> > 
> > IMHO it's not about the memory listeners, I think that' after vfio detected
> > some vIOMMU memory regions already, which must be based on an vIOMMU address
> > space being available.  I think the problem is that when realize() vfio-pci we
> > fetch the dma address space specifically for getting the vfio group, while that
> > could happen too early, even before vIOMMU is created.
> >   
> > > Interestingly that previous cover letter also discusses how vdpa devices
> > > might have a similar issue, which makes it confusing again that we're calling
> > > out vfio-pci devices by name rather than for a behavior.    
> > 
> > Yes I'll need to see whether this approach will be accepted first.  I think
> > similar thing could help VDPA but it's not required there because VDPA has
> > already worked around using pci_device_iommu_address_space().  So potentially
> > the only one to "fix" is the vfio-pci device using along with vIOMMU, when the
> > device ordering is specified in the wrong order.  I'll leave the VDPA problem
> > to Jason to see whether he prefers keeping current code, or switch to a simpler
> > one.  That should be after this one.
> >   
> > > 
> > > If the behavior here is that vfio-pci devices attach a listener to the
> > > device address space, then that provides a couple possible options.  We
> > > could look for devices that have recorded an interest in their address
> > > space, such as by setting a flag on PCIDevice when someone calls
> > > pci_device_iommu_address_space(), where we could walk all devices using
> > > the code in this series to find a device with such a flag.    
> > 
> > Right, we can set a flag for all the pci devices that needs to consolidate
> > pci_device_iommu_address_space() result, however then it'll be vfio-pci only so
> > far.  Btw, I actually proposed similar things two months ago, and I think Igor
> > showed concern on that flag being vague on meaning:  
> 
> (1)
> > https://lore.kernel.org/qemu-devel/20210906104915.7dd5c934@redhat.com/  
> 
> >   
> >   > > Does it need to be a pre_plug hook?  I thought we might just need a flag in the
> >   > > pci device classes showing that it should be after vIOMMUs, then in vIOMMU
> >   > > realize functions we walk pci bus to make sure no such device exist?
> >   > > 
> >   > > We could have a base vIOMMU class, then that could be in the realize() of the
> >   > > common class.    
> >   > 
> >   > We basically don't know if device needs IOMMU or not and can work
> >   > with/without it just fine. In this case I'd think about IOMMU as board
> >   > feature that morphs PCI buses (some of them) (address space, bus numers, ...).
> >   > So I don't perceive any iommu flag as a device property at all.
> >   > 
> >   > As for realize vs pre_plug, the later is the part of abstract realize
> >   > (see: device_set_realized) and is already used by some PCI infrastructure:
> >   >   ex: pcie_cap_slot_pre_plug_cb/spapr_pci_pre_plug    
> > 
> > I still think that flag will work, that flag should only shows "whether this
> > device needs to be specified earlier than vIOMMU", but I can get the point from
> > Igor that it's at least confusing on what does the flag mean.  
> 
> > Meanwhile I
> > don't think that flag will be required, as this is not the first time we name a
> > special device in the code, e.g. pc_machine_device_pre_plug_cb().
> > intel_iommu.c has it too upon vfio-pci already on making sure caching-mode=on
> > in vtd_machine_done_notify_one().  
> 
> I pointed to specifically to _pre_plug() handler and not as
> implemented here in realize().
> Reasoning behind it is that some_device_realize() should not poke
> into other devices, while pc_machine_device_pre_plug_cb() is
> part of machine code can/may legitimately access its child devices and verify/
> configure them. (Hence I'd drop my suggested-by with current impl.)
>  
> > If Igor is okay with adding such a flag for PCIDevice class, I can do that in
> > the new version.  I don't have a strong opinion on this.  
> 
> Also, I've suggested to use pre_plug only as the last resort in case
> vfio-pci can't be made independent of the order (see [1] for reset time
> suggestion).
> So why 'reset' approach didn't work out?
> (this need to be cover letter/commit message as a reason why
> we are resorting to a hack)

Attaching MemoryListeners is not a lightweight process, that's where we
get the replay of device mappings and where the IOMMU will pin and map
all of the VM address space for the device.  Minimally, we'd need to
condition this setup based on the previous device address space such
that it's only affected on changes.  That probably leans more towards a
machine-init-done notifier to setup the listeners (assuming those work
correctly for hot-plug).

However, device address space is synonymous with the vfio container
used for a device.  In order to gain access to a vfio device, the vfio
group must first be placed into an IOMMU context, aka container.
Therefore that would imply that essentially the entire realize function
of a vfio device would move to a machine-init-done handler because we
can't actually access the device until we know the address space such
that we can place the group into a container.  Creating a container per
device which is attached to an address space at machine-init-done time
is also not practical as that incurs duplicate mapping overhead (time
and space) and locked memory accounting issues that we already struggle
with in the vIOMMU case.

There might be other issues, but that alone seems pretty impractical
for what seems like it ought not be such a taboo operation during
realize, to simply care about the device address space.  Thanks,

Alex
Peter Xu Oct. 27, 2021, 8:30 a.m. UTC | #7
On Tue, Oct 26, 2021 at 05:11:39PM +0200, Igor Mammedov wrote:
> On Fri, 22 Oct 2021 10:14:29 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > Hi, Alex,
> > 
> > On Thu, Oct 21, 2021 at 04:30:39PM -0600, Alex Williamson wrote:
> > > On Thu, 21 Oct 2021 18:42:59 +0800
> > > Peter Xu <peterx@redhat.com> wrote:
> > >   
> > > > Scan the pci bus to make sure there's no vfio-pci device attached before vIOMMU
> > > > is realized.  
> > > 
> > > Sorry, I'm not onboard with this solution at all.
> > > 
> > > It would be really useful though if this commit log or a code comment
> > > described exactly the incompatibility for which vfio-pci devices are
> > > being called out here.  Otherwise I see this as a bit of magic voodoo
> > > that gets lost in lore and copied elsewhere and we're constantly trying
> > > to figure out specific incompatibilities when vfio-pci devices are
> > > trying really hard to be "just another device".  
> > 
> > Sure, I can enrich the commit message.
> > 
> > > 
> > > I infer from the link of the previous alternate solution that this is
> > > to do with the fact that vfio devices attach a memory listener to the
> > > device address space.  
> > 
> > IMHO it's not about the memory listeners, I think that' after vfio detected
> > some vIOMMU memory regions already, which must be based on an vIOMMU address
> > space being available.  I think the problem is that when realize() vfio-pci we
> > fetch the dma address space specifically for getting the vfio group, while that
> > could happen too early, even before vIOMMU is created.
> > 
> > > Interestingly that previous cover letter also discusses how vdpa devices
> > > might have a similar issue, which makes it confusing again that we're calling
> > > out vfio-pci devices by name rather than for a behavior.  
> > 
> > Yes I'll need to see whether this approach will be accepted first.  I think
> > similar thing could help VDPA but it's not required there because VDPA has
> > already worked around using pci_device_iommu_address_space().  So potentially
> > the only one to "fix" is the vfio-pci device using along with vIOMMU, when the
> > device ordering is specified in the wrong order.  I'll leave the VDPA problem
> > to Jason to see whether he prefers keeping current code, or switch to a simpler
> > one.  That should be after this one.
> > 
> > > 
> > > If the behavior here is that vfio-pci devices attach a listener to the
> > > device address space, then that provides a couple possible options.  We
> > > could look for devices that have recorded an interest in their address
> > > space, such as by setting a flag on PCIDevice when someone calls
> > > pci_device_iommu_address_space(), where we could walk all devices using
> > > the code in this series to find a device with such a flag.  
> > 
> > Right, we can set a flag for all the pci devices that needs to consolidate
> > pci_device_iommu_address_space() result, however then it'll be vfio-pci only so
> > far.  Btw, I actually proposed similar things two months ago, and I think Igor
> > showed concern on that flag being vague on meaning:
> 
> (1)
> > https://lore.kernel.org/qemu-devel/20210906104915.7dd5c934@redhat.com/
> 
> > 
> >   > > Does it need to be a pre_plug hook?  I thought we might just need a flag in the
> >   > > pci device classes showing that it should be after vIOMMUs, then in vIOMMU
> >   > > realize functions we walk pci bus to make sure no such device exist?
> >   > > 
> >   > > We could have a base vIOMMU class, then that could be in the realize() of the
> >   > > common class.  
> >   > 
> >   > We basically don't know if device needs IOMMU or not and can work
> >   > with/without it just fine. In this case I'd think about IOMMU as board
> >   > feature that morphs PCI buses (some of them) (address space, bus numers, ...).
> >   > So I don't perceive any iommu flag as a device property at all.
> >   > 
> >   > As for realize vs pre_plug, the later is the part of abstract realize
> >   > (see: device_set_realized) and is already used by some PCI infrastructure:
> >   >   ex: pcie_cap_slot_pre_plug_cb/spapr_pci_pre_plug  
> > 
> > I still think that flag will work, that flag should only shows "whether this
> > device needs to be specified earlier than vIOMMU", but I can get the point from
> > Igor that it's at least confusing on what does the flag mean.
> 
> > Meanwhile I
> > don't think that flag will be required, as this is not the first time we name a
> > special device in the code, e.g. pc_machine_device_pre_plug_cb().
> > intel_iommu.c has it too upon vfio-pci already on making sure caching-mode=on
> > in vtd_machine_done_notify_one().
> 
> I pointed to specifically to _pre_plug() handler and not as
> implemented here in realize().
> Reasoning behind it is that some_device_realize() should not poke
> into other devices, while pc_machine_device_pre_plug_cb() is
> part of machine code can/may legitimately access its child devices and verify/
> configure them. (Hence I'd drop my suggested-by with current impl.)

I see, I didn't try that because I see that q35 even does not have pre_plug()
installed yet, so I need to add it in the same patchset, am I right?

Frankly I've also no idea why pc has the pre_plug() but not q35.. But if you
think that matters then I can try.

>  
> > If Igor is okay with adding such a flag for PCIDevice class, I can do that in
> > the new version.  I don't have a strong opinion on this.
> 
> Also, I've suggested to use pre_plug only as the last resort in case
> vfio-pci can't be made independent of the order (see [1] for reset time
> suggestion).
> So why 'reset' approach didn't work out?
> (this need to be cover letter/commit message as a reason why
> we are resorting to a hack)

As explained by Alex, I think that's more challenging and it was discussed
quite a few times before, so I didn't mention that in cover letter.

Sorry I should have mentioned some of it.
Peter Xu Oct. 28, 2021, 2:30 a.m. UTC | #8
On Wed, Oct 27, 2021 at 04:30:18PM +0800, Peter Xu wrote:
> On Tue, Oct 26, 2021 at 05:11:39PM +0200, Igor Mammedov wrote:
> > On Fri, 22 Oct 2021 10:14:29 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> > 
> > > Hi, Alex,
> > > 
> > > On Thu, Oct 21, 2021 at 04:30:39PM -0600, Alex Williamson wrote:
> > > > On Thu, 21 Oct 2021 18:42:59 +0800
> > > > Peter Xu <peterx@redhat.com> wrote:
> > > >   
> > > > > Scan the pci bus to make sure there's no vfio-pci device attached before vIOMMU
> > > > > is realized.  
> > > > 
> > > > Sorry, I'm not onboard with this solution at all.
> > > > 
> > > > It would be really useful though if this commit log or a code comment
> > > > described exactly the incompatibility for which vfio-pci devices are
> > > > being called out here.  Otherwise I see this as a bit of magic voodoo
> > > > that gets lost in lore and copied elsewhere and we're constantly trying
> > > > to figure out specific incompatibilities when vfio-pci devices are
> > > > trying really hard to be "just another device".  
> > > 
> > > Sure, I can enrich the commit message.
> > > 
> > > > 
> > > > I infer from the link of the previous alternate solution that this is
> > > > to do with the fact that vfio devices attach a memory listener to the
> > > > device address space.  
> > > 
> > > IMHO it's not about the memory listeners, I think that' after vfio detected
> > > some vIOMMU memory regions already, which must be based on an vIOMMU address
> > > space being available.  I think the problem is that when realize() vfio-pci we
> > > fetch the dma address space specifically for getting the vfio group, while that
> > > could happen too early, even before vIOMMU is created.
> > > 
> > > > Interestingly that previous cover letter also discusses how vdpa devices
> > > > might have a similar issue, which makes it confusing again that we're calling
> > > > out vfio-pci devices by name rather than for a behavior.  
> > > 
> > > Yes I'll need to see whether this approach will be accepted first.  I think
> > > similar thing could help VDPA but it's not required there because VDPA has
> > > already worked around using pci_device_iommu_address_space().  So potentially
> > > the only one to "fix" is the vfio-pci device using along with vIOMMU, when the
> > > device ordering is specified in the wrong order.  I'll leave the VDPA problem
> > > to Jason to see whether he prefers keeping current code, or switch to a simpler
> > > one.  That should be after this one.
> > > 
> > > > 
> > > > If the behavior here is that vfio-pci devices attach a listener to the
> > > > device address space, then that provides a couple possible options.  We
> > > > could look for devices that have recorded an interest in their address
> > > > space, such as by setting a flag on PCIDevice when someone calls
> > > > pci_device_iommu_address_space(), where we could walk all devices using
> > > > the code in this series to find a device with such a flag.  
> > > 
> > > Right, we can set a flag for all the pci devices that needs to consolidate
> > > pci_device_iommu_address_space() result, however then it'll be vfio-pci only so
> > > far.  Btw, I actually proposed similar things two months ago, and I think Igor
> > > showed concern on that flag being vague on meaning:
> > 
> > (1)
> > > https://lore.kernel.org/qemu-devel/20210906104915.7dd5c934@redhat.com/
> > 
> > > 
> > >   > > Does it need to be a pre_plug hook?  I thought we might just need a flag in the
> > >   > > pci device classes showing that it should be after vIOMMUs, then in vIOMMU
> > >   > > realize functions we walk pci bus to make sure no such device exist?
> > >   > > 
> > >   > > We could have a base vIOMMU class, then that could be in the realize() of the
> > >   > > common class.  
> > >   > 
> > >   > We basically don't know if device needs IOMMU or not and can work
> > >   > with/without it just fine. In this case I'd think about IOMMU as board
> > >   > feature that morphs PCI buses (some of them) (address space, bus numers, ...).
> > >   > So I don't perceive any iommu flag as a device property at all.
> > >   > 
> > >   > As for realize vs pre_plug, the later is the part of abstract realize
> > >   > (see: device_set_realized) and is already used by some PCI infrastructure:
> > >   >   ex: pcie_cap_slot_pre_plug_cb/spapr_pci_pre_plug  
> > > 
> > > I still think that flag will work, that flag should only shows "whether this
> > > device needs to be specified earlier than vIOMMU", but I can get the point from
> > > Igor that it's at least confusing on what does the flag mean.
> > 
> > > Meanwhile I
> > > don't think that flag will be required, as this is not the first time we name a
> > > special device in the code, e.g. pc_machine_device_pre_plug_cb().
> > > intel_iommu.c has it too upon vfio-pci already on making sure caching-mode=on
> > > in vtd_machine_done_notify_one().
> > 
> > I pointed to specifically to _pre_plug() handler and not as
> > implemented here in realize().
> > Reasoning behind it is that some_device_realize() should not poke
> > into other devices, while pc_machine_device_pre_plug_cb() is
> > part of machine code can/may legitimately access its child devices and verify/
> > configure them. (Hence I'd drop my suggested-by with current impl.)
> 
> I see, I didn't try that because I see that q35 even does not have pre_plug()
> installed yet, so I need to add it in the same patchset, am I right?
> 
> Frankly I've also no idea why pc has the pre_plug() but not q35.. But if you
> think that matters then I can try.

I think I know what I'm missing.. there's the pc_get_hotplug_handler() that
conditionally fetch the handler, so when I was trying with q35/VT-d the handler
always returned null..

I'll repost with pre_plug(), thanks.
diff mbox series

Patch

diff --git a/hw/i386/x86-iommu.c b/hw/i386/x86-iommu.c
index 86ad03972e..58abce7edc 100644
--- a/hw/i386/x86-iommu.c
+++ b/hw/i386/x86-iommu.c
@@ -21,6 +21,7 @@ 
 #include "hw/sysbus.h"
 #include "hw/i386/x86-iommu.h"
 #include "hw/qdev-properties.h"
+#include "hw/vfio/pci.h"
 #include "hw/i386/pc.h"
 #include "qapi/error.h"
 #include "qemu/error-report.h"
@@ -103,6 +104,16 @@  IommuType x86_iommu_get_type(void)
     return x86_iommu_default->type;
 }
 
+static void x86_iommu_pci_dev_hook(PCIBus *bus, PCIDevice *dev, void *opaque)
+{
+    Error **errp = (Error **)opaque;
+
+    if (object_dynamic_cast(OBJECT(dev), TYPE_VFIO_PCI)) {
+        error_setg(errp, "Device '%s' must be specified before vIOMMUs",
+                   TYPE_VFIO_PCI);
+    }
+}
+
 static void x86_iommu_realize(DeviceState *dev, Error **errp)
 {
     X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(dev);
@@ -120,6 +131,12 @@  static void x86_iommu_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    /* Make sure there's no special device plugged before vIOMMU */
+    pci_for_each_device_all(x86_iommu_pci_dev_hook, (void *)errp);
+    if (*errp) {
+        return;
+    }
+
     /* If the user didn't specify IR, choose a default value for it */
     if (x86_iommu->intr_supported == ON_OFF_AUTO_AUTO) {
         x86_iommu->intr_supported = irq_all_kernel ?
@@ -151,6 +168,7 @@  static Property x86_iommu_properties[] = {
 static void x86_iommu_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
+
     dc->realize = x86_iommu_realize;
     device_class_set_props(dc, x86_iommu_properties);
 }