diff mbox series

[v6,4/5,FUTURE] xen/arm: enable vPCI for domUs

Message ID 20231113222118.825758-5-stewart.hildebrand@amd.com (mailing list archive)
State New, archived
Headers show
Series Kconfig for PCI passthrough on ARM | expand

Commit Message

Stewart Hildebrand Nov. 13, 2023, 10:21 p.m. UTC
Select HAS_VPCI_GUEST_SUPPORT in Kconfig for enabling vPCI support for
domUs.

Add checks to fail guest creation if the configuration is invalid.

Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
As the tag implies, this patch is not intended to be merged (yet).

Note that CONFIG_HAS_VPCI_GUEST_SUPPORT is not currently used in the
upstream code base. It will be used by the vPCI series [1]. This patch
is intended to be merged as part of the vPCI series. I'll coordinate
with Volodymyr to include this in the vPCI series or resend afterwards.
Meanwhile, I'll include it here until the Kconfig and
xen_domctl_createdomain prerequisites have been committed.

v5->v6:
* drop is_pvh_domain(), simply make arch_needs_vpci() return false on
  x86 for now
* leave has_vpci() alone, instead, add HAS_VPCI_GUEST_SUPPORT check in
  domain_create

v4->v5:
* replace is_system_domain() check with dom_io check
* return an error in XEN_DOMCTL_assign_device (thanks Jan!)
* remove CONFIG_ARM check
* add needs_vpci() and arch_needs_vpci()
* add HAS_VPCI_GUEST_SUPPORT check to has_vpci()

v3->v4:
* refuse to create domain if configuration is invalid
* split toolstack change into separate patch

v2->v3:
* set pci flags in toolstack

v1->v2:
* new patch

[1] https://lists.xenproject.org/archives/html/xen-devel/2023-10/msg00660.html
---
 xen/arch/arm/Kconfig              |  1 +
 xen/arch/arm/include/asm/domain.h |  2 ++
 xen/arch/arm/vpci.c               |  8 ++++++++
 xen/arch/x86/include/asm/domain.h |  2 ++
 xen/common/domain.c               |  5 +++++
 xen/drivers/passthrough/pci.c     | 20 ++++++++++++++++++++
 6 files changed, 38 insertions(+)

Comments

Jan Beulich Nov. 14, 2023, 9:13 a.m. UTC | #1
On 13.11.2023 23:21, Stewart Hildebrand wrote:
> --- a/xen/arch/x86/include/asm/domain.h
> +++ b/xen/arch/x86/include/asm/domain.h
> @@ -503,6 +503,8 @@ struct arch_domain
>  #define has_vpit(d)        (!!((d)->arch.emulation_flags & X86_EMU_PIT))
>  #define has_pirq(d)        (!!((d)->arch.emulation_flags & X86_EMU_USE_PIRQ))
>  
> +#define arch_needs_vpci(d) ({ (void)(d); false; })

See my comments on the v5 thread on both this and ...

> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -1542,6 +1542,18 @@ void iommu_dev_iotlb_flush_timeout(struct domain *d, struct pci_dev *pdev)
>      pcidevs_unlock();
>  }
>  
> +static bool needs_vpci(const struct domain *d)
> +{
> +    if ( is_hardware_domain(d) )
> +        return false;

... this. (It is generally a good idea to wait a little with sending new
versions, when you can't be sure yet whether the earlier discussion has
settled.)

Jan
Stewart Hildebrand Nov. 30, 2023, 2:47 a.m. UTC | #2
On 11/14/23 04:13, Jan Beulich wrote:
> On 13.11.2023 23:21, Stewart Hildebrand wrote:
>> --- a/xen/arch/x86/include/asm/domain.h
>> +++ b/xen/arch/x86/include/asm/domain.h
>> @@ -503,6 +503,8 @@ struct arch_domain
>>  #define has_vpit(d)        (!!((d)->arch.emulation_flags & X86_EMU_PIT))
>>  #define has_pirq(d)        (!!((d)->arch.emulation_flags & X86_EMU_USE_PIRQ))
>>  
>> +#define arch_needs_vpci(d) ({ (void)(d); false; })
> 
> See my comments on the v5 thread on both this and ...

So, the goal here is to return true for a PVH dom0, and false otherwise (for now). Since dom0 can't feasibly be full HVM, and is_hvm_domain(d) returns true for PVH, how about the following?

/* TODO: re-visit when vPCI is enabled for PVH domUs. */
#define arch_needs_vpci(d) ({                       \
    const struct domain *_d = (d);                  \
    is_hardware_domain(_d) && is_hvm_domain(_d); })


Link to v5 thread for reference [1]
[1] https://lists.xenproject.org/archives/html/xen-devel/2023-11/msg00968.html

> 
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -1542,6 +1542,18 @@ void iommu_dev_iotlb_flush_timeout(struct domain *d, struct pci_dev *pdev)
>>      pcidevs_unlock();
>>  }
>>  
>> +static bool needs_vpci(const struct domain *d)
>> +{
>> +    if ( is_hardware_domain(d) )
>> +        return false;
> 
> ... this.

I'll move this check to the Arm arch_needs_vpci() in xen/arch/arm/include/asm/domain.h

> (It is generally a good idea to wait a little with sending new
> versions, when you can't be sure yet whether the earlier discussion has
> settled.)

(Sorry, I'll be better about this going forward.)

> 
> Jan
Jan Beulich Nov. 30, 2023, 8:33 a.m. UTC | #3
On 30.11.2023 03:47, Stewart Hildebrand wrote:
> On 11/14/23 04:13, Jan Beulich wrote:
>> On 13.11.2023 23:21, Stewart Hildebrand wrote:
>>> --- a/xen/arch/x86/include/asm/domain.h
>>> +++ b/xen/arch/x86/include/asm/domain.h
>>> @@ -503,6 +503,8 @@ struct arch_domain
>>>  #define has_vpit(d)        (!!((d)->arch.emulation_flags & X86_EMU_PIT))
>>>  #define has_pirq(d)        (!!((d)->arch.emulation_flags & X86_EMU_USE_PIRQ))
>>>  
>>> +#define arch_needs_vpci(d) ({ (void)(d); false; })
>>
>> See my comments on the v5 thread on both this and ...
> 
> So, the goal here is to return true for a PVH dom0, and false otherwise (for now). Since dom0 can't feasibly be full HVM, and is_hvm_domain(d) returns true for PVH, how about the following?
> 
> /* TODO: re-visit when vPCI is enabled for PVH domUs. */
> #define arch_needs_vpci(d) ({                       \
>     const struct domain *_d = (d);                  \
>     is_hardware_domain(_d) && is_hvm_domain(_d); })

Looks okay to me, except for the leading underscore in _d (see respective
Misra guidelines, merely re-enforcing what the C standard says). Of course
the double evaluate_nospec() isn't quite nice in the result, but I guess
this isn't going to be used in many places?

Jan
Stewart Hildebrand Nov. 30, 2023, 5:06 p.m. UTC | #4
On 11/30/23 03:33, Jan Beulich wrote:
> On 30.11.2023 03:47, Stewart Hildebrand wrote:
>> On 11/14/23 04:13, Jan Beulich wrote:
>>> On 13.11.2023 23:21, Stewart Hildebrand wrote:
>>>> --- a/xen/arch/x86/include/asm/domain.h
>>>> +++ b/xen/arch/x86/include/asm/domain.h
>>>> @@ -503,6 +503,8 @@ struct arch_domain
>>>>  #define has_vpit(d)        (!!((d)->arch.emulation_flags & X86_EMU_PIT))
>>>>  #define has_pirq(d)        (!!((d)->arch.emulation_flags & X86_EMU_USE_PIRQ))
>>>>  
>>>> +#define arch_needs_vpci(d) ({ (void)(d); false; })
>>>
>>> See my comments on the v5 thread on both this and ...
>>
>> So, the goal here is to return true for a PVH dom0, and false otherwise (for now). Since dom0 can't feasibly be full HVM, and is_hvm_domain(d) returns true for PVH, how about the following?
>>
>> /* TODO: re-visit when vPCI is enabled for PVH domUs. */
>> #define arch_needs_vpci(d) ({                       \
>>     const struct domain *_d = (d);                  \
>>     is_hardware_domain(_d) && is_hvm_domain(_d); })
> 
> Looks okay to me, except for the leading underscore in _d (see respective
> Misra guidelines, merely re-enforcing what the C standard says).

Right. If I'm interpreting the standards correctly, it looks like a trailing underscore would work, seeing as we haven't adopted MISRA C:2012 Dir 4.5 (yet?).

> Of course
> the double evaluate_nospec() isn't quite nice in the result, but I guess
> this isn't going to be used in many places?

Only in XEN_DOMCTL_assign_device, as far as I'm aware.

> 
> Jan
Jan Beulich Dec. 1, 2023, 6:57 a.m. UTC | #5
On 30.11.2023 18:06, Stewart Hildebrand wrote:
> On 11/30/23 03:33, Jan Beulich wrote:
>> On 30.11.2023 03:47, Stewart Hildebrand wrote:
>>> On 11/14/23 04:13, Jan Beulich wrote:
>>>> On 13.11.2023 23:21, Stewart Hildebrand wrote:
>>>>> --- a/xen/arch/x86/include/asm/domain.h
>>>>> +++ b/xen/arch/x86/include/asm/domain.h
>>>>> @@ -503,6 +503,8 @@ struct arch_domain
>>>>>  #define has_vpit(d)        (!!((d)->arch.emulation_flags & X86_EMU_PIT))
>>>>>  #define has_pirq(d)        (!!((d)->arch.emulation_flags & X86_EMU_USE_PIRQ))
>>>>>  
>>>>> +#define arch_needs_vpci(d) ({ (void)(d); false; })
>>>>
>>>> See my comments on the v5 thread on both this and ...
>>>
>>> So, the goal here is to return true for a PVH dom0, and false otherwise (for now). Since dom0 can't feasibly be full HVM, and is_hvm_domain(d) returns true for PVH, how about the following?
>>>
>>> /* TODO: re-visit when vPCI is enabled for PVH domUs. */
>>> #define arch_needs_vpci(d) ({                       \
>>>     const struct domain *_d = (d);                  \
>>>     is_hardware_domain(_d) && is_hvm_domain(_d); })
>>
>> Looks okay to me, except for the leading underscore in _d (see respective
>> Misra guidelines, merely re-enforcing what the C standard says).
> 
> Right. If I'm interpreting the standards correctly, it looks like a trailing underscore would work, seeing as we haven't adopted MISRA C:2012 Dir 4.5 (yet?).

Adopting the respective Misra rule would only affirm that we should adhere
to the C spec. Us being free-standing (and hence no runtime library involved)
may have been an argument towards more relaxed treatment, but imo never was a
good justification. And yes, trailing underscores is what I have been
recommending, but some of the other maintainers don't really like them
(without, iirc, indicating what else to use as an underlying naming scheme).

Jan
Roger Pau Monne Dec. 1, 2023, 9:16 a.m. UTC | #6
On Mon, Nov 13, 2023 at 05:21:13PM -0500, Stewart Hildebrand wrote:
> @@ -1618,6 +1630,14 @@ int iommu_do_pci_domctl(
>          bus = PCI_BUS(machine_sbdf);
>          devfn = PCI_DEVFN(machine_sbdf);
>  
> +        if ( needs_vpci(d) && !has_vpci(d) )
> +        {
> +            printk(XENLOG_G_WARNING "Cannot assign %pp to %pd: vPCI support not enabled\n",
> +                   &PCI_SBDF(seg, bus, devfn), d);
> +            ret = -EPERM;
> +            break;

I think this is likely too restrictive going forward.  The current
approach is indeed to enable vPCI on a per-domain basis because that's
how PVH dom0 uses it, due to being unable to use ioreq servers.

If we start to expose vPCI suport to guests the interface should be on
a per-device basis, so that vPCI could be enabled for some devices,
while others could still be handled by ioreq servers.

We might want to add a new flag to xen_domctl_assign_device (used by
XEN_DOMCTL_assign_device) in order to signal whether the device will
use vPCI.

Thanks, Roger.
Stefano Stabellini Dec. 2, 2023, 2:56 a.m. UTC | #7
On Fri, 1 Dec 2023, Roger Pau Monné wrote:
> On Mon, Nov 13, 2023 at 05:21:13PM -0500, Stewart Hildebrand wrote:
> > @@ -1618,6 +1630,14 @@ int iommu_do_pci_domctl(
> >          bus = PCI_BUS(machine_sbdf);
> >          devfn = PCI_DEVFN(machine_sbdf);
> >  
> > +        if ( needs_vpci(d) && !has_vpci(d) )
> > +        {
> > +            printk(XENLOG_G_WARNING "Cannot assign %pp to %pd: vPCI support not enabled\n",
> > +                   &PCI_SBDF(seg, bus, devfn), d);
> > +            ret = -EPERM;
> > +            break;
> 
> I think this is likely too restrictive going forward.  The current
> approach is indeed to enable vPCI on a per-domain basis because that's
> how PVH dom0 uses it, due to being unable to use ioreq servers.
> 
> If we start to expose vPCI suport to guests the interface should be on
> a per-device basis, so that vPCI could be enabled for some devices,
> while others could still be handled by ioreq servers.
> 
> We might want to add a new flag to xen_domctl_assign_device (used by
> XEN_DOMCTL_assign_device) in order to signal whether the device will
> use vPCI.

Actually I don't think this is a good idea. I am all for flexibility but
supporting multiple different configurations comes at an extra cost for
both maintainers and contributors. I think we should try to reduce the
amount of configurations we support rather than increasing them
(especially on x86 where we have PV, PVH, HVM).

I don't think we should enable IOREQ servers to handle PCI passthrough
for PVH guests and/or guests with vPCI. If the domain has vPCI, PCI
Passthrough can be handled by vPCI just fine. I think this should be a
good anti-feature to have (a goal to explicitly not add this feature) to
reduce complexity. Unless you see a specific usecase to add support for
it?
Stewart Hildebrand Dec. 4, 2023, 3:54 a.m. UTC | #8
On 12/1/23 21:56, Stefano Stabellini wrote:
> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
>> On Mon, Nov 13, 2023 at 05:21:13PM -0500, Stewart Hildebrand wrote:
>>> @@ -1618,6 +1630,14 @@ int iommu_do_pci_domctl(
>>>          bus = PCI_BUS(machine_sbdf);
>>>          devfn = PCI_DEVFN(machine_sbdf);
>>>  
>>> +        if ( needs_vpci(d) && !has_vpci(d) )
>>> +        {
>>> +            printk(XENLOG_G_WARNING "Cannot assign %pp to %pd: vPCI support not enabled\n",
>>> +                   &PCI_SBDF(seg, bus, devfn), d);
>>> +            ret = -EPERM;
>>> +            break;
>>
>> I think this is likely too restrictive going forward.  The current
>> approach is indeed to enable vPCI on a per-domain basis because that's
>> how PVH dom0 uses it, due to being unable to use ioreq servers.
>>
>> If we start to expose vPCI suport to guests the interface should be on
>> a per-device basis, so that vPCI could be enabled for some devices,
>> while others could still be handled by ioreq servers.
>>
>> We might want to add a new flag to xen_domctl_assign_device (used by
>> XEN_DOMCTL_assign_device) in order to signal whether the device will
>> use vPCI.
> 
> Actually I don't think this is a good idea. I am all for flexibility but
> supporting multiple different configurations comes at an extra cost for
> both maintainers and contributors. I think we should try to reduce the
> amount of configurations we support rather than increasing them
> (especially on x86 where we have PV, PVH, HVM).
> 
> I don't think we should enable IOREQ servers to handle PCI passthrough
> for PVH guests and/or guests with vPCI. If the domain has vPCI, PCI
> Passthrough can be handled by vPCI just fine. I think this should be a
> good anti-feature to have (a goal to explicitly not add this feature) to
> reduce complexity. Unless you see a specific usecase to add support for
> it?

Just to preemptively clarify: there is a use case for passthrough (vPCI) and emulated virtio-pci devices (ioreq). However, the XEN_DOMCTL_assign_device hypercall, where this check is added, is only used for real physical hardware devices as far as I can tell. So I agree, I can't see a use case for passing through some physical devices with vPCI and some physical devices with qemu/ioreq.

With that said, the plumbing for a new flag does not appear to be particularly complex. It may actually be simpler than introducing a per-arch needs_vpci(d) heuristic.

diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
index 96cb4da0794e..2c38088a4772 100644
--- a/tools/libs/light/libxl_pci.c
+++ b/tools/libs/light/libxl_pci.c
@@ -1113,6 +1113,7 @@ typedef struct pci_add_state {
     libxl_device_pci pci;
     libxl_domid pci_domid;
     int retries;
+    bool vpci;
 } pci_add_state;

 static void pci_add_qemu_trad_watch_state_cb(libxl__egc *egc,
@@ -1176,6 +1177,10 @@ static void do_pci_add(libxl__egc *egc,
         }
     }

+    if (type == LIBXL_DOMAIN_TYPE_PVH /* includes Arm guests */) {
+        pas->vpci = true;
+    }
+
     rc = 0;

 out:
@@ -1418,7 +1423,8 @@ static void pci_add_dm_done(libxl__egc *egc,
     unsigned long long start, end, flags, size;
     int irq, i;
     int r;
-    uint32_t flag = XEN_DOMCTL_DEV_RDM_RELAXED;
+    uint32_t flag = XEN_DOMCTL_DEV_RDM_RELAXED |
+                    (pas->vpci ? XEN_DOMCTL_DEV_USES_VPCI : 0);
     uint32_t domainid = domid;
     bool isstubdom = libxl_is_stubdom(ctx, domid, &domainid);

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 2203725a2aa6..7786da1cf1e6 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1630,7 +1630,7 @@ int iommu_do_pci_domctl(
         bus = PCI_BUS(machine_sbdf);
         devfn = PCI_DEVFN(machine_sbdf);

-        if ( needs_vpci(d) && !has_vpci(d) )
+        if ( (flags & XEN_DOMCTL_DEV_USES_VPCI) && !has_vpci(d) )
         {
             printk(XENLOG_G_WARNING "Cannot assign %pp to %pd: vPCI support not enabled\n",
                    &PCI_SBDF(seg, bus, devfn), d);
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 8b3ea62cae06..5735d47219bc 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -527,6 +527,7 @@ struct xen_domctl_assign_device {
     uint32_t dev;   /* XEN_DOMCTL_DEV_* */
     uint32_t flags;
 #define XEN_DOMCTL_DEV_RDM_RELAXED      1 /* assign only */
+#define XEN_DOMCTL_DEV_USES_VPCI        (1 << 1)
     union {
         struct {
             uint32_t machine_sbdf;   /* machine PCI ID of assigned device */
Jan Beulich Dec. 4, 2023, 8:24 a.m. UTC | #9
On 02.12.2023 03:56, Stefano Stabellini wrote:
> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
>> On Mon, Nov 13, 2023 at 05:21:13PM -0500, Stewart Hildebrand wrote:
>>> @@ -1618,6 +1630,14 @@ int iommu_do_pci_domctl(
>>>          bus = PCI_BUS(machine_sbdf);
>>>          devfn = PCI_DEVFN(machine_sbdf);
>>>  
>>> +        if ( needs_vpci(d) && !has_vpci(d) )
>>> +        {
>>> +            printk(XENLOG_G_WARNING "Cannot assign %pp to %pd: vPCI support not enabled\n",
>>> +                   &PCI_SBDF(seg, bus, devfn), d);
>>> +            ret = -EPERM;
>>> +            break;
>>
>> I think this is likely too restrictive going forward.  The current
>> approach is indeed to enable vPCI on a per-domain basis because that's
>> how PVH dom0 uses it, due to being unable to use ioreq servers.
>>
>> If we start to expose vPCI suport to guests the interface should be on
>> a per-device basis, so that vPCI could be enabled for some devices,
>> while others could still be handled by ioreq servers.
>>
>> We might want to add a new flag to xen_domctl_assign_device (used by
>> XEN_DOMCTL_assign_device) in order to signal whether the device will
>> use vPCI.
> 
> Actually I don't think this is a good idea. I am all for flexibility but
> supporting multiple different configurations comes at an extra cost for
> both maintainers and contributors. I think we should try to reduce the
> amount of configurations we support rather than increasing them
> (especially on x86 where we have PV, PVH, HVM).
> 
> I don't think we should enable IOREQ servers to handle PCI passthrough
> for PVH guests and/or guests with vPCI. If the domain has vPCI, PCI
> Passthrough can be handled by vPCI just fine. I think this should be a
> good anti-feature to have (a goal to explicitly not add this feature) to
> reduce complexity. Unless you see a specific usecase to add support for
> it?

I could see very special devices to want accompanying by a special ioreq
server. I could also see tandem pass-through/emulated device pairs to
potentially be of use and require coordination in the same ioreq server.

That said, I wouldn't insist on allowing a mix of vPCI and ioreq servers
to be used until an actual use case (with code supporting it) actually
appears.

Jan
Roger Pau Monne Dec. 4, 2023, 10:58 a.m. UTC | #10
On Fri, Dec 01, 2023 at 06:56:32PM -0800, Stefano Stabellini wrote:
> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
> > On Mon, Nov 13, 2023 at 05:21:13PM -0500, Stewart Hildebrand wrote:
> > > @@ -1618,6 +1630,14 @@ int iommu_do_pci_domctl(
> > >          bus = PCI_BUS(machine_sbdf);
> > >          devfn = PCI_DEVFN(machine_sbdf);
> > >  
> > > +        if ( needs_vpci(d) && !has_vpci(d) )
> > > +        {
> > > +            printk(XENLOG_G_WARNING "Cannot assign %pp to %pd: vPCI support not enabled\n",
> > > +                   &PCI_SBDF(seg, bus, devfn), d);
> > > +            ret = -EPERM;
> > > +            break;
> > 
> > I think this is likely too restrictive going forward.  The current
> > approach is indeed to enable vPCI on a per-domain basis because that's
> > how PVH dom0 uses it, due to being unable to use ioreq servers.
> > 
> > If we start to expose vPCI suport to guests the interface should be on
> > a per-device basis, so that vPCI could be enabled for some devices,
> > while others could still be handled by ioreq servers.
> > 
> > We might want to add a new flag to xen_domctl_assign_device (used by
> > XEN_DOMCTL_assign_device) in order to signal whether the device will
> > use vPCI.
> 
> Actually I don't think this is a good idea. I am all for flexibility but
> supporting multiple different configurations comes at an extra cost for
> both maintainers and contributors. I think we should try to reduce the
> amount of configurations we support rather than increasing them
> (especially on x86 where we have PV, PVH, HVM).

I think it's perfectly fine to initially require a domain to have all
its devices either passed through using vPCI or ireqs, but the
interface IMO should allow for such differentiation in the future.
That's why I think introducing a domain wide vPCI flag might not be
the best option going forward.

It would be perfectly fine for XEN_DOMCTL_assign_device to set a
domain wide vPCI flag, iow:

if ( HYPERCALL_VPCI_FLAG_SET && !has_vpci(d) )
{
    if ( has_arch_pdevs(d) )
    {
        printk("All passthrough devices must use the same backend\n");
        return -EINVAL;
    }

    /* Set vPCI domain flag */
}

We have already agreed that we want to aim for a setup where ioreqs
and vPCI could be used for the same domain, but I guess you assumed
that ioreqs would be used for emulated devices exclusively and vPCI
for passthrough devices?

Overall if we agree that ioreqs and vPCI should co-exist for a domain,
I'm not sure there's much reason to limit ioreqs to only handle
emulated devices, seems like an arbitrary limitation.

> I don't think we should enable IOREQ servers to handle PCI passthrough
> for PVH guests and/or guests with vPCI. If the domain has vPCI, PCI
> Passthrough can be handled by vPCI just fine. I think this should be a
> good anti-feature to have (a goal to explicitly not add this feature) to
> reduce complexity. Unless you see a specific usecase to add support for
> it?

There are passthrough devices (GPUs) that might require some extra
mediation on dom0 (like the Intel GVT-g thing) that would force the
usage of ioreqs to passthrough.

It's important that the interfaces we introduce are correct IMO,
because that ends up reflecting on the configuration options that we
expose to users on xl/libxl.  While both XEN_DOMCTL_createdomain and
XEN_DOMCTL_assign_device are unstable interfaces, how the vPCI option
gets placed there will ultimately influence how the option gets
exposed in xl/libxl, and the interface there is relevant to keep
stable for end user sanity.

Thanks, Roger.
Stewart Hildebrand Dec. 4, 2023, 7:09 p.m. UTC | #11
On 12/4/23 05:58, Roger Pau Monné wrote:
> On Fri, Dec 01, 2023 at 06:56:32PM -0800, Stefano Stabellini wrote:
>> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
>>> On Mon, Nov 13, 2023 at 05:21:13PM -0500, Stewart Hildebrand wrote:
>>>> @@ -1618,6 +1630,14 @@ int iommu_do_pci_domctl(
>>>>          bus = PCI_BUS(machine_sbdf);
>>>>          devfn = PCI_DEVFN(machine_sbdf);
>>>>  
>>>> +        if ( needs_vpci(d) && !has_vpci(d) )
>>>> +        {
>>>> +            printk(XENLOG_G_WARNING "Cannot assign %pp to %pd: vPCI support not enabled\n",
>>>> +                   &PCI_SBDF(seg, bus, devfn), d);
>>>> +            ret = -EPERM;
>>>> +            break;
>>>
>>> I think this is likely too restrictive going forward.  The current
>>> approach is indeed to enable vPCI on a per-domain basis because that's
>>> how PVH dom0 uses it, due to being unable to use ioreq servers.
>>>
>>> If we start to expose vPCI suport to guests the interface should be on
>>> a per-device basis, so that vPCI could be enabled for some devices,
>>> while others could still be handled by ioreq servers.
>>>
>>> We might want to add a new flag to xen_domctl_assign_device (used by
>>> XEN_DOMCTL_assign_device) in order to signal whether the device will
>>> use vPCI.
>>
>> Actually I don't think this is a good idea. I am all for flexibility but
>> supporting multiple different configurations comes at an extra cost for
>> both maintainers and contributors. I think we should try to reduce the
>> amount of configurations we support rather than increasing them
>> (especially on x86 where we have PV, PVH, HVM).
> 
> I think it's perfectly fine to initially require a domain to have all
> its devices either passed through using vPCI or ireqs, but the
> interface IMO should allow for such differentiation in the future.
> That's why I think introducing a domain wide vPCI flag might not be
> the best option going forward.
> 
> It would be perfectly fine for XEN_DOMCTL_assign_device to set a
> domain wide vPCI flag, iow:
> 
> if ( HYPERCALL_VPCI_FLAG_SET && !has_vpci(d) )
> {
>     if ( has_arch_pdevs(d) )
>     {
>         printk("All passthrough devices must use the same backend\n");
>         return -EINVAL;
>     }
> 
>     /* Set vPCI domain flag */

There is a vPCI initializer that would need to be called too (on Arm, to set up mmio handlers). It is normally called earlier during arch_domain_create(), but it looks to be okay to call here as long as it is done before the guest boots.

    d->options |= XEN_DOMCTL_CDF_vpci;
    domain_vpci_init(d);

Perhaps the flag should be set inside domain_vpci_init(d).

> }
> 
> We have already agreed that we want to aim for a setup where ioreqs
> and vPCI could be used for the same domain, but I guess you assumed
> that ioreqs would be used for emulated devices exclusively and vPCI
> for passthrough devices?
> 
> Overall if we agree that ioreqs and vPCI should co-exist for a domain,
> I'm not sure there's much reason to limit ioreqs to only handle
> emulated devices, seems like an arbitrary limitation.
> 
>> I don't think we should enable IOREQ servers to handle PCI passthrough
>> for PVH guests and/or guests with vPCI. If the domain has vPCI, PCI
>> Passthrough can be handled by vPCI just fine. I think this should be a
>> good anti-feature to have (a goal to explicitly not add this feature) to
>> reduce complexity. Unless you see a specific usecase to add support for
>> it?
> 
> There are passthrough devices (GPUs) that might require some extra
> mediation on dom0 (like the Intel GVT-g thing) that would force the
> usage of ioreqs to passthrough.
> 
> It's important that the interfaces we introduce are correct IMO,
> because that ends up reflecting on the configuration options that we
> expose to users on xl/libxl.  While both XEN_DOMCTL_createdomain and
> XEN_DOMCTL_assign_device are unstable interfaces, how the vPCI option
> gets placed there will ultimately influence how the option gets
> exposed in xl/libxl, and the interface there is relevant to keep
> stable for end user sanity.
> 
> Thanks, Roger.
Stefano Stabellini Dec. 4, 2023, 10:07 p.m. UTC | #12
On Mon, 4 Dec 2023, Roger Pau Monné wrote:
> On Fri, Dec 01, 2023 at 06:56:32PM -0800, Stefano Stabellini wrote:
> > On Fri, 1 Dec 2023, Roger Pau Monné wrote:
> > > On Mon, Nov 13, 2023 at 05:21:13PM -0500, Stewart Hildebrand wrote:
> > > > @@ -1618,6 +1630,14 @@ int iommu_do_pci_domctl(
> > > >          bus = PCI_BUS(machine_sbdf);
> > > >          devfn = PCI_DEVFN(machine_sbdf);
> > > >  
> > > > +        if ( needs_vpci(d) && !has_vpci(d) )
> > > > +        {
> > > > +            printk(XENLOG_G_WARNING "Cannot assign %pp to %pd: vPCI support not enabled\n",
> > > > +                   &PCI_SBDF(seg, bus, devfn), d);
> > > > +            ret = -EPERM;
> > > > +            break;
> > > 
> > > I think this is likely too restrictive going forward.  The current
> > > approach is indeed to enable vPCI on a per-domain basis because that's
> > > how PVH dom0 uses it, due to being unable to use ioreq servers.
> > > 
> > > If we start to expose vPCI suport to guests the interface should be on
> > > a per-device basis, so that vPCI could be enabled for some devices,
> > > while others could still be handled by ioreq servers.
> > > 
> > > We might want to add a new flag to xen_domctl_assign_device (used by
> > > XEN_DOMCTL_assign_device) in order to signal whether the device will
> > > use vPCI.
> > 
> > Actually I don't think this is a good idea. I am all for flexibility but
> > supporting multiple different configurations comes at an extra cost for
> > both maintainers and contributors. I think we should try to reduce the
> > amount of configurations we support rather than increasing them
> > (especially on x86 where we have PV, PVH, HVM).
> 
> I think it's perfectly fine to initially require a domain to have all
> its devices either passed through using vPCI or ireqs, but the
> interface IMO should allow for such differentiation in the future.
> That's why I think introducing a domain wide vPCI flag might not be
> the best option going forward.
> 
> It would be perfectly fine for XEN_DOMCTL_assign_device to set a
> domain wide vPCI flag, iow:
> 
> if ( HYPERCALL_VPCI_FLAG_SET && !has_vpci(d) )
> {
>     if ( has_arch_pdevs(d) )
>     {
>         printk("All passthrough devices must use the same backend\n");
>         return -EINVAL;
>     }
> 
>     /* Set vPCI domain flag */
> }

That would be fine by me, but maybe we can avoid this change too. I was
imagining that vPCI would be enabled at domain creation, not at runtime.
And that vPCI would be enabled by default for all PVH guests (once we
are past the initial experimental phase.)


> We have already agreed that we want to aim for a setup where ioreqs
> and vPCI could be used for the same domain, but I guess you assumed
> that ioreqs would be used for emulated devices exclusively and vPCI
> for passthrough devices?

Yes, that's right


> Overall if we agree that ioreqs and vPCI should co-exist for a domain,
> I'm not sure there's much reason to limit ioreqs to only handle
> emulated devices, seems like an arbitrary limitation.

Reply below


> > I don't think we should enable IOREQ servers to handle PCI passthrough
> > for PVH guests and/or guests with vPCI. If the domain has vPCI, PCI
> > Passthrough can be handled by vPCI just fine. I think this should be a
> > good anti-feature to have (a goal to explicitly not add this feature) to
> > reduce complexity. Unless you see a specific usecase to add support for
> > it?
> 
> There are passthrough devices (GPUs) that might require some extra
> mediation on dom0 (like the Intel GVT-g thing) that would force the
> usage of ioreqs to passthrough.

From an architectural perspective, I think it would be cleaner, simpler
to maintain, and simpler to understand if Xen was the sole owner of the
PCI Root Complex and PCI config space mediation (implemented with vPCI).
IOREQ can be used for emulation and it works very well for that. At
least in my mind, that makes things much simpler.

I understand there are non-trivial cases, like virtual GPUs with
hardware access, but I don't classify those as passthrough. That's
because there isn't one device that gets fully assigned to the guest.
Instead, there is an emulated device (hence IOREQ) with certain MMIO
regions and interrupts that end up being directly mapped from real
hardware.

So I think it is natural in those cases to use IOREQ and it is also
natural to have QEMU remap MMIO/IRQs at runtime. From a vPCI
perspective, I hope it will mostly look as if the device is assigned to
Dom0. Even if it ends up being more complex than that, Rome wasn't
built in one day, and I don't think we should try to solve this problem
on day1 (as long as the interfaces are not stable interfaces).


> It's important that the interfaces we introduce are correct IMO,
> because that ends up reflecting on the configuration options that we
> expose to users on xl/libxl.  While both XEN_DOMCTL_createdomain and
> XEN_DOMCTL_assign_device are unstable interfaces, how the vPCI option
> gets placed there will ultimately influence how the option gets
> exposed in xl/libxl, and the interface there is relevant to keep
> stable for end user sanity.

I agree with you on the stable interfaces. The important part is not to
introduce changes to stable interfaces that could limit us in the
future. Specifically that includes xl and libxl, we need to be careful
there. But I don't see a single per-domain vPCI enable/disable option as
a problem. Let's say that in the future we have a mediated vGPU
implementation: if it works together with vPCI then the per-domain vPCI
option in libxl will be enabled (either explicitely or by default), if
it doesn't then vPCI will be disabled (either explicitely or by the
newer vGPU options.)

For *unstable* interfaces (XEN_DOMCTL_assign_device) I would rather wait
before adding more changes on top of them, not because I don't care
about the mediated GPU problem (we do have something similar at AMD),
but because I worry that if we try to change them now we might not do a
good enough job. I would prefer to wait until we know more about the
actual use case, ideally with code supporting it.

I think the difference in points of views comes from the fact that I see
vPCI as the default, QEMU only as a limited peripheral emulator (or
mediator for the vGPU case) but not in control. vPCI and QEMU are not
equal in my view. vPCI is in charge and always present if not in very
uncommon setups (even if we decide to hook it inside Xen by using
internal IOREQ interfaces). QEMU might come and go.

Now that I am writing this, I realize this is also why I wasn't too
happy with the idea of hooking vPCI using IOREQ. It makes them look as
if they are the same, while I don't they should be considered at the
same level of priority, criticality, safety, integration in the system,
etc.
Roger Pau Monne Dec. 5, 2023, 11:08 a.m. UTC | #13
On Mon, Dec 04, 2023 at 02:07:51PM -0800, Stefano Stabellini wrote:
> On Mon, 4 Dec 2023, Roger Pau Monné wrote:
> > On Fri, Dec 01, 2023 at 06:56:32PM -0800, Stefano Stabellini wrote:
> > > On Fri, 1 Dec 2023, Roger Pau Monné wrote:
> > > > On Mon, Nov 13, 2023 at 05:21:13PM -0500, Stewart Hildebrand wrote:
> > > > > @@ -1618,6 +1630,14 @@ int iommu_do_pci_domctl(
> > > > >          bus = PCI_BUS(machine_sbdf);
> > > > >          devfn = PCI_DEVFN(machine_sbdf);
> > > > >  
> > > > > +        if ( needs_vpci(d) && !has_vpci(d) )
> > > > > +        {
> > > > > +            printk(XENLOG_G_WARNING "Cannot assign %pp to %pd: vPCI support not enabled\n",
> > > > > +                   &PCI_SBDF(seg, bus, devfn), d);
> > > > > +            ret = -EPERM;
> > > > > +            break;
> > > > 
> > > > I think this is likely too restrictive going forward.  The current
> > > > approach is indeed to enable vPCI on a per-domain basis because that's
> > > > how PVH dom0 uses it, due to being unable to use ioreq servers.
> > > > 
> > > > If we start to expose vPCI suport to guests the interface should be on
> > > > a per-device basis, so that vPCI could be enabled for some devices,
> > > > while others could still be handled by ioreq servers.
> > > > 
> > > > We might want to add a new flag to xen_domctl_assign_device (used by
> > > > XEN_DOMCTL_assign_device) in order to signal whether the device will
> > > > use vPCI.
> > > 
> > > Actually I don't think this is a good idea. I am all for flexibility but
> > > supporting multiple different configurations comes at an extra cost for
> > > both maintainers and contributors. I think we should try to reduce the
> > > amount of configurations we support rather than increasing them
> > > (especially on x86 where we have PV, PVH, HVM).
> > 
> > I think it's perfectly fine to initially require a domain to have all
> > its devices either passed through using vPCI or ireqs, but the
> > interface IMO should allow for such differentiation in the future.
> > That's why I think introducing a domain wide vPCI flag might not be
> > the best option going forward.
> > 
> > It would be perfectly fine for XEN_DOMCTL_assign_device to set a
> > domain wide vPCI flag, iow:
> > 
> > if ( HYPERCALL_VPCI_FLAG_SET && !has_vpci(d) )
> > {
> >     if ( has_arch_pdevs(d) )
> >     {
> >         printk("All passthrough devices must use the same backend\n");
> >         return -EINVAL;
> >     }
> > 
> >     /* Set vPCI domain flag */
> > }
> 
> That would be fine by me, but maybe we can avoid this change too. I was
> imagining that vPCI would be enabled at domain creation, not at runtime.
> And that vPCI would be enabled by default for all PVH guests (once we
> are past the initial experimental phase.)

Then we don't even need a new CDF flag, and just enable vPCI when
IOMMU is enabled?  IOW: we can key the enabling of vPCI to
XEN_DOMCTL_CDF_iommu for specific domain types?

Maybe that's not so trivial on x86, as there's no x86 PVH domain type
from the hypervisor PoV.

> 
> > We have already agreed that we want to aim for a setup where ioreqs
> > and vPCI could be used for the same domain, but I guess you assumed
> > that ioreqs would be used for emulated devices exclusively and vPCI
> > for passthrough devices?
> 
> Yes, that's right
> 
> 
> > Overall if we agree that ioreqs and vPCI should co-exist for a domain,
> > I'm not sure there's much reason to limit ioreqs to only handle
> > emulated devices, seems like an arbitrary limitation.
> 
> Reply below
> 
> 
> > > I don't think we should enable IOREQ servers to handle PCI passthrough
> > > for PVH guests and/or guests with vPCI. If the domain has vPCI, PCI
> > > Passthrough can be handled by vPCI just fine. I think this should be a
> > > good anti-feature to have (a goal to explicitly not add this feature) to
> > > reduce complexity. Unless you see a specific usecase to add support for
> > > it?
> > 
> > There are passthrough devices (GPUs) that might require some extra
> > mediation on dom0 (like the Intel GVT-g thing) that would force the
> > usage of ioreqs to passthrough.
> 
> From an architectural perspective, I think it would be cleaner, simpler
> to maintain, and simpler to understand if Xen was the sole owner of the
> PCI Root Complex and PCI config space mediation (implemented with vPCI).
> IOREQ can be used for emulation and it works very well for that. At
> least in my mind, that makes things much simpler.

But IOREQ already has all the code to mediate accesses to the PCI
config space, and the interface to register separate servers for
different PCI devices.

We would then need to duplicate this internally for vPCI, so that vPCI
could forward accesses to IOREQ just for IOREQ to forward to yet a
different component?  Seems like a lot of duplication for no benefit.

> I understand there are non-trivial cases, like virtual GPUs with
> hardware access, but I don't classify those as passthrough. That's
> because there isn't one device that gets fully assigned to the guest.
> Instead, there is an emulated device (hence IOREQ) with certain MMIO
> regions and interrupts that end up being directly mapped from real
> hardware.
> 
> So I think it is natural in those cases to use IOREQ and it is also
> natural to have QEMU remap MMIO/IRQs at runtime. From a vPCI
> perspective, I hope it will mostly look as if the device is assigned to
> Dom0. Even if it ends up being more complex than that, Rome wasn't
> built in one day, and I don't think we should try to solve this problem
> on day1 (as long as the interfaces are not stable interfaces).

I don't see IOREQ as dealing explicitly with emulation.  Yes, it does
allow for emulators to be implemented in user-space, but at the end
it's just an interface that allows forwarding accesses to certain
resources (for the case we are speaking about here, PCI config space)
to entities that registered as handlers.

vPCI OTOH just deals with a very specific resource (PCI config space)
and only allows internal handlers to be registered on a byte
granularity.

So your proposal would be to implement a hierarchy like the one on the
diagram below:

    ┌────────┐ ┌──────────┐ ┌──────────────────┐
    │ Memory │ │ IO Ports │ │ PCI config space │
    └───────┬┘ └┬─────────┘ └───┬──────────────┘
            │   │               │
            │   │           ┌───┴──┐
            │   │           │ vPCI │
            │   │           └─┬──┬─┘
         ┌──┴───┴┐            │  │
         │ IOREQ ├────────────┘  │
         └────┬──┘               │
              │                  │
 ┌────────────┴──┐              ┌┴──────────────┐
 │ IOREQ servers │              │ vPCI handlers │
 └───────────────┘              └───────────────┘

While what I'm proposing would look like:

    ┌────────┐ ┌──────────┐ ┌──────────────────┐
    │ Memory │ │ IO Ports │ │ PCI config space │
    └────┬───┘ └────┬─────┘ └────────┬─────────┘
         │          │                │
         └─────┬────┴────┬───────────┘
               │  IOREQ  │
               └─┬─────┬─┘
                 │     │
 ┌───────────────┤     └────┬──────┐
 │ IOREQ servers │          │ vPCI │
 └───────────────┘          └───┬──┘
                                │
                            ┌───┴───────────┐
                            │ vPCI handlers │
                            └───────────────┘

I'm obviously biased, but I think the latter is cleaner, and allows
all resources to be arbitrated by the same component (IOREQ).

If the concern is about the IOREQ hypercall interface, it would be
fine to introduce an option that limit IOREQs to internal users
(vPCI) without supporting external IOREQ servers.

Think of IOREQ as a resource mediator inside of Xen, that just does
the PCI address decoding and forwards the access to the interested
party, either an external IOREQ server or vPCI.

> 
> > It's important that the interfaces we introduce are correct IMO,
> > because that ends up reflecting on the configuration options that we
> > expose to users on xl/libxl.  While both XEN_DOMCTL_createdomain and
> > XEN_DOMCTL_assign_device are unstable interfaces, how the vPCI option
> > gets placed there will ultimately influence how the option gets
> > exposed in xl/libxl, and the interface there is relevant to keep
> > stable for end user sanity.
> 
> I agree with you on the stable interfaces. The important part is not to
> introduce changes to stable interfaces that could limit us in the
> future. Specifically that includes xl and libxl, we need to be careful
> there. But I don't see a single per-domain vPCI enable/disable option as
> a problem. Let's say that in the future we have a mediated vGPU
> implementation: if it works together with vPCI then the per-domain vPCI
> option in libxl will be enabled (either explicitely or by default), if
> it doesn't then vPCI will be disabled (either explicitely or by the
> newer vGPU options.)

If vPCI is hooked into IOREQ there won't be a need anymore to register
the vPCI config space traps, as that would be done by IOREQ, and hence
vPCI managed devices could be registered at runtime with IOREQ.  IOW:
there won't be a need anymore to signal at domain creation whether
vPCI is intended to be used or not.

We would obviously need to enable IOREQ for all domains with IOMMU
enabled, as it would be IOREQ that register the PCI config space
handlers.

> For *unstable* interfaces (XEN_DOMCTL_assign_device) I would rather wait
> before adding more changes on top of them, not because I don't care
> about the mediated GPU problem (we do have something similar at AMD),
> but because I worry that if we try to change them now we might not do a
> good enough job. I would prefer to wait until we know more about the
> actual use case, ideally with code supporting it.
> 
> I think the difference in points of views comes from the fact that I see
> vPCI as the default, QEMU only as a limited peripheral emulator (or
> mediator for the vGPU case) but not in control. vPCI and QEMU are not
> equal in my view. vPCI is in charge and always present if not in very
> uncommon setups (even if we decide to hook it inside Xen by using
> internal IOREQ interfaces). QEMU might come and go.

Xen needs a single component that mediates accesses to resources,
whether that's IOREQ, or something else I don't really care that much.
Having vPCI mediate accesses to the PCI config space, and IOREQ to the
memory (and on x86 IO port) space just seems awfully complicated for
AFAICT no real benefit.

Also, you seem to confabulate IOREQ with QEMU, while the latter is
indeed an user of IOREQ, I do see IOREQ as a simple resource mediator
inside of Xen, that has the ability to forward such accesses to
external emulators using an hypercall interface.

> Now that I am writing this, I realize this is also why I wasn't too
> happy with the idea of hooking vPCI using IOREQ. It makes them look as
> if they are the same, while I don't they should be considered at the
> same level of priority, criticality, safety, integration in the system,
> etc.

I feel there are some fears with IOREQ from a safety PoV?  The code
that does the resource multiplexing is small, and as said above if
there are safety concerns with the hypercall interface it would be
fine to limit it's usage to internal handlers only.

Thanks, Roger.
Stewart Hildebrand Dec. 5, 2023, 4:27 p.m. UTC | #14
On 12/5/23 06:08, Roger Pau Monné wrote:
> On Mon, Dec 04, 2023 at 02:07:51PM -0800, Stefano Stabellini wrote:
>> On Mon, 4 Dec 2023, Roger Pau Monné wrote:
>>> On Fri, Dec 01, 2023 at 06:56:32PM -0800, Stefano Stabellini wrote:
>>>> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
>>>>> On Mon, Nov 13, 2023 at 05:21:13PM -0500, Stewart Hildebrand wrote:
>>>>>> @@ -1618,6 +1630,14 @@ int iommu_do_pci_domctl(
>>>>>>          bus = PCI_BUS(machine_sbdf);
>>>>>>          devfn = PCI_DEVFN(machine_sbdf);
>>>>>>  
>>>>>> +        if ( needs_vpci(d) && !has_vpci(d) )
>>>>>> +        {
>>>>>> +            printk(XENLOG_G_WARNING "Cannot assign %pp to %pd: vPCI support not enabled\n",
>>>>>> +                   &PCI_SBDF(seg, bus, devfn), d);
>>>>>> +            ret = -EPERM;
>>>>>> +            break;
>>>>>
>>>>> I think this is likely too restrictive going forward.  The current
>>>>> approach is indeed to enable vPCI on a per-domain basis because that's
>>>>> how PVH dom0 uses it, due to being unable to use ioreq servers.
>>>>>
>>>>> If we start to expose vPCI suport to guests the interface should be on
>>>>> a per-device basis, so that vPCI could be enabled for some devices,
>>>>> while others could still be handled by ioreq servers.
>>>>>
>>>>> We might want to add a new flag to xen_domctl_assign_device (used by
>>>>> XEN_DOMCTL_assign_device) in order to signal whether the device will
>>>>> use vPCI.
>>>>
>>>> Actually I don't think this is a good idea. I am all for flexibility but
>>>> supporting multiple different configurations comes at an extra cost for
>>>> both maintainers and contributors. I think we should try to reduce the
>>>> amount of configurations we support rather than increasing them
>>>> (especially on x86 where we have PV, PVH, HVM).
>>>
>>> I think it's perfectly fine to initially require a domain to have all
>>> its devices either passed through using vPCI or ireqs, but the
>>> interface IMO should allow for such differentiation in the future.
>>> That's why I think introducing a domain wide vPCI flag might not be
>>> the best option going forward.
>>>
>>> It would be perfectly fine for XEN_DOMCTL_assign_device to set a
>>> domain wide vPCI flag, iow:
>>>
>>> if ( HYPERCALL_VPCI_FLAG_SET && !has_vpci(d) )
>>> {
>>>     if ( has_arch_pdevs(d) )
>>>     {
>>>         printk("All passthrough devices must use the same backend\n");
>>>         return -EINVAL;
>>>     }
>>>
>>>     /* Set vPCI domain flag */
>>> }
>>
>> That would be fine by me, but maybe we can avoid this change too. I was
>> imagining that vPCI would be enabled at domain creation, not at runtime.
>> And that vPCI would be enabled by default for all PVH guests (once we
>> are past the initial experimental phase.)
> 
> Then we don't even need a new CDF flag, and just enable vPCI when
> IOMMU is enabled?  IOW: we can key the enabling of vPCI to
> XEN_DOMCTL_CDF_iommu for specific domain types?

There are many Arm based platforms that need to use iommu but don't have (or don't use) PCI, so we'd still like to have a separate vPCI flag.

> 
> Maybe that's not so trivial on x86, as there's no x86 PVH domain type
> from the hypervisor PoV.
> 
>>
>>> We have already agreed that we want to aim for a setup where ioreqs
>>> and vPCI could be used for the same domain, but I guess you assumed
>>> that ioreqs would be used for emulated devices exclusively and vPCI
>>> for passthrough devices?
>>
>> Yes, that's right
>>
>>
>>> Overall if we agree that ioreqs and vPCI should co-exist for a domain,
>>> I'm not sure there's much reason to limit ioreqs to only handle
>>> emulated devices, seems like an arbitrary limitation.
>>
>> Reply below
>>
>>
>>>> I don't think we should enable IOREQ servers to handle PCI passthrough
>>>> for PVH guests and/or guests with vPCI. If the domain has vPCI, PCI
>>>> Passthrough can be handled by vPCI just fine. I think this should be a
>>>> good anti-feature to have (a goal to explicitly not add this feature) to
>>>> reduce complexity. Unless you see a specific usecase to add support for
>>>> it?
>>>
>>> There are passthrough devices (GPUs) that might require some extra
>>> mediation on dom0 (like the Intel GVT-g thing) that would force the
>>> usage of ioreqs to passthrough.
>>
>> From an architectural perspective, I think it would be cleaner, simpler
>> to maintain, and simpler to understand if Xen was the sole owner of the
>> PCI Root Complex and PCI config space mediation (implemented with vPCI).
>> IOREQ can be used for emulation and it works very well for that. At
>> least in my mind, that makes things much simpler.
> 
> But IOREQ already has all the code to mediate accesses to the PCI
> config space, and the interface to register separate servers for
> different PCI devices.
> 
> We would then need to duplicate this internally for vPCI, so that vPCI
> could forward accesses to IOREQ just for IOREQ to forward to yet a
> different component?  Seems like a lot of duplication for no benefit.
> 
>> I understand there are non-trivial cases, like virtual GPUs with
>> hardware access, but I don't classify those as passthrough. That's
>> because there isn't one device that gets fully assigned to the guest.
>> Instead, there is an emulated device (hence IOREQ) with certain MMIO
>> regions and interrupts that end up being directly mapped from real
>> hardware.
>>
>> So I think it is natural in those cases to use IOREQ and it is also
>> natural to have QEMU remap MMIO/IRQs at runtime. From a vPCI
>> perspective, I hope it will mostly look as if the device is assigned to
>> Dom0. Even if it ends up being more complex than that, Rome wasn't
>> built in one day, and I don't think we should try to solve this problem
>> on day1 (as long as the interfaces are not stable interfaces).
> 
> I don't see IOREQ as dealing explicitly with emulation.  Yes, it does
> allow for emulators to be implemented in user-space, but at the end
> it's just an interface that allows forwarding accesses to certain
> resources (for the case we are speaking about here, PCI config space)
> to entities that registered as handlers.
> 
> vPCI OTOH just deals with a very specific resource (PCI config space)
> and only allows internal handlers to be registered on a byte
> granularity.
> 
> So your proposal would be to implement a hierarchy like the one on the
> diagram below:
> 
>     ┌────────┐ ┌──────────┐ ┌──────────────────┐
>     │ Memory │ │ IO Ports │ │ PCI config space │
>     └───────┬┘ └┬─────────┘ └───┬──────────────┘
>             │   │               │
>             │   │           ┌───┴──┐
>             │   │           │ vPCI │
>             │   │           └─┬──┬─┘
>          ┌──┴───┴┐            │  │
>          │ IOREQ ├────────────┘  │
>          └────┬──┘               │
>               │                  │
>  ┌────────────┴──┐              ┌┴──────────────┐
>  │ IOREQ servers │              │ vPCI handlers │
>  └───────────────┘              └───────────────┘
> 
> While what I'm proposing would look like:
> 
>     ┌────────┐ ┌──────────┐ ┌──────────────────┐
>     │ Memory │ │ IO Ports │ │ PCI config space │
>     └────┬───┘ └────┬─────┘ └────────┬─────────┘
>          │          │                │
>          └─────┬────┴────┬───────────┘
>                │  IOREQ  │
>                └─┬─────┬─┘
>                  │     │
>  ┌───────────────┤     └────┬──────┐
>  │ IOREQ servers │          │ vPCI │
>  └───────────────┘          └───┬──┘
>                                 │
>                             ┌───┴───────────┐
>                             │ vPCI handlers │
>                             └───────────────┘
> 
> I'm obviously biased, but I think the latter is cleaner, and allows
> all resources to be arbitrated by the same component (IOREQ).
> 
> If the concern is about the IOREQ hypercall interface, it would be
> fine to introduce an option that limit IOREQs to internal users
> (vPCI) without supporting external IOREQ servers.
> 
> Think of IOREQ as a resource mediator inside of Xen, that just does
> the PCI address decoding and forwards the access to the interested
> party, either an external IOREQ server or vPCI.
> 
>>
>>> It's important that the interfaces we introduce are correct IMO,
>>> because that ends up reflecting on the configuration options that we
>>> expose to users on xl/libxl.  While both XEN_DOMCTL_createdomain and
>>> XEN_DOMCTL_assign_device are unstable interfaces, how the vPCI option
>>> gets placed there will ultimately influence how the option gets
>>> exposed in xl/libxl, and the interface there is relevant to keep
>>> stable for end user sanity.
>>
>> I agree with you on the stable interfaces. The important part is not to
>> introduce changes to stable interfaces that could limit us in the
>> future. Specifically that includes xl and libxl, we need to be careful
>> there. But I don't see a single per-domain vPCI enable/disable option as
>> a problem. Let's say that in the future we have a mediated vGPU
>> implementation: if it works together with vPCI then the per-domain vPCI
>> option in libxl will be enabled (either explicitely or by default), if
>> it doesn't then vPCI will be disabled (either explicitely or by the
>> newer vGPU options.)
> 
> If vPCI is hooked into IOREQ there won't be a need anymore to register
> the vPCI config space traps, as that would be done by IOREQ, and hence
> vPCI managed devices could be registered at runtime with IOREQ.  IOW:
> there won't be a need anymore to signal at domain creation whether
> vPCI is intended to be used or not.
> 
> We would obviously need to enable IOREQ for all domains with IOMMU
> enabled, as it would be IOREQ that register the PCI config space
> handlers.
> 
>> For *unstable* interfaces (XEN_DOMCTL_assign_device) I would rather wait
>> before adding more changes on top of them, not because I don't care
>> about the mediated GPU problem (we do have something similar at AMD),
>> but because I worry that if we try to change them now we might not do a
>> good enough job. I would prefer to wait until we know more about the
>> actual use case, ideally with code supporting it.
>>
>> I think the difference in points of views comes from the fact that I see
>> vPCI as the default, QEMU only as a limited peripheral emulator (or
>> mediator for the vGPU case) but not in control. vPCI and QEMU are not
>> equal in my view. vPCI is in charge and always present if not in very
>> uncommon setups (even if we decide to hook it inside Xen by using
>> internal IOREQ interfaces). QEMU might come and go.
> 
> Xen needs a single component that mediates accesses to resources,
> whether that's IOREQ, or something else I don't really care that much.
> Having vPCI mediate accesses to the PCI config space, and IOREQ to the
> memory (and on x86 IO port) space just seems awfully complicated for
> AFAICT no real benefit.
> 
> Also, you seem to confabulate IOREQ with QEMU, while the latter is
> indeed an user of IOREQ, I do see IOREQ as a simple resource mediator
> inside of Xen, that has the ability to forward such accesses to
> external emulators using an hypercall interface.
> 
>> Now that I am writing this, I realize this is also why I wasn't too
>> happy with the idea of hooking vPCI using IOREQ. It makes them look as
>> if they are the same, while I don't they should be considered at the
>> same level of priority, criticality, safety, integration in the system,
>> etc.
> 
> I feel there are some fears with IOREQ from a safety PoV?  The code
> that does the resource multiplexing is small, and as said above if
> there are safety concerns with the hypercall interface it would be
> fine to limit it's usage to internal handlers only.
> 
> Thanks, Roger.
Roger Pau Monne Dec. 5, 2023, 5:09 p.m. UTC | #15
On Tue, Dec 05, 2023 at 11:27:03AM -0500, Stewart Hildebrand wrote:
> On 12/5/23 06:08, Roger Pau Monné wrote:
> > On Mon, Dec 04, 2023 at 02:07:51PM -0800, Stefano Stabellini wrote:
> >> On Mon, 4 Dec 2023, Roger Pau Monné wrote:
> >>> On Fri, Dec 01, 2023 at 06:56:32PM -0800, Stefano Stabellini wrote:
> >>>> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
> >>>>> On Mon, Nov 13, 2023 at 05:21:13PM -0500, Stewart Hildebrand wrote:
> >>>>>> @@ -1618,6 +1630,14 @@ int iommu_do_pci_domctl(
> >>>>>>          bus = PCI_BUS(machine_sbdf);
> >>>>>>          devfn = PCI_DEVFN(machine_sbdf);
> >>>>>>  
> >>>>>> +        if ( needs_vpci(d) && !has_vpci(d) )
> >>>>>> +        {
> >>>>>> +            printk(XENLOG_G_WARNING "Cannot assign %pp to %pd: vPCI support not enabled\n",
> >>>>>> +                   &PCI_SBDF(seg, bus, devfn), d);
> >>>>>> +            ret = -EPERM;
> >>>>>> +            break;
> >>>>>
> >>>>> I think this is likely too restrictive going forward.  The current
> >>>>> approach is indeed to enable vPCI on a per-domain basis because that's
> >>>>> how PVH dom0 uses it, due to being unable to use ioreq servers.
> >>>>>
> >>>>> If we start to expose vPCI suport to guests the interface should be on
> >>>>> a per-device basis, so that vPCI could be enabled for some devices,
> >>>>> while others could still be handled by ioreq servers.
> >>>>>
> >>>>> We might want to add a new flag to xen_domctl_assign_device (used by
> >>>>> XEN_DOMCTL_assign_device) in order to signal whether the device will
> >>>>> use vPCI.
> >>>>
> >>>> Actually I don't think this is a good idea. I am all for flexibility but
> >>>> supporting multiple different configurations comes at an extra cost for
> >>>> both maintainers and contributors. I think we should try to reduce the
> >>>> amount of configurations we support rather than increasing them
> >>>> (especially on x86 where we have PV, PVH, HVM).
> >>>
> >>> I think it's perfectly fine to initially require a domain to have all
> >>> its devices either passed through using vPCI or ireqs, but the
> >>> interface IMO should allow for such differentiation in the future.
> >>> That's why I think introducing a domain wide vPCI flag might not be
> >>> the best option going forward.
> >>>
> >>> It would be perfectly fine for XEN_DOMCTL_assign_device to set a
> >>> domain wide vPCI flag, iow:
> >>>
> >>> if ( HYPERCALL_VPCI_FLAG_SET && !has_vpci(d) )
> >>> {
> >>>     if ( has_arch_pdevs(d) )
> >>>     {
> >>>         printk("All passthrough devices must use the same backend\n");
> >>>         return -EINVAL;
> >>>     }
> >>>
> >>>     /* Set vPCI domain flag */
> >>> }
> >>
> >> That would be fine by me, but maybe we can avoid this change too. I was
> >> imagining that vPCI would be enabled at domain creation, not at runtime.
> >> And that vPCI would be enabled by default for all PVH guests (once we
> >> are past the initial experimental phase.)
> > 
> > Then we don't even need a new CDF flag, and just enable vPCI when
> > IOMMU is enabled?  IOW: we can key the enabling of vPCI to
> > XEN_DOMCTL_CDF_iommu for specific domain types?
> 
> There are many Arm based platforms that need to use iommu but don't have (or don't use) PCI, so we'd still like to have a separate vPCI flag.

OK, read below though - if we switch to vPCI being a descendant of
IOREQ (so that the PCI config space decoding is done by IOREQ) we
could hotplug vPCI managed devices at runtime without requiring any
prior initialization at domain create, since the traps to the PCI
config space would be setup by IOREQ.

We might need a PCI flag in order to signal whether the domain is
intended to use PCI devices or not, and so whether IOREQ needs to
setup PCI config space traps (either fully emulated or passthrough
devices).  But that would be arch-specific AFAICT, as on x86 we
always trap accesses to the PCI IO ports.

Thanks, Roger.
Stewart Hildebrand Dec. 5, 2023, 5:36 p.m. UTC | #16
On 12/5/23 12:09, Roger Pau Monné wrote:
> On Tue, Dec 05, 2023 at 11:27:03AM -0500, Stewart Hildebrand wrote:
>> On 12/5/23 06:08, Roger Pau Monné wrote:
>>> On Mon, Dec 04, 2023 at 02:07:51PM -0800, Stefano Stabellini wrote:
>>>> On Mon, 4 Dec 2023, Roger Pau Monné wrote:
>>>>> On Fri, Dec 01, 2023 at 06:56:32PM -0800, Stefano Stabellini wrote:
>>>>>> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
>>>>>>> On Mon, Nov 13, 2023 at 05:21:13PM -0500, Stewart Hildebrand wrote:
>>>>>>>> @@ -1618,6 +1630,14 @@ int iommu_do_pci_domctl(
>>>>>>>>          bus = PCI_BUS(machine_sbdf);
>>>>>>>>          devfn = PCI_DEVFN(machine_sbdf);
>>>>>>>>  
>>>>>>>> +        if ( needs_vpci(d) && !has_vpci(d) )
>>>>>>>> +        {
>>>>>>>> +            printk(XENLOG_G_WARNING "Cannot assign %pp to %pd: vPCI support not enabled\n",
>>>>>>>> +                   &PCI_SBDF(seg, bus, devfn), d);
>>>>>>>> +            ret = -EPERM;
>>>>>>>> +            break;
>>>>>>>
>>>>>>> I think this is likely too restrictive going forward.  The current
>>>>>>> approach is indeed to enable vPCI on a per-domain basis because that's
>>>>>>> how PVH dom0 uses it, due to being unable to use ioreq servers.
>>>>>>>
>>>>>>> If we start to expose vPCI suport to guests the interface should be on
>>>>>>> a per-device basis, so that vPCI could be enabled for some devices,
>>>>>>> while others could still be handled by ioreq servers.
>>>>>>>
>>>>>>> We might want to add a new flag to xen_domctl_assign_device (used by
>>>>>>> XEN_DOMCTL_assign_device) in order to signal whether the device will
>>>>>>> use vPCI.
>>>>>>
>>>>>> Actually I don't think this is a good idea. I am all for flexibility but
>>>>>> supporting multiple different configurations comes at an extra cost for
>>>>>> both maintainers and contributors. I think we should try to reduce the
>>>>>> amount of configurations we support rather than increasing them
>>>>>> (especially on x86 where we have PV, PVH, HVM).
>>>>>
>>>>> I think it's perfectly fine to initially require a domain to have all
>>>>> its devices either passed through using vPCI or ireqs, but the
>>>>> interface IMO should allow for such differentiation in the future.
>>>>> That's why I think introducing a domain wide vPCI flag might not be
>>>>> the best option going forward.
>>>>>
>>>>> It would be perfectly fine for XEN_DOMCTL_assign_device to set a
>>>>> domain wide vPCI flag, iow:
>>>>>
>>>>> if ( HYPERCALL_VPCI_FLAG_SET && !has_vpci(d) )
>>>>> {
>>>>>     if ( has_arch_pdevs(d) )
>>>>>     {
>>>>>         printk("All passthrough devices must use the same backend\n");
>>>>>         return -EINVAL;
>>>>>     }
>>>>>
>>>>>     /* Set vPCI domain flag */
>>>>> }
>>>>
>>>> That would be fine by me, but maybe we can avoid this change too. I was
>>>> imagining that vPCI would be enabled at domain creation, not at runtime.
>>>> And that vPCI would be enabled by default for all PVH guests (once we
>>>> are past the initial experimental phase.)
>>>
>>> Then we don't even need a new CDF flag, and just enable vPCI when
>>> IOMMU is enabled?  IOW: we can key the enabling of vPCI to
>>> XEN_DOMCTL_CDF_iommu for specific domain types?
>>
>> There are many Arm based platforms that need to use iommu but don't have (or don't use) PCI, so we'd still like to have a separate vPCI flag.
> 
> OK, read below though - if we switch to vPCI being a descendant of
> IOREQ (so that the PCI config space decoding is done by IOREQ) we
> could hotplug vPCI managed devices at runtime without requiring any
> prior initialization at domain create, since the traps to the PCI
> config space would be setup by IOREQ.
> 
> We might need a PCI flag in order to signal whether the domain is
> intended to use PCI devices or not, and so whether IOREQ needs to
> setup PCI config space traps (either fully emulated or passthrough
> devices).  But that would be arch-specific AFAICT, as on x86 we
> always trap accesses to the PCI IO ports.

On Arm, the toolstack (or dom0less creation code) needs to construct a {v,ioreq}PCI root complex node in the device tree before guest boot. Attempting to hot plug such a device tree node at runtime sounds like a goal for the (far) future. On Arm, we don't trap the {v,ioreq}PCI config space by default, so, yes, we for sure would need such a {v,ioreq}PCI flag for setting up the {v,ioreq}PCI mmio handlers if we go this route.
Stewart Hildebrand Dec. 5, 2023, 7:01 p.m. UTC | #17
On 12/5/23 06:08, Roger Pau Monné wrote:
> On Mon, Dec 04, 2023 at 02:07:51PM -0800, Stefano Stabellini wrote:
>> On Mon, 4 Dec 2023, Roger Pau Monné wrote:
>>> On Fri, Dec 01, 2023 at 06:56:32PM -0800, Stefano Stabellini wrote:
>>>> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
>>>>> On Mon, Nov 13, 2023 at 05:21:13PM -0500, Stewart Hildebrand wrote:
>>>>>> @@ -1618,6 +1630,14 @@ int iommu_do_pci_domctl(
>>>>>>          bus = PCI_BUS(machine_sbdf);
>>>>>>          devfn = PCI_DEVFN(machine_sbdf);
>>>>>>  
>>>>>> +        if ( needs_vpci(d) && !has_vpci(d) )
>>>>>> +        {
>>>>>> +            printk(XENLOG_G_WARNING "Cannot assign %pp to %pd: vPCI support not enabled\n",
>>>>>> +                   &PCI_SBDF(seg, bus, devfn), d);
>>>>>> +            ret = -EPERM;
>>>>>> +            break;
>>>>>
>>>>> I think this is likely too restrictive going forward.  The current
>>>>> approach is indeed to enable vPCI on a per-domain basis because that's
>>>>> how PVH dom0 uses it, due to being unable to use ioreq servers.
>>>>>
>>>>> If we start to expose vPCI suport to guests the interface should be on
>>>>> a per-device basis, so that vPCI could be enabled for some devices,
>>>>> while others could still be handled by ioreq servers.
>>>>>
>>>>> We might want to add a new flag to xen_domctl_assign_device (used by
>>>>> XEN_DOMCTL_assign_device) in order to signal whether the device will
>>>>> use vPCI.
>>>>
>>>> Actually I don't think this is a good idea. I am all for flexibility but
>>>> supporting multiple different configurations comes at an extra cost for
>>>> both maintainers and contributors. I think we should try to reduce the
>>>> amount of configurations we support rather than increasing them
>>>> (especially on x86 where we have PV, PVH, HVM).
>>>
>>> I think it's perfectly fine to initially require a domain to have all
>>> its devices either passed through using vPCI or ireqs, but the
>>> interface IMO should allow for such differentiation in the future.
>>> That's why I think introducing a domain wide vPCI flag might not be
>>> the best option going forward.
>>>
>>> It would be perfectly fine for XEN_DOMCTL_assign_device to set a
>>> domain wide vPCI flag, iow:
>>>
>>> if ( HYPERCALL_VPCI_FLAG_SET && !has_vpci(d) )
>>> {
>>>     if ( has_arch_pdevs(d) )
>>>     {
>>>         printk("All passthrough devices must use the same backend\n");
>>>         return -EINVAL;
>>>     }
>>>
>>>     /* Set vPCI domain flag */
>>> }
>>
>> That would be fine by me, but maybe we can avoid this change too. I was
>> imagining that vPCI would be enabled at domain creation, not at runtime.
>> And that vPCI would be enabled by default for all PVH guests (once we
>> are past the initial experimental phase.)
> 
> Then we don't even need a new CDF flag, and just enable vPCI when
> IOMMU is enabled?  IOW: we can key the enabling of vPCI to
> XEN_DOMCTL_CDF_iommu for specific domain types?
> 
> Maybe that's not so trivial on x86, as there's no x86 PVH domain type
> from the hypervisor PoV.
> 
>>
>>> We have already agreed that we want to aim for a setup where ioreqs
>>> and vPCI could be used for the same domain, but I guess you assumed
>>> that ioreqs would be used for emulated devices exclusively and vPCI
>>> for passthrough devices?
>>
>> Yes, that's right
>>
>>
>>> Overall if we agree that ioreqs and vPCI should co-exist for a domain,
>>> I'm not sure there's much reason to limit ioreqs to only handle
>>> emulated devices, seems like an arbitrary limitation.
>>
>> Reply below
>>
>>
>>>> I don't think we should enable IOREQ servers to handle PCI passthrough
>>>> for PVH guests and/or guests with vPCI. If the domain has vPCI, PCI
>>>> Passthrough can be handled by vPCI just fine. I think this should be a
>>>> good anti-feature to have (a goal to explicitly not add this feature) to
>>>> reduce complexity. Unless you see a specific usecase to add support for
>>>> it?
>>>
>>> There are passthrough devices (GPUs) that might require some extra
>>> mediation on dom0 (like the Intel GVT-g thing) that would force the
>>> usage of ioreqs to passthrough.
>>
>> From an architectural perspective, I think it would be cleaner, simpler
>> to maintain, and simpler to understand if Xen was the sole owner of the
>> PCI Root Complex and PCI config space mediation (implemented with vPCI).
>> IOREQ can be used for emulation and it works very well for that. At
>> least in my mind, that makes things much simpler.
> 
> But IOREQ already has all the code to mediate accesses to the PCI
> config space, and the interface to register separate servers for
> different PCI devices.
> 
> We would then need to duplicate this internally for vPCI, so that vPCI
> could forward accesses to IOREQ just for IOREQ to forward to yet a
> different component?  Seems like a lot of duplication for no benefit.
> 
>> I understand there are non-trivial cases, like virtual GPUs with
>> hardware access, but I don't classify those as passthrough. That's
>> because there isn't one device that gets fully assigned to the guest.
>> Instead, there is an emulated device (hence IOREQ) with certain MMIO
>> regions and interrupts that end up being directly mapped from real
>> hardware.
>>
>> So I think it is natural in those cases to use IOREQ and it is also
>> natural to have QEMU remap MMIO/IRQs at runtime. From a vPCI
>> perspective, I hope it will mostly look as if the device is assigned to
>> Dom0. Even if it ends up being more complex than that, Rome wasn't
>> built in one day, and I don't think we should try to solve this problem
>> on day1 (as long as the interfaces are not stable interfaces).
> 
> I don't see IOREQ as dealing explicitly with emulation.  Yes, it does
> allow for emulators to be implemented in user-space, but at the end
> it's just an interface that allows forwarding accesses to certain
> resources (for the case we are speaking about here, PCI config space)
> to entities that registered as handlers.
> 
> vPCI OTOH just deals with a very specific resource (PCI config space)
> and only allows internal handlers to be registered on a byte
> granularity.
> 
> So your proposal would be to implement a hierarchy like the one on the
> diagram below:
> 
>     ┌────────┐ ┌──────────┐ ┌──────────────────┐
>     │ Memory │ │ IO Ports │ │ PCI config space │
>     └───────┬┘ └┬─────────┘ └───┬──────────────┘
>             │   │               │
>             │   │           ┌───┴──┐
>             │   │           │ vPCI │
>             │   │           └─┬──┬─┘
>          ┌──┴───┴┐            │  │
>          │ IOREQ ├────────────┘  │
>          └────┬──┘               │
>               │                  │
>  ┌────────────┴──┐              ┌┴──────────────┐
>  │ IOREQ servers │              │ vPCI handlers │
>  └───────────────┘              └───────────────┘
> 
> While what I'm proposing would look like:
> 
>     ┌────────┐ ┌──────────┐ ┌──────────────────┐
>     │ Memory │ │ IO Ports │ │ PCI config space │
>     └────┬───┘ └────┬─────┘ └────────┬─────────┘
>          │          │                │
>          └─────┬────┴────┬───────────┘
>                │  IOREQ  │
>                └─┬─────┬─┘
>                  │     │
>  ┌───────────────┤     └────┬──────┐
>  │ IOREQ servers │          │ vPCI │
>  └───────────────┘          └───┬──┘
>                                 │
>                             ┌───┴───────────┐
>                             │ vPCI handlers │
>                             └───────────────┘
> 
> I'm obviously biased, but I think the latter is cleaner, and allows
> all resources to be arbitrated by the same component (IOREQ).
> 
> If the concern is about the IOREQ hypercall interface, it would be
> fine to introduce an option that limit IOREQs to internal users
> (vPCI) without supporting external IOREQ servers.
> 
> Think of IOREQ as a resource mediator inside of Xen, that just does
> the PCI address decoding and forwards the access to the interested
> party, either an external IOREQ server or vPCI.
> 
>>
>>> It's important that the interfaces we introduce are correct IMO,
>>> because that ends up reflecting on the configuration options that we
>>> expose to users on xl/libxl.  While both XEN_DOMCTL_createdomain and
>>> XEN_DOMCTL_assign_device are unstable interfaces, how the vPCI option
>>> gets placed there will ultimately influence how the option gets
>>> exposed in xl/libxl, and the interface there is relevant to keep
>>> stable for end user sanity.
>>
>> I agree with you on the stable interfaces. The important part is not to
>> introduce changes to stable interfaces that could limit us in the
>> future. Specifically that includes xl and libxl, we need to be careful
>> there. But I don't see a single per-domain vPCI enable/disable option as
>> a problem. Let's say that in the future we have a mediated vGPU
>> implementation: if it works together with vPCI then the per-domain vPCI
>> option in libxl will be enabled (either explicitely or by default), if
>> it doesn't then vPCI will be disabled (either explicitely or by the
>> newer vGPU options.)
> 
> If vPCI is hooked into IOREQ there won't be a need anymore to register
> the vPCI config space traps, as that would be done by IOREQ, and hence
> vPCI managed devices could be registered at runtime with IOREQ.  IOW:
> there won't be a need anymore to signal at domain creation whether
> vPCI is intended to be used or not.
> 
> We would obviously need to enable IOREQ for all domains with IOMMU
> enabled, as it would be IOREQ that register the PCI config space
> handlers.
> 
>> For *unstable* interfaces (XEN_DOMCTL_assign_device) I would rather wait
>> before adding more changes on top of them, not because I don't care
>> about the mediated GPU problem (we do have something similar at AMD),
>> but because I worry that if we try to change them now we might not do a
>> good enough job. I would prefer to wait until we know more about the
>> actual use case, ideally with code supporting it.
>>
>> I think the difference in points of views comes from the fact that I see
>> vPCI as the default, QEMU only as a limited peripheral emulator (or
>> mediator for the vGPU case) but not in control. vPCI and QEMU are not
>> equal in my view. vPCI is in charge and always present if not in very
>> uncommon setups (even if we decide to hook it inside Xen by using
>> internal IOREQ interfaces). QEMU might come and go.
> 
> Xen needs a single component that mediates accesses to resources,
> whether that's IOREQ, or something else I don't really care that much.

I just wanted to share what the "something else" diagram might look like.

    ┌────────┐ ┌──────────┐ ┌──────────────────┐
    │ Memory │ │ IO Ports │ │ PCI config space │
    └────┬───┘ └────┬─────┘ └──────────┬───────┘
         │          │                  │
         └──┬───────┴───────────────┬──┘
            │ PCI Resource Mediator │
            └────┬─────┬────────────┘
                 │     │
         ┌───────┤     └────┬──────┐
         │ IOREQ │          │ vPCI │
         └────┬──┘          └───┬──┘
              │                 │
 ┌────────────┴──┐          ┌───┴───────────┐
 │ IOREQ servers │          │ vPCI handlers │
 └───────────────┘          └───────────────┘

> Having vPCI mediate accesses to the PCI config space, and IOREQ to the
> memory (and on x86 IO port) space just seems awfully complicated for
> AFAICT no real benefit.
> 
> Also, you seem to confabulate IOREQ with QEMU, while the latter is
> indeed an user of IOREQ, I do see IOREQ as a simple resource mediator
> inside of Xen, that has the ability to forward such accesses to
> external emulators using an hypercall interface.
> 
>> Now that I am writing this, I realize this is also why I wasn't too
>> happy with the idea of hooking vPCI using IOREQ. It makes them look as
>> if they are the same, while I don't they should be considered at the
>> same level of priority, criticality, safety, integration in the system,
>> etc.
> 
> I feel there are some fears with IOREQ from a safety PoV?  The code
> that does the resource multiplexing is small, and as said above if
> there are safety concerns with the hypercall interface it would be
> fine to limit it's usage to internal handlers only.

Would it make any sense at all to split the resource multiplexing bits from IOREQ into a new separate PCI resource mediator?

> 
> Thanks, Roger.
Stefano Stabellini Dec. 5, 2023, 11:25 p.m. UTC | #18
On Tue, 5 Dec 2023, Stewart Hildebrand wrote:
> On 12/5/23 12:09, Roger Pau Monné wrote:
> > On Tue, Dec 05, 2023 at 11:27:03AM -0500, Stewart Hildebrand wrote:
> >> On 12/5/23 06:08, Roger Pau Monné wrote:
> >>> On Mon, Dec 04, 2023 at 02:07:51PM -0800, Stefano Stabellini wrote:
> >>>> On Mon, 4 Dec 2023, Roger Pau Monné wrote:
> >>>>> On Fri, Dec 01, 2023 at 06:56:32PM -0800, Stefano Stabellini wrote:
> >>>>>> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
> >>>>>>> On Mon, Nov 13, 2023 at 05:21:13PM -0500, Stewart Hildebrand wrote:
> >>>>>>>> @@ -1618,6 +1630,14 @@ int iommu_do_pci_domctl(
> >>>>>>>>          bus = PCI_BUS(machine_sbdf);
> >>>>>>>>          devfn = PCI_DEVFN(machine_sbdf);
> >>>>>>>>  
> >>>>>>>> +        if ( needs_vpci(d) && !has_vpci(d) )
> >>>>>>>> +        {
> >>>>>>>> +            printk(XENLOG_G_WARNING "Cannot assign %pp to %pd: vPCI support not enabled\n",
> >>>>>>>> +                   &PCI_SBDF(seg, bus, devfn), d);
> >>>>>>>> +            ret = -EPERM;
> >>>>>>>> +            break;
> >>>>>>>
> >>>>>>> I think this is likely too restrictive going forward.  The current
> >>>>>>> approach is indeed to enable vPCI on a per-domain basis because that's
> >>>>>>> how PVH dom0 uses it, due to being unable to use ioreq servers.
> >>>>>>>
> >>>>>>> If we start to expose vPCI suport to guests the interface should be on
> >>>>>>> a per-device basis, so that vPCI could be enabled for some devices,
> >>>>>>> while others could still be handled by ioreq servers.
> >>>>>>>
> >>>>>>> We might want to add a new flag to xen_domctl_assign_device (used by
> >>>>>>> XEN_DOMCTL_assign_device) in order to signal whether the device will
> >>>>>>> use vPCI.
> >>>>>>
> >>>>>> Actually I don't think this is a good idea. I am all for flexibility but
> >>>>>> supporting multiple different configurations comes at an extra cost for
> >>>>>> both maintainers and contributors. I think we should try to reduce the
> >>>>>> amount of configurations we support rather than increasing them
> >>>>>> (especially on x86 where we have PV, PVH, HVM).
> >>>>>
> >>>>> I think it's perfectly fine to initially require a domain to have all
> >>>>> its devices either passed through using vPCI or ireqs, but the
> >>>>> interface IMO should allow for such differentiation in the future.
> >>>>> That's why I think introducing a domain wide vPCI flag might not be
> >>>>> the best option going forward.
> >>>>>
> >>>>> It would be perfectly fine for XEN_DOMCTL_assign_device to set a
> >>>>> domain wide vPCI flag, iow:
> >>>>>
> >>>>> if ( HYPERCALL_VPCI_FLAG_SET && !has_vpci(d) )
> >>>>> {
> >>>>>     if ( has_arch_pdevs(d) )
> >>>>>     {
> >>>>>         printk("All passthrough devices must use the same backend\n");
> >>>>>         return -EINVAL;
> >>>>>     }
> >>>>>
> >>>>>     /* Set vPCI domain flag */
> >>>>> }
> >>>>
> >>>> That would be fine by me, but maybe we can avoid this change too. I was
> >>>> imagining that vPCI would be enabled at domain creation, not at runtime.
> >>>> And that vPCI would be enabled by default for all PVH guests (once we
> >>>> are past the initial experimental phase.)
> >>>
> >>> Then we don't even need a new CDF flag, and just enable vPCI when
> >>> IOMMU is enabled?  IOW: we can key the enabling of vPCI to
> >>> XEN_DOMCTL_CDF_iommu for specific domain types?
> >>
> >> There are many Arm based platforms that need to use iommu but don't have (or don't use) PCI, so we'd still like to have a separate vPCI flag.
> > 
> > OK, read below though - if we switch to vPCI being a descendant of
> > IOREQ (so that the PCI config space decoding is done by IOREQ) we
> > could hotplug vPCI managed devices at runtime without requiring any
> > prior initialization at domain create, since the traps to the PCI
> > config space would be setup by IOREQ.
> > 
> > We might need a PCI flag in order to signal whether the domain is
> > intended to use PCI devices or not, and so whether IOREQ needs to
> > setup PCI config space traps (either fully emulated or passthrough
> > devices).  But that would be arch-specific AFAICT, as on x86 we
> > always trap accesses to the PCI IO ports.
> 
> On Arm, the toolstack (or dom0less creation code) needs to construct a {v,ioreq}PCI root complex node in the device tree before guest boot. Attempting to hot plug such a device tree node at runtime sounds like a goal for the (far) future. On Arm, we don't trap the {v,ioreq}PCI config space by default, so, yes, we for sure would need such a {v,ioreq}PCI flag for setting up the {v,ioreq}PCI mmio handlers if we go this route.

Yes and also dynamic configuration and hotplug are actually detrimental
in safety configurations where you need as much as possible, ideally
everything, to be specified at build time.
Stefano Stabellini Dec. 6, 2023, 2:34 a.m. UTC | #19
On Tue, 5 Dec 2023, Roger Pau Monné wrote:
> > > > I don't think we should enable IOREQ servers to handle PCI passthrough
> > > > for PVH guests and/or guests with vPCI. If the domain has vPCI, PCI
> > > > Passthrough can be handled by vPCI just fine. I think this should be a
> > > > good anti-feature to have (a goal to explicitly not add this feature) to
> > > > reduce complexity. Unless you see a specific usecase to add support for
> > > > it?
> > > 
> > > There are passthrough devices (GPUs) that might require some extra
> > > mediation on dom0 (like the Intel GVT-g thing) that would force the
> > > usage of ioreqs to passthrough.
> > 
> > From an architectural perspective, I think it would be cleaner, simpler
> > to maintain, and simpler to understand if Xen was the sole owner of the
> > PCI Root Complex and PCI config space mediation (implemented with vPCI).
> > IOREQ can be used for emulation and it works very well for that. At
> > least in my mind, that makes things much simpler.
> 
> But IOREQ already has all the code to mediate accesses to the PCI
> config space, and the interface to register separate servers for
> different PCI devices.
> 
> We would then need to duplicate this internally for vPCI, so that vPCI
> could forward accesses to IOREQ just for IOREQ to forward to yet a
> different component?  Seems like a lot of duplication for no benefit.

[...] 
 
> Also, you seem to confabulate IOREQ with QEMU, while the latter is
> indeed an user of IOREQ, I do see IOREQ as a simple resource mediator
> inside of Xen, that has the ability to forward such accesses to
> external emulators using an hypercall interface.

We have been using different terminologies until now. IOREQ could mean
anything from the ABI interface, the emulator side (QEMU) or the
hypervisor side (Xen). I am going to align with your wording and say:

IOREQ: only the IOREQ implementation in Xen (xen/common/ioreq.c)
IOREQ server: QEMU or alternative

I think it is OK if we use IOREQ internally within Xen to hook vPCI with
PCI config space accesses and emulation. I don't think it is a good idea
to attempt to enable IOREQ servers (e.g. QEMU) to implement PCI
Passthrough when vPCI is also enabled for the domain, at least
initially.


> > I understand there are non-trivial cases, like virtual GPUs with
> > hardware access, but I don't classify those as passthrough. That's
> > because there isn't one device that gets fully assigned to the guest.
> > Instead, there is an emulated device (hence IOREQ) with certain MMIO
> > regions and interrupts that end up being directly mapped from real
> > hardware.
> > 
> > So I think it is natural in those cases to use IOREQ and it is also
> > natural to have QEMU remap MMIO/IRQs at runtime. From a vPCI
> > perspective, I hope it will mostly look as if the device is assigned to
> > Dom0. Even if it ends up being more complex than that, Rome wasn't
> > built in one day, and I don't think we should try to solve this problem
> > on day1 (as long as the interfaces are not stable interfaces).
> 
> I don't see IOREQ as dealing explicitly with emulation.  Yes, it does
> allow for emulators to be implemented in user-space, but at the end
> it's just an interface that allows forwarding accesses to certain
> resources (for the case we are speaking about here, PCI config space)
> to entities that registered as handlers.
> 
> vPCI OTOH just deals with a very specific resource (PCI config space)
> and only allows internal handlers to be registered on a byte
> granularity.
> 
> So your proposal would be to implement a hierarchy like the one on the
> diagram below:
> 
>     ┌────────┐ ┌──────────┐ ┌──────────────────┐
>     │ Memory │ │ IO Ports │ │ PCI config space │
>     └───────┬┘ └┬─────────┘ └───┬──────────────┘
>             │   │               │
>             │   │           ┌───┴──┐
>             │   │           │ vPCI │
>             │   │           └─┬──┬─┘
>          ┌──┴───┴┐            │  │
>          │ IOREQ ├────────────┘  │
>          └────┬──┘               │
>               │                  │
>  ┌────────────┴──┐              ┌┴──────────────┐
>  │ IOREQ servers │              │ vPCI handlers │
>  └───────────────┘              └───────────────┘

Yes


> While what I'm proposing would look like:
> 
>     ┌────────┐ ┌──────────┐ ┌──────────────────┐
>     │ Memory │ │ IO Ports │ │ PCI config space │
>     └────┬───┘ └────┬─────┘ └────────┬─────────┘
>          │          │                │
>          └─────┬────┴────┬───────────┘
>                │  IOREQ  │
>                └─┬─────┬─┘
>                  │     │
>  ┌───────────────┤     └────┬──────┐
>  │ IOREQ servers │          │ vPCI │
>  └───────────────┘          └───┬──┘
>                                 │
>                             ┌───┴───────────┐
>                             │ vPCI handlers │
>                             └───────────────┘

I don't have a major problem with this, but I find it less clear than
the first one.

Let's say that all domains are PVH (or ARM guests). QEMU is running in
Dom0. If QEMU does emulation, that's fine. If QEMU implements PCI
Passthrough then QEMU uses libpci to do PCI config space reads and
write, which go to the Linux kernel in Dom0, which ends up doing PCI
config space reads and writes on the device, and that goes via vPCI in
Xen (vPCI for Dom0). So actually vPCI is still present. It is a lot
simpler to think that vPCI is in charge of all mediated PCI config space
accesses rather than thinking that for the same device vPCI handles PCI
config space accesses for Dom0 but not for DomU.

It is not my preference but I am OK to compromise and go ahead with the
architecture you proposed but please let's keep IOREQ servers out of the
PCI Passthrough picture at least initially.


> I'm obviously biased, but I think the latter is cleaner, and allows
> all resources to be arbitrated by the same component (IOREQ).
> 
> If the concern is about the IOREQ hypercall interface, it would be
> fine to introduce an option that limit IOREQs to internal users
> (vPCI) without supporting external IOREQ servers.
> 
> Think of IOREQ as a resource mediator inside of Xen, that just does
> the PCI address decoding and forwards the access to the interested
> party, either an external IOREQ server or vPCI.

The part about IOREQ (xen/common/ioreq.c) being a resource mediator is
OKish.

I had many discussions over the years with various members of the larger
open source embedded community (Linaro, etc.) and the problem is that
when one says "IOREQ" typically people think of QEMU or other userspace
emulators. They don't think of the Xen side of it. This becomes very
relevant here because Xen is the only part of the system that is
getting safety-certified and it is important to convey the message that
nothing else in required to be safety-certified to have a fully working
Xen system that supports PCI Passthrough.

In short, it is important that the community doesn't get the idea that
QEMU needs to be safety-certified to have PCI Passthrough working
correctly with Xen in a safety scenario.

 
> > > It's important that the interfaces we introduce are correct IMO,
> > > because that ends up reflecting on the configuration options that we
> > > expose to users on xl/libxl.  While both XEN_DOMCTL_createdomain and
> > > XEN_DOMCTL_assign_device are unstable interfaces, how the vPCI option
> > > gets placed there will ultimately influence how the option gets
> > > exposed in xl/libxl, and the interface there is relevant to keep
> > > stable for end user sanity.
> > 
> > I agree with you on the stable interfaces. The important part is not to
> > introduce changes to stable interfaces that could limit us in the
> > future. Specifically that includes xl and libxl, we need to be careful
> > there. But I don't see a single per-domain vPCI enable/disable option as
> > a problem. Let's say that in the future we have a mediated vGPU
> > implementation: if it works together with vPCI then the per-domain vPCI
> > option in libxl will be enabled (either explicitely or by default), if
> > it doesn't then vPCI will be disabled (either explicitely or by the
> > newer vGPU options.)
> 
> If vPCI is hooked into IOREQ there won't be a need anymore to register
> the vPCI config space traps, as that would be done by IOREQ, and hence
> vPCI managed devices could be registered at runtime with IOREQ.  IOW:
> there won't be a need anymore to signal at domain creation whether
> vPCI is intended to be used or not.

For safety, we have requirements to specify everything statically before
boot so typically anything dynamic is a problem.


> We would obviously need to enable IOREQ for all domains with IOMMU
> enabled, as it would be IOREQ that register the PCI config space
> handlers.

This bit might be OK


> > For *unstable* interfaces (XEN_DOMCTL_assign_device) I would rather wait
> > before adding more changes on top of them, not because I don't care
> > about the mediated GPU problem (we do have something similar at AMD),
> > but because I worry that if we try to change them now we might not do a
> > good enough job. I would prefer to wait until we know more about the
> > actual use case, ideally with code supporting it.
> > 
> > I think the difference in points of views comes from the fact that I see
> > vPCI as the default, QEMU only as a limited peripheral emulator (or
> > mediator for the vGPU case) but not in control. vPCI and QEMU are not
> > equal in my view. vPCI is in charge and always present if not in very
> > uncommon setups (even if we decide to hook it inside Xen by using
> > internal IOREQ interfaces). QEMU might come and go.
> 
> Xen needs a single component that mediates accesses to resources,
> whether that's IOREQ, or something else I don't really care that much.
> Having vPCI mediate accesses to the PCI config space, and IOREQ to the
> memory (and on x86 IO port) space just seems awfully complicated for
> AFAICT no real benefit.
>
> > Now that I am writing this, I realize this is also why I wasn't too
> > happy with the idea of hooking vPCI using IOREQ. It makes them look as
> > if they are the same, while I don't they should be considered at the
> > same level of priority, criticality, safety, integration in the system,
> > etc.
> 
> I feel there are some fears with IOREQ from a safety PoV?  The code
> that does the resource multiplexing is small, and as said above if
> there are safety concerns with the hypercall interface it would be
> fine to limit it's usage to internal handlers only.

Yes it is about safety. Everything within Xen will be safety-certified,
hence usable in a safety critical scenario, everything outside of Xen
might not.

The fear is not on the IOREQ itself because xen/common/ioreq.c is part
of the certification scope. The fear is that IOREQ servers (e.g. QEMU)
are somehow in the picture when we discuss safety architectures with PCI
Passthrough, or that IOREQ servers could interfere with vPCI. By
"interfere" I mean that QEMU running in dom0 (a deprivileged dom0) will
be able to cause a malfunction in Xen vPCI.

Yes, limiting the hypercall interface would help in that regard because
it would limit Xen exposure.
Roger Pau Monne Dec. 11, 2023, 9:42 a.m. UTC | #20
On Tue, Dec 05, 2023 at 02:01:46PM -0500, Stewart Hildebrand wrote:
> On 12/5/23 06:08, Roger Pau Monné wrote:
> > On Mon, Dec 04, 2023 at 02:07:51PM -0800, Stefano Stabellini wrote:
> >> On Mon, 4 Dec 2023, Roger Pau Monné wrote:
> >>> On Fri, Dec 01, 2023 at 06:56:32PM -0800, Stefano Stabellini wrote:
> >>>> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
> >>>>> On Mon, Nov 13, 2023 at 05:21:13PM -0500, Stewart Hildebrand wrote:
> >>>>>> @@ -1618,6 +1630,14 @@ int iommu_do_pci_domctl(
> >>>>>>          bus = PCI_BUS(machine_sbdf);
> >>>>>>          devfn = PCI_DEVFN(machine_sbdf);
> >>>>>>  
> >>>>>> +        if ( needs_vpci(d) && !has_vpci(d) )
> >>>>>> +        {
> >>>>>> +            printk(XENLOG_G_WARNING "Cannot assign %pp to %pd: vPCI support not enabled\n",
> >>>>>> +                   &PCI_SBDF(seg, bus, devfn), d);
> >>>>>> +            ret = -EPERM;
> >>>>>> +            break;
> >>>>>
> >>>>> I think this is likely too restrictive going forward.  The current
> >>>>> approach is indeed to enable vPCI on a per-domain basis because that's
> >>>>> how PVH dom0 uses it, due to being unable to use ioreq servers.
> >>>>>
> >>>>> If we start to expose vPCI suport to guests the interface should be on
> >>>>> a per-device basis, so that vPCI could be enabled for some devices,
> >>>>> while others could still be handled by ioreq servers.
> >>>>>
> >>>>> We might want to add a new flag to xen_domctl_assign_device (used by
> >>>>> XEN_DOMCTL_assign_device) in order to signal whether the device will
> >>>>> use vPCI.
> >>>>
> >>>> Actually I don't think this is a good idea. I am all for flexibility but
> >>>> supporting multiple different configurations comes at an extra cost for
> >>>> both maintainers and contributors. I think we should try to reduce the
> >>>> amount of configurations we support rather than increasing them
> >>>> (especially on x86 where we have PV, PVH, HVM).
> >>>
> >>> I think it's perfectly fine to initially require a domain to have all
> >>> its devices either passed through using vPCI or ireqs, but the
> >>> interface IMO should allow for such differentiation in the future.
> >>> That's why I think introducing a domain wide vPCI flag might not be
> >>> the best option going forward.
> >>>
> >>> It would be perfectly fine for XEN_DOMCTL_assign_device to set a
> >>> domain wide vPCI flag, iow:
> >>>
> >>> if ( HYPERCALL_VPCI_FLAG_SET && !has_vpci(d) )
> >>> {
> >>>     if ( has_arch_pdevs(d) )
> >>>     {
> >>>         printk("All passthrough devices must use the same backend\n");
> >>>         return -EINVAL;
> >>>     }
> >>>
> >>>     /* Set vPCI domain flag */
> >>> }
> >>
> >> That would be fine by me, but maybe we can avoid this change too. I was
> >> imagining that vPCI would be enabled at domain creation, not at runtime.
> >> And that vPCI would be enabled by default for all PVH guests (once we
> >> are past the initial experimental phase.)
> > 
> > Then we don't even need a new CDF flag, and just enable vPCI when
> > IOMMU is enabled?  IOW: we can key the enabling of vPCI to
> > XEN_DOMCTL_CDF_iommu for specific domain types?
> > 
> > Maybe that's not so trivial on x86, as there's no x86 PVH domain type
> > from the hypervisor PoV.
> > 
> >>
> >>> We have already agreed that we want to aim for a setup where ioreqs
> >>> and vPCI could be used for the same domain, but I guess you assumed
> >>> that ioreqs would be used for emulated devices exclusively and vPCI
> >>> for passthrough devices?
> >>
> >> Yes, that's right
> >>
> >>
> >>> Overall if we agree that ioreqs and vPCI should co-exist for a domain,
> >>> I'm not sure there's much reason to limit ioreqs to only handle
> >>> emulated devices, seems like an arbitrary limitation.
> >>
> >> Reply below
> >>
> >>
> >>>> I don't think we should enable IOREQ servers to handle PCI passthrough
> >>>> for PVH guests and/or guests with vPCI. If the domain has vPCI, PCI
> >>>> Passthrough can be handled by vPCI just fine. I think this should be a
> >>>> good anti-feature to have (a goal to explicitly not add this feature) to
> >>>> reduce complexity. Unless you see a specific usecase to add support for
> >>>> it?
> >>>
> >>> There are passthrough devices (GPUs) that might require some extra
> >>> mediation on dom0 (like the Intel GVT-g thing) that would force the
> >>> usage of ioreqs to passthrough.
> >>
> >> From an architectural perspective, I think it would be cleaner, simpler
> >> to maintain, and simpler to understand if Xen was the sole owner of the
> >> PCI Root Complex and PCI config space mediation (implemented with vPCI).
> >> IOREQ can be used for emulation and it works very well for that. At
> >> least in my mind, that makes things much simpler.
> > 
> > But IOREQ already has all the code to mediate accesses to the PCI
> > config space, and the interface to register separate servers for
> > different PCI devices.
> > 
> > We would then need to duplicate this internally for vPCI, so that vPCI
> > could forward accesses to IOREQ just for IOREQ to forward to yet a
> > different component?  Seems like a lot of duplication for no benefit.
> > 
> >> I understand there are non-trivial cases, like virtual GPUs with
> >> hardware access, but I don't classify those as passthrough. That's
> >> because there isn't one device that gets fully assigned to the guest.
> >> Instead, there is an emulated device (hence IOREQ) with certain MMIO
> >> regions and interrupts that end up being directly mapped from real
> >> hardware.
> >>
> >> So I think it is natural in those cases to use IOREQ and it is also
> >> natural to have QEMU remap MMIO/IRQs at runtime. From a vPCI
> >> perspective, I hope it will mostly look as if the device is assigned to
> >> Dom0. Even if it ends up being more complex than that, Rome wasn't
> >> built in one day, and I don't think we should try to solve this problem
> >> on day1 (as long as the interfaces are not stable interfaces).
> > 
> > I don't see IOREQ as dealing explicitly with emulation.  Yes, it does
> > allow for emulators to be implemented in user-space, but at the end
> > it's just an interface that allows forwarding accesses to certain
> > resources (for the case we are speaking about here, PCI config space)
> > to entities that registered as handlers.
> > 
> > vPCI OTOH just deals with a very specific resource (PCI config space)
> > and only allows internal handlers to be registered on a byte
> > granularity.
> > 
> > So your proposal would be to implement a hierarchy like the one on the
> > diagram below:
> > 
> >     ┌────────┐ ┌──────────┐ ┌──────────────────┐
> >     │ Memory │ │ IO Ports │ │ PCI config space │
> >     └───────┬┘ └┬─────────┘ └───┬──────────────┘
> >             │   │               │
> >             │   │           ┌───┴──┐
> >             │   │           │ vPCI │
> >             │   │           └─┬──┬─┘
> >          ┌──┴───┴┐            │  │
> >          │ IOREQ ├────────────┘  │
> >          └────┬──┘               │
> >               │                  │
> >  ┌────────────┴──┐              ┌┴──────────────┐
> >  │ IOREQ servers │              │ vPCI handlers │
> >  └───────────────┘              └───────────────┘
> > 
> > While what I'm proposing would look like:
> > 
> >     ┌────────┐ ┌──────────┐ ┌──────────────────┐
> >     │ Memory │ │ IO Ports │ │ PCI config space │
> >     └────┬───┘ └────┬─────┘ └────────┬─────────┘
> >          │          │                │
> >          └─────┬────┴────┬───────────┘
> >                │  IOREQ  │
> >                └─┬─────┬─┘
> >                  │     │
> >  ┌───────────────┤     └────┬──────┐
> >  │ IOREQ servers │          │ vPCI │
> >  └───────────────┘          └───┬──┘
> >                                 │
> >                             ┌───┴───────────┐
> >                             │ vPCI handlers │
> >                             └───────────────┘
> > 
> > I'm obviously biased, but I think the latter is cleaner, and allows
> > all resources to be arbitrated by the same component (IOREQ).
> > 
> > If the concern is about the IOREQ hypercall interface, it would be
> > fine to introduce an option that limit IOREQs to internal users
> > (vPCI) without supporting external IOREQ servers.
> > 
> > Think of IOREQ as a resource mediator inside of Xen, that just does
> > the PCI address decoding and forwards the access to the interested
> > party, either an external IOREQ server or vPCI.
> > 
> >>
> >>> It's important that the interfaces we introduce are correct IMO,
> >>> because that ends up reflecting on the configuration options that we
> >>> expose to users on xl/libxl.  While both XEN_DOMCTL_createdomain and
> >>> XEN_DOMCTL_assign_device are unstable interfaces, how the vPCI option
> >>> gets placed there will ultimately influence how the option gets
> >>> exposed in xl/libxl, and the interface there is relevant to keep
> >>> stable for end user sanity.
> >>
> >> I agree with you on the stable interfaces. The important part is not to
> >> introduce changes to stable interfaces that could limit us in the
> >> future. Specifically that includes xl and libxl, we need to be careful
> >> there. But I don't see a single per-domain vPCI enable/disable option as
> >> a problem. Let's say that in the future we have a mediated vGPU
> >> implementation: if it works together with vPCI then the per-domain vPCI
> >> option in libxl will be enabled (either explicitely or by default), if
> >> it doesn't then vPCI will be disabled (either explicitely or by the
> >> newer vGPU options.)
> > 
> > If vPCI is hooked into IOREQ there won't be a need anymore to register
> > the vPCI config space traps, as that would be done by IOREQ, and hence
> > vPCI managed devices could be registered at runtime with IOREQ.  IOW:
> > there won't be a need anymore to signal at domain creation whether
> > vPCI is intended to be used or not.
> > 
> > We would obviously need to enable IOREQ for all domains with IOMMU
> > enabled, as it would be IOREQ that register the PCI config space
> > handlers.
> > 
> >> For *unstable* interfaces (XEN_DOMCTL_assign_device) I would rather wait
> >> before adding more changes on top of them, not because I don't care
> >> about the mediated GPU problem (we do have something similar at AMD),
> >> but because I worry that if we try to change them now we might not do a
> >> good enough job. I would prefer to wait until we know more about the
> >> actual use case, ideally with code supporting it.
> >>
> >> I think the difference in points of views comes from the fact that I see
> >> vPCI as the default, QEMU only as a limited peripheral emulator (or
> >> mediator for the vGPU case) but not in control. vPCI and QEMU are not
> >> equal in my view. vPCI is in charge and always present if not in very
> >> uncommon setups (even if we decide to hook it inside Xen by using
> >> internal IOREQ interfaces). QEMU might come and go.
> > 
> > Xen needs a single component that mediates accesses to resources,
> > whether that's IOREQ, or something else I don't really care that much.
> 
> I just wanted to share what the "something else" diagram might look like.
> 
>     ┌────────┐ ┌──────────┐ ┌──────────────────┐
>     │ Memory │ │ IO Ports │ │ PCI config space │
>     └────┬───┘ └────┬─────┘ └──────────┬───────┘
>          │          │                  │
>          └──┬───────┴───────────────┬──┘
>             │ PCI Resource Mediator │
>             └────┬─────┬────────────┘
>                  │     │
>          ┌───────┤     └────┬──────┐
>          │ IOREQ │          │ vPCI │
>          └────┬──┘          └───┬──┘
>               │                 │
>  ┌────────────┴──┐          ┌───┴───────────┐
>  │ IOREQ servers │          │ vPCI handlers │
>  └───────────────┘          └───────────────┘

It's IMO weird that the PCI resource mediator also controls Memory
and IO ports, since that's not a PCI specific resource.

Isn't your proposed "PCI Resource Mediator" the same as what IOREQ
currently does?

I'm kind of confused as what benefit there is in introducing another
layer that multiplexes guest resource accesses.

> > Having vPCI mediate accesses to the PCI config space, and IOREQ to the
> > memory (and on x86 IO port) space just seems awfully complicated for
> > AFAICT no real benefit.
> > 
> > Also, you seem to confabulate IOREQ with QEMU, while the latter is
> > indeed an user of IOREQ, I do see IOREQ as a simple resource mediator
> > inside of Xen, that has the ability to forward such accesses to
> > external emulators using an hypercall interface.
> > 
> >> Now that I am writing this, I realize this is also why I wasn't too
> >> happy with the idea of hooking vPCI using IOREQ. It makes them look as
> >> if they are the same, while I don't they should be considered at the
> >> same level of priority, criticality, safety, integration in the system,
> >> etc.
> > 
> > I feel there are some fears with IOREQ from a safety PoV?  The code
> > that does the resource multiplexing is small, and as said above if
> > there are safety concerns with the hypercall interface it would be
> > fine to limit it's usage to internal handlers only.
> 
> Would it make any sense at all to split the resource multiplexing bits from IOREQ into a new separate PCI resource mediator?

That might be fine, but seems like a lot of work and more complexity
in Xen for AFAICT no real benefit.

I think I would need to better understand the worries with using
IOREQ, but wouldn't it be easier to just limit the current IOREQ
code/interface to suit your needs?  Again without knowing exactly what
are the issues with using IOREQ	it's hard to propose solutions.

Thanks, Roger.
Roger Pau Monne Dec. 11, 2023, 10:36 a.m. UTC | #21
On Tue, Dec 05, 2023 at 06:34:35PM -0800, Stefano Stabellini wrote:
> On Tue, 5 Dec 2023, Roger Pau Monné wrote:
> > > > > I don't think we should enable IOREQ servers to handle PCI passthrough
> > > > > for PVH guests and/or guests with vPCI. If the domain has vPCI, PCI
> > > > > Passthrough can be handled by vPCI just fine. I think this should be a
> > > > > good anti-feature to have (a goal to explicitly not add this feature) to
> > > > > reduce complexity. Unless you see a specific usecase to add support for
> > > > > it?
> > > > 
> > > > There are passthrough devices (GPUs) that might require some extra
> > > > mediation on dom0 (like the Intel GVT-g thing) that would force the
> > > > usage of ioreqs to passthrough.
> > > 
> > > From an architectural perspective, I think it would be cleaner, simpler
> > > to maintain, and simpler to understand if Xen was the sole owner of the
> > > PCI Root Complex and PCI config space mediation (implemented with vPCI).
> > > IOREQ can be used for emulation and it works very well for that. At
> > > least in my mind, that makes things much simpler.
> > 
> > But IOREQ already has all the code to mediate accesses to the PCI
> > config space, and the interface to register separate servers for
> > different PCI devices.
> > 
> > We would then need to duplicate this internally for vPCI, so that vPCI
> > could forward accesses to IOREQ just for IOREQ to forward to yet a
> > different component?  Seems like a lot of duplication for no benefit.
> 
> [...] 
>  
> > Also, you seem to confabulate IOREQ with QEMU, while the latter is
> > indeed an user of IOREQ, I do see IOREQ as a simple resource mediator
> > inside of Xen, that has the ability to forward such accesses to
> > external emulators using an hypercall interface.
> 
> We have been using different terminologies until now. IOREQ could mean
> anything from the ABI interface, the emulator side (QEMU) or the
> hypervisor side (Xen). I am going to align with your wording and say:
> 
> IOREQ: only the IOREQ implementation in Xen (xen/common/ioreq.c)
> IOREQ server: QEMU or alternative
> 
> I think it is OK if we use IOREQ internally within Xen to hook vPCI with
> PCI config space accesses and emulation. I don't think it is a good idea
> to attempt to enable IOREQ servers (e.g. QEMU) to implement PCI
> Passthrough when vPCI is also enabled for the domain, at least
> initially.

I agree, it's perfectly fine to initially limit to vPCI passthrough
devices + QEMU emulated devices only for example.

I think it was mostly an issue with terminology then :).

> 
> > > I understand there are non-trivial cases, like virtual GPUs with
> > > hardware access, but I don't classify those as passthrough. That's
> > > because there isn't one device that gets fully assigned to the guest.
> > > Instead, there is an emulated device (hence IOREQ) with certain MMIO
> > > regions and interrupts that end up being directly mapped from real
> > > hardware.
> > > 
> > > So I think it is natural in those cases to use IOREQ and it is also
> > > natural to have QEMU remap MMIO/IRQs at runtime. From a vPCI
> > > perspective, I hope it will mostly look as if the device is assigned to
> > > Dom0. Even if it ends up being more complex than that, Rome wasn't
> > > built in one day, and I don't think we should try to solve this problem
> > > on day1 (as long as the interfaces are not stable interfaces).
> > 
> > I don't see IOREQ as dealing explicitly with emulation.  Yes, it does
> > allow for emulators to be implemented in user-space, but at the end
> > it's just an interface that allows forwarding accesses to certain
> > resources (for the case we are speaking about here, PCI config space)
> > to entities that registered as handlers.
> > 
> > vPCI OTOH just deals with a very specific resource (PCI config space)
> > and only allows internal handlers to be registered on a byte
> > granularity.
> > 
> > So your proposal would be to implement a hierarchy like the one on the
> > diagram below:
> > 
> >     ┌────────┐ ┌──────────┐ ┌──────────────────┐
> >     │ Memory │ │ IO Ports │ │ PCI config space │
> >     └───────┬┘ └┬─────────┘ └───┬──────────────┘
> >             │   │               │
> >             │   │           ┌───┴──┐
> >             │   │           │ vPCI │
> >             │   │           └─┬──┬─┘
> >          ┌──┴───┴┐            │  │
> >          │ IOREQ ├────────────┘  │
> >          └────┬──┘               │
> >               │                  │
> >  ┌────────────┴──┐              ┌┴──────────────┐
> >  │ IOREQ servers │              │ vPCI handlers │
> >  └───────────────┘              └───────────────┘
> 
> Yes
> 
> 
> > While what I'm proposing would look like:
> > 
> >     ┌────────┐ ┌──────────┐ ┌──────────────────┐
> >     │ Memory │ │ IO Ports │ │ PCI config space │
> >     └────┬───┘ └────┬─────┘ └────────┬─────────┘
> >          │          │                │
> >          └─────┬────┴────┬───────────┘
> >                │  IOREQ  │
> >                └─┬─────┬─┘
> >                  │     │
> >  ┌───────────────┤     └────┬──────┐
> >  │ IOREQ servers │          │ vPCI │
> >  └───────────────┘          └───┬──┘
> >                                 │
> >                             ┌───┴───────────┐
> >                             │ vPCI handlers │
> >                             └───────────────┘
> 
> I don't have a major problem with this, but I find it less clear than
> the first one.
> 
> Let's say that all domains are PVH (or ARM guests). QEMU is running in
> Dom0. If QEMU does emulation, that's fine. If QEMU implements PCI
> Passthrough then QEMU uses libpci to do PCI config space reads and
> write, which go to the Linux kernel in Dom0, which ends up doing PCI
> config space reads and writes on the device, and that goes via vPCI in
> Xen (vPCI for Dom0). So actually vPCI is still present. It is a lot
> simpler to think that vPCI is in charge of all mediated PCI config space
> accesses rather than thinking that for the same device vPCI handles PCI
> config space accesses for Dom0 but not for DomU.

So most of the issue is again with terminology I think, you would
like to avoid even having to mention the word IOREQ for PVH domains
for example, which you could possibly do if vPCI trapped all accesses
to the PCI config space.

I would be fine with renaming that internal IOREQ component to
something else.  What I insist on having is a single component that
multiplexes access to all platform resources (IO ports, MMIO, PCI
config space), so that we can have a (kind of) unified interface to
register handlers.

> It is not my preference but I am OK to compromise and go ahead with the
> architecture you proposed but please let's keep IOREQ servers out of the
> PCI Passthrough picture at least initially.
> 
> 
> > I'm obviously biased, but I think the latter is cleaner, and allows
> > all resources to be arbitrated by the same component (IOREQ).
> > 
> > If the concern is about the IOREQ hypercall interface, it would be
> > fine to introduce an option that limit IOREQs to internal users
> > (vPCI) without supporting external IOREQ servers.
> > 
> > Think of IOREQ as a resource mediator inside of Xen, that just does
> > the PCI address decoding and forwards the access to the interested
> > party, either an external IOREQ server or vPCI.
> 
> The part about IOREQ (xen/common/ioreq.c) being a resource mediator is
> OKish.
> 
> I had many discussions over the years with various members of the larger
> open source embedded community (Linaro, etc.) and the problem is that
> when one says "IOREQ" typically people think of QEMU or other userspace
> emulators. They don't think of the Xen side of it. This becomes very
> relevant here because Xen is the only part of the system that is
> getting safety-certified and it is important to convey the message that
> nothing else in required to be safety-certified to have a fully working
> Xen system that supports PCI Passthrough.
> 
> In short, it is important that the community doesn't get the idea that
> QEMU needs to be safety-certified to have PCI Passthrough working
> correctly with Xen in a safety scenario.

Maybe we need to rename that internal IOREQ component to something
else, and then IOREQ would strictly be limited to the hypercall
interface + IOREQ servers.

Or maybe we just need more education/documentation around the
difference between the internal side of IOREQs vs IOREQ servers vs
QEMU.  See for example demu, which is an emulator for a PC-like
compatible system using IOREQ servers:

https://xenbits.xen.org/gitweb/?p=people/pauldu/demu.git;a=summary

>  
> > > > It's important that the interfaces we introduce are correct IMO,
> > > > because that ends up reflecting on the configuration options that we
> > > > expose to users on xl/libxl.  While both XEN_DOMCTL_createdomain and
> > > > XEN_DOMCTL_assign_device are unstable interfaces, how the vPCI option
> > > > gets placed there will ultimately influence how the option gets
> > > > exposed in xl/libxl, and the interface there is relevant to keep
> > > > stable for end user sanity.
> > > 
> > > I agree with you on the stable interfaces. The important part is not to
> > > introduce changes to stable interfaces that could limit us in the
> > > future. Specifically that includes xl and libxl, we need to be careful
> > > there. But I don't see a single per-domain vPCI enable/disable option as
> > > a problem. Let's say that in the future we have a mediated vGPU
> > > implementation: if it works together with vPCI then the per-domain vPCI
> > > option in libxl will be enabled (either explicitely or by default), if
> > > it doesn't then vPCI will be disabled (either explicitely or by the
> > > newer vGPU options.)
> > 
> > If vPCI is hooked into IOREQ there won't be a need anymore to register
> > the vPCI config space traps, as that would be done by IOREQ, and hence
> > vPCI managed devices could be registered at runtime with IOREQ.  IOW:
> > there won't be a need anymore to signal at domain creation whether
> > vPCI is intended to be used or not.
> 
> For safety, we have requirements to specify everything statically before
> boot so typically anything dynamic is a problem.
> 
> 
> > We would obviously need to enable IOREQ for all domains with IOMMU
> > enabled, as it would be IOREQ that register the PCI config space
> > handlers.
> 
> This bit might be OK
> 
> 
> > > For *unstable* interfaces (XEN_DOMCTL_assign_device) I would rather wait
> > > before adding more changes on top of them, not because I don't care
> > > about the mediated GPU problem (we do have something similar at AMD),
> > > but because I worry that if we try to change them now we might not do a
> > > good enough job. I would prefer to wait until we know more about the
> > > actual use case, ideally with code supporting it.
> > > 
> > > I think the difference in points of views comes from the fact that I see
> > > vPCI as the default, QEMU only as a limited peripheral emulator (or
> > > mediator for the vGPU case) but not in control. vPCI and QEMU are not
> > > equal in my view. vPCI is in charge and always present if not in very
> > > uncommon setups (even if we decide to hook it inside Xen by using
> > > internal IOREQ interfaces). QEMU might come and go.
> > 
> > Xen needs a single component that mediates accesses to resources,
> > whether that's IOREQ, or something else I don't really care that much.
> > Having vPCI mediate accesses to the PCI config space, and IOREQ to the
> > memory (and on x86 IO port) space just seems awfully complicated for
> > AFAICT no real benefit.
> >
> > > Now that I am writing this, I realize this is also why I wasn't too
> > > happy with the idea of hooking vPCI using IOREQ. It makes them look as
> > > if they are the same, while I don't they should be considered at the
> > > same level of priority, criticality, safety, integration in the system,
> > > etc.
> > 
> > I feel there are some fears with IOREQ from a safety PoV?  The code
> > that does the resource multiplexing is small, and as said above if
> > there are safety concerns with the hypercall interface it would be
> > fine to limit it's usage to internal handlers only.
> 
> Yes it is about safety. Everything within Xen will be safety-certified,
> hence usable in a safety critical scenario, everything outside of Xen
> might not.
> 
> The fear is not on the IOREQ itself because xen/common/ioreq.c is part
> of the certification scope. The fear is that IOREQ servers (e.g. QEMU)
> are somehow in the picture when we discuss safety architectures with PCI
> Passthrough, or that IOREQ servers could interfere with vPCI. By
> "interfere" I mean that QEMU running in dom0 (a deprivileged dom0) will
> be able to cause a malfunction in Xen vPCI.

For that purpose it doesn't matter much how IOREQs or vPCI interact,
as any (buggy) interaction could possibly allow IOREQ to cause
malfunctions to vPCI.

> Yes, limiting the hypercall interface would help in that regard because
> it would limit Xen exposure.

That would be fine IMO, it could even be a Kconfig option if that
better suits your needs.

Thanks, Roger.
Stefano Stabellini Dec. 12, 2023, 1:34 a.m. UTC | #22
On Mon, 11 Dec 2023, Roger Pau Monné wrote:
> On Tue, Dec 05, 2023 at 06:34:35PM -0800, Stefano Stabellini wrote:
> > On Tue, 5 Dec 2023, Roger Pau Monné wrote:
> > > > > > I don't think we should enable IOREQ servers to handle PCI passthrough
> > > > > > for PVH guests and/or guests with vPCI. If the domain has vPCI, PCI
> > > > > > Passthrough can be handled by vPCI just fine. I think this should be a
> > > > > > good anti-feature to have (a goal to explicitly not add this feature) to
> > > > > > reduce complexity. Unless you see a specific usecase to add support for
> > > > > > it?
> > > > > 
> > > > > There are passthrough devices (GPUs) that might require some extra
> > > > > mediation on dom0 (like the Intel GVT-g thing) that would force the
> > > > > usage of ioreqs to passthrough.
> > > > 
> > > > From an architectural perspective, I think it would be cleaner, simpler
> > > > to maintain, and simpler to understand if Xen was the sole owner of the
> > > > PCI Root Complex and PCI config space mediation (implemented with vPCI).
> > > > IOREQ can be used for emulation and it works very well for that. At
> > > > least in my mind, that makes things much simpler.
> > > 
> > > But IOREQ already has all the code to mediate accesses to the PCI
> > > config space, and the interface to register separate servers for
> > > different PCI devices.
> > > 
> > > We would then need to duplicate this internally for vPCI, so that vPCI
> > > could forward accesses to IOREQ just for IOREQ to forward to yet a
> > > different component?  Seems like a lot of duplication for no benefit.
> > 
> > [...] 
> >  
> > > Also, you seem to confabulate IOREQ with QEMU, while the latter is
> > > indeed an user of IOREQ, I do see IOREQ as a simple resource mediator
> > > inside of Xen, that has the ability to forward such accesses to
> > > external emulators using an hypercall interface.
> > 
> > We have been using different terminologies until now. IOREQ could mean
> > anything from the ABI interface, the emulator side (QEMU) or the
> > hypervisor side (Xen). I am going to align with your wording and say:
> > 
> > IOREQ: only the IOREQ implementation in Xen (xen/common/ioreq.c)
> > IOREQ server: QEMU or alternative
> > 
> > I think it is OK if we use IOREQ internally within Xen to hook vPCI with
> > PCI config space accesses and emulation. I don't think it is a good idea
> > to attempt to enable IOREQ servers (e.g. QEMU) to implement PCI
> > Passthrough when vPCI is also enabled for the domain, at least
> > initially.
> 
> I agree, it's perfectly fine to initially limit to vPCI passthrough
> devices + QEMU emulated devices only for example.

OK good


> I think it was mostly an issue with terminology then :).

Yes :)


> > > > I understand there are non-trivial cases, like virtual GPUs with
> > > > hardware access, but I don't classify those as passthrough. That's
> > > > because there isn't one device that gets fully assigned to the guest.
> > > > Instead, there is an emulated device (hence IOREQ) with certain MMIO
> > > > regions and interrupts that end up being directly mapped from real
> > > > hardware.
> > > > 
> > > > So I think it is natural in those cases to use IOREQ and it is also
> > > > natural to have QEMU remap MMIO/IRQs at runtime. From a vPCI
> > > > perspective, I hope it will mostly look as if the device is assigned to
> > > > Dom0. Even if it ends up being more complex than that, Rome wasn't
> > > > built in one day, and I don't think we should try to solve this problem
> > > > on day1 (as long as the interfaces are not stable interfaces).
> > > 
> > > I don't see IOREQ as dealing explicitly with emulation.  Yes, it does
> > > allow for emulators to be implemented in user-space, but at the end
> > > it's just an interface that allows forwarding accesses to certain
> > > resources (for the case we are speaking about here, PCI config space)
> > > to entities that registered as handlers.
> > > 
> > > vPCI OTOH just deals with a very specific resource (PCI config space)
> > > and only allows internal handlers to be registered on a byte
> > > granularity.
> > > 
> > > So your proposal would be to implement a hierarchy like the one on the
> > > diagram below:
> > > 
> > >     ┌────────┐ ┌──────────┐ ┌──────────────────┐
> > >     │ Memory │ │ IO Ports │ │ PCI config space │
> > >     └───────┬┘ └┬─────────┘ └───┬──────────────┘
> > >             │   │               │
> > >             │   │           ┌───┴──┐
> > >             │   │           │ vPCI │
> > >             │   │           └─┬──┬─┘
> > >          ┌──┴───┴┐            │  │
> > >          │ IOREQ ├────────────┘  │
> > >          └────┬──┘               │
> > >               │                  │
> > >  ┌────────────┴──┐              ┌┴──────────────┐
> > >  │ IOREQ servers │              │ vPCI handlers │
> > >  └───────────────┘              └───────────────┘
> > 
> > Yes
> > 
> > 
> > > While what I'm proposing would look like:
> > > 
> > >     ┌────────┐ ┌──────────┐ ┌──────────────────┐
> > >     │ Memory │ │ IO Ports │ │ PCI config space │
> > >     └────┬───┘ └────┬─────┘ └────────┬─────────┘
> > >          │          │                │
> > >          └─────┬────┴────┬───────────┘
> > >                │  IOREQ  │
> > >                └─┬─────┬─┘
> > >                  │     │
> > >  ┌───────────────┤     └────┬──────┐
> > >  │ IOREQ servers │          │ vPCI │
> > >  └───────────────┘          └───┬──┘
> > >                                 │
> > >                             ┌───┴───────────┐
> > >                             │ vPCI handlers │
> > >                             └───────────────┘
> > 
> > I don't have a major problem with this, but I find it less clear than
> > the first one.
> > 
> > Let's say that all domains are PVH (or ARM guests). QEMU is running in
> > Dom0. If QEMU does emulation, that's fine. If QEMU implements PCI
> > Passthrough then QEMU uses libpci to do PCI config space reads and
> > write, which go to the Linux kernel in Dom0, which ends up doing PCI
> > config space reads and writes on the device, and that goes via vPCI in
> > Xen (vPCI for Dom0). So actually vPCI is still present. It is a lot
> > simpler to think that vPCI is in charge of all mediated PCI config space
> > accesses rather than thinking that for the same device vPCI handles PCI
> > config space accesses for Dom0 but not for DomU.
> 
> So most of the issue is again with terminology I think, you would
> like to avoid even having to mention the word IOREQ for PVH domains
> for example, which you could possibly do if vPCI trapped all accesses
> to the PCI config space.
> 
> I would be fine with renaming that internal IOREQ component to
> something else.  What I insist on having is a single component that
> multiplexes access to all platform resources (IO ports, MMIO, PCI
> config space), so that we can have a (kind of) unified interface to
> register handlers.

Yes I am OK with that.

A single multiplexer is fine, however we need to be careful as IOREQ in
Xen has a lot of stuff about handling messages to and from QEMU and
state changes related to it, see ioreq_send and ioreq_send_buffered.


> > It is not my preference but I am OK to compromise and go ahead with the
> > architecture you proposed but please let's keep IOREQ servers out of the
> > PCI Passthrough picture at least initially.
> > 
> > 
> > > I'm obviously biased, but I think the latter is cleaner, and allows
> > > all resources to be arbitrated by the same component (IOREQ).
> > > 
> > > If the concern is about the IOREQ hypercall interface, it would be
> > > fine to introduce an option that limit IOREQs to internal users
> > > (vPCI) without supporting external IOREQ servers.
> > > 
> > > Think of IOREQ as a resource mediator inside of Xen, that just does
> > > the PCI address decoding and forwards the access to the interested
> > > party, either an external IOREQ server or vPCI.
> > 
> > The part about IOREQ (xen/common/ioreq.c) being a resource mediator is
> > OKish.
> > 
> > I had many discussions over the years with various members of the larger
> > open source embedded community (Linaro, etc.) and the problem is that
> > when one says "IOREQ" typically people think of QEMU or other userspace
> > emulators. They don't think of the Xen side of it. This becomes very
> > relevant here because Xen is the only part of the system that is
> > getting safety-certified and it is important to convey the message that
> > nothing else in required to be safety-certified to have a fully working
> > Xen system that supports PCI Passthrough.
> > 
> > In short, it is important that the community doesn't get the idea that
> > QEMU needs to be safety-certified to have PCI Passthrough working
> > correctly with Xen in a safety scenario.
> 
> Maybe we need to rename that internal IOREQ component to something
> else, and then IOREQ would strictly be limited to the hypercall
> interface + IOREQ servers.

Right. We could also keep calling IOREQ things in Xen strictly related
to handling the message passing interface to QEMU, e.g. ioreq_send.

 
> Or maybe we just need more education/documentation around the
> difference between the internal side of IOREQs vs IOREQ servers vs
> QEMU.  See for example demu, which is an emulator for a PC-like
> compatible system using IOREQ servers:
> 
> https://xenbits.xen.org/gitweb/?p=people/pauldu/demu.git;a=summary
> 
> >  
> > > > > It's important that the interfaces we introduce are correct IMO,
> > > > > because that ends up reflecting on the configuration options that we
> > > > > expose to users on xl/libxl.  While both XEN_DOMCTL_createdomain and
> > > > > XEN_DOMCTL_assign_device are unstable interfaces, how the vPCI option
> > > > > gets placed there will ultimately influence how the option gets
> > > > > exposed in xl/libxl, and the interface there is relevant to keep
> > > > > stable for end user sanity.
> > > > 
> > > > I agree with you on the stable interfaces. The important part is not to
> > > > introduce changes to stable interfaces that could limit us in the
> > > > future. Specifically that includes xl and libxl, we need to be careful
> > > > there. But I don't see a single per-domain vPCI enable/disable option as
> > > > a problem. Let's say that in the future we have a mediated vGPU
> > > > implementation: if it works together with vPCI then the per-domain vPCI
> > > > option in libxl will be enabled (either explicitely or by default), if
> > > > it doesn't then vPCI will be disabled (either explicitely or by the
> > > > newer vGPU options.)
> > > 
> > > If vPCI is hooked into IOREQ there won't be a need anymore to register
> > > the vPCI config space traps, as that would be done by IOREQ, and hence
> > > vPCI managed devices could be registered at runtime with IOREQ.  IOW:
> > > there won't be a need anymore to signal at domain creation whether
> > > vPCI is intended to be used or not.
> > 
> > For safety, we have requirements to specify everything statically before
> > boot so typically anything dynamic is a problem.
> > 
> > 
> > > We would obviously need to enable IOREQ for all domains with IOMMU
> > > enabled, as it would be IOREQ that register the PCI config space
> > > handlers.
> > 
> > This bit might be OK
> > 
> > 
> > > > For *unstable* interfaces (XEN_DOMCTL_assign_device) I would rather wait
> > > > before adding more changes on top of them, not because I don't care
> > > > about the mediated GPU problem (we do have something similar at AMD),
> > > > but because I worry that if we try to change them now we might not do a
> > > > good enough job. I would prefer to wait until we know more about the
> > > > actual use case, ideally with code supporting it.
> > > > 
> > > > I think the difference in points of views comes from the fact that I see
> > > > vPCI as the default, QEMU only as a limited peripheral emulator (or
> > > > mediator for the vGPU case) but not in control. vPCI and QEMU are not
> > > > equal in my view. vPCI is in charge and always present if not in very
> > > > uncommon setups (even if we decide to hook it inside Xen by using
> > > > internal IOREQ interfaces). QEMU might come and go.
> > > 
> > > Xen needs a single component that mediates accesses to resources,
> > > whether that's IOREQ, or something else I don't really care that much.
> > > Having vPCI mediate accesses to the PCI config space, and IOREQ to the
> > > memory (and on x86 IO port) space just seems awfully complicated for
> > > AFAICT no real benefit.
> > >
> > > > Now that I am writing this, I realize this is also why I wasn't too
> > > > happy with the idea of hooking vPCI using IOREQ. It makes them look as
> > > > if they are the same, while I don't they should be considered at the
> > > > same level of priority, criticality, safety, integration in the system,
> > > > etc.
> > > 
> > > I feel there are some fears with IOREQ from a safety PoV?  The code
> > > that does the resource multiplexing is small, and as said above if
> > > there are safety concerns with the hypercall interface it would be
> > > fine to limit it's usage to internal handlers only.
> > 
> > Yes it is about safety. Everything within Xen will be safety-certified,
> > hence usable in a safety critical scenario, everything outside of Xen
> > might not.
> > 
> > The fear is not on the IOREQ itself because xen/common/ioreq.c is part
> > of the certification scope. The fear is that IOREQ servers (e.g. QEMU)
> > are somehow in the picture when we discuss safety architectures with PCI
> > Passthrough, or that IOREQ servers could interfere with vPCI. By
> > "interfere" I mean that QEMU running in dom0 (a deprivileged dom0) will
> > be able to cause a malfunction in Xen vPCI.
> 
> For that purpose it doesn't matter much how IOREQs or vPCI interact,
> as any (buggy) interaction could possibly allow IOREQ to cause
> malfunctions to vPCI.

yep


> > Yes, limiting the hypercall interface would help in that regard because
> > it would limit Xen exposure.
> 
> That would be fine IMO, it could even be a Kconfig option if that
> better suits your needs.

OK. I think we are aligned.
diff mbox series

Patch

diff --git a/xen/arch/arm/Kconfig b/xen/arch/arm/Kconfig
index 5ff68e5d5979..3845b238a33f 100644
--- a/xen/arch/arm/Kconfig
+++ b/xen/arch/arm/Kconfig
@@ -195,6 +195,7 @@  config PCI_PASSTHROUGH
 	depends on ARM_64
 	select HAS_PCI
 	select HAS_VPCI
+	select HAS_VPCI_GUEST_SUPPORT
 	default n
 	help
 	  This option enables PCI device passthrough
diff --git a/xen/arch/arm/include/asm/domain.h b/xen/arch/arm/include/asm/domain.h
index 3614562eaefe..8e6d5fe9578c 100644
--- a/xen/arch/arm/include/asm/domain.h
+++ b/xen/arch/arm/include/asm/domain.h
@@ -31,6 +31,8 @@  enum domain_type {
 
 #define is_domain_direct_mapped(d) ((d)->cdf & CDF_directmap)
 
+#define arch_needs_vpci(d) ({ (void)(d); true; })
+
 /*
  * Is the domain using the host memory layout?
  *
diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
index 3bc4bb55082a..61e0edcedea9 100644
--- a/xen/arch/arm/vpci.c
+++ b/xen/arch/arm/vpci.c
@@ -2,6 +2,7 @@ 
 /*
  * xen/arch/arm/vpci.c
  */
+#include <xen/lib.h>
 #include <xen/sched.h>
 #include <xen/vpci.h>
 
@@ -90,8 +91,15 @@  int domain_vpci_init(struct domain *d)
             return ret;
     }
     else
+    {
+        if ( !IS_ENABLED(CONFIG_HAS_VPCI_GUEST_SUPPORT) )
+        {
+            gdprintk(XENLOG_ERR, "vPCI requested but guest support not enabled\n");
+            return -EINVAL;
+        }
         register_mmio_handler(d, &vpci_mmio_handler,
                               GUEST_VPCI_ECAM_BASE, GUEST_VPCI_ECAM_SIZE, NULL);
+    }
 
     return 0;
 }
diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index cb02a4d1ebb2..d34015391eed 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -503,6 +503,8 @@  struct arch_domain
 #define has_vpit(d)        (!!((d)->arch.emulation_flags & X86_EMU_PIT))
 #define has_pirq(d)        (!!((d)->arch.emulation_flags & X86_EMU_USE_PIRQ))
 
+#define arch_needs_vpci(d) ({ (void)(d); false; })
+
 #define gdt_ldt_pt_idx(v) \
       ((v)->vcpu_id >> (PAGETABLE_ORDER - GDT_LDT_VCPU_SHIFT))
 #define pv_gdt_ptes(v) \
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 12dc27428972..47d49c57bf83 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -692,6 +692,11 @@  struct domain *domain_create(domid_t domid,
 
     if ( !is_idle_domain(d) )
     {
+        err = -EINVAL;
+        if ( !is_hardware_domain(d) && (config->flags & XEN_DOMCTL_CDF_vpci) &&
+             !IS_ENABLED(CONFIG_HAS_VPCI_GUEST_SUPPORT) )
+            goto fail;
+
         if ( !is_hardware_domain(d) )
             d->nr_pirqs = nr_static_irqs + extra_domU_irqs;
         else
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 04d00c7c37df..2203725a2aa6 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1542,6 +1542,18 @@  void iommu_dev_iotlb_flush_timeout(struct domain *d, struct pci_dev *pdev)
     pcidevs_unlock();
 }
 
+static bool needs_vpci(const struct domain *d)
+{
+    if ( is_hardware_domain(d) )
+        return false;
+
+    if ( d == dom_io )
+        /* xl pci-assignable-add assigns PCI devices to domIO */
+        return false;
+
+    return arch_needs_vpci(d);
+}
+
 int iommu_do_pci_domctl(
     struct xen_domctl *domctl, struct domain *d,
     XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
@@ -1618,6 +1630,14 @@  int iommu_do_pci_domctl(
         bus = PCI_BUS(machine_sbdf);
         devfn = PCI_DEVFN(machine_sbdf);
 
+        if ( needs_vpci(d) && !has_vpci(d) )
+        {
+            printk(XENLOG_G_WARNING "Cannot assign %pp to %pd: vPCI support not enabled\n",
+                   &PCI_SBDF(seg, bus, devfn), d);
+            ret = -EPERM;
+            break;
+        }
+
         pcidevs_lock();
         ret = device_assigned(seg, bus, devfn);
         if ( domctl->cmd == XEN_DOMCTL_test_assign_device )