Message ID | 155364082689.15803.7062874513041742278.stgit@gimli.home (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] pci: Use PCI aliases when determining device IOMMU address space | expand |
On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote: > Conventional PCI buses pre-date requester IDs. An IOMMU cannot > distinguish by devfn & bus between devices in a conventional PCI > topology and therefore we cannot assign them separate AddressSpaces. > By taking this requester ID aliasing into account, QEMU better matches > the bare metal behavior and restrictions, and enables shared > AddressSpace configurations that are otherwise not possible with > guest IOMMU support. > > For the latter case, given any example where an IOMMU group on the > host includes multiple devices: > > $ ls /sys/kernel/iommu_groups/1/devices/ > 0000:00:01.0 0000:01:00.0 0000:01:00.1 [1] > > If we incorporate a vIOMMU into the VM configuration, we're restricted > that we can only assign one of the endpoints to the guest because a > second endpoint will attempt to use a different AddressSpace. VFIO > only supports IOMMU group level granularity at the container level, > preventing this second endpoint from being assigned: > > qemu-system-x86_64 -machine q35... \ > -device intel-iommu,intremap=on \ > -device pcie-root-port,addr=1e.0,id=pcie.1 \ > -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ > -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 > > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ > 0000:01:00.1: group 1 used in multiple address spaces > > However, when QEMU incorporates proper aliasing, we can make use of a > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that > provides the downstream devices with the same AddressSpace, ex: > > qemu-system-x86_64 -machine q35... \ > -device intel-iommu,intremap=on \ > -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ > -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ > -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 > > While the utility of this hack may be limited, this AddressSpace > aliasing is the correct behavior for QEMU to emulate bare metal. > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> The patch looks sane to me even as a bug fix since otherwise the DMA address spaces used under misc kinds of PCI bridges can be wrong, so: Reviewed-by: Peter Xu <peterx@redhat.com> Though I have a question that confused me even before: Alex, do you know why all the context entry of the devices in the IOMMU root table will be programmed even if the devices are under a pcie-to-pci bridge? I'm giving an example with above [1] to be clear: in that case IIUC we'll program context entries for all the three devices (00:01.0, 01:00.0, 01:00.1) but they'll point to the same IOMMU table. DMAs of devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on bare metal and then why we bother to program the context entry of 01:00.1? It seems never used. (It should be used for current QEMU to work with pcie-to-pci bridges if without this patch, but I feel like I don't know the real answer behind) Thanks,
Hi Alex, On 3/26/19 11:55 PM, Alex Williamson wrote: > Conventional PCI buses pre-date requester IDs. An IOMMU cannot > distinguish by devfn & bus between devices in a conventional PCI > topology and therefore we cannot assign them separate AddressSpaces. > By taking this requester ID aliasing into account, QEMU better matches > the bare metal behavior and restrictions, and enables shared > AddressSpace configurations that are otherwise not possible with > guest IOMMU support. > > For the latter case, given any example where an IOMMU group on the > host includes multiple devices: > > $ ls /sys/kernel/iommu_groups/1/devices/ > 0000:00:01.0 0000:01:00.0 0000:01:00.1 > > If we incorporate a vIOMMU into the VM configuration, we're restricted > that we can only assign one of the endpoints to the guest because a > second endpoint will attempt to use a different AddressSpace. VFIO > only supports IOMMU group level granularity at the container level, > preventing this second endpoint from being assigned: > > qemu-system-x86_64 -machine q35... \ > -device intel-iommu,intremap=on \ > -device pcie-root-port,addr=1e.0,id=pcie.1 \ > -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ > -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 > > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ > 0000:01:00.1: group 1 used in multiple address spaces > > However, when QEMU incorporates proper aliasing, we can make use of a > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that > provides the downstream devices with the same AddressSpace, ex: > > qemu-system-x86_64 -machine q35... \ > -device intel-iommu,intremap=on \ > -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ > -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ > -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 > > While the utility of this hack may be limited, this AddressSpace > aliasing is the correct behavior for QEMU to emulate bare metal. > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > --- > hw/pci/pci.c | 33 +++++++++++++++++++++++++++++++-- > 1 file changed, 31 insertions(+), 2 deletions(-) > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c > index 35451c1e9987..38467e676f1f 100644 > --- a/hw/pci/pci.c > +++ b/hw/pci/pci.c > @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev) > { > PCIBus *bus = pci_get_bus(dev); > PCIBus *iommu_bus = bus; > + uint8_t devfn = dev->devfn; > > while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) { > - iommu_bus = pci_get_bus(iommu_bus->parent_dev); > + PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev); > + > + /* > + * Determine which requester ID alias should be used for the device > + * based on the PCI topology. There are no requester IDs on convetional conventional > + * PCI buses, therefore we push the alias up to the parent on each non- > + * express bus. Which alias we use depends on whether this is a legacy > + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to- > + * PCI bridge spec. Note that we cannot use pci_requester_id() here > + * because the resulting BDF depends on the secondary bridge register Didn't you mean secondary bus number register? > + * programming. We also cannot lookup the PCIBus from the bus number > + * at this point for the iommu_fn. Also, requester_id_cache is the > + * alias to the root bus, which is usually, but not necessarily always > + * where we'll find our iommu_fn. > + */ > + if (!pci_bus_is_express(iommu_bus)) { > + PCIDevice *parent = iommu_bus->parent_dev; > + > + if (pci_is_express(parent) && > + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { > + devfn = PCI_DEVFN(0, 0); > + bus = iommu_bus; > + } else { > + devfn = parent->devfn; > + bus = parent_bus; > + } > + } > + > + iommu_bus = parent_bus; > } > if (iommu_bus && iommu_bus->iommu_fn) { > - return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn); > + return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn); I think it would make sense to comment this iommu_fn() callback's role somewhere. > } > return &address_space_memory; > } > > Thanks Eric
On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote: > Conventional PCI buses pre-date requester IDs. An IOMMU cannot > distinguish by devfn & bus between devices in a conventional PCI > topology and therefore we cannot assign them separate AddressSpaces. > By taking this requester ID aliasing into account, QEMU better matches > the bare metal behavior and restrictions, and enables shared > AddressSpace configurations that are otherwise not possible with > guest IOMMU support. > > For the latter case, given any example where an IOMMU group on the > host includes multiple devices: > > $ ls /sys/kernel/iommu_groups/1/devices/ > 0000:00:01.0 0000:01:00.0 0000:01:00.1 > > If we incorporate a vIOMMU into the VM configuration, we're restricted > that we can only assign one of the endpoints to the guest because a > second endpoint will attempt to use a different AddressSpace. VFIO > only supports IOMMU group level granularity at the container level, > preventing this second endpoint from being assigned: > > qemu-system-x86_64 -machine q35... \ > -device intel-iommu,intremap=on \ > -device pcie-root-port,addr=1e.0,id=pcie.1 \ > -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ > -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 > > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ > 0000:01:00.1: group 1 used in multiple address spaces > > However, when QEMU incorporates proper aliasing, we can make use of a > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that > provides the downstream devices with the same AddressSpace, ex: > > qemu-system-x86_64 -machine q35... \ > -device intel-iommu,intremap=on \ > -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ > -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ > -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 This will break extended address space access though, won't it? things like attempts to take PCI Express badnwidth into consideration by drivers for E2E credit math will also not work ... > > While the utility of this hack may be limited, this AddressSpace > aliasing is the correct behavior for QEMU to emulate bare metal. This one's a valid point. > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > --- > hw/pci/pci.c | 33 +++++++++++++++++++++++++++++++-- > 1 file changed, 31 insertions(+), 2 deletions(-) > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c > index 35451c1e9987..38467e676f1f 100644 > --- a/hw/pci/pci.c > +++ b/hw/pci/pci.c > @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev) > { > PCIBus *bus = pci_get_bus(dev); > PCIBus *iommu_bus = bus; > + uint8_t devfn = dev->devfn; > > while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) { > - iommu_bus = pci_get_bus(iommu_bus->parent_dev); > + PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev); > + > + /* > + * Determine which requester ID alias should be used for the device > + * based on the PCI topology. There are no requester IDs on convetional > + * PCI buses, therefore we push the alias up to the parent on each non- > + * express bus. Which alias we use depends on whether this is a legacy > + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to- > + * PCI bridge spec. Note that we cannot use pci_requester_id() here > + * because the resulting BDF depends on the secondary bridge register > + * programming. Could you clarify this part a bit pls? > We also cannot lookup the PCIBus from the bus number > + * at this point for the iommu_fn. Also, requester_id_cache is the > + * alias to the root bus, which is usually, but not necessarily always > + * where we'll find our iommu_fn. > + */ > + if (!pci_bus_is_express(iommu_bus)) { > + PCIDevice *parent = iommu_bus->parent_dev; > + > + if (pci_is_express(parent) && > + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { > + devfn = PCI_DEVFN(0, 0); Why 0,0? > + bus = iommu_bus; > + } else { > + devfn = parent->devfn; > + bus = parent_bus; > + } > + } > + > + iommu_bus = parent_bus; > } > if (iommu_bus && iommu_bus->iommu_fn) { > - return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn); > + return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn); > } > return &address_space_memory; > }
On Wed, Mar 27, 2019 at 02:25:00PM +0800, Peter Xu wrote: > On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote: > > Conventional PCI buses pre-date requester IDs. An IOMMU cannot > > distinguish by devfn & bus between devices in a conventional PCI > > topology and therefore we cannot assign them separate AddressSpaces. > > By taking this requester ID aliasing into account, QEMU better matches > > the bare metal behavior and restrictions, and enables shared > > AddressSpace configurations that are otherwise not possible with > > guest IOMMU support. > > > > For the latter case, given any example where an IOMMU group on the > > host includes multiple devices: > > > > $ ls /sys/kernel/iommu_groups/1/devices/ > > 0000:00:01.0 0000:01:00.0 0000:01:00.1 > > [1] > > > > > If we incorporate a vIOMMU into the VM configuration, we're restricted > > that we can only assign one of the endpoints to the guest because a > > second endpoint will attempt to use a different AddressSpace. VFIO > > only supports IOMMU group level granularity at the container level, > > preventing this second endpoint from being assigned: > > > > qemu-system-x86_64 -machine q35... \ > > -device intel-iommu,intremap=on \ > > -device pcie-root-port,addr=1e.0,id=pcie.1 \ > > -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ > > -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 > > > > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ > > 0000:01:00.1: group 1 used in multiple address spaces > > > > However, when QEMU incorporates proper aliasing, we can make use of a > > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that > > provides the downstream devices with the same AddressSpace, ex: > > > > qemu-system-x86_64 -machine q35... \ > > -device intel-iommu,intremap=on \ > > -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ > > -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ > > -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 > > > > While the utility of this hack may be limited, this AddressSpace > > aliasing is the correct behavior for QEMU to emulate bare metal. > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > The patch looks sane to me even as a bug fix since otherwise the DMA > address spaces used under misc kinds of PCI bridges can be wrong, so: > > Reviewed-by: Peter Xu <peterx@redhat.com> > > Though I have a question that confused me even before: Alex, do you > know why all the context entry of the devices in the IOMMU root table > will be programmed even if the devices are under a pcie-to-pci bridge? > I'm giving an example with above [1] to be clear: in that case IIUC > we'll program context entries for all the three devices (00:01.0, > 01:00.0, 01:00.1) but they'll point to the same IOMMU table. DMAs of > devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on > bare metal What makes you think so? PCI Express spec says: Requester ID The combination of a Requester's Bus Number, Device Number, and Function Number that uniquely identifies the Requester. With an ARI Requester ID, bits traditionally used for the Device Number field are used instead to expand the Function Number field, and the Device Number is implied to be 0. > and then why we bother to program the context entry of > 01:00.1? It seems never used. > > (It should be used for current QEMU to work with pcie-to-pci bridges > if without this patch, but I feel like I don't know the real answer > behind) > > Thanks, > > -- > Peter Xu
On Wed, 27 Mar 2019 14:25:00 +0800 Peter Xu <peterx@redhat.com> wrote: > On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote: > > Conventional PCI buses pre-date requester IDs. An IOMMU cannot > > distinguish by devfn & bus between devices in a conventional PCI > > topology and therefore we cannot assign them separate AddressSpaces. > > By taking this requester ID aliasing into account, QEMU better matches > > the bare metal behavior and restrictions, and enables shared > > AddressSpace configurations that are otherwise not possible with > > guest IOMMU support. > > > > For the latter case, given any example where an IOMMU group on the > > host includes multiple devices: > > > > $ ls /sys/kernel/iommu_groups/1/devices/ > > 0000:00:01.0 0000:01:00.0 0000:01:00.1 > > [1] > > > > > If we incorporate a vIOMMU into the VM configuration, we're restricted > > that we can only assign one of the endpoints to the guest because a > > second endpoint will attempt to use a different AddressSpace. VFIO > > only supports IOMMU group level granularity at the container level, > > preventing this second endpoint from being assigned: > > > > qemu-system-x86_64 -machine q35... \ > > -device intel-iommu,intremap=on \ > > -device pcie-root-port,addr=1e.0,id=pcie.1 \ > > -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ > > -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 > > > > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ > > 0000:01:00.1: group 1 used in multiple address spaces > > > > However, when QEMU incorporates proper aliasing, we can make use of a > > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that > > provides the downstream devices with the same AddressSpace, ex: > > > > qemu-system-x86_64 -machine q35... \ > > -device intel-iommu,intremap=on \ > > -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ > > -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ > > -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 > > > > While the utility of this hack may be limited, this AddressSpace > > aliasing is the correct behavior for QEMU to emulate bare metal. > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > The patch looks sane to me even as a bug fix since otherwise the DMA > address spaces used under misc kinds of PCI bridges can be wrong, so: I'm not sure if "as a bug fix" here is encouraging a 4.0 target, but I'd be cautious about this if so. Eric Auger noted that he's seen an SMMU VM hit a guest kernel bug-on, which needs further investigation. It's not clear if it's just an untested or unimplemented scenario for SMMU to see a conventional PCI bus or if there's something wrong in QEMU. I also haven't tested AMD IOMMU and only VT-d to a very limited degree, thus RFC. > Reviewed-by: Peter Xu <peterx@redhat.com> > > Though I have a question that confused me even before: Alex, do you > know why all the context entry of the devices in the IOMMU root table > will be programmed even if the devices are under a pcie-to-pci bridge? > I'm giving an example with above [1] to be clear: in that case IIUC > we'll program context entries for all the three devices (00:01.0, > 01:00.0, 01:00.1) but they'll point to the same IOMMU table. DMAs of > devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on > bare metal and then why we bother to program the context entry of > 01:00.1? It seems never used. > > (It should be used for current QEMU to work with pcie-to-pci bridges > if without this patch, but I feel like I don't know the real answer > behind) We actually have two different scenarios that could be represented by [1], the group can be formed by lack of isolation or by lack of visibility. In the group above, it's the former, isolation. The PCIe root port does not support ACS, so while the IOMMU has visibility of the individual devices, peer-to-peer between devices may also be possible. Native, trusted, in-kernel drivers for these devices could still make use of separate IOMMU domains per device, but in order to expose the devices to a userspace driver we need to consider them a non-isolated set to prevent side-channel attacks between devices. We therefore consider them as a group within the IOMMU API and it's required that each context entry maps to the same domain as the IOMMU will see transactions for each requester ID. If we had the visibility case, such as if [1] represented a PCIe-to-PCI bridge subgroup, then the IOMMU really does only see the bridge requester ID and there may not be a reason to populate the context entries for the downstream aliased devices. Perhaps the IOMMU might still choose to do so, particularly if the bridge is actually a PCI-X bridge as PCI-X does incorporate a requester ID, but also has strange rules about the bridge being able to claim ownership of the transaction. So it might be paranoia or simplification that causes all the context entries to be programmed, or for alias quirks, uncertainty if a device exclusively uses a quirked requester ID or might sometimes use the proper requester ID. In the example I present, we're taking [1], which could be either case above, and converting it into the visibility case in order to force the IOMMU to handle the devices within a single address space. Thanks, Alex
Hi, On 3/27/19 4:32 PM, Michael S. Tsirkin wrote: > On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote: >> Conventional PCI buses pre-date requester IDs. An IOMMU cannot >> distinguish by devfn & bus between devices in a conventional PCI >> topology and therefore we cannot assign them separate AddressSpaces. >> By taking this requester ID aliasing into account, QEMU better matches >> the bare metal behavior and restrictions, and enables shared >> AddressSpace configurations that are otherwise not possible with >> guest IOMMU support. >> >> For the latter case, given any example where an IOMMU group on the >> host includes multiple devices: >> >> $ ls /sys/kernel/iommu_groups/1/devices/ >> 0000:00:01.0 0000:01:00.0 0000:01:00.1 >> >> If we incorporate a vIOMMU into the VM configuration, we're restricted >> that we can only assign one of the endpoints to the guest because a >> second endpoint will attempt to use a different AddressSpace. VFIO >> only supports IOMMU group level granularity at the container level, >> preventing this second endpoint from being assigned: >> >> qemu-system-x86_64 -machine q35... \ >> -device intel-iommu,intremap=on \ >> -device pcie-root-port,addr=1e.0,id=pcie.1 \ >> -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ >> -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 >> >> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ >> 0000:01:00.1: group 1 used in multiple address spaces >> >> However, when QEMU incorporates proper aliasing, we can make use of a >> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that >> provides the downstream devices with the same AddressSpace, ex: >> >> qemu-system-x86_64 -machine q35... \ >> -device intel-iommu,intremap=on \ >> -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ >> -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ >> -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 > > > This will break extended address space access though, won't it? > things like attempts to take PCI Express badnwidth into > consideration by drivers for E2E credit math will also > not work ... > >> >> While the utility of this hack may be limited, this AddressSpace >> aliasing is the correct behavior for QEMU to emulate bare metal. > > > This one's a valid point. > >> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> >> --- >> hw/pci/pci.c | 33 +++++++++++++++++++++++++++++++-- >> 1 file changed, 31 insertions(+), 2 deletions(-) >> >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c >> index 35451c1e9987..38467e676f1f 100644 >> --- a/hw/pci/pci.c >> +++ b/hw/pci/pci.c >> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev) >> { >> PCIBus *bus = pci_get_bus(dev); >> PCIBus *iommu_bus = bus; >> + uint8_t devfn = dev->devfn; >> >> while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) { >> - iommu_bus = pci_get_bus(iommu_bus->parent_dev); >> + PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev); >> + >> + /* >> + * Determine which requester ID alias should be used for the device >> + * based on the PCI topology. There are no requester IDs on convetional >> + * PCI buses, therefore we push the alias up to the parent on each non- >> + * express bus. Which alias we use depends on whether this is a legacy >> + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to- >> + * PCI bridge spec. Note that we cannot use pci_requester_id() here >> + * because the resulting BDF depends on the secondary bridge register >> + * programming. > > Could you clarify this part a bit pls? > >> We also cannot lookup the PCIBus from the bus number >> + * at this point for the iommu_fn. Also, requester_id_cache is the >> + * alias to the root bus, which is usually, but not necessarily always >> + * where we'll find our iommu_fn. >> + */ >> + if (!pci_bus_is_express(iommu_bus)) { >> + PCIDevice *parent = iommu_bus->parent_dev; >> + >> + if (pci_is_express(parent) && >> + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { >> + devfn = PCI_DEVFN(0, 0); > > Why 0,0? The spec says: "If the bridge generates a new Requester ID for a transaction forwarded from the secondary interface to the primary interface, the bridge assigns the PCI Express Requester ID using its secondary interface’s Bus Number and sets both the Device Number and Function Number fields to zero.". I guess it is the reason? > >> + bus = iommu_bus; >> + } else { >> + devfn = parent->devfn; >> + bus = parent_bus; Alex, please could you clarify this case as well , comment? Thanks Eric >> + } >> + } >> + >> + iommu_bus = parent_bus; >> } >> if (iommu_bus && iommu_bus->iommu_fn) { >> - return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn); >> + return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn); >> } >> return &address_space_memory; >> }
On Wed, Mar 27, 2019 at 05:43:45PM +0100, Auger Eric wrote: > Hi, > > On 3/27/19 4:32 PM, Michael S. Tsirkin wrote: > > On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote: > >> Conventional PCI buses pre-date requester IDs. An IOMMU cannot > >> distinguish by devfn & bus between devices in a conventional PCI > >> topology and therefore we cannot assign them separate AddressSpaces. > >> By taking this requester ID aliasing into account, QEMU better matches > >> the bare metal behavior and restrictions, and enables shared > >> AddressSpace configurations that are otherwise not possible with > >> guest IOMMU support. > >> > >> For the latter case, given any example where an IOMMU group on the > >> host includes multiple devices: > >> > >> $ ls /sys/kernel/iommu_groups/1/devices/ > >> 0000:00:01.0 0000:01:00.0 0000:01:00.1 > >> > >> If we incorporate a vIOMMU into the VM configuration, we're restricted > >> that we can only assign one of the endpoints to the guest because a > >> second endpoint will attempt to use a different AddressSpace. VFIO > >> only supports IOMMU group level granularity at the container level, > >> preventing this second endpoint from being assigned: > >> > >> qemu-system-x86_64 -machine q35... \ > >> -device intel-iommu,intremap=on \ > >> -device pcie-root-port,addr=1e.0,id=pcie.1 \ > >> -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ > >> -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 > >> > >> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ > >> 0000:01:00.1: group 1 used in multiple address spaces > >> > >> However, when QEMU incorporates proper aliasing, we can make use of a > >> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that > >> provides the downstream devices with the same AddressSpace, ex: > >> > >> qemu-system-x86_64 -machine q35... \ > >> -device intel-iommu,intremap=on \ > >> -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ > >> -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ > >> -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 > > > > > > This will break extended address space access though, won't it? > > things like attempts to take PCI Express badnwidth into > > consideration by drivers for E2E credit math will also > > not work ... > > > >> > >> While the utility of this hack may be limited, this AddressSpace > >> aliasing is the correct behavior for QEMU to emulate bare metal. > > > > > > This one's a valid point. > > > >> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > >> --- > >> hw/pci/pci.c | 33 +++++++++++++++++++++++++++++++-- > >> 1 file changed, 31 insertions(+), 2 deletions(-) > >> > >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c > >> index 35451c1e9987..38467e676f1f 100644 > >> --- a/hw/pci/pci.c > >> +++ b/hw/pci/pci.c > >> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev) > >> { > >> PCIBus *bus = pci_get_bus(dev); > >> PCIBus *iommu_bus = bus; > >> + uint8_t devfn = dev->devfn; > >> > >> while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) { > >> - iommu_bus = pci_get_bus(iommu_bus->parent_dev); > >> + PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev); > >> + > >> + /* > >> + * Determine which requester ID alias should be used for the device > >> + * based on the PCI topology. There are no requester IDs on convetional > >> + * PCI buses, therefore we push the alias up to the parent on each non- > >> + * express bus. Which alias we use depends on whether this is a legacy > >> + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to- > >> + * PCI bridge spec. Note that we cannot use pci_requester_id() here > >> + * because the resulting BDF depends on the secondary bridge register > >> + * programming. > > > > Could you clarify this part a bit pls? > > > >> We also cannot lookup the PCIBus from the bus number > >> + * at this point for the iommu_fn. Also, requester_id_cache is the > >> + * alias to the root bus, which is usually, but not necessarily always > >> + * where we'll find our iommu_fn. > >> + */ > >> + if (!pci_bus_is_express(iommu_bus)) { > >> + PCIDevice *parent = iommu_bus->parent_dev; > >> + > >> + if (pci_is_express(parent) && > >> + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { > >> + devfn = PCI_DEVFN(0, 0); > > > > Why 0,0? > > The spec says: "If the bridge generates a new Requester ID for a > transaction forwarded from the secondary interface to the primary > interface, the bridge assigns the PCI Express Requester ID using its > secondary interface’s Bus Number and sets both the Device Number and > Function Number fields to zero.". I guess it is the reason? Ah. > > > >> + bus = iommu_bus; > >> + } else { > >> + devfn = parent->devfn; > >> + bus = parent_bus; > Alex, please could you clarify this case as well , comment? > > Thanks > > Eric right. > >> + } > >> + } > >> + > >> + iommu_bus = parent_bus; > >> } > >> if (iommu_bus && iommu_bus->iommu_fn) { > >> - return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn); > >> + return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn); > >> } > >> return &address_space_memory; > >> }
On Wed, 27 Mar 2019 11:32:55 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote: > > Conventional PCI buses pre-date requester IDs. An IOMMU cannot > > distinguish by devfn & bus between devices in a conventional PCI > > topology and therefore we cannot assign them separate AddressSpaces. > > By taking this requester ID aliasing into account, QEMU better matches > > the bare metal behavior and restrictions, and enables shared > > AddressSpace configurations that are otherwise not possible with > > guest IOMMU support. > > > > For the latter case, given any example where an IOMMU group on the > > host includes multiple devices: > > > > $ ls /sys/kernel/iommu_groups/1/devices/ > > 0000:00:01.0 0000:01:00.0 0000:01:00.1 > > > > If we incorporate a vIOMMU into the VM configuration, we're restricted > > that we can only assign one of the endpoints to the guest because a > > second endpoint will attempt to use a different AddressSpace. VFIO > > only supports IOMMU group level granularity at the container level, > > preventing this second endpoint from being assigned: > > > > qemu-system-x86_64 -machine q35... \ > > -device intel-iommu,intremap=on \ > > -device pcie-root-port,addr=1e.0,id=pcie.1 \ > > -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ > > -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 > > > > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ > > 0000:01:00.1: group 1 used in multiple address spaces > > > > However, when QEMU incorporates proper aliasing, we can make use of a > > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that > > provides the downstream devices with the same AddressSpace, ex: > > > > qemu-system-x86_64 -machine q35... \ > > -device intel-iommu,intremap=on \ > > -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ > > -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ > > -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 > > > This will break extended address space access though, won't it? > things like attempts to take PCI Express badnwidth into > consideration by drivers for E2E credit math will also > not work ... Correct. As I explain in my reply to Peter, we're forcing the IOMMU to use a common address space for these devices, and the only hack we have to do that is to introduce a conventional PCI bus. PCIe-to-PCI bridges are required to respond with an unsupported request to extended config space on the downstream devices. This is part of the loss of utility I mention below. Perhaps we can add a couple more layers of hacks if those sorts of things are important. PCI-X supports extended config space, so (without digging through the spec) I assume a PCIe-to-PCIX bridge could pass extended config space. Additionally, if we had the same bridge in reverse mode, for something like this: [RC]-[PCIe-to-PCIX]-[PCIX-to-PCIe]-[PCIe endpoint] Then we would have access to link and extended config space while maintaining the single address space. Hack^3 > > While the utility of this hack may be limited, this AddressSpace > > aliasing is the correct behavior for QEMU to emulate bare metal. > > > This one's a valid point. > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > --- > > hw/pci/pci.c | 33 +++++++++++++++++++++++++++++++-- > > 1 file changed, 31 insertions(+), 2 deletions(-) > > > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c > > index 35451c1e9987..38467e676f1f 100644 > > --- a/hw/pci/pci.c > > +++ b/hw/pci/pci.c > > @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev) > > { > > PCIBus *bus = pci_get_bus(dev); > > PCIBus *iommu_bus = bus; > > + uint8_t devfn = dev->devfn; > > > > while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) { > > - iommu_bus = pci_get_bus(iommu_bus->parent_dev); > > + PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev); > > + > > + /* > > + * Determine which requester ID alias should be used for the device > > + * based on the PCI topology. There are no requester IDs on convetional > > + * PCI buses, therefore we push the alias up to the parent on each non- > > + * express bus. Which alias we use depends on whether this is a legacy > > + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to- > > + * PCI bridge spec. Note that we cannot use pci_requester_id() here > > + * because the resulting BDF depends on the secondary bridge register > > + * programming. > > Could you clarify this part a bit pls? We use the same algorithm here as we do to come up with requester_id_cache on the PCIDevice (modulo stopping at the iommu_fn bus), but pci_req_id_cache_extract() is only useful once the bridge secondary bus number register is programmed. Until then, the bus part of the BDF is always zero, which means we can't even lookup the PCIBus from the bus number. Since we're setting up the IOMMU AddressSpace during device realize, we cannot presume any bus number programming. NB. None of the IOMMUs actually seem to care about the PCIBus at this point, the pointer is simply used as an opaque token for a hash for the bus, but that also means there's really no good substitute for it either. > > We also cannot lookup the PCIBus from the bus number > > + * at this point for the iommu_fn. Also, requester_id_cache is the > > + * alias to the root bus, which is usually, but not necessarily always > > + * where we'll find our iommu_fn. > > + */ > > + if (!pci_bus_is_express(iommu_bus)) { > > + PCIDevice *parent = iommu_bus->parent_dev; > > + > > + if (pci_is_express(parent) && > > + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { > > + devfn = PCI_DEVFN(0, 0); > > Why 0,0? See also pci_req_id_cache_get(), this is defined in the PCIe-to-PCI bridge spec. Bridges following this spec use a fixed requester ID for all downstream devices in order to differentiate requests from the bridge itself vs downstream devices. pci_req_id_cache_extract() is where this gets applied for the PCIReqIDCache version. Thanks, Alex > > + bus = iommu_bus; > > + } else { > > + devfn = parent->devfn; > > + bus = parent_bus; > > + } > > + } > > + > > + iommu_bus = parent_bus; > > } > > if (iommu_bus && iommu_bus->iommu_fn) { > > - return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn); > > + return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn); > > } > > return &address_space_memory; > > }
On Wed, 27 Mar 2019 17:43:45 +0100 Auger Eric <eric.auger@redhat.com> wrote: > Hi, > > On 3/27/19 4:32 PM, Michael S. Tsirkin wrote: > > On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote: > >> Conventional PCI buses pre-date requester IDs. An IOMMU cannot > >> distinguish by devfn & bus between devices in a conventional PCI > >> topology and therefore we cannot assign them separate AddressSpaces. > >> By taking this requester ID aliasing into account, QEMU better matches > >> the bare metal behavior and restrictions, and enables shared > >> AddressSpace configurations that are otherwise not possible with > >> guest IOMMU support. > >> > >> For the latter case, given any example where an IOMMU group on the > >> host includes multiple devices: > >> > >> $ ls /sys/kernel/iommu_groups/1/devices/ > >> 0000:00:01.0 0000:01:00.0 0000:01:00.1 > >> > >> If we incorporate a vIOMMU into the VM configuration, we're restricted > >> that we can only assign one of the endpoints to the guest because a > >> second endpoint will attempt to use a different AddressSpace. VFIO > >> only supports IOMMU group level granularity at the container level, > >> preventing this second endpoint from being assigned: > >> > >> qemu-system-x86_64 -machine q35... \ > >> -device intel-iommu,intremap=on \ > >> -device pcie-root-port,addr=1e.0,id=pcie.1 \ > >> -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ > >> -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 > >> > >> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ > >> 0000:01:00.1: group 1 used in multiple address spaces > >> > >> However, when QEMU incorporates proper aliasing, we can make use of a > >> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that > >> provides the downstream devices with the same AddressSpace, ex: > >> > >> qemu-system-x86_64 -machine q35... \ > >> -device intel-iommu,intremap=on \ > >> -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ > >> -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ > >> -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 > > > > > > This will break extended address space access though, won't it? > > things like attempts to take PCI Express badnwidth into > > consideration by drivers for E2E credit math will also > > not work ... > > > >> > >> While the utility of this hack may be limited, this AddressSpace > >> aliasing is the correct behavior for QEMU to emulate bare metal. > > > > > > This one's a valid point. > > > >> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > >> --- > >> hw/pci/pci.c | 33 +++++++++++++++++++++++++++++++-- > >> 1 file changed, 31 insertions(+), 2 deletions(-) > >> > >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c > >> index 35451c1e9987..38467e676f1f 100644 > >> --- a/hw/pci/pci.c > >> +++ b/hw/pci/pci.c > >> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev) > >> { > >> PCIBus *bus = pci_get_bus(dev); > >> PCIBus *iommu_bus = bus; > >> + uint8_t devfn = dev->devfn; > >> > >> while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) { > >> - iommu_bus = pci_get_bus(iommu_bus->parent_dev); > >> + PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev); > >> + > >> + /* > >> + * Determine which requester ID alias should be used for the device > >> + * based on the PCI topology. There are no requester IDs on convetional > >> + * PCI buses, therefore we push the alias up to the parent on each non- > >> + * express bus. Which alias we use depends on whether this is a legacy > >> + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to- > >> + * PCI bridge spec. Note that we cannot use pci_requester_id() here > >> + * because the resulting BDF depends on the secondary bridge register > >> + * programming. > > > > Could you clarify this part a bit pls? > > > >> We also cannot lookup the PCIBus from the bus number > >> + * at this point for the iommu_fn. Also, requester_id_cache is the > >> + * alias to the root bus, which is usually, but not necessarily always > >> + * where we'll find our iommu_fn. > >> + */ > >> + if (!pci_bus_is_express(iommu_bus)) { > >> + PCIDevice *parent = iommu_bus->parent_dev; > >> + > >> + if (pci_is_express(parent) && > >> + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { > >> + devfn = PCI_DEVFN(0, 0); > > > > Why 0,0? > > The spec says: "If the bridge generates a new Requester ID for a > transaction forwarded from the secondary interface to the primary > interface, the bridge assigns the PCI Express Requester ID using its > secondary interface’s Bus Number and sets both the Device Number and > Function Number fields to zero.". I guess it is the reason? Yup, PCI Express to PCI/PCI-X Bridge Spec, rev 1.0, sec 2.3. > > > >> + bus = iommu_bus; > >> + } else { > >> + devfn = parent->devfn; > >> + bus = parent_bus; > Alex, please could you clarify this case as well , comment? IIRC, this one is more of a "because that's how it's implemented" situation. pci_req_id_cache_get() also handles this scenario and refers to it as legacy PCI and the same algorithm works here. So long as we're on a non-express bus, we continue to push the DMA alias device up to the parent device. If we end on a bridge that does not have a PCIe capability to expose itself as a PCIe-to-PCI bridge, then it probably uses itself as the requester ID. This is not without exception though, see drivers/pci/search.c:pci_for_each_dma_alias in the kernel to see that there are some (few) quirked bridges that use the PCIe-to-PCI algorithm. In our case here it barely matters, it's only a difference of whether the bridge itself is considered within the same AddressSpace as the downstream device and we use the algorithm that is more generally correct to match bare metal. Ideally we'd not see this case in QEMU, but IIRC it's possible to put a PCI-to-PCI bridge on a PCIe bus to generate it. Thanks, Alex
On Wed, 27 Mar 2019 12:46:41 +0100 Auger Eric <eric.auger@redhat.com> wrote: > Hi Alex, > > On 3/26/19 11:55 PM, Alex Williamson wrote: > > Conventional PCI buses pre-date requester IDs. An IOMMU cannot > > distinguish by devfn & bus between devices in a conventional PCI > > topology and therefore we cannot assign them separate AddressSpaces. > > By taking this requester ID aliasing into account, QEMU better matches > > the bare metal behavior and restrictions, and enables shared > > AddressSpace configurations that are otherwise not possible with > > guest IOMMU support. > > > > For the latter case, given any example where an IOMMU group on the > > host includes multiple devices: > > > > $ ls /sys/kernel/iommu_groups/1/devices/ > > 0000:00:01.0 0000:01:00.0 0000:01:00.1 > > > > If we incorporate a vIOMMU into the VM configuration, we're restricted > > that we can only assign one of the endpoints to the guest because a > > second endpoint will attempt to use a different AddressSpace. VFIO > > only supports IOMMU group level granularity at the container level, > > preventing this second endpoint from being assigned: > > > > qemu-system-x86_64 -machine q35... \ > > -device intel-iommu,intremap=on \ > > -device pcie-root-port,addr=1e.0,id=pcie.1 \ > > -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ > > -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 > > > > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ > > 0000:01:00.1: group 1 used in multiple address spaces > > > > However, when QEMU incorporates proper aliasing, we can make use of a > > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that > > provides the downstream devices with the same AddressSpace, ex: > > > > qemu-system-x86_64 -machine q35... \ > > -device intel-iommu,intremap=on \ > > -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ > > -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ > > -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 > > > > While the utility of this hack may be limited, this AddressSpace > > aliasing is the correct behavior for QEMU to emulate bare metal. > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > --- > > hw/pci/pci.c | 33 +++++++++++++++++++++++++++++++-- > > 1 file changed, 31 insertions(+), 2 deletions(-) > > > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c > > index 35451c1e9987..38467e676f1f 100644 > > --- a/hw/pci/pci.c > > +++ b/hw/pci/pci.c > > @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev) > > { > > PCIBus *bus = pci_get_bus(dev); > > PCIBus *iommu_bus = bus; > > + uint8_t devfn = dev->devfn; > > > > while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) { > > - iommu_bus = pci_get_bus(iommu_bus->parent_dev); > > + PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev); > > + > > + /* > > + * Determine which requester ID alias should be used for the device > > + * based on the PCI topology. There are no requester IDs on convetional > conventional Oops, fixed > > + * PCI buses, therefore we push the alias up to the parent on each non- > > + * express bus. Which alias we use depends on whether this is a legacy > > + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to- > > + * PCI bridge spec. Note that we cannot use pci_requester_id() here > > + * because the resulting BDF depends on the secondary bridge register > Didn't you mean secondary bus number register? Yes, fixed > > + * programming. We also cannot lookup the PCIBus from the bus number > > + * at this point for the iommu_fn. Also, requester_id_cache is the > > + * alias to the root bus, which is usually, but not necessarily always > > + * where we'll find our iommu_fn. > > + */ > > + if (!pci_bus_is_express(iommu_bus)) { > > + PCIDevice *parent = iommu_bus->parent_dev; > > + > > + if (pci_is_express(parent) && > > + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { > > + devfn = PCI_DEVFN(0, 0); > > + bus = iommu_bus; > > + } else { > > + devfn = parent->devfn; > > + bus = parent_bus; > > + } > > + } > > + > > + iommu_bus = parent_bus; > > } > > if (iommu_bus && iommu_bus->iommu_fn) { > > - return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn); > > + return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn); > I think it would make sense to comment this iommu_fn() callback's role > somewhere. While I agree, it seems a bit out of scope of this patch. Or are you suggesting that this patch fundamentally changing the role rather than trying to make it work as intended? Thanks, Alex
Hi Alex, On 3/27/19 7:02 PM, Alex Williamson wrote: > On Wed, 27 Mar 2019 12:46:41 +0100 > Auger Eric <eric.auger@redhat.com> wrote: > >> Hi Alex, >> >> On 3/26/19 11:55 PM, Alex Williamson wrote: >>> Conventional PCI buses pre-date requester IDs. An IOMMU cannot >>> distinguish by devfn & bus between devices in a conventional PCI >>> topology and therefore we cannot assign them separate AddressSpaces. >>> By taking this requester ID aliasing into account, QEMU better matches >>> the bare metal behavior and restrictions, and enables shared >>> AddressSpace configurations that are otherwise not possible with >>> guest IOMMU support. >>> >>> For the latter case, given any example where an IOMMU group on the >>> host includes multiple devices: >>> >>> $ ls /sys/kernel/iommu_groups/1/devices/ >>> 0000:00:01.0 0000:01:00.0 0000:01:00.1 >>> >>> If we incorporate a vIOMMU into the VM configuration, we're restricted >>> that we can only assign one of the endpoints to the guest because a >>> second endpoint will attempt to use a different AddressSpace. VFIO >>> only supports IOMMU group level granularity at the container level, >>> preventing this second endpoint from being assigned: >>> >>> qemu-system-x86_64 -machine q35... \ >>> -device intel-iommu,intremap=on \ >>> -device pcie-root-port,addr=1e.0,id=pcie.1 \ >>> -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ >>> -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 >>> >>> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ >>> 0000:01:00.1: group 1 used in multiple address spaces >>> >>> However, when QEMU incorporates proper aliasing, we can make use of a >>> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that >>> provides the downstream devices with the same AddressSpace, ex: >>> >>> qemu-system-x86_64 -machine q35... \ >>> -device intel-iommu,intremap=on \ >>> -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ >>> -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ >>> -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 >>> >>> While the utility of this hack may be limited, this AddressSpace >>> aliasing is the correct behavior for QEMU to emulate bare metal. >>> >>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> >>> --- >>> hw/pci/pci.c | 33 +++++++++++++++++++++++++++++++-- >>> 1 file changed, 31 insertions(+), 2 deletions(-) >>> >>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c >>> index 35451c1e9987..38467e676f1f 100644 >>> --- a/hw/pci/pci.c >>> +++ b/hw/pci/pci.c >>> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev) >>> { >>> PCIBus *bus = pci_get_bus(dev); >>> PCIBus *iommu_bus = bus; >>> + uint8_t devfn = dev->devfn; >>> >>> while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) { >>> - iommu_bus = pci_get_bus(iommu_bus->parent_dev); >>> + PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev); >>> + >>> + /* >>> + * Determine which requester ID alias should be used for the device >>> + * based on the PCI topology. There are no requester IDs on convetional >> conventional > > Oops, fixed > >>> + * PCI buses, therefore we push the alias up to the parent on each non- >>> + * express bus. Which alias we use depends on whether this is a legacy >>> + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to- >>> + * PCI bridge spec. Note that we cannot use pci_requester_id() here >>> + * because the resulting BDF depends on the secondary bridge register >> Didn't you mean secondary bus number register? > > Yes, fixed > >>> + * programming. We also cannot lookup the PCIBus from the bus number >>> + * at this point for the iommu_fn. Also, requester_id_cache is the >>> + * alias to the root bus, which is usually, but not necessarily always >>> + * where we'll find our iommu_fn. >>> + */ >>> + if (!pci_bus_is_express(iommu_bus)) { >>> + PCIDevice *parent = iommu_bus->parent_dev; >>> + >>> + if (pci_is_express(parent) && >>> + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { >>> + devfn = PCI_DEVFN(0, 0); >>> + bus = iommu_bus; >>> + } else { >>> + devfn = parent->devfn; >>> + bus = parent_bus; >>> + } >>> + } >>> + >>> + iommu_bus = parent_bus; >>> } >>> if (iommu_bus && iommu_bus->iommu_fn) { >>> - return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn); >>> + return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn); >> I think it would make sense to comment this iommu_fn() callback's role >> somewhere. > > While I agree, it seems a bit out of scope of this patch. Or are you > suggesting that this patch fundamentally changing the role rather than > trying to make it work as intended? The last question is what I tried to figure out when reviewing the patch as you change the @bus and @devfn arg values passed to the cb. Given the lack of documentation about the role of this function, it is not obvious to see if the patch does not break anything without reading the cb implementations. Thanks Eric > > Alex >
On Wed, 27 Mar 2019 19:07:44 +0100 Auger Eric <eric.auger@redhat.com> wrote: > Hi Alex, > > On 3/27/19 7:02 PM, Alex Williamson wrote: > > On Wed, 27 Mar 2019 12:46:41 +0100 > > Auger Eric <eric.auger@redhat.com> wrote: > > > >> Hi Alex, > >> > >> On 3/26/19 11:55 PM, Alex Williamson wrote: > >>> Conventional PCI buses pre-date requester IDs. An IOMMU cannot > >>> distinguish by devfn & bus between devices in a conventional PCI > >>> topology and therefore we cannot assign them separate AddressSpaces. > >>> By taking this requester ID aliasing into account, QEMU better matches > >>> the bare metal behavior and restrictions, and enables shared > >>> AddressSpace configurations that are otherwise not possible with > >>> guest IOMMU support. > >>> > >>> For the latter case, given any example where an IOMMU group on the > >>> host includes multiple devices: > >>> > >>> $ ls /sys/kernel/iommu_groups/1/devices/ > >>> 0000:00:01.0 0000:01:00.0 0000:01:00.1 > >>> > >>> If we incorporate a vIOMMU into the VM configuration, we're restricted > >>> that we can only assign one of the endpoints to the guest because a > >>> second endpoint will attempt to use a different AddressSpace. VFIO > >>> only supports IOMMU group level granularity at the container level, > >>> preventing this second endpoint from being assigned: > >>> > >>> qemu-system-x86_64 -machine q35... \ > >>> -device intel-iommu,intremap=on \ > >>> -device pcie-root-port,addr=1e.0,id=pcie.1 \ > >>> -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ > >>> -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 > >>> > >>> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ > >>> 0000:01:00.1: group 1 used in multiple address spaces > >>> > >>> However, when QEMU incorporates proper aliasing, we can make use of a > >>> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that > >>> provides the downstream devices with the same AddressSpace, ex: > >>> > >>> qemu-system-x86_64 -machine q35... \ > >>> -device intel-iommu,intremap=on \ > >>> -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ > >>> -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ > >>> -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 > >>> > >>> While the utility of this hack may be limited, this AddressSpace > >>> aliasing is the correct behavior for QEMU to emulate bare metal. > >>> > >>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > >>> --- > >>> hw/pci/pci.c | 33 +++++++++++++++++++++++++++++++-- > >>> 1 file changed, 31 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c > >>> index 35451c1e9987..38467e676f1f 100644 > >>> --- a/hw/pci/pci.c > >>> +++ b/hw/pci/pci.c > >>> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev) > >>> { > >>> PCIBus *bus = pci_get_bus(dev); > >>> PCIBus *iommu_bus = bus; > >>> + uint8_t devfn = dev->devfn; > >>> > >>> while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) { > >>> - iommu_bus = pci_get_bus(iommu_bus->parent_dev); > >>> + PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev); > >>> + > >>> + /* > >>> + * Determine which requester ID alias should be used for the device > >>> + * based on the PCI topology. There are no requester IDs on convetional > >> conventional > > > > Oops, fixed > > > >>> + * PCI buses, therefore we push the alias up to the parent on each non- > >>> + * express bus. Which alias we use depends on whether this is a legacy > >>> + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to- > >>> + * PCI bridge spec. Note that we cannot use pci_requester_id() here > >>> + * because the resulting BDF depends on the secondary bridge register > >> Didn't you mean secondary bus number register? > > > > Yes, fixed > > > >>> + * programming. We also cannot lookup the PCIBus from the bus number > >>> + * at this point for the iommu_fn. Also, requester_id_cache is the > >>> + * alias to the root bus, which is usually, but not necessarily always > >>> + * where we'll find our iommu_fn. > >>> + */ > >>> + if (!pci_bus_is_express(iommu_bus)) { > >>> + PCIDevice *parent = iommu_bus->parent_dev; > >>> + > >>> + if (pci_is_express(parent) && > >>> + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { > >>> + devfn = PCI_DEVFN(0, 0); > >>> + bus = iommu_bus; > >>> + } else { > >>> + devfn = parent->devfn; > >>> + bus = parent_bus; > >>> + } > >>> + } > >>> + > >>> + iommu_bus = parent_bus; > >>> } > >>> if (iommu_bus && iommu_bus->iommu_fn) { > >>> - return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn); > >>> + return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn); > >> I think it would make sense to comment this iommu_fn() callback's role > >> somewhere. > > > > While I agree, it seems a bit out of scope of this patch. Or are you > > suggesting that this patch fundamentally changing the role rather than > > trying to make it work as intended? > > The last question is what I tried to figure out when reviewing the patch > as you change the @bus and @devfn arg values passed to the cb. Given the > lack of documentation about the role of this function, it is not obvious > to see if the patch does not break anything without reading the cb > implementations. AIUI, the original semantics are simply "return an AddressSpace for this {PCIBus, devfn}". I don't think that fundamentally changes, it's just that we know the bus hosting the IOMMU (ie. the one with iommu_fn) and we know the common algorithm for determining the effective requester ID the IOMMU will see for that device from its bus. We're therefore simply providing the topology defined alias to the device rather than the device itself. We're not actually changing the semantics of the iommu_fn, we're just limiting the set of {PCIBus, devfn}s that we'll request. It seems like more of a semantic change to make the callee responsible for determining the visibility of the requested device. Thanks, Alex
On Wed, Mar 27, 2019 at 10:37:09AM -0600, Alex Williamson wrote: > On Wed, 27 Mar 2019 14:25:00 +0800 > Peter Xu <peterx@redhat.com> wrote: > > > On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote: > > > Conventional PCI buses pre-date requester IDs. An IOMMU cannot > > > distinguish by devfn & bus between devices in a conventional PCI > > > topology and therefore we cannot assign them separate AddressSpaces. > > > By taking this requester ID aliasing into account, QEMU better matches > > > the bare metal behavior and restrictions, and enables shared > > > AddressSpace configurations that are otherwise not possible with > > > guest IOMMU support. > > > > > > For the latter case, given any example where an IOMMU group on the > > > host includes multiple devices: > > > > > > $ ls /sys/kernel/iommu_groups/1/devices/ > > > 0000:00:01.0 0000:01:00.0 0000:01:00.1 > > > > [1] > > > > > > > > If we incorporate a vIOMMU into the VM configuration, we're restricted > > > that we can only assign one of the endpoints to the guest because a > > > second endpoint will attempt to use a different AddressSpace. VFIO > > > only supports IOMMU group level granularity at the container level, > > > preventing this second endpoint from being assigned: > > > > > > qemu-system-x86_64 -machine q35... \ > > > -device intel-iommu,intremap=on \ > > > -device pcie-root-port,addr=1e.0,id=pcie.1 \ > > > -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ > > > -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 > > > > > > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ > > > 0000:01:00.1: group 1 used in multiple address spaces > > > > > > However, when QEMU incorporates proper aliasing, we can make use of a > > > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that > > > provides the downstream devices with the same AddressSpace, ex: > > > > > > qemu-system-x86_64 -machine q35... \ > > > -device intel-iommu,intremap=on \ > > > -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ > > > -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ > > > -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 > > > > > > While the utility of this hack may be limited, this AddressSpace > > > aliasing is the correct behavior for QEMU to emulate bare metal. > > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > > > The patch looks sane to me even as a bug fix since otherwise the DMA > > address spaces used under misc kinds of PCI bridges can be wrong, so: > > I'm not sure if "as a bug fix" here is encouraging a 4.0 target, but > I'd be cautious about this if so. Eric Auger noted that he's seen an > SMMU VM hit a guest kernel bug-on, which needs further > investigation. It's not clear if it's just an untested or > unimplemented scenario for SMMU to see a conventional PCI bus or if > there's something wrong in QEMU. I also haven't tested AMD IOMMU and > only VT-d to a very limited degree, thus RFC. Sorry to be unclear. I wasn't meant to target this for 4.0, and I completely agree that it should be after the release since it is still a relatively influential change to the PCI system of QEMU, not to mention that the system mostly works well even without this patch. (except things like assignment of multi-functions with IOMMU but it is rare, after all) > > > Reviewed-by: Peter Xu <peterx@redhat.com> > > > > Though I have a question that confused me even before: Alex, do you > > know why all the context entry of the devices in the IOMMU root table > > will be programmed even if the devices are under a pcie-to-pci bridge? > > I'm giving an example with above [1] to be clear: in that case IIUC > > we'll program context entries for all the three devices (00:01.0, > > 01:00.0, 01:00.1) but they'll point to the same IOMMU table. DMAs of > > devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on > > bare metal and then why we bother to program the context entry of > > 01:00.1? It seems never used. > > > > (It should be used for current QEMU to work with pcie-to-pci bridges > > if without this patch, but I feel like I don't know the real answer > > behind) > > We actually have two different scenarios that could be represented by > [1], the group can be formed by lack of isolation or by lack of > visibility. In the group above, it's the former, isolation. The PCIe > root port does not support ACS, so while the IOMMU has visibility of > the individual devices, peer-to-peer between devices may also be > possible. Native, trusted, in-kernel drivers for these devices could > still make use of separate IOMMU domains per device, but in order to > expose the devices to a userspace driver we need to consider them a > non-isolated set to prevent side-channel attacks between devices. We > therefore consider them as a group within the IOMMU API and it's > required that each context entry maps to the same domain as the IOMMU > will see transactions for each requester ID. > > If we had the visibility case, such as if [1] represented a PCIe-to-PCI > bridge subgroup, then the IOMMU really does only see the bridge > requester ID and there may not be a reason to populate the context > entries for the downstream aliased devices. Perhaps the IOMMU might > still choose to do so, particularly if the bridge is actually a PCI-X > bridge as PCI-X does incorporate a requester ID, but also has strange > rules about the bridge being able to claim ownership of the > transaction. So it might be paranoia or simplification that causes all > the context entries to be programmed, or for alias quirks, uncertainty > if a device exclusively uses a quirked requester ID or might sometimes > use the proper requester ID. > > In the example I present, we're taking [1], which could be either case > above, and converting it into the visibility case in order to force the > IOMMU to handle the devices within a single address space. Thanks, The answers are detailed and clear (as usual :). My thanks!
On Wed, Mar 27, 2019 at 11:35:35AM -0400, Michael S. Tsirkin wrote: > On Wed, Mar 27, 2019 at 02:25:00PM +0800, Peter Xu wrote: > > On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote: > > > Conventional PCI buses pre-date requester IDs. An IOMMU cannot > > > distinguish by devfn & bus between devices in a conventional PCI > > > topology and therefore we cannot assign them separate AddressSpaces. > > > By taking this requester ID aliasing into account, QEMU better matches > > > the bare metal behavior and restrictions, and enables shared > > > AddressSpace configurations that are otherwise not possible with > > > guest IOMMU support. > > > > > > For the latter case, given any example where an IOMMU group on the > > > host includes multiple devices: > > > > > > $ ls /sys/kernel/iommu_groups/1/devices/ > > > 0000:00:01.0 0000:01:00.0 0000:01:00.1 > > > > [1] > > > > > > > > If we incorporate a vIOMMU into the VM configuration, we're restricted > > > that we can only assign one of the endpoints to the guest because a > > > second endpoint will attempt to use a different AddressSpace. VFIO > > > only supports IOMMU group level granularity at the container level, > > > preventing this second endpoint from being assigned: > > > > > > qemu-system-x86_64 -machine q35... \ > > > -device intel-iommu,intremap=on \ > > > -device pcie-root-port,addr=1e.0,id=pcie.1 \ > > > -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ > > > -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 > > > > > > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ > > > 0000:01:00.1: group 1 used in multiple address spaces > > > > > > However, when QEMU incorporates proper aliasing, we can make use of a > > > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that > > > provides the downstream devices with the same AddressSpace, ex: > > > > > > qemu-system-x86_64 -machine q35... \ > > > -device intel-iommu,intremap=on \ > > > -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ > > > -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ > > > -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 > > > > > > While the utility of this hack may be limited, this AddressSpace > > > aliasing is the correct behavior for QEMU to emulate bare metal. > > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > > > The patch looks sane to me even as a bug fix since otherwise the DMA > > address spaces used under misc kinds of PCI bridges can be wrong, so: > > > > Reviewed-by: Peter Xu <peterx@redhat.com> > > > > Though I have a question that confused me even before: Alex, do you > > know why all the context entry of the devices in the IOMMU root table > > will be programmed even if the devices are under a pcie-to-pci bridge? > > I'm giving an example with above [1] to be clear: in that case IIUC > > we'll program context entries for all the three devices (00:01.0, > > 01:00.0, 01:00.1) but they'll point to the same IOMMU table. DMAs of > > devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on > > bare metal > > What makes you think so? > > PCI Express spec says: > > Requester ID > > The combination of a Requester's Bus Number, Device Number, and Function > Number that uniquely identifies the Requester. With an ARI Requester ID, bits > traditionally used for the Device Number field are used instead to expand the > Function Number field, and the Device Number is implied to be 0. It can be something special when there're bridges like pcie-to-pci bridge because the bridge can take the ownership of the transaction and modify the requester ID (which should be part of the transaction ID). I believe it's clearer now since I saw a lot of discussions happening in the other thread about this too... Thanks,
Hi Alex, [+ Robin] On 3/27/19 5:37 PM, Alex Williamson wrote: > On Wed, 27 Mar 2019 14:25:00 +0800 > Peter Xu <peterx@redhat.com> wrote: > >> On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote: >>> Conventional PCI buses pre-date requester IDs. An IOMMU cannot >>> distinguish by devfn & bus between devices in a conventional PCI >>> topology and therefore we cannot assign them separate AddressSpaces. >>> By taking this requester ID aliasing into account, QEMU better matches >>> the bare metal behavior and restrictions, and enables shared >>> AddressSpace configurations that are otherwise not possible with >>> guest IOMMU support. >>> >>> For the latter case, given any example where an IOMMU group on the >>> host includes multiple devices: >>> >>> $ ls /sys/kernel/iommu_groups/1/devices/ >>> 0000:00:01.0 0000:01:00.0 0000:01:00.1 >> >> [1] >> >>> >>> If we incorporate a vIOMMU into the VM configuration, we're restricted >>> that we can only assign one of the endpoints to the guest because a >>> second endpoint will attempt to use a different AddressSpace. VFIO >>> only supports IOMMU group level granularity at the container level, >>> preventing this second endpoint from being assigned: >>> >>> qemu-system-x86_64 -machine q35... \ >>> -device intel-iommu,intremap=on \ >>> -device pcie-root-port,addr=1e.0,id=pcie.1 \ >>> -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ >>> -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 >>> >>> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ >>> 0000:01:00.1: group 1 used in multiple address spaces >>> >>> However, when QEMU incorporates proper aliasing, we can make use of a >>> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that >>> provides the downstream devices with the same AddressSpace, ex: >>> >>> qemu-system-x86_64 -machine q35... \ >>> -device intel-iommu,intremap=on \ >>> -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ >>> -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ >>> -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 >>> >>> While the utility of this hack may be limited, this AddressSpace >>> aliasing is the correct behavior for QEMU to emulate bare metal. >>> >>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> >> >> The patch looks sane to me even as a bug fix since otherwise the DMA >> address spaces used under misc kinds of PCI bridges can be wrong, so: > > I'm not sure if "as a bug fix" here is encouraging a 4.0 target, but > I'd be cautious about this if so. Eric Auger noted that he's seen an > SMMU VM hit a guest kernel bug-on, which needs further > investigation. It's not clear if it's just an untested or > unimplemented scenario for SMMU to see a conventional PCI bus or if > there's something wrong in QEMU. I also haven't tested AMD IOMMU and > only VT-d to a very limited degree, thus RFC. So I have tracked this further and here is what I can see. On guest side, the 2 assigned devices that I have put downstream to the pcie-to-pci bridge get an iommu_fwspec handle with 2 ids, the first one corresponding to the requester id of the very device and the second one corresponding to the rid matching the same bus number and devfn=0 dev0 = 0000:02:01.0 0000:02:00.0 dev1 = 0000:02:01.1 0000:02:00.0 Then iommu_probe_device is called for 0000:02:01.0 and 0000:02:01.1. Each time it iterates over the associated ids and we call add_device twice for 0000:02:00.0. The second time, the arm-smmu-v3 driver recognizes a context is already alive for 0000:02:00.0 and triggers a BUG_ON(). At the origin of the creation of 2 ids for each device, iort_iommu_configure is called on each downstream device which calls pci_for_each_dma_alias(). We enter the pci_is_pcie(tmp)/PCI_EXP_TYPE_PCI_BRIDGE code path and iort_pci_iommu_init is called with bus number 2 and devfn=0. Thanks Eric > >> Reviewed-by: Peter Xu <peterx@redhat.com> >> >> Though I have a question that confused me even before: Alex, do you >> know why all the context entry of the devices in the IOMMU root table >> will be programmed even if the devices are under a pcie-to-pci bridge? >> I'm giving an example with above [1] to be clear: in that case IIUC >> we'll program context entries for all the three devices (00:01.0, >> 01:00.0, 01:00.1) but they'll point to the same IOMMU table. DMAs of >> devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on >> bare metal and then why we bother to program the context entry of >> 01:00.1? It seems never used. >> >> (It should be used for current QEMU to work with pcie-to-pci bridges >> if without this patch, but I feel like I don't know the real answer >> behind) > > We actually have two different scenarios that could be represented by > [1], the group can be formed by lack of isolation or by lack of > visibility. In the group above, it's the former, isolation. The PCIe > root port does not support ACS, so while the IOMMU has visibility of > the individual devices, peer-to-peer between devices may also be > possible. Native, trusted, in-kernel drivers for these devices could > still make use of separate IOMMU domains per device, but in order to > expose the devices to a userspace driver we need to consider them a > non-isolated set to prevent side-channel attacks between devices. We > therefore consider them as a group within the IOMMU API and it's > required that each context entry maps to the same domain as the IOMMU > will see transactions for each requester ID. > > If we had the visibility case, such as if [1] represented a PCIe-to-PCI > bridge subgroup, then the IOMMU really does only see the bridge > requester ID and there may not be a reason to populate the context > entries for the downstream aliased devices. Perhaps the IOMMU might > still choose to do so, particularly if the bridge is actually a PCI-X > bridge as PCI-X does incorporate a requester ID, but also has strange > rules about the bridge being able to claim ownership of the > transaction. So it might be paranoia or simplification that causes all > the context entries to be programmed, or for alias quirks, uncertainty > if a device exclusively uses a quirked requester ID or might sometimes > use the proper requester ID. > > In the example I present, we're taking [1], which could be either case > above, and converting it into the visibility case in order to force the > IOMMU to handle the devices within a single address space. Thanks, > > Alex >
On 28/03/2019 10:38, Auger Eric wrote: > Hi Alex, > > [+ Robin] > > On 3/27/19 5:37 PM, Alex Williamson wrote: >> On Wed, 27 Mar 2019 14:25:00 +0800 >> Peter Xu <peterx@redhat.com> wrote: >> >>> On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote: >>>> Conventional PCI buses pre-date requester IDs. An IOMMU cannot >>>> distinguish by devfn & bus between devices in a conventional PCI >>>> topology and therefore we cannot assign them separate AddressSpaces. >>>> By taking this requester ID aliasing into account, QEMU better matches >>>> the bare metal behavior and restrictions, and enables shared >>>> AddressSpace configurations that are otherwise not possible with >>>> guest IOMMU support. >>>> >>>> For the latter case, given any example where an IOMMU group on the >>>> host includes multiple devices: >>>> >>>> $ ls /sys/kernel/iommu_groups/1/devices/ >>>> 0000:00:01.0 0000:01:00.0 0000:01:00.1 >>> >>> [1] >>> >>>> >>>> If we incorporate a vIOMMU into the VM configuration, we're restricted >>>> that we can only assign one of the endpoints to the guest because a >>>> second endpoint will attempt to use a different AddressSpace. VFIO >>>> only supports IOMMU group level granularity at the container level, >>>> preventing this second endpoint from being assigned: >>>> >>>> qemu-system-x86_64 -machine q35... \ >>>> -device intel-iommu,intremap=on \ >>>> -device pcie-root-port,addr=1e.0,id=pcie.1 \ >>>> -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ >>>> -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 >>>> >>>> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ >>>> 0000:01:00.1: group 1 used in multiple address spaces >>>> >>>> However, when QEMU incorporates proper aliasing, we can make use of a >>>> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that >>>> provides the downstream devices with the same AddressSpace, ex: >>>> >>>> qemu-system-x86_64 -machine q35... \ >>>> -device intel-iommu,intremap=on \ >>>> -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ >>>> -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ >>>> -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 >>>> >>>> While the utility of this hack may be limited, this AddressSpace >>>> aliasing is the correct behavior for QEMU to emulate bare metal. >>>> >>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> >>> >>> The patch looks sane to me even as a bug fix since otherwise the DMA >>> address spaces used under misc kinds of PCI bridges can be wrong, so: >> >> I'm not sure if "as a bug fix" here is encouraging a 4.0 target, but >> I'd be cautious about this if so. Eric Auger noted that he's seen an >> SMMU VM hit a guest kernel bug-on, which needs further >> investigation. It's not clear if it's just an untested or >> unimplemented scenario for SMMU to see a conventional PCI bus or if >> there's something wrong in QEMU. I also haven't tested AMD IOMMU and >> only VT-d to a very limited degree, thus RFC. > > So I have tracked this further and here is what I can see. > > On guest side, the 2 assigned devices that I have put downstream to the > pcie-to-pci bridge get an iommu_fwspec handle with 2 ids, the first one > corresponding to the requester id of the very device and the second one > corresponding to the rid matching the same bus number and devfn=0 > > dev0 = 0000:02:01.0 > 0000:02:00.0 > > dev1 = 0000:02:01.1 > 0000:02:00.0 > > Then iommu_probe_device is called for 0000:02:01.0 and 0000:02:01.1. > Each time it iterates over the associated ids and we call add_device > twice for 0000:02:00.0. The second time, the arm-smmu-v3 driver > recognizes a context is already alive for 0000:02:00.0 and triggers a > BUG_ON(). Hmm, aliasing bridges are supposed to be handled as of commit 563b5cbe334e ("iommu/arm-smmu-v3: Cope with duplicated Stream IDs") - what's changed since then? Robin. > At the origin of the creation of 2 ids for each device, > iort_iommu_configure is called on each downstream device which calls > pci_for_each_dma_alias(). We enter the > pci_is_pcie(tmp)/PCI_EXP_TYPE_PCI_BRIDGE code path and > iort_pci_iommu_init is called with bus number 2 and devfn=0. > > Thanks > > Eric >> >>> Reviewed-by: Peter Xu <peterx@redhat.com> >>> >>> Though I have a question that confused me even before: Alex, do you >>> know why all the context entry of the devices in the IOMMU root table >>> will be programmed even if the devices are under a pcie-to-pci bridge? >>> I'm giving an example with above [1] to be clear: in that case IIUC >>> we'll program context entries for all the three devices (00:01.0, >>> 01:00.0, 01:00.1) but they'll point to the same IOMMU table. DMAs of >>> devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on >>> bare metal and then why we bother to program the context entry of >>> 01:00.1? It seems never used. >>> >>> (It should be used for current QEMU to work with pcie-to-pci bridges >>> if without this patch, but I feel like I don't know the real answer >>> behind) >> >> We actually have two different scenarios that could be represented by >> [1], the group can be formed by lack of isolation or by lack of >> visibility. In the group above, it's the former, isolation. The PCIe >> root port does not support ACS, so while the IOMMU has visibility of >> the individual devices, peer-to-peer between devices may also be >> possible. Native, trusted, in-kernel drivers for these devices could >> still make use of separate IOMMU domains per device, but in order to >> expose the devices to a userspace driver we need to consider them a >> non-isolated set to prevent side-channel attacks between devices. We >> therefore consider them as a group within the IOMMU API and it's >> required that each context entry maps to the same domain as the IOMMU >> will see transactions for each requester ID. >> >> If we had the visibility case, such as if [1] represented a PCIe-to-PCI >> bridge subgroup, then the IOMMU really does only see the bridge >> requester ID and there may not be a reason to populate the context >> entries for the downstream aliased devices. Perhaps the IOMMU might >> still choose to do so, particularly if the bridge is actually a PCI-X >> bridge as PCI-X does incorporate a requester ID, but also has strange >> rules about the bridge being able to claim ownership of the >> transaction. So it might be paranoia or simplification that causes all >> the context entries to be programmed, or for alias quirks, uncertainty >> if a device exclusively uses a quirked requester ID or might sometimes >> use the proper requester ID. >> >> In the example I present, we're taking [1], which could be either case >> above, and converting it into the visibility case in order to force the >> IOMMU to handle the devices within a single address space. Thanks, >> >> Alex >>
Hi Robin, On 3/28/19 11:56 AM, Robin Murphy wrote: > On 28/03/2019 10:38, Auger Eric wrote: >> Hi Alex, >> >> [+ Robin] >> >> On 3/27/19 5:37 PM, Alex Williamson wrote: >>> On Wed, 27 Mar 2019 14:25:00 +0800 >>> Peter Xu <peterx@redhat.com> wrote: >>> >>>> On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote: >>>>> Conventional PCI buses pre-date requester IDs. An IOMMU cannot >>>>> distinguish by devfn & bus between devices in a conventional PCI >>>>> topology and therefore we cannot assign them separate AddressSpaces. >>>>> By taking this requester ID aliasing into account, QEMU better matches >>>>> the bare metal behavior and restrictions, and enables shared >>>>> AddressSpace configurations that are otherwise not possible with >>>>> guest IOMMU support. >>>>> >>>>> For the latter case, given any example where an IOMMU group on the >>>>> host includes multiple devices: >>>>> >>>>> $ ls /sys/kernel/iommu_groups/1/devices/ >>>>> 0000:00:01.0 0000:01:00.0 0000:01:00.1 >>>> >>>> [1] >>>> >>>>> >>>>> If we incorporate a vIOMMU into the VM configuration, we're restricted >>>>> that we can only assign one of the endpoints to the guest because a >>>>> second endpoint will attempt to use a different AddressSpace. VFIO >>>>> only supports IOMMU group level granularity at the container level, >>>>> preventing this second endpoint from being assigned: >>>>> >>>>> qemu-system-x86_64 -machine q35... \ >>>>> -device intel-iommu,intremap=on \ >>>>> -device pcie-root-port,addr=1e.0,id=pcie.1 \ >>>>> -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ >>>>> -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 >>>>> >>>>> qemu-system-x86_64: -device >>>>> vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ >>>>> 0000:01:00.1: group 1 used in multiple address spaces >>>>> >>>>> However, when QEMU incorporates proper aliasing, we can make use of a >>>>> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that >>>>> provides the downstream devices with the same AddressSpace, ex: >>>>> >>>>> qemu-system-x86_64 -machine q35... \ >>>>> -device intel-iommu,intremap=on \ >>>>> -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ >>>>> -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ >>>>> -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 >>>>> >>>>> While the utility of this hack may be limited, this AddressSpace >>>>> aliasing is the correct behavior for QEMU to emulate bare metal. >>>>> >>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> >>>> >>>> The patch looks sane to me even as a bug fix since otherwise the DMA >>>> address spaces used under misc kinds of PCI bridges can be wrong, so: >>> >>> I'm not sure if "as a bug fix" here is encouraging a 4.0 target, but >>> I'd be cautious about this if so. Eric Auger noted that he's seen an >>> SMMU VM hit a guest kernel bug-on, which needs further >>> investigation. It's not clear if it's just an untested or >>> unimplemented scenario for SMMU to see a conventional PCI bus or if >>> there's something wrong in QEMU. I also haven't tested AMD IOMMU and >>> only VT-d to a very limited degree, thus RFC. >> >> So I have tracked this further and here is what I can see. >> >> On guest side, the 2 assigned devices that I have put downstream to the >> pcie-to-pci bridge get an iommu_fwspec handle with 2 ids, the first one >> corresponding to the requester id of the very device and the second one >> corresponding to the rid matching the same bus number and devfn=0 >> >> dev0 = 0000:02:01.0 >> 0000:02:00.0 >> >> dev1 = 0000:02:01.1 >> 0000:02:00.0 >> >> Then iommu_probe_device is called for 0000:02:01.0 and 0000:02:01.1. >> Each time it iterates over the associated ids and we call add_device >> twice for 0000:02:00.0. The second time, the arm-smmu-v3 driver >> recognizes a context is already alive for 0000:02:00.0 and triggers a >> BUG_ON(). > > Hmm, aliasing bridges are supposed to be handled as of commit > 563b5cbe334e ("iommu/arm-smmu-v3: Cope with duplicated Stream IDs") - > what's changed since then? I our case, the problem is not we have the same ids[] for a given device but we have 2 functions attached to the pcie-to-pci bridge and their fwspec have an ids[] in common, the one with the subordinate bus and devfn=0. device pcie-pci-bridge,addr=1e.0,id=pci.1 \ -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 [ 2.509381] ixgbe 0000:02:01.0: arm_smmu_attach_dev ARM_SMMU_DOMAIN_S1 [ 2.512095] arm_smmu_install_ste_for_dev sid[0]=520 [ 2.513524] arm_smmu_write_strtab_ent sid=520 val=1 [ 2.514872] arm_smmu_write_strtab_ent sid=520 ABORT [ 2.516266] arm_smmu_install_ste_for_dev sid[1]=512 [ 2.517609] arm_smmu_write_strtab_ent sid=512 val=1 [ 2.518958] arm_smmu_write_strtab_ent sid=512 ABORT [ 3.301316] ixgbe 0000:02:01.1: arm_smmu_attach_dev ARM_SMMU_DOMAIN_S1 [ 3.302456] arm_smmu_install_ste_for_dev sid[0]=521 [ 3.303420] arm_smmu_write_strtab_ent sid=521 val=1 [ 3.304408] arm_smmu_write_strtab_ent sid=521 ABORT [ 3.305442] arm_smmu_install_ste_for_dev sid[1]=512 [ 3.306812] arm_smmu_write_strtab_ent sid=512 val=3758686219 [ 3.308390] arm_smmu_write_strtab_ent sid=512 S1 | S2 ste_live Thanks Eric > > Robin. > >> At the origin of the creation of 2 ids for each device, >> iort_iommu_configure is called on each downstream device which calls >> pci_for_each_dma_alias(). We enter the >> pci_is_pcie(tmp)/PCI_EXP_TYPE_PCI_BRIDGE code path and >> iort_pci_iommu_init is called with bus number 2 and devfn=0. >> >> Thanks >> >> Eric >>> >>>> Reviewed-by: Peter Xu <peterx@redhat.com> >>>> >>>> Though I have a question that confused me even before: Alex, do you >>>> know why all the context entry of the devices in the IOMMU root table >>>> will be programmed even if the devices are under a pcie-to-pci bridge? >>>> I'm giving an example with above [1] to be clear: in that case IIUC >>>> we'll program context entries for all the three devices (00:01.0, >>>> 01:00.0, 01:00.1) but they'll point to the same IOMMU table. DMAs of >>>> devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on >>>> bare metal and then why we bother to program the context entry of >>>> 01:00.1? It seems never used. >>>> >>>> (It should be used for current QEMU to work with pcie-to-pci bridges >>>> if without this patch, but I feel like I don't know the real answer >>>> behind) >>> >>> We actually have two different scenarios that could be represented by >>> [1], the group can be formed by lack of isolation or by lack of >>> visibility. In the group above, it's the former, isolation. The PCIe >>> root port does not support ACS, so while the IOMMU has visibility of >>> the individual devices, peer-to-peer between devices may also be >>> possible. Native, trusted, in-kernel drivers for these devices could >>> still make use of separate IOMMU domains per device, but in order to >>> expose the devices to a userspace driver we need to consider them a >>> non-isolated set to prevent side-channel attacks between devices. We >>> therefore consider them as a group within the IOMMU API and it's >>> required that each context entry maps to the same domain as the IOMMU >>> will see transactions for each requester ID. >>> >>> If we had the visibility case, such as if [1] represented a PCIe-to-PCI >>> bridge subgroup, then the IOMMU really does only see the bridge >>> requester ID and there may not be a reason to populate the context >>> entries for the downstream aliased devices. Perhaps the IOMMU might >>> still choose to do so, particularly if the bridge is actually a PCI-X >>> bridge as PCI-X does incorporate a requester ID, but also has strange >>> rules about the bridge being able to claim ownership of the >>> transaction. So it might be paranoia or simplification that causes all >>> the context entries to be programmed, or for alias quirks, uncertainty >>> if a device exclusively uses a quirked requester ID or might sometimes >>> use the proper requester ID. >>> >>> In the example I present, we're taking [1], which could be either case >>> above, and converting it into the visibility case in order to force the >>> IOMMU to handle the devices within a single address space. Thanks, >>> >>> Alex >>>
On 28/03/2019 12:53, Auger Eric wrote: > Hi Robin, > > On 3/28/19 11:56 AM, Robin Murphy wrote: >> On 28/03/2019 10:38, Auger Eric wrote: >>> Hi Alex, >>> >>> [+ Robin] >>> >>> On 3/27/19 5:37 PM, Alex Williamson wrote: >>>> On Wed, 27 Mar 2019 14:25:00 +0800 >>>> Peter Xu <peterx@redhat.com> wrote: >>>> >>>>> On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote: >>>>>> Conventional PCI buses pre-date requester IDs. An IOMMU cannot >>>>>> distinguish by devfn & bus between devices in a conventional PCI >>>>>> topology and therefore we cannot assign them separate AddressSpaces. >>>>>> By taking this requester ID aliasing into account, QEMU better matches >>>>>> the bare metal behavior and restrictions, and enables shared >>>>>> AddressSpace configurations that are otherwise not possible with >>>>>> guest IOMMU support. >>>>>> >>>>>> For the latter case, given any example where an IOMMU group on the >>>>>> host includes multiple devices: >>>>>> >>>>>> $ ls /sys/kernel/iommu_groups/1/devices/ >>>>>> 0000:00:01.0 0000:01:00.0 0000:01:00.1 >>>>> >>>>> [1] >>>>> >>>>>> >>>>>> If we incorporate a vIOMMU into the VM configuration, we're restricted >>>>>> that we can only assign one of the endpoints to the guest because a >>>>>> second endpoint will attempt to use a different AddressSpace. VFIO >>>>>> only supports IOMMU group level granularity at the container level, >>>>>> preventing this second endpoint from being assigned: >>>>>> >>>>>> qemu-system-x86_64 -machine q35... \ >>>>>> -device intel-iommu,intremap=on \ >>>>>> -device pcie-root-port,addr=1e.0,id=pcie.1 \ >>>>>> -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ >>>>>> -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 >>>>>> >>>>>> qemu-system-x86_64: -device >>>>>> vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ >>>>>> 0000:01:00.1: group 1 used in multiple address spaces >>>>>> >>>>>> However, when QEMU incorporates proper aliasing, we can make use of a >>>>>> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that >>>>>> provides the downstream devices with the same AddressSpace, ex: >>>>>> >>>>>> qemu-system-x86_64 -machine q35... \ >>>>>> -device intel-iommu,intremap=on \ >>>>>> -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ >>>>>> -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ >>>>>> -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 >>>>>> >>>>>> While the utility of this hack may be limited, this AddressSpace >>>>>> aliasing is the correct behavior for QEMU to emulate bare metal. >>>>>> >>>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> >>>>> >>>>> The patch looks sane to me even as a bug fix since otherwise the DMA >>>>> address spaces used under misc kinds of PCI bridges can be wrong, so: >>>> >>>> I'm not sure if "as a bug fix" here is encouraging a 4.0 target, but >>>> I'd be cautious about this if so. Eric Auger noted that he's seen an >>>> SMMU VM hit a guest kernel bug-on, which needs further >>>> investigation. It's not clear if it's just an untested or >>>> unimplemented scenario for SMMU to see a conventional PCI bus or if >>>> there's something wrong in QEMU. I also haven't tested AMD IOMMU and >>>> only VT-d to a very limited degree, thus RFC. >>> >>> So I have tracked this further and here is what I can see. >>> >>> On guest side, the 2 assigned devices that I have put downstream to the >>> pcie-to-pci bridge get an iommu_fwspec handle with 2 ids, the first one >>> corresponding to the requester id of the very device and the second one >>> corresponding to the rid matching the same bus number and devfn=0 >>> >>> dev0 = 0000:02:01.0 >>> 0000:02:00.0 >>> >>> dev1 = 0000:02:01.1 >>> 0000:02:00.0 >>> >>> Then iommu_probe_device is called for 0000:02:01.0 and 0000:02:01.1. >>> Each time it iterates over the associated ids and we call add_device >>> twice for 0000:02:00.0. The second time, the arm-smmu-v3 driver >>> recognizes a context is already alive for 0000:02:00.0 and triggers a >>> BUG_ON(). >> >> Hmm, aliasing bridges are supposed to be handled as of commit >> 563b5cbe334e ("iommu/arm-smmu-v3: Cope with duplicated Stream IDs") - >> what's changed since then? > > I our case, the problem is not we have the same ids[] for a given device > but we have 2 functions attached to the pcie-to-pci bridge and their > fwspec have an ids[] in common, the one with the subordinate bus and > devfn=0. Oh, right, the recollection of having fixed something like this before distracted me from that important detail, sorry :) Yeah, it's a shortcoming of the driver - in general we don't support arbitrary devices sharing Stream IDs (as arm_smmu_device_group() says), but since we hadn't yet seen (nor expected) a non-trivial legacy PCI topology with SMMUv3, that case has never really been tested properly either. Robin. > device pcie-pci-bridge,addr=1e.0,id=pci.1 \ > -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ > -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 > > [ 2.509381] ixgbe 0000:02:01.0: arm_smmu_attach_dev ARM_SMMU_DOMAIN_S1 > [ 2.512095] arm_smmu_install_ste_for_dev sid[0]=520 > [ 2.513524] arm_smmu_write_strtab_ent sid=520 val=1 > [ 2.514872] arm_smmu_write_strtab_ent sid=520 ABORT > [ 2.516266] arm_smmu_install_ste_for_dev sid[1]=512 > [ 2.517609] arm_smmu_write_strtab_ent sid=512 val=1 > [ 2.518958] arm_smmu_write_strtab_ent sid=512 ABORT > > [ 3.301316] ixgbe 0000:02:01.1: arm_smmu_attach_dev ARM_SMMU_DOMAIN_S1 > [ 3.302456] arm_smmu_install_ste_for_dev sid[0]=521 > [ 3.303420] arm_smmu_write_strtab_ent sid=521 val=1 > [ 3.304408] arm_smmu_write_strtab_ent sid=521 ABORT > [ 3.305442] arm_smmu_install_ste_for_dev sid[1]=512 > [ 3.306812] arm_smmu_write_strtab_ent sid=512 val=3758686219 > [ 3.308390] arm_smmu_write_strtab_ent sid=512 S1 | S2 ste_live > > > Thanks > > Eric >> >> Robin. >> >>> At the origin of the creation of 2 ids for each device, >>> iort_iommu_configure is called on each downstream device which calls >>> pci_for_each_dma_alias(). We enter the >>> pci_is_pcie(tmp)/PCI_EXP_TYPE_PCI_BRIDGE code path and >>> iort_pci_iommu_init is called with bus number 2 and devfn=0. >>> >>> Thanks >>> >>> Eric >>>> >>>>> Reviewed-by: Peter Xu <peterx@redhat.com> >>>>> >>>>> Though I have a question that confused me even before: Alex, do you >>>>> know why all the context entry of the devices in the IOMMU root table >>>>> will be programmed even if the devices are under a pcie-to-pci bridge? >>>>> I'm giving an example with above [1] to be clear: in that case IIUC >>>>> we'll program context entries for all the three devices (00:01.0, >>>>> 01:00.0, 01:00.1) but they'll point to the same IOMMU table. DMAs of >>>>> devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on >>>>> bare metal and then why we bother to program the context entry of >>>>> 01:00.1? It seems never used. >>>>> >>>>> (It should be used for current QEMU to work with pcie-to-pci bridges >>>>> if without this patch, but I feel like I don't know the real answer >>>>> behind) >>>> >>>> We actually have two different scenarios that could be represented by >>>> [1], the group can be formed by lack of isolation or by lack of >>>> visibility. In the group above, it's the former, isolation. The PCIe >>>> root port does not support ACS, so while the IOMMU has visibility of >>>> the individual devices, peer-to-peer between devices may also be >>>> possible. Native, trusted, in-kernel drivers for these devices could >>>> still make use of separate IOMMU domains per device, but in order to >>>> expose the devices to a userspace driver we need to consider them a >>>> non-isolated set to prevent side-channel attacks between devices. We >>>> therefore consider them as a group within the IOMMU API and it's >>>> required that each context entry maps to the same domain as the IOMMU >>>> will see transactions for each requester ID. >>>> >>>> If we had the visibility case, such as if [1] represented a PCIe-to-PCI >>>> bridge subgroup, then the IOMMU really does only see the bridge >>>> requester ID and there may not be a reason to populate the context >>>> entries for the downstream aliased devices. Perhaps the IOMMU might >>>> still choose to do so, particularly if the bridge is actually a PCI-X >>>> bridge as PCI-X does incorporate a requester ID, but also has strange >>>> rules about the bridge being able to claim ownership of the >>>> transaction. So it might be paranoia or simplification that causes all >>>> the context entries to be programmed, or for alias quirks, uncertainty >>>> if a device exclusively uses a quirked requester ID or might sometimes >>>> use the proper requester ID. >>>> >>>> In the example I present, we're taking [1], which could be either case >>>> above, and converting it into the visibility case in order to force the >>>> IOMMU to handle the devices within a single address space. Thanks, >>>> >>>> Alex >>>>
[Cc +Brijesh] Hi Brijesh, will the change below require the IVRS to be updated to include aliases for all BDF ranges behind a conventional bridge? I think the Linux code handles this regardless of the firmware provided aliases, but is it required per spec for the ACPI tables to include bridge aliases? Thanks, Alex On Tue, 26 Mar 2019 16:55:19 -0600 Alex Williamson <alex.williamson@redhat.com> wrote: > Conventional PCI buses pre-date requester IDs. An IOMMU cannot > distinguish by devfn & bus between devices in a conventional PCI > topology and therefore we cannot assign them separate AddressSpaces. > By taking this requester ID aliasing into account, QEMU better matches > the bare metal behavior and restrictions, and enables shared > AddressSpace configurations that are otherwise not possible with > guest IOMMU support. > > For the latter case, given any example where an IOMMU group on the > host includes multiple devices: > > $ ls /sys/kernel/iommu_groups/1/devices/ > 0000:00:01.0 0000:01:00.0 0000:01:00.1 > > If we incorporate a vIOMMU into the VM configuration, we're restricted > that we can only assign one of the endpoints to the guest because a > second endpoint will attempt to use a different AddressSpace. VFIO > only supports IOMMU group level granularity at the container level, > preventing this second endpoint from being assigned: > > qemu-system-x86_64 -machine q35... \ > -device intel-iommu,intremap=on \ > -device pcie-root-port,addr=1e.0,id=pcie.1 \ > -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ > -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 > > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ > 0000:01:00.1: group 1 used in multiple address spaces > > However, when QEMU incorporates proper aliasing, we can make use of a > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that > provides the downstream devices with the same AddressSpace, ex: > > qemu-system-x86_64 -machine q35... \ > -device intel-iommu,intremap=on \ > -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ > -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ > -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 > > While the utility of this hack may be limited, this AddressSpace > aliasing is the correct behavior for QEMU to emulate bare metal. > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > --- > hw/pci/pci.c | 33 +++++++++++++++++++++++++++++++-- > 1 file changed, 31 insertions(+), 2 deletions(-) > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c > index 35451c1e9987..38467e676f1f 100644 > --- a/hw/pci/pci.c > +++ b/hw/pci/pci.c > @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev) > { > PCIBus *bus = pci_get_bus(dev); > PCIBus *iommu_bus = bus; > + uint8_t devfn = dev->devfn; > > while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) { > - iommu_bus = pci_get_bus(iommu_bus->parent_dev); > + PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev); > + > + /* > + * Determine which requester ID alias should be used for the device > + * based on the PCI topology. There are no requester IDs on convetional > + * PCI buses, therefore we push the alias up to the parent on each non- > + * express bus. Which alias we use depends on whether this is a legacy > + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to- > + * PCI bridge spec. Note that we cannot use pci_requester_id() here > + * because the resulting BDF depends on the secondary bridge register > + * programming. We also cannot lookup the PCIBus from the bus number > + * at this point for the iommu_fn. Also, requester_id_cache is the > + * alias to the root bus, which is usually, but not necessarily always > + * where we'll find our iommu_fn. > + */ > + if (!pci_bus_is_express(iommu_bus)) { > + PCIDevice *parent = iommu_bus->parent_dev; > + > + if (pci_is_express(parent) && > + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { > + devfn = PCI_DEVFN(0, 0); > + bus = iommu_bus; > + } else { > + devfn = parent->devfn; > + bus = parent_bus; > + } > + } > + > + iommu_bus = parent_bus; > } > if (iommu_bus && iommu_bus->iommu_fn) { > - return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn); > + return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn); > } > return &address_space_memory; > } >
Thanks for adding Alex. Adding Suravee. On 3/29/19 11:49 AM, Alex Williamson wrote: > [Cc +Brijesh] > > Hi Brijesh, will the change below require the IVRS to be updated to > include aliases for all BDF ranges behind a conventional bridge? I > think the Linux code handles this regardless of the firmware provided > aliases, but is it required per spec for the ACPI tables to include > bridge aliases? Thanks, > We do need to includes aliases in ACPI table. We need to populate the IVHD type 0x43 and 0x4 for alias range start and end. I believe host IVRS would contain similar information. Suravee, please correct me if I am missing something? thanks > Alex > > On Tue, 26 Mar 2019 16:55:19 -0600 > Alex Williamson <alex.williamson@redhat.com> wrote: > >> Conventional PCI buses pre-date requester IDs. An IOMMU cannot >> distinguish by devfn & bus between devices in a conventional PCI >> topology and therefore we cannot assign them separate AddressSpaces. >> By taking this requester ID aliasing into account, QEMU better matches >> the bare metal behavior and restrictions, and enables shared >> AddressSpace configurations that are otherwise not possible with >> guest IOMMU support. >> >> For the latter case, given any example where an IOMMU group on the >> host includes multiple devices: >> >> $ ls /sys/kernel/iommu_groups/1/devices/ >> 0000:00:01.0 0000:01:00.0 0000:01:00.1 >> >> If we incorporate a vIOMMU into the VM configuration, we're restricted >> that we can only assign one of the endpoints to the guest because a >> second endpoint will attempt to use a different AddressSpace. VFIO >> only supports IOMMU group level granularity at the container level, >> preventing this second endpoint from being assigned: >> >> qemu-system-x86_64 -machine q35... \ >> -device intel-iommu,intremap=on \ >> -device pcie-root-port,addr=1e.0,id=pcie.1 \ >> -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ >> -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 >> >> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ >> 0000:01:00.1: group 1 used in multiple address spaces >> >> However, when QEMU incorporates proper aliasing, we can make use of a >> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that >> provides the downstream devices with the same AddressSpace, ex: >> >> qemu-system-x86_64 -machine q35... \ >> -device intel-iommu,intremap=on \ >> -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ >> -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ >> -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 >> >> While the utility of this hack may be limited, this AddressSpace >> aliasing is the correct behavior for QEMU to emulate bare metal. >> >> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> >> --- >> hw/pci/pci.c | 33 +++++++++++++++++++++++++++++++-- >> 1 file changed, 31 insertions(+), 2 deletions(-) >> >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c >> index 35451c1e9987..38467e676f1f 100644 >> --- a/hw/pci/pci.c >> +++ b/hw/pci/pci.c >> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev) >> { >> PCIBus *bus = pci_get_bus(dev); >> PCIBus *iommu_bus = bus; >> + uint8_t devfn = dev->devfn; >> >> while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) { >> - iommu_bus = pci_get_bus(iommu_bus->parent_dev); >> + PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev); >> + >> + /* >> + * Determine which requester ID alias should be used for the device >> + * based on the PCI topology. There are no requester IDs on convetional >> + * PCI buses, therefore we push the alias up to the parent on each non- >> + * express bus. Which alias we use depends on whether this is a legacy >> + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to- >> + * PCI bridge spec. Note that we cannot use pci_requester_id() here >> + * because the resulting BDF depends on the secondary bridge register >> + * programming. We also cannot lookup the PCIBus from the bus number >> + * at this point for the iommu_fn. Also, requester_id_cache is the >> + * alias to the root bus, which is usually, but not necessarily always >> + * where we'll find our iommu_fn. >> + */ >> + if (!pci_bus_is_express(iommu_bus)) { >> + PCIDevice *parent = iommu_bus->parent_dev; >> + >> + if (pci_is_express(parent) && >> + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { >> + devfn = PCI_DEVFN(0, 0); >> + bus = iommu_bus; >> + } else { >> + devfn = parent->devfn; >> + bus = parent_bus; >> + } >> + } >> + >> + iommu_bus = parent_bus; >> } >> if (iommu_bus && iommu_bus->iommu_fn) { >> - return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn); >> + return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn); >> } >> return &address_space_memory; >> } >> >
On Mon, 1 Apr 2019 13:41:39 +0000 "Singh, Brijesh" <brijesh.singh@amd.com> wrote: > Thanks for adding Alex. > > Adding Suravee. > > > On 3/29/19 11:49 AM, Alex Williamson wrote: > > [Cc +Brijesh] > > > > Hi Brijesh, will the change below require the IVRS to be updated to > > include aliases for all BDF ranges behind a conventional bridge? I > > think the Linux code handles this regardless of the firmware provided > > aliases, but is it required per spec for the ACPI tables to include > > bridge aliases? Thanks, > > > > We do need to includes aliases in ACPI table. We need to populate the > IVHD type 0x43 and 0x4 for alias range start and end. I believe host > IVRS would contain similar information. > > Suravee, please correct me if I am missing something? I finally found some time to investigate this a little further, yes the types mentioned are correct for defining start and end of an alias range. The challenge here is that these entries require a DeviceID, which is defined as a BDF, AIUI. The IVRS is created in QEMU, but bus numbers are defined by the guest firmware, and potentially redefined by the guest OS. This makes it non-trivial to insert a few IVHDs into the IVRS to describe alias ranges. I'm wondering if the solution here is to define a new linker-loader command that would instruct the guest to write a bus number byte to a given offset for a described device. These commands would be inserted before the checksum command, such that these bus number updates are calculated as part of the checksum. I'm imagining the command format would need to be able to distinguish between the actual bus number of a described device, the secondary bus number of the device, and the subordinate bus number of the device. For describing the device, I'm envisioning stealing from the DMAR definition, which already includes a bus number invariant mechanism to describe a device, starting with a segment and root bus, follow a chain of devfns to get to the target device. Therefore the guest firmware would follow the path to the described device, pick the desired bus number, and write it to the indicated table offset. Does this seem like a reasonable approach? Better ideas? I'm not thrilled with the increased scope demanded by IVRS support, but so long as we have an AMD IOMMU model, I don't see how to avoid it. Thanks, Alex > > On Tue, 26 Mar 2019 16:55:19 -0600 > > Alex Williamson <alex.williamson@redhat.com> wrote: > > > >> Conventional PCI buses pre-date requester IDs. An IOMMU cannot > >> distinguish by devfn & bus between devices in a conventional PCI > >> topology and therefore we cannot assign them separate AddressSpaces. > >> By taking this requester ID aliasing into account, QEMU better matches > >> the bare metal behavior and restrictions, and enables shared > >> AddressSpace configurations that are otherwise not possible with > >> guest IOMMU support. > >> > >> For the latter case, given any example where an IOMMU group on the > >> host includes multiple devices: > >> > >> $ ls /sys/kernel/iommu_groups/1/devices/ > >> 0000:00:01.0 0000:01:00.0 0000:01:00.1 > >> > >> If we incorporate a vIOMMU into the VM configuration, we're restricted > >> that we can only assign one of the endpoints to the guest because a > >> second endpoint will attempt to use a different AddressSpace. VFIO > >> only supports IOMMU group level granularity at the container level, > >> preventing this second endpoint from being assigned: > >> > >> qemu-system-x86_64 -machine q35... \ > >> -device intel-iommu,intremap=on \ > >> -device pcie-root-port,addr=1e.0,id=pcie.1 \ > >> -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ > >> -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 > >> > >> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ > >> 0000:01:00.1: group 1 used in multiple address spaces > >> > >> However, when QEMU incorporates proper aliasing, we can make use of a > >> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that > >> provides the downstream devices with the same AddressSpace, ex: > >> > >> qemu-system-x86_64 -machine q35... \ > >> -device intel-iommu,intremap=on \ > >> -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ > >> -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ > >> -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 > >> > >> While the utility of this hack may be limited, this AddressSpace > >> aliasing is the correct behavior for QEMU to emulate bare metal. > >> > >> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > >> --- > >> hw/pci/pci.c | 33 +++++++++++++++++++++++++++++++-- > >> 1 file changed, 31 insertions(+), 2 deletions(-) > >> > >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c > >> index 35451c1e9987..38467e676f1f 100644 > >> --- a/hw/pci/pci.c > >> +++ b/hw/pci/pci.c > >> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev) > >> { > >> PCIBus *bus = pci_get_bus(dev); > >> PCIBus *iommu_bus = bus; > >> + uint8_t devfn = dev->devfn; > >> > >> while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) { > >> - iommu_bus = pci_get_bus(iommu_bus->parent_dev); > >> + PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev); > >> + > >> + /* > >> + * Determine which requester ID alias should be used for the device > >> + * based on the PCI topology. There are no requester IDs on convetional > >> + * PCI buses, therefore we push the alias up to the parent on each non- > >> + * express bus. Which alias we use depends on whether this is a legacy > >> + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to- > >> + * PCI bridge spec. Note that we cannot use pci_requester_id() here > >> + * because the resulting BDF depends on the secondary bridge register > >> + * programming. We also cannot lookup the PCIBus from the bus number > >> + * at this point for the iommu_fn. Also, requester_id_cache is the > >> + * alias to the root bus, which is usually, but not necessarily always > >> + * where we'll find our iommu_fn. > >> + */ > >> + if (!pci_bus_is_express(iommu_bus)) { > >> + PCIDevice *parent = iommu_bus->parent_dev; > >> + > >> + if (pci_is_express(parent) && > >> + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { > >> + devfn = PCI_DEVFN(0, 0); > >> + bus = iommu_bus; > >> + } else { > >> + devfn = parent->devfn; > >> + bus = parent_bus; > >> + } > >> + } > >> + > >> + iommu_bus = parent_bus; > >> } > >> if (iommu_bus && iommu_bus->iommu_fn) { > >> - return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn); > >> + return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn); > >> } > >> return &address_space_memory; > >> } > >> > >
On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote: > On Mon, 1 Apr 2019 13:41:39 +0000 > "Singh, Brijesh" <brijesh.singh@amd.com> wrote: > > > Thanks for adding Alex. > > > > Adding Suravee. > > > > > > On 3/29/19 11:49 AM, Alex Williamson wrote: > > > [Cc +Brijesh] > > > > > > Hi Brijesh, will the change below require the IVRS to be updated to > > > include aliases for all BDF ranges behind a conventional bridge? I > > > think the Linux code handles this regardless of the firmware provided > > > aliases, but is it required per spec for the ACPI tables to include > > > bridge aliases? Thanks, > > > > > > > We do need to includes aliases in ACPI table. We need to populate the > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host > > IVRS would contain similar information. > > > > Suravee, please correct me if I am missing something? > > I finally found some time to investigate this a little further, yes the > types mentioned are correct for defining start and end of an alias > range. The challenge here is that these entries require a DeviceID, > which is defined as a BDF, AIUI. The IVRS is created in QEMU, but bus > numbers are defined by the guest firmware, and potentially redefined by > the guest OS. This makes it non-trivial to insert a few IVHDs into the > IVRS to describe alias ranges. I'm wondering if the solution here is > to define a new linker-loader command that would instruct the guest to > write a bus number byte to a given offset for a described device. > These commands would be inserted before the checksum command, such that > these bus number updates are calculated as part of the checksum. > > I'm imagining the command format would need to be able to distinguish > between the actual bus number of a described device, the secondary bus > number of the device, and the subordinate bus number of the device. > For describing the device, I'm envisioning stealing from the DMAR > definition, which already includes a bus number invariant mechanism to > describe a device, starting with a segment and root bus, follow a chain > of devfns to get to the target device. Therefore the guest firmware > would follow the path to the described device, pick the desired bus > number, and write it to the indicated table offset. > > Does this seem like a reasonable approach? Better ideas? I'm not > thrilled with the increased scope demanded by IVRS support, but so long > as we have an AMD IOMMU model, I don't see how to avoid it. Thanks, > > Alex We generally just got the bus #s as programmed into virtual bridges/devices and had qemu program them into acpi tables. If guest os changes bus#s it's responsible for updating acpi tables :) > > > On Tue, 26 Mar 2019 16:55:19 -0600 > > > Alex Williamson <alex.williamson@redhat.com> wrote: > > > > > >> Conventional PCI buses pre-date requester IDs. An IOMMU cannot > > >> distinguish by devfn & bus between devices in a conventional PCI > > >> topology and therefore we cannot assign them separate AddressSpaces. > > >> By taking this requester ID aliasing into account, QEMU better matches > > >> the bare metal behavior and restrictions, and enables shared > > >> AddressSpace configurations that are otherwise not possible with > > >> guest IOMMU support. > > >> > > >> For the latter case, given any example where an IOMMU group on the > > >> host includes multiple devices: > > >> > > >> $ ls /sys/kernel/iommu_groups/1/devices/ > > >> 0000:00:01.0 0000:01:00.0 0000:01:00.1 > > >> > > >> If we incorporate a vIOMMU into the VM configuration, we're restricted > > >> that we can only assign one of the endpoints to the guest because a > > >> second endpoint will attempt to use a different AddressSpace. VFIO > > >> only supports IOMMU group level granularity at the container level, > > >> preventing this second endpoint from being assigned: > > >> > > >> qemu-system-x86_64 -machine q35... \ > > >> -device intel-iommu,intremap=on \ > > >> -device pcie-root-port,addr=1e.0,id=pcie.1 \ > > >> -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ > > >> -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 > > >> > > >> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ > > >> 0000:01:00.1: group 1 used in multiple address spaces > > >> > > >> However, when QEMU incorporates proper aliasing, we can make use of a > > >> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that > > >> provides the downstream devices with the same AddressSpace, ex: > > >> > > >> qemu-system-x86_64 -machine q35... \ > > >> -device intel-iommu,intremap=on \ > > >> -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ > > >> -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ > > >> -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 > > >> > > >> While the utility of this hack may be limited, this AddressSpace > > >> aliasing is the correct behavior for QEMU to emulate bare metal. > > >> > > >> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > >> --- > > >> hw/pci/pci.c | 33 +++++++++++++++++++++++++++++++-- > > >> 1 file changed, 31 insertions(+), 2 deletions(-) > > >> > > >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c > > >> index 35451c1e9987..38467e676f1f 100644 > > >> --- a/hw/pci/pci.c > > >> +++ b/hw/pci/pci.c > > >> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev) > > >> { > > >> PCIBus *bus = pci_get_bus(dev); > > >> PCIBus *iommu_bus = bus; > > >> + uint8_t devfn = dev->devfn; > > >> > > >> while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) { > > >> - iommu_bus = pci_get_bus(iommu_bus->parent_dev); > > >> + PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev); > > >> + > > >> + /* > > >> + * Determine which requester ID alias should be used for the device > > >> + * based on the PCI topology. There are no requester IDs on convetional > > >> + * PCI buses, therefore we push the alias up to the parent on each non- > > >> + * express bus. Which alias we use depends on whether this is a legacy > > >> + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to- > > >> + * PCI bridge spec. Note that we cannot use pci_requester_id() here > > >> + * because the resulting BDF depends on the secondary bridge register > > >> + * programming. We also cannot lookup the PCIBus from the bus number > > >> + * at this point for the iommu_fn. Also, requester_id_cache is the > > >> + * alias to the root bus, which is usually, but not necessarily always > > >> + * where we'll find our iommu_fn. > > >> + */ > > >> + if (!pci_bus_is_express(iommu_bus)) { > > >> + PCIDevice *parent = iommu_bus->parent_dev; > > >> + > > >> + if (pci_is_express(parent) && > > >> + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { > > >> + devfn = PCI_DEVFN(0, 0); > > >> + bus = iommu_bus; > > >> + } else { > > >> + devfn = parent->devfn; > > >> + bus = parent_bus; > > >> + } > > >> + } > > >> + > > >> + iommu_bus = parent_bus; > > >> } > > >> if (iommu_bus && iommu_bus->iommu_fn) { > > >> - return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn); > > >> + return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn); > > >> } > > >> return &address_space_memory; > > >> } > > >> > > >
On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote: > > On 3/29/19 11:49 AM, Alex Williamson wrote: > > > [Cc +Brijesh] > > > > > > Hi Brijesh, will the change below require the IVRS to be updated to > > > include aliases for all BDF ranges behind a conventional bridge? I > > > think the Linux code handles this regardless of the firmware provided > > > aliases, but is it required per spec for the ACPI tables to include > > > bridge aliases? Thanks, > > > > > > > We do need to includes aliases in ACPI table. We need to populate the > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host > > IVRS would contain similar information. > > > > Suravee, please correct me if I am missing something? > > I finally found some time to investigate this a little further, yes the > types mentioned are correct for defining start and end of an alias > range. The challenge here is that these entries require a DeviceID, > which is defined as a BDF, AIUI. The IVRS is created in QEMU, but bus > numbers are defined by the guest firmware, and potentially redefined by > the guest OS. This makes it non-trivial to insert a few IVHDs into the > IVRS to describe alias ranges. I'm wondering if the solution here is > to define a new linker-loader command that would instruct the guest to > write a bus number byte to a given offset for a described device. > These commands would be inserted before the checksum command, such that > these bus number updates are calculated as part of the checksum. > > I'm imagining the command format would need to be able to distinguish > between the actual bus number of a described device, the secondary bus > number of the device, and the subordinate bus number of the device. > For describing the device, I'm envisioning stealing from the DMAR > definition, which already includes a bus number invariant mechanism to > describe a device, starting with a segment and root bus, follow a chain > of devfns to get to the target device. Therefore the guest firmware > would follow the path to the described device, pick the desired bus > number, and write it to the indicated table offset. > > Does this seem like a reasonable approach? Better ideas? I'm not > thrilled with the increased scope demanded by IVRS support, but so long > as we have an AMD IOMMU model, I don't see how to avoid it. Thanks, I don't have a better idea yet, but just want to say that accidentally I was trying to look into this as well starting from this week and I'd say that's mostly what I thought about too (I was still reading a bit seabios when I saw this email)... so at least this idea makes sense to me. Would the guest OS still change the PCI bus number even after the firmware (BIOS/UEFI)? Could I ask in what case would that happen? Thanks,
On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote: > On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote: > > > On 3/29/19 11:49 AM, Alex Williamson wrote: > > > > [Cc +Brijesh] > > > > > > > > Hi Brijesh, will the change below require the IVRS to be updated to > > > > include aliases for all BDF ranges behind a conventional bridge? I > > > > think the Linux code handles this regardless of the firmware provided > > > > aliases, but is it required per spec for the ACPI tables to include > > > > bridge aliases? Thanks, > > > > > > > > > > We do need to includes aliases in ACPI table. We need to populate the > > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host > > > IVRS would contain similar information. > > > > > > Suravee, please correct me if I am missing something? > > > > I finally found some time to investigate this a little further, yes the > > types mentioned are correct for defining start and end of an alias > > range. The challenge here is that these entries require a DeviceID, > > which is defined as a BDF, AIUI. The IVRS is created in QEMU, but bus > > numbers are defined by the guest firmware, and potentially redefined by > > the guest OS. This makes it non-trivial to insert a few IVHDs into the > > IVRS to describe alias ranges. I'm wondering if the solution here is > > to define a new linker-loader command that would instruct the guest to > > write a bus number byte to a given offset for a described device. > > These commands would be inserted before the checksum command, such that > > these bus number updates are calculated as part of the checksum. > > > > I'm imagining the command format would need to be able to distinguish > > between the actual bus number of a described device, the secondary bus > > number of the device, and the subordinate bus number of the device. > > For describing the device, I'm envisioning stealing from the DMAR > > definition, which already includes a bus number invariant mechanism to > > describe a device, starting with a segment and root bus, follow a chain > > of devfns to get to the target device. Therefore the guest firmware > > would follow the path to the described device, pick the desired bus > > number, and write it to the indicated table offset. > > > > Does this seem like a reasonable approach? Better ideas? I'm not > > thrilled with the increased scope demanded by IVRS support, but so long > > as we have an AMD IOMMU model, I don't see how to avoid it. Thanks, > > I don't have a better idea yet, but just want to say that accidentally > I was trying to look into this as well starting from this week and I'd > say that's mostly what I thought about too (I was still reading a bit > seabios when I saw this email)... so at least this idea makes sense to > me. > > Would the guest OS still change the PCI bus number even after the > firmware (BIOS/UEFI)? Could I ask in what case would that happen? > > Thanks, Guest OSes can in theory rebalance resources. Changing bus numbers would be useful if new bridges are added by hotplug. In practice at least Linux doesn't do the rebalancing. I think that if we start reporting PNP OS support in BIOS then windows might start doing that more aggressively. > -- > Peter Xu
On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote: > On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote: > > On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote: > > > > On 3/29/19 11:49 AM, Alex Williamson wrote: > > > > > [Cc +Brijesh] > > > > > > > > > > Hi Brijesh, will the change below require the IVRS to be updated to > > > > > include aliases for all BDF ranges behind a conventional bridge? I > > > > > think the Linux code handles this regardless of the firmware provided > > > > > aliases, but is it required per spec for the ACPI tables to include > > > > > bridge aliases? Thanks, > > > > > > > > > > > > > We do need to includes aliases in ACPI table. We need to populate the > > > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host > > > > IVRS would contain similar information. > > > > > > > > Suravee, please correct me if I am missing something? > > > > > > I finally found some time to investigate this a little further, yes the > > > types mentioned are correct for defining start and end of an alias > > > range. The challenge here is that these entries require a DeviceID, > > > which is defined as a BDF, AIUI. The IVRS is created in QEMU, but bus > > > numbers are defined by the guest firmware, and potentially redefined by > > > the guest OS. This makes it non-trivial to insert a few IVHDs into the > > > IVRS to describe alias ranges. I'm wondering if the solution here is > > > to define a new linker-loader command that would instruct the guest to > > > write a bus number byte to a given offset for a described device. > > > These commands would be inserted before the checksum command, such that > > > these bus number updates are calculated as part of the checksum. > > > > > > I'm imagining the command format would need to be able to distinguish > > > between the actual bus number of a described device, the secondary bus > > > number of the device, and the subordinate bus number of the device. > > > For describing the device, I'm envisioning stealing from the DMAR > > > definition, which already includes a bus number invariant mechanism to > > > describe a device, starting with a segment and root bus, follow a chain > > > of devfns to get to the target device. Therefore the guest firmware > > > would follow the path to the described device, pick the desired bus > > > number, and write it to the indicated table offset. > > > > > > Does this seem like a reasonable approach? Better ideas? I'm not > > > thrilled with the increased scope demanded by IVRS support, but so long > > > as we have an AMD IOMMU model, I don't see how to avoid it. Thanks, > > > > I don't have a better idea yet, but just want to say that accidentally > > I was trying to look into this as well starting from this week and I'd > > say that's mostly what I thought about too (I was still reading a bit > > seabios when I saw this email)... so at least this idea makes sense to > > me. > > > > Would the guest OS still change the PCI bus number even after the > > firmware (BIOS/UEFI)? Could I ask in what case would that happen? > > > > Thanks, > > Guest OSes can in theory rebalance resources. Changing bus numbers > would be useful if new bridges are added by hotplug. > In practice at least Linux doesn't do the rebalancing. > I think that if we start reporting PNP OS support in BIOS then windows > might start doing that more aggressively. It's surprising me a bit... IMHO if we allow the bus number to change then at least many scripts can even fail which might work before. E.g. , a very common script can run "lspci-like" program to list each device and then do "lspci-like -vvv" again upon the BDF it fetched from previous commands. Any kind of BDF caching would be invalid since that from either userspace or kernel. Also, obviously the data to be stored in IVRS is closely bound to how bus number is defined. Even if we can add a new linker-loader command to all the open firmwares like seabios or OVMF but still we can't do that to Windows (or, could we?...). Now one step back, I'm also curious on the reason behind on why AMD spec required the IVRS with BDF information, rather than the scope information like what Intel DMAR spec was asking for. Thanks,
On Wed, 24 Jul 2019 18:03:31 +0800 Peter Xu <zhexu@redhat.com> wrote: > On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote: > > On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote: > > > On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote: > > > > > On 3/29/19 11:49 AM, Alex Williamson wrote: > > > > > > [Cc +Brijesh] > > > > > > > > > > > > Hi Brijesh, will the change below require the IVRS to be updated to > > > > > > include aliases for all BDF ranges behind a conventional bridge? I > > > > > > think the Linux code handles this regardless of the firmware provided > > > > > > aliases, but is it required per spec for the ACPI tables to include > > > > > > bridge aliases? Thanks, > > > > > > > > > > > > > > > > We do need to includes aliases in ACPI table. We need to populate the > > > > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host > > > > > IVRS would contain similar information. > > > > > > > > > > Suravee, please correct me if I am missing something? > > > > > > > > I finally found some time to investigate this a little further, yes the > > > > types mentioned are correct for defining start and end of an alias > > > > range. The challenge here is that these entries require a DeviceID, > > > > which is defined as a BDF, AIUI. The IVRS is created in QEMU, but bus > > > > numbers are defined by the guest firmware, and potentially redefined by > > > > the guest OS. This makes it non-trivial to insert a few IVHDs into the > > > > IVRS to describe alias ranges. I'm wondering if the solution here is > > > > to define a new linker-loader command that would instruct the guest to > > > > write a bus number byte to a given offset for a described device. > > > > These commands would be inserted before the checksum command, such that > > > > these bus number updates are calculated as part of the checksum. > > > > > > > > I'm imagining the command format would need to be able to distinguish > > > > between the actual bus number of a described device, the secondary bus > > > > number of the device, and the subordinate bus number of the device. > > > > For describing the device, I'm envisioning stealing from the DMAR > > > > definition, which already includes a bus number invariant mechanism to > > > > describe a device, starting with a segment and root bus, follow a chain > > > > of devfns to get to the target device. Therefore the guest firmware > > > > would follow the path to the described device, pick the desired bus > > > > number, and write it to the indicated table offset. > > > > > > > > Does this seem like a reasonable approach? Better ideas? I'm not > > > > thrilled with the increased scope demanded by IVRS support, but so long > > > > as we have an AMD IOMMU model, I don't see how to avoid it. Thanks, > > > > > > I don't have a better idea yet, but just want to say that accidentally > > > I was trying to look into this as well starting from this week and I'd > > > say that's mostly what I thought about too (I was still reading a bit > > > seabios when I saw this email)... so at least this idea makes sense to > > > me. > > > > > > Would the guest OS still change the PCI bus number even after the > > > firmware (BIOS/UEFI)? Could I ask in what case would that happen? > > > > > > Thanks, > > > > Guest OSes can in theory rebalance resources. Changing bus numbers > > would be useful if new bridges are added by hotplug. > > In practice at least Linux doesn't do the rebalancing. > > I think that if we start reporting PNP OS support in BIOS then windows > > might start doing that more aggressively. > > It's surprising me a bit... IMHO if we allow the bus number to change > then at least many scripts can even fail which might work before. > E.g. , a very common script can run "lspci-like" program to list each > device and then do "lspci-like -vvv" again upon the BDF it fetched > from previous commands. Any kind of BDF caching would be invalid > since that from either userspace or kernel. > > Also, obviously the data to be stored in IVRS is closely bound to how > bus number is defined. Even if we can add a new linker-loader command > to all the open firmwares like seabios or OVMF but still we can't do > that to Windows (or, could we?...). > > Now one step back, I'm also curious on the reason behind on why AMD > spec required the IVRS with BDF information, rather than the scope > information like what Intel DMAR spec was asking for. It's a deficiency of the IVRS spec, but it's really out of scope here. It's not the responsibility of the hypervisor to resolve this sort of design issue, we should simply maintain the bare metal behavior and the bare metal limitations of the design. Michael did invoke some interesting ideas regarding QEMU updating the IRVS table though. QEMU does know when bus apertures are programmed on devices and the config writes for these updates could trigger IVRS updates. I think we'd want to allow such updates only between machine reset and the guest firmware writing the table checksum. This reduces the scope of the necessary changes, though still feels a little messy to have these config writes making table updates. Another approach, and maybe what Michael was really suggesting, is that we essentially create the ACPI tables twice AFAICT. Once initially, then again via a select callback in fw_cfg. For SeaBIOS, it looks like this second generation would be created after the PCI bus has been enumerated and initialized. I've been trying to see if the same is likely for OVMF, though it's not clear to me that this is a reasonable ordering to rely on. It would be entirely reasonable that firmware could process ACPI tables in advance of enumerating PCI, even potentially as a prerequisite to enumerating PCI. So ultimately I'm not sure if there are valid ordering assumptions to use these callbacks this way, though I'd appreciate any further discussion. Thanks, Alex
On Wed, 24 Jul 2019 08:43:55 -0600 Alex Williamson <alex.williamson@redhat.com> wrote: > On Wed, 24 Jul 2019 18:03:31 +0800 > Peter Xu <zhexu@redhat.com> wrote: > > > On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote: > > > On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote: > > > > On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote: > > > > > > On 3/29/19 11:49 AM, Alex Williamson wrote: > > > > > > > [Cc +Brijesh] > > > > > > > > > > > > > > Hi Brijesh, will the change below require the IVRS to be updated to > > > > > > > include aliases for all BDF ranges behind a conventional bridge? I > > > > > > > think the Linux code handles this regardless of the firmware provided > > > > > > > aliases, but is it required per spec for the ACPI tables to include > > > > > > > bridge aliases? Thanks, [snip] For a data point, I fired up an old 990FX system which includes a PCIe-to-PCI bridge and I added a plugin PCIe-to-PCI bridge just to keep things interesting... guess how many alias ranges are in the IVRS... Yep, just the one built into the motherboard: AMD-Vi: Using IVHD type 0x10 AMD-Vi: device: 00:00.2 cap: 0040 seg: 0 flags: 3e info 1300 AMD-Vi: mmio-addr: 00000000fec30000 AMD-Vi: DEV_SELECT_RANGE_START devid: 00:00.0 flags: 00 AMD-Vi: DEV_RANGE_END devid: 00:00.2 AMD-Vi: DEV_SELECT devid: 00:09.0 flags: 00 AMD-Vi: DEV_SELECT devid: 01:00.0 flags: 00 AMD-Vi: DEV_SELECT devid: 00:0a.0 flags: 00 AMD-Vi: DEV_SELECT devid: 02:00.0 flags: 00 AMD-Vi: DEV_SELECT devid: 00:11.0 flags: 00 AMD-Vi: DEV_SELECT_RANGE_START devid: 00:12.0 flags: 00 AMD-Vi: DEV_RANGE_END devid: 00:12.2 AMD-Vi: DEV_SELECT_RANGE_START devid: 00:13.0 flags: 00 AMD-Vi: DEV_RANGE_END devid: 00:13.2 AMD-Vi: DEV_SELECT devid: 00:14.0 flags: d7 AMD-Vi: DEV_SELECT devid: 00:14.2 flags: 00 AMD-Vi: DEV_SELECT devid: 00:14.3 flags: 00 AMD-Vi: DEV_SELECT devid: 00:14.4 flags: 00 AMD-Vi: DEV_ALIAS_RANGE devid: 03:00.0 flags: 00 devid_to: 00:14.4 AMD-Vi: DEV_RANGE_END devid: 03:1f.7 [Everything on bus 03:xx.x is aliased to device 00:14.4, the builtin PCIe-to-PCI bridge] AMD-Vi: DEV_SELECT devid: 00:14.5 flags: 00 AMD-Vi: DEV_SELECT devid: 00:15.0 flags: 00 AMD-Vi: DEV_SELECT_RANGE_START devid: 04:00.0 flags: 00 AMD-Vi: DEV_RANGE_END devid: 04:1f.7 AMD-Vi: DEV_SELECT devid: 00:15.1 flags: 00 AMD-Vi: DEV_SELECT_RANGE_START devid: 05:00.0 flags: 00 AMD-Vi: DEV_RANGE_END devid: 05:1f.7 AMD-Vi: DEV_SELECT devid: 00:15.2 flags: 00 AMD-Vi: DEV_SELECT_RANGE_START devid: 06:00.0 flags: 00 AMD-Vi: DEV_RANGE_END devid: 06:1f.7 AMD-Vi: DEV_SELECT devid: 00:15.3 flags: 00 AMD-Vi: DEV_SELECT_RANGE_START devid: 08:00.0 flags: 00 AMD-Vi: DEV_RANGE_END devid: 08:1f.7 AMD-Vi: DEV_SELECT_RANGE_START devid: 00:16.0 flags: 00 AMD-Vi: DEV_RANGE_END devid: 00:16.2 AMD-Vi: DEV_SPECIAL(IOAPIC[8]) devid: 00:14.0 AMD-Vi: DEV_SPECIAL(HPET[0]) devid: 00:14.0 AMD-Vi: DEV_SPECIAL(IOAPIC[8]) devid: 00:00.1 -[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] RD9x0/RX980 Host Bridge +-00.2 Advanced Micro Devices, Inc. [AMD/ATI] RD890S/RD990 I/O Memory Management Unit (IOMMU) +-09.0-[01]----00.0 Etron Technology, Inc. EJ168 USB 3.0 Host Controller +-0a.0-[02]----00.0 Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s Controller +-11.0 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] +-12.0 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller +-12.2 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller +-13.0 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller +-13.2 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller +-14.0 Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller +-14.2 Advanced Micro Devices, Inc. [AMD/ATI] SBx00 Azalia (Intel HDA) +-14.3 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller +-14.4-[03]--+-06.0 NVidia / SGS Thomson (Joint Venture) Riva128 | \-0e.0 VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller +-14.5 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller +-15.0-[04]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller +-15.1-[05]----00.0 Etron Technology, Inc. EJ168 USB 3.0 Host Controller +-15.2-[06-07]----00.0-[07]----00.0 Realtek Semiconductor Co., Ltd. Device 8149 +-15.3-[08]-- +-16.0 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller +-16.2 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller +-18.0 Advanced Micro Devices, Inc. [AMD] Family 10h Processor HyperTransport Configuration +-18.1 Advanced Micro Devices, Inc. [AMD] Family 10h Processor Address Map +-18.2 Advanced Micro Devices, Inc. [AMD] Family 10h Processor DRAM Controller +-18.3 Advanced Micro Devices, Inc. [AMD] Family 10h Processor Miscellaneous Control \-18.4 Advanced Micro Devices, Inc. [AMD] Family 10h Processor Link Control 00:14.4 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge (rev 40) (prog-if 01 [Subtractive decode]) Bus: primary=00, secondary=03, subordinate=03, sec-latency=64 06:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 03) (prog-if 00 [Normal decode]) Bus: primary=06, secondary=07, subordinate=07, sec-latency=32 Capabilities: [80] Express (v1) PCI-Express to PCI/PCI-X Bridge, MSI 00 I realized as I was writing, that the alias caused by 06:00.0 would be devfn 00.0 on the secondary bus 07, ie. 07:00.0 would alias to 07:00.0, so technically there's really not an alias for this small case. So I replaced the NIC with this: +-15.2-[06-07]----00.0-[07]--+-00.0 NEC Corporation OHCI USB Controller +-00.1 NEC Corporation OHCI USB Controller \-00.2 NEC Corporation uPD72010x USB 2.0 Controller Now functions 07:00.[12] also alias to 07:00.0. The IVRS table is unaffected. I'm tempted to say that QEMU would be no worse than bare metal if we simply ignore IVHD device alias entries. I know that Linux will figure out the aliasing regardless of the IVRS. Will Windows guests? I'd love to hear from Bijesh or Suravee regarding the behavior above versus what AMD expects from system vendors. Thanks, Alex
On Wed, Jul 24, 2019 at 08:43:55AM -0600, Alex Williamson wrote: > On Wed, 24 Jul 2019 18:03:31 +0800 > Peter Xu <zhexu@redhat.com> wrote: > > > On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote: > > > On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote: > > > > On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote: > > > > > > On 3/29/19 11:49 AM, Alex Williamson wrote: > > > > > > > [Cc +Brijesh] > > > > > > > > > > > > > > Hi Brijesh, will the change below require the IVRS to be updated to > > > > > > > include aliases for all BDF ranges behind a conventional bridge? I > > > > > > > think the Linux code handles this regardless of the firmware provided > > > > > > > aliases, but is it required per spec for the ACPI tables to include > > > > > > > bridge aliases? Thanks, > > > > > > > > > > > > > > > > > > > We do need to includes aliases in ACPI table. We need to populate the > > > > > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host > > > > > > IVRS would contain similar information. > > > > > > > > > > > > Suravee, please correct me if I am missing something? > > > > > > > > > > I finally found some time to investigate this a little further, yes the > > > > > types mentioned are correct for defining start and end of an alias > > > > > range. The challenge here is that these entries require a DeviceID, > > > > > which is defined as a BDF, AIUI. The IVRS is created in QEMU, but bus > > > > > numbers are defined by the guest firmware, and potentially redefined by > > > > > the guest OS. This makes it non-trivial to insert a few IVHDs into the > > > > > IVRS to describe alias ranges. I'm wondering if the solution here is > > > > > to define a new linker-loader command that would instruct the guest to > > > > > write a bus number byte to a given offset for a described device. > > > > > These commands would be inserted before the checksum command, such that > > > > > these bus number updates are calculated as part of the checksum. > > > > > > > > > > I'm imagining the command format would need to be able to distinguish > > > > > between the actual bus number of a described device, the secondary bus > > > > > number of the device, and the subordinate bus number of the device. > > > > > For describing the device, I'm envisioning stealing from the DMAR > > > > > definition, which already includes a bus number invariant mechanism to > > > > > describe a device, starting with a segment and root bus, follow a chain > > > > > of devfns to get to the target device. Therefore the guest firmware > > > > > would follow the path to the described device, pick the desired bus > > > > > number, and write it to the indicated table offset. > > > > > > > > > > Does this seem like a reasonable approach? Better ideas? I'm not > > > > > thrilled with the increased scope demanded by IVRS support, but so long > > > > > as we have an AMD IOMMU model, I don't see how to avoid it. Thanks, > > > > > > > > I don't have a better idea yet, but just want to say that accidentally > > > > I was trying to look into this as well starting from this week and I'd > > > > say that's mostly what I thought about too (I was still reading a bit > > > > seabios when I saw this email)... so at least this idea makes sense to > > > > me. > > > > > > > > Would the guest OS still change the PCI bus number even after the > > > > firmware (BIOS/UEFI)? Could I ask in what case would that happen? > > > > > > > > Thanks, > > > > > > Guest OSes can in theory rebalance resources. Changing bus numbers > > > would be useful if new bridges are added by hotplug. > > > In practice at least Linux doesn't do the rebalancing. > > > I think that if we start reporting PNP OS support in BIOS then windows > > > might start doing that more aggressively. > > > > It's surprising me a bit... IMHO if we allow the bus number to change > > then at least many scripts can even fail which might work before. > > E.g. , a very common script can run "lspci-like" program to list each > > device and then do "lspci-like -vvv" again upon the BDF it fetched > > from previous commands. Any kind of BDF caching would be invalid > > since that from either userspace or kernel. > > > > Also, obviously the data to be stored in IVRS is closely bound to how > > bus number is defined. Even if we can add a new linker-loader command > > to all the open firmwares like seabios or OVMF but still we can't do > > that to Windows (or, could we?...). > > > > Now one step back, I'm also curious on the reason behind on why AMD > > spec required the IVRS with BDF information, rather than the scope > > information like what Intel DMAR spec was asking for. > > It's a deficiency of the IVRS spec, but it's really out of scope here. > It's not the responsibility of the hypervisor to resolve this sort of > design issue, we should simply maintain the bare metal behavior and the > bare metal limitations of the design. Yes this is a better perspective. It's not the first time I totally forgot to go back to reference the bare-metal as that's what we're emulating and normally that'll have very similar problem. And the "data point" email does give a proper supporting material for this. > > Michael did invoke some interesting ideas regarding QEMU updating the > IRVS table though. QEMU does know when bus apertures are programmed on > devices and the config writes for these updates could trigger IVRS > updates. I think we'd want to allow such updates only between machine > reset and the guest firmware writing the table checksum. This reduces > the scope of the necessary changes, though still feels a little messy > to have these config writes making table updates. > > Another approach, and maybe what Michael was really suggesting, is that > we essentially create the ACPI tables twice AFAICT. Once initially, > then again via a select callback in fw_cfg. For SeaBIOS, it looks like > this second generation would be created after the PCI bus has been > enumerated and initialized. I've been trying to see if the same is > likely for OVMF, though it's not clear to me that this is a reasonable > ordering to rely on. It would be entirely reasonable that firmware > could process ACPI tables in advance of enumerating PCI, even > potentially as a prerequisite to enumerating PCI. So ultimately I'm not > sure if there are valid ordering assumptions to use these callbacks > this way, though I'd appreciate any further discussion. Thanks, After re-read Michael's reply, I feel like what Michael suggested is that we can simply ignore the bus-number-change case by the guest OS for now, but I might be wrong. In all cases, this will be a problem only if "we still need to fill in the BDF information" somehow, while that statement seems to be questionable now. Thanks,
On Thu, Jul 25, 2019 at 02:37:33PM +0800, Peter Xu wrote: > After re-read Michael's reply, I feel like what Michael suggested is > that we can simply ignore the bus-number-change case by the guest OS > for now, but I might be wrong. That's what I suggested, yes.
On Thu, 25 Jul 2019 06:43:04 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Thu, Jul 25, 2019 at 02:37:33PM +0800, Peter Xu wrote: > > After re-read Michael's reply, I feel like what Michael suggested is > > that we can simply ignore the bus-number-change case by the guest OS > > for now, but I might be wrong. > That's what I suggested, yes. "by the guest OS", yes, that's the part that's beyond the scope of this effort. If we deem it necessary to support IVHD DMA alias though, it's the guest firmware that determines the initial bus numbers, which is part of the DeviceID used to defined IVHD entries and we would not be able to ignore that initial programming. Everything related to the guest OS renumber PCI buses is out of scope, the guest firmware programming initial bus numbers is not. Thanks, Alex
On 7/24/19 2:42 PM, Alex Williamson wrote: > On Wed, 24 Jul 2019 08:43:55 -0600 > Alex Williamson <alex.williamson@redhat.com> wrote: > >> On Wed, 24 Jul 2019 18:03:31 +0800 >> Peter Xu <zhexu@redhat.com> wrote: >> >>> On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote: >>>> On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote: >>>>> On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote: >>>>>>> On 3/29/19 11:49 AM, Alex Williamson wrote: >>>>>>>> [Cc +Brijesh] >>>>>>>> >>>>>>>> Hi Brijesh, will the change below require the IVRS to be updated to >>>>>>>> include aliases for all BDF ranges behind a conventional bridge? I >>>>>>>> think the Linux code handles this regardless of the firmware provided >>>>>>>> aliases, but is it required per spec for the ACPI tables to include >>>>>>>> bridge aliases? Thanks, > [snip] > > For a data point, I fired up an old 990FX system which includes a > PCIe-to-PCI bridge and I added a plugin PCIe-to-PCI bridge just to keep > things interesting... guess how many alias ranges are in the IVRS... > Yep, just the one built into the motherboard: > > AMD-Vi: Using IVHD type 0x10 > AMD-Vi: device: 00:00.2 cap: 0040 seg: 0 flags: 3e info 1300 > AMD-Vi: mmio-addr: 00000000fec30000 > AMD-Vi: DEV_SELECT_RANGE_START devid: 00:00.0 flags: 00 > AMD-Vi: DEV_RANGE_END devid: 00:00.2 > AMD-Vi: DEV_SELECT devid: 00:09.0 flags: 00 > AMD-Vi: DEV_SELECT devid: 01:00.0 flags: 00 > AMD-Vi: DEV_SELECT devid: 00:0a.0 flags: 00 > AMD-Vi: DEV_SELECT devid: 02:00.0 flags: 00 > AMD-Vi: DEV_SELECT devid: 00:11.0 flags: 00 > AMD-Vi: DEV_SELECT_RANGE_START devid: 00:12.0 flags: 00 > AMD-Vi: DEV_RANGE_END devid: 00:12.2 > AMD-Vi: DEV_SELECT_RANGE_START devid: 00:13.0 flags: 00 > AMD-Vi: DEV_RANGE_END devid: 00:13.2 > AMD-Vi: DEV_SELECT devid: 00:14.0 flags: d7 > AMD-Vi: DEV_SELECT devid: 00:14.2 flags: 00 > AMD-Vi: DEV_SELECT devid: 00:14.3 flags: 00 > AMD-Vi: DEV_SELECT devid: 00:14.4 flags: 00 > AMD-Vi: DEV_ALIAS_RANGE devid: 03:00.0 flags: 00 devid_to: 00:14.4 > AMD-Vi: DEV_RANGE_END devid: 03:1f.7 > > [Everything on bus 03:xx.x is aliased to device 00:14.4, the builtin PCIe-to-PCI bridge] > > AMD-Vi: DEV_SELECT devid: 00:14.5 flags: 00 > AMD-Vi: DEV_SELECT devid: 00:15.0 flags: 00 > AMD-Vi: DEV_SELECT_RANGE_START devid: 04:00.0 flags: 00 > AMD-Vi: DEV_RANGE_END devid: 04:1f.7 > AMD-Vi: DEV_SELECT devid: 00:15.1 flags: 00 > AMD-Vi: DEV_SELECT_RANGE_START devid: 05:00.0 flags: 00 > AMD-Vi: DEV_RANGE_END devid: 05:1f.7 > AMD-Vi: DEV_SELECT devid: 00:15.2 flags: 00 > AMD-Vi: DEV_SELECT_RANGE_START devid: 06:00.0 flags: 00 > AMD-Vi: DEV_RANGE_END devid: 06:1f.7 > AMD-Vi: DEV_SELECT devid: 00:15.3 flags: 00 > AMD-Vi: DEV_SELECT_RANGE_START devid: 08:00.0 flags: 00 > AMD-Vi: DEV_RANGE_END devid: 08:1f.7 > AMD-Vi: DEV_SELECT_RANGE_START devid: 00:16.0 flags: 00 > AMD-Vi: DEV_RANGE_END devid: 00:16.2 > AMD-Vi: DEV_SPECIAL(IOAPIC[8]) devid: 00:14.0 > AMD-Vi: DEV_SPECIAL(HPET[0]) devid: 00:14.0 > AMD-Vi: DEV_SPECIAL(IOAPIC[8]) devid: 00:00.1 > > -[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] RD9x0/RX980 Host Bridge > +-00.2 Advanced Micro Devices, Inc. [AMD/ATI] RD890S/RD990 I/O Memory Management Unit (IOMMU) > +-09.0-[01]----00.0 Etron Technology, Inc. EJ168 USB 3.0 Host Controller > +-0a.0-[02]----00.0 Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s Controller > +-11.0 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] > +-12.0 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller > +-12.2 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller > +-13.0 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller > +-13.2 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller > +-14.0 Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller > +-14.2 Advanced Micro Devices, Inc. [AMD/ATI] SBx00 Azalia (Intel HDA) > +-14.3 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller > +-14.4-[03]--+-06.0 NVidia / SGS Thomson (Joint Venture) Riva128 > | \-0e.0 VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller > +-14.5 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller > +-15.0-[04]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller > +-15.1-[05]----00.0 Etron Technology, Inc. EJ168 USB 3.0 Host Controller > +-15.2-[06-07]----00.0-[07]----00.0 Realtek Semiconductor Co., Ltd. Device 8149 > +-15.3-[08]-- > +-16.0 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller > +-16.2 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller > +-18.0 Advanced Micro Devices, Inc. [AMD] Family 10h Processor HyperTransport Configuration > +-18.1 Advanced Micro Devices, Inc. [AMD] Family 10h Processor Address Map > +-18.2 Advanced Micro Devices, Inc. [AMD] Family 10h Processor DRAM Controller > +-18.3 Advanced Micro Devices, Inc. [AMD] Family 10h Processor Miscellaneous Control > \-18.4 Advanced Micro Devices, Inc. [AMD] Family 10h Processor Link Control > > 00:14.4 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge (rev 40) (prog-if 01 [Subtractive decode]) > Bus: primary=00, secondary=03, subordinate=03, sec-latency=64 > > 06:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 03) (prog-if 00 [Normal decode]) > Bus: primary=06, secondary=07, subordinate=07, sec-latency=32 > Capabilities: [80] Express (v1) PCI-Express to PCI/PCI-X Bridge, MSI 00 > > I realized as I was writing, that the alias caused by 06:00.0 would be > devfn 00.0 on the secondary bus 07, ie. 07:00.0 would alias to > 07:00.0, so technically there's really not an alias for this small > case. So I replaced the NIC with this: > > +-15.2-[06-07]----00.0-[07]--+-00.0 NEC Corporation OHCI USB Controller > +-00.1 NEC Corporation OHCI USB Controller > \-00.2 NEC Corporation uPD72010x USB 2.0 Controller > > Now functions 07:00.[12] also alias to 07:00.0. The IVRS table is > unaffected. > > I'm tempted to say that QEMU would be no worse than bare metal if we > simply ignore IVHD device alias entries. I know that Linux will figure > out the aliasing regardless of the IVRS. Will Windows guests? I'd > love to hear from Bijesh or Suravee regarding the behavior above versus > what AMD expects from system vendors. Thanks, > Alex, I am not well versed on the expected IOMMU address space behavior yet, Suravee will know more about. I will ask him to take a look at this thread. thanks > Alex >
On Thu, Jul 25, 2019 at 08:00:23AM -0600, Alex Williamson wrote: > On Thu, 25 Jul 2019 06:43:04 -0400 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Thu, Jul 25, 2019 at 02:37:33PM +0800, Peter Xu wrote: > > > After re-read Michael's reply, I feel like what Michael suggested is > > > that we can simply ignore the bus-number-change case by the guest OS > > > for now, but I might be wrong. > > That's what I suggested, yes. > > "by the guest OS", yes, that's the part that's beyond the scope of this > effort. If we deem it necessary to support IVHD DMA alias though, it's > the guest firmware that determines the initial bus numbers, which is > part of the DeviceID used to defined IVHD entries and we would not be > able to ignore that initial programming. Everything related to the > guest OS renumber PCI buses is out of scope, the guest firmware > programming initial bus numbers is not. Right. That's par for the course, we have same issues with MCFG and others. bios programs it and then we generate acpi based on that. >Thanks, > > Alex
diff --git a/hw/pci/pci.c b/hw/pci/pci.c index 35451c1e9987..38467e676f1f 100644 --- a/hw/pci/pci.c +++ b/hw/pci/pci.c @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev) { PCIBus *bus = pci_get_bus(dev); PCIBus *iommu_bus = bus; + uint8_t devfn = dev->devfn; while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) { - iommu_bus = pci_get_bus(iommu_bus->parent_dev); + PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev); + + /* + * Determine which requester ID alias should be used for the device + * based on the PCI topology. There are no requester IDs on convetional + * PCI buses, therefore we push the alias up to the parent on each non- + * express bus. Which alias we use depends on whether this is a legacy + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to- + * PCI bridge spec. Note that we cannot use pci_requester_id() here + * because the resulting BDF depends on the secondary bridge register + * programming. We also cannot lookup the PCIBus from the bus number + * at this point for the iommu_fn. Also, requester_id_cache is the + * alias to the root bus, which is usually, but not necessarily always + * where we'll find our iommu_fn. + */ + if (!pci_bus_is_express(iommu_bus)) { + PCIDevice *parent = iommu_bus->parent_dev; + + if (pci_is_express(parent) && + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { + devfn = PCI_DEVFN(0, 0); + bus = iommu_bus; + } else { + devfn = parent->devfn; + bus = parent_bus; + } + } + + iommu_bus = parent_bus; } if (iommu_bus && iommu_bus->iommu_fn) { - return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn); + return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn); } return &address_space_memory; }
Conventional PCI buses pre-date requester IDs. An IOMMU cannot distinguish by devfn & bus between devices in a conventional PCI topology and therefore we cannot assign them separate AddressSpaces. By taking this requester ID aliasing into account, QEMU better matches the bare metal behavior and restrictions, and enables shared AddressSpace configurations that are otherwise not possible with guest IOMMU support. For the latter case, given any example where an IOMMU group on the host includes multiple devices: $ ls /sys/kernel/iommu_groups/1/devices/ 0000:00:01.0 0000:01:00.0 0000:01:00.1 If we incorporate a vIOMMU into the VM configuration, we're restricted that we can only assign one of the endpoints to the guest because a second endpoint will attempt to use a different AddressSpace. VFIO only supports IOMMU group level granularity at the container level, preventing this second endpoint from being assigned: qemu-system-x86_64 -machine q35... \ -device intel-iommu,intremap=on \ -device pcie-root-port,addr=1e.0,id=pcie.1 \ -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ 0000:01:00.1: group 1 used in multiple address spaces However, when QEMU incorporates proper aliasing, we can make use of a PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that provides the downstream devices with the same AddressSpace, ex: qemu-system-x86_64 -machine q35... \ -device intel-iommu,intremap=on \ -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 While the utility of this hack may be limited, this AddressSpace aliasing is the correct behavior for QEMU to emulate bare metal. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> --- hw/pci/pci.c | 33 +++++++++++++++++++++++++++++++-- 1 file changed, 31 insertions(+), 2 deletions(-)