Message ID | 20230116124709.793084-1-eric.auger@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] virtio-iommu: Take into account possible aliasing in virtio_iommu_mr() | expand |
Hi Eric, On Mon, Jan 16, 2023 at 07:47:09AM -0500, Eric Auger wrote: [...] > once we attempt to plug such devices downstream to the pcie-to-pci > bridge, those devices are put in a singleton group. The pcie-to-pci > bridge disappears from the radar (not attached to any group), and the > pcie root port the pcie-to-pci bridge is plugged onto is put in a > separate singleton group. So the topology is wrong and definitively > different from the one observed with the intel iommu. > > I suspect some other issue in the viot/pci probing on guest kernel > side. I would appreciate on that kernel part. > > qemu command excerpt: > for x86_64: > > -device ioh3420,bus=pcie.0,chassis=1,id=pcie.1,addr=0x2.0x0 > -device pcie-pci-bridge,id=pcie_pci_bridge1,bus=pcie.1 > -netdev tap,id=mynet0,ifname=tap0,script=qemu-ifup,downscript=qemu-ifdown > -device e1000,netdev=mynet0,bus=pcie_pci_bridge1 > > guest pci topology: > > -[0000:00]-+-00.0 Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller > +-01.0 Device 1234:1111 > +-02.0-[01-02]----00.0-[02]----00.0 Intel Corporation 82540EM Gigabit Ethernet Controller > > wrong guest topology: > /sys/kernel/iommu_groups/2/devices/0000:00:02.0 (pcie root port) > /sys/kernel/iommu_groups/1/devices/0000:02:00.0 (E1000) > no group for 0000:01:00:0 (the pcie-to-pci bridge) This highlights several problems in the kernel: (1) The pcie-to-pci bridge doesn't get an IOMMU group. That's a bug in the VIOT driver, where we use the wrong struct device when dealing with PCI aliases. I'll send a fix. (2) e1000 gets a different group than the pcie-to-pci bridge, and than other devices on that bus. This wrongly enables assigning aliased devices to different VMs. It's not specific to virtio-iommu, I can get the same result with SMMUv3, both DT and ACPI: qemu-system-aarch64 -M virt,gic-version=3,its=on,iommu=smmuv3 -cpu max -m 4G -smp 8 -kernel Image -initrd initrd -append console=ttyAMA0 -nographic \ -device pcie-root-port,chassis=1,id=pcie.1,bus=pcie.0 \ -device pcie-pci-bridge,id=pcie.2,bus=pcie.1 \ -device e1000,netdev=net0,bus=pcie.2,addr=1.0 -netdev user,id=net0 \ -device e1000,netdev=net1,bus=pcie.2,addr=2.0 -netdev user,id=net1 I think the logic in pci_device_group() expects the devices to be probed in a specific top-down order, so that when we get to a device, all its parents already have a group. Starting with arbitrary devices would require walking down and across the tree to find aliases which is too complex. As you noted Intel IOMMU doesn't have this problem, because probing happens in the expected order. The driver loads at the right time and bus_iommu_probe() ends up calling pci_device_group() in the right order. For virtio-iommu and SMMU, the drivers are subject to probe deferral, and devices drivers are bound kinda randomly by the driver core. In that case it's acpi/of_iommu_configure() that calls pci_device_group(). I'm not sure what the best approach is to fix this. It seems wrong to rely on previous driver probes in pci_device_group() at all, because even if you probe in the right order, you could build a kernel without the PCIe port driver and the aliased endpoints will still be put in different groups. Maybe pci_device_group(), when creating a new group, should immediately assign that group to all parents that alias the device? Or maybe we can reuse the RID alias infrastructure used by quirks. (3) with the SMMU, additional devices on the PCI bus can't get an IOMMU group: [ 2.019587] e1000 0000:02:01.0: Adding to iommu group 0 [ 2.389750] e1000 0000:02:02.0: stream 512 already in tree Due to cdf315f907d4 ("iommu/arm-smmu-v3: Maintain a SID->device structure"), the SMMUv3 driver doesn't support RID aliasing anymore. The structure needs to be extended with multiple devices per SID, in which case the fault handler will choose an arbitrary device associated to the faulting SID. (4) When everything else works, on Arm ACPI the PCIe root port is put in a separate group due to ACS being enabled, but on other platforms (including Arm DT), the root port is in the same group. I haven't looked into that. > with intel iommu we get the following topology: > /sys/kernel/iommu_groups/2/devices/0000:02:00.0 > /sys/kernel/iommu_groups/2/devices/0000:01:00.0 > /sys/kernel/iommu_groups/2/devices/0000:00:02.0 > > on aarch64 we get the same using those devices: > -device pcie-root-port,bus=pcie.0,chassis=1,id=pcie.1,addr=0x2.0x0 > -device pcie-pci-bridge,id=pcie_pci_bridge1,bus=pcie.1 > -netdev tap,id=mynet0,ifname=tap0,script=qemu-ifup,downscript=qemu-ifdown > -device e1000,netdev=mynet0,bus=pcie_pci_bridge1 > > Signed-off-by: Eric Auger <eric.auger@redhat.com> > --- > hw/virtio/virtio-iommu.c | 15 +++++++++++++++ > 1 file changed, 15 insertions(+) > > diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c > index 23c470977e..58c367b744 100644 > --- a/hw/virtio/virtio-iommu.c > +++ b/hw/virtio/virtio-iommu.c > @@ -178,6 +178,21 @@ static IOMMUMemoryRegion *virtio_iommu_mr(VirtIOIOMMU *s, uint32_t sid) > dev = iommu_pci_bus->pbdev[devfn]; > if (dev) { > return &dev->iommu_mr; > + } else { /* check possible aliasing */ > + PCIBus *pbus = iommu_pci_bus->bus; > + > + if (!pci_bus_is_express(pbus)) { > + PCIDevice *parent = pbus->parent_dev; > + > + if (pci_is_express(parent) && > + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { > + devfn = PCI_DEVFN(0, 0); > + dev = iommu_pci_bus->pbdev[devfn]; > + if (dev) { > + return &dev->iommu_mr; > + } > + } > + } Yes, I think that makes sense. Maybe the comment should refer to pci_device_iommu_address_space(), which explains what is happening. Since this exactly mirrors what that function does, do we also need the case where the parent is not PCIe? In that case the bus is the parent and devfn is that of the bridge Thanks, Jean
On 2023-01-20 15:28, Jean-Philippe Brucker wrote: For some reason this came through as blank mail with a text attachment, so apologies for the lack of quoting, but for reference this is a long-standing known issue: https://lore.kernel.org/linux-iommu/9625faf4-48ef-2dd3-d82f-931d9cf26976@huawei.com/ In summary, yes, hanging {of,acpi}_iommu_configure() off driver probe is massively wrong, and pulling that step into iommu_probe_device() in a sane and robust manner is one of the next things to start on after getting the bus ops out of the way. Thanks, Robin.
On Fri, Jan 20, 2023 at 03:50:18PM +0000, Robin Murphy wrote: > On 2023-01-20 15:28, Jean-Philippe Brucker wrote: > > For some reason this came through as blank mail with a text attachment, Ugh sorry about that, looks like I hit ^D in mutt before sending > so apologies for the lack of quoting, but for reference this is a > long-standing known issue: > > https://lore.kernel.org/linux-iommu/9625faf4-48ef-2dd3-d82f-931d9cf26976@huawei.com/ > > In summary, yes, hanging {of,acpi}_iommu_configure() off driver probe is > massively wrong, and pulling that step into iommu_probe_device() in a sane > and robust manner is one of the next things to start on after getting the > bus ops out of the way. Right, thanks for confirming Thanks, Jean
Hi Jean, Robin, On 1/20/23 17:23, Jean-Philippe Brucker wrote: > On Fri, Jan 20, 2023 at 03:50:18PM +0000, Robin Murphy wrote: >> On 2023-01-20 15:28, Jean-Philippe Brucker wrote: >> >> For some reason this came through as blank mail with a text attachment, > Ugh sorry about that, looks like I hit ^D in mutt before sending > >> so apologies for the lack of quoting, but for reference this is a >> long-standing known issue: >> >> https://lore.kernel.org/linux-iommu/9625faf4-48ef-2dd3-d82f-931d9cf26976@huawei.com/ >> >> In summary, yes, hanging {of,acpi}_iommu_configure() off driver probe is >> massively wrong, and pulling that step into iommu_probe_device() in a sane >> and robust manner is one of the next things to start on after getting the >> bus ops out of the way. > Right, thanks for confirming Thank you very much for your support and confirmation of the probe issue! Eric > > Thanks, > Jean >
diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c index 23c470977e..58c367b744 100644 --- a/hw/virtio/virtio-iommu.c +++ b/hw/virtio/virtio-iommu.c @@ -178,6 +178,21 @@ static IOMMUMemoryRegion *virtio_iommu_mr(VirtIOIOMMU *s, uint32_t sid) dev = iommu_pci_bus->pbdev[devfn]; if (dev) { return &dev->iommu_mr; + } else { /* check possible aliasing */ + PCIBus *pbus = iommu_pci_bus->bus; + + if (!pci_bus_is_express(pbus)) { + PCIDevice *parent = pbus->parent_dev; + + if (pci_is_express(parent) && + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { + devfn = PCI_DEVFN(0, 0); + dev = iommu_pci_bus->pbdev[devfn]; + if (dev) { + return &dev->iommu_mr; + } + } + } } } return NULL;
When devices are plugged downstream to a pcie to pci bridge, the virtio-iommu driver stills sends virtio-iommu commands using the rid without aliasing (ie. devfn != 0). However only an IOMMU MR matching <bus>:00.0 was registered so virtio_iommu_mr fails to identify the memory region matching the ep_id. This results into the probe command failing and the devices beeing not attached. As a result they are not associated to any iommu group. Recognize the case where the device is plugged onto a pcie-to-pci bridge and apply the mask to retrieve the right IOMMU MR. This trick allows to have the correct iommu group topology on guest with virtio-pci devices. However this is not sufficient to fix the iommu group topology issue with basic emulated NIC or VFIO devices. Even with that patch, once we attempt to plug such devices downstream to the pcie-to-pci bridge, those devices are put in a singleton group. The pcie-to-pci bridge disappears from the radar (not attached to any group), and the pcie root port the pcie-to-pci bridge is plugged onto is put in a separate singleton group. So the topology is wrong and definitively different from the one observed with the intel iommu. I suspect some other issue in the viot/pci probing on guest kernel side. I would appreciate on that kernel part. qemu command excerpt: for x86_64: -device ioh3420,bus=pcie.0,chassis=1,id=pcie.1,addr=0x2.0x0 -device pcie-pci-bridge,id=pcie_pci_bridge1,bus=pcie.1 -netdev tap,id=mynet0,ifname=tap0,script=qemu-ifup,downscript=qemu-ifdown -device e1000,netdev=mynet0,bus=pcie_pci_bridge1 guest pci topology: -[0000:00]-+-00.0 Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller +-01.0 Device 1234:1111 +-02.0-[01-02]----00.0-[02]----00.0 Intel Corporation 82540EM Gigabit Ethernet Controller wrong guest topology: /sys/kernel/iommu_groups/2/devices/0000:00:02.0 (pcie root port) /sys/kernel/iommu_groups/1/devices/0000:02:00.0 (E1000) no group for 0000:01:00:0 (the pcie-to-pci bridge) with intel iommu we get the following topology: /sys/kernel/iommu_groups/2/devices/0000:02:00.0 /sys/kernel/iommu_groups/2/devices/0000:01:00.0 /sys/kernel/iommu_groups/2/devices/0000:00:02.0 on aarch64 we get the same using those devices: -device pcie-root-port,bus=pcie.0,chassis=1,id=pcie.1,addr=0x2.0x0 -device pcie-pci-bridge,id=pcie_pci_bridge1,bus=pcie.1 -netdev tap,id=mynet0,ifname=tap0,script=qemu-ifup,downscript=qemu-ifdown -device e1000,netdev=mynet0,bus=pcie_pci_bridge1 Signed-off-by: Eric Auger <eric.auger@redhat.com> --- hw/virtio/virtio-iommu.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+)