diff mbox series

[RFC] virtio-iommu: Take into account possible aliasing in virtio_iommu_mr()

Message ID 20230116124709.793084-1-eric.auger@redhat.com (mailing list archive)
State New, archived
Headers show
Series [RFC] virtio-iommu: Take into account possible aliasing in virtio_iommu_mr() | expand

Commit Message

Eric Auger Jan. 16, 2023, 12:47 p.m. UTC
When devices are plugged downstream to a pcie to pci bridge,
the virtio-iommu driver stills sends virtio-iommu commands using
the rid without aliasing (ie. devfn != 0). However only an IOMMU
MR matching <bus>:00.0 was registered so virtio_iommu_mr fails
to identify the memory region matching the ep_id. This results
into the probe command failing and the devices beeing not
attached. As a result they are not associated to any iommu group.

Recognize the case where the device is plugged onto a pcie-to-pci
bridge and apply the mask to retrieve the right IOMMU MR.

This trick allows to have the correct iommu group topology on guest
with virtio-pci devices. However this is not sufficient to fix
the iommu group topology issue with basic emulated NIC or VFIO
devices. Even with that patch, once we attempt to plug such
devices downstream to the pcie-to-pci bridge, those devices are
put in a singleton group. The pcie-to-pci bridge disappears from
the radar (not attached to any group), and the pcie root port the
pcie-to-pci bridge is plugged onto is put in a separate singleton
group. So the topology is wrong and definitively different from the
one observed with the intel iommu.

I suspect some other issue in the viot/pci probing on guest kernel
side. I would appreciate on that kernel part.

qemu command excerpt:
for x86_64:

-device ioh3420,bus=pcie.0,chassis=1,id=pcie.1,addr=0x2.0x0
-device pcie-pci-bridge,id=pcie_pci_bridge1,bus=pcie.1
-netdev tap,id=mynet0,ifname=tap0,script=qemu-ifup,downscript=qemu-ifdown
-device e1000,netdev=mynet0,bus=pcie_pci_bridge1

guest pci topology:

-[0000:00]-+-00.0  Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
           +-01.0  Device 1234:1111
           +-02.0-[01-02]----00.0-[02]----00.0  Intel Corporation 82540EM Gigabit Ethernet Controller

wrong guest topology:
/sys/kernel/iommu_groups/2/devices/0000:00:02.0 (pcie root port)
/sys/kernel/iommu_groups/1/devices/0000:02:00.0 (E1000)
no group for 0000:01:00:0 (the pcie-to-pci bridge)

with intel iommu we get the following topology:
/sys/kernel/iommu_groups/2/devices/0000:02:00.0
/sys/kernel/iommu_groups/2/devices/0000:01:00.0
/sys/kernel/iommu_groups/2/devices/0000:00:02.0

on aarch64 we get the same using those devices:
-device pcie-root-port,bus=pcie.0,chassis=1,id=pcie.1,addr=0x2.0x0
-device pcie-pci-bridge,id=pcie_pci_bridge1,bus=pcie.1
-netdev tap,id=mynet0,ifname=tap0,script=qemu-ifup,downscript=qemu-ifdown
-device e1000,netdev=mynet0,bus=pcie_pci_bridge1

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 hw/virtio/virtio-iommu.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

Comments

Jean-Philippe Brucker Jan. 20, 2023, 3:28 p.m. UTC | #1
Hi Eric,

On Mon, Jan 16, 2023 at 07:47:09AM -0500, Eric Auger wrote:
[...]
> once we attempt to plug such devices downstream to the pcie-to-pci
> bridge, those devices are put in a singleton group. The pcie-to-pci
> bridge disappears from the radar (not attached to any group), and the
> pcie root port the pcie-to-pci bridge is plugged onto is put in a
> separate singleton group. So the topology is wrong and definitively
> different from the one observed with the intel iommu.
> 
> I suspect some other issue in the viot/pci probing on guest kernel
> side. I would appreciate on that kernel part.
> 
> qemu command excerpt:
> for x86_64:
> 
> -device ioh3420,bus=pcie.0,chassis=1,id=pcie.1,addr=0x2.0x0
> -device pcie-pci-bridge,id=pcie_pci_bridge1,bus=pcie.1
> -netdev tap,id=mynet0,ifname=tap0,script=qemu-ifup,downscript=qemu-ifdown
> -device e1000,netdev=mynet0,bus=pcie_pci_bridge1
> 
> guest pci topology:
> 
> -[0000:00]-+-00.0  Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
>            +-01.0  Device 1234:1111
>            +-02.0-[01-02]----00.0-[02]----00.0  Intel Corporation 82540EM Gigabit Ethernet Controller
> 
> wrong guest topology:
> /sys/kernel/iommu_groups/2/devices/0000:00:02.0 (pcie root port)
> /sys/kernel/iommu_groups/1/devices/0000:02:00.0 (E1000)
> no group for 0000:01:00:0 (the pcie-to-pci bridge)

This highlights several problems in the kernel:


(1) The pcie-to-pci bridge doesn't get an IOMMU group. That's a bug in the
VIOT driver, where we use the wrong struct device when dealing with PCI
aliases. I'll send a fix.


(2) e1000 gets a different group than the pcie-to-pci bridge, and than
other devices on that bus. This wrongly enables assigning aliased devices
to different VMs.

It's not specific to virtio-iommu, I can get the same result with SMMUv3,
both DT and ACPI:

qemu-system-aarch64 -M virt,gic-version=3,its=on,iommu=smmuv3 -cpu max -m 4G -smp 8 -kernel Image -initrd initrd -append console=ttyAMA0 -nographic \
	-device pcie-root-port,chassis=1,id=pcie.1,bus=pcie.0 \
	-device pcie-pci-bridge,id=pcie.2,bus=pcie.1 \
	-device e1000,netdev=net0,bus=pcie.2,addr=1.0 -netdev user,id=net0 \
	-device e1000,netdev=net1,bus=pcie.2,addr=2.0 -netdev user,id=net1

I think the logic in pci_device_group() expects the devices to be probed
in a specific top-down order, so that when we get to a device, all its
parents already have a group. Starting with arbitrary devices would
require walking down and across the tree to find aliases which is too
complex.

As you noted Intel IOMMU doesn't have this problem, because probing
happens in the expected order. The driver loads at the right time and
bus_iommu_probe() ends up calling pci_device_group() in the right order.
For virtio-iommu and SMMU, the drivers are subject to probe deferral, and
devices drivers are bound kinda randomly by the driver core. In that case
it's acpi/of_iommu_configure() that calls pci_device_group().

I'm not sure what the best approach is to fix this. It seems wrong to rely
on previous driver probes in pci_device_group() at all, because even if
you probe in the right order, you could build a kernel without the PCIe
port driver and the aliased endpoints will still be put in different
groups. Maybe pci_device_group(), when creating a new group, should
immediately assign that group to all parents that alias the device?
Or maybe we can reuse the RID alias infrastructure used by quirks.


(3) with the SMMU, additional devices on the PCI bus can't get an IOMMU
group:

[    2.019587] e1000 0000:02:01.0: Adding to iommu group 0
[    2.389750] e1000 0000:02:02.0: stream 512 already in tree

Due to cdf315f907d4 ("iommu/arm-smmu-v3: Maintain a SID->device
structure"), the SMMUv3 driver doesn't support RID aliasing anymore.
The structure needs to be extended with multiple devices per SID, in which
case the fault handler will choose an arbitrary device associated to the
faulting SID.


(4) When everything else works, on Arm ACPI the PCIe root port is put in a
separate group due to ACS being enabled, but on other platforms (including
Arm DT), the root port is in the same group. I haven't looked into that.


> with intel iommu we get the following topology:
> /sys/kernel/iommu_groups/2/devices/0000:02:00.0
> /sys/kernel/iommu_groups/2/devices/0000:01:00.0
> /sys/kernel/iommu_groups/2/devices/0000:00:02.0
> 
> on aarch64 we get the same using those devices:
> -device pcie-root-port,bus=pcie.0,chassis=1,id=pcie.1,addr=0x2.0x0
> -device pcie-pci-bridge,id=pcie_pci_bridge1,bus=pcie.1
> -netdev tap,id=mynet0,ifname=tap0,script=qemu-ifup,downscript=qemu-ifdown
> -device e1000,netdev=mynet0,bus=pcie_pci_bridge1
> 
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> ---
>  hw/virtio/virtio-iommu.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> index 23c470977e..58c367b744 100644
> --- a/hw/virtio/virtio-iommu.c
> +++ b/hw/virtio/virtio-iommu.c
> @@ -178,6 +178,21 @@ static IOMMUMemoryRegion *virtio_iommu_mr(VirtIOIOMMU *s, uint32_t sid)
>          dev = iommu_pci_bus->pbdev[devfn];
>          if (dev) {
>              return &dev->iommu_mr;
> +        } else { /* check possible aliasing */
> +            PCIBus *pbus = iommu_pci_bus->bus;
> +
> +            if (!pci_bus_is_express(pbus)) {
> +                PCIDevice *parent = pbus->parent_dev;
> +
> +                if (pci_is_express(parent) &&
> +                    pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
> +                    devfn = PCI_DEVFN(0, 0);
> +                    dev = iommu_pci_bus->pbdev[devfn];
> +                    if (dev) {
> +                        return &dev->iommu_mr;
> +                    }
> +                }
> +            }

Yes, I think that makes sense. Maybe the comment should refer to
pci_device_iommu_address_space(), which explains what is happening. Since
this exactly mirrors what that function does, do we also need the case
where the parent is not PCIe?  In that case the bus is the parent and
devfn is that of the bridge

Thanks,
Jean
Robin Murphy Jan. 20, 2023, 3:50 p.m. UTC | #2
On 2023-01-20 15:28, Jean-Philippe Brucker wrote:

For some reason this came through as blank mail with a text attachment, 
so apologies for the lack of quoting, but for reference this is a 
long-standing known issue:

https://lore.kernel.org/linux-iommu/9625faf4-48ef-2dd3-d82f-931d9cf26976@huawei.com/

In summary, yes, hanging {of,acpi}_iommu_configure() off driver probe is 
massively wrong, and pulling that step into iommu_probe_device() in a 
sane and robust manner is one of the next things to start on after 
getting the bus ops out of the way.

Thanks,
Robin.
Jean-Philippe Brucker Jan. 20, 2023, 4:23 p.m. UTC | #3
On Fri, Jan 20, 2023 at 03:50:18PM +0000, Robin Murphy wrote:
> On 2023-01-20 15:28, Jean-Philippe Brucker wrote:
> 
> For some reason this came through as blank mail with a text attachment,

Ugh sorry about that, looks like I hit ^D in mutt before sending

> so apologies for the lack of quoting, but for reference this is a
> long-standing known issue:
> 
> https://lore.kernel.org/linux-iommu/9625faf4-48ef-2dd3-d82f-931d9cf26976@huawei.com/
> 
> In summary, yes, hanging {of,acpi}_iommu_configure() off driver probe is
> massively wrong, and pulling that step into iommu_probe_device() in a sane
> and robust manner is one of the next things to start on after getting the
> bus ops out of the way.

Right, thanks for confirming

Thanks,
Jean
Eric Auger Jan. 20, 2023, 4:43 p.m. UTC | #4
Hi Jean, Robin,

On 1/20/23 17:23, Jean-Philippe Brucker wrote:
> On Fri, Jan 20, 2023 at 03:50:18PM +0000, Robin Murphy wrote:
>> On 2023-01-20 15:28, Jean-Philippe Brucker wrote:
>>
>> For some reason this came through as blank mail with a text attachment,
> Ugh sorry about that, looks like I hit ^D in mutt before sending
>
>> so apologies for the lack of quoting, but for reference this is a
>> long-standing known issue:
>>
>> https://lore.kernel.org/linux-iommu/9625faf4-48ef-2dd3-d82f-931d9cf26976@huawei.com/
>>
>> In summary, yes, hanging {of,acpi}_iommu_configure() off driver probe is
>> massively wrong, and pulling that step into iommu_probe_device() in a sane
>> and robust manner is one of the next things to start on after getting the
>> bus ops out of the way.
> Right, thanks for confirming

Thank you very much for your support and confirmation of the probe issue!

Eric
>
> Thanks,
> Jean
>
diff mbox series

Patch

diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
index 23c470977e..58c367b744 100644
--- a/hw/virtio/virtio-iommu.c
+++ b/hw/virtio/virtio-iommu.c
@@ -178,6 +178,21 @@  static IOMMUMemoryRegion *virtio_iommu_mr(VirtIOIOMMU *s, uint32_t sid)
         dev = iommu_pci_bus->pbdev[devfn];
         if (dev) {
             return &dev->iommu_mr;
+        } else { /* check possible aliasing */
+            PCIBus *pbus = iommu_pci_bus->bus;
+
+            if (!pci_bus_is_express(pbus)) {
+                PCIDevice *parent = pbus->parent_dev;
+
+                if (pci_is_express(parent) &&
+                    pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
+                    devfn = PCI_DEVFN(0, 0);
+                    dev = iommu_pci_bus->pbdev[devfn];
+                    if (dev) {
+                        return &dev->iommu_mr;
+                    }
+                }
+            }
         }
     }
     return NULL;