Message ID | 20230421165037.2506-1-Jonathan.Cameron@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] hw/arm/virt: Provide DT binding generation for PCI eXpander Bridges | expand |
On Fri, 21 Apr 2023 at 17:50, Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: > > This patch and the problem it is trying to resolve will form part of a talk > I will be giving next week at Linaro Connect. https://sched.co/1K85p > > So in the spirit of giving people almost no warning... Plus being able to > say the discussion has kicked off here is the simplest solution I could > come up with. If we can agree on an approach to the sizing of memory > windows question by Thursday even better (I'll update the slides!) ;) > > This is a precursor for CXL on arm-virt (for which DT support was > requested). CXL QEMU emulation uses pxb-cxl which inherits from the > slightly simpler pxb-pcie. ACPI support for these additional host bridges > has been available for some time but relies an interesting dance with > EDK2: > * During initial board creation the PXB buses are largely ignored > and ACPI tables + DT passed to EDK2 only expose the root bus on > which the section of the PXB instance that exists as a normal PCI EP > may be discovered. > * EDK2 then carries out full PCI bus enumeration, including the buses > below the PXB host bridges. > * Finally EDK2 calls back into QEMU to rebuild the tables via > virt_acpi_build_update(), at which point the correct DSDT for the > PXB host bridges is generated and an adjust DSDT section is generated > for the primary host bridge (holes are bunched in the _CRS). > > For device tree bindings there is no such dance with the firmware and as > such we need QEMU to directly present device tree entries for the primary > PCIe bus and the PXB instances. So, not knowing anything about CXL, my naive question is: this is PCI, why can't we handle it the way we handle everything else PCI, i.e. present the PCI controller in the DTB and it's the guest kernel's job to probe, enumerate, etc whatever it wants ? thanks -- PMM
On Mon, 24 Apr 2023 10:28:30 +0100 Peter Maydell <peter.maydell@linaro.org> wrote: > On Fri, 21 Apr 2023 at 17:50, Jonathan Cameron > <Jonathan.Cameron@huawei.com> wrote: > > > > This patch and the problem it is trying to resolve will form part of a talk > > I will be giving next week at Linaro Connect. https://sched.co/1K85p > > > > So in the spirit of giving people almost no warning... Plus being able to > > say the discussion has kicked off here is the simplest solution I could > > come up with. If we can agree on an approach to the sizing of memory > > windows question by Thursday even better (I'll update the slides!) ;) > > > > This is a precursor for CXL on arm-virt (for which DT support was > > requested). CXL QEMU emulation uses pxb-cxl which inherits from the > > slightly simpler pxb-pcie. ACPI support for these additional host bridges > > has been available for some time but relies an interesting dance with > > EDK2: > > * During initial board creation the PXB buses are largely ignored > > and ACPI tables + DT passed to EDK2 only expose the root bus on > > which the section of the PXB instance that exists as a normal PCI EP > > may be discovered. > > * EDK2 then carries out full PCI bus enumeration, including the buses > > below the PXB host bridges. > > * Finally EDK2 calls back into QEMU to rebuild the tables via > > virt_acpi_build_update(), at which point the correct DSDT for the > > PXB host bridges is generated and an adjust DSDT section is generated > > for the primary host bridge (holes are bunched in the _CRS). > > > > For device tree bindings there is no such dance with the firmware and as > > such we need QEMU to directly present device tree entries for the primary > > PCIe bus and the PXB instances. > > So, not knowing anything about CXL, my naive question is: > this is PCI, why can't we handle it the way we handle everything > else PCI, i.e. present the PCI controller in the DTB and it's > the guest kernel's job to probe, enumerate, etc whatever it wants ? Absolutely the kernel will still do the enumeration. But there's a problem - it won't always work, unless there is 'enough room'. If the aim is to do it in a similar fashion to how PCI Expander Bridges (PXB) are handled today with ACPI (we could look at doing something different of course) We have one set of memory windows for assigning PCI bars into etc. In the model used for PXB, that set of resources is shared between different host bridges including the main one (each one has separate DT description). So for virt, VIRT_PCIE_MMIO, VIRT_PCIE_IO, VIRT_PCIE_ECAM, VIRT_HIGH_PCIE_ECAM and VIRT_HIGH_PCIE_MMIO are split up between the host bridges. Each host bridge has it's own DT description. For ACPI, how to split up available memory windows up depends on the requirements of the PCIe devices below the host bridges. For ACPI we 'know' the answer as to what is required at the point of creating the description because EDK2 did the work for us. For DT we currently don't. What to do about that is the question this RFC tries to address. In theory we could change the kernel to support enumerating PXB instances, but that's a very different model from today where they are just 'standard' host bridges, using the generic bindings for ACPI (and after this patch for DT). I'll add an example in a reply (on wrong machine and stuck in a meeting today). Jonathan > > thanks > -- PMM
On Mon, 24 Apr 2023 at 16:41, Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: > > On Mon, 24 Apr 2023 10:28:30 +0100 > Peter Maydell <peter.maydell@linaro.org> wrote: > > So, not knowing anything about CXL, my naive question is: > > this is PCI, why can't we handle it the way we handle everything > > else PCI, i.e. present the PCI controller in the DTB and it's > > the guest kernel's job to probe, enumerate, etc whatever it wants ? > > Absolutely the kernel will still do the enumeration. But there's a problem > - it won't always work, unless there is 'enough room'. > > If the aim is to do it in a similar fashion to how PCI Expander Bridges (PXB) > are handled today with ACPI (we could look at doing something different of course) > > We have one set of memory windows for assigning PCI bars into etc. In the model > used for PXB, that set of resources is shared between different host bridges > including the main one (each one has separate DT description). > > So for virt, VIRT_PCIE_MMIO, VIRT_PCIE_IO, VIRT_PCIE_ECAM, VIRT_HIGH_PCIE_ECAM > and VIRT_HIGH_PCIE_MMIO are split up between the host bridges. > Each host bridge has it's own DT description. > > For ACPI, how to split up available memory windows up depends on the requirements > of the PCIe devices below the host bridges. For ACPI we 'know' the answer > as to what is required at the point of creating the description because EDK2 > did the work for us. For DT we currently don't. What to do about that is the > question this RFC tries to address. > > In theory we could change the kernel to support enumerating PXB instances, but > that's a very different model from today where they are just 'standard' > host bridges, using the generic bindings for ACPI (and after this patch for DT). On the other hand, having QEMU enumerate PCI devices is *also* a very different model from today, where we assume that the guest code is the one that is going to deal with enumerating PCI devices. To my mind one of the major advantages of PCI is exactly that it is entirely probeable and discoverable, so that there is no need for the dtb to include a lot of information that the kernel can find out for itself by directly asking the hardware... thanks -- PMM
On Mon, 24 Apr 2023 16:45:48 +0100 Peter Maydell <peter.maydell@linaro.org> wrote: > On Mon, 24 Apr 2023 at 16:41, Jonathan Cameron > <Jonathan.Cameron@huawei.com> wrote: > > > > On Mon, 24 Apr 2023 10:28:30 +0100 > > Peter Maydell <peter.maydell@linaro.org> wrote: > > > So, not knowing anything about CXL, my naive question is: > > > this is PCI, why can't we handle it the way we handle everything > > > else PCI, i.e. present the PCI controller in the DTB and it's > > > the guest kernel's job to probe, enumerate, etc whatever it wants ? > > > > Absolutely the kernel will still do the enumeration. But there's a problem > > - it won't always work, unless there is 'enough room'. > > > > If the aim is to do it in a similar fashion to how PCI Expander Bridges (PXB) > > are handled today with ACPI (we could look at doing something different of course) > > > > We have one set of memory windows for assigning PCI bars into etc. In the model > > used for PXB, that set of resources is shared between different host bridges > > including the main one (each one has separate DT description). > > > > So for virt, VIRT_PCIE_MMIO, VIRT_PCIE_IO, VIRT_PCIE_ECAM, VIRT_HIGH_PCIE_ECAM > > and VIRT_HIGH_PCIE_MMIO are split up between the host bridges. > > Each host bridge has it's own DT description. > > > > For ACPI, how to split up available memory windows up depends on the requirements > > of the PCIe devices below the host bridges. For ACPI we 'know' the answer > > as to what is required at the point of creating the description because EDK2 > > did the work for us. For DT we currently don't. What to do about that is the > > question this RFC tries to address. > > > > In theory we could change the kernel to support enumerating PXB instances, but > > that's a very different model from today where they are just 'standard' > > host bridges, using the generic bindings for ACPI (and after this patch for DT). > > On the other hand, having QEMU enumerate PCI devices is *also* a > very different model from today, where we assume that the guest > code is the one that is going to deal with enumerating PCI devices. > To my mind one of the major advantages of PCI is exactly that it > is entirely probeable and discoverable, so that there is no need > for the dtb to include a lot of information that the kernel can > find out for itself by directly asking the hardware... I absolutely agree that QEMU enumerating PCI seem silly level of complexity to introduce. So easy route is to just use the bus numbers to partition resources. We have those available without any complexity. It's not the same as how it's done with ACPI, but then the alternatives are either (though maybe they are closer). Note current proposed algorithm may be too simplistic (perhaps some alignment forcing should adjust the division of the resources to start on round number addresses) More complex route will be to push the whole problem to the kernel. I doubt we'll get any agreement to add this to the generic host bridge DT bindings given it's just to support QEMU specific host bridges which the kernel has previously had no knowledge of beyond as generic PCIe Host bridges. So we'd need QEMU to detect the presence of PXB instances then use a new DT compatible for the main host bridge and kick off a new type of host bridge discovery. Probably also need to modify EDK2 to recognize this, which will give us compatibility problems with existing ACPI usage (EDK2 is reading the DT bindings so kick off it's PCIe enumeration so this gets fiddly) May also have problems with the kernel having to do late discovery of host bridges (not tried it but sounds 'interesting'.) Anyhow, I promised some illustrative examples. For reference, here's one of my test configurations (slightly pathological as it exercises some of the heuristics in enumeration) -+-[0000:00]-+-00.0 Red Hat, Inc. QEMU PCIe Host bridge | +-01.0-[01]-- | +-02.0 Red Hat, Inc. Virtio block device | +-03.0 Red Hat, Inc. Virtio network device | +-04.0 Red Hat, Inc. QEMU PCIe Expander bridge | \-05.0 Red Hat, Inc. QEMU PCIe Expander bridge +-[0000:0c]---00.0-[0d-0f]--+-00.0-[0e-0f]----00.0-[0f]----00.0 Red Hat, Inc. Virtio RNG | \-00.1 Red Hat, Inc. Virtio RNG \-[0000:30]---00.0-[31]----00.0 Red Hat, Inc. Virtio RNG It's somewhat easier to see the current ACPI results vs what this patch proposed for DT by looking at /proc/iomem than looking at the firmware descriptions themselves. /proc/iomem includes the firmware described host bridge windows and the PCI buses and devices enumerated below them (EDK2 enumerated for ACPI, Linux kernel for DT) There are some other subtle differences of how the memory is mapped, but hopefully you can see the relationship between what happens with ACPI (requiring enumeration) and the proposal in this RFC for DT which doesn't. Relevant chunks of /proc/iomem for ACPI for 32 bit region (64 bit one similar.) //FW described main HB 10000000-103fffff : PCI Bus 0000:00 //Everything indented discovered by enumeration of PCI bus. 10000000-101fffff : PCI Bus 0000:01 10200000-1023ffff : 0000:00:03.0 10240000-10240fff : 0000:00:01.0 10241000-10241fff : 0000:00:02.0 10242000-10242fff : 0000:00:03.0 //FW described. 1st PXB on bus 0xc //Key is start is defined as previous enumerated device end + some heuristic driven padding. 10400000-10700fff : PCI Bus 0000:0c 10400000-105fffff : PCI Bus 0000:0d 10400000-10400fff : PCI Bus 0000:0e 10400000-10400fff : PCI Bus 0000:0f 10400000-10400fff : 0000:0f:00.0 10401000-10401fff : 0000:0d:00.1 10600000-10600fff : 0000:0c:00.0 //FW described spare bit assigned to main HB 10701000-107fffff : PCI Bus 0000:00 //FW described. 2nd PXB on bus 0x30 //Key is start is defined as previous enumerated device end + some heuristic driven padding. 10800000-10a00fff : PCI Bus 0000:30 10800000-108fffff : PCI Bus 0000:31 10800000-10800fff : 0000:31:00.0 10900000-10900fff : 0000:30:00.0 10a01000-3efeffff : PCI Bus 0000:00 DT version of equivalent from this patch. Mem split up based only bus numbers. //FW described. 10000000-1233f3ff : pcie@10000000 //Everything indented is enumerated by the linux kernel 10000000-101fffff : PCI Bus 0000:01 10200000-1023ffff : 0000:00:03.0 10240000-10240fff : 0000:00:01.0 10241000-10241fff : 0000:00:02.0 10242000-10242fff : 0000:00:03.0 //lots of space here //FW described 1st PXB - note way above end of previous enumerated region. 1233f400-18cfcfff : pcie@1233f400 12340000-12340fff : 0001:0c:00.0 12400000-126fffff : PCI Bus 0001:0d 12400000-125fffff : PCI Bus 0001:0e 12400000-125fffff : PCI Bus 0001:0f 12400000-12400fff : 0001:0f:00.0 12600000-12600fff : 0001:0d:00.1 //lots of space here.. //FW described 2nd PXB - note way above en of previous enumerated region. 18cfd000-3efeffff : pcie@18cfd000 18cfd000-18cfdfff : 0002:30:00.0 18d00000-18efffff : PCI Bus 0002:31 18d00000-18d00fff : 0002:31:00.0 //lots of space here. Note the main bridge CRS and hence /proc/iomem entries have entries after the PXBs because QEMU provides the left over bits of the window, even though there is no easy way to use them. For the DT scheme they end up in the space for which ever PXB has the highest bus number. In this case we have masses of room in the DT scheme as all the devices have small bars. I have test cases that fail to have enough room but those are even worse to read and require hacks to the PCI test device driver. I'm not great at reading DT binary from sysfs so some dt version of this ripped from /proc/iomem above rather than directly //Only first entry of each _CRS type is actually useful. //In DT node: pcie@10000000 Device (PCI0) { Name (_HID, "PNP0A08" /* PCI Express Bus */) // _HID: Hardware ID Name (_CID, "PNP0A03" /* PCI Bus */) // _CID: Compatible ID Name (_SEG, Zero) // _SEG: PCI Segment Name (_BBN, Zero) // _BBN: BIOS Bus Number ... Name (_CRS, ResourceTemplate () // _CRS: Current Resource Settings { //Bus numbers 0 to B : // DT bus-range = [0, b] WordBusNumber (ResourceProducer, MinFixed, MaxFixed, PosDecode, 0x0000, // Granularity 0x0000, // Range Minimum 0x000B, // Range Maximum 0x0000, // Translation Offset 0x000C, // Length ,, ) //First part to cover devices enumeratd + space from heuristics (bit of padding) // DT ranges first entry 10000000-1233f3ff : pcie@10000000 DWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, NonCacheable, ReadWrite, 0x00000000, // Granularity 0x10000000, // Range Minimum 0x103FFFFF, // Range Maximum 0x00000000, // Translation Offset 0x00400000, // Length ,, , AddressRangeMemory, TypeStatic) //A hole from enumeration of the PXBs. Doesn't exist in DT form... DWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, NonCacheable, ReadWrite, 0x00000000, // Granularity 0x10701000, // Range Minimum 0x107FFFFF, // Range Maximum 0x00000000, // Translation Offset 0x000FF000, // Length ,, , AddressRangeMemory, TypeStatic) //Left over above PXBs. Doesn't exist in DT form where we assign 'the rest' to //the PXB with the highest bus number. DWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, NonCacheable, ReadWrite, 0x00000000, // Granularity 0x10A01000, // Range Minimum 0x3EFEFFFF, // Range Maximum 0x00000000, // Translation Offset 0x2E5EF000, // Length ,, , AddressRangeMemory, TypeStatic) //First part to cover devices enumerated (+ space from heuristics) //I left whole range for IO on DT version. Don't really care about this being available //for modern PCIe devices below PXBs anyway. DWordIO (ResourceProducer, MinFixed, MaxFixed, PosDecode, EntireRange, 0x00000000, // Granularity 0x00000000, // Range Minimum 0x00000FFF, // Range Maximum 0x3EFF0000, // Translation Offset 0x00001000, // Length ,, , TypeStatic, DenseTranslation) //Left over of original window - no used by anything DWordIO (ResourceProducer, MinFixed, MaxFixed, PosDecode, EntireRange, 0x00000000, // Granularity 0x00003000, // Range Minimum 0x0000FFFF, // Range Maximum 0x3EFF0000, // Translation Offset 0x0000D000, // Length ,, , TypeStatic, DenseTranslation) //First part to cover the devices enumerated. // DT proportional version: 8000000000-85ffffffff : pcie@10000000 // Note that in this example all the devices have small BARs so with enumeration // they don't need as much space as I gave with DT version. QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, NonCacheable, ReadWrite, 0x0000000000000000, // Granularity 0x0000008000000000, // Range Minimum 0x00000080000FFFFF, // Range Maximum 0x0000000000000000, // Translation Offset 0x0000000000100000, // Length ,, , AddressRangeMemory, TypeStatic) //Left over of original window - not used by anything. QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, NonCacheable, ReadWrite, 0x0000000000000000, // Granularity 0x0000008000400000, // Range Minimum 0x000000FFFFFFFFFF, // Range Maximum 0x0000000000000000, // Translation Offset 0x0000007FFFC00000, // Length ,, , AddressRangeMemory, TypeStatic) }) ... Device (PC0C) { Name (_HID, "PNP0A08" /* PCI Express Bus */) // _HID: Hardware ID Name (_CID, "PNP0A03" /* PCI Bus */) // _CID: Compatible ID Name (_BBN, 0x0C) // _BBN: BIOS Bus Number Name (_UID, 0x0C) // _UID: Unique ID ... Name (_CRS, ResourceTemplate () // _CRS: Current Resource Settings { // IO space big enough for enumerated devices on first PXB // For DT I gave no IO space to PXBs for now. DWordIO (ResourceProducer, MinFixed, MaxFixed, PosDecode, EntireRange, 0x00000000, // Granularity 0x00001000, // Range Minimum 0x00001FFF, // Range Maximum 0x3EFF0000, // Translation Offset 0x00001000, // Length ,, , TypeStatic, DenseTranslation) //Lower mem region for enumerated devices on first PXB //DT : 1233f400-18cfcfff : pcie@1233f400 DWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, NonCacheable, ReadWrite, 0x00000000, // Granularity 0x10400000, // Range Minimum 0x10700FFF, // Range Maximum 0x00000000, // Translation Offset 0x00301000, // Length ,, , AddressRangeMemory, TypeStatic) //High mem region for enumerated devices on first PXB //DT: 8600000000-97ffffffff : pcie@1233f400 QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, NonCacheable, ReadWrite, 0x0000000000000000, // Granularity 0x0000008000100000, // Range Minimum 0x00000080002FFFFF, // Range Maximum 0x0000000000000000, // Translation Offset 0x0000000000200000, // Length ,, , AddressRangeMemory, TypeStatic) //PXB covers bus numbers 0xc to 0xf WordBusNumber (ResourceProducer, MinFixed, MaxFixed, PosDecode, 0x0000, // Granularity 0x000C, // Range Minimum 0x000F, // Range Maximum 0x0000, // Translation Offset 0x0004, // Length ,, ) }) Device (PC30) { Name (_HID, "PNP0A08" /* PCI Express Bus */) // _HID: Hardware ID Name (_CID, "PNP0A03" /* PCI Bus */) // _CID: Compatible ID Name (_BBN, 0x30) // _BBN: BIOS Bus Number Name (_UID, 0x30) // _UID: Unique ID Name (_STR, Unicode ("pxb Device")) // _STR: Description String Name (_CCA, One) // _CCA: Cache Coherency Attribute Name (_PRT, Package (0x80) // _PRT: PCI Routing Table Name (_CRS, ResourceTemplate () // _CRS: Current Resource Settings { //IO space for 2nd PXB - punched out of the main bus space. //DT: Again no IO space. DWordIO (ResourceProducer, MinFixed, MaxFixed, PosDecode, EntireRange, 0x00000000, // Granularity 0x00002000, // Range Minimum 0x00002FFF, // Range Maximum 0x3EFF0000, // Translation Offset 0x00001000, // Length ,, , TypeStatic, DenseTranslation) //low mem space for 2nd PXB punched out of main host bridge space. // DT: 18cfd000-3efeffff : pcie@18cfd000 DWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, NonCacheable, ReadWrite, 0x00000000, // Granularity 0x10800000, // Range Minimum 0x10A00FFF, // Range Maximum 0x00000000, // Translation Offset 0x00201000, // Length ,, , AddressRangeMemory, TypeStatic)# //high mem space for 2nd PXB punched out main host bridge space. // DT: 9800000000-ffffffffff : pcie@18cfd000 QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, NonCacheable, ReadWrite, 0x0000000000000000, // Granularity 0x0000008000300000, // Range Minimum 0x00000080003FFFFF, // Range Maximum 0x0000000000000000, // Translation Offset 0x0000000000100000, // Length ,, , AddressRangeMemory, TypeStatic) // Just 2 bus numbers (directly connected device) WordBusNumber (ResourceProducer, MinFixed, MaxFixed, PosDecode, 0x0000, // Granularity 0x0030, // Range Minimum 0x0031, // Range Maximum 0x0000, // Translation Offset 0x0002, // Length ,, ) }) Maybe this mess will help illustrate the issues and approaches. I personally think the rough approach of this patch is most sensible. Don't enumerate the bus at all for DT case, just do partitioning based on bus numbers. In corner cases that will mean the kernel enumeration fails when things would have worked with current ACPI / EDK2 dance to allow enumeration before the ACPI tables the kernel uses are written. Jonathan > > thanks > -- PMM
On Mon, 24 Apr 2023 at 22:56, Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: > > On Mon, 24 Apr 2023 16:45:48 +0100 > Peter Maydell <peter.maydell@linaro.org> wrote: > > On the other hand, having QEMU enumerate PCI devices is *also* a > > very different model from today, where we assume that the guest > > code is the one that is going to deal with enumerating PCI devices. > > To my mind one of the major advantages of PCI is exactly that it > > is entirely probeable and discoverable, so that there is no need > > for the dtb to include a lot of information that the kernel can > > find out for itself by directly asking the hardware... > > I absolutely agree that QEMU enumerating PCI seem silly level of complexity > to introduce. So easy route is to just use the bus numbers to partition > resources. We have those available without any complexity. It's not the > same as how it's done with ACPI, but then the alternatives are either > (though maybe they are closer). Note current proposed algorithm may be > too simplistic (perhaps some alignment forcing should adjust the division > of the resources to start on round number addresses) I think we definitely need to talk about this later this week, but my initial view is that if: (1) the guest kernel can get the information it needs to do this by probing the hardware (2) doing it in QEMU gives you "this isn't a great allocation" "we don't really have the info we need to do it optimally" "this is more of a policy decision" effects (which is what it's sounding like to me) this is a really strong argument for "guest software should be doing this". DTB-booting kernels has always meant the kernel doing a lot of work that under ACPI/UEFI/x86-PC is typically done by firmware, and this seems similar to me. thanks -- PMM
On Tue, 25 Apr 2023 09:28:44 +0100 Peter Maydell <peter.maydell@linaro.org> wrote: > On Mon, 24 Apr 2023 at 22:56, Jonathan Cameron > <Jonathan.Cameron@huawei.com> wrote: > > > > On Mon, 24 Apr 2023 16:45:48 +0100 > > Peter Maydell <peter.maydell@linaro.org> wrote: > > > On the other hand, having QEMU enumerate PCI devices is *also* a > > > very different model from today, where we assume that the guest > > > code is the one that is going to deal with enumerating PCI devices. > > > To my mind one of the major advantages of PCI is exactly that it > > > is entirely probeable and discoverable, so that there is no need > > > for the dtb to include a lot of information that the kernel can > > > find out for itself by directly asking the hardware... > > > > I absolutely agree that QEMU enumerating PCI seem silly level of complexity > > to introduce. So easy route is to just use the bus numbers to partition > > resources. We have those available without any complexity. It's not the > > same as how it's done with ACPI, but then the alternatives are either > > (though maybe they are closer). Note current proposed algorithm may be > > too simplistic (perhaps some alignment forcing should adjust the division > > of the resources to start on round number addresses) > > I think we definitely need to talk about this later this week, > but my initial view is that if: > (1) the guest kernel can get the information it needs to do this > by probing the hardware Not currently (from a quick look) - see below. But we could potentially make it visible by augmenting the config space of the PCIe devices that are discoverable. Or we could in theory ignore the bus numbers that we do provide as QEMU parameters, but that would be rather surprising for users. > (2) doing it in QEMU gives you "this isn't a great allocation" > "we don't really have the info we need to do it optimally" > "this is more of a policy decision" effects > (which is what it's sounding like to me) > > this is a really strong argument for "guest software should be > doing this". DTB-booting kernels has always meant the kernel > doing a lot of work that under ACPI/UEFI/x86-PC is typically > done by firmware, and this seems similar to me. We could explore only solving the problem for pxb-cxl for now. However, we would still be talking infrastructure in kernel only to support emulated CXL devices and I can see that being controversial. A normal CXL host bridge is not something we can enumerate. I guess this may call for a PoC on the kernel side of things to see how bad it looks. I suspect very ugly indeed :( but I could be wrong. I think we'll also have to augment the PXB PCI devices that appear on the main bus to provide discoverability they don't currently have (such as bus numbers) Probably a DVSEC entry in PCI extended space. Current dump of what is there: root@debian:~# lspci -s 05.0 -xxxx -vvv 00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge Subsystem: Red Hat, Inc. QEMU PCIe Expander bridge Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- 00: 36 1b 0b 00 00 00 a0 00 00 00 00 06 00 00 00 00 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 f4 1a 00 11 30: 00 00 00 00 00 00 00 00 00 00 00 00 ff 00 00 00 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Which is pretty light on info. Jonathan > > thanks > -- PMM
On Tue, 25 Apr 2023 at 18:37, Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: > We could explore only solving the problem for pxb-cxl for now. > However, we would still be talking infrastructure in kernel only > to support emulated CXL devices and I can see that being > controversial. A normal CXL host bridge is not something > we can enumerate. Hmm, so what is real hardware doing that our emulation is not? -- PMM
On Tue, 25 Apr 2023 21:15:25 +0100 Peter Maydell <peter.maydell@linaro.org> wrote: > On Tue, 25 Apr 2023 at 18:37, Jonathan Cameron > <Jonathan.Cameron@huawei.com> wrote: > > We could explore only solving the problem for pxb-cxl for now. > > However, we would still be talking infrastructure in kernel only > > to support emulated CXL devices and I can see that being > > controversial. A normal CXL host bridge is not something > > we can enumerate. > > Hmm, so what is real hardware doing that our emulation is not? For real hardware, the host bridges would not typically share the various windows. Various options, but most likely they would be in different PCI segments, with separate ECAM space, and separate space into which the BARs would be mapped. That separation would be needed as the Physical Address routing in the host would need to direct the reads and writes to the correct bit of hardware and that tends to be done with very simple address decoders on the appropriate interconnects - those internal routing tables are normally effectively fixed for a given system. We could add another full host bridge to arm-virt. That would require separate emulation as I don't think we can reuse the pxb-cxl stuff. The PXB approach is to ignore the normal restrictions on routing being fairly fixed, by building a fixed configuration before the OS sees it - allowing different host bridges to use different parts of a single region. Arguably very early boot firmware could do that with a physical system but it would involve some nasty impdef programming of address decoder logic. This would be similar to what firmware does to change the routing dependent on whether it finds a 2 socket confirmation or a 4 socket configuration in servers. Want entity would do this is implementation defined - may well be before any application processor boots. Jonathan > > -- PMM
diff --git a/hw/arm/virt.c b/hw/arm/virt.c index ac626b3bef..4648437bba 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -55,6 +55,8 @@ #include "qemu/bitops.h" #include "qemu/error-report.h" #include "qemu/module.h" +#include "hw/pci/pci_bridge.h" +#include "hw/pci/pci_bus.h" #include "hw/pci-host/gpex.h" #include "hw/virtio/virtio-pci.h" #include "hw/core/sysbus-fdt.h" @@ -1648,6 +1650,210 @@ static void virt_build_smbios(VirtMachineState *vms) } } +static void virt_add_pxb_dt(VirtMachineState *vms, PCIBus *pxb_bus, + int max_bus_num, hwaddr mmio_per_bus, + hwaddr mmio_high_per_bus, + int pci_domain, + Error **errp) +{ + void *fdt = MACHINE(vms)->fdt; + int ecam_id = VIRT_ECAM_ID(vms->highmem_ecam); + char *nodename_pxb; + uint8_t bus_num = pci_bus_num(pxb_bus); + hwaddr base_mmio_pxb = vms->memmap[VIRT_PCIE_MMIO].base + + mmio_per_bus * bus_num; + hwaddr size_mmio_pxb; + hwaddr base_mmio_high_pxb = vms->memmap[VIRT_HIGH_PCIE_MMIO].base + + mmio_high_per_bus * bus_num; + hwaddr size_mmio_high_pxb; + hwaddr base_ecam_pxb, size_ecam_pxb; + int buses; + + buses = pci_bus_count_buses(pxb_bus); + if (bus_num + buses >= max_bus_num) { + error_setg(errp, + "Insufficient bus numbers (req %d > avail: %d) available for PXB topology", + buses, max_bus_num - bus_num - 1); + return; + } + + base_ecam_pxb = vms->memmap[ecam_id].base; + base_ecam_pxb += bus_num * PCIE_MMCFG_SIZE_MIN; + size_ecam_pxb = PCIE_MMCFG_SIZE_MIN * (max_bus_num - bus_num); + size_mmio_pxb = mmio_per_bus * (max_bus_num - bus_num); + size_mmio_high_pxb = mmio_high_per_bus * (max_bus_num - bus_num); + nodename_pxb = g_strdup_printf("/pcie@%" PRIx64, base_mmio_pxb); + qemu_fdt_add_subnode(fdt, nodename_pxb); + qemu_fdt_setprop_cell(fdt, nodename_pxb, "linux,pci-domain", pci_domain); + qemu_fdt_setprop_string(fdt, nodename_pxb, "device_type", "pci"); + qemu_fdt_setprop_cell(fdt, nodename_pxb, "#address-cells", 3); + qemu_fdt_setprop_cell(fdt, nodename_pxb, "#size-cells", 2); + qemu_fdt_setprop_string(fdt, nodename_pxb, + "compatible", "pci-host-ecam-generic"); + /* I'm not sure what this should be. */ + if (vms->msi_phandle) { + qemu_fdt_setprop_cells(fdt, nodename_pxb, "msi-map", + 0, vms->msi_phandle, 0, 0x10000); + } + qemu_fdt_setprop_cells(fdt, nodename_pxb, "bus-range", + bus_num, max_bus_num - 1); + qemu_fdt_setprop_sized_cells(fdt, nodename_pxb, "reg", 2, base_ecam_pxb, + 2, size_ecam_pxb); + + if (vms->highmem_mmio) { + qemu_fdt_setprop_sized_cells(fdt, nodename_pxb, "ranges", + 1, FDT_PCI_RANGE_IOPORT, 2, 0, + /* No PIO space for PXB */ + 2, 0, 2, 0, + 1, FDT_PCI_RANGE_MMIO, + 2, base_mmio_pxb, + 2, base_mmio_pxb, + 2, size_mmio_pxb, + 1, FDT_PCI_RANGE_MMIO_64BIT, + 2, base_mmio_high_pxb, + 2, base_mmio_high_pxb, + 2, size_mmio_high_pxb); + } else { + qemu_fdt_setprop_sized_cells(fdt, nodename_pxb, "ranges", + 1, FDT_PCI_RANGE_IOPORT, 2, 0, + 2, 0, 2, 0, + 1, FDT_PCI_RANGE_MMIO, + 2, base_mmio_pxb, + 2, base_mmio_pxb, + 2, size_mmio_pxb); + } +} + +static void virt_update_pci_dt(VirtMachineState *vms, Error **errp) +{ + void *fdt = MACHINE(vms)->fdt; + char *nodename = vms->pciehb_nodename; + hwaddr base_mmio = vms->memmap[VIRT_PCIE_MMIO].base; + hwaddr base_mmio_high = vms->memmap[VIRT_HIGH_PCIE_MMIO].base; + hwaddr size_mmio, size_mmio_high; + hwaddr base_pio = vms->memmap[VIRT_PCIE_PIO].base; + hwaddr size_pio = vms->memmap[VIRT_PCIE_PIO].size; + int ecam_id = VIRT_ECAM_ID(vms->highmem_ecam); + hwaddr base_ecam = vms->memmap[ecam_id].base; + int total_bus_nr = vms->memmap[ecam_id].size / PCIE_MMCFG_SIZE_MIN; + hwaddr mmio_per_bus, mmio_high_per_bus, size_ecam; + int pci_domain = 1; /* Main bus is 0 */ + int buses, max_bus_nr; + PCIBus *bus; + + /* + * EDK2 ACPI flow for PXB instances breaks if we modify the DT to incorporate + * them. So for now only enable DT for PXBs if acpi=off + */ + if (virt_is_acpi_enabled(vms)) { + return; + } + + /* + * Only support PCI Expander Bridges if highmem_ecam available. + * Hard to squeeze them into the smaller ecam. + */ + if (!vms->highmem_ecam) { + return; + } + + /* First update DT for the main PCI bus */ + bus = vms->bus; + max_bus_nr = total_bus_nr; + /* Find the max bus nr as smallest of the PXB buses */ + QLIST_FOREACH(bus, &bus->child, sibling) { + uint8_t bus_num; + + if (!pci_bus_is_root(bus)) { + continue; + } + bus_num = pci_bus_num(bus); + if (bus_num < max_bus_nr) { + max_bus_nr = bus_num; + } + } + + buses = pci_bus_count_buses(vms->bus); + if (buses >= max_bus_nr) { + error_setg(errp, + "Insufficient bus numbers (req %d > avail: %d) available for primary PCIe toplogy", + buses, max_bus_nr - 1); + return; + } + + /* + * For other resources options are: + * 1) Divide them up based on bus number ranges. + * 2) Discover what is needed by doing part of bus enumeration. + * This is complex, and we may not want that complexity in QEMU. + */ + + size_ecam = max_bus_nr * PCIE_MMCFG_SIZE_MIN; + if (size_ecam > vms->memmap[ecam_id].size) { + size_ecam = vms->memmap[ecam_id].size; + } + + mmio_per_bus = vms->memmap[VIRT_PCIE_MMIO].size / total_bus_nr; + size_mmio = mmio_per_bus * max_bus_nr; + mmio_high_per_bus = vms->memmap[VIRT_HIGH_PCIE_MMIO].size / total_bus_nr; + size_mmio_high = mmio_high_per_bus * max_bus_nr; + qemu_fdt_setprop_cells(fdt, nodename, "bus-range", 0, max_bus_nr - 1); + + qemu_fdt_setprop_sized_cells(fdt, nodename, "reg", 2, base_ecam, + 2, size_ecam); + + /* + * Leave all of PIO space for the main GPEX - avoids issues with + * regions that are too small to map in OS. + */ + if (vms->highmem_mmio) { + qemu_fdt_setprop_sized_cells(fdt, nodename, "ranges", + 1, FDT_PCI_RANGE_IOPORT, 2, 0, + 2, base_pio, 2, size_pio, + 1, FDT_PCI_RANGE_MMIO, 2, base_mmio, + 2, base_mmio, 2, size_mmio, + 1, FDT_PCI_RANGE_MMIO_64BIT, + 2, base_mmio_high, + 2, base_mmio_high, 2, size_mmio_high); + } else { + qemu_fdt_setprop_sized_cells(fdt, nodename, "ranges", + 1, FDT_PCI_RANGE_IOPORT, 2, 0, + 2, base_pio, 2, size_pio, + 1, FDT_PCI_RANGE_MMIO, 2, base_mmio, + 2, base_mmio, 2, size_mmio); + } + + /* Now add the PCI Expander Bridges (PXB) */ + bus = vms->bus; + QLIST_FOREACH(bus, &bus->child, sibling) { + uint16_t max_bus_num = 0x100; + PCIBus *bus_start = vms->bus; + int bus_num = pci_bus_num(bus); + + if (!pci_bus_is_root(bus)) { + continue; + } + + /* Find the minimum PXB bus number greater than this one */ + QLIST_FOREACH(bus_start, &bus_start->child, sibling) { + uint8_t this_bus_num; + + if (!pci_bus_is_root(bus_start)) { + continue; + } + + this_bus_num = pci_bus_num(bus_start); + if (this_bus_num > bus_num && + this_bus_num < max_bus_num) { + max_bus_num = this_bus_num; + } + } + + virt_add_pxb_dt(vms, bus, max_bus_num, mmio_per_bus, mmio_high_per_bus, + pci_domain++, errp); + } +} + static void virt_machine_done(Notifier *notifier, void *data) { @@ -1658,6 +1864,14 @@ void virt_machine_done(Notifier *notifier, void *data) struct arm_boot_info *info = &vms->bootinfo; AddressSpace *as = arm_boot_address_space(cpu, info); + /* + * If PCI Expander Bridges pxb-pcie, have been added, the adjustments to + * the main PCIe DT entries and creation of those for the PXB host bridge + * entires, may only be created after all PCI devices are present as only + * at that time can resource requirements (bus numbers etc) be known. + */ + virt_update_pci_dt(vms, &error_fatal); + /* * If the user provided a dtb, we assume the dynamic sysbus nodes * already are integrated there. This corresponds to a use case where diff --git a/hw/pci/pci.c b/hw/pci/pci.c index def5000e7b..47fd581fe4 100644 --- a/hw/pci/pci.c +++ b/hw/pci/pci.c @@ -552,6 +552,27 @@ void pci_root_bus_cleanup(PCIBus *bus) qbus_unrealize(BUS(bus)); } +int pci_bus_count_buses(PCIBus *bus) +{ + int buses = 0; + int devfn; + + for (devfn = 0; devfn < 256; devfn++) { + PCIDevice *d; + + if (!bus->devices[devfn]) { + continue; + } + d = bus->devices[devfn]; + if (!object_dynamic_cast(OBJECT(d), TYPE_PCI_BRIDGE)) { + continue; + } + buses += pci_bus_count_buses(&PCI_BRIDGE(d)->sec_bus); + buses++; + } + return buses; +} + void pci_bus_irqs(PCIBus *bus, pci_set_irq_fn set_irq, void *irq_opaque, int nirq) { diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h index d5a40cd058..565a968c14 100644 --- a/include/hw/pci/pci.h +++ b/include/hw/pci/pci.h @@ -282,6 +282,7 @@ PCIBus *pci_root_bus_new(DeviceState *parent, const char *name, MemoryRegion *address_space_io, uint8_t devfn_min, const char *typename); void pci_root_bus_cleanup(PCIBus *bus); +int pci_bus_count_buses(PCIBus *bus); void pci_bus_irqs(PCIBus *bus, pci_set_irq_fn set_irq, void *irq_opaque, int nirq); void pci_bus_map_irqs(PCIBus *bus, pci_map_irq_fn map_irq);
This patch and the problem it is trying to resolve will form part of a talk I will be giving next week at Linaro Connect. https://sched.co/1K85p So in the spirit of giving people almost no warning... Plus being able to say the discussion has kicked off here is the simplest solution I could come up with. If we can agree on an approach to the sizing of memory windows question by Thursday even better (I'll update the slides!) ;) This is a precursor for CXL on arm-virt (for which DT support was requested). CXL QEMU emulation uses pxb-cxl which inherits from the slightly simpler pxb-pcie. ACPI support for these additional host bridges has been available for some time but relies an interesting dance with EDK2: * During initial board creation the PXB buses are largely ignored and ACPI tables + DT passed to EDK2 only expose the root bus on which the section of the PXB instance that exists as a normal PCI EP may be discovered. * EDK2 then carries out full PCI bus enumeration, including the buses below the PXB host bridges. * Finally EDK2 calls back into QEMU to rebuild the tables via virt_acpi_build_update(), at which point the correct DSDT for the PXB host bridges is generated and an adjust DSDT section is generated for the primary host bridge (holes are bunched in the _CRS). For device tree bindings there is no such dance with the firmware and as such we need QEMU to directly present device tree entries for the primary PCIe bus and the PXB instances. This cannot be fully done either at initial PCIe instantiation, (PXB instances not yet instantiated) or at in virt_machine_done() as it is for ACPI (virtio-mem hot plug routines rely on presence of primary PCIe bus). Thus the proposed solution is to set things up initially without being careful and later update it with additional checks that we don't get overlapping bus numbers. It 'might' be possible to use hotplug handlers to update the DT as we go along, but that would be rather challenging as each additional PXB introduction would sometimes require an update of the dt for all other instances. Note: The linux,pci-domain dt element is more restrictive than the equivalent in ACPI, so for now just put each PXB in it's own domain. I'll look to address relaxing that on the kernel side, but then this code won't initially work with older kernels - so this lack of alignment with ACPI systems may have to persist (you end up with a single segment for ACPI systems, and multiple for DT). Another question is how to allocate the resources. A PXB instance uses a section of the ECAM space (which is fine as that is defined by bus numbers) and also part of each of the PCI windows. When EDK2 is involved, the allocation of the windows is done after enumeration of connected PCI devices / switches etc with heuristics adding extra space for hotplug. Thus the windows configured for the bridges can be easily established and written into the ACPI _CRS entries. I've considered two options: 1) (More or less) enumerate the full PCI topology in QEMU, just don't actually write the registers for routing. Apply similar heuristics (note that EDK2 and the kernel may have different ones and they have changed over time) to establish how much memory is needed in each window. This is a complex solution that may still provide allocations that aren't big enough for the kernel to then enumerate the PCI topology using it's algorithms. 2) Use the one thing we definitely control for PXB instances, the base bus number. Based on how many buses are available to each PXB instance allocate a proportion of the available memory windows. Again, it's more than possible that this won't leave enough room to provide all of the necessary space to accommodate BARs, but is fairly simple to implement and fairly simple to understand when the allocation doesn't work. This option is what the current code does. Note that either way any regressions should be limited to systems using DT with PXB instances which would be odd given they don't currently work. RFC being posted to get feedback on the general approach we should take. One option would be to ignore pxb-pcie and only deal with pxb-cxl on the basis that removes risk of regressions because we don't support CXL on arm-virt yet. However, code will end up much the same so we can discuss this without the added fun that CXL will bring. Other more minor nasties: * EDK2 breaks if we pass it a DT that includes the PXB instances. * Need to look closer at the interrupt mapping - or potentially make this MSI/MSIX only. Equivalent is broken on x86 / q35 :) Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> --- hw/arm/virt.c | 214 +++++++++++++++++++++++++++++++++++++++++++ hw/pci/pci.c | 21 +++++ include/hw/pci/pci.h | 1 + 3 files changed, 236 insertions(+)