diff mbox

[V4] docs: add PCIe devices placement guidelines

Message ID 1477930733-30873-1-git-send-email-marcel@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Marcel Apfelbaum Oct. 31, 2016, 4:18 p.m. UTC
Proposes best practices on how to use PCI Express/PCI device
in PCI Express based machines and explain the reasoning behind them.

Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
---
 
Hi,

v3->v4:
 - Addressed minor typos spotted by Laszlo, thanks!

v2->v3:
 - Addressed the comments from Andrea Bolognani and Laszlo Ersek, which are
   much appreciated!
 - Added links to presentations that may help the understanding of the document.

RFC->v2:
 - Addressed a lot of comments from the reviewers (many thanks to all, especially to Laszlo)


Thanks,
Marcel
 

 docs/pcie.txt | 306 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 306 insertions(+)
 create mode 100644 docs/pcie.txt

Comments

Laine Stump Oct. 31, 2016, 5:44 p.m. UTC | #1
On 10/31/2016 12:18 PM, Marcel Apfelbaum wrote:

> +
> +2.2.1 Plugging a PCI Express device into a PCI Express Root Port:
> +          -device ioh3420,id=root_port1,slot=x[,chassis=y][,bus=pcie.0][,addr=z]  \
> +          -device <dev>,bus=root_port1
> +      Note that (slot, chassis) pair is mandatory and must be
> +      unique for each PCI Express Root Port.

I keep meaning to ask about this and forgetting - just what is "slot" 
used for? In the past we were told that chassis and *port* were 
mandatory and must be unique, but hadn't before been told anything (that 
I can remember) about "slot":

* If chassis isn't specified by the user, we will set it to the index of 
the controller (libvirt internally sets an index for each PCI 
controller, with the root bus being index 0, and each subsequent 
controller being the next higher number; this also is the number 
specified as "bus" in the libvirt config for the devices that connect to 
that controller (which is then used to determine the "id" of the 
controller used in the commandline argument for the device).

* If port isn't specified by the user, then we set it according to where 
the port is attached in the root complex:

    port = slot << 3 + function

But what is slot?

> +2.2.2 Using multi-function PCI Express Root Ports:
> +      -device ioh3420,id=root_port1,multifunction=on,chassis=x[,bus=pcie.0][,slot=y][,addr=z.0] \

Similar to what Laszlo reported about v3 - these examples show slot as 
optional, where the first example shows it as mandatory. Also, none of 
the examples show "port" (which we were previously told was mandatory 
(or at least told that it needed to be unique) at all.

> +      -device ioh3420,id=root_port2,chassis=x1[,bus=pcie.0][,slot=y1][,addr=z.1] \
> +      -device ioh3420,id=root_port3,chassis=x2[,bus=pcie.0][,slot=y2][,addr=z.2] \
> +2.2.2 Plugging a PCI Express device into a Switch:
> +      -device ioh3420,id=root_port1,chassis=x[,bus=pcie.0][,slot=y][,addr=z]  \
> +      -device x3130-upstream,id=upstream_port1,bus=root_port1[,addr=x]          \
> +      -device xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=x1[,slot=y1][,addr=z1]] \

..and there it is again, in downstream ports.

> +      -device <dev>,bus=downstream_port1
> +Note that 'addr' parameter can be 0 for all the examples above.

> +Prefer flat hierarchies. For most scenarios a single DMI-PCI Bridge
> +(having 32 slots) and several PCI-PCI Bridges attached to it
> +(each supporting also 32 slots)


> will support hundreds of legacy devices.
> +The recommendation is to populate one PCI-PCI Bridge under the DMI-PCI Bridge
> +until is full and then plug a new PCI-PCI Bridge...
> +
> +   pcie.0 bus
> +   ----------------------------------------------
> +        |                            |
> +   -----------               ------------------
> +   | PCI Dev |               | DMI-PCI BRIDGE |
> +   ----------                ------------------
> +                               |            |
> +                  ------------------    ------------------
> +                  | PCI-PCI Bridge |    | PCI-PCI Bridge |
> +                  ------------------    ------------------
> +                                         |           |
> +                                  -----------     -----------
> +                                  | PCI Dev |     | PCI Dev |
> +                                  -----------     -----------

It's nit-picking, but it might make the method for expansion more 
obvious if there was a "..." to the right of the 2nd pci-bridge.

> +
> +2.3.1 To plug a PCI device into pcie.0 as Integrated Endpoint use:

s/as/as an/

> +      -device <dev>[,bus=pcie.0]
> +2.3.2 Plugging a PCI device into a PCI-PCI Bridge:
> +      -device i82801b11-bridge,id=dmi_pci_bridge1[,bus=pcie.0]                        \
> +      -device pci-bridge,id=pci_bridge1,bus=dmi_pci_bridge1[,chassis_nr=x][,addr=y]   \
> +      -device <dev>,bus=pci_bridge1[,addr=x]
> +      Note that 'addr' cannot be 0 unless shpc=off parameter is passed to
> +      the PCI Bridge.

(Tangentially related to your document: We need to decide what to do 
about this in libvirt - up until now I'd never heard of the shpc option, 
so it is absent from all libvirt-generated commandlines, and pci-bridges 
in libvirt only allow devices on slots 1-31. If we want to allow slot 0, 
then we'll need to expose an shpc option in the libvirt config, and 
start auto-setting it on when new bridges are defined, then 
allow/disallow use of slot 0 accordingly. I don't know if one slot is 
worth all that trouble though...)

> +
> +3. IO space issues
> +===================
> +The PCI Express Root Ports and PCI Express Downstream ports are seen by
> +Firmware/Guest OS as PCI-PCI Bridges. As required by the PCI spec, each
> +such Port should be reserved a 4K IO range for, even though only one
> +(multifunction) device can be plugged into each Port. This results in
> +poor IO space utilization.
> +
> +The firmware used by QEMU (SeaBIOS/OVMF) may try further optimizations
> +by not allocating IO space for each PCI Express Root / PCI Express
> +Downstream port if:
> +    (1) the port is empty, or
> +    (2) the device behind the port has no IO BARs.
> +
> +The IO space is very limited, to 65536 byte-wide IO ports, and may even be
> +fragmented by fixed IO ports owned by platform devices resulting in at most
> +10 PCI Express Root Ports or PCI Express Downstream Ports per system
> +if devices with IO BARs are used in the PCI Express hierarchy. Using the
> +proposed device placing strategy solves this issue by using only
> +PCI Express devices within PCI Express hierarchy.
> +
> +The PCI Express spec requires the PCI Express devices to work


s/the PCI Express devices to work/that PCI Express devices work properly/


> +without using IO. The PCI hierarchy has no such limitations.

s/IO/IO space/ ? (or whatever is the correct qualifier)

> +
> +
> +4. Bus numbers issues
> +======================
> +Each PCI domain can have up to only 256 buses and the QEMU PCI Express
> +machines do not support multiple PCI domains even if extra Root
> +Complexes (pxb-pcie) are used.
> +
> +Each element of the PCI Express hierarchy (Root Complexes,
> +PCI Express Root Ports, PCI Express Downstream/Upstream ports)
> +takes up bus numbers. Since only one (multifunction) device


s/takes up bus numbers/uses one bus number/


> +can be attached to a PCI Express Root Port or PCI Express Downstream
> +Port it is advised to plan in advance for the expected number of
> +devices to prevent bus numbers starvation.


s/numbers/number/


> +
> +Avoiding PCI Express Switches (and thereby striving for a flat PCI


s/flat/flatter/ ??


> +Express hierarchy) enables the hierarchy to not spend bus numbers on
> +Upstream Ports.
> +
> +The bus_nr properties of the pxb-pcie devices partition the 0..255 bus
> +number space. All bus numbers assigned to the buses recursively behind a
> +given pxb-pcie device's root bus must fit between the bus_nr property of
> +that pxb-pcie device, and the lowest of the higher bus_nr properties
> +that the command line sets for other pxb-pcie devices.
> +
> +
> +5. Hot-plug
> +============
> +The PCI Express root buses (pcie.0 and the buses exposed by pxb-pcie devices)
> +do not support hot-plug, so any devices plugged into Root Complexes
> +cannot be hot-plugged/hot-unplugged:
> +    (1) PCI Express Integrated Endpoints
> +    (2) PCI Express Root Ports
> +    (3) DMI-PCI Bridges
> +    (4) pxb-pcie

(maybe you should mention that downstream ports can't be hotplugged into 
an existing upstream port (and thus, because qemu can only attach a 
single device at a time, it is currently not possible to hotplug a 
downstream port in any manner)

> +
> +PCI devices can be hot-plugged into PCI-PCI Bridges. The PCI hot-plug is ACPI
> +based and can work side by side with the PCI Express native hot-plug.
> +
> +PCI Express devices can be natively hot-plugged/hot-unplugged into/from
> +PCI Express Root Ports (and PCI Express Downstream Ports).
> +
> +5.1 Planning for hot-plug:
> +    (1) PCI hierarchy
> +        Leave enough PCI-PCI Bridge slots empty or add one
> +        or more empty PCI-PCI Bridges to the DMI-PCI Bridge.
> +
> +        For each such PCI-PCI Bridge the Guest Firmware is expected to reserve
> +        4K IO space and 2M MMIO range to be used for all devices behind it.
> +
> +        Because of the hard IO limit of around 10 PCI Bridges (~ 40K space)
> +        per system don't use more than 9 PCI-PCI Bridges, leaving 4K for the
> +        Integrated Endpoints. (The PCI Express Hierarchy needs no IO space).
> +
> +    (2) PCI Express hierarchy:
> +        Leave enough PCI Express Root Ports empty. Use multifunction
> +        PCI Express Root Ports (up to 8 ports per pcie.0 slot)
> +        on the Root Complex(es), for keeping the
> +        hierarchy as flat as possible, thereby saving PCI bus numbers.

FYI: https://www.redhat.com/archives/libvir-list/2016-October/msg01048.html

> +        Don't use PCI Express Switches if you don't have
> +        to, each one of those uses an extra PCI bus (for its Upstream Port)
> +        that could be put to better use with another Root Port or Downstream
> +        Port, which may come handy for hot-plugging another device.
> +
> +
> +5.3 Hot-plug example:
> +Using HMP: (add -monitor stdio to QEMU command line)
> +  device_add <dev>,id=<id>,bus=<PCI Express Root Port Id/PCI Express Downstream Port Id/PCI-PCI Bridge Id/>
> +
> +
> +6. Device assignment
> +====================
> +Host devices are mostly PCI Express and should be plugged only into
> +PCI Express Root Ports or PCI Express Downstream Ports.
> +PCI-PCI Bridge slots can be used for legacy PCI host devices.
> +
> +6.1 How to detect if a device is PCI Express:
> +  > lspci -s 03:00.0 -v (as root)
> +
> +    03:00.0 Network controller: Intel Corporation Wireless 7260 (rev 83)
> +    Subsystem: Intel Corporation Dual Band Wireless-AC 7260
> +    Flags: bus master, fast devsel, latency 0, IRQ 50
> +    Memory at f0400000 (64-bit, non-prefetchable) [size=8K]
> +    Capabilities: [c8] Power Management version 3
> +    Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
> +    Capabilities: [40] Express Endpoint, MSI 00
> +    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +    Capabilities: [100] Advanced Error Reporting
> +    Capabilities: [140] Device Serial Number 7c-7a-91-ff-ff-90-db-20
> +    Capabilities: [14c] Latency Tolerance Reporting
> +    Capabilities: [154] Vendor Specific Information: ID=cafe Rev=1 Len=014
> +
> +If you can see the "Express Endpoint" capability in the
> +output, then the device is indeed PCI Express.

I also have a libvirt patch queued up that detects whether a device is 
PCI Express, and assigns the guest-side address accordingly.


> +
> +
> +7. Virtio devices
> +=================
> +Virtio devices plugged into the PCI hierarchy or as Integrated Endpoints
> +will remain PCI and have transitional behaviour as default.
> +Transitional virtio devices work in both IO and MMIO modes depending on
> +the guest support. The Guest firmware will assign both IO and MMIO resources
> +to transitional virtio devices.
> +
> +Virtio devices plugged into PCI Express ports are PCI Express devices and
> +have "1.0" behavior by default without IO support.
> +In both cases disable-legacy and disable-modern properties can be used
> +to override the behaviour.
> +
> +Note that setting disable-legacy=off will enable legacy mode (enabling
> +legacy behavior) for PCI Express virtio devices causing them to
> +require IO space, which, given the limited available IO space, may quickly
> +lead to resource exhaustion, and is therefore strongly discouraged.


...unless required by a guest OS that lacks virtio-1.0 drivers" (but of 
course the hapless user will figure that out for themselves soon enough :-/)


> +
> +
> +8. Conclusion
> +==============
> +The proposal offers a usage model that is easy to understand and follow
> +and at the same time overcomes the PCI Express architecture limitations.
> +

Aside from the fact that libvirt sets the "port" option for ioh3420 
devices, and doesn't set "slot"
Marcel Apfelbaum Nov. 1, 2016, 1:21 p.m. UTC | #2
On 10/31/2016 07:44 PM, Laine Stump wrote:
> On 10/31/2016 12:18 PM, Marcel Apfelbaum wrote:
>

Hi Laine,
Thank you for having a look at the document.

>> +
>> +2.2.1 Plugging a PCI Express device into a PCI Express Root Port:
>> +          -device ioh3420,id=root_port1,slot=x[,chassis=y][,bus=pcie.0][,addr=z]  \
>> +          -device <dev>,bus=root_port1
>> +      Note that (slot, chassis) pair is mandatory and must be
>> +      unique for each PCI Express Root Port.
>
> I keep meaning to ask about this and forgetting - just what is "slot" used for? In the past we were told that chassis and *port* were mandatory and must be unique, but hadn't before been told anything
> (that I can remember) about "slot":
>

I am glad you asked, here is the PCIe Spec:

6.7.1.7. Slot Numbering
          A Physical Slot Identifier (as defined in PCI Hot-Plug Specification, Revision 1.1, Section 1.5) consists of
          an optional chassis number and the physical slot number of the slot. The physical slot number is a
          chassis unique identifier for a slot. System software determines the physical slot number from
          registers in the Port. Chassis number 0 is reserved for the main chassis. The chassis number for
          other chassis must be a non-zero value obtained from a PCI-to-PCI Bridge’s Chassis Number
          register (see the PCI-to-PCI Bridge Architecture Specification, Revision 1.2, Section 13.4).
          Regardless of the form factor associated with each slot, each physical slot number must be unique
          within a chassis.


So we form the "Physical Slot Identifier" from both command line parameters "chassis" and "slot".
The "Physical Slot Identifier" will by used as the method of identification of the PCI Express Root Port.


Regarding the "port" command line parameter:

7.8.6. Link Capabilities Register (Offset 0Ch)
        31:24 Port Number – This field indicates the PCI Express Port
                            number for the given PCI Express Link.
                            Multi-Function devices associated with an Upstream Port must
                            report the same value in this field for all Functions.

While it is set by QEMU, I don't think is has any significance for the emulated hardware
since, as far as I understand it, the 'port' number is used for implementation
of advanced hardware features as "Virtual Channels.
So we can get away with leaving it 0...


> * If chassis isn't specified by the user, we will set it to the index of the controller (libvirt internally sets an index for each PCI controller, with the root bus being index 0, and each subsequent
> controller being the next higher number; this also is the number specified as "bus" in the libvirt config for the devices that connect to that controller (which is then used to determine the "id" of
> the controller used in the commandline argument for the device).
>

As you can see, since we don't support more than 256 PCI Express Ports, we can get away with leaving
the chassis 0 and increment the slot for each port.
We don't support more ports because we support maximum 256 buses. Once we will have multiple PCI domains
we'll need a different chassis per domain, but until then we are covered.

> * If port isn't specified by the user, then we set it according to where the port is attached in the root complex:
>
>    port = slot << 3 + function
>

I don't think is necessary.

> But what is slot?
>

I hope the above explanation helps.

>> +2.2.2 Using multi-function PCI Express Root Ports:
>> +      -device ioh3420,id=root_port1,multifunction=on,chassis=x[,bus=pcie.0][,slot=y][,addr=z.0] \
>
> Similar to what Laszlo reported about v3 - these examples show slot as optional, where the first example shows it as mandatory. Also, none of the examples show "port" (which we were previously told
> was mandatory (or at least told that it needed to be unique) at all.

Does the note from 2.2.1 specifying the pair should be unique help?

Thinking about it more, maybe we should leave chassis optional because we don't support
multiple PCI domains. I'll update the doc.

>
>> +      -device ioh3420,id=root_port2,chassis=x1[,bus=pcie.0][,slot=y1][,addr=z.1] \
>> +      -device ioh3420,id=root_port3,chassis=x2[,bus=pcie.0][,slot=y2][,addr=z.2] \
>> +2.2.2 Plugging a PCI Express device into a Switch:
>> +      -device ioh3420,id=root_port1,chassis=x[,bus=pcie.0][,slot=y][,addr=z]  \
>> +      -device x3130-upstream,id=upstream_port1,bus=root_port1[,addr=x]          \
>> +      -device xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=x1[,slot=y1][,addr=z1]] \
>
> ..and there it is again, in downstream ports.
>
>> +      -device <dev>,bus=downstream_port1
>> +Note that 'addr' parameter can be 0 for all the examples above.
>
>> +Prefer flat hierarchies. For most scenarios a single DMI-PCI Bridge
>> +(having 32 slots) and several PCI-PCI Bridges attached to it
>> +(each supporting also 32 slots)
>
>
>> will support hundreds of legacy devices.
>> +The recommendation is to populate one PCI-PCI Bridge under the DMI-PCI Bridge
>> +until is full and then plug a new PCI-PCI Bridge...
>> +
>> +   pcie.0 bus
>> +   ----------------------------------------------
>> +        |                            |
>> +   -----------               ------------------
>> +   | PCI Dev |               | DMI-PCI BRIDGE |
>> +   ----------                ------------------
>> +                               |            |
>> +                  ------------------    ------------------
>> +                  | PCI-PCI Bridge |    | PCI-PCI Bridge |
>> +                  ------------------    ------------------
>> +                                         |           |
>> +                                  -----------     -----------
>> +                                  | PCI Dev |     | PCI Dev |
>> +                                  -----------     -----------
>
> It's nit-picking, but it might make the method for expansion more obvious if there was a "..." to the right of the 2nd pci-bridge.
>

Sure

>> +
>> +2.3.1 To plug a PCI device into pcie.0 as Integrated Endpoint use:
>
> s/as/as an/
>

OK

>> +      -device <dev>[,bus=pcie.0]
>> +2.3.2 Plugging a PCI device into a PCI-PCI Bridge:
>> +      -device i82801b11-bridge,id=dmi_pci_bridge1[,bus=pcie.0]                        \
>> +      -device pci-bridge,id=pci_bridge1,bus=dmi_pci_bridge1[,chassis_nr=x][,addr=y]   \
>> +      -device <dev>,bus=pci_bridge1[,addr=x]
>> +      Note that 'addr' cannot be 0 unless shpc=off parameter is passed to
>> +      the PCI Bridge.
>
> (Tangentially related to your document: We need to decide what to do about this in libvirt - up until now I'd never heard of the shpc option, so it is absent from all libvirt-generated commandlines,
> and pci-bridges in libvirt only allow devices on slots 1-31. If we want to allow slot 0, then we'll need to expose an shpc option in the libvirt config, and start auto-setting it on when new bridges
> are defined, then allow/disallow use of slot 0 accordingly. I don't know if one slot is worth all that trouble though...)
>

I *personally* think it worth the effort, not only for the "extra" slot, but also for an easier understanding of the model.
I am also thinking about defaulting the shpc parameter to off for new QEMU versions, so it worth making libvirt
to understand the "shpc" parameter and act accordingly. I would suggest the default "off'.

>> +
>> +3. IO space issues
>> +===================
>> +The PCI Express Root Ports and PCI Express Downstream ports are seen by
>> +Firmware/Guest OS as PCI-PCI Bridges. As required by the PCI spec, each
>> +such Port should be reserved a 4K IO range for, even though only one
>> +(multifunction) device can be plugged into each Port. This results in
>> +poor IO space utilization.
>> +
>> +The firmware used by QEMU (SeaBIOS/OVMF) may try further optimizations
>> +by not allocating IO space for each PCI Express Root / PCI Express
>> +Downstream port if:
>> +    (1) the port is empty, or
>> +    (2) the device behind the port has no IO BARs.
>> +
>> +The IO space is very limited, to 65536 byte-wide IO ports, and may even be
>> +fragmented by fixed IO ports owned by platform devices resulting in at most
>> +10 PCI Express Root Ports or PCI Express Downstream Ports per system
>> +if devices with IO BARs are used in the PCI Express hierarchy. Using the
>> +proposed device placing strategy solves this issue by using only
>> +PCI Express devices within PCI Express hierarchy.
>> +
>> +The PCI Express spec requires the PCI Express devices to work
>
>
> s/the PCI Express devices to work/that PCI Express devices work properly/
>

Sure

>
>> +without using IO. The PCI hierarchy has no such limitations.
>
> s/IO/IO space/ ? (or whatever is the correct qualifier)
>

OK

>> +
>> +
>> +4. Bus numbers issues
>> +======================
>> +Each PCI domain can have up to only 256 buses and the QEMU PCI Express
>> +machines do not support multiple PCI domains even if extra Root
>> +Complexes (pxb-pcie) are used.
>> +
>> +Each element of the PCI Express hierarchy (Root Complexes,
>> +PCI Express Root Ports, PCI Express Downstream/Upstream ports)
>> +takes up bus numbers. Since only one (multifunction) device
>
>
> s/takes up bus numbers/uses one bus number/
>

OK

>
>> +can be attached to a PCI Express Root Port or PCI Express Downstream
>> +Port it is advised to plan in advance for the expected number of
>> +devices to prevent bus numbers starvation.
>
>
> s/numbers/number/
>

OK

>
>> +
>> +Avoiding PCI Express Switches (and thereby striving for a flat PCI
>
>
> s/flat/flatter/ ??
>

:) I'll change

>
>> +Express hierarchy) enables the hierarchy to not spend bus numbers on
>> +Upstream Ports.
>> +
>> +The bus_nr properties of the pxb-pcie devices partition the 0..255 bus
>> +number space. All bus numbers assigned to the buses recursively behind a
>> +given pxb-pcie device's root bus must fit between the bus_nr property of
>> +that pxb-pcie device, and the lowest of the higher bus_nr properties
>> +that the command line sets for other pxb-pcie devices.
>> +
>> +
>> +5. Hot-plug
>> +============
>> +The PCI Express root buses (pcie.0 and the buses exposed by pxb-pcie devices)
>> +do not support hot-plug, so any devices plugged into Root Complexes
>> +cannot be hot-plugged/hot-unplugged:
>> +    (1) PCI Express Integrated Endpoints
>> +    (2) PCI Express Root Ports
>> +    (3) DMI-PCI Bridges
>> +    (4) pxb-pcie
>
> (maybe you should mention that downstream ports can't be hotplugged into an existing upstream port (and thus, because qemu can only attach a single device at a time, it is currently not possible to
> hotplug a downstream port in any manner)
>

A good idea, thanks (I'll only mention the downstream ports are not hot-plugable)

>> +
>> +PCI devices can be hot-plugged into PCI-PCI Bridges. The PCI hot-plug is ACPI
>> +based and can work side by side with the PCI Express native hot-plug.
>> +
>> +PCI Express devices can be natively hot-plugged/hot-unplugged into/from
>> +PCI Express Root Ports (and PCI Express Downstream Ports).
>> +
>> +5.1 Planning for hot-plug:
>> +    (1) PCI hierarchy
>> +        Leave enough PCI-PCI Bridge slots empty or add one
>> +        or more empty PCI-PCI Bridges to the DMI-PCI Bridge.
>> +
>> +        For each such PCI-PCI Bridge the Guest Firmware is expected to reserve
>> +        4K IO space and 2M MMIO range to be used for all devices behind it.
>> +
>> +        Because of the hard IO limit of around 10 PCI Bridges (~ 40K space)
>> +        per system don't use more than 9 PCI-PCI Bridges, leaving 4K for the
>> +        Integrated Endpoints. (The PCI Express Hierarchy needs no IO space).
>> +
>> +    (2) PCI Express hierarchy:
>> +        Leave enough PCI Express Root Ports empty. Use multifunction
>> +        PCI Express Root Ports (up to 8 ports per pcie.0 slot)
>> +        on the Root Complex(es), for keeping the
>> +        hierarchy as flat as possible, thereby saving PCI bus numbers.
>
> FYI: https://www.redhat.com/archives/libvir-list/2016-October/msg01048.html
>

Cool!

>> +        Don't use PCI Express Switches if you don't have
>> +        to, each one of those uses an extra PCI bus (for its Upstream Port)
>> +        that could be put to better use with another Root Port or Downstream
>> +        Port, which may come handy for hot-plugging another device.
>> +
>> +
>> +5.3 Hot-plug example:
>> +Using HMP: (add -monitor stdio to QEMU command line)
>> +  device_add <dev>,id=<id>,bus=<PCI Express Root Port Id/PCI Express Downstream Port Id/PCI-PCI Bridge Id/>
>> +
>> +
>> +6. Device assignment
>> +====================
>> +Host devices are mostly PCI Express and should be plugged only into
>> +PCI Express Root Ports or PCI Express Downstream Ports.
>> +PCI-PCI Bridge slots can be used for legacy PCI host devices.
>> +
>> +6.1 How to detect if a device is PCI Express:
>> +  > lspci -s 03:00.0 -v (as root)
>> +
>> +    03:00.0 Network controller: Intel Corporation Wireless 7260 (rev 83)
>> +    Subsystem: Intel Corporation Dual Band Wireless-AC 7260
>> +    Flags: bus master, fast devsel, latency 0, IRQ 50
>> +    Memory at f0400000 (64-bit, non-prefetchable) [size=8K]
>> +    Capabilities: [c8] Power Management version 3
>> +    Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>> +    Capabilities: [40] Express Endpoint, MSI 00
>> +    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> +    Capabilities: [100] Advanced Error Reporting
>> +    Capabilities: [140] Device Serial Number 7c-7a-91-ff-ff-90-db-20
>> +    Capabilities: [14c] Latency Tolerance Reporting
>> +    Capabilities: [154] Vendor Specific Information: ID=cafe Rev=1 Len=014
>> +
>> +If you can see the "Express Endpoint" capability in the
>> +output, then the device is indeed PCI Express.
>
> I also have a libvirt patch queued up that detects whether a device is PCI Express, and assigns the guest-side address accordingly.
>

Thanks for keeping in sync.

>
>> +
>> +
>> +7. Virtio devices
>> +=================
>> +Virtio devices plugged into the PCI hierarchy or as Integrated Endpoints
>> +will remain PCI and have transitional behaviour as default.
>> +Transitional virtio devices work in both IO and MMIO modes depending on
>> +the guest support. The Guest firmware will assign both IO and MMIO resources
>> +to transitional virtio devices.
>> +
>> +Virtio devices plugged into PCI Express ports are PCI Express devices and
>> +have "1.0" behavior by default without IO support.
>> +In both cases disable-legacy and disable-modern properties can be used
>> +to override the behaviour.
>> +
>> +Note that setting disable-legacy=off will enable legacy mode (enabling
>> +legacy behavior) for PCI Express virtio devices causing them to
>> +require IO space, which, given the limited available IO space, may quickly
>> +lead to resource exhaustion, and is therefore strongly discouraged.
>
>
> ...unless required by a guest OS that lacks virtio-1.0 drivers" (but of course the hapless user will figure that out for themselves soon enough :-/)
>

Is their fault :)

>
>> +
>> +
>> +8. Conclusion
>> +==============
>> +The proposal offers a usage model that is easy to understand and follow
>> +and at the same time overcomes the PCI Express architecture limitations.
>> +
>
> Aside from the fact that libvirt sets the "port" option for ioh3420 devices, and doesn't set "slot"

I would re-think to setting the 'slot' we use rather than 'port' (& 'chasis').


Thanks,
Marcel
diff mbox

Patch

diff --git a/docs/pcie.txt b/docs/pcie.txt
new file mode 100644
index 0000000..e9c79c5
--- /dev/null
+++ b/docs/pcie.txt
@@ -0,0 +1,306 @@ 
+PCI EXPRESS GUIDELINES
+======================
+
+1. Introduction
+================
+The doc proposes best practices on how to use PCI Express/PCI device
+in PCI Express based machines and explains the reasoning behind them.
+
+The following presentations accompany this document:
+ (1) Q35 overview.
+     http://wiki.qemu.org/images/4/4e/Q35.pdf
+ (2) A comparison between PCI and PCI Express technologies.
+     http://wiki.qemu.org/images/f/f6/PCIvsPCIe.pdf
+
+Note: The usage examples are not intended to replace the full
+documentation, please use QEMU help to retrieve all options.
+
+2. Device placement strategy
+============================
+QEMU does not have a clear socket-device matching mechanism
+and allows any PCI/PCI Express device to be plugged into any
+PCI/PCI Express slot.
+Plugging a PCI device into a PCI Express slot might not always work and
+is weird anyway since it cannot be done for "bare metal".
+Plugging a PCI Express device into a PCI slot will hide the Extended
+Configuration Space thus is also not recommended.
+
+The recommendation is to separate the PCI Express and PCI hierarchies.
+PCI Express devices should be plugged only into PCI Express Root Ports and
+PCI Express Downstream ports.
+
+2.1 Root Bus (pcie.0)
+=====================
+Place only the following kinds of devices directly on the Root Complex:
+    (1) PCI Devices (e.g. network card, graphics card, IDE controller),
+        not controllers. Place only legacy PCI devices on
+        the Root Complex. These will be considered Integrated Endpoints.
+        Note: Integrated Endpoints are not hot-pluggable.
+
+        Although the PCI Express spec does not forbid PCI Express devices as
+        Integrated Endpoints, existing hardware mostly integrates legacy PCI
+        devices with the Root Complex. Guest OSes are suspected to behave
+        strangely when PCI Express devices are integrated
+        with the Root Complex.
+
+    (2) PCI Express Root Ports (ioh3420), for starting exclusively PCI Express
+        hierarchies.
+
+    (3) DMI-PCI Bridges (i82801b11-bridge), for starting legacy PCI
+        hierarchies.
+
+    (4) Extra Root Complexes (pxb-pcie), if multiple PCI Express Root Buses
+        are needed.
+
+   pcie.0 bus
+   ----------------------------------------------------------------------------
+        |                |                    |                  |
+   -----------   ------------------   ------------------   --------------
+   | PCI Dev |   | PCIe Root Port |   | DMI-PCI Bridge |   |  pxb-pcie  |
+   -----------   ------------------   ------------------   --------------
+
+2.1.1 To plug a device into pcie.0 as a Root Complex Integrated Endpoint use:
+          -device <dev>[,bus=pcie.0]
+2.1.2 To expose a new PCI Express Root Bus use:
+          -device pxb-pcie,id=pcie.1,bus_nr=x[,numa_node=y][,addr=z]
+      Only PCI Express Root Ports and DMI-PCI bridges can be connected
+      to the pcie.1 bus:
+          -device ioh3420,id=root_port1[,bus=pcie.1][,chassis=x][,slot=y][,addr=z]                                     \
+          -device i82801b11-bridge,id=dmi_pci_bridge1,bus=pcie.1
+
+
+2.2 PCI Express only hierarchy
+==============================
+Always use PCI Express Root Ports to start PCI Express hierarchies.
+
+A PCI Express Root bus supports up to 32 devices. Since each
+PCI Express Root Port is a function and a multi-function
+device may support up to 8 functions, the maximum possible
+number of PCI Express Root Ports per PCI Express Root Bus is 256.
+
+Prefer grouping PCI Express Root Ports into multi-function devices
+to keep a simple flat hierarchy that is enough for most scenarios.
+Only use PCI Express Switches (x3130-upstream, xio3130-downstream)
+if there is no more room for PCI Express Root Ports.
+Please see section 4. for further justifications.
+
+Plug only PCI Express devices into PCI Express Ports.
+
+
+   pcie.0 bus
+   ----------------------------------------------------------------------------------
+        |                 |                                    |
+   -------------    -------------                        -------------
+   | Root Port |    | Root Port |                        | Root Port |
+   ------------     -------------                        -------------
+         |                            -------------------------|------------------------
+    ------------                      |                 -----------------              |
+    | PCIe Dev |                      |    PCI Express  | Upstream Port |              |
+    ------------                      |      Switch     -----------------              |
+                                      |                  |            |                |
+                                      |    -------------------    -------------------  |
+                                      |    | Downstream Port |    | Downstream Port |  |
+                                      |    -------------------    -------------------  |
+                                      -------------|-----------------------|------------
+                                             ------------
+                                             | PCIe Dev |
+                                             ------------
+
+2.2.1 Plugging a PCI Express device into a PCI Express Root Port:
+          -device ioh3420,id=root_port1,slot=x[,chassis=y][,bus=pcie.0][,addr=z]  \
+          -device <dev>,bus=root_port1
+      Note that (slot, chassis) pair is mandatory and must be
+      unique for each PCI Express Root Port.
+2.2.2 Using multi-function PCI Express Root Ports:
+      -device ioh3420,id=root_port1,multifunction=on,chassis=x[,bus=pcie.0][,slot=y][,addr=z.0] \
+      -device ioh3420,id=root_port2,chassis=x1[,bus=pcie.0][,slot=y1][,addr=z.1] \
+      -device ioh3420,id=root_port3,chassis=x2[,bus=pcie.0][,slot=y2][,addr=z.2] \
+2.2.2 Plugging a PCI Express device into a Switch:
+      -device ioh3420,id=root_port1,chassis=x[,bus=pcie.0][,slot=y][,addr=z]  \
+      -device x3130-upstream,id=upstream_port1,bus=root_port1[,addr=x]          \
+      -device xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=x1[,slot=y1][,addr=z1]] \
+      -device <dev>,bus=downstream_port1
+Note that 'addr' parameter can be 0 for all the examples above.
+
+
+2.3 PCI only hierarchy
+======================
+Legacy PCI devices can be plugged into pcie.0 as Integrated Endpoints,
+but, as mentioned in section 5, doing so means the legacy PCI
+device in question will be incapable of hot-unplugging.
+Besides that use DMI-PCI Bridges (i82801b11-bridge) in combination
+with PCI-PCI Bridges (pci-bridge) to start PCI hierarchies.
+
+Prefer flat hierarchies. For most scenarios a single DMI-PCI Bridge
+(having 32 slots) and several PCI-PCI Bridges attached to it
+(each supporting also 32 slots) will support hundreds of legacy devices.
+The recommendation is to populate one PCI-PCI Bridge under the DMI-PCI Bridge
+until is full and then plug a new PCI-PCI Bridge...
+
+   pcie.0 bus
+   ----------------------------------------------
+        |                            |
+   -----------               ------------------
+   | PCI Dev |               | DMI-PCI BRIDGE |
+   ----------                ------------------
+                               |            |
+                  ------------------    ------------------
+                  | PCI-PCI Bridge |    | PCI-PCI Bridge |
+                  ------------------    ------------------
+                                         |           |
+                                  -----------     -----------
+                                  | PCI Dev |     | PCI Dev |
+                                  -----------     -----------
+
+2.3.1 To plug a PCI device into pcie.0 as Integrated Endpoint use:
+      -device <dev>[,bus=pcie.0]
+2.3.2 Plugging a PCI device into a PCI-PCI Bridge:
+      -device i82801b11-bridge,id=dmi_pci_bridge1[,bus=pcie.0]                        \
+      -device pci-bridge,id=pci_bridge1,bus=dmi_pci_bridge1[,chassis_nr=x][,addr=y]   \
+      -device <dev>,bus=pci_bridge1[,addr=x]
+      Note that 'addr' cannot be 0 unless shpc=off parameter is passed to
+      the PCI Bridge.
+
+3. IO space issues
+===================
+The PCI Express Root Ports and PCI Express Downstream ports are seen by
+Firmware/Guest OS as PCI-PCI Bridges. As required by the PCI spec, each
+such Port should be reserved a 4K IO range for, even though only one
+(multifunction) device can be plugged into each Port. This results in
+poor IO space utilization.
+
+The firmware used by QEMU (SeaBIOS/OVMF) may try further optimizations
+by not allocating IO space for each PCI Express Root / PCI Express
+Downstream port if:
+    (1) the port is empty, or
+    (2) the device behind the port has no IO BARs.
+
+The IO space is very limited, to 65536 byte-wide IO ports, and may even be
+fragmented by fixed IO ports owned by platform devices resulting in at most
+10 PCI Express Root Ports or PCI Express Downstream Ports per system
+if devices with IO BARs are used in the PCI Express hierarchy. Using the
+proposed device placing strategy solves this issue by using only
+PCI Express devices within PCI Express hierarchy.
+
+The PCI Express spec requires the PCI Express devices to work
+without using IO. The PCI hierarchy has no such limitations.
+
+
+4. Bus numbers issues
+======================
+Each PCI domain can have up to only 256 buses and the QEMU PCI Express
+machines do not support multiple PCI domains even if extra Root
+Complexes (pxb-pcie) are used.
+
+Each element of the PCI Express hierarchy (Root Complexes,
+PCI Express Root Ports, PCI Express Downstream/Upstream ports)
+takes up bus numbers. Since only one (multifunction) device
+can be attached to a PCI Express Root Port or PCI Express Downstream
+Port it is advised to plan in advance for the expected number of
+devices to prevent bus numbers starvation.
+
+Avoiding PCI Express Switches (and thereby striving for a flat PCI
+Express hierarchy) enables the hierarchy to not spend bus numbers on
+Upstream Ports.
+
+The bus_nr properties of the pxb-pcie devices partition the 0..255 bus
+number space. All bus numbers assigned to the buses recursively behind a
+given pxb-pcie device's root bus must fit between the bus_nr property of
+that pxb-pcie device, and the lowest of the higher bus_nr properties
+that the command line sets for other pxb-pcie devices.
+
+
+5. Hot-plug
+============
+The PCI Express root buses (pcie.0 and the buses exposed by pxb-pcie devices)
+do not support hot-plug, so any devices plugged into Root Complexes
+cannot be hot-plugged/hot-unplugged:
+    (1) PCI Express Integrated Endpoints
+    (2) PCI Express Root Ports
+    (3) DMI-PCI Bridges
+    (4) pxb-pcie
+
+PCI devices can be hot-plugged into PCI-PCI Bridges. The PCI hot-plug is ACPI
+based and can work side by side with the PCI Express native hot-plug.
+
+PCI Express devices can be natively hot-plugged/hot-unplugged into/from
+PCI Express Root Ports (and PCI Express Downstream Ports).
+
+5.1 Planning for hot-plug:
+    (1) PCI hierarchy
+        Leave enough PCI-PCI Bridge slots empty or add one
+        or more empty PCI-PCI Bridges to the DMI-PCI Bridge.
+
+        For each such PCI-PCI Bridge the Guest Firmware is expected to reserve
+        4K IO space and 2M MMIO range to be used for all devices behind it.
+
+        Because of the hard IO limit of around 10 PCI Bridges (~ 40K space)
+        per system don't use more than 9 PCI-PCI Bridges, leaving 4K for the
+        Integrated Endpoints. (The PCI Express Hierarchy needs no IO space).
+
+    (2) PCI Express hierarchy:
+        Leave enough PCI Express Root Ports empty. Use multifunction
+        PCI Express Root Ports (up to 8 ports per pcie.0 slot)
+        on the Root Complex(es), for keeping the
+        hierarchy as flat as possible, thereby saving PCI bus numbers.
+        Don't use PCI Express Switches if you don't have
+        to, each one of those uses an extra PCI bus (for its Upstream Port)
+        that could be put to better use with another Root Port or Downstream
+        Port, which may come handy for hot-plugging another device.
+
+
+5.3 Hot-plug example:
+Using HMP: (add -monitor stdio to QEMU command line)
+  device_add <dev>,id=<id>,bus=<PCI Express Root Port Id/PCI Express Downstream Port Id/PCI-PCI Bridge Id/>
+
+
+6. Device assignment
+====================
+Host devices are mostly PCI Express and should be plugged only into
+PCI Express Root Ports or PCI Express Downstream Ports.
+PCI-PCI Bridge slots can be used for legacy PCI host devices.
+
+6.1 How to detect if a device is PCI Express:
+  > lspci -s 03:00.0 -v (as root)
+
+    03:00.0 Network controller: Intel Corporation Wireless 7260 (rev 83)
+    Subsystem: Intel Corporation Dual Band Wireless-AC 7260
+    Flags: bus master, fast devsel, latency 0, IRQ 50
+    Memory at f0400000 (64-bit, non-prefetchable) [size=8K]
+    Capabilities: [c8] Power Management version 3
+    Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
+    Capabilities: [40] Express Endpoint, MSI 00
+    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+    Capabilities: [100] Advanced Error Reporting
+    Capabilities: [140] Device Serial Number 7c-7a-91-ff-ff-90-db-20
+    Capabilities: [14c] Latency Tolerance Reporting
+    Capabilities: [154] Vendor Specific Information: ID=cafe Rev=1 Len=014 
+
+If you can see the "Express Endpoint" capability in the
+output, then the device is indeed PCI Express.
+
+
+7. Virtio devices
+=================
+Virtio devices plugged into the PCI hierarchy or as Integrated Endpoints
+will remain PCI and have transitional behaviour as default.
+Transitional virtio devices work in both IO and MMIO modes depending on
+the guest support. The Guest firmware will assign both IO and MMIO resources
+to transitional virtio devices.
+
+Virtio devices plugged into PCI Express ports are PCI Express devices and
+have "1.0" behavior by default without IO support.
+In both cases disable-legacy and disable-modern properties can be used
+to override the behaviour.
+
+Note that setting disable-legacy=off will enable legacy mode (enabling
+legacy behavior) for PCI Express virtio devices causing them to
+require IO space, which, given the limited available IO space, may quickly
+lead to resource exhaustion, and is therefore strongly discouraged.
+
+
+8. Conclusion
+==============
+The proposal offers a usage model that is easy to understand and follow
+and at the same time overcomes the PCI Express architecture limitations.
+