From patchwork Wed Jun 29 05:28:20 2011 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jon Mason X-Patchwork-Id: 927162 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by demeter2.kernel.org (8.14.4/8.14.4) with ESMTP id p5T6BJGh020792 for ; Wed, 29 Jun 2011 06:11:20 GMT Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751766Ab1F2F2Z (ORCPT ); Wed, 29 Jun 2011 01:28:25 -0400 Received: from mail-pv0-f174.google.com ([74.125.83.174]:54639 "EHLO mail-pv0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750984Ab1F2F2Z (ORCPT ); Wed, 29 Jun 2011 01:28:25 -0400 Received: by pvg12 with SMTP id 12so541330pvg.19 for ; Tue, 28 Jun 2011 22:28:24 -0700 (PDT) Received: by 10.68.17.164 with SMTP id p4mr585269pbd.328.1309325304310; Tue, 28 Jun 2011 22:28:24 -0700 (PDT) Received: from myri.com (cpe-70-113-48-102.austin.res.rr.com [70.113.48.102]) by mx.google.com with ESMTPS id p5sm670416pbd.12.2011.06.28.22.28.22 (version=SSLv3 cipher=OTHER); Tue, 28 Jun 2011 22:28:23 -0700 (PDT) Date: Wed, 29 Jun 2011 00:28:20 -0500 From: Jon Mason To: Benjamin Herrenschmidt Cc: Jesse Barnes , linux-pci@vger.kernel.org, Pratyush Anand , Jordan_Hargrave@dell.com, Andrew Gallatin Subject: [RFC v6] PCI: Set PCI-E Max Payload Size on fabric Message-ID: <20110629052815.GA14550@myri.com> References: <20110624035452.GA12806@myri.com> <1308890971.32158.280.camel@pasglop> <1308956057.32158.334.camel@pasglop> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <1308956057.32158.334.camel@pasglop> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org X-Greylist: IP, sender and recipient auto-whitelisted, not delayed by milter-greylist-4.2.6 (demeter2.kernel.org [140.211.167.43]); Wed, 29 Jun 2011 06:11:20 +0000 (UTC) There is a sizable performance boost for having the largest possible maximum payload size on each PCI-E device. However, the maximum payload size must be uniform on a given PCI-E fabric, and each device, bridge, and root port can have a different max size. To find and configure the optimal MPS settings, one must walk the fabric and determine the largest MPS available on all the devices on the given fabric. However, on some architectures (notably PPC) it is known that the downstream communication will never be larger than the MRRS. With this known, the MPS can be configured to be the largest value supported by the PCI-E switches and root port. In most cases this means that the devices on the fabric are not limited to the lowest common denominator. This requires the setting of the MRRS when configuring the MPS of each device. Use with caution, if the MRRS is not respected or if it is modified after being set, then there is the possibility of a bus error. To choose between the two available options, two kernel boot args have been added to the PCI calls. "mpsmin" will provide the former behavior, while "mpsmax" will perform the latter behavior. By default, the former behavior is used. Along with the potential performance improvement, this works around an issue with buggy BIOS revisions found on various whitebox motherboards which do not configure MPS beyond one level below the root port. There is one issue with the "mpsmin" design. For PCI-E hotplug enabled slots not connected directly to a PCI-E root port, there can be problems when hotplugging devices. This is due to the possibility of hotplugging a device into the fabric with a smaller MPS that the devices currently running have configured. Modifying the MPS on the running devices could cause a fatal bus error due to an incoming frame being larger than the newly configured MPS. To work around this, the MPS for the entire fabric must be set to the minimum size. Any devices hotplugged into this fabric will have the minimum MPS set. If the PCI hotplug slot is directly connected to the root port and there are not other devices on the fabric (which seems to be the most common case), then this is not an issue and MPS discovery will occur as normal. Signed-off-by: Jon Mason --- drivers/pci/hotplug/pcihp_slot.c | 58 ++++------------ drivers/pci/pci.c | 66 ++++++++++++++++++ drivers/pci/probe.c | 138 ++++++++++++++++++++++++++++++++++++++ include/linux/pci.h | 7 ++- 4 files changed, 224 insertions(+), 45 deletions(-) diff --git a/drivers/pci/hotplug/pcihp_slot.c b/drivers/pci/hotplug/pcihp_slot.c index 749fdf0..ee4ff14 100644 --- a/drivers/pci/hotplug/pcihp_slot.c +++ b/drivers/pci/hotplug/pcihp_slot.c @@ -26,6 +26,19 @@ #include #include +extern void pcie_set_maxpayloadsize(struct pci_bus *bus); + +static void pcie_determine_mps(struct pci_dev *dev) +{ + struct pci_bus *bus = dev->bus; + + while (bus->self->pcie_type != PCI_EXP_TYPE_ROOT_PORT && + !pci_is_root_bus(bus)) + bus = bus->parent; + + pcie_set_maxpayloadsize(bus); +} + static struct hpp_type0 pci_default_type0 = { .revision = 1, .cache_line_size = 8, @@ -158,47 +171,6 @@ static void program_hpp_type2(struct pci_dev *dev, struct hpp_type2 *hpp) */ } -/* Program PCIE MaxPayload setting on device: ensure parent maxpayload <= device */ -static int pci_set_payload(struct pci_dev *dev) -{ - int pos, ppos; - u16 pctl, psz; - u16 dctl, dsz, dcap, dmax; - struct pci_dev *parent; - - parent = dev->bus->self; - pos = pci_find_capability(dev, PCI_CAP_ID_EXP); - if (!pos) - return 0; - - /* Read Device MaxPayload capability and setting */ - pci_read_config_word(dev, pos + PCI_EXP_DEVCTL, &dctl); - pci_read_config_word(dev, pos + PCI_EXP_DEVCAP, &dcap); - dsz = (dctl & PCI_EXP_DEVCTL_PAYLOAD) >> 5; - dmax = (dcap & PCI_EXP_DEVCAP_PAYLOAD); - - /* Read Parent MaxPayload setting */ - ppos = pci_find_capability(parent, PCI_CAP_ID_EXP); - if (!ppos) - return 0; - pci_read_config_word(parent, ppos + PCI_EXP_DEVCTL, &pctl); - psz = (pctl & PCI_EXP_DEVCTL_PAYLOAD) >> 5; - - /* If parent payload > device max payload -> error - * If parent payload > device payload -> set speed - * If parent payload <= device payload -> do nothing - */ - if (psz > dmax) - return -1; - else if (psz > dsz) { - dev_info(&dev->dev, "Setting MaxPayload to %d\n", 128 << psz); - pci_write_config_word(dev, pos + PCI_EXP_DEVCTL, - (dctl & ~PCI_EXP_DEVCTL_PAYLOAD) + - (psz << 5)); - } - return 0; -} - void pci_configure_slot(struct pci_dev *dev) { struct pci_dev *cdev; @@ -210,9 +182,7 @@ void pci_configure_slot(struct pci_dev *dev) (dev->class >> 8) == PCI_CLASS_BRIDGE_PCI))) return; - ret = pci_set_payload(dev); - if (ret) - dev_warn(&dev->dev, "could not set device max payload\n"); + pcie_determine_mps(dev); memset(&hpp, 0, sizeof(hpp)); ret = pci_get_hp_params(dev, &hpp); diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 56098b3..bc25728 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -77,6 +77,8 @@ unsigned long pci_cardbus_mem_size = DEFAULT_CARDBUS_MEM_SIZE; unsigned long pci_hotplug_io_size = DEFAULT_HOTPLUG_IO_SIZE; unsigned long pci_hotplug_mem_size = DEFAULT_HOTPLUG_MEM_SIZE; +bool pcie_mps_forcemax = true; + /* * The default CLS is used if arch didn't set CLS explicitly and not * all pci devices agree on the same value. Arch can override either @@ -3218,6 +3220,66 @@ out: EXPORT_SYMBOL(pcie_set_readrq); /** + * pcie_set_mps - set PCI Express maximum payload size + * @dev: PCI device to query + * + * Returns maximum payload size in bytes + * or appropriate error value. + */ +int pcie_get_mps(struct pci_dev *dev) +{ + int ret, cap; + u16 ctl; + + cap = pci_pcie_cap(dev); + if (!cap) + return -EINVAL; + + ret = pci_read_config_word(dev, cap + PCI_EXP_DEVCTL, &ctl); + if (!ret) + ret = 128 << ((ctl & PCI_EXP_DEVCTL_PAYLOAD) >> 5); + + return ret; +} + +/** + * pcie_set_mps - set PCI Express maximum payload size + * @dev: PCI device to query + * @rq: maximum payload size in bytes + * valid values are 128, 256, 512, 1024, 2048, 4096 + * + * If possible sets maximum payload size + */ +int pcie_set_mps(struct pci_dev *dev, int mps) +{ + int cap, err = -EINVAL; + u16 ctl, v; + + if (mps < 128 || mps > 4096 || !is_power_of_2(mps)) + goto out; + + v = (ffs(mps) - 8) << 5; + if (v > dev->pcie_mpss) + goto out; + + cap = pci_pcie_cap(dev); + if (!cap) + goto out; + + err = pci_read_config_word(dev, cap + PCI_EXP_DEVCTL, &ctl); + if (err) + goto out; + + if ((ctl & PCI_EXP_DEVCTL_PAYLOAD) != v) { + ctl &= ~PCI_EXP_DEVCTL_PAYLOAD; + ctl |= v; + err = pci_write_config_word(dev, cap + PCI_EXP_DEVCTL, ctl); + } +out: + return err; +} + +/** * pci_select_bars - Make BAR mask from the type of resource * @dev: the PCI device for which BAR mask is made * @flags: resource type mask to be selected @@ -3498,6 +3560,10 @@ static int __init pci_setup(char *str) pci_hotplug_io_size = memparse(str + 9, &str); } else if (!strncmp(str, "hpmemsize=", 10)) { pci_hotplug_mem_size = memparse(str + 10, &str); + } else if (!strncmp(str, "mpsmin", 6)) { + pcie_mps_forcemax = false; + } else if (!strncmp(str, "mpsmax", 6)) { + pcie_mps_forcemax = true; } else { printk(KERN_ERR "PCI: Unknown option `%s'\n", str); diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 48849ff..d32c62e 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -860,6 +860,8 @@ void set_pcie_port_type(struct pci_dev *pdev) pdev->pcie_cap = pos; pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, ®16); pdev->pcie_type = (reg16 & PCI_EXP_FLAGS_TYPE) >> 4; + pci_read_config_word(pdev, pos + PCI_EXP_DEVCAP, ®16); + pdev->pcie_mpss = reg16 & PCI_EXP_DEVCAP_PAYLOAD; } void set_pcie_hotplug_bridge(struct pci_dev *pdev) @@ -1327,6 +1329,137 @@ int pci_scan_slot(struct pci_bus *bus, int devfn) return nr; } +static int pcie_find_smpss(struct pci_dev *dev, void *data) +{ + u8 *smpss = data; + + if (!pci_is_pcie(dev)) + return 0; + + /* For PCIE hotplug enabled slots not connected directly to a + * PCI-E root port, there can be problems when hotplugging + * devices. This is due to the possibility of hotplugging a + * device into the fabric with a smaller MPS that the devices + * currently running have configured. Modifying the MPS on the + * running devices could cause a fatal bus error due to an + * incoming frame being larger than the newly configured MPS. + * To work around this, the MPS for the entire fabric must be + * set to the minimum size. Any devices hotplugged into this + * fabric will have the minimum MPS set. If the PCI hotplug + * slot is directly connected to the root port and there are not + * other devices on the fabric (which seems to be the most + * common case), then this is not an issue and MPS discovery + * will occur as normal. + */ + if (dev->is_hotplug_bridge && (!list_is_singular(&dev->bus->devices) || + dev->bus->self->pcie_type != PCI_EXP_TYPE_ROOT_PORT)) + *smpss = 0; + + if (*smpss > dev->pcie_mpss) + *smpss = dev->pcie_mpss; + + return 0; +} + +static int pcie_write_mps(struct pci_dev *dev, void *data) +{ + int rc, mrrs, mps, dev_mpss; + + if (!pci_is_pcie(dev)) + return 0; + + mps = 128 << *(u8 *)data; + dev_mpss = 128 << dev->pcie_mpss; + + if (pcie_mps_forcemax) { + if (dev->bus->self) { + dev_dbg(&dev->bus->dev, "Bus MPSS %d\n", + 128 << dev->bus->self->pcie_mpss); + + /* For "MPS Force Max", the assumption is made that + * downstream communication will never be larger than + * the MRRS. So, the MPS only needs to be configured + * for the upstream communication. This being the case, + * walk from the top down and set the MPS of the child + * to that of the parent bus. + */ + mps = 128 << dev->bus->self->pcie_mpss; + if (mps > dev_mpss) + dev_warn(&dev->dev, "MPS configured higher than" + " maximum supported by the device. If" + " a bus issue occurs, try running with" + " pci=mpsmin.\n"); + } + + /* For Max performance, the MRRS must be set to the largest + * supported value. However, it cannot be configured larger + * than the MPS the device or the bus can support. This assumes + * that the largest MRRS available on the device cannot be + * smaller than the device MPSS. + */ + mrrs = mps < dev_mpss ? mps : dev_mpss; + + dev->pcie_mpss = ffs(mps) - 8; + } else + /* In the "safe" case, configure the MRRS for fairness on the + * bus by making all devices have the same size + */ + mrrs = mps; + + dev_info(&dev->dev, "Setting MRRS to %d\n", mrrs); + rc = pcie_set_readrq(dev, mrrs); + if (rc) + dev_err(&dev->dev, "Failed attempting to set the MRRS\n"); + + if (mrrs != pcie_get_readrq(dev)) { + dev_err(&dev->dev, "MRRS is not what was written. Falling back" + " to the minimum value.\n"); + + rc = pcie_set_readrq(dev, 128); + if (rc) + dev_err(&dev->dev, + "Failed attempting to set the MRRS\n"); + } + + dev_info(&dev->dev, "Setting MPS to %d\n", mps); + rc = pcie_set_mps(dev, mps); + if (rc) + dev_err(&dev->dev, "Failed attempting to set the MPS\n"); + + dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n", + pcie_get_mps(dev), dev_mpss, pcie_get_readrq(dev)); + + return 0; +} + +/* pcie_set_maxpayloadsize requires that pci_walk_bus work in a top-down, + * parents then children fashion. If this changes, then this code will not + * work as designed. + */ +void pcie_set_maxpayloadsize(struct pci_bus *bus) +{ + u8 smpss; + + if (!bus->self) + return; + + if (!pci_is_pcie(bus->self)) + return; + + if (bus->self->pcie_type != PCI_EXP_TYPE_ROOT_PORT) + return; + + smpss = bus->self->pcie_mpss; + + if (!pcie_mps_forcemax) { + pcie_find_smpss(bus->self, &smpss); + pci_walk_bus(bus, pcie_find_smpss, &smpss); + } + + pcie_write_mps(bus->self, &smpss); + pci_walk_bus(bus, pcie_write_mps, &smpss); +} + unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus) { unsigned int devfn, pass, max = bus->secondary; @@ -1359,6 +1492,11 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus) max = pci_scan_bridge(bus, dev, max, pass); } + /* After the bus has been walked and all devices discovered, determine + * the MPS of the fabric and set it. + */ + pcie_set_maxpayloadsize(bus); + /* * We've scanned the bus and so we know all about what's on * the other side of any bridges that may be on this bus plus diff --git a/include/linux/pci.h b/include/linux/pci.h index c446b5c..9b35724 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -251,7 +251,8 @@ struct pci_dev { u8 revision; /* PCI revision, low byte of class word */ u8 hdr_type; /* PCI header type (`multi' flag masked out) */ u8 pcie_cap; /* PCI-E capability offset */ - u8 pcie_type; /* PCI-E device/port type */ + u8 pcie_type:4; /* PCI-E device/port type */ + u8 pcie_mpss:3; /* PCI-E Max Payload Size Supported */ u8 rom_base_reg; /* which config register controls the ROM */ u8 pin; /* which interrupt pin this device uses */ @@ -617,6 +618,8 @@ struct pci_driver { /* these external functions are only available when PCI support is enabled */ #ifdef CONFIG_PCI +extern bool pcie_mps_forcemax; + extern struct bus_type pci_bus_type; /* Do NOT directly access these two variables, unless you are arch specific pci @@ -796,6 +799,8 @@ int pcix_get_mmrbc(struct pci_dev *dev); int pcix_set_mmrbc(struct pci_dev *dev, int mmrbc); int pcie_get_readrq(struct pci_dev *dev); int pcie_set_readrq(struct pci_dev *dev, int rq); +int pcie_get_mps(struct pci_dev *dev); +int pcie_set_mps(struct pci_dev *dev, int mps); int __pci_reset_function(struct pci_dev *dev); int pci_reset_function(struct pci_dev *dev); void pci_update_resource(struct pci_dev *dev, int resno);