Message ID | 1394222924-28886-1-git-send-email-yinghai@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
On Fri, Mar 7, 2014 at 1:08 PM, Yinghai Lu <yinghai@kernel.org> wrote: > When one of children resources does not support MEM_64, MEM_64 for > bridge get reset, so pull down whole pref resource on the bridge under 4G. > > If the bridge support pref mem 64, will only allocate that with pref mem64 to > children that support it. > For children resources if they only support pref mem 32, will allocate them > from non pref mem instead. > > If the bridge only support 32bit pref mmio, will still have all children pref > mmio under that. > > -v2: Add release bridge res support with bridge mem res for pref_mem children res. > -v3: refresh and make it can be applied early before for_each_dev_res patchset. > -v4: fix non-pref mmio 64bit support found by Guo Chao. > -v7: repost as ibm's powerpc system work again when > 1. petiboot boot kernel have this patch. > 2. or mellanox driver added new .shutdown member. > discussion could be found at: > http://marc.info/?t=138968064500001&r=1&w=2 I think this fixes some sort of bug, and I suppose if I spent a few hours decoding the discussion you mention (the 44 message-long "mlx4_core probe error after applying Yinghai's patch" discussion), I might be able to figure out what it is. But maybe somebody who already knows what the bug is can summarize it for me, and maybe even open a kernel.org bugzilla with a dmesg log as an example. I do intend to use this message [1] as a source to help construct a changelog, but I'd like to also describe a specific problem that is solved by this patch. Ideally this would already be in the changelog, but it's not, so any help in figuring it out would be appreciated. [1] http://lkml.kernel.org/r/CAErSpo5VWREGptW0MU0wLJUa3h23DXMZPGkdJFMTx5WAGL8J6Q@mail.gmail.com > Signed-off-by: Yinghai Lu <yinghai@kernel.org> > Tested-by: Guo Chao <yan@linux.vnet.ibm.com> > Tested-by: Wei Yang <weiyang@linux.vnet.ibm.com> > > --- > drivers/pci/setup-bus.c | 138 ++++++++++++++++++++++++++++++++---------------- > drivers/pci/setup-res.c | 20 ++++++ > 2 files changed, 111 insertions(+), 47 deletions(-) > > Index: linux-2.6/drivers/pci/setup-bus.c > =================================================================== > --- linux-2.6.orig/drivers/pci/setup-bus.c > +++ linux-2.6/drivers/pci/setup-bus.c > @@ -713,12 +713,11 @@ static void pci_bridge_check_ranges(stru > bus resource of a given type. Note: we intentionally skip > the bus resources which have already been assigned (that is, > have non-NULL parent resource). */ > -static struct resource *find_free_bus_resource(struct pci_bus *bus, unsigned long type) > +static struct resource *find_free_bus_resource(struct pci_bus *bus, > + unsigned long type_mask, unsigned long type) > { > int i; > struct resource *r; > - unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM | > - IORESOURCE_PREFETCH; > > pci_bus_for_each_resource(bus, r, i) { > if (r == &ioport_resource || r == &iomem_resource) > @@ -815,7 +814,8 @@ static void pbus_size_io(struct pci_bus > resource_size_t add_size, struct list_head *realloc_head) > { > struct pci_dev *dev; > - struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO); > + struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO, > + IORESOURCE_IO); > resource_size_t size = 0, size0 = 0, size1 = 0; > resource_size_t children_add_size = 0; > resource_size_t min_align, align; > @@ -915,15 +915,17 @@ static inline resource_size_t calculate_ > * guarantees that all child resources fit in this size. > */ > static int pbus_size_mem(struct pci_bus *bus, unsigned long mask, > - unsigned long type, resource_size_t min_size, > - resource_size_t add_size, > - struct list_head *realloc_head) > + unsigned long type, unsigned long type2, > + unsigned long type3, > + resource_size_t min_size, resource_size_t add_size, > + struct list_head *realloc_head) > { > struct pci_dev *dev; > resource_size_t min_align, align, size, size0, size1; > resource_size_t aligns[12]; /* Alignments from 1Mb to 2Gb */ > int order, max_order; > - struct resource *b_res = find_free_bus_resource(bus, type); > + struct resource *b_res = find_free_bus_resource(bus, > + mask | IORESOURCE_PREFETCH, type); > unsigned int mem64_mask = 0; > resource_size_t children_add_size = 0; > > @@ -944,7 +946,9 @@ static int pbus_size_mem(struct pci_bus > struct resource *r = &dev->resource[i]; > resource_size_t r_size; > > - if (r->parent || (r->flags & mask) != type) > + if (r->parent || ((r->flags & mask) != type && > + (r->flags & mask) != type2 && > + (r->flags & mask) != type3)) > continue; > r_size = resource_size(r); > #ifdef CONFIG_PCI_IOV > @@ -1117,8 +1121,9 @@ void __ref __pci_bus_size_bridges(struct > struct list_head *realloc_head) > { > struct pci_dev *dev; > - unsigned long mask, prefmask; > + unsigned long mask, prefmask, type2 = 0, type3 = 0; > resource_size_t additional_mem_size = 0, additional_io_size = 0; > + struct resource *b_res; > > list_for_each_entry(dev, &bus->devices, bus_list) { > struct pci_bus *b = dev->subordinate; > @@ -1163,15 +1168,34 @@ void __ref __pci_bus_size_bridges(struct > has already been allocated by arch code, try > non-prefetchable range for both types of PCI memory > resources. */ > + b_res = &bus->self->resource[PCI_BRIDGE_RESOURCES]; > mask = IORESOURCE_MEM; > prefmask = IORESOURCE_MEM | IORESOURCE_PREFETCH; > - if (pbus_size_mem(bus, prefmask, prefmask, > + if (b_res[2].flags & IORESOURCE_MEM_64) { > + prefmask |= IORESOURCE_MEM_64; > + if (pbus_size_mem(bus, prefmask, prefmask, > + prefmask, prefmask, > realloc_head ? 0 : additional_mem_size, > - additional_mem_size, realloc_head)) > - mask = prefmask; /* Success, size non-prefetch only. */ > - else > - additional_mem_size += additional_mem_size; > - pbus_size_mem(bus, mask, IORESOURCE_MEM, > + additional_mem_size, realloc_head)) { > + /* Success, size non-pref64 only. */ > + mask = prefmask; > + type2 = prefmask & ~IORESOURCE_MEM_64; > + type3 = prefmask & ~IORESOURCE_PREFETCH; > + } > + } > + if (!type2) { > + prefmask &= ~IORESOURCE_MEM_64; > + if (pbus_size_mem(bus, prefmask, prefmask, > + prefmask, prefmask, > + realloc_head ? 0 : additional_mem_size, > + additional_mem_size, realloc_head)) { > + /* Success, size non-prefetch only. */ > + mask = prefmask; > + } else > + additional_mem_size += additional_mem_size; > + type2 = type3 = IORESOURCE_MEM; > + } > + pbus_size_mem(bus, mask, IORESOURCE_MEM, type2, type3, > realloc_head ? 0 : additional_mem_size, > additional_mem_size, realloc_head); > break; > @@ -1257,42 +1281,66 @@ static void __ref __pci_bridge_assign_re > static void pci_bridge_release_resources(struct pci_bus *bus, > unsigned long type) > { > - int idx; > - bool changed = false; > - struct pci_dev *dev; > + struct pci_dev *dev = bus->self; > struct resource *r; > unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM | > - IORESOURCE_PREFETCH; > + IORESOURCE_PREFETCH | IORESOURCE_MEM_64; > + unsigned old_flags = 0; > + struct resource *b_res; > + int idx = 1; > > - dev = bus->self; > - for (idx = PCI_BRIDGE_RESOURCES; idx <= PCI_BRIDGE_RESOURCE_END; > - idx++) { > - r = &dev->resource[idx]; > - if ((r->flags & type_mask) != type) > - continue; > - if (!r->parent) > - continue; > - /* > - * if there are children under that, we should release them > - * all > - */ > - release_child_resources(r); > - if (!release_resource(r)) { > - dev_printk(KERN_DEBUG, &dev->dev, > - "resource %d %pR released\n", idx, r); > - /* keep the old size */ > - r->end = resource_size(r) - 1; > - r->start = 0; > - r->flags = 0; > - changed = true; > - } > - } > + b_res = &dev->resource[PCI_BRIDGE_RESOURCES]; > + > + /* > + * 1. if there is io port assign fail, will release bridge > + * io port. > + * 2. if there is non pref mmio assign fail, release bridge > + * nonpref mmio. > + * 3. if there is 64bit pref mmio assign fail, and bridge pref > + * is 64bit, release bridge pref mmio. > + * 4. if there is pref mmio assign fail, and bridge pref is > + * 32bit mmio, release bridge pref mmio > + * 5. if there is pref mmio assign fail, and bridge pref is not > + * assigned, release bridge nonpref mmio. > + */ > + if (type & IORESOURCE_IO) > + idx = 0; > + else if (!(type & IORESOURCE_PREFETCH)) > + idx = 1; > + else if ((type & IORESOURCE_MEM_64) && > + (b_res[2].flags & IORESOURCE_MEM_64)) > + idx = 2; > + else if (!(b_res[2].flags & IORESOURCE_MEM_64) && > + (b_res[2].flags & IORESOURCE_PREFETCH)) > + idx = 2; > + else > + idx = 1; > + > + r = &b_res[idx]; > + > + if (!r->parent) > + return; > + > + /* > + * if there are children under that, we should release them > + * all > + */ > + release_child_resources(r); > + if (!release_resource(r)) { > + type = old_flags = r->flags & type_mask; > + dev_printk(KERN_DEBUG, &dev->dev, "resource %d %pR released\n", > + PCI_BRIDGE_RESOURCES + idx, r); > + /* keep the old size */ > + r->end = resource_size(r) - 1; > + r->start = 0; > + r->flags = 0; > > - if (changed) { > /* avoiding touch the one without PREF */ > if (type & IORESOURCE_PREFETCH) > type = IORESOURCE_PREFETCH; > __pci_setup_bridge(bus, type); > + /* for next child res under same bridge */ > + r->flags = old_flags; > } > } > > @@ -1471,7 +1519,7 @@ void pci_assign_unassigned_root_bus_reso > LIST_HEAD(fail_head); > struct pci_dev_resource *fail_res; > unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM | > - IORESOURCE_PREFETCH; > + IORESOURCE_PREFETCH | IORESOURCE_MEM_64; > int pci_try_num = 1; > enum enable_type enable_local; > > Index: linux-2.6/drivers/pci/setup-res.c > =================================================================== > --- linux-2.6.orig/drivers/pci/setup-res.c > +++ linux-2.6/drivers/pci/setup-res.c > @@ -208,15 +208,31 @@ static int __pci_assign_resource(struct > > /* First, try exact prefetching match.. */ > ret = pci_bus_alloc_resource(bus, res, size, align, min, > - IORESOURCE_PREFETCH, > + IORESOURCE_PREFETCH | IORESOURCE_MEM_64, > pcibios_align_resource, dev); > > - if (ret < 0 && (res->flags & IORESOURCE_PREFETCH)) { > + if (ret < 0 && > + (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) == > + (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) { > + /* > + * That failed. > + * > + * Try below 4g pref > + */ > + ret = pci_bus_alloc_resource(bus, res, size, align, min, > + IORESOURCE_PREFETCH, > + pcibios_align_resource, dev); > + } > + > + if (ret < 0 && > + (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64))) { > /* > * That failed. > * > * But a prefetching area can handle a non-prefetching > * window (it will just not perform as well). > + * > + * Also can put 64bit under 32bit range. (below 4g). > */ > ret = pci_bus_alloc_resource(bus, res, size, align, min, 0, > pcibios_align_resource, dev); -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 04, 2014 at 04:43:27PM -0600, Bjorn Helgaas wrote: > On Fri, Mar 7, 2014 at 1:08 PM, Yinghai Lu <yinghai@kernel.org> wrote: > > When one of children resources does not support MEM_64, MEM_64 for > > bridge get reset, so pull down whole pref resource on the bridge under 4G. > > > > If the bridge support pref mem 64, will only allocate that with pref mem64 to > > children that support it. > > For children resources if they only support pref mem 32, will allocate them > > from non pref mem instead. > > > > If the bridge only support 32bit pref mmio, will still have all children pref > > mmio under that. > > > > -v2: Add release bridge res support with bridge mem res for pref_mem children res. > > -v3: refresh and make it can be applied early before for_each_dev_res patchset. > > -v4: fix non-pref mmio 64bit support found by Guo Chao. > > -v7: repost as ibm's powerpc system work again when > > 1. petiboot boot kernel have this patch. > > 2. or mellanox driver added new .shutdown member. > > discussion could be found at: > > http://marc.info/?t=138968064500001&r=1&w=2 > > I think this fixes some sort of bug, and I suppose if I spent a few > hours decoding the discussion you mention (the 44 message-long > "mlx4_core probe error after applying Yinghai's patch" discussion), I > might be able to figure out what it is. > The result of 44 message-long discussion is: the Mellanox card's failure is due to a bug in its driver instead of this patch. The patch refines the logic of using prefetchable window, so that 64-bit prefetchable BARs have a chance to be really prefetchable. It does not fix any bugs, instead, it exposes one. Thanks, Guo Chao > But maybe somebody who already knows what the bug is can summarize it > for me, and maybe even open a kernel.org bugzilla with a dmesg log as > an example. > > I do intend to use this message [1] as a source to help construct a > changelog, but I'd like to also describe a specific problem that is > solved by this patch. Ideally this would already be in the changelog, > but it's not, so any help in figuring it out would be appreciated. > > [1] http://lkml.kernel.org/r/CAErSpo5VWREGptW0MU0wLJUa3h23DXMZPGkdJFMTx5WAGL8J6Q@mail.gmail.com > > > Signed-off-by: Yinghai Lu <yinghai@kernel.org> > > Tested-by: Guo Chao <yan@linux.vnet.ibm.com> > > Tested-by: Wei Yang <weiyang@linux.vnet.ibm.com> > > > > --- > > drivers/pci/setup-bus.c | 138 ++++++++++++++++++++++++++++++++---------------- > > drivers/pci/setup-res.c | 20 ++++++ > > 2 files changed, 111 insertions(+), 47 deletions(-) > > > > Index: linux-2.6/drivers/pci/setup-bus.c > > =================================================================== > > --- linux-2.6.orig/drivers/pci/setup-bus.c > > +++ linux-2.6/drivers/pci/setup-bus.c > > @@ -713,12 +713,11 @@ static void pci_bridge_check_ranges(stru > > bus resource of a given type. Note: we intentionally skip > > the bus resources which have already been assigned (that is, > > have non-NULL parent resource). */ > > -static struct resource *find_free_bus_resource(struct pci_bus *bus, unsigned long type) > > +static struct resource *find_free_bus_resource(struct pci_bus *bus, > > + unsigned long type_mask, unsigned long type) > > { > > int i; > > struct resource *r; > > - unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM | > > - IORESOURCE_PREFETCH; > > > > pci_bus_for_each_resource(bus, r, i) { > > if (r == &ioport_resource || r == &iomem_resource) > > @@ -815,7 +814,8 @@ static void pbus_size_io(struct pci_bus > > resource_size_t add_size, struct list_head *realloc_head) > > { > > struct pci_dev *dev; > > - struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO); > > + struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO, > > + IORESOURCE_IO); > > resource_size_t size = 0, size0 = 0, size1 = 0; > > resource_size_t children_add_size = 0; > > resource_size_t min_align, align; > > @@ -915,15 +915,17 @@ static inline resource_size_t calculate_ > > * guarantees that all child resources fit in this size. > > */ > > static int pbus_size_mem(struct pci_bus *bus, unsigned long mask, > > - unsigned long type, resource_size_t min_size, > > - resource_size_t add_size, > > - struct list_head *realloc_head) > > + unsigned long type, unsigned long type2, > > + unsigned long type3, > > + resource_size_t min_size, resource_size_t add_size, > > + struct list_head *realloc_head) > > { > > struct pci_dev *dev; > > resource_size_t min_align, align, size, size0, size1; > > resource_size_t aligns[12]; /* Alignments from 1Mb to 2Gb */ > > int order, max_order; > > - struct resource *b_res = find_free_bus_resource(bus, type); > > + struct resource *b_res = find_free_bus_resource(bus, > > + mask | IORESOURCE_PREFETCH, type); > > unsigned int mem64_mask = 0; > > resource_size_t children_add_size = 0; > > > > @@ -944,7 +946,9 @@ static int pbus_size_mem(struct pci_bus > > struct resource *r = &dev->resource[i]; > > resource_size_t r_size; > > > > - if (r->parent || (r->flags & mask) != type) > > + if (r->parent || ((r->flags & mask) != type && > > + (r->flags & mask) != type2 && > > + (r->flags & mask) != type3)) > > continue; > > r_size = resource_size(r); > > #ifdef CONFIG_PCI_IOV > > @@ -1117,8 +1121,9 @@ void __ref __pci_bus_size_bridges(struct > > struct list_head *realloc_head) > > { > > struct pci_dev *dev; > > - unsigned long mask, prefmask; > > + unsigned long mask, prefmask, type2 = 0, type3 = 0; > > resource_size_t additional_mem_size = 0, additional_io_size = 0; > > + struct resource *b_res; > > > > list_for_each_entry(dev, &bus->devices, bus_list) { > > struct pci_bus *b = dev->subordinate; > > @@ -1163,15 +1168,34 @@ void __ref __pci_bus_size_bridges(struct > > has already been allocated by arch code, try > > non-prefetchable range for both types of PCI memory > > resources. */ > > + b_res = &bus->self->resource[PCI_BRIDGE_RESOURCES]; > > mask = IORESOURCE_MEM; > > prefmask = IORESOURCE_MEM | IORESOURCE_PREFETCH; > > - if (pbus_size_mem(bus, prefmask, prefmask, > > + if (b_res[2].flags & IORESOURCE_MEM_64) { > > + prefmask |= IORESOURCE_MEM_64; > > + if (pbus_size_mem(bus, prefmask, prefmask, > > + prefmask, prefmask, > > realloc_head ? 0 : additional_mem_size, > > - additional_mem_size, realloc_head)) > > - mask = prefmask; /* Success, size non-prefetch only. */ > > - else > > - additional_mem_size += additional_mem_size; > > - pbus_size_mem(bus, mask, IORESOURCE_MEM, > > + additional_mem_size, realloc_head)) { > > + /* Success, size non-pref64 only. */ > > + mask = prefmask; > > + type2 = prefmask & ~IORESOURCE_MEM_64; > > + type3 = prefmask & ~IORESOURCE_PREFETCH; > > + } > > + } > > + if (!type2) { > > + prefmask &= ~IORESOURCE_MEM_64; > > + if (pbus_size_mem(bus, prefmask, prefmask, > > + prefmask, prefmask, > > + realloc_head ? 0 : additional_mem_size, > > + additional_mem_size, realloc_head)) { > > + /* Success, size non-prefetch only. */ > > + mask = prefmask; > > + } else > > + additional_mem_size += additional_mem_size; > > + type2 = type3 = IORESOURCE_MEM; > > + } > > + pbus_size_mem(bus, mask, IORESOURCE_MEM, type2, type3, > > realloc_head ? 0 : additional_mem_size, > > additional_mem_size, realloc_head); > > break; > > @@ -1257,42 +1281,66 @@ static void __ref __pci_bridge_assign_re > > static void pci_bridge_release_resources(struct pci_bus *bus, > > unsigned long type) > > { > > - int idx; > > - bool changed = false; > > - struct pci_dev *dev; > > + struct pci_dev *dev = bus->self; > > struct resource *r; > > unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM | > > - IORESOURCE_PREFETCH; > > + IORESOURCE_PREFETCH | IORESOURCE_MEM_64; > > + unsigned old_flags = 0; > > + struct resource *b_res; > > + int idx = 1; > > > > - dev = bus->self; > > - for (idx = PCI_BRIDGE_RESOURCES; idx <= PCI_BRIDGE_RESOURCE_END; > > - idx++) { > > - r = &dev->resource[idx]; > > - if ((r->flags & type_mask) != type) > > - continue; > > - if (!r->parent) > > - continue; > > - /* > > - * if there are children under that, we should release them > > - * all > > - */ > > - release_child_resources(r); > > - if (!release_resource(r)) { > > - dev_printk(KERN_DEBUG, &dev->dev, > > - "resource %d %pR released\n", idx, r); > > - /* keep the old size */ > > - r->end = resource_size(r) - 1; > > - r->start = 0; > > - r->flags = 0; > > - changed = true; > > - } > > - } > > + b_res = &dev->resource[PCI_BRIDGE_RESOURCES]; > > + > > + /* > > + * 1. if there is io port assign fail, will release bridge > > + * io port. > > + * 2. if there is non pref mmio assign fail, release bridge > > + * nonpref mmio. > > + * 3. if there is 64bit pref mmio assign fail, and bridge pref > > + * is 64bit, release bridge pref mmio. > > + * 4. if there is pref mmio assign fail, and bridge pref is > > + * 32bit mmio, release bridge pref mmio > > + * 5. if there is pref mmio assign fail, and bridge pref is not > > + * assigned, release bridge nonpref mmio. > > + */ > > + if (type & IORESOURCE_IO) > > + idx = 0; > > + else if (!(type & IORESOURCE_PREFETCH)) > > + idx = 1; > > + else if ((type & IORESOURCE_MEM_64) && > > + (b_res[2].flags & IORESOURCE_MEM_64)) > > + idx = 2; > > + else if (!(b_res[2].flags & IORESOURCE_MEM_64) && > > + (b_res[2].flags & IORESOURCE_PREFETCH)) > > + idx = 2; > > + else > > + idx = 1; > > + > > + r = &b_res[idx]; > > + > > + if (!r->parent) > > + return; > > + > > + /* > > + * if there are children under that, we should release them > > + * all > > + */ > > + release_child_resources(r); > > + if (!release_resource(r)) { > > + type = old_flags = r->flags & type_mask; > > + dev_printk(KERN_DEBUG, &dev->dev, "resource %d %pR released\n", > > + PCI_BRIDGE_RESOURCES + idx, r); > > + /* keep the old size */ > > + r->end = resource_size(r) - 1; > > + r->start = 0; > > + r->flags = 0; > > > > - if (changed) { > > /* avoiding touch the one without PREF */ > > if (type & IORESOURCE_PREFETCH) > > type = IORESOURCE_PREFETCH; > > __pci_setup_bridge(bus, type); > > + /* for next child res under same bridge */ > > + r->flags = old_flags; > > } > > } > > > > @@ -1471,7 +1519,7 @@ void pci_assign_unassigned_root_bus_reso > > LIST_HEAD(fail_head); > > struct pci_dev_resource *fail_res; > > unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM | > > - IORESOURCE_PREFETCH; > > + IORESOURCE_PREFETCH | IORESOURCE_MEM_64; > > int pci_try_num = 1; > > enum enable_type enable_local; > > > > Index: linux-2.6/drivers/pci/setup-res.c > > =================================================================== > > --- linux-2.6.orig/drivers/pci/setup-res.c > > +++ linux-2.6/drivers/pci/setup-res.c > > @@ -208,15 +208,31 @@ static int __pci_assign_resource(struct > > > > /* First, try exact prefetching match.. */ > > ret = pci_bus_alloc_resource(bus, res, size, align, min, > > - IORESOURCE_PREFETCH, > > + IORESOURCE_PREFETCH | IORESOURCE_MEM_64, > > pcibios_align_resource, dev); > > > > - if (ret < 0 && (res->flags & IORESOURCE_PREFETCH)) { > > + if (ret < 0 && > > + (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) == > > + (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) { > > + /* > > + * That failed. > > + * > > + * Try below 4g pref > > + */ > > + ret = pci_bus_alloc_resource(bus, res, size, align, min, > > + IORESOURCE_PREFETCH, > > + pcibios_align_resource, dev); > > + } > > + > > + if (ret < 0 && > > + (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64))) { > > /* > > * That failed. > > * > > * But a prefetching area can handle a non-prefetching > > * window (it will just not perform as well). > > + * > > + * Also can put 64bit under 32bit range. (below 4g). > > */ > > ret = pci_bus_alloc_resource(bus, res, size, align, min, 0, > > pcibios_align_resource, dev); > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/04/2014 05:57, Guo Chao wrote: > On Fri, Apr 04, 2014 at 04:43:27PM -0600, Bjorn Helgaas wrote: >> On Fri, Mar 7, 2014 at 1:08 PM, Yinghai Lu <yinghai@kernel.org> wrote: >>> When one of children resources does not support MEM_64, MEM_64 for >>> bridge get reset, so pull down whole pref resource on the bridge under 4G. >>> >>> If the bridge support pref mem 64, will only allocate that with pref mem64 to >>> children that support it. >>> For children resources if they only support pref mem 32, will allocate them >>> from non pref mem instead. >>> >>> If the bridge only support 32bit pref mmio, will still have all children pref >>> mmio under that. >>> >>> -v2: Add release bridge res support with bridge mem res for pref_mem children res. >>> -v3: refresh and make it can be applied early before for_each_dev_res patchset. >>> -v4: fix non-pref mmio 64bit support found by Guo Chao. >>> -v7: repost as ibm's powerpc system work again when >>> 1. petiboot boot kernel have this patch. >>> 2. or mellanox driver added new .shutdown member. >>> discussion could be found at: >>> http://marc.info/?t=138968064500001&r=1&w=2 >> I think this fixes some sort of bug, and I suppose if I spent a few >> hours decoding the discussion you mention (the 44 message-long >> "mlx4_core probe error after applying Yinghai's patch" discussion), I >> might be able to figure out what it is. >> > The result of 44 message-long discussion is: the Mellanox card's failure is > due to a bug in its driver instead of this patch. So just to make sure we're on the same page -- fixing the bug is carried out through netdev, e.g this 3.14-rc8 upstream commit 97a5221 "net/mlx4_core: pass pci_device_id.driver_data to __mlx4_init_one during reset" and now this one more fix which is under discussion http://marc.info/?l=linux-pci&m=139675010015646&w=2, right? > > The patch refines the logic of using prefetchable window, so that 64-bit > prefetchable BARs have a chance to be really prefetchable. It does not fix > any bugs, instead, it exposes one. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 08, 2014 at 10:18:48AM +0300, Or Gerlitz wrote: >On 08/04/2014 05:57, Guo Chao wrote: >>On Fri, Apr 04, 2014 at 04:43:27PM -0600, Bjorn Helgaas wrote: >>>On Fri, Mar 7, 2014 at 1:08 PM, Yinghai Lu <yinghai@kernel.org> wrote: >>>>When one of children resources does not support MEM_64, MEM_64 for >>>>bridge get reset, so pull down whole pref resource on the bridge under 4G. >>>> >>>>If the bridge support pref mem 64, will only allocate that with pref mem64 to >>>>children that support it. >>>>For children resources if they only support pref mem 32, will allocate them >>>>from non pref mem instead. >>>> >>>>If the bridge only support 32bit pref mmio, will still have all children pref >>>>mmio under that. >>>> >>>>-v2: Add release bridge res support with bridge mem res for pref_mem children res. >>>>-v3: refresh and make it can be applied early before for_each_dev_res patchset. >>>>-v4: fix non-pref mmio 64bit support found by Guo Chao. >>>>-v7: repost as ibm's powerpc system work again when >>>> 1. petiboot boot kernel have this patch. >>>> 2. or mellanox driver added new .shutdown member. >>>> discussion could be found at: >>>> http://marc.info/?t=138968064500001&r=1&w=2 >>>I think this fixes some sort of bug, and I suppose if I spent a few >>>hours decoding the discussion you mention (the 44 message-long >>>"mlx4_core probe error after applying Yinghai's patch" discussion), I >>>might be able to figure out what it is. >>> >>The result of 44 message-long discussion is: the Mellanox card's failure is >>due to a bug in its driver instead of this patch. > > >So just to make sure we're on the same page -- fixing the bug is >carried out through netdev, e.g this 3.14-rc8 >upstream commit 97a5221 "net/mlx4_core: pass >pci_device_id.driver_data to __mlx4_init_one during reset" >and now this one more fix which is under discussion >http://marc.info/?l=linux-pci&m=139675010015646&w=2, right? No, this one http://marc.info/?l=linux-pci&m=139675010015646&w=2 is another bug in mlx4 driver, missing a proper pci_dev_data in mlx4_init_one. The bug discussing in this thread is the driver "forget" to release the "owner" during kexec. So this device couldn't be usded after kexec. > > > >> >>The patch refines the logic of using prefetchable window, so that 64-bit >>prefetchable BARs have a chance to be really prefetchable. It does not fix >>any bugs, instead, it exposes one.
On 08/04/2014 10:41, Wei Yang wrote: >> >So just to make sure we're on the same page -- fixing the bug is >> >carried out through netdev, e.g this 3.14-rc8 >> >upstream commit 97a5221 "net/mlx4_core: pass >> >pci_device_id.driver_data to __mlx4_init_one during reset" >> >and now this one more fix which is under discussion >> >http://marc.info/?l=linux-pci&m=139675010015646&w=2, right? > No, this onehttp://marc.info/?l=linux-pci&m=139675010015646&w=2 is another > bug in mlx4 driver, missing a proper pci_dev_data in mlx4_init_one. > > The bug discussing in this thread is the driver "forget" to release the > "owner" during kexec. So this device couldn't be usded after kexec. > Can you guys please open a kernel.org bugzilla ticket, or better... work on a patch? -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 08, 2014 at 10:55:55AM +0300, Or Gerlitz wrote: > On 08/04/2014 10:41, Wei Yang wrote: > >>>So just to make sure we're on the same page -- fixing the bug is > >>>carried out through netdev, e.g this 3.14-rc8 > >>>upstream commit 97a5221 "net/mlx4_core: pass > >>>pci_device_id.driver_data to __mlx4_init_one during reset" > >>>and now this one more fix which is under discussion > >>>http://marc.info/?l=linux-pci&m=139675010015646&w=2, right? > >No, this onehttp://marc.info/?l=linux-pci&m=139675010015646&w=2 is another > >bug in mlx4 driver, missing a proper pci_dev_data in mlx4_init_one. > > > >The bug discussing in this thread is the driver "forget" to release the > >"owner" during kexec. So this device couldn't be usded after kexec. > > > Can you guys please open a kernel.org bugzilla ticket, or better... > work on a patch? > It's already fixed in upstream. commit 367d56f7b4d5ce61e883c64f81786c7a3ae88eea Author: Gavin Shan <shangw@linux.vnet.ibm.com> Date: Tue Mar 4 15:35:20 2014 +0800 net/mlx4: Support shutdown() interface Thanks, Guo Chao -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 7, 2014 at 8:57 PM, Guo Chao <yan@linux.vnet.ibm.com> wrote: > On Fri, Apr 04, 2014 at 04:43:27PM -0600, Bjorn Helgaas wrote: >> On Fri, Mar 7, 2014 at 1:08 PM, Yinghai Lu <yinghai@kernel.org> wrote: >> > When one of children resources does not support MEM_64, MEM_64 for >> > bridge get reset, so pull down whole pref resource on the bridge under 4G. >> > >> > If the bridge support pref mem 64, will only allocate that with pref mem64 to >> > children that support it. >> > For children resources if they only support pref mem 32, will allocate them >> > from non pref mem instead. >> > >> > If the bridge only support 32bit pref mmio, will still have all children pref >> > mmio under that. >> > >> > -v2: Add release bridge res support with bridge mem res for pref_mem children res. >> > -v3: refresh and make it can be applied early before for_each_dev_res patchset. >> > -v4: fix non-pref mmio 64bit support found by Guo Chao. >> > -v7: repost as ibm's powerpc system work again when >> > 1. petiboot boot kernel have this patch. >> > 2. or mellanox driver added new .shutdown member. >> > discussion could be found at: >> > http://marc.info/?t=138968064500001&r=1&w=2 >> >> I think this fixes some sort of bug, and I suppose if I spent a few >> hours decoding the discussion you mention (the 44 message-long >> "mlx4_core probe error after applying Yinghai's patch" discussion), I >> might be able to figure out what it is. >> > > The result of 44 message-long discussion is: the Mellanox card's failure is > due to a bug in its driver instead of this patch. > > The patch refines the logic of using prefetchable window, so that 64-bit > prefetchable BARs have a chance to be really prefetchable. It does not fix > any bugs, instead, it exposes one. OK, then I'm back to square one, and I need an explanation of why we want to merge this patch. I think the patch conserves 32-bit address space by putting more things above 4G than we used to. But this comes at the cost of using non-prefetchable windows in some cases where we used to use prefetchable ones. Somebody has to explain why we want to do this and why the benefit of conserving the 32-bit space is more important losing the prefetchability. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, Bjorn: On Tue, Apr 08, 2014 at 09:02:10AM -0600, Bjorn Helgaas wrote: > On Mon, Apr 7, 2014 at 8:57 PM, Guo Chao <yan@linux.vnet.ibm.com> wrote: > > On Fri, Apr 04, 2014 at 04:43:27PM -0600, Bjorn Helgaas wrote: > >> On Fri, Mar 7, 2014 at 1:08 PM, Yinghai Lu <yinghai@kernel.org> wrote: > >> > When one of children resources does not support MEM_64, MEM_64 for > >> > bridge get reset, so pull down whole pref resource on the bridge under 4G. > >> > > >> > If the bridge support pref mem 64, will only allocate that with pref mem64 to > >> > children that support it. > >> > For children resources if they only support pref mem 32, will allocate them > >> > from non pref mem instead. > >> > > >> > If the bridge only support 32bit pref mmio, will still have all children pref > >> > mmio under that. > >> > > >> > -v2: Add release bridge res support with bridge mem res for pref_mem children res. > >> > -v3: refresh and make it can be applied early before for_each_dev_res patchset. > >> > -v4: fix non-pref mmio 64bit support found by Guo Chao. > >> > -v7: repost as ibm's powerpc system work again when > >> > 1. petiboot boot kernel have this patch. > >> > 2. or mellanox driver added new .shutdown member. > >> > discussion could be found at: > >> > http://marc.info/?t=138968064500001&r=1&w=2 > >> > >> I think this fixes some sort of bug, and I suppose if I spent a few > >> hours decoding the discussion you mention (the 44 message-long > >> "mlx4_core probe error after applying Yinghai's patch" discussion), I > >> might be able to figure out what it is. > >> > > > > The result of 44 message-long discussion is: the Mellanox card's failure is > > due to a bug in its driver instead of this patch. > > > > The patch refines the logic of using prefetchable window, so that 64-bit > > prefetchable BARs have a chance to be really prefetchable. It does not fix > > any bugs, instead, it exposes one. > > OK, then I'm back to square one, and I need an explanation of why we > want to merge this patch. > > I think the patch conserves 32-bit address space by putting more > things above 4G than we used to. But this comes at the cost of using > non-prefetchable windows in some cases where we used to use > prefetchable ones. > > Somebody has to explain why we want to do this and why the benefit of > conserving the 32-bit space is more important losing the > prefetchability. I want to make sure we are talking about this patch: PCI: Try best to allocate pref mmio 64bit above 4g http://patchwork.ozlabs.org/patch/328067/ instead of this one: PCI: Try to allocate mem64 above 4G at first http://patchwork.ozlabs.org/patch/303197/ This patch does not intend to conserve 32-bit space at all. It makes better use of prefetchable window. The problem with the old code is: when both 32-bit and 64-bit prefetchable BARs present, it's in favor of 32-bit one to use prefetchable window. This window is then not supposed to get 4G-above address, and this 32-bit-address-only property would propagates upwards until reaching root bridge. In later assignment phase, 4G-above space is never touched. This is just caused by a single 32-bit prefetchable BAR (say a ROM BAR). This patch helps by making better decision: * Keep the old behaviour if only 32-bit or 64-bit prefetchable BARs present * If both of them present, put 64-bit ones to prefetchable window, 32-bit ones to non-prefetchable window So 4G-above space can be properly used. Why does enabling 64-bit space matter? * 32-bit space is just too small. If larger-than-4G BAR sounds unrealistic (we do have such devices), there is still a chance that total MMIO size under a domain is too large, especially when SR-IOV enabled (we met this situation). * On PowerNV platform, we can do better platform configuration for 64-bit prefetchable space, that's important for enabling SR-IOV on this platform. Thanks, Guo Chao > > Bjorn > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 9, 2014 at 1:52 AM, Guo Chao <yan@linux.vnet.ibm.com> wrote: >> OK, then I'm back to square one, and I need an explanation of why we >> want to merge this patch. >> >> I think the patch conserves 32-bit address space by putting more >> things above 4G than we used to. But this comes at the cost of using >> non-prefetchable windows in some cases where we used to use >> prefetchable ones. >> >> Somebody has to explain why we want to do this and why the benefit of >> conserving the 32-bit space is more important losing the >> prefetchability. > > I want to make sure we are talking about this patch: > PCI: Try best to allocate pref mmio 64bit above 4g > http://patchwork.ozlabs.org/patch/328067/ > > instead of this one: > PCI: Try to allocate mem64 above 4G at first > http://patchwork.ozlabs.org/patch/303197/ Yes, I'm talking about http://patchwork.ozlabs.org/patch/328067; if you look at patchwork, the second one (303197) is marked "superseded." > This patch does not intend to conserve 32-bit space at all. It makes > better use of prefetchable window. > > The problem with the old code is: when both 32-bit and 64-bit prefetchable > BARs present, it's in favor of 32-bit one to use prefetchable window. > This window is then not supposed to get 4G-above address, and this > 32-bit-address-only property would propagates upwards until reaching root > bridge. In later assignment phase, 4G-above space is never touched. This > is just caused by a single 32-bit prefetchable BAR (say a ROM BAR). > > This patch helps by making better decision: > > * Keep the old behaviour if only 32-bit or 64-bit prefetchable > BARs present > > * If both of them present, put 64-bit ones to prefetchable window, > 32-bit ones to non-prefetchable window I thought the patch was supposed to change the way we allocate bridge windows. But you are talking about the way we assign 32- and 64-bit device resources, i.e., changing the way we decide whether to put them in prefetchable or non-prefetchable windows. > So 4G-above space can be properly used. > > Why does enabling 64-bit space matter? > > * 32-bit space is just too small. If larger-than-4G BAR sounds > unrealistic (we do have such devices), there is still a chance that > total MMIO size under a domain is too large, especially when SR-IOV > enabled (we met this situation). Please give more details about the problem you saw. A complete dmesg log showing a boot failure or a device that doesn't work, and another log with this patch applied, showing things working as desired, would go a long ways toward clarifying this. This sounds like a specific case that will be fixed by the patch, and if you give more details, maybe I can figure out what's going on. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2014-04-10 at 11:26 -0600, Bjorn Helgaas wrote: > > The problem with the old code is: when both 32-bit and 64-bit prefetchable > > BARs present, it's in favor of 32-bit one to use prefetchable window. > > This window is then not supposed to get 4G-above address, and this > > 32-bit-address-only property would propagates upwards until reaching root > > bridge. In later assignment phase, 4G-above space is never touched. This > > is just caused by a single 32-bit prefetchable BAR (say a ROM BAR). > > > > This patch helps by making better decision: > > > > * Keep the old behaviour if only 32-bit or 64-bit prefetchable > > BARs present > > > > * If both of them present, put 64-bit ones to prefetchable window, > > 32-bit ones to non-prefetchable window > > I thought the patch was supposed to change the way we allocate bridge > windows. But you are talking about the way we assign 32- and 64-bit > device resources, i.e., changing the way we decide whether to put them > in prefetchable or non-prefetchable windows. Wait... Are you putting "non-prefetchable" 64-bit BARs in the prefetchable range ? That's bad. Don't do that. Some switches or bridges *will* actively prefetch, and putting a register BAR marked non-prefetchable in the prefetchable range will thus cause nasty bugs if a register with side effects becomes the target of a prefetch operations. We can do that on powerpc *selectively* when and only when we know for sure that no bridge will prefetch on the path to the device. We know our host bridges won't and we *might* (to be verified) know that the PLX switches we have soldered on the mobo won't, but that's the only cases where it is legit. > > So 4G-above space can be properly used. > > > > Why does enabling 64-bit space matter? > > > > * 32-bit space is just too small. If larger-than-4G BAR sounds > > unrealistic (we do have such devices), there is still a chance that > > total MMIO size under a domain is too large, especially when SR-IOV > > enabled (we met this situation). > > Please give more details about the problem you saw. A complete dmesg > log showing a boot failure or a device that doesn't work, and another > log with this patch applied, showing things working as desired, would > go a long ways toward clarifying this. > > This sounds like a specific case that will be fixed by the patch, and > if you give more details, maybe I can figure out what's going on. Well, in our case it's a very complicated story. We have this segmented space where we have 256 segments covering our 32-bit space. We need each "partitionable endpoint" to be in separate sets of segments. PEs are basically iommu domain but also affect MMIO and are our unit of assignment to guests when virtualizing and of error handling when doing EEH (so we freeze access on error by PE). So with something like SR-IOV where each VF has to be a separate PE, the 32-bit space is problematic because we want the VF BARs to stride over segments so each VF lives in a separate PE, but we run out of segments and the 32-bit space has a fixed segment size that may not match the VF BAR stride. So we want our BARs to shoot into 64-bit space where we have additional windows with each 256 segments that we can use, and additionally, we can resize these to make the segment size match the VF BAR stride. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 10, 2014 at 11:26:14AM -0600, Bjorn Helgaas wrote: > On Wed, Apr 9, 2014 at 1:52 AM, Guo Chao <yan@linux.vnet.ibm.com> wrote: > > So 4G-above space can be properly used. > > > > Why does enabling 64-bit space matter? > > > > * 32-bit space is just too small. If larger-than-4G BAR sounds > > unrealistic (we do have such devices), there is still a chance that > > total MMIO size under a domain is too large, especially when SR-IOV > > enabled (we met this situation). > > Please give more details about the problem you saw. A complete dmesg > log showing a boot failure or a device that doesn't work, and another > log with this patch applied, showing things working as desired, would > go a long ways toward clarifying this. > > This sounds like a specific case that will be fixed by the patch, and > if you give more details, maybe I can figure out what's going on. > Let's see an example. | pci 0003:05:00.0: reg 0x10: [mem 0x3d05801000000-0x3d058010fffff 64bit] | pci 0003:05:00.0: reg 0x18: [mem 0x3d05010000000-0x3d05017ffffff 64bit pref] | pci 0003:05:00.0: reg 0x30: [mem 0x00000000-0x000fffff pref] | pci 0003:05:00.0: reg 0x134: [mem 0x3d05018000000-0x3d0501fffffff 64bit pref] This is printed at enumeration phase. This device has a SRIOV BAR with size of 0x7ffffff (128M). That's the size of a signle VF BAR. The device supports 63 VFs so we need near 8G space in total. Apparanlty we need exploit 64-bit space. | PCI host bridge to bus 0003:00 | pci_bus 0003:00: root bus resource [mem 0x3d05800000000-0x3d0587ffeffff] (bus address [0x80000000-0xfffeffff]) | pci_bus 0003:00: root bus resource [mem 0x3d05008000000-0x3d057ffffffff 64bit pref] And we do have a huge (32G) 64-bit prefetchable window supply. We expect everything to work fine, but: | pci 0003:00:00.0: BAR 15: can't assign mem pref (size 0x206000000) | pci 0003:00:00.0: BAR 14: assigned [mem 0x3d05800000000-0x3d05802ffffff] | pci 0003:00:00.0: BAR 13: can't assign io (size 0x4000) It went wrong at the beginning. Note the error message never considers 64-bit or not, but BAR 15 here has it MEM_64 flag cleared. It first tried to find a 32-bit prefetchable window, but we only supply a 64-bit one. So it fall back to (32-bit) non-prefetchable window, but there is no enough room there. At last it went into complicated steps (not show here) of allocating requested resource first, then try best for the optional ones, etc.. Why is BAR 15 (prefetchable) 32 bit instead of 64? Because PCI core favours 32-bit prefetchable BARs and we have some. This is one of them: | pci 0003:05:00.0: reg 0x30: [mem 0x00000000-0x000fffff pref] PCI core decides to let them enjoy the benefition of prefetch. They can't bear the risk of getting 4G-above address, so its parent, its parent's parent, its parent's parent's parent, finally the root bridge (00:00.0) must have their MEM_64 flag of prefetchable resource (BAR 15) clear. In the end nobody is eligible to use the 64-bit (prefetchable) space even we have huge supply ! Note even the resource is small and successfully fall back into 32-bit non-prefetchable window, that's still not OK for us because we need SRIOV resource be at 64-bit prefetchable space to do platform configuration. With Yinghai's patch, when 64-bit prefetchable BARs found, they're more favoured than the 32-bit prefetchable ones (if any), so all upstream bridges' prefetchable windows have their MEM_64 flag reserved and the huge 64-bit prefetchable space will be exploited: | pci 0003:00:00.0: BAR 15: assigned [mem 0x3d05008000000-0x3d0521fffffff 64bit pref] | pci 0003:00:00.0: BAR 14: assigned [mem 0x3d05800000000-0x3d05802ffffff] | pci 0003:00:00.0: BAR 13: can't assign io (size 0x4000) (The IO resource error here is due to we do not provide IO window) Thanks, Guo Chao > Bjorn > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 15, 2014 at 5:54 AM, Guo Chao <yan@linux.vnet.ibm.com> wrote: > On Thu, Apr 10, 2014 at 11:26:14AM -0600, Bjorn Helgaas wrote: >> On Wed, Apr 9, 2014 at 1:52 AM, Guo Chao <yan@linux.vnet.ibm.com> wrote: >> > So 4G-above space can be properly used. >> > >> > Why does enabling 64-bit space matter? >> > >> > * 32-bit space is just too small. If larger-than-4G BAR sounds >> > unrealistic (we do have such devices), there is still a chance that >> > total MMIO size under a domain is too large, especially when SR-IOV >> > enabled (we met this situation). >> >> Please give more details about the problem you saw. A complete dmesg >> log showing a boot failure or a device that doesn't work, and another >> log with this patch applied, showing things working as desired, would >> go a long ways toward clarifying this. >> >> This sounds like a specific case that will be fixed by the patch, and >> if you give more details, maybe I can figure out what's going on. >> > > Let's see an example. Thanks for the example. Please open a bug report at http://bugzilla.kernel.org and attach the complete dmesg logs before and after Yinghai's patch. Having the complete logs helps me answer questions myself without having to bother you, and it also helps me figure out whether we can improve our logging to make it easier to diagnose problems like this. > | pci 0003:05:00.0: reg 0x10: [mem 0x3d05801000000-0x3d058010fffff 64bit] > | pci 0003:05:00.0: reg 0x18: [mem 0x3d05010000000-0x3d05017ffffff 64bit pref] > | pci 0003:05:00.0: reg 0x30: [mem 0x00000000-0x000fffff pref] > | pci 0003:05:00.0: reg 0x134: [mem 0x3d05018000000-0x3d0501fffffff 64bit pref] > > This is printed at enumeration phase. This device has a SRIOV BAR with > size of 0x7ffffff (128M). That's the size of a signle VF BAR. The device > supports 63 VFs so we need near 8G space in total. Apparanlty we need > exploit 64-bit space. Yes. Do we print a hint anywhere about how many VFs there are? In other words, can you deduce the number "63" from the dmesg, or do you have to figure that out some other way? It'd be nice if that information were somewhere in dmesg. > | PCI host bridge to bus 0003:00 > | pci_bus 0003:00: root bus resource [mem 0x3d05800000000-0x3d0587ffeffff] (bus address [0x80000000-0xfffeffff]) > | pci_bus 0003:00: root bus resource [mem 0x3d05008000000-0x3d057ffffffff 64bit pref] > > And we do have a huge (32G) 64-bit prefetchable window supply. We expect > everything to work fine, but: > > | pci 0003:00:00.0: BAR 15: can't assign mem pref (size 0x206000000) > | pci 0003:00:00.0: BAR 14: assigned [mem 0x3d05800000000-0x3d05802ffffff] > | pci 0003:00:00.0: BAR 13: can't assign io (size 0x4000) > > It went wrong at the beginning. Note the error message never considers > 64-bit or not, but BAR 15 here has it MEM_64 flag cleared. BAR 15 is a bridge window. I think its resource flags should reflect the capability of the *window*, even if we disable the window or we happen to assign addresses that are under 4GB. So I think it's wrong that we clear the MEM_64 flag in pbus_size_mem() and the IO flag in pbus_size_io(). > It first > tried to find a 32-bit prefetchable window, but we only supply a 64-bit one. > So it fall back to (32-bit) non-prefetchable window, but there is no enough > room there. At last it went into complicated steps (not show here) of > allocating requested resource first, then try best for the optional ones, etc.. > > Why is BAR 15 (prefetchable) 32 bit instead of 64? Because PCI core favours > 32-bit prefetchable BARs and we have some. This is one of them: > > | pci 0003:05:00.0: reg 0x30: [mem 0x00000000-0x000fffff pref] > > PCI core decides to let them enjoy the benefition of prefetch. They can't > bear the risk of getting 4G-above address, so its parent, its parent's parent, > its parent's parent's parent, finally the root bridge (00:00.0) must have their > MEM_64 flag of prefetchable resource (BAR 15) clear. It sounds like we're tracking the resource requirements (prefetchability and BAR width) by using the flags on bridge windows. If that's the case, I think it's wrong. We should preserve the bridge window flags, because those express the bridge hardware capabilities, and we should explicitly keep track of what's required by devices below the bridge in some other way. > In the end nobody > is eligible to use the 64-bit (prefetchable) space even we have huge > supply ! > > Note even the resource is small and successfully fall back into 32-bit > non-prefetchable window, that's still not OK for us because we need > SRIOV resource be at 64-bit prefetchable space to do platform > configuration. > > With Yinghai's patch, when 64-bit prefetchable BARs found, they're more > favoured than the 32-bit prefetchable ones (if any), so all upstream bridges' > prefetchable windows have their MEM_64 flag reserved and the huge 64-bit > prefetchable space will be exploited: > > | pci 0003:00:00.0: BAR 15: assigned [mem 0x3d05008000000-0x3d0521fffffff 64bit pref] > | pci 0003:00:00.0: BAR 14: assigned [mem 0x3d05800000000-0x3d05802ffffff] > | pci 0003:00:00.0: BAR 13: can't assign io (size 0x4000) > > (The IO resource error here is due to we do not provide IO window) Yes. The lack of I/O space is just a constraint of the platform. It'd be nice if we printed a more meaningful error message in this case. One really has to be a PCI expert to distinguish this from a real problem that we need to fix. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 15, 2014 at 5:09 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: >> It first >> tried to find a 32-bit prefetchable window, but we only supply a 64-bit one. >> So it fall back to (32-bit) non-prefetchable window, but there is no enough >> room there. At last it went into complicated steps (not show here) of >> allocating requested resource first, then try best for the optional ones, etc.. >> >> Why is BAR 15 (prefetchable) 32 bit instead of 64? Because PCI core favours >> 32-bit prefetchable BARs and we have some. This is one of them: >> >> | pci 0003:05:00.0: reg 0x30: [mem 0x00000000-0x000fffff pref] >> >> PCI core decides to let them enjoy the benefition of prefetch. They can't >> bear the risk of getting 4G-above address, so its parent, its parent's parent, >> its parent's parent's parent, finally the root bridge (00:00.0) must have their >> MEM_64 flag of prefetchable resource (BAR 15) clear. > > It sounds like we're tracking the resource requirements > (prefetchability and BAR width) by using the flags on bridge windows. > If that's the case, I think it's wrong. We should preserve the bridge > window flags, because those express the bridge hardware capabilities, > and we should explicitly keep track of what's required by devices > below the bridge in some other way. if the BAR support 64bit, then resource could be 64bit. if the BAR does not support 64bit, then resource should be 32bit. That should be best guess to decide resource flag. Yinghai -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2014-04-15 at 18:09 -0600, Bjorn Helgaas wrote: > > Thanks for the example. Please open a bug report at > http://bugzilla.kernel.org and attach the complete dmesg logs before > and after Yinghai's patch. > > Having the complete logs helps me answer questions myself without > having to bother you, and it also helps me figure out whether we can > improve our logging to make it easier to diagnose problems like this. Unfortunately, for a *little while* longer (hint !) we can't publish a complete log from a Power8 machine, but we should be able to include everything remotely related to PCI. > > | pci 0003:05:00.0: reg 0x10: [mem 0x3d05801000000-0x3d058010fffff 64bit] > > | pci 0003:05:00.0: reg 0x18: [mem 0x3d05010000000-0x3d05017ffffff 64bit pref] > > | pci 0003:05:00.0: reg 0x30: [mem 0x00000000-0x000fffff pref] > > | pci 0003:05:00.0: reg 0x134: [mem 0x3d05018000000-0x3d0501fffffff 64bit pref] > > > > This is printed at enumeration phase. This device has a SRIOV BAR with > > size of 0x7ffffff (128M). That's the size of a signle VF BAR. The device > > supports 63 VFs so we need near 8G space in total. Apparanlty we need > > exploit 64-bit space. > > Yes. Do we print a hint anywhere about how many VFs there are? In > other words, can you deduce the number "63" from the dmesg, or do you > have to figure that out some other way? It'd be nice if that > information were somewhere in dmesg. > > > | PCI host bridge to bus 0003:00 > > | pci_bus 0003:00: root bus resource [mem 0x3d05800000000-0x3d0587ffeffff] (bus address [0x80000000-0xfffeffff]) > > | pci_bus 0003:00: root bus resource [mem 0x3d05008000000-0x3d057ffffffff 64bit pref] > > > > And we do have a huge (32G) 64-bit prefetchable window supply. We expect > > everything to work fine, but: > > > > | pci 0003:00:00.0: BAR 15: can't assign mem pref (size 0x206000000) > > | pci 0003:00:00.0: BAR 14: assigned [mem 0x3d05800000000-0x3d05802ffffff] > > | pci 0003:00:00.0: BAR 13: can't assign io (size 0x4000) > > > > It went wrong at the beginning. Note the error message never considers > > 64-bit or not, but BAR 15 here has it MEM_64 flag cleared. > > BAR 15 is a bridge window. I think its resource flags should reflect > the capability of the *window*, even if we disable the window or we > happen to assign addresses that are under 4GB. So I think it's wrong > that we clear the MEM_64 flag in pbus_size_mem() and the IO flag in > pbus_size_io(). > > > It first > > tried to find a 32-bit prefetchable window, but we only supply a 64-bit one. > > So it fall back to (32-bit) non-prefetchable window, but there is no enough > > room there. At last it went into complicated steps (not show here) of > > allocating requested resource first, then try best for the optional ones, etc.. > > > > Why is BAR 15 (prefetchable) 32 bit instead of 64? Because PCI core favours > > 32-bit prefetchable BARs and we have some. This is one of them: > > > > | pci 0003:05:00.0: reg 0x30: [mem 0x00000000-0x000fffff pref] > > > > PCI core decides to let them enjoy the benefition of prefetch. They can't > > bear the risk of getting 4G-above address, so its parent, its parent's parent, > > its parent's parent's parent, finally the root bridge (00:00.0) must have their > > MEM_64 flag of prefetchable resource (BAR 15) clear. > > It sounds like we're tracking the resource requirements > (prefetchability and BAR width) by using the flags on bridge windows. > If that's the case, I think it's wrong. We should preserve the bridge > window flags, because those express the bridge hardware capabilities, > and we should explicitly keep track of what's required by devices > below the bridge in some other way. > > > In the end nobody > > is eligible to use the 64-bit (prefetchable) space even we have huge > > supply ! > > > > Note even the resource is small and successfully fall back into 32-bit > > non-prefetchable window, that's still not OK for us because we need > > SRIOV resource be at 64-bit prefetchable space to do platform > > configuration. > > > > With Yinghai's patch, when 64-bit prefetchable BARs found, they're more > > favoured than the 32-bit prefetchable ones (if any), so all upstream bridges' > > prefetchable windows have their MEM_64 flag reserved and the huge 64-bit > > prefetchable space will be exploited: > > > > | pci 0003:00:00.0: BAR 15: assigned [mem 0x3d05008000000-0x3d0521fffffff 64bit pref] > > | pci 0003:00:00.0: BAR 14: assigned [mem 0x3d05800000000-0x3d05802ffffff] > > | pci 0003:00:00.0: BAR 13: can't assign io (size 0x4000) > > > > (The IO resource error here is due to we do not provide IO window) > > Yes. The lack of I/O space is just a constraint of the platform. > It'd be nice if we printed a more meaningful error message in this > case. One really has to be a PCI expert to distinguish this from a > real problem that we need to fix. > > Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2014-04-16 at 16:30 +1000, Benjamin Herrenschmidt wrote: > On Tue, 2014-04-15 at 18:09 -0600, Bjorn Helgaas wrote: > > > > Thanks for the example. Please open a bug report at > > http://bugzilla.kernel.org and attach the complete dmesg logs before > > and after Yinghai's patch. > > > > Having the complete logs helps me answer questions myself without > > having to bother you, and it also helps me figure out whether we can > > improve our logging to make it easier to diagnose problems like this. > > Unfortunately, for a *little while* longer (hint !) we can't publish > a complete log from a Power8 machine, but we should be able to include > everything remotely related to PCI. Hrm, actually, I'm not sure we can disclose the device details just yet, so let's just hide the vendor/device ID to make the lawyers happy and you can print all the rest. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 15, 2014 at 10:16 PM, Yinghai Lu <yinghai@kernel.org> wrote: > On Tue, Apr 15, 2014 at 5:09 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: >>> It first >>> tried to find a 32-bit prefetchable window, but we only supply a 64-bit one. >>> So it fall back to (32-bit) non-prefetchable window, but there is no enough >>> room there. At last it went into complicated steps (not show here) of >>> allocating requested resource first, then try best for the optional ones, etc.. >>> >>> Why is BAR 15 (prefetchable) 32 bit instead of 64? Because PCI core favours >>> 32-bit prefetchable BARs and we have some. This is one of them: >>> >>> | pci 0003:05:00.0: reg 0x30: [mem 0x00000000-0x000fffff pref] >>> >>> PCI core decides to let them enjoy the benefition of prefetch. They can't >>> bear the risk of getting 4G-above address, so its parent, its parent's parent, >>> its parent's parent's parent, finally the root bridge (00:00.0) must have their >>> MEM_64 flag of prefetchable resource (BAR 15) clear. >> >> It sounds like we're tracking the resource requirements >> (prefetchability and BAR width) by using the flags on bridge windows. >> If that's the case, I think it's wrong. We should preserve the bridge >> window flags, because those express the bridge hardware capabilities, >> and we should explicitly keep track of what's required by devices >> below the bridge in some other way. > > if the BAR support 64bit, then resource could be 64bit. > if the BAR does not support 64bit, then resource should be 32bit. > > That should be best guess to decide resource flag. We don't need to "decide" the resource flags. The hardware tells us what it can support, and *that* should be reflected in the flags. We do need to decide what addresses to assign and whether to put a prefetchable BAR in a prefetchable or non-prefetchable window. Putting decision results in the resource flags obfuscates the whole process. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 16, 2014 at 12:33 AM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Wed, 2014-04-16 at 16:30 +1000, Benjamin Herrenschmidt wrote: >> On Tue, 2014-04-15 at 18:09 -0600, Bjorn Helgaas wrote: >> > >> > Thanks for the example. Please open a bug report at >> > http://bugzilla.kernel.org and attach the complete dmesg logs before >> > and after Yinghai's patch. >> > >> > Having the complete logs helps me answer questions myself without >> > having to bother you, and it also helps me figure out whether we can >> > improve our logging to make it easier to diagnose problems like this. >> >> Unfortunately, for a *little while* longer (hint !) we can't publish >> a complete log from a Power8 machine, but we should be able to include >> everything remotely related to PCI. > > Hrm, actually, I'm not sure we can disclose the device details just yet, > so let's just hide the vendor/device ID to make the lawyers happy and > you can print all the rest. Sure, no problem (and it looks like Guo has done just that with https://bugzilla.kernel.org/show_bug.cgi?id=74151). I'm pretty sure this problem is not specific to your secret hardware; it should be reproducible on any hardware with similar apertures and similar BARs. I am a little concerned about the PE implications, but I'm ignoring that for now until we sort out the first-order problem of getting the allocation scheme to work. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[-cc Gavin (addr is bouncing)] On Fri, Mar 07, 2014 at 12:08:44PM -0800, Yinghai Lu wrote: > When one of children resources does not support MEM_64, MEM_64 for > bridge get reset, so pull down whole pref resource on the bridge under 4G. > > If the bridge support pref mem 64, will only allocate that with pref mem64 to > children that support it. > For children resources if they only support pref mem 32, will allocate them > from non pref mem instead. > > If the bridge only support 32bit pref mmio, will still have all children pref > mmio under that. > > -v2: Add release bridge res support with bridge mem res for pref_mem children res. > -v3: refresh and make it can be applied early before for_each_dev_res patchset. > -v4: fix non-pref mmio 64bit support found by Guo Chao. > -v7: repost as ibm's powerpc system work again when > 1. petiboot boot kernel have this patch. > 2. or mellanox driver added new .shutdown member. > discussion could be found at: > http://marc.info/?t=138968064500001&r=1&w=2 > > Signed-off-by: Yinghai Lu <yinghai@kernel.org> > Tested-by: Guo Chao <yan@linux.vnet.ibm.com> > Tested-by: Wei Yang <weiyang@linux.vnet.ibm.com> > > --- > drivers/pci/setup-bus.c | 138 ++++++++++++++++++++++++++++++++---------------- > drivers/pci/setup-res.c | 20 ++++++ > 2 files changed, 111 insertions(+), 47 deletions(-) > > Index: linux-2.6/drivers/pci/setup-bus.c > =================================================================== > --- linux-2.6.orig/drivers/pci/setup-bus.c > +++ linux-2.6/drivers/pci/setup-bus.c > @@ -713,12 +713,11 @@ static void pci_bridge_check_ranges(stru > bus resource of a given type. Note: we intentionally skip > the bus resources which have already been assigned (that is, > have non-NULL parent resource). */ > -static struct resource *find_free_bus_resource(struct pci_bus *bus, unsigned long type) > +static struct resource *find_free_bus_resource(struct pci_bus *bus, > + unsigned long type_mask, unsigned long type) > { > int i; > struct resource *r; > - unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM | > - IORESOURCE_PREFETCH; > > pci_bus_for_each_resource(bus, r, i) { > if (r == &ioport_resource || r == &iomem_resource) > @@ -815,7 +814,8 @@ static void pbus_size_io(struct pci_bus > resource_size_t add_size, struct list_head *realloc_head) > { > struct pci_dev *dev; > - struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO); > + struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO, > + IORESOURCE_IO); > resource_size_t size = 0, size0 = 0, size1 = 0; > resource_size_t children_add_size = 0; > resource_size_t min_align, align; > @@ -915,15 +915,17 @@ static inline resource_size_t calculate_ > * guarantees that all child resources fit in this size. > */ > static int pbus_size_mem(struct pci_bus *bus, unsigned long mask, > - unsigned long type, resource_size_t min_size, > - resource_size_t add_size, > - struct list_head *realloc_head) > + unsigned long type, unsigned long type2, > + unsigned long type3, > + resource_size_t min_size, resource_size_t add_size, > + struct list_head *realloc_head) This no longer matches the function documentation above, but it doesn't really matter, because this interface, with a mask plus three type parameters, is just way too complicated. > { > struct pci_dev *dev; > resource_size_t min_align, align, size, size0, size1; > resource_size_t aligns[12]; /* Alignments from 1Mb to 2Gb */ > int order, max_order; > - struct resource *b_res = find_free_bus_resource(bus, type); > + struct resource *b_res = find_free_bus_resource(bus, > + mask | IORESOURCE_PREFETCH, type); > unsigned int mem64_mask = 0; > resource_size_t children_add_size = 0; > > @@ -944,7 +946,9 @@ static int pbus_size_mem(struct pci_bus > struct resource *r = &dev->resource[i]; > resource_size_t r_size; > > - if (r->parent || (r->flags & mask) != type) > + if (r->parent || ((r->flags & mask) != type && > + (r->flags & mask) != type2 && > + (r->flags & mask) != type3)) > continue; > r_size = resource_size(r); > #ifdef CONFIG_PCI_IOV > @@ -1117,8 +1121,9 @@ void __ref __pci_bus_size_bridges(struct > struct list_head *realloc_head) > { > struct pci_dev *dev; > - unsigned long mask, prefmask; > + unsigned long mask, prefmask, type2 = 0, type3 = 0; > resource_size_t additional_mem_size = 0, additional_io_size = 0; > + struct resource *b_res; > > list_for_each_entry(dev, &bus->devices, bus_list) { > struct pci_bus *b = dev->subordinate; > @@ -1163,15 +1168,34 @@ void __ref __pci_bus_size_bridges(struct > has already been allocated by arch code, try > non-prefetchable range for both types of PCI memory > resources. */ > + b_res = &bus->self->resource[PCI_BRIDGE_RESOURCES]; > mask = IORESOURCE_MEM; > prefmask = IORESOURCE_MEM | IORESOURCE_PREFETCH; > - if (pbus_size_mem(bus, prefmask, prefmask, > + if (b_res[2].flags & IORESOURCE_MEM_64) { > + prefmask |= IORESOURCE_MEM_64; > + if (pbus_size_mem(bus, prefmask, prefmask, > + prefmask, prefmask, > realloc_head ? 0 : additional_mem_size, > - additional_mem_size, realloc_head)) > - mask = prefmask; /* Success, size non-prefetch only. */ > - else > - additional_mem_size += additional_mem_size; > - pbus_size_mem(bus, mask, IORESOURCE_MEM, > + additional_mem_size, realloc_head)) { > + /* Success, size non-pref64 only. */ > + mask = prefmask; > + type2 = prefmask & ~IORESOURCE_MEM_64; > + type3 = prefmask & ~IORESOURCE_PREFETCH; > + } > + } > + if (!type2) { > + prefmask &= ~IORESOURCE_MEM_64; > + if (pbus_size_mem(bus, prefmask, prefmask, > + prefmask, prefmask, > + realloc_head ? 0 : additional_mem_size, > + additional_mem_size, realloc_head)) { > + /* Success, size non-prefetch only. */ > + mask = prefmask; > + } else > + additional_mem_size += additional_mem_size; > + type2 = type3 = IORESOURCE_MEM; > + } > + pbus_size_mem(bus, mask, IORESOURCE_MEM, type2, type3, > realloc_head ? 0 : additional_mem_size, > additional_mem_size, realloc_head); I spent quite a bit of time trying to understand this code yesterday, but I failed. Comments might help, but more importantly, it needs to be simplified so the code makes sense even without comments. > break; > @@ -1257,42 +1281,66 @@ static void __ref __pci_bridge_assign_re > static void pci_bridge_release_resources(struct pci_bus *bus, > unsigned long type) > { > - int idx; > - bool changed = false; > - struct pci_dev *dev; > + struct pci_dev *dev = bus->self; > struct resource *r; > unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM | > - IORESOURCE_PREFETCH; > + IORESOURCE_PREFETCH | IORESOURCE_MEM_64; > + unsigned old_flags = 0; > + struct resource *b_res; > + int idx = 1; > > - dev = bus->self; > - for (idx = PCI_BRIDGE_RESOURCES; idx <= PCI_BRIDGE_RESOURCE_END; > - idx++) { > - r = &dev->resource[idx]; > - if ((r->flags & type_mask) != type) > - continue; > - if (!r->parent) > - continue; > - /* > - * if there are children under that, we should release them > - * all > - */ > - release_child_resources(r); > - if (!release_resource(r)) { > - dev_printk(KERN_DEBUG, &dev->dev, > - "resource %d %pR released\n", idx, r); > - /* keep the old size */ > - r->end = resource_size(r) - 1; > - r->start = 0; > - r->flags = 0; > - changed = true; > - } > - } > + b_res = &dev->resource[PCI_BRIDGE_RESOURCES]; > + > + /* > + * 1. if there is io port assign fail, will release bridge > + * io port. > + * 2. if there is non pref mmio assign fail, release bridge > + * nonpref mmio. > + * 3. if there is 64bit pref mmio assign fail, and bridge pref > + * is 64bit, release bridge pref mmio. > + * 4. if there is pref mmio assign fail, and bridge pref is > + * 32bit mmio, release bridge pref mmio > + * 5. if there is pref mmio assign fail, and bridge pref is not > + * assigned, release bridge nonpref mmio. > + */ > + if (type & IORESOURCE_IO) > + idx = 0; > + else if (!(type & IORESOURCE_PREFETCH)) > + idx = 1; > + else if ((type & IORESOURCE_MEM_64) && > + (b_res[2].flags & IORESOURCE_MEM_64)) > + idx = 2; > + else if (!(b_res[2].flags & IORESOURCE_MEM_64) && > + (b_res[2].flags & IORESOURCE_PREFETCH)) > + idx = 2; > + else > + idx = 1; This is pretty hard to figure out. Is there any way we can just pass in the bridge window identifier directly, e.g., 0 == I/O window, 1 == non-prefetchable MEM window, 2 == prefetchable window? It seems like way up in pci_assign_unassigned_root_bus_resources(), we ought to know which window failed, and we could just pass that in somehow instead of having to derive it here. > + > + r = &b_res[idx]; > + > + if (!r->parent) > + return; > + > + /* > + * if there are children under that, we should release them > + * all > + */ > + release_child_resources(r); > + if (!release_resource(r)) { > + type = old_flags = r->flags & type_mask; > + dev_printk(KERN_DEBUG, &dev->dev, "resource %d %pR released\n", > + PCI_BRIDGE_RESOURCES + idx, r); > + /* keep the old size */ > + r->end = resource_size(r) - 1; > + r->start = 0; > + r->flags = 0; > > - if (changed) { > /* avoiding touch the one without PREF */ > if (type & IORESOURCE_PREFETCH) > type = IORESOURCE_PREFETCH; > __pci_setup_bridge(bus, type); > + /* for next child res under same bridge */ > + r->flags = old_flags; > } > } > > @@ -1471,7 +1519,7 @@ void pci_assign_unassigned_root_bus_reso > LIST_HEAD(fail_head); > struct pci_dev_resource *fail_res; > unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM | > - IORESOURCE_PREFETCH; > + IORESOURCE_PREFETCH | IORESOURCE_MEM_64; > int pci_try_num = 1; > enum enable_type enable_local; > > Index: linux-2.6/drivers/pci/setup-res.c > =================================================================== > --- linux-2.6.orig/drivers/pci/setup-res.c > +++ linux-2.6/drivers/pci/setup-res.c > @@ -208,15 +208,31 @@ static int __pci_assign_resource(struct > > /* First, try exact prefetching match.. */ > ret = pci_bus_alloc_resource(bus, res, size, align, min, > - IORESOURCE_PREFETCH, > + IORESOURCE_PREFETCH | IORESOURCE_MEM_64, > pcibios_align_resource, dev); > > - if (ret < 0 && (res->flags & IORESOURCE_PREFETCH)) { > + if (ret < 0 && > + (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) == > + (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) { > + /* > + * That failed. > + * > + * Try below 4g pref > + */ > + ret = pci_bus_alloc_resource(bus, res, size, align, min, > + IORESOURCE_PREFETCH, > + pcibios_align_resource, dev); > + } pci_bus_alloc_from_region() already has a mechanism for restricting allocation to certain bus addresses. I don't like this second mechanism of also using IORESOURCE_MEM_64. There's too much implicit state -- I think there's a dependency here on how we manage the IORESOURCE_MEM_64 bit in the bridge window resources. That makes this code hard to understand. > + > + if (ret < 0 && > + (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64))) { > /* > * That failed. > * > * But a prefetching area can handle a non-prefetching > * window (it will just not perform as well). > + * > + * Also can put 64bit under 32bit range. (below 4g). > */ > ret = pci_bus_alloc_resource(bus, res, size, align, min, 0, > pcibios_align_resource, dev); -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 16, 2014 at 3:11 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: > > I spent quite a bit of time trying to understand this code yesterday, but > I failed. Comments might help, but more importantly, it needs to be > simplified so the code makes sense even without comments. will add some comments. >> + b_res = &dev->resource[PCI_BRIDGE_RESOURCES]; >> + >> + /* >> + * 1. if there is io port assign fail, will release bridge >> + * io port. >> + * 2. if there is non pref mmio assign fail, release bridge >> + * nonpref mmio. >> + * 3. if there is 64bit pref mmio assign fail, and bridge pref >> + * is 64bit, release bridge pref mmio. >> + * 4. if there is pref mmio assign fail, and bridge pref is >> + * 32bit mmio, release bridge pref mmio >> + * 5. if there is pref mmio assign fail, and bridge pref is not >> + * assigned, release bridge nonpref mmio. >> + */ >> + if (type & IORESOURCE_IO) >> + idx = 0; >> + else if (!(type & IORESOURCE_PREFETCH)) >> + idx = 1; >> + else if ((type & IORESOURCE_MEM_64) && >> + (b_res[2].flags & IORESOURCE_MEM_64)) >> + idx = 2; >> + else if (!(b_res[2].flags & IORESOURCE_MEM_64) && >> + (b_res[2].flags & IORESOURCE_PREFETCH)) >> + idx = 2; >> + else >> + idx = 1; > > This is pretty hard to figure out. Is there any way we can just pass in > the bridge window identifier directly, e.g., 0 == I/O window, 1 == > non-prefetchable MEM window, 2 == prefetchable window? It seems like way > up in pci_assign_unassigned_root_bus_resources(), we ought to know which > window failed, and we could just pass that in somehow instead of having to > derive it here. that is for find out which up-level could be scratched... and do not know exact which one should be fit as twisted mmio-pref under mmio. >> >> - if (ret < 0 && (res->flags & IORESOURCE_PREFETCH)) { >> + if (ret < 0 && >> + (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) == >> + (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) { >> + /* >> + * That failed. >> + * >> + * Try below 4g pref >> + */ >> + ret = pci_bus_alloc_resource(bus, res, size, align, min, >> + IORESOURCE_PREFETCH, >> + pcibios_align_resource, dev); >> + } > > pci_bus_alloc_from_region() already has a mechanism for restricting > allocation to certain bus addresses. I don't like this second mechanism of > also using IORESOURCE_MEM_64. There's too much implicit state -- I think > there's a dependency here on how we manage the IORESOURCE_MEM_64 bit in the > bridge window resources. That makes this code hard to understand. > will drop that. Yinghai -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 16, 2014 at 9:20 PM, Yinghai Lu <yinghai@kernel.org> wrote: > On Wed, Apr 16, 2014 at 3:11 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: > >>> >>> - if (ret < 0 && (res->flags & IORESOURCE_PREFETCH)) { >>> + if (ret < 0 && >>> + (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) == >>> + (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) { >>> + /* >>> + * That failed. >>> + * >>> + * Try below 4g pref >>> + */ >>> + ret = pci_bus_alloc_resource(bus, res, size, align, min, >>> + IORESOURCE_PREFETCH, >>> + pcibios_align_resource, dev); >>> + } >> >> pci_bus_alloc_from_region() already has a mechanism for restricting >> allocation to certain bus addresses. I don't like this second mechanism of >> also using IORESOURCE_MEM_64. There's too much implicit state -- I think >> there's a dependency here on how we manage the IORESOURCE_MEM_64 bit in the >> bridge window resources. That makes this code hard to understand. >> > > will drop that. After more checking, found out that we still need the change. Reason: We size pref,mmio64 at first, and then put pref without 64 under nonpref. If bridge have pref/mmio64, and mmio32, children have pref/mmio64, and pref/mmio32. then during sizing: children pref/mmio64 is under bridge pref/mmio64, children pref/mmio32 is under bridge mmio32. If during allocation, bridge pref/mmio64 still could get allocation under 4G. then without this change we could put children pref/mmio32 under that. could be ok, we would put children pref/mm64 under bridge mmio32. But in some case size could be different, then could be fail if bridge mmio32 is small then bridge pref/mmio64. Please check latest v9, with comments change. Thanks Yinghai -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: linux-2.6/drivers/pci/setup-bus.c =================================================================== --- linux-2.6.orig/drivers/pci/setup-bus.c +++ linux-2.6/drivers/pci/setup-bus.c @@ -713,12 +713,11 @@ static void pci_bridge_check_ranges(stru bus resource of a given type. Note: we intentionally skip the bus resources which have already been assigned (that is, have non-NULL parent resource). */ -static struct resource *find_free_bus_resource(struct pci_bus *bus, unsigned long type) +static struct resource *find_free_bus_resource(struct pci_bus *bus, + unsigned long type_mask, unsigned long type) { int i; struct resource *r; - unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM | - IORESOURCE_PREFETCH; pci_bus_for_each_resource(bus, r, i) { if (r == &ioport_resource || r == &iomem_resource) @@ -815,7 +814,8 @@ static void pbus_size_io(struct pci_bus resource_size_t add_size, struct list_head *realloc_head) { struct pci_dev *dev; - struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO); + struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO, + IORESOURCE_IO); resource_size_t size = 0, size0 = 0, size1 = 0; resource_size_t children_add_size = 0; resource_size_t min_align, align; @@ -915,15 +915,17 @@ static inline resource_size_t calculate_ * guarantees that all child resources fit in this size. */ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask, - unsigned long type, resource_size_t min_size, - resource_size_t add_size, - struct list_head *realloc_head) + unsigned long type, unsigned long type2, + unsigned long type3, + resource_size_t min_size, resource_size_t add_size, + struct list_head *realloc_head) { struct pci_dev *dev; resource_size_t min_align, align, size, size0, size1; resource_size_t aligns[12]; /* Alignments from 1Mb to 2Gb */ int order, max_order; - struct resource *b_res = find_free_bus_resource(bus, type); + struct resource *b_res = find_free_bus_resource(bus, + mask | IORESOURCE_PREFETCH, type); unsigned int mem64_mask = 0; resource_size_t children_add_size = 0; @@ -944,7 +946,9 @@ static int pbus_size_mem(struct pci_bus struct resource *r = &dev->resource[i]; resource_size_t r_size; - if (r->parent || (r->flags & mask) != type) + if (r->parent || ((r->flags & mask) != type && + (r->flags & mask) != type2 && + (r->flags & mask) != type3)) continue; r_size = resource_size(r); #ifdef CONFIG_PCI_IOV @@ -1117,8 +1121,9 @@ void __ref __pci_bus_size_bridges(struct struct list_head *realloc_head) { struct pci_dev *dev; - unsigned long mask, prefmask; + unsigned long mask, prefmask, type2 = 0, type3 = 0; resource_size_t additional_mem_size = 0, additional_io_size = 0; + struct resource *b_res; list_for_each_entry(dev, &bus->devices, bus_list) { struct pci_bus *b = dev->subordinate; @@ -1163,15 +1168,34 @@ void __ref __pci_bus_size_bridges(struct has already been allocated by arch code, try non-prefetchable range for both types of PCI memory resources. */ + b_res = &bus->self->resource[PCI_BRIDGE_RESOURCES]; mask = IORESOURCE_MEM; prefmask = IORESOURCE_MEM | IORESOURCE_PREFETCH; - if (pbus_size_mem(bus, prefmask, prefmask, + if (b_res[2].flags & IORESOURCE_MEM_64) { + prefmask |= IORESOURCE_MEM_64; + if (pbus_size_mem(bus, prefmask, prefmask, + prefmask, prefmask, realloc_head ? 0 : additional_mem_size, - additional_mem_size, realloc_head)) - mask = prefmask; /* Success, size non-prefetch only. */ - else - additional_mem_size += additional_mem_size; - pbus_size_mem(bus, mask, IORESOURCE_MEM, + additional_mem_size, realloc_head)) { + /* Success, size non-pref64 only. */ + mask = prefmask; + type2 = prefmask & ~IORESOURCE_MEM_64; + type3 = prefmask & ~IORESOURCE_PREFETCH; + } + } + if (!type2) { + prefmask &= ~IORESOURCE_MEM_64; + if (pbus_size_mem(bus, prefmask, prefmask, + prefmask, prefmask, + realloc_head ? 0 : additional_mem_size, + additional_mem_size, realloc_head)) { + /* Success, size non-prefetch only. */ + mask = prefmask; + } else + additional_mem_size += additional_mem_size; + type2 = type3 = IORESOURCE_MEM; + } + pbus_size_mem(bus, mask, IORESOURCE_MEM, type2, type3, realloc_head ? 0 : additional_mem_size, additional_mem_size, realloc_head); break; @@ -1257,42 +1281,66 @@ static void __ref __pci_bridge_assign_re static void pci_bridge_release_resources(struct pci_bus *bus, unsigned long type) { - int idx; - bool changed = false; - struct pci_dev *dev; + struct pci_dev *dev = bus->self; struct resource *r; unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM | - IORESOURCE_PREFETCH; + IORESOURCE_PREFETCH | IORESOURCE_MEM_64; + unsigned old_flags = 0; + struct resource *b_res; + int idx = 1; - dev = bus->self; - for (idx = PCI_BRIDGE_RESOURCES; idx <= PCI_BRIDGE_RESOURCE_END; - idx++) { - r = &dev->resource[idx]; - if ((r->flags & type_mask) != type) - continue; - if (!r->parent) - continue; - /* - * if there are children under that, we should release them - * all - */ - release_child_resources(r); - if (!release_resource(r)) { - dev_printk(KERN_DEBUG, &dev->dev, - "resource %d %pR released\n", idx, r); - /* keep the old size */ - r->end = resource_size(r) - 1; - r->start = 0; - r->flags = 0; - changed = true; - } - } + b_res = &dev->resource[PCI_BRIDGE_RESOURCES]; + + /* + * 1. if there is io port assign fail, will release bridge + * io port. + * 2. if there is non pref mmio assign fail, release bridge + * nonpref mmio. + * 3. if there is 64bit pref mmio assign fail, and bridge pref + * is 64bit, release bridge pref mmio. + * 4. if there is pref mmio assign fail, and bridge pref is + * 32bit mmio, release bridge pref mmio + * 5. if there is pref mmio assign fail, and bridge pref is not + * assigned, release bridge nonpref mmio. + */ + if (type & IORESOURCE_IO) + idx = 0; + else if (!(type & IORESOURCE_PREFETCH)) + idx = 1; + else if ((type & IORESOURCE_MEM_64) && + (b_res[2].flags & IORESOURCE_MEM_64)) + idx = 2; + else if (!(b_res[2].flags & IORESOURCE_MEM_64) && + (b_res[2].flags & IORESOURCE_PREFETCH)) + idx = 2; + else + idx = 1; + + r = &b_res[idx]; + + if (!r->parent) + return; + + /* + * if there are children under that, we should release them + * all + */ + release_child_resources(r); + if (!release_resource(r)) { + type = old_flags = r->flags & type_mask; + dev_printk(KERN_DEBUG, &dev->dev, "resource %d %pR released\n", + PCI_BRIDGE_RESOURCES + idx, r); + /* keep the old size */ + r->end = resource_size(r) - 1; + r->start = 0; + r->flags = 0; - if (changed) { /* avoiding touch the one without PREF */ if (type & IORESOURCE_PREFETCH) type = IORESOURCE_PREFETCH; __pci_setup_bridge(bus, type); + /* for next child res under same bridge */ + r->flags = old_flags; } } @@ -1471,7 +1519,7 @@ void pci_assign_unassigned_root_bus_reso LIST_HEAD(fail_head); struct pci_dev_resource *fail_res; unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM | - IORESOURCE_PREFETCH; + IORESOURCE_PREFETCH | IORESOURCE_MEM_64; int pci_try_num = 1; enum enable_type enable_local; Index: linux-2.6/drivers/pci/setup-res.c =================================================================== --- linux-2.6.orig/drivers/pci/setup-res.c +++ linux-2.6/drivers/pci/setup-res.c @@ -208,15 +208,31 @@ static int __pci_assign_resource(struct /* First, try exact prefetching match.. */ ret = pci_bus_alloc_resource(bus, res, size, align, min, - IORESOURCE_PREFETCH, + IORESOURCE_PREFETCH | IORESOURCE_MEM_64, pcibios_align_resource, dev); - if (ret < 0 && (res->flags & IORESOURCE_PREFETCH)) { + if (ret < 0 && + (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) == + (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) { + /* + * That failed. + * + * Try below 4g pref + */ + ret = pci_bus_alloc_resource(bus, res, size, align, min, + IORESOURCE_PREFETCH, + pcibios_align_resource, dev); + } + + if (ret < 0 && + (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64))) { /* * That failed. * * But a prefetching area can handle a non-prefetching * window (it will just not perform as well). + * + * Also can put 64bit under 32bit range. (below 4g). */ ret = pci_bus_alloc_resource(bus, res, size, align, min, 0, pcibios_align_resource, dev);