diff mbox

[v7] PCI: Try best to allocate pref mmio 64bit above 4g

Message ID 1394222924-28886-1-git-send-email-yinghai@kernel.org (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Yinghai Lu March 7, 2014, 8:08 p.m. UTC
When one of children resources does not support MEM_64, MEM_64 for
bridge get reset, so pull down whole pref resource on the bridge under 4G.

If the bridge support pref mem 64, will only allocate that with pref mem64 to
children that support it.
For children resources if they only support pref mem 32, will allocate them
from non pref mem instead.

If the bridge only support 32bit pref mmio, will still have all children pref
mmio under that.

-v2: Add release bridge res support with bridge mem res for pref_mem children res.
-v3: refresh and make it can be applied early before for_each_dev_res patchset.
-v4: fix non-pref mmio 64bit support found by Guo Chao.
-v7: repost as ibm's powerpc system work again when
    1. petiboot boot kernel have this patch.
    2. or mellanox driver added new .shutdown member.
    discussion could be found at:
	 http://marc.info/?t=138968064500001&r=1&w=2

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Tested-by: Guo Chao <yan@linux.vnet.ibm.com>
Tested-by: Wei Yang <weiyang@linux.vnet.ibm.com>

---
 drivers/pci/setup-bus.c |  138 ++++++++++++++++++++++++++++++++----------------
 drivers/pci/setup-res.c |   20 ++++++
 2 files changed, 111 insertions(+), 47 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Bjorn Helgaas April 4, 2014, 10:43 p.m. UTC | #1
On Fri, Mar 7, 2014 at 1:08 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> When one of children resources does not support MEM_64, MEM_64 for
> bridge get reset, so pull down whole pref resource on the bridge under 4G.
>
> If the bridge support pref mem 64, will only allocate that with pref mem64 to
> children that support it.
> For children resources if they only support pref mem 32, will allocate them
> from non pref mem instead.
>
> If the bridge only support 32bit pref mmio, will still have all children pref
> mmio under that.
>
> -v2: Add release bridge res support with bridge mem res for pref_mem children res.
> -v3: refresh and make it can be applied early before for_each_dev_res patchset.
> -v4: fix non-pref mmio 64bit support found by Guo Chao.
> -v7: repost as ibm's powerpc system work again when
>     1. petiboot boot kernel have this patch.
>     2. or mellanox driver added new .shutdown member.
>     discussion could be found at:
>          http://marc.info/?t=138968064500001&r=1&w=2

I think this fixes some sort of bug, and I suppose if I spent a few
hours decoding the discussion you mention (the 44 message-long
"mlx4_core probe error after applying Yinghai's patch" discussion), I
might be able to figure out what it is.

But maybe somebody who already knows what the bug is can summarize it
for me, and maybe even open a kernel.org bugzilla with a dmesg log as
an example.

I do intend to use this message [1] as a source to help construct a
changelog, but I'd like to also describe a specific problem that is
solved by this patch.  Ideally this would already be in the changelog,
but it's not, so any help in figuring it out would be appreciated.

[1] http://lkml.kernel.org/r/CAErSpo5VWREGptW0MU0wLJUa3h23DXMZPGkdJFMTx5WAGL8J6Q@mail.gmail.com

> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> Tested-by: Guo Chao <yan@linux.vnet.ibm.com>
> Tested-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>
> ---
>  drivers/pci/setup-bus.c |  138 ++++++++++++++++++++++++++++++++----------------
>  drivers/pci/setup-res.c |   20 ++++++
>  2 files changed, 111 insertions(+), 47 deletions(-)
>
> Index: linux-2.6/drivers/pci/setup-bus.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/setup-bus.c
> +++ linux-2.6/drivers/pci/setup-bus.c
> @@ -713,12 +713,11 @@ static void pci_bridge_check_ranges(stru
>     bus resource of a given type. Note: we intentionally skip
>     the bus resources which have already been assigned (that is,
>     have non-NULL parent resource). */
> -static struct resource *find_free_bus_resource(struct pci_bus *bus, unsigned long type)
> +static struct resource *find_free_bus_resource(struct pci_bus *bus,
> +                        unsigned long type_mask, unsigned long type)
>  {
>         int i;
>         struct resource *r;
> -       unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
> -                                 IORESOURCE_PREFETCH;
>
>         pci_bus_for_each_resource(bus, r, i) {
>                 if (r == &ioport_resource || r == &iomem_resource)
> @@ -815,7 +814,8 @@ static void pbus_size_io(struct pci_bus
>                 resource_size_t add_size, struct list_head *realloc_head)
>  {
>         struct pci_dev *dev;
> -       struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO);
> +       struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO,
> +                                                       IORESOURCE_IO);
>         resource_size_t size = 0, size0 = 0, size1 = 0;
>         resource_size_t children_add_size = 0;
>         resource_size_t min_align, align;
> @@ -915,15 +915,17 @@ static inline resource_size_t calculate_
>   * guarantees that all child resources fit in this size.
>   */
>  static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
> -                        unsigned long type, resource_size_t min_size,
> -                       resource_size_t add_size,
> -                       struct list_head *realloc_head)
> +                        unsigned long type, unsigned long type2,
> +                        unsigned long type3,
> +                        resource_size_t min_size, resource_size_t add_size,
> +                        struct list_head *realloc_head)
>  {
>         struct pci_dev *dev;
>         resource_size_t min_align, align, size, size0, size1;
>         resource_size_t aligns[12];     /* Alignments from 1Mb to 2Gb */
>         int order, max_order;
> -       struct resource *b_res = find_free_bus_resource(bus, type);
> +       struct resource *b_res = find_free_bus_resource(bus,
> +                                        mask | IORESOURCE_PREFETCH, type);
>         unsigned int mem64_mask = 0;
>         resource_size_t children_add_size = 0;
>
> @@ -944,7 +946,9 @@ static int pbus_size_mem(struct pci_bus
>                         struct resource *r = &dev->resource[i];
>                         resource_size_t r_size;
>
> -                       if (r->parent || (r->flags & mask) != type)
> +                       if (r->parent || ((r->flags & mask) != type &&
> +                                         (r->flags & mask) != type2 &&
> +                                         (r->flags & mask) != type3))
>                                 continue;
>                         r_size = resource_size(r);
>  #ifdef CONFIG_PCI_IOV
> @@ -1117,8 +1121,9 @@ void __ref __pci_bus_size_bridges(struct
>                         struct list_head *realloc_head)
>  {
>         struct pci_dev *dev;
> -       unsigned long mask, prefmask;
> +       unsigned long mask, prefmask, type2 = 0, type3 = 0;
>         resource_size_t additional_mem_size = 0, additional_io_size = 0;
> +       struct resource *b_res;
>
>         list_for_each_entry(dev, &bus->devices, bus_list) {
>                 struct pci_bus *b = dev->subordinate;
> @@ -1163,15 +1168,34 @@ void __ref __pci_bus_size_bridges(struct
>                    has already been allocated by arch code, try
>                    non-prefetchable range for both types of PCI memory
>                    resources. */
> +               b_res = &bus->self->resource[PCI_BRIDGE_RESOURCES];
>                 mask = IORESOURCE_MEM;
>                 prefmask = IORESOURCE_MEM | IORESOURCE_PREFETCH;
> -               if (pbus_size_mem(bus, prefmask, prefmask,
> +               if (b_res[2].flags & IORESOURCE_MEM_64) {
> +                       prefmask |= IORESOURCE_MEM_64;
> +                       if (pbus_size_mem(bus, prefmask, prefmask,
> +                                 prefmask, prefmask,
>                                   realloc_head ? 0 : additional_mem_size,
> -                                 additional_mem_size, realloc_head))
> -                       mask = prefmask; /* Success, size non-prefetch only. */
> -               else
> -                       additional_mem_size += additional_mem_size;
> -               pbus_size_mem(bus, mask, IORESOURCE_MEM,
> +                                 additional_mem_size, realloc_head)) {
> +                                       /* Success, size non-pref64 only. */
> +                                       mask = prefmask;
> +                                       type2 = prefmask & ~IORESOURCE_MEM_64;
> +                                       type3 = prefmask & ~IORESOURCE_PREFETCH;
> +                       }
> +               }
> +               if (!type2) {
> +                       prefmask &= ~IORESOURCE_MEM_64;
> +                       if (pbus_size_mem(bus, prefmask, prefmask,
> +                                        prefmask, prefmask,
> +                                        realloc_head ? 0 : additional_mem_size,
> +                                        additional_mem_size, realloc_head)) {
> +                               /* Success, size non-prefetch only. */
> +                               mask = prefmask;
> +                       } else
> +                               additional_mem_size += additional_mem_size;
> +                       type2 = type3 = IORESOURCE_MEM;
> +               }
> +               pbus_size_mem(bus, mask, IORESOURCE_MEM, type2, type3,
>                                 realloc_head ? 0 : additional_mem_size,
>                                 additional_mem_size, realloc_head);
>                 break;
> @@ -1257,42 +1281,66 @@ static void __ref __pci_bridge_assign_re
>  static void pci_bridge_release_resources(struct pci_bus *bus,
>                                           unsigned long type)
>  {
> -       int idx;
> -       bool changed = false;
> -       struct pci_dev *dev;
> +       struct pci_dev *dev = bus->self;
>         struct resource *r;
>         unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
> -                                 IORESOURCE_PREFETCH;
> +                                 IORESOURCE_PREFETCH | IORESOURCE_MEM_64;
> +       unsigned old_flags = 0;
> +       struct resource *b_res;
> +       int idx = 1;
>
> -       dev = bus->self;
> -       for (idx = PCI_BRIDGE_RESOURCES; idx <= PCI_BRIDGE_RESOURCE_END;
> -            idx++) {
> -               r = &dev->resource[idx];
> -               if ((r->flags & type_mask) != type)
> -                       continue;
> -               if (!r->parent)
> -                       continue;
> -               /*
> -                * if there are children under that, we should release them
> -                *  all
> -                */
> -               release_child_resources(r);
> -               if (!release_resource(r)) {
> -                       dev_printk(KERN_DEBUG, &dev->dev,
> -                                "resource %d %pR released\n", idx, r);
> -                       /* keep the old size */
> -                       r->end = resource_size(r) - 1;
> -                       r->start = 0;
> -                       r->flags = 0;
> -                       changed = true;
> -               }
> -       }
> +       b_res = &dev->resource[PCI_BRIDGE_RESOURCES];
> +
> +       /*
> +        *     1. if there is io port assign fail, will release bridge
> +        *        io port.
> +        *     2. if there is non pref mmio assign fail, release bridge
> +        *        nonpref mmio.
> +        *     3. if there is 64bit pref mmio assign fail, and bridge pref
> +        *        is 64bit, release bridge pref mmio.
> +        *     4. if there is pref mmio assign fail, and bridge pref is
> +        *        32bit mmio, release bridge pref mmio
> +        *     5. if there is pref mmio assign fail, and bridge pref is not
> +        *        assigned, release bridge nonpref mmio.
> +        */
> +       if (type & IORESOURCE_IO)
> +               idx = 0;
> +       else if (!(type & IORESOURCE_PREFETCH))
> +               idx = 1;
> +       else if ((type & IORESOURCE_MEM_64) &&
> +                (b_res[2].flags & IORESOURCE_MEM_64))
> +               idx = 2;
> +       else if (!(b_res[2].flags & IORESOURCE_MEM_64) &&
> +                (b_res[2].flags & IORESOURCE_PREFETCH))
> +               idx = 2;
> +       else
> +               idx = 1;
> +
> +       r = &b_res[idx];
> +
> +       if (!r->parent)
> +               return;
> +
> +       /*
> +        * if there are children under that, we should release them
> +        *  all
> +        */
> +       release_child_resources(r);
> +       if (!release_resource(r)) {
> +               type = old_flags = r->flags & type_mask;
> +               dev_printk(KERN_DEBUG, &dev->dev, "resource %d %pR released\n",
> +                                       PCI_BRIDGE_RESOURCES + idx, r);
> +               /* keep the old size */
> +               r->end = resource_size(r) - 1;
> +               r->start = 0;
> +               r->flags = 0;
>
> -       if (changed) {
>                 /* avoiding touch the one without PREF */
>                 if (type & IORESOURCE_PREFETCH)
>                         type = IORESOURCE_PREFETCH;
>                 __pci_setup_bridge(bus, type);
> +               /* for next child res under same bridge */
> +               r->flags = old_flags;
>         }
>  }
>
> @@ -1471,7 +1519,7 @@ void pci_assign_unassigned_root_bus_reso
>         LIST_HEAD(fail_head);
>         struct pci_dev_resource *fail_res;
>         unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
> -                                 IORESOURCE_PREFETCH;
> +                                 IORESOURCE_PREFETCH | IORESOURCE_MEM_64;
>         int pci_try_num = 1;
>         enum enable_type enable_local;
>
> Index: linux-2.6/drivers/pci/setup-res.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/setup-res.c
> +++ linux-2.6/drivers/pci/setup-res.c
> @@ -208,15 +208,31 @@ static int __pci_assign_resource(struct
>
>         /* First, try exact prefetching match.. */
>         ret = pci_bus_alloc_resource(bus, res, size, align, min,
> -                                    IORESOURCE_PREFETCH,
> +                                    IORESOURCE_PREFETCH | IORESOURCE_MEM_64,
>                                      pcibios_align_resource, dev);
>
> -       if (ret < 0 && (res->flags & IORESOURCE_PREFETCH)) {
> +       if (ret < 0 &&
> +           (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) ==
> +            (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) {
> +               /*
> +                * That failed.
> +                *
> +                * Try below 4g pref
> +                */
> +               ret = pci_bus_alloc_resource(bus, res, size, align, min,
> +                                            IORESOURCE_PREFETCH,
> +                                            pcibios_align_resource, dev);
> +       }
> +
> +       if (ret < 0 &&
> +           (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64))) {
>                 /*
>                  * That failed.
>                  *
>                  * But a prefetching area can handle a non-prefetching
>                  * window (it will just not perform as well).
> +                *
> +                * Also can put 64bit under 32bit range. (below 4g).
>                  */
>                 ret = pci_bus_alloc_resource(bus, res, size, align, min, 0,
>                                              pcibios_align_resource, dev);
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Guo Chao April 8, 2014, 2:57 a.m. UTC | #2
On Fri, Apr 04, 2014 at 04:43:27PM -0600, Bjorn Helgaas wrote:
> On Fri, Mar 7, 2014 at 1:08 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> > When one of children resources does not support MEM_64, MEM_64 for
> > bridge get reset, so pull down whole pref resource on the bridge under 4G.
> >
> > If the bridge support pref mem 64, will only allocate that with pref mem64 to
> > children that support it.
> > For children resources if they only support pref mem 32, will allocate them
> > from non pref mem instead.
> >
> > If the bridge only support 32bit pref mmio, will still have all children pref
> > mmio under that.
> >
> > -v2: Add release bridge res support with bridge mem res for pref_mem children res.
> > -v3: refresh and make it can be applied early before for_each_dev_res patchset.
> > -v4: fix non-pref mmio 64bit support found by Guo Chao.
> > -v7: repost as ibm's powerpc system work again when
> >     1. petiboot boot kernel have this patch.
> >     2. or mellanox driver added new .shutdown member.
> >     discussion could be found at:
> >          http://marc.info/?t=138968064500001&r=1&w=2
> 
> I think this fixes some sort of bug, and I suppose if I spent a few
> hours decoding the discussion you mention (the 44 message-long
> "mlx4_core probe error after applying Yinghai's patch" discussion), I
> might be able to figure out what it is.
> 

The result of 44 message-long discussion is: the Mellanox card's failure is
due to a bug in its driver instead of this patch.

The patch refines the logic of using prefetchable window, so that 64-bit
prefetchable BARs have a chance to be really prefetchable. It does not fix
any bugs, instead, it exposes one. 

Thanks,
Guo Chao

> But maybe somebody who already knows what the bug is can summarize it
> for me, and maybe even open a kernel.org bugzilla with a dmesg log as
> an example.
> 
> I do intend to use this message [1] as a source to help construct a
> changelog, but I'd like to also describe a specific problem that is
> solved by this patch.  Ideally this would already be in the changelog,
> but it's not, so any help in figuring it out would be appreciated.
> 
> [1] http://lkml.kernel.org/r/CAErSpo5VWREGptW0MU0wLJUa3h23DXMZPGkdJFMTx5WAGL8J6Q@mail.gmail.com
> 
> > Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> > Tested-by: Guo Chao <yan@linux.vnet.ibm.com>
> > Tested-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> >
> > ---
> >  drivers/pci/setup-bus.c |  138 ++++++++++++++++++++++++++++++++----------------
> >  drivers/pci/setup-res.c |   20 ++++++
> >  2 files changed, 111 insertions(+), 47 deletions(-)
> >
> > Index: linux-2.6/drivers/pci/setup-bus.c
> > ===================================================================
> > --- linux-2.6.orig/drivers/pci/setup-bus.c
> > +++ linux-2.6/drivers/pci/setup-bus.c
> > @@ -713,12 +713,11 @@ static void pci_bridge_check_ranges(stru
> >     bus resource of a given type. Note: we intentionally skip
> >     the bus resources which have already been assigned (that is,
> >     have non-NULL parent resource). */
> > -static struct resource *find_free_bus_resource(struct pci_bus *bus, unsigned long type)
> > +static struct resource *find_free_bus_resource(struct pci_bus *bus,
> > +                        unsigned long type_mask, unsigned long type)
> >  {
> >         int i;
> >         struct resource *r;
> > -       unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
> > -                                 IORESOURCE_PREFETCH;
> >
> >         pci_bus_for_each_resource(bus, r, i) {
> >                 if (r == &ioport_resource || r == &iomem_resource)
> > @@ -815,7 +814,8 @@ static void pbus_size_io(struct pci_bus
> >                 resource_size_t add_size, struct list_head *realloc_head)
> >  {
> >         struct pci_dev *dev;
> > -       struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO);
> > +       struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO,
> > +                                                       IORESOURCE_IO);
> >         resource_size_t size = 0, size0 = 0, size1 = 0;
> >         resource_size_t children_add_size = 0;
> >         resource_size_t min_align, align;
> > @@ -915,15 +915,17 @@ static inline resource_size_t calculate_
> >   * guarantees that all child resources fit in this size.
> >   */
> >  static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
> > -                        unsigned long type, resource_size_t min_size,
> > -                       resource_size_t add_size,
> > -                       struct list_head *realloc_head)
> > +                        unsigned long type, unsigned long type2,
> > +                        unsigned long type3,
> > +                        resource_size_t min_size, resource_size_t add_size,
> > +                        struct list_head *realloc_head)
> >  {
> >         struct pci_dev *dev;
> >         resource_size_t min_align, align, size, size0, size1;
> >         resource_size_t aligns[12];     /* Alignments from 1Mb to 2Gb */
> >         int order, max_order;
> > -       struct resource *b_res = find_free_bus_resource(bus, type);
> > +       struct resource *b_res = find_free_bus_resource(bus,
> > +                                        mask | IORESOURCE_PREFETCH, type);
> >         unsigned int mem64_mask = 0;
> >         resource_size_t children_add_size = 0;
> >
> > @@ -944,7 +946,9 @@ static int pbus_size_mem(struct pci_bus
> >                         struct resource *r = &dev->resource[i];
> >                         resource_size_t r_size;
> >
> > -                       if (r->parent || (r->flags & mask) != type)
> > +                       if (r->parent || ((r->flags & mask) != type &&
> > +                                         (r->flags & mask) != type2 &&
> > +                                         (r->flags & mask) != type3))
> >                                 continue;
> >                         r_size = resource_size(r);
> >  #ifdef CONFIG_PCI_IOV
> > @@ -1117,8 +1121,9 @@ void __ref __pci_bus_size_bridges(struct
> >                         struct list_head *realloc_head)
> >  {
> >         struct pci_dev *dev;
> > -       unsigned long mask, prefmask;
> > +       unsigned long mask, prefmask, type2 = 0, type3 = 0;
> >         resource_size_t additional_mem_size = 0, additional_io_size = 0;
> > +       struct resource *b_res;
> >
> >         list_for_each_entry(dev, &bus->devices, bus_list) {
> >                 struct pci_bus *b = dev->subordinate;
> > @@ -1163,15 +1168,34 @@ void __ref __pci_bus_size_bridges(struct
> >                    has already been allocated by arch code, try
> >                    non-prefetchable range for both types of PCI memory
> >                    resources. */
> > +               b_res = &bus->self->resource[PCI_BRIDGE_RESOURCES];
> >                 mask = IORESOURCE_MEM;
> >                 prefmask = IORESOURCE_MEM | IORESOURCE_PREFETCH;
> > -               if (pbus_size_mem(bus, prefmask, prefmask,
> > +               if (b_res[2].flags & IORESOURCE_MEM_64) {
> > +                       prefmask |= IORESOURCE_MEM_64;
> > +                       if (pbus_size_mem(bus, prefmask, prefmask,
> > +                                 prefmask, prefmask,
> >                                   realloc_head ? 0 : additional_mem_size,
> > -                                 additional_mem_size, realloc_head))
> > -                       mask = prefmask; /* Success, size non-prefetch only. */
> > -               else
> > -                       additional_mem_size += additional_mem_size;
> > -               pbus_size_mem(bus, mask, IORESOURCE_MEM,
> > +                                 additional_mem_size, realloc_head)) {
> > +                                       /* Success, size non-pref64 only. */
> > +                                       mask = prefmask;
> > +                                       type2 = prefmask & ~IORESOURCE_MEM_64;
> > +                                       type3 = prefmask & ~IORESOURCE_PREFETCH;
> > +                       }
> > +               }
> > +               if (!type2) {
> > +                       prefmask &= ~IORESOURCE_MEM_64;
> > +                       if (pbus_size_mem(bus, prefmask, prefmask,
> > +                                        prefmask, prefmask,
> > +                                        realloc_head ? 0 : additional_mem_size,
> > +                                        additional_mem_size, realloc_head)) {
> > +                               /* Success, size non-prefetch only. */
> > +                               mask = prefmask;
> > +                       } else
> > +                               additional_mem_size += additional_mem_size;
> > +                       type2 = type3 = IORESOURCE_MEM;
> > +               }
> > +               pbus_size_mem(bus, mask, IORESOURCE_MEM, type2, type3,
> >                                 realloc_head ? 0 : additional_mem_size,
> >                                 additional_mem_size, realloc_head);
> >                 break;
> > @@ -1257,42 +1281,66 @@ static void __ref __pci_bridge_assign_re
> >  static void pci_bridge_release_resources(struct pci_bus *bus,
> >                                           unsigned long type)
> >  {
> > -       int idx;
> > -       bool changed = false;
> > -       struct pci_dev *dev;
> > +       struct pci_dev *dev = bus->self;
> >         struct resource *r;
> >         unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
> > -                                 IORESOURCE_PREFETCH;
> > +                                 IORESOURCE_PREFETCH | IORESOURCE_MEM_64;
> > +       unsigned old_flags = 0;
> > +       struct resource *b_res;
> > +       int idx = 1;
> >
> > -       dev = bus->self;
> > -       for (idx = PCI_BRIDGE_RESOURCES; idx <= PCI_BRIDGE_RESOURCE_END;
> > -            idx++) {
> > -               r = &dev->resource[idx];
> > -               if ((r->flags & type_mask) != type)
> > -                       continue;
> > -               if (!r->parent)
> > -                       continue;
> > -               /*
> > -                * if there are children under that, we should release them
> > -                *  all
> > -                */
> > -               release_child_resources(r);
> > -               if (!release_resource(r)) {
> > -                       dev_printk(KERN_DEBUG, &dev->dev,
> > -                                "resource %d %pR released\n", idx, r);
> > -                       /* keep the old size */
> > -                       r->end = resource_size(r) - 1;
> > -                       r->start = 0;
> > -                       r->flags = 0;
> > -                       changed = true;
> > -               }
> > -       }
> > +       b_res = &dev->resource[PCI_BRIDGE_RESOURCES];
> > +
> > +       /*
> > +        *     1. if there is io port assign fail, will release bridge
> > +        *        io port.
> > +        *     2. if there is non pref mmio assign fail, release bridge
> > +        *        nonpref mmio.
> > +        *     3. if there is 64bit pref mmio assign fail, and bridge pref
> > +        *        is 64bit, release bridge pref mmio.
> > +        *     4. if there is pref mmio assign fail, and bridge pref is
> > +        *        32bit mmio, release bridge pref mmio
> > +        *     5. if there is pref mmio assign fail, and bridge pref is not
> > +        *        assigned, release bridge nonpref mmio.
> > +        */
> > +       if (type & IORESOURCE_IO)
> > +               idx = 0;
> > +       else if (!(type & IORESOURCE_PREFETCH))
> > +               idx = 1;
> > +       else if ((type & IORESOURCE_MEM_64) &&
> > +                (b_res[2].flags & IORESOURCE_MEM_64))
> > +               idx = 2;
> > +       else if (!(b_res[2].flags & IORESOURCE_MEM_64) &&
> > +                (b_res[2].flags & IORESOURCE_PREFETCH))
> > +               idx = 2;
> > +       else
> > +               idx = 1;
> > +
> > +       r = &b_res[idx];
> > +
> > +       if (!r->parent)
> > +               return;
> > +
> > +       /*
> > +        * if there are children under that, we should release them
> > +        *  all
> > +        */
> > +       release_child_resources(r);
> > +       if (!release_resource(r)) {
> > +               type = old_flags = r->flags & type_mask;
> > +               dev_printk(KERN_DEBUG, &dev->dev, "resource %d %pR released\n",
> > +                                       PCI_BRIDGE_RESOURCES + idx, r);
> > +               /* keep the old size */
> > +               r->end = resource_size(r) - 1;
> > +               r->start = 0;
> > +               r->flags = 0;
> >
> > -       if (changed) {
> >                 /* avoiding touch the one without PREF */
> >                 if (type & IORESOURCE_PREFETCH)
> >                         type = IORESOURCE_PREFETCH;
> >                 __pci_setup_bridge(bus, type);
> > +               /* for next child res under same bridge */
> > +               r->flags = old_flags;
> >         }
> >  }
> >
> > @@ -1471,7 +1519,7 @@ void pci_assign_unassigned_root_bus_reso
> >         LIST_HEAD(fail_head);
> >         struct pci_dev_resource *fail_res;
> >         unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
> > -                                 IORESOURCE_PREFETCH;
> > +                                 IORESOURCE_PREFETCH | IORESOURCE_MEM_64;
> >         int pci_try_num = 1;
> >         enum enable_type enable_local;
> >
> > Index: linux-2.6/drivers/pci/setup-res.c
> > ===================================================================
> > --- linux-2.6.orig/drivers/pci/setup-res.c
> > +++ linux-2.6/drivers/pci/setup-res.c
> > @@ -208,15 +208,31 @@ static int __pci_assign_resource(struct
> >
> >         /* First, try exact prefetching match.. */
> >         ret = pci_bus_alloc_resource(bus, res, size, align, min,
> > -                                    IORESOURCE_PREFETCH,
> > +                                    IORESOURCE_PREFETCH | IORESOURCE_MEM_64,
> >                                      pcibios_align_resource, dev);
> >
> > -       if (ret < 0 && (res->flags & IORESOURCE_PREFETCH)) {
> > +       if (ret < 0 &&
> > +           (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) ==
> > +            (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) {
> > +               /*
> > +                * That failed.
> > +                *
> > +                * Try below 4g pref
> > +                */
> > +               ret = pci_bus_alloc_resource(bus, res, size, align, min,
> > +                                            IORESOURCE_PREFETCH,
> > +                                            pcibios_align_resource, dev);
> > +       }
> > +
> > +       if (ret < 0 &&
> > +           (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64))) {
> >                 /*
> >                  * That failed.
> >                  *
> >                  * But a prefetching area can handle a non-prefetching
> >                  * window (it will just not perform as well).
> > +                *
> > +                * Also can put 64bit under 32bit range. (below 4g).
> >                  */
> >                 ret = pci_bus_alloc_resource(bus, res, size, align, min, 0,
> >                                              pcibios_align_resource, dev);
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz April 8, 2014, 7:18 a.m. UTC | #3
On 08/04/2014 05:57, Guo Chao wrote:
> On Fri, Apr 04, 2014 at 04:43:27PM -0600, Bjorn Helgaas wrote:
>> On Fri, Mar 7, 2014 at 1:08 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>>> When one of children resources does not support MEM_64, MEM_64 for
>>> bridge get reset, so pull down whole pref resource on the bridge under 4G.
>>>
>>> If the bridge support pref mem 64, will only allocate that with pref mem64 to
>>> children that support it.
>>> For children resources if they only support pref mem 32, will allocate them
>>> from non pref mem instead.
>>>
>>> If the bridge only support 32bit pref mmio, will still have all children pref
>>> mmio under that.
>>>
>>> -v2: Add release bridge res support with bridge mem res for pref_mem children res.
>>> -v3: refresh and make it can be applied early before for_each_dev_res patchset.
>>> -v4: fix non-pref mmio 64bit support found by Guo Chao.
>>> -v7: repost as ibm's powerpc system work again when
>>>      1. petiboot boot kernel have this patch.
>>>      2. or mellanox driver added new .shutdown member.
>>>      discussion could be found at:
>>>           http://marc.info/?t=138968064500001&r=1&w=2
>> I think this fixes some sort of bug, and I suppose if I spent a few
>> hours decoding the discussion you mention (the 44 message-long
>> "mlx4_core probe error after applying Yinghai's patch" discussion), I
>> might be able to figure out what it is.
>>
> The result of 44 message-long discussion is: the Mellanox card's failure is
> due to a bug in its driver instead of this patch.


So just to make sure we're on the same page -- fixing the bug is carried 
out through netdev, e.g this 3.14-rc8
upstream commit 97a5221 "net/mlx4_core: pass pci_device_id.driver_data 
to __mlx4_init_one during reset"
and now this one more fix which is under discussion 
http://marc.info/?l=linux-pci&m=139675010015646&w=2, right?



>
> The patch refines the logic of using prefetchable window, so that 64-bit
> prefetchable BARs have a chance to be really prefetchable. It does not fix
> any bugs, instead, it exposes one.

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wei Yang April 8, 2014, 7:41 a.m. UTC | #4
On Tue, Apr 08, 2014 at 10:18:48AM +0300, Or Gerlitz wrote:
>On 08/04/2014 05:57, Guo Chao wrote:
>>On Fri, Apr 04, 2014 at 04:43:27PM -0600, Bjorn Helgaas wrote:
>>>On Fri, Mar 7, 2014 at 1:08 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>>>>When one of children resources does not support MEM_64, MEM_64 for
>>>>bridge get reset, so pull down whole pref resource on the bridge under 4G.
>>>>
>>>>If the bridge support pref mem 64, will only allocate that with pref mem64 to
>>>>children that support it.
>>>>For children resources if they only support pref mem 32, will allocate them
>>>>from non pref mem instead.
>>>>
>>>>If the bridge only support 32bit pref mmio, will still have all children pref
>>>>mmio under that.
>>>>
>>>>-v2: Add release bridge res support with bridge mem res for pref_mem children res.
>>>>-v3: refresh and make it can be applied early before for_each_dev_res patchset.
>>>>-v4: fix non-pref mmio 64bit support found by Guo Chao.
>>>>-v7: repost as ibm's powerpc system work again when
>>>>     1. petiboot boot kernel have this patch.
>>>>     2. or mellanox driver added new .shutdown member.
>>>>     discussion could be found at:
>>>>          http://marc.info/?t=138968064500001&r=1&w=2
>>>I think this fixes some sort of bug, and I suppose if I spent a few
>>>hours decoding the discussion you mention (the 44 message-long
>>>"mlx4_core probe error after applying Yinghai's patch" discussion), I
>>>might be able to figure out what it is.
>>>
>>The result of 44 message-long discussion is: the Mellanox card's failure is
>>due to a bug in its driver instead of this patch.
>
>
>So just to make sure we're on the same page -- fixing the bug is
>carried out through netdev, e.g this 3.14-rc8
>upstream commit 97a5221 "net/mlx4_core: pass
>pci_device_id.driver_data to __mlx4_init_one during reset"
>and now this one more fix which is under discussion
>http://marc.info/?l=linux-pci&m=139675010015646&w=2, right?

No, this one http://marc.info/?l=linux-pci&m=139675010015646&w=2 is another
bug in mlx4 driver, missing a proper pci_dev_data in mlx4_init_one.

The bug discussing in this thread is the driver "forget" to release the
"owner" during kexec. So this device couldn't be usded after kexec.

>
>
>
>>
>>The patch refines the logic of using prefetchable window, so that 64-bit
>>prefetchable BARs have a chance to be really prefetchable. It does not fix
>>any bugs, instead, it exposes one.
Or Gerlitz April 8, 2014, 7:55 a.m. UTC | #5
On 08/04/2014 10:41, Wei Yang wrote:
>> >So just to make sure we're on the same page -- fixing the bug is
>> >carried out through netdev, e.g this 3.14-rc8
>> >upstream commit 97a5221 "net/mlx4_core: pass
>> >pci_device_id.driver_data to __mlx4_init_one during reset"
>> >and now this one more fix which is under discussion
>> >http://marc.info/?l=linux-pci&m=139675010015646&w=2, right?
> No, this onehttp://marc.info/?l=linux-pci&m=139675010015646&w=2  is another
> bug in mlx4 driver, missing a proper pci_dev_data in mlx4_init_one.
>
> The bug discussing in this thread is the driver "forget" to release the
> "owner" during kexec. So this device couldn't be usded after kexec.
>
Can you guys please open a kernel.org bugzilla ticket, or better... work 
on a patch?
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Guo Chao April 8, 2014, 8:22 a.m. UTC | #6
On Tue, Apr 08, 2014 at 10:55:55AM +0300, Or Gerlitz wrote:
> On 08/04/2014 10:41, Wei Yang wrote:
> >>>So just to make sure we're on the same page -- fixing the bug is
> >>>carried out through netdev, e.g this 3.14-rc8
> >>>upstream commit 97a5221 "net/mlx4_core: pass
> >>>pci_device_id.driver_data to __mlx4_init_one during reset"
> >>>and now this one more fix which is under discussion
> >>>http://marc.info/?l=linux-pci&m=139675010015646&w=2, right?
> >No, this onehttp://marc.info/?l=linux-pci&m=139675010015646&w=2  is another
> >bug in mlx4 driver, missing a proper pci_dev_data in mlx4_init_one.
> >
> >The bug discussing in this thread is the driver "forget" to release the
> >"owner" during kexec. So this device couldn't be usded after kexec.
> >
> Can you guys please open a kernel.org bugzilla ticket, or better...
> work on a patch?
> 

It's already fixed in upstream.

commit 367d56f7b4d5ce61e883c64f81786c7a3ae88eea
Author: Gavin Shan <shangw@linux.vnet.ibm.com>
Date:   Tue Mar 4 15:35:20 2014 +0800

    net/mlx4: Support shutdown() interface


Thanks,
Guo Chao

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas April 8, 2014, 3:02 p.m. UTC | #7
On Mon, Apr 7, 2014 at 8:57 PM, Guo Chao <yan@linux.vnet.ibm.com> wrote:
> On Fri, Apr 04, 2014 at 04:43:27PM -0600, Bjorn Helgaas wrote:
>> On Fri, Mar 7, 2014 at 1:08 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>> > When one of children resources does not support MEM_64, MEM_64 for
>> > bridge get reset, so pull down whole pref resource on the bridge under 4G.
>> >
>> > If the bridge support pref mem 64, will only allocate that with pref mem64 to
>> > children that support it.
>> > For children resources if they only support pref mem 32, will allocate them
>> > from non pref mem instead.
>> >
>> > If the bridge only support 32bit pref mmio, will still have all children pref
>> > mmio under that.
>> >
>> > -v2: Add release bridge res support with bridge mem res for pref_mem children res.
>> > -v3: refresh and make it can be applied early before for_each_dev_res patchset.
>> > -v4: fix non-pref mmio 64bit support found by Guo Chao.
>> > -v7: repost as ibm's powerpc system work again when
>> >     1. petiboot boot kernel have this patch.
>> >     2. or mellanox driver added new .shutdown member.
>> >     discussion could be found at:
>> >          http://marc.info/?t=138968064500001&r=1&w=2
>>
>> I think this fixes some sort of bug, and I suppose if I spent a few
>> hours decoding the discussion you mention (the 44 message-long
>> "mlx4_core probe error after applying Yinghai's patch" discussion), I
>> might be able to figure out what it is.
>>
>
> The result of 44 message-long discussion is: the Mellanox card's failure is
> due to a bug in its driver instead of this patch.
>
> The patch refines the logic of using prefetchable window, so that 64-bit
> prefetchable BARs have a chance to be really prefetchable. It does not fix
> any bugs, instead, it exposes one.

OK, then I'm back to square one, and I need an explanation of why we
want to merge this patch.

I think the patch conserves 32-bit address space by putting more
things above 4G than we used to.  But this comes at the cost of using
non-prefetchable windows in some cases where we used to use
prefetchable ones.

Somebody has to explain why we want to do this and why the benefit of
conserving the 32-bit space is more important losing the
prefetchability.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Guo Chao April 9, 2014, 7:52 a.m. UTC | #8
Hi, Bjorn:

On Tue, Apr 08, 2014 at 09:02:10AM -0600, Bjorn Helgaas wrote:
> On Mon, Apr 7, 2014 at 8:57 PM, Guo Chao <yan@linux.vnet.ibm.com> wrote:
> > On Fri, Apr 04, 2014 at 04:43:27PM -0600, Bjorn Helgaas wrote:
> >> On Fri, Mar 7, 2014 at 1:08 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> >> > When one of children resources does not support MEM_64, MEM_64 for
> >> > bridge get reset, so pull down whole pref resource on the bridge under 4G.
> >> >
> >> > If the bridge support pref mem 64, will only allocate that with pref mem64 to
> >> > children that support it.
> >> > For children resources if they only support pref mem 32, will allocate them
> >> > from non pref mem instead.
> >> >
> >> > If the bridge only support 32bit pref mmio, will still have all children pref
> >> > mmio under that.
> >> >
> >> > -v2: Add release bridge res support with bridge mem res for pref_mem children res.
> >> > -v3: refresh and make it can be applied early before for_each_dev_res patchset.
> >> > -v4: fix non-pref mmio 64bit support found by Guo Chao.
> >> > -v7: repost as ibm's powerpc system work again when
> >> >     1. petiboot boot kernel have this patch.
> >> >     2. or mellanox driver added new .shutdown member.
> >> >     discussion could be found at:
> >> >          http://marc.info/?t=138968064500001&r=1&w=2
> >>
> >> I think this fixes some sort of bug, and I suppose if I spent a few
> >> hours decoding the discussion you mention (the 44 message-long
> >> "mlx4_core probe error after applying Yinghai's patch" discussion), I
> >> might be able to figure out what it is.
> >>
> >
> > The result of 44 message-long discussion is: the Mellanox card's failure is
> > due to a bug in its driver instead of this patch.
> >
> > The patch refines the logic of using prefetchable window, so that 64-bit
> > prefetchable BARs have a chance to be really prefetchable. It does not fix
> > any bugs, instead, it exposes one.
> 
> OK, then I'm back to square one, and I need an explanation of why we
> want to merge this patch.
> 
> I think the patch conserves 32-bit address space by putting more
> things above 4G than we used to.  But this comes at the cost of using
> non-prefetchable windows in some cases where we used to use
> prefetchable ones.
> 
> Somebody has to explain why we want to do this and why the benefit of
> conserving the 32-bit space is more important losing the
> prefetchability.

I want to make sure we are talking about this patch:
	PCI: Try best to allocate pref mmio 64bit above 4g
	http://patchwork.ozlabs.org/patch/328067/

instead of this one:
	PCI: Try to allocate mem64 above 4G at first
	http://patchwork.ozlabs.org/patch/303197/

This patch does not intend to conserve 32-bit space at all. It makes
better use of prefetchable window.

The problem with the old code is: when both 32-bit and 64-bit prefetchable
BARs present, it's in favor of 32-bit one to use prefetchable window.
This window is then not supposed to get 4G-above address, and this
32-bit-address-only property would propagates upwards until reaching root
bridge. In later assignment phase, 4G-above space is never touched. This
is just caused by a single 32-bit prefetchable BAR (say a ROM BAR).

This patch helps by making better decision:

	* Keep the old behaviour if only 32-bit or 64-bit prefetchable
	  BARs present

	* If both of them present, put 64-bit ones to prefetchable window,
	  32-bit ones to non-prefetchable window

So 4G-above space can be properly used.

Why does enabling 64-bit space matter?

* 32-bit space is just too small. If larger-than-4G BAR sounds
  unrealistic (we do have such devices), there is still a chance that
  total MMIO size under a domain is too large, especially when SR-IOV
  enabled (we met this situation).

* On PowerNV platform, we can do better platform configuration for 64-bit
  prefetchable space, that's important for enabling SR-IOV on this
  platform.


Thanks,
Guo Chao

> 
> Bjorn
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas April 10, 2014, 5:26 p.m. UTC | #9
On Wed, Apr 9, 2014 at 1:52 AM, Guo Chao <yan@linux.vnet.ibm.com> wrote:

>> OK, then I'm back to square one, and I need an explanation of why we
>> want to merge this patch.
>>
>> I think the patch conserves 32-bit address space by putting more
>> things above 4G than we used to.  But this comes at the cost of using
>> non-prefetchable windows in some cases where we used to use
>> prefetchable ones.
>>
>> Somebody has to explain why we want to do this and why the benefit of
>> conserving the 32-bit space is more important losing the
>> prefetchability.
>
> I want to make sure we are talking about this patch:
>         PCI: Try best to allocate pref mmio 64bit above 4g
>         http://patchwork.ozlabs.org/patch/328067/
>
> instead of this one:
>         PCI: Try to allocate mem64 above 4G at first
>         http://patchwork.ozlabs.org/patch/303197/

Yes, I'm talking about http://patchwork.ozlabs.org/patch/328067; if
you look at patchwork, the second one (303197) is marked "superseded."

> This patch does not intend to conserve 32-bit space at all. It makes
> better use of prefetchable window.
>
> The problem with the old code is: when both 32-bit and 64-bit prefetchable
> BARs present, it's in favor of 32-bit one to use prefetchable window.
> This window is then not supposed to get 4G-above address, and this
> 32-bit-address-only property would propagates upwards until reaching root
> bridge. In later assignment phase, 4G-above space is never touched. This
> is just caused by a single 32-bit prefetchable BAR (say a ROM BAR).
>
> This patch helps by making better decision:
>
>         * Keep the old behaviour if only 32-bit or 64-bit prefetchable
>           BARs present
>
>         * If both of them present, put 64-bit ones to prefetchable window,
>           32-bit ones to non-prefetchable window

I thought the patch was supposed to change the way we allocate bridge
windows.  But you are talking about the way we assign 32- and 64-bit
device resources, i.e., changing the way we decide whether to put them
in prefetchable or non-prefetchable windows.

> So 4G-above space can be properly used.
>
> Why does enabling 64-bit space matter?
>
> * 32-bit space is just too small. If larger-than-4G BAR sounds
>   unrealistic (we do have such devices), there is still a chance that
>   total MMIO size under a domain is too large, especially when SR-IOV
>   enabled (we met this situation).

Please give more details about the problem you saw.  A complete dmesg
log showing a boot failure or a device that doesn't work, and another
log with this patch applied, showing things working as desired, would
go a long ways toward clarifying this.

This sounds like a specific case that will be fixed by the patch, and
if you give more details, maybe I can figure out what's going on.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt April 10, 2014, 9:23 p.m. UTC | #10
On Thu, 2014-04-10 at 11:26 -0600, Bjorn Helgaas wrote:
> > The problem with the old code is: when both 32-bit and 64-bit prefetchable
> > BARs present, it's in favor of 32-bit one to use prefetchable window.
> > This window is then not supposed to get 4G-above address, and this
> > 32-bit-address-only property would propagates upwards until reaching root
> > bridge. In later assignment phase, 4G-above space is never touched. This
> > is just caused by a single 32-bit prefetchable BAR (say a ROM BAR).
> >
> > This patch helps by making better decision:
> >
> >         * Keep the old behaviour if only 32-bit or 64-bit prefetchable
> >           BARs present
> >
> >         * If both of them present, put 64-bit ones to prefetchable window,
> >           32-bit ones to non-prefetchable window
> 
> I thought the patch was supposed to change the way we allocate bridge
> windows.  But you are talking about the way we assign 32- and 64-bit
> device resources, i.e., changing the way we decide whether to put them
> in prefetchable or non-prefetchable windows.

Wait...

Are you putting "non-prefetchable" 64-bit BARs in the prefetchable
range ?

That's bad. Don't do that.

Some switches or bridges *will* actively prefetch, and putting a
register BAR marked non-prefetchable in the prefetchable range will thus
cause nasty bugs if a register with side effects becomes the target of a
prefetch operations.

We can do that on powerpc *selectively* when and only when we know for
sure that no bridge will prefetch on the path to the device. We know our
host bridges won't and we *might* (to be verified) know that the PLX
switches we have soldered on the mobo won't, but that's the only cases
where it is legit.

> > So 4G-above space can be properly used.
> >
> > Why does enabling 64-bit space matter?
> >
> > * 32-bit space is just too small. If larger-than-4G BAR sounds
> >   unrealistic (we do have such devices), there is still a chance that
> >   total MMIO size under a domain is too large, especially when SR-IOV
> >   enabled (we met this situation).
> 
> Please give more details about the problem you saw.  A complete dmesg
> log showing a boot failure or a device that doesn't work, and another
> log with this patch applied, showing things working as desired, would
> go a long ways toward clarifying this.
> 
> This sounds like a specific case that will be fixed by the patch, and
> if you give more details, maybe I can figure out what's going on.

Well, in our case it's a very complicated story. We have this segmented
space where we have 256 segments covering our 32-bit space. We need each
"partitionable endpoint" to be in separate sets of segments. PEs are
basically iommu domain but also affect MMIO and are our unit of
assignment to guests when virtualizing and of error handling when doing
EEH (so we freeze access on error by PE).

So with something like SR-IOV where each VF has to be a separate PE,
the 32-bit space is problematic because we want the VF BARs to stride
over segments so each VF lives in a separate PE, but we run out of
segments and the 32-bit space has a fixed segment size that may not
match the VF BAR stride.

So we want our BARs to shoot into 64-bit space where we have additional
windows with each 256 segments that we can use, and additionally, we can
resize these to make the segment size match the VF BAR stride.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Guo Chao April 15, 2014, 11:54 a.m. UTC | #11
On Thu, Apr 10, 2014 at 11:26:14AM -0600, Bjorn Helgaas wrote:
> On Wed, Apr 9, 2014 at 1:52 AM, Guo Chao <yan@linux.vnet.ibm.com> wrote:
> > So 4G-above space can be properly used.
> >
> > Why does enabling 64-bit space matter?
> >
> > * 32-bit space is just too small. If larger-than-4G BAR sounds
> >   unrealistic (we do have such devices), there is still a chance that
> >   total MMIO size under a domain is too large, especially when SR-IOV
> >   enabled (we met this situation).
> 
> Please give more details about the problem you saw.  A complete dmesg
> log showing a boot failure or a device that doesn't work, and another
> log with this patch applied, showing things working as desired, would
> go a long ways toward clarifying this.
> 
> This sounds like a specific case that will be fixed by the patch, and
> if you give more details, maybe I can figure out what's going on.
> 

Let's see an example.

| pci 0003:05:00.0: reg 0x10: [mem 0x3d05801000000-0x3d058010fffff 64bit]
| pci 0003:05:00.0: reg 0x18: [mem 0x3d05010000000-0x3d05017ffffff 64bit pref]
| pci 0003:05:00.0: reg 0x30: [mem 0x00000000-0x000fffff pref]
| pci 0003:05:00.0: reg 0x134: [mem 0x3d05018000000-0x3d0501fffffff 64bit pref]

This is printed at enumeration phase. This device has a SRIOV BAR with
size of 0x7ffffff (128M). That's the size of a signle VF BAR. The device
supports 63 VFs so we need near 8G space in total. Apparanlty we need
exploit 64-bit space.

| PCI host bridge to bus 0003:00
| pci_bus 0003:00: root bus resource [mem 0x3d05800000000-0x3d0587ffeffff] (bus address [0x80000000-0xfffeffff])
| pci_bus 0003:00: root bus resource [mem 0x3d05008000000-0x3d057ffffffff 64bit pref]

And we do have a huge (32G) 64-bit prefetchable window supply. We expect
everything to work fine, but:

| pci 0003:00:00.0: BAR 15: can't assign mem pref (size 0x206000000)
| pci 0003:00:00.0: BAR 14: assigned [mem 0x3d05800000000-0x3d05802ffffff]
| pci 0003:00:00.0: BAR 13: can't assign io (size 0x4000)

It went wrong at the beginning. Note the error message never considers
64-bit or not, but BAR 15 here has it MEM_64 flag cleared. It first
tried to find a 32-bit prefetchable window, but we only supply a 64-bit one.
So it fall back to (32-bit) non-prefetchable window, but there is no enough
room there. At last it went into complicated steps (not show here) of
allocating requested resource first, then try best for the optional ones, etc..

Why is BAR 15 (prefetchable) 32 bit instead of 64? Because PCI core favours
32-bit prefetchable BARs and we have some. This is one of them:

| pci 0003:05:00.0: reg 0x30: [mem 0x00000000-0x000fffff pref]

PCI core decides to let them enjoy the benefition of prefetch. They can't
bear the risk of getting 4G-above address, so its parent, its parent's parent,
its parent's parent's parent, finally the root bridge (00:00.0) must have their
MEM_64 flag of prefetchable resource (BAR 15) clear. In the end nobody
is eligible to use the 64-bit (prefetchable) space even we have huge
supply !

Note even the resource is small and successfully fall back into 32-bit
non-prefetchable window, that's still not OK for us because we need
SRIOV resource be at 64-bit prefetchable space to do platform
configuration.

With Yinghai's patch, when 64-bit prefetchable BARs found, they're more
favoured than the 32-bit prefetchable ones (if any), so all upstream bridges'
prefetchable windows have their MEM_64 flag reserved and the huge 64-bit
prefetchable space will be exploited:

| pci 0003:00:00.0: BAR 15: assigned [mem 0x3d05008000000-0x3d0521fffffff 64bit pref]
| pci 0003:00:00.0: BAR 14: assigned [mem 0x3d05800000000-0x3d05802ffffff]
| pci 0003:00:00.0: BAR 13: can't assign io (size 0x4000)

(The IO resource error here is due to we do not provide IO window)

Thanks,
Guo Chao


> Bjorn
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas April 16, 2014, 12:09 a.m. UTC | #12
On Tue, Apr 15, 2014 at 5:54 AM, Guo Chao <yan@linux.vnet.ibm.com> wrote:
> On Thu, Apr 10, 2014 at 11:26:14AM -0600, Bjorn Helgaas wrote:
>> On Wed, Apr 9, 2014 at 1:52 AM, Guo Chao <yan@linux.vnet.ibm.com> wrote:
>> > So 4G-above space can be properly used.
>> >
>> > Why does enabling 64-bit space matter?
>> >
>> > * 32-bit space is just too small. If larger-than-4G BAR sounds
>> >   unrealistic (we do have such devices), there is still a chance that
>> >   total MMIO size under a domain is too large, especially when SR-IOV
>> >   enabled (we met this situation).
>>
>> Please give more details about the problem you saw.  A complete dmesg
>> log showing a boot failure or a device that doesn't work, and another
>> log with this patch applied, showing things working as desired, would
>> go a long ways toward clarifying this.
>>
>> This sounds like a specific case that will be fixed by the patch, and
>> if you give more details, maybe I can figure out what's going on.
>>
>
> Let's see an example.

Thanks for the example.  Please open a bug report at
http://bugzilla.kernel.org and attach the complete dmesg logs before
and after Yinghai's patch.

Having the complete logs helps me answer questions myself without
having to bother you, and it also helps me figure out whether we can
improve our logging to make it easier to diagnose problems like this.

> | pci 0003:05:00.0: reg 0x10: [mem 0x3d05801000000-0x3d058010fffff 64bit]
> | pci 0003:05:00.0: reg 0x18: [mem 0x3d05010000000-0x3d05017ffffff 64bit pref]
> | pci 0003:05:00.0: reg 0x30: [mem 0x00000000-0x000fffff pref]
> | pci 0003:05:00.0: reg 0x134: [mem 0x3d05018000000-0x3d0501fffffff 64bit pref]
>
> This is printed at enumeration phase. This device has a SRIOV BAR with
> size of 0x7ffffff (128M). That's the size of a signle VF BAR. The device
> supports 63 VFs so we need near 8G space in total. Apparanlty we need
> exploit 64-bit space.

Yes.  Do we print a hint anywhere about how many VFs there are?  In
other words, can you deduce the number "63" from the dmesg, or do you
have to figure that out some other way?  It'd be nice if that
information were somewhere in dmesg.

> | PCI host bridge to bus 0003:00
> | pci_bus 0003:00: root bus resource [mem 0x3d05800000000-0x3d0587ffeffff] (bus address [0x80000000-0xfffeffff])
> | pci_bus 0003:00: root bus resource [mem 0x3d05008000000-0x3d057ffffffff 64bit pref]
>
> And we do have a huge (32G) 64-bit prefetchable window supply. We expect
> everything to work fine, but:
>
> | pci 0003:00:00.0: BAR 15: can't assign mem pref (size 0x206000000)
> | pci 0003:00:00.0: BAR 14: assigned [mem 0x3d05800000000-0x3d05802ffffff]
> | pci 0003:00:00.0: BAR 13: can't assign io (size 0x4000)
>
> It went wrong at the beginning. Note the error message never considers
> 64-bit or not, but BAR 15 here has it MEM_64 flag cleared.

BAR 15 is a bridge window.  I think its resource flags should reflect
the capability of the *window*, even if we disable the window or we
happen to assign addresses that are under 4GB.  So I think it's wrong
that we clear the MEM_64 flag  in pbus_size_mem() and the IO flag in
pbus_size_io().

> It first
> tried to find a 32-bit prefetchable window, but we only supply a 64-bit one.
> So it fall back to (32-bit) non-prefetchable window, but there is no enough
> room there. At last it went into complicated steps (not show here) of
> allocating requested resource first, then try best for the optional ones, etc..
>
> Why is BAR 15 (prefetchable) 32 bit instead of 64? Because PCI core favours
> 32-bit prefetchable BARs and we have some. This is one of them:
>
> | pci 0003:05:00.0: reg 0x30: [mem 0x00000000-0x000fffff pref]
>
> PCI core decides to let them enjoy the benefition of prefetch. They can't
> bear the risk of getting 4G-above address, so its parent, its parent's parent,
> its parent's parent's parent, finally the root bridge (00:00.0) must have their
> MEM_64 flag of prefetchable resource (BAR 15) clear.

It sounds like we're tracking the resource requirements
(prefetchability and BAR width) by using the flags on bridge windows.
If that's the case, I think it's wrong.  We should preserve the bridge
window flags, because those express the bridge hardware capabilities,
and we should explicitly keep track of what's required by devices
below the bridge in some other way.

> In the end nobody
> is eligible to use the 64-bit (prefetchable) space even we have huge
> supply !
>
> Note even the resource is small and successfully fall back into 32-bit
> non-prefetchable window, that's still not OK for us because we need
> SRIOV resource be at 64-bit prefetchable space to do platform
> configuration.
>
> With Yinghai's patch, when 64-bit prefetchable BARs found, they're more
> favoured than the 32-bit prefetchable ones (if any), so all upstream bridges'
> prefetchable windows have their MEM_64 flag reserved and the huge 64-bit
> prefetchable space will be exploited:
>
> | pci 0003:00:00.0: BAR 15: assigned [mem 0x3d05008000000-0x3d0521fffffff 64bit pref]
> | pci 0003:00:00.0: BAR 14: assigned [mem 0x3d05800000000-0x3d05802ffffff]
> | pci 0003:00:00.0: BAR 13: can't assign io (size 0x4000)
>
> (The IO resource error here is due to we do not provide IO window)

Yes.  The lack of I/O space is just a constraint of the platform.
It'd be nice if we printed a more meaningful error message in this
case.  One really has to be a PCI expert to distinguish this from a
real problem that we need to fix.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yinghai Lu April 16, 2014, 4:16 a.m. UTC | #13
On Tue, Apr 15, 2014 at 5:09 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> It first
>> tried to find a 32-bit prefetchable window, but we only supply a 64-bit one.
>> So it fall back to (32-bit) non-prefetchable window, but there is no enough
>> room there. At last it went into complicated steps (not show here) of
>> allocating requested resource first, then try best for the optional ones, etc..
>>
>> Why is BAR 15 (prefetchable) 32 bit instead of 64? Because PCI core favours
>> 32-bit prefetchable BARs and we have some. This is one of them:
>>
>> | pci 0003:05:00.0: reg 0x30: [mem 0x00000000-0x000fffff pref]
>>
>> PCI core decides to let them enjoy the benefition of prefetch. They can't
>> bear the risk of getting 4G-above address, so its parent, its parent's parent,
>> its parent's parent's parent, finally the root bridge (00:00.0) must have their
>> MEM_64 flag of prefetchable resource (BAR 15) clear.
>
> It sounds like we're tracking the resource requirements
> (prefetchability and BAR width) by using the flags on bridge windows.
> If that's the case, I think it's wrong.  We should preserve the bridge
> window flags, because those express the bridge hardware capabilities,
> and we should explicitly keep track of what's required by devices
> below the bridge in some other way.

if the BAR support 64bit, then resource could be 64bit.
if the BAR does not support 64bit, then resource should be 32bit.

That should be best guess to decide resource flag.

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt April 16, 2014, 6:30 a.m. UTC | #14
On Tue, 2014-04-15 at 18:09 -0600, Bjorn Helgaas wrote:
> 
> Thanks for the example.  Please open a bug report at
> http://bugzilla.kernel.org and attach the complete dmesg logs before
> and after Yinghai's patch.
> 
> Having the complete logs helps me answer questions myself without
> having to bother you, and it also helps me figure out whether we can
> improve our logging to make it easier to diagnose problems like this.

Unfortunately, for a *little while* longer (hint !) we can't publish
a complete log from a Power8 machine, but we should be able to include
everything remotely related to PCI.

> > | pci 0003:05:00.0: reg 0x10: [mem 0x3d05801000000-0x3d058010fffff 64bit]
> > | pci 0003:05:00.0: reg 0x18: [mem 0x3d05010000000-0x3d05017ffffff 64bit pref]
> > | pci 0003:05:00.0: reg 0x30: [mem 0x00000000-0x000fffff pref]
> > | pci 0003:05:00.0: reg 0x134: [mem 0x3d05018000000-0x3d0501fffffff 64bit pref]
> >
> > This is printed at enumeration phase. This device has a SRIOV BAR with
> > size of 0x7ffffff (128M). That's the size of a signle VF BAR. The device
> > supports 63 VFs so we need near 8G space in total. Apparanlty we need
> > exploit 64-bit space.
> 
> Yes.  Do we print a hint anywhere about how many VFs there are?  In
> other words, can you deduce the number "63" from the dmesg, or do you
> have to figure that out some other way?  It'd be nice if that
> information were somewhere in dmesg.
>
> > | PCI host bridge to bus 0003:00
> > | pci_bus 0003:00: root bus resource [mem 0x3d05800000000-0x3d0587ffeffff] (bus address [0x80000000-0xfffeffff])
> > | pci_bus 0003:00: root bus resource [mem 0x3d05008000000-0x3d057ffffffff 64bit pref]
> >
> > And we do have a huge (32G) 64-bit prefetchable window supply. We expect
> > everything to work fine, but:
> >
> > | pci 0003:00:00.0: BAR 15: can't assign mem pref (size 0x206000000)
> > | pci 0003:00:00.0: BAR 14: assigned [mem 0x3d05800000000-0x3d05802ffffff]
> > | pci 0003:00:00.0: BAR 13: can't assign io (size 0x4000)
> >
> > It went wrong at the beginning. Note the error message never considers
> > 64-bit or not, but BAR 15 here has it MEM_64 flag cleared.
> 
> BAR 15 is a bridge window.  I think its resource flags should reflect
> the capability of the *window*, even if we disable the window or we
> happen to assign addresses that are under 4GB.  So I think it's wrong
> that we clear the MEM_64 flag  in pbus_size_mem() and the IO flag in
> pbus_size_io().
> 
> > It first
> > tried to find a 32-bit prefetchable window, but we only supply a 64-bit one.
> > So it fall back to (32-bit) non-prefetchable window, but there is no enough
> > room there. At last it went into complicated steps (not show here) of
> > allocating requested resource first, then try best for the optional ones, etc..
> >
> > Why is BAR 15 (prefetchable) 32 bit instead of 64? Because PCI core favours
> > 32-bit prefetchable BARs and we have some. This is one of them:
> >
> > | pci 0003:05:00.0: reg 0x30: [mem 0x00000000-0x000fffff pref]
> >
> > PCI core decides to let them enjoy the benefition of prefetch. They can't
> > bear the risk of getting 4G-above address, so its parent, its parent's parent,
> > its parent's parent's parent, finally the root bridge (00:00.0) must have their
> > MEM_64 flag of prefetchable resource (BAR 15) clear.
> 
> It sounds like we're tracking the resource requirements
> (prefetchability and BAR width) by using the flags on bridge windows.
> If that's the case, I think it's wrong.  We should preserve the bridge
> window flags, because those express the bridge hardware capabilities,
> and we should explicitly keep track of what's required by devices
> below the bridge in some other way.
> 
> > In the end nobody
> > is eligible to use the 64-bit (prefetchable) space even we have huge
> > supply !
> >
> > Note even the resource is small and successfully fall back into 32-bit
> > non-prefetchable window, that's still not OK for us because we need
> > SRIOV resource be at 64-bit prefetchable space to do platform
> > configuration.
> >
> > With Yinghai's patch, when 64-bit prefetchable BARs found, they're more
> > favoured than the 32-bit prefetchable ones (if any), so all upstream bridges'
> > prefetchable windows have their MEM_64 flag reserved and the huge 64-bit
> > prefetchable space will be exploited:
> >
> > | pci 0003:00:00.0: BAR 15: assigned [mem 0x3d05008000000-0x3d0521fffffff 64bit pref]
> > | pci 0003:00:00.0: BAR 14: assigned [mem 0x3d05800000000-0x3d05802ffffff]
> > | pci 0003:00:00.0: BAR 13: can't assign io (size 0x4000)
> >
> > (The IO resource error here is due to we do not provide IO window)
> 
> Yes.  The lack of I/O space is just a constraint of the platform.
> It'd be nice if we printed a more meaningful error message in this
> case.  One really has to be a PCI expert to distinguish this from a
> real problem that we need to fix.
> 
> Bjorn


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt April 16, 2014, 6:33 a.m. UTC | #15
On Wed, 2014-04-16 at 16:30 +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2014-04-15 at 18:09 -0600, Bjorn Helgaas wrote:
> > 
> > Thanks for the example.  Please open a bug report at
> > http://bugzilla.kernel.org and attach the complete dmesg logs before
> > and after Yinghai's patch.
> > 
> > Having the complete logs helps me answer questions myself without
> > having to bother you, and it also helps me figure out whether we can
> > improve our logging to make it easier to diagnose problems like this.
> 
> Unfortunately, for a *little while* longer (hint !) we can't publish
> a complete log from a Power8 machine, but we should be able to include
> everything remotely related to PCI.

Hrm, actually, I'm not sure we can disclose the device details just yet,
so let's just hide the vendor/device ID to make the lawyers happy and
you can print all the rest.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas April 16, 2014, 5:10 p.m. UTC | #16
On Tue, Apr 15, 2014 at 10:16 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> On Tue, Apr 15, 2014 at 5:09 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>> It first
>>> tried to find a 32-bit prefetchable window, but we only supply a 64-bit one.
>>> So it fall back to (32-bit) non-prefetchable window, but there is no enough
>>> room there. At last it went into complicated steps (not show here) of
>>> allocating requested resource first, then try best for the optional ones, etc..
>>>
>>> Why is BAR 15 (prefetchable) 32 bit instead of 64? Because PCI core favours
>>> 32-bit prefetchable BARs and we have some. This is one of them:
>>>
>>> | pci 0003:05:00.0: reg 0x30: [mem 0x00000000-0x000fffff pref]
>>>
>>> PCI core decides to let them enjoy the benefition of prefetch. They can't
>>> bear the risk of getting 4G-above address, so its parent, its parent's parent,
>>> its parent's parent's parent, finally the root bridge (00:00.0) must have their
>>> MEM_64 flag of prefetchable resource (BAR 15) clear.
>>
>> It sounds like we're tracking the resource requirements
>> (prefetchability and BAR width) by using the flags on bridge windows.
>> If that's the case, I think it's wrong.  We should preserve the bridge
>> window flags, because those express the bridge hardware capabilities,
>> and we should explicitly keep track of what's required by devices
>> below the bridge in some other way.
>
> if the BAR support 64bit, then resource could be 64bit.
> if the BAR does not support 64bit, then resource should be 32bit.
>
> That should be best guess to decide resource flag.

We don't need to "decide" the resource flags.  The hardware tells us
what it can support, and *that* should be reflected in the flags.  We
do need to decide what addresses to assign and whether to put a
prefetchable BAR in a prefetchable or non-prefetchable window.

Putting decision results in the resource flags obfuscates the whole process.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas April 16, 2014, 5:15 p.m. UTC | #17
On Wed, Apr 16, 2014 at 12:33 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Wed, 2014-04-16 at 16:30 +1000, Benjamin Herrenschmidt wrote:
>> On Tue, 2014-04-15 at 18:09 -0600, Bjorn Helgaas wrote:
>> >
>> > Thanks for the example.  Please open a bug report at
>> > http://bugzilla.kernel.org and attach the complete dmesg logs before
>> > and after Yinghai's patch.
>> >
>> > Having the complete logs helps me answer questions myself without
>> > having to bother you, and it also helps me figure out whether we can
>> > improve our logging to make it easier to diagnose problems like this.
>>
>> Unfortunately, for a *little while* longer (hint !) we can't publish
>> a complete log from a Power8 machine, but we should be able to include
>> everything remotely related to PCI.
>
> Hrm, actually, I'm not sure we can disclose the device details just yet,
> so let's just hide the vendor/device ID to make the lawyers happy and
> you can print all the rest.

Sure, no problem (and it looks like Guo has done just that with
https://bugzilla.kernel.org/show_bug.cgi?id=74151).

I'm pretty sure this problem is not specific to your secret hardware;
it should be reproducible on any hardware with similar apertures and
similar BARs.  I am a little concerned about the PE implications, but
I'm ignoring that for now until we sort out the first-order problem of
getting the allocation scheme to work.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas April 16, 2014, 10:11 p.m. UTC | #18
[-cc Gavin (addr is bouncing)]

On Fri, Mar 07, 2014 at 12:08:44PM -0800, Yinghai Lu wrote:
> When one of children resources does not support MEM_64, MEM_64 for
> bridge get reset, so pull down whole pref resource on the bridge under 4G.
> 
> If the bridge support pref mem 64, will only allocate that with pref mem64 to
> children that support it.
> For children resources if they only support pref mem 32, will allocate them
> from non pref mem instead.
> 
> If the bridge only support 32bit pref mmio, will still have all children pref
> mmio under that.
> 
> -v2: Add release bridge res support with bridge mem res for pref_mem children res.
> -v3: refresh and make it can be applied early before for_each_dev_res patchset.
> -v4: fix non-pref mmio 64bit support found by Guo Chao.
> -v7: repost as ibm's powerpc system work again when
>     1. petiboot boot kernel have this patch.
>     2. or mellanox driver added new .shutdown member.
>     discussion could be found at:
> 	 http://marc.info/?t=138968064500001&r=1&w=2
> 
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> Tested-by: Guo Chao <yan@linux.vnet.ibm.com>
> Tested-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> 
> ---
>  drivers/pci/setup-bus.c |  138 ++++++++++++++++++++++++++++++++----------------
>  drivers/pci/setup-res.c |   20 ++++++
>  2 files changed, 111 insertions(+), 47 deletions(-)
> 
> Index: linux-2.6/drivers/pci/setup-bus.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/setup-bus.c
> +++ linux-2.6/drivers/pci/setup-bus.c
> @@ -713,12 +713,11 @@ static void pci_bridge_check_ranges(stru
>     bus resource of a given type. Note: we intentionally skip
>     the bus resources which have already been assigned (that is,
>     have non-NULL parent resource). */
> -static struct resource *find_free_bus_resource(struct pci_bus *bus, unsigned long type)
> +static struct resource *find_free_bus_resource(struct pci_bus *bus,
> +			 unsigned long type_mask, unsigned long type)
>  {
>  	int i;
>  	struct resource *r;
> -	unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
> -				  IORESOURCE_PREFETCH;
>  
>  	pci_bus_for_each_resource(bus, r, i) {
>  		if (r == &ioport_resource || r == &iomem_resource)
> @@ -815,7 +814,8 @@ static void pbus_size_io(struct pci_bus
>  		resource_size_t add_size, struct list_head *realloc_head)
>  {
>  	struct pci_dev *dev;
> -	struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO);
> +	struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO,
> +							IORESOURCE_IO);
>  	resource_size_t size = 0, size0 = 0, size1 = 0;
>  	resource_size_t children_add_size = 0;
>  	resource_size_t min_align, align;
> @@ -915,15 +915,17 @@ static inline resource_size_t calculate_
>   * guarantees that all child resources fit in this size.
>   */
>  static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
> -			 unsigned long type, resource_size_t min_size,
> -			resource_size_t add_size,
> -			struct list_head *realloc_head)
> +			 unsigned long type, unsigned long type2,
> +			 unsigned long type3,
> +			 resource_size_t min_size, resource_size_t add_size,
> +			 struct list_head *realloc_head)

This no longer matches the function documentation above, but it doesn't
really matter, because this interface, with a mask plus three type
parameters, is just way too complicated.

>  {
>  	struct pci_dev *dev;
>  	resource_size_t min_align, align, size, size0, size1;
>  	resource_size_t aligns[12];	/* Alignments from 1Mb to 2Gb */
>  	int order, max_order;
> -	struct resource *b_res = find_free_bus_resource(bus, type);
> +	struct resource *b_res = find_free_bus_resource(bus,
> +					 mask | IORESOURCE_PREFETCH, type);
>  	unsigned int mem64_mask = 0;
>  	resource_size_t children_add_size = 0;
>  
> @@ -944,7 +946,9 @@ static int pbus_size_mem(struct pci_bus
>  			struct resource *r = &dev->resource[i];
>  			resource_size_t r_size;
>  
> -			if (r->parent || (r->flags & mask) != type)
> +			if (r->parent || ((r->flags & mask) != type &&
> +					  (r->flags & mask) != type2 &&
> +					  (r->flags & mask) != type3))
>  				continue;
>  			r_size = resource_size(r);
>  #ifdef CONFIG_PCI_IOV
> @@ -1117,8 +1121,9 @@ void __ref __pci_bus_size_bridges(struct
>  			struct list_head *realloc_head)
>  {
>  	struct pci_dev *dev;
> -	unsigned long mask, prefmask;
> +	unsigned long mask, prefmask, type2 = 0, type3 = 0;
>  	resource_size_t additional_mem_size = 0, additional_io_size = 0;
> +	struct resource *b_res;
>  
>  	list_for_each_entry(dev, &bus->devices, bus_list) {
>  		struct pci_bus *b = dev->subordinate;
> @@ -1163,15 +1168,34 @@ void __ref __pci_bus_size_bridges(struct
>  		   has already been allocated by arch code, try
>  		   non-prefetchable range for both types of PCI memory
>  		   resources. */
> +		b_res = &bus->self->resource[PCI_BRIDGE_RESOURCES];
>  		mask = IORESOURCE_MEM;
>  		prefmask = IORESOURCE_MEM | IORESOURCE_PREFETCH;
> -		if (pbus_size_mem(bus, prefmask, prefmask,
> +		if (b_res[2].flags & IORESOURCE_MEM_64) {
> +			prefmask |= IORESOURCE_MEM_64;
> +			if (pbus_size_mem(bus, prefmask, prefmask,
> +				  prefmask, prefmask,
>  				  realloc_head ? 0 : additional_mem_size,
> -				  additional_mem_size, realloc_head))
> -			mask = prefmask; /* Success, size non-prefetch only. */
> -		else
> -			additional_mem_size += additional_mem_size;
> -		pbus_size_mem(bus, mask, IORESOURCE_MEM,
> +				  additional_mem_size, realloc_head)) {
> +					/* Success, size non-pref64 only. */
> +					mask = prefmask;
> +					type2 = prefmask & ~IORESOURCE_MEM_64;
> +					type3 = prefmask & ~IORESOURCE_PREFETCH;
> +			}
> +		}
> +		if (!type2) {
> +			prefmask &= ~IORESOURCE_MEM_64;
> +			if (pbus_size_mem(bus, prefmask, prefmask,
> +					 prefmask, prefmask,
> +					 realloc_head ? 0 : additional_mem_size,
> +					 additional_mem_size, realloc_head)) {
> +				/* Success, size non-prefetch only. */
> +				mask = prefmask;
> +			} else
> +				additional_mem_size += additional_mem_size;
> +			type2 = type3 = IORESOURCE_MEM;
> +		}
> +		pbus_size_mem(bus, mask, IORESOURCE_MEM, type2, type3,
>  				realloc_head ? 0 : additional_mem_size,
>  				additional_mem_size, realloc_head);

I spent quite a bit of time trying to understand this code yesterday, but
I failed.  Comments might help, but more importantly, it needs to be
simplified so the code makes sense even without comments.

>  		break;
> @@ -1257,42 +1281,66 @@ static void __ref __pci_bridge_assign_re
>  static void pci_bridge_release_resources(struct pci_bus *bus,
>  					  unsigned long type)
>  {
> -	int idx;
> -	bool changed = false;
> -	struct pci_dev *dev;
> +	struct pci_dev *dev = bus->self;
>  	struct resource *r;
>  	unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
> -				  IORESOURCE_PREFETCH;
> +				  IORESOURCE_PREFETCH | IORESOURCE_MEM_64;
> +	unsigned old_flags = 0;
> +	struct resource *b_res;
> +	int idx = 1;
>  
> -	dev = bus->self;
> -	for (idx = PCI_BRIDGE_RESOURCES; idx <= PCI_BRIDGE_RESOURCE_END;
> -	     idx++) {
> -		r = &dev->resource[idx];
> -		if ((r->flags & type_mask) != type)
> -			continue;
> -		if (!r->parent)
> -			continue;
> -		/*
> -		 * if there are children under that, we should release them
> -		 *  all
> -		 */
> -		release_child_resources(r);
> -		if (!release_resource(r)) {
> -			dev_printk(KERN_DEBUG, &dev->dev,
> -				 "resource %d %pR released\n", idx, r);
> -			/* keep the old size */
> -			r->end = resource_size(r) - 1;
> -			r->start = 0;
> -			r->flags = 0;
> -			changed = true;
> -		}
> -	}
> +	b_res = &dev->resource[PCI_BRIDGE_RESOURCES];
> +
> +	/*
> +	 *     1. if there is io port assign fail, will release bridge
> +	 *	  io port.
> +	 *     2. if there is non pref mmio assign fail, release bridge
> +	 *	  nonpref mmio.
> +	 *     3. if there is 64bit pref mmio assign fail, and bridge pref
> +	 *	  is 64bit, release bridge pref mmio.
> +	 *     4. if there is pref mmio assign fail, and bridge pref is
> +	 *	  32bit mmio, release bridge pref mmio
> +	 *     5. if there is pref mmio assign fail, and bridge pref is not
> +	 *	  assigned, release bridge nonpref mmio.
> +	 */
> +	if (type & IORESOURCE_IO)
> +		idx = 0;
> +	else if (!(type & IORESOURCE_PREFETCH))
> +		idx = 1;
> +	else if ((type & IORESOURCE_MEM_64) &&
> +		 (b_res[2].flags & IORESOURCE_MEM_64))
> +		idx = 2;
> +	else if (!(b_res[2].flags & IORESOURCE_MEM_64) &&
> +		 (b_res[2].flags & IORESOURCE_PREFETCH))
> +		idx = 2;
> +	else
> +		idx = 1;

This is pretty hard to figure out.  Is there any way we can just pass in
the bridge window identifier directly, e.g., 0 == I/O window, 1 ==
non-prefetchable MEM window, 2 == prefetchable window?  It seems like way
up in pci_assign_unassigned_root_bus_resources(), we ought to know which
window failed, and we could just pass that in somehow instead of having to
derive it here.

> +
> +	r = &b_res[idx];
> +
> +	if (!r->parent)
> +		return;
> +
> +	/*
> +	 * if there are children under that, we should release them
> +	 *  all
> +	 */
> +	release_child_resources(r);
> +	if (!release_resource(r)) {
> +		type = old_flags = r->flags & type_mask;
> +		dev_printk(KERN_DEBUG, &dev->dev, "resource %d %pR released\n",
> +					PCI_BRIDGE_RESOURCES + idx, r);
> +		/* keep the old size */
> +		r->end = resource_size(r) - 1;
> +		r->start = 0;
> +		r->flags = 0;
>  
> -	if (changed) {
>  		/* avoiding touch the one without PREF */
>  		if (type & IORESOURCE_PREFETCH)
>  			type = IORESOURCE_PREFETCH;
>  		__pci_setup_bridge(bus, type);
> +		/* for next child res under same bridge */
> +		r->flags = old_flags;
>  	}
>  }
>  
> @@ -1471,7 +1519,7 @@ void pci_assign_unassigned_root_bus_reso
>  	LIST_HEAD(fail_head);
>  	struct pci_dev_resource *fail_res;
>  	unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
> -				  IORESOURCE_PREFETCH;
> +				  IORESOURCE_PREFETCH | IORESOURCE_MEM_64;
>  	int pci_try_num = 1;
>  	enum enable_type enable_local;
>  
> Index: linux-2.6/drivers/pci/setup-res.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/setup-res.c
> +++ linux-2.6/drivers/pci/setup-res.c
> @@ -208,15 +208,31 @@ static int __pci_assign_resource(struct
>  
>  	/* First, try exact prefetching match.. */
>  	ret = pci_bus_alloc_resource(bus, res, size, align, min,
> -				     IORESOURCE_PREFETCH,
> +				     IORESOURCE_PREFETCH | IORESOURCE_MEM_64,
>  				     pcibios_align_resource, dev);
>  
> -	if (ret < 0 && (res->flags & IORESOURCE_PREFETCH)) {
> +	if (ret < 0 &&
> +	    (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) ==
> +	     (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) {
> +		/*
> +		 * That failed.
> +		 *
> +		 * Try below 4g pref
> +		 */
> +		ret = pci_bus_alloc_resource(bus, res, size, align, min,
> +					     IORESOURCE_PREFETCH,
> +					     pcibios_align_resource, dev);
> +	}

pci_bus_alloc_from_region() already has a mechanism for restricting
allocation to certain bus addresses.  I don't like this second mechanism of
also using IORESOURCE_MEM_64.  There's too much implicit state -- I think
there's a dependency here on how we manage the IORESOURCE_MEM_64 bit in the
bridge window resources.  That makes this code hard to understand.

> +
> +	if (ret < 0 &&
> +	    (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64))) {
>  		/*
>  		 * That failed.
>  		 *
>  		 * But a prefetching area can handle a non-prefetching
>  		 * window (it will just not perform as well).
> +		 *
> +		 * Also can put 64bit under 32bit range. (below 4g).
>  		 */
>  		ret = pci_bus_alloc_resource(bus, res, size, align, min, 0,
>  					     pcibios_align_resource, dev);
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yinghai Lu April 17, 2014, 4:20 a.m. UTC | #19
On Wed, Apr 16, 2014 at 3:11 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>
> I spent quite a bit of time trying to understand this code yesterday, but
> I failed.  Comments might help, but more importantly, it needs to be
> simplified so the code makes sense even without comments.

will add some comments.

>> +     b_res = &dev->resource[PCI_BRIDGE_RESOURCES];
>> +
>> +     /*
>> +      *     1. if there is io port assign fail, will release bridge
>> +      *        io port.
>> +      *     2. if there is non pref mmio assign fail, release bridge
>> +      *        nonpref mmio.
>> +      *     3. if there is 64bit pref mmio assign fail, and bridge pref
>> +      *        is 64bit, release bridge pref mmio.
>> +      *     4. if there is pref mmio assign fail, and bridge pref is
>> +      *        32bit mmio, release bridge pref mmio
>> +      *     5. if there is pref mmio assign fail, and bridge pref is not
>> +      *        assigned, release bridge nonpref mmio.
>> +      */
>> +     if (type & IORESOURCE_IO)
>> +             idx = 0;
>> +     else if (!(type & IORESOURCE_PREFETCH))
>> +             idx = 1;
>> +     else if ((type & IORESOURCE_MEM_64) &&
>> +              (b_res[2].flags & IORESOURCE_MEM_64))
>> +             idx = 2;
>> +     else if (!(b_res[2].flags & IORESOURCE_MEM_64) &&
>> +              (b_res[2].flags & IORESOURCE_PREFETCH))
>> +             idx = 2;
>> +     else
>> +             idx = 1;
>
> This is pretty hard to figure out.  Is there any way we can just pass in
> the bridge window identifier directly, e.g., 0 == I/O window, 1 ==
> non-prefetchable MEM window, 2 == prefetchable window?  It seems like way
> up in pci_assign_unassigned_root_bus_resources(), we ought to know which
> window failed, and we could just pass that in somehow instead of having to
> derive it here.

that is for find out which up-level could be scratched...
and do not know exact which one should be fit as twisted mmio-pref under mmio.

>>
>> -     if (ret < 0 && (res->flags & IORESOURCE_PREFETCH)) {
>> +     if (ret < 0 &&
>> +         (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) ==
>> +          (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) {
>> +             /*
>> +              * That failed.
>> +              *
>> +              * Try below 4g pref
>> +              */
>> +             ret = pci_bus_alloc_resource(bus, res, size, align, min,
>> +                                          IORESOURCE_PREFETCH,
>> +                                          pcibios_align_resource, dev);
>> +     }
>
> pci_bus_alloc_from_region() already has a mechanism for restricting
> allocation to certain bus addresses.  I don't like this second mechanism of
> also using IORESOURCE_MEM_64.  There's too much implicit state -- I think
> there's a dependency here on how we manage the IORESOURCE_MEM_64 bit in the
> bridge window resources.  That makes this code hard to understand.
>

will drop that.

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yinghai Lu April 17, 2014, 4:35 p.m. UTC | #20
On Wed, Apr 16, 2014 at 9:20 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> On Wed, Apr 16, 2014 at 3:11 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>
>>>
>>> -     if (ret < 0 && (res->flags & IORESOURCE_PREFETCH)) {
>>> +     if (ret < 0 &&
>>> +         (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) ==
>>> +          (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) {
>>> +             /*
>>> +              * That failed.
>>> +              *
>>> +              * Try below 4g pref
>>> +              */
>>> +             ret = pci_bus_alloc_resource(bus, res, size, align, min,
>>> +                                          IORESOURCE_PREFETCH,
>>> +                                          pcibios_align_resource, dev);
>>> +     }
>>
>> pci_bus_alloc_from_region() already has a mechanism for restricting
>> allocation to certain bus addresses.  I don't like this second mechanism of
>> also using IORESOURCE_MEM_64.  There's too much implicit state -- I think
>> there's a dependency here on how we manage the IORESOURCE_MEM_64 bit in the
>> bridge window resources.  That makes this code hard to understand.
>>
>
> will drop that.

After more checking, found out that we still need the change.
Reason:
We size pref,mmio64 at first, and then put pref without 64 under nonpref.

If bridge have pref/mmio64, and mmio32, children have pref/mmio64, and
pref/mmio32.
then during sizing: children pref/mmio64 is under bridge pref/mmio64,
children pref/mmio32
is under bridge mmio32.

If during allocation, bridge pref/mmio64 still could get allocation under 4G.
then without this change we could put children pref/mmio32 under that.
could be ok, we would put children pref/mm64 under bridge mmio32.
But in some case size could be different, then could be fail if bridge
mmio32 is small then bridge pref/mmio64.

Please check latest v9, with comments change.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-2.6/drivers/pci/setup-bus.c
===================================================================
--- linux-2.6.orig/drivers/pci/setup-bus.c
+++ linux-2.6/drivers/pci/setup-bus.c
@@ -713,12 +713,11 @@  static void pci_bridge_check_ranges(stru
    bus resource of a given type. Note: we intentionally skip
    the bus resources which have already been assigned (that is,
    have non-NULL parent resource). */
-static struct resource *find_free_bus_resource(struct pci_bus *bus, unsigned long type)
+static struct resource *find_free_bus_resource(struct pci_bus *bus,
+			 unsigned long type_mask, unsigned long type)
 {
 	int i;
 	struct resource *r;
-	unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
-				  IORESOURCE_PREFETCH;
 
 	pci_bus_for_each_resource(bus, r, i) {
 		if (r == &ioport_resource || r == &iomem_resource)
@@ -815,7 +814,8 @@  static void pbus_size_io(struct pci_bus
 		resource_size_t add_size, struct list_head *realloc_head)
 {
 	struct pci_dev *dev;
-	struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO);
+	struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO,
+							IORESOURCE_IO);
 	resource_size_t size = 0, size0 = 0, size1 = 0;
 	resource_size_t children_add_size = 0;
 	resource_size_t min_align, align;
@@ -915,15 +915,17 @@  static inline resource_size_t calculate_
  * guarantees that all child resources fit in this size.
  */
 static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
-			 unsigned long type, resource_size_t min_size,
-			resource_size_t add_size,
-			struct list_head *realloc_head)
+			 unsigned long type, unsigned long type2,
+			 unsigned long type3,
+			 resource_size_t min_size, resource_size_t add_size,
+			 struct list_head *realloc_head)
 {
 	struct pci_dev *dev;
 	resource_size_t min_align, align, size, size0, size1;
 	resource_size_t aligns[12];	/* Alignments from 1Mb to 2Gb */
 	int order, max_order;
-	struct resource *b_res = find_free_bus_resource(bus, type);
+	struct resource *b_res = find_free_bus_resource(bus,
+					 mask | IORESOURCE_PREFETCH, type);
 	unsigned int mem64_mask = 0;
 	resource_size_t children_add_size = 0;
 
@@ -944,7 +946,9 @@  static int pbus_size_mem(struct pci_bus
 			struct resource *r = &dev->resource[i];
 			resource_size_t r_size;
 
-			if (r->parent || (r->flags & mask) != type)
+			if (r->parent || ((r->flags & mask) != type &&
+					  (r->flags & mask) != type2 &&
+					  (r->flags & mask) != type3))
 				continue;
 			r_size = resource_size(r);
 #ifdef CONFIG_PCI_IOV
@@ -1117,8 +1121,9 @@  void __ref __pci_bus_size_bridges(struct
 			struct list_head *realloc_head)
 {
 	struct pci_dev *dev;
-	unsigned long mask, prefmask;
+	unsigned long mask, prefmask, type2 = 0, type3 = 0;
 	resource_size_t additional_mem_size = 0, additional_io_size = 0;
+	struct resource *b_res;
 
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		struct pci_bus *b = dev->subordinate;
@@ -1163,15 +1168,34 @@  void __ref __pci_bus_size_bridges(struct
 		   has already been allocated by arch code, try
 		   non-prefetchable range for both types of PCI memory
 		   resources. */
+		b_res = &bus->self->resource[PCI_BRIDGE_RESOURCES];
 		mask = IORESOURCE_MEM;
 		prefmask = IORESOURCE_MEM | IORESOURCE_PREFETCH;
-		if (pbus_size_mem(bus, prefmask, prefmask,
+		if (b_res[2].flags & IORESOURCE_MEM_64) {
+			prefmask |= IORESOURCE_MEM_64;
+			if (pbus_size_mem(bus, prefmask, prefmask,
+				  prefmask, prefmask,
 				  realloc_head ? 0 : additional_mem_size,
-				  additional_mem_size, realloc_head))
-			mask = prefmask; /* Success, size non-prefetch only. */
-		else
-			additional_mem_size += additional_mem_size;
-		pbus_size_mem(bus, mask, IORESOURCE_MEM,
+				  additional_mem_size, realloc_head)) {
+					/* Success, size non-pref64 only. */
+					mask = prefmask;
+					type2 = prefmask & ~IORESOURCE_MEM_64;
+					type3 = prefmask & ~IORESOURCE_PREFETCH;
+			}
+		}
+		if (!type2) {
+			prefmask &= ~IORESOURCE_MEM_64;
+			if (pbus_size_mem(bus, prefmask, prefmask,
+					 prefmask, prefmask,
+					 realloc_head ? 0 : additional_mem_size,
+					 additional_mem_size, realloc_head)) {
+				/* Success, size non-prefetch only. */
+				mask = prefmask;
+			} else
+				additional_mem_size += additional_mem_size;
+			type2 = type3 = IORESOURCE_MEM;
+		}
+		pbus_size_mem(bus, mask, IORESOURCE_MEM, type2, type3,
 				realloc_head ? 0 : additional_mem_size,
 				additional_mem_size, realloc_head);
 		break;
@@ -1257,42 +1281,66 @@  static void __ref __pci_bridge_assign_re
 static void pci_bridge_release_resources(struct pci_bus *bus,
 					  unsigned long type)
 {
-	int idx;
-	bool changed = false;
-	struct pci_dev *dev;
+	struct pci_dev *dev = bus->self;
 	struct resource *r;
 	unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
-				  IORESOURCE_PREFETCH;
+				  IORESOURCE_PREFETCH | IORESOURCE_MEM_64;
+	unsigned old_flags = 0;
+	struct resource *b_res;
+	int idx = 1;
 
-	dev = bus->self;
-	for (idx = PCI_BRIDGE_RESOURCES; idx <= PCI_BRIDGE_RESOURCE_END;
-	     idx++) {
-		r = &dev->resource[idx];
-		if ((r->flags & type_mask) != type)
-			continue;
-		if (!r->parent)
-			continue;
-		/*
-		 * if there are children under that, we should release them
-		 *  all
-		 */
-		release_child_resources(r);
-		if (!release_resource(r)) {
-			dev_printk(KERN_DEBUG, &dev->dev,
-				 "resource %d %pR released\n", idx, r);
-			/* keep the old size */
-			r->end = resource_size(r) - 1;
-			r->start = 0;
-			r->flags = 0;
-			changed = true;
-		}
-	}
+	b_res = &dev->resource[PCI_BRIDGE_RESOURCES];
+
+	/*
+	 *     1. if there is io port assign fail, will release bridge
+	 *	  io port.
+	 *     2. if there is non pref mmio assign fail, release bridge
+	 *	  nonpref mmio.
+	 *     3. if there is 64bit pref mmio assign fail, and bridge pref
+	 *	  is 64bit, release bridge pref mmio.
+	 *     4. if there is pref mmio assign fail, and bridge pref is
+	 *	  32bit mmio, release bridge pref mmio
+	 *     5. if there is pref mmio assign fail, and bridge pref is not
+	 *	  assigned, release bridge nonpref mmio.
+	 */
+	if (type & IORESOURCE_IO)
+		idx = 0;
+	else if (!(type & IORESOURCE_PREFETCH))
+		idx = 1;
+	else if ((type & IORESOURCE_MEM_64) &&
+		 (b_res[2].flags & IORESOURCE_MEM_64))
+		idx = 2;
+	else if (!(b_res[2].flags & IORESOURCE_MEM_64) &&
+		 (b_res[2].flags & IORESOURCE_PREFETCH))
+		idx = 2;
+	else
+		idx = 1;
+
+	r = &b_res[idx];
+
+	if (!r->parent)
+		return;
+
+	/*
+	 * if there are children under that, we should release them
+	 *  all
+	 */
+	release_child_resources(r);
+	if (!release_resource(r)) {
+		type = old_flags = r->flags & type_mask;
+		dev_printk(KERN_DEBUG, &dev->dev, "resource %d %pR released\n",
+					PCI_BRIDGE_RESOURCES + idx, r);
+		/* keep the old size */
+		r->end = resource_size(r) - 1;
+		r->start = 0;
+		r->flags = 0;
 
-	if (changed) {
 		/* avoiding touch the one without PREF */
 		if (type & IORESOURCE_PREFETCH)
 			type = IORESOURCE_PREFETCH;
 		__pci_setup_bridge(bus, type);
+		/* for next child res under same bridge */
+		r->flags = old_flags;
 	}
 }
 
@@ -1471,7 +1519,7 @@  void pci_assign_unassigned_root_bus_reso
 	LIST_HEAD(fail_head);
 	struct pci_dev_resource *fail_res;
 	unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
-				  IORESOURCE_PREFETCH;
+				  IORESOURCE_PREFETCH | IORESOURCE_MEM_64;
 	int pci_try_num = 1;
 	enum enable_type enable_local;
 
Index: linux-2.6/drivers/pci/setup-res.c
===================================================================
--- linux-2.6.orig/drivers/pci/setup-res.c
+++ linux-2.6/drivers/pci/setup-res.c
@@ -208,15 +208,31 @@  static int __pci_assign_resource(struct
 
 	/* First, try exact prefetching match.. */
 	ret = pci_bus_alloc_resource(bus, res, size, align, min,
-				     IORESOURCE_PREFETCH,
+				     IORESOURCE_PREFETCH | IORESOURCE_MEM_64,
 				     pcibios_align_resource, dev);
 
-	if (ret < 0 && (res->flags & IORESOURCE_PREFETCH)) {
+	if (ret < 0 &&
+	    (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) ==
+	     (IORESOURCE_PREFETCH | IORESOURCE_MEM_64)) {
+		/*
+		 * That failed.
+		 *
+		 * Try below 4g pref
+		 */
+		ret = pci_bus_alloc_resource(bus, res, size, align, min,
+					     IORESOURCE_PREFETCH,
+					     pcibios_align_resource, dev);
+	}
+
+	if (ret < 0 &&
+	    (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64))) {
 		/*
 		 * That failed.
 		 *
 		 * But a prefetching area can handle a non-prefetching
 		 * window (it will just not perform as well).
+		 *
+		 * Also can put 64bit under 32bit range. (below 4g).
 		 */
 		ret = pci_bus_alloc_resource(bus, res, size, align, min, 0,
 					     pcibios_align_resource, dev);