diff mbox series

[v4,1/2] PCI: Take other bus devices into account when distributing resources

Message ID 20230104091635.63331-2-mika.westerberg@linux.intel.com (mailing list archive)
State Superseded
Headers show
Series PCI: distribute resources for root buses | expand

Commit Message

Mika Westerberg Jan. 4, 2023, 9:16 a.m. UTC
A PCI bridge may reside on a bus with other devices as well. The
resource distribution code does not take this into account properly and
therefore it expands the bridge resource windows too much, not leaving
space for the other devices (or functions of a multifunction device) and
this leads to an issue that Jonathan reported. He runs QEMU with the
following topology (QEMU parameters):

 -device pcie-root-port,port=0,id=root_port13,chassis=0,slot=2  \
 -device x3130-upstream,id=sw1,bus=root_port13,multifunction=on \
 -device e1000,bus=root_port13,addr=0.1                         \
 -device xio3130-downstream,id=fun1,bus=sw1,chassis=0,slot=3    \
 -device e1000,bus=fun1

The first e1000 NIC here is another function in the switch upstream
port. This leads to following errors:

  pci 0000:00:04.0: bridge window [mem 0x10200000-0x103fffff] to [bus 02-04]
  pci 0000:02:00.0: bridge window [mem 0x10200000-0x103fffff] to [bus 03-04]
  pci 0000:02:00.1: BAR 0: failed to assign [mem size 0x00020000]
  e1000 0000:02:00.1: can't ioremap BAR 0: [??? 0x00000000 flags 0x0]

Fix this by taking into account the device BARs on the bus, including
the ones belonging to bridges themselves.

Link: https://lore.kernel.org/linux-pci/20221014124553.0000696f@huawei.com/
Link: https://lore.kernel.org/linux-pci/6053736d-1923-41e7-def9-7585ce1772d9@ixsystems.com/
Reported-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reported-by: Alexander Motin <mav@ixsystems.com>
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
---
 drivers/pci/setup-bus.c | 205 +++++++++++++++++++++++++---------------
 1 file changed, 129 insertions(+), 76 deletions(-)

Comments

Bjorn Helgaas Jan. 4, 2023, 10:48 p.m. UTC | #1
On Wed, Jan 04, 2023 at 11:16:34AM +0200, Mika Westerberg wrote:
> A PCI bridge may reside on a bus with other devices as well. The
> resource distribution code does not take this into account properly and
> therefore it expands the bridge resource windows too much, not leaving
> space for the other devices (or functions of a multifunction device) and
> this leads to an issue that Jonathan reported. He runs QEMU with the
> following topology (QEMU parameters):
> 
>  -device pcie-root-port,port=0,id=root_port13,chassis=0,slot=2  \
>  -device x3130-upstream,id=sw1,bus=root_port13,multifunction=on \
>  -device e1000,bus=root_port13,addr=0.1                         \
>  -device xio3130-downstream,id=fun1,bus=sw1,chassis=0,slot=3    \
>  -device e1000,bus=fun1
> 
> The first e1000 NIC here is another function in the switch upstream
> port. This leads to following errors:
> 
>   pci 0000:00:04.0: bridge window [mem 0x10200000-0x103fffff] to [bus 02-04]
>   pci 0000:02:00.0: bridge window [mem 0x10200000-0x103fffff] to [bus 03-04]
>   pci 0000:02:00.1: BAR 0: failed to assign [mem size 0x00020000]
>   e1000 0000:02:00.1: can't ioremap BAR 0: [??? 0x00000000 flags 0x0]
> 
> Fix this by taking into account the device BARs on the bus, including
> the ones belonging to bridges themselves.

I think it accounts for bridge windows (and VF BARs from SR-IOV
Capabilities) as well as regular BARs, doesn't it?

> Link: https://lore.kernel.org/linux-pci/20221014124553.0000696f@huawei.com/
> Link: https://lore.kernel.org/linux-pci/6053736d-1923-41e7-def9-7585ce1772d9@ixsystems.com/
> Reported-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reported-by: Alexander Motin <mav@ixsystems.com>
> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> ---
>  drivers/pci/setup-bus.c | 205 +++++++++++++++++++++++++---------------
>  1 file changed, 129 insertions(+), 76 deletions(-)
> 
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index b4096598dbcb..cf6a7bdf2427 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -1750,6 +1750,7 @@ static void adjust_bridge_window(struct pci_dev *bridge, struct resource *res,
>  				 resource_size_t new_size)
>  {
>  	resource_size_t add_size, size = resource_size(res);
> +	resource_size_t min_size;
>  
>  	if (res->parent)
>  		return;
> @@ -1757,30 +1758,87 @@ static void adjust_bridge_window(struct pci_dev *bridge, struct resource *res,
>  	if (!new_size)
>  		return;
>  
> +	/* Minimum granularity of a bridge bridge window */

bridge bridge (fixed in my copy)

> +	min_size = window_alignment(bridge->bus, res->flags);
> +
>  	if (new_size > size) {
>  		add_size = new_size - size;
> +		if (add_size < min_size)
> +			return;
>  		pci_dbg(bridge, "bridge window %pR extended by %pa\n", res,
>  			&add_size);
>  	} else if (new_size < size) {
>  		add_size = size - new_size;
> +		if (add_size < min_size)
> +			return;
>  		pci_dbg(bridge, "bridge window %pR shrunken by %pa\n", res,
>  			&add_size);
> +	} else {
> +		return;
>  	}
>  
>  	res->end = res->start + new_size - 1;
>  	remove_from_list(add_list, res);
>  }
>  
> +static void reduce_dev_resources(struct pci_dev *dev, struct resource *io,
> +				 struct resource *mmio,
> +				 struct resource *mmio_pref)

IIUC, this is removing the space consumed by "dev" from io, mmio,
mmio_pref.

> +{
> +	int i;
> +
> +	for (i = 0; i < PCI_NUM_RESOURCES; i++) {
> +		struct resource *res = &dev->resource[i];
> +		resource_size_t align, tmp, size;
> +
> +		size = resource_size(res);
> +		if (!size)
> +			continue;
> +
> +		align = pci_resource_alignment(dev, res);
> +
> +		if (resource_type(res) == IORESOURCE_IO) {
> +			align = align ? ALIGN(io->start, align) - io->start : 0;
> +			tmp = align + size;
> +			io->start = min(io->start + tmp, io->end + 1);

Do you think it would be worth a helper to do this repeated
calculation?  AFAICT they're all the same.

  static void remove_dev_res(struct resource *avail, struct pci_dev *dev, struct resource *res)
  {
    size = resource_size(res);
    if (!size)
      return;

    align = pci_resource_alignment(dev, res);
    align = align ? ALIGN(avail->start, align) - avail->start : 0;
    tmp = align + size;
    avail->start = min(avail->start + tmp, avail->end + 1);
  }

  if (resource_type(res) == IORESOURCE_IO)
    remove_dev_res(io, dev, res);
  else if (resource_type(res) == IORESOURCE_MEM) {
    if (res->flags & IORESOURCE_PREFETCH)
      remove_dev_res(mmio_pref, dev, res);
    else
      remove_dev_res(mmio, dev, res);
  }

I don't quite understand how it works.  What's the case where "align"
can be zero?  Will we remove VF BARs twice, once while processing the
PF (where pf->resource[PCI_IOV_RESOURCES] contains space for BAR 0 of
all the VFs), and again while processing the VFs?

> +		} else if (resource_type(res) == IORESOURCE_MEM) {
> +			if (res->flags & IORESOURCE_PREFETCH) {
> +				align = align ? ALIGN(mmio_pref->start, align) -
> +						mmio_pref->start : 0;
> +				tmp = align + size;
> +				mmio_pref->start = min(mmio_pref->start + tmp,
> +						       mmio_pref->end + 1);
> +			} else {
> +				align = align ? ALIGN(mmio->start, align) -
> +						mmio->start : 0;
> +				tmp = align + size;
> +				mmio->start = min(mmio->start + tmp,
> +						  mmio->end + 1);
> +			}
> +		}
> +	}
> +}
> +
> +/*
> + * io, mmio and mmio_pref contain the total amount of bridge window
> + * space available. This includes the minimal space needed to cover all
> + * the existing devices on the bus and the possible extra space that can
> + * be shared with the bridges.

So "mmio" is space available on "bus", i.e., this is a window that
bus->self forwards to "bus" and is available for devices on "bus"?

This space would be available for devices on "bus", i.e., for BARs of
those devices and windows of any bridges to devices below "bus"?

> + * The resource space consumed by bus->self (the bridge) is already
> + * reduced.

I don't quite get what this is telling me.  If "mmio" is space
forwarded to "bus" by bus->self, what does it mean for it to be
reduced?  If bus->self forwards it, the entire space should be
available on "bus".

> + */
>  static void pci_bus_distribute_available_resources(struct pci_bus *bus,
>  					    struct list_head *add_list,
>  					    struct resource io,
>  					    struct resource mmio,
>  					    struct resource mmio_pref)
>  {
> +	resource_size_t io_align, mmio_align, mmio_pref_align;
> +	resource_size_t io_per_b, mmio_per_b, mmio_pref_per_b;
>  	unsigned int normal_bridges = 0, hotplug_bridges = 0;
>  	struct resource *io_res, *mmio_res, *mmio_pref_res;
>  	struct pci_dev *dev, *bridge = bus->self;
> -	resource_size_t io_per_hp, mmio_per_hp, mmio_pref_per_hp, align;
>  
>  	io_res = &bridge->resource[PCI_BRIDGE_IO_WINDOW];
>  	mmio_res = &bridge->resource[PCI_BRIDGE_MEM_WINDOW];
> @@ -1790,17 +1848,17 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
>  	 * The alignment of this bridge is yet to be considered, hence it must
>  	 * be done now before extending its bridge window.
>  	 */
> -	align = pci_resource_alignment(bridge, io_res);
> -	if (!io_res->parent && align)
> -		io.start = min(ALIGN(io.start, align), io.end + 1);
> +	io_align = pci_resource_alignment(bridge, io_res);
> +	if (!io_res->parent && io_align)
> +		io.start = min(ALIGN(io.start, io_align), io.end + 1);
>  
> -	align = pci_resource_alignment(bridge, mmio_res);
> -	if (!mmio_res->parent && align)
> -		mmio.start = min(ALIGN(mmio.start, align), mmio.end + 1);
> +	mmio_align = pci_resource_alignment(bridge, mmio_res);
> +	if (!mmio_res->parent && mmio_align)
> +		mmio.start = min(ALIGN(mmio.start, mmio_align), mmio.end + 1);
>  
> -	align = pci_resource_alignment(bridge, mmio_pref_res);
> -	if (!mmio_pref_res->parent && align)
> -		mmio_pref.start = min(ALIGN(mmio_pref.start, align),
> +	mmio_pref_align = pci_resource_alignment(bridge, mmio_pref_res);
> +	if (!mmio_pref_res->parent && mmio_pref_align)
> +		mmio_pref.start = min(ALIGN(mmio_pref.start, mmio_pref_align),
>  			mmio_pref.end + 1);

It looks like this hunk and the corresponding "Make sure the split
resource space is properly aligned" one below could conceivably fix a
separate problem all by themselves?  If so, could be split to a
separate patch?

>  	/*
> @@ -1824,94 +1882,89 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
>  			normal_bridges++;
>  	}
>  
> -	/*
> -	 * There is only one bridge on the bus so it gets all available
> -	 * resources which it can then distribute to the possible hotplug
> -	 * bridges below.
> -	 */
> -	if (hotplug_bridges + normal_bridges == 1) {
> -		dev = list_first_entry(&bus->devices, struct pci_dev, bus_list);
> -		if (dev->subordinate)
> -			pci_bus_distribute_available_resources(dev->subordinate,
> -				add_list, io, mmio, mmio_pref);
> -		return;
> -	}
> -
> -	if (hotplug_bridges == 0)
> +	if (!(hotplug_bridges + normal_bridges))
>  		return;
>  
>  	/*
>  	 * Calculate the total amount of extra resource space we can
> -	 * pass to bridges below this one.  This is basically the
> -	 * extra space reduced by the minimal required space for the
> -	 * non-hotplug bridges.
> +	 * pass to bridges below this one.

The "space we can pass to bridges below this one" is the existing
language, but I think it's a little too restrictive.  A bridge
forwards space to the secondary bus, where it's available for any
device, not just bridges.

Maybe:

  Calculate the amount of space we can forward from "bus" to any
  downstream buses, i.e., the space left over after assigning the BARs
  and windows on "bus".

> +	   This is basically the extra
> +	 * space reduced by the minimal required space for the bridge
> +	 * windows and device BARs on this bus.
>  	 */
> -	for_each_pci_bridge(dev, bus) {
> -		resource_size_t used_size;
> -		struct resource *res;
> -
> -		if (dev->is_hotplug_bridge)
> -			continue;
> -
> -		/*
> -		 * Reduce the available resource space by what the
> -		 * bridge and devices below it occupy.
> -		 */
> -		res = &dev->resource[PCI_BRIDGE_IO_WINDOW];
> -		align = pci_resource_alignment(dev, res);
> -		align = align ? ALIGN(io.start, align) - io.start : 0;
> -		used_size = align + resource_size(res);
> -		if (!res->parent)
> -			io.start = min(io.start + used_size, io.end + 1);
> -
> -		res = &dev->resource[PCI_BRIDGE_MEM_WINDOW];
> -		align = pci_resource_alignment(dev, res);
> -		align = align ? ALIGN(mmio.start, align) - mmio.start : 0;
> -		used_size = align + resource_size(res);
> -		if (!res->parent)
> -			mmio.start = min(mmio.start + used_size, mmio.end + 1);
> +	list_for_each_entry(dev, &bus->devices, bus_list)
> +		reduce_dev_resources(dev, &io, &mmio, &mmio_pref);
>  
> -		res = &dev->resource[PCI_BRIDGE_PREF_MEM_WINDOW];
> -		align = pci_resource_alignment(dev, res);
> -		align = align ? ALIGN(mmio_pref.start, align) -
> -			mmio_pref.start : 0;
> -		used_size = align + resource_size(res);
> -		if (!res->parent)
> -			mmio_pref.start = min(mmio_pref.start + used_size,
> -				mmio_pref.end + 1);
> +	/*
> +	 * If there is at least one hotplug bridge on this bus it gets
> +	 * all the extra resource space that was left after the
> +	 * reductions above.
> +	 *
> +	 * If there are no hotplug bridges the extra resource space is
> +	 * split between non-hotplug bridges. This is to allow possible
> +	 * hotplug bridges below them to get the extra space as well.

I guess "hotplug_bridges" only counts bridges directly on "bus",
right?  I.e., we don't look deeper in the hierarchy for downstream
hotplug bridges?

What happens in a topology like this:

  10:00.0 non-hotplug bridge to [bus 20-3f]
  10:01.0 non-hotplug bridge to [bus 40]
  20:00.0 hotplug bridge
  40:00.0 NIC

where we're distributing space on "bus" 10, hotplug_bridges == 0 and
normal_bridges == 2?  Do we give half the extra space to bus 20 and
the other half to bus 40, even though we could tell up front that bus
20 is the only place that can actually use any extra space?

> +	 */
> +	if (hotplug_bridges) {
> +		io_per_b = div64_ul(resource_size(&io), hotplug_bridges);
> +		mmio_per_b = div64_ul(resource_size(&mmio), hotplug_bridges);
> +		mmio_pref_per_b = div64_ul(resource_size(&mmio_pref),
> +					   hotplug_bridges);
> +	} else {
> +		io_per_b = div64_ul(resource_size(&io), normal_bridges);
> +		mmio_per_b = div64_ul(resource_size(&mmio), normal_bridges);
> +		mmio_pref_per_b = div64_ul(resource_size(&mmio_pref),
> +					   normal_bridges);
>  	}
>  
> -	io_per_hp = div64_ul(resource_size(&io), hotplug_bridges);
> -	mmio_per_hp = div64_ul(resource_size(&mmio), hotplug_bridges);
> -	mmio_pref_per_hp = div64_ul(resource_size(&mmio_pref),
> -		hotplug_bridges);
> -
>  	/*
> -	 * Go over devices on this bus and distribute the remaining
> -	 * resource space between hotplug bridges.
> +	 * Make sure the split resource space is properly aligned for
> +	 * bridge windows (align it down to avoid going above what is
> +	 * available).
>  	 */
> +	if (io_align)
> +		io_per_b = ALIGN_DOWN(io_per_b, io_align);
> +	if (mmio_align)
> +		mmio_per_b = ALIGN_DOWN(mmio_per_b, mmio_align);
> +	if (mmio_pref_align)
> +		mmio_pref_per_b = ALIGN_DOWN(mmio_pref_per_b, mmio_align);
> +
>  	for_each_pci_bridge(dev, bus) {
> +		resource_size_t allocated_io, allocated_mmio, allocated_mmio_pref;
> +		const struct resource *res;
>  		struct pci_bus *b;
>  
>  		b = dev->subordinate;
> -		if (!b || !dev->is_hotplug_bridge)
> +		if (!b)
> +			continue;
> +		if (hotplug_bridges && !dev->is_hotplug_bridge)
>  			continue;
>  
> +		io.end = io.start + io_per_b - 1;
>  		/*
> -		 * Distribute available extra resources equally between
> -		 * hotplug-capable downstream ports taking alignment into
> -		 * account.
> +		 * The x_per_b holds the extra resource space that can
> +		 * be added for each bridge but there is the minimal
> +		 * already reserved as well so adjust x.start down
> +		 * accordingly to cover the whole space.
>  		 */
> -		io.end = io.start + io_per_hp - 1;
> -		mmio.end = mmio.start + mmio_per_hp - 1;
> -		mmio_pref.end = mmio_pref.start + mmio_pref_per_hp - 1;
> +		res = &dev->resource[PCI_BRIDGE_IO_WINDOW];
> +		allocated_io = resource_size(res);
> +		io.start -= allocated_io;
> +
> +		mmio.end = mmio.start + mmio_per_b - 1;
> +		res = &dev->resource[PCI_BRIDGE_MEM_WINDOW];
> +		allocated_mmio = resource_size(res);
> +		mmio.start -= allocated_mmio;
> +
> +		mmio_pref.end = mmio_pref.start + mmio_pref_per_b - 1;
> +		res = &dev->resource[PCI_BRIDGE_PREF_MEM_WINDOW];
> +		allocated_mmio_pref = resource_size(res);
> +		mmio_pref.start -= allocated_mmio_pref;
>  
>  		pci_bus_distribute_available_resources(b, add_list, io, mmio,
>  						       mmio_pref);
>  
> -		io.start += io_per_hp;
> -		mmio.start += mmio_per_hp;
> -		mmio_pref.start += mmio_pref_per_hp;
> +		io.start += allocated_io + io_per_b;
> +		mmio.start += allocated_mmio + mmio_per_b;
> +		mmio_pref.start += allocated_mmio_pref + mmio_pref_per_b;
>  	}
>  }
>  
> -- 
> 2.35.1
>
Mika Westerberg Jan. 5, 2023, 9:12 a.m. UTC | #2
Hi,

On Wed, Jan 04, 2023 at 04:48:09PM -0600, Bjorn Helgaas wrote:
> On Wed, Jan 04, 2023 at 11:16:34AM +0200, Mika Westerberg wrote:
> > A PCI bridge may reside on a bus with other devices as well. The
> > resource distribution code does not take this into account properly and
> > therefore it expands the bridge resource windows too much, not leaving
> > space for the other devices (or functions of a multifunction device) and
> > this leads to an issue that Jonathan reported. He runs QEMU with the
> > following topology (QEMU parameters):
> > 
> >  -device pcie-root-port,port=0,id=root_port13,chassis=0,slot=2  \
> >  -device x3130-upstream,id=sw1,bus=root_port13,multifunction=on \
> >  -device e1000,bus=root_port13,addr=0.1                         \
> >  -device xio3130-downstream,id=fun1,bus=sw1,chassis=0,slot=3    \
> >  -device e1000,bus=fun1
> > 
> > The first e1000 NIC here is another function in the switch upstream
> > port. This leads to following errors:
> > 
> >   pci 0000:00:04.0: bridge window [mem 0x10200000-0x103fffff] to [bus 02-04]
> >   pci 0000:02:00.0: bridge window [mem 0x10200000-0x103fffff] to [bus 03-04]
> >   pci 0000:02:00.1: BAR 0: failed to assign [mem size 0x00020000]
> >   e1000 0000:02:00.1: can't ioremap BAR 0: [??? 0x00000000 flags 0x0]
> > 
> > Fix this by taking into account the device BARs on the bus, including
> > the ones belonging to bridges themselves.
> 
> I think it accounts for bridge windows (and VF BARs from SR-IOV
> Capabilities) as well as regular BARs, doesn't it?

Yes. Do you want me to mention them all here as well?

> > Link: https://lore.kernel.org/linux-pci/20221014124553.0000696f@huawei.com/
> > Link: https://lore.kernel.org/linux-pci/6053736d-1923-41e7-def9-7585ce1772d9@ixsystems.com/
> > Reported-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Reported-by: Alexander Motin <mav@ixsystems.com>
> > Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> > ---
> >  drivers/pci/setup-bus.c | 205 +++++++++++++++++++++++++---------------
> >  1 file changed, 129 insertions(+), 76 deletions(-)
> > 
> > diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> > index b4096598dbcb..cf6a7bdf2427 100644
> > --- a/drivers/pci/setup-bus.c
> > +++ b/drivers/pci/setup-bus.c
> > @@ -1750,6 +1750,7 @@ static void adjust_bridge_window(struct pci_dev *bridge, struct resource *res,
> >  				 resource_size_t new_size)
> >  {
> >  	resource_size_t add_size, size = resource_size(res);
> > +	resource_size_t min_size;
> >  
> >  	if (res->parent)
> >  		return;
> > @@ -1757,30 +1758,87 @@ static void adjust_bridge_window(struct pci_dev *bridge, struct resource *res,
> >  	if (!new_size)
> >  		return;
> >  
> > +	/* Minimum granularity of a bridge bridge window */
> 
> bridge bridge (fixed in my copy)

Right, thanks.

> > +	min_size = window_alignment(bridge->bus, res->flags);
> > +
> >  	if (new_size > size) {
> >  		add_size = new_size - size;
> > +		if (add_size < min_size)
> > +			return;
> >  		pci_dbg(bridge, "bridge window %pR extended by %pa\n", res,
> >  			&add_size);
> >  	} else if (new_size < size) {
> >  		add_size = size - new_size;
> > +		if (add_size < min_size)
> > +			return;
> >  		pci_dbg(bridge, "bridge window %pR shrunken by %pa\n", res,
> >  			&add_size);
> > +	} else {
> > +		return;
> >  	}
> >  
> >  	res->end = res->start + new_size - 1;
> >  	remove_from_list(add_list, res);
> >  }
> >  
> > +static void reduce_dev_resources(struct pci_dev *dev, struct resource *io,
> > +				 struct resource *mmio,
> > +				 struct resource *mmio_pref)
> 
> IIUC, this is removing the space consumed by "dev" from io, mmio,
> mmio_pref.

Yes.

> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < PCI_NUM_RESOURCES; i++) {
> > +		struct resource *res = &dev->resource[i];
> > +		resource_size_t align, tmp, size;
> > +
> > +		size = resource_size(res);
> > +		if (!size)
> > +			continue;
> > +
> > +		align = pci_resource_alignment(dev, res);
> > +
> > +		if (resource_type(res) == IORESOURCE_IO) {
> > +			align = align ? ALIGN(io->start, align) - io->start : 0;
> > +			tmp = align + size;
> > +			io->start = min(io->start + tmp, io->end + 1);
> 
> Do you think it would be worth a helper to do this repeated
> calculation?  AFAICT they're all the same.
> 
>   static void remove_dev_res(struct resource *avail, struct pci_dev *dev, struct resource *res)
>   {
>     size = resource_size(res);
>     if (!size)
>       return;
> 
>     align = pci_resource_alignment(dev, res);
>     align = align ? ALIGN(avail->start, align) - avail->start : 0;
>     tmp = align + size;
>     avail->start = min(avail->start + tmp, avail->end + 1);
>   }
> 
>   if (resource_type(res) == IORESOURCE_IO)
>     remove_dev_res(io, dev, res);
>   else if (resource_type(res) == IORESOURCE_MEM) {
>     if (res->flags & IORESOURCE_PREFETCH)
>       remove_dev_res(mmio_pref, dev, res);
>     else
>       remove_dev_res(mmio, dev, res);
>   }

Yes, makes sense.

> I don't quite understand how it works.  What's the case where "align"
> can be zero?  Will we remove VF BARs twice, once while processing the
> PF (where pf->resource[PCI_IOV_RESOURCES] contains space for BAR 0 of
> all the VFs), and again while processing the VFs?

pci_resource_alignment() returns 0 if the resource does not have align
set (start/size align). At least this may happen for bridge windows if
they are not yet "sized" by pbus_size_mem()/io(). This is the original
code that I moved here.

Regarding VFs, that's a good question. Should we skip the VFs here and
only account for the PF?

> > +		} else if (resource_type(res) == IORESOURCE_MEM) {
> > +			if (res->flags & IORESOURCE_PREFETCH) {
> > +				align = align ? ALIGN(mmio_pref->start, align) -
> > +						mmio_pref->start : 0;
> > +				tmp = align + size;
> > +				mmio_pref->start = min(mmio_pref->start + tmp,
> > +						       mmio_pref->end + 1);
> > +			} else {
> > +				align = align ? ALIGN(mmio->start, align) -
> > +						mmio->start : 0;
> > +				tmp = align + size;
> > +				mmio->start = min(mmio->start + tmp,
> > +						  mmio->end + 1);
> > +			}
> > +		}
> > +	}
> > +}
> > +
> > +/*
> > + * io, mmio and mmio_pref contain the total amount of bridge window
> > + * space available. This includes the minimal space needed to cover all
> > + * the existing devices on the bus and the possible extra space that can
> > + * be shared with the bridges.
> 
> So "mmio" is space available on "bus", i.e., this is a window that
> bus->self forwards to "bus" and is available for devices on "bus"?

Yes.

> This space would be available for devices on "bus", i.e., for BARs of
> those devices and windows of any bridges to devices below "bus"?

Yes.

> > + * The resource space consumed by bus->self (the bridge) is already
> > + * reduced.
> 
> I don't quite get what this is telling me.  If "mmio" is space
> forwarded to "bus" by bus->self, what does it mean for it to be
> reduced?  If bus->self forwards it, the entire space should be
> available on "bus".

I mean that if the bus->self has BARs itself they are not included here
(as they belong to the bus above this on). I can drop this comment
altogether if it confuses.

> > + */
> >  static void pci_bus_distribute_available_resources(struct pci_bus *bus,
> >  					    struct list_head *add_list,
> >  					    struct resource io,
> >  					    struct resource mmio,
> >  					    struct resource mmio_pref)
> >  {
> > +	resource_size_t io_align, mmio_align, mmio_pref_align;
> > +	resource_size_t io_per_b, mmio_per_b, mmio_pref_per_b;
> >  	unsigned int normal_bridges = 0, hotplug_bridges = 0;
> >  	struct resource *io_res, *mmio_res, *mmio_pref_res;
> >  	struct pci_dev *dev, *bridge = bus->self;
> > -	resource_size_t io_per_hp, mmio_per_hp, mmio_pref_per_hp, align;
> >  
> >  	io_res = &bridge->resource[PCI_BRIDGE_IO_WINDOW];
> >  	mmio_res = &bridge->resource[PCI_BRIDGE_MEM_WINDOW];
> > @@ -1790,17 +1848,17 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
> >  	 * The alignment of this bridge is yet to be considered, hence it must
> >  	 * be done now before extending its bridge window.
> >  	 */
> > -	align = pci_resource_alignment(bridge, io_res);
> > -	if (!io_res->parent && align)
> > -		io.start = min(ALIGN(io.start, align), io.end + 1);
> > +	io_align = pci_resource_alignment(bridge, io_res);
> > +	if (!io_res->parent && io_align)
> > +		io.start = min(ALIGN(io.start, io_align), io.end + 1);
> >  
> > -	align = pci_resource_alignment(bridge, mmio_res);
> > -	if (!mmio_res->parent && align)
> > -		mmio.start = min(ALIGN(mmio.start, align), mmio.end + 1);
> > +	mmio_align = pci_resource_alignment(bridge, mmio_res);
> > +	if (!mmio_res->parent && mmio_align)
> > +		mmio.start = min(ALIGN(mmio.start, mmio_align), mmio.end + 1);
> >  
> > -	align = pci_resource_alignment(bridge, mmio_pref_res);
> > -	if (!mmio_pref_res->parent && align)
> > -		mmio_pref.start = min(ALIGN(mmio_pref.start, align),
> > +	mmio_pref_align = pci_resource_alignment(bridge, mmio_pref_res);
> > +	if (!mmio_pref_res->parent && mmio_pref_align)
> > +		mmio_pref.start = min(ALIGN(mmio_pref.start, mmio_pref_align),
> >  			mmio_pref.end + 1);
> 
> It looks like this hunk and the corresponding "Make sure the split
> resource space is properly aligned" one below could conceivably fix a
> separate problem all by themselves?  If so, could be split to a
> separate patch?

Sure.

> >  	/*
> > @@ -1824,94 +1882,89 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
> >  			normal_bridges++;
> >  	}
> >  
> > -	/*
> > -	 * There is only one bridge on the bus so it gets all available
> > -	 * resources which it can then distribute to the possible hotplug
> > -	 * bridges below.
> > -	 */
> > -	if (hotplug_bridges + normal_bridges == 1) {
> > -		dev = list_first_entry(&bus->devices, struct pci_dev, bus_list);
> > -		if (dev->subordinate)
> > -			pci_bus_distribute_available_resources(dev->subordinate,
> > -				add_list, io, mmio, mmio_pref);
> > -		return;
> > -	}
> > -
> > -	if (hotplug_bridges == 0)
> > +	if (!(hotplug_bridges + normal_bridges))
> >  		return;
> >  
> >  	/*
> >  	 * Calculate the total amount of extra resource space we can
> > -	 * pass to bridges below this one.  This is basically the
> > -	 * extra space reduced by the minimal required space for the
> > -	 * non-hotplug bridges.
> > +	 * pass to bridges below this one.
> 
> The "space we can pass to bridges below this one" is the existing
> language, but I think it's a little too restrictive.  A bridge
> forwards space to the secondary bus, where it's available for any
> device, not just bridges.
> 
> Maybe:
> 
>   Calculate the amount of space we can forward from "bus" to any
>   downstream buses, i.e., the space left over after assigning the BARs
>   and windows on "bus".

Okay.

> > +	   This is basically the extra
> > +	 * space reduced by the minimal required space for the bridge
> > +	 * windows and device BARs on this bus.
> >  	 */
> > -	for_each_pci_bridge(dev, bus) {
> > -		resource_size_t used_size;
> > -		struct resource *res;
> > -
> > -		if (dev->is_hotplug_bridge)
> > -			continue;
> > -
> > -		/*
> > -		 * Reduce the available resource space by what the
> > -		 * bridge and devices below it occupy.
> > -		 */
> > -		res = &dev->resource[PCI_BRIDGE_IO_WINDOW];
> > -		align = pci_resource_alignment(dev, res);
> > -		align = align ? ALIGN(io.start, align) - io.start : 0;
> > -		used_size = align + resource_size(res);
> > -		if (!res->parent)
> > -			io.start = min(io.start + used_size, io.end + 1);
> > -
> > -		res = &dev->resource[PCI_BRIDGE_MEM_WINDOW];
> > -		align = pci_resource_alignment(dev, res);
> > -		align = align ? ALIGN(mmio.start, align) - mmio.start : 0;
> > -		used_size = align + resource_size(res);
> > -		if (!res->parent)
> > -			mmio.start = min(mmio.start + used_size, mmio.end + 1);
> > +	list_for_each_entry(dev, &bus->devices, bus_list)
> > +		reduce_dev_resources(dev, &io, &mmio, &mmio_pref);
> >  
> > -		res = &dev->resource[PCI_BRIDGE_PREF_MEM_WINDOW];
> > -		align = pci_resource_alignment(dev, res);
> > -		align = align ? ALIGN(mmio_pref.start, align) -
> > -			mmio_pref.start : 0;
> > -		used_size = align + resource_size(res);
> > -		if (!res->parent)
> > -			mmio_pref.start = min(mmio_pref.start + used_size,
> > -				mmio_pref.end + 1);
> > +	/*
> > +	 * If there is at least one hotplug bridge on this bus it gets
> > +	 * all the extra resource space that was left after the
> > +	 * reductions above.
> > +	 *
> > +	 * If there are no hotplug bridges the extra resource space is
> > +	 * split between non-hotplug bridges. This is to allow possible
> > +	 * hotplug bridges below them to get the extra space as well.
> 
> I guess "hotplug_bridges" only counts bridges directly on "bus",
> right?  I.e., we don't look deeper in the hierarchy for downstream
> hotplug bridges?

Yes correct.

> What happens in a topology like this:
> 
>   10:00.0 non-hotplug bridge to [bus 20-3f]
>   10:01.0 non-hotplug bridge to [bus 40]
>   20:00.0 hotplug bridge
>   40:00.0 NIC
> 
> where we're distributing space on "bus" 10, hotplug_bridges == 0 and
> normal_bridges == 2?  Do we give half the extra space to bus 20 and
> the other half to bus 40, even though we could tell up front that bus
> 20 is the only place that can actually use any extra space?

Yes we split it into half.
Mika Westerberg Jan. 5, 2023, 1:43 p.m. UTC | #3
Hi,

On Thu, Jan 05, 2023 at 11:12:11AM +0200, Mika Westerberg wrote:
> > What happens in a topology like this:
> > 
> >   10:00.0 non-hotplug bridge to [bus 20-3f]
> >   10:01.0 non-hotplug bridge to [bus 40]
> >   20:00.0 hotplug bridge
> >   40:00.0 NIC
> > 
> > where we're distributing space on "bus" 10, hotplug_bridges == 0 and
> > normal_bridges == 2?  Do we give half the extra space to bus 20 and
> > the other half to bus 40, even though we could tell up front that bus
> > 20 is the only place that can actually use any extra space?
> 
> Yes we split it into half.

Forgot to reply also that would it make sense here to look at below the
non-hotplug bridges and if we find hotplug bridges, distribute the space
equally between those or something like that?
Bjorn Helgaas Jan. 5, 2023, 5:04 p.m. UTC | #4
On Thu, Jan 05, 2023 at 03:43:29PM +0200, Mika Westerberg wrote:
> On Thu, Jan 05, 2023 at 11:12:11AM +0200, Mika Westerberg wrote:
> > > What happens in a topology like this:
> > > 
> > >   10:00.0 non-hotplug bridge to [bus 20-3f]
> > >   10:01.0 non-hotplug bridge to [bus 40]
> > >   20:00.0 hotplug bridge
> > >   40:00.0 NIC
> > > 
> > > where we're distributing space on "bus" 10, hotplug_bridges == 0 and
> > > normal_bridges == 2?  Do we give half the extra space to bus 20 and
> > > the other half to bus 40, even though we could tell up front that bus
> > > 20 is the only place that can actually use any extra space?
> > 
> > Yes we split it into half.
> 
> Forgot to reply also that would it make sense here to look at below the
> non-hotplug bridges and if we find hotplug bridges, distribute the space
> equally between those or something like that?

Yes, I do think ultimately it would make sense to keep track at every
bridge whether it or any descendant is a hotplug bridge so we could
distribute extra space only to bridges that could potentially use it.

But I don't know if that needs to be done in this series.  This code
is so complicated and fragile that I think being ruthless about
defining the minimal problem we're solving and avoiding scope creep
will improve our chances of success.

So treat this as a question to improve my understanding more than
anything.

Bjorn
Mika Westerberg Jan. 9, 2023, 11:11 a.m. UTC | #5
Hi,

On Thu, Jan 05, 2023 at 11:04:13AM -0600, Bjorn Helgaas wrote:
> On Thu, Jan 05, 2023 at 03:43:29PM +0200, Mika Westerberg wrote:
> > On Thu, Jan 05, 2023 at 11:12:11AM +0200, Mika Westerberg wrote:
> > > > What happens in a topology like this:
> > > > 
> > > >   10:00.0 non-hotplug bridge to [bus 20-3f]
> > > >   10:01.0 non-hotplug bridge to [bus 40]
> > > >   20:00.0 hotplug bridge
> > > >   40:00.0 NIC
> > > > 
> > > > where we're distributing space on "bus" 10, hotplug_bridges == 0 and
> > > > normal_bridges == 2?  Do we give half the extra space to bus 20 and
> > > > the other half to bus 40, even though we could tell up front that bus
> > > > 20 is the only place that can actually use any extra space?
> > > 
> > > Yes we split it into half.
> > 
> > Forgot to reply also that would it make sense here to look at below the
> > non-hotplug bridges and if we find hotplug bridges, distribute the space
> > equally between those or something like that?
> 
> Yes, I do think ultimately it would make sense to keep track at every
> bridge whether it or any descendant is a hotplug bridge so we could
> distribute extra space only to bridges that could potentially use it.
> 
> But I don't know if that needs to be done in this series.  This code
> is so complicated and fragile that I think being ruthless about
> defining the minimal problem we're solving and avoiding scope creep
> will improve our chances of success.
> 
> So treat this as a question to improve my understanding more than
> anything.

Okay, undestood ;-)
Alexander Motin Jan. 9, 2023, 6:27 p.m. UTC | #6
On 09.01.2023 06:11, Mika Westerberg wrote:
> On Thu, Jan 05, 2023 at 11:04:13AM -0600, Bjorn Helgaas wrote:
>> On Thu, Jan 05, 2023 at 03:43:29PM +0200, Mika Westerberg wrote:
>>> On Thu, Jan 05, 2023 at 11:12:11AM +0200, Mika Westerberg wrote:
>>>>> What happens in a topology like this:
>>>>>
>>>>>    10:00.0 non-hotplug bridge to [bus 20-3f]
>>>>>    10:01.0 non-hotplug bridge to [bus 40]
>>>>>    20:00.0 hotplug bridge
>>>>>    40:00.0 NIC
>>>>>
>>>>> where we're distributing space on "bus" 10, hotplug_bridges == 0 and
>>>>> normal_bridges == 2?  Do we give half the extra space to bus 20 and
>>>>> the other half to bus 40, even though we could tell up front that bus
>>>>> 20 is the only place that can actually use any extra space?
>>>>
>>>> Yes we split it into half.
>>>
>>> Forgot to reply also that would it make sense here to look at below the
>>> non-hotplug bridges and if we find hotplug bridges, distribute the space
>>> equally between those or something like that?
>>
>> Yes, I do think ultimately it would make sense to keep track at every
>> bridge whether it or any descendant is a hotplug bridge so we could
>> distribute extra space only to bridges that could potentially use it.
>>
>> But I don't know if that needs to be done in this series.  This code
>> is so complicated and fragile that I think being ruthless about
>> defining the minimal problem we're solving and avoiding scope creep
>> will improve our chances of success.
>>
>> So treat this as a question to improve my understanding more than
>> anything.
> 
> Okay, undestood ;-)

I was also wondering about this problem.  But my first though was: if we 
are not going to look down through all the tree, may be we should better 
distribute resources between all bridges no this bus, no matter whether 
they are hot-plug or not?  Because behind any non-hotplug bridge may 
appear a hot-plug one.  As a possible trivial approach, give hot-plug 
bridges twice of what is given to non-hot-plug ones. ;)

PS: Thank you, Mika, for working on this.
Alexander Motin Jan. 9, 2023, 7:33 p.m. UTC | #7
On 04.01.2023 04:16, Mika Westerberg wrote:
> A PCI bridge may reside on a bus with other devices as well. The
> resource distribution code does not take this into account properly and
> therefore it expands the bridge resource windows too much, not leaving
> space for the other devices (or functions of a multifunction device) and
> this leads to an issue that Jonathan reported. He runs QEMU with the
> following topology (QEMU parameters):
> 
>   -device pcie-root-port,port=0,id=root_port13,chassis=0,slot=2  \
>   -device x3130-upstream,id=sw1,bus=root_port13,multifunction=on \
>   -device e1000,bus=root_port13,addr=0.1                         \
>   -device xio3130-downstream,id=fun1,bus=sw1,chassis=0,slot=3    \
>   -device e1000,bus=fun1
> 
> The first e1000 NIC here is another function in the switch upstream
> port. This leads to following errors:
> 
>    pci 0000:00:04.0: bridge window [mem 0x10200000-0x103fffff] to [bus 02-04]
>    pci 0000:02:00.0: bridge window [mem 0x10200000-0x103fffff] to [bus 03-04]
>    pci 0000:02:00.1: BAR 0: failed to assign [mem size 0x00020000]
>    e1000 0000:02:00.1: can't ioremap BAR 0: [??? 0x00000000 flags 0x0]
> 
> Fix this by taking into account the device BARs on the bus, including
> the ones belonging to bridges themselves.
> 
> Link: https://lore.kernel.org/linux-pci/20221014124553.0000696f@huawei.com/
> Link: https://lore.kernel.org/linux-pci/6053736d-1923-41e7-def9-7585ce1772d9@ixsystems.com/
> Reported-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reported-by: Alexander Motin <mav@ixsystems.com>
> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> ---
>   drivers/pci/setup-bus.c | 205 +++++++++++++++++++++++++---------------
>   1 file changed, 129 insertions(+), 76 deletions(-)
> 
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index b4096598dbcb..cf6a7bdf2427 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -1750,6 +1750,7 @@ static void adjust_bridge_window(struct pci_dev *bridge, struct resource *res,
>   				 resource_size_t new_size)
>   {
>   	resource_size_t add_size, size = resource_size(res);
> +	resource_size_t min_size;
>   
>   	if (res->parent)
>   		return;
> @@ -1757,30 +1758,87 @@ static void adjust_bridge_window(struct pci_dev *bridge, struct resource *res,
>   	if (!new_size)
>   		return;
>   
> +	/* Minimum granularity of a bridge bridge window */
> +	min_size = window_alignment(bridge->bus, res->flags);
> +
>   	if (new_size > size) {
>   		add_size = new_size - size;
> +		if (add_size < min_size)
> +			return;
>   		pci_dbg(bridge, "bridge window %pR extended by %pa\n", res,
>   			&add_size);
>   	} else if (new_size < size) {
>   		add_size = size - new_size;
> +		if (add_size < min_size)
> +			return;

May be I don't understand something, but in what situation it may 
happen, and won't it be a problem if you silently do nothing here, while 
the calling code will use the passed new_size as-is?

>   		pci_dbg(bridge, "bridge window %pR shrunken by %pa\n", res,
>   			&add_size);
> +	} else {
> +		return;
>   	}
>   
>   	res->end = res->start + new_size - 1;
>   	remove_from_list(add_list, res);
>   }
>   
> +static void reduce_dev_resources(struct pci_dev *dev, struct resource *io,
> +				 struct resource *mmio,
> +				 struct resource *mmio_pref)
> +{
> +	int i;
> +
> +	for (i = 0; i < PCI_NUM_RESOURCES; i++) {
> +		struct resource *res = &dev->resource[i];
> +		resource_size_t align, tmp, size;
> +
> +		size = resource_size(res);
> +		if (!size)
> +			continue;
> +
> +		align = pci_resource_alignment(dev, res);
> +
> +		if (resource_type(res) == IORESOURCE_IO) {
> +			align = align ? ALIGN(io->start, align) - io->start : 0;
> +			tmp = align + size;
> +			io->start = min(io->start + tmp, io->end + 1);
> +		} else if (resource_type(res) == IORESOURCE_MEM) {
> +			if (res->flags & IORESOURCE_PREFETCH) {
> +				align = align ? ALIGN(mmio_pref->start, align) -
> +						mmio_pref->start : 0;
> +				tmp = align + size;
> +				mmio_pref->start = min(mmio_pref->start + tmp,
> +						       mmio_pref->end + 1);
> +			} else {
> +				align = align ? ALIGN(mmio->start, align) -
> +						mmio->start : 0;
> +				tmp = align + size;
> +				mmio->start = min(mmio->start + tmp,
> +						  mmio->end + 1);
> +			}
> +		}
> +	}
> +}
> +
> +/*
> + * io, mmio and mmio_pref contain the total amount of bridge window
> + * space available. This includes the minimal space needed to cover all
> + * the existing devices on the bus and the possible extra space that can
> + * be shared with the bridges.
> + *
> + * The resource space consumed by bus->self (the bridge) is already
> + * reduced.
> + */
>   static void pci_bus_distribute_available_resources(struct pci_bus *bus,
>   					    struct list_head *add_list,
>   					    struct resource io,
>   					    struct resource mmio,
>   					    struct resource mmio_pref)
>   {
> +	resource_size_t io_align, mmio_align, mmio_pref_align;
> +	resource_size_t io_per_b, mmio_per_b, mmio_pref_per_b;
>   	unsigned int normal_bridges = 0, hotplug_bridges = 0;
>   	struct resource *io_res, *mmio_res, *mmio_pref_res;
>   	struct pci_dev *dev, *bridge = bus->self;
> -	resource_size_t io_per_hp, mmio_per_hp, mmio_pref_per_hp, align;
>   
>   	io_res = &bridge->resource[PCI_BRIDGE_IO_WINDOW];
>   	mmio_res = &bridge->resource[PCI_BRIDGE_MEM_WINDOW];
> @@ -1790,17 +1848,17 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
>   	 * The alignment of this bridge is yet to be considered, hence it must
>   	 * be done now before extending its bridge window.
>   	 */
> -	align = pci_resource_alignment(bridge, io_res);
> -	if (!io_res->parent && align)
> -		io.start = min(ALIGN(io.start, align), io.end + 1);
> +	io_align = pci_resource_alignment(bridge, io_res);
> +	if (!io_res->parent && io_align)
> +		io.start = min(ALIGN(io.start, io_align), io.end + 1);
>   
> -	align = pci_resource_alignment(bridge, mmio_res);
> -	if (!mmio_res->parent && align)
> -		mmio.start = min(ALIGN(mmio.start, align), mmio.end + 1);
> +	mmio_align = pci_resource_alignment(bridge, mmio_res);
> +	if (!mmio_res->parent && mmio_align)
> +		mmio.start = min(ALIGN(mmio.start, mmio_align), mmio.end + 1);
>   
> -	align = pci_resource_alignment(bridge, mmio_pref_res);
> -	if (!mmio_pref_res->parent && align)
> -		mmio_pref.start = min(ALIGN(mmio_pref.start, align),
> +	mmio_pref_align = pci_resource_alignment(bridge, mmio_pref_res);
> +	if (!mmio_pref_res->parent && mmio_pref_align)
> +		mmio_pref.start = min(ALIGN(mmio_pref.start, mmio_pref_align),
>   			mmio_pref.end + 1);
>   
>   	/*
> @@ -1824,94 +1882,89 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
>   			normal_bridges++;
>   	}
>   
> -	/*
> -	 * There is only one bridge on the bus so it gets all available
> -	 * resources which it can then distribute to the possible hotplug
> -	 * bridges below.
> -	 */
> -	if (hotplug_bridges + normal_bridges == 1) {
> -		dev = list_first_entry(&bus->devices, struct pci_dev, bus_list);
> -		if (dev->subordinate)
> -			pci_bus_distribute_available_resources(dev->subordinate,
> -				add_list, io, mmio, mmio_pref);
> -		return;
> -	}
> -
> -	if (hotplug_bridges == 0)
> +	if (!(hotplug_bridges + normal_bridges))
>   		return;
>   
>   	/*
>   	 * Calculate the total amount of extra resource space we can
> -	 * pass to bridges below this one.  This is basically the
> -	 * extra space reduced by the minimal required space for the
> -	 * non-hotplug bridges.
> +	 * pass to bridges below this one. This is basically the extra
> +	 * space reduced by the minimal required space for the bridge
> +	 * windows and device BARs on this bus.
>   	 */
> -	for_each_pci_bridge(dev, bus) {
> -		resource_size_t used_size;
> -		struct resource *res;
> -
> -		if (dev->is_hotplug_bridge)
> -			continue;
> -
> -		/*
> -		 * Reduce the available resource space by what the
> -		 * bridge and devices below it occupy.
> -		 */
> -		res = &dev->resource[PCI_BRIDGE_IO_WINDOW];
> -		align = pci_resource_alignment(dev, res);
> -		align = align ? ALIGN(io.start, align) - io.start : 0;
> -		used_size = align + resource_size(res);
> -		if (!res->parent)
> -			io.start = min(io.start + used_size, io.end + 1);
> -
> -		res = &dev->resource[PCI_BRIDGE_MEM_WINDOW];
> -		align = pci_resource_alignment(dev, res);
> -		align = align ? ALIGN(mmio.start, align) - mmio.start : 0;
> -		used_size = align + resource_size(res);
> -		if (!res->parent)
> -			mmio.start = min(mmio.start + used_size, mmio.end + 1);
> +	list_for_each_entry(dev, &bus->devices, bus_list)
> +		reduce_dev_resources(dev, &io, &mmio, &mmio_pref);
>   
> -		res = &dev->resource[PCI_BRIDGE_PREF_MEM_WINDOW];
> -		align = pci_resource_alignment(dev, res);
> -		align = align ? ALIGN(mmio_pref.start, align) -
> -			mmio_pref.start : 0;
> -		used_size = align + resource_size(res);
> -		if (!res->parent)
> -			mmio_pref.start = min(mmio_pref.start + used_size,
> -				mmio_pref.end + 1);
> +	/*
> +	 * If there is at least one hotplug bridge on this bus it gets
> +	 * all the extra resource space that was left after the
> +	 * reductions above.
> +	 *
> +	 * If there are no hotplug bridges the extra resource space is
> +	 * split between non-hotplug bridges. This is to allow possible
> +	 * hotplug bridges below them to get the extra space as well.
> +	 */
> +	if (hotplug_bridges) {
> +		io_per_b = div64_ul(resource_size(&io), hotplug_bridges);
> +		mmio_per_b = div64_ul(resource_size(&mmio), hotplug_bridges);
> +		mmio_pref_per_b = div64_ul(resource_size(&mmio_pref),
> +					   hotplug_bridges);
> +	} else {
> +		io_per_b = div64_ul(resource_size(&io), normal_bridges);
> +		mmio_per_b = div64_ul(resource_size(&mmio), normal_bridges);
> +		mmio_pref_per_b = div64_ul(resource_size(&mmio_pref),
> +					   normal_bridges);
>   	}
>   
> -	io_per_hp = div64_ul(resource_size(&io), hotplug_bridges);
> -	mmio_per_hp = div64_ul(resource_size(&mmio), hotplug_bridges);
> -	mmio_pref_per_hp = div64_ul(resource_size(&mmio_pref),
> -		hotplug_bridges);
> -
>   	/*
> -	 * Go over devices on this bus and distribute the remaining
> -	 * resource space between hotplug bridges.
> +	 * Make sure the split resource space is properly aligned for
> +	 * bridge windows (align it down to avoid going above what is
> +	 * available).
>   	 */
> +	if (io_align)
> +		io_per_b = ALIGN_DOWN(io_per_b, io_align);
> +	if (mmio_align)
> +		mmio_per_b = ALIGN_DOWN(mmio_per_b, mmio_align);
> +	if (mmio_pref_align)
> +		mmio_pref_per_b = ALIGN_DOWN(mmio_pref_per_b, mmio_align);

If I understand it right, you are applying alignment requirements of the 
parent bridge to the windows of its children.  I don't have examples of 
any bridge with different alignment, but shouldn't we better get and use 
proper alignment inside the loop below?

> +
>   	for_each_pci_bridge(dev, bus) {
> +		resource_size_t allocated_io, allocated_mmio, allocated_mmio_pref;
> +		const struct resource *res;
>   		struct pci_bus *b;
>   
>   		b = dev->subordinate;
> -		if (!b || !dev->is_hotplug_bridge)
> +		if (!b)
> +			continue;
> +		if (hotplug_bridges && !dev->is_hotplug_bridge)
>   			continue;
>   
> +		io.end = io.start + io_per_b - 1;
>   		/*
> -		 * Distribute available extra resources equally between
> -		 * hotplug-capable downstream ports taking alignment into
> -		 * account.
> +		 * The x_per_b holds the extra resource space that can
> +		 * be added for each bridge but there is the minimal
> +		 * already reserved as well so adjust x.start down
> +		 * accordingly to cover the whole space.
>   		 */
> -		io.end = io.start + io_per_hp - 1;
> -		mmio.end = mmio.start + mmio_per_hp - 1;
> -		mmio_pref.end = mmio_pref.start + mmio_pref_per_hp - 1;
> +		res = &dev->resource[PCI_BRIDGE_IO_WINDOW];
> +		allocated_io = resource_size(res);
> +		io.start -= allocated_io;
> +
> +		mmio.end = mmio.start + mmio_per_b - 1;
> +		res = &dev->resource[PCI_BRIDGE_MEM_WINDOW];
> +		allocated_mmio = resource_size(res);
> +		mmio.start -= allocated_mmio;
> +
> +		mmio_pref.end = mmio_pref.start + mmio_pref_per_b - 1;
> +		res = &dev->resource[PCI_BRIDGE_PREF_MEM_WINDOW];
> +		allocated_mmio_pref = resource_size(res);
> +		mmio_pref.start -= allocated_mmio_pref;
>   
>   		pci_bus_distribute_available_resources(b, add_list, io, mmio,
>   						       mmio_pref);
>   
> -		io.start += io_per_hp;
> -		mmio.start += mmio_per_hp;
> -		mmio_pref.start += mmio_pref_per_hp;
> +		io.start += allocated_io + io_per_b;
> +		mmio.start += allocated_mmio + mmio_per_b;
> +		mmio_pref.start += allocated_mmio_pref + mmio_pref_per_b;
>   	}
>   }
>
Mika Westerberg Jan. 10, 2023, 9:54 a.m. UTC | #8
Hi,

On Mon, Jan 09, 2023 at 02:33:35PM -0500, Alexander Motin wrote:
> On 04.01.2023 04:16, Mika Westerberg wrote:
> > A PCI bridge may reside on a bus with other devices as well. The
> > resource distribution code does not take this into account properly and
> > therefore it expands the bridge resource windows too much, not leaving
> > space for the other devices (or functions of a multifunction device) and
> > this leads to an issue that Jonathan reported. He runs QEMU with the
> > following topology (QEMU parameters):
> > 
> >   -device pcie-root-port,port=0,id=root_port13,chassis=0,slot=2  \
> >   -device x3130-upstream,id=sw1,bus=root_port13,multifunction=on \
> >   -device e1000,bus=root_port13,addr=0.1                         \
> >   -device xio3130-downstream,id=fun1,bus=sw1,chassis=0,slot=3    \
> >   -device e1000,bus=fun1
> > 
> > The first e1000 NIC here is another function in the switch upstream
> > port. This leads to following errors:
> > 
> >    pci 0000:00:04.0: bridge window [mem 0x10200000-0x103fffff] to [bus 02-04]
> >    pci 0000:02:00.0: bridge window [mem 0x10200000-0x103fffff] to [bus 03-04]
> >    pci 0000:02:00.1: BAR 0: failed to assign [mem size 0x00020000]
> >    e1000 0000:02:00.1: can't ioremap BAR 0: [??? 0x00000000 flags 0x0]
> > 
> > Fix this by taking into account the device BARs on the bus, including
> > the ones belonging to bridges themselves.
> > 
> > Link: https://lore.kernel.org/linux-pci/20221014124553.0000696f@huawei.com/
> > Link: https://lore.kernel.org/linux-pci/6053736d-1923-41e7-def9-7585ce1772d9@ixsystems.com/
> > Reported-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Reported-by: Alexander Motin <mav@ixsystems.com>
> > Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> > ---
> >   drivers/pci/setup-bus.c | 205 +++++++++++++++++++++++++---------------
> >   1 file changed, 129 insertions(+), 76 deletions(-)
> > 
> > diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> > index b4096598dbcb..cf6a7bdf2427 100644
> > --- a/drivers/pci/setup-bus.c
> > +++ b/drivers/pci/setup-bus.c
> > @@ -1750,6 +1750,7 @@ static void adjust_bridge_window(struct pci_dev *bridge, struct resource *res,
> >   				 resource_size_t new_size)
> >   {
> >   	resource_size_t add_size, size = resource_size(res);
> > +	resource_size_t min_size;
> >   	if (res->parent)
> >   		return;
> > @@ -1757,30 +1758,87 @@ static void adjust_bridge_window(struct pci_dev *bridge, struct resource *res,
> >   	if (!new_size)
> >   		return;
> > +	/* Minimum granularity of a bridge bridge window */
> > +	min_size = window_alignment(bridge->bus, res->flags);
> > +
> >   	if (new_size > size) {
> >   		add_size = new_size - size;
> > +		if (add_size < min_size)
> > +			return;
> >   		pci_dbg(bridge, "bridge window %pR extended by %pa\n", res,
> >   			&add_size);
> >   	} else if (new_size < size) {
> >   		add_size = size - new_size;
> > +		if (add_size < min_size)
> > +			return;
> 
> May be I don't understand something, but in what situation it may happen,
> and won't it be a problem if you silently do nothing here, while the calling
> code will use the passed new_size as-is?

Good point.

I saw this happening when I run QEMU but now that I started to look at
it again the reason seems to be that there is a 32-bit prefetchable
option ROM BAR for the e1000 NIC and that gets reduced incorrectly by
the reduce_dev_resources() which causes the "imbalance".

> >   		pci_dbg(bridge, "bridge window %pR shrunken by %pa\n", res,
> >   			&add_size);
> > +	} else {
> > +		return;
> >   	}
> >   	res->end = res->start + new_size - 1;
> >   	remove_from_list(add_list, res);
> >   }
> > +static void reduce_dev_resources(struct pci_dev *dev, struct resource *io,
> > +				 struct resource *mmio,
> > +				 struct resource *mmio_pref)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < PCI_NUM_RESOURCES; i++) {
> > +		struct resource *res = &dev->resource[i];
> > +		resource_size_t align, tmp, size;
> > +
> > +		size = resource_size(res);
> > +		if (!size)
> > +			continue;
> > +
> > +		align = pci_resource_alignment(dev, res);
> > +
> > +		if (resource_type(res) == IORESOURCE_IO) {
> > +			align = align ? ALIGN(io->start, align) - io->start : 0;
> > +			tmp = align + size;
> > +			io->start = min(io->start + tmp, io->end + 1);
> > +		} else if (resource_type(res) == IORESOURCE_MEM) {
> > +			if (res->flags & IORESOURCE_PREFETCH) {
> > +				align = align ? ALIGN(mmio_pref->start, align) -
> > +						mmio_pref->start : 0;
> > +				tmp = align + size;
> > +				mmio_pref->start = min(mmio_pref->start + tmp,
> > +						       mmio_pref->end + 1);
> > +			} else {
> > +				align = align ? ALIGN(mmio->start, align) -
> > +						mmio->start : 0;
> > +				tmp = align + size;
> > +				mmio->start = min(mmio->start + tmp,
> > +						  mmio->end + 1);
> > +			}
> > +		}
> > +	}
> > +}
> > +
> > +/*
> > + * io, mmio and mmio_pref contain the total amount of bridge window
> > + * space available. This includes the minimal space needed to cover all
> > + * the existing devices on the bus and the possible extra space that can
> > + * be shared with the bridges.
> > + *
> > + * The resource space consumed by bus->self (the bridge) is already
> > + * reduced.
> > + */
> >   static void pci_bus_distribute_available_resources(struct pci_bus *bus,
> >   					    struct list_head *add_list,
> >   					    struct resource io,
> >   					    struct resource mmio,
> >   					    struct resource mmio_pref)
> >   {
> > +	resource_size_t io_align, mmio_align, mmio_pref_align;
> > +	resource_size_t io_per_b, mmio_per_b, mmio_pref_per_b;
> >   	unsigned int normal_bridges = 0, hotplug_bridges = 0;
> >   	struct resource *io_res, *mmio_res, *mmio_pref_res;
> >   	struct pci_dev *dev, *bridge = bus->self;
> > -	resource_size_t io_per_hp, mmio_per_hp, mmio_pref_per_hp, align;
> >   	io_res = &bridge->resource[PCI_BRIDGE_IO_WINDOW];
> >   	mmio_res = &bridge->resource[PCI_BRIDGE_MEM_WINDOW];
> > @@ -1790,17 +1848,17 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
> >   	 * The alignment of this bridge is yet to be considered, hence it must
> >   	 * be done now before extending its bridge window.
> >   	 */
> > -	align = pci_resource_alignment(bridge, io_res);
> > -	if (!io_res->parent && align)
> > -		io.start = min(ALIGN(io.start, align), io.end + 1);
> > +	io_align = pci_resource_alignment(bridge, io_res);
> > +	if (!io_res->parent && io_align)
> > +		io.start = min(ALIGN(io.start, io_align), io.end + 1);
> > -	align = pci_resource_alignment(bridge, mmio_res);
> > -	if (!mmio_res->parent && align)
> > -		mmio.start = min(ALIGN(mmio.start, align), mmio.end + 1);
> > +	mmio_align = pci_resource_alignment(bridge, mmio_res);
> > +	if (!mmio_res->parent && mmio_align)
> > +		mmio.start = min(ALIGN(mmio.start, mmio_align), mmio.end + 1);
> > -	align = pci_resource_alignment(bridge, mmio_pref_res);
> > -	if (!mmio_pref_res->parent && align)
> > -		mmio_pref.start = min(ALIGN(mmio_pref.start, align),
> > +	mmio_pref_align = pci_resource_alignment(bridge, mmio_pref_res);
> > +	if (!mmio_pref_res->parent && mmio_pref_align)
> > +		mmio_pref.start = min(ALIGN(mmio_pref.start, mmio_pref_align),
> >   			mmio_pref.end + 1);
> >   	/*
> > @@ -1824,94 +1882,89 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
> >   			normal_bridges++;
> >   	}
> > -	/*
> > -	 * There is only one bridge on the bus so it gets all available
> > -	 * resources which it can then distribute to the possible hotplug
> > -	 * bridges below.
> > -	 */
> > -	if (hotplug_bridges + normal_bridges == 1) {
> > -		dev = list_first_entry(&bus->devices, struct pci_dev, bus_list);
> > -		if (dev->subordinate)
> > -			pci_bus_distribute_available_resources(dev->subordinate,
> > -				add_list, io, mmio, mmio_pref);
> > -		return;
> > -	}
> > -
> > -	if (hotplug_bridges == 0)
> > +	if (!(hotplug_bridges + normal_bridges))
> >   		return;
> >   	/*
> >   	 * Calculate the total amount of extra resource space we can
> > -	 * pass to bridges below this one.  This is basically the
> > -	 * extra space reduced by the minimal required space for the
> > -	 * non-hotplug bridges.
> > +	 * pass to bridges below this one. This is basically the extra
> > +	 * space reduced by the minimal required space for the bridge
> > +	 * windows and device BARs on this bus.
> >   	 */
> > -	for_each_pci_bridge(dev, bus) {
> > -		resource_size_t used_size;
> > -		struct resource *res;
> > -
> > -		if (dev->is_hotplug_bridge)
> > -			continue;
> > -
> > -		/*
> > -		 * Reduce the available resource space by what the
> > -		 * bridge and devices below it occupy.
> > -		 */
> > -		res = &dev->resource[PCI_BRIDGE_IO_WINDOW];
> > -		align = pci_resource_alignment(dev, res);
> > -		align = align ? ALIGN(io.start, align) - io.start : 0;
> > -		used_size = align + resource_size(res);
> > -		if (!res->parent)
> > -			io.start = min(io.start + used_size, io.end + 1);
> > -
> > -		res = &dev->resource[PCI_BRIDGE_MEM_WINDOW];
> > -		align = pci_resource_alignment(dev, res);
> > -		align = align ? ALIGN(mmio.start, align) - mmio.start : 0;
> > -		used_size = align + resource_size(res);
> > -		if (!res->parent)
> > -			mmio.start = min(mmio.start + used_size, mmio.end + 1);
> > +	list_for_each_entry(dev, &bus->devices, bus_list)
> > +		reduce_dev_resources(dev, &io, &mmio, &mmio_pref);
> > -		res = &dev->resource[PCI_BRIDGE_PREF_MEM_WINDOW];
> > -		align = pci_resource_alignment(dev, res);
> > -		align = align ? ALIGN(mmio_pref.start, align) -
> > -			mmio_pref.start : 0;
> > -		used_size = align + resource_size(res);
> > -		if (!res->parent)
> > -			mmio_pref.start = min(mmio_pref.start + used_size,
> > -				mmio_pref.end + 1);
> > +	/*
> > +	 * If there is at least one hotplug bridge on this bus it gets
> > +	 * all the extra resource space that was left after the
> > +	 * reductions above.
> > +	 *
> > +	 * If there are no hotplug bridges the extra resource space is
> > +	 * split between non-hotplug bridges. This is to allow possible
> > +	 * hotplug bridges below them to get the extra space as well.
> > +	 */
> > +	if (hotplug_bridges) {
> > +		io_per_b = div64_ul(resource_size(&io), hotplug_bridges);
> > +		mmio_per_b = div64_ul(resource_size(&mmio), hotplug_bridges);
> > +		mmio_pref_per_b = div64_ul(resource_size(&mmio_pref),
> > +					   hotplug_bridges);
> > +	} else {
> > +		io_per_b = div64_ul(resource_size(&io), normal_bridges);
> > +		mmio_per_b = div64_ul(resource_size(&mmio), normal_bridges);
> > +		mmio_pref_per_b = div64_ul(resource_size(&mmio_pref),
> > +					   normal_bridges);
> >   	}
> > -	io_per_hp = div64_ul(resource_size(&io), hotplug_bridges);
> > -	mmio_per_hp = div64_ul(resource_size(&mmio), hotplug_bridges);
> > -	mmio_pref_per_hp = div64_ul(resource_size(&mmio_pref),
> > -		hotplug_bridges);
> > -
> >   	/*
> > -	 * Go over devices on this bus and distribute the remaining
> > -	 * resource space between hotplug bridges.
> > +	 * Make sure the split resource space is properly aligned for
> > +	 * bridge windows (align it down to avoid going above what is
> > +	 * available).
> >   	 */
> > +	if (io_align)
> > +		io_per_b = ALIGN_DOWN(io_per_b, io_align);
> > +	if (mmio_align)
> > +		mmio_per_b = ALIGN_DOWN(mmio_per_b, mmio_align);
> > +	if (mmio_pref_align)
> > +		mmio_pref_per_b = ALIGN_DOWN(mmio_pref_per_b, mmio_align);
> 
> If I understand it right, you are applying alignment requirements of the
> parent bridge to the windows of its children.  I don't have examples of any
> bridge with different alignment, but shouldn't we better get and use proper
> alignment inside the loop below?

Yes, I agree.
Mika Westerberg Jan. 10, 2023, 10:07 a.m. UTC | #9
Hi,

On Mon, Jan 09, 2023 at 01:27:07PM -0500, Alexander Motin wrote:
> On 09.01.2023 06:11, Mika Westerberg wrote:
> > On Thu, Jan 05, 2023 at 11:04:13AM -0600, Bjorn Helgaas wrote:
> > > On Thu, Jan 05, 2023 at 03:43:29PM +0200, Mika Westerberg wrote:
> > > > On Thu, Jan 05, 2023 at 11:12:11AM +0200, Mika Westerberg wrote:
> > > > > > What happens in a topology like this:
> > > > > > 
> > > > > >    10:00.0 non-hotplug bridge to [bus 20-3f]
> > > > > >    10:01.0 non-hotplug bridge to [bus 40]
> > > > > >    20:00.0 hotplug bridge
> > > > > >    40:00.0 NIC
> > > > > > 
> > > > > > where we're distributing space on "bus" 10, hotplug_bridges == 0 and
> > > > > > normal_bridges == 2?  Do we give half the extra space to bus 20 and
> > > > > > the other half to bus 40, even though we could tell up front that bus
> > > > > > 20 is the only place that can actually use any extra space?
> > > > > 
> > > > > Yes we split it into half.
> > > > 
> > > > Forgot to reply also that would it make sense here to look at below the
> > > > non-hotplug bridges and if we find hotplug bridges, distribute the space
> > > > equally between those or something like that?
> > > 
> > > Yes, I do think ultimately it would make sense to keep track at every
> > > bridge whether it or any descendant is a hotplug bridge so we could
> > > distribute extra space only to bridges that could potentially use it.
> > > 
> > > But I don't know if that needs to be done in this series.  This code
> > > is so complicated and fragile that I think being ruthless about
> > > defining the minimal problem we're solving and avoiding scope creep
> > > will improve our chances of success.
> > > 
> > > So treat this as a question to improve my understanding more than
> > > anything.
> > 
> > Okay, undestood ;-)
> 
> I was also wondering about this problem.  But my first though was: if we are
> not going to look down through all the tree, may be we should better
> distribute resources between all bridges no this bus, no matter whether they
> are hot-plug or not?  Because behind any non-hotplug bridge may appear a
> hot-plug one.  As a possible trivial approach, give hot-plug bridges twice
> of what is given to non-hot-plug ones. ;)

Yeah, this may end up running out of resources pretty quickly. This is
primarily used with Thunderbolt/USB4 devices and the recent ones have up
to 3 downstream PCIe hotplug ports (along with the non-hotplug ones) so
chaining a couple of those quickly consumes all that space.

I'm all ears if there is a good (and simple) strategy for dealing with
this but at least we should not make the functionality worse than what
the current situation is IMHO. I'm going to try the suggestion to look
behind non-hotplug bridges for hotplug ones and see how complex that
becomes ;-)

> PS: Thank you, Mika, for working on this.

No problem :)
diff mbox series

Patch

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index b4096598dbcb..cf6a7bdf2427 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1750,6 +1750,7 @@  static void adjust_bridge_window(struct pci_dev *bridge, struct resource *res,
 				 resource_size_t new_size)
 {
 	resource_size_t add_size, size = resource_size(res);
+	resource_size_t min_size;
 
 	if (res->parent)
 		return;
@@ -1757,30 +1758,87 @@  static void adjust_bridge_window(struct pci_dev *bridge, struct resource *res,
 	if (!new_size)
 		return;
 
+	/* Minimum granularity of a bridge bridge window */
+	min_size = window_alignment(bridge->bus, res->flags);
+
 	if (new_size > size) {
 		add_size = new_size - size;
+		if (add_size < min_size)
+			return;
 		pci_dbg(bridge, "bridge window %pR extended by %pa\n", res,
 			&add_size);
 	} else if (new_size < size) {
 		add_size = size - new_size;
+		if (add_size < min_size)
+			return;
 		pci_dbg(bridge, "bridge window %pR shrunken by %pa\n", res,
 			&add_size);
+	} else {
+		return;
 	}
 
 	res->end = res->start + new_size - 1;
 	remove_from_list(add_list, res);
 }
 
+static void reduce_dev_resources(struct pci_dev *dev, struct resource *io,
+				 struct resource *mmio,
+				 struct resource *mmio_pref)
+{
+	int i;
+
+	for (i = 0; i < PCI_NUM_RESOURCES; i++) {
+		struct resource *res = &dev->resource[i];
+		resource_size_t align, tmp, size;
+
+		size = resource_size(res);
+		if (!size)
+			continue;
+
+		align = pci_resource_alignment(dev, res);
+
+		if (resource_type(res) == IORESOURCE_IO) {
+			align = align ? ALIGN(io->start, align) - io->start : 0;
+			tmp = align + size;
+			io->start = min(io->start + tmp, io->end + 1);
+		} else if (resource_type(res) == IORESOURCE_MEM) {
+			if (res->flags & IORESOURCE_PREFETCH) {
+				align = align ? ALIGN(mmio_pref->start, align) -
+						mmio_pref->start : 0;
+				tmp = align + size;
+				mmio_pref->start = min(mmio_pref->start + tmp,
+						       mmio_pref->end + 1);
+			} else {
+				align = align ? ALIGN(mmio->start, align) -
+						mmio->start : 0;
+				tmp = align + size;
+				mmio->start = min(mmio->start + tmp,
+						  mmio->end + 1);
+			}
+		}
+	}
+}
+
+/*
+ * io, mmio and mmio_pref contain the total amount of bridge window
+ * space available. This includes the minimal space needed to cover all
+ * the existing devices on the bus and the possible extra space that can
+ * be shared with the bridges.
+ *
+ * The resource space consumed by bus->self (the bridge) is already
+ * reduced.
+ */
 static void pci_bus_distribute_available_resources(struct pci_bus *bus,
 					    struct list_head *add_list,
 					    struct resource io,
 					    struct resource mmio,
 					    struct resource mmio_pref)
 {
+	resource_size_t io_align, mmio_align, mmio_pref_align;
+	resource_size_t io_per_b, mmio_per_b, mmio_pref_per_b;
 	unsigned int normal_bridges = 0, hotplug_bridges = 0;
 	struct resource *io_res, *mmio_res, *mmio_pref_res;
 	struct pci_dev *dev, *bridge = bus->self;
-	resource_size_t io_per_hp, mmio_per_hp, mmio_pref_per_hp, align;
 
 	io_res = &bridge->resource[PCI_BRIDGE_IO_WINDOW];
 	mmio_res = &bridge->resource[PCI_BRIDGE_MEM_WINDOW];
@@ -1790,17 +1848,17 @@  static void pci_bus_distribute_available_resources(struct pci_bus *bus,
 	 * The alignment of this bridge is yet to be considered, hence it must
 	 * be done now before extending its bridge window.
 	 */
-	align = pci_resource_alignment(bridge, io_res);
-	if (!io_res->parent && align)
-		io.start = min(ALIGN(io.start, align), io.end + 1);
+	io_align = pci_resource_alignment(bridge, io_res);
+	if (!io_res->parent && io_align)
+		io.start = min(ALIGN(io.start, io_align), io.end + 1);
 
-	align = pci_resource_alignment(bridge, mmio_res);
-	if (!mmio_res->parent && align)
-		mmio.start = min(ALIGN(mmio.start, align), mmio.end + 1);
+	mmio_align = pci_resource_alignment(bridge, mmio_res);
+	if (!mmio_res->parent && mmio_align)
+		mmio.start = min(ALIGN(mmio.start, mmio_align), mmio.end + 1);
 
-	align = pci_resource_alignment(bridge, mmio_pref_res);
-	if (!mmio_pref_res->parent && align)
-		mmio_pref.start = min(ALIGN(mmio_pref.start, align),
+	mmio_pref_align = pci_resource_alignment(bridge, mmio_pref_res);
+	if (!mmio_pref_res->parent && mmio_pref_align)
+		mmio_pref.start = min(ALIGN(mmio_pref.start, mmio_pref_align),
 			mmio_pref.end + 1);
 
 	/*
@@ -1824,94 +1882,89 @@  static void pci_bus_distribute_available_resources(struct pci_bus *bus,
 			normal_bridges++;
 	}
 
-	/*
-	 * There is only one bridge on the bus so it gets all available
-	 * resources which it can then distribute to the possible hotplug
-	 * bridges below.
-	 */
-	if (hotplug_bridges + normal_bridges == 1) {
-		dev = list_first_entry(&bus->devices, struct pci_dev, bus_list);
-		if (dev->subordinate)
-			pci_bus_distribute_available_resources(dev->subordinate,
-				add_list, io, mmio, mmio_pref);
-		return;
-	}
-
-	if (hotplug_bridges == 0)
+	if (!(hotplug_bridges + normal_bridges))
 		return;
 
 	/*
 	 * Calculate the total amount of extra resource space we can
-	 * pass to bridges below this one.  This is basically the
-	 * extra space reduced by the minimal required space for the
-	 * non-hotplug bridges.
+	 * pass to bridges below this one. This is basically the extra
+	 * space reduced by the minimal required space for the bridge
+	 * windows and device BARs on this bus.
 	 */
-	for_each_pci_bridge(dev, bus) {
-		resource_size_t used_size;
-		struct resource *res;
-
-		if (dev->is_hotplug_bridge)
-			continue;
-
-		/*
-		 * Reduce the available resource space by what the
-		 * bridge and devices below it occupy.
-		 */
-		res = &dev->resource[PCI_BRIDGE_IO_WINDOW];
-		align = pci_resource_alignment(dev, res);
-		align = align ? ALIGN(io.start, align) - io.start : 0;
-		used_size = align + resource_size(res);
-		if (!res->parent)
-			io.start = min(io.start + used_size, io.end + 1);
-
-		res = &dev->resource[PCI_BRIDGE_MEM_WINDOW];
-		align = pci_resource_alignment(dev, res);
-		align = align ? ALIGN(mmio.start, align) - mmio.start : 0;
-		used_size = align + resource_size(res);
-		if (!res->parent)
-			mmio.start = min(mmio.start + used_size, mmio.end + 1);
+	list_for_each_entry(dev, &bus->devices, bus_list)
+		reduce_dev_resources(dev, &io, &mmio, &mmio_pref);
 
-		res = &dev->resource[PCI_BRIDGE_PREF_MEM_WINDOW];
-		align = pci_resource_alignment(dev, res);
-		align = align ? ALIGN(mmio_pref.start, align) -
-			mmio_pref.start : 0;
-		used_size = align + resource_size(res);
-		if (!res->parent)
-			mmio_pref.start = min(mmio_pref.start + used_size,
-				mmio_pref.end + 1);
+	/*
+	 * If there is at least one hotplug bridge on this bus it gets
+	 * all the extra resource space that was left after the
+	 * reductions above.
+	 *
+	 * If there are no hotplug bridges the extra resource space is
+	 * split between non-hotplug bridges. This is to allow possible
+	 * hotplug bridges below them to get the extra space as well.
+	 */
+	if (hotplug_bridges) {
+		io_per_b = div64_ul(resource_size(&io), hotplug_bridges);
+		mmio_per_b = div64_ul(resource_size(&mmio), hotplug_bridges);
+		mmio_pref_per_b = div64_ul(resource_size(&mmio_pref),
+					   hotplug_bridges);
+	} else {
+		io_per_b = div64_ul(resource_size(&io), normal_bridges);
+		mmio_per_b = div64_ul(resource_size(&mmio), normal_bridges);
+		mmio_pref_per_b = div64_ul(resource_size(&mmio_pref),
+					   normal_bridges);
 	}
 
-	io_per_hp = div64_ul(resource_size(&io), hotplug_bridges);
-	mmio_per_hp = div64_ul(resource_size(&mmio), hotplug_bridges);
-	mmio_pref_per_hp = div64_ul(resource_size(&mmio_pref),
-		hotplug_bridges);
-
 	/*
-	 * Go over devices on this bus and distribute the remaining
-	 * resource space between hotplug bridges.
+	 * Make sure the split resource space is properly aligned for
+	 * bridge windows (align it down to avoid going above what is
+	 * available).
 	 */
+	if (io_align)
+		io_per_b = ALIGN_DOWN(io_per_b, io_align);
+	if (mmio_align)
+		mmio_per_b = ALIGN_DOWN(mmio_per_b, mmio_align);
+	if (mmio_pref_align)
+		mmio_pref_per_b = ALIGN_DOWN(mmio_pref_per_b, mmio_align);
+
 	for_each_pci_bridge(dev, bus) {
+		resource_size_t allocated_io, allocated_mmio, allocated_mmio_pref;
+		const struct resource *res;
 		struct pci_bus *b;
 
 		b = dev->subordinate;
-		if (!b || !dev->is_hotplug_bridge)
+		if (!b)
+			continue;
+		if (hotplug_bridges && !dev->is_hotplug_bridge)
 			continue;
 
+		io.end = io.start + io_per_b - 1;
 		/*
-		 * Distribute available extra resources equally between
-		 * hotplug-capable downstream ports taking alignment into
-		 * account.
+		 * The x_per_b holds the extra resource space that can
+		 * be added for each bridge but there is the minimal
+		 * already reserved as well so adjust x.start down
+		 * accordingly to cover the whole space.
 		 */
-		io.end = io.start + io_per_hp - 1;
-		mmio.end = mmio.start + mmio_per_hp - 1;
-		mmio_pref.end = mmio_pref.start + mmio_pref_per_hp - 1;
+		res = &dev->resource[PCI_BRIDGE_IO_WINDOW];
+		allocated_io = resource_size(res);
+		io.start -= allocated_io;
+
+		mmio.end = mmio.start + mmio_per_b - 1;
+		res = &dev->resource[PCI_BRIDGE_MEM_WINDOW];
+		allocated_mmio = resource_size(res);
+		mmio.start -= allocated_mmio;
+
+		mmio_pref.end = mmio_pref.start + mmio_pref_per_b - 1;
+		res = &dev->resource[PCI_BRIDGE_PREF_MEM_WINDOW];
+		allocated_mmio_pref = resource_size(res);
+		mmio_pref.start -= allocated_mmio_pref;
 
 		pci_bus_distribute_available_resources(b, add_list, io, mmio,
 						       mmio_pref);
 
-		io.start += io_per_hp;
-		mmio.start += mmio_per_hp;
-		mmio_pref.start += mmio_pref_per_hp;
+		io.start += allocated_io + io_per_b;
+		mmio.start += allocated_mmio + mmio_per_b;
+		mmio_pref.start += allocated_mmio_pref + mmio_pref_per_b;
 	}
 }