pci-mvebu driver on km_kirkwood

Message ID	20140219102658.76eec91e@skate (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> Date: Wed, 19 Feb 2014 10:26:58 +0100 From: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> To: Gerlando Falauto <gerlando.falauto@keymile.com> Subject: Re: pci-mvebu driver on km_kirkwood Message-ID: <20140219102658.76eec91e@skate> In-Reply-To: <53046D98.6020801@keymile.com> References: <51DD88A4.1030506@keymile.com> <20130710185706.72b124a4@skate> <51DD9A8C.10608@keymile.com> <20130711163220.2b3adf38@skate> <53039894.10905@keymile.com> <20140218212751.07c2aeb5@skate> <53046D98.6020801@keymile.com> Organization: Free Electrons Mime-Version: 1.0 Cc: Lior Amsalem <alior@marvell.com>, Andrew Lunn <andrew@lunn.ch>, Jason Cooper <jason@lakedaemon.net>, "Longchamp, Valentin" <Valentin.Longchamp@keymile.com>, Ezequiel Garcia <ezequiel.garcia@free-electrons.com>, "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>, Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com> Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Thomas Petazzoni Feb. 19, 2014, 9:26 a.m. UTC

Dear Gerlando Falauto,

On Wed, 19 Feb 2014 09:38:48 +0100, Gerlando Falauto wrote:

> >> pci 0000:01:00.0: BAR 3: assigned [mem 0xe8000000-0xe87fffff]
> >> pci 0000:01:00.0: BAR 4: assigned [mem 0xe8800000-0xe8801fff]
> >> pci 0000:01:00.0: BAR 0: assigned [mem 0xe8802000-0xe8802fff]
> >> pci 0000:01:00.0: BAR 2: assigned [mem 0xe8803000-0xe8803fff]
> >> pci 0000:01:00.0: BAR 5: assigned [mem 0xe8804000-0xe8804fff]
> >
> > So in total, for the device 0000:01:00, the memory region should go
> > from 0xe0000000 to 0xe8804fff. This means that a 256 MB window is
> > needed for this device, because only power of two sizes are possible
> > for MBus windows.
> >
> > Can you give me the output of /sys/kernel/debug/mvebu-mbus/devices ? It
> > will tell us how the MBus windows are configured, as I suspect the
> > problem might be here.
> 
> Here it goes:
> 
> [00] disabled
> [01] disabled
> [02] disabled
> [03] disabled
> [04] 00000000ff000000 - 00000000ff010000 : nand
> [05] 00000000f4000000 - 00000000f8000000 : vpcie
> [06] 00000000fe000000 - 00000000fe010000 : dragonite
> [07] 00000000e0000000 - 00000000ec000000 : pcie0.0
> 
> So there's something wrong: a 256MB window should go all the way up to 
> 0xf0000000, and we have 192MB instead and I don't know how that would be 
> interpreted.

My understanding is that a 192 MB window is illegal, because the window
size should be encoded as a sequence of 1s followed by a sequence of 0s
from the LSB to the MSB. To me, this means that only power of two
window sizes are possible.

> I couldn't figure out where this range comes from though, as in the 
> device tree I now have a size of 256MB (I stupidly set it to 192MB at 
> some point, but I now changed it):
> 
> # hexdump -C /proc/device-tree/ocp@f1000000/pcie-controller/ranges
>   | cut -c1-58
> 00000000  82 00 00 00 00 00 00 00  00 04 00 00 00 04 00 00
> 00000010  00 00 00 00 00 00 20 00  82 00 00 00 00 00 00 00
> 00000020  e0 00 00 00 e0 00 00 00  00 00 00 00 10 00 00 00
>                                                 ^^^^^^^^^^^

Wow, that's an old DT representation that you have here :)

But ok, let me try to explain. The 256 MB value that you define in the
DT is the global PCIe memory aperture: it is the maximum amount of
memory that we allow the PCIe driver to allocate for PCIe windows. But
depending on which PCIe devices you have plugged in, and how large
their BARs are, not necessarily all of these 256 MB will be used.

So, you can very well have a 256 MB global PCIe memory aperture, and
still have only one 1 MB PCIe memory window for PCIe 0.0 and a 256 KB
PCIe memory window for PCIe 1.0, and that's it.

Now, the 192 MB comes from the enumeration of your device. Linux
enumerates the BAR of your device:

pci 0000:01:00.0: BAR 1: assigned [mem 0xe0000000-0xe7ffffff]
pci 0000:01:00.0: BAR 3: assigned [mem 0xe8000000-0xe87fffff]
pci 0000:01:00.0: BAR 4: assigned [mem 0xe8800000-0xe8801fff]
pci 0000:01:00.0: BAR 0: assigned [mem 0xe8802000-0xe8802fff]
pci 0000:01:00.0: BAR 2: assigned [mem 0xe8803000-0xe8803fff]
pci 0000:01:00.0: BAR 5: assigned [mem 0xe8804000-0xe8804fff]

and then concludes that at the emulated bridge level, the memory region
to be created is:

pci 0000:00:01.0:   bridge window [mem 0xe0000000-0xebffffff]

which corresponds to the 192 MB window that we see created.

But I believe a 192 MB memory window cannot work with MBus, it should
be rounded up to the next power of 2. Can you try the below patch (not
tested, not even compiled, might need some tweaks to apply to your 3.10
kernel) :

I'm obviously interested in seeing the message that gets shown, as well
as the new mvebu-mbus debugfs output.

For good measure, if you could also dump the registers of the PCIe
window. In your case, it was window 7, so dumping 0xf1020070 and
0xf1020074 would be useful.

> But apart from that, what I still don't understand is how that could 
> have anything to do with my problem. The memory area I'm not able to 
> access starts at 0xe4000000.
> BAR0, on the other hand, spawns 0xe8802000-0xe8802fff and seems to work 
> fine.

I am not sure, but since we are configuring an invalid memory size,
maybe the MBus behavior is undefined, and we get some completely funky
behavior, where parts of the 192 MB window are actually work, but parts
of it are not.

Thomas

Gerlando Falauto Feb. 19, 2014, 9:39 a.m. UTC | #1

Hi Thomas,

spoiler first: SUCCESS!!!!

On 02/19/2014 10:26 AM, Thomas Petazzoni wrote:
> Dear Gerlando Falauto,
[...]

>>
>> # hexdump -C /proc/device-tree/ocp@f1000000/pcie-controller/ranges
>>    | cut -c1-58
>> 00000000  82 00 00 00 00 00 00 00  00 04 00 00 00 04 00 00
>> 00000010  00 00 00 00 00 00 20 00  82 00 00 00 00 00 00 00
>> 00000020  e0 00 00 00 e0 00 00 00  00 00 00 00 10 00 00 00
>>                                                  ^^^^^^^^^^^
>
> Wow, that's an old DT representation that you have here :)

Indeed... ;-)

> But ok, let me try to explain. The 256 MB value that you define in the
> DT is the global PCIe memory aperture: it is the maximum amount of
> memory that we allow the PCIe driver to allocate for PCIe windows. But
> depending on which PCIe devices you have plugged in, and how large
> their BARs are, not necessarily all of these 256 MB will be used.
>
> So, you can very well have a 256 MB global PCIe memory aperture, and
> still have only one 1 MB PCIe memory window for PCIe 0.0 and a 256 KB
> PCIe memory window for PCIe 1.0, and that's it.
>
> Now, the 192 MB comes from the enumeration of your device. Linux
> enumerates the BAR of your device:
>
> pci 0000:01:00.0: BAR 1: assigned [mem 0xe0000000-0xe7ffffff]
> pci 0000:01:00.0: BAR 3: assigned [mem 0xe8000000-0xe87fffff]
> pci 0000:01:00.0: BAR 4: assigned [mem 0xe8800000-0xe8801fff]
> pci 0000:01:00.0: BAR 0: assigned [mem 0xe8802000-0xe8802fff]
> pci 0000:01:00.0: BAR 2: assigned [mem 0xe8803000-0xe8803fff]
> pci 0000:01:00.0: BAR 5: assigned [mem 0xe8804000-0xe8804fff]
>
> and then concludes that at the emulated bridge level, the memory region
> to be created is:
>
> pci 0000:00:01.0:   bridge window [mem 0xe0000000-0xebffffff]
>
> which corresponds to the 192 MB window that we see created.
>
> But I believe a 192 MB memory window cannot work with MBus, it should
> be rounded up to the next power of 2. Can you try the below patch (not
> tested, not even compiled, might need some tweaks to apply to your 3.10
> kernel) :
>
> diff --git a/drivers/pci/host/pci-mvebu.c b/drivers/pci/host/pci-mvebu.c
> index 13478ec..002229a 100644
> --- a/drivers/pci/host/pci-mvebu.c
> +++ b/drivers/pci/host/pci-mvebu.c
> @@ -372,6 +372,11 @@ static void mvebu_pcie_handle_membase_change(struct mvebu_pcie_port *port)
>                  (((port->bridge.memlimit & 0xFFF0) << 16) | 0xFFFFF) -
>                  port->memwin_base;
>
> +       pr_info("PCIE %d.%d: creating window at 0x%x, size 0x%x rounded up to 0x%x\n",
> +               port->port, port->lane, port->memwin_base,
> +               port->memwin_size, roundup_pow_of_two(port->memwin_size));
> +       port->memwin_size = roundup_pow_of_two(port->memwin_size);
> +
>          mvebu_mbus_add_window_by_id(port->mem_target, port->mem_attr,
>                                      port->memwin_base, port->memwin_size);
>   }
>
> I'm obviously interested in seeing the message that gets shown, as well
> as the new mvebu-mbus debugfs output.

----------
pci 0000:00:01.0:   bridge window [mem 0xe0000000-0xebffffff]
PCIE 0.0: creating window at 0xe0000000, size 0xbffffff rounded up to 
0x10000000
----------

  cat /sys/kernel/debug/mvebu-mbus/devices
[00] disabled
[01] disabled
[02] disabled
[03] disabled
[04] 00000000ff000000 - 00000000ff010000 : nand
[05] 00000000f4000000 - 00000000f8000000 : vpcie
[06] 00000000fe000000 - 00000000fe010000 : dragonite
[07] 00000000e0000000 - 00000000f0000000 : pcie0.0

> For good measure, if you could also dump the registers of the PCIe
> window. In your case, it was window 7, so dumping 0xf1020070 and
> 0xf1020074 would be useful.

Isn't that where the output of debugfs comes from?

>> But apart from that, what I still don't understand is how that could
>> have anything to do with my problem. The memory area I'm not able to
>> access starts at 0xe4000000.
>> BAR0, on the other hand, spawns 0xe8802000-0xe8802fff and seems to work
>> fine.
>
> I am not sure, but since we are configuring an invalid memory size,
> maybe the MBus behavior is undefined, and we get some completely funky
> behavior, where parts of the 192 MB window are actually work, but parts
> of it are not.

And... Ladies and gentlemen... it turns out YOU'RE RIGHT!!!
With your patch now everything works fine!!!

No words (or quads, for that matter) can express how grateful I am! ;-)

Thank you so much!!!
Gerlando

>
> Thomas
>

Thomas Petazzoni Feb. 19, 2014, 1:37 p.m. UTC | #2

Gerlando, Bjorn,

Bjorn, I added you as the To: because there is a PCI related question
for you below :)

On Wed, 19 Feb 2014 10:39:07 +0100, Gerlando Falauto wrote:

> spoiler first: SUCCESS!!!!

Awesome :)

> > I'm obviously interested in seeing the message that gets shown, as well
> > as the new mvebu-mbus debugfs output.
> 
> ----------
> pci 0000:00:01.0:   bridge window [mem 0xe0000000-0xebffffff]
> PCIE 0.0: creating window at 0xe0000000, size 0xbffffff rounded up to 
> 0x10000000

Right, rounding from 192 MB to 265 MB.

>   cat /sys/kernel/debug/mvebu-mbus/devices
> [00] disabled
> [01] disabled
> [02] disabled
> [03] disabled
> [04] 00000000ff000000 - 00000000ff010000 : nand
> [05] 00000000f4000000 - 00000000f8000000 : vpcie
> [06] 00000000fe000000 - 00000000fe010000 : dragonite
> [07] 00000000e0000000 - 00000000f0000000 : pcie0.0
> 
> > For good measure, if you could also dump the registers of the PCIe
> > window. In your case, it was window 7, so dumping 0xf1020070 and
> > 0xf1020074 would be useful.
> 
> Isn't that where the output of debugfs comes from?

It is, but the mvebu-mbus is interpreting the sequence of 1s and 0s to
give the real size, and this involves a little bit of magic of bit
manipulation, which I wanted to check by having a look at the raw
values of the registers.

> > I am not sure, but since we are configuring an invalid memory size,
> > maybe the MBus behavior is undefined, and we get some completely funky
> > behavior, where parts of the 192 MB window are actually work, but parts
> > of it are not.
> 
> And... Ladies and gentlemen... it turns out YOU'RE RIGHT!!!
> With your patch now everything works fine!!!
> 
> No words (or quads, for that matter) can express how grateful I am! ;-)

Cool. However, I am not sure my fix is really correct, because is you
had another PCIe device that needed 64 MB of memory space, the PCIe
core would have allocated addresses 0xec000000 -> 0xf0000000 to it,
which would have conflicted with the forced "power of 2 up-rounding"
we've applied on the memory space of the first device.

Therefore, I believe this constraint should be taken into account by
the PCIe core when allocating the different memory regions for each
device.

Bjorn, the mvebu PCIe host driver has the constraint that the I/O and
memory regions associated to each PCIe device of the emulated bridge
have a size that is a power of 2.

I am currently using the ->align_resource() hook to ensure that the
start address of the resource matches certain other constraints, but I
don't see a way of telling the PCI core that I need the resource to
have its size rounded up to the next power of 2 size. Is there a way of
doing this?

In the case described by Gerlando, the PCI core has assigned a 192 MB
region, but the Marvell hardware can only create windows that have a
power of two size, i.e 256 MB. Therefore, the PCI core should be told
this constraint, so that it doesn't allocate the next resource right
after the 192 MB one.

Thanks!

Thomas

Bjorn Helgaas Feb. 19, 2014, 9:45 p.m. UTC | #3

On Wed, Feb 19, 2014 at 6:37 AM, Thomas Petazzoni
<thomas.petazzoni@free-electrons.com> wrote:
> Gerlando, Bjorn,
>
> Bjorn, I added you as the To: because there is a PCI related question
> for you below :)
>
> On Wed, 19 Feb 2014 10:39:07 +0100, Gerlando Falauto wrote:
>
>> spoiler first: SUCCESS!!!!
>
> Awesome :)
>
>> > I'm obviously interested in seeing the message that gets shown, as well
>> > as the new mvebu-mbus debugfs output.
>>
>> ----------
>> pci 0000:00:01.0:   bridge window [mem 0xe0000000-0xebffffff]
>> PCIE 0.0: creating window at 0xe0000000, size 0xbffffff rounded up to
>> 0x10000000
>
> Right, rounding from 192 MB to 265 MB.
>
>>   cat /sys/kernel/debug/mvebu-mbus/devices
>> [00] disabled
>> [01] disabled
>> [02] disabled
>> [03] disabled
>> [04] 00000000ff000000 - 00000000ff010000 : nand
>> [05] 00000000f4000000 - 00000000f8000000 : vpcie
>> [06] 00000000fe000000 - 00000000fe010000 : dragonite
>> [07] 00000000e0000000 - 00000000f0000000 : pcie0.0
>>
>> > For good measure, if you could also dump the registers of the PCIe
>> > window. In your case, it was window 7, so dumping 0xf1020070 and
>> > 0xf1020074 would be useful.
>>
>> Isn't that where the output of debugfs comes from?
>
> It is, but the mvebu-mbus is interpreting the sequence of 1s and 0s to
> give the real size, and this involves a little bit of magic of bit
> manipulation, which I wanted to check by having a look at the raw
> values of the registers.
>
>> > I am not sure, but since we are configuring an invalid memory size,
>> > maybe the MBus behavior is undefined, and we get some completely funky
>> > behavior, where parts of the 192 MB window are actually work, but parts
>> > of it are not.
>>
>> And... Ladies and gentlemen... it turns out YOU'RE RIGHT!!!
>> With your patch now everything works fine!!!
>>
>> No words (or quads, for that matter) can express how grateful I am! ;-)
>
> Cool. However, I am not sure my fix is really correct, because is you
> had another PCIe device that needed 64 MB of memory space, the PCIe
> core would have allocated addresses 0xec000000 -> 0xf0000000 to it,
> which would have conflicted with the forced "power of 2 up-rounding"
> we've applied on the memory space of the first device.
>
> Therefore, I believe this constraint should be taken into account by
> the PCIe core when allocating the different memory regions for each
> device.
>
> Bjorn, the mvebu PCIe host driver has the constraint that the I/O and
> memory regions associated to each PCIe device of the emulated bridge
> have a size that is a power of 2.
>
> I am currently using the ->align_resource() hook to ensure that the
> start address of the resource matches certain other constraints, but I
> don't see a way of telling the PCI core that I need the resource to
> have its size rounded up to the next power of 2 size. Is there a way of
> doing this?
>
> In the case described by Gerlando, the PCI core has assigned a 192 MB
> region, but the Marvell hardware can only create windows that have a
> power of two size, i.e 256 MB. Therefore, the PCI core should be told
> this constraint, so that it doesn't allocate the next resource right
> after the 192 MB one.

I'm not sure I understand this correctly, but I *think* this 192 MB
region that gets rounded up to 256 MB because of the Marvell
constraint is a host bridge aperture.  If that's the case, it's
entirely up to you (the host bridge driver author) to round it as
needed before passing it to pci_add_resource_offset().

The PCI core will never allocate any space that is outside the host
bridge apertures.

But maybe I don't understand your situation well enough.

Bjorn

Thomas Petazzoni Feb. 20, 2014, 8:55 a.m. UTC | #4

Dear Bjorn Helgaas,

+ Jason Gunthorpe.

On Wed, 19 Feb 2014 14:45:48 -0700, Bjorn Helgaas wrote:

> > Cool. However, I am not sure my fix is really correct, because is you
> > had another PCIe device that needed 64 MB of memory space, the PCIe
> > core would have allocated addresses 0xec000000 -> 0xf0000000 to it,
> > which would have conflicted with the forced "power of 2 up-rounding"
> > we've applied on the memory space of the first device.
> >
> > Therefore, I believe this constraint should be taken into account by
> > the PCIe core when allocating the different memory regions for each
> > device.
> >
> > Bjorn, the mvebu PCIe host driver has the constraint that the I/O and
> > memory regions associated to each PCIe device of the emulated bridge
> > have a size that is a power of 2.
> >
> > I am currently using the ->align_resource() hook to ensure that the
> > start address of the resource matches certain other constraints, but I
> > don't see a way of telling the PCI core that I need the resource to
> > have its size rounded up to the next power of 2 size. Is there a way of
> > doing this?
> >
> > In the case described by Gerlando, the PCI core has assigned a 192 MB
> > region, but the Marvell hardware can only create windows that have a
> > power of two size, i.e 256 MB. Therefore, the PCI core should be told
> > this constraint, so that it doesn't allocate the next resource right
> > after the 192 MB one.
> 
> I'm not sure I understand this correctly, but I *think* this 192 MB
> region that gets rounded up to 256 MB because of the Marvell
> constraint is a host bridge aperture.  If that's the case, it's
> entirely up to you (the host bridge driver author) to round it as
> needed before passing it to pci_add_resource_offset().
> 
> The PCI core will never allocate any space that is outside the host
> bridge apertures.

Hum, I believe there is a misunderstanding here. We are already using
pci_add_resource_offset() to define the global aperture for the entire
PCI bridge. This is not causing any problem.

Let me give a little bit of background first.

On Marvell hardware, the physical address space layout is configurable,
through the use of "MBus windows". A "MBus window" is defined by a base
address, a size, and a target device. So if the CPU needs to access a
given device (such as PCIe 0.0 for example), then we need to create a
"MBus window" whose size and target device match PCIe 0.0.

Since Armada XP has 10 PCIe interfaces, we cannot just statically
create as many MBus windows as there are PCIe interfaces: it would both
exhaust the number of MBus windows available, and also exhaust the
physical address space, because we would have to create very large
windows, just in case the PCIe device plugged behind this interface
needs large BARs.

So, what the pci-mvebu.c driver does is that it creates an emulated PCI
bridge. This emulated bridge is used to let the Linux PCI core
enumerate the real physical PCI devices behind the bridge, allocate a
range of physical addresses that is available for each of these
devices, and write them to the bridge registers. Since the bridge is
not a real one, but emulated, but trap those writes, and use them to
create the MBus windows that will allow the CPU to actually access the
device, at the base address chosen by the Linux PCI core during the
enumeration process.

However, MBus windows have a certain constraint that they must have a
power of two size, so the Linux PCI core should not write to one of the
bridge PCI_MEMORY_BASE / PCI_MEMORY_LIMIT registers any range of
address whose size is not a power of 2.

Let me take the example of Gerlando:

pci 0000:01:00.0: BAR 1: assigned [mem 0xe0000000-0xe7ffffff]
pci 0000:01:00.0: BAR 3: assigned [mem 0xe8000000-0xe87fffff]
pci 0000:01:00.0: BAR 4: assigned [mem 0xe8800000-0xe8801fff]
pci 0000:01:00.0: BAR 0: assigned [mem 0xe8802000-0xe8802fff]
pci 0000:01:00.0: BAR 2: assigned [mem 0xe8803000-0xe8803fff]
pci 0000:01:00.0: BAR 5: assigned [mem 0xe8804000-0xe8804fff]
pci 0000:00:01.0: PCI bridge to [bus 01]
pci 0000:00:01.0:   bridge window [mem 0xe0000000-0xebffffff]

So, pci 0000:01:00 is the real device, which has a number of BARs of a
certain size. Taking into account all those BARs, the Linux PCI core
decides to assign [mem 0xe0000000-0xebffffff] to the bridge (last line
of the log above). The problem is that [mem 0xe0000000-0xebffffff] is
192 MB, but we would like the Linux PCI core to extend that to 256 MB.

As you can see it is not about the global aperture associated to the
bridge, but about the size of the window associated to each "port" of
the bridge.

Does that make sense? Keep in mind that I'm still not completely
familiar with the PCI terminology, so maybe the above explanation does
not use the right terms.

Thanks for your feedback,

Thomas

Jason Gunthorpe Feb. 20, 2014, 5:35 p.m. UTC | #5

On Thu, Feb 20, 2014 at 09:55:18AM +0100, Thomas Petazzoni wrote:

> Does that make sense? Keep in mind that I'm still not completely
> familiar with the PCI terminology, so maybe the above explanation does
> not use the right terms.

Stated another way, the Marvel PCI-E to PCI-E bridge config space has
a quirk that requires the window BARs to be aligned on their size and
sized to a power of 2.

The first requirement is already being handled by hooking through
ARM's 'align_resource' callback.

One avenue would be to have mvebu_pcie_align_resource return a struct
resource and manipulate the size as well. Assuming the PCI core will
accommodate that.

Jason

Bjorn Helgaas Feb. 20, 2014, 7:18 p.m. UTC | #6

On Thu, Feb 20, 2014 at 1:55 AM, Thomas Petazzoni
<thomas.petazzoni@free-electrons.com> wrote:
> Dear Bjorn Helgaas,
>
> + Jason Gunthorpe.
>
> On Wed, 19 Feb 2014 14:45:48 -0700, Bjorn Helgaas wrote:
>
>> > Cool. However, I am not sure my fix is really correct, because is you
>> > had another PCIe device that needed 64 MB of memory space, the PCIe
>> > core would have allocated addresses 0xec000000 -> 0xf0000000 to it,
>> > which would have conflicted with the forced "power of 2 up-rounding"
>> > we've applied on the memory space of the first device.
>> >
>> > Therefore, I believe this constraint should be taken into account by
>> > the PCIe core when allocating the different memory regions for each
>> > device.
>> >
>> > Bjorn, the mvebu PCIe host driver has the constraint that the I/O and
>> > memory regions associated to each PCIe device of the emulated bridge
>> > have a size that is a power of 2.
>> >
>> > I am currently using the ->align_resource() hook to ensure that the
>> > start address of the resource matches certain other constraints, but I
>> > don't see a way of telling the PCI core that I need the resource to
>> > have its size rounded up to the next power of 2 size. Is there a way of
>> > doing this?
>> >
>> > In the case described by Gerlando, the PCI core has assigned a 192 MB
>> > region, but the Marvell hardware can only create windows that have a
>> > power of two size, i.e 256 MB. Therefore, the PCI core should be told
>> > this constraint, so that it doesn't allocate the next resource right
>> > after the 192 MB one.
>>
>> I'm not sure I understand this correctly, but I *think* this 192 MB
>> region that gets rounded up to 256 MB because of the Marvell
>> constraint is a host bridge aperture.  If that's the case, it's
>> entirely up to you (the host bridge driver author) to round it as
>> needed before passing it to pci_add_resource_offset().
>>
>> The PCI core will never allocate any space that is outside the host
>> bridge apertures.
>
> Hum, I believe there is a misunderstanding here. We are already using
> pci_add_resource_offset() to define the global aperture for the entire
> PCI bridge. This is not causing any problem.
>
> Let me give a little bit of background first.
>
> On Marvell hardware, the physical address space layout is configurable,
> through the use of "MBus windows". A "MBus window" is defined by a base
> address, a size, and a target device. So if the CPU needs to access a
> given device (such as PCIe 0.0 for example), then we need to create a
> "MBus window" whose size and target device match PCIe 0.0.

I was assuming "PCIe 0.0" was a host bridge, but it sounds like maybe
that's not true.  Is it really a PCIe root port?  That would mean the
MBus windows are some non-PCIe-compliant thing between the root
complex and the root ports, I guess.

> Since Armada XP has 10 PCIe interfaces, we cannot just statically
> create as many MBus windows as there are PCIe interfaces: it would both
> exhaust the number of MBus windows available, and also exhaust the
> physical address space, because we would have to create very large
> windows, just in case the PCIe device plugged behind this interface
> needs large BARs.

Everybody else in the world *does* statically configure host bridge
apertures before enumerating the devices below the bridge.  I see why
you want to know what devices are there before deciding whether and
how large to make an MBus window.  But that is new functionality that
we don't have today, and the general idea is not Marvell-specific, so
other systems might want something like this, too.  So I'm not sure if
using quirks to try to wedge it into the current PCI core is the right
approach.  I don't have another proposal, but we should at least think
about what direction we want to take.

> So, what the pci-mvebu.c driver does is that it creates an emulated PCI
> bridge. This emulated bridge is used to let the Linux PCI core
> enumerate the real physical PCI devices behind the bridge, allocate a
> range of physical addresses that is available for each of these
> devices, and write them to the bridge registers. Since the bridge is
> not a real one, but emulated, but trap those writes, and use them to
> create the MBus windows that will allow the CPU to actually access the
> device, at the base address chosen by the Linux PCI core during the
> enumeration process.
>
> However, MBus windows have a certain constraint that they must have a
> power of two size, so the Linux PCI core should not write to one of the
> bridge PCI_MEMORY_BASE / PCI_MEMORY_LIMIT registers any range of
> address whose size is not a power of 2.

I'm still not sure I understand what's going on here.  It sounds like
your emulated bridge basically wraps the host bridge and makes it look
like a PCI-PCI bridge.  But I assume the host bridge itself is also
visible, and has apertures (I guess these are the MBus windows?)  So
when you first discover the host bridge, before enumerating anything
below it, what apertures does it have?  Do you leave them disabled
until after we enumerate the devices, figure out how much space they
need, and configure the emulated PCI-PCI bridge to enable the MBus
windows?

It'd be nice if dmesg mentioned the host bridge explicitly as we do on
other architectures; maybe that would help understand what's going on
under the covers.  Maybe a longer excerpt would already have this; you
already use pci_add_resource_offset(), which is used when creating the
root bus, so you must have some sort of aperture before enumerating.

> Let me take the example of Gerlando:
>
> pci 0000:01:00.0: BAR 1: assigned [mem 0xe0000000-0xe7ffffff]
> pci 0000:01:00.0: BAR 3: assigned [mem 0xe8000000-0xe87fffff]
> pci 0000:01:00.0: BAR 4: assigned [mem 0xe8800000-0xe8801fff]
> pci 0000:01:00.0: BAR 0: assigned [mem 0xe8802000-0xe8802fff]
> pci 0000:01:00.0: BAR 2: assigned [mem 0xe8803000-0xe8803fff]
> pci 0000:01:00.0: BAR 5: assigned [mem 0xe8804000-0xe8804fff]
> pci 0000:00:01.0: PCI bridge to [bus 01]
> pci 0000:00:01.0:   bridge window [mem 0xe0000000-0xebffffff]
>
> So, pci 0000:01:00 is the real device, which has a number of BARs of a
> certain size. Taking into account all those BARs, the Linux PCI core
> decides to assign [mem 0xe0000000-0xebffffff] to the bridge (last line
> of the log above). The problem is that [mem 0xe0000000-0xebffffff] is
> 192 MB, but we would like the Linux PCI core to extend that to 256 MB.

If 01:00.0 is a PCIe endpoint, it must have a root port above it, so
that means 00:01.0 must be the root port.  But I think you're saying
that 00:01.0 is actually *emulated* and isn't PCIe-compliant, e.g., it
has extra window alignment restrictions.  I'm scared about what other
non-PCIe-compliant things there might be.  What happens when the PCI
core configures MPS, ASPM, etc.,

> As you can see it is not about the global aperture associated to the
> bridge, but about the size of the window associated to each "port" of
> the bridge.

Bjorn

Thomas Petazzoni Feb. 20, 2014, 8:29 p.m. UTC | #7

Dear Jason Gunthorpe,

On Thu, 20 Feb 2014 10:35:18 -0700, Jason Gunthorpe wrote:
> On Thu, Feb 20, 2014 at 09:55:18AM +0100, Thomas Petazzoni wrote:
> 
> > Does that make sense? Keep in mind that I'm still not completely
> > familiar with the PCI terminology, so maybe the above explanation does
> > not use the right terms.
> 
> Stated another way, the Marvel PCI-E to PCI-E bridge config space has
> a quirk that requires the window BARs to be aligned on their size and
> sized to a power of 2.

Correct.

> The first requirement is already being handled by hooking through
> ARM's 'align_resource' callback.

Absolutely.

> One avenue would be to have mvebu_pcie_align_resource return a struct
> resource and manipulate the size as well. Assuming the PCI core will
> accommodate that.

That would effectively be the easiest solution from the point of view
of the PCIe driver.

In practice, the story is a little bit more subtle than that: the PCIe
driver may want to decide to either tell the PCI core to enlarge the
window BAR up to the next power of two size, or to dedicate two windows
to it.

For example:

 * If the PCI core allocates a 96 KB BAR, we clearly want it to be
   enlarged to 128 KB, so that we have to create a single window for it.

 * However, if the PCI core allocates a 192 MB BAR, we may want to
   instead create two windows: a first one of 128 MB and a second one
   of 64 MB. This consumes two windows, but saves 64 MB of physical
   address space.

(Note that I haven't tested myself the creation of two windows for the
same target device, but I was told by Lior that it should work).

As you can see from the two examples above, we may not necessarily want
to enforce this power-of-two constraint in all cases. We may want to
accept a non-power-of-2 size in the case of the 192 MB BAR, and let the
mvebu-mbus driver figure out that it should allocate several
consecutive windows to cover these 192 MB.

But to begin with, rounding up all window BARs up to the next power of
two size would be perfectly OK.

Jason, would you mind maybe replying to Bjorn Helgaas email (Thu, 20
Feb 2014 12:18:42 -0700) ? I believe that a lot of the misunderstanding
between Bjorn and me is due to the fact that I don't use the correct
PCI terminology to describe how the Marvell hardware works, and how the
Marvell PCIe driver copes with it. I'm sure you would explain it in a
way that would be more easily understood by someone very familiar with
the PCI terminology such as Bjorn. Thanks a lot!

Best regards,

Thomas

Jason Gunthorpe Feb. 21, 2014, 12:24 a.m. UTC | #8

On Thu, Feb 20, 2014 at 12:18:42PM -0700, Bjorn Helgaas wrote:

> > On Marvell hardware, the physical address space layout is configurable,
> > through the use of "MBus windows". A "MBus window" is defined by a base
> > address, a size, and a target device. So if the CPU needs to access a
> > given device (such as PCIe 0.0 for example), then we need to create a
> > "MBus window" whose size and target device match PCIe 0.0.
> 
> I was assuming "PCIe 0.0" was a host bridge, but it sounds like maybe
> that's not true.  Is it really a PCIe root port?  That would mean the
> MBus windows are some non-PCIe-compliant thing between the root
> complex and the root ports, I guess.

It really is a root port. The hardware acts like a root port at the
TLP level. It has all the root port specific stuff in some format but
critically, completely lacks a compliant config space for a root
port bridge.

So the driver creates a 'compliant' config space for the root
port. Building the config space requires harmonizing registers related
to the PCI-E and registers related to internal routing and dealing
with the mismatch between what the hardware can actualy provide and
what the PCI spec requires it provide.

The only mismatch that gets exposed to the PCI core we know about is
the bridge window address alignment restrictions.

This is what Thomas has been asking about.

> > Since Armada XP has 10 PCIe interfaces, we cannot just statically
> > create as many MBus windows as there are PCIe interfaces: it would both
> > exhaust the number of MBus windows available, and also exhaust the
> > physical address space, because we would have to create very large
> > windows, just in case the PCIe device plugged behind this interface
> > needs large BARs.
> 
> Everybody else in the world *does* statically configure host bridge
> apertures before enumerating the devices below the bridge.  

The original PCI-E driver for this hardware did use a 1 root port per
host bridge model, with static host bridge aperture allocation and so
forth.

It works fine, just like everyone else in the world, as long as you
have only 1 or 2 ports. The XP hardware had *10* ports on a single
32 bit machine. You run out of address space, you run out of
HW routing resources, it just doesn't work acceptably.

> I see why you want to know what devices are there before deciding
> whether and how large to make an MBus window.  But that is new
> functionality that we don't have today, and the general idea is not

Well, in general, it isn't new core functionality, it is functionality
that already exists to support PCI bridges.

Choosing to use a one host bridge to N root port bridge model lets the
driver use all that functionality and the only wrinkle that becomes
visible to the PCI core as a whole is the non-compliant alignment
restriction on the bridge window BAR.

This also puts the driver in alignment with the PCI-E specs for root
complexes, which means user space can actually see things like the
PCI-E root port link capability block and makes it hot plug work
properly (I am actively using hot plug with this driver)

I personaly think this is a reasonable way to support this highly
flexible HW.

> I'm still not sure I understand what's going on here.  It sounds like
> your emulated bridge basically wraps the host bridge and makes it look
> like a PCI-PCI bridge.  But I assume the host bridge itself is also
> visible, and has apertures (I guess these are the MBus windows?)  

No, there is only one bridge, it is a per-physical-port MBUS / PCI-E
bridge. It performs an identical function to the root port bridge
described in PCI-E. MBUS serves as the root-complex internal bus 0.

There isn't 2 levels of bridging, so the MBUS / PCI-E bridge can
claim any system address and there is no such thing as a 'host
bridge'.

What Linux calls 'the host bridge aperture' is simply a wack of
otherwise unused physical address space, it has no special properties.

> It'd be nice if dmesg mentioned the host bridge explicitly as we do on
> other architectures; maybe that would help understand what's going on
> under the covers.  Maybe a longer excerpt would already have this; you
> already use pci_add_resource_offset(), which is used when creating the
> root bus, so you must have some sort of aperture before enumerating.

Well, /proc/iomem looks like this:

e0000000-efffffff : PCI MEM 0000
  e0000000-e00fffff : PCI Bus 0000:01
    e0000000-e001ffff : 0000:01:00.0

'PCI MEM 0000' is the 'host bridge aperture' it is an arbitary
range of address space that doesn't overlap anything.

'PCI Bus 0000:01' is the MBUS / PCI-E root port bridge for physical
port 0

'0000:01:00.0' is BAR 0 of an an off-chip device.

> If 01:00.0 is a PCIe endpoint, it must have a root port above it, so
> that means 00:01.0 must be the root port.  But I think you're saying
> that 00:01.0 is actually *emulated* and isn't PCIe-compliant, e.g., it
> has extra window alignment restrictions.  

It is important to understand that the emulation is only of the root
port bridge configuration space. The underlying TLP processing is done
in HW and is compliant.

> I'm scared about what other non-PCIe-compliant things there might
> be.  What happens when the PCI core configures MPS, ASPM, etc.,

As the TLP processing and the underlying PHY are all compliant these
things are all supported in HW.

MPS is supported directly by the HW

ASPM is supported by the HW, as is the entire link capability and
status block.

AER is supported directly by the HW

But here is the thing, without the software emulated config space
there would be no sane way for the Linux PCI core to access these
features. The HW simply does not present them in a way that the core
code can understand without a SW intervention of some kind.

Jason

Jason Gunthorpe Feb. 21, 2014, 12:32 a.m. UTC | #9

On Thu, Feb 20, 2014 at 09:29:14PM +0100, Thomas Petazzoni wrote:

> In practice, the story is a little bit more subtle than that: the PCIe
> driver may want to decide to either tell the PCI core to enlarge the
> window BAR up to the next power of two size, or to dedicate two windows
> to it.

That is a smart, easy solution! Maybe that is the least invasive way
to proceed for now?

I have no idea how you decide when to round up and when to allocate
more windows, that feels like a fairly complex optimization problem!

Alternatively, I suspect you can use the PCI quirk mechanism to alter
the resource sizing on a bridge?

> Jason, would you mind maybe replying to Bjorn Helgaas email (Thu, 20
> Feb 2014 12:18:42 -0700) ? I believe that a lot of the misunderstanding
> between Bjorn and me is due to the fact that I don't use the correct
> PCI terminology to describe how the Marvell hardware works, and how the
> Marvell PCIe driver copes with it. I'm sure you would explain it in a
> way that would be more easily understood by someone very familiar with
> the PCI terminology such as Bjorn. Thanks a lot!

Done!

Hope it helps,
Jason

Thomas Petazzoni Feb. 21, 2014, 8:34 a.m. UTC | #10

Dear Jason Gunthorpe,

On Thu, 20 Feb 2014 17:32:27 -0700, Jason Gunthorpe wrote:

> > In practice, the story is a little bit more subtle than that: the PCIe
> > driver may want to decide to either tell the PCI core to enlarge the
> > window BAR up to the next power of two size, or to dedicate two windows
> > to it.
> 
> That is a smart, easy solution! Maybe that is the least invasive way
> to proceed for now?

So you suggest that the mvebu-mbus driver should accept a
non power-of-two window size, and do internally the job of cutting that
into several power-of-two sized areas and creating the corresponding
windows?

> I have no idea how you decide when to round up and when to allocate
> more windows, that feels like a fairly complex optimization problem!

Yes, it is a fairly complex problem. I was thinking of a threshold of
"lost space". Below this threshold, it's better to enlarge the window,
above the threshold it's better to create two windows. But not easy.

> Alternatively, I suspect you can use the PCI quirk mechanism to alter
> the resource sizing on a bridge?

Can you give more details about this mechanism, and how it could be
used to alter the size of resources on a bridge?

> > Jason, would you mind maybe replying to Bjorn Helgaas email (Thu, 20
> > Feb 2014 12:18:42 -0700) ? I believe that a lot of the misunderstanding
> > between Bjorn and me is due to the fact that I don't use the correct
> > PCI terminology to describe how the Marvell hardware works, and how the
> > Marvell PCIe driver copes with it. I'm sure you would explain it in a
> > way that would be more easily understood by someone very familiar with
> > the PCI terminology such as Bjorn. Thanks a lot!
> 
> Done!

Thanks a lot! Really appreciated.

Best regards,

Thomas

Gerlando Falauto Feb. 21, 2014, 8:58 a.m. UTC | #11

Hi guys,

first of all thank you for your support and the explanations.
I'm slowly starting to understand something more about this kind of stuff.

On 02/21/2014 09:34 AM, Thomas Petazzoni wrote:
> Dear Jason Gunthorpe,
>
> On Thu, 20 Feb 2014 17:32:27 -0700, Jason Gunthorpe wrote:
>
>>> In practice, the story is a little bit more subtle than that: the PCIe
>>> driver may want to decide to either tell the PCI core to enlarge the
>>> window BAR up to the next power of two size, or to dedicate two windows
>>> to it.
>>
>> That is a smart, easy solution! Maybe that is the least invasive way
>> to proceed for now?
>
> So you suggest that the mvebu-mbus driver should accept a
> non power-of-two window size, and do internally the job of cutting that
> into several power-of-two sized areas and creating the corresponding
> windows?
>
>> I have no idea how you decide when to round up and when to allocate
>> more windows, that feels like a fairly complex optimization problem!
>
> Yes, it is a fairly complex problem. I was thinking of a threshold of
> "lost space". Below this threshold, it's better to enlarge the window,
> above the threshold it's better to create two windows. But not easy.
>
>> Alternatively, I suspect you can use the PCI quirk mechanism to alter
>> the resource sizing on a bridge?
>
> Can you give more details about this mechanism, and how it could be
> used to alter the size of resources on a bridge?

I'm not sure I understand all the details... but I guess some sort of 
rounding mechanism is indeed already in place somewhere:

pci 0000:00:01.0: BAR 8: assigned [mem 0xe0000000-0xebffffff]
pci 0000:01:00.0: BAR 1: assigned [mem 0xe0000000-0xe7ffffff]
pci 0000:01:00.0: BAR 3: assigned [mem 0xe8000000-0xe87fffff]
pci 0000:01:00.0: BAR 4: assigned [mem 0xe8800000-0xe8801fff]
pci 0000:01:00.0: BAR 0: assigned [mem 0xe8802000-0xe8802fff]
pci 0000:01:00.0: BAR 2: assigned [mem 0xe8803000-0xe8803fff]
pci 0000:01:00.0: BAR 5: assigned [mem 0xe8804000-0xe8804fff]
pci 0000:00:01.0: PCI bridge to [bus 01]

If you look at the numbers, the total size required by BAR0-5 is 
0x8805000, so around 136MB, that is 128MB+8MB+2K+1K+1K.
This gets rounded up (on this 'virtual' BAR 8) to 192MB (I don't know 
where or why), which is 1.5x a power of two (i.e. two consecutive bits 
followed by all zeroes).

If that's not just a coincidence, finding a coverage subset becomes a 
trivial matter (128MB+64MB).

In any case, even if we have an odd number like the above (0x8805000), I 
believe we could easily find a suboptimal coverage by just taking the 
most significant one and the second most significant one (possibly left 
shifted by 1 if there's a third one somewhere else).
In the above case, that would be 0x8000000 + 0x1000000. That's 
128MB+16MB, which is even smaller than the rounding above (192MB).

What do you think?

Thanks again!
Gerlando

Thomas Petazzoni Feb. 21, 2014, 9:12 a.m. UTC | #12

Dear Gerlando Falauto,

On Fri, 21 Feb 2014 09:58:21 +0100, Gerlando Falauto wrote:

> > Can you give more details about this mechanism, and how it could be
> > used to alter the size of resources on a bridge?
> 
> I'm not sure I understand all the details... but I guess some sort of 
> rounding mechanism is indeed already in place somewhere:
> 
> pci 0000:00:01.0: BAR 8: assigned [mem 0xe0000000-0xebffffff]
> pci 0000:01:00.0: BAR 1: assigned [mem 0xe0000000-0xe7ffffff]
> pci 0000:01:00.0: BAR 3: assigned [mem 0xe8000000-0xe87fffff]
> pci 0000:01:00.0: BAR 4: assigned [mem 0xe8800000-0xe8801fff]
> pci 0000:01:00.0: BAR 0: assigned [mem 0xe8802000-0xe8802fff]
> pci 0000:01:00.0: BAR 2: assigned [mem 0xe8803000-0xe8803fff]
> pci 0000:01:00.0: BAR 5: assigned [mem 0xe8804000-0xe8804fff]
> pci 0000:00:01.0: PCI bridge to [bus 01]
> 
> If you look at the numbers, the total size required by BAR0-5 is 
> 0x8805000, so around 136MB, that is 128MB+8MB+2K+1K+1K.
> This gets rounded up (on this 'virtual' BAR 8) to 192MB (I don't know 
> where or why), which is 1.5x a power of two (i.e. two consecutive bits 
> followed by all zeroes).

Would indeed be interesting to know who does this rounding, and why,
and according to what rules.

> If that's not just a coincidence, finding a coverage subset becomes a 
> trivial matter (128MB+64MB).
> 
> In any case, even if we have an odd number like the above (0x8805000), I 
> believe we could easily find a suboptimal coverage by just taking the 
> most significant one and the second most significant one (possibly left 
> shifted by 1 if there's a third one somewhere else).
> In the above case, that would be 0x8000000 + 0x1000000. That's 
> 128MB+16MB, which is even smaller than the rounding above (192MB).
> 
> What do you think?

Sure, but whichever choice we make, the Linux PCI core must know by how
much we've enlarge the bridge window BAR, otherwise the Linux PCI core
may allocate for the next bridge window BAR a range of addresses that
doesn't overlap with what it has allocate for the previous bridge
window BAR, but that ends up overlapping due to us "extending" the
previous bridge window BAR to match the MBus requirements.

Gerlando, would you be able to test a quick hack that creates 2 windows
to cover exactly 128 MB + 64 MB ? This would at least allow us to
confirm that the strategy of splitting in multiple windows is usable.

Thanks!

Thomas

Gerlando Falauto Feb. 21, 2014, 9:16 a.m. UTC | #13

Hi Thomas,

On 02/21/2014 10:12 AM, Thomas Petazzoni wrote:
> Dear Gerlando Falauto,
>
> On Fri, 21 Feb 2014 09:58:21 +0100, Gerlando Falauto wrote:
>
>>> Can you give more details about this mechanism, and how it could be
>>> used to alter the size of resources on a bridge?
>>
>> I'm not sure I understand all the details... but I guess some sort of
>> rounding mechanism is indeed already in place somewhere:
>>
>> pci 0000:00:01.0: BAR 8: assigned [mem 0xe0000000-0xebffffff]
>> pci 0000:01:00.0: BAR 1: assigned [mem 0xe0000000-0xe7ffffff]
>> pci 0000:01:00.0: BAR 3: assigned [mem 0xe8000000-0xe87fffff]
>> pci 0000:01:00.0: BAR 4: assigned [mem 0xe8800000-0xe8801fff]
>> pci 0000:01:00.0: BAR 0: assigned [mem 0xe8802000-0xe8802fff]
>> pci 0000:01:00.0: BAR 2: assigned [mem 0xe8803000-0xe8803fff]
>> pci 0000:01:00.0: BAR 5: assigned [mem 0xe8804000-0xe8804fff]
>> pci 0000:00:01.0: PCI bridge to [bus 01]
>>
>> If you look at the numbers, the total size required by BAR0-5 is
>> 0x8805000, so around 136MB, that is 128MB+8MB+2K+1K+1K.
>> This gets rounded up (on this 'virtual' BAR 8) to 192MB (I don't know
>> where or why), which is 1.5x a power of two (i.e. two consecutive bits
>> followed by all zeroes).
>
> Would indeed be interesting to know who does this rounding, and why,
> and according to what rules.
>
>> If that's not just a coincidence, finding a coverage subset becomes a
>> trivial matter (128MB+64MB).
>>
>> In any case, even if we have an odd number like the above (0x8805000), I
>> believe we could easily find a suboptimal coverage by just taking the
>> most significant one and the second most significant one (possibly left
>> shifted by 1 if there's a third one somewhere else).
>> In the above case, that would be 0x8000000 + 0x1000000. That's
>> 128MB+16MB, which is even smaller than the rounding above (192MB).
>>
>> What do you think?
>
> Sure, but whichever choice we make, the Linux PCI core must know by how
> much we've enlarge the bridge window BAR, otherwise the Linux PCI core
> may allocate for the next bridge window BAR a range of addresses that
> doesn't overlap with what it has allocate for the previous bridge
> window BAR, but that ends up overlapping due to us "extending" the
> previous bridge window BAR to match the MBus requirements.
>
> Gerlando, would you be able to test a quick hack that creates 2 windows
> to cover exactly 128 MB + 64 MB ? This would at least allow us to
> confirm that the strategy of splitting in multiple windows is usable.

Sure, though probably not until next week.
I guess it would then also be useful to restore my previous setup, where 
the total PCIe aperture is 192MB, right?

Thank you guys!
Gerlando

Bjorn Helgaas Feb. 21, 2014, 7:05 p.m. UTC | #14

[+cc Gavin, Ben for EEH alignment question below]

On Thu, Feb 20, 2014 at 5:24 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Thu, Feb 20, 2014 at 12:18:42PM -0700, Bjorn Helgaas wrote:
>
>> > On Marvell hardware, the physical address space layout is configurable,
>> > through the use of "MBus windows". A "MBus window" is defined by a base
>> > address, a size, and a target device. So if the CPU needs to access a
>> > given device (such as PCIe 0.0 for example), then we need to create a
>> > "MBus window" whose size and target device match PCIe 0.0.
> ...

> So the driver creates a 'compliant' config space for the root
> port. Building the config space requires harmonizing registers related
> to the PCI-E and registers related to internal routing and dealing
> with the mismatch between what the hardware can actualy provide and
> what the PCI spec requires it provide.
> ...

>> > Since Armada XP has 10 PCIe interfaces, we cannot just statically
>> > create as many MBus windows as there are PCIe interfaces: it would both
>> > exhaust the number of MBus windows available, and also exhaust the
>> > physical address space, because we would have to create very large
>> > windows, just in case the PCIe device plugged behind this interface
>> > needs large BARs.
> ...

>> I'm still not sure I understand what's going on here.  It sounds like
>> your emulated bridge basically wraps the host bridge and makes it look
>> like a PCI-PCI bridge.  But I assume the host bridge itself is also
>> visible, and has apertures (I guess these are the MBus windows?)
>
> No, there is only one bridge, it is a per-physical-port MBUS / PCI-E
> bridge. It performs an identical function to the root port bridge
> described in PCI-E. MBUS serves as the root-complex internal bus 0.
>
> There isn't 2 levels of bridging, so the MBUS / PCI-E bridge can
> claim any system address and there is no such thing as a 'host
> bridge'.
>
> What Linux calls 'the host bridge aperture' is simply a wack of
> otherwise unused physical address space, it has no special properties.
>
>> It'd be nice if dmesg mentioned the host bridge explicitly as we do on
>> other architectures; maybe that would help understand what's going on
>> under the covers.  Maybe a longer excerpt would already have this; you
>> already use pci_add_resource_offset(), which is used when creating the
>> root bus, so you must have some sort of aperture before enumerating.
>
> Well, /proc/iomem looks like this:
>
> e0000000-efffffff : PCI MEM 0000
>   e0000000-e00fffff : PCI Bus 0000:01
>     e0000000-e001ffff : 0000:01:00.0
>
> 'PCI MEM 0000' is the 'host bridge aperture' it is an arbitary
> range of address space that doesn't overlap anything.
>
> 'PCI Bus 0000:01' is the MBUS / PCI-E root port bridge for physical
> port 0

Thanks for making this more concrete.  Let me see if I understand any better:

- e0000000-efffffff is the "host bridge aperture" but it doesn't
correspond to an actual aperture in hardware (there are no registers
where you set this range).  The only real use for this range is to be
the arena within which the PCI core can assign space to the Root
Ports.  This is static and you don't need to change it based on what
devices we discover.

- There may be several MBus/PCIe Root Ports, and you want to configure
their apertures at enumeration-time based on what devices are below
them.  As you say, the PCI core supports this except that MBus
apertures must be a power-of-two in size and aligned on their size,
while ordinary PCI bridge windows only need to start and end on 1MB
boundaries.

- e0000000-e00fffff is an example of one MBus/PCIe aperture, and this
space is available on PCI bus 01.  This one happens to be 1MB in size,
but it could be 2MB, 4MB, etc., but not 3MB like a normal bridge
window could be.

- You're currently using the ARM ->align_resource() hook (part of
pcibios_align_resource()), which is used in the bowels of the
allocator (__find_resource()) and affects the starting address of the
region we allocate, but not the size.  So you can force the start of
an MBus aperture to be power-of-two aligned, but not the end.

The allocate_resource() alignf argument is only used by PCI and
PCMCIA, so it doesn't seem like it would be too terrible to extend the
alignf interface so it could control the size, too.  Would something
like that solve this problem?

I first wondered if you could use pcibios_window_alignment(), but it
doesn't know the amount of space we need below the bridge, and it also
can't affect the size of the window or the ending address, so I don't
think it will help.

But I wonder if powerpc has a similar issue here: I think EEH might
need, for example 16MB bridge window alignment.  Since
pcibios_window_alignment() only affects the *starting* address, could
the core assign a 9MB window whose starting address is 16MB-aligned?
Could EEH deal with that?  What if the PCI core assigned the space
right after the 9MB window to another device?

Bjorn

Thomas Petazzoni Feb. 21, 2014, 7:21 p.m. UTC | #15

Dear Bjorn Helgaas,

On Fri, 21 Feb 2014 12:05:49 -0700, Bjorn Helgaas wrote:

> Thanks for making this more concrete.  Let me see if I understand any better:

Good to see that Jason Gunthorpe could explain this in better words.
I'll try to answer to your questions below, I'm sure Jason will correct
me if I say incorrect things, or things that are imprecise.

> - e0000000-efffffff is the "host bridge aperture" but it doesn't
> correspond to an actual aperture in hardware (there are no registers
> where you set this range).  The only real use for this range is to be
> the arena within which the PCI core can assign space to the Root
> Ports.  This is static and you don't need to change it based on what
> devices we discover.

Correct. We don't configure this in any hardware register. We just give
this aperture to the Linux PCI core to tell it "please allocate all
BAR physical ranges from this global aperture".

> - There may be several MBus/PCIe Root Ports, and you want to configure
> their apertures at enumeration-time based on what devices are below
> them.  As you say, the PCI core supports this except that MBus
> apertures must be a power-of-two in size and aligned on their size,
> while ordinary PCI bridge windows only need to start and end on 1MB
> boundaries.

Exactly.

> - e0000000-e00fffff is an example of one MBus/PCIe aperture, and this
> space is available on PCI bus 01.  This one happens to be 1MB in size,
> but it could be 2MB, 4MB, etc., but not 3MB like a normal bridge
> window could be.

Absolutely.

Note that we have the possibility of mapping a 3 MB BAR, by using a 2
MB window followed by a 1 MB window. However, since the number of
windows is limited (8 on Kirkwood, 20 on Armada 370/XP), we will prefer
to enlarge the BAR size if it's size is fairly small, and only resort
to using multiple windows if the amount of lost physical space is big.

So, for a 3 MB BAR, we will definitely prefer to extend it to a single 4
MB window, because losing 1 MB of physical address space is preferable
over losing one window.

For a 192 MB BAR, we may prefer to use one 128 MB window followed by
one 64 MB window.

But as long as the pci-mvebu driver can control the size of the BAR, it
can decide on its own whether its prefers enlarging the BAR, or using
multiple windows.

> - You're currently using the ARM ->align_resource() hook (part of
> pcibios_align_resource()), which is used in the bowels of the
> allocator (__find_resource()) and affects the starting address of the
> region we allocate, but not the size.  So you can force the start of
> an MBus aperture to be power-of-two aligned, but not the end.

Correct.

Happy to see that we've managed to get an understanding on what the
problem.

> The allocate_resource() alignf argument is only used by PCI and
> PCMCIA, so it doesn't seem like it would be too terrible to extend the
> alignf interface so it could control the size, too.  Would something
> like that solve this problem?

I don't know, I would have to look more precisely into this alignf
argument, and see how it could be extended to solve our constraints.

> I first wondered if you could use pcibios_window_alignment(), but it
> doesn't know the amount of space we need below the bridge, and it also
> can't affect the size of the window or the ending address, so I don't
> think it will help.
> 
> But I wonder if powerpc has a similar issue here: I think EEH might
> need, for example 16MB bridge window alignment.  Since
> pcibios_window_alignment() only affects the *starting* address, could
> the core assign a 9MB window whose starting address is 16MB-aligned?
> Could EEH deal with that?  What if the PCI core assigned the space
> right after the 9MB window to another device?

I'll let the other PCI people answer this :-)

Thanks a lot for your feedback!

Thomas

Benjamin Herrenschmidt Feb. 21, 2014, 7:53 p.m. UTC | #16

On Fri, 2014-02-21 at 12:05 -0700, Bjorn Helgaas wrote:
> But I wonder if powerpc has a similar issue here: I think EEH might
> need, for example 16MB bridge window alignment.  Since
> pcibios_window_alignment() only affects the *starting* address, could
> the core assign a 9MB window whose starting address is 16MB-aligned?
> Could EEH deal with that?  What if the PCI core assigned the space
> right after the 9MB window to another device?

Gavin, did you guys deal with that at all ? Are we aligning the size as
well somewhat ?

Cheers,
Ben.

Gavin Shan Feb. 23, 2014, 3:43 a.m. UTC | #17

On Sat, Feb 22, 2014 at 06:53:23AM +1100, Benjamin Herrenschmidt wrote:
>On Fri, 2014-02-21 at 12:05 -0700, Bjorn Helgaas wrote:
>> But I wonder if powerpc has a similar issue here: I think EEH might
>> need, for example 16MB bridge window alignment.  Since
>> pcibios_window_alignment() only affects the *starting* address, could
>> the core assign a 9MB window whose starting address is 16MB-aligned?
>> Could EEH deal with that?  What if the PCI core assigned the space
>> right after the 9MB window to another device?
>
>Gavin, did you guys deal with that at all ? Are we aligning the size as
>well somewhat ?
>

Yeah, we can handle it well because pcibios_window_alignment() affects
both starting address and size of PCI bridge window. More details could
be found in pci/drivers/setup-bus.c::pbus_size_mem(): starting address,
"size0", "size1", "size1-size0" are aligned to "min_align", which is
coming from pcibios_window_alignment() (16MB as mentioned).

Thanks,
Gavin

pci-mvebu driver on km_kirkwood

Commit Message

Comments

Patch