diff mbox

Intel I350 mini-PCIe card (igb) on Mirabox (mvebu / Armada 370)

Message ID 20140408180828.GC32490@obsidianresearch.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jason Gunthorpe April 8, 2014, 6:08 p.m. UTC
On Tue, Apr 08, 2014 at 07:53:24PM +0200, Willy Tarreau wrote:
> Hi Jason,
> 
> On Tue, Apr 08, 2014 at 11:14:17AM -0600, Jason Gunthorpe wrote:
> > The mbus hardware requires a power of two size, if a non-power
> > of two is passed in to the low level routines they configure the
> > register in a way that results in undefined behaviour.
> > 
> > - WARN_ON to make this invalid usage really obvious
> > - Round down to the nearest power of two so we only stuff a valid
> >   size into the HW
> 
> So if I understand it right, for example when my first NIC registers
> e0000000-e02fffff, then we'll truncate it down to e0000000-e01fffff,
> won't that cause any issue, for example if the NIC needs to access
> data beyond that limit ?

The mbus windows are assigned to the bridge, not to the NIC bars:

00:0a.0 PCI bridge [0604]: Marvell Technology Group Ltd. Device [11ab:7846] (rev 02) (prog-if 00 [Normal decode])
        Memory behind bridge: e0000000-e17fffff

So the above is not OK, 17fffff is not a power of two. The patch
causes the mbus to WARN_ON and then round down so that the hardware is
at least in a defined configuration.

> BTW I can try your patch with the myricom NIC which started to work
> when rounding up, I'll quickly see if that fixes the issue as well,
> but I'm now a little bit confused.

This patch won't fix anything, it just fixes the mbus driver to always
program the hardware with valid values, even if the upper layer
requests something invalid. Basically, we spent a few weeks tracking
this behavior down, the patch would ensure that time doesn't get spent
again.

The goal now is to avoid the WARN_ON in the patch from firing.

For a proper fix, something like this to create multiple aligned
windows is a simple option (untested, need the inverse on the
de-register, loop should probably live in mbus code):

Jason

Comments

Thomas Petazzoni April 8, 2014, 6:15 p.m. UTC | #1
Dear Jason Gunthorpe,

On Tue, 8 Apr 2014 12:08:28 -0600, Jason Gunthorpe wrote:

> > BTW I can try your patch with the myricom NIC which started to work
> > when rounding up, I'll quickly see if that fixes the issue as well,
> > but I'm now a little bit confused.
> 
> This patch won't fix anything, it just fixes the mbus driver to always
> program the hardware with valid values, even if the upper layer
> requests something invalid. Basically, we spent a few weeks tracking
> this behavior down, the patch would ensure that time doesn't get spent
> again.

Agreed.

> The goal now is to avoid the WARN_ON in the patch from firing.
> 
> For a proper fix, something like this to create multiple aligned
> windows is a simple option (untested, need the inverse on the
> de-register, loop should probably live in mbus code):
> 
> diff --git a/drivers/pci/host/pci-mvebu.c b/drivers/pci/host/pci-mvebu.c
> index 789cdb2..7312c82 100644
> --- a/drivers/pci/host/pci-mvebu.c
> +++ b/drivers/pci/host/pci-mvebu.c
> @@ -382,6 +382,9 @@ static void mvebu_pcie_handle_iobase_change(struct mvebu_pcie_port *port)
>  
>  static void mvebu_pcie_handle_membase_change(struct mvebu_pcie_port *port)
>  {
> +       phys_addr_t base;
> +       unsigned int mapped_size;
> +
>         /* Are the new membase/memlimit values invalid? */
>         if (port->bridge.memlimit < port->bridge.membase ||
>             !(port->bridge.command & PCI_COMMAND_MEMORY)) {
> @@ -408,8 +411,16 @@ static void mvebu_pcie_handle_membase_change(struct mvebu_pcie_port *port)
>                 (((port->bridge.memlimit & 0xFFF0) << 16) | 0xFFFFF) -
>                 port->memwin_base;
>  
> -       mvebu_mbus_add_window_by_id(port->mem_target, port->mem_attr,
> -                                   port->memwin_base, port->memwin_size);
> +       base = port->memwin_base;
> +       mapped_size = 0;
> +       while (mapped_size < port->memwin_size) {
> +               unsigned int size =
> +                   rounddown_pow_of_two(port->memwin_size - mapped_size);
> +               mvebu_mbus_add_window_by_id(port->mem_target, port->mem_attr,
> +                                           base, size);
> +               base += size;
> +               mapped_size += size;
> +       }
>  }

The problem I have with this approach is the consumption of windows. We
have a limited number of them, and this approach could easily consume
quite a few windows for one single bridge BAR, depending on its size.
Like a bridge BAR of 120 MB would consume four windows (one 64 MB, one
32 MB, one 16 MB and one 8 MB).

Instead, I would really like to see the PCI core being told about this
constraint, so that it can oversize the bridge BAR (to 128 MB in the
above example of a 120 MB bridge BAR). Of course, it would be better if
the PCI core would not blindly apply this logic: if there is a 129 MB
bridge BAR, we'd prefer to have two windows (one 128 MB and one 1 MB),
so as to not loose too much physical address space.

Thomas
Jason Gunthorpe April 8, 2014, 6:40 p.m. UTC | #2
On Tue, Apr 08, 2014 at 08:15:47PM +0200, Thomas Petazzoni wrote:

> The problem I have with this approach is the consumption of windows. We
> have a limited number of them, and this approach could easily consume
> quite a few windows for one single bridge BAR, depending on its size.
> Like a bridge BAR of 120 MB would consume four windows (one 64 MB, one
> 32 MB, one 16 MB and one 8 MB).

I agree, but on the other hand, the three people hitting this problem
have lots and lots of free windows. This won't work for someone using
a large number of PEX ports, but it does work right now, and it is
small enough to go into -stable..
 
> Instead, I would really like to see the PCI core being told about this
> constraint, so that it can oversize the bridge BAR (to 128 MB in the
> above example of a 120 MB bridge BAR). Of course, it would be better if
> the PCI core would not blindly apply this logic: if there is a 129 MB
> bridge BAR, we'd prefer to have two windows (one 128 MB and one 1 MB),
> so as to not loose too much physical address space.

Right, it would be great if the alignment function could alter the
size as well - but as you say, once we get there you will still want
to allocate more than 1 mbus window per bridge window, so the code
will still be required ..

Also, as you pointed out, the base alignment is also important to
consider, so this needs fixing too: 

	if (res->flags & IORESOURCE_IO)
		return round_up(start, max_t(resource_size_t, SZ_64K, size));
	else if (res->flags & IORESOURCE_MEM)
		return round_up(start, max_t(resource_size_t, SZ_1M, size));

What does round_up do when size is not a power of 2? Not the right
thing, I think. Size should be rounded first..

And the rough loop should consider the base alignment when
constraining the size as well...

Jason
Willy Tarreau April 8, 2014, 7:15 p.m. UTC | #3
On Tue, Apr 08, 2014 at 12:08:28PM -0600, Jason Gunthorpe wrote:
> > So if I understand it right, for example when my first NIC registers
> > e0000000-e02fffff, then we'll truncate it down to e0000000-e01fffff,
> > won't that cause any issue, for example if the NIC needs to access
> > data beyond that limit ?
> 
> The mbus windows are assigned to the bridge, not to the NIC bars:
> 
> 00:0a.0 PCI bridge [0604]: Marvell Technology Group Ltd. Device [11ab:7846] (rev 02) (prog-if 00 [Normal decode])
>         Memory behind bridge: e0000000-e17fffff
> 
> So the above is not OK, 17fffff is not a power of two. The patch
> causes the mbus to WARN_ON and then round down so that the hardware is
> at least in a defined configuration.
>
> > BTW I can try your patch with the myricom NIC which started to work
> > when rounding up, I'll quickly see if that fixes the issue as well,
> > but I'm now a little bit confused.
> 
> This patch won't fix anything, it just fixes the mbus driver to always
> program the hardware with valid values, even if the upper layer
> requests something invalid.

OK that's what I understood, but I mean why rounding down instead of up
in order to correctly cover the window the device expects ? And we all
found that rounding up fixed our respective devices (3 igb NICs, one
myri10ge NIC, one SATA controller I believe).

By rounding up you'd have had 1ffffff instead, which at least covers the
window the device expects.

It's just this specific point of the reason for rounding down versus up
that concerns me.

Thanks,
Willy
Jason Gunthorpe April 8, 2014, 7:21 p.m. UTC | #4
On Tue, Apr 08, 2014 at 09:15:14PM +0200, Willy Tarreau wrote:

> OK that's what I understood, but I mean why rounding down instead of up
> in order to correctly cover the window the device expects ? And we all
> found that rounding up fixed our respective devices (3 igb NICs, one
> myri10ge NIC, one SATA controller I believe).
> 
> By rounding up you'd have had 1ffffff instead, which at least covers the
> window the device expects.

It is also an error to configure the mbus to have overlaping windows
and the code checks for overlaps with the base/size given. If we round
up then it might create an overlap.

You guys were OK with the round up in the PCI code only because your
systems have a single used PEX so the rounding didn't create an
overlaping situation..

The generic code should either round down, or compeltely bail, as
Thomas suggested.

Fundementally we shouldn't get to the WARN_ON, so what happens after
is just an attempt to salvage something from the situation.

Jason
Matthew Minter April 8, 2014, 8:17 p.m. UTC | #5
This is just a guess but perhaps the best we could do would be to round up but also add logic to fail completely if adding a PCI device would cause a window to overlap?

 This would mean situations will be fixed where we can without design changes and situations that cannot be fixed will fail with clear errors?

On 8 Apr 2014, at 20:21, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:

> On Tue, Apr 08, 2014 at 09:15:14PM +0200, Willy Tarreau wrote:
> 
>> OK that's what I understood, but I mean why rounding down instead of up
>> in order to correctly cover the window the device expects ? And we all
>> found that rounding up fixed our respective devices (3 igb NICs, one
>> myri10ge NIC, one SATA controller I believe).
>> 
>> By rounding up you'd have had 1ffffff instead, which at least covers the
>> window the device expects.
> 
> It is also an error to configure the mbus to have overlaping windows
> and the code checks for overlaps with the base/size given. If we round
> up then it might create an overlap.
> 
> You guys were OK with the round up in the PCI code only because your
> systems have a single used PEX so the rounding didn't create an
> overlaping situation..
> 
> The generic code should either round down, or compeltely bail, as
> Thomas suggested.
> 
> Fundementally we shouldn't get to the WARN_ON, so what happens after
> is just an attempt to salvage something from the situation.
> 
> Jason
Neil Greatorex April 8, 2014, 8:19 p.m. UTC | #6
On Tue, 8 Apr 2014, Jason Gunthorpe wrote:
> On Tue, Apr 08, 2014 at 09:15:14PM +0200, Willy Tarreau wrote:
>
>> OK that's what I understood, but I mean why rounding down instead of up
>> in order to correctly cover the window the device expects ? And we all
>> found that rounding up fixed our respective devices (3 igb NICs, one
>> myri10ge NIC, one SATA controller I believe).
>>
>> By rounding up you'd have had 1ffffff instead, which at least covers the
>> window the device expects.
>
> It is also an error to configure the mbus to have overlaping windows
> and the code checks for overlaps with the base/size given. If we round
> up then it might create an overlap.

If I use only the rounding up part of my patch and not the 4M alignment 
part then this is the exact scenario I get:

00:01.0 PCI bridge [0604]: Marvell Technology Group Ltd. Device 
[11ab:6710] (rev 01) (prog-if 00 [Normal decode])
 	Memory behind bridge: e0000000-e02fffff
 	Prefetchable memory behind bridge: 00000000-000fffff
00:02.0 PCI bridge [0604]: Marvell Technology Group Ltd. Device 
[11ab:6710] (rev 01) (prog-if 00 [Normal decode])
 	Memory behind bridge: e0300000-e03fffff
 	Prefetchable memory behind bridge: 00000000-000fffff

The window for bridge 00:01.0 gets rounded up to cover e0000000-e03fffff 
and the attempt to add the window for bridge 00:02.0 fails.

> You guys were OK with the round up in the PCI code only because your
> systems have a single used PEX so the rounding didn't create an
> overlaping situation..

In my case it worked because I changed the alignment to 4M to prevent the 
overlap.

> The generic code should either round down, or compeltely bail, as
> Thomas suggested.
>
> Fundementally we shouldn't get to the WARN_ON, so what happens after
> is just an attempt to salvage something from the situation.

Agreed.

Cheers,
Neil
Willy Tarreau April 8, 2014, 8:43 p.m. UTC | #7
Hi Jason,

On Tue, Apr 08, 2014 at 01:21:41PM -0600, Jason Gunthorpe wrote:
> On Tue, Apr 08, 2014 at 09:15:14PM +0200, Willy Tarreau wrote:
> 
> > OK that's what I understood, but I mean why rounding down instead of up
> > in order to correctly cover the window the device expects ? And we all
> > found that rounding up fixed our respective devices (3 igb NICs, one
> > myri10ge NIC, one SATA controller I believe).
> > 
> > By rounding up you'd have had 1ffffff instead, which at least covers the
> > window the device expects.
> 
> It is also an error to configure the mbus to have overlaping windows
> and the code checks for overlaps with the base/size given. If we round
> up then it might create an overlap.

OK, Thomas just explained this to me over the phone, it makes sense now.
In fact, I think that a mix between your patch to create multiple windows
and something Thomas wanted to do (ie allocate larger and inform the PCI
core) would be a good solution by preferring to round up small windows
(low waste) and create multiple windows where the waste is too large
(eg: address space divided by the max number of windows).

Thank you for your explanation.

Willy
Thomas Petazzoni April 8, 2014, 9:50 p.m. UTC | #8
Dear Matthew Minter,

On Tue, 8 Apr 2014 21:17:47 +0100, Matthew Minter wrote:

> This is just a guess but perhaps the best we could do would be to round up but also add logic to fail completely if adding a PCI device would cause a window to overlap?

Rounding up does not work. Let me try to explain why.

The PCI core enumerates all the BARs into your device and decides that
it should be mapped from 0xe0000000 to 0xe0900000 (9 MB). The PCI core
writes those values into the BAR of the emulated bridge in the
pci-mvebu driver. The pci-mvebu driver captures this write, and figures
out that 9 MB is not good for a MBus window. So it rounds it up to 16
MB, from 0xe0000000 to 0xe1000000, and configures the MBus window. Your
device will work perfectly fine.

Now, the PCI core goes on and enumerate the next device. This device
needs 1 MB of memory space. From the PCI core point of view, only
0xe0000000 to 0xe0900000 is occupied, so for the next device, it may
well decide to use 0xe0900000 -> 0xe0a00000, write this to the BAR of
the emulated bridge, which will lead the emulated bridge to try to
create a window for this area... which isn't possible as it overlaps
the previous window.

Conclusion: rounding up the size of the BARs cannot be done without the
cooperation of the PCI core.

Thomas
diff mbox

Patch

diff --git a/drivers/pci/host/pci-mvebu.c b/drivers/pci/host/pci-mvebu.c
index 789cdb2..7312c82 100644
--- a/drivers/pci/host/pci-mvebu.c
+++ b/drivers/pci/host/pci-mvebu.c
@@ -382,6 +382,9 @@  static void mvebu_pcie_handle_iobase_change(struct mvebu_pcie_port *port)
 
 static void mvebu_pcie_handle_membase_change(struct mvebu_pcie_port *port)
 {
+       phys_addr_t base;
+       unsigned int mapped_size;
+
        /* Are the new membase/memlimit values invalid? */
        if (port->bridge.memlimit < port->bridge.membase ||
            !(port->bridge.command & PCI_COMMAND_MEMORY)) {
@@ -408,8 +411,16 @@  static void mvebu_pcie_handle_membase_change(struct mvebu_pcie_port *port)
                (((port->bridge.memlimit & 0xFFF0) << 16) | 0xFFFFF) -
                port->memwin_base;
 
-       mvebu_mbus_add_window_by_id(port->mem_target, port->mem_attr,
-                                   port->memwin_base, port->memwin_size);
+       base = port->memwin_base;
+       mapped_size = 0;
+       while (mapped_size < port->memwin_size) {
+               unsigned int size =
+                   rounddown_pow_of_two(port->memwin_size - mapped_size);
+               mvebu_mbus_add_window_by_id(port->mem_target, port->mem_attr,
+                                           base, size);
+               base += size;
+               mapped_size += size;
+       }
 }