diff mbox

Intel I350 mini-PCIe card (igb) on Mirabox (mvebu / Armada 370)

Message ID 20140408171417.GB27776@obsidianresearch.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jason Gunthorpe April 8, 2014, 5:14 p.m. UTC
On Tue, Apr 08, 2014 at 05:55:45PM +0200, Thomas Petazzoni wrote:

> mvebu-mbus/devices output:
> 
> # cat /sys/kernel/debug/mvebu-mbus/devices 
> [10] 00000000e0000000 - 00000000e0900000 : 0004:00e8

> I'm not sure why I don't have the non-power-of-two problem for the MBus
> windows.

Well, you do, 0x900000 is not aligned. It converts to size=0b10001111,
which is undefined behavior for mbus. 

It looks like you are lucky, for your system the undefined behavior
creates a window that hits all the BARs, but the others were not so
lucky.

What do you think about this:

From 235b0bf637242a50ec45c8766d18a942bff601cf Mon Sep 17 00:00:00 2001
From: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Date: Tue, 8 Apr 2014 11:12:41 -0600
Subject: [PATCH] bus: mvebu-mbus: Avoid setting an undefined window size

The mbus hardware requires a power of two size, if a non-power
of two is passed in to the low level routines they configure the
register in a way that results in undefined behaviour.

- WARN_ON to make this invalid usage really obvious
- Round down to the nearest power of two so we only stuff a valid
  size into the HW
- When reading interpret undefined values in a conservative way,
  the value is assumed to be the lowest power of two. This avoids
  the debugfs output showing a value that looks 'correct'

All together this makes the recent problems with silent failure
of PCI very obvious, noisy and debuggable.

Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
---
 drivers/bus/mvebu-mbus.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Willy Tarreau April 8, 2014, 5:53 p.m. UTC | #1
Hi Jason,

On Tue, Apr 08, 2014 at 11:14:17AM -0600, Jason Gunthorpe wrote:
> The mbus hardware requires a power of two size, if a non-power
> of two is passed in to the low level routines they configure the
> register in a way that results in undefined behaviour.
> 
> - WARN_ON to make this invalid usage really obvious
> - Round down to the nearest power of two so we only stuff a valid
>   size into the HW

So if I understand it right, for example when my first NIC registers
e0000000-e02fffff, then we'll truncate it down to e0000000-e01fffff,
won't that cause any issue, for example if the NIC needs to access
data beyond that limit ?

BTW I can try your patch with the myricom NIC which started to work
when rounding up, I'll quickly see if that fixes the issue as well,
but I'm now a little bit confused.

Regards,
Willy
Thomas Petazzoni April 8, 2014, 6:01 p.m. UTC | #2
Dear Jason Gunthorpe,

On Tue, 8 Apr 2014 11:14:17 -0600, Jason Gunthorpe wrote:

> > mvebu-mbus/devices output:
> > 
> > # cat /sys/kernel/debug/mvebu-mbus/devices 
> > [10] 00000000e0000000 - 00000000e0900000 : 0004:00e8
> 
> > I'm not sure why I don't have the non-power-of-two problem for the MBus
> > windows.
> 
> Well, you do, 0x900000 is not aligned. It converts to size=0b10001111,
> which is undefined behavior for mbus. 

Ah correct. Though I'm still puzzled as to why the undefined behavior
works for me, and not for Willy, who I believe has the same NIC as me.

> What do you think about this:
> 
> From 235b0bf637242a50ec45c8766d18a942bff601cf Mon Sep 17 00:00:00 2001
> From: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
> Date: Tue, 8 Apr 2014 11:12:41 -0600
> Subject: [PATCH] bus: mvebu-mbus: Avoid setting an undefined window size
> 
> The mbus hardware requires a power of two size, if a non-power
> of two is passed in to the low level routines they configure the
> register in a way that results in undefined behaviour.
> 
> - WARN_ON to make this invalid usage really obvious
> - Round down to the nearest power of two so we only stuff a valid
>   size into the HW

I perfectly fine with those two.

> - When reading interpret undefined values in a conservative way,
>   the value is assumed to be the lowest power of two. This avoids
>   the debugfs output showing a value that looks 'correct'

But I am not sure with this one. Since now you're anyway rounding down
sizes, how is it possible to get a non-power-of-two size in the
registers?

I would probably prefer to have mvebu_devs_debug_show() do something
like:

                seq_printf(seq, "[%02d] %016llx - %016llx : %04x:%04x%s",
                           win, (unsigned long long)wbase,
                           (unsigned long long)(wbase + wsize), wtarget, wattr,
			   (!is_power_of_2(wsize)) ? " non-pow2 undefined behavior!" : "");

or something like that. This way at least the function does not "hide"
the fact that the configured value is invalid.

See one more comment below.

> All together this makes the recent problems with silent failure
> of PCI very obvious, noisy and debuggable.
> 
> Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
> ---
>  drivers/bus/mvebu-mbus.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/bus/mvebu-mbus.c b/drivers/bus/mvebu-mbus.c
> index 2ac754e..d26f63c 100644
> --- a/drivers/bus/mvebu-mbus.c
> +++ b/drivers/bus/mvebu-mbus.c
> @@ -56,6 +56,7 @@
>  #include <linux/of.h>
>  #include <linux/of_address.h>
>  #include <linux/debugfs.h>
> +#include <linux/log2.h>
>  
>  /*
>   * DDR target is the same on all platforms.
> @@ -147,7 +148,7 @@ static void mvebu_mbus_read_window(struct mvebu_mbus_state *mbus,
>  	*enabled = 1;
>  	*base = ((u64)basereg & WIN_BASE_HIGH) << 32;
>  	*base |= (basereg & WIN_BASE_LOW);
> -	*size = (ctrlreg | ~WIN_CTRL_SIZE_MASK) + 1;
> +	*size = 1 << (ffs(~(ctrlreg | ~WIN_CTRL_SIZE_MASK)) - 1);
>  
>  	if (target)
>  		*target = (ctrlreg & WIN_CTRL_TGT_MASK) >> WIN_CTRL_TGT_SHIFT;
> @@ -266,6 +267,9 @@ static int mvebu_mbus_setup_window(struct mvebu_mbus_state *mbus,
>  		mbus->soc->win_cfg_offset(win);
>  	u32 ctrl, remap_addr;
>  
> +	WARN_ON(!is_power_of_2(size));
> +	size = rounddown_pow_of_two(size);

Maybe something like:

	if (!is_power_of_2(size)) {
		WARN(true, "Invalid MBus window size: 0x%x, rounding down to 0x%x\n",
		     size, rounddown_pow_of_two(size));
		size = rounddown_pow_of_two(size);
	}

And while we're adding checks, why not also verify that the base
address is a multiple of the window size? I think it's the other
requirement.

That being said, this warning doesn't really solve the problem that the
PCI core may allocate BARs whose size cannot be represented through
MBus windows :-)

Thomas
Jason Gunthorpe April 8, 2014, 6:22 p.m. UTC | #3
On Tue, Apr 08, 2014 at 08:01:40PM +0200, Thomas Petazzoni wrote:

> > Well, you do, 0x900000 is not aligned. It converts to size=0b10001111,
> > which is undefined behavior for mbus. 
> 
> Ah correct. Though I'm still puzzled as to why the undefined behavior
> works for me, and not for Willy, who I believe has the same NIC as me.

If we knew the algorithm in the HW it would probably make sense :)

> > - When reading interpret undefined values in a conservative way,
> >   the value is assumed to be the lowest power of two. This avoids
> >   the debugfs output showing a value that looks 'correct'
> 
> But I am not sure with this one. Since now you're anyway rounding down
> sizes, how is it possible to get a non-power-of-two size in the
> registers?

I agree, it should not happen if everything is correct, but the
apparently correct debugfs output obscured the root cause of the
problem..

> I would probably prefer to have mvebu_devs_debug_show() do something
> like:
> 
>                 seq_printf(seq, "[%02d] %016llx - %016llx : %04x:%04x%s",
>                            win, (unsigned long long)wbase,
>                            (unsigned long long)(wbase + wsize), wtarget, wattr,
> 			   (!is_power_of_2(wsize)) ? " non-pow2 undefined behavior!" : "");

That sounds good
 
> > +	WARN_ON(!is_power_of_2(size));
> > +	size = rounddown_pow_of_two(size);
> 
> Maybe something like:
> 
> 	if (!is_power_of_2(size)) {
> 		WARN(true, "Invalid MBus window size: 0x%x, rounding down to 0x%x\n",
> 		     size, rounddown_pow_of_two(size));
> 		size = rounddown_pow_of_two(size);
> 	}
> 
> And while we're adding checks, why not also verify that the base
> address is a multiple of the window size? I think it's the other
> requirement.

Yes, agree, sounds good

> That being said, this warning doesn't really solve the problem that the
> PCI core may allocate BARs whose size cannot be represented through
> MBus windows :-)

Right, but Will pointed out it took 3 months to get to the root cause,
so it might save that time in future :)

I'll revise/resend the patch.

Jason
Thomas Petazzoni April 8, 2014, 6:32 p.m. UTC | #4
Dear Jason Gunthorpe,

On Tue, 8 Apr 2014 12:22:30 -0600, Jason Gunthorpe wrote:

> > Ah correct. Though I'm still puzzled as to why the undefined behavior
> > works for me, and not for Willy, who I believe has the same NIC as me.
> 
> If we knew the algorithm in the HW it would probably make sense :)

Indeed :)

> > But I am not sure with this one. Since now you're anyway rounding down
> > sizes, how is it possible to get a non-power-of-two size in the
> > registers?
> 
> I agree, it should not happen if everything is correct, but the
> apparently correct debugfs output obscured the root cause of the
> problem..

Absolutely.

> > That being said, this warning doesn't really solve the problem that the
> > PCI core may allocate BARs whose size cannot be represented through
> > MBus windows :-)
> 
> Right, but Will pointed out it took 3 months to get to the root cause,
> so it might save that time in future :)

Yes, I completely agree, and I'm totally in favor of making things more
robust in mvebu-mbus.

Actually, should we simply return an error if something tries to set up
a window with an invalid size? But maybe it's a bit too strong for the
moment, until the power of two size problem gets resolved on the PCI
side.

But the other day, I also had funky issues with another window, and it
turned out to be caused by a non-power-of-two window as well...

> I'll revise/resend the patch.

Thanks!

Thomas
diff mbox

Patch

diff --git a/drivers/bus/mvebu-mbus.c b/drivers/bus/mvebu-mbus.c
index 2ac754e..d26f63c 100644
--- a/drivers/bus/mvebu-mbus.c
+++ b/drivers/bus/mvebu-mbus.c
@@ -56,6 +56,7 @@ 
 #include <linux/of.h>
 #include <linux/of_address.h>
 #include <linux/debugfs.h>
+#include <linux/log2.h>
 
 /*
  * DDR target is the same on all platforms.
@@ -147,7 +148,7 @@  static void mvebu_mbus_read_window(struct mvebu_mbus_state *mbus,
 	*enabled = 1;
 	*base = ((u64)basereg & WIN_BASE_HIGH) << 32;
 	*base |= (basereg & WIN_BASE_LOW);
-	*size = (ctrlreg | ~WIN_CTRL_SIZE_MASK) + 1;
+	*size = 1 << (ffs(~(ctrlreg | ~WIN_CTRL_SIZE_MASK)) - 1);
 
 	if (target)
 		*target = (ctrlreg & WIN_CTRL_TGT_MASK) >> WIN_CTRL_TGT_SHIFT;
@@ -266,6 +267,9 @@  static int mvebu_mbus_setup_window(struct mvebu_mbus_state *mbus,
 		mbus->soc->win_cfg_offset(win);
 	u32 ctrl, remap_addr;
 
+	WARN_ON(!is_power_of_2(size));
+	size = rounddown_pow_of_two(size);
+
 	ctrl = ((size - 1) & WIN_CTRL_SIZE_MASK) |
 		(attr << WIN_CTRL_ATTR_SHIFT)    |
 		(target << WIN_CTRL_TGT_SHIFT)   |