diff mbox

[2/2] mtd: orion-nand: fix build error with ARMv4

Message ID 20140509212810.GF18257@obsidianresearch.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jason Gunthorpe May 9, 2014, 9:28 p.m. UTC
On Fri, May 09, 2014 at 03:45:05PM -0300, Ezequiel Garcia wrote:

> I gave this a try in order to answer Arnd's performance
> question. First of all, the patch seems wrong. I guess it's because
> readsl reads 4-bytes pieces, instead of 8-bytes.
> 
> This patch below is tested (but not completely, see below) and works:

Compilers are better now, I think you can just ditch the weirdness:

uint64_t *from;
uint64_t *to;
void foo()
{
	for (unsigned int I = 0; I != 1000; I++)
		*to++ = *from;
}

Using even gcc 4.6.3 gives good code:

(v6)
.L2:
        ldrd    r2, [ip]
        strd    r2, [r1], #8
        cmp     r1, r0

(v4)
.L2:
        ldmia   ip, {r0-r1}
        stmia   r3!, {r0-r1}
        cmp     r3, r2

For correctness this v4 version does require that the cpu executes the
ldmia reads in increasing address order, and never in any other
order. AFAIK the periphal is just a simple fifo that basically ignores
the address.

memcpy_fromio is not as good since it will never align if the buffer
is unaligned, while this version does.

The below gives:

  c8:   ea000002        b       d8 <orion_nand_read_buf+0x84>
  cc:   e5dc0000        ldrb    r0, [ip]
  d0:   e7c30001        strb    r0, [r3, r1]
  d4:   e2811001        add     r1, r1, #1
  d8:   e1510002        cmp     r1, r2

Which looks the same as the asm version to me.

Comments

Ezequiel Garcia May 9, 2014, 10:09 p.m. UTC | #1
On 09 May 03:28 PM, Jason Gunthorpe wrote:
> 
> > I gave this a try in order to answer Arnd's performance
> > question. First of all, the patch seems wrong. I guess it's because
> > readsl reads 4-bytes pieces, instead of 8-bytes.
> > 
> > This patch below is tested (but not completely, see below) and works:
> 
> Compilers are better now, I think you can just ditch the weirdness:
> 
[..]
> 
> The below gives:
> 
>   c8:   ea000002        b       d8 <orion_nand_read_buf+0x84>
>   cc:   e5dc0000        ldrb    r0, [ip]
>   d0:   e7c30001        strb    r0, [r3, r1]
>   d4:   e2811001        add     r1, r1, #1
>   d8:   e1510002        cmp     r1, r2
> 
> Which looks the same as the asm version to me.
> 

Nice! It wasn't really needed but since I have the board here:

# time nanddump /dev/mtd5 -f /dev/null -q
real	0m 5.82s
user	0m 0.20s
sys	0m 5.60s

Jason: Care to submit a proper patch?

On 08 May 04:56 PM, Arnd Bergmann wrote:
> Ok, that is a noticeable difference. For scale, what is the size of that partition?

The board is Openblocks A6, running mainline.

# cat /proc/mtd 
dev:    size   erasesize  name
mtd0: 00090000 00004000 "uboot"
mtd1: 00044000 00004000 "env"
mtd2: 00024000 00004000 "test"
mtd3: 00400000 00004000 "conf"
mtd4: 01d20000 00004000 "linux"
mtd5: 01dec000 00004000 "user"
Arnd Bergmann May 9, 2014, 10:24 p.m. UTC | #2
On Friday 09 May 2014 19:09:15 Ezequiel Garcia wrote:
> # time nanddump /dev/mtd5 -f /dev/null -q
> real    0m 5.82s
> user    0m 0.20s
> sys     0m 5.60s
> 
> Jason: Care to submit a proper patch?
> 
> On 08 May 04:56 PM, Arnd Bergmann wrote:
> > Ok, that is a noticeable difference. For scale, what is the size of that partition?
> 
> The board is Openblocks A6, running mainline.
> 
> # cat /proc/mtd 
> dev:    size   erasesize  name
> mtd0: 00090000 00004000 "uboot"
> mtd1: 00044000 00004000 "env"
> mtd2: 00024000 00004000 "test"
> mtd3: 00400000 00004000 "conf"
> mtd4: 01d20000 00004000 "linux"
> mtd5: 01dec000 00004000 "user"

Ok, so it takes 5.6 seconds in kernel mode to access 31MB, which comes down
to 5.60MB/s. That isn't very fast compared to the time the CPU should take
for those instructions, so I'm surprised it actually makes any difference
at all.

There isn't a usable slave DMA engine in Armada XP by chance?

	Arnd
Ezequiel Garcia May 9, 2014, 11:55 p.m. UTC | #3
On 10 May 12:24 AM, Arnd Bergmann wrote:
> On Friday 09 May 2014 19:09:15 Ezequiel Garcia wrote:
> > # time nanddump /dev/mtd5 -f /dev/null -q
> > real    0m 5.82s
> > user    0m 0.20s
> > sys     0m 5.60s
> > 
> > Jason: Care to submit a proper patch?
> > 
> > On 08 May 04:56 PM, Arnd Bergmann wrote:
> > > Ok, that is a noticeable difference. For scale, what is the size of that partition?
> > 
> > The board is Openblocks A6, running mainline.
> > 
> > # cat /proc/mtd 
> > dev:    size   erasesize  name
> > mtd0: 00090000 00004000 "uboot"
> > mtd1: 00044000 00004000 "env"
> > mtd2: 00024000 00004000 "test"
> > mtd3: 00400000 00004000 "conf"
> > mtd4: 01d20000 00004000 "linux"
> > mtd5: 01dec000 00004000 "user"
> 
> Ok, so it takes 5.6 seconds in kernel mode to access 31MB, which comes down
> to 5.60MB/s. That isn't very fast compared to the time the CPU should take
> for those instructions, so I'm surprised it actually makes any difference
> at all.
>
> There isn't a usable slave DMA engine in Armada XP by chance?
> 

The Openblocks A6 is a Kirkwood, not Armada XP.
Jason Gunthorpe May 13, 2014, 8:55 p.m. UTC | #4
On Fri, May 09, 2014 at 07:09:15PM -0300, Ezequiel Garcia wrote:
> On 09 May 03:28 PM, Jason Gunthorpe wrote:
> > 
> > > I gave this a try in order to answer Arnd's performance
> > > question. First of all, the patch seems wrong. I guess it's because
> > > readsl reads 4-bytes pieces, instead of 8-bytes.
> > > 
> > > This patch below is tested (but not completely, see below) and works:
> > 
> > Compilers are better now, I think you can just ditch the weirdness:
> > 
> [..]
> > 
> > The below gives:
> > 
> >   c8:   ea000002        b       d8 <orion_nand_read_buf+0x84>
> >   cc:   e5dc0000        ldrb    r0, [ip]
> >   d0:   e7c30001        strb    r0, [r3, r1]
> >   d4:   e2811001        add     r1, r1, #1
> >   d8:   e1510002        cmp     r1, r2
> > 
> > Which looks the same as the asm version to me.
> > 
> 
> Nice! It wasn't really needed but since I have the board here:
> 
> # time nanddump /dev/mtd5 -f /dev/null -q
> real	0m 5.82s
> user	0m 0.20s
> sys	0m 5.60s
> 
> Jason: Care to submit a proper patch?

Sure, but did anyone (Arnd?) have thoughts on a better way to do this:

+#ifdef CONFIG_64BIT
+               buf64[i++] = readq_relaxed(io_base);
+#else
+               buf64[i++] = *(const volatile u64 __force *)io_base;
+#endif

IMHO, readq should exist on any platform that can issue a 64 bit bus
transaction, and I expect many ARM's qualify.

> On 08 May 04:56 PM, Arnd Bergmann wrote:

> Ok, so it takes 5.6 seconds in kernel mode to access 31MB, which
> comes down to 5.60MB/s. That isn't very fast compared to the time
> the CPU should take for those instructions, so I'm surprised it
> actually makes any difference at all.

Likely, what is happening is that the bus interface is holding off
returning the read data until it complets the bus cycles, then the
response travels to the CPU which turns around another.

This creates a dead time where the bus isn't do anything.

The larger bus transfer the CPU can do the less percentage of time the
turnaround takes as overhead.

If the cpu could pipeline two reads then it could be highest-possible,
but I guess the memory ordering for the mapping prevents that??

Regarding DMA, who knows if the interface can handle a burst
transfer..

Jason
Arnd Bergmann May 14, 2014, 11:47 a.m. UTC | #5
On Tuesday 13 May 2014 14:55:48 Jason Gunthorpe wrote:
> On Fri, May 09, 2014 at 07:09:15PM -0300, Ezequiel Garcia wrote:
> > On 09 May 03:28 PM, Jason Gunthorpe wrote:
> > > 
> > > > I gave this a try in order to answer Arnd's performance
> > > > question. First of all, the patch seems wrong. I guess it's because
> > > > readsl reads 4-bytes pieces, instead of 8-bytes.
> > > > 
> > > > This patch below is tested (but not completely, see below) and works:
> > > 
> > > Compilers are better now, I think you can just ditch the weirdness:
> > > 
> > [..]
> > > 
> > > The below gives:
> > > 
> > >   c8:   ea000002        b       d8 <orion_nand_read_buf+0x84>
> > >   cc:   e5dc0000        ldrb    r0, [ip]
> > >   d0:   e7c30001        strb    r0, [r3, r1]
> > >   d4:   e2811001        add     r1, r1, #1
> > >   d8:   e1510002        cmp     r1, r2
> > > 
> > > Which looks the same as the asm version to me.
> > > 
> > 
> > Nice! It wasn't really needed but since I have the board here:
> > 
> > # time nanddump /dev/mtd5 -f /dev/null -q
> > real  0m 5.82s
> > user  0m 0.20s
> > sys   0m 5.60s
> > 
> > Jason: Care to submit a proper patch?
> 
> Sure, but did anyone (Arnd?) have thoughts on a better way to do this:
> 
> +#ifdef CONFIG_64BIT
> +               buf64[i++] = readq_relaxed(io_base);
> +#else
> +               buf64[i++] = *(const volatile u64 __force *)io_base;
> +#endif
>
> IMHO, readq should exist on any platform that can issue a 64 bit bus
> transaction, and I expect many ARM's qualify.

Well, the original problem happened specifically for the case that doesn't
have a 64-bit bus transaction (building for ARMv4). If we define
readq_relaxed, it has to be an inline assembly, in order to work for
drivers that require an atomic transaction, so that would have the
same problem. We are a bit inconsistent here though: most 32-bit
architectures have no readq, parisc has one that does two 32-bit accesses,
sh relies on the compiler, and tile apparently has a native instruction.

It seems reasonable to replace the inline assembly in this driver with
a new function as a cleanup, but then how do you want to solve the case
of building a combined armv4/v5 kernel?

	Arnd
Geert Uytterhoeven May 14, 2014, 12:35 p.m. UTC | #6
On Wed, May 14, 2014 at 1:47 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>> Sure, but did anyone (Arnd?) have thoughts on a better way to do this:
>>
>> +#ifdef CONFIG_64BIT
>> +               buf64[i++] = readq_relaxed(io_base);
>> +#else
>> +               buf64[i++] = *(const volatile u64 __force *)io_base;
>> +#endif
>>
>> IMHO, readq should exist on any platform that can issue a 64 bit bus
>> transaction, and I expect many ARM's qualify.
>
> Well, the original problem happened specifically for the case that doesn't
> have a 64-bit bus transaction (building for ARMv4). If we define
> readq_relaxed, it has to be an inline assembly, in order to work for
> drivers that require an atomic transaction, so that would have the
> same problem. We are a bit inconsistent here though: most 32-bit
> architectures have no readq, parisc has one that does two 32-bit accesses,
> sh relies on the compiler, and tile apparently has a native instruction.
>
> It seems reasonable to replace the inline assembly in this driver with
> a new function as a cleanup, but then how do you want to solve the case
> of building a combined armv4/v5 kernel?

Provide two helper functions, one for v4, one for v5, and call
them through a function pointer that's set up during driver probe?

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
Arnd Bergmann May 14, 2014, 1:09 p.m. UTC | #7
On Wednesday 14 May 2014 14:35:28 Geert Uytterhoeven wrote:
> On Wed, May 14, 2014 at 1:47 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> >> Sure, but did anyone (Arnd?) have thoughts on a better way to do this:
> >>
> >> +#ifdef CONFIG_64BIT
> >> +               buf64[i++] = readq_relaxed(io_base);
> >> +#else
> >> +               buf64[i++] = *(const volatile u64 __force *)io_base;
> >> +#endif
> >>
> >> IMHO, readq should exist on any platform that can issue a 64 bit bus
> >> transaction, and I expect many ARM's qualify.
> >
> > Well, the original problem happened specifically for the case that doesn't
> > have a 64-bit bus transaction (building for ARMv4). If we define
> > readq_relaxed, it has to be an inline assembly, in order to work for
> > drivers that require an atomic transaction, so that would have the
> > same problem. We are a bit inconsistent here though: most 32-bit
> > architectures have no readq, parisc has one that does two 32-bit accesses,
> > sh relies on the compiler, and tile apparently has a native instruction.
> >
> > It seems reasonable to replace the inline assembly in this driver with
> > a new function as a cleanup, but then how do you want to solve the case
> > of building a combined armv4/v5 kernel?
> 
> Provide two helper functions, one for v4, one for v5, and call
> them through a function pointer that's set up during driver probe?

No, that won't help. It's a compile-time problem only: This driver
is never actually used on ARMv4, but if we build a kernel that supports
both ARMv4 and later, gcc passes -march=armv4 to the assembler, which
barfs on an invalid instruction. It would be fine to make that error
message just go away because we know it will only be used on CPUs that
do have this instruction.

	Arnd
diff mbox

Patch

diff --git a/drivers/mtd/nand/orion_nand.c b/drivers/mtd/nand/orion_nand.c
index dc9d07f34b8a..fea1597f623e 100644
--- a/drivers/mtd/nand/orion_nand.c
+++ b/drivers/mtd/nand/orion_nand.c
@@ -95,16 +95,13 @@  static void orion_nand_read_buf(struct mtd_info *mtd, uint8_t *buf, int len)
 	}
 	buf64 = (uint64_t *)buf;
 	while (i < len/8) {
-		/*
-		 * Since GCC has no proper constraint (PR 43518)
-		 * force x variable to r2/r3 registers as ldrd instruction
-		 * requires first register to be even.
-		 */
-		register uint64_t x asm ("r2");
-
-		asm volatile ("ldrd\t%0, [%1]" : "=&r" (x) : "r" (io_base));
-		buf64[i++] = x;
+#ifdef CONFIG_64BIT
+		buf64[i++] = readq_relaxed(io_base);
+#else
+		buf64[i++] = *(const volatile u64 __force *)io_base;
+#endif
 	}
+
 	i *= 8;
 	while (i < len)
 		buf[i++] = readb(io_base);