diff mbox series

[RFC] drivers: ata: ahci_sunxi: Increased SATA/AHCI DMA TX/RX FIFOs

Message ID 20190510192550.17458-1-um@mutluit.com (mailing list archive)
State RFC
Headers show
Series [RFC] drivers: ata: ahci_sunxi: Increased SATA/AHCI DMA TX/RX FIFOs | expand

Commit Message

Uenal Mutlu May 10, 2019, 7:25 p.m. UTC
Increasing the SATA/AHCI DMA TX/RX FIFOs (P0DMACR.TXTS and .RXTS) from
default 0x0 each to 0x3 each gives a write performance boost of 120MB/s
from lame 36MB/s to 45MB/s previously. Read performance is about 200MB/s
[tested on SSD using dd bs=4K count=512K].

Tested on the Banana Pi R1 (aka Lamobo R1) and Banana Pi M1 SBCs
with Allwinner A20 32bit-SoCs (ARMv7-a / arm-linux-gnueabihf).
These devices are RaspberryPi-like small devices.

RFC: Since more than about 25 similar SBC/SoC models do use the
ahci_sunxi driver, users are encouraged to test it on all the
affected boards and give feedback.

List of the affected sunxi and other boards and SoCs with SATA using
the ahci_sunxi driver:
  $ grep -i -e "^&ahci" arch/arm/boot/dts/sun*dts
  and http://linux-sunxi.org/Category:Devices_with_SATA_port

Signed-off-by: Uenal Mutlu <um@mutluit.com>
---
 drivers/ata/ahci_sunxi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Uenal Mutlu May 11, 2019, 6:12 p.m. UTC | #1
Stefan Monnier wrote on 05/11/2019 03:37 PM:
>> Increasing the SATA/AHCI DMA TX/RX FIFOs (P0DMACR.TXTS and .RXTS) from
>> default 0x0 each to 0x3 each gives a write performance boost of 120MB/s
>> from lame 36MB/s to 45MB/s previously. Read performance is about 200MB/s
>> [tested on SSD using dd bs=4K count=512K].
>
> Such a simple patch to fix such a long-standing performance problem that
> everyone [ well, apparently not quite everyone ] assumed was a hardware
> limitation...
>
> And yet, April 1st is long gone.
>
> Is it really for real?

Yes, it's indeed real, Stefan; really no April 1st joke.  :-)

As you indicated, this problem of slow SATA write-speed
with these small devices lasts now for more than 5 years.
This patch finally solves the problem.

On my test device (BPI-R1) the optimum blocksize seems to be 12K
as it then gives even 129 MB/s write speed.

Here are some test results with different blocksizes, all giving
a write speed of 125 to 129 MB/s:

time sh -c "dd if=/dev/zero of=test.tmp bs=$bs count=$count conv=fdatasync"


------------ bs=8K / count=256K / 1 ------------------
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 16.9237 s, 127 MB/s

real	0m16.935s
user	0m0.388s
sys	0m15.777s

------------ bs=8K / count=256K / 2 ------------------
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 16.9916 s, 126 MB/s

real	0m17.973s
user	0m0.326s
sys	0m16.806s

------------ bs=8K / count=256K / 3 ------------------
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 17.0085 s, 126 MB/s

real	0m17.993s
user	0m0.442s
sys	0m16.588s

------------ bs=12K / count=171K / 1 ------------------
175104+0 records in
175104+0 records out
2151677952 bytes (2.2 GB) copied, 16.8474 s, 128 MB/s

real	0m16.860s
user	0m0.205s
sys	0m15.705s

------------ bs=12K / count=171K / 2 ------------------
175104+0 records in
175104+0 records out
2151677952 bytes (2.2 GB) copied, 16.6934 s, 129 MB/s

real	0m17.669s
user	0m0.227s
sys	0m16.355s

------------ bs=12K / count=171K / 3 ------------------
175104+0 records in
175104+0 records out
2151677952 bytes (2.2 GB) copied, 16.6684 s, 129 MB/s

real	0m17.654s
user	0m0.388s
sys	0m16.118s

------------ bs=16K / count=128K / 1 ------------------
131072+0 records in
131072+0 records out
2147483648 bytes (2.1 GB) copied, 17.1845 s, 125 MB/s

real	0m17.200s
user	0m0.251s
sys	0m16.060s

------------ bs=16K / count=128K / 2 ------------------
131072+0 records in
131072+0 records out
2147483648 bytes (2.1 GB) copied, 16.9221 s, 127 MB/s

real	0m17.902s
user	0m0.170s
sys	0m16.763s

------------ bs=16K / count=128K / 3 ------------------
131072+0 records in
131072+0 records out
2147483648 bytes (2.1 GB) copied, 16.8845 s, 127 MB/s

real	0m17.868s
user	0m0.167s
sys	0m16.736s
Maxime Ripard May 12, 2019, 12:12 p.m. UTC | #2
Hi,

On Fri, May 10, 2019 at 09:25:50PM +0200, Uenal Mutlu wrote:
> Increasing the SATA/AHCI DMA TX/RX FIFOs (P0DMACR.TXTS and .RXTS) from
> default 0x0 each to 0x3 each gives a write performance boost of 120MB/s
> from lame 36MB/s to 45MB/s previously. Read performance is about 200MB/s
> [tested on SSD using dd bs=4K count=512K].
>
> Tested on the Banana Pi R1 (aka Lamobo R1) and Banana Pi M1 SBCs
> with Allwinner A20 32bit-SoCs (ARMv7-a / arm-linux-gnueabihf).
> These devices are RaspberryPi-like small devices.
>
> RFC: Since more than about 25 similar SBC/SoC models do use the
> ahci_sunxi driver, users are encouraged to test it on all the
> affected boards and give feedback.
>
> List of the affected sunxi and other boards and SoCs with SATA using
> the ahci_sunxi driver:
>   $ grep -i -e "^&ahci" arch/arm/boot/dts/sun*dts
>   and http://linux-sunxi.org/Category:Devices_with_SATA_port
>
> Signed-off-by: Uenal Mutlu <um@mutluit.com>
> ---
>  drivers/ata/ahci_sunxi.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/ata/ahci_sunxi.c b/drivers/ata/ahci_sunxi.c
> index 911710643305..257986431c79 100644
> --- a/drivers/ata/ahci_sunxi.c
> +++ b/drivers/ata/ahci_sunxi.c
> @@ -158,7 +158,7 @@ static void ahci_sunxi_start_engine(struct ata_port *ap)
>  	struct ahci_host_priv *hpriv = ap->host->private_data;
>
>  	/* Setup DMA before DMA start */
> -	sunxi_clrsetbits(hpriv->mmio + AHCI_P0DMACR, 0x0000ff00, 0x00004400);
> +	sunxi_clrsetbits(hpriv->mmio + AHCI_P0DMACR, 0x0000ffff, 0x00004433);

Having comments / defines here would be great, once fixed:
Acked-by: Maxime Ripard <maxime.ripard@bootlin.com>

Maxime

--
Maxime Ripard, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
Uenal Mutlu May 12, 2019, 4:08 p.m. UTC | #3
Hi Maxime & Others,

what follows is a somewhat lengthy technical story behind this patch;
you can just skip it and jump to the end.


As can be seen in the ahci_sunxi.c, the port used in this patch
is this one (32bit):
   #define AHCI_P0DMACR    0x0170
It's a so called "Vendor Specific Port" according to the SATA/AHCI specs by Intel.
The data behind it is actually a struct, consisting of 4 fields,
each 4bits long, plus a 16bits long field that is marked as Reserved
in secondary literature (see below):

struct AHCI_P0DMACR_t
{
   unsigned TXTS  : 4,
            RXTS  : 4,
            TXABL : 4,
            RXABL : 4,
            Res1  : 16;
};

This struct is just my creation for my own tests as it's not part of the
driver source. The patch touches only the first 2 fields: TXTS and RXTS.

See this similar product documentation by Texas Instruments for the above struct:
https://www.ti.com/lit/ug/sprugj8c/sprugj8c.pdf
TMS320C674x/OMAP-L1x Processor, Serial ATA (SATA) Controller, User's Guide,
Literature Number: SPRUGJ8C, March 2011,
Page 68, Chapter 4.33 "Port DMA Control Register (P0DMACR)"

The above TI document describes the two fields as follows:

TXTS:
   Transmit Transaction Size (TX_TRANSACTION_SIZE). This field defines the DMA 
transaction size in
   DWORDs for transmit (system bus read, device write) operation. [...]

RXTS:
   Receive Transaction Size (RX_TRANSACTION_SIZE). This field defines the Port 
DMA transaction size
   in DWORDs for receive (system bus write, device read) operation. [...]


So, in my patch the fields TXTS and RXTS are set to 3 each.
Without the patch, these fields seem to have some random content
(I'vee seen 5 and 6, 4 and 4, and 0 and 0 on different devices),
as the previous code doesn't touch these 2 fields (ie. these two fields
are not within the used old mask of 0xff00; cf. ahci_sunxi.c, function 
ahci_sunxi_start_engine(...)).


Some background story in my hunt for obtaining product documentation:

I couldn't find any product documentation for the SATA/AHCI
in these SoCs by Allwinner Technology (allwinnertech.com),
unlike with such products from other such companies.

I asked Allwinner, but they just said that the A20 of my SBC
would (allegedly) no more be actual and that the support for it
has ended (but this statement somehow cannot be true as the
A20 SoC is still continued being marketed at their website).
They instead sent me a bunch of really irrelevant PDFs which have
nothing to do with SATA/AHCI.

So, the company Allwinner Technology unfortunately was not cooperative
to provide me information on their SATA/AHCI-implementation in their SoCs :-(
Even the ports used in the actual ahci_sunxi.c in the linux tree are undocumented;
it is even commented with "/* This magic is from the original code */"
and below it many ports are used for which no documentation is available,
or at least I couldn't find any on the Internet. And the initial programmer
in 2014 and prior was Daniel Wang (danielwang@allwinnertech.com),
but email to that address bounces.

So, I was forced to research secondary literature from other vendors
like Texas Instruments (thanks TI !) and Intel, and also studying
very old source codes in the old Linux repositories (as it differs
much from the current version) going back to the year 2014, and had
to do many (blind) experiments until I found this solution.

The above given User's Guide by Texas Instruments (and their such
documents for their newer such products) helped me much to find the solution.
It's of course not really the correct documentation for the Allwinner SoCs,
but still better than nothing.

If I only had the right documentation, then I for sure could try
to further improve that already achieved result by this patch,
as with SATA-II upto 300 MiB/s is possible.


Yes, I'll resend the patch with some appropriate comments.

Uenal Mutlu



Maxime Ripard wrote on 05/12/2019 02:12 PM:
> Hi,
>
> On Fri, May 10, 2019 at 09:25:50PM +0200, Uenal Mutlu wrote:
>> Increasing the SATA/AHCI DMA TX/RX FIFOs (P0DMACR.TXTS and .RXTS) from
>> default 0x0 each to 0x3 each gives a write performance boost of 120MB/s
>> from lame 36MB/s to 45MB/s previously. Read performance is about 200MB/s
>> [tested on SSD using dd bs=4K count=512K].
>>
>> Tested on the Banana Pi R1 (aka Lamobo R1) and Banana Pi M1 SBCs
>> with Allwinner A20 32bit-SoCs (ARMv7-a / arm-linux-gnueabihf).
>> These devices are RaspberryPi-like small devices.
>>
>> RFC: Since more than about 25 similar SBC/SoC models do use the
>> ahci_sunxi driver, users are encouraged to test it on all the
>> affected boards and give feedback.
>>
>> List of the affected sunxi and other boards and SoCs with SATA using
>> the ahci_sunxi driver:
>>    $ grep -i -e "^&ahci" arch/arm/boot/dts/sun*dts
>>    and http://linux-sunxi.org/Category:Devices_with_SATA_port
>>
>> Signed-off-by: Uenal Mutlu <um@mutluit.com>
>> ---
>>   drivers/ata/ahci_sunxi.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/ata/ahci_sunxi.c b/drivers/ata/ahci_sunxi.c
>> index 911710643305..257986431c79 100644
>> --- a/drivers/ata/ahci_sunxi.c
>> +++ b/drivers/ata/ahci_sunxi.c
>> @@ -158,7 +158,7 @@ static void ahci_sunxi_start_engine(struct ata_port *ap)
>>   	struct ahci_host_priv *hpriv = ap->host->private_data;
>>
>>   	/* Setup DMA before DMA start */
>> -	sunxi_clrsetbits(hpriv->mmio + AHCI_P0DMACR, 0x0000ff00, 0x00004400);
>> +	sunxi_clrsetbits(hpriv->mmio + AHCI_P0DMACR, 0x0000ffff, 0x00004433);
>
> Having comments / defines here would be great, once fixed:
> Acked-by: Maxime Ripard <maxime.ripard@bootlin.com>
>
> Maxime
>
> --
> Maxime Ripard, Bootlin
> Embedded Linux and Kernel engineering
> https://bootlin.com
Maxime Ripard May 12, 2019, 5:40 p.m. UTC | #4
Hi,

On Sun, May 12, 2019 at 06:08:19PM +0200, U.Mutlu wrote:
> Hi Maxime & Others,
>
> what follows is a somewhat lengthy technical story behind this patch;
> you can just skip it and jump to the end.
>
>
> As can be seen in the ahci_sunxi.c, the port used in this patch
> is this one (32bit):
>   #define AHCI_P0DMACR    0x0170
> It's a so called "Vendor Specific Port" according to the SATA/AHCI specs by Intel.
> The data behind it is actually a struct, consisting of 4 fields,
> each 4bits long, plus a 16bits long field that is marked as Reserved
> in secondary literature (see below):
>
> struct AHCI_P0DMACR_t
> {
>   unsigned TXTS  : 4,
>            RXTS  : 4,
>            TXABL : 4,
>            RXABL : 4,
>            Res1  : 16;
> };
>
> This struct is just my creation for my own tests as it's not part of the
> driver source. The patch touches only the first 2 fields: TXTS and RXTS.
>
> See this similar product documentation by Texas Instruments for the above struct:
> https://www.ti.com/lit/ug/sprugj8c/sprugj8c.pdf
> TMS320C674x/OMAP-L1x Processor, Serial ATA (SATA) Controller, User's Guide,
> Literature Number: SPRUGJ8C, March 2011,
> Page 68, Chapter 4.33 "Port DMA Control Register (P0DMACR)"
>
> The above TI document describes the two fields as follows:
>
> TXTS:
>   Transmit Transaction Size (TX_TRANSACTION_SIZE). This field defines the
> DMA transaction size in
>   DWORDs for transmit (system bus read, device write) operation. [...]
>
> RXTS:
>   Receive Transaction Size (RX_TRANSACTION_SIZE). This field defines the
> Port DMA transaction size
>   in DWORDs for receive (system bus write, device read) operation. [...]
>
>
> So, in my patch the fields TXTS and RXTS are set to 3 each.
> Without the patch, these fields seem to have some random content
> (I'vee seen 5 and 6, 4 and 4, and 0 and 0 on different devices),
> as the previous code doesn't touch these 2 fields (ie. these two fields
> are not within the used old mask of 0xff00; cf. ahci_sunxi.c, function
> ahci_sunxi_start_engine(...)).
>
>
> Some background story in my hunt for obtaining product documentation:
>
> I couldn't find any product documentation for the SATA/AHCI
> in these SoCs by Allwinner Technology (allwinnertech.com),
> unlike with such products from other such companies.
>
> I asked Allwinner, but they just said that the A20 of my SBC
> would (allegedly) no more be actual and that the support for it
> has ended (but this statement somehow cannot be true as the
> A20 SoC is still continued being marketed at their website).
> They instead sent me a bunch of really irrelevant PDFs which have
> nothing to do with SATA/AHCI.
>
> So, the company Allwinner Technology unfortunately was not cooperative
> to provide me information on their SATA/AHCI-implementation in their SoCs :-(
> Even the ports used in the actual ahci_sunxi.c in the linux tree are undocumented;
> it is even commented with "/* This magic is from the original code */"
> and below it many ports are used for which no documentation is available,
> or at least I couldn't find any on the Internet. And the initial programmer
> in 2014 and prior was Daniel Wang (danielwang@allwinnertech.com),
> but email to that address bounces.
>
> So, I was forced to research secondary literature from other vendors
> like Texas Instruments (thanks TI !) and Intel, and also studying
> very old source codes in the old Linux repositories (as it differs
> much from the current version) going back to the year 2014, and had
> to do many (blind) experiments until I found this solution.
>
> The above given User's Guide by Texas Instruments (and their such
> documents for their newer such products) helped me much to find the solution.
> It's of course not really the correct documentation for the Allwinner SoCs,
> but still better than nothing.
>
> If I only had the right documentation, then I for sure could try
> to further improve that already achieved result by this patch,
> as with SATA-II upto 300 MiB/s is possible.
>
>
> Yes, I'll resend the patch with some appropriate comments.

That's awesome research and explanation, thanks! :)

Maxime

--
Maxime Ripard, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
diff mbox series

Patch

diff --git a/drivers/ata/ahci_sunxi.c b/drivers/ata/ahci_sunxi.c
index 911710643305..257986431c79 100644
--- a/drivers/ata/ahci_sunxi.c
+++ b/drivers/ata/ahci_sunxi.c
@@ -158,7 +158,7 @@  static void ahci_sunxi_start_engine(struct ata_port *ap)
 	struct ahci_host_priv *hpriv = ap->host->private_data;
 
 	/* Setup DMA before DMA start */
-	sunxi_clrsetbits(hpriv->mmio + AHCI_P0DMACR, 0x0000ff00, 0x00004400);
+	sunxi_clrsetbits(hpriv->mmio + AHCI_P0DMACR, 0x0000ffff, 0x00004433);
 
 	/* Start DMA */
 	sunxi_setbits(port_mmio + PORT_CMD, PORT_CMD_START);