[00/10] Raspberry Pi SPI speedups
mbox series

Message ID cover.1564825752.git.lukas@wunner.de
Headers show
Series
  • Raspberry Pi SPI speedups
Related show

Message

Lukas Wunner Aug. 3, 2019, 10:10 a.m. UTC
So far the BCM2835 SPI driver cannot cope with TX-only and RX-only
transfers (rx_buf or tx_buf is NULL) when using DMA:  It relies on
the SPI core to convert them to full-duplex transfers by allocating
and DMA-mapping a dummy rx_buf or tx_buf.  This costs performance.

Resolve by pre-allocating reusable DMA descriptors which cyclically
clear the RX FIFO (for TX-only transfers) or zero-fill the TX FIFO
(for RX-only transfers).  Patch [07/10] provides some numbers for
the achieved latency improvement and CPU time reduction with an
SPI Ethernet controller.  SPI displays should see a similar speedup.
I've also made an effort to reduce peripheral and memory bus accesses.

The series is meant to be applied on top of broonie/for-next.
It can be applied to Linus' current tree if commit
8d8bef503658 ("spi: bcm2835: Fix 3-wire mode if DMA is enabled")
is cherry-picked from broonie's repo beforehand.

Please review and test.  Thank you.

Lukas Wunner (10):
  dmaengine: bcm2835: Allow reusable descriptors
  dmaengine: bcm2835: Allow cyclic transactions without interrupt
  spi: Guarantee cacheline alignment of driver-private data
  spi: bcm2835: Drop dma_pending flag
  spi: bcm2835: Work around DONE bit erratum
  spi: bcm2835: Cache CS register value for ->prepare_message()
  spi: bcm2835: Speed up TX-only DMA transfers by clearing RX FIFO
  dmaengine: bcm2835: Document struct bcm2835_dmadev
  dmaengine: bcm2835: Avoid accessing memory when copying zeroes
  spi: bcm2835: Speed up RX-only DMA transfers by zero-filling TX FIFO

 drivers/dma/bcm2835-dma.c |  38 +++-
 drivers/spi/spi-bcm2835.c | 408 ++++++++++++++++++++++++++++++++------
 drivers/spi/spi.c         |  18 +-
 3 files changed, 390 insertions(+), 74 deletions(-)

Comments

Noralf Trønnes Aug. 3, 2019, 4:01 p.m. UTC | #1
Den 03.08.2019 12.10, skrev Lukas Wunner:
> So far the BCM2835 SPI driver cannot cope with TX-only and RX-only
> transfers (rx_buf or tx_buf is NULL) when using DMA:  It relies on
> the SPI core to convert them to full-duplex transfers by allocating
> and DMA-mapping a dummy rx_buf or tx_buf.  This costs performance.
> 
> Resolve by pre-allocating reusable DMA descriptors which cyclically
> clear the RX FIFO (for TX-only transfers) or zero-fill the TX FIFO
> (for RX-only transfers).  Patch [07/10] provides some numbers for
> the achieved latency improvement and CPU time reduction with an
> SPI Ethernet controller.  SPI displays should see a similar speedup.
> I've also made an effort to reduce peripheral and memory bus accesses.
> 
> The series is meant to be applied on top of broonie/for-next.
> It can be applied to Linus' current tree if commit
> 8d8bef503658 ("spi: bcm2835: Fix 3-wire mode if DMA is enabled")
> is cherry-picked from broonie's repo beforehand.
> 
> Please review and test.  Thank you.
> 

Tested-by: Noralf Trønnes <noralf@tronnes.org>

I've tested on a 320x240 RGB565 display, flipping 2 framebuffers as fast
as possible. The buffers are 320x240x2=150kB which are maxsize split by
spi-bcm2835.

I see a small increase in speed from ~34.75 to ~35.55 fps using this series.

Details:

$ ~/libdrm/tests/modetest/modetest -M mi0283qt -s 31:320x240@RG16 -v
setting mode 320x240-0Hz@RG16 on connectors 31, crtc 34
freq: 30.36Hz
freq: 35.32Hz
freq: 35.56Hz
freq: 35.57Hz
freq: 35.54Hz
freq: 35.55Hz
freq: 35.60Hz
freq: 35.62Hz
freq: 35.50Hz
freq: 35.23Hz
freq: 35.49Hz
freq: 35.44Hz
freq: 35.39Hz
freq: 35.58Hz

$ od --endian big -tu4 -An
/sys/bus/spi/devices/spi0.0/of_node/spi-max-frequency
   64000000

<skip those that are zero>
pi@pi2835:~$ tail -n +1 /sys/bus/spi/devices/spi0.0/statistics/*
==> /sys/bus/spi/devices/spi0.0/statistics/bytes <==
131644720
==> /sys/bus/spi/devices/spi0.0/statistics/bytes_rx <==
2
==> /sys/bus/spi/devices/spi0.0/statistics/bytes_tx <==
131644718
==> /sys/bus/spi/devices/spi0.0/statistics/messages <==
5188
==> /sys/bus/spi/devices/spi0.0/statistics/spi_sync <==
5188
==> /sys/bus/spi/devices/spi0.0/statistics/spi_sync_immediate <==
5188
==> /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_0-1 <==
2609
==>
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_16384-32767 <==
857
==> /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_2-3 <==
5
==>
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_32768-65535 <==
1714
==> /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_4-7 <==
1717
==> /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_8-15 <==
2
==> /sys/bus/spi/devices/spi0.0/statistics/transfers <==
6904
==> /sys/bus/spi/devices/spi0.0/statistics/transfers_split_maxsize <==
857


Noralf.
Stefan Wahren Aug. 11, 2019, 7:50 p.m. UTC | #2
Hi Lukas,

Am 03.08.19 um 12:10 schrieb Lukas Wunner:
> So far the BCM2835 SPI driver cannot cope with TX-only and RX-only
> transfers (rx_buf or tx_buf is NULL) when using DMA:  It relies on
> the SPI core to convert them to full-duplex transfers by allocating
> and DMA-mapping a dummy rx_buf or tx_buf.  This costs performance.
>
> Resolve by pre-allocating reusable DMA descriptors which cyclically
> clear the RX FIFO (for TX-only transfers) or zero-fill the TX FIFO
> (for RX-only transfers).  Patch [07/10] provides some numbers for
> the achieved latency improvement and CPU time reduction with an
> SPI Ethernet controller.  SPI displays should see a similar speedup.
> I've also made an effort to reduce peripheral and memory bus accesses.

i know the BCM2711 / Raspberry Pi 4 isn't upstreamed yet, but this
series hasn't been tested with RPi 4?

I only want to know.
Lukas Wunner Aug. 11, 2019, 7:52 p.m. UTC | #3
On Sun, Aug 11, 2019 at 09:50:17PM +0200, Stefan Wahren wrote:
> Am 03.08.19 um 12:10 schrieb Lukas Wunner:
> > So far the BCM2835 SPI driver cannot cope with TX-only and RX-only
> > transfers (rx_buf or tx_buf is NULL) when using DMA:  It relies on
> > the SPI core to convert them to full-duplex transfers by allocating
> > and DMA-mapping a dummy rx_buf or tx_buf.  This costs performance.
> >
> > Resolve by pre-allocating reusable DMA descriptors which cyclically
> > clear the RX FIFO (for TX-only transfers) or zero-fill the TX FIFO
> > (for RX-only transfers).  Patch [07/10] provides some numbers for
> > the achieved latency improvement and CPU time reduction with an
> > SPI Ethernet controller.  SPI displays should see a similar speedup.
> > I've also made an effort to reduce peripheral and memory bus accesses.
> 
> i know the BCM2711 / Raspberry Pi 4 isn't upstreamed yet, but this
> series hasn't been tested with RPi 4?

No, I only have the CM1 and CM3 at my disposal for testing,
the series seemed to work fine on both.

Thanks,

Lukas
Stefan Wahren Aug. 19, 2019, 7:22 p.m. UTC | #4
Am 03.08.19 um 12:10 schrieb Lukas Wunner:
> So far the BCM2835 SPI driver cannot cope with TX-only and RX-only
> transfers (rx_buf or tx_buf is NULL) when using DMA:  It relies on
> the SPI core to convert them to full-duplex transfers by allocating
> and DMA-mapping a dummy rx_buf or tx_buf.  This costs performance.
>
> Resolve by pre-allocating reusable DMA descriptors which cyclically
> clear the RX FIFO (for TX-only transfers) or zero-fill the TX FIFO
> (for RX-only transfers).  Patch [07/10] provides some numbers for
> the achieved latency improvement and CPU time reduction with an
> SPI Ethernet controller.  SPI displays should see a similar speedup.
> I've also made an effort to reduce peripheral and memory bus accesses.
>
> The series is meant to be applied on top of broonie/for-next.
> It can be applied to Linus' current tree if commit
> 8d8bef503658 ("spi: bcm2835: Fix 3-wire mode if DMA is enabled")
> is cherry-picked from broonie's repo beforehand.
>
> Please review and test.  Thank you.
>
> Lukas Wunner (10):
>   dmaengine: bcm2835: Allow reusable descriptors
>   dmaengine: bcm2835: Allow cyclic transactions without interrupt
>   spi: Guarantee cacheline alignment of driver-private data
>   spi: bcm2835: Drop dma_pending flag
>   spi: bcm2835: Work around DONE bit erratum
>   spi: bcm2835: Cache CS register value for ->prepare_message()
>   spi: bcm2835: Speed up TX-only DMA transfers by clearing RX FIFO
>   dmaengine: bcm2835: Document struct bcm2835_dmadev
>   dmaengine: bcm2835: Avoid accessing memory when copying zeroes
>   spi: bcm2835: Speed up RX-only DMA transfers by zero-filling TX FIFO
>
Acked-by: Stefan Wahren <wahrenst@gmx.net>

Sorry, for this late reply
Martin Sperl Aug. 21, 2019, 3:21 p.m. UTC | #5
> On 03.08.2019, at 12:10, Lukas Wunner <lukas@wunner.de> wrote:
> 
> Lukas Wunner (10):
>  dmaengine: bcm2835: Allow reusable descriptors
>  dmaengine: bcm2835: Allow cyclic transactions without interrupt
>  spi: Guarantee cacheline alignment of driver-private data
>  spi: bcm2835: Drop dma_pending flag
>  spi: bcm2835: Work around DONE bit erratum
>  spi: bcm2835: Cache CS register value for ->prepare_message()
>  spi: bcm2835: Speed up TX-only DMA transfers by clearing RX FIFO
>  dmaengine: bcm2835: Document struct bcm2835_dmadev
>  dmaengine: bcm2835: Avoid accessing memory when copying zeroes
>  spi: bcm2835: Speed up RX-only DMA transfers by zero-filling TX FIFO
> 
Acked-by: Martin Sperl <kernel@martin.sperl.org>
Lukas Wunner Aug. 24, 2019, 10:33 a.m. UTC | #6
Dear Marc,

to alleviate you of having to add all the Acked-by and Tested-by tags
to this series, I've prepared a "prêt-à-porter" branch complete with
all tags which you can cherry-pick or merge from.

If you decide to instead apply the patches yourself, you can double-check
the result by comparing it to my branch with "git range-diff".

If you have any comments on the series or would like to have anything
changed, please let me know.

Thanks!

----------------------------------------------------------------

The following changes since commit c55be305915974db160ce6472722ff74f45b8d4e:

  spi: spi-fsl-dspi: Use poll mode in case the platform IRQ is missing (2019-08-23 12:01:44 +0100)

are available in the git repository at:

  https://github.com/l1k/linux bcm2835_spi_simplex_v1

for you to fetch changes up to 37ad33d4bee27d9a24f1deffd675e327d1bb899e:

  spi: bcm2835: Speed up RX-only DMA transfers by zero-filling TX FIFO (2019-08-24 11:54:11 +0200)

----------------------------------------------------------------
So far the BCM2835 SPI driver cannot cope with TX-only and RX-only
transfers (rx_buf or tx_buf is NULL) when using DMA:  It relies on
the SPI core to convert them to full-duplex transfers by allocating
and DMA-mapping a dummy rx_buf or tx_buf.  This costs performance.

Resolve by pre-allocating reusable DMA descriptors which cyclically
clear the RX FIFO (for TX-only transfers) or zero-fill the TX FIFO
(for RX-only transfers).  Commit fecf4ba3f248 provides some numbers
for the achieved latency improvement and CPU time reduction with an
SPI Ethernet controller.  SPI displays should see a similar speedup.
I've also made an effort to reduce peripheral and memory bus accesses.
----------------------------------------------------------------
Lukas Wunner (10):
      dmaengine: bcm2835: Allow reusable descriptors
      dmaengine: bcm2835: Allow cyclic transactions without interrupt
      spi: Guarantee cacheline alignment of driver-private data
      spi: bcm2835: Drop dma_pending flag
      spi: bcm2835: Work around DONE bit erratum
      spi: bcm2835: Cache CS register value for ->prepare_message()
      spi: bcm2835: Speed up TX-only DMA transfers by clearing RX FIFO
      dmaengine: bcm2835: Document struct bcm2835_dmadev
      dmaengine: bcm2835: Avoid accessing memory when copying zeroes
      spi: bcm2835: Speed up RX-only DMA transfers by zero-filling TX FIFO

 drivers/dma/bcm2835-dma.c |  38 ++++-
 drivers/spi/spi-bcm2835.c | 408 ++++++++++++++++++++++++++++++++++++++--------
 drivers/spi/spi.c         |  18 +-
 3 files changed, 390 insertions(+), 74 deletions(-)
Lukas Wunner Sept. 7, 2019, 9:06 a.m. UTC | #7
Dear Mark,

On Sat, Aug 03, 2019 at 12:10:00PM +0200, Lukas Wunner wrote:
> So far the BCM2835 SPI driver cannot cope with TX-only and RX-only
> transfers (rx_buf or tx_buf is NULL) when using DMA:  It relies on
> the SPI core to convert them to full-duplex transfers by allocating
> and DMA-mapping a dummy rx_buf or tx_buf.  This costs performance.
> 
> Resolve by pre-allocating reusable DMA descriptors which cyclically
> clear the RX FIFO (for TX-only transfers) or zero-fill the TX FIFO
> (for RX-only transfers).  Patch [07/10] provides some numbers for
> the achieved latency improvement and CPU time reduction with an
> SPI Ethernet controller.  SPI displays should see a similar speedup.
> I've also made an effort to reduce peripheral and memory bus accesses.

Just a gentle ping, this patch set was posted to the list 5 weeks ago,
has all necessary acks and has been tested successfully by 2 people
besides myself.

Do you have any thoughts on it?  Any objections?

In case the patches no longer apply cleanly I've prepared this branch
based on your for-next branch of Aug 23 from which you can merge if
you prefer that:

https://github.com/l1k/linux/commits/bcm2835_spi_simplex_v1

However I can also repost if necessary.

(PS: Apologies for misspelling your name as "Marc" in my e-mail of Aug 24.)

Thanks,

Lukas

> Lukas Wunner (10):
>   dmaengine: bcm2835: Allow reusable descriptors
>   dmaengine: bcm2835: Allow cyclic transactions without interrupt
>   spi: Guarantee cacheline alignment of driver-private data
>   spi: bcm2835: Drop dma_pending flag
>   spi: bcm2835: Work around DONE bit erratum
>   spi: bcm2835: Cache CS register value for ->prepare_message()
>   spi: bcm2835: Speed up TX-only DMA transfers by clearing RX FIFO
>   dmaengine: bcm2835: Document struct bcm2835_dmadev
>   dmaengine: bcm2835: Avoid accessing memory when copying zeroes
>   spi: bcm2835: Speed up RX-only DMA transfers by zero-filling TX FIFO
> 
>  drivers/dma/bcm2835-dma.c |  38 +++-
>  drivers/spi/spi-bcm2835.c | 408 ++++++++++++++++++++++++++++++++------
>  drivers/spi/spi.c         |  18 +-
>  3 files changed, 390 insertions(+), 74 deletions(-)
> 
> -- 
> 2.20.1
Mark Brown Sept. 9, 2019, 4:56 p.m. UTC | #8
On Sat, Sep 07, 2019 at 11:06:37AM +0200, Lukas Wunner wrote:

> Just a gentle ping, this patch set was posted to the list 5 weeks ago,
> has all necessary acks and has been tested successfully by 2 people
> besides myself.

Please don't send content free pings and please allow a reasonable time
for review.  People get busy, go on holiday, attend conferences and so 
on so unless there is some reason for urgency (like critical bug fixes)
please allow at least a couple of weeks for review.  If there have been
review comments then people may be waiting for those to be addressed.

Sending content free pings adds to the mail volume (if they are seen at
all) which is often the problem and since they can't be reviewed
directly if something has gone wrong you'll have to resend the patches
anyway, so sending again is generally a better approach though there are
some other maintainers who like them - if in doubt look at how patches
for the subsystem are normally handled.
Mark Brown Sept. 10, 2019, 11:21 a.m. UTC | #9
On Sat, Sep 07, 2019 at 11:06:37AM +0200, Lukas Wunner wrote:

> Do you have any thoughts on it?  Any objections?

Having found this in the archives something went quite wrong with
the posting, the patches appear to be in a random non-sequential
order:

   20466 N T 08/03 Lukas Wunner    (1.7K) [PATCH 00/10] Raspberry Pi SPI speedup
   20467 N T 08/03 Lukas Wunner    (2.0K) ├─>[PATCH 02/10] dmaengine: bcm2835: A
   20468 N C 08/08 Vinod Koul      (2.2K) │ └─>
   20469 N T 08/03 Lukas Wunner    (1.1K) ├─>[PATCH 01/10] dmaengine: bcm2835: A
   20470 N C 08/08 Vinod Koul      (1.3K) │ └─>
   20471 N T 08/03 Lukas Wunner    (3.9K) ├─>[PATCH 04/10] spi: bcm2835: Drop dm
   20472 N T 08/03 Lukas Wunner    (3.5K) ├─>[PATCH 05/10] spi: bcm2835: Work ar

I have no recollection of what happened with this but I'd expect
it's something to do with that and the lengthy discussion of some
bits.

It also doesn't apply any more, please resend.  In addition it
looks like patch 5 is a fix but it's in the middle of the series
for some reason, in general bug fixes should always be sent as
the first so that they don't depend on anything else.

To repeat what I said in the standard mail please don't send
pings, they're just noise - I can't apply a ping.