[v6] RFT: mmc: sdhci: Implement an SDHCI-specific bounce buffer

The bounce buffer is gone from the MMC core, and now we found out
that there are some (crippled) i.MX boards out there that have broken
ADMA (cannot do scatter-gather), and also broken PIO so they must
use SDMA. Closer examination shows a less significant slowdown
also on SDMA-only capable Laptop hosts.

SDMA sets down the number of segments to one, so that each segment
gets turned into a singular request that ping-pongs to the block
layer before the next request/segment is issued.

Apparently it happens a lot that the block layer send requests
that include a lot of physically discontigous segments. My guess
is that this phenomenon is coming from the file system.

These devices that cannot handle scatterlists in hardware can see
major benefits from a DMA-contigous bounce buffer.

This patch accumulates those fragmented scatterlists in a physically
contigous bounce buffer so that we can issue bigger DMA data chunks
to/from the card.

When tested with thise PCI-integrated host (1217:8221) that
only supports SDMA:
0b:00.0 SD Host controller: O2 Micro, Inc. OZ600FJ0/OZ900FJ0/OZ600FJS
        SD/MMC Card Reader Controller (rev 05)
This patch gave ~1Mbyte/s improved throughput on large reads and
writes when testing using iozone than without the patch.

dmesg:
sdhci-pci 0000:0b:00.0: SDHCI controller found [1217:8221] (rev 5)
mmc0 bounce up to 128 segments into one, max segment size 65536 bytes
mmc0: SDHCI controller on PCI [0000:0b:00.0] using DMA

On the i.MX SDHCI controllers on the crippled i.MX 25 and i.MX 35
the patch restores the performance to what it was before we removed
the bounce buffers, and then some: performance is better than ever
because we now allocate a bounce buffer the size of the maximum
single request the SDMA engine can handle. On the PCI laptop this
is 256K, whereas with the old bounce buffer code it was 64K max.

Cc: Benjamin Beckmeyer <beckmeyer.b@rittal.de>
Cc: Pierre Ossman <pierre@ossman.eu>
Cc: Benoît Thébaudeau <benoit@wsystem.com>
Cc: Fabio Estevam <fabio.estevam@nxp.com>
Cc: stable@vger.kernel.org
Fixes: de3ee99b097d ("mmc: Delete bounce buffer handling")
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
---
ChangeLog v5->v6:
- Again switch back to explicit sync of buffers. I want to get this
  solution to work because it gives more control and it's more
  elegant.
- Update host->max_req_size as noted by Adrian, hopefully this
  fixes the i.MX. I was just lucky on my Intel laptop I guess:
  the block stack never requested anything bigger than 64KB and
  that was why it worked even if max_req_size was bigger than
  what would fit in the bounce buffer.
- Copy the number of bytes in the mmc_data instead of the number
  of bytes in the bounce buffer. For RX this is blksize * blocks
  and for TX this is bytes_xfered.
- Break out a sdhci_sdma_address() for getting the DMA address
  for either the raw sglist or the bounce buffer depending on
  configuration.
- Add some explicit bounds check for the data so that we do not
  attempt to copy more than the bounce buffer size even if the
  block layer is erroneously configured.
- Move allocation of bounce buffer out to its own function.
- Use pr_[info|err] throughout so all debug prints from the
  driver come out in the same manner and style.
- Use unsigned int for the bounce buffer size.
- Re-tested with iozone: we still get the same nice performance
  improvements.
- Request a text on i.MX (hi Benjamin)
ChangeLog v4->v5:
- Go back to dma_alloc_coherent() as this apparently works better.
- Keep the other changes, cap for 64KB, fall back to single segments.
- Requesting a test of this on i.MX. (Sorry Benjamin.)
ChangeLog v3->v4:
- Cap the bounce buffer to 64KB instead of the biggest segment
  as we experience diminishing returns with buffers > 64KB.
- Instead of using dma_alloc_coherent(), use good old devm_kmalloc()
  and issue dma_sync_single_for*() to explicitly switch
  ownership between CPU and the device. This way we exercise the
  cache better and may consume less CPU.
- Bail out with single segments if we cannot allocate a bounce
  buffer.
- Tested on the PCI SDHCI on my laptop: requesting a new test
  on i.MX from Benjamin. (Please!)
ChangeLog v2->v3:
- Rewrite the commit message a bit
- Add Benjamin's Tested-by
- Add Fixes and stable tags
ChangeLog v1->v2:
- Skip the remapping and fiddling with the buffer, instead use
  dma_alloc_coherent() and use a simple, coherent bounce buffer.
- Couple kernel messages to ->parent of the mmc_host as it relates
  to the hardware characteristics.
---
 drivers/mmc/host/sdhci.c | 162 ++++++++++++++++++++++++++++++++++++++++++++---
 drivers/mmc/host/sdhci.h |   3 +
 2 files changed, 157 insertions(+), 8 deletions(-)

[v6] RFT: mmc: sdhci: Implement an SDHCI-specific bounce buffer

Commit Message

Comments

Patch