diff mbox

[v5] RFT: mmc: sdhci: Implement an SDHCI-specific bounce buffer

Message ID 20180112141907.18203-1-linus.walleij@linaro.org (mailing list archive)
State New, archived
Headers show

Commit Message

Linus Walleij Jan. 12, 2018, 2:19 p.m. UTC
The bounce buffer is gone from the MMC core, and now we found out
that there are some (crippled) i.MX boards out there that have broken
ADMA (cannot do scatter-gather), and broken PIO so they must use
SDMA. Closer examination shows a less significant slowdown also on
SDMA-only capable Laptop hosts.

SDMA sets down the number of segments to one, so that each segment
gets turned into a singular request that ping-pongs to the block
layer before the next request/segment is issued.

Apparently it happens a lot that the block layer send requests
that include a lot of physically discontigous segments. My guess
is that this phenomenon is coming from the file system.

These devices that cannot handle scatterlists in hardware can see
major benefits from a DMA-contigous bounce buffer.

This patch accumulates those fragmented scatterlists in a physically
contigous bounce buffer so that we can issue bigger DMA data chunks
to/from the card.

When tested with thise PCI-integrated host (1217:8221) that
only supports SDMA:
0b:00.0 SD Host controller: O2 Micro, Inc. OZ600FJ0/OZ900FJ0/OZ600FJS
        SD/MMC Card Reader Controller (rev 05)
This patch gave ~1Mbyte/s improved throughput on large reads and
writes when testing using iozone than without the patch.

On the i.MX SDHCI controllers on the crippled i.MX 25 and i.MX 35
the patch restores the performance to what it was before we removed
the bounce buffers, and then some: performance is better than ever
because we now allocate a bounce buffer the size of the maximum
single request the SDMA engine can handle. On the PCI laptop this
is 256K, whereas with the old bounce buffer code it was 64K max.

Cc: Benjamin Beckmeyer <beckmeyer.b@rittal.de>
Cc: Pierre Ossman <pierre@ossman.eu>
Cc: Benoît Thébaudeau <benoit@wsystem.com>
Cc: Fabio Estevam <fabio.estevam@nxp.com>
Cc: stable@vger.kernel.org
Fixes: de3ee99b097d ("mmc: Delete bounce buffer handling")
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
---
ChangeLog v4->v5:
- Go back to dma_alloc_coherent() as this apparently works better.
- Keep the other changes, cap for 64KB, fall back to single segments.
- Requesting a test of this on i.MX. (Sorry Benjamin.)
ChangeLog v3->v4:
- Cap the bounce buffer to 64KB instead of the biggest segment
  as we experience diminishing returns with buffers > 64KB.
- Instead of using dma_alloc_coherent(), use good old devm_kmalloc()
  and issue dma_sync_single_for*() to explicitly switch
  ownership between CPU and the device. This way we exercise the
  cache better and may consume less CPU.
- Bail out with single segments if we cannot allocate a bounce
  buffer.
- Tested on the PCI SDHCI on my laptop: requesting a new test
  on i.MX from Benjamin. (Please!)
ChangeLog v2->v3:
- Rewrite the commit message a bit
- Add Benjamin's Tested-by
- Add Fixes and stable tags
ChangeLog v1->v2:
- Skip the remapping and fiddling with the buffer, instead use
  dma_alloc_coherent() and use a simple, coherent bounce buffer.
- Couple kernel messages to ->parent of the mmc_host as it relates
  to the hardware characteristics.
---
 drivers/mmc/host/sdhci.c | 105 +++++++++++++++++++++++++++++++++++++++++++----
 drivers/mmc/host/sdhci.h |   3 ++
 2 files changed, 100 insertions(+), 8 deletions(-)

Comments

Benjamin Beckmeyer Jan. 15, 2018, 6:07 a.m. UTC | #1
Good morning,

> - Requesting a test of this on i.MX. (Sorry Benjamin.)

No worries. I did another test with your v5 patch, but it doesn't work.
Here is the kernel log: 

sdhci: Secure Digital Host Controller Interface driver
sdhci: Copyright(c) Pierre Ossman
sdhci-pltfm: SDHCI platform and OF driver helper
sdhci-esdhc-imx 53fb4000.esdhc: host only supports SDMA, activate bounce buffer
sdhci-esdhc-imx 53fb4000.esdhc: bounce buffer: bounce up to 128 segments into one, max segment size 65536 bytes
mmc0: SDHCI controller on 53fb4000.esdhc [53fb4000.esdhc] using DMA
sdhci-esdhc-imx 53fb8000.esdhc: Got CD GPIO
sdhci-esdhc-imx 53fb8000.esdhc: Got WP GPIO
sdhci-esdhc-imx 53fb8000.esdhc: host only supports SDMA, activate bounce buffer
sdhci-esdhc-imx 53fb8000.esdhc: bounce buffer: bounce up to 128 segments into one, max segment size 65536 bytes
mmc1: SDHCI controller on 53fb8000.esdhc [53fb8000.esdhc] using DMA
mmc0: new high speed MMC card at address 0001
oprofile: no performance counters
oprofile: using timer interrupt.
mmcblk0: mmc0:0001 002G00 1.83 GiB 
mmcblk0boot0: mmc0:0001 002G00 partition 1 512 KiB
NET: Registered protocol family 10
mmcblk0boot1: mmc0:0001 002G00 partition 2 512 KiB
Segment Routing with IPv6
sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
NET: Registered protocol family 17
can: controller area network core (rev 20170425 abi 9)
NET: Registered protocol family 29
can: raw protocol (rev 20170425)
can: broadcast manager protocol (rev 20170425 t)
can: netlink gateway (rev 20170425) max_hops=1
input: gpio-keys as /devices/platform/gpio-keys/input/input1
mmcblk0rpmb: mmc0:0001 002G00 partition 3 128 KiB
imxdi_rtc 53ffc000.dryice: setting system clock to 1970-01-05 20:50:24 UTC (420624)
 mmcblk0: p1 p2 p3 p4 < p5 p6 p7 p8 p9 p10 >
EXT4-fs (mmcblk0p8): couldn't mount as ext3 due to feature incompatibilities
EXT4-fs (mmcblk0p8): couldn't mount as ext2 due to feature incompatibilities
EXT4-fs (mmcblk0p8): INFO: recovery required on readonly filesystem
EXT4-fs (mmcblk0p8): write access will be enabled during recovery
EXT4-fs (mmcblk0p8): recovery complete
EXT4-fs (mmcblk0p8): mounted filesystem with ordered data mode. Opts: (null)
VFS: Mounted root (ext4 filesystem) readonly on device 259:0.
devtmpfs: mounted
Freeing unused kernel memory: 248K
This architecture does not have kernel memory protection.
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b


Kind regards,
Benjamin Beckmeyer
--
To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Walleij Jan. 15, 2018, 9:53 a.m. UTC | #2
On Mon, Jan 15, 2018 at 7:07 AM, Benjamin Beckmeyer
<beckmeyer.b@rittal.de> wrote:

> Good morning,
>
>> - Requesting a test of this on i.MX. (Sorry Benjamin.)
>
> No worries. I did another test with your v5 patch, but it doesn't work.

Wow that is weird.

So v3 works.

And it's not the code that bails out if we have nr_segs == 1
because that code path is not taken.

It is the 64K instead of 512K buffer that makes this happen,
then.

I guess I just respin v3 with the change that we fall back to
1 segment if we can's allocate coherent memory for the
bounce buffer.

Yours,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Adrian Hunter Jan. 15, 2018, 10:54 a.m. UTC | #3
On 15/01/18 11:53, Linus Walleij wrote:
> On Mon, Jan 15, 2018 at 7:07 AM, Benjamin Beckmeyer
> <beckmeyer.b@rittal.de> wrote:
> 
>> Good morning,
>>
>>> - Requesting a test of this on i.MX. (Sorry Benjamin.)
>>
>> No worries. I did another test with your v5 patch, but it doesn't work.
> 
> Wow that is weird.
> 
> So v3 works.
> 
> And it's not the code that bails out if we have nr_segs == 1
> because that code path is not taken.
> 
> It is the 64K instead of 512K buffer that makes this happen,
> then.

That might be because you are not updating max_req_size.

> I guess I just respin v3 with the change that we fall back to
> 1 segment if we can's allocate coherent memory for the
> bounce buffer.

I would like to support pre_req() and post_req().  I have some code for that
if you are interested.
--
To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Adrian Hunter Jan. 15, 2018, 11:58 a.m. UTC | #4
On 15/01/18 12:54, Adrian Hunter wrote:
> On 15/01/18 11:53, Linus Walleij wrote:
>> On Mon, Jan 15, 2018 at 7:07 AM, Benjamin Beckmeyer
>> <beckmeyer.b@rittal.de> wrote:
>>
>>> Good morning,
>>>
>>>> - Requesting a test of this on i.MX. (Sorry Benjamin.)
>>>
>>> No worries. I did another test with your v5 patch, but it doesn't work.
>>
>> Wow that is weird.
>>
>> So v3 works.
>>
>> And it's not the code that bails out if we have nr_segs == 1
>> because that code path is not taken.
>>
>> It is the 64K instead of 512K buffer that makes this happen,
>> then.
> 
> That might be because you are not updating max_req_size.
> 
>> I guess I just respin v3 with the change that we fall back to
>> 1 segment if we can's allocate coherent memory for the
>> bounce buffer.
> 
> I would like to support pre_req() and post_req().  I have some code for that
> if you are interested.
> 

Then again it depends on the amount of memory allocated.  In some cases, one
big buffer will probably be better than 2 half the size.
--
To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Adrian Hunter Jan. 15, 2018, 12:22 p.m. UTC | #5
On 12/01/18 16:19, Linus Walleij wrote:
> The bounce buffer is gone from the MMC core, and now we found out
> that there are some (crippled) i.MX boards out there that have broken
> ADMA (cannot do scatter-gather), and broken PIO so they must use
> SDMA. Closer examination shows a less significant slowdown also on
> SDMA-only capable Laptop hosts.
> 
> SDMA sets down the number of segments to one, so that each segment
> gets turned into a singular request that ping-pongs to the block
> layer before the next request/segment is issued.
> 
> Apparently it happens a lot that the block layer send requests
> that include a lot of physically discontigous segments. My guess
> is that this phenomenon is coming from the file system.
> 
> These devices that cannot handle scatterlists in hardware can see
> major benefits from a DMA-contigous bounce buffer.
> 
> This patch accumulates those fragmented scatterlists in a physically
> contigous bounce buffer so that we can issue bigger DMA data chunks
> to/from the card.
> 
> When tested with thise PCI-integrated host (1217:8221) that
> only supports SDMA:
> 0b:00.0 SD Host controller: O2 Micro, Inc. OZ600FJ0/OZ900FJ0/OZ600FJS
>         SD/MMC Card Reader Controller (rev 05)
> This patch gave ~1Mbyte/s improved throughput on large reads and
> writes when testing using iozone than without the patch.
> 
> On the i.MX SDHCI controllers on the crippled i.MX 25 and i.MX 35
> the patch restores the performance to what it was before we removed
> the bounce buffers, and then some: performance is better than ever
> because we now allocate a bounce buffer the size of the maximum
> single request the SDMA engine can handle. On the PCI laptop this
> is 256K, whereas with the old bounce buffer code it was 64K max.
> 
> Cc: Benjamin Beckmeyer <beckmeyer.b@rittal.de>
> Cc: Pierre Ossman <pierre@ossman.eu>
> Cc: Benoît Thébaudeau <benoit@wsystem.com>
> Cc: Fabio Estevam <fabio.estevam@nxp.com>
> Cc: stable@vger.kernel.org
> Fixes: de3ee99b097d ("mmc: Delete bounce buffer handling")
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

Thanks for doing this.  A few comments below.

> ---
> ChangeLog v4->v5:
> - Go back to dma_alloc_coherent() as this apparently works better.
> - Keep the other changes, cap for 64KB, fall back to single segments.
> - Requesting a test of this on i.MX. (Sorry Benjamin.)
> ChangeLog v3->v4:
> - Cap the bounce buffer to 64KB instead of the biggest segment
>   as we experience diminishing returns with buffers > 64KB.
> - Instead of using dma_alloc_coherent(), use good old devm_kmalloc()
>   and issue dma_sync_single_for*() to explicitly switch
>   ownership between CPU and the device. This way we exercise the
>   cache better and may consume less CPU.
> - Bail out with single segments if we cannot allocate a bounce
>   buffer.
> - Tested on the PCI SDHCI on my laptop: requesting a new test
>   on i.MX from Benjamin. (Please!)
> ChangeLog v2->v3:
> - Rewrite the commit message a bit
> - Add Benjamin's Tested-by
> - Add Fixes and stable tags
> ChangeLog v1->v2:
> - Skip the remapping and fiddling with the buffer, instead use
>   dma_alloc_coherent() and use a simple, coherent bounce buffer.
> - Couple kernel messages to ->parent of the mmc_host as it relates
>   to the hardware characteristics.
> ---
>  drivers/mmc/host/sdhci.c | 105 +++++++++++++++++++++++++++++++++++++++++++----
>  drivers/mmc/host/sdhci.h |   3 ++
>  2 files changed, 100 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
> index e9290a3439d5..4e594d5e3185 100644
> --- a/drivers/mmc/host/sdhci.c
> +++ b/drivers/mmc/host/sdhci.c
> @@ -19,6 +19,7 @@
>  #include <linux/io.h>
>  #include <linux/module.h>
>  #include <linux/dma-mapping.h>
> +#include <linux/sizes.h>
>  #include <linux/slab.h>
>  #include <linux/scatterlist.h>
>  #include <linux/swiotlb.h>
> @@ -502,8 +503,22 @@ static int sdhci_pre_dma_transfer(struct sdhci_host *host,
>  	if (data->host_cookie == COOKIE_PRE_MAPPED)
>  		return data->sg_count;
>  
> -	sg_count = dma_map_sg(mmc_dev(host->mmc), data->sg, data->sg_len,
> -			      mmc_get_dma_dir(data));
> +	/* Bounce write requests to the bounce buffer */
> +	if (host->bounce_buffer) {
> +		if (mmc_get_dma_dir(data) == DMA_TO_DEVICE) {
> +			/* Copy the data to the bounce buffer */
> +			sg_copy_to_buffer(data->sg, data->sg_len,
> +					  host->bounce_buffer,
> +					  host->bounce_buffer_size);

Shouldn't this be the data size instead of host->bounce_buffer_size

> +		}
> +		/* Just a dummy value */
> +		sg_count = 1;
> +	} else {
> +		/* Just access the data directly from memory */
> +		sg_count = dma_map_sg(mmc_dev(host->mmc), data->sg,
> +				      data->sg_len,
> +				      mmc_get_dma_dir(data));
> +	}
>  
>  	if (sg_count == 0)
>  		return -ENOSPC;
> @@ -858,8 +873,13 @@ static void sdhci_prepare_data(struct sdhci_host *host, struct mmc_command *cmd)
>  					     SDHCI_ADMA_ADDRESS_HI);
>  		} else {
>  			WARN_ON(sg_cnt != 1);
> -			sdhci_writel(host, sg_dma_address(data->sg),
> -				SDHCI_DMA_ADDRESS);
> +			/* Bounce buffer goes to work */
> +			if (host->bounce_buffer)
> +				sdhci_writel(host, host->bounce_addr,
> +					     SDHCI_DMA_ADDRESS);
> +			else
> +				sdhci_writel(host, sg_dma_address(data->sg),
> +					     SDHCI_DMA_ADDRESS);
>  		}
>  	}
>  
> @@ -2248,7 +2268,12 @@ static void sdhci_pre_req(struct mmc_host *mmc, struct mmc_request *mrq)
>  
>  	mrq->data->host_cookie = COOKIE_UNMAPPED;
>  
> -	if (host->flags & SDHCI_REQ_USE_DMA)
> +	/*
> +	 * No pre-mapping in the pre hook if we're using the bounce buffer,
> +	 * for that we would need two bounce buffers since one buffer is
> +	 * in flight when this is getting called.
> +	 */
> +	if (host->flags & SDHCI_REQ_USE_DMA && !host->bounce_buffer)
>  		sdhci_pre_dma_transfer(host, mrq->data, COOKIE_PRE_MAPPED);
>  }
>  
> @@ -2352,8 +2377,23 @@ static bool sdhci_request_done(struct sdhci_host *host)
>  		struct mmc_data *data = mrq->data;
>  
>  		if (data && data->host_cookie == COOKIE_MAPPED) {
> -			dma_unmap_sg(mmc_dev(host->mmc), data->sg, data->sg_len,
> -				     mmc_get_dma_dir(data));
> +			if (host->bounce_buffer) {
> +				/*
> +				 * On reads, copy the bounced data into the
> +				 * sglist
> +				 */
> +				if (mmc_get_dma_dir(data) == DMA_FROM_DEVICE) {
> +					sg_copy_from_buffer(data->sg,
> +						data->sg_len,
> +						host->bounce_buffer,
> +						host->bounce_buffer_size);

And this should be data->bytes_xfered

> +				}
> +			} else {
> +				/* Unmap the raw data */
> +				dma_unmap_sg(mmc_dev(host->mmc), data->sg,
> +					     data->sg_len,
> +					     mmc_get_dma_dir(data));
> +			}
>  			data->host_cookie = COOKIE_UNMAPPED;
>  		}
>  	}
> @@ -2636,7 +2676,12 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask)
>  		 */
>  		if (intmask & SDHCI_INT_DMA_END) {
>  			u32 dmastart, dmanow;
> -			dmastart = sg_dma_address(host->data->sg);
> +
> +			if (host->bounce_buffer)
> +				dmastart = host->bounce_addr;
> +			else
> +				dmastart = sg_dma_address(host->data->sg);

How about making a function for this e.g. called sdhci_sdma_address()

> +
>  			dmanow = dmastart + host->data->bytes_xfered;
>  			/*
>  			 * Force update to the next DMA block boundary.
> @@ -3713,6 +3758,47 @@ int sdhci_setup_host(struct sdhci_host *host)
>  	 */
>  	mmc->max_blk_count = (host->quirks & SDHCI_QUIRK_NO_MULTIBLOCK) ? 1 : 65535;
>  
> +	if (mmc->max_segs == 1) {

Please make the bounce buffer allocation a separate function and put it at
the start of __sdhci_add_host()

> +		unsigned int max_blocks;

Perhaps drop the max_blocks local variable, and just set
mmc->max_segs = host->bounce_buffer_size / 512

> +		unsigned int max_seg_size;

It would read better if you called this bounce_size or buffer_size

> +
> +		max_seg_size = SZ_64K;
> +		if (mmc->max_req_size < max_seg_size)
> +			max_seg_size = mmc->max_req_size;

Maybe make the bounce buffer size calculation a separate function and follow
the same logic as the old mmc_queue_calc_bouncesz()

> +		max_blocks = max_seg_size / 512;
> +		dev_info(mmc->parent,
> +			 "host only supports SDMA, activate bounce buffer\n");

There are other prints below so this seems redundant

> +
> +		/*
> +		 * When we just support one segment, we can get significant
> +		 * speedup by the help of a bounce buffer to group scattered
> +		 * reads/writes together.
> +		 */
> +		host->bounce_buffer = dma_alloc_coherent(mmc->parent,
> +							 max_seg_size,
> +							 &host->bounce_addr,
> +							 GFP_KERNEL);
> +		if (!host->bounce_buffer) {
> +			dev_err(mmc->parent,
> +				"failed to allocate %u bytes for bounce buffer, falling back to single segments\n",
> +				max_seg_size);

Please use pr_err and mmc_hostname(mmc)

> +			/*
> +			 * Exiting with zero here makes sure we proceed with
> +			 * mmc->max_segs == 1.
> +			 */
> +			return 0;
> +		}
> +		host->bounce_buffer_size = max_seg_size;
> +
> +		/* Lie about this since we're bouncing */
> +		mmc->max_segs = max_blocks;
> +		mmc->max_seg_size = max_seg_size;

Don't you need to set max_req_size also

> +
> +		dev_info(mmc->parent,
> +			 "bounce buffer: bounce up to %u segments into one, max segment size %u bytes\n",
> +			 max_blocks, max_seg_size);

Please use pr_info and mmc_hostname(mmc)

> +	}
> +
>  	return 0;
>  
>  unreg:
> @@ -3743,6 +3829,9 @@ void sdhci_cleanup_host(struct sdhci_host *host)
>  				  host->align_addr);
>  	host->adma_table = NULL;
>  	host->align_buffer = NULL;

Please make freeing the bounce buffer a separate function

> +	if (host->bounce_buffer)
> +		dma_free_coherent(mmc->parent, host->bounce_buffer_size,
> +				  host->bounce_buffer, host->bounce_addr);
>  }
>  EXPORT_SYMBOL_GPL(sdhci_cleanup_host);
>  
> diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
> index 54bc444c317f..865e09618d22 100644
> --- a/drivers/mmc/host/sdhci.h
> +++ b/drivers/mmc/host/sdhci.h
> @@ -440,6 +440,9 @@ struct sdhci_host {
>  
>  	int irq;		/* Device IRQ */
>  	void __iomem *ioaddr;	/* Mapped address */
> +	char *bounce_buffer;	/* For packing SDMA reads/writes */
> +	dma_addr_t bounce_addr;
> +	size_t bounce_buffer_size;

SDMA doesn't support large transfers so unsigned int is better that size_t here.

>  
>  	const struct sdhci_ops *ops;	/* Low level hw interface */
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index e9290a3439d5..4e594d5e3185 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -19,6 +19,7 @@ 
 #include <linux/io.h>
 #include <linux/module.h>
 #include <linux/dma-mapping.h>
+#include <linux/sizes.h>
 #include <linux/slab.h>
 #include <linux/scatterlist.h>
 #include <linux/swiotlb.h>
@@ -502,8 +503,22 @@  static int sdhci_pre_dma_transfer(struct sdhci_host *host,
 	if (data->host_cookie == COOKIE_PRE_MAPPED)
 		return data->sg_count;
 
-	sg_count = dma_map_sg(mmc_dev(host->mmc), data->sg, data->sg_len,
-			      mmc_get_dma_dir(data));
+	/* Bounce write requests to the bounce buffer */
+	if (host->bounce_buffer) {
+		if (mmc_get_dma_dir(data) == DMA_TO_DEVICE) {
+			/* Copy the data to the bounce buffer */
+			sg_copy_to_buffer(data->sg, data->sg_len,
+					  host->bounce_buffer,
+					  host->bounce_buffer_size);
+		}
+		/* Just a dummy value */
+		sg_count = 1;
+	} else {
+		/* Just access the data directly from memory */
+		sg_count = dma_map_sg(mmc_dev(host->mmc), data->sg,
+				      data->sg_len,
+				      mmc_get_dma_dir(data));
+	}
 
 	if (sg_count == 0)
 		return -ENOSPC;
@@ -858,8 +873,13 @@  static void sdhci_prepare_data(struct sdhci_host *host, struct mmc_command *cmd)
 					     SDHCI_ADMA_ADDRESS_HI);
 		} else {
 			WARN_ON(sg_cnt != 1);
-			sdhci_writel(host, sg_dma_address(data->sg),
-				SDHCI_DMA_ADDRESS);
+			/* Bounce buffer goes to work */
+			if (host->bounce_buffer)
+				sdhci_writel(host, host->bounce_addr,
+					     SDHCI_DMA_ADDRESS);
+			else
+				sdhci_writel(host, sg_dma_address(data->sg),
+					     SDHCI_DMA_ADDRESS);
 		}
 	}
 
@@ -2248,7 +2268,12 @@  static void sdhci_pre_req(struct mmc_host *mmc, struct mmc_request *mrq)
 
 	mrq->data->host_cookie = COOKIE_UNMAPPED;
 
-	if (host->flags & SDHCI_REQ_USE_DMA)
+	/*
+	 * No pre-mapping in the pre hook if we're using the bounce buffer,
+	 * for that we would need two bounce buffers since one buffer is
+	 * in flight when this is getting called.
+	 */
+	if (host->flags & SDHCI_REQ_USE_DMA && !host->bounce_buffer)
 		sdhci_pre_dma_transfer(host, mrq->data, COOKIE_PRE_MAPPED);
 }
 
@@ -2352,8 +2377,23 @@  static bool sdhci_request_done(struct sdhci_host *host)
 		struct mmc_data *data = mrq->data;
 
 		if (data && data->host_cookie == COOKIE_MAPPED) {
-			dma_unmap_sg(mmc_dev(host->mmc), data->sg, data->sg_len,
-				     mmc_get_dma_dir(data));
+			if (host->bounce_buffer) {
+				/*
+				 * On reads, copy the bounced data into the
+				 * sglist
+				 */
+				if (mmc_get_dma_dir(data) == DMA_FROM_DEVICE) {
+					sg_copy_from_buffer(data->sg,
+						data->sg_len,
+						host->bounce_buffer,
+						host->bounce_buffer_size);
+				}
+			} else {
+				/* Unmap the raw data */
+				dma_unmap_sg(mmc_dev(host->mmc), data->sg,
+					     data->sg_len,
+					     mmc_get_dma_dir(data));
+			}
 			data->host_cookie = COOKIE_UNMAPPED;
 		}
 	}
@@ -2636,7 +2676,12 @@  static void sdhci_data_irq(struct sdhci_host *host, u32 intmask)
 		 */
 		if (intmask & SDHCI_INT_DMA_END) {
 			u32 dmastart, dmanow;
-			dmastart = sg_dma_address(host->data->sg);
+
+			if (host->bounce_buffer)
+				dmastart = host->bounce_addr;
+			else
+				dmastart = sg_dma_address(host->data->sg);
+
 			dmanow = dmastart + host->data->bytes_xfered;
 			/*
 			 * Force update to the next DMA block boundary.
@@ -3713,6 +3758,47 @@  int sdhci_setup_host(struct sdhci_host *host)
 	 */
 	mmc->max_blk_count = (host->quirks & SDHCI_QUIRK_NO_MULTIBLOCK) ? 1 : 65535;
 
+	if (mmc->max_segs == 1) {
+		unsigned int max_blocks;
+		unsigned int max_seg_size;
+
+		max_seg_size = SZ_64K;
+		if (mmc->max_req_size < max_seg_size)
+			max_seg_size = mmc->max_req_size;
+		max_blocks = max_seg_size / 512;
+		dev_info(mmc->parent,
+			 "host only supports SDMA, activate bounce buffer\n");
+
+		/*
+		 * When we just support one segment, we can get significant
+		 * speedup by the help of a bounce buffer to group scattered
+		 * reads/writes together.
+		 */
+		host->bounce_buffer = dma_alloc_coherent(mmc->parent,
+							 max_seg_size,
+							 &host->bounce_addr,
+							 GFP_KERNEL);
+		if (!host->bounce_buffer) {
+			dev_err(mmc->parent,
+				"failed to allocate %u bytes for bounce buffer, falling back to single segments\n",
+				max_seg_size);
+			/*
+			 * Exiting with zero here makes sure we proceed with
+			 * mmc->max_segs == 1.
+			 */
+			return 0;
+		}
+		host->bounce_buffer_size = max_seg_size;
+
+		/* Lie about this since we're bouncing */
+		mmc->max_segs = max_blocks;
+		mmc->max_seg_size = max_seg_size;
+
+		dev_info(mmc->parent,
+			 "bounce buffer: bounce up to %u segments into one, max segment size %u bytes\n",
+			 max_blocks, max_seg_size);
+	}
+
 	return 0;
 
 unreg:
@@ -3743,6 +3829,9 @@  void sdhci_cleanup_host(struct sdhci_host *host)
 				  host->align_addr);
 	host->adma_table = NULL;
 	host->align_buffer = NULL;
+	if (host->bounce_buffer)
+		dma_free_coherent(mmc->parent, host->bounce_buffer_size,
+				  host->bounce_buffer, host->bounce_addr);
 }
 EXPORT_SYMBOL_GPL(sdhci_cleanup_host);
 
diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
index 54bc444c317f..865e09618d22 100644
--- a/drivers/mmc/host/sdhci.h
+++ b/drivers/mmc/host/sdhci.h
@@ -440,6 +440,9 @@  struct sdhci_host {
 
 	int irq;		/* Device IRQ */
 	void __iomem *ioaddr;	/* Mapped address */
+	char *bounce_buffer;	/* For packing SDMA reads/writes */
+	dma_addr_t bounce_addr;
+	size_t bounce_buffer_size;
 
 	const struct sdhci_ops *ops;	/* Low level hw interface */