diff mbox

[v6,00/11] mmc: use nonblock mmc requests to minimize latency

Message ID 20110621075319.GN26089@n2100.arm.linux.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Russell King - ARM Linux June 21, 2011, 7:53 a.m. UTC
On Sun, Jun 19, 2011 at 11:17:26PM +0200, Per Forlin wrote:
> How significant is the cache maintenance over head?

Per,

Can you measure how much difference this has before and after your
patch set please?  This moves the dsb() out of the individual cache
maintanence functions, such that we will only perform one dsb() per
dma_*_sg call rather than one per SG entry.

Thanks.

 arch/arm/include/asm/dma-mapping.h |   11 +++++++++++
 arch/arm/mm/cache-fa.S             |    6 ------
 arch/arm/mm/cache-v4wb.S           |    2 --
 arch/arm/mm/cache-v6.S             |    6 ------
 arch/arm/mm/cache-v7.S             |    3 ---
 arch/arm/mm/dma-mapping.c          |    8 ++++++++
 6 files changed, 19 insertions(+), 17 deletions(-)

Comments

Per Forlin June 21, 2011, 8:09 a.m. UTC | #1
On 21 June 2011 09:53, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:
> On Sun, Jun 19, 2011 at 11:17:26PM +0200, Per Forlin wrote:
>> How significant is the cache maintenance over head?
>
> Per,
>
> Can you measure how much difference this has before and after your
> patch set please?
Absolutely, I can run the mmc_tests to get the measurement. The cache
affect is greater the faster the flash memory is. Currently I only
have access to a SD card (20 MiB/S). By the end of this week I can run
on eMMC (45 MiB/s) if this will be needed.

Thanks for your input,
Per
saeed bishara June 27, 2011, 10:34 a.m. UTC | #2
>
> +static inline void __dma_sync(void)
> +{
> +       dsb();
> +}
> +
>  /*
>  * Return whether the given device DMA address mask can be supported
>  * properly.  For example, if your device can only drive the low 24-bits
> @@ -378,6 +383,7 @@ static inline dma_addr_t dma_map_single(struct device *dev, void *cpu_addr,
>        BUG_ON(!valid_dma_direction(dir));
>
>        addr = __dma_map_single(dev, cpu_addr, size, dir);
> +       __dma_sync();
Russell,
  I'm curious about the correctness of this patch for systems with
outer cache. shouldn't the dsb be issued before the outer cache
maintenance?
saeed
Russell King - ARM Linux June 27, 2011, 11:02 a.m. UTC | #3
On Mon, Jun 27, 2011 at 01:34:48PM +0300, saeed bishara wrote:
> Russell,
>   I'm curious about the correctness of this patch for systems with
> outer cache. shouldn't the dsb be issued before the outer cache
> maintenance?

Maybe we should do two passes over SG lists then - one for the inner and
another for the outer cache?

In effect we could do three passes:

1. Calculate the total size of the SG list to determine whether full
   cache flush is more efficient.
2. Flush inner cache
   Then dsb()
3. Flush outer cache
   Another dsb()
saeed bishara June 28, 2011, 6:22 a.m. UTC | #4
On Mon, Jun 27, 2011 at 2:02 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Mon, Jun 27, 2011 at 01:34:48PM +0300, saeed bishara wrote:
>> Russell,
>>   I'm curious about the correctness of this patch for systems with
>> outer cache. shouldn't the dsb be issued before the outer cache
>> maintenance?
>
> Maybe we should do two passes over SG lists then - one for the inner and
> another for the outer cache?
>
> In effect we could do three passes:
>
> 1. Calculate the total size of the SG list to determine whether full
>   cache flush is more efficient.
> 2. Flush inner cache
>   Then dsb()
> 3. Flush outer cache
>   Another dsb()
>
looking at l2x0 cache, it seems to me that it would be possible to do
asynchronous l2 cache maintenance for
range operations, so maybe that can be utilized in order to
parallelism between inner cache and outer cache maintenance,
so the flow can be:
2. Flush sg buffer from inner cache
3. dsb()
4. start Flush outer cache for that buffer
5. Flush next sg buffer from inner cache
6. dsb
7. goto 4.
8. when no more buffers left, wait for outer cache operations

saeed
Russell King - ARM Linux July 3, 2011, 2:47 p.m. UTC | #5
On Tue, Jun 28, 2011 at 09:22:20AM +0300, saeed bishara wrote:
> On Mon, Jun 27, 2011 at 2:02 PM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> > On Mon, Jun 27, 2011 at 01:34:48PM +0300, saeed bishara wrote:
> >> Russell,
> >>   I'm curious about the correctness of this patch for systems with
> >> outer cache. shouldn't the dsb be issued before the outer cache
> >> maintenance?
> >
> > Maybe we should do two passes over SG lists then - one for the inner and
> > another for the outer cache?
> >
> > In effect we could do three passes:
> >
> > 1. Calculate the total size of the SG list to determine whether full
> >   cache flush is more efficient.
> > 2. Flush inner cache
> >   Then dsb()
> > 3. Flush outer cache
> >   Another dsb()
> >
> looking at l2x0 cache, it seems to me that it would be possible to do
> asynchronous l2 cache maintenance for
> range operations, so maybe that can be utilized in order to
> parallelism between inner cache and outer cache maintenance,
> so the flow can be:
> 2. Flush sg buffer from inner cache
> 3. dsb()
> 4. start Flush outer cache for that buffer
> 5. Flush next sg buffer from inner cache
> 6. dsb
> 7. goto 4.
> 8. when no more buffers left, wait for outer cache operations

I'm not sure how practical that is given the architecture of the L2x0
controllers - where we need to wait for the previous operation to
complete by reading the operation specific register and then issue a
sync.

Having looked at this again, I think trying to do any optimization of
this code will be fraught, because of the dmabounce stuff getting in
the way.  If only dmabounce didn't exist... if only crap hardware didn't
exist...  dmabounce really needs to die.
diff mbox

Patch

diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h
index 4fff837..853eba5 100644
--- a/arch/arm/include/asm/dma-mapping.h
+++ b/arch/arm/include/asm/dma-mapping.h
@@ -115,6 +115,11 @@  static inline void __dma_page_dev_to_cpu(struct page *page, unsigned long off,
 		___dma_page_dev_to_cpu(page, off, size, dir);
 }
 
+static inline void __dma_sync(void)
+{
+	dsb();
+}
+
 /*
  * Return whether the given device DMA address mask can be supported
  * properly.  For example, if your device can only drive the low 24-bits
@@ -378,6 +383,7 @@  static inline dma_addr_t dma_map_single(struct device *dev, void *cpu_addr,
 	BUG_ON(!valid_dma_direction(dir));
 
 	addr = __dma_map_single(dev, cpu_addr, size, dir);
+	__dma_sync();
 	debug_dma_map_page(dev, virt_to_page(cpu_addr),
 			(unsigned long)cpu_addr & ~PAGE_MASK, size,
 			dir, addr, true);
@@ -407,6 +413,7 @@  static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
 	BUG_ON(!valid_dma_direction(dir));
 
 	addr = __dma_map_page(dev, page, offset, size, dir);
+	__dma_sync();
 	debug_dma_map_page(dev, page, offset, size, dir, addr, false);
 
 	return addr;
@@ -431,6 +438,7 @@  static inline void dma_unmap_single(struct device *dev, dma_addr_t handle,
 {
 	debug_dma_unmap_page(dev, handle, size, dir, true);
 	__dma_unmap_single(dev, handle, size, dir);
+	__dma_sync();
 }
 
 /**
@@ -452,6 +460,7 @@  static inline void dma_unmap_page(struct device *dev, dma_addr_t handle,
 {
 	debug_dma_unmap_page(dev, handle, size, dir, false);
 	__dma_unmap_page(dev, handle, size, dir);
+	__dma_sync();
 }
 
 /**
@@ -484,6 +493,7 @@  static inline void dma_sync_single_range_for_cpu(struct device *dev,
 		return;
 
 	__dma_single_dev_to_cpu(dma_to_virt(dev, handle) + offset, size, dir);
+	__dma_sync();
 }
 
 static inline void dma_sync_single_range_for_device(struct device *dev,
@@ -498,6 +508,7 @@  static inline void dma_sync_single_range_for_device(struct device *dev,
 		return;
 
 	__dma_single_cpu_to_dev(dma_to_virt(dev, handle) + offset, size, dir);
+	__dma_sync();
 }
 
 static inline void dma_sync_single_for_cpu(struct device *dev,
diff --git a/arch/arm/mm/cache-fa.S b/arch/arm/mm/cache-fa.S
index 1fa6f71..6eeb734 100644
--- a/arch/arm/mm/cache-fa.S
+++ b/arch/arm/mm/cache-fa.S
@@ -179,8 +179,6 @@  fa_dma_inv_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mov	r0, #0
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -197,8 +195,6 @@  fa_dma_clean_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mov	r0, #0	
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -212,8 +208,6 @@  ENTRY(fa_dma_flush_range)
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mov	r0, #0	
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
diff --git a/arch/arm/mm/cache-v4wb.S b/arch/arm/mm/cache-v4wb.S
index f40c696..523c0cb 100644
--- a/arch/arm/mm/cache-v4wb.S
+++ b/arch/arm/mm/cache-v4wb.S
@@ -194,7 +194,6 @@  v4wb_dma_inv_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -211,7 +210,6 @@  v4wb_dma_clean_range:
 	add	r0, r0, #CACHE_DLINESIZE
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
diff --git a/arch/arm/mm/cache-v6.S b/arch/arm/mm/cache-v6.S
index 73b4a8b..7a842dd 100644
--- a/arch/arm/mm/cache-v6.S
+++ b/arch/arm/mm/cache-v6.S
@@ -239,8 +239,6 @@  v6_dma_inv_range:
 	strlo	r2, [r0]			@ write for ownership
 #endif
 	blo	1b
-	mov	r0, #0
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -262,8 +260,6 @@  v6_dma_clean_range:
 	add	r0, r0, #D_CACHE_LINE_SIZE
 	cmp	r0, r1
 	blo	1b
-	mov	r0, #0
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
@@ -290,8 +286,6 @@  ENTRY(v6_dma_flush_range)
 	strlob	r2, [r0]			@ write for ownership
 #endif
 	blo	1b
-	mov	r0, #0
-	mcr	p15, 0, r0, c7, c10, 4		@ drain write buffer
 	mov	pc, lr
 
 /*
diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S
index d32f02b..18dcef6 100644
--- a/arch/arm/mm/cache-v7.S
+++ b/arch/arm/mm/cache-v7.S
@@ -257,7 +257,6 @@  v7_dma_inv_range:
 	add	r0, r0, r2
 	cmp	r0, r1
 	blo	1b
-	dsb
 	mov	pc, lr
 ENDPROC(v7_dma_inv_range)
 
@@ -275,7 +274,6 @@  v7_dma_clean_range:
 	add	r0, r0, r2
 	cmp	r0, r1
 	blo	1b
-	dsb
 	mov	pc, lr
 ENDPROC(v7_dma_clean_range)
 
@@ -293,7 +291,6 @@  ENTRY(v7_dma_flush_range)
 	add	r0, r0, r2
 	cmp	r0, r1
 	blo	1b
-	dsb
 	mov	pc, lr
 ENDPROC(v7_dma_flush_range)
 
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 82a093c..042b056 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -97,6 +97,7 @@  static struct page *__dma_alloc_buffer(struct device *dev, size_t size, gfp_t gf
 	memset(ptr, 0, size);
 	dmac_flush_range(ptr, ptr + size);
 	outer_flush_range(__pa(ptr), __pa(ptr) + size);
+	__dma_sync();
 
 	return page;
 }
@@ -572,6 +573,7 @@  int dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 		if (dma_mapping_error(dev, s->dma_address))
 			goto bad_mapping;
 	}
+	__dma_sync();
 	debug_dma_map_sg(dev, sg, nents, nents, dir);
 	return nents;
 
@@ -602,6 +604,8 @@  void dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
 
 	for_each_sg(sg, s, nents, i)
 		__dma_unmap_page(dev, sg_dma_address(s), sg_dma_len(s), dir);
+
+	__dma_sync();
 }
 EXPORT_SYMBOL(dma_unmap_sg);
 
@@ -627,6 +631,8 @@  void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
 				      s->length, dir);
 	}
 
+	__dma_sync();
+
 	debug_dma_sync_sg_for_cpu(dev, sg, nents, dir);
 }
 EXPORT_SYMBOL(dma_sync_sg_for_cpu);
@@ -653,6 +659,8 @@  void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
 				      s->length, dir);
 	}
 
+	__dma_sync();
+
 	debug_dma_sync_sg_for_device(dev, sg, nents, dir);
 }
 EXPORT_SYMBOL(dma_sync_sg_for_device);