diff mbox

usb: ehci: make HC see up-to-date qh/qtd descriptor ASAP

Message ID 1314720193-26577-1-git-send-email-ming.lei@canonical.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ming Lei Aug. 30, 2011, 4:03 p.m. UTC
From: Ming Lei <ming.lei@canonical.com>

This patch introduces the helper of ehci_sync_mem to flush
qtd/qh into memory immediately on some ARM, so that HC can
see the up-to-date qtd/qh descriptor asap.

This patch fixs one performance bug on ARM Cortex A9 dual core
platform, which has been reported on quite a few ARM machines
(OMAP4, Tegra 2, snowball...), see details from link of
https://bugs.launchpad.net/bugs/709245.

The patch has been tested ok on OMAP4 panda A1 board, and the
performance of 'dd' over usb mass storage can be increased from
4~5MB/sec to 14~16MB/sec after applying this patch.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/usb/host/ehci-q.c |   18 ++++++++++++++++++
 drivers/usb/host/ehci.h   |   17 +++++++++++++++++
 2 files changed, 35 insertions(+), 0 deletions(-)

Comments

Alan Stern Aug. 30, 2011, 4:15 p.m. UTC | #1
On Wed, 31 Aug 2011 ming.lei@canonical.com wrote:

> From: Ming Lei <ming.lei@canonical.com>
> 
> This patch introduces the helper of ehci_sync_mem to flush
> qtd/qh into memory immediately on some ARM, so that HC can
> see the up-to-date qtd/qh descriptor asap.
> 
> This patch fixs one performance bug on ARM Cortex A9 dual core
> platform, which has been reported on quite a few ARM machines
> (OMAP4, Tegra 2, snowball...), see details from link of
> https://bugs.launchpad.net/bugs/709245.
> 
> The patch has been tested ok on OMAP4 panda A1 board, and the
> performance of 'dd' over usb mass storage can be increased from
> 4~5MB/sec to 14~16MB/sec after applying this patch.
> 
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: Russell King <linux@arm.linux.org.uk>
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
>  drivers/usb/host/ehci-q.c |   18 ++++++++++++++++++
>  drivers/usb/host/ehci.h   |   17 +++++++++++++++++
>  2 files changed, 35 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
> index 0917e3a..2719879 100644
> --- a/drivers/usb/host/ehci-q.c
> +++ b/drivers/usb/host/ehci-q.c
> @@ -995,6 +995,12 @@ static void qh_link_async (struct ehci_hcd *ehci, struct ehci_qh *qh)
>  	head->qh_next.qh = qh;
>  	head->hw->hw_next = dma;
>  
> +	/*
> +	 * flush qh descriptor into memory immediately,
> +	 * see comments in qh_append_tds.
> +	 * */

Comments are supposed to look like this:

	/*
	 * Blah blah blah
	 * blah blah blah
	 */

> +	ehci_sync_mem();
> +
>  	qh_get(qh);
>  	qh->xacterrs = 0;
>  	qh->qh_state = QH_STATE_LINKED;
> @@ -1082,6 +1088,18 @@ static struct ehci_qh *qh_append_tds (
>  			wmb ();
>  			dummy->hw_token = token;
>  
> +			/*
> +			 * Writing to dma coherent buffer on ARM may
> +			 * be delayed to reach memory, so HC may not see
> +			 * hw_token of dummy qtd in time, which can cause
> +			 * the qtd transaction to be executed very late,
> +			 * and degrade performance a lot. ehci_sync_mem
> +			 * is added to flush 'token' immediatelly into
> +			 * memory, so that ehci can execute the transaction
> +			 * ASAP.
> +			 * */

Here too.

> +			ehci_sync_mem();
> +
>  			urb->hcpriv = qh_get (qh);
>  		}
>  	}
> diff --git a/drivers/usb/host/ehci.h b/drivers/usb/host/ehci.h
> index cc7d337..313d9d6 100644
> --- a/drivers/usb/host/ehci.h
> +++ b/drivers/usb/host/ehci.h
> @@ -738,6 +738,23 @@ static inline u32 hc32_to_cpup (const struct ehci_hcd *ehci, const __hc32 *x)
>  
>  #endif
>  
> +/*
> + * Writing to dma coherent memory on ARM may be delayed via L2
> + * writing buffer, so introduce the helper which can flush L2 writing
> + * buffer into memory immediately, especially used to flush ehci
> + * descriptor to memory.
> + * */

And here.

> +#ifdef	CONFIG_ARM_DMA_MEM_BUFFERABLE
> +static inline void ehci_sync_mem()
> +{
> +	mb();
> +}
> +#else
> +static inline void ehci_sync_mem()
> +{
> +}
> +#endif
> +

Except for the formatting of the comments, this is fine.  When you fix 
up the comments, you can add:

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>

Alan Stern
Mark Salter Aug. 30, 2011, 4:38 p.m. UTC | #2
On Wed, 2011-08-31 at 00:03 +0800, ming.lei@canonical.com wrote:
> +/*
> + * Writing to dma coherent memory on ARM may be delayed via L2
> + * writing buffer, so introduce the helper which can flush L2 writing
> + * buffer into memory immediately, especially used to flush ehci
> + * descriptor to memory.
> + * */
> +#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE
> +static inline void ehci_sync_mem()
> +{
> +       mb();
> +}
> +#else
> +static inline void ehci_sync_mem()
> +{
> +}
> +#endif
> +

I'm wondering if this doesn't really belong in the DMA API for any
future architectures that can't avoid prolonged write buffering to DMA
coherent memory. IIUC, ARM mitigates this for most drivers by including
an implicit write buffer flush in the mmio write routines. This takes
care of the drivers which write to a mmio device register after writing
something to shared DMA memory. IIUC, this doesn't help ehci because the
host controller is polling to see what the cpu writes to the shared
memory. Other hardware which polls shared memory like that will likely
have the same problem and could use buffer drain helpers as well.

--Mark
Alan Stern Aug. 30, 2011, 5:15 p.m. UTC | #3
On Tue, 30 Aug 2011, Mark Salter wrote:

> On Wed, 2011-08-31 at 00:03 +0800, ming.lei@canonical.com wrote:
> > +/*
> > + * Writing to dma coherent memory on ARM may be delayed via L2
> > + * writing buffer, so introduce the helper which can flush L2 writing
> > + * buffer into memory immediately, especially used to flush ehci
> > + * descriptor to memory.
> > + * */
> > +#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE
> > +static inline void ehci_sync_mem()
> > +{
> > +       mb();
> > +}
> > +#else
> > +static inline void ehci_sync_mem()
> > +{
> > +}
> > +#endif
> > +
> 
> I'm wondering if this doesn't really belong in the DMA API for any
> future architectures that can't avoid prolonged write buffering to DMA
> coherent memory. IIUC, ARM mitigates this for most drivers by including
> an implicit write buffer flush in the mmio write routines. This takes
> care of the drivers which write to a mmio device register after writing
> something to shared DMA memory. IIUC, this doesn't help ehci because the
> host controller is polling to see what the cpu writes to the shared
> memory. Other hardware which polls shared memory like that will likely
> have the same problem and could use buffer drain helpers as well.

This would be a good thing to define centrally.  Would you like to 
post an RFC on LKML?

Do you know of any other examples of hardware that polls shared DMA 
memory?

Alan Stern
Will Deacon Aug. 30, 2011, 5:26 p.m. UTC | #4
On Tue, Aug 30, 2011 at 05:38:30PM +0100, Mark Salter wrote:
> On Wed, 2011-08-31 at 00:03 +0800, ming.lei@canonical.com wrote:
> > +/*
> > + * Writing to dma coherent memory on ARM may be delayed via L2
> > + * writing buffer, so introduce the helper which can flush L2 writing
> > + * buffer into memory immediately, especially used to flush ehci
> > + * descriptor to memory.
> > + * */
> > +#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE
> > +static inline void ehci_sync_mem()
> > +{
> > +       mb();
> > +}
> > +#else
> > +static inline void ehci_sync_mem()
> > +{
> > +}
> > +#endif
> > +
> 
> I'm wondering if this doesn't really belong in the DMA API for any
> future architectures that can't avoid prolonged write buffering to DMA
> coherent memory. IIUC, ARM mitigates this for most drivers by including
> an implicit write buffer flush in the mmio write routines. This takes
> care of the drivers which write to a mmio device register after writing
> something to shared DMA memory. IIUC, this doesn't help ehci because the
> host controller is polling to see what the cpu writes to the shared
> memory. Other hardware which polls shared memory like that will likely
> have the same problem and could use buffer drain helpers as well.

Right. In this case the buffering is happening at L2 which becomes
noticeable when measuring performance. We also buffer stores at the
CPU (regardless of memory type) but because these tend to become visible
fairly quickly, there isn't a comparable performance problem.

Given that I would expect other architectures to buffer writes at the CPU,
would it not be worth having an API for flushing to L3 (devices)? It seems
like this would be a useful addition to the coherent DMA API on platforms
that handle coherency with non-cacheable memory attributes.

Will
Greg KH Aug. 30, 2011, 5:48 p.m. UTC | #5
On Tue, Aug 30, 2011 at 06:26:42PM +0100, Will Deacon wrote:
> On Tue, Aug 30, 2011 at 05:38:30PM +0100, Mark Salter wrote:
> > On Wed, 2011-08-31 at 00:03 +0800, ming.lei@canonical.com wrote:
> > > +/*
> > > + * Writing to dma coherent memory on ARM may be delayed via L2
> > > + * writing buffer, so introduce the helper which can flush L2 writing
> > > + * buffer into memory immediately, especially used to flush ehci
> > > + * descriptor to memory.
> > > + * */
> > > +#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE
> > > +static inline void ehci_sync_mem()
> > > +{
> > > +       mb();
> > > +}
> > > +#else
> > > +static inline void ehci_sync_mem()
> > > +{
> > > +}
> > > +#endif
> > > +
> > 
> > I'm wondering if this doesn't really belong in the DMA API for any
> > future architectures that can't avoid prolonged write buffering to DMA
> > coherent memory. IIUC, ARM mitigates this for most drivers by including
> > an implicit write buffer flush in the mmio write routines. This takes
> > care of the drivers which write to a mmio device register after writing
> > something to shared DMA memory. IIUC, this doesn't help ehci because the
> > host controller is polling to see what the cpu writes to the shared
> > memory. Other hardware which polls shared memory like that will likely
> > have the same problem and could use buffer drain helpers as well.
> 
> Right. In this case the buffering is happening at L2 which becomes
> noticeable when measuring performance. We also buffer stores at the
> CPU (regardless of memory type) but because these tend to become visible
> fairly quickly, there isn't a comparable performance problem.
> 
> Given that I would expect other architectures to buffer writes at the CPU,
> would it not be worth having an API for flushing to L3 (devices)? It seems
> like this would be a useful addition to the coherent DMA API on platforms
> that handle coherency with non-cacheable memory attributes.

I agree, this seems to be a "new" type of barrier that is needed, as the
code comment above seems to go against what the kernel memory barrier
documentation says about what a memory barrier really does on the
hardware.

greg k-h
Will Deacon Aug. 30, 2011, 5:54 p.m. UTC | #6
On Tue, Aug 30, 2011 at 06:48:59PM +0100, Greg KH wrote:
> On Tue, Aug 30, 2011 at 06:26:42PM +0100, Will Deacon wrote:
> > On Tue, Aug 30, 2011 at 05:38:30PM +0100, Mark Salter wrote:
> > > On Wed, 2011-08-31 at 00:03 +0800, ming.lei@canonical.com wrote:
> > > > +/*
> > > > + * Writing to dma coherent memory on ARM may be delayed via L2
> > > > + * writing buffer, so introduce the helper which can flush L2 writing
> > > > + * buffer into memory immediately, especially used to flush ehci
> > > > + * descriptor to memory.
> > > > + * */
> > > > +#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE
> > > > +static inline void ehci_sync_mem()
> > > > +{
> > > > +       mb();
> > > > +}
> > > > +#else
> > > > +static inline void ehci_sync_mem()
> > > > +{
> > > > +}
> > > > +#endif
> > > > +
> > > 
> > > I'm wondering if this doesn't really belong in the DMA API for any
> > > future architectures that can't avoid prolonged write buffering to DMA
> > > coherent memory. IIUC, ARM mitigates this for most drivers by including
> > > an implicit write buffer flush in the mmio write routines. This takes
> > > care of the drivers which write to a mmio device register after writing
> > > something to shared DMA memory. IIUC, this doesn't help ehci because the
> > > host controller is polling to see what the cpu writes to the shared
> > > memory. Other hardware which polls shared memory like that will likely
> > > have the same problem and could use buffer drain helpers as well.
> > 
> > Right. In this case the buffering is happening at L2 which becomes
> > noticeable when measuring performance. We also buffer stores at the
> > CPU (regardless of memory type) but because these tend to become visible
> > fairly quickly, there isn't a comparable performance problem.
> > 
> > Given that I would expect other architectures to buffer writes at the CPU,
> > would it not be worth having an API for flushing to L3 (devices)? It seems
> > like this would be a useful addition to the coherent DMA API on platforms
> > that handle coherency with non-cacheable memory attributes.
> 
> I agree, this seems to be a "new" type of barrier that is needed, as the
> code comment above seems to go against what the kernel memory barrier
> documentation says about what a memory barrier really does on the
> hardware.

Although this doesn't have anything to do with ordering; it's all to do with
immediacy so I think we should try to avoiding using the term `barrier'. If
this can be made part of the coherent DMA API, that might be the best place
for it (I can't think of any other areas this is needed given that the
streaming DMA API and I/O accessors already deal with it).

Will
Mark Salter Aug. 30, 2011, 6:45 p.m. UTC | #7
On Tue, 2011-08-30 at 13:15 -0400, Alan Stern wrote:
> On Tue, 30 Aug 2011, Mark Salter wrote:
> 
> > On Wed, 2011-08-31 at 00:03 +0800, ming.lei@canonical.com wrote:
> > > +/*
> > > + * Writing to dma coherent memory on ARM may be delayed via L2
> > > + * writing buffer, so introduce the helper which can flush L2 writing
> > > + * buffer into memory immediately, especially used to flush ehci
> > > + * descriptor to memory.
> > > + * */
> > > +#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE
> > > +static inline void ehci_sync_mem()
> > > +{
> > > +       mb();
> > > +}
> > > +#else
> > > +static inline void ehci_sync_mem()
> > > +{
> > > +}
> > > +#endif
> > > +
> > 
> > I'm wondering if this doesn't really belong in the DMA API for any
> > future architectures that can't avoid prolonged write buffering to DMA
> > coherent memory. IIUC, ARM mitigates this for most drivers by including
> > an implicit write buffer flush in the mmio write routines. This takes
> > care of the drivers which write to a mmio device register after writing
> > something to shared DMA memory. IIUC, this doesn't help ehci because the
> > host controller is polling to see what the cpu writes to the shared
> > memory. Other hardware which polls shared memory like that will likely
> > have the same problem and could use buffer drain helpers as well.
> 
> This would be a good thing to define centrally.  Would you like to 
> post an RFC on LKML?

Yes, I can take a stab at that.

> 
> Do you know of any other examples of hardware that polls shared DMA 
> memory?

Not offhand nor after a quick search. I don't think it is a common
way of doing things.

--Mark
Chen Peter-B29397 Aug. 31, 2011, 12:23 a.m. UTC | #8
On Aug 31, 2011, at 1:54 AM, Will Deacon wrote:

> On Tue, Aug 30, 2011 at 06:48:59PM +0100, Greg KH wrote:
>> On Tue, Aug 30, 2011 at 06:26:42PM +0100, Will Deacon wrote:
>>> On Tue, Aug 30, 2011 at 05:38:30PM +0100, Mark Salter wrote:
>>>> On Wed, 2011-08-31 at 00:03 +0800, ming.lei@canonical.com wrote:
>>>>> +/*
>>>>> + * Writing to dma coherent memory on ARM may be delayed via L2
>>>>> + * writing buffer, so introduce the helper which can flush L2 writing
>>>>> + * buffer into memory immediately, especially used to flush ehci
>>>>> + * descriptor to memory.
>>>>> + * */
>>>>> +#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE
>>>>> +static inline void ehci_sync_mem()
>>>>> +{
>>>>> +       mb();
>>>>> +}
>>>>> +#else
>>>>> +static inline void ehci_sync_mem()
>>>>> +{
>>>>> +}
>>>>> +#endif
>>>>> +
>>>> 
>>>> I'm wondering if this doesn't really belong in the DMA API for any
>>>> future architectures that can't avoid prolonged write buffering to DMA
>>>> coherent memory. IIUC, ARM mitigates this for most drivers by including
>>>> an implicit write buffer flush in the mmio write routines. This takes
>>>> care of the drivers which write to a mmio device register after writing
>>>> something to shared DMA memory. IIUC, this doesn't help ehci because the
>>>> host controller is polling to see what the cpu writes to the shared
>>>> memory. Other hardware which polls shared memory like that will likely
>>>> have the same problem and could use buffer drain helpers as well.
>>> 
>>> Right. In this case the buffering is happening at L2 which becomes
>>> noticeable when measuring performance. We also buffer stores at the
>>> CPU (regardless of memory type) but because these tend to become visible
>>> fairly quickly, there isn't a comparable performance problem.
>>> 
>>> Given that I would expect other architectures to buffer writes at the CPU,
>>> would it not be worth having an API for flushing to L3 (devices)? It seems
>>> like this would be a useful addition to the coherent DMA API on platforms
>>> that handle coherency with non-cacheable memory attributes.
>> 
>> I agree, this seems to be a "new" type of barrier that is needed, as the
>> code comment above seems to go against what the kernel memory barrier
>> documentation says about what a memory barrier really does on the
>> hardware.
> 
> Although this doesn't have anything to do with ordering; it's all to do with
> immediacy so I think we should try to avoiding using the term `barrier'. If
> this can be made part of the coherent DMA API, that might be the best place
> for it (I can't think of any other areas this is needed given that the
> streaming DMA API and I/O accessors already deal with it).
I am agree with you. I met the same issue at both usb device driver (adding next dTD pointer which
the current one is handling) and usb host driver (performance issue this thread have discussed) at
Freescale i.MX6Q platform (4 Cores, ARM SMP).
So, now I need to add two barriers at two different drivers.

One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer
also uncache, but bufferable?

> 
> Will
> --
> To unsubscribe from this list: send the line "unsubscribe linux-usb" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Best Regard,
Peter Chen
Ming Lei Aug. 31, 2011, 12:56 a.m. UTC | #9
On Wed, Aug 31, 2011 at 1:54 AM, Will Deacon <will.deacon@arm.com> wrote:

> Although this doesn't have anything to do with ordering; it's all to do with
> immediacy so I think we should try to avoiding using the term `barrier'. If
> this can be made part of the coherent DMA API, that might be the best place
> for it (I can't think of any other areas this is needed given that the
> streaming DMA API and I/O accessors already deal with it).

Agree too.


thanks,
--
Ming Lei
Will Deacon Aug. 31, 2011, 8:49 a.m. UTC | #10
On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote:
> One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer
> also uncache, but bufferable?

Which CPU was on this platform?

Will
Chen Peter-B29397 Aug. 31, 2011, 12:33 p.m. UTC | #11
Best Regard,
Peter Chen
 



On Aug 31, 2011, at 4:49 PM, Will Deacon wrote:

> On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote:
>> One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer
>> also uncache, but bufferable?
> 
> Which CPU was on this platform?

Cortex A8 UP system (freescale i.MX5x platform)
> 
> Will
>
Mark Salter Aug. 31, 2011, 1:43 p.m. UTC | #12
On Wed, 2011-08-31 at 09:49 +0100, Will Deacon wrote:
> On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote:
> > One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer
> > also uncache, but bufferable?
> 
> Which CPU was on this platform?

Using a 3.1.0-rc4+ kernel on a Pandaboard, and running 'hdparm -t' on a
usb disk drive, I see ~5.8MB/s read speed. Same kernel, but passing
nosmp on the commandline, I see 20.3MB/s.

Can someone explain why nosmp would make such a difference?

--Mark
Will Deacon Aug. 31, 2011, 3:21 p.m. UTC | #13
On Wed, Aug 31, 2011 at 02:43:33PM +0100, Mark Salter wrote:
> On Wed, 2011-08-31 at 09:49 +0100, Will Deacon wrote:
> > On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote:
> > > One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer
> > > also uncache, but bufferable?
> > 
> > Which CPU was on this platform?
> 
> Using a 3.1.0-rc4+ kernel on a Pandaboard, and running 'hdparm -t' on a
> usb disk drive, I see ~5.8MB/s read speed. Same kernel, but passing
> nosmp on the commandline, I see 20.3MB/s.
> 
> Can someone explain why nosmp would make such a difference?

Oh gawd, that's horrible. I have a feeling it's probably a separate issue
though, caused by:

omap_modify_auxcoreboot0(0x200, 0xfffffdff);

in boot_secondary for OMAP. Unfortunately I have no idea what that line is
doing because it ends up talking to the secure monitor.

Will
Mark Salter Aug. 31, 2011, 3:27 p.m. UTC | #14
On Wed, 2011-08-31 at 16:21 +0100, Will Deacon wrote:
> On Wed, Aug 31, 2011 at 02:43:33PM +0100, Mark Salter wrote:
> > On Wed, 2011-08-31 at 09:49 +0100, Will Deacon wrote:
> > > On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote:
> > > > One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer
> > > > also uncache, but bufferable?
> > > 
> > > Which CPU was on this platform?
> > 
> > Using a 3.1.0-rc4+ kernel on a Pandaboard, and running 'hdparm -t' on a
> > usb disk drive, I see ~5.8MB/s read speed. Same kernel, but passing
> > nosmp on the commandline, I see 20.3MB/s.
> > 
> > Can someone explain why nosmp would make such a difference?
> 
> Oh gawd, that's horrible. I have a feeling it's probably a separate issue
> though, caused by:
> 
> omap_modify_auxcoreboot0(0x200, 0xfffffdff);
> 
> in boot_secondary for OMAP. Unfortunately I have no idea what that line is
> doing because it ends up talking to the secure monitor.

Okay, I may poke around a bit with that to see I can get a better
understanding.

With the patched ehci-q.c, I see no noticeable difference between smp
and nosmp. Both get me around 23.5MB/s with my setup.

--Mark
Marc Zyngier Aug. 31, 2011, 4:12 p.m. UTC | #15
On 31/08/11 16:27, Mark Salter wrote:
> On Wed, 2011-08-31 at 16:21 +0100, Will Deacon wrote:
>> On Wed, Aug 31, 2011 at 02:43:33PM +0100, Mark Salter wrote:
>>> On Wed, 2011-08-31 at 09:49 +0100, Will Deacon wrote:
>>>> On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote:
>>>>> One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer
>>>>> also uncache, but bufferable?
>>>>
>>>> Which CPU was on this platform?
>>>
>>> Using a 3.1.0-rc4+ kernel on a Pandaboard, and running 'hdparm -t' on a
>>> usb disk drive, I see ~5.8MB/s read speed. Same kernel, but passing
>>> nosmp on the commandline, I see 20.3MB/s.
>>>
>>> Can someone explain why nosmp would make such a difference?
>>
>> Oh gawd, that's horrible. I have a feeling it's probably a separate issue
>> though, caused by:
>>
>> omap_modify_auxcoreboot0(0x200, 0xfffffdff);
>>
>> in boot_secondary for OMAP. Unfortunately I have no idea what that line is
>> doing because it ends up talking to the secure monitor.
> 
> Okay, I may poke around a bit with that to see I can get a better
> understanding.
> 
> With the patched ehci-q.c, I see no noticeable difference between smp
> and nosmp. Both get me around 23.5MB/s with my setup.

Oddly enough, this patch doesn't do anything on my Tegra setup. In both
cases, I get around 17MB/s from a crap SD card plugged in a USB reader.

This leads me to suspect that this issue is very much OMAP4 specific.
Can anyone verify this theory on other some A9 platforms?

Cheers,

	M.
Marc Dietrich Aug. 31, 2011, 4:55 p.m. UTC | #16
Am Mittwoch 31 August 2011, 18:12:48 schrieb Marc Zyngier:
> [...]
> Oddly enough, this patch doesn't do anything on my Tegra setup. In both
> cases, I get around 17MB/s from a crap SD card plugged in a USB reader.
> 
> This leads me to suspect that this issue is very much OMAP4 specific.
> Can anyone verify this theory on other some A9 platforms?

That's odd. On my Tegra2 (on ac100) it boosts the transfer rate from 7 to 
17 MB/s. 

Marc
Nicolas Pitre Aug. 31, 2011, 5:46 p.m. UTC | #17
On Wed, 31 Aug 2011, Will Deacon wrote:

> On Wed, Aug 31, 2011 at 02:43:33PM +0100, Mark Salter wrote:
> > On Wed, 2011-08-31 at 09:49 +0100, Will Deacon wrote:
> > > On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote:
> > > > One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer
> > > > also uncache, but bufferable?
> > > 
> > > Which CPU was on this platform?
> > 
> > Using a 3.1.0-rc4+ kernel on a Pandaboard, and running 'hdparm -t' on a
> > usb disk drive, I see ~5.8MB/s read speed. Same kernel, but passing
> > nosmp on the commandline, I see 20.3MB/s.
> > 
> > Can someone explain why nosmp would make such a difference?
> 
> Oh gawd, that's horrible. I have a feeling it's probably a separate issue
> though, caused by:
> 
> omap_modify_auxcoreboot0(0x200, 0xfffffdff);
> 
> in boot_secondary for OMAP. Unfortunately I have no idea what that line is
> doing because it ends up talking to the secure monitor.

Well, this issue is apparently affecting other ARMv9 implementations 
too.  In which case this code in arch/arm/mm/mmu.c could be responsible:

                if (is_smp()) {
                        /*
                         * Mark memory with the "shared" attribute
                         * for SMP systems
                         */
                        user_pgprot |= L_PTE_SHARED;
                        kern_pgprot |= L_PTE_SHARED;
                        vecs_pgprot |= L_PTE_SHARED;
                        mem_types[MT_DEVICE_WC].prot_sect |= PMD_SECT_S;
                        mem_types[MT_DEVICE_WC].prot_pte |= L_PTE_SHARED;
                        mem_types[MT_DEVICE_CACHED].prot_sect |= PMD_SECT_S;
                        mem_types[MT_DEVICE_CACHED].prot_pte |= L_PTE_SHARED;
                        mem_types[MT_MEMORY].prot_sect |= PMD_SECT_S;
                        mem_types[MT_MEMORY].prot_pte |= L_PTE_SHARED;
                        mem_types[MT_MEMORY_NONCACHED].prot_sect |= PMD_SECT_S;
                        mem_types[MT_MEMORY_NONCACHED].prot_pte |= L_PTE_SHARED;
                }

However I don't see the nosmp kernel argument having any effect on the 
result from is_smp().


Nicolas
Will Deacon Aug. 31, 2011, 5:51 p.m. UTC | #18
On Wed, Aug 31, 2011 at 06:46:50PM +0100, Nicolas Pitre wrote:
> On Wed, 31 Aug 2011, Will Deacon wrote:
> 
> > On Wed, Aug 31, 2011 at 02:43:33PM +0100, Mark Salter wrote:
> > > On Wed, 2011-08-31 at 09:49 +0100, Will Deacon wrote:
> > > > On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote:
> > > > > One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer
> > > > > also uncache, but bufferable?
> > > > 
> > > > Which CPU was on this platform?
> > > 
> > > Using a 3.1.0-rc4+ kernel on a Pandaboard, and running 'hdparm -t' on a
> > > usb disk drive, I see ~5.8MB/s read speed. Same kernel, but passing
> > > nosmp on the commandline, I see 20.3MB/s.
> > > 
> > > Can someone explain why nosmp would make such a difference?
> > 
> > Oh gawd, that's horrible. I have a feeling it's probably a separate issue
> > though, caused by:
> > 
> > omap_modify_auxcoreboot0(0x200, 0xfffffdff);
> > 
> > in boot_secondary for OMAP. Unfortunately I have no idea what that line is
> > doing because it ends up talking to the secure monitor.
> 
> Well, this issue is apparently affecting other ARMv9 implementations 
> too.  In which case this code in arch/arm/mm/mmu.c could be responsible:
> 
>                 if (is_smp()) {
>                         /*
>                          * Mark memory with the "shared" attribute
>                          * for SMP systems
>                          */
>                         user_pgprot |= L_PTE_SHARED;
>                         kern_pgprot |= L_PTE_SHARED;
>                         vecs_pgprot |= L_PTE_SHARED;
>                         mem_types[MT_DEVICE_WC].prot_sect |= PMD_SECT_S;
>                         mem_types[MT_DEVICE_WC].prot_pte |= L_PTE_SHARED;
>                         mem_types[MT_DEVICE_CACHED].prot_sect |= PMD_SECT_S;
>                         mem_types[MT_DEVICE_CACHED].prot_pte |= L_PTE_SHARED;
>                         mem_types[MT_MEMORY].prot_sect |= PMD_SECT_S;
>                         mem_types[MT_MEMORY].prot_pte |= L_PTE_SHARED;
>                         mem_types[MT_MEMORY_NONCACHED].prot_sect |= PMD_SECT_S;
>                         mem_types[MT_MEMORY_NONCACHED].prot_pte |= L_PTE_SHARED;
>                 }
> 
> However I don't see the nosmp kernel argument having any effect on the 
> result from is_smp().

Yes, the first thing that sprung to mind was the shared attribute, but like
you say, that doesn't seem to be affected by the nosmp command line
argument.

Another thing that Marc and I tried on OMAP4 was not bringing up the secondary
CPU during boot (by commenting out most of smp_init). In this case, I/O
performance was good until we tried to online the secondary CPU. The online
failed but after that the I/O performance was certainly degraded.

Will
Rob Herring Aug. 31, 2011, 6:19 p.m. UTC | #19
On 08/31/2011 12:51 PM, Will Deacon wrote:
> On Wed, Aug 31, 2011 at 06:46:50PM +0100, Nicolas Pitre wrote:
>> On Wed, 31 Aug 2011, Will Deacon wrote:
>>
>>> On Wed, Aug 31, 2011 at 02:43:33PM +0100, Mark Salter wrote:
>>>> On Wed, 2011-08-31 at 09:49 +0100, Will Deacon wrote:
>>>>> On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote:
>>>>>> One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer
>>>>>> also uncache, but bufferable?
>>>>>
>>>>> Which CPU was on this platform?
>>>>
>>>> Using a 3.1.0-rc4+ kernel on a Pandaboard, and running 'hdparm -t' on a
>>>> usb disk drive, I see ~5.8MB/s read speed. Same kernel, but passing
>>>> nosmp on the commandline, I see 20.3MB/s.
>>>>
>>>> Can someone explain why nosmp would make such a difference?
>>>
>>> Oh gawd, that's horrible. I have a feeling it's probably a separate issue
>>> though, caused by:
>>>
>>> omap_modify_auxcoreboot0(0x200, 0xfffffdff);
>>>
>>> in boot_secondary for OMAP. Unfortunately I have no idea what that line is
>>> doing because it ends up talking to the secure monitor.
>>
>> Well, this issue is apparently affecting other ARMv9 implementations 
>> too.  In which case this code in arch/arm/mm/mmu.c could be responsible:
>>
>>                 if (is_smp()) {
>>                         /*
>>                          * Mark memory with the "shared" attribute
>>                          * for SMP systems
>>                          */
>>                         user_pgprot |= L_PTE_SHARED;
>>                         kern_pgprot |= L_PTE_SHARED;
>>                         vecs_pgprot |= L_PTE_SHARED;
>>                         mem_types[MT_DEVICE_WC].prot_sect |= PMD_SECT_S;
>>                         mem_types[MT_DEVICE_WC].prot_pte |= L_PTE_SHARED;
>>                         mem_types[MT_DEVICE_CACHED].prot_sect |= PMD_SECT_S;
>>                         mem_types[MT_DEVICE_CACHED].prot_pte |= L_PTE_SHARED;
>>                         mem_types[MT_MEMORY].prot_sect |= PMD_SECT_S;
>>                         mem_types[MT_MEMORY].prot_pte |= L_PTE_SHARED;
>>                         mem_types[MT_MEMORY_NONCACHED].prot_sect |= PMD_SECT_S;
>>                         mem_types[MT_MEMORY_NONCACHED].prot_pte |= L_PTE_SHARED;
>>                 }
>>
>> However I don't see the nosmp kernel argument having any effect on the 
>> result from is_smp().
> 
> Yes, the first thing that sprung to mind was the shared attribute, but like
> you say, that doesn't seem to be affected by the nosmp command line
> argument.
> 
> Another thing that Marc and I tried on OMAP4 was not bringing up the secondary
> CPU during boot (by commenting out most of smp_init). In this case, I/O
> performance was good until we tried to online the secondary CPU. The online
> failed but after that the I/O performance was certainly degraded.
> 

Was the SCU enabled at that point? One diff between nosmp boot and
offlining the 2nd core would be that the SCU remains enabled in the
latter case. I think the SCU does not get enabled for nosmp.

Do we really know which write buffer the data is sitting? Some
experiments to only flush the L1 write buffer would be interesting.
Perhaps something executed on the 2nd core has a mb which doesn't help
for SMP because the other core's L1 write buffer is not flushed, but it
helps for nosmp because everything runs on 1 core and any occurrence of
a mb will flush all data out. I wouldn't expect the behavior to be so
consistent though. Could it be something is not visible to the other
core rather than not visible to the EHCI controller?

Rob
Mark Salter Aug. 31, 2011, 6:35 p.m. UTC | #20
On Wed, 2011-08-31 at 13:19 -0500, Rob Herring wrote:
> On 08/31/2011 12:51 PM, Will Deacon wrote:
> > On Wed, Aug 31, 2011 at 06:46:50PM +0100, Nicolas Pitre wrote:
> >> On Wed, 31 Aug 2011, Will Deacon wrote:
> >>
> >>> On Wed, Aug 31, 2011 at 02:43:33PM +0100, Mark Salter wrote:
> >>>> On Wed, 2011-08-31 at 09:49 +0100, Will Deacon wrote:
> >>>>> On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote:
> >>>>>> One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer
> >>>>>> also uncache, but bufferable?
> >>>>>
> >>>>> Which CPU was on this platform?
> >>>>
> >>>> Using a 3.1.0-rc4+ kernel on a Pandaboard, and running 'hdparm -t' on a
> >>>> usb disk drive, I see ~5.8MB/s read speed. Same kernel, but passing
> >>>> nosmp on the commandline, I see 20.3MB/s.
> >>>>
> >>>> Can someone explain why nosmp would make such a difference?
> >>>
> >>> Oh gawd, that's horrible. I have a feeling it's probably a separate issue
> >>> though, caused by:
> >>>
> >>> omap_modify_auxcoreboot0(0x200, 0xfffffdff);
> >>>
> >>> in boot_secondary for OMAP. Unfortunately I have no idea what that line is
> >>> doing because it ends up talking to the secure monitor.
> >>
> >> Well, this issue is apparently affecting other ARMv9 implementations 
> >> too.  In which case this code in arch/arm/mm/mmu.c could be responsible:
> >>
> >>                 if (is_smp()) {
> >>                         /*
> >>                          * Mark memory with the "shared" attribute
> >>                          * for SMP systems
> >>                          */
> >>                         user_pgprot |= L_PTE_SHARED;
> >>                         kern_pgprot |= L_PTE_SHARED;
> >>                         vecs_pgprot |= L_PTE_SHARED;
> >>                         mem_types[MT_DEVICE_WC].prot_sect |= PMD_SECT_S;
> >>                         mem_types[MT_DEVICE_WC].prot_pte |= L_PTE_SHARED;
> >>                         mem_types[MT_DEVICE_CACHED].prot_sect |= PMD_SECT_S;
> >>                         mem_types[MT_DEVICE_CACHED].prot_pte |= L_PTE_SHARED;
> >>                         mem_types[MT_MEMORY].prot_sect |= PMD_SECT_S;
> >>                         mem_types[MT_MEMORY].prot_pte |= L_PTE_SHARED;
> >>                         mem_types[MT_MEMORY_NONCACHED].prot_sect |= PMD_SECT_S;
> >>                         mem_types[MT_MEMORY_NONCACHED].prot_pte |= L_PTE_SHARED;
> >>                 }
> >>
> >> However I don't see the nosmp kernel argument having any effect on the 
> >> result from is_smp().
> > 
> > Yes, the first thing that sprung to mind was the shared attribute, but like
> > you say, that doesn't seem to be affected by the nosmp command line
> > argument.
> > 
> > Another thing that Marc and I tried on OMAP4 was not bringing up the secondary
> > CPU during boot (by commenting out most of smp_init). In this case, I/O
> > performance was good until we tried to online the secondary CPU. The online
> > failed but after that the I/O performance was certainly degraded.
> > 
> 
> Was the SCU enabled at that point? One diff between nosmp boot and
> offlining the 2nd core would be that the SCU remains enabled in the
> latter case. I think the SCU does not get enabled for nosmp.
> 
> Do we really know which write buffer the data is sitting? Some
> experiments to only flush the L1 write buffer would be interesting.
> Perhaps something executed on the 2nd core has a mb which doesn't help
> for SMP because the other core's L1 write buffer is not flushed, but it
> helps for nosmp because everything runs on 1 core and any occurrence of
> a mb will flush all data out. I wouldn't expect the behavior to be so
> consistent though. Could it be something is not visible to the other
> core rather than not visible to the EHCI controller?

One experiment I did a few days ago was to pin processes and interrupts
to core#0 (except IPI and local timer). This didn't make any noticeable
difference.

My current understanding is that the writes are getting hung up in a
cache and not a write buffer. I am seeing delays of 10-15ms between
queuing the urb and getting an interrupt for urb completion. That
drops to a few hundred microseconds with the explicit flushing added
to the ehci driver. I don't see how any write buffer could hold data
that long without draining out on its own. What I see seems to suggest
that the memory is only coherent among the cores and not coherent for
CPU writes/device reads. Adding just a dsb() for the ehci flush does
not help. An outer_sync() is also necessary.

--Mark
Rob Herring Aug. 31, 2011, 6:49 p.m. UTC | #21
On 08/31/2011 01:35 PM, Mark Salter wrote:
> On Wed, 2011-08-31 at 13:19 -0500, Rob Herring wrote:
>> On 08/31/2011 12:51 PM, Will Deacon wrote:
>>> On Wed, Aug 31, 2011 at 06:46:50PM +0100, Nicolas Pitre wrote:
>>>> On Wed, 31 Aug 2011, Will Deacon wrote:
>>>>
>>>>> On Wed, Aug 31, 2011 at 02:43:33PM +0100, Mark Salter wrote:
>>>>>> On Wed, 2011-08-31 at 09:49 +0100, Will Deacon wrote:
>>>>>>> On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote:
>>>>>>>> One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer
>>>>>>>> also uncache, but bufferable?
>>>>>>>
>>>>>>> Which CPU was on this platform?
>>>>>>
>>>>>> Using a 3.1.0-rc4+ kernel on a Pandaboard, and running 'hdparm -t' on a
>>>>>> usb disk drive, I see ~5.8MB/s read speed. Same kernel, but passing
>>>>>> nosmp on the commandline, I see 20.3MB/s.
>>>>>>
>>>>>> Can someone explain why nosmp would make such a difference?
>>>>>
>>>>> Oh gawd, that's horrible. I have a feeling it's probably a separate issue
>>>>> though, caused by:
>>>>>
>>>>> omap_modify_auxcoreboot0(0x200, 0xfffffdff);
>>>>>
>>>>> in boot_secondary for OMAP. Unfortunately I have no idea what that line is
>>>>> doing because it ends up talking to the secure monitor.
>>>>
>>>> Well, this issue is apparently affecting other ARMv9 implementations 
>>>> too.  In which case this code in arch/arm/mm/mmu.c could be responsible:
>>>>
>>>>                 if (is_smp()) {
>>>>                         /*
>>>>                          * Mark memory with the "shared" attribute
>>>>                          * for SMP systems
>>>>                          */
>>>>                         user_pgprot |= L_PTE_SHARED;
>>>>                         kern_pgprot |= L_PTE_SHARED;
>>>>                         vecs_pgprot |= L_PTE_SHARED;
>>>>                         mem_types[MT_DEVICE_WC].prot_sect |= PMD_SECT_S;
>>>>                         mem_types[MT_DEVICE_WC].prot_pte |= L_PTE_SHARED;
>>>>                         mem_types[MT_DEVICE_CACHED].prot_sect |= PMD_SECT_S;
>>>>                         mem_types[MT_DEVICE_CACHED].prot_pte |= L_PTE_SHARED;
>>>>                         mem_types[MT_MEMORY].prot_sect |= PMD_SECT_S;
>>>>                         mem_types[MT_MEMORY].prot_pte |= L_PTE_SHARED;
>>>>                         mem_types[MT_MEMORY_NONCACHED].prot_sect |= PMD_SECT_S;
>>>>                         mem_types[MT_MEMORY_NONCACHED].prot_pte |= L_PTE_SHARED;
>>>>                 }
>>>>
>>>> However I don't see the nosmp kernel argument having any effect on the 
>>>> result from is_smp().
>>>
>>> Yes, the first thing that sprung to mind was the shared attribute, but like
>>> you say, that doesn't seem to be affected by the nosmp command line
>>> argument.
>>>
>>> Another thing that Marc and I tried on OMAP4 was not bringing up the secondary
>>> CPU during boot (by commenting out most of smp_init). In this case, I/O
>>> performance was good until we tried to online the secondary CPU. The online
>>> failed but after that the I/O performance was certainly degraded.
>>>
>>
>> Was the SCU enabled at that point? One diff between nosmp boot and
>> offlining the 2nd core would be that the SCU remains enabled in the
>> latter case. I think the SCU does not get enabled for nosmp.
>>
>> Do we really know which write buffer the data is sitting? Some
>> experiments to only flush the L1 write buffer would be interesting.
>> Perhaps something executed on the 2nd core has a mb which doesn't help
>> for SMP because the other core's L1 write buffer is not flushed, but it
>> helps for nosmp because everything runs on 1 core and any occurrence of
>> a mb will flush all data out. I wouldn't expect the behavior to be so
>> consistent though. Could it be something is not visible to the other
>> core rather than not visible to the EHCI controller?
> 
> One experiment I did a few days ago was to pin processes and interrupts
> to core#0 (except IPI and local timer). This didn't make any noticeable
> difference.
> 
> My current understanding is that the writes are getting hung up in a
> cache and not a write buffer. I am seeing delays of 10-15ms between
> queuing the urb and getting an interrupt for urb completion. That
> drops to a few hundred microseconds with the explicit flushing added
> to the ehci driver. I don't see how any write buffer could hold data
> that long without draining out on its own. What I see seems to suggest
> that the memory is only coherent among the cores and not coherent for
> CPU writes/device reads. Adding just a dsb() for the ehci flush does
> not help. An outer_sync() is also necessary.
> 
An outer_sync will only drain the write buffer of the L2. It does not
flush the cache though. If the write buffer does in fact keep data as
long as possible (until it needs a free slot or the line is full), then
long delays to write out data are certainly possible. The exact
operation is not documented AFAIR.

Rob
Mark Salter Aug. 31, 2011, 6:58 p.m. UTC | #22
On Wed, 2011-08-31 at 13:49 -0500, Rob Herring wrote:
> An outer_sync will only drain the write buffer of the L2. It does not
> flush the cache though. If the write buffer does in fact keep data as
> long as possible (until it needs a free slot or the line is full), then
> long delays to write out data are certainly possible. The exact
> operation is not documented AFAIR.

Ah, thanks for that. I really haven't been paying close attention to
ARMv6/7 hardware and I'm in the process of playing catch up.

--Mark
Will Deacon Aug. 31, 2011, 7:35 p.m. UTC | #23
On Wed, Aug 31, 2011 at 07:19:33PM +0100, Rob Herring wrote:
> On 08/31/2011 12:51 PM, Will Deacon wrote:
> > Another thing that Marc and I tried on OMAP4 was not bringing up the secondary
> > CPU during boot (by commenting out most of smp_init). In this case, I/O
> > performance was good until we tried to online the secondary CPU. The online
> > failed but after that the I/O performance was certainly degraded.
> > 
> 
> Was the SCU enabled at that point? One diff between nosmp boot and
> offlining the 2nd core would be that the SCU remains enabled in the
> latter case. I think the SCU does not get enabled for nosmp.

Our rudimentary test (printing out the SCU control register during boot)
showed that it *was* enabled for nosmp. I think this is due to the secure
world having to do that on OMAP so it's probably not true for other
platforms.

Will
Marc Zyngier Sept. 1, 2011, 10:34 a.m. UTC | #24
Hi Marc,

On 31/08/11 17:55, Marc Dietrich wrote:
> Am Mittwoch 31 August 2011, 18:12:48 schrieb Marc Zyngier:
>> [...]
>> Oddly enough, this patch doesn't do anything on my Tegra setup. In both
>> cases, I get around 17MB/s from a crap SD card plugged in a USB reader.
>>
>> This leads me to suspect that this issue is very much OMAP4 specific.
>> Can anyone verify this theory on other some A9 platforms?
> 
> That's odd. On my Tegra2 (on ac100) it boosts the transfer rate from 7 to 
> 17 MB/s. 

I'm using a Harmony board. Could you share your kernel version, .config
and dmesg?

Thanks,

	M.
Marc Dietrich Sept. 1, 2011, 11:13 a.m. UTC | #25
> Hi Marc,

^dito,

> On 31/08/11 17:55, Marc Dietrich wrote:
> > Am Mittwoch 31 August 2011, 18:12:48 schrieb Marc Zyngier:
> >> [...]
> >> Oddly enough, this patch doesn't do anything on my Tegra setup. In both
> >> cases, I get around 17MB/s from a crap SD card plugged in a USB reader.
> >> 
> >> This leads me to suspect that this issue is very much OMAP4 specific.
> >> Can anyone verify this theory on other some A9 platforms?
> > 
> > That's odd. On my Tegra2 (on ac100) it boosts the transfer rate from 7 to
> > 17 MB/s.
> 
> I'm using a Harmony board. Could you share your kernel version, .config
> and dmesg?
> 
> Thanks,
> 
> 	M.

I use the chromiumos tree (for chromiumos 2.6.38 kernel: 
http://git.chromium.org/chromiumos/third_party/kernel-next.git) with some 
additions to make it run on the AC100. This modified tree is on 
git://gitorious.org/~marvin24/ac100/marvin24s-kernel.git. The config is 
paz00_defconfig and a dmesg you can get e.g. from http://pastebin.com/9uVfDWma
(it's not very current, but it should be sufficient).

I'll add Stephen Warren from NVIDIA to the CC list. He has more HW to test on. 

Btw, this is the patch I used: http://gitorious.org/~marvin24/ac100/marvin24s-
kernel/commit/cce8d9e25d009a45c219a6ad0b9ac4e27d034ab0

	Marc
Stephen Warren Sept. 1, 2011, 7:08 p.m. UTC | #26
Marc Dietich wrote at Thursday, September 01, 2011 5:14 AM:
> I'll add Stephen Warren from NVIDIA to the CC list. He has more HW to test on.

Here are the results I found:

Harmony:
Tegra USB3 -> SMSC9514 hub: NOT affected
(Unplugging LAN cable, or disabling SMSC9514 LAN driver doesn't change this)

Seaboard (springbank; clamshell):
Tegra USB1 -> no hub: Affected

Seaboard (seaboard non-clamshell):
Tegra USB1 -> no hub: Affected
Tegra USB3 -> no hub: Affected

TrimSlice:
Tegra USB3 -> unknown hub: Affected

This implies there's something different about Harmony.

Is the USB hub a clue? Seaboard doesn't have one, and although I don't
know what model TrimSlice uses, I assume it's different since I know
TrimSlice's Ethernet is not the same as Harmony's.

I don't see anything in board-harmony.c vs. board-seaboard.c that'd affect
anything USB-related.

Perhaps there's some kind of bootloader or BCT difference. However, my
Harmony and both Seaboards both use (a very old) U-Boot and BCT from
ChromeOS, so I don't imagine there's actually much difference there.
Marc Zyngier Sept. 2, 2011, 9:50 a.m. UTC | #27
On 01/09/11 20:08, Stephen Warren wrote:
> Marc Dietich wrote at Thursday, September 01, 2011 5:14 AM:
>> I'll add Stephen Warren from NVIDIA to the CC list. He has more HW to test on.
> 
> Here are the results I found:
> 
> Harmony:
> Tegra USB3 -> SMSC9514 hub: NOT affected
> (Unplugging LAN cable, or disabling SMSC9514 LAN driver doesn't change this)
> 
> Seaboard (springbank; clamshell):
> Tegra USB1 -> no hub: Affected
> 
> Seaboard (seaboard non-clamshell):
> Tegra USB1 -> no hub: Affected
> Tegra USB3 -> no hub: Affected
> 
> TrimSlice:
> Tegra USB3 -> unknown hub: Affected
> 
> This implies there's something different about Harmony.
> 
> Is the USB hub a clue? Seaboard doesn't have one, and although I don't
> know what model TrimSlice uses, I assume it's different since I know
> TrimSlice's Ethernet is not the same as Harmony's.

Panda has the exact same USB hub configuration, and is affected. So we
can rule this out.

> I don't see anything in board-harmony.c vs. board-seaboard.c that'd affect
> anything USB-related.
> 
> Perhaps there's some kind of bootloader or BCT difference. However, my
> Harmony and both Seaboards both use (a very old) U-Boot and BCT from
> ChromeOS, so I don't imagine there's actually much difference there.

I just noticed something else. Harmony is fast *most of the time*. In
about one in 3 reboots, I get the slow behavior. When USB is fast, I
also have I2C interrupts "screaming":

 85:     294321          0       GIC  tegra-i2c
116:          0          0       GIC  tegra-i2c
118:      98542          0       GIC  tps6586x

This is a couple of seconds after boot.

When USB is slow, I see the following:
[    0.385270] tps6586x 3-0034: Chip ID read failed: -121
[    0.390584] tps6586x: probe of 3-0034 failed with error -5

... and I2C interrupt is quiet.

The I2C interrupt handler calls writel(), which does a cache sync. That
would explain the "fast" behavior of Harmony.

Do you see the same this on your board?

	M.
Marc Dietrich Sept. 2, 2011, 11:13 a.m. UTC | #28
just another measurement point

> Stephen Warren wrote at Thursday:
>
> Here are the results I found:
> 
> Harmony:
> Tegra USB3 -> SMSC9514 hub: NOT affected
> (Unplugging LAN cable, or disabling SMSC9514 LAN driver doesn't change
> this)
> 
> Seaboard (springbank; clamshell):
> Tegra USB1 -> no hub: Affected
> 
> Seaboard (seaboard non-clamshell):
> Tegra USB1 -> no hub: Affected
> Tegra USB3 -> no hub: Affected
> 
> TrimSlice:
> Tegra USB3 -> unknown hub: Affected

PAZ00:
ULPI -> SMCS 2512: Affected
Tegra USB3 -> SMSC 2514: Affected

The patch also cures high latencies/packet drops on wifi connected to ULPI via 
the 2512 hub. The pen drive was connected to USB3/2514.

> This implies there's something different about Harmony.
> 
> Is the USB hub a clue? Seaboard doesn't have one, and although I don't
> know what model TrimSlice uses, I assume it's different since I know
> TrimSlice's Ethernet is not the same as Harmony's.
> 
> I don't see anything in board-harmony.c vs. board-seaboard.c that'd affect
> anything USB-related.
> 
> Perhaps there's some kind of bootloader or BCT difference. However, my
> Harmony and both Seaboards both use (a very old) U-Boot and BCT from
> ChromeOS, so I don't imagine there's actually much difference there.
Stephen Warren Sept. 2, 2011, 5:07 p.m. UTC | #29
Marc Zyngier wrote at Friday, September 02, 2011 3:51 AM:
> On 01/09/11 20:08, Stephen Warren wrote:
> > Marc Dietich wrote at Thursday, September 01, 2011 5:14 AM:
> >> I'll add Stephen Warren from NVIDIA to the CC list. He has more HW to test on.
> >
> > Here are the results I found:
> >
> > Harmony:
> > Tegra USB3 -> SMSC9514 hub: NOT affected
> > (Unplugging LAN cable, or disabling SMSC9514 LAN driver doesn't change this)
...
> I just noticed something else. Harmony is fast *most of the time*. In
> about one in 3 reboots, I get the slow behavior. When USB is fast, I
> also have I2C interrupts "screaming":
> 
>  85:     294321          0       GIC  tegra-i2c
> 116:          0          0       GIC  tegra-i2c
> 118:      98542          0       GIC  tps6586x
> 
> This is a couple of seconds after boot.
> 
> When USB is slow, I see the following:
> [    0.385270] tps6586x 3-0034: Chip ID read failed: -121
> [    0.390584] tps6586x: probe of 3-0034 failed with error -5
> 
> ... and I2C interrupt is quiet.
> 
> The I2C interrupt handler calls writel(), which does a cache sync. That
> would explain the "fast" behavior of Harmony.
> 
> Do you see the same this on your board?

Yes, I re-ran the test a few more times and see those exact same symptoms.

In a case with the screaming I2C interrupts and fast USB, I then did:

echo 3-0034 > /sys/bus/i2c/drivers/tps6586x/unbind
(I got a kernel BUG and bash crashed here, but just logged back in)

which caused the I2C interrupt handler to stop, then re-ran the test.
I then saw the slow USB speed.

So, now I think *all* platforms(boards) are affected, right?
diff mbox

Patch

diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
index 0917e3a..2719879 100644
--- a/drivers/usb/host/ehci-q.c
+++ b/drivers/usb/host/ehci-q.c
@@ -995,6 +995,12 @@  static void qh_link_async (struct ehci_hcd *ehci, struct ehci_qh *qh)
 	head->qh_next.qh = qh;
 	head->hw->hw_next = dma;
 
+	/*
+	 * flush qh descriptor into memory immediately,
+	 * see comments in qh_append_tds.
+	 * */
+	ehci_sync_mem();
+
 	qh_get(qh);
 	qh->xacterrs = 0;
 	qh->qh_state = QH_STATE_LINKED;
@@ -1082,6 +1088,18 @@  static struct ehci_qh *qh_append_tds (
 			wmb ();
 			dummy->hw_token = token;
 
+			/*
+			 * Writing to dma coherent buffer on ARM may
+			 * be delayed to reach memory, so HC may not see
+			 * hw_token of dummy qtd in time, which can cause
+			 * the qtd transaction to be executed very late,
+			 * and degrade performance a lot. ehci_sync_mem
+			 * is added to flush 'token' immediatelly into
+			 * memory, so that ehci can execute the transaction
+			 * ASAP.
+			 * */
+			ehci_sync_mem();
+
 			urb->hcpriv = qh_get (qh);
 		}
 	}
diff --git a/drivers/usb/host/ehci.h b/drivers/usb/host/ehci.h
index cc7d337..313d9d6 100644
--- a/drivers/usb/host/ehci.h
+++ b/drivers/usb/host/ehci.h
@@ -738,6 +738,23 @@  static inline u32 hc32_to_cpup (const struct ehci_hcd *ehci, const __hc32 *x)
 
 #endif
 
+/*
+ * Writing to dma coherent memory on ARM may be delayed via L2
+ * writing buffer, so introduce the helper which can flush L2 writing
+ * buffer into memory immediately, especially used to flush ehci
+ * descriptor to memory.
+ * */
+#ifdef	CONFIG_ARM_DMA_MEM_BUFFERABLE
+static inline void ehci_sync_mem()
+{
+	mb();
+}
+#else
+static inline void ehci_sync_mem()
+{
+}
+#endif
+
 /*-------------------------------------------------------------------------*/
 
 #ifndef DEBUG