diff mbox

usb: ehci: fix update qtd->token in qh_append_tds

Message ID 1314456515-16419-1-git-send-email-ming.lei@canonical.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ming Lei Aug. 27, 2011, 2:48 p.m. UTC
From: Ming Lei <ming.lei@canonical.com>

This patch fixs one performance bug on ARM Cortex A9 dual core platform,
which has been reported on quite a few ARM machines(OMAP4, Tegra 2, snowball...),
see details from link of https://bugs.launchpad.net/bugs/709245.

In fact, one mb() on ARM is enough to flush L2 cache, but
'dummy->hw_token = token;' after mb() is added just for obeying
correct mb() usage.

The patch has been tested ok on OMAP4 panda A1 board, the performance
of 'dd' over usb mass storage can be increased from 4~5MB/sec to
14~16MB/sec after applying this patch.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/usb/host/ehci-q.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

Comments

Santosh Shilimkar Aug. 27, 2011, 3:03 p.m. UTC | #1
Hi,

On Saturday 27 August 2011 08:18 PM, ming.lei@canonical.com wrote:
> From: Ming Lei<ming.lei@canonical.com>
>
> This patch fixs one performance bug on ARM Cortex A9 dual core platform,
> which has been reported on quite a few ARM machines(OMAP4, Tegra 2, snowball...),
> see details from link of https://bugs.launchpad.net/bugs/709245.
>
> In fact, one mb() on ARM is enough to flush L2 cache, but
> 'dummy->hw_token = token;' after mb() is added just for obeying
> correct mb() usage.
>
Who said "one mb() on ARM is enough to flush L2 cache" ?
It's just a memory barrier and it doesn't flush any cache.
What it cleans is the CPU write buffers and the L2 cache
write buffers.

> The patch has been tested ok on OMAP4 panda A1 board, the performance
> of 'dd' over usb mass storage can be increased from 4~5MB/sec to
> 14~16MB/sec after applying this patch.
>
Though number looks great, how is the below patch helping to get better
numbers.

> Signed-off-by: Ming Lei<ming.lei@canonical.com>
> ---
>   drivers/usb/host/ehci-q.c |   14 ++++++++++++++
>   1 files changed, 14 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
> index 0917e3a..65b5021 100644
> --- a/drivers/usb/host/ehci-q.c
> +++ b/drivers/usb/host/ehci-q.c
> @@ -1082,6 +1082,20 @@ static struct ehci_qh *qh_append_tds (
>   			wmb ();
>   			dummy->hw_token = token;
>
> +			/* The mb() below is added to make sure that
> +			 * 'token' can be writen into qtd, so that ehci
> +			 * HC can see the up-to-date qtd descriptor. On
> +			 * some archs(at least on ARM Cortex A9 dual core),
> +			 * writing into coherenet memory doesn't mean the
> +			 * value written can reach physical memory
> +			 * immediately, and the value may be buffered
> +			 * inside L2 cache. 'dummy->hw_token = token;'
> +			 * after mb() is added for obeying correct mb()
> +			 * usage.
> +			 * */
> +			mb();
> +			token = dummy->hw_token;
> +

This patch at max fix some corruption if the memory buffer
used is buffer-able. Infact I see there is already a write memory
barrier above. So just pushing that down by one line should
be enough.

 >   			dummy->hw_token = token;
 >   			wmb ();

Is there another patch along with this which removes, some cache clean
on this buffer ?

Regards
Santosh
Greg KH Aug. 27, 2011, 3:13 p.m. UTC | #2
On Sat, Aug 27, 2011 at 10:48:35PM +0800, ming.lei@canonical.com wrote:
> From: Ming Lei <ming.lei@canonical.com>
> 
> This patch fixs one performance bug on ARM Cortex A9 dual core platform,
> which has been reported on quite a few ARM machines(OMAP4, Tegra 2, snowball...),
> see details from link of https://bugs.launchpad.net/bugs/709245.
> 
> In fact, one mb() on ARM is enough to flush L2 cache, but
> 'dummy->hw_token = token;' after mb() is added just for obeying
> correct mb() usage.

Really?  A mb() should not be flushing any caches, it's just a memory
barrier.  Or is ARM somehow "special" in this way?

> The patch has been tested ok on OMAP4 panda A1 board, the performance
> of 'dd' over usb mass storage can be increased from 4~5MB/sec to
> 14~16MB/sec after applying this patch.

That's impressive, but I don't think this is really the proper way to do
this...

> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
>  drivers/usb/host/ehci-q.c |   14 ++++++++++++++
>  1 files changed, 14 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
> index 0917e3a..65b5021 100644
> --- a/drivers/usb/host/ehci-q.c
> +++ b/drivers/usb/host/ehci-q.c
> @@ -1082,6 +1082,20 @@ static struct ehci_qh *qh_append_tds (
>  			wmb ();
>  			dummy->hw_token = token;
>  
> +			/* The mb() below is added to make sure that
> +			 * 'token' can be writen into qtd, so that ehci
> +			 * HC can see the up-to-date qtd descriptor. On
> +			 * some archs(at least on ARM Cortex A9 dual core),
> +			 * writing into coherenet memory doesn't mean the
> +			 * value written can reach physical memory
> +			 * immediately, and the value may be buffered
> +			 * inside L2 cache. 'dummy->hw_token = token;'
> +			 * after mb() is added for obeying correct mb()
> +			 * usage.
> +			 * */
> +			mb();
> +			token = dummy->hw_token;

Your comment does not match the code, so something is wrong here.

greg k-h
Ming Lei Aug. 27, 2011, 3:18 p.m. UTC | #3
On Sat, Aug 27, 2011 at 11:03 PM, Santosh <santosh.shilimkar@ti.com> wrote:
> Hi,
>
> On Saturday 27 August 2011 08:18 PM, ming.lei@canonical.com wrote:
>>
>> From: Ming Lei<ming.lei@canonical.com>
>>
>> This patch fixs one performance bug on ARM Cortex A9 dual core platform,
>> which has been reported on quite a few ARM machines(OMAP4, Tegra 2,
>> snowball...),
>> see details from link of https://bugs.launchpad.net/bugs/709245.
>>
>> In fact, one mb() on ARM is enough to flush L2 cache, but
>> 'dummy->hw_token = token;' after mb() is added just for obeying
>> correct mb() usage.
>>
> Who said "one mb() on ARM is enough to flush L2 cache" ?
> It's just a memory barrier and it doesn't flush any cache.
> What it cleans is the CPU write buffers and the L2 cache
> write buffers

Yes,  your description is more accurate, it should be L2 write buffer,
I see mb() will call outer_sync() on ARM.

>
>> The patch has been tested ok on OMAP4 panda A1 board, the performance
>> of 'dd' over usb mass storage can be increased from 4~5MB/sec to
>> 14~16MB/sec after applying this patch.
>>
> Though number looks great, how is the below patch helping to get better
> numbers.

The patch can make ehci HC see the up-to-date qtd, so make usb transaction
executed correctly. If a qtd->token is not updated, maybe IOC is not
set or set very
late, so interrupt can't be triggered in time, also mistaken 'total
bytes to transfer'
can make HC work badly.

In fact, I have traced the problem and found ehci irq is often delayed
by ehci HC.
also sometimes ehci irq is lost, so I start to trace ehci driver and
find the problem here.

>
>> Signed-off-by: Ming Lei<ming.lei@canonical.com>
>> ---
>>  drivers/usb/host/ehci-q.c |   14 ++++++++++++++
>>  1 files changed, 14 insertions(+), 0 deletions(-)
>>
>> diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
>> index 0917e3a..65b5021 100644
>> --- a/drivers/usb/host/ehci-q.c
>> +++ b/drivers/usb/host/ehci-q.c
>> @@ -1082,6 +1082,20 @@ static struct ehci_qh *qh_append_tds (
>>                        wmb ();
>>                        dummy->hw_token = token;
>>
>> +                       /* The mb() below is added to make sure that
>> +                        * 'token' can be writen into qtd, so that ehci
>> +                        * HC can see the up-to-date qtd descriptor. On
>> +                        * some archs(at least on ARM Cortex A9 dual
>> core),
>> +                        * writing into coherenet memory doesn't mean the
>> +                        * value written can reach physical memory
>> +                        * immediately, and the value may be buffered
>> +                        * inside L2 cache. 'dummy->hw_token = token;'
>> +                        * after mb() is added for obeying correct mb()
>> +                        * usage.
>> +                        * */
>> +                       mb();
>> +                       token = dummy->hw_token;
>> +
>
> This patch at max fix some corruption if the memory buffer
> used is buffer-able. Infact I see there is already a write memory
> barrier above. So just pushing that down by one line should
> be enough.

The above wmb is used to order updating qtd->hw_next and
dummy->hw_token.

>
>>                       dummy->hw_token = token;
>>                       wmb ();
>
> Is there another patch along with this which removes, some cache clean
> on this buffer ?

No, I am not sure the wmb should be merged with mb().

thanks,
--
Ming Lei
Ming Lei Aug. 27, 2011, 3:33 p.m. UTC | #4
Hi,

On Sat, Aug 27, 2011 at 11:13 PM, Greg KH <greg@kroah.com> wrote:
> On Sat, Aug 27, 2011 at 10:48:35PM +0800, ming.lei@canonical.com wrote:
>> From: Ming Lei <ming.lei@canonical.com>
>>
>> This patch fixs one performance bug on ARM Cortex A9 dual core platform,
>> which has been reported on quite a few ARM machines(OMAP4, Tegra 2, snowball...),
>> see details from link of https://bugs.launchpad.net/bugs/709245.
>>
>> In fact, one mb() on ARM is enough to flush L2 cache, but
>> 'dummy->hw_token = token;' after mb() is added just for obeying
>> correct mb() usage.
>
> Really?  A mb() should not be flushing any caches, it's just a memory
> barrier.  Or is ARM somehow "special" in this way?

As Santosh pointed out, mb on ARM will flush L2 write buffer. The
description here is wrong.

I think the below should make the writing reach into memory on all
ARCH after ' token = dummy->hw_token;' is executed.

                       dummy->hw_token = token;
                       mb()
                       token = dummy->hw_token;

The above is the idea introduced to fix the problem.

>
>> The patch has been tested ok on OMAP4 panda A1 board, the performance
>> of 'dd' over usb mass storage can be increased from 4~5MB/sec to
>> 14~16MB/sec after applying this patch.
>
> That's impressive, but I don't think this is really the proper way to do
> this...
>
>> Signed-off-by: Ming Lei <ming.lei@canonical.com>
>> ---
>>  drivers/usb/host/ehci-q.c |   14 ++++++++++++++
>>  1 files changed, 14 insertions(+), 0 deletions(-)
>>
>> diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
>> index 0917e3a..65b5021 100644
>> --- a/drivers/usb/host/ehci-q.c
>> +++ b/drivers/usb/host/ehci-q.c
>> @@ -1082,6 +1082,20 @@ static struct ehci_qh *qh_append_tds (
>>                       wmb ();
>>                       dummy->hw_token = token;
>>
>> +                     /* The mb() below is added to make sure that
>> +                      * 'token' can be writen into qtd, so that ehci
>> +                      * HC can see the up-to-date qtd descriptor. On
>> +                      * some archs(at least on ARM Cortex A9 dual core),
>> +                      * writing into coherenet memory doesn't mean the
>> +                      * value written can reach physical memory
>> +                      * immediately, and the value may be buffered
>> +                      * inside L2 cache. 'dummy->hw_token = token;'
>> +                      * after mb() is added for obeying correct mb()
>> +                      * usage.
>> +                      * */
>> +                     mb();
>> +                     token = dummy->hw_token;
>
> Your comment does not match the code, so something is wrong here.

If you mean "L2 cache flush", I confess to the mistaken description,
and will update it later. If you mean others, could you help to point it out?

thanks,
--
Ming Lei
Santosh Shilimkar Aug. 27, 2011, 3:46 p.m. UTC | #5
On Saturday 27 August 2011 08:48 PM, Ming Lei wrote:
> On Sat, Aug 27, 2011 at 11:03 PM, Santosh<santosh.shilimkar@ti.com>  wrote:
>> Hi,
>>
>> On Saturday 27 August 2011 08:18 PM, ming.lei@canonical.com wrote:
>>>
>>> From: Ming Lei<ming.lei@canonical.com>
>>>
>>> This patch fixs one performance bug on ARM Cortex A9 dual core platform,
>>> which has been reported on quite a few ARM machines(OMAP4, Tegra 2,
>>> snowball...),
>>> see details from link of https://bugs.launchpad.net/bugs/709245.
>>>
>>> In fact, one mb() on ARM is enough to flush L2 cache, but
>>> 'dummy->hw_token = token;' after mb() is added just for obeying
>>> correct mb() usage.
>>>
>> Who said "one mb() on ARM is enough to flush L2 cache" ?
>> It's just a memory barrier and it doesn't flush any cache.
>> What it cleans is the CPU write buffers and the L2 cache
>> write buffers
>
> Yes,  your description is more accurate, it should be L2 write buffer,
> I see mb() will call outer_sync() on ARM.
>
>>
>>> The patch has been tested ok on OMAP4 panda A1 board, the performance
>>> of 'dd' over usb mass storage can be increased from 4~5MB/sec to
>>> 14~16MB/sec after applying this patch.
>>>
>> Though number looks great, how is the below patch helping to get better
>> numbers.
>
> The patch can make ehci HC see the up-to-date qtd, so make usb transaction
> executed correctly. If a qtd->token is not updated, maybe IOC is not
> set or set very
> late, so interrupt can't be triggered in time, also mistaken 'total
> bytes to transfer'
> can make HC work badly.
>
> In fact, I have traced the problem and found ehci irq is often delayed
> by ehci HC.
> also sometimes ehci irq is lost, so I start to trace ehci driver and
> find the problem here.
>
You are missing all of this important description in change log.

>>
>>> Signed-off-by: Ming Lei<ming.lei@canonical.com>
>>> ---
>>>   drivers/usb/host/ehci-q.c |   14 ++++++++++++++
>>>   1 files changed, 14 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
>>> index 0917e3a..65b5021 100644
>>> --- a/drivers/usb/host/ehci-q.c
>>> +++ b/drivers/usb/host/ehci-q.c
>>> @@ -1082,6 +1082,20 @@ static struct ehci_qh *qh_append_tds (
>>>                         wmb ();
>>>                         dummy->hw_token = token;
>>>
>>> +                       /* The mb() below is added to make sure that
>>> +                        * 'token' can be writen into qtd, so that ehci
>>> +                        * HC can see the up-to-date qtd descriptor. On
>>> +                        * some archs(at least on ARM Cortex A9 dual
>>> core),
>>> +                        * writing into coherenet memory doesn't mean the
>>> +                        * value written can reach physical memory
>>> +                        * immediately, and the value may be buffered
>>> +                        * inside L2 cache. 'dummy->hw_token = token;'
>>> +                        * after mb() is added for obeying correct mb()
>>> +                        * usage.
>>> +                        * */
>>> +                       mb();
>>> +                       token = dummy->hw_token;
>>> +
>>
>> This patch at max fix some corruption if the memory buffer
>> used is buffer-able. Infact I see there is already a write memory
>> barrier above. So just pushing that down by one line should
>> be enough.
>
> The above wmb is used to order updating qtd->hw_next and
> dummy->hw_token.
>
>>
>>>                        dummy->hw_token = token;
>>>                        wmb ();
>>
>> Is there another patch along with this which removes, some cache clean
>> on this buffer ?
>
> No, I am not sure the wmb should be merged with mb().
>
It will work but I leave that call to you since I don't understand much
what that function is doing. Infact I see, excessive usage of wmb()
in qh_append_tds()

Regards
Santosh
Greg KH Aug. 27, 2011, 4:07 p.m. UTC | #6
On Sat, Aug 27, 2011 at 11:33:26PM +0800, Ming Lei wrote:
> Hi,
> 
> On Sat, Aug 27, 2011 at 11:13 PM, Greg KH <greg@kroah.com> wrote:
> > On Sat, Aug 27, 2011 at 10:48:35PM +0800, ming.lei@canonical.com wrote:
> >> From: Ming Lei <ming.lei@canonical.com>
> >>
> >> This patch fixs one performance bug on ARM Cortex A9 dual core platform,
> >> which has been reported on quite a few ARM machines(OMAP4, Tegra 2, snowball...),
> >> see details from link of https://bugs.launchpad.net/bugs/709245.
> >>
> >> In fact, one mb() on ARM is enough to flush L2 cache, but
> >> 'dummy->hw_token = token;' after mb() is added just for obeying
> >> correct mb() usage.
> >
> > Really?  A mb() should not be flushing any caches, it's just a memory
> > barrier.  Or is ARM somehow "special" in this way?
> 
> As Santosh pointed out, mb on ARM will flush L2 write buffer. The
> description here is wrong.

Then this can't be accepted as-is :)

> I think the below should make the writing reach into memory on all
> ARCH after ' token = dummy->hw_token;' is executed.
> 
>                        dummy->hw_token = token;
>                        mb()
>                        token = dummy->hw_token;
> 
> The above is the idea introduced to fix the problem.

Are you sure?  Have you read the documentation about memory barriers to
confirm this?

> >> The patch has been tested ok on OMAP4 panda A1 board, the performance
> >> of 'dd' over usb mass storage can be increased from 4~5MB/sec to
> >> 14~16MB/sec after applying this patch.
> >
> > That's impressive, but I don't think this is really the proper way to do
> > this...
> >
> >> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> >> ---
> >>  drivers/usb/host/ehci-q.c |   14 ++++++++++++++
> >>  1 files changed, 14 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
> >> index 0917e3a..65b5021 100644
> >> --- a/drivers/usb/host/ehci-q.c
> >> +++ b/drivers/usb/host/ehci-q.c
> >> @@ -1082,6 +1082,20 @@ static struct ehci_qh *qh_append_tds (
> >>                       wmb ();
> >>                       dummy->hw_token = token;
> >>
> >> +                     /* The mb() below is added to make sure that
> >> +                      * 'token' can be writen into qtd, so that ehci
> >> +                      * HC can see the up-to-date qtd descriptor. On
> >> +                      * some archs(at least on ARM Cortex A9 dual core),
> >> +                      * writing into coherenet memory doesn't mean the
> >> +                      * value written can reach physical memory
> >> +                      * immediately, and the value may be buffered
> >> +                      * inside L2 cache. 'dummy->hw_token = token;'
> >> +                      * after mb() is added for obeying correct mb()
> >> +                      * usage.
> >> +                      * */
> >> +                     mb();
> >> +                     token = dummy->hw_token;
> >
> > Your comment does not match the code, so something is wrong here.
> 
> If you mean "L2 cache flush", I confess to the mistaken description,
> and will update it later. If you mean others, could you help to point it out?

I mean others, please read the the last 3 lines of the comment and
compare that to the code lines you added.

greg k-h
Sergei Shtylyov Aug. 27, 2011, 4:31 p.m. UTC | #7
Hello.

On 27-08-2011 18:48, ming.lei@canonical.com wrote:

> From: Ming Lei<ming.lei@canonical.com>

> This patch fixs one performance bug on ARM Cortex A9 dual core platform,
> which has been reported on quite a few ARM machines(OMAP4, Tegra 2, snowball...),
> see details from link of https://bugs.launchpad.net/bugs/709245.

> In fact, one mb() on ARM is enough to flush L2 cache, but
> 'dummy->hw_token = token;' after mb() is added just for obeying
> correct mb() usage.

> The patch has been tested ok on OMAP4 panda A1 board, the performance
> of 'dd' over usb mass storage can be increased from 4~5MB/sec to
> 14~16MB/sec after applying this patch.

> Signed-off-by: Ming Lei<ming.lei@canonical.com>
> ---
>   drivers/usb/host/ehci-q.c |   14 ++++++++++++++
>   1 files changed, 14 insertions(+), 0 deletions(-)

> diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
> index 0917e3a..65b5021 100644
> --- a/drivers/usb/host/ehci-q.c
> +++ b/drivers/usb/host/ehci-q.c
> @@ -1082,6 +1082,20 @@ static struct ehci_qh *qh_append_tds (
>   			wmb ();
>   			dummy->hw_token = token;
>
> +			/* The mb() below is added to make sure that
> +			 * 'token' can be writen into qtd, so that ehci
> +			 * HC can see the up-to-date qtd descriptor. On
> +			 * some archs(at least on ARM Cortex A9 dual core),
> +			 * writing into coherenet memory doesn't mean the
> +			 * value written can reach physical memory
> +			 * immediately, and the value may be buffered
> +			 * inside L2 cache. 'dummy->hw_token = token;'

    You meant 'token = dummy->hw_token;'?

> +			 * after mb() is added for obeying correct mb()
> +			 * usage.
> +			 * */
> +			mb();
> +			token = dummy->hw_token;
> +
>   			urb->hcpriv = qh_get (qh);
>   		}
>   	}

WBR, Sergei
Ming Lei Aug. 27, 2011, 4:57 p.m. UTC | #8
Hi,

Thanks for your comment.

On Sun, Aug 28, 2011 at 12:07 AM, Greg KH <greg@kroah.com> wrote:

>> As Santosh pointed out, mb on ARM will flush L2 write buffer. The
>> description here is wrong.
>
> Then this can't be accepted as-is :)

Yes, I will update it in v1, :-)

>> I think the below should make the writing reach into memory on all
>> ARCH after ' token = dummy->hw_token;' is executed.
>>
>>                        dummy->hw_token = token;
>>                        mb()
>>                        token = dummy->hw_token;
>>
>> The above is the idea introduced to fix the problem.
>
> Are you sure?  Have you read the documentation about memory barriers to
> confirm this?

I read the doc again, :-), and it mentions few about mb/wmb/rmb, I think
my above description is still not correct. Generally speaking, mb only
means there is a order between two accesses.

Now I think only one mb() after 'dummy->hw_token = token;' is enough:
HC will read the up-to-date value of qtd->hw_token after mb() is executed
because of the effect of the mb(), which should be guaranteed by mb.

> I mean others, please read the the last 3 lines of the comment and
> compare that to the code lines you added.

I see now, the comment of the last 3 lines is wrong, should be

                        * inside L2 cache. 'token = dummy->hw_token'
                         * after mb() is added for obeying correct mb()
                         * usage.

But the 'token = dummy->hw_token' after mb() isn't needed any
more as described above,  is it?


thanks,
--
Ming Lei
Ming Lei Aug. 27, 2011, 5:20 p.m. UTC | #9
Hi,

On Sun, Aug 28, 2011 at 12:57 AM, Ming Lei <ming.lei@canonical.com> wrote:

>> Are you sure?  Have you read the documentation about memory barriers to
>> confirm this?
>
> I read the doc again, :-), and it mentions few about mb/wmb/rmb, I think
> my above description is still not correct. Generally speaking, mb only
> means there is a order between two accesses.
>
> Now I think only one mb() after 'dummy->hw_token = token;' is enough:
> HC will read the up-to-date value of qtd->hw_token after mb() is executed
> because of the effect of the mb(), which should be guaranteed by mb.

Looks like there is still another similar problem in qh_link_async():
the last wmb
should be changed into mb, because HC will read 'head->hw->hw_next' from qh
descriptor and this pointer in qh is read only for HC. But this problem can't be
observed on ARM, since wmb on ARM is same with mb.

thanks,
--
Ming Lei
Alan Stern Aug. 27, 2011, 8:06 p.m. UTC | #10
On Sun, 28 Aug 2011, Ming Lei wrote:

> I read the doc again, :-), and it mentions few about mb/wmb/rmb, I think
> my above description is still not correct. Generally speaking, mb only
> means there is a order between two accesses.
> 
> Now I think only one mb() after 'dummy->hw_token = token;' is enough:
> HC will read the up-to-date value of qtd->hw_token after mb() is executed
> because of the effect of the mb(), which should be guaranteed by mb.

I've been following this whole discussion.  I didn't understand the
idea behind the original patch, and I don't understand this.  What
reason is there for adding a memory barrier after the "dummy->hw_token
= token" line?  Such a barrier would not accomplish anything useful,
because there are no later memory accesses which need to be ordered
with respect to that line.

(By the way, Santosh appears to be right about the earlier wmb() in
this routine.  As far as I can see, it isn't needed.  David Brownell
probably wrote it just as a precaution.)

> > I mean others, please read the the last 3 lines of the comment and
> > compare that to the code lines you added.
> 
> I see now, the comment of the last 3 lines is wrong, should be
> 
>                         * inside L2 cache. 'token = dummy->hw_token'
>                          * after mb() is added for obeying correct mb()
>                          * usage.
> 
> But the 'token = dummy->hw_token' after mb() isn't needed any
> more as described above,  is it?

I don't see why it was ever needed at all.  The compiler will realize
that token is a dead variable at that point in the routine and will
optimize the new line away.

> The patch can make ehci HC see the up-to-date qtd, so make usb transaction
> executed correctly. If a qtd->token is not updated, maybe IOC is not set or set very
> late, so interrupt can't be triggered in time, also mistaken 'total bytes to transfer'
> can make HC work badly.

This doesn't make any sense.  qtd becomes the new dummy; by the time
the HC sees qtd the only important bit set in qtd->hw_token is the HALT
bit.

> In fact, I have traced the problem and found ehci irq is often delayed by ehci HC.
> also sometimes ehci irq is lost, so I start to trace ehci driver and find the problem here.

I don't think you have identified the underlying cause correctly.  
Something else must be responsible.

Alan Stern
Alan Stern Aug. 27, 2011, 8:11 p.m. UTC | #11
On Sun, 28 Aug 2011, Ming Lei wrote:

> Looks like there is still another similar problem in qh_link_async():
> the last wmb
> should be changed into mb, because HC will read 'head->hw->hw_next' from qh
> descriptor and this pointer in qh is read only for HC. But this problem can't be
> observed on ARM, since wmb on ARM is same with mb.

It doesn't matter what the HC does -- the wmb() instruction is executed
by the CPU, not the HC.  The point of that instruction is to make sure 
that the

	qh->hw->hw_next = head->hw->hw_next;

line (and all the preceding lines as well) is ordered before the

	head->hw->hw_next = dma;
3C
line.  Since both of these lines are writes, not reads, it suffices to 
use wmb() rather than mb().

Alan Stern
Ming Lei Aug. 28, 2011, 3:13 a.m. UTC | #12
Hi,

Thanks for your comment.

On Sun, Aug 28, 2011 at 4:06 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Sun, 28 Aug 2011, Ming Lei wrote:
>
>> I read the doc again, :-), and it mentions few about mb/wmb/rmb, I think
>> my above description is still not correct. Generally speaking, mb only
>> means there is a order between two accesses.
>>
>> Now I think only one mb() after 'dummy->hw_token = token;' is enough:
>> HC will read the up-to-date value of qtd->hw_token after mb() is executed
>> because of the effect of the mb(), which should be guaranteed by mb.
>
> I've been following this whole discussion.  I didn't understand the
> idea behind the original patch, and I don't understand this.  What
> reason is there for adding a memory barrier after the "dummy->hw_token
> = token" line?  Such a barrier would not accomplish anything useful,
> because there are no later memory accesses which need to be ordered
> with respect to that line.

Here, mb is used to synchronize  between writing of CPU and reading of
ehci HC, see below:

	CPU								EHCI
	dummy->hw_token = token;
	mb
									read  dummy->hw_token

The mb() introduced  was supposed to be used to make sure that EHCI
can see the correct value of  dummy->hw_token. If EHCI can't get the correct
hw_token in time, it will work badly, such as irq delay or irq lost which may be
caused by mistaken 'IOC' or 'total bytes to transfer' in hw_token. This is just
what the patch is to fix.

But I think the above is still not correct, and the correct way may be
like below:

	CPU								EHCI
	dummy->hw_token = token;
	wmb
	qh->hw->hw_qtd_next = QTD_NEXT(ehci, dummy->qtd_dma);
									fetch next qtd from qh->hw->hw_qtd_next
									read  qtd->hw_token

The problem is that qh has already linked dummy into its queue before(as did in
qh_update), so even after 'dummy->hw_token = token;' is executed on CPU, EHCI
still may not fetch the up-to-date value of the qtd(dummy in here),
and this should
be the root cause, IMO.

I will figure out a elegant way to handle this race.

> (By the way, Santosh appears to be right about the earlier wmb() in
> this routine.  As far as I can see, it isn't needed.  David Brownell
> probably wrote it just as a precaution.)
>
>> > I mean others, please read the the last 3 lines of the comment and
>> > compare that to the code lines you added.
>>
>> I see now, the comment of the last 3 lines is wrong, should be
>>
>>                         * inside L2 cache. 'token = dummy->hw_token'
>>                          * after mb() is added for obeying correct mb()
>>                          * usage.
>>
>> But the 'token = dummy->hw_token' after mb() isn't needed any
>> more as described above,  is it?
>
> I don't see why it was ever needed at all.  The compiler will realize
> that token is a dead variable at that point in the routine and will
> optimize the new line away.

Yes, it is wrong, as I said before.

>
>> The patch can make ehci HC see the up-to-date qtd, so make usb transaction
>> executed correctly. If a qtd->token is not updated, maybe IOC is not set or set very
>> late, so interrupt can't be triggered in time, also mistaken 'total bytes to transfer'
>> can make HC work badly.
>
> This doesn't make any sense.  qtd becomes the new dummy; by the time
> the HC sees qtd the only important bit set in qtd->hw_token is the HALT
> bit.

Here, I mean the 'qtd' is the 'dummy' in code, and I described it in such way
because 'dummy' has been scheduled into qh queue, so qtd should  be its
new identity, a little confusion about the description, :-)

>
>> In fact, I have traced the problem and found ehci irq is often delayed by ehci HC.
>> also sometimes ehci irq is lost, so I start to trace ehci driver and find the problem here.
>
> I don't think you have identified the underlying cause correctly.
> Something else must be responsible.

I have explained above, so please point it out if anything is still wrong.

thanks,
--
Ming Lei
Ming Lei Aug. 28, 2011, 3:35 a.m. UTC | #13
Hi,

On Sun, Aug 28, 2011 at 4:11 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Sun, 28 Aug 2011, Ming Lei wrote:
>
>> Looks like there is still another similar problem in qh_link_async():
>> the last wmb
>> should be changed into mb, because HC will read 'head->hw->hw_next' from qh
>> descriptor and this pointer in qh is read only for HC. But this problem can't be
>> observed on ARM, since wmb on ARM is same with mb.
>
> It doesn't matter what the HC does -- the wmb() instruction is executed
> by the CPU, not the HC.  The point of that instruction is to make sure
> that the
>
>        qh->hw->hw_next = head->hw->hw_next;
>
> line (and all the preceding lines as well) is ordered before the
>
>        head->hw->hw_next = dma;
> 3C
> line.  Since both of these lines are writes, not reads, it suffices to
> use wmb() rather than mb().

oops,  I added a wmb() after the line of 'head->hw->hw_next = dma;',
in my debug code, and not found that it is my local modification before
sending out last mail. (I described it as 'the last wmb' in fact), sorry for the
noise, please ignore it.

thanks,
--
Ming Lei
Russell King - ARM Linux Aug. 28, 2011, 11:36 p.m. UTC | #14
On Sun, Aug 28, 2011 at 01:00:07PM -0400, Alan Stern wrote:
> It won't do that.  All it will do is guarantee that the CPU writes out 
> dumy->hw_token before it writes out or reads in any values executed 
> after the mb.

You're right from the perspective of how things are defined today.  However,
that isn't how things work on ARM.

With ARMv6 and ARMv7, we have weak memory ordering.  This includes so
called "DMA coherent" memory.  This means that the architecture does not
guarantee the order of writes to DMA coherent memory (which is non-
cacheable normal memory) without an intervening 'data synchronization
barrier' (dsb).  Even that may not be sufficient without also poking
at the L2 cache controller.

We get around some of that by ensuring that our MMIO read/write macros
contain the necessary barriers to ensure that DMA memory is up to date
before the DMA agent is programmed.  However, this doesn't cater for
agents which continue to run in the background.

These agents will need some kind of barrier to ensure that the write
becomes visible - there's no way to get around that.  Maybe we need
yet another new barrier macro...
Alan Stern Aug. 29, 2011, 1:51 a.m. UTC | #15
On Mon, 29 Aug 2011, Russell King - ARM Linux wrote:

> On Sun, Aug 28, 2011 at 01:00:07PM -0400, Alan Stern wrote:
> > It won't do that.  All it will do is guarantee that the CPU writes out 
> > dumy->hw_token before it writes out or reads in any values executed 
> > after the mb.
> 
> You're right from the perspective of how things are defined today.  However,
> that isn't how things work on ARM.
> 
> With ARMv6 and ARMv7, we have weak memory ordering.  This includes so
> called "DMA coherent" memory.  This means that the architecture does not
> guarantee the order of writes to DMA coherent memory (which is non-
> cacheable normal memory) without an intervening 'data synchronization
> barrier' (dsb).  Even that may not be sufficient without also poking
> at the L2 cache controller.
> 
> We get around some of that by ensuring that our MMIO read/write macros
> contain the necessary barriers to ensure that DMA memory is up to date
> before the DMA agent is programmed.  However, this doesn't cater for
> agents which continue to run in the background.
> 
> These agents will need some kind of barrier to ensure that the write
> becomes visible - there's no way to get around that.  Maybe we need
> yet another new barrier macro...

Hmmm.  Although the semantics of the various mb() macros were
originally defined only for inter-CPU synchronization, I believe they
are also supposed to work for guaranteeing the order of accesses to
DMA-coherent memory.  If that's not the case with ARM, something is
seriously wrong.  (Maybe I'm wrong about this, but if I am then there's
currently _no_ way for the kernel to order DMA-coherent accesses on
ARM.)

You know better than I do what is needed to resolve the ordering issue.  
However, contrary to what the original patch description said, this
isn't entirely a matter of making the write visible to the host
controller: No doubt in time the write will eventually become visible
anyway.  It's a matter of making the write become visible reasonably
quickly and in the correct order with respect to other writes.

Is this extra L2-cache "poke" needed for proper ordering, or is it 
needed merely to flush the write out to memory in a timely manner?

Alan Stern
Russell King - ARM Linux Aug. 29, 2011, 8:52 a.m. UTC | #16
On Sun, Aug 28, 2011 at 09:51:10PM -0400, Alan Stern wrote:
> Hmmm.  Although the semantics of the various mb() macros were
> originally defined only for inter-CPU synchronization, I believe they
> are also supposed to work for guaranteeing the order of accesses to
> DMA-coherent memory.  If that's not the case with ARM, something is
> seriously wrong.  (Maybe I'm wrong about this, but if I am then there's
> currently _no_ way for the kernel to order DMA-coherent accesses on
> ARM.)

That is the case with ARM - mb() and wmb() does everything that's
required.  rmb() is weaker than the other two.

> You know better than I do what is needed to resolve the ordering issue.  
> However, contrary to what the original patch description said, this
> isn't entirely a matter of making the write visible to the host
> controller: No doubt in time the write will eventually become visible
> anyway.  It's a matter of making the write become visible reasonably
> quickly and in the correct order with respect to other writes.

I'm not entirely sure what the problem is - I think its about a write
by the CPU to dma coherent memory being delayed and not being visible
to the HC in a timely manner.  Either mb() or wmb() placed after the
write on ARM will do that - and ARM has no requirement to do a read-
back after the barrier.

> Is this extra L2-cache "poke" needed for proper ordering, or is it 
> needed merely to flush the write out to memory in a timely manner?

Both, though primerily it's about ensuring correct ordering.  A side
effect of it is that it will flush all pending writes in L2 before
completing.

From the theoretical viewpoint, I think I'm right to say that mb()
doesn't need to provide that level of ordering as its supposed to be
an inter-CPU barrier - which probably means we need to invent a new
barrier to deal with DMA memory ordering.  However, given the
difficulty of getting the existing barriers placed correctly, I don't
think inventing new barriers is a very good idea.

What we can do is view devices which perform DMA as being strongly
ordered with respect to their memory accesses - iow, they have an
implicit memory barrier before and after their accesses to memory.
This would make the CPUs use of mb() have a conceptual pairing with
the DMA agents.
Alan Stern Aug. 29, 2011, 1:57 p.m. UTC | #17
On Mon, 29 Aug 2011, Russell King - ARM Linux wrote:

> > You know better than I do what is needed to resolve the ordering issue.  
> > However, contrary to what the original patch description said, this
> > isn't entirely a matter of making the write visible to the host
> > controller: No doubt in time the write will eventually become visible
> > anyway.  It's a matter of making the write become visible reasonably
> > quickly and in the correct order with respect to other writes.
> 
> I'm not entirely sure what the problem is - I think its about a write
> by the CPU to dma coherent memory being delayed and not being visible
> to the HC in a timely manner.  Either mb() or wmb() placed after the
> write on ARM will do that - and ARM has no requirement to do a read-
> back after the barrier.

Okay, then this needs to be done in a way that won't slow down other
architectures with an unnecessary memory barrier.  And there needs to
be a comment in the code explaining that the new mb() instruction isn't
being used as a memory barrier but rather to expedite writeback of the
L2 cache.

This certainly is starting to sound like something that needs to be 
addressed in the arch-specific #include files...

> > Is this extra L2-cache "poke" needed for proper ordering, or is it 
> > needed merely to flush the write out to memory in a timely manner?
> 
> Both, though primerily it's about ensuring correct ordering.  A side
> effect of it is that it will flush all pending writes in L2 before
> completing.
> 
> From the theoretical viewpoint, I think I'm right to say that mb()
> doesn't need to provide that level of ordering as its supposed to be
> an inter-CPU barrier - which probably means we need to invent a new
> barrier to deal with DMA memory ordering.  However, given the
> difficulty of getting the existing barriers placed correctly, I don't
> think inventing new barriers is a very good idea.
> 
> What we can do is view devices which perform DMA as being strongly
> ordered with respect to their memory accesses - iow, they have an
> implicit memory barrier before and after their accesses to memory.
> This would make the CPUs use of mb() have a conceptual pairing with
> the DMA agents.

Yes, that's the model I have been using all along.  After all, if a DMA 
master carries out its memory accesses in some random order then it's 
impossible for the CPU to make any guarantees.

Alan Stern
Ming Lei Aug. 29, 2011, 2:25 p.m. UTC | #18
Hi,

On Mon, Aug 29, 2011 at 1:00 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Sun, 28 Aug 2011, Ming Lei wrote:
>
>> > I've been following this whole discussion.  I didn't understand the
>> > idea behind the original patch, and I don't understand this.  What
>> > reason is there for adding a memory barrier after the "dummy->hw_token
>> > = token" line?  Such a barrier would not accomplish anything useful,
>> > because there are no later memory accesses which need to be ordered
>> > with respect to that line.
>>
>> Here, mb is used to synchronize  between writing of CPU and reading of
>> ehci HC, see below:
>
> A memory barrier never synchronizes events between two processors (for
> example, between the CPU and HC).  It only affects ordering for the
> processor it executes on.  That's why you always have to use memory
> barriers in pairs for inter-CPU synchronization: one memory barrier
> instruction for each CPU.

IMO, for SMP case, memory barrier pairs(smp_*) are required to synchronize
events between two processors. mb/wmb/rmb are mainly used to
synchronize events between CPU and device, and one barrier may be
enough, for example:

	CPU							device	
	A=1;
	wmb
	B=2;
									read B
									read A

one wmb is used to order 'A=1' and 'B=2', which will make the two write
operations reach to physical memory as the order: 'A=1' first, 'B=1' second.
Then the device can observe the two write events as the order above,
so if device has seen 'B==2', then device will surely see 'A==1'.

>
>>       CPU                                                             EHCI
>>       dummy->hw_token = token;
>>       mb
>>                                                                       read  dummy->hw_token
>>
>> The mb() introduced  was supposed to be used to make sure that EHCI
>> can see the correct value of  dummy->hw_token.
>
> It won't do that.  All it will do is guarantee that the CPU writes out
> dumy->hw_token before it writes out or reads in any values executed
> after the mb.
>
>>  If EHCI can't get the correct
>> hw_token in time, it will work badly, such as irq delay or irq lost which may be
>> caused by mistaken 'IOC' or 'total bytes to transfer' in hw_token.
>
> No.  All that will happen is that the HC will execute the qTD a little
> later, when it does read the correct hw_token.  No IRQs will be lost,
> and there will be no mistaken IOC or "total bytes to transfer" values.

IMO, the dummy has been linked into queue pointed by qh->hw->hw_qtd_next,
so EHCI will fetch dummy qtd and execute the transaction and will not have
any delay on the transaction.

Let me explain the problem again: On ARM, the wmb() before
'dummy->hw_token = token;'
will flush l2 write buffer into memory and all parts of 'dummy' except
for hw_token field
have reached into memory already, but dummy->hw_token will stay at l2
write buffer
and not reach into memory at this time, so ehci may fetch a
inconsistent qtd and execute it,
then mistaken IOC or "total bytes to transfer"  are read by EHCI and
cause delayed irq
or lost irq.

It is not only a reasoning or guess, and I have traced this kind of
fact certainly.

>
>>  This is just
>> what the patch is to fix.
>
> The patch won't fix this.  You're trying to force the CPU to write out
> dummy->hw_token more quickly.  A memory barrier won't do that.  And
> even if the CPU does write out the value more quickly, that doesn't
> guarantee it will be in time for the HC to see the new value right
> away.
>
>> But I think the above is still not correct, and the correct way may be
>> like below:
>>
>>       CPU                                                             EHCI
>>       dummy->hw_token = token;
>>       wmb
>>       qh->hw->hw_qtd_next = QTD_NEXT(ehci, dummy->qtd_dma);
>
> If you do this write, you will corrupt the hardware list.  You _must_
> _not_ modify the hardware fields of an active QH.

Yes, I agree on it. hw_qtd_next is write by HC and qh_refresh(), and it
should be written only by cpu when the qh is in idle state.

> Besides, qh_link_async() calls qh_refresh() -> qh_update(), which does
> this already.  And it does this correctly, with the appropriate checks
> to make sure the QH isn't active.
>
>>                                                                       fetch next qtd from qh->hw->hw_qtd_next
>>                                                                       read  qtd->hw_token
>>
>> The problem is that qh has already linked dummy into its queue before(as did in
>> qh_update), so even after 'dummy->hw_token = token;' is executed on CPU, EHCI
>> still may not fetch the up-to-date value of the qtd(dummy in here),
>> and this should
>> be the root cause, IMO.
>
> There's no way to fix this.  The CPU and the HC are independent.  The
> HC is allowed to cache values whenever it likes (within certain
> limits).  If it has already cached an old value from the qTD, there's
> no way to force it to use a new value.  You just have to wait until it
> refreshes its cache.

I don't think ehci will cache something inside hardware, and it will
always fetch qtd linked into qh from memory(not cache) to qh transfer overlay.
We can find it in ehci spec 4.10.2.

As I explained above in the 'wmb' example, I hope we can use
the current memory barrier rules to fix this kind of problem and avoid
to invent new barrier. But because we can't write qh->hw->hw_qtd_next
when the qh has been linked into ehci async queue, it is not doable.

thanks,
--
Ming Lei
Alan Stern Aug. 29, 2011, 3:03 p.m. UTC | #19
On Mon, 29 Aug 2011, Ming Lei wrote:

> IMO, the dummy has been linked into queue pointed by qh->hw->hw_qtd_next,
> so EHCI will fetch dummy qtd and execute the transaction and will not have
> any delay on the transaction.
> 
> Let me explain the problem again: On ARM, the wmb() before
> 'dummy->hw_token = token;'
> will flush l2 write buffer into memory and all parts of 'dummy' except
> for hw_token field
> have reached into memory already, but dummy->hw_token will stay at l2
> write buffer
> and not reach into memory at this time, so ehci may fetch a
> inconsistent qtd and execute it,
> then mistaken IOC or "total bytes to transfer"  are read by EHCI and
> cause delayed irq
> or lost irq.

No.  Even if the HC reads dummy before dummy->hw_token has been written
out to memory from the L2 cache, it will not see any inconsistencies.  
It will see the old value in hw_token, which has the ACTIVE bit clear.  
Therefore it will not try to execute the qTD but will move on to the
next QH.  See what the fourth paragraph in section 4.10.2 of the EHCI 
spec says about the case where a qTD's ACTIVE bit is set to 0.

Some EHCI implementations have a quirk, in which they perform the 
overlay even when ACTIVE is clear.  But even these implementations 
won't try to execute the qTD, because the old value of dummy->hw_token 
also has the HALT bit set.

> It is not only a reasoning or guess, and I have traced this kind of
> fact certainly.

If your controller behaves as you suggest then it is buggy.  And in
that case, adding another memory barrier won't fix it.  There is still
the possibility that the HC will read dummy during the brief time after
the existing wmb() and before the CPU has written dummy->hw_token.

Alan Stern
Ming Lei Aug. 29, 2011, 3:21 p.m. UTC | #20
Hi,

On Mon, Aug 29, 2011 at 11:03 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Mon, 29 Aug 2011, Ming Lei wrote:
>
>> IMO, the dummy has been linked into queue pointed by qh->hw->hw_qtd_next,
>> so EHCI will fetch dummy qtd and execute the transaction and will not have
>> any delay on the transaction.
>>
>> Let me explain the problem again: On ARM, the wmb() before
>> 'dummy->hw_token = token;'
>> will flush l2 write buffer into memory and all parts of 'dummy' except
>> for hw_token field
>> have reached into memory already, but dummy->hw_token will stay at l2
>> write buffer
>> and not reach into memory at this time, so ehci may fetch a
>> inconsistent qtd and execute it,
>> then mistaken IOC or "total bytes to transfer"  are read by EHCI and
>> cause delayed irq
>> or lost irq.
>
> No.  Even if the HC reads dummy before dummy->hw_token has been written
> out to memory from the L2 cache, it will not see any inconsistencies.
> It will see the old value in hw_token, which has the ACTIVE bit clear.
> Therefore it will not try to execute the qTD but will move on to the
> next QH.  See what the fourth paragraph in section 4.10.2 of the EHCI
> spec says about the case where a qTD's ACTIVE bit is set to 0.

Suppose HC can see the old value in hw_token, which has the ACTIVE bit clear.

The qtd transaction will not be executed until the new token is
flushed into memory.
From view of CPU, the irq is still be delayed because ioc irq is generated
after the qtd transaction is executed when new token is flushed into
memory. The delay
depends on how long the token will be flushed into memory.

From my observation, the delay is commonly over 5ms for CSW, sometimes the delay
is more than 20ms, so caused the bad performance.

Also I am not sure if EHCI can read the old hw_token correctly in this kind of
inconsistent memory state.

>
> Some EHCI implementations have a quirk, in which they perform the
> overlay even when ACTIVE is clear.  But even these implementations
> won't try to execute the qTD, because the old value of dummy->hw_token
> also has the HALT bit set.
>
>> It is not only a reasoning or guess, and I have traced this kind of
>> fact certainly.
>
> If your controller behaves as you suggest then it is buggy.  And in
> that case, adding another memory barrier won't fix it.  There is still
> the possibility that the HC will read dummy during the brief time after
> the existing wmb() and before the CPU has written dummy->hw_token.

No, the mb after 'dummy->hw_token=token' does fix the problem. As
said above, IOC IRQ is surely delayed from view of CPU.


thanks,
--
Ming Lei
Ming Lei Aug. 29, 2011, 3:55 p.m. UTC | #21
Hi,

On Mon, Aug 29, 2011 at 9:57 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Mon, 29 Aug 2011, Russell King - ARM Linux wrote:
>
>> > You know better than I do what is needed to resolve the ordering issue.
>> > However, contrary to what the original patch description said, this
>> > isn't entirely a matter of making the write visible to the host
>> > controller: No doubt in time the write will eventually become visible
>> > anyway.  It's a matter of making the write become visible reasonably
>> > quickly and in the correct order with respect to other writes.
>>
>> I'm not entirely sure what the problem is - I think its about a write
>> by the CPU to dma coherent memory being delayed and not being visible
>> to the HC in a timely manner.  Either mb() or wmb() placed after the
>> write on ARM will do that - and ARM has no requirement to do a read-
>> back after the barrier.
>
> Okay, then this needs to be done in a way that won't slow down other
> architectures with an unnecessary memory barrier.  And there needs to
> be a comment in the code explaining that the new mb() instruction isn't
> being used as a memory barrier but rather to expedite writeback of the
> L2 cache.

If writing to coherent memory can't reach physical memory immediately on
other ARCHs,  the problem can still happen on these ARCHs. But I am
not sure if there are these kind of ARCHs except for ARM.

Anyway, current memory barriers in qh_append_tds() can't prevent the problem
from happening on ARM.

If no better solutions, maybe we have to use 'mb()' after
'dummy->hw_token = token'
to fix the problem:

>
> This certainly is starting to sound like something that needs to be
> addressed in the arch-specific #include files...
>
>> > Is this extra L2-cache "poke" needed for proper ordering, or is it
>> > needed merely to flush the write out to memory in a timely manner?
>>
>> Both, though primerily it's about ensuring correct ordering.  A side
>> effect of it is that it will flush all pending writes in L2 before
>> completing.
>>
>> From the theoretical viewpoint, I think I'm right to say that mb()
>> doesn't need to provide that level of ordering as its supposed to be
>> an inter-CPU barrier - which probably means we need to invent a new
>> barrier to deal with DMA memory ordering.  However, given the
>> difficulty of getting the existing barriers placed correctly, I don't
>> think inventing new barriers is a very good idea.
>>
>> What we can do is view devices which perform DMA as being strongly
>> ordered with respect to their memory accesses - iow, they have an
>> implicit memory barrier before and after their accesses to memory.
>> This would make the CPUs use of mb() have a conceptual pairing with
>> the DMA agents.
>
> Yes, that's the model I have been using all along.  After all, if a DMA
> master carries out its memory accesses in some random order then it's
> impossible for the CPU to make any guarantees.
>
> Alan Stern
>
>
Mark Salter Aug. 29, 2011, 4:24 p.m. UTC | #22
On Mon, 2011-08-29 at 23:55 +0800, Ming Lei wrote:
> If writing to coherent memory can't reach physical memory immediately on
> other ARCHs,  the problem can still happen on these ARCHs. But I am
> not sure if there are these kind of ARCHs except for ARM.

If there is write buffering which prevents outside bus masters from
seeing the data, can we really call it coherent memory?

--Mark
Ming Lei Aug. 30, 2011, 2:02 p.m. UTC | #23
Hi Alan,

Thanks for your comment.

On Tue, Aug 30, 2011 at 12:33 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> Nevertheless, it remains true that you want to add a memory barrier
> instruction simply in order to speed up a cache writeback, not to force
> any sort of ordering.  This needs to be explained carefully in the code
> (not just in the patch description!) and it needs to be done in a way
> that won't affect other architectures.

OK, will do it.

> Also, as you mentioned before, you may want to do the same thing in
> qh_link_async() just after the
>
>        head->hw->hw_next = dma;
>
> line.  Delays in flushing that write would also slow down performance.

Will do it too.


thanks,
--
Ming Lei
diff mbox

Patch

diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
index 0917e3a..65b5021 100644
--- a/drivers/usb/host/ehci-q.c
+++ b/drivers/usb/host/ehci-q.c
@@ -1082,6 +1082,20 @@  static struct ehci_qh *qh_append_tds (
 			wmb ();
 			dummy->hw_token = token;
 
+			/* The mb() below is added to make sure that
+			 * 'token' can be writen into qtd, so that ehci
+			 * HC can see the up-to-date qtd descriptor. On
+			 * some archs(at least on ARM Cortex A9 dual core),
+			 * writing into coherenet memory doesn't mean the
+			 * value written can reach physical memory
+			 * immediately, and the value may be buffered
+			 * inside L2 cache. 'dummy->hw_token = token;'
+			 * after mb() is added for obeying correct mb()
+			 * usage.
+			 * */
+			mb();
+			token = dummy->hw_token;
+
 			urb->hcpriv = qh_get (qh);
 		}
 	}