diff mbox

[RFC,3/4] ARM: bL_entry: Match memory barriers to architectural requirements

Message ID 1358268498-8086-4-git-send-email-dave.martin@linaro.org (mailing list archive)
State New, archived
Headers show

Commit Message

tip-bot for Dave Martin Jan. 15, 2013, 4:48 p.m. UTC
For architectural correctness even Strongly-Ordered memory accesses
require barriers in order to guarantee that multiple CPUs have a
coherent view of the ordering of memory accesses.

Virtually everything done by this early code is done via explicit
memory access only, so DSBs are seldom required.  Existing barriers
are demoted to DMB, except where a DSB is needed to synchronise
non-memory signalling (i.e., before a SEV).  If a particular
platform performs cache maintenance in its power_up_setup function,
it should force it to complete explicitly including a DSB, instead
of relying on the bL_head framework code to do it.

Some additional DMBs are added to ensure all the memory ordering
properties required by the race avoidance algorithm.  DMBs are also
moved out of loops, and for clarity some are moved so that most
directly follow the memory operation which needs to be
synchronised.

The setting of a CPU's bL_entry_vectors[] entry is also required to
act as a synchronisation point, so a DMB is added after checking
that entry to ensure that other CPUs do not observe gated
operations leaking across the opening of the gate.

Signed-off-by: Dave Martin <dave.martin@linaro.org>
---
 arch/arm/common/bL_head.S |   21 +++++++++++----------
 1 files changed, 11 insertions(+), 10 deletions(-)

Comments

Santosh Shilimkar Jan. 16, 2013, 6:50 a.m. UTC | #1
+ Catalin, RMK

Dave,

On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote:
> For architectural correctness even Strongly-Ordered memory accesses
> require barriers in order to guarantee that multiple CPUs have a
> coherent view of the ordering of memory accesses.
>
> Virtually everything done by this early code is done via explicit
> memory access only, so DSBs are seldom required.  Existing barriers
> are demoted to DMB, except where a DSB is needed to synchronise
> non-memory signalling (i.e., before a SEV).  If a particular
> platform performs cache maintenance in its power_up_setup function,
> it should force it to complete explicitly including a DSB, instead
> of relying on the bL_head framework code to do it.
>
> Some additional DMBs are added to ensure all the memory ordering
> properties required by the race avoidance algorithm.  DMBs are also
> moved out of loops, and for clarity some are moved so that most
> directly follow the memory operation which needs to be
> synchronised.
>
> The setting of a CPU's bL_entry_vectors[] entry is also required to
> act as a synchronisation point, so a DMB is added after checking
> that entry to ensure that other CPUs do not observe gated
> operations leaking across the opening of the gate.
>
> Signed-off-by: Dave Martin <dave.martin@linaro.org>
> ---

Sorry to pick on this again but I am not able to understand why
the strongly ordered access needs barriers. At least from the
ARM point of view, a strongly ordered write will be more of blocking
write and the further interconnect also is suppose to respect that
rule. SO read writes are like adding barrier after every load store
so adding explicit barriers doesn't make sense. Is this a side
effect of some "write early response" kind of optimizations at
interconnect level ?
Will you be able to point to specs or documents which puts
this requirement ?

Regards
santosh
tip-bot for Dave Martin Jan. 16, 2013, 11:49 a.m. UTC | #2
On Wed, Jan 16, 2013 at 12:20:47PM +0530, Santosh Shilimkar wrote:
> + Catalin, RMK
> 
> Dave,
> 
> On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote:
> >For architectural correctness even Strongly-Ordered memory accesses
> >require barriers in order to guarantee that multiple CPUs have a
> >coherent view of the ordering of memory accesses.
> >
> >Virtually everything done by this early code is done via explicit
> >memory access only, so DSBs are seldom required.  Existing barriers
> >are demoted to DMB, except where a DSB is needed to synchronise
> >non-memory signalling (i.e., before a SEV).  If a particular
> >platform performs cache maintenance in its power_up_setup function,
> >it should force it to complete explicitly including a DSB, instead
> >of relying on the bL_head framework code to do it.
> >
> >Some additional DMBs are added to ensure all the memory ordering
> >properties required by the race avoidance algorithm.  DMBs are also
> >moved out of loops, and for clarity some are moved so that most
> >directly follow the memory operation which needs to be
> >synchronised.
> >
> >The setting of a CPU's bL_entry_vectors[] entry is also required to
> >act as a synchronisation point, so a DMB is added after checking
> >that entry to ensure that other CPUs do not observe gated
> >operations leaking across the opening of the gate.
> >
> >Signed-off-by: Dave Martin <dave.martin@linaro.org>
> >---
> 
> Sorry to pick on this again but I am not able to understand why
> the strongly ordered access needs barriers. At least from the
> ARM point of view, a strongly ordered write will be more of blocking
> write and the further interconnect also is suppose to respect that

This is what I originally assumed (hence the absence of barriers in
the initial patch).

> rule. SO read writes are like adding barrier after every load store

This assumption turns out to be wrong, unfortunately, although in
a uniprocessor scenario is makes no difference.  A SO memory access
does block the CPU making the access, but explicitly does not
block the interconnect.

In a typical boot scenario for example, all secondary CPUs are
quiescent or powered down, so there's no problem.  But we can't make
the same assumptions when we're trying to coordinate between
multiple active CPUs.

> so adding explicit barriers doesn't make sense. Is this a side
> effect of some "write early response" kind of optimizations at
> interconnect level ?

Strongly-Ordered accesses are always non-shareable, so there is
no explicit guarantee of coherency between multiple masters.

If there is only one master, it makes no difference, but if there
are multiple masters, there is no guarantee that they are conntected
to a slave device (DRAM controller in this case) via a single
slave port.

The architecture only guarantees global serialisation when there is a
single slave device, but provides no way to know whether two accesses
from different masters will reach the same slave port.  This is in the
realms of "implementation defined."

Unfortunately, a high-performance component like a DRAM controller
is exactly the kind of component which may implement multiple
master ports, so you can't guarantee that accesses are serialised
in the same order from the perspective of all masters.  There may
be some pipelining and caching between each master port and the actual
memory, for example.  This is allowed, because there is no requirement
for the DMC to look like a single slave device from the perspective
of multiple masters.

A multi-ported slave might provide transparent coherency between master
ports, but it is only required to guarantee this when the accesses
are shareable (SO is always non-shared), or when explicit barriers
are used to force synchronisation between the device's master ports.

Of course, a given platform may have a DMC with only one slave
port, in which case the barriers should not be needed.  But I wanted
this code to be generic enough to be reusable -- hence the
addition of the barriers.  The CPU does not need to wait for a DMB
to "complete" in any sense, so this does not necessarily have a
meaningful impact on performance.

This is my understanding anyway.

> Will you be able to point to specs or documents which puts
> this requirement ?

Unfortunately, this is one of this things which we require not because
there is a statement in the ARM ARM to say that we need it -- rather,
there is no statement in the ARM ARM to say that we don't.

Cheers
---Dave
Santosh Shilimkar Jan. 16, 2013, 12:11 p.m. UTC | #3
On Wednesday 16 January 2013 05:19 PM, Dave Martin wrote:
> On Wed, Jan 16, 2013 at 12:20:47PM +0530, Santosh Shilimkar wrote:
>> + Catalin, RMK
>>
>> Dave,
>>
>> On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote:
>>> For architectural correctness even Strongly-Ordered memory accesses
>>> require barriers in order to guarantee that multiple CPUs have a
>>> coherent view of the ordering of memory accesses.
>>>
>>> Virtually everything done by this early code is done via explicit
>>> memory access only, so DSBs are seldom required.  Existing barriers
>>> are demoted to DMB, except where a DSB is needed to synchronise
>>> non-memory signalling (i.e., before a SEV).  If a particular
>>> platform performs cache maintenance in its power_up_setup function,
>>> it should force it to complete explicitly including a DSB, instead
>>> of relying on the bL_head framework code to do it.
>>>
>>> Some additional DMBs are added to ensure all the memory ordering
>>> properties required by the race avoidance algorithm.  DMBs are also
>>> moved out of loops, and for clarity some are moved so that most
>>> directly follow the memory operation which needs to be
>>> synchronised.
>>>
>>> The setting of a CPU's bL_entry_vectors[] entry is also required to
>>> act as a synchronisation point, so a DMB is added after checking
>>> that entry to ensure that other CPUs do not observe gated
>>> operations leaking across the opening of the gate.
>>>
>>> Signed-off-by: Dave Martin <dave.martin@linaro.org>
>>> ---
>>
>> Sorry to pick on this again but I am not able to understand why
>> the strongly ordered access needs barriers. At least from the
>> ARM point of view, a strongly ordered write will be more of blocking
>> write and the further interconnect also is suppose to respect that
>
> This is what I originally assumed (hence the absence of barriers in
> the initial patch).
>
>> rule. SO read writes are like adding barrier after every load store
>
> This assumption turns out to be wrong, unfortunately, although in
> a uniprocessor scenario is makes no difference.  A SO memory access
> does block the CPU making the access, but explicitly does not
> block the interconnect.
>
I suspected the interconnect part when you described the barrier
need for SO memory region.

> In a typical boot scenario for example, all secondary CPUs are
> quiescent or powered down, so there's no problem.  But we can't make
> the same assumptions when we're trying to coordinate between
> multiple active CPUs.
>
>> so adding explicit barriers doesn't make sense. Is this a side
>> effect of some "write early response" kind of optimizations at
>> interconnect level ?
>
> Strongly-Ordered accesses are always non-shareable, so there is
> no explicit guarantee of coherency between multiple masters.
>
This is where probably issue then. My understanding is exactly
opposite here and hence I wasn't worried about multi-master
CPU scenario since sharable attributes would be taking care of it
considering the same page tables being used in SMP system.

ARM documentation says -
------------
Shareability and the S bit, with TEX remap
The memory type of a region, as indicated in the Memory type column of 
Table B3-12 on page B3-1350, provides
the first level of control of whether the region is shareable:
• If the memory type is Strongly-ordered then the region is Shareable
------------------------------------------------------------

> If there is only one master, it makes no difference, but if there
> are multiple masters, there is no guarantee that they are conntected
> to a slave device (DRAM controller in this case) via a single
> slave port.
>
See above. We are talking about multiple CPUs here and not
the DSP or other co-processors. In either case we are discussing
code which is getting execute on ARM CPUs so we can safely limit
it to the multi-master ARM CPU.

> The architecture only guarantees global serialisation when there is a
> single slave device, but provides no way to know whether two accesses
> from different masters will reach the same slave port.  This is in the
> realms of "implementation defined."
>
> Unfortunately, a high-performance component like a DRAM controller
> is exactly the kind of component which may implement multiple
> master ports, so you can't guarantee that accesses are serialised
> in the same order from the perspective of all masters.  There may
> be some pipelining and caching between each master port and the actual
> memory, for example.  This is allowed, because there is no requirement
> for the DMC to look like a single slave device from the perspective
> of multiple masters.
>
> A multi-ported slave might provide transparent coherency between master
> ports, but it is only required to guarantee this when the accesses
> are shareable (SO is always non-shared), or when explicit barriers
> are used to force synchronisation between the device's master ports.
>
> Of course, a given platform may have a DMC with only one slave
> port, in which case the barriers should not be needed.  But I wanted
> this code to be generic enough to be reusable -- hence the
> addition of the barriers.  The CPU does not need to wait for a DMB
> to "complete" in any sense, so this does not necessarily have a
> meaningful impact on performance.
>
> This is my understanding anyway.
>
>> Will you be able to point to specs or documents which puts
>> this requirement ?
>
> Unfortunately, this is one of this things which we require not because
> there is a statement in the ARM ARM to say that we need it -- rather,
> there is no statement in the ARM ARM to say that we don't.
>
Thanks a lot for elaborate answer. It helps to understand the rationale
at least.

Regards
Santosh
tip-bot for Dave Martin Jan. 16, 2013, 12:47 p.m. UTC | #4
On Wed, Jan 16, 2013 at 05:41:00PM +0530, Santosh Shilimkar wrote:
> On Wednesday 16 January 2013 05:19 PM, Dave Martin wrote:
> >On Wed, Jan 16, 2013 at 12:20:47PM +0530, Santosh Shilimkar wrote:
> >>+ Catalin, RMK
> >>
> >>Dave,
> >>
> >>On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote:
> >>>For architectural correctness even Strongly-Ordered memory accesses
> >>>require barriers in order to guarantee that multiple CPUs have a
> >>>coherent view of the ordering of memory accesses.
> >>>
> >>>Virtually everything done by this early code is done via explicit
> >>>memory access only, so DSBs are seldom required.  Existing barriers
> >>>are demoted to DMB, except where a DSB is needed to synchronise
> >>>non-memory signalling (i.e., before a SEV).  If a particular
> >>>platform performs cache maintenance in its power_up_setup function,
> >>>it should force it to complete explicitly including a DSB, instead
> >>>of relying on the bL_head framework code to do it.
> >>>
> >>>Some additional DMBs are added to ensure all the memory ordering
> >>>properties required by the race avoidance algorithm.  DMBs are also
> >>>moved out of loops, and for clarity some are moved so that most
> >>>directly follow the memory operation which needs to be
> >>>synchronised.
> >>>
> >>>The setting of a CPU's bL_entry_vectors[] entry is also required to
> >>>act as a synchronisation point, so a DMB is added after checking
> >>>that entry to ensure that other CPUs do not observe gated
> >>>operations leaking across the opening of the gate.
> >>>
> >>>Signed-off-by: Dave Martin <dave.martin@linaro.org>
> >>>---
> >>
> >>Sorry to pick on this again but I am not able to understand why
> >>the strongly ordered access needs barriers. At least from the
> >>ARM point of view, a strongly ordered write will be more of blocking
> >>write and the further interconnect also is suppose to respect that
> >
> >This is what I originally assumed (hence the absence of barriers in
> >the initial patch).
> >
> >>rule. SO read writes are like adding barrier after every load store
> >
> >This assumption turns out to be wrong, unfortunately, although in
> >a uniprocessor scenario is makes no difference.  A SO memory access
> >does block the CPU making the access, but explicitly does not
> >block the interconnect.
> >
> I suspected the interconnect part when you described the barrier
> need for SO memory region.
> 
> >In a typical boot scenario for example, all secondary CPUs are
> >quiescent or powered down, so there's no problem.  But we can't make
> >the same assumptions when we're trying to coordinate between
> >multiple active CPUs.
> >
> >>so adding explicit barriers doesn't make sense. Is this a side
> >>effect of some "write early response" kind of optimizations at
> >>interconnect level ?
> >
> >Strongly-Ordered accesses are always non-shareable, so there is
> >no explicit guarantee of coherency between multiple masters.
> >
> This is where probably issue then. My understanding is exactly
> opposite here and hence I wasn't worried about multi-master
> CPU scenario since sharable attributes would be taking care of it
> considering the same page tables being used in SMP system.
> 
> ARM documentation says -
> ------------
> Shareability and the S bit, with TEX remap
> The memory type of a region, as indicated in the Memory type column
> of Table B3-12 on page B3-1350, provides
> the first level of control of whether the region is shareable:
> • If the memory type is Strongly-ordered then the region is Shareable
> ------------------------------------------------------------

Hmmm, it looks like you're right here.  My assumption that SO implies
non-shareable is wrong.  This is backed up by:

A3.5.6 Device and Strongly-ordered memory

"Address locations marked as Strongly-ordered [...] are always treated
as Shareable."


I think this is sufficient to ensure that if two CPUs access the same
location with SO accesses, each will see an access order to any single
location which is consistent with the program order of the accesses on
the other CPUs.  (This comes from the glossary definition of Coherent.)

However, I can't see any general guarantee for accesses to _different_
locations, beyond the guarantees for certain special cases given in
A3.8.2 Ordering requirements for memory accesses (address and control
dependencies etc.)

This may make some of the dmbs unnecessary, but it is not clear whether
they are all unnecessary.


I'll need to follow up on this and see if we can get an answer.

Cheers
---Dave
Santosh Shilimkar Jan. 16, 2013, 2:36 p.m. UTC | #5
On Wednesday 16 January 2013 06:17 PM, Dave Martin wrote:
> On Wed, Jan 16, 2013 at 05:41:00PM +0530, Santosh Shilimkar wrote:
>> On Wednesday 16 January 2013 05:19 PM, Dave Martin wrote:
>>> On Wed, Jan 16, 2013 at 12:20:47PM +0530, Santosh Shilimkar wrote:
>>>> + Catalin, RMK
>>>>
>>>> Dave,
>>>>
>>>> On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote:
>>>>> For architectural correctness even Strongly-Ordered memory accesses
>>>>> require barriers in order to guarantee that multiple CPUs have a
>>>>> coherent view of the ordering of memory accesses.
>>>>>
>>>>> Virtually everything done by this early code is done via explicit
>>>>> memory access only, so DSBs are seldom required.  Existing barriers
>>>>> are demoted to DMB, except where a DSB is needed to synchronise
>>>>> non-memory signalling (i.e., before a SEV).  If a particular
>>>>> platform performs cache maintenance in its power_up_setup function,
>>>>> it should force it to complete explicitly including a DSB, instead
>>>>> of relying on the bL_head framework code to do it.
>>>>>
>>>>> Some additional DMBs are added to ensure all the memory ordering
>>>>> properties required by the race avoidance algorithm.  DMBs are also
>>>>> moved out of loops, and for clarity some are moved so that most
>>>>> directly follow the memory operation which needs to be
>>>>> synchronised.
>>>>>
>>>>> The setting of a CPU's bL_entry_vectors[] entry is also required to
>>>>> act as a synchronisation point, so a DMB is added after checking
>>>>> that entry to ensure that other CPUs do not observe gated
>>>>> operations leaking across the opening of the gate.
>>>>>
>>>>> Signed-off-by: Dave Martin <dave.martin@linaro.org>
>>>>> ---
>>>>
>>>> Sorry to pick on this again but I am not able to understand why
>>>> the strongly ordered access needs barriers. At least from the
>>>> ARM point of view, a strongly ordered write will be more of blocking
>>>> write and the further interconnect also is suppose to respect that
>>>
>>> This is what I originally assumed (hence the absence of barriers in
>>> the initial patch).
>>>
>>>> rule. SO read writes are like adding barrier after every load store
>>>
>>> This assumption turns out to be wrong, unfortunately, although in
>>> a uniprocessor scenario is makes no difference.  A SO memory access
>>> does block the CPU making the access, but explicitly does not
>>> block the interconnect.
>>>
>> I suspected the interconnect part when you described the barrier
>> need for SO memory region.
>>
>>> In a typical boot scenario for example, all secondary CPUs are
>>> quiescent or powered down, so there's no problem.  But we can't make
>>> the same assumptions when we're trying to coordinate between
>>> multiple active CPUs.
>>>
>>>> so adding explicit barriers doesn't make sense. Is this a side
>>>> effect of some "write early response" kind of optimizations at
>>>> interconnect level ?
>>>
>>> Strongly-Ordered accesses are always non-shareable, so there is
>>> no explicit guarantee of coherency between multiple masters.
>>>
>> This is where probably issue then. My understanding is exactly
>> opposite here and hence I wasn't worried about multi-master
>> CPU scenario since sharable attributes would be taking care of it
>> considering the same page tables being used in SMP system.
>>
>> ARM documentation says -
>> ------------
>> Shareability and the S bit, with TEX remap
>> The memory type of a region, as indicated in the Memory type column
>> of Table B3-12 on page B3-1350, provides
>> the first level of control of whether the region is shareable:
>> • If the memory type is Strongly-ordered then the region is Shareable
>> ------------------------------------------------------------
>
> Hmmm, it looks like you're right here.  My assumption that SO implies
> non-shareable is wrong.  This is backed up by:
>
> A3.5.6 Device and Strongly-ordered memory
>
> "Address locations marked as Strongly-ordered [...] are always treated
> as Shareable."
>
>
> I think this is sufficient to ensure that if two CPUs access the same
> location with SO accesses, each will see an access order to any single
> location which is consistent with the program order of the accesses on
> the other CPUs.  (This comes from the glossary definition of Coherent.)
>
> However, I can't see any general guarantee for accesses to _different_
> locations, beyond the guarantees for certain special cases given in
> A3.8.2 Ordering requirements for memory accesses (address and control
> dependencies etc.)
>
> This may make some of the dmbs unnecessary, but it is not clear whether
> they are all unnecessary.
>
>
> I'll need to follow up on this and see if we can get an answer.
>
Thanks David. I am looking forward to hear more on this.

Regards,
Santosh
Catalin Marinas Jan. 16, 2013, 3:05 p.m. UTC | #6
Santosh,

On Wed, Jan 16, 2013 at 06:50:47AM +0000, Santosh Shilimkar wrote:
> On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote:
> > For architectural correctness even Strongly-Ordered memory accesses
> > require barriers in order to guarantee that multiple CPUs have a
> > coherent view of the ordering of memory accesses.
> >
> > Virtually everything done by this early code is done via explicit
> > memory access only, so DSBs are seldom required.  Existing barriers
> > are demoted to DMB, except where a DSB is needed to synchronise
> > non-memory signalling (i.e., before a SEV).  If a particular
> > platform performs cache maintenance in its power_up_setup function,
> > it should force it to complete explicitly including a DSB, instead
> > of relying on the bL_head framework code to do it.
> >
> > Some additional DMBs are added to ensure all the memory ordering
> > properties required by the race avoidance algorithm.  DMBs are also
> > moved out of loops, and for clarity some are moved so that most
> > directly follow the memory operation which needs to be
> > synchronised.
> >
> > The setting of a CPU's bL_entry_vectors[] entry is also required to
> > act as a synchronisation point, so a DMB is added after checking
> > that entry to ensure that other CPUs do not observe gated
> > operations leaking across the opening of the gate.
> >
> > Signed-off-by: Dave Martin <dave.martin@linaro.org>
> 
> Sorry to pick on this again but I am not able to understand why
> the strongly ordered access needs barriers. At least from the
> ARM point of view, a strongly ordered write will be more of blocking
> write and the further interconnect also is suppose to respect that
> rule. SO read writes are like adding barrier after every load store
> so adding explicit barriers doesn't make sense. Is this a side
> effect of some "write early response" kind of optimizations at
> interconnect level ?

SO or Device memory accesses are *not* like putting a proper barrier
between each access, though it may behave in some situations like having
a barrier. The ARM ARM (A3.8.3, fig 3.5) defines how accesses must
*arrive* at a peripheral or block of memory depending on the memory type
and in case of Device or SO we don't need additional barriers because
such accesses would *arrive* in order (given the minimum 1KB range
restriction). But it does not say anything about *observability* by a
different *master*. That's because you can't guarantee that your memory
accesses go to the same slave port.

For observability by a different master, you need an explicit DMB even
though the memory type is Device or SO. While it may work fine in most
cases (especially when the accesses by various masters go to the same
slave port), you can't be sure what the memory controller or whatever
interconnect do.

As Dave said, it's more about what the ARM ARM doesn't say rather than
what it explicitly states.
tip-bot for Dave Martin Jan. 16, 2013, 3:37 p.m. UTC | #7
On Wed, Jan 16, 2013 at 03:05:34PM +0000, Catalin Marinas wrote:
> Santosh,
> 
> On Wed, Jan 16, 2013 at 06:50:47AM +0000, Santosh Shilimkar wrote:
> > On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote:
> > > For architectural correctness even Strongly-Ordered memory accesses
> > > require barriers in order to guarantee that multiple CPUs have a
> > > coherent view of the ordering of memory accesses.
> > >
> > > Virtually everything done by this early code is done via explicit
> > > memory access only, so DSBs are seldom required.  Existing barriers
> > > are demoted to DMB, except where a DSB is needed to synchronise
> > > non-memory signalling (i.e., before a SEV).  If a particular
> > > platform performs cache maintenance in its power_up_setup function,
> > > it should force it to complete explicitly including a DSB, instead
> > > of relying on the bL_head framework code to do it.
> > >
> > > Some additional DMBs are added to ensure all the memory ordering
> > > properties required by the race avoidance algorithm.  DMBs are also
> > > moved out of loops, and for clarity some are moved so that most
> > > directly follow the memory operation which needs to be
> > > synchronised.
> > >
> > > The setting of a CPU's bL_entry_vectors[] entry is also required to
> > > act as a synchronisation point, so a DMB is added after checking
> > > that entry to ensure that other CPUs do not observe gated
> > > operations leaking across the opening of the gate.
> > >
> > > Signed-off-by: Dave Martin <dave.martin@linaro.org>
> > 
> > Sorry to pick on this again but I am not able to understand why
> > the strongly ordered access needs barriers. At least from the
> > ARM point of view, a strongly ordered write will be more of blocking
> > write and the further interconnect also is suppose to respect that
> > rule. SO read writes are like adding barrier after every load store
> > so adding explicit barriers doesn't make sense. Is this a side
> > effect of some "write early response" kind of optimizations at
> > interconnect level ?
> 
> SO or Device memory accesses are *not* like putting a proper barrier
> between each access, though it may behave in some situations like having
> a barrier. The ARM ARM (A3.8.3, fig 3.5) defines how accesses must
> *arrive* at a peripheral or block of memory depending on the memory type
> and in case of Device or SO we don't need additional barriers because
> such accesses would *arrive* in order (given the minimum 1KB range
> restriction). But it does not say anything about *observability* by a
> different *master*. That's because you can't guarantee that your memory
> accesses go to the same slave port.
> 
> For observability by a different master, you need an explicit DMB even
> though the memory type is Device or SO. While it may work fine in most
> cases (especially when the accesses by various masters go to the same
> slave port), you can't be sure what the memory controller or whatever
> interconnect do.
> 
> As Dave said, it's more about what the ARM ARM doesn't say rather than
> what it explicitly states.

OK, so I talked to one of the ARM architects and he strongly recommends
having the DMBs.

The fact that SO accesses are shareable guarantees a coherent view of
access order only per memory location (i.e., per address), _not_
per slave.  This is what was missing from my original assumption.

For different memory locations, even within a single block of memory,
you don't get that guarantee about what other masters see, except for
the precise documented situations where there are address, control,
or data dependencies between instructions which imply a particular
observation order by other masters.  (See A3.8.2, Ordering requirements
for memory accesses).

For this locking code, we're concerned about ordering between different
memory locations in almost all instances, so most or all of the
barriers will be needed, for strict correctness.

Cheers
---Dave
Santosh Shilimkar Jan. 17, 2013, 6:39 a.m. UTC | #8
On Wednesday 16 January 2013 08:35 PM, Catalin Marinas wrote:
> Santosh,
>
> On Wed, Jan 16, 2013 at 06:50:47AM +0000, Santosh Shilimkar wrote:
>> On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote:
>>> For architectural correctness even Strongly-Ordered memory accesses
>>> require barriers in order to guarantee that multiple CPUs have a
>>> coherent view of the ordering of memory accesses.
>>>
>>> Virtually everything done by this early code is done via explicit
>>> memory access only, so DSBs are seldom required.  Existing barriers
>>> are demoted to DMB, except where a DSB is needed to synchronise
>>> non-memory signalling (i.e., before a SEV).  If a particular
>>> platform performs cache maintenance in its power_up_setup function,
>>> it should force it to complete explicitly including a DSB, instead
>>> of relying on the bL_head framework code to do it.
>>>
>>> Some additional DMBs are added to ensure all the memory ordering
>>> properties required by the race avoidance algorithm.  DMBs are also
>>> moved out of loops, and for clarity some are moved so that most
>>> directly follow the memory operation which needs to be
>>> synchronised.
>>>
>>> The setting of a CPU's bL_entry_vectors[] entry is also required to
>>> act as a synchronisation point, so a DMB is added after checking
>>> that entry to ensure that other CPUs do not observe gated
>>> operations leaking across the opening of the gate.
>>>
>>> Signed-off-by: Dave Martin <dave.martin@linaro.org>
>>
>> Sorry to pick on this again but I am not able to understand why
>> the strongly ordered access needs barriers. At least from the
>> ARM point of view, a strongly ordered write will be more of blocking
>> write and the further interconnect also is suppose to respect that
>> rule. SO read writes are like adding barrier after every load store
>> so adding explicit barriers doesn't make sense. Is this a side
>> effect of some "write early response" kind of optimizations at
>> interconnect level ?
>
> SO or Device memory accesses are *not* like putting a proper barrier
> between each access, though it may behave in some situations like having
> a barrier. The ARM ARM (A3.8.3, fig 3.5) defines how accesses must
> *arrive* at a peripheral or block of memory depending on the memory type
> and in case of Device or SO we don't need additional barriers because
> such accesses would *arrive* in order (given the minimum 1KB range
> restriction). But it does not say anything about *observability* by a
> different *master*. That's because you can't guarantee that your memory
> accesses go to the same slave port.
>
> For observability by a different master, you need an explicit DMB even
> though the memory type is Device or SO. While it may work fine in most
> cases (especially when the accesses by various masters go to the same
> slave port), you can't be sure what the memory controller or whatever
> interconnect do.
>
I agree that the interconnect behavior and no. of slave port usages
is implementation specific and we can't assume anything here. So
safer side is to have those additional barriers to not hit the corner
cases.

> As Dave said, it's more about what the ARM ARM doesn't say rather than
> what it explicitly states.
>
I understand it better now. Thanks for good discussion.

Regards,
Santosh
diff mbox

Patch

diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
index fd71ff6..a4a20e5 100644
--- a/arch/arm/common/bL_head.S
+++ b/arch/arm/common/bL_head.S
@@ -87,8 +87,7 @@  ENTRY(bL_entry_point)
 	mov	r5, #BL_SYNC_CPU_SIZE
 	mla	r5, r9, r5, r8			@ r5 = bL_sync cpu address
 	strb	r0, [r5]
-
-	dsb
+	dmb
 
 	@ At this point, the cluster cannot unexpectedly enter the GOING_DOWN
 	@ state, because there is at least one active CPU (this CPU).
@@ -97,7 +96,7 @@  ENTRY(bL_entry_point)
 	mla	r11, r0, r10, r11		@ r11 = cluster first man lock
 	mov	r0, r11
 	mov	r1, r9				@ cpu
-	bl	vlock_trylock
+	bl	vlock_trylock			@ implies DSB
 
 	cmp	r0, #0				@ failed to get the lock?
 	bne	cluster_setup_wait		@ wait for cluster setup if so
@@ -115,11 +114,12 @@  cluster_setup:
 	@ Wait for any previously-pending cluster teardown operations to abort
 	@ or complete:
 
-	dsb
-	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
+	dmb
+0:	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
 	cmp	r0, #CLUSTER_GOING_DOWN
 	wfeeq
-	beq	cluster_setup
+	beq	0b
+	dmb
 
 	@ If the outbound gave up before teardown started, skip cluster setup:
 
@@ -131,8 +131,8 @@  cluster_setup:
 	cmp	r7, #0
 	mov	r0, #1		@ second (cluster) affinity level
 	blxne	r7		@ Call power_up_setup if defined
+	dmb
 
-	dsb
 	mov	r0, #CLUSTER_UP
 	strb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
 	dsb
@@ -146,11 +146,11 @@  cluster_setup_leave:
 	@ In the contended case, non-first men wait here for cluster setup
 	@ to complete:
 cluster_setup_wait:
-	dsb
 	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
 	cmp	r0, #CLUSTER_UP
 	wfene
 	bne	cluster_setup_wait
+	dmb
 
 cluster_setup_complete:
 	@ If a platform-specific CPU setup hook is needed, it is
@@ -162,13 +162,14 @@  cluster_setup_complete:
 
 	@ Mark the CPU as up:
 
-	dsb
+	dmb
 	mov	r0, #CPU_UP
 	strb	r0, [r5]
+	dmb
 
 bL_entry_gated:
-	dsb
 	ldr	r5, [r6, r4, lsl #2]		@ r5 = CPU entry vector
+	dmb
 	cmp	r5, #0
 	wfeeq
 	beq	bL_entry_gated