diff mbox

[03/16] ARM: b.L: introduce helpers for platform coherency exit/setup

Message ID 1357777251-13541-4-git-send-email-nicolas.pitre@linaro.org (mailing list archive)
State New, archived
Headers show

Commit Message

Nicolas Pitre Jan. 10, 2013, 12:20 a.m. UTC
From: Dave Martin <dave.martin@linaro.org>

This provides helper methods to coordinate between CPUs coming down
and CPUs going up, as well as documentation on the used algorithms,
so that cluster teardown and setup
operations are not done for a cluster simultaneously.

For use in the power_down() implementation:
  * __bL_cpu_going_down(unsigned int cluster, unsigned int cpu)
  * __bL_outbound_enter_critical(unsigned int cluster)
  * __bL_outbound_leave_critical(unsigned int cluster)
  * __bL_cpu_down(unsigned int cluster, unsigned int cpu)

The power_up_setup() helper should do platform-specific setup in
preparation for turning the CPU on, such as invalidating local caches
or entering coherency.  It must be assembler for now, since it must
run before the MMU can be switched on.  It is passed the affinity level
which should be initialized.

Because the bL_cluster_sync_struct content is looked-up and modified
with the cache enabled or disabled depending on the code path, it is
crucial to always ensure proper cache maintenance to update main memory
right away.  Therefore, any cached write must be followed by a cache clean
operation and any cached read must be preceded by a cache invalidate
operation on the accessed memory.

To avoid races where a reader would invalidate the cache and discard the
latest update from a writer before that writer had a chance to clean it
to RAM, we simply use cache flush (clean+invalidate) operations
everywhere.

Also, in order to prevent a cached writer from interfering with an
adjacent non-cached writer, we ensure each state variable is located to
a separate cache line.

Thanks to Nicolas Pitre and Achin Gupta for the help with this
patch.

Signed-off-by: Dave Martin <dave.martin@linaro.org>
---
 .../arm/big.LITTLE/cluster-pm-race-avoidance.txt   | 498 +++++++++++++++++++++
 arch/arm/common/bL_entry.c                         | 160 +++++++
 arch/arm/common/bL_head.S                          |  88 +++-
 arch/arm/include/asm/bL_entry.h                    |  62 +++
 4 files changed, 806 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt

Comments

Catalin Marinas Jan. 10, 2013, 4:53 p.m. UTC | #1
On 10 January 2013 00:20, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> --- /dev/null
> +++ b/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
> @@ -0,0 +1,498 @@
> +Big.LITTLE cluster Power-up/power-down race avoidance algorithm
> +===============================================================

Nice description and ascii art :).

> --- a/arch/arm/common/bL_entry.c
> +++ b/arch/arm/common/bL_entry.c
> @@ -116,3 +116,163 @@ int bL_cpu_powered_up(void)
>                 platform_ops->powered_up();
>         return 0;
>  }
> +
> +struct bL_sync_struct bL_sync;
> +
> +static void __sync_range(volatile void *p, size_t size)
> +{
> +       char *_p = (char *)p;
> +
> +       __cpuc_flush_dcache_area(_p, size);
> +       outer_flush_range(__pa(_p), __pa(_p + size));
> +       outer_sync();

The outer flush-range operations already contain a cache_sync, so an
additional outer_sync() operation is not necessary.

You (well, Dave) said that you use the flush instead of
clean/invalidate to avoid races with other CPUs writing the location.
However, on the same CPU you can get a speculative load into L1 after
the L1 flush but before the L2 flush, so the reader case can fail.

The sequence for readers is (note *L2* inval first):

L2 inval
L1 inval

The sequence for writers is:

L1 clean
L2 clean

The bi-directional sequence (that's what you need) is:

L1 clean
L2 clean+inval
L1 clean+inval

The last L1 op must be clean+inval in case another CPU writes to this
location to avoid discarding the write.

If you don't have an L2, you just end up with two L1 clean ops, so you
can probably put some checks.

> +#define sync_mem(ptr) __sync_range(ptr, sizeof *(ptr))
> +
> +/*
> + * __bL_cpu_going_down: Indicates that the cpu is being torn down.
> + *    This must be called at the point of committing to teardown of a CPU.
> + *    The CPU cache (SCTRL.C bit) is expected to still be active.
> + */
> +void __bL_cpu_going_down(unsigned int cpu, unsigned int cluster)
> +{
> +       bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_GOING_DOWN;
> +       sync_mem(&bL_sync.clusters[cluster].cpus[cpu].cpu);
> +}
> +
> +/*
> + * __bL_cpu_down: Indicates that cpu teardown is complete and that the
> + *    cluster can be torn down without disrupting this CPU.
> + *    To avoid deadlocks, this must be called before a CPU is powered down.
> + *    The CPU cache (SCTRL.C bit) is expected to be off.
> + */
> +void __bL_cpu_down(unsigned int cpu, unsigned int cluster)
> +{
> +       dsb();
> +       bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_DOWN;
> +       sync_mem(&bL_sync.clusters[cluster].cpus[cpu].cpu);
> +       sev();

For the sev() here (and other functions in this patch) you need a
dsb() before. I'm not sure outer_sync() has one.
Nicolas Pitre Jan. 10, 2013, 5:59 p.m. UTC | #2
On Thu, 10 Jan 2013, Catalin Marinas wrote:

> On 10 January 2013 00:20, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > --- /dev/null
> > +++ b/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
> > @@ -0,0 +1,498 @@
> > +Big.LITTLE cluster Power-up/power-down race avoidance algorithm
> > +===============================================================
> 
> Nice description and ascii art :).
> 
> > --- a/arch/arm/common/bL_entry.c
> > +++ b/arch/arm/common/bL_entry.c
> > @@ -116,3 +116,163 @@ int bL_cpu_powered_up(void)
> >                 platform_ops->powered_up();
> >         return 0;
> >  }
> > +
> > +struct bL_sync_struct bL_sync;
> > +
> > +static void __sync_range(volatile void *p, size_t size)
> > +{
> > +       char *_p = (char *)p;
> > +
> > +       __cpuc_flush_dcache_area(_p, size);
> > +       outer_flush_range(__pa(_p), __pa(_p + size));
> > +       outer_sync();
> 
> The outer flush-range operations already contain a cache_sync, so an
> additional outer_sync() operation is not necessary.
> 
> You (well, Dave) said that you use the flush instead of
> clean/invalidate to avoid races with other CPUs writing the location.

Yes.  To clarify for everyone, the issue here is that those state values 
are being written and/or read by different CPUs which may or may not 
have their cache enabled.  And in some cases the L1 cache is disabled 
but L2 is still enabled.

So a cached reader must invalidate the cache to ensure it reads an 
up-to-date value from RAM since the last update might have come from a 
CPU with its cache disabled.  But invalidating the cache might discard 
the newly updated state from a writer with an active cache before that 
writer had the chance to clean its cache to RAM.  Therefore, using a 
cache flush rather than a cache invalidate before every reads solves 
this race.

> However, on the same CPU you can get a speculative load into L1 after
> the L1 flush but before the L2 flush, so the reader case can fail.
> 
> The sequence for readers is (note *L2* inval first):
> 
> L2 inval
> L1 inval

As you noted below and as I explained above, this can't be an inval 
operation as that could discard a concurrent writer's update.

> The sequence for writers is:
> 
> L1 clean
> L2 clean
> 
> The bi-directional sequence (that's what you need) is:
> 
> L1 clean
> L2 clean+inval
> L1 clean+inval
> 
> The last L1 op must be clean+inval in case another CPU writes to this
> location to avoid discarding the write.
> 
> If you don't have an L2, you just end up with two L1 clean ops, so you
> can probably put some checks.

In fact, since this is only used on A7/A15 right now, there is no outer 
cache and the outer calls are effectively no-ops.  I'm wondering if 
those should simply be removed until/unless there is some system showing 
up with a need for them.

> > +#define sync_mem(ptr) __sync_range(ptr, sizeof *(ptr))
> > +
> > +/*
> > + * __bL_cpu_going_down: Indicates that the cpu is being torn down.
> > + *    This must be called at the point of committing to teardown of a CPU.
> > + *    The CPU cache (SCTRL.C bit) is expected to still be active.
> > + */
> > +void __bL_cpu_going_down(unsigned int cpu, unsigned int cluster)
> > +{
> > +       bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_GOING_DOWN;
> > +       sync_mem(&bL_sync.clusters[cluster].cpus[cpu].cpu);
> > +}
> > +
> > +/*
> > + * __bL_cpu_down: Indicates that cpu teardown is complete and that the
> > + *    cluster can be torn down without disrupting this CPU.
> > + *    To avoid deadlocks, this must be called before a CPU is powered down.
> > + *    The CPU cache (SCTRL.C bit) is expected to be off.
> > + */
> > +void __bL_cpu_down(unsigned int cpu, unsigned int cluster)
> > +{
> > +       dsb();
> > +       bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_DOWN;
> > +       sync_mem(&bL_sync.clusters[cluster].cpus[cpu].cpu);
> > +       sev();
> 
> For the sev() here (and other functions in this patch) you need a
> dsb() before. I'm not sure outer_sync() has one.

__cpuc_flush_dcache_area() does though, via v7_flush_kern_dcache_area.


Nicolas
Catalin Marinas Jan. 10, 2013, 9:50 p.m. UTC | #3
On 10 January 2013 17:59, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Thu, 10 Jan 2013, Catalin Marinas wrote:
>
>> On 10 January 2013 00:20, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>> > --- a/arch/arm/common/bL_entry.c
>> > +++ b/arch/arm/common/bL_entry.c
>> > @@ -116,3 +116,163 @@ int bL_cpu_powered_up(void)
>> >                 platform_ops->powered_up();
>> >         return 0;
>> >  }
>> > +
>> > +struct bL_sync_struct bL_sync;
>> > +
>> > +static void __sync_range(volatile void *p, size_t size)
>> > +{
>> > +       char *_p = (char *)p;
>> > +
>> > +       __cpuc_flush_dcache_area(_p, size);
>> > +       outer_flush_range(__pa(_p), __pa(_p + size));
>> > +       outer_sync();
...
>> However, on the same CPU you can get a speculative load into L1 after
>> the L1 flush but before the L2 flush, so the reader case can fail.
>>
>> The sequence for readers is (note *L2* inval first):
>>
>> L2 inval
>> L1 inval
>
> As you noted below and as I explained above, this can't be an inval
> operation as that could discard a concurrent writer's update.
>
>> The sequence for writers is:
>>
>> L1 clean
>> L2 clean
>>
>> The bi-directional sequence (that's what you need) is:
>>
>> L1 clean
>> L2 clean+inval
>> L1 clean+inval
>>
>> The last L1 op must be clean+inval in case another CPU writes to this
>> location to avoid discarding the write.
>>
>> If you don't have an L2, you just end up with two L1 clean ops, so you
>> can probably put some checks.
>
> In fact, since this is only used on A7/A15 right now, there is no outer
> cache and the outer calls are effectively no-ops.  I'm wondering if
> those should simply be removed until/unless there is some system showing
> up with a need for them.

You could. I expect multi-cluster systems to have integrated L2 cache
and avoid explicit outer cache maintenance. But is there a chance that
your patches could be generalised to existing systems with A9 (not b.L
configuration but just hotplug or cpuidle support)? I haven't finished
reading all the patches, so maybe that's not the case at all.

Anyway, my point is that if L1 is inner and L2 outer, the correct
bi-derectional flushing sequence is slightly different.
Nicolas Pitre Jan. 10, 2013, 10:31 p.m. UTC | #4
On Thu, 10 Jan 2013, Catalin Marinas wrote:

> On 10 January 2013 17:59, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > On Thu, 10 Jan 2013, Catalin Marinas wrote:
> >
> >> On 10 January 2013 00:20, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> >> > --- a/arch/arm/common/bL_entry.c
> >> > +++ b/arch/arm/common/bL_entry.c
> >> > @@ -116,3 +116,163 @@ int bL_cpu_powered_up(void)
> >> >                 platform_ops->powered_up();
> >> >         return 0;
> >> >  }
> >> > +
> >> > +struct bL_sync_struct bL_sync;
> >> > +
> >> > +static void __sync_range(volatile void *p, size_t size)
> >> > +{
> >> > +       char *_p = (char *)p;
> >> > +
> >> > +       __cpuc_flush_dcache_area(_p, size);
> >> > +       outer_flush_range(__pa(_p), __pa(_p + size));
> >> > +       outer_sync();
> ...
> >> However, on the same CPU you can get a speculative load into L1 after
> >> the L1 flush but before the L2 flush, so the reader case can fail.
> >>
> >> The sequence for readers is (note *L2* inval first):
> >>
> >> L2 inval
> >> L1 inval
> >
> > As you noted below and as I explained above, this can't be an inval
> > operation as that could discard a concurrent writer's update.
> >
> >> The sequence for writers is:
> >>
> >> L1 clean
> >> L2 clean
> >>
> >> The bi-directional sequence (that's what you need) is:
> >>
> >> L1 clean
> >> L2 clean+inval
> >> L1 clean+inval
> >>
> >> The last L1 op must be clean+inval in case another CPU writes to this
> >> location to avoid discarding the write.
> >>
> >> If you don't have an L2, you just end up with two L1 clean ops, so you
> >> can probably put some checks.
> >
> > In fact, since this is only used on A7/A15 right now, there is no outer
> > cache and the outer calls are effectively no-ops.  I'm wondering if
> > those should simply be removed until/unless there is some system showing
> > up with a need for them.
> 
> You could. I expect multi-cluster systems to have integrated L2 cache
> and avoid explicit outer cache maintenance. But is there a chance that
> your patches could be generalised to existing systems with A9 (not b.L
> configuration but just hotplug or cpuidle support)? I haven't finished
> reading all the patches, so maybe that's not the case at all.

I suppose it could, although the special requirements put on the first 
man / last man exist only for multi-cluster systems.  OTOH, existing A9 
systems are already served by far less complex code already, so it is 
really a matter of figuring out if the backend for those A9 systems 
needed by this cluster code would be simpler than the existing code, in 
which case that would certainly be beneficial.

> Anyway, my point is that if L1 is inner and L2 outer, the correct
> bi-derectional flushing sequence is slightly different.

Agreed, I'll make sure to capture that in the code somehow.


Nicolas
Nicolas Pitre Jan. 10, 2013, 10:32 p.m. UTC | #5
On Thu, 10 Jan 2013, Catalin Marinas wrote:

> On 10 January 2013 00:20, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > --- /dev/null
> > +++ b/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
> > @@ -0,0 +1,498 @@
> > +Big.LITTLE cluster Power-up/power-down race avoidance algorithm
> > +===============================================================
> 
> Nice description and ascii art :).

Credits go to Dave Martin.


Nicolas
Will Deacon Jan. 10, 2013, 11:13 p.m. UTC | #6
On Thu, Jan 10, 2013 at 12:20:38AM +0000, Nicolas Pitre wrote:
> From: Dave Martin <dave.martin@linaro.org>
> 
> This provides helper methods to coordinate between CPUs coming down
> and CPUs going up, as well as documentation on the used algorithms,
> so that cluster teardown and setup
> operations are not done for a cluster simultaneously.

[...]

> +int __init bL_cluster_sync_init(void (*power_up_setup)(void))
> +{
> +       unsigned int i, j, mpidr, this_cluster;
> +
> +       BUILD_BUG_ON(BL_SYNC_CLUSTER_SIZE * BL_NR_CLUSTERS != sizeof bL_sync);
> +       BUG_ON((unsigned long)&bL_sync & (__CACHE_WRITEBACK_GRANULE - 1));
> +
> +       /*
> +        * Set initial CPU and cluster states.
> +        * Only one cluster is assumed to be active at this point.
> +        */
> +       for (i = 0; i < BL_NR_CLUSTERS; i++) {
> +               bL_sync.clusters[i].cluster = CLUSTER_DOWN;
> +               bL_sync.clusters[i].inbound = INBOUND_NOT_COMING_UP;
> +               for (j = 0; j < BL_CPUS_PER_CLUSTER; j++)
> +                       bL_sync.clusters[i].cpus[j].cpu = CPU_DOWN;
> +       }
> +       asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));

We have a helper for this...

> +       this_cluster = (mpidr >> 8) & 0xf;

... and also this, thanks to Lorenzo's recent patches.

> +       for_each_online_cpu(i)
> +               bL_sync.clusters[this_cluster].cpus[i].cpu = CPU_UP;
> +       bL_sync.clusters[this_cluster].cluster = CLUSTER_UP;
> +       sync_mem(&bL_sync);
> +
> +       if (power_up_setup) {
> +               bL_power_up_setup_phys = virt_to_phys(power_up_setup);
> +               sync_mem(&bL_power_up_setup_phys);
> +       }
> +
> +       return 0;
> +}
> diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
> index 9d351f2b4c..f7a64ac127 100644
> --- a/arch/arm/common/bL_head.S
> +++ b/arch/arm/common/bL_head.S
> @@ -7,11 +7,19 @@
>   * This program is free software; you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License version 2 as
>   * published by the Free Software Foundation.
> + *
> + *
> + * Refer to Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
> + * for details of the synchronisation algorithms used here.
>   */
> 
>  #include <linux/linkage.h>
>  #include <asm/bL_entry.h>
> 
> +.if BL_SYNC_CLUSTER_CPUS
> +.error "cpus must be the first member of struct bL_cluster_sync_struct"
> +.endif
> +
>         .macro  pr_dbg  cpu, string
>  #if defined(CONFIG_DEBUG_LL) && defined(DEBUG)
>         b       1901f
> @@ -52,12 +60,82 @@ ENTRY(bL_entry_point)
>  2:     pr_dbg  r4, "kernel bL_entry_point\n"
> 
>         /*
> -        * MMU is off so we need to get to bL_entry_vectors in a
> +        * MMU is off so we need to get to various variables in a
>          * position independent way.
>          */
>         adr     r5, 3f
> -       ldr     r6, [r5]
> +       ldmia   r5, {r6, r7, r8}
>         add     r6, r5, r6                      @ r6 = bL_entry_vectors
> +       ldr     r7, [r5, r7]                    @ r7 = bL_power_up_setup_phys
> +       add     r8, r5, r8                      @ r8 = bL_sync
> +
> +       mov     r0, #BL_SYNC_CLUSTER_SIZE
> +       mla     r8, r0, r10, r8                 @ r8 = bL_sync cluster base
> +
> +       @ Signal that this CPU is coming UP:
> +       mov     r0, #CPU_COMING_UP
> +       mov     r5, #BL_SYNC_CPU_SIZE
> +       mla     r5, r9, r5, r8                  @ r5 = bL_sync cpu address
> +       strb    r0, [r5]
> +
> +       dsb

Why is a dmb not enough here? In fact, the same goes for most of these
other than the one preceeding the sev. Is there an interaction with the
different mappings for the cluster data that I've missed?

> +
> +       @ At this point, the cluster cannot unexpectedly enter the GOING_DOWN
> +       @ state, because there is at least one active CPU (this CPU).
> +
> +       @ Check if the cluster has been set up yet:
> +       ldrb    r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> +       cmp     r0, #CLUSTER_UP
> +       beq     cluster_already_up
> +
> +       @ Signal that the cluster is being brought up:
> +       mov     r0, #INBOUND_COMING_UP
> +       strb    r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
> +
> +       dsb
> +
> +       @ Any CPU trying to take the cluster into CLUSTER_GOING_DOWN from this
> +       @ point onwards will observe INBOUND_COMING_UP and abort.
> +
> +       @ Wait for any previously-pending cluster teardown operations to abort
> +       @ or complete:
> +cluster_teardown_wait:
> +       ldrb    r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> +       cmp     r0, #CLUSTER_GOING_DOWN
> +       wfeeq
> +       beq     cluster_teardown_wait
> +
> +       @ power_up_setup is responsible for setting up the cluster:
> +
> +       cmp     r7, #0
> +       mov     r0, #1          @ second (cluster) affinity level
> +       blxne   r7              @ Call power_up_setup if defined
> +
> +       @ Leave the cluster setup critical section:
> +
> +       dsb
> +       mov     r0, #INBOUND_NOT_COMING_UP
> +       strb    r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
> +       mov     r0, #CLUSTER_UP
> +       strb    r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> +       dsb
> +       sev
> +
> +cluster_already_up:
> +       @ If a platform-specific CPU setup hook is needed, it is
> +       @ called from here.
> +
> +       cmp     r7, #0
> +       mov     r0, #0          @ first (CPU) affinity level
> +       blxne   r7              @ Call power_up_setup if defined
> +
> +       @ Mark the CPU as up:
> +
> +       dsb
> +       mov     r0, #CPU_UP
> +       strb    r0, [r5]
> +       dsb
> +       sev
> 
>  bL_entry_gated:
>         ldr     r5, [r6, r4, lsl #2]            @ r5 = CPU entry vector
> @@ -70,6 +148,8 @@ bL_entry_gated:
>         .align  2
> 
>  3:     .word   bL_entry_vectors - .
> +       .word   bL_power_up_setup_phys - 3b
> +       .word   bL_sync - 3b
> 
>  ENDPROC(bL_entry_point)
> 
> @@ -79,3 +159,7 @@ ENDPROC(bL_entry_point)
>         .type   bL_entry_vectors, #object
>  ENTRY(bL_entry_vectors)
>         .space  4 * BL_NR_CLUSTERS * BL_CPUS_PER_CLUSTER
> +
> +       .type   bL_power_up_setup_phys, #object
> +ENTRY(bL_power_up_setup_phys)
> +       .space  4               @ set by bL_cluster_sync_init()
> diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
> index 942d7f9f19..167394d9a0 100644
> --- a/arch/arm/include/asm/bL_entry.h
> +++ b/arch/arm/include/asm/bL_entry.h
> @@ -15,8 +15,37 @@
>  #define BL_CPUS_PER_CLUSTER    4
>  #define BL_NR_CLUSTERS         2
> 
> +/* Definitions for bL_cluster_sync_struct */
> +#define CPU_DOWN               0x11
> +#define CPU_COMING_UP          0x12
> +#define CPU_UP                 0x13
> +#define CPU_GOING_DOWN         0x14
> +
> +#define CLUSTER_DOWN           0x21
> +#define CLUSTER_UP             0x22
> +#define CLUSTER_GOING_DOWN     0x23
> +
> +#define INBOUND_NOT_COMING_UP  0x31
> +#define INBOUND_COMING_UP      0x32

Do these numbers signify anything? Why not 0, 1, 2 etc?

> +
> +/* This is a complete guess. */
> +#define __CACHE_WRITEBACK_ORDER        6

Is this CONFIG_ARM_L1_CACHE_SHIFT?

> +#define __CACHE_WRITEBACK_GRANULE (1 << __CACHE_WRITEBACK_ORDER)
> +
> +/* Offsets for the bL_cluster_sync_struct members, for use in asm: */
> +#define BL_SYNC_CLUSTER_CPUS   0

Why not use asm-offsets.h for this?

> +#define BL_SYNC_CPU_SIZE       __CACHE_WRITEBACK_GRANULE
> +#define BL_SYNC_CLUSTER_CLUSTER \
> +       (BL_SYNC_CLUSTER_CPUS + BL_SYNC_CPU_SIZE * BL_CPUS_PER_CLUSTER)
> +#define BL_SYNC_CLUSTER_INBOUND \
> +       (BL_SYNC_CLUSTER_CLUSTER + __CACHE_WRITEBACK_GRANULE)
> +#define BL_SYNC_CLUSTER_SIZE \
> +       (BL_SYNC_CLUSTER_INBOUND + __CACHE_WRITEBACK_GRANULE)
> +

Hmm, this looks pretty fragile to me but again, you need this stuff at
compile time. Is there an architected maximum value for the writeback
granule? Failing that, we may as well just use things like
__cacheline_aligned if we're only using the L1 alignment anyway.

Will
Nicolas Pitre Jan. 11, 2013, 1:50 a.m. UTC | #7
On Thu, 10 Jan 2013, Will Deacon wrote:

> On Thu, Jan 10, 2013 at 12:20:38AM +0000, Nicolas Pitre wrote:
> > From: Dave Martin <dave.martin@linaro.org>
> > 
> > This provides helper methods to coordinate between CPUs coming down
> > and CPUs going up, as well as documentation on the used algorithms,
> > so that cluster teardown and setup
> > operations are not done for a cluster simultaneously.
> 
> [...]
> 
> > +int __init bL_cluster_sync_init(void (*power_up_setup)(void))
> > +{
> > +       unsigned int i, j, mpidr, this_cluster;
> > +
> > +       BUILD_BUG_ON(BL_SYNC_CLUSTER_SIZE * BL_NR_CLUSTERS != sizeof bL_sync);
> > +       BUG_ON((unsigned long)&bL_sync & (__CACHE_WRITEBACK_GRANULE - 1));
> > +
> > +       /*
> > +        * Set initial CPU and cluster states.
> > +        * Only one cluster is assumed to be active at this point.
> > +        */
> > +       for (i = 0; i < BL_NR_CLUSTERS; i++) {
> > +               bL_sync.clusters[i].cluster = CLUSTER_DOWN;
> > +               bL_sync.clusters[i].inbound = INBOUND_NOT_COMING_UP;
> > +               for (j = 0; j < BL_CPUS_PER_CLUSTER; j++)
> > +                       bL_sync.clusters[i].cpus[j].cpu = CPU_DOWN;
> > +       }
> > +       asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));
> 
> We have a helper for this...
> 
> > +       this_cluster = (mpidr >> 8) & 0xf;
> 
> ... and also this, thanks to Lorenzo's recent patches.

Indeed, I'll have a closer look at them.

> > +       for_each_online_cpu(i)
> > +               bL_sync.clusters[this_cluster].cpus[i].cpu = CPU_UP;
> > +       bL_sync.clusters[this_cluster].cluster = CLUSTER_UP;
> > +       sync_mem(&bL_sync);
> > +
> > +       if (power_up_setup) {
> > +               bL_power_up_setup_phys = virt_to_phys(power_up_setup);
> > +               sync_mem(&bL_power_up_setup_phys);
> > +       }
> > +
> > +       return 0;
> > +}
> > diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
> > index 9d351f2b4c..f7a64ac127 100644
> > --- a/arch/arm/common/bL_head.S
> > +++ b/arch/arm/common/bL_head.S
> > @@ -7,11 +7,19 @@
> >   * This program is free software; you can redistribute it and/or modify
> >   * it under the terms of the GNU General Public License version 2 as
> >   * published by the Free Software Foundation.
> > + *
> > + *
> > + * Refer to Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
> > + * for details of the synchronisation algorithms used here.
> >   */
> > 
> >  #include <linux/linkage.h>
> >  #include <asm/bL_entry.h>
> > 
> > +.if BL_SYNC_CLUSTER_CPUS
> > +.error "cpus must be the first member of struct bL_cluster_sync_struct"
> > +.endif
> > +
> >         .macro  pr_dbg  cpu, string
> >  #if defined(CONFIG_DEBUG_LL) && defined(DEBUG)
> >         b       1901f
> > @@ -52,12 +60,82 @@ ENTRY(bL_entry_point)
> >  2:     pr_dbg  r4, "kernel bL_entry_point\n"
> > 
> >         /*
> > -        * MMU is off so we need to get to bL_entry_vectors in a
> > +        * MMU is off so we need to get to various variables in a
> >          * position independent way.
> >          */
> >         adr     r5, 3f
> > -       ldr     r6, [r5]
> > +       ldmia   r5, {r6, r7, r8}
> >         add     r6, r5, r6                      @ r6 = bL_entry_vectors
> > +       ldr     r7, [r5, r7]                    @ r7 = bL_power_up_setup_phys
> > +       add     r8, r5, r8                      @ r8 = bL_sync
> > +
> > +       mov     r0, #BL_SYNC_CLUSTER_SIZE
> > +       mla     r8, r0, r10, r8                 @ r8 = bL_sync cluster base
> > +
> > +       @ Signal that this CPU is coming UP:
> > +       mov     r0, #CPU_COMING_UP
> > +       mov     r5, #BL_SYNC_CPU_SIZE
> > +       mla     r5, r9, r5, r8                  @ r5 = bL_sync cpu address
> > +       strb    r0, [r5]
> > +
> > +       dsb
> 
> Why is a dmb not enough here? In fact, the same goes for most of these
> other than the one preceeding the sev. Is there an interaction with the
> different mappings for the cluster data that I've missed?

Probably Dave could comment more on this as this is his code, or Achin 
who also reviewed it.  I don't know the level of discussion that 
happened inside ARM around those barriers.

When the TC2 firmware didn't properly handle the ACP snoops, the dsb's 
couldn't be used at this point.  The replacement for a dsb was a read 
back followed by a dmb in that case, and then the general sentiment was 
that this was an A15 specific workaround which wasn't architecturally 
guaranteed on all ARMv7 compliant implementations, or something along 
those lines.

Given that the TC2 firmware properly handles the snoops now, and that 
the dsb apparently doesn't require a readback, we just decided to revert 
to having simple dsb's.

> > +
> > +       @ At this point, the cluster cannot unexpectedly enter the GOING_DOWN
> > +       @ state, because there is at least one active CPU (this CPU).
> > +
> > +       @ Check if the cluster has been set up yet:
> > +       ldrb    r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> > +       cmp     r0, #CLUSTER_UP
> > +       beq     cluster_already_up
> > +
> > +       @ Signal that the cluster is being brought up:
> > +       mov     r0, #INBOUND_COMING_UP
> > +       strb    r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
> > +
> > +       dsb
> > +
> > +       @ Any CPU trying to take the cluster into CLUSTER_GOING_DOWN from this
> > +       @ point onwards will observe INBOUND_COMING_UP and abort.
> > +
> > +       @ Wait for any previously-pending cluster teardown operations to abort
> > +       @ or complete:
> > +cluster_teardown_wait:
> > +       ldrb    r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> > +       cmp     r0, #CLUSTER_GOING_DOWN
> > +       wfeeq
> > +       beq     cluster_teardown_wait
> > +
> > +       @ power_up_setup is responsible for setting up the cluster:
> > +
> > +       cmp     r7, #0
> > +       mov     r0, #1          @ second (cluster) affinity level
> > +       blxne   r7              @ Call power_up_setup if defined
> > +
> > +       @ Leave the cluster setup critical section:
> > +
> > +       dsb
> > +       mov     r0, #INBOUND_NOT_COMING_UP
> > +       strb    r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
> > +       mov     r0, #CLUSTER_UP
> > +       strb    r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> > +       dsb
> > +       sev
> > +
> > +cluster_already_up:
> > +       @ If a platform-specific CPU setup hook is needed, it is
> > +       @ called from here.
> > +
> > +       cmp     r7, #0
> > +       mov     r0, #0          @ first (CPU) affinity level
> > +       blxne   r7              @ Call power_up_setup if defined
> > +
> > +       @ Mark the CPU as up:
> > +
> > +       dsb
> > +       mov     r0, #CPU_UP
> > +       strb    r0, [r5]
> > +       dsb
> > +       sev
> > 
> >  bL_entry_gated:
> >         ldr     r5, [r6, r4, lsl #2]            @ r5 = CPU entry vector
> > @@ -70,6 +148,8 @@ bL_entry_gated:
> >         .align  2
> > 
> >  3:     .word   bL_entry_vectors - .
> > +       .word   bL_power_up_setup_phys - 3b
> > +       .word   bL_sync - 3b
> > 
> >  ENDPROC(bL_entry_point)
> > 
> > @@ -79,3 +159,7 @@ ENDPROC(bL_entry_point)
> >         .type   bL_entry_vectors, #object
> >  ENTRY(bL_entry_vectors)
> >         .space  4 * BL_NR_CLUSTERS * BL_CPUS_PER_CLUSTER
> > +
> > +       .type   bL_power_up_setup_phys, #object
> > +ENTRY(bL_power_up_setup_phys)
> > +       .space  4               @ set by bL_cluster_sync_init()
> > diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
> > index 942d7f9f19..167394d9a0 100644
> > --- a/arch/arm/include/asm/bL_entry.h
> > +++ b/arch/arm/include/asm/bL_entry.h
> > @@ -15,8 +15,37 @@
> >  #define BL_CPUS_PER_CLUSTER    4
> >  #define BL_NR_CLUSTERS         2
> > 
> > +/* Definitions for bL_cluster_sync_struct */
> > +#define CPU_DOWN               0x11
> > +#define CPU_COMING_UP          0x12
> > +#define CPU_UP                 0x13
> > +#define CPU_GOING_DOWN         0x14
> > +
> > +#define CLUSTER_DOWN           0x21
> > +#define CLUSTER_UP             0x22
> > +#define CLUSTER_GOING_DOWN     0x23
> > +
> > +#define INBOUND_NOT_COMING_UP  0x31
> > +#define INBOUND_COMING_UP      0x32
> 
> Do these numbers signify anything? Why not 0, 1, 2 etc?

Initially that's what they were.  But durring debugging (as we faced a 
few cache coherency issues here) it was more useful to use numbers with 
an easily distinguishable signature.  For example, a 0 may come from 
anywhere and could mean anything so that is about the worst choice.
Other than that, those numbers have no particular significance.

> > +
> > +/* This is a complete guess. */
> > +#define __CACHE_WRITEBACK_ORDER        6
> 
> Is this CONFIG_ARM_L1_CACHE_SHIFT?

No.  That has to cover L2 as well.

> > +#define __CACHE_WRITEBACK_GRANULE (1 << __CACHE_WRITEBACK_ORDER)
> > +
> > +/* Offsets for the bL_cluster_sync_struct members, for use in asm: */
> > +#define BL_SYNC_CLUSTER_CPUS   0
> 
> Why not use asm-offsets.h for this?

That's how that was done initially. But that ended up cluttering 
asm-offsets.h for stuff that actually is really a local implementation 
detail which doesn't need kernel wide scope.  In other words, the end 
result looked worse.

One could argue that they are still exposed too much as the only files 
that need to know about those defines are bL_head.S and bL_entry.c.

> > +#define BL_SYNC_CPU_SIZE       __CACHE_WRITEBACK_GRANULE
> > +#define BL_SYNC_CLUSTER_CLUSTER \
> > +       (BL_SYNC_CLUSTER_CPUS + BL_SYNC_CPU_SIZE * BL_CPUS_PER_CLUSTER)
> > +#define BL_SYNC_CLUSTER_INBOUND \
> > +       (BL_SYNC_CLUSTER_CLUSTER + __CACHE_WRITEBACK_GRANULE)
> > +#define BL_SYNC_CLUSTER_SIZE \
> > +       (BL_SYNC_CLUSTER_INBOUND + __CACHE_WRITEBACK_GRANULE)
> > +
> 
> Hmm, this looks pretty fragile to me but again, you need this stuff at
> compile time.

There are compile time and run time assertions in bL_entry.c to ensure 
those offsets and the corresponding C structure don't get out of sync.

> Is there an architected maximum value for the writeback
> granule? Failing that, we may as well just use things like
> __cacheline_aligned if we're only using the L1 alignment anyway.

See above -- we need L2 alignment.


Nicolas
tip-bot for Dave Martin Jan. 11, 2013, 10:36 a.m. UTC | #8
On Thu, Jan 10, 2013 at 05:31:08PM -0500, Nicolas Pitre wrote:
> On Thu, 10 Jan 2013, Catalin Marinas wrote:
> 
> > On 10 January 2013 17:59, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > > On Thu, 10 Jan 2013, Catalin Marinas wrote:
> > >
> > >> On 10 January 2013 00:20, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > >> > --- a/arch/arm/common/bL_entry.c
> > >> > +++ b/arch/arm/common/bL_entry.c
> > >> > @@ -116,3 +116,163 @@ int bL_cpu_powered_up(void)
> > >> >                 platform_ops->powered_up();
> > >> >         return 0;
> > >> >  }
> > >> > +
> > >> > +struct bL_sync_struct bL_sync;
> > >> > +
> > >> > +static void __sync_range(volatile void *p, size_t size)
> > >> > +{
> > >> > +       char *_p = (char *)p;
> > >> > +
> > >> > +       __cpuc_flush_dcache_area(_p, size);
> > >> > +       outer_flush_range(__pa(_p), __pa(_p + size));
> > >> > +       outer_sync();
> > ...
> > >> However, on the same CPU you can get a speculative load into L1 after
> > >> the L1 flush but before the L2 flush, so the reader case can fail.
> > >>
> > >> The sequence for readers is (note *L2* inval first):
> > >>
> > >> L2 inval
> > >> L1 inval
> > >
> > > As you noted below and as I explained above, this can't be an inval
> > > operation as that could discard a concurrent writer's update.
> > >
> > >> The sequence for writers is:
> > >>
> > >> L1 clean
> > >> L2 clean
> > >>
> > >> The bi-directional sequence (that's what you need) is:
> > >>
> > >> L1 clean
> > >> L2 clean+inval
> > >> L1 clean+inval

Agreed.  My bad, sorry... I was focusing on other aspects; plus we have
no actual outer cache, so the mis-ordering is hidden in our testing.

This code has been through a few iterations, some of which had separate
sequences for reads and writes, though possibly the ordering is still
wrong.

If our cache is enabled, we may end up with the responsibility of writing
out another CPU's dirty lines due to speculative migration, so for most
or all of the flushes here, we do need the third sequence (inner clean;
outer clean+invalidate; inner clean+invalidate)

> > >>
> > >> The last L1 op must be clean+inval in case another CPU writes to this
> > >> location to avoid discarding the write.
> > >>
> > >> If you don't have an L2, you just end up with two L1 clean ops, so you
> > >> can probably put some checks.
> > >
> > > In fact, since this is only used on A7/A15 right now, there is no outer
> > > cache and the outer calls are effectively no-ops.  I'm wondering if
> > > those should simply be removed until/unless there is some system showing
> > > up with a need for them.
> > 
> > You could. I expect multi-cluster systems to have integrated L2 cache
> > and avoid explicit outer cache maintenance. But is there a chance that
> > your patches could be generalised to existing systems with A9 (not b.L
> > configuration but just hotplug or cpuidle support)? I haven't finished
> > reading all the patches, so maybe that's not the case at all.
> 
> I suppose it could, although the special requirements put on the first 
> man / last man exist only for multi-cluster systems.  OTOH, existing A9 
> systems are already served by far less complex code already, so it is 
> really a matter of figuring out if the backend for those A9 systems 
> needed by this cluster code would be simpler than the existing code, in 
> which case that would certainly be beneficial.

The outer operations just expand to nothing if there is no outer cache; the
optimisation would be that instead of L1 clean; L1 clean+inval, we just
need L1 clean+inval.

> > Anyway, my point is that if L1 is inner and L2 outer, the correct
> > bi-derectional flushing sequence is slightly different.
> 
> Agreed, I'll make sure to capture that in the code somehow.

I'll have a go at this today (but I won't over-elaborate in case you've
already done it...)

Cheers
---Dave
tip-bot for Dave Martin Jan. 11, 2013, 11:09 a.m. UTC | #9
On Thu, Jan 10, 2013 at 08:50:59PM -0500, Nicolas Pitre wrote:
> On Thu, 10 Jan 2013, Will Deacon wrote:
> 
> > On Thu, Jan 10, 2013 at 12:20:38AM +0000, Nicolas Pitre wrote:
> > > From: Dave Martin <dave.martin@linaro.org>
> > > 
> > > This provides helper methods to coordinate between CPUs coming down
> > > and CPUs going up, as well as documentation on the used algorithms,
> > > so that cluster teardown and setup
> > > operations are not done for a cluster simultaneously.
> > 
> > [...]
> > 
> > > +int __init bL_cluster_sync_init(void (*power_up_setup)(void))
> > > +{
> > > +       unsigned int i, j, mpidr, this_cluster;
> > > +
> > > +       BUILD_BUG_ON(BL_SYNC_CLUSTER_SIZE * BL_NR_CLUSTERS != sizeof bL_sync);
> > > +       BUG_ON((unsigned long)&bL_sync & (__CACHE_WRITEBACK_GRANULE - 1));
> > > +
> > > +       /*
> > > +        * Set initial CPU and cluster states.
> > > +        * Only one cluster is assumed to be active at this point.
> > > +        */
> > > +       for (i = 0; i < BL_NR_CLUSTERS; i++) {
> > > +               bL_sync.clusters[i].cluster = CLUSTER_DOWN;
> > > +               bL_sync.clusters[i].inbound = INBOUND_NOT_COMING_UP;
> > > +               for (j = 0; j < BL_CPUS_PER_CLUSTER; j++)
> > > +                       bL_sync.clusters[i].cpus[j].cpu = CPU_DOWN;
> > > +       }
> > > +       asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));
> > 
> > We have a helper for this...

Agreed, we would ideally use a single definition for that.

> > 
> > > +       this_cluster = (mpidr >> 8) & 0xf;
> > 
> > ... and also this, thanks to Lorenzo's recent patches.
> 
> Indeed, I'll have a closer look at them.
> 
> > > +       for_each_online_cpu(i)
> > > +               bL_sync.clusters[this_cluster].cpus[i].cpu = CPU_UP;
> > > +       bL_sync.clusters[this_cluster].cluster = CLUSTER_UP;
> > > +       sync_mem(&bL_sync);
> > > +
> > > +       if (power_up_setup) {
> > > +               bL_power_up_setup_phys = virt_to_phys(power_up_setup);
> > > +               sync_mem(&bL_power_up_setup_phys);
> > > +       }
> > > +
> > > +       return 0;
> > > +}
> > > diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
> > > index 9d351f2b4c..f7a64ac127 100644
> > > --- a/arch/arm/common/bL_head.S
> > > +++ b/arch/arm/common/bL_head.S
> > > @@ -7,11 +7,19 @@
> > >   * This program is free software; you can redistribute it and/or modify
> > >   * it under the terms of the GNU General Public License version 2 as
> > >   * published by the Free Software Foundation.
> > > + *
> > > + *
> > > + * Refer to Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
> > > + * for details of the synchronisation algorithms used here.
> > >   */
> > > 
> > >  #include <linux/linkage.h>
> > >  #include <asm/bL_entry.h>
> > > 
> > > +.if BL_SYNC_CLUSTER_CPUS
> > > +.error "cpus must be the first member of struct bL_cluster_sync_struct"
> > > +.endif
> > > +
> > >         .macro  pr_dbg  cpu, string
> > >  #if defined(CONFIG_DEBUG_LL) && defined(DEBUG)
> > >         b       1901f
> > > @@ -52,12 +60,82 @@ ENTRY(bL_entry_point)
> > >  2:     pr_dbg  r4, "kernel bL_entry_point\n"
> > > 
> > >         /*
> > > -        * MMU is off so we need to get to bL_entry_vectors in a
> > > +        * MMU is off so we need to get to various variables in a
> > >          * position independent way.
> > >          */
> > >         adr     r5, 3f
> > > -       ldr     r6, [r5]
> > > +       ldmia   r5, {r6, r7, r8}
> > >         add     r6, r5, r6                      @ r6 = bL_entry_vectors
> > > +       ldr     r7, [r5, r7]                    @ r7 = bL_power_up_setup_phys
> > > +       add     r8, r5, r8                      @ r8 = bL_sync
> > > +
> > > +       mov     r0, #BL_SYNC_CLUSTER_SIZE
> > > +       mla     r8, r0, r10, r8                 @ r8 = bL_sync cluster base
> > > +
> > > +       @ Signal that this CPU is coming UP:
> > > +       mov     r0, #CPU_COMING_UP
> > > +       mov     r5, #BL_SYNC_CPU_SIZE
> > > +       mla     r5, r9, r5, r8                  @ r5 = bL_sync cpu address
> > > +       strb    r0, [r5]
> > > +
> > > +       dsb
> > 
> > Why is a dmb not enough here? In fact, the same goes for most of these
> > other than the one preceeding the sev. Is there an interaction with the
> > different mappings for the cluster data that I've missed?
> 
> Probably Dave could comment more on this as this is his code, or Achin 
> who also reviewed it.  I don't know the level of discussion that 
> happened inside ARM around those barriers.
> 
> When the TC2 firmware didn't properly handle the ACP snoops, the dsb's 
> couldn't be used at this point.  The replacement for a dsb was a read 
> back followed by a dmb in that case, and then the general sentiment was 
> that this was an A15 specific workaround which wasn't architecturally 
> guaranteed on all ARMv7 compliant implementations, or something along 
> those lines.
> 
> Given that the TC2 firmware properly handles the snoops now, and that 
> the dsb apparently doesn't require a readback, we just decided to revert 
> to having simple dsb's.

I'll take another look at the code and think about this again.  This code
was initially a bit conservative.  Because we are S-O at this point,
most of your potential dmbs should actually require no barrier at all
(as in the vlock code).  I was cautious about that, but we've now seen
the principle work successfully with the vlock code (which postdates
the cluster state handling code here).

The one exception is before SEV.  Also, before WFE (opinions differ, but
since we are about to wait anyway the extra time cost of the dsb is not
really a concern here).

> 
> > > +
> > > +       @ At this point, the cluster cannot unexpectedly enter the GOING_DOWN
> > > +       @ state, because there is at least one active CPU (this CPU).
> > > +
> > > +       @ Check if the cluster has been set up yet:
> > > +       ldrb    r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> > > +       cmp     r0, #CLUSTER_UP
> > > +       beq     cluster_already_up
> > > +
> > > +       @ Signal that the cluster is being brought up:
> > > +       mov     r0, #INBOUND_COMING_UP
> > > +       strb    r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
> > > +
> > > +       dsb
> > > +
> > > +       @ Any CPU trying to take the cluster into CLUSTER_GOING_DOWN from this
> > > +       @ point onwards will observe INBOUND_COMING_UP and abort.
> > > +
> > > +       @ Wait for any previously-pending cluster teardown operations to abort
> > > +       @ or complete:
> > > +cluster_teardown_wait:
> > > +       ldrb    r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> > > +       cmp     r0, #CLUSTER_GOING_DOWN
> > > +       wfeeq
> > > +       beq     cluster_teardown_wait
> > > +
> > > +       @ power_up_setup is responsible for setting up the cluster:
> > > +
> > > +       cmp     r7, #0
> > > +       mov     r0, #1          @ second (cluster) affinity level
> > > +       blxne   r7              @ Call power_up_setup if defined
> > > +
> > > +       @ Leave the cluster setup critical section:
> > > +
> > > +       dsb
> > > +       mov     r0, #INBOUND_NOT_COMING_UP
> > > +       strb    r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
> > > +       mov     r0, #CLUSTER_UP
> > > +       strb    r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> > > +       dsb
> > > +       sev
> > > +
> > > +cluster_already_up:
> > > +       @ If a platform-specific CPU setup hook is needed, it is
> > > +       @ called from here.
> > > +
> > > +       cmp     r7, #0
> > > +       mov     r0, #0          @ first (CPU) affinity level
> > > +       blxne   r7              @ Call power_up_setup if defined
> > > +
> > > +       @ Mark the CPU as up:
> > > +
> > > +       dsb
> > > +       mov     r0, #CPU_UP
> > > +       strb    r0, [r5]
> > > +       dsb
> > > +       sev
> > > 
> > >  bL_entry_gated:
> > >         ldr     r5, [r6, r4, lsl #2]            @ r5 = CPU entry vector
> > > @@ -70,6 +148,8 @@ bL_entry_gated:
> > >         .align  2
> > > 
> > >  3:     .word   bL_entry_vectors - .
> > > +       .word   bL_power_up_setup_phys - 3b
> > > +       .word   bL_sync - 3b
> > > 
> > >  ENDPROC(bL_entry_point)
> > > 
> > > @@ -79,3 +159,7 @@ ENDPROC(bL_entry_point)
> > >         .type   bL_entry_vectors, #object
> > >  ENTRY(bL_entry_vectors)
> > >         .space  4 * BL_NR_CLUSTERS * BL_CPUS_PER_CLUSTER
> > > +
> > > +       .type   bL_power_up_setup_phys, #object
> > > +ENTRY(bL_power_up_setup_phys)
> > > +       .space  4               @ set by bL_cluster_sync_init()
> > > diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
> > > index 942d7f9f19..167394d9a0 100644
> > > --- a/arch/arm/include/asm/bL_entry.h
> > > +++ b/arch/arm/include/asm/bL_entry.h
> > > @@ -15,8 +15,37 @@
> > >  #define BL_CPUS_PER_CLUSTER    4
> > >  #define BL_NR_CLUSTERS         2
> > > 
> > > +/* Definitions for bL_cluster_sync_struct */
> > > +#define CPU_DOWN               0x11
> > > +#define CPU_COMING_UP          0x12
> > > +#define CPU_UP                 0x13
> > > +#define CPU_GOING_DOWN         0x14
> > > +
> > > +#define CLUSTER_DOWN           0x21
> > > +#define CLUSTER_UP             0x22
> > > +#define CLUSTER_GOING_DOWN     0x23
> > > +
> > > +#define INBOUND_NOT_COMING_UP  0x31
> > > +#define INBOUND_COMING_UP      0x32
> > 
> > Do these numbers signify anything? Why not 0, 1, 2 etc?
> 
> Initially that's what they were.  But durring debugging (as we faced a 
> few cache coherency issues here) it was more useful to use numbers with 
> an easily distinguishable signature.  For example, a 0 may come from 
> anywhere and could mean anything so that is about the worst choice.
> Other than that, those numbers have no particular significance.
> 
> > > +
> > > +/* This is a complete guess. */
> > > +#define __CACHE_WRITEBACK_ORDER        6
> > 
> > Is this CONFIG_ARM_L1_CACHE_SHIFT?
> 
> No.  That has to cover L2 as well.

Of course, I seem to remember that there are assumptions elsewhere in 
the kernel that 1 << CONFIG_ARM_L1_CACHE_SHIFT is (at least) the cache
writeback granule.

I prefer not to use a macro with a wholly misleading name, but I would
like a "proper" way to get this value, if there is one ... ?

One reason for adding a #define here was to document the fact that the
value used really is a guess and that we have no correct way to discover
it.

> 
> > > +#define __CACHE_WRITEBACK_GRANULE (1 << __CACHE_WRITEBACK_ORDER)
> > > +
> > > +/* Offsets for the bL_cluster_sync_struct members, for use in asm: */
> > > +#define BL_SYNC_CLUSTER_CPUS   0
> > 
> > Why not use asm-offsets.h for this?
> 
> That's how that was done initially. But that ended up cluttering 
> asm-offsets.h for stuff that actually is really a local implementation 
> detail which doesn't need kernel wide scope.  In other words, the end 
> result looked worse.
> 
> One could argue that they are still exposed too much as the only files 
> that need to know about those defines are bL_head.S and bL_entry.c.
> 
> > > +#define BL_SYNC_CPU_SIZE       __CACHE_WRITEBACK_GRANULE
> > > +#define BL_SYNC_CLUSTER_CLUSTER \
> > > +       (BL_SYNC_CLUSTER_CPUS + BL_SYNC_CPU_SIZE * BL_CPUS_PER_CLUSTER)
> > > +#define BL_SYNC_CLUSTER_INBOUND \
> > > +       (BL_SYNC_CLUSTER_CLUSTER + __CACHE_WRITEBACK_GRANULE)
> > > +#define BL_SYNC_CLUSTER_SIZE \
> > > +       (BL_SYNC_CLUSTER_INBOUND + __CACHE_WRITEBACK_GRANULE)
> > > +
> > 
> > Hmm, this looks pretty fragile to me but again, you need this stuff at
> > compile time.
> 
> There are compile time and run time assertions in bL_entry.c to ensure 
> those offsets and the corresponding C structure don't get out of sync.
> 
> > Is there an architected maximum value for the writeback
> > granule? Failing that, we may as well just use things like

There is an architectural maximum, bit it is 2K (which although "safe"
feels a bit excessive for our purposes.  A 2+3 CPU system would require
at least 22K for the synchronisation data with this assumption, rising to
28K for 4+4.  Not the end of the world for .bss data on modern hardware
with GB of DRAM, but it still feels wasteful.

Does anyone have a view on how much we care?

If there is no outer cache, the actual granule size can be determined
via CP15 at run-time; if there is an outer cache, we would also need
to find out its granule somehow.

> > __cacheline_aligned if we're only using the L1 alignment anyway.
> 
> See above -- we need L2 alignment.

This partly depends on whether __cacheline_aligned is supposed to
guarantee cache writeback granule alignment.  Is it?  At best I was highly
uncertain about this.

Cheers
---Dave
Santosh Shilimkar Jan. 11, 2013, 5:46 p.m. UTC | #10
On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> From: Dave Martin <dave.martin@linaro.org>
>
> This provides helper methods to coordinate between CPUs coming down
> and CPUs going up, as well as documentation on the used algorithms,
> so that cluster teardown and setup
> operations are not done for a cluster simultaneously.
>
> For use in the power_down() implementation:
>    * __bL_cpu_going_down(unsigned int cluster, unsigned int cpu)
>    * __bL_outbound_enter_critical(unsigned int cluster)
>    * __bL_outbound_leave_critical(unsigned int cluster)
>    * __bL_cpu_down(unsigned int cluster, unsigned int cpu)
>
> The power_up_setup() helper should do platform-specific setup in
> preparation for turning the CPU on, such as invalidating local caches
> or entering coherency.  It must be assembler for now, since it must
> run before the MMU can be switched on.  It is passed the affinity level
> which should be initialized.
>
> Because the bL_cluster_sync_struct content is looked-up and modified
> with the cache enabled or disabled depending on the code path, it is
> crucial to always ensure proper cache maintenance to update main memory
> right away.  Therefore, any cached write must be followed by a cache clean
> operation and any cached read must be preceded by a cache invalidate
> operation on the accessed memory.
>
> To avoid races where a reader would invalidate the cache and discard the
> latest update from a writer before that writer had a chance to clean it
> to RAM, we simply use cache flush (clean+invalidate) operations
> everywhere.
>
> Also, in order to prevent a cached writer from interfering with an
> adjacent non-cached writer, we ensure each state variable is located to
> a separate cache line.
>
> Thanks to Nicolas Pitre and Achin Gupta for the help with this
> patch.
>
> Signed-off-by: Dave Martin <dave.martin@linaro.org>
> ---
>   .../arm/big.LITTLE/cluster-pm-race-avoidance.txt   | 498 +++++++++++++++++++++
>   arch/arm/common/bL_entry.c                         | 160 +++++++
>   arch/arm/common/bL_head.S                          |  88 +++-
>   arch/arm/include/asm/bL_entry.h                    |  62 +++
>   4 files changed, 806 insertions(+), 2 deletions(-)
>   create mode 100644 Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
>
> diff --git a/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt b/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
> new file mode 100644
> index 0000000000..d6151e0235
> --- /dev/null
> +++ b/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
> @@ -0,0 +1,498 @@
> +Big.LITTLE cluster Power-up/power-down race avoidance algorithm
> +===============================================================
> +
> +This file documents the algorithm which is used to coordinate CPU and
> +cluster setup and teardown operations and to manage hardware coherency
> +controls safely.
> +
> +The section "Rationale" explains what the algorithm is for and why it is
> +needed.  "Basic model" explains general concepts using a simplified view
> +of the system.  The other sections explain the actual details of the
> +algorithm in use.
> +
> +
> +Rationale
> +---------
> +
> +In a system containing multiple CPUs, it is desirable to have the
> +ability to turn off individual CPUs when the system is idle, reducing
> +power consumption and thermal dissipation.
> +
> +In a system containing multiple clusters of CPUs, it is also desirable
> +to have the ability to turn off entire clusters.
> +
> +Turning entire clusters off and on is a risky business, because it
> +involves performing potentially destructive operations affecting a group
> +of independently running CPUs, while the OS continues to run.  This
> +means that we need some coordination in order to ensure that critical
> +cluster-level operations are only performed when it is truly safe to do
> +so.
> +
> +Simple locking may not be sufficient to solve this problem, because
> +mechanisms like Linux spinlocks may rely on coherency mechanisms which
> +are not immediately enabled when a cluster powers up.  Since enabling or
> +disabling those mechanisms may itself be a non-atomic operation (such as
> +writing some hardware registers and invalidating large caches), other
> +methods of coordination are required in order to guarantee safe
> +power-down and power-up at the cluster level.
> +
> +The mechanism presented in this document describes a coherent memory
> +based protocol for performing the needed coordination.  It aims to be as
> +lightweight as possible, while providing the required safety properties.
> +
> +
> +Basic model
> +-----------
> +
> +Each cluster and CPU is assigned a state, as follows:
> +
> +	DOWN
> +	COMING_UP
> +	UP
> +	GOING_DOWN
> +
> +	    +---------> UP ----------+
> +	    |                        v
> +
> +	COMING_UP                GOING_DOWN
> +
> +	    ^                        |
> +	    +--------- DOWN <--------+
> +
> +
> +DOWN:	The CPU or cluster is not coherent, and is either powered off or
> +	suspended, or is ready to be powered off or suspended.
> +
> +COMING_UP: The CPU or cluster has committed to moving to the UP state.
> +	It may be part way through the process of initialisation and
> +	enabling coherency.
> +
> +UP:	The CPU or cluster is active and coherent at the hardware
> +	level.  A CPU in this state is not necessarily being used
> +	actively by the kernel.
> +
> +GOING_DOWN: The CPU or cluster has committed to moving to the DOWN
> +	state.  It may be part way through the process of teardown and
> +	coherency exit.
> +
> +
> +Each CPU has one of these states assigned to it at any point in time.
> +The CPU states are described in the "CPU state" section, below.
> +
> +Each cluster is also assigned a state, but it is necessary to split the
> +state value into two parts (the "cluster" state and "inbound" state) and
> +to introduce additional states in order to avoid races between different
> +CPUs in the cluster simultaneously modifying the state.  The cluster-
> +level states are described in the "Cluster state" section.
> +
> +To help distinguish the CPU states from cluster states in this
> +discussion, the state names are given a CPU_ prefix for the CPU states,
> +and a CLUSTER_ or INBOUND_ prefix for the cluster states.
> +
> +
> +CPU state
> +---------
> +
> +In this algorithm, each individual core in a multi-core processor is
> +referred to as a "CPU".  CPUs are assumed to be single-threaded:
> +therefore, a CPU can only be doing one thing at a single point in time.
> +
> +This means that CPUs fit the basic model closely.
> +
> +The algorithm defines the following states for each CPU in the system:
> +
> +	CPU_DOWN
> +	CPU_COMING_UP
> +	CPU_UP
> +	CPU_GOING_DOWN
> +
> +	 cluster setup and
> +	CPU setup complete          policy decision
> +	      +-----------> CPU_UP ------------+
> +	      |                                v
> +
> +	CPU_COMING_UP                   CPU_GOING_DOWN
> +
> +	      ^                                |
> +	      +----------- CPU_DOWN <----------+
> +	 policy decision           CPU teardown complete
> +	or hardware event
> +
> +
> +The definitions of the four states correspond closely to the states of
> +the basic model.
> +
> +Transitions between states occur as follows.
> +
> +A trigger event (spontaneous) means that the CPU can transition to the
> +next state as a result of making local progress only, with no
> +requirement for any external event to happen.
> +
> +
> +CPU_DOWN:
> +
> +	A CPU reaches the CPU_DOWN state when it is ready for
> +	power-down.  On reaching this state, the CPU will typically
> +	power itself down or suspend itself, via a WFI instruction or a
> +	firmware call.
> +
> +	Next state:	CPU_COMING_UP
> +	Conditions:	none
> +
> +	Trigger events:
> +
> +		a) an explicit hardware power-up operation, resulting
> +		   from a policy decision on another CPU;
> +
> +		b) a hardware event, such as an interrupt.
> +
> +
> +CPU_COMING_UP:
> +
> +	A CPU cannot start participating in hardware coherency until the
> +	cluster is set up and coherent.  If the cluster is not ready,
> +	then the CPU will wait in the CPU_COMING_UP state until the
> +	cluster has been set up.
> +
> +	Next state:	CPU_UP
> +	Conditions:	The CPU's parent cluster must be in CLUSTER_UP.
> +	Trigger events:	Transition of the parent cluster to CLUSTER_UP.
> +
> +	Refer to the "Cluster state" section for a description of the
> +	CLUSTER_UP state.
> +
> +
> +CPU_UP:
> +	When a CPU reaches the CPU_UP state, it is safe for the CPU to
> +	start participating in local coherency.
> +
> +	This is done by jumping to the kernel's CPU resume code.
> +
> +	Note that the definition of this state is slightly different
> +	from the basic model definition: CPU_UP does not mean that the
> +	CPU is coherent yet, but it does mean that it is safe to resume
> +	the kernel.  The kernel handles the rest of the resume
> +	procedure, so the remaining steps are not visible as part of the
> +	race avoidance algorithm.
> +
> +	The CPU remains in this state until an explicit policy decision
> +	is made to shut down or suspend the CPU.
> +
> +	Next state:	CPU_GOING_DOWN
> +	Conditions:	none
> +	Trigger events:	explicit policy decision
> +
> +
> +CPU_GOING_DOWN:
> +
> +	While in this state, the CPU exits coherency, including any
> +	operations required to achieve this (such as cleaning data
> +	caches).
> +
> +	Next state:	CPU_DOWN
> +	Conditions:	local CPU teardown complete
> +	Trigger events:	(spontaneous)
> +
> +
> +Cluster state
> +-------------
> +
> +A cluster is a group of connected CPUs with some common resources.
> +Because a cluster contains multiple CPUs, it can be doing multiple
> +things at the same time.  This has some implications.  In particular, a
> +CPU can start up while another CPU is tearing the cluster down.
> +
> +In this discussion, the "outbound side" is the view of the cluster state
> +as seen by a CPU tearing the cluster down.  The "inbound side" is the
> +view of the cluster state as seen by a CPU setting the CPU up.
> +
> +In order to enable safe coordination in such situations, it is important
> +that a CPU which is setting up the cluster can advertise its state
> +independently of the CPU which is tearing down the cluster.  For this
> +reason, the cluster state is split into two parts:
> +
> +	"cluster" state: The global state of the cluster; or the state
> +		on the outbound side:
> +
> +		CLUSTER_DOWN
> +		CLUSTER_UP
> +		CLUSTER_GOING_DOWN
> +
> +	"inbound" state: The state of the cluster on the inbound side.
> +
> +		INBOUND_NOT_COMING_UP
> +		INBOUND_COMING_UP
> +
> +
> +	The different pairings of these states results in six possible
> +	states for the cluster as a whole:
> +
> +	                            CLUSTER_UP
> +	          +==========> INBOUND_NOT_COMING_UP -------------+
> +	          #                                               |
> +	                                                          |
> +	     CLUSTER_UP     <----+                                |
> +	  INBOUND_COMING_UP      |                                v
> +
> +	          ^             CLUSTER_GOING_DOWN       CLUSTER_GOING_DOWN
> +	          #              INBOUND_COMING_UP <=== INBOUND_NOT_COMING_UP
> +
> +	    CLUSTER_DOWN         |                                |
> +	  INBOUND_COMING_UP <----+                                |
> +	                                                          |
> +	          ^                                               |
> +	          +===========     CLUSTER_DOWN      <------------+
> +	                       INBOUND_NOT_COMING_UP
> +
> +	Transitions -----> can only be made by the outbound CPU, and
> +	only involve changes to the "cluster" state.
> +
> +	Transitions ===##> can only be made by the inbound CPU, and only
> +	involve changes to the "inbound" state, except where there is no
> +	further transition possible on the outbound side (i.e., the
> +	outbound CPU has put the cluster into the CLUSTER_DOWN state).
> +
> +	The race avoidance algorithm does not provide a way to determine
> +	which exact CPUs within the cluster play these roles.  This must
> +	be decided in advance by some other means.  Refer to the section
> +	"Last man and first man selection" for more explanation.
> +
> +
> +	CLUSTER_DOWN/INBOUND_NOT_COMING_UP is the only state where the
> +	cluster can actually be powered down.
> +
> +	The parallelism of the inbound and outbound CPUs is observed by
> +	the existence of two different paths from CLUSTER_GOING_DOWN/
> +	INBOUND_NOT_COMING_UP (corresponding to GOING_DOWN in the basic
> +	model) to CLUSTER_DOWN/INBOUND_COMING_UP (corresponding to
> +	COMING_UP in the basic model).  The second path avoids cluster
> +	teardown completely.
> +
> +	CLUSTER_UP/INBOUND_COMING_UP is equivalent to UP in the basic
> +	model.  The final transition to CLUSTER_UP/INBOUND_NOT_COMING_UP
> +	is trivial and merely resets the state machine ready for the
> +	next cycle.
> +
> +	Details of the allowable transitions follow.
> +
> +	The next state in each case is notated
> +
> +		<cluster state>/<inbound state> (<transitioner>)
> +
> +	where the <transitioner> is the side on which the transition
> +	can occur; either the inbound or the outbound side.
> +
> +
> +CLUSTER_DOWN/INBOUND_NOT_COMING_UP:
> +
> +	Next state:	CLUSTER_DOWN/INBOUND_COMING_UP (inbound)
> +	Conditions:	none
> +	Trigger events:
> +
> +		a) an explicit hardware power-up operation, resulting
> +		   from a policy decision on another CPU;
> +
> +		b) a hardware event, such as an interrupt.
> +
> +
> +CLUSTER_DOWN/INBOUND_COMING_UP:
> +
> +	In this state, an inbound CPU sets up the cluster, including
> +	enabling of hardware coherency at the cluster level and any
> +	other operations (such as cache invalidation) which are required
> +	in order to achieve this.
> +
> +	The purpose of this state is to do sufficient cluster-level
> +	setup to enable other CPUs in the cluster to enter coherency
> +	safely.
> +
> +	Next state:	CLUSTER_UP/INBOUND_COMING_UP (inbound)
> +	Conditions:	cluster-level setup and hardware coherency complete
> +	Trigger events:	(spontaneous)
> +
> +
> +CLUSTER_UP/INBOUND_COMING_UP:
> +
> +	Cluster-level setup is complete and hardware coherency is
> +	enabled for the cluster.  Other CPUs in the cluster can safely
> +	enter coherency.
> +
> +	This is a transient state, leading immediately to
> +	CLUSTER_UP/INBOUND_NOT_COMING_UP.  All other CPUs on the cluster
> +	should consider treat these two states as equivalent.
> +
> +	Next state:	CLUSTER_UP/INBOUND_NOT_COMING_UP (inbound)
> +	Conditions:	none
> +	Trigger events:	(spontaneous)
> +
> +
> +CLUSTER_UP/INBOUND_NOT_COMING_UP:
> +
> +	Cluster-level setup is complete and hardware coherency is
> +	enabled for the cluster.  Other CPUs in the cluster can safely
> +	enter coherency.
> +
> +	The cluster will remain in this state until a policy decision is
> +	made to power the cluster down.
> +
> +	Next state:	CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP (outbound)
> +	Conditions:	none
> +	Trigger events:	policy decision to power down the cluster
> +
> +
> +CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP:
> +
> +	An outbound CPU is tearing the cluster down.  The selected CPU
> +	must wait in this state until all CPUs in the cluster are in the
> +	CPU_DOWN state.
> +
> +	When all CPUs are in the CPU_DOWN state, the cluster can be torn
> +	down, for example by cleaning data caches and exiting
> +	cluster-level coherency.
> +
> +	To avoid wasteful unnecessary teardown operations, the outbound
> +	should check the inbound cluster state for asynchronous
> +	transitions to INBOUND_COMING_UP.  Alternatively, individual
> +	CPUs can be checked for entry into CPU_COMING_UP or CPU_UP.
> +
> +
> +	Next states:
> +
> +	CLUSTER_DOWN/INBOUND_NOT_COMING_UP (outbound)
> +		Conditions:	cluster torn down and ready to power off
> +		Trigger events:	(spontaneous)
> +
> +	CLUSTER_GOING_DOWN/INBOUND_COMING_UP (inbound)
> +		Conditions:	none
> +		Trigger events:
> +
> +			a) an explicit hardware power-up operation,
> +			   resulting from a policy decision on another
> +			   CPU;
> +
> +			b) a hardware event, such as an interrupt.
> +
> +
> +CLUSTER_GOING_DOWN/INBOUND_COMING_UP:
> +
> +	The cluster is (or was) being torn down, but another CPU has
> +	come online in the meantime and is trying to set up the cluster
> +	again.
> +
> +	If the outbound CPU observes this state, it has two choices:
> +
> +		a) back out of teardown, restoring the cluster to the
> +		   CLUSTER_UP state;
> +
> +		b) finish tearing the cluster down and put the cluster
> +		   in the CLUSTER_DOWN state; the inbound CPU will
> +		   set up the cluster again from there.
> +
> +	Choice (a) permits the removal of some latency by avoiding
> +	unnecessary teardown and setup operations in situations where
> +	the cluster is not really going to be powered down.
> +
> +
> +	Next states:
> +
> +	CLUSTER_UP/INBOUND_COMING_UP (outbound)
> +		Conditions:	cluster-level setup and hardware
> +				coherency complete
> +		Trigger events:	(spontaneous)
> +
> +	CLUSTER_DOWN/INBOUND_COMING_UP (outbound)
> +		Conditions:	cluster torn down and ready to power off
> +		Trigger events:	(spontaneous)
> +
> +
> +Last man and First man selection
> +--------------------------------
> +
> +The CPU which performs cluster tear-down operations on the outbound side
> +is commonly referred to as the "last man".
> +
> +The CPU which performs cluster setup on the inbound side is commonly
> +referred to as the "first man".
> +
> +The race avoidance algorithm documented above does not provide a
> +mechanism to choose which CPUs should play these roles.
> +
> +
> +Last man:
> +
> +When shutting down the cluster, all the CPUs involved are initially
> +executing Linux and hence coherent.  Therefore, ordinary spinlocks can
> +be used to select a last man safely, before the CPUs become
> +non-coherent.
> +
> +
> +First man:
> +
> +Because CPUs may power up asynchronously in response to external wake-up
> +events, a dynamic mechanism is needed to make sure that only one CPU
> +attempts to play the first man role and do the cluster-level
> +initialisation: any other CPUs must wait for this to complete before
> +proceeding.
> +
> +Cluster-level initialisation may involve actions such as configuring
> +coherency controls in the bus fabric.
> +
> +The current implementation in bL_head.S uses a separate mutual exclusion
> +mechanism to do this arbitration.  This mechanism is documented in
> +detail in vlocks.txt.
> +
> +
> +Features and Limitations
> +------------------------
> +
> +Implementation:
> +
> +	The current ARM-based implementation is split between
> +	arch/arm/common/bL_head.S (low-level inbound CPU operations) and
> +	arch/arm/common/bL_entry.c (everything else):
> +
> +	__bL_cpu_going_down() signals the transition of a CPU to the
> +		CPU_GOING_DOWN state.
> +
> +	__bL_cpu_down() signals the transition of a CPU to the CPU_DOWN
> +		state.
> +
> +	A CPU transitions to CPU_COMING_UP and then to CPU_UP via the
> +		low-level power-up code in bL_head.S.  This could
> +		involve CPU-specific setup code, but in the current
> +		implementation it does not.
> +
> +	__bL_outbound_enter_critical() and __bL_outbound_leave_critical()
> +		handle transitions from CLUSTER_UP to CLUSTER_GOING_DOWN
> +		and from there to CLUSTER_DOWN or back to CLUSTER_UP (in
> +		the case of an aborted cluster power-down).
> +
> +		These functions are more complex than the __bL_cpu_*()
> +		functions due to the extra inter-CPU coordination which
> +		is needed for safe transitions at the cluster level.
> +
> +	A cluster transitions from CLUSTER_DOWN back to CLUSTER_UP via
> +		the low-level power-up code in bL_head.S.  This
> +		typically involves platform-specific setup code,
> +		provided by the platform-specific power_up_setup
> +		function registered via bL_cluster_sync_init.
> +
> +Deep topologies:
> +
> +	As currently described and implemented, the algorithm does not
> +	support CPU topologies involving more than two levels (i.e.,
> +	clusters of clusters are not supported).  The algorithm could be
> +	extended by replicating the cluster-level states for the
> +	additional topological levels, and modifying the transition
> +	rules for the intermediate (non-outermost) cluster levels.
> +
> +
> +Colophon
> +--------
> +
> +Originally created and documented by Dave Martin for Linaro Limited, in
> +collaboration with Nicolas Pitre and Achin Gupta.
> +
Great write-up Dave!! I might have to do couple of more passes on it to
get overall idea, but surely this documentation is good start for
anybody reading/reviewing the big.LITTLE switcher code.

> +Copyright (C) 2012  Linaro Limited
> +Distributed under the terms of Version 2 of the GNU General Public
> +License, as defined in linux/COPYING.
> diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
> index 41de0622de..1ea4ec9df0 100644
> --- a/arch/arm/common/bL_entry.c
> +++ b/arch/arm/common/bL_entry.c
> @@ -116,3 +116,163 @@ int bL_cpu_powered_up(void)
>   		platform_ops->powered_up();
>   	return 0;
>   }
> +
> +struct bL_sync_struct bL_sync;
> +
> +static void __sync_range(volatile void *p, size_t size)
> +{
> +	char *_p = (char *)p;
> +
> +	__cpuc_flush_dcache_area(_p, size);
> +	outer_flush_range(__pa(_p), __pa(_p + size));
> +	outer_sync();
> +}
> +
> +#define sync_mem(ptr) __sync_range(ptr, sizeof *(ptr))
> +
> +/*
/** as per kerneldoc.

> + * __bL_cpu_going_down: Indicates that the cpu is being torn down.
> + *    This must be called at the point of committing to teardown of a CPU.
> + *    The CPU cache (SCTRL.C bit) is expected to still be active.
> + */
> +void __bL_cpu_going_down(unsigned int cpu, unsigned int cluster)
> +{
> +	bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_GOING_DOWN;
> +	sync_mem(&bL_sync.clusters[cluster].cpus[cpu].cpu);
> +}
> +

[..]

> diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
> index 9d351f2b4c..f7a64ac127 100644
> --- a/arch/arm/common/bL_head.S
> +++ b/arch/arm/common/bL_head.S
> @@ -7,11 +7,19 @@
>    * This program is free software; you can redistribute it and/or modify
>    * it under the terms of the GNU General Public License version 2 as
>    * published by the Free Software Foundation.
> + *
> + *
> + * Refer to Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
> + * for details of the synchronisation algorithms used here.
>    */
>
>   #include <linux/linkage.h>
>   #include <asm/bL_entry.h>
>
> +.if BL_SYNC_CLUSTER_CPUS
> +.error "cpus must be the first member of struct bL_cluster_sync_struct"
> +.endif
> +
>   	.macro	pr_dbg	cpu, string
>   #if defined(CONFIG_DEBUG_LL) && defined(DEBUG)
>   	b	1901f
> @@ -52,12 +60,82 @@ ENTRY(bL_entry_point)
>   2:	pr_dbg	r4, "kernel bL_entry_point\n"
>
>   	/*
> -	 * MMU is off so we need to get to bL_entry_vectors in a
> +	 * MMU is off so we need to get to various variables in a
>   	 * position independent way.
>   	 */
>   	adr	r5, 3f
> -	ldr	r6, [r5]
> +	ldmia	r5, {r6, r7, r8}
>   	add	r6, r5, r6			@ r6 = bL_entry_vectors
> +	ldr	r7, [r5, r7]			@ r7 = bL_power_up_setup_phys
> +	add	r8, r5, r8			@ r8 = bL_sync
> +
> +	mov	r0, #BL_SYNC_CLUSTER_SIZE
> +	mla	r8, r0, r10, r8			@ r8 = bL_sync cluster base
> +
> +	@ Signal that this CPU is coming UP:
> +	mov	r0, #CPU_COMING_UP
> +	mov	r5, #BL_SYNC_CPU_SIZE
> +	mla	r5, r9, r5, r8			@ r5 = bL_sync cpu address
> +	strb	r0, [r5]
> +
> +	dsb
Do you really need above dsb(). With MMU off, the the store should any 
way make it to the main memory, No ?

> +
> +	@ At this point, the cluster cannot unexpectedly enter the GOING_DOWN
> +	@ state, because there is at least one active CPU (this CPU).
> +
> +	@ Check if the cluster has been set up yet:
> +	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> +	cmp	r0, #CLUSTER_UP
> +	beq	cluster_already_up
> +
> +	@ Signal that the cluster is being brought up:
> +	mov	r0, #INBOUND_COMING_UP
> +	strb	r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
> +
> +	dsb
Same comment.

Regards,
Santosh
tip-bot for Dave Martin Jan. 11, 2013, 6:07 p.m. UTC | #11
On Fri, Jan 11, 2013 at 11:16:18PM +0530, Santosh Shilimkar wrote:

[...]

> >+Originally created and documented by Dave Martin for Linaro Limited, in
> >+collaboration with Nicolas Pitre and Achin Gupta.
> >+
> Great write-up Dave!! I might have to do couple of more passes on it to
> get overall idea, but surely this documentation is good start for
> anybody reading/reviewing the big.LITTLE switcher code.

Thanks for reading through it.  Partly, this was insurance against me
forgetting how the code worked in between writing and posting it...
but this is all quite subtle code, so it felt important to document
it thoroughly.

> 
> >+Copyright (C) 2012  Linaro Limited
> >+Distributed under the terms of Version 2 of the GNU General Public
> >+License, as defined in linux/COPYING.
> >diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
> >index 41de0622de..1ea4ec9df0 100644
> >--- a/arch/arm/common/bL_entry.c
> >+++ b/arch/arm/common/bL_entry.c
> >@@ -116,3 +116,163 @@ int bL_cpu_powered_up(void)
> >  		platform_ops->powered_up();
> >  	return 0;
> >  }
> >+
> >+struct bL_sync_struct bL_sync;
> >+
> >+static void __sync_range(volatile void *p, size_t size)
> >+{
> >+	char *_p = (char *)p;
> >+
> >+	__cpuc_flush_dcache_area(_p, size);
> >+	outer_flush_range(__pa(_p), __pa(_p + size));
> >+	outer_sync();
> >+}
> >+
> >+#define sync_mem(ptr) __sync_range(ptr, sizeof *(ptr))
> >+
> >+/*
> /** as per kerneldoc.

Does kerneldoc not require the comment to be specially formatted?

I haven't played with that, so far.

> 
> >+ * __bL_cpu_going_down: Indicates that the cpu is being torn down.
> >+ *    This must be called at the point of committing to teardown of a CPU.
> >+ *    The CPU cache (SCTRL.C bit) is expected to still be active.
> >+ */
> >+void __bL_cpu_going_down(unsigned int cpu, unsigned int cluster)
> >+{
> >+	bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_GOING_DOWN;
> >+	sync_mem(&bL_sync.clusters[cluster].cpus[cpu].cpu);
> >+}
> >+
> 
> [..]
> 
> >diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
> >index 9d351f2b4c..f7a64ac127 100644
> >--- a/arch/arm/common/bL_head.S
> >+++ b/arch/arm/common/bL_head.S
> >@@ -7,11 +7,19 @@
> >   * This program is free software; you can redistribute it and/or modify
> >   * it under the terms of the GNU General Public License version 2 as
> >   * published by the Free Software Foundation.
> >+ *
> >+ *
> >+ * Refer to Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
> >+ * for details of the synchronisation algorithms used here.
> >   */
> >
> >  #include <linux/linkage.h>
> >  #include <asm/bL_entry.h>
> >
> >+.if BL_SYNC_CLUSTER_CPUS
> >+.error "cpus must be the first member of struct bL_cluster_sync_struct"
> >+.endif
> >+
> >  	.macro	pr_dbg	cpu, string
> >  #if defined(CONFIG_DEBUG_LL) && defined(DEBUG)
> >  	b	1901f
> >@@ -52,12 +60,82 @@ ENTRY(bL_entry_point)
> >  2:	pr_dbg	r4, "kernel bL_entry_point\n"
> >
> >  	/*
> >-	 * MMU is off so we need to get to bL_entry_vectors in a
> >+	 * MMU is off so we need to get to various variables in a
> >  	 * position independent way.
> >  	 */
> >  	adr	r5, 3f
> >-	ldr	r6, [r5]
> >+	ldmia	r5, {r6, r7, r8}
> >  	add	r6, r5, r6			@ r6 = bL_entry_vectors
> >+	ldr	r7, [r5, r7]			@ r7 = bL_power_up_setup_phys
> >+	add	r8, r5, r8			@ r8 = bL_sync
> >+
> >+	mov	r0, #BL_SYNC_CLUSTER_SIZE
> >+	mla	r8, r0, r10, r8			@ r8 = bL_sync cluster base
> >+
> >+	@ Signal that this CPU is coming UP:
> >+	mov	r0, #CPU_COMING_UP
> >+	mov	r5, #BL_SYNC_CPU_SIZE
> >+	mla	r5, r9, r5, r8			@ r5 = bL_sync cpu address
> >+	strb	r0, [r5]
> >+
> >+	dsb
> Do you really need above dsb(). With MMU off, the the store should

The short answer is "maybe not".  Some of the barriers can be
eliminated; some can be demoted to DSBs.  Others may be required but
unnecessarily duplicated e.g., between bL_head.S and vlock.S.

> any way make it to the main memory, No ?

Yes, but this raises issues about precisely what the architecture
guarantees about memory ordering in these scenarios.  The only obvious
thing about that is that it's non-obvious.

Strongly-Ordered memory is not quite the same as having explicit
barriers everywhere.

I need to have a careful think, but it should be possible to optimise
a bit here.

> 
> >+
> >+	@ At this point, the cluster cannot unexpectedly enter the GOING_DOWN
> >+	@ state, because there is at least one active CPU (this CPU).
> >+
> >+	@ Check if the cluster has been set up yet:
> >+	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> >+	cmp	r0, #CLUSTER_UP
> >+	beq	cluster_already_up
> >+
> >+	@ Signal that the cluster is being brought up:
> >+	mov	r0, #INBOUND_COMING_UP
> >+	strb	r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
> >+
> >+	dsb
> Same comment.

Same answer... for now

Cheers
---Dave
Santosh Shilimkar Jan. 11, 2013, 6:34 p.m. UTC | #12
On Friday 11 January 2013 11:37 PM, Dave Martin wrote:
> On Fri, Jan 11, 2013 at 11:16:18PM +0530, Santosh Shilimkar wrote:
>
> [...]
>
>>> +Originally created and documented by Dave Martin for Linaro Limited, in
>>> +collaboration with Nicolas Pitre and Achin Gupta.
>>> +
>> Great write-up Dave!! I might have to do couple of more passes on it to
>> get overall idea, but surely this documentation is good start for
>> anybody reading/reviewing the big.LITTLE switcher code.
>
> Thanks for reading through it.  Partly, this was insurance against me
> forgetting how the code worked in between writing and posting it...
> but this is all quite subtle code, so it felt important to document
> it thoroughly.
>
>>
>>> +Copyright (C) 2012  Linaro Limited
>>> +Distributed under the terms of Version 2 of the GNU General Public
>>> +License, as defined in linux/COPYING.
>>> diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
>>> index 41de0622de..1ea4ec9df0 100644
>>> --- a/arch/arm/common/bL_entry.c
>>> +++ b/arch/arm/common/bL_entry.c
>>> @@ -116,3 +116,163 @@ int bL_cpu_powered_up(void)
>>>   		platform_ops->powered_up();
>>>   	return 0;
>>>   }
>>> +
>>> +struct bL_sync_struct bL_sync;
>>> +
>>> +static void __sync_range(volatile void *p, size_t size)
>>> +{
>>> +	char *_p = (char *)p;
>>> +
>>> +	__cpuc_flush_dcache_area(_p, size);
>>> +	outer_flush_range(__pa(_p), __pa(_p + size));
>>> +	outer_sync();
>>> +}
>>> +
>>> +#define sync_mem(ptr) __sync_range(ptr, sizeof *(ptr))
>>> +
>>> +/*
>> /** as per kerneldoc.
>
> Does kerneldoc not require the comment to be specially formatted?
>
> I haven't played with that, so far.
>
>>
>>> + * __bL_cpu_going_down: Indicates that the cpu is being torn down.
>>> + *    This must be called at the point of committing to teardown of a CPU.
>>> + *    The CPU cache (SCTRL.C bit) is expected to still be active.
>>> + */
>>> +void __bL_cpu_going_down(unsigned int cpu, unsigned int cluster)
>>> +{
>>> +	bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_GOING_DOWN;
>>> +	sync_mem(&bL_sync.clusters[cluster].cpus[cpu].cpu);
>>> +}
>>> +
>>
>> [..]
>>
>>> diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
>>> index 9d351f2b4c..f7a64ac127 100644
>>> --- a/arch/arm/common/bL_head.S
>>> +++ b/arch/arm/common/bL_head.S
>>> @@ -7,11 +7,19 @@
>>>    * This program is free software; you can redistribute it and/or modify
>>>    * it under the terms of the GNU General Public License version 2 as
>>>    * published by the Free Software Foundation.
>>> + *
>>> + *
>>> + * Refer to Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
>>> + * for details of the synchronisation algorithms used here.
>>>    */
>>>
>>>   #include <linux/linkage.h>
>>>   #include <asm/bL_entry.h>
>>>
>>> +.if BL_SYNC_CLUSTER_CPUS
>>> +.error "cpus must be the first member of struct bL_cluster_sync_struct"
>>> +.endif
>>> +
>>>   	.macro	pr_dbg	cpu, string
>>>   #if defined(CONFIG_DEBUG_LL) && defined(DEBUG)
>>>   	b	1901f
>>> @@ -52,12 +60,82 @@ ENTRY(bL_entry_point)
>>>   2:	pr_dbg	r4, "kernel bL_entry_point\n"
>>>
>>>   	/*
>>> -	 * MMU is off so we need to get to bL_entry_vectors in a
>>> +	 * MMU is off so we need to get to various variables in a
>>>   	 * position independent way.
>>>   	 */
>>>   	adr	r5, 3f
>>> -	ldr	r6, [r5]
>>> +	ldmia	r5, {r6, r7, r8}
>>>   	add	r6, r5, r6			@ r6 = bL_entry_vectors
>>> +	ldr	r7, [r5, r7]			@ r7 = bL_power_up_setup_phys
>>> +	add	r8, r5, r8			@ r8 = bL_sync
>>> +
>>> +	mov	r0, #BL_SYNC_CLUSTER_SIZE
>>> +	mla	r8, r0, r10, r8			@ r8 = bL_sync cluster base
>>> +
>>> +	@ Signal that this CPU is coming UP:
>>> +	mov	r0, #CPU_COMING_UP
>>> +	mov	r5, #BL_SYNC_CPU_SIZE
>>> +	mla	r5, r9, r5, r8			@ r5 = bL_sync cpu address
>>> +	strb	r0, [r5]
>>> +
>>> +	dsb
>> Do you really need above dsb(). With MMU off, the the store should
>
> The short answer is "maybe not".  Some of the barriers can be
> eliminated; some can be demoted to DSBs.  Others may be required but
> unnecessarily duplicated e.g., between bL_head.S and vlock.S.
>
>> any way make it to the main memory, No ?
>
> Yes, but this raises issues about precisely what the architecture
> guarantees about memory ordering in these scenarios.  The only obvious
> thing about that is that it's non-obvious.
>
Well at least ARM documents clearly says the memory accesses will be
treated as strongly ordered with MMU OFF and that means they expect
to make it to main memory.

> Strongly-Ordered memory is not quite the same as having explicit
> barriers everywhere.
>
> I need to have a careful think, but it should be possible to optimise
> a bit here.
>
If the CCI comes in between that rule and if it needs a barrier to let
it flush is WB to main memory then thats a different story.

Anyway thanks for the answer.
Regards
Santosh
diff mbox

Patch

diff --git a/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt b/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
new file mode 100644
index 0000000000..d6151e0235
--- /dev/null
+++ b/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
@@ -0,0 +1,498 @@ 
+Big.LITTLE cluster Power-up/power-down race avoidance algorithm
+===============================================================
+
+This file documents the algorithm which is used to coordinate CPU and
+cluster setup and teardown operations and to manage hardware coherency
+controls safely.
+
+The section "Rationale" explains what the algorithm is for and why it is
+needed.  "Basic model" explains general concepts using a simplified view
+of the system.  The other sections explain the actual details of the
+algorithm in use.
+
+
+Rationale
+---------
+
+In a system containing multiple CPUs, it is desirable to have the
+ability to turn off individual CPUs when the system is idle, reducing
+power consumption and thermal dissipation.
+
+In a system containing multiple clusters of CPUs, it is also desirable
+to have the ability to turn off entire clusters.
+
+Turning entire clusters off and on is a risky business, because it
+involves performing potentially destructive operations affecting a group
+of independently running CPUs, while the OS continues to run.  This
+means that we need some coordination in order to ensure that critical
+cluster-level operations are only performed when it is truly safe to do
+so.
+
+Simple locking may not be sufficient to solve this problem, because
+mechanisms like Linux spinlocks may rely on coherency mechanisms which
+are not immediately enabled when a cluster powers up.  Since enabling or
+disabling those mechanisms may itself be a non-atomic operation (such as
+writing some hardware registers and invalidating large caches), other
+methods of coordination are required in order to guarantee safe
+power-down and power-up at the cluster level.
+
+The mechanism presented in this document describes a coherent memory
+based protocol for performing the needed coordination.  It aims to be as
+lightweight as possible, while providing the required safety properties.
+
+
+Basic model
+-----------
+
+Each cluster and CPU is assigned a state, as follows:
+
+	DOWN
+	COMING_UP
+	UP
+	GOING_DOWN
+
+	    +---------> UP ----------+
+	    |                        v
+
+	COMING_UP                GOING_DOWN
+
+	    ^                        |
+	    +--------- DOWN <--------+
+
+
+DOWN:	The CPU or cluster is not coherent, and is either powered off or
+	suspended, or is ready to be powered off or suspended.
+
+COMING_UP: The CPU or cluster has committed to moving to the UP state.
+	It may be part way through the process of initialisation and
+	enabling coherency.
+
+UP:	The CPU or cluster is active and coherent at the hardware
+	level.  A CPU in this state is not necessarily being used
+	actively by the kernel.
+
+GOING_DOWN: The CPU or cluster has committed to moving to the DOWN
+	state.  It may be part way through the process of teardown and
+	coherency exit.
+
+
+Each CPU has one of these states assigned to it at any point in time.
+The CPU states are described in the "CPU state" section, below.
+
+Each cluster is also assigned a state, but it is necessary to split the
+state value into two parts (the "cluster" state and "inbound" state) and
+to introduce additional states in order to avoid races between different
+CPUs in the cluster simultaneously modifying the state.  The cluster-
+level states are described in the "Cluster state" section.
+
+To help distinguish the CPU states from cluster states in this
+discussion, the state names are given a CPU_ prefix for the CPU states,
+and a CLUSTER_ or INBOUND_ prefix for the cluster states.
+
+
+CPU state
+---------
+
+In this algorithm, each individual core in a multi-core processor is
+referred to as a "CPU".  CPUs are assumed to be single-threaded:
+therefore, a CPU can only be doing one thing at a single point in time.
+
+This means that CPUs fit the basic model closely.
+
+The algorithm defines the following states for each CPU in the system:
+
+	CPU_DOWN
+	CPU_COMING_UP
+	CPU_UP
+	CPU_GOING_DOWN
+
+	 cluster setup and
+	CPU setup complete          policy decision
+	      +-----------> CPU_UP ------------+
+	      |                                v
+
+	CPU_COMING_UP                   CPU_GOING_DOWN
+
+	      ^                                |
+	      +----------- CPU_DOWN <----------+
+	 policy decision           CPU teardown complete
+	or hardware event
+
+
+The definitions of the four states correspond closely to the states of
+the basic model.
+
+Transitions between states occur as follows.
+
+A trigger event (spontaneous) means that the CPU can transition to the
+next state as a result of making local progress only, with no
+requirement for any external event to happen.
+
+
+CPU_DOWN:
+
+	A CPU reaches the CPU_DOWN state when it is ready for
+	power-down.  On reaching this state, the CPU will typically
+	power itself down or suspend itself, via a WFI instruction or a
+	firmware call.
+
+	Next state:	CPU_COMING_UP
+	Conditions:	none
+
+	Trigger events:
+
+		a) an explicit hardware power-up operation, resulting
+		   from a policy decision on another CPU;
+
+		b) a hardware event, such as an interrupt.
+
+
+CPU_COMING_UP:
+
+	A CPU cannot start participating in hardware coherency until the
+	cluster is set up and coherent.  If the cluster is not ready,
+	then the CPU will wait in the CPU_COMING_UP state until the
+	cluster has been set up.
+
+	Next state:	CPU_UP
+	Conditions:	The CPU's parent cluster must be in CLUSTER_UP.
+	Trigger events:	Transition of the parent cluster to CLUSTER_UP.
+
+	Refer to the "Cluster state" section for a description of the
+	CLUSTER_UP state.
+
+
+CPU_UP:
+	When a CPU reaches the CPU_UP state, it is safe for the CPU to
+	start participating in local coherency.
+
+	This is done by jumping to the kernel's CPU resume code.
+
+	Note that the definition of this state is slightly different
+	from the basic model definition: CPU_UP does not mean that the
+	CPU is coherent yet, but it does mean that it is safe to resume
+	the kernel.  The kernel handles the rest of the resume
+	procedure, so the remaining steps are not visible as part of the
+	race avoidance algorithm.
+
+	The CPU remains in this state until an explicit policy decision
+	is made to shut down or suspend the CPU.
+
+	Next state:	CPU_GOING_DOWN
+	Conditions:	none
+	Trigger events:	explicit policy decision
+
+
+CPU_GOING_DOWN:
+
+	While in this state, the CPU exits coherency, including any
+	operations required to achieve this (such as cleaning data
+	caches).
+
+	Next state:	CPU_DOWN
+	Conditions:	local CPU teardown complete
+	Trigger events:	(spontaneous)
+
+
+Cluster state
+-------------
+
+A cluster is a group of connected CPUs with some common resources.
+Because a cluster contains multiple CPUs, it can be doing multiple
+things at the same time.  This has some implications.  In particular, a
+CPU can start up while another CPU is tearing the cluster down.
+
+In this discussion, the "outbound side" is the view of the cluster state
+as seen by a CPU tearing the cluster down.  The "inbound side" is the
+view of the cluster state as seen by a CPU setting the CPU up.
+
+In order to enable safe coordination in such situations, it is important
+that a CPU which is setting up the cluster can advertise its state
+independently of the CPU which is tearing down the cluster.  For this
+reason, the cluster state is split into two parts:
+
+	"cluster" state: The global state of the cluster; or the state
+		on the outbound side:
+
+		CLUSTER_DOWN
+		CLUSTER_UP
+		CLUSTER_GOING_DOWN
+
+	"inbound" state: The state of the cluster on the inbound side.
+
+		INBOUND_NOT_COMING_UP
+		INBOUND_COMING_UP
+
+
+	The different pairings of these states results in six possible
+	states for the cluster as a whole:
+
+	                            CLUSTER_UP
+	          +==========> INBOUND_NOT_COMING_UP -------------+
+	          #                                               |
+	                                                          |
+	     CLUSTER_UP     <----+                                |
+	  INBOUND_COMING_UP      |                                v
+
+	          ^             CLUSTER_GOING_DOWN       CLUSTER_GOING_DOWN
+	          #              INBOUND_COMING_UP <=== INBOUND_NOT_COMING_UP
+
+	    CLUSTER_DOWN         |                                |
+	  INBOUND_COMING_UP <----+                                |
+	                                                          |
+	          ^                                               |
+	          +===========     CLUSTER_DOWN      <------------+
+	                       INBOUND_NOT_COMING_UP
+
+	Transitions -----> can only be made by the outbound CPU, and
+	only involve changes to the "cluster" state.
+
+	Transitions ===##> can only be made by the inbound CPU, and only
+	involve changes to the "inbound" state, except where there is no
+	further transition possible on the outbound side (i.e., the
+	outbound CPU has put the cluster into the CLUSTER_DOWN state).
+
+	The race avoidance algorithm does not provide a way to determine
+	which exact CPUs within the cluster play these roles.  This must
+	be decided in advance by some other means.  Refer to the section
+	"Last man and first man selection" for more explanation.
+
+
+	CLUSTER_DOWN/INBOUND_NOT_COMING_UP is the only state where the
+	cluster can actually be powered down.
+
+	The parallelism of the inbound and outbound CPUs is observed by
+	the existence of two different paths from CLUSTER_GOING_DOWN/
+	INBOUND_NOT_COMING_UP (corresponding to GOING_DOWN in the basic
+	model) to CLUSTER_DOWN/INBOUND_COMING_UP (corresponding to
+	COMING_UP in the basic model).  The second path avoids cluster
+	teardown completely.
+
+	CLUSTER_UP/INBOUND_COMING_UP is equivalent to UP in the basic
+	model.  The final transition to CLUSTER_UP/INBOUND_NOT_COMING_UP
+	is trivial and merely resets the state machine ready for the
+	next cycle.
+
+	Details of the allowable transitions follow.
+
+	The next state in each case is notated
+
+		<cluster state>/<inbound state> (<transitioner>)
+
+	where the <transitioner> is the side on which the transition
+	can occur; either the inbound or the outbound side.
+
+
+CLUSTER_DOWN/INBOUND_NOT_COMING_UP:
+
+	Next state:	CLUSTER_DOWN/INBOUND_COMING_UP (inbound)
+	Conditions:	none
+	Trigger events:
+
+		a) an explicit hardware power-up operation, resulting
+		   from a policy decision on another CPU;
+
+		b) a hardware event, such as an interrupt.
+
+
+CLUSTER_DOWN/INBOUND_COMING_UP:
+
+	In this state, an inbound CPU sets up the cluster, including
+	enabling of hardware coherency at the cluster level and any
+	other operations (such as cache invalidation) which are required
+	in order to achieve this.
+
+	The purpose of this state is to do sufficient cluster-level
+	setup to enable other CPUs in the cluster to enter coherency
+	safely.
+
+	Next state:	CLUSTER_UP/INBOUND_COMING_UP (inbound)
+	Conditions:	cluster-level setup and hardware coherency complete
+	Trigger events:	(spontaneous)
+
+
+CLUSTER_UP/INBOUND_COMING_UP:
+
+	Cluster-level setup is complete and hardware coherency is
+	enabled for the cluster.  Other CPUs in the cluster can safely
+	enter coherency.
+
+	This is a transient state, leading immediately to
+	CLUSTER_UP/INBOUND_NOT_COMING_UP.  All other CPUs on the cluster
+	should consider treat these two states as equivalent.
+
+	Next state:	CLUSTER_UP/INBOUND_NOT_COMING_UP (inbound)
+	Conditions:	none
+	Trigger events:	(spontaneous)
+
+
+CLUSTER_UP/INBOUND_NOT_COMING_UP:
+
+	Cluster-level setup is complete and hardware coherency is
+	enabled for the cluster.  Other CPUs in the cluster can safely
+	enter coherency.
+
+	The cluster will remain in this state until a policy decision is
+	made to power the cluster down.
+
+	Next state:	CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP (outbound)
+	Conditions:	none
+	Trigger events:	policy decision to power down the cluster
+
+
+CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP:
+
+	An outbound CPU is tearing the cluster down.  The selected CPU
+	must wait in this state until all CPUs in the cluster are in the
+	CPU_DOWN state.
+
+	When all CPUs are in the CPU_DOWN state, the cluster can be torn
+	down, for example by cleaning data caches and exiting
+	cluster-level coherency.
+
+	To avoid wasteful unnecessary teardown operations, the outbound
+	should check the inbound cluster state for asynchronous
+	transitions to INBOUND_COMING_UP.  Alternatively, individual
+	CPUs can be checked for entry into CPU_COMING_UP or CPU_UP.
+
+
+	Next states:
+
+	CLUSTER_DOWN/INBOUND_NOT_COMING_UP (outbound)
+		Conditions:	cluster torn down and ready to power off
+		Trigger events:	(spontaneous)
+
+	CLUSTER_GOING_DOWN/INBOUND_COMING_UP (inbound)
+		Conditions:	none
+		Trigger events:
+
+			a) an explicit hardware power-up operation,
+			   resulting from a policy decision on another
+			   CPU;
+
+			b) a hardware event, such as an interrupt.
+
+
+CLUSTER_GOING_DOWN/INBOUND_COMING_UP:
+
+	The cluster is (or was) being torn down, but another CPU has
+	come online in the meantime and is trying to set up the cluster
+	again.
+
+	If the outbound CPU observes this state, it has two choices:
+
+		a) back out of teardown, restoring the cluster to the
+		   CLUSTER_UP state;
+
+		b) finish tearing the cluster down and put the cluster
+		   in the CLUSTER_DOWN state; the inbound CPU will
+		   set up the cluster again from there.
+
+	Choice (a) permits the removal of some latency by avoiding
+	unnecessary teardown and setup operations in situations where
+	the cluster is not really going to be powered down.
+
+
+	Next states:
+
+	CLUSTER_UP/INBOUND_COMING_UP (outbound)
+		Conditions:	cluster-level setup and hardware
+				coherency complete
+		Trigger events:	(spontaneous)
+
+	CLUSTER_DOWN/INBOUND_COMING_UP (outbound)
+		Conditions:	cluster torn down and ready to power off
+		Trigger events:	(spontaneous)
+
+
+Last man and First man selection
+--------------------------------
+
+The CPU which performs cluster tear-down operations on the outbound side
+is commonly referred to as the "last man".
+
+The CPU which performs cluster setup on the inbound side is commonly
+referred to as the "first man".
+
+The race avoidance algorithm documented above does not provide a
+mechanism to choose which CPUs should play these roles.
+
+
+Last man:
+
+When shutting down the cluster, all the CPUs involved are initially
+executing Linux and hence coherent.  Therefore, ordinary spinlocks can
+be used to select a last man safely, before the CPUs become
+non-coherent.
+
+
+First man:
+
+Because CPUs may power up asynchronously in response to external wake-up
+events, a dynamic mechanism is needed to make sure that only one CPU
+attempts to play the first man role and do the cluster-level
+initialisation: any other CPUs must wait for this to complete before
+proceeding.
+
+Cluster-level initialisation may involve actions such as configuring
+coherency controls in the bus fabric.
+
+The current implementation in bL_head.S uses a separate mutual exclusion
+mechanism to do this arbitration.  This mechanism is documented in
+detail in vlocks.txt.
+
+
+Features and Limitations
+------------------------
+
+Implementation:
+
+	The current ARM-based implementation is split between
+	arch/arm/common/bL_head.S (low-level inbound CPU operations) and
+	arch/arm/common/bL_entry.c (everything else):
+
+	__bL_cpu_going_down() signals the transition of a CPU to the
+		CPU_GOING_DOWN state.
+
+	__bL_cpu_down() signals the transition of a CPU to the CPU_DOWN
+		state.
+
+	A CPU transitions to CPU_COMING_UP and then to CPU_UP via the
+		low-level power-up code in bL_head.S.  This could
+		involve CPU-specific setup code, but in the current
+		implementation it does not.
+
+	__bL_outbound_enter_critical() and __bL_outbound_leave_critical()
+		handle transitions from CLUSTER_UP to CLUSTER_GOING_DOWN
+		and from there to CLUSTER_DOWN or back to CLUSTER_UP (in
+		the case of an aborted cluster power-down).
+
+		These functions are more complex than the __bL_cpu_*()
+		functions due to the extra inter-CPU coordination which
+		is needed for safe transitions at the cluster level.
+
+	A cluster transitions from CLUSTER_DOWN back to CLUSTER_UP via
+		the low-level power-up code in bL_head.S.  This
+		typically involves platform-specific setup code,
+		provided by the platform-specific power_up_setup
+		function registered via bL_cluster_sync_init.
+
+Deep topologies:
+
+	As currently described and implemented, the algorithm does not
+	support CPU topologies involving more than two levels (i.e.,
+	clusters of clusters are not supported).  The algorithm could be
+	extended by replicating the cluster-level states for the
+	additional topological levels, and modifying the transition
+	rules for the intermediate (non-outermost) cluster levels.
+
+
+Colophon
+--------
+
+Originally created and documented by Dave Martin for Linaro Limited, in
+collaboration with Nicolas Pitre and Achin Gupta.
+
+Copyright (C) 2012  Linaro Limited
+Distributed under the terms of Version 2 of the GNU General Public
+License, as defined in linux/COPYING.
diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
index 41de0622de..1ea4ec9df0 100644
--- a/arch/arm/common/bL_entry.c
+++ b/arch/arm/common/bL_entry.c
@@ -116,3 +116,163 @@  int bL_cpu_powered_up(void)
 		platform_ops->powered_up();
 	return 0;
 }
+
+struct bL_sync_struct bL_sync;
+
+static void __sync_range(volatile void *p, size_t size)
+{
+	char *_p = (char *)p;
+
+	__cpuc_flush_dcache_area(_p, size);
+	outer_flush_range(__pa(_p), __pa(_p + size));
+	outer_sync();
+}
+
+#define sync_mem(ptr) __sync_range(ptr, sizeof *(ptr))
+
+/*
+ * __bL_cpu_going_down: Indicates that the cpu is being torn down.
+ *    This must be called at the point of committing to teardown of a CPU.
+ *    The CPU cache (SCTRL.C bit) is expected to still be active.
+ */
+void __bL_cpu_going_down(unsigned int cpu, unsigned int cluster)
+{
+	bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_GOING_DOWN;
+	sync_mem(&bL_sync.clusters[cluster].cpus[cpu].cpu);
+}
+
+/*
+ * __bL_cpu_down: Indicates that cpu teardown is complete and that the
+ *    cluster can be torn down without disrupting this CPU.
+ *    To avoid deadlocks, this must be called before a CPU is powered down.
+ *    The CPU cache (SCTRL.C bit) is expected to be off.
+ */
+void __bL_cpu_down(unsigned int cpu, unsigned int cluster)
+{
+	dsb();
+	bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_DOWN;
+	sync_mem(&bL_sync.clusters[cluster].cpus[cpu].cpu);
+	sev();
+}
+
+/*
+ * __bL_outbound_leave_critical: Leave the cluster teardown critical section.
+ * @state: the final state of the cluster:
+ *     CLUSTER_UP: no destructive teardown was done and the cluster has been
+ *         restored to the previous state (CPU cache still active); or
+ *     CLUSTER_DOWN: the cluster has been torn-down, ready for power-off
+ *         (CPU cache disabled).
+ */
+void __bL_outbound_leave_critical(unsigned int cluster, int state)
+{
+	dsb();
+	bL_sync.clusters[cluster].cluster = state;
+	sync_mem(&bL_sync.clusters[cluster].cluster);
+	sev();
+}
+
+/*
+ * __bL_outbound_enter_critical: Enter the cluster teardown critical section.
+ * This function should be called by the last man, after local CPU teardown
+ * is complete.  CPU cache expected to be active.
+ *
+ * Returns:
+ *     false: the critical section was not entered because an inbound CPU was
+ *         observed, or the cluster is already being set up;
+ *     true: the critical section was entered: it is now safe to tear down the
+ *         cluster.
+ */
+bool __bL_outbound_enter_critical(unsigned int cpu, unsigned int cluster)
+{
+	unsigned int i;
+	struct bL_cluster_sync_struct *c = &bL_sync.clusters[cluster];
+
+	/* Warn inbound CPUs that the cluster is being torn down: */
+	c->cluster = CLUSTER_GOING_DOWN;
+	sync_mem(&c->cluster);
+
+	/* Back out if the inbound cluster is already in the critical region: */
+	sync_mem(&c->inbound);
+	if (c->inbound == INBOUND_COMING_UP)
+		goto abort;
+
+	/*
+	 * Wait for all CPUs to get out of the GOING_DOWN state, so that local
+	 * teardown is complete on each CPU before tearing down the cluster.
+	 *
+	 * If any CPU has been woken up again from the DOWN state, then we
+	 * shouldn't be taking the cluster down at all: abort in that case.
+	 */
+	sync_mem(&c->cpus);
+	for (i = 0; i < BL_CPUS_PER_CLUSTER; i++) {
+		int cpustate;
+
+		if (i == cpu)
+			continue;
+
+		while (1) {
+			cpustate = c->cpus[i].cpu;
+			if (cpustate != CPU_GOING_DOWN)
+				break;
+
+			wfe();
+			sync_mem(&c->cpus[i].cpu);
+		}
+
+		switch (cpustate) {
+		case CPU_DOWN:
+			continue;
+
+		default:
+			goto abort;
+		}
+	}
+
+	dsb();
+
+	return true;
+
+abort:
+	__bL_outbound_leave_critical(cluster, CLUSTER_UP);
+	return false;
+}
+
+int __bL_cluster_state(unsigned int cluster)
+{
+	sync_mem(&bL_sync.clusters[cluster].cluster);
+	return bL_sync.clusters[cluster].cluster;
+}
+
+extern unsigned long bL_power_up_setup_phys;
+
+int __init bL_cluster_sync_init(void (*power_up_setup)(void))
+{
+	unsigned int i, j, mpidr, this_cluster;
+
+	BUILD_BUG_ON(BL_SYNC_CLUSTER_SIZE * BL_NR_CLUSTERS != sizeof bL_sync);
+	BUG_ON((unsigned long)&bL_sync & (__CACHE_WRITEBACK_GRANULE - 1));
+
+	/*
+	 * Set initial CPU and cluster states.
+	 * Only one cluster is assumed to be active at this point.
+	 */
+	for (i = 0; i < BL_NR_CLUSTERS; i++) {
+		bL_sync.clusters[i].cluster = CLUSTER_DOWN;
+		bL_sync.clusters[i].inbound = INBOUND_NOT_COMING_UP;
+		for (j = 0; j < BL_CPUS_PER_CLUSTER; j++)
+			bL_sync.clusters[i].cpus[j].cpu = CPU_DOWN;
+	}
+	asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));
+	this_cluster = (mpidr >> 8) & 0xf;
+	for_each_online_cpu(i)
+		bL_sync.clusters[this_cluster].cpus[i].cpu = CPU_UP;
+	bL_sync.clusters[this_cluster].cluster = CLUSTER_UP;
+	sync_mem(&bL_sync);
+
+	if (power_up_setup) {
+		bL_power_up_setup_phys = virt_to_phys(power_up_setup);
+		sync_mem(&bL_power_up_setup_phys);
+	}
+
+	return 0;
+}
diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
index 9d351f2b4c..f7a64ac127 100644
--- a/arch/arm/common/bL_head.S
+++ b/arch/arm/common/bL_head.S
@@ -7,11 +7,19 @@ 
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
+ *
+ *
+ * Refer to Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
+ * for details of the synchronisation algorithms used here.
  */
 
 #include <linux/linkage.h>
 #include <asm/bL_entry.h>
 
+.if BL_SYNC_CLUSTER_CPUS
+.error "cpus must be the first member of struct bL_cluster_sync_struct"
+.endif
+
 	.macro	pr_dbg	cpu, string
 #if defined(CONFIG_DEBUG_LL) && defined(DEBUG)
 	b	1901f
@@ -52,12 +60,82 @@  ENTRY(bL_entry_point)
 2:	pr_dbg	r4, "kernel bL_entry_point\n"
 
 	/*
-	 * MMU is off so we need to get to bL_entry_vectors in a
+	 * MMU is off so we need to get to various variables in a
 	 * position independent way.
 	 */
 	adr	r5, 3f
-	ldr	r6, [r5]
+	ldmia	r5, {r6, r7, r8}
 	add	r6, r5, r6			@ r6 = bL_entry_vectors
+	ldr	r7, [r5, r7]			@ r7 = bL_power_up_setup_phys
+	add	r8, r5, r8			@ r8 = bL_sync
+
+	mov	r0, #BL_SYNC_CLUSTER_SIZE
+	mla	r8, r0, r10, r8			@ r8 = bL_sync cluster base
+
+	@ Signal that this CPU is coming UP:
+	mov	r0, #CPU_COMING_UP
+	mov	r5, #BL_SYNC_CPU_SIZE
+	mla	r5, r9, r5, r8			@ r5 = bL_sync cpu address
+	strb	r0, [r5]
+
+	dsb
+
+	@ At this point, the cluster cannot unexpectedly enter the GOING_DOWN
+	@ state, because there is at least one active CPU (this CPU).
+
+	@ Check if the cluster has been set up yet:
+	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
+	cmp	r0, #CLUSTER_UP
+	beq	cluster_already_up
+
+	@ Signal that the cluster is being brought up:
+	mov	r0, #INBOUND_COMING_UP
+	strb	r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
+
+	dsb
+
+	@ Any CPU trying to take the cluster into CLUSTER_GOING_DOWN from this
+	@ point onwards will observe INBOUND_COMING_UP and abort.
+
+	@ Wait for any previously-pending cluster teardown operations to abort
+	@ or complete:
+cluster_teardown_wait:
+	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
+	cmp	r0, #CLUSTER_GOING_DOWN
+	wfeeq
+	beq	cluster_teardown_wait
+
+	@ power_up_setup is responsible for setting up the cluster:
+
+	cmp	r7, #0
+	mov	r0, #1		@ second (cluster) affinity level
+	blxne	r7		@ Call power_up_setup if defined
+
+	@ Leave the cluster setup critical section:
+
+	dsb
+	mov	r0, #INBOUND_NOT_COMING_UP
+	strb	r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
+	mov	r0, #CLUSTER_UP
+	strb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
+	dsb
+	sev
+
+cluster_already_up:
+	@ If a platform-specific CPU setup hook is needed, it is
+	@ called from here.
+
+	cmp	r7, #0
+	mov	r0, #0		@ first (CPU) affinity level
+	blxne	r7		@ Call power_up_setup if defined
+
+	@ Mark the CPU as up:
+
+	dsb
+	mov	r0, #CPU_UP
+	strb	r0, [r5]
+	dsb
+	sev
 
 bL_entry_gated:
 	ldr	r5, [r6, r4, lsl #2]		@ r5 = CPU entry vector
@@ -70,6 +148,8 @@  bL_entry_gated:
 	.align	2
 
 3:	.word	bL_entry_vectors - .
+	.word	bL_power_up_setup_phys - 3b
+	.word	bL_sync - 3b
 
 ENDPROC(bL_entry_point)
 
@@ -79,3 +159,7 @@  ENDPROC(bL_entry_point)
 	.type	bL_entry_vectors, #object
 ENTRY(bL_entry_vectors)
 	.space	4 * BL_NR_CLUSTERS * BL_CPUS_PER_CLUSTER
+
+	.type	bL_power_up_setup_phys, #object
+ENTRY(bL_power_up_setup_phys)
+	.space  4		@ set by bL_cluster_sync_init()
diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
index 942d7f9f19..167394d9a0 100644
--- a/arch/arm/include/asm/bL_entry.h
+++ b/arch/arm/include/asm/bL_entry.h
@@ -15,8 +15,37 @@ 
 #define BL_CPUS_PER_CLUSTER	4
 #define BL_NR_CLUSTERS		2
 
+/* Definitions for bL_cluster_sync_struct */
+#define CPU_DOWN		0x11
+#define CPU_COMING_UP		0x12
+#define CPU_UP			0x13
+#define CPU_GOING_DOWN		0x14
+
+#define CLUSTER_DOWN		0x21
+#define CLUSTER_UP		0x22
+#define CLUSTER_GOING_DOWN	0x23
+
+#define INBOUND_NOT_COMING_UP	0x31
+#define INBOUND_COMING_UP	0x32
+
+/* This is a complete guess. */
+#define __CACHE_WRITEBACK_ORDER	6
+#define __CACHE_WRITEBACK_GRANULE (1 << __CACHE_WRITEBACK_ORDER)
+
+/* Offsets for the bL_cluster_sync_struct members, for use in asm: */
+#define BL_SYNC_CLUSTER_CPUS	0
+#define BL_SYNC_CPU_SIZE	__CACHE_WRITEBACK_GRANULE
+#define BL_SYNC_CLUSTER_CLUSTER \
+	(BL_SYNC_CLUSTER_CPUS + BL_SYNC_CPU_SIZE * BL_CPUS_PER_CLUSTER)
+#define BL_SYNC_CLUSTER_INBOUND \
+	(BL_SYNC_CLUSTER_CLUSTER + __CACHE_WRITEBACK_GRANULE)
+#define BL_SYNC_CLUSTER_SIZE \
+	(BL_SYNC_CLUSTER_INBOUND + __CACHE_WRITEBACK_GRANULE)
+
 #ifndef __ASSEMBLY__
 
+#include <linux/types.h>
+
 /*
  * Platform specific code should use this symbol to set up secondary
  * entry location for processors to use when released from reset.
@@ -123,5 +152,38 @@  struct bL_platform_power_ops {
  */
 int __init bL_platform_power_register(const struct bL_platform_power_ops *ops);
 
+/* Synchronisation structures for coordinating safe cluster setup/teardown: */
+
+/*
+ * When modifying this structure, make sure you update the BL_SYNC_ defines
+ * to match.
+ */
+struct bL_cluster_sync_struct {
+	/* individual CPU states */
+	struct {
+		volatile s8 cpu __aligned(__CACHE_WRITEBACK_GRANULE);
+	} cpus[BL_CPUS_PER_CLUSTER];
+
+	/* cluster state */
+	volatile s8 cluster __aligned(__CACHE_WRITEBACK_GRANULE);
+
+	/* inbound-side state */
+	volatile s8 inbound __aligned(__CACHE_WRITEBACK_GRANULE);
+};
+
+struct bL_sync_struct {
+	struct bL_cluster_sync_struct clusters[BL_NR_CLUSTERS];
+};
+
+extern unsigned long bL_sync_phys;	/* physical address of *bL_sync */
+
+void __bL_cpu_going_down(unsigned int cpu, unsigned int cluster);
+void __bL_cpu_down(unsigned int cpu, unsigned int cluster);
+void __bL_outbound_leave_critical(unsigned int cluster, int state);
+bool __bL_outbound_enter_critical(unsigned int this_cpu, unsigned int cluster);
+int __bL_cluster_state(unsigned int cluster);
+
+int __init bL_cluster_sync_init(void (*power_up_setup)(void));
+
 #endif /* ! __ASSEMBLY__ */
 #endif