diff mbox

ARM: rockchip: Convert resume code to C

Message ID 1417461694-5129-1-git-send-email-dianders@chromium.org (mailing list archive)
State New, archived
Headers show

Commit Message

Doug Anderson Dec. 1, 2014, 7:21 p.m. UTC
This CL introduces the concept of "embedded blobs" for Rockchip, which
allows you to compile some C code into a binary blob that is linked
into the kernel.  This binary blob is self contained and is intended
to run in cases where SDRAM (and thus the rest of Linux) isn't
available.  Resume is a prime candiate for this as is the planned
DDRFreq code.

We convert the existing assembly resume code into C as proof that this
works and to prepare for linking in SDRAM reinit code.

At the moment suspend/resume is only implemented for rk3288 and I'm
told that suspend/resume on older Rockchip variants is very limited
(can't shut off the ARM?), so I haven't tried to make this generic.
When we have more than one Rockchip SoC doing S2R we can adjust.

Signed-off-by: Doug Anderson <dianders@chromium.org>
---
A few ground rules...

I'm aware of the patches from Russ Dill adding support for Embeddable
PIE [0], but I have chosen not to try to take over those patches or
base my patch atop them.  Why?

1. The selfish reason that I just don't have time to try to address
   all comments and land an 11-part patch series that's been sitting
   idle for 1+ years.

2. Dave Martin had some good questions about the embedded PIE patches
   [1] that made me question whether they were the right away to go.
   Among other things he pointed out similarities between them and the
   preexsiting module loader.

3. This code is small, simple, and I think it matches the approach
   taken elsewhere in the kernel for things like the decompressor stub.

I could certainly expect that others might not agree with my decision
and I'm half expecting a NAK of this patch asking for me to come up
with a more general solution.  As you could probably guess from #1
above, I won't be upset by such a response but it will lead to me
dropping this patch.  ;) Minor or even medium-sized requests will be
happily addressed, however.

OK, with that out of the way...

These patches are based atop v10 of Chris Zhong's Rockchip
suspend/resume patches.  They were tested atop next-20141201 on
rk3288-evb-rk808.  Total patches atop that version of Linux were:

1. https://patchwork.kernel.org/patch/5051881/ - clocksource:
   arch_timer: Allow the device tree to specify uninitialized timer
   registers

2. https://patchwork.kernel.org/patch/5363671/ - clocksource:
   arch_timer: Fix code to use physical timers when requested

3. https://patchwork.kernel.org/patch/5382141/ - ARM: dts: rk3288: add
   arm,cpu-registers-not-fw-configured

4. Revert (b77d439 ARM: dts: rockchip: temporarily disable smp on
   rk3288)

5. https://patchwork.kernel.org/patch/5325111/ - usb: dwc2: resume
   root hub when device detect with suspend state

6. https://patchwork.kernel.org/patch/5410611/ - ARM: rockchip: add
   suspend and resume for RK3288

7. https://patchwork.kernel.org/patch/5410621/ - ARM: rockchip: Add
   pmu-sram binding

8. https://patchwork.kernel.org/patch/5410631/ - ARM: dts: add RK3288
   suspend support

9. https://patchwork.kernel.org/patch/5410641/ - ARM: dts: rockchip:
   add suspend settings for rk3288-evb-rk808

It looks like my pinctrl patches might be dropped due to cross
dependency problems, so tomorrow's linux-next will probably also need
(https://patchwork.kernel.org/patch/5344551/ - pinctrl: rockchip:
Handle wakeup pins).

I've also got a local hack to the Rockchip "pm.c" to replace the usage
of "PMU_ARMINT_WAKEUP_EN" with 0x0e.  There seems to be some sort of
ARM Interrupt waking us up all the time right when we go to sleep and
the above will hack it so that only GPIOs + SDMMC Card Detect can wake
us up.  Someone should track down what's going on there, but for now
I've used the hack to prove that the basic code actually works.  The
issue happens both with and without my patch.

[0]: https://lkml.org/lkml/2013/9/17/193
[1]: https://lkml.org/lkml/2013/11/8/351

 arch/arm/mach-rockchip/Makefile                    |  4 +-
 arch/arm/mach-rockchip/embedded/.gitignore         |  1 +
 arch/arm/mach-rockchip/embedded/Makefile           | 53 +++++++++++++
 arch/arm/mach-rockchip/embedded/rk3288_resume.c    | 92 ++++++++++++++++++++++
 arch/arm/mach-rockchip/embedded/rk3288_resume.h    | 44 +++++++++++
 .../arm/mach-rockchip/embedded/rk3288_resume.lds.S | 40 ++++++++++
 .../embedded/rk3288_resume_embedded.h              | 17 ++++
 arch/arm/mach-rockchip/pm.c                        | 37 ++++++---
 arch/arm/mach-rockchip/pm.h                        |  8 --
 arch/arm/mach-rockchip/sleep.S                     | 73 -----------------
 10 files changed, 275 insertions(+), 94 deletions(-)
 create mode 100644 arch/arm/mach-rockchip/embedded/.gitignore
 create mode 100644 arch/arm/mach-rockchip/embedded/Makefile
 create mode 100644 arch/arm/mach-rockchip/embedded/rk3288_resume.c
 create mode 100644 arch/arm/mach-rockchip/embedded/rk3288_resume.h
 create mode 100644 arch/arm/mach-rockchip/embedded/rk3288_resume.lds.S
 create mode 100644 arch/arm/mach-rockchip/embedded/rk3288_resume_embedded.h
 delete mode 100644 arch/arm/mach-rockchip/sleep.S

Comments

Russell King - ARM Linux Dec. 1, 2014, 10:50 p.m. UTC | #1
On Mon, Dec 01, 2014 at 11:21:34AM -0800, Doug Anderson wrote:
> We convert the existing assembly resume code into C as proof that this
> works and to prepare for linking in SDRAM reinit code.

...

> base my patch atop them.  Why?
                            ^^^^

That's a very good question...

> +static void __noreturn rk3288_resume_c(void)
> +{
> +	if (rk3288_resume_params.l2ctlr_f)
> +		asm("mcr p15, 1, %0, c9, c0, 2" : :
> +			"r" (rk3288_resume_params.l2ctlr));

Assembly...

> +static void __naked __noreturn rk3288_resume(void)
> +{
> +	/* Make sure we're on CPU0, no IRQs and get a stack setup */
> +	asm volatile (
> +			"msr	cpsr_cxf, %0\n"
> +
> +			/* Only cpu0 continues to run, the others halt here */
> +			"mrc	p15, 0, r1, c0, c0, 5\n"
> +			"and	r1, r1, #0xf\n"
> +			"cmp	r1, #0\n"
> +			"beq	cpu0run\n"
> +		"secondary_loop:\n"
> +			"wfe\n"
> +			"b	secondary_loop\n"
> +
> +		"cpu0run:\n"
> +			"mov	sp, %1\n"
> +		:
> +		: "i" (INIT_CPSR), "r" (&__stack_start)
> +		: "cc", "r1", "sp");

Big load of assembly.

What I see here is a load of complexity which achieves very little.
The result doesn't get rid of much assembly, but it does make stuff
more complicated.  And the diffstat speaks volumes about this:

 10 files changed, 275 insertions(+), 94 deletions(-)                           

There's a lot of words in the description, but it's missing the most
important bit: why do we want to take this approach - what benefits
does it bring?
Doug Anderson Dec. 1, 2014, 11:04 p.m. UTC | #2
Russel,

On Mon, Dec 1, 2014 at 2:50 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> What I see here is a load of complexity which achieves very little.
> The result doesn't get rid of much assembly, but it does make stuff
> more complicated.  And the diffstat speaks volumes about this:
>
>  10 files changed, 275 insertions(+), 94 deletions(-)
>
> There's a lot of words in the description, but it's missing the most
> important bit: why do we want to take this approach - what benefits
> does it bring?

Sure.  I guess the most important words in the description are:

> We convert the existing assembly resume code into C as proof that this
> works and to prepare for linking in SDRAM reinit code.

I can't say that the SDRAM reinit code is ready for prime time yet,
but you can get a preview of what it could look like at:

https://chromium-review.googlesource.com/#/c/227366/25/arch/arm/mach-rockchip/embedded/rk3288_ddr_resume.c

Adding that code in assembly seems like a very, very bad idea.
Certainly my patch could wait until the DDR code is ready to be posted
upstream if that made sense.  One advantage of waiting is that it's
possible that the DDR code might end up moving elsewhere if it made
sense to have it part of a memory controller driver or something like
that.

-Doug
Arnd Bergmann Dec. 2, 2014, 9:33 a.m. UTC | #3
On Monday 01 December 2014 15:04:59 Doug Anderson wrote:
> Russel,
> 
> On Mon, Dec 1, 2014 at 2:50 PM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> > What I see here is a load of complexity which achieves very little.
> > The result doesn't get rid of much assembly, but it does make stuff
> > more complicated.  And the diffstat speaks volumes about this:
> >
> >  10 files changed, 275 insertions(+), 94 deletions(-)
> >
> > There's a lot of words in the description, but it's missing the most
> > important bit: why do we want to take this approach - what benefits
> > does it bring?
> 
> Sure.  I guess the most important words in the description are:
> 
> > We convert the existing assembly resume code into C as proof that this
> > works and to prepare for linking in SDRAM reinit code.
> 
> I can't say that the SDRAM reinit code is ready for prime time yet,
> but you can get a preview of what it could look like at:
> 
> https://chromium-review.googlesource.com/#/c/227366/25/arch/arm/mach-rockchip/embedded/rk3288_ddr_resume.c
> 
> Adding that code in assembly seems like a very, very bad idea.
> Certainly my patch could wait until the DDR code is ready to be posted
> upstream if that made sense.  One advantage of waiting is that it's
> possible that the DDR code might end up moving elsewhere if it made
> sense to have it part of a memory controller driver or something like
> that.

I recently looked at another vendor tree (quantenna wifi access point,
based on arch/arc), which was putting arbitrary functions into SRAM
for performance reasons, in their case the entire hot path for network
switching. Having at least the infrastructure to do this seems like
a great idea, even though it's very hard to do in a general-purpose
kernel, as you'd have a hard time squeezing as much code as possible
into the available SRAM.

Unfortunately you already said that you're not that interested in
making it completely generic, and I also don't think I want to have
the infrastructure for it in mach-rockchip and would want to see that
at least shared across arch/arm if it's too hard to do
cross-architecture. If you were to include code from drivers/memory/
in the blob, you couldn't keep it in mach-rockchip anyway.

AFAICT, the quantenna implementation is similar to the itcm/dtcm
stuff we already have (but are not using upstream), so I wonder
why we can't use that here too, see Documentation/arm/tcm.txt

	Arnd
Linus Walleij Dec. 2, 2014, 1:34 p.m. UTC | #4
On Tue, Dec 2, 2014 at 10:33 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Monday 01 December 2014 15:04:59 Doug Anderson wrote:

>> Adding that code in assembly seems like a very, very bad idea.
>> Certainly my patch could wait until the DDR code is ready to be posted
>> upstream if that made sense.  One advantage of waiting is that it's
>> possible that the DDR code might end up moving elsewhere if it made
>> sense to have it part of a memory controller driver or something like
>> that.
(...)
> AFAICT, the quantenna implementation is similar to the itcm/dtcm
> stuff we already have (but are not using upstream), so I wonder
> why we can't use that here too, see Documentation/arm/tcm.txt

I agree. The TCM __tcm* macros to tag code and data for the TCM
can be used by any mechanism by modifying mem_init() in
arch/arm/mm/init.c

If the name is disturbing the __tcm* function can be renamed
__onchip* simply.

It will never work on multiplatform however, and if some real change
shall come to that, something like Russ Dills patches and a
generic approach to dynamic loading of onchip executables is
needed.

I had hopes of replacing the TCM mechanism with that
scheme, people also want to load onchip programs from
userspace as it happens, that would be the right way to go.

Yours,
Linus Walleij
Doug Anderson Dec. 2, 2014, 5:36 p.m. UTC | #5
Hi,

On Tue, Dec 2, 2014 at 1:33 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Monday 01 December 2014 15:04:59 Doug Anderson wrote:
>> Russel,
>>
>> On Mon, Dec 1, 2014 at 2:50 PM, Russell King - ARM Linux
>> <linux@arm.linux.org.uk> wrote:
>> > What I see here is a load of complexity which achieves very little.
>> > The result doesn't get rid of much assembly, but it does make stuff
>> > more complicated.  And the diffstat speaks volumes about this:
>> >
>> >  10 files changed, 275 insertions(+), 94 deletions(-)
>> >
>> > There's a lot of words in the description, but it's missing the most
>> > important bit: why do we want to take this approach - what benefits
>> > does it bring?
>>
>> Sure.  I guess the most important words in the description are:
>>
>> > We convert the existing assembly resume code into C as proof that this
>> > works and to prepare for linking in SDRAM reinit code.
>>
>> I can't say that the SDRAM reinit code is ready for prime time yet,
>> but you can get a preview of what it could look like at:
>>
>> https://chromium-review.googlesource.com/#/c/227366/25/arch/arm/mach-rockchip/embedded/rk3288_ddr_resume.c
>>
>> Adding that code in assembly seems like a very, very bad idea.
>> Certainly my patch could wait until the DDR code is ready to be posted
>> upstream if that made sense.  One advantage of waiting is that it's
>> possible that the DDR code might end up moving elsewhere if it made
>> sense to have it part of a memory controller driver or something like
>> that.
>
> I recently looked at another vendor tree (quantenna wifi access point,
> based on arch/arc), which was putting arbitrary functions into SRAM
> for performance reasons, in their case the entire hot path for network
> switching. Having at least the infrastructure to do this seems like
> a great idea, even though it's very hard to do in a general-purpose
> kernel, as you'd have a hard time squeezing as much code as possible
> into the available SRAM.

I'm always a fan of seeing general infrastructure introduced, though
we always need to make sure that the general infrastructure makes
things easier and not harder.  There's always the danger of adding so
much abstraction for a small thing that using it is like pulling
teeth.  I'm not saying that's the case here, but it is always a
danger.

Note: I will point out a critical differences between the "hotpath"
problem and the one I'm solving here.  When you're just trying to
speed up a hotpath, it's not the end of the world if there's a stray
access to SDRAM.  If you happen to access a global variable in SDRAM,
or use a libc function to do division, or have a WARN_ON, those things
are OK.  It might also be OK if the stack was still in SDRAM.  When
you're compiling code that has to run with no other kernel function
present it's really nice to link them into a separate executable.


> Unfortunately you already said that you're not that interested in
> making it completely generic,

Yes.  I felt like a jerk with my list of caveats, but at the same time
I know that realistically I just don't have time.  It was either send
it upstream with the caveats or don't send it.  To me it seemed better
to send it.

Thank you for providing a constructive review despite my caveats!  :)


> and I also don't think I want to have
> the infrastructure for it in mach-rockchip and would want to see that
> at least shared across arch/arm if it's too hard to do
> cross-architecture. If you were to include code from drivers/memory/
> in the blob, you couldn't keep it in mach-rockchip anyway.

I guess I was envisioning that if other places need similar
functionality that they would copy the ideas here.  Some of the
Makefile bits could possibly be shared through some type of Makefile
library.  I know copying code / Makefiles is bad, but sometimes it's
the cleanest way to do something.  If we start seeing a lot of
duplication then we can make things common and we can truly evaluate
whether the common solution is better than the duplication.


> AFAICT, the quantenna implementation is similar to the itcm/dtcm
> stuff we already have (but are not using upstream), so I wonder
> why we can't use that here too, see Documentation/arm/tcm.txt

I wasn't aware of the TCM stuff.  Thanks for the pointer!  It looks
pretty neat...

Ah, but the TCM stuff has a critical difference from my problem.  By
the very definition of TCM you don't need to do relocation.

TCM has the magical property that you can assign the physical address.
That means that you instantly sidestep multiplatform problems of
having SRAM at different physical addresses.  You can compile the code
to assume it's at 0xfffe0000 and it will work on every single machine
out there that needs TCM.  So if you've got a generic "udelay"
function you could just mark it as a "tcmfunc" and it will work
everywhere.  No relocation needed and the compiler knows exactly where
things will be.

Unfortunately, the rk3288 doesn't have TCM.  I tried enabling it and
got these nice printouts at boot:

    DTCM    : 0xfffe8000 - 0xfffe8000   (   0 kB)
    ITCM    : 0xfffe0000 - 0xfffe0000   (   0 kB)

Instead of TCM I'm using the "PMU SRAM" on the rk3288 which is
designed to keep code and data across deep sleep.  Adding relocation
to the existing TCM support gets back into the rats nest of issues
that I was trying to avoid tackling.

A few other TCM thoughts:

1. It sure seems unlikely that the current TCM solution would scale to
multiplatform.  Oh right, Linus W said this in his reply, too.  If
you've got SoC_A, SoC_B, and SoC_C all marking their functions
"tcmfunc" then they'll all be placed in the TCM section, right?
There's no way to detect that you're on SoC_A and that you only need
the SoC_A code.  Given the marching orders of multiplatform,
multiplatform, multiplatform then I _think_ that means we shouldn't
let anyone merge any code to mainline that uses TCM (unless TCM gets
revamped).

2. I haven't tried it, but it seems like the compiler still might not
catch stray (accidental) accesses from the TCM section to the non-TCM
section.  Again, this isn't a showstopper because you'd just track
each one down, but it is a nice feature of adding a separate
executable.


As always, please correct me where I'm misunderstanding things.


Unless any of the above has changed your minds, feel free to consider
this patch abandoned and thus we part ways amicably.  :)  At least
rk3288 looks like it can get _basic_ suspend/resume (without shutting
off SDRAM) working without this...

-Doug
Linus Walleij Dec. 3, 2014, 1:21 p.m. UTC | #6
On Tue, Dec 2, 2014 at 6:36 PM, Doug Anderson <dianders@chromium.org> wrote:

> Ah, but the TCM stuff has a critical difference from my problem.  By
> the very definition of TCM you don't need to do relocation.
>
> TCM has the magical property that you can assign the physical address.
> That means that you instantly sidestep multiplatform problems of
> having SRAM at different physical addresses.

An SRAM behaves like an ITCM and no DTCM. So just
patch the code to believe it is an ITCM and test it out.

Then I guess you're implictly referring to the fact that the code
that you're running in your TCM/SRAM will turn of the MMU,
so it can't be used for remapping. This is correct, BUT if you're
just compiling for one single machine anyway it doesn't matter,
just use 0xBFEA0000 or wherever it is physically as the phys+virt
address.

They are just defined like so in arch/arm/include/asm/memory.h:

/*
 * We fix the TCM memories max 32 KiB ITCM resp DTCM at these
 * locations
 */
#ifdef CONFIG_HAVE_TCM
#define ITCM_OFFSET     UL(0xfffe0000)
#define DTCM_OFFSET     UL(0xfffe8000)
#endif

Patch it and see if it works.

>  You can compile the code
> to assume it's at 0xfffe0000 and it will work on every single machine
> out there that needs TCM.

Doesn't matter as long as we're not multiplatform! It's more like
ITCM_OFFSET would be a Kconfig constant and you could change
it per-platform that need this.

>  So if you've got a generic "udelay"
> function you could just mark it as a "tcmfunc" and it will work
> everywhere.  No relocation needed and the compiler knows exactly where
> things will be.

This would still work as long as you're not multiplatform.

But udelay is probably a bad idea. Maybe sched_clock() or others.

> 1. It sure seems unlikely that the current TCM solution would scale to
> multiplatform.  Oh right, Linus W said this in his reply, too.

Yes. But it is no less elegant than this patch AFAICT.

> you've got SoC_A, SoC_B, and SoC_C all marking their functions
> "tcmfunc" then they'll all be placed in the TCM section, right?
> There's no way to detect that you're on SoC_A and that you only need
> the SoC_A code.  Given the marching orders of multiplatform,
> multiplatform, multiplatform then I _think_ that means we shouldn't
> let anyone merge any code to mainline that uses TCM (unless TCM gets
> revamped).
>
> 2. I haven't tried it, but it seems like the compiler still might not
> catch stray (accidental) accesses from the TCM section to the non-TCM
> section.  Again, this isn't a showstopper because you'd just track
> each one down, but it is a nice feature of adding a separate
> executable.

This is correct I think.

I think something in the direction of Russ Dill's patch set is the right
way forward to fix the problem for multiplatform.

Yours,
Linus Walleij
Arnd Bergmann Dec. 3, 2014, 2:42 p.m. UTC | #7
On Tuesday 02 December 2014 09:36:00 Doug Anderson wrote:
> On Tue, Dec 2, 2014 at 1:33 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Monday 01 December 2014 15:04:59 Doug Anderson wrote:
> >> On Mon, Dec 1, 2014 at 2:50 PM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:
> > I recently looked at another vendor tree (quantenna wifi access point,
> > based on arch/arc), which was putting arbitrary functions into SRAM
> > for performance reasons, in their case the entire hot path for network
> > switching. Having at least the infrastructure to do this seems like
> > a great idea, even though it's very hard to do in a general-purpose
> > kernel, as you'd have a hard time squeezing as much code as possible
> > into the available SRAM.
> 
> I'm always a fan of seeing general infrastructure introduced, though
> we always need to make sure that the general infrastructure makes
> things easier and not harder.  There's always the danger of adding so
> much abstraction for a small thing that using it is like pulling
> teeth.  I'm not saying that's the case here, but it is always a
> danger.
> 
> Note: I will point out a critical differences between the "hotpath"
> problem and the one I'm solving here.  When you're just trying to
> speed up a hotpath, it's not the end of the world if there's a stray
> access to SDRAM.  If you happen to access a global variable in SDRAM,
> or use a libc function to do division, or have a WARN_ON, those things
> are OK.  It might also be OK if the stack was still in SDRAM.  When
> you're compiling code that has to run with no other kernel function
> present it's really nice to link them into a separate executable.

Yes, makes sense. We might be able to use the same trick that we have
for verifying __init sections though: During the final link of the
vmlinux or module, check that an SRAM function only calls other
functions that are in SRAM and accesses global variables that way
too.

It wouldn't cover any pointers you pass using function arguments
though, and I don't yet understand the requirements for stack accesses.
How do you currently deal with local variables that are put on the
stack by a blob?

> > and I also don't think I want to have
> > the infrastructure for it in mach-rockchip and would want to see that
> > at least shared across arch/arm if it's too hard to do
> > cross-architecture. If you were to include code from drivers/memory/
> > in the blob, you couldn't keep it in mach-rockchip anyway.
> 
> I guess I was envisioning that if other places need similar
> functionality that they would copy the ideas here.  Some of the
> Makefile bits could possibly be shared through some type of Makefile
> library.  I know copying code / Makefiles is bad, but sometimes it's
> the cleanest way to do something.  If we start seeing a lot of
> duplication then we can make things common and we can truly evaluate
> whether the common solution is better than the duplication.

The makefile parts should be really easy to share by putting them
into scripts/Makefile.lib.

> > AFAICT, the quantenna implementation is similar to the itcm/dtcm
> > stuff we already have (but are not using upstream), so I wonder
> > why we can't use that here too, see Documentation/arm/tcm.txt
> 
> I wasn't aware of the TCM stuff.  Thanks for the pointer!  It looks
> pretty neat...
> 
> Ah, but the TCM stuff has a critical difference from my problem.  By
> the very definition of TCM you don't need to do relocation.
> 
> TCM has the magical property that you can assign the physical address.
> That means that you instantly sidestep multiplatform problems of
> having SRAM at different physical addresses.  You can compile the code
> to assume it's at 0xfffe0000 and it will work on every single machine
> out there that needs TCM.  So if you've got a generic "udelay"
> function you could just mark it as a "tcmfunc" and it will work
> everywhere.  No relocation needed and the compiler knows exactly where
> things will be.

Ok, I have to admit that I don't actually understand the differences
myself. Why does the physical address of the TCM matter? Can't we
just map the SRAM to a sufficiently large well-known virtual address?

> Unfortunately, the rk3288 doesn't have TCM.  I tried enabling it and
> got these nice printouts at boot:
> 
>     DTCM    : 0xfffe8000 - 0xfffe8000   (   0 kB)
>     ITCM    : 0xfffe0000 - 0xfffe0000   (   0 kB)
> 
> Instead of TCM I'm using the "PMU SRAM" on the rk3288 which is
> designed to keep code and data across deep sleep.  Adding relocation
> to the existing TCM support gets back into the rats nest of issues
> that I was trying to avoid tackling.
> 
> A few other TCM thoughts:
> 
> 1. It sure seems unlikely that the current TCM solution would scale to
> multiplatform.  Oh right, Linus W said this in his reply, too.  If
> you've got SoC_A, SoC_B, and SoC_C all marking their functions
> "tcmfunc" then they'll all be placed in the TCM section, right?

Correct, that would be a problem, at least if the total size grows
to more than the minimum of any of the chips' physical SRAM.

> There's no way to detect that you're on SoC_A and that you only need
> the SoC_A code.  Given the marching orders of multiplatform,
> multiplatform, multiplatform then I _think_ that means we shouldn't
> let anyone merge any code to mainline that uses TCM (unless TCM gets
> revamped).

Just out of curiosity, what sizes are we looking at here, for the
code you currently have and the available SRAM on rk3288?

> 2. I haven't tried it, but it seems like the compiler still might not
> catch stray (accidental) accesses from the TCM section to the non-TCM
> section.  Again, this isn't a showstopper because you'd just track
> each one down, but it is a nice feature of adding a separate
> executable.

No, the compiler won't care about this, but as mentioned above we
can have the kernel linker scripts help us a bit here.

	Arnd
Doug Anderson Dec. 3, 2014, 4:44 p.m. UTC | #8
Linus,

On Wed, Dec 3, 2014 at 5:21 AM, Linus Walleij <linus.walleij@linaro.org> wrote:
> On Tue, Dec 2, 2014 at 6:36 PM, Doug Anderson <dianders@chromium.org> wrote:
>
>> Ah, but the TCM stuff has a critical difference from my problem.  By
>> the very definition of TCM you don't need to do relocation.
>>
>> TCM has the magical property that you can assign the physical address.
>> That means that you instantly sidestep multiplatform problems of
>> having SRAM at different physical addresses.
>
> An SRAM behaves like an ITCM and no DTCM. So just
> patch the code to believe it is an ITCM and test it out.
>
> Then I guess you're implictly referring to the fact that the code
> that you're running in your TCM/SRAM will turn of the MMU,
> so it can't be used for remapping.

Right.  In my case the resume code is running at resume time after the
CPUs have lost state and are just coming back on.  All addresses need
to be physical.  Sorry, I should have been explicit.


> This is correct, BUT if you're
> just compiling for one single machine anyway it doesn't matter,
> just use 0xBFEA0000 or wherever it is physically as the phys+virt
> address.
>
> They are just defined like so in arch/arm/include/asm/memory.h:
>
> /*
>  * We fix the TCM memories max 32 KiB ITCM resp DTCM at these
>  * locations
>  */
> #ifdef CONFIG_HAVE_TCM
> #define ITCM_OFFSET     UL(0xfffe0000)
> #define DTCM_OFFSET     UL(0xfffe8000)
> #endif
>
> Patch it and see if it works.

I'm nearly certainly it could be made to work.  ...but the solution I
have also works just fine.  I was looking for something that upstream
would take, and I don't think the solution you're proposing is
something that I could actually land upstream (could it?)

Also: it is possible (probable?) we'll want to have more than one
blob.  One (resume code) may need to fit in the 4K PMU SRAM that's
saved across suspend/resume.  The other (ddrfreq) may be able to go in
the larger SRAM.  I could try to fool the TCM code to handle this,
but...


>>  You can compile the code
>> to assume it's at 0xfffe0000 and it will work on every single machine
>> out there that needs TCM.
>
> Doesn't matter as long as we're not multiplatform! It's more like
> ITCM_OFFSET would be a Kconfig constant and you could change
> it per-platform that need this.
>
>>  So if you've got a generic "udelay"
>> function you could just mark it as a "tcmfunc" and it will work
>> everywhere.  No relocation needed and the compiler knows exactly where
>> things will be.
>
> This would still work as long as you're not multiplatform.

Yup, but I don't think I can land something upstream that's not multiplatform.


> But udelay is probably a bad idea. Maybe sched_clock() or others.

Some type of delay function is incredibly likely to be needed in SRAM
and it's possible to implement this fairly generically by using
arch_counter_get_cntpct() (or you could get the virtual one).


>> 1. It sure seems unlikely that the current TCM solution would scale to
>> multiplatform.  Oh right, Linus W said this in his reply, too.
>
> Yes. But it is no less elegant than this patch AFAICT.

I guess it's a matter of opinion.  What I have could scale to
multiplatform just fine, which I think is a nice benefit.  It also
doesn't involve coopting the TCM code into behaving in a way it wasn't
quite designed to work.  It also is very self-contained.  ...but I can
certainly see your side of the argument, too.


>> you've got SoC_A, SoC_B, and SoC_C all marking their functions
>> "tcmfunc" then they'll all be placed in the TCM section, right?
>> There's no way to detect that you're on SoC_A and that you only need
>> the SoC_A code.  Given the marching orders of multiplatform,
>> multiplatform, multiplatform then I _think_ that means we shouldn't
>> let anyone merge any code to mainline that uses TCM (unless TCM gets
>> revamped).
>>
>> 2. I haven't tried it, but it seems like the compiler still might not
>> catch stray (accidental) accesses from the TCM section to the non-TCM
>> section.  Again, this isn't a showstopper because you'd just track
>> each one down, but it is a nice feature of adding a separate
>> executable.
>
> This is correct I think.
>
> I think something in the direction of Russ Dill's patch set is the right
> way forward to fix the problem for multiplatform.

Sure.  I look forward to someone solving it.  ...and as I said, until
someone does solve it I think we're blocked from landing deep sleep
mode for rk3288 upstream (unless someone decides that my patches are
OK).

-Doug
Russell King - ARM Linux Dec. 3, 2014, 4:49 p.m. UTC | #9
On Wed, Dec 03, 2014 at 08:44:55AM -0800, Doug Anderson wrote:
> Sure.  I look forward to someone solving it.  ...and as I said, until
> someone does solve it I think we're blocked from landing deep sleep
> mode for rk3288 upstream (unless someone decides that my patches are
> OK).

What we definitely don't want to do is to have each SoC doing their
own thing on this, so we really need it to either be generic across
all ARMs, or fully generic.

So, if it can be turned into some kind of generic infrastructure, I'd
be happier with it, because that would help it be re-used across other
SoCs.
Doug Anderson Dec. 3, 2014, 4:57 p.m. UTC | #10
Arnd,

On Wed, Dec 3, 2014 at 6:42 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 02 December 2014 09:36:00 Doug Anderson wrote:
>> On Tue, Dec 2, 2014 at 1:33 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>> > On Monday 01 December 2014 15:04:59 Doug Anderson wrote:
>> >> On Mon, Dec 1, 2014 at 2:50 PM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:
>> > I recently looked at another vendor tree (quantenna wifi access point,
>> > based on arch/arc), which was putting arbitrary functions into SRAM
>> > for performance reasons, in their case the entire hot path for network
>> > switching. Having at least the infrastructure to do this seems like
>> > a great idea, even though it's very hard to do in a general-purpose
>> > kernel, as you'd have a hard time squeezing as much code as possible
>> > into the available SRAM.
>>
>> I'm always a fan of seeing general infrastructure introduced, though
>> we always need to make sure that the general infrastructure makes
>> things easier and not harder.  There's always the danger of adding so
>> much abstraction for a small thing that using it is like pulling
>> teeth.  I'm not saying that's the case here, but it is always a
>> danger.
>>
>> Note: I will point out a critical differences between the "hotpath"
>> problem and the one I'm solving here.  When you're just trying to
>> speed up a hotpath, it's not the end of the world if there's a stray
>> access to SDRAM.  If you happen to access a global variable in SDRAM,
>> or use a libc function to do division, or have a WARN_ON, those things
>> are OK.  It might also be OK if the stack was still in SDRAM.  When
>> you're compiling code that has to run with no other kernel function
>> present it's really nice to link them into a separate executable.
>
> Yes, makes sense. We might be able to use the same trick that we have
> for verifying __init sections though: During the final link of the
> vmlinux or module, check that an SRAM function only calls other
> functions that are in SRAM and accesses global variables that way
> too.

Yup, I thought about this.  You might want some way to make decisions
about whether accesses are OK.  If you're optimizing a hotpath maybe
all accesses are OK (but deserve a warning?).  If you're running code
where SDRAM is not available then no accesses are OK.


> It wouldn't cover any pointers you pass using function arguments
> though, and I don't yet understand the requirements for stack accesses.
> How do you currently deal with local variables that are put on the
> stack by a blob?

The blob sets up its own stack in assembly code.


>> > and I also don't think I want to have
>> > the infrastructure for it in mach-rockchip and would want to see that
>> > at least shared across arch/arm if it's too hard to do
>> > cross-architecture. If you were to include code from drivers/memory/
>> > in the blob, you couldn't keep it in mach-rockchip anyway.
>>
>> I guess I was envisioning that if other places need similar
>> functionality that they would copy the ideas here.  Some of the
>> Makefile bits could possibly be shared through some type of Makefile
>> library.  I know copying code / Makefiles is bad, but sometimes it's
>> the cleanest way to do something.  If we start seeing a lot of
>> duplication then we can make things common and we can truly evaluate
>> whether the common solution is better than the duplication.
>
> The makefile parts should be really easy to share by putting them
> into scripts/Makefile.lib.

Agreed.


>> > AFAICT, the quantenna implementation is similar to the itcm/dtcm
>> > stuff we already have (but are not using upstream), so I wonder
>> > why we can't use that here too, see Documentation/arm/tcm.txt
>>
>> I wasn't aware of the TCM stuff.  Thanks for the pointer!  It looks
>> pretty neat...
>>
>> Ah, but the TCM stuff has a critical difference from my problem.  By
>> the very definition of TCM you don't need to do relocation.
>>
>> TCM has the magical property that you can assign the physical address.
>> That means that you instantly sidestep multiplatform problems of
>> having SRAM at different physical addresses.  You can compile the code
>> to assume it's at 0xfffe0000 and it will work on every single machine
>> out there that needs TCM.  So if you've got a generic "udelay"
>> function you could just mark it as a "tcmfunc" and it will work
>> everywhere.  No relocation needed and the compiler knows exactly where
>> things will be.
>
> Ok, I have to admit that I don't actually understand the differences
> myself. Why does the physical address of the TCM matter? Can't we
> just map the SRAM to a sufficiently large well-known virtual address?

Linus W. got it right when he said I was implicitly alluding to the
fact that I needed to be running with the MMU off.  I'm running resume
code which runs with everything off and all addresses are physical.

In my case I could compile the code PIC/PID if needed, but I don't
think it's so easy with the TCM approach.


>> Unfortunately, the rk3288 doesn't have TCM.  I tried enabling it and
>> got these nice printouts at boot:
>>
>>     DTCM    : 0xfffe8000 - 0xfffe8000   (   0 kB)
>>     ITCM    : 0xfffe0000 - 0xfffe0000   (   0 kB)
>>
>> Instead of TCM I'm using the "PMU SRAM" on the rk3288 which is
>> designed to keep code and data across deep sleep.  Adding relocation
>> to the existing TCM support gets back into the rats nest of issues
>> that I was trying to avoid tackling.
>>
>> A few other TCM thoughts:
>>
>> 1. It sure seems unlikely that the current TCM solution would scale to
>> multiplatform.  Oh right, Linus W said this in his reply, too.  If
>> you've got SoC_A, SoC_B, and SoC_C all marking their functions
>> "tcmfunc" then they'll all be placed in the TCM section, right?
>
> Correct, that would be a problem, at least if the total size grows
> to more than the minimum of any of the chips' physical SRAM.

See below.  I have 4K, which means that the total size of all SoC's
TCM code has to be less than 4K.


>> There's no way to detect that you're on SoC_A and that you only need
>> the SoC_A code.  Given the marching orders of multiplatform,
>> multiplatform, multiplatform then I _think_ that means we shouldn't
>> let anyone merge any code to mainline that uses TCM (unless TCM gets
>> revamped).
>
> Just out of curiosity, what sizes are we looking at here, for the
> code you currently have and the available SRAM on rk3288?

I'm running in 4K of SRAM.  I think my current code is just over 2K.
It's unlikely any other platform would fit.


>> 2. I haven't tried it, but it seems like the compiler still might not
>> catch stray (accidental) accesses from the TCM section to the non-TCM
>> section.  Again, this isn't a showstopper because you'd just track
>> each one down, but it is a nice feature of adding a separate
>> executable.
>
> No, the compiler won't care about this, but as mentioned above we
> can have the kernel linker scripts help us a bit here.

Yup, true.

-Doug
Doug Anderson Dec. 3, 2014, 4:59 p.m. UTC | #11
Russell,

On Wed, Dec 3, 2014 at 8:49 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Wed, Dec 03, 2014 at 08:44:55AM -0800, Doug Anderson wrote:
>> Sure.  I look forward to someone solving it.  ...and as I said, until
>> someone does solve it I think we're blocked from landing deep sleep
>> mode for rk3288 upstream (unless someone decides that my patches are
>> OK).
>
> What we definitely don't want to do is to have each SoC doing their
> own thing on this, so we really need it to either be generic across
> all ARMs, or fully generic.
>
> So, if it can be turned into some kind of generic infrastructure, I'd
> be happier with it, because that would help it be re-used across other
> SoCs.

If folks can come up with a way to adapt my current approach to be
acceptable then I'm happy to do it.

...but if not then I'll happily drop this patch and patiently await
the day when this problem is solved upstream.  Once that happens,
someone can try landing the deep sleep for rk3288 using that new
approach.


-Doug
diff mbox

Patch

diff --git a/arch/arm/mach-rockchip/Makefile b/arch/arm/mach-rockchip/Makefile
index 5c3a9b2..ad2b56e 100644
--- a/arch/arm/mach-rockchip/Makefile
+++ b/arch/arm/mach-rockchip/Makefile
@@ -1,5 +1,7 @@ 
 CFLAGS_platsmp.o := -march=armv7-a
 
 obj-$(CONFIG_ARCH_ROCKCHIP) += rockchip.o
-obj-$(CONFIG_PM_SLEEP) += pm.o sleep.o
+obj-$(CONFIG_PM_SLEEP) += pm.o
 obj-$(CONFIG_SMP) += headsmp.o platsmp.o
+
+obj-y += embedded/
diff --git a/arch/arm/mach-rockchip/embedded/.gitignore b/arch/arm/mach-rockchip/embedded/.gitignore
new file mode 100644
index 0000000..84709d7
--- /dev/null
+++ b/arch/arm/mach-rockchip/embedded/.gitignore
@@ -0,0 +1 @@ 
+rk3288_resume.lds
diff --git a/arch/arm/mach-rockchip/embedded/Makefile b/arch/arm/mach-rockchip/embedded/Makefile
new file mode 100644
index 0000000..2e5a253
--- /dev/null
+++ b/arch/arm/mach-rockchip/embedded/Makefile
@@ -0,0 +1,53 @@ 
+# Makefile for embedded code blobs for Rockchip SoCs
+#
+# These code blobs are emedded into vmlinux and copied into SRAM
+# at times when SDRAM is not available.  Each blob is self contained.
+#
+# Some blobs may be linked to expect to run at a very specific address.
+# A good example is resume code blobs that always expect to run in a
+# very specific bit of SRAM that keeps power during sleep.  This code
+# is also running with the cache off so it can predict the address it
+# will be at.
+#
+# Other blobs may be linked with -fpic (by adding CFLAGS_file.o := -fpic).
+# These can be located anywhere.  I believe gcc will support this by
+# assuming that the .text and .data sections are relative to each other.
+#
+# That brings up the point that all blobs here:
+# - Are generally very small
+# - Generally have code and data jammed together in one blob.
+# - Generally have "parameters" at the beginning that are filled in by
+#   the kernel.
+
+obj-$(CONFIG_PM_SLEEP) += rk3288_resume.bin.o
+
+targets := rk3288_resume.o \
+	rk3288_resume.elf rk3288_resume.lds \
+	rk3288_resume.bin rk3288_resume.bin.o
+
+# Reset objcopy flags, ARM puts "-O binary" here.
+OBJCOPYFLAGS :=
+
+# Our embedded code can't handle this flag.
+ifeq ($(CONFIG_FUNCTION_TRACER),y)
+ORIG_CFLAGS := $(KBUILD_CFLAGS)
+KBUILD_CFLAGS = $(subst -pg, , $(ORIG_CFLAGS))
+endif
+
+KBUILD_CFLAGS += -fno-stack-protector
+
+# This is the ELF for the embedded binary
+LDFLAGS_rk3288_resume.elf := -Bstatic -nostdlib -T
+$(obj)/rk3288_resume.elf: $(obj)/rk3288_resume.lds $(obj)/rk3288_resume.o \
+			  FORCE
+	$(call if_changed,ld)
+
+# Create binary data for the kernel
+OBJCOPYFLAGS_rk3288_resume.bin := -O binary
+$(obj)/rk3288_resume.bin: $(obj)/rk3288_resume.elf FORCE
+	$(call if_changed,objcopy)
+
+# Import the data into the kernel
+OBJCOPYFLAGS_rk3288_resume.bin.o += -B $(ARCH) -I binary -O elf32-littlearm
+$(obj)/rk3288_resume.bin.o: $(obj)/rk3288_resume.bin FORCE
+	$(call if_changed,objcopy)
diff --git a/arch/arm/mach-rockchip/embedded/rk3288_resume.c b/arch/arm/mach-rockchip/embedded/rk3288_resume.c
new file mode 100644
index 0000000..997f874
--- /dev/null
+++ b/arch/arm/mach-rockchip/embedded/rk3288_resume.c
@@ -0,0 +1,92 @@ 
+/*
+ * Rockchip rk3288 resume code
+ *
+ * This code is intended to be linked into the embedded resume binary
+ * for the rk3288 SoC
+ *
+ * Copyright (c) 2014 Google, Inc
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/kernel.h>
+
+#include "rk3288_resume.h"
+#include "rk3288_resume_embedded.h"
+
+#define INIT_CPSR	(PSR_I_BIT | PSR_F_BIT | SVC_MODE)
+
+static __noreturn void rk3288_resume(void);
+
+/* Parameters of early board initialization in SPL */
+struct rk3288_resume_params rk3288_resume_params
+		__attribute__((section(".resume_params"))) = {
+	.resume_loc	= rk3288_resume,
+};
+
+/**
+ * rk3288_resume_c - Main C entry point for rk3288 at resume time
+ *
+ * This function is called by rk3288_resume() and does the brunt of
+ * any resume processing.  After it's done it will call the kernel's
+ * cpu_resume() function (which it finds through its params structure).
+ *
+ * At the time this function is called:
+ * - We know we're on CPU0.
+ * - Interrupts are disabled.
+ * - We've got a stack.
+ * - The cache is turned off, so all addresses are physical.
+ * - SDRAM hasn't been restored yet (if it was off).
+ *
+ * WARNING: This code, the stack and the params structure are all sitting in
+ * PMU SRAM.  If you try to write to that memory using an 8-bit access (or even
+ * 16-bit) you'll get an imprecise data abort and it will be very hard to debug.
+ * Keep everything in here as 32-bit wide and aligned.  YOU'VE BEEN WARNED.
+ */
+static void __noreturn rk3288_resume_c(void)
+{
+	if (rk3288_resume_params.l2ctlr_f)
+		asm("mcr p15, 1, %0, c9, c0, 2" : :
+			"r" (rk3288_resume_params.l2ctlr));
+
+	rk3288_resume_params.cpu_resume();
+}
+
+/**
+ * rk3288_resume - First entry point for rk3288 at resume time
+ *
+ * A pointer to this function is stored in rk3288_resume_params.  The
+ * kernel uses the pointer in that structure to find this function and
+ * to put its (physical) address in a location that it will get jumped
+ * to at resume time.
+ *
+ * There is no stack at the time this function is called, so this
+ * function is in charge of setting it up.  We get to a function with
+ * a normal stack pointer ASAP.
+ */
+static void __naked __noreturn rk3288_resume(void)
+{
+	/* Make sure we're on CPU0, no IRQs and get a stack setup */
+	asm volatile (
+			"msr	cpsr_cxf, %0\n"
+
+			/* Only cpu0 continues to run, the others halt here */
+			"mrc	p15, 0, r1, c0, c0, 5\n"
+			"and	r1, r1, #0xf\n"
+			"cmp	r1, #0\n"
+			"beq	cpu0run\n"
+		"secondary_loop:\n"
+			"wfe\n"
+			"b	secondary_loop\n"
+
+		"cpu0run:\n"
+			"mov	sp, %1\n"
+		:
+		: "i" (INIT_CPSR), "r" (&__stack_start)
+		: "cc", "r1", "sp");
+
+	/* Now get into a normal function that can use a stack */
+	rk3288_resume_c();
+}
diff --git a/arch/arm/mach-rockchip/embedded/rk3288_resume.h b/arch/arm/mach-rockchip/embedded/rk3288_resume.h
new file mode 100644
index 0000000..9d3cfc6
--- /dev/null
+++ b/arch/arm/mach-rockchip/embedded/rk3288_resume.h
@@ -0,0 +1,44 @@ 
+/*
+ * Rockchip resume header (API from kernel to embedded code)
+ *
+ * Copyright (c) 2014 Google, Inc
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef __MACH_ROCKCHIP_RK3288_RESUME_H
+#define __MACH_ROCKCHIP_RK3288_RESUME_H
+
+/**
+ * rk3288_resume_params - Parameter space for the resume code
+ *
+ * This structure is at the start of the resume blob and is used to communicate
+ * between the resume blob and the callers.
+ *
+ * WARNING: This structure is sitting in PMU SRAM.  If you try to write to that
+ * memory using an 8-bit access (or even 16-bit) you'll get an imprecise data
+ * abort and it will be very hard to debug.  Keep everything in here as 32-bit
+ * wide and aligned.  YOU'VE BEEN WARNED.
+ *
+ * @resume_loc:		The value here should be the resume address that the CPU
+ *			is programmed to go to at resume time.
+ *
+ * @l2ctlr_f:		If non-zero we'll set l2ctlr at resume time.
+ * @l2ctlr:		The value to set l2ctlr to at resume time.
+ *
+ * @cpu_resume:		The function to jump to when we're all done.
+ */
+struct rk3288_resume_params {
+	/* This is compiled in and can be read to find the resume location */
+	__noreturn void (*resume_loc)(void);
+
+	/* Filled in by the client of the resume code */
+	u32 l2ctlr_f;		/* u32 not bool to avoid 8-bit SRAM access */
+	u32 l2ctlr;
+
+	__noreturn void (*cpu_resume)(void);
+};
+
+#endif /* __MACH_ROCKCHIP_RK3288_RESUME_H */
diff --git a/arch/arm/mach-rockchip/embedded/rk3288_resume.lds.S b/arch/arm/mach-rockchip/embedded/rk3288_resume.lds.S
new file mode 100644
index 0000000..0643169
--- /dev/null
+++ b/arch/arm/mach-rockchip/embedded/rk3288_resume.lds.S
@@ -0,0 +1,40 @@ 
+MEMORY {
+	pmu_sram_code  : ORIGIN = 0xff720000, LENGTH = 0xf00
+	pmu_sram_stack : ORIGIN = 0xff720f00, LENGTH = 0x100
+}
+
+OUTPUT_FORMAT("elf32-littlearm", "elf32-littlearm", "elf32-littlearm")
+OUTPUT_ARCH(arm)
+
+SECTIONS
+{
+	/* Don't need unwind tables */
+	/DISCARD/ : {
+		*(.ARM.exidx*)
+		*(.ARM.extab*)
+	}
+
+	/* Kernel code finds params because it knows they are first */
+	.params : { *(.resume_params*) } > pmu_sram_code
+	. = ALIGN(4);
+
+	.text : { *(.text*) } > pmu_sram_code
+	. = ALIGN(4);
+
+	.rodata : { *(SORT_BY_ALIGNMENT(.rodata*)) } > pmu_sram_code
+	. = ALIGN(4);
+
+	.data : {
+		*(SORT_BY_ALIGNMENT(.data*))
+		. = ALIGN(4);
+
+		/* We purposely put bss as part of data to avoid initting */
+		*(SORT_BY_ALIGNMENT(.bss*))
+		. = ALIGN(4);
+	} > pmu_sram_code
+
+	.stack : {
+		. += LENGTH(pmu_sram_stack) - 8;
+		__stack_start = .;
+	} > pmu_sram_stack
+}
diff --git a/arch/arm/mach-rockchip/embedded/rk3288_resume_embedded.h b/arch/arm/mach-rockchip/embedded/rk3288_resume_embedded.h
new file mode 100644
index 0000000..923d88a
--- /dev/null
+++ b/arch/arm/mach-rockchip/embedded/rk3288_resume_embedded.h
@@ -0,0 +1,17 @@ 
+/*
+ * Rockchip resume header (API between files in embedded code)
+ *
+ * Copyright (c) 2014 Google, Inc
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef __MACH_ROCKCHIP_RK3288_RESUME_EMBEDDED_H
+#define __MACH_ROCKCHIP_RK3288_RESUME_EMBEDDED_H
+
+/* Defined in the linker script */
+extern u32 *__stack_start;
+
+#endif /* __MACH_ROCKCHIP_RK3288_RESUME_EMBEDDED_H */
diff --git a/arch/arm/mach-rockchip/pm.c b/arch/arm/mach-rockchip/pm.c
index 50cb781..f213caa 100644
--- a/arch/arm/mach-rockchip/pm.c
+++ b/arch/arm/mach-rockchip/pm.c
@@ -28,6 +28,7 @@ 
 #include <asm/suspend.h>
 
 #include "pm.h"
+#include "embedded/rk3288_resume.h"
 
 /* These enum are option of low power mode */
 enum {
@@ -57,17 +58,33 @@  static inline u32 rk3288_l2_config(void)
 	return l2ctlr;
 }
 
-static void rk3288_config_bootdata(void)
+static void __init rk3288_init_pmu_sram(void)
 {
-	rkpm_bootdata_cpusp = rk3288_bootram_phy + (SZ_4K - 8);
-	rkpm_bootdata_cpu_code = virt_to_phys(cpu_resume);
+	extern char _binary_arch_arm_mach_rockchip_embedded_rk3288_resume_bin_start;
+	extern char _binary_arch_arm_mach_rockchip_embedded_rk3288_resume_bin_end;
+	u32 size = &_binary_arch_arm_mach_rockchip_embedded_rk3288_resume_bin_end -
+		   &_binary_arch_arm_mach_rockchip_embedded_rk3288_resume_bin_start;
+	struct rk3288_resume_params *params;
 
-	rkpm_bootdata_l2ctlr_f  = 1;
-	rkpm_bootdata_l2ctlr = rk3288_l2_config();
+	/* move resume code and data to PMU sram */
+	memcpy(rk3288_bootram_base,
+	       &_binary_arch_arm_mach_rockchip_embedded_rk3288_resume_bin_start,
+	       size);
+
+	/* setup the params that we know at boot time */
+	params = (struct rk3288_resume_params *)rk3288_bootram_base;
+
+	params->cpu_resume = (void *)virt_to_phys(cpu_resume);
+
+	params->l2ctlr_f = 1;
+	params->l2ctlr = rk3288_l2_config();
 }
 
 static void rk3288_slp_mode_set(int level)
 {
+	struct rk3288_resume_params *params =
+		(struct rk3288_resume_params *)rk3288_bootram_base;
+
 	u32 mode_set, mode_set1;
 
 	regmap_read(sgrf_regmap, RK3288_SGRF_SOC_CON0, &rk3288_sgrf_soc_con0);
@@ -81,7 +98,7 @@  static void rk3288_slp_mode_set(int level)
 
 	/* booting address of resuming system is from this register value */
 	regmap_write(sgrf_regmap, RK3288_SGRF_FAST_BOOT_ADDR,
-		     rk3288_bootram_phy);
+		     (u32)params->resume_loc);
 
 	regmap_write(pmu_regmap, RK3288_PMU_WAKEUP_CFG1,
 		     PMU_ARMINT_WAKEUP_EN);
@@ -162,7 +179,7 @@  static void rk3288_suspend_finish(void)
 		pr_err("%s: Suspend finish failed\n", __func__);
 }
 
-static int rk3288_suspend_init(struct device_node *np)
+static int __init rk3288_suspend_init(struct device_node *np)
 {
 	struct device_node *sram_np;
 	struct resource res;
@@ -203,11 +220,7 @@  static int rk3288_suspend_init(struct device_node *np)
 
 	of_node_put(sram_np);
 
-	rk3288_config_bootdata();
-
-	/* copy resume code and data to bootsram */
-	memcpy(rk3288_bootram_base, rockchip_slp_cpu_resume,
-	       rk3288_bootram_sz);
+	rk3288_init_pmu_sram();
 
 	return 0;
 }
diff --git a/arch/arm/mach-rockchip/pm.h b/arch/arm/mach-rockchip/pm.h
index 99722d0..0d1f127 100644
--- a/arch/arm/mach-rockchip/pm.h
+++ b/arch/arm/mach-rockchip/pm.h
@@ -15,14 +15,6 @@ 
 #ifndef __MACH_ROCKCHIP_PM_H
 #define __MACH_ROCKCHIP_PM_H
 
-extern unsigned long rkpm_bootdata_cpusp;
-extern unsigned long rkpm_bootdata_cpu_code;
-extern unsigned long rkpm_bootdata_l2ctlr_f;
-extern unsigned long rkpm_bootdata_l2ctlr;
-extern unsigned long rkpm_bootdata_ddr_code;
-extern unsigned long rkpm_bootdata_ddr_data;
-extern unsigned long rk3288_bootram_sz;
-
 void rockchip_slp_cpu_resume(void);
 void __init rockchip_suspend_init(void);
 
diff --git a/arch/arm/mach-rockchip/sleep.S b/arch/arm/mach-rockchip/sleep.S
deleted file mode 100644
index 2eec9a3..0000000
--- a/arch/arm/mach-rockchip/sleep.S
+++ /dev/null
@@ -1,73 +0,0 @@ 
-/*
- * Copyright (c) 2014, Fuzhou Rockchip Electronics Co., Ltd
- * Author: Tony Xie <tony.xie@rock-chips.com>
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms and conditions of the GNU General Public License,
- * version 2, as published by the Free Software Foundation.
- *
- * This program is distributed in the hope it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
- * more details.
- *
- */
-
-#include <linux/linkage.h>
-#include <asm/assembler.h>
-#include <asm/memory.h>
-
-.data
-/*
- * this code will be copied from
- * ddr to sram for system resumeing.
- * so it is ".data section".
- */
-.align
-
-ENTRY(rockchip_slp_cpu_resume)
-	setmode	PSR_I_BIT | PSR_F_BIT | SVC_MODE, r1  @ set svc, irqs off
-	mrc	p15, 0, r1, c0, c0, 5
-	and	r1, r1, #0xf
-	cmp	r1, #0
-	/* olny cpu0 can continue to run, the others is halt here */
-	beq	cpu0run
-secondary_loop:
-	wfe
-	b	secondary_loop
-cpu0run:
-	ldr	r3, rkpm_bootdata_l2ctlr_f
-	cmp	r3, #0
-	beq	sp_set
-	ldr	r3, rkpm_bootdata_l2ctlr
-	mcr	p15, 1, r3, c9, c0, 2
-sp_set:
-	ldr	sp, rkpm_bootdata_cpusp
-	ldr	r1, rkpm_bootdata_cpu_code
-	bx	r1
-ENDPROC(rockchip_slp_cpu_resume)
-
-/* Parameters filled in by the kernel */
-
-/* Flag for whether to restore L2CTLR on resume */
-	.global rkpm_bootdata_l2ctlr_f
-rkpm_bootdata_l2ctlr_f:
-	.long 0
-
-/* Saved L2CTLR to restore on resume */
-	.global rkpm_bootdata_l2ctlr
-rkpm_bootdata_l2ctlr:
-	.long 0
-
-/* CPU resume SP addr */
-	.globl rkpm_bootdata_cpusp
-rkpm_bootdata_cpusp:
-	.long 0
-
-/* CPU resume function (physical address) */
-	.globl rkpm_bootdata_cpu_code
-rkpm_bootdata_cpu_code:
-	.long 0
-
-ENTRY(rk3288_bootram_sz)
-        .word   . - rockchip_slp_cpu_resume