diff mbox

[RFC,v5] ARM hibernation / suspend-to-disk (fwd)

Message ID 20110616213128.GC32629@n2100.arm.linux.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Russell King - ARM Linux June 16, 2011, 9:31 p.m. UTC
On Wed, Jun 15, 2011 at 02:35:26PM +0100, Frank Hofmann wrote:
> this change is perfect; with this, the hibernation support code turns 
> into the attached.

Should I assume that's an ack for all the patches?

> That's both better and simpler to perform a full suspend/resume cycle 
> (via resetting in the cpu_suspend "finisher") after the snapshot image 
> has been created, instead of shoehorning a return into this.

It's now not soo difficult to have an error code returned from the
finisher function - the only thing we have to make sure is that
the cpu_do_suspend helper just saves state and does not make any
state changes.

We can then do this, which makes it possible for the finisher to
return, and we propagate its return value.  We also ensure that a
path through from cpu_resume will result in a zero return value
from cpu_suspend.

This isn't an invitation for people to make the S2RAM path return
after they time out waiting for suspend to happen - that's potentially
dangerous because in that case the suspend may happen while we're
resuming devices which wouldn't be nice.

Comments

Frank Hofmann June 20, 2011, 12:32 p.m. UTC | #1
On Thu, 16 Jun 2011, Russell King - ARM Linux wrote:

> On Wed, Jun 15, 2011 at 02:35:26PM +0100, Frank Hofmann wrote:
>> this change is perfect; with this, the hibernation support code turns
>> into the attached.
>
> Should I assume that's an ack for all the patches?

Sorry the late reply, been away.

I've only tested on s3c64xx (for s2ram and s2disk) and omap3 (s2disk only 
due to the omap code not having been "genericized" yet), the changes are 
ok there.
The remainder apply and compile ok.

If that's sufficient, then:
Acked-by: Frank Hofmann <frank.hofmann@tomtom.com>

>
>> That's both better and simpler to perform a full suspend/resume cycle
>> (via resetting in the cpu_suspend "finisher") after the snapshot image
>> has been created, instead of shoehorning a return into this.
>
> It's now not soo difficult to have an error code returned from the
> finisher function - the only thing we have to make sure is that
> the cpu_do_suspend helper just saves state and does not make any
> state changes.
>
> We can then do this, which makes it possible for the finisher to
> return, and we propagate its return value.  We also ensure that a
> path through from cpu_resume will result in a zero return value
> from cpu_suspend.
>
> This isn't an invitation for people to make the S2RAM path return
> after they time out waiting for suspend to happen - that's potentially
> dangerous because in that case the suspend may happen while we're
> resuming devices which wouldn't be nice.

You're right there's some risk that the ability to return an error here is 
misinterpreted as an invitation to use error returns for indicating state 
machine conditions.

What I'm wondering about is the usecase for having an error return 
opportunity in this case (beyond reporting it as such). Isn't it rather so 
that most finishers would succeed and at best do a "BUG_ON()" at failure, 
because system state then isn't such to make it trivially recoverable ?

For hibernation, the ability to force the entire down/up transition before 
writing the snapshot image out is actually very beneficial, for 
reliability - one knows that the device/cpu side of suspend/resume has 
already worked when the snapshot is being written, without having to wait 
for reboot/image load/resume to test that. One would want to go through 
suspend/resume even if the memory snapshot operation, swsusp_save, errors.
A failure of swsusp_save doesn't make suspend/resume itself a failure, 
therefore it's desirable to propagate that return code in other ways (and 
keep the finisher "unconditional").

I'm not opposed to this addition as such, but I'm curious:
* If an error occurs, how can the caller of cpu_suspend recover from it ?
* What's the state the system is in after an error in this codepath ?
* What subsystems / users would do anything else with it than BUG_ON() ?

Also, consider possible errors in the SoC-specific code on the resume 
side; situations like a failure to perform SoC-iROM-calls for e.g. an 
attempted secure state restore would result in errors that can't be 
propagated by this mechanism; i.e. there are still failure modes which 
require platform-specific intervention of sorts, and platform-specific 
error propagation/handling, even were cpu_suspend to return error codes.

FrankH.

>
> diff -u b/arch/arm/kernel/sleep.S b/arch/arm/kernel/sleep.S
> --- b/arch/arm/kernel/sleep.S
> +++ b/arch/arm/kernel/sleep.S
> @@ -12,7 +12,6 @@
>  *  r1 = v:p offset
>  *  r2 = suspend function arg0
>  *  r3 = suspend function
> - * Note: does not return until system resumes
>  */
> ENTRY(cpu_suspend)
> 	stmfd	sp!, {r4 - r11, lr}
> @@ -26,7 +25,7 @@
> #endif
> 	mov	r6, sp			@ current virtual SP
> 	sub	sp, sp, r5		@ allocate CPU state on stack
> -	mov	r0, sp			@ save pointer
> +	mov	r0, sp			@ save pointer to CPU save block
> 	add	ip, ip, r1		@ convert resume fn to phys
> 	stmfd	sp!, {r1, r6, ip}	@ save v:p, virt SP, phys resume fn
> 	ldr	r5, =sleep_save_sp
> @@ -55,10 +54,17 @@
> #else
> 	bl	__cpuc_flush_kern_all
> #endif
> +	adr	lr, BSYM(cpu_suspend_abort)
> 	ldmfd	sp!, {r0, pc}		@ call suspend fn
> ENDPROC(cpu_suspend)
> 	.ltorg
>
> +cpu_suspend_abort:
> +	ldmia	sp!, {r1 - r3}		@ pop v:p, virt SP, phys resume fn
> +	mov	sp, r2
> +	ldmfd	sp!, {r4 - r11, pc}
> +ENDPROC(cpu_suspend_abort)
> +
> /*
>  * r0 = control register value
>  * r1 = v:p offset (preserved by cpu_do_resume)
> @@ -88,6 +94,7 @@
> cpu_resume_after_mmu:
> 	str	r5, [r2, r4, lsl #2]	@ restore old mapping
> 	mcr	p15, 0, r0, c1, c0, 0	@ turn on D-cache
> +	mov	r0, #0			@ return zero on success
> 	ldmfd	sp!, {r4 - r11, pc}
> ENDPROC(cpu_resume_after_mmu)
>
>
>
Russell King - ARM Linux June 21, 2011, 2:35 p.m. UTC | #2
On Mon, Jun 20, 2011 at 01:32:47PM +0100, Frank Hofmann wrote:
> I've only tested on s3c64xx (for s2ram and s2disk) and omap3 (s2disk only 
> due to the omap code not having been "genericized" yet), the changes are  
> ok there.
> The remainder apply and compile ok.
>
> If that's sufficient, then:
> Acked-by: Frank Hofmann <frank.hofmann@tomtom.com>

I've added that to the core stuff and s3c64xx.

> You're right there's some risk that the ability to return an error here 
> is misinterpreted as an invitation to use error returns for indicating 
> state machine conditions.

Well, an error return undoes (and probably corrupts) the state which
cpu_suspend saved, so it wouldn't be useful for state machine stuff -
I wonder whether we should poison the saved pointers just to reinforce
this point.

> What I'm wondering about is the usecase for having an error return  
> opportunity in this case (beyond reporting it as such). Isn't it rather 
> so that most finishers would succeed and at best do a "BUG_ON()" at 
> failure, because system state then isn't such to make it trivially 
> recoverable ?

For s2ram, there's no reason at this point to fail.  The only thing left
to do at this point is to do whatever is required to cause the system to
enter the low power mode.

In the case where you need to place the SDRAM into self refresh mode
manually, the timing of stuff after self refresh mode has been entered
to the point where power is cut to the CPU is normally well defined,
and if for whatever reason power doesn't get cut, there's very little
which can be done to re-awake the system (you'd have to ensure that
the code to do that was already in the I-cache and whatever data was
required was already in registers.)

And as I mentioned, if you timeout waiting for the system to power off,
and you've done everything you should have done, are you sure that the
system won't power off on you unexpectedly if you return with an error,
which would then mean you have corrupted state saved (and possibly
corrupted filesystems.)

So, I don't think S2RAM would have any practical use for an error return
path.  It's more for hibernation.

> For hibernation, the ability to force the entire down/up transition 
> before writing the snapshot image out is actually very beneficial, for  
> reliability - one knows that the device/cpu side of suspend/resume has  
> already worked when the snapshot is being written, without having to wait 
> for reboot/image load/resume to test that.

It's not a test of the suspend/resume path though - device state is very
different between suspending in some way followed by an immediate resume
without resetting, and a real suspend, power off, resume cycle.

The former doesn't really test the resume paths - a register which has
been forgotten won't be found by any other way than cutting the power,
and you'll only ever find that on a proper resume-from-power-off.

> I'm not opposed to this addition as such, but I'm curious:
> * If an error occurs, how can the caller of cpu_suspend recover from it ?

If an error occurs from cpu_suspend(), then the caller needs to undo any
modification to state which it did while saving, and return an error.

> * What's the state the system is in after an error in this codepath ?

We can arrange for another callback into CPU specific code to sort out
anything they've done - iow, an error return from cpu_suspend should
ensure that we have the same state present as we had at the point it
was called.  We currently do not need that as cpu_suspend() just saves
state without modifying anything, but if we ended up with a CPU which
required stuff to be modified, we'd have to undo those modifications
before returning an error.

> * What subsystems / users would do anything else with it than BUG_ON() ?

The error code from the platform_suspend_ops ->enter method is already
dealt with - by going through the resume steps before returning the
error code.  Whether hibernate has similar functionality I don't know.

What I do think is that if hibernate needs to return an error code, we'd
better do that without doing things like taking the MMU down to run the
resume paths in order to do that.  If we have some error to propagate,
we want to propagate it back as easily as possible.
diff mbox

Patch

diff -u b/arch/arm/kernel/sleep.S b/arch/arm/kernel/sleep.S
--- b/arch/arm/kernel/sleep.S
+++ b/arch/arm/kernel/sleep.S
@@ -12,7 +12,6 @@ 
  *  r1 = v:p offset
  *  r2 = suspend function arg0
  *  r3 = suspend function
- * Note: does not return until system resumes
  */
 ENTRY(cpu_suspend)
 	stmfd	sp!, {r4 - r11, lr}
@@ -26,7 +25,7 @@ 
 #endif
 	mov	r6, sp			@ current virtual SP
 	sub	sp, sp, r5		@ allocate CPU state on stack
-	mov	r0, sp			@ save pointer
+	mov	r0, sp			@ save pointer to CPU save block
 	add	ip, ip, r1		@ convert resume fn to phys
 	stmfd	sp!, {r1, r6, ip}	@ save v:p, virt SP, phys resume fn
 	ldr	r5, =sleep_save_sp
@@ -55,10 +54,17 @@ 
 #else
 	bl	__cpuc_flush_kern_all
 #endif
+	adr	lr, BSYM(cpu_suspend_abort)
 	ldmfd	sp!, {r0, pc}		@ call suspend fn
 ENDPROC(cpu_suspend)
 	.ltorg
 
+cpu_suspend_abort:
+	ldmia	sp!, {r1 - r3}		@ pop v:p, virt SP, phys resume fn
+	mov	sp, r2
+	ldmfd	sp!, {r4 - r11, pc}
+ENDPROC(cpu_suspend_abort)
+
 /*
  * r0 = control register value
  * r1 = v:p offset (preserved by cpu_do_resume)
@@ -88,6 +94,7 @@ 
 cpu_resume_after_mmu:
 	str	r5, [r2, r4, lsl #2]	@ restore old mapping
 	mcr	p15, 0, r0, c1, c0, 0	@ turn on D-cache
+	mov	r0, #0			@ return zero on success
 	ldmfd	sp!, {r4 - r11, pc}
 ENDPROC(cpu_resume_after_mmu)