Patchwork 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

login
register
mail settings
Submitter Hidetoshi Seto
Date June 23, 2009, 3:40 a.m.
Message ID <4A404EC6.8080902@jp.fujitsu.com>
Download mbox | patch
Permalink /patch/31907/
State New, archived
Headers show

Comments

Hidetoshi Seto - June 23, 2009, 3:40 a.m.
Maciej Rutecki wrote:
> 2009/6/22 Andi Kleen <ak@linux.intel.com>:
> 
>> Here's a debug patch for the poller: http://firstfloor.org/ak/mcp-debug
>> Can you apply that and try again and send me the output?
>>
> 
> Dmesg after resume:
> http://unixy.pl/maciek/download/kernel/2.6.30-git17/pc/dmesg-2.6.30-git17-patch.txt
> 
> System hangs when uptime is roughly 5-6 minutes (when I don't change
> check_interval). netconsole doesn't show anything.
> 

I found in the dmesg that mce_init() and mce_cpu_features() are called
on cpu0 twice in short time:

[   82.989005] mcp on cpu 0 flags 2 banks ecc39e70
[   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:502
[   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:506
[   82.989005] bank 0
[   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[   82.989005] bank 1
[   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[   82.989005] bank 2
[   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[   82.989005] bank 3
[   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[   82.989005] bank 4
[   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[   82.989005] bank 5
[   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[   82.989005] mcp on cpu 0 finished
[   82.989005] CPU0: Thermal LVT vector (0xfa) already installed
[   82.989005] PM: Restoring platform NVS memory
[   82.989005] mcp on cpu 0 flags 2 banks ecc39e70
[   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:502
[   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:506
[   82.989005] bank 0
[   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[   82.989005] bank 1
[   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[   82.989005] bank 2
[   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[   82.989005] bank 3
[   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[   82.989005] bank 4
[   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[   82.989005] bank 5
[   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[   82.989005] mcp on cpu 0 finished
[   82.989005] CPU0: Thermal LVT vector (0xfa) already installed

mce_cpu_features() (which prints "Thermal ...") is always paired with
mce_init(), and is called only from mcheck_init() and mce_resume().

One of the above would be from mce_resume(), and if another was from
mcheck_init(), then setup_timer() in mce_init_timer() will break the
pending timer...

[arch/x86/power/cpu.c]
> static void __restore_processor_state(struct saved_context *ctxt)
> {
>  :
> #ifdef CONFIG_X86_32
>         mcheck_init(&boot_cpu_data);
> #endif
> }

Hum?

Maciej, could you try this patch? 

Thanks,
H.Seto
Hugh Dickins - June 23, 2009, 10:54 a.m.
On Tue, 23 Jun 2009, Hidetoshi Seto wrote:
> Maciej Rutecki wrote:
> > 2009/6/22 Andi Kleen <ak@linux.intel.com>:
> > 
> >> Here's a debug patch for the poller: http://firstfloor.org/ak/mcp-debug
> >> Can you apply that and try again and send me the output?
> >>
> > 
> > Dmesg after resume:
> > http://unixy.pl/maciek/download/kernel/2.6.30-git17/pc/dmesg-2.6.30-git17-patch.txt
> > 
> > System hangs when uptime is roughly 5-6 minutes (when I don't change
> > check_interval). netconsole doesn't show anything.
> > 
> 
> I found in the dmesg that mce_init() and mce_cpu_features() are called
> on cpu0 twice in short time:
> 
> [   82.989005] mcp on cpu 0 flags 2 banks ecc39e70
> [   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:502
> [   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:506
> [   82.989005] bank 0
> [   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [   82.989005] bank 1
> [   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [   82.989005] bank 2
> [   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [   82.989005] bank 3
> [   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [   82.989005] bank 4
> [   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [   82.989005] bank 5
> [   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [   82.989005] mcp on cpu 0 finished
> [   82.989005] CPU0: Thermal LVT vector (0xfa) already installed
> [   82.989005] PM: Restoring platform NVS memory
> [   82.989005] mcp on cpu 0 flags 2 banks ecc39e70
> [   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:502
> [   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:506
> [   82.989005] bank 0
> [   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [   82.989005] bank 1
> [   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [   82.989005] bank 2
> [   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [   82.989005] bank 3
> [   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [   82.989005] bank 4
> [   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [   82.989005] bank 5
> [   82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
> [   82.989005] mcp on cpu 0 finished
> [   82.989005] CPU0: Thermal LVT vector (0xfa) already installed
> 
> mce_cpu_features() (which prints "Thermal ...") is always paired with
> mce_init(), and is called only from mcheck_init() and mce_resume().
> 
> One of the above would be from mce_resume(), and if another was from
> mcheck_init(), then setup_timer() in mce_init_timer() will break the
> pending timer...
> 
> [arch/x86/power/cpu.c]
> > static void __restore_processor_state(struct saved_context *ctxt)
> > {
> >  :
> > #ifdef CONFIG_X86_32
> >         mcheck_init(&boot_cpu_data);
> > #endif
> > }
> 
> Hum?
> 
> Maciej, could you try this patch? 

Well found, thanks so much for that, I can confirm that this patch
works for me (though I'm not Maciej).

Since running recent gits on Core2 Duo, following resume from RAM,
within five minutes I'd get either a silent freeze, or a BUG at
kernel/timer.c:911, and/or DEBUG_LIST list.h warnings from
internal_add_timer().

Your patch puts an end to all that.  Something for Andi to rush to
Linus for -rc1 I think - though I won't be in the least surprised if
Andi decides on something a little different, it's rather surprising
to have that mcheck_init() call from over there in just the one case.

Thanks again,
Hugh

> 
> Thanks,
> H.Seto
> 
> ===
> [PATCH] x86: Fix mce resume on 32bit
> 
> Calling mcheck_init() on resume is required only with CONFIG_X86_OLD_MCE=y.
> 
> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
> ---
>  arch/x86/power/cpu.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
> index d277ef1..b3d20b9 100644
> --- a/arch/x86/power/cpu.c
> +++ b/arch/x86/power/cpu.c
> @@ -244,7 +244,7 @@ static void __restore_processor_state(struct saved_context *ctxt)
>  	do_fpu_end();
>  	mtrr_ap_init();
>  
> -#ifdef CONFIG_X86_32
> +#ifdef CONFIG_X86_OLD_MCE
>  	mcheck_init(&boot_cpu_data);
>  #endif
>  }
> -- 
> 1.6.3
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Andi Kleen - June 23, 2009, 11:06 a.m.
> Your patch puts an end to all that.  Something for Andi to rush to
> Linus for -rc1 I think - though I won't be in the least surprised if
> Andi decides on something a little different, it's rather surprising
> to have that mcheck_init() call from over there in just the one case.
> 
> Thanks again,
> Hugh


Yes patch looks good. I was just waiting for reports if it really fixes
the problem.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Maciej Rutecki - June 23, 2009, 2:47 p.m.
2009/6/23 Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>:

>
> Maciej, could you try this patch?

Tested in 2 systems (for 17 and 30 minutes). Works without any
problems. Thanks for patch!

>
> ===
> [PATCH] x86: Fix mce resume on 32bit
>
> Calling mcheck_init() on resume is required only with CONFIG_X86_OLD_MCE=y.
>
> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
> ---
>  arch/x86/power/cpu.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
> index d277ef1..b3d20b9 100644
> --- a/arch/x86/power/cpu.c
> +++ b/arch/x86/power/cpu.c
> @@ -244,7 +244,7 @@ static void __restore_processor_state(struct saved_context *ctxt)
>        do_fpu_end();
>        mtrr_ap_init();
>
> -#ifdef CONFIG_X86_32
> +#ifdef CONFIG_X86_OLD_MCE
>        mcheck_init(&boot_cpu_data);
>  #endif
>  }

Tested-by Maciej Rutecki <maciej.rutecki@gmail.com>
Andi Kleen - June 23, 2009, 7:57 p.m.
Maciej Rutecki wrote:
> 2009/6/23 Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>:
> 
>> Maciej, could you try this patch?
> 
> Tested in 2 systems (for 17 and 30 minutes). Works without any
> problems. Thanks for patch!

Ok great. Thanks for testing. Peter please put Seto-san's patch into urgent.

Acked-by: Andi Kleen <ak@linux.intel.com>

Thanks.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Patch

===
[PATCH] x86: Fix mce resume on 32bit

Calling mcheck_init() on resume is required only with CONFIG_X86_OLD_MCE=y.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
---
 arch/x86/power/cpu.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index d277ef1..b3d20b9 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -244,7 +244,7 @@  static void __restore_processor_state(struct saved_context *ctxt)
 	do_fpu_end();
 	mtrr_ap_init();
 
-#ifdef CONFIG_X86_32
+#ifdef CONFIG_X86_OLD_MCE
 	mcheck_init(&boot_cpu_data);
 #endif
 }