diff mbox

[RFC,v3] x86, hotplug: Use hlt instead of mwait if invoked from disable_nonboot_cpus

Message ID 209704957.TWElLMTLfP@vostro.rjw.lan (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Rafael J. Wysocki July 7, 2016, 12:33 a.m. UTC
On Tuesday, June 28, 2016 05:16:43 PM Chen Yu wrote:
> Stress test from Varun Koyyalagunta reports that, the
> nonboot CPU would hang occasionally, when resuming from
> hibernation. Further investigation shows that, the precise
> stage when nonboot CPU hangs, is the time when the nonboot
> CPU been woken up incorrectly, and tries to monitor the
> mwait_ptr for the second time, then an exception is
> triggered due to illegal vaddr access, say, something like,
> 'Unable to handler kernel address of 0xffff8800ba800010...'
> 
> Further investigation shows that, this exception is caused
> by accessing a page without PRESENT flag, because the pte entry
> for this vaddr is zero. Here's the scenario how this problem
> happens: Page table for direct mapping is allocated dynamically
> by kernel_physical_mapping_init, it is possible that in the
> resume process, when the boot CPU is trying to write back pages
> to their original address, and just right to writes to the monitor
> mwait_ptr then wakes up one of the nonboot CPUs, since the page
> table currently used by the nonboot CPU might not the same as it
> is before the hibernation, an exception might occur due to
> inconsistent page table.
> 
> First try is to get rid of this problem by changing the monitor
> address from task.flag to zero page, because no one would write
> data to zero page. But there is still problem because of a ping-pong
> wake up scenario in mwait_play_dead:
> 
> One possible implementation of a clflush is a read-invalidate snoop,
> which is what a store might look like, so cflush might break the mwait.
> 
> 1. CPU1 wait at zero page
> 2. CPU2 cflush zero page, wake CPU1 up, then CPU2 waits at zero page
> 3. CPU1 is woken up, and invoke cflush zero page, thus wake up CPU2 again.
> then the nonboot CPUs never sleep for long.
> 
> So it's better to monitor different address for each
> nonboot CPUs, however since there is only one zero page, at most:
> PAGE_SIZE/L1_CACHE_LINE CPUs are satisfied, which is usually 64
> on a x86_64, apparently it's not enough for servers, maybe more
> zero pages are required.
> 
> So choose a new solution as Brian suggested, to put the nonboot CPUs
> into hlt before resume, without touching any memory during s/r.
> Theoretically there might still be some problems if some of the CPUs have
> already been put offline, but since the case is very rare and users
> can work around it, we do not deal with this special case in kernel
> for now.
> 
> BTW, as James mentioned, he might want to encapsulate disable_nonboot_cpus
> into arch-specific, so this patch might need small change after that.
> 
> Comments and suggestions would be appreciated.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
> Reported-and-tested-by: Varun Koyyalagunta <cpudebug@centtech.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>

Below is my sort of version of this (untested) and I did it this way, because
the issue is specific to resume from hibernation (the workaround need not be
applied anywhere else) and the hibernate_resume_nonboot_cpu_disable() thing may
be useful to arm64 too if I'm not mistaken (James?).

Actually, if arm64 uses it too, the __weak implementation can be dropped,
because it will be possible to make it depend on ARCH_HIBERNATION_HEADER
(x86 and arm64 are the only users of that).

Thanks,
Rafael


---
 arch/x86/include/asm/cpu.h |    2 ++
 arch/x86/kernel/smpboot.c  |    5 +++++
 arch/x86/power/cpu.c       |   19 +++++++++++++++++++
 kernel/power/hibernate.c   |    7 ++++++-
 kernel/power/power.h       |    2 ++
 5 files changed, 34 insertions(+), 1 deletion(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Chen Yu July 7, 2016, 2:50 a.m. UTC | #1
> -----Original Message-----

> From: Rafael J. Wysocki [mailto:rjw@rjwysocki.net]

> Sent: Thursday, July 07, 2016 8:33 AM

> To: Chen, Yu C; James Morse

> Cc: linux-pm@vger.kernel.org; Thomas Gleixner; H. Peter Anvin; Pavel Machek;

> Borislav Petkov; Peter Zijlstra; Ingo Molnar; Len Brown; x86@kernel.org; linux-

> kernel@vger.kernel.org

> Subject: Re: [PATCH][RFC v3] x86, hotplug: Use hlt instead of mwait if invoked

> from disable_nonboot_cpus

> 

> On Tuesday, June 28, 2016 05:16:43 PM Chen Yu wrote:

> > Stress test from Varun Koyyalagunta reports that, the nonboot CPU

> > would hang occasionally, when resuming from hibernation. Further

> > investigation shows that, the precise stage when nonboot CPU hangs, is

> > the time when the nonboot CPU been woken up incorrectly, and tries to

> > monitor the mwait_ptr for the second time, then an exception is

> > triggered due to illegal vaddr access, say, something like, 'Unable to

> > handler kernel address of 0xffff8800ba800010...'

> >

> > Further investigation shows that, this exception is caused by

> > accessing a page without PRESENT flag, because the pte entry for this

> > vaddr is zero. Here's the scenario how this problem

> > happens: Page table for direct mapping is allocated dynamically by

> > kernel_physical_mapping_init, it is possible that in the resume

> > process, when the boot CPU is trying to write back pages to their

> > original address, and just right to writes to the monitor mwait_ptr

> > then wakes up one of the nonboot CPUs, since the page table currently

> > used by the nonboot CPU might not the same as it is before the

> > hibernation, an exception might occur due to inconsistent page table.

> >

> > First try is to get rid of this problem by changing the monitor

> > address from task.flag to zero page, because no one would write data

> > to zero page. But there is still problem because of a ping-pong wake

> > up scenario in mwait_play_dead:

> >

> > One possible implementation of a clflush is a read-invalidate snoop,

> > which is what a store might look like, so cflush might break the mwait.

> >

> > 1. CPU1 wait at zero page

> > 2. CPU2 cflush zero page, wake CPU1 up, then CPU2 waits at zero page

> > 3. CPU1 is woken up, and invoke cflush zero page, thus wake up CPU2 again.

> > then the nonboot CPUs never sleep for long.

> >

> > So it's better to monitor different address for each nonboot CPUs,

> > however since there is only one zero page, at most:

> > PAGE_SIZE/L1_CACHE_LINE CPUs are satisfied, which is usually 64 on a

> > x86_64, apparently it's not enough for servers, maybe more zero pages

> > are required.

> >

> > So choose a new solution as Brian suggested, to put the nonboot CPUs

> > into hlt before resume, without touching any memory during s/r.

> > Theoretically there might still be some problems if some of the CPUs

> > have already been put offline, but since the case is very rare and

> > users can work around it, we do not deal with this special case in

> > kernel for now.

> >

> > BTW, as James mentioned, he might want to encapsulate

> > disable_nonboot_cpus into arch-specific, so this patch might need small

> change after that.

> >

> > Comments and suggestions would be appreciated.

> >

> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371

> > Reported-and-tested-by: Varun Koyyalagunta <cpudebug@centtech.com>

> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>

> 

> Below is my sort of version of this (untested) and I did it this way, because the

> issue is specific to resume from hibernation (the workaround need not be

> applied anywhere else) and the hibernate_resume_nonboot_cpu_disable()

> thing may be useful to arm64 too if I'm not mistaken (James?).


James might want a flag to distinguish whether it is from suspend or resume,
in his arch-specific disabled_nonboot_cpus?

and this patch works on my xeon.
Tested-by: Chen Yu <yu.c.chen@intel.com>


> 

> Actually, if arm64 uses it too, the __weak implementation can be dropped,

> because it will be possible to make it depend on ARCH_HIBERNATION_HEADER

> (x86 and arm64 are the only users of that).

> 

> Thanks,

> Rafael

>
James Morse July 7, 2016, 8:38 a.m. UTC | #2
Hi Rafael,

On 07/07/16 01:33, Rafael J. Wysocki wrote:
> Below is my sort of version of this (untested) and I did it this way, because
> the issue is specific to resume from hibernation (the workaround need not be
> applied anywhere else) and the hibernate_resume_nonboot_cpu_disable() thing may
> be useful to arm64 too if I'm not mistaken (James?).

Yes, we will always need to do something extra (based on data in the
arch_hibernation_header) to resume if CPU0 was offline, or kexec meant we no
longer know which CPU the firmware will boot us on.


> Actually, if arm64 uses it too, the __weak implementation can be dropped,
> because it will be possible to make it depend on ARCH_HIBERNATION_HEADER
> (x86 and arm64 are the only users of that).

Heh, I avoided that as it felt too much like a hack!



Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki July 7, 2016, 12:25 p.m. UTC | #3
On Thursday, July 07, 2016 09:38:14 AM James Morse wrote:
> Hi Rafael,
> 
> On 07/07/16 01:33, Rafael J. Wysocki wrote:
> > Below is my sort of version of this (untested) and I did it this way, because
> > the issue is specific to resume from hibernation (the workaround need not be
> > applied anywhere else) and the hibernate_resume_nonboot_cpu_disable() thing may
> > be useful to arm64 too if I'm not mistaken (James?).
> 
> Yes, we will always need to do something extra (based on data in the
> arch_hibernation_header) to resume if CPU0 was offline, or kexec meant we no
> longer know which CPU the firmware will boot us on.
> 
> 
> > Actually, if arm64 uses it too, the __weak implementation can be dropped,
> > because it will be possible to make it depend on ARCH_HIBERNATION_HEADER
> > (x86 and arm64 are the only users of that).
> 
> Heh, I avoided that as it felt too much like a hack!

OK, let's do as follows, then.

I'll queue up this patch for 4.8 if people don't object.

Then you can implement hibernate_resume_nonboot_cpu_disable() as needed on arm64
and we'll drop the __weak thing next.

Since both users of ARCH_HIBERNATION_HEADER will have their own implementations
of hibernate_resume_nonboot_cpu_disable(), we can just make it a static inline
wrapper around disable_nonboot_cpus() if ARCH_HIBERNATION_HEADER is unset.

That actually makes sense, because when ARCH_HIBERNATION_HEADER is unset, then
(a) the layout of the kernel text and static data during image restoration must
be the same as before hibernation (in which case issues like the MWAIT/MONITOR one
on x86 simply cannot happen) and (b) the restore kernel is unable to handle any
differences between the current (ie. image restoration time) and pre-hibernation
configurations of the system.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Morse July 7, 2016, 4:03 p.m. UTC | #4
Hi,

On 07/07/16 03:50, Chen, Yu C wrote:
>> From: Rafael J. Wysocki [mailto:rjw@rjwysocki.net]
>> Below is my sort of version of this (untested) and I did it this way, because the
>> issue is specific to resume from hibernation (the workaround need not be
>> applied anywhere else) and the hibernate_resume_nonboot_cpu_disable()
>> thing may be useful to arm64 too if I'm not mistaken (James?).
> 
> James might want a flag to distinguish whether it is from suspend or resume,
> in his arch-specific disabled_nonboot_cpus?

That isn't serious, we can work out whether it is hibernate/resume based on
whether we've read data out of the the arch header. I added it in the other
series as it looked cleaner to pass the value in instead of inferring it.


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-pm/kernel/power/hibernate.c
===================================================================
--- linux-pm.orig/kernel/power/hibernate.c
+++ linux-pm/kernel/power/hibernate.c
@@ -409,6 +409,11 @@  int hibernation_snapshot(int platform_mo
 	goto Close;
 }
 
+int __weak hibernate_resume_nonboot_cpu_disable(void)
+{
+	return disable_nonboot_cpus();
+}
+
 /**
  * resume_target_kernel - Restore system state from a hibernation image.
  * @platform_mode: Whether or not to use the platform driver.
@@ -433,7 +438,7 @@  static int resume_target_kernel(bool pla
 	if (error)
 		goto Cleanup;
 
-	error = disable_nonboot_cpus();
+	error = hibernate_resume_nonboot_cpu_disable();
 	if (error)
 		goto Enable_cpus;
 
Index: linux-pm/kernel/power/power.h
===================================================================
--- linux-pm.orig/kernel/power/power.h
+++ linux-pm/kernel/power/power.h
@@ -38,6 +38,8 @@  static inline char *check_image_kernel(s
 }
 #endif /* CONFIG_ARCH_HIBERNATION_HEADER */
 
+extern int hibernate_resume_nonboot_cpu_disable(void);
+
 /*
  * Keep some memory free so that I/O operations can succeed without paging
  * [Might this be more than 4 MB?]
Index: linux-pm/arch/x86/power/cpu.c
===================================================================
--- linux-pm.orig/arch/x86/power/cpu.c
+++ linux-pm/arch/x86/power/cpu.c
@@ -266,6 +266,25 @@  void notrace restore_processor_state(voi
 EXPORT_SYMBOL(restore_processor_state);
 #endif
 
+#if defined(CONFIG_HIBERNATION) && defined(CONFIG_HOTPLUG_CPU)
+int hibernate_resume_nonboot_cpu_disable(void)
+{
+	int ret;
+
+	/*
+	 * Ensure that MONITOR/MWAIT will not be used in the "play dead" loop
+	 * during image restoration, because it is likely that the monitored
+	 * address will be actually written to at that time and then the "dead"
+	 * CPU may start executing instructions from an image kernel's page
+	 * (and that may not be the "play dead" loop any more).
+	 */
+	force_hlt_play_dead = true;
+	ret = disable_nonboot_cpus();
+	force_hlt_play_dead = false;
+	return ret;
+}
+#endif
+
 /*
  * When bsp_check() is called in hibernate and suspend, cpu hotplug
  * is disabled already. So it's unnessary to handle race condition between
Index: linux-pm/arch/x86/kernel/smpboot.c
===================================================================
--- linux-pm.orig/arch/x86/kernel/smpboot.c
+++ linux-pm/arch/x86/kernel/smpboot.c
@@ -1441,6 +1441,8 @@  __init void prefill_possible_map(void)
 
 #ifdef CONFIG_HOTPLUG_CPU
 
+bool force_hlt_play_dead;
+
 static void remove_siblinginfo(int cpu)
 {
 	int sibling;
@@ -1642,6 +1644,9 @@  void native_play_dead(void)
 	play_dead_common();
 	tboot_shutdown(TB_SHUTDOWN_WFS);
 
+	if (force_hlt_play_dead)
+		hlt_play_dead();
+
 	mwait_play_dead();	/* Only returns on failure */
 	if (cpuidle_play_dead())
 		hlt_play_dead();
Index: linux-pm/arch/x86/include/asm/cpu.h
===================================================================
--- linux-pm.orig/arch/x86/include/asm/cpu.h
+++ linux-pm/arch/x86/include/asm/cpu.h
@@ -26,6 +26,8 @@  struct x86_cpu {
 };
 
 #ifdef CONFIG_HOTPLUG_CPU
+extern bool force_hlt_play_dead;
+
 extern int arch_register_cpu(int num);
 extern void arch_unregister_cpu(int);
 extern void start_cpu0(void);