[RFC,v3] x86, hotplug: Use hlt instead of mwait if invoked from disable_nonboot_cpus

On Tuesday, June 28, 2016 05:16:43 PM Chen Yu wrote:
> Stress test from Varun Koyyalagunta reports that, the
> nonboot CPU would hang occasionally, when resuming from
> hibernation. Further investigation shows that, the precise
> stage when nonboot CPU hangs, is the time when the nonboot
> CPU been woken up incorrectly, and tries to monitor the
> mwait_ptr for the second time, then an exception is
> triggered due to illegal vaddr access, say, something like,
> 'Unable to handler kernel address of 0xffff8800ba800010...'
> 
> Further investigation shows that, this exception is caused
> by accessing a page without PRESENT flag, because the pte entry
> for this vaddr is zero. Here's the scenario how this problem
> happens: Page table for direct mapping is allocated dynamically
> by kernel_physical_mapping_init, it is possible that in the
> resume process, when the boot CPU is trying to write back pages
> to their original address, and just right to writes to the monitor
> mwait_ptr then wakes up one of the nonboot CPUs, since the page
> table currently used by the nonboot CPU might not the same as it
> is before the hibernation, an exception might occur due to
> inconsistent page table.
> 
> First try is to get rid of this problem by changing the monitor
> address from task.flag to zero page, because no one would write
> data to zero page. But there is still problem because of a ping-pong
> wake up scenario in mwait_play_dead:
> 
> One possible implementation of a clflush is a read-invalidate snoop,
> which is what a store might look like, so cflush might break the mwait.
> 
> 1. CPU1 wait at zero page
> 2. CPU2 cflush zero page, wake CPU1 up, then CPU2 waits at zero page
> 3. CPU1 is woken up, and invoke cflush zero page, thus wake up CPU2 again.
> then the nonboot CPUs never sleep for long.
> 
> So it's better to monitor different address for each
> nonboot CPUs, however since there is only one zero page, at most:
> PAGE_SIZE/L1_CACHE_LINE CPUs are satisfied, which is usually 64
> on a x86_64, apparently it's not enough for servers, maybe more
> zero pages are required.
> 
> So choose a new solution as Brian suggested, to put the nonboot CPUs
> into hlt before resume, without touching any memory during s/r.
> Theoretically there might still be some problems if some of the CPUs have
> already been put offline, but since the case is very rare and users
> can work around it, we do not deal with this special case in kernel
> for now.
> 
> BTW, as James mentioned, he might want to encapsulate disable_nonboot_cpus
> into arch-specific, so this patch might need small change after that.
> 
> Comments and suggestions would be appreciated.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
> Reported-and-tested-by: Varun Koyyalagunta <cpudebug@centtech.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>

Below is my sort of version of this (untested) and I did it this way, because
the issue is specific to resume from hibernation (the workaround need not be
applied anywhere else) and the hibernate_resume_nonboot_cpu_disable() thing may
be useful to arm64 too if I'm not mistaken (James?).

Actually, if arm64 uses it too, the __weak implementation can be dropped,
because it will be possible to make it depend on ARCH_HIBERNATION_HEADER
(x86 and arm64 are the only users of that).

Thanks,
Rafael

---
 arch/x86/include/asm/cpu.h |    2 ++
 arch/x86/kernel/smpboot.c  |    5 +++++
 arch/x86/power/cpu.c       |   19 +++++++++++++++++++
 kernel/power/hibernate.c   |    7 ++++++-
 kernel/power/power.h       |    2 ++
 5 files changed, 34 insertions(+), 1 deletion(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC,v3] x86, hotplug: Use hlt instead of mwait if invoked from disable_nonboot_cpus

Commit Message

Comments

Patch