diff mbox series

[v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

Message ID nycvar.YFH.7.76.1905300007470.1962@cbobk.fhfr.pm (mailing list archive)
State Accepted, archived
Delegated to: Rafael Wysocki
Headers show
Series [v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume | expand

Commit Message

Jiri Kosina May 29, 2019, 10:09 p.m. UTC
From: Jiri Kosina <jkosina@suse.cz>

As explained in

	0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")

we always, no matter what, have to bring up x86 HT siblings during boot at
least once in order to avoid first MCE bringing the system to its knees.

That means that whenever 'nosmt' is supplied on the kernel command-line,
all the HT siblings are as a result sitting in mwait or cpudile after
going through the online-offline cycle at least once.

This causes a serious issue though when a kernel, which saw 'nosmt' on its
commandline, is going to perform resume from hibernation: if the resume
from the hibernated image is successful, cr3 is flipped in order to point
to the address space of the kernel that is being resumed, which in turn
means that all the HT siblings are all of a sudden mwaiting on address
which is no longer valid.

That results in triple fault shortly after cr3 is switched, and machine
reboots.

Fix this by always waking up all the SMT siblings before initiating the
'restore from hibernation' process; this guarantees that all the HT
siblings will be properly carried over to the resumed kernel waiting in
resume_play_dead(), and acted upon accordingly afterwards, based on the
target kernel configuration.
Symmetricaly, the resumed kernel has to push the SMT siblings to mwait
again in case it has SMT disabled; this means it has to online all
the siblings when resuming (so that they come out of hlt) and offline
them again to let them reach mwait.

Cc: stable@vger.kernel.org # v4.19+
Debugged-by: Thomas Gleixner <tglx@linutronix.de>
Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
---

v1 -> v2:
        - restructure error handling as suggested by peterz
        - add Rafael's ack

v2 -> v3:
        - added extra online/offline dance for nosmt case during
          resume, as we want the siblings to be in mwait, not hlt
        - dropped peterz's and Rafael's acks for now due to the above

v3 -> v4:
	- fix undefined return value from arch_resume_nosmt() in case
	  it's not overriden by arch

 arch/x86/power/cpu.c       | 10 ++++++++++
 arch/x86/power/hibernate.c | 33 +++++++++++++++++++++++++++++++++
 include/linux/cpu.h        |  4 ++++
 kernel/cpu.c               |  4 ++--
 kernel/power/hibernate.c   |  9 +++++++++
 5 files changed, 58 insertions(+), 2 deletions(-)

Comments

Rafael J. Wysocki May 30, 2019, 8:46 a.m. UTC | #1
On Thu, May 30, 2019 at 12:09 AM Jiri Kosina <jikos@kernel.org> wrote:
>
> From: Jiri Kosina <jkosina@suse.cz>
>
> As explained in
>
>         0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
>
> we always, no matter what, have to bring up x86 HT siblings during boot at
> least once in order to avoid first MCE bringing the system to its knees.
>
> That means that whenever 'nosmt' is supplied on the kernel command-line,
> all the HT siblings are as a result sitting in mwait or cpudile after
> going through the online-offline cycle at least once.
>
> This causes a serious issue though when a kernel, which saw 'nosmt' on its
> commandline, is going to perform resume from hibernation: if the resume
> from the hibernated image is successful, cr3 is flipped in order to point
> to the address space of the kernel that is being resumed, which in turn
> means that all the HT siblings are all of a sudden mwaiting on address
> which is no longer valid.
>
> That results in triple fault shortly after cr3 is switched, and machine
> reboots.
>
> Fix this by always waking up all the SMT siblings before initiating the
> 'restore from hibernation' process; this guarantees that all the HT
> siblings will be properly carried over to the resumed kernel waiting in
> resume_play_dead(), and acted upon accordingly afterwards, based on the
> target kernel configuration.
> Symmetricaly, the resumed kernel has to push the SMT siblings to mwait
> again in case it has SMT disabled; this means it has to online all
> the siblings when resuming (so that they come out of hlt) and offline
> them again to let them reach mwait.
>
> Cc: stable@vger.kernel.org # v4.19+
> Debugged-by: Thomas Gleixner <tglx@linutronix.de>
> Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
> Signed-off-by: Jiri Kosina <jkosina@suse.cz>

LGTM

And I would prefer this one to go in through the PM tree due to the
hibernate core changes,
so can I get an ACK from the x86 arch side here, please?

> ---
>
> v1 -> v2:
>         - restructure error handling as suggested by peterz
>         - add Rafael's ack
>
> v2 -> v3:
>         - added extra online/offline dance for nosmt case during
>           resume, as we want the siblings to be in mwait, not hlt
>         - dropped peterz's and Rafael's acks for now due to the above
>
> v3 -> v4:
>         - fix undefined return value from arch_resume_nosmt() in case
>           it's not overriden by arch
>
>  arch/x86/power/cpu.c       | 10 ++++++++++
>  arch/x86/power/hibernate.c | 33 +++++++++++++++++++++++++++++++++
>  include/linux/cpu.h        |  4 ++++
>  kernel/cpu.c               |  4 ++--
>  kernel/power/hibernate.c   |  9 +++++++++
>  5 files changed, 58 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
> index a7d966964c6f..513ce09e9950 100644
> --- a/arch/x86/power/cpu.c
> +++ b/arch/x86/power/cpu.c
> @@ -299,7 +299,17 @@ int hibernate_resume_nonboot_cpu_disable(void)
>          * address in its instruction pointer may not be possible to resolve
>          * any more at that point (the page tables used by it previously may
>          * have been overwritten by hibernate image data).
> +        *
> +        * First, make sure that we wake up all the potentially disabled SMT
> +        * threads which have been initially brought up and then put into
> +        * mwait/cpuidle sleep.
> +        * Those will be put to proper (not interfering with hibernation
> +        * resume) sleep afterwards, and the resumed kernel will decide itself
> +        * what to do with them.
>          */
> +       ret = cpuhp_smt_enable();
> +       if (ret)
> +               return ret;
>         smp_ops.play_dead = resume_play_dead;
>         ret = disable_nonboot_cpus();
>         smp_ops.play_dead = play_dead;
> diff --git a/arch/x86/power/hibernate.c b/arch/x86/power/hibernate.c
> index 4845b8c7be7f..fc413717a45f 100644
> --- a/arch/x86/power/hibernate.c
> +++ b/arch/x86/power/hibernate.c
> @@ -11,6 +11,7 @@
>  #include <linux/suspend.h>
>  #include <linux/scatterlist.h>
>  #include <linux/kdebug.h>
> +#include <linux/cpu.h>
>
>  #include <crypto/hash.h>
>
> @@ -245,3 +246,35 @@ int relocate_restore_code(void)
>         __flush_tlb_all();
>         return 0;
>  }
> +
> +int arch_resume_nosmt(void)
> +{
> +       int ret = 0;
> +       /*
> +        * We reached this while coming out of hibernation. This means
> +        * that SMT siblings are sleeping in hlt, as mwait is not safe
> +        * against control transition during resume (see comment in
> +        * hibernate_resume_nonboot_cpu_disable()).
> +        *
> +        * If the resumed kernel has SMT disabled, we have to take all the
> +        * SMT siblings out of hlt, and offline them again so that they
> +        * end up in mwait proper.
> +        *
> +        * Called with hotplug disabled.
> +        */
> +       cpu_hotplug_enable();
> +       if (cpu_smt_control == CPU_SMT_DISABLED ||
> +                       cpu_smt_control == CPU_SMT_FORCE_DISABLED) {
> +               enum cpuhp_smt_control old = cpu_smt_control;
> +
> +               ret = cpuhp_smt_enable();
> +               if (ret)
> +                       goto out;
> +               ret = cpuhp_smt_disable(old);
> +               if (ret)
> +                       goto out;
> +       }
> +out:
> +       cpu_hotplug_disable();
> +       return ret;
> +}
> diff --git a/include/linux/cpu.h b/include/linux/cpu.h
> index 3813fe45effd..fcb1386bb0d4 100644
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -201,10 +201,14 @@ enum cpuhp_smt_control {
>  extern enum cpuhp_smt_control cpu_smt_control;
>  extern void cpu_smt_disable(bool force);
>  extern void cpu_smt_check_topology(void);
> +extern int cpuhp_smt_enable(void);
> +extern int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval);
>  #else
>  # define cpu_smt_control               (CPU_SMT_NOT_IMPLEMENTED)
>  static inline void cpu_smt_disable(bool force) { }
>  static inline void cpu_smt_check_topology(void) { }
> +static inline int cpuhp_smt_enable(void) { return 0; }
> +static inline int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) { return 0; }
>  #endif
>
>  /*
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index f2ef10460698..077fde6fb953 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -2061,7 +2061,7 @@ static void cpuhp_online_cpu_device(unsigned int cpu)
>         kobject_uevent(&dev->kobj, KOBJ_ONLINE);
>  }
>
> -static int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
> +int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
>  {
>         int cpu, ret = 0;
>
> @@ -2093,7 +2093,7 @@ static int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
>         return ret;
>  }
>
> -static int cpuhp_smt_enable(void)
> +int cpuhp_smt_enable(void)
>  {
>         int cpu, ret = 0;
>
> diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
> index c8c272df7154..b65635753e8e 100644
> --- a/kernel/power/hibernate.c
> +++ b/kernel/power/hibernate.c
> @@ -257,6 +257,11 @@ void swsusp_show_speed(ktime_t start, ktime_t stop,
>                 (kps % 1000) / 10);
>  }
>
> +__weak int arch_resume_nosmt(void)
> +{
> +       return 0;
> +}
> +
>  /**
>   * create_image - Create a hibernation image.
>   * @platform_mode: Whether or not to use the platform driver.
> @@ -324,6 +329,10 @@ static int create_image(int platform_mode)
>   Enable_cpus:
>         suspend_enable_secondary_cpus();
>
> +       /* Allow architectures to do nosmt-specific post-resume dances */
> +       if (!in_suspend)
> +               error = arch_resume_nosmt();
> +
>   Platform_finish:
>         platform_finish(platform_mode);
>
>
> --
> Jiri Kosina
> SUSE Labs
>
Pavel Machek May 30, 2019, 10:47 a.m. UTC | #2
On Thu 2019-05-30 00:09:39, Jiri Kosina wrote:
> From: Jiri Kosina <jkosina@suse.cz>
> 
> As explained in
> 
> 	0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
> 
> we always, no matter what, have to bring up x86 HT siblings during boot at
> least once in order to avoid first MCE bringing the system to its knees.
> 
> That means that whenever 'nosmt' is supplied on the kernel command-line,
> all the HT siblings are as a result sitting in mwait or cpudile after
> going through the online-offline cycle at least once.
> 
> This causes a serious issue though when a kernel, which saw 'nosmt' on its
> commandline, is going to perform resume from hibernation: if the resume
> from the hibernated image is successful, cr3 is flipped in order to point
> to the address space of the kernel that is being resumed, which in turn
> means that all the HT siblings are all of a sudden mwaiting on address
> which is no longer valid.
> 
> That results in triple fault shortly after cr3 is switched, and machine
> reboots.
> 
> Fix this by always waking up all the SMT siblings before initiating the
> 'restore from hibernation' process; this guarantees that all the HT
> siblings will be properly carried over to the resumed kernel waiting in
> resume_play_dead(), and acted upon accordingly afterwards, based on the
> target kernel configuration.
> Symmetricaly, the resumed kernel has to push the SMT siblings to mwait
> again in case it has SMT disabled; this means it has to online all
> the siblings when resuming (so that they come out of hlt) and offline
> them again to let them reach mwait.
> 
> Cc: stable@vger.kernel.org # v4.19+
> Debugged-by: Thomas Gleixner <tglx@linutronix.de>
> Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
> Signed-off-by: Jiri Kosina <jkosina@suse.cz>

Acked-by: Pavel Machek <pavel@ucw.cz>
Thomas Gleixner May 30, 2019, 9:27 p.m. UTC | #3
On Thu, 30 May 2019, Rafael J. Wysocki wrote:
> >
> > Cc: stable@vger.kernel.org # v4.19+
> > Debugged-by: Thomas Gleixner <tglx@linutronix.de>
> > Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
> > Signed-off-by: Jiri Kosina <jkosina@suse.cz>
> 
> LGTM
> 
> And I would prefer this one to go in through the PM tree due to the
> hibernate core changes,

Ok.

> so can I get an ACK from the x86 arch side here, please?

No. Is the following good enough?

    Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

Thanks,

	tglx
Rafael J. Wysocki May 30, 2019, 9:38 p.m. UTC | #4
On Thu, May 30, 2019 at 11:27 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Thu, 30 May 2019, Rafael J. Wysocki wrote:
> > >
> > > Cc: stable@vger.kernel.org # v4.19+
> > > Debugged-by: Thomas Gleixner <tglx@linutronix.de>
> > > Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
> > > Signed-off-by: Jiri Kosina <jkosina@suse.cz>
> >
> > LGTM
> >
> > And I would prefer this one to go in through the PM tree due to the
> > hibernate core changes,
>
> Ok.
>
> > so can I get an ACK from the x86 arch side here, please?
>
> No. Is the following good enough?
>
>     Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

Yes, it is, thanks!
Josh Poimboeuf May 30, 2019, 11:38 p.m. UTC | #5
On Thu, May 30, 2019 at 11:38:51PM +0200, Rafael J. Wysocki wrote:
> On Thu, May 30, 2019 at 11:27 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > On Thu, 30 May 2019, Rafael J. Wysocki wrote:
> > > >
> > > > Cc: stable@vger.kernel.org # v4.19+
> > > > Debugged-by: Thomas Gleixner <tglx@linutronix.de>
> > > > Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
> > > > Signed-off-by: Jiri Kosina <jkosina@suse.cz>
> > >
> > > LGTM
> > >
> > > And I would prefer this one to go in through the PM tree due to the
> > > hibernate core changes,
> >
> > Ok.
> >
> > > so can I get an ACK from the x86 arch side here, please?
> >
> > No. Is the following good enough?
> >
> >     Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> 
> Yes, it is, thanks!

I still think changing monitor/mwait to use a fixmap address would be a
much cleaner way to fix this.  I can try to work up a patch tomorrow.
Jiri Kosina May 30, 2019, 11:42 p.m. UTC | #6
On Thu, 30 May 2019, Josh Poimboeuf wrote:

> > >     Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> > 
> > Yes, it is, thanks!
> 
> I still think changing monitor/mwait to use a fixmap address would be a
> much cleaner way to fix this.  I can try to work up a patch tomorrow.

I disagree with that from the backwards compatibility point of view.

I personally am quite frequently using differnet combinations of 
resumer/resumee kernels, and I've never been biten by it so far. I'd guess 
I am not the only one.
Fixmap sort of breaks that invariant.

Thanks,
Josh Poimboeuf May 31, 2019, 5:14 a.m. UTC | #7
On Fri, May 31, 2019 at 01:42:02AM +0200, Jiri Kosina wrote:
> On Thu, 30 May 2019, Josh Poimboeuf wrote:
> 
> > > >     Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> > > 
> > > Yes, it is, thanks!
> > 
> > I still think changing monitor/mwait to use a fixmap address would be a
> > much cleaner way to fix this.  I can try to work up a patch tomorrow.
> 
> I disagree with that from the backwards compatibility point of view.
> 
> I personally am quite frequently using differnet combinations of 
> resumer/resumee kernels, and I've never been biten by it so far. I'd guess 
> I am not the only one.
> Fixmap sort of breaks that invariant.

Right now there is no backwards compatibility because nosmt resume is
already broken.

For "future" backwards compatibility we could just define a hard-coded
reserved fixmap page address, adjacent to the vsyscall reserved address.

Something like this (not yet tested)?  Maybe we could also remove the
resume_play_dead() hack?

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 9da8cccdf3fb..1c328624162c 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -80,6 +80,7 @@ enum fixed_addresses {
 #ifdef CONFIG_X86_VSYSCALL_EMULATION
 	VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
 #endif
+	FIX_MWAIT = (FIXADDR_TOP - VSYSCALL_ADDR - 1) >> PAGE_SHIFT,
 #endif
 	FIX_DBGP_BASE,
 	FIX_EARLYCON_MEM_BASE,
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 73e69aaaa117..9804fbe25d03 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -108,6 +108,8 @@ int __read_mostly __max_smt_threads = 1;
 /* Flag to indicate if a complete sched domain rebuild is required */
 bool x86_topology_update;
 
+static char __mwait_page[PAGE_SIZE];
+
 int arch_update_cpu_topology(void)
 {
 	int retval = x86_topology_update;
@@ -1319,6 +1321,8 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
 	smp_quirk_init_udelay();
 
 	speculative_store_bypass_ht_init();
+
+	set_fixmap(FIX_MWAIT, __pa_symbol(&__mwait_page));
 }
 
 void arch_enable_nonboot_cpus_begin(void)
@@ -1631,11 +1635,12 @@ static inline void mwait_play_dead(void)
 	}
 
 	/*
-	 * This should be a memory location in a cache line which is
-	 * unlikely to be touched by other processors.  The actual
-	 * content is immaterial as it is not actually modified in any way.
+	 * This memory location is never actually written to.  It's mapped at a
+	 * reserved fixmap address to ensure the monitored address remains
+	 * valid across a hibernation resume operation.  Otherwise a triple
+	 * fault can occur.
 	 */
-	mwait_ptr = &current_thread_info()->flags;
+	mwait_ptr = (void *)fix_to_virt(FIX_MWAIT);
 
 	wbinvd();
Rafael J. Wysocki May 31, 2019, 8:26 a.m. UTC | #8
On Friday, May 31, 2019 7:14:56 AM CEST Josh Poimboeuf wrote:
> On Fri, May 31, 2019 at 01:42:02AM +0200, Jiri Kosina wrote:
> > On Thu, 30 May 2019, Josh Poimboeuf wrote:
> > 
> > > > >     Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> > > > 
> > > > Yes, it is, thanks!
> > > 
> > > I still think changing monitor/mwait to use a fixmap address would be a
> > > much cleaner way to fix this.  I can try to work up a patch tomorrow.
> > 
> > I disagree with that from the backwards compatibility point of view.
> > 
> > I personally am quite frequently using differnet combinations of 
> > resumer/resumee kernels, and I've never been biten by it so far. I'd guess 
> > I am not the only one.
> > Fixmap sort of breaks that invariant.
> 
> Right now there is no backwards compatibility because nosmt resume is
> already broken.
> 
> For "future" backwards compatibility we could just define a hard-coded
> reserved fixmap page address, adjacent to the vsyscall reserved address.
> 
> Something like this (not yet tested)?  Maybe we could also remove the
> resume_play_dead() hack?

Yes, we can IMO, but in a separate patch, please.

> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
> index 9da8cccdf3fb..1c328624162c 100644
> --- a/arch/x86/include/asm/fixmap.h
> +++ b/arch/x86/include/asm/fixmap.h
> @@ -80,6 +80,7 @@ enum fixed_addresses {
>  #ifdef CONFIG_X86_VSYSCALL_EMULATION
>  	VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
>  #endif
> +	FIX_MWAIT = (FIXADDR_TOP - VSYSCALL_ADDR - 1) >> PAGE_SHIFT,
>  #endif
>  	FIX_DBGP_BASE,
>  	FIX_EARLYCON_MEM_BASE,
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 73e69aaaa117..9804fbe25d03 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -108,6 +108,8 @@ int __read_mostly __max_smt_threads = 1;
>  /* Flag to indicate if a complete sched domain rebuild is required */
>  bool x86_topology_update;
>  
> +static char __mwait_page[PAGE_SIZE];
> +
>  int arch_update_cpu_topology(void)
>  {
>  	int retval = x86_topology_update;
> @@ -1319,6 +1321,8 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
>  	smp_quirk_init_udelay();
>  
>  	speculative_store_bypass_ht_init();
> +
> +	set_fixmap(FIX_MWAIT, __pa_symbol(&__mwait_page));
>  }
>  
>  void arch_enable_nonboot_cpus_begin(void)
> @@ -1631,11 +1635,12 @@ static inline void mwait_play_dead(void)
>  	}
>  
>  	/*
> -	 * This should be a memory location in a cache line which is
> -	 * unlikely to be touched by other processors.  The actual
> -	 * content is immaterial as it is not actually modified in any way.
> +	 * This memory location is never actually written to.  It's mapped at a
> +	 * reserved fixmap address to ensure the monitored address remains
> +	 * valid across a hibernation resume operation.  Otherwise a triple
> +	 * fault can occur.
>  	 */
> -	mwait_ptr = &current_thread_info()->flags;
> +	mwait_ptr = (void *)fix_to_virt(FIX_MWAIT);
>  
>  	wbinvd();
>  
> 

Jiri, any chance to test this?
Jiri Kosina May 31, 2019, 8:47 a.m. UTC | #9
On Fri, 31 May 2019, Josh Poimboeuf wrote:

> > I disagree with that from the backwards compatibility point of view.
> > 
> > I personally am quite frequently using differnet combinations of 
> > resumer/resumee kernels, and I've never been biten by it so far. I'd guess 
> > I am not the only one.
> > Fixmap sort of breaks that invariant.
> 
> Right now there is no backwards compatibility because nosmt resume is
> already broken.

Yeah, well, but that's "only" for nosmt kernels at least.

> For "future" backwards compatibility we could just define a hard-coded 
> reserved fixmap page address, adjacent to the vsyscall reserved address.
> 
> Something like this (not yet tested)?  Maybe we could also remove the
> resume_play_dead() hack?

Does it also solve cpuidle case? I have no overview what all the cpuidle 
drivers might be potentially doing in their ->enter_dead() callbacks. 
Rafael?

Thanks,
Rafael J. Wysocki May 31, 2019, 8:57 a.m. UTC | #10
On Friday, May 31, 2019 10:47:21 AM CEST Jiri Kosina wrote:
> On Fri, 31 May 2019, Josh Poimboeuf wrote:
> 
> > > I disagree with that from the backwards compatibility point of view.
> > > 
> > > I personally am quite frequently using differnet combinations of 
> > > resumer/resumee kernels, and I've never been biten by it so far. I'd guess 
> > > I am not the only one.
> > > Fixmap sort of breaks that invariant.
> > 
> > Right now there is no backwards compatibility because nosmt resume is
> > already broken.
> 
> Yeah, well, but that's "only" for nosmt kernels at least.
> 
> > For "future" backwards compatibility we could just define a hard-coded 
> > reserved fixmap page address, adjacent to the vsyscall reserved address.
> > 
> > Something like this (not yet tested)?  Maybe we could also remove the
> > resume_play_dead() hack?
> 
> Does it also solve cpuidle case? I have no overview what all the cpuidle 
> drivers might be potentially doing in their ->enter_dead() callbacks. 
> Rafael?

There are just two of them, ACPI cpuidle and intel_idle, and they both should
be covered.

In any case, I think that this is the way to go here even though it may be somewhat
problematic to start with.

Cheers,
Rafael
Jiri Kosina May 31, 2019, 12:09 p.m. UTC | #11
On Fri, 31 May 2019, Josh Poimboeuf wrote:

> Something like this (not yet tested)?  Maybe we could also remove the
> resume_play_dead() hack?

I tried to test this, but the resumed kernel doesn't seem to be healthy 
for reason I don't understand yet. 
Symptoms I've seen so far -- 'dazed and confused NMI', spontaneous reboot, 
userspace segfault.

> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
> index 9da8cccdf3fb..1c328624162c 100644
> --- a/arch/x86/include/asm/fixmap.h
> +++ b/arch/x86/include/asm/fixmap.h
> @@ -80,6 +80,7 @@ enum fixed_addresses {
>  #ifdef CONFIG_X86_VSYSCALL_EMULATION
>  	VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
>  #endif
> +	FIX_MWAIT = (FIXADDR_TOP - VSYSCALL_ADDR - 1) >> PAGE_SHIFT,

Two things to this:

- you don't seem to fix x86_32
- shouldn't it rather be FIXADDR_TOP - VSYSCALL_ADDR + 1 instead?

Thanks,
Pavel Machek May 31, 2019, 12:18 p.m. UTC | #12
On Fri 2019-05-31 01:42:02, Jiri Kosina wrote:
> On Thu, 30 May 2019, Josh Poimboeuf wrote:
> 
> > > >     Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> > > 
> > > Yes, it is, thanks!
> > 
> > I still think changing monitor/mwait to use a fixmap address would be a
> > much cleaner way to fix this.  I can try to work up a patch tomorrow.
> 
> I disagree with that from the backwards compatibility point of view.
> 
> I personally am quite frequently using differnet combinations of 
> resumer/resumee kernels, and I've never been biten by it so far. I'd guess 
> I am not the only one.
> Fixmap sort of breaks that invariant.

If we get less tricky code, it may be worth it...
									Pavel
Andy Lutomirski May 31, 2019, 2:24 p.m. UTC | #13
On Fri, May 31, 2019 at 1:57 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> On Friday, May 31, 2019 10:47:21 AM CEST Jiri Kosina wrote:
> > On Fri, 31 May 2019, Josh Poimboeuf wrote:
> >
> > > > I disagree with that from the backwards compatibility point of view.
> > > >
> > > > I personally am quite frequently using differnet combinations of
> > > > resumer/resumee kernels, and I've never been biten by it so far. I'd guess
> > > > I am not the only one.
> > > > Fixmap sort of breaks that invariant.
> > >
> > > Right now there is no backwards compatibility because nosmt resume is
> > > already broken.
> >
> > Yeah, well, but that's "only" for nosmt kernels at least.
> >
> > > For "future" backwards compatibility we could just define a hard-coded
> > > reserved fixmap page address, adjacent to the vsyscall reserved address.
> > >
> > > Something like this (not yet tested)?  Maybe we could also remove the
> > > resume_play_dead() hack?
> >
> > Does it also solve cpuidle case? I have no overview what all the cpuidle
> > drivers might be potentially doing in their ->enter_dead() callbacks.
> > Rafael?
>
> There are just two of them, ACPI cpuidle and intel_idle, and they both should
> be covered.
>
> In any case, I think that this is the way to go here even though it may be somewhat
> problematic to start with.
>

Given that there seems to be a genuine compatibility issue right now,
can we design an actual sane way to hand off control of all CPUs
rather than adding duct tape to an extremely fragile mechanism?  I can
think of at least two sensible solutions:

1. Have a self-contained "play dead for kexec/resume" function that
touches only few well-defined physical pages: a set of page tables and
a page of code.  Load CR3 to point to those page tables, fill in the
code with some form of infinite loop, and run it.  Or just turn off
paging entirely and run the infinite loop.  Have the kernel doing the
resuming inform the kernel being resumed of which pages these are, and
have the kernel being resumed take over all CPUs before reusing the
pages.

2. Put the CPU all the way to sleep by sending it an INIT IPI.

Version 2 seems very simple and robust.  Is there a reason we can't do
it?  We obviously don't want to do it for normal offline because it
might be a high-power state, but a cpu in the wait-for-SIPI state is
not going to exit that state all by itself.

The patch to implement #2 should be short and sweet as long as we are
careful to only put genuine APs to sleep like this.  The only downside
I can see is that an new kernel resuming and old kernel that was
booted with nosmt is going to waste power, but I don't think that's a
showstopper.
Jiri Kosina May 31, 2019, 2:31 p.m. UTC | #14
On Fri, 31 May 2019, Andy Lutomirski wrote:

> 2. Put the CPU all the way to sleep by sending it an INIT IPI.
> 
> Version 2 seems very simple and robust.  Is there a reason we can't do
> it?  We obviously don't want to do it for normal offline because it
> might be a high-power state, but a cpu in the wait-for-SIPI state is
> not going to exit that state all by itself.
> 
> The patch to implement #2 should be short and sweet as long as we are
> careful to only put genuine APs to sleep like this.  The only downside
> I can see is that an new kernel resuming and old kernel that was
> booted with nosmt is going to waste power, but I don't think that's a
> showstopper.

Well, if *that* is not an issue, than the original 3-liner that just 
forces them to 'hlt' [1] would be good enough as well.

[1] https://lore.kernel.org/lkml/nycvar.YFH.7.76.1905291230130.1962@cbobk.fhfr.pm/

Thanks,
Jiri Kosina May 31, 2019, 2:33 p.m. UTC | #15
On Fri, 31 May 2019, Jiri Kosina wrote:

> On Fri, 31 May 2019, Andy Lutomirski wrote:
> 
> > 2. Put the CPU all the way to sleep by sending it an INIT IPI.
> > 
> > Version 2 seems very simple and robust.  Is there a reason we can't do
> > it?  We obviously don't want to do it for normal offline because it
> > might be a high-power state, but a cpu in the wait-for-SIPI state is
> > not going to exit that state all by itself.
> > 
> > The patch to implement #2 should be short and sweet as long as we are
> > careful to only put genuine APs to sleep like this.  The only downside
> > I can see is that an new kernel resuming and old kernel that was
> > booted with nosmt is going to waste power, but I don't think that's a
> > showstopper.
> 
> Well, if *that* is not an issue, than the original 3-liner that just 
> forces them to 'hlt' [1] would be good enough as well.

Actually no, scratch that, I misunderstood your proposal, sorry.
Andy Lutomirski May 31, 2019, 2:46 p.m. UTC | #16
> On May 31, 2019, at 7:31 AM, Jiri Kosina <jikos@kernel.org> wrote:
> 
>> On Fri, 31 May 2019, Andy Lutomirski wrote:
>> 
>> 2. Put the CPU all the way to sleep by sending it an INIT IPI.
>> 
>> Version 2 seems very simple and robust.  Is there a reason we can't do
>> it?  We obviously don't want to do it for normal offline because it
>> might be a high-power state, but a cpu in the wait-for-SIPI state is
>> not going to exit that state all by itself.
>> 
>> The patch to implement #2 should be short and sweet as long as we are
>> careful to only put genuine APs to sleep like this.  The only downside
>> I can see is that an new kernel resuming and old kernel that was
>> booted with nosmt is going to waste power, but I don't think that's a
>> showstopper.
> 
> Well, if *that* is not an issue, than the original 3-liner that just 
> forces them to 'hlt' [1] would be good enough as well.
> 
> 

Seems okay to me as long as we’re confident we won’t get a spurious interrupt.

In general, I don’t think we’re ever suppose to rely on mwait *staying* asleep.  As I understand it, mwait can wake up whenever it wants, and the only real guarantee we have is that the CPU makes some effort to stay asleep until an interrupt is received or the monitor address is poked.

As a trivial example, if we are in a VM and we get scheduled out at any point between MONITOR and the eventual intentional wakeup, we’re toast. Same if we get an SMI due to bad luck or due to a thermal event happening shortly after pushing the power button to resume from hibernate.

For that matter, what actually happens if we get an SMI while halted?  Does RSM go directly to sleep or does it re-fetch the HLT?

It seems to me that we should just avoid the scenario where we have IP pointed to a bogus address and we just cross our fingers and hope the CPU doesn’t do anything.

I think that, as a short term fix, we should use HLT and, as a long term fix, we should either keep the CPU state fully valid or we should hard-offline the CPU using documented mechanisms, e.g. the WAIT-for-SIPI state.
Jiri Kosina May 31, 2019, 2:51 p.m. UTC | #17
On Fri, 31 May 2019, Josh Poimboeuf wrote:

> > I personally am quite frequently using differnet combinations of 
> > resumer/resumee kernels, and I've never been biten by it so far. I'd guess 
> > I am not the only one.
> > Fixmap sort of breaks that invariant.
> 
> Right now there is no backwards compatibility because nosmt resume is
> already broken.
> 
> For "future" backwards compatibility we could just define a hard-coded
> reserved fixmap page address, adjacent to the vsyscall reserved address.
> 
> Something like this (not yet tested)?  Maybe we could also remove the
> resume_play_dead() hack?

Looking into SDM:

=====
A store to the address range armed by the MONITOR instruction, an 
interrupt, an NMI or SMI, a debug exception, a machine check exception, 
the BINIT# signal, the INIT# signal, or the RESET# signal will exit the 
implementation-dependent-optimized state.
=====

And mwait doesn't have the 'auto-restart on SMM exit' like hlt does. So I 
guess that's why I am seeing the triple faults even with your (fixed, see 
below) patch as well.

So I don't think we can safely use this aproach.

> 
> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
> index 9da8cccdf3fb..1c328624162c 100644
> --- a/arch/x86/include/asm/fixmap.h
> +++ b/arch/x86/include/asm/fixmap.h
> @@ -80,6 +80,7 @@ enum fixed_addresses {
>  #ifdef CONFIG_X86_VSYSCALL_EMULATION
>  	VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
>  #endif
> +	FIX_MWAIT = (FIXADDR_TOP - VSYSCALL_ADDR - 1) >> PAGE_SHIFT,
>  #endif
>  	FIX_DBGP_BASE,
>  	FIX_EARLYCON_MEM_BASE,
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 73e69aaaa117..9804fbe25d03 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -108,6 +108,8 @@ int __read_mostly __max_smt_threads = 1;
>  /* Flag to indicate if a complete sched domain rebuild is required */
>  bool x86_topology_update;
>  
> +static char __mwait_page[PAGE_SIZE];

This needs to be __align(PAGE_SIZE) in order for the fixmap to work 
properly.
Jiri Kosina May 31, 2019, 2:54 p.m. UTC | #18
On Fri, 31 May 2019, Andy Lutomirski wrote:

> For that matter, what actually happens if we get an SMI while halted?  
> Does RSM go directly to sleep or does it re-fetch the HLT?

Our mails just crossed, I replied to Josh's mwait() proposal patch a 
minute ago.

HLT is guaranteed to be re-entered if SMM interrupted it, while MWAIT is 
not.

So as a short-term fix for 5.2, I still believe in v4 of my patch that 
does the mwait->hlt->mwait transition across hibernate/resume, and for 5.3 
I can look into forcing it to wait-for-SIPI proper.
Josh Poimboeuf May 31, 2019, 3:26 p.m. UTC | #19
On Fri, May 31, 2019 at 04:54:20PM +0200, Jiri Kosina wrote:
> On Fri, 31 May 2019, Andy Lutomirski wrote:
> 
> > For that matter, what actually happens if we get an SMI while halted?  
> > Does RSM go directly to sleep or does it re-fetch the HLT?
> 
> Our mails just crossed, I replied to Josh's mwait() proposal patch a 
> minute ago.

Good catch.  I agree that mwait seems unsafe across resume and my patch
is bogus.

Andy, in the short term it sounds like you're proposing to make
native_play_dead() use hlt_play_dead() unconditionally.  Right?

That would simplify things and also would fix Jiri's bug I think.  The
only question I'd have is if we have data on the power savings
difference between hlt and mwait.  mwait seems to wake up on a lot of
different conditions which might negate its deeper sleep state.

Andy, for your long term idea to use INIT IPI, I wonder if that would
work with SMT siblings?  Specifically I wonder about the Intel issue
that requires siblings to have CR4.MCE set.
Jiri Kosina May 31, 2019, 3:41 p.m. UTC | #20
On Fri, 31 May 2019, Josh Poimboeuf wrote:

> The only question I'd have is if we have data on the power savings 
> difference between hlt and mwait.  mwait seems to wake up on a lot of 
> different conditions which might negate its deeper sleep state.

hlt wakes up on basically the same set of events, but has the 
auto-restarting semantics on some of them (especially SMM). So the wakeup 
frequency itself shouldn't really contribute to power consumption 
difference; it's the C-state that mwait allows CPU to enter.

Thanks,
Josh Poimboeuf May 31, 2019, 4:19 p.m. UTC | #21
On Fri, May 31, 2019 at 05:41:18PM +0200, Jiri Kosina wrote:
> On Fri, 31 May 2019, Josh Poimboeuf wrote:
> 
> > The only question I'd have is if we have data on the power savings 
> > difference between hlt and mwait.  mwait seems to wake up on a lot of 
> > different conditions which might negate its deeper sleep state.
> 
> hlt wakes up on basically the same set of events, but has the 
> auto-restarting semantics on some of them (especially SMM). So the wakeup 
> frequency itself shouldn't really contribute to power consumption 
> difference; it's the C-state that mwait allows CPU to enter.

Ok.  I reluctantly surrender :-)  For your v4:

  Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>

It works as a short term fix, but it's fragile, and it does feel like
we're just adding more duct tape, as Andy said.
Andy Lutomirski May 31, 2019, 4:23 p.m. UTC | #22
On Fri, May 31, 2019 at 7:54 AM Jiri Kosina <jikos@kernel.org> wrote:
>
> On Fri, 31 May 2019, Andy Lutomirski wrote:
>
> > For that matter, what actually happens if we get an SMI while halted?
> > Does RSM go directly to sleep or does it re-fetch the HLT?
>
> Our mails just crossed, I replied to Josh's mwait() proposal patch a
> minute ago.
>
> HLT is guaranteed to be re-entered if SMM interrupted it, while MWAIT is
> not.
>
> So as a short-term fix for 5.2, I still believe in v4 of my patch that
> does the mwait->hlt->mwait transition across hibernate/resume, and for 5.3
> I can look into forcing it to wait-for-SIPI proper.
>

How sure are you?  From http://www.rcollins.org/ddj/Mar97/Mar97.html I see:

In general, the AutoHALT field directs the microprocessor whether or
not to restart the HLT instruction upon exit of SMM. This is
accomplished by decrementing EIP and executing whatever instruction
resides at that position. AutoHALT restart behavior is consistent,
regardless of whether or not EIP-1 contains a HLT instruction. If the
SMM handler set Auto HALT[bit0]=1 when the interrupted instruction was
not a HLT instruction(AutoHALT[bit0]= 0 upon entrance), they would run
the risk of resuming execution at an undesired location. The RSM
microcode doesn’t know the length of the interrupted instruction.
Therefore when AutoHALT[bit0]=1 upon exit, the RSM microcode blindly
decrements the EIP register by 1 and resumes execution. This explains
why Intel warns that unpredictable behavior may result from setting
this field to restart a HLT instruction when the microprocessor wasn’t
in a HALT state upon entrance. Listing One presents an algorithm that
describes the AutoHALT Restart feature.

The AMD APM says:

If the return from SMM takes the processor back to the halt state, the
HLT instruction is not re-
executed. However, the halt special bus-cycle is driven on the
processor bus after the RSM instruction
executes.

The Intel SDM Vol 3 34.10 says:

If the HLT instruction is restarted, the processor will generate a
memory access to fetch the HLT instruction (if it is
not in the internal cache), and execute a HLT bus transaction. This
behavior results in multiple HLT bus transactions
for the same HLT instruction.

And the SDM is very clear that one should *not* do RSM with auto-halt
set if the instruction that was interrupted wasn't HLT.

It sounds to me like Intel does not architecturally guarantee that
it's okay to do HLT from one CPU and then remove the HLT instruction
out from under it on another CPU.
Andy Lutomirski May 31, 2019, 4:51 p.m. UTC | #23
On Fri, May 31, 2019 at 9:19 AM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>
> On Fri, May 31, 2019 at 05:41:18PM +0200, Jiri Kosina wrote:
> > On Fri, 31 May 2019, Josh Poimboeuf wrote:
> >
> > > The only question I'd have is if we have data on the power savings
> > > difference between hlt and mwait.  mwait seems to wake up on a lot of
> > > different conditions which might negate its deeper sleep state.
> >
> > hlt wakes up on basically the same set of events, but has the
> > auto-restarting semantics on some of them (especially SMM). So the wakeup
> > frequency itself shouldn't really contribute to power consumption
> > difference; it's the C-state that mwait allows CPU to enter.
>
> Ok.  I reluctantly surrender :-)  For your v4:
>
>   Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
>
> It works as a short term fix, but it's fragile, and it does feel like
> we're just adding more duct tape, as Andy said.
>

Just to clarify what I was thinking, it seems like soft-offlining a
CPU and resuming a kernel have fundamentally different requirements.
To soft-offline a CPU, we want to get power consumption as low as
possible and make sure that MCE won't kill the system.  It's okay for
the CPU to occasionally execute some code.  For resume, what we're
really doing is trying to hand control of all CPUs from kernel A to
kernel B.  There are two basic ways to hand off control of a given
CPU: we can jump (with JMP, RET, horrible self-modifying code, etc)
from one kernel to the other, or we can attempt to make a given CPU
stop executing code from either kernel at all and then forcibly wrench
control of it in kernel B.  Either approach seems okay, but the latter
approach depends on getting the CPU to reliably stop executing code.
We don't care about power consumption for resume, and I'm not even
convinced that we need to be able to survive an MCE that happens while
we're resuming, although surviving MCE would be nice.

So if we don't want to depend on nasty system details at all, we could
have the first kernel explicitly wake up all CPUs and hand them all
off to the new kernel, more or less the same way that we hand over
control of the BSP right now.  Or we can look for a way to tell all
the APs to stop executing kernel code, and the only architectural way
I know of to do that is to sent an INIT IPI (and then presumably
deassert INIT -- the SDM is a bit vague).

Or we could allocate a page, stick a GDT, a TSS, and a 1: hlt; jmp 1b
in it, turn off paging, and run that code.  And then somehow convince
the kernel we load not to touch that page until it finishes waking up
all CPUs.  This seems conceptually simple and very robust, but I'm not
sure it fits in with the way hibernation works right now at all.
Josh Poimboeuf May 31, 2019, 6:11 p.m. UTC | #24
On Fri, May 31, 2019 at 09:51:09AM -0700, Andy Lutomirski wrote:
> Just to clarify what I was thinking, it seems like soft-offlining a
> CPU and resuming a kernel have fundamentally different requirements.
> To soft-offline a CPU, we want to get power consumption as low as
> possible and make sure that MCE won't kill the system.  It's okay for
> the CPU to occasionally execute some code.  For resume, what we're
> really doing is trying to hand control of all CPUs from kernel A to
> kernel B.  There are two basic ways to hand off control of a given
> CPU: we can jump (with JMP, RET, horrible self-modifying code, etc)
> from one kernel to the other, or we can attempt to make a given CPU
> stop executing code from either kernel at all and then forcibly wrench
> control of it in kernel B.  Either approach seems okay, but the latter
> approach depends on getting the CPU to reliably stop executing code.
> We don't care about power consumption for resume, and I'm not even
> convinced that we need to be able to survive an MCE that happens while
> we're resuming, although surviving MCE would be nice.

I'd thought you were proposing a global improvement: we get rid of
mwait_play_dead() everywhere, i.e. all the time, not just for the resume
path.

Instead it sounds like you were proposing a local improvement to the
resume path, to continue doing what
hibernate_resume_nonboot_cpu_disable() is already doing, but use an INIT
IPI instead of HLT to make sure the CPU is completely dead.

That may be a theoretical improvement but we'd still need to do the
whole "wake and play dead" dance which Jiri's patch is doing for offline
CPUs.  So Jiri's patch looks ok to me.
Jiri Kosina May 31, 2019, 9:05 p.m. UTC | #25
On Fri, 31 May 2019, Andy Lutomirski wrote:

> The Intel SDM Vol 3 34.10 says:
> 
> If the HLT instruction is restarted, the processor will generate a
> memory access to fetch the HLT instruction (if it is
> not in the internal cache), and execute a HLT bus transaction. This
> behavior results in multiple HLT bus transactions
> for the same HLT instruction.

Which basically means that both hibernation and kexec have been broken in 
this respect for gazillions of years, and seems like noone noticed. Makes 
one wonder what the reason for that might be.

Either SDM is not precise and the refetch actually never happens for real 
(or is always in these cases satisfied from I$ perhaps?), or ... ?

So my patch basically puts things back where they have been for ages 
(while mwait is obviously much worse, as that gets woken up by the write 
to the monitored address, which inevitably does happen during resume), but 
seems like SDM is suggesting that we've been in a grey zone wrt RSM at 
least for all those ages.

So perhaps we really should ditch resume_play_dead() altogether 
eventually, and replace it with sending INIT IPI around instead (and then 
waking the CPUs properly via INIT INIT START). I'd still like to do that 
for 5.3 though, as that'd be slightly bigger surgery, and conservatively 
put things basically back to state they have been up to now for 5.2.

Thanks,
Andy Lutomirski May 31, 2019, 9:22 p.m. UTC | #26
> On May 31, 2019, at 2:05 PM, Jiri Kosina <jikos@kernel.org> wrote:
> 
>> On Fri, 31 May 2019, Andy Lutomirski wrote:
>> 
>> The Intel SDM Vol 3 34.10 says:
>> 
>> If the HLT instruction is restarted, the processor will generate a
>> memory access to fetch the HLT instruction (if it is
>> not in the internal cache), and execute a HLT bus transaction. This
>> behavior results in multiple HLT bus transactions
>> for the same HLT instruction.
> 
> Which basically means that both hibernation and kexec have been broken in 
> this respect for gazillions of years, and seems like noone noticed. Makes 
> one wonder what the reason for that might be.
> 
> Either SDM is not precise and the refetch actually never happens for real 
> (or is always in these cases satisfied from I$ perhaps?), or ... ?
> 
> So my patch basically puts things back where they have been for ages 
> (while mwait is obviously much worse, as that gets woken up by the write 
> to the monitored address, which inevitably does happen during resume), but 
> seems like SDM is suggesting that we've been in a grey zone wrt RSM at 
> least for all those ages.
> 
> So perhaps we really should ditch resume_play_dead() altogether 
> eventually, and replace it with sending INIT IPI around instead (and then 
> waking the CPUs properly via INIT INIT START). I'd still like to do that 
> for 5.3 though, as that'd be slightly bigger surgery, and conservatively 
> put things basically back to state they have been up to now for 5.2.
> 


Seems reasonable to me.  I would guess that it mostly works because SMI isn’t all that common and the window where it matters is short.  Or maybe the SDM is misleading.
Rafael J. Wysocki June 3, 2019, 10:03 a.m. UTC | #27
On Friday, May 31, 2019 6:19:52 PM CEST Josh Poimboeuf wrote:
> On Fri, May 31, 2019 at 05:41:18PM +0200, Jiri Kosina wrote:
> > On Fri, 31 May 2019, Josh Poimboeuf wrote:
> > 
> > > The only question I'd have is if we have data on the power savings 
> > > difference between hlt and mwait.  mwait seems to wake up on a lot of 
> > > different conditions which might negate its deeper sleep state.
> > 
> > hlt wakes up on basically the same set of events, but has the 
> > auto-restarting semantics on some of them (especially SMM). So the wakeup 
> > frequency itself shouldn't really contribute to power consumption 
> > difference; it's the C-state that mwait allows CPU to enter.
> 
> Ok.  I reluctantly surrender :-)  For your v4:
> 
>   Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
> 
> It works as a short term fix, but it's fragile, and it does feel like
> we're just adding more duct tape, as Andy said.

OK, the v4 queued up then, thanks!
Sean Christopherson June 3, 2019, 2:23 p.m. UTC | #28
On Fri, May 31, 2019 at 02:22:27PM -0700, Andy Lutomirski wrote:
> 
> > On May 31, 2019, at 2:05 PM, Jiri Kosina <jikos@kernel.org> wrote:
> > 
> >> On Fri, 31 May 2019, Andy Lutomirski wrote:
> >> 
> >> The Intel SDM Vol 3 34.10 says:
> >> 
> >> If the HLT instruction is restarted, the processor will generate a
> >> memory access to fetch the HLT instruction (if it is
> >> not in the internal cache), and execute a HLT bus transaction. This
> >> behavior results in multiple HLT bus transactions
> >> for the same HLT instruction.
> > 
> > Which basically means that both hibernation and kexec have been broken in 
> > this respect for gazillions of years, and seems like noone noticed. Makes 
> > one wonder what the reason for that might be.
> > 
> > Either SDM is not precise and the refetch actually never happens for real 
> > (or is always in these cases satisfied from I$ perhaps?), or ... ?
> > 
> > So my patch basically puts things back where they have been for ages 
> > (while mwait is obviously much worse, as that gets woken up by the write 
> > to the monitored address, which inevitably does happen during resume), but 
> > seems like SDM is suggesting that we've been in a grey zone wrt RSM at 
> > least for all those ages.
> > 
> > So perhaps we really should ditch resume_play_dead() altogether 
> > eventually, and replace it with sending INIT IPI around instead (and then 
> > waking the CPUs properly via INIT INIT START). I'd still like to do that 
> > for 5.3 though, as that'd be slightly bigger surgery, and conservatively 
> > put things basically back to state they have been up to now for 5.2.
> > 
> 
> 
> Seems reasonable to me.  I would guess that it mostly works because SMI isn’t
> all that common and the window where it matters is short.  Or maybe the SDM
> is misleading.

For P6 and later, i.e. all modern CPUs, Intel processors go straight to
halted state and don't fetch/decode the HLT instruction.

P5 actually did a fetch, but from what I can tell that behavior wasn't
carried forward to KNC, unlike other legacy interrupt crud from P5:

[1] https://lkml.kernel.org/r/20190430004504.GH31379@linux.intel.com
Jiri Kosina June 3, 2019, 3:24 p.m. UTC | #29
On Mon, 3 Jun 2019, Sean Christopherson wrote:

> For P6 and later, i.e. all modern CPUs, Intel processors go straight to
> halted state and don't fetch/decode the HLT instruction.

That'd be a rather relieving fact actually. Do you happen to know if this 
is stated in some Intel documentation and we've just overlooked it, or 
whether it's rather an information that's being carried over from 
generation to generation by whispering through grapevine?

Thanks,
Sean Christopherson June 3, 2019, 4:18 p.m. UTC | #30
On Mon, Jun 03, 2019 at 05:24:26PM +0200, Jiri Kosina wrote:
> On Mon, 3 Jun 2019, Sean Christopherson wrote:
> 
> > For P6 and later, i.e. all modern CPUs, Intel processors go straight to
> > halted state and don't fetch/decode the HLT instruction.
> 
> That'd be a rather relieving fact actually. Do you happen to know if this 
> is stated in some Intel documentation and we've just overlooked it, or 
> whether it's rather an information that's being carried over from 
> generation to generation by whispering through grapevine?

I highly doubt it's officially stated anywhere.  Intel's approach to this
type of micro-architecture specific behavior is (usually) to word the SDM
in such a way that both approaches are legal.  E.g. a 1993 version of the
SDM says "Returns to interrupted HLT instruction", whereas in 1995, which
just so happens to coincide with the introduction of the P6 architecture,
the SDM started saying "Returns to HALT state" and added the blurb about
"will generate a memory access to fetch the HLT instruction (if it is not
in the internal cache)" so that the old behavior is still legal.

All that being said, the "straight to HALT" behavior is now the de facto
standard since lots of people will yell loudly if it changes.
diff mbox series

Patch

diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index a7d966964c6f..513ce09e9950 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -299,7 +299,17 @@  int hibernate_resume_nonboot_cpu_disable(void)
 	 * address in its instruction pointer may not be possible to resolve
 	 * any more at that point (the page tables used by it previously may
 	 * have been overwritten by hibernate image data).
+	 *
+	 * First, make sure that we wake up all the potentially disabled SMT
+	 * threads which have been initially brought up and then put into
+	 * mwait/cpuidle sleep.
+	 * Those will be put to proper (not interfering with hibernation
+	 * resume) sleep afterwards, and the resumed kernel will decide itself
+	 * what to do with them.
 	 */
+	ret = cpuhp_smt_enable();
+	if (ret)
+		return ret;
 	smp_ops.play_dead = resume_play_dead;
 	ret = disable_nonboot_cpus();
 	smp_ops.play_dead = play_dead;
diff --git a/arch/x86/power/hibernate.c b/arch/x86/power/hibernate.c
index 4845b8c7be7f..fc413717a45f 100644
--- a/arch/x86/power/hibernate.c
+++ b/arch/x86/power/hibernate.c
@@ -11,6 +11,7 @@ 
 #include <linux/suspend.h>
 #include <linux/scatterlist.h>
 #include <linux/kdebug.h>
+#include <linux/cpu.h>
 
 #include <crypto/hash.h>
 
@@ -245,3 +246,35 @@  int relocate_restore_code(void)
 	__flush_tlb_all();
 	return 0;
 }
+
+int arch_resume_nosmt(void)
+{
+	int ret = 0;
+	/*
+	 * We reached this while coming out of hibernation. This means
+	 * that SMT siblings are sleeping in hlt, as mwait is not safe
+	 * against control transition during resume (see comment in
+	 * hibernate_resume_nonboot_cpu_disable()).
+	 *
+	 * If the resumed kernel has SMT disabled, we have to take all the
+	 * SMT siblings out of hlt, and offline them again so that they
+	 * end up in mwait proper.
+	 *
+	 * Called with hotplug disabled.
+	 */
+	cpu_hotplug_enable();
+	if (cpu_smt_control == CPU_SMT_DISABLED ||
+			cpu_smt_control == CPU_SMT_FORCE_DISABLED) {
+		enum cpuhp_smt_control old = cpu_smt_control;
+
+		ret = cpuhp_smt_enable();
+		if (ret)
+			goto out;
+		ret = cpuhp_smt_disable(old);
+		if (ret)
+			goto out;
+	}
+out:
+	cpu_hotplug_disable();
+	return ret;
+}
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 3813fe45effd..fcb1386bb0d4 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -201,10 +201,14 @@  enum cpuhp_smt_control {
 extern enum cpuhp_smt_control cpu_smt_control;
 extern void cpu_smt_disable(bool force);
 extern void cpu_smt_check_topology(void);
+extern int cpuhp_smt_enable(void);
+extern int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval);
 #else
 # define cpu_smt_control		(CPU_SMT_NOT_IMPLEMENTED)
 static inline void cpu_smt_disable(bool force) { }
 static inline void cpu_smt_check_topology(void) { }
+static inline int cpuhp_smt_enable(void) { return 0; }
+static inline int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) { return 0; }
 #endif
 
 /*
diff --git a/kernel/cpu.c b/kernel/cpu.c
index f2ef10460698..077fde6fb953 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2061,7 +2061,7 @@  static void cpuhp_online_cpu_device(unsigned int cpu)
 	kobject_uevent(&dev->kobj, KOBJ_ONLINE);
 }
 
-static int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
+int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
 {
 	int cpu, ret = 0;
 
@@ -2093,7 +2093,7 @@  static int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
 	return ret;
 }
 
-static int cpuhp_smt_enable(void)
+int cpuhp_smt_enable(void)
 {
 	int cpu, ret = 0;
 
diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index c8c272df7154..b65635753e8e 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -257,6 +257,11 @@  void swsusp_show_speed(ktime_t start, ktime_t stop,
 		(kps % 1000) / 10);
 }
 
+__weak int arch_resume_nosmt(void)
+{
+	return 0;
+}
+
 /**
  * create_image - Create a hibernation image.
  * @platform_mode: Whether or not to use the platform driver.
@@ -324,6 +329,10 @@  static int create_image(int platform_mode)
  Enable_cpus:
 	suspend_enable_secondary_cpus();
 
+	/* Allow architectures to do nosmt-specific post-resume dances */
+	if (!in_suspend)
+		error = arch_resume_nosmt();
+
  Platform_finish:
 	platform_finish(platform_mode);