diff mbox series

[v2,2/3] x86/smp: Allow forcing the mwait hint for play dead loop

Message ID 20241029101507.7188-3-patryk.wlazlyn@linux.intel.com (mailing list archive)
State New
Headers show
Series SRF: Fix offline CPU preventing pc6 entry | expand

Commit Message

Patryk Wlazlyn Oct. 29, 2024, 10:15 a.m. UTC
The current implementation for looking up the mwait hint for the deepest
cstate depends on them to be continuous in range [0, NUM_SUBSTATES-1].
While that is correct on most Intel x86 platforms, it is not
architectural and may not result in reaching the most optimized idle
state on some of them.

For example Intel's Sierra Forest report two C6 substates in cpuid leaf 5:
    C6S  (hint 0x22)
    C6SP (hint 0x23)

Hints 0x20 and 0x21 are skipped entirely, causing the current
implementation to compute the wrong hint, when looking for the deepest
cstate for offlined CPU to enter. As a result, package with an offlined
CPU can never reach PC6.

Allow the idle driver to communicate the deepest idle cstate to the x86
offline code.

Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
---
 arch/x86/include/asm/smp.h |  3 +++
 arch/x86/kernel/smpboot.c  | 12 +++++++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

Comments

Dave Hansen Oct. 29, 2024, 6:30 p.m. UTC | #1
On 10/29/24 03:15, Patryk Wlazlyn wrote:
> +void smp_set_mwait_play_dead_hint(unsigned int hint)
> +{
> +	WRITE_ONCE(play_dead_mwait_hint, hint);
> +}

This all feels a bit hacky and unstructured to me.

Could we at least set up a few rules here?  Like, say what the hints
are, what values can they have?  Where do they come from?  Can this get
called more than once?  Does it _need_ to be set?  What's the behavior
when it is not set?  Who is responsible for calling this?

What good does the smp_ prefix do?  I don't think _callers_ care whether
this is getting optimized out or not.

> -	hint = get_deepest_mwait_hint();
> +	hint = READ_ONCE(play_dead_mwait_hint);
> +	if (hint == PLAY_DEAD_MWAIT_HINT_UNSET)
> +		hint = get_deepest_mwait_hint();

This is also rather opaque.

Why are there two hints?  What makes one better than the other one?
Artem Bityutskiy Oct. 30, 2024, 9:58 a.m. UTC | #2
On Tue, 2024-10-29 at 11:30 -0700, Dave Hansen wrote:
> On 10/29/24 03:15, Patryk Wlazlyn wrote:
> > +void smp_set_mwait_play_dead_hint(unsigned int hint)
> > +{
> > +	WRITE_ONCE(play_dead_mwait_hint, hint);
> > +}
> 
> This all feels a bit hacky and unstructured to me.
> 
> Could we at least set up a few rules here?  Like, say what the hints
> are, what values can they have?  Where do they come from?  Can this get
> called more than once?  Does it _need_ to be set?  What's the behavior
> when it is not set?  Who is responsible for calling this?
> 
> What good does the smp_ prefix do?  I don't think _callers_ care whether
> this is getting optimized out or not.
> 

The goal of 'get_deepest_mwait_hint()' is to find the mwait hint of the deepest
available C-state, in order to request it for the offline CPU. On Intel CPUs,
the C-states and their mwait hint values are platform-specific.

Generally, there is no architectural way for enumerating mwait hints on Intel
CPUs. In the idle path (different to the CPU offline path), idle drivers (if
enabled) enumerate and request C-states using either ACPI mechanisms or a
compiled-in, per-platform custom C-states table, provided by Intel for specific
platforms.

In the CPU offline path, only the deepest C-state hint is needed. Historically,
it was determined using a simple algorithm, which happened to provide the
correct result on most Intel platforms. This algorithm is based on scanning
CPUID leaf 5 EDX bits and building the hint value from the C-state and sub-state
numbers. 

Generally speaking, mwait hints are opaque numbers, and the algorithm is not
architectural. While it produces the correct results for most Intel CPUs, it
produces sub-optimal result for some CPUs. For example Intel Sierra Forest Xeon
CPU, the algorithm produces hint 0x21, while the actual deepest C-state hint is
0x23. If hint 0x21 is used, the result is that the offline CPU does not enter
the deepest available C-state. While this is not fatal, the CPU ends up saving
less energy than it could have saved.

The 'set_mwait_play_dead_hint()' function provides a mechanism for defining the
mwait hint for the offline CPU, and can be used for platforms where the generic
non-architectural algorithm provides a sub-optimal result.

Q&A.

1. Could we at least set up a few rules here?  Like, say what the hints
are, what values can they have?

The hints are 8-bit values, lower 4 bits define "sub-state", higher 4 bits
define the state.

The state value (higher 4 bits) correspond to the state enumerated by CPUID leaf
5 (Value 0 is C0, value 1 is C1, etc). The sub-state value is an opaque number.

The hint is provided to the mwait instruction via EAX.

2. Where do they come from?

Hardware C-states are defined by the specific platform (e.g., C1, C1E, CC6,
PC6). Then they are "mapped" to the SDM C-states (C0, C1, C2, etc). The specific
platform defines the hint values.

Intel typically provides the hint values in the EDS (External Design
Specification) document. It is typically non-public.

Intel also discloses the hint values for open-source projects like Linux, and
then Intel engineers submit them to the intel_idle driver.

Some of the hints may also be found via ACPI _CST table.

3. Can this get called more than once?

It is not supposed to. The idea is that if a driver like intel_idle is used, it
can call 'set_mwait_play_dead_hint()' and provide the most optimal hint number
for the offline code.

4. Does it _need_ to be set?

No. It is more of an optimization. But it is an important optimization which may
result in saving a lot of money in a datacenter.

Typically using a "wrong" hint value is non-fatal, at least I did not see it
being fatal so far. The CPU will map it to some hardware C-state request, but
which one - depends on the "wrong" value and the CPU. It just may be sub-
optimal.

5. What's the behavior when it is not set?

The offline code will fall-back to the generic non-architectural algorithm,
which provides correct results for all server platforms I dealt with since 2017.
It should provide the correct hint for most client platforms, as far as I am
aware.

Sierra Forest Xeon is the first platform where the generic algorithm provides a
sub-optimal value 0x21. It is not fatal, just sub-optimal.

Note: I am working with Intel firmware team on having the FW "re-mapping" hint
0x21 to hint 0x23, so that "unaware" Linux kernel also ends up with requesting
the deepest C-state for an offline CPU.

6. Who is responsible for calling this?

The idea for now is that the intel_idle driver calls it.

But in theory, in the future, any driver/platform code may call it if it "knows"
what's the most optimal hint, I suppose. I do not have a good example though.

Artem.
Patryk Wlazlyn Oct. 30, 2024, 1:33 p.m. UTC | #3
>> +void smp_set_mwait_play_dead_hint(unsigned int hint)
>> +{
>> +    WRITE_ONCE(play_dead_mwait_hint, hint);
>> +}
>
> This all feels a bit hacky and unstructured to me.
>
> Could we at least set up a few rules here?  Like, say what the hints
> are, what values can they have?  Where do they come from?  Can this get
> called more than once?  Does it _need_ to be set?  What's the behavior
> when it is not set?  Who is responsible for calling this?

The other idea is to first check if currently loaded idle driver provides
enter_dead() callback first and leave the current, deepest mwait hint
computation code as a fallback.

Does that sound less hacky?

Unfortunately, it comes with a little problem. In case of kexec, we need to
have a way to exit from the mwait loop and enter hlt to prevent offlined CPU
from crashing when the old memory is being overwritten.

I think, we can solve it by bringing the CPU back online before we proceed
with kexec, but I would appreciate some feedback from someone who is more
familiar with kexec, before merging that.

We may also signal that by touching the resched flag on which enter_dead()
code will monitor in case of mwait and enter hlt right after, but that's a
bit hackier IMO.


> What good does the smp_ prefix do?  I don't think _callers_ care whether
> this is getting optimized out or not.

The prefix makes it a little bit cleaner by not exporting new global symbol
with "set_mwait_play_dead_hint" name.
Dave Hansen Oct. 30, 2024, 7:32 p.m. UTC | #4
On 10/30/24 02:58, Artem Bityutskiy wrote:
> On Tue, 2024-10-29 at 11:30 -0700, Dave Hansen wrote:
> 1. Could we at least set up a few rules here?  Like, say what the hints
> are, what values can they have?
> 
> The hints are 8-bit values, lower 4 bits define "sub-state", higher 4 bits
> define the state.
> 
> The state value (higher 4 bits) correspond to the state enumerated by CPUID leaf
> 5 (Value 0 is C0, value 1 is C1, etc). The sub-state value is an opaque number.
> 
> The hint is provided to the mwait instruction via EAX.

OK, so can you distill that down to something succinct and get it in a
comment above the new function, please?

> 2. Where do they come from?
> 
> Hardware C-states are defined by the specific platform (e.g., C1, C1E, CC6,
> PC6). Then they are "mapped" to the SDM C-states (C0, C1, C2, etc). The specific
> platform defines the hint values.
> 
> Intel typically provides the hint values in the EDS (External Design
> Specification) document. It is typically non-public.
> 
> Intel also discloses the hint values for open-source projects like Linux, and
> then Intel engineers submit them to the intel_idle driver.
> 
> Some of the hints may also be found via ACPI _CST table.

What about the mwait_play_dead() loop that calculates the hint?  Doesn't
that derive the hint from CPUID?

> 3. Can this get called more than once?
> 
> It is not supposed to. The idea is that if a driver like intel_idle is used, it
> can call 'set_mwait_play_dead_hint()' and provide the most optimal hint number
> for the offline code.

There are two important nuggets in there:

First, an idle driver can but is not required to set the hint.  This
would be good comment material.

Second, only one thing is supposed to set the hint.  This is a good
thing to WARN() about.

> 4. Does it _need_ to be set?
> 
> No. It is more of an optimization. But it is an important optimization which may
> result in saving a lot of money in a datacenter.
> 
> Typically using a "wrong" hint value is non-fatal, at least I did not see it
> being fatal so far. The CPU will map it to some hardware C-state request, but
> which one - depends on the "wrong" value and the CPU. It just may be sub-
> optimal.

OK, so this tells me we *don't* want some kind of:

	WARN_ON(play_dead_mwait_hint == PLAY_DEAD_MWAIT_HINT_UNSET);

warning.

> 5. What's the behavior when it is not set?
> 
> The offline code will fall-back to the generic non-architectural algorithm,
> which provides correct results for all server platforms I dealt with since 2017.
> It should provide the correct hint for most client platforms, as far as I am
> aware.
> 
> Sierra Forest Xeon is the first platform where the generic algorithm provides a
> sub-optimal value 0x21. It is not fatal, just sub-optimal.

What is the non-architectural algorithm?  Which Linux code are you
referring to?

> Note: I am working with Intel firmware team on having the FW "re-mapping" hint
> 0x21 to hint 0x23, so that "unaware" Linux kernel also ends up with requesting
> the deepest C-state for an offline CPU.

That would be great as well.  Thanks for doing that!

> 6. Who is responsible for calling this?
> 
> The idea for now is that the intel_idle driver calls it.
> 
> But in theory, in the future, any driver/platform code may call it if it "knows"
> what's the most optimal hint, I suppose. I do not have a good example though.

So let's look at how this works:

void native_play_dead(void)
{
...
        mwait_play_dead();
        if (cpuidle_play_dead())
                hlt_play_dead();
}

This _existing_ code has three different ways of playing dead (in this
order of preference):

 1. mwait
 2. cpuidle
 3. hlt

It has (at least) two different mechanisms for telling which of these to
call:

  1. mwait has a bunch of built-in logic that will ensure the CPU
     should use for playing dead.  If not, it does nothing and returns.
  2. cpuidle_play_dead() (only used by acpi_idle_driver as far as I can
     tell) will return an error if it does not support playing dead

If 1 and 2 fail, then hlt_play_dead() gets called.

But the really fun part of this is that idle driver is *called* here.
The driver that is also responsible for overriding the mwait hint.

So this series opts to have the boot code plumb the hint back into a
basically undocumented global variable while also assuming that the
system is *going* to use mwait.  It then does *nothing* with the
callback just adjacent to the code it wants to modify.

Seems rather spaghetti-like to me.

To make it worse, go look at da6fa7ef67f0 ("x86/smpboot: Don't use
mwait_play_dead() on AMD systems").  It hacks AMD-specific code in
mwait_play_dead() just to force the cpuidle code to get called.

What if we did this?  First, introduce a helper:

	bool mwait_play_dead_with_hint(u32 hint)

and then restructure native_play_dead() to look like this:

static mwait_play_dead_generic(void)
{
	u32 hint = get_deepest_mwait_hint();

	return mwait_play_dead_with_hint(hint);
}

void native_play_dead(void)
{
	bool used;

	used = cpuidle_play_dead();
	if (used)
		return;

	used = mwait_play_dead_generic();
	if (used)
		return;

	hlt_play_dead();
}

If the cpuidle drivers want to use mwait with a different hint, they
override the *EXISTING* drv->states[].enter_dead() functionality and
call mwait_play_dead_with_hint() with their new hint.  Then they don't
need to pass anything _over_ to the mwait code.

Wouldn't something like that makes this all much more straightforward?
Rafael J. Wysocki Oct. 30, 2024, 7:53 p.m. UTC | #5
On Wed, Oct 30, 2024 at 8:32 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/30/24 02:58, Artem Bityutskiy wrote:
> > On Tue, 2024-10-29 at 11:30 -0700, Dave Hansen wrote:
> > 1. Could we at least set up a few rules here?  Like, say what the hints
> > are, what values can they have?
> >
> > The hints are 8-bit values, lower 4 bits define "sub-state", higher 4 bits
> > define the state.
> >
> > The state value (higher 4 bits) correspond to the state enumerated by CPUID leaf
> > 5 (Value 0 is C0, value 1 is C1, etc). The sub-state value is an opaque number.
> >
> > The hint is provided to the mwait instruction via EAX.
>
> OK, so can you distill that down to something succinct and get it in a
> comment above the new function, please?
>
> > 2. Where do they come from?
> >
> > Hardware C-states are defined by the specific platform (e.g., C1, C1E, CC6,
> > PC6). Then they are "mapped" to the SDM C-states (C0, C1, C2, etc). The specific
> > platform defines the hint values.
> >
> > Intel typically provides the hint values in the EDS (External Design
> > Specification) document. It is typically non-public.
> >
> > Intel also discloses the hint values for open-source projects like Linux, and
> > then Intel engineers submit them to the intel_idle driver.
> >
> > Some of the hints may also be found via ACPI _CST table.
>
> What about the mwait_play_dead() loop that calculates the hint?  Doesn't
> that derive the hint from CPUID?
>
> > 3. Can this get called more than once?
> >
> > It is not supposed to. The idea is that if a driver like intel_idle is used, it
> > can call 'set_mwait_play_dead_hint()' and provide the most optimal hint number
> > for the offline code.
>
> There are two important nuggets in there:
>
> First, an idle driver can but is not required to set the hint.  This
> would be good comment material.
>
> Second, only one thing is supposed to set the hint.  This is a good
> thing to WARN() about.
>
> > 4. Does it _need_ to be set?
> >
> > No. It is more of an optimization. But it is an important optimization which may
> > result in saving a lot of money in a datacenter.
> >
> > Typically using a "wrong" hint value is non-fatal, at least I did not see it
> > being fatal so far. The CPU will map it to some hardware C-state request, but
> > which one - depends on the "wrong" value and the CPU. It just may be sub-
> > optimal.
>
> OK, so this tells me we *don't* want some kind of:
>
>         WARN_ON(play_dead_mwait_hint == PLAY_DEAD_MWAIT_HINT_UNSET);
>
> warning.
>
> > 5. What's the behavior when it is not set?
> >
> > The offline code will fall-back to the generic non-architectural algorithm,
> > which provides correct results for all server platforms I dealt with since 2017.
> > It should provide the correct hint for most client platforms, as far as I am
> > aware.
> >
> > Sierra Forest Xeon is the first platform where the generic algorithm provides a
> > sub-optimal value 0x21. It is not fatal, just sub-optimal.
>
> What is the non-architectural algorithm?  Which Linux code are you
> referring to?
>
> > Note: I am working with Intel firmware team on having the FW "re-mapping" hint
> > 0x21 to hint 0x23, so that "unaware" Linux kernel also ends up with requesting
> > the deepest C-state for an offline CPU.
>
> That would be great as well.  Thanks for doing that!
>
> > 6. Who is responsible for calling this?
> >
> > The idea for now is that the intel_idle driver calls it.
> >
> > But in theory, in the future, any driver/platform code may call it if it "knows"
> > what's the most optimal hint, I suppose. I do not have a good example though.
>
> So let's look at how this works:
>
> void native_play_dead(void)
> {
> ...
>         mwait_play_dead();
>         if (cpuidle_play_dead())
>                 hlt_play_dead();
> }
>
> This _existing_ code has three different ways of playing dead (in this
> order of preference):
>
>  1. mwait
>  2. cpuidle
>  3. hlt
>
> It has (at least) two different mechanisms for telling which of these to
> call:
>
>   1. mwait has a bunch of built-in logic that will ensure the CPU
>      should use for playing dead.  If not, it does nothing and returns.
>   2. cpuidle_play_dead() (only used by acpi_idle_driver as far as I can
>      tell) will return an error if it does not support playing dead
>
> If 1 and 2 fail, then hlt_play_dead() gets called.
>
> But the really fun part of this is that idle driver is *called* here.

Currently, cpuidle_play_dead() is for the cases when MWAIT is not supported.

> The driver that is also responsible for overriding the mwait hint.

So no, intel_idle is not called there because it only uses MWAIT.

> So this series opts to have the boot code plumb the hint back into a
> basically undocumented global variable while also assuming that the
> system is *going* to use mwait.  It then does *nothing* with the
> callback just adjacent to the code it wants to modify.
>
> Seems rather spaghetti-like to me.
>
> To make it worse, go look at da6fa7ef67f0 ("x86/smpboot: Don't use
> mwait_play_dead() on AMD systems").  It hacks AMD-specific code in
> mwait_play_dead() just to force the cpuidle code to get called.
>
> What if we did this?  First, introduce a helper:
>
>         bool mwait_play_dead_with_hint(u32 hint)
>
> and then restructure native_play_dead() to look like this:
>
> static mwait_play_dead_generic(void)
> {
>         u32 hint = get_deepest_mwait_hint();
>
>         return mwait_play_dead_with_hint(hint);
> }
>
> void native_play_dead(void)
> {
>         bool used;
>
>         used = cpuidle_play_dead();
>         if (used)
>                 return;
>
>         used = mwait_play_dead_generic();
>         if (used)
>                 return;
>
>         hlt_play_dead();
> }
>
> If the cpuidle drivers want to use mwait with a different hint, they
> override the *EXISTING* drv->states[].enter_dead() functionality and
> call mwait_play_dead_with_hint() with their new hint.  Then they don't
> need to pass anything _over_ to the mwait code.
>
> Wouldn't something like that makes this all much more straightforward?

Well, except for one detail which is this beautiful thing in mwait_play_dead():

if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
    /*
     * Kexec is about to happen. Don't go back into mwait() as
     * the kexec kernel might overwrite text and data including
     * page tables and stack. So mwait() would resume when the
     * monitor cache line is written to and then the CPU goes
     * south due to overwritten text, page tables and stack.
     *
     * Note: This does _NOT_ protect against a stray MCE, NMI,
     * SMI. They will resume execution at the instruction
     * following the HLT instruction and run into the problem
     * which this is trying to prevent.
     */
    WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT);
    while(1)
         native_halt();

clearly referred to as a kexec() hack, which cannot be done in
cpuidle_play_dead() because the cpuidle driver doesn't know how to get
to md->control.

And even if it did, it is kind of not its business to deal with this stuff.
Dave Hansen Oct. 30, 2024, 8:11 p.m. UTC | #6
On 10/30/24 12:53, Rafael J. Wysocki wrote:
> clearly referred to as a kexec() hack, which cannot be done in
> cpuidle_play_dead() because the cpuidle driver doesn't know how to get
> to md->control.

What if we have an mwait_play_dead() _helper_?  It takes the hint as an
argument and retains all the kexec hacks.  All the cpuidle driver has to
do is call the helper with the hint that the cpuidle driver determines.
Rafael J. Wysocki Oct. 30, 2024, 8:14 p.m. UTC | #7
On Wed, Oct 30, 2024 at 9:11 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/30/24 12:53, Rafael J. Wysocki wrote:
> > clearly referred to as a kexec() hack, which cannot be done in
> > cpuidle_play_dead() because the cpuidle driver doesn't know how to get
> > to md->control.
>
> What if we have an mwait_play_dead() _helper_?  It takes the hint as an
> argument and retains all the kexec hacks.  All the cpuidle driver has to
> do is call the helper with the hint that the cpuidle driver determines.

The same idea has occurred to me in the meantime, so yes, I think that
it would work.
Dave Hansen Oct. 30, 2024, 10:55 p.m. UTC | #8
On 10/30/24 06:33, Patryk Wlazlyn wrote:
>>> +void smp_set_mwait_play_dead_hint(unsigned int hint)
>>> +{
>>> +    WRITE_ONCE(play_dead_mwait_hint, hint);
>>> +}
>>
>> This all feels a bit hacky and unstructured to me.
>>
>> Could we at least set up a few rules here?  Like, say what the hints
>> are, what values can they have?  Where do they come from?  Can this get
>> called more than once?  Does it _need_ to be set?  What's the behavior
>> when it is not set?  Who is responsible for calling this?
> 
> The other idea is to first check if currently loaded idle driver provides
> enter_dead() callback first and leave the current, deepest mwait hint
> computation code as a fallback.
> 
> Does that sound less hacky?

Yes.

> Unfortunately, it comes with a little problem. In case of kexec, we need to
> have a way to exit from the mwait loop and enter hlt to prevent offlined CPU
> from crashing when the old memory is being overwritten.

Why is this a problem?  Like I mentioning to Rafael, just *call*
mwait_play_dead() from the idle driver.

mwait_play_dead() could probably also use some refactoring because it
has 3 pieces:

	1. Should the code run at all or defer to another play dead
	   implementation?
	2. Calculating the hint (obviously not needed in your new case)
	3. Actually running mwait (including the kexec hack)

The "should the code run?" bit is superfluous but harmless if called
from the idle driver.  Ditto on the hint calculation, but you already
factored it out.  That leaves the "actually run mwait" bit which you
100% need.

> I think, we can solve it by bringing the CPU back online before we proceed
> with kexec, but I would appreciate some feedback from someone who is more
> familiar with kexec, before merging that.
> 
> We may also signal that by touching the resched flag on which enter_dead()
> code will monitor in case of mwait and enter hlt right after, but that's a
> bit hackier IMO.

That route is also fine with me, but I'm not sure it's necessary.

>> What good does the smp_ prefix do?  I don't think _callers_ care whether
>> this is getting optimized out or not.
> 
> The prefix makes it a little bit cleaner by not exporting new global symbol
> with "set_mwait_play_dead_hint" name.

I'm not following.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index ca073f40698f..2cb083a84225 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -114,6 +114,7 @@  void wbinvd_on_cpu(int cpu);
 int wbinvd_on_all_cpus(void);
 
 void smp_kick_mwait_play_dead(void);
+void smp_set_mwait_play_dead_hint(unsigned int hint);
 
 void native_smp_send_reschedule(int cpu);
 void native_send_call_func_ipi(const struct cpumask *mask);
@@ -164,6 +165,8 @@  static inline struct cpumask *cpu_llc_shared_mask(int cpu)
 {
 	return (struct cpumask *)cpumask_of(0);
 }
+
+static inline void smp_set_mwait_play_dead_hint(unsigned int hint) { }
 #endif /* CONFIG_SMP */
 
 #ifdef CONFIG_DEBUG_NMI_SELFTEST
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 683898e3b20e..08f7b43f3fc3 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -127,6 +127,9 @@  int __read_mostly __max_smt_threads = 1;
 /* Flag to indicate if a complete sched domain rebuild is required */
 bool x86_topology_update;
 
+#define PLAY_DEAD_MWAIT_HINT_UNSET 0U
+static unsigned int __read_mostly play_dead_mwait_hint;
+
 int arch_update_cpu_topology(void)
 {
 	int retval = x86_topology_update;
@@ -1270,6 +1273,11 @@  void play_dead_common(void)
 	local_irq_disable();
 }
 
+void smp_set_mwait_play_dead_hint(unsigned int hint)
+{
+	WRITE_ONCE(play_dead_mwait_hint, hint);
+}
+
 /* Computes mwait hint for the deepest mwait hint based on cpuid leaf 0x5 */
 static inline unsigned int get_deepest_mwait_hint(void)
 {
@@ -1322,7 +1330,9 @@  static inline void mwait_play_dead(void)
 	if (__this_cpu_read(cpu_info.cpuid_level) < CPUID_MWAIT_LEAF)
 		return;
 
-	hint = get_deepest_mwait_hint();
+	hint = READ_ONCE(play_dead_mwait_hint);
+	if (hint == PLAY_DEAD_MWAIT_HINT_UNSET)
+		hint = get_deepest_mwait_hint();
 
 	/* Set up state for the kexec() hack below */
 	md->status = CPUDEAD_MWAIT_WAIT;