Message ID | 20241108122909.763663-3-patryk.wlazlyn@linux.intel.com (mailing list archive) |
---|---|
State | Changes Requested, archived |
Headers | show |
Series | SRF: Fix offline CPU preventing pc6 entry | expand |
On 11/8/24 04:29, Patryk Wlazlyn wrote: > The generic implementation, based on cpuid leaf 0x5, for looking up the > mwait hint for the deepest cstate, depends on them to be continuous in > range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86 > platforms, it is not architectural and may not result in reaching the > most optimized idle state on some of them. > > Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and > fallback to the later in case of missing enter_dead() handler. I don't think the bug has anything to do with this patch, really. There's no need to rehash it here. The issue here is that the only way to call mwait today is via mwait_play_dead() directly, using its internally-calculated hint. What you want is for a cpuidle-driver-calculated hint to be used. So, you're using the new hint function via the cpuidle driver. But you just need the cpuidle driver to be called first, not the old mwait_play_dead()-calculated hint. The new code will still do mwait, just via a different path and with a different hint. The thing this doesn't mention is what the impact on everyone else is. I _think_ the ACPI cpuidle driver is the only worry. Today, if there's a system that supports mwait and ACPI cpuidle, it'll use mwait. After this patch, it'll use ACPI cpuidle. The changelog doesn't mention that behavior change or why that's OK. Also, looking at this: > - mwait_play_dead(); > if (cpuidle_play_dead()) > - hlt_play_dead(); > + mwait_play_dead(); > + hlt_play_dead(); None of those return on success, right? Is there any reason this couldn't just be: /* The first successful play_dead() will not return: */ cpuidle_play_dead(); mwait_play_dead(); hlt_play_dead(); That has the added bonus of not needing to return anything from mwait_play_dead() and the resulting churn from the last patch.
> I don't think the bug has anything to do with this patch, really. > There's no need to rehash it here. > > The issue here is that the only way to call mwait today is via > mwait_play_dead() directly, using its internally-calculated hint. > > What you want is for a cpuidle-driver-calculated hint to be used. So, > you're using the new hint function via the cpuidle driver. But you just > need the cpuidle driver to be called first, not the old > mwait_play_dead()-calculated hint. The new code will still do mwait, > just via a different path and with a different hint. Ok. I'll just say that we change the order because idle driver may know better. > The thing this doesn't mention is what the impact on everyone else is. > I _think_ the ACPI cpuidle driver is the only worry. Today, if there's > a system that supports mwait and ACPI cpuidle, it'll use mwait. After > this patch, it'll use ACPI cpuidle. > > The changelog doesn't mention that behavior change or why that's OK. True, but I think the mwait_play_dead() is exclusively for Intel. Other target platforms get an early return. I'll include that in the commit message. > Also, looking at this: > >> - mwait_play_dead(); >> if (cpuidle_play_dead()) >> - hlt_play_dead(); >> + mwait_play_dead(); >> + hlt_play_dead(); > > None of those return on success, right? > > Is there any reason this couldn't just be: > > /* The first successful play_dead() will not return: */ > cpuidle_play_dead(); > mwait_play_dead(); > hlt_play_dead(); > > That has the added bonus of not needing to return anything from > mwait_play_dead() and the resulting churn from the last patch. mwait_play_dead may return if mwait_play_dead_with_hint returns and it only does on non-smp builds. That being said, we do ignore the return value right now, because in either case we want to enter hlt_play_dead() as a fallback, so I guess we can make mwait_play_dead return void, but leave mwait_play_dead_with_hint returning int or add ifdef CONFIG_SMP guards in intel_idle. When going with the return types proposed in this patch set, on non-smp builds intel_idle would call mwait_play_dead_with_hint() which would "return 1"; and propagate through cpuidle_play_dead().
On Fri, Nov 08, 2024 at 01:29:08PM +0100, Patryk Wlazlyn wrote: > The generic implementation, based on cpuid leaf 0x5, for looking up the > mwait hint for the deepest cstate, depends on them to be continuous in > range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86 > platforms, it is not architectural and may not result in reaching the > most optimized idle state on some of them. > > Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and > fallback to the later in case of missing enter_dead() handler. > > Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com> > --- > arch/x86/kernel/smpboot.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > index 44c40781bad6..721bb931181c 100644 > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -1416,9 +1416,9 @@ void native_play_dead(void) > play_dead_common(); > tboot_shutdown(TB_SHUTDOWN_WFS); > > - mwait_play_dead(); > if (cpuidle_play_dead()) > - hlt_play_dead(); > + mwait_play_dead(); > + hlt_play_dead(); > } Yeah, I don't think so. we don't want to accidentally hit acpi_idle_play_dead().
On Tue, Nov 12, 2024 at 12:47 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Nov 08, 2024 at 01:29:08PM +0100, Patryk Wlazlyn wrote: > > The generic implementation, based on cpuid leaf 0x5, for looking up the > > mwait hint for the deepest cstate, depends on them to be continuous in > > range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86 > > platforms, it is not architectural and may not result in reaching the > > most optimized idle state on some of them. > > > > Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and > > fallback to the later in case of missing enter_dead() handler. > > > > Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com> > > --- > > arch/x86/kernel/smpboot.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > > index 44c40781bad6..721bb931181c 100644 > > --- a/arch/x86/kernel/smpboot.c > > +++ b/arch/x86/kernel/smpboot.c > > @@ -1416,9 +1416,9 @@ void native_play_dead(void) > > play_dead_common(); > > tboot_shutdown(TB_SHUTDOWN_WFS); > > > > - mwait_play_dead(); > > if (cpuidle_play_dead()) > > - hlt_play_dead(); > > + mwait_play_dead(); > > + hlt_play_dead(); > > } > > Yeah, I don't think so. we don't want to accidentally hit > acpi_idle_play_dead(). Fair enough. Then we are back to the original approach though: https://lore.kernel.org/linux-pm/20241029101507.7188-3-patryk.wlazlyn@linux.intel.com/
On Tue, Nov 12, 2024 at 01:03:49PM +0100, Rafael J. Wysocki wrote: > On Tue, Nov 12, 2024 at 12:47 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Fri, Nov 08, 2024 at 01:29:08PM +0100, Patryk Wlazlyn wrote: > > > The generic implementation, based on cpuid leaf 0x5, for looking up the > > > mwait hint for the deepest cstate, depends on them to be continuous in > > > range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86 > > > platforms, it is not architectural and may not result in reaching the > > > most optimized idle state on some of them. > > > > > > Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and > > > fallback to the later in case of missing enter_dead() handler. > > > > > > Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com> > > > --- > > > arch/x86/kernel/smpboot.c | 4 ++-- > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > > > index 44c40781bad6..721bb931181c 100644 > > > --- a/arch/x86/kernel/smpboot.c > > > +++ b/arch/x86/kernel/smpboot.c > > > @@ -1416,9 +1416,9 @@ void native_play_dead(void) > > > play_dead_common(); > > > tboot_shutdown(TB_SHUTDOWN_WFS); > > > > > > - mwait_play_dead(); > > > if (cpuidle_play_dead()) > > > - hlt_play_dead(); > > > + mwait_play_dead(); > > > + hlt_play_dead(); > > > } > > > > Yeah, I don't think so. we don't want to accidentally hit > > acpi_idle_play_dead(). > > Fair enough. > > Then we are back to the original approach though: > > https://lore.kernel.org/linux-pm/20241029101507.7188-3-patryk.wlazlyn@linux.intel.com/ Well, that won't be brilliant for hybrid systems where the available states are different per CPU. Also, all of this is a bit of a trainwreck... AFAICT AMD wants IO based idle (per the 2018 commit). So they want the ACPI thing. But on Intel we really don't want HLT, and had that MWAIT, but that has real problems with KEXEC. And I don't think we can rely on INTEL_IDLE=y. The ACPI thing doesn't support FFh states for it's enter_dead(), should it? Anyway, ideally x86 would grow a new instruction to offline a CPU, both MWAIT and HLT have problems vs non-maskable interrupts. I really don't know what is best here, maybe moving that whole CPUID loop to boot, store the value in a per-cpu mwait_play_dead_hint. Have AMD explicitly clear the value, and avoid mwait when 0 -- hint 0 is equal to HLT anyway. But as said, we need a new instruction.
On Tue, Nov 12, 2024 at 1:18 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Nov 12, 2024 at 01:03:49PM +0100, Rafael J. Wysocki wrote: > > On Tue, Nov 12, 2024 at 12:47 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Fri, Nov 08, 2024 at 01:29:08PM +0100, Patryk Wlazlyn wrote: > > > > The generic implementation, based on cpuid leaf 0x5, for looking up the > > > > mwait hint for the deepest cstate, depends on them to be continuous in > > > > range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86 > > > > platforms, it is not architectural and may not result in reaching the > > > > most optimized idle state on some of them. > > > > > > > > Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and > > > > fallback to the later in case of missing enter_dead() handler. > > > > > > > > Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com> > > > > --- > > > > arch/x86/kernel/smpboot.c | 4 ++-- > > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > > > > index 44c40781bad6..721bb931181c 100644 > > > > --- a/arch/x86/kernel/smpboot.c > > > > +++ b/arch/x86/kernel/smpboot.c > > > > @@ -1416,9 +1416,9 @@ void native_play_dead(void) > > > > play_dead_common(); > > > > tboot_shutdown(TB_SHUTDOWN_WFS); > > > > > > > > - mwait_play_dead(); > > > > if (cpuidle_play_dead()) > > > > - hlt_play_dead(); > > > > + mwait_play_dead(); > > > > + hlt_play_dead(); > > > > } > > > > > > Yeah, I don't think so. we don't want to accidentally hit > > > acpi_idle_play_dead(). > > > > Fair enough. > > > > Then we are back to the original approach though: > > > > https://lore.kernel.org/linux-pm/20241029101507.7188-3-patryk.wlazlyn@linux.intel.com/ > > Well, that won't be brilliant for hybrid systems where the available > states are different per CPU. But they aren't. At least so far that has not been the case on any platform known to me and I'm not aware of any plans to make that happen (guess what, some other OSes may be unhappy). > Also, all of this is a bit of a trainwreck... AFAICT AMD wants IO based > idle (per the 2018 commit). So they want the ACPI thing. Yes. > But on Intel we really don't want HLT, and had that MWAIT, but that has > real problems with KEXEC. And I don't think we can rely on INTEL_IDLE=y. We could because it handles ACPI now and ACPI idle doesn't add any value on top of it except for the IO-based idle case. > The ACPI thing doesn't support FFh states for it's enter_dead(), should it? It does AFAICS, but the FFH is still MWAIT. > Anyway, ideally x86 would grow a new instruction to offline a CPU, both > MWAIT and HLT have problems vs non-maskable interrupts. > > I really don't know what is best here, maybe moving that whole CPUID > loop to boot, store the value in a per-cpu mwait_play_dead_hint. Have > AMD explicitly clear the value, and avoid mwait when 0 -- hint 0 is > equal to HLT anyway. > > But as said, we need a new instruction. Before that, there is the problem with the MWAIT hint computation in mwait_play_dead() and in fact intel_idle does know what hint to use in there.
On Tue, Nov 12, 2024 at 1:30 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Tue, Nov 12, 2024 at 1:18 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Tue, Nov 12, 2024 at 01:03:49PM +0100, Rafael J. Wysocki wrote: > > > On Tue, Nov 12, 2024 at 12:47 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > > On Fri, Nov 08, 2024 at 01:29:08PM +0100, Patryk Wlazlyn wrote: > > > > > The generic implementation, based on cpuid leaf 0x5, for looking up the > > > > > mwait hint for the deepest cstate, depends on them to be continuous in > > > > > range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86 > > > > > platforms, it is not architectural and may not result in reaching the > > > > > most optimized idle state on some of them. > > > > > > > > > > Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and > > > > > fallback to the later in case of missing enter_dead() handler. > > > > > > > > > > Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com> > > > > > --- > > > > > arch/x86/kernel/smpboot.c | 4 ++-- > > > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > > > > > index 44c40781bad6..721bb931181c 100644 > > > > > --- a/arch/x86/kernel/smpboot.c > > > > > +++ b/arch/x86/kernel/smpboot.c > > > > > @@ -1416,9 +1416,9 @@ void native_play_dead(void) > > > > > play_dead_common(); > > > > > tboot_shutdown(TB_SHUTDOWN_WFS); > > > > > > > > > > - mwait_play_dead(); > > > > > if (cpuidle_play_dead()) > > > > > - hlt_play_dead(); > > > > > + mwait_play_dead(); > > > > > + hlt_play_dead(); > > > > > } > > > > > > > > Yeah, I don't think so. we don't want to accidentally hit > > > > acpi_idle_play_dead(). > > > > > > Fair enough. > > > > > > Then we are back to the original approach though: > > > > > > https://lore.kernel.org/linux-pm/20241029101507.7188-3-patryk.wlazlyn@linux.intel.com/ > > > > Well, that won't be brilliant for hybrid systems where the available > > states are different per CPU. > > But they aren't. > > At least so far that has not been the case on any platform known to me > and I'm not aware of any plans to make that happen (guess what, some > other OSes may be unhappy). > > > Also, all of this is a bit of a trainwreck... AFAICT AMD wants IO based > > idle (per the 2018 commit). So they want the ACPI thing. > > Yes. > > > But on Intel we really don't want HLT, and had that MWAIT, but that has > > real problems with KEXEC. And I don't think we can rely on INTEL_IDLE=y. > > We could because it handles ACPI now and ACPI idle doesn't add any > value on top of it except for the IO-based idle case. > > > The ACPI thing doesn't support FFh states for it's enter_dead(), should it? > > It does AFAICS, but the FFH is still MWAIT. Sorry, no, it doesn't. It might I guess, but it would have been dead code in the vast majority of configurations in the field.
On Tue, 2024-11-12 at 13:18 +0100, Peter Zijlstra wrote: > But on Intel we really don't want HLT, and had that MWAIT, but that has > real problems with KEXEC. And I don't think we can rely on INTEL_IDLE=y. If INTEL_IDLE is not set, then we'll just use existing mwait creation algorithm in 'mwait_play_dead()', which works too, just not ideal. > Anyway, ideally x86 would grow a new instruction to offline a CPU, both > MWAIT and HLT have problems vs non-maskable interrupts. ... snip ... > But as said, we need a new instruction. FYI, I already started discussing a special "gimme the deepest C-state" mwait hint - just a constant like 0xFF. CPUID leaf 5 has many reserved bits, one could be used for enumeration of this feature. But this is just a quick idea so far, and informal discussions so far. Artem.
On Tue, Nov 12, 2024 at 12:47 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Nov 08, 2024 at 01:29:08PM +0100, Patryk Wlazlyn wrote: > > The generic implementation, based on cpuid leaf 0x5, for looking up the > > mwait hint for the deepest cstate, depends on them to be continuous in > > range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86 > > platforms, it is not architectural and may not result in reaching the > > most optimized idle state on some of them. > > > > Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and > > fallback to the later in case of missing enter_dead() handler. > > > > Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com> > > --- > > arch/x86/kernel/smpboot.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > > index 44c40781bad6..721bb931181c 100644 > > --- a/arch/x86/kernel/smpboot.c > > +++ b/arch/x86/kernel/smpboot.c > > @@ -1416,9 +1416,9 @@ void native_play_dead(void) > > play_dead_common(); > > tboot_shutdown(TB_SHUTDOWN_WFS); > > > > - mwait_play_dead(); > > if (cpuidle_play_dead()) > > - hlt_play_dead(); > > + mwait_play_dead(); > > + hlt_play_dead(); > > } > > Yeah, I don't think so. we don't want to accidentally hit > acpi_idle_play_dead(). Having inspected the code once again, I'm not sure what your concern is. :enter.dead() is set to acpi_idle_play_dead() for all states in ACPI idle - see acpi_processor_setup_cstates() and the role of the type check is to filter out bogus table entries (the "type" must be 1, 2, or 3 as per the spec). Then cpuidle_play_dead() calls drv->states[i].enter_dead() for the deepest state where it is set and if this is FFH, acpi_idle_play_dead() will return an error. So after the change, the code above will fall back to mwait_play_dead() then. Or am I missing anything?
On Tue, Nov 12, 2024 at 01:30:29PM +0100, Rafael J. Wysocki wrote: > > > Then we are back to the original approach though: > > > > > > https://lore.kernel.org/linux-pm/20241029101507.7188-3-patryk.wlazlyn@linux.intel.com/ > > > > Well, that won't be brilliant for hybrid systems where the available > > states are different per CPU. > > But they aren't. > > At least so far that has not been the case on any platform known to me > and I'm not aware of any plans to make that happen (guess what, some > other OSes may be unhappy). Well, that's something at least. > > Also, all of this is a bit of a trainwreck... AFAICT AMD wants IO based > > idle (per the 2018 commit). So they want the ACPI thing. > > Yes. > > > But on Intel we really don't want HLT, and had that MWAIT, but that has > > real problems with KEXEC. And I don't think we can rely on INTEL_IDLE=y. > > We could because it handles ACPI now and ACPI idle doesn't add any > value on top of it except for the IO-based idle case. You're saying we can mandate INTEL_IDLE=y? Because currently defconfig doesn't even have it on. > > The ACPI thing doesn't support FFh states for it's enter_dead(), should it? > > It does AFAICS, but the FFH is still MWAIT. What I'm trying to say is that acpi_idle_play_dead() doesn't seem to support FFh and as such won't ever use MWAIT. > > Anyway, ideally x86 would grow a new instruction to offline a CPU, both > > MWAIT and HLT have problems vs non-maskable interrupts. > > > > I really don't know what is best here, maybe moving that whole CPUID > > loop to boot, store the value in a per-cpu mwait_play_dead_hint. Have > > AMD explicitly clear the value, and avoid mwait when 0 -- hint 0 is > > equal to HLT anyway. > > > > But as said, we need a new instruction. > > Before that, there is the problem with the MWAIT hint computation in > mwait_play_dead() and in fact intel_idle does know what hint to use in > there. But we need to deal witn INTEL_IDLE=n. Also, I don't see any MWAIT_LEAF parsing in intel_idle.c. Yes, it requests the information, but then it mostly ignores it -- it only consumes two ECX bits or so. I don't see it finding a max-cstate from mwait_substates anywhere. So given we don't have any such code, why can't we simply fix the cstate parsing we have in mwait_play_dead() and call it a day?
On Tue, Nov 12, 2024 at 02:44:49PM +0200, Artem Bityutskiy wrote: > On Tue, 2024-11-12 at 13:18 +0100, Peter Zijlstra wrote: > > But on Intel we really don't want HLT, and had that MWAIT, but that has > > real problems with KEXEC. And I don't think we can rely on INTEL_IDLE=y. > > If INTEL_IDLE is not set, then we'll just use existing mwait creation algorithm > in 'mwait_play_dead()', which works too, just not ideal. So why not fix the substate detectoring function and ignore everything? > > Anyway, ideally x86 would grow a new instruction to offline a CPU, both > > MWAIT and HLT have problems vs non-maskable interrupts. > ... snip ... > > But as said, we need a new instruction. > > FYI, I already started discussing a special "gimme the deepest C-state" mwait > hint - just a constant like 0xFF. CPUID leaf 5 has many reserved bits, one could > be used for enumeration of this feature. > > But this is just a quick idea so far, and informal discussions so far. No, not mwait hint. We need an instruction that: - goes to deepest C state - drops into WAIT-for-Start-IPI (SIPI) Notably, it should not wake from: - random memory writes - NMI, MCE, SMI and other such non-maskable thingies - anything else -- the memory pointed to by RIP might no longer exist Lets call the instruction: DEAD. With the mwait 'hack', kexec still goes belly up if it gets a spurious NMI (and them others) at an inopportune time, and this does happen afaik. Just not enough to worry the data center guys like the mwait thing.
On Tue, Nov 12, 2024 at 02:23:14PM +0100, Rafael J. Wysocki wrote: > On Tue, Nov 12, 2024 at 12:47 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Fri, Nov 08, 2024 at 01:29:08PM +0100, Patryk Wlazlyn wrote: > > > The generic implementation, based on cpuid leaf 0x5, for looking up the > > > mwait hint for the deepest cstate, depends on them to be continuous in > > > range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86 > > > platforms, it is not architectural and may not result in reaching the > > > most optimized idle state on some of them. > > > > > > Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and > > > fallback to the later in case of missing enter_dead() handler. > > > > > > Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com> > > > --- > > > arch/x86/kernel/smpboot.c | 4 ++-- > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > > > index 44c40781bad6..721bb931181c 100644 > > > --- a/arch/x86/kernel/smpboot.c > > > +++ b/arch/x86/kernel/smpboot.c > > > @@ -1416,9 +1416,9 @@ void native_play_dead(void) > > > play_dead_common(); > > > tboot_shutdown(TB_SHUTDOWN_WFS); > > > > > > - mwait_play_dead(); > > > if (cpuidle_play_dead()) > > > - hlt_play_dead(); > > > + mwait_play_dead(); > > > + hlt_play_dead(); > > > } > > > > Yeah, I don't think so. we don't want to accidentally hit > > acpi_idle_play_dead(). > > Having inspected the code once again, I'm not sure what your concern is. > > :enter.dead() is set to acpi_idle_play_dead() for all states in ACPI > idle - see acpi_processor_setup_cstates() and the role of the type > check is to filter out bogus table entries (the "type" must be 1, 2, > or 3 as per the spec). > > Then cpuidle_play_dead() calls drv->states[i].enter_dead() for the > deepest state where it is set and if this is FFH, > acpi_idle_play_dead() will return an error. So after the change, the > code above will fall back to mwait_play_dead() then. > > Or am I missing anything? So it relies on there being a C2/C3 state enumerated and that being FFh. Otherwise it will find a 'working' state and we're up a creek. Typically I expect C2/C3 FFh states will be there on Intel stuff, but it seems awefully random to rely on this hole. AMD might unwittinly change the ACPI driver (they're the main user) and then we'd be up a creek. Robustly we'd teach the ACPI driver about FFh and set enter_dead on every state -- but we'd have to double check that with AMD. At the same time, intel_idle should then also set enter_dead on all states. And then the mwait case is only ever reached if CPUIDLE=n.
On Tue, Nov 12, 2024 at 2:50 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Nov 12, 2024 at 01:30:29PM +0100, Rafael J. Wysocki wrote: > > > > > Then we are back to the original approach though: > > > > > > > > https://lore.kernel.org/linux-pm/20241029101507.7188-3-patryk.wlazlyn@linux.intel.com/ > > > > > > Well, that won't be brilliant for hybrid systems where the available > > > states are different per CPU. > > > > But they aren't. > > > > At least so far that has not been the case on any platform known to me > > and I'm not aware of any plans to make that happen (guess what, some > > other OSes may be unhappy). > > Well, that's something at least. > > > > Also, all of this is a bit of a trainwreck... AFAICT AMD wants IO based > > > idle (per the 2018 commit). So they want the ACPI thing. > > > > Yes. > > > > > But on Intel we really don't want HLT, and had that MWAIT, but that has > > > real problems with KEXEC. And I don't think we can rely on INTEL_IDLE=y. > > > > We could because it handles ACPI now and ACPI idle doesn't add any > > value on top of it except for the IO-based idle case. > > You're saying we can mandate INTEL_IDLE=y? Because currently defconfig > doesn't even have it on. It is conceivable. > > > The ACPI thing doesn't support FFh states for it's enter_dead(), should it? > > > > It does AFAICS, but the FFH is still MWAIT. > > What I'm trying to say is that acpi_idle_play_dead() doesn't seem to > support FFh and as such won't ever use MWAIT. Right, but if it finds an FFH state deeper than C1, it will fall back to the next play_dead method. > > > Anyway, ideally x86 would grow a new instruction to offline a CPU, both > > > MWAIT and HLT have problems vs non-maskable interrupts. > > > > > > I really don't know what is best here, maybe moving that whole CPUID > > > loop to boot, store the value in a per-cpu mwait_play_dead_hint. Have > > > AMD explicitly clear the value, and avoid mwait when 0 -- hint 0 is > > > equal to HLT anyway. > > > > > > But as said, we need a new instruction. > > > > Before that, there is the problem with the MWAIT hint computation in > > mwait_play_dead() and in fact intel_idle does know what hint to use in > > there. > > But we need to deal witn INTEL_IDLE=n. Then the code would do what it is doing today, as a matter of fallback. > Also, I don't see any MWAIT_LEAF > parsing in intel_idle.c. Yes, it requests the information, but then it > mostly ignores it -- it only consumes two ECX bits or so. > > I don't see it finding a max-cstate from mwait_substates anywhere. No, it gets this either from _CST or from a built-in table for the given processor model. > So given we don't have any such code, why can't we simply fix the cstate > parsing we have in mwait_play_dead() and call it a day? I'll leave this one to Artem, but there is at least one reason to avoid doing that I know about: There is no guarantee that whatever has been found was actually validated.
On Tue, Nov 12, 2024 at 3:56 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Nov 12, 2024 at 02:23:14PM +0100, Rafael J. Wysocki wrote: > > On Tue, Nov 12, 2024 at 12:47 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Fri, Nov 08, 2024 at 01:29:08PM +0100, Patryk Wlazlyn wrote: > > > > The generic implementation, based on cpuid leaf 0x5, for looking up the > > > > mwait hint for the deepest cstate, depends on them to be continuous in > > > > range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86 > > > > platforms, it is not architectural and may not result in reaching the > > > > most optimized idle state on some of them. > > > > > > > > Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and > > > > fallback to the later in case of missing enter_dead() handler. > > > > > > > > Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com> > > > > --- > > > > arch/x86/kernel/smpboot.c | 4 ++-- > > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > > > > index 44c40781bad6..721bb931181c 100644 > > > > --- a/arch/x86/kernel/smpboot.c > > > > +++ b/arch/x86/kernel/smpboot.c > > > > @@ -1416,9 +1416,9 @@ void native_play_dead(void) > > > > play_dead_common(); > > > > tboot_shutdown(TB_SHUTDOWN_WFS); > > > > > > > > - mwait_play_dead(); > > > > if (cpuidle_play_dead()) > > > > - hlt_play_dead(); > > > > + mwait_play_dead(); > > > > + hlt_play_dead(); > > > > } > > > > > > Yeah, I don't think so. we don't want to accidentally hit > > > acpi_idle_play_dead(). > > > > Having inspected the code once again, I'm not sure what your concern is. > > > > :enter.dead() is set to acpi_idle_play_dead() for all states in ACPI > > idle - see acpi_processor_setup_cstates() and the role of the type > > check is to filter out bogus table entries (the "type" must be 1, 2, > > or 3 as per the spec). > > > > Then cpuidle_play_dead() calls drv->states[i].enter_dead() for the > > deepest state where it is set and if this is FFH, > > acpi_idle_play_dead() will return an error. So after the change, the > > code above will fall back to mwait_play_dead() then. > > > > Or am I missing anything? > > So it relies on there being a C2/C3 state enumerated and that being FFh. > Otherwise it will find a 'working' state and we're up a creek. > > Typically I expect C2/C3 FFh states will be there on Intel stuff, but it > seems awefully random to rely on this hole. AMD might unwittinly change > the ACPI driver (they're the main user) and then we'd be up a creek. > > Robustly we'd teach the ACPI driver about FFh and set enter_dead on > every state -- but we'd have to double check that with AMD. > > At the same time, intel_idle should then also set enter_dead on all > states. > > And then the mwait case is only ever reached if CPUIDLE=n. So that's why I would prefer intel_idle, if configured, to give mwait_play_dead() a hint on the MWAIT hint to use. Otherwise the latter would just fall back to the current method. This would not be bullet-proof, but it would take the opportunity to work better if it could.
On Tue, Nov 12, 2024 at 03:56:22PM +0100, Rafael J. Wysocki wrote: > > So given we don't have any such code, why can't we simply fix the cstate > > parsing we have in mwait_play_dead() and call it a day? > > I'll leave this one to Artem, but there is at least one reason to > avoid doing that I know about: There is no guarantee that whatever has > been found was actually validated. It's a bit daft to explicitly advertise a state in CPUID that's not validated. I realize that MSFT will likely only ever use the ACPI table, but at the same time, the CPUID bits and ACPI tables both come from the same BIOS image, no? .... and we've never seen the BIOS be self contradictory before ... /me runs
On Tue, Nov 12, 2024 at 4:08 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Nov 12, 2024 at 03:56:22PM +0100, Rafael J. Wysocki wrote: > > > > So given we don't have any such code, why can't we simply fix the cstate > > > parsing we have in mwait_play_dead() and call it a day? > > > > I'll leave this one to Artem, but there is at least one reason to > > avoid doing that I know about: There is no guarantee that whatever has > > been found was actually validated. > > It's a bit daft to explicitly advertise a state in CPUID that's not > validated. I realize that MSFT will likely only ever use the ACPI table, Right. > but at the same time, the CPUID bits and ACPI tables both come from the > same BIOS image, no? Yes, but the list of C-states advertised as supported in CPUID is usually longer than the _CST list size (at most 3) ...
On Tue, Nov 12, 2024 at 03:56:18PM +0100, Peter Zijlstra wrote: > On Tue, Nov 12, 2024 at 02:23:14PM +0100, Rafael J. Wysocki wrote: > > On Tue, Nov 12, 2024 at 12:47 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Fri, Nov 08, 2024 at 01:29:08PM +0100, Patryk Wlazlyn wrote: > > > > The generic implementation, based on cpuid leaf 0x5, for looking up the > > > > mwait hint for the deepest cstate, depends on them to be continuous in > > > > range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86 > > > > platforms, it is not architectural and may not result in reaching the > > > > most optimized idle state on some of them. > > > > > > > > Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and > > > > fallback to the later in case of missing enter_dead() handler. > > > > > > > > Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com> > > > > --- > > > > arch/x86/kernel/smpboot.c | 4 ++-- > > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > > > > index 44c40781bad6..721bb931181c 100644 > > > > --- a/arch/x86/kernel/smpboot.c > > > > +++ b/arch/x86/kernel/smpboot.c > > > > @@ -1416,9 +1416,9 @@ void native_play_dead(void) > > > > play_dead_common(); > > > > tboot_shutdown(TB_SHUTDOWN_WFS); > > > > > > > > - mwait_play_dead(); > > > > if (cpuidle_play_dead()) > > > > - hlt_play_dead(); > > > > + mwait_play_dead(); > > > > + hlt_play_dead(); > > > > } > > > > > > Yeah, I don't think so. we don't want to accidentally hit > > > acpi_idle_play_dead(). > > > > Having inspected the code once again, I'm not sure what your concern is. > > > > :enter.dead() is set to acpi_idle_play_dead() for all states in ACPI > > idle - see acpi_processor_setup_cstates() and the role of the type > > check is to filter out bogus table entries (the "type" must be 1, 2, > > or 3 as per the spec). > > > > Then cpuidle_play_dead() calls drv->states[i].enter_dead() for the > > deepest state where it is set and if this is FFH, > > acpi_idle_play_dead() will return an error. So after the change, the > > code above will fall back to mwait_play_dead() then. > > > > Or am I missing anything? > > So it relies on there being a C2/C3 state enumerated and that being FFh. > Otherwise it will find a 'working' state and we're up a creek. > > Typically I expect C2/C3 FFh states will be there on Intel stuff, but it > seems awefully random to rely on this hole. AMD might unwittinly change > the ACPI driver (they're the main user) and then we'd be up a creek. AMD platforms won't be using FFH based states for offlined CPUs. We prefer IO based states when available, and HLT otherwise. > > Robustly we'd teach the ACPI driver about FFh and set enter_dead on > every state -- but we'd have to double check that with AMD. Works for us as long as those FFh states aren't used for play_dead on AMD platforms. How about something like this (completely untested) ---------------------x8---------------------------------------------------- diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c index f3ffd0a3a012..bd611771fa6c 100644 --- a/arch/x86/kernel/acpi/cstate.c +++ b/arch/x86/kernel/acpi/cstate.c @@ -215,6 +215,24 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx) } EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter); +static int acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx) +{ + unsigned int cpu = smp_processor_id(); + struct cstate_entry *percpu_entry; + + /* + * This is ugly. But AMD processors don't prefer MWAIT based + * C-states when processors are offlined. + */ + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD || + boot_cpu_data.x86_vendor == X86_VENDOR_HYGON) + return -ENODEV; + + percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu); + return mwait_play_dead_with_hints(percpu_entry->states[cx->index].eax); +} +EXPORT_SYMBOL_GPL(acpi_processor_ffh_play_dead); + static int __init ffh_cstate_init(void) { struct cpuinfo_x86 *c = &boot_cpu_data; diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c index 831fa4a12159..c535f5df9081 100644 --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c @@ -590,6 +590,8 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index) raw_safe_halt(); else if (cx->entry_method == ACPI_CSTATE_SYSTEMIO) { io_idle(cx->address); + } else if (cx->entry_method == ACPI_CSTATE_FFH) { + return acpi_procesor_ffh_play_dead(cx); } else return -ENODEV; } diff --git a/include/acpi/processor.h b/include/acpi/processor.h index e6f6074eadbf..38329dcdd2b9 100644 --- a/include/acpi/processor.h +++ b/include/acpi/processor.h @@ -280,6 +280,7 @@ int acpi_processor_ffh_cstate_probe(unsigned int cpu, struct acpi_processor_cx *cx, struct acpi_power_register *reg); void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cstate); +int acpi_processor_ffh_dead(struct acpi_processor_cx *cstate); #else static inline void acpi_processor_power_init_bm_check(struct acpi_processor_flags @@ -300,6 +301,11 @@ static inline void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx { return; } + +static inline acpi_processor_ffh_dead(struct acpi_processor_cx *cstate) +{ + return -ENODEV; +} #endif static inline int call_on_cpu(int cpu, long (*fn)(void *), void *arg, ---------------------x8-------------------------------------------------------- -- Thanks and Regards gautham.
On 11/13/24 03:41, Gautham R. Shenoy wrote: > + /* > + * This is ugly. But AMD processors don't prefer MWAIT based > + * C-states when processors are offlined. > + */ > + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD || > + boot_cpu_data.x86_vendor == X86_VENDOR_HYGON) > + return -ENODEV; Can we get an X86_FEATURE for this, please? Either a positive one: X86_FEATURE_MWAIT_OK_FOR_OFFLINE or a negative one: X86_FEATURE_MWAIT_BUSTED_FOR_OFFLINE ... with better names. Or even a helper. Because if you add this AMD||HYGON check, it'll be at _least_ the second one of these for the same logical reason.
On Wed, Nov 13, 2024 at 05:11:38PM +0530, Gautham R. Shenoy wrote: > How about something like this (completely untested) > > ---------------------x8---------------------------------------------------- > > diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c > index f3ffd0a3a012..bd611771fa6c 100644 > --- a/arch/x86/kernel/acpi/cstate.c > +++ b/arch/x86/kernel/acpi/cstate.c > @@ -215,6 +215,24 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx) > } > EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter); > > +static int acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx) > +{ > + unsigned int cpu = smp_processor_id(); > + struct cstate_entry *percpu_entry; > + > + /* > + * This is ugly. But AMD processors don't prefer MWAIT based > + * C-states when processors are offlined. > + */ > + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD || > + boot_cpu_data.x86_vendor == X86_VENDOR_HYGON) > + return -ENODEV; Are there AMD systems with FFh idle states at all? Also, I don't think this works right, the loop in cpuidle_play_dead() will terminate on this, and not try a shallower state which might have IO/HLT on. > + > + percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu); > + return mwait_play_dead_with_hints(percpu_entry->states[cx->index].eax); > +} > +EXPORT_SYMBOL_GPL(acpi_processor_ffh_play_dead);
On 11/13/2024 5:22 PM, Peter Zijlstra wrote: > On Wed, Nov 13, 2024 at 05:11:38PM +0530, Gautham R. Shenoy wrote: > >> How about something like this (completely untested) >> >> ---------------------x8---------------------------------------------------- >> >> diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c >> index f3ffd0a3a012..bd611771fa6c 100644 >> --- a/arch/x86/kernel/acpi/cstate.c >> +++ b/arch/x86/kernel/acpi/cstate.c >> @@ -215,6 +215,24 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx) >> } >> EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter); >> >> +static int acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx) >> +{ >> + unsigned int cpu = smp_processor_id(); >> + struct cstate_entry *percpu_entry; >> + >> + /* >> + * This is ugly. But AMD processors don't prefer MWAIT based >> + * C-states when processors are offlined. >> + */ >> + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD || >> + boot_cpu_data.x86_vendor == X86_VENDOR_HYGON) >> + return -ENODEV; > Are there AMD systems with FFh idle states at all? I don't know. > Also, I don't think this works right, the loop in cpuidle_play_dead() > will terminate on this, and not try a shallower state which might have > IO/HLT on. I think you are right. >> + >> + percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu); >> + return mwait_play_dead_with_hints(percpu_entry->states[cx->index].eax); >> +} >> +EXPORT_SYMBOL_GPL(acpi_processor_ffh_play_dead);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 44c40781bad6..721bb931181c 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1416,9 +1416,9 @@ void native_play_dead(void) play_dead_common(); tboot_shutdown(TB_SHUTDOWN_WFS); - mwait_play_dead(); if (cpuidle_play_dead()) - hlt_play_dead(); + mwait_play_dead(); + hlt_play_dead(); } #else /* ... !CONFIG_HOTPLUG_CPU */
The generic implementation, based on cpuid leaf 0x5, for looking up the mwait hint for the deepest cstate, depends on them to be continuous in range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86 platforms, it is not architectural and may not result in reaching the most optimized idle state on some of them. Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and fallback to the later in case of missing enter_dead() handler. Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com> --- arch/x86/kernel/smpboot.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)