diff mbox series

[RESEND] Do not modify perf bias performance setting by default at boot

Message ID 6369897.qxlu8PgE1t@house (mailing list archive)
State Rejected, archived
Headers show
Series [RESEND] Do not modify perf bias performance setting by default at boot | expand

Commit Message

Thomas Renninger March 14, 2019, 2:42 p.m. UTC
This is a revert of mainline git commits:
commit b51ef52df71cb28e9d90cd1d48b79bf19f0bab06
commit 17edf2d79f1ea6dfdb4c444801d928953b9f98d6
commit abe48b108247e9b90b4c6739662a2e5c765ed114

It is about this kernel message showing up on quite a lot servers:
[    0.072652] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
[    0.076003] ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8)

The background afaik were some BIOS which "forgot to set a default value".
Certainly some (one?) broken laptop BIOS was meant, because
the value zero (performance) is quite a reasonable default value on server systems
and this is how a lot CPUs typically come up there.

Another sever bug with above patches:
If you really set performance BIAS MSR value to 'performance'
as mentioned in kernel boot log, it is reset on next CPU offline/online cycle
(see below). Unfortunately one does not see this in the logs anymore, because
of the printk_once() usage:

dmesg |grep -i bias
[    0.026642] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
[    0.028002] ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8)
liszt:~/:[0]# x86_energy_perf_policy 
cpu0: EPB 6
cpu1: EPB 6
cpu2: EPB 6
cpu3: EPB 6
liszt:~/:[0]# x86_energy_perf_policy 0
liszt:~/:[0]# x86_energy_perf_policy 
cpu0: EPB 0
cpu1: EPB 0
cpu2: EPB 0
cpu3: EPB 0
liszt:~/:[0]# echo offline > /sys/devices/system/cpu/cpu1/online
liszt:~/:[0]# echo online > /sys/devices/system/cpu/cpu1/online
liszt:~/:[0]# x86_energy_perf_policy 
cpu0: EPB 0
cpu1: EPB 6
cpu2: EPB 0
cpu3: EPB 0


Signed-off-by: Thomas Renninger <trenn@suse.de>
Tested-by: Simon Schricker <sschricker@suse.de>

Comments

Rafael J. Wysocki March 14, 2019, 10:08 p.m. UTC | #1
On Thu, Mar 14, 2019 at 3:42 PM Thomas Renninger <trenn@suse.de> wrote:
>
> This is a revert of mainline git commits:
> commit b51ef52df71cb28e9d90cd1d48b79bf19f0bab06
> commit 17edf2d79f1ea6dfdb4c444801d928953b9f98d6
> commit abe48b108247e9b90b4c6739662a2e5c765ed114

I'm not quite convinced that reverting these is the right thing to do here.

> It is about this kernel message showing up on quite a lot servers:
> [    0.072652] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
> [    0.076003] ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8)

What happens on boot is a matter of convention and the convention by
which the EPB is set to "neutral" at that point has been used for
quite a while.  Changing it now may confuse some users or even worse.

It would be fine to change the log level of these messages in my view, though.

> The background afaik were some BIOS which "forgot to set a default value".
> Certainly some (one?) broken laptop BIOS was meant, because
> the value zero (performance) is quite a reasonable default value on server systems
> and this is how a lot CPUs typically come up there.

That depends on the expected power/perf profile of the system and the
kernel has no information regarding that.  Arguably, user space can
set the EPB to whatever it wants at the init time according to the
information it has.  It is not quite realistic to expect that either
the BIOS or the kernel will do the right thing on all systems in that
respect anyway.

> Another sever bug with above patches:
> If you really set performance BIAS MSR value to 'performance'
> as mentioned in kernel boot log, it is reset on next CPU offline/online cycle
> (see below).

That is a bug.  There is no reason to change the EPB on CPU
offline/online and I agree that this needs to be fixed.  The reverts
would make this problem go away, but that's not the only possible way
to fix it.

Moreover, what is done during system-wide resume appears to be a bug
too.  Whatever EPB value is there during suspend should be saved and
it should be restored during resume. Reverting the commits above will
not fix that particular issue.

Thanks,
Rafael
Thomas Renninger March 15, 2019, 3:36 p.m. UTC | #2
Hi Rafael,

On Thursday, March 14, 2019 11:08:03 PM CET Rafael J. Wysocki wrote:
> On Thu, Mar 14, 2019 at 3:42 PM Thomas Renninger <trenn@suse.de> wrote:
> > This is a revert of mainline git commits:
> > commit b51ef52df71cb28e9d90cd1d48b79bf19f0bab06
> > commit 17edf2d79f1ea6dfdb4c444801d928953b9f98d6
> > commit abe48b108247e9b90b4c6739662a2e5c765ed114
> 
> I'm not quite convinced that reverting these is the right thing to do here.

Ok, I try harder.
 
> > It is about this kernel message showing up on quite a lot servers:
> > [    0.072652] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
> > [    0.076003] ENERGY_PERF_BIAS: View and update with
> > x86_energy_perf_policy(8)
> What happens on boot is a matter of convention and the convention by
> which the EPB is set to "neutral" at that point has been used for
> quite a while.  Changing it now may confuse some users or even worse.

SUSE has this patch included for quite a while (in all kind of
SLE 12 SP flavors), possibly other distributions as well.
Unfortunately the patch got removed recently, because my commit (this one)
was ignored by Len a while ago.

I also like to see more examples of BIOSes not initializing this value.
This message is rare (so the setting back does not happen often, probably
only wrongly with latest BIOSes overriding intended BIOS performance settings 
on servers).

On my workstation the BIOS initializes perf bias to:
cpupower info
analyzing CPU 0:
perf-bias: 7

I could grep through quite some dozens of machines..., but these are mostly
servers and probably show either "zero"/"performance" or "6"/"normal" because
the Linux kernel overrides the INTENDED performance perf bias value to 6.

So we (SUSE) are going with this patch forever.
Otherwise we would run into a similar support nightmare we ran into, when
Intel decided to ignore CPU idle states as exported by BIOS through ACPI.
BIOS documentation of all big server vendors mentioned "performance" settings.
With a kernel update these BIOS C states settings have been ignored (some long 
latency once were not exported on purpose).

The list of breaking conventions and specifications is long...
People mostly blame the "bad BIOS developer". In this case things have been
broke by the kernel.
It's now (with the resume patch) broken in way, that "performance" setting
in the mainline kernel is unusable for years already.

Can we please get this finally properly fixed!

> That is a bug.  There is no reason to change the EPB on CPU
> offline/online and I agree that this needs to be fixed.  The reverts
> would make this problem go away, but that's not the only possible way
> to fix it.

It is the only proper way to fix this.
 
> Moreover, what is done during system-wide resume appears to be a bug
> too.  Whatever EPB value is there during suspend should be saved and
> it should be restored during resume. Reverting the commits above will
> not fix that particular issue.

Someone could have a look what happens on a recent system
after suspend (with this patch on top).
I do not have laptops to test suspend/resume with latest kernels.
These broken patches have been added a bit too quickly.

I doubt much BIOSes "forgot to initialize this value" and I have my doubts
that resume let's the CPU fall back to "uninitialized" value.
I more expect BIOS resets performance value after resume and the kernel is
doing it all wrong.
Hm, in fact this could be proofed by modifying perf bias value, do a suspend
resume cyle. If the value is kept we should avoid another try of complicating
things and simply reverting touching perf bias value altogether, correct?

BIOS probably does not store perf BIAS value and re-applies it after
suspend resume. Some might re-set it to the intial value (performance..., 
D'oh).

        Thomas
Rafael J. Wysocki March 18, 2019, 10:26 a.m. UTC | #3
On Fri, Mar 15, 2019 at 4:36 PM Thomas Renninger <trenn@suse.de> wrote:
>
> Hi Rafael,
>
> On Thursday, March 14, 2019 11:08:03 PM CET Rafael J. Wysocki wrote:
> > On Thu, Mar 14, 2019 at 3:42 PM Thomas Renninger <trenn@suse.de> wrote:
> > > This is a revert of mainline git commits:
> > > commit b51ef52df71cb28e9d90cd1d48b79bf19f0bab06
> > > commit 17edf2d79f1ea6dfdb4c444801d928953b9f98d6
> > > commit abe48b108247e9b90b4c6739662a2e5c765ed114
> >
> > I'm not quite convinced that reverting these is the right thing to do here.
>
> Ok, I try harder.
>
> > > It is about this kernel message showing up on quite a lot servers:
> > > [    0.072652] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
> > > [    0.076003] ENERGY_PERF_BIAS: View and update with
> > > x86_energy_perf_policy(8)
> > What happens on boot is a matter of convention and the convention by
> > which the EPB is set to "neutral" at that point has been used for
> > quite a while.  Changing it now may confuse some users or even worse.
>
> SUSE has this patch included for quite a while (in all kind of
> SLE 12 SP flavors), possibly other distributions as well.
> Unfortunately the patch got removed recently, because my commit (this one)
> was ignored by Len a while ago.
>
> I also like to see more examples of BIOSes not initializing this value.
> This message is rare (so the setting back does not happen often, probably
> only wrongly with latest BIOSes overriding intended BIOS performance settings
> on servers).
>
> On my workstation the BIOS initializes perf bias to:
> cpupower info
> analyzing CPU 0:
> perf-bias: 7
>
> I could grep through quite some dozens of machines..., but these are mostly
> servers and probably show either "zero"/"performance" or "6"/"normal" because
> the Linux kernel overrides the INTENDED performance perf bias value to 6.

The perf bias Intended by whom?

Yes, the kernel replaces whatever the original BIOS setting is with
its own one.  It may not match every setup perfectly, but at least it
is consistent.  Why exactly is it worse than whatever the BIOS has
set?

> So we (SUSE) are going with this patch forever.
>
> Otherwise we would run into a similar support nightmare we ran into, when
> Intel decided to ignore CPU idle states as exported by BIOS through ACPI.
> BIOS documentation of all big server vendors mentioned "performance" settings.
> With a kernel update these BIOS C states settings have been ignored (some long
> latency once were not exported on purpose).
>
> The list of breaking conventions and specifications is long...
> People mostly blame the "bad BIOS developer". In this case things have been
> broke by the kernel.

I agree that the kernel should not modify the EPB on system-wide
resume and on CPU online, but I don't see why changing the BIOS
setting at init time is a problem really.

> It's now (with the resume patch) broken in way, that "performance" setting

So the point seems to be that the BIOS setting should be preserved or
people are not able to configure the systems for performance through
setting things in the BIOS.  However, that only means that setting
things in the BIOS is not sufficient to configure a system for
performance and that has always been the case AFAICS, with or without
the EPB.

If you want to configure a system for performance, you need to do that
not just in the BIOS, but also in the OS (and, quite arguably, I would
expect the latter to be sufficient).

> in the mainline kernel is unusable for years already.
>
> Can we please get this finally properly fixed!
>
> > That is a bug.  There is no reason to change the EPB on CPU
> > offline/online and I agree that this needs to be fixed.  The reverts
> > would make this problem go away, but that's not the only possible way
> > to fix it.
>
> It is the only proper way to fix this.

I beg to differ.

The system-wide resume part will still not be working properly after
the reverts.

> > Moreover, what is done during system-wide resume appears to be a bug
> > too.  Whatever EPB value is there during suspend should be saved and
> > it should be restored during resume. Reverting the commits above will
> > not fix that particular issue.
>
> Someone could have a look what happens on a recent system
> after suspend (with this patch on top).
> I do not have laptops to test suspend/resume with latest kernels.
> These broken patches have been added a bit too quickly.
>
> I doubt much BIOSes "forgot to initialize this value" and I have my doubts
> that resume let's the CPU fall back to "uninitialized" value.
> I more expect BIOS resets performance value after resume and the kernel is
> doing it all wrong.
> Hm, in fact this could be proofed by modifying perf bias value, do a suspend
> resume cyle. If the value is kept we should avoid another try of complicating
> things and simply reverting touching perf bias value altogether, correct?

No, AFAICS.

If the EPB value has been modified from user space after initializing
the system, that new value should be preserved across system-side
suspend/resume (and hibernation) and also (arguably) across CPU
online/offline.  Not the BIOS-set value, but the new one set by user
space.  The only way to guarantee the preservation of it across
suspend/resume is to save it at suspend time and restore it during
resume.  Relying on the BIOS to avoid touching it may be a bit overly
optimistic.

> BIOS probably does not store perf BIAS value and re-applies it after
> suspend resume. Some might re-set it to the intial value (performance...,
> D'oh).

Right.

Thanks,
Rafael
Thomas Renninger March 18, 2019, 11:15 a.m. UTC | #4
On Monday, March 18, 2019 11:26:10 AM CET Rafael J. Wysocki wrote:
> On Fri, Mar 15, 2019 at 4:36 PM Thomas Renninger <trenn@suse.de> wrote:
> > Hi Rafael,
> >
 
...

> > On my workstation the BIOS initializes perf bias to:
> > cpupower info
> > analyzing CPU 0:
> > perf-bias: 7
> > 
> > I could grep through quite some dozens of machines..., but these are
> > mostly
> > servers and probably show either "zero"/"performance" or "6"/"normal"
> > because the Linux kernel overrides the INTENDED performance perf bias
> > value to 6.
> The perf bias Intended by whom?

The instance who should be in charge to set/init such a value:
BIOS?
 
> Yes, the kernel replaces whatever the original BIOS setting is with
> its own one.

No, it only replaces the "performance" (0) value with "normal" (6).
This does not makes sense and is broken.

> It may not match every setup perfectly, but at least it
> is consistent.  Why exactly is it worse than whatever the BIOS has
> set?

Because there may be BIOS settings for the CPU which justify initialization
of the Perf BIAS value by BIOS.

What sense does it make to unconditionally set perf BIAS value from
performance to balanced?
Why is this done?
 
> > So we (SUSE) are going with this patch forever.
> > 
> > Otherwise we would run into a similar support nightmare we ran into, when
> > Intel decided to ignore CPU idle states as exported by BIOS through ACPI.
> > BIOS documentation of all big server vendors mentioned "performance"
> > settings. With a kernel update these BIOS C states settings have been
> > ignored (some long latency once were not exported on purpose).
> > 
> > The list of breaking conventions and specifications is long...
> > People mostly blame the "bad BIOS developer". In this case things have
> > been
> > broke by the kernel.
> 
> I agree that the kernel should not modify the EPB on system-wide
> resume and on CPU online, but I don't see why changing the BIOS
> setting at init time is a problem really.

I would agree if we differ a tablet/laptop system and set the performance
value to normal/powersave.
And on a server we set it from normal/powersave to performance.

But we should not touch this value anyway.
Again: Why should the kernel touch it?

There may be BIOSes initialzing it via BIOS options. And this is a very valid 
thing to do.

> > It's now (with the resume patch) broken in way, that "performance" setting
> 
> So the point seems to be that the BIOS setting should be preserved or
> people are not able to configure the systems for performance through
> setting things in the BIOS.
> However, that only means that setting
> things in the BIOS is not sufficient to configure a system for
> performance and that has always been the case AFAICS, with or without
> the EPB.

?!?
If the kernel unconditionally, without documentation overrides such values...
(and in this case only because of a workaround of some buggy BIOSes not 
initialzing this value)...

> If you want to configure a system for performance, you need to do that
> not just in the BIOS, but also in the OS (and, quite arguably, I would
> expect the latter to be sufficient).

No. You must not ignore BIOS settings. Even worse, you must not override
these without any sane reason.

Your assumption above might be right. But we want to do it better, right?

...
 
> The system-wide resume part will still not be working properly after
> the reverts.

But it must never blindly (unconditionally) be set to specific value.
Correct?

You mean the kernel should store the pre-hibernation perf BIAS value
in NVRAM and write it back when waking up again?
This would make sense.

It would also mean perf BIAS never really worked, at least did not survive
suspend. On servers (no hibernation) it would works but is overridden
to a value you typically do not want to have on a server...
So the current situation is rather broken in the kernel.


      Thomas
Rafael J. Wysocki March 18, 2019, 11:40 a.m. UTC | #5
On Mon, Mar 18, 2019 at 12:15 PM Thomas Renninger <trenn@suse.de> wrote:
>
> On Monday, March 18, 2019 11:26:10 AM CET Rafael J. Wysocki wrote:
> > On Fri, Mar 15, 2019 at 4:36 PM Thomas Renninger <trenn@suse.de> wrote:
> > > Hi Rafael,
> > >
>
> ...
>
> > > On my workstation the BIOS initializes perf bias to:
> > > cpupower info
> > > analyzing CPU 0:
> > > perf-bias: 7
> > >
> > > I could grep through quite some dozens of machines..., but these are
> > > mostly
> > > servers and probably show either "zero"/"performance" or "6"/"normal"
> > > because the Linux kernel overrides the INTENDED performance perf bias
> > > value to 6.
> > The perf bias Intended by whom?
>
> The instance who should be in charge to set/init such a value:
> BIOS?

And who's BIOS, really?  I guess you mean the OEM?  Note, however,
that the user and the OEM may not agree on that, but whatever.

> > Yes, the kernel replaces whatever the original BIOS setting is with
> > its own one.
>
> No, it only replaces the "performance" (0) value with "normal" (6).
> This does not makes sense and is broken.

OK, fair enough.

I guess it would have been better to set it to 6 unconditionally.

What about the systems that will misbehave when it is left at 0?

> > It may not match every setup perfectly, but at least it
> > is consistent.  Why exactly is it worse than whatever the BIOS has
> > set?
>
> Because there may be BIOS settings for the CPU which justify initialization
> of the Perf BIAS value by BIOS.

Well, the EPB is there for users to set it via the OS.  The BIOS
setting is not guaranteed to work for all users anyway.

> What sense does it make to unconditionally set perf BIAS value from
> performance to balanced?
> Why is this done?

Basically, for HW health reasons AFAICS.

Apparently, on some systems EPB=0 is (was?) special and means (meant?)
very aggressive use of turbo etc. which is not healthy in general.

> > > So we (SUSE) are going with this patch forever.
> > >
> > > Otherwise we would run into a similar support nightmare we ran into, when
> > > Intel decided to ignore CPU idle states as exported by BIOS through ACPI.
> > > BIOS documentation of all big server vendors mentioned "performance"
> > > settings. With a kernel update these BIOS C states settings have been
> > > ignored (some long latency once were not exported on purpose).
> > >
> > > The list of breaking conventions and specifications is long...
> > > People mostly blame the "bad BIOS developer". In this case things have
> > > been
> > > broke by the kernel.
> >
> > I agree that the kernel should not modify the EPB on system-wide
> > resume and on CPU online, but I don't see why changing the BIOS
> > setting at init time is a problem really.
>
> I would agree if we differ a tablet/laptop system and set the performance
> value to normal/powersave.
> And on a server we set it from normal/powersave to performance.
>
> But we should not touch this value anyway.
> Again: Why should the kernel touch it?
>
> There may be BIOSes initialzing it via BIOS options. And this is a very valid
> thing to do.

Yes, and there may be BIOSes leaving it at 0 with the assumption that
the OS will adjust it.  The kernel cannot know which is the case.

> > > It's now (with the resume patch) broken in way, that "performance" setting
> >
> > So the point seems to be that the BIOS setting should be preserved or
> > people are not able to configure the systems for performance through
> > setting things in the BIOS.
> > However, that only means that setting
> > things in the BIOS is not sufficient to configure a system for
> > performance and that has always been the case AFAICS, with or without
> > the EPB.
>
> ?!?
> If the kernel unconditionally, without documentation overrides such values...
> (and in this case only because of a workaround of some buggy BIOSes not
> initialzing this value)...

I acknowledge that the lack of documentation is a problem.

> > If you want to configure a system for performance, you need to do that
> > not just in the BIOS, but also in the OS (and, quite arguably, I would
> > expect the latter to be sufficient).
>
> No. You must not ignore BIOS settings. Even worse, you must not override
> these without any sane reason.

While there are BIOS settings that better should not be overridden,
the EPB is not one of them.

> Your assumption above might be right. But we want to do it better, right?
>
> ...
>
> > The system-wide resume part will still not be working properly after
> > the reverts.
>
> But it must never blindly (unconditionally) be set to specific value.
> Correct?

Yes.  I've already said that.

> You mean the kernel should store the pre-hibernation perf BIAS value
> in NVRAM and write it back when waking up again?

Or in the image and yes, it should write it back.

> This would make sense.
>
> It would also mean perf BIAS never really worked, at least did not survive
> suspend.

Right.

> On servers (no hibernation) it would works but is overridden
> to a value you typically do not want to have on a server...
> So the current situation is rather broken in the kernel.

Well, you can say so, but fixing it really means something more than
reverting the commits that your patch is reverting and *that* is my
point.

Yes, I think that this needs to be fixed.

No, I don't think that the reverts you are proposing are the way to go here.

Thanks,
Rafael
Thomas Renninger March 18, 2019, 1:22 p.m. UTC | #6
On Monday, March 18, 2019 12:40:46 PM CET Rafael J. Wysocki wrote:
> On Mon, Mar 18, 2019 at 12:15 PM Thomas Renninger <trenn@suse.de> wrote:

...
 
> And who's BIOS, really?  I guess you mean the OEM?  Note, however,
> that the user and the OEM may not agree on that, but whatever.

I mean both!
The OEM.
And the user who might choose a "performance" BIOS setting.

> > > Yes, the kernel replaces whatever the original BIOS setting is with
> > > its own one.
> > 
> > No, it only replaces the "performance" (0) value with "normal" (6).
> > This does not makes sense and is broken.

Both know it better than this...

> OK, fair enough.
> 
> I guess it would have been better to set it to 6 unconditionally.

No, I guess keeping the original value is the only sane thing to do.
Sorry, Rafael, I have to insist on this.
There might be secrets Len and you cannot share with public, but you
should at least explain this privately then in a way that the current
code could somehow make sense. I absolutely cannot see that.

> What about the systems that will misbehave when it is left at 0?

Good question.
What exactly happens there? The CPU is faster in general?
And may consume a tiny bit more of power.
I had reports that Turbo Mode is not entered on some server CPUs when
perf BIAS is switched back to normal mode.
It is hard to identify what Intel implemented in microcode, but it's not
that much altogether. It would be nice if Intel would be a bit more verbose
about this and show some performance vs powersave figures.

> > > It may not match every setup perfectly, but at least it
> > > is consistent.  Why exactly is it worse than whatever the BIOS has
> > > set?
> > 
> > Because there may be BIOS settings for the CPU which justify
> > initialization
> > of the Perf BIAS value by BIOS.
> 
> Well, the EPB is there for users to set it via the OS.  The BIOS
> setting is not guaranteed to work for all users anyway.

Who says that? Is this documented?

I would say it is exactly the other way around. Energy Perf BIAS hint is
a MSR which must be changed in Ring 0 kernel environment.

x86_energy_perf_policy is a nice tool to play with and to try out what
this value is for. But for example in secure boot mode userspace must not
modify this HW setting in any way.
So it would not be possible for a secure booted server to switch this
setting back to performance mode.
 
> > What sense does it make to unconditionally set perf BIAS value from
> > performance to balanced?
> > Why is this done?
> 
> Basically, for HW health reasons AFAICS.
>  
> Apparently, on some systems EPB=0 is (was?) special and means (meant?)
> very aggressive use of turbo etc. which is not healthy in general.

Hmm, maybe you want to explain this privately.
Performance is a valid setting.
x86_energy_perf_policy tool has a "performance" string for this and I expect it is very used.
I would switch to it if I am constantly connected to AC.
man x86_energy_perf_policy
also does not tell the user anything about "be careful with performance setting"

And until today every CPU online/offline cycle or machine resume cycle switches
the value back to performance (if kernel tried to switch to 6).

...
 
> > There may be BIOSes initialzing it via BIOS options. And this is a very
> > valid thing to do.
> 
> Yes, and there may be BIOSes leaving it at 0 with the assumption that
> the OS will adjust it.  The kernel cannot know which is the case.

Correct. The kernel cannot know.

You know this better than anyone else:
- A specific Microsoft version is doing it wrong.
- We adopt.
- Microsoft is doing it correctly with the next version
- We are busted

I am ok with leaving the message as a hint, I would still check
ACPI prefered profile variable that we are not running on a server.
On these systems the message would be wrong rather sure and performance is intended.

...

> > No. You must not ignore BIOS settings. Even worse, you must not override
> > these without any sane reason.
> 
> While there are BIOS settings that better should not be overridden,
> the EPB is not one of them.

Again: Why or where is this documented?

> > Your assumption above might be right. But we want to do it better, right?
> > 
> > ...
> > 
> > > The system-wide resume part will still not be working properly after
> > > the reverts.
> > 
> > But it must never blindly (unconditionally) be set to specific value.
> > Correct?
> 
> Yes.  I've already said that.

How about below solution then for the initial boot up sequence?
 
> > You mean the kernel should store the pre-hibernation perf BIAS value
> > in NVRAM and write it back when waking up again?
> 
> Or in the image and yes, it should write it back.

Can you point me to a similar case where a variable is stored through
hibernation?
If it's not too much I can try to come up with an (compiled but untested) patch.

> > This would make sense.
> > 
> > It would also mean perf BIAS never really worked, at least did not survive
> > suspend.
> 
> Right.
> 
> > On servers (no hibernation) it would works but is overridden
> > to a value you typically do not want to have on a server...
> > So the current situation is rather broken in the kernel.
> 
> Well, you can say so, but fixing it really means something more than
> reverting the commits that your patch is reverting and *that* is my
> point.
> 
> Yes, I think that this needs to be fixed.
> 
> No, I don't think that the reverts you are proposing are the way to go here.

Why?
It is the first step to go.

You are missing the save and restore part?
But these never existed anyway.

       Thomas


Patch compile tested only on a specific .config as an RFC:

--- arch/x86/kernel/cpu/intel.c.orig	2019-03-18 13:41:08.380418119 +0100
+++ arch/x86/kernel/cpu/intel.c	2019-03-18 14:03:40.092490953 +0100
@@ -9,6 +9,7 @@
 #include <linux/thread_info.h>
 #include <linux/init.h>
 #include <linux/uaccess.h>
+#include <linux/acpi.h>
 
 #include <asm/cpufeature.h>
 #include <asm/pgtable.h>
@@ -611,10 +612,12 @@
 	if ((epb & 0xF) != ENERGY_PERF_BIAS_PERFORMANCE)
 		return;
 
-	pr_warn_once("ENERGY_PERF_BIAS: Set to 'normal', was 'performance'\n");
-	pr_warn_once("ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8)\n");
-	epb = (epb & ~0xF) | ENERGY_PERF_BIAS_NORMAL;
-	wrmsrl(MSR_IA32_ENERGY_PERF_BIAS, epb);
+	if (acpi_gbl_FADT.preferred_profile == PM_PERFORMANCE_SERVER ||
+	    acpi_gbl_FADT.preferred_profile == PM_ENTERPRISE_SERVER)
+		return;
+
+	pr_info_once("ENERGY_PERF_BIAS: has value: 'performance'\n");
+	pr_info_once("ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8)\n");
 }
 
 static void intel_bsp_resume(struct cpuinfo_x86 *c)
Rafael J. Wysocki March 18, 2019, 10:57 p.m. UTC | #7
On Mon, Mar 18, 2019 at 2:22 PM Thomas Renninger <trenn@suse.de> wrote:
>
> On Monday, March 18, 2019 12:40:46 PM CET Rafael J. Wysocki wrote:
> > On Mon, Mar 18, 2019 at 12:15 PM Thomas Renninger <trenn@suse.de> wrote:
>
> ...
>
> > And who's BIOS, really?  I guess you mean the OEM?  Note, however,
> > that the user and the OEM may not agree on that, but whatever.
>
> I mean both!
> The OEM.
> And the user who might choose a "performance" BIOS setting.
>
> > > > Yes, the kernel replaces whatever the original BIOS setting is with
> > > > its own one.
> > >
> > > No, it only replaces the "performance" (0) value with "normal" (6).
> > > This does not makes sense and is broken.
>
> Both know it better than this...
>
> > OK, fair enough.
> >
> > I guess it would have been better to set it to 6 unconditionally.
>
> No, I guess keeping the original value is the only sane thing to do.
> Sorry, Rafael, I have to insist on this.

Which doesn't make your argument more convincing at all.

> There might be secrets Len and you cannot share with public, but you
> should at least explain this privately then in a way that the current
> code could somehow make sense. I absolutely cannot see that.

Anyway, I just wanted to say that using the same initial value in all
cases would have been more consistent than what the current code does.

> > What about the systems that will misbehave when it is left at 0?
>
> Good question.
> What exactly happens there? The CPU is faster in general?
> And may consume a tiny bit more of power.
> I had reports that Turbo Mode is not entered on some server CPUs when
> perf BIAS is switched back to normal mode.
> It is hard to identify what Intel implemented in microcode, but it's not
> that much altogether. It would be nice if Intel would be a bit more verbose
> about this and show some performance vs powersave figures.

Can you please stop insinuating things? That really doesn't help or
make the discussion clearer.

> > > > It may not match every setup perfectly, but at least it
> > > > is consistent.  Why exactly is it worse than whatever the BIOS has
> > > > set?
> > >
> > > Because there may be BIOS settings for the CPU which justify
> > > initialization
> > > of the Perf BIAS value by BIOS.
> >
> > Well, the EPB is there for users to set it via the OS.  The BIOS
> > setting is not guaranteed to work for all users anyway.
>
> Who says that?

I do.

If users have different needs, the single BIOS-provided value may not
work for all of them just as well as the value set by the kernel
doesn't.  IOW, it really doesn't matter who sets the initial value,
the BIOS or the kernel.  In both cases it may not work for everybody.

> Is this documented?
>
> I would say it is exactly the other way around. Energy Perf BIAS hint is
> a MSR which must be changed in Ring 0 kernel environment.
>
> x86_energy_perf_policy is a nice tool to play with and to try out what
> this value is for. But for example in secure boot mode userspace must not
> modify this HW setting in any way.
> So it would not be possible for a secure booted server to switch this
> setting back to performance mode.

I suspect that this was the real motivation for the reverts. :-)

Well, if x86_energy_perf_policy doesn't work in hardened environments,
then this is a problem, but making all systems use BIOS-provided
values is not the most straightforward way to address it in my view.

> > > What sense does it make to unconditionally set perf BIAS value from
> > > performance to balanced?
> > > Why is this done?
> >
> > Basically, for HW health reasons AFAICS.
> >
> > Apparently, on some systems EPB=0 is (was?) special and means (meant?)
> > very aggressive use of turbo etc. which is not healthy in general.
>
> Hmm, maybe you want to explain this privately.
> Performance is a valid setting.
> x86_energy_perf_policy tool has a "performance" string for this and I expect it is very used.
> I would switch to it if I am constantly connected to AC.
> man x86_energy_perf_policy
> also does not tell the user anything about "be careful with performance setting"
>
> And until today every CPU online/offline cycle or machine resume cycle switches
> the value back to performance (if kernel tried to switch to 6).
>
> ...
>
> > > There may be BIOSes initialzing it via BIOS options. And this is a very
> > > valid thing to do.
> >
> > Yes, and there may be BIOSes leaving it at 0 with the assumption that
> > the OS will adjust it.  The kernel cannot know which is the case.
>
> Correct. The kernel cannot know.
>
> You know this better than anyone else:
> - A specific Microsoft version is doing it wrong.
> - We adopt.
> - Microsoft is doing it correctly with the next version
> - We are busted
>
> I am ok with leaving the message as a hint, I would still check
> ACPI prefered profile variable that we are not running on a server.
> On these systems the message would be wrong rather sure and performance is intended.

So yes, ACPI profiles could be used in there as they also are used
elsewhere in a similar context.

>
> ...
>
> > > No. You must not ignore BIOS settings. Even worse, you must not override
> > > these without any sane reason.
> >
> > While there are BIOS settings that better should not be overridden,
> > the EPB is not one of them.
>
> Again: Why or where is this documented?

In the SDM if you will, under "Performance and Energy Bias Hint support":

"Software can use whatever criteria it sees fit to program the MSR
with an appropriate value."

If that doesn't mean "Yes, the OS can override the BIOS setting", then
I'm not sure what would.

> > > Your assumption above might be right. But we want to do it better, right?
> > >
> > > ...
> > >
> > > > The system-wide resume part will still not be working properly after
> > > > the reverts.
> > >
> > > But it must never blindly (unconditionally) be set to specific value.
> > > Correct?
> >
> > Yes.  I've already said that.
>
> How about below solution then for the initial boot up sequence?

The MSR would never be written to with that change, is that intentional?

> > > You mean the kernel should store the pre-hibernation perf BIAS value
> > > in NVRAM and write it back when waking up again?
> >
> > Or in the image and yes, it should write it back.
>
> Can you point me to a similar case where a variable is stored through
> hibernation?
> If it's not too much I can try to come up with an (compiled but untested) patch.
>
> > > This would make sense.
> > >
> > > It would also mean perf BIAS never really worked, at least did not survive
> > > suspend.
> >
> > Right.
> >
> > > On servers (no hibernation) it would works but is overridden
> > > to a value you typically do not want to have on a server...
> > > So the current situation is rather broken in the kernel.
> >
> > Well, you can say so, but fixing it really means something more than
> > reverting the commits that your patch is reverting and *that* is my
> > point.
> >
> > Yes, I think that this needs to be fixed.
> >
> > No, I don't think that the reverts you are proposing are the way to go here.
>
> Why?
> It is the first step to go.

Well, I'll just go and fix the real problems here I guess, that will
be way more productive than just talking about them ...

And then we can get back to the initial setting discussion.
Thomas Renninger March 20, 2019, 2:17 p.m. UTC | #8
Rafael,

I top post the general things and answer in only a few sentences embedded in 
context below:

I very much honour your work and your neutral opinions and reasoning and
I always have.

This patch is a resend and while I try to come up with alternative hacks, 
there still is no solution, not even a suggestion.

I sent the patch 3 years ago:
https://lkml.org/lkml/2016/2/26/675
And I found this when doing performance analysis with Mel (Gorman).

This time Hannes (Reinicke) stumbled over it, while he was working on 
performance tests on NVE over fabrics.

We need this fixed and I am going to repush this into our kernel(s) now.

On Monday, March 18, 2019 11:57:08 PM CET Rafael J. Wysocki wrote:
> On Mon, Mar 18, 2019 at 2:22 PM Thomas Renninger <trenn@suse.de> wrote:
> > On Monday, March 18, 2019 12:40:46 PM CET Rafael J. Wysocki wrote:
> > > On Mon, Mar 18, 2019 at 12:15 PM Thomas Renninger <trenn@suse.de> wrote:
> > ...

...

> > > > > It may not match every setup perfectly, but at least it
> > > > > is consistent.  Why exactly is it worse than whatever the BIOS has
> > > > > set?
> > > > 
> > > > Because there may be BIOS settings for the CPU which justify
> > > > initialization
> > > > of the Perf BIAS value by BIOS.
> > > 
> > > Well, the EPB is there for users to set it via the OS.  The BIOS
> > > setting is not guaranteed to work for all users anyway.
> > 
> > Who says that?
> 
> I do.

And this is the reason you do not see much patches from myself anymore over
the last years.
It's certainly not your fault. I had quite some discussions with Len about
specification and BIOS breakages.

Especially in the CPU powersave area, idle states and cpufreq drivers, Intel
was doing it differently all time long the last 5 years. Ignoring their own
specifications, ignoring possible BIOS settings and changing kernel and 
userspace interfaces all the time.

And now I have the discussion again...
While it is related to this patch, it gets off topic.
I guess there should be a more general thread on lkml:
"Do not change APIs every second day"
Up to userspace, but also to BIOS.

> And then we can get back to the initial setting discussion.

Let's stick to this topic in this thread.
There is no reason to not find a proper fix for this meanwhile.

Overriding the BIOS setting should IMO only take place:
- if lifetime of CPU could be affected as you mentioned. But in this case the
  affected CPUs should be matched
- if we expect that there are BIOSes which "want to set this value to 6,
  but may have forgotten to do so", as mentioned in the original patch
  from Len. BIOSes where this is the case should get a quirk to set it to 6.

Still, ideally the whole overriding and the message should vanish IMO.

Last time I sent this patch your answer was:
"I need to talk to Len about that,"
https://lkml.org/lkml/2016/3/4/371

But as this is x86 kernel core code, I guess this should be discussed and 
pushed by the general x86 maintainers anyway. Sigh.

     Thomas
Rafael J. Wysocki March 20, 2019, 10:18 p.m. UTC | #9
On Wed, Mar 20, 2019 at 3:17 PM Thomas Renninger <trenn@suse.de> wrote:
>
> Rafael,
>
> I top post the general things and answer in only a few sentences embedded in
> context below:
>
> I very much honour your work and your neutral opinions and reasoning and
> I always have.
>
> This patch is a resend and while I try to come up with alternative hacks,
> there still is no solution, not even a suggestion.

I have a patch to rework the EPB handling to avoid the offline/online
and suspend/resume issues which I'm going to post shortly.  It doesn't
change the current behavior on the first CPU bring-up, however, so
feel free to adjust it to your needs.

> I sent the patch 3 years ago:
> https://lkml.org/lkml/2016/2/26/675
> And I found this when doing performance analysis with Mel (Gorman).
>
> This time Hannes (Reinicke) stumbled over it, while he was working on
> performance tests on NVE over fabrics.
>
> We need this fixed and I am going to repush this into our kernel(s) now.

I'd recommend to wait for a while with that.

> On Monday, March 18, 2019 11:57:08 PM CET Rafael J. Wysocki wrote:
> > On Mon, Mar 18, 2019 at 2:22 PM Thomas Renninger <trenn@suse.de> wrote:
> > > On Monday, March 18, 2019 12:40:46 PM CET Rafael J. Wysocki wrote:
> > > > On Mon, Mar 18, 2019 at 12:15 PM Thomas Renninger <trenn@suse.de> wrote:
> > > ...
>
> ...
>
> > > > > > It may not match every setup perfectly, but at least it
> > > > > > is consistent.  Why exactly is it worse than whatever the BIOS has
> > > > > > set?
> > > > >
> > > > > Because there may be BIOS settings for the CPU which justify
> > > > > initialization
> > > > > of the Perf BIAS value by BIOS.
> > > >
> > > > Well, the EPB is there for users to set it via the OS.  The BIOS
> > > > setting is not guaranteed to work for all users anyway.
> > >
> > > Who says that?
> >
> > I do.
>
> And this is the reason you do not see much patches from myself anymore over
> the last years.
> It's certainly not your fault. I had quite some discussions with Len about
> specification and BIOS breakages.
>
> Especially in the CPU powersave area, idle states and cpufreq drivers, Intel
> was doing it differently all time long the last 5 years. Ignoring their own
> specifications, ignoring possible BIOS settings and changing kernel and
> userspace interfaces all the time.

It looks like you are generally frustrated and hopefully this isn't my fault.

I would like you to be more specific, though.

> And now I have the discussion again...

What discussion?

> While it is related to this patch, it gets off topic.
> I guess there should be a more general thread on lkml:
> "Do not change APIs every second day"
> Up to userspace, but also to BIOS.

I have no idea what you mean here, sorry.

> > And then we can get back to the initial setting discussion.
>
> Let's stick to this topic in this thread.
> There is no reason to not find a proper fix for this meanwhile.
>
> Overriding the BIOS setting should IMO only take place:
> - if lifetime of CPU could be affected as you mentioned. But in this case the
>   affected CPUs should be matched
> - if we expect that there are BIOSes which "want to set this value to 6,
>   but may have forgotten to do so", as mentioned in the original patch
>   from Len. BIOSes where this is the case should get a quirk to set it to 6.

On all of the laptops in my office where EPB is present the initial
value of it is 0 and I don't honestly think that this is really
intentional.  It looks to me like they don't initialize the EPB at
all, which unfortunately is hard to distinguish from an intentional
'performance' setting.

>
> Still, ideally the whole overriding and the message should vanish IMO.

We may consider doing that, but let's fix clear bugs first.

> Last time I sent this patch your answer was:
> "I need to talk to Len about that,"
> https://lkml.org/lkml/2016/3/4/371
>
> But as this is x86 kernel core code, I guess this should be discussed and
> pushed by the general x86 maintainers anyway. Sigh.
diff mbox series

Patch

Index: do_not_modify_perf_bias/arch/x86/kernel/cpu/common.c
===================================================================
--- do_not_modify_perf_bias.orig/arch/x86/kernel/cpu/common.c	2019-03-13 17:33:06.849890801 +0100
+++ do_not_modify_perf_bias/arch/x86/kernel/cpu/common.c	2019-03-13 18:01:54.781983906 +0100
@@ -18,7 +18,6 @@ 
 #include <linux/kgdb.h>
 #include <linux/smp.h>
 #include <linux/io.h>
-#include <linux/syscore_ops.h>
 
 #include <asm/stackprotector.h>
 #include <asm/perf_event.h>
@@ -1864,23 +1863,6 @@  void cpu_init(void)
 }
 #endif
 
-static void bsp_resume(void)
-{
-	if (this_cpu->c_bsp_resume)
-		this_cpu->c_bsp_resume(&boot_cpu_data);
-}
-
-static struct syscore_ops cpu_syscore_ops = {
-	.resume		= bsp_resume,
-};
-
-static int __init init_cpu_syscore(void)
-{
-	register_syscore_ops(&cpu_syscore_ops);
-	return 0;
-}
-core_initcall(init_cpu_syscore);
-
 /*
  * The microcode loader calls this upon late microcode load to recheck features,
  * only when microcode has been updated. Caller holds microcode_mutex and CPU
Index: do_not_modify_perf_bias/arch/x86/kernel/cpu/cpu.h
===================================================================
--- do_not_modify_perf_bias.orig/arch/x86/kernel/cpu/cpu.h	2019-03-13 17:33:06.849890801 +0100
+++ do_not_modify_perf_bias/arch/x86/kernel/cpu/cpu.h	2019-03-13 18:01:53.973983863 +0100
@@ -14,7 +14,6 @@  struct cpu_dev {
 	void		(*c_init)(struct cpuinfo_x86 *);
 	void		(*c_identify)(struct cpuinfo_x86 *);
 	void		(*c_detect_tlb)(struct cpuinfo_x86 *);
-	void		(*c_bsp_resume)(struct cpuinfo_x86 *);
 	int		c_x86_vendor;
 #ifdef CONFIG_X86_32
 	/* Optional vendor specific routine to obtain the cache size. */
Index: do_not_modify_perf_bias/arch/x86/kernel/cpu/intel.c
===================================================================
--- do_not_modify_perf_bias.orig/arch/x86/kernel/cpu/intel.c	2019-03-13 17:33:06.853890801 +0100
+++ do_not_modify_perf_bias/arch/x86/kernel/cpu/intel.c	2019-03-13 18:01:52.789983799 +0100
@@ -596,36 +596,6 @@  detect_keyid_bits:
 	c->x86_phys_bits -= keyid_bits;
 }
 
-static void init_intel_energy_perf(struct cpuinfo_x86 *c)
-{
-	u64 epb;
-
-	/*
-	 * Initialize MSR_IA32_ENERGY_PERF_BIAS if not already initialized.
-	 * (x86_energy_perf_policy(8) is available to change it at run-time.)
-	 */
-	if (!cpu_has(c, X86_FEATURE_EPB))
-		return;
-
-	rdmsrl(MSR_IA32_ENERGY_PERF_BIAS, epb);
-	if ((epb & 0xF) != ENERGY_PERF_BIAS_PERFORMANCE)
-		return;
-
-	pr_warn_once("ENERGY_PERF_BIAS: Set to 'normal', was 'performance'\n");
-	pr_warn_once("ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8)\n");
-	epb = (epb & ~0xF) | ENERGY_PERF_BIAS_NORMAL;
-	wrmsrl(MSR_IA32_ENERGY_PERF_BIAS, epb);
-}
-
-static void intel_bsp_resume(struct cpuinfo_x86 *c)
-{
-	/*
-	 * MSR_IA32_ENERGY_PERF_BIAS is lost across suspend/resume,
-	 * so reinitialize it properly like during bootup:
-	 */
-	init_intel_energy_perf(c);
-}
-
 static void init_cpuid_fault(struct cpuinfo_x86 *c)
 {
 	u64 msr;
@@ -763,8 +733,6 @@  static void init_intel(struct cpuinfo_x8
 	if (cpu_has(c, X86_FEATURE_TME))
 		detect_tme(c);
 
-	init_intel_energy_perf(c);
-
 	init_intel_misc_features(c);
 }
 
@@ -1023,7 +991,6 @@  static const struct cpu_dev intel_cpu_de
 	.c_detect_tlb	= intel_detect_tlb,
 	.c_early_init   = early_init_intel,
 	.c_init		= init_intel,
-	.c_bsp_resume	= intel_bsp_resume,
 	.c_x86_vendor	= X86_VENDOR_INTEL,
 };