diff mbox series

[RESEND,V9,3/7] cpufreq: amd-pstate: Enable amd-pstate preferred core supporting.

Message ID 20231013033118.3759311-4-li.meng@amd.com (mailing list archive)
State Superseded, archived
Headers show
Series amd-pstate preferred core | expand

Commit Message

Meng, Li (Jassmine) Oct. 13, 2023, 3:31 a.m. UTC
amd-pstate driver utilizes the functions and data structures
provided by the ITMT architecture to enable the scheduler to
favor scheduling on cores which can be get a higher frequency
with lower voltage. We call it amd-pstate preferrred core.

Here sched_set_itmt_core_prio() is called to set priorities and
sched_set_itmt_support() is called to enable ITMT feature.
amd-pstate driver uses the highest performance value to indicate
the priority of CPU. The higher value has a higher priority.

The initial core rankings are set up by amd-pstate when the
system boots.

Add a variable hw_prefcore in cpudata structure. It will check
if the processor and power firmware support preferred core
feature.

Add one new early parameter `disable` to allow user to disable
the preferred core.

Only when hardware supports preferred core and user set `enabled`
in early parameter, amd pstate driver supports preferred core featue.

Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Reviewed-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Wyes Karny <wyes.karny@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Co-developed-by: Perry Yuan <Perry.Yuan@amd.com>
Signed-off-by: Perry Yuan <Perry.Yuan@amd.com>
Signed-off-by: Meng Li <li.meng@amd.com>
---
 drivers/cpufreq/amd-pstate.c | 155 +++++++++++++++++++++++++++++++----
 include/linux/amd-pstate.h   |   4 +
 2 files changed, 143 insertions(+), 16 deletions(-)

Comments

Peter Zijlstra Oct. 13, 2023, 4:01 p.m. UTC | #1
On Fri, Oct 13, 2023 at 11:31:14AM +0800, Meng Li wrote:

> +#define AMD_PSTATE_PREFCORE_THRESHOLD	166
> +#define AMD_PSTATE_MAX_CPPC_PERF	255

> +static void amd_pstate_init_prefcore(struct amd_cpudata *cpudata)
> +{
> +	int ret, prio;
> +	u32 highest_perf;
> +	static u32 max_highest_perf = 0, min_highest_perf = U32_MAX;

What serializes these things?

Also, *why* are you using u32 here, what's wrong with something like:

	int max_hp = INT_MIN, min_hp = INT_MAX;

> +
> +	ret = amd_pstate_get_highest_perf(cpudata->cpu, &highest_perf);
> +	if (ret)
> +		return;
> +
> +	cpudata->hw_prefcore = true;
> +	/* check if CPPC preferred core feature is enabled*/
> +	if (highest_perf == AMD_PSTATE_MAX_CPPC_PERF) {

Which effectively means <255 (also, seems to suggest MAX_CPPC_PERF might
not be the best name, hmm?)

Should you not write '>= 255' then? Just in case something 'funny'
happens?

> +		pr_debug("AMD CPPC preferred core is unsupported!\n");
> +		cpudata->hw_prefcore = false;
> +		return;
> +	}
> +
> +	if (!amd_pstate_prefcore)
> +		return;
> +
> +	/* The maximum value of highest perf is 255 */
> +	prio = (int)(highest_perf & 0xff);

If for some weird reason you get 0x1ff or whatever above (dodgy BIOS
never happens, right) then this makes sense how?

Perhaps stop sending patches at break-nek speed and think for a little
while on how to write this and not be confused?


> +	/*
> +	 * The priorities can be set regardless of whether or not
> +	 * sched_set_itmt_support(true) has been called and it is valid to
> +	 * update them at any time after it has been called.
> +	 */
> +	sched_set_itmt_core_prio(prio, cpudata->cpu);
> +
> +	if (max_highest_perf <= min_highest_perf) {
> +		if (highest_perf > max_highest_perf)
> +			max_highest_perf = highest_perf;
> +
> +		if (highest_perf < min_highest_perf)
> +			min_highest_perf = highest_perf;
> +
> +		if (max_highest_perf > min_highest_perf) {
> +			/*
> +			 * This code can be run during CPU online under the
> +			 * CPU hotplug locks, so sched_set_itmt_support()
> +			 * cannot be called from here.  Queue up a work item
> +			 * to invoke it.
> +			 */
> +			schedule_work(&sched_prefcore_work);
> +		}
> +	}

Not a word about what serializes these variables.

> +}
Meng, Li (Jassmine) Oct. 16, 2023, 6:20 a.m. UTC | #2
[AMD Official Use Only - General]

Hi Peter:

> -----Original Message-----
> From: Peter Zijlstra <peterz@infradead.org>
> Sent: Saturday, October 14, 2023 12:01 AM
> To: Meng, Li (Jassmine) <Li.Meng@amd.com>
> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>; Huang, Ray
> <Ray.Huang@amd.com>; linux-pm@vger.kernel.org; linux-
> kernel@vger.kernel.org; x86@kernel.org; linux-acpi@vger.kernel.org; Shuah
> Khan <skhan@linuxfoundation.org>; linux-kselftest@vger.kernel.org;
> Fontenot, Nathan <Nathan.Fontenot@amd.com>; Sharma, Deepak
> <Deepak.Sharma@amd.com>; Deucher, Alexander
> <Alexander.Deucher@amd.com>; Limonciello, Mario
> <Mario.Limonciello@amd.com>; Huang, Shimmer
> <Shimmer.Huang@amd.com>; Yuan, Perry <Perry.Yuan@amd.com>; Du,
> Xiaojian <Xiaojian.Du@amd.com>; Viresh Kumar <viresh.kumar@linaro.org>;
> Borislav Petkov <bp@alien8.de>; Oleksandr Natalenko
> <oleksandr@natalenko.name>; Karny, Wyes <Wyes.Karny@amd.com>
> Subject: Re: [RESEND PATCH V9 3/7] cpufreq: amd-pstate: Enable amd-
> pstate preferred core supporting.
>
> Caution: This message originated from an External Source. Use proper
> caution when opening attachments, clicking links, or responding.
>
>
> On Fri, Oct 13, 2023 at 11:31:14AM +0800, Meng Li wrote:
>
> > +#define AMD_PSTATE_PREFCORE_THRESHOLD        166
> > +#define AMD_PSTATE_MAX_CPPC_PERF     255
>
> > +static void amd_pstate_init_prefcore(struct amd_cpudata *cpudata) {
> > +     int ret, prio;
> > +     u32 highest_perf;
> > +     static u32 max_highest_perf = 0, min_highest_perf = U32_MAX;
>
> What serializes these things?
>
> Also, *why* are you using u32 here, what's wrong with something like:
>
>         int max_hp = INT_MIN, min_hp = INT_MAX;
>
[Meng, Li (Jassmine)]
We use ITMT architecture to utilize preferred core features.
Therefore, we need to try to be consistent with Intel's implementation as much as possible.
For details, please refer to the intel_pstate_set_itmt_prio function in file intel_pstate.c. (Line 355)

I think using the data type of u32 is consistent with the data structures of cppc_perf_ctrls and amd_cpudata etc.

> > +
> > +     ret = amd_pstate_get_highest_perf(cpudata->cpu, &highest_perf);
> > +     if (ret)
> > +             return;
> > +
> > +     cpudata->hw_prefcore = true;
> > +     /* check if CPPC preferred core feature is enabled*/
> > +     if (highest_perf == AMD_PSTATE_MAX_CPPC_PERF) {
>
> Which effectively means <255 (also, seems to suggest MAX_CPPC_PERF
> might not be the best name, hmm?)
>
> Should you not write '>= 255' then? Just in case something 'funny'
> happens?
>
[Meng, Li (Jassmine)]
OK, I will modify these.

> > +             pr_debug("AMD CPPC preferred core is unsupported!\n");
> > +             cpudata->hw_prefcore = false;
> > +             return;
> > +     }
> > +
> > +     if (!amd_pstate_prefcore)
> > +             return;
> > +
> > +     /* The maximum value of highest perf is 255 */
> > +     prio = (int)(highest_perf & 0xff);
>
> If for some weird reason you get 0x1ff or whatever above (dodgy BIOS never
> happens, right) then this makes sense how?
>
> Perhaps stop sending patches at break-nek speed and think for a little while
> on how to write this and not be confused?
>
[Meng, Li (Jassmine)]
If I use '>= 255' to check, the issue mentioned will not exist.
Because it will be returned when highest_perff>0xff.
>
> > +     /*
> > +      * The priorities can be set regardless of whether or not
> > +      * sched_set_itmt_support(true) has been called and it is valid to
> > +      * update them at any time after it has been called.
> > +      */
> > +     sched_set_itmt_core_prio(prio, cpudata->cpu);
> > +
> > +     if (max_highest_perf <= min_highest_perf) {
> > +             if (highest_perf > max_highest_perf)
> > +                     max_highest_perf = highest_perf;
> > +
> > +             if (highest_perf < min_highest_perf)
> > +                     min_highest_perf = highest_perf;
> > +
> > +             if (max_highest_perf > min_highest_perf) {
> > +                     /*
> > +                      * This code can be run during CPU online under the
> > +                      * CPU hotplug locks, so sched_set_itmt_support()
> > +                      * cannot be called from here.  Queue up a work item
> > +                      * to invoke it.
> > +                      */
> > +                     schedule_work(&sched_prefcore_work);
> > +             }
> > +     }
>
> Not a word about what serializes these variables.
>
> > +}
Peter Zijlstra Oct. 16, 2023, 10:58 a.m. UTC | #3
On Mon, Oct 16, 2023 at 06:20:53AM +0000, Meng, Li (Jassmine) wrote:
> > > +static void amd_pstate_init_prefcore(struct amd_cpudata *cpudata) {
> > > +     int ret, prio;
> > > +     u32 highest_perf;
> > > +     static u32 max_highest_perf = 0, min_highest_perf = U32_MAX;
> >
> > What serializes these things?
> >
> > Also, *why* are you using u32 here, what's wrong with something like:
> >
> >         int max_hp = INT_MIN, min_hp = INT_MAX;
> >
> [Meng, Li (Jassmine)]
> We use ITMT architecture to utilize preferred core features.
> Therefore, we need to try to be consistent with Intel's implementation
> as much as possible.  For details, please refer to the
> intel_pstate_set_itmt_prio function in file intel_pstate.c. (Line 355)
> 
> I think using the data type of u32 is consistent with the data
> structures of cppc_perf_ctrls and amd_cpudata etc.

Rafael, should we fix intel_pstate too?

The point is, that sched_asym_prefer(), the final consumer of these
values uses int and thus an explicitly signed compare.

Using u32 and U32_MAX anywhere near the setting the priority makes
absolutely no sense.

If you were to have the high bit set, things do not behave as expected.

Also, same question as to the amd folks; what serializes those static
variables?
Wysocki, Rafael J Oct. 16, 2023, 5:27 p.m. UTC | #4
On 10/16/2023 12:58 PM, Peter Zijlstra wrote:
> On Mon, Oct 16, 2023 at 06:20:53AM +0000, Meng, Li (Jassmine) wrote:
>>>> +static void amd_pstate_init_prefcore(struct amd_cpudata *cpudata) {
>>>> +     int ret, prio;
>>>> +     u32 highest_perf;
>>>> +     static u32 max_highest_perf = 0, min_highest_perf = U32_MAX;
>>> What serializes these things?
>>>
>>> Also, *why* are you using u32 here, what's wrong with something like:
>>>
>>>          int max_hp = INT_MIN, min_hp = INT_MAX;
>>>
>> [Meng, Li (Jassmine)]
>> We use ITMT architecture to utilize preferred core features.
>> Therefore, we need to try to be consistent with Intel's implementation
>> as much as possible.  For details, please refer to the
>> intel_pstate_set_itmt_prio function in file intel_pstate.c. (Line 355)
>>
>> I think using the data type of u32 is consistent with the data
>> structures of cppc_perf_ctrls and amd_cpudata etc.
> Rafael, should we fix intel_pstate too?

Srinivas should be more familiar with this code than I am, so adding him.


> The point is, that sched_asym_prefer(), the final consumer of these
> values uses int and thus an explicitly signed compare.
>
> Using u32 and U32_MAX anywhere near the setting the priority makes
> absolutely no sense.
>
> If you were to have the high bit set, things do not behave as expected.

Right, but in practice these values are always between 0 and 255 
inclusive AFAICS.

It would have been better to use u8 I suppose.


> Also, same question as to the amd folks; what serializes those static
> variables?

That's a good one.
srinivas pandruvada Oct. 16, 2023, 6:50 p.m. UTC | #5
On Mon, 2023-10-16 at 19:27 +0200, Wysocki, Rafael J wrote:
> On 10/16/2023 12:58 PM, Peter Zijlstra wrote:
> > On Mon, Oct 16, 2023 at 06:20:53AM +0000, Meng, Li (Jassmine)
> > wrote:
> > > > > +static void amd_pstate_init_prefcore(struct amd_cpudata
> > > > > *cpudata) {
> > > > > +     int ret, prio;
> > > > > +     u32 highest_perf;
> > > > > +     static u32 max_highest_perf = 0, min_highest_perf =
> > > > > U32_MAX;
> > > > What serializes these things?
> > > > 
> > > > Also, *why* are you using u32 here, what's wrong with something
> > > > like:
> > > > 
> > > >          int max_hp = INT_MIN, min_hp = INT_MAX;
> > > > 
> > > [Meng, Li (Jassmine)]
> > > We use ITMT architecture to utilize preferred core features.
> > > Therefore, we need to try to be consistent with Intel's
> > > implementation
> > > as much as possible.  For details, please refer to the
> > > intel_pstate_set_itmt_prio function in file intel_pstate.c. (Line
> > > 355)
> > > 
> > > I think using the data type of u32 is consistent with the data
> > > structures of cppc_perf_ctrls and amd_cpudata etc.
> > Rafael, should we fix intel_pstate too?
> 
> Srinivas should be more familiar with this code than I am, so adding
> him.
> 
If we make
	static u32 max_highest_perf = 0, min_highest_perf = U32_MAX;
to
	static int max_highest_perf = INT_MIN, min_highest_perf =
INT_MAX;

Then in intel_pstate we will compare signed vs unsigned comparison as
cppc_perf.highest_perf is u32.


In reality this will be fine to change to "int" as we will never reach
u32 max as performance on any Intel platform.

> 
> > The point is, that sched_asym_prefer(), the final consumer of these
> > values uses int and thus an explicitly signed compare.
> > 
> > Using u32 and U32_MAX anywhere near the setting the priority makes
> > absolutely no sense.
> > 
> > If you were to have the high bit set, things do not behave as
> > expected.
> 
> Right, but in practice these values are always between 0 and 255 
> inclusive AFAICS.
> 
> It would have been better to use u8 I suppose.
Should be fine as over clocked parts will set to max 0xff.

> 
> 
> > Also, same question as to the amd folks; what serializes those
> > static
> > variables?
> 
> That's a good one.

This function which is checking static variables is called from cpufreq
->init callback. Which in turn is called from a function which is
passed as startup() function pointer to
cpuhp_setup_state_nocalls_cpuslocked().

I see that startup() callbacks are called under a mutex
cpuhp_state_mutex for each present CPUs. So if some tear down happen,
that is also protected by the same mutex. The assumption is here is
that cpuhp_invoke_callback() in hotplug state machine is not called in
parallel on two CPUs by the hotplug state machine. But I see activity
on parallel bringup, so this is questionable now.

Thanks,
Srinivas

> 
>
Peter Zijlstra Oct. 16, 2023, 9:55 p.m. UTC | #6
On Mon, Oct 16, 2023 at 11:50:34AM -0700, srinivas pandruvada wrote:

I'll respond to the rest tomorrow, it's far too late.

> > > Also, same question as to the amd folks; what serializes those
> > > static
> > > variables?
> > 
> > That's a good one.
> 
> This function which is checking static variables is called from cpufreq
> ->init callback. Which in turn is called from a function which is
> passed as startup() function pointer to
> cpuhp_setup_state_nocalls_cpuslocked().
> 
> I see that startup() callbacks are called under a mutex
> cpuhp_state_mutex for each present CPUs. So if some tear down happen,
> that is also protected by the same mutex. The assumption is here is
> that cpuhp_invoke_callback() in hotplug state machine is not called in
> parallel on two CPUs by the hotplug state machine. But I see activity
> on parallel bringup, so this is questionable now.

Parallel bringup should still serialise this. It mostly only does the
hardware bringup in parallel.

Having a pointer back to the cpu hotplug lock would make it easier to
untangle this code though.
Meng, Li (Jassmine) Oct. 17, 2023, 8:22 a.m. UTC | #7
[AMD Official Use Only - General]

Hi Peter:

After our internal discussion, the following modifications will be made.
Do you think they are feasible?
        1. Add judgement for "highest_perf".  When it is less than 255, the preferred core feature is enabled. And it will set the priority.
        2. Delete "static u32 max_highset_perf/min_highest_perf", because amd p-state preferred core does not require special processing for hotplug.

+#define CPPC_MAX_PERF  U8_MAX
+
+static void amd_pstate_init_prefcore(struct amd_cpudata *cpudata)
+{
+       int ret, prio;
+       u32 highest_perf;
+
+       ret = amd_pstate_get_highest_perf(cpudata->cpu, &highest_perf);
+       if (ret)
+               return;
+
+       cpudata->hw_prefcore = true;
+       /* check if CPPC preferred core feature is enabled*/
+       if (highest_perf < CPPC_MAX_PERF)
+               prio = (int)highest_perf;
+       else {
+               pr_debug("AMD CPPC preferred core is unsupported!\n");
+               cpudata->hw_prefcore = false;
+               return;
+       }
+
+       if (!amd_pstate_prefcore)
+               return;
+
+       /*
+        * The priorities can be set regardless of whether or not
+        * sched_set_itmt_support(true) has been called and it is valid to
+        * update them at any time after it has been called.
+        */
+       sched_set_itmt_core_prio(prio, cpudata->cpu);
+
+       schedule_work(&sched_prefcore_work);
+}

> -----Original Message-----
> From: srinivas pandruvada <srinivas.pandruvada@linux.intel.com>
> Sent: Tuesday, October 17, 2023 2:51 AM
> To: Wysocki, Rafael J <rafael.j.wysocki@intel.com>; Peter Zijlstra
> <peterz@infradead.org>; Meng, Li (Jassmine) <Li.Meng@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; linux-pm@vger.kernel.org; linux-
> kernel@vger.kernel.org; x86@kernel.org; linux-acpi@vger.kernel.org; Shuah
> Khan <skhan@linuxfoundation.org>; linux-kselftest@vger.kernel.org;
> Fontenot, Nathan <Nathan.Fontenot@amd.com>; Sharma, Deepak
> <Deepak.Sharma@amd.com>; Deucher, Alexander
> <Alexander.Deucher@amd.com>; Limonciello, Mario
> <Mario.Limonciello@amd.com>; Huang, Shimmer
> <Shimmer.Huang@amd.com>; Yuan, Perry <Perry.Yuan@amd.com>; Du,
> Xiaojian <Xiaojian.Du@amd.com>; Viresh Kumar <viresh.kumar@linaro.org>;
> Borislav Petkov <bp@alien8.de>; Oleksandr Natalenko
> <oleksandr@natalenko.name>; Karny, Wyes <Wyes.Karny@amd.com>
> Subject: Re: [RESEND PATCH V9 3/7] cpufreq: amd-pstate: Enable amd-
> pstate preferred core supporting.
>
> Caution: This message originated from an External Source. Use proper
> caution when opening attachments, clicking links, or responding.
>
>
> On Mon, 2023-10-16 at 19:27 +0200, Wysocki, Rafael J wrote:
> > On 10/16/2023 12:58 PM, Peter Zijlstra wrote:
> > > On Mon, Oct 16, 2023 at 06:20:53AM +0000, Meng, Li (Jassmine)
> > > wrote:
> > > > > > +static void amd_pstate_init_prefcore(struct amd_cpudata
> > > > > > *cpudata) {
> > > > > > +     int ret, prio;
> > > > > > +     u32 highest_perf;
> > > > > > +     static u32 max_highest_perf = 0, min_highest_perf =
> > > > > > U32_MAX;
> > > > > What serializes these things?
> > > > >
> > > > > Also, *why* are you using u32 here, what's wrong with something
> > > > > like:
> > > > >
> > > > >          int max_hp = INT_MIN, min_hp = INT_MAX;
> > > > >
> > > > [Meng, Li (Jassmine)]
> > > > We use ITMT architecture to utilize preferred core features.
> > > > Therefore, we need to try to be consistent with Intel's
> > > > implementation as much as possible.  For details, please refer to
> > > > the intel_pstate_set_itmt_prio function in file intel_pstate.c.
> > > > (Line
> > > > 355)
> > > >
> > > > I think using the data type of u32 is consistent with the data
> > > > structures of cppc_perf_ctrls and amd_cpudata etc.
> > > Rafael, should we fix intel_pstate too?
> >
> > Srinivas should be more familiar with this code than I am, so adding
> > him.
> >
> If we make
>         static u32 max_highest_perf = 0, min_highest_perf = U32_MAX; to
>         static int max_highest_perf = INT_MIN, min_highest_perf = INT_MAX;
>
> Then in intel_pstate we will compare signed vs unsigned comparison as
> cppc_perf.highest_perf is u32.
>
>
> In reality this will be fine to change to "int" as we will never reach
> u32 max as performance on any Intel platform.
>
> >
> > > The point is, that sched_asym_prefer(), the final consumer of these
> > > values uses int and thus an explicitly signed compare.
> > >
> > > Using u32 and U32_MAX anywhere near the setting the priority makes
> > > absolutely no sense.
> > >
> > > If you were to have the high bit set, things do not behave as
> > > expected.
> >
> > Right, but in practice these values are always between 0 and 255
> > inclusive AFAICS.
> >
> > It would have been better to use u8 I suppose.
> Should be fine as over clocked parts will set to max 0xff.
>
> >
> >
> > > Also, same question as to the amd folks; what serializes those
> > > static variables?
> >
> > That's a good one.
>
> This function which is checking static variables is called from cpufreq
> ->init callback. Which in turn is called from a function which is
> passed as startup() function pointer to
> cpuhp_setup_state_nocalls_cpuslocked().
>
> I see that startup() callbacks are called under a mutex cpuhp_state_mutex
> for each present CPUs. So if some tear down happen, that is also protected
> by the same mutex. The assumption is here is that cpuhp_invoke_callback()
> in hotplug state machine is not called in parallel on two CPUs by the hotplug
> state machine. But I see activity on parallel bringup, so this is questionable
> now.
>
> Thanks,
> Srinivas
>
> >
> >
diff mbox series

Patch

diff --git a/drivers/cpufreq/amd-pstate.c b/drivers/cpufreq/amd-pstate.c
index 9a1e194d5cf8..6aae383990f1 100644
--- a/drivers/cpufreq/amd-pstate.c
+++ b/drivers/cpufreq/amd-pstate.c
@@ -37,6 +37,7 @@ 
 #include <linux/uaccess.h>
 #include <linux/static_call.h>
 #include <linux/amd-pstate.h>
+#include <linux/topology.h>
 
 #include <acpi/processor.h>
 #include <acpi/cppc_acpi.h>
@@ -49,6 +50,8 @@ 
 
 #define AMD_PSTATE_TRANSITION_LATENCY	20000
 #define AMD_PSTATE_TRANSITION_DELAY	1000
+#define AMD_PSTATE_PREFCORE_THRESHOLD	166
+#define AMD_PSTATE_MAX_CPPC_PERF	255
 
 /*
  * TODO: We need more time to fine tune processors with shared memory solution
@@ -64,6 +67,7 @@  static struct cpufreq_driver amd_pstate_driver;
 static struct cpufreq_driver amd_pstate_epp_driver;
 static int cppc_state = AMD_PSTATE_UNDEFINED;
 static bool cppc_enabled;
+static bool amd_pstate_prefcore = true;
 
 /*
  * AMD Energy Preference Performance (EPP)
@@ -290,23 +294,21 @@  static inline int amd_pstate_enable(bool enable)
 static int pstate_init_perf(struct amd_cpudata *cpudata)
 {
 	u64 cap1;
-	u32 highest_perf;
 
 	int ret = rdmsrl_safe_on_cpu(cpudata->cpu, MSR_AMD_CPPC_CAP1,
 				     &cap1);
 	if (ret)
 		return ret;
 
-	/*
-	 * TODO: Introduce AMD specific power feature.
-	 *
-	 * CPPC entry doesn't indicate the highest performance in some ASICs.
+	/* For platforms that do not support the preferred core feature, the
+	 * highest_pef may be configured with 166 or 255, to avoid max frequency
+	 * calculated wrongly. we take the AMD_CPPC_HIGHEST_PERF(cap1) value as
+	 * the default max perf.
 	 */
-	highest_perf = amd_get_highest_perf();
-	if (highest_perf > AMD_CPPC_HIGHEST_PERF(cap1))
-		highest_perf = AMD_CPPC_HIGHEST_PERF(cap1);
-
-	WRITE_ONCE(cpudata->highest_perf, highest_perf);
+	if (cpudata->hw_prefcore)
+		WRITE_ONCE(cpudata->highest_perf, AMD_PSTATE_PREFCORE_THRESHOLD);
+	else
+		WRITE_ONCE(cpudata->highest_perf, AMD_CPPC_HIGHEST_PERF(cap1));
 
 	WRITE_ONCE(cpudata->nominal_perf, AMD_CPPC_NOMINAL_PERF(cap1));
 	WRITE_ONCE(cpudata->lowest_nonlinear_perf, AMD_CPPC_LOWNONLIN_PERF(cap1));
@@ -318,17 +320,15 @@  static int pstate_init_perf(struct amd_cpudata *cpudata)
 static int cppc_init_perf(struct amd_cpudata *cpudata)
 {
 	struct cppc_perf_caps cppc_perf;
-	u32 highest_perf;
 
 	int ret = cppc_get_perf_caps(cpudata->cpu, &cppc_perf);
 	if (ret)
 		return ret;
 
-	highest_perf = amd_get_highest_perf();
-	if (highest_perf > cppc_perf.highest_perf)
-		highest_perf = cppc_perf.highest_perf;
-
-	WRITE_ONCE(cpudata->highest_perf, highest_perf);
+	if (cpudata->hw_prefcore)
+		WRITE_ONCE(cpudata->highest_perf, AMD_PSTATE_PREFCORE_THRESHOLD);
+	else
+		WRITE_ONCE(cpudata->highest_perf, cppc_perf.highest_perf);
 
 	WRITE_ONCE(cpudata->nominal_perf, cppc_perf.nominal_perf);
 	WRITE_ONCE(cpudata->lowest_nonlinear_perf,
@@ -676,6 +676,93 @@  static void amd_perf_ctl_reset(unsigned int cpu)
 	wrmsrl_on_cpu(cpu, MSR_AMD_PERF_CTL, 0);
 }
 
+/*
+ * Set amd-pstate preferred core enable can't be done directly from cpufreq callbacks
+ * due to locking, so queue the work for later.
+ */
+static void amd_pstste_sched_prefcore_workfn(struct work_struct *work)
+{
+	sched_set_itmt_support();
+}
+static DECLARE_WORK(sched_prefcore_work, amd_pstste_sched_prefcore_workfn);
+
+/*
+ * Get the highest performance register value.
+ * @cpu: CPU from which to get highest performance.
+ * @highest_perf: Return address.
+ *
+ * Return: 0 for success, -EIO otherwise.
+ */
+static int amd_pstate_get_highest_perf(int cpu, u32 *highest_perf)
+{
+	int ret;
+
+	if (boot_cpu_has(X86_FEATURE_CPPC)) {
+		u64 cap1;
+
+		ret = rdmsrl_safe_on_cpu(cpu, MSR_AMD_CPPC_CAP1, &cap1);
+		if (ret)
+			return ret;
+		WRITE_ONCE(*highest_perf, AMD_CPPC_HIGHEST_PERF(cap1));
+	} else {
+		u64 cppc_highest_perf;
+
+		ret = cppc_get_highest_perf(cpu, &cppc_highest_perf);
+		WRITE_ONCE(*highest_perf, cppc_highest_perf);
+	}
+
+	return (ret);
+}
+
+static void amd_pstate_init_prefcore(struct amd_cpudata *cpudata)
+{
+	int ret, prio;
+	u32 highest_perf;
+	static u32 max_highest_perf = 0, min_highest_perf = U32_MAX;
+
+	ret = amd_pstate_get_highest_perf(cpudata->cpu, &highest_perf);
+	if (ret)
+		return;
+
+	cpudata->hw_prefcore = true;
+	/* check if CPPC preferred core feature is enabled*/
+	if (highest_perf == AMD_PSTATE_MAX_CPPC_PERF) {
+		pr_debug("AMD CPPC preferred core is unsupported!\n");
+		cpudata->hw_prefcore = false;
+		return;
+	}
+
+	if (!amd_pstate_prefcore)
+		return;
+
+	/* The maximum value of highest perf is 255 */
+	prio = (int)(highest_perf & 0xff);
+	/*
+	 * The priorities can be set regardless of whether or not
+	 * sched_set_itmt_support(true) has been called and it is valid to
+	 * update them at any time after it has been called.
+	 */
+	sched_set_itmt_core_prio(prio, cpudata->cpu);
+
+	if (max_highest_perf <= min_highest_perf) {
+		if (highest_perf > max_highest_perf)
+			max_highest_perf = highest_perf;
+
+		if (highest_perf < min_highest_perf)
+			min_highest_perf = highest_perf;
+
+		if (max_highest_perf > min_highest_perf) {
+			/*
+			 * This code can be run during CPU online under the
+			 * CPU hotplug locks, so sched_set_itmt_support()
+			 * cannot be called from here.  Queue up a work item
+			 * to invoke it.
+			 */
+			schedule_work(&sched_prefcore_work);
+		}
+	}
+}
+
 static int amd_pstate_cpu_init(struct cpufreq_policy *policy)
 {
 	int min_freq, max_freq, nominal_freq, lowest_nonlinear_freq, ret;
@@ -697,6 +784,8 @@  static int amd_pstate_cpu_init(struct cpufreq_policy *policy)
 
 	cpudata->cpu = policy->cpu;
 
+	amd_pstate_init_prefcore(cpudata);
+
 	ret = amd_pstate_init_perf(cpudata);
 	if (ret)
 		goto free_cpudata1;
@@ -845,6 +934,17 @@  static ssize_t show_amd_pstate_highest_perf(struct cpufreq_policy *policy,
 	return sysfs_emit(buf, "%u\n", perf);
 }
 
+static ssize_t show_amd_pstate_hw_prefcore(struct cpufreq_policy *policy,
+					   char *buf)
+{
+	bool hw_prefcore;
+	struct amd_cpudata *cpudata = policy->driver_data;
+
+	hw_prefcore = READ_ONCE(cpudata->hw_prefcore);
+
+	return sysfs_emit(buf, "%s\n", hw_prefcore ? "supported" : "unsupported");
+}
+
 static ssize_t show_energy_performance_available_preferences(
 				struct cpufreq_policy *policy, char *buf)
 {
@@ -1037,18 +1137,27 @@  static ssize_t status_store(struct device *a, struct device_attribute *b,
 	return ret < 0 ? ret : count;
 }
 
+static ssize_t prefcore_show(struct device *dev,
+			     struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%s\n", amd_pstate_prefcore ? "enabled" : "disabled");
+}
+
 cpufreq_freq_attr_ro(amd_pstate_max_freq);
 cpufreq_freq_attr_ro(amd_pstate_lowest_nonlinear_freq);
 
 cpufreq_freq_attr_ro(amd_pstate_highest_perf);
+cpufreq_freq_attr_ro(amd_pstate_hw_prefcore);
 cpufreq_freq_attr_rw(energy_performance_preference);
 cpufreq_freq_attr_ro(energy_performance_available_preferences);
 static DEVICE_ATTR_RW(status);
+static DEVICE_ATTR_RO(prefcore);
 
 static struct freq_attr *amd_pstate_attr[] = {
 	&amd_pstate_max_freq,
 	&amd_pstate_lowest_nonlinear_freq,
 	&amd_pstate_highest_perf,
+	&amd_pstate_hw_prefcore,
 	NULL,
 };
 
@@ -1056,6 +1165,7 @@  static struct freq_attr *amd_pstate_epp_attr[] = {
 	&amd_pstate_max_freq,
 	&amd_pstate_lowest_nonlinear_freq,
 	&amd_pstate_highest_perf,
+	&amd_pstate_hw_prefcore,
 	&energy_performance_preference,
 	&energy_performance_available_preferences,
 	NULL,
@@ -1063,6 +1173,7 @@  static struct freq_attr *amd_pstate_epp_attr[] = {
 
 static struct attribute *pstate_global_attributes[] = {
 	&dev_attr_status.attr,
+	&dev_attr_prefcore.attr,
 	NULL
 };
 
@@ -1114,6 +1225,8 @@  static int amd_pstate_epp_cpu_init(struct cpufreq_policy *policy)
 	cpudata->cpu = policy->cpu;
 	cpudata->epp_policy = 0;
 
+	amd_pstate_init_prefcore(cpudata);
+
 	ret = amd_pstate_init_perf(cpudata);
 	if (ret)
 		goto free_cpudata1;
@@ -1527,7 +1640,17 @@  static int __init amd_pstate_param(char *str)
 
 	return amd_pstate_set_driver(mode_idx);
 }
+
+static int __init amd_prefcore_param(char *str)
+{
+	if (!strcmp(str, "disable"))
+		amd_pstate_prefcore = false;
+
+	return 0;
+}
+
 early_param("amd_pstate", amd_pstate_param);
+early_param("amd_prefcore", amd_prefcore_param);
 
 MODULE_AUTHOR("Huang Rui <ray.huang@amd.com>");
 MODULE_DESCRIPTION("AMD Processor P-state Frequency Driver");
diff --git a/include/linux/amd-pstate.h b/include/linux/amd-pstate.h
index 446394f84606..87e140e9e6db 100644
--- a/include/linux/amd-pstate.h
+++ b/include/linux/amd-pstate.h
@@ -52,6 +52,9 @@  struct amd_aperf_mperf {
  * @prev: Last Aperf/Mperf/tsc count value read from register
  * @freq: current cpu frequency value
  * @boost_supported: check whether the Processor or SBIOS supports boost mode
+ * @hw_prefcore: check whether HW supports preferred core featue.
+ * 		  Only when hw_prefcore and early prefcore param are true,
+ * 		  AMD P-State driver supports preferred core featue.
  * @epp_policy: Last saved policy used to set energy-performance preference
  * @epp_cached: Cached CPPC energy-performance preference value
  * @policy: Cpufreq policy value
@@ -81,6 +84,7 @@  struct amd_cpudata {
 
 	u64	freq;
 	bool	boost_supported;
+	bool	hw_prefcore;
 
 	/* EPP feature related attributes*/
 	s16	epp_policy;