diff mbox series

[v4,2/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled

Message ID 4684795.LlGW2geaUc@kreacher (mailing list archive)
State Superseded, archived
Headers show
Series cpufreq: intel_pstate: Implement passive mode with HWP enabled | expand

Commit Message

Rafael J. Wysocki July 28, 2020, 3:13 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Allow intel_pstate to work in the passive mode with HWP enabled and
make it set the HWP minimum performance limit (HWP floor) to the
P-state value given by the target frequency supplied by the cpufreq
governor, so as to prevent the HWP algorithm and the CPU scheduler
from working against each other, at least when the schedutil governor
is in use, and update the intel_pstate documentation accordingly.

Among other things, this allows utilization clamps to be taken
into account, at least to a certain extent, when intel_pstate is
in use and makes it more likely that sufficient capacity for
deadline tasks will be provided.

After this change, the resulting behavior of an HWP system with
intel_pstate in the passive mode should be close to the behavior
of the analogous non-HWP system with intel_pstate in the passive
mode, except that in the frequency range below the base frequency
(ie. the frequency retured by the base_frequency cpufreq attribute
in sysfs on HWP systems) the HWP algorithm is allowed to go above
the floor P-state set by intel_pstate with or without hardware
coordination of P-states among CPUs in the same package.

Also note that the setting of the HWP floor may not be taken into
account by the processor in the following cases:

 * For the HWP floor in the range of P-states above the base
   frequency, referred to as the turbo range, the processor has a
   license to choose any P-state from that range, either below or
   above the HWP floor, just like a non-HWP processor in the case
   when the target P-state falls into the turbo range.

 * If P-states of the CPUs in the same package are coordinated
   at the hardware level, the processor may choose a P-state
   above the HWP floor, just like a non-HWP processor in the
   analogous case.

With this change applied, intel_pstate in the passive mode
assumes complete control over the HWP request MSR and concurrent
changes of that MSR (eg. via the direct MSR access interface) are
overridden by it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v1 -> v2:
   * Avoid a race condition when updating the HWP request register while
     setting a new EPP value via sysfs.

v2 -> v3:
   * Rebase.

v3 -> v4:
   * Avoid exposing the hwp_dynamic_boost sysfs switch in the passive mode.

---
 Documentation/admin-guide/pm/intel_pstate.rst |   89 +++++------
 drivers/cpufreq/intel_pstate.c                |  204 ++++++++++++++++++++------
 2 files changed, 204 insertions(+), 89 deletions(-)

Comments

srinivas pandruvada Aug. 1, 2020, 11:21 p.m. UTC | #1
On Tue, 2020-07-28 at 17:13 +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Allow intel_pstate to work in the passive mode with HWP enabled and
> make it set the HWP minimum performance limit (HWP floor) to the
> P-state value given by the target frequency supplied by the cpufreq
> governor, so as to prevent the HWP algorithm and the CPU scheduler
> from working against each other, at least when the schedutil governor
> is in use, and update the intel_pstate documentation accordingly.
> 
> Among other things, this allows utilization clamps to be taken
> into account, at least to a certain extent, when intel_pstate is
> in use and makes it more likely that sufficient capacity for
> deadline tasks will be provided.
> 
> After this change, the resulting behavior of an HWP system with
> intel_pstate in the passive mode should be close to the behavior
> of the analogous non-HWP system with intel_pstate in the passive
> mode, except that in the frequency range below the base frequency
> (ie. the frequency retured by the base_frequency cpufreq attribute
> in sysfs on HWP systems) the HWP algorithm is allowed to go above
> the floor P-state set by intel_pstate with or without hardware
> coordination of P-states among CPUs in the same package.
> 
Do you mean HWP.req.min will be below base_freq (unless user overrides
it)?
With busy workload I see HWP req.min = HWP req.max.
The base freq: 1.3GHz (ratio 0x0d), MAX 1C turbo: 3.9GHz (ratio: 0x27)
When I monitor MSR 0x774 (HWP_REQ), I see
0x80002727

Normally msr 0x774
0x80002704 

Thanks,
Srinivas

> Also note that the setting of the HWP floor may not be taken into
> account by the processor in the following cases:
> 
>  * For the HWP floor in the range of P-states above the base
>    frequency, referred to as the turbo range, the processor has a
>    license to choose any P-state from that range, either below or
>    above the HWP floor, just like a non-HWP processor in the case
>    when the target P-state falls into the turbo range.
> 
>  * If P-states of the CPUs in the same package are coordinated
>    at the hardware level, the processor may choose a P-state
>    above the HWP floor, just like a non-HWP processor in the
>    analogous case.
> 
> With this change applied, intel_pstate in the passive mode
> assumes complete control over the HWP request MSR and concurrent
> changes of that MSR (eg. via the direct MSR access interface) are
> overridden by it.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> v1 -> v2:
>    * Avoid a race condition when updating the HWP request register
> while
>      setting a new EPP value via sysfs.
> 
> v2 -> v3:
>    * Rebase.
> 
> v3 -> v4:
>    * Avoid exposing the hwp_dynamic_boost sysfs switch in the passive
> mode.
> 
> ---
>  Documentation/admin-guide/pm/intel_pstate.rst |   89 +++++------
>  drivers/cpufreq/intel_pstate.c                |  204
> ++++++++++++++++++++------
>  2 files changed, 204 insertions(+), 89 deletions(-)
> 
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -36,6 +36,7 @@
>  #define INTEL_PSTATE_SAMPLING_INTERVAL	(10 * NSEC_PER_MSEC)
>  
>  #define INTEL_CPUFREQ_TRANSITION_LATENCY	20000
> +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP	5000
>  #define INTEL_CPUFREQ_TRANSITION_DELAY		500
>  
>  #ifdef CONFIG_ACPI
> @@ -220,6 +221,7 @@ struct global_params {
>   *			preference/bias
>   * @epp_saved:		Saved EPP/EPB during system suspend or
> CPU offline
>   *			operation
> + * @epp_cached		Cached HWP energy-performance
> preference value
>   * @hwp_req_cached:	Cached value of the last HWP Request MSR
>   * @hwp_cap_cached:	Cached value of the last HWP Capabilities MSR
>   * @last_io_update:	Last time when IO wake flag was set
> @@ -257,6 +259,7 @@ struct cpudata {
>  	s16 epp_policy;
>  	s16 epp_default;
>  	s16 epp_saved;
> +	s16 epp_cached;
>  	u64 hwp_req_cached;
>  	u64 hwp_cap_cached;
>  	u64 last_io_update;
> @@ -690,6 +693,8 @@ static ssize_t show_energy_performance_a
>  
>  cpufreq_freq_attr_ro(energy_performance_available_preferences);
>  
> +static struct cpufreq_driver intel_pstate;
> +
>  static ssize_t store_energy_performance_preference(
>  		struct cpufreq_policy *policy, const char *buf, size_t
> count)
>  {
> @@ -718,14 +723,35 @@ static ssize_t store_energy_performance_
>  		raw = true;
>  	}
>  
> +	mutex_lock(&intel_pstate_driver_lock);
> +
> +	if (!intel_pstate_driver) {
> +		mutex_unlock(&intel_pstate_driver_lock);
> +		return -EAGAIN;
> +	}
> +
>  	mutex_lock(&intel_pstate_limits_lock);
>  
> -	ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw,
> epp);
> -	if (!ret)
> +	if (intel_pstate_driver == &intel_pstate) {
> +		ret = intel_pstate_set_energy_pref_index(cpu_data, ret,
> raw, epp);
> +		if (!ret)
> +			ret = count;
> +	} else {
> +		/*
> +		 * In the passive mode simply update the cached EPP
> value and
> +		 * rely on intel_cpufreq_adjust_hwp() to pick it up
> later.
> +		 */
> +		if (!raw)
> +			epp = ret ? epp_values[ret - 1] : cpu_data-
> >epp_default;
> +
> +		WRITE_ONCE(cpu_data->epp_cached, epp);
>  		ret = count;
> +	}
>  
>  	mutex_unlock(&intel_pstate_limits_lock);
>  
> +	mutex_unlock(&intel_pstate_driver_lock);
> +
>  	return ret;
>  }
>  
> @@ -1138,8 +1164,6 @@ static ssize_t store_no_turbo(struct kob
>  	return count;
>  }
>  
> -static struct cpufreq_driver intel_pstate;
> -
>  static void update_qos_request(enum freq_qos_req_type type)
>  {
>  	int max_state, turbo_max, freq, i, perf_pct;
> @@ -1323,9 +1347,10 @@ static const struct attribute_group inte
>  
>  static const struct x86_cpu_id intel_pstate_cpu_ee_disable_ids[];
>  
> +static struct kobject *intel_pstate_kobject;
> +
>  static void __init intel_pstate_sysfs_expose_params(void)
>  {
> -	struct kobject *intel_pstate_kobject;
>  	int rc;
>  
>  	intel_pstate_kobject = kobject_create_and_add("intel_pstate",
> @@ -1350,17 +1375,31 @@ static void __init intel_pstate_sysfs_ex
>  	rc = sysfs_create_file(intel_pstate_kobject,
> &min_perf_pct.attr);
>  	WARN_ON(rc);
>  
> -	if (hwp_active) {
> -		rc = sysfs_create_file(intel_pstate_kobject,
> -				       &hwp_dynamic_boost.attr);
> -		WARN_ON(rc);
> -	}
> -
>  	if (x86_match_cpu(intel_pstate_cpu_ee_disable_ids)) {
>  		rc = sysfs_create_file(intel_pstate_kobject,
> &energy_efficiency.attr);
>  		WARN_ON(rc);
>  	}
>  }
> +
> +static void intel_pstate_sysfs_expose_hwp_dynamic_boost(void)
> +{
> +	int rc;
> +
> +	if (!hwp_active)
> +		return;
> +
> +	rc = sysfs_create_file(intel_pstate_kobject,
> &hwp_dynamic_boost.attr);
> +	WARN_ON_ONCE(rc);
> +}
> +
> +static void intel_pstate_sysfs_hide_hwp_dynamic_boost(void)
> +{
> +	if (!hwp_active)
> +		return;
> +
> +	sysfs_remove_file(intel_pstate_kobject,
> &hwp_dynamic_boost.attr);
> +}
> +
>  /************************** sysfs end ************************/
>  
>  static void intel_pstate_hwp_enable(struct cpudata *cpudata)
> @@ -2041,6 +2080,7 @@ static int intel_pstate_init_cpu(unsigne
>  		cpu->epp_default = -EINVAL;
>  		cpu->epp_powersave = -EINVAL;
>  		cpu->epp_saved = -EINVAL;
> +		WRITE_ONCE(cpu->epp_cached, -EINVAL);
>  	}
>  
>  	cpu = all_cpu_data[cpunum];
> @@ -2239,7 +2279,10 @@ static int intel_pstate_verify_policy(st
>  
>  static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
>  {
> -	intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
> +	if (hwp_active)
> +		intel_pstate_hwp_force_min_perf(policy->cpu);
> +	else
> +		intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
>  }
>  
>  static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
> @@ -2247,12 +2290,10 @@ static void intel_pstate_stop_cpu(struct
>  	pr_debug("CPU %d exiting\n", policy->cpu);
>  
>  	intel_pstate_clear_update_util_hook(policy->cpu);
> -	if (hwp_active) {
> +	if (hwp_active)
>  		intel_pstate_hwp_save_state(policy);
> -		intel_pstate_hwp_force_min_perf(policy->cpu);
> -	} else {
> -		intel_cpufreq_stop_cpu(policy);
> -	}
> +
> +	intel_cpufreq_stop_cpu(policy);
>  }
>  
>  static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
> @@ -2382,13 +2423,82 @@ static void intel_cpufreq_trace(struct c
>  		fp_toint(cpu->iowait_boost * 100));
>  }
>  
> +static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32
> target_pstate,
> +				     bool fast_switch)
> +{
> +	u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
> +	s16 epp;
> +
> +	value &= ~HWP_MIN_PERF(~0L);
> +	value |= HWP_MIN_PERF(target_pstate);
> +
> +	/*
> +	 * The entire MSR needs to be updated in order to update the
> HWP min
> +	 * field in it, so opportunistically update the max too if
> needed.
> +	 */
> +	value &= ~HWP_MAX_PERF(~0L);
> +	value |= HWP_MAX_PERF(cpu->max_perf_ratio);
> +
> +	/*
> +	 * In case the EPP has been adjusted via sysfs, write the last
> cached
> +	 * value of it to the MSR as well.
> +	 */
> +	epp = READ_ONCE(cpu->epp_cached);
> +	if (epp >= 0) {
> +		value &= ~GENMASK_ULL(31, 24);
> +		value |= (u64)epp << 24;
> +	}
> +
> +	if (value == prev)
> +		return;
> +
> +	WRITE_ONCE(cpu->hwp_req_cached, value);
> +	if (fast_switch)
> +		wrmsrl(MSR_HWP_REQUEST, value);
> +	else
> +		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
> +}
> +
> +static void intel_cpufreq_adjust_perf_ctl(struct cpudata *cpu,
> +					  u32 target_pstate, bool
> fast_switch)
> +{
> +	if (fast_switch)
> +		wrmsrl(MSR_IA32_PERF_CTL,
> +		       pstate_funcs.get_val(cpu, target_pstate));
> +	else
> +		wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
> +			      pstate_funcs.get_val(cpu,
> target_pstate));
> +}
> +
> +static int intel_cpufreq_update_pstate(struct cpudata *cpu, int
> target_pstate,
> +				       bool fast_switch)
> +{
> +	int old_pstate = cpu->pstate.current_pstate;
> +
> +	target_pstate = intel_pstate_prepare_request(cpu,
> target_pstate);
> +	if (target_pstate != old_pstate) {
> +		cpu->pstate.current_pstate = target_pstate;
> +		if (hwp_active)
> +			intel_cpufreq_adjust_hwp(cpu, target_pstate,
> +						 fast_switch);
> +		else
> +			intel_cpufreq_adjust_perf_ctl(cpu,
> target_pstate,
> +						      fast_switch);
> +	}
> +
> +	intel_cpufreq_trace(cpu, fast_switch ?
> INTEL_PSTATE_TRACE_FAST_SWITCH :
> +			    INTEL_PSTATE_TRACE_TARGET, old_pstate);
> +
> +	return target_pstate;
> +}
> +
>  static int intel_cpufreq_target(struct cpufreq_policy *policy,
>  				unsigned int target_freq,
>  				unsigned int relation)
>  {
>  	struct cpudata *cpu = all_cpu_data[policy->cpu];
>  	struct cpufreq_freqs freqs;
> -	int target_pstate, old_pstate;
> +	int target_pstate;
>  
>  	update_turbo_state();
>  
> @@ -2396,6 +2506,7 @@ static int intel_cpufreq_target(struct c
>  	freqs.new = target_freq;
>  
>  	cpufreq_freq_transition_begin(policy, &freqs);
> +
>  	switch (relation) {
>  	case CPUFREQ_RELATION_L:
>  		target_pstate = DIV_ROUND_UP(freqs.new, cpu-
> >pstate.scaling);
> @@ -2407,15 +2518,11 @@ static int intel_cpufreq_target(struct c
>  		target_pstate = DIV_ROUND_CLOSEST(freqs.new, cpu-
> >pstate.scaling);
>  		break;
>  	}
> -	target_pstate = intel_pstate_prepare_request(cpu,
> target_pstate);
> -	old_pstate = cpu->pstate.current_pstate;
> -	if (target_pstate != cpu->pstate.current_pstate) {
> -		cpu->pstate.current_pstate = target_pstate;
> -		wrmsrl_on_cpu(policy->cpu, MSR_IA32_PERF_CTL,
> -			      pstate_funcs.get_val(cpu,
> target_pstate));
> -	}
> +
> +	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate,
> false);
> +
>  	freqs.new = target_pstate * cpu->pstate.scaling;
> -	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_TARGET,
> old_pstate);
> +
>  	cpufreq_freq_transition_end(policy, &freqs, false);
>  
>  	return 0;
> @@ -2425,15 +2532,14 @@ static unsigned int intel_cpufreq_fast_s
>  					      unsigned int target_freq)
>  {
>  	struct cpudata *cpu = all_cpu_data[policy->cpu];
> -	int target_pstate, old_pstate;
> +	int target_pstate;
>  
>  	update_turbo_state();
>  
>  	target_pstate = DIV_ROUND_UP(target_freq, cpu->pstate.scaling);
> -	target_pstate = intel_pstate_prepare_request(cpu,
> target_pstate);
> -	old_pstate = cpu->pstate.current_pstate;
> -	intel_pstate_update_pstate(cpu, target_pstate);
> -	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_FAST_SWITCH,
> old_pstate);
> +
> +	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate,
> true);
> +
>  	return target_pstate * cpu->pstate.scaling;
>  }
>  
> @@ -2453,7 +2559,6 @@ static int intel_cpufreq_cpu_init(struct
>  		return ret;
>  
>  	policy->cpuinfo.transition_latency =
> INTEL_CPUFREQ_TRANSITION_LATENCY;
> -	policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
>  	/* This reflects the intel_pstate_get_cpu_pstates() setting. */
>  	policy->cur = policy->cpuinfo.min_freq;
>  
> @@ -2465,10 +2570,17 @@ static int intel_cpufreq_cpu_init(struct
>  
>  	cpu = all_cpu_data[policy->cpu];
>  
> -	if (hwp_active)
> +	if (hwp_active) {
> +		u64 value;
> +
>  		intel_pstate_get_hwp_max(policy->cpu, &turbo_max,
> &max_state);
> -	else
> +		policy->transition_delay_us =
> INTEL_CPUFREQ_TRANSITION_DELAY_HWP;
> +		rdmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, &value);
> +		WRITE_ONCE(cpu->hwp_req_cached, value);
> +	} else {
>  		turbo_max = cpu->pstate.turbo_pstate;
> +		policy->transition_delay_us =
> INTEL_CPUFREQ_TRANSITION_DELAY;
> +	}
>  
>  	min_freq = DIV_ROUND_UP(turbo_max * global.min_perf_pct, 100);
>  	min_freq *= cpu->pstate.scaling;
> @@ -2545,6 +2657,10 @@ static void intel_pstate_driver_cleanup(
>  		}
>  	}
>  	put_online_cpus();
> +
> +	if (intel_pstate_driver == &intel_pstate)
> +		intel_pstate_sysfs_hide_hwp_dynamic_boost();
> +
>  	intel_pstate_driver = NULL;
>  }
>  
> @@ -2552,6 +2668,9 @@ static int intel_pstate_register_driver(
>  {
>  	int ret;
>  
> +	if (driver == &intel_pstate)
> +		intel_pstate_sysfs_expose_hwp_dynamic_boost();
> +
>  	memset(&global, 0, sizeof(global));
>  	global.max_perf_pct = 100;
>  
> @@ -2569,9 +2688,6 @@ static int intel_pstate_register_driver(
>  
>  static int intel_pstate_unregister_driver(void)
>  {
> -	if (hwp_active)
> -		return -EBUSY;
> -
>  	cpufreq_unregister_driver(intel_pstate_driver);
>  	intel_pstate_driver_cleanup();
>  
> @@ -2827,7 +2943,10 @@ static int __init intel_pstate_init(void
>  			hwp_active++;
>  			hwp_mode_bdw = id->driver_data;
>  			intel_pstate.attr = hwp_cpufreq_attrs;
> -			default_driver = &intel_pstate;
> +			intel_cpufreq.attr = hwp_cpufreq_attrs;
> +			if (!default_driver)
> +				default_driver = &intel_pstate;
> +
>  			goto hwp_cpu_matched;
>  		}
>  	} else {
> @@ -2898,14 +3017,13 @@ static int __init intel_pstate_setup(cha
>  	if (!str)
>  		return -EINVAL;
>  
> -	if (!strcmp(str, "disable")) {
> +	if (!strcmp(str, "disable"))
>  		no_load = 1;
> -	} else if (!strcmp(str, "active")) {
> +	else if (!strcmp(str, "active"))
>  		default_driver = &intel_pstate;
> -	} else if (!strcmp(str, "passive")) {
> +	else if (!strcmp(str, "passive"))
>  		default_driver = &intel_cpufreq;
> -		no_hwp = 1;
> -	}
> +
>  	if (!strcmp(str, "no_hwp")) {
>  		pr_info("HWP disabled\n");
>  		no_hwp = 1;
> Index: linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
> ===================================================================
> --- linux-pm.orig/Documentation/admin-guide/pm/intel_pstate.rst
> +++ linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
> @@ -54,10 +54,13 @@ registered (see `below <status_attr_>`_)
>  Operation Modes
>  ===============
>  
> -``intel_pstate`` can operate in three different modes: in the active
> mode with
> -or without hardware-managed P-states support and in the passive
> mode.  Which of
> -them will be in effect depends on what kernel command line options
> are used and
> -on the capabilities of the processor.
> +``intel_pstate`` can operate in two different modes, active or
> passive.  In the
> +active mode, it uses its own internal preformance scaling governor
> algorithm or
> +allows the hardware to do preformance scaling by itself, while in
> the passive
> +mode it responds to requests made by a generic ``CPUFreq`` governor
> implementing
> +a certain performance scaling algorithm.  Which of them will be in
> effect
> +depends on what kernel command line options are used and on the
> capabilities of
> +the processor.
>  
>  Active Mode
>  -----------
> @@ -194,10 +197,11 @@ This is the default operation mode of ``
>  hardware-managed P-states (HWP) support.  It is always used if the
>  ``intel_pstate=passive`` argument is passed to the kernel in the
> command line
>  regardless of whether or not the given processor supports
> HWP.  [Note that the
> -``intel_pstate=no_hwp`` setting implies ``intel_pstate=passive`` if
> it is used
> -without ``intel_pstate=active``.]  Like in the active mode without
> HWP support,
> -in this mode ``intel_pstate`` may refuse to work with processors
> that are not
> -recognized by it.
> +``intel_pstate=no_hwp`` setting causes the driver to start in the
> passive mode
> +if it is not combined with ``intel_pstate=active``.]  Like in the
> active mode
> +without HWP support, in this mode ``intel_pstate`` may refuse to
> work with
> +processors that are not recognized by it if HWP is prevented from
> being enabled
> +through the kernel command line.
>  
>  If the driver works in this mode, the ``scaling_driver`` policy
> attribute in
>  ``sysfs`` for all ``CPUFreq`` policies contains the string
> "intel_cpufreq".
> @@ -318,10 +322,9 @@ manuals need to be consulted to get to i
>  
>  For this reason, there is a list of supported processors in
> ``intel_pstate`` and
>  the driver initialization will fail if the detected processor is not
> in that
> -list, unless it supports the `HWP feature <Active Mode_>`_.  [The
> interface to
> -obtain all of the information listed above is the same for all of
> the processors
> -supporting the HWP feature, which is why they all are supported by
> -``intel_pstate``.]
> +list, unless it supports the HWP feature.  [The interface to obtain
> all of the
> +information listed above is the same for all of the processors
> supporting the
> +HWP feature, which is why ``intel_pstate`` works with all of them.]
>  
>  
>  User Space Interface in ``sysfs``
> @@ -425,22 +428,16 @@ argument is passed to the kernel in the
>  	as well as the per-policy ones) are then reset to their default
>  	values, possibly depending on the target operation mode.]
>  
> -	That only is supported in some configurations, though (for
> example, if
> -	the `HWP feature is enabled in the processor <Active Mode With
> HWP_>`_,
> -	the operation mode of the driver cannot be changed), and if it
> is not
> -	supported in the current configuration, writes to this
> attribute will
> -	fail with an appropriate error.
> -
>  ``energy_efficiency``
> -	This attribute is only present on platforms, which have CPUs
> matching
> -	Kaby Lake or Coffee Lake desktop CPU model. By default
> -	energy efficiency optimizations are disabled on these CPU
> models in HWP
> -	mode by this driver. Enabling energy efficiency may limit
> maximum
> -	operating frequency in both HWP and non HWP mode. In non HWP
> mode,
> -	optimizations are done only in the turbo frequency range. In
> HWP mode,
> -	optimizations are done in the entire frequency range. Setting
> this
> -	attribute to "1" enables energy efficiency optimizations and
> setting
> -	to "0" disables energy efficiency optimizations.
> +	This attribute is only present on platforms with CPUs matching
> the Kaby
> +	Lake or Coffee Lake desktop CPU model. By default, energy-
> efficiency
> +	optimizations are disabled on these CPU models if HWP is
> enabled.
> +	Enabling energy-efficiency optimizations may limit maximum
> operating
> +	frequency with or without the HWP feature.  With HWP enabled,
> the
> +	optimizations are done only in the turbo frequency
> range.  Without it,
> +	they are done in the entire available frequency range.  Setting
> this
> +	attribute to "1" enables the energy-efficiency optimizations
> and setting
> +	to "0" disables them.
>  
>  Interpretation of Policy Attributes
>  -----------------------------------
> @@ -484,8 +481,8 @@ Next, the following policy attributes ha
>  	policy for the time interval between the last two invocations
> of the
>  	driver's utilization update callback by the CPU scheduler for
> that CPU.
>  
> -One more policy attribute is present if the `HWP feature is enabled
> in the
> -processor <Active Mode With HWP_>`_:
> +One more policy attribute is present if the HWP feature is enabled
> in the
> +processor:
>  
>  ``base_frequency``
>  	Shows the base frequency of the CPU. Any frequency above this
> will be
> @@ -526,11 +523,11 @@ on the following rules, regardless of th
>  
>   3. The global and per-policy limits can be set independently.
>  
> -If the `HWP feature is enabled in the processor <Active Mode With
> HWP_>`_, the
> -resulting effective values are written into its registers whenever
> the limits
> -change in order to request its internal P-state selection logic to
> always set
> -P-states within these limits.  Otherwise, the limits are taken into
> account by
> -scaling governors (in the `passive mode <Passive Mode_>`_) and by
> the driver
> +In the `active mode with the HWP feature enabled <Active Mode With
> HWP_>`_, the
> +resulting effective values are written into hardware registers
> whenever the
> +limits change in order to request its internal P-state selection
> logic to always
> +set P-states within these limits.  Otherwise, the limits are taken
> into account
> +by scaling governors (in the `passive mode <Passive Mode_>`_) and by
> the driver
>  every time before setting a new P-state for a CPU.
>  
>  Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command
> line argument
> @@ -541,12 +538,11 @@ at all and the only way to set the limit
>  Energy vs Performance Hints
>  ---------------------------
>  
> -If ``intel_pstate`` works in the `active mode with the HWP feature
> enabled
> -<Active Mode With HWP_>`_ in the processor, additional attributes
> are present
> -in every ``CPUFreq`` policy directory in ``sysfs``.  They are
> intended to allow
> -user space to help ``intel_pstate`` to adjust the processor's
> internal P-state
> -selection logic by focusing it on performance or on energy-
> efficiency, or
> -somewhere between the two extremes:
> +If the hardware-managed P-states (HWP) is enabled in the processor,
> additional
> +attributes, intended to allow user space to help ``intel_pstate`` to
> adjust the
> +processor's internal P-state selection logic by focusing it on
> performance or on
> +energy-efficiency, or somewhere between the two extremes, are
> present in every
> +``CPUFreq`` policy directory in ``sysfs``.  They are :
>  
>  ``energy_performance_preference``
>  	Current value of the energy vs performance hint for the given
> policy
> @@ -650,12 +646,14 @@ of them have to be prepended with the ``
>  	Do not register ``intel_pstate`` as the scaling driver even if
> the
>  	processor is supported by it.
>  
> +``active``
> +	Register ``intel_pstate`` in the `active mode <Active Mode_>`_
> to start
> +	with.
> +
>  ``passive``
>  	Register ``intel_pstate`` in the `passive mode <Passive
> Mode_>`_ to
>  	start with.
>  
> -	This option implies the ``no_hwp`` one described below.
> -
>  ``force``
>  	Register ``intel_pstate`` as the scaling driver instead of
>  	``acpi-cpufreq`` even if the latter is preferred on the given
> system.
> @@ -670,13 +668,12 @@ of them have to be prepended with the ``
>  	driver is used instead of ``acpi-cpufreq``.
>  
>  ``no_hwp``
> -	Do not enable the `hardware-managed P-states (HWP) feature
> -	<Active Mode With HWP_>`_ even if it is supported by the
> processor.
> +	Do not enable the hardware-managed P-states (HWP) feature even
> if it is
> +	supported by the processor.
>  
>  ``hwp_only``
>  	Register ``intel_pstate`` as the scaling driver only if the
> -	`hardware-managed P-states (HWP) feature <Active Mode With
> HWP_>`_ is
> -	supported by the processor.
> +	hardware-managed P-states (HWP) feature is supported by the
> processor.
>  
>  ``support_acpi_ppc``
>  	Take ACPI ``_PPC`` performance limits into account.
> 
> 
>
Doug Smythies Aug. 2, 2020, 2:14 p.m. UTC | #2
On 2020.08.01 16:41 Srinivas Pandruvada wrote:
> On Tue, 2020-07-28 at 17:13 +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Allow intel_pstate to work in the passive mode with HWP enabled and
> > make it set the HWP minimum performance limit (HWP floor) to the
> > P-state value given by the target frequency supplied by the cpufreq
> > governor, so as to prevent the HWP algorithm and the CPU scheduler
> > from working against each other, at least when the schedutil governor
> > is in use, and update the intel_pstate documentation accordingly.
> >
> > Among other things, this allows utilization clamps to be taken
> > into account, at least to a certain extent, when intel_pstate is
> > in use and makes it more likely that sufficient capacity for
> > deadline tasks will be provided.
> >
> > After this change, the resulting behavior of an HWP system with
> > intel_pstate in the passive mode should be close to the behavior
> > of the analogous non-HWP system with intel_pstate in the passive
> > mode, except that in the frequency range below the base frequency
> > (ie. the frequency retured by the base_frequency cpufreq attribute
> > in sysfs on HWP systems) the HWP algorithm is allowed to go above
> > the floor P-state set by intel_pstate with or without hardware
> > coordination of P-states among CPUs in the same package.
> >
> Do you mean HWP.req.min will be below base_freq (unless user overrides
> it)?

No.

> With busy workload I see HWP req.min = HWP req.max.
> The base freq: 1.3GHz (ratio 0x0d), MAX 1C turbo: 3.9GHz (ratio: 0x27)
> When I monitor MSR 0x774 (HWP_REQ), I see
> 0x80002727

Yes, that is what I expect to see.

> 
> Normally msr 0x774
> 0x80002704

That would be "active" mode and the powersave governor, correct?.
And yes that is what I expect for your processor.
For mine, load or no load, decoded:
0x774: IA32_HWP_REQUEST:    CPU 0-5 :
    raw: 80002E08 : 80002E08 : 80002E08 : 80002E08 : 80002E08 : 80002E08 :
    min:        8 :        8 :        8 :        8 :        8 :        8 :
    max:       46 :       46 :       46 :       46 :       46 :       46 :
    des:        0 :        0 :        0 :        0 :        0 :        0 :
    epp:      128 :      128 :      128 :      128 :      128 :      128 :
    act:        0 :        0 :        0 :        0 :        0 :        0 :

This thread is about passive mode, and myself, I do not expect the last byte to be
4 (8 for mine) under load.

... Doug
srinivas pandruvada Aug. 2, 2020, 7:20 p.m. UTC | #3
On Sun, 2020-08-02 at 07:14 -0700, Doug Smythies wrote:
> On 2020.08.01 16:41 Srinivas Pandruvada wrote:
> > On Tue, 2020-07-28 at 17:13 +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > 
> > > Allow intel_pstate to work in the passive mode with HWP enabled
> > > and
> > > make it set the HWP minimum performance limit (HWP floor) to the
> > > P-state value given by the target frequency supplied by the
> > > cpufreq
> > > governor, so as to prevent the HWP algorithm and the CPU
> > > scheduler
> > > from working against each other, at least when the schedutil
> > > governor
> > > is in use, and update the intel_pstate documentation accordingly.
> > > 
> > > Among other things, this allows utilization clamps to be taken
> > > into account, at least to a certain extent, when intel_pstate is
> > > in use and makes it more likely that sufficient capacity for
> > > deadline tasks will be provided.
> > > 
> > > After this change, the resulting behavior of an HWP system with
> > > intel_pstate in the passive mode should be close to the behavior
> > > of the analogous non-HWP system with intel_pstate in the passive
> > > mode, except that in the frequency range below the base frequency
> > > (ie. the frequency retured by the base_frequency cpufreq
> > > attribute
> > > in sysfs on HWP systems) the HWP algorithm is allowed to go above
> > > the floor P-state set by intel_pstate with or without hardware
> > > coordination of P-states among CPUs in the same package.
> > > 
> > Do you mean HWP.req.min will be below base_freq (unless user
> > overrides
> > it)?
> 
> No.
Correct. I was just thinking about base_freq relation.
I can set floor above or below base_freq, HWP will reach upto ceiling/
max.

For example:

Floor above base of 0x0d

Busy%	Bzy_MHz	TSC_MHz	            M0X774
51.33	3500	1498	0x0000000000000000
99.70	3500	1498	0x00000
0008000270e
2.74	3500	1498	0x000000008000270e
2.92	3500	1498	
0x000000008000270e
99.77	3500	1498	0x000000008000270e
99.78	3500	
1498	0x000000008000270e
2.98	3500	1498	0x000000008000270e
99.75
	3500	1498	0x000000008000270e
3.01	3500	1498	0x00000
0008000270e

Floor Below base of 0x0d

Busy%	Bzy_MHz	TSC_MHz	            M0X774
51.33	3500	1498	0x0000000000000000
3.08	3500	1498	0x000000008000270c
99.77	3500	1498	0x000000008000270c
2.87	3500	1498	0x000000008000270c
99.75	3500	1498	0x000000008000270c
2.81	3500	1498	0x000000008000270c
99.76	3500	1498	0x000000008000270c
99.78	3500	1498	0x000000008000270c
2.82	3500	1498	0x000000008000270c


Thanks,
Srinivas

> > With busy workload I see HWP req.min = HWP req.max.
> > The base freq: 1.3GHz (ratio 0x0d), MAX 1C turbo: 3.9GHz (ratio:
> > 0x27)
> > When I monitor MSR 0x774 (HWP_REQ), I see
> > 0x80002727
> 
> Yes, that is what I expect to see.
> 
> > Normally msr 0x774
> > 0x80002704
> 
> That would be "active" mode and the powersave governor, correct?.
> And yes that is what I expect for your processor.
> For mine, load or no load, decoded:
> 0x774: IA32_HWP_REQUEST:    CPU 0-5 :
>     raw: 80002E08 : 80002E08 : 80002E08 : 80002E08 : 80002E08 :
> 80002E08 :
>     min:        8 :        8 :        8 :        8 :        8
> :        8 :
>     max:       46 :       46 :       46 :       46 :       46
> :       46 :
>     des:        0 :        0 :        0 :        0 :        0
> :        0 :
>     epp:      128 :      128 :      128 :      128 :      128
> :      128 :
>     act:        0 :        0 :        0 :        0 :        0
> :        0 :
> 
> This thread is about passive mode, and myself, I do not expect the
> last byte to be
> 4 (8 for mine) under load.
> 
> ... Doug
> 
>
diff mbox series

Patch

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -36,6 +36,7 @@ 
 #define INTEL_PSTATE_SAMPLING_INTERVAL	(10 * NSEC_PER_MSEC)
 
 #define INTEL_CPUFREQ_TRANSITION_LATENCY	20000
+#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP	5000
 #define INTEL_CPUFREQ_TRANSITION_DELAY		500
 
 #ifdef CONFIG_ACPI
@@ -220,6 +221,7 @@  struct global_params {
  *			preference/bias
  * @epp_saved:		Saved EPP/EPB during system suspend or CPU offline
  *			operation
+ * @epp_cached		Cached HWP energy-performance preference value
  * @hwp_req_cached:	Cached value of the last HWP Request MSR
  * @hwp_cap_cached:	Cached value of the last HWP Capabilities MSR
  * @last_io_update:	Last time when IO wake flag was set
@@ -257,6 +259,7 @@  struct cpudata {
 	s16 epp_policy;
 	s16 epp_default;
 	s16 epp_saved;
+	s16 epp_cached;
 	u64 hwp_req_cached;
 	u64 hwp_cap_cached;
 	u64 last_io_update;
@@ -690,6 +693,8 @@  static ssize_t show_energy_performance_a
 
 cpufreq_freq_attr_ro(energy_performance_available_preferences);
 
+static struct cpufreq_driver intel_pstate;
+
 static ssize_t store_energy_performance_preference(
 		struct cpufreq_policy *policy, const char *buf, size_t count)
 {
@@ -718,14 +723,35 @@  static ssize_t store_energy_performance_
 		raw = true;
 	}
 
+	mutex_lock(&intel_pstate_driver_lock);
+
+	if (!intel_pstate_driver) {
+		mutex_unlock(&intel_pstate_driver_lock);
+		return -EAGAIN;
+	}
+
 	mutex_lock(&intel_pstate_limits_lock);
 
-	ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw, epp);
-	if (!ret)
+	if (intel_pstate_driver == &intel_pstate) {
+		ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw, epp);
+		if (!ret)
+			ret = count;
+	} else {
+		/*
+		 * In the passive mode simply update the cached EPP value and
+		 * rely on intel_cpufreq_adjust_hwp() to pick it up later.
+		 */
+		if (!raw)
+			epp = ret ? epp_values[ret - 1] : cpu_data->epp_default;
+
+		WRITE_ONCE(cpu_data->epp_cached, epp);
 		ret = count;
+	}
 
 	mutex_unlock(&intel_pstate_limits_lock);
 
+	mutex_unlock(&intel_pstate_driver_lock);
+
 	return ret;
 }
 
@@ -1138,8 +1164,6 @@  static ssize_t store_no_turbo(struct kob
 	return count;
 }
 
-static struct cpufreq_driver intel_pstate;
-
 static void update_qos_request(enum freq_qos_req_type type)
 {
 	int max_state, turbo_max, freq, i, perf_pct;
@@ -1323,9 +1347,10 @@  static const struct attribute_group inte
 
 static const struct x86_cpu_id intel_pstate_cpu_ee_disable_ids[];
 
+static struct kobject *intel_pstate_kobject;
+
 static void __init intel_pstate_sysfs_expose_params(void)
 {
-	struct kobject *intel_pstate_kobject;
 	int rc;
 
 	intel_pstate_kobject = kobject_create_and_add("intel_pstate",
@@ -1350,17 +1375,31 @@  static void __init intel_pstate_sysfs_ex
 	rc = sysfs_create_file(intel_pstate_kobject, &min_perf_pct.attr);
 	WARN_ON(rc);
 
-	if (hwp_active) {
-		rc = sysfs_create_file(intel_pstate_kobject,
-				       &hwp_dynamic_boost.attr);
-		WARN_ON(rc);
-	}
-
 	if (x86_match_cpu(intel_pstate_cpu_ee_disable_ids)) {
 		rc = sysfs_create_file(intel_pstate_kobject, &energy_efficiency.attr);
 		WARN_ON(rc);
 	}
 }
+
+static void intel_pstate_sysfs_expose_hwp_dynamic_boost(void)
+{
+	int rc;
+
+	if (!hwp_active)
+		return;
+
+	rc = sysfs_create_file(intel_pstate_kobject, &hwp_dynamic_boost.attr);
+	WARN_ON_ONCE(rc);
+}
+
+static void intel_pstate_sysfs_hide_hwp_dynamic_boost(void)
+{
+	if (!hwp_active)
+		return;
+
+	sysfs_remove_file(intel_pstate_kobject, &hwp_dynamic_boost.attr);
+}
+
 /************************** sysfs end ************************/
 
 static void intel_pstate_hwp_enable(struct cpudata *cpudata)
@@ -2041,6 +2080,7 @@  static int intel_pstate_init_cpu(unsigne
 		cpu->epp_default = -EINVAL;
 		cpu->epp_powersave = -EINVAL;
 		cpu->epp_saved = -EINVAL;
+		WRITE_ONCE(cpu->epp_cached, -EINVAL);
 	}
 
 	cpu = all_cpu_data[cpunum];
@@ -2239,7 +2279,10 @@  static int intel_pstate_verify_policy(st
 
 static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
 {
-	intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
+	if (hwp_active)
+		intel_pstate_hwp_force_min_perf(policy->cpu);
+	else
+		intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
 }
 
 static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
@@ -2247,12 +2290,10 @@  static void intel_pstate_stop_cpu(struct
 	pr_debug("CPU %d exiting\n", policy->cpu);
 
 	intel_pstate_clear_update_util_hook(policy->cpu);
-	if (hwp_active) {
+	if (hwp_active)
 		intel_pstate_hwp_save_state(policy);
-		intel_pstate_hwp_force_min_perf(policy->cpu);
-	} else {
-		intel_cpufreq_stop_cpu(policy);
-	}
+
+	intel_cpufreq_stop_cpu(policy);
 }
 
 static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
@@ -2382,13 +2423,82 @@  static void intel_cpufreq_trace(struct c
 		fp_toint(cpu->iowait_boost * 100));
 }
 
+static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32 target_pstate,
+				     bool fast_switch)
+{
+	u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
+	s16 epp;
+
+	value &= ~HWP_MIN_PERF(~0L);
+	value |= HWP_MIN_PERF(target_pstate);
+
+	/*
+	 * The entire MSR needs to be updated in order to update the HWP min
+	 * field in it, so opportunistically update the max too if needed.
+	 */
+	value &= ~HWP_MAX_PERF(~0L);
+	value |= HWP_MAX_PERF(cpu->max_perf_ratio);
+
+	/*
+	 * In case the EPP has been adjusted via sysfs, write the last cached
+	 * value of it to the MSR as well.
+	 */
+	epp = READ_ONCE(cpu->epp_cached);
+	if (epp >= 0) {
+		value &= ~GENMASK_ULL(31, 24);
+		value |= (u64)epp << 24;
+	}
+
+	if (value == prev)
+		return;
+
+	WRITE_ONCE(cpu->hwp_req_cached, value);
+	if (fast_switch)
+		wrmsrl(MSR_HWP_REQUEST, value);
+	else
+		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
+}
+
+static void intel_cpufreq_adjust_perf_ctl(struct cpudata *cpu,
+					  u32 target_pstate, bool fast_switch)
+{
+	if (fast_switch)
+		wrmsrl(MSR_IA32_PERF_CTL,
+		       pstate_funcs.get_val(cpu, target_pstate));
+	else
+		wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
+			      pstate_funcs.get_val(cpu, target_pstate));
+}
+
+static int intel_cpufreq_update_pstate(struct cpudata *cpu, int target_pstate,
+				       bool fast_switch)
+{
+	int old_pstate = cpu->pstate.current_pstate;
+
+	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
+	if (target_pstate != old_pstate) {
+		cpu->pstate.current_pstate = target_pstate;
+		if (hwp_active)
+			intel_cpufreq_adjust_hwp(cpu, target_pstate,
+						 fast_switch);
+		else
+			intel_cpufreq_adjust_perf_ctl(cpu, target_pstate,
+						      fast_switch);
+	}
+
+	intel_cpufreq_trace(cpu, fast_switch ? INTEL_PSTATE_TRACE_FAST_SWITCH :
+			    INTEL_PSTATE_TRACE_TARGET, old_pstate);
+
+	return target_pstate;
+}
+
 static int intel_cpufreq_target(struct cpufreq_policy *policy,
 				unsigned int target_freq,
 				unsigned int relation)
 {
 	struct cpudata *cpu = all_cpu_data[policy->cpu];
 	struct cpufreq_freqs freqs;
-	int target_pstate, old_pstate;
+	int target_pstate;
 
 	update_turbo_state();
 
@@ -2396,6 +2506,7 @@  static int intel_cpufreq_target(struct c
 	freqs.new = target_freq;
 
 	cpufreq_freq_transition_begin(policy, &freqs);
+
 	switch (relation) {
 	case CPUFREQ_RELATION_L:
 		target_pstate = DIV_ROUND_UP(freqs.new, cpu->pstate.scaling);
@@ -2407,15 +2518,11 @@  static int intel_cpufreq_target(struct c
 		target_pstate = DIV_ROUND_CLOSEST(freqs.new, cpu->pstate.scaling);
 		break;
 	}
-	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
-	old_pstate = cpu->pstate.current_pstate;
-	if (target_pstate != cpu->pstate.current_pstate) {
-		cpu->pstate.current_pstate = target_pstate;
-		wrmsrl_on_cpu(policy->cpu, MSR_IA32_PERF_CTL,
-			      pstate_funcs.get_val(cpu, target_pstate));
-	}
+
+	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, false);
+
 	freqs.new = target_pstate * cpu->pstate.scaling;
-	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_TARGET, old_pstate);
+
 	cpufreq_freq_transition_end(policy, &freqs, false);
 
 	return 0;
@@ -2425,15 +2532,14 @@  static unsigned int intel_cpufreq_fast_s
 					      unsigned int target_freq)
 {
 	struct cpudata *cpu = all_cpu_data[policy->cpu];
-	int target_pstate, old_pstate;
+	int target_pstate;
 
 	update_turbo_state();
 
 	target_pstate = DIV_ROUND_UP(target_freq, cpu->pstate.scaling);
-	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
-	old_pstate = cpu->pstate.current_pstate;
-	intel_pstate_update_pstate(cpu, target_pstate);
-	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_FAST_SWITCH, old_pstate);
+
+	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, true);
+
 	return target_pstate * cpu->pstate.scaling;
 }
 
@@ -2453,7 +2559,6 @@  static int intel_cpufreq_cpu_init(struct
 		return ret;
 
 	policy->cpuinfo.transition_latency = INTEL_CPUFREQ_TRANSITION_LATENCY;
-	policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
 	/* This reflects the intel_pstate_get_cpu_pstates() setting. */
 	policy->cur = policy->cpuinfo.min_freq;
 
@@ -2465,10 +2570,17 @@  static int intel_cpufreq_cpu_init(struct
 
 	cpu = all_cpu_data[policy->cpu];
 
-	if (hwp_active)
+	if (hwp_active) {
+		u64 value;
+
 		intel_pstate_get_hwp_max(policy->cpu, &turbo_max, &max_state);
-	else
+		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY_HWP;
+		rdmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, &value);
+		WRITE_ONCE(cpu->hwp_req_cached, value);
+	} else {
 		turbo_max = cpu->pstate.turbo_pstate;
+		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
+	}
 
 	min_freq = DIV_ROUND_UP(turbo_max * global.min_perf_pct, 100);
 	min_freq *= cpu->pstate.scaling;
@@ -2545,6 +2657,10 @@  static void intel_pstate_driver_cleanup(
 		}
 	}
 	put_online_cpus();
+
+	if (intel_pstate_driver == &intel_pstate)
+		intel_pstate_sysfs_hide_hwp_dynamic_boost();
+
 	intel_pstate_driver = NULL;
 }
 
@@ -2552,6 +2668,9 @@  static int intel_pstate_register_driver(
 {
 	int ret;
 
+	if (driver == &intel_pstate)
+		intel_pstate_sysfs_expose_hwp_dynamic_boost();
+
 	memset(&global, 0, sizeof(global));
 	global.max_perf_pct = 100;
 
@@ -2569,9 +2688,6 @@  static int intel_pstate_register_driver(
 
 static int intel_pstate_unregister_driver(void)
 {
-	if (hwp_active)
-		return -EBUSY;
-
 	cpufreq_unregister_driver(intel_pstate_driver);
 	intel_pstate_driver_cleanup();
 
@@ -2827,7 +2943,10 @@  static int __init intel_pstate_init(void
 			hwp_active++;
 			hwp_mode_bdw = id->driver_data;
 			intel_pstate.attr = hwp_cpufreq_attrs;
-			default_driver = &intel_pstate;
+			intel_cpufreq.attr = hwp_cpufreq_attrs;
+			if (!default_driver)
+				default_driver = &intel_pstate;
+
 			goto hwp_cpu_matched;
 		}
 	} else {
@@ -2898,14 +3017,13 @@  static int __init intel_pstate_setup(cha
 	if (!str)
 		return -EINVAL;
 
-	if (!strcmp(str, "disable")) {
+	if (!strcmp(str, "disable"))
 		no_load = 1;
-	} else if (!strcmp(str, "active")) {
+	else if (!strcmp(str, "active"))
 		default_driver = &intel_pstate;
-	} else if (!strcmp(str, "passive")) {
+	else if (!strcmp(str, "passive"))
 		default_driver = &intel_cpufreq;
-		no_hwp = 1;
-	}
+
 	if (!strcmp(str, "no_hwp")) {
 		pr_info("HWP disabled\n");
 		no_hwp = 1;
Index: linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
===================================================================
--- linux-pm.orig/Documentation/admin-guide/pm/intel_pstate.rst
+++ linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
@@ -54,10 +54,13 @@  registered (see `below <status_attr_>`_)
 Operation Modes
 ===============
 
-``intel_pstate`` can operate in three different modes: in the active mode with
-or without hardware-managed P-states support and in the passive mode.  Which of
-them will be in effect depends on what kernel command line options are used and
-on the capabilities of the processor.
+``intel_pstate`` can operate in two different modes, active or passive.  In the
+active mode, it uses its own internal preformance scaling governor algorithm or
+allows the hardware to do preformance scaling by itself, while in the passive
+mode it responds to requests made by a generic ``CPUFreq`` governor implementing
+a certain performance scaling algorithm.  Which of them will be in effect
+depends on what kernel command line options are used and on the capabilities of
+the processor.
 
 Active Mode
 -----------
@@ -194,10 +197,11 @@  This is the default operation mode of ``
 hardware-managed P-states (HWP) support.  It is always used if the
 ``intel_pstate=passive`` argument is passed to the kernel in the command line
 regardless of whether or not the given processor supports HWP.  [Note that the
-``intel_pstate=no_hwp`` setting implies ``intel_pstate=passive`` if it is used
-without ``intel_pstate=active``.]  Like in the active mode without HWP support,
-in this mode ``intel_pstate`` may refuse to work with processors that are not
-recognized by it.
+``intel_pstate=no_hwp`` setting causes the driver to start in the passive mode
+if it is not combined with ``intel_pstate=active``.]  Like in the active mode
+without HWP support, in this mode ``intel_pstate`` may refuse to work with
+processors that are not recognized by it if HWP is prevented from being enabled
+through the kernel command line.
 
 If the driver works in this mode, the ``scaling_driver`` policy attribute in
 ``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq".
@@ -318,10 +322,9 @@  manuals need to be consulted to get to i
 
 For this reason, there is a list of supported processors in ``intel_pstate`` and
 the driver initialization will fail if the detected processor is not in that
-list, unless it supports the `HWP feature <Active Mode_>`_.  [The interface to
-obtain all of the information listed above is the same for all of the processors
-supporting the HWP feature, which is why they all are supported by
-``intel_pstate``.]
+list, unless it supports the HWP feature.  [The interface to obtain all of the
+information listed above is the same for all of the processors supporting the
+HWP feature, which is why ``intel_pstate`` works with all of them.]
 
 
 User Space Interface in ``sysfs``
@@ -425,22 +428,16 @@  argument is passed to the kernel in the
 	as well as the per-policy ones) are then reset to their default
 	values, possibly depending on the target operation mode.]
 
-	That only is supported in some configurations, though (for example, if
-	the `HWP feature is enabled in the processor <Active Mode With HWP_>`_,
-	the operation mode of the driver cannot be changed), and if it is not
-	supported in the current configuration, writes to this attribute will
-	fail with an appropriate error.
-
 ``energy_efficiency``
-	This attribute is only present on platforms, which have CPUs matching
-	Kaby Lake or Coffee Lake desktop CPU model. By default
-	energy efficiency optimizations are disabled on these CPU models in HWP
-	mode by this driver. Enabling energy efficiency may limit maximum
-	operating frequency in both HWP and non HWP mode. In non HWP mode,
-	optimizations are done only in the turbo frequency range. In HWP mode,
-	optimizations are done in the entire frequency range. Setting this
-	attribute to "1" enables energy efficiency optimizations and setting
-	to "0" disables energy efficiency optimizations.
+	This attribute is only present on platforms with CPUs matching the Kaby
+	Lake or Coffee Lake desktop CPU model. By default, energy-efficiency
+	optimizations are disabled on these CPU models if HWP is enabled.
+	Enabling energy-efficiency optimizations may limit maximum operating
+	frequency with or without the HWP feature.  With HWP enabled, the
+	optimizations are done only in the turbo frequency range.  Without it,
+	they are done in the entire available frequency range.  Setting this
+	attribute to "1" enables the energy-efficiency optimizations and setting
+	to "0" disables them.
 
 Interpretation of Policy Attributes
 -----------------------------------
@@ -484,8 +481,8 @@  Next, the following policy attributes ha
 	policy for the time interval between the last two invocations of the
 	driver's utilization update callback by the CPU scheduler for that CPU.
 
-One more policy attribute is present if the `HWP feature is enabled in the
-processor <Active Mode With HWP_>`_:
+One more policy attribute is present if the HWP feature is enabled in the
+processor:
 
 ``base_frequency``
 	Shows the base frequency of the CPU. Any frequency above this will be
@@ -526,11 +523,11 @@  on the following rules, regardless of th
 
  3. The global and per-policy limits can be set independently.
 
-If the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, the
-resulting effective values are written into its registers whenever the limits
-change in order to request its internal P-state selection logic to always set
-P-states within these limits.  Otherwise, the limits are taken into account by
-scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
+In the `active mode with the HWP feature enabled <Active Mode With HWP_>`_, the
+resulting effective values are written into hardware registers whenever the
+limits change in order to request its internal P-state selection logic to always
+set P-states within these limits.  Otherwise, the limits are taken into account
+by scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
 every time before setting a new P-state for a CPU.
 
 Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument
@@ -541,12 +538,11 @@  at all and the only way to set the limit
 Energy vs Performance Hints
 ---------------------------
 
-If ``intel_pstate`` works in the `active mode with the HWP feature enabled
-<Active Mode With HWP_>`_ in the processor, additional attributes are present
-in every ``CPUFreq`` policy directory in ``sysfs``.  They are intended to allow
-user space to help ``intel_pstate`` to adjust the processor's internal P-state
-selection logic by focusing it on performance or on energy-efficiency, or
-somewhere between the two extremes:
+If the hardware-managed P-states (HWP) is enabled in the processor, additional
+attributes, intended to allow user space to help ``intel_pstate`` to adjust the
+processor's internal P-state selection logic by focusing it on performance or on
+energy-efficiency, or somewhere between the two extremes, are present in every
+``CPUFreq`` policy directory in ``sysfs``.  They are :
 
 ``energy_performance_preference``
 	Current value of the energy vs performance hint for the given policy
@@ -650,12 +646,14 @@  of them have to be prepended with the ``
 	Do not register ``intel_pstate`` as the scaling driver even if the
 	processor is supported by it.
 
+``active``
+	Register ``intel_pstate`` in the `active mode <Active Mode_>`_ to start
+	with.
+
 ``passive``
 	Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to
 	start with.
 
-	This option implies the ``no_hwp`` one described below.
-
 ``force``
 	Register ``intel_pstate`` as the scaling driver instead of
 	``acpi-cpufreq`` even if the latter is preferred on the given system.
@@ -670,13 +668,12 @@  of them have to be prepended with the ``
 	driver is used instead of ``acpi-cpufreq``.
 
 ``no_hwp``
-	Do not enable the `hardware-managed P-states (HWP) feature
-	<Active Mode With HWP_>`_ even if it is supported by the processor.
+	Do not enable the hardware-managed P-states (HWP) feature even if it is
+	supported by the processor.
 
 ``hwp_only``
 	Register ``intel_pstate`` as the scaling driver only if the
-	`hardware-managed P-states (HWP) feature <Active Mode With HWP_>`_ is
-	supported by the processor.
+	hardware-managed P-states (HWP) feature is supported by the processor.
 
 ``support_acpi_ppc``
 	Take ACPI ``_PPC`` performance limits into account.