cpufreq: intel_pstate: Implement passive mode with HWP enabled
diff mbox series

Message ID 3955470.QvD6XneCf3@kreacher
State Superseded, archived
Headers show
Series
  • cpufreq: intel_pstate: Implement passive mode with HWP enabled
Related show

Commit Message

Rafael J. Wysocki July 14, 2020, 6:16 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Allow intel_pstate to work in the passive mode with HWP enabled and
make it set the HWP minimum performance limit (HWP floor) to the
P-state value given by the target frequency supplied by the cpufreq
governor, so as to prevent the HWP algorithm and the CPU scheduler
from working against each other, at least when the schedutil governor
is in use, and update the intel_pstate documentation accordingly.

Among other things, this allows utilization clamps to be taken
into account, at least to a certain extent, when intel_pstate is
in use and makes it more likely that sufficient capacity for
deadline tasks will be provided.

After this change, the resulting behavior of an HWP system with
intel_pstate in the passive mode should be close to the behavior
of the analogous non-HWP system with intel_pstate in the passive
mode, except that in the frequency range below the base frequency
(ie. the frequency retured by the base_frequency cpufreq attribute
in sysfs on HWP systems) the HWP algorithm is allowed to go above
the floor P-state set by intel_pstate with or without hardware
coordination of P-states among CPUs in the same package.

Also note that the setting of the HWP floor may not be taken into
account by the processor in the following cases:

 * For the HWP floor in the range of P-states above the base
   frequency, referred to as the turbo range, the processor has a
   license to choose any P-state from that range, either below or
   above the HWP floor, just like a non-HWP processor in the case
   when the target P-state falls into the turbo range.

 * If P-states of the CPUs in the same package are coordinated
   at the hardware level, the processor may choose a P-state
   above the HWP floor, just like a non-HWP processor in the
   analogous case.

With this change applied, intel_pstate in the passive mode
assumes complete control over the HWP request MSR and concurrent
changes of that MSR (eg. via the direct MSR access interface) are
overridden by it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

This basically unifies the passive mode behavior of intel_pstate for systems
with and without HWP enabled.  The only case in which there is a difference
between the two (after this patch) is below the turbo range, where the HWP
algorithm can go above the floor regardless of whether or not P-state are
coordinated package-wide (this means the systems with per-core P-states
mostly is where the difference can be somewhat visible).

Since the passive mode hasn't worked with HWP at all, and it is not going to
the default for HWP systems anyway, I don't see any drawbacks related to making
this change, so I would consider this as 5.9 material unless there are any
serious objections.

Thanks!

---
 Documentation/admin-guide/pm/intel_pstate.rst |   89 +++++++---------
 drivers/cpufreq/intel_pstate.c                |  141 ++++++++++++++++++++------
 2 files changed, 152 insertions(+), 78 deletions(-)

Comments

Francisco Jerez July 15, 2020, 12:09 a.m. UTC | #1
"Rafael J. Wysocki" <rjw@rjwysocki.net> writes:

> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Allow intel_pstate to work in the passive mode with HWP enabled and
> make it set the HWP minimum performance limit (HWP floor) to the
> P-state value given by the target frequency supplied by the cpufreq
> governor, so as to prevent the HWP algorithm and the CPU scheduler
> from working against each other, at least when the schedutil governor
> is in use, and update the intel_pstate documentation accordingly.
>
> Among other things, this allows utilization clamps to be taken
> into account, at least to a certain extent, when intel_pstate is
> in use and makes it more likely that sufficient capacity for
> deadline tasks will be provided.
>
> After this change, the resulting behavior of an HWP system with
> intel_pstate in the passive mode should be close to the behavior
> of the analogous non-HWP system with intel_pstate in the passive
> mode, except that in the frequency range below the base frequency
> (ie. the frequency retured by the base_frequency cpufreq attribute
> in sysfs on HWP systems) the HWP algorithm is allowed to go above
> the floor P-state set by intel_pstate with or without hardware
> coordination of P-states among CPUs in the same package.
>
> Also note that the setting of the HWP floor may not be taken into
> account by the processor in the following cases:
>
>  * For the HWP floor in the range of P-states above the base
>    frequency, referred to as the turbo range, the processor has a
>    license to choose any P-state from that range, either below or
>    above the HWP floor, just like a non-HWP processor in the case
>    when the target P-state falls into the turbo range.
>
>  * If P-states of the CPUs in the same package are coordinated
>    at the hardware level, the processor may choose a P-state
>    above the HWP floor, just like a non-HWP processor in the
>    analogous case.
>
> With this change applied, intel_pstate in the passive mode
> assumes complete control over the HWP request MSR and concurrent
> changes of that MSR (eg. via the direct MSR access interface) are
> overridden by it.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>
> This basically unifies the passive mode behavior of intel_pstate for systems
> with and without HWP enabled.  The only case in which there is a difference
> between the two (after this patch) is below the turbo range, where the HWP
> algorithm can go above the floor regardless of whether or not P-state are
> coordinated package-wide (this means the systems with per-core P-states
> mostly is where the difference can be somewhat visible).
>
> Since the passive mode hasn't worked with HWP at all, and it is not going to
> the default for HWP systems anyway, I don't see any drawbacks related to making
> this change, so I would consider this as 5.9 material unless there are any
> serious objections.
>
> Thanks!
>
> ---
>  Documentation/admin-guide/pm/intel_pstate.rst |   89 +++++++---------
>  drivers/cpufreq/intel_pstate.c                |  141 ++++++++++++++++++++------
>  2 files changed, 152 insertions(+), 78 deletions(-)
>
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -36,6 +36,7 @@
>  #define INTEL_PSTATE_SAMPLING_INTERVAL	(10 * NSEC_PER_MSEC)
>  
>  #define INTEL_CPUFREQ_TRANSITION_LATENCY	20000
> +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP	5000
>  #define INTEL_CPUFREQ_TRANSITION_DELAY		500
>  
>  #ifdef CONFIG_ACPI
> @@ -222,6 +223,7 @@ struct global_params {
>   *			preference/bias
>   * @epp_saved:		Saved EPP/EPB during system suspend or CPU offline
>   *			operation
> + * @epp_cached		Cached HWP energy-performance preference value
>   * @hwp_req_cached:	Cached value of the last HWP Request MSR
>   * @hwp_cap_cached:	Cached value of the last HWP Capabilities MSR
>   * @last_io_update:	Last time when IO wake flag was set
> @@ -259,6 +261,7 @@ struct cpudata {
>  	s16 epp_policy;
>  	s16 epp_default;
>  	s16 epp_saved;
> +	s16 epp_cached;
>  	u64 hwp_req_cached;
>  	u64 hwp_cap_cached;
>  	u64 last_io_update;
> @@ -676,6 +679,8 @@ static int intel_pstate_set_energy_pref_
>  
>  		value |= (u64)epp << 24;
>  		ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
> +
> +		WRITE_ONCE(cpu_data->epp_cached, epp);

Why introduce a new EPP cache variable if there is already
hwp_req_cached?  If intel_pstate_set_energy_pref_index() is failing to
update hwp_req_cached maybe we should fix that instead.  That will save
you a little bit of work in intel_cpufreq_adjust_hwp().

>  	} else {
>  		if (epp == -EINVAL)
>  			epp = (pref_index - 1) << 2;
> @@ -2047,6 +2052,7 @@ static int intel_pstate_init_cpu(unsigne
>  		cpu->epp_default = -EINVAL;
>  		cpu->epp_powersave = -EINVAL;
>  		cpu->epp_saved = -EINVAL;
> +		WRITE_ONCE(cpu->epp_cached, -EINVAL);
>  	}
>  
>  	cpu = all_cpu_data[cpunum];
> @@ -2245,7 +2251,10 @@ static int intel_pstate_verify_policy(st
>  
>  static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
>  {
> -	intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
> +	if (hwp_active)
> +		intel_pstate_hwp_force_min_perf(policy->cpu);
> +	else
> +		intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
>  }
>  
>  static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
> @@ -2253,12 +2262,10 @@ static void intel_pstate_stop_cpu(struct
>  	pr_debug("CPU %d exiting\n", policy->cpu);
>  
>  	intel_pstate_clear_update_util_hook(policy->cpu);
> -	if (hwp_active) {
> +	if (hwp_active)
>  		intel_pstate_hwp_save_state(policy);
> -		intel_pstate_hwp_force_min_perf(policy->cpu);
> -	} else {
> -		intel_cpufreq_stop_cpu(policy);
> -	}
> +
> +	intel_cpufreq_stop_cpu(policy);
>  }
>  
>  static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
> @@ -2388,13 +2395,82 @@ static void intel_cpufreq_trace(struct c
>  		fp_toint(cpu->iowait_boost * 100));
>  }
>  
> +static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32 target_pstate,
> +				     bool fast_switch)
> +{
> +	u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
> +	s16 epp;
> +
> +	value &= ~HWP_MIN_PERF(~0L);
> +	value |= HWP_MIN_PERF(target_pstate);
> +
> +	/*
> +	 * The entire MSR needs to be updated in order to update the HWP min
> +	 * field in it, so opportunistically update the max too if needed.
> +	 */
> +	value &= ~HWP_MAX_PERF(~0L);
> +	value |= HWP_MAX_PERF(cpu->max_perf_ratio);
> +
> +	/*
> +	 * In case the EPP has been adjusted via sysfs, write the last cached
> +	 * value of it to the MSR as well.
> +	 */
> +	epp = READ_ONCE(cpu->epp_cached);
> +	if (epp >= 0) {
> +		value &= ~GENMASK_ULL(31, 24);
> +		value |= (u64)epp << 24;
> +	}
> +
> +	if (value == prev)
> +		return;
> +
> +	WRITE_ONCE(cpu->hwp_req_cached, value);
> +	if (fast_switch)
> +		wrmsrl(MSR_HWP_REQUEST, value);
> +	else
> +		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
> +}

I've asked this question already but you may have missed it: Given that
you are of the opinion that [1] should be implemented in schedutil
instead with intel_pstate in HWP passive mode, what's your plan for
exposing the HWP_MAX_PERF knob to the governor in addition to
HWP_MIN_PERF, since the interface implemented here only allows the
governor to provide a single frequency?

[1] https://lwn.net/ml/linux-pm/20200428032258.2518-1-currojerez@riseup.net/

> +
> +static void intel_cpufreq_adjust_perf_ctl(struct cpudata *cpu,
> +					  u32 target_pstate, bool fast_switch)
> +{
> +	if (fast_switch)
> +		wrmsrl(MSR_IA32_PERF_CTL,
> +		       pstate_funcs.get_val(cpu, target_pstate));
> +	else
> +		wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
> +			      pstate_funcs.get_val(cpu, target_pstate));
> +}
> +
> +static int intel_cpufreq_update_pstate(struct cpudata *cpu, int target_pstate,
> +				       bool fast_switch)
> +{
> +	int old_pstate = cpu->pstate.current_pstate;
> +
> +	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
> +	if (target_pstate != old_pstate) {
> +		cpu->pstate.current_pstate = target_pstate;
> +		if (hwp_active)
> +			intel_cpufreq_adjust_hwp(cpu, target_pstate,
> +						 fast_switch);
> +		else
> +			intel_cpufreq_adjust_perf_ctl(cpu, target_pstate,
> +						      fast_switch);
> +	}
> +
> +	intel_cpufreq_trace(cpu, fast_switch ? INTEL_PSTATE_TRACE_FAST_SWITCH :
> +			    INTEL_PSTATE_TRACE_TARGET, old_pstate);
> +
> +	return target_pstate;
> +}
> +
>  static int intel_cpufreq_target(struct cpufreq_policy *policy,
>  				unsigned int target_freq,
>  				unsigned int relation)
>  {
>  	struct cpudata *cpu = all_cpu_data[policy->cpu];
>  	struct cpufreq_freqs freqs;
> -	int target_pstate, old_pstate;
> +	int target_pstate;
>  
>  	update_turbo_state();
>  
> @@ -2402,6 +2478,7 @@ static int intel_cpufreq_target(struct c
>  	freqs.new = target_freq;
>  
>  	cpufreq_freq_transition_begin(policy, &freqs);
> +
>  	switch (relation) {
>  	case CPUFREQ_RELATION_L:
>  		target_pstate = DIV_ROUND_UP(freqs.new, cpu->pstate.scaling);
> @@ -2413,15 +2490,11 @@ static int intel_cpufreq_target(struct c
>  		target_pstate = DIV_ROUND_CLOSEST(freqs.new, cpu->pstate.scaling);
>  		break;
>  	}
> -	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
> -	old_pstate = cpu->pstate.current_pstate;
> -	if (target_pstate != cpu->pstate.current_pstate) {
> -		cpu->pstate.current_pstate = target_pstate;
> -		wrmsrl_on_cpu(policy->cpu, MSR_IA32_PERF_CTL,
> -			      pstate_funcs.get_val(cpu, target_pstate));
> -	}
> +
> +	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, false);
> +
>  	freqs.new = target_pstate * cpu->pstate.scaling;
> -	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_TARGET, old_pstate);
> +
>  	cpufreq_freq_transition_end(policy, &freqs, false);
>  
>  	return 0;
> @@ -2431,15 +2504,14 @@ static unsigned int intel_cpufreq_fast_s
>  					      unsigned int target_freq)
>  {
>  	struct cpudata *cpu = all_cpu_data[policy->cpu];
> -	int target_pstate, old_pstate;
> +	int target_pstate;
>  
>  	update_turbo_state();
>  
>  	target_pstate = DIV_ROUND_UP(target_freq, cpu->pstate.scaling);
> -	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
> -	old_pstate = cpu->pstate.current_pstate;
> -	intel_pstate_update_pstate(cpu, target_pstate);
> -	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_FAST_SWITCH, old_pstate);
> +
> +	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, true);
> +
>  	return target_pstate * cpu->pstate.scaling;
>  }
>  
> @@ -2459,7 +2531,6 @@ static int intel_cpufreq_cpu_init(struct
>  		return ret;
>  
>  	policy->cpuinfo.transition_latency = INTEL_CPUFREQ_TRANSITION_LATENCY;
> -	policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
>  	/* This reflects the intel_pstate_get_cpu_pstates() setting. */
>  	policy->cur = policy->cpuinfo.min_freq;
>  
> @@ -2471,10 +2542,17 @@ static int intel_cpufreq_cpu_init(struct
>  
>  	cpu = all_cpu_data[policy->cpu];
>  
> -	if (hwp_active)
> +	if (hwp_active) {
> +		u64 value;
> +
>  		intel_pstate_get_hwp_max(policy->cpu, &turbo_max, &max_state);
> -	else
> +		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY_HWP;
> +		rdmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, &value);
> +		WRITE_ONCE(cpu->hwp_req_cached, value);
> +	} else {
>  		turbo_max = cpu->pstate.turbo_pstate;
> +		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
> +	}
>  
>  	min_freq = DIV_ROUND_UP(turbo_max * global.min_perf_pct, 100);
>  	min_freq *= cpu->pstate.scaling;
> @@ -2575,9 +2653,6 @@ static int intel_pstate_register_driver(
>  
>  static int intel_pstate_unregister_driver(void)
>  {
> -	if (hwp_active)
> -		return -EBUSY;
> -
>  	cpufreq_unregister_driver(intel_pstate_driver);
>  	intel_pstate_driver_cleanup();
>  
> @@ -2828,7 +2903,10 @@ static int __init intel_pstate_init(void
>  			hwp_active++;
>  			hwp_mode_bdw = id->driver_data;
>  			intel_pstate.attr = hwp_cpufreq_attrs;
> -			default_driver = &intel_pstate;
> +			intel_cpufreq.attr = hwp_cpufreq_attrs;
> +			if (!default_driver)
> +				default_driver = &intel_pstate;
> +
>  			goto hwp_cpu_matched;
>  		}
>  	} else {
> @@ -2899,14 +2977,13 @@ static int __init intel_pstate_setup(cha
>  	if (!str)
>  		return -EINVAL;
>  
> -	if (!strcmp(str, "disable")) {
> +	if (!strcmp(str, "disable"))
>  		no_load = 1;
> -	} else if (!strcmp(str, "active")) {
> +	else if (!strcmp(str, "active"))
>  		default_driver = &intel_pstate;
> -	} else if (!strcmp(str, "passive")) {
> +	else if (!strcmp(str, "passive"))
>  		default_driver = &intel_cpufreq;
> -		no_hwp = 1;
> -	}
> +
>  	if (!strcmp(str, "no_hwp")) {
>  		pr_info("HWP disabled\n");
>  		no_hwp = 1;
> Index: linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
> ===================================================================
> --- linux-pm.orig/Documentation/admin-guide/pm/intel_pstate.rst
> +++ linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
> @@ -54,10 +54,13 @@ registered (see `below <status_attr_>`_)
>  Operation Modes
>  ===============
>  
> -``intel_pstate`` can operate in three different modes: in the active mode with
> -or without hardware-managed P-states support and in the passive mode.  Which of
> -them will be in effect depends on what kernel command line options are used and
> -on the capabilities of the processor.
> +``intel_pstate`` can operate in two different modes, active or passive.  In the
> +active mode, it uses its own internal preformance scaling governor algorithm or
> +allows the hardware to do preformance scaling by itself, while in the passive
> +mode it responds to requests made by a generic ``CPUFreq`` governor implementing
> +a certain performance scaling algorithm.  Which of them will be in effect
> +depends on what kernel command line options are used and on the capabilities of
> +the processor.
>  
>  Active Mode
>  -----------
> @@ -194,10 +197,11 @@ This is the default operation mode of ``
>  hardware-managed P-states (HWP) support.  It is always used if the
>  ``intel_pstate=passive`` argument is passed to the kernel in the command line
>  regardless of whether or not the given processor supports HWP.  [Note that the
> -``intel_pstate=no_hwp`` setting implies ``intel_pstate=passive`` if it is used
> -without ``intel_pstate=active``.]  Like in the active mode without HWP support,
> -in this mode ``intel_pstate`` may refuse to work with processors that are not
> -recognized by it.
> +``intel_pstate=no_hwp`` setting causes the driver to start in the passive mode
> +if it is not combined with ``intel_pstate=active``.]  Like in the active mode
> +without HWP support, in this mode ``intel_pstate`` may refuse to work with
> +processors that are not recognized by it if HWP is prevented from being enabled
> +through the kernel command line.
>  
>  If the driver works in this mode, the ``scaling_driver`` policy attribute in
>  ``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq".
> @@ -318,10 +322,9 @@ manuals need to be consulted to get to i
>  
>  For this reason, there is a list of supported processors in ``intel_pstate`` and
>  the driver initialization will fail if the detected processor is not in that
> -list, unless it supports the `HWP feature <Active Mode_>`_.  [The interface to
> -obtain all of the information listed above is the same for all of the processors
> -supporting the HWP feature, which is why they all are supported by
> -``intel_pstate``.]
> +list, unless it supports the HWP feature.  [The interface to obtain all of the
> +information listed above is the same for all of the processors supporting the
> +HWP feature, which is why ``intel_pstate`` works with all of them.]
>  
>  
>  User Space Interface in ``sysfs``
> @@ -425,22 +428,16 @@ argument is passed to the kernel in the
>  	as well as the per-policy ones) are then reset to their default
>  	values, possibly depending on the target operation mode.]
>  
> -	That only is supported in some configurations, though (for example, if
> -	the `HWP feature is enabled in the processor <Active Mode With HWP_>`_,
> -	the operation mode of the driver cannot be changed), and if it is not
> -	supported in the current configuration, writes to this attribute will
> -	fail with an appropriate error.
> -
>  ``energy_efficiency``
> -	This attribute is only present on platforms, which have CPUs matching
> -	Kaby Lake or Coffee Lake desktop CPU model. By default
> -	energy efficiency optimizations are disabled on these CPU models in HWP
> -	mode by this driver. Enabling energy efficiency may limit maximum
> -	operating frequency in both HWP and non HWP mode. In non HWP mode,
> -	optimizations are done only in the turbo frequency range. In HWP mode,
> -	optimizations are done in the entire frequency range. Setting this
> -	attribute to "1" enables energy efficiency optimizations and setting
> -	to "0" disables energy efficiency optimizations.
> +	This attribute is only present on platforms with CPUs matching the Kaby
> +	Lake or Coffee Lake desktop CPU model. By default, energy-efficiency
> +	optimizations are disabled on these CPU models if HWP is enabled.
> +	Enabling energy-efficiency optimizations may limit maximum operating
> +	frequency with or without the HWP feature.  With HWP enabled, the
> +	optimizations are done only in the turbo frequency range.  Without it,
> +	they are done in the entire available frequency range.  Setting this
> +	attribute to "1" enables the energy-efficiency optimizations and setting
> +	to "0" disables them.
>  
>  Interpretation of Policy Attributes
>  -----------------------------------
> @@ -484,8 +481,8 @@ Next, the following policy attributes ha
>  	policy for the time interval between the last two invocations of the
>  	driver's utilization update callback by the CPU scheduler for that CPU.
>  
> -One more policy attribute is present if the `HWP feature is enabled in the
> -processor <Active Mode With HWP_>`_:
> +One more policy attribute is present if the HWP feature is enabled in the
> +processor:
>  
>  ``base_frequency``
>  	Shows the base frequency of the CPU. Any frequency above this will be
> @@ -526,11 +523,11 @@ on the following rules, regardless of th
>  
>   3. The global and per-policy limits can be set independently.
>  
> -If the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, the
> -resulting effective values are written into its registers whenever the limits
> -change in order to request its internal P-state selection logic to always set
> -P-states within these limits.  Otherwise, the limits are taken into account by
> -scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
> +In the `active mode with the HWP feature enabled <Active Mode With HWP_>`_, the
> +resulting effective values are written into hardware registers whenever the
> +limits change in order to request its internal P-state selection logic to always
> +set P-states within these limits.  Otherwise, the limits are taken into account
> +by scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
>  every time before setting a new P-state for a CPU.
>  
>  Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument
> @@ -541,12 +538,11 @@ at all and the only way to set the limit
>  Energy vs Performance Hints
>  ---------------------------
>  
> -If ``intel_pstate`` works in the `active mode with the HWP feature enabled
> -<Active Mode With HWP_>`_ in the processor, additional attributes are present
> -in every ``CPUFreq`` policy directory in ``sysfs``.  They are intended to allow
> -user space to help ``intel_pstate`` to adjust the processor's internal P-state
> -selection logic by focusing it on performance or on energy-efficiency, or
> -somewhere between the two extremes:
> +If the hardware-managed P-states (HWP) is enabled in the processor, additional
> +attributes, intended to allow user space to help ``intel_pstate`` to adjust the
> +processor's internal P-state selection logic by focusing it on performance or on
> +energy-efficiency, or somewhere between the two extremes, are present in every
> +``CPUFreq`` policy directory in ``sysfs``.  They are :
>  
>  ``energy_performance_preference``
>  	Current value of the energy vs performance hint for the given policy
> @@ -650,12 +646,14 @@ of them have to be prepended with the ``
>  	Do not register ``intel_pstate`` as the scaling driver even if the
>  	processor is supported by it.
>  
> +``active``
> +	Register ``intel_pstate`` in the `active mode <Active Mode_>`_ to start
> +	with.
> +
>  ``passive``
>  	Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to
>  	start with.
>  
> -	This option implies the ``no_hwp`` one described below.
> -
>  ``force``
>  	Register ``intel_pstate`` as the scaling driver instead of
>  	``acpi-cpufreq`` even if the latter is preferred on the given system.
> @@ -670,13 +668,12 @@ of them have to be prepended with the ``
>  	driver is used instead of ``acpi-cpufreq``.
>  
>  ``no_hwp``
> -	Do not enable the `hardware-managed P-states (HWP) feature
> -	<Active Mode With HWP_>`_ even if it is supported by the processor.
> +	Do not enable the hardware-managed P-states (HWP) feature even if it is
> +	supported by the processor.
>  
>  ``hwp_only``
>  	Register ``intel_pstate`` as the scaling driver only if the
> -	`hardware-managed P-states (HWP) feature <Active Mode With HWP_>`_ is
> -	supported by the processor.
> +	hardware-managed P-states (HWP) feature is supported by the processor.
>  
>  ``support_acpi_ppc``
>  	Take ACPI ``_PPC`` performance limits into account.
Rafael J. Wysocki July 15, 2020, 12:04 p.m. UTC | #2
On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez <currojerez@riseup.net> wrote:
>
> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
>
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Allow intel_pstate to work in the passive mode with HWP enabled and
> > make it set the HWP minimum performance limit (HWP floor) to the
> > P-state value given by the target frequency supplied by the cpufreq
> > governor, so as to prevent the HWP algorithm and the CPU scheduler
> > from working against each other, at least when the schedutil governor
> > is in use, and update the intel_pstate documentation accordingly.
> >
> > Among other things, this allows utilization clamps to be taken
> > into account, at least to a certain extent, when intel_pstate is
> > in use and makes it more likely that sufficient capacity for
> > deadline tasks will be provided.
> >
> > After this change, the resulting behavior of an HWP system with
> > intel_pstate in the passive mode should be close to the behavior
> > of the analogous non-HWP system with intel_pstate in the passive
> > mode, except that in the frequency range below the base frequency
> > (ie. the frequency retured by the base_frequency cpufreq attribute
> > in sysfs on HWP systems) the HWP algorithm is allowed to go above
> > the floor P-state set by intel_pstate with or without hardware
> > coordination of P-states among CPUs in the same package.
> >
> > Also note that the setting of the HWP floor may not be taken into
> > account by the processor in the following cases:
> >
> >  * For the HWP floor in the range of P-states above the base
> >    frequency, referred to as the turbo range, the processor has a
> >    license to choose any P-state from that range, either below or
> >    above the HWP floor, just like a non-HWP processor in the case
> >    when the target P-state falls into the turbo range.
> >
> >  * If P-states of the CPUs in the same package are coordinated
> >    at the hardware level, the processor may choose a P-state
> >    above the HWP floor, just like a non-HWP processor in the
> >    analogous case.
> >
> > With this change applied, intel_pstate in the passive mode
> > assumes complete control over the HWP request MSR and concurrent
> > changes of that MSR (eg. via the direct MSR access interface) are
> > overridden by it.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >
> > This basically unifies the passive mode behavior of intel_pstate for systems
> > with and without HWP enabled.  The only case in which there is a difference
> > between the two (after this patch) is below the turbo range, where the HWP
> > algorithm can go above the floor regardless of whether or not P-state are
> > coordinated package-wide (this means the systems with per-core P-states
> > mostly is where the difference can be somewhat visible).
> >
> > Since the passive mode hasn't worked with HWP at all, and it is not going to
> > the default for HWP systems anyway, I don't see any drawbacks related to making
> > this change, so I would consider this as 5.9 material unless there are any
> > serious objections.
> >
> > Thanks!
> >
> > ---
> >  Documentation/admin-guide/pm/intel_pstate.rst |   89 +++++++---------
> >  drivers/cpufreq/intel_pstate.c                |  141 ++++++++++++++++++++------
> >  2 files changed, 152 insertions(+), 78 deletions(-)
> >
> > Index: linux-pm/drivers/cpufreq/intel_pstate.c
> > ===================================================================
> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> > +++ linux-pm/drivers/cpufreq/intel_pstate.c
> > @@ -36,6 +36,7 @@
> >  #define INTEL_PSTATE_SAMPLING_INTERVAL       (10 * NSEC_PER_MSEC)
> >
> >  #define INTEL_CPUFREQ_TRANSITION_LATENCY     20000
> > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP   5000
> >  #define INTEL_CPUFREQ_TRANSITION_DELAY               500
> >
> >  #ifdef CONFIG_ACPI
> > @@ -222,6 +223,7 @@ struct global_params {
> >   *                   preference/bias
> >   * @epp_saved:               Saved EPP/EPB during system suspend or CPU offline
> >   *                   operation
> > + * @epp_cached               Cached HWP energy-performance preference value
> >   * @hwp_req_cached:  Cached value of the last HWP Request MSR
> >   * @hwp_cap_cached:  Cached value of the last HWP Capabilities MSR
> >   * @last_io_update:  Last time when IO wake flag was set
> > @@ -259,6 +261,7 @@ struct cpudata {
> >       s16 epp_policy;
> >       s16 epp_default;
> >       s16 epp_saved;
> > +     s16 epp_cached;
> >       u64 hwp_req_cached;
> >       u64 hwp_cap_cached;
> >       u64 last_io_update;
> > @@ -676,6 +679,8 @@ static int intel_pstate_set_energy_pref_
> >
> >               value |= (u64)epp << 24;
> >               ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
> > +
> > +             WRITE_ONCE(cpu_data->epp_cached, epp);
>
> Why introduce a new EPP cache variable if there is already
> hwp_req_cached?  If intel_pstate_set_energy_pref_index() is failing to
> update hwp_req_cached maybe we should fix that instead.  That will save
> you a little bit of work in intel_cpufreq_adjust_hwp().

Yes, it would, but then we'd need to explicitly synchronize
intel_pstate_set_energy_pref_index() with the scheduler context which
I'd rather avoid.

> >       } else {
> >               if (epp == -EINVAL)
> >                       epp = (pref_index - 1) << 2;
> > @@ -2047,6 +2052,7 @@ static int intel_pstate_init_cpu(unsigne
> >               cpu->epp_default = -EINVAL;
> >               cpu->epp_powersave = -EINVAL;
> >               cpu->epp_saved = -EINVAL;
> > +             WRITE_ONCE(cpu->epp_cached, -EINVAL);
> >       }
> >
> >       cpu = all_cpu_data[cpunum];
> > @@ -2245,7 +2251,10 @@ static int intel_pstate_verify_policy(st
> >
> >  static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
> >  {
> > -     intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
> > +     if (hwp_active)
> > +             intel_pstate_hwp_force_min_perf(policy->cpu);
> > +     else
> > +             intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
> >  }
> >
> >  static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
> > @@ -2253,12 +2262,10 @@ static void intel_pstate_stop_cpu(struct
> >       pr_debug("CPU %d exiting\n", policy->cpu);
> >
> >       intel_pstate_clear_update_util_hook(policy->cpu);
> > -     if (hwp_active) {
> > +     if (hwp_active)
> >               intel_pstate_hwp_save_state(policy);
> > -             intel_pstate_hwp_force_min_perf(policy->cpu);
> > -     } else {
> > -             intel_cpufreq_stop_cpu(policy);
> > -     }
> > +
> > +     intel_cpufreq_stop_cpu(policy);
> >  }
> >
> >  static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
> > @@ -2388,13 +2395,82 @@ static void intel_cpufreq_trace(struct c
> >               fp_toint(cpu->iowait_boost * 100));
> >  }
> >
> > +static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32 target_pstate,
> > +                                  bool fast_switch)
> > +{
> > +     u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
> > +     s16 epp;
> > +
> > +     value &= ~HWP_MIN_PERF(~0L);
> > +     value |= HWP_MIN_PERF(target_pstate);
> > +
> > +     /*
> > +      * The entire MSR needs to be updated in order to update the HWP min
> > +      * field in it, so opportunistically update the max too if needed.
> > +      */
> > +     value &= ~HWP_MAX_PERF(~0L);
> > +     value |= HWP_MAX_PERF(cpu->max_perf_ratio);
> > +
> > +     /*
> > +      * In case the EPP has been adjusted via sysfs, write the last cached
> > +      * value of it to the MSR as well.
> > +      */
> > +     epp = READ_ONCE(cpu->epp_cached);
> > +     if (epp >= 0) {
> > +             value &= ~GENMASK_ULL(31, 24);
> > +             value |= (u64)epp << 24;
> > +     }
> > +
> > +     if (value == prev)
> > +             return;
> > +
> > +     WRITE_ONCE(cpu->hwp_req_cached, value);
> > +     if (fast_switch)
> > +             wrmsrl(MSR_HWP_REQUEST, value);
> > +     else
> > +             wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
> > +}
>
> I've asked this question already but you may have missed it: Given that
> you are of the opinion that [1] should be implemented in schedutil
> instead with intel_pstate in HWP passive mode, what's your plan for
> exposing the HWP_MAX_PERF knob to the governor in addition to
> HWP_MIN_PERF, since the interface implemented here only allows the
> governor to provide a single frequency?
>
> [1] https://lwn.net/ml/linux-pm/20200428032258.2518-1-currojerez@riseup.net/

This is not just about the schedutil governor, but about cpufreq
governors in general (someone may still want to use the performance
governor on top of intel_pstate, for example).

And while governors can only provide one frequency, the policy limits
in the cpufreq framework are based on QoS lists now and so it is
possible to add a max limit request, say from a driver, to the max QoS
list, and update it as needed, causing the max policy limit to be
adjusted.

That said I'm not exactly sure how useful the max limit generally is
in practice on HWP systems, given that setting it above the base
frequency causes it to be ignored, effectively, and the turbo range
may be wider than the range of P-states below the base frequency.

Generally, I'm not quite convinced that limiting the max frequency is
really the right choice for controlling the processor's power draw on
the systems in question.  There are other ways to do that, which in
theory should be more effective.  I mentioned RAPL somewhere in this
context and there's the GUC firmware too.
Doug Smythies July 15, 2020, 8:39 p.m. UTC | #3
On 2020.07.14 11:16 Rafael J. Wysocki wrote:
> 
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
...
> Since the passive mode hasn't worked with HWP at all, and it is not going to
> the default for HWP systems anyway, I don't see any drawbacks related to making
> this change, so I would consider this as 5.9 material unless there are any
> serious objections.

Good point.
Some of the tests I do involve labour intensive post processing of data.
I want to automate some of that work, and it will take time.
We might be into the 5.9-rc series before I have detailed feedback.

However, so far:

Inverse impulse response test [1]:

High level test, i5-9600K, HWP-passive (this patch), ondemand:
3101 tests. 0 failures. (GOOD)

From [1], re-stated:
> . High level: i5-9600K: 2453 tests, 60 failures, 2.45% fail rate. (HWP-active - powersave)
> . Verify acpi-cpufreq/ondemand works fine: i5-9600K: 8975 tests. 0 failures.

My version of that cool Alexander named pipe test [2] serialized workflow:

HWP-passive (this patch), performance: PASS.

From [2], re-stated, and also re-tested.
HWP-disabled passive - performance: FAIL.
Although, I believe the issue to be EPB management, [3].

And yes, I did see the reply to [3] that came earlier,
And have now re-done the test, with the referenced patch added.
It still is FAIL. I reply to the [3] thread, eventually.

[1] https://marc.info/?l=linux-pm&m=159354421400342&w=2
[2] https://marc.info/?l=linux-pm&m=159155067328641&w=2
[3] https://marc.info/?l=linux-pm&m=159438804230744&w=2

Kernel:

b08284a541ad (HEAD -> k58rc5-doug) cpufreq: intel_pstate: Avoid enabling HWP if EPP is not supported
063fd7ccabfe cpufreq: intel_pstate: Implement passive mode with HWP enabled
730ccf5054e9 cpufreq: intel_pstate: Allow raw energy performance preference value
bee36df01c68 cpufreq: intel_pstate: Allow enable/disable energy efficiency
199629d8200e cpufreq: intel_pstate: Fix active mode setting from command line
11ba468877bb (tag: v5.8-rc5, origin/master, origin/HEAD, master) Linux 5.8-rc5

Rules for this work:

. never use x86_energy_perf_policy.
. For HWP disabled: never change from active to passive or via versa, but rather do it via boot.
. after boot always check and reset the various power limit log bits that are set.
. never compile the kernel (well, until after any tests), which will set those bits again.
. never run prime95 high heat torture test, which will set those bits again.
. try to never do anything else that will set those bits again.

To be clear, I do allow changing governors within the context of the above rules.

... Doug
Francisco Jerez July 15, 2020, 9:35 p.m. UTC | #4
"Rafael J. Wysocki" <rafael@kernel.org> writes:

> On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez <currojerez@riseup.net> wrote:
>>
>> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
>>
>> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> >
>> > Allow intel_pstate to work in the passive mode with HWP enabled and
>> > make it set the HWP minimum performance limit (HWP floor) to the
>> > P-state value given by the target frequency supplied by the cpufreq
>> > governor, so as to prevent the HWP algorithm and the CPU scheduler
>> > from working against each other, at least when the schedutil governor
>> > is in use, and update the intel_pstate documentation accordingly.
>> >
>> > Among other things, this allows utilization clamps to be taken
>> > into account, at least to a certain extent, when intel_pstate is
>> > in use and makes it more likely that sufficient capacity for
>> > deadline tasks will be provided.
>> >
>> > After this change, the resulting behavior of an HWP system with
>> > intel_pstate in the passive mode should be close to the behavior
>> > of the analogous non-HWP system with intel_pstate in the passive
>> > mode, except that in the frequency range below the base frequency
>> > (ie. the frequency retured by the base_frequency cpufreq attribute
>> > in sysfs on HWP systems) the HWP algorithm is allowed to go above
>> > the floor P-state set by intel_pstate with or without hardware
>> > coordination of P-states among CPUs in the same package.
>> >
>> > Also note that the setting of the HWP floor may not be taken into
>> > account by the processor in the following cases:
>> >
>> >  * For the HWP floor in the range of P-states above the base
>> >    frequency, referred to as the turbo range, the processor has a
>> >    license to choose any P-state from that range, either below or
>> >    above the HWP floor, just like a non-HWP processor in the case
>> >    when the target P-state falls into the turbo range.
>> >
>> >  * If P-states of the CPUs in the same package are coordinated
>> >    at the hardware level, the processor may choose a P-state
>> >    above the HWP floor, just like a non-HWP processor in the
>> >    analogous case.
>> >
>> > With this change applied, intel_pstate in the passive mode
>> > assumes complete control over the HWP request MSR and concurrent
>> > changes of that MSR (eg. via the direct MSR access interface) are
>> > overridden by it.
>> >
>> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> > ---
>> >
>> > This basically unifies the passive mode behavior of intel_pstate for systems
>> > with and without HWP enabled.  The only case in which there is a difference
>> > between the two (after this patch) is below the turbo range, where the HWP
>> > algorithm can go above the floor regardless of whether or not P-state are
>> > coordinated package-wide (this means the systems with per-core P-states
>> > mostly is where the difference can be somewhat visible).
>> >
>> > Since the passive mode hasn't worked with HWP at all, and it is not going to
>> > the default for HWP systems anyway, I don't see any drawbacks related to making
>> > this change, so I would consider this as 5.9 material unless there are any
>> > serious objections.
>> >
>> > Thanks!
>> >
>> > ---
>> >  Documentation/admin-guide/pm/intel_pstate.rst |   89 +++++++---------
>> >  drivers/cpufreq/intel_pstate.c                |  141 ++++++++++++++++++++------
>> >  2 files changed, 152 insertions(+), 78 deletions(-)
>> >
>> > Index: linux-pm/drivers/cpufreq/intel_pstate.c
>> > ===================================================================
>> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
>> > +++ linux-pm/drivers/cpufreq/intel_pstate.c
>> > @@ -36,6 +36,7 @@
>> >  #define INTEL_PSTATE_SAMPLING_INTERVAL       (10 * NSEC_PER_MSEC)
>> >
>> >  #define INTEL_CPUFREQ_TRANSITION_LATENCY     20000
>> > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP   5000
>> >  #define INTEL_CPUFREQ_TRANSITION_DELAY               500
>> >
>> >  #ifdef CONFIG_ACPI
>> > @@ -222,6 +223,7 @@ struct global_params {
>> >   *                   preference/bias
>> >   * @epp_saved:               Saved EPP/EPB during system suspend or CPU offline
>> >   *                   operation
>> > + * @epp_cached               Cached HWP energy-performance preference value
>> >   * @hwp_req_cached:  Cached value of the last HWP Request MSR
>> >   * @hwp_cap_cached:  Cached value of the last HWP Capabilities MSR
>> >   * @last_io_update:  Last time when IO wake flag was set
>> > @@ -259,6 +261,7 @@ struct cpudata {
>> >       s16 epp_policy;
>> >       s16 epp_default;
>> >       s16 epp_saved;
>> > +     s16 epp_cached;
>> >       u64 hwp_req_cached;
>> >       u64 hwp_cap_cached;
>> >       u64 last_io_update;
>> > @@ -676,6 +679,8 @@ static int intel_pstate_set_energy_pref_
>> >
>> >               value |= (u64)epp << 24;
>> >               ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
>> > +
>> > +             WRITE_ONCE(cpu_data->epp_cached, epp);
>>
>> Why introduce a new EPP cache variable if there is already
>> hwp_req_cached?  If intel_pstate_set_energy_pref_index() is failing to
>> update hwp_req_cached maybe we should fix that instead.  That will save
>> you a little bit of work in intel_cpufreq_adjust_hwp().
>
> Yes, it would, but then we'd need to explicitly synchronize
> intel_pstate_set_energy_pref_index() with the scheduler context which
> I'd rather avoid.
>

How does using a differently named variable save you from doing that?

And won't the EPP setting programmed by intel_pstate_set_energy_pref_index()
be lost if intel_pstate_hwp_boost_up() or some other user of
hwp_req_cached is executed afterwards with the current approach?  Seems
like a bug to me.

>> >       } else {
>> >               if (epp == -EINVAL)
>> >                       epp = (pref_index - 1) << 2;
>> > @@ -2047,6 +2052,7 @@ static int intel_pstate_init_cpu(unsigne
>> >               cpu->epp_default = -EINVAL;
>> >               cpu->epp_powersave = -EINVAL;
>> >               cpu->epp_saved = -EINVAL;
>> > +             WRITE_ONCE(cpu->epp_cached, -EINVAL);
>> >       }
>> >
>> >       cpu = all_cpu_data[cpunum];
>> > @@ -2245,7 +2251,10 @@ static int intel_pstate_verify_policy(st
>> >
>> >  static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
>> >  {
>> > -     intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
>> > +     if (hwp_active)
>> > +             intel_pstate_hwp_force_min_perf(policy->cpu);
>> > +     else
>> > +             intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
>> >  }
>> >
>> >  static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
>> > @@ -2253,12 +2262,10 @@ static void intel_pstate_stop_cpu(struct
>> >       pr_debug("CPU %d exiting\n", policy->cpu);
>> >
>> >       intel_pstate_clear_update_util_hook(policy->cpu);
>> > -     if (hwp_active) {
>> > +     if (hwp_active)
>> >               intel_pstate_hwp_save_state(policy);
>> > -             intel_pstate_hwp_force_min_perf(policy->cpu);
>> > -     } else {
>> > -             intel_cpufreq_stop_cpu(policy);
>> > -     }
>> > +
>> > +     intel_cpufreq_stop_cpu(policy);
>> >  }
>> >
>> >  static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
>> > @@ -2388,13 +2395,82 @@ static void intel_cpufreq_trace(struct c
>> >               fp_toint(cpu->iowait_boost * 100));
>> >  }
>> >
>> > +static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32 target_pstate,
>> > +                                  bool fast_switch)
>> > +{
>> > +     u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
>> > +     s16 epp;
>> > +
>> > +     value &= ~HWP_MIN_PERF(~0L);
>> > +     value |= HWP_MIN_PERF(target_pstate);
>> > +
>> > +     /*
>> > +      * The entire MSR needs to be updated in order to update the HWP min
>> > +      * field in it, so opportunistically update the max too if needed.
>> > +      */
>> > +     value &= ~HWP_MAX_PERF(~0L);
>> > +     value |= HWP_MAX_PERF(cpu->max_perf_ratio);
>> > +
>> > +     /*
>> > +      * In case the EPP has been adjusted via sysfs, write the last cached
>> > +      * value of it to the MSR as well.
>> > +      */
>> > +     epp = READ_ONCE(cpu->epp_cached);
>> > +     if (epp >= 0) {
>> > +             value &= ~GENMASK_ULL(31, 24);
>> > +             value |= (u64)epp << 24;
>> > +     }
>> > +
>> > +     if (value == prev)
>> > +             return;
>> > +
>> > +     WRITE_ONCE(cpu->hwp_req_cached, value);
>> > +     if (fast_switch)
>> > +             wrmsrl(MSR_HWP_REQUEST, value);
>> > +     else
>> > +             wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
>> > +}
>>
>> I've asked this question already but you may have missed it: Given that
>> you are of the opinion that [1] should be implemented in schedutil
>> instead with intel_pstate in HWP passive mode, what's your plan for
>> exposing the HWP_MAX_PERF knob to the governor in addition to
>> HWP_MIN_PERF, since the interface implemented here only allows the
>> governor to provide a single frequency?
>>
>> [1] https://lwn.net/ml/linux-pm/20200428032258.2518-1-currojerez@riseup.net/
>
> This is not just about the schedutil governor, but about cpufreq
> governors in general (someone may still want to use the performance
> governor on top of intel_pstate, for example).
>
> And while governors can only provide one frequency, the policy limits
> in the cpufreq framework are based on QoS lists now and so it is
> possible to add a max limit request, say from a driver, to the max QoS
> list, and update it as needed, causing the max policy limit to be
> adjusted.
>
> That said I'm not exactly sure how useful the max limit generally is
> in practice on HWP systems, given that setting it above the base
> frequency causes it to be ignored, effectively, and the turbo range
> may be wider than the range of P-states below the base frequency.
>

I don't think that's accurate.  I've looked at hundreds of traces while
my series [1] was in control of HWP_REQ_MAX and I've never seen an
excursion above the maximum HWP_REQ_MAX control specified by it within a
given P-state domain, even while that maximum specified was well into
the turbo range.  So, yeah, I agree that HWP_REQ_MAX is nothing like a
hard limit, particularly when multiple threads are running on the same
clock domain, but the processor will still make its best effort to limit
the clock frequency to the maximum of the requested maximums, even if it
happens to be within the turbo range.  That doesn't make it useless.
The exact same thing can be said about controlling HWP_REQ_MIN as you're
doing now in this revision of your patch, BTW.

If you don't believe me here is the turbostat sample with maximum
Bzy_MHz I get on the computer I'm sitting on right now while compiling a
kernel on CPU0 if I set HWP_REQ_MAX to 0x1c (within the turbo range):

| Core    CPU     Avg_MHz Busy%   Bzy_MHz            HWP_REQ      PkgWatt CorWatt
| -       -       757     27.03   2800    0x0000000000000000      7.13    4.90
| 0       0       2794    99.77   2800    0x0000000080001c04      7.13    4.90
| 0       2       83      2.98    2800    0x0000000080001c04
| 1       1       73      2.60    2800    0x0000000080001c04
| 1       3       78      2.79    2800    0x0000000080001c04

With the default HWP_REQUEST:

| Core    CPU     Avg_MHz Busy%   Bzy_MHz            HWP_REQ      PkgWatt CorWatt
| -       -       814     27.00   3015    0x0000000000000000      8.49    6.18
| 0       0       2968    98.24   3021    0x0000000080001f04      8.49    6.18
| 0       2       84      2.81    2982    0x0000000080001f04
| 1       1       99      3.34    2961    0x0000000080001f04
| 1       3       105     3.60    2921    0x0000000080001f04

> Generally, I'm not quite convinced that limiting the max frequency is
> really the right choice for controlling the processor's power draw on
> the systems in question.  There are other ways to do that, which in
> theory should be more effective.  I mentioned RAPL somewhere in this
> context and there's the GUC firmware too.

I feel like we've had that conversation before and it's somewhat
off-topic so I'll keep it short: Yes, in theory RAPL is more effective
than HWP_REQ_MAX as a mechanism to limit the absolute power consumption
of the processor package, but that's not the purpose of [1], its purpose
is setting a lower limit to the energy efficiency of the processor when
the maximum usable CPU frequency is known (due to the existence of an IO
device bottleneck) -- And if the maximum usable CPU frequency is the
information we have at hand, controlling the maximum CPU frequency
directly is optimal, rather than trying to find the RAPL constraint that
achieves the same average frequency by trial an error.  Also, in theory,
even if you had an oracle to tell you what the appropriate RAPL
constraint is, the result would necessarily be more energy-inefficient
than controlling the maximum CPU frequency directly, since you're giving
the processor additional freedom to run at frequencies above the one you
want to average, which is guaranteed to be more energy-inefficient than
running at that fixed frequency, assuming we are in the region of
convexity of the processor's power curve.

Anyway, if you still have some disagreement on the theoretical details
you're more than welcome to bring up the matter on the other thread [1],
or accept the invitation for a presentation I sent you months ago... ;)
Srinivas Pandruvada July 16, 2020, 1:14 a.m. UTC | #5
On Wed, 2020-07-15 at 14:35 -0700, Francisco Jerez wrote:
> "Rafael J. Wysocki" <rafael@kernel.org> writes:
> 
> > On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez <
> > currojerez@riseup.net> wrote:
> > > "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
> > > 
> > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > 

[...]

> > > > I don't think that's accurate.  I've looked at hundreds of
> > > > traces
> while
> my series [1] was in control of HWP_REQ_MAX and I've never seen an
> excursion above the maximum HWP_REQ_MAX control specified by it
> within a
> given P-state domain, even while that maximum specified was well into
> the turbo range.  So, yeah, I agree that HWP_REQ_MAX is nothing like
> a
> hard limit, particularly when multiple threads are running on the
> same
> clock domain, but the processor will still make its best effort to
> limit
> the clock frequency to the maximum of the requested maximums, even if
> it
> happens to be within the turbo range.  That doesn't make it useless.
> The exact same thing can be said about controlling HWP_REQ_MIN as
> you're
> doing now in this revision of your patch, BTW.
> 
> If you don't believe me here is the turbostat sample with maximum
> Bzy_MHz I get on the computer I'm sitting on right now while
> compiling a
> kernel on CPU0 if I set HWP_REQ_MAX to 0x1c (within the turbo range):
> 
> > Core    CPU     Avg_MHz
> > Busy%   Bzy_MHz            HWP_REQ      PkgWatt CorWatt
> > -       -       757     27.03   2800    0x0000000000000000      7.1
> > 3    4.90
> > 0       0       2794    99.77   2800    0x0000000080001c04      7.1
> > 3    4.90
> > 0       2       83      2.98    2800    0x0000000080001c04
> > 1       1       73      2.60    2800    0x0000000080001c04
> > 1       3       78      2.79    2800    0x0000000080001c04
> 
> With the default HWP_REQUEST:
> 
> > Core    CPU     Avg_MHz
> > Busy%   Bzy_MHz            HWP_REQ      PkgWatt CorWatt
> > -       -       814     27.00   3015    0x0000000000000000      8.4
> > 9    6.18
> > 0       0       2968    98.24   3021    0x0000000080001f04      8.4
> > 9    6.18
> > 0       2       84      2.81    2982    0x0000000080001f04
> > 1       1       99      3.34    2961    0x0000000080001f04
> > 1       3       105     3.60    2921    0x0000000080001f04

Correct. In HWP mode this is possible to lower limit in turbo region
conditionally. In legacy mode you can't with turbo activation ratio.

But what we don't want set max and min perf and use like desired to run
at a P-state overriding HWP or limit the range where HWP can't do any
meaningful selection.

> > Generally, I'm not quite convinced that limiting the max frequency
> > is
> > really the right choice for controlling the processor's power draw
> > on
> > the systems in question.  There are other ways to do that, which in
> > theory should be more effective.  I mentioned RAPL somewhere in
> > this
> > context and there's the GUC firmware too.
> 
> I feel like we've had that conversation before and it's somewhat
> off-topic so I'll keep it short: Yes, in theory RAPL is more
> effective
> than HWP_REQ_MAX as a mechanism to limit the absolute power
> consumption
> of the processor package, but that's not the purpose of [1], its
> purpose
> is setting a lower limit to the energy efficiency of the processor
> when
> the maximum usable CPU frequency is known (due to the existence of an
> IO
> device bottleneck) -- And if the maximum usable CPU frequency is the
> information we have at hand, controlling the maximum CPU frequency
> directly is optimal, rather than trying to find the RAPL constraint
> that
> achieves the same average frequency by trial an error.  Also, in
> theory,
> even if you had an oracle to tell you what the appropriate RAPL
> constraint is, the result would necessarily be more energy-
> inefficient
> than controlling the maximum CPU frequency directly, since you're
> giving
> the processor additional freedom to run at frequencies above the one
> you
> want to average, which is guaranteed to be more energy-inefficient
> than
> running at that fixed frequency, assuming we are in the region of
> convexity of the processor's power curve.
> 
> Anyway, if you still have some disagreement on the theoretical
> details
> you're more than welcome to bring up the matter on the other thread
> [1],
> or accept the invitation for a presentation I sent you months ago...
> ;)
Rafael J. Wysocki July 16, 2020, 12:07 p.m. UTC | #6
On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies <dsmythies@telus.net> wrote:
>
> On 2020.07.14 11:16 Rafael J. Wysocki wrote:
> >
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ...
> > Since the passive mode hasn't worked with HWP at all, and it is not going to
> > the default for HWP systems anyway, I don't see any drawbacks related to making
> > this change, so I would consider this as 5.9 material unless there are any
> > serious objections.
>
> Good point.
> Some of the tests I do involve labour intensive post processing of data.
> I want to automate some of that work, and it will take time.
> We might be into the 5.9-rc series before I have detailed feedback.
>
> However, so far:
>
> Inverse impulse response test [1]:
>
> High level test, i5-9600K, HWP-passive (this patch), ondemand:
> 3101 tests. 0 failures. (GOOD)
>
> From [1], re-stated:
> > . High level: i5-9600K: 2453 tests, 60 failures, 2.45% fail rate. (HWP-active - powersave)
> > . Verify acpi-cpufreq/ondemand works fine: i5-9600K: 8975 tests. 0 failures.
>
> My version of that cool Alexander named pipe test [2] serialized workflow:
>
> HWP-passive (this patch), performance: PASS.
>
> From [2], re-stated, and also re-tested.
> HWP-disabled passive - performance: FAIL.

But I'm not quite sure how this is related to this patch?

This test would still fail without the patch if the kernel was started
with intel_pstate=passive in the kernel command line, wouldn't it.

> Although, I believe the issue to be EPB management, [3].

Well, that's kind of unexpected.

If this really is the case, then it looks like the effect of the EPB
setting on the processor is different depending on whether or not HWP
is enabled.

> And yes, I did see the reply to [3] that came earlier,
> And have now re-done the test, with the referenced patch added.
> It still is FAIL. I reply to the [3] thread, eventually.
>
> [1] https://marc.info/?l=linux-pm&m=159354421400342&w=2
> [2] https://marc.info/?l=linux-pm&m=159155067328641&w=2
> [3] https://marc.info/?l=linux-pm&m=159438804230744&w=2
>
> Kernel:
>
> b08284a541ad (HEAD -> k58rc5-doug) cpufreq: intel_pstate: Avoid enabling HWP if EPP is not supported
> 063fd7ccabfe cpufreq: intel_pstate: Implement passive mode with HWP enabled
> 730ccf5054e9 cpufreq: intel_pstate: Allow raw energy performance preference value
> bee36df01c68 cpufreq: intel_pstate: Allow enable/disable energy efficiency
> 199629d8200e cpufreq: intel_pstate: Fix active mode setting from command line
> 11ba468877bb (tag: v5.8-rc5, origin/master, origin/HEAD, master) Linux 5.8-rc5
>
> Rules for this work:
>
> . never use x86_energy_perf_policy.
> . For HWP disabled: never change from active to passive or via versa, but rather do it via boot.
> . after boot always check and reset the various power limit log bits that are set.
> . never compile the kernel (well, until after any tests), which will set those bits again.
> . never run prime95 high heat torture test, which will set those bits again.
> . try to never do anything else that will set those bits again.
>
> To be clear, I do allow changing governors within the context of the above rules.

Thanks for the feedback!
Rafael J. Wysocki July 16, 2020, 2:32 p.m. UTC | #7
On Wed, Jul 15, 2020 at 11:35 PM Francisco Jerez <currojerez@riseup.net> wrote:
>
> "Rafael J. Wysocki" <rafael@kernel.org> writes:
>
> > On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez <currojerez@riseup.net> wrote:
> >>
> >> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
> >>
> >> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >> >
> >> > Allow intel_pstate to work in the passive mode with HWP enabled and
> >> > make it set the HWP minimum performance limit (HWP floor) to the
> >> > P-state value given by the target frequency supplied by the cpufreq
> >> > governor, so as to prevent the HWP algorithm and the CPU scheduler
> >> > from working against each other, at least when the schedutil governor
> >> > is in use, and update the intel_pstate documentation accordingly.
> >> >
> >> > Among other things, this allows utilization clamps to be taken
> >> > into account, at least to a certain extent, when intel_pstate is
> >> > in use and makes it more likely that sufficient capacity for
> >> > deadline tasks will be provided.
> >> >
> >> > After this change, the resulting behavior of an HWP system with
> >> > intel_pstate in the passive mode should be close to the behavior
> >> > of the analogous non-HWP system with intel_pstate in the passive
> >> > mode, except that in the frequency range below the base frequency
> >> > (ie. the frequency retured by the base_frequency cpufreq attribute
> >> > in sysfs on HWP systems) the HWP algorithm is allowed to go above
> >> > the floor P-state set by intel_pstate with or without hardware
> >> > coordination of P-states among CPUs in the same package.
> >> >
> >> > Also note that the setting of the HWP floor may not be taken into
> >> > account by the processor in the following cases:
> >> >
> >> >  * For the HWP floor in the range of P-states above the base
> >> >    frequency, referred to as the turbo range, the processor has a
> >> >    license to choose any P-state from that range, either below or
> >> >    above the HWP floor, just like a non-HWP processor in the case
> >> >    when the target P-state falls into the turbo range.
> >> >
> >> >  * If P-states of the CPUs in the same package are coordinated
> >> >    at the hardware level, the processor may choose a P-state
> >> >    above the HWP floor, just like a non-HWP processor in the
> >> >    analogous case.
> >> >
> >> > With this change applied, intel_pstate in the passive mode
> >> > assumes complete control over the HWP request MSR and concurrent
> >> > changes of that MSR (eg. via the direct MSR access interface) are
> >> > overridden by it.
> >> >
> >> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >> > ---
> >> >
> >> > This basically unifies the passive mode behavior of intel_pstate for systems
> >> > with and without HWP enabled.  The only case in which there is a difference
> >> > between the two (after this patch) is below the turbo range, where the HWP
> >> > algorithm can go above the floor regardless of whether or not P-state are
> >> > coordinated package-wide (this means the systems with per-core P-states
> >> > mostly is where the difference can be somewhat visible).
> >> >
> >> > Since the passive mode hasn't worked with HWP at all, and it is not going to
> >> > the default for HWP systems anyway, I don't see any drawbacks related to making
> >> > this change, so I would consider this as 5.9 material unless there are any
> >> > serious objections.
> >> >
> >> > Thanks!
> >> >
> >> > ---
> >> >  Documentation/admin-guide/pm/intel_pstate.rst |   89 +++++++---------
> >> >  drivers/cpufreq/intel_pstate.c                |  141 ++++++++++++++++++++------
> >> >  2 files changed, 152 insertions(+), 78 deletions(-)
> >> >
> >> > Index: linux-pm/drivers/cpufreq/intel_pstate.c
> >> > ===================================================================
> >> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> >> > +++ linux-pm/drivers/cpufreq/intel_pstate.c
> >> > @@ -36,6 +36,7 @@
> >> >  #define INTEL_PSTATE_SAMPLING_INTERVAL       (10 * NSEC_PER_MSEC)
> >> >
> >> >  #define INTEL_CPUFREQ_TRANSITION_LATENCY     20000
> >> > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP   5000
> >> >  #define INTEL_CPUFREQ_TRANSITION_DELAY               500
> >> >
> >> >  #ifdef CONFIG_ACPI
> >> > @@ -222,6 +223,7 @@ struct global_params {
> >> >   *                   preference/bias
> >> >   * @epp_saved:               Saved EPP/EPB during system suspend or CPU offline
> >> >   *                   operation
> >> > + * @epp_cached               Cached HWP energy-performance preference value
> >> >   * @hwp_req_cached:  Cached value of the last HWP Request MSR
> >> >   * @hwp_cap_cached:  Cached value of the last HWP Capabilities MSR
> >> >   * @last_io_update:  Last time when IO wake flag was set
> >> > @@ -259,6 +261,7 @@ struct cpudata {
> >> >       s16 epp_policy;
> >> >       s16 epp_default;
> >> >       s16 epp_saved;
> >> > +     s16 epp_cached;
> >> >       u64 hwp_req_cached;
> >> >       u64 hwp_cap_cached;
> >> >       u64 last_io_update;
> >> > @@ -676,6 +679,8 @@ static int intel_pstate_set_energy_pref_
> >> >
> >> >               value |= (u64)epp << 24;
> >> >               ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
> >> > +
> >> > +             WRITE_ONCE(cpu_data->epp_cached, epp);
> >>
> >> Why introduce a new EPP cache variable if there is already
> >> hwp_req_cached?  If intel_pstate_set_energy_pref_index() is failing to
> >> update hwp_req_cached maybe we should fix that instead.  That will save
> >> you a little bit of work in intel_cpufreq_adjust_hwp().
> >
> > Yes, it would, but then we'd need to explicitly synchronize
> > intel_pstate_set_energy_pref_index() with the scheduler context which
> > I'd rather avoid.
> >
>
> How does using a differently named variable save you from doing that?

It is a separate variable.

The only updater of epp_cached, except for the initialization, is
intel_pstate_set_energy_pref_index() and it cannot race with another
instance of itself, so there are no concurrent writes to epp_cached.

In the passive mode the only updater of hwp_req_cached, except for the
initialization, is intel_cpufreq_adjust_hwp() (or there is a bug in
the patch that I have missed) and it cannot race with another instance
of itself for the same CPU, so there are no concurrent writes to
hwp_req_cached.

 if intel_pstate_set_energy_pref_index() updated hwp_req_cached
directly, however, it might be updated in two places concurrently and
so explicit synchronization would be necessary.

> And won't the EPP setting programmed by intel_pstate_set_energy_pref_index()
> be lost if intel_pstate_hwp_boost_up() or some other user of
> hwp_req_cached is executed afterwards with the current approach?

The value written to the register by it may be overwritten by a
concurrent intel_cpufreq_adjust_hwp(), but that is not a problem,
because next time intel_cpufreq_adjust_hwp() runs for the target CPU,
it will pick up the updated epp_cached value which will be written to
the register.  So there may be a short time window after the
intel_pstate_set_energy_pref_index() invocation in which the new EPP
value may not be in effect, but in general there is no guarantee that
the new EPP will take effect immediately after updating the MSR
anyway, so that race doesn't matter.

That said, that race is avoidable, but I was thinking that trying to
avoid it might not be worth it.  Now I see a better way to avoid it,
though, so I'm going to update the patch to that end.

> Seems like a bug to me.

It is racy, but not every race is a bug.

> >> >       } else {
> >> >               if (epp == -EINVAL)
> >> >                       epp = (pref_index - 1) << 2;
> >> > @@ -2047,6 +2052,7 @@ static int intel_pstate_init_cpu(unsigne
> >> >               cpu->epp_default = -EINVAL;
> >> >               cpu->epp_powersave = -EINVAL;
> >> >               cpu->epp_saved = -EINVAL;
> >> > +             WRITE_ONCE(cpu->epp_cached, -EINVAL);
> >> >       }
> >> >
> >> >       cpu = all_cpu_data[cpunum];
> >> > @@ -2245,7 +2251,10 @@ static int intel_pstate_verify_policy(st
> >> >
> >> >  static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
> >> >  {
> >> > -     intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
> >> > +     if (hwp_active)
> >> > +             intel_pstate_hwp_force_min_perf(policy->cpu);
> >> > +     else
> >> > +             intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
> >> >  }
> >> >
> >> >  static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
> >> > @@ -2253,12 +2262,10 @@ static void intel_pstate_stop_cpu(struct
> >> >       pr_debug("CPU %d exiting\n", policy->cpu);
> >> >
> >> >       intel_pstate_clear_update_util_hook(policy->cpu);
> >> > -     if (hwp_active) {
> >> > +     if (hwp_active)
> >> >               intel_pstate_hwp_save_state(policy);
> >> > -             intel_pstate_hwp_force_min_perf(policy->cpu);
> >> > -     } else {
> >> > -             intel_cpufreq_stop_cpu(policy);
> >> > -     }
> >> > +
> >> > +     intel_cpufreq_stop_cpu(policy);
> >> >  }
> >> >
> >> >  static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
> >> > @@ -2388,13 +2395,82 @@ static void intel_cpufreq_trace(struct c
> >> >               fp_toint(cpu->iowait_boost * 100));
> >> >  }
> >> >
> >> > +static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32 target_pstate,
> >> > +                                  bool fast_switch)
> >> > +{
> >> > +     u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
> >> > +     s16 epp;
> >> > +
> >> > +     value &= ~HWP_MIN_PERF(~0L);
> >> > +     value |= HWP_MIN_PERF(target_pstate);
> >> > +
> >> > +     /*
> >> > +      * The entire MSR needs to be updated in order to update the HWP min
> >> > +      * field in it, so opportunistically update the max too if needed.
> >> > +      */
> >> > +     value &= ~HWP_MAX_PERF(~0L);
> >> > +     value |= HWP_MAX_PERF(cpu->max_perf_ratio);
> >> > +
> >> > +     /*
> >> > +      * In case the EPP has been adjusted via sysfs, write the last cached
> >> > +      * value of it to the MSR as well.
> >> > +      */
> >> > +     epp = READ_ONCE(cpu->epp_cached);
> >> > +     if (epp >= 0) {
> >> > +             value &= ~GENMASK_ULL(31, 24);
> >> > +             value |= (u64)epp << 24;
> >> > +     }
> >> > +
> >> > +     if (value == prev)
> >> > +             return;
> >> > +
> >> > +     WRITE_ONCE(cpu->hwp_req_cached, value);
> >> > +     if (fast_switch)
> >> > +             wrmsrl(MSR_HWP_REQUEST, value);
> >> > +     else
> >> > +             wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
> >> > +}
> >>
> >> I've asked this question already but you may have missed it: Given that
> >> you are of the opinion that [1] should be implemented in schedutil
> >> instead with intel_pstate in HWP passive mode, what's your plan for
> >> exposing the HWP_MAX_PERF knob to the governor in addition to
> >> HWP_MIN_PERF, since the interface implemented here only allows the
> >> governor to provide a single frequency?
> >>
> >> [1] https://lwn.net/ml/linux-pm/20200428032258.2518-1-currojerez@riseup.net/
> >
> > This is not just about the schedutil governor, but about cpufreq
> > governors in general (someone may still want to use the performance
> > governor on top of intel_pstate, for example).
> >
> > And while governors can only provide one frequency, the policy limits
> > in the cpufreq framework are based on QoS lists now and so it is
> > possible to add a max limit request, say from a driver, to the max QoS
> > list, and update it as needed, causing the max policy limit to be
> > adjusted.
> >
> > That said I'm not exactly sure how useful the max limit generally is
> > in practice on HWP systems, given that setting it above the base
> > frequency causes it to be ignored, effectively, and the turbo range
> > may be wider than the range of P-states below the base frequency.
> >
>
> I don't think that's accurate.  I've looked at hundreds of traces while
> my series [1] was in control of HWP_REQ_MAX and I've never seen an
> excursion above the maximum HWP_REQ_MAX control specified by it within a
> given P-state domain, even while that maximum specified was well into
> the turbo range.

I'm not going to argue with your experience. :-)

What I'm saying is that there is no guarantee that the processor will
always select P-states below HWP_REQ_MAX in the turbo range.  That may
not happen in practice, but it is not precluded AFAICS.

Also while HWP_REQ_MAX can work in practice most of the time with HWP
enabled, without HWP there is no easy way to limit the max frequency
if the current request falls into the turbo range.  The HWP case is
more important nowadays, but there still are systems without it and
ideally they should be covered as well.

> So, yeah, I agree that HWP_REQ_MAX is nothing like a
> hard limit, particularly when multiple threads are running on the same
> clock domain, but the processor will still make its best effort to limit
> the clock frequency to the maximum of the requested maximums, even if it
> happens to be within the turbo range.  That doesn't make it useless.

I haven't used the word "useless" anywhere in my previous message.

Using the max frequency to control power has merit, but how much of it
is there depends on some factors that may change from one system to
another.

The alternative power control methods may be more reliable in general.

> The exact same thing can be said about controlling HWP_REQ_MIN as you're
> doing now in this revision of your patch, BTW.

Which has been clearly stated in the changelog I believe.

The point here is that this is as good as using the perf control
register to ask for a given P-state without HWP which trying to drive
the max too is added complexity.

> If you don't believe me here is the turbostat sample with maximum
> Bzy_MHz I get on the computer I'm sitting on right now while compiling a
> kernel on CPU0 if I set HWP_REQ_MAX to 0x1c (within the turbo range):
>
> | Core    CPU     Avg_MHz Busy%   Bzy_MHz            HWP_REQ      PkgWatt CorWatt
> | -       -       757     27.03   2800    0x0000000000000000      7.13    4.90
> | 0       0       2794    99.77   2800    0x0000000080001c04      7.13    4.90
> | 0       2       83      2.98    2800    0x0000000080001c04
> | 1       1       73      2.60    2800    0x0000000080001c04
> | 1       3       78      2.79    2800    0x0000000080001c04
>
> With the default HWP_REQUEST:
>
> | Core    CPU     Avg_MHz Busy%   Bzy_MHz            HWP_REQ      PkgWatt CorWatt
> | -       -       814     27.00   3015    0x0000000000000000      8.49    6.18
> | 0       0       2968    98.24   3021    0x0000000080001f04      8.49    6.18
> | 0       2       84      2.81    2982    0x0000000080001f04
> | 1       1       99      3.34    2961    0x0000000080001f04
> | 1       3       105     3.60    2921    0x0000000080001f04
>
> > Generally, I'm not quite convinced that limiting the max frequency is
> > really the right choice for controlling the processor's power draw on
> > the systems in question.  There are other ways to do that, which in
> > theory should be more effective.  I mentioned RAPL somewhere in this
> > context and there's the GUC firmware too.
>
> I feel like we've had that conversation before and it's somewhat
> off-topic so I'll keep it short: Yes, in theory RAPL is more effective
> than HWP_REQ_MAX as a mechanism to limit the absolute power consumption
> of the processor package, but that's not the purpose of [1], its purpose
> is setting a lower limit to the energy efficiency of the processor when
> the maximum usable CPU frequency is known (due to the existence of an IO
> device bottleneck)

Whether or not that frequency is actually known seems quite
questionable to me, but that's aside.

More important, it is unclear to me what you mean by "a lower limit to
the energy efficiency of the processor".

I guess what you mean is that the processor might decide to go for a
more energy-efficient configuration by increasing its frequency in a
"race to idle" fashion (in response to a perceived utilization spike)
and you want to prevent that from occurring.

Or, generally speaking, that the CPU performance scaling logic, either
in the kernel or in the processor itself, might select a higher
operating frequency of a CPU in response to a perceived utilization
spike, but that may be a mistake in the presence of another data
processing device sharing the power budget with the processor, so you
want to prevent that from taking place.

In both cases, I wouldn't call that limiting the energy-efficiency of
the processor.  Effectively, this means putting a limit on the
processor's power budget, which is exactly what RAPL is for.

> -- And if the maximum usable CPU frequency is the
> information we have at hand,

How so?

How can you tell what that frequency is?

> controlling the maximum CPU frequency
> directly is optimal, rather than trying to find the RAPL constraint that
> achieves the same average frequency by trial an error.  Also, in theory,
> even if you had an oracle to tell you what the appropriate RAPL
> constraint is, the result would necessarily be more energy-inefficient
> than controlling the maximum CPU frequency directly, since you're giving
> the processor additional freedom to run at frequencies above the one you
> want to average, which is guaranteed to be more energy-inefficient than
> running at that fixed frequency, assuming we are in the region of
> convexity of the processor's power curve.

So the reason why you want to limit the processor's max frequency in
the first place is because it is sharing the power budget with
something else.  If there's no sharing of the power budget or thermal
constraints, there is no need to limit the CPU frequency other than
for the sake of saving energy.

What you can achieve by limiting the max CPU frequency is to make the
processor draw less power (and cause it to use either less or more
energy, depending on the energy-efficiency curve).  You don't know how
much less power it will draw then, however.

You seem to be saying "I know exactly what the maximum frequency of
the CPU can be, so why I don't set it as the upper limit", but I'm
questioning the source of that knowledge.  Does it not come from
knowing the power budget you want to give to the processor?

> Anyway, if you still have some disagreement on the theoretical details
> you're more than welcome to bring up the matter on the other thread [1],
> or accept the invitation for a presentation I sent you months ago... ;)

Why don't we continue the discussion here instead?

I think we are getting to the bottom of things here.
Rafael J. Wysocki July 16, 2020, 2:33 p.m. UTC | #8
On Thu, Jul 16, 2020 at 3:14 AM Srinivas Pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
>
> On Wed, 2020-07-15 at 14:35 -0700, Francisco Jerez wrote:
> > "Rafael J. Wysocki" <rafael@kernel.org> writes:
> >
> > > On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez <
> > > currojerez@riseup.net> wrote:
> > > > "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
> > > >
> > > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > >
>
> [...]
>
> > > > > I don't think that's accurate.  I've looked at hundreds of
> > > > > traces
> > while
> > my series [1] was in control of HWP_REQ_MAX and I've never seen an
> > excursion above the maximum HWP_REQ_MAX control specified by it
> > within a
> > given P-state domain, even while that maximum specified was well into
> > the turbo range.  So, yeah, I agree that HWP_REQ_MAX is nothing like
> > a
> > hard limit, particularly when multiple threads are running on the
> > same
> > clock domain, but the processor will still make its best effort to
> > limit
> > the clock frequency to the maximum of the requested maximums, even if
> > it
> > happens to be within the turbo range.  That doesn't make it useless.
> > The exact same thing can be said about controlling HWP_REQ_MIN as
> > you're
> > doing now in this revision of your patch, BTW.
> >
> > If you don't believe me here is the turbostat sample with maximum
> > Bzy_MHz I get on the computer I'm sitting on right now while
> > compiling a
> > kernel on CPU0 if I set HWP_REQ_MAX to 0x1c (within the turbo range):
> >
> > > Core    CPU     Avg_MHz
> > > Busy%   Bzy_MHz            HWP_REQ      PkgWatt CorWatt
> > > -       -       757     27.03   2800    0x0000000000000000      7.1
> > > 3    4.90
> > > 0       0       2794    99.77   2800    0x0000000080001c04      7.1
> > > 3    4.90
> > > 0       2       83      2.98    2800    0x0000000080001c04
> > > 1       1       73      2.60    2800    0x0000000080001c04
> > > 1       3       78      2.79    2800    0x0000000080001c04
> >
> > With the default HWP_REQUEST:
> >
> > > Core    CPU     Avg_MHz
> > > Busy%   Bzy_MHz            HWP_REQ      PkgWatt CorWatt
> > > -       -       814     27.00   3015    0x0000000000000000      8.4
> > > 9    6.18
> > > 0       0       2968    98.24   3021    0x0000000080001f04      8.4
> > > 9    6.18
> > > 0       2       84      2.81    2982    0x0000000080001f04
> > > 1       1       99      3.34    2961    0x0000000080001f04
> > > 1       3       105     3.60    2921    0x0000000080001f04
>
> Correct. In HWP mode this is possible to lower limit in turbo region
> conditionally. In legacy mode you can't with turbo activation ratio.
>
> But what we don't want set max and min perf and use like desired to run
> at a P-state overriding HWP or limit the range where HWP can't do any
> meaningful selection.

That's a good point too IMO.
Francisco Jerez July 17, 2020, 12:21 a.m. UTC | #9
"Rafael J. Wysocki" <rafael@kernel.org> writes:

> On Wed, Jul 15, 2020 at 11:35 PM Francisco Jerez <currojerez@riseup.net> wrote:
>>
>> "Rafael J. Wysocki" <rafael@kernel.org> writes:
>>
>> > On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez <currojerez@riseup.net> wrote:
>> >>
>> >> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
>> >>
>> >> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> >> >
>> >> > Allow intel_pstate to work in the passive mode with HWP enabled and
>> >> > make it set the HWP minimum performance limit (HWP floor) to the
>> >> > P-state value given by the target frequency supplied by the cpufreq
>> >> > governor, so as to prevent the HWP algorithm and the CPU scheduler
>> >> > from working against each other, at least when the schedutil governor
>> >> > is in use, and update the intel_pstate documentation accordingly.
>> >> >
>> >> > Among other things, this allows utilization clamps to be taken
>> >> > into account, at least to a certain extent, when intel_pstate is
>> >> > in use and makes it more likely that sufficient capacity for
>> >> > deadline tasks will be provided.
>> >> >
>> >> > After this change, the resulting behavior of an HWP system with
>> >> > intel_pstate in the passive mode should be close to the behavior
>> >> > of the analogous non-HWP system with intel_pstate in the passive
>> >> > mode, except that in the frequency range below the base frequency
>> >> > (ie. the frequency retured by the base_frequency cpufreq attribute
>> >> > in sysfs on HWP systems) the HWP algorithm is allowed to go above
>> >> > the floor P-state set by intel_pstate with or without hardware
>> >> > coordination of P-states among CPUs in the same package.
>> >> >
>> >> > Also note that the setting of the HWP floor may not be taken into
>> >> > account by the processor in the following cases:
>> >> >
>> >> >  * For the HWP floor in the range of P-states above the base
>> >> >    frequency, referred to as the turbo range, the processor has a
>> >> >    license to choose any P-state from that range, either below or
>> >> >    above the HWP floor, just like a non-HWP processor in the case
>> >> >    when the target P-state falls into the turbo range.
>> >> >
>> >> >  * If P-states of the CPUs in the same package are coordinated
>> >> >    at the hardware level, the processor may choose a P-state
>> >> >    above the HWP floor, just like a non-HWP processor in the
>> >> >    analogous case.
>> >> >
>> >> > With this change applied, intel_pstate in the passive mode
>> >> > assumes complete control over the HWP request MSR and concurrent
>> >> > changes of that MSR (eg. via the direct MSR access interface) are
>> >> > overridden by it.
>> >> >
>> >> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> >> > ---
>> >> >
>> >> > This basically unifies the passive mode behavior of intel_pstate for systems
>> >> > with and without HWP enabled.  The only case in which there is a difference
>> >> > between the two (after this patch) is below the turbo range, where the HWP
>> >> > algorithm can go above the floor regardless of whether or not P-state are
>> >> > coordinated package-wide (this means the systems with per-core P-states
>> >> > mostly is where the difference can be somewhat visible).
>> >> >
>> >> > Since the passive mode hasn't worked with HWP at all, and it is not going to
>> >> > the default for HWP systems anyway, I don't see any drawbacks related to making
>> >> > this change, so I would consider this as 5.9 material unless there are any
>> >> > serious objections.
>> >> >
>> >> > Thanks!
>> >> >
>> >> > ---
>> >> >  Documentation/admin-guide/pm/intel_pstate.rst |   89 +++++++---------
>> >> >  drivers/cpufreq/intel_pstate.c                |  141 ++++++++++++++++++++------
>> >> >  2 files changed, 152 insertions(+), 78 deletions(-)
>> >> >
>> >> > Index: linux-pm/drivers/cpufreq/intel_pstate.c
>> >> > ===================================================================
>> >> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
>> >> > +++ linux-pm/drivers/cpufreq/intel_pstate.c
>> >> > @@ -36,6 +36,7 @@
>> >> >  #define INTEL_PSTATE_SAMPLING_INTERVAL       (10 * NSEC_PER_MSEC)
>> >> >
>> >> >  #define INTEL_CPUFREQ_TRANSITION_LATENCY     20000
>> >> > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP   5000
>> >> >  #define INTEL_CPUFREQ_TRANSITION_DELAY               500
>> >> >
>> >> >  #ifdef CONFIG_ACPI
>> >> > @@ -222,6 +223,7 @@ struct global_params {
>> >> >   *                   preference/bias
>> >> >   * @epp_saved:               Saved EPP/EPB during system suspend or CPU offline
>> >> >   *                   operation
>> >> > + * @epp_cached               Cached HWP energy-performance preference value
>> >> >   * @hwp_req_cached:  Cached value of the last HWP Request MSR
>> >> >   * @hwp_cap_cached:  Cached value of the last HWP Capabilities MSR
>> >> >   * @last_io_update:  Last time when IO wake flag was set
>> >> > @@ -259,6 +261,7 @@ struct cpudata {
>> >> >       s16 epp_policy;
>> >> >       s16 epp_default;
>> >> >       s16 epp_saved;
>> >> > +     s16 epp_cached;
>> >> >       u64 hwp_req_cached;
>> >> >       u64 hwp_cap_cached;
>> >> >       u64 last_io_update;
>> >> > @@ -676,6 +679,8 @@ static int intel_pstate_set_energy_pref_
>> >> >
>> >> >               value |= (u64)epp << 24;
>> >> >               ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
>> >> > +
>> >> > +             WRITE_ONCE(cpu_data->epp_cached, epp);
>> >>
>> >> Why introduce a new EPP cache variable if there is already
>> >> hwp_req_cached?  If intel_pstate_set_energy_pref_index() is failing to
>> >> update hwp_req_cached maybe we should fix that instead.  That will save
>> >> you a little bit of work in intel_cpufreq_adjust_hwp().
>> >
>> > Yes, it would, but then we'd need to explicitly synchronize
>> > intel_pstate_set_energy_pref_index() with the scheduler context which
>> > I'd rather avoid.
>> >
>>
>> How does using a differently named variable save you from doing that?
>
> It is a separate variable.
>
> The only updater of epp_cached, except for the initialization, is
> intel_pstate_set_energy_pref_index() and it cannot race with another
> instance of itself, so there are no concurrent writes to epp_cached.
>
> In the passive mode the only updater of hwp_req_cached, except for the
> initialization, is intel_cpufreq_adjust_hwp() (or there is a bug in
> the patch that I have missed) and it cannot race with another instance
> of itself for the same CPU, so there are no concurrent writes to
> hwp_req_cached.
>
>  if intel_pstate_set_energy_pref_index() updated hwp_req_cached
> directly, however, it might be updated in two places concurrently and
> so explicit synchronization would be necessary.
>

That's fair, but we may need to add such synchronization anyway due to
the bug I pointed out above, so it might be simpler to avoid introducing
additional state and simply stick to hwp_req_cached with proper
synchronization.

>> And won't the EPP setting programmed by intel_pstate_set_energy_pref_index()
>> be lost if intel_pstate_hwp_boost_up() or some other user of
>> hwp_req_cached is executed afterwards with the current approach?
>
> The value written to the register by it may be overwritten by a
> concurrent intel_cpufreq_adjust_hwp(), but that is not a problem,
> because next time intel_cpufreq_adjust_hwp() runs for the target CPU,
> it will pick up the updated epp_cached value which will be written to
> the register.

However intel_cpufreq_adjust_hwp() may never be executed afterwards if
intel_pstate is in active mode, in which case the overwritten value may
remain there forever potentially.

> So there may be a short time window after the
> intel_pstate_set_energy_pref_index() invocation in which the new EPP
> value may not be in effect, but in general there is no guarantee that
> the new EPP will take effect immediately after updating the MSR
> anyway, so that race doesn't matter.
>
> That said, that race is avoidable, but I was thinking that trying to
> avoid it might not be worth it.  Now I see a better way to avoid it,
> though, so I'm going to update the patch to that end.
>
>> Seems like a bug to me.
>
> It is racy, but not every race is a bug.
>

Still seems like there is a bug in intel_pstate_set_energy_pref_index()
AFAICT.

>> >> >       } else {
>> >> >               if (epp == -EINVAL)
>> >> >                       epp = (pref_index - 1) << 2;
>> >> > @@ -2047,6 +2052,7 @@ static int intel_pstate_init_cpu(unsigne
>> >> >               cpu->epp_default = -EINVAL;
>> >> >               cpu->epp_powersave = -EINVAL;
>> >> >               cpu->epp_saved = -EINVAL;
>> >> > +             WRITE_ONCE(cpu->epp_cached, -EINVAL);
>> >> >       }
>> >> >
>> >> >       cpu = all_cpu_data[cpunum];
>> >> > @@ -2245,7 +2251,10 @@ static int intel_pstate_verify_policy(st
>> >> >
>> >> >  static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
>> >> >  {
>> >> > -     intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
>> >> > +     if (hwp_active)
>> >> > +             intel_pstate_hwp_force_min_perf(policy->cpu);
>> >> > +     else
>> >> > +             intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
>> >> >  }
>> >> >
>> >> >  static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
>> >> > @@ -2253,12 +2262,10 @@ static void intel_pstate_stop_cpu(struct
>> >> >       pr_debug("CPU %d exiting\n", policy->cpu);
>> >> >
>> >> >       intel_pstate_clear_update_util_hook(policy->cpu);
>> >> > -     if (hwp_active) {
>> >> > +     if (hwp_active)
>> >> >               intel_pstate_hwp_save_state(policy);
>> >> > -             intel_pstate_hwp_force_min_perf(policy->cpu);
>> >> > -     } else {
>> >> > -             intel_cpufreq_stop_cpu(policy);
>> >> > -     }
>> >> > +
>> >> > +     intel_cpufreq_stop_cpu(policy);
>> >> >  }
>> >> >
>> >> >  static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
>> >> > @@ -2388,13 +2395,82 @@ static void intel_cpufreq_trace(struct c
>> >> >               fp_toint(cpu->iowait_boost * 100));
>> >> >  }
>> >> >
>> >> > +static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32 target_pstate,
>> >> > +                                  bool fast_switch)
>> >> > +{
>> >> > +     u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
>> >> > +     s16 epp;
>> >> > +
>> >> > +     value &= ~HWP_MIN_PERF(~0L);
>> >> > +     value |= HWP_MIN_PERF(target_pstate);
>> >> > +
>> >> > +     /*
>> >> > +      * The entire MSR needs to be updated in order to update the HWP min
>> >> > +      * field in it, so opportunistically update the max too if needed.
>> >> > +      */
>> >> > +     value &= ~HWP_MAX_PERF(~0L);
>> >> > +     value |= HWP_MAX_PERF(cpu->max_perf_ratio);
>> >> > +
>> >> > +     /*
>> >> > +      * In case the EPP has been adjusted via sysfs, write the last cached
>> >> > +      * value of it to the MSR as well.
>> >> > +      */
>> >> > +     epp = READ_ONCE(cpu->epp_cached);
>> >> > +     if (epp >= 0) {
>> >> > +             value &= ~GENMASK_ULL(31, 24);
>> >> > +             value |= (u64)epp << 24;
>> >> > +     }
>> >> > +
>> >> > +     if (value == prev)
>> >> > +             return;
>> >> > +
>> >> > +     WRITE_ONCE(cpu->hwp_req_cached, value);
>> >> > +     if (fast_switch)
>> >> > +             wrmsrl(MSR_HWP_REQUEST, value);
>> >> > +     else
>> >> > +             wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
>> >> > +}
>> >>
>> >> I've asked this question already but you may have missed it: Given that
>> >> you are of the opinion that [1] should be implemented in schedutil
>> >> instead with intel_pstate in HWP passive mode, what's your plan for
>> >> exposing the HWP_MAX_PERF knob to the governor in addition to
>> >> HWP_MIN_PERF, since the interface implemented here only allows the
>> >> governor to provide a single frequency?
>> >>
>> >> [1] https://lwn.net/ml/linux-pm/20200428032258.2518-1-currojerez@riseup.net/
>> >
>> > This is not just about the schedutil governor, but about cpufreq
>> > governors in general (someone may still want to use the performance
>> > governor on top of intel_pstate, for example).
>> >
>> > And while governors can only provide one frequency, the policy limits
>> > in the cpufreq framework are based on QoS lists now and so it is
>> > possible to add a max limit request, say from a driver, to the max QoS
>> > list, and update it as needed, causing the max policy limit to be
>> > adjusted.
>> >
>> > That said I'm not exactly sure how useful the max limit generally is
>> > in practice on HWP systems, given that setting it above the base
>> > frequency causes it to be ignored, effectively, and the turbo range
>> > may be wider than the range of P-states below the base frequency.
>> >
>>
>> I don't think that's accurate.  I've looked at hundreds of traces while
>> my series [1] was in control of HWP_REQ_MAX and I've never seen an
>> excursion above the maximum HWP_REQ_MAX control specified by it within a
>> given P-state domain, even while that maximum specified was well into
>> the turbo range.
>
> I'm not going to argue with your experience. :-)
>
> What I'm saying is that there is no guarantee that the processor will
> always select P-states below HWP_REQ_MAX in the turbo range.  That may
> not happen in practice, but it is not precluded AFAICS.
>
> Also while HWP_REQ_MAX can work in practice most of the time with HWP
> enabled, without HWP there is no easy way to limit the max frequency
> if the current request falls into the turbo range.  The HWP case is
> more important nowadays, but there still are systems without it and
> ideally they should be covered as well.
>

In the non-HWP case we have a single P-state control so the question of
how to plumb an extra P-state control from the CPUFREQ governor seems
largely irrelevant.  The current interface seems fine as-is for such
systems.

>> So, yeah, I agree that HWP_REQ_MAX is nothing like a
>> hard limit, particularly when multiple threads are running on the same
>> clock domain, but the processor will still make its best effort to limit
>> the clock frequency to the maximum of the requested maximums, even if it
>> happens to be within the turbo range.  That doesn't make it useless.
>
> I haven't used the word "useless" anywhere in my previous message.
>
> Using the max frequency to control power has merit, but how much of it
> is there depends on some factors that may change from one system to
> another.
>
> The alternative power control methods may be more reliable in general.
>

That's precisely what I've been calling into question.  IIRC the
alternative power control methods we have discussed in the past are:

 - EPP: The semantics of this control are largely unspecified beyond
   higher values being more energy-efficient than lower values.  The set
   of energy efficiency optimizations controlled by it and the
   thresholds at which they become active are fully platform-specific.
   I guess you didn't mean this one as example of a more reliable and
   less platform-specific control.

 - RAPL: The semantics of this control are indeed well-defined, it's
   able to set an absolute average power constraint to the involved
   power planes.  However, the relationship between the information
   available to the kernel about a given workload (e.g. from CPU
   performance counters) and the optimal value of the RAPL constraints
   is highly platform-specific, requiring multiple iterations of
   adjustments and performance monitoring in order to approach the
   optimal value (unlike HWP_REQ_MAX since there is a simple,
   platform-independent relationship between observed frequency
   and... frequency -- More on that below).

 - P-code mailbox interface: Available to the graphics driver when GuC
   submission is in use, which is not available currently on any
   production platform.  It won't allow the energy efficiency
   optimization I'm proposing to be taken advantage of by discrete
   graphics nor IO devices other than the GPU.  Like HWP_REQ_MAX it sets
   a constraint on the CPU P-states so most caveats of HWP_REQ_MAX would
   apply to it too.  But unlike HWP_REQ_MAX it has global effect on the
   system limiting its usefulness in a multitasking environment.
   Requires a governor to run in a GPU microcontroller with more limited
   information than CPUFREQ.

So I'm either missing some alternative power control method or I
strongly disagree that there is a more reliable and platform-independent
alternative to HWP_REQ_MAX.

>> The exact same thing can be said about controlling HWP_REQ_MIN as you're
>> doing now in this revision of your patch, BTW.
>
> Which has been clearly stated in the changelog I believe.
>

Right, which is why I found it surprising to hear the same point as a
counterargument against HWP_REQ_MAX.

> The point here is that this is as good as using the perf control
> register to ask for a given P-state without HWP which trying to drive
> the max too is added complexity.
>
>> If you don't believe me here is the turbostat sample with maximum
>> Bzy_MHz I get on the computer I'm sitting on right now while compiling a
>> kernel on CPU0 if I set HWP_REQ_MAX to 0x1c (within the turbo range):
>>
>> | Core    CPU     Avg_MHz Busy%   Bzy_MHz            HWP_REQ      PkgWatt CorWatt
>> | -       -       757     27.03   2800    0x0000000000000000      7.13    4.90
>> | 0       0       2794    99.77   2800    0x0000000080001c04      7.13    4.90
>> | 0       2       83      2.98    2800    0x0000000080001c04
>> | 1       1       73      2.60    2800    0x0000000080001c04
>> | 1       3       78      2.79    2800    0x0000000080001c04
>>
>> With the default HWP_REQUEST:
>>
>> | Core    CPU     Avg_MHz Busy%   Bzy_MHz            HWP_REQ      PkgWatt CorWatt
>> | -       -       814     27.00   3015    0x0000000000000000      8.49    6.18
>> | 0       0       2968    98.24   3021    0x0000000080001f04      8.49    6.18
>> | 0       2       84      2.81    2982    0x0000000080001f04
>> | 1       1       99      3.34    2961    0x0000000080001f04
>> | 1       3       105     3.60    2921    0x0000000080001f04
>>
>> > Generally, I'm not quite convinced that limiting the max frequency is
>> > really the right choice for controlling the processor's power draw on
>> > the systems in question.  There are other ways to do that, which in
>> > theory should be more effective.  I mentioned RAPL somewhere in this
>> > context and there's the GUC firmware too.
>>
>> I feel like we've had that conversation before and it's somewhat
>> off-topic so I'll keep it short: Yes, in theory RAPL is more effective
>> than HWP_REQ_MAX as a mechanism to limit the absolute power consumption
>> of the processor package, but that's not the purpose of [1], its purpose
>> is setting a lower limit to the energy efficiency of the processor when
>> the maximum usable CPU frequency is known (due to the existence of an IO
>> device bottleneck)
>
> Whether or not that frequency is actually known seems quite
> questionable to me, but that's aside.
>

It's not actually known, but it can be approximated easily under a
widely-applicable assumption -- More on that below.

> More important, it is unclear to me what you mean by "a lower limit to
> the energy efficiency of the processor".
>

If we define the instantaneous energy efficiency of a CPU (eta) to be
the ratio between its instantaneous frequency (f) and power consumption
(P), I want to be able to set a lower limit to that ratio in cases where
I can determine that doing so won't impact the performance of the
application:

|  eta_min <= eta = f / P

Setting such a lower limit to the instantaneous energy efficiency of the
processor can only lower the total amount of energy consumed by the
processor in order to perform a given amount of work (If you don't
believe me on that feel free to express it as the integral of P over
time, with P recovered from the expression above), therefore it can only
improve the average energy efficiency of the workload in the long run.

Because of the convex relationship between P and f above a certain
inflection point (AKA maximum efficiency ratio, AKA min_pstate in
intel_pstate.c), eta is monotonically decreasing with respect to
frequency above that point, therefore setting a lower limit to the
energy efficiency of the processor is equivalent to setting an upper
limit to its frequency within that range.

> I guess what you mean is that the processor might decide to go for a
> more energy-efficient configuration by increasing its frequency in a
> "race to idle" fashion (in response to a perceived utilization spike)
> and you want to prevent that from occurring.
>

No, a race to idle response to a utilization spike would only be more
energy efficient than the performance-equivalent constant-frequency
response in cases where the latter constant frequency is in the
concavity region of the system's power curve (below the inflection
point).  I certainly don't want to prevent that from occurring when it's
the most energy-efficient thing to do.

> Or, generally speaking, that the CPU performance scaling logic, either
> in the kernel or in the processor itself, might select a higher
> operating frequency of a CPU in response to a perceived utilization
> spike, but that may be a mistake in the presence of another data
> processing device sharing the power budget with the processor, so you
> want to prevent that from taking place.
>

Yes, I do.

> In both cases, I wouldn't call that limiting the energy-efficiency of
> the processor.  Effectively, this means putting a limit on the
> processor's power budget, which is exactly what RAPL is for.
>

No, limiting the processor frequency also imposes a limit to its energy
efficiency due to the reason explained above.

>> -- And if the maximum usable CPU frequency is the
>> information we have at hand,
>
> How so?
>
> How can you tell what that frequency is?
>

In the general case it would take a crystal ball to know the amount of
work the CPU is going to have to do in the future, however as soon as
the system has reached a steady state (which amounts to a large fraction
of the time and energy consumption of many workloads, therefore it's an
interesting case to optimize for) its previous behavior can be taken as
proxy for its future behavior (by definition of steady state), therefore
we can measure the performance delivered by the processor in the
immediate past and make sure that the governor's response doesn't
prevent it from achieving the same performance (plus some margin in
order to account for potential fluctuations in the workload).

That's, yes, an essentially heuristic assumption, but one that underlies
every other CPU frequency governor in the Linux kernel tree to a greater
or lower extent.

>> controlling the maximum CPU frequency
>> directly is optimal, rather than trying to find the RAPL constraint that
>> achieves the same average frequency by trial an error.  Also, in theory,
>> even if you had an oracle to tell you what the appropriate RAPL
>> constraint is, the result would necessarily be more energy-inefficient
>> than controlling the maximum CPU frequency directly, since you're giving
>> the processor additional freedom to run at frequencies above the one you
>> want to average, which is guaranteed to be more energy-inefficient than
>> running at that fixed frequency, assuming we are in the region of
>> convexity of the processor's power curve.
>
> So the reason why you want to limit the processor's max frequency in
> the first place is because it is sharing the power budget with
> something else.

No, my ultimate goal is to optimize the energy efficiency of the CPU in
cases where the system has a bottleneck elsewhere.

> If there's no sharing of the power budget or thermal constraints,
> there is no need to limit the CPU frequency other than for the sake of
> saving energy.
>

Yes!

> What you can achieve by limiting the max CPU frequency is to make the
> processor draw less power (and cause it to use either less or more
> energy, depending on the energy-efficiency curve).

Yes, in order to make sure that limiting the maximum CPU frequency
doesn't lead to increased energy usage the response of the governor is
clamped to the convexity range of the CPU power curve (which yes, I'm
aware is only an approximation to the convexity range of the
whole-system power curve).

> You don't know how much less power it will draw then, however.
>

I don't see any need to care how much less power is drawn in absolute
terms, as long as the energy efficiency of the system is improved *and*
its performance is at least the same as it was before.

> You seem to be saying "I know exactly what the maximum frequency of
> the CPU can be, so why I don't set it as the upper limit", but I'm
> questioning the source of that knowledge.  Does it not come from
> knowing the power budget you want to give to the processor?
>

No, it comes from CPU performance counters -- More on that above.

>> Anyway, if you still have some disagreement on the theoretical details
>> you're more than welcome to bring up the matter on the other thread [1],
>> or accept the invitation for a presentation I sent you months ago... ;)
>
> Why don't we continue the discussion here instead?
>
> I think we are getting to the bottom of things here.

Up to you.
Doug Smythies July 17, 2020, 1:37 p.m. UTC | #10
Hi Rafael,

Thank you for your reply.

On 2020.07.16 05:08 Rafael J. Wysocki wrote:
> On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies <dsmythies@telus.net> wrote:
>> On 2020.07.14 11:16 Rafael J. Wysocki wrote:
>> >
>> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> ...
>> > Since the passive mode hasn't worked with HWP at all, and it is not going to
>> > the default for HWP systems anyway, I don't see any drawbacks related to making
>> > this change, so I would consider this as 5.9 material unless there are any
>> > serious objections.
>>
>> Good point.

Actually, for those users that default to passive mode upon boot,
this would mean they would find themselves using this.
Also, it isn't obvious, from the typical "what driver and what governor"
inquiry.

>> Some of the tests I do involve labour intensive post processing of data.
>> I want to automate some of that work, and it will take time.
>> We might be into the 5.9-rc series before I have detailed feedback.
>>
>> However, so far:
>>
>> Inverse impulse response test [1]:
>>
>> High level test, i5-9600K, HWP-passive (this patch), ondemand:
>> 3101 tests. 0 failures. (GOOD)
>>
>> From [1], re-stated:
>> > . High level: i5-9600K: 2453 tests, 60 failures, 2.45% fail rate. (HWP-active - powersave)
>> > . Verify acpi-cpufreq/ondemand works fine: i5-9600K: 8975 tests. 0 failures.
>>
>> My version of that cool Alexander named pipe test [2] serialized workflow:
>>
>> HWP-passive (this patch), performance: PASS.
>>
>> From [2], re-stated, and also re-tested.
>> HWP-disabled passive - performance: FAIL.
> 
> But I'm not quite sure how this is related to this patch?

It isn't. The point being that it is different.
But yes, that failure is because of our other discussion [3].

> 
> This test would still fail without the patch if the kernel was started
> with intel_pstate=passive in the kernel command line, wouldn't it.

Yes.

> 
> > Although, I believe the issue to be EPB management, [3].
> 
> Well, that's kind of unexpected.
> 
> If this really is the case, then it looks like the effect of the EPB
> setting on the processor is different depending on whether or not HWP
> is enabled.
> 
>> And yes, I did see the reply to [3] that came earlier,
>> And have now re-done the test, with the referenced patch added.
>> It still is FAIL. I reply to the [3] thread, eventually.
>>
>> [1] https://marc.info/?l=linux-pm&m=159354421400342&w=2
>> [2] https://marc.info/?l=linux-pm&m=159155067328641&w=2
>> [3] https://marc.info/?l=linux-pm&m=159438804230744&w=2
>>
>> Kernel:
>>
>> b08284a541ad (HEAD -> k58rc5-doug) cpufreq: intel_pstate: Avoid enabling HWP if EPP is not supported
>> 063fd7ccabfe cpufreq: intel_pstate: Implement passive mode with HWP enabled
>> 730ccf5054e9 cpufreq: intel_pstate: Allow raw energy performance preference value
>> bee36df01c68 cpufreq: intel_pstate: Allow enable/disable energy efficiency
>> 199629d8200e cpufreq: intel_pstate: Fix active mode setting from command line
>> 11ba468877bb (tag: v5.8-rc5, origin/master, origin/HEAD, master) Linux 5.8-rc5
>>
>> Rules for this work:
>>
>> . never use x86_energy_perf_policy.
>> . For HWP disabled: never change from active to passive or via versa, but rather do it via boot.
>> . after boot always check and reset the various power limit log bits that are set.
>> . never compile the kernel (well, until after any tests), which will set those bits again.
>> . never run prime95 high heat torture test, which will set those bits again.
>> . try to never do anything else that will set those bits again.
>>
>> To be clear, I do allow changing governors within the context of the above rules.
> 
> Thanks for the feedback!
Rafael J. Wysocki July 19, 2020, 11:42 a.m. UTC | #11
Hi Doug,

On Fri, Jul 17, 2020 at 3:37 PM Doug Smythies <dsmythies@telus.net> wrote:
>
> Hi Rafael,
>
> Thank you for your reply.
>
> On 2020.07.16 05:08 Rafael J. Wysocki wrote:
> > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies <dsmythies@telus.net> wrote:
> >> On 2020.07.14 11:16 Rafael J. Wysocki wrote:
> >> >
> >> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >> ...
> >> > Since the passive mode hasn't worked with HWP at all, and it is not going to
> >> > the default for HWP systems anyway, I don't see any drawbacks related to making
> >> > this change, so I would consider this as 5.9 material unless there are any
> >> > serious objections.
> >>
> >> Good point.
>
> Actually, for those users that default to passive mode upon boot,
> this would mean they would find themselves using this.
> Also, it isn't obvious, from the typical "what driver and what governor"
> inquiry.

So the change in behavior is that after this patch
intel_pstate=passive doesn't imply no_hwp any more.

That's a very minor difference though and I'm not aware of any adverse
effects it can cause on HWP systems anyway.

The "what governor" is straightforward in the passive mode: that's
whatever cpufreq governor has been selected.

The driver is "intel_cpufreq" which means that the processor is
requested to run at a frequency selected by the governor or higher,
unless in the turbo range.  This works similarly in both the HWP and
non-HWP cases, except that in the HWP case it is possible to adjust
the EPP (through the additional sysfs knob) and the base frequency is
exported (the latter two things can be used to distinguish between the
two cases just fine IMO).

> >> Some of the tests I do involve labour intensive post processing of data.
> >> I want to automate some of that work, and it will take time.
> >> We might be into the 5.9-rc series before I have detailed feedback.
> >>
> >> However, so far:
> >>
> >> Inverse impulse response test [1]:
> >>
> >> High level test, i5-9600K, HWP-passive (this patch), ondemand:
> >> 3101 tests. 0 failures. (GOOD)
> >>
> >> From [1], re-stated:
> >> > . High level: i5-9600K: 2453 tests, 60 failures, 2.45% fail rate. (HWP-active - powersave)
> >> > . Verify acpi-cpufreq/ondemand works fine: i5-9600K: 8975 tests. 0 failures.
> >>
> >> My version of that cool Alexander named pipe test [2] serialized workflow:
> >>
> >> HWP-passive (this patch), performance: PASS.
> >>
> >> From [2], re-stated, and also re-tested.
> >> HWP-disabled passive - performance: FAIL.
> >
> > But I'm not quite sure how this is related to this patch?
>
> It isn't. The point being that it is different.

It is different, but kind of in a positive way IMO.

> But yes, that failure is because of our other discussion [3].

OK

> >
> > This test would still fail without the patch if the kernel was started
> > with intel_pstate=passive in the kernel command line, wouldn't it.
>
> Yes.

OK

Thanks!
Rafael J. Wysocki July 19, 2020, 7:06 p.m. UTC | #12
On Fri, Jul 17, 2020 at 2:21 AM Francisco Jerez <currojerez@riseup.net> wrote:
>
> "Rafael J. Wysocki" <rafael@kernel.org> writes:
>
> > On Wed, Jul 15, 2020 at 11:35 PM Francisco Jerez <currojerez@riseup.net> wrote:
> >>
> >> "Rafael J. Wysocki" <rafael@kernel.org> writes:
> >>
> >> > On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez <currojerez@riseup.net> wrote:
> >> >>
> >> >> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
> >> >>
> >> >> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >> >> >
> >> >> > Allow intel_pstate to work in the passive mode with HWP enabled and
> >> >> > make it set the HWP minimum performance limit (HWP floor) to the
> >> >> > P-state value given by the target frequency supplied by the cpufreq
> >> >> > governor, so as to prevent the HWP algorithm and the CPU scheduler
> >> >> > from working against each other, at least when the schedutil governor
> >> >> > is in use, and update the intel_pstate documentation accordingly.
> >> >> >
> >> >> > Among other things, this allows utilization clamps to be taken
> >> >> > into account, at least to a certain extent, when intel_pstate is
> >> >> > in use and makes it more likely that sufficient capacity for
> >> >> > deadline tasks will be provided.
> >> >> >
> >> >> > After this change, the resulting behavior of an HWP system with
> >> >> > intel_pstate in the passive mode should be close to the behavior
> >> >> > of the analogous non-HWP system with intel_pstate in the passive
> >> >> > mode, except that in the frequency range below the base frequency
> >> >> > (ie. the frequency retured by the base_frequency cpufreq attribute
> >> >> > in sysfs on HWP systems) the HWP algorithm is allowed to go above
> >> >> > the floor P-state set by intel_pstate with or without hardware
> >> >> > coordination of P-states among CPUs in the same package.
> >> >> >
> >> >> > Also note that the setting of the HWP floor may not be taken into
> >> >> > account by the processor in the following cases:
> >> >> >
> >> >> >  * For the HWP floor in the range of P-states above the base
> >> >> >    frequency, referred to as the turbo range, the processor has a
> >> >> >    license to choose any P-state from that range, either below or
> >> >> >    above the HWP floor, just like a non-HWP processor in the case
> >> >> >    when the target P-state falls into the turbo range.
> >> >> >
> >> >> >  * If P-states of the CPUs in the same package are coordinated
> >> >> >    at the hardware level, the processor may choose a P-state
> >> >> >    above the HWP floor, just like a non-HWP processor in the
> >> >> >    analogous case.
> >> >> >
> >> >> > With this change applied, intel_pstate in the passive mode
> >> >> > assumes complete control over the HWP request MSR and concurrent
> >> >> > changes of that MSR (eg. via the direct MSR access interface) are
> >> >> > overridden by it.
> >> >> >
> >> >> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >> >> > ---
> >> >> >
> >> >> > This basically unifies the passive mode behavior of intel_pstate for systems
> >> >> > with and without HWP enabled.  The only case in which there is a difference
> >> >> > between the two (after this patch) is below the turbo range, where the HWP
> >> >> > algorithm can go above the floor regardless of whether or not P-state are
> >> >> > coordinated package-wide (this means the systems with per-core P-states
> >> >> > mostly is where the difference can be somewhat visible).
> >> >> >
> >> >> > Since the passive mode hasn't worked with HWP at all, and it is not going to
> >> >> > the default for HWP systems anyway, I don't see any drawbacks related to making
> >> >> > this change, so I would consider this as 5.9 material unless there are any
> >> >> > serious objections.
> >> >> >
> >> >> > Thanks!
> >> >> >
> >> >> > ---
> >> >> >  Documentation/admin-guide/pm/intel_pstate.rst |   89 +++++++---------
> >> >> >  drivers/cpufreq/intel_pstate.c                |  141 ++++++++++++++++++++------
> >> >> >  2 files changed, 152 insertions(+), 78 deletions(-)
> >> >> >
> >> >> > Index: linux-pm/drivers/cpufreq/intel_pstate.c
> >> >> > ===================================================================
> >> >> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> >> >> > +++ linux-pm/drivers/cpufreq/intel_pstate.c
> >> >> > @@ -36,6 +36,7 @@
> >> >> >  #define INTEL_PSTATE_SAMPLING_INTERVAL       (10 * NSEC_PER_MSEC)
> >> >> >
> >> >> >  #define INTEL_CPUFREQ_TRANSITION_LATENCY     20000
> >> >> > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP   5000
> >> >> >  #define INTEL_CPUFREQ_TRANSITION_DELAY               500
> >> >> >
> >> >> >  #ifdef CONFIG_ACPI
> >> >> > @@ -222,6 +223,7 @@ struct global_params {
> >> >> >   *                   preference/bias
> >> >> >   * @epp_saved:               Saved EPP/EPB during system suspend or CPU offline
> >> >> >   *                   operation
> >> >> > + * @epp_cached               Cached HWP energy-performance preference value
> >> >> >   * @hwp_req_cached:  Cached value of the last HWP Request MSR
> >> >> >   * @hwp_cap_cached:  Cached value of the last HWP Capabilities MSR
> >> >> >   * @last_io_update:  Last time when IO wake flag was set
> >> >> > @@ -259,6 +261,7 @@ struct cpudata {
> >> >> >       s16 epp_policy;
> >> >> >       s16 epp_default;
> >> >> >       s16 epp_saved;
> >> >> > +     s16 epp_cached;
> >> >> >       u64 hwp_req_cached;
> >> >> >       u64 hwp_cap_cached;
> >> >> >       u64 last_io_update;
> >> >> > @@ -676,6 +679,8 @@ static int intel_pstate_set_energy_pref_
> >> >> >
> >> >> >               value |= (u64)epp << 24;
> >> >> >               ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
> >> >> > +
> >> >> > +             WRITE_ONCE(cpu_data->epp_cached, epp);
> >> >>
> >> >> Why introduce a new EPP cache variable if there is already
> >> >> hwp_req_cached?  If intel_pstate_set_energy_pref_index() is failing to
> >> >> update hwp_req_cached maybe we should fix that instead.  That will save
> >> >> you a little bit of work in intel_cpufreq_adjust_hwp().
> >> >
> >> > Yes, it would, but then we'd need to explicitly synchronize
> >> > intel_pstate_set_energy_pref_index() with the scheduler context which
> >> > I'd rather avoid.
> >> >
> >>
> >> How does using a differently named variable save you from doing that?
> >
> > It is a separate variable.
> >
> > The only updater of epp_cached, except for the initialization, is
> > intel_pstate_set_energy_pref_index() and it cannot race with another
> > instance of itself, so there are no concurrent writes to epp_cached.
> >
> > In the passive mode the only updater of hwp_req_cached, except for the
> > initialization, is intel_cpufreq_adjust_hwp() (or there is a bug in
> > the patch that I have missed) and it cannot race with another instance
> > of itself for the same CPU, so there are no concurrent writes to
> > hwp_req_cached.
> >
> >  if intel_pstate_set_energy_pref_index() updated hwp_req_cached
> > directly, however, it might be updated in two places concurrently and
> > so explicit synchronization would be necessary.
> >
>
> That's fair, but we may need to add such synchronization anyway due to
> the bug I pointed out above,

I guess you've not regarded my explanation of this as sufficient.

> so it might be simpler to avoid introducing
> additional state and simply stick to hwp_req_cached with proper
> synchronization.

Not really.  In the v2 of the patch the race (which was not a bug
anyway) is explicitly avoided.

> >> And won't the EPP setting programmed by intel_pstate_set_energy_pref_index()
> >> be lost if intel_pstate_hwp_boost_up() or some other user of
> >> hwp_req_cached is executed afterwards with the current approach?
> >
> > The value written to the register by it may be overwritten by a
> > concurrent intel_cpufreq_adjust_hwp(), but that is not a problem,
> > because next time intel_cpufreq_adjust_hwp() runs for the target CPU,
> > it will pick up the updated epp_cached value which will be written to
> > the register.
>
> However intel_cpufreq_adjust_hwp() may never be executed afterwards if
> intel_pstate is in active mode,

intel_cpufreq_adjust_hwp() is not executed in the active mode at all.

> in which case the overwritten value may
> remain there forever potentially.

However, in the active mode the only updater of hwp_req_cached is
intel_pstate_hwp_set() and this patch doesn't introduce any
differences in behavior in that case.

> > So there may be a short time window after the
> > intel_pstate_set_energy_pref_index() invocation in which the new EPP
> > value may not be in effect, but in general there is no guarantee that
> > the new EPP will take effect immediately after updating the MSR
> > anyway, so that race doesn't matter.
> >
> > That said, that race is avoidable, but I was thinking that trying to
> > avoid it might not be worth it.  Now I see a better way to avoid it,
> > though, so I'm going to update the patch to that end.
> >
> >> Seems like a bug to me.
> >
> > It is racy, but not every race is a bug.
> >
>
> Still seems like there is a bug in intel_pstate_set_energy_pref_index()
> AFAICT.

If there is a bug, then what exactly is it, from the users' perspective?

> >> >> >       } else {
> >> >> >               if (epp == -EINVAL)
> >> >> >                       epp = (pref_index - 1) << 2;
> >> >> > @@ -2047,6 +2052,7 @@ static int intel_pstate_init_cpu(unsigne
> >> >> >               cpu->epp_default = -EINVAL;
> >> >> >               cpu->epp_powersave = -EINVAL;
> >> >> >               cpu->epp_saved = -EINVAL;
> >> >> > +             WRITE_ONCE(cpu->epp_cached, -EINVAL);
> >> >> >       }
> >> >> >
> >> >> >       cpu = all_cpu_data[cpunum];
> >> >> > @@ -2245,7 +2251,10 @@ static int intel_pstate_verify_policy(st
> >> >> >
> >> >> >  static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
> >> >> >  {
> >> >> > -     intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
> >> >> > +     if (hwp_active)
> >> >> > +             intel_pstate_hwp_force_min_perf(policy->cpu);
> >> >> > +     else
> >> >> > +             intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
> >> >> >  }
> >> >> >
> >> >> >  static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
> >> >> > @@ -2253,12 +2262,10 @@ static void intel_pstate_stop_cpu(struct
> >> >> >       pr_debug("CPU %d exiting\n", policy->cpu);
> >> >> >
> >> >> >       intel_pstate_clear_update_util_hook(policy->cpu);
> >> >> > -     if (hwp_active) {
> >> >> > +     if (hwp_active)
> >> >> >               intel_pstate_hwp_save_state(policy);
> >> >> > -             intel_pstate_hwp_force_min_perf(policy->cpu);
> >> >> > -     } else {
> >> >> > -             intel_cpufreq_stop_cpu(policy);
> >> >> > -     }
> >> >> > +
> >> >> > +     intel_cpufreq_stop_cpu(policy);
> >> >> >  }
> >> >> >
> >> >> >  static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
> >> >> > @@ -2388,13 +2395,82 @@ static void intel_cpufreq_trace(struct c
> >> >> >               fp_toint(cpu->iowait_boost * 100));
> >> >> >  }
> >> >> >
> >> >> > +static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32 target_pstate,
> >> >> > +                                  bool fast_switch)
> >> >> > +{
> >> >> > +     u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
> >> >> > +     s16 epp;
> >> >> > +
> >> >> > +     value &= ~HWP_MIN_PERF(~0L);
> >> >> > +     value |= HWP_MIN_PERF(target_pstate);
> >> >> > +
> >> >> > +     /*
> >> >> > +      * The entire MSR needs to be updated in order to update the HWP min
> >> >> > +      * field in it, so opportunistically update the max too if needed.
> >> >> > +      */
> >> >> > +     value &= ~HWP_MAX_PERF(~0L);
> >> >> > +     value |= HWP_MAX_PERF(cpu->max_perf_ratio);
> >> >> > +
> >> >> > +     /*
> >> >> > +      * In case the EPP has been adjusted via sysfs, write the last cached
> >> >> > +      * value of it to the MSR as well.
> >> >> > +      */
> >> >> > +     epp = READ_ONCE(cpu->epp_cached);
> >> >> > +     if (epp >= 0) {
> >> >> > +             value &= ~GENMASK_ULL(31, 24);
> >> >> > +             value |= (u64)epp << 24;
> >> >> > +     }
> >> >> > +
> >> >> > +     if (value == prev)
> >> >> > +             return;
> >> >> > +
> >> >> > +     WRITE_ONCE(cpu->hwp_req_cached, value);
> >> >> > +     if (fast_switch)
> >> >> > +             wrmsrl(MSR_HWP_REQUEST, value);
> >> >> > +     else
> >> >> > +             wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
> >> >> > +}
> >> >>
> >> >> I've asked this question already but you may have missed it: Given that
> >> >> you are of the opinion that [1] should be implemented in schedutil
> >> >> instead with intel_pstate in HWP passive mode, what's your plan for
> >> >> exposing the HWP_MAX_PERF knob to the governor in addition to
> >> >> HWP_MIN_PERF, since the interface implemented here only allows the
> >> >> governor to provide a single frequency?
> >> >>
> >> >> [1] https://lwn.net/ml/linux-pm/20200428032258.2518-1-currojerez@riseup.net/
> >> >
> >> > This is not just about the schedutil governor, but about cpufreq
> >> > governors in general (someone may still want to use the performance
> >> > governor on top of intel_pstate, for example).
> >> >
> >> > And while governors can only provide one frequency, the policy limits
> >> > in the cpufreq framework are based on QoS lists now and so it is
> >> > possible to add a max limit request, say from a driver, to the max QoS
> >> > list, and update it as needed, causing the max policy limit to be
> >> > adjusted.
> >> >
> >> > That said I'm not exactly sure how useful the max limit generally is
> >> > in practice on HWP systems, given that setting it above the base
> >> > frequency causes it to be ignored, effectively, and the turbo range
> >> > may be wider than the range of P-states below the base frequency.
> >> >
> >>
> >> I don't think that's accurate.  I've looked at hundreds of traces while
> >> my series [1] was in control of HWP_REQ_MAX and I've never seen an
> >> excursion above the maximum HWP_REQ_MAX control specified by it within a
> >> given P-state domain, even while that maximum specified was well into
> >> the turbo range.
> >
> > I'm not going to argue with your experience. :-)
> >
> > What I'm saying is that there is no guarantee that the processor will
> > always select P-states below HWP_REQ_MAX in the turbo range.  That may
> > not happen in practice, but it is not precluded AFAICS.
> >
> > Also while HWP_REQ_MAX can work in practice most of the time with HWP
> > enabled, without HWP there is no easy way to limit the max frequency
> > if the current request falls into the turbo range.  The HWP case is
> > more important nowadays, but there still are systems without it and
> > ideally they should be covered as well.
> >
>
> In the non-HWP case we have a single P-state control so the question of
> how to plumb an extra P-state control from the CPUFREQ governor seems
> largely irrelevant.  The current interface seems fine as-is for such
> systems.

Which doesn't mean that the non-HWP systems aren't affected by the
problem you want addressed, does it?

And so if they are affected by it, then ideally the way it is
addressed should also be applicable to them.

> >> So, yeah, I agree that HWP_REQ_MAX is nothing like a
> >> hard limit, particularly when multiple threads are running on the same
> >> clock domain, but the processor will still make its best effort to limit
> >> the clock frequency to the maximum of the requested maximums, even if it
> >> happens to be within the turbo range.  That doesn't make it useless.
> >
> > I haven't used the word "useless" anywhere in my previous message.
> >
> > Using the max frequency to control power has merit, but how much of it
> > is there depends on some factors that may change from one system to
> > another.
> >
> > The alternative power control methods may be more reliable in general.
> >
>
> That's precisely what I've been calling into question.  IIRC the
> alternative power control methods we have discussed in the past are:
>
>  - EPP: The semantics of this control are largely unspecified beyond
>    higher values being more energy-efficient than lower values.  The set
>    of energy efficiency optimizations controlled by it and the
>    thresholds at which they become active are fully platform-specific.
>    I guess you didn't mean this one as example of a more reliable and
>    less platform-specific control.
>
>  - RAPL: The semantics of this control are indeed well-defined, it's
>    able to set an absolute average power constraint to the involved
>    power planes.  However, the relationship between the information
>    available to the kernel about a given workload (e.g. from CPU
>    performance counters) and the optimal value of the RAPL constraints
>    is highly platform-specific, requiring multiple iterations of
>    adjustments and performance monitoring in order to approach the
>    optimal value (unlike HWP_REQ_MAX since there is a simple,
>    platform-independent relationship between observed frequency
>    and... frequency -- More on that below).

But the relationship between the power budget and HWP_REQ_MAX is more
convoluted.

>  - P-code mailbox interface: Available to the graphics driver when GuC
>    submission is in use, which is not available currently on any
>    production platform.  It won't allow the energy efficiency
>    optimization I'm proposing to be taken advantage of by discrete
>    graphics nor IO devices other than the GPU.  Like HWP_REQ_MAX it sets
>    a constraint on the CPU P-states so most caveats of HWP_REQ_MAX would
>    apply to it too.  But unlike HWP_REQ_MAX it has global effect on the
>    system limiting its usefulness in a multitasking environment.
>    Requires a governor to run in a GPU microcontroller with more limited
>    information than CPUFREQ.

The way it works is beyond the scope of discussion here.  The point is
that it can be used to achieve the goal at least in some case, which
wouldn't require complexity to be added to the CPU performance scaling
subsystem of the kernel.

To really decide what is better, the two alternatively would need to
be compared quatitatively, but it doesn't look like they have been.

> So I'm either missing some alternative power control method or I
> strongly disagree that there is a more reliable and platform-independent
> alternative to HWP_REQ_MAX.
>
> >> The exact same thing can be said about controlling HWP_REQ_MIN as you're
> >> doing now in this revision of your patch, BTW.
> >
> > Which has been clearly stated in the changelog I believe.
> >
>
> Right, which is why I found it surprising to hear the same point as a
> counterargument against HWP_REQ_MAX.

The argument is against adding an extra constraint, beyond what's
there already, that isn't guaranteed to be effective even.

We kind of know that what is there already is not ideal, so adding
more stuff that is not ideal on top of that is kind of questionable at
least in principle.

> > The point here is that this is as good as using the perf control
> > register to ask for a given P-state without HWP which trying to drive
> > the max too is added complexity.
> >
> >> If you don't believe me here is the turbostat sample with maximum
> >> Bzy_MHz I get on the computer I'm sitting on right now while compiling a
> >> kernel on CPU0 if I set HWP_REQ_MAX to 0x1c (within the turbo range):
> >>
> >> | Core    CPU     Avg_MHz Busy%   Bzy_MHz            HWP_REQ      PkgWatt CorWatt
> >> | -       -       757     27.03   2800    0x0000000000000000      7.13    4.90
> >> | 0       0       2794    99.77   2800    0x0000000080001c04      7.13    4.90
> >> | 0       2       83      2.98    2800    0x0000000080001c04
> >> | 1       1       73      2.60    2800    0x0000000080001c04
> >> | 1       3       78      2.79    2800    0x0000000080001c04
> >>
> >> With the default HWP_REQUEST:
> >>
> >> | Core    CPU     Avg_MHz Busy%   Bzy_MHz            HWP_REQ      PkgWatt CorWatt
> >> | -       -       814     27.00   3015    0x0000000000000000      8.49    6.18
> >> | 0       0       2968    98.24   3021    0x0000000080001f04      8.49    6.18
> >> | 0       2       84      2.81    2982    0x0000000080001f04
> >> | 1       1       99      3.34    2961    0x0000000080001f04
> >> | 1       3       105     3.60    2921    0x0000000080001f04
> >>
> >> > Generally, I'm not quite convinced that limiting the max frequency is
> >> > really the right choice for controlling the processor's power draw on
> >> > the systems in question.  There are other ways to do that, which in
> >> > theory should be more effective.  I mentioned RAPL somewhere in this
> >> > context and there's the GUC firmware too.
> >>
> >> I feel like we've had that conversation before and it's somewhat
> >> off-topic so I'll keep it short: Yes, in theory RAPL is more effective
> >> than HWP_REQ_MAX as a mechanism to limit the absolute power consumption
> >> of the processor package, but that's not the purpose of [1], its purpose
> >> is setting a lower limit to the energy efficiency of the processor when
> >> the maximum usable CPU frequency is known (due to the existence of an IO
> >> device bottleneck)
> >
> > Whether or not that frequency is actually known seems quite
> > questionable to me, but that's aside.
> >
>
> It's not actually known, but it can be approximated easily under a
> widely-applicable assumption -- More on that below.
>
> > More important, it is unclear to me what you mean by "a lower limit to
> > the energy efficiency of the processor".
> >
>
> If we define the instantaneous energy efficiency of a CPU (eta) to be
> the ratio between its instantaneous frequency (f) and power consumption
> (P),

I'm sorry, but this definition is conceptually misguided.

Energy-efficiency (denote it as \phi) can be defined as work/energy which means

\phi = dW / dE

for the instantaneous one and in general that is not the same as the
simple fraction below.

> I want to be able to set a lower limit to that ratio in cases where
> I can determine that doing so won't impact the performance of the
> application:
>
> |  eta_min <= eta = f / P
>
> Setting such a lower limit to the instantaneous energy efficiency of the
> processor can only lower the total amount of energy consumed by the
> processor in order to perform a given amount of work (If you don't
> believe me on that feel free to express it as the integral of P over
> time, with P recovered from the expression above), therefore it can only
> improve the average energy efficiency of the workload in the long run.
>
> Because of the convex relationship between P and f above a certain
> inflection point (AKA maximum efficiency ratio, AKA min_pstate in
> intel_pstate.c), eta is monotonically decreasing with respect to
> frequency above that point, therefore setting a lower limit to the
> energy efficiency of the processor is equivalent to setting an upper
> limit to its frequency within that range.

So under the usual assumptions, in order to increase the frequency it
is necessary to increase power at least linearly which roughly means
what you said above.

And if increasing the frequency above a certain limit is not going to
increase the amount of work done, which very well may be the case,
then it makes sense to set a limit on the frequency.

The entire CPU performance scaling is based on that concept and the
trick is to determine the "useful max" frequency in question reliably
enough.

> > I guess what you mean is that the processor might decide to go for a
> > more energy-efficient configuration by increasing its frequency in a
> > "race to idle" fashion (in response to a perceived utilization spike)
> > and you want to prevent that from occurring.
> >
>
> No, a race to idle response to a utilization spike would only be more
> energy efficient than the performance-equivalent constant-frequency
> response in cases where the latter constant frequency is in the
> concavity region of the system's power curve (below the inflection
> point).  I certainly don't want to prevent that from occurring when it's
> the most energy-efficient thing to do.
>
> > Or, generally speaking, that the CPU performance scaling logic, either
> > in the kernel or in the processor itself, might select a higher
> > operating frequency of a CPU in response to a perceived utilization
> > spike, but that may be a mistake in the presence of another data
> > processing device sharing the power budget with the processor, so you
> > want to prevent that from taking place.
> >
>
> Yes, I do.
>
> > In both cases, I wouldn't call that limiting the energy-efficiency of
> > the processor.  Effectively, this means putting a limit on the
> > processor's power budget, which is exactly what RAPL is for.
> >
>
> No, limiting the processor frequency also imposes a limit to its energy
> efficiency due to the reason explained above.

I suppose that you mean the instantaneous thing which I don't think
can really be referred to as "energy-efficiency".

Regardless, the ultimate goal appears to be to allow the non-CPU
component you care about draw more power.

> >> -- And if the maximum usable CPU frequency is the
> >> information we have at hand,
> >
> > How so?
> >
> > How can you tell what that frequency is?
> >
>
> In the general case it would take a crystal ball to know the amount of
> work the CPU is going to have to do in the future, however as soon as
> the system has reached a steady state (which amounts to a large fraction
> of the time and energy consumption of many workloads, therefore it's an
> interesting case to optimize for) its previous behavior can be taken as
> proxy for its future behavior (by definition of steady state), therefore
> we can measure the performance delivered by the processor in the
> immediate past and make sure that the governor's response doesn't
> prevent it from achieving the same performance (plus some margin in
> order to account for potential fluctuations in the workload).
>
> That's, yes, an essentially heuristic assumption, but one that underlies
> every other CPU frequency governor in the Linux kernel tree to a greater
> or lower extent.

Yes, it is, and so I don't quite see the connection between it and my question.

Apparently, the unmodified performance scaling governors are not
sufficient, so there must be something beyond the above which allows
you to determine the frequency in question and so I'm asking what that
is.

> >> controlling the maximum CPU frequency
> >> directly is optimal, rather than trying to find the RAPL constraint that
> >> achieves the same average frequency by trial an error.  Also, in theory,
> >> even if you had an oracle to tell you what the appropriate RAPL
> >> constraint is, the result would necessarily be more energy-inefficient
> >> than controlling the maximum CPU frequency directly, since you're giving
> >> the processor additional freedom to run at frequencies above the one you
> >> want to average, which is guaranteed to be more energy-inefficient than
> >> running at that fixed frequency, assuming we are in the region of
> >> convexity of the processor's power curve.
> >
> > So the reason why you want to limit the processor's max frequency in
> > the first place is because it is sharing the power budget with
> > something else.
>
> No, my ultimate goal is to optimize the energy efficiency of the CPU in
> cases where the system has a bottleneck elsewhere.

I don't know what you mean here, sorry.

> > If there's no sharing of the power budget or thermal constraints,
> > there is no need to limit the CPU frequency other than for the sake of
> > saving energy.
> >
>
> Yes!
>
> > What you can achieve by limiting the max CPU frequency is to make the
> > processor draw less power (and cause it to use either less or more
> > energy, depending on the energy-efficiency curve).
>
> Yes, in order to make sure that limiting the maximum CPU frequency
> doesn't lead to increased energy usage the response of the governor is
> clamped to the convexity range of the CPU power curve (which yes, I'm
> aware is only an approximation to the convexity range of the
> whole-system power curve).
>
> > You don't know how much less power it will draw then, however.
> >
>
> I don't see any need to care how much less power is drawn in absolute
> terms, as long as the energy efficiency of the system is improved *and*
> its performance is at least the same as it was before.

There is obviously a connection between power and energy and you were
talking about steady states above.

In a steady state energy is a product of power and time, approximately.

Besides, I was talking on the average and that's what means for power budgets.

> > You seem to be saying "I know exactly what the maximum frequency of
> > the CPU can be, so why I don't set it as the upper limit", but I'm
> > questioning the source of that knowledge.  Does it not come from
> > knowing the power budget you want to give to the processor?
> >
>
> No, it comes from CPU performance counters -- More on that above.

I guess you mean the paragraph regarding reaching a steady state etc.,
but there's nothing about the CPU performance counters in there, so it
is kind of hard for me to understand this remark.

Overall, so far, I'm seeing a claim that the CPU subsystem can be made
use less energy and do as much work as before (which is what improving
the energy-efficiency means in general) if the maximum frequency of
CPUs is limited in a clever way.

I'm failing to see what that clever way is, though.
Francisco Jerez July 20, 2020, 11:20 p.m. UTC | #13
"Rafael J. Wysocki" <rafael@kernel.org> writes:

> On Fri, Jul 17, 2020 at 2:21 AM Francisco Jerez <currojerez@riseup.net> wrote:
>>
>> "Rafael J. Wysocki" <rafael@kernel.org> writes:
>>
>> > On Wed, Jul 15, 2020 at 11:35 PM Francisco Jerez <currojerez@riseup.net> wrote:
>> >>
>> >> "Rafael J. Wysocki" <rafael@kernel.org> writes:
>> >>
>> >> > On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez <currojerez@riseup.net> wrote:
>> >> >>
>> >> >> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
>> >> >>
>> >> >> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> >> >> >
>> >> >> > Allow intel_pstate to work in the passive mode with HWP enabled and
>> >> >> > make it set the HWP minimum performance limit (HWP floor) to the
>> >> >> > P-state value given by the target frequency supplied by the cpufreq
>> >> >> > governor, so as to prevent the HWP algorithm and the CPU scheduler
>> >> >> > from working against each other, at least when the schedutil governor
>> >> >> > is in use, and update the intel_pstate documentation accordingly.
>> >> >> >
>> >> >> > Among other things, this allows utilization clamps to be taken
>> >> >> > into account, at least to a certain extent, when intel_pstate is
>> >> >> > in use and makes it more likely that sufficient capacity for
>> >> >> > deadline tasks will be provided.
>> >> >> >
>> >> >> > After this change, the resulting behavior of an HWP system with
>> >> >> > intel_pstate in the passive mode should be close to the behavior
>> >> >> > of the analogous non-HWP system with intel_pstate in the passive
>> >> >> > mode, except that in the frequency range below the base frequency
>> >> >> > (ie. the frequency retured by the base_frequency cpufreq attribute
>> >> >> > in sysfs on HWP systems) the HWP algorithm is allowed to go above
>> >> >> > the floor P-state set by intel_pstate with or without hardware
>> >> >> > coordination of P-states among CPUs in the same package.
>> >> >> >
>> >> >> > Also note that the setting of the HWP floor may not be taken into
>> >> >> > account by the processor in the following cases:
>> >> >> >
>> >> >> >  * For the HWP floor in the range of P-states above the base
>> >> >> >    frequency, referred to as the turbo range, the processor has a
>> >> >> >    license to choose any P-state from that range, either below or
>> >> >> >    above the HWP floor, just like a non-HWP processor in the case
>> >> >> >    when the target P-state falls into the turbo range.
>> >> >> >
>> >> >> >  * If P-states of the CPUs in the same package are coordinated
>> >> >> >    at the hardware level, the processor may choose a P-state
>> >> >> >    above the HWP floor, just like a non-HWP processor in the
>> >> >> >    analogous case.
>> >> >> >
>> >> >> > With this change applied, intel_pstate in the passive mode
>> >> >> > assumes complete control over the HWP request MSR and concurrent
>> >> >> > changes of that MSR (eg. via the direct MSR access interface) are
>> >> >> > overridden by it.
>> >> >> >
>> >> >> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> >> >> > ---
>> >> >> >
>> >> >> > This basically unifies the passive mode behavior of intel_pstate for systems
>> >> >> > with and without HWP enabled.  The only case in which there is a difference
>> >> >> > between the two (after this patch) is below the turbo range, where the HWP
>> >> >> > algorithm can go above the floor regardless of whether or not P-state are
>> >> >> > coordinated package-wide (this means the systems with per-core P-states
>> >> >> > mostly is where the difference can be somewhat visible).
>> >> >> >
>> >> >> > Since the passive mode hasn't worked with HWP at all, and it is not going to
>> >> >> > the default for HWP systems anyway, I don't see any drawbacks related to making
>> >> >> > this change, so I would consider this as 5.9 material unless there are any
>> >> >> > serious objections.
>> >> >> >
>> >> >> > Thanks!
>> >> >> >
>> >> >> > ---
>> >> >> >  Documentation/admin-guide/pm/intel_pstate.rst |   89 +++++++---------
>> >> >> >  drivers/cpufreq/intel_pstate.c                |  141 ++++++++++++++++++++------
>> >> >> >  2 files changed, 152 insertions(+), 78 deletions(-)
>> >> >> >
>> >> >> > Index: linux-pm/drivers/cpufreq/intel_pstate.c
>> >> >> > ===================================================================
>> >> >> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
>> >> >> > +++ linux-pm/drivers/cpufreq/intel_pstate.c
>> >> >> > @@ -36,6 +36,7 @@
>> >> >> >  #define INTEL_PSTATE_SAMPLING_INTERVAL       (10 * NSEC_PER_MSEC)
>> >> >> >
>> >> >> >  #define INTEL_CPUFREQ_TRANSITION_LATENCY     20000
>> >> >> > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP   5000
>> >> >> >  #define INTEL_CPUFREQ_TRANSITION_DELAY               500
>> >> >> >
>> >> >> >  #ifdef CONFIG_ACPI
>> >> >> > @@ -222,6 +223,7 @@ struct global_params {
>> >> >> >   *                   preference/bias
>> >> >> >   * @epp_saved:               Saved EPP/EPB during system suspend or CPU offline
>> >> >> >   *                   operation
>> >> >> > + * @epp_cached               Cached HWP energy-performance preference value
>> >> >> >   * @hwp_req_cached:  Cached value of the last HWP Request MSR
>> >> >> >   * @hwp_cap_cached:  Cached value of the last HWP Capabilities MSR
>> >> >> >   * @last_io_update:  Last time when IO wake flag was set
>> >> >> > @@ -259,6 +261,7 @@ struct cpudata {
>> >> >> >       s16 epp_policy;
>> >> >> >       s16 epp_default;
>> >> >> >       s16 epp_saved;
>> >> >> > +     s16 epp_cached;
>> >> >> >       u64 hwp_req_cached;
>> >> >> >       u64 hwp_cap_cached;
>> >> >> >       u64 last_io_update;
>> >> >> > @@ -676,6 +679,8 @@ static int intel_pstate_set_energy_pref_
>> >> >> >
>> >> >> >               value |= (u64)epp << 24;
>> >> >> >               ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
>> >> >> > +
>> >> >> > +             WRITE_ONCE(cpu_data->epp_cached, epp);
>> >> >>
>> >> >> Why introduce a new EPP cache variable if there is already
>> >> >> hwp_req_cached?  If intel_pstate_set_energy_pref_index() is failing to
>> >> >> update hwp_req_cached maybe we should fix that instead.  That will save
>> >> >> you a little bit of work in intel_cpufreq_adjust_hwp().
>> >> >
>> >> > Yes, it would, but then we'd need to explicitly synchronize
>> >> > intel_pstate_set_energy_pref_index() with the scheduler context which
>> >> > I'd rather avoid.
>> >> >
>> >>
>> >> How does using a differently named variable save you from doing that?
>> >
>> > It is a separate variable.
>> >
>> > The only updater of epp_cached, except for the initialization, is
>> > intel_pstate_set_energy_pref_index() and it cannot race with another
>> > instance of itself, so there are no concurrent writes to epp_cached.
>> >
>> > In the passive mode the only updater of hwp_req_cached, except for the
>> > initialization, is intel_cpufreq_adjust_hwp() (or there is a bug in
>> > the patch that I have missed) and it cannot race with another instance
>> > of itself for the same CPU, so there are no concurrent writes to
>> > hwp_req_cached.
>> >
>> >  if intel_pstate_set_energy_pref_index() updated hwp_req_cached
>> > directly, however, it might be updated in two places concurrently and
>> > so explicit synchronization would be necessary.
>> >
>>
>> That's fair, but we may need to add such synchronization anyway due to
>> the bug I pointed out above,
>
> I guess you've not regarded my explanation of this as sufficient.
>

No, and I just checked that the bug can be reproduced successfully on my
system with a couple of bash commands -- See below.

>> so it might be simpler to avoid introducing
>> additional state and simply stick to hwp_req_cached with proper
>> synchronization.
>
> Not really.  In the v2 of the patch the race (which was not a bug
> anyway) is explicitly avoided.
>
>> >> And won't the EPP setting programmed by intel_pstate_set_energy_pref_index()
>> >> be lost if intel_pstate_hwp_boost_up() or some other user of
>> >> hwp_req_cached is executed afterwards with the current approach?
>> >
>> > The value written to the register by it may be overwritten by a
>> > concurrent intel_cpufreq_adjust_hwp(), but that is not a problem,
>> > because next time intel_cpufreq_adjust_hwp() runs for the target CPU,
>> > it will pick up the updated epp_cached value which will be written to
>> > the register.
>>
>> However intel_cpufreq_adjust_hwp() may never be executed afterwards if
>> intel_pstate is in active mode,
>
> intel_cpufreq_adjust_hwp() is not executed in the active mode at all.
>

On that we agree, in which case...

>> in which case the overwritten value may remain there forever
>> potentially.
>
> However, in the active mode the only updater of hwp_req_cached is
> intel_pstate_hwp_set() and this patch doesn't introduce any
> differences in behavior in that case.
>

intel_pstate_hwp_set() is the only updater, but there are other
consumers that can get out of sync with the HWP request value written by
intel_pstate_set_energy_pref_index().  intel_pstate_hwp_boost_up() seems
like the most concerning example I named earlier.

>> > So there may be a short time window after the
>> > intel_pstate_set_energy_pref_index() invocation in which the new EPP
>> > value may not be in effect, but in general there is no guarantee that
>> > the new EPP will take effect immediately after updating the MSR
>> > anyway, so that race doesn't matter.
>> >
>> > That said, that race is avoidable, but I was thinking that trying to
>> > avoid it might not be worth it.  Now I see a better way to avoid it,
>> > though, so I'm going to update the patch to that end.
>> >
>> >> Seems like a bug to me.
>> >
>> > It is racy, but not every race is a bug.
>> >
>>
>> Still seems like there is a bug in intel_pstate_set_energy_pref_index()
>> AFAICT.
>
> If there is a bug, then what exactly is it, from the users' perspective?
>

It can be reproduced easily as follows:

| echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
| for p in /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference; do echo performance > $p; done

Let's make sure that the EPP updates landed on the turbostat output:

|[..]
| Core    CPU     Avg_MHz Busy%   Bzy_MHz            HWP_REQ
| -       -       1       0.05    2396    0x0000000000000000
| 0       0       1       0.05    2153    0x0000000000002704
| 0       4       1       0.04    2062    0x0000000000002704
| 1       1       1       0.02    2938    0x0000000000002704
| 1       5       2       0.09    2609    0x0000000000002704
| 2       2       1       0.04    1857    0x0000000000002704
| 2       6       1       0.05    2561    0x0000000000002704
| 3       3       0       0.01    1883    0x0000000000002704
| 3       7       2       0.07    2703    0x0000000000002704
|[..]

Now let's do some non-trivial IO activity in order to trigger HWP
dynamic boost, and watch while random CPUs start losing their EPP
setting requested via sysfs:

|[..]
| Core    CPU     Avg_MHz Busy%   Bzy_MHz            HWP_REQ
| -       -       16      0.81    2023    0x0000000000000000
| 0       0       7       0.66    1069    0x0000000080002704
                                                    ^^
| 0       4       24      2.19    1116    0x0000000080002704
                                                    ^^
| 1       1       18      0.68    2618    0x0000000000002704
| 1       5       1       0.03    2005    0x0000000000002704
| 2       2       2       0.07    2512    0x0000000000002704
| 2       6       33      1.35    2402    0x0000000000002704
| 3       3       1       0.04    2470    0x0000000000002704
| 3       7       45      1.42    3185    0x0000000080002704
                                                    ^^

>> >> >> >       } else {
>> >> >> >               if (epp == -EINVAL)
>> >> >> >                       epp = (pref_index - 1) << 2;
>> >> >> > @@ -2047,6 +2052,7 @@ static int intel_pstate_init_cpu(unsigne
>> >> >> >               cpu->epp_default = -EINVAL;
>> >> >> >               cpu->epp_powersave = -EINVAL;
>> >> >> >               cpu->epp_saved = -EINVAL;
>> >> >> > +             WRITE_ONCE(cpu->epp_cached, -EINVAL);
>> >> >> >       }
>> >> >> >
>> >> >> >       cpu = all_cpu_data[cpunum];
>> >> >> > @@ -2245,7 +2251,10 @@ static int intel_pstate_verify_policy(st
>> >> >> >
>> >> >> >  static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
>> >> >> >  {
>> >> >> > -     intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
>> >> >> > +     if (hwp_active)
>> >> >> > +             intel_pstate_hwp_force_min_perf(policy->cpu);
>> >> >> > +     else
>> >> >> > +             intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
>> >> >> >  }
>> >> >> >
>> >> >> >  static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
>> >> >> > @@ -2253,12 +2262,10 @@ static void intel_pstate_stop_cpu(struct
>> >> >> >       pr_debug("CPU %d exiting\n", policy->cpu);
>> >> >> >
>> >> >> >       intel_pstate_clear_update_util_hook(policy->cpu);
>> >> >> > -     if (hwp_active) {
>> >> >> > +     if (hwp_active)
>> >> >> >               intel_pstate_hwp_save_state(policy);
>> >> >> > -             intel_pstate_hwp_force_min_perf(policy->cpu);
>> >> >> > -     } else {
>> >> >> > -             intel_cpufreq_stop_cpu(policy);
>> >> >> > -     }
>> >> >> > +
>> >> >> > +     intel_cpufreq_stop_cpu(policy);
>> >> >> >  }
>> >> >> >
>> >> >> >  static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
>> >> >> > @@ -2388,13 +2395,82 @@ static void intel_cpufreq_trace(struct c
>> >> >> >               fp_toint(cpu->iowait_boost * 100));
>> >> >> >  }
>> >> >> >
>> >> >> > +static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32 target_pstate,
>> >> >> > +                                  bool fast_switch)
>> >> >> > +{
>> >> >> > +     u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
>> >> >> > +     s16 epp;
>> >> >> > +
>> >> >> > +     value &= ~HWP_MIN_PERF(~0L);
>> >> >> > +     value |= HWP_MIN_PERF(target_pstate);
>> >> >> > +
>> >> >> > +     /*
>> >> >> > +      * The entire MSR needs to be updated in order to update the HWP min
>> >> >> > +      * field in it, so opportunistically update the max too if needed.
>> >> >> > +      */
>> >> >> > +     value &= ~HWP_MAX_PERF(~0L);
>> >> >> > +     value |= HWP_MAX_PERF(cpu->max_perf_ratio);
>> >> >> > +
>> >> >> > +     /*
>> >> >> > +      * In case the EPP has been adjusted via sysfs, write the last cached
>> >> >> > +      * value of it to the MSR as well.
>> >> >> > +      */
>> >> >> > +     epp = READ_ONCE(cpu->epp_cached);
>> >> >> > +     if (epp >= 0) {
>> >> >> > +             value &= ~GENMASK_ULL(31, 24);
>> >> >> > +             value |= (u64)epp << 24;
>> >> >> > +     }
>> >> >> > +
>> >> >> > +     if (value == prev)
>> >> >> > +             return;
>> >> >> > +
>> >> >> > +     WRITE_ONCE(cpu->hwp_req_cached, value);
>> >> >> > +     if (fast_switch)
>> >> >> > +             wrmsrl(MSR_HWP_REQUEST, value);
>> >> >> > +     else
>> >> >> > +             wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
>> >> >> > +}
>> >> >>
>> >> >> I've asked this question already but you may have missed it: Given that
>> >> >> you are of the opinion that [1] should be implemented in schedutil
>> >> >> instead with intel_pstate in HWP passive mode, what's your plan for
>> >> >> exposing the HWP_MAX_PERF knob to the governor in addition to
>> >> >> HWP_MIN_PERF, since the interface implemented here only allows the
>> >> >> governor to provide a single frequency?
>> >> >>
>> >> >> [1] https://lwn.net/ml/linux-pm/20200428032258.2518-1-currojerez@riseup.net/
>> >> >
>> >> > This is not just about the schedutil governor, but about cpufreq
>> >> > governors in general (someone may still want to use the performance
>> >> > governor on top of intel_pstate, for example).
>> >> >
>> >> > And while governors can only provide one frequency, the policy limits
>> >> > in the cpufreq framework are based on QoS lists now and so it is
>> >> > possible to add a max limit request, say from a driver, to the max QoS
>> >> > list, and update it as needed, causing the max policy limit to be
>> >> > adjusted.
>> >> >
>> >> > That said I'm not exactly sure how useful the max limit generally is
>> >> > in practice on HWP systems, given that setting it above the base
>> >> > frequency causes it to be ignored, effectively, and the turbo range
>> >> > may be wider than the range of P-states below the base frequency.
>> >> >
>> >>
>> >> I don't think that's accurate.  I've looked at hundreds of traces while
>> >> my series [1] was in control of HWP_REQ_MAX and I've never seen an
>> >> excursion above the maximum HWP_REQ_MAX control specified by it within a
>> >> given P-state domain, even while that maximum specified was well into
>> >> the turbo range.
>> >
>> > I'm not going to argue with your experience. :-)
>> >
>> > What I'm saying is that there is no guarantee that the processor will
>> > always select P-states below HWP_REQ_MAX in the turbo range.  That may
>> > not happen in practice, but it is not precluded AFAICS.
>> >
>> > Also while HWP_REQ_MAX can work in practice most of the time with HWP
>> > enabled, without HWP there is no easy way to limit the max frequency
>> > if the current request falls into the turbo range.  The HWP case is
>> > more important nowadays, but there still are systems without it and
>> > ideally they should be covered as well.
>> >
>>
>> In the non-HWP case we have a single P-state control so the question of
>> how to plumb an extra P-state control from the CPUFREQ governor seems
>> largely irrelevant.  The current interface seems fine as-is for such
>> systems.
>
> Which doesn't mean that the non-HWP systems aren't affected by the
> problem you want addressed, does it?
>
> And so if they are affected by it, then ideally the way it is
> addressed should also be applicable to them.
>

Yes they are affected, and I have patches for them that give
energy-efficiency improvements comparable to the HWP-specific ones, but
addressing the issue on non-HWP parts doesn't require any changes to the
CPUFREQ interface we're discussing, even if the logic is moved into a
generic governor from intel_pstate.c.

>> >> So, yeah, I agree that HWP_REQ_MAX is nothing like a
>> >> hard limit, particularly when multiple threads are running on the same
>> >> clock domain, but the processor will still make its best effort to limit
>> >> the clock frequency to the maximum of the requested maximums, even if it
>> >> happens to be within the turbo range.  That doesn't make it useless.
>> >
>> > I haven't used the word "useless" anywhere in my previous message.
>> >
>> > Using the max frequency to control power has merit, but how much of it
>> > is there depends on some factors that may change from one system to
>> > another.
>> >
>> > The alternative power control methods may be more reliable in general.
>> >
>>
>> That's precisely what I've been calling into question.  IIRC the
>> alternative power control methods we have discussed in the past are:
>>
>>  - EPP: The semantics of this control are largely unspecified beyond
>>    higher values being more energy-efficient than lower values.  The set
>>    of energy efficiency optimizations controlled by it and the
>>    thresholds at which they become active are fully platform-specific.
>>    I guess you didn't mean this one as example of a more reliable and
>>    less platform-specific control.
>>
>>  - RAPL: The semantics of this control are indeed well-defined, it's
>>    able to set an absolute average power constraint to the involved
>>    power planes.  However, the relationship between the information
>>    available to the kernel about a given workload (e.g. from CPU
>>    performance counters) and the optimal value of the RAPL constraints
>>    is highly platform-specific, requiring multiple iterations of
>>    adjustments and performance monitoring in order to approach the
>>    optimal value (unlike HWP_REQ_MAX since there is a simple,
>>    platform-independent relationship between observed frequency
>>    and... frequency -- More on that below).
>
> But the relationship between the power budget and HWP_REQ_MAX is more
> convoluted.
>

Actually the energy efficiency optimization I'm working on is fully
independent of the processor's absolute power budget.  IOW the
relationship between power budget and HWP_REQ_MAX is trivial (please
take a look at the code if you don't believe me).

>>  - P-code mailbox interface: Available to the graphics driver when GuC
>>    submission is in use, which is not available currently on any
>>    production platform.  It won't allow the energy efficiency
>>    optimization I'm proposing to be taken advantage of by discrete
>>    graphics nor IO devices other than the GPU.  Like HWP_REQ_MAX it sets
>>    a constraint on the CPU P-states so most caveats of HWP_REQ_MAX would
>>    apply to it too.  But unlike HWP_REQ_MAX it has global effect on the
>>    system limiting its usefulness in a multitasking environment.
>>    Requires a governor to run in a GPU microcontroller with more limited
>>    information than CPUFREQ.
>
> The way it works is beyond the scope of discussion here.

Oh?  If one of the proposed solutions imposes some architectural
constraints like running the controller in an environment with limited
information or forces its effect to be global across the whole system,
why shouldn't it be a matter of discussion?

> The point is that it can be used to achieve the goal at least in some
> case, which wouldn't require complexity to be added to the CPU
> performance scaling subsystem of the kernel.
>

Sure, but the complexity of interfacing with the microcontroller, shipping
and maintaining a proprietary binary dependency also needs to be considered.

> To really decide what is better, the two alternatively would need to
> be compared quatitatively, but it doesn't look like they have been.
>

That's a fair request, I'm all for justifying design decisions
quantitatively -- And in fact we did compare my non-HWP controller with
the GuC-based solution quantitatively in a BXT platform, and results
showed the kernel-based solution to be superior by a significant margin.
That was a couple of years ago though and many things have changed, we
can get you the results of the previous comparison or an updated
comparison if you don't find it appealing to look at old numbers.

>> So I'm either missing some alternative power control method or I
>> strongly disagree that there is a more reliable and platform-independent
>> alternative to HWP_REQ_MAX.
>>
>> >> The exact same thing can be said about controlling HWP_REQ_MIN as you're
>> >> doing now in this revision of your patch, BTW.
>> >
>> > Which has been clearly stated in the changelog I believe.
>> >
>>
>> Right, which is why I found it surprising to hear the same point as a
>> counterargument against HWP_REQ_MAX.
>
> The argument is against adding an extra constraint, beyond what's
> there already, that isn't guaranteed to be effective even.
>
> We kind of know that what is there already is not ideal, so adding
> more stuff that is not ideal on top of that is kind of questionable at
> least in principle.
>
>> > The point here is that this is as good as using the perf control
>> > register to ask for a given P-state without HWP which trying to drive
>> > the max too is added complexity.
>> >
>> >> If you don't believe me here is the turbostat sample with maximum
>> >> Bzy_MHz I get on the computer I'm sitting on right now while compiling a
>> >> kernel on CPU0 if I set HWP_REQ_MAX to 0x1c (within the turbo range):
>> >>
>> >> | Core    CPU     Avg_MHz Busy%   Bzy_MHz            HWP_REQ      PkgWatt CorWatt
>> >> | -       -       757     27.03   2800    0x0000000000000000      7.13    4.90
>> >> | 0       0       2794    99.77   2800    0x0000000080001c04      7.13    4.90
>> >> | 0       2       83      2.98    2800    0x0000000080001c04
>> >> | 1       1       73      2.60    2800    0x0000000080001c04
>> >> | 1       3       78      2.79    2800    0x0000000080001c04
>> >>
>> >> With the default HWP_REQUEST:
>> >>
>> >> | Core    CPU     Avg_MHz Busy%   Bzy_MHz            HWP_REQ      PkgWatt CorWatt
>> >> | -       -       814     27.00   3015    0x0000000000000000      8.49    6.18
>> >> | 0       0       2968    98.24   3021    0x0000000080001f04      8.49    6.18
>> >> | 0       2       84      2.81    2982    0x0000000080001f04
>> >> | 1       1       99      3.34    2961    0x0000000080001f04
>> >> | 1       3       105     3.60    2921    0x0000000080001f04
>> >>
>> >> > Generally, I'm not quite convinced that limiting the max frequency is
>> >> > really the right choice for controlling the processor's power draw on
>> >> > the systems in question.  There are other ways to do that, which in
>> >> > theory should be more effective.  I mentioned RAPL somewhere in this
>> >> > context and there's the GUC firmware too.
>> >>
>> >> I feel like we've had that conversation before and it's somewhat
>> >> off-topic so I'll keep it short: Yes, in theory RAPL is more effective
>> >> than HWP_REQ_MAX as a mechanism to limit the absolute power consumption
>> >> of the processor package, but that's not the purpose of [1], its purpose
>> >> is setting a lower limit to the energy efficiency of the processor when
>> >> the maximum usable CPU frequency is known (due to the existence of an IO
>> >> device bottleneck)
>> >
>> > Whether or not that frequency is actually known seems quite
>> > questionable to me, but that's aside.
>> >
>>
>> It's not actually known, but it can be approximated easily under a
>> widely-applicable assumption -- More on that below.
>>
>> > More important, it is unclear to me what you mean by "a lower limit to
>> > the energy efficiency of the processor".
>> >
>>
>> If we define the instantaneous energy efficiency of a CPU (eta) to be
>> the ratio between its instantaneous frequency (f) and power consumption
>> (P),
>
> I'm sorry, but this definition is conceptually misguided.
>
> Energy-efficiency (denote it as \phi) can be defined as work/energy which means
>
> \phi = dW / dE
>
> for the instantaneous one and in general that is not the same as the
> simple fraction below.
>

Hah!  Actually both definitions are mathematically equivalent everywhere
they're both defined.  I assume that the 'd' symbols in your expression
denote Leibniz's notation for a total derivative (if they didn't --
e.g. if they denoted some sort finite increment instead then I would
disagree that your expression is a valid definition of *instantaneous*
energy efficiency).  In addition I assume that we agree on there being
two well-defined functions of time in the case that concerns us, which
we call "instantaneous power consumption" [P(t) = dE(t)/dt] and
"instantaneous frequency" [f(t) = dW(t)/dt].  With that in mind you just
have to apply the standard chain rule from calculus in order to prove
the equivalence of both expressions:

| \phi = dW(t(E))/dE = dW(t)/dt * dt(E)/dE = dW(t)/dt * (dE(t)/dt)^-1 =
|      = f(t) * P(t)^-1 = eta

>> I want to be able to set a lower limit to that ratio in cases where
>> I can determine that doing so won't impact the performance of the
>> application:
>>
>> |  eta_min <= eta = f / P
>>
>> Setting such a lower limit to the instantaneous energy efficiency of the
>> processor can only lower the total amount of energy consumed by the
>> processor in order to perform a given amount of work (If you don't
>> believe me on that feel free to express it as the integral of P over
>> time, with P recovered from the expression above), therefore it can only
>> improve the average energy efficiency of the workload in the long run.
>>
>> Because of the convex relationship between P and f above a certain
>> inflection point (AKA maximum efficiency ratio, AKA min_pstate in
>> intel_pstate.c), eta is monotonically decreasing with respect to
>> frequency above that point, therefore setting a lower limit to the
>> energy efficiency of the processor is equivalent to setting an upper
>> limit to its frequency within that range.
>
> So under the usual assumptions, in order to increase the frequency it
> is necessary to increase power at least linearly which roughly means
> what you said above.
>
> And if increasing the frequency above a certain limit is not going to
> increase the amount of work done, which very well may be the case,
> then it makes sense to set a limit on the frequency.
>
> The entire CPU performance scaling is based on that concept and the
> trick is to determine the "useful max" frequency in question reliably
> enough.
>

Yes, seems like we are on the same page there.

>> > I guess what you mean is that the processor might decide to go for a
>> > more energy-efficient configuration by increasing its frequency in a
>> > "race to idle" fashion (in response to a perceived utilization spike)
>> > and you want to prevent that from occurring.
>> >
>>
>> No, a race to idle response to a utilization spike would only be more
>> energy efficient than the performance-equivalent constant-frequency
>> response in cases where the latter constant frequency is in the
>> concavity region of the system's power curve (below the inflection
>> point).  I certainly don't want to prevent that from occurring when it's
>> the most energy-efficient thing to do.
>>
>> > Or, generally speaking, that the CPU performance scaling logic, either
>> > in the kernel or in the processor itself, might select a higher
>> > operating frequency of a CPU in response to a perceived utilization
>> > spike, but that may be a mistake in the presence of another data
>> > processing device sharing the power budget with the processor, so you
>> > want to prevent that from taking place.
>> >
>>
>> Yes, I do.
>>
>> > In both cases, I wouldn't call that limiting the energy-efficiency of
>> > the processor.  Effectively, this means putting a limit on the
>> > processor's power budget, which is exactly what RAPL is for.
>> >
>>
>> No, limiting the processor frequency also imposes a limit to its energy
>> efficiency due to the reason explained above.
>
> I suppose that you mean the instantaneous thing which I don't think
> can really be referred to as "energy-efficiency".
>

Well I showed above that it's equivalent to your definition of energy
efficiency everywhere they're both defined.  So I meant the same thing
as you.

> Regardless, the ultimate goal appears to be to allow the non-CPU
> component you care about draw more power.
>

No, I explicitly dismissed that in my previous reply.

>> >> -- And if the maximum usable CPU frequency is the
>> >> information we have at hand,
>> >
>> > How so?
>> >
>> > How can you tell what that frequency is?
>> >
>>
>> In the general case it would take a crystal ball to know the amount of
>> work the CPU is going to have to do in the future, however as soon as
>> the system has reached a steady state (which amounts to a large fraction
>> of the time and energy consumption of many workloads, therefore it's an
>> interesting case to optimize for) its previous behavior can be taken as
>> proxy for its future behavior (by definition of steady state), therefore
>> we can measure the performance delivered by the processor in the
>> immediate past and make sure that the governor's response doesn't
>> prevent it from achieving the same performance (plus some margin in
>> order to account for potential fluctuations in the workload).
>>
>> That's, yes, an essentially heuristic assumption, but one that underlies
>> every other CPU frequency governor in the Linux kernel tree to a greater
>> or lower extent.
>
> Yes, it is, and so I don't quite see the connection between it and my question.
>
> Apparently, the unmodified performance scaling governors are not
> sufficient, so there must be something beyond the above which allows
> you to determine the frequency in question and so I'm asking what that
> is.
>

The underlying heuristic assumption is the same as I outlined above, but
in any implementation of such a heuristic there is necessarily a
trade-off between responsiveness to short-term fluctuations and
long-term energy usage.  This trade-off is a function of the somewhat
arbitrary time interval I was referring to as "immediate past" -- A
longer time parameter allows the controller to consider a greater
portion of the workload's history while computing the response with
optimal energy usage, at the cost of increasing its reaction time to
discontinuous changes in the behavior of the workload (AKA increased
latency).

One of the key differences between the governor I proposed and the
pre-existing ones is that it doesn't attempt to come up with a magic
time parameter that works for everybody, because there isn't such a
thing, since different devices and applications have latency
requirements which often differ by orders of magnitude.  Instead, a PM
QoS-based interface is introduced [2] which aggregates the latency
requirements from multiple clients, and forwards the result to the
CPUFREQ governor which dynamically adjusts its time parameter to suit
the workloads (hence the name "variably low-pass filtering").

Individual device drivers are generally in a better position to decide
what their latency requirements are than any central PM agent (including
the HWP) -- We can talk more about the algorithm used to do that as soon
as we've reached some agreement on the basics.

[2] https://lwn.net/ml/linux-pm/20200428032258.2518-2-currojerez@riseup.net/

>> >> controlling the maximum CPU frequency
>> >> directly is optimal, rather than trying to find the RAPL constraint that
>> >> achieves the same average frequency by trial an error.  Also, in theory,
>> >> even if you had an oracle to tell you what the appropriate RAPL
>> >> constraint is, the result would necessarily be more energy-inefficient
>> >> than controlling the maximum CPU frequency directly, since you're giving
>> >> the processor additional freedom to run at frequencies above the one you
>> >> want to average, which is guaranteed to be more energy-inefficient than
>> >> running at that fixed frequency, assuming we are in the region of
>> >> convexity of the processor's power curve.
>> >
>> > So the reason why you want to limit the processor's max frequency in
>> > the first place is because it is sharing the power budget with
>> > something else.
>>
>> No, my ultimate goal is to optimize the energy efficiency of the CPU in
>> cases where the system has a bottleneck elsewhere.
>
> I don't know what you mean here, sorry.
>

I'm having trouble coming up with simpler words to express the same
thing: My ultimate goal is to improve the energy efficiency of the CPU.
Improved energy balancing behavior is only a nice bonus.

>> > If there's no sharing of the power budget or thermal constraints,
>> > there is no need to limit the CPU frequency other than for the sake of
>> > saving energy.
>> >
>>
>> Yes!
>>
>> > What you can achieve by limiting the max CPU frequency is to make the
>> > processor draw less power (and cause it to use either less or more
>> > energy, depending on the energy-efficiency curve).
>>
>> Yes, in order to make sure that limiting the maximum CPU frequency
>> doesn't lead to increased energy usage the response of the governor is
>> clamped to the convexity range of the CPU power curve (which yes, I'm
>> aware is only an approximation to the convexity range of the
>> whole-system power curve).
>>
>> > You don't know how much less power it will draw then, however.
>> >
>>
>> I don't see any need to care how much less power is drawn in absolute
>> terms, as long as the energy efficiency of the system is improved *and*
>> its performance is at least the same as it was before.
>
> There is obviously a connection between power and energy and you were
> talking about steady states above.
>
> In a steady state energy is a product of power and time, approximately.
>
> Besides, I was talking on the average and that's what means for power budgets.
>

My point is that even if I knew the exact absolute power consumption of
the CPU as a function of its P-state, that wouldn't help me construct a
governor which is significantly more energy-efficient in the typical
steady-state case, since the governor's response is already optimal
energy-wise under the heuristic assumptions outlined above and in my
series.

>> > You seem to be saying "I know exactly what the maximum frequency of
>> > the CPU can be, so why I don't set it as the upper limit", but I'm
>> > questioning the source of that knowledge.  Does it not come from
>> > knowing the power budget you want to give to the processor?
>> >
>>
>> No, it comes from CPU performance counters -- More on that above.
>
> I guess you mean the paragraph regarding reaching a steady state etc.,
> but there's nothing about the CPU performance counters in there, so it
> is kind of hard for me to understand this remark.
>

It comes from monitoring the CPU performance counters in the immediate
past (the exact definition of "immediate past" being a function of the
PM QoS constraints in place) in order to compute the average amount of
work delivered by the CPU thread per unit of time, which allows us to
find the most energy-efficient P-state which won't negatively impact the
performance of the workload under the assumption of steady state.

> Overall, so far, I'm seeing a claim that the CPU subsystem can be made
> use less energy and do as much work as before (which is what improving
> the energy-efficiency means in general) if the maximum frequency of
> CPUs is limited in a clever way.
>
> I'm failing to see what that clever way is, though.

Hopefully the clarifications above help some.
Srinivas Pandruvada July 21, 2020, 4:25 p.m. UTC | #14
On Mon, 2020-07-20 at 16:20 -0700, Francisco Jerez wrote:
> "Rafael J. Wysocki" <rafael@kernel.org> writes:
> 
> > On Fri, Jul 17, 2020 at 2:21 AM Francisco Jerez <
> > currojerez@riseup.net> wrote:
> > > "Rafael J. Wysocki" <rafael@kernel.org> writes:
> > > 
{...]

> > Overall, so far, I'm seeing a claim that the CPU subsystem can be
> > made
> > use less energy and do as much work as before (which is what
> > improving
> > the energy-efficiency means in general) if the maximum frequency of
> > CPUs is limited in a clever way.
> > 
> > I'm failing to see what that clever way is, though.
> Hopefully the clarifications above help some.

To simplify:

Suppose I called a function numpy.multiply() to multiply two big arrays
and thread is a pegged to a CPU. Let's say it is causing CPU to
finish the job in 10ms and it is using a P-State of 0x20. But the same
job could have been done in 10ms even if it was using P-state of 0x16.
So we are not energy efficient. To really know where is the bottle neck
there are numbers of perf counters, may be cache was the issue, we
could rather raise the uncore frequency a little. A simple APRF,MPERF
counters are not enough. or we characterize the workload at different
P-states and set limits.
I think this is not you want to say for energy efficiency with your
changes. 

The way you are trying to improve "performance" is by caller (device
driver) to say how important my job at hand. Here device driver suppose
offload this calculations to some GPU and can wait up to 10 ms, you
want to tell CPU to be slow. But the p-state driver at a movement
observes that there is a chance of overshoot of latency, it will
immediately ask for higher P-state. So you want P-state limits based on
the latency requirements of the caller. Since caller has more knowledge
of latency requirement, this allows other devices sharing the power
budget to get more or less power, and improve overall energy efficiency
as the combined performance of system is improved.
Is this correct?
Francisco Jerez July 21, 2020, 11:14 p.m. UTC | #15
Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:

> On Mon, 2020-07-20 at 16:20 -0700, Francisco Jerez wrote:
>> "Rafael J. Wysocki" <rafael@kernel.org> writes:
>> 
>> > On Fri, Jul 17, 2020 at 2:21 AM Francisco Jerez <
>> > currojerez@riseup.net> wrote:
>> > > "Rafael J. Wysocki" <rafael@kernel.org> writes:
>> > > 
> {...]
>
>> > Overall, so far, I'm seeing a claim that the CPU subsystem can be
>> > made
>> > use less energy and do as much work as before (which is what
>> > improving
>> > the energy-efficiency means in general) if the maximum frequency of
>> > CPUs is limited in a clever way.
>> > 
>> > I'm failing to see what that clever way is, though.
>> Hopefully the clarifications above help some.
>
> To simplify:
>
> Suppose I called a function numpy.multiply() to multiply two big arrays
> and thread is a pegged to a CPU. Let's say it is causing CPU to
> finish the job in 10ms and it is using a P-State of 0x20. But the same
> job could have been done in 10ms even if it was using P-state of 0x16.
> So we are not energy efficient. To really know where is the bottle neck
> there are numbers of perf counters, may be cache was the issue, we
> could rather raise the uncore frequency a little. A simple APRF,MPERF
> counters are not enough. 

Yes, that's right, APERF and MPERF aren't sufficient to identify every
kind of possible bottleneck, some visibility of the utilization of other
subsystems is necessary in addition -- Like e.g the instrumentation
introduced in my series to detect a GPU bottleneck.  A bottleneck
condition in an IO device can be communicated to CPUFREQ by adjusting a
PM QoS latency request (link [2] in my previous reply) that effectively
gives the governor permission to rearrange CPU work arbitrarily within
the specified time frame (which should be of the order of the natural
latency of the IO device -- e.g. at least the rendering time of a frame
for a GPU) in order to minimize energy usage.

> or we characterize the workload at different P-states and set limits.
> I think this is not you want to say for energy efficiency with your
> changes. 
>
> The way you are trying to improve "performance" is by caller (device
> driver) to say how important my job at hand. Here device driver suppose
> offload this calculations to some GPU and can wait up to 10 ms, you
> want to tell CPU to be slow. But the p-state driver at a movement
> observes that there is a chance of overshoot of latency, it will
> immediately ask for higher P-state. So you want P-state limits based on
> the latency requirements of the caller. Since caller has more knowledge
> of latency requirement, this allows other devices sharing the power
> budget to get more or less power, and improve overall energy efficiency
> as the combined performance of system is improved.
> Is this correct?

Yes, pretty much.
Rafael J. Wysocki July 27, 2020, 5:15 p.m. UTC | #16
On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote:

[cut]

> >
> > However, in the active mode the only updater of hwp_req_cached is
> > intel_pstate_hwp_set() and this patch doesn't introduce any
> > differences in behavior in that case.
> >
> 
> intel_pstate_hwp_set() is the only updater, but there are other
> consumers that can get out of sync with the HWP request value written by
> intel_pstate_set_energy_pref_index().  intel_pstate_hwp_boost_up() seems
> like the most concerning example I named earlier.
> 
> >> > So there may be a short time window after the
> >> > intel_pstate_set_energy_pref_index() invocation in which the new EPP
> >> > value may not be in effect, but in general there is no guarantee that
> >> > the new EPP will take effect immediately after updating the MSR
> >> > anyway, so that race doesn't matter.
> >> >
> >> > That said, that race is avoidable, but I was thinking that trying to
> >> > avoid it might not be worth it.  Now I see a better way to avoid it,
> >> > though, so I'm going to update the patch to that end.
> >> >
> >> >> Seems like a bug to me.
> >> >
> >> > It is racy, but not every race is a bug.
> >> >
> >>
> >> Still seems like there is a bug in intel_pstate_set_energy_pref_index()
> >> AFAICT.
> >
> > If there is a bug, then what exactly is it, from the users' perspective?
> >
> 
> It can be reproduced easily as follows:
> 
> | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
> | for p in /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference; do echo performance > $p; done

Is this the active mode or the passive mode with the $subject patch applied?

If the former, the issue is there regardless of the patch, so it needs to be
fixed.

If the latter, there should be no effect of hwp_dynamic_boost (which was
overlooked by me).

[cut]
 
> > To really decide what is better, the two alternatively would need to
> > be compared quatitatively, but it doesn't look like they have been.
> >
> 
> That's a fair request, I'm all for justifying design decisions
> quantitatively -- And in fact we did compare my non-HWP controller with
> the GuC-based solution quantitatively in a BXT platform, and results
> showed the kernel-based solution to be superior by a significant margin.

Why do you think that this result is significant beyond BXT?

> That was a couple of years ago though and many things have changed, we
> can get you the results of the previous comparison or an updated
> comparison if you don't find it appealing to look at old numbers.
> 

[cut]

> >>
> >> If we define the instantaneous energy efficiency of a CPU (eta) to be
> >> the ratio between its instantaneous frequency (f) and power consumption
> >> (P),
> >
> > I'm sorry, but this definition is conceptually misguided.
> >
> > Energy-efficiency (denote it as \phi) can be defined as work/energy which means
> >
> > \phi = dW / dE
> >
> > for the instantaneous one and in general that is not the same as the
> > simple fraction below.
> >
> 
> Hah!  Actually both definitions are mathematically equivalent everywhere
> they're both defined.  I assume that the 'd' symbols in your expression
> denote Leibniz's notation for a total derivative (if they didn't --
> e.g. if they denoted some sort finite increment instead then I would
> disagree that your expression is a valid definition of *instantaneous*
> energy efficiency).  In addition I assume that we agree on there being
> two well-defined functions of time in the case that concerns us, which
> we call "instantaneous power consumption" [P(t) = dE(t)/dt] and
> "instantaneous frequency" [f(t) = dW(t)/dt].  With that in mind you just
> have to apply the standard chain rule from calculus in order to prove
> the equivalence of both expressions:
> 
> | \phi = dW(t(E))/dE = dW(t)/dt * dt(E)/dE = dW(t)/dt * (dE(t)/dt)^-1 =
> |      = f(t) * P(t)^-1 = eta

Well, under certain assumptions, but OK, I stand corrected.

[cut]

> 
> > Regardless, the ultimate goal appears to be to allow the non-CPU
> > component you care about draw more power.
> >
> 
> No, I explicitly dismissed that in my previous reply.

But at the same time you seem to agree that without the non-CPU component
(or thermal pressure) the existing CPU performance scaling would be
sufficient.

[cut]

> > Yes, it is, and so I don't quite see the connection between it and my question.
> >
> > Apparently, the unmodified performance scaling governors are not
> > sufficient, so there must be something beyond the above which allows
> > you to determine the frequency in question and so I'm asking what that
> > is.
> >
> 
> The underlying heuristic assumption is the same as I outlined above, but
> in any implementation of such a heuristic there is necessarily a
> trade-off between responsiveness to short-term fluctuations and
> long-term energy usage.  This trade-off is a function of the somewhat
> arbitrary time interval I was referring to as "immediate past" -- A
> longer time parameter allows the controller to consider a greater
> portion of the workload's history while computing the response with
> optimal energy usage, at the cost of increasing its reaction time to
> discontinuous changes in the behavior of the workload (AKA increased
> latency).

OK

> One of the key differences between the governor I proposed and the
> pre-existing ones is that it doesn't attempt to come up with a magic
> time parameter that works for everybody, because there isn't such a
> thing, since different devices and applications have latency
> requirements which often differ by orders of magnitude.

The problem with this approach is that, generally speaking, the kernel
has a definition of "close past" already, which comes from the PELT
signal in the scheduler.

That signal is used for more than just CPU performance scaling and there
is a reason for that, as the scheduler's decisions generally need to be
aligned with CPU performance scaling decisions.

> Instead, a PM QoS-based interface is introduced [2] which aggregates the
> latency requirements from multiple clients, and forwards the result to the
> CPUFREQ governor which dynamically adjusts its time parameter to suit
> the workloads (hence the name "variably low-pass filtering").
> 
> Individual device drivers are generally in a better position to decide
> what their latency requirements are than any central PM agent (including
> the HWP) -- We can talk more about the algorithm used to do that as soon
> as we've reached some agreement on the basics.
> 
> [2] https://lwn.net/ml/linux-pm/20200428032258.2518-2-currojerez@riseup.net/

Because "latency" is a bit overloaded as a technical term (there is the task
wakeup latency, the CPU idle state exit latency, network latency and so on),
it would be good to come up with a better name for this concept.

> I'm having trouble coming up with simpler words to express the same
> thing: My ultimate goal is to improve the energy efficiency of the CPU.
> Improved energy balancing behavior is only a nice bonus.

Which means doing the same amount of work in the same time while using less
energy.

So yes, I would like to talk about the algorithm.

[cut]

> > I guess you mean the paragraph regarding reaching a steady state etc.,
> > but there's nothing about the CPU performance counters in there, so it
> > is kind of hard for me to understand this remark.
> >
> 
> It comes from monitoring the CPU performance counters in the immediate
> past (the exact definition of "immediate past" being a function of the
> PM QoS constraints in place) in order to compute the average amount of
> work delivered by the CPU thread per unit of time, which allows us to
> find the most energy-efficient P-state which won't negatively impact the
> performance of the workload under the assumption of steady state.

That's what all CPU performance scaling governors attempt to do.

To be more precise, they attempt to find the minimum P-state (or frequency)
that won't negatively impact performance and they attempt to give a reasonable
performance ramp-up response at the same time.

What you are saying basically means that you have a better CPU performance
scaling algorithm than the ones used in the existing governors (which very
well may be true) that can use additional input from entities like device
drivers (which may know something about the workload that is not known to
the scheduler).
Rafael J. Wysocki July 27, 2020, 5:23 p.m. UTC | #17
On Wednesday, July 22, 2020 1:14:42 AM CEST Francisco Jerez wrote:
> 
> --==-=-=
> Content-Type: multipart/mixed; boundary="=-=-="
> 
> --=-=-=
> Content-Type: text/plain; charset=utf-8
> Content-Disposition: inline
> Content-Transfer-Encoding: quoted-printable
> 
> Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:
> 
> > On Mon, 2020-07-20 at 16:20 -0700, Francisco Jerez wrote:
> >> "Rafael J. Wysocki" <rafael@kernel.org> writes:
> >>=20
> >> > On Fri, Jul 17, 2020 at 2:21 AM Francisco Jerez <
> >> > currojerez@riseup.net> wrote:
> >> > > "Rafael J. Wysocki" <rafael@kernel.org> writes:
> >> > >=20
> > {...]
> >
> >> > Overall, so far, I'm seeing a claim that the CPU subsystem can be
> >> > made
> >> > use less energy and do as much work as before (which is what
> >> > improving
> >> > the energy-efficiency means in general) if the maximum frequency of
> >> > CPUs is limited in a clever way.
> >> >=20
> >> > I'm failing to see what that clever way is, though.
> >> Hopefully the clarifications above help some.
> >
> > To simplify:
> >
> > Suppose I called a function numpy.multiply() to multiply two big arrays
> > and thread is a pegged to a CPU. Let's say it is causing CPU to
> > finish the job in 10ms and it is using a P-State of 0x20. But the same
> > job could have been done in 10ms even if it was using P-state of 0x16.
> > So we are not energy efficient. To really know where is the bottle neck
> > there are numbers of perf counters, may be cache was the issue, we
> > could rather raise the uncore frequency a little. A simple APRF,MPERF
> > counters are not enough.=20
> 
> Yes, that's right, APERF and MPERF aren't sufficient to identify every
> kind of possible bottleneck, some visibility of the utilization of other
> subsystems is necessary in addition -- Like e.g the instrumentation
> introduced in my series to detect a GPU bottleneck.  A bottleneck
> condition in an IO device can be communicated to CPUFREQ

It generally is not sufficient to communicate it to cpufreq.  It needs to be
communicated to the CPU scheduler.

> by adjusting a
> PM QoS latency request (link [2] in my previous reply) that effectively
> gives the governor permission to rearrange CPU work arbitrarily within
> the specified time frame (which should be of the order of the natural
> latency of the IO device -- e.g. at least the rendering time of a frame
> for a GPU) in order to minimize energy usage.

OK, we need to talk more about this.

> > or we characterize the workload at different P-states and set limits.
> > I think this is not you want to say for energy efficiency with your
> > changes.=20
> >
> > The way you are trying to improve "performance" is by caller (device
> > driver) to say how important my job at hand. Here device driver suppose
> > offload this calculations to some GPU and can wait up to 10 ms, you
> > want to tell CPU to be slow. But the p-state driver at a movement
> > observes that there is a chance of overshoot of latency, it will
> > immediately ask for higher P-state. So you want P-state limits based on
> > the latency requirements of the caller. Since caller has more knowledge
> > of latency requirement, this allows other devices sharing the power
> > budget to get more or less power, and improve overall energy efficiency
> > as the combined performance of system is improved.
> > Is this correct?
> 
> Yes, pretty much.

OK
Francisco Jerez July 28, 2020, 2:32 a.m. UTC | #18
"Rafael J. Wysocki" <rjw@rjwysocki.net> writes:

> On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote:
>
> [cut]
>
>> >
>> > However, in the active mode the only updater of hwp_req_cached is
>> > intel_pstate_hwp_set() and this patch doesn't introduce any
>> > differences in behavior in that case.
>> >
>> 
>> intel_pstate_hwp_set() is the only updater, but there are other
>> consumers that can get out of sync with the HWP request value written by
>> intel_pstate_set_energy_pref_index().  intel_pstate_hwp_boost_up() seems
>> like the most concerning example I named earlier.
>> 
>> >> > So there may be a short time window after the
>> >> > intel_pstate_set_energy_pref_index() invocation in which the new EPP
>> >> > value may not be in effect, but in general there is no guarantee that
>> >> > the new EPP will take effect immediately after updating the MSR
>> >> > anyway, so that race doesn't matter.
>> >> >
>> >> > That said, that race is avoidable, but I was thinking that trying to
>> >> > avoid it might not be worth it.  Now I see a better way to avoid it,
>> >> > though, so I'm going to update the patch to that end.
>> >> >
>> >> >> Seems like a bug to me.
>> >> >
>> >> > It is racy, but not every race is a bug.
>> >> >
>> >>
>> >> Still seems like there is a bug in intel_pstate_set_energy_pref_index()
>> >> AFAICT.
>> >
>> > If there is a bug, then what exactly is it, from the users' perspective?
>> >
>> 
>> It can be reproduced easily as follows:
>> 
>> | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
>> | for p in /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference; do echo performance > $p; done
>
> Is this the active mode or the passive mode with the $subject patch applied?
>
> If the former, the issue is there regardless of the patch, so it needs to be
> fixed.
>
> If the latter, there should be no effect of hwp_dynamic_boost (which was
> overlooked by me).
>

This seems to be a problem in active mode only, so yeah the bug exists
regardless of your patch, but the fix is likely to allow you to simplify
this series slightly if it allows you to take full advantage of
hwp_req_cached and drop the additional EPP cache.

> [cut]
>  
>> > To really decide what is better, the two alternatively would need to
>> > be compared quatitatively, but it doesn't look like they have been.
>> >
>> 
>> That's a fair request, I'm all for justifying design decisions
>> quantitatively -- And in fact we did compare my non-HWP controller with
>> the GuC-based solution quantitatively in a BXT platform, and results
>> showed the kernel-based solution to be superior by a significant margin.
>
> Why do you think that this result is significant beyond BXT?
>

Because the limitations of GuC power management relative to the
kernel-driven solution haven't fundamentally changed since BXT.  Of
course it might be that the gap we observed was due to more accidental
reasons, like bugs in the BXT GuC power management code, it would
certainly make sense to recheck.

>> That was a couple of years ago though and many things have changed, we
>> can get you the results of the previous comparison or an updated
>> comparison if you don't find it appealing to look at old numbers.
>> 
>
> [cut]
>
>> >>
>> >> If we define the instantaneous energy efficiency of a CPU (eta) to be
>> >> the ratio between its instantaneous frequency (f) and power consumption
>> >> (P),
>> >
>> > I'm sorry, but this definition is conceptually misguided.
>> >
>> > Energy-efficiency (denote it as \phi) can be defined as work/energy which means
>> >
>> > \phi = dW / dE
>> >
>> > for the instantaneous one and in general that is not the same as the
>> > simple fraction below.
>> >
>> 
>> Hah!  Actually both definitions are mathematically equivalent everywhere
>> they're both defined.  I assume that the 'd' symbols in your expression
>> denote Leibniz's notation for a total derivative (if they didn't --
>> e.g. if they denoted some sort finite increment instead then I would
>> disagree that your expression is a valid definition of *instantaneous*
>> energy efficiency).  In addition I assume that we agree on there being
>> two well-defined functions of time in the case that concerns us, which
>> we call "instantaneous power consumption" [P(t) = dE(t)/dt] and
>> "instantaneous frequency" [f(t) = dW(t)/dt].  With that in mind you just
>> have to apply the standard chain rule from calculus in order to prove
>> the equivalence of both expressions:
>> 
>> | \phi = dW(t(E))/dE = dW(t)/dt * dt(E)/dE = dW(t)/dt * (dE(t)/dt)^-1 =
>> |      = f(t) * P(t)^-1 = eta
>
> Well, under certain assumptions, but OK, I stand corrected.
>
> [cut]
>
>> 
>> > Regardless, the ultimate goal appears to be to allow the non-CPU
>> > component you care about draw more power.
>> >
>> 
>> No, I explicitly dismissed that in my previous reply.
>
> But at the same time you seem to agree that without the non-CPU component
> (or thermal pressure) the existing CPU performance scaling would be
> sufficient.
>

Yes, but not necessarily in order to allow the non-CPU component to draw
more power as you said above, but also because the existence of a
bottleneck in a non-CPU component gives us an opportunity to improve the
energy efficiency of the CPU, regardless of whether that allows the
workload to run faster.

> [cut]
>
>> > Yes, it is, and so I don't quite see the connection between it and my question.
>> >
>> > Apparently, the unmodified performance scaling governors are not
>> > sufficient, so there must be something beyond the above which allows
>> > you to determine the frequency in question and so I'm asking what that
>> > is.
>> >
>> 
>> The underlying heuristic assumption is the same as I outlined above, but
>> in any implementation of such a heuristic there is necessarily a
>> trade-off between responsiveness to short-term fluctuations and
>> long-term energy usage.  This trade-off is a function of the somewhat
>> arbitrary time interval I was referring to as "immediate past" -- A
>> longer time parameter allows the controller to consider a greater
>> portion of the workload's history while computing the response with
>> optimal energy usage, at the cost of increasing its reaction time to
>> discontinuous changes in the behavior of the workload (AKA increased
>> latency).
>
> OK
>
>> One of the key differences between the governor I proposed and the
>> pre-existing ones is that it doesn't attempt to come up with a magic
>> time parameter that works for everybody, because there isn't such a
>> thing, since different devices and applications have latency
>> requirements which often differ by orders of magnitude.
>
> The problem with this approach is that, generally speaking, the kernel
> has a definition of "close past" already, which comes from the PELT
> signal in the scheduler.
>
> That signal is used for more than just CPU performance scaling and there
> is a reason for that, as the scheduler's decisions generally need to be
> aligned with CPU performance scaling decisions.
>

Yes, I fully agree that in an ideal world the response latency
constraint I was referring to above would be tracked per-scheduling
entity and used as definition of "close past" by PELT too -- Actually I
think I mentioned I was working on a prototype with scheduler-level
tracking of latency constraints, but other folks requested the RFC to be
based on a simpler interface not requiring scheduler surgery to
implement, which is why I came up with the PM QoS-based interface.  I
believe we have discussed exposing this latency constraint as a third
clamp similar to utilization clamps -- I would be fine with such an
interface if you think it's the way to go.

That said, in most practical cases it should be possible to take close
to full advantage of the response latency information from the schedutil
governor, even if it's provided via PM QoS rather than having
per-scheduling entity granularity -- The time parameter used to control
CPU frequency would just be the most strict among the applications
running in the system, which should prevent performance loss in
applications with a low latency constraint, but might cause us to miss
out some opportunities for energy optimization in a multitasking
environment compared to the full scheduling-based solution.  Doesn't
seem like a deal-breaker to me though and it makes the code
substantially easier to review.

>> Instead, a PM QoS-based interface is introduced [2] which aggregates the
>> latency requirements from multiple clients, and forwards the result to the
>> CPUFREQ governor which dynamically adjusts its time parameter to suit
>> the workloads (hence the name "variably low-pass filtering").
>> 
>> Individual device drivers are generally in a better position to decide
>> what their latency requirements are than any central PM agent (including
>> the HWP) -- We can talk more about the algorithm used to do that as soon
>> as we've reached some agreement on the basics.
>> 
>> [2] https://lwn.net/ml/linux-pm/20200428032258.2518-2-currojerez@riseup.net/
>
> Because "latency" is a bit overloaded as a technical term (there is the task
> wakeup latency, the CPU idle state exit latency, network latency and so on),
> it would be good to come up with a better name for this concept.
>

Ah, yes, I think you suggested calling it scaling response frequency
instead, hopefully I updated every reference to response latency in the
code and comments, let me know if I missed any.

>> I'm having trouble coming up with simpler words to express the same
>> thing: My ultimate goal is to improve the energy efficiency of the CPU.
>> Improved energy balancing behavior is only a nice bonus.
>
> Which means doing the same amount of work in the same time while using less
> energy.
>
> So yes, I would like to talk about the algorithm.
>
> [cut]
>
>> > I guess you mean the paragraph regarding reaching a steady state etc.,
>> > but there's nothing about the CPU performance counters in there, so it
>> > is kind of hard for me to understand this remark.
>> >
>> 
>> It comes from monitoring the CPU performance counters in the immediate
>> past (the exact definition of "immediate past" being a function of the
>> PM QoS constraints in place) in order to compute the average amount of
>> work delivered by the CPU thread per unit of time, which allows us to
>> find the most energy-efficient P-state which won't negatively impact the
>> performance of the workload under the assumption of steady state.
>
> That's what all CPU performance scaling governors attempt to do.
>
> To be more precise, they attempt to find the minimum P-state (or frequency)
> that won't negatively impact performance and they attempt to give a reasonable
> performance ramp-up response at the same time.
>
> What you are saying basically means that you have a better CPU performance
> scaling algorithm than the ones used in the existing governors (which very
> well may be true) that can use additional input from entities like device
> drivers (which may know something about the workload that is not known to
> the scheduler).

Yeah, pretty much.
Rafael J. Wysocki July 28, 2020, 3:41 p.m. UTC | #19
On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote:
> 

[cut]

> > If there is a bug, then what exactly is it, from the users' perspective?
> >
> 
> It can be reproduced easily as follows:
> 
> | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
> | for p in /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference; do echo performance > $p; done
> 
> Let's make sure that the EPP updates landed on the turbostat output:
> 
> |[..]
> | Core    CPU     Avg_MHz Busy%   Bzy_MHz            HWP_REQ
> | -       -       1       0.05    2396    0x0000000000000000
> | 0       0       1       0.05    2153    0x0000000000002704
> | 0       4       1       0.04    2062    0x0000000000002704
> | 1       1       1       0.02    2938    0x0000000000002704
> | 1       5       2       0.09    2609    0x0000000000002704
> | 2       2       1       0.04    1857    0x0000000000002704
> | 2       6       1       0.05    2561    0x0000000000002704
> | 3       3       0       0.01    1883    0x0000000000002704
> | 3       7       2       0.07    2703    0x0000000000002704
> |[..]
> 
> Now let's do some non-trivial IO activity in order to trigger HWP
> dynamic boost, and watch while random CPUs start losing their EPP
> setting requested via sysfs:
> 
> |[..]
> | Core    CPU     Avg_MHz Busy%   Bzy_MHz            HWP_REQ
> | -       -       16      0.81    2023    0x0000000000000000
> | 0       0       7       0.66    1069    0x0000000080002704
>                                                     ^^
> | 0       4       24      2.19    1116    0x0000000080002704
>                                                     ^^
> | 1       1       18      0.68    2618    0x0000000000002704
> | 1       5       1       0.03    2005    0x0000000000002704
> | 2       2       2       0.07    2512    0x0000000000002704
> | 2       6       33      1.35    2402    0x0000000000002704
> | 3       3       1       0.04    2470    0x0000000000002704
> | 3       7       45      1.42    3185    0x0000000080002704
>                                                     ^^

Actually, that's because intel_pstate_hwp_boost_up() and
intel_pstate_hwp_boost_down() use the hwp_req_cached value
for updating the HWP Request MSR and that is only written to
by intel_pstate_hwp_set() which is only invoked on policy changes,
so the MSR writes from intel_pstate_set_energy_pref_index()
basically get discarded.

So this is a matter of synchronizing intel_pstate_set_policy() with
intel_pstate_set_energy_pref_index() and they both acquire
intel_pstate_limits_lock already, so this shouldn't be too difficult to fix.

Let me cut a patch for that.
Rafael J. Wysocki July 28, 2020, 6:27 p.m. UTC | #20
On Tuesday, July 28, 2020 4:32:22 AM CEST Francisco Jerez wrote:
>
> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
> 
> > On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote:
> >
> > [cut]
> >
> >> >
> >> > However, in the active mode the only updater of hwp_req_cached is
> >> > intel_pstate_hwp_set() and this patch doesn't introduce any
> >> > differences in behavior in that case.
> >> >
> >>=20
> >> intel_pstate_hwp_set() is the only updater, but there are other
> >> consumers that can get out of sync with the HWP request value written by
> >> intel_pstate_set_energy_pref_index().  intel_pstate_hwp_boost_up() seems
> >> like the most concerning example I named earlier.
> >>=20
> >> >> > So there may be a short time window after the
> >> >> > intel_pstate_set_energy_pref_index() invocation in which the new EPP
> >> >> > value may not be in effect, but in general there is no guarantee th=
> at
> >> >> > the new EPP will take effect immediately after updating the MSR
> >> >> > anyway, so that race doesn't matter.
> >> >> >
> >> >> > That said, that race is avoidable, but I was thinking that trying to
> >> >> > avoid it might not be worth it.  Now I see a better way to avoid it,
> >> >> > though, so I'm going to update the patch to that end.
> >> >> >
> >> >> >> Seems like a bug to me.
> >> >> >
> >> >> > It is racy, but not every race is a bug.
> >> >> >
> >> >>
> >> >> Still seems like there is a bug in intel_pstate_set_energy_pref_index=
> ()
> >> >> AFAICT.
> >> >
> >> > If there is a bug, then what exactly is it, from the users' perspectiv=
> e?
> >> >
> >>=20
> >> It can be reproduced easily as follows:
> >>=20
> >> | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
> >> | for p in /sys/devices/system/cpu/cpufreq/policy*/energy_performance_pr=
> eference; do echo performance > $p; done
> >
> > Is this the active mode or the passive mode with the $subject patch appli=
> ed?
> >
> > If the former, the issue is there regardless of the patch, so it needs to=
>  be
> > fixed.
> >
> > If the latter, there should be no effect of hwp_dynamic_boost (which was
> > overlooked by me).
> >
> 
> This seems to be a problem in active mode only, so yeah the bug exists
> regardless of your patch, but the fix is likely to allow you to simplify
> this series slightly if it allows you to take full advantage of
> hwp_req_cached and drop the additional EPP cache.

The additional EPP cache is there to avoid synchronizing the scheduler
context directly with a random process running on another CPU and doing
things that may take time.

The difference between the active mode and the passive mode in this respect
is that in the latter case hwp_req_cached generally needs to be updated from
the scheduler context, whereas in the former case it does not.

[cut]

> >> No, I explicitly dismissed that in my previous reply.
> >
> > But at the same time you seem to agree that without the non-CPU component
> > (or thermal pressure) the existing CPU performance scaling would be
> > sufficient.
> >
> 
> Yes, but not necessarily in order to allow the non-CPU component to draw
> more power as you said above, but also because the existence of a
> bottleneck in a non-CPU component gives us an opportunity to improve the
> energy efficiency of the CPU, regardless of whether that allows the
> workload to run faster.

But why would the bottleneck be there otherwise?

> > [cut]
> >
> >> > Yes, it is, and so I don't quite see the connection between it and my =
> question.
> >> >
> >> > Apparently, the unmodified performance scaling governors are not
> >> > sufficient, so there must be something beyond the above which allows
> >> > you to determine the frequency in question and so I'm asking what that
> >> > is.
> >> >
> >>=20
> >> The underlying heuristic assumption is the same as I outlined above, but
> >> in any implementation of such a heuristic there is necessarily a
> >> trade-off between responsiveness to short-term fluctuations and
> >> long-term energy usage.  This trade-off is a function of the somewhat
> >> arbitrary time interval I was referring to as "immediate past" -- A
> >> longer time parameter allows the controller to consider a greater
> >> portion of the workload's history while computing the response with
> >> optimal energy usage, at the cost of increasing its reaction time to
> >> discontinuous changes in the behavior of the workload (AKA increased
> >> latency).
> >
> > OK
> >
> >> One of the key differences between the governor I proposed and the
> >> pre-existing ones is that it doesn't attempt to come up with a magic
> >> time parameter that works for everybody, because there isn't such a
> >> thing, since different devices and applications have latency
> >> requirements which often differ by orders of magnitude.
> >
> > The problem with this approach is that, generally speaking, the kernel
> > has a definition of "close past" already, which comes from the PELT
> > signal in the scheduler.
> >
> > That signal is used for more than just CPU performance scaling and there
> > is a reason for that, as the scheduler's decisions generally need to be
> > aligned with CPU performance scaling decisions.
> >
> 
> Yes, I fully agree that in an ideal world the response latency
> constraint I was referring to above would be tracked per-scheduling
> entity and used as definition of "close past" by PELT too -- Actually I
> think I mentioned I was working on a prototype with scheduler-level
> tracking of latency constraints, but other folks requested the RFC to be
> based on a simpler interface not requiring scheduler surgery to
> implement, which is why I came up with the PM QoS-based interface.  I
> believe we have discussed exposing this latency constraint as a third
> clamp similar to utilization clamps -- I would be fine with such an
> interface if you think it's the way to go.

I'm not sure yet to be honest.

> That said, in most practical cases it should be possible to take close
> to full advantage of the response latency information from the schedutil
> governor, even if it's provided via PM QoS rather than having
> per-scheduling entity granularity -- The time parameter used to control
> CPU frequency would just be the most strict among the applications
> running in the system, which should prevent performance loss in
> applications with a low latency constraint, but might cause us to miss
> out some opportunities for energy optimization in a multitasking
> environment compared to the full scheduling-based solution.  Doesn't
> seem like a deal-breaker to me though and it makes the code
> substantially easier to review.

I agree on the simplicity side, but the reason why I think that the scheduler
needs to be involved ultimately is because it may have a reason to ignore the
bottleneck and go ahead with its decisions anyway.
Francisco Jerez July 29, 2020, 5:46 a.m. UTC | #21
"Rafael J. Wysocki" <rjw@rjwysocki.net> writes:

> On Tuesday, July 28, 2020 4:32:22 AM CEST Francisco Jerez wrote:
>>
>> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
>> 
>> > On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote:
>> >
>> > [cut]
>> >
>> >> >
>> >> > However, in the active mode the only updater of hwp_req_cached is
>> >> > intel_pstate_hwp_set() and this patch doesn't introduce any
>> >> > differences in behavior in that case.
>> >> >
>> >>=20
>> >> intel_pstate_hwp_set() is the only updater, but there are other
>> >> consumers that can get out of sync with the HWP request value written by
>> >> intel_pstate_set_energy_pref_index().  intel_pstate_hwp_boost_up() seems
>> >> like the most concerning example I named earlier.
>> >>=20
>> >> >> > So there may be a short time window after the
>> >> >> > intel_pstate_set_energy_pref_index() invocation in which the new EPP
>> >> >> > value may not be in effect, but in general there is no guarantee th=
>> at
>> >> >> > the new EPP will take effect immediately after updating the MSR
>> >> >> > anyway, so that race doesn't matter.
>> >> >> >
>> >> >> > That said, that race is avoidable, but I was thinking that trying to
>> >> >> > avoid it might not be worth it.  Now I see a better way to avoid it,
>> >> >> > though, so I'm going to update the patch to that end.
>> >> >> >
>> >> >> >> Seems like a bug to me.
>> >> >> >
>> >> >> > It is racy, but not every race is a bug.
>> >> >> >
>> >> >>
>> >> >> Still seems like there is a bug in intel_pstate_set_energy_pref_index=
>> ()
>> >> >> AFAICT.
>> >> >
>> >> > If there is a bug, then what exactly is it, from the users' perspectiv=
>> e?
>> >> >
>> >>=20
>> >> It can be reproduced easily as follows:
>> >>=20
>> >> | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
>> >> | for p in /sys/devices/system/cpu/cpufreq/policy*/energy_performance_pr=
>> eference; do echo performance > $p; done
>> >
>> > Is this the active mode or the passive mode with the $subject patch appli=
>> ed?
>> >
>> > If the former, the issue is there regardless of the patch, so it needs to=
>>  be
>> > fixed.
>> >
>> > If the latter, there should be no effect of hwp_dynamic_boost (which was
>> > overlooked by me).
>> >
>> 
>> This seems to be a problem in active mode only, so yeah the bug exists
>> regardless of your patch, but the fix is likely to allow you to simplify
>> this series slightly if it allows you to take full advantage of
>> hwp_req_cached and drop the additional EPP cache.
>
> The additional EPP cache is there to avoid synchronizing the scheduler
> context directly with a random process running on another CPU and doing
> things that may take time.
>
> The difference between the active mode and the passive mode in this respect
> is that in the latter case hwp_req_cached generally needs to be updated from
> the scheduler context, whereas in the former case it does not.
>

Hm, that's unfortunate.  Though I'd be surprised to see any appreciable
performance penalty from synchronizing with a sysfs handler that should
hardly ever be called.  Anyway thanks for the fix.

> [cut]
>
>> >> No, I explicitly dismissed that in my previous reply.
>> >
>> > But at the same time you seem to agree that without the non-CPU component
>> > (or thermal pressure) the existing CPU performance scaling would be
>> > sufficient.
>> >
>> 
>> Yes, but not necessarily in order to allow the non-CPU component to draw
>> more power as you said above, but also because the existence of a
>> bottleneck in a non-CPU component gives us an opportunity to improve the
>> energy efficiency of the CPU, regardless of whether that allows the
>> workload to run faster.
>
> But why would the bottleneck be there otherwise?
>

Because some resource of the system (e.g. memory bandwidth, GPU fill
rate) may be close to 100% utilized, causing a bottleneck for reasons
unrelated to its energy usage.

>> > [cut]
>> >
>> >> > Yes, it is, and so I don't quite see the connection between it and my =
>> question.
>> >> >
>> >> > Apparently, the unmodified performance scaling governors are not
>> >> > sufficient, so there must be something beyond the above which allows
>> >> > you to determine the frequency in question and so I'm asking what that
>> >> > is.
>> >> >
>> >>=20
>> >> The underlying heuristic assumption is the same as I outlined above, but
>> >> in any implementation of such a heuristic there is necessarily a
>> >> trade-off between responsiveness to short-term fluctuations and
>> >> long-term energy usage.  This trade-off is a function of the somewhat
>> >> arbitrary time interval I was referring to as "immediate past" -- A
>> >> longer time parameter allows the controller to consider a greater
>> >> portion of the workload's history while computing the response with
>> >> optimal energy usage, at the cost of increasing its reaction time to
>> >> discontinuous changes in the behavior of the workload (AKA increased
>> >> latency).
>> >
>> > OK
>> >
>> >> One of the key differences between the governor I proposed and the
>> >> pre-existing ones is that it doesn't attempt to come up with a magic
>> >> time parameter that works for everybody, because there isn't such a
>> >> thing, since different devices and applications have latency
>> >> requirements which often differ by orders of magnitude.
>> >
>> > The problem with this approach is that, generally speaking, the kernel
>> > has a definition of "close past" already, which comes from the PELT
>> > signal in the scheduler.
>> >
>> > That signal is used for more than just CPU performance scaling and there
>> > is a reason for that, as the scheduler's decisions generally need to be
>> > aligned with CPU performance scaling decisions.
>> >
>> 
>> Yes, I fully agree that in an ideal world the response latency
>> constraint I was referring to above would be tracked per-scheduling
>> entity and used as definition of "close past" by PELT too -- Actually I
>> think I mentioned I was working on a prototype with scheduler-level
>> tracking of latency constraints, but other folks requested the RFC to be
>> based on a simpler interface not requiring scheduler surgery to
>> implement, which is why I came up with the PM QoS-based interface.  I
>> believe we have discussed exposing this latency constraint as a third
>> clamp similar to utilization clamps -- I would be fine with such an
>> interface if you think it's the way to go.
>
> I'm not sure yet to be honest.
>
>> That said, in most practical cases it should be possible to take close
>> to full advantage of the response latency information from the schedutil
>> governor, even if it's provided via PM QoS rather than having
>> per-scheduling entity granularity -- The time parameter used to control
>> CPU frequency would just be the most strict among the applications
>> running in the system, which should prevent performance loss in
>> applications with a low latency constraint, but might cause us to miss
>> out some opportunities for energy optimization in a multitasking
>> environment compared to the full scheduling-based solution.  Doesn't
>> seem like a deal-breaker to me though and it makes the code
>> substantially easier to review.
>
> I agree on the simplicity side, but the reason why I think that the scheduler
> needs to be involved ultimately is because it may have a reason to ignore the
> bottleneck and go ahead with its decisions anyway.

Yes, definitely.  One of the cases I was considering where it would make
sense to do that is whenever a process with e.g. realtime priority
starts executing in a given GPU.  But that can be done regardless of
whether the latency constraint is tracked through the scheduler or PM
QoS, similar to the way you handle realtime priority in schedutil right
now.  The main functional difference between the scheduling and PM QoS
approach would be the granularity of the latency constraint computation,
we could always arrange for the same set of processes to be excluded
from the optimization.
Rafael J. Wysocki July 29, 2020, 5:52 p.m. UTC | #22
On Wednesday, July 29, 2020 7:46:08 AM CEST Francisco Jerez wrote:
> 
> --==-=-=
> Content-Type: multipart/mixed; boundary="=-=-="
> 
> --=-=-=
> Content-Type: text/plain; charset=utf-8
> Content-Disposition: inline
> Content-Transfer-Encoding: quoted-printable
> 
> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
> 
> > On Tuesday, July 28, 2020 4:32:22 AM CEST Francisco Jerez wrote:
> >>
> >> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
> >>=20
> >> > On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote:
> >> >
> >> > [cut]
> >> >
> >> >> >
> >> >> > However, in the active mode the only updater of hwp_req_cached is
> >> >> > intel_pstate_hwp_set() and this patch doesn't introduce any
> >> >> > differences in behavior in that case.
> >> >> >
> >> >>=3D20
> >> >> intel_pstate_hwp_set() is the only updater, but there are other
> >> >> consumers that can get out of sync with the HWP request value written=
>  by
> >> >> intel_pstate_set_energy_pref_index().  intel_pstate_hwp_boost_up() se=
> ems
> >> >> like the most concerning example I named earlier.
> >> >>=3D20
> >> >> >> > So there may be a short time window after the
> >> >> >> > intel_pstate_set_energy_pref_index() invocation in which the new=
>  EPP
> >> >> >> > value may not be in effect, but in general there is no guarantee=
>  th=3D
> >> at
> >> >> >> > the new EPP will take effect immediately after updating the MSR
> >> >> >> > anyway, so that race doesn't matter.
> >> >> >> >
> >> >> >> > That said, that race is avoidable, but I was thinking that tryin=
> g to
> >> >> >> > avoid it might not be worth it.  Now I see a better way to avoid=
>  it,
> >> >> >> > though, so I'm going to update the patch to that end.
> >> >> >> >
> >> >> >> >> Seems like a bug to me.
> >> >> >> >
> >> >> >> > It is racy, but not every race is a bug.
> >> >> >> >
> >> >> >>
> >> >> >> Still seems like there is a bug in intel_pstate_set_energy_pref_in=
> dex=3D
> >> ()
> >> >> >> AFAICT.
> >> >> >
> >> >> > If there is a bug, then what exactly is it, from the users' perspec=
> tiv=3D
> >> e?
> >> >> >
> >> >>=3D20
> >> >> It can be reproduced easily as follows:
> >> >>=3D20
> >> >> | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
> >> >> | for p in /sys/devices/system/cpu/cpufreq/policy*/energy_performance=
> _pr=3D
> >> eference; do echo performance > $p; done
> >> >
> >> > Is this the active mode or the passive mode with the $subject patch ap=
> pli=3D
> >> ed?
> >> >
> >> > If the former, the issue is there regardless of the patch, so it needs=
>  to=3D
> >>  be
> >> > fixed.
> >> >
> >> > If the latter, there should be no effect of hwp_dynamic_boost (which w=
> as
> >> > overlooked by me).
> >> >
> >>=20
> >> This seems to be a problem in active mode only, so yeah the bug exists
> >> regardless of your patch, but the fix is likely to allow you to simplify
> >> this series slightly if it allows you to take full advantage of
> >> hwp_req_cached and drop the additional EPP cache.
> >
> > The additional EPP cache is there to avoid synchronizing the scheduler
> > context directly with a random process running on another CPU and doing
> > things that may take time.
> >
> > The difference between the active mode and the passive mode in this respe=
> ct
> > is that in the latter case hwp_req_cached generally needs to be updated f=
> rom
> > the scheduler context, whereas in the former case it does not.
> >
> 
> Hm, that's unfortunate.  Though I'd be surprised to see any appreciable
> performance penalty from synchronizing with a sysfs handler that should
> hardly ever be called.  Anyway thanks for the fix.

No problem.

> > [cut]
> >
> >> >> No, I explicitly dismissed that in my previous reply.
> >> >
> >> > But at the same time you seem to agree that without the non-CPU compon=
> ent
> >> > (or thermal pressure) the existing CPU performance scaling would be
> >> > sufficient.
> >> >
> >>=20
> >> Yes, but not necessarily in order to allow the non-CPU component to draw
> >> more power as you said above, but also because the existence of a
> >> bottleneck in a non-CPU component gives us an opportunity to improve the
> >> energy efficiency of the CPU, regardless of whether that allows the
> >> workload to run faster.
> >
> > But why would the bottleneck be there otherwise?
> >
> 
> Because some resource of the system (e.g. memory bandwidth, GPU fill
> rate) may be close to 100% utilized, causing a bottleneck for reasons
> unrelated to its energy usage.

Well, not quite.  Or at least in that case the performance cannot be improved
by limiting the CPU frequency below the frequency looked for by scaling
governors, AFAICS.

Scaling governors generally look for the maximum frequency at which there is no
CPU idle time in the workload.  At that frequency the CPU time required by the
workload to achieve the maximum performance is equal to the total CPU time
available to it.  I till refer to that frequency as the maximum effective
frequency (MEF) of the workload.

By definition, running at frequencies above the MEF does not improve
performance, but it causes CPU idle time to appear.  OTOH running at
frequencies below the MEF increases the CPU time required by the workload
to achieve the maximum performance, so effectively the workload does
not get enough CPU time for the performance to be maximum, so it is lower
than at the MEF.

Of course, the MEF is well-defined as long as the processor does not share
the power budget with another component that is also used by the workload
(say, a GPU).  Without the sharing of a power budget, the MEF can be determined
by looking at the CPU idle time (or CPU busy time, or CPU load, whichever is
the most convenient) alone, because it already depends on the speed of any
memory etc accessed by the workload and slowing down the processor doesn't
improve the performance (because the other components don't run any faster
as a result of that).

However, if the processor is sharing the power budget with a GPU (say), it
is not enough to look at the CPU idle time to determine the MEF, because
slowing down the processor generally causes the GPU to get more power which
allows it to run faster and CPUs can do more work, because they spend less
time waiting for the GPU, so the CPU time available to the workload effectively
increases and it can achieve the maximum performance at a lower frequency.
So there is "effective MEF" depending on the relative performance balance
between the processor and the GPU and on what "fraction" of the workload
runs on the GPU.

This means that the power budget sharing is essential here and the "if the
energy-efficiency of the processor is improved, the other components get
more power as a bonus" argument is not really valid.

The frequency of the processor gets limited in order for the other components
to get more power, which then allows the processor to do more work in the
same time at the same frequency, so it becomes more energy-efficient.
Francisco Jerez July 30, 2020, 12:49 a.m. UTC | #23
"Rafael J. Wysocki" <rjw@rjwysocki.net> writes:

> On Wednesday, July 29, 2020 7:46:08 AM CEST Francisco Jerez wrote:
>> 
>> --==-=-=
>> Content-Type: multipart/mixed; boundary="=-=-="
>> 
>> --=-=-=
>> Content-Type: text/plain; charset=utf-8
>> Content-Disposition: inline
>> Content-Transfer-Encoding: quoted-printable
>> 
>> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
>> 
>> > On Tuesday, July 28, 2020 4:32:22 AM CEST Francisco Jerez wrote:
>> >>
>> >> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
>> >>=20
>> >> > On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote:
>> >> >
>> >> > [cut]
>> >> >
>> >> >> >
>> >> >> > However, in the active mode the only updater of hwp_req_cached is
>> >> >> > intel_pstate_hwp_set() and this patch doesn't introduce any
>> >> >> > differences in behavior in that case.
>> >> >> >
>> >> >>=3D20
>> >> >> intel_pstate_hwp_set() is the only updater, but there are other
>> >> >> consumers that can get out of sync with the HWP request value written=
>>  by
>> >> >> intel_pstate_set_energy_pref_index().  intel_pstate_hwp_boost_up() se=
>> ems
>> >> >> like the most concerning example I named earlier.
>> >> >>=3D20
>> >> >> >> > So there may be a short time window after the
>> >> >> >> > intel_pstate_set_energy_pref_index() invocation in which the new=
>>  EPP
>> >> >> >> > value may not be in effect, but in general there is no guarantee=
>>  th=3D
>> >> at
>> >> >> >> > the new EPP will take effect immediately after updating the MSR
>> >> >> >> > anyway, so that race doesn't matter.
>> >> >> >> >
>> >> >> >> > That said, that race is avoidable, but I was thinking that tryin=
>> g to
>> >> >> >> > avoid it might not be worth it.  Now I see a better way to avoid=
>>  it,
>> >> >> >> > though, so I'm going to update the patch to that end.
>> >> >> >> >
>> >> >> >> >> Seems like a bug to me.
>> >> >> >> >
>> >> >> >> > It is racy, but not every race is a bug.
>> >> >> >> >
>> >> >> >>
>> >> >> >> Still seems like there is a bug in intel_pstate_set_energy_pref_in=
>> dex=3D
>> >> ()
>> >> >> >> AFAICT.
>> >> >> >
>> >> >> > If there is a bug, then what exactly is it, from the users' perspec=
>> tiv=3D
>> >> e?
>> >> >> >
>> >> >>=3D20
>> >> >> It can be reproduced easily as follows:
>> >> >>=3D20
>> >> >> | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
>> >> >> | for p in /sys/devices/system/cpu/cpufreq/policy*/energy_performance=
>> _pr=3D
>> >> eference; do echo performance > $p; done
>> >> >
>> >> > Is this the active mode or the passive mode with the $subject patch ap=
>> pli=3D
>> >> ed?
>> >> >
>> >> > If the former, the issue is there regardless of the patch, so it needs=
>>  to=3D
>> >>  be
>> >> > fixed.
>> >> >
>> >> > If the latter, there should be no effect of hwp_dynamic_boost (which w=
>> as
>> >> > overlooked by me).
>> >> >
>> >>=20
>> >> This seems to be a problem in active mode only, so yeah the bug exists
>> >> regardless of your patch, but the fix is likely to allow you to simplify
>> >> this series slightly if it allows you to take full advantage of
>> >> hwp_req_cached and drop the additional EPP cache.
>> >
>> > The additional EPP cache is there to avoid synchronizing the scheduler
>> > context directly with a random process running on another CPU and doing
>> > things that may take time.
>> >
>> > The difference between the active mode and the passive mode in this respe=
>> ct
>> > is that in the latter case hwp_req_cached generally needs to be updated f=
>> rom
>> > the scheduler context, whereas in the former case it does not.
>> >
>> 
>> Hm, that's unfortunate.  Though I'd be surprised to see any appreciable
>> performance penalty from synchronizing with a sysfs handler that should
>> hardly ever be called.  Anyway thanks for the fix.
>
> No problem.
>
>> > [cut]
>> >
>> >> >> No, I explicitly dismissed that in my previous reply.
>> >> >
>> >> > But at the same time you seem to agree that without the non-CPU compon=
>> ent
>> >> > (or thermal pressure) the existing CPU performance scaling would be
>> >> > sufficient.
>> >> >
>> >>=20
>> >> Yes, but not necessarily in order to allow the non-CPU component to draw
>> >> more power as you said above, but also because the existence of a
>> >> bottleneck in a non-CPU component gives us an opportunity to improve the
>> >> energy efficiency of the CPU, regardless of whether that allows the
>> >> workload to run faster.
>> >
>> > But why would the bottleneck be there otherwise?
>> >
>> 
>> Because some resource of the system (e.g. memory bandwidth, GPU fill
>> rate) may be close to 100% utilized, causing a bottleneck for reasons
>> unrelated to its energy usage.
>
> Well, not quite.  Or at least in that case the performance cannot be improved
> by limiting the CPU frequency below the frequency looked for by scaling
> governors, AFAICS.
>

Right, but it might still be possible to improve the energy efficiency
of the workload even if its throughput cannot be improved, which seems
like a worthwhile purpose in itself.

> Scaling governors generally look for the maximum frequency at which there is no
> CPU idle time in the workload.  At that frequency the CPU time required by the
> workload to achieve the maximum performance is equal to the total CPU time
> available to it.  I till refer to that frequency as the maximum effective
> frequency (MEF) of the workload.
>
> By definition, running at frequencies above the MEF does not improve
> performance, but it causes CPU idle time to appear.  OTOH running at
> frequencies below the MEF increases the CPU time required by the workload
> to achieve the maximum performance, so effectively the workload does
> not get enough CPU time for the performance to be maximum, so it is lower
> than at the MEF.
>

Yes, we agree on that.

> Of course, the MEF is well-defined as long as the processor does not share
> the power budget with another component that is also used by the workload
> (say, a GPU).  Without the sharing of a power budget, the MEF can be determined
> by looking at the CPU idle time (or CPU busy time, or CPU load, whichever is
> the most convenient) alone, because it already depends on the speed of any
> memory etc accessed by the workload and slowing down the processor doesn't
> improve the performance (because the other components don't run any faster
> as a result of that).
>
> However, if the processor is sharing the power budget with a GPU (say), it
> is not enough to look at the CPU idle time to determine the MEF, because
> slowing down the processor generally causes the GPU to get more power which
> allows it to run faster and CPUs can do more work, because they spend less
> time waiting for the GPU, so the CPU time available to the workload effectively
> increases and it can achieve the maximum performance at a lower frequency.
> So there is "effective MEF" depending on the relative performance balance
> between the processor and the GPU and on what "fraction" of the workload
> runs on the GPU.
>

That doesn't mean that the MEF isn't well-defined in systems with a
shared power budget.  If you define it as the lowest frequency at which
the workload reaches maximum throughput, then there still is one even if
the system is TDP-bound: the maximum frequency at which the CPU and
device maintain their optimal power balance -- Above it the power budget
of the device will be constrained, causing it to run at less than
optimal throughput, also causing the workload as a whole to run at less
than optimal throughput.

That said I agree with you that it takes more than looking at the CPU
utilization in order to determine the MEF in a system with a shared
power budget -- Until you've reached a steady state close enough to the
optimal power balance, at which point looking at the CPU utilization is
really all it takes in order for the governor to estimate the MEF.

IOW, introducing additional power budget variables (in combination with
additional power curve information from both the CPU and device) *might*
help you reach the optimal steady state from a suboptimal state more
quickly in principle, but it won't have any effect on the optimality of
the final steady state as soon as it's reached.

Does that mean it's essential to introduce such power variables in order
for the controller to approach the optimal steady state?  No, because
any realistic controller attempting to approximate the MEF of the
workload (whether it's monitoring the power consumption variables or
not) necessarily needs to overshoot that MEF estimate by some factor in
order to avoid getting stuck at a low frequency whenever the load
fluctuates above the current MEF.  This means that even if at some point
the power balance is far enough from the optimal ratio that the initial
MEF estimate is off by a fraction greater than the overshoot factor of
the controller, the controller will get immediate feedback of the
situation as soon as the device throughput ramps up due to the released
power budget, allowing it to make a more accurate approximation of the
real MEF in a small number of iterations (of the order of 1 iteration
surprisingly frequently).

> This means that the power budget sharing is essential here and the "if the
> energy-efficiency of the processor is improved, the other components get
> more power as a bonus" argument is not really valid.
>

That was just a statement of my goals while working on the algorithm
[improve the energy efficiency of the CPU in presence of an IO
bottleneck], it's therefore axiomatic in nature rather than some sort of
logical conclusion that can be dismissed as invalid.  You might say you
have different goals in mind but that doesn't mean other people's are
not valid.

> The frequency of the processor gets limited in order for the other components
> to get more power, which then allows the processor to do more work in the
> same time at the same frequency, so it becomes more energy-efficient.

Whenever the throughput of the workload is limited by its power budget,
then yes, sure, but even when that's not the case it can be valuable to
reduce the amount of energy the system is consuming in order to perform
a certain task.  Ask the guy playing a videogame or watching a movie on
battery power while sitting on a train.  Or the datacenter operator
unable to bring another node online because of their power/thermal
dissipation budget.
Rafael J. Wysocki July 31, 2020, 5:52 p.m. UTC | #24
On Thursday, July 30, 2020 2:49:34 AM CEST Francisco Jerez wrote:
> 

[cut]

> >> >
> >> >> >> No, I explicitly dismissed that in my previous reply.
> >> >> >
> >> >> > But at the same time you seem to agree that without the non-CPU com=
> pon=3D
> >> ent
> >> >> > (or thermal pressure) the existing CPU performance scaling would be
> >> >> > sufficient.
> >> >> >
> >> >>=3D20
> >> >> Yes, but not necessarily in order to allow the non-CPU component to d=
> raw
> >> >> more power as you said above, but also because the existence of a
> >> >> bottleneck in a non-CPU component gives us an opportunity to improve =
> the
> >> >> energy efficiency of the CPU, regardless of whether that allows the
> >> >> workload to run faster.
> >> >
> >> > But why would the bottleneck be there otherwise?
> >> >
> >>=20
> >> Because some resource of the system (e.g. memory bandwidth, GPU fill
> >> rate) may be close to 100% utilized, causing a bottleneck for reasons
> >> unrelated to its energy usage.
> >
> > Well, not quite.  Or at least in that case the performance cannot be impr=
> oved
> > by limiting the CPU frequency below the frequency looked for by scaling
> > governors, AFAICS.
> >
> 
> Right, but it might still be possible to improve the energy efficiency
> of the workload even if its throughput cannot be improved, which seems
> like a worthwhile purpose in itself.

My point is that in this case the energy-efficiency of the processor cannot
be improved without decreasing performance.

For the processors that are relevant here, the most energy-efficient way to
run them is in the minimum P-state, but that rarely provides the required
performance.  Without the knowledge on how much performance really is required
we assume maximum achievable.  Anything else would need to involve some extra
policy knobs (which are missing ATM) and that is another problem.

I'm not saying that it is not a problem, but it is not possible to say how much
performance to sacrifice without any input from the user on that.

IOW, this is a topic for another discussion.

> > Scaling governors generally look for the maximum frequency at which there=
>  is no
> > CPU idle time in the workload.  At that frequency the CPU time required b=
> y the
> > workload to achieve the maximum performance is equal to the total CPU time
> > available to it.  I till refer to that frequency as the maximum effective
> > frequency (MEF) of the workload.
> >
> > By definition, running at frequencies above the MEF does not improve
> > performance, but it causes CPU idle time to appear.  OTOH running at
> > frequencies below the MEF increases the CPU time required by the workload
> > to achieve the maximum performance, so effectively the workload does
> > not get enough CPU time for the performance to be maximum, so it is lower
> > than at the MEF.
> >
> 
> Yes, we agree on that.
> 
> > Of course, the MEF is well-defined as long as the processor does not share
> > the power budget with another component that is also used by the workload
> > (say, a GPU).  Without the sharing of a power budget, the MEF can be dete=
> rmined
> > by looking at the CPU idle time (or CPU busy time, or CPU load, whichever=
>  is
> > the most convenient) alone, because it already depends on the speed of any
> > memory etc accessed by the workload and slowing down the processor doesn't
> > improve the performance (because the other components don't run any faster
> > as a result of that).
> >
> > However, if the processor is sharing the power budget with a GPU (say), it
> > is not enough to look at the CPU idle time to determine the MEF, because
> > slowing down the processor generally causes the GPU to get more power whi=
> ch
> > allows it to run faster and CPUs can do more work, because they spend less
> > time waiting for the GPU, so the CPU time available to the workload effec=
> tively
> > increases and it can achieve the maximum performance at a lower frequency.
> > So there is "effective MEF" depending on the relative performance balance
> > between the processor and the GPU and on what "fraction" of the workload
> > runs on the GPU.
> >
> 
> That doesn't mean that the MEF isn't well-defined in systems with a
> shared power budget.  If you define it as the lowest frequency at which
> the workload reaches maximum throughput, then there still is one even if
> the system is TDP-bound: the maximum frequency at which the CPU and
> device maintain their optimal power balance -- Above it the power budget
> of the device will be constrained, causing it to run at less than
> optimal throughput, also causing the workload as a whole to run at less
> than optimal throughput.

That is what I am calling the "effective MEF" and the main point is that it
cannot be determined by looking at the CPU idle time alone.

> That said I agree with you that it takes more than looking at the CPU
> utilization in order to determine the MEF in a system with a shared
> power budget -- Until you've reached a steady state close enough to the
> optimal power balance, at which point looking at the CPU utilization is
> really all it takes in order for the governor to estimate the MEF.

I'm not convinced, because in principle there may be many steady states
at "boundary" frequencies, such that the CPU idle time goes from zero to
nonzero when crossing the "boundary", in that case.

That generally depends on what OPPs are available (physically) to the
processor and the GPU (I'm using the GPU as an example, but that may be
something else sharing the power budget with the processor, like an FPGA).

For example, if increasing the CPU frequency above a "boundary" does not
cause it to take up enough of the power budget to force the GPU to switch
over to a lower-frequency OPP, it may very well cause some CPU idle time
to appear, but that doesn't mean that the optimum power balance has been
reached.

In general, the GPU needs to be monitored as well as the CPU and that's
why the GPU bottleneck detection in your patches is key.  But having
that in place one could simply put an upper limit on the CPU frequency
through the existing policy max QoS in the cpufreq framework in response
to the GPU bottleneck without changing the scaling governors.

> IOW, introducing additional power budget variables (in combination with
> additional power curve information from both the CPU and device) *might*
> help you reach the optimal steady state from a suboptimal state more
> quickly in principle, but it won't have any effect on the optimality of
> the final steady state as soon as it's reached.
> 
> Does that mean it's essential to introduce such power variables in order
> for the controller to approach the optimal steady state?  No, because
> any realistic controller attempting to approximate the MEF of the
> workload (whether it's monitoring the power consumption variables or
> not) necessarily needs to overshoot that MEF estimate by some factor in
> order to avoid getting stuck at a low frequency whenever the load
> fluctuates above the current MEF.  This means that even if at some point
> the power balance is far enough from the optimal ratio that the initial
> MEF estimate is off by a fraction greater than the overshoot factor of
> the controller, the controller will get immediate feedback of the
> situation as soon as the device throughput ramps up due to the released
> power budget, allowing it to make a more accurate approximation of the
> real MEF in a small number of iterations (of the order of 1 iteration
> surprisingly frequently).
> 
> > This means that the power budget sharing is essential here and the "if the
> > energy-efficiency of the processor is improved, the other components get
> > more power as a bonus" argument is not really valid.
> >
> 
> That was just a statement of my goals while working on the algorithm
> [improve the energy efficiency of the CPU in presence of an IO
> bottleneck], it's therefore axiomatic in nature rather than some sort of
> logical conclusion that can be dismissed as invalid.  You might say you
> have different goals in mind but that doesn't mean other people's are
> not valid.

Well, this really isn't about the goals but about understanding of what
really happens.

What I'm trying to say is that the sharing of energy budget is a necessary
condition allowing the processor's energy-efficiency to be improved without
sacrificing performance.

> > The frequency of the processor gets limited in order for the other compon=
> ents
> > to get more power, which then allows the processor to do more work in the
> > same time at the same frequency, so it becomes more energy-efficient.
> 
> Whenever the throughput of the workload is limited by its power budget,
> then yes, sure, but even when that's not the case it can be valuable to
> reduce the amount of energy the system is consuming in order to perform
> a certain task.

Yes, it is valuable, but this is a separate problem and addressing it
requires taking additional user input (regarding the energy vs performance
policy) into account.
Francisco Jerez July 31, 2020, 10:43 p.m. UTC | #25
"Rafael J. Wysocki" <rjw@rjwysocki.net> writes:

> On Thursday, July 30, 2020 2:49:34 AM CEST Francisco Jerez wrote:
>> 
>
> [cut]
>
>> >> >
>> >> >> >> No, I explicitly dismissed that in my previous reply.
>> >> >> >
>> >> >> > But at the same time you seem to agree that without the non-CPU com=
>> pon=3D
>> >> ent
>> >> >> > (or thermal pressure) the existing CPU performance scaling would be
>> >> >> > sufficient.
>> >> >> >
>> >> >>=3D20
>> >> >> Yes, but not necessarily in order to allow the non-CPU component to d=
>> raw
>> >> >> more power as you said above, but also because the existence of a
>> >> >> bottleneck in a non-CPU component gives us an opportunity to improve =
>> the
>> >> >> energy efficiency of the CPU, regardless of whether that allows the
>> >> >> workload to run faster.
>> >> >
>> >> > But why would the bottleneck be there otherwise?
>> >> >
>> >>=20
>> >> Because some resource of the system (e.g. memory bandwidth, GPU fill
>> >> rate) may be close to 100% utilized, causing a bottleneck for reasons
>> >> unrelated to its energy usage.
>> >
>> > Well, not quite.  Or at least in that case the performance cannot be impr=
>> oved
>> > by limiting the CPU frequency below the frequency looked for by scaling
>> > governors, AFAICS.
>> >
>> 
>> Right, but it might still be possible to improve the energy efficiency
>> of the workload even if its throughput cannot be improved, which seems
>> like a worthwhile purpose in itself.
>
> My point is that in this case the energy-efficiency of the processor cannot
> be improved without decreasing performance.
>

Well in principle it can whenever there is a bottleneck in a non-CPU
component.

> For the processors that are relevant here, the most energy-efficient way to
> run them is in the minimum P-state, but that rarely provides the required
> performance.  Without the knowledge on how much performance really is required
> we assume maximum achievable.  Anything else would need to involve some extra
> policy knobs (which are missing ATM) and that is another problem.
>

But there is a middle ground between limiting the workload to run at the
minimum P-state and having it run at the maximum achievable P-state.
The MEF estimation we were just talking about can help.  However due to
its heuristic nature I certainly see the merit of having policy knobs
allowing the optimization to be controlled by users, which is why I
added such controls in v2.99, at your request.

> I'm not saying that it is not a problem, but it is not possible to say how much
> performance to sacrifice without any input from the user on that.
>
> IOW, this is a topic for another discussion.
>

I have the vague recollection that you brought that up already and I
agreed and implemented the changes you asked for.

>> > Scaling governors generally look for the maximum frequency at which there=
>>  is no
>> > CPU idle time in the workload.  At that frequency the CPU time required b=
>> y the
>> > workload to achieve the maximum performance is equal to the total CPU time
>> > available to it.  I till refer to that frequency as the maximum effective
>> > frequency (MEF) of the workload.
>> >
>> > By definition, running at frequencies above the MEF does not improve
>> > performance, but it causes CPU idle time to appear.  OTOH running at
>> > frequencies below the MEF increases the CPU time required by the workload
>> > to achieve the maximum performance, so effectively the workload does
>> > not get enough CPU time for the performance to be maximum, so it is lower
>> > than at the MEF.
>> >
>> 
>> Yes, we agree on that.
>> 
>> > Of course, the MEF is well-defined as long as the processor does not share
>> > the power budget with another component that is also used by the workload
>> > (say, a GPU).  Without the sharing of a power budget, the MEF can be dete=
>> rmined
>> > by looking at the CPU idle time (or CPU busy time, or CPU load, whichever=
>>  is
>> > the most convenient) alone, because it already depends on the speed of any
>> > memory etc accessed by the workload and slowing down the processor doesn't
>> > improve the performance (because the other components don't run any faster
>> > as a result of that).
>> >
>> > However, if the processor is sharing the power budget with a GPU (say), it
>> > is not enough to look at the CPU idle time to determine the MEF, because
>> > slowing down the processor generally causes the GPU to get more power whi=
>> ch
>> > allows it to run faster and CPUs can do more work, because they spend less
>> > time waiting for the GPU, so the CPU time available to the workload effec=
>> tively
>> > increases and it can achieve the maximum performance at a lower frequency.
>> > So there is "effective MEF" depending on the relative performance balance
>> > between the processor and the GPU and on what "fraction" of the workload
>> > runs on the GPU.
>> >
>> 
>> That doesn't mean that the MEF isn't well-defined in systems with a
>> shared power budget.  If you define it as the lowest frequency at which
>> the workload reaches maximum throughput, then there still is one even if
>> the system is TDP-bound: the maximum frequency at which the CPU and
>> device maintain their optimal power balance -- Above it the power budget
>> of the device will be constrained, causing it to run at less than
>> optimal throughput, also causing the workload as a whole to run at less
>> than optimal throughput.
>
> That is what I am calling the "effective MEF" and the main point is that it
> cannot be determined by looking at the CPU idle time alone.
>

And my main point is that it can (by looking at its average frequency
alone -- closely related but not fully equivalent to CPU idle time) *if*
the workload has reached a steady state close enough to its optimal
power balance.  Then I tried to explain how the governor I implemented
approaches such an optimal steady state from any arbitrary sub-optimal
state *without* relying on monitoring power consumption via RAPL.

>> That said I agree with you that it takes more than looking at the CPU
>> utilization in order to determine the MEF in a system with a shared
>> power budget -- Until you've reached a steady state close enough to the
>> optimal power balance, at which point looking at the CPU utilization is
>> really all it takes in order for the governor to estimate the MEF.
>
> I'm not convinced, because in principle there may be many steady states
> at "boundary" frequencies, such that the CPU idle time goes from zero to
> nonzero when crossing the "boundary", in that case.
>
> That generally depends on what OPPs are available (physically) to the
> processor and the GPU (I'm using the GPU as an example, but that may be
> something else sharing the power budget with the processor, like an FPGA).
>
> For example, if increasing the CPU frequency above a "boundary" does not
> cause it to take up enough of the power budget to force the GPU to switch
> over to a lower-frequency OPP, it may very well cause some CPU idle time
> to appear, but that doesn't mean that the optimum power balance has been
> reached.
>

But you do agree that, under the assumption (1) of steady state, and the
assumption (2) that the processor has already crossed the right
boundaries for the power balance to be close to optimal, it is
sufficient to look at its average frequency ("average" defined with a
time parameter consistent with the latency constraints of the workload)
in order to estimate its MEF.  Or?

Due to the optimality assumption (2), the computational power that the
workload can extract out of the bottlenecking device is maximal, which
means that the amount of time the CPU spends waiting for the device per
unit of CPU work is minimal, which means that even if we run the CPU at
a higher frequency it won't take less time per unit of CPU work, which
means that we have reached the MEF.  Due to the steady state assumption
(1) we can then extrapolate the MEF estimate computed for the immediate
past into the immediate future, which is the heuristic part of this.

Then it remains to show that the controller will approach such an
optimal steady state even if assumption (2) doesn't hold initially,
which is what I tried to do in my previous e-mail at a high level.  Let
me know if my explanation wasn't clear enough.

> In general, the GPU needs to be monitored as well as the CPU and that's
> why the GPU bottleneck detection in your patches is key.  But having
> that in place one could simply put an upper limit on the CPU frequency
> through the existing policy max QoS in the cpufreq framework in response
> to the GPU bottleneck without changing the scaling governors.
>

Yes, at the cost of monitoring the average frequency of every CPU from
every IO device driver that can potentially benefit from improved CPU
energy efficiency, in order for each of them to compute an appropriate
MEF estimate for each CPU.  And at the cost of performance degradation
in a multitasking environment whenever two or more different process
impose conflicting frequency QoS constraints based on conflicting
latency requirements -- Or in a multitasking environment where a certain
process needs to be excluded from the frequency constraint, which you
were advocating for earlier in this thread.

>> IOW, introducing additional power budget variables (in combination with
>> additional power curve information from both the CPU and device) *might*
>> help you reach the optimal steady state from a suboptimal state more
>> quickly in principle, but it won't have any effect on the optimality of
>> the final steady state as soon as it's reached.
>> 
>> Does that mean it's essential to introduce such power variables in order
>> for the controller to approach the optimal steady state?  No, because
>> any realistic controller attempting to approximate the MEF of the
>> workload (whether it's monitoring the power consumption variables or
>> not) necessarily needs to overshoot that MEF estimate by some factor in
>> order to avoid getting stuck at a low frequency whenever the load
>> fluctuates above the current MEF.  This means that even if at some point
>> the power balance is far enough from the optimal ratio that the initial
>> MEF estimate is off by a fraction greater than the overshoot factor of
>> the controller, the controller will get immediate feedback of the
>> situation as soon as the device throughput ramps up due to the released
>> power budget, allowing it to make a more accurate approximation of the
>> real MEF in a small number of iterations (of the order of 1 iteration
>> surprisingly frequently).
>> 
>> > This means that the power budget sharing is essential here and the "if the
>> > energy-efficiency of the processor is improved, the other components get
>> > more power as a bonus" argument is not really valid.
>> >
>> 
>> That was just a statement of my goals while working on the algorithm
>> [improve the energy efficiency of the CPU in presence of an IO
>> bottleneck], it's therefore axiomatic in nature rather than some sort of
>> logical conclusion that can be dismissed as invalid.  You might say you
>> have different goals in mind but that doesn't mean other people's are
>> not valid.
>
> Well, this really isn't about the goals but about understanding of what
> really happens.
>
> What I'm trying to say is that the sharing of energy budget is a necessary
> condition allowing the processor's energy-efficiency to be improved without
> sacrificing performance.
>

And I disagree it's a necessary condition.  As a counterexample consider
a video game being rendered at 60 FPS on a discrete GPU without energy
budget sharing.  Suppose that only 40% of the CPU computational capacity
is needed in order to achieve that, but the CPU frequency peaks at 80%
of its maximum.  Then assume that we clamp the CPU frequency at 40% of
its maximum, still within the convexity range of its power curve.  While
doing that we have improved the processor's energy efficiency without
sacrificing performance.  And there was no sharing of energy budget
whatsoever.

>> > The frequency of the processor gets limited in order for the other compon=
>> ents
>> > to get more power, which then allows the processor to do more work in the
>> > same time at the same frequency, so it becomes more energy-efficient.
>> 
>> Whenever the throughput of the workload is limited by its power budget,
>> then yes, sure, but even when that's not the case it can be valuable to
>> reduce the amount of energy the system is consuming in order to perform
>> a certain task.
>
> Yes, it is valuable, but this is a separate problem and addressing it
> requires taking additional user input (regarding the energy vs performance
> policy) into account.

Yes, I agree that taking user input is valuable.  Feel free to provide
any feed-back on that matter if you don't consider the current policy
mechanism to be satisfactory.
Doug Smythies Aug. 2, 2020, 3:17 p.m. UTC | #26
Hi Rafael,

On 2020.07.19 04:43 Rafael J. Wysocki wrote:
> On Fri, Jul 17, 2020 at 3:37 PM Doug Smythies <dsmythies@telus.net> wrote:
> > On 2020.07.16 05:08 Rafael J. Wysocki wrote:
> > > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies <dsmythies@telus.net> wrote:
> > >> On 2020.07.14 11:16 Rafael J. Wysocki wrote:
> > >> >
> > >> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >> ...
> > >> > Since the passive mode hasn't worked with HWP at all, and it is not going to
> > >> > the default for HWP systems anyway, I don't see any drawbacks related to making
> > >> > this change, so I would consider this as 5.9 material unless there are any
> > >> > serious objections.
> > >>
> > >> Good point.
> >
> > Actually, for those users that default to passive mode upon boot,
> > this would mean they would find themselves using this.
> > Also, it isn't obvious, from the typical "what driver and what governor"
> > inquiry.
> 
> So the change in behavior is that after this patch
> intel_pstate=passive doesn't imply no_hwp any more.
> 
> That's a very minor difference though and I'm not aware of any adverse
> effects it can cause on HWP systems anyway.

My point was, that it will now default to something where
testing has not been completed.

> The "what governor" is straightforward in the passive mode: that's
> whatever cpufreq governor has been selected.

I think you might have missed my point.
From the normal methods of inquiry one does not know
if HWP is being used or not. Why? Because with
or without HWP one gets the same answers under:

/sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

... Doug
Rafael J. Wysocki Aug. 3, 2020, 5:08 p.m. UTC | #27
On Sunday, August 2, 2020 5:17:39 PM CEST Doug Smythies wrote:
> Hi Rafael,
> 
> On 2020.07.19 04:43 Rafael J. Wysocki wrote:
> > On Fri, Jul 17, 2020 at 3:37 PM Doug Smythies <dsmythies@telus.net> wrote:
> > > On 2020.07.16 05:08 Rafael J. Wysocki wrote:
> > > > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies <dsmythies@telus.net> wrote:
> > > >> On 2020.07.14 11:16 Rafael J. Wysocki wrote:
> > > >> >
> > > >> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > >> ...
> > > >> > Since the passive mode hasn't worked with HWP at all, and it is not going to
> > > >> > the default for HWP systems anyway, I don't see any drawbacks related to making
> > > >> > this change, so I would consider this as 5.9 material unless there are any
> > > >> > serious objections.
> > > >>
> > > >> Good point.
> > >
> > > Actually, for those users that default to passive mode upon boot,
> > > this would mean they would find themselves using this.
> > > Also, it isn't obvious, from the typical "what driver and what governor"
> > > inquiry.
> > 
> > So the change in behavior is that after this patch
> > intel_pstate=passive doesn't imply no_hwp any more.
> > 
> > That's a very minor difference though and I'm not aware of any adverse
> > effects it can cause on HWP systems anyway.
> 
> My point was, that it will now default to something where
> testing has not been completed.
> 
> > The "what governor" is straightforward in the passive mode: that's
> > whatever cpufreq governor has been selected.
> 
> I think you might have missed my point.
> From the normal methods of inquiry one does not know
> if HWP is being used or not. Why? Because with
> or without HWP one gets the same answers under:
> 
> /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
> /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Yes, but this is also the case in the active mode, isn't it?

Thanks!
Doug Smythies Aug. 6, 2020, 5:54 a.m. UTC | #28
On 2020.08.03 10:09 Rafael J. Wysocki wrote:
> On Sunday, August 2, 2020 5:17:39 PM CEST Doug Smythies wrote:
> > On 2020.07.19 04:43 Rafael J. Wysocki wrote:
> > > On Fri, Jul 17, 2020 at 3:37 PM Doug Smythies <dsmythies@telus.net> wrote:
> > > > On 2020.07.16 05:08 Rafael J. Wysocki wrote:
> > > > > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies <dsmythies@telus.net> wrote:
> > > > >> On 2020.07.14 11:16 Rafael J. Wysocki wrote:
> > > > >> >
> > > > >> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > >> ...
> > > > >> > Since the passive mode hasn't worked with HWP at all, and it is not going to
> > > > >> > the default for HWP systems anyway, I don't see any drawbacks related to making
> > > > >> > this change, so I would consider this as 5.9 material unless there are any
> > > > >> > serious objections.
> > > > >>
> > > > >> Good point.
> > > >
> > > > Actually, for those users that default to passive mode upon boot,
> > > > this would mean they would find themselves using this.
> > > > Also, it isn't obvious, from the typical "what driver and what governor"
> > > > inquiry.
> > >
> > > So the change in behavior is that after this patch
> > > intel_pstate=passive doesn't imply no_hwp any more.
> > >
> > > That's a very minor difference though and I'm not aware of any adverse
> > > effects it can cause on HWP systems anyway.
> >
> > My point was, that it will now default to something where
> > testing has not been completed.
> >
> > > The "what governor" is straightforward in the passive mode: that's
> > > whatever cpufreq governor has been selected.
> >
> > I think you might have missed my point.
> > From the normal methods of inquiry one does not know
> > if HWP is being used or not. Why? Because with
> > or without HWP one gets the same answers under:
> >
> > /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
> > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
> 
> Yes, but this is also the case in the active mode, isn't it?

Yes, fair enough.
But we aren't changing what it means by default
between kernel 5.8 and 5.9-rc1.

... Doug
Rafael J. Wysocki Aug. 6, 2020, 11:39 a.m. UTC | #29
On Thursday, August 6, 2020 7:54:47 AM CEST Doug Smythies wrote:
> On 2020.08.03 10:09 Rafael J. Wysocki wrote:
> > On Sunday, August 2, 2020 5:17:39 PM CEST Doug Smythies wrote:
> > > On 2020.07.19 04:43 Rafael J. Wysocki wrote:
> > > > On Fri, Jul 17, 2020 at 3:37 PM Doug Smythies <dsmythies@telus.net> wrote:
> > > > > On 2020.07.16 05:08 Rafael J. Wysocki wrote:
> > > > > > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies <dsmythies@telus.net> wrote:
> > > > > >> On 2020.07.14 11:16 Rafael J. Wysocki wrote:
> > > > > >> >
> > > > > >> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > >> ...
> > > > > >> > Since the passive mode hasn't worked with HWP at all, and it is not going to
> > > > > >> > the default for HWP systems anyway, I don't see any drawbacks related to making
> > > > > >> > this change, so I would consider this as 5.9 material unless there are any
> > > > > >> > serious objections.
> > > > > >>
> > > > > >> Good point.
> > > > >
> > > > > Actually, for those users that default to passive mode upon boot,
> > > > > this would mean they would find themselves using this.
> > > > > Also, it isn't obvious, from the typical "what driver and what governor"
> > > > > inquiry.
> > > >
> > > > So the change in behavior is that after this patch
> > > > intel_pstate=passive doesn't imply no_hwp any more.
> > > >
> > > > That's a very minor difference though and I'm not aware of any adverse
> > > > effects it can cause on HWP systems anyway.
> > >
> > > My point was, that it will now default to something where
> > > testing has not been completed.
> > >
> > > > The "what governor" is straightforward in the passive mode: that's
> > > > whatever cpufreq governor has been selected.
> > >
> > > I think you might have missed my point.
> > > From the normal methods of inquiry one does not know
> > > if HWP is being used or not. Why? Because with
> > > or without HWP one gets the same answers under:
> > >
> > > /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
> > > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
> > 
> > Yes, but this is also the case in the active mode, isn't it?
> 
> Yes, fair enough.
> But we aren't changing what it means by default
> between kernel 5.8 and 5.9-rc1.

No, we aren't.

The only (expected) change is when booting with intel_pstate=passive and
without intel_pstate=no_hwp in the command line.

Which should be easy enough to address by adding intel_pstate=no_hwp to the
command line in 5.9-rc1 and later (to achieve the same behavior after a
fresh boot).

Cheers!

Patch
diff mbox series

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -36,6 +36,7 @@ 
 #define INTEL_PSTATE_SAMPLING_INTERVAL	(10 * NSEC_PER_MSEC)
 
 #define INTEL_CPUFREQ_TRANSITION_LATENCY	20000
+#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP	5000
 #define INTEL_CPUFREQ_TRANSITION_DELAY		500
 
 #ifdef CONFIG_ACPI
@@ -222,6 +223,7 @@  struct global_params {
  *			preference/bias
  * @epp_saved:		Saved EPP/EPB during system suspend or CPU offline
  *			operation
+ * @epp_cached		Cached HWP energy-performance preference value
  * @hwp_req_cached:	Cached value of the last HWP Request MSR
  * @hwp_cap_cached:	Cached value of the last HWP Capabilities MSR
  * @last_io_update:	Last time when IO wake flag was set
@@ -259,6 +261,7 @@  struct cpudata {
 	s16 epp_policy;
 	s16 epp_default;
 	s16 epp_saved;
+	s16 epp_cached;
 	u64 hwp_req_cached;
 	u64 hwp_cap_cached;
 	u64 last_io_update;
@@ -676,6 +679,8 @@  static int intel_pstate_set_energy_pref_
 
 		value |= (u64)epp << 24;
 		ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
+
+		WRITE_ONCE(cpu_data->epp_cached, epp);
 	} else {
 		if (epp == -EINVAL)
 			epp = (pref_index - 1) << 2;
@@ -2047,6 +2052,7 @@  static int intel_pstate_init_cpu(unsigne
 		cpu->epp_default = -EINVAL;
 		cpu->epp_powersave = -EINVAL;
 		cpu->epp_saved = -EINVAL;
+		WRITE_ONCE(cpu->epp_cached, -EINVAL);
 	}
 
 	cpu = all_cpu_data[cpunum];
@@ -2245,7 +2251,10 @@  static int intel_pstate_verify_policy(st
 
 static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
 {
-	intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
+	if (hwp_active)
+		intel_pstate_hwp_force_min_perf(policy->cpu);
+	else
+		intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
 }
 
 static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
@@ -2253,12 +2262,10 @@  static void intel_pstate_stop_cpu(struct
 	pr_debug("CPU %d exiting\n", policy->cpu);
 
 	intel_pstate_clear_update_util_hook(policy->cpu);
-	if (hwp_active) {
+	if (hwp_active)
 		intel_pstate_hwp_save_state(policy);
-		intel_pstate_hwp_force_min_perf(policy->cpu);
-	} else {
-		intel_cpufreq_stop_cpu(policy);
-	}
+
+	intel_cpufreq_stop_cpu(policy);
 }
 
 static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
@@ -2388,13 +2395,82 @@  static void intel_cpufreq_trace(struct c
 		fp_toint(cpu->iowait_boost * 100));
 }
 
+static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32 target_pstate,
+				     bool fast_switch)
+{
+	u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
+	s16 epp;
+
+	value &= ~HWP_MIN_PERF(~0L);
+	value |= HWP_MIN_PERF(target_pstate);
+
+	/*
+	 * The entire MSR needs to be updated in order to update the HWP min
+	 * field in it, so opportunistically update the max too if needed.
+	 */
+	value &= ~HWP_MAX_PERF(~0L);
+	value |= HWP_MAX_PERF(cpu->max_perf_ratio);
+
+	/*
+	 * In case the EPP has been adjusted via sysfs, write the last cached
+	 * value of it to the MSR as well.
+	 */
+	epp = READ_ONCE(cpu->epp_cached);
+	if (epp >= 0) {
+		value &= ~GENMASK_ULL(31, 24);
+		value |= (u64)epp << 24;
+	}
+
+	if (value == prev)
+		return;
+
+	WRITE_ONCE(cpu->hwp_req_cached, value);
+	if (fast_switch)
+		wrmsrl(MSR_HWP_REQUEST, value);
+	else
+		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
+}
+
+static void intel_cpufreq_adjust_perf_ctl(struct cpudata *cpu,
+					  u32 target_pstate, bool fast_switch)
+{
+	if (fast_switch)
+		wrmsrl(MSR_IA32_PERF_CTL,
+		       pstate_funcs.get_val(cpu, target_pstate));
+	else
+		wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
+			      pstate_funcs.get_val(cpu, target_pstate));
+}
+
+static int intel_cpufreq_update_pstate(struct cpudata *cpu, int target_pstate,
+				       bool fast_switch)
+{
+	int old_pstate = cpu->pstate.current_pstate;
+
+	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
+	if (target_pstate != old_pstate) {
+		cpu->pstate.current_pstate = target_pstate;
+		if (hwp_active)
+			intel_cpufreq_adjust_hwp(cpu, target_pstate,
+						 fast_switch);
+		else
+			intel_cpufreq_adjust_perf_ctl(cpu, target_pstate,
+						      fast_switch);
+	}
+
+	intel_cpufreq_trace(cpu, fast_switch ? INTEL_PSTATE_TRACE_FAST_SWITCH :
+			    INTEL_PSTATE_TRACE_TARGET, old_pstate);
+
+	return target_pstate;
+}
+
 static int intel_cpufreq_target(struct cpufreq_policy *policy,
 				unsigned int target_freq,
 				unsigned int relation)
 {
 	struct cpudata *cpu = all_cpu_data[policy->cpu];
 	struct cpufreq_freqs freqs;
-	int target_pstate, old_pstate;
+	int target_pstate;
 
 	update_turbo_state();
 
@@ -2402,6 +2478,7 @@  static int intel_cpufreq_target(struct c
 	freqs.new = target_freq;
 
 	cpufreq_freq_transition_begin(policy, &freqs);
+
 	switch (relation) {
 	case CPUFREQ_RELATION_L:
 		target_pstate = DIV_ROUND_UP(freqs.new, cpu->pstate.scaling);
@@ -2413,15 +2490,11 @@  static int intel_cpufreq_target(struct c
 		target_pstate = DIV_ROUND_CLOSEST(freqs.new, cpu->pstate.scaling);
 		break;
 	}
-	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
-	old_pstate = cpu->pstate.current_pstate;
-	if (target_pstate != cpu->pstate.current_pstate) {
-		cpu->pstate.current_pstate = target_pstate;
-		wrmsrl_on_cpu(policy->cpu, MSR_IA32_PERF_CTL,
-			      pstate_funcs.get_val(cpu, target_pstate));
-	}
+
+	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, false);
+
 	freqs.new = target_pstate * cpu->pstate.scaling;
-	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_TARGET, old_pstate);
+
 	cpufreq_freq_transition_end(policy, &freqs, false);
 
 	return 0;
@@ -2431,15 +2504,14 @@  static unsigned int intel_cpufreq_fast_s
 					      unsigned int target_freq)
 {
 	struct cpudata *cpu = all_cpu_data[policy->cpu];
-	int target_pstate, old_pstate;
+	int target_pstate;
 
 	update_turbo_state();
 
 	target_pstate = DIV_ROUND_UP(target_freq, cpu->pstate.scaling);
-	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
-	old_pstate = cpu->pstate.current_pstate;
-	intel_pstate_update_pstate(cpu, target_pstate);
-	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_FAST_SWITCH, old_pstate);
+
+	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, true);
+
 	return target_pstate * cpu->pstate.scaling;
 }
 
@@ -2459,7 +2531,6 @@  static int intel_cpufreq_cpu_init(struct
 		return ret;
 
 	policy->cpuinfo.transition_latency = INTEL_CPUFREQ_TRANSITION_LATENCY;
-	policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
 	/* This reflects the intel_pstate_get_cpu_pstates() setting. */
 	policy->cur = policy->cpuinfo.min_freq;
 
@@ -2471,10 +2542,17 @@  static int intel_cpufreq_cpu_init(struct
 
 	cpu = all_cpu_data[policy->cpu];
 
-	if (hwp_active)
+	if (hwp_active) {
+		u64 value;
+
 		intel_pstate_get_hwp_max(policy->cpu, &turbo_max, &max_state);
-	else
+		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY_HWP;
+		rdmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, &value);
+		WRITE_ONCE(cpu->hwp_req_cached, value);
+	} else {
 		turbo_max = cpu->pstate.turbo_pstate;
+		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
+	}
 
 	min_freq = DIV_ROUND_UP(turbo_max * global.min_perf_pct, 100);
 	min_freq *= cpu->pstate.scaling;
@@ -2575,9 +2653,6 @@  static int intel_pstate_register_driver(
 
 static int intel_pstate_unregister_driver(void)
 {
-	if (hwp_active)
-		return -EBUSY;
-
 	cpufreq_unregister_driver(intel_pstate_driver);
 	intel_pstate_driver_cleanup();
 
@@ -2828,7 +2903,10 @@  static int __init intel_pstate_init(void
 			hwp_active++;
 			hwp_mode_bdw = id->driver_data;
 			intel_pstate.attr = hwp_cpufreq_attrs;
-			default_driver = &intel_pstate;
+			intel_cpufreq.attr = hwp_cpufreq_attrs;
+			if (!default_driver)
+				default_driver = &intel_pstate;
+
 			goto hwp_cpu_matched;
 		}
 	} else {
@@ -2899,14 +2977,13 @@  static int __init intel_pstate_setup(cha
 	if (!str)
 		return -EINVAL;
 
-	if (!strcmp(str, "disable")) {
+	if (!strcmp(str, "disable"))
 		no_load = 1;
-	} else if (!strcmp(str, "active")) {
+	else if (!strcmp(str, "active"))
 		default_driver = &intel_pstate;
-	} else if (!strcmp(str, "passive")) {
+	else if (!strcmp(str, "passive"))
 		default_driver = &intel_cpufreq;
-		no_hwp = 1;
-	}
+
 	if (!strcmp(str, "no_hwp")) {
 		pr_info("HWP disabled\n");
 		no_hwp = 1;
Index: linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
===================================================================
--- linux-pm.orig/Documentation/admin-guide/pm/intel_pstate.rst
+++ linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
@@ -54,10 +54,13 @@  registered (see `below <status_attr_>`_)
 Operation Modes
 ===============
 
-``intel_pstate`` can operate in three different modes: in the active mode with
-or without hardware-managed P-states support and in the passive mode.  Which of
-them will be in effect depends on what kernel command line options are used and
-on the capabilities of the processor.
+``intel_pstate`` can operate in two different modes, active or passive.  In the
+active mode, it uses its own internal preformance scaling governor algorithm or
+allows the hardware to do preformance scaling by itself, while in the passive
+mode it responds to requests made by a generic ``CPUFreq`` governor implementing
+a certain performance scaling algorithm.  Which of them will be in effect
+depends on what kernel command line options are used and on the capabilities of
+the processor.
 
 Active Mode
 -----------
@@ -194,10 +197,11 @@  This is the default operation mode of ``
 hardware-managed P-states (HWP) support.  It is always used if the
 ``intel_pstate=passive`` argument is passed to the kernel in the command line
 regardless of whether or not the given processor supports HWP.  [Note that the
-``intel_pstate=no_hwp`` setting implies ``intel_pstate=passive`` if it is used
-without ``intel_pstate=active``.]  Like in the active mode without HWP support,
-in this mode ``intel_pstate`` may refuse to work with processors that are not
-recognized by it.
+``intel_pstate=no_hwp`` setting causes the driver to start in the passive mode
+if it is not combined with ``intel_pstate=active``.]  Like in the active mode
+without HWP support, in this mode ``intel_pstate`` may refuse to work with
+processors that are not recognized by it if HWP is prevented from being enabled
+through the kernel command line.
 
 If the driver works in this mode, the ``scaling_driver`` policy attribute in
 ``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq".
@@ -318,10 +322,9 @@  manuals need to be consulted to get to i
 
 For this reason, there is a list of supported processors in ``intel_pstate`` and
 the driver initialization will fail if the detected processor is not in that
-list, unless it supports the `HWP feature <Active Mode_>`_.  [The interface to
-obtain all of the information listed above is the same for all of the processors
-supporting the HWP feature, which is why they all are supported by
-``intel_pstate``.]
+list, unless it supports the HWP feature.  [The interface to obtain all of the
+information listed above is the same for all of the processors supporting the
+HWP feature, which is why ``intel_pstate`` works with all of them.]
 
 
 User Space Interface in ``sysfs``
@@ -425,22 +428,16 @@  argument is passed to the kernel in the
 	as well as the per-policy ones) are then reset to their default
 	values, possibly depending on the target operation mode.]
 
-	That only is supported in some configurations, though (for example, if
-	the `HWP feature is enabled in the processor <Active Mode With HWP_>`_,
-	the operation mode of the driver cannot be changed), and if it is not
-	supported in the current configuration, writes to this attribute will
-	fail with an appropriate error.
-
 ``energy_efficiency``
-	This attribute is only present on platforms, which have CPUs matching
-	Kaby Lake or Coffee Lake desktop CPU model. By default
-	energy efficiency optimizations are disabled on these CPU models in HWP
-	mode by this driver. Enabling energy efficiency may limit maximum
-	operating frequency in both HWP and non HWP mode. In non HWP mode,
-	optimizations are done only in the turbo frequency range. In HWP mode,
-	optimizations are done in the entire frequency range. Setting this
-	attribute to "1" enables energy efficiency optimizations and setting
-	to "0" disables energy efficiency optimizations.
+	This attribute is only present on platforms with CPUs matching the Kaby
+	Lake or Coffee Lake desktop CPU model. By default, energy-efficiency
+	optimizations are disabled on these CPU models if HWP is enabled.
+	Enabling energy-efficiency optimizations may limit maximum operating
+	frequency with or without the HWP feature.  With HWP enabled, the
+	optimizations are done only in the turbo frequency range.  Without it,
+	they are done in the entire available frequency range.  Setting this
+	attribute to "1" enables the energy-efficiency optimizations and setting
+	to "0" disables them.
 
 Interpretation of Policy Attributes
 -----------------------------------
@@ -484,8 +481,8 @@  Next, the following policy attributes ha
 	policy for the time interval between the last two invocations of the
 	driver's utilization update callback by the CPU scheduler for that CPU.
 
-One more policy attribute is present if the `HWP feature is enabled in the
-processor <Active Mode With HWP_>`_:
+One more policy attribute is present if the HWP feature is enabled in the
+processor:
 
 ``base_frequency``
 	Shows the base frequency of the CPU. Any frequency above this will be
@@ -526,11 +523,11 @@  on the following rules, regardless of th
 
  3. The global and per-policy limits can be set independently.
 
-If the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, the
-resulting effective values are written into its registers whenever the limits
-change in order to request its internal P-state selection logic to always set
-P-states within these limits.  Otherwise, the limits are taken into account by
-scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
+In the `active mode with the HWP feature enabled <Active Mode With HWP_>`_, the
+resulting effective values are written into hardware registers whenever the
+limits change in order to request its internal P-state selection logic to always
+set P-states within these limits.  Otherwise, the limits are taken into account
+by scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
 every time before setting a new P-state for a CPU.
 
 Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument
@@ -541,12 +538,11 @@  at all and the only way to set the limit
 Energy vs Performance Hints
 ---------------------------
 
-If ``intel_pstate`` works in the `active mode with the HWP feature enabled
-<Active Mode With HWP_>`_ in the processor, additional attributes are present
-in every ``CPUFreq`` policy directory in ``sysfs``.  They are intended to allow
-user space to help ``intel_pstate`` to adjust the processor's internal P-state
-selection logic by focusing it on performance or on energy-efficiency, or
-somewhere between the two extremes:
+If the hardware-managed P-states (HWP) is enabled in the processor, additional
+attributes, intended to allow user space to help ``intel_pstate`` to adjust the
+processor's internal P-state selection logic by focusing it on performance or on
+energy-efficiency, or somewhere between the two extremes, are present in every
+``CPUFreq`` policy directory in ``sysfs``.  They are :
 
 ``energy_performance_preference``
 	Current value of the energy vs performance hint for the given policy
@@ -650,12 +646,14 @@  of them have to be prepended with the ``
 	Do not register ``intel_pstate`` as the scaling driver even if the
 	processor is supported by it.
 
+``active``
+	Register ``intel_pstate`` in the `active mode <Active Mode_>`_ to start
+	with.
+
 ``passive``
 	Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to
 	start with.
 
-	This option implies the ``no_hwp`` one described below.
-
 ``force``
 	Register ``intel_pstate`` as the scaling driver instead of
 	``acpi-cpufreq`` even if the latter is preferred on the given system.
@@ -670,13 +668,12 @@  of them have to be prepended with the ``
 	driver is used instead of ``acpi-cpufreq``.
 
 ``no_hwp``
-	Do not enable the `hardware-managed P-states (HWP) feature
-	<Active Mode With HWP_>`_ even if it is supported by the processor.
+	Do not enable the hardware-managed P-states (HWP) feature even if it is
+	supported by the processor.
 
 ``hwp_only``
 	Register ``intel_pstate`` as the scaling driver only if the
-	`hardware-managed P-states (HWP) feature <Active Mode With HWP_>`_ is
-	supported by the processor.
+	hardware-managed P-states (HWP) feature is supported by the processor.
 
 ``support_acpi_ppc``
 	Take ACPI ``_PPC`` performance limits into account.