diff mbox

[RFC/RFT,v4,1/2] cpufreq: New governor using utilization data from the scheduler

Message ID 1825489.pc33SqXSIB@vostro.rjw.lan (mailing list archive)
State Superseded, archived
Delegated to: Rafael Wysocki
Headers show

Commit Message

Rafael J. Wysocki Feb. 25, 2016, 9:14 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Add a new cpufreq scaling governor, called "schedutil", that uses
scheduler-provided CPU utilization information as input for making
its decisions.

Doing that is possible after commit fe7034338ba0 (cpufreq: Add
mechanism for registering utilization update callbacks) that
introduced cpufreq_update_util() called by the scheduler on
utilization changes (from CFS) and RT/DL task status updates.
In particular, CPU frequency scaling decisions may be based on
the the utilization data passed to cpufreq_update_util() by CFS.

The new governor is very simple.  It is almost as simple as it
can be and remain reasonably functional.

The frequency selection formula used by it is essentially the same
as the one used by the "ondemand" governor, although it doesn't use
the additional up_threshold parameter, but instead of computing the
load as the "non-idle CPU time" to "total CPU time" ratio, it takes
the utilization data provided by CFS as input.  More specifically,
it represents "load" as the util/max ratio, where util and max
are the utilization and CPU capacity coming from CFS.

All of the computations are carried out in the utilization update
handlers provided by the new governor.  One of those handlers is
used for cpufreq policies shared between multiple CPUs and the other
one is for policies with one CPU only (and therefore it doesn't need
to use any extra synchronization means).  The only operation carried
out by the new governor's ->gov_dbs_timer callback, sugov_set_freq(),
is a __cpufreq_driver_target() call to trigger a frequency update (to
a value already computed beforehand in one of the utilization update
handlers).  This means that, at least for some cpufreq drivers that
can update CPU frequency by doing simple register writes, it should
be possible to set the frequency in the utilization update handlers
too in which case all of the governor's activity would take place in
the scheduler paths invoking cpufreq_update_util() without the need
to run anything in process context.

Currently, the governor treats all of the RT and DL tasks as
"unknown utilization" and sets the frequency to the allowed
maximum when updated from the RT or DL sched classes.  That
heavy-handed approach should be replaced with something more
specifically targeted at RT and DL tasks.

To some extent it relies on the common governor code in
cpufreq_governor.c and it uses that code in a somewhat unusual
way (different from what the "ondemand" and "conservative"
governors do), so some small and rather unintrusive changes
have to be made in that code and the other governors to support it.

However, after making it possible to set the CPU frequency from
the utilization update handlers, that new governor's interactions
with the common code might be limited to the initialization, cleanup
and handling of sysfs attributes (currently only one attribute,
sampling_rate, is supported in addition to the standard policy
attributes handled by the cpufreq core).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

There was no v3 of this patch, but patch [2/2] had a v3 in the meantime.

Changes from v2:
- Avoid requesting the same frequency that was requested last time for
  the given policy.

Changes from v1:
- Use policy->min and policy->max as min/max frequency in computations.

---
 drivers/cpufreq/Kconfig                |   15 +
 drivers/cpufreq/Makefile               |    1 
 drivers/cpufreq/cpufreq_conservative.c |    3 
 drivers/cpufreq/cpufreq_governor.c     |   21 +-
 drivers/cpufreq/cpufreq_governor.h     |    2 
 drivers/cpufreq/cpufreq_ondemand.c     |    3 
 drivers/cpufreq/cpufreq_schedutil.c    |  253 +++++++++++++++++++++++++++++++++
 7 files changed, 288 insertions(+), 10 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Rafael J. Wysocki Feb. 27, 2016, 12:21 a.m. UTC | #1
On Thursday, February 25, 2016 10:14:35 PM Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Add a new cpufreq scaling governor, called "schedutil", that uses
> scheduler-provided CPU utilization information as input for making
> its decisions.
> 
> Doing that is possible after commit fe7034338ba0 (cpufreq: Add
> mechanism for registering utilization update callbacks) that
> introduced cpufreq_update_util() called by the scheduler on
> utilization changes (from CFS) and RT/DL task status updates.
> In particular, CPU frequency scaling decisions may be based on
> the the utilization data passed to cpufreq_update_util() by CFS.
> 
> The new governor is very simple.  It is almost as simple as it
> can be and remain reasonably functional.
> 
> The frequency selection formula used by it is essentially the same
> as the one used by the "ondemand" governor, although it doesn't use
> the additional up_threshold parameter, but instead of computing the
> load as the "non-idle CPU time" to "total CPU time" ratio, it takes
> the utilization data provided by CFS as input.  More specifically,
> it represents "load" as the util/max ratio, where util and max
> are the utilization and CPU capacity coming from CFS.
> 
> All of the computations are carried out in the utilization update
> handlers provided by the new governor.  One of those handlers is
> used for cpufreq policies shared between multiple CPUs and the other
> one is for policies with one CPU only (and therefore it doesn't need
> to use any extra synchronization means).  The only operation carried
> out by the new governor's ->gov_dbs_timer callback, sugov_set_freq(),
> is a __cpufreq_driver_target() call to trigger a frequency update (to
> a value already computed beforehand in one of the utilization update
> handlers).  This means that, at least for some cpufreq drivers that
> can update CPU frequency by doing simple register writes, it should
> be possible to set the frequency in the utilization update handlers
> too in which case all of the governor's activity would take place in
> the scheduler paths invoking cpufreq_update_util() without the need
> to run anything in process context.
> 
> Currently, the governor treats all of the RT and DL tasks as
> "unknown utilization" and sets the frequency to the allowed
> maximum when updated from the RT or DL sched classes.  That
> heavy-handed approach should be replaced with something more
> specifically targeted at RT and DL tasks.
> 
> To some extent it relies on the common governor code in
> cpufreq_governor.c and it uses that code in a somewhat unusual
> way (different from what the "ondemand" and "conservative"
> governors do), so some small and rather unintrusive changes
> have to be made in that code and the other governors to support it.
> 
> However, after making it possible to set the CPU frequency from
> the utilization update handlers, that new governor's interactions
> with the common code might be limited to the initialization, cleanup
> and handling of sysfs attributes (currently only one attribute,
> sampling_rate, is supported in addition to the standard policy
> attributes handled by the cpufreq core).
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> There was no v3 of this patch, but patch [2/2] had a v3 in the meantime.
> 
> Changes from v2:
> - Avoid requesting the same frequency that was requested last time for
>   the given policy.
> 
> Changes from v1:
> - Use policy->min and policy->max as min/max frequency in computations.
> 

Well, this is getting interesting. :-)

Some preliminary results from SpecPower indicate that this governor (with
patch [2/2] on top), as simple as it is, beats "ondemand" in performance/power
and generally allows the system to achieve better performance.  The difference
is not significant, but measurable.

If that is confirmed in further testing, I will be inclined to drop this thing
in in the next cycle.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle Feb. 27, 2016, 4:33 a.m. UTC | #2
On 02/25/2016 01:14 PM, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Add a new cpufreq scaling governor, called "schedutil", that uses
> scheduler-provided CPU utilization information as input for making
> its decisions.
> 
> Doing that is possible after commit fe7034338ba0 (cpufreq: Add
> mechanism for registering utilization update callbacks) that
> introduced cpufreq_update_util() called by the scheduler on
> utilization changes (from CFS) and RT/DL task status updates.
> In particular, CPU frequency scaling decisions may be based on
> the the utilization data passed to cpufreq_update_util() by CFS.
> 
> The new governor is very simple.  It is almost as simple as it
> can be and remain reasonably functional.
> 
> The frequency selection formula used by it is essentially the same
> as the one used by the "ondemand" governor, although it doesn't use
> the additional up_threshold parameter, but instead of computing the
> load as the "non-idle CPU time" to "total CPU time" ratio, it takes
> the utilization data provided by CFS as input.  More specifically,
> it represents "load" as the util/max ratio, where util and max
> are the utilization and CPU capacity coming from CFS.
> 
> All of the computations are carried out in the utilization update
> handlers provided by the new governor.  One of those handlers is
> used for cpufreq policies shared between multiple CPUs and the other
> one is for policies with one CPU only (and therefore it doesn't need
> to use any extra synchronization means).  The only operation carried
> out by the new governor's ->gov_dbs_timer callback, sugov_set_freq(),
> is a __cpufreq_driver_target() call to trigger a frequency update (to
> a value already computed beforehand in one of the utilization update
> handlers).  This means that, at least for some cpufreq drivers that
> can update CPU frequency by doing simple register writes, it should
> be possible to set the frequency in the utilization update handlers
> too in which case all of the governor's activity would take place in
> the scheduler paths invoking cpufreq_update_util() without the need
> to run anything in process context.
> 
> Currently, the governor treats all of the RT and DL tasks as
> "unknown utilization" and sets the frequency to the allowed
> maximum when updated from the RT or DL sched classes.  That
> heavy-handed approach should be replaced with something more
> specifically targeted at RT and DL tasks.
> 
> To some extent it relies on the common governor code in
> cpufreq_governor.c and it uses that code in a somewhat unusual
> way (different from what the "ondemand" and "conservative"
> governors do), so some small and rather unintrusive changes
> have to be made in that code and the other governors to support it.
> 
> However, after making it possible to set the CPU frequency from
> the utilization update handlers, that new governor's interactions
> with the common code might be limited to the initialization, cleanup
> and handling of sysfs attributes (currently only one attribute,
> sampling_rate, is supported in addition to the standard policy
> attributes handled by the cpufreq core).

We'll still need a slow path for platforms that don't support fast
transitions right?

Or is this referring to plans to add an RT thread for slowpath changes
instead of using the dbs stuff :) .

> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> There was no v3 of this patch, but patch [2/2] had a v3 in the meantime.
> 
> Changes from v2:
> - Avoid requesting the same frequency that was requested last time for
>   the given policy.
> 
> Changes from v1:
> - Use policy->min and policy->max as min/max frequency in computations.
> 
> ---
>  drivers/cpufreq/Kconfig                |   15 +
>  drivers/cpufreq/Makefile               |    1 
>  drivers/cpufreq/cpufreq_conservative.c |    3 
>  drivers/cpufreq/cpufreq_governor.c     |   21 +-
>  drivers/cpufreq/cpufreq_governor.h     |    2 
>  drivers/cpufreq/cpufreq_ondemand.c     |    3 
>  drivers/cpufreq/cpufreq_schedutil.c    |  253 +++++++++++++++++++++++++++++++++
>  7 files changed, 288 insertions(+), 10 deletions(-)
> 
> Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
> +++ linux-pm/drivers/cpufreq/cpufreq_governor.h
> @@ -164,7 +164,7 @@ struct dbs_governor {
>  	void (*free)(struct policy_dbs_info *policy_dbs);
>  	int (*init)(struct dbs_data *dbs_data, bool notify);
>  	void (*exit)(struct dbs_data *dbs_data, bool notify);
> -	void (*start)(struct cpufreq_policy *policy);
> +	bool (*start)(struct cpufreq_policy *policy);
>  };
>  
>  static inline struct dbs_governor *dbs_governor_of(struct cpufreq_policy *policy)
> Index: linux-pm/drivers/cpufreq/cpufreq_schedutil.c
> ===================================================================
> --- /dev/null
> +++ linux-pm/drivers/cpufreq/cpufreq_schedutil.c
> @@ -0,0 +1,253 @@
> +/*
> + * CPUFreq governor based on scheduler-provided CPU utilization data.
> + *
> + * Copyright (C) 2016, Intel Corporation
> + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/percpu-defs.h>
> +#include <linux/slab.h>
> +
> +#include "cpufreq_governor.h"
> +
> +struct sugov_policy {
> +	struct policy_dbs_info policy_dbs;
> +	unsigned int next_freq;
> +	raw_spinlock_t update_lock;  /* For shared policies */
> +};
> +
> +static inline struct sugov_policy *to_sg_policy(struct policy_dbs_info *policy_dbs)
> +{
> +	return container_of(policy_dbs, struct sugov_policy, policy_dbs);
> +}
> +
> +struct sugov_cpu {
> +	struct update_util_data update_util;
> +	struct policy_dbs_info *policy_dbs;
> +	/* The fields below are only needed when sharing a policy. */
> +	unsigned long util;
> +	unsigned long max;
> +	u64 last_update;
> +};
> +
> +static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> +
> +/************************ Governor internals ***********************/
> +
> +static unsigned int sugov_next_freq(struct policy_dbs_info *policy_dbs,
> +				    unsigned long util, unsigned long max,
> +				    u64 last_sample_time)
> +{
> +	struct cpufreq_policy *policy = policy_dbs->policy;
> +	unsigned int min_f = policy->min;
> +	unsigned int max_f = policy->max;
> +	unsigned int j;
> +
> +	if (util > max || min_f >= max_f)
> +		return max_f;
> +
> +	for_each_cpu(j, policy->cpus) {
> +		struct sugov_cpu *j_sg_cpu;
> +		unsigned long j_util, j_max;
> +
> +		if (j == smp_processor_id())
> +			continue;
> +
> +		j_sg_cpu = &per_cpu(sugov_cpu, j);
> +		/*
> +		 * If the CPU was last updated before the previous sampling
> +		 * time, we don't take it into account as it probably is idle
> +		 * now.
> +		 */

If the sampling rate is less than the tick, it seems possible a busy CPU
may not manage to get an update in within the current sampling period.

What about checking to see if there was an update within the last tick
period, rather than sample period?

There's also NO_HZ_FULL. I don't know that anyone using NO_HZ_FULL would
want to use a governor other than performance, but I thought I'd mention
it just in case.

> +		if (j_sg_cpu->last_update < last_sample_time)
> +			continue;
> +
> +		j_util = j_sg_cpu->util;
> +		j_max = j_sg_cpu->max;
> +		if (j_util > j_max)
> +			return max_f;
> +
> +		if (j_util * max > j_max * util) {
> +			util = j_util;
> +			max = j_max;
> +		}
> +	}
> +
> +	return min_f + util * (max_f - min_f) / max;
> +}
> +
> +static void sugov_update_commit(struct policy_dbs_info *policy_dbs, u64 time,
> +				unsigned int next_freq)
> +{
> +	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
> +
> +	policy_dbs->last_sample_time = time;
> +	if (sg_policy->next_freq != next_freq) {
> +		sg_policy->next_freq = next_freq;
> +		policy_dbs->work_in_progress = true;
> +		irq_work_queue(&policy_dbs->irq_work);
> +	}
> +}
> +
> +static void sugov_update_shared(struct update_util_data *data, u64 time,
> +				unsigned long util, unsigned long max)
> +{
> +	struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
> +	struct policy_dbs_info *policy_dbs = sg_cpu->policy_dbs;
> +	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
> +	unsigned int next_f;
> +	u64 delta_ns, lst;
> +
> +	raw_spin_lock(&sg_policy->update_lock);
> +
> +	sg_cpu->util = util;
> +	sg_cpu->max = max;
> +	sg_cpu->last_update = time;
> +
> +	if (policy_dbs->work_in_progress)
> +		goto out;
> +
> +	/*
> +	 * This assumes that dbs_data_handler() will not change sample_delay_ns.
> +	 */
> +	lst = policy_dbs->last_sample_time;
> +	delta_ns = time - lst;
> +	if ((s64)delta_ns < policy_dbs->sample_delay_ns)
> +		goto out;
> +
> +	next_f = sugov_next_freq(policy_dbs, util, max, lst);
> +
> +	sugov_update_commit(policy_dbs, time, next_f);
> +
> + out:
> +	raw_spin_unlock(&sg_policy->update_lock);
> +}
> +
> +static void sugov_update_single(struct update_util_data *data, u64 time,
> +				unsigned long util, unsigned long max)
> +{
> +	struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
> +	struct policy_dbs_info *policy_dbs = sg_cpu->policy_dbs;
> +	unsigned int min_f, max_f, next_f;
> +	u64 delta_ns;
> +
> +	if (policy_dbs->work_in_progress)
> +		return;
> +
> +	/*
> +	 * This assumes that dbs_data_handler() will not change sample_delay_ns.
> +	 */
> +	delta_ns = time - policy_dbs->last_sample_time;
> +	if ((s64)delta_ns < policy_dbs->sample_delay_ns)
> +		return;
> +
> +	min_f = policy_dbs->policy->min;
> +	max_f = policy_dbs->policy->max;
> +	next_f = util > max || min_f >= max_f ? max_f :
> +			min_f + util * (max_f - min_f) / max;
> +
> +	sugov_update_commit(policy_dbs, time, next_f);
> +}
> +
> +/************************** sysfs interface ************************/
> +
> +gov_show_one_common(sampling_rate);
> +
> +gov_attr_rw(sampling_rate);
> +
> +static struct attribute *sugov_attributes[] = {
> +	&sampling_rate.attr,
> +	NULL
> +};
> +
> +/********************** dbs_governor interface *********************/
> +
> +static struct policy_dbs_info *sugov_alloc(void)
> +{
> +	struct sugov_policy *sg_policy;
> +
> +	sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
> +	if (!sg_policy)
> +		return NULL;
> +
> +	raw_spin_lock_init(&sg_policy->update_lock);
> +	return &sg_policy->policy_dbs;
> +}
> +
> +static void sugov_free(struct policy_dbs_info *policy_dbs)
> +{
> +	kfree(to_sg_policy(policy_dbs));

Is it possible that a CPU could still be in the sugov_update_* path
above when this runs? I see that cpufreq_governor_stop will call
gov_cancel_work which clears the update_util hook, but I don't see
anything that synchronizes with all CPUs having exited the update path.

> +}
> +
> +static bool sugov_start(struct cpufreq_policy *policy)
> +{
> +	struct policy_dbs_info *policy_dbs = policy->governor_data;
> +	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
> +	unsigned int cpu;
> +
> +	gov_update_sample_delay(policy_dbs, policy_dbs->dbs_data->sampling_rate);
> +	policy_dbs->last_sample_time = 0;
> +	sg_policy->next_freq = UINT_MAX;
> +
> +	for_each_cpu(cpu, policy->cpus) {
> +		struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
> +
> +		sg_cpu->policy_dbs = policy_dbs;
> +		if (policy_dbs->is_shared) {
> +			sg_cpu->util = ULONG_MAX;
> +			sg_cpu->max = 0;
> +			sg_cpu->last_update = 0;
> +			sg_cpu->update_util.func = sugov_update_shared;
> +		} else {
> +			sg_cpu->update_util.func = sugov_update_single;
> +		}
> +		cpufreq_set_update_util_data(cpu, &sg_cpu->update_util);
> +	}
> +	return false;
> +}
> +
> +static unsigned int sugov_set_freq(struct cpufreq_policy *policy)
> +{
> +	struct policy_dbs_info *policy_dbs = policy->governor_data;
> +	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
> +
> +	__cpufreq_driver_target(policy, sg_policy->next_freq, CPUFREQ_RELATION_C);

Similar to the concern raised in the acpi changes I think this should be
CPUFREQ_RELATION_L, otherwise you may not run fast enough to meet demand.

> +	return policy_dbs->dbs_data->sampling_rate;
> +}
> +
> +static struct dbs_governor schedutil_gov = {
> +	.gov = {
> +		.name = "schedutil",
> +		.governor = cpufreq_governor_dbs,
> +		.max_transition_latency	= TRANSITION_LATENCY_LIMIT,
> +		.owner = THIS_MODULE,
> +	},
> +	.kobj_type = { .default_attrs = sugov_attributes },
> +	.gov_dbs_timer = sugov_set_freq,
> +	.alloc = sugov_alloc,
> +	.free = sugov_free,
> +	.start = sugov_start,
> +};
> +
> +#define CPU_FREQ_GOV_SCHEDUTIL	(&schedutil_gov.gov)
> +
> +static int __init sugov_init(void)
> +{
> +	return cpufreq_register_governor(CPU_FREQ_GOV_SCHEDUTIL);
> +}
> +
> +static void __exit sugov_exit(void)
> +{
> +	cpufreq_unregister_governor(CPU_FREQ_GOV_SCHEDUTIL);
> +}
> +
> +MODULE_AUTHOR("Rafael J. Wysocki <rafael.j.wysocki@intel.com>");
> +MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
> +MODULE_LICENSE("GPL");
> +
> +module_init(sugov_init);
> +module_exit(sugov_exit);
> Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c
> +++ linux-pm/drivers/cpufreq/cpufreq_conservative.c
> @@ -299,12 +299,13 @@ static void cs_exit(struct dbs_data *dbs
>  	kfree(dbs_data->tuners);
>  }
>  
> -static void cs_start(struct cpufreq_policy *policy)
> +static bool cs_start(struct cpufreq_policy *policy)
>  {
>  	struct cs_policy_dbs_info *dbs_info = to_dbs_info(policy->governor_data);
>  
>  	dbs_info->down_skip = 0;
>  	dbs_info->requested_freq = policy->cur;
> +	return true;
>  }
>  
>  static struct dbs_governor cs_dbs_gov = {
> Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
> +++ linux-pm/drivers/cpufreq/cpufreq_governor.c
> @@ -472,9 +472,11 @@ static int cpufreq_governor_init(struct
>  	INIT_LIST_HEAD(&dbs_data->policy_dbs_list);
>  	mutex_init(&dbs_data->mutex);
>  
> -	ret = gov->init(dbs_data, !policy->governor->initialized);
> -	if (ret)
> -		goto free_policy_dbs_info;
> +	if (gov->init) {
> +		ret = gov->init(dbs_data, !policy->governor->initialized);
> +		if (ret)
> +			goto free_policy_dbs_info;
> +	}
>  
>  	/* policy latency is in ns. Convert it to us first */
>  	latency = policy->cpuinfo.transition_latency / 1000;
> @@ -510,7 +512,10 @@ static int cpufreq_governor_init(struct
>  
>  	if (!have_governor_per_policy())
>  		gov->gdbs_data = NULL;
> -	gov->exit(dbs_data, !policy->governor->initialized);
> +
> +	if (gov->exit)
> +		gov->exit(dbs_data, !policy->governor->initialized);
> +
>  	kfree(dbs_data);
>  
>  free_policy_dbs_info:
> @@ -544,7 +549,9 @@ static int cpufreq_governor_exit(struct
>  		if (!have_governor_per_policy())
>  			gov->gdbs_data = NULL;
>  
> -		gov->exit(dbs_data, policy->governor->initialized == 1);
> +		if (gov->exit)
> +			gov->exit(dbs_data, policy->governor->initialized == 1);
> +
>  		mutex_destroy(&dbs_data->mutex);
>  		kfree(dbs_data);
>  	} else {
> @@ -588,9 +595,9 @@ static int cpufreq_governor_start(struct
>  			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
>  	}
>  
> -	gov->start(policy);
> +	if (gov->start(policy))
> +		gov_set_update_util(policy_dbs, sampling_rate);
>  
> -	gov_set_update_util(policy_dbs, sampling_rate);
>  	return 0;
>  }
>  
> Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
> +++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
> @@ -402,12 +402,13 @@ static void od_exit(struct dbs_data *dbs
>  	kfree(dbs_data->tuners);
>  }
>  
> -static void od_start(struct cpufreq_policy *policy)
> +static bool od_start(struct cpufreq_policy *policy)
>  {
>  	struct od_policy_dbs_info *dbs_info = to_dbs_info(policy->governor_data);
>  
>  	dbs_info->sample_type = OD_NORMAL_SAMPLE;
>  	ondemand_powersave_bias_init(policy);
> +	return true;
>  }
>  
>  static struct od_ops od_ops = {
> Index: linux-pm/drivers/cpufreq/Kconfig
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/Kconfig
> +++ linux-pm/drivers/cpufreq/Kconfig
> @@ -184,6 +184,21 @@ config CPU_FREQ_GOV_CONSERVATIVE
>  
>  	  If in doubt, say N.
>  
> +config CPU_FREQ_GOV_SCHEDUTIL
> +	tristate "'schedutil' cpufreq policy governor"
> +	depends on CPU_FREQ

do you need to select IRQ_WORK here

> +	select CPU_FREQ_GOV_COMMON
> +	help
> +	  The frequency selection formula used by this governor is analogous
> +	  to the one used by 'ondemand', but instead of computing CPU load
> +	  as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
> +	  utilization data provided by the scheduler as input.
> +
> +	  To compile this driver as a module, choose M here: the
> +	  module will be called cpufreq_conservative.
> +
> +	  If in doubt, say N.
> +
>  comment "CPU frequency scaling drivers"
>  
>  config CPUFREQ_DT
> Index: linux-pm/drivers/cpufreq/Makefile
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/Makefile
> +++ linux-pm/drivers/cpufreq/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_POWERSAVE)	+=
>  obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE)	+= cpufreq_userspace.o
>  obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND)	+= cpufreq_ondemand.o
>  obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE)	+= cpufreq_conservative.o
> +obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)	+= cpufreq_schedutil.o
>  obj-$(CONFIG_CPU_FREQ_GOV_COMMON)		+= cpufreq_governor.o
>  
>  obj-$(CONFIG_CPUFREQ_DT)		+= cpufreq-dt.o
> 

thanks,
Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 27, 2016, 3:24 p.m. UTC | #3
On Friday, February 26, 2016 08:33:19 PM Steve Muckle wrote:
> On 02/25/2016 01:14 PM, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > 
> > Add a new cpufreq scaling governor, called "schedutil", that uses
> > scheduler-provided CPU utilization information as input for making
> > its decisions.
> > 
> > Doing that is possible after commit fe7034338ba0 (cpufreq: Add
> > mechanism for registering utilization update callbacks) that
> > introduced cpufreq_update_util() called by the scheduler on
> > utilization changes (from CFS) and RT/DL task status updates.
> > In particular, CPU frequency scaling decisions may be based on
> > the the utilization data passed to cpufreq_update_util() by CFS.
> > 
> > The new governor is very simple.  It is almost as simple as it
> > can be and remain reasonably functional.
> > 
> > The frequency selection formula used by it is essentially the same
> > as the one used by the "ondemand" governor, although it doesn't use
> > the additional up_threshold parameter, but instead of computing the
> > load as the "non-idle CPU time" to "total CPU time" ratio, it takes
> > the utilization data provided by CFS as input.  More specifically,
> > it represents "load" as the util/max ratio, where util and max
> > are the utilization and CPU capacity coming from CFS.
> > 
> > All of the computations are carried out in the utilization update
> > handlers provided by the new governor.  One of those handlers is
> > used for cpufreq policies shared between multiple CPUs and the other
> > one is for policies with one CPU only (and therefore it doesn't need
> > to use any extra synchronization means).  The only operation carried
> > out by the new governor's ->gov_dbs_timer callback, sugov_set_freq(),
> > is a __cpufreq_driver_target() call to trigger a frequency update (to
> > a value already computed beforehand in one of the utilization update
> > handlers).  This means that, at least for some cpufreq drivers that
> > can update CPU frequency by doing simple register writes, it should
> > be possible to set the frequency in the utilization update handlers
> > too in which case all of the governor's activity would take place in
> > the scheduler paths invoking cpufreq_update_util() without the need
> > to run anything in process context.
> > 
> > Currently, the governor treats all of the RT and DL tasks as
> > "unknown utilization" and sets the frequency to the allowed
> > maximum when updated from the RT or DL sched classes.  That
> > heavy-handed approach should be replaced with something more
> > specifically targeted at RT and DL tasks.
> > 
> > To some extent it relies on the common governor code in
> > cpufreq_governor.c and it uses that code in a somewhat unusual
> > way (different from what the "ondemand" and "conservative"
> > governors do), so some small and rather unintrusive changes
> > have to be made in that code and the other governors to support it.
> > 
> > However, after making it possible to set the CPU frequency from
> > the utilization update handlers, that new governor's interactions
> > with the common code might be limited to the initialization, cleanup
> > and handling of sysfs attributes (currently only one attribute,
> > sampling_rate, is supported in addition to the standard policy
> > attributes handled by the cpufreq core).
> 
> We'll still need a slow path for platforms that don't support fast
> transitions right?

Right.

> Or is this referring to plans to add an RT thread for slowpath changes
> instead of using the dbs stuff :) .

I did not consider that when I was working on the $subject patch. :-)

> > 
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> > 
> > There was no v3 of this patch, but patch [2/2] had a v3 in the meantime.
> > 
> > Changes from v2:
> > - Avoid requesting the same frequency that was requested last time for
> >   the given policy.
> > 
> > Changes from v1:
> > - Use policy->min and policy->max as min/max frequency in computations.
> > 
> > ---
> >  drivers/cpufreq/Kconfig                |   15 +
> >  drivers/cpufreq/Makefile               |    1 
> >  drivers/cpufreq/cpufreq_conservative.c |    3 
> >  drivers/cpufreq/cpufreq_governor.c     |   21 +-
> >  drivers/cpufreq/cpufreq_governor.h     |    2 
> >  drivers/cpufreq/cpufreq_ondemand.c     |    3 
> >  drivers/cpufreq/cpufreq_schedutil.c    |  253 +++++++++++++++++++++++++++++++++
> >  7 files changed, 288 insertions(+), 10 deletions(-)
> > 
> > Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
> > ===================================================================
> > --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
> > +++ linux-pm/drivers/cpufreq/cpufreq_governor.h
> > @@ -164,7 +164,7 @@ struct dbs_governor {
> >  	void (*free)(struct policy_dbs_info *policy_dbs);
> >  	int (*init)(struct dbs_data *dbs_data, bool notify);
> >  	void (*exit)(struct dbs_data *dbs_data, bool notify);
> > -	void (*start)(struct cpufreq_policy *policy);
> > +	bool (*start)(struct cpufreq_policy *policy);
> >  };
> >  
> >  static inline struct dbs_governor *dbs_governor_of(struct cpufreq_policy *policy)
> > Index: linux-pm/drivers/cpufreq/cpufreq_schedutil.c
> > ===================================================================
> > --- /dev/null
> > +++ linux-pm/drivers/cpufreq/cpufreq_schedutil.c
> > @@ -0,0 +1,253 @@
> > +/*
> > + * CPUFreq governor based on scheduler-provided CPU utilization data.
> > + *
> > + * Copyright (C) 2016, Intel Corporation
> > + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#include <linux/percpu-defs.h>
> > +#include <linux/slab.h>
> > +
> > +#include "cpufreq_governor.h"
> > +
> > +struct sugov_policy {
> > +	struct policy_dbs_info policy_dbs;
> > +	unsigned int next_freq;
> > +	raw_spinlock_t update_lock;  /* For shared policies */
> > +};
> > +
> > +static inline struct sugov_policy *to_sg_policy(struct policy_dbs_info *policy_dbs)
> > +{
> > +	return container_of(policy_dbs, struct sugov_policy, policy_dbs);
> > +}
> > +
> > +struct sugov_cpu {
> > +	struct update_util_data update_util;
> > +	struct policy_dbs_info *policy_dbs;
> > +	/* The fields below are only needed when sharing a policy. */
> > +	unsigned long util;
> > +	unsigned long max;
> > +	u64 last_update;
> > +};
> > +
> > +static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> > +
> > +/************************ Governor internals ***********************/
> > +
> > +static unsigned int sugov_next_freq(struct policy_dbs_info *policy_dbs,
> > +				    unsigned long util, unsigned long max,
> > +				    u64 last_sample_time)
> > +{
> > +	struct cpufreq_policy *policy = policy_dbs->policy;
> > +	unsigned int min_f = policy->min;
> > +	unsigned int max_f = policy->max;
> > +	unsigned int j;
> > +
> > +	if (util > max || min_f >= max_f)
> > +		return max_f;
> > +
> > +	for_each_cpu(j, policy->cpus) {
> > +		struct sugov_cpu *j_sg_cpu;
> > +		unsigned long j_util, j_max;
> > +
> > +		if (j == smp_processor_id())
> > +			continue;
> > +
> > +		j_sg_cpu = &per_cpu(sugov_cpu, j);
> > +		/*
> > +		 * If the CPU was last updated before the previous sampling
> > +		 * time, we don't take it into account as it probably is idle
> > +		 * now.
> > +		 */
> 
> If the sampling rate is less than the tick, it seems possible a busy CPU
> may not manage to get an update in within the current sampling period.

Right.

sampling_rate is more of a rate limit here, though, because the governor doesn't
really do any sampling (I was talking about that in the intro message at
http://marc.info/?l=linux-pm&m=145609673008122&w=2).

It was easy to re-use the existing show/store for that, so I took the shortcut,
but I'm considering to introduce a new attribute with a more suitable name for
that.  That would help to avoid a couple of unclean things in patch [2/2] as
well if I'm not mistaken.

> What about checking to see if there was an update within the last tick
> period, rather than sample period?

If your rate limit is set below the rate of the updates (scheduling rate more
or less), it simply has no effect.  To me, that case doesn't require any
special care.

> There's also NO_HZ_FULL. I don't know that anyone using NO_HZ_FULL would
> want to use a governor other than performance, but I thought I'd mention
> it just in case.

Yup.  That's broken already with anything different from "performance".

I think we should fix it, but it can be taken care of later.  Plus that's one
of the things that need to be fixed in general rather than for a single
governor.

> > +		if (j_sg_cpu->last_update < last_sample_time)
> > +			continue;
> > +
> > +		j_util = j_sg_cpu->util;
> > +		j_max = j_sg_cpu->max;
> > +		if (j_util > j_max)
> > +			return max_f;
> > +
> > +		if (j_util * max > j_max * util) {
> > +			util = j_util;
> > +			max = j_max;
> > +		}
> > +	}
> > +
> > +	return min_f + util * (max_f - min_f) / max;
> > +}
> > +
> > +static void sugov_update_commit(struct policy_dbs_info *policy_dbs, u64 time,
> > +				unsigned int next_freq)
> > +{
> > +	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
> > +
> > +	policy_dbs->last_sample_time = time;
> > +	if (sg_policy->next_freq != next_freq) {
> > +		sg_policy->next_freq = next_freq;
> > +		policy_dbs->work_in_progress = true;
> > +		irq_work_queue(&policy_dbs->irq_work);
> > +	}
> > +}
> > +
> > +static void sugov_update_shared(struct update_util_data *data, u64 time,
> > +				unsigned long util, unsigned long max)
> > +{
> > +	struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
> > +	struct policy_dbs_info *policy_dbs = sg_cpu->policy_dbs;
> > +	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
> > +	unsigned int next_f;
> > +	u64 delta_ns, lst;
> > +
> > +	raw_spin_lock(&sg_policy->update_lock);
> > +
> > +	sg_cpu->util = util;
> > +	sg_cpu->max = max;
> > +	sg_cpu->last_update = time;
> > +
> > +	if (policy_dbs->work_in_progress)
> > +		goto out;
> > +
> > +	/*
> > +	 * This assumes that dbs_data_handler() will not change sample_delay_ns.
> > +	 */
> > +	lst = policy_dbs->last_sample_time;
> > +	delta_ns = time - lst;
> > +	if ((s64)delta_ns < policy_dbs->sample_delay_ns)
> > +		goto out;
> > +
> > +	next_f = sugov_next_freq(policy_dbs, util, max, lst);
> > +
> > +	sugov_update_commit(policy_dbs, time, next_f);
> > +
> > + out:
> > +	raw_spin_unlock(&sg_policy->update_lock);
> > +}
> > +
> > +static void sugov_update_single(struct update_util_data *data, u64 time,
> > +				unsigned long util, unsigned long max)
> > +{
> > +	struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
> > +	struct policy_dbs_info *policy_dbs = sg_cpu->policy_dbs;
> > +	unsigned int min_f, max_f, next_f;
> > +	u64 delta_ns;
> > +
> > +	if (policy_dbs->work_in_progress)
> > +		return;
> > +
> > +	/*
> > +	 * This assumes that dbs_data_handler() will not change sample_delay_ns.
> > +	 */
> > +	delta_ns = time - policy_dbs->last_sample_time;
> > +	if ((s64)delta_ns < policy_dbs->sample_delay_ns)
> > +		return;
> > +
> > +	min_f = policy_dbs->policy->min;
> > +	max_f = policy_dbs->policy->max;
> > +	next_f = util > max || min_f >= max_f ? max_f :
> > +			min_f + util * (max_f - min_f) / max;
> > +
> > +	sugov_update_commit(policy_dbs, time, next_f);
> > +}
> > +
> > +/************************** sysfs interface ************************/
> > +
> > +gov_show_one_common(sampling_rate);
> > +
> > +gov_attr_rw(sampling_rate);
> > +
> > +static struct attribute *sugov_attributes[] = {
> > +	&sampling_rate.attr,
> > +	NULL
> > +};
> > +
> > +/********************** dbs_governor interface *********************/
> > +
> > +static struct policy_dbs_info *sugov_alloc(void)
> > +{
> > +	struct sugov_policy *sg_policy;
> > +
> > +	sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
> > +	if (!sg_policy)
> > +		return NULL;
> > +
> > +	raw_spin_lock_init(&sg_policy->update_lock);
> > +	return &sg_policy->policy_dbs;
> > +}
> > +
> > +static void sugov_free(struct policy_dbs_info *policy_dbs)
> > +{
> > +	kfree(to_sg_policy(policy_dbs));
> 
> Is it possible that a CPU could still be in the sugov_update_* path
> above when this runs?

No, it isn't.

> I see that cpufreq_governor_stop will call
> gov_cancel_work which clears the update_util hook, but I don't see
> anything that synchronizes with all CPUs having exited the update path.

gov_cancel_work() calls gov_clear_update_util() which clears the update_util
pointers for all policy CPUs and invokes synchronize_rcu() (or synchronize_sched()
with my latest patch applied: https://patchwork.kernel.org/patch/8443191/).

This guarantees that sugov_update_* won't be running on any of those CPUs
when that has returned.

> > +}
> > +
> > +static bool sugov_start(struct cpufreq_policy *policy)
> > +{
> > +	struct policy_dbs_info *policy_dbs = policy->governor_data;
> > +	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
> > +	unsigned int cpu;
> > +
> > +	gov_update_sample_delay(policy_dbs, policy_dbs->dbs_data->sampling_rate);
> > +	policy_dbs->last_sample_time = 0;
> > +	sg_policy->next_freq = UINT_MAX;
> > +
> > +	for_each_cpu(cpu, policy->cpus) {
> > +		struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
> > +
> > +		sg_cpu->policy_dbs = policy_dbs;
> > +		if (policy_dbs->is_shared) {
> > +			sg_cpu->util = ULONG_MAX;
> > +			sg_cpu->max = 0;
> > +			sg_cpu->last_update = 0;
> > +			sg_cpu->update_util.func = sugov_update_shared;
> > +		} else {
> > +			sg_cpu->update_util.func = sugov_update_single;
> > +		}
> > +		cpufreq_set_update_util_data(cpu, &sg_cpu->update_util);
> > +	}
> > +	return false;
> > +}
> > +
> > +static unsigned int sugov_set_freq(struct cpufreq_policy *policy)
> > +{
> > +	struct policy_dbs_info *policy_dbs = policy->governor_data;
> > +	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
> > +
> > +	__cpufreq_driver_target(policy, sg_policy->next_freq, CPUFREQ_RELATION_C);
> 
> Similar to the concern raised in the acpi changes I think this should be
> CPUFREQ_RELATION_L, otherwise you may not run fast enough to meet demand.

"ondemand" uses CPUFREQ_RELATION_C when powersave_bias is unset and that's why
I used it here.

Like I said in my reply to Peter in that thread, using RELATION_L here is likely
to make us avoid the min frequency almost entirely even if the system is almost
completely idle.  I don't think that would be OK really.

That said my opinion about this particular item isn't really strong.

> > +	return policy_dbs->dbs_data->sampling_rate;
> > +}
> > +
> > +static struct dbs_governor schedutil_gov = {
> > +	.gov = {
> > +		.name = "schedutil",
> > +		.governor = cpufreq_governor_dbs,
> > +		.max_transition_latency	= TRANSITION_LATENCY_LIMIT,
> > +		.owner = THIS_MODULE,
> > +	},
> > +	.kobj_type = { .default_attrs = sugov_attributes },
> > +	.gov_dbs_timer = sugov_set_freq,
> > +	.alloc = sugov_alloc,
> > +	.free = sugov_free,
> > +	.start = sugov_start,
> > +};
> > +
> > +#define CPU_FREQ_GOV_SCHEDUTIL	(&schedutil_gov.gov)
> > +
> > +static int __init sugov_init(void)
> > +{
> > +	return cpufreq_register_governor(CPU_FREQ_GOV_SCHEDUTIL);
> > +}
> > +
> > +static void __exit sugov_exit(void)
> > +{
> > +	cpufreq_unregister_governor(CPU_FREQ_GOV_SCHEDUTIL);
> > +}
> > +
> > +MODULE_AUTHOR("Rafael J. Wysocki <rafael.j.wysocki@intel.com>");
> > +MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
> > +MODULE_LICENSE("GPL");
> > +
> > +module_init(sugov_init);
> > +module_exit(sugov_exit);
> > Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c
> > ===================================================================
> > --- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c
> > +++ linux-pm/drivers/cpufreq/cpufreq_conservative.c
> > @@ -299,12 +299,13 @@ static void cs_exit(struct dbs_data *dbs
> >  	kfree(dbs_data->tuners);
> >  }
> >  
> > -static void cs_start(struct cpufreq_policy *policy)
> > +static bool cs_start(struct cpufreq_policy *policy)
> >  {
> >  	struct cs_policy_dbs_info *dbs_info = to_dbs_info(policy->governor_data);
> >  
> >  	dbs_info->down_skip = 0;
> >  	dbs_info->requested_freq = policy->cur;
> > +	return true;
> >  }
> >  
> >  static struct dbs_governor cs_dbs_gov = {
> > Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
> > ===================================================================
> > --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
> > +++ linux-pm/drivers/cpufreq/cpufreq_governor.c
> > @@ -472,9 +472,11 @@ static int cpufreq_governor_init(struct
> >  	INIT_LIST_HEAD(&dbs_data->policy_dbs_list);
> >  	mutex_init(&dbs_data->mutex);
> >  
> > -	ret = gov->init(dbs_data, !policy->governor->initialized);
> > -	if (ret)
> > -		goto free_policy_dbs_info;
> > +	if (gov->init) {
> > +		ret = gov->init(dbs_data, !policy->governor->initialized);
> > +		if (ret)
> > +			goto free_policy_dbs_info;
> > +	}
> >  
> >  	/* policy latency is in ns. Convert it to us first */
> >  	latency = policy->cpuinfo.transition_latency / 1000;
> > @@ -510,7 +512,10 @@ static int cpufreq_governor_init(struct
> >  
> >  	if (!have_governor_per_policy())
> >  		gov->gdbs_data = NULL;
> > -	gov->exit(dbs_data, !policy->governor->initialized);
> > +
> > +	if (gov->exit)
> > +		gov->exit(dbs_data, !policy->governor->initialized);
> > +
> >  	kfree(dbs_data);
> >  
> >  free_policy_dbs_info:
> > @@ -544,7 +549,9 @@ static int cpufreq_governor_exit(struct
> >  		if (!have_governor_per_policy())
> >  			gov->gdbs_data = NULL;
> >  
> > -		gov->exit(dbs_data, policy->governor->initialized == 1);
> > +		if (gov->exit)
> > +			gov->exit(dbs_data, policy->governor->initialized == 1);
> > +
> >  		mutex_destroy(&dbs_data->mutex);
> >  		kfree(dbs_data);
> >  	} else {
> > @@ -588,9 +595,9 @@ static int cpufreq_governor_start(struct
> >  			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
> >  	}
> >  
> > -	gov->start(policy);
> > +	if (gov->start(policy))
> > +		gov_set_update_util(policy_dbs, sampling_rate);
> >  
> > -	gov_set_update_util(policy_dbs, sampling_rate);
> >  	return 0;
> >  }
> >  
> > Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
> > ===================================================================
> > --- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
> > +++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
> > @@ -402,12 +402,13 @@ static void od_exit(struct dbs_data *dbs
> >  	kfree(dbs_data->tuners);
> >  }
> >  
> > -static void od_start(struct cpufreq_policy *policy)
> > +static bool od_start(struct cpufreq_policy *policy)
> >  {
> >  	struct od_policy_dbs_info *dbs_info = to_dbs_info(policy->governor_data);
> >  
> >  	dbs_info->sample_type = OD_NORMAL_SAMPLE;
> >  	ondemand_powersave_bias_init(policy);
> > +	return true;
> >  }
> >  
> >  static struct od_ops od_ops = {
> > Index: linux-pm/drivers/cpufreq/Kconfig
> > ===================================================================
> > --- linux-pm.orig/drivers/cpufreq/Kconfig
> > +++ linux-pm/drivers/cpufreq/Kconfig
> > @@ -184,6 +184,21 @@ config CPU_FREQ_GOV_CONSERVATIVE
> >  
> >  	  If in doubt, say N.
> >  
> > +config CPU_FREQ_GOV_SCHEDUTIL
> > +	tristate "'schedutil' cpufreq policy governor"
> > +	depends on CPU_FREQ
> 
> do you need to select IRQ_WORK here

No, I don't.

It already is selected by CPU_FREQ in linux-next, but it really should be
selected by CPU_FREQ_GOV_COMMON, so let me cut a patch to make that change.

> > +	select CPU_FREQ_GOV_COMMON
> > +	help
> > +	  The frequency selection formula used by this governor is analogous
> > +	  to the one used by 'ondemand', but instead of computing CPU load
> > +	  as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
> > +	  utilization data provided by the scheduler as input.
> > +
> > +	  To compile this driver as a module, choose M here: the
> > +	  module will be called cpufreq_conservative.
> > +
> > +	  If in doubt, say N.
> > +
> >  comment "CPU frequency scaling drivers"
> >  
> >  config CPUFREQ_DT
> > Index: linux-pm/drivers/cpufreq/Makefile
> > ===================================================================
> > --- linux-pm.orig/drivers/cpufreq/Makefile
> > +++ linux-pm/drivers/cpufreq/Makefile
> > @@ -10,6 +10,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_POWERSAVE)	+=
> >  obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE)	+= cpufreq_userspace.o
> >  obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND)	+= cpufreq_ondemand.o
> >  obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE)	+= cpufreq_conservative.o
> > +obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)	+= cpufreq_schedutil.o
> >  obj-$(CONFIG_CPU_FREQ_GOV_COMMON)		+= cpufreq_governor.o
> >  
> >  obj-$(CONFIG_CPUFREQ_DT)		+= cpufreq-dt.o
> > 

Thanks for your comments,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle March 1, 2016, 4:10 a.m. UTC | #4
On 02/27/2016 07:24 AM, Rafael J. Wysocki wrote:
>>> +static unsigned int sugov_next_freq(struct policy_dbs_info *policy_dbs,
>>> +				    unsigned long util, unsigned long max,
>>> +				    u64 last_sample_time)
>>> +{
>>> +	struct cpufreq_policy *policy = policy_dbs->policy;
>>> +	unsigned int min_f = policy->min;
>>> +	unsigned int max_f = policy->max;
>>> +	unsigned int j;
>>> +
>>> +	if (util > max || min_f >= max_f)
>>> +		return max_f;
>>> +
>>> +	for_each_cpu(j, policy->cpus) {
>>> +		struct sugov_cpu *j_sg_cpu;
>>> +		unsigned long j_util, j_max;
>>> +
>>> +		if (j == smp_processor_id())
>>> +			continue;
>>> +
>>> +		j_sg_cpu = &per_cpu(sugov_cpu, j);
>>> +		/*
>>> +		 * If the CPU was last updated before the previous sampling
>>> +		 * time, we don't take it into account as it probably is idle
>>> +		 * now.
>>> +		 */
>>
>> If the sampling rate is less than the tick, it seems possible a busy CPU
>> may not manage to get an update in within the current sampling period.
> 
> Right.
> 
> sampling_rate is more of a rate limit here, though, because the governor doesn't
> really do any sampling (I was talking about that in the intro message at
> http://marc.info/?l=linux-pm&m=145609673008122&w=2).
> 
> It was easy to re-use the existing show/store for that, so I took the shortcut,
> but I'm considering to introduce a new attribute with a more suitable name for
> that.  That would help to avoid a couple of unclean things in patch [2/2] as
> well if I'm not mistaken.
> 
>> What about checking to see if there was an update within the last tick
>> period, rather than sample period?
> 
> If your rate limit is set below the rate of the updates (scheduling rate more
> or less), it simply has no effect.  To me, that case doesn't require any
> special care.

I'm specifically worried about the check below where we omit a CPU's
capacity request if its last update came before the last sample time.

Say there are 2 CPUs in a frequency domain, HZ is 100 and the sample
delay here is 4ms. At t=0ms CPU0 starts running a CPU hog and reports
being fully busy. At t=5ms CPU1 starts running a task - CPU0's vote is
kept and factored in and last_sample_time is updated to t=5ms. At t=9ms
CPU1 stops running the task and updates again, but this time CPU0's vote
is discarded because its last update is t=0ms, and last_sample_time is
5ms. But CPU0 is fully busy and the freq is erroneously dropped.

> 
>>> +		if (j_sg_cpu->last_update < last_sample_time)
>>> +			continue;
>>> +
>>> +		j_util = j_sg_cpu->util;
>>> +		j_max = j_sg_cpu->max;
>>> +		if (j_util > j_max)
>>> +			return max_f;
>>> +
>>> +		if (j_util * max > j_max * util) {
>>> +			util = j_util;
>>> +			max = j_max;
>>> +		}
>>> +	}
>>> +
>>> +	return min_f + util * (max_f - min_f) / max;
>>> +}
...
>>> +static unsigned int sugov_set_freq(struct cpufreq_policy *policy)
>>> +{
>>> +	struct policy_dbs_info *policy_dbs = policy->governor_data;
>>> +	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
>>> +
>>> +	__cpufreq_driver_target(policy, sg_policy->next_freq, CPUFREQ_RELATION_C);
>>
>> Similar to the concern raised in the acpi changes I think this should be
>> CPUFREQ_RELATION_L, otherwise you may not run fast enough to meet demand.
> 
> "ondemand" uses CPUFREQ_RELATION_C when powersave_bias is unset and that's why
> I used it here.
> 
> Like I said in my reply to Peter in that thread, using RELATION_L here is likely
> to make us avoid the min frequency almost entirely even if the system is almost
> completely idle.  I don't think that would be OK really.
> 
> That said my opinion about this particular item isn't really strong.

I think the calculation for required CPU bandwidth needs tweaking.

Aside from always wanting something past fmin, currently the amount of
extra CPU capacity given for a particular % utilization depends on how
high the platform's fmin happens to be, even if the fmax speeds are the
same. For example given two platforms with the following available
frequencies (MHz):

platform A: 100, 300, 500, 700, 900, 1100
platform B: 500, 700, 900, 1100

A 50% utilization load on platform A will want 600 MHz (rounding up to
700 MHz perhaps) whereas platform B will want 800 MHz (again likely
rounding up to 900 MHz), even though the load consumes 550 MHz on both
platforms.

One possibility would be something like we had in schedfreq, getting the
absolute CPU bw requirement (util/max) * fmax and then adding some %
margin, which I think is more consistent. It is true that it means
figuring out what the right margin is and now there's a magic number
(and potentially a tunable), but it would be more consistent.

thanks,
Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 1, 2016, 8:20 p.m. UTC | #5
On Tue, Mar 1, 2016 at 5:10 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
> On 02/27/2016 07:24 AM, Rafael J. Wysocki wrote:
>>>> +static unsigned int sugov_next_freq(struct policy_dbs_info *policy_dbs,
>>>> +                               unsigned long util, unsigned long max,
>>>> +                               u64 last_sample_time)
>>>> +{
>>>> +   struct cpufreq_policy *policy = policy_dbs->policy;
>>>> +   unsigned int min_f = policy->min;
>>>> +   unsigned int max_f = policy->max;
>>>> +   unsigned int j;
>>>> +
>>>> +   if (util > max || min_f >= max_f)
>>>> +           return max_f;
>>>> +
>>>> +   for_each_cpu(j, policy->cpus) {
>>>> +           struct sugov_cpu *j_sg_cpu;
>>>> +           unsigned long j_util, j_max;
>>>> +
>>>> +           if (j == smp_processor_id())
>>>> +                   continue;
>>>> +
>>>> +           j_sg_cpu = &per_cpu(sugov_cpu, j);
>>>> +           /*
>>>> +            * If the CPU was last updated before the previous sampling
>>>> +            * time, we don't take it into account as it probably is idle
>>>> +            * now.
>>>> +            */
>>>
>>> If the sampling rate is less than the tick, it seems possible a busy CPU
>>> may not manage to get an update in within the current sampling period.
>>
>> Right.
>>
>> sampling_rate is more of a rate limit here, though, because the governor doesn't
>> really do any sampling (I was talking about that in the intro message at
>> http://marc.info/?l=linux-pm&m=145609673008122&w=2).
>>
>> It was easy to re-use the existing show/store for that, so I took the shortcut,
>> but I'm considering to introduce a new attribute with a more suitable name for
>> that.  That would help to avoid a couple of unclean things in patch [2/2] as
>> well if I'm not mistaken.
>>
>>> What about checking to see if there was an update within the last tick
>>> period, rather than sample period?
>>
>> If your rate limit is set below the rate of the updates (scheduling rate more
>> or less), it simply has no effect.  To me, that case doesn't require any
>> special care.
>
> I'm specifically worried about the check below where we omit a CPU's
> capacity request if its last update came before the last sample time.
>
> Say there are 2 CPUs in a frequency domain, HZ is 100 and the sample
> delay here is 4ms.

Yes, that's the case I clearly didn't take into consideration. :-)

My assumption was that the sample delay would always be greater than
the typical update rate which of course need not be the case.

The reason I added the check at all was that the numbers from the
other CPUs may become stale if those CPUs are idle for too long, so at
one point the contributions from them need to be discarded.  Question
is when that point is and since sample delay may be arbitrary, that
mechanism has to be more complex.

> At t=0ms CPU0 starts running a CPU hog and reports
> being fully busy. At t=5ms CPU1 starts running a task - CPU0's vote is
> kept and factored in and last_sample_time is updated to t=5ms. At t=9ms
> CPU1 stops running the task and updates again, but this time CPU0's vote
> is discarded because its last update is t=0ms, and last_sample_time is
> 5ms. But CPU0 is fully busy and the freq is erroneously dropped.
>

[cut]

>>
>> Like I said in my reply to Peter in that thread, using RELATION_L here is likely
>> to make us avoid the min frequency almost entirely even if the system is almost
>> completely idle.  I don't think that would be OK really.
>>
>> That said my opinion about this particular item isn't really strong.
>
> I think the calculation for required CPU bandwidth needs tweaking.

The reason why I used that particular formula was that ondemand used
it.  Of course, the input to it is different in ondemand, but the idea
here is to avoid departing from it too much.

> Aside from always wanting something past fmin, currently the amount of
> extra CPU capacity given for a particular % utilization depends on how
> high the platform's fmin happens to be, even if the fmax speeds are the
> same. For example given two platforms with the following available
> frequencies (MHz):
>
> platform A: 100, 300, 500, 700, 900, 1100
> platform B: 500, 700, 900, 1100

The frequencies may not determine raw performance, though, so 500 MHz
in platform A may correspond to 700 MHz in platform B.  You never
know.

>
> A 50% utilization load on platform A will want 600 MHz (rounding up to
> 700 MHz perhaps) whereas platform B will want 800 MHz (again likely
> rounding up to 900 MHz), even though the load consumes 550 MHz on both
> platforms.
>
> One possibility would be something like we had in schedfreq, getting the
> absolute CPU bw requirement (util/max) * fmax and then adding some %
> margin, which I think is more consistent. It is true that it means
> figuring out what the right margin is and now there's a magic number
> (and potentially a tunable), but it would be more consistent.
>

What the picture is missing is the information on how much more
performance you get by running in a higher P-state (or OPP if you
will).  We don't have that information, however, and relying on
frequency values here generally  doesn't help.

Moreover, since 0 utilization gets you to run in f_min no matter what,
if you treat f_max as an absolute, you're going to underutilize the
P-states in the upper half of the available range.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle March 3, 2016, 3:20 a.m. UTC | #6
On 03/01/2016 12:20 PM, Rafael J. Wysocki wrote:
>> I'm specifically worried about the check below where we omit a CPU's
>> capacity request if its last update came before the last sample time.
>>
>> Say there are 2 CPUs in a frequency domain, HZ is 100 and the sample
>> delay here is 4ms.
> 
> Yes, that's the case I clearly didn't take into consideration. :-)
> 
> My assumption was that the sample delay would always be greater than
> the typical update rate which of course need not be the case.
> 
> The reason I added the check at all was that the numbers from the
> other CPUs may become stale if those CPUs are idle for too long, so at
> one point the contributions from them need to be discarded.  Question
> is when that point is and since sample delay may be arbitrary, that
> mechanism has to be more complex.

Yeah this has been an open issue on our end as well. Sampling-based
governors of course solved this primarily via their fundamental nature
and sampling rate. The interactive governor also has a separate tunable
IIRC which specified how long a CPU may have its sampling timer deferred
due to idle when running @ > fmin (the "slack timer").

Decoupling the CPU update staleness limit from the freq change rate
limit via a separate tunable would be valuable IMO. Would you be
amenable to a patch that did that?

>>> Like I said in my reply to Peter in that thread, using RELATION_L here is likely
>>> to make us avoid the min frequency almost entirely even if the system is almost
>>> completely idle.  I don't think that would be OK really.
>>>
>>> That said my opinion about this particular item isn't really strong.
>>
>> I think the calculation for required CPU bandwidth needs tweaking.
> 
> The reason why I used that particular formula was that ondemand used
> it.  Of course, the input to it is different in ondemand, but the idea
> here is to avoid departing from it too much.
> 
>> Aside from always wanting something past fmin, currently the amount of
>> extra CPU capacity given for a particular % utilization depends on how
>> high the platform's fmin happens to be, even if the fmax speeds are the
>> same. For example given two platforms with the following available
>> frequencies (MHz):
>>
>> platform A: 100, 300, 500, 700, 900, 1100
>> platform B: 500, 700, 900, 1100
> 
> The frequencies may not determine raw performance, though, so 500 MHz
> in platform A may correspond to 700 MHz in platform B.  You never
> know.

My example here was solely intended to illustrate that the current
algorithm itself introduces an inconsistency in policy when other things
are equal. Depending on the fmin value, this ondemand-style calculation
will give a more or less generous amount of CPU bandwidth headroom to a
platform with a higher fmin.

It'd be good to be able to express the desired amount of CPU bandwidth
headroom in such a way that it doesn't depend on the platform's fmin
value, since CPU headroom is a critical factor in tuning a platform's
governor for optimal power and performance.

> 
>>
>> A 50% utilization load on platform A will want 600 MHz (rounding up to
>> 700 MHz perhaps) whereas platform B will want 800 MHz (again likely
>> rounding up to 900 MHz), even though the load consumes 550 MHz on both
>> platforms.
>>
>> One possibility would be something like we had in schedfreq, getting the
>> absolute CPU bw requirement (util/max) * fmax and then adding some %
>> margin, which I think is more consistent. It is true that it means
>> figuring out what the right margin is and now there's a magic number
>> (and potentially a tunable), but it would be more consistent.
>>
> 
> What the picture is missing is the information on how much more
> performance you get by running in a higher P-state (or OPP if you
> will).  We don't have that information, however, and relying on
> frequency values here generally  doesn't help.

Why does the frequency value not help? It is true there may be issues of
a workload being memory bound and not responding quite linearly to
increasing frequency, but that would pose a problem for the current
algorithm also. Surely it's better to attempt a consistent policy which
doesn't vary based on a platform's fmin value?

> Moreover, since 0 utilization gets you to run in f_min no matter what,
> if you treat f_max as an absolute, you're going to underutilize the
> P-states in the upper half of the available range.

Sorry I didn't follow. What do you mean by underutilize the upper half
of the range? I don't see how using RELATION_L with (util/max) * fmax *
(headroom) wouldn't be correct in that regard.

thanks,
Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle March 3, 2016, 3:35 a.m. UTC | #7
On 03/02/2016 07:20 PM, Steve Muckle wrote:
> Why does the frequency value not help? It is true there may be issues of
> a workload being memory bound and not responding quite linearly to
> increasing frequency, but that would pose a problem for the current
> algorithm also. Surely it's better to attempt a consistent policy which
> doesn't vary based on a platform's fmin value?

FWIW I'm not trying to hold up this series - rather just discuss
possibilities and differences with the now deprecated solution that may
be able to be integrated here sometime in the near future.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 3, 2016, 7:20 p.m. UTC | #8
On Thu, Mar 3, 2016 at 4:20 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
> On 03/01/2016 12:20 PM, Rafael J. Wysocki wrote:
>>> I'm specifically worried about the check below where we omit a CPU's
>>> capacity request if its last update came before the last sample time.
>>>
>>> Say there are 2 CPUs in a frequency domain, HZ is 100 and the sample
>>> delay here is 4ms.
>>
>> Yes, that's the case I clearly didn't take into consideration. :-)
>>
>> My assumption was that the sample delay would always be greater than
>> the typical update rate which of course need not be the case.
>>
>> The reason I added the check at all was that the numbers from the
>> other CPUs may become stale if those CPUs are idle for too long, so at
>> one point the contributions from them need to be discarded.  Question
>> is when that point is and since sample delay may be arbitrary, that
>> mechanism has to be more complex.
>
> Yeah this has been an open issue on our end as well. Sampling-based
> governors of course solved this primarily via their fundamental nature
> and sampling rate. The interactive governor also has a separate tunable
> IIRC which specified how long a CPU may have its sampling timer deferred
> due to idle when running @ > fmin (the "slack timer").
>
> Decoupling the CPU update staleness limit from the freq change rate
> limit via a separate tunable would be valuable IMO. Would you be
> amenable to a patch that did that?

Yes, I would.

It still would be better, though, if that didn't have to be a tunable.

What do you think about my idea to use NSEC_PER_SEC / HZ as the
staleness limit (like in https://patchwork.kernel.org/patch/8477261/)?

[cut]

>> Moreover, since 0 utilization gets you to run in f_min no matter what,
>> if you treat f_max as an absolute, you're going to underutilize the
>> P-states in the upper half of the available range.
>
> Sorry I didn't follow. What do you mean by underutilize the upper half
> of the range? I don't see how using RELATION_L with (util/max) * fmax *
> (headroom) wouldn't be correct in that regard.

Suppose all of the util values from 0 to max are equally probable (or
equally frequent) and the available frequencies are close enough to
each other that it doesn't really matter whether _C or _L is used.

Say f_min is 400 and f_max is 1000.

Then, if you take next_freq = f_max * util / max, 50% of requests will
fall into the 400-500 section of the available frequency range.  Of
course, 40% of them will fall to f_min, but that means that the other
available states will be used less frequently, on the average.

I would prefer that to be more balanced.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -164,7 +164,7 @@  struct dbs_governor {
 	void (*free)(struct policy_dbs_info *policy_dbs);
 	int (*init)(struct dbs_data *dbs_data, bool notify);
 	void (*exit)(struct dbs_data *dbs_data, bool notify);
-	void (*start)(struct cpufreq_policy *policy);
+	bool (*start)(struct cpufreq_policy *policy);
 };
 
 static inline struct dbs_governor *dbs_governor_of(struct cpufreq_policy *policy)
Index: linux-pm/drivers/cpufreq/cpufreq_schedutil.c
===================================================================
--- /dev/null
+++ linux-pm/drivers/cpufreq/cpufreq_schedutil.c
@@ -0,0 +1,253 @@ 
+/*
+ * CPUFreq governor based on scheduler-provided CPU utilization data.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/percpu-defs.h>
+#include <linux/slab.h>
+
+#include "cpufreq_governor.h"
+
+struct sugov_policy {
+	struct policy_dbs_info policy_dbs;
+	unsigned int next_freq;
+	raw_spinlock_t update_lock;  /* For shared policies */
+};
+
+static inline struct sugov_policy *to_sg_policy(struct policy_dbs_info *policy_dbs)
+{
+	return container_of(policy_dbs, struct sugov_policy, policy_dbs);
+}
+
+struct sugov_cpu {
+	struct update_util_data update_util;
+	struct policy_dbs_info *policy_dbs;
+	/* The fields below are only needed when sharing a policy. */
+	unsigned long util;
+	unsigned long max;
+	u64 last_update;
+};
+
+static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
+
+/************************ Governor internals ***********************/
+
+static unsigned int sugov_next_freq(struct policy_dbs_info *policy_dbs,
+				    unsigned long util, unsigned long max,
+				    u64 last_sample_time)
+{
+	struct cpufreq_policy *policy = policy_dbs->policy;
+	unsigned int min_f = policy->min;
+	unsigned int max_f = policy->max;
+	unsigned int j;
+
+	if (util > max || min_f >= max_f)
+		return max_f;
+
+	for_each_cpu(j, policy->cpus) {
+		struct sugov_cpu *j_sg_cpu;
+		unsigned long j_util, j_max;
+
+		if (j == smp_processor_id())
+			continue;
+
+		j_sg_cpu = &per_cpu(sugov_cpu, j);
+		/*
+		 * If the CPU was last updated before the previous sampling
+		 * time, we don't take it into account as it probably is idle
+		 * now.
+		 */
+		if (j_sg_cpu->last_update < last_sample_time)
+			continue;
+
+		j_util = j_sg_cpu->util;
+		j_max = j_sg_cpu->max;
+		if (j_util > j_max)
+			return max_f;
+
+		if (j_util * max > j_max * util) {
+			util = j_util;
+			max = j_max;
+		}
+	}
+
+	return min_f + util * (max_f - min_f) / max;
+}
+
+static void sugov_update_commit(struct policy_dbs_info *policy_dbs, u64 time,
+				unsigned int next_freq)
+{
+	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
+
+	policy_dbs->last_sample_time = time;
+	if (sg_policy->next_freq != next_freq) {
+		sg_policy->next_freq = next_freq;
+		policy_dbs->work_in_progress = true;
+		irq_work_queue(&policy_dbs->irq_work);
+	}
+}
+
+static void sugov_update_shared(struct update_util_data *data, u64 time,
+				unsigned long util, unsigned long max)
+{
+	struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
+	struct policy_dbs_info *policy_dbs = sg_cpu->policy_dbs;
+	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
+	unsigned int next_f;
+	u64 delta_ns, lst;
+
+	raw_spin_lock(&sg_policy->update_lock);
+
+	sg_cpu->util = util;
+	sg_cpu->max = max;
+	sg_cpu->last_update = time;
+
+	if (policy_dbs->work_in_progress)
+		goto out;
+
+	/*
+	 * This assumes that dbs_data_handler() will not change sample_delay_ns.
+	 */
+	lst = policy_dbs->last_sample_time;
+	delta_ns = time - lst;
+	if ((s64)delta_ns < policy_dbs->sample_delay_ns)
+		goto out;
+
+	next_f = sugov_next_freq(policy_dbs, util, max, lst);
+
+	sugov_update_commit(policy_dbs, time, next_f);
+
+ out:
+	raw_spin_unlock(&sg_policy->update_lock);
+}
+
+static void sugov_update_single(struct update_util_data *data, u64 time,
+				unsigned long util, unsigned long max)
+{
+	struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
+	struct policy_dbs_info *policy_dbs = sg_cpu->policy_dbs;
+	unsigned int min_f, max_f, next_f;
+	u64 delta_ns;
+
+	if (policy_dbs->work_in_progress)
+		return;
+
+	/*
+	 * This assumes that dbs_data_handler() will not change sample_delay_ns.
+	 */
+	delta_ns = time - policy_dbs->last_sample_time;
+	if ((s64)delta_ns < policy_dbs->sample_delay_ns)
+		return;
+
+	min_f = policy_dbs->policy->min;
+	max_f = policy_dbs->policy->max;
+	next_f = util > max || min_f >= max_f ? max_f :
+			min_f + util * (max_f - min_f) / max;
+
+	sugov_update_commit(policy_dbs, time, next_f);
+}
+
+/************************** sysfs interface ************************/
+
+gov_show_one_common(sampling_rate);
+
+gov_attr_rw(sampling_rate);
+
+static struct attribute *sugov_attributes[] = {
+	&sampling_rate.attr,
+	NULL
+};
+
+/********************** dbs_governor interface *********************/
+
+static struct policy_dbs_info *sugov_alloc(void)
+{
+	struct sugov_policy *sg_policy;
+
+	sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
+	if (!sg_policy)
+		return NULL;
+
+	raw_spin_lock_init(&sg_policy->update_lock);
+	return &sg_policy->policy_dbs;
+}
+
+static void sugov_free(struct policy_dbs_info *policy_dbs)
+{
+	kfree(to_sg_policy(policy_dbs));
+}
+
+static bool sugov_start(struct cpufreq_policy *policy)
+{
+	struct policy_dbs_info *policy_dbs = policy->governor_data;
+	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
+	unsigned int cpu;
+
+	gov_update_sample_delay(policy_dbs, policy_dbs->dbs_data->sampling_rate);
+	policy_dbs->last_sample_time = 0;
+	sg_policy->next_freq = UINT_MAX;
+
+	for_each_cpu(cpu, policy->cpus) {
+		struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
+
+		sg_cpu->policy_dbs = policy_dbs;
+		if (policy_dbs->is_shared) {
+			sg_cpu->util = ULONG_MAX;
+			sg_cpu->max = 0;
+			sg_cpu->last_update = 0;
+			sg_cpu->update_util.func = sugov_update_shared;
+		} else {
+			sg_cpu->update_util.func = sugov_update_single;
+		}
+		cpufreq_set_update_util_data(cpu, &sg_cpu->update_util);
+	}
+	return false;
+}
+
+static unsigned int sugov_set_freq(struct cpufreq_policy *policy)
+{
+	struct policy_dbs_info *policy_dbs = policy->governor_data;
+	struct sugov_policy *sg_policy = to_sg_policy(policy_dbs);
+
+	__cpufreq_driver_target(policy, sg_policy->next_freq, CPUFREQ_RELATION_C);
+	return policy_dbs->dbs_data->sampling_rate;
+}
+
+static struct dbs_governor schedutil_gov = {
+	.gov = {
+		.name = "schedutil",
+		.governor = cpufreq_governor_dbs,
+		.max_transition_latency	= TRANSITION_LATENCY_LIMIT,
+		.owner = THIS_MODULE,
+	},
+	.kobj_type = { .default_attrs = sugov_attributes },
+	.gov_dbs_timer = sugov_set_freq,
+	.alloc = sugov_alloc,
+	.free = sugov_free,
+	.start = sugov_start,
+};
+
+#define CPU_FREQ_GOV_SCHEDUTIL	(&schedutil_gov.gov)
+
+static int __init sugov_init(void)
+{
+	return cpufreq_register_governor(CPU_FREQ_GOV_SCHEDUTIL);
+}
+
+static void __exit sugov_exit(void)
+{
+	cpufreq_unregister_governor(CPU_FREQ_GOV_SCHEDUTIL);
+}
+
+MODULE_AUTHOR("Rafael J. Wysocki <rafael.j.wysocki@intel.com>");
+MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
+MODULE_LICENSE("GPL");
+
+module_init(sugov_init);
+module_exit(sugov_exit);
Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c
+++ linux-pm/drivers/cpufreq/cpufreq_conservative.c
@@ -299,12 +299,13 @@  static void cs_exit(struct dbs_data *dbs
 	kfree(dbs_data->tuners);
 }
 
-static void cs_start(struct cpufreq_policy *policy)
+static bool cs_start(struct cpufreq_policy *policy)
 {
 	struct cs_policy_dbs_info *dbs_info = to_dbs_info(policy->governor_data);
 
 	dbs_info->down_skip = 0;
 	dbs_info->requested_freq = policy->cur;
+	return true;
 }
 
 static struct dbs_governor cs_dbs_gov = {
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -472,9 +472,11 @@  static int cpufreq_governor_init(struct
 	INIT_LIST_HEAD(&dbs_data->policy_dbs_list);
 	mutex_init(&dbs_data->mutex);
 
-	ret = gov->init(dbs_data, !policy->governor->initialized);
-	if (ret)
-		goto free_policy_dbs_info;
+	if (gov->init) {
+		ret = gov->init(dbs_data, !policy->governor->initialized);
+		if (ret)
+			goto free_policy_dbs_info;
+	}
 
 	/* policy latency is in ns. Convert it to us first */
 	latency = policy->cpuinfo.transition_latency / 1000;
@@ -510,7 +512,10 @@  static int cpufreq_governor_init(struct
 
 	if (!have_governor_per_policy())
 		gov->gdbs_data = NULL;
-	gov->exit(dbs_data, !policy->governor->initialized);
+
+	if (gov->exit)
+		gov->exit(dbs_data, !policy->governor->initialized);
+
 	kfree(dbs_data);
 
 free_policy_dbs_info:
@@ -544,7 +549,9 @@  static int cpufreq_governor_exit(struct
 		if (!have_governor_per_policy())
 			gov->gdbs_data = NULL;
 
-		gov->exit(dbs_data, policy->governor->initialized == 1);
+		if (gov->exit)
+			gov->exit(dbs_data, policy->governor->initialized == 1);
+
 		mutex_destroy(&dbs_data->mutex);
 		kfree(dbs_data);
 	} else {
@@ -588,9 +595,9 @@  static int cpufreq_governor_start(struct
 			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
 	}
 
-	gov->start(policy);
+	if (gov->start(policy))
+		gov_set_update_util(policy_dbs, sampling_rate);
 
-	gov_set_update_util(policy_dbs, sampling_rate);
 	return 0;
 }
 
Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
+++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
@@ -402,12 +402,13 @@  static void od_exit(struct dbs_data *dbs
 	kfree(dbs_data->tuners);
 }
 
-static void od_start(struct cpufreq_policy *policy)
+static bool od_start(struct cpufreq_policy *policy)
 {
 	struct od_policy_dbs_info *dbs_info = to_dbs_info(policy->governor_data);
 
 	dbs_info->sample_type = OD_NORMAL_SAMPLE;
 	ondemand_powersave_bias_init(policy);
+	return true;
 }
 
 static struct od_ops od_ops = {
Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -184,6 +184,21 @@  config CPU_FREQ_GOV_CONSERVATIVE
 
 	  If in doubt, say N.
 
+config CPU_FREQ_GOV_SCHEDUTIL
+	tristate "'schedutil' cpufreq policy governor"
+	depends on CPU_FREQ
+	select CPU_FREQ_GOV_COMMON
+	help
+	  The frequency selection formula used by this governor is analogous
+	  to the one used by 'ondemand', but instead of computing CPU load
+	  as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
+	  utilization data provided by the scheduler as input.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called cpufreq_conservative.
+
+	  If in doubt, say N.
+
 comment "CPU frequency scaling drivers"
 
 config CPUFREQ_DT
Index: linux-pm/drivers/cpufreq/Makefile
===================================================================
--- linux-pm.orig/drivers/cpufreq/Makefile
+++ linux-pm/drivers/cpufreq/Makefile
@@ -10,6 +10,7 @@  obj-$(CONFIG_CPU_FREQ_GOV_POWERSAVE)	+=
 obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE)	+= cpufreq_userspace.o
 obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND)	+= cpufreq_ondemand.o
 obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE)	+= cpufreq_conservative.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)	+= cpufreq_schedutil.o
 obj-$(CONFIG_CPU_FREQ_GOV_COMMON)		+= cpufreq_governor.o
 
 obj-$(CONFIG_CPUFREQ_DT)		+= cpufreq-dt.o