diff mbox

[v6,7/7,Resend] cpufreq: schedutil: New governor based on scheduler utilization data

Message ID 6666532.7ULg06hQ7e@vostro.rjw.lan (mailing list archive)
State Superseded, archived
Delegated to: Rafael Wysocki
Headers show

Commit Message

Rafael J. Wysocki March 22, 2016, 1:54 a.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Add a new cpufreq scaling governor, called "schedutil", that uses
scheduler-provided CPU utilization information as input for making
its decisions.

Doing that is possible after commit 34e2c555f3e1 (cpufreq: Add
mechanism for registering utilization update callbacks) that
introduced cpufreq_update_util() called by the scheduler on
utilization changes (from CFS) and RT/DL task status updates.
In particular, CPU frequency scaling decisions may be based on
the the utilization data passed to cpufreq_update_util() by CFS.

The new governor is relatively simple.

The frequency selection formula used by it depends on whether or not
the utilization is frequency-invariant.  In the frequency-invariant
case the new CPU frequency is given by

	next_freq = 1.25 * max_freq * util / max

where util and max are the last two arguments of cpufreq_update_util().
In turn, if util is not frequency-invariant, the maximum frequency in
the above formula is replaced with the current frequency of the CPU:

	next_freq = 1.25 * curr_freq * util / max

The coefficient 1.25 corresponds to the frequency tipping point at
(util / max) = 0.8.

All of the computations are carried out in the utilization update
handlers provided by the new governor.  One of those handlers is
used for cpufreq policies shared between multiple CPUs and the other
one is for policies with one CPU only (and therefore it doesn't need
to use any extra synchronization means).

The governor supports fast frequency switching if that is supported
by the cpufreq driver in use and possible for the given policy.
In the fast switching case, all operations of the governor take
place in its utilization update handlers.  If fast switching cannot
be used, the frequency switch operations are carried out with the
help of a work item which only calls __cpufreq_driver_target()
(under a mutex) to trigger a frequency update (to a value already
computed beforehand in one of the utilization update handlers).

Currently, the governor treats all of the RT and DL tasks as
"unknown utilization" and sets the frequency to the allowed
maximum when updated from the RT or DL sched classes.  That
heavy-handed approach should be replaced with something more
subtle and specifically targeted at RT and DL tasks.

The governor shares some tunables management code with the
"ondemand" and "conservative" governors and uses some common
definitions from cpufreq_governor.h, but apart from that it
is stand-alone.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

Changes from v5:
- Fixed sugov_update_commit() to set sg_policy->next_freq properly
  in the "work item" branch.
- Used smp_processor_id() in sugov_irq_work() and restored work_in_progress.

Changes from v4:
- Use TICK_NSEC in sugov_next_freq_shared().
- Use schedule_work_on() to schedule work items and replace
  work_in_progress with work_cpu (which is used both for scheduling
  work items and as a "work in progress" marker).
- Rearrange sugov_update_commit() to only check policy->min/max if
  fast switching is enabled.
- Replace util > max checks with util == ULONG_MAX checks to make
  it clear that they are about a special case (RT/DL).

Changes from v3:
- The "next frequency" formula based on
  http://marc.info/?l=linux-acpi&m=145756618321500&w=4 and
  http://marc.info/?l=linux-kernel&m=145760739700716&w=4
- The governor goes into kernel/sched/ (again).

Changes from v2:
- The governor goes into drivers/cpufreq/.
- The "next frequency" formula has an additional 1.1 factor to allow
  more util/max values to map onto the top-most frequency in case the
  distance between that and the previous one is unproportionally small.
- sugov_update_commit() traces CPU frequency even if the new one is
  the same as the previous one (otherwise, if the system is 100% loaded
  for long enough, powertop starts to report that all CPUs are 100% idle).

---
 drivers/cpufreq/Kconfig          |   26 +
 kernel/sched/Makefile            |    1 
 kernel/sched/cpufreq_schedutil.c |  528 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h             |    8 
 4 files changed, 563 insertions(+)


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Steve Muckle March 26, 2016, 1:12 a.m. UTC | #1
Hi Rafael,

On 03/21/2016 06:54 PM, Rafael J. Wysocki wrote:
...
> +config CPU_FREQ_GOV_SCHEDUTIL
> +	tristate "'schedutil' cpufreq policy governor"
> +	depends on CPU_FREQ
> +	select CPU_FREQ_GOV_ATTR_SET
> +	select IRQ_WORK
> +	help
> +	  The frequency selection formula used by this governor is analogous
> +	  to the one used by 'ondemand', but instead of computing CPU load
> +	  as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
> +	  utilization data provided by the scheduler as input.

The formula's changed a bit from ondemand - can the formula description
in the commit text be repackaged a bit and used here?

...
> +
> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
> +				unsigned int next_freq)
> +{
> +	struct cpufreq_policy *policy = sg_policy->policy;
> +
> +	sg_policy->last_freq_update_time = time;
> +
> +	if (policy->fast_switch_enabled) {
> +		if (next_freq > policy->max)
> +			next_freq = policy->max;
> +		else if (next_freq < policy->min)
> +			next_freq = policy->min;

The __cpufreq_driver_target() interface has this capping in it. For
uniformity should this be pushed into cpufreq_driver_fast_switch()?

> +
> +		if (sg_policy->next_freq == next_freq) {
> +			trace_cpu_frequency(policy->cur, smp_processor_id());
> +			return;
> +		}

I fear this may bloat traces unnecessarily as there may be long
stretches when a frequency domain is at the same frequency (especially
fmin or fmax).

...
> +static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
> +					   unsigned long util, unsigned long max)
> +{
> +	struct cpufreq_policy *policy = sg_policy->policy;
> +	unsigned int max_f = policy->cpuinfo.max_freq;
> +	u64 last_freq_update_time = sg_policy->last_freq_update_time;
> +	unsigned int j;
> +
> +	if (util == ULONG_MAX)
> +		return max_f;
> +
> +	for_each_cpu(j, policy->cpus) {
> +		struct sugov_cpu *j_sg_cpu;
> +		unsigned long j_util, j_max;
> +		u64 delta_ns;
> +
> +		if (j == smp_processor_id())
> +			continue;
> +
> +		j_sg_cpu = &per_cpu(sugov_cpu, j);
> +		/*
> +		 * If the CPU utilization was last updated before the previous
> +		 * frequency update and the time elapsed between the last update
> +		 * of the CPU utilization and the last frequency update is long
> +		 * enough, don't take the CPU into account as it probably is
> +		 * idle now.
> +		 */
> +		delta_ns = last_freq_update_time - j_sg_cpu->last_update;
> +		if ((s64)delta_ns > TICK_NSEC)

Why not declare delta_ns as an s64 (also in suguv_should_update_freq)
and avoid the cast?

...
> +static int sugov_limits(struct cpufreq_policy *policy)
> +{
> +	struct sugov_policy *sg_policy = policy->governor_data;
> +
> +	if (!policy->fast_switch_enabled) {
> +		mutex_lock(&sg_policy->work_lock);
> +
> +		if (policy->max < policy->cur)
> +			__cpufreq_driver_target(policy, policy->max,
> +						CPUFREQ_RELATION_H);
> +		else if (policy->min > policy->cur)
> +			__cpufreq_driver_target(policy, policy->min,
> +						CPUFREQ_RELATION_L);
> +
> +		mutex_unlock(&sg_policy->work_lock);
> +	}

Is the expectation that in the fast_switch_enabled case we should
re-evaluate soon enough that an explicit fixup is not required here? I'm
worried as to whether that will always be true given the possible
criticality of applying frequency limits (thermal for example).

thanks,
Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 26, 2016, 2:05 a.m. UTC | #2
On Sat, Mar 26, 2016 at 2:12 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
> Hi Rafael,
>
> On 03/21/2016 06:54 PM, Rafael J. Wysocki wrote:
> ...
>> +config CPU_FREQ_GOV_SCHEDUTIL
>> +     tristate "'schedutil' cpufreq policy governor"
>> +     depends on CPU_FREQ
>> +     select CPU_FREQ_GOV_ATTR_SET
>> +     select IRQ_WORK
>> +     help
>> +       The frequency selection formula used by this governor is analogous
>> +       to the one used by 'ondemand', but instead of computing CPU load
>> +       as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
>> +       utilization data provided by the scheduler as input.
>
> The formula's changed a bit from ondemand - can the formula description
> in the commit text be repackaged a bit and used here?

Right, I forgot to update this help text.

I'll figure out what to do here.

> ...
>> +
>> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
>> +                             unsigned int next_freq)
>> +{
>> +     struct cpufreq_policy *policy = sg_policy->policy;
>> +
>> +     sg_policy->last_freq_update_time = time;
>> +
>> +     if (policy->fast_switch_enabled) {
>> +             if (next_freq > policy->max)
>> +                     next_freq = policy->max;
>> +             else if (next_freq < policy->min)
>> +                     next_freq = policy->min;
>
> The __cpufreq_driver_target() interface has this capping in it. For
> uniformity should this be pushed into cpufreq_driver_fast_switch()?

It could, but see below.

>> +
>> +             if (sg_policy->next_freq == next_freq) {
>> +                     trace_cpu_frequency(policy->cur, smp_processor_id());
>> +                     return;
>> +             }
>
> I fear this may bloat traces unnecessarily as there may be long
> stretches when a frequency domain is at the same frequency (especially
> fmin or fmax).

I put it here, because without it powertop reports that the CPU is
idle in situations like these.

> ...
>> +static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
>> +                                        unsigned long util, unsigned long max)
>> +{
>> +     struct cpufreq_policy *policy = sg_policy->policy;
>> +     unsigned int max_f = policy->cpuinfo.max_freq;
>> +     u64 last_freq_update_time = sg_policy->last_freq_update_time;
>> +     unsigned int j;
>> +
>> +     if (util == ULONG_MAX)
>> +             return max_f;
>> +
>> +     for_each_cpu(j, policy->cpus) {
>> +             struct sugov_cpu *j_sg_cpu;
>> +             unsigned long j_util, j_max;
>> +             u64 delta_ns;
>> +
>> +             if (j == smp_processor_id())
>> +                     continue;
>> +
>> +             j_sg_cpu = &per_cpu(sugov_cpu, j);
>> +             /*
>> +              * If the CPU utilization was last updated before the previous
>> +              * frequency update and the time elapsed between the last update
>> +              * of the CPU utilization and the last frequency update is long
>> +              * enough, don't take the CPU into account as it probably is
>> +              * idle now.
>> +              */
>> +             delta_ns = last_freq_update_time - j_sg_cpu->last_update;
>> +             if ((s64)delta_ns > TICK_NSEC)
>
> Why not declare delta_ns as an s64 (also in suguv_should_update_freq)
> and avoid the cast?

I took this from __update_load_avg(), but it shouldn't matter here.

> ...
>> +static int sugov_limits(struct cpufreq_policy *policy)
>> +{
>> +     struct sugov_policy *sg_policy = policy->governor_data;
>> +
>> +     if (!policy->fast_switch_enabled) {
>> +             mutex_lock(&sg_policy->work_lock);
>> +
>> +             if (policy->max < policy->cur)
>> +                     __cpufreq_driver_target(policy, policy->max,
>> +                                             CPUFREQ_RELATION_H);
>> +             else if (policy->min > policy->cur)
>> +                     __cpufreq_driver_target(policy, policy->min,
>> +                                             CPUFREQ_RELATION_L);
>> +
>> +             mutex_unlock(&sg_policy->work_lock);
>> +     }
>
> Is the expectation that in the fast_switch_enabled case we should
> re-evaluate soon enough that an explicit fixup is not required here?

Yes, it is.

> I'm worried as to whether that will always be true given the possible
> criticality of applying frequency limits (thermal for example).

The part of the patch below that you cut actually takes care of that:

    sg_policy->need_freq_update = true;

which causes the rate limit to be ignored essentially, so the
frequency will be changed on the first update from the scheduler.
Which also is why the min/max check is before the sg_policy->next_freq
== next_freq check in sugov_update_commit().

I wanted to avoid locking in the fast switch/one CPU per policy case
which otherwise would be necessary just for the handling of this
thing.  I'd like to keep it the way it is unless it can be clearly
demonstrated that it really would lead to problems in practice in a
real system.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 27, 2016, 1:36 a.m. UTC | #3
On Sat, Mar 26, 2016 at 3:05 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Sat, Mar 26, 2016 at 2:12 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
>> Hi Rafael,
>>
>> On 03/21/2016 06:54 PM, Rafael J. Wysocki wrote:
>> ...
>>> +config CPU_FREQ_GOV_SCHEDUTIL
>>> +     tristate "'schedutil' cpufreq policy governor"
>>> +     depends on CPU_FREQ
>>> +     select CPU_FREQ_GOV_ATTR_SET
>>> +     select IRQ_WORK
>>> +     help
>>> +       The frequency selection formula used by this governor is analogous
>>> +       to the one used by 'ondemand', but instead of computing CPU load
>>> +       as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
>>> +       utilization data provided by the scheduler as input.
>>
>> The formula's changed a bit from ondemand - can the formula description
>> in the commit text be repackaged a bit and used here?
>
> Right, I forgot to update this help text.
>
> I'll figure out what to do here.
>
>> ...
>>> +
>>> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
>>> +                             unsigned int next_freq)
>>> +{
>>> +     struct cpufreq_policy *policy = sg_policy->policy;
>>> +
>>> +     sg_policy->last_freq_update_time = time;
>>> +
>>> +     if (policy->fast_switch_enabled) {
>>> +             if (next_freq > policy->max)
>>> +                     next_freq = policy->max;
>>> +             else if (next_freq < policy->min)
>>> +                     next_freq = policy->min;
>>
>> The __cpufreq_driver_target() interface has this capping in it. For
>> uniformity should this be pushed into cpufreq_driver_fast_switch()?
>
> It could, but see below.

It should be doable regardless unless I'm overlooking something.  Will try.

[cut]

>> ...
>>> +static int sugov_limits(struct cpufreq_policy *policy)
>>> +{
>>> +     struct sugov_policy *sg_policy = policy->governor_data;
>>> +
>>> +     if (!policy->fast_switch_enabled) {
>>> +             mutex_lock(&sg_policy->work_lock);
>>> +
>>> +             if (policy->max < policy->cur)
>>> +                     __cpufreq_driver_target(policy, policy->max,
>>> +                                             CPUFREQ_RELATION_H);
>>> +             else if (policy->min > policy->cur)
>>> +                     __cpufreq_driver_target(policy, policy->min,
>>> +                                             CPUFREQ_RELATION_L);
>>> +
>>> +             mutex_unlock(&sg_policy->work_lock);
>>> +     }
>>
>> Is the expectation that in the fast_switch_enabled case we should
>> re-evaluate soon enough that an explicit fixup is not required here?
>
> Yes, it is.
>
>> I'm worried as to whether that will always be true given the possible
>> criticality of applying frequency limits (thermal for example).
>
> The part of the patch below that you cut actually takes care of that:
>
>     sg_policy->need_freq_update = true;
>
> which causes the rate limit to be ignored essentially, so the
> frequency will be changed on the first update from the scheduler.
> Which also is why the min/max check is before the sg_policy->next_freq
> == next_freq check in sugov_update_commit().
>
> I wanted to avoid locking in the fast switch/one CPU per policy case
> which otherwise would be necessary just for the handling of this
> thing.  I'd like to keep it the way it is unless it can be clearly
> demonstrated that it really would lead to problems in practice in a
> real system.

Besides, even if frequency is updated directly from here in the "fast
switch" case, that still doesn't guarantee that it will be updated
immediately, because the task running this code may be preempted and
only scheduled again in the next cycle.  Not to mention the fact that
it may not run on the CPU to be updated, so it would need to use
something like smp_call_function_single() for the update and that
would complicate things even more.

Overall, I don't really think that doing the update directly from here
in the "fast switch" case would improve things much latency-wise and
it would increase complexity and introduce overhead into the fast
path.  So this really is a tradeoff and the current choice is the
right one IMO.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar March 28, 2016, 9:03 a.m. UTC | #4
On 22-03-16, 02:54, Rafael J. Wysocki wrote:
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===================================================================
> --- /dev/null
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -0,0 +1,528 @@
> +/*
> + * CPUFreq governor based on scheduler-provided CPU utilization data.
> + *
> + * Copyright (C) 2016, Intel Corporation
> + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/cpufreq.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <trace/events/power.h>
> +
> +#include "sched.h"
> +
> +struct sugov_tunables {
> +	struct gov_attr_set attr_set;
> +	unsigned int rate_limit_us;
> +};
> +
> +struct sugov_policy {
> +	struct cpufreq_policy *policy;
> +
> +	struct sugov_tunables *tunables;
> +	struct list_head tunables_hook;
> +
> +	raw_spinlock_t update_lock;  /* For shared policies */
> +	u64 last_freq_update_time;
> +	s64 freq_update_delay_ns;

And why isn't it part of sugov_tunables? Its gonna be same for all policies
sharing tunables ..

> +	unsigned int next_freq;
> +
> +	/* The next fields are only needed if fast switch cannot be used. */
> +	struct irq_work irq_work;
> +	struct work_struct work;
> +	struct mutex work_lock;
> +	bool work_in_progress;
> +
> +	bool need_freq_update;
> +};
> +
> +struct sugov_cpu {
> +	struct update_util_data update_util;
> +	struct sugov_policy *sg_policy;
> +
> +	/* The fields below are only needed when sharing a policy. */
> +	unsigned long util;
> +	unsigned long max;
> +	u64 last_update;
> +};
> +
> +static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> +
> +/************************ Governor internals ***********************/
> +
> +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)

To make its purpose clear, maybe name it as: sugov_should_reevaluate_freq(),
because we aren't updating the freq just yet, but deciding if we need to
reevaluate again or not.

As its going to be called from hotpath, maybe mark it as inline and let compiler
decide ?

> +{
> +	u64 delta_ns;
> +
> +	if (sg_policy->work_in_progress)
> +		return false;
> +
> +	if (unlikely(sg_policy->need_freq_update)) {
> +		sg_policy->need_freq_update = false;
> +		return true;
> +	}
> +
> +	delta_ns = time - sg_policy->last_freq_update_time;
> +	return (s64)delta_ns >= sg_policy->freq_update_delay_ns;
> +}
> +
> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,

Maybe sugov_update_freq() ?

> +				unsigned int next_freq)
> +{
> +	struct cpufreq_policy *policy = sg_policy->policy;
> +
> +	sg_policy->last_freq_update_time = time;
> +
> +	if (policy->fast_switch_enabled) {
> +		if (next_freq > policy->max)
> +			next_freq = policy->max;
> +		else if (next_freq < policy->min)
> +			next_freq = policy->min;
> +
> +		if (sg_policy->next_freq == next_freq) {
> +			trace_cpu_frequency(policy->cur, smp_processor_id());
> +			return;
> +		}
> +		sg_policy->next_freq = next_freq;

Why not do all of above stuff as part of else block as well and move it before
the if {} block ?

> +		next_freq = cpufreq_driver_fast_switch(policy, next_freq);
> +		if (next_freq == CPUFREQ_ENTRY_INVALID)
> +			return;
> +
> +		policy->cur = next_freq;
> +		trace_cpu_frequency(next_freq, smp_processor_id());
> +	} else if (sg_policy->next_freq != next_freq) {
> +		sg_policy->next_freq = next_freq;
> +		sg_policy->work_in_progress = true;
> +		irq_work_queue(&sg_policy->irq_work);
> +	}
> +}
> +
> +/**
> + * get_next_freq - Compute a new frequency for a given cpufreq policy.
> + * @policy: cpufreq policy object to compute the new frequency for.
> + * @util: Current CPU utilization.
> + * @max: CPU capacity.
> + *
> + * If the utilization is frequency-invariant, choose the new frequency to be
> + * proportional to it, that is
> + *
> + * next_freq = C * max_freq * util / max
> + *
> + * Otherwise, approximate the would-be frequency-invariant utilization by
> + * util_raw * (curr_freq / max_freq) which leads to
> + *
> + * next_freq = C * curr_freq * util_raw / max
> + *
> + * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8.
> + */
> +static unsigned int get_next_freq(struct cpufreq_policy *policy,
> +				  unsigned long util, unsigned long max)
> +{
> +	unsigned int freq = arch_scale_freq_invariant() ?
> +				policy->cpuinfo.max_freq : policy->cur;
> +
> +	return (freq + (freq >> 2)) * util / max;
> +}
> +
> +static void sugov_update_single(struct update_util_data *hook, u64 time,
> +				unsigned long util, unsigned long max)
> +{
> +	struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
> +	struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> +	struct cpufreq_policy *policy = sg_policy->policy;
> +	unsigned int next_f;
> +
> +	if (!sugov_should_update_freq(sg_policy, time))
> +		return;
> +
> +	next_f = util == ULONG_MAX ? policy->cpuinfo.max_freq :
> +			get_next_freq(policy, util, max);
> +	sugov_update_commit(sg_policy, time, next_f);
> +}
> +
> +static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
> +					   unsigned long util, unsigned long max)
> +{
> +	struct cpufreq_policy *policy = sg_policy->policy;
> +	unsigned int max_f = policy->cpuinfo.max_freq;
> +	u64 last_freq_update_time = sg_policy->last_freq_update_time;
> +	unsigned int j;
> +
> +	if (util == ULONG_MAX)
> +		return max_f;
> +
> +	for_each_cpu(j, policy->cpus) {
> +		struct sugov_cpu *j_sg_cpu;
> +		unsigned long j_util, j_max;
> +		u64 delta_ns;
> +
> +		if (j == smp_processor_id())
> +			continue;

Why skip local CPU completely ? And if we really want to do that, what about
something like for_each_cpu_and_not to kill the unnecessary if {} statement ?

> +
> +		j_sg_cpu = &per_cpu(sugov_cpu, j);
> +		/*
> +		 * If the CPU utilization was last updated before the previous
> +		 * frequency update and the time elapsed between the last update
> +		 * of the CPU utilization and the last frequency update is long
> +		 * enough, don't take the CPU into account as it probably is
> +		 * idle now.
> +		 */
> +		delta_ns = last_freq_update_time - j_sg_cpu->last_update;
> +		if ((s64)delta_ns > TICK_NSEC)
> +			continue;
> +
> +		j_util = j_sg_cpu->util;
> +		if (j_util == ULONG_MAX)
> +			return max_f;
> +
> +		j_max = j_sg_cpu->max;
> +		if (j_util * max > j_max * util) {
> +			util = j_util;
> +			max = j_max;
> +		}
> +	}
> +
> +	return get_next_freq(policy, util, max);
> +}
> +
> +static void sugov_update_shared(struct update_util_data *hook, u64 time,
> +				unsigned long util, unsigned long max)
> +{
> +	struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
> +	struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> +	unsigned int next_f;
> +
> +	raw_spin_lock(&sg_policy->update_lock);
> +
> +	sg_cpu->util = util;
> +	sg_cpu->max = max;
> +	sg_cpu->last_update = time;
> +
> +	if (sugov_should_update_freq(sg_policy, time)) {
> +		next_f = sugov_next_freq_shared(sg_policy, util, max);
> +		sugov_update_commit(sg_policy, time, next_f);
> +	}
> +
> +	raw_spin_unlock(&sg_policy->update_lock);
> +}
> +
> +static void sugov_work(struct work_struct *work)
> +{
> +	struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
> +
> +	mutex_lock(&sg_policy->work_lock);
> +	__cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
> +				CPUFREQ_RELATION_L);
> +	mutex_unlock(&sg_policy->work_lock);
> +
> +	sg_policy->work_in_progress = false;
> +}
> +
> +static void sugov_irq_work(struct irq_work *irq_work)
> +{
> +	struct sugov_policy *sg_policy;
> +
> +	sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
> +	schedule_work_on(smp_processor_id(), &sg_policy->work);
> +}
> +
> +/************************** sysfs interface ************************/
> +
> +static struct sugov_tunables *global_tunables;
> +static DEFINE_MUTEX(global_tunables_lock);
> +
> +static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr_set)
> +{
> +	return container_of(attr_set, struct sugov_tunables, attr_set);
> +}
> +
> +static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
> +{
> +	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
> +
> +	return sprintf(buf, "%u\n", tunables->rate_limit_us);
> +}
> +
> +static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf,
> +				   size_t count)
> +{
> +	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
> +	struct sugov_policy *sg_policy;
> +	unsigned int rate_limit_us;
> +	int ret;
> +
> +	ret = sscanf(buf, "%u", &rate_limit_us);

checkpatch warns for this, we should be using kstrtou32 here ..

> +	if (ret != 1)
> +		return -EINVAL;
> +
> +	tunables->rate_limit_us = rate_limit_us;
> +
> +	list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
> +		sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
> +
> +	return count;
> +}
> +
> +static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);

Why not reuse gov_attr_rw() ?

> +
> +static struct attribute *sugov_attributes[] = {
> +	&rate_limit_us.attr,
> +	NULL
> +};
> +
> +static struct kobj_type sugov_tunables_ktype = {
> +	.default_attrs = sugov_attributes,
> +	.sysfs_ops = &governor_sysfs_ops,
> +};
> +
> +/********************** cpufreq governor interface *********************/
> +
> +static struct cpufreq_governor schedutil_gov;
> +
> +static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
> +{
> +	struct sugov_policy *sg_policy;
> +
> +	sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
> +	if (!sg_policy)
> +		return NULL;
> +
> +	sg_policy->policy = policy;
> +	init_irq_work(&sg_policy->irq_work, sugov_irq_work);
> +	INIT_WORK(&sg_policy->work, sugov_work);
> +	mutex_init(&sg_policy->work_lock);
> +	raw_spin_lock_init(&sg_policy->update_lock);
> +	return sg_policy;
> +}
> +
> +static void sugov_policy_free(struct sugov_policy *sg_policy)
> +{
> +	mutex_destroy(&sg_policy->work_lock);
> +	kfree(sg_policy);
> +}
> +
> +static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy)
> +{
> +	struct sugov_tunables *tunables;
> +
> +	tunables = kzalloc(sizeof(*tunables), GFP_KERNEL);
> +	if (tunables)
> +		gov_attr_set_init(&tunables->attr_set, &sg_policy->tunables_hook);
> +
> +	return tunables;
> +}
> +
> +static void sugov_tunables_free(struct sugov_tunables *tunables)
> +{
> +	if (!have_governor_per_policy())
> +		global_tunables = NULL;
> +
> +	kfree(tunables);
> +}
> +
> +static int sugov_init(struct cpufreq_policy *policy)
> +{
> +	struct sugov_policy *sg_policy;
> +	struct sugov_tunables *tunables;
> +	unsigned int lat;
> +	int ret = 0;
> +
> +	/* State should be equivalent to EXIT */
> +	if (policy->governor_data)
> +		return -EBUSY;
> +
> +	sg_policy = sugov_policy_alloc(policy);
> +	if (!sg_policy)
> +		return -ENOMEM;
> +
> +	mutex_lock(&global_tunables_lock);
> +
> +	if (global_tunables) {
> +		if (WARN_ON(have_governor_per_policy())) {
> +			ret = -EINVAL;
> +			goto free_sg_policy;
> +		}
> +		policy->governor_data = sg_policy;
> +		sg_policy->tunables = global_tunables;
> +
> +		gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook);
> +		goto out;
> +	}
> +
> +	tunables = sugov_tunables_alloc(sg_policy);
> +	if (!tunables) {
> +		ret = -ENOMEM;
> +		goto free_sg_policy;
> +	}
> +
> +	tunables->rate_limit_us = LATENCY_MULTIPLIER;
> +	lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC;
> +	if (lat)
> +		tunables->rate_limit_us *= lat;
> +
> +	if (!have_governor_per_policy())
> +		global_tunables = tunables;

To make sugov_tunables_alloc/free() symmetric to each other, should we move
above into sugov_tunables_alloc() ?

> +
> +	policy->governor_data = sg_policy;
> +	sg_policy->tunables = tunables;
> +
> +	ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
> +				   get_governor_parent_kobj(policy), "%s",
> +				   schedutil_gov.name);
> +	if (!ret)
> +		goto out;
> +
> +	/* Failure, so roll back. */
> +	policy->governor_data = NULL;
> +	sugov_tunables_free(tunables);
> +
> + free_sg_policy:
> +	pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret);
> +	sugov_policy_free(sg_policy);

I didn't like the way we have mixed success and failure path here, just to save
a single line of code (unlock).

Over that it does things, that aren't symmetric anymore. For example, we have
called sugov_policy_alloc() without locks and are freeing it from within locks.

> +
> + out:
> +	mutex_unlock(&global_tunables_lock);
> +	return ret;
> +}
> +
> +static int sugov_exit(struct cpufreq_policy *policy)
> +{
> +	struct sugov_policy *sg_policy = policy->governor_data;
> +	struct sugov_tunables *tunables = sg_policy->tunables;
> +	unsigned int count;
> +
> +	mutex_lock(&global_tunables_lock);
> +
> +	count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook);
> +	policy->governor_data = NULL;
> +	if (!count)
> +		sugov_tunables_free(tunables);
> +
> +	mutex_unlock(&global_tunables_lock);
> +
> +	sugov_policy_free(sg_policy);
> +	return 0;
> +}
> +
> +static int sugov_start(struct cpufreq_policy *policy)
> +{
> +	struct sugov_policy *sg_policy = policy->governor_data;
> +	unsigned int cpu;
> +
> +	cpufreq_enable_fast_switch(policy);

Why should we be doing this from START, which gets called a lot compared to
INIT/EXIT? This is something which should be moved to INIT IMHO.

> +	sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
> +	sg_policy->last_freq_update_time = 0;
> +	sg_policy->next_freq = UINT_MAX;
> +	sg_policy->work_in_progress = false;
> +	sg_policy->need_freq_update = false;
> +
> +	for_each_cpu(cpu, policy->cpus) {
> +		struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
> +
> +		sg_cpu->sg_policy = sg_policy;
> +		if (policy_is_shared(policy)) {
> +			sg_cpu->util = ULONG_MAX;
> +			sg_cpu->max = 0;
> +			sg_cpu->last_update = 0;
> +			cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
> +						     sugov_update_shared);
> +		} else {
> +			cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
> +						     sugov_update_single);
> +		}
> +	}
> +	return 0;
> +}
> +
> +static int sugov_stop(struct cpufreq_policy *policy)
> +{
> +	struct sugov_policy *sg_policy = policy->governor_data;
> +	unsigned int cpu;
> +
> +	for_each_cpu(cpu, policy->cpus)
> +		cpufreq_remove_update_util_hook(cpu);
> +
> +	synchronize_sched();
> +
> +	irq_work_sync(&sg_policy->irq_work);
> +	cancel_work_sync(&sg_policy->work);

And again, we should have a disable-fast-switch as well..

> +	return 0;
> +}
> +
> +static int sugov_limits(struct cpufreq_policy *policy)
> +{
> +	struct sugov_policy *sg_policy = policy->governor_data;
> +
> +	if (!policy->fast_switch_enabled) {
> +		mutex_lock(&sg_policy->work_lock);
> +
> +		if (policy->max < policy->cur)
> +			__cpufreq_driver_target(policy, policy->max,
> +						CPUFREQ_RELATION_H);
> +		else if (policy->min > policy->cur)
> +			__cpufreq_driver_target(policy, policy->min,
> +						CPUFREQ_RELATION_L);
> +
> +		mutex_unlock(&sg_policy->work_lock);

Maybe we can try to take lock only if we are going to switch the freq, i.e. only
if sugov_limits is called for policy->min/max update?

i.e.

void __sugov_limits(policy, freq, relation)
{
		mutex_lock(&sg_policy->work_lock);
                __cpufreq_driver_target(policy, freq, relation);
		mutex_unlock(&sg_policy->work_lock);
}

static int sugov_limits(struct cpufreq_policy *policy)
{
	struct sugov_policy *sg_policy = policy->governor_data;

	if (!policy->fast_switch_enabled) {
		if (policy->max < policy->cur)
                        __sugov_limits(policy, policy->max, CPUFREQ_RELATION_H);
		else if (policy->min > policy->cur)
                        __sugov_limits(policy, policy->min, CPUFREQ_RELATION_L);
	}

	sg_policy->need_freq_update = true;
	return 0;
}

??

And maybe the same for current governors? (ofcourse in a separate patch, I can
do that if you want).


Also, why not just always do 'sg_policy->need_freq_update = true' from this
routine and remove everything else? It will be taken care of on next evaluation.

> +
> +int sugov_governor(struct cpufreq_policy *policy, unsigned int event)
> +{
> +	if (event == CPUFREQ_GOV_POLICY_INIT) {
> +		return sugov_init(policy);
> +	} else if (policy->governor_data) {
> +		switch (event) {
> +		case CPUFREQ_GOV_POLICY_EXIT:
> +			return sugov_exit(policy);
> +		case CPUFREQ_GOV_START:
> +			return sugov_start(policy);
> +		case CPUFREQ_GOV_STOP:
> +			return sugov_stop(policy);
> +		case CPUFREQ_GOV_LIMITS:
> +			return sugov_limits(policy);
> +		}
> +	}
> +	return -EINVAL;
> +}
> +
> +static struct cpufreq_governor schedutil_gov = {
> +	.name = "schedutil",
> +	.governor = sugov_governor,
> +	.owner = THIS_MODULE,
> +};
> +
> +static int __init sugov_module_init(void)
> +{
> +	return cpufreq_register_governor(&schedutil_gov);
> +}
> +
> +static void __exit sugov_module_exit(void)
> +{
> +	cpufreq_unregister_governor(&schedutil_gov);
> +}
> +
> +MODULE_AUTHOR("Rafael J. Wysocki <rafael.j.wysocki@intel.com>");
> +MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
> +MODULE_LICENSE("GPL");

Maybe a MODULE_ALIAS as well ?
Steve Muckle March 28, 2016, 6:17 p.m. UTC | #5
On 03/26/2016 06:36 PM, Rafael J. Wysocki wrote:
>>>> +static int sugov_limits(struct cpufreq_policy *policy)
>>>> >>> +{
>>>> >>> +     struct sugov_policy *sg_policy = policy->governor_data;
>>>> >>> +
>>>> >>> +     if (!policy->fast_switch_enabled) {
>>>> >>> +             mutex_lock(&sg_policy->work_lock);
>>>> >>> +
>>>> >>> +             if (policy->max < policy->cur)
>>>> >>> +                     __cpufreq_driver_target(policy, policy->max,
>>>> >>> +                                             CPUFREQ_RELATION_H);
>>>> >>> +             else if (policy->min > policy->cur)
>>>> >>> +                     __cpufreq_driver_target(policy, policy->min,
>>>> >>> +                                             CPUFREQ_RELATION_L);
>>>> >>> +
>>>> >>> +             mutex_unlock(&sg_policy->work_lock);
>>>> >>> +     }
>>> >>
>>> >> Is the expectation that in the fast_switch_enabled case we should
>>> >> re-evaluate soon enough that an explicit fixup is not required here?
>> >
>> > Yes, it is.
>> >
>>> >> I'm worried as to whether that will always be true given the possible
>>> >> criticality of applying frequency limits (thermal for example).
>> >
>> > The part of the patch below that you cut actually takes care of that:
>> >
>> >     sg_policy->need_freq_update = true;
>> >
>> > which causes the rate limit to be ignored essentially, so the
>> > frequency will be changed on the first update from the scheduler.

The scenario I'm contemplating is that while a CPU-intensive task is
running a thermal interrupt goes off. The driver for this thermal
interrupt responds by capping fmax. If this happens just after the tick,
it seems possible that we could wait a full tick before changing the
frequency. Given a 10ms tick it could be rather annoying for thermal
management algorithms on some platforms (I'm familiar with a few).

>> > Which also is why the min/max check is before the sg_policy->next_freq
>> > == next_freq check in sugov_update_commit().
>> >
>> > I wanted to avoid locking in the fast switch/one CPU per policy case
>> > which otherwise would be necessary just for the handling of this
>> > thing.  I'd like to keep it the way it is unless it can be clearly
>> > demonstrated that it really would lead to problems in practice in a
>> > real system.
>
> Besides, even if frequency is updated directly from here in the "fast
> switch" case, that still doesn't guarantee that it will be updated
> immediately, because the task running this code may be preempted and
> only scheduled again in the next cycle.
>
> Not to mention the fact that it may not run on the CPU to be updated,
> so it would need to use something like smp_call_function_single() for
> the update and that would complicate things even more.
> 
> Overall, I don't really think that doing the update directly from here
> in the "fast switch" case would improve things much latency-wise and
> it would increase complexity and introduce overhead into the fast
> path.  So this really is a tradeoff and the current choice is the
> right one IMO.

On the desire to avoid locking in the fast switch/one CPU per policy
case, I wondered about whether disabling interrupts in sugov_limits()
would suffice. That's a rarely called function and I was hoping that the
update hook would already have interrupts disabled due to its being
called in scheduler paths that may do raw_spin_lock_irqsave. But I'm not
sure offhand that will always be true. If it isn't though then I'm not
sure what's necessarily stopping say the sched tick calling the hook
while the hook is already in progress from some other path.

Agreed there would need to be some additional complexity somewhere to
get things running on the correct CPU.

Anyway I have nothing against deferring this for now.

thanks,
Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 29, 2016, 12:23 p.m. UTC | #6
On Monday, March 28, 2016 11:17:44 AM Steve Muckle wrote:
> On 03/26/2016 06:36 PM, Rafael J. Wysocki wrote:
> >>>> +static int sugov_limits(struct cpufreq_policy *policy)
> >>>> >>> +{
> >>>> >>> +     struct sugov_policy *sg_policy = policy->governor_data;
> >>>> >>> +
> >>>> >>> +     if (!policy->fast_switch_enabled) {
> >>>> >>> +             mutex_lock(&sg_policy->work_lock);
> >>>> >>> +
> >>>> >>> +             if (policy->max < policy->cur)
> >>>> >>> +                     __cpufreq_driver_target(policy, policy->max,
> >>>> >>> +                                             CPUFREQ_RELATION_H);
> >>>> >>> +             else if (policy->min > policy->cur)
> >>>> >>> +                     __cpufreq_driver_target(policy, policy->min,
> >>>> >>> +                                             CPUFREQ_RELATION_L);
> >>>> >>> +
> >>>> >>> +             mutex_unlock(&sg_policy->work_lock);
> >>>> >>> +     }
> >>> >>
> >>> >> Is the expectation that in the fast_switch_enabled case we should
> >>> >> re-evaluate soon enough that an explicit fixup is not required here?
> >> >
> >> > Yes, it is.
> >> >
> >>> >> I'm worried as to whether that will always be true given the possible
> >>> >> criticality of applying frequency limits (thermal for example).
> >> >
> >> > The part of the patch below that you cut actually takes care of that:
> >> >
> >> >     sg_policy->need_freq_update = true;
> >> >
> >> > which causes the rate limit to be ignored essentially, so the
> >> > frequency will be changed on the first update from the scheduler.
> 
> The scenario I'm contemplating is that while a CPU-intensive task is
> running a thermal interrupt goes off. The driver for this thermal
> interrupt responds by capping fmax. If this happens just after the tick,
> it seems possible that we could wait a full tick before changing the
> frequency. Given a 10ms tick it could be rather annoying for thermal
> management algorithms on some platforms (I'm familiar with a few).

The thermal driver has to do something like cpufreq_update_policy() then
which can only happen in process context.  I'm not sure how it is possible
to guarantee any latency better than that full tick here anyway.

> >> > Which also is why the min/max check is before the sg_policy->next_freq
> >> > == next_freq check in sugov_update_commit().
> >> >
> >> > I wanted to avoid locking in the fast switch/one CPU per policy case
> >> > which otherwise would be necessary just for the handling of this
> >> > thing.  I'd like to keep it the way it is unless it can be clearly
> >> > demonstrated that it really would lead to problems in practice in a
> >> > real system.
> >
> > Besides, even if frequency is updated directly from here in the "fast
> > switch" case, that still doesn't guarantee that it will be updated
> > immediately, because the task running this code may be preempted and
> > only scheduled again in the next cycle.
> >
> > Not to mention the fact that it may not run on the CPU to be updated,
> > so it would need to use something like smp_call_function_single() for
> > the update and that would complicate things even more.
> > 
> > Overall, I don't really think that doing the update directly from here
> > in the "fast switch" case would improve things much latency-wise and
> > it would increase complexity and introduce overhead into the fast
> > path.  So this really is a tradeoff and the current choice is the
> > right one IMO.
> 
> On the desire to avoid locking in the fast switch/one CPU per policy
> case, I wondered about whether disabling interrupts in sugov_limits()
> would suffice. That's a rarely called function and I was hoping that the
> update hook would already have interrupts disabled due to its being
> called in scheduler paths that may do raw_spin_lock_irqsave. But I'm not
> sure offhand that will always be true.

It will.

That's why we can use RCU-sched in cpufreq_update_util() etc.

> If it isn't though then I'm not
> sure what's necessarily stopping say the sched tick calling the hook
> while the hook is already in progress from some other path.
> 
> Agreed there would need to be some additional complexity somewhere to
> get things running on the correct CPU.
> 
> Anyway I have nothing against deferring this for now.

OK

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 29, 2016, 12:58 p.m. UTC | #7
On Monday, March 28, 2016 02:33:33 PM Viresh Kumar wrote:
> On 22-03-16, 02:54, Rafael J. Wysocki wrote:
> > Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> > ===================================================================
> > --- /dev/null
> > +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> > @@ -0,0 +1,528 @@
> > +/*
> > + * CPUFreq governor based on scheduler-provided CPU utilization data.
> > + *
> > + * Copyright (C) 2016, Intel Corporation
> > + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#include <linux/cpufreq.h>
> > +#include <linux/module.h>
> > +#include <linux/slab.h>
> > +#include <trace/events/power.h>
> > +
> > +#include "sched.h"
> > +
> > +struct sugov_tunables {
> > +	struct gov_attr_set attr_set;
> > +	unsigned int rate_limit_us;
> > +};
> > +
> > +struct sugov_policy {
> > +	struct cpufreq_policy *policy;
> > +
> > +	struct sugov_tunables *tunables;
> > +	struct list_head tunables_hook;
> > +
> > +	raw_spinlock_t update_lock;  /* For shared policies */
> > +	u64 last_freq_update_time;
> > +	s64 freq_update_delay_ns;
> 
> And why isn't it part of sugov_tunables?

Because it is not a tunable.

> Its gonna be same for all policies sharing tunables ..

The value will be the same, but the cacheline won't.

> 
> > +	unsigned int next_freq;
> > +
> > +	/* The next fields are only needed if fast switch cannot be used. */
> > +	struct irq_work irq_work;
> > +	struct work_struct work;
> > +	struct mutex work_lock;
> > +	bool work_in_progress;
> > +
> > +	bool need_freq_update;
> > +};
> > +
> > +struct sugov_cpu {
> > +	struct update_util_data update_util;
> > +	struct sugov_policy *sg_policy;
> > +
> > +	/* The fields below are only needed when sharing a policy. */
> > +	unsigned long util;
> > +	unsigned long max;
> > +	u64 last_update;
> > +};
> > +
> > +static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> > +
> > +/************************ Governor internals ***********************/
> > +
> > +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> 
> To make its purpose clear, maybe name it as: sugov_should_reevaluate_freq(),
> because we aren't updating the freq just yet, but deciding if we need to
> reevaluate again or not.

Splitting hairs anyone?

> As its going to be called from hotpath, maybe mark it as inline and let compiler
> decide ?

The compiler will make it inline if it decides it's worth it anyway.

> > +{
> > +	u64 delta_ns;
> > +
> > +	if (sg_policy->work_in_progress)
> > +		return false;
> > +
> > +	if (unlikely(sg_policy->need_freq_update)) {
> > +		sg_policy->need_freq_update = false;
> > +		return true;
> > +	}
> > +
> > +	delta_ns = time - sg_policy->last_freq_update_time;
> > +	return (s64)delta_ns >= sg_policy->freq_update_delay_ns;
> > +}
> > +
> > +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
> 
> Maybe sugov_update_freq() ?

Can you please give up suggesting the names?

What's wrong with the original one?  Is it confusing in some way or something?

> > +				unsigned int next_freq)
> > +{
> > +	struct cpufreq_policy *policy = sg_policy->policy;
> > +
> > +	sg_policy->last_freq_update_time = time;
> > +
> > +	if (policy->fast_switch_enabled) {
> > +		if (next_freq > policy->max)
> > +			next_freq = policy->max;
> > +		else if (next_freq < policy->min)
> > +			next_freq = policy->min;
> > +
> > +		if (sg_policy->next_freq == next_freq) {
> > +			trace_cpu_frequency(policy->cur, smp_processor_id());
> > +			return;
> > +		}
> > +		sg_policy->next_freq = next_freq;
> 
> Why not do all of above stuff as part of else block as well and move it before
> the if {} block ?

Because the trace_cpu_frequency() is needed only in the fast switch case.

> > +		next_freq = cpufreq_driver_fast_switch(policy, next_freq);
> > +		if (next_freq == CPUFREQ_ENTRY_INVALID)
> > +			return;
> > +
> > +		policy->cur = next_freq;
> > +		trace_cpu_frequency(next_freq, smp_processor_id());
> > +	} else if (sg_policy->next_freq != next_freq) {
> > +		sg_policy->next_freq = next_freq;
> > +		sg_policy->work_in_progress = true;
> > +		irq_work_queue(&sg_policy->irq_work);
> > +	}
> > +}
> > +
> > +/**
> > + * get_next_freq - Compute a new frequency for a given cpufreq policy.
> > + * @policy: cpufreq policy object to compute the new frequency for.
> > + * @util: Current CPU utilization.
> > + * @max: CPU capacity.
> > + *
> > + * If the utilization is frequency-invariant, choose the new frequency to be
> > + * proportional to it, that is
> > + *
> > + * next_freq = C * max_freq * util / max
> > + *
> > + * Otherwise, approximate the would-be frequency-invariant utilization by
> > + * util_raw * (curr_freq / max_freq) which leads to
> > + *
> > + * next_freq = C * curr_freq * util_raw / max
> > + *
> > + * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8.
> > + */
> > +static unsigned int get_next_freq(struct cpufreq_policy *policy,
> > +				  unsigned long util, unsigned long max)
> > +{
> > +	unsigned int freq = arch_scale_freq_invariant() ?
> > +				policy->cpuinfo.max_freq : policy->cur;
> > +
> > +	return (freq + (freq >> 2)) * util / max;
> > +}
> > +
> > +static void sugov_update_single(struct update_util_data *hook, u64 time,
> > +				unsigned long util, unsigned long max)
> > +{
> > +	struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
> > +	struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> > +	struct cpufreq_policy *policy = sg_policy->policy;
> > +	unsigned int next_f;
> > +
> > +	if (!sugov_should_update_freq(sg_policy, time))
> > +		return;
> > +
> > +	next_f = util == ULONG_MAX ? policy->cpuinfo.max_freq :
> > +			get_next_freq(policy, util, max);
> > +	sugov_update_commit(sg_policy, time, next_f);
> > +}
> > +
> > +static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
> > +					   unsigned long util, unsigned long max)
> > +{
> > +	struct cpufreq_policy *policy = sg_policy->policy;
> > +	unsigned int max_f = policy->cpuinfo.max_freq;
> > +	u64 last_freq_update_time = sg_policy->last_freq_update_time;
> > +	unsigned int j;
> > +
> > +	if (util == ULONG_MAX)
> > +		return max_f;
> > +
> > +	for_each_cpu(j, policy->cpus) {
> > +		struct sugov_cpu *j_sg_cpu;
> > +		unsigned long j_util, j_max;
> > +		u64 delta_ns;
> > +
> > +		if (j == smp_processor_id())
> > +			continue;
> 
> Why skip local CPU completely ?

Because the original util and max come from it.

> And if we really want to do that, what about something like for_each_cpu_and_not
> to kill the unnecessary if {} statement ?

That will work.
 
> > +
> > +		j_sg_cpu = &per_cpu(sugov_cpu, j);
> > +		/*
> > +		 * If the CPU utilization was last updated before the previous
> > +		 * frequency update and the time elapsed between the last update
> > +		 * of the CPU utilization and the last frequency update is long
> > +		 * enough, don't take the CPU into account as it probably is
> > +		 * idle now.
> > +		 */
> > +		delta_ns = last_freq_update_time - j_sg_cpu->last_update;
> > +		if ((s64)delta_ns > TICK_NSEC)
> > +			continue;
> > +
> > +		j_util = j_sg_cpu->util;
> > +		if (j_util == ULONG_MAX)
> > +			return max_f;
> > +
> > +		j_max = j_sg_cpu->max;
> > +		if (j_util * max > j_max * util) {
> > +			util = j_util;
> > +			max = j_max;
> > +		}
> > +	}
> > +
> > +	return get_next_freq(policy, util, max);
> > +}
> > +
> > +static void sugov_update_shared(struct update_util_data *hook, u64 time,
> > +				unsigned long util, unsigned long max)
> > +{
> > +	struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
> > +	struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> > +	unsigned int next_f;
> > +
> > +	raw_spin_lock(&sg_policy->update_lock);
> > +
> > +	sg_cpu->util = util;
> > +	sg_cpu->max = max;
> > +	sg_cpu->last_update = time;
> > +
> > +	if (sugov_should_update_freq(sg_policy, time)) {
> > +		next_f = sugov_next_freq_shared(sg_policy, util, max);
> > +		sugov_update_commit(sg_policy, time, next_f);
> > +	}
> > +
> > +	raw_spin_unlock(&sg_policy->update_lock);
> > +}
> > +
> > +static void sugov_work(struct work_struct *work)
> > +{
> > +	struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
> > +
> > +	mutex_lock(&sg_policy->work_lock);
> > +	__cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
> > +				CPUFREQ_RELATION_L);
> > +	mutex_unlock(&sg_policy->work_lock);
> > +
> > +	sg_policy->work_in_progress = false;
> > +}
> > +
> > +static void sugov_irq_work(struct irq_work *irq_work)
> > +{
> > +	struct sugov_policy *sg_policy;
> > +
> > +	sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
> > +	schedule_work_on(smp_processor_id(), &sg_policy->work);
> > +}
> > +
> > +/************************** sysfs interface ************************/
> > +
> > +static struct sugov_tunables *global_tunables;
> > +static DEFINE_MUTEX(global_tunables_lock);
> > +
> > +static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr_set)
> > +{
> > +	return container_of(attr_set, struct sugov_tunables, attr_set);
> > +}
> > +
> > +static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
> > +{
> > +	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
> > +
> > +	return sprintf(buf, "%u\n", tunables->rate_limit_us);
> > +}
> > +
> > +static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf,
> > +				   size_t count)
> > +{
> > +	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
> > +	struct sugov_policy *sg_policy;
> > +	unsigned int rate_limit_us;
> > +	int ret;
> > +
> > +	ret = sscanf(buf, "%u", &rate_limit_us);
> 
> checkpatch warns for this, we should be using kstrtou32 here ..

Hmm.  checkpatch.  Oh well.

> > +	if (ret != 1)
> > +		return -EINVAL;
> > +
> > +	tunables->rate_limit_us = rate_limit_us;
> > +
> > +	list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
> > +		sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
> > +
> > +	return count;
> > +}
> > +
> > +static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
> 
> Why not reuse gov_attr_rw() ?

Would it work?

> > +
> > +static struct attribute *sugov_attributes[] = {
> > +	&rate_limit_us.attr,
> > +	NULL
> > +};
> > +
> > +static struct kobj_type sugov_tunables_ktype = {
> > +	.default_attrs = sugov_attributes,
> > +	.sysfs_ops = &governor_sysfs_ops,
> > +};
> > +
> > +/********************** cpufreq governor interface *********************/
> > +
> > +static struct cpufreq_governor schedutil_gov;
> > +
> > +static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
> > +{
> > +	struct sugov_policy *sg_policy;
> > +
> > +	sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
> > +	if (!sg_policy)
> > +		return NULL;
> > +
> > +	sg_policy->policy = policy;
> > +	init_irq_work(&sg_policy->irq_work, sugov_irq_work);
> > +	INIT_WORK(&sg_policy->work, sugov_work);
> > +	mutex_init(&sg_policy->work_lock);
> > +	raw_spin_lock_init(&sg_policy->update_lock);
> > +	return sg_policy;
> > +}
> > +
> > +static void sugov_policy_free(struct sugov_policy *sg_policy)
> > +{
> > +	mutex_destroy(&sg_policy->work_lock);
> > +	kfree(sg_policy);
> > +}
> > +
> > +static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy)
> > +{
> > +	struct sugov_tunables *tunables;
> > +
> > +	tunables = kzalloc(sizeof(*tunables), GFP_KERNEL);
> > +	if (tunables)
> > +		gov_attr_set_init(&tunables->attr_set, &sg_policy->tunables_hook);
> > +
> > +	return tunables;
> > +}
> > +
> > +static void sugov_tunables_free(struct sugov_tunables *tunables)
> > +{
> > +	if (!have_governor_per_policy())
> > +		global_tunables = NULL;
> > +
> > +	kfree(tunables);
> > +}
> > +
> > +static int sugov_init(struct cpufreq_policy *policy)
> > +{
> > +	struct sugov_policy *sg_policy;
> > +	struct sugov_tunables *tunables;
> > +	unsigned int lat;
> > +	int ret = 0;
> > +
> > +	/* State should be equivalent to EXIT */
> > +	if (policy->governor_data)
> > +		return -EBUSY;
> > +
> > +	sg_policy = sugov_policy_alloc(policy);
> > +	if (!sg_policy)
> > +		return -ENOMEM;
> > +
> > +	mutex_lock(&global_tunables_lock);
> > +
> > +	if (global_tunables) {
> > +		if (WARN_ON(have_governor_per_policy())) {
> > +			ret = -EINVAL;
> > +			goto free_sg_policy;
> > +		}
> > +		policy->governor_data = sg_policy;
> > +		sg_policy->tunables = global_tunables;
> > +
> > +		gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook);
> > +		goto out;
> > +	}
> > +
> > +	tunables = sugov_tunables_alloc(sg_policy);
> > +	if (!tunables) {
> > +		ret = -ENOMEM;
> > +		goto free_sg_policy;
> > +	}
> > +
> > +	tunables->rate_limit_us = LATENCY_MULTIPLIER;
> > +	lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC;
> > +	if (lat)
> > +		tunables->rate_limit_us *= lat;
> > +
> > +	if (!have_governor_per_policy())
> > +		global_tunables = tunables;
> 
> To make sugov_tunables_alloc/free() symmetric to each other, should we move
> above into sugov_tunables_alloc() ?

It doesn't matter too much, does it?

> > +
> > +	policy->governor_data = sg_policy;
> > +	sg_policy->tunables = tunables;
> > +
> > +	ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
> > +				   get_governor_parent_kobj(policy), "%s",
> > +				   schedutil_gov.name);
> > +	if (!ret)
> > +		goto out;
> > +
> > +	/* Failure, so roll back. */
> > +	policy->governor_data = NULL;
> > +	sugov_tunables_free(tunables);
> > +
> > + free_sg_policy:
> > +	pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret);
> > +	sugov_policy_free(sg_policy);
> 
> I didn't like the way we have mixed success and failure path here, just to save
> a single line of code (unlock).

I don't follow, sorry.  Yes, I can do unlock/return instead of the "goto out",
but then the goto label is still needed.

> Over that it does things, that aren't symmetric anymore. For example, we have
> called sugov_policy_alloc() without locks

Are you sure?

> and are freeing it from within locks.

Both are under global_tunables_lock.

> > +
> > + out:
> > +	mutex_unlock(&global_tunables_lock);
> > +	return ret;
> > +}
> > +
> > +static int sugov_exit(struct cpufreq_policy *policy)
> > +{
> > +	struct sugov_policy *sg_policy = policy->governor_data;
> > +	struct sugov_tunables *tunables = sg_policy->tunables;
> > +	unsigned int count;
> > +
> > +	mutex_lock(&global_tunables_lock);
> > +
> > +	count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook);
> > +	policy->governor_data = NULL;
> > +	if (!count)
> > +		sugov_tunables_free(tunables);
> > +
> > +	mutex_unlock(&global_tunables_lock);
> > +
> > +	sugov_policy_free(sg_policy);
> > +	return 0;
> > +}
> > +
> > +static int sugov_start(struct cpufreq_policy *policy)
> > +{
> > +	struct sugov_policy *sg_policy = policy->governor_data;
> > +	unsigned int cpu;
> > +
> > +	cpufreq_enable_fast_switch(policy);
> 
> Why should we be doing this from START, which gets called a lot compared to
> INIT/EXIT? This is something which should be moved to INIT IMHO.

Yes, INIT would be a better call site.

> > +	sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
> > +	sg_policy->last_freq_update_time = 0;
> > +	sg_policy->next_freq = UINT_MAX;
> > +	sg_policy->work_in_progress = false;
> > +	sg_policy->need_freq_update = false;
> > +
> > +	for_each_cpu(cpu, policy->cpus) {
> > +		struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
> > +
> > +		sg_cpu->sg_policy = sg_policy;
> > +		if (policy_is_shared(policy)) {
> > +			sg_cpu->util = ULONG_MAX;
> > +			sg_cpu->max = 0;
> > +			sg_cpu->last_update = 0;
> > +			cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
> > +						     sugov_update_shared);
> > +		} else {
> > +			cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
> > +						     sugov_update_single);
> > +		}
> > +	}
> > +	return 0;
> > +}
> > +
> > +static int sugov_stop(struct cpufreq_policy *policy)
> > +{
> > +	struct sugov_policy *sg_policy = policy->governor_data;
> > +	unsigned int cpu;
> > +
> > +	for_each_cpu(cpu, policy->cpus)
> > +		cpufreq_remove_update_util_hook(cpu);
> > +
> > +	synchronize_sched();
> > +
> > +	irq_work_sync(&sg_policy->irq_work);
> > +	cancel_work_sync(&sg_policy->work);
> 
> And again, we should have a disable-fast-switch as well..

That's not necessary.

> > +	return 0;
> > +}
> > +
> > +static int sugov_limits(struct cpufreq_policy *policy)
> > +{
> > +	struct sugov_policy *sg_policy = policy->governor_data;
> > +
> > +	if (!policy->fast_switch_enabled) {
> > +		mutex_lock(&sg_policy->work_lock);
> > +
> > +		if (policy->max < policy->cur)
> > +			__cpufreq_driver_target(policy, policy->max,
> > +						CPUFREQ_RELATION_H);
> > +		else if (policy->min > policy->cur)
> > +			__cpufreq_driver_target(policy, policy->min,
> > +						CPUFREQ_RELATION_L);
> > +
> > +		mutex_unlock(&sg_policy->work_lock);
> 
> Maybe we can try to take lock only if we are going to switch the freq, i.e. only
> if sugov_limits is called for policy->min/max update?

The __cpufreq_driver_target() in sugov_work() potentially updates policy->cur
that's checked here, so I don't really think this is a good idea.

> i.e.
> 
> void __sugov_limits(policy, freq, relation)
> {
> 		mutex_lock(&sg_policy->work_lock);
>                 __cpufreq_driver_target(policy, freq, relation);
> 		mutex_unlock(&sg_policy->work_lock);
> }
> 
> static int sugov_limits(struct cpufreq_policy *policy)
> {
> 	struct sugov_policy *sg_policy = policy->governor_data;
> 
> 	if (!policy->fast_switch_enabled) {
> 		if (policy->max < policy->cur)
>                         __sugov_limits(policy, policy->max, CPUFREQ_RELATION_H);
> 		else if (policy->min > policy->cur)
>                         __sugov_limits(policy, policy->min, CPUFREQ_RELATION_L);
> 	}
> 
> 	sg_policy->need_freq_update = true;
> 	return 0;
> }
> 
> ??
> 
> And maybe the same for current governors? (ofcourse in a separate patch, I can
> do that if you want).
> 
> 
> Also, why not just always do 'sg_policy->need_freq_update = true' from this
> routine and remove everything else? It will be taken care of on next evaluation.

It only would be taken care of quickly enough in the fast switch case.

> > +
> > +int sugov_governor(struct cpufreq_policy *policy, unsigned int event)
> > +{
> > +	if (event == CPUFREQ_GOV_POLICY_INIT) {
> > +		return sugov_init(policy);
> > +	} else if (policy->governor_data) {
> > +		switch (event) {
> > +		case CPUFREQ_GOV_POLICY_EXIT:
> > +			return sugov_exit(policy);
> > +		case CPUFREQ_GOV_START:
> > +			return sugov_start(policy);
> > +		case CPUFREQ_GOV_STOP:
> > +			return sugov_stop(policy);
> > +		case CPUFREQ_GOV_LIMITS:
> > +			return sugov_limits(policy);
> > +		}
> > +	}
> > +	return -EINVAL;
> > +}
> > +
> > +static struct cpufreq_governor schedutil_gov = {
> > +	.name = "schedutil",
> > +	.governor = sugov_governor,
> > +	.owner = THIS_MODULE,
> > +};
> > +
> > +static int __init sugov_module_init(void)
> > +{
> > +	return cpufreq_register_governor(&schedutil_gov);
> > +}
> > +
> > +static void __exit sugov_module_exit(void)
> > +{
> > +	cpufreq_unregister_governor(&schedutil_gov);
> > +}
> > +
> > +MODULE_AUTHOR("Rafael J. Wysocki <rafael.j.wysocki@intel.com>");
> > +MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
> > +MODULE_LICENSE("GPL");
> 
> Maybe a MODULE_ALIAS as well ?

Sorry, I don't follow.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 30, 2016, 1:12 a.m. UTC | #8
On Tue, Mar 29, 2016 at 2:58 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> On Monday, March 28, 2016 02:33:33 PM Viresh Kumar wrote:
>> On 22-03-16, 02:54, Rafael J. Wysocki wrote:

[cut]

>> > +static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
>> > +                                      unsigned long util, unsigned long max)
>> > +{
>> > +   struct cpufreq_policy *policy = sg_policy->policy;
>> > +   unsigned int max_f = policy->cpuinfo.max_freq;
>> > +   u64 last_freq_update_time = sg_policy->last_freq_update_time;
>> > +   unsigned int j;
>> > +
>> > +   if (util == ULONG_MAX)
>> > +           return max_f;
>> > +
>> > +   for_each_cpu(j, policy->cpus) {
>> > +           struct sugov_cpu *j_sg_cpu;
>> > +           unsigned long j_util, j_max;
>> > +           u64 delta_ns;
>> > +
>> > +           if (j == smp_processor_id())
>> > +                   continue;
>>
>> Why skip local CPU completely ?
>
> Because the original util and max come from it.
>
>> And if we really want to do that, what about something like for_each_cpu_and_not
>> to kill the unnecessary if {} statement ?
>
> That will work.

Except that for_each_cpu_and_not is not defined as of today.

I guess I can play with cpumasks, but then I'm not sure that will end
up actually more efficient.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar March 30, 2016, 4:10 a.m. UTC | #9
On 29-03-16, 14:58, Rafael J. Wysocki wrote:
> On Monday, March 28, 2016 02:33:33 PM Viresh Kumar wrote:
> > Its gonna be same for all policies sharing tunables ..
> 
> The value will be the same, but the cacheline won't.

Fair enough. So this information is replicated for each policy for performance
benefits. Would it make sense to add a comment for that? Its not obvious today
why we are keeping this per-policy.

> > > +static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf,
> > > +				   size_t count)
> > > +{
> > > +	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
> > > +	struct sugov_policy *sg_policy;
> > > +	unsigned int rate_limit_us;
> > > +	int ret;
> > > +
> > > +	ret = sscanf(buf, "%u", &rate_limit_us);
> > > +	if (ret != 1)
> > > +		return -EINVAL;
> > > +
> > > +	tunables->rate_limit_us = rate_limit_us;
> > > +
> > > +	list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
> > > +		sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
> > > +
> > > +	return count;
> > > +}
> > > +
> > > +static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
> > 
> > Why not reuse gov_attr_rw() ?
> 
> Would it work?

Why wouldn't it? I had a look again at that and I couldn't find a reason for it
to not work. Probably I missed something.

> > > +	ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
> > > +				   get_governor_parent_kobj(policy), "%s",
> > > +				   schedutil_gov.name);
> > > +	if (!ret)
> > > +		goto out;
> > > +
> > > +	/* Failure, so roll back. */
> > > +	policy->governor_data = NULL;
> > > +	sugov_tunables_free(tunables);
> > > +
> > > + free_sg_policy:
> > > +	pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret);
> > > +	sugov_policy_free(sg_policy);
> > 
> > I didn't like the way we have mixed success and failure path here, just to save
> > a single line of code (unlock).
> 
> I don't follow, sorry.  Yes, I can do unlock/return instead of the "goto out",
> but then the goto label is still needed.

Sorry for not being clear earlier, but this what I was suggesting it to look like:

---
static int sugov_init(struct cpufreq_policy *policy)
{
       struct sugov_policy *sg_policy;
       struct sugov_tunables *tunables;
       unsigned int lat;
       int ret = 0;

       /* State should be equivalent to EXIT */
       if (policy->governor_data)
               return -EBUSY;

       sg_policy = sugov_policy_alloc(policy);
       if (!sg_policy)
               return -ENOMEM;

       mutex_lock(&global_tunables_lock);

       if (global_tunables) {
               if (WARN_ON(have_governor_per_policy())) {
                       ret = -EINVAL;
                       goto free_sg_policy;
               }
               policy->governor_data = sg_policy;
               sg_policy->tunables = global_tunables;

               gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook);

       	       mutex_unlock(&global_tunables_lock);
       	       return 0;
       }

       tunables = sugov_tunables_alloc(sg_policy);
       if (!tunables) {
               ret = -ENOMEM;
               goto free_sg_policy;
       }

       tunables->rate_limit_us = LATENCY_MULTIPLIER;
       lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC;
       if (lat)
               tunables->rate_limit_us *= lat;

       if (!have_governor_per_policy())
               global_tunables = tunables;

       policy->governor_data = sg_policy;
       sg_policy->tunables = tunables;

       ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
                                  get_governor_parent_kobj(policy), "%s",
                                  schedutil_gov.name);
       if (!ret) {
       	       mutex_unlock(&global_tunables_lock);
       	       return 0;
       }

       /* Failure, so roll back. */
       policy->governor_data = NULL;
       sugov_tunables_free(tunables);

 free_sg_policy:
       mutex_unlock(&global_tunables_lock);

       pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret);
       sugov_policy_free(sg_policy);

       return ret;

---

> > Over that it does things, that aren't symmetric anymore. For example, we have
> > called sugov_policy_alloc() without locks
> 
> Are you sure?

Yes.

> > and are freeing it from within locks.
> 
> Both are under global_tunables_lock.

No, sugov_policy_alloc() isn't called from within locks.

> > > +static int __init sugov_module_init(void)
> > > +{
> > > +	return cpufreq_register_governor(&schedutil_gov);
> > > +}
> > > +
> > > +static void __exit sugov_module_exit(void)
> > > +{
> > > +	cpufreq_unregister_governor(&schedutil_gov);
> > > +}
> > > +
> > > +MODULE_AUTHOR("Rafael J. Wysocki <rafael.j.wysocki@intel.com>");
> > > +MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
> > > +MODULE_LICENSE("GPL");
> > 
> > Maybe a MODULE_ALIAS as well ?
> 
> Sorry, I don't follow.

Oh, I was just saying that we may also want to add a MODULE_ALIAS() line here
to help auto-loading if it is built as a module?
Peter Zijlstra March 31, 2016, 12:24 p.m. UTC | #10
On Mon, Mar 28, 2016 at 11:17:44AM -0700, Steve Muckle wrote:
> The scenario I'm contemplating is that while a CPU-intensive task is
> running a thermal interrupt goes off. The driver for this thermal
> interrupt responds by capping fmax. If this happens just after the tick,
> it seems possible that we could wait a full tick before changing the
> frequency. Given a 10ms tick it could be rather annoying for thermal
> management algorithms on some platforms (I'm familiar with a few).

So I'm blissfully unaware of all the thermal stuffs we have; but it
looks like its somehow bolten onto cpufreq without feedback.

The thing I worry about is thermal scaling the CPU back past where RT/DL
tasks can still complete in time. It should not be able to do that, or
rather, missing deadlines because thermal is about as useful as
rebooting the device.

I guess I'm saying is, the whole cpufreq/thermal 'interface' needs work
anyhow.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 31, 2016, 12:28 p.m. UTC | #11
On Wed, Mar 30, 2016 at 03:12:40AM +0200, Rafael J. Wysocki wrote:
> Except that for_each_cpu_and_not is not defined as of today.
> 
> I guess I can play with cpumasks, but then I'm not sure that will end
> up actually more efficient.

It will not indeed.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 31, 2016, 12:32 p.m. UTC | #12
On Thu, Mar 31, 2016 at 2:24 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Mar 28, 2016 at 11:17:44AM -0700, Steve Muckle wrote:
>> The scenario I'm contemplating is that while a CPU-intensive task is
>> running a thermal interrupt goes off. The driver for this thermal
>> interrupt responds by capping fmax. If this happens just after the tick,
>> it seems possible that we could wait a full tick before changing the
>> frequency. Given a 10ms tick it could be rather annoying for thermal
>> management algorithms on some platforms (I'm familiar with a few).
>
> So I'm blissfully unaware of all the thermal stuffs we have; but it
> looks like its somehow bolten onto cpufreq without feedback.
>
> The thing I worry about is thermal scaling the CPU back past where RT/DL
> tasks can still complete in time. It should not be able to do that, or
> rather, missing deadlines because thermal is about as useful as
> rebooting the device.

Right.  If thermal throttling kicks in, the game is pretty much over.

That's why ideas float about taking the thermal constraints into
account upfront, but that's a different discussion entirely.

> I guess I'm saying is, the whole cpufreq/thermal 'interface' needs work
> anyhow.

Yes, it does.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle April 1, 2016, 6:15 p.m. UTC | #13
On 03/31/2016 05:32 AM, Rafael J. Wysocki wrote:
> On Thu, Mar 31, 2016 at 2:24 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Mon, Mar 28, 2016 at 11:17:44AM -0700, Steve Muckle wrote:
>>> The scenario I'm contemplating is that while a CPU-intensive task is
>>> running a thermal interrupt goes off. The driver for this thermal
>>> interrupt responds by capping fmax. If this happens just after the tick,
>>> it seems possible that we could wait a full tick before changing the
>>> frequency. Given a 10ms tick it could be rather annoying for thermal
>>> management algorithms on some platforms (I'm familiar with a few).
>>
>> So I'm blissfully unaware of all the thermal stuffs we have; but it
>> looks like its somehow bolten onto cpufreq without feedback.
>>
>> The thing I worry about is thermal scaling the CPU back past where RT/DL
>> tasks can still complete in time. It should not be able to do that, or
>> rather, missing deadlines because thermal is about as useful as
>> rebooting the device.

I'd agree that impacting RT/DL activity because of throttling may be as
bad as as a reset, but that seems worst case. There could be some
graceful shutdown or notification/alarm that can be done. Or a platform
can simply choose to reset.

Shouldn't we try to give the system designer the option of doing
something in software (by throttling the CPUs as low as necessary to
continue operation) rather than giving up and relying on a hardware reset?

> Right.  If thermal throttling kicks in, the game is pretty much over.
> 
> That's why ideas float about taking the thermal constraints into
> account upfront, but that's a different discussion entirely.

Current mainstream mobile platforms frequently throttle during normal
operation. I think it's important to have a robust throttling mechanism
at least until the more proactive thermal management scheme is fully
developed and proves to be equally capable (if and when that happens).

>> I guess I'm saying is, the whole cpufreq/thermal 'interface' needs work
>> anyhow.
> 
> Yes, it does.

Agreed!

thanks,
Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -107,6 +107,16 @@  config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
 	  Be aware that not all cpufreq drivers support the conservative
 	  governor. If unsure have a look at the help section of the
 	  driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+	bool "schedutil"
+	select CPU_FREQ_GOV_SCHEDUTIL
+	select CPU_FREQ_GOV_PERFORMANCE
+	help
+	  Use the 'schedutil' CPUFreq governor by default. If unsure,
+	  have a look at the help section of that governor. The fallback
+	  governor will be 'performance'.
+
 endchoice
 
 config CPU_FREQ_GOV_PERFORMANCE
@@ -188,6 +198,22 @@  config CPU_FREQ_GOV_CONSERVATIVE
 
 	  If in doubt, say N.
 
+config CPU_FREQ_GOV_SCHEDUTIL
+	tristate "'schedutil' cpufreq policy governor"
+	depends on CPU_FREQ
+	select CPU_FREQ_GOV_ATTR_SET
+	select IRQ_WORK
+	help
+	  The frequency selection formula used by this governor is analogous
+	  to the one used by 'ondemand', but instead of computing CPU load
+	  as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
+	  utilization data provided by the scheduler as input.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called cpufreq_schedutil.
+
+	  If in doubt, say N.
+
 comment "CPU frequency scaling drivers"
 
 config CPUFREQ_DT
Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- /dev/null
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -0,0 +1,528 @@ 
+/*
+ * CPUFreq governor based on scheduler-provided CPU utilization data.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/cpufreq.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <trace/events/power.h>
+
+#include "sched.h"
+
+struct sugov_tunables {
+	struct gov_attr_set attr_set;
+	unsigned int rate_limit_us;
+};
+
+struct sugov_policy {
+	struct cpufreq_policy *policy;
+
+	struct sugov_tunables *tunables;
+	struct list_head tunables_hook;
+
+	raw_spinlock_t update_lock;  /* For shared policies */
+	u64 last_freq_update_time;
+	s64 freq_update_delay_ns;
+	unsigned int next_freq;
+
+	/* The next fields are only needed if fast switch cannot be used. */
+	struct irq_work irq_work;
+	struct work_struct work;
+	struct mutex work_lock;
+	bool work_in_progress;
+
+	bool need_freq_update;
+};
+
+struct sugov_cpu {
+	struct update_util_data update_util;
+	struct sugov_policy *sg_policy;
+
+	/* The fields below are only needed when sharing a policy. */
+	unsigned long util;
+	unsigned long max;
+	u64 last_update;
+};
+
+static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
+
+/************************ Governor internals ***********************/
+
+static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
+{
+	u64 delta_ns;
+
+	if (sg_policy->work_in_progress)
+		return false;
+
+	if (unlikely(sg_policy->need_freq_update)) {
+		sg_policy->need_freq_update = false;
+		return true;
+	}
+
+	delta_ns = time - sg_policy->last_freq_update_time;
+	return (s64)delta_ns >= sg_policy->freq_update_delay_ns;
+}
+
+static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
+				unsigned int next_freq)
+{
+	struct cpufreq_policy *policy = sg_policy->policy;
+
+	sg_policy->last_freq_update_time = time;
+
+	if (policy->fast_switch_enabled) {
+		if (next_freq > policy->max)
+			next_freq = policy->max;
+		else if (next_freq < policy->min)
+			next_freq = policy->min;
+
+		if (sg_policy->next_freq == next_freq) {
+			trace_cpu_frequency(policy->cur, smp_processor_id());
+			return;
+		}
+		sg_policy->next_freq = next_freq;
+		next_freq = cpufreq_driver_fast_switch(policy, next_freq);
+		if (next_freq == CPUFREQ_ENTRY_INVALID)
+			return;
+
+		policy->cur = next_freq;
+		trace_cpu_frequency(next_freq, smp_processor_id());
+	} else if (sg_policy->next_freq != next_freq) {
+		sg_policy->next_freq = next_freq;
+		sg_policy->work_in_progress = true;
+		irq_work_queue(&sg_policy->irq_work);
+	}
+}
+
+/**
+ * get_next_freq - Compute a new frequency for a given cpufreq policy.
+ * @policy: cpufreq policy object to compute the new frequency for.
+ * @util: Current CPU utilization.
+ * @max: CPU capacity.
+ *
+ * If the utilization is frequency-invariant, choose the new frequency to be
+ * proportional to it, that is
+ *
+ * next_freq = C * max_freq * util / max
+ *
+ * Otherwise, approximate the would-be frequency-invariant utilization by
+ * util_raw * (curr_freq / max_freq) which leads to
+ *
+ * next_freq = C * curr_freq * util_raw / max
+ *
+ * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8.
+ */
+static unsigned int get_next_freq(struct cpufreq_policy *policy,
+				  unsigned long util, unsigned long max)
+{
+	unsigned int freq = arch_scale_freq_invariant() ?
+				policy->cpuinfo.max_freq : policy->cur;
+
+	return (freq + (freq >> 2)) * util / max;
+}
+
+static void sugov_update_single(struct update_util_data *hook, u64 time,
+				unsigned long util, unsigned long max)
+{
+	struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
+	struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+	struct cpufreq_policy *policy = sg_policy->policy;
+	unsigned int next_f;
+
+	if (!sugov_should_update_freq(sg_policy, time))
+		return;
+
+	next_f = util == ULONG_MAX ? policy->cpuinfo.max_freq :
+			get_next_freq(policy, util, max);
+	sugov_update_commit(sg_policy, time, next_f);
+}
+
+static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
+					   unsigned long util, unsigned long max)
+{
+	struct cpufreq_policy *policy = sg_policy->policy;
+	unsigned int max_f = policy->cpuinfo.max_freq;
+	u64 last_freq_update_time = sg_policy->last_freq_update_time;
+	unsigned int j;
+
+	if (util == ULONG_MAX)
+		return max_f;
+
+	for_each_cpu(j, policy->cpus) {
+		struct sugov_cpu *j_sg_cpu;
+		unsigned long j_util, j_max;
+		u64 delta_ns;
+
+		if (j == smp_processor_id())
+			continue;
+
+		j_sg_cpu = &per_cpu(sugov_cpu, j);
+		/*
+		 * If the CPU utilization was last updated before the previous
+		 * frequency update and the time elapsed between the last update
+		 * of the CPU utilization and the last frequency update is long
+		 * enough, don't take the CPU into account as it probably is
+		 * idle now.
+		 */
+		delta_ns = last_freq_update_time - j_sg_cpu->last_update;
+		if ((s64)delta_ns > TICK_NSEC)
+			continue;
+
+		j_util = j_sg_cpu->util;
+		if (j_util == ULONG_MAX)
+			return max_f;
+
+		j_max = j_sg_cpu->max;
+		if (j_util * max > j_max * util) {
+			util = j_util;
+			max = j_max;
+		}
+	}
+
+	return get_next_freq(policy, util, max);
+}
+
+static void sugov_update_shared(struct update_util_data *hook, u64 time,
+				unsigned long util, unsigned long max)
+{
+	struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
+	struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+	unsigned int next_f;
+
+	raw_spin_lock(&sg_policy->update_lock);
+
+	sg_cpu->util = util;
+	sg_cpu->max = max;
+	sg_cpu->last_update = time;
+
+	if (sugov_should_update_freq(sg_policy, time)) {
+		next_f = sugov_next_freq_shared(sg_policy, util, max);
+		sugov_update_commit(sg_policy, time, next_f);
+	}
+
+	raw_spin_unlock(&sg_policy->update_lock);
+}
+
+static void sugov_work(struct work_struct *work)
+{
+	struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
+
+	mutex_lock(&sg_policy->work_lock);
+	__cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
+				CPUFREQ_RELATION_L);
+	mutex_unlock(&sg_policy->work_lock);
+
+	sg_policy->work_in_progress = false;
+}
+
+static void sugov_irq_work(struct irq_work *irq_work)
+{
+	struct sugov_policy *sg_policy;
+
+	sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
+	schedule_work_on(smp_processor_id(), &sg_policy->work);
+}
+
+/************************** sysfs interface ************************/
+
+static struct sugov_tunables *global_tunables;
+static DEFINE_MUTEX(global_tunables_lock);
+
+static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr_set)
+{
+	return container_of(attr_set, struct sugov_tunables, attr_set);
+}
+
+static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
+{
+	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+
+	return sprintf(buf, "%u\n", tunables->rate_limit_us);
+}
+
+static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf,
+				   size_t count)
+{
+	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+	struct sugov_policy *sg_policy;
+	unsigned int rate_limit_us;
+	int ret;
+
+	ret = sscanf(buf, "%u", &rate_limit_us);
+	if (ret != 1)
+		return -EINVAL;
+
+	tunables->rate_limit_us = rate_limit_us;
+
+	list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
+		sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
+
+	return count;
+}
+
+static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
+
+static struct attribute *sugov_attributes[] = {
+	&rate_limit_us.attr,
+	NULL
+};
+
+static struct kobj_type sugov_tunables_ktype = {
+	.default_attrs = sugov_attributes,
+	.sysfs_ops = &governor_sysfs_ops,
+};
+
+/********************** cpufreq governor interface *********************/
+
+static struct cpufreq_governor schedutil_gov;
+
+static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
+{
+	struct sugov_policy *sg_policy;
+
+	sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
+	if (!sg_policy)
+		return NULL;
+
+	sg_policy->policy = policy;
+	init_irq_work(&sg_policy->irq_work, sugov_irq_work);
+	INIT_WORK(&sg_policy->work, sugov_work);
+	mutex_init(&sg_policy->work_lock);
+	raw_spin_lock_init(&sg_policy->update_lock);
+	return sg_policy;
+}
+
+static void sugov_policy_free(struct sugov_policy *sg_policy)
+{
+	mutex_destroy(&sg_policy->work_lock);
+	kfree(sg_policy);
+}
+
+static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy)
+{
+	struct sugov_tunables *tunables;
+
+	tunables = kzalloc(sizeof(*tunables), GFP_KERNEL);
+	if (tunables)
+		gov_attr_set_init(&tunables->attr_set, &sg_policy->tunables_hook);
+
+	return tunables;
+}
+
+static void sugov_tunables_free(struct sugov_tunables *tunables)
+{
+	if (!have_governor_per_policy())
+		global_tunables = NULL;
+
+	kfree(tunables);
+}
+
+static int sugov_init(struct cpufreq_policy *policy)
+{
+	struct sugov_policy *sg_policy;
+	struct sugov_tunables *tunables;
+	unsigned int lat;
+	int ret = 0;
+
+	/* State should be equivalent to EXIT */
+	if (policy->governor_data)
+		return -EBUSY;
+
+	sg_policy = sugov_policy_alloc(policy);
+	if (!sg_policy)
+		return -ENOMEM;
+
+	mutex_lock(&global_tunables_lock);
+
+	if (global_tunables) {
+		if (WARN_ON(have_governor_per_policy())) {
+			ret = -EINVAL;
+			goto free_sg_policy;
+		}
+		policy->governor_data = sg_policy;
+		sg_policy->tunables = global_tunables;
+
+		gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook);
+		goto out;
+	}
+
+	tunables = sugov_tunables_alloc(sg_policy);
+	if (!tunables) {
+		ret = -ENOMEM;
+		goto free_sg_policy;
+	}
+
+	tunables->rate_limit_us = LATENCY_MULTIPLIER;
+	lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC;
+	if (lat)
+		tunables->rate_limit_us *= lat;
+
+	if (!have_governor_per_policy())
+		global_tunables = tunables;
+
+	policy->governor_data = sg_policy;
+	sg_policy->tunables = tunables;
+
+	ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
+				   get_governor_parent_kobj(policy), "%s",
+				   schedutil_gov.name);
+	if (!ret)
+		goto out;
+
+	/* Failure, so roll back. */
+	policy->governor_data = NULL;
+	sugov_tunables_free(tunables);
+
+ free_sg_policy:
+	pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret);
+	sugov_policy_free(sg_policy);
+
+ out:
+	mutex_unlock(&global_tunables_lock);
+	return ret;
+}
+
+static int sugov_exit(struct cpufreq_policy *policy)
+{
+	struct sugov_policy *sg_policy = policy->governor_data;
+	struct sugov_tunables *tunables = sg_policy->tunables;
+	unsigned int count;
+
+	mutex_lock(&global_tunables_lock);
+
+	count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook);
+	policy->governor_data = NULL;
+	if (!count)
+		sugov_tunables_free(tunables);
+
+	mutex_unlock(&global_tunables_lock);
+
+	sugov_policy_free(sg_policy);
+	return 0;
+}
+
+static int sugov_start(struct cpufreq_policy *policy)
+{
+	struct sugov_policy *sg_policy = policy->governor_data;
+	unsigned int cpu;
+
+	cpufreq_enable_fast_switch(policy);
+
+	sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
+	sg_policy->last_freq_update_time = 0;
+	sg_policy->next_freq = UINT_MAX;
+	sg_policy->work_in_progress = false;
+	sg_policy->need_freq_update = false;
+
+	for_each_cpu(cpu, policy->cpus) {
+		struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
+
+		sg_cpu->sg_policy = sg_policy;
+		if (policy_is_shared(policy)) {
+			sg_cpu->util = ULONG_MAX;
+			sg_cpu->max = 0;
+			sg_cpu->last_update = 0;
+			cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
+						     sugov_update_shared);
+		} else {
+			cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
+						     sugov_update_single);
+		}
+	}
+	return 0;
+}
+
+static int sugov_stop(struct cpufreq_policy *policy)
+{
+	struct sugov_policy *sg_policy = policy->governor_data;
+	unsigned int cpu;
+
+	for_each_cpu(cpu, policy->cpus)
+		cpufreq_remove_update_util_hook(cpu);
+
+	synchronize_sched();
+
+	irq_work_sync(&sg_policy->irq_work);
+	cancel_work_sync(&sg_policy->work);
+	return 0;
+}
+
+static int sugov_limits(struct cpufreq_policy *policy)
+{
+	struct sugov_policy *sg_policy = policy->governor_data;
+
+	if (!policy->fast_switch_enabled) {
+		mutex_lock(&sg_policy->work_lock);
+
+		if (policy->max < policy->cur)
+			__cpufreq_driver_target(policy, policy->max,
+						CPUFREQ_RELATION_H);
+		else if (policy->min > policy->cur)
+			__cpufreq_driver_target(policy, policy->min,
+						CPUFREQ_RELATION_L);
+
+		mutex_unlock(&sg_policy->work_lock);
+	}
+
+	sg_policy->need_freq_update = true;
+	return 0;
+}
+
+int sugov_governor(struct cpufreq_policy *policy, unsigned int event)
+{
+	if (event == CPUFREQ_GOV_POLICY_INIT) {
+		return sugov_init(policy);
+	} else if (policy->governor_data) {
+		switch (event) {
+		case CPUFREQ_GOV_POLICY_EXIT:
+			return sugov_exit(policy);
+		case CPUFREQ_GOV_START:
+			return sugov_start(policy);
+		case CPUFREQ_GOV_STOP:
+			return sugov_stop(policy);
+		case CPUFREQ_GOV_LIMITS:
+			return sugov_limits(policy);
+		}
+	}
+	return -EINVAL;
+}
+
+static struct cpufreq_governor schedutil_gov = {
+	.name = "schedutil",
+	.governor = sugov_governor,
+	.owner = THIS_MODULE,
+};
+
+static int __init sugov_module_init(void)
+{
+	return cpufreq_register_governor(&schedutil_gov);
+}
+
+static void __exit sugov_module_exit(void)
+{
+	cpufreq_unregister_governor(&schedutil_gov);
+}
+
+MODULE_AUTHOR("Rafael J. Wysocki <rafael.j.wysocki@intel.com>");
+MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
+MODULE_LICENSE("GPL");
+
+#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+struct cpufreq_governor *cpufreq_default_governor(void)
+{
+	return &schedutil_gov;
+}
+
+fs_initcall(sugov_module_init);
+#else
+module_init(sugov_module_init);
+#endif
+module_exit(sugov_module_exit);
Index: linux-pm/kernel/sched/Makefile
===================================================================
--- linux-pm.orig/kernel/sched/Makefile
+++ linux-pm/kernel/sched/Makefile
@@ -20,3 +20,4 @@  obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
 obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
 obj-$(CONFIG_CPU_FREQ) += cpufreq.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
Index: linux-pm/kernel/sched/sched.h
===================================================================
--- linux-pm.orig/kernel/sched/sched.h
+++ linux-pm/kernel/sched/sched.h
@@ -1841,3 +1841,11 @@  static inline void cpufreq_trigger_updat
 static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) {}
 static inline void cpufreq_trigger_update(u64 time) {}
 #endif /* CONFIG_CPU_FREQ */
+
+#ifdef arch_scale_freq_capacity
+#ifndef arch_scale_freq_invariant
+#define arch_scale_freq_invariant()	(true)
+#endif
+#else /* arch_scale_freq_capacity */
+#define arch_scale_freq_invariant()	(false)
+#endif