diff mbox

[RFC,2/6] sched: Introduce energy models of CPUs

Message ID 20180320094312.24081-3-dietmar.eggemann@arm.com (mailing list archive)
State RFC, archived
Headers show

Commit Message

Dietmar Eggemann March 20, 2018, 9:43 a.m. UTC
From: Quentin Perret <quentin.perret@arm.com>

The energy consumption of each CPU in the system is modeled with a list
of values representing its dissipated power and compute capacity at each
available Operating Performance Point (OPP). These values are derived
from existing information in the kernel (currently used by the thermal
subsystem) and don't require the introduction of new platform-specific
tunables. The energy model is also provided with a simple representation
of all frequency domains as cpumasks, hence enabling the scheduler to be
aware of dependencies between CPUs. The data required to build the energy
model is provided by the OPP library which enables an abstract view of
the platform from the scheduler. The new data structures holding these
models and the routines to populate them are stored in
kernel/sched/energy.c.

For the sake of simplicity, it is assumed in the energy model that all
CPUs in a frequency domain share the same micro-architecture. As long as
this assumption is correct, the energy models of different CPUs belonging
to the same frequency domain are equal. Hence, this commit builds only one
energy model per frequency domain, and links all relevant CPUs to it in
order to save time and memory. If needed for future hardware platforms,
relaxing this assumption should imply relatively simple modifications in
the code but a significantly higher algorithmic complexity.

As it appears that energy-aware scheduling really makes a difference on
heterogeneous systems (e.g. big.LITTLE platforms), it is restricted to
systems having:

   1. SD_ASYM_CPUCAPACITY flag set
   2. Dynamic Voltage and Frequency Scaling (DVFS) is enabled
   3. Available power estimates for the OPPs of all possible CPUs

Moreover, the scheduler is notified of the energy model availability
using a static key in order to minimize the overhead on non-energy-aware
systems.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

---
This patch depends on additional infrastructure being merged in the OPP
core. As this infrastructure can also be useful for other clients, the
related patches have been posted separately [1].

[1] https://marc.info/?l=linux-pm&m=151635516419249&w=2
---
 include/linux/sched/energy.h |  31 +++++++
 kernel/sched/Makefile        |   2 +-
 kernel/sched/energy.c        | 190 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 222 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/sched/energy.h
 create mode 100644 kernel/sched/energy.c

Comments

Greg Kroah-Hartman March 20, 2018, 9:52 a.m. UTC | #1
On Tue, Mar 20, 2018 at 09:43:08AM +0000, Dietmar Eggemann wrote:
> From: Quentin Perret <quentin.perret@arm.com>
> 
> The energy consumption of each CPU in the system is modeled with a list
> of values representing its dissipated power and compute capacity at each
> available Operating Performance Point (OPP). These values are derived
> from existing information in the kernel (currently used by the thermal
> subsystem) and don't require the introduction of new platform-specific
> tunables. The energy model is also provided with a simple representation
> of all frequency domains as cpumasks, hence enabling the scheduler to be
> aware of dependencies between CPUs. The data required to build the energy
> model is provided by the OPP library which enables an abstract view of
> the platform from the scheduler. The new data structures holding these
> models and the routines to populate them are stored in
> kernel/sched/energy.c.
> 
> For the sake of simplicity, it is assumed in the energy model that all
> CPUs in a frequency domain share the same micro-architecture. As long as
> this assumption is correct, the energy models of different CPUs belonging
> to the same frequency domain are equal. Hence, this commit builds only one
> energy model per frequency domain, and links all relevant CPUs to it in
> order to save time and memory. If needed for future hardware platforms,
> relaxing this assumption should imply relatively simple modifications in
> the code but a significantly higher algorithmic complexity.
> 
> As it appears that energy-aware scheduling really makes a difference on
> heterogeneous systems (e.g. big.LITTLE platforms), it is restricted to
> systems having:
> 
>    1. SD_ASYM_CPUCAPACITY flag set
>    2. Dynamic Voltage and Frequency Scaling (DVFS) is enabled
>    3. Available power estimates for the OPPs of all possible CPUs
> 
> Moreover, the scheduler is notified of the energy model availability
> using a static key in order to minimize the overhead on non-energy-aware
> systems.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> 
> ---
> This patch depends on additional infrastructure being merged in the OPP
> core. As this infrastructure can also be useful for other clients, the
> related patches have been posted separately [1].
> 
> [1] https://marc.info/?l=linux-pm&m=151635516419249&w=2
> ---
>  include/linux/sched/energy.h |  31 +++++++
>  kernel/sched/Makefile        |   2 +-
>  kernel/sched/energy.c        | 190 +++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 222 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/sched/energy.h
>  create mode 100644 kernel/sched/energy.c
> 
> diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h
> new file mode 100644
> index 000000000000..b4f43564ffe4
> --- /dev/null
> +++ b/include/linux/sched/energy.h
> @@ -0,0 +1,31 @@
> +#ifndef _LINUX_SCHED_ENERGY_H
> +#define _LINUX_SCHED_ENERGY_H

No copyright or license info?  Not good :(

> --- /dev/null
> +++ b/kernel/sched/energy.c
> @@ -0,0 +1,190 @@
> +/*
> + * Released under the GPLv2 only.
> + * SPDX-License-Identifier: GPL-2.0

Please read the documentation for the SPDX lines on how to do them
correctly.  Newer versions of checkpatch.pl will catch this, but that is
in linux-next for the moment.

And once you have the SPDX line, the "Released under..." line is not
needed.


> + *
> + * Energy-aware scheduling models
> + *
> + * Copyright (C) 2018, Arm Ltd.
> + * Written by: Quentin Perret, Arm Ltd.
> + *
> + * This file is subject to the terms and conditions of the GNU General Public
> + * License.  See the file "COPYING" in the main directory of this archive
> + * for more details.

This paragraph is not needed at all.

> + */
> +
> +#define pr_fmt(fmt) "sched-energy: " fmt
> +
> +#include <linux/sched/topology.h>
> +#include <linux/sched/energy.h>
> +#include <linux/pm_opp.h>
> +
> +#include "sched.h"
> +
> +DEFINE_STATIC_KEY_FALSE(sched_energy_present);
> +struct sched_energy_model ** __percpu energy_model;
> +
> +/*
> + * A copy of the cpumasks representing the frequency domains is kept private
> + * to the scheduler. They are stacked in a dynamically allocated linked list
> + * as we don't know how many frequency domains the system has.
> + */
> +LIST_HEAD(freq_domains);

global variable?  If so, please prefix it with something more unique
than "freq_".

> +#ifdef CONFIG_PM_OPP

#ifdefs go in .h files, not .c files, right?

thanks,

greg k-h
Quentin Perret March 21, 2018, 12:45 a.m. UTC | #2
On Tuesday 20 Mar 2018 at 10:52:15 (+0100), Greg Kroah-Hartman wrote:
> On Tue, Mar 20, 2018 at 09:43:08AM +0000, Dietmar Eggemann wrote:
> > From: Quentin Perret <quentin.perret@arm.com>
> > 
> > The energy consumption of each CPU in the system is modeled with a list
> > of values representing its dissipated power and compute capacity at each
> > available Operating Performance Point (OPP). These values are derived
> > from existing information in the kernel (currently used by the thermal
> > subsystem) and don't require the introduction of new platform-specific
> > tunables. The energy model is also provided with a simple representation
> > of all frequency domains as cpumasks, hence enabling the scheduler to be
> > aware of dependencies between CPUs. The data required to build the energy
> > model is provided by the OPP library which enables an abstract view of
> > the platform from the scheduler. The new data structures holding these
> > models and the routines to populate them are stored in
> > kernel/sched/energy.c.
> > 
> > For the sake of simplicity, it is assumed in the energy model that all
> > CPUs in a frequency domain share the same micro-architecture. As long as
> > this assumption is correct, the energy models of different CPUs belonging
> > to the same frequency domain are equal. Hence, this commit builds only one
> > energy model per frequency domain, and links all relevant CPUs to it in
> > order to save time and memory. If needed for future hardware platforms,
> > relaxing this assumption should imply relatively simple modifications in
> > the code but a significantly higher algorithmic complexity.
> > 
> > As it appears that energy-aware scheduling really makes a difference on
> > heterogeneous systems (e.g. big.LITTLE platforms), it is restricted to
> > systems having:
> > 
> >    1. SD_ASYM_CPUCAPACITY flag set
> >    2. Dynamic Voltage and Frequency Scaling (DVFS) is enabled
> >    3. Available power estimates for the OPPs of all possible CPUs
> > 
> > Moreover, the scheduler is notified of the energy model availability
> > using a static key in order to minimize the overhead on non-energy-aware
> > systems.
> > 
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > 
> > ---
> > This patch depends on additional infrastructure being merged in the OPP
> > core. As this infrastructure can also be useful for other clients, the
> > related patches have been posted separately [1].
> > 
> > [1] https://marc.info/?l=linux-pm&m=151635516419249&w=2
> > ---
> >  include/linux/sched/energy.h |  31 +++++++
> >  kernel/sched/Makefile        |   2 +-
> >  kernel/sched/energy.c        | 190 +++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 222 insertions(+), 1 deletion(-)
> >  create mode 100644 include/linux/sched/energy.h
> >  create mode 100644 kernel/sched/energy.c
> > 
> > diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h
> > new file mode 100644
> > index 000000000000..b4f43564ffe4
> > --- /dev/null
> > +++ b/include/linux/sched/energy.h
> > @@ -0,0 +1,31 @@
> > +#ifndef _LINUX_SCHED_ENERGY_H
> > +#define _LINUX_SCHED_ENERGY_H
> 
> No copyright or license info?  Not good :(
> 
> > --- /dev/null
> > +++ b/kernel/sched/energy.c
> > @@ -0,0 +1,190 @@
> > +/*
> > + * Released under the GPLv2 only.
> > + * SPDX-License-Identifier: GPL-2.0
> 
> Please read the documentation for the SPDX lines on how to do them
> correctly.  Newer versions of checkpatch.pl will catch this, but that is
> in linux-next for the moment.
> 
> And once you have the SPDX line, the "Released under..." line is not
> needed.
> 
> 
> > + *
> > + * Energy-aware scheduling models
> > + *
> > + * Copyright (C) 2018, Arm Ltd.
> > + * Written by: Quentin Perret, Arm Ltd.
> > + *
> > + * This file is subject to the terms and conditions of the GNU General Public
> > + * License.  See the file "COPYING" in the main directory of this archive
> > + * for more details.
> 
> This paragraph is not needed at all.

Right, I will fix all the licence issues and add one to the new header
file. I took example on existing files a while ago when I first wrote
the patches and forgot to update them later on. Sorry about that.

> 
> > + */
> > +
> > +#define pr_fmt(fmt) "sched-energy: " fmt
> > +
> > +#include <linux/sched/topology.h>
> > +#include <linux/sched/energy.h>
> > +#include <linux/pm_opp.h>
> > +
> > +#include "sched.h"
> > +
> > +DEFINE_STATIC_KEY_FALSE(sched_energy_present);
> > +struct sched_energy_model ** __percpu energy_model;
> > +
> > +/*
> > + * A copy of the cpumasks representing the frequency domains is kept private
> > + * to the scheduler. They are stacked in a dynamically allocated linked list
> > + * as we don't know how many frequency domains the system has.
> > + */
> > +LIST_HEAD(freq_domains);
> 
> global variable?  If so, please prefix it with something more unique
> than "freq_".

Will do.

> 
> > +#ifdef CONFIG_PM_OPP
> 
> #ifdefs go in .h files, not .c files, right?

Yes good point. Actually, I might be able to tweak only kernel/sched/Makefile
to ensure we have CONFIG_PM_OPP. I will look into it.

> 
> thanks,
> 
> greg k-h

Thanks,
Quentin
Quentin Perret March 25, 2018, 1:48 p.m. UTC | #3
On Tuesday 20 Mar 2018 at 10:52:15 (+0100), Greg Kroah-Hartman wrote:
> On Tue, Mar 20, 2018 at 09:43:08AM +0000, Dietmar Eggemann wrote:
> > From: Quentin Perret <quentin.perret@arm.com>

[...]

> > +#ifdef CONFIG_PM_OPP
> 
> #ifdefs go in .h files, not .c files, right?
> 

So, after looking into this, my suggestion would be to: 1) remove the
#ifdef CONFIG_PM_OPP from energy.c entirely; 2) make sure
init_sched_energy() is stubbed properly for !CONFIG_SMP and
!CONFIG_PM_OPP in include/linux/sched/energy.h; 3) relocate the global
variables (energy_model, freq_domains, ...) to fair.c; and 4) modify
kernel/sched/Makefile with something like:

ifeq ($(CONFIG_PM_OPP),y)
obj-$(CONFIG_SMP) += energy.o
endif

That way, energy.c is not compiled if not needed by the arch, and the
ifdef are kept within header files and Makefiles.

Would that work ?

Thanks,
Quentin
Peter Zijlstra April 9, 2018, 12:01 p.m. UTC | #4
On Tue, Mar 20, 2018 at 09:43:08AM +0000, Dietmar Eggemann wrote:
> From: Quentin Perret <quentin.perret@arm.com>
> 
> The energy consumption of each CPU in the system is modeled with a list
> of values representing its dissipated power and compute capacity at each
> available Operating Performance Point (OPP). These values are derived
> from existing information in the kernel (currently used by the thermal
> subsystem) and don't require the introduction of new platform-specific
> tunables. The energy model is also provided with a simple representation
> of all frequency domains as cpumasks, hence enabling the scheduler to be
> aware of dependencies between CPUs. The data required to build the energy
> model is provided by the OPP library which enables an abstract view of
> the platform from the scheduler. The new data structures holding these
> models and the routines to populate them are stored in
> kernel/sched/energy.c.
> 
> For the sake of simplicity, it is assumed in the energy model that all
> CPUs in a frequency domain share the same micro-architecture. As long as
> this assumption is correct, the energy models of different CPUs belonging
> to the same frequency domain are equal. Hence, this commit builds only one
> energy model per frequency domain, and links all relevant CPUs to it in
> order to save time and memory. If needed for future hardware platforms,
> relaxing this assumption should imply relatively simple modifications in
> the code but a significantly higher algorithmic complexity.

What this doesn't mention is why this isn't part of the regular topology
bits. IIRC this is because the frequency domains don't necessarily need
to align with the existing topology, but this completely fails to state
any of that.

Also, since I'm not at all familiar with DT and the OPP library stuff,
this code is completely unreadable to me and there isn't a nice comment
to help me along.
Quentin Perret April 9, 2018, 1:45 p.m. UTC | #5
On Monday 09 Apr 2018 at 14:01:11 (+0200), Peter Zijlstra wrote:
> On Tue, Mar 20, 2018 at 09:43:08AM +0000, Dietmar Eggemann wrote:
> > From: Quentin Perret <quentin.perret@arm.com>
> > 
> > The energy consumption of each CPU in the system is modeled with a list
> > of values representing its dissipated power and compute capacity at each
> > available Operating Performance Point (OPP). These values are derived
> > from existing information in the kernel (currently used by the thermal
> > subsystem) and don't require the introduction of new platform-specific
> > tunables. The energy model is also provided with a simple representation
> > of all frequency domains as cpumasks, hence enabling the scheduler to be
> > aware of dependencies between CPUs. The data required to build the energy
> > model is provided by the OPP library which enables an abstract view of
> > the platform from the scheduler. The new data structures holding these
> > models and the routines to populate them are stored in
> > kernel/sched/energy.c.
> > 
> > For the sake of simplicity, it is assumed in the energy model that all
> > CPUs in a frequency domain share the same micro-architecture. As long as
> > this assumption is correct, the energy models of different CPUs belonging
> > to the same frequency domain are equal. Hence, this commit builds only one
> > energy model per frequency domain, and links all relevant CPUs to it in
> > order to save time and memory. If needed for future hardware platforms,
> > relaxing this assumption should imply relatively simple modifications in
> > the code but a significantly higher algorithmic complexity.
> 
> What this doesn't mention is why this isn't part of the regular topology
> bits. IIRC this is because the frequency domains don't necessarily need
> to align with the existing topology, but this completely fails to state
> any of that.

Yes that's the main reason. Frequency domains and scheduling domains don't
necessarily align. That used to be the case for big.LITTLE platforms, but
not anymore with DynamIQ ...

> 
> Also, since I'm not at all familiar with DT and the OPP library stuff,
> this code is completely unreadable to me and there isn't a nice comment
> to help me along.

Right, so I can definitely fix that. Comments in the code and a better
commit message should help hopefully. And also, it has already been
suggested that a documentation file should be added alongside the code
for this patchset, so I'll make sure we add that for the next version.
In the meantime, here is a (hopefully) better explanation below.

In this specific patch, we are basically trying to figure out the
boundaries of frequency domains, and the power consumed by each CPU
at each OPP, to make them available to the scheduler. The important
thing here is that, in both cases, we rely on the OPP library to
keep the code as platform-agnostic as possible.

In the case of the frequency domains for example, the cpufreq driver is
in charge of specifying the CPUs that are sharing frequencies. That
information can come from DT, or SCPI, or SCMI, or whatever -- we
probably shouldn't have to care about that from the scheduler's
standpoint. That's why using dev_pm_opp_get_sharing_cpus() is handy,
the OPP library gives us the digested information we need.

The power values (dev_pm_opp_get_power) we use right now are those
already used by the thermal subsystem (IPA), which means we don't have
to introduce any new DT binding whatsoever. In a close future, the power
values could also come from other sources (SCMI for ex), and again it's
probably not the scheduler's job to care about those things, so the OPP
library is helping us again. As mentioned in the notes, as of today, this
approach has dependencies on other patches relating to these things which
are already on the list [1].

The rest of the code in this patch is just about iterating over the
CPUs/freq. domains/OPPs. The algorithm is more or less the following:

 1. find a frequency domain which hasn't been visited yet;
 2. estimate the power and capacity of a CPU in this freq domain at each
    possible OPP;
 3. map all CPUs in the freq domain to this list of <capacity, power> tuples;
 4. go to 1.

I hope that makes sense.

Thanks,
Quentin

[1] https://marc.info/?l=linux-pm&m=151635516419249&w=2
Peter Zijlstra April 9, 2018, 3:32 p.m. UTC | #6
On Mon, Apr 09, 2018 at 02:45:11PM +0100, Quentin Perret wrote:

> In this specific patch, we are basically trying to figure out the
> boundaries of frequency domains, and the power consumed by each CPU
> at each OPP, to make them available to the scheduler. The important
> thing here is that, in both cases, we rely on the OPP library to
> keep the code as platform-agnostic as possible.

AFAICT the only users of this PM_OPP stuff is a bunch of ARM platforms.
Granted, body else has build a big.little style system, so that might
all be fine I suppose.

It won't be until some !ARM chip comes along that we'll know how
generically usable any of this really is.

> In the case of the frequency domains for example, the cpufreq driver is
> in charge of specifying the CPUs that are sharing frequencies. That
> information can come from DT, or SCPI, or SCMI, or whatever -- we
> probably shouldn't have to care about that from the scheduler's
> standpoint. That's why using dev_pm_opp_get_sharing_cpus() is handy,
> the OPP library gives us the digested information we need.

So I kinda would've expected to just ask cpufreq, that after all already
knows these things. Why did we need to invent this pm_opp thing?

Cpufreq has a tons of supported architectures, pm_opp not so much.

> The power values (dev_pm_opp_get_power) we use right now are those
> already used by the thermal subsystem (IPA), which means we don't have

I love an IPA style beer, but I'm thinking that's not the same IPA,
right :-)

> to introduce any new DT binding whatsoever. In a close future, the power
> values could also come from other sources (SCMI for ex), and again it's
> probably not the scheduler's job to care about those things, so the OPP
> library is helping us again. As mentioned in the notes, as of today, this
> approach has dependencies on other patches relating to these things which
> are already on the list [1].

Is there any !ARM thermal driver? (clearly I'm not up-to-date on things
thermal).
Quentin Perret April 9, 2018, 4:42 p.m. UTC | #7
On Monday 09 Apr 2018 at 17:32:33 (+0200), Peter Zijlstra wrote:
> On Mon, Apr 09, 2018 at 02:45:11PM +0100, Quentin Perret wrote:
> 
> > In this specific patch, we are basically trying to figure out the
> > boundaries of frequency domains, and the power consumed by each CPU
> > at each OPP, to make them available to the scheduler. The important
> > thing here is that, in both cases, we rely on the OPP library to
> > keep the code as platform-agnostic as possible.
> 
> AFAICT the only users of this PM_OPP stuff is a bunch of ARM platforms.

That's correct.

> Granted, body else has build a big.little style system, so that might
> all be fine I suppose.
> 
> It won't be until some !ARM chip comes along that we'll know how
> generically usable any of this really is.
> 

Right. There is already a lot of diversity in the Arm ecosystem that has
to be managed. That's what I meant by platform-agnostic. Now, I agree
that it should be discussed whether or not this is enough for other
archs ...

It might be reasonable to expect from the archs who want to use EAS that
they expose their OPPs in the OPP lib. That should be harmless, and EAS
needs to know about the OPPs, so they should be made visible, ideally
somewhere generic. Otherwise, that means the interface with the
EAS has to be defined only by the energy model data structures, and the
actual energy model loading procedure becomes free-form arch code.

I quiet like the first idea from a pure design standpoint, but I could
also understand if maintainers of other archs were reluctant to
have new dependencies on PM_OPP ...

> > In the case of the frequency domains for example, the cpufreq driver is
> > in charge of specifying the CPUs that are sharing frequencies. That
> > information can come from DT, or SCPI, or SCMI, or whatever -- we
> > probably shouldn't have to care about that from the scheduler's
> > standpoint. That's why using dev_pm_opp_get_sharing_cpus() is handy,
> > the OPP library gives us the digested information we need.
> 
> So I kinda would've expected to just ask cpufreq, that after all already
> knows these things. Why did we need to invent this pm_opp thing?

Yes, we can definitely rely on cpufreq for this one. There is a "strong"
dependency on PM_OPP to get power values, so I decided to use PM_OPP for
the frequency domains as well, for consistency. But I can change that if
needed.

> 
> Cpufreq has a tons of supported architectures, pm_opp not so much.
> 
> > The power values (dev_pm_opp_get_power) we use right now are those
> > already used by the thermal subsystem (IPA), which means we don't have
> 
> I love an IPA style beer, but I'm thinking that's not the same IPA,
> right :-)

Well, both can help to chill down in a way ... :-)

The IPA I'm talking about means Intelligent Power Allocator. It's a
thermal governor that uses a power model of the platform to allocate
power budgets to CPUs & GPUs using a control loop. The code is in
drivers/thermal/power_allocator.c if this is of interest.

> 
> > to introduce any new DT binding whatsoever. In a close future, the power
> > values could also come from other sources (SCMI for ex), and again it's
> > probably not the scheduler's job to care about those things, so the OPP
> > library is helping us again. As mentioned in the notes, as of today, this
> > approach has dependencies on other patches relating to these things which
> > are already on the list [1].
> 
> Is there any !ARM thermal driver? (clearly I'm not up-to-date on things
> thermal).

I don't think so.

Thanks,
Quentin
Rafael J. Wysocki April 10, 2018, 6:55 a.m. UTC | #8
On Mon, Apr 9, 2018 at 6:42 PM, Quentin Perret <quentin.perret@arm.com> wrote:
> On Monday 09 Apr 2018 at 17:32:33 (+0200), Peter Zijlstra wrote:
>> On Mon, Apr 09, 2018 at 02:45:11PM +0100, Quentin Perret wrote:
>>
>> > In this specific patch, we are basically trying to figure out the
>> > boundaries of frequency domains, and the power consumed by each CPU
>> > at each OPP, to make them available to the scheduler. The important
>> > thing here is that, in both cases, we rely on the OPP library to
>> > keep the code as platform-agnostic as possible.
>>
>> AFAICT the only users of this PM_OPP stuff is a bunch of ARM platforms.
>
> That's correct.
>
>> Granted, body else has build a big.little style system, so that might
>> all be fine I suppose.
>>
>> It won't be until some !ARM chip comes along that we'll know how
>> generically usable any of this really is.
>>
>
> Right. There is already a lot of diversity in the Arm ecosystem that has
> to be managed. That's what I meant by platform-agnostic. Now, I agree
> that it should be discussed whether or not this is enough for other
> archs ...

Even for ARM64 w/ ACPI, mind you.

> It might be reasonable to expect from the archs who want to use EAS that
> they expose their OPPs in the OPP lib. That should be harmless, and EAS
> needs to know about the OPPs, so they should be made visible, ideally
> somewhere generic. Otherwise, that means the interface with the
> EAS has to be defined only by the energy model data structures, and the
> actual energy model loading procedure becomes free-form arch code.
>
> I quiet like the first idea from a pure design standpoint, but I could
> also understand if maintainers of other archs were reluctant to
> have new dependencies on PM_OPP ...

Not just reluctant I would think.

Depending on PM_OPP directly here is like depending on ACPI directly.
Would you agree with the latter?

>> > In the case of the frequency domains for example, the cpufreq driver is
>> > in charge of specifying the CPUs that are sharing frequencies. That
>> > information can come from DT, or SCPI, or SCMI, or whatever -- we
>> > probably shouldn't have to care about that from the scheduler's
>> > standpoint. That's why using dev_pm_opp_get_sharing_cpus() is handy,
>> > the OPP library gives us the digested information we need.
>>
>> So I kinda would've expected to just ask cpufreq, that after all already
>> knows these things. Why did we need to invent this pm_opp thing?
>
> Yes, we can definitely rely on cpufreq for this one. There is a "strong"
> dependency on PM_OPP to get power values, so I decided to use PM_OPP for
> the frequency domains as well, for consistency. But I can change that if
> needed.

Yes, please.

>>
>> Cpufreq has a tons of supported architectures, pm_opp not so much.
>>
>> > The power values (dev_pm_opp_get_power) we use right now are those
>> > already used by the thermal subsystem (IPA), which means we don't have
>>
>> I love an IPA style beer, but I'm thinking that's not the same IPA,
>> right :-)
>
> Well, both can help to chill down in a way ... :-)
>
> The IPA I'm talking about means Intelligent Power Allocator. It's a
> thermal governor that uses a power model of the platform to allocate
> power budgets to CPUs & GPUs using a control loop. The code is in
> drivers/thermal/power_allocator.c if this is of interest.
>
>>
>> > to introduce any new DT binding whatsoever. In a close future, the power
>> > values could also come from other sources (SCMI for ex), and again it's
>> > probably not the scheduler's job to care about those things, so the OPP
>> > library is helping us again. As mentioned in the notes, as of today, this
>> > approach has dependencies on other patches relating to these things which
>> > are already on the list [1].
>>
>> Is there any !ARM thermal driver? (clearly I'm not up-to-date on things
>> thermal).
>
> I don't think so.

No, there isn't, AFAICS.

Thanks!
Quentin Perret April 10, 2018, 9:31 a.m. UTC | #9
On Tuesday 10 Apr 2018 at 08:55:14 (+0200), Rafael J. Wysocki wrote:
> On Mon, Apr 9, 2018 at 6:42 PM, Quentin Perret <quentin.perret@arm.com> wrote:
> > On Monday 09 Apr 2018 at 17:32:33 (+0200), Peter Zijlstra wrote:
> >> On Mon, Apr 09, 2018 at 02:45:11PM +0100, Quentin Perret wrote:

[...]

> > I quiet like the first idea from a pure design standpoint, but I could
> > also understand if maintainers of other archs were reluctant to
> > have new dependencies on PM_OPP ...
> 
> Not just reluctant I would think.
> 
> Depending on PM_OPP directly here is like depending on ACPI directly.
> Would you agree with the latter?

Right, I see your point. I was suggesting to use PM_OPP only to make the
OPPs *visible*, nothing else. That doesn't mean all archs would have
to use dev_pm_opp_set_rate() or anything, they could just keep on doing
DVFS their own way. PM_OPP would just be a common way to make OPPs
visible outside of their subsystem, which should be harmless. The point
is to keep the energy model loading code common to all archs.

Another solution would be to let the archs populate the energy model
data-structures themselves, and turn the current energy.c file into
arm/arm64-specific code for ex.

Overall, I guess the question is whether or not PM_OPP is the right
interface for EAS of multiple archs ... That sounds like an interesting
discussion topic for OSPM next week, so thanks a lot for raising this
point !

Regards,
Quentin
Rafael J. Wysocki April 10, 2018, 10:20 a.m. UTC | #10
On Tue, Apr 10, 2018 at 11:31 AM, Quentin Perret <quentin.perret@arm.com> wrote:
> On Tuesday 10 Apr 2018 at 08:55:14 (+0200), Rafael J. Wysocki wrote:
>> On Mon, Apr 9, 2018 at 6:42 PM, Quentin Perret <quentin.perret@arm.com> wrote:
>> > On Monday 09 Apr 2018 at 17:32:33 (+0200), Peter Zijlstra wrote:
>> >> On Mon, Apr 09, 2018 at 02:45:11PM +0100, Quentin Perret wrote:
>
> [...]
>
>> > I quiet like the first idea from a pure design standpoint, but I could
>> > also understand if maintainers of other archs were reluctant to
>> > have new dependencies on PM_OPP ...
>>
>> Not just reluctant I would think.
>>
>> Depending on PM_OPP directly here is like depending on ACPI directly.
>> Would you agree with the latter?
>
> Right, I see your point. I was suggesting to use PM_OPP only to make the
> OPPs *visible*, nothing else. That doesn't mean all archs would have
> to use dev_pm_opp_set_rate() or anything, they could just keep on doing
> DVFS their own way. PM_OPP would just be a common way to make OPPs
> visible outside of their subsystem, which should be harmless. The point
> is to keep the energy model loading code common to all archs.
>
> Another solution would be to let the archs populate the energy model
> data-structures themselves, and turn the current energy.c file into
> arm/arm64-specific code for ex.
>
> Overall, I guess the question is whether or not PM_OPP is the right
> interface for EAS of multiple archs ... That sounds like an interesting
> discussion topic for OSPM next week,

I agree.

> so thanks a lot for raising this point !

And moreover, we already have cpufreq and cpuidle that use their own
representations of the same information, generally coming from lower
layers.  They do that, because they need to work with different
platforms that generally represent the low-level information
differently.  I don't see why that principle doesn't apply to EAS.

Maybe there should be a common data structure to be used by them all,
but I'm quite confident that PM_OPP is not suitable for this purpose
in general.
diff mbox

Patch

diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h
new file mode 100644
index 000000000000..b4f43564ffe4
--- /dev/null
+++ b/include/linux/sched/energy.h
@@ -0,0 +1,31 @@ 
+#ifndef _LINUX_SCHED_ENERGY_H
+#define _LINUX_SCHED_ENERGY_H
+
+#ifdef CONFIG_SMP
+struct capacity_state {
+	unsigned long cap;	/* compute capacity */
+	unsigned long power;	/* power consumption at this compute capacity */
+};
+
+struct sched_energy_model {
+	int nr_cap_states;
+	struct capacity_state *cap_states;
+};
+
+struct freq_domain {
+	struct list_head next;
+	cpumask_t span;
+};
+
+extern struct sched_energy_model ** __percpu energy_model;
+extern struct static_key_false sched_energy_present;
+extern struct list_head freq_domains;
+#define for_each_freq_domain(fdom) \
+			list_for_each_entry(fdom, &freq_domains, next)
+
+void init_sched_energy(void);
+#else
+static inline void init_sched_energy(void) { }
+#endif
+
+#endif
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index d9a02b318108..912972ad4dbc 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -20,7 +20,7 @@  obj-y += core.o loadavg.o clock.o cputime.o
 obj-y += idle.o fair.o rt.o deadline.o
 obj-y += wait.o wait_bit.o swait.o completion.o
 
-obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o
+obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o energy.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
diff --git a/kernel/sched/energy.c b/kernel/sched/energy.c
new file mode 100644
index 000000000000..4662c993e096
--- /dev/null
+++ b/kernel/sched/energy.c
@@ -0,0 +1,190 @@ 
+/*
+ * Released under the GPLv2 only.
+ * SPDX-License-Identifier: GPL-2.0
+ *
+ * Energy-aware scheduling models
+ *
+ * Copyright (C) 2018, Arm Ltd.
+ * Written by: Quentin Perret, Arm Ltd.
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ */
+
+#define pr_fmt(fmt) "sched-energy: " fmt
+
+#include <linux/sched/topology.h>
+#include <linux/sched/energy.h>
+#include <linux/pm_opp.h>
+
+#include "sched.h"
+
+DEFINE_STATIC_KEY_FALSE(sched_energy_present);
+struct sched_energy_model ** __percpu energy_model;
+
+/*
+ * A copy of the cpumasks representing the frequency domains is kept private
+ * to the scheduler. They are stacked in a dynamically allocated linked list
+ * as we don't know how many frequency domains the system has.
+ */
+LIST_HEAD(freq_domains);
+
+#ifdef CONFIG_PM_OPP
+static struct sched_energy_model *build_energy_model(int cpu)
+{
+	unsigned long cap_scale = arch_scale_cpu_capacity(NULL, cpu);
+	unsigned long cap, freq, power, max_freq = ULONG_MAX;
+	unsigned long opp_eff, prev_opp_eff = ULONG_MAX;
+	struct sched_energy_model *em = NULL;
+	struct device *cpu_dev;
+	struct dev_pm_opp *opp;
+	int opp_cnt, i;
+
+	cpu_dev = get_cpu_device(cpu);
+	if (!cpu_dev) {
+		pr_err("CPU%d: Failed to get device\n", cpu);
+		return NULL;
+	}
+
+	opp_cnt = dev_pm_opp_get_opp_count(cpu_dev);
+	if (opp_cnt <= 0) {
+		pr_err("CPU%d: Failed to get # of available OPPs.\n", cpu);
+		return NULL;
+	}
+
+	opp = dev_pm_opp_find_freq_floor(cpu_dev, &max_freq);
+	if (IS_ERR(opp)) {
+		pr_err("CPU%d: Failed to get max frequency.\n", cpu);
+		return NULL;
+	}
+
+	dev_pm_opp_put(opp);
+	if (!max_freq) {
+		pr_err("CPU%d: Found null max frequency.\n", cpu);
+		return NULL;
+	}
+
+	em = kzalloc(sizeof(*em), GFP_KERNEL);
+	if (!em)
+		return NULL;
+
+	em->cap_states = kcalloc(opp_cnt, sizeof(*em->cap_states), GFP_KERNEL);
+	if (!em->cap_states)
+		goto free_em;
+
+	for (i = 0, freq = 0; i < opp_cnt; i++, freq++) {
+		opp = dev_pm_opp_find_freq_ceil(cpu_dev, &freq);
+		if (IS_ERR(opp)) {
+			pr_err("CPU%d: Failed to get OPP %d.\n", cpu, i+1);
+			goto free_cs;
+		}
+
+		power = dev_pm_opp_get_power(opp);
+		dev_pm_opp_put(opp);
+		if (!power || !freq)
+			goto free_cs;
+
+		cap = freq * cap_scale / max_freq;
+		em->cap_states[i].power = power;
+		em->cap_states[i].cap = cap;
+
+		/*
+		 * The capacity/watts efficiency ratio should decrease as the
+		 * frequency grows on sane platforms. If not, warn the user
+		 * that some high OPPs are more power efficient than some
+		 * of the lower ones.
+		 */
+		opp_eff = (cap << 20) / power;
+		if (opp_eff >= prev_opp_eff)
+			pr_warn("CPU%d: cap/pwr: OPP%d > OPP%d\n", cpu, i, i-1);
+		prev_opp_eff = opp_eff;
+	}
+
+	em->nr_cap_states = opp_cnt;
+	return em;
+
+free_cs:
+	kfree(em->cap_states);
+free_em:
+	kfree(em);
+	return NULL;
+}
+
+static void free_energy_model(void)
+{
+	struct sched_energy_model *em;
+	struct freq_domain *tmp, *pos;
+	int cpu;
+
+	list_for_each_entry_safe(pos, tmp, &freq_domains, next) {
+		cpu = cpumask_first(&(pos->span));
+		em = *per_cpu_ptr(energy_model, cpu);
+		if (em) {
+			kfree(em->cap_states);
+			kfree(em);
+		}
+
+		list_del(&(pos->next));
+		kfree(pos);
+	}
+
+	free_percpu(energy_model);
+}
+
+void init_sched_energy(void)
+{
+	struct freq_domain *fdom;
+	struct sched_energy_model *em;
+	struct device *cpu_dev;
+	int cpu, ret, fdom_cpu;
+
+	/* Energy Aware Scheduling is used for asymmetric systems only. */
+	if (!lowest_flag_domain(smp_processor_id(), SD_ASYM_CPUCAPACITY))
+		return;
+
+	energy_model = alloc_percpu(struct sched_energy_model *);
+	if (!energy_model)
+		goto exit_fail;
+
+	for_each_possible_cpu(cpu) {
+		if (*per_cpu_ptr(energy_model, cpu))
+			continue;
+
+		/* Keep a copy of the sharing_cpus mask */
+		fdom = kzalloc(sizeof(struct freq_domain), GFP_KERNEL);
+		if (!fdom)
+			goto free_em;
+
+		cpu_dev = get_cpu_device(cpu);
+		ret = dev_pm_opp_get_sharing_cpus(cpu_dev, &(fdom->span));
+		if (ret)
+			goto free_em;
+		list_add(&(fdom->next), &freq_domains);
+
+		/*
+		 * Build the energy model of one CPU, and link it to all CPUs
+		 * in its frequency domain. This should be correct as long as
+		 * they share the same micro-architecture.
+		 */
+		fdom_cpu = cpumask_first(&(fdom->span));
+		em = build_energy_model(fdom_cpu);
+		if (!em)
+			goto free_em;
+
+		for_each_cpu(fdom_cpu, &(fdom->span))
+			*per_cpu_ptr(energy_model, fdom_cpu) = em;
+	}
+
+	static_branch_enable(&sched_energy_present);
+
+	pr_info("Energy Aware Scheduling started.\n");
+	return;
+free_em:
+	free_energy_model();
+exit_fail:
+	pr_err("Energy Aware Scheduling initialization failed.\n");
+}
+#else
+void init_sched_energy(void) {}
+#endif