diff mbox

[8/8] sched: prefer cpufreq_scale_freq_capacity

Message ID 1457932932-28444-9-git-send-email-mturquette+renesas@baylibre.com (mailing list archive)
State RFC, archived
Headers show

Commit Message

Michael Turquette March 14, 2016, 5:22 a.m. UTC
arch_scale_freq_capacity is weird. It specifies an arch hook for an
implementation that could easily vary within an architecture or even a
chip family.

This patch helps to mitigate this weirdness by defaulting to the
cpufreq-provided implementation, which should work for all cases where
CONFIG_CPU_FREQ is set.

If CONFIG_CPU_FREQ is not set, then try to use an implementation
provided by the architecture. Failing that, fall back to
SCHED_CAPACITY_SCALE.

It may be desirable for cpufreq drivers to specify their own
implementation of arch_scale_freq_capacity in the future. The same is
true for platform code within an architecture. In both cases an
efficient implementation selector will need to be created and this patch
adds a comment to that effect.

Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com>
---
 kernel/sched/sched.h | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

Comments

Dietmar Eggemann March 15, 2016, 7:13 p.m. UTC | #1
On 14/03/16 05:22, Michael Turquette wrote:
> arch_scale_freq_capacity is weird. It specifies an arch hook for an
> implementation that could easily vary within an architecture or even a
> chip family.
> 
> This patch helps to mitigate this weirdness by defaulting to the
> cpufreq-provided implementation, which should work for all cases where
> CONFIG_CPU_FREQ is set.
> 
> If CONFIG_CPU_FREQ is not set, then try to use an implementation
> provided by the architecture. Failing that, fall back to
> SCHED_CAPACITY_SCALE.
> 
> It may be desirable for cpufreq drivers to specify their own
> implementation of arch_scale_freq_capacity in the future. The same is
> true for platform code within an architecture. In both cases an
> efficient implementation selector will need to be created and this patch
> adds a comment to that effect.

For me this independence of the scheduler code towards the actual
implementation of the Frequency Invariant Engine (FEI) was actually a
feature.

In EAS RFC5.2 (linux-arm.org/linux-power.git energy_model_rfc_v5.2 ,
which hasn't been posted to LKML) we establish the link in the ARCH code
(arch/arm64/include/asm/topology.h).

#ifdef CONFIG_CPU_FREQ
#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
...
+#endif

> 
> Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com>
> ---
>  kernel/sched/sched.h | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 469d11d..37502ea 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1368,7 +1368,21 @@ static inline int hrtick_enabled(struct rq *rq)
>  #ifdef CONFIG_SMP
>  extern void sched_avg_update(struct rq *rq);
>  
> -#ifndef arch_scale_freq_capacity
> +/*
> + * arch_scale_freq_capacity can be implemented by cpufreq, platform code or
> + * arch code. We select the cpufreq-provided implementation first. If it
> + * doesn't exist then we default to any other implementation provided from
> + * platform/arch code. If those do not exist then we use the default
> + * SCHED_CAPACITY_SCALE value below.
> + *
> + * Note that if cpufreq drivers or platform/arch code have competing
> + * implementations it is up to those subsystems to select one at runtime with
> + * an efficient solution, as we cannot tolerate the overhead of indirect
> + * functions (e.g. function pointers) in the scheduler fast path
> + */
> +#ifdef CONFIG_CPU_FREQ
> +#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
> +#elif !defined(arch_scale_freq_capacity)
>  static __always_inline
>  unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
>  {
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Turquette March 15, 2016, 8:46 p.m. UTC | #2
Quoting Dietmar Eggemann (2016-03-15 12:13:58)
> On 14/03/16 05:22, Michael Turquette wrote:
> > arch_scale_freq_capacity is weird. It specifies an arch hook for an
> > implementation that could easily vary within an architecture or even a
> > chip family.
> > 
> > This patch helps to mitigate this weirdness by defaulting to the
> > cpufreq-provided implementation, which should work for all cases where
> > CONFIG_CPU_FREQ is set.
> > 
> > If CONFIG_CPU_FREQ is not set, then try to use an implementation
> > provided by the architecture. Failing that, fall back to
> > SCHED_CAPACITY_SCALE.
> > 
> > It may be desirable for cpufreq drivers to specify their own
> > implementation of arch_scale_freq_capacity in the future. The same is
> > true for platform code within an architecture. In both cases an
> > efficient implementation selector will need to be created and this patch
> > adds a comment to that effect.
> 
> For me this independence of the scheduler code towards the actual
> implementation of the Frequency Invariant Engine (FEI) was actually a
> feature.

I do not agree that it is a strength; I think it is confusing. My
opinion is that cpufreq drivers should implement
arch_scale_freq_capacity. Having a sane fallback
(cpufreq_scale_freq_capacity) simply means that you can remove the
boilerplate from the arm32 and arm64 code, which is a win.

Furthermore, if we have multiple competing implementations of
arch_scale_freq_invariance, wouldn't it be better for all of them to
live in cpufreq drivers? This means we would only need to implement a
single run-time "selector".

On the other hand, if the implementation lives in arch code and we have
various implementations of arch_scale_freq_capacity within an
architecture, then each arch would need to implement this selector
function. Even worse then if we have a split where some implementations
live in drivers/cpufreq (e.g. intel_pstate) and others in arch/arm and
others in arch/arm64 ... now we have three selectors.

Note that this has nothing to do with cpu microarch invariance. I'm
happy for that to stay in arch code because we can have heterogeneous
cpus that do not scale frequency, and thus would not enable cpufreq.
But if your platform scales cpu frequency, then really cpufreq should be
in the loop.

> 
> In EAS RFC5.2 (linux-arm.org/linux-power.git energy_model_rfc_v5.2 ,
> which hasn't been posted to LKML) we establish the link in the ARCH code
> (arch/arm64/include/asm/topology.h).

Right, sorry again about preemptively posting the patch. Total brainfart
on my part.

> 
> #ifdef CONFIG_CPU_FREQ
> #define arch_scale_freq_capacity cpufreq_scale_freq_capacity
> ...
> +#endif

The above is no longer necessary with this patch. Same question as
above: why insist on the arch boilerplate?

Regards,
Mike

> 
> > 
> > Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com>
> > ---
> >  kernel/sched/sched.h | 16 +++++++++++++++-
> >  1 file changed, 15 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 469d11d..37502ea 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1368,7 +1368,21 @@ static inline int hrtick_enabled(struct rq *rq)
> >  #ifdef CONFIG_SMP
> >  extern void sched_avg_update(struct rq *rq);
> >  
> > -#ifndef arch_scale_freq_capacity
> > +/*
> > + * arch_scale_freq_capacity can be implemented by cpufreq, platform code or
> > + * arch code. We select the cpufreq-provided implementation first. If it
> > + * doesn't exist then we default to any other implementation provided from
> > + * platform/arch code. If those do not exist then we use the default
> > + * SCHED_CAPACITY_SCALE value below.
> > + *
> > + * Note that if cpufreq drivers or platform/arch code have competing
> > + * implementations it is up to those subsystems to select one at runtime with
> > + * an efficient solution, as we cannot tolerate the overhead of indirect
> > + * functions (e.g. function pointers) in the scheduler fast path
> > + */
> > +#ifdef CONFIG_CPU_FREQ
> > +#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
> > +#elif !defined(arch_scale_freq_capacity)
> >  static __always_inline
> >  unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> >  {
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 15, 2016, 9:37 p.m. UTC | #3
On Sun, Mar 13, 2016 at 10:22:12PM -0700, Michael Turquette wrote:
> +++ b/kernel/sched/sched.h
> @@ -1368,7 +1368,21 @@ static inline int hrtick_enabled(struct rq *rq)
>  #ifdef CONFIG_SMP
>  extern void sched_avg_update(struct rq *rq);
>  
> -#ifndef arch_scale_freq_capacity

> +#ifdef CONFIG_CPU_FREQ
> +#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
> +#elif !defined(arch_scale_freq_capacity)
>  static __always_inline
>  unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
>  {

This could not allow x86 to use the APERF/MPERF thing, so no, can't be
right.

Maybe something like

#ifndef arch_scale_freq_capacity
#ifdef CONFIG_CPU_FREQ
#define arch_scale_freq_capacity	cpufreq_scale_freq_capacity
#else
static __always_inline
unsigned long arch_scale_freq_capacity(..)
{
	return SCHED_CAPACITY_SCALE;
}
#endif
#endif

Which will let the arch override and only falls back to cpufreq if
the arch doesn't do anything.


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 16, 2016, 7:47 a.m. UTC | #4
On Tue, Mar 15, 2016 at 03:27:21PM -0700, Michael Turquette wrote:

> That solution scales for the case where architectures have different
> methods. It doesn't scale for the case where cpufreq drivers or platform
> code within the same arch have competing implementations.

Sure it does; no matter what interface we use on x86 to set the DVFS
hints (ACPI, intel_p_state, whatever), using APERF/MPERF is the only
actual way of telling WTH the actual frequency was.

> I'm happy with it as a stop-gap, because it will initially work for
> arm{64} and x86, but we'll still need run-time selection of
> arch_scale_freq_capacity some day. Once we have that, I think that we
> should favor a run-time provided implementation over the arch-provided
> one.

Also, I'm thinking we don't need any of this. Your
cpufreq_scale_freq_capacity() is completely and utterly pointless. Since
its implementation simply provides whatever frequency we selected its
identical to not using frequency invariant load metrics and having
cpufreq use the !inv formula.

See:

  lkml.kernel.org/r/20160309163930.GP6356@twins.programming.kicks-ass.net

Now, something else (power aware scheduling etc..) might need the freq
invariant stuff, but cpufreq (which we're concerned with here) does not
unless arch_scale_freq_capacity() does something else than simply return
the value we've set earlier.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 16, 2016, 12:41 p.m. UTC | #5
On Wed, Mar 16, 2016 at 08:47:52AM +0100, Peter Zijlstra wrote:
> On Tue, Mar 15, 2016 at 03:27:21PM -0700, Michael Turquette wrote:

> > I'm happy with it as a stop-gap, because it will initially work for
> > arm{64} and x86, but we'll still need run-time selection of
> > arch_scale_freq_capacity some day. Once we have that, I think that we
> > should favor a run-time provided implementation over the arch-provided
> > one.
> 
> Also, I'm thinking we don't need any of this. Your
> cpufreq_scale_freq_capacity() is completely and utterly pointless. 

Scrap that, while true to instant utilization, this isn't true for our
case, since our utilization numbers carry history, and any frequency
change in that window is relevant.


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dietmar Eggemann March 16, 2016, 7:44 p.m. UTC | #6
On 15/03/16 20:46, Michael Turquette wrote:
> Quoting Dietmar Eggemann (2016-03-15 12:13:58)
>> On 14/03/16 05:22, Michael Turquette wrote:

[...]

>> For me this independence of the scheduler code towards the actual
>> implementation of the Frequency Invariant Engine (FEI) was actually a
>> feature.
> 
> I do not agree that it is a strength; I think it is confusing. My
> opinion is that cpufreq drivers should implement
> arch_scale_freq_capacity. Having a sane fallback
> (cpufreq_scale_freq_capacity) simply means that you can remove the
> boilerplate from the arm32 and arm64 code, which is a win.
> 
> Furthermore, if we have multiple competing implementations of
> arch_scale_freq_invariance, wouldn't it be better for all of them to
> live in cpufreq drivers? This means we would only need to implement a
> single run-time "selector".
> 
> On the other hand, if the implementation lives in arch code and we have
> various implementations of arch_scale_freq_capacity within an
> architecture, then each arch would need to implement this selector
> function. Even worse then if we have a split where some implementations
> live in drivers/cpufreq (e.g. intel_pstate) and others in arch/arm and
> others in arch/arm64 ... now we have three selectors.

OK, now I see your point. What I don't understand is the fact why you
want different foo_scale_freq_capacity() implementations per cpufreq
drivers. IMHO we want to do the cpufreq.c based implementation to
abstract from that (at least for target_index() cpufreq drivers).

intel_pstate (setpolicy()) is an exception but my humble guess is that
systems with intel_pstate driver have X86_FEATURE_APERFMPERF support.

> Note that this has nothing to do with cpu microarch invariance. I'm
> happy for that to stay in arch code because we can have heterogeneous
> cpus that do not scale frequency, and thus would not enable cpufreq.
> But if your platform scales cpu frequency, then really cpufreq should be
> in the loop.

Agreed.

> 
>>
>> In EAS RFC5.2 (linux-arm.org/linux-power.git energy_model_rfc_v5.2 ,
>> which hasn't been posted to LKML) we establish the link in the ARCH code
>> (arch/arm64/include/asm/topology.h).
> 
> Right, sorry again about preemptively posting the patch. Total brainfart
> on my part.
> 
>>
>> #ifdef CONFIG_CPU_FREQ
>> #define arch_scale_freq_capacity cpufreq_scale_freq_capacity
>> ...
>> +#endif
> 
> The above is no longer necessary with this patch. Same question as
> above: why insist on the arch boilerplate?

OK.

[...]
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 16, 2016, 8:07 p.m. UTC | #7
On Wed, Mar 16, 2016 at 07:44:33PM +0000, Dietmar Eggemann wrote:
> intel_pstate (setpolicy()) is an exception but my humble guess is that
> systems with intel_pstate driver have X86_FEATURE_APERFMPERF support.

A quick browse of the Intel SDM says you're right. It looks like
everything after Pentium-M; so Core-Solo/Core-Duo and onwards have
APERF/MPERF.

And it looks like P6 class systems didn't have DVFS support at all,
which basically leaves P4 and Pentium-M as the only chips to have DVFS
support lacking APERF/MPERF.

And while I haven't got any P4 based space heaters left, I might still
have a Pentium-M class laptop somewhere (if it still boots).

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 16, 2016, 9:32 p.m. UTC | #8
On Wednesday, March 16, 2016 09:07:52 PM Peter Zijlstra wrote:
> On Wed, Mar 16, 2016 at 07:44:33PM +0000, Dietmar Eggemann wrote:
> > intel_pstate (setpolicy()) is an exception but my humble guess is that
> > systems with intel_pstate driver have X86_FEATURE_APERFMPERF support.
> 
> A quick browse of the Intel SDM says you're right. It looks like
> everything after Pentium-M; so Core-Solo/Core-Duo and onwards have
> APERF/MPERF.
> 
> And it looks like P6 class systems didn't have DVFS support at all,
> which basically leaves P4 and Pentium-M as the only chips to have DVFS
> support lacking APERF/MPERF.
> 
> And while I haven't got any P4 based space heaters left, I might still
> have a Pentium-M class laptop somewhere (if it still boots).

intel_pstate depends on APERF/MPERF, so it won't work with anything
without them.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 469d11d..37502ea 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1368,7 +1368,21 @@  static inline int hrtick_enabled(struct rq *rq)
 #ifdef CONFIG_SMP
 extern void sched_avg_update(struct rq *rq);
 
-#ifndef arch_scale_freq_capacity
+/*
+ * arch_scale_freq_capacity can be implemented by cpufreq, platform code or
+ * arch code. We select the cpufreq-provided implementation first. If it
+ * doesn't exist then we default to any other implementation provided from
+ * platform/arch code. If those do not exist then we use the default
+ * SCHED_CAPACITY_SCALE value below.
+ *
+ * Note that if cpufreq drivers or platform/arch code have competing
+ * implementations it is up to those subsystems to select one at runtime with
+ * an efficient solution, as we cannot tolerate the overhead of indirect
+ * functions (e.g. function pointers) in the scheduler fast path
+ */
+#ifdef CONFIG_CPU_FREQ
+#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
+#elif !defined(arch_scale_freq_capacity)
 static __always_inline
 unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
 {