Message ID | 1457932932-28444-9-git-send-email-mturquette+renesas@baylibre.com (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
On 14/03/16 05:22, Michael Turquette wrote: > arch_scale_freq_capacity is weird. It specifies an arch hook for an > implementation that could easily vary within an architecture or even a > chip family. > > This patch helps to mitigate this weirdness by defaulting to the > cpufreq-provided implementation, which should work for all cases where > CONFIG_CPU_FREQ is set. > > If CONFIG_CPU_FREQ is not set, then try to use an implementation > provided by the architecture. Failing that, fall back to > SCHED_CAPACITY_SCALE. > > It may be desirable for cpufreq drivers to specify their own > implementation of arch_scale_freq_capacity in the future. The same is > true for platform code within an architecture. In both cases an > efficient implementation selector will need to be created and this patch > adds a comment to that effect. For me this independence of the scheduler code towards the actual implementation of the Frequency Invariant Engine (FEI) was actually a feature. In EAS RFC5.2 (linux-arm.org/linux-power.git energy_model_rfc_v5.2 , which hasn't been posted to LKML) we establish the link in the ARCH code (arch/arm64/include/asm/topology.h). #ifdef CONFIG_CPU_FREQ #define arch_scale_freq_capacity cpufreq_scale_freq_capacity ... +#endif > > Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com> > --- > kernel/sched/sched.h | 16 +++++++++++++++- > 1 file changed, 15 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 469d11d..37502ea 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1368,7 +1368,21 @@ static inline int hrtick_enabled(struct rq *rq) > #ifdef CONFIG_SMP > extern void sched_avg_update(struct rq *rq); > > -#ifndef arch_scale_freq_capacity > +/* > + * arch_scale_freq_capacity can be implemented by cpufreq, platform code or > + * arch code. We select the cpufreq-provided implementation first. If it > + * doesn't exist then we default to any other implementation provided from > + * platform/arch code. If those do not exist then we use the default > + * SCHED_CAPACITY_SCALE value below. > + * > + * Note that if cpufreq drivers or platform/arch code have competing > + * implementations it is up to those subsystems to select one at runtime with > + * an efficient solution, as we cannot tolerate the overhead of indirect > + * functions (e.g. function pointers) in the scheduler fast path > + */ > +#ifdef CONFIG_CPU_FREQ > +#define arch_scale_freq_capacity cpufreq_scale_freq_capacity > +#elif !defined(arch_scale_freq_capacity) > static __always_inline > unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu) > { > -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Quoting Dietmar Eggemann (2016-03-15 12:13:58) > On 14/03/16 05:22, Michael Turquette wrote: > > arch_scale_freq_capacity is weird. It specifies an arch hook for an > > implementation that could easily vary within an architecture or even a > > chip family. > > > > This patch helps to mitigate this weirdness by defaulting to the > > cpufreq-provided implementation, which should work for all cases where > > CONFIG_CPU_FREQ is set. > > > > If CONFIG_CPU_FREQ is not set, then try to use an implementation > > provided by the architecture. Failing that, fall back to > > SCHED_CAPACITY_SCALE. > > > > It may be desirable for cpufreq drivers to specify their own > > implementation of arch_scale_freq_capacity in the future. The same is > > true for platform code within an architecture. In both cases an > > efficient implementation selector will need to be created and this patch > > adds a comment to that effect. > > For me this independence of the scheduler code towards the actual > implementation of the Frequency Invariant Engine (FEI) was actually a > feature. I do not agree that it is a strength; I think it is confusing. My opinion is that cpufreq drivers should implement arch_scale_freq_capacity. Having a sane fallback (cpufreq_scale_freq_capacity) simply means that you can remove the boilerplate from the arm32 and arm64 code, which is a win. Furthermore, if we have multiple competing implementations of arch_scale_freq_invariance, wouldn't it be better for all of them to live in cpufreq drivers? This means we would only need to implement a single run-time "selector". On the other hand, if the implementation lives in arch code and we have various implementations of arch_scale_freq_capacity within an architecture, then each arch would need to implement this selector function. Even worse then if we have a split where some implementations live in drivers/cpufreq (e.g. intel_pstate) and others in arch/arm and others in arch/arm64 ... now we have three selectors. Note that this has nothing to do with cpu microarch invariance. I'm happy for that to stay in arch code because we can have heterogeneous cpus that do not scale frequency, and thus would not enable cpufreq. But if your platform scales cpu frequency, then really cpufreq should be in the loop. > > In EAS RFC5.2 (linux-arm.org/linux-power.git energy_model_rfc_v5.2 , > which hasn't been posted to LKML) we establish the link in the ARCH code > (arch/arm64/include/asm/topology.h). Right, sorry again about preemptively posting the patch. Total brainfart on my part. > > #ifdef CONFIG_CPU_FREQ > #define arch_scale_freq_capacity cpufreq_scale_freq_capacity > ... > +#endif The above is no longer necessary with this patch. Same question as above: why insist on the arch boilerplate? Regards, Mike > > > > > Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com> > > --- > > kernel/sched/sched.h | 16 +++++++++++++++- > > 1 file changed, 15 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > index 469d11d..37502ea 100644 > > --- a/kernel/sched/sched.h > > +++ b/kernel/sched/sched.h > > @@ -1368,7 +1368,21 @@ static inline int hrtick_enabled(struct rq *rq) > > #ifdef CONFIG_SMP > > extern void sched_avg_update(struct rq *rq); > > > > -#ifndef arch_scale_freq_capacity > > +/* > > + * arch_scale_freq_capacity can be implemented by cpufreq, platform code or > > + * arch code. We select the cpufreq-provided implementation first. If it > > + * doesn't exist then we default to any other implementation provided from > > + * platform/arch code. If those do not exist then we use the default > > + * SCHED_CAPACITY_SCALE value below. > > + * > > + * Note that if cpufreq drivers or platform/arch code have competing > > + * implementations it is up to those subsystems to select one at runtime with > > + * an efficient solution, as we cannot tolerate the overhead of indirect > > + * functions (e.g. function pointers) in the scheduler fast path > > + */ > > +#ifdef CONFIG_CPU_FREQ > > +#define arch_scale_freq_capacity cpufreq_scale_freq_capacity > > +#elif !defined(arch_scale_freq_capacity) > > static __always_inline > > unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu) > > { > > -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Mar 13, 2016 at 10:22:12PM -0700, Michael Turquette wrote: > +++ b/kernel/sched/sched.h > @@ -1368,7 +1368,21 @@ static inline int hrtick_enabled(struct rq *rq) > #ifdef CONFIG_SMP > extern void sched_avg_update(struct rq *rq); > > -#ifndef arch_scale_freq_capacity > +#ifdef CONFIG_CPU_FREQ > +#define arch_scale_freq_capacity cpufreq_scale_freq_capacity > +#elif !defined(arch_scale_freq_capacity) > static __always_inline > unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu) > { This could not allow x86 to use the APERF/MPERF thing, so no, can't be right. Maybe something like #ifndef arch_scale_freq_capacity #ifdef CONFIG_CPU_FREQ #define arch_scale_freq_capacity cpufreq_scale_freq_capacity #else static __always_inline unsigned long arch_scale_freq_capacity(..) { return SCHED_CAPACITY_SCALE; } #endif #endif Which will let the arch override and only falls back to cpufreq if the arch doesn't do anything. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 15, 2016 at 03:27:21PM -0700, Michael Turquette wrote: > That solution scales for the case where architectures have different > methods. It doesn't scale for the case where cpufreq drivers or platform > code within the same arch have competing implementations. Sure it does; no matter what interface we use on x86 to set the DVFS hints (ACPI, intel_p_state, whatever), using APERF/MPERF is the only actual way of telling WTH the actual frequency was. > I'm happy with it as a stop-gap, because it will initially work for > arm{64} and x86, but we'll still need run-time selection of > arch_scale_freq_capacity some day. Once we have that, I think that we > should favor a run-time provided implementation over the arch-provided > one. Also, I'm thinking we don't need any of this. Your cpufreq_scale_freq_capacity() is completely and utterly pointless. Since its implementation simply provides whatever frequency we selected its identical to not using frequency invariant load metrics and having cpufreq use the !inv formula. See: lkml.kernel.org/r/20160309163930.GP6356@twins.programming.kicks-ass.net Now, something else (power aware scheduling etc..) might need the freq invariant stuff, but cpufreq (which we're concerned with here) does not unless arch_scale_freq_capacity() does something else than simply return the value we've set earlier. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 16, 2016 at 08:47:52AM +0100, Peter Zijlstra wrote: > On Tue, Mar 15, 2016 at 03:27:21PM -0700, Michael Turquette wrote: > > I'm happy with it as a stop-gap, because it will initially work for > > arm{64} and x86, but we'll still need run-time selection of > > arch_scale_freq_capacity some day. Once we have that, I think that we > > should favor a run-time provided implementation over the arch-provided > > one. > > Also, I'm thinking we don't need any of this. Your > cpufreq_scale_freq_capacity() is completely and utterly pointless. Scrap that, while true to instant utilization, this isn't true for our case, since our utilization numbers carry history, and any frequency change in that window is relevant. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 15/03/16 20:46, Michael Turquette wrote: > Quoting Dietmar Eggemann (2016-03-15 12:13:58) >> On 14/03/16 05:22, Michael Turquette wrote: [...] >> For me this independence of the scheduler code towards the actual >> implementation of the Frequency Invariant Engine (FEI) was actually a >> feature. > > I do not agree that it is a strength; I think it is confusing. My > opinion is that cpufreq drivers should implement > arch_scale_freq_capacity. Having a sane fallback > (cpufreq_scale_freq_capacity) simply means that you can remove the > boilerplate from the arm32 and arm64 code, which is a win. > > Furthermore, if we have multiple competing implementations of > arch_scale_freq_invariance, wouldn't it be better for all of them to > live in cpufreq drivers? This means we would only need to implement a > single run-time "selector". > > On the other hand, if the implementation lives in arch code and we have > various implementations of arch_scale_freq_capacity within an > architecture, then each arch would need to implement this selector > function. Even worse then if we have a split where some implementations > live in drivers/cpufreq (e.g. intel_pstate) and others in arch/arm and > others in arch/arm64 ... now we have three selectors. OK, now I see your point. What I don't understand is the fact why you want different foo_scale_freq_capacity() implementations per cpufreq drivers. IMHO we want to do the cpufreq.c based implementation to abstract from that (at least for target_index() cpufreq drivers). intel_pstate (setpolicy()) is an exception but my humble guess is that systems with intel_pstate driver have X86_FEATURE_APERFMPERF support. > Note that this has nothing to do with cpu microarch invariance. I'm > happy for that to stay in arch code because we can have heterogeneous > cpus that do not scale frequency, and thus would not enable cpufreq. > But if your platform scales cpu frequency, then really cpufreq should be > in the loop. Agreed. > >> >> In EAS RFC5.2 (linux-arm.org/linux-power.git energy_model_rfc_v5.2 , >> which hasn't been posted to LKML) we establish the link in the ARCH code >> (arch/arm64/include/asm/topology.h). > > Right, sorry again about preemptively posting the patch. Total brainfart > on my part. > >> >> #ifdef CONFIG_CPU_FREQ >> #define arch_scale_freq_capacity cpufreq_scale_freq_capacity >> ... >> +#endif > > The above is no longer necessary with this patch. Same question as > above: why insist on the arch boilerplate? OK. [...] -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 16, 2016 at 07:44:33PM +0000, Dietmar Eggemann wrote: > intel_pstate (setpolicy()) is an exception but my humble guess is that > systems with intel_pstate driver have X86_FEATURE_APERFMPERF support. A quick browse of the Intel SDM says you're right. It looks like everything after Pentium-M; so Core-Solo/Core-Duo and onwards have APERF/MPERF. And it looks like P6 class systems didn't have DVFS support at all, which basically leaves P4 and Pentium-M as the only chips to have DVFS support lacking APERF/MPERF. And while I haven't got any P4 based space heaters left, I might still have a Pentium-M class laptop somewhere (if it still boots). -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday, March 16, 2016 09:07:52 PM Peter Zijlstra wrote: > On Wed, Mar 16, 2016 at 07:44:33PM +0000, Dietmar Eggemann wrote: > > intel_pstate (setpolicy()) is an exception but my humble guess is that > > systems with intel_pstate driver have X86_FEATURE_APERFMPERF support. > > A quick browse of the Intel SDM says you're right. It looks like > everything after Pentium-M; so Core-Solo/Core-Duo and onwards have > APERF/MPERF. > > And it looks like P6 class systems didn't have DVFS support at all, > which basically leaves P4 and Pentium-M as the only chips to have DVFS > support lacking APERF/MPERF. > > And while I haven't got any P4 based space heaters left, I might still > have a Pentium-M class laptop somewhere (if it still boots). intel_pstate depends on APERF/MPERF, so it won't work with anything without them. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 469d11d..37502ea 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1368,7 +1368,21 @@ static inline int hrtick_enabled(struct rq *rq) #ifdef CONFIG_SMP extern void sched_avg_update(struct rq *rq); -#ifndef arch_scale_freq_capacity +/* + * arch_scale_freq_capacity can be implemented by cpufreq, platform code or + * arch code. We select the cpufreq-provided implementation first. If it + * doesn't exist then we default to any other implementation provided from + * platform/arch code. If those do not exist then we use the default + * SCHED_CAPACITY_SCALE value below. + * + * Note that if cpufreq drivers or platform/arch code have competing + * implementations it is up to those subsystems to select one at runtime with + * an efficient solution, as we cannot tolerate the overhead of indirect + * functions (e.g. function pointers) in the scheduler fast path + */ +#ifdef CONFIG_CPU_FREQ +#define arch_scale_freq_capacity cpufreq_scale_freq_capacity +#elif !defined(arch_scale_freq_capacity) static __always_inline unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu) {
arch_scale_freq_capacity is weird. It specifies an arch hook for an implementation that could easily vary within an architecture or even a chip family. This patch helps to mitigate this weirdness by defaulting to the cpufreq-provided implementation, which should work for all cases where CONFIG_CPU_FREQ is set. If CONFIG_CPU_FREQ is not set, then try to use an implementation provided by the architecture. Failing that, fall back to SCHED_CAPACITY_SCALE. It may be desirable for cpufreq drivers to specify their own implementation of arch_scale_freq_capacity in the future. The same is true for platform code within an architecture. In both cases an efficient implementation selector will need to be created and this patch adds a comment to that effect. Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com> --- kernel/sched/sched.h | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)