diff mbox series

[v2,2/3] x86/sched: Add basic support for CPU capacity scaling

Message ID 13573795.uLZWGnKmhe@rjwysocki.net (mailing list archive)
State Superseded, archived
Headers show
Series x86 / intel_pstate: Set asymmetric CPU capacity on hybrid systems | expand

Commit Message

Rafael J. Wysocki Aug. 12, 2024, 12:42 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

In order be able to compute the sizes of tasks consistently across all
CPUs in a hybrid system, it is necessary to provide CPU capacity scaling
information to the scheduler via arch_scale_cpu_capacity().  Moreover,
the value returned by arch_scale_freq_capacity() for the given CPU must
correspond to the arch_scale_cpu_capacity() return value for it, or
utilization computations will be inaccurate.

Add support for it through per-CPU variables holding the capacity and
maximum-to-base frequency ratio (times SCHED_CAPACITY_SCALE) that will
be returned by arch_scale_cpu_capacity() and used by scale_freq_tick()
to compute arch_freq_scale for the current CPU, respectively.

In order to avoid adding measurable overhead for non-hybrid x86 systems,
which are the vast majority in the field, whether or not the new hybrid
CPU capacity scaling will be in effect is controlled by a static key.
This static key is set by calling arch_enable_hybrid_capacity_scale()
which also allocates memory for the per-CPU data and initializes it.
Next, arch_set_cpu_capacity() is used to set the per-CPU variables
mentioned above for each CPU and arch_rebuild_sched_domains() needs
to be called for the scheduler to realize that capacity-aware
scheduling can be used going forward.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v1 -> v2:
   * Allow arch_enable_hybrid_capacity_scale() to succeed when frequency-
     invariance is not enabled and do not disable capacity scaling in
     disable_freq_invariance_workfn() (it is not really necessary to do
     so and the code is simpler without doing that).  Having done that,
     move arch_hybrid_cap_scale_key definition closer to the code using it.
   * Replace WARN_ON_ONCE() with WARN_ONCE() (2 places).
   * Fix arch_enable_hybrid_capacity_scale() return value when hybrid
     capacity scaling is already enabled.
   * Fix arch_set_cpu_capacity() kerneldoc comment.

---
 arch/x86/include/asm/topology.h  |   13 +++++
 arch/x86/kernel/cpu/aperfmperf.c |   87 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 99 insertions(+), 1 deletion(-)

Comments

Chen Yu Aug. 13, 2024, 1:27 a.m. UTC | #1
Hi Rafael,

On 2024-08-12 at 14:42:26 +0200, Rafael J. Wysocki wrote:
> +void arch_set_cpu_capacity(int cpu, unsigned long cap, unsigned long base_cap,
> +			   unsigned long max_freq, unsigned long base_freq)
> +{
> +	if (static_branch_likely(&arch_hybrid_cap_scale_key)) {
> +		WRITE_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->capacity,
> +			   div_u64(cap << SCHED_CAPACITY_SHIFT, base_cap));
> +		WRITE_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->freq_ratio,
> +			   div_u64(max_freq << SCHED_CAPACITY_SHIFT, base_freq));
>

Would the capacity update be frequently invoked? Just wonder if we could
first READ_ONCE() to check if the value is already the value we want to
change to, to avoid one write and less cache snoop overhead (in case other
CPU reads this CPU's capacity)

thanks,
Chenyu
Rafael J. Wysocki Aug. 13, 2024, 11:07 a.m. UTC | #2
Hi,

On Tue, Aug 13, 2024 at 3:27 AM Chen Yu <yu.c.chen@intel.com> wrote:
>
> Hi Rafael,
>
> On 2024-08-12 at 14:42:26 +0200, Rafael J. Wysocki wrote:
> > +void arch_set_cpu_capacity(int cpu, unsigned long cap, unsigned long base_cap,
> > +                        unsigned long max_freq, unsigned long base_freq)
> > +{
> > +     if (static_branch_likely(&arch_hybrid_cap_scale_key)) {
> > +             WRITE_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->capacity,
> > +                        div_u64(cap << SCHED_CAPACITY_SHIFT, base_cap));
> > +             WRITE_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->freq_ratio,
> > +                        div_u64(max_freq << SCHED_CAPACITY_SHIFT, base_freq));
> >
>
> Would the capacity update be frequently invoked?

I hope not.

> Just wonder if we could
> first READ_ONCE() to check if the value is already the value we want to
> change to, to avoid one write and less cache snoop overhead (in case other
> CPU reads this CPU's capacity)

Well, I'd rather not complicate the code beyond what is necessary
unless this is demonstrated to make a measurable difference.

Besides, AFAICS the only case in the caller in which the same values
can be passed to arch_set_cpu_capacity() is the CPU offline one and
that should not happen too often.
Ricardo Neri Aug. 26, 2024, 10:08 p.m. UTC | #3
On Mon, Aug 12, 2024 at 02:42:26PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

[...]

> +bool arch_enable_hybrid_capacity_scale(void)
> +{
> +	int cpu;
> +
> +	if (static_branch_unlikely(&arch_hybrid_cap_scale_key)) {
> +		WARN_ONCE(1, "Hybrid CPU capacity scaling already enabled");
> +		return true;
> +	}

Maybe an empty line here for readability?

> +	arch_cpu_scale = alloc_percpu(struct arch_hybrid_cpu_scale);
> +	if (!arch_cpu_scale)
> +		return false;
> +
> +	for_each_possible_cpu(cpu) {
> +		per_cpu_ptr(arch_cpu_scale, cpu)->capacity = SCHED_CAPACITY_SCALE;
> +		per_cpu_ptr(arch_cpu_scale, cpu)->freq_ratio = arch_max_freq_ratio;
> +	}
> +
> +	static_branch_enable(&arch_hybrid_cap_scale_key);
> +
> +	pr_info("Hybrid CPU capacity scaling enabled\n");
> +
> +	return true;
> +}
> +
> +/**
> + * arch_set_cpu_capacity - Set scale-invariance parameters for a CPU
> + * @cpu: Target CPU.
> + * @cap: Capacity of @cpu, relative to @base_cap, at its maximum frequency.
> + * @base_cap: System-wide maximum CPU capacity.

It is confusing to e that @base_cap is the maximum capacity of the system.
Maybe @max_cap?

> + * @max_freq: Frequency of @cpu corresponding to @cap.
> + * @base_freq: Frequency of @cpu at which MPERF counts.
> + *
> + * The units in which @cap and @base_cap are expressed do not matter, so long
> + * as they are consistent, because the former is effectively divided by the
> + * latter.  Analogously for @max_freq and @base_freq.
> + *
> + * After calling this function for all CPUs, call arch_rebuild_sched_domains()
> + * to let the scheduler know that capacity-aware scheduling can be used going
> + * forward.
> + */
> +void arch_set_cpu_capacity(int cpu, unsigned long cap, unsigned long base_cap,
> +			   unsigned long max_freq, unsigned long base_freq)
> +{
> +	if (static_branch_likely(&arch_hybrid_cap_scale_key)) {
> +		WRITE_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->capacity,
> +			   div_u64(cap << SCHED_CAPACITY_SHIFT, base_cap));
> +		WRITE_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->freq_ratio,
> +			   div_u64(max_freq << SCHED_CAPACITY_SHIFT, base_freq));
> +	} else {
> +		WARN_ONCE(1, "Hybrid CPU capacity scaling not enabled");
> +	}
> +}
> +
> +unsigned long arch_scale_cpu_capacity(int cpu)
> +{
> +	if (static_branch_unlikely(&arch_hybrid_cap_scale_key))
> +		return READ_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->capacity);
> +
> +	return SCHED_CAPACITY_SCALE;
> +}
> +EXPORT_SYMBOL_GPL(arch_scale_cpu_capacity);
> +
>  static void scale_freq_tick(u64 acnt, u64 mcnt)
>  {
>  	u64 freq_scale;
> +	u64 freq_ratio;

Why can't freq_ratio be declared on the same line as freq_scale?

>  
>  	if (!arch_scale_freq_invariant())
>  		return;
> @@ -359,7 +439,12 @@ static void scale_freq_tick(u64 acnt, u6
>  	if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
>  		goto error;
>  
> -	if (check_mul_overflow(mcnt, arch_max_freq_ratio, &mcnt) || !mcnt)
> +	if (static_branch_unlikely(&arch_hybrid_cap_scale_key))
> +		freq_ratio = READ_ONCE(this_cpu_ptr(arch_cpu_scale)->freq_ratio);
> +	else
> +		freq_ratio = arch_max_freq_ratio;

It seems that arch_max_freq_ratio will never be used on hybrid processors
and computing arch_turbo_freq_ratio will be a waste of cycles.

Unfortunately, intel_set_max_freq_ratio() is called before the
arch_hybrid_cap_scale_key static key is set.

Maybe some rework is in order?

Thanks and BR,
Ricardo
Rafael J. Wysocki Aug. 27, 2024, 11:57 a.m. UTC | #4
On Tue, Aug 27, 2024 at 12:02 AM Ricardo Neri
<ricardo.neri-calderon@linux.intel.com> wrote:
>
> On Mon, Aug 12, 2024 at 02:42:26PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> [...]
>
> > +bool arch_enable_hybrid_capacity_scale(void)
> > +{
> > +     int cpu;
> > +
> > +     if (static_branch_unlikely(&arch_hybrid_cap_scale_key)) {
> > +             WARN_ONCE(1, "Hybrid CPU capacity scaling already enabled");
> > +             return true;
> > +     }
>
> Maybe an empty line here for readability?

Sure, if this helps.

> > +     arch_cpu_scale = alloc_percpu(struct arch_hybrid_cpu_scale);
> > +     if (!arch_cpu_scale)
> > +             return false;
> > +
> > +     for_each_possible_cpu(cpu) {
> > +             per_cpu_ptr(arch_cpu_scale, cpu)->capacity = SCHED_CAPACITY_SCALE;
> > +             per_cpu_ptr(arch_cpu_scale, cpu)->freq_ratio = arch_max_freq_ratio;
> > +     }
> > +
> > +     static_branch_enable(&arch_hybrid_cap_scale_key);
> > +
> > +     pr_info("Hybrid CPU capacity scaling enabled\n");
> > +
> > +     return true;
> > +}
> > +
> > +/**
> > + * arch_set_cpu_capacity - Set scale-invariance parameters for a CPU
> > + * @cpu: Target CPU.
> > + * @cap: Capacity of @cpu, relative to @base_cap, at its maximum frequency.
> > + * @base_cap: System-wide maximum CPU capacity.
>
> It is confusing to e that @base_cap is the maximum capacity of the system.
> Maybe @max_cap?

But the max_cap and max_freq are sort of confusing again.

I guess I can call it max_cap and also rename max_freq to cap_freq.

> > + * @max_freq: Frequency of @cpu corresponding to @cap.
> > + * @base_freq: Frequency of @cpu at which MPERF counts.
> > + *
> > + * The units in which @cap and @base_cap are expressed do not matter, so long
> > + * as they are consistent, because the former is effectively divided by the
> > + * latter.  Analogously for @max_freq and @base_freq.
> > + *
> > + * After calling this function for all CPUs, call arch_rebuild_sched_domains()
> > + * to let the scheduler know that capacity-aware scheduling can be used going
> > + * forward.
> > + */
> > +void arch_set_cpu_capacity(int cpu, unsigned long cap, unsigned long base_cap,
> > +                        unsigned long max_freq, unsigned long base_freq)
> > +{
> > +     if (static_branch_likely(&arch_hybrid_cap_scale_key)) {
> > +             WRITE_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->capacity,
> > +                        div_u64(cap << SCHED_CAPACITY_SHIFT, base_cap));
> > +             WRITE_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->freq_ratio,
> > +                        div_u64(max_freq << SCHED_CAPACITY_SHIFT, base_freq));
> > +     } else {
> > +             WARN_ONCE(1, "Hybrid CPU capacity scaling not enabled");
> > +     }
> > +}
> > +
> > +unsigned long arch_scale_cpu_capacity(int cpu)
> > +{
> > +     if (static_branch_unlikely(&arch_hybrid_cap_scale_key))
> > +             return READ_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->capacity);
> > +
> > +     return SCHED_CAPACITY_SCALE;
> > +}
> > +EXPORT_SYMBOL_GPL(arch_scale_cpu_capacity);
> > +
> >  static void scale_freq_tick(u64 acnt, u64 mcnt)
> >  {
> >       u64 freq_scale;
> > +     u64 freq_ratio;
>
> Why can't freq_ratio be declared on the same line as freq_scale?

It can.

> >
> >       if (!arch_scale_freq_invariant())
> >               return;
> > @@ -359,7 +439,12 @@ static void scale_freq_tick(u64 acnt, u6
> >       if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
> >               goto error;
> >
> > -     if (check_mul_overflow(mcnt, arch_max_freq_ratio, &mcnt) || !mcnt)
> > +     if (static_branch_unlikely(&arch_hybrid_cap_scale_key))
> > +             freq_ratio = READ_ONCE(this_cpu_ptr(arch_cpu_scale)->freq_ratio);
> > +     else
> > +             freq_ratio = arch_max_freq_ratio;
>
> It seems that arch_max_freq_ratio will never be used on hybrid processors
> and computing arch_turbo_freq_ratio will be a waste of cycles.

Well, what if the memory allocation is
arch_enable_hybrid_capacity_scale() fails?

> Unfortunately, intel_set_max_freq_ratio() is called before the
> arch_hybrid_cap_scale_key static key is set.
>
> Maybe some rework is in order?

I'd rather not do it.  This is all initialization and done once.

However, a driver mode change can mess up with it which I have
overlooked.  I'll fix this (and make the above changes) and send a new
version of the series.
diff mbox series

Patch

Index: linux-pm/arch/x86/include/asm/topology.h
===================================================================
--- linux-pm.orig/arch/x86/include/asm/topology.h
+++ linux-pm/arch/x86/include/asm/topology.h
@@ -280,11 +280,24 @@  static inline long arch_scale_freq_capac
 }
 #define arch_scale_freq_capacity arch_scale_freq_capacity
 
+bool arch_enable_hybrid_capacity_scale(void);
+void arch_set_cpu_capacity(int cpu, unsigned long cap, unsigned long base_cap,
+			   unsigned long max_freq, unsigned long base_freq);
+
+unsigned long arch_scale_cpu_capacity(int cpu);
+#define arch_scale_cpu_capacity arch_scale_cpu_capacity
+
 extern void arch_set_max_freq_ratio(bool turbo_disabled);
 extern void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled);
 
 void arch_rebuild_sched_domains(void);
 #else
+static inline bool arch_enable_hybrid_capacity_scale(void) { return false; }
+static inline void arch_set_cpu_capacity(int cpu, unsigned long cap,
+					 unsigned long base_cap,
+					 unsigned long max_freq,
+					 unsigned long base_freq) { }
+
 static inline void arch_set_max_freq_ratio(bool turbo_disabled) { }
 static inline void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled) { }
 
Index: linux-pm/arch/x86/kernel/cpu/aperfmperf.c
===================================================================
--- linux-pm.orig/arch/x86/kernel/cpu/aperfmperf.c
+++ linux-pm/arch/x86/kernel/cpu/aperfmperf.c
@@ -349,9 +349,89 @@  static DECLARE_WORK(disable_freq_invaria
 DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
 EXPORT_PER_CPU_SYMBOL_GPL(arch_freq_scale);
 
+static DEFINE_STATIC_KEY_FALSE(arch_hybrid_cap_scale_key);
+
+struct arch_hybrid_cpu_scale {
+	unsigned long capacity;
+	unsigned long freq_ratio;
+};
+
+static struct arch_hybrid_cpu_scale __percpu *arch_cpu_scale;
+
+/**
+ * arch_enable_hybrid_capacity_scale - Enable hybrid CPU capacity scaling
+ *
+ * Allocate memory for per-CPU data used by hybrid CPU capacity scaling,
+ * initialize it and set the static key controlling its code paths.
+ *
+ * Must be called before arch_set_cpu_capacity().
+ */
+bool arch_enable_hybrid_capacity_scale(void)
+{
+	int cpu;
+
+	if (static_branch_unlikely(&arch_hybrid_cap_scale_key)) {
+		WARN_ONCE(1, "Hybrid CPU capacity scaling already enabled");
+		return true;
+	}
+	arch_cpu_scale = alloc_percpu(struct arch_hybrid_cpu_scale);
+	if (!arch_cpu_scale)
+		return false;
+
+	for_each_possible_cpu(cpu) {
+		per_cpu_ptr(arch_cpu_scale, cpu)->capacity = SCHED_CAPACITY_SCALE;
+		per_cpu_ptr(arch_cpu_scale, cpu)->freq_ratio = arch_max_freq_ratio;
+	}
+
+	static_branch_enable(&arch_hybrid_cap_scale_key);
+
+	pr_info("Hybrid CPU capacity scaling enabled\n");
+
+	return true;
+}
+
+/**
+ * arch_set_cpu_capacity - Set scale-invariance parameters for a CPU
+ * @cpu: Target CPU.
+ * @cap: Capacity of @cpu, relative to @base_cap, at its maximum frequency.
+ * @base_cap: System-wide maximum CPU capacity.
+ * @max_freq: Frequency of @cpu corresponding to @cap.
+ * @base_freq: Frequency of @cpu at which MPERF counts.
+ *
+ * The units in which @cap and @base_cap are expressed do not matter, so long
+ * as they are consistent, because the former is effectively divided by the
+ * latter.  Analogously for @max_freq and @base_freq.
+ *
+ * After calling this function for all CPUs, call arch_rebuild_sched_domains()
+ * to let the scheduler know that capacity-aware scheduling can be used going
+ * forward.
+ */
+void arch_set_cpu_capacity(int cpu, unsigned long cap, unsigned long base_cap,
+			   unsigned long max_freq, unsigned long base_freq)
+{
+	if (static_branch_likely(&arch_hybrid_cap_scale_key)) {
+		WRITE_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->capacity,
+			   div_u64(cap << SCHED_CAPACITY_SHIFT, base_cap));
+		WRITE_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->freq_ratio,
+			   div_u64(max_freq << SCHED_CAPACITY_SHIFT, base_freq));
+	} else {
+		WARN_ONCE(1, "Hybrid CPU capacity scaling not enabled");
+	}
+}
+
+unsigned long arch_scale_cpu_capacity(int cpu)
+{
+	if (static_branch_unlikely(&arch_hybrid_cap_scale_key))
+		return READ_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->capacity);
+
+	return SCHED_CAPACITY_SCALE;
+}
+EXPORT_SYMBOL_GPL(arch_scale_cpu_capacity);
+
 static void scale_freq_tick(u64 acnt, u64 mcnt)
 {
 	u64 freq_scale;
+	u64 freq_ratio;
 
 	if (!arch_scale_freq_invariant())
 		return;
@@ -359,7 +439,12 @@  static void scale_freq_tick(u64 acnt, u6
 	if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
 		goto error;
 
-	if (check_mul_overflow(mcnt, arch_max_freq_ratio, &mcnt) || !mcnt)
+	if (static_branch_unlikely(&arch_hybrid_cap_scale_key))
+		freq_ratio = READ_ONCE(this_cpu_ptr(arch_cpu_scale)->freq_ratio);
+	else
+		freq_ratio = arch_max_freq_ratio;
+
+	if (check_mul_overflow(mcnt, freq_ratio, &mcnt) || !mcnt)
 		goto error;
 
 	freq_scale = div64_u64(acnt, mcnt);