diff mbox series

[RFC,04/16] sched/fair: Remove magic hardcoded margin in fits_capacity()

Message ID 20240820163512.1096301-5-qyousef@layalina.io (mailing list archive)
State RFC, archived
Headers show
Series sched/fair/schedutil: Better manage system response time | expand

Commit Message

Qais Yousef Aug. 20, 2024, 4:35 p.m. UTC
Replace hardcoded margin value in fits_capacity() with better dynamic
logic.

80% margin is a magic value that has served its purpose for now, but it
no longer fits the variety of systems that exist today. If a system is
over powered specifically, this 80% will mean we leave a lot of capacity
unused before we decide to upmigrate on HMP system.

On many systems the little cores are under powered and ability to
migrate faster away from them is desired.

Redefine misfit migration to mean the utilization threshold at which the
task would become misfit at the next load balance event assuming it
becomes an always running task.

To calculate this threshold, we use the new approximate_util_avg()
function to find out the threshold, based on arch_scale_cpu_capacity()
the task will be misfit if it continues to run for a TICK_USEC which is
our worst case scenario for when misfit migration will kick in.

Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
 kernel/sched/core.c  |  1 +
 kernel/sched/fair.c  | 40 ++++++++++++++++++++++++++++++++--------
 kernel/sched/sched.h |  1 +
 3 files changed, 34 insertions(+), 8 deletions(-)

Comments

Sultan Alsawaf (unemployed) Aug. 22, 2024, 5:09 a.m. UTC | #1
Hi Qais,

On Tue, Aug 20, 2024 at 05:35:00PM +0100, Qais Yousef wrote:
> Replace hardcoded margin value in fits_capacity() with better dynamic
> logic.
> 
> 80% margin is a magic value that has served its purpose for now, but it
> no longer fits the variety of systems that exist today. If a system is
> over powered specifically, this 80% will mean we leave a lot of capacity
> unused before we decide to upmigrate on HMP system.
> 
> On many systems the little cores are under powered and ability to
> migrate faster away from them is desired.
> 
> Redefine misfit migration to mean the utilization threshold at which the
> task would become misfit at the next load balance event assuming it
> becomes an always running task.
> 
> To calculate this threshold, we use the new approximate_util_avg()
> function to find out the threshold, based on arch_scale_cpu_capacity()
> the task will be misfit if it continues to run for a TICK_USEC which is
> our worst case scenario for when misfit migration will kick in.
> 
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
>  kernel/sched/core.c  |  1 +
>  kernel/sched/fair.c  | 40 ++++++++++++++++++++++++++++++++--------
>  kernel/sched/sched.h |  1 +
>  3 files changed, 34 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6d35c48239be..402ee4947ef0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8266,6 +8266,7 @@ void __init sched_init(void)
>  		rq->sd = NULL;
>  		rq->rd = NULL;
>  		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
> +		rq->fits_capacity_threshold = SCHED_CAPACITY_SCALE;
>  		rq->balance_callback = &balance_push_callback;
>  		rq->active_balance = 0;
>  		rq->next_balance = jiffies;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9057584ec06d..e5e986af18dc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -95,11 +95,15 @@ int __weak arch_asym_cpu_priority(int cpu)
>  }
>  
>  /*
> - * The margin used when comparing utilization with CPU capacity.
> - *
> - * (default: ~20%)
> + * fits_capacity() must ensure that a task will not be 'stuck' on a CPU with
> + * lower capacity for too long. This the threshold is the util value at which
> + * if a task becomes always busy it could miss misfit migration load balance
> + * event. So we consider a task is misfit before it reaches this point.
>   */
> -#define fits_capacity(cap, max)	((cap) * 1280 < (max) * 1024)
> +static inline bool fits_capacity(unsigned long util, int cpu)
> +{
> +	return util < cpu_rq(cpu)->fits_capacity_threshold;
> +}
>  
>  /*
>   * The margin used when comparing CPU capacities.
> @@ -4978,14 +4982,13 @@ static inline int util_fits_cpu(unsigned long util,
>  				unsigned long uclamp_max,
>  				int cpu)
>  {
> -	unsigned long capacity = capacity_of(cpu);
>  	unsigned long capacity_orig;
>  	bool fits, uclamp_max_fits;
>  
>  	/*
>  	 * Check if the real util fits without any uclamp boost/cap applied.
>  	 */
> -	fits = fits_capacity(util, capacity);
> +	fits = fits_capacity(util, cpu);
>  
>  	if (!uclamp_is_used())
>  		return fits;
> @@ -9592,12 +9595,33 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
>  {
>  	unsigned long capacity = scale_rt_capacity(cpu);
>  	struct sched_group *sdg = sd->groups;
> +	struct rq *rq = cpu_rq(cpu);
> +	u64 limit;
>  
>  	if (!capacity)
>  		capacity = 1;
>  
> -	cpu_rq(cpu)->cpu_capacity = capacity;
> -	trace_sched_cpu_capacity_tp(cpu_rq(cpu));
> +	rq->cpu_capacity = capacity;
> +	trace_sched_cpu_capacity_tp(rq);
> +
> +	/*
> +	 * Calculate the util at which the task must be considered a misfit.
> +	 *
> +	 * We must ensure that a task experiences the same ramp-up time to
> +	 * reach max performance point of the system regardless of the CPU it
> +	 * is running on (due to invariance, time will stretch and task will
> +	 * take longer to achieve the same util value compared to a task
> +	 * running on a big CPU) and a delay in misfit migration which depends
> +	 * on TICK doesn't end up hurting it as it can happen after we would
> +	 * have crossed this threshold.
> +	 *
> +	 * To ensure that invaraince is taken into account, we don't scale time
> +	 * and use it as-is, approximate_util_avg() will then let us know the
> +	 * our threshold.
> +	 */
> +	limit = approximate_runtime(arch_scale_cpu_capacity(cpu)) * USEC_PER_MSEC;

Perhaps it makes more sense to use `capacity` here instead of
`arch_scale_cpu_capacity(cpu)`? Seems like reduced capacity due to HW pressure
(and IRQs + RT util) should be considered, e.g. for a capacity inversion due to
HW pressure on a mid core that results in a little core being faster.

Also, multiplying by the PELT period (1024 us) rather than USEC_PER_MSEC would
be more accurate.

> +	limit -= TICK_USEC; /* sd->balance_interval is more accurate */

I think `limit` could easily wrap here, especially with a 100 Hz tick, and make
it seem like an ultra-slow core (e.g. due to HW pressure) can suddenly fit any
task.

How about `lsub_positive(&limit, TICK_USEC)` instead?

> +	rq->fits_capacity_threshold = approximate_util_avg(0, limit);
>  
>  	sdg->sgc->capacity = capacity;
>  	sdg->sgc->min_capacity = capacity;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 47f158b2cdc2..ab4672675b84 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1093,6 +1093,7 @@ struct rq {
>  	struct sched_domain __rcu	*sd;
>  
>  	unsigned long		cpu_capacity;
> +	unsigned long		fits_capacity_threshold;
>  
>  	struct balance_callback *balance_callback;
>  
> -- 
> 2.34.1
> 

Cheers,
Sultan
Dietmar Eggemann Sept. 17, 2024, 7:41 p.m. UTC | #2
On 22/08/2024 07:09, Sultan Alsawaf (unemployed) wrote:
> Hi Qais,
> 
> On Tue, Aug 20, 2024 at 05:35:00PM +0100, Qais Yousef wrote:
>> Replace hardcoded margin value in fits_capacity() with better dynamic
>> logic.
>>
>> 80% margin is a magic value that has served its purpose for now, but it
>> no longer fits the variety of systems that exist today. If a system is
>> over powered specifically, this 80% will mean we leave a lot of capacity
>> unused before we decide to upmigrate on HMP system.
>>
>> On many systems the little cores are under powered and ability to
>> migrate faster away from them is desired.
>>
>> Redefine misfit migration to mean the utilization threshold at which the
>> task would become misfit at the next load balance event assuming it
>> becomes an always running task.
>>
>> To calculate this threshold, we use the new approximate_util_avg()
>> function to find out the threshold, based on arch_scale_cpu_capacity()
>> the task will be misfit if it continues to run for a TICK_USEC which is
>> our worst case scenario for when misfit migration will kick in.

[...]

>> +	/*
>> +	 * Calculate the util at which the task must be considered a misfit.
>> +	 *
>> +	 * We must ensure that a task experiences the same ramp-up time to
>> +	 * reach max performance point of the system regardless of the CPU it
>> +	 * is running on (due to invariance, time will stretch and task will
>> +	 * take longer to achieve the same util value compared to a task
>> +	 * running on a big CPU) and a delay in misfit migration which depends
>> +	 * on TICK doesn't end up hurting it as it can happen after we would
>> +	 * have crossed this threshold.
>> +	 *
>> +	 * To ensure that invaraince is taken into account, we don't scale time
>> +	 * and use it as-is, approximate_util_avg() will then let us know the
>> +	 * our threshold.
>> +	 */
>> +	limit = approximate_runtime(arch_scale_cpu_capacity(cpu)) * USEC_PER_MSEC;
> 
> Perhaps it makes more sense to use `capacity` here instead of
> `arch_scale_cpu_capacity(cpu)`? Seems like reduced capacity due to HW pressure
> (and IRQs + RT util) should be considered, e.g. for a capacity inversion due to
> HW pressure on a mid core that results in a little core being faster.

If you want to keep it strictly 'uarch & freq-invariant' based, then it
wouldn't have to be called periodically in update_cpu_capacity(). Just
set rq->fits_capacity_threshold once after cpu_scale has been fully
(uArch & Freq) normalized.

[...]
diff mbox series

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6d35c48239be..402ee4947ef0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8266,6 +8266,7 @@  void __init sched_init(void)
 		rq->sd = NULL;
 		rq->rd = NULL;
 		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
+		rq->fits_capacity_threshold = SCHED_CAPACITY_SCALE;
 		rq->balance_callback = &balance_push_callback;
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9057584ec06d..e5e986af18dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -95,11 +95,15 @@  int __weak arch_asym_cpu_priority(int cpu)
 }
 
 /*
- * The margin used when comparing utilization with CPU capacity.
- *
- * (default: ~20%)
+ * fits_capacity() must ensure that a task will not be 'stuck' on a CPU with
+ * lower capacity for too long. This the threshold is the util value at which
+ * if a task becomes always busy it could miss misfit migration load balance
+ * event. So we consider a task is misfit before it reaches this point.
  */
-#define fits_capacity(cap, max)	((cap) * 1280 < (max) * 1024)
+static inline bool fits_capacity(unsigned long util, int cpu)
+{
+	return util < cpu_rq(cpu)->fits_capacity_threshold;
+}
 
 /*
  * The margin used when comparing CPU capacities.
@@ -4978,14 +4982,13 @@  static inline int util_fits_cpu(unsigned long util,
 				unsigned long uclamp_max,
 				int cpu)
 {
-	unsigned long capacity = capacity_of(cpu);
 	unsigned long capacity_orig;
 	bool fits, uclamp_max_fits;
 
 	/*
 	 * Check if the real util fits without any uclamp boost/cap applied.
 	 */
-	fits = fits_capacity(util, capacity);
+	fits = fits_capacity(util, cpu);
 
 	if (!uclamp_is_used())
 		return fits;
@@ -9592,12 +9595,33 @@  static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 {
 	unsigned long capacity = scale_rt_capacity(cpu);
 	struct sched_group *sdg = sd->groups;
+	struct rq *rq = cpu_rq(cpu);
+	u64 limit;
 
 	if (!capacity)
 		capacity = 1;
 
-	cpu_rq(cpu)->cpu_capacity = capacity;
-	trace_sched_cpu_capacity_tp(cpu_rq(cpu));
+	rq->cpu_capacity = capacity;
+	trace_sched_cpu_capacity_tp(rq);
+
+	/*
+	 * Calculate the util at which the task must be considered a misfit.
+	 *
+	 * We must ensure that a task experiences the same ramp-up time to
+	 * reach max performance point of the system regardless of the CPU it
+	 * is running on (due to invariance, time will stretch and task will
+	 * take longer to achieve the same util value compared to a task
+	 * running on a big CPU) and a delay in misfit migration which depends
+	 * on TICK doesn't end up hurting it as it can happen after we would
+	 * have crossed this threshold.
+	 *
+	 * To ensure that invaraince is taken into account, we don't scale time
+	 * and use it as-is, approximate_util_avg() will then let us know the
+	 * our threshold.
+	 */
+	limit = approximate_runtime(arch_scale_cpu_capacity(cpu)) * USEC_PER_MSEC;
+	limit -= TICK_USEC; /* sd->balance_interval is more accurate */
+	rq->fits_capacity_threshold = approximate_util_avg(0, limit);
 
 	sdg->sgc->capacity = capacity;
 	sdg->sgc->min_capacity = capacity;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 47f158b2cdc2..ab4672675b84 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1093,6 +1093,7 @@  struct rq {
 	struct sched_domain __rcu	*sd;
 
 	unsigned long		cpu_capacity;
+	unsigned long		fits_capacity_threshold;
 
 	struct balance_callback *balance_callback;