diff mbox

sysbench throughput degradation in 4.13+

Message ID 20170927135820.61cd077f@cuia.usersys.redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Rik van Riel Sept. 27, 2017, 5:58 p.m. UTC
On Wed, 27 Sep 2017 11:35:30 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Sep 22, 2017 at 12:12:45PM -0400, Eric Farman wrote:
> > 
> > MySQL.  We've tried a few different configs with both test=oltp and
> > test=threads, but both show the same behavior.  What I have settled on for
> > my repro is the following:
> >   
> 
> Right, didn't even need to run it in a guest to observe a regression.
> 
> So the below cures native sysbench and NAS bench for me, does it also
> work for you virt thingy?
> 
> 
> PRE (current tip/master):
> 
> ivb-ex sysbench:
> 
>   2: [30 secs]     transactions:                        64110  (2136.94 per sec.)
>   5: [30 secs]     transactions:                        143644 (4787.99 per sec.)
>  10: [30 secs]     transactions:                        274298 (9142.93 per sec.)
>  20: [30 secs]     transactions:                        418683 (13955.45 per sec.)
>  40: [30 secs]     transactions:                        320731 (10690.15 per sec.)
>  80: [30 secs]     transactions:                        355096 (11834.28 per sec.)
> 
> hsw-ex NAS:
> 
> OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds =                    18.01
> OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds =                    17.89
> OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds =                    17.93
> lu.C.x_threads_144_run_1.log: Time in seconds =                   434.68
> lu.C.x_threads_144_run_2.log: Time in seconds =                   405.36
> lu.C.x_threads_144_run_3.log: Time in seconds =                   433.83
> 
> 
> POST (+patch):
> 
> ivb-ex sysbench:
> 
>   2: [30 secs]     transactions:                        64494  (2149.75 per sec.)
>   5: [30 secs]     transactions:                        145114 (4836.99 per sec.)
>  10: [30 secs]     transactions:                        278311 (9276.69 per sec.)
>  20: [30 secs]     transactions:                        437169 (14571.60 per sec.)
>  40: [30 secs]     transactions:                        669837 (22326.73 per sec.)
>  80: [30 secs]     transactions:                        631739 (21055.88 per sec.)
> 
> hsw-ex NAS:
> 
> lu.C.x_threads_144_run_1.log: Time in seconds =                    23.36
> lu.C.x_threads_144_run_2.log: Time in seconds =                    22.96
> lu.C.x_threads_144_run_3.log: Time in seconds =                    22.52
> 
> 
> This patch takes out all the shiny wake_affine stuff and goes back to
> utter basics. Rik was there another NUMA benchmark that wanted your
> fancy stuff? Because NAS isn't it.

I like the simplicity of your approach!  I hope it does not break
stuff like netperf...

I have been working on the patch below, which is much less optimistic
about when to do an affine wakeup than before.

It may be worth testing, in case it works better with some workload,
though relying on cached values still makes me somewhat uneasy.

I will try to get kernels tested here that implement both approaches,
to see what ends up working best.

---8<---
Subject: sched: make wake_affine_llc less eager

With the wake_affine_llc logic, tasks get moved around too eagerly,
and then moved back later, leading to poor performance for some
workloads.

Make wake_affine_llc less eager by comparing the minimum load of
the source LLC with the maximum load of the destination LLC, similar
to how source_load and target_load work for regular migration.

Also, get rid of an overly optimistic test that could potentially
pull across a lot of tasks if the target LLC happened to have fewer
runnable tasks at load balancing time.

Conversely, sync wakeups could happen without taking LLC loads
into account, if the waker would leave an idle CPU behind on
the target LLC.

Signed-off-by: Rik van Riel <riel@redhat.com>

---
 include/linux/sched/topology.h |  3 ++-
 kernel/sched/fair.c            | 56 +++++++++++++++++++++++++++++++++---------
 2 files changed, 46 insertions(+), 13 deletions(-)

Comments

Eric Farman Sept. 28, 2017, 11:04 a.m. UTC | #1
On 09/27/2017 01:58 PM, Rik van Riel wrote:
> On Wed, 27 Sep 2017 11:35:30 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Fri, Sep 22, 2017 at 12:12:45PM -0400, Eric Farman wrote:
>>>
>>> MySQL.  We've tried a few different configs with both test=oltp and
>>> test=threads, but both show the same behavior.  What I have settled on for
>>> my repro is the following:
>>>    
>>
>> Right, didn't even need to run it in a guest to observe a regression.
>>
>> So the below cures native sysbench and NAS bench for me, does it also
>> work for you virt thingy?
>>
>>
>> PRE (current tip/master):
>>
>> ivb-ex sysbench:
>>
>>    2: [30 secs]     transactions:                        64110  (2136.94 per sec.)
>>    5: [30 secs]     transactions:                        143644 (4787.99 per sec.)
>>   10: [30 secs]     transactions:                        274298 (9142.93 per sec.)
>>   20: [30 secs]     transactions:                        418683 (13955.45 per sec.)
>>   40: [30 secs]     transactions:                        320731 (10690.15 per sec.)
>>   80: [30 secs]     transactions:                        355096 (11834.28 per sec.)
>>
>> hsw-ex NAS:
>>
>> OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds =                    18.01
>> OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds =                    17.89
>> OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds =                    17.93
>> lu.C.x_threads_144_run_1.log: Time in seconds =                   434.68
>> lu.C.x_threads_144_run_2.log: Time in seconds =                   405.36
>> lu.C.x_threads_144_run_3.log: Time in seconds =                   433.83
>>
>>
>> POST (+patch):
>>
>> ivb-ex sysbench:
>>
>>    2: [30 secs]     transactions:                        64494  (2149.75 per sec.)
>>    5: [30 secs]     transactions:                        145114 (4836.99 per sec.)
>>   10: [30 secs]     transactions:                        278311 (9276.69 per sec.)
>>   20: [30 secs]     transactions:                        437169 (14571.60 per sec.)
>>   40: [30 secs]     transactions:                        669837 (22326.73 per sec.)
>>   80: [30 secs]     transactions:                        631739 (21055.88 per sec.)
>>
>> hsw-ex NAS:
>>
>> lu.C.x_threads_144_run_1.log: Time in seconds =                    23.36
>> lu.C.x_threads_144_run_2.log: Time in seconds =                    22.96
>> lu.C.x_threads_144_run_3.log: Time in seconds =                    22.52
>>
>>
>> This patch takes out all the shiny wake_affine stuff and goes back to
>> utter basics. Rik was there another NUMA benchmark that wanted your
>> fancy stuff? Because NAS isn't it.
> 
> I like the simplicity of your approach!  I hope it does not break
> stuff like netperf...
> 
> I have been working on the patch below, which is much less optimistic
> about when to do an affine wakeup than before.
> 
> It may be worth testing, in case it works better with some workload,
> though relying on cached values still makes me somewhat uneasy.
> 

Here are numbers for our environment, to compare the two patches:

sysbench --test=threads:
next-20170926:		25470.8
-with-Peters-patch:	29559.1
-with-Riks-patch:	29283

sysbench --test=oltp:
next-20170926:		5722.37
-with-Peters-patch:	9623.45
-with-Riks-patch:	9360.59

Didn't record host cpu migrations in all scenarios, but a spot check 
showed a similar reduction in both patches.

  - Eric

> I will try to get kernels tested here that implement both approaches,
> to see what ends up working best.
> 
> ---8<---
> Subject: sched: make wake_affine_llc less eager
> 
> With the wake_affine_llc logic, tasks get moved around too eagerly,
> and then moved back later, leading to poor performance for some
> workloads.
> 
> Make wake_affine_llc less eager by comparing the minimum load of
> the source LLC with the maximum load of the destination LLC, similar
> to how source_load and target_load work for regular migration.
> 
> Also, get rid of an overly optimistic test that could potentially
> pull across a lot of tasks if the target LLC happened to have fewer
> runnable tasks at load balancing time.
> 
> Conversely, sync wakeups could happen without taking LLC loads
> into account, if the waker would leave an idle CPU behind on
> the target LLC.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> 
> ---
>   include/linux/sched/topology.h |  3 ++-
>   kernel/sched/fair.c            | 56 +++++++++++++++++++++++++++++++++---------
>   2 files changed, 46 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index d7b6dab956ec..0c295ff5049b 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -77,7 +77,8 @@ struct sched_domain_shared {
>   	 * used by wake_affine().
>   	 */
>   	unsigned long	nr_running;
> -	unsigned long	load;
> +	unsigned long	min_load;
> +	unsigned long	max_load;
>   	unsigned long	capacity;
>   };
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 86195add977f..7740c6776e08 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5239,6 +5239,23 @@ static unsigned long target_load(int cpu, int type)
>   	return max(rq->cpu_load[type-1], total);
>   }
> 
> +static void min_max_load(int cpu, unsigned long *min_load,
> +		 	 unsigned long *max_load)
> +{
> +	struct rq *rq = cpu_rq(cpu);
> +	unsigned long minl = ULONG_MAX;
> +	unsigned long maxl = 0;
> +	int i;
> +
> +	for (i = 0; i < CPU_LOAD_IDX_MAX; i++) {
> +		minl = min(minl, rq->cpu_load[i]);
> +		maxl = max(maxl, rq->cpu_load[i]);
> +	}
> +
> +	*min_load = minl;
> +	*max_load = maxl;
> +}
> +
>   static unsigned long capacity_of(int cpu)
>   {
>   	return cpu_rq(cpu)->cpu_capacity;
> @@ -5310,7 +5327,8 @@ static int wake_wide(struct task_struct *p)
> 
>   struct llc_stats {
>   	unsigned long	nr_running;
> -	unsigned long	load;
> +	unsigned long	min_load;
> +	unsigned long	max_load;
>   	unsigned long	capacity;
>   	int		has_capacity;
>   };
> @@ -5323,7 +5341,8 @@ static bool get_llc_stats(struct llc_stats *stats, int cpu)
>   		return false;
> 
>   	stats->nr_running	= READ_ONCE(sds->nr_running);
> -	stats->load		= READ_ONCE(sds->load);
> +	stats->min_load		= READ_ONCE(sds->min_load);
> +	stats->max_load		= READ_ONCE(sds->max_load);
>   	stats->capacity		= READ_ONCE(sds->capacity);
>   	stats->has_capacity	= stats->nr_running < per_cpu(sd_llc_size, cpu);
> 
> @@ -5359,10 +5378,14 @@ wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
>   		unsigned long current_load = task_h_load(current);
> 
>   		/* in this case load hits 0 and this LLC is considered 'idle' */
> -		if (current_load > this_stats.load)
> +		if (current_load > this_stats.max_load)
> +			return true;
> +
> +		/* allow if the CPU would go idle, regardless of LLC load */
> +		if (current_load >= target_load(this_cpu, sd->wake_idx))
>   			return true;
> 
> -		this_stats.load -= current_load;
> +		this_stats.max_load -= current_load;
>   	}
> 
>   	/*
> @@ -5375,10 +5398,6 @@ wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
>   	if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1)
>   		return false;
> 
> -	/* if this cache has capacity, come here */
> -	if (this_stats.has_capacity && this_stats.nr_running+1 < prev_stats.nr_running)
> -		return true;
> -
>   	/*
>   	 * Check to see if we can move the load without causing too much
>   	 * imbalance.
> @@ -5391,8 +5410,8 @@ wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
>   	prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
>   	prev_eff_load *= this_stats.capacity;
> 
> -	this_eff_load *= this_stats.load + task_load;
> -	prev_eff_load *= prev_stats.load - task_load;
> +	this_eff_load *= this_stats.max_load + task_load;
> +	prev_eff_load *= prev_stats.min_load - task_load;
> 
>   	return this_eff_load <= prev_eff_load;
>   }
> @@ -7033,6 +7052,8 @@ enum group_type {
>   struct sg_lb_stats {
>   	unsigned long avg_load; /*Avg load across the CPUs of the group */
>   	unsigned long group_load; /* Total load over the CPUs of the group */
> +	unsigned long min_load;
> +	unsigned long max_load;
>   	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
>   	unsigned long load_per_task;
>   	unsigned long group_capacity;
> @@ -7059,6 +7080,8 @@ struct sd_lb_stats {
>   	unsigned long total_load;	/* Total load of all groups in sd */
>   	unsigned long total_capacity;	/* Total capacity of all groups in sd */
>   	unsigned long avg_load;	/* Average load across all groups in sd */
> +	unsigned long min_load;		/* Sum of lowest loadavg on CPUs */
> +	unsigned long max_load;		/* Sum of highest loadavg on CPUs */
> 
>   	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
>   	struct sg_lb_stats local_stat;	/* Statistics of the local group */
> @@ -7077,6 +7100,8 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
>   		.local = NULL,
>   		.total_running = 0UL,
>   		.total_load = 0UL,
> +		.min_load = 0UL,
> +		.max_load = 0UL,
>   		.total_capacity = 0UL,
>   		.busiest_stat = {
>   			.avg_load = 0UL,
> @@ -7358,7 +7383,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   			int local_group, struct sg_lb_stats *sgs,
>   			bool *overload)
>   {
> -	unsigned long load;
> +	unsigned long load, min_load, max_load;
>   	int i, nr_running;
> 
>   	memset(sgs, 0, sizeof(*sgs));
> @@ -7372,7 +7397,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   		else
>   			load = source_load(i, load_idx);
> 
> +		min_max_load(i, &min_load, &max_load);
> +
>   		sgs->group_load += load;
> +		sgs->min_load += min_load;
> +		sgs->max_load += max_load;
>   		sgs->group_util += cpu_util(i);
>   		sgs->sum_nr_running += rq->cfs.h_nr_running;
> 
> @@ -7569,6 +7598,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>   		/* Now, start updating sd_lb_stats */
>   		sds->total_running += sgs->sum_nr_running;
>   		sds->total_load += sgs->group_load;
> +		sds->min_load += sgs->min_load;
> +		sds->max_load += sgs->max_load;
>   		sds->total_capacity += sgs->group_capacity;
> 
>   		sg = sg->next;
> @@ -7596,7 +7627,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>   	 * XXX fix that.
>   	 */
>   	WRITE_ONCE(shared->nr_running,	sds->total_running);
> -	WRITE_ONCE(shared->load,	sds->total_load);
> +	WRITE_ONCE(shared->min_load,	sds->min_load);
> +	WRITE_ONCE(shared->max_load,	sds->max_load);
>   	WRITE_ONCE(shared->capacity,	sds->total_capacity);
>   }
> 
> 
>
Peter Zijlstra Sept. 28, 2017, 12:36 p.m. UTC | #2
On Wed, Sep 27, 2017 at 01:58:20PM -0400, Rik van Riel wrote:
> I like the simplicity of your approach!  I hope it does not break
> stuff like netperf...

So the old approach that looks at the weight of the two CPUs behaves
slightly better in the overloaded case. On the threads==nr_cpus load
points they match fairly evenly.

I seem to have misplaced my netperf scripts, but I'll have a play with
it.
Peter Zijlstra Sept. 28, 2017, 12:37 p.m. UTC | #3
On Wed, Sep 27, 2017 at 01:58:20PM -0400, Rik van Riel wrote:
> @@ -5359,10 +5378,14 @@ wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
>  		unsigned long current_load = task_h_load(current);
>  
>  		/* in this case load hits 0 and this LLC is considered 'idle' */
> -		if (current_load > this_stats.load)
> +		if (current_load > this_stats.max_load)
> +			return true;
> +
> +		/* allow if the CPU would go idle, regardless of LLC load */
> +		if (current_load >= target_load(this_cpu, sd->wake_idx))
>  			return true;
>  
> -		this_stats.load -= current_load;
> +		this_stats.max_load -= current_load;
>  	}
>  
>  	/*
> @@ -5375,10 +5398,6 @@ wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
>  	if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1)
>  		return false;
>  
> -	/* if this cache has capacity, come here */
> -	if (this_stats.has_capacity && this_stats.nr_running+1 < prev_stats.nr_running)
> -		return true;
> -
>  	/*
>  	 * Check to see if we can move the load without causing too much
>  	 * imbalance.
> @@ -5391,8 +5410,8 @@ wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
>  	prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
>  	prev_eff_load *= this_stats.capacity;
>  
> -	this_eff_load *= this_stats.load + task_load;
> -	prev_eff_load *= prev_stats.load - task_load;
> +	this_eff_load *= this_stats.max_load + task_load;
> +	prev_eff_load *= prev_stats.min_load - task_load;
>  
>  	return this_eff_load <= prev_eff_load;
>  }

So I would really like a workload that needs this LLC/NUMA stuff.
Because I much prefer the simpler: 'on which of these two CPUs can I run
soonest' approach.
Matt Fleming Oct. 2, 2017, 10:53 p.m. UTC | #4
On Wed, 27 Sep, at 01:58:20PM, Rik van Riel wrote:
> 
> I like the simplicity of your approach!  I hope it does not break
> stuff like netperf...
> 
> I have been working on the patch below, which is much less optimistic
> about when to do an affine wakeup than before.

Running netperf for this patch and Peter's patch shows that Peter's
comes out on top, with scores pretty close to v4.12 in most places on
my 2-NUMA node 48-CPU Xeon box.

I haven't dug any further into why v4.13-peterz+ is worse than v4.12,
but I will next week.

netperf-tcp
                                4.12.0                 4.13.0                 4.13.0                 4.13.0
                               default                default                peterz+                  riel+
Min       64        1653.72 (   0.00%)     1554.29 (  -6.01%)     1601.71 (  -3.15%)     1627.01 (  -1.62%)
Min       128       3240.96 (   0.00%)     2858.49 ( -11.80%)     3122.62 (  -3.65%)     3063.67 (  -5.47%)
Min       256       5840.55 (   0.00%)     4077.43 ( -30.19%)     5529.98 (  -5.32%)     5362.18 (  -8.19%)
Min       1024     16812.11 (   0.00%)    11899.72 ( -29.22%)    16335.83 (  -2.83%)    15075.24 ( -10.33%)
Min       2048     26875.79 (   0.00%)    18852.35 ( -29.85%)    25902.85 (  -3.62%)    22804.82 ( -15.15%)
Min       3312     33060.18 (   0.00%)    20984.28 ( -36.53%)    32817.82 (  -0.73%)    29161.49 ( -11.79%)
Min       4096     34513.24 (   0.00%)    23253.94 ( -32.62%)    34167.80 (  -1.00%)    29349.09 ( -14.96%)
Min       8192     39836.88 (   0.00%)    28881.63 ( -27.50%)    39613.28 (  -0.56%)    35307.95 ( -11.37%)
Min       16384    44203.84 (   0.00%)    31616.74 ( -28.48%)    43608.86 (  -1.35%)    38130.44 ( -13.74%)
Hmean     64        1686.58 (   0.00%)     1613.25 (  -4.35%)     1657.04 (  -1.75%)     1655.38 (  -1.85%)
Hmean     128       3361.84 (   0.00%)     2945.34 ( -12.39%)     3173.47 (  -5.60%)     3122.38 (  -7.12%)
Hmean     256       5993.92 (   0.00%)     4423.32 ( -26.20%)     5618.26 (  -6.27%)     5523.72 (  -7.84%)
Hmean     1024     17225.83 (   0.00%)    12314.23 ( -28.51%)    16574.85 (  -3.78%)    15644.71 (  -9.18%)
Hmean     2048     27944.22 (   0.00%)    21301.63 ( -23.77%)    26395.38 (  -5.54%)    24067.57 ( -13.87%)
Hmean     3312     33760.48 (   0.00%)    22361.07 ( -33.77%)    33198.32 (  -1.67%)    30055.64 ( -10.97%)
Hmean     4096     35077.74 (   0.00%)    29153.73 ( -16.89%)    34479.40 (  -1.71%)    31215.64 ( -11.01%)
Hmean     8192     40674.31 (   0.00%)    33493.01 ( -17.66%)    40443.22 (  -0.57%)    37298.58 (  -8.30%)
Hmean     16384    45492.12 (   0.00%)    37177.64 ( -18.28%)    44308.62 (  -2.60%)    40728.33 ( -10.47%)
Max       64        1745.95 (   0.00%)     1649.03 (  -5.55%)     1710.52 (  -2.03%)     1702.65 (  -2.48%)
Max       128       3509.96 (   0.00%)     3082.35 ( -12.18%)     3204.19 (  -8.71%)     3174.41 (  -9.56%)
Max       256       6138.35 (   0.00%)     4687.62 ( -23.63%)     5694.52 (  -7.23%)     5722.08 (  -6.78%)
Max       1024     17732.13 (   0.00%)    13270.42 ( -25.16%)    16838.69 (  -5.04%)    16580.18 (  -6.50%)
Max       2048     28907.99 (   0.00%)    24816.39 ( -14.15%)    26792.86 (  -7.32%)    25003.60 ( -13.51%)
Max       3312     34512.60 (   0.00%)    23510.32 ( -31.88%)    33762.47 (  -2.17%)    31676.54 (  -8.22%)
Max       4096     35918.95 (   0.00%)    35245.77 (  -1.87%)    34866.23 (  -2.93%)    32537.07 (  -9.42%)
Max       8192     41749.55 (   0.00%)    41068.52 (  -1.63%)    42164.53 (   0.99%)    40105.50 (  -3.94%)
Max       16384    47234.74 (   0.00%)    41728.66 ( -11.66%)    45387.40 (  -3.91%)    44107.25 (  -6.62%)
Peter Zijlstra Oct. 3, 2017, 8:39 a.m. UTC | #5
On Mon, Oct 02, 2017 at 11:53:12PM +0100, Matt Fleming wrote:
> On Wed, 27 Sep, at 01:58:20PM, Rik van Riel wrote:
> > 
> > I like the simplicity of your approach!  I hope it does not break
> > stuff like netperf...
> > 
> > I have been working on the patch below, which is much less optimistic
> > about when to do an affine wakeup than before.
> 
> Running netperf for this patch and Peter's patch shows that Peter's
> comes out on top, with scores pretty close to v4.12 in most places on
> my 2-NUMA node 48-CPU Xeon box.
> 
> I haven't dug any further into why v4.13-peterz+ is worse than v4.12,
> but I will next week.

So I was waiting for Rik, who promised to run a bunch of NUMA workloads
over the weekend.

The trivial thing regresses a wee bit on the overloaded case, I've not
yet tried to fix it.
Rik van Riel Oct. 3, 2017, 4:02 p.m. UTC | #6
On Tue, 2017-10-03 at 10:39 +0200, Peter Zijlstra wrote:
> On Mon, Oct 02, 2017 at 11:53:12PM +0100, Matt Fleming wrote:
> > On Wed, 27 Sep, at 01:58:20PM, Rik van Riel wrote:
> > > 
> > > I like the simplicity of your approach!  I hope it does not break
> > > stuff like netperf...
> > > 
> > > I have been working on the patch below, which is much less
> > > optimistic
> > > about when to do an affine wakeup than before.
> > 
> > Running netperf for this patch and Peter's patch shows that Peter's
> > comes out on top, with scores pretty close to v4.12 in most places
> > on
> > my 2-NUMA node 48-CPU Xeon box.
> > 
> > I haven't dug any further into why v4.13-peterz+ is worse than
> > v4.12,
> > but I will next week.
> 
> So I was waiting for Rik, who promised to run a bunch of NUMA
> workloads
> over the weekend.
> 
> The trivial thing regresses a wee bit on the overloaded case, I've
> not
> yet tried to fix it.

In Jirka's tests, your simple patch also came out
on top.
diff mbox

Patch

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index d7b6dab956ec..0c295ff5049b 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -77,7 +77,8 @@  struct sched_domain_shared {
 	 * used by wake_affine().
 	 */
 	unsigned long	nr_running;
-	unsigned long	load;
+	unsigned long	min_load;
+	unsigned long	max_load;
 	unsigned long	capacity;
 };
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 86195add977f..7740c6776e08 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5239,6 +5239,23 @@  static unsigned long target_load(int cpu, int type)
 	return max(rq->cpu_load[type-1], total);
 }
 
+static void min_max_load(int cpu, unsigned long *min_load,
+		 	 unsigned long *max_load)
+{
+	struct rq *rq = cpu_rq(cpu);
+	unsigned long minl = ULONG_MAX;
+	unsigned long maxl = 0;
+	int i;
+
+	for (i = 0; i < CPU_LOAD_IDX_MAX; i++) {
+		minl = min(minl, rq->cpu_load[i]);
+		maxl = max(maxl, rq->cpu_load[i]);
+	}
+
+	*min_load = minl;
+	*max_load = maxl;
+}
+
 static unsigned long capacity_of(int cpu)
 {
 	return cpu_rq(cpu)->cpu_capacity;
@@ -5310,7 +5327,8 @@  static int wake_wide(struct task_struct *p)
 
 struct llc_stats {
 	unsigned long	nr_running;
-	unsigned long	load;
+	unsigned long	min_load;
+	unsigned long	max_load;
 	unsigned long	capacity;
 	int		has_capacity;
 };
@@ -5323,7 +5341,8 @@  static bool get_llc_stats(struct llc_stats *stats, int cpu)
 		return false;
 
 	stats->nr_running	= READ_ONCE(sds->nr_running);
-	stats->load		= READ_ONCE(sds->load);
+	stats->min_load		= READ_ONCE(sds->min_load);
+	stats->max_load		= READ_ONCE(sds->max_load);
 	stats->capacity		= READ_ONCE(sds->capacity);
 	stats->has_capacity	= stats->nr_running < per_cpu(sd_llc_size, cpu);
 
@@ -5359,10 +5378,14 @@  wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
 		unsigned long current_load = task_h_load(current);
 
 		/* in this case load hits 0 and this LLC is considered 'idle' */
-		if (current_load > this_stats.load)
+		if (current_load > this_stats.max_load)
+			return true;
+
+		/* allow if the CPU would go idle, regardless of LLC load */
+		if (current_load >= target_load(this_cpu, sd->wake_idx))
 			return true;
 
-		this_stats.load -= current_load;
+		this_stats.max_load -= current_load;
 	}
 
 	/*
@@ -5375,10 +5398,6 @@  wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
 	if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1)
 		return false;
 
-	/* if this cache has capacity, come here */
-	if (this_stats.has_capacity && this_stats.nr_running+1 < prev_stats.nr_running)
-		return true;
-
 	/*
 	 * Check to see if we can move the load without causing too much
 	 * imbalance.
@@ -5391,8 +5410,8 @@  wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
 	prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
 	prev_eff_load *= this_stats.capacity;
 
-	this_eff_load *= this_stats.load + task_load;
-	prev_eff_load *= prev_stats.load - task_load;
+	this_eff_load *= this_stats.max_load + task_load;
+	prev_eff_load *= prev_stats.min_load - task_load;
 
 	return this_eff_load <= prev_eff_load;
 }
@@ -7033,6 +7052,8 @@  enum group_type {
 struct sg_lb_stats {
 	unsigned long avg_load; /*Avg load across the CPUs of the group */
 	unsigned long group_load; /* Total load over the CPUs of the group */
+	unsigned long min_load;
+	unsigned long max_load;
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
 	unsigned long group_capacity;
@@ -7059,6 +7080,8 @@  struct sd_lb_stats {
 	unsigned long total_load;	/* Total load of all groups in sd */
 	unsigned long total_capacity;	/* Total capacity of all groups in sd */
 	unsigned long avg_load;	/* Average load across all groups in sd */
+	unsigned long min_load;		/* Sum of lowest loadavg on CPUs */
+	unsigned long max_load;		/* Sum of highest loadavg on CPUs */
 
 	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
 	struct sg_lb_stats local_stat;	/* Statistics of the local group */
@@ -7077,6 +7100,8 @@  static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 		.local = NULL,
 		.total_running = 0UL,
 		.total_load = 0UL,
+		.min_load = 0UL,
+		.max_load = 0UL,
 		.total_capacity = 0UL,
 		.busiest_stat = {
 			.avg_load = 0UL,
@@ -7358,7 +7383,7 @@  static inline void update_sg_lb_stats(struct lb_env *env,
 			int local_group, struct sg_lb_stats *sgs,
 			bool *overload)
 {
-	unsigned long load;
+	unsigned long load, min_load, max_load;
 	int i, nr_running;
 
 	memset(sgs, 0, sizeof(*sgs));
@@ -7372,7 +7397,11 @@  static inline void update_sg_lb_stats(struct lb_env *env,
 		else
 			load = source_load(i, load_idx);
 
+		min_max_load(i, &min_load, &max_load);
+
 		sgs->group_load += load;
+		sgs->min_load += min_load;
+		sgs->max_load += max_load;
 		sgs->group_util += cpu_util(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
@@ -7569,6 +7598,8 @@  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		/* Now, start updating sd_lb_stats */
 		sds->total_running += sgs->sum_nr_running;
 		sds->total_load += sgs->group_load;
+		sds->min_load += sgs->min_load;
+		sds->max_load += sgs->max_load;
 		sds->total_capacity += sgs->group_capacity;
 
 		sg = sg->next;
@@ -7596,7 +7627,8 @@  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	 * XXX fix that.
 	 */
 	WRITE_ONCE(shared->nr_running,	sds->total_running);
-	WRITE_ONCE(shared->load,	sds->total_load);
+	WRITE_ONCE(shared->min_load,	sds->min_load);
+	WRITE_ONCE(shared->max_load,	sds->max_load);
 	WRITE_ONCE(shared->capacity,	sds->total_capacity);
 }