diff mbox series

sched: Consider capacity for certain load balancing decisions

Message ID 20230201012032.2874481-1-xii@google.com (mailing list archive)
State New, archived
Headers show
Series sched: Consider capacity for certain load balancing decisions | expand

Commit Message

Xi Wang Feb. 1, 2023, 1:20 a.m. UTC
After load balancing was split into different scenarios, CPU capacity
is ignored for the "migrate_task" case, which means a thread can stay
on a softirq heavy cpu for an extended amount of time.

By comparing nr_running/capacity instead of just nr_running we can add
CPU capacity back into "migrate_task" decisions. This benefits
workloads running on machines with heavy network traffic. The change
is unlikely to cause serious problems for other workloads but maybe
some corner cases still need to be considered.

Signed-off-by: Xi Wang <xii@google.com>
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Peter Zijlstra Feb. 3, 2023, 9:51 a.m. UTC | #1
On Tue, Jan 31, 2023 at 05:20:32PM -0800, Xi Wang wrote:
> After load balancing was split into different scenarios, CPU capacity
> is ignored for the "migrate_task" case, which means a thread can stay
> on a softirq heavy cpu for an extended amount of time.
> 
> By comparing nr_running/capacity instead of just nr_running we can add
> CPU capacity back into "migrate_task" decisions. This benefits
> workloads running on machines with heavy network traffic. The change
> is unlikely to cause serious problems for other workloads but maybe
> some corner cases still need to be considered.
> 
> Signed-off-by: Xi Wang <xii@google.com>
> ---
>  kernel/sched/fair.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0f8736991427..aad14bc04544 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10368,8 +10368,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>  			break;
>  
>  		case migrate_task:
> -			if (busiest_nr < nr_running) {
> +			if (busiest_nr * capacity < nr_running * busiest_capacity) {
>  				busiest_nr = nr_running;
> +				busiest_capacity = capacity;
>  				busiest = rq;
>  			}
>  			break;

I don't think this is correct. The migrate_task case is work-conserving,
and your change can severely break that I think.
Xi Wang Feb. 3, 2023, 6:47 p.m. UTC | #2
On Fri, Feb 3, 2023 at 1:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Jan 31, 2023 at 05:20:32PM -0800, Xi Wang wrote:
> > After load balancing was split into different scenarios, CPU capacity
> > is ignored for the "migrate_task" case, which means a thread can stay
> > on a softirq heavy cpu for an extended amount of time.
> >
> > By comparing nr_running/capacity instead of just nr_running we can add
> > CPU capacity back into "migrate_task" decisions. This benefits
> > workloads running on machines with heavy network traffic. The change
> > is unlikely to cause serious problems for other workloads but maybe
> > some corner cases still need to be considered.
> >
> > Signed-off-by: Xi Wang <xii@google.com>
> > ---
> >  kernel/sched/fair.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0f8736991427..aad14bc04544 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10368,8 +10368,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> >                       break;
> >
> >               case migrate_task:
> > -                     if (busiest_nr < nr_running) {
> > +                     if (busiest_nr * capacity < nr_running * busiest_capacity) {
> >                               busiest_nr = nr_running;
> > +                             busiest_capacity = capacity;
> >                               busiest = rq;
> >                       }
> >                       break;
>
> I don't think this is correct. The migrate_task case is work-conserving,
> and your change can severely break that I think.
>

I think you meant this kind of scenario:
cpu 0: idle
cpu 1: 2 tasks
cpu 2: 1 task but only has 30% of capacity
Pulling from cpu 2 is good for the task but lowers the overall cpu
throughput.

The problem we have is:
cpu 0: idle
cpu 1: 1 task
cpu 2: 1 task but only has 60% of capacity due to net softirq
The task on cpu 2 stays there and runs slower. (This can also be
considered non work-conserving if we account softirq like a task.)

Maybe the logic can be merged like this: Use capacity but pick from
nr_running > 1 cpus first, then nr_running == 1 cpus if not found.
Vincent Guittot Feb. 6, 2023, 9:28 a.m. UTC | #3
On Fri, 3 Feb 2023 at 19:47, Xi Wang <xii@google.com> wrote:
>
> On Fri, Feb 3, 2023 at 1:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Jan 31, 2023 at 05:20:32PM -0800, Xi Wang wrote:
> > > After load balancing was split into different scenarios, CPU capacity
> > > is ignored for the "migrate_task" case, which means a thread can stay
> > > on a softirq heavy cpu for an extended amount of time.
> > >
> > > By comparing nr_running/capacity instead of just nr_running we can add
> > > CPU capacity back into "migrate_task" decisions. This benefits
> > > workloads running on machines with heavy network traffic. The change
> > > is unlikely to cause serious problems for other workloads but maybe
> > > some corner cases still need to be considered.
> > >
> > > Signed-off-by: Xi Wang <xii@google.com>
> > > ---
> > >  kernel/sched/fair.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 0f8736991427..aad14bc04544 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -10368,8 +10368,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> > >                       break;
> > >
> > >               case migrate_task:
> > > -                     if (busiest_nr < nr_running) {
> > > +                     if (busiest_nr * capacity < nr_running * busiest_capacity) {
> > >                               busiest_nr = nr_running;
> > > +                             busiest_capacity = capacity;
> > >                               busiest = rq;
> > >                       }
> > >                       break;
> >
> > I don't think this is correct. The migrate_task case is work-conserving,
> > and your change can severely break that I think.
> >
>
> I think you meant this kind of scenario:
> cpu 0: idle
> cpu 1: 2 tasks
> cpu 2: 1 task but only has 30% of capacity
> Pulling from cpu 2 is good for the task but lowers the overall cpu
> throughput.
>
> The problem we have is:
> cpu 0: idle
> cpu 1: 1 task
> cpu 2: 1 task but only has 60% of capacity due to net softirq
> The task on cpu 2 stays there and runs slower. (This can also be
> considered non work-conserving if we account softirq like a task.)

When load_balance runs for this 2 cpus, cpu2 should be tagged as
misfit_task because of reduce_capacity and should be selected in
priority by cpu0 to pull the task. Do you have more details on your
topology ?


>
> Maybe the logic can be merged like this: Use capacity but pick from
> nr_running > 1 cpus first, then nr_running == 1 cpus if not found.
Xi Wang Feb. 7, 2023, 1:03 a.m. UTC | #4
On Mon, Feb 6, 2023 at 1:28 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Fri, 3 Feb 2023 at 19:47, Xi Wang <xii@google.com> wrote:
> >
> > On Fri, Feb 3, 2023 at 1:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Tue, Jan 31, 2023 at 05:20:32PM -0800, Xi Wang wrote:
> > > > After load balancing was split into different scenarios, CPU capacity
> > > > is ignored for the "migrate_task" case, which means a thread can stay
> > > > on a softirq heavy cpu for an extended amount of time.
> > > >
> > > > By comparing nr_running/capacity instead of just nr_running we can add
> > > > CPU capacity back into "migrate_task" decisions. This benefits
> > > > workloads running on machines with heavy network traffic. The change
> > > > is unlikely to cause serious problems for other workloads but maybe
> > > > some corner cases still need to be considered.
> > > >
> > > > Signed-off-by: Xi Wang <xii@google.com>
> > > > ---
> > > >  kernel/sched/fair.c | 3 ++-
> > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 0f8736991427..aad14bc04544 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -10368,8 +10368,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> > > >                       break;
> > > >
> > > >               case migrate_task:
> > > > -                     if (busiest_nr < nr_running) {
> > > > +                     if (busiest_nr * capacity < nr_running * busiest_capacity) {
> > > >                               busiest_nr = nr_running;
> > > > +                             busiest_capacity = capacity;
> > > >                               busiest = rq;
> > > >                       }
> > > >                       break;
> > >
> > > I don't think this is correct. The migrate_task case is work-conserving,
> > > and your change can severely break that I think.
> > >
> >
> > I think you meant this kind of scenario:
> > cpu 0: idle
> > cpu 1: 2 tasks
> > cpu 2: 1 task but only has 30% of capacity
> > Pulling from cpu 2 is good for the task but lowers the overall cpu
> > throughput.
> >
> > The problem we have is:
> > cpu 0: idle
> > cpu 1: 1 task
> > cpu 2: 1 task but only has 60% of capacity due to net softirq
> > The task on cpu 2 stays there and runs slower. (This can also be
> > considered non work-conserving if we account softirq like a task.)
>
> When load_balance runs for this 2 cpus, cpu2 should be tagged as
> misfit_task because of reduce_capacity and should be selected in
> priority by cpu0 to pull the task. Do you have more details on your
> topology ?

The topology is 64 core AMD with 2 hyperthreads.

I am not familiar with the related code but I think there are cases
where a task fits cpu capacity but it can still run faster elsewhere,
e.g.: Bursty workloads. Thread pool threads with variable utilization
because it would process more or less requests based on cpu
availability (pick the next request from a shared queue when the
previous one is done). A thread having enough cpu cycles but runs
slower due to softirqs can also directly affect application
performance.

> >
> > Maybe the logic can be merged like this: Use capacity but pick from
> > nr_running > 1 cpus first, then nr_running == 1 cpus if not found.
Vincent Guittot Feb. 7, 2023, 8:08 a.m. UTC | #5
On Tue, 7 Feb 2023 at 02:03, Xi Wang <xii@google.com> wrote:
>
> On Mon, Feb 6, 2023 at 1:28 AM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Fri, 3 Feb 2023 at 19:47, Xi Wang <xii@google.com> wrote:
> > >
> > > On Fri, Feb 3, 2023 at 1:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Tue, Jan 31, 2023 at 05:20:32PM -0800, Xi Wang wrote:
> > > > > After load balancing was split into different scenarios, CPU capacity
> > > > > is ignored for the "migrate_task" case, which means a thread can stay
> > > > > on a softirq heavy cpu for an extended amount of time.
> > > > >
> > > > > By comparing nr_running/capacity instead of just nr_running we can add
> > > > > CPU capacity back into "migrate_task" decisions. This benefits
> > > > > workloads running on machines with heavy network traffic. The change
> > > > > is unlikely to cause serious problems for other workloads but maybe
> > > > > some corner cases still need to be considered.
> > > > >
> > > > > Signed-off-by: Xi Wang <xii@google.com>
> > > > > ---
> > > > >  kernel/sched/fair.c | 3 ++-
> > > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index 0f8736991427..aad14bc04544 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -10368,8 +10368,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> > > > >                       break;
> > > > >
> > > > >               case migrate_task:
> > > > > -                     if (busiest_nr < nr_running) {
> > > > > +                     if (busiest_nr * capacity < nr_running * busiest_capacity) {
> > > > >                               busiest_nr = nr_running;
> > > > > +                             busiest_capacity = capacity;
> > > > >                               busiest = rq;
> > > > >                       }
> > > > >                       break;
> > > >
> > > > I don't think this is correct. The migrate_task case is work-conserving,
> > > > and your change can severely break that I think.
> > > >
> > >
> > > I think you meant this kind of scenario:
> > > cpu 0: idle
> > > cpu 1: 2 tasks
> > > cpu 2: 1 task but only has 30% of capacity
> > > Pulling from cpu 2 is good for the task but lowers the overall cpu
> > > throughput.
> > >
> > > The problem we have is:
> > > cpu 0: idle
> > > cpu 1: 1 task
> > > cpu 2: 1 task but only has 60% of capacity due to net softirq
> > > The task on cpu 2 stays there and runs slower. (This can also be
> > > considered non work-conserving if we account softirq like a task.)
> >
> > When load_balance runs for this 2 cpus, cpu2 should be tagged as
> > misfit_task because of reduce_capacity and should be selected in
> > priority by cpu0 to pull the task. Do you have more details on your
> > topology ?
>
> The topology is 64 core AMD with 2 hyperthreads.
>
> I am not familiar with the related code but I think there are cases
> where a task fits cpu capacity but it can still run faster elsewhere,
> e.g.: Bursty workloads. Thread pool threads with variable utilization
> because it would process more or less requests based on cpu
> availability (pick the next request from a shared queue when the
> previous one is done). A thread having enough cpu cycles but runs
> slower due to softirqs can also directly affect application
> performance.

misfit_task is also used to detect CPUs with reduced capacity. We can
have a look at update_sg_lb_stats(). Your use case above should fall
in this case

>
> > >
> > > Maybe the logic can be merged like this: Use capacity but pick from
> > > nr_running > 1 cpus first, then nr_running == 1 cpus if not found.
diff mbox series

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0f8736991427..aad14bc04544 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10368,8 +10368,9 @@  static struct rq *find_busiest_queue(struct lb_env *env,
 			break;
 
 		case migrate_task:
-			if (busiest_nr < nr_running) {
+			if (busiest_nr * capacity < nr_running * busiest_capacity) {
 				busiest_nr = nr_running;
+				busiest_capacity = capacity;
 				busiest = rq;
 			}
 			break;