Message ID | 20230201012032.2874481-1-xii@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | sched: Consider capacity for certain load balancing decisions | expand |
On Tue, Jan 31, 2023 at 05:20:32PM -0800, Xi Wang wrote: > After load balancing was split into different scenarios, CPU capacity > is ignored for the "migrate_task" case, which means a thread can stay > on a softirq heavy cpu for an extended amount of time. > > By comparing nr_running/capacity instead of just nr_running we can add > CPU capacity back into "migrate_task" decisions. This benefits > workloads running on machines with heavy network traffic. The change > is unlikely to cause serious problems for other workloads but maybe > some corner cases still need to be considered. > > Signed-off-by: Xi Wang <xii@google.com> > --- > kernel/sched/fair.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 0f8736991427..aad14bc04544 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -10368,8 +10368,9 @@ static struct rq *find_busiest_queue(struct lb_env *env, > break; > > case migrate_task: > - if (busiest_nr < nr_running) { > + if (busiest_nr * capacity < nr_running * busiest_capacity) { > busiest_nr = nr_running; > + busiest_capacity = capacity; > busiest = rq; > } > break; I don't think this is correct. The migrate_task case is work-conserving, and your change can severely break that I think.
On Fri, Feb 3, 2023 at 1:51 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Jan 31, 2023 at 05:20:32PM -0800, Xi Wang wrote: > > After load balancing was split into different scenarios, CPU capacity > > is ignored for the "migrate_task" case, which means a thread can stay > > on a softirq heavy cpu for an extended amount of time. > > > > By comparing nr_running/capacity instead of just nr_running we can add > > CPU capacity back into "migrate_task" decisions. This benefits > > workloads running on machines with heavy network traffic. The change > > is unlikely to cause serious problems for other workloads but maybe > > some corner cases still need to be considered. > > > > Signed-off-by: Xi Wang <xii@google.com> > > --- > > kernel/sched/fair.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 0f8736991427..aad14bc04544 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -10368,8 +10368,9 @@ static struct rq *find_busiest_queue(struct lb_env *env, > > break; > > > > case migrate_task: > > - if (busiest_nr < nr_running) { > > + if (busiest_nr * capacity < nr_running * busiest_capacity) { > > busiest_nr = nr_running; > > + busiest_capacity = capacity; > > busiest = rq; > > } > > break; > > I don't think this is correct. The migrate_task case is work-conserving, > and your change can severely break that I think. > I think you meant this kind of scenario: cpu 0: idle cpu 1: 2 tasks cpu 2: 1 task but only has 30% of capacity Pulling from cpu 2 is good for the task but lowers the overall cpu throughput. The problem we have is: cpu 0: idle cpu 1: 1 task cpu 2: 1 task but only has 60% of capacity due to net softirq The task on cpu 2 stays there and runs slower. (This can also be considered non work-conserving if we account softirq like a task.) Maybe the logic can be merged like this: Use capacity but pick from nr_running > 1 cpus first, then nr_running == 1 cpus if not found.
On Fri, 3 Feb 2023 at 19:47, Xi Wang <xii@google.com> wrote: > > On Fri, Feb 3, 2023 at 1:51 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Tue, Jan 31, 2023 at 05:20:32PM -0800, Xi Wang wrote: > > > After load balancing was split into different scenarios, CPU capacity > > > is ignored for the "migrate_task" case, which means a thread can stay > > > on a softirq heavy cpu for an extended amount of time. > > > > > > By comparing nr_running/capacity instead of just nr_running we can add > > > CPU capacity back into "migrate_task" decisions. This benefits > > > workloads running on machines with heavy network traffic. The change > > > is unlikely to cause serious problems for other workloads but maybe > > > some corner cases still need to be considered. > > > > > > Signed-off-by: Xi Wang <xii@google.com> > > > --- > > > kernel/sched/fair.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index 0f8736991427..aad14bc04544 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -10368,8 +10368,9 @@ static struct rq *find_busiest_queue(struct lb_env *env, > > > break; > > > > > > case migrate_task: > > > - if (busiest_nr < nr_running) { > > > + if (busiest_nr * capacity < nr_running * busiest_capacity) { > > > busiest_nr = nr_running; > > > + busiest_capacity = capacity; > > > busiest = rq; > > > } > > > break; > > > > I don't think this is correct. The migrate_task case is work-conserving, > > and your change can severely break that I think. > > > > I think you meant this kind of scenario: > cpu 0: idle > cpu 1: 2 tasks > cpu 2: 1 task but only has 30% of capacity > Pulling from cpu 2 is good for the task but lowers the overall cpu > throughput. > > The problem we have is: > cpu 0: idle > cpu 1: 1 task > cpu 2: 1 task but only has 60% of capacity due to net softirq > The task on cpu 2 stays there and runs slower. (This can also be > considered non work-conserving if we account softirq like a task.) When load_balance runs for this 2 cpus, cpu2 should be tagged as misfit_task because of reduce_capacity and should be selected in priority by cpu0 to pull the task. Do you have more details on your topology ? > > Maybe the logic can be merged like this: Use capacity but pick from > nr_running > 1 cpus first, then nr_running == 1 cpus if not found.
On Mon, Feb 6, 2023 at 1:28 AM Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Fri, 3 Feb 2023 at 19:47, Xi Wang <xii@google.com> wrote: > > > > On Fri, Feb 3, 2023 at 1:51 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Tue, Jan 31, 2023 at 05:20:32PM -0800, Xi Wang wrote: > > > > After load balancing was split into different scenarios, CPU capacity > > > > is ignored for the "migrate_task" case, which means a thread can stay > > > > on a softirq heavy cpu for an extended amount of time. > > > > > > > > By comparing nr_running/capacity instead of just nr_running we can add > > > > CPU capacity back into "migrate_task" decisions. This benefits > > > > workloads running on machines with heavy network traffic. The change > > > > is unlikely to cause serious problems for other workloads but maybe > > > > some corner cases still need to be considered. > > > > > > > > Signed-off-by: Xi Wang <xii@google.com> > > > > --- > > > > kernel/sched/fair.c | 3 ++- > > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > index 0f8736991427..aad14bc04544 100644 > > > > --- a/kernel/sched/fair.c > > > > +++ b/kernel/sched/fair.c > > > > @@ -10368,8 +10368,9 @@ static struct rq *find_busiest_queue(struct lb_env *env, > > > > break; > > > > > > > > case migrate_task: > > > > - if (busiest_nr < nr_running) { > > > > + if (busiest_nr * capacity < nr_running * busiest_capacity) { > > > > busiest_nr = nr_running; > > > > + busiest_capacity = capacity; > > > > busiest = rq; > > > > } > > > > break; > > > > > > I don't think this is correct. The migrate_task case is work-conserving, > > > and your change can severely break that I think. > > > > > > > I think you meant this kind of scenario: > > cpu 0: idle > > cpu 1: 2 tasks > > cpu 2: 1 task but only has 30% of capacity > > Pulling from cpu 2 is good for the task but lowers the overall cpu > > throughput. > > > > The problem we have is: > > cpu 0: idle > > cpu 1: 1 task > > cpu 2: 1 task but only has 60% of capacity due to net softirq > > The task on cpu 2 stays there and runs slower. (This can also be > > considered non work-conserving if we account softirq like a task.) > > When load_balance runs for this 2 cpus, cpu2 should be tagged as > misfit_task because of reduce_capacity and should be selected in > priority by cpu0 to pull the task. Do you have more details on your > topology ? The topology is 64 core AMD with 2 hyperthreads. I am not familiar with the related code but I think there are cases where a task fits cpu capacity but it can still run faster elsewhere, e.g.: Bursty workloads. Thread pool threads with variable utilization because it would process more or less requests based on cpu availability (pick the next request from a shared queue when the previous one is done). A thread having enough cpu cycles but runs slower due to softirqs can also directly affect application performance. > > > > Maybe the logic can be merged like this: Use capacity but pick from > > nr_running > 1 cpus first, then nr_running == 1 cpus if not found.
On Tue, 7 Feb 2023 at 02:03, Xi Wang <xii@google.com> wrote: > > On Mon, Feb 6, 2023 at 1:28 AM Vincent Guittot > <vincent.guittot@linaro.org> wrote: > > > > On Fri, 3 Feb 2023 at 19:47, Xi Wang <xii@google.com> wrote: > > > > > > On Fri, Feb 3, 2023 at 1:51 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > > On Tue, Jan 31, 2023 at 05:20:32PM -0800, Xi Wang wrote: > > > > > After load balancing was split into different scenarios, CPU capacity > > > > > is ignored for the "migrate_task" case, which means a thread can stay > > > > > on a softirq heavy cpu for an extended amount of time. > > > > > > > > > > By comparing nr_running/capacity instead of just nr_running we can add > > > > > CPU capacity back into "migrate_task" decisions. This benefits > > > > > workloads running on machines with heavy network traffic. The change > > > > > is unlikely to cause serious problems for other workloads but maybe > > > > > some corner cases still need to be considered. > > > > > > > > > > Signed-off-by: Xi Wang <xii@google.com> > > > > > --- > > > > > kernel/sched/fair.c | 3 ++- > > > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > > index 0f8736991427..aad14bc04544 100644 > > > > > --- a/kernel/sched/fair.c > > > > > +++ b/kernel/sched/fair.c > > > > > @@ -10368,8 +10368,9 @@ static struct rq *find_busiest_queue(struct lb_env *env, > > > > > break; > > > > > > > > > > case migrate_task: > > > > > - if (busiest_nr < nr_running) { > > > > > + if (busiest_nr * capacity < nr_running * busiest_capacity) { > > > > > busiest_nr = nr_running; > > > > > + busiest_capacity = capacity; > > > > > busiest = rq; > > > > > } > > > > > break; > > > > > > > > I don't think this is correct. The migrate_task case is work-conserving, > > > > and your change can severely break that I think. > > > > > > > > > > I think you meant this kind of scenario: > > > cpu 0: idle > > > cpu 1: 2 tasks > > > cpu 2: 1 task but only has 30% of capacity > > > Pulling from cpu 2 is good for the task but lowers the overall cpu > > > throughput. > > > > > > The problem we have is: > > > cpu 0: idle > > > cpu 1: 1 task > > > cpu 2: 1 task but only has 60% of capacity due to net softirq > > > The task on cpu 2 stays there and runs slower. (This can also be > > > considered non work-conserving if we account softirq like a task.) > > > > When load_balance runs for this 2 cpus, cpu2 should be tagged as > > misfit_task because of reduce_capacity and should be selected in > > priority by cpu0 to pull the task. Do you have more details on your > > topology ? > > The topology is 64 core AMD with 2 hyperthreads. > > I am not familiar with the related code but I think there are cases > where a task fits cpu capacity but it can still run faster elsewhere, > e.g.: Bursty workloads. Thread pool threads with variable utilization > because it would process more or less requests based on cpu > availability (pick the next request from a shared queue when the > previous one is done). A thread having enough cpu cycles but runs > slower due to softirqs can also directly affect application > performance. misfit_task is also used to detect CPUs with reduced capacity. We can have a look at update_sg_lb_stats(). Your use case above should fall in this case > > > > > > > Maybe the logic can be merged like this: Use capacity but pick from > > > nr_running > 1 cpus first, then nr_running == 1 cpus if not found.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0f8736991427..aad14bc04544 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10368,8 +10368,9 @@ static struct rq *find_busiest_queue(struct lb_env *env, break; case migrate_task: - if (busiest_nr < nr_running) { + if (busiest_nr * capacity < nr_running * busiest_capacity) { busiest_nr = nr_running; + busiest_capacity = capacity; busiest = rq; } break;
After load balancing was split into different scenarios, CPU capacity is ignored for the "migrate_task" case, which means a thread can stay on a softirq heavy cpu for an extended amount of time. By comparing nr_running/capacity instead of just nr_running we can add CPU capacity back into "migrate_task" decisions. This benefits workloads running on machines with heavy network traffic. The change is unlikely to cause serious problems for other workloads but maybe some corner cases still need to be considered. Signed-off-by: Xi Wang <xii@google.com> --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)