diff mbox series

block/blk-mq: Don't complete locally if capacities are different

Message ID 20240122224220.1206234-1-qyousef@layalina.io (mailing list archive)
State New, archived
Headers show
Series block/blk-mq: Don't complete locally if capacities are different | expand

Commit Message

Qais Yousef Jan. 22, 2024, 10:42 p.m. UTC
The logic in blk_mq_complete_need_ipi() assumes SMP systems where all
CPUs have equal capacities and only LLC cache can make a different on
perceived performance. But this assumption falls apart on HMP systems
where LLC is shared, but the CPUs have different capacities. Staying
local then can have a big performance impact if the IO request was done
from a CPU with higher capacity but the interrupt is serviced on a lower
capacity CPU.

Introduce new cpus_gte_capacity() function to enable do the additional
check.

Without the patch I see the BLOCK softirq always running on little cores
(where the hardirq is serviced). With it I can see it running on all
cores.

This was noticed after the topology change [1] where now on a big.LITTLE
we truly get that the LLC is shared between all cores where as in the
past it was being misrepresented for historical reasons. The logic
exposed a missing dependency on capacities for such systems where there
can be a big performance difference between the CPUs.

This of course introduced a noticeable change in behavior depending on
how the topology is presented. Leading to regressions in some workloads
as the performance of the BLOCK softirq on littles can be noticeably
worse.

[1] https://lpc.events/event/16/contributions/1342/attachments/962/1883/LPC-2022-Android-MC-Phantom-Domains.pdf

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 block/blk-mq.c                 | 5 +++--
 include/linux/sched/topology.h | 6 ++++++
 kernel/sched/core.c            | 8 ++++++++
 3 files changed, 17 insertions(+), 2 deletions(-)

Comments

Christoph Hellwig Jan. 23, 2024, 8:47 a.m. UTC | #1
On Mon, Jan 22, 2024 at 10:42:20PM +0000, Qais Yousef wrote:
> The logic in blk_mq_complete_need_ipi() assumes SMP systems where all
> CPUs have equal capacities

What is a capacity here?

> +	return arch_scale_cpu_capacity(this_cpu) >= arch_scale_cpu_capacity(that_cpu);

oerly long line here.

Also pleas split patches for different subsystems.
Jens Axboe Jan. 23, 2024, 3:58 p.m. UTC | #2
On 1/23/24 1:47 AM, Christoph Hellwig wrote:
> On Mon, Jan 22, 2024 at 10:42:20PM +0000, Qais Yousef wrote:
>> The logic in blk_mq_complete_need_ipi() assumes SMP systems where all
>> CPUs have equal capacities
> 
> What is a capacity here?

It seems to be the chosen word to describe the performance potential of
the core in question, we use it elsewhere in the kernel. But yes, could
do with a bit more of an explanation.

>> +	return arch_scale_cpu_capacity(this_cpu) >= arch_scale_cpu_capacity(that_cpu);
> 
> oerly long line here.
> 
> Also pleas split patches for different subsystems.

Yes please, the sched/topology thing should be a separate prep patch.
Qais Yousef Jan. 24, 2024, 12:24 a.m. UTC | #3
On 01/23/24 08:58, Jens Axboe wrote:
> On 1/23/24 1:47 AM, Christoph Hellwig wrote:
> > On Mon, Jan 22, 2024 at 10:42:20PM +0000, Qais Yousef wrote:
> >> The logic in blk_mq_complete_need_ipi() assumes SMP systems where all
> >> CPUs have equal capacities
> > 
> > What is a capacity here?
> 
> It seems to be the chosen word to describe the performance potential of
> the core in question, we use it elsewhere in the kernel. But yes, could
> do with a bit more of an explanation.

Is referring to it as compute capacity makes it clearer? Sorry I thought that's
a common term.

> 
> >> +	return arch_scale_cpu_capacity(this_cpu) >= arch_scale_cpu_capacity(that_cpu);
> > 
> > oerly long line here.

This is consistent with similar long lines in the same file and it's more
readable as one line. checkpatch doesn't complain about this being long;
I think they look for 100 or 120 now. This is 86.

> > 
> > Also pleas split patches for different subsystems.
> 
> Yes please, the sched/topology thing should be a separate prep patch.

Okay. I thought the norm to keep such small patches self contained as adding
dead code followed by a one liner isn't always seen as better. But a split it
is :)

I'll give the sched/arm folks time to have a look before I post a new version.


Thanks!

--
Qais Yousef
Qais Yousef Feb. 2, 2024, 11:20 p.m. UTC | #4
On 01/22/24 22:42, Qais Yousef wrote:
> The logic in blk_mq_complete_need_ipi() assumes SMP systems where all
> CPUs have equal capacities and only LLC cache can make a different on
> perceived performance. But this assumption falls apart on HMP systems
> where LLC is shared, but the CPUs have different capacities. Staying
> local then can have a big performance impact if the IO request was done
> from a CPU with higher capacity but the interrupt is serviced on a lower
> capacity CPU.
> 
> Introduce new cpus_gte_capacity() function to enable do the additional
> check.

As I was preparing to send a new version. I thought it is worth noting here
that I initially had cpus_equal_capacity() to check if the performance on the
two cpus is equal, but then opted to change it to >= so that we favour perf.

But now I am having 2nd thoughts again.

If the interrupt was on a bigger core but the request comes from a little core,
is it better to move it to little core (which can save power) to match the
requester, or better honour the fact that someone has put the interrupt on
a bigger core and maybe they prefer to keep requests there by default?

I am leaning back towards cpus_equal_capacity() so we match the performance
level the requester is running at. irqbalancer can end up shuffling interrupts
potentially and I believe it is better for the scheduler to handle the
performance level the requester should be running at, and then here our job
should be to match that level without making more assumptions.


--
Qais Yousef

> 
> Without the patch I see the BLOCK softirq always running on little cores
> (where the hardirq is serviced). With it I can see it running on all
> cores.
> 
> This was noticed after the topology change [1] where now on a big.LITTLE
> we truly get that the LLC is shared between all cores where as in the
> past it was being misrepresented for historical reasons. The logic
> exposed a missing dependency on capacities for such systems where there
> can be a big performance difference between the CPUs.
> 
> This of course introduced a noticeable change in behavior depending on
> how the topology is presented. Leading to regressions in some workloads
> as the performance of the BLOCK softirq on littles can be noticeably
> worse.
> 
> [1] https://lpc.events/event/16/contributions/1342/attachments/962/1883/LPC-2022-Android-MC-Phantom-Domains.pdf
> 
> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> ---
>  block/blk-mq.c                 | 5 +++--
>  include/linux/sched/topology.h | 6 ++++++
>  kernel/sched/core.c            | 8 ++++++++
>  3 files changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index ac18f802c027..9b2d278a7ae7 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1163,10 +1163,11 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq)
>  	if (force_irqthreads())
>  		return false;
>  
> -	/* same CPU or cache domain?  Complete locally */
> +	/* same CPU or cache domain and capacity?  Complete locally */
>  	if (cpu == rq->mq_ctx->cpu ||
>  	    (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
> -	     cpus_share_cache(cpu, rq->mq_ctx->cpu)))
> +	     cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
> +	     cpus_gte_capacity(cpu, rq->mq_ctx->cpu)))
>  		return false;
>  
>  	/* don't try to IPI to an offline CPU */
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index a6e04b4a21d7..31cef5780ba4 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -176,6 +176,7 @@ extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
>  cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
>  void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
>  
> +bool cpus_gte_capacity(int this_cpu, int that_cpu);
>  bool cpus_share_cache(int this_cpu, int that_cpu);
>  bool cpus_share_resources(int this_cpu, int that_cpu);
>  
> @@ -226,6 +227,11 @@ partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
>  {
>  }
>  
> +static inline bool cpus_gte_capacity(int this_cpu, int that_cpu)
> +{
> +	return true;
> +}
> +
>  static inline bool cpus_share_cache(int this_cpu, int that_cpu)
>  {
>  	return true;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index db4be4921e7f..db5ab4b3cee7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3954,6 +3954,14 @@ void wake_up_if_idle(int cpu)
>  	}
>  }
>  
> +bool cpus_gte_capacity(int this_cpu, int that_cpu)
> +{
> +	if (this_cpu == that_cpu)
> +		return true;
> +
> +	return arch_scale_cpu_capacity(this_cpu) >= arch_scale_cpu_capacity(that_cpu);
> +}
> +
>  bool cpus_share_cache(int this_cpu, int that_cpu)
>  {
>  	if (this_cpu == that_cpu)
> -- 
> 2.34.1
>
diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ac18f802c027..9b2d278a7ae7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1163,10 +1163,11 @@  static inline bool blk_mq_complete_need_ipi(struct request *rq)
 	if (force_irqthreads())
 		return false;
 
-	/* same CPU or cache domain?  Complete locally */
+	/* same CPU or cache domain and capacity?  Complete locally */
 	if (cpu == rq->mq_ctx->cpu ||
 	    (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
-	     cpus_share_cache(cpu, rq->mq_ctx->cpu)))
+	     cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
+	     cpus_gte_capacity(cpu, rq->mq_ctx->cpu)))
 		return false;
 
 	/* don't try to IPI to an offline CPU */
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index a6e04b4a21d7..31cef5780ba4 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -176,6 +176,7 @@  extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
 void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
 
+bool cpus_gte_capacity(int this_cpu, int that_cpu);
 bool cpus_share_cache(int this_cpu, int that_cpu);
 bool cpus_share_resources(int this_cpu, int that_cpu);
 
@@ -226,6 +227,11 @@  partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 {
 }
 
+static inline bool cpus_gte_capacity(int this_cpu, int that_cpu)
+{
+	return true;
+}
+
 static inline bool cpus_share_cache(int this_cpu, int that_cpu)
 {
 	return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index db4be4921e7f..db5ab4b3cee7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3954,6 +3954,14 @@  void wake_up_if_idle(int cpu)
 	}
 }
 
+bool cpus_gte_capacity(int this_cpu, int that_cpu)
+{
+	if (this_cpu == that_cpu)
+		return true;
+
+	return arch_scale_cpu_capacity(this_cpu) >= arch_scale_cpu_capacity(that_cpu);
+}
+
 bool cpus_share_cache(int this_cpu, int that_cpu)
 {
 	if (this_cpu == that_cpu)