diff mbox series

blk-mq: Allow complete locally if capacities are different

Message ID 20240828114958.29422-1-quic_mapa@quicinc.com (mailing list archive)
State New
Headers show
Series blk-mq: Allow complete locally if capacities are different | expand

Commit Message

MANISH PANDEY Aug. 28, 2024, 11:49 a.m. UTC
'Commit af550e4c9682 ("block/blk-mq: Don't complete locally if
capacities are different")' enforces to complete the request locally
only if the submission and completion CPUs have same capacity.

To have optimal IO load balancing or to avoid contention b/w submission
path and completion path, user may need to complete IO request of large
capacity CPU(s) on Small Capacity CPU(s) or vice versa.

Hence introduce a QUEUE_FLAG_ALLOW_DIFF_CAPACITY blk queue flag to let
user decide if it wants to complete the request locally or need an IPI
even if the capacity of the requesting and completion queue is different.
This gives flexibility to user to choose best CPU for their completion
to give best performance for their system.

Link: https://lore.kernel.org/all/66912a22-540d-4b9a-bd06-cce55b9ad304@quicinc.com/T/
Co-developed-by: Can Guo <quic_cang@quicinc.com>
Co-developed-by: Nitin Rawat <quic_nitirawa@quicinc.com>
Signed-off-by: Manish Pandey <quic_mapa@quicinc.com>
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq.c         |  3 ++-
 block/blk-sysfs.c      | 12 ++++++++++--
 include/linux/blkdev.h |  1 +
 4 files changed, 14 insertions(+), 3 deletions(-)

Comments

Bart Van Assche Aug. 28, 2024, 12:13 p.m. UTC | #1
On 8/28/24 7:49 AM, Manish Pandey wrote:
> 'Commit af550e4c9682 ("block/blk-mq: Don't complete locally if
> capacities are different")' enforces to complete the request locally
> only if the submission and completion CPUs have same capacity.
> 
> To have optimal IO load balancing or to avoid contention b/w submission
> path and completion path, user may need to complete IO request of large
> capacity CPU(s) on Small Capacity CPU(s) or vice versa.
> 
> Hence introduce a QUEUE_FLAG_ALLOW_DIFF_CAPACITY blk queue flag to let
> user decide if it wants to complete the request locally or need an IPI
> even if the capacity of the requesting and completion queue is different.
> This gives flexibility to user to choose best CPU for their completion
> to give best performance for their system.

I think that the following is missing from the above description:
- Mentioning that this is for an unusual interrupt routing technology
   (SoC sends the interrupt to another CPU core than what has been
    specified in the smp_affinity mask).
- An explanation why the desired effect cannot be achieved by changing
   rq_affinity into 0.

>   block/blk-mq-debugfs.c |  1 +
>   block/blk-mq.c         |  3 ++-
>   block/blk-sysfs.c      | 12 ++++++++++--
>   include/linux/blkdev.h |  1 +
>   4 files changed, 14 insertions(+), 3 deletions(-)

Since the semantics of a sysfs attribute are modified,
Documentation/ABI/stable/sysfs-block should be updated.

Thanks,

Bart.
Christian Loehle Aug. 28, 2024, 12:26 p.m. UTC | #2
On 8/28/24 13:13, Bart Van Assche wrote:
> On 8/28/24 7:49 AM, Manish Pandey wrote:
>> 'Commit af550e4c9682 ("block/blk-mq: Don't complete locally if
>> capacities are different")' enforces to complete the request locally
>> only if the submission and completion CPUs have same capacity.
>>
>> To have optimal IO load balancing or to avoid contention b/w submission
>> path and completion path, user may need to complete IO request of large
>> capacity CPU(s) on Small Capacity CPU(s) or vice versa.
>>
>> Hence introduce a QUEUE_FLAG_ALLOW_DIFF_CAPACITY blk queue flag to let
>> user decide if it wants to complete the request locally or need an IPI
>> even if the capacity of the requesting and completion queue is different.
>> This gives flexibility to user to choose best CPU for their completion
>> to give best performance for their system.

Why have the flag be
QUEUE_FLAG_ALLOW_DIFF_CAPACITY
instead of just
QUEUE_FLAG_SAME_LLC
so it isn't as HMP specifically-worded?
(And of course then having QUEUE_FLAG_SAME_COMP be strictly
stronger than QUEUE_FLAG_SAME_LLC. On !HMP they are equal.)
That would also answer Bart's question below how this is
different to rq_affinity=0.

> 
> I think that the following is missing from the above description:
> - Mentioning that this is for an unusual interrupt routing technology
>   (SoC sends the interrupt to another CPU core than what has been
>    specified in the smp_affinity mask).

FWIW on !mcq that doesn't have to be the case.

> - An explanation why the desired effect cannot be achieved by changing
>   rq_affinity into 0.
> 
>>   block/blk-mq-debugfs.c |  1 +
>>   block/blk-mq.c         |  3 ++-
>>   block/blk-sysfs.c      | 12 ++++++++++--
>>   include/linux/blkdev.h |  1 +
>>   4 files changed, 14 insertions(+), 3 deletions(-)
> 
> Since the semantics of a sysfs attribute are modified,
> Documentation/ABI/stable/sysfs-block should be updated.
> 
> Thanks,
> 
> Bart.
>
Bart Van Assche Aug. 28, 2024, 1:47 p.m. UTC | #3
On 8/28/24 8:26 AM, Christian Loehle wrote:
> On 8/28/24 13:13, Bart Van Assche wrote:
>> I think that the following is missing from the above description:
>> - Mentioning that this is for an unusual interrupt routing technology
>>    (SoC sends the interrupt to another CPU core than what has been
>>     specified in the smp_affinity mask).
> 
> FWIW on !mcq that doesn't have to be the case.

Hmm ... is there any x86 architecture that ignores the smp_affinity
mask? I have not yet encountered an x86 system that does not respect
the smp_affinity mask.

Thanks,

Bart.
Qais Yousef Sept. 1, 2024, 5:32 p.m. UTC | #4
Thanks for the CC Bart.

Manish, if you're going to send a patch to address an issue from another merged
patch, the etiquet is to keep the CC list of the original patch the same and
include the author of that patch in the loop.

On 08/28/24 08:13, Bart Van Assche wrote:
> On 8/28/24 7:49 AM, Manish Pandey wrote:
> > 'Commit af550e4c9682 ("block/blk-mq: Don't complete locally if
> > capacities are different")' enforces to complete the request locally
> > only if the submission and completion CPUs have same capacity.
> > 
> > To have optimal IO load balancing or to avoid contention b/w submission
> > path and completion path, user may need to complete IO request of large
> > capacity CPU(s) on Small Capacity CPU(s) or vice versa.
> > 
> > Hence introduce a QUEUE_FLAG_ALLOW_DIFF_CAPACITY blk queue flag to let
> > user decide if it wants to complete the request locally or need an IPI

I answered you here

	https://lore.kernel.org/lkml/20240901171317.bm5z3vplqgdwp4bc@airbuntu/

This approach is not acceptable. I think you need to better explain why
rq_affinity=0 is not usable instead of confusing rq_affinity=1 needs to be
hacked in this manner.

The right extension would be to teach the system how to detect cases where it
is better not to keep them on the same LLC/capacity because of scenario XYZ
that is known (genericaly and scalably) to break and requires an exception.

rq_affinity=0 would give you what you want AFAICT and don't see a reason for
this hack.

> > even if the capacity of the requesting and completion queue is different.
> > This gives flexibility to user to choose best CPU for their completion
> > to give best performance for their system.
> 
> I think that the following is missing from the above description:
> - Mentioning that this is for an unusual interrupt routing technology
>   (SoC sends the interrupt to another CPU core than what has been
>    specified in the smp_affinity mask).
> - An explanation why the desired effect cannot be achieved by changing
>   rq_affinity into 0.

It fails to mention a lot of things from the discussion from the previous
thread sadly... Including the fact that there's a strange argument about
regression on a platform that is easily fixed by using rq_affinity=0, but the
argument of not using this is because some other platforms don't need to use
rq_affinity=0.

I'm not sure if rq_affinity=1 is supposed to work for all cases especially with
the specific and custom setup Manish has.

Anyway. The submission has a broken CC list that omits a lot of folks from the
discussion.

> 
> >   block/blk-mq-debugfs.c |  1 +
> >   block/blk-mq.c         |  3 ++-
> >   block/blk-sysfs.c      | 12 ++++++++++--
> >   include/linux/blkdev.h |  1 +
> >   4 files changed, 14 insertions(+), 3 deletions(-)
> 
> Since the semantics of a sysfs attribute are modified,
> Documentation/ABI/stable/sysfs-block should be updated.
> 
> Thanks,
> 
> Bart.
diff mbox series

Patch

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 5463697a8442..af048dad9667 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -93,6 +93,7 @@  static const char *const blk_queue_flag_name[] = {
 	QUEUE_FLAG_NAME(RQ_ALLOC_TIME),
 	QUEUE_FLAG_NAME(HCTX_ACTIVE),
 	QUEUE_FLAG_NAME(SQ_SCHED),
+	QUEUE_FLAG_NAME(ALLOW_DIFF_CAPACITY),
 };
 #undef QUEUE_FLAG_NAME
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index aa28157b1aaf..1584312d870a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1164,7 +1164,8 @@  static inline bool blk_mq_complete_need_ipi(struct request *rq)
 	if (cpu == rq->mq_ctx->cpu ||
 	    (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
 	     cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
-	     cpus_equal_capacity(cpu, rq->mq_ctx->cpu)))
+	     (test_bit(QUEUE_FLAG_ALLOW_DIFF_CAPACITY, &rq->q->queue_flags) ||
+	      cpus_equal_capacity(cpu, rq->mq_ctx->cpu))))
 		return false;
 
 	/* don't try to IPI to an offline CPU */
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 60116d13cb80..37d6ab325180 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -300,8 +300,9 @@  static ssize_t queue_rq_affinity_show(struct gendisk *disk, char *page)
 {
 	bool set = test_bit(QUEUE_FLAG_SAME_COMP, &disk->queue->queue_flags);
 	bool force = test_bit(QUEUE_FLAG_SAME_FORCE, &disk->queue->queue_flags);
+	bool allow = test_bit(QUEUE_FLAG_ALLOW_DIFF_CAPACITY, &disk->queue->queue_flags);
 
-	return queue_var_show(set << force, page);
+	return queue_var_show((set << force) | (allow << set), page);
 }
 
 static ssize_t
@@ -316,15 +317,22 @@  queue_rq_affinity_store(struct gendisk *disk, const char *page, size_t count)
 	if (ret < 0)
 		return ret;
 
-	if (val == 2) {
+	if (val == 3) {
+		blk_queue_flag_set(QUEUE_FLAG_SAME_COMP, q);
+		blk_queue_flag_clear(QUEUE_FLAG_SAME_FORCE, q);
+		blk_queue_flag_set(QUEUE_FLAG_ALLOW_DIFF_CAPACITY, q);
+	} else if (val == 2) {
 		blk_queue_flag_set(QUEUE_FLAG_SAME_COMP, q);
 		blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, q);
+		blk_queue_flag_clear(QUEUE_FLAG_ALLOW_DIFF_CAPACITY, q);
 	} else if (val == 1) {
 		blk_queue_flag_set(QUEUE_FLAG_SAME_COMP, q);
 		blk_queue_flag_clear(QUEUE_FLAG_SAME_FORCE, q);
+		blk_queue_flag_clear(QUEUE_FLAG_ALLOW_DIFF_CAPACITY, q);
 	} else if (val == 0) {
 		blk_queue_flag_clear(QUEUE_FLAG_SAME_COMP, q);
 		blk_queue_flag_clear(QUEUE_FLAG_SAME_FORCE, q);
+		blk_queue_flag_clear(QUEUE_FLAG_ALLOW_DIFF_CAPACITY, q);
 	}
 #endif
 	return ret;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b7664d593486..902fb726ebe1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -602,6 +602,7 @@  enum {
 	QUEUE_FLAG_RQ_ALLOC_TIME,	/* record rq->alloc_time_ns */
 	QUEUE_FLAG_HCTX_ACTIVE,		/* at least one blk-mq hctx is active */
 	QUEUE_FLAG_SQ_SCHED,		/* single queue style io dispatch */
+	QUEUE_FLAG_ALLOW_DIFF_CAPACITY,	/* complete on different capacity CPU-group */
 	QUEUE_FLAG_MAX
 };