block: skip QUEUE_FLAG_STATS and rq-qos for passthrough io

Message ID	20231123102431.6804-1-kundan.kumar@samsung.com (mailing list archive)
State	New, archived
Headers	show Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=samsung.com header.i=@samsung.com header.b="eI9tDwF2" From: "kundan.kumar" <kundan.kumar@samsung.com> To: axboe@kernel.dk, kbusch@kernel.org, hch@lst.de Cc: linux-block@vger.kernel.org, Kundan Kumar <kundan.kumar@samsung.com>, Kanchan Joshi <joshi.k@samsung.com> Subject: [PATCH] block: skip QUEUE_FLAG_STATS and rq-qos for passthrough io Date: Thu, 23 Nov 2023 15:54:31 +0530 Message-Id: <20231123102431.6804-1-kundan.kumar@samsung.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="utf-8" CMS-TYPE: 105P DLP-Filter: Pass References: <CGME20231123103102epcas5p2aee268735dc9e0c357e6d3b98d16fe21@epcas5p2.samsung.com>
Series	block: skip QUEUE_FLAG_STATS and rq-qos for passthrough io \| expand block: skip QUEUE_FLAG_STATS and rq-qos for passthrough io

Kundan Kumar Nov. 23, 2023, 10:24 a.m. UTC

From: Kundan Kumar <kundan.kumar@samsung.com>

Write-back throttling (WBT) enables QUEUE_FLAG_STATS on the request
queue. But WBT does not make sense for passthrough io, so skip
QUEUE_FLAG_STATS processing.

Also skip rq_qos_issue/done for passthrough io.

Overall, the change gives ~11% hike in peak performance.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 block/blk-mq.c     | 3 ++-
 block/blk-rq-qos.h | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

Christoph Hellwig Nov. 23, 2023, 3:30 p.m. UTC | #1

On Thu, Nov 23, 2023 at 03:54:31PM +0530, kundan.kumar wrote:
> -	if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
> +	if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)
> +			&& !blk_rq_is_passthrough(rq)) {

&& goes on the starting line in the kernel code style.

The rest looks good, but that stats overhead seems pretty horrible..

Kanchan Joshi Nov. 23, 2023, 6:12 p.m. UTC | #2

On 11/23/2023 9:00 PM, Christoph Hellwig wrote:
> The rest looks good, but that stats overhead seems pretty horrible..

On my setup
Before[1]: 7.06M
After[2]: 8.29M

[1]
# taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 
-u1 -r4 /dev/ng0n1 /dev/ng1n1
submitter=0, tid=2076, file=/dev/ng0n1, node=-1
submitter=1, tid=2077, file=/dev/ng1n1, node=-1
polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
Engine=io_uring, sq_ring=256, cq_ring=256
polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
Engine=io_uring, sq_ring=256, cq_ring=256
IOPS=6.95M, BW=3.39GiB/s, IOS/call=32/31
IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/32
IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/31
Exiting on timeout
Maximum IOPS=7.06M

[2]
  # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 
-u1 -r4 /dev/ng0n1 /dev/ng1n1
submitter=0, tid=2123, file=/dev/ng0n1, node=-1
submitter=1, tid=2124, file=/dev/ng1n1, node=-1
polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
Engine=io_uring, sq_ring=256, cq_ring=256
IOPS=8.27M, BW=4.04GiB/s, IOS/call=32/31
IOPS=8.29M, BW=4.05GiB/s, IOS/call=32/31
IOPS=8.29M, BW=4.05GiB/s, IOS/call=31/31
Exiting on timeout
Maximum IOPS=8.29M

Jens Axboe Nov. 23, 2023, 8:07 p.m. UTC | #3

On 11/23/23 8:30 AM, Christoph Hellwig wrote:
> On Thu, Nov 23, 2023 at 03:54:31PM +0530, kundan.kumar wrote:
>> -	if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
>> +	if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)
>> +			&& !blk_rq_is_passthrough(rq)) {
> 
> && goes on the starting line in the kernel code style.
> 
> The rest looks good, but that stats overhead seems pretty horrible..

It is, but at least it's better than what it used to be. Here's an
example from my box, you can pretty easily tell when iostats got enabled
while it was running:

[...]
submitter=22, tid=645761, file=/dev/nvme22n1, node=6
submitter=23, tid=645762, file=/dev/nvme23n1, node=6
polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=119.09M, BW=58.15GiB/s, IOS/call=31/31
IOPS=122.03M, BW=59.59GiB/s, IOS/call=31/31
IOPS=122.01M, BW=59.58GiB/s, IOS/call=31/32
IOPS=122.02M, BW=59.58GiB/s, IOS/call=31/32
IOPS=122.02M, BW=59.58GiB/s, IOS/call=31/31
IOPS=112.72M, BW=55.04GiB/s, IOS/call=31/31
IOPS=108.43M, BW=52.94GiB/s, IOS/call=31/32
IOPS=108.46M, BW=52.96GiB/s, IOS/call=32/31
IOPS=108.58M, BW=53.02GiB/s, IOS/call=31/31

Jens Axboe Nov. 23, 2023, 8:19 p.m. UTC | #4

On 11/23/23 11:12 AM, Kanchan Joshi wrote:
> On 11/23/2023 9:00 PM, Christoph Hellwig wrote:
>> The rest looks good, but that stats overhead seems pretty horrible..
> 
> On my setup
> Before[1]: 7.06M
> After[2]: 8.29M
> 
> [1]
> # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 
> -u1 -r4 /dev/ng0n1 /dev/ng1n1
> submitter=0, tid=2076, file=/dev/ng0n1, node=-1
> submitter=1, tid=2077, file=/dev/ng1n1, node=-1
> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
> Engine=io_uring, sq_ring=256, cq_ring=256
> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
> Engine=io_uring, sq_ring=256, cq_ring=256
> IOPS=6.95M, BW=3.39GiB/s, IOS/call=32/31
> IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/32
> IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/31
> Exiting on timeout
> Maximum IOPS=7.06M
> 
> [2]
>   # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 
> -u1 -r4 /dev/ng0n1 /dev/ng1n1
> submitter=0, tid=2123, file=/dev/ng0n1, node=-1
> submitter=1, tid=2124, file=/dev/ng1n1, node=-1
> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
> Engine=io_uring, sq_ring=256, cq_ring=256
> IOPS=8.27M, BW=4.04GiB/s, IOS/call=32/31
> IOPS=8.29M, BW=4.05GiB/s, IOS/call=32/31
> IOPS=8.29M, BW=4.05GiB/s, IOS/call=31/31
> Exiting on timeout
> Maximum IOPS=8.29M

It's all really down to how expensive getting the current time is on
your box, some will be better and some worse

One idea that has been bounced around in the past is to have a
blk_ktime_get_ns() and have it be something ala:

u64 blk_ktime_get_ns(void)
{
	struct blk_plug *plug = current->plug;

	if (!plug)
		return ktime_get_ns();

	if (!plug->ktime_valid)
		plug->ktime = ktime_get_ns();

	return plug->ktime;
}

in freestyle form, with the idea being that we don't care granularity to
the extent that we'd need a new stamp every time.

If the task is scheduled out, the plug is flushed anyway, which should
invalidate the stamp. For preemption this isn't true iirc, so we'd need
some kind of blk_flush_plug_ts() or something for that case to
invalidate it.

Hopefully this could then also get away from passing in a cached value
that we do in various spots, exactly because all of this time stamping
is expensive. It's also a bit of a game of whack-a-mole, as users get
added and distro kernels tend to turn on basically everything anyway.

Jens Axboe Nov. 23, 2023, 8:37 p.m. UTC | #5

On 11/23/23 1:19 PM, Jens Axboe wrote:
> On 11/23/23 11:12 AM, Kanchan Joshi wrote:
>> On 11/23/2023 9:00 PM, Christoph Hellwig wrote:
>>> The rest looks good, but that stats overhead seems pretty horrible..
>>
>> On my setup
>> Before[1]: 7.06M
>> After[2]: 8.29M
>>
>> [1]
>> # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 
>> -u1 -r4 /dev/ng0n1 /dev/ng1n1
>> submitter=0, tid=2076, file=/dev/ng0n1, node=-1
>> submitter=1, tid=2077, file=/dev/ng1n1, node=-1
>> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
>> Engine=io_uring, sq_ring=256, cq_ring=256
>> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
>> Engine=io_uring, sq_ring=256, cq_ring=256
>> IOPS=6.95M, BW=3.39GiB/s, IOS/call=32/31
>> IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/32
>> IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/31
>> Exiting on timeout
>> Maximum IOPS=7.06M
>>
>> [2]
>>   # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 
>> -u1 -r4 /dev/ng0n1 /dev/ng1n1
>> submitter=0, tid=2123, file=/dev/ng0n1, node=-1
>> submitter=1, tid=2124, file=/dev/ng1n1, node=-1
>> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
>> Engine=io_uring, sq_ring=256, cq_ring=256
>> IOPS=8.27M, BW=4.04GiB/s, IOS/call=32/31
>> IOPS=8.29M, BW=4.05GiB/s, IOS/call=32/31
>> IOPS=8.29M, BW=4.05GiB/s, IOS/call=31/31
>> Exiting on timeout
>> Maximum IOPS=8.29M
> 
> It's all really down to how expensive getting the current time is on
> your box, some will be better and some worse
> 
> One idea that has been bounced around in the past is to have a
> blk_ktime_get_ns() and have it be something ala:
> 
> u64 blk_ktime_get_ns(void)
> {
> 	struct blk_plug *plug = current->plug;
> 
> 	if (!plug)
> 		return ktime_get_ns();
> 
> 	if (!plug->ktime_valid)
> 		plug->ktime = ktime_get_ns();
> 
> 	return plug->ktime;
> }
> 
> in freestyle form, with the idea being that we don't care granularity to
> the extent that we'd need a new stamp every time.
> 
> If the task is scheduled out, the plug is flushed anyway, which should
> invalidate the stamp. For preemption this isn't true iirc, so we'd need
> some kind of blk_flush_plug_ts() or something for that case to
> invalidate it.
> 
> Hopefully this could then also get away from passing in a cached value
> that we do in various spots, exactly because all of this time stamping
> is expensive. It's also a bit of a game of whack-a-mole, as users get
> added and distro kernels tend to turn on basically everything anyway.

Did a quick'n dirty (see below), which recoups half of the lost
performance for me. And my kernel deliberately doesn't enable all of the
gunk in block/ that slows everything down, I suspect it'll be a bigger
win for you percentage wise:

IOPS=121.42M, BW=59.29GiB/s, IOS/call=31/31
IOPS=121.47M, BW=59.31GiB/s, IOS/call=32/32
IOPS=121.44M, BW=59.30GiB/s, IOS/call=31/31
IOPS=121.47M, BW=59.31GiB/s, IOS/call=31/31
IOPS=121.45M, BW=59.30GiB/s, IOS/call=32/32
IOPS=119.95M, BW=58.57GiB/s, IOS/call=31/32
IOPS=115.30M, BW=56.30GiB/s, IOS/call=32/31
IOPS=115.38M, BW=56.34GiB/s, IOS/call=32/32
IOPS=115.35M, BW=56.32GiB/s, IOS/call=32/32


diff --git a/block/blk-core.c b/block/blk-core.c
index fdf25b8d6e78..6e74af442b94 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1055,6 +1055,7 @@ void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios)
 	plug->rq_count = 0;
 	plug->multiple_queues = false;
 	plug->has_elevator = false;
+	plug->cur_ktime = 0;
 	INIT_LIST_HEAD(&plug->cb_list);
 
 	/*
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 3f4d41952ef2..e4e9f7eed346 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -143,7 +143,7 @@ static void blk_account_io_flush(struct request *rq)
 	part_stat_lock();
 	part_stat_inc(part, ios[STAT_FLUSH]);
 	part_stat_add(part, nsecs[STAT_FLUSH],
-		      ktime_get_ns() - rq->start_time_ns);
+		      blk_ktime_get_ns() - rq->start_time_ns);
 	part_stat_unlock();
 }
 
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 089fcb9cfce3..d06cea625462 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -829,7 +829,7 @@ static int ioc_autop_idx(struct ioc *ioc, struct gendisk *disk)
 
 	/* step up/down based on the vrate */
 	vrate_pct = div64_u64(ioc->vtime_base_rate * 100, VTIME_PER_USEC);
-	now_ns = ktime_get_ns();
+	now_ns = blk_ktime_get_ns();
 
 	if (p->too_fast_vrate_pct && p->too_fast_vrate_pct <= vrate_pct) {
 		if (!ioc->autop_too_fast_at)
@@ -1044,7 +1044,7 @@ static void ioc_now(struct ioc *ioc, struct ioc_now *now)
 	unsigned seq;
 	u64 vrate;
 
-	now->now_ns = ktime_get();
+	now->now_ns = blk_ktime_get();
 	now->now = ktime_to_us(now->now_ns);
 	vrate = atomic64_read(&ioc->vtime_rate);
 
@@ -2810,7 +2810,7 @@ static void ioc_rqos_done(struct rq_qos *rqos, struct request *rq)
 		return;
 	}
 
-	on_q_ns = ktime_get_ns() - rq->alloc_time_ns;
+	on_q_ns = blk_ktime_get_ns() - rq->alloc_time_ns;
 	rq_wait_ns = rq->start_time_ns - rq->alloc_time_ns;
 	size_nsec = div64_u64(calc_size_vtime_cost(rq, ioc), VTIME_PER_NSEC);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 900c1be1fee1..9c96dee9e584 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -323,7 +323,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	RB_CLEAR_NODE(&rq->rb_node);
 	rq->tag = BLK_MQ_NO_TAG;
 	rq->internal_tag = BLK_MQ_NO_TAG;
-	rq->start_time_ns = ktime_get_ns();
+	rq->start_time_ns = blk_ktime_get_ns();
 	rq->part = NULL;
 	blk_crypto_rq_set_defaults(rq);
 }
@@ -333,7 +333,7 @@ EXPORT_SYMBOL(blk_rq_init);
 static inline void blk_mq_rq_time_init(struct request *rq, u64 alloc_time_ns)
 {
 	if (blk_mq_need_time_stamp(rq))
-		rq->start_time_ns = ktime_get_ns();
+		rq->start_time_ns = blk_ktime_get_ns();
 	else
 		rq->start_time_ns = 0;
 
@@ -444,7 +444,7 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
 
 	/* alloc_time includes depth and tag waits */
 	if (blk_queue_rq_alloc_time(q))
-		alloc_time_ns = ktime_get_ns();
+		alloc_time_ns = blk_ktime_get_ns();
 
 	if (data->cmd_flags & REQ_NOWAIT)
 		data->flags |= BLK_MQ_REQ_NOWAIT;
@@ -629,7 +629,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 
 	/* alloc_time includes depth and tag waits */
 	if (blk_queue_rq_alloc_time(q))
-		alloc_time_ns = ktime_get_ns();
+		alloc_time_ns = blk_ktime_get_ns();
 
 	/*
 	 * If the tag allocator sleeps we could get an allocation for a
@@ -1037,7 +1037,7 @@ static inline void __blk_mq_end_request_acct(struct request *rq, u64 now)
 inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
 {
 	if (blk_mq_need_time_stamp(rq))
-		__blk_mq_end_request_acct(rq, ktime_get_ns());
+		__blk_mq_end_request_acct(rq, blk_ktime_get_ns());
 
 	blk_mq_finish_request(rq);
 
@@ -1080,7 +1080,7 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob)
 	u64 now = 0;
 
 	if (iob->need_ts)
-		now = ktime_get_ns();
+		now = blk_ktime_get_ns();
 
 	while ((rq = rq_list_pop(&iob->req_list)) != NULL) {
 		prefetch(rq->bio);
@@ -1249,7 +1249,7 @@ void blk_mq_start_request(struct request *rq)
 	trace_block_rq_issue(rq);
 
 	if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
-		rq->io_start_time_ns = ktime_get_ns();
+		rq->io_start_time_ns = blk_ktime_get_ns();
 		rq->stats_sectors = blk_rq_sectors(rq);
 		rq->rq_flags |= RQF_STATS;
 		rq_qos_issue(q, rq);
@@ -3066,7 +3066,7 @@ blk_status_t blk_insert_cloned_request(struct request *rq)
 	blk_mq_run_dispatch_ops(q,
 			ret = blk_mq_request_issue_directly(rq, true));
 	if (ret)
-		blk_account_io_done(rq, ktime_get_ns());
+		blk_account_io_done(rq, blk_ktime_get_ns());
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blk_insert_cloned_request);
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 13e4377a8b28..5919919ba1c3 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1813,7 +1813,7 @@ static bool throtl_tg_is_idle(struct throtl_grp *tg)
 	time = min_t(unsigned long, MAX_IDLE_TIME, 4 * tg->idletime_threshold);
 	ret = tg->latency_target == DFL_LATENCY_TARGET ||
 	      tg->idletime_threshold == DFL_IDLE_THRESHOLD ||
-	      (ktime_get_ns() >> 10) - tg->last_finish_time > time ||
+	      (blk_ktime_get_ns() >> 10) - tg->last_finish_time > time ||
 	      tg->avg_idletime > tg->idletime_threshold ||
 	      (tg->latency_target && tg->bio_cnt &&
 		tg->bad_bio_cnt * 5 < tg->bio_cnt);
@@ -2058,7 +2058,7 @@ static void blk_throtl_update_idletime(struct throtl_grp *tg)
 	if (last_finish_time == 0)
 		return;
 
-	now = ktime_get_ns() >> 10;
+	now = blk_ktime_get_ns() >> 10;
 	if (now <= last_finish_time ||
 	    last_finish_time == tg->checked_last_finish_time)
 		return;
@@ -2325,7 +2325,7 @@ void blk_throtl_bio_endio(struct bio *bio)
 	if (!tg->td->limit_valid[LIMIT_LOW])
 		return;
 
-	finish_time_ns = ktime_get_ns();
+	finish_time_ns = blk_ktime_get_ns();
 	tg->last_finish_time = finish_time_ns >> 10;
 
 	start_time = bio_issue_time(&bio->bi_issue) >> 10;
diff --git a/block/blk.h b/block/blk.h
index 08a358bc0919..4f081a00e644 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -518,4 +518,17 @@ static inline int req_ref_read(struct request *req)
 	return atomic_read(&req->ref);
 }
 
+static inline u64 blk_ktime_get_ns(void)
+{
+	struct blk_plug *plug = current->plug;
+
+	if (!plug)
+		return ktime_get_ns();
+	if (!(plug->cur_ktime & 1ULL)) {
+		plug->cur_ktime = ktime_get_ns();
+		plug->cur_ktime |= 1ULL;
+	}
+	return plug->cur_ktime;
+}
+
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 51fa7ffdee83..081830279f70 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -972,6 +972,9 @@ struct blk_plug {
 	bool multiple_queues;
 	bool has_elevator;
 
+	/* bit0 set if valid */
+	u64 cur_ktime;
+
 	struct list_head cb_list; /* md requires an unplug callback */
 };

Kundan Kumar Nov. 24, 2023, 11:28 a.m. UTC | #6

On Fri, Nov 24, 2023 at 2:07 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 11/23/23 1:19 PM, Jens Axboe wrote:
> > On 11/23/23 11:12 AM, Kanchan Joshi wrote:
> >> On 11/23/2023 9:00 PM, Christoph Hellwig wrote:
> >>> The rest looks good, but that stats overhead seems pretty horrible..
> >>
> >> On my setup
> >> Before[1]: 7.06M
> >> After[2]: 8.29M
> >>
> >> [1]
> >> # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2
> >> -u1 -r4 /dev/ng0n1 /dev/ng1n1
> >> submitter=0, tid=2076, file=/dev/ng0n1, node=-1
> >> submitter=1, tid=2077, file=/dev/ng1n1, node=-1
> >> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
> >> Engine=io_uring, sq_ring=256, cq_ring=256
> >> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
> >> Engine=io_uring, sq_ring=256, cq_ring=256
> >> IOPS=6.95M, BW=3.39GiB/s, IOS/call=32/31
> >> IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/32
> >> IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/31
> >> Exiting on timeout
> >> Maximum IOPS=7.06M
> >>
> >> [2]
> >>   # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2
> >> -u1 -r4 /dev/ng0n1 /dev/ng1n1
> >> submitter=0, tid=2123, file=/dev/ng0n1, node=-1
> >> submitter=1, tid=2124, file=/dev/ng1n1, node=-1
> >> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
> >> Engine=io_uring, sq_ring=256, cq_ring=256
> >> IOPS=8.27M, BW=4.04GiB/s, IOS/call=32/31
> >> IOPS=8.29M, BW=4.05GiB/s, IOS/call=32/31
> >> IOPS=8.29M, BW=4.05GiB/s, IOS/call=31/31
> >> Exiting on timeout
> >> Maximum IOPS=8.29M
> >
> > It's all really down to how expensive getting the current time is on
> > your box, some will be better and some worse
> >
> > One idea that has been bounced around in the past is to have a
> > blk_ktime_get_ns() and have it be something ala:
> >
> > u64 blk_ktime_get_ns(void)
> > {
> >       struct blk_plug *plug = current->plug;
> >
> >       if (!plug)
> >               return ktime_get_ns();
> >
> >       if (!plug->ktime_valid)
> >               plug->ktime = ktime_get_ns();
> >
> >       return plug->ktime;
> > }
> >
> > in freestyle form, with the idea being that we don't care granularity to
> > the extent that we'd need a new stamp every time.
> >
> > If the task is scheduled out, the plug is flushed anyway, which should
> > invalidate the stamp. For preemption this isn't true iirc, so we'd need
> > some kind of blk_flush_plug_ts() or something for that case to
> > invalidate it.
> >
> > Hopefully this could then also get away from passing in a cached value
> > that we do in various spots, exactly because all of this time stamping
> > is expensive. It's also a bit of a game of whack-a-mole, as users get
> > added and distro kernels tend to turn on basically everything anyway.
>
> Did a quick'n dirty (see below), which recoups half of the lost
> performance for me. And my kernel deliberately doesn't enable all of the
> gunk in block/ that slows everything down, I suspect it'll be a bigger
> win for you percentage wise:
>
> IOPS=121.42M, BW=59.29GiB/s, IOS/call=31/31
> IOPS=121.47M, BW=59.31GiB/s, IOS/call=32/32
> IOPS=121.44M, BW=59.30GiB/s, IOS/call=31/31
> IOPS=121.47M, BW=59.31GiB/s, IOS/call=31/31
> IOPS=121.45M, BW=59.30GiB/s, IOS/call=32/32
> IOPS=119.95M, BW=58.57GiB/s, IOS/call=31/32
> IOPS=115.30M, BW=56.30GiB/s, IOS/call=32/31
> IOPS=115.38M, BW=56.34GiB/s, IOS/call=32/32
> IOPS=115.35M, BW=56.32GiB/s, IOS/call=32/32
>
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index fdf25b8d6e78..6e74af442b94 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1055,6 +1055,7 @@ void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios)
>         plug->rq_count = 0;
>         plug->multiple_queues = false;
>         plug->has_elevator = false;
> +       plug->cur_ktime = 0;
>         INIT_LIST_HEAD(&plug->cb_list);
>
>         /*
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index 3f4d41952ef2..e4e9f7eed346 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -143,7 +143,7 @@ static void blk_account_io_flush(struct request *rq)
>         part_stat_lock();
>         part_stat_inc(part, ios[STAT_FLUSH]);
>         part_stat_add(part, nsecs[STAT_FLUSH],
> -                     ktime_get_ns() - rq->start_time_ns);
> +                     blk_ktime_get_ns() - rq->start_time_ns);
>         part_stat_unlock();
>  }
>
> diff --git a/block/blk-iocost.c b/block/blk-iocost.c
> index 089fcb9cfce3..d06cea625462 100644
> --- a/block/blk-iocost.c
> +++ b/block/blk-iocost.c
> @@ -829,7 +829,7 @@ static int ioc_autop_idx(struct ioc *ioc, struct gendisk *disk)
>
>         /* step up/down based on the vrate */
>         vrate_pct = div64_u64(ioc->vtime_base_rate * 100, VTIME_PER_USEC);
> -       now_ns = ktime_get_ns();
> +       now_ns = blk_ktime_get_ns();
>
>         if (p->too_fast_vrate_pct && p->too_fast_vrate_pct <= vrate_pct) {
>                 if (!ioc->autop_too_fast_at)
> @@ -1044,7 +1044,7 @@ static void ioc_now(struct ioc *ioc, struct ioc_now *now)
>         unsigned seq;
>         u64 vrate;
>
> -       now->now_ns = ktime_get();
> +       now->now_ns = blk_ktime_get();
>         now->now = ktime_to_us(now->now_ns);
>         vrate = atomic64_read(&ioc->vtime_rate);
>
> @@ -2810,7 +2810,7 @@ static void ioc_rqos_done(struct rq_qos *rqos, struct request *rq)
>                 return;
>         }
>
> -       on_q_ns = ktime_get_ns() - rq->alloc_time_ns;
> +       on_q_ns = blk_ktime_get_ns() - rq->alloc_time_ns;
>         rq_wait_ns = rq->start_time_ns - rq->alloc_time_ns;
>         size_nsec = div64_u64(calc_size_vtime_cost(rq, ioc), VTIME_PER_NSEC);
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 900c1be1fee1..9c96dee9e584 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -323,7 +323,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
>         RB_CLEAR_NODE(&rq->rb_node);
>         rq->tag = BLK_MQ_NO_TAG;
>         rq->internal_tag = BLK_MQ_NO_TAG;
> -       rq->start_time_ns = ktime_get_ns();
> +       rq->start_time_ns = blk_ktime_get_ns();
>         rq->part = NULL;
>         blk_crypto_rq_set_defaults(rq);
>  }
> @@ -333,7 +333,7 @@ EXPORT_SYMBOL(blk_rq_init);
>  static inline void blk_mq_rq_time_init(struct request *rq, u64 alloc_time_ns)
>  {
>         if (blk_mq_need_time_stamp(rq))
> -               rq->start_time_ns = ktime_get_ns();
> +               rq->start_time_ns = blk_ktime_get_ns();
>         else
>                 rq->start_time_ns = 0;
>
> @@ -444,7 +444,7 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
>
>         /* alloc_time includes depth and tag waits */
>         if (blk_queue_rq_alloc_time(q))
> -               alloc_time_ns = ktime_get_ns();
> +               alloc_time_ns = blk_ktime_get_ns();
>
>         if (data->cmd_flags & REQ_NOWAIT)
>                 data->flags |= BLK_MQ_REQ_NOWAIT;
> @@ -629,7 +629,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
>
>         /* alloc_time includes depth and tag waits */
>         if (blk_queue_rq_alloc_time(q))
> -               alloc_time_ns = ktime_get_ns();
> +               alloc_time_ns = blk_ktime_get_ns();
>
>         /*
>          * If the tag allocator sleeps we could get an allocation for a
> @@ -1037,7 +1037,7 @@ static inline void __blk_mq_end_request_acct(struct request *rq, u64 now)
>  inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
>  {
>         if (blk_mq_need_time_stamp(rq))
> -               __blk_mq_end_request_acct(rq, ktime_get_ns());
> +               __blk_mq_end_request_acct(rq, blk_ktime_get_ns());
>
>         blk_mq_finish_request(rq);
>
> @@ -1080,7 +1080,7 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob)
>         u64 now = 0;
>
>         if (iob->need_ts)
> -               now = ktime_get_ns();
> +               now = blk_ktime_get_ns();
>
>         while ((rq = rq_list_pop(&iob->req_list)) != NULL) {
>                 prefetch(rq->bio);
> @@ -1249,7 +1249,7 @@ void blk_mq_start_request(struct request *rq)
>         trace_block_rq_issue(rq);
>
>         if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
> -               rq->io_start_time_ns = ktime_get_ns();
> +               rq->io_start_time_ns = blk_ktime_get_ns();
>                 rq->stats_sectors = blk_rq_sectors(rq);
>                 rq->rq_flags |= RQF_STATS;
>                 rq_qos_issue(q, rq);
> @@ -3066,7 +3066,7 @@ blk_status_t blk_insert_cloned_request(struct request *rq)
>         blk_mq_run_dispatch_ops(q,
>                         ret = blk_mq_request_issue_directly(rq, true));
>         if (ret)
> -               blk_account_io_done(rq, ktime_get_ns());
> +               blk_account_io_done(rq, blk_ktime_get_ns());
>         return ret;
>  }
>  EXPORT_SYMBOL_GPL(blk_insert_cloned_request);
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 13e4377a8b28..5919919ba1c3 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -1813,7 +1813,7 @@ static bool throtl_tg_is_idle(struct throtl_grp *tg)
>         time = min_t(unsigned long, MAX_IDLE_TIME, 4 * tg->idletime_threshold);
>         ret = tg->latency_target == DFL_LATENCY_TARGET ||
>               tg->idletime_threshold == DFL_IDLE_THRESHOLD ||
> -             (ktime_get_ns() >> 10) - tg->last_finish_time > time ||
> +             (blk_ktime_get_ns() >> 10) - tg->last_finish_time > time ||
>               tg->avg_idletime > tg->idletime_threshold ||
>               (tg->latency_target && tg->bio_cnt &&
>                 tg->bad_bio_cnt * 5 < tg->bio_cnt);
> @@ -2058,7 +2058,7 @@ static void blk_throtl_update_idletime(struct throtl_grp *tg)
>         if (last_finish_time == 0)
>                 return;
>
> -       now = ktime_get_ns() >> 10;
> +       now = blk_ktime_get_ns() >> 10;
>         if (now <= last_finish_time ||
>             last_finish_time == tg->checked_last_finish_time)
>                 return;
> @@ -2325,7 +2325,7 @@ void blk_throtl_bio_endio(struct bio *bio)
>         if (!tg->td->limit_valid[LIMIT_LOW])
>                 return;
>
> -       finish_time_ns = ktime_get_ns();
> +       finish_time_ns = blk_ktime_get_ns();
>         tg->last_finish_time = finish_time_ns >> 10;
>
>         start_time = bio_issue_time(&bio->bi_issue) >> 10;
> diff --git a/block/blk.h b/block/blk.h
> index 08a358bc0919..4f081a00e644 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -518,4 +518,17 @@ static inline int req_ref_read(struct request *req)
>         return atomic_read(&req->ref);
>  }
>
> +static inline u64 blk_ktime_get_ns(void)
> +{
> +       struct blk_plug *plug = current->plug;
> +
> +       if (!plug)
> +               return ktime_get_ns();
> +       if (!(plug->cur_ktime & 1ULL)) {
> +               plug->cur_ktime = ktime_get_ns();
> +               plug->cur_ktime |= 1ULL;
> +       }
> +       return plug->cur_ktime;
> +}
> +
>  #endif /* BLK_INTERNAL_H */
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 51fa7ffdee83..081830279f70 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -972,6 +972,9 @@ struct blk_plug {
>         bool multiple_queues;
>         bool has_elevator;
>
> +       /* bit0 set if valid */
> +       u64 cur_ktime;
> +
>         struct list_head cb_list; /* md requires an unplug callback */
>  };
>
>
> --
> Jens Axboe
>
>

Hi Jens,

This is what I see with your changes on my setup.

Before[1]: 7.06M
After[2]: 7.52M

[1]
# taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 \
-u1 -r4 /dev/ng0n1 /dev/ng1n1
submitter=0, tid=2076, file=/dev/ng0n1, node=-1
submitter=1, tid=2077, file=/dev/ng1n1, node=-1
polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
Engine=io_uring, sq_ring=256, cq_ring=256
polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
Engine=io_uring, sq_ring=256, cq_ring=256
IOPS=6.95M, BW=3.39GiB/s, IOS/call=32/31
IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/32
IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/31
Exiting on timeout
Maximum IOPS=7.06M

[2]
# taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 \
-u1 -r4 /dev/ng0n1 /dev/ng1n1
submitter=0, tid=2204, file=/dev/ng0n1, node=-1
submitter=1, tid=2205, file=/dev/ng1n1, node=-1
polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
Engine=io_uring, sq_ring=256, cq_ring=256
IOPS=7.40M, BW=3.62GiB/s, IOS/call=32/31
IOPS=7.51M, BW=3.67GiB/s, IOS/call=32/31
IOPS=7.52M, BW=3.67GiB/s, IOS/call=32/32
Exiting on timeout
Maximum IOPS=7.52M

The original patch avoids processing throttle stats and wbt_issue/done
stats for passthrough-io path.

Improvement with original-patch :
7.06M -> 8.29M

It seems that both the optimizations are different. The original patch is about
"completely disabling stats for passthrough-io" and your changes
optimize getting the
current time which would improve performance for everyone.

I think both of them are independent.

--
Kundan Kumar

Kundan Kumar Dec. 1, 2023, 12:11 p.m. UTC | #7

On Fri, Nov 24, 2023 at 4:58 PM Kundan Kumar <kundanthebest@gmail.com> wrote:
>
> On Fri, Nov 24, 2023 at 2:07 AM Jens Axboe <axboe@kernel.dk> wrote:
> >
> > On 11/23/23 1:19 PM, Jens Axboe wrote:
> > > On 11/23/23 11:12 AM, Kanchan Joshi wrote:
> > >> On 11/23/2023 9:00 PM, Christoph Hellwig wrote:
> > >>> The rest looks good, but that stats overhead seems pretty horrible..
> > >>
> > >> On my setup
> > >> Before[1]: 7.06M
> > >> After[2]: 8.29M
> > >>
> > >> [1]
> > >> # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2
> > >> -u1 -r4 /dev/ng0n1 /dev/ng1n1
> > >> submitter=0, tid=2076, file=/dev/ng0n1, node=-1
> > >> submitter=1, tid=2077, file=/dev/ng1n1, node=-1
> > >> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
> > >> Engine=io_uring, sq_ring=256, cq_ring=256
> > >> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
> > >> Engine=io_uring, sq_ring=256, cq_ring=256
> > >> IOPS=6.95M, BW=3.39GiB/s, IOS/call=32/31
> > >> IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/32
> > >> IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/31
> > >> Exiting on timeout
> > >> Maximum IOPS=7.06M
> > >>
> > >> [2]
> > >>   # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2
> > >> -u1 -r4 /dev/ng0n1 /dev/ng1n1
> > >> submitter=0, tid=2123, file=/dev/ng0n1, node=-1
> > >> submitter=1, tid=2124, file=/dev/ng1n1, node=-1
> > >> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
> > >> Engine=io_uring, sq_ring=256, cq_ring=256
> > >> IOPS=8.27M, BW=4.04GiB/s, IOS/call=32/31
> > >> IOPS=8.29M, BW=4.05GiB/s, IOS/call=32/31
> > >> IOPS=8.29M, BW=4.05GiB/s, IOS/call=31/31
> > >> Exiting on timeout
> > >> Maximum IOPS=8.29M
> > >
> > > It's all really down to how expensive getting the current time is on
> > > your box, some will be better and some worse
> > >
> > > One idea that has been bounced around in the past is to have a
> > > blk_ktime_get_ns() and have it be something ala:
> > >
> > > u64 blk_ktime_get_ns(void)
> > > {
> > >       struct blk_plug *plug = current->plug;
> > >
> > >       if (!plug)
> > >               return ktime_get_ns();
> > >
> > >       if (!plug->ktime_valid)
> > >               plug->ktime = ktime_get_ns();
> > >
> > >       return plug->ktime;
> > > }
> > >
> > > in freestyle form, with the idea being that we don't care granularity to
> > > the extent that we'd need a new stamp every time.
> > >
> > > If the task is scheduled out, the plug is flushed anyway, which should
> > > invalidate the stamp. For preemption this isn't true iirc, so we'd need
> > > some kind of blk_flush_plug_ts() or something for that case to
> > > invalidate it.
> > >
> > > Hopefully this could then also get away from passing in a cached value
> > > that we do in various spots, exactly because all of this time stamping
> > > is expensive. It's also a bit of a game of whack-a-mole, as users get
> > > added and distro kernels tend to turn on basically everything anyway.
> >
> > Did a quick'n dirty (see below), which recoups half of the lost
> > performance for me. And my kernel deliberately doesn't enable all of the
> > gunk in block/ that slows everything down, I suspect it'll be a bigger
> > win for you percentage wise:
> >
> > IOPS=121.42M, BW=59.29GiB/s, IOS/call=31/31
> > IOPS=121.47M, BW=59.31GiB/s, IOS/call=32/32
> > IOPS=121.44M, BW=59.30GiB/s, IOS/call=31/31
> > IOPS=121.47M, BW=59.31GiB/s, IOS/call=31/31
> > IOPS=121.45M, BW=59.30GiB/s, IOS/call=32/32
> > IOPS=119.95M, BW=58.57GiB/s, IOS/call=31/32
> > IOPS=115.30M, BW=56.30GiB/s, IOS/call=32/31
> > IOPS=115.38M, BW=56.34GiB/s, IOS/call=32/32
> > IOPS=115.35M, BW=56.32GiB/s, IOS/call=32/32
> >
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index fdf25b8d6e78..6e74af442b94 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -1055,6 +1055,7 @@ void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios)
> >         plug->rq_count = 0;
> >         plug->multiple_queues = false;
> >         plug->has_elevator = false;
> > +       plug->cur_ktime = 0;
> >         INIT_LIST_HEAD(&plug->cb_list);
> >
> >         /*
> > diff --git a/block/blk-flush.c b/block/blk-flush.c
> > index 3f4d41952ef2..e4e9f7eed346 100644
> > --- a/block/blk-flush.c
> > +++ b/block/blk-flush.c
> > @@ -143,7 +143,7 @@ static void blk_account_io_flush(struct request *rq)
> >         part_stat_lock();
> >         part_stat_inc(part, ios[STAT_FLUSH]);
> >         part_stat_add(part, nsecs[STAT_FLUSH],
> > -                     ktime_get_ns() - rq->start_time_ns);
> > +                     blk_ktime_get_ns() - rq->start_time_ns);
> >         part_stat_unlock();
> >  }
> >
> > diff --git a/block/blk-iocost.c b/block/blk-iocost.c
> > index 089fcb9cfce3..d06cea625462 100644
> > --- a/block/blk-iocost.c
> > +++ b/block/blk-iocost.c
> > @@ -829,7 +829,7 @@ static int ioc_autop_idx(struct ioc *ioc, struct gendisk *disk)
> >
> >         /* step up/down based on the vrate */
> >         vrate_pct = div64_u64(ioc->vtime_base_rate * 100, VTIME_PER_USEC);
> > -       now_ns = ktime_get_ns();
> > +       now_ns = blk_ktime_get_ns();
> >
> >         if (p->too_fast_vrate_pct && p->too_fast_vrate_pct <= vrate_pct) {
> >                 if (!ioc->autop_too_fast_at)
> > @@ -1044,7 +1044,7 @@ static void ioc_now(struct ioc *ioc, struct ioc_now *now)
> >         unsigned seq;
> >         u64 vrate;
> >
> > -       now->now_ns = ktime_get();
> > +       now->now_ns = blk_ktime_get();
> >         now->now = ktime_to_us(now->now_ns);
> >         vrate = atomic64_read(&ioc->vtime_rate);
> >
> > @@ -2810,7 +2810,7 @@ static void ioc_rqos_done(struct rq_qos *rqos, struct request *rq)
> >                 return;
> >         }
> >
> > -       on_q_ns = ktime_get_ns() - rq->alloc_time_ns;
> > +       on_q_ns = blk_ktime_get_ns() - rq->alloc_time_ns;
> >         rq_wait_ns = rq->start_time_ns - rq->alloc_time_ns;
> >         size_nsec = div64_u64(calc_size_vtime_cost(rq, ioc), VTIME_PER_NSEC);
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 900c1be1fee1..9c96dee9e584 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -323,7 +323,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
> >         RB_CLEAR_NODE(&rq->rb_node);
> >         rq->tag = BLK_MQ_NO_TAG;
> >         rq->internal_tag = BLK_MQ_NO_TAG;
> > -       rq->start_time_ns = ktime_get_ns();
> > +       rq->start_time_ns = blk_ktime_get_ns();
> >         rq->part = NULL;
> >         blk_crypto_rq_set_defaults(rq);
> >  }
> > @@ -333,7 +333,7 @@ EXPORT_SYMBOL(blk_rq_init);
> >  static inline void blk_mq_rq_time_init(struct request *rq, u64 alloc_time_ns)
> >  {
> >         if (blk_mq_need_time_stamp(rq))
> > -               rq->start_time_ns = ktime_get_ns();
> > +               rq->start_time_ns = blk_ktime_get_ns();
> >         else
> >                 rq->start_time_ns = 0;
> >
> > @@ -444,7 +444,7 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
> >
> >         /* alloc_time includes depth and tag waits */
> >         if (blk_queue_rq_alloc_time(q))
> > -               alloc_time_ns = ktime_get_ns();
> > +               alloc_time_ns = blk_ktime_get_ns();
> >
> >         if (data->cmd_flags & REQ_NOWAIT)
> >                 data->flags |= BLK_MQ_REQ_NOWAIT;
> > @@ -629,7 +629,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
> >
> >         /* alloc_time includes depth and tag waits */
> >         if (blk_queue_rq_alloc_time(q))
> > -               alloc_time_ns = ktime_get_ns();
> > +               alloc_time_ns = blk_ktime_get_ns();
> >
> >         /*
> >          * If the tag allocator sleeps we could get an allocation for a
> > @@ -1037,7 +1037,7 @@ static inline void __blk_mq_end_request_acct(struct request *rq, u64 now)
> >  inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
> >  {
> >         if (blk_mq_need_time_stamp(rq))
> > -               __blk_mq_end_request_acct(rq, ktime_get_ns());
> > +               __blk_mq_end_request_acct(rq, blk_ktime_get_ns());
> >
> >         blk_mq_finish_request(rq);
> >
> > @@ -1080,7 +1080,7 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob)
> >         u64 now = 0;
> >
> >         if (iob->need_ts)
> > -               now = ktime_get_ns();
> > +               now = blk_ktime_get_ns();
> >
> >         while ((rq = rq_list_pop(&iob->req_list)) != NULL) {
> >                 prefetch(rq->bio);
> > @@ -1249,7 +1249,7 @@ void blk_mq_start_request(struct request *rq)
> >         trace_block_rq_issue(rq);
> >
> >         if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
> > -               rq->io_start_time_ns = ktime_get_ns();
> > +               rq->io_start_time_ns = blk_ktime_get_ns();
> >                 rq->stats_sectors = blk_rq_sectors(rq);
> >                 rq->rq_flags |= RQF_STATS;
> >                 rq_qos_issue(q, rq);
> > @@ -3066,7 +3066,7 @@ blk_status_t blk_insert_cloned_request(struct request *rq)
> >         blk_mq_run_dispatch_ops(q,
> >                         ret = blk_mq_request_issue_directly(rq, true));
> >         if (ret)
> > -               blk_account_io_done(rq, ktime_get_ns());
> > +               blk_account_io_done(rq, blk_ktime_get_ns());
> >         return ret;
> >  }
> >  EXPORT_SYMBOL_GPL(blk_insert_cloned_request);
> > diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> > index 13e4377a8b28..5919919ba1c3 100644
> > --- a/block/blk-throttle.c
> > +++ b/block/blk-throttle.c
> > @@ -1813,7 +1813,7 @@ static bool throtl_tg_is_idle(struct throtl_grp *tg)
> >         time = min_t(unsigned long, MAX_IDLE_TIME, 4 * tg->idletime_threshold);
> >         ret = tg->latency_target == DFL_LATENCY_TARGET ||
> >               tg->idletime_threshold == DFL_IDLE_THRESHOLD ||
> > -             (ktime_get_ns() >> 10) - tg->last_finish_time > time ||
> > +             (blk_ktime_get_ns() >> 10) - tg->last_finish_time > time ||
> >               tg->avg_idletime > tg->idletime_threshold ||
> >               (tg->latency_target && tg->bio_cnt &&
> >                 tg->bad_bio_cnt * 5 < tg->bio_cnt);
> > @@ -2058,7 +2058,7 @@ static void blk_throtl_update_idletime(struct throtl_grp *tg)
> >         if (last_finish_time == 0)
> >                 return;
> >
> > -       now = ktime_get_ns() >> 10;
> > +       now = blk_ktime_get_ns() >> 10;
> >         if (now <= last_finish_time ||
> >             last_finish_time == tg->checked_last_finish_time)
> >                 return;
> > @@ -2325,7 +2325,7 @@ void blk_throtl_bio_endio(struct bio *bio)
> >         if (!tg->td->limit_valid[LIMIT_LOW])
> >                 return;
> >
> > -       finish_time_ns = ktime_get_ns();
> > +       finish_time_ns = blk_ktime_get_ns();
> >         tg->last_finish_time = finish_time_ns >> 10;
> >
> >         start_time = bio_issue_time(&bio->bi_issue) >> 10;
> > diff --git a/block/blk.h b/block/blk.h
> > index 08a358bc0919..4f081a00e644 100644
> > --- a/block/blk.h
> > +++ b/block/blk.h
> > @@ -518,4 +518,17 @@ static inline int req_ref_read(struct request *req)
> >         return atomic_read(&req->ref);
> >  }
> >
> > +static inline u64 blk_ktime_get_ns(void)
> > +{
> > +       struct blk_plug *plug = current->plug;
> > +
> > +       if (!plug)
> > +               return ktime_get_ns();
> > +       if (!(plug->cur_ktime & 1ULL)) {
> > +               plug->cur_ktime = ktime_get_ns();
> > +               plug->cur_ktime |= 1ULL;
> > +       }
> > +       return plug->cur_ktime;
> > +}
> > +
> >  #endif /* BLK_INTERNAL_H */
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 51fa7ffdee83..081830279f70 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -972,6 +972,9 @@ struct blk_plug {
> >         bool multiple_queues;
> >         bool has_elevator;
> >
> > +       /* bit0 set if valid */
> > +       u64 cur_ktime;
> > +
> >         struct list_head cb_list; /* md requires an unplug callback */
> >  };
> >
> >
> > --
> > Jens Axboe
> >
> >
>
> Hi Jens,
>
> This is what I see with your changes on my setup.
>
> Before[1]: 7.06M
> After[2]: 7.52M
>
> [1]
> # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 \
> -u1 -r4 /dev/ng0n1 /dev/ng1n1
> submitter=0, tid=2076, file=/dev/ng0n1, node=-1
> submitter=1, tid=2077, file=/dev/ng1n1, node=-1
> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
> Engine=io_uring, sq_ring=256, cq_ring=256
> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
> Engine=io_uring, sq_ring=256, cq_ring=256
> IOPS=6.95M, BW=3.39GiB/s, IOS/call=32/31
> IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/32
> IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/31
> Exiting on timeout
> Maximum IOPS=7.06M
>
> [2]
> # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 \
> -u1 -r4 /dev/ng0n1 /dev/ng1n1
> submitter=0, tid=2204, file=/dev/ng0n1, node=-1
> submitter=1, tid=2205, file=/dev/ng1n1, node=-1
> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
> Engine=io_uring, sq_ring=256, cq_ring=256
> IOPS=7.40M, BW=3.62GiB/s, IOS/call=32/31
> IOPS=7.51M, BW=3.67GiB/s, IOS/call=32/31
> IOPS=7.52M, BW=3.67GiB/s, IOS/call=32/32
> Exiting on timeout
> Maximum IOPS=7.52M
>
> The original patch avoids processing throttle stats and wbt_issue/done
> stats for passthrough-io path.
>
> Improvement with original-patch :
> 7.06M -> 8.29M
>
> It seems that both the optimizations are different. The original patch is about
> "completely disabling stats for passthrough-io" and your changes
> optimize getting the
> current time which would improve performance for everyone.
>
> I think both of them are independent.
>
> --
> Kundan Kumar

Any feedback on this ?
--
Kundan Kumart

Jens Axboe Dec. 1, 2023, 2:31 p.m. UTC | #8

On 11/24/23 4:28 AM, Kundan Kumar wrote:
> This is what I see with your changes on my setup.
> 
> Before[1]: 7.06M
> After[2]: 7.52M
> 
> [1]
> # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 \
> -u1 -r4 /dev/ng0n1 /dev/ng1n1
> submitter=0, tid=2076, file=/dev/ng0n1, node=-1
> submitter=1, tid=2077, file=/dev/ng1n1, node=-1
> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
> Engine=io_uring, sq_ring=256, cq_ring=256
> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
> Engine=io_uring, sq_ring=256, cq_ring=256
> IOPS=6.95M, BW=3.39GiB/s, IOS/call=32/31
> IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/32
> IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/31
> Exiting on timeout
> Maximum IOPS=7.06M
> 
> [2]
> # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 \
> -u1 -r4 /dev/ng0n1 /dev/ng1n1
> submitter=0, tid=2204, file=/dev/ng0n1, node=-1
> submitter=1, tid=2205, file=/dev/ng1n1, node=-1
> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
> Engine=io_uring, sq_ring=256, cq_ring=256
> IOPS=7.40M, BW=3.62GiB/s, IOS/call=32/31
> IOPS=7.51M, BW=3.67GiB/s, IOS/call=32/31
> IOPS=7.52M, BW=3.67GiB/s, IOS/call=32/32
> Exiting on timeout
> Maximum IOPS=7.52M
> 
> The original patch avoids processing throttle stats and wbt_issue/done
> stats for passthrough-io path.
> 
> Improvement with original-patch :
> 7.06M -> 8.29M
> 
> It seems that both the optimizations are different. The original patch is about
> "completely disabling stats for passthrough-io" and your changes
> optimize getting the
> current time which would improve performance for everyone.
> 
> I think both of them are independent.

Yes they are, mine is just a general "we should do something like this
rather than play whack-a-mole on the issue side for time stamping". It
doesn't solve the completion side, which is why your patch is better for
passthrough as a whole.

I do think we should add your patch, they are orthogonal. Did you send
out a v2 we can queue up for 6.8?

block: skip QUEUE_FLAG_STATS and rq-qos for passthrough io

Commit Message

Comments

Patch