Message ID | 20231123102431.6804-1-kundan.kumar@samsung.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: skip QUEUE_FLAG_STATS and rq-qos for passthrough io | expand |
On Thu, Nov 23, 2023 at 03:54:31PM +0530, kundan.kumar wrote: > - if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) { > + if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags) > + && !blk_rq_is_passthrough(rq)) { && goes on the starting line in the kernel code style. The rest looks good, but that stats overhead seems pretty horrible..
On 11/23/2023 9:00 PM, Christoph Hellwig wrote:
> The rest looks good, but that stats overhead seems pretty horrible..
On my setup
Before[1]: 7.06M
After[2]: 8.29M
[1]
# taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2
-u1 -r4 /dev/ng0n1 /dev/ng1n1
submitter=0, tid=2076, file=/dev/ng0n1, node=-1
submitter=1, tid=2077, file=/dev/ng1n1, node=-1
polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
Engine=io_uring, sq_ring=256, cq_ring=256
polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
Engine=io_uring, sq_ring=256, cq_ring=256
IOPS=6.95M, BW=3.39GiB/s, IOS/call=32/31
IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/32
IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/31
Exiting on timeout
Maximum IOPS=7.06M
[2]
# taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2
-u1 -r4 /dev/ng0n1 /dev/ng1n1
submitter=0, tid=2123, file=/dev/ng0n1, node=-1
submitter=1, tid=2124, file=/dev/ng1n1, node=-1
polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256
Engine=io_uring, sq_ring=256, cq_ring=256
IOPS=8.27M, BW=4.04GiB/s, IOS/call=32/31
IOPS=8.29M, BW=4.05GiB/s, IOS/call=32/31
IOPS=8.29M, BW=4.05GiB/s, IOS/call=31/31
Exiting on timeout
Maximum IOPS=8.29M
On 11/23/23 8:30 AM, Christoph Hellwig wrote: > On Thu, Nov 23, 2023 at 03:54:31PM +0530, kundan.kumar wrote: >> - if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) { >> + if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags) >> + && !blk_rq_is_passthrough(rq)) { > > && goes on the starting line in the kernel code style. > > The rest looks good, but that stats overhead seems pretty horrible.. It is, but at least it's better than what it used to be. Here's an example from my box, you can pretty easily tell when iostats got enabled while it was running: [...] submitter=22, tid=645761, file=/dev/nvme22n1, node=6 submitter=23, tid=645762, file=/dev/nvme23n1, node=6 polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 IOPS=119.09M, BW=58.15GiB/s, IOS/call=31/31 IOPS=122.03M, BW=59.59GiB/s, IOS/call=31/31 IOPS=122.01M, BW=59.58GiB/s, IOS/call=31/32 IOPS=122.02M, BW=59.58GiB/s, IOS/call=31/32 IOPS=122.02M, BW=59.58GiB/s, IOS/call=31/31 IOPS=112.72M, BW=55.04GiB/s, IOS/call=31/31 IOPS=108.43M, BW=52.94GiB/s, IOS/call=31/32 IOPS=108.46M, BW=52.96GiB/s, IOS/call=32/31 IOPS=108.58M, BW=53.02GiB/s, IOS/call=31/31
On 11/23/23 11:12 AM, Kanchan Joshi wrote: > On 11/23/2023 9:00 PM, Christoph Hellwig wrote: >> The rest looks good, but that stats overhead seems pretty horrible.. > > On my setup > Before[1]: 7.06M > After[2]: 8.29M > > [1] > # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 > -u1 -r4 /dev/ng0n1 /dev/ng1n1 > submitter=0, tid=2076, file=/dev/ng0n1, node=-1 > submitter=1, tid=2077, file=/dev/ng1n1, node=-1 > polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 > Engine=io_uring, sq_ring=256, cq_ring=256 > polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 > Engine=io_uring, sq_ring=256, cq_ring=256 > IOPS=6.95M, BW=3.39GiB/s, IOS/call=32/31 > IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/32 > IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/31 > Exiting on timeout > Maximum IOPS=7.06M > > [2] > # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 > -u1 -r4 /dev/ng0n1 /dev/ng1n1 > submitter=0, tid=2123, file=/dev/ng0n1, node=-1 > submitter=1, tid=2124, file=/dev/ng1n1, node=-1 > polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 > Engine=io_uring, sq_ring=256, cq_ring=256 > IOPS=8.27M, BW=4.04GiB/s, IOS/call=32/31 > IOPS=8.29M, BW=4.05GiB/s, IOS/call=32/31 > IOPS=8.29M, BW=4.05GiB/s, IOS/call=31/31 > Exiting on timeout > Maximum IOPS=8.29M It's all really down to how expensive getting the current time is on your box, some will be better and some worse One idea that has been bounced around in the past is to have a blk_ktime_get_ns() and have it be something ala: u64 blk_ktime_get_ns(void) { struct blk_plug *plug = current->plug; if (!plug) return ktime_get_ns(); if (!plug->ktime_valid) plug->ktime = ktime_get_ns(); return plug->ktime; } in freestyle form, with the idea being that we don't care granularity to the extent that we'd need a new stamp every time. If the task is scheduled out, the plug is flushed anyway, which should invalidate the stamp. For preemption this isn't true iirc, so we'd need some kind of blk_flush_plug_ts() or something for that case to invalidate it. Hopefully this could then also get away from passing in a cached value that we do in various spots, exactly because all of this time stamping is expensive. It's also a bit of a game of whack-a-mole, as users get added and distro kernels tend to turn on basically everything anyway.
On 11/23/23 1:19 PM, Jens Axboe wrote: > On 11/23/23 11:12 AM, Kanchan Joshi wrote: >> On 11/23/2023 9:00 PM, Christoph Hellwig wrote: >>> The rest looks good, but that stats overhead seems pretty horrible.. >> >> On my setup >> Before[1]: 7.06M >> After[2]: 8.29M >> >> [1] >> # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 >> -u1 -r4 /dev/ng0n1 /dev/ng1n1 >> submitter=0, tid=2076, file=/dev/ng0n1, node=-1 >> submitter=1, tid=2077, file=/dev/ng1n1, node=-1 >> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 >> Engine=io_uring, sq_ring=256, cq_ring=256 >> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 >> Engine=io_uring, sq_ring=256, cq_ring=256 >> IOPS=6.95M, BW=3.39GiB/s, IOS/call=32/31 >> IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/32 >> IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/31 >> Exiting on timeout >> Maximum IOPS=7.06M >> >> [2] >> # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 >> -u1 -r4 /dev/ng0n1 /dev/ng1n1 >> submitter=0, tid=2123, file=/dev/ng0n1, node=-1 >> submitter=1, tid=2124, file=/dev/ng1n1, node=-1 >> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 >> Engine=io_uring, sq_ring=256, cq_ring=256 >> IOPS=8.27M, BW=4.04GiB/s, IOS/call=32/31 >> IOPS=8.29M, BW=4.05GiB/s, IOS/call=32/31 >> IOPS=8.29M, BW=4.05GiB/s, IOS/call=31/31 >> Exiting on timeout >> Maximum IOPS=8.29M > > It's all really down to how expensive getting the current time is on > your box, some will be better and some worse > > One idea that has been bounced around in the past is to have a > blk_ktime_get_ns() and have it be something ala: > > u64 blk_ktime_get_ns(void) > { > struct blk_plug *plug = current->plug; > > if (!plug) > return ktime_get_ns(); > > if (!plug->ktime_valid) > plug->ktime = ktime_get_ns(); > > return plug->ktime; > } > > in freestyle form, with the idea being that we don't care granularity to > the extent that we'd need a new stamp every time. > > If the task is scheduled out, the plug is flushed anyway, which should > invalidate the stamp. For preemption this isn't true iirc, so we'd need > some kind of blk_flush_plug_ts() or something for that case to > invalidate it. > > Hopefully this could then also get away from passing in a cached value > that we do in various spots, exactly because all of this time stamping > is expensive. It's also a bit of a game of whack-a-mole, as users get > added and distro kernels tend to turn on basically everything anyway. Did a quick'n dirty (see below), which recoups half of the lost performance for me. And my kernel deliberately doesn't enable all of the gunk in block/ that slows everything down, I suspect it'll be a bigger win for you percentage wise: IOPS=121.42M, BW=59.29GiB/s, IOS/call=31/31 IOPS=121.47M, BW=59.31GiB/s, IOS/call=32/32 IOPS=121.44M, BW=59.30GiB/s, IOS/call=31/31 IOPS=121.47M, BW=59.31GiB/s, IOS/call=31/31 IOPS=121.45M, BW=59.30GiB/s, IOS/call=32/32 IOPS=119.95M, BW=58.57GiB/s, IOS/call=31/32 IOPS=115.30M, BW=56.30GiB/s, IOS/call=32/31 IOPS=115.38M, BW=56.34GiB/s, IOS/call=32/32 IOPS=115.35M, BW=56.32GiB/s, IOS/call=32/32 diff --git a/block/blk-core.c b/block/blk-core.c index fdf25b8d6e78..6e74af442b94 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1055,6 +1055,7 @@ void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios) plug->rq_count = 0; plug->multiple_queues = false; plug->has_elevator = false; + plug->cur_ktime = 0; INIT_LIST_HEAD(&plug->cb_list); /* diff --git a/block/blk-flush.c b/block/blk-flush.c index 3f4d41952ef2..e4e9f7eed346 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -143,7 +143,7 @@ static void blk_account_io_flush(struct request *rq) part_stat_lock(); part_stat_inc(part, ios[STAT_FLUSH]); part_stat_add(part, nsecs[STAT_FLUSH], - ktime_get_ns() - rq->start_time_ns); + blk_ktime_get_ns() - rq->start_time_ns); part_stat_unlock(); } diff --git a/block/blk-iocost.c b/block/blk-iocost.c index 089fcb9cfce3..d06cea625462 100644 --- a/block/blk-iocost.c +++ b/block/blk-iocost.c @@ -829,7 +829,7 @@ static int ioc_autop_idx(struct ioc *ioc, struct gendisk *disk) /* step up/down based on the vrate */ vrate_pct = div64_u64(ioc->vtime_base_rate * 100, VTIME_PER_USEC); - now_ns = ktime_get_ns(); + now_ns = blk_ktime_get_ns(); if (p->too_fast_vrate_pct && p->too_fast_vrate_pct <= vrate_pct) { if (!ioc->autop_too_fast_at) @@ -1044,7 +1044,7 @@ static void ioc_now(struct ioc *ioc, struct ioc_now *now) unsigned seq; u64 vrate; - now->now_ns = ktime_get(); + now->now_ns = blk_ktime_get(); now->now = ktime_to_us(now->now_ns); vrate = atomic64_read(&ioc->vtime_rate); @@ -2810,7 +2810,7 @@ static void ioc_rqos_done(struct rq_qos *rqos, struct request *rq) return; } - on_q_ns = ktime_get_ns() - rq->alloc_time_ns; + on_q_ns = blk_ktime_get_ns() - rq->alloc_time_ns; rq_wait_ns = rq->start_time_ns - rq->alloc_time_ns; size_nsec = div64_u64(calc_size_vtime_cost(rq, ioc), VTIME_PER_NSEC); diff --git a/block/blk-mq.c b/block/blk-mq.c index 900c1be1fee1..9c96dee9e584 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -323,7 +323,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq) RB_CLEAR_NODE(&rq->rb_node); rq->tag = BLK_MQ_NO_TAG; rq->internal_tag = BLK_MQ_NO_TAG; - rq->start_time_ns = ktime_get_ns(); + rq->start_time_ns = blk_ktime_get_ns(); rq->part = NULL; blk_crypto_rq_set_defaults(rq); } @@ -333,7 +333,7 @@ EXPORT_SYMBOL(blk_rq_init); static inline void blk_mq_rq_time_init(struct request *rq, u64 alloc_time_ns) { if (blk_mq_need_time_stamp(rq)) - rq->start_time_ns = ktime_get_ns(); + rq->start_time_ns = blk_ktime_get_ns(); else rq->start_time_ns = 0; @@ -444,7 +444,7 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data) /* alloc_time includes depth and tag waits */ if (blk_queue_rq_alloc_time(q)) - alloc_time_ns = ktime_get_ns(); + alloc_time_ns = blk_ktime_get_ns(); if (data->cmd_flags & REQ_NOWAIT) data->flags |= BLK_MQ_REQ_NOWAIT; @@ -629,7 +629,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, /* alloc_time includes depth and tag waits */ if (blk_queue_rq_alloc_time(q)) - alloc_time_ns = ktime_get_ns(); + alloc_time_ns = blk_ktime_get_ns(); /* * If the tag allocator sleeps we could get an allocation for a @@ -1037,7 +1037,7 @@ static inline void __blk_mq_end_request_acct(struct request *rq, u64 now) inline void __blk_mq_end_request(struct request *rq, blk_status_t error) { if (blk_mq_need_time_stamp(rq)) - __blk_mq_end_request_acct(rq, ktime_get_ns()); + __blk_mq_end_request_acct(rq, blk_ktime_get_ns()); blk_mq_finish_request(rq); @@ -1080,7 +1080,7 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob) u64 now = 0; if (iob->need_ts) - now = ktime_get_ns(); + now = blk_ktime_get_ns(); while ((rq = rq_list_pop(&iob->req_list)) != NULL) { prefetch(rq->bio); @@ -1249,7 +1249,7 @@ void blk_mq_start_request(struct request *rq) trace_block_rq_issue(rq); if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) { - rq->io_start_time_ns = ktime_get_ns(); + rq->io_start_time_ns = blk_ktime_get_ns(); rq->stats_sectors = blk_rq_sectors(rq); rq->rq_flags |= RQF_STATS; rq_qos_issue(q, rq); @@ -3066,7 +3066,7 @@ blk_status_t blk_insert_cloned_request(struct request *rq) blk_mq_run_dispatch_ops(q, ret = blk_mq_request_issue_directly(rq, true)); if (ret) - blk_account_io_done(rq, ktime_get_ns()); + blk_account_io_done(rq, blk_ktime_get_ns()); return ret; } EXPORT_SYMBOL_GPL(blk_insert_cloned_request); diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 13e4377a8b28..5919919ba1c3 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -1813,7 +1813,7 @@ static bool throtl_tg_is_idle(struct throtl_grp *tg) time = min_t(unsigned long, MAX_IDLE_TIME, 4 * tg->idletime_threshold); ret = tg->latency_target == DFL_LATENCY_TARGET || tg->idletime_threshold == DFL_IDLE_THRESHOLD || - (ktime_get_ns() >> 10) - tg->last_finish_time > time || + (blk_ktime_get_ns() >> 10) - tg->last_finish_time > time || tg->avg_idletime > tg->idletime_threshold || (tg->latency_target && tg->bio_cnt && tg->bad_bio_cnt * 5 < tg->bio_cnt); @@ -2058,7 +2058,7 @@ static void blk_throtl_update_idletime(struct throtl_grp *tg) if (last_finish_time == 0) return; - now = ktime_get_ns() >> 10; + now = blk_ktime_get_ns() >> 10; if (now <= last_finish_time || last_finish_time == tg->checked_last_finish_time) return; @@ -2325,7 +2325,7 @@ void blk_throtl_bio_endio(struct bio *bio) if (!tg->td->limit_valid[LIMIT_LOW]) return; - finish_time_ns = ktime_get_ns(); + finish_time_ns = blk_ktime_get_ns(); tg->last_finish_time = finish_time_ns >> 10; start_time = bio_issue_time(&bio->bi_issue) >> 10; diff --git a/block/blk.h b/block/blk.h index 08a358bc0919..4f081a00e644 100644 --- a/block/blk.h +++ b/block/blk.h @@ -518,4 +518,17 @@ static inline int req_ref_read(struct request *req) return atomic_read(&req->ref); } +static inline u64 blk_ktime_get_ns(void) +{ + struct blk_plug *plug = current->plug; + + if (!plug) + return ktime_get_ns(); + if (!(plug->cur_ktime & 1ULL)) { + plug->cur_ktime = ktime_get_ns(); + plug->cur_ktime |= 1ULL; + } + return plug->cur_ktime; +} + #endif /* BLK_INTERNAL_H */ diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 51fa7ffdee83..081830279f70 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -972,6 +972,9 @@ struct blk_plug { bool multiple_queues; bool has_elevator; + /* bit0 set if valid */ + u64 cur_ktime; + struct list_head cb_list; /* md requires an unplug callback */ };
On Fri, Nov 24, 2023 at 2:07 AM Jens Axboe <axboe@kernel.dk> wrote: > > On 11/23/23 1:19 PM, Jens Axboe wrote: > > On 11/23/23 11:12 AM, Kanchan Joshi wrote: > >> On 11/23/2023 9:00 PM, Christoph Hellwig wrote: > >>> The rest looks good, but that stats overhead seems pretty horrible.. > >> > >> On my setup > >> Before[1]: 7.06M > >> After[2]: 8.29M > >> > >> [1] > >> # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 > >> -u1 -r4 /dev/ng0n1 /dev/ng1n1 > >> submitter=0, tid=2076, file=/dev/ng0n1, node=-1 > >> submitter=1, tid=2077, file=/dev/ng1n1, node=-1 > >> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 > >> Engine=io_uring, sq_ring=256, cq_ring=256 > >> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 > >> Engine=io_uring, sq_ring=256, cq_ring=256 > >> IOPS=6.95M, BW=3.39GiB/s, IOS/call=32/31 > >> IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/32 > >> IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/31 > >> Exiting on timeout > >> Maximum IOPS=7.06M > >> > >> [2] > >> # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 > >> -u1 -r4 /dev/ng0n1 /dev/ng1n1 > >> submitter=0, tid=2123, file=/dev/ng0n1, node=-1 > >> submitter=1, tid=2124, file=/dev/ng1n1, node=-1 > >> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 > >> Engine=io_uring, sq_ring=256, cq_ring=256 > >> IOPS=8.27M, BW=4.04GiB/s, IOS/call=32/31 > >> IOPS=8.29M, BW=4.05GiB/s, IOS/call=32/31 > >> IOPS=8.29M, BW=4.05GiB/s, IOS/call=31/31 > >> Exiting on timeout > >> Maximum IOPS=8.29M > > > > It's all really down to how expensive getting the current time is on > > your box, some will be better and some worse > > > > One idea that has been bounced around in the past is to have a > > blk_ktime_get_ns() and have it be something ala: > > > > u64 blk_ktime_get_ns(void) > > { > > struct blk_plug *plug = current->plug; > > > > if (!plug) > > return ktime_get_ns(); > > > > if (!plug->ktime_valid) > > plug->ktime = ktime_get_ns(); > > > > return plug->ktime; > > } > > > > in freestyle form, with the idea being that we don't care granularity to > > the extent that we'd need a new stamp every time. > > > > If the task is scheduled out, the plug is flushed anyway, which should > > invalidate the stamp. For preemption this isn't true iirc, so we'd need > > some kind of blk_flush_plug_ts() or something for that case to > > invalidate it. > > > > Hopefully this could then also get away from passing in a cached value > > that we do in various spots, exactly because all of this time stamping > > is expensive. It's also a bit of a game of whack-a-mole, as users get > > added and distro kernels tend to turn on basically everything anyway. > > Did a quick'n dirty (see below), which recoups half of the lost > performance for me. And my kernel deliberately doesn't enable all of the > gunk in block/ that slows everything down, I suspect it'll be a bigger > win for you percentage wise: > > IOPS=121.42M, BW=59.29GiB/s, IOS/call=31/31 > IOPS=121.47M, BW=59.31GiB/s, IOS/call=32/32 > IOPS=121.44M, BW=59.30GiB/s, IOS/call=31/31 > IOPS=121.47M, BW=59.31GiB/s, IOS/call=31/31 > IOPS=121.45M, BW=59.30GiB/s, IOS/call=32/32 > IOPS=119.95M, BW=58.57GiB/s, IOS/call=31/32 > IOPS=115.30M, BW=56.30GiB/s, IOS/call=32/31 > IOPS=115.38M, BW=56.34GiB/s, IOS/call=32/32 > IOPS=115.35M, BW=56.32GiB/s, IOS/call=32/32 > > > diff --git a/block/blk-core.c b/block/blk-core.c > index fdf25b8d6e78..6e74af442b94 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -1055,6 +1055,7 @@ void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios) > plug->rq_count = 0; > plug->multiple_queues = false; > plug->has_elevator = false; > + plug->cur_ktime = 0; > INIT_LIST_HEAD(&plug->cb_list); > > /* > diff --git a/block/blk-flush.c b/block/blk-flush.c > index 3f4d41952ef2..e4e9f7eed346 100644 > --- a/block/blk-flush.c > +++ b/block/blk-flush.c > @@ -143,7 +143,7 @@ static void blk_account_io_flush(struct request *rq) > part_stat_lock(); > part_stat_inc(part, ios[STAT_FLUSH]); > part_stat_add(part, nsecs[STAT_FLUSH], > - ktime_get_ns() - rq->start_time_ns); > + blk_ktime_get_ns() - rq->start_time_ns); > part_stat_unlock(); > } > > diff --git a/block/blk-iocost.c b/block/blk-iocost.c > index 089fcb9cfce3..d06cea625462 100644 > --- a/block/blk-iocost.c > +++ b/block/blk-iocost.c > @@ -829,7 +829,7 @@ static int ioc_autop_idx(struct ioc *ioc, struct gendisk *disk) > > /* step up/down based on the vrate */ > vrate_pct = div64_u64(ioc->vtime_base_rate * 100, VTIME_PER_USEC); > - now_ns = ktime_get_ns(); > + now_ns = blk_ktime_get_ns(); > > if (p->too_fast_vrate_pct && p->too_fast_vrate_pct <= vrate_pct) { > if (!ioc->autop_too_fast_at) > @@ -1044,7 +1044,7 @@ static void ioc_now(struct ioc *ioc, struct ioc_now *now) > unsigned seq; > u64 vrate; > > - now->now_ns = ktime_get(); > + now->now_ns = blk_ktime_get(); > now->now = ktime_to_us(now->now_ns); > vrate = atomic64_read(&ioc->vtime_rate); > > @@ -2810,7 +2810,7 @@ static void ioc_rqos_done(struct rq_qos *rqos, struct request *rq) > return; > } > > - on_q_ns = ktime_get_ns() - rq->alloc_time_ns; > + on_q_ns = blk_ktime_get_ns() - rq->alloc_time_ns; > rq_wait_ns = rq->start_time_ns - rq->alloc_time_ns; > size_nsec = div64_u64(calc_size_vtime_cost(rq, ioc), VTIME_PER_NSEC); > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 900c1be1fee1..9c96dee9e584 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -323,7 +323,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq) > RB_CLEAR_NODE(&rq->rb_node); > rq->tag = BLK_MQ_NO_TAG; > rq->internal_tag = BLK_MQ_NO_TAG; > - rq->start_time_ns = ktime_get_ns(); > + rq->start_time_ns = blk_ktime_get_ns(); > rq->part = NULL; > blk_crypto_rq_set_defaults(rq); > } > @@ -333,7 +333,7 @@ EXPORT_SYMBOL(blk_rq_init); > static inline void blk_mq_rq_time_init(struct request *rq, u64 alloc_time_ns) > { > if (blk_mq_need_time_stamp(rq)) > - rq->start_time_ns = ktime_get_ns(); > + rq->start_time_ns = blk_ktime_get_ns(); > else > rq->start_time_ns = 0; > > @@ -444,7 +444,7 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data) > > /* alloc_time includes depth and tag waits */ > if (blk_queue_rq_alloc_time(q)) > - alloc_time_ns = ktime_get_ns(); > + alloc_time_ns = blk_ktime_get_ns(); > > if (data->cmd_flags & REQ_NOWAIT) > data->flags |= BLK_MQ_REQ_NOWAIT; > @@ -629,7 +629,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, > > /* alloc_time includes depth and tag waits */ > if (blk_queue_rq_alloc_time(q)) > - alloc_time_ns = ktime_get_ns(); > + alloc_time_ns = blk_ktime_get_ns(); > > /* > * If the tag allocator sleeps we could get an allocation for a > @@ -1037,7 +1037,7 @@ static inline void __blk_mq_end_request_acct(struct request *rq, u64 now) > inline void __blk_mq_end_request(struct request *rq, blk_status_t error) > { > if (blk_mq_need_time_stamp(rq)) > - __blk_mq_end_request_acct(rq, ktime_get_ns()); > + __blk_mq_end_request_acct(rq, blk_ktime_get_ns()); > > blk_mq_finish_request(rq); > > @@ -1080,7 +1080,7 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob) > u64 now = 0; > > if (iob->need_ts) > - now = ktime_get_ns(); > + now = blk_ktime_get_ns(); > > while ((rq = rq_list_pop(&iob->req_list)) != NULL) { > prefetch(rq->bio); > @@ -1249,7 +1249,7 @@ void blk_mq_start_request(struct request *rq) > trace_block_rq_issue(rq); > > if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) { > - rq->io_start_time_ns = ktime_get_ns(); > + rq->io_start_time_ns = blk_ktime_get_ns(); > rq->stats_sectors = blk_rq_sectors(rq); > rq->rq_flags |= RQF_STATS; > rq_qos_issue(q, rq); > @@ -3066,7 +3066,7 @@ blk_status_t blk_insert_cloned_request(struct request *rq) > blk_mq_run_dispatch_ops(q, > ret = blk_mq_request_issue_directly(rq, true)); > if (ret) > - blk_account_io_done(rq, ktime_get_ns()); > + blk_account_io_done(rq, blk_ktime_get_ns()); > return ret; > } > EXPORT_SYMBOL_GPL(blk_insert_cloned_request); > diff --git a/block/blk-throttle.c b/block/blk-throttle.c > index 13e4377a8b28..5919919ba1c3 100644 > --- a/block/blk-throttle.c > +++ b/block/blk-throttle.c > @@ -1813,7 +1813,7 @@ static bool throtl_tg_is_idle(struct throtl_grp *tg) > time = min_t(unsigned long, MAX_IDLE_TIME, 4 * tg->idletime_threshold); > ret = tg->latency_target == DFL_LATENCY_TARGET || > tg->idletime_threshold == DFL_IDLE_THRESHOLD || > - (ktime_get_ns() >> 10) - tg->last_finish_time > time || > + (blk_ktime_get_ns() >> 10) - tg->last_finish_time > time || > tg->avg_idletime > tg->idletime_threshold || > (tg->latency_target && tg->bio_cnt && > tg->bad_bio_cnt * 5 < tg->bio_cnt); > @@ -2058,7 +2058,7 @@ static void blk_throtl_update_idletime(struct throtl_grp *tg) > if (last_finish_time == 0) > return; > > - now = ktime_get_ns() >> 10; > + now = blk_ktime_get_ns() >> 10; > if (now <= last_finish_time || > last_finish_time == tg->checked_last_finish_time) > return; > @@ -2325,7 +2325,7 @@ void blk_throtl_bio_endio(struct bio *bio) > if (!tg->td->limit_valid[LIMIT_LOW]) > return; > > - finish_time_ns = ktime_get_ns(); > + finish_time_ns = blk_ktime_get_ns(); > tg->last_finish_time = finish_time_ns >> 10; > > start_time = bio_issue_time(&bio->bi_issue) >> 10; > diff --git a/block/blk.h b/block/blk.h > index 08a358bc0919..4f081a00e644 100644 > --- a/block/blk.h > +++ b/block/blk.h > @@ -518,4 +518,17 @@ static inline int req_ref_read(struct request *req) > return atomic_read(&req->ref); > } > > +static inline u64 blk_ktime_get_ns(void) > +{ > + struct blk_plug *plug = current->plug; > + > + if (!plug) > + return ktime_get_ns(); > + if (!(plug->cur_ktime & 1ULL)) { > + plug->cur_ktime = ktime_get_ns(); > + plug->cur_ktime |= 1ULL; > + } > + return plug->cur_ktime; > +} > + > #endif /* BLK_INTERNAL_H */ > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 51fa7ffdee83..081830279f70 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -972,6 +972,9 @@ struct blk_plug { > bool multiple_queues; > bool has_elevator; > > + /* bit0 set if valid */ > + u64 cur_ktime; > + > struct list_head cb_list; /* md requires an unplug callback */ > }; > > > -- > Jens Axboe > > Hi Jens, This is what I see with your changes on my setup. Before[1]: 7.06M After[2]: 7.52M [1] # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 \ -u1 -r4 /dev/ng0n1 /dev/ng1n1 submitter=0, tid=2076, file=/dev/ng0n1, node=-1 submitter=1, tid=2077, file=/dev/ng1n1, node=-1 polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 Engine=io_uring, sq_ring=256, cq_ring=256 polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 Engine=io_uring, sq_ring=256, cq_ring=256 IOPS=6.95M, BW=3.39GiB/s, IOS/call=32/31 IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/32 IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/31 Exiting on timeout Maximum IOPS=7.06M [2] # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 \ -u1 -r4 /dev/ng0n1 /dev/ng1n1 submitter=0, tid=2204, file=/dev/ng0n1, node=-1 submitter=1, tid=2205, file=/dev/ng1n1, node=-1 polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 Engine=io_uring, sq_ring=256, cq_ring=256 IOPS=7.40M, BW=3.62GiB/s, IOS/call=32/31 IOPS=7.51M, BW=3.67GiB/s, IOS/call=32/31 IOPS=7.52M, BW=3.67GiB/s, IOS/call=32/32 Exiting on timeout Maximum IOPS=7.52M The original patch avoids processing throttle stats and wbt_issue/done stats for passthrough-io path. Improvement with original-patch : 7.06M -> 8.29M It seems that both the optimizations are different. The original patch is about "completely disabling stats for passthrough-io" and your changes optimize getting the current time which would improve performance for everyone. I think both of them are independent. -- Kundan Kumar
On Fri, Nov 24, 2023 at 4:58 PM Kundan Kumar <kundanthebest@gmail.com> wrote: > > On Fri, Nov 24, 2023 at 2:07 AM Jens Axboe <axboe@kernel.dk> wrote: > > > > On 11/23/23 1:19 PM, Jens Axboe wrote: > > > On 11/23/23 11:12 AM, Kanchan Joshi wrote: > > >> On 11/23/2023 9:00 PM, Christoph Hellwig wrote: > > >>> The rest looks good, but that stats overhead seems pretty horrible.. > > >> > > >> On my setup > > >> Before[1]: 7.06M > > >> After[2]: 8.29M > > >> > > >> [1] > > >> # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 > > >> -u1 -r4 /dev/ng0n1 /dev/ng1n1 > > >> submitter=0, tid=2076, file=/dev/ng0n1, node=-1 > > >> submitter=1, tid=2077, file=/dev/ng1n1, node=-1 > > >> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 > > >> Engine=io_uring, sq_ring=256, cq_ring=256 > > >> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 > > >> Engine=io_uring, sq_ring=256, cq_ring=256 > > >> IOPS=6.95M, BW=3.39GiB/s, IOS/call=32/31 > > >> IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/32 > > >> IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/31 > > >> Exiting on timeout > > >> Maximum IOPS=7.06M > > >> > > >> [2] > > >> # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 > > >> -u1 -r4 /dev/ng0n1 /dev/ng1n1 > > >> submitter=0, tid=2123, file=/dev/ng0n1, node=-1 > > >> submitter=1, tid=2124, file=/dev/ng1n1, node=-1 > > >> polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 > > >> Engine=io_uring, sq_ring=256, cq_ring=256 > > >> IOPS=8.27M, BW=4.04GiB/s, IOS/call=32/31 > > >> IOPS=8.29M, BW=4.05GiB/s, IOS/call=32/31 > > >> IOPS=8.29M, BW=4.05GiB/s, IOS/call=31/31 > > >> Exiting on timeout > > >> Maximum IOPS=8.29M > > > > > > It's all really down to how expensive getting the current time is on > > > your box, some will be better and some worse > > > > > > One idea that has been bounced around in the past is to have a > > > blk_ktime_get_ns() and have it be something ala: > > > > > > u64 blk_ktime_get_ns(void) > > > { > > > struct blk_plug *plug = current->plug; > > > > > > if (!plug) > > > return ktime_get_ns(); > > > > > > if (!plug->ktime_valid) > > > plug->ktime = ktime_get_ns(); > > > > > > return plug->ktime; > > > } > > > > > > in freestyle form, with the idea being that we don't care granularity to > > > the extent that we'd need a new stamp every time. > > > > > > If the task is scheduled out, the plug is flushed anyway, which should > > > invalidate the stamp. For preemption this isn't true iirc, so we'd need > > > some kind of blk_flush_plug_ts() or something for that case to > > > invalidate it. > > > > > > Hopefully this could then also get away from passing in a cached value > > > that we do in various spots, exactly because all of this time stamping > > > is expensive. It's also a bit of a game of whack-a-mole, as users get > > > added and distro kernels tend to turn on basically everything anyway. > > > > Did a quick'n dirty (see below), which recoups half of the lost > > performance for me. And my kernel deliberately doesn't enable all of the > > gunk in block/ that slows everything down, I suspect it'll be a bigger > > win for you percentage wise: > > > > IOPS=121.42M, BW=59.29GiB/s, IOS/call=31/31 > > IOPS=121.47M, BW=59.31GiB/s, IOS/call=32/32 > > IOPS=121.44M, BW=59.30GiB/s, IOS/call=31/31 > > IOPS=121.47M, BW=59.31GiB/s, IOS/call=31/31 > > IOPS=121.45M, BW=59.30GiB/s, IOS/call=32/32 > > IOPS=119.95M, BW=58.57GiB/s, IOS/call=31/32 > > IOPS=115.30M, BW=56.30GiB/s, IOS/call=32/31 > > IOPS=115.38M, BW=56.34GiB/s, IOS/call=32/32 > > IOPS=115.35M, BW=56.32GiB/s, IOS/call=32/32 > > > > > > diff --git a/block/blk-core.c b/block/blk-core.c > > index fdf25b8d6e78..6e74af442b94 100644 > > --- a/block/blk-core.c > > +++ b/block/blk-core.c > > @@ -1055,6 +1055,7 @@ void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios) > > plug->rq_count = 0; > > plug->multiple_queues = false; > > plug->has_elevator = false; > > + plug->cur_ktime = 0; > > INIT_LIST_HEAD(&plug->cb_list); > > > > /* > > diff --git a/block/blk-flush.c b/block/blk-flush.c > > index 3f4d41952ef2..e4e9f7eed346 100644 > > --- a/block/blk-flush.c > > +++ b/block/blk-flush.c > > @@ -143,7 +143,7 @@ static void blk_account_io_flush(struct request *rq) > > part_stat_lock(); > > part_stat_inc(part, ios[STAT_FLUSH]); > > part_stat_add(part, nsecs[STAT_FLUSH], > > - ktime_get_ns() - rq->start_time_ns); > > + blk_ktime_get_ns() - rq->start_time_ns); > > part_stat_unlock(); > > } > > > > diff --git a/block/blk-iocost.c b/block/blk-iocost.c > > index 089fcb9cfce3..d06cea625462 100644 > > --- a/block/blk-iocost.c > > +++ b/block/blk-iocost.c > > @@ -829,7 +829,7 @@ static int ioc_autop_idx(struct ioc *ioc, struct gendisk *disk) > > > > /* step up/down based on the vrate */ > > vrate_pct = div64_u64(ioc->vtime_base_rate * 100, VTIME_PER_USEC); > > - now_ns = ktime_get_ns(); > > + now_ns = blk_ktime_get_ns(); > > > > if (p->too_fast_vrate_pct && p->too_fast_vrate_pct <= vrate_pct) { > > if (!ioc->autop_too_fast_at) > > @@ -1044,7 +1044,7 @@ static void ioc_now(struct ioc *ioc, struct ioc_now *now) > > unsigned seq; > > u64 vrate; > > > > - now->now_ns = ktime_get(); > > + now->now_ns = blk_ktime_get(); > > now->now = ktime_to_us(now->now_ns); > > vrate = atomic64_read(&ioc->vtime_rate); > > > > @@ -2810,7 +2810,7 @@ static void ioc_rqos_done(struct rq_qos *rqos, struct request *rq) > > return; > > } > > > > - on_q_ns = ktime_get_ns() - rq->alloc_time_ns; > > + on_q_ns = blk_ktime_get_ns() - rq->alloc_time_ns; > > rq_wait_ns = rq->start_time_ns - rq->alloc_time_ns; > > size_nsec = div64_u64(calc_size_vtime_cost(rq, ioc), VTIME_PER_NSEC); > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index 900c1be1fee1..9c96dee9e584 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -323,7 +323,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq) > > RB_CLEAR_NODE(&rq->rb_node); > > rq->tag = BLK_MQ_NO_TAG; > > rq->internal_tag = BLK_MQ_NO_TAG; > > - rq->start_time_ns = ktime_get_ns(); > > + rq->start_time_ns = blk_ktime_get_ns(); > > rq->part = NULL; > > blk_crypto_rq_set_defaults(rq); > > } > > @@ -333,7 +333,7 @@ EXPORT_SYMBOL(blk_rq_init); > > static inline void blk_mq_rq_time_init(struct request *rq, u64 alloc_time_ns) > > { > > if (blk_mq_need_time_stamp(rq)) > > - rq->start_time_ns = ktime_get_ns(); > > + rq->start_time_ns = blk_ktime_get_ns(); > > else > > rq->start_time_ns = 0; > > > > @@ -444,7 +444,7 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data) > > > > /* alloc_time includes depth and tag waits */ > > if (blk_queue_rq_alloc_time(q)) > > - alloc_time_ns = ktime_get_ns(); > > + alloc_time_ns = blk_ktime_get_ns(); > > > > if (data->cmd_flags & REQ_NOWAIT) > > data->flags |= BLK_MQ_REQ_NOWAIT; > > @@ -629,7 +629,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, > > > > /* alloc_time includes depth and tag waits */ > > if (blk_queue_rq_alloc_time(q)) > > - alloc_time_ns = ktime_get_ns(); > > + alloc_time_ns = blk_ktime_get_ns(); > > > > /* > > * If the tag allocator sleeps we could get an allocation for a > > @@ -1037,7 +1037,7 @@ static inline void __blk_mq_end_request_acct(struct request *rq, u64 now) > > inline void __blk_mq_end_request(struct request *rq, blk_status_t error) > > { > > if (blk_mq_need_time_stamp(rq)) > > - __blk_mq_end_request_acct(rq, ktime_get_ns()); > > + __blk_mq_end_request_acct(rq, blk_ktime_get_ns()); > > > > blk_mq_finish_request(rq); > > > > @@ -1080,7 +1080,7 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob) > > u64 now = 0; > > > > if (iob->need_ts) > > - now = ktime_get_ns(); > > + now = blk_ktime_get_ns(); > > > > while ((rq = rq_list_pop(&iob->req_list)) != NULL) { > > prefetch(rq->bio); > > @@ -1249,7 +1249,7 @@ void blk_mq_start_request(struct request *rq) > > trace_block_rq_issue(rq); > > > > if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) { > > - rq->io_start_time_ns = ktime_get_ns(); > > + rq->io_start_time_ns = blk_ktime_get_ns(); > > rq->stats_sectors = blk_rq_sectors(rq); > > rq->rq_flags |= RQF_STATS; > > rq_qos_issue(q, rq); > > @@ -3066,7 +3066,7 @@ blk_status_t blk_insert_cloned_request(struct request *rq) > > blk_mq_run_dispatch_ops(q, > > ret = blk_mq_request_issue_directly(rq, true)); > > if (ret) > > - blk_account_io_done(rq, ktime_get_ns()); > > + blk_account_io_done(rq, blk_ktime_get_ns()); > > return ret; > > } > > EXPORT_SYMBOL_GPL(blk_insert_cloned_request); > > diff --git a/block/blk-throttle.c b/block/blk-throttle.c > > index 13e4377a8b28..5919919ba1c3 100644 > > --- a/block/blk-throttle.c > > +++ b/block/blk-throttle.c > > @@ -1813,7 +1813,7 @@ static bool throtl_tg_is_idle(struct throtl_grp *tg) > > time = min_t(unsigned long, MAX_IDLE_TIME, 4 * tg->idletime_threshold); > > ret = tg->latency_target == DFL_LATENCY_TARGET || > > tg->idletime_threshold == DFL_IDLE_THRESHOLD || > > - (ktime_get_ns() >> 10) - tg->last_finish_time > time || > > + (blk_ktime_get_ns() >> 10) - tg->last_finish_time > time || > > tg->avg_idletime > tg->idletime_threshold || > > (tg->latency_target && tg->bio_cnt && > > tg->bad_bio_cnt * 5 < tg->bio_cnt); > > @@ -2058,7 +2058,7 @@ static void blk_throtl_update_idletime(struct throtl_grp *tg) > > if (last_finish_time == 0) > > return; > > > > - now = ktime_get_ns() >> 10; > > + now = blk_ktime_get_ns() >> 10; > > if (now <= last_finish_time || > > last_finish_time == tg->checked_last_finish_time) > > return; > > @@ -2325,7 +2325,7 @@ void blk_throtl_bio_endio(struct bio *bio) > > if (!tg->td->limit_valid[LIMIT_LOW]) > > return; > > > > - finish_time_ns = ktime_get_ns(); > > + finish_time_ns = blk_ktime_get_ns(); > > tg->last_finish_time = finish_time_ns >> 10; > > > > start_time = bio_issue_time(&bio->bi_issue) >> 10; > > diff --git a/block/blk.h b/block/blk.h > > index 08a358bc0919..4f081a00e644 100644 > > --- a/block/blk.h > > +++ b/block/blk.h > > @@ -518,4 +518,17 @@ static inline int req_ref_read(struct request *req) > > return atomic_read(&req->ref); > > } > > > > +static inline u64 blk_ktime_get_ns(void) > > +{ > > + struct blk_plug *plug = current->plug; > > + > > + if (!plug) > > + return ktime_get_ns(); > > + if (!(plug->cur_ktime & 1ULL)) { > > + plug->cur_ktime = ktime_get_ns(); > > + plug->cur_ktime |= 1ULL; > > + } > > + return plug->cur_ktime; > > +} > > + > > #endif /* BLK_INTERNAL_H */ > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > > index 51fa7ffdee83..081830279f70 100644 > > --- a/include/linux/blkdev.h > > +++ b/include/linux/blkdev.h > > @@ -972,6 +972,9 @@ struct blk_plug { > > bool multiple_queues; > > bool has_elevator; > > > > + /* bit0 set if valid */ > > + u64 cur_ktime; > > + > > struct list_head cb_list; /* md requires an unplug callback */ > > }; > > > > > > -- > > Jens Axboe > > > > > > Hi Jens, > > This is what I see with your changes on my setup. > > Before[1]: 7.06M > After[2]: 7.52M > > [1] > # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 \ > -u1 -r4 /dev/ng0n1 /dev/ng1n1 > submitter=0, tid=2076, file=/dev/ng0n1, node=-1 > submitter=1, tid=2077, file=/dev/ng1n1, node=-1 > polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 > Engine=io_uring, sq_ring=256, cq_ring=256 > polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 > Engine=io_uring, sq_ring=256, cq_ring=256 > IOPS=6.95M, BW=3.39GiB/s, IOS/call=32/31 > IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/32 > IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/31 > Exiting on timeout > Maximum IOPS=7.06M > > [2] > # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 \ > -u1 -r4 /dev/ng0n1 /dev/ng1n1 > submitter=0, tid=2204, file=/dev/ng0n1, node=-1 > submitter=1, tid=2205, file=/dev/ng1n1, node=-1 > polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 > Engine=io_uring, sq_ring=256, cq_ring=256 > IOPS=7.40M, BW=3.62GiB/s, IOS/call=32/31 > IOPS=7.51M, BW=3.67GiB/s, IOS/call=32/31 > IOPS=7.52M, BW=3.67GiB/s, IOS/call=32/32 > Exiting on timeout > Maximum IOPS=7.52M > > The original patch avoids processing throttle stats and wbt_issue/done > stats for passthrough-io path. > > Improvement with original-patch : > 7.06M -> 8.29M > > It seems that both the optimizations are different. The original patch is about > "completely disabling stats for passthrough-io" and your changes > optimize getting the > current time which would improve performance for everyone. > > I think both of them are independent. > > -- > Kundan Kumar Any feedback on this ? -- Kundan Kumart
On 11/24/23 4:28 AM, Kundan Kumar wrote: > This is what I see with your changes on my setup. > > Before[1]: 7.06M > After[2]: 7.52M > > [1] > # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 \ > -u1 -r4 /dev/ng0n1 /dev/ng1n1 > submitter=0, tid=2076, file=/dev/ng0n1, node=-1 > submitter=1, tid=2077, file=/dev/ng1n1, node=-1 > polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 > Engine=io_uring, sq_ring=256, cq_ring=256 > polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 > Engine=io_uring, sq_ring=256, cq_ring=256 > IOPS=6.95M, BW=3.39GiB/s, IOS/call=32/31 > IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/32 > IOPS=7.06M, BW=3.45GiB/s, IOS/call=32/31 > Exiting on timeout > Maximum IOPS=7.06M > > [2] > # taskset -c 2,3 t/io_uring -b512 -d256 -c32 -s32 -p1 -F1 -B1 -O0 -n2 \ > -u1 -r4 /dev/ng0n1 /dev/ng1n1 > submitter=0, tid=2204, file=/dev/ng0n1, node=-1 > submitter=1, tid=2205, file=/dev/ng1n1, node=-1 > polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=256 > Engine=io_uring, sq_ring=256, cq_ring=256 > IOPS=7.40M, BW=3.62GiB/s, IOS/call=32/31 > IOPS=7.51M, BW=3.67GiB/s, IOS/call=32/31 > IOPS=7.52M, BW=3.67GiB/s, IOS/call=32/32 > Exiting on timeout > Maximum IOPS=7.52M > > The original patch avoids processing throttle stats and wbt_issue/done > stats for passthrough-io path. > > Improvement with original-patch : > 7.06M -> 8.29M > > It seems that both the optimizations are different. The original patch is about > "completely disabling stats for passthrough-io" and your changes > optimize getting the > current time which would improve performance for everyone. > > I think both of them are independent. Yes they are, mine is just a general "we should do something like this rather than play whack-a-mole on the issue side for time stamping". It doesn't solve the completion side, which is why your patch is better for passthrough as a whole. I do think we should add your patch, they are orthogonal. Did you send out a v2 we can queue up for 6.8?
diff --git a/block/blk-mq.c b/block/blk-mq.c index 900c1be1fee1..2de9c14dd43f 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1248,7 +1248,8 @@ void blk_mq_start_request(struct request *rq) trace_block_rq_issue(rq); - if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) { + if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags) + && !blk_rq_is_passthrough(rq)) { rq->io_start_time_ns = ktime_get_ns(); rq->stats_sectors = blk_rq_sectors(rq); rq->rq_flags |= RQF_STATS; diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h index f48ee150d667..37245c97ee61 100644 --- a/block/blk-rq-qos.h +++ b/block/blk-rq-qos.h @@ -118,7 +118,7 @@ static inline void rq_qos_cleanup(struct request_queue *q, struct bio *bio) static inline void rq_qos_done(struct request_queue *q, struct request *rq) { - if (q->rq_qos) + if (q->rq_qos && !blk_rq_is_passthrough(rq)) __rq_qos_done(q->rq_qos, rq); }