From patchwork Mon Oct 26 06:15:34 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Weiping Zhang X-Patchwork-Id: 11855783 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.2 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BEB4EC2D0A3 for ; Mon, 26 Oct 2020 06:41:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 89A9022247 for ; Mon, 26 Oct 2020 06:41:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1421710AbgJZGl5 (ORCPT ); Mon, 26 Oct 2020 02:41:57 -0400 Received: from mx1.didichuxing.com ([111.202.154.82]:24543 "HELO bsf02.didichuxing.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with SMTP id S1421682AbgJZGl4 (ORCPT ); Mon, 26 Oct 2020 02:41:56 -0400 X-ASG-Debug-ID: 1603694510-0e41086da21f39e0002-Cu09wu Received: from mail.didiglobal.com (bogon [172.20.36.203]) by bsf02.didichuxing.com with ESMTP id uozPeK3qMJ6TMsI7; Mon, 26 Oct 2020 14:41:50 +0800 (CST) X-Barracuda-Envelope-From: zhangweiping@didiglobal.com Received: from 192.168.3.9 (172.22.50.20) by BJSGEXMBX03.didichuxing.com (172.20.15.133) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 26 Oct 2020 14:15:40 +0800 Date: Mon, 26 Oct 2020 14:15:34 +0800 From: Weiping Zhang To: , , , CC: Subject: [PATCH v4 1/2] block: fix inaccurate io_ticks Message-ID: <20201026061533.GA23974@192.168.3.9> X-ASG-Orig-Subj: [PATCH v4 1/2] block: fix inaccurate io_ticks Mail-Followup-To: axboe@kernel.dk, ming.lei@redhat.com, snitzer@redhat.com, mpatocka@redhat.com, linux-block@vger.kernel.org MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-Originating-IP: [172.22.50.20] X-ClientProxiedBy: BJEXCAS03.didichuxing.com (172.20.36.245) To BJSGEXMBX03.didichuxing.com (172.20.15.133) X-Barracuda-Connect: bogon[172.20.36.203] X-Barracuda-Start-Time: 1603694510 X-Barracuda-URL: https://bsf02.didichuxing.com:443/cgi-mod/mark.cgi X-Virus-Scanned: by bsmtpd at didichuxing.com X-Barracuda-Scan-Msg-Size: 6568 X-Barracuda-BRTS-Status: 1 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using global scores of TAG_LEVEL=1000.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=1000.0 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.85498 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org Do not add io_ticks if there is no infligh io when start a new IO, otherwise an extra 1 jiffy will be add to this IO. I run the following command on a host, with different kernel version. fio -name=test -ioengine=sync -bs=4K -rw=write -filename=/home/test.fio.log -size=100M -time_based=1 -direct=1 -runtime=300 -rate=2m,2m If we run fio in a sync direct io mode, IO will be proccessed one by one, you can see that there are 512 IOs completed in one second. kernel: 4.19.0 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vda 0.00 0.00 0.00 512.00 0.00 2.00 8.00 0.21 0.40 0.00 0.40 0.40 20.60 The averate io.latency is 0.4ms, so the disk time cost in one second should be 0.4 * 512 = 204.8 ms, that means, %util should be 20%. Becase update_io_ticks will add a extra 1 jiffy(1ms) for every IO, the io.latency will be 1 + 0.4 = 1.4ms, 1.4 * 512 = 716.8ms, so the %util show it about 72%. Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util vda 0.00 512.00 0.00 2.00 0.00 0.00 0.00 0.00 0.00 0.40 0.20 0.00 4.00 1.41 72.10 After this patch: Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util vda 0.00 512.00 0.00 2.00 0.00 0.00 0.00 0.00 0.00 0.40 0.20 0.00 4.00 0.39 20.00 Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting") Fixes: 2b8bd423614c ("block/diskstats: more accurate approximation of io_ticks for slow disks") Reported-by: Yabin Li Signed-off-by: Weiping Zhang --- block/blk-core.c | 19 ++++++++++++++----- block/blk-mq.c | 26 ++++++++++++++++++++++++++ block/blk-mq.h | 1 + block/blk.h | 1 + block/genhd.c | 13 +++++++++++++ 5 files changed, 55 insertions(+), 5 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index ac00d2fa4eb4..9dad92355125 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1256,14 +1256,14 @@ unsigned int blk_rq_err_bytes(const struct request *rq) } EXPORT_SYMBOL_GPL(blk_rq_err_bytes); -static void update_io_ticks(struct hd_struct *part, unsigned long now, bool end) +static void update_io_ticks(struct hd_struct *part, unsigned long now, bool inflight) { unsigned long stamp; again: stamp = READ_ONCE(part->stamp); if (unlikely(stamp != now)) { - if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) - __part_stat_add(part, io_ticks, end ? now - stamp : 1); + if (likely(cmpxchg(&part->stamp, stamp, now) == stamp) && inflight) + __part_stat_add(part, io_ticks, now - stamp); } if (part->partno) { part = &part_to_disk(part)->part0; @@ -1310,13 +1310,20 @@ void blk_account_io_done(struct request *req, u64 now) void blk_account_io_start(struct request *rq) { + struct hd_struct *part; + struct request_queue *q; + bool inflight; + if (!blk_do_io_stat(rq)) return; rq->part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq)); part_stat_lock(); - update_io_ticks(rq->part, jiffies, false); + part = rq->part; + q = part_to_disk(part)->queue; + inflight = blk_mq_part_is_in_flight(q, part); + update_io_ticks(part, jiffies, inflight); part_stat_unlock(); } @@ -1325,9 +1332,11 @@ static unsigned long __part_start_io_acct(struct hd_struct *part, { const int sgrp = op_stat_group(op); unsigned long now = READ_ONCE(jiffies); + bool inflight; part_stat_lock(); - update_io_ticks(part, now, false); + inflight = part_is_in_flight(part); + update_io_ticks(part, now, inflight); part_stat_inc(part, ios[sgrp]); part_stat_add(part, sectors[sgrp], sectors); part_stat_local_inc(part, in_flight[op_is_write(op)]); diff --git a/block/blk-mq.c b/block/blk-mq.c index 696450257ac1..126a6a6f7035 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -130,6 +130,32 @@ void blk_mq_in_flight_rw(struct request_queue *q, struct hd_struct *part, inflight[1] = mi.inflight[1]; } +static bool blk_mq_part_check_inflight(struct blk_mq_hw_ctx *hctx, + struct request *rq, void *priv, + bool reserved) +{ + struct mq_inflight *mi = priv; + + if (rq->part == mi->part && blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT) { + mi->inflight[rq_data_dir(rq)]++; + /* return false to break loop early */ + return false; + } + + return true; +} + +bool blk_mq_part_is_in_flight(struct request_queue *q, struct hd_struct *part) +{ + struct mq_inflight mi = { .part = part }; + + mi.inflight[0] = mi.inflight[1] = 0; + + blk_mq_queue_tag_busy_iter(q, blk_mq_part_check_inflight, &mi); + + return mi.inflight[0] + mi.inflight[1] > 0; +} + void blk_freeze_queue_start(struct request_queue *q) { mutex_lock(&q->mq_freeze_lock); diff --git a/block/blk-mq.h b/block/blk-mq.h index a52703c98b77..bb7e22d746e1 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -76,6 +76,7 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx, blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last); void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx, struct list_head *list); +bool blk_mq_part_is_in_flight(struct request_queue *q, struct hd_struct *part); /* * CPU -> queue mappings diff --git a/block/blk.h b/block/blk.h index dfab98465db9..2572b7aadcbb 100644 --- a/block/blk.h +++ b/block/blk.h @@ -443,5 +443,6 @@ static inline void part_nr_sects_write(struct hd_struct *part, sector_t size) int bio_add_hw_page(struct request_queue *q, struct bio *bio, struct page *page, unsigned int len, unsigned int offset, unsigned int max_sectors, bool *same_page); +bool part_is_in_flight(struct hd_struct *part); #endif /* BLK_INTERNAL_H */ diff --git a/block/genhd.c b/block/genhd.c index 0a273211fec2..4a089bed9dcb 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -109,6 +109,19 @@ static void part_stat_read_all(struct hd_struct *part, struct disk_stats *stat) } } +bool part_is_in_flight(struct hd_struct *part) +{ + int cpu; + + for_each_possible_cpu(cpu) { + if (part_stat_local_read_cpu(part, in_flight[0], cpu) || + part_stat_local_read_cpu(part, in_flight[1], cpu)) + return true; + } + + return false; +} + static unsigned int part_in_flight(struct hd_struct *part) { unsigned int inflight = 0; From patchwork Mon Oct 26 06:15:51 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Weiping Zhang X-Patchwork-Id: 11855779 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.2 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 58C0AC388F9 for ; Mon, 26 Oct 2020 06:41:59 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2E930222B9 for ; Mon, 26 Oct 2020 06:41:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1420395AbgJZGl6 (ORCPT ); Mon, 26 Oct 2020 02:41:58 -0400 Received: from mx2.didiglobal.com ([111.202.154.82]:9324 "HELO bsf02.didichuxing.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with SMTP id S1420847AbgJZGl6 (ORCPT ); Mon, 26 Oct 2020 02:41:58 -0400 X-ASG-Debug-ID: 1603694510-0e41086da21f39e0004-Cu09wu Received: from mail.didiglobal.com (bogon [172.20.36.203]) by bsf02.didichuxing.com with ESMTP id qAA3DfzG1roLyNNB; Mon, 26 Oct 2020 14:41:51 +0800 (CST) X-Barracuda-Envelope-From: zhangweiping@didiglobal.com Received: from 192.168.3.9 (172.22.50.20) by BJSGEXMBX03.didichuxing.com (172.20.15.133) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 26 Oct 2020 14:15:57 +0800 Date: Mon, 26 Oct 2020 14:15:51 +0800 From: Weiping Zhang To: , , , CC: Subject: [PATCH v4 2/2] blk-mq: break more earlier when interate hctx Message-ID: <20201026061550.GA24417@192.168.3.9> X-ASG-Orig-Subj: [PATCH v4 2/2] blk-mq: break more earlier when interate hctx Mail-Followup-To: axboe@kernel.dk, ming.lei@redhat.com, snitzer@redhat.com, mpatocka@redhat.com, linux-block@vger.kernel.org MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-Originating-IP: [172.22.50.20] X-ClientProxiedBy: BJEXCAS03.didichuxing.com (172.20.36.245) To BJSGEXMBX03.didichuxing.com (172.20.15.133) X-Barracuda-Connect: bogon[172.20.36.203] X-Barracuda-Start-Time: 1603694511 X-Barracuda-URL: https://bsf02.didichuxing.com:443/cgi-mod/mark.cgi X-Virus-Scanned: by bsmtpd at didichuxing.com X-Barracuda-Scan-Msg-Size: 5543 X-Barracuda-BRTS-Status: 1 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Spam-Score: -1.52 X-Barracuda-Spam-Status: No, SCORE=-1.52 using global scores of TAG_LEVEL=1000.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=1000.0 tests=BSF_RULE7568M X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.85498 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.50 BSF_RULE7568M Custom Rule 7568M Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org For blk_mq_part_is_in_inflight and blk_mq_queue_inflight they do not care how many inflight IOs, so they stop interate other hxtc when find a request meets their requirement. Some cpu cycles can be saved in such way. Signed-off-by: Weiping Zhang --- block/blk-mq-tag.c | 11 +++++++++-- block/blk-mq-tag.h | 2 +- block/blk-mq.c | 34 +++++++++++++++++++++++++++++----- include/linux/blk-mq.h | 1 + 4 files changed, 40 insertions(+), 8 deletions(-) diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 9c92053e704d..f364682dabe1 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -401,14 +401,17 @@ EXPORT_SYMBOL(blk_mq_tagset_wait_completed_request); * reserved) where rq is a pointer to a request and hctx points * to the hardware queue associated with the request. 'reserved' * indicates whether or not @rq is a reserved request. - * @priv: Will be passed as third argument to @fn. + *@check_break: Pointer to the function that will callbed for earch hctx on @q. + * @check_break will break the loop for hctx when it return false, + * if you want to iterate all hctx, set it to NULL. + * @priv: Will be passed as third argument to @fn, or arg to @check_break * * Note: if @q->tag_set is shared with other request queues then @fn will be * called for all requests on all queues that share that tag set and not only * for requests associated with @q. */ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, - void *priv) + check_break_fn *check_break, void *priv) { struct blk_mq_hw_ctx *hctx; int i; @@ -434,7 +437,11 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, if (tags->nr_reserved_tags) bt_for_each(hctx, tags->breserved_tags, fn, priv, true); bt_for_each(hctx, tags->bitmap_tags, fn, priv, false); + + if (check_break && !check_break(priv)) + goto out; } +out: blk_queue_exit(q); } diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h index 7d3e6b333a4a..d122be9f87cb 100644 --- a/block/blk-mq-tag.h +++ b/block/blk-mq-tag.h @@ -42,7 +42,7 @@ extern void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set, extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool); void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, - void *priv); + check_break_fn *check_break, void *priv); void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn, void *priv); diff --git a/block/blk-mq.c b/block/blk-mq.c index 126a6a6f7035..458ade751b01 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -115,7 +115,7 @@ unsigned int blk_mq_in_flight(struct request_queue *q, struct hd_struct *part) { struct mq_inflight mi = { .part = part }; - blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi); + blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, NULL, &mi); return mi.inflight[0] + mi.inflight[1]; } @@ -125,11 +125,22 @@ void blk_mq_in_flight_rw(struct request_queue *q, struct hd_struct *part, { struct mq_inflight mi = { .part = part }; - blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi); + blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, NULL, &mi); inflight[0] = mi.inflight[0]; inflight[1] = mi.inflight[1]; } +static bool blk_mq_part_check_break(void *priv) +{ + struct mq_inflight *mi = priv; + + /* return false to stop interate other hctx */ + if (mi->inflight[0] || mi->inflight[1]) + return false; + + return true; +} + static bool blk_mq_part_check_inflight(struct blk_mq_hw_ctx *hctx, struct request *rq, void *priv, bool reserved) @@ -151,7 +162,8 @@ bool blk_mq_part_is_in_flight(struct request_queue *q, struct hd_struct *part) mi.inflight[0] = mi.inflight[1] = 0; - blk_mq_queue_tag_busy_iter(q, blk_mq_part_check_inflight, &mi); + blk_mq_queue_tag_busy_iter(q, blk_mq_part_check_inflight, + blk_mq_part_check_break, &mi); return mi.inflight[0] + mi.inflight[1] > 0; } @@ -909,11 +921,23 @@ static bool blk_mq_rq_inflight(struct blk_mq_hw_ctx *hctx, struct request *rq, return true; } +static bool blk_mq_rq_check_break(void *priv) +{ + bool *busy = priv; + + /* return false to stop interate other hctx */ + if (*busy) + return false; + + return true; +} + bool blk_mq_queue_inflight(struct request_queue *q) { bool busy = false; - blk_mq_queue_tag_busy_iter(q, blk_mq_rq_inflight, &busy); + blk_mq_queue_tag_busy_iter(q, blk_mq_rq_inflight, + blk_mq_rq_check_break, &busy); return busy; } EXPORT_SYMBOL_GPL(blk_mq_queue_inflight); @@ -1018,7 +1042,7 @@ static void blk_mq_timeout_work(struct work_struct *work) if (!percpu_ref_tryget(&q->q_usage_counter)) return; - blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &next); + blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, NULL, &next); if (next != 0) { mod_timer(&q->timeout, next); diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index b23eeca4d677..efd1e8269f0b 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -280,6 +280,7 @@ struct blk_mq_queue_data { typedef bool (busy_iter_fn)(struct blk_mq_hw_ctx *, struct request *, void *, bool); typedef bool (busy_tag_iter_fn)(struct request *, void *, bool); +typedef bool (check_break_fn)(void *); /** * struct blk_mq_ops - Callback functions that implements block driver