diff mbox series

blk-mq: fix incorrect rq start_time_ns and alloc_time_ns after throttled

Message ID 20230601053919.3639954-1-chengming.zhou@linux.dev (mailing list archive)
State New, archived
Headers show
Series blk-mq: fix incorrect rq start_time_ns and alloc_time_ns after throttled | expand

Commit Message

Chengming Zhou June 1, 2023, 5:39 a.m. UTC
From: Chengming Zhou <zhouchengming@bytedance.com>

iocost rely on rq start_time_ns and alloc_time_ns to tell the saturation
state of the block device.

If any qos ->throttle() end up blocking, the cached rq start_time_ns and
alloc_time_ns will include its throtted time, which can confuse its user.

This patch add nr_flush counter in blk_plug, so we can tell if the task
has throttled in any qos ->throttle(), in which case we need to correct
the rq start_time_ns and alloc_time_ns.

Another solution may be make rq_qos_throttle() return bool to indicate
if it has throttled in any qos ->throttle(). But this need more changes.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 block/blk-core.c       |  3 +++
 block/blk-mq.c         | 18 ++++++++++++++++++
 include/linux/blkdev.h |  8 +++++---
 3 files changed, 26 insertions(+), 3 deletions(-)

Comments

Tejun Heo June 5, 2023, 6:31 p.m. UTC | #1
Hello,

On Thu, Jun 01, 2023 at 01:39:19PM +0800, chengming.zhou@linux.dev wrote:
> From: Chengming Zhou <zhouchengming@bytedance.com>
> 
> iocost rely on rq start_time_ns and alloc_time_ns to tell the saturation
> state of the block device.
> 
> If any qos ->throttle() end up blocking, the cached rq start_time_ns and
> alloc_time_ns will include its throtted time, which can confuse its user.

I don't follow. rq_qos_throttle() happens before a request is allocated, so
whether ->throttle() blocks or not doesn't affect alloc_time_ns or
start_time_ns.

> This patch add nr_flush counter in blk_plug, so we can tell if the task
> has throttled in any qos ->throttle(), in which case we need to correct
> the rq start_time_ns and alloc_time_ns.
> 
> Another solution may be make rq_qos_throttle() return bool to indicate
> if it has throttled in any qos ->throttle(). But this need more changes.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Depending on the flush behavior and adjusting alloc_time_ns seems fragile to
me and will likely confuse other users of alloc_time_ns too.

Maybe I'm misunderstanding the problem you're describing. Can you give a
concrete example of how the current code would misbehave?

Thanks.
Chengming Zhou June 6, 2023, 10:22 a.m. UTC | #2
On 2023/6/6 02:31, Tejun Heo wrote:
> Hello,
> 
> On Thu, Jun 01, 2023 at 01:39:19PM +0800, chengming.zhou@linux.dev wrote:
>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>
>> iocost rely on rq start_time_ns and alloc_time_ns to tell the saturation
>> state of the block device.
>>
>> If any qos ->throttle() end up blocking, the cached rq start_time_ns and
>> alloc_time_ns will include its throtted time, which can confuse its user.
> 
> I don't follow. rq_qos_throttle() happens before a request is allocated, so
> whether ->throttle() blocks or not doesn't affect alloc_time_ns or
> start_time_ns.
Yes, most of the time request is allocated after rq_qos_throttle() and its
alloc_time_ns or start_time_ns won't be affected clearly.

But for plug batched allocation introduced by the commit 47c122e35d7e
("block: pre-allocate requests if plug is started and is a batch"), we can
rq_qos_throttle() after the allocation of the request. This is what the
blk_mq_get_cached_request() does.

In this case, the cached request alloc_time_ns or start_time_ns is much ahead
if block in any qos ->throttle().

> 
>> This patch add nr_flush counter in blk_plug, so we can tell if the task
>> has throttled in any qos ->throttle(), in which case we need to correct
>> the rq start_time_ns and alloc_time_ns.
>>
>> Another solution may be make rq_qos_throttle() return bool to indicate
>> if it has throttled in any qos ->throttle(). But this need more changes.
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> 
> Depending on the flush behavior and adjusting alloc_time_ns seems fragile to
> me and will likely confuse other users of alloc_time_ns too.

I agree with you, this code is not good. My basic idea is to adjust the cached
request alloc_time_ns and start_time_ns when throttled.

> 
> Maybe I'm misunderstanding the problem you're describing. Can you give a
> concrete example of how the current code would misbehave?
> 

I tried using fio to reproduce it:

1. set the iocost qos (a bit strict qos setting to reproduce throttle)

echo "259:0 enable=1 rpct=5 rlat=500 wpct=5 wlat=500" > io.cost.qos

2. run fio using io_uring ioengine (for now only io_uring used batched allocation)

fio --name global --runtime 30 --time_based --size 10G --ioengine io_uring \
--iodepth 256 --buffered 0 --sqthread_poll \
--name job1 --rw read --cgroup job1 --numjobs 10 \
--name job2 --rw write --cgroup job2 --numjobs 10

3. run bpftrace to check request start_time_ns

bpftrace -e 'kprobe:__rq_qos_track { $rq = (struct request *)arg1; if ($rq->start_time_ns) { @delta = hist((nsecs - $rq->start_time_ns)/1000); } }'


If we go blk_mq_get_cached_request() -> throttle() and throttled for some time,
then the returned cached request start_time_ns will be much ahead.

Like below: (delta value is us)

@delta:
[0]               170090 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]                  898 |                                                    |
[2, 4)               418 |                                                    |
[4, 8)               284 |                                                    |
[8, 16)               54 |                                                    |
[16, 32)             198 |                                                    |
[32, 64)            5416 |@                                                   |
[64, 128)           5082 |@                                                   |
[128, 256)          1296 |                                                    |
[256, 512)            23 |                                                    |
[512, 1K)           2632 |                                                    |
[1K, 2K)           21143 |@@@@@@                                              |
[2K, 4K)           26349 |@@@@@@@@                                            |
[4K, 8K)            4559 |@                                                   |
[8K, 16K)           4273 |@                                                   |
[16K, 32K)            14 |                                                    |
Tejun Heo June 8, 2023, 10:56 p.m. UTC | #3
Hello,

On Tue, Jun 06, 2023 at 06:22:28PM +0800, Chengming Zhou wrote:
...
> But for plug batched allocation introduced by the commit 47c122e35d7e
> ("block: pre-allocate requests if plug is started and is a batch"), we can
> rq_qos_throttle() after the allocation of the request. This is what the
> blk_mq_get_cached_request() does.
> 
> In this case, the cached request alloc_time_ns or start_time_ns is much ahead
> if block in any qos ->throttle().

Ah, okay, that's problematic.

> >> This patch add nr_flush counter in blk_plug, so we can tell if the task
> >> has throttled in any qos ->throttle(), in which case we need to correct
> >> the rq start_time_ns and alloc_time_ns.
> >>
> >> Another solution may be make rq_qos_throttle() return bool to indicate
> >> if it has throttled in any qos ->throttle(). But this need more changes.
> >>
> >> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> > 
> > Depending on the flush behavior and adjusting alloc_time_ns seems fragile to
> > me and will likely confuse other users of alloc_time_ns too.
> 
> I agree with you, this code is not good. My basic idea is to adjust the cached
> request alloc_time_ns and start_time_ns when throttled.

Would it make sense to skip setting the alloc_time_ns during pre-allocation
and set it later when the pre-allocated rq is actually used? That should
jive better.

Thanks.
Chengming Zhou June 24, 2023, 3:24 p.m. UTC | #4
Hello,

On 2023/6/9 06:56, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jun 06, 2023 at 06:22:28PM +0800, Chengming Zhou wrote:
> ...
>> But for plug batched allocation introduced by the commit 47c122e35d7e
>> ("block: pre-allocate requests if plug is started and is a batch"), we can
>> rq_qos_throttle() after the allocation of the request. This is what the
>> blk_mq_get_cached_request() does.
>>
>> In this case, the cached request alloc_time_ns or start_time_ns is much ahead
>> if block in any qos ->throttle().
> 
> Ah, okay, that's problematic.
> 

Thanks for your review!

Sorry for my delay, I was out of the office.

>>>> This patch add nr_flush counter in blk_plug, so we can tell if the task
>>>> has throttled in any qos ->throttle(), in which case we need to correct
>>>> the rq start_time_ns and alloc_time_ns.
>>>>
>>>> Another solution may be make rq_qos_throttle() return bool to indicate
>>>> if it has throttled in any qos ->throttle(). But this need more changes.
>>>>
>>>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>>>
>>> Depending on the flush behavior and adjusting alloc_time_ns seems fragile to
>>> me and will likely confuse other users of alloc_time_ns too.
>>
>> I agree with you, this code is not good. My basic idea is to adjust the cached
>> request alloc_time_ns and start_time_ns when throttled.
> 
> Would it make sense to skip setting the alloc_time_ns during pre-allocation
> and set it later when the pre-allocated rq is actually used? That should
> jive better.
> 

Ok, I think it's much clearer that we set the alloc_time_ns and start_time_ns
to "now" when the pre-allocated rq is actually used.

I will send an updated version later.

Thanks.
diff mbox series

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index 00c74330fa92..5109f7f5606c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1053,6 +1053,7 @@  void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios)
 	plug->cached_rq = NULL;
 	plug->nr_ios = min_t(unsigned short, nr_ios, BLK_MAX_REQUEST_COUNT);
 	plug->rq_count = 0;
+	plug->nr_flush = 0;
 	plug->multiple_queues = false;
 	plug->has_elevator = false;
 	plug->nowait = false;
@@ -1150,6 +1151,8 @@  void __blk_flush_plug(struct blk_plug *plug, bool from_schedule)
 	 */
 	if (unlikely(!rq_list_empty(plug->cached_rq)))
 		blk_mq_free_plug_rqs(plug);
+
+	plug->nr_flush++;
 }
 
 /**
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f6dad0886a2f..8731f2815790 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2871,6 +2871,7 @@  static inline struct request *blk_mq_get_cached_request(struct request_queue *q,
 {
 	struct request *rq;
 	enum hctx_type type, hctx_type;
+	unsigned short nr_flush;
 
 	if (!plug)
 		return NULL;
@@ -2897,8 +2898,25 @@  static inline struct request *blk_mq_get_cached_request(struct request_queue *q,
 	 * before we throttle.
 	 */
 	plug->cached_rq = rq_list_next(rq);
+	nr_flush = plug->nr_flush;
 	rq_qos_throttle(q, *bio);
 
+	/*
+	 * If any qos ->throttle() end up blocking, we will have flushed the
+	 * plug and we need to correct the rq start_time_ns and alloc_time_ns.
+	 */
+	if (nr_flush != plug->nr_flush) {
+		if (blk_mq_need_time_stamp(rq)) {
+			u64 now = ktime_get_ns();
+
+#ifdef CONFIG_BLK_RQ_ALLOC_TIME
+			if (rq->alloc_time_ns)
+				rq->alloc_time_ns += now - rq->start_time_ns;
+#endif
+			rq->start_time_ns = now;
+		}
+	}
+
 	rq->cmd_flags = (*bio)->bi_opf;
 	INIT_LIST_HEAD(&rq->queuelist);
 	return rq;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e3242e67a8e3..cf66871a1844 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -978,9 +978,11 @@  struct blk_plug {
 
 	unsigned short rq_count;
 
-	bool multiple_queues;
-	bool has_elevator;
-	bool nowait;
+	unsigned short nr_flush;
+
+	bool multiple_queues:1;
+	bool has_elevator:1;
+	bool nowait:1;
 
 	struct list_head cb_list; /* md requires an unplug callback */
 };