Message ID | 20230601053919.3639954-1-chengming.zhou@linux.dev (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | blk-mq: fix incorrect rq start_time_ns and alloc_time_ns after throttled | expand |
Hello, On Thu, Jun 01, 2023 at 01:39:19PM +0800, chengming.zhou@linux.dev wrote: > From: Chengming Zhou <zhouchengming@bytedance.com> > > iocost rely on rq start_time_ns and alloc_time_ns to tell the saturation > state of the block device. > > If any qos ->throttle() end up blocking, the cached rq start_time_ns and > alloc_time_ns will include its throtted time, which can confuse its user. I don't follow. rq_qos_throttle() happens before a request is allocated, so whether ->throttle() blocks or not doesn't affect alloc_time_ns or start_time_ns. > This patch add nr_flush counter in blk_plug, so we can tell if the task > has throttled in any qos ->throttle(), in which case we need to correct > the rq start_time_ns and alloc_time_ns. > > Another solution may be make rq_qos_throttle() return bool to indicate > if it has throttled in any qos ->throttle(). But this need more changes. > > Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Depending on the flush behavior and adjusting alloc_time_ns seems fragile to me and will likely confuse other users of alloc_time_ns too. Maybe I'm misunderstanding the problem you're describing. Can you give a concrete example of how the current code would misbehave? Thanks.
On 2023/6/6 02:31, Tejun Heo wrote: > Hello, > > On Thu, Jun 01, 2023 at 01:39:19PM +0800, chengming.zhou@linux.dev wrote: >> From: Chengming Zhou <zhouchengming@bytedance.com> >> >> iocost rely on rq start_time_ns and alloc_time_ns to tell the saturation >> state of the block device. >> >> If any qos ->throttle() end up blocking, the cached rq start_time_ns and >> alloc_time_ns will include its throtted time, which can confuse its user. > > I don't follow. rq_qos_throttle() happens before a request is allocated, so > whether ->throttle() blocks or not doesn't affect alloc_time_ns or > start_time_ns. Yes, most of the time request is allocated after rq_qos_throttle() and its alloc_time_ns or start_time_ns won't be affected clearly. But for plug batched allocation introduced by the commit 47c122e35d7e ("block: pre-allocate requests if plug is started and is a batch"), we can rq_qos_throttle() after the allocation of the request. This is what the blk_mq_get_cached_request() does. In this case, the cached request alloc_time_ns or start_time_ns is much ahead if block in any qos ->throttle(). > >> This patch add nr_flush counter in blk_plug, so we can tell if the task >> has throttled in any qos ->throttle(), in which case we need to correct >> the rq start_time_ns and alloc_time_ns. >> >> Another solution may be make rq_qos_throttle() return bool to indicate >> if it has throttled in any qos ->throttle(). But this need more changes. >> >> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> > > Depending on the flush behavior and adjusting alloc_time_ns seems fragile to > me and will likely confuse other users of alloc_time_ns too. I agree with you, this code is not good. My basic idea is to adjust the cached request alloc_time_ns and start_time_ns when throttled. > > Maybe I'm misunderstanding the problem you're describing. Can you give a > concrete example of how the current code would misbehave? > I tried using fio to reproduce it: 1. set the iocost qos (a bit strict qos setting to reproduce throttle) echo "259:0 enable=1 rpct=5 rlat=500 wpct=5 wlat=500" > io.cost.qos 2. run fio using io_uring ioengine (for now only io_uring used batched allocation) fio --name global --runtime 30 --time_based --size 10G --ioengine io_uring \ --iodepth 256 --buffered 0 --sqthread_poll \ --name job1 --rw read --cgroup job1 --numjobs 10 \ --name job2 --rw write --cgroup job2 --numjobs 10 3. run bpftrace to check request start_time_ns bpftrace -e 'kprobe:__rq_qos_track { $rq = (struct request *)arg1; if ($rq->start_time_ns) { @delta = hist((nsecs - $rq->start_time_ns)/1000); } }' If we go blk_mq_get_cached_request() -> throttle() and throttled for some time, then the returned cached request start_time_ns will be much ahead. Like below: (delta value is us) @delta: [0] 170090 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [1] 898 | | [2, 4) 418 | | [4, 8) 284 | | [8, 16) 54 | | [16, 32) 198 | | [32, 64) 5416 |@ | [64, 128) 5082 |@ | [128, 256) 1296 | | [256, 512) 23 | | [512, 1K) 2632 | | [1K, 2K) 21143 |@@@@@@ | [2K, 4K) 26349 |@@@@@@@@ | [4K, 8K) 4559 |@ | [8K, 16K) 4273 |@ | [16K, 32K) 14 | |
Hello, On Tue, Jun 06, 2023 at 06:22:28PM +0800, Chengming Zhou wrote: ... > But for plug batched allocation introduced by the commit 47c122e35d7e > ("block: pre-allocate requests if plug is started and is a batch"), we can > rq_qos_throttle() after the allocation of the request. This is what the > blk_mq_get_cached_request() does. > > In this case, the cached request alloc_time_ns or start_time_ns is much ahead > if block in any qos ->throttle(). Ah, okay, that's problematic. > >> This patch add nr_flush counter in blk_plug, so we can tell if the task > >> has throttled in any qos ->throttle(), in which case we need to correct > >> the rq start_time_ns and alloc_time_ns. > >> > >> Another solution may be make rq_qos_throttle() return bool to indicate > >> if it has throttled in any qos ->throttle(). But this need more changes. > >> > >> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> > > > > Depending on the flush behavior and adjusting alloc_time_ns seems fragile to > > me and will likely confuse other users of alloc_time_ns too. > > I agree with you, this code is not good. My basic idea is to adjust the cached > request alloc_time_ns and start_time_ns when throttled. Would it make sense to skip setting the alloc_time_ns during pre-allocation and set it later when the pre-allocated rq is actually used? That should jive better. Thanks.
Hello, On 2023/6/9 06:56, Tejun Heo wrote: > Hello, > > On Tue, Jun 06, 2023 at 06:22:28PM +0800, Chengming Zhou wrote: > ... >> But for plug batched allocation introduced by the commit 47c122e35d7e >> ("block: pre-allocate requests if plug is started and is a batch"), we can >> rq_qos_throttle() after the allocation of the request. This is what the >> blk_mq_get_cached_request() does. >> >> In this case, the cached request alloc_time_ns or start_time_ns is much ahead >> if block in any qos ->throttle(). > > Ah, okay, that's problematic. > Thanks for your review! Sorry for my delay, I was out of the office. >>>> This patch add nr_flush counter in blk_plug, so we can tell if the task >>>> has throttled in any qos ->throttle(), in which case we need to correct >>>> the rq start_time_ns and alloc_time_ns. >>>> >>>> Another solution may be make rq_qos_throttle() return bool to indicate >>>> if it has throttled in any qos ->throttle(). But this need more changes. >>>> >>>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> >>> >>> Depending on the flush behavior and adjusting alloc_time_ns seems fragile to >>> me and will likely confuse other users of alloc_time_ns too. >> >> I agree with you, this code is not good. My basic idea is to adjust the cached >> request alloc_time_ns and start_time_ns when throttled. > > Would it make sense to skip setting the alloc_time_ns during pre-allocation > and set it later when the pre-allocated rq is actually used? That should > jive better. > Ok, I think it's much clearer that we set the alloc_time_ns and start_time_ns to "now" when the pre-allocated rq is actually used. I will send an updated version later. Thanks.
diff --git a/block/blk-core.c b/block/blk-core.c index 00c74330fa92..5109f7f5606c 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1053,6 +1053,7 @@ void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios) plug->cached_rq = NULL; plug->nr_ios = min_t(unsigned short, nr_ios, BLK_MAX_REQUEST_COUNT); plug->rq_count = 0; + plug->nr_flush = 0; plug->multiple_queues = false; plug->has_elevator = false; plug->nowait = false; @@ -1150,6 +1151,8 @@ void __blk_flush_plug(struct blk_plug *plug, bool from_schedule) */ if (unlikely(!rq_list_empty(plug->cached_rq))) blk_mq_free_plug_rqs(plug); + + plug->nr_flush++; } /** diff --git a/block/blk-mq.c b/block/blk-mq.c index f6dad0886a2f..8731f2815790 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2871,6 +2871,7 @@ static inline struct request *blk_mq_get_cached_request(struct request_queue *q, { struct request *rq; enum hctx_type type, hctx_type; + unsigned short nr_flush; if (!plug) return NULL; @@ -2897,8 +2898,25 @@ static inline struct request *blk_mq_get_cached_request(struct request_queue *q, * before we throttle. */ plug->cached_rq = rq_list_next(rq); + nr_flush = plug->nr_flush; rq_qos_throttle(q, *bio); + /* + * If any qos ->throttle() end up blocking, we will have flushed the + * plug and we need to correct the rq start_time_ns and alloc_time_ns. + */ + if (nr_flush != plug->nr_flush) { + if (blk_mq_need_time_stamp(rq)) { + u64 now = ktime_get_ns(); + +#ifdef CONFIG_BLK_RQ_ALLOC_TIME + if (rq->alloc_time_ns) + rq->alloc_time_ns += now - rq->start_time_ns; +#endif + rq->start_time_ns = now; + } + } + rq->cmd_flags = (*bio)->bi_opf; INIT_LIST_HEAD(&rq->queuelist); return rq; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index e3242e67a8e3..cf66871a1844 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -978,9 +978,11 @@ struct blk_plug { unsigned short rq_count; - bool multiple_queues; - bool has_elevator; - bool nowait; + unsigned short nr_flush; + + bool multiple_queues:1; + bool has_elevator:1; + bool nowait:1; struct list_head cb_list; /* md requires an unplug callback */ };